如何使用Python+Selenium爬取Goodreads上万条书评而不崩溃?

新手上路,请多包涵

通过python+selenium去爬取goodreads上一本书的评论,由于goodreads的评论是一页加载所有内容,不断点load more,就不断在该页面增加内容,在加载到3000-5000条评论时,页面就会崩溃,用的edge,内存设置的无限制。这种情况应该怎么做才行。

希望能获取到完整的上万条评论。
试过减少页面操作的次数、每次加载的间隔设置为5s。但到了3000条以上时仍有概率崩溃。

阅读 4.2k
avatarAI BotBETA

回答

要爬取Goodreads上万条书评而不崩溃,你可以尝试以下几种策略来优化你的Selenium脚本:

  1. 使用无头浏览器
    使用无头浏览器(如Chrome Headless或Firefox Headless)可以减少内存和CPU的使用,从而增加脚本的稳定性。

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)
  2. 增加页面加载等待时间
    虽然你已经尝试了每次加载后等待5秒,但可以尝试增加等待时间,或者使用更智能的等待条件(如WebDriverWait和expected_conditions)。

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    wait = WebDriverWait(driver, 10)  # 增加等待时间到10秒
    # 示例等待条件
    element = wait.until(EC.presence_of_element_located((By.ID, "some-id")))
  3. 分批处理评论
    不要一次性加载所有评论,而是分批加载并处理。例如,每次加载500条评论,处理完后再加载下一批。
  4. 使用Selenium Grid或分布式处理
    如果单台机器无法处理大量数据,可以考虑使用Selenium Grid将任务分发到多台机器上执行。
  5. 调整Edge浏览器的设置
    虽然你提到内存设置无限制,但Edge浏览器可能还有其他限制导致崩溃。可以尝试使用Chrome或其他浏览器,或者调整Edge的具体设置(如禁用某些扩展、清理缓存等)。
  6. 捕获异常并重新尝试
    在你的脚本中加入异常捕获机制,当发生崩溃时自动重新尝试加载评论。

    try:
        # 加载评论的代码
    except Exception as e:
        print(f"发生错误: {e}")
        # 重新加载页面或执行其他恢复操作
        driver.refresh()  # 或者其他恢复策略
  7. 检查Goodreads的反爬虫机制
    确保你的脚本没有触发Goodreads的反爬虫机制。如果Goodreads检测到你的爬虫行为,可能会故意导致页面崩溃或返回错误数据。

通过结合上述策略,你应该能够更稳定地爬取Goodreads上的大量书评。不过,请注意遵守Goodreads的使用条款和法律法规,避免过度爬取或滥用数据。

1 个回答
✓ 已被采纳

登录后用selenium获取token后直接用接口获取

import logging
import json
import re
import requests

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S")


def get_driver():
    chrome_driver_path = r'D:\programs\chromedriver\chromedriver.exe'
    debugger_address = '127.0.0.1:9222'
    service = Service(chrome_driver_path)
    options = webdriver.ChromeOptions()
    options.add_experimental_option('debuggerAddress', debugger_address)
    driver = webdriver.Chrome(service=service, options=options)
    return driver

def get_resource_id(book_url, data):
    pattern = r'/show/(\d+)'
    match = re.search(pattern, book_url)
    if match:
        book_id = match.group(1)
        key = f'getBookByLegacyId({{\"legacyId\":\"{book_id}\"}})'
        book_ref = data.get("props", {}).get("pageProps", {}).get("apolloState", {}).get("ROOT_QUERY", {}).get(key, {}).get('__ref', '')
        if not book_ref:
            return ''
        resource_id = data.get("props", {}).get("pageProps", {}).get("apolloState", {}).get(book_ref, {}).get("work", {}).get('__ref', '')[5:]
        return resource_id
    return ''

def get_params(book_url):
    result = {
        'jwtToken': '',
        'resource_id' : ''
    }
    try:
        driver.get(book_url)
        wait = WebDriverWait(driver, 10)
        script_tag = wait.until(EC.presence_of_element_located((By.ID, '__NEXT_DATA__')))
        soup = BeautifulSoup(driver.page_source, 'lxml')
        script_tag = soup.find('script', id='__NEXT_DATA__')
        data = json.loads(script_tag.string)
        with open('temp.json', 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=4)
        result['jwtToken'] = data.get("props", {}).get("pageProps", {}).get("jwtToken")
        result['resource_id'] = get_resource_id(book_url, data)
    except Exception as e:
        logging.exception(e)
    return result

def get_next_reviews(jwt_token, prev_page_token, resource_id, limit=100):
    '''获取下一页评论'''
    result = {
        'code': -1,
        'reviews': [],
        'next_page_token': '',
        'total_count': 0
    }
    try:
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
            'Referer': 'https://www.goodreads.com/',
            'Origin': 'https://www.goodreads.com',
        }
        if not jwt_token:
            headers['X-Api-Key'] = X_API_KEY
        else:
            headers['Authorization'] = jwt_token
        
        # limit测试最大100条, 超过则不返回数据
        params = {"operationName":"getReviews","variables":{"filters":{"resourceType":"WORK","resourceId":resource_id},"pagination":{"after":prev_page_token,"limit":limit}},"query":"query getReviews($filters: BookReviewsFilterInput!, $pagination: PaginationInput) {\n  getReviews(filters: $filters, pagination: $pagination) {\n    ...BookReviewsFragment\n    __typename\n  }\n}\n\nfragment BookReviewsFragment on BookReviewsConnection {\n  totalCount\n  edges {\n    node {\n      ...ReviewCardFragment\n      __typename\n    }\n    __typename\n  }\n  pageInfo {\n    prevPageToken\n    nextPageToken\n    __typename\n  }\n  __typename\n}\n\nfragment ReviewCardFragment on Review {\n  __typename\n  id\n  creator {\n    ...ReviewerProfileFragment\n    __typename\n  }\n  recommendFor\n  updatedAt\n  createdAt\n  spoilerStatus\n  lastRevisionAt\n  text\n  rating\n  shelving {\n    shelf {\n      name\n      webUrl\n      __typename\n    }\n    taggings {\n      tag {\n        name\n        webUrl\n        __typename\n      }\n      __typename\n    }\n    webUrl\n    __typename\n  }\n  likeCount\n  viewerHasLiked\n  commentCount\n}\n\nfragment ReviewerProfileFragment on User {\n  id: legacyId\n  imageUrlSquare\n  isAuthor\n  ...SocialUserFragment\n  textReviewsCount\n  viewerRelationshipStatus {\n    isBlockedByViewer\n    __typename\n  }\n  name\n  webUrl\n  contributor {\n    id\n    works {\n      totalCount\n      __typename\n    }\n    __typename\n  }\n  __typename\n}\n\nfragment SocialUserFragment on User {\n  viewerRelationshipStatus {\n    isFollowing\n    isFriend\n    __typename\n  }\n  followersCount\n  __typename\n}\n"}

        interface_url = 'https://kxbwmqov6jgg3daaamb744ycu4.appsync-api.us-east-1.amazonaws.com/graphql'
        res = requests.post(interface_url, headers=headers, json=params)
        if res.status_code != 200:
            raise Exception(res.text)
        data = res.json()
        reviews = data.get("data", {}).get('getReviews', {}).get('edges', [])
        next_page_token = data.get("data", {}).get('getReviews', {}).get('pageInfo', {}).get('nextPageToken', '')
        total_count = data.get("data", {}).get('getReviews', {}).get('totalCount', 0)
        result['code'] = 1
        result['reviews'] = reviews
        result['next_page_token'] = next_page_token
        result['total_count'] = total_count
    except Exception as e:
        logging.exception(e)
    return result

def get_book_reviews(book_url):
    try:
        # 通过人机验证后在书籍页面先获取jwt_token和resource_id参数
        logging.info("正在获取参数")
        params = get_params(book_url)
        if not params['jwtToken'] and not X_API_KEY:
            logging.error('未获取到jwtToken且未设置X_API_KEY')
            return
        if not params['resource_id']:
            logging.error('未获取到resource_id')
            return
        logging.info(f'{params=}')
        logging.info('开始获取评论')
        jwt_token = params['jwtToken']
        prev_page_token = ''
        resource_id = params['resource_id']
        all_reviews = []
        while True:
            result = get_next_reviews(jwt_token, prev_page_token, resource_id)
            if result['code'] != 1:
                logging.error("网络错误或token过期")
                break
            if not result['reviews']:
                break
            total_count = result['total_count']
            all_reviews.extend(result['reviews'])
            prev_page_token = result['next_page_token']
            logging.info(f"已获取 {len(all_reviews)} / {total_count}")
            if not result['next_page_token']:
                break
        # with open('temp.json', 'w', encoding='utf-8') as f:
        #     json.dump(all_reviews, f, indent=4)
        # 评论信息
        return [item.get("node", {}).get('text', '') for item in all_reviews]
    except Exception as e:
        logging.exception(e)


X_API_KEY = ''
driver = get_driver()
book_url = 'https://www.goodreads.com/book/show/12651.The_Social_Contract?from_search=true'
reviews = get_book_reviews(book_url)
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题
宣传栏