新手上路，请多包涵

我见过几种从网站上抓取多个页面的解决方案，但无法在我的代码上运行。

目前，我有这段代码，正在努力抓取第一页。我想创建一个循环来抓取网站的所有页面（从第 1 页到第 5 页）

 import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent
ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')

driver = webdriver.Chrome('/Users/raduulea/Documents/chromedriver', options=options)
driver.get('https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre/liege/4000?page=1')

import time
time.sleep(10)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

results = soup.find_all("div", {"class":"result-xl"})
title=[]
address=[]
price=[]
surface=[]
desc=[]

for result in results:
    title.append(result.find("div", {"class":"title-bar-left"}).get_text().strip())
    address.append(result.find("span", {"result-adress"}).get_text().strip())
    price.append(result.find("div", {"class":"xl-price rangePrice"}).get_text().strip())
    surface.append(result.find("div", {"class":"xl-surface-ch"}).get_text().strip())
    desc.append(result.find("div", {"class":"xl-desc"}).get_text().strip())

df = pd.DataFrame({"Title":title,"Address":address,"Price:":price,"Surface" : surface,"Description":desc})
df.to_csv("output.csv")

原文由 mr-kim 发布，翻译遵循 CC BY-SA 4.0 许可协议

python-3.x selenium-webdriver web-scraping beautifulsoup

阅读 1.1k

2 个回答

得票最新

社区维基

发布于
2022-11-17

✓ 已被采纳

试试下面的代码。它会循环遍历所有页面，而不仅仅是 5 页。如果可用，请检查下一个按钮，然后单击它，否则会中断 wile 循环。

 import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.add_argument("window-size=1400,600")
from fake_useragent import UserAgent

ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')

driver = webdriver.Chrome('/Users/raduulea/Documents/chromedriver', options=options)

driver.get('https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre')

import time

time.sleep(10)

Title = []
address = []
price = []
surface = []
desc = []
page=2
while True:
    time.sleep(10)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    results = soup.find_all("div", {"class": "result-xl"})
    for result in results:
        Title.append(result.find("div", {"class": "title-bar-left"}).get_text().strip())
        address.append(result.find("span", {"result-adress"}).get_text().strip())
        price.append(result.find("div", {"class": "xl-price rangePrice"}).get_text().strip())
        surface.append(result.find("div", {"class": "xl-surface-ch"}).get_text().strip())
        desc.append(result.find("div", {"class": "xl-desc"}).get_text().strip())
    if len(driver.find_elements_by_css_selector("a.next")) > 0:
        url = "https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre/?page={}".format(page)
        driver.get(url)
        page += 1
        #It will traverse for only 5 pages as you are after if want more page just comment the below if block
        if int(page)>5:
        break
    else:
        break

df = pd.DataFrame({"Title": Title, "Address": address, "Price:": price, "Surface": surface, "Description": desc})
df.to_csv("output.csv")

原文由 KunduK 发布，翻译遵循 CC BY-SA 4.0 许可协议

社区维基

发布于
2022-11-17

我没有检查你的代码，但根据你的问题，我希望这对读者有用

你可以使用 driver.back();

 urls = ['https://example1.com', 'https://example2.com']
driver = webdriver.Chrome("chromedriver")
driver.get(url[0])
html_example0 = driver.page_source
driver.back()
driver.get(url[1])
html_example1 = driver.page_source

在我不使用时的上层代码 driver.back() ；我对 driver.page_source 得到相同的响应，而网站不一样。

使用驱动程序的 back() 方法；我得到了想要的结果。

实际上，该方法似乎是按后退按钮返回到浏览器中的上一个页面。

原文由 Fazi Alnjd 发布，翻译遵循 CC BY-SA 4.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

如何使用 Selenium (Python) 抓取多个页面

你尚未登录，登录后可以

如何使用Python Selenium爬取shadow-root（open）内的评论内容？

如何解决使用 bs4 模块中 find_all 提取列表元素中包含回车符的现象？

Stack Overflow 翻译