python爬虫爬取知乎问题

import requests
from bs4 import BeautifulSoup

url = 'http://www.zhihu.com/#signin'
url1 = 'http://www.zhihu.com/login/email'
url2 = 'http://www.zhihu.com/'

ans = requests.get(url)
soup = BeautifulSoup(ans.content)
_xsrf = soup.find('input',type= 'hidden')['value']
print _xsrf

postdata = {'_xsrf':_xsrf,

        'password':'000000000',
        'email':'000000000'}

headers = {

    'Host':'www.zhihu.com',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
    'Accept-Encoding':'gzip, deflate',
    'Referer':'http://www.zhihu.com/',
    'If-None-Match':"FpeHbcRb4rpt_GuDL6-34nrLgGKd.gz",
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive'

    }

ans1 = requests.post(url1,data = postdata,headers = headers,cookies = ans.cookies)

ans2 = requests.get(url2,cookies = ans1.cookies)
print ans2.text

阅读 9k
3 个回答

LZ你的办法太原始了,果断应该上scrapy

先清理浏览器cookies,然后再登陆会出现验证码,获取验证码地址,再提交就可以成功登陆了

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题