为什么用Python3写的爬虫爬取到的图片无法打开?

新手上路,请多包涵

为什么用Python3写的爬虫爬取到的图片无法打开

# -*- coding:utf8 -*-
import requests
from bs4 import BeautifulSoup
url = 'http://www.meizitu.com/a/5582.html'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')
imgs = soup.select('#picture > p > img')
mm_imgs = []
for img in imgs:
    src = img.get('src')
    mm_imgs.append(src)
    for mm in mm_imgs:
        filename = '/'+(str(mm)[-20:]).replace('/','-')

        target = "./{}".format(filename)

    with open(target, "wb") as fs:
        fs.write(req.content)

    print("%s => %s" % (mm, target))

图片描述

图片描述

阅读 12.7k
2 个回答

这里你拿到图片的src之后没有去请求而是用的原url的content, 原url的content是html
你需要每个图片src重新请求一次,并且在请求时带上User-Agent

# -*- coding:utf8 -*-
import requests
import os
from bs4 import BeautifulSoup
url = 'http://www.meizitu.com/a/5582.html'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')
imgs = soup.select('#picture > p > img')
mm_imgs = []

if not os.path.exists('uploads'):
    os.mkdir('uploads')

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
for img in imgs:
    src = img.get('src')
    filename = (src[-18:]).replace('/','-')
    target = "uploads/{}".format(filename)
    r = requests.get(src, headers=headers)
    with open(target, "wb") as fs:
        fs.write(r.content)

    print("%s => %s" % (src, target))

这个网站还有做了一定的反爬虫策略的。你需要带上你的headers去请求

headers ={'Host':'mm.chinasareview.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',}

而不是堂而皇之的告诉人家你是爬虫啊

最后附上一张图吧,还是可以爬取的
图片描述

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题