python爬虫,爬取到的HTML源码是一种编码格式展示的内容,但是具体处理数据的适合就取不到这个值。

python爬虫,爬取到的HTML源码是一种编码格式展示的内容,但是具体处理数据的适合就取不到这个值。

练习爬取的网页:
https://detail.tmall.com/item...

代码的目的是获取对应手机的型号:

def handle_starttag(self, tag, attrs):
    if tag == 'tr' and not self.finish:
        for variable, value in attrs:
            if variable == 'class' and value == 'tm-tableAttrSub':
                self.target_tr = True
    if tag == 'th' and self.target_tr and not self.finish:
        self.processing = 'th'
    if tag == 'td' and self.target_tr and self.target_th and not self.finish:
        # print 'value:',value
        self.processing = 'td'

def handle_data(self, data):
    if self.processing == 'th' and data.find('型号') > -1 and not self.finish and self.target_tr:
        self.target_th = True
        self.processing = ''
    if self.processing == 'td' and not self.finish and self.target_tr and self.target_th:
        self.finish = True
        self.target_th = False
        self.target_tr = False
        self.temp = data
        self.processing = ''
        print 'phoneName', data

获取到的HTML代码片段:
<tr><th> 型号</th><td>& nbsp;& #32418;& #31859;& #25163;& #26426;3</td>
(原内容直接复制就被转码展示了,可以将&后的空格去掉)

最后的输出:
phoneName 3

但是期望的输出应该是:
phoneName 红米手机3

请教各位大大,怎么将获取到的html代码片段中的正确内容复制到data中呢?

阅读 6.3k
2 个回答

推荐使用BeautifulSoup库来进行页面文本处理

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
}

url = 'https://detail.tmall.com/item.htm?id=525793357336&rn=7a5abaee6ca91c8c11c1472e38cea795&abbucket=15&sku_properties=10004:653780895;5919063:6536025;12304035:48072'
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, 'lxml')

for item in soup.find_all('tr'):
    for phone in item.find_all('th', text=' 型号'):
        if phone.get_text() == ' 型号':
            print(item.find('td').get_text().strip())
红米手机3
>>> import html
>>> print(html.unescape('&nbsp;&#32418;&#31859;&#25163;&#26426;3'))
 红米手机3
>>> 

html.unescape
其中&nbsp;可以直接替换为空格

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题