python爬虫,爬取到的HTML源码是一种编码格式展示的内容,但是具体处理数据的适合就取不到这个值。
练习爬取的网页:
https://detail.tmall.com/item...
代码的目的是获取对应手机的型号:
def handle_starttag(self, tag, attrs):
if tag == 'tr' and not self.finish:
for variable, value in attrs:
if variable == 'class' and value == 'tm-tableAttrSub':
self.target_tr = True
if tag == 'th' and self.target_tr and not self.finish:
self.processing = 'th'
if tag == 'td' and self.target_tr and self.target_th and not self.finish:
# print 'value:',value
self.processing = 'td'
def handle_data(self, data):
if self.processing == 'th' and data.find('型号') > -1 and not self.finish and self.target_tr:
self.target_th = True
self.processing = ''
if self.processing == 'td' and not self.finish and self.target_tr and self.target_th:
self.finish = True
self.target_th = False
self.target_tr = False
self.temp = data
self.processing = ''
print 'phoneName', data
获取到的HTML代码片段:
<tr><th> 型号</th><td>& nbsp;& #32418;& #31859;& #25163;& #26426;3</td>
(原内容直接复制就被转码展示了,可以将&后的空格去掉)
最后的输出:
phoneName 3
但是期望的输出应该是:
phoneName 红米手机3
请教各位大大,怎么将获取到的html代码片段中的正确内容复制到data中呢?
推荐使用BeautifulSoup库来进行页面文本处理