这个正则哪里出问题了。

huang

16832740

发布于
2015-10-26

小_秦

5.8k26604

更新于
2015-10-26

是这样的，我打算爬知乎这个问题的回答和图片。http://www.zhihu.com/question/20937691

代码如下：

# -*- coding:utf-8 -*-

import urllib
import urllib2
import re
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

#url = 'http://www.zhihu.com/question/20937691'

class Spider:
    def __init__(self):
        self.siteurl = 'http://www.zhihu.com/question'

    def getPage(self,pageIndex):
        url = self.siteurl + '/' + str(pageIndex)
        request = urllib2.Request(url)
        response = urllib2.urlopen(request)
        return response.read().decode('utf-8')

    def getContent(self,pageIndex):
        page = self.getPage(pageIndex)
        pattern = re.compile('<div class="zm-editalble.*?>::before(.*?)<noscript>.*?</noscript><img class ="origin_image.*? src=(.*?) style="width.*?></img>::after</div>',re.S)
        items = re.findall(pattern,page)
        for item in items:
            print item[0],item[1]

spider = Spider()
spider.getContent(20937691)

这是刚开始，还没有继续往下写。出现的问题是运行之后只会显示finished in 。。s。请问这是问什么，还有一个小问题就是正则哪一行可以选择在哪里回车从下一行开始写，写在一行里太长了。谢谢

python2.7

阅读 3k

3 个回答

✓ 已被采纳

你用正则爬网页本身就是最大的问题好不好！！！

Arnie97

2k51935

发布于
2015-10-26

更新于
2015-10-26

我先回答一下小问题：
Python很多地方都借鉴了C语言的特性，其中就包括，若干个字符串字面量写在一起就表示一个，例如：

pattern = re.compile('<div class="zm-editalble.*?>'
                     '::before(.*?)'
                     '<noscript>.*?</noscript>'
                     '<img class ="origin_image.*?'
                     ' src=(.*?) style="width.*?></img>'
                     '::after</div>', re.S)

然后，好像不用这么复杂吧…

def getContent(self,pageIndex):
    page = self.getPage(pageIndex)
    pattern = re.compile('data-original="(.+?)"')
    items = re.findall(pattern,page)
    for item in items:
        print item

最后，写爬虫的话，推荐一下 Beautiful Soup…

importcjj

223136

发布于
2015-10-27

import re
import urllib2

response = urllib2.urlopen('http://www.zhihu.com/question/20937691').read()
pattern = re.compile(r'<div class="zm-editable-content clearfix">.+?</div>',re.S)
pattern.findall(response)

如果能先登录，就更好了(因为不登录可能不完整，一般都是这尿性)。

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进