用Python获取网站数据

这是我的代码,但是我的正则表达式还是有问题,求解

#coding=utf-8
import urllib2;
import re;
def getProvince(mainUrl):
    req = urllib2.Request(mainUrl);
    resp = urllib2.urlopen(req);
    respHtml = resp.read();
    # print "respHtml",respHtml;
    #<a href="/lelist/listxian.aspx?id=D44C1502B7D5BEA1" class="cunpaddingl4">安徽</a>
    #re.search('<h1\s+?class="h1user">(?P<h1user>.+?)</h1>', respHtml);
    foundA_lable = re.search('<a\s+?class=cunpaddingl4>(?P<cunpaddingl4>.+?)</a>',respHtml);
    print "foundA_lable =",foundA_lable;
    if foundA_lable:
        province = foundA_lable.group("cunpaddingl4");
        print u"cunpaddingl4 =",province;
    else :
        print u"没有匹配到数据";
        
print getProvince("http://www.yigecun.com/");


阅读 3.4k
1 个回答

稍微改了一下你的代码,主要是改为用re.findall(),返回一个list,然后循环遍历打印。
bs4也方便实现。

#coding=utf-8
import urllib2
import re

def getProvince(mainUrl):
    req = urllib2.Request(mainUrl)
    resp = urllib2.urlopen(req)
    respHtml = resp.read()
    # print "respHtml",respHtml;
    #<a href="/lelist/listxian.aspx?id=D44C1502B7D5BEA1" class="cunpaddingl4">安徽</a>
    #re.search('<h1\s+?class="h1user">(?P<h1user>.+?)</h1>', respHtml);
    foundA_lable = re.findall('<a\s+\S+\s+class="cunpaddingl4">(?P<cunpaddingl4>.+?)</a>', respHtml)
    print "foundA_lable =", len(foundA_lable)
    for province in foundA_lable:
        # province = foundA_lable.group("cunpaddingl4")
        print u"cunpaddingl4 =", province
        
getProvince("http://www.yigecun.com/")
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进