python如何正确抓取网页标题

发布于
2012-10-02

通过 urllib 将网页内容抓取下来，然后用正则表达式 re 模块将标题匹配出来，但是发现部分标题会出现问题，比如下面抓 Apple 的代码运行结果是 App，测试发现匹配结果 m 是没有问题的，问题出现在了 strip() 这里。

# -*- coding: utf-8 -*-
import urllib
import re

url='http://apple.com'
html = urllib.urlopen(url).read()
#print html
m = re.search("<title>.*</title>", html)
print m.group() # 这里输出结果 <title>Apple</title>
print m.group().strip("</title>") #问题应该出现在这个正则

python

阅读 39.5k

8 个回答

tjureyoung

✓ 已被采纳

有一个简单的错误。HTML文件不能用正则表达式parse，因为他的文法比正则表达式高级，具体原因参考这里。
推荐解析这种HTML用一些第三方库，例如mechanize
我的代码如下：

import mechanize
import cookielib
if __name__=='__main__':
    br = mechanize.Browser()
    br.set_cookiejar(cookielib.LWPCookieJar()) # Cookie jar
    
    br.set_handle_equiv(True) # Browser Option
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
    br.open("http://apple.com")
    print br.title()

输出为Apple
对于mechanize的详细使用，参考这里

安装mechanize，就easy_install一下就好。

cute

2.2k153

发布于
2012-10-02

通用的方法是使用htmlparser解析.

比如使用lxml扩展包来解析:

from lxml import html
doc = html.parse('http://www.apple.com/')
title = doc.find('.//title').text
print title

或者使用BeautifulSoup来解析:

import urllib
from BeautifulSoup import BeautifulSoup
content = urllib.urlopen('http://www.apple.com/').read()
soup = BeautifulSoup(content)
print soup.find('title')

pynix

发布于
2012-10-02

re.findall(r"<title>(.*)</title>","<title>Apple</title>")

正则有一个分组功能。。。。。。。

Yukir

167123

发布于
2012-10-02

关键是用()进行分组提取，使用.*不一定匹配上。因为.*代表的含义是一组任意字符，但不包括换行符。

hbprotoss

76199

发布于
2012-10-02

更新于
2012-10-02

pattern = re.compile((?<=<title>)[\w\W]*(?=</title>))
pattern.search("Apple")

主要是(?<=...)和(?=...)这两个表达式

hoozecn

951

发布于
2012-10-02

更新于
2012-10-02

这是strip的help

`Help on method_descriptor:

strip(...)
S.strip([chars]) -> string or unicode

Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping`

title中包涵le, 所以apple里的le被strip掉了

greatghoul

2.1k52130

发布于
2012-10-18

如果是使用正则解析，可以用如下方法

html = urllib.urlopen('http://apple.com').read()
m = re.search(r'<title>(.*)</title>', html, flags=re.I)
print 'Title: ', m and m.group(1) or ''

或者可以使用 pyquery

#-*0 coding: utf-8 -*- 
from pyquery import PyQuery as pq

d = pq(url='http://apple.com')
print 'Title: ', d('title').text()

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

python如何正确抓取网页标题

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

如何实现一个深拷贝函数？

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？