python 正则表达式findall函数查找网页中所有的URL

发布于
2016-01-17

在做 python practice book 的习题，完成如下问题：
Problem 8: Write a program links.py that takes URL of a webpage as argument and prints all the URLs linked from that webpage.
要求使用 python 的 re 模块。

遇到的问题：正则表达式 (src|href)\=\".*?\" 在 re.findall 无法返回数组以URL 组成的数组，而是返回了['src', 'src', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'src', 'src', 'src', 'href', 'href'...]

python 正则表达式

阅读 8k

2 个回答

发布于
2016-01-17

✓ 已被采纳

findall得到的是(...)所匹配的部分; 建议这样正则修改为这样(src|href)\=(\.*?)\", 你能看到它会返回被括号括起来的匹配部分;

发布于
2016-01-18

正则表达式提取网页内容太麻烦，容易出错。推荐用beautifulsoup以及xpath

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

相似问题

找不到问题？创建新问题