使用beautifulsoup时，无法找到特定标签内容？

Question

使用beautifulsoup时，无法找到特定标签内容？

发布于
2016-07-12

目的：在网页 https://movie.douban.com/tv/#!type=tv&tag=%E8%8B%B1%E5%89%A7&sort=rank&page_limit=20&page_start=0上抓取评分超过x的所有英剧。
问题：在得到网页内容后，无法用find_all 方法找到<div class="wp"这个标签的内容，
<a class="item"也无法找出，而其他内容能正常找出。（初学........）
部分网页源码：

<script id="subject-tmpl" type="text/tmpl">
        <% if (playable) { %>
        <a class="item" target="_blank" href="<%= url%>?tag=<%= tag%>&from=gaia_video">
        <% } else {%>
        <a class="item" target="_blank" href="<%= url%>?tag=<%= tag%>&from=gaia">
        <% } %>
            <div class="cover-wp" data-isnew="<%= is_new%>" data-id="<%= id%>">
                <img src="<%= cover%>" alt="<%= title%>" data-x="<%= cover_x%>" data-y="<%= cover_y%>" onload="loadImg(this)" />
            </div>
            <p>
                <% if (is_new) { %>
                    <span class="green">
                        <img src="https://img3.doubanio.com/f/movie/caa8f80abecee1fc6f9d31924cef8dd9a24c7227/pics/movie/ic_new.png" width="16" class="new" />
                    </span>
                <% } %>

                <%= title%>

                <% if (rate !== '0.0') { %>
                    <strong><%= rate%></strong>
                <% } else {%>
                    <span>暂无评分</span>
                <% } %>
            </p>
            <% if (is_beetle_subject) { %>
                <div style="width:140px;float:left;margin-top:-22px;cursor:default">
                    <img class="biz-beetle-rec" style="width:88px;height:12px;opacity:1" src="https://img3.doubanio.com/img/biz/beetle/home/biz-beetle-icon@2x.png">
                </div>
            <% } %>
        </a>
    </script>

程序代码（问题部分）：

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
import urllib2


def get_text(x):
    soup = BeautifulSoup(x, "html.parser", from_encoding="utf-8")
    # print soup
    texts_1 = soup.find_all("a", class_="item")
    print type(texts_1)
    print len(texts_1)
    print texts_1
    texts_2 = soup.find_all("script", id="subject-tmpl")
    print type(texts_2)
    print len(texts_2)
    print texts_2
    texts_3 = soup.find_all("div", class_="cover-wp")
    print type(texts_3)
    print len(texts_3)
    print texts_3
urls = {"电影": "https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=rank&page_limit=20&page_start=0",
        "英剧": "https://movie.douban.com/tv/#!type=tv&tag=%E8%8B%B1%E5%89%A7&sort=rank&page_limit=20&page_start=0",
        "美剧": "https://movie.douban.com/tv/#!type=tv&tag=%E7%BE%8E%E5%89%A7&sort=rank&page_limit=20&page_start=0"}

content = urllib2.urlopen(urls["英剧"]).read()
# print content
get_text(content)

输出结果：

<class 'bs4.element.ResultSet'>
0
[]
<class 'bs4.element.ResultSet'>
1
[<script id="subject-tmpl" type="text/tmpl">\n        <% if (playable) { %>\n        <a class="item" target="_blank" href="<%= url%>?tag=<%= tag%>&from=gaia_video">\n        <% } else {%>\n        <a class="item" target="_blank" href="<%= url%>?tag=<%= tag%>&from=gaia">\n        <% } %>\n            <div class="cover-wp" data-isnew="<%= is_new%>" data-id="<%= id%>">\n                <img src="<%= cover%>" alt="<%= title%>" data-x="<%= cover_x%>" data-y="<%= cover_y%>" onload="loadImg(this)" />\n            </div>\n            <p>\n                <% if (is_new) { %>\n                    <span class="green">\n                        <img src="https://img3.doubanio.com/f/movie/caa8f80abecee1fc6f9d31924cef8dd9a24c7227/pics/movie/ic_new.png" width="16" class="new" />\n                    </span>\n                <% } %>\n\n                <%= title%>\n\n                <% if (rate !== '0.0') { %>\n                    <strong><%= rate%></strong>\n                <% } else {%>\n                    <span>\u6682\u65e0\u8bc4\u5206</span>\n                <% } %>\n            </p>\n            <% if (is_beetle_subject) { %>\n                <div style="width:140px;float:left;margin-top:-22px;cursor:default">\n                    <img class="biz-beetle-rec" style="width:88px;height:12px;opacity:1" src="https://img3.doubanio.com/img/biz/beetle/home/biz-beetle-icon@2x.png">\n                </div>\n            <% } %>\n        </a>\n    </script>]
<class 'bs4.element.ResultSet'>
0
[]
[Finished in 0.9s]

截图：

python sublime-text beautifulsoup

阅读 12.2k

1 个回答

得票最新

ivechan

637116

发布于
2016-07-12

✓ 已被采纳

    <script type="text/tmpl" id="subject-info-tmpl">
        <div class="wp">

你看代码就知道，这个div标签在js代码块中，得运行js代码后，才能成为DOM节点的一部分。
但是你只读取HTML源码，并没有执行其中的js代码。
所以beautifulsoup用的HTML解析器并不认为 div class=”wp” 是DOM树的一部分，find_all也就没有结果。

解决方法的话，你直接写个正则匹配比较好，用bs4恐怕是不行。
至于正则怎么写，你可以看《正则表达式必知必会》这本书，10分钟就可以搞定了。

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

相似问题

找不到问题？创建新问题

使用beautifulsoup时，无法找到特定标签内容？

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？