Wang Leehom’s melons are huge! I used Python to crawl the Guawen comment area and found it more exciting

I opened Weibo in the morning, WC, and the first post I posted on Weibo was a melon article.

So I found out the source of Gua Wen profusely. The basic situation is that Leehom's ex-wife can't bear it, so he tore Leehom... The blog post is as follows:

At first, I still had some doubts. Two days ago, Leehom admitted to the divorce and posted a blog post:

What the blog post reveals is that it is a good gathering, good time, and peaceful atmosphere. It seems that the wording is a bit improper, but I don’t need to worry about it anymore.

Although I don’t chase stars, I am basically indifferent to stars of all sizes, but I have known Leehom from Wahaha’s mineral water bottles many years ago...

I don’t remember when, Wahaha replaced the endorsement human macro. At that time, there was still a lot of condemnation on the Internet. Now it seems...

So I read Li Lianglei’s microblog with the curiosity of the people who eat melons, WC, it really owes Leehom an Oscar...

So how can I let go of the comment area... So I am going to use Python to crawl the comment area data, the main code implementation is as follows:

# 爬取一页评论内容
def get_one_page(url):
    headers = {
        'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3880.4 Safari/537.36',
        'Host' : 'weibo.cn',
        'Accept' : 'application/json, text/plain, */*',
        'Accept-Language' : 'zh-CN,zh;q=0.9',
        'Accept-Encoding' : 'gzip, deflate, br',
        'Cookie' : '自己的Cookie',
        'DNT' : '1',
        'Connection' : 'keep-alive'
    }
    # 获取网页 html
    response = requests.get(url, headers = headers, verify=False)
    # 爬取成功
    if response.status_code == 200:
        # 返回值为 html 文档，传入到解析函数当中
        return response.text
    return None

# 解析保存评论信息
def save_one_page(html):
    comments = re.findall('<span class="ctt">(.*?)</span>', html)
    for comment in comments[1:]:
        result = re.sub('<.*?>', '', comment)
        if '回复@' not in result:
            with open('comments.txt', 'a+', encoding='utf-8') as fp:
                fp.write(result)

I won’t talk about the crawling analysis process here. If you unclear, you can take a look: 161c0856999c47 Weibo comment area crawls , the data is available, now we use Python to see what TOP10 vocabulary is. The main code implementation is as follows:

stop_words = []
with open('stop_words.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    for line in lines:
        stop_words.append(line.strip())
content = open('comments.txt', 'rb').read()
# jieba 分词
word_list = jieba.cut(content)
words = []
for word in word_list:
    if word not in stop_words:
        words.append(word)

wordcount = {}
for word in words:
    if word != ' ':
        wordcount[word] = wordcount.get(word, 0)+1
wordtop = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)[:10]
wx = []
wy = []
for w in wordtop:
    wx.append(w[0])
    wy.append(w[1])

(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS))
    .add_xaxis(wx)
    .add_yaxis('数量', wy)
    .reversal_axis()
    .set_global_opts(
        title_opts=opts.TitleOpts(title='评论词 TOP10'),
        yaxis_opts=opts.AxisOpts(name='词语'),
        xaxis_opts=opts.AxisOpts(name='数量'),
        )
    .set_series_opts(label_opts=opts.LabelOpts(position='right'))
).render_notebook()

Take a look at the effect:

We will not comment here, and then generate a word cloud to look at the comment area. The main code implementation is as follows:

def jieba_():
    stop_words = []
    with open('stop_words.txt', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())
    content = open('comments.txt', 'rb').read()
    # jieba 分词
    word_list = jieba.cut(content)
    words = []
    for word in word_list:
        if word not in stop_words:
            words.append(word)
    global word_cloud
    # 用逗号隔开词语
    word_cloud = '，'.join(words)

def cloud():
    # 打开词云背景图
    cloud_mask = np.array(Image.open('bg.png'))
    # 定义词云的一些属性
    wc = WordCloud(
        # 背景图分割颜色为白色
        background_color='white',
        # 背景图样
        mask=cloud_mask,
        # 显示最大词数
        max_words=200,
        # 显示中文
        font_path='./fonts/simhei.ttf',
        # 最大尺寸
        max_font_size=100
    )
    global word_cloud
    # 词云函数
    x = wc.generate(word_cloud)
    # 生成词云图片
    image = x.to_image()
    # 展示词云图片
    image.show()
    # 保存词云图片
    wc.to_file('melon.png')

Take a look at the effect:

The source code has been sorted out, if you need it, you can get it from the public Python 2nd backstage reply wlh .

Wang Leehom’s melons are huge! I used Python to crawl the Guawen comment area and found it more exciting

Python小二

引用和评论

浙大出DeepSeek手册了

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

本地使用PaddleOCR进行图片识别获得文字（返回JSON）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总