The 2022 Spring Festival New Year movie is divided, the Watergate Bridge is not ideal, and the world collapses! Find out with Python

Although some areas were forced to close movie theaters due to the impact of the epidemic, judging from the data of the second single-day total box office in Chinese film history on the first day of the Lunar New Year, everyone's enthusiasm for watching movies during the Spring Festival in 2022 is still very high.

On the day of its release on the first day of the new year, from the box office data, "Shuimen Bridge of Changjin Lake" took the lead, "The Four Seas", "This Killer Is Not So Calm", "Miracle · Stupid Child", you chase after me, and the cartoon "Bears" Strong performance, Sniper is a shame.

At present, several live-action films on Douban have been split.

In this article, we use Python to crawl these Douban movie reviews. The specific analysis process of the crawling will not be described here. If you don't understand, you can refer to: Douban movie , the main implementation code is as follows:

def spider():
    url = 'https://accounts.douban.com/j/mobile/login/basic'
    headers = {"User-Agent": 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'}
    # 龙岭迷窟网址，为了动态翻页，start 后加了格式化数字，短评页面有 20 条数据，每页增加 20 条
    url_comment = 'https://movie.douban.com/subject/35215390/comments?start=%d&limit=20&sort=new_score&status=P'
    data = {
        'ck': '',
        'name': '用户名',
        'password': '密码',
        'remember': 'false',
        'ticket': ''
    }
    session = requests.session()
    session.post(url=url, headers=headers, data=data)
    # 初始化 4 个 list 分别存用户名、评星、时间、评论文字
    users = []
    stars = []
    times = []
    content = []
    # 抓取 500 条，每页 20 条，这也是豆瓣给的上限
    for i in range(0, 500, 20):
        # 获取 HTML
        data = session.get(url_comment % i, headers=headers)
        # 状态码 200 表是成功
        print('第', i, '页', '状态码：',data.status_code)
        # 暂停 0-1 秒时间，防止IP被封
        time.sleep(random.random())
        # 解析 HTML
        selector = etree.HTML(data.text)
        # 用 xpath 获取单页所有评论
        comments = selector.xpath('//div[@class="comment"]')
        # 遍历所有评论，获取详细信息
        for comment in comments:
            # 获取用户名
            user = comment.xpath('.//h3/span[2]/a/text()')[0]
            # 获取评星
            star = comment.xpath('.//h3/span[2]/span[2]/@class')[0][7:8]
            # 获取时间
            date_time = comment.xpath('.//h3/span[2]/span[3]/@title')
            # 有的时间为空，需要判断下
            if len(date_time) != 0:
                date_time = date_time[0]
                date_time = date_time[:10]
            else:
                date_time = None
            # 获取评论文字
            comment_text = comment.xpath('.//p/span/text()')[0].strip()
            # 添加所有信息到列表
            users.append(user)
            stars.append(star)
            times.append(date_time)
            content.append(comment_text)
    # 用字典包装
    comment_dic = {'user': users, 'star': stars, 'time': times, 'comments': content}
    # 转换成 DataFrame 格式
    comment_df = pd.DataFrame(comment_dic)
    # 保存数据
    comment_df.to_csv('data.csv')

With the comment data, we can intuitively feel it through the word cloud. The main code is implemented as follows:

df = pd.read_csv("comment.csv", index_col = 0)
cts_list = df['comments'].values.tolist()
cts_str ="".join([str(i).replace('\n', '').replace(' ', '') for i in cts_list])
stop_words = []
with open('stop_words.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    for line in lines:
        stop_words.append(line.strip())
# jieba 分词
word_list = jieba.cut(cts_str)
words = []
for word in word_list:
    if word not in stop_words:
        words.append(word)
cts_str = '，'.join(words)
print(cts_str)
stylecloud.gen_stylecloud(text=cts_str, max_words=300,
                          collocations=False,
                          font_path="SIMLI.TTF",
                          icon_name="fas fa-arrow-circle-right",
                          size=800,
                          output_name="comment.png")
Image(filename="comment.png")

First of all, let's take a look at "The Four Seas". Why is the word of mouth of "The Four Seas" not across the four seas? See what the audience has to say:

Then watch "The Killer Isn't Calm". As a comedy, Douban's rating is not bad. Let's see what the audience has to say:

Then watch "Changjin Lake's Water Gate Bridge", the current rating is lower than the first one, let's see what the audience said:

Then watch "Miracle Stupid Child", the ratings and box office are quite satisfactory, let's see what the audience has to say:

Then continue to watch "Sniper", guided by Master Guo and his daughter, the box office is not good, and the rating is temporarily ranked first. Let's see what the audience has to say:

The source code is obtained from the public number Python second background reply m2022 ~

The 2022 Spring Festival New Year movie is divided, the Watergate Bridge is not ideal, and the world collapses! Find out with Python

Python小二

引用和评论

浙大出DeepSeek手册了

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）