Although some areas were forced to close movie theaters due to the impact of the epidemic, judging from the data of the second single-day total box office in Chinese film history on the first day of the Lunar New Year, everyone's enthusiasm for watching movies during the Spring Festival in 2022 is still very high.
On the day of its release on the first day of the new year, from the box office data, "Shuimen Bridge of Changjin Lake" took the lead, "The Four Seas", "This Killer Is Not So Calm", "Miracle · Stupid Child", you chase after me, and the cartoon "Bears" Strong performance, Sniper is a shame.
At present, several live-action films on Douban have been split.
In this article, we use Python to crawl these Douban movie reviews. The specific analysis process of the crawling will not be described here. If you don't understand, you can refer to: Douban movie , the main implementation code is as follows:
def spider():
url = 'https://accounts.douban.com/j/mobile/login/basic'
headers = {"User-Agent": 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'}
# 龙岭迷窟网址,为了动态翻页,start 后加了格式化数字,短评页面有 20 条数据,每页增加 20 条
url_comment = 'https://movie.douban.com/subject/35215390/comments?start=%d&limit=20&sort=new_score&status=P'
data = {
'ck': '',
'name': '用户名',
'password': '密码',
'remember': 'false',
'ticket': ''
}
session = requests.session()
session.post(url=url, headers=headers, data=data)
# 初始化 4 个 list 分别存用户名、评星、时间、评论文字
users = []
stars = []
times = []
content = []
# 抓取 500 条,每页 20 条,这也是豆瓣给的上限
for i in range(0, 500, 20):
# 获取 HTML
data = session.get(url_comment % i, headers=headers)
# 状态码 200 表是成功
print('第', i, '页', '状态码:',data.status_code)
# 暂停 0-1 秒时间,防止IP被封
time.sleep(random.random())
# 解析 HTML
selector = etree.HTML(data.text)
# 用 xpath 获取单页所有评论
comments = selector.xpath('//div[@class="comment"]')
# 遍历所有评论,获取详细信息
for comment in comments:
# 获取用户名
user = comment.xpath('.//h3/span[2]/a/text()')[0]
# 获取评星
star = comment.xpath('.//h3/span[2]/span[2]/@class')[0][7:8]
# 获取时间
date_time = comment.xpath('.//h3/span[2]/span[3]/@title')
# 有的时间为空,需要判断下
if len(date_time) != 0:
date_time = date_time[0]
date_time = date_time[:10]
else:
date_time = None
# 获取评论文字
comment_text = comment.xpath('.//p/span/text()')[0].strip()
# 添加所有信息到列表
users.append(user)
stars.append(star)
times.append(date_time)
content.append(comment_text)
# 用字典包装
comment_dic = {'user': users, 'star': stars, 'time': times, 'comments': content}
# 转换成 DataFrame 格式
comment_df = pd.DataFrame(comment_dic)
# 保存数据
comment_df.to_csv('data.csv')
With the comment data, we can intuitively feel it through the word cloud. The main code is implemented as follows:
df = pd.read_csv("comment.csv", index_col = 0)
cts_list = df['comments'].values.tolist()
cts_str ="".join([str(i).replace('\n', '').replace(' ', '') for i in cts_list])
stop_words = []
with open('stop_words.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
stop_words.append(line.strip())
# jieba 分词
word_list = jieba.cut(cts_str)
words = []
for word in word_list:
if word not in stop_words:
words.append(word)
cts_str = ','.join(words)
print(cts_str)
stylecloud.gen_stylecloud(text=cts_str, max_words=300,
collocations=False,
font_path="SIMLI.TTF",
icon_name="fas fa-arrow-circle-right",
size=800,
output_name="comment.png")
Image(filename="comment.png")
First of all, let's take a look at "The Four Seas". Why is the word of mouth of "The Four Seas" not across the four seas? See what the audience has to say:
Then watch "The Killer Isn't Calm". As a comedy, Douban's rating is not bad. Let's see what the audience has to say:
Then watch "Changjin Lake's Water Gate Bridge", the current rating is lower than the first one, let's see what the audience said:
Then watch "Miracle Stupid Child", the ratings and box office are quite satisfactory, let's see what the audience has to say:
Then continue to watch "Sniper", guided by Master Guo and his daughter, the box office is not good, and the rating is temporarily ranked first. Let's see what the audience has to say:
The source code is obtained from the public number Python second background reply m2022 ~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。