2

Halfway through the National Day holiday, have you been out for a wave? Or stay at home? Do you know where your friend went? In this article, we will briefly analyze the piao.qunar.com

Data scraping

First, we open the website: piao.qunar.com , enter a provincial administrative division in the search box to search, take Zhejiang as an example, as shown in the figure:

Pull down the page again, F12 to open the developer tools, click on the next page to see the URL, as shown in the figure:

By observing the URL, we can see that keyword and page are dynamic. One is the input condition value, and the other is the page number value. When we need to page and crawl, we can perform dynamic assignment, and then switch the developer tool to Response, we can It is found that the returned data is in JSON format, as shown in the figure:

Here we use 34 provincial administrative divisions as keywords for page crawling. The main crawling code is implemented as follows:

def get_city_data(cities, pages):
    cityNames = []
    sightNames = []
    stars = []
    scores = []
    qunarPrices = []
    saleCounts = []
    districtses = []
    points = []
    intros = []
    frees = []
    addresses = []
    for city in cities:
        for page in range(1, pages+1):
            try:
                print(f'正在爬取{city}第{page}页数据...')
                time.sleep(random.uniform(1.5, 2.5))
                url = f'https://piao.qunar.com/ticket/list.json?from=mpl_search_suggest&keyword={city}&page={page}'
                print('url:', url)
                result = requests.get(url, headers=headers, timeout=(2.5, 5.5))
                status = result.status_code
                if(status == 200):
                    # 每页数据
                    response_info = json.loads(result.text)
                    print('数据:', response_info)
                    sight_list = response_info['data']['sightList']
                    for sight in sight_list:
                        sightName = sight['sightName']  # 名称
                        star = sight.get('star', None)  # 星级
                        score = sight.get('score', 0)  # 评分
                        qunarPrice = sight.get('qunarPrice', 0)  # 价格
                        saleCount = sight.get('saleCount', 0)  # 销量
                        districts = sight.get('districts', None)  # 行政区划
                        point = sight.get('point', None)  # 坐标
                        intro = sight.get('intro', None)  # 简介
                        free = sight.get('free', True)  # 是否免费
                        address = sight.get('address', None)  # 地址
                        cityNames.append(city)
                        sightNames.append(sightName)
                        stars.append(star)
                        scores.append(score)
                        qunarPrices.append(qunarPrice)
                        saleCounts.append(saleCount)
                        districtses.append(districts)
                        points.append(point)
                        intros.append(intro)
                        frees.append(free)
                        addresses.append(address)
            except:
                continue
    city_dic = {'cityName': cityNames, 'sightName': sightNames, 'star': stars,
               'score': scores, 'qunarPrice': qunarPrices, 'saleCount': saleCounts,
                'districts': districtses, 'point': points, 'intro': intros,
                'free': frees, 'address': addresses}
    city_df = pd.DataFrame(city_dic)
    city_df.to_csv('cities.csv', index=False)

data analysis

Now that the data is available, let's analyze it briefly.

Location distribution

First, let's take a look at the location distribution of scenic spots.

First look at the overall distribution of the scenic area, the main code implementation is as follows:

for city in df[(df.iloc[:, 5] > 0)].iloc[:, 0]:
    if city != "":
        cities.append(city)
data = Counter(cities).most_common(100)
gx = []
gy = []
for c in data:
    gx.append(c[0])
    gy.append(c[1])
(
    Map(init_opts=opts.InitOpts(theme=ThemeType.MACARONS, height="500px"))
    .add('数量', [list(z) for z in zip(gx, gy)], 'china')
    .set_global_opts(
    title_opts=opts.TitleOpts(title='各地景区数量分布'),
    visualmap_opts=opts.VisualMapOpts(max_=150, is_piecewise=True),
    )
).render_notebook()

Take a look at the effect:

Take a look at the sales situation of various scenic spots, the main code implementation is as follows:

df_item = df[['cityName','saleCount']]
df_sum = df_item.groupby('cityName').sum()
(
    Map(init_opts=opts.InitOpts(theme=ThemeType.ROMANTIC, height="500px"))
    .add('销量', [list(z) for z in zip(df_sum.index.values.tolist(), df_sum.values.tolist())], 'china')
    .set_global_opts(
    title_opts=opts.TitleOpts(title='各地景区销量分布'),
    visualmap_opts=opts.VisualMapOpts(max_=150000, is_piecewise=True)
    )
).render_notebook()

Take a look at the effect:

The hottest scenic spot

Let’s look at what are the TOP10 popular scenic spots? What are their prices? The main code implementation is as follows:

sort_sale = df.sort_values(by='saleCount', ascending=True)
(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS, width='125%'))
    .add_xaxis(list(sort_sale['sightName'])[-10:])
    .add_yaxis('销量', sort_sale['saleCount'].values.tolist()[-10:])
    .add_yaxis('价格', sort_sale['qunarPrice'].values.tolist()[-10:])
    .reversal_axis()
    .set_global_opts(
        title_opts=opts.TitleOpts(title='最热景区TOP10'),
        yaxis_opts=opts.AxisOpts(name='名称', axislabel_opts=opts.LabelOpts(rotate=-30)),
        xaxis_opts=opts.AxisOpts(name='销量/价格'),
        )
    .set_series_opts(label_opts=opts.LabelOpts(position="right"))
).render_notebook()

Take a look at the effect:

From the figure, we can see that the prices of the TOP10 popular scenic spots are mostly within 500, which is relatively close to the people. If your friend likes lively, he (she) may have gone to popular scenic spots.

Then look at the introduction of popular scenic spots. Here we select T100 data and look at it through the word cloud. The main implementation code is as follows:

sort_sale = df.sort_values(by='saleCount', ascending=True)
stylecloud.gen_stylecloud(text=cts_str, max_words=100,
                          collocations=False,
                          font_path="SIMLI.TTF",
                          icon_name="fab fa-firefox",
                          size=800,
                          output_name="hot.png")

Take a look at the effect:

The most luxurious scenic spot

Let's take a look at what are the TOP10 scenic spots? How about their sales? The main code implementation is as follows:

sort_price = df.sort_values(by='qunarPrice', ascending=True)
(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.ROMA))
    .add_xaxis(list(sort_price['sightName'])[-10:])
    .add_yaxis('价格', sort_price['qunarPrice'].values.tolist()[-10:])
    .add_yaxis('销量', sort_price['saleCount'].values.tolist()[-10:])
    .reversal_axis()
    .set_global_opts(
        title_opts=opts.TitleOpts(title='最豪景区TOP10'),
        yaxis_opts=opts.AxisOpts(name='名称'),
        xaxis_opts=opts.AxisOpts(name='价格/销量'),
        )
    .set_series_opts(label_opts=opts.LabelOpts(position="right"))
).render_notebook()

Take a look at the effect:

If your friend is a local tyrant who loves to travel, he (she) is likely to go to the local tyrant scenic spot.

Let's take a look at the introduction of the local tyrants. Here we still select the T100 data and look at it through the word cloud.

The main code implementation is as follows:

sort_price = df.sort_values(by='qunarPrice', ascending=True)
stylecloud.gen_stylecloud(text=cts_str, max_words=100,
                          collocations=False,
                          font_path="SIMLI.TTF",
                          icon_name="fas fa-yen-sign",#最豪
                          size=800,
                          output_name="money.png")

Take a look at the effect:

Scenic star

Let's take a look at the number of 5A-level scenic spots in each provincial administrative division. The main code implementation is as follows:

df_sum = df[df['star']=='5A'].groupby('cityName').count()['star']
(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS))
        .add_xaxis(df_sum.index.values.tolist())
        .add_yaxis('数量', df_sum.values.tolist())
        .set_global_opts(
        title_opts=opts.TitleOpts(title='各地5A景区数量'),
        datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_='inside')],
    )
).render_notebook()

Take a look at the effect:

If your friend loves traveling and has a soft spot for 5A scenic spots, he or she may have gone to a city with 5A scenic spots.

Finally, let's take a look at the star ratio of T200 popular scenic spots? The main code implementation is as follows:

sort_data = df.sort_values(by=['saleCount'], ascending=True)
rates = list(sort_data['star'])[-200:]
gx = ["3A", "4A", "5A"]
gy = [
    rates.count("3A"),
    rates.count("4A"),
    rates.count("5A")
]
(
    Pie(init_opts=opts.InitOpts(theme=ThemeType.MACARONS))
    .add("", list(zip(gx, gy)), radius=["40%", "70%"])
    .set_global_opts(title_opts=opts.TitleOpts(title="销量TOP200景区星级比例", pos_left = "left"))
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%", font_size=12))
).render_notebook()

Take a look at the effect:

From the figure, we can see that more than 90% of the scenic spots are 4/5A.

Okay, that's it for this article. In this article, we have conducted a simple analysis of several indicators in Qunar ticket sales data, which can be a simple reference. Of course, if you are interested, you can continue to analyze other indicators. Perform analysis.

No source code in the public Python small two reply back qunar get.


Python小二
180 声望416 粉丝