The National Day holiday is half over. Are you out? Or are you at home? Do you know where your friend went? In this paper, we climb the qunar.com ticket data (piao.qunar.com) to a simple analysis.

Data crawl

First, we open the website: piao.qunar.com and enter a provincial administrative division in the search box to search, taking Zhejiang as an example, as shown in the figure:

Drag the page down again, F12 opens the developer tool, click the next page to see the URL, as shown in the picture:

By observing the URL, we can see that the keyword and page are dynamic, one is the input condition value, the other is the page number value. When we need to page crawl, we can dynamically assign the value. After cutting the developer tool to Response, we can find that the returned data is in JSON format, as shown in the figure:

Here, we take 34 provincial administrative divisions as the keyword to perform paging crawl, and the main crawl code is as follows:

def get_city_data(cities, pages): cityNames = [] sightNames = [] stars = [] scores = [] qunarPrices = [] saleCounts = [] districtses = [] points = [] intros = [] frees = [] addresses = [] for city in cities: for page in range(1, pages+1): try: Print (f') {city} {page} ') time. Sleep (random uniform (1.5, 2.5)) url = f '{city} & https://piao.qunar.com/ticket/list.json?from=mpl_search_suggest&keyword= page = {page}' print (' url: ', url) result = requests. Get (url, headers=headers, timeout=(2.5, 5.5)) status = result.status_code if(status == 200): Response_info = json.loads(result.text) print(' loads: ', response_info) sight_list = response_info['data']['sightList'] for sight in sight_list: SightName = sight['sightName'] # 名称 star = sight. Get ('star', None) # star score = sight. Get ('score', QunarPrice = sight. Get ('qunarPrice', 0) # saleCount = sight. End (' end ', 'end'); end (' end ', 'end') Intro = sight. Get ('intro', None) # intro = sight. Get ('free', True) # Append (city) sightnames.append (sightName) stars.append(star) scores. Append (score) qunarPrices.append(qunarPrice) saleCounts.append(saleCount) districtses.append(districts) points.append(point) intros.append(intro) frees.append(free) addresses.append(address) except: continue city_dic = {'cityName': cityNames, 'sightName': sightNames, 'star': stars, 'score': scores, 'qunarPrice': qunarPrices, 'saleCount': saleCounts, 'districts': districtses, 'point': points, 'intro': intros, 'free': frees, 'address': addresses} city_df = pd.DataFrame(city_dic) city_df.to_csv('cities.csv', index=False)Copy the code

The data analysis

Now that we have the data, let’s do a little analysis.

Location distribution

First, let’s take a look at the location distribution of scenic spots.

First, take a look at the overall distribution of scenic spots. The main code is as follows:

for city in df[(df.iloc[:, 5] > 0)].iloc[:, 0]: if city ! = "": cities.append(city) data = Counter(cities).most_common(100) gx = [] gy = [] for c in data: Gx.append (c[0]) gx.appEnd (c[1]) (Map(init_opts= opts.initopts (theme= themetype. MACARONS, height="500px")).add(' number ', [list(z) for z in zip(gx, gy)], 'China ').set_global_opts(title_opts= opts.titleopts (title_opts= opts.titleopts), visualmap_opts=opts.VisualMapOpts(max_=150, is_piecewise=True), ) ).render_notebook()Copy the code

Take a look at the results:

Take a look at the sales of scenic spots around the country, the main code is as follows:

df_item = df[['cityName','saleCount']] df_sum = df_item.groupby('cityName').sum() ( Map(init_opts= opts.initopts (theme= themetype. ROMANTIC, height="500px")). Add (' sales ', [list(z) for z in zip(df_sum.index.values.tolist(), df_sum.values.tolist())], 'China'). Set_global_opts (title_opts = opts. TitleOpts (title = 'around scenic spot sales distribution), visualmap_opts = opts. VisualMapOpts (max_ = 150000, is_piecewise=True) ) ).render_notebook()Copy the code

Take a look at the results:

The hottest scenic spot

Let’s move on to the TOP10. How much do they cost? The main code is as follows:

sort_sale = df.sort_values(by='saleCount', ascending=True) ( Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS, Width = "125%")). Add_xaxis (list (sort_sale [' sightName ']) [- 10:]). Add_yaxis (' sales', Sort_sale [' saleCount]. Values. Tolist () [- 10:]) add_yaxis (' price ', sort_sale['qunarPrice'].values.tolist()[-10:]) .reversal_axis() .set_global_opts( Title_opts = opts.titleopts (title=' TOP10'), yaxis_opts= opts.axisopts (name=' name ', Axislabel_opts = opts.labelopts (rotate=-30)), xaxis_opts= opts.axisopts (name=' sales/price '), ) .set_series_opts(label_opts=opts.LabelOpts(position="right")) ).render_notebook()Copy the code

Take a look at the results:

From the picture, we can see that the prices of the TOP10 scenic spots are mostly within 500, which is relatively close to the people. If your friend loves crowds, he or she probably went to a popular scenic spot.

Then look at the introduction of popular scenic spots. Here we select T100 data and take a look through the word cloud. The main implementation code is as follows:

sort_sale = df.sort_values(by='saleCount', ascending=True)
stylecloud.gen_stylecloud(text=cts_str, max_words=100,
                          collocations=False,
                          font_path="SIMLI.TTF",
                          icon_name="fab fa-firefox",
                          size=800,
                          output_name="hot.png")
Copy the code

Take a look at the results:

The most luxury resort

Let’s take a look at the TOP10 scenic spots for ticket prices. How are they selling? The main code is as follows:

sort_price = df.sort_values(by='qunarPrice', ascending=True) ( Bar(init_opts=opts.InitOpts(theme=ThemeType.ROMA)) .add_xaxis(list(sort_price['sightName'])[-10:]) Add_yaxis (' price 'sort_price [' qunarPrice] values. The tolist () [- 10:]). Add_yaxis (' sales', sort_price['saleCount'].values.tolist()[-10:]) .reversal_axis() .set_global_opts( Title_opts = opts.titleopts (title=' TOP10'), yaxis_opts= opts.axisopts (name=' name '), Xaxis_opts = opts.axisopts (name=' price/sales '), ) .set_series_opts(label_opts=opts.LabelOpts(position="right")) ).render_notebook()Copy the code

Take a look at the results:

If your friend is a wealthy person who loves to travel, there is a good chance that he or she has gone to a tuhao scenic spot.

Then take a look at the introduction of tuhao scenic spots. Here we still choose T100 data and take a look through the word cloud.

The main code is as follows:

sort_price = df.sort_values(by='qunarPrice', ascending=True) stylecloud.gen_stylecloud(text=cts_str, max_words=100, Collocations =False, font_path=" ffa-yen-sign ", icon_name=" ffa-yen-sign ")Copy the code

Take a look at the results:

Star of the scenic spot

Let’s take a look at the number of 5A scenic spots of each provincial administrative division. The main codes are as follows:

df_sum = df[df['star']=='5A'].groupby('cityName').count()['star'] ( Bar(init_opts= opts.initopts (theme= themetype.macarons)).add_xaxis(df_sum.index.values.tolist()).add_yaxis(' number ', Df_sum.values.tolist ()).set_global_opts(title_opts= opts.titleopts (title=' number of 5A scenic spots around '), datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_='inside')], ) ).render_notebook()Copy the code

Take a look at the results:

If your friend loves traveling and is fond of 5A scenic spots, he or she may have gone to a city with 5A scenic spots.

Finally, let’s take a look at the T200 popular scenic spot star ratio is what? The main code is as follows:

sort_data = df.sort_values(by=['saleCount'], ascending=True) rates = list(sort_data['star'])[-200:] gx = ["3A", "4A", "5A"] gy = [ rates.count("3A"), rates.count("4A"), rates.count("5A") ] ( Pie(init_opts=opts.InitOpts(theme=ThemeType.MACARONS)) .add("", list(zip(gx, gy)), radius=["40%", Set_global_opts (title_opts= opts.titleopts (title=" proportion of stars in TOP200 scenic spots ", pos_left = "left")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%", font_size=12)) ).render_notebook()Copy the code

Take a look at the results:

It can be seen from the picture that more than 90% of the scenic spots are rated 4/5A.

Well, this article is here, the article we go to which network ticket sales data in a few indicators for a simple analysis, can do a simple reference, of course, if you are interested, you can continue to other indicators for analysis.

Source code is available from Python qunar.