“This is the 16th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

With the economic development, people’s income is getting higher and higher, and they have more money and time at their disposal. After the basic life needs are met, people are seeking to improve the quality of life and spiritual satisfaction. The best way to increase my knowledge, cultivate my sentiment and exercise my body is to travel on vacation. A survey shows that nearly 67% of people choose to travel when time permits. With the growth of the number of tourists, the means of transportation, the way of travel and the destination also show a trend of diversification. The purpose of this paper is to analyze the current popular tourism areas, as well as the characteristics of people traveling and travel time and other content. Let’s cut to the chase.


The data collection

Analysis main page

Open tuniu travel network home page.

trips.tuniu.com/

As shown in the figure, travel notes written by various travelers can be divided into three categories: recommended travel notes, popular travel notes, and latest releases. Here we need to climb 10,000 travel notes from recommended travel notes for data analysis.When we click on the next page, we can see that each page switch initiates a new oneAjaxThe request. Therefore, we can try to crawl the travelog’s data through the requested address. First open the developer tool and click the second page to get the package as shown below.

The data addresses of page two and page three are shown below.

Trips.tuniu.com/travels/ind… _ = 1612599945466 trips.tuniu.com/travels/ind… _ = 1612600065558

  • SortType: travelogue category number. 1 indicates recommended travelogue, 2 indicates popular travelogue, and 3 indicates newly released travelogue
  • Page: number of pages
  • Limit: Indicates the number of travel notes on a page
  • _ : Timestamp

Open the address in a new TAB and return the data shown in the figure below.Developer tools might work better.These data may not really need much, so let’s go to the details page and see the address of the details page.

www.tuniu.com/trips/31327… www.tuniu.com/trips/31327…

It’s easy to see that the trailing number is actually the ID in the JSON data. The address of the child page can also be obtained.


Analysis subpage

According to the need of analysis, we need to crawl the three information of the sub-page are respectivelytag(Tag),img_count(Number of travel pictures),destination(Destination). As shown in the figure below. The page is a static interface, reducing the difficulty of crawling.


The crawler design

First, climb the JSON data of each travel notes in the main page and save it as a tuuniu.csv file. Then read the ID in the tuuniu.csv file and form sub-pages by splice the url prefix (www.tuniu.com/trips/)…


Main page data collection

According to the specified range, format the URL and return the list of saved urls. In the actual crawling process, it is not recommended to add a large number of urls to the task list of the coroutine at one time. Asyncio library uses select internally, and SELECT is the number of open files in the system, which has a limit. Windows defaults to 509, so be aware of this when setting ranges or you will get an error.

def get_url_list() :
    url_list = []
    url = 'https://trips.tuniu.com/travels/index/ajax-list?sortType=3&page=%d&limit=10&_=1611556898746'
    Format the URL according to the range and add to the list
    for page in range(1.510):
        url_list.append(url%page)
    return url_list
Copy the code

Data acquisition crawler uses AIOHTTP + Asyncio coroutine and proxy IP to crawl to avoid being identified as crawler. Because IP address stability varies, set a timeout period. If the timeout period expires, change the IP address and continue to collect data. The data is then formatted and saved.

async def get_page_text(url) :
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
        while True:
            try:
                async with session.get(url=url, proxy='http://' + choice(proxy_list), headers=headers, timeout=5) as response:
                    # Change the encoding format of the corresponding data
                    response.encoding = 'utf-8'
                    When the coroutine is suspended, the event loop can execute other tasks.
                    page_text = await response.text()
                    If the data fails to be obtained, change the IP address and continue the request
                    ifresponse.status ! =200:
                        continue
                    print(f"{url.split('&') [1]}Climb complete!")
                    break
            except Exception as e:
                print(e)
                Catch the exception and continue the request
                continue
    return save_detail_url(page_text)
Copy the code

Format and save the collected data into a dictionary in string format, and then obtain the values in the dictionary and save them in CSV format.

def save_detail_url(page_text) :
    Convert string data to dictionary type
    page_data = json.loads(page_text)
    detail_data = page_data['data'] ['rows']
    Create a DataFrame to save the data to be crawled
    df = pd.DataFrame(columns=['id'.'name'.'summary'.'stamp'.'authorId'.'viewCount'.'likeCount'.'commentCount'.'imgId'.'publishTime'.'picUrl'.'authorName'.'authorHeadImg'.'authorIndentity'.'bindOrder'.'bindBanner'.'bindSchedule'.'hasLike'])
    for dic in detail_data:
        df = df.append(dic, ignore_index=True)
        
    # Remove summary newline characters, carriage returns, etc
    df['summary'] = df['summary'].str.replace('\r|\n|\t'.' ')
    Add the header line if the file doesn't exist
    if os.path.exists(r'C:\Users\pc\Desktop\tuniu.csv'):
        df.to_csv(r'C:\Users\pc\Desktop\tuniu.csv', index=False, mode='a', header=False)
    else:
        df.to_csv(r'C:\Users\pc\Desktop\tuniu.csv', index=False, mode='a')
Copy the code


The collected data is as follows:


Sub-page data collection

The ID column in the CSV file can be read by the PANDAS library. The ID column is concatedto the URL of the detailed travel notes page, and the url list is returned.

def get_url_list(read_path) :
    df = pd.read_csv(read_path, low_memory=False, usecols=['id'])
    df = df.loc[0: 500]
    url_list = ['https://www.tuniu.com/trips/%i'%i for i in df['id']]
    return url_list
Copy the code

Data collection In data collection, compared with the main page, there are more anti-crawling measures in the process of sub-page crawling. Cookie values are added to headers when climbing.

async def get_page_text(url) :
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'.'Cookie': 'your cookies'
    }
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
        while True:
            try:
                async with session.get(url=url, headers=headers, timeout=5) as response:
                    # Change the encoding format of the corresponding data
                    response.encoding = 'utf-8'
                    When the coroutine is suspended, the event loop can execute other tasks.
                    page_text = await response.text()
                    If the data fails to be obtained, change the IP address and continue the request
                    ifresponse.status ! =200:
                        print(response.status)
                        continue
                    id = url.split('/')[-1]
                    print(f"{url.split('/')[-1]}Climb complete!")
                    break
            except Exception as e:
                Catch the exception and continue the request
                continue
    return save_to_csv(page_text, id)
Copy the code

Data is formatted and savedusexpathNote that you can save the page data locally and then locate the page, because the actual page data is different from the original page. Such as locationtag(Travel notes TAB), locate the TAB on the original pagexpathThe expression is not the same as the actual crawl, as shown below. The location of the tourist should also be noted, directly in the original pagecopythexpathExpression that is empty at run time. The following figure shows the label location on the original page.Look in locally saved page datadivOf the labelclassAttribute values"Poi - the title"Is found to be empty. So we had to query the location name directly in the source code of the page, and try to obtain routing location information by locating other tags. The following figure shows the location label.Data formatting and saving code:

def save_to_csv(page_text, id) :
    tree = etree.HTML(page_text)
    # with open('tuniuDetail.html', 'w', encoding='utf-8')as fp:
    # fp.write(page_text)
    tag_div_list = tree.xpath('//*[@id="vueApp"]/div[1]/div[1]/div')
    tag_list = []
    Get the travel tag information in div
    for div in tag_div_list:
        tag_list.append(div.xpath('./text()') [0])
    # Count images
    img_count = len(tree.xpath('//div[@class="sdk-trips-image"]'))
    pattern = re.compile('(? <="poiIds":\[)(.*?)}')
    destination_data = tree.xpath('/html/body/script[4]/text() | /html/body/script[3]/text()')
    poiIds = pattern.search(destination_data[0]).group()
    # check if there is a value in poiIds, otherwise null
    if '] ' in poiIds:
        poiIds = '{"name": ""}'

    destination = eval(poiIds)['name']
    df = pd.DataFrame(columns=['id'.'tag'.'img_count'.'destination'])
    df = df.append({'id': id.'tag': ' '.join(tag_list), 'img_count': img_count, 'destination': destination}, ignore_index=True)
    if os.path.exists(r'C:\Users\pc\Desktop\tuniu_detail.csv'):
        df.to_csv(r'C:\Users\pc\Desktop\tuniu_detail.csv', index=False, mode='a', header=False)
    else:
        df.to_csv(r'C:\Users\pc\Desktop\tuniu_detail.csv', index=False, mode='a')
Copy the code


The collected sub-page data is as follows:


Merged data set

CSV file and tuniu_detail. CSV file are merged by id column.

def merge() :
    df1 = pd.read_csv(r'C:\Users\pc\Desktop\tuniu_test.csv')
    df2 = pd.read_csv(r'C:\Users\pc\Desktop\tuniu_detail.csv')
    df1['summary'] = df1['summary'].str.replace('\r|\n|\t'.' ')
    merge_df = pd.merge(df1, df2, on='id', how="inner")
    merge_df.to_csv(r'C:\Users\pc\Desktop\tuniu_info.csv', index=False)
Copy the code

Merged data:


Data analysis and visualization

TOP10 popular places to visit

We analyzed the destination information in the collected data to see where travelers generally go.

# Set the background and Chinese display
sns.set_style('darkgrid', {'font.sans-serif': ['simhei'.'FangSong']})

def tuniu_destination(read_path) :
    df = pd.read_csv(read_path, low_memory=False, usecols=['destination'])
    # Count the top ten locations and the number of occurrences
    des_count = df['destination'].value_counts().head(10)
    des, count = list(des_count.index), list(des_count.values)

    new_df = pd.DataFrame(columns=['destination'.'count'])
    new_df['destination'], new_df['count'] = des, count

    sns_plot = sns.barplot(x='destination', y='count', palette='ch:.25', data=new_df)
    plt.xlabel('place')
    plt.ylabel('number')
    plt.show()
    # Save images
    # sns_plot.figure.savefig(r'C:\Users\pc\Desktop\des_count.jpg', dpi=1000)
Copy the code

As can be seen from the figure, sanya, Lijiang and Chengdu, the top three, are all well-known tourist destinations. Every year, many tourists come from other places, which to some extent promote the development of local economy and also attract more tourists.


Travel characteristics of travelers

Based on the information obtained from travel tags, we analyze some of the characteristics of travel.

def tuniu_tag(read_path):
    df = pd.read_csv(read_path, low_memory=False, usecols=['tag'])
    all_tag = []
    for lst in df['tag']:
        tag_list = lst.split(The '#')
        for i in range(1, len(tag_list)):
            all_tag.append(tag_list[i])
    tag_str = ' '.join(all_tag) # map word cloud= WordCloud(max_words=100,
                     max_font_size=200,
                     width=1000,
                     height=800,
                     font_path='C:\Windows\Fonts\simhei.ttf'
                     ).generate(tag_str)
    word.to_file(r"C:\Users\pc\Desktop\tag.jpg")
Copy the code

Through above the label word cloud image, we can see that appear most times there are road, humanities, natural wonders, photography, food, home inn, etc., and these labels, it is precisely the characteristics of most travelers travel, but in recent years in terms of the home stay facility, many country joined, make the following in the travel at the same time can also experience the local customs.


Travel season

According to the travel notes published time, roughly guess the travel time, see what the travel time characteristics.

def tuniu_season(read_path):
    df = pd.read_csv(read_path, low_memory=False, usecols=['publishTime'])
    season_dic = {'spring': 0.'summer':0.'the fall': 0.'winter': 0}
    for publishTime in df['publishTime'] :time = publishTime.split(The '-') [1]
        if time in ['01'.'02'.'03']:
            season_dic['spring'] += 1
        elif time in ['04'.'05'.'06']:
            season_dic['summer'] += 1
        elif time in ['07'.'08'.'09']:
            season_dic['the fall'] += 1
        elif time in ['10'.'11'.'12']:
            season_dic['winter'] += 1# paint the pie chart colors= ['#81ecec'.'#ff7675'.'#6c5ce7'.'#F5F6CE']
    plt.pie(list(season_dic.values()), labels=list(season_dic.keys()), autopct = '%.1f%%', colors = colors, startangle = 90, counterclock = False)
    plt.legend(loc='upper right', bbox_to_anchor=(1.2.0.2))
    plt.savefig(r'C:/Users/pc/Desktop/season.jpg', dpi=1000)
    plt.show()
Copy the code

As can be seen from the graph, the number of people traveling in spring and autumn is about the same, at the average level, with the most in winter and the least in summer.


Travel times in popular areas

Climate may vary depending on the location of each site. Next, we analyze the relationship between the number of people travelling in each month in three popular areas (Sanya, Lijiang and Chengdu).

def des_season(read_path):
    df = pd.read_csv(read_path, low_memory=False, usecols=['destination'.'publishTime'])
    new_df = pd.DataFrame(columns=['des'.'month'])
    dic = {'sanya': 0.'the lijiang': 0.'chengdu': 0}
    for i in range(len(df)):
        if df['destination'][i] in ['sanya'.'the lijiang'.'chengdu']:
            if dic[df['destination'][i]] < 200:
                dic[df['destination'][i]] += 1
                new_df = new_df.append({'des': df['destination'][i], 'month': int(df['publishTime'][i].split(The '-') [1])}, ignore_index=True)
    s1 = new_df.loc[new_df['des']=='sanya'] ['month'].value_counts()
    s1.sort_index(inplace=True)
    s2 = new_df.loc[new_df['des'] == 'the lijiang'] ['month'].value_counts()
    s2.sort_index(inplace=True)
    s3 = new_df.loc[new_df['des'] == 'chengdu'] ['month'].value_counts()
    s3.sort_index(inplace=True)
    df_1 = pd.DataFrame({'des': 'sanya'.'month': s1.index, 'count': s1.values})
    df_2 = pd.DataFrame({'des': 'the lijiang'.'month': s2.index, 'count': s2.values})
    df_3 = pd.DataFrame({'des': 'chengdu'.'month': s3.index, 'count'Values}) # merge DataFrame res_df from three locales= pd.concat([df_1, df_2, df_3], ignore_index=True) # Draw a graph= sns.relplot(x=res_df['month'], y=res_df['count'], hue=res_df['des'], height=3, kind='line', estimator=None, data=new_df)
    plt.show()
    fig.savefig(r'C:\Users\pc\Desktop\des_season.jpg', dpi=1000)
Copy the code

Looking at the chart above, we can see that travel times in these three popular cities are somewhat different. Sanya has more tourists in December and January, while Lijiang and Chengdu both have peak trips in November. Based on this information, travel time can be reasonably arranged according to the situation.


Conclusion the feeling the most waste of time or sub page data acquisition, use at the beginning of is the proxy IP crawl, midway have delayed for a period of time, running in the next few days, found that as long as use the proxy server IP will be detected, the data returned is always wrong, had only to give up using a proxy, using native IP crawl, efficiency is lower.


For those who are new to Python or want to get started with Python, you can follow the public account “Python New Horizons” to communicate and learn Python together. They are all beginners. Sometimes a simple question is stuck for a long time, but others may suddenly realize it with a little help. There are also nearly 1,000 resume templates and hundreds of e-books waiting for you to collect!