preface

The main goal is to crawl Github’s fan data for a given user and perform a simple wave of visual analysis of the crawl data. Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Bs4 module;

Requests module;

The argparse module;

Pyecharts module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Data crawl

Beautifulsoup feels like it hasn’t been beautifulsoup for a while, so let’s use it today to parse the web and get the data we want. Take my own account:

We first grab the user names of all followers, which are in tags like the following:

Beautifulsoup makes it easy to extract them:

"Get followers user name"
def getfollowernames(self) :
    print('[INFO]: getting %s all followers usernames... ' % self.target_username)
    page = 0
    follower_names = []
    headers = self.headers.copy()
    while True:
        page += 1
        followers_url = f'https://github.com/{self.target_username}? page={page}&tab=followers'
        try:
            response = requests.get(followers_url, headers=headers, timeout=15)
            html = response.text
            if 've reached the end' in html:
                break
            soup = BeautifulSoup(html, 'lxml')
            for name in soup.find_all('span', class_='link-gray pl-1') :print(name)
                follower_names.append(name.text)
            for name in soup.find_all('span', class_='link-gray') :print(name)
                if name.text not in follower_names:
                    follower_names.append(name.text)
        except:
            pass
        time.sleep(random.random() + random.randrange(0.2))
        headers.update({'Referer': followers_url})
    print('[INFO]: successfully get %s of % S followers user name... ' % (self.target_username, len(follower_names)))
    return follower_names
Copy the code

Then, we can enter their home pages according to these user names to capture the detailed data of the corresponding users. The structure of each home page link is as follows:

https://github.com/ + user name For example: https://github.com/CharlesPikachuCopy the code

The data we want to capture includes:

Similarly, we use BeautifulSoup to extract this information:

for idx, name in enumerate(follower_names):
    print('[INFO]: Climbing details of user %s... ' % name)
    user_url = f'https://github.com/{name}'
    try:
        response = requests.get(user_url, headers=self.headers, timeout=15)
        html = response.text
        soup = BeautifulSoup(html, 'lxml')
        # -- Get the username
        username = soup.find_all('span', class_='p-name vcard-fullname d-block overflow-hidden')
        if username:
            username = [name, username[0].text]
        else:
            username = [name, ' ']
        # -- Location
        position = soup.find_all('span', class_='p-label')
        if position:
            position = position[0].text
        else:
            position = ' '
        # -- number of stores, stars, followers, following
        overview = soup.find_all('span', class_='Counter')
        num_repos = self.str2int(overview[0].text)
        num_stars = self.str2int(overview[2].text)
        num_followers = self.str2int(overview[3].text)
        num_followings = self.str2int(overview[4].text)
        # -- Contributions (last year)
        num_contributions = soup.find_all('h2', class_='f4 text-normal mb-2')
        num_contributions = self.str2int(num_contributions[0].text.replace('\n'.' ').replace(' '.' '). \
                            replace('contributioninthelastyear'.' ').replace('contributionsinthelastyear'.' '))
        # -- Save data
        info = [username, position, num_repos, num_stars, num_followers, num_followings, num_contributions]
        print(info)
        follower_infos[str(idx)] = info
    except:
        pass
    time.sleep(random.random() + random.randrange(0.2))
Copy the code

Data visualization

Let’s take our own fan data, which is about 1200.

Let’s take a look at their distribution of code submissions over the past year:

The one with the most submissions, fengjixuchui, made 9,437 submissions in the past year. On average, we have to submit more than 20 times a day, which is too diligent.

Take a look at the distribution of warehouses owned by each person:

I thought it would be a monotonous curve, but I underestimated you.

Let’s look at the distribution of the number of stars:

It’s okay. At least not all of them are “snorkelers.

. Praise the elder brother named Lifa123, actually gave others 18700 👍, also too show.

Take a look at the distribution of followers for those 1000 + people:

A brief look, many friends have more followers than ME. As expected master in the folk.

I share Python data crawler cases every day. The next article is about Python crawling and simple analysis of A-share company data

All done~ Complete source code see personal profile or private letter to get related files.