preface

The main goal is to crawl the fan data of specified users on GitHub and perform a wave of simple visual analysis on the crawled data. Let’s get started happily

The development tools

Python version: 3.6.4

Related modules:

Bs4 module;

Requests module;

The argparse module;

Pyecharts module;

And some of the modules that come with Python.

Environment set up

Install Python and add it to the environment variable, and the PIP will install the appropriate module.

Data crawl

It feels like we haven’t used BeautifulSoup in a long time, so let’s use it today to parse web pages to get the data we want. Take my own account as an example:

We’ll grab the usernames of all our followers, which are in a tag like the one shown below:

BeautifulSoup is a convenient way to extract them:

"" def getFollowerNames (self): print('[INFO]: Getting %s of all followers' usernames... ' % self.target_username) page = 0 follower_names = [] headers = self.headers.copy() while True: page += 1 followers_url = f'https://github.com/{self.target_username}?page={page}&tab=followers' try: response = requests.get(followers_url, headers=headers, timeout=15) html = response.text if 've reached the end' in html: break soup = BeautifulSoup(html, 'lxml') for name in soup.find_all('span', class_='link-gray pl-1'): print(name) follower_names.append(name.text) for name in soup.find_all('span', class_='link-gray'): print(name) if name.text not in follower_names: follower_names.append(name.text) except: pass time.sleep(random.random() + random.randrange(0, 2)) headers.update({'Referer': followers_url}) print('[INFO]: Successfully got %s of %s followers usernames... ' % (self.target_username, len(follower_names))) return follower_names

Then, we can enter their homepages according to these user names to grab the corresponding user’s detailed data. Each homepage link is constructed as follows:

https://github.com/ + user name For example: https://github.com/CharlesPikachu

The data we want to capture includes:

Again, we use beautifulSoup to extract this information:

For idx, name in enumerate(Follower_Names): print('[INFO]: Crawling user %s details... ' % name) user_url = f'https://github.com/{name}' try: response = requests.get(user_url, headers=self.headers, timeout=15) html = response.text soup = BeautifulSoup(html, Find_all ('span', class_='p-name vcard-fullname d-block overflow-hidden') if username: username = [name, username[0].text] else: Position = soup. Find_all ('span', class_='p-label') position = position[0].text else: Position = "#" -- number of stars, followers, following overview = soup. Find_all ('span'), class_='Counter') num_repos = self.str2int(overview[0].text) num_stars = self.str2int(overview[2].text) num_followers = Str2int (Overview [3].text) num_followings = self.str2int(Overview [4].text) # -- contributions num_contributions = contributions soup.find_all('h2', class_='f4 text-normal mb-2') num_contributions = self.str2int(num_contributions[0].text.replace('\n', '').replace(' ', "). \ the replace (' contributioninthelastyear ', '). The replace (' contributionsinthelastyear ', ')) # - saving data info = [username, position, num_repos, num_stars, num_followers, num_followings, num_contributions] print(info) follower_infos[str(idx)] = info except: pass time.sleep(random.random() + random.randrange(0, 2))

Data visualization

Let’s take our own followers, for example, which is about 1,200.

Let’s take a look at the number of times they submitted code over the past year:

The highest number of submissions was Fengjixuchui, with 9,437 submissions in the past year. On average, I submit more than 20 submissions a day, which is a lot of work.

Let’s look at the distribution of the number of warehouses each person has:

I thought it was going to be a monotonous curve, but I underestimated you.

Then let’s take a look at the number distribution of STAR people:

Not bad, at least not all of them whoring for nothing

. Praise once named lifa123 old brother, actually gave others 18700 👍, also too show.

Let’s take a look at the distribution of followers among those 1,000 + people:

Some friends have more followers than me. As expected, the master is in the folk.

After reading this article, friends like to click like support, pay attention to me every day to share Python data crawler cases, the next article to share Python crawling and simple analysis of A-share company data

All Done ~ complete source code see personal home page or private message access to related files.