How did I build a web crawler from scratch to automate my job search

How I automated my job search by building a Web crawler from Scratch

Originally written by Zhia Hwa Chong

Translation from: The Gold Project

This article is permalink: github.com/xitu/gold-m…

Translator: Starrier

Proofreader: liruochen1998

The cause of

At midnight on a Friday, my friends were out having a good time while I was still working at my computer.

Oddly enough, I didn’t feel left out.

I’m doing something that I think is really fun and really good.

I just graduated from college, so I desperately need to get a job. When I left Seattle, my backpack was full of college textbooks and clothes. I can fit all my stuff in the trunk of my 2002 Honda Civic.

I wasn’t much of a social person at the time, so I decided to solve the work problem in the best way I knew how. I tried to create an app to help me, and this article is about how I did it. 😃

Start using Craigslist

I was in my room, working frantically on software to help me collect and respond to people who were looking for software engineers on Craigslist. Craigslist is essentially an Internet marketplace where you can find items for sale, services, community blogs, and more.

Craigslist

At the time, I had never built a fully fledged application. Most of what I did in college was academic projects, including building and parsing binary trees, computer graphics, and language processing models.

I’m a little white.

Still, I learned about a popular programming language called Python. I know very little about Python, but I work tirelessly to learn about it.

So I decided to pair up and build a small application in the new programming language.

The journey of prototyping

I have a used BenQ laptop that my brother gave me when I was in college, and I use it now for development.

This is not the best development environment in any way. I’m using an older version of Python 2.4 and Sublime Text, but the process of writing the application from scratch is really an exciting experience.

I still don’t know what to do. I tried all kinds of things to see what I needed. My first approach was to figure out how I could easily access Craigslist data.

I looked to see if Craigslist had a publicly available REST API. Sadly for me, there are no such interfaces.

However, I found another good thing.

Craigslist has an RSS feed for personal use only. An RSS feed is essentially a readable computer digest of updates sent by a web site. In this case, the RSS feed allows me to get new task lists as they are published. It was perfect for my needs.

Example of an RSS feed

Next, I need a way to read the RSS feed. I didn’t want to manually browse the RSS profile myself, because that would be a time sink, no different from browsing Craigslist.

This is when I began to realize the power of Google. There is a joke that software engineers spend most of their time looking for answers on Google. I think there’s some truth to that.

I did a Google search, and I found a useful article on StackOverflow that describes how to search through Craiglist RSS feeds. This is a filtering feature that Craigslist offers for free. All I have to do is pass a query parameter with a particular keyword that I’m interested in.

I focused on finding a software-related job in Seattle. In this case, I typed a specific URL in Seattle to find information that contained the keyword “software.”

Seattle.craigslist.org/search/sss?…

Good, it’s working. It’s very beautiful.

For example, a Seattle RSS feed titled “Software.”

Beautiful Soup is one of the best tools I use

To my disbelief, my method worked.

First, limit the number of enumerations. My data does not include all available job postings in Seattle. The result returned is only a subset of the overall result. I want to be as broad as possible, so I need a list of all available positions.

Second, I realized that the RSS feed didn’t contain any contact information. It was a bit of a letdown. I can find the listings, but I can’t contact the publishers unless I manually filter the listings.

A screenshot of the Craigslist reply link

I am a person with many skills and interests, but repetitive manual work is not one of them. I could hire someone else to do it for me, but I’m barely living on a dollar’s worth of ramen, so that means I can’t splurged on this side project.

It’s a dead end, but it’s not the end.

Continuous iteration

From my first failed attempt, I learned that Craigslist has RSS feeds and that every blog has a link to the real blog itself.

Great, if I can access the real blog, then maybe I can crawl its email address? 🧐 This means I need to find a way to get the email address from the original blog.

Again, I found Google, which I trusted, and searched for “Ways to parse a website.”

On Google, I found a cool little Python feature called Beautiful Soup. Essentially, it’s a great tool that allows you to parse the entire DOM tree and help you understand the structure of the web page.

My needs were simple: I needed an easy-to-use tool that would allow me to collect data from web pages. BeautifulSoup examined both boxes rather than spending more time picking the best tool, I chose one that worked and moved on. Here is an optional list with similar operations.

BeautifulSoup homepage

Tip: I found an excellent guide that describes how to use Python and BeautifulSoup for web crawls. If you are interested in learning how to reptile then I suggest you read it.

With this new tool, my workflow is set up.

My Workflow

I’m now ready for my next task: scraping E-mail addresses from actual blogs.

Now, here’s the cool thing about open source technology. They are free and work great! It’s like free ice cream and a freshly baked chocolate chip cookie on a hot summer day.

BeautifulSoup allows you to search for specific HTML tags or markers on a web page. And Craigslist already handles them so well that finding email addresses is easy. A tag is something like an “email-reply-link” that basically says that an email link is available.

Since then, everything has been easier. I relied on BeautifulSoup’s built-in functionality, and with a few simple actions, I could easily get an email address from a Craigslist blog.

Combine content

In less than an hour, I had my first MVP, and I had built a Web crawler that collected email addresses and responded to people looking for software engineers within a 100-mile radius of Seattle.

Code screenshots

I added various add-ons to the original script to get better results. For example, I saved the results in CSV and HTML pages so that I could parse them more quickly.

Of course there are many other noteworthy features, such as:

Record my ability to send email addresses
The fatigue rule prevents special cases of sending emails to people I’ve already been in contact with
Note, some emails need a captcha before they can be displayed to prevent bots (like mine)
Craigslist doesn’t allow crawlers on their pages, so if I run them too often, I’m banned. (I tried to “trick” Craigslist by switching between different VPNS, but it didn’t work).
I still can’t retrieve all the posts on Craigslist

The last one is Kicker. But I think if a post has been around for a while, the person who posted it may not be able to find it again. It’s a tradeoff, but I think I can handle it.

The whole experience is Tetris. I know what my end goal is, and my real challenge is to put the right pieces together to achieve a specific end goal. Each piece of the puzzle took me on a different journey. It’s challenging, but I enjoy it, and I learn something new each time.

Lessons learned

It was an eye-opening experience, I ended up learning something about how the Internet (and Craigslist) works, how various tools work together to solve a problem, and I got a cool little story that I could share with my friends.

In a sense, it’s like how technology works today. You’ve discovered a huge and complex problem that you need to solve. And you don’t see any immediate, obvious solutions. You break this large, complex problem down into different manageable chunks, and then solve each chunk in turn.

Looking back, my question was: ** How can I take advantage of this excellent directory on the Internet to quickly reach people with specific interests? ** There was no known product or solution at the time, so I broke it down into parts:

Find all the lists on the platform
Collect contact information about each list
Send an email if you have contact information

That’s all there is to it. Technology is only a means to an end. If I could use an Excel spreadsheet to do it for me, I would choose to do it. However, I’m no Excel master, so I took the approach that made the most sense to me at the time.

There is room for improvement

There are many things I can improve on:

I started with a language I wasn’t familiar with, so there was a learning curve at first. That’s not too bad, because Python is easy to learn. So I strongly recommend that any software enthusiast use it as a first language.
Excessive reliance on open source technology, open source software itself has a series of problems. The multiple libraries I use are no longer in active development, so I run into problems early on where I can’t import the libraries, or the libraries fail for seemingly innocuous reasons.
Working on a project on your own can be fun, but it can also be stressful. You need a lot of motivation to get these things. The project was quick and easy, but I still spent a few weekends working on it. As the project went on, I started to lose motivation, and when I got a job, I gave up the project completely.

Resources and tools I use

The Hitchhiker’s Guide to Python — Overall, this is a great book for learning Python. I recommend Python as a first programming language for beginners, and in my article I discussed how to use it to get quotes from several top companies.

BeautifulSoup – Utility for building my web crawler.

Web Scraping with Python — a practical guide to learning how to use Python for Web Scraping.

Lean Startup — From this book, I learned the idea of rapid prototyping and creating an MVP to test. I think the ideas here apply to many different areas, and it helped me with this project.

Evernote – I used Evernote to organize my thoughts for this article. Highly recommend it — I use it in everything I do.

My Laptop — This is my current laptop at home, set up as a workstation. It’s a lot easier to use than an old BenQ, but both are good for ordinary programming jobs.

Credits:

Brandon O ‘Brien, my mentor, proofread this article and provided valuable feedback on how to improve it.

Leon Tager, my colleague and friend, has guided and inspired me with much needed economic wisdom.

You can sign up for Ndustry News and Random Tidbits and be the first to know when new articles are posted where I’m logged in.

Zhia Chong is a software engineer at Twitter. He works with the AD measurement team in Seattle, assessing advertisers’ impact and return on investment. The team is the hiring!

You can find him on Twitter and LinkedIn.

Thanks to Open source Portfolio.

If you find any errors in the translation or other areas that need improvement, you are welcome to revise and PR the translation in the Gold Translation program, and you can also get corresponding bonus points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.