preface

I recently got into typescript, learned some basics, and had time on the weekend to wonder if I could do something with TS. Seeing other friends use it to write crawlers, I happened to be interested in this piece, so I also tried to write a simple crawler, and today I share it with you.

preparation

Before writing a crawler, we need to do some preparatory work. Let’s take a look at the following:

The target

The following figure is the content we will crawl this time. We need to get the information of each course and finally write it into a JSON file. Destination URL.

Environment set up

We start by creating a new folder and initialize the project by executing the following commands in sequence

npm init -y
Copy the code

Generates the typescript configuration file tsconfig.json

tsc --init
Copy the code

Install the typescript

npm install typescript -D
Copy the code

Let’s install some more modules that are required for the project

  • Ts-node (saves compiling TS for easy debugging)
  • Superagent (for sending requests)
  • Cheerio (used to parse the retrieved HTML structure)

For example, superagent is written in JS, but we don’t recognize it in typescript. It requires a translation file. The translation file is called the corresponding type declaration module here. Let’s install these two type declaration modules:

  • @types/superagent
  • @types/cheerio

Once the dependencies listed above are installed, our environment setup is almost complete

Project launch test

Let’s modify the package.json scripts configuration:

"scripts": {
  "dev": "ts-node ./src/app.ts"
 }
Copy the code

Create an app.ts file in SRC and write:

console.log('I want to write a reptile.');
Copy the code

And then I run NPM run dev, and I can see the output and I want to write the crawler and that means there’s no problem

Analyze web pages and capture data

Analyze page tag structure

Before starting to write crawler, we need to analyze the structure of our target web page and observe which tags the information we need is stored in, which may allow us to crawl more accurately. In our example, we select a random course and right-click to check, as shown below: we can see that each course is placed under a LI tag

  • The name of the course is stored in the corresponding LI labeldata-nameProperties of the
  • Course enrollment information is stored in theclass=oneUnder the P tag
  • Course pricing information is stored in theclass=twoUnder the P tag ofclass=priceSpan tag

Fetching the data

After the above analysis, we can start writing our code by introducing the modules we need

import superagent from 'superagent';
import cheerio from 'cheerio';
const fs = require('fs');// will be used to write files later
Copy the code
  • Let’s start by defining a class that defines a method to get the target page. Here we use the SuperAgent to initiate the request, while using es6async awaitTo handle asynchronous operations; After we get the data and print the result, we’ll find that the HTML structure is stored in the returned objecttextProperty, so we just return the value stored in text.
class Grabcourse {
  // Store the destination web page address
  private url: string = 'https://coding.imooc.com/?c=fe&sort=0&unlearn=0&page=1';
  // Get the HTML structure of the page to be climbed
  async getHtml() {
    const courseHtml = await superagent.get(this.url);
    return courseHtml.text;
  }
  constructor() {
    this.getHtml(); }}new Grabcourse();
Copy the code
  • Then we also need to get the HTML structure for parsing, usingcheerioThis module allows us to easily get the tags you want in HTML, because it supports jquery syntax, familiar friends can be very quick to use.

    We also define a method on our class that parses the retrieved HTML, accepts a parameter value of string, and returns the parsed result.
async loadhtml(html: string) {
  return cheerio.load(html);
}
Copy the code
  • At this point, we now need to get the information for each course. We also define a method to get the information for the course on the class'.course-list li'The selector can get all the lessons, and then we walk through it, get the other values we need and store them in an array; I’m going to set the parameter type of the function to any otherwise ts will throw a warning, and we also need to define an array to store the course information, because we know that each item in this array has onlyCourse name.Course enrollment informationandThe price, so here we can define an interface
interface Course {
  courseName: string;
  courseType: string;
  coursePrice: string;
}
Copy the code

Then add the following line to the class to specify the required Course type for each item in the array

private courseItems:Course[] = [];
Copy the code

Get relevant codes for all course information

// Get course information
  async getCourseInfo($element: any) {
    $element('.course-list li').each((idx: any, ele: any) = > {
      const courseName = $element(ele).attr('data-name');
      // Find Spaces with replace
      const courseType = $element(ele).find('.one').text().replace(/\s/g.' ');
      const coursePrice = $element(ele).find('.two .price').text();
      this.courseItems.push({
        courseName,
        courseType,
        coursePrice,
      });
    });
    return this.courseItems;
  }
Copy the code

Next we need to write a method that holds the information we get from the course, we still define a method on the class,

// Save the lessons obtained
async saveCourseItems(result: Course[]) {
    const data = {
      course: result
    }
    // Save the logic
    fs.writeFile('./course.json'.JSON.stringify(data), (err: any) = > {
      if (err) {
        console.error(err)
        return
      }
      console.log('File write succeeded. ')})}Copy the code

By virtue of the principle of high cohesion and low coupling, we have defined separate methods. Now we need to put these methods in a single method and execute this single method from constructor. When we new Grabcourse, the crawling logic will be executed one by one. And it’s easier to read

async initSpride() {
  const html = await this.getHtml();
  const $element = await this.loadhtml(html);
  const courseItems = await this.getCourseInfo($element);
  this.saveCourseItems(courseItems);
}
constructor() {
  this.initSpride();
}
Copy the code

Finally, run the command NPM run dev, and you can see that the course. Json file is generated. When you open the file, you can see the data as shown in the following figure, indicating that our crawler has been successful

The complete code

import superagent from 'superagent';
import cheerio from 'cheerio';
const fs = require('fs');

interface Course {
  courseName: string;
  courseType: string;
  coursePrice: string;
}

class Grabcourse {
  private url: string = 'https://coding.imooc.com/?c=fe&sort=0&unlearn=0&page=1';
  private courseItems:Course[] = [];

  // Get the HTML structure of the page to be climbed
  async getHtml() {
    const courseHtml = await superagent.get(this.url);
    return courseHtml.text;
  }
  / / parse HTML
  async loadhtml(html: string) {
    return cheerio.load(html);
  }
  // Get course information
  async getCourseInfo($element: any) {
    $element('.course-list li').each((idx: any, ele: any) = > {
      const courseName = $element(ele).attr('data-name');
      const courseType = $element(ele).find('.one').text().replace(/\s/g.' ');
      const coursePrice = $element(ele).find('.two .price').text();
      this.courseItems.push({
        courseName,
        courseType,
        coursePrice,
      });
    });
    return this.courseItems;
  }
  async saveCourseItems(result: Course[]) {
    const data = {
      course: result
    }
    // Save the logic
    fs.writeFile('./course.json'.JSON.stringify(data), (err: any) = > {
      if (err) {
        console.error(err)
        return
      }
      console.log('File write succeeded. ')})}async initSpride() {
    const html = await this.getHtml();
    const $element = await this.loadhtml(html);
    const courseItems = await this.getCourseInfo($element);
    this.saveCourseItems(courseItems);
  }
  constructor() {
    this.initSpride(); }}new Grabcourse();
Copy the code