Write a simple crawler in TypeScript

There are only two kinds of programming languages: those that are always reviled and those that are never used.

First choose the site to crawl, we practice using Quotes to Scrape. The site is based on data from GoodReads, Python’s open source crawler framework Scrapy (now Zyte), and is Scrapy’s recommended crawler portal.

Initialize the project

Execute the corresponding command.

Generate package.json file NPM init-y
Generate the tsconfig.json configuration file tsc-init
NPM install ts-node –save-dev install ts-node –save-dev
Install TypeScript depends on NPM install TypeScript –save-dev
NPM install superagent @types/superagent –save

@types/superagent is a.d.ts translation package for superagent, since superagent is written in JS, it needs to analyze the contents of.js files through.d.ts files. Type completion and overrides are done in.d.ts for methods or properties exported from.js by superagent.

💡 Superagent is a lightweight HTTP request library based on Node.js
NPM install cheerio@types /cheerio –save

💡 Cheerio is a simple implementation based on the core function of jQuery (DOM manipulation), which is mainly used for the DOM manipulation on the server
Create a SRC directory and create an index.ts file in this directory
```
// src/index.ts
console.log("hello world");
Copy the code
```

Configure the package.json startup dev

// package.json
"scripts": {
  "dev": "ts-node src/index.ts"
}
Copy the code

The console executes NPM run dev and “Hello world” appears indicating initialization is complete

Write the crawler

Create a Crawler type

// src/index.ts
class Crawler {
  private page = 1
  private url = `http://quotes.toscrape.com/page/The ${this.page}/ `
  constructor() {
    console.log(this.url); }}const crawler = new Crawler();
Copy the code

Use superAgent to send requests

// src/index.ts
import superagent from 'superagent';

class Crawler {
  private page = 1;
  private url = `http://quotes.toscrape.com/page/The ${this.page}/ `
  private html = ' ';

  async getHtml() {
    const result = await superagent.get(this.url);
    console.log(result.text);
    this.html = result.text;
  }
  constructor() {
    this.getHtml(); }}const crawler = new Crawler();
Copy the code

Because the superagent.get() method returns a Promise object, async and await are used.

NPM run dev is executed, and the console prints out THE HTML code and the request succeeds.

Fetch data using Cheerio

Cheerio API usage is basically the same as jQuery, so there will be the same problem as jQuery, you can’t use array.prototype.map () loop

// src/index.ts
import superagent from 'superagent';
import cheerio from 'cheerio';

interface Quote {
  text: string;
  author: string;
  tagList: Array<string>;
}

interface QuoteData {
  time: number.data: Array<Quote>
}

class Crawler {
  private page = 1;
  private url = `http://quotes.toscrape.com/page/The ${this.page}/ `;

  constructor() {
    this.init();
  }

  async init() {
    const html = await this.getHtml();
    const quoteList = this.getQuoteData(html);
    console.log(quoteList);
  }

  async getHtml() {
    const result = await superagent.get(this.url);
    return result.text
  }

  getQuoteData(html: string) {
    const $ = cheerio.load(html);
    const quotes = $('.quote');
    let quoteList: Array<Quote> = [];
    quotes.map((index, element) = > {
      const text = $(element).find('.text').text();
      const author = $(element).find('.author').text();
      const tags = $(element).find('.tag');
      const tagList = tags
        .map((tagIndex, tagElement) = > $(tagElement).text())
        .get() as Array<string>;
      quoteList.push({
        text,
        author,
        tagList,
      });
    });
    return {
      time: new Date().getTime(),
      data: quoteList }; }}const crawler = new Crawler();
Copy the code

After NPM run dev is executed, the console outputs an array of famous quotes for the website.

Save the data to a JSON file

Create a data folder in the root directory, each time the data is retrieved in quotes_[timestamp]. Json file name format saved in the data folder

// src/index.ts
import superagent from 'superagent';
import cheerio from 'cheerio';
import fs from 'fs';
import path from 'path';

interface Quote {
  text: string;
  author: string;
  tagList: Array<string>;
}

interface QuoteData {
  time: number.data: Array<Quote>
}

class Crawler {
  private page = 1;
  private url = `http://quotes.toscrape.com/page/The ${this.page}/ `;

  constructor() {
    this.init();
  }

  async init() {
    const html = await this.getHtml();
    const quoteList = this.getQuoteData(html);
    this.saveJson(quoteList);
  }

  async getHtml() {
    const result = await superagent.get(this.url);
    return result.text
  }

  getQuoteData(html: string) {
    const $ = cheerio.load(html);
    const quotes = $('.quote');
    let quoteList: Array<Quote> = [];
    quotes.map((index, element) = > {
      const text = $(element).find('.text').text();
      const author = $(element).find('.author').text();
      const tags = $(element).find('.tag');
      const tagList = tags
        .map((tagIndex, tagElement) = > $(tagElement).text())
        .get() as Array<string>;
      quoteList.push({
        text,
        author,
        tagList,
      });
    });
    return {
      time: new Date().getTime(),
      data: quoteList
    };
  }

  saveJson(quoteInfo: QuoteData) {
    const filePath = path.resolve(__dirname, `.. /data/quotes_${quoteInfo.time}.json`);
    fs.writeFileSync(filePath, JSON.stringify(quoteInfo, null.2)); }}const crawler = new Crawler();
Copy the code

From there, a simple TS crawler is complete.

Write a simple crawler in TypeScript

Initialize the project

Write the crawler

Create a Crawler type

Use superAgent to send requests

Fetch data using Cheerio

Save the data to a JSON file

Related Posts

Close reading of JavaScript you Don’t Know volume -I- Chapter 1 type

ES6: Deconstructed assignment of variables

Introduction to Flutter development from TodoList