All university data in China using Node crawler (2)

This article introduces the list of universities by province and baidu Encyclopedia details for each university.

Create the university entity class

University information is more, because we want to show more detailed university data.

import { Entity, Column, PrimaryGeneratedColumn } from 'typeorm';

@Entity()
export class College {

  @PrimaryGeneratedColumn()
  id: number;

  @Column()
  name: string;

  @Column({
    nullable: true,
  })
  createdYear: string;

  @Column({
    nullable: true
  })
  type: string;


  @Column({
    nullable: true.type: 'simple-array'
  })
  tags: string[];

  @Column({
    nullable: true
  })
  category: string;


  @Column({
    nullable: true.type: 'simple-array'
  })
  schoolFellow: string;

  @Column({
    nullable: true
  })
  department: string;

  @Column({
    nullable: true
  })
  website: string;

  @Column({
    nullable: true,
  })
  code: string;

  @Column({
    nullable: true
  })
  motto: string;

  @Column({
    nullable: true
  })
  location: string;

  @Column({
    nullable: true
  })
  provinceAbbr: string;

  @Column({
    nullable: true
  })
  create_ts: string;

  @Column({
    nullable: true
  })
  update_ts: string;


  @Column({
    nullable: true.type: 'text'
  })
  des: string;

}

Copy the code

Climb the list of colleges

Let’s climb from https://daxue.eol.cn/ in each province university list, then get to the name, location, provinceAbbr, create_ts these few simple fields, And then use this. CollegeRepository. Save (collegeListInfo) will university list stored in the database.

Note that the each method parses the DOM uses an async anonymous callback, because the callback we store to the database is an await method.

 async crawlerCollegeaNameByProvince(province) {
    const a = () => {
      return new Promise((reslove, reject) => {
        var c = new Crawler({
          maxConnections: 10,
          callback: (error, res, done) = > {if (error) {
              console.log(error);
              reject(error)
            } else {
              var $ = res.$;
              let items = [];
               $('.table-x').find('tbody').find('tr').each(async (idx, element) => {
                if (idx === 0 || idx === 1) { return; }
                var $element = $(element);
                let collegeListInfo = {
                  name: $element.find('td').eq(1).text(),
                  location: $element.find('td').eq(4).text(),
                  provinceAbbr: province,
                  create_ts: moment().format('YYYY-MM-DD HH:mm:ss')
                }
                await this.collegeRepository.save(collegeListInfo);
                items.push(collegeListInfo);
              });
              reslove(items);
            }
            done();
          }
        });
        c.queue(`https://daxue.eol.cn/${province}.shtml`); })};return await a();
  }
Copy the code

We can pass parameters in Controller:

 @Post('crawlerCollegeList')
  reptile(@Body('province') province: string) {
    return this.collegeService.crawlerCollegeaNameByProvince(province);
  }
Copy the code

Climb baidu Baike university details

Some sites will set up some checking mechanism to prevent crawlers, so we need to add request headers to disguise normal browser access. The contents of the header are available in the browser’s developer tools, and you can add this information to our crawler code. Accept-encoding: indicates the Encoding type that the browser sends to the server. There are gzip, Deflate, BR, etc.

For baidu Encyclopedia crawler, we need to set accept-encoding to br, otherwise it will always report an error parsing.


  async crawlerCollegeInfo() {
    const crawlerFun = (name,id) => {
      return new Promise((reslove, reject) => {
        var c = new Crawler({
          rateLimit: 4000,
          maxConnections: 1,
          callback: (error, res, done) = > {if (error) {
              reject(error)
            } else {
              const $ = res.$;
              let collegeBasicInfo: College = new College();
              $('.lemma-summary').find('.para').each((index, element) => {
                const $element = $(element);
                if(collegeBasicInfo.des){
                  collegeBasicInfo.des = $element.text() + '\n';
                }else{
                  collegeBasicInfo.des = collegeBasicInfo.des + $element.text() + '\n'; }}); collegeBasicInfo.website = $('.baseBox').find('.dl-baseinfo').last().find('dl').last().find('dd').find('a').text();
              let allParamsElement = $('.basic-info').find('dt');
              allParamsElement.each((index, element) => {
                const $element = $(element);
                const labelTitle = $element.text().replace(/\s+/g, "");
                for (const key in ParamCrawler) {
                  if (labelTitle === ParamCrawler[key]) {
                    if (key === 'createdYear') {
                      collegeBasicInfo.createdYear = $element.next (). The text (). The substr (0, 5); }else {
                      collegeBasicInfo[key] = $element.next().text().replace(/[\r\n]/g, "");
                    }
                  }
                }
              })
              reslove(collegeBasicInfo);
            }
            done();
          }
        });
        c.queue(
          {
            uri: `https://baike.baidu.com/item/${name}`,
            headers: {
              'accept-encoding': 'br'}}); })};let response=await this.collegeRepository.find();
    response.forEach(async (item)=>{
      letbasicInfo= await crawlerFun(encodeURIComponent(item.name),item.id); await this.collegeRepository.update(item.id, basicInfo); })}Copy the code

As can be seen, we first get the list of universities, and then crawler baidu encyclopedia information of each university, such a practice is actually rather rough, but here because we just learn the use of crawler, so we do it briefly first. A processed array is returned.

Let’s add a get list interface:

 controller:
 @Post('list')
  findAll(@Body() condition: Object): Promise<College[]> {
    return this.collegeService.findAll(condition);
  }
  
  service:
  findAll(condition: Object): Promise<College[]> {
    return this.collegeRepository.find({
      where: condition
    });
  }
Copy the code

In this case, we simply pass the filter object to the /college/list post interface to get the corresponding return result, such as the province code.

Update/delete/obtain specific university information

Sometimes we need to manually update the data because of errors or inaccuracies. So how does Nest add, delete, change, and check Rest apis?

  controller:
  @Get('detail/:id')
  findOne(@Param('id') id: string): Promise<College> {
    return this.collegeService.findOne(id);
  }

  @Put(':id')
  update(
    @Param('id') id: string,
    @Body() updateCollege: College,
  ): Promise<void> {
    return this.collegeService.update(id, updateCollege);
  }

  @Delete(':id')
  remove(@Param('id') id: string): Promise<void> {
    return this.collegeService.remove(id);
  }
  
  service:
  async update(id, updateCollege: College): Promise<void> {
    delete updateCollege.id;
    updateCollege.update_ts = moment().format('YYYY-MM-DD HH:mm:ss');
    await this.collegeRepository.update(id, updateCollege);
  }

  async findOne(id: string): Promise<College> {
    await this.checkCollegeExist(id);
    return this.collegeRepository.findOne(id);
  }

  async remove(id: string): Promise<void> {
    await this.checkCollegeExist(id);
    await this.collegeRepository.delete(id);
  }
Copy the code

Finally, let’s look at the university information we climbed to:

All university data in China using Node crawler (2)

Create the university entity class

Climb the list of colleges

Climb baidu Baike university details

Update/delete/obtain specific university information

Related Posts

Mysql > query data from current day, current month, last month

Faster than Minikube, use Kind to quickly create K8S learning environments

Product Design & 26 Steps of Dashboard design – Zhihu column