Write a simple crawler

Currently, Deno does not have enough support for tripartite libraries, so it needs to use additional libraries. JSPM can be used to introduce tripartite libraries

import  xxx from "https://dev.jspm.io/xxx";
Copy the code

1. Get the web Dom

/** * @subscribe Load file convert to html string * @param url * @param path */ export async function loadHtml(url:string){ const res = await fetch(url); const htmlData = new Uint8Array(await res.arrayBuffer()); let htmlString = ""; // for (var i = 0; i < htmlData.length; i++) { htmlString += String.fromCharCode(htmlData[i]); } return htmlString; }Copy the code

2. Introduce the cheerio

Cheerio is similar to JQuery. The corresponding DOM node can be selected through the selector, and DOM attribute can be obtained by attr

import  cheerio from "https://dev.jspm.io/cheerio";
Copy the code

3. Capture page information

/**
 * get suburl form html string
 * @param htmlString 
 */
export function getSubUrl(htmlString:any){
    let $ = cheerio.load(htmlString);
    let lis = $(".pg-goods-list li a");
    let hrefList:any = [];
    lis.each((i:number,elem:any)=>{
        hrefList.push($(elem).attr('href'))
     });
    let subUrlList  = [...new Set(hrefList)];
    return subUrlList;
}
Copy the code

4. Pay attention to

Make sure to add the permission attribute –allow-net when you run the ts file. If you want to add only a single domain name, use –allow-net=xxx.com deno run — allow-net.src \job.ts