Write a vertical domain search engine

Foreword: want Google and Baidu already enough, the search that realizes here is to facilitate oneself to do a small practice of follow-up matter only.

A theoretical framework

Want to achieve a search engine, the first need to consider a complete architecture.

Page to grab
storage
Analysis of the
Search implementation
show

Page to grab

First of all, I’m going to take the simplest HttpClient approach to page fetching. Some might say that you’re missing a lot of Web2.0 sites by doing this. Yes, that’s right.

storage

Then, store, I’m going to use the file system directly for physical storage, and load all the results into memory when the search is used. Some people might say, oh, you’re using up memory, yeah, yeah, I can swap a lot of swap space for performance.

Analysis of the

In the analysis part, I plan to directly use the word segmentation algorithm to parse out the word frequency and establish the inverted index of the article, but not store the inverted index of all the words of the article, considering the access performance of the unoptimized file system. My plan here is to directly take the word frequency in the range of 20~50 and the site title of the word segmentation results as the site’s keywords, the establishment of inverted system exists. To make the description less blank and abstract, here is the final structure:The title name of the file is the word name of the word segmentation. The file stores all the website domain names with the keyword, as follows:It’s kind of like the storage at the bottom of ElasticSearch, but I didn’t make any optimizations.

Search implementation

In the search implementation part, I plan to load the above files directly into memory and store them directly using HashMap for easy reading.

show

In order to make it easy to use on demand, I’m going to use the Google Chrome plugin directly for the presentation.

Ok, now that the theoretical framework is ready, let’s start implementing it

implement

Page to grab

As mentioned, page fetching is done directly with HttpClient, but it also involves external chain parsing of the page. Before talking about external chain analysis, I’m going to talk about my grasping ideas.

Imagine the whole Internet as a huge network, through the link between the web site mutual series, although there are a large number of sites are islands, but do not interfere with the vast majority of the site crawl. So the scheme adopted here is the breadth of the main nodes for the multi-point traversal first, only grab the content of the home page of a single website, analysis of all the external chain, and then as a target to grab.

The code for grabbing the page is as follows:

import com.chaojilaji.auto.autocode.generatecode.GenerateFile;
import com.chaojilaji.auto.autocode.standartReq.SendReq;
import com.chaojilaji.auto.autocode.utils.Json;
import com.chaojilaji.moneyframework.model.OnePage;
import com.chaojilaji.moneyframework.model.Word;
import com.chaojilaji.moneyframework.service.Nlp;
import com.chaojilaji.moneyframework.utils.DomainUtils;
import com.chaojilaji.moneyframework.utils.HtmlUtil;
import com.chaojilaji.moneyframework.utils.MDUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.stereotype.Service;
import org.springframework.util.StringUtils;

import java.io.*;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentSkipListSet;

public class HttpClientCrawl {
    private static Log logger = LogFactory.getLog(HttpClientCrawl.class);

    public Set<String> oldDomains = new ConcurrentSkipListSet<>();
    public Map<String, OnePage> onePageMap = new ConcurrentHashMap<>(400000);
    public Set<String> ignoreSet = new ConcurrentSkipListSet<>();
    public Map<String, Set<String>> siteMaps = new ConcurrentHashMap<>(50000);

    public String domain;

    public HttpClientCrawl(String domain) {
        this.domain = DomainUtils.getDomainWithCompleteDomain(domain);
        String[] ignores = {"gov.cn"."apac.cn"."org.cn"."twitter.com"
                , "baidu.com"."google.com"."sina.com"."weibo.com"
                , "github.com"."sina.com.cn"."sina.cn"."edu.cn"."wordpress.org"."sephora.com"};
        ignoreSet.addAll(Arrays.asList(ignores));
        loadIgnore();
        loadWord();
    }

    private Map<String, String> defaultHeaders(a) {
        Map<String, String> ans = new HashMap<>();
        ans.put("Accept"."application/json, text/plain, */*");
        ans.put("Content-Type"."application/json");
        ans.put("Connection"."keep-alive");
        ans.put("Accept-Language"."zh-CN,zh; Q = 0.9");
        ans.put("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36");
        return ans;
    }


    public SendReq.ResBody doRequest(String url, String method, Map<String, Object> params) {
        String urlTrue = url;
        SendReq.ResBody resBody = SendReq.sendReq(urlTrue, method, params, defaultHeaders());
        return resBody;
    }

    public void loadIgnore(a) {
        File directory = new File(".");
        try {
            String file = directory.getCanonicalPath() + "/moneyframework/generate/ignore/demo.txt";
            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(file))));
            String line = "";
            while((line = reader.readLine()) ! =null) {
                String x = line.replace("["."").replace("]"."").replace(""."");
                String[] y = x.split(","); ignoreSet.addAll(Arrays.asList(y)); }}catch(IOException e) { e.printStackTrace(); }}public void loadDomains(String file) {
        File directory = new File(".");
        try {
            File file1 = new File(directory.getCanonicalPath() + "\ \" + file);
            logger.info(directory.getCanonicalPath() + "\ \" + file);
            if(! file1.exists()) { file1.createNewFile(); } BufferedReader reader =new BufferedReader(new InputStreamReader(new FileInputStream(file1)));
            String line = "";
            while((line = reader.readLine()) ! =null) {
                line = line.trim();
                OnePage onePage = new OnePage(line);
                if(! oldDomains.contains(onePage.getDomain())) { onePageMap.put(onePage.getDomain(), onePage); oldDomains.add(onePage.getDomain()); }}}catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch(IOException e) { e.printStackTrace(); }}public void handleWord(List<String> s, String domain, String title) {
        for (String a : s) {
            String x = a.split("") [0];
            String y = a.split("") [1];
            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());
            if (Integer.parseInt(y) >= 10) {
                if (z.contains(domain)) continue;
                z.add(domain);
                siteMaps.put(x, z);
                GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));
            }
        }
        Set<Word> xxxx = Nlp.separateWordAndReturnUnit(title);
        for (Word word : xxxx) {
            String x = word.getWord();
            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());
            if (z.contains(domain)) continue;
            z.add(domain);
            siteMaps.put(x, z);
            GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); }}public void loadWord(a) {
        File directory = new File(".");
        try {
            File file1 = new File(directory.getCanonicalPath() + "\\moneyframework/domain/markdown");
            if (file1.isDirectory()) {
                int fileCnt = 0;
                File[] files = file1.listFiles();
                for (File file : files) {
                    fileCnt ++;
                    try {
                        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
                        String line = "";
                        siteMaps.put(file.getName().replace(".md".""), new ConcurrentSkipListSet<>());
                        while((line = reader.readLine()) ! =null) {
                            line = line.trim();
                            if (line.startsWith("# # # #")) {
                                siteMaps.get(file.getName().replace(".md"."")).add(line.replace("# # # #"."").trim()); }}}catch (Exception e){

                    }
                    if ((fileCnt % 1000) = =0){
                        logger.info((fileCnt * 100.0) / files.length + "%"); }}}for(Map.Entry<String,Set<String>> xxx : siteMaps.entrySet()){ oldDomains.addAll(xxx.getValue()); }}catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch(IOException e) { e.printStackTrace(); }}public void doTask(a) {
        String root = "http://" + this.domain + "/";
        Queue<String> urls = new LinkedList<>();
        urls.add(root);
        Set<String> tmpDomains = new HashSet<>();
        tmpDomains.addAll(oldDomains);
        tmpDomains.add(DomainUtils.getDomainWithCompleteDomain(root));
        int cnt = 0;
        while(! urls.isEmpty()) { String url = urls.poll(); SendReq.ResBody html = doRequest(url,"GET".new HashMap<>());
            cnt++;
            if (html.getCode().equals(0)) {
                ignoreSet.add(DomainUtils.getDomainWithCompleteDomain(url));
                try {
                    GenerateFile.createFile2("moneyframework/generate/ignore"."demo.txt", ignoreSet.toString());
                } catch (IOException e) {
                    e.printStackTrace();
                }
                continue;
            }
            OnePage onePage = new OnePage();
            onePage.setUrl(url);
            onePage.setDomain(DomainUtils.getDomainWithCompleteDomain(url));
            onePage.setCode(html.getCode());
            String title = HtmlUtil.getTitle(html.getResponce()).trim();
            if(! StringUtils.hasText(title) || title.length() >100 || title.contains("�")) {
                title = "没有";
            }
            onePage.setTitle(title);
            String content = HtmlUtil.getContent(html.getResponce());
            Set<Word> words = Nlp.separateWordAndReturnUnit(content);
            List<String> wordStr = Nlp.print2List(new ArrayList<>(words), 10);
            handleWord(wordStr, DomainUtils.getDomainWithCompleteDomain(url), title);
            onePage.setContent(wordStr.toString());
            if (html.getCode().equals(200)) {
                List<String> domains = HtmlUtil.getUrls(html.getResponce());
                for (String domain : domains) {
                    int flag = 0;
                    String[] aaa = domain.split(".");
                    if (aaa.length>=4) {continue;
                    }
                    for (String i : ignoreSet) {
                        if (domain.endsWith(i)) {
                            flag = 1;
                            break; }}if (flag == 1) continue;
                    if (StringUtils.hasText(domain.trim())) {
                        if(! tmpDomains.contains(domain)) { tmpDomains.add(domain); urls.add("http://" + domain + "/");
                        }
                    }
                }
                logger.info(this.domain + "Queue size is" + urls.size());
                if (cnt >= 2000) {
                    break; }}else {
                if (url.startsWith("http:")){
                    urls.add(url.replace("http:"."https:"));
                }
            }
        }
    }
}
Copy the code

Where _sendreq.sendreq_ is a self-implemented download page of your method that calls the HttpClient method. If you want to implement Web2.0 crawling, consider wrapping PlayWrite inside. Then there is the HtmlUtils utility class for formatting Html, removing tags and various garbled characters caused by special characters.

import org.apache.commons.lang3.StringEscapeUtils;

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlUtil {


    public static String getContent(String html) {
        String ans = "";
        try {
            html = StringEscapeUtils.unescapeHtml4(html);
            html = delHTMLTag(html);
            html = htmlTextFormat(html);
            return html;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return ans;
    }

    public static String delHTMLTag(String htmlStr) {
        String regEx_script = "
      [^>; // Define a regular expression for script
        String regEx_style = "
      [^>; // Define a regular expression for style
        String regEx_html = "< [^ >] + >"; // Define regular expressions for HTML tags

        Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
        Matcher m_script = p_script.matcher(htmlStr);
        htmlStr = m_script.replaceAll(""); // Filter script tags

        Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
        Matcher m_style = p_style.matcher(htmlStr);
        htmlStr = m_style.replaceAll(""); // Filter the style tag

        Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
        Matcher m_html = p_html.matcher(htmlStr);
        htmlStr = m_html.replaceAll(""); // Filter HTML tags

        return htmlStr.trim();
    }

    public static String htmlTextFormat(String htmlText) {
        return htmlText
                .replaceAll("+"."")
                .replaceAll("\n"."")
                .replaceAll("\r"."")
                .replaceAll("\t"."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(""."")
                .replaceAll(","."")
                .replaceAll("⎛ ⎝"."")
                .replaceAll("⎠ ⎞"."")
                .replaceAll(""."")
                .replaceAll("!!!!!"."")
                .replaceAll("✔"."");

    }

    public static List<String> getUrls(String htmlText) {
        Pattern pattern = Pattern.compile("(http|https):\\/\\/[A-Za-z0-9_\\-\\+.:?&@=\\/%#,;] *");
        Matcher matcher = pattern.matcher(htmlText);
        Set<String> ans = new HashSet<>();
        while (matcher.find()) {
            ans.add(DomainUtils.getDomainWithCompleteDomain(matcher.group()));
        }
        return new ArrayList<>(ans);
    }

    public static String getTitle(String htmlText) {
        Pattern pattern = Pattern.compile("(? <=title\\>).*(? =);
        Matcher matcher = pattern.matcher(htmlText);
        Set<String> ans = new HashSet<>();
        while (matcher.find()) {
            return matcher.group();
        }
        return ""; }}Copy the code

In addition to removing tags and special characters mentioned above, methods are implemented to get all urls and titles (Java has several libraries that provide the same method).

storage

In the above code, in fact, contains the storage and analysis of the call code, now singled out analysis.

public void handleWord(List<String> s, String domain, String title) {
        for (String a : s) {
            String x = a.split("") [0];
            String y = a.split("") [1];
            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());
            if (Integer.parseInt(y) >= 10) {
                if (z.contains(domain)) continue;
                z.add(domain);
                siteMaps.put(x, z);
                GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));
            }
        }
        Set<Word> xxxx = Nlp.separateWordAndReturnUnit(title);
        for (Word word : xxxx) {
            String x = word.getWord();
            Set<String> z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());
            if (z.contains(domain)) continue;
            z.add(domain);
            siteMaps.put(x, z);
            GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); }}Copy the code

The method of storage is the handleWord, where s is the segmentation result of a page (there is no offset to store the occurrence of words here, so it is not an inverted index), domain is the domain name itself, title is the title. Here, GenerateFile is called, which is the file creation utility class of the custom implementation. Part of the code is as follows:

public static void createFileRecursion(String fileName, Integer height) throws IOException {
    Path path = Paths.get(fileName);
    if (Files.exists(path)) {
        // TODO:2021/11/13 If the file exists
        return;
    }
    if (Files.exists(path.getParent())) {
        // TODO:2021/11/13 If the parent file exists, create the file directly
        if (height == 0) {
            Files.createFile(path);
        } else{ Files.createDirectory(path); }}else {
        createFileRecursion(path.getParent().toString(), height + 1);
        // TODO:2021/11/13: this step ensures that path's parent exists, now it needs to create its owncreateFileRecursion(fileName, height); }}public static void appendFileWithRelativePath(String folder, String fileName, String value) {
    File directory = new File(".");
    try {
        fileName = directory.getCanonicalPath() + "/" + folder + "/" + fileName;
        createFileRecursion(fileName, 0);
    } catch (IOException e) {
        e.printStackTrace();
    }
    try {
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(fileName, true));
        bufferedOutputStream.write(value.getBytes());
        bufferedOutputStream.flush();
        bufferedOutputStream.close();
    } catch(IOException e) { e.printStackTrace(); }}Copy the code

Analysis of the

The analysis here is mainly about word segmentation and word frequency statistics of the processed web content, and Hanlp is still used here.

import com.chaojilaji.moneyframework.model.Word;
import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.Segment;
import com.hankcs.hanlp.seg.common.Term;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Nlp {

    private static Pattern ignoreWords = Pattern.compile("[, 0-9 _ \ \ -,,,,,, \ \] \ \ [\ \ /! () 【 】 *?" () "+ : | \ % ~ < > -] +" ");


    public static Set<Word> separateWordAndReturnUnit(String text) {
        Segment segment = HanLP.newSegment().enableOffset(true);
        Set<Word> detectorUnits = new HashSet<>();
        Map<Integer, Word> detectorUnitMap = new HashMap<>();
        List<Term> terms = segment.seg(text);
        for (Term term : terms) {
            Matcher matcher = ignoreWords.matcher(term.word);
            if(! matcher.find() && term.word.length() >1 && !term.word.contains("�")) {
                Integer hashCode = term.word.hashCode();
                Word detectorUnit = detectorUnitMap.get(hashCode);
                if (Objects.nonNull(detectorUnit)) {
                    detectorUnit.setCount(detectorUnit.getCount() + 1);
                } else {
                    detectorUnit = new Word();
                    detectorUnit.setWord(term.word.trim());
                    detectorUnit.setCount(1); detectorUnitMap.put(hashCode, detectorUnit); detectorUnits.add(detectorUnit); }}}return detectorUnits;
    }

    public static List<String> print2List(List<Word> tmp,int cnt){
        PriorityQueue<Word> words = new PriorityQueue<>();
        List<String> ans = new ArrayList<>();
        for (Word word : tmp) {
            words.add(word);
        }
        int count = 0;
        while(! words.isEmpty()) { Word word = words.poll();if (word.getCount()<50){
                ans.add(word.getWord() + "" + word.getCount());
                count ++;
                if (count >= cnt){
                    break; }}}returnans; }}Copy the code

SeparateWordAndReturnUnit is the text participle and word frequency statistics, and its structure is as follows:

public class Word implements Comparable{
    private String word;
    private Integer count = 0; . .@Override
    public int compareTo(Object o) {
        if (this.count >= ((Word)o).count){
            return -1;
        }else {
            return 1; }}}Copy the code

The print2List method is used to sort the List and output it. It is also possible to use the built-in sorting method. The purpose of using the priority queue is to think that the time complexity of the large top heap is lower than that of the fast heap, but the data volume is not large, so the optimization is overdone.

Search implementation

The search implementation, in essence, uses two HashMaps to view the results in two different dimensions, one in terms of the domain name and the other in terms of terms. So when I use the Google plugin to implement the presentation, I can do two things

Click the plug-in in the upper right corner of each site, read the current site keywords, has been the relevant site
Do a multi-keyword search in the plugin options

The loading code is rather blunt, is simple file reading, string processing, not pasted here. However, there is a point worth noting here is that it needs to be reloaded regularly, because the content is changing and the crawler is writing data all the time, and there also needs to be feedback to tell the crawler which sites can become new targets to crawl.

The following methods need to be provided to the plug-in

Get site keywords by domain name

@GetMapping("/api/v1/keywords")
@ResponseBody
public String getKeyWords(String domain) {
    try {
        Site site = demoService.stringSiteMap.get(DomainUtils.getDomainWithCompleteDomain(domain));
        if (Objects.nonNull(site)) {
            String keyWords = site.getKeywords();
            keyWords = keyWords.replace("["."").replace("]"."");
            String[] keyWordss = keyWords.split(",");
            StringBuffer ans = new StringBuffer();
            for (int i = 0; i < keyWordss.length; i++) {
                ans.append(keyWordss[i]).append("\n");
            }
            returnans.toString(); }}catch (Exception e) {
    }
    return "The site is not in storage.";
}
Copy the code

Get similar sites by domain name

@GetMapping("/api/v1/relations")
@ResponseBody
public String getRelationDomain(String domain) {
    try {
        Site site = demoService.stringSiteMap.get(DomainUtils.getDomainWithCompleteDomain(domain));
        String keyWords = site.getKeywords();
        keyWords = keyWords.replace("["."").replace("]"."");
        String[] keyWordss = keyWords.split(",");
        Set<String> tmp = new HashSet<>();
        int cnt = 0;
        for (int i = 0; i < keyWordss.length; i++) {
            String keyword = keyWordss[i];
            String key = keyword.split("") [0];
            if (IgnoreUtils.checkIgnore(key)) continue;
            cnt++;
            Set<String> x = demoService.siteMaps.get(key);
            if (Objects.nonNull(x)) {
                for (String y : x) {
                    String yy = demoService.stringSiteMap.get(y).getKeywords();
                    int l = yy.indexOf(key);
                    if(l ! = -1) {
                        String yyy = "";
                        int flag = 0;
                        for (int j = l; j < yy.length(); j++) {
                            if (yy.charAt(j) == ', ' || yy.charAt(j) == '] ') {
                                break;
                            }
                            if (flag == 1) {
                                yyy = yyy + yy.charAt(j);
                            }
                            if (yy.charAt(j) == ' ') {
                                flag = 1; }}if (Integer.parseInt(yyy) >= 20) {
                            tmp.add(y + "--" + key + "--"+ yyy); }}else {
                        // Boolean titleContains = demoService.stringSiteMap.get(y).getTitle().contains(key);
                        // if (titleContains) {
                        // tmp.add(y + "----" + key + "----");
                        / /}}}}if (cnt >= 4) {
                break;
            }
        }
        StringBuffer ans = new StringBuffer();
        for (String s : tmp) {
            ans.append("<a href=\"http://" + s.split("--") [0] + "\" >" + s + "</a><br>");
        }
        return ans.toString();

    } catch (Exception e) {
        // e.printStackTrace();
    }
    return "This site has no similar site at present";
}
Copy the code

Get relevant websites by multiple keywords

@GetMapping("/api/v1/keyresult")
@ResponseBody
public String getKeyResult(String key, String key2, String key3,Integer page, Integer size) {
    Set<String> x = new HashSet<>(demoService.siteMaps.get(key));
    if (StringUtils.hasText(key2)) {
        key2 = key2.trim();
        if(StringUtils.hasText(key2)){ Set<String> x2 = demoService.siteMaps.get(key2); x.retainAll(x2); }}if (StringUtils.hasText(key3)) {
        key3 = key3.trim();
        if(StringUtils.hasText(key3)){ Set<String> x3 = demoService.siteMaps.get(key3); x.retainAll(x3); }}if (Objects.nonNull(x) && x.size() > 0) {
        Set<String> tmp = new HashSet<>();
        for (String y : x) {
            String yy = demoService.stringSiteMap.get(y).getKeywords();
            int l = yy.indexOf(key);
            if(l ! = -1) {
                String yyy = "";
                int flag = 0;
                for (int j = l; j < yy.length(); j++) {
                    if (yy.charAt(j) == ', ') {
                        break;
                    }
                    if (flag == 1) {
                        yyy = yyy + yy.charAt(j);
                    }
                    if (yy.charAt(j) == ' ') {
                        flag = 1;
                    }
                }
                tmp.add(y + "--" + demoService.stringSiteMap.get(y).getTitle() + "--" + key + "--" + yyy);
            } else {
                Boolean titleContains = demoService.stringSiteMap.get(y).getTitle().contains(key);
                if (titleContains) {
                    tmp.add(y + "--" + demoService.stringSiteMap.get(y).getTitle() + "--" + key + "---- title contains");
                }
            }

        }
        StringBuffer ans = new StringBuffer();
        List<String> temp = new ArrayList<>(tmp);
        for (int i = (page - 1) * size; i < temp.size() && i < page * size; i++) {
            String s = temp.get(i);
            ans.append("<a href=\"http://" + s.split("--") [0] + "\" style=\"font-size: 20px\">"
                       + s.split("--") [1] + "</a> <p style=\"font-size: 15px\">" + s.split("--") [0] + "     " + s.split("--") [3]
                       + "</p><hr color=\"silver\" size=1/>");
        }
        return ans.toString();
    }
    return "Not included yet";
}
Copy the code

Inform the crawler of the site as the target for crawling

@GetMapping("/api/v1/demo")
@ResponseBody
public void demo(String key) {
    new Thread(new Runnable() {
        @Override
        public void run(a) {
            HttpClientCrawl clientCrawl = new HttpClientCrawl(key);
            try {
                clientCrawl.doTask();
            } catch (Exception e) {
                e.printStackTrace();
            }
            finally {
                clientCrawl.oldDomains.clear();
                clientCrawl.siteMaps.clear();
                clientCrawl.onePageMap.clear();
                clientCrawl.ignoreSet.clear();
            }
        }
    }).start();
}
Copy the code

This is an informal project, so the writing is simple and casual, forgive me.

show

The presentation section uses a Google plugin, and the quickest way to do this is to go to the next github plugin and tinker with the implementation. Then you can delve deeper into the principles and mysteries. portal

The middle implementation process is no different from writing a normal page, so it is omitted.

The final result is as follows:Then the search section reads as follows:

Here I would like to ask each friend one thing: if you have a long time of technical website collection, you can share in the comments section, I am currently caught in the target capture bottleneck. All walks of life.

Write a vertical domain search engine

A theoretical framework

Page to grab

storage

Analysis of the

Search implementation

show

implement

Page to grab

storage

Analysis of the

Search implementation

show

Related Posts

Python20 lines of code implements video characterization

【 Share 】 Use tools to efficiently complete the front and back end docking

High-performance sorting that breaks memory limits