1. Background:

In view of the hair drop situation these days, thinking of the weekend is more rest, I was woken up by wechat didi letter when I took a nap, take a look, is that female student oh, a brief look, her main expression means is: “can you help me x x x x x X X x? As a male student, can you say that you can’t? m(o_ _)m

2. Introduction:

A brief introduction of “demand”, she wants me to help in “take an examination of 365” to download a batch of questions and answers, as follow: www.zikao365.com/shiti/downl…


3. The crawler

Now that you know what to implement, you’re ready to dissect the web page. Here are the steps.

1. The browser captures packets

2. Understand the parameters to be passed

Unable to decode value: Unable to decode value: Unable to decode value: Unable to decode value: Unable to decode value: Unable to decode value: Unable to decode value: Unable to decode value: Unable to decode value

Then use the text box to paste what you just copied:

curl 'https://www.zikao365.com/shiti/downlist_search.shtm' \
  -H 'Connection: keep-alive' \
  -H 'Cache-Control: max-age=0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: https://www.zikao365.com' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9 ' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-User: ? 1 ' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Referer: https://www.zikao365.com/shiti/downlist_search.shtm' \
  -H 'Accept-Language: zh-CN,zh; Q = 0.9, useful - TW; Q = 0.8, en - US; Q = 0.7, en. Q = 0.6 ' \
  --data-raw 'KeyWord=%C2%ED%BF%CB%CB%BC%D6%F7%D2%E5&monthHidden=%C8%AB%B2%BF&month=&yearHidden=%C8%AB%B2%BF&year=' \
  --compressed
Copy the code

KeyWord=% c2% ED%BF%CB%CB%BC% d6%f7%d2% E5

3. How do I deliver paging

When we get here we can get a page of list data, so what parameters do we pass to get the second page? On the following page Then observe caught is www.zikao365.com/shiti/downl… As shown above, the URL has changed a bit, adding a parameter page, meaning page=2 represents page 2

Turn the page is not a problem, then look at how to get internal download links ~

4. Look for download links

Just click on a link in details such as item page is as follows: www.zikao365.com/shiti/downl…

I was going to happily click “Download now”, but it turned out that logging in was a hassle

There are only three paths ahead

  1. Look for female classmate to want an account number, click to see what circumstance
  2. Set up an account by yourself, but it’s too much trouble
  3. Try to bypass login authentication and download directly (obviously I chose the third option)

If the href attribute is disabled and the onclick event is overridden, open F12 > Elements with this in mind

OK~~ came here to test 365 of the overall list of data and detailed data download files have a whole understanding, then you can start coding!!

The code implementation is a little bit simpler and I’ll post it all up here

public static void main(String[] args) throws Exception {
    HttpClient httpClient = HttpClient.buildHttpClient();
    String urlTemplate = "https://www.zikao365.com/shiti/downlist_search.shtm?page=${page}&KeyWord=${KeyWord}";
    String url = null;
    for (int i = 1; i < 6; i++) {
        // The first page begins
        url = urlTemplate.replace("${page}", i + "");
        url = url.replace("${KeyWord}"."%C2%ED%BF%CB%CB%BC%D6%F7%D2%E5");
        // I forgot that the capture is POST, so I wrote GET, ok, that would be more convenient.
        Request request = httpClient.buildRequest(url).GET();
        Response<String> response = request.execute(BodyHandlers.ofString(Charset.forName("gbk")));
        log.debug("response: [{}]", response.getBody());
        // Parse HTML into a Document
        Document document = Jsoup.parse(response.getBody());
        // Get the hyperlinks for each page
        Elements a = document.select(".bot.clearfix li > a");
        List<String> hrefs = a.stream().map(v -> v.attr("href")).collect(Collectors.toList());
        for (String href : hrefs) {
            // This is equivalent to clicking on the hyperlink to the specific test question
            response = httpClient.buildRequest(href).execute(BodyHandlers.ofString(Charset.forName("gbk")));
            String html = response.getBody();
            // Parse the document inside each question
            document = Jsoup.parse(html);
            // Get the title of the current test
            String title = document.selectFirst(".main div > h1").text();
            // Manually cut the string to get the path of the file
            String[] scripts = StringUtils.substringsBetween(html, "<script"."</script>");
            String path = Arrays.stream(scripts).filter(v -> v.contains("addClick") && v.contains("path")).findFirst().get();
            path = path.substring(path.indexOf("path") + 4);
            path = path.replaceAll("'"."\" ");/ / here to 'to "unity, avoid using the individual page is' such as: var path =' http://download.zikao365.com/shiti/19158.pdf '
            path = path.substring(path.indexOf("\" ") + 1);
            path = path.substring(0, path.indexOf("\" "));
            // Get the suffix of the file
            String suffix = path.substring(path.lastIndexOf("."));
            log.info("title:[{}], path: [{}], suffix: [{}]", title, path, suffix);
            // Download the file to the local directory
            // It can also be downloaded using commons. IO utility classes, such as: Fileutils.copyurltofile (new URL(path), path.get ("C:\ Users\ houyu\ Desktop\ Marxism ", title + suffix).tofile ());
            httpClient.buildRequest(path).execute(BodyHandlers.ofFile(Paths.get("C:\ Users\ Houyu \Desktop\ Marxism", title + suffix))); }}}Copy the code

Note: Here are a few libraries or utility classes used

  • HttpClient: An HttpClient package based on HttpURLConnection.

  • Commons-lang3: A more commonly used toolkit

  • Jsoup: A Java version of CSS selector that works like a dessert.

4. Download

Log:

The downloaded files are as follows, 108 in total:

5. The last

When I sent the 108 files to the girl student…


If you want to learn, I will call you, so that you can have a legitimate reason to talk. Have you ever met a girl like you?


Public id: IT_loading

CSDN:blog.csdn.net/JinglongSou…

Blog: shaines. Cn /

E-mail: [email protected]

Programmer [back], is a focus on programming, love technology Java backend developer, passionate about [Java backend], [data crawler field]. Share skills and dry goods from time to time!! Welcome to pay attention to “IT loading “, a dry goods and combat only public account.