Java crawl blog garden personal blog

preface

Recently I bought a personal domain name on a cloud, with the server after the thinking of buying to build your own personal website, because of the need to prepare too much, temporarily put on hold first, thinking about borrow lot build a static Pages, structures, is also the process of twists and turns, the main is a domain name address configuration make people waste, but generally went well, Website address chenchangyuan.cn (empty blog, beautiful style, will be added later)

Use Git + NPM +hexo, and then configure it in Github. There are many online tutorials. If you have any questions, please feel free to comment.

I used to do Java for a few years, due to the company’s job responsibilities, gradually turned, now mainly do front-end development.

So I want to use Java to crawl the article, and then crawl the HTML into MD (has not been implemented, welcome students guidance).

1. Get all the urls of your blog

Check blog address

www.cnblogs.com/ccylovehs/d…

Traverse according to the number of blogs you’ve written yourself

The blog address details page in set in the collection, details page address www.cnblogs.com/ccylovehs/p…

2. The URL of the detail page generates AN HTML file

Iterate through the set collection, generating HTML files in turn

The file is stored in the C://data//blog directory with the file name generated by capture group 1

3. Code implementation

package com.blog.util;

import java.io.BufferedReader;
import java.io.File;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author Jack Chen
 * */
public class BlogUtil {/** * URL_PAGE: cnblogs URL * URL_PAGE_DETAIL: details page URL * PAGE_COUNT: number of pages * urlLists: Set of details page URL Set (to prevent repetition) * p: * */ Public final static String URL_PAGE ="https://www.cnblogs.com/ccylovehs/default.html?page=";
    public final static String URL_PAGE_DETAIL = "https://www.cnblogs.com/ccylovehs/p/([0-9]+.html)";
    public final static int PAGE_COUNT = 3;
    public static Set<String> urlLists = new TreeSet<String>();
    public final static Pattern p = Pattern.compile(URL_PAGE_DETAIL);
    
    
    public static void main(String[] args) throws Exception {
        for(int i = 1; i<=PAGE_COUNT; i++) { getUrls(i); }for(Iterator<String> i = urlLists.iterator(); i.hasNext();) { createFile(i.next()); } } /** * @param url * @throws Exception */ private static void createFile(String url) throws Exception { Matcher m = p.matcher(url); m.find(); String fileName = m.group(1); String prefix ="C://data//blog//";
        File file = new File(prefix + fileName);
        PrintStream ps = new PrintStream(file);

        URL u = new URL(url);
        HttpURLConnection conn = (HttpURLConnection) u.openConnection();
        conn.connect();
        BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"));
        String str;
        
        while((str = br.readLine()) ! = null){ ps.println(str); } ps.close(); br.close(); conn.disconnect(); } /** * @param idx * @throws Exception */ private static void getUrls(int idx) throws Exception{ URL u = new URL(URL_PAGE+""+idx);
        HttpURLConnection conn = (HttpURLConnection) u.openConnection();
        conn.connect();
        BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"));
        String str;
        while((str = br.readLine()) ! = null){if(null ! = str && str.contains("https://www.cnblogs.com/ccylovehs/p/")) {
                Matcher m = p.matcher(str);
                if(m.find()) { System.out.println(m.group(1)); urlLists.add(m.group()); } } } br.close(); conn.disconnect(); }}Copy the code

4. Conclusion

If you think it is useful, please move the mouse and give me a star. Your encouragement is my biggest motivation

Github.com/chenchangyu…

Because I don’t want to manually generate MD files one by one, the next step is to convert HTML files into MD files in batches, so as to improve the personal blog content, which will be continued

Java crawl blog garden personal blog

4. Conclusion

Related Posts

MariaDB pushes Enterprise Server with enhanced backup, cluster build, and auditing capabilities

Algorithms that instantly improve the performance of distributed systems tenfold

Practice exploration of Ant Financial Service Mesh