This is the second article in the Java Web crawler series. In the last Java Web crawler, it was so simple that we simply learned how to use Java for web crawlers. In this article, we will simply talk about the web crawler, when we meet the need to log in the website, what should we do?

When doing crawler, it is common to encounter problems requiring login, such as writing scripts and snatching tickets. As long as personal information is needed, login is required. There are two main solutions to such problems: One way is to manually set cookies, that is, log in on the website first, copy the cookies after login, and manually set the cookie attribute in the HTTP request in the crawler. This way is suitable for low collection frequency and short collection period, because cookies will fail. In case of long-term collection, cookies need to be set frequently, which is not a feasible method. The second method is to use program simulated login to obtain cookies through simulated login. This method is suitable for long-term collection of this website, because each collection will be logged in first, so that there is no need to worry about cookie expiration.

In order to make you better understand the use of these two ways, I take douban personal home page nickname as an example, respectively use these two ways to obtain the need to log in to see the information. The obtained information is shown in the figure below:

Simpleton is called simplicity

Manually setting cookies

Manual setting cookies, this way is simple, we only need in douban online login, after landing successful can get to the cookies with user information, douban login link: https://accounts.douban.com/passport/login. As shown below:

/** * Manually set cookies * Log in from the website and then view the cookies * in Request Headers@param url
 * @throws IOException
 */
public void setCookies(String url) throws IOException {

    Document document = Jsoup.connect(url)
            // Manually set cookies
            .header("Cookie"."your cookies")
            .get();
    //
    if(document ! =null) {
        // Get the douban nickname node
        Element element = document.select(".info h1").first();
        if (element == null) {
            System.out.println(".info H1 tag not found");
            return;
        }
        // Retrieve the name of the douban node
        String userName = element.ownText();
        System.out.println("Douban my net name is:" + userName);
    } else {
        System.out.println("Wrong !!!!!"); }}Copy the code

So if you look at this code, it’s just like a website that you don’t need to log in to, except for a.header(“Cookie”, “your cookies”), so let’s just copy the Cookie from the browser right here, and say main, okay

public static void main(String[] args) throws Exception {
    // Personal center URL
    String user_info_url = "https://www.douban.com/people/150968577/";
    new CrawleLogin().setCookies(user_info_url);
Copy the code

Running main gives the following result:

Simpleton is called simplicity

Simulated landing mode

Simulation on the way to solve the deficiency of manually, setting cookies, but at the same time, it introduces a more complex issue, now the captcha varied and multifarious, many are challenging, such as on a pile of operations in certain pictures, this is very difficult, is not literally can write it out. So it’s up to the developer to weigh the pros and cons of using either method. The Douban website we use today, there is no verification code at the time of login, for this kind of no verification code is relatively simple, the most important thing about simulated login is to find the real login request, login parameters. This we can only cheat, we first in the login interface enter the wrong account password, so that the page will not jump, so we can easily find the login request. Let me demonstrate how to search for the login link on Douban.com. We input the wrong user name and password on the login interface, click login, and check the initiated request link on network, as shown in the picture below:

https://accounts.douban.com/j/mobile/login/basic

/** * Jsoup simulate login douban visit personal center * enter a wrong account password when logging in douban, view the parameters required for login * first construct login request parameters, obtain cookies after success * Set request cookies, request again *@paramLoginUrl indicates the loginUrl *@paramUserInfoUrl Personal center URL *@throws IOException
 */
public void jsoupLogin(String loginUrl,String userInfoUrl)  throws IOException {

    // Construct the login parameters
    Map<String,String> data = new HashMap<>();
    data.put("name"."your_account");
    data.put("password"."your_password");
    data.put("remember"."false");
    data.put("ticket"."");
    data.put("ck"."");
    Connection.Response login = Jsoup.connect(loginUrl)
            .ignoreContentType(true) // Ignore type validation
            .followRedirects(false) // Disable redirection
            .postDataCharset("utf-8")
            .header("Upgrade-Insecure-Requests"."1")
            .header("Accept"."application/json")
            .header("Content-Type"."application/x-www-form-urlencoded")
            .header("X-Requested-With"."XMLHttpRequest")
            .header("User-Agent"."Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
            .data(data)
            .method(Connection.Method.POST)
            .execute();
    login.charset("UTF-8");
    // The cookies generated after the successful login have been obtained in login
    // Construct a request to access the personal center
    Document document = Jsoup.connect(userInfoUrl)
            // Retrieve the cookies in the login object
            .cookies(login.cookies())
            .get();
    if(document ! =null) {
        Element element = document.select(".info h1").first();
        if (element == null) {
            System.out.println(".info H1 tag not found");
            return;
        }
        String userName = element.ownText();
        System.out.println("Douban my net name is:" + userName);
    } else {
        System.out.println("Wrong !!!!!"); }}Copy the code

This code is divided into two paragraphs, the first paragraph is simulated login, the last paragraph is the analysis of douban home page, in this code launched two requests, the first request is simulated login to obtain the cookie, the second request carries the cookie obtained after the first simulated login, so that you can also access the page to login, modify the main method

public static void main(String[] args) throws Exception {
    // Personal center URL
    String user_info_url = "https://www.douban.com/people/150968577/";

    // Login interface
    String login_url = "https://accounts.douban.com/j/mobile/login/basic";

    // new CrawleLogin().setCookies(user_info_url);
    new CrawleLogin().jsoupLogin(login_url,user_info_url);
}
Copy the code

Run the main method and get the following result:

Simpleton is called simplicity

In addition to jsoup, we can also use HttpClient to simulate logins. Httpclient is not as complicated as Jsoup, because httpClient saves sessions like a browser. After logging in to the httpClient to save the cookie, the request in the same httpClient with a cookie. The httpClient simulated login code is as follows:

Httpclient is similar to Jsoup, the difference is that httpClient has the concept of session * does not need to set cookies in the same httpClient. Will be cached by default@param loginUrl
 * @param userInfoUrl
 */
public void httpClientLogin(String loginUrl,String userInfoUrl) throws Exception{

    CloseableHttpClient httpclient = HttpClients.createDefault();
    HttpUriRequest login = RequestBuilder.post()
            .setUri(new URI(loginUrl))/ / login url
            .setHeader("Upgrade-Insecure-Requests"."1")
            .setHeader("Accept"."application/json")
            .setHeader("Content-Type"."application/x-www-form-urlencoded")
            .setHeader("X-Requested-With"."XMLHttpRequest")
            .setHeader("User-Agent"."Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
            // Set the account information
            .addParameter("name"."your_account")
            .addParameter("password"."your_password")
            .addParameter("remember"."false")
            .addParameter("ticket"."")
            .addParameter("ck"."")
            .build();
    // Simulate login
    CloseableHttpResponse response = httpclient.execute(login);
    if (response.getStatusLine().getStatusCode() == 200) {// Construct a request to access the personal center
        HttpGet httpGet = new HttpGet(userInfoUrl);
        CloseableHttpResponse user_response = httpclient.execute(httpGet);
        HttpEntity entity = user_response.getEntity();
        //
        String body = EntityUtils.toString(entity, "utf-8");

        // If there is a string or not, you can determine whether there is a string or not
        System.out.println("Simpleton that call pure whether find?"+(body.contains("Simpleton is called simplicity.")));
    }else {
        System.out.println("Httpclient simulated login to douban failed!!!!"); }}Copy the code

Running this code also returns true.

There are two ways to solve the problem of crawler landing. One is to manually set cookies. This way is suitable for short-term collection or one-time collection, and the cost is low. Another way is simulated landing, this way is suitable for long-term collection of websites, because the cost of simulated landing is quite high, especially some abnormal verification code, the benefit is to be able to let you once and for all

Above is the Java crawler encountered login problems related knowledge sharing, I hope to help you, the next is about crawler encountered data asynchronous loading problems. If you are interested in reptiles, you might as well follow a wave, learn from each other, and improve on each other

Source: source code

Article insufficient place, hope everybody gives directions a lot, common study, common progress

The last

Play a small advertisement, welcome to scan the code to pay attention to the wechat public number: “The technical blog of the flathead brother”, progress together.