preface

In the process of crawler development, we often encounter the situation of captcha, which is a threshold measure in the process of reverse crawling. The cost of cracking is very high, and it requires deep learning, JS reverse and other related experience, and the success rate is not necessarily very high. Even if successful, if the target site changes the method or algorithm used to generate captchas, then the hard work of cracking them will be wasted. So, cracking captcha, which is anti-human, is really not for beginners.

So what can you do to avoid captcha and log in? If an approach isn’t working, we don’t want to Stick to it. We can try a Workaround approach. This article will introduce a simple and feasible way to circumvent captchas, and it has proved to be very effective. At present, there is not much information about this method on the Internet, so this article will briefly introduce it as a reference.

The overall train of thought

Today’s websites generally do not require users to re-enter information for login each time they visit, but allow users to directly access the content after login through login verification during the second login. So how does this work? Websites typically store encrypted information in the browser, sort of giving you a room card with an expiration date, and every time you take that card, you scan it and get in. Most of these encrypted information exists in the form of cookies, which contain Session information. When accessing, the Cookie is compared with the Session on the browser. If successful, the information after login can be accessed normally. If you are not familiar with this mechanism, please refer to this article. This happens to be the mechanism we use to bypass captchas, as shown below.

Therefore, our idea is not to try any means to crack the verification code, such as using a coding platform, or self-developed deep learning algorithms, but to use cookies, a simple browser storage method to bypass the login verification code, so as to achieve the purpose of capturing the data after login.

Now let’s introduce the specific implementation. The overall implementation idea is shown in the figure below.

In a nutshell, the steps are as follows:

  1. We obtain the Cookie of the target website through the browser (after entering verification code to log in);
  2. Then the browser plug-in is used to obtain the Cookie of the corresponding target website from the browser.
  3. The browser plug-in sends the obtained Cookie to the background API;
  4. The API writes the received Cookie to the database;
  5. The crawler retrieves cookies from the database and carries them with it on every request.

In this way, we have completed a process of capturing the website information after login. The yellow part of the database and API can be changed to local files, of course, in order to use the production environment, we still recommend using API + database.

Browser plug-in

Introduction to the

Browser plug-in is mainly to expand browser functions, so that users in the browser to achieve some practical tools. This article will use Chrome browser plug-in, because the development is relatively simple, the number of users is also more.

Chrome plug-ins are made up of different components. Components can include background scripts, content scripts, optional pages, UI elements, and different logical files. Plug-in components are created by Web development techniques: HTML, CSS, JavaScript. The plug-in component will depend on its own functionality and may not require all Settings to be configured.

The development of

manifest.json

Development of the Chrome plugin started with manifest.json. This is a configuration file similar to package.json in Node.js. This configuration file tells the browser basic information about the plug-in, including name, description, version, and so on. It also includes its runnable Scope, can operate permissions; It also contains entry files, icon information, and more.

A more complete manifest.json is shown below.

{
  "name": "ArtiPub"."version": "0.1.4"."manifest_version": 2."description": "ArtiPub login assistant, help to log in nuggets, SegmentFault, CSDN, Zhihu and other technical platforms"."icons": {
    "16": "icon.png"."48": "icon.png"."128": "icon.png"
  },
  "browser_action": {
    "default_title": "ArtiPub Login Assistant"."default_icon": "icon.png"."default_popup": "popup.html"
  },
  "permissions": [
    "cookies"."http://*/*"."https://*/*"."storage"]}Copy the code

The more important part is permissions, which defines the permissions for the plug-in. Most apis, including the storage API, are registered under the Permissions field. To take advantage of some features, such as cookies, you must add cookies under Permissions.

The example here is the manifest.json plug-in for ArtiPub, an open source multi-article platform.

HTML

In manifest.json, we see another popup.html file, which is actually an entry file that represents the popbox HTML that the Chrome plugin triggers when it opens. From here, we can do a lot of articles, such as embed some UI components, make some buttons, embed some necessary JS files and so on.

For convenience, we use React + TypeScript to develop the UI part of the plug-in. The final output will be popup.html, manifest.json, and some other static files.

Popup. The TSX is as follows. Please note onGetLoginInfo method under the chrome. Cookies. GetAllCookieStore part, here is to get a Cookie core logic.

import {Button, Card, Input} from 'antd';
import * as React from 'react';
import axios from 'axios';
import './Popup.scss';

interface AppProps {
}

interface AppState {
  allowedDomains: string[];
  configVisible: boolean;
  url: string;
  fetched: boolean;
  loading: boolean;
}

export default class Popup extends React.Component<AppProps.AppState> {
  constructor(props: AppProps, state: AppState) {
    super(props, state);
  }

  componentDidMount() {
    // Example of how to send a message to eventPage.ts.
    chrome.runtime.sendMessage({popupMounted: true});

    this.setState({
      allowedDomains: [].configVisible: false.url: localStorage.getItem('url') | |'http://localhost:3000'.fetched: false.loading: false}); }async onGetLoginInfo() {
    this.setState({
      loading: true
    });

    // Get the required domain name from the background
    const response = await axios.get(this.state.url + '/platforms');
    const platforms = response.data.data;
    this.setState({
      allowedDomains: platforms.map((d: any) = > d.name)
    });

    // This is the core logic to get cookies
    // Walk through all cookieStores
    chrome.cookies.getAllCookieStores(cookieStores= > {
      // console.log(cookieStores);
      
      // Iterate through all stores
      cookieStores.forEach(store= > {
        // Get the cookies corresponding to the store
        chrome.cookies.getAll({storeId: store.id}, cookies => {
          // Filter cookies for desired domain names
          const data = cookies.filter(c= > {
            for (let domain of this.state.allowedDomains) {
              if (c.domain.match(domain)) {
                return true}}return false
          });

          // Send cookies to the back end
          axios.post(this.state.url + '/cookies', data)
            .then((a)= > {
              this.setState({fetched: true});
            })
            .finally((a)= > {
              this.setState({loading: false});
            });
        });
      });
    });
  }

  onConfig() {
    this.setState({
      configVisible:!this.state.configVisible,
    })
  }

  onUrlChange(ev: any) {
    localStorage.setItem('url', ev.target.value);
    this.setState({
      url: ev.target.value,
    });
  }

  render() {
    let btn = (
      <Button type="primary"
              onClick={this.onGetLoginInfo.bind(this)}>One-click access to login information</Button>
    );
    if (this.state && this.state.loading) {
      btn = (
        <Button type="primary" loading={true}>Is access to</Button>)}else if (this.state && this.state.fetched) {
      btn = (
        <Button className="success" type="primary">Obtained successfully</Button>)}let input;
    if (this.state && this.state.configvisible) {input = (< input value={this.state.url} className=" input-URL "placeholder=" placeholder" onChange={this.onUrlChange.bind(this)}/> ); } return (<Card className="artipub-container"> <h2> artipub <Button type="primary" shape="circle" icon="tool" className="config-btn" onClick={this.onConfig.bind(this)}/> </h2> {input} {btn} </Card> ) } }Copy the code

Packaging plug-in

Refer to ArtiPub’s plug-in repository here. Just run the NPM run build underneath. All the packed static files will be in the build directory.

The import plug-in

The next step is to import the plug-in into the browser.

Click On Settings in Chrome and click On Extensions as shown below.

Then click Load Unpacked as shown below to Load unpacked files. Click and select the build directory that you just packaged and built.

The plug-in is then loaded. You should see the plug-in icon in the upper right corner.

The use of plug-in

Before using the plugin, you need to manually log in to the target site and ensure that the Cookie has been saved in the browser. In addition, you need to make sure that the background import program has been started. Please refer to ArtiPub’s background API for getting cookies. Then, open the browser plug-in you just imported and click “One click to get login information” import.

Background API and database

Background Cookie acquisition API is relatively simple, here we briefly show, the code is as follows.

const models = require('.. /models')

module.exports = {
  addCookies: async (req, res) => {
    const cookies = req.body
    for (let i = 0; i < cookies.length; i++) {
      const c = cookies[i]
      let cookie = await models.Cookie.findOne({
        domain: c.domain,
        name: c.name
      })
      if (cookie) {
        // The cookie already exists
        for (let k in c) {
          if (c.hasOwnProperty(k)) {
            cookie[k] = c[k]
          }
        }
      } else {
        // The cookie does not exist
        cookie = newmodels.Cookie({ ... c }) }await cookie.save()
    }
    res.json({
      status: 'ok'}})},Copy the code

The database we use is highly flexible MongoDB. If you want to use other databases, such as MySQL or SQLite, you can.

The crawler

Let’s stick with the ArtiPub example here. Here we use Puppeteer as our crawler engine. The code for core setting cookies is as follows.

  async setCookies() {
    const cookies = await models.Cookie.find({ domain: { $regex: this.platform.name } })
    for (let i = 0; i < cookies.length; i++) {
      const c = cookies[i]
      await this.page.setCookie({
        name: c.name,
        value: c.value,
        domain: c.domain,
      })
    }
  }
Copy the code

For a detailed crawler code, see github.com/crawlab-tea…

In fact, you have saved the corresponding Cookie in the database, all you need to do is take it out and apply it in the request to the target website. It doesn’t matter which framework you use, such as Scrapy and requests.

Next, you can run the crawler to bypass the authentication login.

conclusion

This article implements a crawler design that uses Chrome plugin to bypass authentication login. The core logic of this design is to make use of Chrome browser’s ability to obtain cookies of other domain names, and then apply it in crawler request, so as to bypass the purpose of login verification code. This design is very simple, low cost, easy to operate, high stability and success rate compared to directly cracking captcha. ArtiPub, an open source publishing platform, uses this design to bypass login verification and publish articles. In practice, this approach proved to be very effective, and every platform was able to bypass the login smoothly. The disadvantage may be that some front-end, especially browser plug-in development experience, but are not difficult, can be used in half an hour. In a production environment, using browser plug-ins can be an effective anti-crawl method. You are welcome to try this approach in crawlers that require login operations.

reference

  • Chrome plug-in development tutorial: developer.chrome.com/extensions/…
  • ArtiPub: github.com/crawlab-tea…