Crawler System and Data Processing Combat


The original selections 899.00

More than 300 participants have joined ****

The bottom price is ¥399.00


>> Click on the bottom of the article to read the original text


On the teacher



Yang Zhen senior Software Architect


Worked at Sun Microsystems and Microsoft (Asia) of China academy of engineering physics, China academy of engineering, tencent Beijing wireless Internet business, perfect world famous companies, such as early to the kernel is responsible for the Java virtual machine, mobile products and the development of search engine, now led more than 50 people, a senior research and development team, engaged in product development based on big data, artificial intelligence, Team involved in image processing, face recognition, target detection), natural language processing (, relationship extraction, machine translation, automatic text categorization based), recommendation system, search engine, knowledge map, the research figure database, the crawler, large data storage and mining, distributed system architecture, Web, and mobile terminal product development technology.

The content characteristics



1. It focuses on the application cases of artificial intelligence data acquisition, aiming to let everyone know the methods and skills of data acquisition in various application fields

2. Technologies and solutions for obtaining data from Google, Wikipedia, Weibo, wechat official account, Taobao, jingdong and other websites

3. The sources and acquisition methods of data such as image recognition, target detection, entity type recognition, text classification, relationship extraction, structured information and chat robot are introduced

4. Basic technology courses of crawler, such as HTTP and Python, are introduced in the way of recording and broadcasting, while the application scenarios of crawler are mainly introduced in the live courses


Learning style



Classes will begin on April 17, 2018

Live streaming, 12 sessions, 2 hours each

2 times per week (Tuesday, Friday, 20:00-22:00 PM)

After live broadcast, playback video can be recorded and watched repeatedly online, valid for one year


Course outline



Lesson 1 Static web crawler: basic technology of crawler

HTML 2. CSS selectors 3. JavaScript introduction 4. LXML and XPath 5

6. The first reptile: the journey of mafengwo


The second lesson login and dynamic web page crawl

1. The form

2. Website login and cookies

3. Browser for Headless: PhantomJS

4. Browser driver: Selenium

5. Dynamic web data acquisition


The third lesson is the capture of micro-blog

1. Analysis of distribution and structure of microblog websites

2. Crawl through dynamic pages

3. Reverse analysis of weibo network interface

4. Use API to capture micro-blog


The fourth lesson is wechat public number capture

1. AnyProxy packet capture tool

2. Interface analysis of wechat public accounts

3. Use the NodeJS redirection interface

4. Background data acquisition and saving

5. Use the interface to directly obtain all historical messages

6. Anti-crawler architecture design for wechat public account


The fifth lesson verification code processing, JINGdong, Taobao data capture and storage cases

1. Image comparison based on distance

2. TesseractOcr based digital recognition

3. Other verification code identification schemes

4. Jd data capture

5. Taobao data capture


Lesson 6 Multithreading and multi-process crawlers

1. Threads and processes

2. Python multithreading constraints

3. Multiple threads simultaneously capture data

4. Multiple processes simultaneously capture data

5. Log system design


Lesson 7 Storage of Microblog data: Distributed Database and Application

SQL and the no. 1

2. The Hadoop framework

   3.  HDFS

   4.  HBase

   5.  MongoDB

   6.  Redis 

7. Distributed crawler based on distributed database


Lesson 8 Multi-machine parallel Micro-blog fetching: Distributed system design

1. Daemon process

2. The Socket programming

3. The Master design

4. The design of Slave

5. Task scheduling and communication protocol

6. Crawlers for distributed cluster deployment


Lesson 9 PageRank, dynamic reshoot of web pages and the means of dealing with anti-crawler technology

1. PageRank calculation model and derivation

2. Web page crawl order rearrangement

3. Website service architecture

4. Find and utilize distributed servers

5. Multi-ip technology and routing control

6. Crawler architecture that can handle almost any anti-crawler rule


Introduction to Scrapy crawler frame

   1.  Sample

2. Framework analysis

3. Automatic generation of crawlers

4. The console

5. Line

6. The middleware


Automatic Text Extraction, Web page classification, and Machine Learning applications for text

1. Automatic extraction of text

2. Text classification

3. Basis of web page classification

4. Word segmentation and feature extraction

5. Linear regression

   6.  SVM

   7.  Logistic Regession

8. Categorize web pages

9. Multiple classifiers

Lesson 12 Information Retrieval, principles and Applications of search engines

1. Search engine architecture introduction

2. Forward and inverted tables

3. The Boolean model

4. The Vector model

5. Probability model

   6.  TF/IDF

   7.  Elastic Search


To join the tour, consult and view the course, please click

Left left left