The soldiers long:

Fat Sir, recently I have been using Golang for development projects, and gradually I have some understanding of Golang. Suddenly one day, I wanted to use Golang to crawl the data on websites, such as weather forecast, sentence of the day and so on. I found that the data of these websites are dynamically generated by javascript. Do not know how to get the dynamic data on the website down, for my use ah, such as I capture the dynamic data after the email to me yo

Fat Sir Lifted his long hair and said to the soldier commander gently, young man, Golang is very efficient in application development. Of course, it is easy to climb the data on the website. There are dynamic methods, so LET me explain to you

The installation of Golang

This step is mainly for children who have not installed Golang on Linux. If they have installed golang themselves, they can skip the simple installation step of Golang

Download golang software

  • 【 Domestic website 】https://studygolang.com/dlGo download the latest go installation package, depending on the system, you can chooseWindows, Linux, MAC
  • [If you can access the Internet] visit go language English websitehttps://docs.studygolang.com/doc/install

Unpack the golang

Tar -c /usr/local-xzf go1.16.linux-amd64.tar.gzCopy the code

Configuration golang

  • Add the binary directory of Go to the PATH environment variable

    vim /etc/profileexport GOROOT=/usr/local/goexport PATH=PATH:PATH:GOROOT/bin
    Copy the code

Reimporting the Configuration

source /etc/profile
Copy the code

Use of chromedp framework

The Chromedp framework is open source on Github, so kids can eat it. If you have an idea, you can contribute to it on Github

You can run the following command to download the file

github.com/chromedp/chromedp
Copy the code

Actual code writing

Bing Chief, you want to crawl the website of daily sentence, I will give you an example, such as crawl this website http://news.iciba.com/, we will crawl the website will be updated every day sentence

image-20210303224355228

Start coding

Func GetHttpHtmlContent(url String, selector String, sel interface{}) (string, error) { options := []chromedp.ExecAllocatorOption{ chromedp.Flag("headless", true), // debug using chromedp.Flag(" blink-Settings ", "imagesEnabled=false"), chromedp.UserAgent(' Mozilla/5.0 (Windows NT 6.3; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 '),} // Initialization parameters, First pass an empty data options = append (chromedp. DefaultExecAllocatorOptions [:], the options...). c, _ := chromedp.NewExecAllocator(context.Background(), options...) // Create context chromeCtx, cancel := chromedp.newContext (c, chromedp.withlogf (log.printf)) Run(chromeCtx, make([] chromedp.action, 0, 1)...) // Create a context, Cancel := context.WithTimeout(chromeCtx, 40*time.Second) defer cancel() var htmlContent string err := chromedp.Run(timeoutCtx, chromedp.Navigate(url), chromedp.WaitVisible(selector), chromedp.OuterHTML(sel, &htmlContent, chromedp.ByJSPath), ) if err ! = nil { logger.Info("Run err : %v\n", err) return "", err } //log.Println(htmlContent) return htmlContent, nil}Copy the code
  • GetHttpHtmlContent is an interface for crawling dynamic data from web sites. Its main function is to crawl dynamic data generated by JS (static data is not included).

  • The first parameter url is the site address we need to pass in to crawl, as shown above

  • The second parameter selector is the HTML selector for the data that we’re going to crawl, and we go to the site through Google, Press F12 -> click the mouse in the upper left corner -> then click the data we need to crawl -> to see the actual HTML source code (currently we see the data generated dynamically through javascript)

    image-20210303230303671

    Right-click item-bottom -> Copy-> Copy Selector to get the following result

    image-20210303230803415

    body > div.screen > div.banner > div.swiper-container-place > div > div.swiper-slide.swiper-slide-0.swiper-slide-visible.swiper-slide-active > a.item.item-big > div.item-bottom
    Copy the code

    This string is the second selector to the GetHttpHtmlContent function

  • The third argument we’ll write for now

    Document. querySelector("body") // Get the data from the bodyCopy the code
  • The return value is the crawl data in string format with HTML content

Here’s how to expand and explain the above code

  • Chromedp.Flag Sets the chromedp parameter to headless mode, which is the command line version of Chrome without GUI. If this parameter is not set to true, the chromedp parameter is set to headless mode. While the application is running, Chromedp pulls the Chrome browser in our environment and displays the page

  • Chromedp.flag (“blink-settings”, “imagesEnabled=false”) sets not to show images

  • HtmlContent is used to receive the result of the crawl. It is a string format. The specific content is HTML

  • Chromedp.ByJSPath is only parsed in what way, this is a callback function, this parameter can also be filled in the following, on demand

  • chromedp.ByNodeID

  • chromedp.BySearch

  • chromedp.ByID

  • chromedp.ByQueryAll

  • chromedp.ByQuery

  • chromedp.ByFunc

  • Here are some of the interfaces involved with Chromedp

Using this framework I get a string of HTML, and I don’t know how to parse it. How can I find the sentence of the day I just saw on the page?

Fat Sir: Don’t worry, I will tell you step by step, live teaching, watch, now we have completed the core step, now the data has been retrieved, I will show you a magic, GoQuery can solve the following string of HTML parsing problems

image-20210303232139506

Goquery The use of third-party libraries

I wrote a little interface that I can show you, Sergeant

Goquery is also available on Github, so you can download the goQuery library by using the following command

go get github.com/PuerkitoBio/goquery
Copy the code

Start coding

Func GetSpecialData(htmlContent String, selector String) (string, error) {dom, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) if err ! = nil { logger.Error(err) return "", err } var str string dom.Find(selector).Each(func(i int, selection *goquery.Selection) { str = selection.Text() }) return str, nil}Copy the code
  • The first argument, htmlContent, is the data that chromedp crawls up above. It’s a string, and the content is HTML

  • The second argument is the HTML selector. For this site, this argument can be filled with.Chinese, as in

    GetSpecialData(htmlContent, ".chinese")
    Copy the code
  • The return value is what we’re going to capture and when you’re working on your dream, there’s no pressure.

Here are some uses of goQuery

It is mainly about the writing and use of VARIOUS HTML selectors. The following is a brief introduction of the types. If you need a detailed understanding, you can give me a message

  • A selector based on the HTML Element
  • The ID selector
  • The Class selector
  • Property selector
  • Parent > Child selector
  • Element + Next Adjacent selector
  • Element ~next sibling selector

Sir: Sergeant, what I said is clear, do you know how to use it?

Bing Chief: Ming ~ got it, I still need to practice more and take different station data to see the effect

Fat Sir: Hey, sergeant, did you just say that you want to send an email to yourself after processing the data?

Sergeant: Yeah, yeah, that’s another question. I don’t know where to put the program, in my own computer, my computer is to shut down every day, I rest, my computer also want to follow me to rest, ah, what to do

Sir: Easy to handle, I can recommend you to use ali cloud server

How to deploy your program to Ali Cloud server

They buy a cloud server can be very convenient to put their own monitoring program or need to run the program on the top, it can be 7*24 hours non-stop run, I recently felt, really good. For specific aliyun purchase methods, you can try to scan the following TWO-DIMENSIONAL code or click the link to buy. It is really good to test. How to use and simple configuration, you can leave a message to me to obtain information.

Of course, the need for the whole small case source code, you can also give me a message oh, let us practice our every idea together, step by step to climb.

Sir: Soldier, I need to remind you that ali cloud server will automatically shut down your running program

Sergeant: Huh? So you’re asking me to buy a server? Are you ripping me off

Sir: Don’t worry, what I recommend is definitely good, and it comes with a solution

Screen tools

Screen tool can help us to deploy the executable program on ali Cloud server, and can always run uninterrupted

Principle:

Screen opens a separate process on the server that performs background tasks exclusively.

Specific operation:

  • The installation

    Sudo apt-get install screen//centosyum install screenCopy the code
  • Create screen window

    Screen-s name For example, screen-s SSHCopy the code
  • Check the process

    screen -ls
    Copy the code

    image-20210303234906943

  • Go to your manager

    Screen-r-d Indicates its ID, for example, screen-r-d 5295Copy the code
  • Close the Screen process

    Screen -s Process name -x quitCopy the code

If you have the need, you can buy ali cloud server through this link, the current MOE new concessions, pro test can be very, don’t ask who I am, I am living Lei Feng.

https://www.aliyun.com/activity?taskCode=messenger2101&recordId=337686&usercode=&share_source=copy_link
Copy the code

Author: Nezha