Reptile combat based on Golang

preface

Crawlers were python’s forte, and I had worked on scrapy and simple crawler widgets in the early days, but then I became interested in Golang and decided to write crawlers for practice. As I golang meng new, there are mistakes, welcome to correct.

General train of thought

Since there are many dynamic pages now, we consider using WebDriver to drive Chrome and other pages to complete rendering before capturing data. (I started with Phantomjs, but later it stopped being maintained and wasn’t very efficient)
Most crawlers run on Linux, so consider Chrome’s Headless mode.
The data is captured, saved to a CSV file, and then sent by email.

Deficiency in

Because of the need to render, the speed is quite slow, even without rendering the image, the speed is not ideal.
Because just began to learn, so multithreading what also did not add, afraid of memory will crash.
Without writing data to the database, putting it in a file is not the final solution.

Need of library

github.com/tebeka/selenium
- Selenium, golang version, can achieve most of these functions.
gopkg.in/gomail.v2
- Send mail to use the library, not updated for a long time, but enough.

Downloading dependency packages

This plan uses DEP to manage dependence, the result this goods pit is quite many, did not study to understand dare not mistake people, temporarily give up.
Download the dependency packages through Go Get

go get github.com/tebeka/selenium
go get gopkg.in/gomail.v2
Copy the code

Code implementation

Start ChromeDriver, which drives the Chrome browser.

Func StartChrome() {opts := []selenium.ServiceOption{} Caps := selenium.Capabilities{ "browserName": ImagCaps := map[string]interface{}{"profile.managed_default_content_settings.images": 2, } chromeCaps := chrome.Capabilities{ Prefs: imagCaps, Path: "", Args: []string{"--headless", // Set Chrome headless mode "--no-sandbox", "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/604.4.7 (KHTML, like Gecko) Version/11.0.2 Safari/604.4.7", },} caps.AddChrome(chromeCaps) // Start ChromeDriver, Port Numbers can be custom service, err = selenium. NewChromeDriverService ("/opt/Google/chrome/chromedriver ", 9515, opts...). if err ! = nil { log.Printf("Error starting the ChromeDriver server: %v", err)} // Check chrome webDriver, err = selenium.NewRemote(caps, fmt.sprintf ("http://localhost:%d/wd/hub", 9515)) if err ! = nil {panic(err)} // This is the pit left by the target site. If you do not add this pit, the mobile web page will be displayed in Linux. webDriver.AddCookie(&selenium.Cookie{ Name: "defaultJumpDomain", Value: "WWW ",}) err = webdriver. Get(urlBeijing) if err! = nil { panic(fmt.Sprintf("Failed to load page: %s\n", err)) } log.Println(webDriver.Title()) }Copy the code

Through the above code can be realized through the code to start Chrome and jump to the target website, convenient for the next step of data acquisition.

Initialize CSV, data storage location

Func SetupWriter() {dateTime = time.now ().Format("2006-01-02 15:04:05") Google's bad taste... os.Mkdir("data", os.ModePerm) csvFile, err := os.Create(fmt.Sprintf("data/%s.csv", dateTime)) if err ! = nil {panic(err)} csvfile. WriteString("\xEF\xBB\xBF") writer = csv.newwriter (csvFile) writer.write ([]string{" car ", "Mileage "," first license ", "price "," location ", "store "})}Copy the code

Data capture

This part is the core business, each site crawls differently, but the idea is the same, using xpath, CSS selector, className, tagName and so on to obtain the content of elements, Selenium API can achieve most of the operation functions, selenium source can be seen. The core API includes WebDriver and WebElement. Below, I write down the process of capturing second-hand car data of Beijing Second-hand Car Home. Other websites can refer to the process of modification.

Open the website of Second-hand Car Home through Safari browser and get the connection of Beijing second-hand car home page

const urlBeijing = "https://www.che168.com/beijing/list/#pvareaid=104646"
Copy the code

Right click “Check Elements” on the page to enter developer mode, and you can see that all the data is there

<ul class="fn-clear certification-list" id="viewlist_ul">
Copy the code

Right-click the sentence and copy -xpath one by one to get the XPath attribute of the modified element

//*[@id="viewlist_ul"]
Copy the code

And then through the code

listContainer, err := webDriver.FindElement(selenium.ByXPATH, "//*[@id=\"viewlist_ul\"]")
Copy the code

The WebElement object is the parent container of all the data. To get the specific data, you need to locate each element subset, as you can see from the development mode

Using the developer tools, you can get the class as carinfo, because there are more than one element, so pass

lists, err := listContainer.FindElements(selenium.ByClassName, "carinfo")
Copy the code

You can get the set of all the element subsets. To get the element data in each subset, you need to traverse the set

for i := 0; i < len(lists); i++ { var urlElem selenium.WebElement if pageIndex == 1 { urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+13)) } else { urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+1)) } if err ! Err := urlelem.getAttribute ("href") if err! = nil {break} webdriver.get (url) title, _ := webdriver.title () log.printf (" current page title: ModelElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/h2") var model string if err ! = nil {log.println (err) model= "not yet"} else {model, _ = modelem.text ()} log.printf ("model=[%s]\n", model)... Write([]string{model, miles, date, price, position, Writer.flush () webdriver. Back() // go Back to the previous page and repeat the steps}Copy the code

All source code below, beginners, light spray ~ ~

// Start crawler () {log.println ("Start Crawling at ", time.Now().Format("2006-01-02 15:04:05")) pageIndex := 0 for { listContainer, err := webDriver.FindElement(selenium.ByXPATH, "//*[@id=\"viewlist_ul\"]") if err ! = nil { panic(err) } lists, err := listContainer.FindElements(selenium.ByClassName, "carinfo") if err ! = nil {panic(err)} log.Println("数据 : ", len(lists)) pageindex+ log.Printf(" 数据 : ", len(lists)) pageindex+ log.Printf(" 数据 : %d页... \n", pageIndex) for i := 0; i < len(lists); i++ { var urlElem selenium.WebElement if pageIndex == 1 { urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+13)) } else { urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+1)) } if err ! = nil { break } url, err := urlElem.GetAttribute("href") if err ! = nil {break} webdriver.get (url) title, _ := webdriver.title () log.printf (" current page title: %s\n", title) modelElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/h2") var model string if err ! Println(err) model= null else {model, _ = modelem.text ()} log.printf ("model=[%s]\n", model) priceElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[2]/div/ins") var price string if err ! Println(err) price = null else {price, _ = priceem.text () price = fmt.sprintf ("%s ", price) } log.Printf("price=[%s]\n", price) milesElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[4]/ul/li[1]/span") var miles string if err ! = nil { log.Println(err) milesElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[3]/ul/li[1]/span") if err ! = nil {log.println (err) Miles = "not yet"} else {miles, _ = mileselem.text ()}} else {miles, _ = milesElem.Text() } log.Printf("miles=[%s]\n", miles) timeElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[4]/ul/li[2]/span") var date string if err ! = nil { log.Println(err) timeElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[3]/ul/li[2]/span") if err ! = nil {log.println (err) date = "not yet"} else {date, _ = timeem.text ()}} else {date, _ = timeElem.Text() } log.Printf("time=[%s]\n", date) positionElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[4]/ul/li[4]/span") var position string if err ! = nil { log.Println(err) positionElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[3]/ul/li[4]/span") if err ! = nil {log.println (err) position = "not yet"} else {position, _ = positionem.text ()}} else {position, _ = positionElem.Text() } log.Printf("position=[%s]\n", position) storeElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/div/div/div") var store string if err ! = nil {the Println (err) store = "no"} else {store, _ = storeElem. The Text () store = strings. The Replace (store, "the businessman |", "", -1) if strings.Contains(store, "store ") {store= strings.Replace(store, "store ", "", -1)}} log.printf ("store=[%s]\n", store) writer.Write([]string{model, miles, date, price, position, Writer.flush () webdriver.back ()} log.printf (" page %d data fetch completed, start next page... \n", pageIndex) nextButton, err := webDriver.FindElement(selenium.ByClassName, "page-item-next") if err ! = nil {log.Println(" All data fetching done!" ) break } nextButton.Click() } log.Println("Crawling Finished at ", time.Now().Format("2006-01-02 15:04:05")) sendResult(dateTime) }Copy the code

Send E-mail

All the code is as follows, relatively simple, do not repeat

func sendResult(fileName string) { email := gomail.NewMessage() email.SetAddressHeader("From", "re**[email protected]", "Zhang * *") email. SetHeader (" To "email. FormatAddress (" li * * [email protected]", "li * *")) email. SetHeader (" Cc ", Email.FormatAddress("zhang**[email protected]", "zhang** ")) email.SetHeader("Subject", SetBody("text/plain; Charset = utF-8 ", "this week grab used car information data, please note that check! \n") email.Attach(fmt.Sprintf("data/%s.csv", fileName)) dialer := &gomail.Dialer{ Host: "smtp.163.com", Port: Password: ${smtp_password}, // customize SMTP server Password SSL: false, } if err := dialer.DialAndSend(email); err ! = nil {log.Println(" Mail sending failed! Err: ", err) return} log.Println(" Email sent successfully!" )}Copy the code

The last one is recycling resources

Defer service.stop () // defer chromedriver () webdriver.quit () // Close the browser to defer csvfile.close () // Close the file streamCopy the code

conclusion

Golang is a beginner, but I only practice with crawler project. The code is rough and there is no engineering at all. I hope it will not mislead people.
Since golang crawler has few other projects to learn from, it also has some of its own research results and hopes to help others.
Finally, Pholcus, an amazing crawler framework written by Amway, has powerful functions and is a relatively perfect framework at present.

Reptile combat based on Golang