Install unirest

Using Python for data requests, we can use opne-URI, but it is not particularly convenient for various types of requests, so we use Unirest for network data requests.

Gem install Unirest // Install UnirestCopy the code

The use of the unirest

Uinirest supports Ruby2.0 at least, which is very simple to use. The common methods are as follows (see unirest.io for details).

Create a request

response = Unirest.post "http://httpbin.org/post", 
                        headers:{ "Accept" => "application/json" }, 
                        parameters:{ :age => 23, :foo => "bar" }

response.code # Status code
response.headers # Response headers
response.body # Parsed body
response.raw_body # Unparsed body
Copy the code

An asynchronous request

response = Unirest.post "http://httpbin.org/post", 
                        headers:{ "Accept" => "application/json" }, 
                        parameters:{ :age => 23, :foo => "bar" } {|response|
    response.code # Status code
    response.headers # Response headers
    response.body # Parsed body
    response.raw_body # Unparsed body
}
Copy the code

Basic GET request

response = Unirest.get "http://httpbin.org/get", auth:{:user=>"username", :password=>"password"}
Copy the code

Install nokogiri

When we climb the data, we need to analyze the data. If the data structure is simple, we can directly use regular expressions; if the data structure is complex, we need to use Nokogiri to operate the HTML DOM. If you don’t know the result of the dom can check the related content (HTML dom tutorial) [http://www.runoob.com/htmldom/htmldom-tutorial.html]

gem install nokogiri
Copy the code

Nokogiri use

Import packages

require 'rubygems'
require 'nonogiri'

Copy the code

Open an HTML document

Page = Nokogiri::HTML(open("index.html)) puts Page. Class # => Nokogiri::HTML::Document # You can also use Unirest directly to request down data Get "http://httpbin.org/get" page = Nokogiri::HTML(response.body)Copy the code

The URL is parsed directly through open-URI

The document is obtained directly through the HTTP request

require 'rubygems'
require 'nokogiri'
require 'open-uri'
   
page = Nokogiri::HTML(open("http://en.wikipedia.org/"))   
puts page.class   # => Nokogiri::HTML::Document
Copy the code

CSS selectors

Document object node analysis

page.css('title') # find all 'title' tags under page, return an array
page.css('li') [0].text Get the contents of the first 'li' tag of the page
page.css('li') [0] ['href'] Get the value of the attribute 'href' in the first 'li' tag
page.css("li[data-category='news']") # get 'li' with attribute 'data-category='news'
page.css('div#funstuff') [0] Get the node with the tag 'id='funstuff'
page.css('div#reference a') Select * from 'a' where id='reference'
Copy the code

More information about Nokogiri can be learned through Parsing HTML with Nokogiri

Install the spreadsheet

Spreadsheet is a Gem implemented by Ruby. It is convenient for us to use it to perform excel operations. We need to store the obtained data locally for easy data recording and subsequent processing.

Spreadsheet = "utF-8" # Create a Spreadsheet object, it is equivalent to Excel file book = Spreadsheet: : Workbook. The new # to create a form in Excel file, Sheet1 = book.create_worksheet :name => "Test Excel" # Set an Excel file format default_format = New (:weight => :bold,# font bold :size => 14, :horizontal_align: = > : the merge, # form merge: color = > "red", "border = > 1, : border_color = >" black ": the pattern = > 1, :pattern_fg_color => "yellow")# Note that if pattern is not processed manually, Test_row = sheet1.row(0) test_ROW.set_format (I, Test_row [0] = "row 1 col 2" # test_rwo[1] = "row 1 col 2" # Write the created Spreadsheet object to a file to form the Spreadsheet book.write 'book2.xls'Copy the code

The crawler

Crawling RUNOOB.COM(http://www.runoob.com/) tutorial list and address data is not really a crawler, but as a variety of gems using Ruby to implement asynchronous data requests, data filtering and storage. Is to achieve a more complex crawler necessary tools. Proficient use of a variety of gems can reflect Ruby’s simplicity

#! /usr/bin/ruby require 'unirest' require 'nokogiri' require 'open-uri' require 'spreadsheet' # Unirest.get "http://www.runoob.com/" page = Nokogiri::HTML(response.body) # datas = page.css('div.codelist') Puts datas. Count # create a table Spreadsheet. Client_encoding = "utf-8" book = Spreadsheet: : Workbook. # to create a new sheet sheet = Book. Create_worksheet: name = > "my excel" index = 0 datas. Each do | category | puts category. The CSS (' h2). # text to get the name of the category The items = category. CSS (' Anderson, tem - top) items. Each do | item | sheet. The row (index) [0] = item. CSS (' h4). The text # write the name of the tutorial Sheet. Row (index) [1] = item [' href '] # write tutorial link index + = 1 end end book. Write '/ users/ssbun/desktop/runoob. XLS' # write to local file (** Watch your path **)Copy the code

Then you can see that you have an XLS file on your desktop. Open it and see the data inside.