Ashamed to say, it has been two months since Crab stick published his first article in Jane book, but crab stick has not written any blog topic, and once again let crab stick worry about the little bit of ink in his stomach. Crab stick’s writing and thinking skills still need to be improved.

Thank you

Crabstick’s first article in Jane’s book is to share with Jane friends how to build their own personal website, half an hour out of their own personal blog, so far this article has received 2868 times of browsing, 169 times like, 102 comments, 32 fans, Crabstick is flattered, and again thanks to Jane friends for their support and encouragement, Thank you for giving the crabstick confidence and determination. Thank you!

about

Why did you suddenly think of writing a blog like this?

Before crab stick to achieve a music class of the APP, who has been troubled with no music resources, and without a strong background database to support, so baidu Google all kinds of music platform API, found that the API or don’t have access to, or resource is not complete, cannot complete a complete process, So crabstick just initiated to write an API idea, resources we can climb data from NetEase cloud music’s official website, of course, this kind of data source is not legal, so janyou can use their own private development, can not be used for commercial oh





Don’t tell anyone

Since this crawler is written in NodeJS, you should have some knowledge of NodeJS, but you can do it in another language after you know the crabstick logic

You got it? Crabstick’s driving





Speedy Crab, let’s go

Ha ha, without further ado, to the point.

Let’s take a look at the home page of NetEase Cloud Music





NetEase Cloud Music homepage

Crabstick’s goal is to take down the top eight playlists on the default home page, and start with a new NodeJs project

// Create a folder
mkdir <your project name>
// Initialize NodeJs
npm init
// Complete the NPM init configuration option to add package.jsonCopy the code
// Install the required modules
npm install --save express superagent cheerioCopy the code

Express is used to complete the route access configuration of API, SuperAgent is used to complete the access to the page of NetEase cloud music website, Cheerio is used to process the HTML data returned from data access, and the API of the three modules can be searched on their respective official websites

If you want to crawl the data on the web page, you must have a certain understanding of the HTML structure of the web page. After studying the structure of NetEase Cloud’s home page, Crab Stick found that the page structure is a little more complicated than expected, but you only need to ask the handlebars of the crab stick, the crab stick opened very steadily

First visit the homepage of NetEase Cloud Music, open the debug window (F12), click the Elements option, and you will see the main HTML structure of the homepage as follows





Home page code structure

Use the shortcut Ctrl + Shift + C to enter Select mode. When the mouse moves over any Element on the page, Element will be automatically positioned at its code location





Select elements

<a title="May there be years to look back on and grow old with love." href="/playlist? id=316387203" class="msk" data-res-id="316387203" data-res-type="13" data-res-action="log" data-res-data="recommendclick|0|featured|user-playlist"></a>Copy the code

The href indicates the location it points to. Data-res-id is the playlist ID. Data-res-type = 13 indicates the type of the playlist. In this case, we don’t care. Data-res-action and data-res-data don’t know what they are used for, so we just take the resources we need

// Define our playlist object structure
{
    id: 'playlists ID'.title: 'Playlist name'.href: 'Playlist pointing'.type: 'type'.cover: 'Playlist Cover Picture'
}Copy the code

In the HTML code, we observe that there is a LI tag in the outer layer of the A link, and the outer layer of li is the ul tag we expect to see. We focus the Element on ul





A list of tags

There are eight Li’s in ul. If you are interested, you can open the code one by one. It is exactly the eight playlist information that we decided to get above. The UL tag has a class named M-cVRlst, we can click the Console option and enter the following code

document.getElementsByClassName('m-cvrlst');Copy the code




The results

The result is not as expected. After making sure the class name is correct, we get an empty array. It is clear that the ul element with the class name m-cVRlst does exist in the page. Those of you who are still in the car follow the crabstick driver back to study

When we focus on the UL element, we can see the hierarchy of this element at the bottom of the debug window. It is not hard to see that there is an iframe element with the id g_IFrame in the ul parent element list





The IFrame element

As the young drivers of the crab stick truck smiled unnaturally, we clicked the red arrow to locate the iframe element





Positioning elements

We see that there is also a Document object in the iframe, indicating that the iframe loaded a different URL. Knowing that, we proceed to execute the following code in the Console panel

g_iframe.contentDocument.getElementsByClassName('m-cvrlst');Copy the code




The execution result

This is what we need to see. The young drivers are relieved and thinking about how to get the HTML structure of the iframe through the code. The iframe loads the page after the current page is returned from the server. The SRC attribute of the iframe points to about:blank, which means that the hop inside the iframe is controlled by javascript code. So iframe embedded pages only load later than expected, so the content in iframe is not loaded when we request music.163.com directly

Write a few lines of code to verify that it’s time to move the NodeJs project we’ve just started

Create a new test.js file and write the following code

// Load the Express module
var app = require('express') ();// Load the superagent module
var request = require('superagent');
// Load the Cheerio module
var cheerio = require('cheerio');

// Specify the access route
app.get('/'.function(req, res){

    // Request NetEase Cloud Music home page
    request.get('http://music.163.com')
        .end(function(err, _response){

            if(! err) {// If no errors occur, the HTML obtained is the HTML structure returned by the page
                var html = _response.text;
                // Cheeio initialization is similar to jQuery
                var $ = cheerio.load(html);
                / / print the iframe
                console.log( 'Iframe internal structure :' + $('#g_iframe').html() );

                res.send('Hello');

            } else {
                returnnext(err); }}); });// Listen on port 3000
app.listen(3000.function(){
    console.log('Server start! ');
});Copy the code
// CMD Run the following command. Do not close the console window CD<your project name>

node test// If successfully output"Server start ! ", prove that the operation is correct, other cases are wrongCopy the code
// Browser access, watch console output
localhost:3000Copy the code




The output

Sorry, the car is driving a little fast, everyone calm down, young drivers who can’t understand the code do not need to understand the meaning of this code first, just follow the steps to verify our guess, that is, iframe was not loaded when we requested the homepage of NetEase Cloud Music, so we can not get the content of iframe. The dom structure inside the printed iframe is empty. This code will be explained later in the API





Sorry, this train is going a little fast

The Chrome debug window gives you a handy tool to open the debug window (F12). If you want to get the contents of the iframe, you need to know what url the iframe loaded after the page was loaded. Click on Sources





Sources options window

This will list all the resources currently loaded on the page, and we can see that at the bottom of the list, there is a child node called contentFrame. Click to open this node





ContentFrame node

Comparing the two screenshots, we find that they load the same page, but the address they visit is different. Interested users can carefully compare the specific content loaded in the two screenshots. Now back to iframe, now we are basically sure. The content loaded in the contentFrame node is exactly the url to which the iframe loaded the URL. We can verify this by opening the first child of the contentFrame node and opening the Discover page





Loaded page

We see that the Discover page also loads an HTML document, but is this the document we need? To verify, press Ctrl + F in Discover and type m-cVRlst, if all goes well





The search results

Haha, this is exactly what we need. How can we know which URL the Discover page points to? Hover the mouse over the Discover page and you can see the URL of music.163.com/discover (because the screenshot cannot be captured, So young drivers need to look around for themselves.

Now that we have the URL, we can officially write our crawler. In our NodeJs project, we will create index.js

// Initialize express
var app = require('express') ();/* System-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system-system.html app.get('/recommendLst', function(req, res){}); * /
app.get('/'.function(req, res){

    // Returns the Hello World string to the address requesting localhost:3000/
    res.send('Hello World ! ');

});

/** * Enable the express service to listen on port 3000 of the local machine
var server = app.listen(3000.function(){
    // If express is started successfully, this method is executed
    var port = server.address().port;

    console.log(`Express app listening at http://localhost:${port}`);
});Copy the code
// Execute on the console
node index
// Browser access
http://localhost:3000/Copy the code




node index





localhost:3000

Congratulations, the first Express Hello World program is running successfully. For details, please refer to the Hello World example provided on the Express website

Use superAgent to access the Discover page

Arrived here, believe Jane friend to crab stick’s routine already easy way, did not say much, hit the road

We first access our localhost:3000 with the superagent, and if nothing else, we get the Hello World string returned by localhost:3000/

// Initialize the superagent module
var request = require('superagent');

app.get('/test'.function(req, res){

    request.get('http://localhost:3000/')
        .end(function(err, _response){

            if(! err) {// If no error occurs during the fetch process
                var result = 'Data obtained :'+_response.text;

                console.log(result);
                res.send(result);


            } else {
                console.log('Get data error ! '); }}); });Copy the code
// We must restart the server after modifying the server code to see the effect
node index
// Browser access
http://localhost:3000/testCopy the code




Console interface





Localhost: 3000 / test interface

More about the SUPERagent API here

Next, we use the superagent get function to access the discover page, we will open a localhost: 3000 / recommendLst API return recommended list data

// Express open/Shameshame.com API
app.get('/recommendLst'.function(req, res){

    // Use superagent to access the Discover page
    request.get('http://music.163.com/discover')
        .end(function(err, _response){

            if(! err) {// The request succeeded
                var dom = _response.text;

                console.log(dom);
                res.send('get success');

            } else {
                console.log('Get data error ! '); }}); });Copy the code
// Restart the service
node index
// Browser access
http://localhost:3000/recommendLstCopy the code

If your console prints the following interface (screenshot incomplete)





Console output

Here is proof that the discover page request was successful!

Use Cheerio to process the returned HTML

A foot brake, crab stick rest for a while, Jane friends take a breath and drink saliva, please stand and help steady, fasten your seat belt, the car will start again…..

Let’s first use Cherrio to deal with simple HTML, Look

// Load the Cheerio module
var cheerio = require('cheerio');

app.get('/testCheerio'.function(req, res){

    var $ = cheerio.load('

This is a sample text

'
); $('#test').css('color'.'red'); res.send( $.html() ); });Copy the code

As usual,

// Restart the service
node index
// Browser access
http://localhost:3000/testCheerioCopy the code

The result is as follows





The results

Also, more on cheerio apis here

Now use cheerio to process the HTML requested by the Superagent

// Express open/Shameshame.com API
app.get('/recommendLst'.function(req, res){

    // Initialize the return object
    var resObj = {
        code: 200.data: []};// Use superagent to access the Discover page
    request.get('http://music.163.com/discover')
        .end(function(err, _response){

            if(! err) {// The request succeeded
                var dom = _response.text;

                // Use cheerio to load dom
                var $ = cheerio.load(dom);
                // Define the array we want to return
                var recommendLst = [];
                // Get the ul element of.m-cvrlst
                $('.m-cvrlst').eq(0).find('li').each(function(index, element){

                    // Get the a link
                    var cvrLink = $(element).find('.u-cover').find('a');
                    console.log(cvrLink.html());
                    // Get the cover playlist
                    var cover = $(element).find('.u-cover').find('img').attr('src');
                    // Organize a single recommended playlist object structure
                    var recommendItem = {
                        id: cvrLink.attr('data-res-id'),
                        title: cvrLink.attr('title'),
                        href: 'http://music.163.com' + cvrLink.attr('href'),
                        type: cvrLink.attr('data-res-type'),
                        cover: cover
                    };
                    // Place a single object in an array
                    recommendLst.push(recommendItem);

                });

                // Replace the returned object
                resObj.data = recommendLst;

            } else {
                resObj.code = 404;
                console.log('Get data error ! ');
            }

            // Response data
            res.send( resObj );

        });

});Copy the code

The code is very simple, detailed training of the use of superagent and Cheerio Jane friends will not be frightened by this code, so far our get recommendation home page API has been completed, we can see the effect of the request, restart the server, you know!





Access to the results

Very complicated? Don’t worry, the crab stick car is insured, haha, Jane needs the Chrome extension JSONView (Chrome App Store, bring your own LDS, use the Lantern for free, for simple needs).





I’m done with JSONView

Looking back at the object structure of the single recommendation list that we defined to see if it’s straightforward, let’s look at the case of access failure, network failure





Access failures

Get playlist details based on playlist ID

This requires us to re-study the DOM structure, I believe that after the above process, Jane friends have full confidence in this, get on the car, go

Return to the home page of NetEase Cloud and click on any playlist





Required resource information

From the diagram, we can analyze that the resource we need has all the information enclosed in the crabstick above. In this way, we can specify the detailed information object structure of a playlist

{
    id: 'playlists ID'.title: 'Playlist name'.owner: 'Playlist owner's name. In the initial stage, only user names are considered, not user details.'.create_time: 'Creation time'.collection_count: 'Number of songs collected'.share_count: 'Number of songs shared'.comment_count: 'Number of comments'.tags: ['tags'].desc: 'Playlist Description'.song_count: 'Total number of songs'.play_count: 'Total Play times'
}Copy the code

I believe Jane has the ability to pull out these data one by one, let’s define the API first

// Define an API to get playlist details based on playlist ID
app.get('/playlist/:playlistId'.function(req, res){

    var playlistId = req.params.playlistId;
    res.send(playlistId);

});Copy the code

/:playlistId will match the dynamic parameters you entered. See how it works. Don’t forget to restart the server





Browser access

Effect is very obvious, we will need to obtain detailed information through this way to obtain the playlist ID, specific how to find the element in which position, crabstick will not take Jane friends to do, I believe Jane friends have seen the above tutorial, should be very familiar with this, crabstick will directly paste the source code, crabstick friendship reminder, pay attention to notes

(Warning…. The speed is increasing, please stand firmly and fasten your seat belt…..)

// Define an API to get playlist details based on playlist ID
app.get('/playlist/:playlistId'.function(req, res){

    // Get the playlist ID
    var playlistId = req.params.playlistId;
    // Define the return object
    var resObj = {
        code: 200.data: {}};/ * * * use superagent request * why are we here at http://music.163.com/playlist?id=${playlistId} * JianYou should remember the iframe netease cloud music home page * Remember to go to the Sources TAB * in the debug panel and see what url */ iframe loaded in the playlist page
    request.get(`http://music.163.com/playlist?id=${playlistId}`)
        .end(function(err, _response){

            if(! err) {// Define a playlist object
                var playlist = {
                    id: playlistId
                };

                DecodeEntities does not convert Chinese characters to Unicode characters
                // decodeEntities is false if decodeEntities is not specified.
                var $ = cheerio.load(_response.text,{decodeEntities: false});
                // Get the playlist DOM
                var dom = $('#m-playlist');
                // Playlist title
                playlist.title = dom.find('.tit').text();
                // Playlist owner
                playlist.owner = dom.find('.user').find('.name').text();
                // Create time
                playlist.create_time =  dom.find('.user').find('.time').text();
                // The number of songs collected
                playlist.collection_count = dom.find('#content-operation').find('.u-btni-fav').attr('data-count');
                // Number of shares
                playlist.share_count = dom.find('#content-operation').find('.u-btni-share').attr('data-count');
                // Number of comments
                playlist.comment_count = dom.find('#content-operation').find('#cnt_comment_count').html();
                / / label
                playlist.tags = [];
                dom.find('.tags').eq(0).find('.u-tag').each(function(index, element){
                    playlist.tags.push($(element).text());
                });
                // Playlist description
                playlist.desc = dom.find('#album-desc-more').html();
                // Total number of songs
                playlist.song_count = dom.find('#playlist-track-count').text();
                // Total number of plays
                playlist.play_count = dom.find('#play-count').text();

                resObj.data = playlist;

            } else {
                resObj.code = 404 ;
                console.log('Get data error! ');
            }

            res.send( resObj );

        });


});Copy the code

Execution result (again, restart the server, then browser access)





Browser Access Results

Why don’t we just load all the songs from the playlist into this interface? Haha, if we had looked at the HTML file loaded with iframe, we wouldn’t have this problem. Let’s look at this one

Open the Debug window (F12), click on the Sources TAB, click on the contentFrame child node, and click on the file at the beginning of playList





The iframe page

We can see that the table of the song is not rendered in the case that the page has been returned. In the process of taking a closer look at the page, we can see that the song list position appears “loading..”. Loading prompt, but, Look, what is this





Look

Crabstick copied this part of the content to do a Json parsing





Analytical results

Obviously, this is exactly what we need, but the JSON string is so long that it would be time consuming if we were to put it into the request playlist details, and the crabstick probably executed the following lines of code on the JSON string

// The operation takes 1s
var str = JSON.stringify('Copied string');
// The operation takes 1s
console.log(str.length);  / / 75151
// The current step takes 2s
var str = JSON.parse(str);Copy the code

Plus we are sure to do corresponding logic with the JSON string, so will only more slowly and better way is after the completion of a playlist loading asynchronous loading in the list of all the songs, and we have to do is just to return the JSON string, logic analytic questions to front Js code to handle, crab sticks to stick a piece of code again

// Define an API to get a playlist of all songs based on the playlist ID
app.get('/song_list/:playlistId'.function(req, res){

    // Get the playlist ID
    var playlistId = req.params.playlistId;
    // Define the return object
    var resObj = {
        code: 200.data: []}; request.get(`http://music.163.com/playlist?id=${playlistId}`)
        .end(function(err, _response){

            if(! err) {// Successfully return HTML
                var $ = cheerio.load(_response.text,{decodeEntities: false});
                // Get the playlist DOM
                var dom = $('#m-playlist');

                resObj.data = JSON.parse( dom.find('#song-list-pre-cache').find('textarea').html() );

            } else {
                resObj.code = 404 ;
                console.log('Get data error! ');
            }

            res.send( resObj );

        });


});Copy the code

Restart the server, the browser access results (crabstick friendship tips, if the computer configuration is not very good, please disable the JSONView extension before browsing the web, otherwise the browser may crash)





Browser Access Results

conclusion

Ok, the destination is not far, crab stick’s car is slowing down, you can see the crab stick here thank you for taking the ride, and I hope the crab Stick article helped you, let’s review the successful API we wrote

  • Access the recommended playlist API
    http://localhost:3000/recommendLstCopy the code
  • Access playlist details API
    http://localhost:3000/playlist/:playlistIdCopy the code
  • Access to the PLAYlist API for all song lists
    http://localhost:3000/song_list/:playlistIdCopy the code

Crab stick with a length not only carefully analyzed the logic way of thinking of the three apis, and introduces in detail how to analysis the JianYou to the dom of a web site and use the code to crawl the data structure, is an API tutorial at the same time, it is a new teaching of NodeJs crawler (claims melons, puff, ha ha, put for a while…). If you have any questions, please leave a message or a private message to crabstick. Crabstick will reply as soon as you see it

The source code

The source code of crabstick has been put into Github WangyiyunAPI. In the future, crabstick will continue to update the API. If it is helpful to you, please give crabstick a Star, thank you


Did this article help you? Welcome to join the front End learning Group wechat group: