preface

Novice Village Mission

Here are a few of the things I studied in my spare time during the last six months of 2018 for those just starting out with reptiles

  • Article crawler (Sogou wechat article, a book, and today’s headlines, etc.)
  • Music data crawler (mainly based on NodeJS, community API of netease Cloud Music)
  • Wechat article like crawler

Recommended tools for climbing

  • Charles (Packet capture tool)
  • Anyproxy is also a packet capture tool but can be more fun to program
  • Adb Android Debug Bridge can use Python scripts to control Android phones to automatically execute scripts
  • You can also write automatic test scripts

Back to the crawl nuggets article

First I’m going to use the original way to get the nuggets article content. I used PHP curl and phpQuery to do this. I found that the content of the article was empty. And resources like images are wrapped with tags like that don’t parse properly.

Apparently nuggets articles are loaded asynchronously. Using curl to retrieve an article, you will find that the content of the article is encrypted. (I didn't have much time to study his encryption rules.)Copy the code

My idea is to simply grab the rendered data directly. In general, this can be achieved using a Simulator and Phantomjs. I was OK with using the simulator locally because the server blew up wrong and then I used PhantomJS.

Here are the dependencies and considerations required for this project

1, matters needing attention (choose domestic sources)

Method 1: Modify the global configuration file of Composer (recommended). Open the CLI for Windows users or the console for Linux and Mac users and run the following command:

composer config -g repo.packagist composer https://packagist.phpcomposer.com
Copy the code

Method 2: Modify the composer. Json configuration file of the current project: Open a command line window (for Windows users) or console (for Linux or Mac users), go to the root directory of your project (where the composer.

composer config repo.packagist composer https://packagist.phpcomposer.com
Copy the code

2. Dependencies that need to be installed

composer require "jonnyw/php-phantomjs:4.*"
Copy the code

3. Special requirements of Linux environment (modify dependent files)

namespace JonnyW\PhantomJs\DependencyInjection;

 /**
     * Load service container.
     *
     * @access public
     * @return void
     */
    public function load($file = null)
    {
        $loader = new YamlFileLoader($this, new FileLocator(__DIR__.'/.. /Resources/config'));
        $loader->load('config.yml');
        $loader->load('services.yml');

        $this->setParameter('phantomjs.cache_dir', sys_get_temp_dir());
        $this->setParameter('phantomjs.resource_dir', __DIR__.'/.. /Resources');
    }

Copy the code

Modify php.ini to enable some system functions

In the code

        $client = Client::getInstance();

        $client->getEngine()->setPath(ROOT_PATH . 'public' . DS . 'phantomjs'); // Set the phantomjs location$client->getEngine()->addOption('--load-images=false');
        $client->getEngine()->addOption('--ignore-ssl-errors=true');

        $url = 'https://juejin.cn/post/6844903728646979597';
        $request = $client->getMessageFactory()->createRequest($url.'GET');

        $timeout= 10000; // Set timeout$request->setTimeout($timeout);

        $response = $client->getMessageFactory()->createResponse();
        $client->send($request.$response);


        $str = $response->getContent();

        $num1 = strpos($str.'<article');
        $num2 = strpos($str.'</article>');


        $re_data = substr($str.$num1.$num2 - $num1);

        $re_data. ='</article>';
        $re_data = str_replace("data-src"."src".$re_data);
//        file_put_contents('3.html'.$re_data);
        return $re_data;
Copy the code

See the effect

conclusion

This crawler is rudimentary and not yet fully automated, distributed, multithreaded, PHP is the best programming language in the world. I want to write an article about programming languages in the evening.