It is only used for exchange and learning. It is forbidden to use this resource to engage in any activities that violate the laws and regulations of the country (region), and all abide by the Network Security Law.

Tips: Just to give you an idea, the actual project will need to maintain details such as agent pool availability

Practical steps

  1. Framework and core library deployment
  2. Periodically update the agent pool process
  3. The list page crawl process is timed
  4. The main process periodically reads the list page tasks from Redis, and sometimes dumps each item to the asynchronous task

The environment

  • CentOS 7.2
  • PHP7.2
  • Swoole 4.3.5
  • Google Chrome 78.0.3904.108
  • ChromeDriver 78.0.3904.105

Composer

  • Facebook/webdriver = 1.7
  • Easyswoole/easyswoole = 3.1.18
  • Easyswoole/curl = 1.0.1

Framework and core library deployment

1. Install EasySwoole 3.1.18

[[email protected] phpseleniumdemo] Composer require easyswoole/easyswoole=3.1.18 [[email protected] phpseleniumdemo] PHP vendor/easyswoole/easyswoole/bin/easyswoole install ______ _____ _ | ____| / ____| | | | |__ __ _ ___ _ _ | (___ __ __ ___ ___ | | ___ | __ _ ` / | | / __ | | | | | \ _ \ \ \ \ / / / _ \ / _ \ | | / _ \ | | ____ | (_ | | \ __ \ | | _ | | ____) |  \ V V / | (_) | | (_) | | | | __/ |______| \__,_| |___/ \__, | |_____/ \_/\_/ \___/ \___/ |_| \___| __/ | |___/ install success,enjoy!Copy the code

2. Install the core libraries facebook/ WebDriver and Easyswoole /curl

[[email protected] phpseleniumdemo]# composer require facebook/webdriver = 1.7
[[email protected] phpseleniumdemo]# composer require easyswoole/curl = 1.0.1
Copy the code

3. Confirm that no error is reported during operation

[[email protected] phpseleniumdemo]# php easyswoole startThere comes | | / ____ | | | | | __ __ _ _ _ _ | (___ __ __ ___ ___ | | ___ > | __ | / _ ` | / __ | | | | | \ _ \ \ \ \ / / / _ \ / _ \ | | / _ \ >| |____ | (_| | \__ \ | |_| | ____) | \ V V / | (_) | | (_) | | | | __/ >|______| \__,_| |___/ \__, | | / \ _ \ _ / \ _____ _____ / \ ___ / | _ | \ ___ | > __ / | > | ___ / main server SWOOLE_WEB listen address 0.0.0.0 listen port 9501 Sub Server1 CONSOLE => [email protected]:9500....Copy the code

Periodically update the agent pool process

Tips: Agent resources please solve, here only provide examples, is actually not used

1. Create the project home directory

[[email protected] phpseleniumdemo]# mkdir App  
#composer specifies App scope
[[email protected] phpseleniumdemo]# cat composer.json  
{  
    "autoload": { 
        "psr-4": { "App\\": "App/"}},"require": { 
        "easyswoole/easyswoole": "3.1.18"."facebook/webdriver": "^ 1.7"."easyswoole/curl": "1.0.1"}}# Update composer Autoload
[[email protected] phpseleniumdemo]# composer dump-autoload
Copy the code

2. Create process directory (run agent pool update as a child process with project startup)

[[email protected] phpseleniumdemo]# mkdir App/Process  
Copy the code

3. Periodic crawling of proxy pool (Redis List type is used to ensure that the latest proxy IP is in the head, crawler logic obtains from the head each time, one proxy IP is used only once)

Tips: Agent resources please solve, here only provide examples, is actually not used

Full code link


      
/**
 * Created by PhpStorm.
 * User: ar414.com@gmail.com
 * Date: 2019/12/7
 * Time: 21:00
 */
namespace App\Process;
use App\Lib\Curl;
use App\Lib\Kv;
use EasySwoole\Component\Process\AbstractProcess;
class UpdateProxyPool extends AbstractProcess
{
    // All proxy IP addresses support only SOcks5
    private $proxyListApi = "http://www.zdopen.com/ShortS5Proxy/GetIP/?api=%s&akey=%s&order=2&type=3";
    const PROXY_KV_KEY = 'spider:proxy:list';
    const TIMER = 15;
    protected function initProxyListApi(a)
    {
// $this->proxyListApi = sprintf($this->proxyListApi,$_ENV['PROXY_LIST_API'],$_ENV['PROXY_LIST_KEY']);
        $this->proxyListApi = sprintf($this->proxyListApi,20191231231237085.'72axxxae0fe34');
    }
    public function run($arg)
    {
        $this->initProxyListApi();
        // Rely on composer require easyswoole/curl=1.0.1
        while (true)
        {
            $ret = Curl::get($this->proxyListApi);
            var_dump($ret);
            if($ret) {
                $ret = json_decode($ret,true);
                if($ret['code'] = =10001 && isset($ret['data'] ['proxy_list']) &&!empty($ret['data'] ['proxy_list']) {foreach($ret['data'] ['proxy_list'] as $proxy) {
                        $proxyItem = $proxy['ip'].':'.$proxy['port'];
                        Kv::redis()->lPush(self::PROXY_KV_KEY,$proxyItem);
                    }
                }
            }
            sleep(self::TIMER); }}}Copy the code

4. Configure the agent pool update process to start with project startup (link to the full code)

public static function mainServerCreate(EventRegister $register)
{
        // Update the agent pool process
        ServerManager::getInstance()->getSwooleServer()->addProcess((new \App\Process\UpdateProxyPool('UpdateProxyPool', []))->getProcess());
}
Copy the code

The list page crawl process is timed

Crawl the list page process (full code link)


      
/**
 * Created by PhpStorm.
 * User: ar414.com@gmail.com
 * Date: 2019/12/7
 * Time: 22:01
 */

namespace App\Process;

use App\Lib\ChromeDriver;
use App\Lib\Kv;
use EasySwoole\Component\Process\AbstractProcess;
use EasySwoole\EasySwoole\Logger;

class ListSpider extends AbstractProcess
{
    const API = 'https://www.188-sb.com/SportsBook.API/web?lid=1&zid=3&pd=%23AC%23B151%23C1%23D50%23E10%23F163%23&cid=42&ctid=42';

    const LIST_KV_KEY = 'spider:list';
    const TIMER = 20; // Execute once every 20 seconds

    public function run($arg)
    {
        while (true)
        {
            try
            {
                $driver = (new ChromeDriver(true))->getDriver();
                $driver->get(self::API);
                $listStr = $driver->getPageSource();
                var_dump($listStr);
                file_put_contents("/www/wwwroot/blog/phpseleniumdemo/listStr.html",$listStr);
                preg_match_all("/PD=(.*); /U",$listStr,$list);
                $list = array_unique($list[1]);
                if($list)
                {
                    Kv::redis()->set(self::LIST_KV_KEY,json_encode($list));
                }
                var_dump('done');
                $driver->close();
                $driver->quit();
            }
            catch (\Throwable $throwable)
            {
                $driver->close();
                $driver->quit();
                Logger::getInstance()->log($throwable->getMessage(),'ListSpiderError');
                var_dump($throwable->getMessage());
            }
            sleep(self::TIMER); }}}Copy the code

The main process periodically reads the list page tasks from Redis, and sometimes dumps each item to the asynchronous task

1. Complete code link

public static function mainServerCreate(EventRegister $register)
    {
        // Update the agent pool process
        ServerManager::getInstance()->getSwooleServer()->addProcess((new \App\Process\UpdateProxyPool('UpdateProxyPool', []))->getProcess());
        // List crawl process
        ServerManager::getInstance()->getSwooleServer()->addProcess((new \App\Process\ListSpider('ListSpider', []))->getProcess());

        $register->set($register::onWorkerStart,function(\swoole_server $server,$workerId){
            if($workerId == 0)
            {
                Timer::getInstance()->loop(30000.function (a) {
                    $ret = Kv::redis()->get(ListSpider::LIST_KV_KEY);
                    if($ret){
                        $ret = json_decode($ret,true);
                        foreach($ret as $item) {
                            TaskManager::async(function (a) use($item){(new ItemSpider(true))->run($item);
                                return true;
                            }, function (a) use($item){
                                var_dump("{$item} Done"); }); }}}); }}); }Copy the code

2. ItemSpider logic


      
/**
 * Created by PhpStorm.
 * User: ar414.com@gmail.com
 * Date: 2019/12/7
 * Time: 22:35
 */

namespace App\Spider;


use App\Lib\ChromeDriver;
use EasySwoole\EasySwoole\Logger;
use Facebook\WebDriver\WebDriverBy;
use Facebook\WebDriver\WebDriverExpectedCondition;

class ItemSpider
{
    public function run($itemPath)
    {
        $driver = (new ChromeDriver(true))->getDriver();
        $itemPath = str_replace(The '#'.'/',$itemPath);
        $url = "https://www.188-sb.com/#{$itemPath}";
        var_dump($url);
        try
        {
            $driver->get($url);
            $driver->wait(ChromeDriver::WAIT_SECONDS)->until(
                WebDriverExpectedCondition::visibilityOfElementLocated(
                    WebDriverBy::className('gl-MarketGroupButton_Text'))); Logger::getInstance()->console("The title is '" . $driver->getTitle() . "'\n");
            Logger::getInstance()->console("The current URI is '" . $driver->getCurrentURL() . "'\n");
            $body = $driver->getPageSource();
            var_dump($body);
            $driver->close();
            $driver->quit();
            //TODO cleans data into the database
        }
        catch (\Throwable $throwable)
        {
            Logger::getInstance()->log($throwable->getMessage(),'Bet365ApiRun');
            $driver->close();
            $driver->quit();
        }
        return; }}Copy the code

3, run,

[[email protected] phpseleniumdemo]# php easyswoole start
Copy the code