I don’t know how many people are now using the network location search engines, but I will go to the rotary network devoted a lot of work, also can also now use the number of people, network backup resource has a disease, it is resource may fail, but many engine for failure judgment, especially some Google custom engine, technical content is not high, webmaster has bothered to make money, Very little thought is given to the user experience. This post is the latest in a series of technical public blog posts that I’ve already made public

Almost all the technical details of the web, this article continues to add:

First do a review: Baidu web crawler JAVA word segmentation algorithm database automatic backup proxy server crawl invited friends to register

Ing :utf-8 "" @author:haoning @create time:2015.8.5 """ from __future__ import division # import division from Queue import Queue from __builtin__ import False from _sqlite3 import SQLITE_ALTER_TABLE from collections import OrderedDict import copy import datetime import json import math import os import random import platform import re import threading, errno, Datetime import time import urllib2 import MySQLdb as MDB DB_HOST = '127.0.0.1' DB_USER = 'root' DB_PASS = 'root' def gethtml(url): try: Print "url",url req = urllib2.request (url) response = urllib2.urlopen(req,None,8) return html except Exception,e: print "e",e if __name__ == '__main__': while 1: #url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442' url="http://pan.baidu.com/s/1qXQD2Pm" html=gethtml(url) print html

E HTTP Error 403: Forbidden, that is to say, he is anti-crawler, after looking at many websites, accidentally tried the following link:

http://pan.baidu.com/share/li…

if __name__ == '__main__':

   while 1:
       url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442'
       #url="http://pan.baidu.com/s/1qXQD2Pm"
       html=gethtml(url)
       print html

Result: <title> Baidu cloud disk – link does not exist </title>, you know, have this must have been invalid, it seems that Baidu has no anti-crawler, good guy.

In fact, Baidu net disk resource entry has two ways:

One is: http://pan.baidu.com/s/1qXQD2Pm, and finally for the short code.

Another kind is: http://pan.baidu.com/share/li… , the key is ShareID + UK the former has known anti-crawler, the latter is not currently, so after using Python test, I will translate the code into Java, because to the wheel is written in Java, directly on the code:

package com.tray.common.utils;

import static org.junit.Assert.*;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Properties;
import java.util.Random;
import java.util.Set;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.junit.Test;

/**
 * 资源校验工具
 * 
 * @author hui
 * 
 */
public class ResourceCheckUtil {
    private static Map<String, String[]> rules;
    static {
        loadRule();
    }

    /**
     * 加载规则库
     */
    public static void loadRule() {
        try {
            InputStream in = ResourceCheckUtil.class.getClassLoader()
                    .getResourceAsStream("rule.properties");
            Properties p = new Properties();
            p.load(in);
            Set<Object> keys = p.keySet();
            Iterator<Object> iterator = keys.iterator();
            String key = null;
            String value = null;
            String[] rule = null;
            rules = new HashMap<String, String[]>();
            while (iterator.hasNext()) {
                key = (String) iterator.next();
                value = (String) p.get(key);
                rule = value.split("\\|");
                rules.put(key, rule);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static String httpRequest(String url) {
        try {
            URL u = new URL(url);
            Random random = new Random();
            HttpURLConnection connection = (HttpURLConnection) u
                    .openConnection();
            connection.setConnectTimeout(3000);//3秒超时
            connection.setReadTimeout(3000); 
            connection.setDoOutput(true);
            connection.setDoInput(true);
            connection.setUseCaches(false);
            connection.setRequestMethod("GET");
            
            String[] user_agents = {
                    "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11",
                    "Opera/9.25 (Windows NT 5.1; U; en)",
                    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
                    "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)",
                    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
                    "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
                    "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
                    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 "
            };
            int index=random.nextInt(7);
            /*connection.setRequestProperty("Content-Type",
                    "text/html;charset=UTF-8");*/
            connection.setRequestProperty("User-Agent",user_agents[index]);
            /*connection.setRequestProperty("Accept-Encoding","gzip, deflate, sdch");
            connection.setRequestProperty("Accept-Language","zh-CN,zh;q=0.8");
            connection.setRequestProperty("Connection","keep-alive");
            connection.setRequestProperty("Host","pan.baidu.com");
            connection.setRequestProperty("Cookie","");
            connection.setRequestProperty("Upgrade-Insecure-Requests","1");*/
            InputStream in = connection.getInputStream();

            BufferedReader br = new BufferedReader(new InputStreamReader(in,
                    "utf-8"));
            StringBuffer sb = new StringBuffer();
            String line = null;
            while ((line = br.readLine()) != null) {
                sb.append(line);
            }
            return sb.toString();

        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        return null;
    }

     @Test
     public void test7() throws Exception {
         System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyq",
         "baidu"));
         System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyqa",
         "baidu"));
        
         System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTd","360"));
         System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTdd",
         "360"));
        
         System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57","weiyun"));
         System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57dd",
         "360"));
        
         System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSes","leshi"));
         System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSesdd",
         "leshi"));
     }

    /**
     * 获取指定页面上标签的内容
     * 
     * @param url
     * @param tagName
     *            标签名称
     * @return
     */
    private static String getHtmlContent(String url, String tagName) {
        String html = httpRequest(url);
        if(html==null){
            return "";
        }
        Document doc = Jsoup.parse(html);
        //System.out.println("doc======"+doc);
        Elements tag=null;
        if(tagName.equals("<h3>")){ //针对微云
            tag=doc.select("h3");
        }
        else if(tagName.equals("class")){ //针对360
            tag=doc.select("div[class=tip]");
        }
        else{
            tag= doc.getElementsByTag(tagName);
        }
        //System.out.println("tag======"+tag);
        String content="";
        if(tag!=null&&!tag.isEmpty()){
            content = tag.get(0).text();
        }
        return content;
    }

    public static int isExistResource(String url, String ruleName) {
        try {
            String[] rule = rules.get(ruleName);
            String tagName = rule[0];
            String opt = rule[1];
            String flag = rule[2];
            /*System.out.println("ruleName"+ruleName);
            System.out.println("tagName"+tagName);
            System.out.println("opt"+opt);
            System.out.println("flag"+flag);
            System.out.println("url"+url);*/
            String content = getHtmlContent(url, tagName);
            //System.out.println("content="+content);
            if(ruleName.equals("baidu")){
                if(content.contains("百度云升级")){ //升级作为不存在处理
                    return 1;
                }
            }
            String regex = null;
            if ("eq".equals(opt)) {
                regex = "^" + flag + "$";
            } else if ("bg".equals(opt)) {
                regex = "^" + flag + ".*$";
            } else if ("ed".equals(opt)) {
                regex = "^.*" + flag + "$";
            } else if ("like".equals(opt)) {
                regex = "^.*" + flag + ".*$";
            }else if("contain".equals(opt)){
                if(content.contains(flag)){
                    return 0;
                }
                else{
                    return 1;
                }
            }
            if(content.matches(regex)){
                return 1;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return 0;
    }

    // public static void main(String[] args)throws Exception {
    // final Path p = Paths.get("C:/Users/hui/Desktop/6-14/");
    // final WatchService watchService =
    // FileSystems.getDefault().newWatchService();
    // p.register(watchService, StandardWatchEventKinds.ENTRY_MODIFY);
    // new Thread(new Runnable() {
    //
    // public void run() {
    // while(true){
    // System.out.println("检测中。。。。");
    // try {
    // WatchKey watchKey = watchService.take();
    // List<WatchEvent<?>> watchEvents = watchKey.pollEvents();
    //
    // for(WatchEvent<?> event : watchEvents){
    // //TODO 根据事件类型采取不同的操作。。。。。。。
    // System.out.println("["+p.getFileName()+"/"+event.context()+"]文件发生了["+event.kind()+"]事件");
    // }
    // watchKey.reset();
    //
    // } catch (Exception e) {
    // e.printStackTrace();
    // }
    // }
    // }
    // }).start();
    // }
    
//    @Test
//    public void testName() throws Exception {
//        System.out.println(new String("\u8BF7\u8F93\u5165\u63D0\u53D6\u7801".getBytes("utf-8"), "utf-8"));
//    }

}

Note that the code to be used to compatible with 360, micro disk and other network disk, but some network disk down, we all know, but the code is still in, this is the idea of the program ape, that is to be broadened, note that the code has a configuration file, I also attach it:

360=class|contain|u5206u4EABu8005u5DF2u53D6u6D88u6B64u5206u4EAB

baidu=title|contain|u94FEu63A5u4E0Du5B58u5728

weiyun=<h3>|contain|u5206u4EABu8D44u6E90u5DF2u7ECFu5220u9664

leshi=title|ed|u63D0u53D6u6587u4EF6

Sorry, Unicode code, trouble you to turn the code, not please baidu: Unicode transcoding tool

To this, to the turntable network link is invalid verification, the code I have been fully open, like this blog children please collect and pay attention to the next.

I build a QQ group, welcome everyone to exchange technology together, group number: 512245829 like microblogging friends attention: entertainment can be rotating