We all know that there are a lot of sharing resources on Baidu cloud network disk, including software, all kinds of video self-study courses, e-books, and even a variety of movies, BT seeds, everything, but Baidu cloud does not provide the corresponding search function. Individuals usually want to find some software, the United States drama feel very egg pain. So I tried to develop a search system of Baidu cloud resources.

  1. Resource crawler thinking: the most important thing for a search engine is to have massive resources. With resources, it is a simple search engine to realize full-text retrieval function based on resources. First of all, I need to climb the sharing resources of Baidu Cloud to get ideas. Open the homepage of any Baidu cloud sharer yun.baidu.com/share/home?uk=xxxxxx&view=share#category/type=0, and you can find that the sharer has subscribers and fans. You can recursively iterate through your subscribers and followers to get a lot of shareable UKs, which in turn can get a lot of shared resources.

  2. System implementation environment:

    • Language: the python

    • Operating system: Linux

    • Other middleware: Nginx, MySQL, Sphinx

  3. The system consists of several independent parts:

    • A standalone resource crawler based on the Requests implementation

    • Resource indexing program based on open source full text search engine Sphinx

    • Based on Django+ Bootstrap3 development of a simple website, the website is built using nginx1.8+ FastCGI (FLUP)+ Python. Demo website http://www.itjujiao.com

  4. Subsequent optimization:

    • Word segmentation processing, the current word segmentation search results are not very ideal, there is a god can give directions under the train of thought. For example, I searched “Secrets of the Kung Fu Panda Scrolls,” and got no results. And retrieval “Kung Fu Panda” is the result set (work, Panda 3. In English, the English subtitles. Mp4, work, husband Panda 2. Kung. Fu. Please 2.2011 BDrip. 720 p. Chinese, Cantonese, English and Taiwanese. Special effects with Chinese and English subtitles. MP4, Gong, Fu Panda 3(Korean version)2016. MKV, etc.) or search for “Scrolls of Secrets” to have a result set ([U.S.] Kung Fu Panda Scrolls of Secrets.2016.1080 P.P.4, GF Panda Scrolls of Secrets. HD1280 ultra clear Chinese and English double characters. MP4, etc.)

    • At present, it is found that most of the captured data are shared resources, and MD5 is considered as the basis for the data de-weighting

  5. PS:

    • At present, the crawler crawls around 4000W of data, Sphinx’s memory requirements are too big, huge hole. Baidu sets IP restrictions on crawlers and writes a simple Xicidaili proxy collector that allows you to configure HTTP proxies for requests.

    • Word segmentation is a built-in implementation of Sphinx, which supports Chinese word segmentation. Chinese word segmentation is based on unary word segmentation, which is a bit excessive, and the effect of word segmentation is not particularly ideal. For example, when I search the keyword “Ip Man 3”, there will be the result of “The Problem of Leaf 3rd Edition”, which is not in line with expectations. There are many ways to improve English participles. For example, when I search for Xart, I will not get X-Art results, which is actually the result set I want (you know).

    • The database is MySQL, the resource table, considering the record limit of a single table, divided into 10 tables. The first time you climb Sphinx, you do full indexing, and then you do incremental indexing.

  6. The crawler part of the implementation code (just a little messy thinking code) :

    #coding: utf8
    
    import re
    import urllib2
    import time
    from Queue import Queue
    import threading, errno, datetime
    import json
    import requests
    import MySQLdb as mdb
    
    DB_HOST = '127.0.0.1'
    DB_USER = 'root'
    DB_PASS = ''
    
    
    re_start = re.compile(r'start=(\d+)')
    re_uid = re.compile(r'query_uk=(\d+)')
    re_pptt = re.compile(r'&pptt=(\d+)')
    re_urlid = re.compile(r'&urlid=(\d+)')
    
    ONEPAGE = 20
    ONESHAREPAGE = 20
    
    URL_SHARE = 'http://yun.baidu.com/pcloud/feed/getsharelist?auth_type=1&start={start}&limit=20&query_uk={uk}&urlid={id}'
    URL_FOLLOW = 'http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk={uk}&limit=20&start={start}&urlid={id}'
    URL_FANS = 'http://yun.baidu.com/pcloud/friend/getfanslist?query_uk={uk}&limit=20&start={start}&urlid={id}'
    
    QNUM = 1000
    hc_q = Queue(20)
    hc_r = Queue(QNUM)
    
    success = 0
    failed = 0
    
    PROXY_LIST = [[0, 10, "42.121.33.160", 809, "", "", 0],
                    [5, 0, "218.97.195.38", 81, "", "", 0],
                    ]
    
    def req_worker(inx):
        s = requests.Session()
        while True:
            req_item = hc_q.get()
            
            req_type = req_item[0]
            url = req_item[1]
            r = s.get(url)
            hc_r.put((r.text, url))
            print "req_worker#", inx, url
            
    def response_worker():
        dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'baiduyun', charset='utf8')
        dbcurr = dbconn.cursor()
        dbcurr.execute('SET NAMES utf8')
        dbcurr.execute('set global wait_timeout=60000')
        while True:
            
            metadata, effective_url = hc_r.get()
            #print "response_worker:", effective_url
            try:
                tnow = int(time.time())
                id = re_urlid.findall(effective_url)[0]
                start = re_start.findall(effective_url)[0]
                if True:
                    if 'getfollowlist' in effective_url: #type = 1
                        follows = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in follows.keys() and follows["total_count"]>0 and str(start) == "0":
                            for i in range((follows["total_count"]-1)/ONEPAGE):
                                try:
                                    dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 1, 0)' % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                                except Exception as ex:
                                    print "E1", str(ex)
                                    pass
                        
                        if "follow_list" in follows.keys():
                            for item in follows["follow_list"]:
                                try:
                                    dbcurr.execute('INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)' % (item['follow_uk'], item['follow_uname'], str(tnow)))
                                except Exception as ex:
                                    print "E13", str(ex)
                                    pass
                        else:
                            print "delete 1", uid, start
                            dbcurr.execute('delete from urlids where uk=%s and type=1 and start>%s' % (uid, start))
                    elif 'getfanslist' in effective_url: #type = 2
                        fans = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in fans.keys() and fans["total_count"]>0 and str(start) == "0":
                            for i in range((fans["total_count"]-1)/ONEPAGE):
                                try:
                                    dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 2, 0)' % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                                except Exception as ex:
                                    print "E2", str(ex)
                                    pass
                        
                        if "fans_list" in fans.keys():
                            for item in fans["fans_list"]:
                                try:
                                    dbcurr.execute('INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)' % (item['fans_uk'], item['fans_uname'], str(tnow)))
                                except Exception as ex:
                                    print "E23", str(ex)
                                    pass
                        else:
                            print "delete 2", uid, start
                            dbcurr.execute('delete from urlids where uk=%s and type=2 and start>%s' % (uid, start))
                    else:
                        shares = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in shares.keys() and shares["total_count"]>0 and str(start) == "0":
                            for i in range((shares["total_count"]-1)/ONESHAREPAGE):
                                try:
                                    dbcurr.execute('INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 0, 0)' % (uid, str(ONESHAREPAGE*(i+1)), str(ONESHAREPAGE)))
                                except Exception as ex:
                                    print "E3", str(ex)
                                    pass
                        if "records" in shares.keys():
                            for item in shares["records"]:
                                try:
                                    dbcurr.execute('INSERT INTO share(userid, filename, shareid, status) VALUES(%s, "%s", %s, 0)' % (uid, item['title'], item['shareid']))
                                except Exception as ex:
                                    #print "E33", str(ex), item
                                    pass
                        else:
                            print "delete 0", uid, start
                            dbcurr.execute('delete from urlids where uk=%s and type=0 and start>%s' % (uid, str(start)))
                    dbcurr.execute('delete from urlids where id=%s' % (id, ))
                    dbconn.commit()
            except Exception as ex:
                print "E5", str(ex), id
    
            
            pid = re_pptt.findall(effective_url)
            
            if pid:
                print "pid>>>", pid
                ppid = int(pid[0])
                PROXY_LIST[ppid][6] -= 1
        dbcurr.close()
        dbconn.close()
        
    def worker():
        global success, failed
        dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'baiduyun', charset='utf8')
        dbcurr = dbconn.cursor()
        dbcurr.execute('SET NAMES utf8')
        dbcurr.execute('set global wait_timeout=60000')
        while True:
    
            #dbcurr.execute('select * from urlids where status=0 order by type limit 1')
            dbcurr.execute('select * from urlids where status=0 and type>0 limit 1')
            d = dbcurr.fetchall()
            #print d
            if d:
                id = d[0][0]
                uk = d[0][1]
                start = d[0][2]
                limit = d[0][3]
                type = d[0][4]
                dbcurr.execute('update urlids set status=1 where id=%s' % (str(id),))
                url = ""
                if type == 0:
                    url = URL_SHARE.format(uk=uk, start=start, id=id).encode('utf-8')
                elif  type == 1:
                    url = URL_FOLLOW.format(uk=uk, start=start, id=id).encode('utf-8')
                elif type == 2:
                    url = URL_FANS.format(uk=uk, start=start, id=id).encode('utf-8')
                if url:
                    hc_q.put((type, url))
                    
                #print "processed", url
            else:
                dbcurr.execute('select * from user where status=0 limit 1000')
                d = dbcurr.fetchall()
                if d:
                    for item in d:
                        try:
                            dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 0, 0)' % (item[1], str(ONESHAREPAGE)))
                            dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 1, 0)' % (item[1], str(ONEPAGE)))
                            dbcurr.execute('insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 2, 0)' % (item[1], str(ONEPAGE)))
                            dbcurr.execute('update user set status=1 where userid=%s' % (item[1],))
                        except Exception as ex:
                            print "E6", str(ex)
                else:
                    time.sleep(1)
                    
            dbconn.commit()
        dbcurr.close()
        dbconn.close()
            
        
    for item in range(16):    
        t = threading.Thread(target = req_worker, args = (item,))
        t.setDaemon(True)
        t.start()
    
    s = threading.Thread(target = worker, args = ())
    s.setDaemon(True)
    s.start()
    
    response_worker()