\

Contributor: Alan\

Abstract: All-powerful Scan King is an important tool for document scanning and retention. This paper uses Requests crawler to synchronize scanned files from mobile client to computer. \

1. Background ****

In the audit work, large quantities of documents need to be scanned and retained, so omnipotent scanning king has become the mainstream mobile phone terminal scanning tool. Because the all-powerful scanning king of web terminal can not be downloaded in large quantities of scanned files, in a certain restricted environment, the use of mobile terminal synchronization to the computer has become a big demand, this paper uses Requests crawler download scanned files, Tkinter design interactive interface, PyInstaller package package EXE to achieve offline use.

2. Fiddler packet capture ****

Fiddler installation and certificate trust Settings are not covered here.

(1) User /login request

Using firefox browser to access scan king website (https://www.camscanner.com/user/login), click login and enter your user name and password. After observing the execution in Fiddler, it was found that webform under the user/login request no. 247 contained uploading parameters, and the returned JSON was URL, which had already been accessed (No. 249) in the URL captured by Fiddler, so it should be the skipping URL (See Figure 2-1).

Figure 2-1

(2) Files /holder request

This request is the interface after login. In Fiddler, only the return value of the request is 0, and the webpage after login is not obtained. Therefore, request headers and cookies may need to be simulated. After successfully obtaining the login screen after executing the request with header and cookie, Which contains the cookie _csc, _csl, Hm_lvt_8f0191b1f1b207d4f6e0d42e771d6fde, _oa, _cdn, JSESSID, Hm_lpvt_8f0191b1f1b207d4f6e0d42e771d6fde, _ct, _isus, _cslt, S2, _cssu, _csste parameters, etc. The interface after login contains the list of all files that the user has synchronized to the cloud, but the request result does not find the list of files, so it is speculated that there may be other websites. In the result of Fiddler fetching, it was found that the file list was returned in request No. 270 in FIG. 2-1 (FIG. 2-2).

Figure 2-2

(3) doc/list request

The request gets a list of files, executes a doc/list request with headers and cookies in Fiddler, and results in a crawl.

(4) Doc /downloadpdf request

When a file is downloaded from Firefox, requests doc/downloadpdf and stat/ Download are found in Fiddler. Doc /downloadpdf returns the corresponding download address according to the unique identifier doc_id of the provided file (as shown in Figure 2-3). Stat /download confirms the file download status. Therefore, doc/downloadpdf requests are mainly concerned.

Figure 2-3

Requests crawler implementation ****

After the package capture in section 2, we started to implement the code. Considering the requirements of the program for work, it can be executed without the Python environment, so we temporarily ignore Selenium and use Requests instead. The program consists of three parts: login_page, get_file_list, and get_downloadfile_url_one.

(1) Program initialization

Set requests. Session, headers, etc., and apply the logger package to log in the application as follows:

class downloadscan(object):
    def __init__(self,username,password):
        self.session = requests.session()
        self.url = "https://www.camscanner.com/user/login"
        self.username,self.password =username,password
        self.respage = None
        self.headers = {"Host""www.camscanner.com", \"User-Agent""Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 58.0) Gecko / 20100101 Firefox / 58.0", \"Accept":"text/plain, */*; Q = 0.01", \"Accept-Language":"zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2", \"Accept-Encoding":"gzip, deflate, br", \"Content-Type""application/x-www-form-urlencoded", \"X-Requested-With":"XMLHttpRequest", \"Connection":"keep-alive", \"Cache-Contro":"max-age=0, no-cache", \"Pragma":"no-cache"}
        self.file_list = None
        self.tmp_cookie = None
        self.download_res = {"filename":None,"address":None} self.status = True self.logger = logging.getLogger(__name__) self.logger.setLevel(level = logging.INFO) handler =  logging.FileHandler("./camscanner/log/log")
        handler.setLevel(logging.INFO)
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)
Copy the code

Log in to login_page

Post, and set status to obtain the login status (success or failure). For details, see login_page. See the redirect_to_holder function and update the login status as follows:

Def login_page(self): return params = {"act":"submit"."redirect_uri":""."area_code":"86"."username":self.username,"password":self.password,"rememberme":False}
        self.respage = self.session.post(self.url,data=params,headers = self.headers)
        self.logger.info(self.url+" status_code: "+ str(self.respage.status_code))
        self.generate_header("login")
        if str(self.respage.status_code) =="200" and self.status ==True:
            self.logger.info("++++++++oh my god, login succeed! + + + + + +")
        else: self.status = False def redirect_to_holder(self)if "data" in json.loads(self.respage.text):
            self.url = json.loads(self.respage.text)["data"]
            self.respage = self.session.get(self.url,headers = self.headers)
            self.logger.info(self.url+" status_code: "+ str(self.respage.status_code))
            self.status = True
        else:
            self.status = False
        if str(self.respage.status_code) =="200" and self.status == True:
            self.logger.info("++++++++oh my god, redirect login succeed! + + + + + +") 
        else:
            self.status = False
Copy the code

(3) Request headers set

    def generate_header(self,action):
        if action=="login":        
            self.headers["Referer"] = self.url
            self.headers["Accept"]  ="text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8"
            self.headers["Upgrade-Insecure-Requests"] = "1"
            self.tmp_cookie = requests.utils.dict_from_cookiejar(self.respage.cookies)
            if "_csc" in self.tmp_cookie and "JSESSID" in self.tmp_cookie and "S2" in self.tmp_cookie and "_cssu" in self.tmp_cookie:
                self.headers["Cookie"]  ="_csc=%s; _csl=zh-cn; _oa=86,%s; JSESSID=%s; Hm_lvt_8f0191b1f1b207d4f6e0d42e771d6fde=%s; _ct=1; _isus=1; _cslt=1; S2=%s; _cssu=%s; _csste=%s; Hm_lvt_8f0191b1f1b207d4f6e0d42e771d6fde=%s; _cssrd=1"%(self.tmp_cookie["_csc"],self.username,self.tmp_cookie["JSESSID"].int(time.time()),self.tmp_cookie["S2"],self.tmp_cookie["_cssu"].int(time.time()),int(time.time()))
            else:
                self.status = False
                self.logger.warning(self.url+"Login failed, headers setting failed")
        else:
            self.headers["Cache-Contro"] ='no-cache'
            self.headers["Content-Type"] = "application/x-www-form-urlencoded; charset=utf-8"
            self.headers["Accept"]  = "text/plain, */*; Q = 0.01"
            self.headers["Referer"] = 'https://www.camscanner.com/files/holder'
            if "Upgrade-Insecure-Requests" in self.headers:
                self.headers.pop("Upgrade-Insecure-Requests")
            if "_csc" in self.tmp_cookie and "JSESSID" in self.tmp_cookie and "S2" in self.tmp_cookie and "_cssu" in self.tmp_cookie:
                self.headers["Cookie"]  ="_csc=%s; _csl=zh-cn; _oa=86,%s; JSESSID=%s; Hm_lvt_8f0191b1f1b207d4f6e0d42e771d6fde=%s; _ct=1; _isus=1; _cslt=1; S2=%s; _cssu=%s; _csste=%s; Hm_lvt_8f0191b1f1b207d4f6e0d42e771d6fde=%s; _cssrd=1"%(self.tmp_cookie["_csc"],self.username,self.tmp_cookie["JSESSID"].int(time.time()),self.tmp_cookie["S2"],self.tmp_cookie["_cssu"].int(time.time()),int(time.time()))
            else:
                self.status = False
                self.logger.warning(self.url+"Login failed, headers setting failed")
Copy the code

(4) Obtain the file list

    def get_file_list(self):
        self.url = "https://www.camscanner.com/doc/list"
        self.respage = self.session.post(self.url,headers = self.headers)
        #self.save_all()
        self.logger.info(self.url+" status_code: "+ str(self.respage.status_code))
        self.file_list = eval(self.respage.text)["data"] ["list"]
        self.logger.info("++++++++enheng, get file list succeed+++++++++++") (5Def get_downloadfile_url_one(self,doc_id,title): self. Generate_header ()"get download url")
        self.url = "https://www.camscanner.com/doc/downloadpdf"
        self.params = {"json_download":json.dumps({"docs": ["%s.jdoc"%doc_id]})}
        times = 0
        tmp = None
        while (times<5 and tmp == None):
            self.respage = self.session.post(self.url,data = self.params,headers = self.headers)
            times +=1
            if "data" in self.respage.json():
                tmp = self.respage.json()["data"]
        self.download_res["address"] = tmp
        self.download_res["filename"] = title + tmp[tmp.rfind("."):]
        self.logger.info("get %s address is : %s"%(title,tmp))
        urllib.request.urlretrieve(self.download_res["address"],self.download_res["filename"])
        self.logger.info("%s, download complete"%self.download_res["filename"])
Copy the code

Tkniter interactive, PyInstaller package exe****

In order to facilitate the program to be used outside the Python environment, the Tkniter package is used to design a simple interactive interface. The login interface is mainly set (Figure 4-1). If the login succeeds in scanning the Queen of the Web version, the synchronization interface (Figure 4-2) is redirected and the user_INFO file is generated locally. Click start synchronization (figure 4-3), the local has been downloaded files to skip, download not download the file directly, and update the local file list, for the synchronous switch failure, account, password change requirements, set the login function, deletes the original user_info file, regenerated (figure 4-4), the results as shown in figure 1 to 4-4-4:

4-1 Login page

4-2 Login succeeded

4-3 The synchronization starts

4-4 Log in again

To encapsulate the tested code, this article runs a relatively simple command:

pyinstaller -F -w downloadscan.py -i camscanner.ico
Copy the code

V. Conclusion ****

To download all the code of this article (downloadscan.py) and the package program (downloadscan.exe), please scan the qr code below to follow the public account and reply to “Capture the package”