About data Sources

This project was written in early July 2017, and was mainly analyzed by using Python to crawl the data of Wangdai Home and Renrendai. Wangdaizhijia is the largest P2P data platform in China, and renrendai ranks among the top 20 P2P platforms in China. The source address

Data crawl

Caught analysis

The package capture tool mainly uses the Web bar of Chrome developer tools. The data of Wangdaijia are all JSON data returned by Ajax, while Renrendai has both ajax data returned and HTML page directly generated data.

The request instance

To construct the request

According to the result of packet capture analysis, the request is constructed. In this project, Python’s Requests library is used to simulate HTTP request concrete code:

import requests
class SessionUtil():
    def __init__(self,headers=None,cookie=None):
        self.session=requests.Session()
        if headers is None:
            headersStr={"Accept":"application/json, text/javascript, */*; Q = 0.01"."X-Requested-With":"XMLHttpRequest"."User-Agent":"Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"."Accept-Encoding":"gzip, deflate, sdch, br"."Accept-Language":"zh-CN,zh; Q = 0.8"
                }
            self.headers=headersStr
        elseDef getReq(self,url): self. Headers =headers self.cookie=cookiereturn self.session.get(url,headers=self.headers).text
    def addCookie(self,cookie):
        self.headers['cookie'Def postReq(self,url,param):return self.session.post(url, param).text
Copy the code

When setting the request header, the key field is only set as “user-agent”. There is no anti-crawling measures for Wangdaijia and Renrendai, and the “Referer” field is not even set to prevent cross-domain errors.

The crawler instances

Here is an example of a crawler

import json
import time
from databaseUtil import DatabaseUtil
from sessionUtil import SessionUtil
from dictUtil import DictUtil
from logUtil import LogUtil
import traceback
def handleData(returnStr):
    jsonData=json.loads(returnStr)
    platData=jsonData.get('data').get('platOuterVo')
    return platData
def storeData(jsonOne,conn,cur,platId):
    actualCapital=jsonOne.get('actualCapital')
    aliasName=jsonOne.get('aliasName')
    association=jsonOne.get('association')
    associationDetail=jsonOne.get('associationDetail')
    autoBid=jsonOne.get('autoBid')
    autoBidCode=jsonOne.get('autoBidCode')
    bankCapital=jsonOne.get('bankCapital')
    bankFunds=jsonOne.get('bankFunds')
    bidSecurity=jsonOne.get('bidSecurity')
    bindingFlag=jsonOne.get('bindingFlag')
    businessType=jsonOne.get('businessType')
    companyName=jsonOne.get('companyName')
    credit=jsonOne.get('credit')
    creditLevel=jsonOne.get('creditLevel')
    delayScore=jsonOne.get('delayScore')
    delayScoreDetail=jsonOne.get('delayScoreDetail')
    displayFlg=jsonOne.get('displayFlg')
    drawScore=jsonOne.get('drawScore')
    drawScoreDetail=jsonOne.get('drawScoreDetail')
    equityVoList=jsonOne.get('equityVoList')
    experienceScore=jsonOne.get('experienceScore')
    experienceScoreDetail=jsonOne.get('experienceScoreDetail')
    fundCapital=jsonOne.get('fundCapital')
    gjlhhFlag=jsonOne.get('gjlhhFlag')
    gjlhhTime=jsonOne.get('gjlhhTime')
    gruarantee=jsonOne.get('gruarantee')
    inspection=jsonOne.get('inspection')
    juridicalPerson=jsonOne.get('juridicalPerson')
    locationArea=jsonOne.get('locationArea')
    locationAreaName=jsonOne.get('locationAreaName')
    locationCity=jsonOne.get('locationCity')
    locationCityName=jsonOne.get('locationCityName')
    manageExpense=jsonOne.get('manageExpense')
    manageExpenseDetail=jsonOne.get('manageExpenseDetail')
    newTrustCreditor=jsonOne.get('newTrustCreditor')
    newTrustCreditorCode=jsonOne.get('newTrustCreditorCode')
    officeAddress=jsonOne.get('officeAddress')
    onlineDate=jsonOne.get('onlineDate')
    payment=jsonOne.get('payment')
    paymode=jsonOne.get('paymode')
    platBackground=jsonOne.get('platBackground')
    platBackgroundDetail=jsonOne.get('platBackgroundDetail')
    platBackgroundDetailExpand=jsonOne.get('platBackgroundDetailExpand')
    platBackgroundExpand=jsonOne.get('platBackgroundExpand')
    platEarnings=jsonOne.get('platEarnings')
    platEarningsCode=jsonOne.get('platEarningsCode')
    platName=jsonOne.get('platName')
    platStatus=jsonOne.get('platStatus')
    platUrl=jsonOne.get('platUrl')
    problem=jsonOne.get('problem')
    problemTime=jsonOne.get('problemTime')
    recordId=jsonOne.get('recordId')
    recordLicId=jsonOne.get('recordLicId')
    registeredCapital=jsonOne.get('registeredCapital')
    riskCapital=jsonOne.get('riskCapital')
    riskFunds=jsonOne.get('riskFunds')
    riskReserve=jsonOne.get('riskReserve')
    riskcontrol=jsonOne.get('riskcontrol')
    securityModel=jsonOne.get('securityModel')
    securityModelCode=jsonOne.get('securityModelCode')
    securityModelOther=jsonOne.get('securityModelOther')
    serviceScore=jsonOne.get('serviceScore')
    serviceScoreDetail=jsonOne.get('serviceScoreDetail')
    startInvestmentAmout=jsonOne.get('startInvestmentAmout')
    term=jsonOne.get('term')
    termCodes=jsonOne.get('termCodes')
    termWeight=jsonOne.get('termWeight')
    transferExpense=jsonOne.get('transferExpense')
    transferExpenseDetail=jsonOne.get('transferExpenseDetail')
    trustCapital=jsonOne.get('trustCapital')
    trustCreditor=jsonOne.get('trustCreditor')
    trustCreditorMonth=jsonOne.get('trustCreditorMonth')
    trustFunds=jsonOne.get('trustFunds')
    tzjPj=jsonOne.get('tzjPj')
    vipExpense=jsonOne.get('vipExpense')
    withTzj=jsonOne.get('withTzj')
    withdrawExpense=jsonOne.get('withdrawExpense')
    sql='insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag ,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoLi st,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationA rea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCo de,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroun dExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCap ital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,servi ceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCre ditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values ("'+actualCapital+'","+aliasName+'","+association+'","+associationDetail+'","+autoBid+'","+autoBidCode+'","+bankCapital+'","+bankFunds+'","+bidSecurity+'","+bindingFlag+'","+businessType+'","+companyName+'","+credit+'","+creditLevel+'","+delayScore+'","+delayScoreDetail+'","+displayFlg+'","+drawScore+'","+drawScoreDetail+'","+equityVoList+'","+experienceScore+'","+experienceScoreDetail+'","+fundCapital+'","+gjlhhFlag+'","+gjlhhTime+'","+gruarantee+'","+inspection+'","+juridicalPerson+'","+locationArea+'","+locationAreaName+'","+locationCity+'","+locationCityName+'","+manageExpense+'","+manageExpenseDetail+'","+newTrustCreditor+'","+newTrustCreditorCode+'","+officeAddress+'","+onlineDate+'","+payment+'","+paymode+'","+platBackground+'","+platBackgroundDetail+'","+platBackgroundDetailExpand+'","+platBackgroundExpand+'","+platEarnings+'","+platEarningsCode+'","+platName+'","+platStatus+'","+platUrl+'","+problem+'","+problemTime+'","+recordId+'","+recordLicId+'","+registeredCapital+'","+riskCapital+'","+riskFunds+'","+riskReserve+'","+riskcontrol+'","+securityModel+'","+securityModelCode+'","+securityModelOther+'","+serviceScore+'","+serviceScoreDetail+'","+startInvestmentAmout+'","+term+'","+termCodes+'","+termWeight+'","+transferExpense+'","+transferExpenseDetail+'","+trustCapital+'","+trustCreditor+'","+trustCreditorMonth+'","+trustFunds+'","+tzjPj+'","+vipExpense+'","+withTzj+'","+withdrawExpense+'","+platId+'"'
    cur.execute(sql)
    conn.commit()

conn,cur=DatabaseUtil().getConn()
session=SessionUtil()
logUtil=LogUtil("problemPlatDetail.log")
cur.execute('select platId from problemPlat')
data=cur.fetchall()
print(data)
mylist=list()
print(data)
for i in range(0,len(data)):
    platId=str(data[i].get('platId'))
    
    mylist.append(platId)

print mylist  
for i in mylist:
    url='http://wwwservice.wdzj.com/api/plat/platData30Days?platId='+i try: data=session.getReq(url) platData=handleData(data) dictObject=DictUtil(platData) storeData(dictObject,conn,cur,i) except  Exception,e: traceback.print_exc() cur.close() conn.closeCopy the code

The entire process is to construct the request and parse the response to each request, with the JSON return value parsed using the JSON library and the HTML page parsed using the BeautifulSoup library (LXML library is recommended for complex HTML pages). The parsed results are stored in the mysql database.

The crawler code

Crawler code address (note: crawler code Python2 and PYTHon3 can be run, I deployed the crawler code on Ali cloud server, using Python2 to run)

The data analysis

Python’s NUMpy, PANDAS, and Matplotlib are used for data analysis, and BDP is used for data analysis.

Time series analysis

Data is read

The data is read into the PANDAS DataFrame for analysis. The following is an example of reading data from the platform in question

problemPlat=pd.read_csv('problemPlat.csv',parse_dates=True)# Question platform
Copy the code

The data structure

Time series analysis

The number of eg problem platforms varies over time

problemPlat['id'] ['2012':'2017'].resample('M',how='count').plot(title='P2P problem ')# The number of problematic P2P platforms changes over time
Copy the code

Graphical display

Spatial analysis

Complete using Haizhi BDP (Python map distribution wheel is complicated, I haven’t learned at that time)

Number of provincial platform issues

Provincial platform turnover

Scale distribution analysis

Eg Nationwide transaction volume distribution code in June

juneData['amount'].hist(normed=True)
juneData['amount'].plot(kind='kde',style='k--')# June transaction probability distribution
Copy the code

Nuclear density graphical display

np.log10(juneData['amount']).hist(normed=True)
np.log10(juneData['amount']).plot(kind='kde',style='k--')# Take the probability distribution of log 10
Copy the code

Graphical display

Correlation analysis

Eg. Change trend of correlation coefficient between trading volume of Lufax and trading volume of all platforms

lujinData=platVolume[platVolume['wdzjPlatId']==59]
corr=pd.rolling_corr(lujinData['amount'],allPlatDayData['amount'],50,min_periods=50).plot(title='Trend of Correlation Coefficients between Trading Volume of Lufax and trading volume of all platforms')
Copy the code

Graphical display

Classification comparison

Comparison of transaction volume data between auto loan platform and the whole platform

carFinanceDayData=carFinanceData.resample('D').sum()['amount'FIG, axes = PLT. Subplots (nrows = 1, ncols = 2, sharey = True, figsize = (14, 7)) carFinanceDayData. The plot (ax = axes [0], the title ='Auto Loan Platform Transaction Volume')
allPlatDayData['amount'].plot(ax=axes[1],title='Transaction Volume on all P2P Platforms')
Copy the code

Trend prediction

Eg Forecast volume trend of Lufax (completed using Facebook Prophet Database)

lujinAmount=platVolume[platVolume['wdzjPlatId']==59]
lujinAmount['y']=lujinAmount['amount']
lujinAmount['ds']=lujinAmount['date']
m=Prophet(yearly_seasonality=True)
m.fit(lujinAmount)
future=m.make_future_dataframe(periods=365)
forecast=m.predict(future)
m.plot(forecast)
Copy the code

Graphical display of trend forecast

Data analysis code

Data analysis code address (note: Data analysis code intelligently runs in Python3 environment) Example after the code is run (no need to install Python environment can also view the specific code solution graphical display)

Afterword.

This is the first project I have written since I switched from Java Web to data, and it is also my first Python project. In the whole process, I did not encounter many difficulties. Generally speaking, crawler and data analysis as well as the threshold of Python are very low. If you want to get started with a Python crawler, Python Network Data Collection is recommended.