Python data fetching, analysis, mining, machine learning, and Python distributed computing

Click “Programlife” above and choose “Top Public Account”

The first time attention program ape (yuan) side of the story

Author’s brief introduction

Hou Huiyang

An old yard farmer in his 30s

About the author: Worked in Renren, Meituan, Xiaomi, Baidu and Didi successively as senior programmer, architect, strategy engineer and r&d director.

Main Contents of this issue

Python Data fetching & Analysis & Machine Learning & Mining & Neural Networks

Data capture

1. Background research

1) Check robots.txt to know what restrictions there are on climbing the site;

2) PIP install builtwith; pip install python-whois

2. Data capture:

1) Dynamically loaded content:

The use of selenium

#! /usr/bin/env python# -*- coding: utf-8 -*-from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport timeimport sysreload(sys)sys.setdefaultencoding('utf8')driver = webdriver.Chrome("/Users/didi/Downloads/chromedriver") driver.get('http://xxx')elem_account = driver.find_element_by_name("UserName")elem_password = driver.find_element_by_name("Password")elem_code = driver.find_element_by_name("VerificationCode") elem_account.clear()elem_password.clear()elem_code.clear()elem_account.send_keys("username")elem_password.send_keys("pas s")elem_code.send_keys("abcd")time.sleep(10)driver.find_element_by_id("btnSubmit").submit()time.sleep(5) Driver.find_element_by_class_name ("txtKeyword").send_keys(u"x") # Simulate search driver.find_element_by_class_name("btnSerch").click()# ... Dw = driver.find_elements_by_xpath('//li[@class="min"]/dl/dt/a')for item in Dw :url = item.get_attribute('href') if  url: ulist.append(url) print(url + "---" + str(pnum)) print("##################")Copy the code

2) Statically loaded content

(1) regular;

(2) LXML;

Bs4 (3)

#! /usr/bin/env python# -*- coding: Utf-8 -*-string = r' SRC ="(http://imgsrc\. \.com.+?\.jpg)" pic_ext="jpeg"' # regular expression string urls = re.findall(string, html)import requestsfrom lxml import etreeimport urllibresponse = requests.get(url)html = etree.HTML(requests.get(url).content)res = html.xpath('//div[@class="d_post_content j_d_post_content "]/img[@class="BDE_Image"]/@src') # lxmlimport requestsfrom bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, 'LXML ') # parse response and create BeautifulSoup object urls = soup.find_all('img', 'BDE_Image')Copy the code

3) : Reverse crawling and reverse crawling

(1) : request frequency;

(2) : request header;

(3) : IP proxy;

4) : Crawler frame:

1. Scrapy

(2) Portia

The data analysis

1. Commonly used data analysis library:

NumPy: Is a vector-based operation. http://www.numpy.org/

1) List => matrix

2) Ndim: dimension; Shape: number of rows and columns; Size: number of elements

Scipy: Is an extension of NumPy, including advanced mathematics, signal processing, statistics, etc. https://www.scipy.org/

Pandas: a package for quickly building high-level data structures: Series and DataFrame based on NumPy. http://pandas.pydata.org/

1) : NumPy is similar to List, and Pandas is similar to Dict.

Matplotlib: Drawing library.

1) : is a powerful drawing tool;

2) : Support scatter chart, line chart, bar chart, etc.;

Pip2 install Numpy>>> import Numpy as NP >>> A = np.arange(10)>>> aarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>>> a ** 2array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])pip2 install Scipy>>> import numpy as np>>> from scipy import linalg>>> a = np.array([[1, 2], [3, 4]])>>> Linalg. det(a)-2.0pip2 install pandas>>> df = pd.DataFrame({' a ': pd.date_range("20170802", periods=5), 'B' : Pd.series ([11, 22, 33, 44,55]), 'C' : pd.Categorical(["t","a","b","c","g"])})>>> dfABC 0 2017-08-02 11 t 1 2017-08-03 22 a 2 2017-08-04 33 b 3 2017-08-05 44 c  4 2017-08-06 55 gpip2 install Matplotlib>>> import matplotlib.pyplot as plt>>> plt.plot([1, 2, 3])[<matplotlib.lines.Line2D object at 0x113f88f50>]>>> plt.ylabel("didi")<matplotlib.text.Text object at 0x110b21c10>>>> plt.show()Copy the code

2. Advanced Data Analysis Library:

Scikit-learn: Machine learning framework.

The figure shows that the data is less than 50. No: more data is needed. Yes uses the classifier and keeps going.

From the figure, it can be seen that the algorithm has four categories: classification, regression, clustering and dimensionality reduction.

KNN:

#! /usr/local/bin/python# -*- coding: Utf-8 - * - "' prediction Iris https://en.wikipedia.org/wiki/Iris_flower_data_set ' '# import module from __future__ import print_functionfrom  sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifier# create data iris = datasets. Load_iris ()iris_X = ir.data # Length of calyx, length of petal iris_y = ir.target # Type of flower 0, 1, Print (iris_X)print(iris_y)print(iris_y)print(ir.target_names) y_test = train_test_split(iris_X, iris_y, Test_size = 0.1) # create KNN (x_classifier) # create KNN (x_classifier) # create KNN (x_classifier) # create KNN (x_classifier) # create KNN (x_classifier) Y_train) # training data ciassigned = KNN. Predict (X_test) # Comparison results print (" # # # # # # # # # # # # # # # # # # # # # # # # # ") print (X_test) print (predicts) print (y_test) # computational prediction accuracy print (KNN) score (X_test, Y_test) [[5. 3.3 1.4 0.2] [5. 3.5 1.3 0.3] [6.7 3.1 5.6 2.4] [5.8 2.7 3.9 1.2] [1.5] 6. 2.2 5. [6. 3. 4.8 1.8] [6.3 2.5 5.] [5. 3.6 1.4 0.2] 1.9 [5.6 2.9 3.6 1.3] [6.9 3.2 5.7 2.3] [4.9 3. 1.4 0.2] [5.9 3. 4.2 1.5] [4.8 3. 1.4 0.1] [5.1 3.4 1.5 0.2][4.7 3.2 1.6 0.2]][0 0 2 1 1 2 2 1 2 0 1 0 0 0][0 0 2 1 2 2 2 0 1 0 0 0 0]0.933333333333Copy the code

Linear Regression

#! /usr/local/bin/python # -*- coding: Utf-8 -*- "" # import module from __future__ import print_functionfrom sklearn import datasetsfrom sklearn.linear_model Loaded_data = datasets.load_boston() # data_X = Loaded_data.datadata_y = loaded_data.targetPrint (data_X)print(data_y Print (model. Predict (data_X[:4, :)) # prediction results print (data_y [: 4]) # results print (" # # # # # # # # # # # # # # # # # # # # # # # # # ") X, y = datasets. Make_regression (n_samples = 100, Scatter (X, y) # scatter plt.show() # scatter(X, y) # scatter plt.show() # scatter(X, y) # scatter plt.show()Copy the code

Data mining

1, mining keywords:

The algorithm involved: TF-IDF

Reference: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

[August 1, 2017, Beijing, China/Estonia, Tallinn: Didi Chuxing today announced a strategic partnership with Taxify, a leading mobile mobility company in Europe and Africa. Didi will support Taxify's deeper market expansion and technology innovation in multiple markets through investment and collaboration in intelligent transportation technology research and development. Didi Chuxing is the world's leading mobile travel platform. Relying on ARTIFICIAL intelligence technology, Didi provides diversified travel services including taxis, private cars, express cars, luxury cars and shunfan cars to more than 400 million users in more than 400 cities. While providing flexible employment and income opportunities for more than 17 million drivers, Didi also uses AI technology to support city managers in building integrated and sustainable smart transportation solutions. Taxify, founded in Estonia in 2013, is the fastest growing mobile travel company in Europe and Africa. At present, its taxi and private car sharing service network covers central cities in Europe, Africa and West Asia; It has reached 18 countries, including Hungary, Romania, Poland, Baltic Countries, South Africa, Nigeria and Kenya, with more than 2.5 million users. Cheng Wei, founder and CEO of Didi Chuxing, said: "Taxify provides excellent and innovative travel services in diversified markets. We are all committed to harnessing the power of mobile Internet technology to meet rapidly evolving consumer needs; We will help transform and upgrade traditional transportation industries. I believe this collaboration will contribute to the building of trans-regional smart transportation links between Asian, European and African markets." Marcus Villig, Founder and CEO of Taxify, said: "Taxify will build on this strategic partnership to build on our strong position in our core markets in Europe and Africa. We believe Didi is the ideal partner to help us become the most popular and efficient travel option in Europe and Africa." #! /usr/local/bin/python # -*- coding: Utf-8 -*- "import osimport codecsimport pandasImport jiebaimport jieba.analyse# Format data tagDF = pandas.DataFrame(columns=['filePath', 'content', 'tag1', 'tag2', 'tag3', 'tag4', 'tag5']) try:with open('./houhuiyang/news.txt', 'r') as f: Strip () tags = jieba.analyse. Extract_tags (content, topK=5) #TF_IDF tagDF.loc[len(tagDF)] = ["./news.txt", content, tags[0], tags[1], tags[2], tags[3], tags[4]] print(tagDF)except Exception, ex: print(ex)Copy the code

Calculate the Top5 keywords of the article: chuxing, didi, Taxify, eufei and transportation

2. Sentiment analysis

Emotional language information: http://www.keenage.com/html/c_bulletin_2007.htm

1) The simplest method is based on the emotion dictionary;

2) Complex is the machine learning-based approach;

Pip2 install NLTK >>> import NLTK >>> from nltk.corpus import stopwords # >>> nltk.download() # >> t = "Didi is a  travel company">>> word_list = nltk.word_tokenize(t)>>> filtered_words = [word for word in word_list if word not in stopwords.words('english')] ['Didi', 'travel', 1): heuristic Heuristic2): Machine learning/statistics :HMM, CRF processing flow :raw_text -> tokenize[pos tag] -> lemma / stemming[pos tag] -> stopwords -> word_listCopy the code

Python Distributed computing

pip2 install mrjjob

pip2 install pyspark

1) Python multithreading;

2) Python multiprocessing;

3) Global interpreter lock GIL;

4) Interprocess communication Queue;

5) Process Pool;

6) Python higher-order functions;

map/reduce/filterCopy the code

7) the pipeline based on Linux MapReducer 【 cat word. The log | python mapper. Py | python reducer. Py | sort – 2 r k 】

Word. Log Beijing Chengdu Shanghai Beijing Shanxi Tianjin Guangzhou #! /usr/local/bin/python# -*- coding: utf-8 -*-'''mapper'''import systry: for lines in sys.stdin: line = lines.split() for word in line: if len(word.strip()) == 0: continue count = "%s,%d" % (word, 1) print(count)except IOError, ex:print(ex)#! /usr/local/bin/python# -*- coding: utf-8 -*-'''reducer'''import systry: word_dict = {} for lines in sys.stdin: line = lines.split(",") if len(line) ! = 2: continue word_dict.setdefault(line[0], 0) word_dict[line[0]] += int(line[1]) for key, val in word_dict.items(): stat = "%s %d" % (key, val) print(stat)except IOError, ex:print(ex)Copy the code

The neural network

There are CPU and GPU versions

1) The neural network established by Tensorflow is static

2) The neural network built by PyTorch http://pytorch.org/#pip-install-pytorch is dynamic

The data:

A Scalar is a quantity with only size but no direction, e.g. 1,2,3, etc

A Vector, “Vector,” is a number that has a magnitude and a direction. It is just a string of numbers, such as (1,2).

A Matrix, “Matrix,” is a group of numbers composed of several vectors in a row, such as [1,2;3,4].

Tensor is a generalization of a bunch of numbers arranged in any dimension. As the figure shows, a matrix is nothing more than a two-dimensional section under a three-dimensional tensor. To find a scalar under a three-dimensional tensor, three dimensional coordinates are needed to locate it.

TensorFlow PyTorch uses a data structure called a tensor to represent all data.

#-*- coding: UTF-8 -*-#author houhuiyangimport torchimport numpy as npfrom torch.autograd import Variableimport torch.nn.functional as Fimport matplotlib.pyplot as pltnp_data = np.arange(6).reshape((2, Print ("\nnp_data", "np_data", # matrix "\ntorch_data", Torch_data # tensor "\ntensor to Numpy "tensor2NP)# tensor = [-1, -2, 1, 2, 3]data = [[1, 2], [3, 10]]tensor = torch.floattensor (data)# tensor = torch.floattensor (data)# tensor = torch.tensor Torch. Mm (tensor, tensor) # tensor variabletensor_v = torch.FloatTensor([[1,2], [3,4]]) variable = variable (tensor_v, T_out = torch. Mean (tensor_v * tensor_v) # x ^ 2 v_out = torch. Mean (variable * variable) #  print( tensor_v, variable, t_out, Backward () # pass (variably.grad) # gradient ''y = Wx linear y =AF(Wx) nonlinear relu/sigmoid/tanh ''x X = Variable(x)x_np = x.data.numpy()y_relu = f.relu (x).data.numpy()y_sigmoid = Figure (1, 1)# y_softplus = f.softplus (x).data.numpy()# y_softplus = f.softplus (x).data.numpy()# Figsize =(8, 6))# plt.plot(x_NP, y_relu, c = "red", label = "relu") 5)plt.legend(loc = "best")plt.show()# plt.subplot(222)plt.plot(x_np, y_sigmoid, c = "red", Label = "igmoid")plt.ylim(-0.2, 1.2)plt.legend(loc = "best")plt.show()# plt.subplot(223)plt.plot(x_np, y_tanh, C = "red", label = "subplot")plt.ylim(-1.2, 1.2)plt.legend(loc = "best")plt.show()Copy the code

Build a simple neural network

#-*- coding: Utf-8 -*- #author import torchfrom torch. Autograd import Variableimport torch. Nn Import matplotlib.pyplot as PLTX = torch. Unsqueeze (torch. Linspace (-1, 1, 100), Dim = 1) # unsqueeze y = x.popow (2) + 0.2 * torch. Rand (x.size())x, y = Variable(x), Variable(y)# print(x)# print(y)# plt.scatter(x.data.numpy(), y.data.numpy())# plt.show()class Net(torch.nn.Module): Torch Moudledef __init__(self, n_features, n_hidden, n_output):super(Net, Self).__init__() # inherit torch __init__self. Hidden = torch. Nn.Linear(n_features, Self. Predict = torch. Nn.Linear(n_hidden, n_output) # def forward(self, X):x = f. elu(self.hidden(x)) # predict(x) # return xnet = Net(1, 10, 1) # Optimizer = torch. Optim.sgd (net.parameters(), Lr = 0.5) # Loss_parameters () print(net.parameters())for t in range(100):prediction = NET (x) # Loss = loss_func(prediction, y) # Calculate the error # Spread optimizer.zero_grad()loss.backward()optimizer.step()if t % 5 == 0: plt.cla() plt.scatter(x.data.numpy(), y.data.numpy()) plt.plot(x.data.numpy(), prediction.data.numpy(), "r-", Lw = 5) PLT. Text (0.5, 0, 'Loss = %. 4 f % Loss. The data [0], fontdict = {' size: 20,' color ': 'red'}) PLT. Pause. (0.1) PLT ioff) (PLT) show ()Copy the code

Mathematical calculus

1. Limits:

Infinitesimal order;

2. Differential calculus:

Derivative:

1) The derivative is the slope of the curve, which reflects how fast the curve changes;

2) The second derivative is the response of the slope change speed, which represents the convexity and concavity of the curve;

Taylor series approximation

Newton method and gradient descent;

Jensen’s inequality:

Convex function. Jensen inequality

Probability theory:

1. Integral calculus:

Newton-leibniz formula

2. Probability space

Random variables and probability: integration of probability density functions; Conditional probability; Conjugate distribution;

Probability distribution:

1) Two-point distribution/Shell effort distribution;

2) Binomial distribution;

3) Poisson distribution;

4) Evenly distributed;

5) Exponential distribution;

6) Normal distribution/Gaussian distribution;

Law of Large numbers and Central limit

Linear algebra:

1) matrix

2) Linear regression;

– THE END –

Python data fetching, analysis, mining, machine learning, and Python distributed computing

Related Posts

SpringBoot+Neo4j: How are you bound to be offline in social e-commerce

Spring Boot integrate Mybatis-Plus step + custom table join query method

Architecture design: micro-service mode, grayscale publishing mode