The text and pictures in this article come from the network, only for learning, communication, do not have any commercial purposes, if you have any questions, please contact us to deal with.

A new vision of programming | authors

Jane books | sources

For those who are new to Python, copy the link below to watch the basic Python tutorial

https://v.douyu.com/author/y6AZ4jn9jwKW
Copy the code

Why Python for data analysis?

Python is both a dynamic, object-oriented scripting language and a minimalist, easy-to-understand programming language. Python is easy to get started with, and its code is readable. A good Python code reads like a foreign language article. This Python feature is called ** “pseudocode” **, and it lets you focus on what you’re doing instead of worrying about Python syntax.

In addition, Python is open source and has many excellent libraries that can be used for data analysis and other areas. More importantly, Python is very compatible with Hadoop, the open source big data platform. Therefore, learning Python is a very cost effective way for data analysts who want to move into big data analysis.

Python’s many advantages make it one of the most popular programming languages. Many companies at home and abroad have also used Python, such as YouTube, Google, Aliyun and so on.

Programming based

To learn how to use Python for data analysis, I suggest that the first step is to learn some Python programming basics, know Python data structures, what are vectors, lists, arrays, dictionaries, etc. Learn about Python functions and modules. Here are some things to know at this stage:

Data analysis process

Python is a powerful tool for data analysis. After mastering the programming foundation of Python, you can gradually enter the wonderful world of data analysis. The author believes that a complete data analysis project can be roughly divided into the following five processes:

1. Data acquisition

Generally, companies that need data analysts have their own databases, and data analysts can use SQL queries to obtain the data they want from the database. Python already has interface packages to connect to SQL Server, mysql, orcale and other major databases, such as PyMSSQL, Pymysql, cx_Oracle, etc.

There are two main ways to obtain external data: one is to obtain the public data on some domestic websites; One is to crawl data automatically by writing crawler code. If you want to use a Python crawler to retrieve data, you can use the following Python tools:

  • Requests- Primarily used to request operations while crawling data.
  • BeautifulSoup- For reading XML and HTML type data when climbing data, parsing it into objects and processing it.
  • Scapy- a packet that handles interactive data and can decode most network protocol packets

2. Data storage

For projects with a small amount of data, Excel can be used for storage and processing, but for projects with a large amount of data, it is more efficient and convenient to use database for storage and management.

3. Data preprocessing

Data preprocessing is also called data cleaning. In most cases, the data we get are in different formats, with problems such as outliers and missing values, and the methods of data preprocessing are different in different projects. The author believes that 80% of the work of data analysis is dealing with data. If Python is chosen as the data cleaning tool, we can use the Numpy and Pandas libraries:

  • Numpy – For scientific computation in Python. It’s great for linear algebra, Fourier transform, and random number operations. It handles multidimensional data well and is compatible with various databases.
  • Pandas — Pandas is an extension of Numpy that provides a series of functions to handle data structures and operations, such as time series.

4. Modeling and analysis

In this stage, the data structure should be clear first, and the model should be selected according to the project requirements.

Common data mining models are:

At this stage, Python also has a good library of tools to support our modeling efforts:

  • Scikit-learn – a library of machine learning algorithms for Python implementations. Scikit-learn can implement commonly used machine learning algorithms such as data preprocessing, classification, regression, dimensionality reduction and model selection.

  • Tensorflow- For deep learning projects with low data processing requirements. Such projects tend to be data-intensive and ultimately require higher precision.

5. Visual analysis

The last step of data analysis is to write the data analysis report, which is also a process of data visualization. In terms of data visualization, the current mainstream visual chemicals in Python include:

  • Matplotlib- mainly used for 2d graphics, it allows users to easily graph data and provides a variety of output formats.
  • Seaborn- A module based on Matplotlib that specializes in statistical visualization and can be seamlessly linked to Pandas.

According to this process, the knowledge points involved in each stage can be broken down as follows:

From the above figure, we can also see that Python is now well supported in the entire data analysis process, whether it is data extraction, data preprocessing, data modeling and analysis, or data visualization