Author: Unreal good

Source: Hang Seng LIGHT Cloud Community

Background introduction

In the process of quantitative analysis, it is always necessary to use a large number of data bases to mine the association between data and finally find the data we need. Analyzing data in Python alone is very complex. Is there a simpler tool that can help us analyze data efficiently and quickly?

Pandas is a powerful tool set for analyzing structured data.

This article is aimed at students who have some basic Python syntax. Those who need to learn Python can find tutorials in the community (developer.hs.net/course/?nav…). .

The basic concept

The Pandas library is a free, open source, third-party Python library that provides high-performance, easy-to-use data structures for Python data analysis, including Series and DataFrame.

Pandas uses Numpy (high-performance matrix computing) as the foundation; It is used for data mining and analysis, and also provides data cleaning function.

Pandas is based on the Python NumPy library and can be used in conjunction with Python’s scientific computing library.

Pandas has been used in many fields since its birth, including finance, statistics, social sciences, and architectural engineering.

Pandas is used for Pandas. It is very important to understand what Pandas do. Pandas is the Python equivalent of Excel: It uses tables (also known as dataframes) and can do a variety of transformations on data, but there are many other functions.

The data structure

DataFrame

A DataFrame is a tabular data structure that contains an ordered set of columns, each of which can be of a different value type (numeric, string, Boolean). DataFrame has both row and column indexes, and can be thought of as a dictionary of Series (with a common index).

The DataFrame constructor is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)
Copy the code

Parameter Description:

  • Data: A group of data (ndarray, Series, Map, Lists, dict, etc.).
  • Index: The index value, or the row label.
  • Columns: The column label. The default value is RangeIndex (0, 1, 2… , n).
  • Dtype: indicates the data type.
  • Copy: copies data. The default value is False.

Series

A Series is like a column in a table, like a one-dimensional array, and can hold any data type.

Series consists of indexes and columns, and functions are as follows:

pandas.Series( data, index, dtype, name, copy)
Copy the code

Parameter Description:

  • Data: A group of data (ndarray type).
  • Index: data index label. If this parameter is not specified, the value starts from 0 by default.
  • Dtype: indicates the data type, which is determined by default.
  • Name: Set the name.
  • Copy: copies data. The default value is False.

Quick learning

The introduction of the component

Importing the Pandas component into the code:

import pandas as pd
Copy the code

If not, there is a problem with the environment configuration or you didn’t download it at all. Download the component by:

pip install Pandas
Copy the code

Series object manipulation

The Series() function creates a Series object from which the corresponding methods and properties can be called:

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)
Copy the code

DataFrame object operation

The syntax for creating an object from DataFrame() is as follows:

Import pandas as pd data = [1,2,3,4,5] df = pd.DataFrame(data) print(df)Copy the code

Reading file data

Local.csv files can be read using the read_csv() function:

Data = pd.read_csv('file.csv') data = pd.read_csv('file.csv', nrows=1000, skiprows=[1,5], encoding= GBK)Copy the code

Parameter Meanings:

  • 'file.csv': indicates the file name that can be added to the system location for reading
  • nrows: indicates the number of rows before reading
  • skiprows: indicates the number of unread lines that are automatically skipped when a file is read.
  • encoding: Indicates the encoding format of the read file

Similar to read_csv, read_excel reads data from Excel files.

Write file data

Pandas provides the to_csv() function to convert DataFrame to CSV data. If you want to write CSV data to a file, you simply pass a file object to the function. Otherwise, the CSV data is returned as a string.

Data. To_csv (' my_new_file. CSV, 'index = None)Copy the code

Parameter Meanings:

  • index: indicates whether to add an index. The index is automatically added by default

Similar to to_CSV, to_excel writes data to Excel files.

conclusion

The Pandas tool set is designed to help Pandas process and analyze data quickly. It will be updated in the future.