This article originated from personal public account: TechFlow, original is not easy, for attention


In this article, we will look at the DataFrame, one of the most important data structures in Pandas.

In the last article, we introduced the use of Series and noted that Series is equivalent to a one-dimensional array, but pandas provides a number of convenient apis. A DataFrame can be understood simply as a Series dict, which splices the data into a two-dimensional table. And provides us with many table level data processing and batch data processing interface, greatly reducing the difficulty of data processing.

Create a DataFrame

DataFrame is a tabular data structure with two indexes, row index and column index, making it easy to retrieve the corresponding row and column. This makes it much easier for us to find and process data.

First, let’s start with the simplest, how to create a DataFrame.

Create from a dictionary


We create a dict whose key is the column name and value is a list. When we pass this dict into the constructor of a DataFrame, it will create a DataFrame for us with key as the column name and value as the corresponding value.

When we output in Jupyter, it will automatically display the DataFrame content as a table for us.

Created from numpy data

We can also create a DataFrame from a TWO-DIMENSIONAL numpy array. If we pass in numpy’s array without specifying the column name, our pandas will create columns with numbers as indexes:


Columns create columns by passing in a list of strings:


Read from file

Pandas can create dataframes by reading data from files in a variety of formats, such as Excel, CSV, and even databases.

Pandas provides apis for structured data such as Excel, CSV, and JSON.


If it’s something special, it doesn’t matter. We use read_table, which reads data from various text files and creates it by passing in delimiters and other parameters. For example, in the last article verifying the effect of PCA dimensionality reduction, we read data from files in. Data format. The delimiters between columns in this file are Spaces, not CSV commas or tables. We finish reading the data by passing sep and specifying the delimiter.


The header parameter indicates which lines of the file should be used as column names. By default header=0, the first line will be used as column names. If there is no column name in the data, you need to specify header=None, otherwise problems will occur. We rarely need multilevel column names, so the default or None is most commonly used.

Of all the ways to create a DataFrame, the most common is the last, reading from a file. When we do machine learning or participate in some competitions in Kaggle, the data is always available to us in the form of files, and we rarely need to create our own data. In actual working scenarios, although the data will not be stored in files, there will also be a source, generally stored in some big data platforms, from which the model obtains training data.

So in general, we rarely use other methods of creating dataframes, but we know a little about them and focus on reading from files.

Common operations

Here are some common operations for PANDAS that I learned about before I systematically learned how to use PANDAS. The reason for this is simply that they are so commonly used that they are common sense.

View the data

The DataFrame instance we ran in Jupyter typed out all of the DataFrame data for us, with ellipses in the middle if there were too many rows. For a DataFrame with a large amount of data, we usually display the first or last items of the DataFrame rather than directly. There are two apis that you need here.

The method that displays the first few pieces of data is called head, and it takes an argument that allows us to specify that it displays the specified number of pieces of data from scratch.


Since there are apis that show the first few, there are also apis that show the last few, which are called tail. This allows us to view the last specified number of DataFrame data:


Column add delete change check

As we mentioned earlier, DataFrame is actually a dict composed of Series. Since it is a dict, we can get the specified Series based on the key value.

There are two methods in the DataFrame to get the specified column. Appends column names or can be queried using dict elements:


We can also read multiple columns at the same time, and if so, only one method is supported: a dict lookup of elements. It allows you to receive a list and find the data corresponding to the columns in the list. The result returned is a new DataFrame composed of these new columns.


We can use del to delete a column we don’t need:


To create a new column, we can just as easily assign the DataFrame as dict assignment:


The assigned object does not have to be a real number; it can also be an array:


It is also very easy to modify a column by overwriting the original data in the same way as assignment.

Convert to a NUMpy array

To retrieve the DataFrame, run the following command:


Since each column in the DataFrame has a separate type, all data is of the same type after being converted to numpy’s array. Pandas will find a common type for all of its columns, which is why it is common to get an Object type. So it’s a good idea to look at the type before you use.values, just to make sure you don’t make errors because of the type.

conclusion

In today’s article, we learned about the relationship between DataFrame and Series. We also learned some basic DataFrame and common usage. Although DataFrame can be approximated as a dict composed of Series, it is actually a single data structure with its own apis that support many fancy operations, making it a powerful tool for processing data.

According to statistics made by professional institutions, for an algorithm engineer, about 70% of his time will be devoted to data processing. The time spent actually writing the model and tuning may be less than 20%, which shows how necessary and important data processing is. Pandas is one of the best tools for manipulating data in the Python world.

If you like this post, if you can, do a triple link and give me some support (follow, read, like).

This article is formatted using MDNICE