Pandas (2)

The article directories

- Review the above
- DataFrame
- - DataFrame case demonstration of key attributes
  - - The columns property
    - The index attribute
    - Attribute values
    - Dtypes properties
    - Size and Shape properties
- The data type of pandas
- - Converters converts the data type
  - Parse the time data type
- Pandas Keeps the data
- - Save the data to CSV
  - - The experiment
  - Save the data to JSON
  - - The sample
  - Save data to Excel
  - - Write data to multiple tables
  - Save data to MySQL
- Use Apply to process each row of data
- - demo

Review the above

DataFrame

The DataFrame is the most important class for pandas. The DataFrame is used to analyze data by analyzing its properties and by using DataFrame methods.

DataFrame case demonstration of key attributes

The CSV file still uses the test.csv file from the previous article.

The columns property

import pandas as pd

csv_file = 'test.csv'

df = pd.read_csv(csv_file)

print(df.columns)
print(type(df.columns))
Copy the code

Index(['a'.'b'], dtype='object')
<class 'pandas.core.indexes.base.Index'>	# pandasA key attribute ofCopy the code

Print (df.columns. Values) print(‘a’ ‘b’]

Is the name of all columns.

Now find all columns that contain ‘a’ :

for name in df.columns.values:
    if name.find('a') > =0:
        print(name)
Copy the code

That’s one way to do it. Is there a simpler way, like lambda functions, that there is a way.

list(filter(lambda x:x.find('a') > =0,df.columns.values))
# filter generator arguments: The first is a function pointer, and the second is an array passed as an argument to the function pointer.
Filter returns a tuple, so use a list to enforce it
Copy the code

The index attribute

print(df.index)
print(type(df.index))
Copy the code

RangeIndex(start=0, stop=3, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>
Copy the code

It starts at 0 and ends at 3, with a step of 1.

Attribute values

print(df.values)
print(type(df.values))
Copy the code

[[ 1.  2.]
 [ 3.  4.]
 [nan  5.]]
<class 'numpy.ndarray'>
Copy the code

As you can see, it’s a multidimensional array.

Dtypes properties

print(df.dtypes['a'])
print(type(df.dtypes))
Copy the code

float64
<class 'pandas.core.series.Series'>
Copy the code

I’m not going to say much about this, because I covered it in the last article.

Size and Shape properties

These two just look at it,

print(df.size)
print(df.shape)
Copy the code

6	# How many elements are there
(3.2) # rows and columns
Copy the code

The data type of pandas

There is no point in talking.

Basic data types: discrete: objec (string), bool, timedelta continuous: int, float, datatime

Data type conversion: Astype method to_numeric method to_datatime method

I will not repeat what I said in the last article.

Converters converts the data type

As mentioned earlier, you can change the data type of the specified column by specifying the dtype parameter at read time, but that is too general and sometimes you can’t change it, like yesterday when I encountered a NAN and didn’t let me force an int.

def toint(x) :
    try:
        return int(float(x))	# you don't know what type of data is being passed in, or do you want to change it to float
    except:
        return 0
    
df = pd.read_csv(csv_file,converters={'a':toint})
Copy the code

Parse the time data type

For ease of operation, I have added a time column to the CSV file:

a,b,data
1.2.2018-10-1
3.4.2020-12-12
,5
Copy the code

df = pd.read_csv(csv_file,parse_dates=['data'])
print(df.dtypes['data'])
print(df.head())
Copy the code

datetime64[ns]

     a  b       data
0  1.0  2 2018-10- 011  3.0  4 2020-12-12
2  NaN  5        NaT
Copy the code

Pandas Keeps the data

Save the data to CSV

To_csv () is a DataFrame method.

Path_or_buf: file saving position sep: delimiter. If no file is written, the default value is ", ". Columns: specifies the column to saveCopy the code

The experiment

Save to CSV

df = pd.read_csv(csv_file,parse_dates=['data'])

df.to_csv('test2.csv',sep='\t')
Copy the code

	a	b	data
0	1.0	2	2020-10-21
1	3.0	4	2020-10-21
2		5	
Copy the code

Specifies that columns save and do not save index columns

df = pd.read_csv(csv_file,parse_dates=['data'])

df.to_csv('test3.csv',columns='a',index = False)
Copy the code

a
1.0
3.0
""
Copy the code

Specify index column

df = pd.read_csv(csv_file,parse_dates=['data'])

df.to_csv('test4.csv',index_label='ID')
# feels more like giving a name to the index column
Copy the code

ID,a,b,data
0.1.0.2.2020-10-21
1.3.0.4.2020-10-21
2.5.Copy the code

Save the data to JSON

The sample

I didn’t get the data, so I don’t have any examples. I’ll have it when I run the data in a couple of days.

Save data to Excel

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', columns=None, header=True, index=True)
Copy the code

3. It seems that there is no good definition, the previous have been said.

Let’s just do it.

df = pd.read_csv(csv_file,parse_dates=['data'])

df.to_excel('text1.xlsx',sheet_name='a')
Copy the code

There e is a problem writing this date.

How to do? Turn object to write, try, test available,

Also note that the corresponding Excel should not be manually opened on the computer at the time of writing.

Write data to multiple tables

ExcelWriter is required

df = pd.read_csv(csv_file,parse_dates=['data'])

with pd.ExcelWriter('test4.xlsx') as f:
    df.to_excel(f,sheet_name='a',columns=['a'],index=True,index_label='ID')
    df.to_excel(f, sheet_name='b', columns=['b'.'data'], index=True, index_label='ID')
Copy the code

Please check the results for yourself.

Save data to MySQL

def to_sql(
        self,
        name,
        con,
        schema=None,
        if_exists="fail",
        index=True,
        index_label=None,
        chunksize=None,
        dtype=None,
        method=None.)
Copy the code

Parameter definition :(the defined parameters will not be described here)

Schema: Used to create database objects. Default values are generally used.

If_exists: What if a table exists?

Fail: raises ValueError. Replace: deletes the table before inserting data. Delete table (s); delete table (s); Append: Inserts new data. If there are primary keys, avoid primary key conflicts. Columns of a DataFrame correspond to columns of a table. DF index is written to the database by default as a column of dataCopy the code

Index: Writes the index as a column to the database. The default value is True, which means that the index of DF is written to the database by default. Index_label is the column name

Index_label: Column name when the index is written to the database. The default value is index. If DF is a multilevel index, index_Label should be a sequence

Chunksize: Batch processing, how many pieces of data are processed at a time. Default all, generally no what use, unless the amount of data is too large, obvious feeling card can be batch processing.

Dtype: a dictionary that specifies the data type of the column. The key is the column name and the value is a string form of SQLAlchemy Types or SQlite3. If a new table is created, the type must be specified; otherwise, the default type is the type with the largest storage capacity. For example, a VARCHar type becomes a text type, which wastes a lot of space. If you want to add data, you do not need to specify this parameter.

Method: What type of insert statement?

None: Single line by default. 'multi' : multi-linecallableInsert function as callback function, write function name, unused.Copy the code

Here I want to say: if you are not in a hurry to access the database, and the amount of data is large, you can write directly into CSV, later directly import data from CSV into the database.

Use Apply to process each row of data

DataFrame.apply(func,axis)
Copy the code

Func: a function pointer used for internal processing axis: a parameter passed to a function pointer

1: Processes data by row0: Processes data by columnCopy the code

demo

Let’s power column A.

df = pd.read_csv(csv_file,dtype = {'data':object})
print(df.head())

df['aa'] = df.apply(lambda x: x.a*x.a,axis = 1) You can try column processing instead

print(df.head())
Copy the code

   ID    a  b        data
0   0  1.0  2  2020-10-21
1   1  3.0  4  2020-10-21
2   2  NaN  5         NaN

   ID    a  b        data   aa
0   0  1.0  2  2020-10-21  1.0
1   1  3.0  4  2020-10-21  9.0
2   2  NaN  5         NaN  NaN
Copy the code

The usual ones are handled by row.

Here for the time being, the next article will enter into a little summary of actual combat.