Pandas is a powerful tool set for analyzing structured data based on Numpy (high-performance matrix operations) for data mining and analysis, as well as data cleansing.

In this article, we have collected the usage of the Python library Pandas and related tools for future reference.

Abbreviations:

Df: Any Pandas DataFrame object

S: Any Pandas Series object

Note: Some attribute methods df and s can be used

Recommended Resources:

  • Pandas online tutorial www.gairuo.com/p/pandas-tu…

  • Pandas: Data Processing and Analysis in Python

Environment set up

# https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/ # https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/ # https://docs.conda.io/en/latest/miniconda.html # excel processing related package XLRD openpyxl/xlsxwriter # parse web package requests/LXML / Html5lib/BeautifulSoup4 Scipy PIP install scipy Can specify the domestic source quickly download and install PIP install pandas -i https://pypi.tuna.tsinghua.edu.cn/simpleCopy the code

Conda multi-Python environment:

# Create a new environment, < environment name >, Conda create -n py39 python=3.9 # delete the environment conda remove -n py39 --all # activate the environment conda activate py39 # Exit the environment conda Deactivate # View all virtual environments and the current environment conda info -eCopy the code

Jupyter Notebook shortcut keys

Start Jupyter Notebook: Jupyter Notebook

shortcuts function
<tab> Code hinting
Shift+ Enter Execute this row and locate the newly added row
Shift + Tab (1-3 times) See the function method description
D, D Double-click D to delete the row
A / B Add a row up/down
M / Y Markdown/code mode

The import library package

Import pandas as pd # (2022-02-12) import numpy as NP import matplotlib.pyplot as PLT import seaborn as SNS  %matplotlib inlineCopy the code

Import data

Pd. read_csv('file.csv', name=[' column name ',' column name 2']) Header =0) # Pd. read_excel('file.xlsx', sheet_name=' table 1', header=0) Pd. read_json(jSON_string) # parse urls, strings, or HTML files. Extract the tables table pd.read_html(URL) # get the content from your stickboard and pass it to read_table() pd.read_clipboard() # Import data from the dictionary object, Key is the column name, Pd.dataframe (dict) # import string from IO import StringIO pd.read_csv(StringIO(web_data.text))Copy the code

Export output data

Df.to_csv ('filename.csv') SQL > alter table df.to_sql(table_name, Connection_object) # Export data in Json format to text file df.to_json(filename) # Other df.to_html() # Display HTML code df.to_markdown() # display markdown Df.to_string () # Display formatting characters df.to_latex(index=False) # LaTeX tabular, longtable df.to_dict('split') # dictionary, Format the list/series/records/index df. To_clipboard (sep = ', '. ExcelWriter = pd.excelWriter ('new.xlsx') Df_1. to_excel(writer,sheet_name=' first ', index=False) Index =False) writer.save() # with pd.excelwriter ('new.xlsx') as writer: Df2.to_excel (writer, sheet_name=' first ') df2.to_excel(writer, sheet_name=' first ') Sheet_name = 'second') # xlsxwriter is used to derive the support cell, color, charts, and other custom function # https://xlsxwriter.readthedocs.io/working_with_pandas.htmlCopy the code

Creating a test object

Pd.dataframe (np.random. Rand (20,5)) # create a Series object from iterable my_list pd.series (my_list) # Df.index = pd.date_range('1900/1/30', Periods. = df shape [0]) # to create random data set df = pd util. Testing. MakeDataFrame (#) to create a set date of random index data df = pd util. Testing. MakePeriodFrame () Df = pd util. Testing. MakeTimeDataFrame # () to create random mixed type data set df = pd util. Testing. MakeMixedDataFrame ()Copy the code

View, check, statistics, properties

Df.head (n) # check the first n lines of the DataFrame object df.tail(n) # check the last n lines of the DataFrame object Df.info () # View index, data type, and memory information df.describe() # View summary statistics of numeric columns dF.dtypes # View each field type DF.AXES # show row and column names of data Df.mean () # return the mean of all columns df.mean(1) # return the mean of all columns, Df.corr () # Returns the correlation coefficient between columns df.count() # Returns the number of non-null values in each column df.max() # returns the maximum value of each column df.min() # Returns the minimum value of each column df.median() # Returns the median of each column Df.std () # return the standard deviation of each column df.var() # variance s.mode() # mode s.rod () # multiplication s.cumprod() # cumulative multiplication df.cumsum(axis=0) # cumulative multiplication S.unique () # Df.idxmax () # df.idxmin() # df.columns # df.team. Unique () # df.team. Unique () # Count ratio: Df.apply (pd.series.value_counts) df.apply(pd.series.value_counts) Df.duplicated () # duplicates() # Delete duplicated rows # set_option, reset_option, and describe_option set display requirements pd.get_option() # Set the maximum number of rows and columns to be displayed, Max_rows = None pd.options.display.max_columns = None df.col.argmin() # maximum [minimum.argmax()] Df.col.idxmin () # ds.cumsum() # ds.cumprod() # ds.cummax() Ds.rolling (x).sum() # calculate the sum of adjacent x elements ds.rolling(x).mean() Ds.rolling (x).var(); ds.rolling(x).std(); ds.rolling(x).min() Ds.rolling (x).max() # Calculate the maximum of x contiguous elementsCopy the code

Data cleaning

Df. Columns = [' a ', 'b', 'c'] # rename the column df. Columns = df. Columns. STR. Replace (', '_') # column name Spaces for marking df. Loc [df. AAA > = 5, [' BBB ', 'CCC']] = 555 # replace data df [' pf] = df site_id. The map ({2: Pd.isnull () # Check for null values in DataFrame objects and return a Boolean array pd.notnull() # Check for non-null values in DataFrame objects, Df.drop (['name'], axis=1) # drop df.drop([0, 10], Df.dropna (axis=1) # df.dropna(axis=1,thresh=n) # Df.fillna (value={'prov':' unknown '}) # Replace the null value of the specified column with s.astype(float) # Df.index. astype('datetime64[ns]') # convert to time format. Place (1, Df. Rename (columns=lambda x) # replace all columns equal to 1 with 'one' s.replace([1, 3],['one','three']) # replace 1 with 'one', 3 with 'three' Df. rename(columns={'old_name': Df.set_index ('column_one') # rename(index=lambda x: Df.columns = ['UID', 'name '] df[' id '] = df[' status '] df.loc[:, Df.loc [::-1]. Reset_index (drop=True) df.loc[::-1].Copy the code

Data processing: Filter, Sort

Df.round (2) # all df.round({'A': 1, 'C': 1) Index == 'Jude'. Index == 'Jude'. Index = 'col' Many conditions query df [(df [' team '] = = 'A') & (df [' Q1 '] > 80) & df. Utype. The isin ([' old guest, 'old visitors'])] # filter is empty the contents of the df [df. Order. Isnull ()] # is similar to the SQL where clause in df [df. Team. The isin (' A', 'B')] df [(df) team = = 'B') & (df) Q1 = = 17)] Df [~ (df [' team '] = = 'A') | (df [' Q1 '] > 80)] #, Contains (' contains ') # contains the character df.sort_values(col1) # sort data by column col1, Df.col1. Sort_values () # ascending, -> s df.sort_values(col2, ascending=False) # ascending Df. sort_values([col1,col2], Ascending =[True,False]) DF2 = pd.get_dummies(df, Dd.set_index (['utype', 'site_id', 'p_day'], Dd.loc [' new visitor ', 2, '2019-06-22'].plot.barh() # loc in order to specify index contents # 100 rows, Cannot specify rows, such as: Df1 = df.loc[0:, [' designer ID', 'name']] # which divide into five interval and specify the labels which = np, array (,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32 [1]) pd. The cut (which, ,5,20,30,50,100 [0], Labels =[u" baby ",u" youth ",u" middle-aged ",u" prime ",u" old "]) daily_index.difference(df_work_day.index) # format df.index  df.columns.tolist() df.values.tolist() df. Values. Tolist () data.apply(np.mean) # apply the function np.mean data. Apply (np.max,axis=1) # apply the DataFrame to each column in the DataFrame Np. Max df.insert(1, 'three', 12, Allow_duplicates =False) # insert column (position, column name, [value]) df.pop('class') # add a row df.append(pd.dataframe ({'one':2, 'two':3, 'three': 4.4}, index = [' f ']), Sort =True) # assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']). Head () df. Assign (rate=lambda df: Df [' increase '] = df[' gross domestic product '] -df [' gross domestic product '].shift(-1) df.tshift(1) # time shift, cycle # same as above, Shift ()-df df[' increment '] = df[' gross domestic product ']. Diff (-1) # retain data, since maximum df.apply(lambda x: Df ['value'] = df.lookup(df['name'], df['best']) s.wohere (s > 1, Add (1) df + 1 / df.add(1) f(df)=df.pipe(f) def gb(df, by): Result = df.copy() result = result.groupby(by).sum() return result # call df.pipe(gb, Df.rolling (2).sum() # Counting dF.rolling (2).sum() # counting dF.rolling (2).sum() # counting dF.rolling (2).sum() # counting dF.expanding (2). ['C'] = df.eval('A+B') # Df.quantile (.5) # rank average, min, Max,first, Dense, default average S.rank () # data explosion Df. Explode ('A') # explode = {0:' not executed ', 1:' executing ', 2:' finished ', } df['taskStatus'] = df['taskStatus']. Apply (status.get) df.assign(amount =0) # add df.loc[('bar', 'two'), Df.index. Get_level_values (2).unique() # query('i0 == "b" & i1 == "b") # query('i0 == "b" & i1 == "b") # Df.astype (' STR ').applymap(lambda x: X.replace ('.00', '')) Join "two times" in the third column column df. Insert (3, 'double', df [' values'] * 2) # enumeration conversion df [' gender '] = df gender. The map ({' male ':' male ', Df ['Col_sum'] = df.apply(lambda x: lambda x) Col_list = list(df)[2:] # number of days df[' total days '] = df[col_list]. Sum (axis=1) # number of days Df.loc ['col_sum'] = df.apply(lambda x: Df.reindex (['col_1', 'col_5'], axis="columns") df.reindex(['col_1', 'col_5'])Copy the code

Data selection

Df. Loc [df['team'] == 'B',['name']] # Df.loc [0,'A':'B'] # df.loc[2018:1990, df.loc[2018:1990, 'A':'B'] # df.loc[2018:1990, 'the first industrial added value:' the added value of the third industry] df. Loc [0, [' A ', 'B']] # d.l oc/position slices, field df. Iloc [0, :] # to return to the first row, Iloc only Numbers df. Iloc [0, 0] # to return to the first column of the first element of the dc. The query (' site_id > 8 and utype = = "old guest" '), head () can # and or / & | # iterator and use for idx,row in df.iterrows(): Df.loc [I,' link '] = f'http://www.gairuo.com/p/{slug}.html' for I in df.name :print(I) # Iterate over a column # Print (label, content) # print(label, content) # print(label, content) Itertuples ():print(row) df.at[2018, 'total '] df.iat[1, Df.nlargest (3, ['population', 'GDP']) df.take([0, population', 'population', 'GDP']) df.take([0, population', 'population', 'GDP']) df.take([0, population', 'population', 'GDP']) 3]) # select * from *; Support date index tag ds. Truncate (before=2, After =4) # convert dataframe to series df.iloc[:,0] float(STR (val).rstrip('%')) # convert percentage to digit df.reset_index(inplace=True) # unindexCopy the code

Data processing GroupBy perspective

Df.groupby (col1,col2) # df.groupby(col1,col2) # df.groupby(col1,col2) # Create a group by column COL1, Pivot_table (index= COL1, values=[COL2, COL3], AGgfunc = Max, pivot_table(index=col1, values=[COL2, COL3]) As_index =False) # pivot_table(index=['site_id', 'utype'], values=['uv_all', 'regist_num'], aggfunc=[' Max ', Melt (df, id_vars=["day"], var_name='city', Value_name = 'temperature') # crosstab is special perspective for statistical grouping frequency table pd. Crosstab (df) Nationality, df. Handedness) # groupby after sorting, (df [(df.p_day >= '20190101')]. Groupby (['p_day', 'name']) .agg({'uv':sum}) .sort_values(['p_day','uv'], ascending=[False, False]).groupby(level=0).head(5) # unstack().plot()) # merge query by first look (Max, min, last, The size, quantity) df. Groupby (' settlement type). The first () # merger detail and grouped statistical aggregation (' Max '` mean `, ` median `, # ` prod `, ` sum `, ` STD `, ` var `, Df1 = df.groupby(by=' designer ID').agg({' balance sum ':sum}) df.groupby(by=df.pf).ip.nunique() # groupby Df.groupby (by=df.pf).ip.value_counts() # groupby + df.groupby('name').agg(['sum', 'median', 'count'])Copy the code

Data consolidation

NDF = (df[' nominate 1'].append(df[' nominate 2']), Append (df[' nom3 '], ignore_index=True) NDF = pd.dataframe (NDF, ignore_index=True) Columns =([' name '])) df.concat([df1, df2], axis=1) # df1 = pd.read_csv('111.csv', sep='\t') df2 = pd.read_csv('222.csv', sep='\t') excel_list = [df1, df2] # result = pd.concat(excel_list).fillna('')[:].astype('str') result = pd.concat(excel_list)[] result.to_excel('333.xlsx', Import glob files = glob.glob("data/cs/*.xls") dflist = [] for I in files: Dflist. Append (pd. Read_excel (I, usecols=['ID', 'time ', Join (df2,on=col1,how='inner') # join(df2,on=col1,how='inner') # join(df_all,on=col1,how='inner'  = pd.merge(df_sku, df_spu, how='left', left_on=df_sku['product_id'], right_on=df_spu['p.product_id'])Copy the code

Time handles time series

Df ['date'] = df['time'].dt.date # Format the specified field as the time type df["date"] = Df ['time'] = df['time'].dt.tz_convert('Asia/Shanghai') # convert to the specified format, May lose seconds after the accuracy of df [' time '] = df [' time ']. Dt. The strftime (" % % Y - m - H: % d % % m: % S ") dc. The index = pd. To_datetime (dc) index, Format ='%Y%m%d', errors='ignore') Pd.datetime (days=2) # pd.timestamp. Now () pd.to_datetime('today') # pd.datetime.today().year == Df.start_work.dt.year df.time.astype('datetime64[ns]').dt.date == pd.to_datetime('today') # import datetime days =  lambda x: Datetime. timedelta(days=x) days(2) Pd. Timedelta(days=2) # Unix timestamp pd.to_datetime(ted.film_date, Df.set_index ('date').resample('M')['quantity'].sum() df.set_index('date').resample('M')['quantity'].sum() Df. Set_index (" date "). Groupby (' name ') [' ext price ']. Resample (" M "). The sum () # summary according to day, Df.groupby (by=df.index.date).agg({'uu':'count'}) # df.groupby(by=df.index.weekday).uu.count() Df. groupby(['name', pd.Grouper(key='date', Freq = 'M')]) [' ext price ']. The sum () # each month summary df. Groupby (pd) Grouper (key = 'day', freq = '1 M)). The sum () # according to the annual, Df. groupby(['name', pd.Grouper(key='date', Freq = 'A - DEC)]) [' ext price']. The sum () # monthly average resampling df [' Close '] resample (' M '.) mean (#) https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases # take time range, RNG = pd.date_range(start="6/1/2016",end="6/30/2016",freq='B') Df.asfreq ('D', method='pad') # Df.tz_convert ('Europe/Berlin') df.time.tz_localize(tz='Asia/Shanghai') # convert df[' time '] = Df ['Time'].dt.tz_localize('UTC').dt.tz_convert('Asia/Shanghai') # from pytz import all_timezones print (all_timezones) Df ['duration'] = pd.to_datetime(df['end']) -pd.to_datetime (df['begin']) < pd.to_datetime('2019-12-11 20:00:00', format='%Y-%m-%d %H:%M:%S')Copy the code

Commonly used cheat

Df = pd.read_csv('111.csv', Sep ='\t').fillna('')[:].astype(' STR ') # display dd. Corr ().total_order_num. Sort_values (ascending=False) # parse list, json String import ast ast.literal_eval("[{'id': 7, 'name': 'Funny'}]") # Series apply method applies a function to # every element in a Series and returns a Series ted.ratings.apply(str_to_list).head() # lambda is a shorter alternative ted.ratings.apply(lambda x: ast.literal_eval(x)) # an even shorter alternative is to apply the # function directly (without lambda) Ted.ratings.apply (ast.literal_eval) # Use apply() df.index.to_series().apply()Copy the code

Style show

# https://pbpython.com/styling-pandas.html df [' per_cost] = df [' per_cost]. The map (' {:, 2 f} % '. The format) # # show % than form Df.style. applymap(lambda x: 'background-color: grey' if x>0 else '', subset=pd.IndexSlice[:, ['B', Df.style. Highlight_max (color='lightgreen').highlight_min(color='#cd4f39') df.style. Format ('{:.2%}', Subset = pd. IndexSlice [: [' B ']]) # # show percent to specify the columns style format_dict = {' sum ':' ${}. 0:0 f ', 'date' : 'm} {: % Y - %', 'pct_of_total' : '{:.2%}' 'c': Format (format_dict) # multiple style forms.hide_index () # specify column size by color depth, Background_gradient (subset=['sum_num'], Bar (color='#FFA07A', vmin=100_000, subset=['sum'], Color ='lightgreen', vmin=0, subset=['pct_of_total'], align='zero') To rise to the green bar (color = [' # ffe4e4 ', '# bbf9ce], vmin = 0, vmax = 1, subset = [' growth'], .set_caption('2018 Sales Performance').hide_index()) # add background color (style) def background_color(row): if row.pv_num >= 10000: return ['background-color: red'] * len(row) elif row.pv_num >= 100: Return ['background-color: yellow'] * len(row) return ['] * len(row) # df.style.apply(background_color, axis=1)Copy the code

Histogram in the table, Sparkline graph

import sparklines import numpy as np def sparkline_str(x): bins=np.histogram(x)[0] sl = ''.join(sparklines.sparklines(bins)) return sl sparkline_str.__name__ = "sparkline" # Plot the trend, Df.groupby ('name')['quantity', 'ext price'].agg(['mean', Sparkline_str]). Round (2) # sparkline https://hugoworld.wordpress.com/2019/01/26/sparklines-in-jupyter-notebooks-ipython-and-pandas/ def sparkline(data, Figsize = (4, 0.25), * * kwargs) : """ creates a sparkline """ # Turn off the max column width so the images won't be truncated pd.set_option('display.max_colwidth', -1) # Turning off the max column will display all the data # if gathering into sets / array we might want to restrict to  a few items pd.set_option('display.max_seq_items', 3) #Monkey patch the dataframe so the sparklines are displayed pd.DataFrame._repr_html_ = lambda self: self.to_html(escape=False) from matplotlib import pyplot as plt import base64 from io import BytesIO data = list(data) *_, ax = plt.subplots(1, 1, figsize=figsize, **kwargs) ax.plot(data) ax.fill_between(range(len(data)), data, len(data)*[min(data)], Alpha =0.1) ax.set_axis_off() img = BytesIO() plt.savefig(img) plt.close() return '<img SRC ="data:image/ PNG; />'.format(base64.b64encode(img.getValue ()).decode())) 'ext price'].agg(['mean', sparkline]) df.apply(sparkline, axis=1) #Copy the code

visualization

kind : str
- 'line' : line plot (default)
- 'bar' : vertical bar plot
- 'barh' : horizontal bar plot
- 'hist' : histogram
- 'box' : boxplot
- 'kde' : Kernel Density Estimation plot
- 'density' : same as 'kde'
- 'area' : area plot
- 'pie' : pie plot

Copy the code

Common methods:

Df88.plot. bar(y='rate', figsize=(20, 10)) # Unit inches df_1 [df_1 p_day > '2019-06-01'). The plot. The bar (x = 'p_day', y = [' total_order_num ', 'order_user], figsize = (16, 6) # bar chart # One site per line, each site home_remain, stack Accumulation is # unstack namely "do not stack" (df [(df) p_day > = '2019-05-1') & (df) utype = = 'old guest')]. The groupby ([' p_day ', 'site_id']) [' home_remain] sum () unstack (). The plot. The line ()) # line chart, Line (x='p_day', y=['uv_all', 'home_remain']) Dd.loc [' new visitor ', 2].plot.scatter(x='order_user', Y ='paid_order_user') # plot graph dd.plot.bar(color='blue') # histogram Specify range ylim=(0,100), X's same lot. Line (ylim = 0) # line color https://matplotlib.org/examples/color/named_colors.html # style (' - ', '-', '-', ':') # Line tag # https://matplotlib.org/api/markers_api.html grid = True display scale etc: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html s.plot.line(color='green', linestyle='-', Marker ='o') # two plots drawn together [df[' quantity '].plot.kde(), Df [' quantity '].plot.hist()] # import seaborn as SNS cm = sns.light_palette("green", as_cmap=True) df.style.background_gradient(cmap=cm, Axis =1) # convert data to a two-dimensional array [I for I in zip([i.trfTime ('%Y-%m-%d') for I in s.index.to_list()], S.t o_list # ())) and the usage as https://hvplot.pyviz.org/user_guide/Plotting.html import hvplot. Pandas # Sqlite printing table statements print(pd.io.sql.get_schema(fdf, 'table_name'))Copy the code

Jupyter notebooks problem

Notebooks Notebooks Matplotlib.pyplot as PLT plt.rcParams['figure. Figsize '] = (notebooks notebooks notebooks PLT plt. pyplot as PLT plt.rcParams['figure. Plt.rcparams [' font-size. Serif '] = ['SimHei'] # Show Chinese problem # single output variables from all IPython. Core. Interactiveshell import interactiveshell interactiveshell. Ast_node_interactivity = 'All' # notebooks notebooks adaptive width from ipython.core. Display import display, HTML display(HTML("<style>.container { width:100% ! important; }</style>") <style>#notebook_panel {background: # FFFFFF; } < / style > # jupyter notebooks embedded page content from IPython. Display the import IFrame IFrame (' https://arxiv.org/pdf/1406.2661.pdf ', width=800, Height = 450) # Markdown a cell does not support more than paste pictures # print a file to open the show only a picture problem solving # / site - packages/notebook/static/notebook/js/main. Min. Js Key = utils.uuid().slice(2,6)+encodeURIandParens(blob.name); Key = utils. Uuid (). Slice (2, 6) + Object. The keys (that. Attachments.) length; # https://github.com/ihnorton/notebook/commit/55687c2dc08817da587977cb6f19f8cc0103bab1 # output from multiple lines IPython. Core. Interactiveshell import interactiveshell interactiveshell. Ast_node_interactivity = 'all' # # the default for the 'last' execution Shell command:! < command statement > # online visualization tool https://plot.ly/createCopy the code

Slideshow slides

PIP install RISE

- [Alt+ R] Play/Exit play - ", "comma hides the two large operation buttons on the left," T "overview PPT,"/" black screen - Slide: main page, switch by pressing the left and right arrow keys. - sub-slide: indicates the secondary page, which can be switched by pressing the up and down arrow keys. Full-screen Fragment: The Fragment is hidden at first and displayed after you press the space bar or arrow bar to achieve dynamic effect. On a page - Skip: Units not shown in the slide. - Notes: The speaker's Notes are also not shown in slides.Copy the code

Author: Li Qinghui, data product expert, the head of the data product team of an e-commerce company, specializes in improving the data application level of the company through data governance, data analysis and data-based operation.

Pandas: Using Python for Data Processing and Analysis

Recommended reasons: The book covers the general needs and pain points of Pandas users. It provides a comprehensive description of the functions, usage, and principles of Pandas. It is a valuable introduction for beginners to learn Pandas systematically. An essential query manual on the desk of experienced Python engineers.