Chapter 01 Fundamentals of Pandas Chapter 02 DataFrame Operations Chapter 03 Data Analysis Into Pandas Chapter 04 Selecting Subsets of Data Chapter 05 Boolean Indexes Chapter 06 Index Alignment Chapter 07 Grouping Aggregation, Filtering, and Transformation Chapter 08 Data cleaning chapter 09 Merging Pandas Chapter 10 Time Series Analysis Chapter 11 Visualization using Matplotlib, Pandas, Seaborn


 In[1]: import pandas as pd
        import numpy as np
Copy the code

1. Check indexes

In[2]: college = pd.read_csv('data/college.csv') columns = college.columns columns Out[2]: Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'], dtype='object')Copy the code
In[3]: columns. Values Out[3]: array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'], dtype=object)Copy the code
In[4]: columns[5] Out[4]: 'WOMENONLY'Copy the code
# remove the array of 2, 9, 11 [5] : In the columns Out [5] [,8,10 [1]] : Index ([' CITY ', 'SATMTMID', 'UGDS], dtype =' object ')Copy the code
In[6]: columns[-7:-4] Out[6]: Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dType ='object')Copy the code
In[7]: columns.min(), columns.max(), columns.isnull().sum() Out[7]: ('CITY', 'WOMENONLY', 0)Copy the code
In[8]: columns + '_A' Out[8]: Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A', 'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A', 'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A', 'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A', 'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A', 'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'], dtype='object')Copy the code
In[9]: columns > 'G' Out[9]: array([ True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True], dtype=bool)Copy the code
[10]: Int (int, int, int, int, int, int, int, int, int); columns[1] = 'city' --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-10-1e9e8e3125de> in <module>() ----> 1 columns[1] = 'city' Ted/anaconda/Users / / lib/python3.6 / site - packages/pandas/core/indexes/base. Py in __setitem__ (self, key, value) 1668 1669 def __setitem__(self, key, value): -> 1670 raise TypeError("Index does not support mutable operations") 1671 1672 def __getitem__(self, key): TypeError: Index does not support mutable operationsCopy the code

More and more

Index objects support set operations: union, cross, difference, and symmetry difference

# section In [11] : c1 = columns [:] 4 c1 Out [11] : Index ([' INSTNM ', 'CITY', 'STABBR', 'HBCU], dtype =' object ') In [12] : c2 = columns[2:5] c2 Out[12]: Index(['STABBR', 'HBCU', 'MENONLY'], dtype='object')Copy the code
# joint In [13] : c1. The union (c2) Out [13] : Index ([' CITY ', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR], dtype =' object ') In [14] : c1 | c2 Out[14]: Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR'], dtype='object')Copy the code
In[15]: c1. Symmetric_difference (c2) Out[15]: Index(['CITY', 'INSTNM', 'MENONLY'], dType ='object') In[16]: c1 ^ c2 Out[16]: Index(['CITY', 'INSTNM', 'MENONLY'], dtype='object')Copy the code

Take the Cartesian product

[17]: s1 = pd.Series(index=list('aaab'), data=np.arange(4)) s1 Out[17]: a 0 a 1 a 2 b 3 dtype: int64 In[18]: s2 = pd.Series(index=list('cababb'), data=np.arange(6)) s2 Out[18]: c 0 a 1 b 2 a 3 b 4 b 5 dtype: int64Copy the code
In[19]: s1 + s2 Out[19]: a 1.0 a 3.0 a 2.0 a 4.0 a 3.0 a 5.0 b 5.0 b 7.0 b 8.0 C NaN dType: float64Copy the code

More and more

No Cartesian product is generated when two sets of index elements are identical and in the same order; Indexes are aligned according to their position. In the following example, the two Series are identical and the result is an integer In[20]: s1 = pd.Series(index=list('aaabb'), data=np.arange(5)) s2 = pd.Series(index=list('aaabb'), data=np.arange(5)) s1 + s2 Out[20]: a 0 a 2 a 4 b 6 b 8 dtype: int64Copy the code
If the index elements are the same, but In a different order, it is In that produces the Cartesian product [21]: s1 = pd.Series(index=list('aaabb'), data=np.arange(5)) s2 = pd.Series(index=list('bbaaa'), data=np.arange(5)) s1 + s2 Out[21]: a 2 a 3 a 4 a 3 a 4 a 5 a 4 a 5 a 6 b 3 b 4 b 4 b 5 dtype: int64Copy the code

3. Index explosion

Set index to RACE In[22]: employee = pd.read_csv('data/employee.csv', index_col='RACE') employee.head() Out[22]:Copy the code


# Select BASE_SALARY to make two Series and judge whether they are the same In[23]: salary1 = employee['BASE_SALARY'] salary2 = employee['BASE_SALARY'] salary1 is salary2 Out[23]: TrueCopy the code
# the result is True, indicating that both refer to the same object. This means that if you change one, the other will change too. To receive a completely new data, use the copy method: In[24]: salary1 = employee['BASE_SALARY'].copy() salary2 = employee['BASE_SALARY'].copy() salary1 is salary2 Out[24]: FalseCopy the code
In[25]: salary1 = salary1.sort_index() salary1.head() Out[25]: salary1 = salary1.sort_index() salary1.head() Out RACE American Indian or Alaskan Native 78355.0 American Indian or Alaskan Native 26125.0 American Indian or Alaskan Native 98536.0 American Indian or Alaskan Native NaN American Indian or Alaskan Native 55461.0 Name: BASE_SALARY, dtype: float64 In[26]: salary2.head() Out[26]: RACE Hispanic/Latino 121862.0 Latino 26125.0 White 45279.0 White 63166.0 White 56347.0 Name: BASE_SALARY, dtype: float64Copy the code
In[27]: salary_add = salary1 + salary2 In[28]: salary_add. Head () Out[28]: RACE American Indian or Alaskan Native 138702.0 American Indian or Alaskan Native 156710.0 American Indian or Alaskan Native 176891.0 American Indian or Alaskan Native 159594.0 American Indian or Alaskan Native 127734.0 Name: BASE_SALARY, dtype: float64Copy the code
Add salary1 to itself; Looking at the length of several results, it can be seen that the length increases from 2000 to 1.17 million In[29]: salary_add1 = salary1 + salary1 len(salary1), len(salary2), len(salary_add), len(salary_add1) Out[29]: (2000, 2000, 1175424, 2000)Copy the code

More and more

# check the number of salary_add values. In[30]: index_vc = salary1.index. Value_counts (Dropna =False) index_vc Out[30]: Black or African American 700 White 665 Hispanic/Latino 480 Asian/Pacific Islander 107 NaN 35 American Indian or Alaskan  Native 11 Others 2 Name: RACE, dtype: int64 In[31]: index_vc.pow(2).sum() Out[31]: 1175424Copy the code

4. Populate values with unequal indexes

Set playerID In[32]; baseball_14 = pd.read_csv('data/baseball14.csv', index_col='playerID') baseball_15 = pd.read_csv('data/baseball15.csv', index_col='playerID') baseball_16 = pd.read_csv('data/baseball16.csv', index_col='playerID') baseball_14.head() Out[32]:Copy the code


Select * from baseball_14; select * from baseball_15; select * from baseball_16; baseball_14.index.difference(baseball_15.index) Out[33]: Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01', 'hoeslj01', 'krausma01', 'preslal01', 'singljo02'], dtype='object', name='playerID') In[34]: baseball_14.index.difference(baseball_16.index) Out[34]: Index(['congeha01', 'correca01', 'gattiev01', 'gomezca01', 'lowrije01', 'rasmuco01', 'tuckepr01', 'valbulu01'], dtype='object', name='playerID')Copy the code
# find the batting count of each player In the last three seasons, column H contains this data In[35]: hits_14 = baseball_14['H'] hits_15 = baseball_15['H'] hits_16 = baseball_16['H'] hits_14.head() Out[35]: Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01', 'hoeslj01', 'krausma01', 'preslal01', 'singljo02'], dtype='object', name='playerID')Copy the code
In[36]: (hits_14 + hits_15).head() Out[36]: PlayerID alTUvjo01 425.0 Cartech02 193.0 CASTRja01 174.0 Congeha01 NaN corpoCA01 NaN Name: H, DType: float64Copy the code
Then # congeha01 and CorpoCA01 are recorded in 2015 but the results are missing. In[37]: hits_14.add(hits_15, fill_value=0).head() Out[37]: PlayerID alTUvjo01 425.0 Cartech02 193.0 CASTRJA01 174.0 congeha01 46.0 CORPOCA01 40.0 Name: H, DType: FLOAT64Copy the code
# Add the data of 2016 to In[38]: hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0) hits_total.head() Out[38]: PlayerID alTUvjo01 641.0 bregmal01 53.0 Cartech02 193.0 CASTRJA01 243.0 congeha01 46.0 Name: H, DType: FLOAT64Copy the code
In[39]: hits_total.hasnans Out[39]: FalseCopy the code

The principle of

# If an element is missing In both Series, even if fill_value is used, the result is still missing In[40]: S = pd.series (index=['a', 'b', 'c', 'd'], data=[np.nan, 3, np.nan, 1]) s Out[40]: a nan b 3.0 c nan d 1.0 dtype: Float64 In[41]: float = pd.series (index=['a', 'b', 'c'], data=[np.nan, 6, 10]) float64 In[41]: float = pd.series (index=['a', 'b', 'c'], data=[np.nan, 6, 10]) Float64 In[42]: s.dd (s1, fill_value=5) Out[42]: a NaN b 9.0c 15.0d 6.0dType: float64 In[43]: S1. add(s, fill_value=5) Out[43]: a NaN b 9.0c 15.0d 6.0 dType: float64Copy the code

More and more

# from baseball_14 choose some columns In [44] : df_14 = baseball_14 [[' G ', 'AB', 'R', 'H']] df_14. Head () Out [44] :Copy the code


In[45]: df_15 = baseball_15[['AB', 'R', 'H', 'HR']] df_15.head() Out[45]:Copy the code


# If you add the two, you will get missing values whenever the rows or columns are not aligned. Highlight_null ('yellow') Out[46]: (df_14 + df_15).head(10).style.highlight_null('yellow') Out[46]:Copy the code


# Even if fill_value=0 is used, some values are missing because some combinations of rows and columns are not present In the input data at all [47]: df_14.add(df_15, fill_value=0).head(10).style.highlight_null('yellow') Out[47]:Copy the code


5. Append columns from different DataFrame

Select 'DEPARTMENT', 'BASE_SALARY' from employee column In[48]: Employee = pd.read_csv('data/employee.csv') dept_sal = employee[['DEPARTMENT', 'BASE_SALARY']] # Sort BASE_SALARY In[49]: dept_sal = dept_sal.sort_values(['DEPARTMENT', 'BASE_SALARY'], ascending=[True, False]) # use drop_duplicates to retain the first row of each department In[50]: max_dept_sal = dept_sal.drop_duplicates(subset='DEPARTMENT') max_dept_sal.head() Out[50]:Copy the code


In[51]: Max_dept_sal = max_dept_sal.set_index('DEPARTMENT') employee = employee.set_index('DEPARTMENT') In[52]: employee['MAX_DEPT_SALARY'] = max_dept_sal['BASE_SALARY'] In[53]: pd.options.display.max_columns = 6 Out[54]:Copy the code


Query ('BASE_SALARY > MAX_DEPT_SALARY') Out[55]:Copy the code


The principle of

# select 10 rows from dept_sal by random, without replacing In[56]: np.random.seed(1234) random_salary = dept_sal.sample(n=10).set_index('DEPARTMENT') random_salary Out[56]:Copy the code


# random_SALARY the employee DataFrame tag corresponds to the random_SALARY tag In[57]: employee['RANDOM_SALARY'] = random_salary['BASE_SALARY'] --------------------------------------------------------------------------- ValueError Traceback (most recent call last)  <ipython-input-57-1cbebe15fa39> in <module>() ----> 1 employee['RANDOM_SALARY'] = random_salary['BASE_SALARY'] Ted/anaconda/Users / / lib/python3.6 / site - packages/pandas/core/frame. Py in __setitem__ (self, key, value) 2329 else: 2330 # set column -> 2331 self._set_item(key, value) 2332 2333 def _setitem_slice(self, key, value): Ted/anaconda/lib / / Users/python3.6 / site - packages/pandas/core/frame. Py in _set_item (self, key, value) 2395 2396 self._ensure_valid_index(value) -> 2397 value = self._sanitize_column(key, value) 2398 NDFrame._set_item(self, key, Value) 2399 / Users/Ted/anaconda/lib/python3.6 / site - packages/pandas/core/frame. Py in _sanitize_column (self, key, value, broadcast) 2545 2546 if isinstance(value, Series): -> 2547 value = reindexer(value) 2548 2549 elif isinstance(value, DataFrame): / Users/Ted/anaconda/lib/python3.6 / site - packages/pandas/core/frame. Py reindexer in 2537 # (value) duplicate the axis 2538 the if not value.index.is_unique: 2541 - > 2539 raise e 2540 # other/Users/Ted/anaconda/lib/python3.6 / site - packages/pandas/core/frame. Py in reindexer(value) 2532 # GH 4107 2533 try: -> 2534 value = value.reindex(self.index)._values 2535 except Exception as e: 2536 / Users/Ted/anaconda/lib/python3.6 / site - packages/pandas/core/series. Py in reindex (self, index, **kwargs) 2424 @Appender(generic._shared_docs['reindex'] % _shared_doc_kwargs) 2425 def reindex(self, index=None, **kwargs): -> 2426 return super(Series, self).reindex(index=index, **kwargs) 2427 2428 @Appender(generic._shared_docs['fillna'] % _shared_doc_kwargs) Ted/anaconda/lib / / Users/python3.6 / site - packages/pandas/core/generic py in reindex (self, * args, **kwargs) 2513 # perform the reindex on the axes 2514 return self._reindex_axes(axes, level, limit, tolerance, method, -> 2515 fill_value, copy).__finalize__(self) 2516 2517 def _reindex_axes(self, axes, level, limit, tolerance, method, Ted/anaconda fill_value, / Users / / lib/python3.6 / site - packages/pandas/core/generic. Py in _reindex_axes (self, axes, level, limit, tolerance, method, fill_value, copy) 2531 obj = obj._reindex_with_indexers({axis: [new_index, indexer]}, 2532 fill_value=fill_value, -> 2533 copy=copy, Allow_dups = False) 2534 2535 return obj/Users/Ted/anaconda/lib/python3.6 / site - packages/pandas/core/generic py in _reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups) 2625 fill_value=fill_value, 2626 allow_dups=allow_dups, -> 2627 copy=copy) 2628 2629 if copy and new_data is self._data: Ted/anaconda/Users / / lib/python3.6 / site - packages/pandas/core/internals. Py in reindex_indexer (self, new_axis, indexer, axis, fill_value, allow_dups, copy) 3884 # some axes don't allow reindexing with dups 3885 if not allow_dups: -> 3886 self.axes[axis]._can_reindex(indexer) 3887 3888 if axis >= self.ndim: Ted/anaconda/lib / / Users/python3.6 / site - packages/pandas/core/indexes/base. Py in _can_reindex (self, indexer) 2834 # trying to reindex on an axis with duplicates 2835 if not self.is_unique and len(indexer): -> 2836 raise ValueError("cannot reindex from a duplicate axis") 2837 2838 def reindex(self, target, method=None, level=None, limit=None, ValueError: cannot reindex from a duplicate axisCopy the code

More and more

Employee ['MAX_SALARY2'] In[58]: employee['MAX_SALARY2'] Employee ['MAX_SALARY2'] = max_dept_sal['BASE_SALARY']. Head (3) # In[59]: Value_counts () Out[59]: 140416.0 29 100000.0 11 64251.0 5 Name: MAX_SALARY2, dtype: In[60]: employee.max_salary2.isnull ().mean() Out[60]: 0.97750000000000004Copy the code

6. Highlight the maximum value per column

In[61]: pd.options.display.max_rows = 8 # Read college data set, INSTNM as column In[62]: college = pd.read_csv('data/college.csv', index_col='INSTNM') college.dtypes Out[62]: CITY object STABBR object HBCU float64 MENONLY float64 ... PCTFLOAN float64 UG25ABV float64 MD_EARN_WNE_P10 object GRAD_DEBT_MDN_SUPP object Length: 26, dtype: objectCopy the code
In[63]: college.md_earn_wne_p10.iloc [0] Out[63]: '30300' In[64]: college.MD_EARN_WNE_P10.iloc[0] Out[64]: '30300'Copy the code
In[65]: college.md_earn_wne_p10. Sort_values (Ascending =False).head() Out[65]: INSTNM Sharon Regional Health System School of Nursing PrivacySuppressed Northcoast Medical Training Academy PrivacySuppressed Success Schools PrivacySuppressed Louisiana Culinary Institute PrivacySuppressed Bais Medrash Toras Chesed PrivacySuppressed Name: MD_EARN_WNE_P10, dtype: objectCopy the code
In[66]: cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'] for col In cols: college[col] = pd.to_numeric(college[col], errors='coerce') college.dtypes.loc[cols] Out[66]: MD_EARN_WNE_P10 float64 GRAD_DEBT_MDN_SUPP float64 dtype: objectCopy the code
In[67]: college_n = college.select_dtypes(include=[np.number]) college_n.head() Out[67]:Copy the code


In[68]: criteria = college_n.nunique() == 2 criteria. Head () Out[68]: HBCU True MENONLY True WOMENONLY True RELAFFIL True SATVRMID False dtype: boolCopy the code
In[69]: binary_cols = college_n.columns[criteria]. Tolist () binary_cols Out[69]: ['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']Copy the code
In[70]: college_n2 = college_n.drop(labels=binary_cols, axis='columns') College_n2.head () Out[70]:Copy the code


In[71]: max_cols = college_n2.idxmax() max_cols Out[71]: SATVRMID California Institute of Technology SATMTMID California Institute of Technology UGDS University of Phoenix-Arizona UGDS_WHITE Mr Leon's School of Hair Design-Moscow ... PCTFLOAN ABC Beauty College Inc UG25ABV Dongguk University-Los Angeles MD_EARN_WNE_P10 Medical College of Wisconsin GRAD_DEBT_MDN_SUPP Southwest University of Visual Arts-Tucson Length: 18, dtype: objectCopy the code
In[72]: unique_max_cols = max_cols. Unique () unique_max_cols[:5] Out[72]: array(['California Institute of Technology', 'University of Phoenix-Arizona', "Mr Leon's School of Hair Design-Moscow", 'Velvatex College of Beauty Culture', 'Thunderbird School of Global Management'], dtype=object)Copy the code
In[73]: college_n2.loc[unique_max_cols].style.highlight_max() Out[73]: college_n2.loc[unique_max_cols].highlight_max() Out[73]:Copy the code


More and more

# Use axis to highlight the maximum value of each line In[74]: college = pd.read_csv('data/college.csv', index_col='INSTNM') college_ugds = college.filter(like='UGDS_').head() college_ugds.style.highlight_max(axis='columns') Out[74]:Copy the code


 In[75]: pd.Timedelta(1, unit='Y')
Out[75]: Timedelta('365 days 05:49:12')
Copy the code

7. Reproduce IDXmax using chain method

# select column In[76] as before: college = pd.read_csv('data/college.csv', index_col='INSTNM') cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'] for col in cols: college[col] = pd.to_numeric(college[col], errors='coerce') college_n = college.select_dtypes(include=[np.number]) criteria = college_n.nunique() == 2 binary_cols = college_n.columns[criteria].tolist() college_n = college_n.drop(labels=binary_cols, axis='columns') In[77]: College_n. Max ().head() Out[77]: SATVRMID 765.0 SATMTMID 785.0 UGDS 151558.0 UGDS_WHITE 1.0 UGDS_BLACK 1.0 DType: Float64Copy the code
In[78]: college_n.eq(college_n.max()).head() Out[78]: college_n.eq(college_n.max()).head() Out[78]:Copy the code


Select the row In[79] that contains at least one True value using any: has_row_max = college_n.eq(college_n.max()).any(axis='columns') has_row_max.head() Out[79]: INSTNM Alabama A & M University False University of Alabama at Birmingham False Amridge University False University of Alabama in Huntsville False Alabama State University False dtype: boolCopy the code
In[80]: college_n. Shape Out[80]: (7535, 18) In[81]: has_row_max.sum() Out[81]: 401Copy the code
# The result is strange because many of the columns with percentages have a maximum value of 1. Cumsum () In[82]: has_row_max.sum() In[83]: college_n.eq(college_n.max()).cumsum() Out[83]:Copy the code


# some columns have only one maximum value, such as SATVRMID and SATMTMID, UGDS_WHITE has many maximum values. There were 109 schools that were 100 percent white. >>> college_n.eq(college_n.max()).cumsum().cumsum().Copy the code


Select all rows In[84] that contain at least one True value using any method: has_row_max2 = college_n.eq(college_n.max())\ .cumsum()\ .cumsum()\ .eq(1)\ .any(axis='columns') has_row_max2.head() Out[84]: INSTNM Alabama A & M University False University of Alabama at Birmingham False Amridge University False University of Alabama in Huntsville False Alabama State University False dtype: boolCopy the code
In[85]: has_row_max2.sum() Out[85]: 16Copy the code
In[86]: idxmax_cols = has_row_max2[has_row_max2].index idxmax_cols Out[86]: Index(['Thunderbird School of Global Management', 'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc', 'Velvatex College of Beauty Culture', 'California Institute of Technology', 'Le Cordon Bleu College of Culinary Arts-San Francisco', 'MTI Business College Inc', 'Dongguk University-Los Angeles', 'Mr Leon's School of Hair Design-Moscow', 'Haskell Indian Nations University', 'LIU Brentwood', 'Medical College of Wisconsin', 'Palau Community College', 'California University of Management and Sciences', 'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'], dtype='object', name='INSTNM')Copy the code
In[87]: set(college_n.idxmax().unique()) == set(idxmax_cols) Out[87]: TrueCopy the code

More and more

In[88]: %timeit college_n.idxmax(). Values 1.11 ms ± 50.9 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each) Out[89]: %timeit college_n.eq(college_n.max())\ .cumsum()\ .cumsum()\ .eq(1)\ .any(axis='columns')\ [lambda x: X].index 5.26 ms ± 35.6 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code

8. Find the most common maximum

In[90]: pd.options.display.max_rows= 40 In[91]: college = pd.read_csv('data/college.csv', index_col='INSTNM') college_ugds = college.filter(like='UGDS_') college_ugds.head() Out[91]:Copy the code


# Select the column name with the highest racial proportion In each row using idxmax method [92]: highest_percentage_race = college_ugds.idxmax(axis='columns') highest_percentage_race.head() Out[92]: INSTNM Alabama A & M University UGDS_BLACK University of Alabama at Birmingham UGDS_WHITE Amridge University UGDS_BLACK University of Alabama in Huntsville UGDS_WHITE Alabama State University UGDS_BLACK dtype: objectCopy the code
In[93]: highest_percentage_race.value_counts(normalize=True) Out[93]: UGDS_WHITE 0.670352 UGDS_BLACK 0.151586 UGDS_HISP 0.129473 UGDS_UNKN 0.023422 UGDS_ASIAN 0.012074 UGDS_AIAN 0.006110 UGDS_NRA 0.004073 UGDS_NHPI 0.001746 UGDS_2MOR 0.001164 DType: FLOAT64Copy the code

More and more

# For schools with the highest percentage of black students, the distribution of the second-ranked race In[94]: college_black = college_ugds[highest_percentage_race == 'UGDS_BLACK'] college_black = college_black.drop('UGDS_BLACK', axis='columns') college_black.idxmax(axis='columns').value_counts(normalize=True) Out[94]: UGDS_WHITE 0.670352 UGDS_BLACK 0.151586 UGDS_HISP 0.129473 UGDS_UNKN 0.023422 UGDS_ASIAN 0.012074 UGDS_AIAN 0.006110 UGDS_NRA 0.004073 UGDS_NHPI 0.001746 UGDS_2MOR 0.001164 DType: FLOAT64Copy the code

Chapter 01 Fundamentals of Pandas Chapter 02 DataFrame Operations Chapter 03 Data Analysis Into Pandas Chapter 04 Selecting Subsets of Data Chapter 05 Boolean Indexes Chapter 06 Index Alignment Chapter 07 Grouping Aggregation, Filtering, and Transformation Chapter 08 Data cleaning chapter 09 Merging Pandas Chapter 10 Time Series Analysis Chapter 11 Visualization using Matplotlib, Pandas, Seaborn