Introduction to the

Prior to 1.0, there was only one form to store text data, and that was object. After 1.0, a new data type called StringDType was added. Today we will explain the text in Pandas.

Create DF of text

Let’s look at some common examples of using text to build DF:

In [1]: pd.Series(['a', 'b', 'c'])
Out[1]: 
0    a
1    b
2    c
dtype: object

If we want to use the new stringDType, we can do something like this:

In [2]: pd.Series(['a', 'b', 'c'], dtype="string")
Out[2]: 
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
Out[3]: 
0    a
1    b
2    c
dtype: string

Or use Astype for conversion:

In [4]: s = pd.Series(['a', 'b', 'c'])

In [5]: s
Out[5]: 
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]: 
0    a
1    b
2    c
dtype: string

A String of method

String can be converted to uppercase, lowercase, and its length counted:

In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], .... : dtype="string") .... : In [25]: s.str.lower() Out[25]: 0 a 1 b 2 c 3 aaba 4 baca 5 <NA> 6 caba 7 dog 8 cat dtype: string In [26]: s.str.upper() Out[26]: 0 A 1 B 2 C 3 AABA 4 BACA 5 <NA> 6 CABA 7 DOG 8 CAT dtype: string In [27]: s.str.len() Out[27]: 0 1 1 1 2 1 3 4 4 4 5 <NA> 6 4 7 3 8 3 dtype: Int64

Trip operations can also be performed:

In [28]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

The String operation for columns

Because Columns are represented by strings, we can operate Columns as normal strings:

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')
In [32]: df = pd.DataFrame(np.random.randn(3, 2), .... : columns=[' Column A ', ' Column B '], index=range(3)) .... : In [33]: df Out[33]: Column A Column B 0 0.469112-0.282863 1-1.509059-1.135632 2 1.212112-0.173215

Split and replace strings

Split splits a String into an array.

In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")

In [39]: s2.str.split('_')
Out[39]: 
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

To access the characters in the array after split, do this:

In [40]: s2.str.split('_').str.get(1)
Out[40]: 
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split('_').str[1]
Out[41]: 
0       b
1       d
2    <NA>
3       g
dtype: object

Expand =True to expand an array after a split into multiple columns:

In [42]: s2.str.split('_', expand=True)
Out[42]: 
      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h

You can specify the number of split columns:

In [43]: s2.str.split('_', expand=True, n=1)
Out[43]: 
      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h

REPLACE is used to replace characters, and regular expressions can also be used during the substitution process:

s3.str.replace('^.a|dog', 'XX-XX ', case=False)

Connection String

We can concatenate a String using cat:

In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")

In [65]: s.str.cat(sep=',')
Out[65]: 'a,b,c,d'

Use.str to index

Pd.series returns a Series. If a Series is a string, you can use index to access the characters in the column. For example:

In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, ....: 'CABA', 'dog', 'cat'], .... : dtype="string") .... : In [100]: s.str[0] Out[100]: 0 A 1 B 2 C 3 A 4 B 5 <NA> 6 C 7 d 8 c dtype: string In [101]: s.str[1] Out[101]: 0 <NA> 1 <NA> 2 <NA> 3 a 4 a 5 <NA> 6 A 7 o 8 a dtype: string

extract

Extract, used to Extract data from a String, receives a expand parameter, which was False by default prior to version 0.23. If false, extract returns Series, index, or DF. If expand=true, DF is returned. After version 0.23, the default is true.

Extract is usually used in conjunction with regular expressions.

In [102]: pd.Series(['a1', 'b2', 'c3'], ..... : dtype="string").str.extract(r'([ab])(\d)', expand=False) ..... : Out[102]: 0 1 0 a 1 1 b 2 2 <NA> <NA>

The above example decomposes each string in a Series as a regular expression. The first part is a character, the second part is a number.

Note that only the group data in the regular expression is extracted.

The following will only extract numbers:

In [106]: pd.Series(['a1', 'b2', 'c3'], ..... : dtype="string").str.extract(r'[ab](\d)', expand=False) ..... : Out[106]: 0 1 1 2 2 <NA> dtype: string

You can also specify column names like this:

In [103]: pd.Series(['a1', 'b2', 'c3'], ..... : dtype="string").str.extract(r'(? P<letter>[ab])(? P<digit>\d)', ..... : expand=False) ..... : Out[103]: letter digit 0 a 1 1 b 2 2 <NA> <NA>

extractall

Similar to ExtractAll, the difference is that Extract will only match the first time, while ExtractAll will do all matches, for example:

In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], ..... : dtype="string") ..... : In [113]: s Out[113]: A a1a2 B b1 C c1 dtype: string In [114]: two_groups = '(? P<letter>[a-z])(? P<digit>[0-9])' In [115]: s.str.extract(two_groups, expand=True) Out[115]: letter digit A a 1 B b 1 C c 1

Extract will not continue after it matches A1.

In [116]: s.str.extractall(two_groups)
Out[116]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

ExtractAll matches A1 and then matches A2.

The contains and match

CONTAINS and match are used to test if DF contains specific data:

In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'], ..... : dtype="string").str.contains(pattern) ..... : Out[127]: 0 False 1 False 2 True 3 True 4 True 5 True dtype: boolean
In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'], ..... : dtype="string").str.match(pattern) ..... : Out[128]: 0 False 1 False 2 True 3 True 4 False 5 True dtype: boolean
In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'], ..... : dtype="string").str.fullmatch(pattern) ..... : Out[129]: 0 False 1 False 2 True 3 True 4 False 5 False dtype: boolean

Summary of String methods

Finally, let’s summarize the String method:

Method Description
cat() Concatenate strings
split() Split strings on delimiter
rsplit() Split strings on delimiter working from the end of the string
get() Index into each element (retrieve i-th element)
join() Join strings in each element of the Series with passed separator
get_dummies() Split strings on the delimiter returning DataFrame of dummy variables
contains() Return boolean array if each string contains pattern/regex
replace() Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
repeat() Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad() Add whitespace to left, right, or both sides of strings
center() Equivalent to str.center
ljust() Equivalent to str.ljust
rjust() Equivalent to str.rjust
zfill() Equivalent to str.zfill
wrap() Split long strings into lines with length less than a given width
slice() Slice each string in the Series
slice_replace() Replace slice in each string with passed value
count() Count occurrences of pattern
startswith() Equivalent to str.startswith(pat) for each element
endswith() Equivalent to str.endswith(pat) for each element
findall() Compute list of all occurrences of pattern/regex for each string
match() Call re.match on each element, returning matched groups as list
extract() Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
extractall() Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group
len() Compute string lengths
strip() Equivalent to str.strip
rstrip() Equivalent to str.rstrip
lstrip() Equivalent to str.lstrip
partition() Equivalent to str.partition
rpartition() Equivalent to str.rpartition
lower() Equivalent to str.lower
casefold() Equivalent to str.casefold
upper() Equivalent to str.upper
find() Equivalent to str.find
rfind() Equivalent to str.rfind
index() Equivalent to str.index
rindex() Equivalent to str.rindex
capitalize() Equivalent to str.capitalize
swapcase() Equivalent to str.swapcase
normalize() Return Unicode normal form. Equivalent to unicodedata.normalize
translate() Equivalent to str.translate
isalnum() Equivalent to str.isalnum
isalpha() Equivalent to str.isalpha
isdigit() Equivalent to str.isdigit
isspace() Equivalent to str.isspace
islower() Equivalent to str.islower
isupper() Equivalent to str.isupper
istitle() Equivalent to str.istitle
isnumeric() Equivalent to str.isnumeric
isdecimal() Equivalent to str.isdecimal

This article has been included in http://www.flydean.com/06-python-pandas-text/

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the tips to wait for you to discover!

Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!