Pandas advanced tutorial: processing text data

Introduction

Before 1.0, there was only one form to store text data, and that was object. After 1.0, a new data type called StringDtype was added. Today I will explain to you those things in text in Pandas.

Create text DF

First look at the common examples of using text to build DF:

In [1]: pd.Series(['a', 'b', 'c'])
Out[1]: 
0    a
1    b
2    c
dtype: object

If you want to use the new StringDtype, you can do this:

In [2]: pd.Series(['a', 'b', 'c'], dtype="string")
Out[2]: 
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
Out[3]: 
0    a
1    b
2    c
dtype: string

Or use astype to convert:

In [4]: s = pd.Series(['a', 'b', 'c'])

In [5]: s
Out[5]: 
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]: 
0    a
1    b
2    c
dtype: string

String methods

String can be converted to uppercase, lowercase and count its length:

In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
   ....:               dtype="string")
   ....: 

In [25]: s.str.lower()
Out[25]: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [26]: s.str.upper()
Out[26]: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [27]: s.str.len()
Out[27]: 
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

You can also perform a trip operation:

In [28]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

String operation of columns

Because columns are represented by String, you can manipulate columns in the usual String way:

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')

In [32]: df = pd.DataFrame(np.random.randn(3, 2),
   ....:                   columns=[' Column A ', ' Column B '], index=range(3))
   ....: 

In [33]: df
Out[33]: 
    Column A    Column B 
0    0.469112   -0.282863
1   -1.509059   -1.135632
2    1.212112   -0.173215

Split and replace String

Split can split a String into an array.

In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")

In [39]: s2.str.split('_')
Out[39]: 
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

To access the characters in the array after split, you can do this:

In [40]: s2.str.split('_').str.get(1)
Out[40]: 
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split('_').str[1]
Out[41]: 
0       b
1       d
2    <NA>
3       g
dtype: object

Use expand=True to expand the split array into multiple columns:

In [42]: s2.str.split('_', expand=True)
Out[42]: 
      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h

You can specify the number of split columns:

In [43]: s2.str.split('_', expand=True, n=1)
Out[43]: 
      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h

replace is used to replace characters, and regular expressions can also be used in the replacement process:

s3.str.replace('^.a|dog', 'XX-XX ', case=False)

String connection

Use cat to connect String:

In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")

In [65]: s.str.cat(sep=',')
Out[65]: 'a,b,c,d'

Use .str to index

pd.Series will return a Series. If the Series is a string, the characters in the column can be accessed through index. For example:

In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
   ....:                'CABA', 'dog', 'cat'],
   ....:               dtype="string")
   ....: 

In [100]: s.str[0]
Out[100]: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [101]: s.str[1]
Out[101]: 
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

extract

Extract is used to decompress data from String. It receives an expand parameter. Before version 0.23, this parameter defaulted to False. If it is false, extract will return Series, index or DF. If expand=true, then DF will be returned. After version 0.23, the default is true.

extract is usually used with regular expressions.

In [102]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'([ab])(\d)', expand=False)
   .....: 
Out[102]: 
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

The above example decomposes each string in the Series according to regular expressions. The first part is characters, and the back part is numbers.

Note that only the group data in the regular expression will be extracted.

The following will only extract numbers:

In [106]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'[ab](\d)', expand=False)
   .....: 
Out[106]: 
0       1
1       2
2    <NA>
dtype: string

You can also specify the name of the column as follows:

In [103]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
   .....:                                       expand=False)
   .....: 
Out[103]: 
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>

extractall

Similar to extract, there is extractall. The difference is that extract will only match the first time, while extractall will do all the matches. For example:

In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
   .....:               dtype="string")
   .....: 

In [113]: s
Out[113]: 
A    a1a2
B      b1
C      c1
dtype: string

In [114]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

In [115]: s.str.extract(two_groups, expand=True)
Out[115]: 
  letter digit
A      a     1
B      b     1
C      c     1

After extract matches a1, it will not continue.

In [116]: s.str.extractall(two_groups)
Out[116]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

After extractall matches a1, it will also match a2.

contains and match

contains and match are used to test whether the DF contains specific data:

In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.contains(pattern)
   .....: 
Out[127]: 
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.match(pattern)
   .....: 
Out[128]: 
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.fullmatch(pattern)
   .....: 
Out[129]: 
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

String method summary

Finally, summarize the String method:

Method	Description
cat()	Concatenate strings
split()	Split strings on delimiter
rsplit()	Split strings on delimiter working from the end of the string
get()	Index into each element (retrieve i-th element)
join()	Join strings in each element of the Series with passed separator
get_dummies()	Split strings on the delimiter returning DataFrame of dummy variables
contains()	Return boolean array if each string contains pattern/regex
replace()	Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
repeat()	Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad()	Add whitespace to left, right, or both sides of strings
center()	Equivalent to str.center
ljust()	Equivalent to str.ljust
rjust()	Equivalent to str.rjust
zfill()	Equivalent to str.zfill
wrap()	Split long strings into lines with length less than a given width
slice()	Slice each string in the Series
slice_replace()	Replace slice in each string with passed value
count()	Count occurrences of pattern
startswith()	Equivalent to str.startswith(pat) for each element
endswith()	Equivalent to str.endswith(pat) for each element
findall()	Compute list of all occurrences of pattern/regex for each string
match()	Call re.match on each element, returning matched groups as list
extract()	Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
extractall()	Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group
len()	Compute string lengths
strip()	Equivalent to str.strip
rstrip()	Equivalent to str.rstrip
lstrip()	Equivalent to str.lstrip
partition()	Equivalent to str.partition
rpartition()	Equivalent to str.rpartition
lower()	Equivalent to str.lower
casefold()	Equivalent to str.casefold
upper()	Equivalent to str.upper
find()	Equivalent to str.find
rfind()	Equivalent to str.rfind
index()	Equivalent to str.index
rindex()	Equivalent to str.rindex
capitalize()	Equivalent to str.capitalize
swapcase()	Equivalent to str.swapcase
normalize()	Return Unicode normal form. Equivalent to unicodedata.normalize
translate()	Equivalent to str.translate
isalnum()	Equivalent to str.isalnum
isalpha()	Equivalent to str.isalpha
isdigit()	Equivalent to str.isdigit
isspace()	Equivalent to str.isspace
islower()	Equivalent to str.islower
isupper()	Equivalent to str.isupper
istitle()	Equivalent to str.istitle
isnumeric()	Equivalent to str.isnumeric
isdecimal()	Equivalent to str.isdecimal

This article has been included in http://www.flydean.com/06-python-pandas-text/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "programs those things", know the technology, know you better!

Pandas advanced tutorial: processing text data

Introduction

Create text DF

String methods

String operation of columns

Split and replace String

String connection

Use .str to index

extract

extractall

contains and match

String method summary

flydean

引用和评论

在stable diffussion中完美修复AI图片

python与nodejs哪个性能高

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时