Introduction
Before 1.0, there was only one form to store text data, and that was object. After 1.0, a new data type called StringDtype was added. Today I will explain to you those things in text in Pandas.
Create text DF
First look at the common examples of using text to build DF:
In [1]: pd.Series(['a', 'b', 'c'])
Out[1]:
0 a
1 b
2 c
dtype: object
If you want to use the new StringDtype, you can do this:
In [2]: pd.Series(['a', 'b', 'c'], dtype="string")
Out[2]:
0 a
1 b
2 c
dtype: string
In [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
Out[3]:
0 a
1 b
2 c
dtype: string
Or use astype to convert:
In [4]: s = pd.Series(['a', 'b', 'c'])
In [5]: s
Out[5]:
0 a
1 b
2 c
dtype: object
In [6]: s.astype("string")
Out[6]:
0 a
1 b
2 c
dtype: string
String methods
String can be converted to uppercase, lowercase and count its length:
In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
....: dtype="string")
....:
In [25]: s.str.lower()
Out[25]:
0 a
1 b
2 c
3 aaba
4 baca
5 <NA>
6 caba
7 dog
8 cat
dtype: string
In [26]: s.str.upper()
Out[26]:
0 A
1 B
2 C
3 AABA
4 BACA
5 <NA>
6 CABA
7 DOG
8 CAT
dtype: string
In [27]: s.str.len()
Out[27]:
0 1
1 1
2 1
3 4
4 4
5 <NA>
6 4
7 3
8 3
dtype: Int64
You can also perform a trip operation:
In [28]: idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
String operation of columns
Because columns are represented by String, you can manipulate columns in the usual String way:
In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')
In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')
In [32]: df = pd.DataFrame(np.random.randn(3, 2),
....: columns=[' Column A ', ' Column B '], index=range(3))
....:
In [33]: df
Out[33]:
Column A Column B
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
Split and replace String
Split can split a String into an array.
In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
In [39]: s2.str.split('_')
Out[39]:
0 [a, b, c]
1 [c, d, e]
2 <NA>
3 [f, g, h]
dtype: object
To access the characters in the array after split, you can do this:
In [40]: s2.str.split('_').str.get(1)
Out[40]:
0 b
1 d
2 <NA>
3 g
dtype: object
In [41]: s2.str.split('_').str[1]
Out[41]:
0 b
1 d
2 <NA>
3 g
dtype: object
Use expand=True to expand the split array into multiple columns:
In [42]: s2.str.split('_', expand=True)
Out[42]:
0 1 2
0 a b c
1 c d e
2 <NA> <NA> <NA>
3 f g h
You can specify the number of split columns:
In [43]: s2.str.split('_', expand=True, n=1)
Out[43]:
0 1
0 a b_c
1 c d_e
2 <NA> <NA>
3 f g_h
replace is used to replace characters, and regular expressions can also be used in the replacement process:
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
String connection
Use cat to connect String:
In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")
In [65]: s.str.cat(sep=',')
Out[65]: 'a,b,c,d'
Use .str to index
pd.Series will return a Series. If the Series is a string, the characters in the column can be accessed through index. For example:
In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
....: 'CABA', 'dog', 'cat'],
....: dtype="string")
....:
In [100]: s.str[0]
Out[100]:
0 A
1 B
2 C
3 A
4 B
5 <NA>
6 C
7 d
8 c
dtype: string
In [101]: s.str[1]
Out[101]:
0 <NA>
1 <NA>
2 <NA>
3 a
4 a
5 <NA>
6 A
7 o
8 a
dtype: string
extract
Extract is used to decompress data from String. It receives an expand parameter. Before version 0.23, this parameter defaulted to False. If it is false, extract will return Series, index or DF. If expand=true, then DF will be returned. After version 0.23, the default is true.
extract is usually used with regular expressions.
In [102]: pd.Series(['a1', 'b2', 'c3'],
.....: dtype="string").str.extract(r'([ab])(\d)', expand=False)
.....:
Out[102]:
0 1
0 a 1
1 b 2
2 <NA> <NA>
The above example decomposes each string in the Series according to regular expressions. The first part is characters, and the back part is numbers.
Note that only the group data in the regular expression will be extracted.
The following will only extract numbers:
In [106]: pd.Series(['a1', 'b2', 'c3'],
.....: dtype="string").str.extract(r'[ab](\d)', expand=False)
.....:
Out[106]:
0 1
1 2
2 <NA>
dtype: string
You can also specify the name of the column as follows:
In [103]: pd.Series(['a1', 'b2', 'c3'],
.....: dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
.....: expand=False)
.....:
Out[103]:
letter digit
0 a 1
1 b 2
2 <NA> <NA>
extractall
Similar to extract, there is extractall. The difference is that extract will only match the first time, while extractall will do all the matches. For example:
In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
.....: dtype="string")
.....:
In [113]: s
Out[113]:
A a1a2
B b1
C c1
dtype: string
In [114]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
In [115]: s.str.extract(two_groups, expand=True)
Out[115]:
letter digit
A a 1
B b 1
C c 1
After extract matches a1, it will not continue.
In [116]: s.str.extractall(two_groups)
Out[116]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
After extractall matches a1, it will also match a2.
contains and match
contains and match are used to test whether the DF contains specific data:
In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
.....: dtype="string").str.contains(pattern)
.....:
Out[127]:
0 False
1 False
2 True
3 True
4 True
5 True
dtype: boolean
In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
.....: dtype="string").str.match(pattern)
.....:
Out[128]:
0 False
1 False
2 True
3 True
4 False
5 True
dtype: boolean
In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
.....: dtype="string").str.fullmatch(pattern)
.....:
Out[129]:
0 False
1 False
2 True
3 True
4 False
5 False
dtype: boolean
String method summary
Finally, summarize the String method:
Method | Description |
---|---|
cat() | Concatenate strings |
split() | Split strings on delimiter |
rsplit() | Split strings on delimiter working from the end of the string |
get() | Index into each element (retrieve i-th element) |
join() | Join strings in each element of the Series with passed separator |
get_dummies() | Split strings on the delimiter returning DataFrame of dummy variables |
contains() | Return boolean array if each string contains pattern/regex |
replace() | Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence |
repeat() | Duplicate values (s.str.repeat(3) equivalent to x * 3) |
pad() | Add whitespace to left, right, or both sides of strings |
center() | Equivalent to str.center |
ljust() | Equivalent to str.ljust |
rjust() | Equivalent to str.rjust |
zfill() | Equivalent to str.zfill |
wrap() | Split long strings into lines with length less than a given width |
slice() | Slice each string in the Series |
slice_replace() | Replace slice in each string with passed value |
count() | Count occurrences of pattern |
startswith() | Equivalent to str.startswith(pat) for each element |
endswith() | Equivalent to str.endswith(pat) for each element |
findall() | Compute list of all occurrences of pattern/regex for each string |
match() | Call re.match on each element, returning matched groups as list |
extract() | Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group |
extractall() | Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group |
len() | Compute string lengths |
strip() | Equivalent to str.strip |
rstrip() | Equivalent to str.rstrip |
lstrip() | Equivalent to str.lstrip |
partition() | Equivalent to str.partition |
rpartition() | Equivalent to str.rpartition |
lower() | Equivalent to str.lower |
casefold() | Equivalent to str.casefold |
upper() | Equivalent to str.upper |
find() | Equivalent to str.find |
rfind() | Equivalent to str.rfind |
index() | Equivalent to str.index |
rindex() | Equivalent to str.rindex |
capitalize() | Equivalent to str.capitalize |
swapcase() | Equivalent to str.swapcase |
normalize() | Return Unicode normal form. Equivalent to unicodedata.normalize |
translate() | Equivalent to str.translate |
isalnum() | Equivalent to str.isalnum |
isalpha() | Equivalent to str.isalpha |
isdigit() | Equivalent to str.isdigit |
isspace() | Equivalent to str.isspace |
islower() | Equivalent to str.islower |
isupper() | Equivalent to str.isupper |
istitle() | Equivalent to str.istitle |
isnumeric() | Equivalent to str.isnumeric |
isdecimal() | Equivalent to str.isdecimal |
This article has been included in http://www.flydean.com/06-python-pandas-text/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "programs those things", know the technology, know you better!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。