In solitude, you can get everything except character. —— Stendhal, "Red and Black"
Overview
The previous article python uses regular expressions to extract the value of a specific field from a json string. simply uses the re
module, but is not familiar with other methods. In order to have a more comprehensive understanding and use of re in python, Record your own learning process here.
In the process of using crawlers to crawl webpage data, you need to use various tools to parse the data in the webpage, such as: etree
, BeautifulSoup
, scrapy
and other tools, but the most powerful one is regular expressions. The following will be the re module method of python conclude.
Python
provides support for regular expressions through the re
The general steps to use re
- Use
re.compile (regular expression) to compile the string form of the regular expression into
Pattern
instance of 0611650be7256a - Use
Pattern
instance to process the text and obtain the matching result (aMatch
instance) - Use the
Match
instance to obtain information and perform other operations
A simple example:
# -*- coding: utf-8 -*-
import re
if __name__ == '__main__':
# 将正则表达式编译成Pattern对象
pattern = re.compile(r'hello')
# 使用Pattern匹配文本,获得匹配结果,无法匹配时将返回None
match = pattern.match('hello world!')
if match:
# 使用Match获得分组信息
print(match.group()) # 输出结果:hello
Using native strings to define regular expressions can easily solve the problem of escape characters
The definition of the native string is:
r''
With native strings, there is no need to manually add escape symbols, it will automatically escape, and the expressions written are more intuitive.
1. Use re
re.compile(strPattern[, flag]):
This method is a factory method of the Pattern class, used to compile regular expressions in the form of strings into Pattern objects.
The first parameter: regular expression string
The second parameter (optional): is the matching mode, the value can use the bitwise OR operator'|' to mean that it takes effect at the same time, such as re.I | re.M
.
The optional values are as follows:
re.I(re.IGNORECASE)
: Ignore case (full wording in parentheses, the same below)M(MULTILINE)
: Multi-line mode, change the behavior of'^' and'$'S(DOTALL)
: Click any matching mode to change the behavior of'.'L(LOCALE)
: Make the predetermined character class \w \W \b \B \s \S depend on the current locale settingU(UNICODE)
: Make the predetermined character class \w \W \b \B \s \S \d \D depend on the character attribute defined by unicodeX(VERBOSE)
: Detailed mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can add comments. The following two regular expressions are equivalent:a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X) b = re.compile(r"\d+\.\d*")
re
provides many module methods to complete the function of regular expressions. These methods can be Pattern
by the corresponding methods of the 0611650be72c18 instance. The only advantage is that one less line of re.compile()
is written, but at the same time the compiled Pattern
object cannot be reused. These methods will be introduced together in the instance methods section of the Pattern class. Such as the above example can be abbreviated as:
m = re.match(r'hello', 'hello world!')
print m.group()
2. Use Pattern
Pattern
object is a compiled regular expression. The text can be matched and searched Pattern
Pattern
object cannot be instantiated directly, it must be obtained using re.compile()
.
2.1 Properties of the Pattern object
Pattern
provides several readable attributes for obtaining information about expressions:
- pattern : The expression string used during compilation.
- flags : The matching mode used during compilation, in digital form.
- : The number of groups in the expression.
- groupindex : A dictionary with the alias of the group with alias in the expression as the key and the number corresponding to the group as the value. Groups without aliases are not included.
# -*- coding: utf-8 -*-
import re
if __name__ == '__main__':
text = 'hello world'
p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)
print("p.pattern:", p.pattern)
print("p.flags:", p.flags)
print("p.groups:", p.groups)
print("p.groupindex:", p.groupindex)
The output is as follows:
p.pattern: (\w+) (\w+)(?P<sign>.*)
p.flags: 48
p.groups: 3
p.groupindex: {'sign': 3}
2.2 Methods of the Pattern object
1. match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):
If the starting position string can find any match of this regular pattern, a corresponding Match
object will be returned.
pattern
cannot be matched during the matching process endpos
matching is over, then None
is returned.
pos
and endpos
default values are 0
and len(string)
;
re.match()
cannot specify these two parameters. The parameter flags
used to specify the matching mode when pattern
Note: This method is not an exact match. When the pattern ends, if there are remaining characters in the string, it is still regarded as a success. For a complete match, you can add a boundary matching character'$' at the end of the expression.
2. search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):
This method is used to find substrings in a string that can be matched successfully.
From string
the pos
standard starting at the attempt to match pattern
, if pattern
still matches the end, a return Match
objects;
If you can not match, then the pos
plus 1
try again after the match; until pos=endpos
still can not match None is returned.
pos
and endpos
default values are 0
and len(string)
;
re.search()
cannot specify these two parameters. The parameter flags
used to specify the matching mode when pattern
A simple example:
# -*- coding: utf-8 -*-
import re
if __name__ == '__main__':
# 将正则表达式编译成Pattern对象
pattern = re.compile(r'world')
# 使用search()查找匹配的子串,不存在能匹配的子串时将返回None
# 这个例子中使用match()无法成功匹配
match = pattern.search('hello world!')
if match:
# 使用Match获得分组信息
print(match.group()) # 输出结果:world
Pay attention to the difference between match method and search method
3. split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):
Split the string according to the substrings that can be matched and return the list.
maxsplit
used to specify the maximum number of divisions, if not specified, all divisions will be made.
# -*- coding: utf-8 -*-
import re
if __name__ == '__main__':
p = re.compile(r'\d+')
# 按照数字分隔字符串
print(p.split('one1two2three3four4')) # 输出结果:['one', 'two', 'three', 'four', '']
4. findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):
Search for string and return all matching substrings in the form of a list.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
if __name__ == '__main__':
p = re.compile(r'\d+')
# 找到所有的数字,以列表的形式返回
print(p.findall('one1two2three3four4')) # 输出结果:['1', '2', '3', '4']
5. finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):
Search for string and return an Match
object).
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
if __name__ == '__main__':
p = re.compile(r'\d+')
# 返回一个顺序访问每一个匹配结果(`Match`对象)的迭代器
for m in p.finditer('one1two2three3four4'):
print(m.group()) # 输出结果:1 2 3 4
6. sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):
Use repl
replace string
in 0611650be732b6 and return the replaced string.
When repl
is a character string, you can use \id
or \g<id>
, \g<name>
reference the group, but you cannot use the number 0.
When repl
is a method, this method should only accept one parameter ( Match
object) and return a string for replacement (groups cannot be quoted in the returned string).
Count is used to specify the maximum number of replacements, if not specified, replace all.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
if __name__ == '__main__':
p = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
print(p.sub(r'\1 \2 hi', s)) # 输出结果:i say hi, hello world hi!
def func(m):
return m.group(1).title() + ' ' + m.group(2).title()
print(p.sub(func, s)) # 输出结果:I Say, Hello World!
7. subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):
The difference between the subn() method and the sub() method is that the returned results are different:
The result returned by the subn() method is a tuple: (string after replacement, number of replacements)
The result returned by the sub() method is a string: the replaced string
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
if __name__ == '__main__':
p = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
print(p.subn(r'\1 \2 hi', s)) # 输出结果:('i say hi, hello world hi!', 2)
def func(m):
return m.group(1).title() + ' ' + m.group(2).title()
print(p.subn(func, s)) # 输出结果:('I Say, Hello World!', 2)
3. Use Match
The Match object is the result of a match and contains a lot of information about the match. You can use the readable attributes or methods provided by Match to obtain this information.
3.1 Properties of the Match object
- string : The text used for matching.
- re : Pattern object used for matching.
- pos : The index where the regular expression in the text starts to search. The value is the same as the parameter of the same name of the
Pattern.match()
andPattern.seach()
- endpos : The index of the end of the regular expression search in the text. The value is the same as the parameter of the same name of the
Pattern.match()
andPattern.seach()
- lastindex : The index of the last captured packet. If there is no captured packet, it will be None.
- lastgroup : The alias of the last captured group. If this packet has no alias or no captured packet, it will be None.
# -*- coding: utf-8 -*-
import re
if __name__ == '__main__':
text = 'hello world'
p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)
match = p.match(text)
if match:
print("match.re:", match.re)
print("match.string:", match.string)
print("match.endpos:", match.endpos)
print("match.pos:", match.pos)
print("match.lastgroup:", match.lastgroup)
print("match.lastindex:", match.lastindex)
# 输出结果如下:
# match.re: re.compile('(\\w+) (\\w+)(?P<sign>.*)', re.DOTALL)
# match.string: hello world
# match.endpos: 11
# match.pos: 0
# match.lastgroup: sign
# match.lastindex: 3
3.2 Methods of the Match object
1. group([group1, …]):
Obtain one or more strings intercepted by grouping, and when multiple parameters are specified, it will be returned as a tuple.
group() can use numbers or aliases;
Number 0 represents the entire matched substring;
When no parameters are filled in, group(0) is returned;
Groups that have not intercepted the string return None;
2. groups([default]):
以元组形式返回全部分组截获的字符串,相当于调用group(1,2,…last);
default means that the group that has not intercepted the string is replaced with this value, and the default is None;
3. groupdict([default]):
返回已有别名的组的别名为键、以该组截获的子串为值的字典,没有别名的组不包含在内。default含义同上。
4. start([group]):
返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引)。group默认值为0。
5. end([group]):
返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1)。group默认值为0。
6. span([group]):
返回(start(group), end(group))。
7. expand(template):
Substitute the matched group into the template and then return. The template can use \id
or \g<id>
, \g<name>
reference group, but the number 0 cannot be used. \id
and \g<id>
are equivalent; but \10
will be considered as the tenth group. If you want to express the character '0' after \1
\g<1>0
.
# -*- coding: utf-8 -*-
import re
if __name__ == '__main__':
import re
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')
print("m.group(1,2):", m.group(0, 1, 2, 3))
print("m.groups():", m.groups())
print("m.groupdict():", m.groupdict())
print("m.start(2):", m.start(2))
print("m.end(2):", m.end(2))
print("m.span(2):", m.span(2))
print(r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3'))
# 输出结果:
# m.group(1,2): ('hello world!', 'hello', 'world', '!')
# m.groups(): ('hello', 'world', '!')
# m.groupdict(): {'sign': '!'}
# m.start(2): 6
# m.end(2): 11
# m.span(2): (6, 11)
# m.expand(r'\2 \1\3'): world hello!
Reference article
https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。