头图
In solitude, you can get everything except character. —— Stendhal, "Red and Black"

Overview

The previous article python uses regular expressions to extract the value of a specific field from a json string. simply uses the re module, but is not familiar with other methods. In order to have a more comprehensive understanding and use of re in python, Record your own learning process here.

In the process of using crawlers to crawl webpage data, you need to use various tools to parse the data in the webpage, such as: etree , BeautifulSoup , scrapy and other tools, but the most powerful one is regular expressions. The following will be the re module method of python conclude.

Python provides support for regular expressions through the re The general steps to use re

  1. Use re.compile (regular expression) to compile the string form of the regular expression into Pattern instance of 0611650be7256a
  2. Use Pattern instance to process the text and obtain the matching result (a Match instance)
  3. Use the Match instance to obtain information and perform other operations

A simple example:

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    # 将正则表达式编译成Pattern对象
    pattern = re.compile(r'hello')

    # 使用Pattern匹配文本,获得匹配结果,无法匹配时将返回None
    match = pattern.match('hello world!')

    if match:
        # 使用Match获得分组信息
        print(match.group()) # 输出结果:hello
        

Using native strings to define regular expressions can easily solve the problem of escape characters

The definition of the native string is: r''

With native strings, there is no need to manually add escape symbols, it will automatically escape, and the expressions written are more intuitive.

1. Use re

re.compile(strPattern[, flag]):

This method is a factory method of the Pattern class, used to compile regular expressions in the form of strings into Pattern objects.

The first parameter: regular expression string

The second parameter (optional): is the matching mode, the value can use the bitwise OR operator'|' to mean that it takes effect at the same time, such as re.I | re.M .

The optional values are as follows:

  • re.I(re.IGNORECASE) : Ignore case (full wording in parentheses, the same below)
  • M(MULTILINE) : Multi-line mode, change the behavior of'^' and'$'
  • S(DOTALL) : Click any matching mode to change the behavior of'.'
  • L(LOCALE) : Make the predetermined character class \w \W \b \B \s \S depend on the current locale setting
  • U(UNICODE) : Make the predetermined character class \w \W \b \B \s \S \d \D depend on the character attribute defined by unicode
  • X(VERBOSE) : Detailed mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can add comments. The following two regular expressions are equivalent:

    a = re.compile(r"""\d +  # the integral part
                       \.    # the decimal point
                       \d *  # some fractional digits""", re.X)
    b = re.compile(r"\d+\.\d*")

re provides many module methods to complete the function of regular expressions. These methods can be Pattern by the corresponding methods of the 0611650be72c18 instance. The only advantage is that one less line of re.compile() is written, but at the same time the compiled Pattern object cannot be reused. These methods will be introduced together in the instance methods section of the Pattern class. Such as the above example can be abbreviated as:

m = re.match(r'hello', 'hello world!')
print m.group()

2. Use Pattern

Pattern object is a compiled regular expression. The text can be matched and searched Pattern

Pattern object cannot be instantiated directly, it must be obtained using re.compile() .

2.1 Properties of the Pattern object

Pattern provides several readable attributes for obtaining information about expressions:

  1. pattern : The expression string used during compilation.
  2. flags : The matching mode used during compilation, in digital form.
  3. : The number of groups in the expression.
  4. groupindex : A dictionary with the alias of the group with alias in the expression as the key and the number corresponding to the group as the value. Groups without aliases are not included.
# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    text = 'hello world'
    p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)

    print("p.pattern:", p.pattern)
    print("p.flags:", p.flags)
    print("p.groups:", p.groups)
    print("p.groupindex:", p.groupindex)

The output is as follows:

p.pattern: (\w+) (\w+)(?P<sign>.*)
p.flags: 48
p.groups: 3
p.groupindex: {'sign': 3}

2.2 Methods of the Pattern object

1. match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):

If the starting position string can find any match of this regular pattern, a corresponding Match object will be returned.

pattern cannot be matched during the matching process endpos matching is over, then None is returned.

pos and endpos default values are 0 and len(string) ;

re.match() cannot specify these two parameters. The parameter flags used to specify the matching mode when pattern

Note: This method is not an exact match. When the pattern ends, if there are remaining characters in the string, it is still regarded as a success. For a complete match, you can add a boundary matching character'$' at the end of the expression.

2. search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):

This method is used to find substrings in a string that can be matched successfully.

From string the pos standard starting at the attempt to match pattern , if pattern still matches the end, a return Match objects;

If you can not match, then the pos plus 1 try again after the match; until pos=endpos still can not match None is returned.

pos and endpos default values are 0 and len(string) ;

re.search() cannot specify these two parameters. The parameter flags used to specify the matching mode when pattern

A simple example:

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    # 将正则表达式编译成Pattern对象
    pattern = re.compile(r'world')

    # 使用search()查找匹配的子串,不存在能匹配的子串时将返回None
    # 这个例子中使用match()无法成功匹配
    match = pattern.search('hello world!')

    if match:
        # 使用Match获得分组信息
        print(match.group()) # 输出结果:world
Pay attention to the difference between match method and search method

3. split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):

Split the string according to the substrings that can be matched and return the list.

maxsplit used to specify the maximum number of divisions, if not specified, all divisions will be made.

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'\d+')
    # 按照数字分隔字符串
    print(p.split('one1two2three3four4')) # 输出结果:['one', 'two', 'three', 'four', '']

4. findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):

Search for string and return all matching substrings in the form of a list.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'\d+')
    # 找到所有的数字,以列表的形式返回
    print(p.findall('one1two2three3four4')) # 输出结果:['1', '2', '3', '4']

5. finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):

Search for string and return an Match object).

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'\d+')
    # 返回一个顺序访问每一个匹配结果(`Match`对象)的迭代器
    for m in p.finditer('one1two2three3four4'):
        print(m.group())  # 输出结果:1 2 3 4

6. sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):

Use repl replace string in 0611650be732b6 and return the replaced string.
When repl is a character string, you can use \id or \g<id> , \g<name> reference the group, but you cannot use the number 0.
When repl is a method, this method should only accept one parameter ( Match object) and return a string for replacement (groups cannot be quoted in the returned string).
Count is used to specify the maximum number of replacements, if not specified, replace all.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'(\w+) (\w+)')
    s = 'i say, hello world!'

    print(p.sub(r'\1 \2 hi', s))  # 输出结果:i say hi, hello world hi!

    def func(m):
        return m.group(1).title() + ' ' + m.group(2).title()

    print(p.sub(func, s))  # 输出结果:I Say, Hello World!

7. subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):

The difference between the subn() method and the sub() method is that the returned results are different:

The result returned by the subn() method is a tuple: (string after replacement, number of replacements)

The result returned by the sub() method is a string: the replaced string

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re

if __name__ == '__main__':
    p = re.compile(r'(\w+) (\w+)')
    s = 'i say, hello world!'

    print(p.subn(r'\1 \2 hi', s))  # 输出结果:('i say hi, hello world hi!', 2)

    def func(m):
        return m.group(1).title() + ' ' + m.group(2).title()

    print(p.subn(func, s))  # 输出结果:('I Say, Hello World!', 2)

3. Use Match

The Match object is the result of a match and contains a lot of information about the match. You can use the readable attributes or methods provided by Match to obtain this information.

3.1 Properties of the Match object

  1. string : The text used for matching.
  2. re : Pattern object used for matching.
  3. pos : The index where the regular expression in the text starts to search. The value is the same as the parameter of the same name of the Pattern.match() and Pattern.seach()
  4. endpos : The index of the end of the regular expression search in the text. The value is the same as the parameter of the same name of the Pattern.match() and Pattern.seach()
  5. lastindex : The index of the last captured packet. If there is no captured packet, it will be None.
  6. lastgroup : The alias of the last captured group. If this packet has no alias or no captured packet, it will be None.
# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    text = 'hello world'
    p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)
    match = p.match(text)
    if match:
        print("match.re:", match.re)
        print("match.string:", match.string)
        print("match.endpos:", match.endpos)
        print("match.pos:", match.pos)
        print("match.lastgroup:", match.lastgroup)
        print("match.lastindex:", match.lastindex)
        
        
# 输出结果如下:
# match.re: re.compile('(\\w+) (\\w+)(?P<sign>.*)', re.DOTALL)
# match.string: hello world
# match.endpos: 11
# match.pos: 0
# match.lastgroup: sign
# match.lastindex: 3

3.2 Methods of the Match object

1. group([group1, …]):

Obtain one or more strings intercepted by grouping, and when multiple parameters are specified, it will be returned as a tuple.

group() can use numbers or aliases;

Number 0 represents the entire matched substring;

When no parameters are filled in, group(0) is returned;

Groups that have not intercepted the string return None;

2. groups([default]):


以元组形式返回全部分组截获的字符串,相当于调用group(1,2,…last);

default means that the group that has not intercepted the string is replaced with this value, and the default is None;

3. groupdict([default]):

返回已有别名的组的别名为键、以该组截获的子串为值的字典,没有别名的组不包含在内。default含义同上。

4. start([group]):

返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引)。group默认值为0。

5. end([group]):


返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1)。group默认值为0。

6. span([group]):


返回(start(group), end(group))。

7. expand(template):

Substitute the matched group into the template and then return. The template can use \id or \g<id> , \g<name> reference group, but the number 0 cannot be used. \id and \g<id> are equivalent; but \10 will be considered as the tenth group. If you want to express the character '0' after \1 \g<1>0 .

# -*- coding: utf-8 -*-
import re

if __name__ == '__main__':
    import re
    m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')
    print("m.group(1,2):", m.group(0, 1, 2, 3))
    print("m.groups():", m.groups())
    print("m.groupdict():", m.groupdict())
    print("m.start(2):", m.start(2))
    print("m.end(2):", m.end(2))
    print("m.span(2):", m.span(2))
    print(r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3'))


# 输出结果:
# m.group(1,2): ('hello world!', 'hello', 'world', '!')
# m.groups(): ('hello', 'world', '!')
# m.groupdict(): {'sign': '!'}
# m.start(2): 6
# m.end(2): 11
# m.span(2): (6, 11)
# m.expand(r'\2 \1\3'): world hello!

Reference article

Python official document

https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html


惜鸟
328 声望2.3k 粉丝