Python正则表达式(2)

对于一些预定义的字符集可以使用转义码可以更加紧凑的表示，re可以识别的转义码有3对，6个，分别为三个字母的大小写，他们的意义是相反的。

\d : 一个数字
\D : 一个非数字
\w : 字母或者数字
\W : 非字母，非数字
\s : 空白符（制表符，空格，换行符等）
\S : 非空白符

如果想指定匹配的内容在文本的相对位置，可以使用锚定，跟转义码类似。

^ 字符或行的开始
$ 字符或行的结束
\A 字符串的开始
\Z 字符串结束
\b 一个单词开头或者末尾的空串
\B 不在一个单词开头或末尾的空串

import re
the_str = "This is some text -- with punctuation"  
re.search(r'^\w+', the_str).group(0)       # This
re.search(r'\A\w+', the_str).group(0)      # This  
re.search(r'\w+\S*$', the_str).group(0)    # punctuation  
re.search(r'\w+\S*\Z', the_str).group(0)   # punctuation  
re.search(r'\w*t\W*', the_str).group(0)    # text --  
re.search(r'\bt\w+', the_str).group(0)     # text  
re.search(r'\Bt*\B', the_str).group(0)     # 没有匹配

用组来解析匹配，简单的说就是在一个正则表达式中有几个小括号()将匹配的表达式分成不同的组，使用group()函数来获取某个组的匹配，其中0为整个正则表达式所匹配的内容，后面从1开始从左往右依次获取每个组的匹配，即每个小括号中的匹配。使用groups()可以获取所有的匹配内容。

import re  
the_str = "--aabb123bbaa"  
pattern = r'(\W+)([a-z]+)(\d+)(\D+)'  
match = re.search(pattern, the_str)    
match.groups()    # ('--', 'aabb', '123', 'bbaa') 
match.group(0)    # '--aabb123bbaa'  
match.group(1)    # '--'  
match.group(2)    # 'aabb'  
match.group(3)    # '123'  
match.group(4)    # 'bbaa'

python对分组的语法做了扩展，我们可以对每个分组进行命名，这样便可以使用名称来调用。语法:(?P<name>pattern),使用groupdict()可以返回一个包含了组名的字典。

import re  
the_str = "--aabb123bbaa"  
pattern = r'(?P<not_al_and_num>\W+)(?P<al>[a-z]+)(?P<num>\d+)(?P<not_num>\D+)'  
match = re.search(pattern, the_str)    
match.groups()    # ('--', 'aabb', '123', 'bbaa')  
match.groupdict() # {'not_al_and_num': '--', 'not_num': 'bbaa', 'num': '123', 'al': 'aabb'}  
match.group(0)                    # '--aabb123bbaa'  
match.group(1)                    # '--'  
match.group(2)                    # 'aabb'  
match.group(3)                    # '123'  
match.group(4)                    # 'bbaa'   
match.group('not_al_and_num')    # '--'
match.group('al')                 # 'aabb'  
match.group('num')               # '123' '
match.group('not_num')            # 'bbaa'

以上的group()方法在使用的时候需要注意，只有在有匹配的时候才会正常运行，否则会抛错，所以在不能保证有匹配而又要输出匹配结果的时候，必须做校验。

在re中可以设置不通的标志，也就是search()和compile()等中都包含的缺省变量flag。使用标志可以进行完成一些特殊的要求，如忽略大小写，多行搜索等。

import re  
the_str = "this Text"  
re.findall(r'\bt\w+', the_str)   # ['this']  
re.findall(r'\bt\w+', the_str, re.IGNORECASE) # ['this', 'Text']

关于搜索选项有很多，具体可查看文档 http://docs.python.org/2/library/re.html#module-re

Python正则表达式(2)

木头lbj

引用和评论

Messenger弹窗组件的使用

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

pip安装报错：No such file or directory 'conda-forge' 没有那个文件或目录

科学计算编程涉及到的技术栈简介

Python 描述符

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时