记一次正则表达式实践

引出

前天我有一个朋友（真不是我自己🤣🤣）遇到了一个场景：

有一段html，其中很多标签内联了样式，现要求只保留给定的几个属性（当时提出的是font-size、text-decoration、color），其余全部去除。

比如 <p style="display:block;color:#333">abc</p> 应替换为 <p style="color:#333">abc</p>

脑海中第一想到的当然就是正则替换。
当然了，与其花大量时间写一个复杂的正则实在不如简化行为，这个之后再讲，毕竟写的过程中还是能学到许多

有兴趣的可以先挑战一下，要求只用一个正则替换函数，不允许对替换结果使用(匿名)函数，即仅使用后向引用一次性完成，下面提供了测试案例

测试用例

<span style="color:#fff">only color</span>
<span style="text-decoration:underline;">only text-decoration</span>
<span style="font-size:20px">only font-size</span>

<span style="font-size:30px;color:rgba(22,33,44)">both font-size and color</span>
<span style="font-size:20px;text-decoration:underline">both font-size and text-decoration</span>

<p style="font-size:10px;text-decoration:underline;color:#fff">all</p>
<p style="font-size:10px;text-decoration:underline;color:#fff;">all and end with semi</p>

<a style="display:block;font-size:10px;text-decoration:underline;color:rgba(1,2,3,0.1);">mix</a>

<a style="display:block;position:fixed">all clear</a>

预备知识

本次解答会用到的一些知识点。以下未特殊说明均为JavaScript语法

反向引用

当正则表达式出现括号对分组时（不考虑以下特殊使用），可在表达式中或替换内容中再次引用匹配的内容，以下给出示例：

表达式中的反向引用
问：写出检测字符串中是否有连续重复字符（假设字符都是数字和字母）的正则表达式，如 abcc包含重复c
答：/(\w)\1/
解析：其中 \1 就是对第一个分组的引用

替换内容中的反向引用
问：将字符串 Today is 2020-04-17 替换为 Today is 04/17/2020
答：

const text = 'Today is 2020-04-17'
const pattern = /(\d+)-(\d+)-(\d+)/
text.replace(pattern, '$2/$3/$1')

解析：替换内容中也可使用反向引用，同样是对匹配分组的引用

反向引用的序号0表示完整的匹配内容，子组从1开始递增。

引用顺序的划分以及更复杂的情况这里不再延伸，更多可见参考文章

非捕获分组

使用 ?: 指定分组不进行捕获。
当需要用到 () 而又不需要其作为子组时可使用（仍在完整匹配中），使分组序号可控

问：为HH:mm:ss格式的时间加入北京时区信息并将小时替换为15，如 Now is 07:00:02 替换为 Now is T15:00:02 +08:00:00
答：

const text = 'Now is 07:00:02'
const pattern = /(?:[01]\d|2[0-3])(:[0-5]\d)(:[0-5]\d)/
text.replace(pattern, 'T15$1$2 +08:00:00')

解析：表示小时的正则 ([01]\d|2[0-3])，表示分秒的正则 (:[0-5]\d)
可见此处 () 都是必不可少的，对小时使用非捕获分组保证了分秒的序号为 1,2，而不是2,3

零宽先行断言

零宽，即不占位，不会出现在完整匹配中，也不会出现在分组中，下同。

先行断言使用 ?= ，如 x(?=y) 表示断定x一定在y的前面才匹配

问：提取字符串中货币单位前的数字的正则表达式，123% 1234$ 5555& 432￥
答：/\d+(?=[\$￥])/
解析：此处使用先行断言表示一定是在 $ 或 ￥ 前的数字才匹配
若不使用先行断言，即 /\d+[\$￥]/，则匹配后会把货币单位也作为结果的一部分，但是这是题给中所不需要的

零宽先行否定断言

先行否定断言使用 ?!，如 x(?!y) 表示断定x一定不在y的前面才匹配

问：提取不包含%的数字的正则表达式，This bag has 50% discount, but still need $300
答：/\d+(?![%\d])/
解析：若只否定 %，即 /\d+(?!%)时，会发现5也匹配出来了，所以还应排除独立的数字使连续数字当成一个整体

零宽后行断言

后行断言使用 ?<=，如 (?<=y)x 表示断定x一定在y的后面才匹配（注意有的浏览器不一定支持）

问：提取 href 中的内容的正则表达式，<a class="link" href="https://segmentfault.com">Segmentfault</a>
答：/(?<=href=").*?(?=")/
解析：同时使用了先行断言和后行断言，限定了前后范围且不会包含至结果中

无序匹配

引出
先看一个常见例子，从 img 标签中取出src内容，一般可以写成
/<img.*?src="(.*?)".*?\/>/
因为只要求了一个属性，所以我们可以对其2边使用任意字符非贪婪模式即可
问题升级
如果现在要求同时取出src和class的内容（假设一定都包含），也就是这样一个老帖子（答案也是从此处获取）
比如给定字符串：<img class="image logo" name="logo" id="logo" src="logo.png" />（假设引号都是使用 "）
如果只看这一个测试案例再结合上例似乎可以写成这样
/<img.*?class="(.*?)".*?src="(.*?)".*?\/>/
好了，测试一下
```
const text = '<img class="image logo" name="logo" id="logo" src="logo.png" />'
const pattern = /<img.*?class="(.*?)".*?src="(.*?)".*?\/>/
text.match(pattern)
```
确实是可行的，但是呢，存在问题。
如果src和class的顺序是反过来的呢? 也就是这样
<img src="logo.png" name="logo" id="logo" class="image logo" />
此时匹配将是失败的，因为这个正则存在很明显的顺序，只能匹配出class在src前面的情况，但实际上位置可能是不定的。
根据原贴4楼可以得到这样的答案(已做改动）
/<img(?=(?:(?!class=).)*class="(.*?)")(?:(?!src=).)*src="(.*?)".*?\/>/
分析

首先两边灰色部分可以当做边界，②和④仍是原来的写法，都可以不看
先看绿色部分，去掉①中的 ?: 后，即使用 ((?!class=).)* 代替了 .*?
由于此处class假设一定存在（也就是说只匹配存在的情况，后同），所以在这里它们是等价的
如果class不一定存在，则①只能使用前者，即 (?=(?:(?!class=).)*(?:class="(.*?)")?)
若仍使用 .*? 会发现完整匹配是成功的，但是并没有将class中的内容加入分组，个人的理解是因为后者是可有可无的，所以会优先被纳入 .*? 的范围内，从而相当于无
绿色部分对整体使用了 ?=，个人的理解是首先整个绿色部分不会占位，也就是说即使匹配的位置实际上是在后续匹配位置之后的也不会影响后续匹配（第一次尝试的正则就是因为影响了后续的匹配才失败的），同时还能对 ?= 中的内容做分组，也就做到了无序。这里是对该正则表达式的分析图示。（重在理解...）
蓝色部分实际上是同一种写法，因为此处假设src一定存在且是最后一个需要查找的属性所以可以省略 ?= 直接连写
实际上因为此处假设src与class一定存在，根据上述2可以化简为
/<img(?=.*?class="(.*?)").*?src="(.*?)".*?\/>/

总结
无序匹配的一般格式（target为目标匹配表达式或其中一部分）：
- target一定存在：(?=.*?target)
- target不一定存在：(?=(?:(?!target).)*target?)

另外最后一个target可以省略 ?=

开始解题

如果预备知识都理解了，那这道题也就很简单了

首先可以知道内联样式一定是包含在style之中，所以边界即 (?<=style=")、.*?(<?=")（或者不使用断言直接 style="、.*?"）
简化CSS属性的行为（更多的可能性自行添加即可）：
- font-size: 20px 即 \d+px
- color: rgb(1,2,3)、rgba(1,2,3,0)、#ffffff、red 即 (?:(?:rgba$.*?$)|(?:#?\w+))
- text-decoration: underline 即 \w+

因为CSS属性不一定存在且排列顺序是不定的，所以使用无序匹配的第二种情况

(?=(?:(?!text\-decoration:).)*(text\-decoration:\w+)?)
(?=(?:(?!font\-size:).)*(font\-size:\d+px)?)
(?:(?!color:).)*(color:(?:(rgba?\(.*?\))|(#\w+)))?

组合在一起便成了答案，这里是分析图示（因为不支持后向断言于是去掉了）

const text = `
<span style="color:#fff">only color</span>
<span style="text-decoration:underline;">only text-decoration</span>
<span style="font-size:20px">only font-size</span>

<span style="font-size:30px;color:rgba(22,33,44)">both font-size and color</span>
<span style="font-size:20px;text-decoration:underline">both font-size and text-decoration</span>

<p style="font-size:10px;text-decoration:underline;color:#fff">all</p>
<p style="font-size:10px;text-decoration:underline;color:#fff;">all and end with semi</p>

<a style="display:block;font-size:10px;text-decoration:underline;color:rgba(1,2,3,0.1);">mix</a>

<a style="display:block;position:fixed">all clear</a>
`

const pattern = /(?<=style=")(?=(?:(?!text\-decoration:).)*(text\-decoration:\w+)?)(?=(?:(?!font\-size:).)*(font\-size:\d+px)?)(?:(?!color:).)*(color:(?:(?:rgba?\(.*?\))|(?:#?\w+)))?.*?(?=")/g
text.replace(pattern, '$1;$2;$3')

遗留的小问题

上述正则做到了无序匹配，但如果再要求替换后的顺序也要与替换前保持一致，就只能对替换内容使用函数处理了（后面有例子）
上述正则在匹配中没有匹配每个属性后可能出现的 ;，而是加入到替换内容中导致替换后结果可能出现多余的 ;，虽然不影响再次使用，不过总会觉得不那么完美
而如果在匹配中加入了 ;，对于结尾没有 ; 以及存在一定顺序的情况可能会出现这样的结果 text-decoration:underlinefont-size:10px;（比如第5个例子）
所以如果想得到完美的答案，还是得对替换内容使用函数处理一下

简单的解法

对于正则接触少的来说，与其花很长的时间写出这样复杂的正则可能还不好调试的情况下，大部分都会选择其他方法吧，我的朋友也是这样，下面给出PHP版示例（同样是对替换内容使用函数）

$text = <<<text
<span style="color:#fff">only color</span>
<span style="text-decoration:underline;">only text-decoration</span>
<span style="font-size:20px">only font-size</span>

<span style="font-size:30px;color:rgba(22,33,44)">both font-size and color</span>
<span style="font-size:20px;text-decoration:underline">both font-size and text-decoration</span>

<p style="font-size:10px;text-decoration:underline;color:#fff">all</p>
<p style="font-size:10px;text-decoration:underline;color:#fff;">all and end with semi</p>

<a style="display:block;font-size:10px;text-decoration:underline;color:rgba(1,2,3,0.1);">mix</a>

<a style="display:block;position:fixed">all clear</a>
text;

$pattern = '/(?<=style=").*?(?=")/';
$res = preg_replace_callback($pattern, function ($matches) {
    $groups = explode(';', $matches[0]);
    $res = [];
    foreach($groups as $group)
    {
        if (preg_match('/^(font\-size|color|text\-decoration)/', $group))
        {
            $res[] = $group;
        }
    }
    return implode(';', $res);
}, $text);

JS版

const pattern = /(?<=style=").*?(?=")/g
text.replace(pattern, match => {
    // match为style中的完整匹配
    const groups = match.split(';')
    let res = []
    for (let group of groups)
    {
        if (/^(font\-size|color|text\-decoration)/.test(group))
        {
            res.push(group)
        }
    }
    return res.join(';')
})

So simple and easy to understand！😅😅

比较

使用PHP对2种方法做了对比，后者还是以微弱的优势胜出（PHP 7.2.19, CentOS 7.7）
有的时候还是 简单才高效😅

// 重复测试用例10000次
$text = str_repeat($text, 10000);

echo 'text length is ' . strlen($text) . "\n";

$pattern = '/(?<=style=")(?=(?:(?!text\-decoration:).)*(text\-decoration:\w+)?)(?=(?:(?!font\-size:).)*(font\-size:\d+px)?)(?:(?!color:).)*(color:(?:(?:rgba?\(.*?\))|(?:#?\w+)))?.*?(?=")/';
$start = microtime(true);
preg_replace($pattern, '$1;$2;$3', $text);
echo 'without callback cost ' . (microtime(true) - $start) . "\n";

$pattern = '/(?<=style=").*?(?=")/';
$start = microtime(true);
preg_replace_callback($pattern, function ($matches) {
    $groups = explode(';', $matches[0]);
    $res = [];
    foreach($groups as $group)
    {
        if (preg_match('/^(font\-size|color|text\-decoration)/', $group))
        {
            $res[] = $group;
        }
    }
    return implode(';', $res);
}, $text);
echo 'with callback cost ' . (microtime(true) - $start) . "\n";

/*
    text length is 6570000
    without callback cost 0.10138607025146
    with callback cost 0.087990045547485
*/

最后

本人水平有限，若有不足或错误之处恳请指出

参考：
正则基础之捕获组
 正则基础之非捕获分组