Regular Expression Complete Tutorial (Slightly Longer)

introduction

Dear readers, if you click on this article, it means that you are very interested in regularization.

You must also understand the importance of regularity. In my opinion, regular expressions are a side standard for measuring the level of programmers.

There are also many tutorials on regular expressions on the Internet, I believe you have read some.

The difference is that the purpose of this article is to hope that all children's shoes who have read it carefully will have substantial improvement.

There are seven chapters in this article, which fully discuss all aspects of regular expressions in the JavaScript language.

If you feel that a certain part of the article is not clear, please leave a message, and Lao Yao will answer it in detail within the scope of his ability.

The specific chapters are as follows:

introduction
Chapter 1 Regular Expression Character Matching Strategy
Chapter 2 Regular Expression Location Matching Strategy
Chapter 3 The Role of Regular Expression Parentheses
Chapter 4 Principles of Regular Expression Backtracking
Chapter 5 Splitting Regular Expressions
Chapter 6 Construction of Regular Expressions
Chapter 7 Regular Expression Programming
postscript

Briefly, what is discussed in each chapter?

is a matching pattern, either matching characters or matching positions.

Chapters 1 and 2 explain the basics of regularization in this light.

can use parentheses to capture data in the regular, either grouping references in API , or back-references in the regular.

This is the topic of Chapter 3, which explains the role of regular square brackets.

learn regular expressions, you need to understand its matching principle.

Chapter 4 explains the principle of backtracking using regular expressions. In addition, in Chapter 6, the overall working principle of regular expressions is also explained.

can not only read the rules of others, but also write the rules by himself.

Chapter 5 is about splitting a regular expression from a reading point of view, and Chapter 6 is about building a regular expression from a writing point of view.

learns regular expressions for real-world application.

Chapter 7 explains the usage of regex, and points to note about API .

How does read this article?

My advice is to read it twice. The first time, read it quickly without asking for any further explanation. The problems encountered during the reading process may wish to be recorded, and perhaps a lot of them can be solved after reading. Then, if you have time, read it a second time with questions.

Take a deep breath and start our regular expression journey. I am waiting for you at the end.

Chapter 1 Regular Expression Character Matching Strategy

Regular expressions are patterns that match either a character or a position. Remember this sentence.

However, most people feel that this is a messy part of learning how to match characters with regular expressions.

After all, there are too many meta-characters, and it doesn't seem to be systematic and hard to remember. This chapter addresses this problem.

content include:

two fuzzy matches
character group
quantifier
branch structure
case analysis

1 Two kinds of fuzzy matching

It is meaningless if the regex only has an exact match, such as /hello/ , and it can only match the substring "hello" in the string.

var regex = /hello/;
console.log( regex.test("hello") ); 
// => true

Regular expressions are powerful because they enable fuzzy matching.

And fuzzy matching, there are two directions of "blur": horizontal blur and vertical blur.

1.1 Horizontal fuzzy matching

Horizontal ambiguity means that the length of a regular matchable string is not fixed and can be in various situations.

The way this is achieved is by using quantifiers. For example, {m,n} , which means that it occurs at least m times in a row and at most n times.

For example /ab{2,5}c/ means match a string where the first character is “a” , then 2 to 5 characters “b” , and finally the character “c” . The test is as follows:

var regex = /ab{2,5}c/g;
var string = "abc abbc abbbc abbbbc abbbbbc abbbbbbc";
console.log( string.match(regex) ); 
// => ["abbc", "abbbc", "abbbbc", "abbbbbc"]

Note: The regular used in the case is /ab{2,5}c/g , followed by g , which is a modifier of regular. Indicates global matching, that is, finds all substrings that satisfy the matching pattern in order in the target string, emphasizing "all", not just "first". g is the first letter of the word global .

1.2 Longitudinal fuzzy matching

Vertical ambiguity refers to a regular matching string, when specific to a certain character, it may not be a certain character, and there may be multiple possibilities.

The way it is implemented is by using character groups. For example, [abc] means that the character is any of the characters “a”、“b”、“c” .

For example /a[123]b/ can match the following three strings: "a1b"、"a2b"、"a3b" . The test is as follows:

var regex = /a[123]b/g;
var string = "a0b a1b a2b a3b a4b";
console.log( string.match(regex) ); 
// => ["a1b", "a2b", "a3b"]

The above is the main content of this chapter. As long as you master horizontal and vertical fuzzy matching, you can solve most of the regular matching problems.

The next content is to expand, if you are familiar with this, you can skip it and go directly to the case section of this chapter.

2. Character group

It should be emphasized that although it is called a character group (character class), it is only one of the characters. For example, [abc] , which means to match a character, it can be one of "a", "b", "c".

2.1 Range notation

What if there are too many characters in the character group? Range notation can be used.

For example, [123456abcdefGHIJKLM] can be written as [1-6a-fG-M] . Omit and abbreviate with hyphen - .

Because the hyphen has a special purpose, what should I do to match any of the three characters "a", "-", and "z"?

It cannot be written as [a-z] because it represents any of the lowercase characters.

Can be written as follows: [-az] or [az-] or [a\-z] . That is, either at the beginning, at the end, or escaped. In short, it will not make the engine think it is a range notation.

2.2 Exclude character group

For vertical fuzzy matching, there is another case where a character can be anything, but it cannot be "a"、"b"、"c" .

At this point is the concept of excluded character groups (antisense character groups). For example, [^abc] means any character except "a"、"b"、"c" . The first character of the character group is ^ (caret), indicating the concept of negation.

Of course, there is also a corresponding range notation.

2.3 Common shorthand

With the concept of character group, we also understand some common symbols. Because they are all shorthands that come with the system.

\d is [0-9] . Represents a single digit. Memory method: its English is digit (number).
\D is [^0-9] . Represents any character except numbers.
\w is [0-9a-zA-Z_] . Represents numbers, uppercase and lowercase letters, and underscores. Memory method: w is short for word, also known as word character.
\W is [^0-9a-zA-Z_] . non-word characters.
\s is [ \t\v\n\r\f] . Indicates whitespace, including spaces, horizontal tabs, vertical tabs, newlines, carriage returns, and form feeds. Memory method: s is the first letter of space character.
\S is [^ \t\v\n\r\f] . non-whitespace.
. is [^\n\r\u2028\u2029] . Wildcard, representing almost any character. The exceptions are newlines, carriage returns, line separators, and paragraph separators. How to remember: Think of each dot in the ellipsis... as a placeholder for anything like that.

What if you want to match any character? Any of [\d\D] , [\w\W] , [\s\S] , and [^] can be used.

3. Quantifiers

Quantifiers are also called repetitions. Once you have the exact meaning of {m,n} , you just need to memorize some shorthand forms.

3.1 Short form

{m,} means at least m occurrences.
{m} equivalent to {m,m} , which means m occurrences.
? equivalent to {0,1} , indicating presence or absence. Memory method: the meaning of the question mark, is there?
+ equivalent to {1,} , indicating at least one occurrence. Memorization method: The plus sign means to append, and there must be one before adding.
* equivalent to {0,} , which means that it appears any number of times and may not appear. Memory method: Look at the stars in the sky, there may not be one, there may be a few scattered, and there may be too many to count.

3.2 Greedy and lazy matching

See the following example:

var regex = /\d{2,5}/g;
var string = "123 1234 12345 123456";
console.log( string.match(regex) ); 
// => ["123", "1234", "12345", "12345"]

The regular /\d{2,5}/ indicates that the number appears 2 to 5 times in a row. Will match 2, 3, 4, 5 consecutive numbers.

But it is greedy, it will match as many as possible. You can give me 6 and I want 5. You can give me 3 and I will have 3. Anyway, as long as it is within the scope of ability, the more the better.

We know that sometimes greed is not a good thing (see the last example of the article). Lazy matching, on the other hand, is to match as little as possible:

var regex = /\d{2,5}?/g;
var string = "123 1234 12345 123456";
console.log( string.match(regex) ); 
// => ["12", "12", "34", "12", "34", "12", "34", "56"]

Among them, /\d{2,5}?/ said that although 2 to 5 times are fine, when 2 is enough, it will not be tried.

can achieve lazy matching by adding a question mark after the quantifier, so all lazy matching cases are as follows:

{m,n}?
{m,}?
??
+?
*?

The way to remember lazy matching is: add a question mark after the quantifier, ask if you are satisfied, are you greedy?

4. Multiple branch selection

A pattern enables horizontal and vertical fuzzy matching. The multi-select branch can support multiple sub-modes to choose from.

The specific form is as follows: (p1|p2|p3) , where p1 , p2 , and p3 are sub-patterns, separated by | (pipe character), indicating any of them.

For example to match "good" and "nice" use /good|nice/ . The test is as follows:

var regex = /good|nice/g;
var string = "good idea, nice try.";
console.log( string.match(regex) ); 
// => ["good", "nice"]

But there is a fact we should pay attention to, for example, when I use /good|goodbye/ to match "goodbye" string, the result is "good" :

var regex = /good|goodbye/g;
var string = "goodbye";
console.log( string.match(regex) ); 
// => ["good"]

And change the regular to /goodbye|good/ , the result is:

var regex = /goodbye|good/g;
var string = "goodbye";
console.log( string.match(regex) ); 
// => ["goodbye"]

That is to say, the branch structure is also lazy, that is, when the previous match is matched, the latter will not be tried again.

5. Case study

Matching characters is nothing more than a combination of character groups, quantifiers and branch structures.

Let's find a few examples to practice (among them, each regular is not the only way to write it):

5.1 matches the hexadecimal color value

Requirements match:

#ffbbad
#Fc01DF
#FFF
#ffE

analyze:

Represents a hexadecimal character and can use the character group [0-9a-fA-F] .

The characters can appear 3 or 6 times, and they need to be quantifiers and branching structures.

When using a branch structure, you need to pay attention to the order.

The regex is as follows:

var regex = /#([0-9a-fA-F]{6}|[0-9a-fA-F]{3})/g;
var string = "#ffbbad #Fc01DF #FFF #ffE";
console.log( string.match(regex) ); 
// => ["#ffbbad", "#Fc01DF", "#FFF", "#ffE"]

5.2 Matching time

Take the 24-hour clock as an example.

Requirements match:

23:59
02:07

analyze:

A total of 4 digits, the first digit can be [0-2] .

When the first bit is 2, the second bit can be [0-3] , otherwise, the second bit is [0-9] .

The 3rd digit is [0-5] and the 4th digit is [0-9]

The regex is as follows:

var regex = /^([01][0-9]|[2][0-3]):[0-5][0-9]$/;
console.log( regex.test("23:59") ); 
console.log( regex.test("02:07") ); 
// => true
// => true

If it is also required to match 7:9, that is to say, the 0 in front of the time and minute can be omitted.

At this point the regex becomes:

var regex = /^(0?[0-9]|1[0-9]|[2][0-3]):(0?[0-9]|[1-5][0-9])$/;
console.log( regex.test("23:59") ); 
console.log( regex.test("02:07") ); 
console.log( regex.test("7:9") ); 
// => true
// => true
// => true

5.3 matches date

For example yyyy-mm-dd format is an example.

Requirements match:

2017-06-10

analyze:

Year, four digits are sufficient, and [0-9]{4} can be used.

Month, a total of 12 months, divided into two cases 01, 02, ..., 09 and 10, 11, 12, available (0[1-9]|1[0-2]) .

Day, maximum 31 days, available (0[1-9]|[12][0-9]|3[01]) .

The regex is as follows:

var regex = /^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$/;
console.log( regex.test("2017-06-10") ); 
// => true

5.4 window operating system file path

Requirements match:

F:\study\javascript\regex\regular expression.pdf
F:\study\javascript\regex\
F:\study\javascript
F:\

analyze:

The overall pattern is: drive letter:\folder\folder\folder\

To match F:\, you need to use [a-zA-Z]:\\ , where the drive letter is not case-sensitive, note that \ character needs to be escaped.

The file name or folder name cannot contain some special characters. At this time, we need to exclude the character group [^\\:*<>|"?\r\n/] to represent legal characters. In addition, the name cannot be empty, at least one character, that is, the quantifier + should be used. So to match "folder\", [^\\:*<>|"?\r\n/]+\\ is available.

Also "folder\" can appear any number of times. That is ([^\\:*<>|"?\r\n/]+\\)* . where parentheses provide subexpressions.

The last part of the path can be "folder", there is no \ , so ([^\\:*<>|"?\r\n/]+)? needs to be added.

Finally, it is spliced into a seemingly complicated regular:

var regex = /^[a-zA-Z]:\\([^\\:*<>|"?\r\n/]+\\)*([^\\:*<>|"?\r\n/]+)?$/;
console.log( regex.test("F:\\study\\javascript\\regex\\regular expression.pdf") ); 
console.log( regex.test("F:\\study\\javascript\\regex\\") ); 
console.log( regex.test("F:\\study\\javascript") ); 
console.log( regex.test("F:\\") ); 
// => true
// => true
// => true
// => true

Among them, when the string in JS represents \ , it should also be escaped.

5.5 matches id

request from

<div id="container" class="main"></div>

Extract id="container".

The regex that might come to mind at first is:

var regex = /id=".*"/
var string = '<div id="container" class="main"></div>';
console.log(string.match(regex)[0]); 
// => id="container" class="main"

Because . is a wildcard, it itself matches double quotes, and the quantifier * is greedy. When encountering double quotes behind container , it will not stop and will continue to match until the last double quote is encountered.

The solution is to use lazy matching:

var regex = /id=".*?"/
var string = '<div id="container" class="main"></div>';
console.log(string.match(regex)[0]); 
// => id="container"

Of course, there will be a problem with this. The efficiency is relatively low, because the matching principle will involve the concept of "backtracking" (this is just a passing mention here, which will be explained in detail in Chapter 4). can be optimized as follows:

var regex = /id="[^"]*"/
var string = '<div id="container" class="main"></div>';
console.log(string.match(regex)[0]); 
// => id="container"

Chapter 1 Summary

There are quite a few cases related to character matching.

Mastering character groups and quantifiers can solve most common situations, that is to say, when you know the two, JS can be regarded as an introduction.

Chapter 2 Regular Expression Location Matching Strategy

Regular expressions are patterns that match either a character or a position. Remember this sentence.

However, most people don't pay so much attention to matching positions when learning regular expressions.

This chapter talks about the total number of regular matching positions.

content include:

What is a location?
How to match location?
location properties
Analysis of several application examples

1. What is a location?

Position is the position between adjacent characters. For example, where the arrow points in the image below:

2. How to match the location?

In ES5 , there are 6 anchor characters:

^ $ \b \B (?=p) (?!p)

2.1^ and $

^ (caret) matches the beginning of a line, and in a multiline match, matches the beginning of a line.

$ (dollar sign) matches end-of-line, in multi-line match matches end-of-line.

For example, we replace the beginning and end of the string with "#" (positions can be replaced by characters!):

var result = "hello".replace(/^|$/g, '#');
console.log(result); 
// => "#hello#"

When multiple lines match the pattern, the two are the concept of lines, which requires our attention:

var result = "I\nlove\njavascript".replace(/^|$/gm, '#');
console.log(result);
/*
#I#
#love#
#javascript#
*/

2.2 \b and \B

\b is the word boundary, specifically the position between \w and \W , including the position between \w and ^ , and also including the position between \w and $ .

For example, a file name is "[JS] Lesson_01.mp4" in \b , as follows:

var result = "[JS] Lesson_01.mp4".replace(/\b/g, '#');
console.log(result); 
// => "[#JS#] #Lesson_01#.#mp4#"

Why is this so? This requires a closer look.

First of all, we know that \w is the short form of the character group [0-9a-zA-Z_] , that is, \w is any character of alphanumeric or underscore. And \W is a short form of the excluded character group [^0-9a-zA-Z_] , that is, \W is any character other than \w .

At this point we can see how each "[#JS#] #Lesson_01#.#mp4#" in "#" came from.

The first "#", surrounded by "[" and "J", is the position between \W and \w .
The second "#" is surrounded by "S" and "]", which is the position between \w and \W .
The third "#" is surrounded by spaces and "L", which is the position between \W and \w .
The fourth "#" is surrounded by "1" and ".", which is the position between \w and \W .
The fifth "#" is surrounded by "." and "m", which is the position between \W and \w .
The sixth "#" corresponds to the end, but the preceding character "4" is \w , which is the position between \w and $ .

After knowing the concept of \b , then \B is relatively easy to understand.

\B is the opposite of \b , not a word boundary. For example, in all positions in the string, deduct \b , and the rest are \B .

Specifically, it is the position between \w and \w , \W and \W , ^ and \W , and \W and $ .

For example, in the above example, replace all \B with "#":

var result = "[JS] Lesson_01.mp4".replace(/\B/g, '#');
console.log(result); 
// => "#[J#S]# L#e#s#s#o#n#_#0#1.m#p#4"

2.3 (?=p) and (?!p)

(?=p) , where p is a subpattern, the position before p .

For example, (?=l) , which means the position in front of the 'l' character, for example:

var result = "hello".replace(/(?=l)/g, '#');
console.log(result); 
// => "he#l#lo"

And (?!p) is the opposite of (?=p) , for example:

var result = "hello".replace(/(?!l)/g, '#');

console.log(result); 
// => "#h#ell#o#"

The scientific names of the two are positive lookahead and negative lookahead respectively.

The Chinese translations are positive lookahead assertion and negative lookahead assertion respectively.

ES6 , positive lookbehind and negative lookbehind are also supported.

Specifically (?<=p) and (?<!p) .

There are also books that translate these four things into looking around, that is, look to the right or to the left.

But in general books, there is no good emphasis on these four positions.

For example, (?=p) is generally understood as: the following characters are required to match p , but those characters of p cannot be included.

In my opinion, (?=p) is as ^ to understand as p .

3. Characteristics of location

For the understanding of the position, we can understand it as the null character "" .

For example, the "hello" string is equivalent to the following form:

"hello" == "" + "h" + "" + "e" + "" + "l" + "" + "l" + "o" + "";

is also equivalent to:

"hello" == "" + "" + "hello"

Therefore, there is no problem in writing /^hello$/ as /^^hello?$/ :

var result = /^^hello?$/.test("hello");
console.log(result); 
// => true

Or even more complex:

var result = /(?=he)^^he(?=\w)llo$\b\b$/.test("hello");
console.log(result); 
// => true

That is to say, the position between characters can be written as multiple.

Understanding positions as null characters is a very effective way of understanding positions.

4. Related cases

4.1 Regular that does not match anything

lets you write a regex that doesn't match anything

easy，/.^/

Because this regex requires only one character, but that character is followed by the beginning.

4.2 Thousand separator notation for

For example, change "12345678" to "12,345,678" .

It can be seen that the corresponding position needs to be replaced with ",".

What is the idea?

4.2.1 Make the last comma

Use (?=\d{3}$) to do:

var result = "12345678".replace(/(?=\d{3}$)/g, ',')
console.log(result); 
// => "12345,678"

4.2.2 get all the commas

Because of the position where the comma appears, a group of three numbers is required, that is, \d{3} appears at least once.

The quantifier + can be used at this point:

var result = "12345678".replace(/(?=(\d{3})+$)/g, ',')
console.log(result); 
// => "12,345,678"

4.2.3 matches the rest of the cases

After writing the regex, we need to verify a few more cases. At this time, we will find the problem:

var result = "123456789".replace(/(?=(\d{3})+$)/g, ',')
console.log(result); 
// => ",123,456,789"

Because of the above regularity, it only means to count forward from the end, if it is a multiple of 3, replace the position in front of it with a comma. Hence this problem occurs.

How to solve it? We require that the match to this position cannot be the beginning.

We know that ^ can be used to match the beginning of the match, but what if this position is not the beginning?

easy, (?!^) , have you thought of it? The test is as follows:

var string1 = "12345678",
string2 = "123456789";
reg = /(?!^)(?=(\d{3})+$)/g;

var result = string1.replace(reg, ',')
console.log(result); 
// => "12,345,678"

result = string2.replace(reg, ',');
console.log(result); 
// => "123,456,789"

What if the value has decimal places? For example: 1234567.123

var string3 = '1234567.123',
reg = /(?!^)(?=(\d{3})+\b(?!$))/g; // \b(?!$)用来控制只匹配小数点之前的或者\b[^$]也可以
var result = string3.replace(reg, ',')
// => '1,234,567.345'

4.2.4 Support other forms

If you want to replace "12345678 123456789" with "12,345,678 123,456,789" .

At this point we need to modify the regular and replace the beginning ^ and ending $ with \b :

var string = "12345678 123456789",
reg = /(?!\b)(?=(\d{3})+\b)/g;

var result = string.replace(reg, ',')
console.log(result); 
// => "12,345,678 123,456,789"

How to understand (?!\b) ?

The current requirement is a position, but not the position in front of \b . In fact, (?!\b) is \B .

So the final regex becomes: /\B(?=(\d{3})+\b)/g .

4.3 Verify password problem

The password is 6-12 characters long and consists of numbers, lowercase characters and uppercase letters, but must include at least 2 characters.

This question is easier to judge if it is written as multiple regular expressions. But it is more difficult to write a regular.

So, let's take a challenge. See if our understanding of location is deep.

4.3.1 Simplified

The condition "but must contain at least 2 characters" is not considered. We can easily write:

var reg = /^[0-9A-Za-z]{6,12}$/;

4.3.2 Determine whether it contains a certain character

Suppose, the request must contain numbers, what should I do? At this point we can use (?=.*[0-9]) to do it.

So the regex becomes:

var reg = /(?=.*[0-9])^[0-9A-Za-z]{6,12}$/;

4.3.3 contains two specific characters at the same time

For example, if it contains both numbers and lowercase letters, you can use (?=.*[0-9])(?=.*[a-z]) to do it.

So the regex becomes:

var reg = /(?=.*[0-9])(?=.*[a-z])^[0-9A-Za-z]{6,12}$/;

4.3.4 Answer

We can turn the original title into one of the following situations:

Contains both numbers and lowercase letters
Contains both numbers and capital letters
Contains both lowercase and uppercase letters
Contains both numbers, lowercase and uppercase letters

The above 4 cases are the relationship of OR (actually, Article 4 can be omitted).

The final answer is:

var reg = /((?=.*[0-9])(?=.*[a-z])|(?=.*[0-9])(?=.*[A-Z])|(?=.*[a-z])(?=.*[A-Z]))^[0-9A-Za-z]{6,12}$/;
console.log( reg.test("1234567") ); // false 全是数字
console.log( reg.test("abcdef") ); // false 全是小写字母
console.log( reg.test("ABCDEFGH") ); // false 全是大写字母
console.log( reg.test("ab23C") ); // false 不足6位
console.log( reg.test("ABCDEF234") ); // true 大写字母和数字
console.log( reg.test("abcdEF234") ); // true 三者都有

4.3.5

The above regular looks more complicated, as long as you understand the second step, the rest will be understood.

/(?=.*[0-9])^[0-9A-Za-z]{6,12}$/

For this regular, we just need to figure out (?=.*[0-9])^ .

Separately, they are (?=.*[0-9]) and ^ .

Indicates that there is a position before the beginning (of course, it is also the beginning, that is, the same position, think of the empty character analogy before).

(?=.*[0-9]) means that the character following the position matches .*[0-9] , that is, any number of arbitrary characters followed by a number.

Translated into vernacular, that is, the next character counted from the beginning must contain a number.

Note: ^ must be required, otherwise it is not possible to control the overall number of characters without counting any characters from the beginning followed by a number, such as the negative example:

// 有^
var reg = /(?=.*[0-9])^[0-9A-Za-z]{6,12}$/;
console.log( reg.test("ABCDEFGH1234") ); // ture
console.log( reg.test("ABCDEFGH12345") ); // false

// 没有^
var reg = /(?=.*[0-9])^[0-9A-Za-z]{6,12}$/;
console.log( reg.test("ABCDEFGH1234") ); // ture
console.log( reg.test("ABCDEFGH12345") ); // ture

4.3.6 Another solution

"Contain at least two characters" means that it can't be all numbers, not all lowercase letters, and not all uppercase letters.

So the requirement "can't be all numbers", how to do it? (?!p) come out!

The corresponding regex is:

var reg = /(?!^[0-9]{6,12}$)^[0-9A-Za-z]{6,12}$/;

Three "can't"?

The final answer is:

var reg = /(?!^[0-9]{6,12}$)(?!^[a-z]{6,12}$)(?!^[A-Z]{6,12}$)^[0-9A-Za-z]{6,12}$/;
console.log( reg.test("1234567") ); // false 全是数字
console.log( reg.test("abcdef") ); // false 全是小写字母
console.log( reg.test("ABCDEFGH") ); // false 全是大写字母
console.log( reg.test("ab23C") ); // false 不足6位
console.log( reg.test("ABCDEF234") ); // true 大写字母和数字
console.log( reg.test("abcdEF234") ); // true 三者都有

Chapter 2 Summary

There are quite a few cases related to location matching.

Mastering these 6 anchor characters in matching positions gives us a new tool for solving regularization problems.

Chapter 3 The Role of Regular Expression Parentheses

There are parentheses in every language. Regular expressions are also a language, and the presence of parentheses makes the language even more powerful.

Whether the use of parentheses is handy is a side standard to measure the level of mastery of regularity.

The role of parentheses can actually be explained in a few words. The parentheses provide grouping, which is convenient for us to refer to it.

When referencing a group, there are two cases: referencing it in JavaScript and referencing it in regular expressions.

Although the content of this chapter is relatively simple, I want to write longer.

content include:

Grouping and Branching Structure
capture packet
backreference
non-capturing grouping
related case

1. Grouping and branching structure

These two are the most intuitive functions of parentheses, and they are also the most primitive functions.

1.1 Group

We know that /a+/ matches consecutive occurrences of "a", and to match consecutive occurrences of “ab” , we need to use /(ab)+/ .

The parentheses provide the grouping function, so that the quantifier + acts on the whole of “ab” . The test is as follows:

var regex = /(ab)+/g;
var string = "ababa abbb ababab";
console.log( string.match(regex) ); 
// => ["abab", "ab", "ababab"]

1.2 Branch structure

In the multi-choice branch structure (p1|p2) , the role of parentheses here is self-evident, providing all possibilities of subexpressions.

For example, to match the following strings:

I love JavaScript
I love Regular Expression

Regular expressions can be used:

var regex = /^I love (JavaScript|Regular Expression)$/;
console.log( regex.test("I love JavaScript") );
console.log( regex.test("I love Regular Expression") );
// => true
// => true

If you remove the parentheses in the regular, that is, /^I love JavaScript|Regular Expression$/ , the matching strings are "I love JavaScript" and "Regular Expression" , of course, this is not what we want.

2. Citation grouping

This is an important role of parentheses, with it, we can perform data extraction, as well as more powerful replacement operations.

To use the benefits it brings, you must use the API of the implementation environment.

Take a date as an example. Assuming the format is yyyy-mm-dd, we can write a simple regular first:

var regex = /\d{4}-\d{2}-\d{2}/;

Then modify it into parentheses:

var regex = /(\d{4})-(\d{2})-(\d{2})/;

Why use this regex?

2.1 Extract data

For example, to extract the year, month, and day, you can do this:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( string.match(regex) ); 
// => ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12"]

An array returned by match , the first element is the overall matching result, then the content matched by each group (in parentheses), then the matching subscript, and finally the input text. (Note: If the regular has the modifier g or not, the format of the array returned by match is different).

Alternatively, you can use the exec method of the regular object:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( regex.exec(string) ); 
// => ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12"]

At the same time, it can also be obtained by using the global properties $1 to $9 of the constructor:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";

regex.test(string); // 正则操作即可，例如
//regex.exec(string);
//string.match(regex);

console.log(RegExp.$1); // "2017"
console.log(RegExp.$2); // "06"
console.log(RegExp.$3); // "12"

2.2 replaces

For example, how to replace yyyy-mm-dd format with mm/dd/yyyy ?

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, "$2/$3/$1");
console.log(result); 
// => "06/12/2017"

Among them, in replace , $1 , $2 , $3 are used to refer to the corresponding group in the second parameter. is equivalent to the following form:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, function() {
    return RegExp.$2 + "/" + RegExp.$3 + "/" + RegExp.$1;
});
console.log(result); 
// => "06/12/2017"

is also equivalent to:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, function(match, year, month, day) {
    return month + "/" + day + "/" + year;
});
console.log(result); 
// => "06/12/2017"

3. Backreferences

In addition to using the corresponding API to refer to the grouping, it is also possible to refer to the grouping within the regex itself. But only the grouping that appeared before can be referenced, i.e. backreferences.

Take the date as an example.

For example, to write a regex that supports matching the following three formats:

2016-06-12
2016/06/12
2016.06.12

The first regex that might come to mind is:

var regex = /\d{4}(-|\/|\.)\d{2}(-|\/|\.)\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // true

where / and . need to be escaped. Although it matches the requested case, it also matches data like "2016-06/12" .

What if we wanted to require the delimiter to be consistent? At this point you need to use backreferences:

var regex = /\d{4}(-|\/|\.)\d{2}\1\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // false

Note the \1 inside, which indicates the grouping (-|\/|\.) before the reference. No matter what it matches (like -), \1 matches that same specific character.

After we know the meaning of \1 , then the concepts of \2 and \3 are understood, that is, they refer to the second and third groups respectively.

Seeing this, at this point, I am afraid you will have three questions.

3.1 What to do with parentheses nesting?

The left parenthesis (open parenthesis) prevails. for example:

var regex = /^((\d)(\d(\d)))\1\2\3\4$/;
var string = "1231231233";
console.log( regex.test(string) ); // true
console.log( RegExp.$1 ); // 123
console.log( RegExp.$2 ); // 1
console.log( RegExp.$3 ); // 23
console.log( RegExp.$4 ); // 3

We can look at this regex pattern:

The first character is a number, say 1,
The second character is a number, say 2,
The third character is a number, say 3,
The next one is \1 , which is the first group content, then look at the group corresponding to the first open bracket, it is 123,
Next is \2 , find the second open bracket, the corresponding group, the matching content is 1,
Next is \3 , find the third open bracket, the corresponding group, the matching content is 23,
The last is \4 , find the 3rd open bracket, the corresponding group, the matching content is 3.

This question, it is estimated that if you look carefully, you should understand.

3.2 What does \10 mean?

Another question may be, that is, \10 represent the 10th group, or \1 and 0 ?

The answer is the former, although it is rare for \10 to appear in a regex. The test is as follows:

var regex = /(1)(2)(3)(4)(5)(6)(7)(8)(9)(#) \10+/;
var string = "123456789# ######"
console.log( regex.test(string) );
// => true

3.3 What happens to the grouping that does not exist?

Because the backreference refers to the previous group, but when we refer to a non-existing group in the regex, the regex will not report an error at this time, but just match the backreferenced character itself. For example, \2 would match "\2". Note that "\2" means that "2" has been converted.

var regex = /\1\2\3\4\5\6\7\8\9/;
console.log( regex.test("\1\2\3\4\5\6\7\8\9") ); 
console.log( "\1\2\3\4\5\6\7\8\9".split("") );

chrome The result printed by the browser:

4. Non-capturing grouping

The groups that appear in the previous article will capture the data they match for subsequent reference, so they are also called capturing groups.

If you only want the most primitive functionality of parentheses, but don't refer to it, that is, neither in the API nor in the regex. At this time, the non-capturing packet (?:p) can be used. For example, the first example in this article can be modified to:

var regex = /(?:ab)+/g;
var string = "ababa abbb ababab";
console.log( string.match(regex) ); 
// => ["abab", "ab", "ababab"]

5. Related cases

So far, the role of parentheses has been finished. To sum up, it provides groupings that we can use. How to use them depends on us.

5.1 String trim method simulation

trim method is to remove the whitespace at the beginning and end of the string. There are two ways to do it.

The first, matches to leading and trailing whitespace, and then replaces it with a null character. Such as:

function trim(str) {
    return str.replace(/^\s+|\s+$/g, '');
}
console.log( trim("  foobar   ") ); 
// => "foobar"

Second, match the entire string, and then use the reference to extract the corresponding data:

function trim(str) {
    return str.replace(/^\s*(.*?)\s*$/g, "$1");
}
console.log( trim("  foobar   ") ); 
// => "foobar"

The lazy match *? is used here, otherwise it will match all spaces before the last space.

Of course, the former is more efficient.

5.2 Convert the first letter of each word to uppercase

function titleize(str) {
    return str.toLowerCase().replace(/(?:^|\s)\w/g, function(c) {
        return c.toUpperCase();
    });
}
console.log( titleize('my name is epeli') ); 
// => "My Name Is Epeli"

The idea is to find the first letter of each word, of course, it is also possible to not use non-capturing matching here.

5.3

function camelize(str) {
    return str.replace(/[-_\s]+(.)?/g, function(match, c) {
        return c ? c.toUpperCase() : '';
    });
}
console.log( camelize('-moz-transform') ); 
// => "MozTransform"

The grouping (.) represents the first letter. The definition of a word is that the preceding character can be multiple hyphens, underscores, and spaces. The purpose of the ? behind the regular is to deal with the characters at the end of str that may not be word characters, for example, str is '-moz-transform' .

5.4 Underline

function dasherize(str) {
    return str.replace(/([A-Z])/g, '-$1').replace(/[-_\s]+/g, '-').toLowerCase();
}
console.log( dasherize('MozTransform') ); 
// => "-moz-transform"

The reverse process of camel case.

5.5 html escaping and

// 将HTML特殊字符转换成等值的实体
function escapeHTML(str) {
    var escapeChars = {
      '¢' : 'cent',
      '£' : 'pound',
      '¥' : 'yen',
      '€': 'euro',
      '©' :'copy',
      '®' : 'reg',
      '<' : 'lt',
      '>' : 'gt',
      '"' : 'quot',
      '&' : 'amp',
      '\'' : '#39'
    };
    return str.replace(new RegExp('[' + Object.keys(escapeChars).join('') +']', 'g'),         function(match) {
            return '&' + escapeChars[match] + ';';
        }
    );
}
console.log( escapeHTML('<div>Blah blah blah</div>') );
// => "&lt;div&gt;Blah blah blah&lt;/div&gt";

It uses the regular generated by the constructor, and then replaces the corresponding format, which has little to do with this chapter.

On the contrary, its reverse process, which uses parentheses to provide references, is also very simple, as follows:

// 实体字符转换为等值的HTML。
function unescapeHTML(str) {
    var htmlEntities = {
      nbsp: ' ',
      cent: '¢',
      pound: '£',
      yen: '¥',
      euro: '€',
      copy: '©',
      reg: '®',
      lt: '<',
      gt: '>',
      quot: '"',
      amp: '&',
      apos: '\''
    };
    return str.replace(/\&([^;]+);/g, function(match, key) {
        if (key in htmlEntities) {
            return htmlEntities[key];
        }
        return match;
    });
}
console.log( unescapeHTML('&lt;div&gt;Blah blah blah&lt;/div&gt;') );
// => "<div>Blah blah blah</div>"

Obtain the corresponding grouping reference through key , and then use it as the key of the object.

5.6 matches paired tags

Requirements match:

<title>regular expression</title>
<p>laoyao bye bye</p>

Mismatch:

<title>wrong!</p>

To match an open label, you can use the regular <[^>]+> ,

To match a closed tag, you can use <\/[^>]+> ,

But to match paired tags, you need to use backreferences, such as:

var regex = /<([^>]+)>[\d\D]*<\/\1>/;
var string1 = "<title>regular expression</title>";
var string2 = "<p>laoyao bye bye</p>";
var string3 = "<title>wrong!</p>";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // false

The open label <[^>]+> changed to <([^>]+)> , and the purpose of using parentheses is to provide grouping for the use of backreferences later. The closing tag uses a backreference, <\/\1> .

In addition, [\d\D] means that this character is a number or not, so it means matching any character.

Chapter 3 Summary

There are too many examples of using parentheses in regular expressions.

It's important to understand that parentheses provide grouping, we can extract data, and we should be fine.

The code in the example has basically not done much analysis, I believe you can understand it.

Chapter 4 Principles of Regular Expression Backtracking

To learn regular expressions, you need to understand some matching principles.

When studying the matching principle, there are two words that appear more frequently: "backtracking".

It sounds very tall, and there are indeed many people who do not understand this.

Therefore, this chapter briefly and briefly explains what backtracking really is.

content include:

Matches without backtracking
matching with backtracking
Common backtracking forms

1. Matching without backtracking

Assuming our regular is /ab{1,3}c/ , its visual form is:

And when the target string is "abbbc", there is no so-called "backtracking". Its matching process is:

where the subexpression b{1,3} means that the "b" character appears 1 to 3 times in a row.

2. Matching with backtracking

If the target string is "abbc", there is a backtrace in between.

Step 5 in the figure has a red color, indicating that the matching is unsuccessful. At this time b{1,3} has already matched 2 characters "b". When trying to try the third one, it turns out that the next character is "c". Then it is considered that b{1,3} has been matched. Then the state returns to the previous state (ie step 6, the same as step 4), and finally the subexpression c is used to match the character "c". Of course, at this point the entire expression matches successfully.

Step 6 in the figure is "backtracking".

You may not feel this way, but here's another example. The regular is:

The target string is "abbbc" and the matching process is:

Where steps 7 and 10 are backtracking. Step 7 is the same as step 4. At this time, b{1,3} matches two "b", and step 10 is the same as step 3. At this time, b{1,3} matches only one "b", which is also the final matching result of b{1,3} .

Here is another clear traceback, the regularity is:

The target string is: "acd"ef, and the matching process is:

The failed attempt to match double quotes is omitted from the figure. It can be seen that .* greatly affects the efficiency.

In order to reduce some unnecessary backtracking, you can modify the regular to /"[^"]*"/ .

3. Common backtracking forms

The way regular expressions match strings has a scientific name called backtracking.

The backtracking method is also called the heuristic method. Its basic idea is: starting from a certain state (initial state) of the problem, searching for all the "states" that can be reached from this state, when a road reaches the "end". At the same time (can't go any further), go back one step or several steps, start from another possible "state", and continue to search until all "paths" (states) have been tried. This method of continuously "forwarding" and "backtracking" to find solutions is called "backtracking". (copy from Baidu Encyclopedia).

It is essentially a depth-first search algorithm. The process of going back to a previous step is called "backtracking". From the above description process, it can be seen that "backtracking" occurs when the road fails. That is, when a attempt to match fails, the next step is usually backtracking.

Right, we get it. So what are the places where regular expressions in JS will generate backtracking?

3.1 Greedy quantifier

The previous examples were all related to greedy quantifiers. For example, b{1,3} , because it is greedy, try the possible order from more to less. It will try "bbb" first, and then see if the entire regex matches. When it can't match, spit out a "b", that is, on the basis of "bb", and then continue to try. If that doesn't work, spit out another one and try again. What if it doesn't work? It just means that the match failed.

Although local matching is greedy, it is also necessary to satisfy the overall matching correctly. Otherwise, if the skin does not exist, what will the hair be attached to?

At this point we can't help but ask, what if multiple greedy quantifiers exist next to each other and conflict with each other?

The answer is, start first! Because of depth-first search. The test is as follows:

var string = "12345";
var regex = /(\d{1,3})(\d{1,3})/;
console.log( string.match(regex) );
// => ["12345", "123", "45", index: 0, input: "12345"]

Among them, the preceding \d{1,3} matches "123", and the following \d{1,3} matches "45".

3.2 Lazy quantifier

Lazy quantifiers are greedy quantifiers followed by a question mark. Indicates as few matches as possible, such as:

var string = "12345";
var regex = /(\d{1,3}?)(\d{1,3})/;
console.log( string.match(regex) );
// => ["1234", "1", "234", index: 0, input: "12345"]

Among them, \d{1,3}? only matches one character "1", and the following \d{1,3} matches "234".

Although lazy quantifiers are not greedy, there will also be backtracking. For example, the regex is:

The target string is "12345", and the matching process is:

I know that you are not greedy and you are very content, but in order to match the whole, there is no other way, so I can only give you more. So the last character matched by \d{1,3}? is "12", which is two digits, not one.

3.3 branch structure

We know that branches are also lazy, such as /can|candy/ , to match the string "candy", the result is "can", because branches will be tried one by one, if the previous ones are satisfied, the latter will not be tested again.

Branch structure, the previous sub-pattern may form a partial match, if the next expression does not match as a whole, it will continue to try the remaining branches. This attempt can also be seen as a kind of backtracking.

For example regular:

The target string is "candy", the matching process:

In step 5 above, although it did not return to the previous state, it still returned to the branch structure and tried the next possibility. So, it can be considered a kind of backtracking.

Chapter 4 Summary

In fact, the backtracking method is easy to grasp.

The short summary is that because there are many possibilities, it is necessary to try one by one. Until, either at a certain step, the overall matching is successful; or after the final attempt, it is found that the overall matching is unsuccessful.

The strategy of the greedy quantifier "try" is: bargaining for clothes. The price is too high, it's cheaper, no, it's cheaper.
The strategy of the lazy quantifier "try" is: sell things to increase the price. If you give less, please give more.
The strategy of "trying" the branch structure is: shop around. This one doesn't work, let's change it, if it doesn't work, then change it.

Since there is a backtracking process, the matching efficiency must be lower. Against whom? Relative to those DFA engines.

The regular engine of JS is NFA, which is short for "non-deterministic finite automata".

Regular expressions in most languages are NFAs, so why is it so popular?

A: Don't look at my slow matching, but I compile fast, and I am also interesting.

Chapter 5 Splitting Regular Expressions

How well a language is mastered can be measured from two perspectives: reading and writing.

Not only ask yourself to solve problems, but also understand other people's solutions. The code is like this, and so is the regular expression.

The regular language is a little different from other languages, it is usually just a bunch of characters, and there is no concept of the so-called "statement".

How to correctly split a large string of regular expressions into pieces has become the key to cracking the "Book of the Sky".

This chapter addresses this issue and includes:

Structures and Operators
Points to Note
case analysis

1. Structures and Operators

Programming languages generally have operators. As long as there are operators, there is a problem. When a lot of operations are together, who will operate first and who will operate last? In order to avoid ambiguity, the language itself needs to define the order of operations, the so-called priority.

In regular expressions, operators are embodied in the structure, that is, a special whole represented by special characters and ordinary characters.

What are the structures in JS regular expressions?

Character literals, character groups, quantifiers, anchor characters, grouping, alternative branches, backreferences.

The specific meaning is briefly reviewed as follows (if you understand it, you can ignore it):

Literal Matches a specific character, including those that do not need to be escaped and those that need to be escaped. For example, a matches the character "a", and \n matches a newline, and \. matches a decimal point.
character group , matching a character, can be one of many possibilities, such as [0-9] , which means matching a number. There is also a short form of \d . There is also an antonym character group, which can be any character except a specific character, such as [^0-9] , which means a non-numeric character, and a short form of \D .
quantifier indicates that a character appears continuously, for example, a{1,3} indicates that the "a" character appears 3 times in a row. There are also common shorthand forms, such as a+ , which means that the "a" character appears at least once in a row.
anchors , which matches a position, not a character. For example, ^ matches the beginning of the string, and \b matches a word boundary, and (?=\d) indicates the position in front of the number.
grouping , using brackets to represent a whole, such as (ab)+ , indicating that the two characters "ab" appear multiple times in a row, you can also use the non-capturing group (?:ab)+ .
branch , multiple sub-expressions can be selected, such as abc|bcd , the expression matches the "

Regular Expression Complete Tutorial (Slightly Longer)

introduction

Chapter 1 Regular Expression Character Matching Strategy

1 Two kinds of fuzzy matching

2. Character group

3. Quantifiers

4. Multiple branch selection

5. Case study

Chapter 1 Summary

Chapter 2 Regular Expression Location Matching Strategy

1. What is a location?

2. How to match the location?

3. Characteristics of location

4. Related cases

Chapter 2 Summary

Chapter 3 The Role of Regular Expression Parentheses

1. Grouping and branching structure

2. Citation grouping

3. Backreferences

4. Non-capturing grouping

5. Related cases

Chapter 3 Summary

Chapter 4 Principles of Regular Expression Backtracking

1. Matching without backtracking

2. Matching with backtracking

3. Common backtracking forms

Chapter 4 Summary

Chapter 5 Splitting Regular Expressions

1. Structures and Operators

浪遏飞舟

引用和评论

如何使用 webpack 优化 lodash

玩转前端正则表达式

pyparsing 与 regex 结合表达汉字等Unicode字符（qbit）

正则

书籍-《正则表达式谜题与AI编码助手》