Linux Three Musketeers&#39; Awk Getting Started Guide

Hello everyone. I’m xindoo . It’s been almost a month and a half since the last time I posted a technical article. The reason is that I have been very busy recently. In addition to eating and sleeping on weekdays, I am either at work or at work. On the road, and on the weekend I just want
在这里插入图片描述
Today at 1024 Program Ape Festival, I took time out of my busy schedule to post articles that I have wanted to write for a long time to join in the fun, and simply teach you how to use awk, a command-line tool. Anyone who knows me knows that I was born in operation and maintenance. I didn't learn too much skill in operation and maintenance. Some command-line tools made thief. Awk is one of them. Later, after I switched to development, I quickly solved many small problems with my proficiency in the use of some command line tools. The convenience and efficiency of the command line have shocked our colleagues many times.

The combination of various command line tools and pipelines can solve many problems extremely quickly. I will not expand it here. If you are interested, you can read a blog Some of the Linux commands I often use . Today’s protagonist is awk , a very powerful text processing tool. I use it daily to clean, filter, view, and even complete some simple data statistics. It's no exaggeration to say that some people need hours or even can't handle the work, I use awk to solve it in minutes, in the eyes of others, it is completely black magic.

You may not feel that way. Let me give you a specific example. There was a colleague who needed to split a text file with tens of millions of lines (greater than 500MB) evenly into two files. In fact, he wanted to split tens of millions of users into two sets evenly and randomly to do some comparative experiments. What would you do? In fact, I used awk one-line command to get it done. I typed the command for 20 seconds and executed it for half a minute.

cat users.txt |awk 'NR%2==0 {print $1}' > 0.txt
cat users.txt |awk 'NR%2==1 {print $1}' > 1.txt

Said so much just to elicit the power of awk, so what is awk? Many beginners think that awk is a text processing tool, and it is called the Three Musketeers of Linux text along with grep and sed. Not just the actual awk text processing tool, it is also a programming language , only for awk provides a number of built-in variables and functions for text processing (will be detailed later), he makes it easy for text processing, then Please follow me from frugal to deep to learn the use of awk.

Basic use

The basic usage of awk is, awk + specific execution + text file, it can also read content from linux pipeline, two usage methods are as follows.

awk program textfile 
cat textfile | awk program

Awk is actually line-oriented data processing, which means that its instructions will be executed once for each line of data, such as the following example

cat a.txt| awk '{print $1, $3}'

The above instruction is to output the first and third columns of all rows of the file starts from 1, and $0 has a special meaning, referring to all the data in this row . By default, awk uses spaces or tabs to distinguish columns. Sometimes text files do not use spaces or tabs to separate columns, but use special symbols (such as-) to separate columns. Awk also provides the -F parameter to specify the separator.

cat a.txt| awk -F'-' '{print $1, $3}'

Built-in variables

Awk is very good at processing text, one of the reasons is that it provides a large number of built-in variables, you can easily get some information about the text content, such as the current row (NR), how many columns in this row (NF), current What is the processed file name (FILENAME)... Here are just a few,

variable	effect
$0	All contents of the current line
$1~$n	Column 1-nth of the current row
NF	How many columns are in the current row
NR	Which line is currently, starting from 1
RS	The entered record delimiter defaults to a newline character
OFS	The output field separator is also a space by default
ORS	The output record separator, the default is a newline character
ARGC	Number of command line parameters
ARGV	Command line parameter array
FILENAME	The name of the current input file
IGNORECASE	If it is true, match is performed ignoring case
ARGIND	The ARGV identifier of the currently processed file

For example, if I want to output a text file a.txt, in the style of |, how many columns are in the first few lines, I can write:

cat a.txt | awk -F'|' '{print NR, NF}'

I used multiple built-in variables to complete the complex processing of multiple texts in the blog "Awk implementation of SQL-like join operations" It's easy to implement.

Built-in function

In addition to the built-in variables, awk also has a lot of commonly used functions built-in. I won’t go into details here. For details, https://www.runoob.com/w3cnote/awk-built-in-functions.html , awk The built-in functions are mainly divided into the following types:

Arithmetic function
String function
Time function
Bit manipulation function
Other functions

These built-in functions can complete most common operations. If these built-in functions are not enough, as I said just now, awk is a programming language and you can implement whatever functions you need.

grammar

Next, we will introduce the basic knowledge of awk as a programming language outside the command line.

variable

Start with variables. In addition to the built-in variables mentioned above, you can also use other variables yourself. Awk and python languages, it is weakly typed, and variables are used directly without declaration. For example, if you require the comprehensive and average value of the second column of a text file, you can write it like this.

cat a.txt |awk '{sum += $2; cnt += 1} END {print sum, sum/cnt}'

Here sum and cnt are our custom variables, which are convenient to write as you use them. In addition to simple variables, awk also supports some complex data structures, such as map. Here I will give an example. For example, we have a group of people’s weight records for the last month and tomorrow. We want to know what is the average weight of each person this month. , The data is as follows, there are three columns in total, namely name, date, and weight.

张三 2021-10-01 67.7
李四 2021-10-01 83.9
张三 2021-10-02 68.1
李四 2021-10-02 85.0
张三 2021-10-03 68.3
张三 2021-10-01 67.9
李四 2021-10-03 84.0
...

Using the map in awk, you can store the total weight sum and the number cnt of each person separately, and wait until all the data is processed and output uniformly. The specific code is as follows:

cat a.txt|awk '{sum[$1] += $3;cnt[$1] += 1} END {for (key in sum) {print key, sum[key]/cnt[key]}}'

judge

From the above several examples, everyone has also noticed that sometimes some judgment conditions have to be used. For example, in the first text split example, I split the file into two according to the parity of the line number. At this time, different logics need to be executed according to different containing signs. The judgment logic in awk is also very simple.

awk 'expr { statement }' # 只有expr为true的时候大括号中的statement代码块才会执行。

END that has appeared many times above, it means that only after all the lines have been processed, the following code blocks will be executed. And END corresponds to BEGIN . The corresponding code is executed before the file processing starts, so some file initialization work is usually done. Other judgments you make can also be written in a similar way. In addition, it also supports if else, which is written as follows:

cat a.txt |awk '{if (NR%2==1) print NR, $1 ; else print NR, $2}'  # 如果是奇数行就输出行号和第一列，否则输出行号和第二列

cycle

Awk also supports for and while loops, which are the same as for and while loops in C language, as follows:

for (initialisation; condition; increment/decrement)
    action

while (condition) 
    action

Here I use awk to realize the output of all prime numbers between 0 and 100 as an example. Let’s talk about the loop and judgment mentioned above. Except for the variable definition, it is basically the same as the C language.

BEGIN {
   i = 2;
   while (i < 100) {
      isPrime = 1;
      for (j = 2; j < i; j++) {
          if (i % j == 0) {
              isPrime = 0;
          }
      }
      if (isPrime == 1) {
          print i;
      }
      i += 1; 
   }
}

If the code is too long to be completely spliced to the command line, you can save the code in a file, and then use awk -f call up, such as:

awk -f getPrime.awk

function

The function definition of awk is also very simple. It is the same as js. For details, please refer to https://www.runoob.com/w3cnote/awk-user-defined-functions.html

function isPrime(n) {
   for (j = 2; j < n; j++) {
      if (i % j == 0) {
         return 0;
      }
   }
   return 1;
}

BEGIN {
   i = 2;
   while (i < 100) {
      if (isPrime(i)) {
          print i;
      }
      i += 1; 
   }
}

Anyone who has learned programming languages like the above grammar will not be unfamiliar with it. It is very simple.

Concluding remarks

As a language, awk seems to be very niche, and it seems to have no advantage compared with other mature programming languages, but it only focuses on text processing, and it is the leader in text processing. However, there is indeed a phenomenon. Now with the emergence of various distributed text retrieval tools (such as elastic search), fewer and fewer people will use awk. Perhaps such excellent command-line tools will gradually be a new generation in the future. The programmer is forgotten in the long river of history..., so I hope this article will let more people know about awk.

In addition, this article is only a superficial introduction to the basic functions of awk, but if you want to be proficient in awk, you need to consult some other information yourself, and accompanied by a lot of contacts. Today I also listened to the live broadcast of the csdn 1024 online event. I happened to hear some top-level programmers’ suggestions for ordinary programmers. In fact, they are all clichés. Everyone knows the truth, but most people are streaming. For mediocre, the core is still lack of practice and accumulation. don't accumulate steps, you can't reach thousands of miles. If you don't accumulate small currents, you can't make a river .

Linux Three Musketeers' Awk Getting Started Guide

Basic use

Built-in variables

Built-in function

grammar

variable

judge

cycle

function

Concluding remarks

xindoo

引用和评论

一文了解知识库背后的技术RAG

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

Linux Three Musketeers&#39; Awk Getting Started Guide

Basic use

Built-in variables

Built-in function

grammar

variable

judge

cycle

function

Concluding remarks

xindoo

引用和评论

一文了解知识库背后的技术RAG

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性

Linux Three Musketeers' Awk Getting Started Guide