A few hundred lines of code to implement a JSON parser

crossoverJie
中文

foreword

When I was writing gscript before, I was wondering if there is a more practical tool that uses the principle of compilation? After all, it is not difficult to really write a language, and it is also difficult to really apply it.

I accidentally saw someone mention JSON parser, this kind of tool is full of our daily development and is widely used.

I have also thought about how it is implemented before, and once the process is related to the compilation principle, I can't help but persuade; but after this period of practice, I found that implementing a JSON parser does not seem to Difficult, but it is enough to apply some knowledge of the front end of the compilation principle.

Thanks to the light weight of JSON and the simple syntax, the core code only takes about 800 lines to implement a JSON parser.

<!--more-->

First let's take a look at the effect:

 import "github.com/crossoverJie/gjson"
func TestJson(t *testing.T) {
    str := `{
   "glossary": {
       "title": "example glossary",
        "age":1,
        "long":99.99,
        "GlossDiv": {
           "title": "S",
            "GlossList": {
               "GlossEntry": {
                   "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                       "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML", true, null]
                   },
                    "GlossSee": "markup"
               }
           }
       }
   }
}`
    decode, err := gjson.Decode(str)
    assert.Nil(t, err)
    fmt.Println(decode)
    v := decode.(map[string]interface{})
    glossary := v["glossary"].(map[string]interface{})
    assert.Equal(t, glossary["title"], "example glossary")
    assert.Equal(t, glossary["age"], 1)
    assert.Equal(t, glossary["long"], 99.99)
    glossDiv := glossary["GlossDiv"].(map[string]interface{})
    assert.Equal(t, glossDiv["title"], "S")
    glossList := glossDiv["GlossList"].(map[string]interface{})
    glossEntry := glossList["GlossEntry"].(map[string]interface{})
    assert.Equal(t, glossEntry["ID"], "SGML")
    assert.Equal(t, glossEntry["SortAs"], "SGML")
    assert.Equal(t, glossEntry["GlossTerm"], "Standard Generalized Markup Language")
    assert.Equal(t, glossEntry["Acronym"], "SGML")
    assert.Equal(t, glossEntry["Abbrev"], "ISO 8879:1986")
    glossDef := glossEntry["GlossDef"].(map[string]interface{})
    assert.Equal(t, glossDef["para"], "A meta-markup language, used to create markup languages such as DocBook.")
    glossSeeAlso := glossDef["GlossSeeAlso"].(*[]interface{})
    assert.Equal(t, (*glossSeeAlso)[0], "GML")
    assert.Equal(t, (*glossSeeAlso)[1], "XML")
    assert.Equal(t, (*glossSeeAlso)[2], true)
    assert.Equal(t, (*glossSeeAlso)[3], "")
    assert.Equal(t, glossEntry["GlossSee"], "markup")
}

As you can see from this use case, strings, booleans, floats, integers, arrays, and various nesting relationships are supported.

Implementation principle

Here is a brief description of the implementation principle, which is essentially two steps:

  1. Lexical analysis : According to the original input JSON string, the token is parsed, that is, an identifier similar to "{" "obj" "age" "1" "[" "]" , just to classify such identifiers.
  2. According to the generated set of token set, read it in a stream, and finally generate the tree structure in the figure, that is, a JSONObject .

Let's focus on what these two steps do.

lexical analysis

 BeginObject  {
String  "name"
SepColon  :
String  "cj"
SepComma  ,
String  "object"
SepColon  :
BeginObject  {
String  "age"
SepColon  :
Number  10
SepComma  ,
String  "sex"
SepColon  :
String  "girl"
EndObject  }
SepComma  ,
String  "list"
SepColon  :
BeginArray  [

In fact, lexical parsing is the process of constructing a finite automaton ( DFA ), the purpose is to generate such a set (token), but we need to classify these tokens for processing in subsequent grammatical analysis.

"{"左花括号就是一个BeginObject b2e62e1e2c7a23e6d6ba8fb15e9745de---代表一个对象声明的开始, "}"则是EndObject的Finish.

Among them "name" this is considered as String string, and so on "[" represents BeginArray

Here I define the following token types:

 type Token string
const (
    Init        Token = "Init"
    BeginObject       = "BeginObject"
    EndObject         = "EndObject"
    BeginArray        = "BeginArray"
    EndArray          = "EndArray"
    Null              = "Null"
    Null1             = "Null1"
    Null2             = "Null2"
    Null3             = "Null3"
    Number            = "Number"
    Float             = "Float"
    BeginString       = "BeginString"
    EndString         = "EndString"
    String            = "String"
    True              = "True"
    True1             = "True1"
    True2             = "True2"
    True3             = "True3"
    False             = "False"
    False1            = "False1"
    False2            = "False2"
    False3            = "False3"
    False4            = "False4"
    // SepColon :
    SepColon = "SepColon"
    // SepComma ,
    SepComma = "SepComma"
    EndJson  = "EndJson"
)
It can be seen that there will be multiple types of true/false/null, which will be ignored first and will be explained later.

Take this JSON as an example: {"age":1} , its state is reversed as shown below:

In general, it is to traverse the string in turn, then update a global state, and perform different operations according to the value of the state.

Part of the code is as follows:

Interested friends can run the singleton debug and it is easy to understand:

https://github.com/crossoverJie/gjson/blob/main/token_test.go

Take this JSON as an example:

 func TestInitStatus(t *testing.T) {
    str := `{"name":"cj", "age":10}`
    tokenize, err := Tokenize(str)
    assert.Nil(t, err)
    for _, tokenType := range tokenize {
        fmt.Printf("%s  %s\n", tokenType.T, tokenType.Value)
    }
}

The final generated token set is as follows:

 BeginObject  {
String  "name"
SepColon  :
String  "cj"
SepComma  ,
String  "age"
SepColon  :
Number  10
EndObject  }

Check in advance

Due to the simple syntax of JSON , some rules can even be checked in lexical rules.

for example:
JSON allows null value, when there is nu nul in our string, this kind of mismatch null value can be thrown in advance ---ba42741096b9f380e4b exception.

For example, when the first string is detected as n, then the following must be u->l->l otherwise an exception will be thrown.

Similarly for floating-point numbers, when there are multiple . points in a value, an exception still needs to be thrown.

This is also the reason mentioned earlier true/false/null these types need to have multiple intermediate states.

Generate JSONObject tree

Before discussing the generation of JSONObject tree, let's first look at such a problem, given a set of brackets, to determine whether it is legal.

  • [<()>] This is legal.
  • [<()>) and this is not legal.

How to achieve it? In fact, it is also very simple, you only need to use the stack to complete, as shown in the following figure:

Using the characteristics of the stack, it traverses the data in turn. When the symbol on the left is encountered, it is pushed into the stack. When the symbol on the right is encountered, it matches the data at the top of the stack, and if it matches, it is popped out of the stack.

If it doesn't match, it means the format is wrong, and if the stack is empty after data traversal, it means the data is legal.

In fact, careful observation JSON The syntax is similar:

 {
    "name": "cj",
    "object": {
        "age": 10,
        "sex": "girl"
    },
    "list": [
        {
            "1": "a"
        },
        {
            "2": "b"
        }
    ]
}

BeginObject:{ and EndObject:} must appear in pairs, and the middle is also in pairs.
For data like "age":10 , there must be data to match after the colon, otherwise it is an illegal format.

So based on the bracket matching principle just now, we can also use a similar method to parse the token set.

We also need to create a stack. When we encounter BeginObject , we push a Map onto the stack. When we encounter a String key, we also push the value into the stack.

When encountering value , a key will be popped out of the stack, and the data will be written to the top of the current stack map .

Of course, a global state is also required in the process of traversing token , so this is also a finite state machine .


For example: when we traverse to Token the type is String , and the value is "name" , the expected next token ;

So we have to record the current status as StatusColon , once the token is subsequently parsed as SepColon , we need to determine whether the current status is StatusColon , if not If the syntax is wrong, an exception can be thrown.

It is also worth noting that the status is actually a 集合 , because the next state may be a variety of situations.

{"e":[1,[2,3],{"d":{"f":"f"}}]}
For example, when we parse to a SepColon colon, the subsequent state may be value or BeginObject { or BeginArray [


Therefore, all three cases have to be considered here, and the others can be deduced by analogy.

The specific analysis process can refer to the source code:
https://github.com/crossoverJie/gjson/blob/main/parse.go


Although it is possible to parse JSON with the help of a stack structure, I don't know if you have found a problem:
This is very easy to miss the rules. For example, there are three cases after the colon just mentioned, and there are even four cases after one BeginArray ( StatusArrayValue, StatusBeginArray, StatusBeginObject, StatusEndArray )

Such code is not very intuitive to read, and at the same time, it is easy to miss the syntax, and it can only be repaired if there is a problem.

Since the problem is mentioned, there is naturally a corresponding solution, which is actually the recursive descent algorithm commonly used in syntax analysis.


We only need to write the algorithm recursively according to the grammar definition of JSON , so that the code is very clear to read, and rules will not be missed.

The complete JSON syntax can be found here:
https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4

I also anticipate changing the implementation to recursive descent in the next version.

Summarize

So far, only a very basic JSON analysis has been implemented, and no performance optimization has been done. Compared with the official JSON package, the performance is not a little bit worse.

 cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkJsonDecode-12            372298             15506 ns/op             512 B/op         12 allocs/op
BenchmarkDecode-12                141482             43516 ns/op           30589 B/op        962 allocs/op
PASS

At the same time, there are still some basic functions that have not been implemented, such as the parsed JSONObject which can be reflected to generate a custom Struct , and the support I want to implement in the end JSON Arithmetic:

 gjson.Get("glossary.age+long*(a.b+a.c)")

At present, I don't seem to have found a similar library that implements this function. It should be very interesting after it is really completed. Interested friends, please continue to pay attention.

Source code:
https://github.com/crossoverJie/gjson

阅读 924

crossoverJie专栏
不定期分享互联网技术和个人经验

会crossover的程序猿

5.2k 声望
3.9k 粉丝
0 条评论

会crossover的程序猿

5.2k 声望
3.9k 粉丝
文章目录
宣传栏