Go does not need Java-style GC

This article was first published on https://robberphex.com/go-does-not-need-a-java-style-gc/ .

Modern languages like Go, Julia, and Rust don't require as complicated garbage collectors as those used by Java c#. But why is this?

We first need to understand how the garbage collector works and how different languages allocate memory. First, let's look at why Java needs such a sophisticated garbage collector.

This article will cover many different garbage collector topics:

Why does Java rely on fast GC? I will introduce some of the design choices in the Java language itself, which can put a lot of pressure on the GC.
Memory fragmentation and its impact on GC design. Why is this important for Java, but not so important for Go.
Value types and how they change GC.
The generational garbage collector, and why Go doesn't need it.
Escape analysis-a technique used by Go to reduce GC pressure.
Compressed garbage collector-This is important in Java, but Go doesn't need it. why?
Concurrent garbage collection-Go solves many GC challenges by using multiple threads to run concurrent garbage collectors. Why is it harder to do this with Java.
Common criticisms of Go GC and why many of the assumptions behind this criticism are often flawed or completely wrong.

Why Java needs fast GC more than other languages

Basically, Java completely outsources memory management to its garbage collector. It turns out that this is a huge mistake. However, in order to be able to explain this, I need to introduce more details.

Let's start from the beginning. It is 1991 and work on Java has begun. Garbage collectors are very popular now. Related research seems very promising. Java designers bet on advanced garbage collectors, which can solve all the challenges in memory management.

For this reason, all objects in Java—except for basic types such as integers and floating-point values—are designed to be allocated on the heap. When discussing memory allocation, we usually distinguish the so-called heap and stack.

The stack is very fast to use, but the space is limited and can only be used for objects within the life cycle of the function call. The stack only applies to local variables.

The heap can be used for all objects. Java basically ignores the stack and chooses to allocate everything on the heap, except for basic types such as integers and floating points. Whenever you write new Something() in Java, it consumes memory on the heap.

However, this kind of memory management is actually quite expensive in terms of memory usage. You might think that creating a 32-bit integer object only requires 4 bytes of memory.

class Knight {
   int health;
}

However, in order for the garbage collector to work, Java stores a header that contains:

Type/Type — Identifies the class to which the object belongs or its type.
Lock/Lock-used to synchronize statements.
Mark/Mark-mark and sweep (mark and sweep) garbage collector use.

These data are usually 16 bytes. Therefore, the ratio of header information to actual data is 4:1. The C++ source code of the Java object is defined as: OpenJDK base class :

class oopDesc {
    volatile markOop  _mark;   // for mark and sweep
    Klass*           _klass;   // the type
}

Memory fragmentation

The next problem is memory fragmentation. When Java allocates an array of objects, it actually creates an array of references that point to other objects in memory. These objects may eventually be scattered in the heap memory. This is very detrimental to performance, because modern microprocessors do not read a single byte of data. Because it is relatively slow to start transferring memory data, every time the CPU tries to access a memory address, the CPU will read a continuous memory.

This contiguous block of memory is called a cache line. The CPU has its own cache, and its size is much smaller than the memory. The CPU cache is used to store recently accessed objects, because these objects are likely to be accessed again. If the memory is fragmented, it means that the cache line will also be fragmented, and the CPU cache will be filled with a lot of useless data. The hit rate of the CPU cache will decrease.

How Java overcomes memory fragmentation

In order to solve these major shortcomings, Java maintainers have invested a lot of resources in advanced garbage collectors. They proposed the concept of compact, that is, moving objects to adjacent blocks in memory. This operation is very expensive, moving memory data from one location to another will consume CPU cycles, and updating references to these objects will also consume CPU cycles.

When these references are used, the garbage collector cannot update them. So updating these references requires suspending all threads. This usually results in a complete pause of hundreds of milliseconds in the process of moving objects, updating references, and reclaiming unused memory in Java programs.

Increase complexity

In order to reduce these long pauses, Java uses the so-called generational garbage collector. These are based on the following premises:

Most objects allocated in the program will be released soon. Therefore, if the GC spends more time processing the most recently allocated objects, it should reduce the pressure on the GC.

This is why Java divides the objects they allocate into two groups:

Elderly objects-objects that have survived multiple mark and clear operations in the GC. Each time a tagging and scanning operation, a generation counter is updated to track the "age" of the object.
Young objects-These objects have a relatively small "age", which means that they were only recently assigned.

Java processes and scans recently allocated objects more actively, and checks whether they should be recycled or moved. As the subject's "age" grows, they will be moved out of the young generation area.

All these optimizations will bring more complexity, it requires more development workload. It needs to pay more money to hire better developers.

How modern languages avoid the same flaws as Java

Modern languages don't need as complicated garbage collectors as Java and C#. This is in the design of these languages, and did not rely on the garbage collector like Java.

type Point struct {
    X, Y int
}
var points [15000]Point

In the Go code example above, we allocated 15,000 Point objects. This allocates memory only once and generates a pointer. In Java, this requires 15,000 memory allocations, and each allocation generates a reference, and these applications must be managed separately. Each Point object will have the aforementioned 16-byte header information overhead. No matter it is in Go language, Julia or Rust, you will not see the header information, and the object usually does not have these header information.

In Java, GC tracks and manages 15,000 individual objects. Go only needs to track one object.

Value type

In languages other than Java, value types are basically supported. The following code defines a rectangle, with a Min and Max points to define its range.

type Rect struct {
   Min, Max Point
}

This becomes a contiguous block of memory. In Java, this becomes a Rect object, which references two separate objects, Min and Max objects. Therefore, in Java, an Rect requires 3 memory allocations, but in Go, Rust, C/c++, and Julia, only 1 memory allocation is required.

左边是Java风格的内存碎片。在Go, C/C++， Julia等程序中，在右边的连续内存块上。

When porting Git to Java, the lack of value types caused serious problems. If there is no value type, it is difficult to obtain good performance. As Shawn O. Pearce on the JGit developer mailing list :

JGit has been struggling with not having an effective way to express SHA-1. C only needs to enter unsigned char[20] and inline it to the memory allocation of the container. byte[20] Java will consume an additional 16 bytes of memory, and the access speed is slower, because these 10 bytes and the container object are located in a non-adjacent memory area. We tried to solve this problem by converting a byte[20] into 5 ints, but this requires additional CPU instructions.

What are we talking about? In Go, I can do the same thing as C/C++ and define a structure like this:

type Sha1 struct {
   data [20]byte
}

These bytes will be located in a complete memory block. And Java will create a pointer to other places.

Java developers realize that they messed up, and developers really need value types to get good performance. You can say that this is an exaggeration, but you need to explain Valhalla project ). This is what Oracle has done for Java value types, and the reason for doing so is exactly what I'm talking about here.

Value type is not enough

So can the Valhalla project solve the Java problem? No. It just brings Java to the same height as c#. C# appeared a few years later than Java, and realized that the garbage collector is not as magical as everyone thinks. Therefore, they added value types.

However, in terms of memory management flexibility, this has not put c#/Java on the same level as languages such as Go and C/C++. Java does not support real pointers. In Go, I can write like this:

var ptr *Point = &rect.Min // 把指向 Min 的指针存储到 ptr 中
*ptr = Point(2, 4)         // 替换 rect.Min 对象

Just like in C/C++, you can get the address or field of an object in Go and store it in a pointer. You can then pass this pointer and use it to modify the field pointed to. This means that you can create large value objects in Go and pass them as function pointers to optimize performance. The situation is better in c#, because it has limited . The previous Go example can be written in c#:

unsafe void foo() {
   ref var ptr = ref rect.Min;
   ptr = new Point(2, 4);
}

However, c#'s pointer support is accompanied by some warnings that are not applicable to Go:

Code that uses pointers must be marked as unsafe . This produces code that is less secure and more likely to crash.
Must be a pure value type allocated on the stack (all structure fields must also be value types).
In fixed , the fixed keyword turns off garbage collection.

Therefore, the normal and safe way to use value types in C# is to copy them, because this does not require defining unsafe or fixed code fields. But for larger value types, this may cause performance issues. Go does not have these problems. You can create pointers to objects managed by the garbage collector in Go. In Go language, there is no need to separately mark the code that uses pointers as in c#.

Custom secondary distributor

With the right pointers, you can do many things that value types cannot. An example is the creation of a secondary distributor. Chandra Sekar S gave an example: Arena in .

type Arena []Node

func (arena *Arena) Alloc() *Node {
    if len(*arena) == 0 {
        *arena = make([]Node, 10000)
    }

    n := &(*arena)[len(*arena)-1]
    *arena = (*arena)[:len(*arena)-1]
    return n
}

Why are these useful? If you look at some microbenchmarks, such as the algorithm for constructing a binary tree, you will usually find that Java has a big advantage over Go. This is because the algorithm of constructing a binary tree is usually used to test the speed of the garbage collector in allocating objects. Java is very fast in this regard, because it uses what we call the bump pointer. It just adds a pointer value, and Go will look for a suitable location in memory to allocate the object. However, using the Arena allocator, you can also quickly build a binary tree in Go.

func buildTree(item, depth int, arena *Arena) *Node {
    n := arena.Alloc()
    if depth <= 0 {
        *n = Node{item, nil, nil}
    } else {
        *n = Node{
              item,
              buildTree(2*item-1, depth-1, arena),
              buildTree(2*item, depth-1, arena),
        }
    }
    return n
}

This is why real pointers are good. You cannot create a pointer to an element in a contiguous block of memory, as shown below:

n := &(*arena)[len(*arena)-1]

Problems with Java Bump Allocator

The bump allocator used by Java GC is similar to the Arena allocator. You only need to move a pointer to get the next value. But the developer does not need to manually specify the use of the Bump allocator. This may seem smarter. But it will cause some problems that are not available in the Go language:

Sooner or later, the memory needs to be compressed, which involves moving data and repairing pointers. The Arena allocator does not need to do this.
In a multithreaded program, the bump allocator needs a lock (unless you use thread local storage). This obliterates their performance advantages, either because locks reduce performance, or because thread-local storage will cause fragmentation, which needs to be compressed later.

Ian Lance Taylor is one of the creators of Go. He explained the problem of the bump allocator :

Generally speaking, it may be more efficient to use a set of per-thread caches to allocate memory, and at this point, you have lost the advantages of the bump allocator. Therefore, I want to assert that under normal circumstances, despite many warnings, there is no real advantage to using a compressed memory allocator for multithreaded programs.

Generational GC and escape analysis

The Java garbage collector has more work to do because it allocates more objects. Why? We just talked about it. Without value objects and real pointers, when allocating large arrays or complex data structures, it will always end up with a large number of objects. Therefore, it requires generational GC.

The need to allocate fewer objects is beneficial to the Go language. But Go language has another trick. Both Go and Java perform escape analysis when compiling functions.

Escape analysis includes looking at the pointer created inside the function and determining whether the pointer has escaped the scope of the function.

func escapingPtr() []int {
   values := []int{4, 5, 10}
   return values
}

fun nonEscapingPtr() int {
    values = []int{4, 5, 10}
    var total int = addUp(values)
    return total
}

In the first example, values points to a slice, which is essentially the same as a pointer to an array. It escaped because it was returned. values must be allocated on the heap.

However, in the second example, the pointer to values does not leave the function nonEscapingPtr values can be allocated on the stack. This action is very fast and the cost is small. The escape analysis itself only analyzes whether the pointer escapes.

Limitations of Java escape analysis

Java also does escape analysis, but there are more restrictions on its use. Covering hotspot virtual machines from Java SE 16 Oracle documentation:

For objects that are not globally escaped, it will not replace heap allocation with stack allocation.

However, Java uses another technique called scalar to replace , which avoids the need to put objects on the stack. Essentially, it decomposes the object and puts its basic members on the stack. Remember that Java can already place basic values int and float However, as Piotr Kołaczkowski discovered in 2021, in practice, scalar substitution does not work even in very trivial situations.

On the contrary, the main advantage of scalar substitution is to avoid locks. If you know that a pointer will not be used outside of the function, you can also be sure that it does not require a lock.

Advantages of Go language escape analysis

However, Go uses escape analysis to determine which objects can be allocated on the stack. This greatly reduces the number of short-lived objects that could have benefited from generational GC. But remember that the whole point of generational GC is to take advantage of the fact that the most recently allocated object has a short lifetime. However, most objects in the Go language may live a long time, because objects with a short survival time are likely to be captured by escape analysis.

Unlike Java, in Go language, escape analysis is also applicable to complex objects. Java usually can only successfully perform escape analysis on simple objects such as byte arrays. Even the built-in ByteBuffer cannot be allocated on the stack using scalar substitution.

Modern languages do not need to compress GC

You can read that many garbage collector experts claim that Go is more likely than Java to run out of memory due to memory fragmentation. The argument goes like this: because Go does not compress the garbage collector, memory will become fragmented over time. When the memory is divided, you will reach a point where it becomes difficult to load a new object into the memory.

However, this problem is greatly reduced due to two reasons:

Go does not allocate as many small objects as Java does. It can allocate large arrays of objects as a single memory block.
Modern memory allocators, such as Google’s TCMalloc or Intel’s Scalable Malloc, do not segment memory.

When designing Java, memory fragmentation is a big problem for memory allocators. People don't think this problem can be solved. But even back in 1998, shortly after the advent of Java, researchers began to solve this problem. Here is paper Mark S. Johnstone and Paul R. Wilson of :

This substantially strengthens our previous results, which show that the problem of memory fragmentation is often misunderstood, and that a good allocator strategy can provide good memory usage for most programs.

Therefore, many of the assumptions made when designing Java's memory allocation strategy are no longer correct.

Generational GC vs. Pause of Concurrent GC

The Java strategy of using generational GC aims to make the garbage collection cycle shorter. You know, in order to move the data and repair the pointer, Java must stop all operations. If you pause for too long, it will reduce the performance and responsiveness of the program. With generational GC, there is less data for each inspection, thereby reducing inspection time.

However, Go solves the same problem with some alternative strategies:

Because there is no need to move the memory, and no need to fix the pointer, there will be less work to be done during the GC operation. Go GC only does one mark and cleanup: it looks for objects that should be released in the object graph.
It runs concurrently. Therefore, a separate GC thread can look for objects to be released without stopping other threads.

Why can Go run GC concurrently but not Java? Because Go will not repair any pointers or move any objects in memory. Therefore, there is no risk of trying to access the pointer of an object, and the object has just been moved, but the pointer has not yet been updated. Objects that no longer have any references will not suddenly get a reference because of the running of a concurrent thread. Therefore, there is no danger in moving a "dead" object in parallel.

How is this going? Suppose you have 4 threads working in a Go program. One of the threads T seconds at any time, and the total time is 4 seconds.

Now imagine that the GC of a Java program only does 2 seconds of GC work. Which program squeezed out the most performance? Who T seconds? It sounds like a Java program, right? Wrong!

The 4 worker threads in the Java program will stop all threads for 2 seconds. This means that 2×4 = 8 seconds of work is lost T Therefore, although the stop time of Go is longer, each stop has less impact on the work of the program, because all threads are not stopped. Therefore, the performance of a slow concurrent GC may be better than a faster GC that relies on stopping all threads to perform its work.

What if garbage is generated faster than it can be cleaned up?

A popular argument against current garbage collectors is that active worker threads may generate garbage faster than garbage collector threads can collect garbage. In the Java world, this is called "concurrency mode failure".

In this case, the runtime has no choice but to stop the program completely and wait for the GC cycle to complete. Therefore, when Go claims that the GC pause time is very low, this statement is only applicable when the GC has enough CPU time and space to exceed the main program.

But Go language has a clever trick to bypass Go GC master Rick Hudson described the problem . Go uses the so-called "Pacer".

If necessary, Pacer will speed up the mark while reducing the dispensing speed. At a higher level, Pacer stopped Goroutine, it made a lot of allocations, and let it be marked. The workload is proportional to the distribution of Goroutine. This speeds up the garbage collector and slows down the mutator.

Goroutines are a bit like green threads reused on a thread pool. Basically, Go takes over the threads that are running a workload that generates a lot of garbage and lets them help the GC clean up the garbage. It will directly manage the thread until the GC runs faster than the coroutine that generates garbage.

in short

Although advanced garbage collectors solve practical problems in Java, modern languages, such as Go and Julia, avoid these problems from the beginning, so there is no need to use the Rolls Royce garbage collector. When you have value types, escape analysis, pointers, multi-core processors, and modern allocators, many of the assumptions behind Java design are forgotten. They no longer apply.

GC's Tradeoff is no longer applicable

Mike Hearn has a very popular story on Medium. He criticized Go GC's statement: modern garbage collection .

The key message of Hearn is that there are always trade-offs in GC design. His point is that because Go's goal is low-latency collection, they will be affected in many other metrics. This is an interesting reading because it covers a lot of details about the trade-offs in GC design.

First of all, what does low latency mean? Go GC pauses only 0.5 milliseconds on average, while various Java collectors may take hundreds of milliseconds.

I think the problem with Mike Hearn's arguments is that they are based on a flawed premise that the memory access patterns of all languages are the same. As I mentioned in this article, this is not the case at all. Go generates far fewer objects that require GC management, and it uses escape analysis to clean up many objects in advance.

Is old technology itself bad?

Hearn’s argument states that simple collection is not good to some extent:

Stop-the-world (STW) mark/clear is the most commonly used GC algorithm in undergraduate computer science courses. When doing job interviews, I sometimes ask candidates to talk about GC, but almost always, they either regard GC as a black box and don’t know anything about it, or think it is still in use today. Very old technology.

Yes, it may be old, but this technology allows GC to run concurrently, which is not allowed by "modern" technology. This is even more important in our modern hardware world with multiple cores.

Go is not C

Another statement:

Since Go is a relatively common imperative language with value types, its memory access model may be comparable to C#. The generation assumption of the latter is of course true, so .NET uses a generational collector.

But in fact, it's not. C# developers will try to minimize the use of large-value objects, because pointer-related code cannot be used safely. We must assume that c# developers prefer to copy value types instead of using pointers, because this can be done safely in the CLR. This will naturally bring higher overhead.

As far as I know, C# also does not use escape analysis to reduce the generation of short-lived objects on the heap. Second, C# is not good at running a large number of tasks at the same time. Go can use their coroutines to speed up collection at the same time, as Pacer mentioned.

Memory compression

Compression: Because there is no compression, your program will eventually fragment the heap. I will discuss heap fragmentation further below. Putting things neatly in the cache will not benefit you.

Here, Mike Hearn's description of the dispenser is not up to date. Modern allocators such as TCMalloc basically eliminate this problem.

Program throughput: Since GC must do a lot of work for each cycle, this steals CPU time from the program itself and reduces its speed.

This does not apply when you have a concurrent GC. All other threads can continue to run while the GC is working-unlike Java, it must stop the entire world.

Heap overhead

Hearn raised the issue of "failure in concurrent mode", assuming that Go GC runs the risk of not being able to keep up with the speed of the garbage generator.

Heap overhead: Because it is very slow to collect the heap by marking/clearing it, you need a lot of free space to ensure that you will not encounter "concurrent mode failure". The default heap overhead is 100%, which doubles the memory required by your program.

I am skeptical of this statement, because many real-world examples I have seen seem to suggest that Go programs use less memory. Not to mention, this ignores the existence of Pacer, which will catch Goroutines and generate a lot of garbage for them to clean up.

Why low latency is also important for Java

We live in a world of Docker and microservices. This means that many smaller programs communicate and work with each other. Imagine that a request goes through several services. In a chain, if there is a major stoppage in one of these services, it will have a chain reaction. It will cause all other processes to stop working. If the next service in the pipeline is waiting for STW garbage collection, it will not work.

Therefore, the delay/throughput trade-off is no longer a trade-off in GC design. When multiple services work together, high latency will cause throughput to drop. Java's preference for high throughput and high latency GC applies to the monolithic world. It no longer applies to the world of microservices.

This is a fundamental problem of Mike Hearn's point of view. He believes that there is no panacea, only trade-offs. It tries to give the impression that Java's trade-offs are equally effective. But the trade-off must be adjusted according to the world we live in.

In short, I think the Go language has made many smart moves and strategic choices. If this is just a trade-off that anyone can do, it is not advisable to omit it.