云计算 - The ultimate nesting doll 2.0｜The practice sharing of observability of cloud native PaaS platform - 个人文章

One Monday morning, Xiaotao made a cup of hot coffee ☕️ as usual, and was about to open the project to start a new day's work. Suddenly, Xiaowen next door shouted: "Look, the user support group is frying..."

User A: "There is something wrong with the Git service, the code submission failed!"
User B: "Help me take a look, the execution pipeline reports an error..."
User C: "Our system is going online today, and now the deployment page can't be opened, so I'm going to panic!"
User D: ...

Xiaotao had to put down the coffee in his hand first, switch the screen to the fortress machine, and log in to the server with a set of smooth operations. "Oh, it turns out that the code that was launched last weekend missed a parameter verification and caused panic." Xiaotao pointed to the screen The log of the container in the previous paragraph told Xiaowen.

Ten minutes later, Xiaowen updated the online system with the repaired installation package, and the user's problem was also solved.

Although the fault was repaired, Xiaotao also fell into meditation, " Why didn't we perceive the abnormality of the system before the user? Now to troubleshoot the problem, we still need to log in to the bastion machine to see the log of the container. Is there a faster way and a shorter time? Can you find out the cause of the online failure within the time limit? ”

At this time, Xiao L, who was sitting opposite, said: "We are all telling users to help them achieve the observability of the system, and it is time for Erda to be observed too."

Xiaotao: "Then what should we do...?"

Typically, we build independent distributed tracing, monitoring, and logging systems to assist development teams in solving diagnostic and observation problems in microservice systems. But at the same time, Erda itself also provides full-featured service observation capabilities, and there are some tracking systems in the community (such as Apache SkyWalking and Jaeger) that provide their own observability, giving us another way to use platform capabilities to observe themselves. ideas.

In the end, we chose to implement Erda's own observability on the Erda platform. The considerations for using this scheme are as follows:

The platform has already provided service observation capabilities, and the introduction of external platforms will cause repeated construction and increase the cost of resources used by the platform.
The development team uses its own platform to troubleshoot faults and performance issues on a daily basis, and eating their own dog food also helps to improve the product.
For the core components of the observability system such as Kafka and data computing components, we use the SRE team's inspection tool to bypass coverage and trigger alarm messages when problems occur

Erda microservice observation platform provides observation and diagnosis tools from different perspectives such as APM, user experience monitoring, link tracking, log analysis, etc. Based on the principle of making the best use of everything, we also process the different observation data generated by Erda separately. The specific implementation details and continue to look down.

OpenTelemetry data access

In the previous article, we introduced how to access Jaeger Trace on Erda. First of all, we also thought of using Jaeger Go SDK as the implementation of link tracking, but OpenTracing, which is the main implementation of Jaeger, has stopped maintenance, so we focused on A new generation of observability standard OpenTelemetry above.

OpenTelemetry is an observability project of CNCF. It was merged by OpenTracing and OpenCensus. It aims to provide a standardized solution in the field of observability, solve the standardization problems of data model, acquisition, processing, export, etc. unrelated services.

As shown in the figure below, to access the trace data of OpenTelemetry on the Erda observability platform, we need to implement the receiver of the otlp protocol in the gateway component, and implement a new span analysis component on the data consumer side to analyze the otlp data as Erda APM. Observability data model.

OpenTelemetry data access and processing flow

Among them, the gateway component is implemented using Golang lightweight, and the core logic is to parse the proto data of otlp, and add authentication and current limiting of tenant data.

Key code reference receivers/opentelemetry

The span_analysis component is implemented based on Flink. Through the DynamicGap time window, the span data of opentelemetry is aggregated and analyzed to generate the following metrics:

service_node describes the node and instance of the service
service_call_ * Describes service and interface call metrics, including HTTP, RPC, DB, and Cache
service_call_*_error describes the abnormal call of the service, including HTTP, RPC, DB and Cache
service_relation describes the calling relationship between services

At the same time, span_analysis will also convert otlp's span to Erda's span standard model, transfer the above metrics and converted span data stream to kafka, and then be consumed and stored by the existing data consumption components of Erda observability platform.

Key code reference analyzer/tracing

Through the above method, we have completed Erda's access and processing of OpenTelemetry Trace data.

Next, let's take a look at how Erda's own service connects to OpenTelemetry.

Golang non-invasive call interception

As a cloud-native PaaS platform, Erda naturally uses the most popular Golang in the cloud-native field for development and implementation. However, in the early days of Erda, we did not pre-set tracking buried points in the logic of any platform. So even in the case where OpenTelemetry provides the Go SDK out of the box, we only do manual Span access in the core logic, which is a costly task.

In my previous experience in Java and .NET Core projects, AOP was used to implement non-business-related logic such as performance and call link buried points. Although the Golang language does not provide a Java Agent-like mechanism that allows us to modify the code logic during program execution, we are still inspired by the monkey project and are fully implementing monkey, pinpoint-apm/go-aop-agent and gohook. After comparison and testing, we chose to use gohook as Erda's AOP implementation idea, and finally provided the implementation of automatic tracking and buried points in erda-infra.

For the principle of monkey, please refer to monkey-patching-in-go

Taking the automatic tracking of http-server as an example, our core implementation is as follows:

 //go:linkname serverHandler net/http.serverHandler
type serverHandler struct {
  srv *http.Server
}

//go:linkname serveHTTP net/http.serverHandler.ServeHTTP
//go:noinline
func serveHTTP(s *serverHandler, rw http.ResponseWriter, req *http.Request)

//go:noinline
func originalServeHTTP(s *serverHandler, rw http.ResponseWriter, req *http.Request) {}

var tracedServerHandler = otelhttp.NewHandler(http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) {
  injectcontext.SetContext(r.Context())
  defer injectcontext.ClearContext()
  s := getServerHandler(r.Context())
  originalServeHTTP(s, rw, r)
}), "", otelhttp.WithSpanNameFormatter(func(operation string, r *http.Request) string {
  u := *r.URL
  u.RawQuery = ""
  u.ForceQuery = false
  return r.Method + " " + u.String()
}))

type _serverHandlerKey int8

const serverHandlerKey _serverHandlerKey = 0

func withServerHandler(ctx context.Context, s *serverHandler) context.Context {
  return context.WithValue(ctx, serverHandlerKey, s)
}

func getServerHandler(ctx context.Context) *serverHandler {
  return ctx.Value(serverHandlerKey).(*serverHandler)
}

//go:noinline
func wrappedHTTPHandler(s *serverHandler, rw http.ResponseWriter, req *http.Request) {
  req = req.WithContext(withServerHandler(req.Context(), s))
  tracedServerHandler.ServeHTTP(rw, req)
}

func init() {
  hook.Hook(serveHTTP, wrappedHTTPHandler, originalServeHTTP)
}

After solving Golang's automatic tracking, we also encountered a thorny problem in asynchronous scenarios, because of context switching, TraceContext cannot be passed to the next Goroutine. Also after referring to the two asynchronous programming models of Java's Future and C#'s Task, we also implement an asynchronous API that automatically transfers the Trace context:

 future1 := parallel.Go(ctx, func(ctx context.Context) (interface{}, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "http://www.baidu.com/api_1", nil)
    if err != nil {
      return nil, err
    }
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
      return nil, err
    }
    defer resp.Body.Close()
    byts, err := ioutil.ReadAll(resp.Body)
    if err != nil {
      return nil, err
    }
    return string(byts), nil
  })

  future2 := parallel.Go(ctx, func(ctx context.Context) (interface{}, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "http://www.baidu.com/api_2", nil)
    if err != nil {
      return nil, err
    }
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
      return nil, err
    }
    defer resp.Body.Close()
    byts, err := ioutil.ReadAll(resp.Body)
    if err != nil {
      return nil, err
    }
    return string(byts), nil
  }, parallel.WithTimeout(10*time.Second))

  body1, err := future1.Get()
  if err != nil {
    return nil, err
  }

  body2, err := future2.Get()
  if err != nil {
    return nil, err
  }

  return &pb.HelloResponse{
    Success: true,
    Data:    body1.(string) + body2.(string),
  }, nil

write at the end

After using OpenTelemetry to connect the trace data generated by Erda platform calls into Erda's own APM, the first benefit we can get is that we can intuitively get the runtime topology of Erda:

Erda runtime topology

Through this topology, we can see many problems in Erda's own architecture design, such as circular dependencies of services, and the existence of outlier services. According to our own observation data, we can also gradually optimize Erda's calling structure in each version iteration.

For our SRE team next door, we can also know the abnormal status of the platform at the first time according to the alarm message generated by the call exception automatically analyzed by Erda APM:

Finally, for our development team, based on observational data, it is easy to gain insight into the slow calls of the platform, as well as analyze failures and performance bottlenecks based on Trace:

Xiao L: "In addition to the above, we can also connect the platform's logs, page access speed, etc. to Erda's observability platform using similar ideas."

Xiaotao suddenly realized: "I see, it turns out that nesting doll observation can still be played like this! In the future, you can drink coffee and do your own work with confidence😄."

We are committed to solving the problems and needs of community users in the actual production environment,
If you have any questions or suggestions,
Welcome to pay attention to the [Erda] public account and leave us a message,
Join the Erda user group to get involved or discuss with us on Github!

The ultimate nesting doll 2.0｜The practice sharing of observability of cloud native PaaS platform

OpenTelemetry data access

Golang non-invasive call interception

write at the end

erda_terminus_io

引用和评论

我们被一个 kong 的性能 bug 折腾了一个通宵

【万字长文】大模型开源开发全景与趋势解读

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

面对开源大模型浪潮，基础模型公司如何持续盈利？