RocketMQ-Streams focuses on the scenario of "large data volume -> high filtering -> light window computing". The core creates light resources and high performance advantages. It has great advantages in resource-sensitive scenarios. The minimum 1Core and 1G can be deployed. Through a large number of filtering optimizations, the performance is 2-5 times faster than other big data. Widely used in security, risk control, edge computing, message queue flow computing.
RocketMQ-Streams is compatible with Flink's SQL, udf/udtf/udaf. In the future, we will deeply integrate with the Flink ecosystem. It can run independently or publish it as a Flink task to run in the Flink cluster. For scenarios with a Flink cluster, it can be It enjoys the advantage of light resources and can achieve unified deployment and operation and maintenance.
RocketMQ-Streams Features and Application Scenarios
RocketMQ-Streams application scenarios
• Calculation scene : suitable for large data volume -> high filtering -> light window calculation scene. Different from mainstream computing engines, it is necessary to deploy clusters, write tasks, publish, tune, and run such a complex process. RocketMQ-Streams itself is a lib package. After writing the stream task based on the SDK, it can be run directly. Supports computing features required for big data development: Exactly-ONCE, flexible windows (scrolling, sliding, sessions), dual-stream Join, high throughput, low latency, and high performance. Minimum 1Core, 1G can run.
• SQL engine : RocketMQ-Streams can be regarded as an SQL engine, compatible with Flink SQL syntax, and supports Flink udf/udtf/udaf extensions. Support SQL hot upgrade, after writing SQL, submit SQL through the SDK, and then the hot release of SQL can be completed.
• ETL engine : RocketMQ-Streams can also be regarded as an ETL engine. In many big data scenarios, it is necessary to complete data from a source through ETl and aggregate into unified storage. It has built-in functions such as grok and regular parsing, which can be combined with SQL. Complete data ETL.
• Development SDK, it is also a data development SDK package, most of the components in it can be used alone, such as Source/sink, which shields the data source, data storage details, provides a unified programming interface, a set of codes, switches input and output, No code changes are required.
RocketMQ-Streams Design Ideas
Design target
• Less dependencies, easy deployment, 1Core, 1G single instance can be deployed, and the scale can be expanded at will.
• Realize the required big data features: Exactly-ONCE, flexible window (rolling, sliding, session), dual-stream Join, high throughput, low latency, and high performance.
• Realize cost controllable, low resource and high performance.
• Compatible with Flink SQL, UDF/UDTF, making it easier for non-technical personnel to use.
Design Ideas
• Adopt shared-nothing distributed architecture design, rely on message queue for load balancing and fault tolerance mechanism, single instance can be started, and increase the instance to realize capacity expansion. Concurrency capability depends on the number of shards.
• Use message queue shards for shuffle, and use message queue load balancing to achieve fault tolerance.
• Utilize storage to implement state backup, achieving the semantics of Exactly-ONCE. Start quickly with structured remote storage without having to wait for local storage to recover.
RocketMQ-Streams Features and Innovations
Detailed explanation of RocketMQ-Streams SDK
Hello World
By convention, let's start with an example to understand RocketMQ-Streams
• namespace: tasks in the same namespace can run in one process and can share configuration
• pipelineName:job name
• DataStreamSource: Create a source node
• map: User function, which can be extended by implementing MapFunction
• toPrint: the result is printed out
• start: start the task
• Running the above code will start an instance. If you want to have multiple instances concurrently, you can start multiple instances, and each instance consumes some RocketMQ data.
• Run result: concatenate the original message with "—" and print it out
RocketMQ-Streams SDK
• Using StreamBuilder as a starting point, create a DataStreamSource by setting namespace, jobName.
• DataStreamSource Through the from method, set the source and create a DataStream object.
• DataStream provides a variety of operations, resulting in different streams:
• The to operation produces a DataStreamAction
• The window operation generates a WindowStream to configure the window parameters
• The join operation produces a JoinStream and configures the join conditions
• Split operation produces SplitStream Configure split condition
• Other operations produce DataStream
• DataStreamAction starts the entire task, and can also configure various policy parameters for the task. Asynchronous start and synchronous start are supported.
RocketMQ-Streams operator
RocketMQ-Streams operator
There are two deployment modes for SQL, 1 is to directly run the client to start SQL, see the first red box; 2 is to build a server cluster and submit SQL through the client to achieve hot deployment, see the second red box.
RocketMQ-Streams SQL extension, supports multiple extension methods:
• Extend SQL capabilities through FlinkUDF, UDTF, and UDAF, which are introduced through create function in SQL. There is a limitation that UDF does not use the contents of Flink FunctionContext when it is opened.
• Functions that extend SQL through built-in functions, the syntax is the same as that of Flink, the function name is the name of the built-in function, and the class name is fixed. As shown in the figure below, a now function is introduced to output the current time. There are more than 200 built-in functions in the system, which can be introduced as needed.
• Implemented through extension functions, it is very simple to implement a function, just mark Function on the class, mark FunctionMethod on the method that needs to be published as a function, and set the name of the function to be published. If you need system information, the first two functions It can be IMessage and Abstract. If you don't need it, you can directly write the parameters. The parameters have no format requirements. As shown in the figure below, a function of now is created, both ways can be used. It can be called by currentTime=now(), and a variable with key=currentTime and value=current time will be added to Message.
• Publish the existing java code as a function, configure the class name, method name, and expected function name of the java code through policy configuration, and copy the java jar package to the jar package directory. The following figure is an application example of several extensions.
RocketMQ-Streams architecture and principle implementation
Overall structure
Source implementation
• Source is required to implement the semantics of at least one consumption. The system implements it through the checkpoint system message. Before submitting the offset, a checkpoint message is sent to notify all operators to refresh the memory.
• Source supports automatic load balancing and fault tolerance of shards.
• When the shard is removed, the data source sends a removal system message to let the operator complete the shard cleanup.
• When there is a new shard, send a new shard message to let the operator complete the shard initialization.
• The data source starts consumemr to get messages through the start method.
• The original message is encoded, and the additional header information is packaged into a Message and delivered to subsequent operators.
Sink implementation
• Sink is a combination of real-time and throughput.
• To implement a Sink, just inherit the AbstractSink class and implement the batchInsert method. The meaning of batchInsert is that a batch of data is written into the storage, which requires subclasses to call the storage interface to implement, and try to apply the batch interface of storage to improve throughput.
• The normal usage is to write Message->cache->flush->storage. The system will strictly guarantee that the amount of each batch written to the storage does not exceed the batchsize. If it exceeds, it will be split into multiple batches. write.
• Sink has a cache, data is written to cache by default, and batches are written to storage to improve throughput. (one cache per shard).
• Automatic refresh can be enabled, and each shard will have a thread that periodically refreshes cache data to storage to improve real-time performance. Implementation class: DataSourceAutoFlushTask.
• The cache can also be flushed to storage by calling the flush method.
• Sink's cache will have memory protection. When the number of messages in the cache is greater than batchSize, it will be forced to refresh to release memory.
RocketMQ-Streams Exactly-ONCE
• Source ensures that the checkpoint system message will be sent when the commit offset is reached, and the component that receives the message will complete the save operation. Messages are consumed at least once.
• Each message will have a message header, which encapsulates the QueueId and offset.
• When the component stores data, it will store the QueueId and the maximum offset processed. When there are duplicate messages, they will be deduplicated according to the maxoffset.
• Memory protection. There may be multiple flushes in a checkpoint cycle (triggered by the number of entries), ensuring that the memory usage is controllable.
RocketMQ-Streams Window
• Support for scrolling, sliding and session windows. Supports event time and natural time (the time the message enters the operator).
• Supports high-performance mode and high-reliability mode. High-performance mode does not rely on remote storage, but window data will be lost during shard switching.
• Quick start, no need to wait for local storage to recover. When an error occurs or shard switching, data is recovered from remote storage asynchronously, and at the same time, remote storage is directly accessed for computing.
• Use message queue load balancing to achieve expansion and contraction. Each Queue is a group, and a group is only consumed by one machine at a time.
• Normal computing relies on local storage and has similar computing performance to Flink.
Supports three trigger modes, which can balance watermark delay and real-time requirements
Application of RocketMQ-Streams in cloud security
In the context of security applications
• When the public cloud switched to the proprietary cloud, it encountered resource problems in intrusion detection computing. The big data cluster did not output by default, and the output was at least 6 high-end machines. It was difficult for users to accept it because they bought a cloud shield and added a big data cluster.
• Proprietary cloud user upgrades are difficult to operate and maintain, and it is impossible to quickly upgrade capabilities and fix bugs.
Stream Computing in Security Applications
• Create a lightweight computing engine based on security features (big data -> high filtering -> light window computing): After analysis, all rules will be pre-filtered, and then heavy statistics, windows, join operations, and The filtering rate is relatively high. Based on this feature, statistics and join operations can be implemented in a lighter solution.
• Cover 100% proprietary cloud rules (regular, join, statistics) through RocketMQ-Streams.
• Light resources, the memory is 1/70 of the public cloud engine, and the CPU is 1/6. Through fingerprint filtering optimization, the performance is improved by more than 5 times, and the resources do not increase linearly with the rules, and there is no resource pressure for new rules. Reuse of the previous regular engine resources can support more than 95% of office sites without adding additional physical resources.
• Support tens of millions of intelligence through highly compressed dimension tables. 1000 W of data only requires 330 M of memory.
• Through the C/S deployment mode, SQL and engines can be hot released, especially in network protection scenarios, and rules can be quickly launched.
RocketMQ-Streams future plans
New version download address: https://github.com/apache/rocketmq-streams/releases/tag/rocketmq-streams-1.0.0-preview
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。