This month, the HStreamDB team is mainly preparing for the final development and release of v0.9, and has further improved and tested the stream partition model improvements, new cluster mechanism, HStream IO and other new features that v0.9 will bring. Also upgraded major clients to adapt v0.9.

Stream partition model improvements

In previous versions, HStreamDB adopted a transparent partition model, the number of partitions in each stream was dynamically adjusted according to the write load, and the partitions inside the stream were invisible to users. The advantage of this model is that it maintains the simplicity of the user concept while retaining the flexibility of implementation. It can dynamically scale the number of partitions with the load, and maintain the required data order during the scaling process.

The main disadvantage of the current model is that users cannot directly perform partition-level operations and fine-grained control, for example, they cannot directly read the data of a partition from any location. To this end, we decided to give users the operation and control capabilities of open partitions, so that users can:

  • Control routing of data between partitions via partitionKey
  • Read the data of any shard directly from the specified location
  • Manually control dynamic scaling of in-stream partitions

In terms of implementation, HStreamDB adopts a key-range-based partitioning mechanism. All shards under the stream jointly divide the entire key space. Each shard belongs to a continuous subspace (key range). The expansion and contraction of shards correspond to the subspace. Split and merge. At the same time, the scaling of the partition will not cause the copying and migration of old data, but will cause the closure of the parent partition. New data will automatically enter the child partition, but at the same time, the data of the parent partition is still readable. Based on this design, the dynamic scaling of partitions will be more controllable and fast, and will not bring about problems such as inefficiency and data ordering caused by redistribution of old data, which is actually the internal working mechanism of transparent partitioning.

The partition model improvements described above will be included in the upcoming v0.9 release (the ability to control partition splitting and merging is not included for now).

HStream IO update

HStream IO is an internal data integration framework that HStreamDB v0.9 will release soon, including source connectors, sink connectors, IO runtime and other components. It can realize the interconnection between HStreamDB and various external systems, thereby helping to promote data in the entire enterprise data stack. efficient circulation and real-time value release.

After we added cdc source support for multiple databases last month, this month we added sink connector support for MySQL and PostgreSQL, and also added support for embedded IO runtime in connector parameter checking, configuration document generation, and task safe exit Other aspects have been improved and enhanced, and SQL commands are also provided to facilitate users to create and manage IO tasks through CLI. Examples are as follows:

 create source connector source01 from mysql with ("host" = "127.0.0.1", "port" = 3306, "user" = "root", "password" = "password", "database" = "d1", "table" = "t1", "stream" = "stream01");

create sink connector sink01 to postgresql with ("host" = "127.0.0.1", "port" = 5432, "user" = "postgres", "password" = "postgres", "database" = "d1", "table" = "t1", "stream" = "stream01");

show connectors;

pause connector source01;

resume connecctor source01;

drop connector source01;

HStream MetaStore

Currently, HStreamDB uses Zookeeper to store metadata in the system, such as the replication attributes of shards and the task allocation and scheduling information of cluster nodes. This brings some additional complexity to the deployment and operation and maintenance of HStreamDB. To separately manage the Zookeeper cluster, etc.

To this end, we plan to remove HStreamDB's direct dependency on Zookeeper and introduce a dedicated HStream MetaStore component (HMeta for short). HMeta will provide a set of abstract metadata storage interfaces, which can theoretically be implemented based on a variety of storage systems. Currently we are developing a default implementation based on rqlite [ https://github.com/rqlite/rqlite ]. rqlite is based on SQLite and raft, written in Golang, very lightweight, easy to deploy and manage.

The development of HMeta is still in progress. As mentioned in our previous newsletter, the new cluster mechanism of HServer no longer relies on Zookeeper. This month, we have also migrated the EpochStore of HStore to HMeta. This feature will not be included in the upcoming v0.9 release, it needs more testing and we plan to release it in v0.10.

Client update

This month, the client also brought a number of upgrades in adapting to HStreamDB v0.9. Taking hstreamdb-java as an example, it mainly includes the following changes:

  • createStream The number of initial partitions can be specified
  • Added listShards method
  • producer and bufferedProducer to adapt to the new partition model
  • Added Reader class, which can be used to read any partition

Clients for other languages (Golang, Python) will also include support for v0.9.

other

Some other noteworthy features completed this month include:

  • Added advertised-listeners configuration to HServer, which is used to solve the problem of external clients accessing HStreamDB when HStreamDB is deployed in a complex network environment.
  • Improved the bootstrap process when HServer cluster starts.

Copyright statement: This article is original by EMQ, please indicate the source when reprinting.

Original link: https://hstream.io/zh/blog/hstreamdb-newsletter-202207


EMQX
336 声望438 粉丝

EMQ(杭州映云科技有限公司)是一家开源物联网数据基础设施软件供应商,交付全球领先的开源 MQTT 消息服务器和流处理数据库,提供基于云原生+边缘计算技术的一站式解决方案,实现企业云边端实时数据连接、移动、...