Overview of Cassandra for MySQL billion-level data migration


wrote more and less tables, such as user browsing article records. The data volume was more than 100 million, and the storage performance with MySQL was not very good. So I tried to migrate to a NoSQL database and chose ScyllaDB after comparison.

ScyllaDB introduction

Scylla is an open source, distributed, decentralized, flexible and scalable, highly available, fault-tolerant, adjustable consistency, row-oriented database.

Scylla re-implemented Cassandra in C++ to improve performance and take advantage of multi-core services. It solves some of the pitfalls of Apache Cassandra. It is an embedded replacement database for Apache Cassandra. The drivers between the two databases are compatible with each other.

and ScyllaDB

  • Scylla is a p2p architecture, there is no single point of failure, Hbase is a master-slave structure, you need to choose the master and maintain the master-slave relationship;
  • Scylla is an AP system, and it can also be made into a CP system by adjusting parameters (as long as R+W>N). HBase is a CP system. If there is a strong data consistency requirement, use HBase;
  • Scylla is a data storage and data management system. HBase is only responsible for data management, it needs to cooperate with HDFS and zookeeper to build a cluster;
  • Scylla has better write performance, because the data hash is scattered, the read performance is not as good as HBase. The HBase writing process is more cumbersome, but its data is sorted in the regin, and the read performance is better;
  • Both are suitable for storing time-series data, such as sensor data, website visits, user behavior, stock market transaction data, etc.;
  • Scylla is good at data access because the write performance is better.

Docker builds ScyllaDB cluster

docker run --name scylla-node2 -p 8042:9042 -p 8160:9160 -p 1000:10000 -p 8180:9180 -v /data/scylladb2:/var/lib/scylla -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla)"

docker run --name scylla-node3 -p 10042:9042 -p 10160:9160 -p 1100:10000 -p 10180:9180 -v /data/scylladb3:/var/lib/scylla -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla)"

docker run --name scylla-node4 -p 11042:9042 -p 11160:9160 -p 1200:10000 -p 11180:9180 -v /data/scylladb4:/var/lib/scylla -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla)"

Next, first introduce the basic information of Cassandra.

Data model

Cassandra's data model draws on the design ideas of Google BigTable, and has the following concepts:

  1. Keyspace (keyspace): It is equivalent to the database in the relational database model and is the uppermost command space.
  2. Key: Corresponding to the primary key in the SQL database. In Cassandra, each row of data records is stored in the form of key/value, where key is a unique identifier.
  3. Column family: A column family is a container for an ordered set of rows. In turn, each row is an ordered set of columns.
  4. Column: The most basic data structure unit, composed of name and value, which contains the timestamp and time-to-live attributes (can be viewed through the writetime() or ttl() function.
  5. Super column: Super column is a special column, so it is also a key-value pair. But the super column stores the mapping of the sub-columns. Cassandra allows the value in key/value to be a map (key/value_list), that is, a column has multiple sub-columns.

The basic attributes of Keyspace in Cassandra are

  • replication factor : It is the number of computers in the cluster that will receive copies of the same data.
  • Replica Placement Strategy : It's just a strategy to place the replica in the ring. We have strategies such as Simple Strategy (Rack Aware Strategy), Old Network Topology Strategy (Rack Aware Strategy) and Network topology strategy (data center sharing strategy).
  • column family : Keyspace is a container for one or more column family lists. Conversely, a column family is a container for row sets. Each row contains a sequence, the column family represents the structure of the data, and each key space has at least one and usually many column families.

The syntax for creating Keyspace is as follows

 CREATE KEYSPACE Keyspace name
 WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

Three main keys

Cassandra mainly consists of three types of keys:

  1. primary key: the key used to determine a piece of data, composed of partition key and clustering key (clustering key is not necessary)
  2. partition key: used for data partitioning. When reading or writing data from the cluster, a function called Partitioner is used to calculate the hash value of the partition key. The hash value is used to determine the node that contains the hash value of the row
  3. clustering key: mainly used for range search and sorting (equivalent to the key in SortedMap)

Cassandra will perform a hash calculation on the partition key and decide which NODE to put the record on. The partition key can be a single primary key or a composite primary key, but the primary key cannot be duplicated (records with duplicate primary keys are likely to be affected by the clustering key. Used for Range Query (range search), and when using it, it needs to provide information in the order of table creation.

CREATE TABLE brows_record_by_target (
        target_id bigint,
        create_user bigint,
        TYPE text,
        app_id text,
        create_time timestamp,
        id bigint,
        PRIMARY KEY ((target_id, TYPE, app_id), create_user, create_time))
    (create_user DESC, create_time DESC);

Among them (target_id, TYPE, app_id) is the partition key, create_user, create_time are the cluster keys, and the whole combination is PRIMARY KEY.


Core components

The core components of Cassandra include:
Gossip : Point-to-point communication protocol, used to exchange node location and status information with each other. When a node starts, it immediately stores Gossip information locally, but when the node information changes, the historical information needs to be cleaned, such as IP changes. Through the Gossip protocol, each node regularly exchanges data between itself and the node with which it has exchanged information every second. Each information exchanged has a version number, so that when new data is available, old data can be overwritten, in order to ensure data For the accuracy of exchange, all nodes must use the same cluster list. Such nodes are also called seeds.

Partitioner : Responsible for distributing data in the cluster. It determines which nodes will place the first copy. Generally, Hash is used as the primary key, and each row of data is distributed to different nodes to ensure the cluster’s availability. Extensibility.

Replica placement strategy : The replication strategy, which determines which node to place the replicated data and the number of replicated copies.

Snitch : Define a network topology to determine how to place replicated data and route requests efficiently.

cassandra.yaml : The main configuration file, which sets the initial configuration of the cluster, table cache parameters, tuning parameters and resource usage, timeout settings, client connections, backups and security

Data distribution

Starting from Cassandra 1.2, each node of Cassandra can have many tokens (hash ranges). This new model is called virtual node. Virtual nodes allow each node to have many hash ranges in the cluster. The virtual node also uses consistent hashing to distribute data, but it does not need to generate and distribute tokens. Each node randomly obtains m virtual nodes from the hash ring according to the configured virtual nodes (m is the number of configured virtual nodes).


Benefits of using virtual points

  1. There is no need to calculate and allocate the token of each node in the cluster.
  2. When adding or removing nodes, the cluster will automatically rebalance the load. When a node joins the cluster, it will share part of the data from other nodes in the cluster; if a node fails, the data on that node will be evenly distributed to other nodes in the cluster
  3. Rebuilding dead nodes is faster.
  4. Allows the use of machines with different performance in the same cluster


Simple use of CQL

CQL detailed guide

Benchmarking SQL differentiation

  • Support json mode select\insert
  • update will be added if it does not exist
  • Update where conditions must include all primary keys, and cannot update fields of type static
  • Additions, deletions, and changes must be made on the primary key column. For a primary key of the form (A, B), if the query condition does not have a partition key A, the query statement needs to enable allow filtering.
  • Too many partition key fields will cause a lot of trouble for future queries. When building a table, you must first consider the data model to avoid falling pits later.

query limit

  • When using the primary key to query, because the partition key is implemented by hash, you can only use =;
  • The clustering key is implemented by sortedMap and can be used for range search and equal sign search. For example: Define PRIMARY KEY ((mainland), state, uid) , then (mainland), state, uid constitutes a joint primary key, where mainland is the partition key, and state and UID are clustering keys.

    select * from users where mainland = 'northamerica' 
    and state > '1' and state < '5'; 
  • Compound partition key queries must satisfy key_part_one=' 'and key_part_two =''
  • In addition, the overall data of Cassandra can be understood as a huge nested Map. You can only go deeper one by one in order, and you can't skip a certain layer in the middle. E.g:

    select * from users where location = 'shanghai' and uid < 5;

    Since no state information is provided, it is not feasible.

In short, the storage in Cassandra is a 2-level nested Map: Partition Key –> Custering Key –> Data

index query

Index: Index can be created on all columns except partition key, of course, except for some types, such as collection types. When using the index to query, the specific restrictions are:

  • Index column can only be queried with =
  • If one of the query conditions is based on index query, then other non-indexed non-primary key fields can be filtered by adding an ALLOW FILTERING. E.g. SELECT * FROM user WHERE c>0 ALLOW FILTERING;

In summary, adding an index to a column is equivalent to turning the column into a partition key to some extent.

阅读 3.3k


354 声望
21 粉丝
0 条评论


354 声望
21 粉丝