Cassandra is a distributed structured data storage solution (NoSql database), with a richer storage structure than Key-Value databases (like Redis), but has limited support than Document databases (like Mongodb); suitable for data analysis Or applications such as data warehouses that require quick search and large amounts of data.
Cassandra cluster features are relatively rich, and there are more scenarios to consider. If you want to use clusters well, you must understand many concepts of clusters. The following are related concepts Introduction;
Concepts related to relational database:
keyspace -> table –> column, corresponding to relational database database -> table -> column
< h3>The main configuration of the cluster:
cluster_name: cluster name, multiple nodes in the same cluster, the cluster name must be the same;
seeds: seed node, the IP of all machines in the cluster, separated by commas Open;
storage_port: The port number of the connection between the Cassandra server and the server, generally does not need to be modified, but ensure that there is no firewall on this port;
listen_address: the communication between the server and the server in the Cassandra cluster address. If left blank, the server's machine name will be used by default;
native_transport_port: the default CQL local service port, the port for the local cql client to interact with the server;
The default Cassandra port is used
7000 is used as the cluster communication port (if SSL is enabled, it is port 7001).
The port 9042 is used for the client connection of the native protocol.
Port 7199 is used for JMX,
Port 9160 is used for abandoned Thrift interface
Cluster main configuration file directory
data_file_directories: The directory where data files are stored, one Or multiple
commitlog_directory: The directory where the log files of the submitted information are stored
saved_caches_directory: The directory where the cache is stored
Cluster related concepts:
Data Center
A logical set composed of multiple racks. For example, machines connected to each other in the same building
Rack
A logical set consists of multiple nodes adjacent to each other. For example, all physical machines on a rack.
Gossip and Failure Detection
Gossip is a p2p protocol used for failure detection, tracking the status of other nodes, and running once per second.
Use Phi Accrual Failure Detection to achieve failure detection
Calculate a result level of suspicion, which indicates the possibility of node failure.
It is flexible and avoids the unreliability of traditional heartbeat. It should be that there may be only a short-term network congestion, especially on public clouds.
Snitches
snitch defines the proximity of each node in the cluster relative to other nodes to determine which node to read and write from.
Generally use the manually defined mode, configure endpoint_snitch in cassandra.yaml: GossipingPropertyFileSnitch
At the same time, configure the dc and rack of the current node in cassandra-rackdc.properties, such as
Rings and Tokens
Cassandra represents the data managed by the cluster as a ring. Each node in the ring is assigned one or more data ranges described by the token to determine its position in the ring. The token is a 64-bit integer ID used to identify each partition, ranging from -2^63 to 2^63-1
The hash value of the partition key is calculated through the hash algorithm to determine which node is stored pre>Virtual Nodes
The concept of virtual node is referred to as vnode, and the original token range has been reduced to multiple smaller token ranges. Each node contains multiple token ranges. By default, each node generates 256 token ranges (adjusted by num_tokens), which is 256 vnodes. It is enabled by default after 2.0. On nodes with poor performance, the value of num_tokens can be appropriately reduced.Partitioners
Partitioners determine which vnode the data is stored on. It is a hash function that calculates the hash value of the partition key of each row.
The code is in the org.apache.cassandra.dht package. At present, Murmur3Partitioner and DHT are mainly used as distributed hash table.Replication Strategies
The first copy is stored in the corresponding vnode. The storage location of other replications is determined by the replica strategy (or called replica placement strategy).
There are two main strategies:
SimpleStrategy
Place the replicas on consecutive nodes on the ring, and indicate from the partitioner The node starts.
NetworkTopologyStrategy
Allows to specify different replication factors for each data center. In the data center, it allocates replicas to different racks to maximize availabilityConsistency Levels
According to the CAP theory, consistency, availability, and partition tolerance are not compatible Yes, cassandra achieves adjustable consistency by setting the minimum number of response nodes when reading and writing.
Optional consistency levels: ANY, ONE, TWO, THREE, QUORUM, ALL, where QUORUM, ALL is strong consistency. Strong consistency formula: R+W>NR: number of read replications, W: number of write replications, N: replication factorQueries and Coordinator Nodes
Can connect to any node to execute For read and write operations, the connected nodes are called Coordinator Nodes, which need to deal with read and write consistency. For example: write to multiple nodes, read from multiple nodesWrite operation execution process
When a write operation is performed, the data is directly written to the commit log file, and Set the dirty flag in the commit log to 1. Then write the data to the memory memtable, each memtable corresponds to a table, when the size of the memtable reaches a limit, it will be written to the disk SSTable, and then the dirty flag in the commit log is set to 0Caching
There are three types of caches:
key cache
Cache the mapping of partiton keys to row index entries, there is jvm heap
row cache
Cache commonly used rows, there are off heap
counter cache
Improve counter performanceHinted Hando
A write high-availability feature, when a write request is sent to the coordinator, the replica node It may be unavailable for various reasons (network, hardware, etc.). At this time, the coordinator will temporarily save the write request and wait for the replica node to go online again before writing. The default retention time is two hours.Tombstones
SStables files cannot be modified. Deleting data is treated as an update and will be updated to tombstone. Before compact runs, it can suppress the original value.
In the setting: Garbage Collection Grace Seconds (GCGraceSeconds), the default is 864,000, 10 days.
Tombstones older than this time will be cleaned up. When the node is unavailable for more than this time, it will be replaced.Bloom Filters
is a fast, non-deterministic algorithm used to determine whether the test element is in the set. This reduces unnecessary disk reads. May get a false-positive result. By mapping the data set to the bit array, a special cache.Compaction
SSTables are immutable through compaction. Regenerate a new SSTable file (this file does not contain unnecessary data, such as deleted data)
Three strategies:
SizeTieredCompactionStrategy (STCS)
The default strategy, write-intensive Type
LeveledCompactionStrategy (LCS)
Read-intensive
DateTieredCompactionStrategy (DTCS)
For time or date-based dataAnti-Entropy, Repair h4>
Assandra uses the Anti-Entropy protocol, which is a gossip protocol used to repair replica set data.
There are two cases.
read repair
When reading, it is found that it is not the latest The data. Now start to repair
Anti-Entropy repair
Run the repair manually through nodetoolMerkle Trees
Merkle Trees come from Ralph Merkle, also called hash tree, yes A kind of binary tree. Each parent node is the hash value of its direct child node, which is used to reduce network I/O.Staged Event-Driven Architecture (SEDA)
Cassandra uses a staged event-driven architecture, SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
A stage consists of events Queue, event processor and thread pool are composed of
controller decides stage scheduling and thread application. The main code is in org.apache.cassandra.concurrent.StageManager
The following operations are executed as stages
Read (local reads)
Mutation (local writes)
Gossip
Request/response (interactions with other nodes)
Anti-entropy (nodetool repair)
Read repair
Migration (making schema changes)
Hinted handoffSystem Keyspaces
system_traces
system_schema
keyspaces
tables
columns
The definition of storage kespace, table, column
- materialized_views< br />Store available views
- functions
User-defined functions
- types
User-defined types
- triggers
Trigger configuration for each table
- aggregates
Aggregation definition
system_auth
system
local
peers
Storage node information
available_ranges
range_xfers< br />Storage token range
materialized_ views_builds_in_progres
built_materialized_views
Tracking view construction
paxos
store paxos status
batchlog
store atomic batch operations The status of
size_estimates
stores the estimated number of partitions for each table, used for Hadoop integrationReference document:
https://www.2cto. com/database/201802/717564.html
https://blog.csdn.net/zhuwinmin/article/details/76066642
https://segmentfault.com/a/1190000015610357