Introduction to Cassandra
Apache Cassandra
In this article, we will cover the essentials of Cassandra, all the basics you need to understand before diving deeper and integrating it with other solutions.
What is Cassandra?
It is an open source database management system (DBMS), which specializes in very large databases.
Cassandra offers different features such as:
- Scalable architecture
- Failure detection and recovery
- Robust data protection
- Data compression
- CQL (Cassandra Query Language)
Information
Author: | Avinash Lakshman, Prashant Malik |
---|---|
Developer: | Apache Software Foundation |
Launching: | initial 2008 |
Architecture
Cassandra has the structure of a cluster:
At a glance we can see:
Apache Cassandra has a logical ring-like network topology.
Cassandra’s architecture was staged event driven (SEDA).
- It creates groups of threads to handle tasks.
DB distribution manager and administrator between servers, process optimization and performance improvement.
Components of Cassandra
Architecture
- Node: Server where the information is stored.
- Data Center: It is a collection or groups of nodes.
- Cluster: It is a collection of data centers.
Storage
Commit log: Cassandra has a mechanism of the operations that registers them in a log.
Mem table: This component writes the information after the commit log.
SSTable: This component writes the information from the mem table to disk.
Bloom filter: It is an algorithm that could be said to be a kind of cache, it is extremely fast.
Installation of Cassandra
Minimum performance required to deploy Cassandra:
- 2 cores
- 8 of RAM
- Java v8
Java is installed.
$ sudo apt install openjdk-8-jdk
It installs the necessary dependencies for the Cassandra server.
$ apt-get install gnupg2 wget curl unzip apt-transport-https -y
Cassandra GPG key is added.
$ wget -q -O - https://www.apache.org/dist/cassandra/KEYS | apt-key add -
Adding the Cassandra repository to APT:
sh -c 'echo "deb http://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'
The package has to be updated and the latest version of Apache Cassandra needs to be installed:
$ sudo apt update
$ sudo apt install cassandra
Verify that Cassandra is installed:
$ nodetool status
output:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.100 128.91 KiB 256 100.0% 5cb85aad-68e9-4521-9464-a9b0899b2c76 rack1
Operating Cassandra
- Systemctl: Command tool to manage services.
- Start, stop, restart.
- Users and groups:
- Groups: cassandra
- Users: cassandra
- Both integrate full permissions.
Configuration
The main configuration file is cassandra.yaml
.
The directories are located in the following directories:
Cassandra package installations:
/etc/cassandra
.The configuration properties are grouped into the following sections:
Quick start: The minimum properties required to configure a cluster.
Commonly used: Most commonly used properties when configuring Cassandra.
Performance optimization: Performance tuning and system resource utilization.
Advanced: Properties for advanced users or properties that are used less frequently.
Security: Server and client security settings.
Quick start properties
cluster_name: 'Test Cluster'
: Name that we will give to our cluster, this must be the same for all the servers belonging to the cluster.seeds : "127.0.0.1"
: We have to change the seeds IPs by a list of IPs separated by commas of all the servers. These will be the ones belonging to the cluster.endpoint_snitch: SimpleSnitch
: This represents the data sharing among the servers.SimpleSnitch
is valid for a single server working alone. In a cluster of several servers we can put the following:RackInferringSnitch
.listen_adrress : localhost
: As we want it to attend connections also from other servers, we must put the real IP of our server.listen_on_broadcast_address: false
: This is to enable multicast.rpc_address: localhost
: Allows external applications using the database to be able to connect.
Features of Cassandra
Among the most important features we can find:
Scalable architecture: thanks to a masterless design, in which all nodes are the same, which offers operational simplicity and easy horizontal scalability.
End-to-end active design: since all nodes can be written to and read from.
Lineal scale performance: the ability to add nodes without having to slow down the pace results in performance gains.
Continuous availability: eliminates single points of failure and provides constant uptime.
Transparent failure detection and recovery: for nodes that cannot be easily restored or replaced.
Flexible and dynamic data model: supports modern data types for fast reads and writes.
Robust data protection: A commit log design prevents data loss and builds backups for easy restoration while keeping data protected and secure.
Tunable data consistency: In this way, Cassandra Database supports data consistency across a widely distributed cluster.
Multi-hub data replication: this is a cross-hub data center (in different geographies) that is supported by multiple availability zones in the cloud for both writes and reads.
Data compression: guarantees that data will be compressed by up to 80% with no resource overhead.
CQL (Cassandra Query Language): a SQL-like language that makes the transition from a relational database very easy.
Directories Structure
Cassandra directory structure consists of different directories for keyspaces, table directories with data inside.
Backup and snapshot directories are stored respectively inside the table directories.
As we can see in the following image:
Separate table directories
Cassandra provides fine-grained control of table storage on disk.
Writing tables to disk using separate directories for each table.
Data files are stored using this directory and the default tarbll name format:
/data/data/ks1/cf1-5be396077b811e3a3ab9dc4b9ac088d/ks1-cf1-hc-1-Data.db
Where
ks1
is the name of the keyspaces, where the bulk data transmission or upload is addressed.The hexadecimal
5be396077b811e3a3ab9dc4b9ac088d
in this example represents the ID.
Cassandra creates subdirectories for each table.
Thanks to its processing being able to move active tables to faster media, such as SSD improves performance.
The splitting of tables on all connected nodes.
Functionalities
As Cassandra is open source, it presents different functionalities such as integrations, which can be configured depending on the type of environment being managed.
Scalability
Cassandra is scalable and elastic (Vertical or Horizontal), allowing you to add new machines to increase performance with no downtime.
Cassandra does not operate in a master-slave architecture, you can simply redirect writes to any available node, without shutting down the system.
Cassandra scales horizontally to meet growth requirements in data size and request rates, horizontal scalability consists of adding additional nodes to the ring. Each additional node provides lineal improvements in computation and storage.
Cassandra assumes that nodes can fail at any time, it automatically adjusts itself to make the best use of the CPU.
Uses available memory resources and makes heavy use of advanced compression and caching techniques to make the best use of limited memory and storage capacities.
Data partitioning
Apache Cassandra is a distributed database that stores data in a cluster of nodes.
A partition key is used to partition data between nodes.
It adds consistent hashing to the data for distribution.
Data partitioned into hash tables using partition keys allows faster search for query response time.
Replication
Cassandra stores replicas of data on several nodes to ensure reliability and fault tolerance (redundancy).
Cassandra technique determines where replicas are placed.
The total number of replicas in a Cassandra cluster is called the “replication factor.
If you add additional Cassandra nodes to the cluster, the default replication factor is not affected.
Consistency level
One of the most important advantages of Cassandra is that it allows to “adjust the availability and consistency” of the data by configuring the replication factor
and consistency level
properties.
The
consistency level
is defined per query and allows you to adjust the time at which the result is offered to clients.The
replication factor
ensures that writes are sent to all replicas.
Compaction
Within Cassandra there are compaction handling strategies, choose the right compaction strategy. For the workload, ensure the best performance for queries and their compaction:
Size Tiered Compaction Strategy (STCS): The default compaction strategy, most useful for non-pure time series workloads with spinning disks, or when LCS E/S is too high.
Leveled Compaction Strategy (LCS): The Leveled Compaction Strategy (LCS) is optimized for read-heavy workloads or workloads with many updates and deletes. It is not a good choice for immutable time series data.
Time Window Compaction Strategy (TWCS): The Time Window Compaction Strategy is designed for immutable time series data, mostly TTL.
Backups
Stores data in immutable SSTables files. The backups in Cassandra are copies of the data, which are located as SSTables files.
Snapshots: It is a copy of the SSTable files of a table at a given time.
Incremental backups: A copy of the SSTable files of a table created by a physical link, incremental backups are combined with snapshots to reduce backup time and hard disk space.
Cassandra gives value to the following needs for its proper functioning:
To store a copy of data for durability.
To be able to restore a table if the table data is lost due to a node/partition/network failure.
To be able to transfer SSTable files to a different machine; for portability.
Competitors
MongoDB: is a cross-platform document-oriented, non-relational (i.e., NoSQL) database program. It is an open source document database that stores data in the form of key-value pairs.
DynamoDBL: is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers integrated security, continuous backups, automated multi-region replication, in-memory caching and data export tools.