Cassandra is a distributed noSQL data store that offers high availability, data durability and horizontal scalability. Cassandra is eventually consistent like Amazon Dynamo and provides a column family data model like Google’s BigTable. Cassandra is not a drop in replacement for a traditional RDBMS, however for some use cases (big data, logging, etc..), Cassandra can make perfect sense.
Cassandra will need to be configured once it has been downloaded and installed. There are a few considerations that need to be made up front: which nodes will be seed nodes and how to partition data across the nodes.
Seed nodes are nodes in a cluster that should be contacted by newly created nodes when joining the cluster. Once the new node has contacted the seed node, it can begin listening to Gossip traffic.
Data partitioning consists of three elements: a partitioner, replica placement strategy and topology.
When a new row is inserted into the cluster, a determination must be made as to which node(s) it should be written, this is done by the partitioner. Cassandra allows pluggable partitioners such as the RandomPartitioner and the ByteOrderedPartitioner. The RandomPartitioner (default) uses consistent hashing to distribute data across the cluster. The ByteOrderedPartitioner stores rows in an ordered manner and allows range scans to occur over the row keys.
The cluster topology consists of number of nodes, data centers and distribution of nodes on racks.
Replica Placement Strategy
The replication strategy tells Cassandra how to replicate data across the cluster. Having data live on more than one node in a cluster keeps keeps it durable in case of node failure. This easiest strategy to use is the SimpleStrategy which treats all nodes as if they are in the same data center. The replication factor is then used to define how many nodes in the cluster will contain a copy of the data. A replication factor of 2 means that a single row of data will live on two nodes in the cluster. The other major strategy is the NetworkTopologyStrategy which allows the definition of multiple data centers and can be used to replicate data to nodes in those data centers as a way to keep data durable. An example would be if there are three data centers defined: A, B, C. Each data center has four nodes and data should live on at least 1 node in each data center, the replication factor would be A:1, B:1, C:1.
The snitch is the component that defines how nodes are grouped together in a cluster. The snitch can be used to define data centers and which nodes belong to each data center.
The default snitch is the SimpleSnitch which groups all nodes in a single data center. This snitch works best when using the SimpleStrategy.
The RackInferringSnitch is useful if the IP address of each node can be broken down by rack and data center. The names of the data centers will be numbers from the IP address. Here is how the IP address is broken down: 10.D.R.N (D=data center, R=rack, N=node)
The PropertyFileSnitch can be used to name data centers and assign nodes. This snitch requires and external properties file named cassandra-topology.properties. Each line in the property file will have an entry like this: IP=DataCenter:RACK (10.10.10.1=MYDATACENTER:MYRACK)
Cassandra is often referred to as “schema-less”, but a structure can still be defined using keyspaces and column families. A keyspace is much like a RDBMS schema and provides a container for column families which are like tables. Unlike tables, column families do not have a set number of columns, allowing each row to define which columns exist on the fly. Allowing the rows to contain different columns can lead to unpredictable disk usage.
As stated above, the keyspace is a container of column families and serves the additional purpose of defining the “replica placement strategy”.
The column family holds rows of data that can have varying numbers of columns. Each row stores the row key (primary key), the column name, the column value and a timestamp.
Each column family has a key cache that keeps keys in memory and defaults to a size of 200,000.
The row cache is used to keep entire rows in memory. This feature is disabled by default since it can have negative performance impacts for column families with very large rows.
The row key is the only primary index supported in Cassandra, however there is support for secondary indices. Secondary indices are best suited for low cardinality data like state abbreviation and not email addresses.
Table as Index Pattern
One way to compensate for the weak secondary index support is to use the Table as Index pattern. This pattern removes the need to use built-in secondary indices by using tables that provide an inverted index. A drawback is that application code must be used to maintain indices.
Here is how the pattern works. An index table is created for each column in the indexed table that should have an index. As column values are added, the index table is queried for the row key that matches the value. The row key from the indexed table is then injected into the matching row in the index table. A delete works the opposite of an add and an update performs a delete and add operation.
A row is not like a standard table row, each row may have a differing number of columns. Below is a diagram of a set of rows and how they would be stored. The first row has three columns and the last three only have two columns. Each row has a different set of columns which is why the column name and column value must be stored together. An easy way to think of a row is like a map where there is a key and a value.
Data modeling with Cassandra is nothing like traditional data modeling. Designing a relational structure involves structuring data logically and normalizing the resulting tables. Data retrieval then involves writing queries to extract that data in a meaningful way. Cassandra requires that this process be reversed and the queries be designed before the structure is created. Another difference is that Cassandra recommends de-normalizing data to make reads and writes faster.
Cassandra does not use SQL like a traditional database, but instead uses CQL.
Data is stored to disk in a structure known as an SSTable (Sorted String Table). SSTables are immutable and can generate several files between compactions.
A read request must go through several stages prior to being fulfilled. Cassandra uses a bloom filter to quickly determine if a row key exists on a node. The bloom filter is a space efficient structure that lives entirely in memory and makes false negatives impossible, but false positives are possible. Next, the row key index is checked to see which SSTable contains the data.
When a record is sent to the cluster to be written, the row key is hashed to identify the node which will be the primary node and then additional nodes are determined. Once a node receives the write request, the record is immediately written to the commit log. The commit log is read from upon node start to recover any writes that had yet to be written to SSTables prior to node shutdown. Once the record is written to the commit log, it is written to a memtable which lives in memory. The memtables are written to disk in the form of SSTables after a threshold has been reached.
Records do not actually get deleted, but rather tombstoned. Records will get deleted during major compactions.
Compaction is a way for Cassandra to merge multiple SSTables into a single SSTable. This process is very heavy on CPU and memory.
Hinted Handoff is the process of passing write operations to a node that was down during normal operations.
A repair should be executed on nodes to fix any data discrepancies that may occur.
As data gets written to the cluster, there may be a need to write the same row to multiple nodes based on the replication factor. There is no guarantee that all nodes get the data at the same time. A read request sent to one node may get an answer that is completely different than a read request sent to another node. Eventually all nodes will have the same data and become consistent. Cassandra does allow the application to define a consistency level when a query is issued. This allows for a read request to specify that all nodes must agree before the data is returned at the cost to performance.
There are several tools that come with Cassandra:
- CQLSH – This tool is used to interact with the cluster using CQL.
- cassandra-cli – This tool allows interaction with the cluster.
- nodetool – This tool can provide statistics about the cluster.
Homepage – http://cassandra.apache.org/
Commercial Offering – http://www.datastax.com
NodeTool – http://wiki.apache.org/cassandra/NodeTool