Big data Leave a Comment

Follow LinkedIn for actionable insights, industry news, technology updates and light hearted humor

Introduction to Cassandra

Have you heard of Cassandra? Wikipedia describes her quite aptly:
Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Ideal for high scalability without compromising on performance, Cassandra is the perfect database platform for mission-critical data.
This blog guides engineers to understand what Cassandra is, how Cassandra works, why do we need Cassandra in our applications, and how to use the features and capabilities of Apache Cassandra.

Basics First

There is a very famous theorem (CAP Theorem) in the Database world, which still proves and states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency – which means that data should be same in all the nodes in the cluster. If the user reads/writes from any node, the user should get the same data.
  • Availability – which means at any point in time, the database should be accessible for read/write and there should not be any downtime in accessing the database.
  • Partition Tolerance – which means that in a distributed system, the cluster continues to function even if there is a communications breakdown between two nodes. In this case, nodes are up in the cluster but not able to communicate between them but still it should work as expected.

According to this theorem, a distributed system cannot satisfy all three of these guarantees at the same time. To be frank, this theorem says you can either have CA or CP or AP in any of the databases.
Well, imagine if you are able to create a new database system, which supports CAP! That would be a priceless innovation in the database world and very lucrative indeed.

 Where Cassandra fits in CAP?

Firstly, Cassandra is a database and It is classified as AP in the CAP. So this is a database which focuses or providing importance to Availability and Partition tolerance.
But believe me, the beautiful feature of this database is we can tune and make this database to also meet Consistency. That surely hikes up the curiosity in IT folks. I will come to that soon.

What is NoSQL?

Having reasonable working experience in NoSQL databases, I can assure you that NoSQL is still a ‘buzzword’ in the database world.
For easy understanding, I would like to list what people say about NoSQL and my opinion on it,

  1. NoSQL is vertically scalable – Agreed
  2. NoSQL violates ACID principle – Not all NoSQL databases, I would say it depends on the database since most of it partially supports ACID. (i.e) Mongo, HBase and sometimes Cassandra supports 100 % Durability, Mongo and HBase support row-level locking etc., But of course there is no concept of Transaction in the NoSQL databases
  3. NoSQL is a key-value store Architecture – Perfect. That is the core concept of NoSQL. It supports faster write and read
  4. NoSQL is for Big Data – Agreed

Yes, all these denote NoSQL in the database world – it violates ACID, having key-value store structure definitely violates the core principle and concepts of relational databases and this is why it is also called as Not only SQL. I would say that NoSQL sacrifices these principles and concepts to provide the performance and data scalability.
NoSQL says take care of ACID in your client code and as a compromise for it, I will provide the performance.

Coming to Cassandra

Cassandra is a NoSQL database and it is not a Master-Slave database. So which means all the nodes in the Cassandra are same. It is a peer-to-peer distributed database so it has the masterless architecture. (P.S. Throughout this blog, NODE denotes Cassandra node)
 Masterless Architecture
In other Master-Slave databases like MongoDB or HBase, there will be a downtime if the Master goes down and we need to wait for the next Master to come up. That’s not the case in Cassandra. It has no special nodes i.e. the cluster has no masters, no slaves or elected leaders. This enables Cassandra to be highly available while having no single point of failure. This is the reason it supports ‘A’ in CAP.
As mentioned it is a distributed database system which means a single logical database is spread across a cluster of nodes and thus the need to spread data evenly amongst all participating nodes. Cassandra stores data by dividing data evenly across its cluster of nodes. Each node is responsible for part of the data. This is how it is able to support ‘P’ in CAP. So it is a database which supports AP.
This answers the question why we need Cassandra in our application. Applications that demand zero downtime need a masterless architecture and that’s where Cassandra drives the value. In simple words, Write and Read can happen from any node in the cluster at any point in time. The below example shows the sample cluster formation in Cassandra for 5 node setup.

Better Throughput
Another highlight of Cassandra is that it can provide better workload performance with the increasing number of nodes. The below diagram demonstrates this in a better way.

As per Cassandra, if two nodes can handle 100K transactions per second, then 4 nodes can handle 200K transactions per second and the strength multiplies so on.
The buck does not stop here. There is definitely more to Cassandra and one can keep exploring to learn more.

Hari Prasanth Loganathan

Leave a Reply