NoSQL playground

Posted on September 3, 2019October 23, 2019 Danijel Gornjaković SENIOR IT

An encouragement guide introducing 3 popular NoSQL databases to Java developers

Hard to imagine a microservices based project without at least one NoSQL database. It feels like they’ve taken dominance over old fashioned relational SQL databases. However, DB-Engines ranking still has SQL databases as top 4 most used (image 1.1.), but this is mostly due to legacy projects maintaining their existing databases and licenses, and not being able to justify the budget for migration. This conclusion can be linked with the trend chart (1.2.), which shows that in the last 6 years most popular SQL databases have gained no momentum, while NoSQL databases are on the constant rise.

Image 1.1. DB ranging top 30 databases

Image 1.2. DB trend chart from 2013. to July 2019.
Top 3 lines are Oracle, MySQL and MsSQL. Bellow are MongoDB, Cassandra, Elasticsearch and Neo4j respectively.

So with this growing trend, you will probably want to add a NoSQL database into your microservices project sooner rather than later. Let’s imagine a scenario: There’s a business demand for a new set of features that semantically belong to one new microservice, so as a technical decision maker you get started on the architecture. You choose Spring Boot and the rest of Spring stack where applicable, together with Spring Data to abstract the business logic and data into repositories and entities, and finally you need to choose which Spring Data implementation to go with. For this purpose we can ignore the fact that Spring Data also has implementations for SQL databases, which would probably fit with the rest of Spring stack better than Hibernate, but I am yet to see someone choosing a SQL database with Spring Data on production. Instead, lets delve into some of the most used NoSQL databases of different types, and try to help you as the decision maker here choose which one to go with.

NoSQL Basics Crash Course

We have to digress first and set some basis for our playground. NoSQL, apart popular belief, doesn’t really exclude SQL. In fact, it stands for: “Not only SQL”. Many NoSQL databases will actually have an SQL-like language support, because it’s really what makes most sense when querying data.

So what really distinguishes and classifies NoSQL databases from SQL databases? It boils down to one thing in the end: Flexibility! A NoSQL database typically doesn’t require rigid structure, doesn’t force you to suffer from your design decisions, and curse the day you’ve set a column to number instead of varchar… No, a NoSQL database will try to be your friend, and let you adapt your model on the fly. Nothing comes for free though, so what is the price for this trade-off? Well, without rigid rules and structure, your data could suffer inconsistency, you might end up with loose, unreferenced data unless you are very careful in your application later… With more power comes more responsibility! But its worth it as NoSQL databases are de facto faster and can hold more data!

We can classify NoSQL databases based on how they store data to:

Column based
Document based
Key-Value
Graph based
Multi-Model

Spring Data, in blunt words, is the JPA for NoSQL database. It actually adds a layer of abstraction on top of JPA, while the implementation is left to a specific NoSQL database driver. It also adds its own very cool feature of a no-code implementation of the repository pattern and the creation of database queries from method names.

In this text I will cover the most popular and known column NoSQL database – Apache Cassandra, document NoSQL database – MongoDB and graph NoSQL database Neo4j. I’ll share some brief throughs on each based on my experience so far.

First the pioneer – MongoDB

MongoDB has been with us for a while now. It was actually its 10 years anniversary this year. This is why MongoDB is probably the most associated database when talking about NoSQL. We can thank it for the NoSQL breakthrough.

It sure pivoted over time and listened to its great community. Current stable release supports multi-document ACID transactions, type conversions and Kubernetes integration… But before getting there, lets approach this from a typical Java+Spring Data developer’s angle.

It’s all about Documents and JSON in MongoDB, a typical database looks something like the image bellow.

Image 2.1. MongoDB example data

However, if you use Spring Data,you won’t be seeing those JSONs directly (well, at least until things go wrong). You will create domain objects, map them with Spring Data annotations, and use repositories to store data and bring it to persistence. If things go as planned, your spring-data repositories will only contain methods described with intuitive names such as:

List<Person>findPersonByCountrySortByAgeAsc();

But as your model grows you’ll be bound to make decisions on how to split up documents, how to reference one from another, or whether to use nested documents… After a few sprints of work you’ll probably have a query in your Sprint Data repository that looks something like this:

@Query(“{name:?0, country:?1, region:?2 , occupations: {$elemMatch: {name:?3 , status:{ $exists: false}, tasks:{ $exists: true, $not: {$size: 0} } }}}”)

List<Person>findPersonByNameCountryRegionHavingAnOccupationWithTasks(String name, Stringcountry, String region, String occupationName);

If a method has a @Query annotation inside a Spring Data repository, the method name will not be parsed to create a database query automatically, so the method name in this case can be whatever you see fit. At this point you are probably thinking about how much effort would it be to just drop existing data, and remodel it… MongoDB gives you an enormous flexibility when modeling your data, and even if you don’t feel your model still really fits the original idea, there are so many functions in MongoDB that can still be used to query your data, so you’ll be able to live with your choices for a while… However, when you descend into using projections and aggregations, your indexes might not be used as intended anymore, at which point performance will suffer. That is surely when you need to start remodeling.

Spring Data and community support for MongoDB is quite amazing. Your repositories can return Java 8 Streams instead of collections, which gives you flexibility in tuning performance. Unit tests support is driven by community, so you will have to choose from implementations available on github. Flapdoodle MongoDB libraryhas proven to be very reliable with embedded MongoDB starting and shutting down with your unit tests: https://github.com/flapdoodle-oss/de.flapdoodle.embed.mongo.

If eventually you do have to look deep into MongoDB documents and JSON data, there are some great clients. Personally, I prefer RoboMongo: https://robomongo.org/.

Let’s draw a graph – Neo4j

Graph are so practical… Ask a child to explain how it sees the world, and it would probably draw something resembling a graph. That’s exactly the biggest selling point of Neo4j: it’s very intuitive. If you are pitching Neo4j to management, it will take you up to 2 minutes to create a simple model using the amazing Neo4j browser console that will represent their business (depicted on image 3.1.). Now the non-technical decision makers have something to cling to, something they are much more comfortable with than JSONS or CSV files.

Image 3.1. Simple Neo4j graph created using the browser console

Neo4j is fully implemented using Java and Scala. It consists of labeled nodes and directed typed relationships between nodes. Both can have properties.

Image 3.2. Example Labeled Property Graph Data Model

It is a NoSQL database, so your model is flexible, and there are no restrictions to what a certain node can have in terms of properties and labels. A node can be simultaneously a Person, and a Writer. Relationships on the other hand can have only one type, but have also no limits in terms of attributes. If you need to connect the same nodes twice to model their association, you’ll just create two relationships with different types.

Image 3.3. Double relationship between same nodes

Although Neo4j is a NoSQL database, it does support referential integrity. You will not be able to delete a node as long as there are relationships pointing to it. So Neo4j combines the worlds of SQL and NoSQL using the best of two: Flexibility + consistency and structure. Its performance is also very amazing, if you get your indexes right…

Writing a query in Neo4j means using an SQL-like language called cypher, which also tries to depict a graph and is in my opinion quite intuitive. Querying persons that know German language would look something like:

Match (p:Person)-[s:SPEAKS]->(l:Language) where l.name=”German” return p;

Can’t really have something resemble a graph more than that using ASCII characters…

Spring data implementation for Neo4j is called Object Graph Mapped, so it also resembles Hibernate-like annotations and mechanism in addition to implementing spring-data api. You can specify Entities, relations, listeners, custom mappers for unsupported types (e.g. you want to persist an object of custom type Circle, you would write your custom mapper to and from a supported type such as String).

Unit testing capabilities are brought to you by Neo4j official releases, so it is very comfortable to develop and test you code. In case you want to have lots of test data already prepared when booting up embedded test Neo4j database, you could just create the data once, and then copy the Neo4j /data folder from hard drive into /test/resources folder, and use that as test data source.

To run optimally, Neo4j requires the amount of RAM matching the data size on disk, so when you start talking about hundreds of gigabytes of data, or even terabytes, Neo4j doesn’t really work for you anymore… Luckily, if you optimize your model, and decouple your business data from e.g. audit data, storing the latter in e.g. elastic search, you will not have performance or hardware shortages from Neo4j.

Neo4j can’t also achieve the write throughput required to run in a distributed big data environment. This is due to the fact that a clustered Neo4j will require one node to act as master, and all write queries will go to it.

Neo4j doesn’t support keyspaces, but if you really want to decouple your data, nothing stops you from running two Neo4j images in different microservices…

Thinking bigger – Apache Cassandra

Now Cassandra… As my current CTO, once stated: “There is nothing in the world that writes as fast as Cassandra.” Although, I am sure there are perfectionists out there who could find examples where this statement is not true, I still agree with it very much based on my experience, and as far as NoSQL databases go… Or any database for that matter.

Cassandra is a de facto Big Data database. But when does data become “Big Data” and justifies using Cassandra to store it?! I’ve worked with Neo4j databases which had hundreds of millions of nodes, but that still didn’t quite qualify as “Big Data”, it was rather a modeled graph data with a lot of content. I’ve also worked with Mongo databases that grew to size of tens of gigabytes, but that didn’t come even near “Big Data”. I’ve devised my own blunt definition: “Big Data means terabytes”. If you also need to have a good write throughput with your Big Data database, then Cassandra is definitely the DB you want to go with.

Why is Cassandra not a valid option for anything less than Big Data? Well, to achieve its fascinating write throughput and data consistency, Cassandra requires a lot of topnotch hardware. Cassandra is always setup in a cluster, where data is distributed across nodes in a ring-like manner.

Image 4.1. Cassandra ring

Among other configuration options when setting up a Cassandra keyspace, maybe the biggest decision you need to make is choosing the replication factor. This determines the number of nodes where replicas are placed. A replication factor of one means that there is only one copy of each row in Cassandra cluster ring. A factor of two means two copies for each row, where each copy is on a different node. This practically means that if you have a replication factor of 3, and two nodesgo down, you will still have all your data. Of course, the trade of for a higher replication factor is higher write latency.

Image 4.2. Cassandra write replication

So what’s behind the incredible write performance of Cassandra? What’s even more mind blowing is that Cassandra achieves this without using indexes. The answer is data partitioning. When you define a table, you are already defined an index, sort of… You have to choose which table columns will be used to formulate a key to query upon.

Let’s model a person with columns: name, lastname, email, address, age. The most likely querying scenario here is probably using name and lastname. So we’ll create that table in an example keyspace with those two columns as primary key:

Create table example.person (name text, lastname text, email text, address text, age int, PRIMARY KEY (name, lastname));

This means we can query persons with simple SQL using those to columns to filter:

Select * from example.person where name=’John’ and lastname=’Doe’;

We’ll always get a very fast result back. This also means we can only use those two columns to query Person table. What happens when we’ve got the email address, but not the name and lastname? Well, with the current table we can’t do anything else but use SpringData to get all rows from Cassandra and then filter in memory…. Since we are in Big Data territory, this probably won’t come cheap for the Heap… The solution: create another table using email as primary key and write every person twice.

Create table example.person_by_email (name text, lastname text, email text, PRIMARY KEY (email));

We’ll query person_by_email using the email, and then use the “name” and “lastname” we obtained to query the person table.

The write overhead we saw here is a usual Cassandra practice, and a data table will have multiple entry points, depending on business requirements. The overhead is very much justified when you get a response within milliseconds while querying millions of rows of data.

So which one is the best?

Of course, as you might expect, there is no single answer. Some pros and cons were given during the brief introduction of these 3 databases.

If you are in Big Data territory, Cassandra is surely a good choice, but there are some similar databases offered as a service by Kubernetes cloud providers such as Azure Cosmos DB (https://cosmos.azure.com/).

Otherwise, your decision between MongoDB or Neo4j could be based or the nature of data: Does it suit a graph? The other factor could be whether you want constrains, as Neo4j gives you the possibility to resemble SQL like constrains in NoSQL world… If those two don’t apply, MongoDB is usually faster and less memory-consuming than Neo4j.

In any case, there is nothing stopping you from firing up a Spring Boot + Spring Data microservice, and trying out each of these databases individually. This playground is totally safe, as long as it’s not on production.