What? NoSQL? Yeah, you read it correctly. NoSQL. I forgot when and where I heard about this for the first time. But I noticed about this data store technology again when I was attending the second Bancakan 2.0 meet up in last March. When I listened to the speaker, lynxluna, I remember about HBase, a scalable distributed database that becomes part of Apache Hadoop project. For your own sake, Apache Hadoop is just one implementation of MapReduce framework.
What is NoSQL?
So, what the hell is NoSQL? Here is the definition of NoSQL in Wikipedia:
NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage.
To keep it simple, NoSQL provide a non-relational data stores. Differs from RDBMS, NoSQL didn’t use relationship data model. Instead, it used various techniques to represent its data model. As Curt Monash quoted Dwight Merriman, founder of 10gen (MongoDB creator), in his blog, there are some data model used in NoSQL:
In Dwight’s opinion, as I understand it, NoSQL data models come in four general kinds.
- Key-value stores, more or less pure. I.e., they store keys+BLOBs (Binary Large OBjects), except that the “Large” part of “BLOB”may not come into play.
- Table-oriented, more or less. The major examples here are Google’s BigTable, and Cassandra.
- Document-oriented, where a “document” is more like XML than free text. MongoDB and CouchDB are the big examples here.
- Graph-oriented. To date, this is the smallest area of the four. I’m reserving judgment as to whether I agree it’s properly included in HVSP and NoSQL.
What’s the problem with RDBMS?
So, what’s wrong with relationship data model in RDBMS? There is nothing wrong. The best feature in RDBMS is data integrity. Mike Kavis explained that RDBMS requires data to be normalized in order to provide quality results and prevent orphan records and redundant data. In order to quickly retrieve the data needed, RDBMS uses primary and seconday keys and index. As the size of the data grows, the problems arise. Mike Kavis explained:
But all of the good intentions that the RDBMS has for ensuring data integrity comes with a cost. Normalizing data requires more tables, which requires more table joins, thus requiring more keys and indexes. As databases start to grow into the terabytes, performance starts to significantly fall off. Often, hardware is thrown at the problem which can be expensive both from a capital standpoint and from an ongoing maintenance and support standpoint.
Let’s take a forum as an example. When we are making a forum for the first time, we didn’t have any traffic yet. But as our forum grows, and there is a lot of members that generate a massive amount of posts, our database server will be suffered. For every database access request, the database server should query the database. To make sure the data retrieval is as quick as possible, for a period of time, the database should be indexed. If there’s a lot of data, it means the indexing will take a lot of time too.
- Replication is not an answer to all performance problems. Although updates on the slave are more optimized than if you ran the updates normally, if you use MyISAM tables, table-locking will still occur, and databases under high-load could still struggle.
- Replication is not a guarantee that the slave will be in sync with the master at any one point in time. Even assuming the connection is always up, a busy slave may not yet have caught up with the master, so you can’t simply interchange SELECT queries across master and slave servers.
From all of the issues mentioned above, maybe the most important issue is about scalability. Like the forum in the example earlier, higher traffics demands more and more load on server. And replicating the database lessen the ACID features of relational database. Dare Obasanjo gave an interesting point of view about relational databases and larce scale websites:
What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system. Well, actually you also get to use SQL as your programming model which is nicer than performing direct file I/O operations.
What NoSQL can do about this?
Differs from relational database, NoSQL focused on scalability of data storage. The data are replicated on many places. The key difference is NoSQL didn’t use (just) indexes. They used various techniques to handle data replication, like mentioned earlier.
Let’s use key-value stores model as the example. This is the basic technique for NoSQL, introduced in Amazon Dynamo. In Dynamo, the data is partitioned dynamically over set of nodes using consistent hashing. Then each data item is identified by a key. This key is hashed to find in which node the data item will be placed. After that, it will walk clockwise in the ring and duplicate the data in the first node after the origin node. The uses of consistent hashing here is to ensure that the node changes just affects the node and its immediate neighbors. The partitioning algorithm made the data storage can scale incrementally. For detailed explanation about how Dynamo works, you can read its paper. About the other models, maybe I’ll create separated posts in the future.
Key – values data stores model provide high performance data read and write. The data is schema-less and using primitives data types, avoiding the complexity of table hierarchy in relational database. The cost for this simplicity is the data integrity provided by relational databases (although maybe there is a non-relational databases that provide integrity).
So, is it the end for relational databases?
Well, no. It depends on the project. If the project will generate a large amount of data with fast query speeds, then we can consider using NoSQL. Big companies like Facebook, Twitter, Digg, or Google produce and maintain a lot of data. So maybe it was right for them to use NoSQL for their data store. But if the data have tight structure and its integrity is the main concern, we can use relational databases. Just use the right tools for the right job. NoSQL is not a replacement to relational databases, it’s just another useful techniques.
This post is just the beginning of my journey into NoSQL. I hope I can write more about this and MapReduce. See you in my next post..
Links of information about NoSQL: http://nosql-database.org/links.html
Credit: Movie Seating webcomic by Randall Munroe of xkcd.com.