Choosing between Columnar and Document Database

SK Database Team

Among NoSQL databases, columnar and document databases are two popular variants and often there is a great dilemma in choosing the right one. While both are great in their own way, they are actually best suited for clearly different use cases.

A columnar database is also known as a column-family store because it stores data in columns rather than rows. Each column is stored and can be accessed and queried independently. This makes it ideal for analytics and reporting applications on large datasets.

In simple terms, data or records are stored in a columnar fashion rather than row-wise as in the case of traditional databases. For example, in a customer database table, the "customer name" column would contain all the names of the customers, the "address" column would contain all the addresses, and so on, exactly like in spreadsheets. The difference is that while spreadsheets have limited rows or records, columnar databases can go up to petabytes and beyond among other differences.

This columnar storage approach allows for fast querying of specific columns and performing analytical queries on large datasets. That makes it a great fit for data warehousing applications. Columnar databases are optimized for read-heavy workloads and big data analytics.

One of the main advantages is that columnar data is of the same type and attributes, it can be compressed using specialized algorithms, which can lead to substantial storage savings. Also, since columns can be accessed and queried independently, one can also optimize between performance and storage space by selectively compressing the columns that are not frequently queried.

Parallel processing is another advantage of columnar databases. This enables columnar databases to scale horizontally and handle very large datasets. The whole dataset can be broken into columnar chunks which are distributed across multiple computing nodes which can then run parallel queries.

Some examples of columnar databases are Apache Cassandra, Hbase and Amazon Redshift.

Document databases have a completely different design philosophy. Here data is stored in the form of semi-structured documents, such as JSON or XML. So complex and hierarchical data originating from web and mobile applications can be stored and queried much more efficiently.

Normalization of unstructured or semi-structured data is a mammoth task. Document databases circumvent this problem by storing such data using nested structures, such as arrays and sub-documents, without the need for normalization. The data is modeled and stored in a way similar to the application structure. This allows for storing complex and hierarchical data in a natural and flexible manner.

For the sake of understanding, we can consider document databases to be similar to object-oriented databases. Each document represents an object with the document fields representing object attributes. For example - each document stores an entity or an object such as a customer or an order and it contains all the attributes associated with the entity. However, document databases handle relationships between documents by using references or embedding as against foreign keys and joints in traditional object-oriented databases.

This structure provides the ability to perform ad-hoc querying. You can run queries for specific fields within a document or perform complex queries across multiple documents. This is why document databases can be considered optimal for write-heavy workloads with the ability to execute very fast and flexible queries and also allow for very fast insertion and retrieval of data.

Some examples of document databases are MongoDB, Couchbase and Amazon DocumentDB.

Summarizing,

a columnar database is best suited for large-scale data warehousing and analytics, while a document database is best suited for web and mobile applications that require fast insertion and retrieval of semi-structured data.
Columnar databases are optimized for read-heavy workloads and are very efficient at querying specific columns of a table, while document databases are optimized for write-heavy workloads and are very efficient at inserting and retrieving semi-structured documents.

Both have their advantages and disadvantages, and the choice of which one to use will depend on the specific requirements of the application.

And since both these databases are optimized differently, they can also complement each other for getting the best performance. For example, behind a given application, a document database may handle transactional operations while a columnar database can run analytical queries on the large dataset originating from these transactions!