MongoDB and SQL databases are two polar opposite sides of the backend world. The former deals with chaotic unstructured data, while the latter works with organized structured data. Both worlds have their own advantages and disadvantages and are meant for different types of use cases. In this article, we will do an in-depth comparison between MongoDB vs SQL databases, (to be precise MySQL database) and will also touch upon the important topic of how we can perform MongoDB analytics similar to the ease with which analytics are done on its SQL counterparts.
MongoDB vs MySQL
As we discussed, we will compare MongoDB with MySQL which is a well-known SQL database and most of our audience will be able to relate with it. But it could have been any other SQL database also like Oracle, MS SQL Server, PostgreSQL, etc for our comparison.
MongoDB belongs to the family of NoSQL databases which is used for storing unstructured documents in JSON format. It was first launched in the year 2009 and has since become one of the leading databases in NoSQL space.
MySQL is an open-source SQL relational database, which is used for storing structured data in a table-like format. It was first launched in the year 1995 and is now managed by Oracle. Since it is free, it has become a very popular choice in demand for SQL databases.
Paradigm of SQL vs NoSQL
SQL databases, also known as relational databases, were designed to store data that has a structured schema. The schema represents the design of the database to which the data should adhere to. In a structured schema, data is saved in a row-column format known as a Table and can be retrieved using queries formatted in the Structured Query Language (SQL).
SQL relational databases were the only viable commercial data storage solution until the 2000s when the internet and web 2.0 boom started to generate a large amount of unstructured data. Such unstructured data could not be mapped to table-like schemas properly and thus arose the need for a different class of databases to support such unstructured data.
This is when NoSQL databases started arriving on the scene. These new databases needed to support this different type of data that was unstructured and not suitable for schemas; data like key-value stores, documents, text, graphs, and wide columns. MongoDB, for example, mainly supports unstructured Documents.
The accumulation of unstructured data was one big step in the direction of the Big Data era, but on the flip side, since the data stored was unstructured, it was not possible to query that data using SQL. SQL up until this point was the standard for querying and analytics and was well known by developers. We will touch upon this point later.
How Data is stored
In MySQL, the data is stored in tables, where the column denotes the attribute and row represents a particular record. These tables, in turn, reside inside the databases. In MongoDB, data is stored in collections that are analogous to MySQL tables. A collection can consist of many documents in which data is stored in JSON format of key-value. There can be hundreds of such collections inside a MongoDB database.
The SQL databases have a relational property where different tables are related to each other with foreign keys, primary keys. E.g. EmployeeID column which would be a primary key of Employee table would be present as a foreign key in the Payments table, thus connecting the two tables with the referential property. This relationship ensures that there is no payment entry of an employee whose details are not present in the master Employee table. This is why SQL databases like MySQL are also called relational databases.
On the other hand, in MongoDB, we cannot establish such a relationship between the unstructured data of the collections. Hence it is considered as a non-relational database.
The architecture of SQL databases like MySQL is governed by the principles of ACID property.
ACID stands for Atomicity, Consistency, Isolation, and Durability. These properties focus on the consistency and reliability of the transaction done in the database.
MongoDB is built on the principles of CAP Theorem which focuses on Consistency, Availability, and Partition. Unlike the ACID properties of SQL databases, CAP theorem focuses on availability of data in the case of MongoDB.
To conclude, SQL databases safeguard reliability of transactions whereas MongoDB ensures high availability of data.
MySQL database or the SQL databases, in general, can be scaled only vertically by increasing memory size, disk space or computing power of the server. Vertical scaling can be expensive with costs growing rapidly for large databases with high query volume.
NoSQL databases like MongoDB support horizontal scaling, also known as sharding. In this case, instead of increasing the server configuration a new server is added for the purpose of scalability. This approach is usually less expensive because a cluster of low-cost commodity hardware can together meet the requirements to support high query volume in a cost-effective manner.
Reliability and Availability
Reliability and availability are key metrics to measure how robust any database system is. Most of the SQL databases were originally designed for standalone servers. To mitigate the risk of failure, their architecture moved towards a distributed database, where the database runs on a cluster of nodes, thus increasing resilience. Even if one node in the cluster is down the database would still be up and running on other nodes.
NoSQL databases like MongoDB were originally designed keeping resilience in mind. It runs on a cluster of commodity hardware and replicates the data across the nodes for high reliability and availability. Unlike SQL databases, reliability and availability is an integral feature of MongoDB architecture and not an afterthought. Hence the automatic failover in MongoDB is swifter and less complex compared to MySQL and other SQL databases.
MySQL databases, like any other SQL databases, have a predefined schema to which the data should comply. For example, the number of columns in a table along with its data type has to be defined while creating the table. Any data that is saved in the table should match the table structure, otherwise, it will give an error.
On the other hand, in MongoDB, there is no need to predefine any schema. A collection can store different types of documents without any problem. There is nothing to worry about if a new type of document arrives, it can easily be saved.
The dynamic nature of MongoDB schema is useful because most of the data that is being generated by internet applications and IoT devices are non-structured which cannot be saved in a traditional SQL database.
Additionally, many companies will store data before they know how it will be used later. This is common with mobile apps storing log data and user activity. As the company gets their apps on the market, they collect data without an end goal. Later, they may discover that this data gives them valuable information of what features need to be added. With unstructured databases, it is easier to do this sort of unplanned data collection because there is no need to define a schema ahead of time.
Query and Analytics
MySQL database can be queried with the help of Structured Query Language or SQL. In fact, MySQL follows ANSI SQL standards which is a common SQL standard adopted by almost all relational databases like Oracle, PostgreSQL, Sybase, etc.
SQL queries are developer-friendly and well established. SQL can be used to perform advanced analytics functions like filters, joins, merge, and aggregation on the data as well. This makes SQL a powerful option for performing advanced analytics.
MongoDB does not support the traditional SQL queries the way MySQL does. MongoDB does, however, support document querying, but the feature is underdeveloped and limited–especially compared to SQL. One example of this is that MongoDB queries do not support joins, which is a crucial operation to derive information from multiple sources of data.
So MongoDB is useful for storing unstructured data but it does not offer a mature query language to perform advanced analytics. This sounds like a deal-breaker for many commercial use-cases but fortunately, there are some options.
I need to do analytics on MongoDB data: What are my options?
The ease of storing large amounts of unstructured document data in MongoDB is an important factor for many use-cases. That’s why rather than ignoring MongoDB because of its limited query support the industry evolved to create multiple solutions to support advanced analytics on MongoDB. Here are a few good options.
Option 1: Import MongoDB Data into SQL Data Warehouse
If we cannot perform analytics on MongoDB, we can load the data into a SQL data warehouse and then run the familiar SQL queries there for analytics. For doing this we can either write a custom batch ETL process or use tools like Panoply or Xplenty.
This is a powerful approach that has worked well for many companies, but it does have some limitations. Although it serves the purpose, it comes with the additional overhead of building and maintaining a data warehouse, not to mention the cost associated with ETL or ELT processes.
Data warehousing might be a good solution for a big enterprise, but you may not like to invest in a data warehouse as a small company or a young startup. Even if a data warehouse is the right fit for most of your company’s data but if some of your data has more stringent handling requirements or is subject to restrictive regulations it may be necessary to keep it out of your data warehouse. One other place that issues can crop up is if you need to do quick experimental analytics with outside datasets that you either cannot or do not want to move into your data warehouse.
Takeaway: Data warehouses are a great option but can come with some expense and limitations. They can also remove the benefit of using NoSQL databases by forcing your to assign your data into a relational schema.
Option 2: MongoDB BI Connectors
MongoDB realized that they also have to give some options for MongoDB analytics. So they came up with the MongoDB BI Connector that can be used with popular business intelligence tools like Tableau, Cognos, Qlik to name the few. This connector acts as an intermediate interface between the BI tools and MongoDB which converts the SQL query into MongoDB query and converts the results back in SQL format when passing it to BI tools.
The MongoDB BI connector indeed makes the work simple compared to other options on the list that we discussed. Using the MongoDB BI connector saves the overhead of building data warehouses or custom Python applications for MongoDB analytics.
A drawback still exists here, what if you would like to join between MongoDB data and MySQL data or any other SQL data. One option is to import MongoDB data into a MySQL database and then perform analytics there. But this essentially brings us back to the data warehousing option and the overhead that we already discussed above.
Takeaway: MongoDB BI connector is a good option if you would like to connect your existing BI tool with MongoDB but it suffers from the drawback that you cannot join data from multiple heterogeneous sources.
Option 3: Data Virtualization with Knowi
Data virtualization is a process in which an application can access data from multiple sources and present it to front end users by abstracting the underlying technicalities. This means the user will have a consistent view of data from multiple sources and a seamless experience.
Knowi is built on data virtualization and does exactly that. It can connect to MongoDB natively and give users the same experience of running the SQL query on MongoDB data as if they are working on a SQL database.
Moreover, it can also join data from multiple heterogeneous sources seamlessly. This means, if we want to join MongoDB data with MySQL data, it can be done just by providing the two sources and the joining field. Another feature worth noting is that native MongoDB queries are also supported inside Knowi.
Takeaway: Compared to other options in the list, data virtualization is the best option if you are looking for a lightweight and seamless experience of performing advanced analytics on MongoDB data. It is an excellent option if you have multiple data sources and need to do cross-database joins or if you simply need to maintain a flexible data infrastructure as you scale.
Option 4: Custom coding with Python and PyMongo
Another option is to build a custom Python application to connect to MongoDB, fetch data from it and perform analytics on it. PyMango is a MongoDB driver for Python to enable this. In fact, using PyMongo we can not only fetch MongoDB data but also write data back into MongoDB.
This can be a good option compared to a data warehouse and would excel at exploratory data analysis but may not always be the right fit for commercial applications.
Takeaway: Unless you specifically need a custom solution that is lightweight and can be used easily for exploratory data analysis, you might still want to look at data warehousing or a BI solution.
Option 5: Translation
One last option is translating the SQL queries into MongoDB queries. This is very similar to what the MongoDB connector does but it is done as 3rd party implementations. The team at Dremio has done a good job building out a translation engine that attempts to resolve this issue. Translation systems let you write out SQL queries, interpret them, and reformat them into NoSQL queries. This is a good option for some simple use-cases but may run into issues for more complex applications that require things like cross-database joins. It also introduces a delay that can cause problems for high throughput analytics. However, if your use-case is fairly straightforward and you don’t foresee it getting more complex in the future, translation may be a great option.
Takeaway: This option is good if you need to work with simple analytical queries on a single database source.
In this post we covered a thorough comparison between MongoDB vs SQL database and saw various options of performing analytics on MongoDB data. Let us summarize our discussion below.
- SQL databases are used to store structured data while NoSQL databases like MongoDB are used to save unstructured data.
- MongoDB is used to save unstructured data in JSON format.
- MongoDB does not support advanced analytics and joins like SQL databases support.
- There are many good solutions available to support MongoDB analytics, including: data virtualization, translation, the MongoDB connector, and data warehousing with an ETL or ELT process.