One of the challenges that arises when building ODBC drivers for many of the different data sources that people demand is how to intuitively map the different types of data to the relational model that is used by ODBC and SQL.

MongoDB is one of those interesting challenges, where the data is stored in Binary JSON (BSON) documents. I’m going to give you the low-down on how the Simba MongoDB ODBC driver maps that data to allow BI tools to easily analyze your data in place. I’m going to use JSON as a way to illustrate how we map the data. BSON isn’t the same but it’s similar enough that the concepts carry over very closely.

First up, you need to understand how data is organized in MongoDB. As mentioned, data is stored in BSON documents, which are organized into collections. If you are used to the relational world, collections could roughly be considered tables and documents could be considered rows, although that’s not very accurate for documents as you’ll see.

Let’s assume we have a collection called People, and a representative document in this collection is as follows:

This document contains some base fields (“_id” and “Name”), as well as a nested document (“Dates”) and a nested array (“Interests”).

There are many ways this could be represented when mapping to the relational model. Through experimentation and real-world feedback, we’ve found the following two ways to be effective, intuitive ways to map the data.

The first way is the simple way, where every field is flattened out to a separate column. When you look at the flattened view of the above document, it looks like the following:

People

_id Name Dates.StartDate Dates.EndDate Interests__0 Interests__1
299 Kyle long ago employed Running Movies

This method tends to work well when the structure of the documents is consistent across a collection, or if you’re treating MongoDB in a more relational manner and thus don’t have any nested arrays. It usually breaks down a bit when you have nested arrays in your documents, which can be illustrated as follows. If you have a Customer collection which tracks the customer orders in a nested array, it’s possible that a single customer may have many hundreds of orders. Because the flattening method breaks out each element in the array into a separate column, you’ll end up with many columns in your table.

In addition, filtering on the array becomes problematic because each element in the array is a separate column. Using the above example, if I wanted to find the names of all the people who were interested in Movies, I would have to write the following SQL:

If you think about the Customer collection outlined above, you can see that it quickly becomes impractical to filter large nested arrays.

Simba solved this problem by using what we call virtual tables. These can be thought of as views on the data that have relationships. Using the above example again, we actually map the data into two separate collections, one for the base fields and one for the nested array.

People_vt_main

People._id Name Dates.StartDate Dates.EndDate
299 Kyle long ago employed

People_vt_Interests

People._id Interests_index Interests.value
299 0 Running
299 1 Movies

You can also optionally have the driver return the count of elements in the array, as well as a few other things. With this new mapping, we can express the previous query as:

This second query is more complex, however this query will remain the same no matter how many elements are added to our arrays for all of the people in the collection. The first query will need to expand to filter on all of the columns representing the array, so in realistic scenarios the second query is much simpler.

Another point is that we are using joins, however because we’re joining virtual tables on the same collection together, our driver can optimize this and the query will execute as if this were one collection. So don’t worry, we’ve made the data easier to analyze while still maintaining the performance you expect.

This mapping of MongoDB’s nested documents to relational schema is done transparently by the Simba MongoDB ODBC driver, allowing an analyst to use familiar SQL queries to work with a NoSQL data source.

If you have any questions, you can try out the driver yourself or contact us at solutions@simba.com.