A lot has changed in MongoDB since the last major release of the Simba MongoDB ODBC and JDBC drivers. Performance is a major focus area for the new release of Simba driver. Extensive query performance improvements have been realized through the use of the MongoDB Aggregation Framework and write performance has also significantly improved using the new MongoDB write protocol. Authentication support such as Kerberos is one of many additional features that have also been added in this release. Refer to the Simba MongoDB driver product page for a complete list of new features. Data throughput, important as it is (especially in big data scenarios), is secondary to ensuring your NoSQL data is structured in a way that can be consumed by your application of choice. How you represent your MongoDB data in a spreadsheet may be different from how you want it represented in a BI tool. This two-part post explores the various methods of representing your MongoDB in a structured manner.

Sampling A MongoDB Collection

DSN Advanced Configuration for MongoDB
DSN Advanced Configuration for MongoDB

The Simba MongoDB driver has both an automatic method of determining a schema for a MongoDB collection as well as a method for manually creating and customising a schema. In either case, sampling of the documents in the collection is performed to determine a common schema. When configuring a MongoDB DSN, the following options influence how document sampling is performed: Sampling Method, Documents to sample, and Step Size.

Sampling Method can either be forward or backward. Forward starts with the first document in a collection and then looks at later documents in the collection. Backward starts at the end of the collection and then looks at earlier documents in the collection. Why choose one over the other? One of the reasons applications utilize a NoSQL data store such as MongoDB is the flexibility of changing what is stored without having to modify the entire schema of the data store. Consider a generic order entry application for a company that just sells goods in only one country. Initially, the application will likely not bother capturing “Country” as one of the order entry fields, and thus, it doesn’t show up as a field in the collection. However, if the company later decides to sell goods in other countries, the application gets updated to capture this information, and now orders are stored in the collection in a country field. Depending on how many documents are sampled, sampling from the beginning of the collection may not detect the addition of the country field in later documents and so sampling backward would be more appropriate. Conversely, if fields are dropped in later documents, sampling forward would be more appropriate to ensure those dropped fields are detected.

Documents to sample is fairly straightforward in that it specifies how many documents to examine to determine a schema. The sample size needs to be big enough to ensure an accurate representation of all fields in the collection. If the collection contains documents that vary considerably in structure then typically a larger sample size is needed. However, the larger the sample size, the bigger the performance hit when sampling. This is particularly important for automatic schema detection which is discussed later in this post but can also be mitigated with manual schema generation which will be covered in a future post.

Step size is how many documents to skip when sampling. For example, a step size of 5 would mean every 5th document in the collection is sampled. Sampling could be useful to ensure each part of a large collection gets sampled but sampling every document would be prohibitive from a performance standpoint.

Automatic Schema Detection

Automatic schema detection is performed whenever the Simba MongoDB driver does not have any predefined schema definition configured (manual schema definition will be covered in a later post). In this scenario, each time the application connects to MongoDB the Simba driver will sample the collection according to the settings discussed above and communicate the schema back to the application. Any arrays in a document will be represented as virtual tables to the application, and any embedded sub-documents are flattened. Consider the following document:

       "_id" : ObjectId("56c25056471b4b0a1ec9f95b"),
       "address" : {
              "building" : "1007",
              "coord" : [-73.856077, 40.848447],
              "street" : "Morris Park Ave",
              "zipcode" : "10462"
       "borough" : "Bronx",
       "cuisine" : "Bakery",
       "grades" : [{
                     "date" : ISODate("2014-03-03T00:00:00Z"),
                     "grade" : "A",
                     "score" : 2
              }, {
                     "date" : ISODate("2013-09-11T00:00:00Z"),
                     "grade" : "A",
                     "score" : 6
              }, {
                     "date" : ISODate("2013-01-24T00:00:00Z"),
                     "grade" : "A",
                     "score" : 10
              }, {
                     "date" : ISODate("2011-11-23T00:00:00Z"),
                     "grade" : "A",
                     "score" : 9
              }, {
                     "date" : ISODate("2011-03-10T00:00:00Z"),
                     "grade" : "B",
                     "score" : 14
       "name" : "Morris Park Bake Shop",
                "restaurant_id" : "30075445"

Automatic schema detection will generate three tables:


_Id Address Building Address Street Address Zipcode Borough Cuisine Name Restaurant Id
56c25056471b4b0a1ec9f95b 1007 Morris Park Ave 10462 Bronx Bakery Morris Park Bake Shop 30075445

This table represents the fields in the root document and any embedded document fields that are not arrays. Note that embedded address document has been flattened (with the exception of the “coord” array).


_Id Restaurants Coord Dim1 Idx Restaurants Coord
56c25056471b4b0a1ec9f95b 0 -73.856077
56c25056471b4b0a1ec9f95b 1 40.848447

The array representing a restaurant’s coordinates is represented by a virtual table. The main restaurants table would be joined to this table to get a list of restaurants and their associated coordinates. It could be argued that rather than representing coordinates as an array, they could instead be latitude and longitude fields in the embedded address document as a single restaurant is only ever going to have one set of coordinates.


_Id Restaurants Grades Dim1 Idx Date Grade Score
56c25056471b4b0a1ec9f95b 0 2014-03-03 A 2
56c25056471b4b0a1ec9f95b 1 2013-09-11 A 6
56c25056471b4b0a1ec9f95b 2 2013-01-24 A 10
56c25056471b4b0a1ec9f95b 3 2011-11-23 A 9
56c25056471b4b0a1ec9f95b 4 2011-03-10 B 14

This table represents the grades array for a restaurant. It too is a virtual table that would be joined to the main restaurant table to get a list of all restaurants and their associated grades.

The normalized view of the above would be represented as such with _Id being the obvious key field:


Automatic schema generation will suffice for many scenarios but what if you want to simplify a schema for all your users or perhaps the size of your data is negatively impacting the performance of automatic schema generation? Next blog post we’ll look at how to define a MongoDB schema for the driver manually and customise it for specific application scenarios. In the meantime, download a free evaluation of Simba’s new MongoDB driver today and give it a try!