In a previous post, I discussed the automatic schema generation capabilities of Simba’s new MongoDB driver. Most of the time this is perfectly adequate but there are some circumstances in which you may wish to provide the schema to the driver rather than having it generate a schema on the fly. This post will describe the use of the new Schema Editor that is bundled with the Simba driver to create and customise schemas for use by ODBC and JDBC applications to access MongoDB collections.
When Should You Predefine a Schema?
As discussed previously, by default the Simba MongoDB driver samples a collection to determine the structure or schema of the data. Without a predefined schema, this sampling occurs each time an application connects with the driver to MongoDB. If the structure of documents changes quite frequently then a larger sample size is required to get a good representation of the collection’s structure. The larger the sample size, the bigger the performance impact. Defining a schema once and providing it to the driver eliminates this performance hit. Remember though that if the document structure in you collection is constantly changing then new changes won’t be represented in this schema and automatic sampling might remain a better option.
Another common reason for customising a schema is that field names used in data stores are notorious for being cryptic, abbreviated and unsuitable (looking at you “ORDERNUM” and “CUSTID”) for use in analytic tools where the consumer is not the database administrator or application developer. Using the Schema Editor, we’ll see how a field’s name can be edited or even removed entirely from a schema if it’s not suitable for analytic purposes.
Finally, there may be circumstances in which it is more appropriate to flatten array elements in a document rather than the default behaviour which is to create a virtual table for each array.
Using the Schema Editor to Generate a Schema
The Schema Editor can be launched either from the application short cut, from the /<driver>/Tools/SchemaEditor folder or, in the case of ODBC, from the ODBC DSN configuration dialog (the latter will automatically pass connection information to the Editor). The initial Schema Editor screen gives you the choice of creating a new schema or modifying an existing schema. Choose “Create A New Schema Definition” and, after providing the ODBC or JDBC connection information, you’ll be presented with some familiar sampling options discussed in my previous post. For this post, I’m using the restaurants sample database for which the default sampling options will suffice.
After sampling, expand the tree on the left and you’ll see we have a database that contains a restaurants collection which contains details about restaurants, the type of food they offer, where they are located and also what sort of grades they have obtained over the years. This restaurants collection is treated as a table from an ODBC / JDBC perspective. Further expanding restaurants we see that we have two virtual tables as well: restaurants_coord and restaurants_grades. These two tables represent arrays that are embedded in a document. For restaurants_grades, an array clearly makes sense as new elements (i.e. grades) will be added to the array over time. For the coordinates of a restaurant’s location, it is debatable whether an array was the best choice to model this data as there will only every be one pair of latitude/longitude coordinates for each restaurant. This particular semantic is only understood by the application creating the data though so automatic schema generation is going to represent this as a virtual table that could contain varying numbers of records for each document. Fortunately, we can change this in the Schema Editor.
Select the restaurants_coord table from the tree on the left and you’ll see on the right there are three fields in
this virtual table: _id (the key field used to link it to the main restaurants table), an index field for the array, and then the restaurants_coord element itself. Highlight this last item and select “Move to Parent”. You’ll be asked how many elements to move to the parent restaurant table. We want the latitude and longitude so select 2. Now if we look at the parent restaurant table, we see two additional fields have been added: restaurants_coord_0 and restaurants_coord_1. These aren’t the most descriptive names so double click on the field in the SQL Name column and change the name to something more descriptive (longitude and latitude come to mind…)
Because we moved the coordinate fields to the parent table, we can further simplify the schema by hiding the restaurants_coord virtual table. Right click on the restaurants_coord table on the left and choose “Hide Table”. Now when an application connects using the driver, there will only be two tables available: the parent restaurants table and the virtual table containing restaurant grades.
Now that we have a simplified our schema, we need to save it. There are two options for persisting a schema: it can be stored within the MongoDB database or saved to a file. If the collection will only have one schema used with it then storing it centrally within MongoDB is ideal – especially if multiple users that will be using the same schema. If multiple schemas are needed (perhaps some schemas only expose certain tables in the collection) then saving it to a file is more appropriate. To publish the schema to the MongoDB database, choose “Publish Schema Map” from the Connection menu. To save the schema as a local file instead, choose “Save” from the File menu.
The final step is to ensure the driver now uses this schema definition rather than its default behaviour of sampling the data after connecting.
For ODBC, this is done using ODBC Administrator to configure the settings for your MongoDB DSN. When editing the DSN settings, click on the “Advanced Options…” button. In the Metadata section of the Advanced Options dialog, you can specify the mechanism (where the schema is stored) and, if “Local File” is chosen, where that file is stored.
For JDBC, this defined in the connection URL used to connect to MongoDB. To use a schema stored in the database the connection URL would look like:
To use a schema stored in a local file, the following needs to be specified on the connection URL:
For more information on how to configure ODBC and JDBC connections as well as the various other configuration options that are available, consult the appropriate install guide.
Adding Structure to Semi-Structured Data
For small data sets or data sets with a structure that stays consistent between documents, automatic schema generation through sampling provides a low-touch method of accessing semi-structured data from your ODBC/JDBC application. For finer control of the schema, the Schema Editor is a powerful tool that can be used to customise a schema to suite your specific needs when accessing NoSQL data.
Download a free evaluation of Simba’s new MongoDB driver today and give the Schema Editor a try!