In SimbaEngine X, v.10.0.5, we are pleased to announce a significant performance optimization in our support for equijoins, the most commonly used means of aggregating data from two different tables both in SQL and NoSQL data sources.
The SQL Engine feature in SimbaEngine SDK is used to convert standard SQL to your data source’s native API statements regardless of the type of your data source. If your data source does not natively support joins, you can leverage our SQL Engine to implement support for joins in the driver.
A join is used to aggregate data from two or more tables, such as different database tables, based on a join condition. An equijoin is a join that uses an equivalence operation between two common fields as the join condition (for example table1.column1 = table2.column2).
The joining potentially involves the manipulation of very large amounts of data. For this reason, it was important for us to optimize the join algorithm implementation to maximize speed while keeping the memory usage under control. In SimbaEngine X, version 10.0.5 we have implemented a “hybrid hash” join algorithm which is optimized for equijoin type joins. Our version of this state-of-the-art algorithm is specially tuned for the SimbaEngine X software to maximize performance improvements while giving the user the flexibility to control the amount of memory allocated to the driver.
On a single join on tables of 100K rows or higher, we are an order of magnitude faster independent of the allocated memory limit. For example, when joining on strings, we obtain an improvement of up to 20x faster for a table as small as one million rows. When joining with the integer datatype on the same table, we measure an improvement of up to 30x faster.
Customers who are using the SQL Engine component of the SimbaEngine SDK will benefit from this optimization immediately.
Any statement passed to your driver which contains a join with equijoin condition will gain a considerable performance improvement. The performance boost is more apparent with larger tables requiring longer execution time, for example in datasets larger than 10K rows.