Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance.
In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!
1. MongoDB molecular similarity map-reduce query
Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step). Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallellized on a cluster of processing nodes. Let's start by refactoring our solution to use MongoDB's map-reduce functionality.