SQL Zone is brought to you in partnership with:

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment. Davy is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

The joy of algorithms and NoSQL: a MongoDB (Map-Reduce) example (part 2)

09.02.2011
| 2058 views |
  • submit to reddit

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!

1. MongoDB molecular similarity map-reduce query

Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step). Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallellized on a cluster of processing nodes. Let's start by refactoring our solution to use MongoDB's map-reduce functionality.

// Calculate the essential numbers
int maxnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() / 0.6);
int minnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() * 0.6);
int numberoffingerprintstoconsider = fingerprintsToFind.size() - minnumberofcompoundfingerprints;

List fingerprintsToConsider = fingerprintsToFind.subList(0,numberoffingerprintstoconsider+1);

// Find all compounds that satisfy the specified conditions
DBObject compoundquery =   
    QueryBuilder.start(FINGERPRINTS_PROPERTY).in(fingerprintsToConsider)
                .and(FINGERPRINTCOUNT_PROPERTY).lessThanEquals(maxnumberofcompoundfingerprints)
                .and(FINGERPRINTCOUNT_PROPERTY).greaterThanEquals(minnumberofcompoundfingerprints)
                .get();

// The map fuction
String map = "function() {  " +
                 "var found = 0; " +
                 "var fingerprintslength = this.fingerprints.length; " +
                 "for (i = 0; i < fingerprintslength; i++) { " +
                     "if (fingerprintstofind[this.fingerprints[i]] === true) { found++; } " +
                 "} " +
                 "if (found >= minnumberofcompoundfingerprints) { emit (this.compound_cid, {found : found, total : this.fingerprint_count} ); } " +
             "}";

// Execute the map reduce function
MapReduceCommand mr = new MapReduceCommand(compoundsCollection, map, "", null, MapReduceCommand.OutputType.INLINE, compoundquery);

// Create a hashmap for the fingerprints to find (to speed up the javascript execution)
Map tofind = new HashMap();
for(String fingerprinttofind : fingerprintsToFind) {
    tofind.put(fingerprinttofind,true);
}

// Set the map reduce scope
Map scope = new HashMap();
scope.put("fingerprintstofind",tofind);
scope.put("minnumberofcompoundfingerprints",minnumberofcompoundfingerprints);
mr.setScope(scope);

// Execute the map reduce
MapReduceOutput out = compoundsCollection.mapReduce(mr);

// Iterate the results
for (DBObject result : out.results()) {
    String compound_cid = (String)result.get("_id");
    DBObject value = (DBObject)result.get("value");

    // Calculate the tanimoto coefficient
    double totalcount = (Double)value.get("total");
    double found = (Double)value.get("found");
    double tanimoto = (Double)value.get("found") / ((Double)value.get("total") + fingerprintsToFind.size() - (Double)value.get("found"));
    // We still need to check whether the tanimoto is really >= the required similarity
    if (tanimoto >= 0.6) {
        System.out.println(compound_cid + " " + (int)(tanimoto * 100) + "%");
    }
}

The map-step of a MongoDB's map-reduce implementation takes a MongoDB document as input and emits one (or more) answers (which, in essence, are again MongoDB documents). Executing our map-step on all compound documents in the compounds collection would not be very efficient. Instead, we would like to limit the execution of our map-step to those documents that can theoretically match the target Tanimoto. Luckily, we already defined this query, namely the compound selection query that was described in the part one of this article! By employing this query, only compounds that match this query are pushed through the map-step. A MongoDB map (and reduce) function is expressed through JavaScript. In our case, we calculate the number of unique fingerprint patterns that are shared by both the target and input compound. In case the minimum number of fingerprint patterns is reached, the map-step emits a document containing the PubChem identifier (as id) and some essential statistics (as values). A reduce-step is employed to aggregate answers into the final result. In our case however, we are interested into the individual results for each compound (document). Hence, no reduce function is applied. When this map-reduce function is executed only 27 compounds are returned (which could potentially match), instead of 7000 compounds when employing the previous Java query!

One would expect the execution time of the map-reduce query to be considerably faster compared to the Java solution. Unfortunately, this is not the case. First of all, interpreted Javascript is a multitude of times slower compared to Java. Secondly, although map-reduce steps could be parallellized when multiple CPU cores are available, the MongoDB map-reduce function always runs single-threaded. To circumvent this limitation, one can use MongoDB sharding. Simply explained, instead of putting all data on a single MongoDB node, multiple MongoDB nodes are employed, each responsible for storing a part of the total data set. When executing our map-reduce function, each node will execute the map-reduce steps on its part of the data in parallel. Hence, when using sharding on a cluster of 4 MongoDB nodes, the map-reduce query executes almost 4 times faster, already catching up with the performance of the Java solution. With the exception of the MongoDB sharding configuration, no changes are required to the map-reduce function itself. Hence, scaling horizontally with MongoDB is a breeze ...

2. Conclusion

MongoDB's map-reduce performance is a bit of a disappointment. MongoDB currently advises to only use it for near real-time computations. Version 2.0 of MongoDB should drastically improve map-reduce performance, as the JavaScript engine will be replaced by other execution platforms. Nevertheless, map-reduce performance can currently be boosted by splitting the load on multiple MongoDB shards.

References
Published at DZone with permission of Davy Suvee, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)