By nivanov
via jroller.com
Published: Dec 06 2007 / 03:17
I’ve had lately several interesting discussions on how to process massive amount of data on the grid (specifically, with GridGain). Imagine that you have say 100TB of data either in files (thousands of files on NAS) or in database (spread over dozens of instance and NAS). Let’s say you are storing textual blogs and you need to calculate tag cloud (i.e. find 20 most frequent tags in those blogs). What’s the best approach?
SaveShareSend
Tags: java, open source
Comments
joecoder replied ago:
The author discusses "affinity split", but doesn't actually describe how he increases data affinity in the scenarios he describes. He also implies that external resources like Network Accessible Storage (NAS) will scale linearly to a large number of simultaneous users. As the number of simultaneous data accesses increases on the NAS, the performance per connection will drop and you will not see the linear increase in data processing performance that the author promises.
It's also worth checking out other technologies for supporting this style of computing. For example, Java Parallel Processing Framework (JPPF), Gigaspaces and Terracotta.
ronslow replied ago:
Wow - this technology is amazing. Absolutely zero deployment.
Voters For This Link (15)
Voters Against This Link (3)