Enigma pulls data from tens of thousands of public data sets, and then offers up an interface that makes it pretty straightforward to trawl through the whole lot in search of the data points that you actually need. As the company’s Marc DaCosta introduced it, a “search and discovery platform for public data.”
Cloudera has a great toolkit to work with Hadoop. Specifically it is focused on building distributed systems and services on top of the Hadoop Ecosystem.
A data warehouse preserves data. It can be argued that a data warehouse preserves only data. This, however, is false. To an extent, a data warehouse must also preserve processing details.
This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python.
Day to day, it's easy to lose sight of what it means to live in the future we've made, to take it for granted.
Data scientists are in short supply! Or at least that’s a headline you can find nearly everywhere. There are people trying desperately to hire them and also people trying hard to jump into the perceived gap and become one. Meanwhile, there’s plenty of skepticism over whether the role is real or a function of all of the hype.
There are a lot of changes occurring these days with the Big Data revolution such as cloud computing, NoSQL, Columnar stores, and virtualization just to mention a few of the fast moving technologies that are transforming how we manage our data and run our IT operations.
"Big data companies are terrible at communicating their value proposition. It needs good storytellers and marketers who can talk about its business value."
I was excited to hear the news that Packt Publishing(http://www.packtpub.com/) were releasing a new book dedicated to Splunk called “Implementing Splunk - Big Data Reporting and Development for Operation Intelligence”(http://www.packtpub.com/implementing-splunk/book). A majority of the documentation and information on Splunk has been produced by Splunk so I was eager to see if “Implementing Splunk” was going to be a fresh take on the large amount of information that is currently out there. “Implementing Splunk” was written by Vincent Bumgarner who has been designing software for close to 20 years and has been working with Splunk from 2007, and has been helping companies use the application as a Business Intelligence, Reporting and Analytics Tool.
At Strata 2013, Microsoft's Dave Campbell talks about how the Xbox leverages big data...
I was playing around with SymPy, a symbolic math package for Python, and ran across nsimplify. It takes a floating point number and tries to simplify it: as a fraction with a small denominator, square root of a small integer, an expression involving famous constants, etc.
Today, Olivier Scaillet gave a great talk on fast recursive projections. After lunch, we discussed financial model complexity, mentioning that sometimes, traders and quants are lost, and it might be good to spend more time on basics than on very advanced stuff.
It seems big data means something different to everyone. In the great debate/hype about big data, there’s no lack of opinion on the topic and it seems to mostly depend on an individual’s product, skill set and business challenges.
My previous post looked at rolling 5 six-sided dice as an approximation of a normal distribution. If you wanted a better approximation, you could roll dice with more sides, or you could roll more dice. Which helps more?
I gave a talk recently at the Mathematical Finance Days, organized in HEC Montréal Monday and Tuesday, on Advanced methods in trees with (as mentioned in the subtitle of the first slide) a some thoughts on teaching mathematical finance.
Standford's Amr Awadallah argues that Hadoop is the "data operating system of the future."
A handful of dice can make a decent normal random number generator, good enough for classroom demonstrations. This Python code calculates the normal distribution of the sum of the dice.
Schema design in NoSQL is very different from schema design in a RDBMS. Once you get something like HBase up and running, you may find yourself staring blankly at a shell, lost in the possibilities of creating your first table.
Source-control backing is a decade-long obsession of mine, and now I'm thinking about “open data.” If something can be represented by a textual document, is structured or regular, and tends towards completeness over time, then source-control is a viable alternative to a relational schema (or a document store).
In this seriously in-depth Pycon talk, we learn how to use Python to scrape data from web sources not conventionally built to supply it.
I ran into an interesting problem today. I was working with the first project where we legitimately needed Solr soft commits and in testing my configuration I wanted to prove to myself that the soft commits were performing as expected.
As the Big Data hype machine continues its relentless attempt to gobble everything in its path, new business units and entire new domains buying into the promise find themselves faced with unanticipated data volume and complexity.
Via LinkedIn TechTalks, Rob Bekkerman delves into the basics of machine learning.
This article shows how to develop a MongoDB application quickly with ZK & Grails.
This error may occur if you’re using sort=geodist() in your Solr Spatial / Geographic Search. The reason is probably that you have an empty pt= value or that the parameter is missing all together.