Arthur Charpentier05/07/13
1245 views
0 replies
A nice post was recently published on the rsnippets blog, about the tikzDevice R package. This package is – indeed – awesome.
Greg Duncan05/07/13
172 views
0 replies
I've blogged about this data set a number of times. Many, many people in the LitSupport, eDiscovery industry use it (like about everyone), it's been available for almost a decade, and now this is found?
Itamar Syn-hershko05/06/13
1811 views
0 replies
Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications.
Gary Sieling05/06/13
1927 views
0 replies
This demonstrates basic language features – case classes, iteration, anonymous functions, etc.
Arthur Charpentier05/06/13
1541 views
0 replies
Just a short post, to share some codes used to generate animated graphs, with R. Assume that we would like to illustrate the law of large number, and the convergence of the average value from binomial sample.
Bootstrap Mark...05/06/13
162 views
0 replies
Ask any business intelligence or analytics vendor to give a few customer examples of how their product is being used with data stored in Hadoop, and you’re likely to get some blank stares. That’s not a slam on the BI vendors. It’s partially an indicator of the state of maturity in the Hadoop market, and partially a reflection of just how hard BI is, regardless of the data source.
Paul Miller05/05/13
2102 views
0 replies
Enigma pulls data from tens of thousands of public data sets, and then offers up an interface that makes it pretty straightforward to trawl through the whole lot in search of the data points that you actually need. As the company’s Marc DaCosta introduced it, a “search and discovery platform for public data.”
Joe Stein05/05/13
2110 views
0 replies
Cloudera has a great toolkit to work with Hadoop. Specifically it is focused on building distributed systems and services on top of the Hadoop Ecosystem.
Steven Lott05/04/13
2401 views
0 replies
A data warehouse preserves data. It can be argued that a data warehouse preserves only data. This, however, is false. To an extent, a data warehouse must also preserve processing details.
Giuseppe Vettigli05/04/13
513 views
0 replies
This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python.
Eric Gregory05/03/13
1975 views
0 replies
Day to day, it's easy to lose sight of what it means to live in the future we've made, to take it for granted.
Christopher Taylor05/03/13
3987 views
0 replies
Data scientists are in short supply! Or at least that’s a headline you can find nearly everywhere. There are people trying desperately to hire them and also people trying hard to jump into the perceived gap and become one. Meanwhile, there’s plenty of skepticism over whether the role is real or a function of all of the hype.
Sam Taha05/03/13
2170 views
1 replies
There are a lot of changes occurring these days with the Big Data revolution such as cloud computing, NoSQL, Columnar stores, and virtualization just to mention a few of the fast moving technologies that are transforming how we manage our data and run our IT operations.
Ravi Kalakota05/03/13
1481 views
0 replies
"Big data companies are terrible at communicating their value proposition. It needs good storytellers and marketers who can talk about its business value."
Vince Sesto05/02/13
2340 views
0 replies
I was excited to hear the news that Packt Publishing(http://www.packtpub.com/) were releasing a new book dedicated to Splunk called “Implementing Splunk - Big Data Reporting and Development for Operation Intelligence”(http://www.packtpub.com/implementing-splunk/book). A majority of the documentation and information on Splunk has been produced by Splunk so I was eager to see if “Implementing Splunk” was going to be a fresh take on the large amount of information that is currently out there. “Implementing Splunk” was written by Vincent Bumgarner who has been designing software for close to 20 years and has been working with Splunk from 2007, and has been helping companies use the application as a Business Intelligence, Reporting and Analytics Tool.
Eric Gregory05/02/13
1783 views
0 replies
At Strata 2013, Microsoft's Dave Campbell talks about how the Xbox leverages big data...
John Cook05/02/13
1565 views
0 replies
I was playing around with SymPy, a symbolic math package for Python, and ran across nsimplify. It takes a floating point number and tries to simplify it: as a fraction with a small denominator, square root of a small integer, an expression involving famous constants, etc.
Arthur Charpentier05/02/13
2800 views
0 replies
Today, Olivier Scaillet gave a great talk on fast recursive projections. After lunch, we discussed financial model complexity, mentioning that sometimes, traders and quants are lost, and it might be good to spend more time on basics than on very advanced stuff.
Christopher Taylor05/01/13
1974 views
0 replies
It seems big data means something different to everyone. In the great debate/hype about big data, there’s no lack of opinion on the topic and it seems to mostly depend on an individual’s product, skill set and business challenges.
John Cook05/01/13
1055 views
0 replies
My previous post looked at rolling 5 six-sided dice as an approximation of a normal distribution. If you wanted a better approximation, you could roll dice with more sides, or you could roll more dice. Which helps more?
Arthur Charpentier05/01/13
1069 views
0 replies
I gave a talk recently at the Mathematical Finance Days, organized in HEC Montréal Monday and Tuesday, on Advanced methods in trees with (as mentioned in the subtitle of the first slide) a some thoughts on teaching mathematical finance.
Eric Gregory05/01/13
1093 views
0 replies
Standford's Amr Awadallah argues that Hadoop is the "data operating system of the future."
John Cook04/30/13
2466 views
0 replies
A handful of dice can make a decent normal random number generator, good enough for classroom demonstrations. This Python code calculates the normal distribution of the sum of the dice.
Chase Seibert04/30/13
2152 views
0 replies
Schema design in NoSQL is very different from schema design in a RDBMS. Once you get something like HBase up and running, you may find yourself staring blankly at a shell, lost in the possibilities of creating your first table.
Paul Hammant04/30/13
3949 views
0 replies
Source-control backing is a decade-long obsession of mine, and now I'm thinking about “open data.” If something can be represented by a textual document, is structured or regular, and tends towards completeness over time, then source-control is a viable alternative to a relational schema (or a document store).