A nice post was recently published on the rsnippets blog, about the tikzDevice R package. This package is – indeed – awesome.
I've blogged about this data set a number of times. Many, many people in the LitSupport, eDiscovery industry use it (like about everyone), it's been available for almost a decade, and now this is found?
Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications.
This demonstrates basic language features – case classes, iteration, anonymous functions, etc.
Just a short post, to share some codes used to generate animated graphs, with R. Assume that we would like to illustrate the law of large number, and the convergence of the average value from binomial sample.
Ask any business intelligence or analytics vendor to give a few customer examples of how their product is being used with data stored in Hadoop, and you’re likely to get some blank stares. That’s not a slam on the BI vendors. It’s partially an indicator of the state of maturity in the Hadoop market, and partially a reflection of just how hard BI is, regardless of the data source.
Enigma pulls data from tens of thousands of public data sets, and then offers up an interface that makes it pretty straightforward to trawl through the whole lot in search of the data points that you actually need. As the company’s Marc DaCosta introduced it, a “search and discovery platform for public data.”
Cloudera has a great toolkit to work with Hadoop. Specifically it is focused on building distributed systems and services on top of the Hadoop Ecosystem.
A data warehouse preserves data. It can be argued that a data warehouse preserves only data. This, however, is false. To an extent, a data warehouse must also preserve processing details.
This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python.
Day to day, it's easy to lose sight of what it means to live in the future we've made, to take it for granted.
Data scientists are in short supply! Or at least that’s a headline you can find nearly everywhere. There are people trying desperately to hire them and also people trying hard to jump into the perceived gap and become one. Meanwhile, there’s plenty of skepticism over whether the role is real or a function of all of the hype.
There are a lot of changes occurring these days with the Big Data revolution such as cloud computing, NoSQL, Columnar stores, and virtualization just to mention a few of the fast moving technologies that are transforming how we manage our data and run our IT operations.
"Big data companies are terrible at communicating their value proposition. It needs good storytellers and marketers who can talk about its business value."
I was excited to hear the news that Packt Publishing(http://www.packtpub.com/) were releasing a new book dedicated to Splunk called “Implementing Splunk - Big Data Reporting and Development for Operation Intelligence”(http://www.packtpub.com/implementing-splunk/book). A majority of the documentation and information on Splunk has been produced by Splunk so I was eager to see if “Implementing Splunk” was going to be a fresh take on the large amount of information that is currently out there. “Implementing Splunk” was written by Vincent Bumgarner who has been designing software for close to 20 years and has been working with Splunk from 2007, and has been helping companies use the application as a Business Intelligence, Reporting and Analytics Tool.
At Strata 2013, Microsoft's Dave Campbell talks about how the Xbox leverages big data...
I was playing around with SymPy, a symbolic math package for Python, and ran across nsimplify. It takes a floating point number and tries to simplify it: as a fraction with a small denominator, square root of a small integer, an expression involving famous constants, etc.
Today, Olivier Scaillet gave a great talk on fast recursive projections. After lunch, we discussed financial model complexity, mentioning that sometimes, traders and quants are lost, and it might be good to spend more time on basics than on very advanced stuff.
It seems big data means something different to everyone. In the great debate/hype about big data, there’s no lack of opinion on the topic and it seems to mostly depend on an individual’s product, skill set and business challenges.
My previous post looked at rolling 5 six-sided dice as an approximation of a normal distribution. If you wanted a better approximation, you could roll dice with more sides, or you could roll more dice. Which helps more?
I gave a talk recently at the Mathematical Finance Days, organized in HEC Montréal Monday and Tuesday, on Advanced methods in trees with (as mentioned in the subtitle of the first slide) a some thoughts on teaching mathematical finance.
Standford's Amr Awadallah argues that Hadoop is the "data operating system of the future."
A handful of dice can make a decent normal random number generator, good enough for classroom demonstrations. This Python code calculates the normal distribution of the sum of the dice.
Schema design in NoSQL is very different from schema design in a RDBMS. Once you get something like HBase up and running, you may find yourself staring blankly at a shell, lost in the possibilities of creating your first table.
Source-control backing is a decade-long obsession of mine, and now I'm thinking about “open data.” If something can be represented by a textual document, is structured or regular, and tends towards completeness over time, then source-control is a viable alternative to a relational schema (or a document store).