To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.
Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid. I wanted to find a better way.
Let's face it - computing was created to analyze data and machine learning represents the state-of-the-art in making sense of data. For many years it has been out of reach for the common developer.
A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.
When modelling data with ARIMA models, it is sometimes useful to plot the inverse characteristic roots. The following functions will compute and plot the inverse roots for any fitted ARIMA model (including seasonal models).
We got a customer question about a map/reduce index that produced the wrong results. The problem was a problem between the conceptual model and the actual model of how Map/Reduce actually works.
Econometrics is often “theory driven” while statistics tends to be “data driven”. I discovered this in the interview for my current job when someone criticized my research for being “data driven” and asked me to respond.
I thought it would be fun to try out a few different Python object relational mappers (ORMs) besides SQLAlchemy. I recently stumbled across a project known as peewee. For this article, we will take the examples from my SQLAlchemy tutorial and port it to peewee to see how it stands up.
I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London. I tried to use ggplot to create a bar chart of the data. Unfortunately that resulted in this error:
In my continued playing around with R I wanted to map a custom function over two lists comparing each item with its corresponding items.
Rolling forecasts are commonly used to compare time series models. Here are a few of the ways they can be computed using R. I will use ARIMA models as a vehicle of illustration, but the code can easily be adapted to other univariate time series models.
The past two weeks have seen a great deal of discussion around the recent computational social science study of Kramer, Guillory and Hancock (2014) “Experimental evidence of massive-scale emotional contagion through social networks” .
We’ve been using Apache Camel a fair amount recently as our ingestion pipeline of choice. It presents a fairly nice DSL for wiring together different data sources, performing transformations, and finally sending data to Solr.
This visualization displays the data for one random NYC yellow taxi on a single day in 2013. See where it operated, how much money it made, and how busy it was over 24 hours.
When a Solr schema changes, us Solr devs know what’s next — a large reindex of all of our data to capture any changes to index-time analysis. When we deliver solutions to our customers, we frequently need to build this in as a feature.
In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.
Monday, I will be giving the closing talk of the R in Insurance Conference, in London, on Bayesian Computations for Actuaries, as to be more specific, Getting into Bayesian Wizardry… (with the eyes of a muggle actuary).
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (July 4 to July 11). This week's topic's include designing data architecture, python in universities and other links, R tricks and Big Data white papers.
Our healthcare system is still (mostly) based on capitalism: more patients + more visits = more money. Within such a system, it is not in the best interest of healthcare providers to have healthy patients.
If you’re trying to navigate your way through the Big Data landscape and can’t see the wood for the trees, allow us to show you the way by cutting right to the chase with this new series of concise and right-to-the-point reports.
One of our modules in our project is an elasticsearch cluster. In order to fine tune the configuration (shards, replicas, mapping, etc.) and the queries, we created a JMeter environment.
Some data writings worth reading from Freakonometrics.
Today’s interview from DevOps Days Austin features Sumo Logic’s co-founder and CTO, Christian Beedgen. If you’re not familiar with Sumo Logic it’s a log management and analytics service. I caught up with Christian right after he got off stage on day one.
In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.
In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big.