When modelling data with ARIMA models, it is sometimes useful to plot the inverse characteristic roots. The following functions will compute and plot the inverse roots for any fitted ARIMA model (including seasonal models).
We got a customer question about a map/reduce index that produced the wrong results. The problem was a problem between the conceptual model and the actual model of how Map/Reduce actually works.
Econometrics is often “theory driven” while statistics tends to be “data driven”. I discovered this in the interview for my current job when someone criticized my research for being “data driven” and asked me to respond.
I thought it would be fun to try out a few different Python object relational mappers (ORMs) besides SQLAlchemy. I recently stumbled across a project known as peewee. For this article, we will take the examples from my SQLAlchemy tutorial and port it to peewee to see how it stands up.
I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London. I tried to use ggplot to create a bar chart of the data. Unfortunately that resulted in this error:
In my continued playing around with R I wanted to map a custom function over two lists comparing each item with its corresponding items.
Rolling forecasts are commonly used to compare time series models. Here are a few of the ways they can be computed using R. I will use ARIMA models as a vehicle of illustration, but the code can easily be adapted to other univariate time series models.
The past two weeks have seen a great deal of discussion around the recent computational social science study of Kramer, Guillory and Hancock (2014) “Experimental evidence of massive-scale emotional contagion through social networks” .
We’ve been using Apache Camel a fair amount recently as our ingestion pipeline of choice. It presents a fairly nice DSL for wiring together different data sources, performing transformations, and finally sending data to Solr.
This visualization displays the data for one random NYC yellow taxi on a single day in 2013. See where it operated, how much money it made, and how busy it was over 24 hours.
When a Solr schema changes, us Solr devs know what’s next — a large reindex of all of our data to capture any changes to index-time analysis. When we deliver solutions to our customers, we frequently need to build this in as a feature.
In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.
Monday, I will be giving the closing talk of the R in Insurance Conference, in London, on Bayesian Computations for Actuaries, as to be more specific, Getting into Bayesian Wizardry… (with the eyes of a muggle actuary).
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (July 4 to July 11). This week's topic's include designing data architecture, python in universities and other links, R tricks and Big Data white papers.
Our healthcare system is still (mostly) based on capitalism: more patients + more visits = more money. Within such a system, it is not in the best interest of healthcare providers to have healthy patients.
If you’re trying to navigate your way through the Big Data landscape and can’t see the wood for the trees, allow us to show you the way by cutting right to the chase with this new series of concise and right-to-the-point reports.
One of our modules in our project is an elasticsearch cluster. In order to fine tune the configuration (shards, replicas, mapping, etc.) and the queries, we created a JMeter environment.
Some data writings worth reading from Freakonometrics.
Today’s interview from DevOps Days Austin features Sumo Logic’s co-founder and CTO, Christian Beedgen. If you’re not familiar with Sumo Logic it’s a log management and analytics service. I caught up with Christian right after he got off stage on day one.
In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.
In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big.
In the past few years, almost every function of our lives has become dependent on real time applications. Whether it is updating our friends on every move we make via social media or shopping on e-commerce websites; we have become completely dependent on getting the correct information quickly.
Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on
There is frequent conversation about the explosive growth of Big Data in the age of wearables, compulsive social media and ever more capable computers, but when it comes down to gleaning useful insights from data, data scientists face more challenges with variety than with sheer volume.
In continuing my analysis of the London Neo4j meetup group using R I wanted to see which days of the week we organise meetups and how many people RSVP affirmatively by the day.