HBase is a database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns. It is designed to run on a cluster of commodity servers and to automatically scale as more servers are added, while retaining the same performance.
The biggest strength of Hector Cuesta’s new book may be that it brings together in one place information on tools that are used together but whose documentation is scattered. The book is great source for sample code, and more.
Spring Data Solr is an extension to the Spring Data project which aims to simplify the usage of Apache Solr in Spring applications. In the following post, the author will show how you can use Spring Data repositories to access Solr features in Spring applications.
So this actually happened: Mark Zuckerberg, CEO of Facebook, has attended the annual NIPS (Neural Information Processing Systems Foundation) conference in Lake Tahoe, Nevada. This is pretty remarkable, given the way the hype around deep learning has increased this year.
In previous posts, the author mentioned that he wanted to keep using Lucene to build on top of existing knowledge and experience, but do this while scaling reliably and without too much pain. Elasticsearch turned out to be a perfect fit for, and in this article, you'll learn why.
This recent article discusses how to debug Hive (Hadoop) through an anecdote regarding a customer's struggling Hive job. According to the author, there are downsides to working with Hadoop, and sometimes it does not offer a lot of information in terms of what has gone wrong.
This installment of Arthur Charpentier's regular collection of data science-related links includes thoughts on how to think like a data scientist, a tutorial on getting started with multilevel modeling in R, changes in the patterns of collaboration in science, and more.
This set of slides on the applications of Apache Solr and Hadoop in search takes an interesting look at one of the key uses of Solr and Hadoop. The slides give a brief overview of the technologies, then explore a wide variety of different subjects in Solr and Hadoop.
When the author went on Toronto Open Data’s website and found a dataset of licensed child care centers throughout Toronto, he thought he might have a fun time analyzing a topic that he thankfully has not had to deal with thus far! In this article, you'll find the process of mapping buildings and the R code to do it.
This presentation from Hilary Mason at devs love bacon is an introduction to machine learning for those who have no prior experience with it. Take a look if you're interested in a quick, fun overview to help you get started.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include a reflection on curing cancer with data visualization, how to compare word counts in two text documents using R, and working with Java 8 Lambda expressions and JDBC.
In this article, you'll find a top 100 list of the most popular Java libraries, based on 10,000 GitHub projects and an analysis of the top trends in Java. Like the author, you may be surprised by some of the results.
Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11,000 businesses and 8,000 check-in sets. We are interested in finding out whether it is possible to visually cluster businesses by category based on their check-in data.
The possibilities in the field of mobile healthcare seem enormous. In the UK at least, much of community health is delivered in a labour intensive way, with professionals either going out to households or patients coming into GP surgeries.
Experienced developers interested in learning more about programming in R have a fantastic resource in John Cook's "R programming for those coming from other languages." Cook's guide is to-the-point and concise, and focuses on the information needed to become productive with R, without a lot of fluff.
Data access, specifically SQL access from within Java, has never been nice. This is in large part due to the fact that the JDBC api has a lot of ceremony. In this article, you'll learn how to make SQL access easier in Java using Java 8 Lambda expressions and Streams.
You’ve just found a bowl you know nothing about. You start pulling out marbles and the first 99 marbles are red. Will the 100th marble be red as well? D’oh. But is it really that obvious? How can you be sure?
This recent article discusses the emerging field of web science, and the increasing popularity of Python as the ideal language for data analysis over previous standards, such as STATA and R.
One of the questions the author tends to get is what happens with a SolrCloud cluster when ZooKeeper fails. Not a single ZooKeeper instance failure, but the whole ensemble not being accessible. Because the answer to this question is easy to verify, the author decided to show what happens when ZooKeeper fails.
This installment of Arthur Charpentier's regular collection of data science-related links includes 5 ways to work with Big Data in R, "Statistical inference in massive data sets," how to analyze your network of Facebook friends with R, and more.
Recently, Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.6. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.
The Python vs. R article cited in this post clarifies the reasons why a programming language is a better choice than a "tool" or "platform."
This installment of Arthur Charpentier's data science-related links includes an article on Lucien Le Cam and Bayes, a discussion of the multilingual cyberspace, a visualization of income disparity over time, and more.
As an old Spring Data fan, when I found out Spring Data offered a Solr module, I jumped at the chance to try it.