Cloud Zone is brought to you in partnership with:

Stacey Schneider is focused on helping evangelize how cloud technologies are transforming application development and delivery for VMware. Prior to its acquisition, Stacey led marketing and community management for application management software provider Hyperic, now a part of VMware’s management portfolio. Before her work in the cloud, she also held various technical leadership positions at CRM software pioneer Siebel Systems, including Director of Technology Product Marketing, managing the Technology Competency in Europe, and the Globalization professional services practice. She was also a part of Siebel’s Nexus project, which focused on building portable web applications that could be deployed across java application servers as well as .NET. Stacey is also the managing principal of SiliconSpark, a consulting agency that has helped over 12 software companies go to market on the web and across the cloud over the past 4 years. Stacey has posted 39 posts at DZone. You can read more from them at their website. View Full User Profile

VMware’s Serengeti – Virtualized Hadoop at Cloud-scale

10.23.2012
| 1952 views |
  • submit to reddit

Not long ago I covered the topic of Big Data adoption in the enterprise. In it, I described how Serengeti enables enterprise to respond to common Hadoop implementation challenges resulting from the lack of usable enterprise-grade tools and the shortage of infrastructure deployment skills.

With the latest release of open source Project Serengeti, VMware continues on its mission to deliver the easiest and most reliable virtualized Big Data platform. One of the most unique attributes of Serengeti Hadoop deployment is that it can easily coexist with other workloads on an existent infrastructure.

Serengeti-deployed Hadoop clusters can also be configured in either local or shared, scale-out data storage architecture. This storage layer can even be shared across multiple HDFS-based analytical workloads. And, in the future, this could potentially be extended to other, non-HDFS-based data engines.

The elasticity of underlining vSphere virtualization platform, helps Serengeti to achieve new levels of efficiency. This architecture enables organizations to share the existing infrastructure with Big Data analytical workloads to deliver optimal storage capacity and performance.

Driving new levels of efficiency

While the idea of a dynamically scalable Hadoop cluster capable of using spare data center capacity was part of the Serengeti Project from the beginning, the recent enhancement of its on-demand compute capacity makes this notion much easier to implement.

Using the new Hadoop Virtualization Extensions (HVE) which resulted from VMware’s work with the Apache Hadoop community, Serengeti is now able to on-demand scale up and shut down compute nodes based on resource availability; fully leveraging data locality. HVE helps Hadoop to truly be aware of the underlying virtualization, which in turn allows Hadoop to delivering the same level of performance already experienced by other vSphere workloads. HVE which will first be available in Greenplum HD 1.2 distribution, making Hadoop enterprise deployments more elastic and secure. Enabling quick and efficient analysis of data already existent in HDFS within minutes, not hours. And, when another, perhaps more important, workload demands these previously unused compute cycles, Serengeti releases them to the pool.

Expediting access to business insight

So, why is all this dynamic capability important? The space of managing Big Data infrastructure is relatively immature. Enterprise IT is under immense pressure to deliver a dynamic analytic platform that will greatly expedite the time it takes to derive from data actionable insight.

This period commonly referred to as Time To Insight (TTI), is the time it takes an average user to extract an actionable business insight from newly discovered data. Think of this as the time it takes to attach or upload necessary data set, execute a specific MapReduce job, and consume the resulting HDFS data from an external analytical tool-set through SQL connection to Hive server.

This process of turning data into actionable information has traditionally been a challenge. The increasingly larger volumes of data, presented in variety of formats, make analytics of any sorts more complex. Doing this increasingly faster demands a whole new level of infrastructure agility. Serengeti drastically shortens the current Big Data TTI.

Ease of use and granularity of control

This latest update further simplifies Hadoop deployment, along with Hadoop ecosystem components like HDFS, MapReduce, Pig and Hive. Just like before, in its simplest configuration, this often-daunting task can be performed using a single command in under ten minutes.

Perhaps the most impressive part of the latest release is that along with this unparalleled speed of deployment and ease of use, Serengeti also delivers the necessary granularity of control over each deployment. This level of control applies to both the infrastructure configuration arguments like storage type, node placement or High Availability (HA) status, as well as to the Hadoop system configuration itself. This includes even the most specific values of the environment variables, HDFS, MapReduce, and logging properties.

As an example, users may wish to select a job scheduling method that best suits their situation. Hadoop originally scheduled jobs in the order they were submitted, so a First-In First-Out (FIFO) scheduler is used by default. In the event a Hadoop cluster is shared by a variety of long and short-running jobs, it may be preferable to use the fair scheduler or capacity scheduler. This will allow shorter jobs to complete in reasonable time instead of waiting on the long-running jobs to complete.

Using Serengeti, the user can indicate the selection of the fair scheduler with the following lines in the Serengeti spec file:

"configuration": {
  "hadoop": {
    "mapred-site.xml": {
      "mapreduce.jobtracker.taskscheduler": "org.apache.hadoop.mapred.FairScheduler"
    }
  }
}

The cluster config command takes the above cluster spec file, makes the required configuration change, and restarts JobTracker for the modified configuration to take effect. This takes much of the burden of configuring Hadoop off the user and makes tuning the cluster a very simple operation.

This level of control over Hadoop deployment applies to both the initial deployment as well as subsequent system tuning.

HA protection for critical Hadoop components

The benefit of Hadoop deployment on VMware’s time-tested virtualization technology is the ability to leverage the very same enterprise-grade enhancements that enterprise IT expects. The two specific features that lend themselves to Serengeti’s emphasis on ease of use are High Availability (HA) and Fault Tolerance (FT), both of which can be enabled for entire Hadoop cluster with a single click.

HA – Protection against host and VM failures

It’s easy to configure High Availability for Hadoop’s NameNode and JobTracker, traditionally considered the single point of failure in each Hadoop deployment, whether based on shared or local storage.

With a single click Serengeti brings very same High Availability to the entire Hadoop stack, including Hive server, both on the host and VM level with automatic failure detection and restart in minutes, on any available host in cluster. Any in-progress Hadoop Jobs will be automatically paused and resumed when name node is up.

FT – Provides Continuous Protection

This notion of protection can be applied even further to deliver true zero downtime to a Hadoop system by preventing data loss using the Fault Tolerance (FT) feature, not only for NameNode and JobTracker, but also other components in the Hadoop cluster.

This is achieved by VMware’s HA/DRS using a single identical shadow VM running in lockstep on separate hosts to deliver zero downtime, zero data loss failover for all virtual machines in case of hardware failures. This solution does not require complex clustering or specialized hardware; it’s a single common mechanism for all applications and operating systems.

Distribution of your choice

Serengeti configuration is not biased to any particular provider of Hadoop. Its features can be equally applied to any one of the currently supported 1.0-based distributions:

As we have shown, Serengeti greatly simplifies access to actionable business insight from large volumes of data using existent infrastructure by dynamically provisioning the necessary platform. This new capability enables enterprise users can focus on the data and its algorithms—not the underlying infrastructure.

Mark Chmarny

About Mark Chmarny

During his 15+ year career, Mark Chmarny has worked across various industries and most recently as a Cloud Architect at EMC where he developed Cloud Computing solutions for both Service Provider and Enterprise customers. At VMware, Mark is a Data Solution Evangelist in the Cloud Application Platform group. Mark received a Mechanical Engineering degree from Technical University in Vienna, Austria and a BA in Communication Arts from Multnomah University in Portland, OR.
0
Published at DZone with permission of its author, Stacey Schneider.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)