HP Technology at WorkThe must-read IT business eNewsletter
Hadoop deep dive: How it enables big data processing
Any exploration of big data will eventually lead you to Hadoop. And yet, despite its extensive coverage in industry media, few people fully understand what it is and what it means for big data.
Whether you’re looking at Hadoop to fulfill a distinct need or simply hope to sharpen your competitive edge, it represents great potential for the business world. To help you tease out that potential for your business, here is a basic overview of Hadoop’s capabilities and how they might fit with your infrastructure.
Why Hadoop now?
“Hadoop is pretty hot in the world of [business intelligence] and data warehousing because it’s helping enable new types of analysis,” says Philip Russom, research director for data management at TDWI. The rapid increase of big data in the past several years, as well as the growing awareness of its potential value, has sharpened interest in these new types of analysis.
Russom cited several trends contributing to the need for a greater range of data analytics options. The increased use of social media, the proliferation of mobile devices and the increase in sensor and GPS data are all contributing reams of information that must be sorted and stored. But it’s the potential for business intelligence that’s hidden within this data that is sparking interest in more robust data processing and analytics solutions.
“Lots of people go into advanced analytics just to understand the current state of their business,” says Russom. Locating the source of sudden customer churn is one example. Developing a better understanding of the customer in general is another. “Any kind of understanding you can get from the customer will provide some form of financial reward,” says Russom.
Isn’t it just a big database?
Hadoop is often confused with a database or database management system. But it’s actually a distributed file system that can upload and sort huge amounts of data (think tens of terabytes in a matter of seconds). When combined with other complementary technologies, Hadoop can enable processes far beyond those that a typical database management system can achieve.
Many people are fuzzy on the concept of Hadoop because it isn’t one distinct solution. “People talk about [Hadoop] like it’s one big monolithic thing, but it’s a series of products,” says Russom. TDWI describes the family of interrelated Hadoop solutions as an “ecosystem.”
The Apache Hadoop open source project comprises eleven sub-projects. Hadoop Common, Hadoop Distributed File System (HDFS) and Hadoop MapReduce perform the primary data storage and processing functions. Other sub-projects perform more distinct functions. For example, Avro is a data serialization system, Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying and Pig is a high-level data-flow language and execution framework for parallel computation. You can read about all of the sub-projects on the Apache Hadoop website.
What does it do?
Hadoop consists of two major components: data storage and data processing. The data storage component is based on a distributed file system. These types of systems exist in other forms, but Hadoop is unique in that it distributes data across a large set of server nodes, which provides robust capacity. The data processing component is based on a paradigm called MapReduce, which was developed by Google to support distributed computing on large data sets on clusters, or nodes.
MapReduce processing distributes, or “maps,” input from one central node to a network of smaller worker nodes. The input may distribute out to several layers of nodes, depending on how large or demanding it is. The worker nodes then process the data, or query, and pass the answer back to the master node, which then combines all the answers into one primary output. With multiple nodes working simultaneously, data can be loaded and processed very quickly.
Another significant factor that sets Hadoop apart: it can process multiple data types simultaneously. You can load up structured, semi-structured and unstructured data into its distributed file system and process it all at the same time. This capability provides significantly more freedom in how you manage data and what you get from it. So Hadoop promises to be the multi-structured data platform that complements data warehouse and other databases of mostly structured data.
What’s it good for?
Hadoop on its own is best suited for large batch processing. It can process tens of terabytes of file-based data very quickly. With these basic but powerful functions, Hadoop is great for searching within large data sets, aggregating information, or performing basic math functions like sums, means and averages.
All very useful stuff, but Hadoop’s potential expands significantly when it’s combined with other technologies. “We see Hadoop as a complement,” says Scott VanValkenburgh, Senior Director of Alliance Management at SAS, a developer of business analytics software and services, and also an HP partner. “Getting value from it is based more on building the right architecture, which is most likely a mix of Hadoop alongside other solutions.”
Vendors are reacting quickly to develop new solutions to get more from Hadoop. For example, SAS is using Hadoop to move toward pushing processing into the database, keeping the data and the engines that work with it in closer proximity to speed efficiency. HP Vertica recently released a Vertica Connector for Hadoop, which provides bi-directional, seamless integration between Hadoop and Vertica, boosting the speed and volume of data analytics in Vertica.
Where is all this going?
With the current momentum behind Hadoop, there’s no question that it’s here to stay. But it’s best to think of Hadoop as a starting point for interesting developments to come.
“Hadoop is another important tool that can be used for analytics versus being a replacement for existing technologies," said Mark Troester, SAS Senior Product Marketing Consultant and thought leader strategist. "It complements advanced analytics and allows you to work more efficiently with data than you can with other technologies.”
So, even if you aren’t thinking about starting a Hadoop project, you should probably pay attention to where it’s going. It may light the way for new business models and processing in the future.