In the big data environment, Hadoop is the fast-growing open source batch processing system emerging as the popular choice for companies handling large volumes of multi-structured data. Hadoop excels with its distributed processing framework, and is primarily used for extracting, storing, and processing both structured and unstructured data. However, Hadoop’s downfall is its inability to produce real-time analysis. Typically, once Hadoop has processed the data, the data is then moved into SQL-based environments for real-time analysis. The major disadvantage of this system architecture is the increased costs when data is pushed to SQL-based engines. Hadoop offers full scalability, and if real-time analysis capabilities were integrated, it would become one of the most efficient and cost-effective solutions available to customers. As a result, real-time analysis capabilities in Hadoop represent an increasingly significant opportunity that few vendors currently offer.
If big data technology vendors are able to deliver a real-time query option within Hadoop, it would completely eliminate the need for data to be moved into SQL based environments. Last spring, Cloudera released the Impala query engine, which scales its open source massively parallel processing (MPP) solution across Hadoop, allowing customers to run SQL queries on data stored in Hadoop. This eliminates the need for multiple platforms or sets of data, as everything is done on the same Hadoop database. EMC followed with HAWQ, its SQL querying engine which scales its Greenplum MPP solution across Hadoop. Twitter recently released Summingbird, which merges Hadoop and Storm, Twitter’s open source distributed real-time computation system, in one open source system. This hybrid system combines the best of both worlds – Hadoop processes the large volumes of data while Storm handles the real-time analytics.
The need for these solutions is becoming more apparent, and with the first few solutions already available, others are not far behind. One issue is that most of the current solutions are still slower than queries against relational databases. If a solution becomes available that can run Hadoop queries as quickly as queries against relational databases, the impact on the big data analytical landscape will be significant.
By Sarah Forman
Research Assistant, M2M & Embedded Technology