Channel technologies, Content, Content

Big Data Information Access: Spark, Presto and Apache Hive (Oh My)


As revelations about Cambridge Analytica have come to light, many people have been outraged over just how much of our data large Internet companies can gather -- and potentially mismanage. But big data-oriented information gathering isn't new. Companies have been taking user information to discover patterns, trends, and associations for years. It’s the reason that services like Facebook remain free.

In recent years, big data has gone mainstream -- as companies of all sizes dedicate significant resources toward gathering user information. Still, sifting through that data can be a daunting task.

Tools for Big Data Gathering, Insights

According to new research from Qubole, the “big data-as-a-service company,” more than three-quarters of companies rely on multiple open source engines to gain insights from their data. Of the engines available:

  • Spark and Presto are the fastest growing.
  • Presto usage has surged 420 percent in compute hours, while Spark has grown 365 percent in the total number of commands run.
  • The third largest engine, Apache Hive also saw growth, with the number of commands increasing 129 percent throughout the year.

Founding Father?

Interestingly, both Presto and Apache Hive were originally created by Facebook. Presto came along after the social media giant realized it needed a quicker and more flexible alternative to Hive. Since it open-sourced the SQL engine, many big tech names have adopted to technology. Companies like Netflix, Uber, Lyft, and Airbnb all use the engine to analyze their own data. (Along with other engines, as the Qubole study shows.)

According to Qubole’s co-founder and CEO, Ashish Thusoo, this demonstrates that more organizations understand the value of data and continue to adopt new technologies and processes in order to find new insights and revenue opportunities. “With three-quarters of businesses actively using multiple engines for their data programs, it’s clear that data activation strategies are becoming more nuanced in matching the best tool for the individual job,” he says.

Growth... And More Growth

The study also revealed that the number of users accessing each platform grew throughout the year, showing that businesses are making the data more accessible and reducing bottlenecks within data teams. Presto saw the largest increase in users running commands on the platform with 255 percent growth in the past year. Spark and Hadoop each saw 171 percent and 136 percent growth, respectively.

Still, plenty of big data challenges remain. Chief among them: A scant nine percent of businesses have woven big data and analytics into their organization’s DNA to where’s it’s central to strategy, decision-making, execution, investments and revenue generation, all supported by an executive advocate, a recent AtScale study found.