Apache Spark: The Future of Big Data Science? -

Apache Spark is the go-to tool for Data Science at scale. It is an open source, distributed compute platform which is the first tool in the Data Science toolbox which is built specifically with Data Science in mind. In this blog, I want to talk about why I think Spark is the future of Data Science at scale and why Capgemini are supporting the Spark London Meetup Group.

We all know that data volumes are growing at an alarming rate and in order to get the best value out of these datasets business need to be able to analyze the full breadth and depth of this data. Traditionally this has been achieved with the various NoSQL datastores like Hadoop, MongoDb, ElasticSearch and countless others. What has been lacking is the ability to process this data for analytics. Analytics has either been achieved by writing complex MapReduce jobs or by picking particular aspects to analyze with Python or R. This works well in a lot of use cases, and typically a machine learning application only need be trained on a small part of the data or the feature engineering and population work means this happens naturally. However, when the need does arise to work with big datasets, (and this is only likely to grow), data science has been at a bit of a loss. That is no longer true with Apache Spark.

I believe that Spark is different from the myriad other solutions to this problem because it allows Data Scientists to develop simple code to perform distributed computing, and the functionality available in Spark is growing at an incredible rate. Much has been made in the Data Science community around Spark’s ability to train Machine Learning models at scale, and this is a key benefit, but I think the real value comes from being able to put an entire analytics pipeline into spark, right from the data ingestion and ETL processes, through the data wrangling and feature engineering processes through to training and execution of models. What's more with spark streaming and graphx spark can provide a much more complete analytics solution.

Spark 2.0 is already available as a preview and a full release is imminent and this will represent a real step forward with the unification of datasets and dataframes, everything you want to do analytically with dataframes becomes much faster. And this is also true for spark streaming with the "unending dataframe".

It is for this reason that here at Capgemini we are supporting the London Apache Spark Meetup group. We want to support the development of this key technology because it helps the community and it will help our clients. The meetup group is free for anyone to join and discusses all aspects of Spark: http://www.meetup.com/Spark-London/

If you want to learn more about some of our the work we do here at Capgemini please take a look at my previous blog on integrating machine learning with multiple analytical techniques (http://ow.ly/4ntB3D) and blogs from my colleagues on Machine Learning in the public sector (http://ow.ly/4nuEp0), Network Analytics at Scale (http://bit.ly/29FCkMq) and Data Mining techniques (http://bit.ly/29GrbvJ).

Finally, if you are interested in joining our innovative team please see our job specs: Data Scientist (http://bit.ly/1UWmhwn), Big Data Analytics Architect (http://bit.ly/29RLhTB) and Big Data Engineer (http://bit.ly/1OpX5HV).

Matt Thomson is a senior data scientist at Capgemini. Read more Capgemini blogs here.