Tag Archives: Machine Learning

Spark Summit Europe Amsterdam

comSysto at Spark Summit Europe

At the end of October 2015 the first European Spark Summit took place at the Beurs van Berlage center in Amsterdam. The conference was the third of its kind this year dedicated to Apache Spark. Four of comSysto’s engineers traveled to Amsterdam for three intense days of Spark. This post summarizes highlights from the training and talks, as well as some of our general thoughts about Spark.

comsysto at spark europe

comSysto at Spark Summit Europe Amsterdam 2015

Keynotes

There were a total of 9 keynotes over two days, here our favorites:

Matei Zaharia the creator of Spark gave a state-of-the-union keynote focusing on the rapid adoption and overall growth of Spark as an Apache Foundation project. Spark now has over 600 contributors and is one of the most active Apache projects. 51% of Spark users are deploying in the cloud. Python popularity as a Spark language grew by 20% and people are also picking up R as a fourth language choice. The introduction of the new DataFrame API was the main challenge this year, more performance optimizations are coming with Project Tungsten. Zaharia also gave a peek at the upcoming Spark 1.6 features: mainly a type-safe DataFrame API named Dataset API, the integration of DataFrames into the Spark Streaming and GraphX APIs and more Tungsten features (in-memory cache, SSD storage).

Martin Odersky gave a keynote on Spark being the “ultimate scala collections”. Spark is an example of a Scala DSL that defines lazy collection operations and adds pairwise operations (e.g. reduceByKey). Scala will adopt some of the concepts, such as collection views, cachable collections and pairwise operations on sequence of pairs as a result of Spark using them extensively. On the other hand Spark can benefit from Scala’s rich type system as well as the upcoming Spores feature for compile-time check of closure captures that might get distributed across nodes. There is obviously a lot of exchange between the two communities which both can benefit from.

Talks

Magellan: Geospatial Analytics on Spark

It is promising to see a library addressing the handling of geospatial data and operations in Spark. There are many libraries available for encoding, parsing and storing geospatial data in various formats, however when trying to express more advanced operations such as geospatial joins, unions or intersections in a distributed fashion you were on your own. Spatial operations will often involve a join of multiple geospatial layers which maps well to RDD operations. Magellan provides optimized geospatial predicates and operations on top of Spark’s DataFrame API. For primitive spatial operations it depends on ESRI’s Geometry API and it aims at implementing the OpenGIS Simple Feature for SQL API.

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Helena Edelson gave a presentation on rethinking classical data processing architectures to meet the flood of data faced with today. LinkedIn for example generates 2.5 trillion events per day amounting to 1 Petabyte of streaming data. The Lambda Architecture style provides guidelines for handling both batch and stream processing of massive datasets, however implementing is still hard. Edelson discussed some technology choices for implementing different aspects of Lambda: Spark/Scala for distributed computing, Mesos for cluster resource management, Akka for concurrent and fault-tolerant application logic, Cassandra for distributed data storage and Kafka for real-time ingestion of streaming data: the SMACK stack. The colocation of Cassandra and Spark nodes for data locality especially seems like a good choice. Code for her reference application killrweather can be found on Github.

Spark DataFrames

Michael Armbrust from Databricks talked about Spark’s DataFrame API and its integration with Spark ML. A DataFrame is a distributed collection of rows organized into named columns and a unified interface for interacting with data in Scala, Java, Python or R. The main advantage of DataFrames over RDDs is Spark’s ability to optimize program execution. Since DataFrames provide more information on the structure of the data, usually better performance can be achieved by optimization compared to regular RDDs. Also user defined functions are language agnostic: for example, user defined Python functions are no longer sent to worker nodes and executed using a slower Python interpreter. Regarding integration with Spark ML, a more streamlined version of MLlib built on top of the DataFrame API was presented. Databricks also introduced Spark ML Pipeline abstraction: A practical machine learning pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. This had to be done manually and was error prone. Spark ML Pipelines provide an abstraction for those common data processing steps. It is nice to see that the programming interface matured and we think we will see plenty of new features in the upcoming releases.

Productionizing Spark and the Spark Job Server

The talk by Evan Chan focused on setting up and tuning Spark clusters and how to avoid common pitfalls: from choosing the right cluster mode to debugging Spark applications and collecting Spark context metrics. Another step towards making Spark production ready is using the Spark Job Server, which turns a Spark cluster into a “cluster as a service” by adding a REST management interface. Spark Job Server provides its own metadata store for storing and sharing jobs, configurations and job jars. It sits on top of your streaming or batch workloads and manages jobs and Spark contexts for you. Since the Job Server is creating the context, an existing Spark context can be re-used or a new one can be created, allowing for low latency queries and RDD sharing among jobs. Security, Authentication and all cluster managers are supported. Spark Job Server also found its way into the latest DataStax Enterprise distribution.

Spark Training

On the first day Databricks offered four training sessions on Spark in parallel. We chose the “Data Science with Apache Spark” training by Jon Bates since our main use cases include exploratory data analysis and machine learning. Offering a training at that scale (hundreds of participants) is definitely a challenge, however it was well executed. Databricks provided access to their cloud platform for all participants which gave everyone the opportunity to use browser-based “notebooks” for exploration and execution of lab code against their own Spark clusters in the cloud (AWS). Compared to small scale trainings there were obviously less opportunities to ask questions and the pace of presentation and amount of the material was tremendous: there was a lot to digest. However the quality of the tutorial content and the opportunity to continue to use the platform for some weeks after the training made up for that.

Conclusion

Spark is a promising tool for handling all kinds of large-scale data processing tasks which are getting more and more common at companies across all industries. IBM calls Spark “Potentially the Most Significant Open Source Project of the Next Decade” and commits to Spark by investing $300 million over the next few years and by assigning more than 3,500 researchers and developers to work on Spark-related projects. Microsoft for instance is using Spark and Cassandra to process over 10TB of event data per day from its Office 365 products. The diverse ecosystem of languages and tools offered by Spark is definitely a unique feature, making the switch from exploratory data analysis to application development a lot smoother. Deploying complete stacks (such as SMACK) on a computing cluster or in the cloud seems challenging at the moment. The current focus lies on explorative tools (notebooks) and languages (Python) tailored towards data scientists as well as deployment topics. Discussions on developing full-stack applications and integrating Spark in existing systems, however, are still rare.

At comSysto we explore Spark during our labs, at data science challenges and by implementing prototypes. For data intensive projects and for implementing lambda architectures we currently regard Spark as one of the primary options.

You want to shape a fundamental change in dealing with data in Germany? Then join our Big Data Community Alliance!

Machine Learning with Spark: Kaggle’s Driver Telematics Competition

Do you want to learn how to apply high-performance distributed computing to real-world machine learning problems? Then this article on how we used Apache Spark to participate in an exciting Kaggle competition might be of interest.

The Lab

At comSysto we regularly engage in labs, where we assess emerging technologies and share our experiences afterwards. While planning our next lab, kaggle.com came out with an interesting data science challenge:

AXA has provided a dataset of over 50,000 anonymized driver trips. The intent of this competition is to develop an algorithmic signature of driving type. Does a driver drive long trips? Short trips? Highway trips? Back roads? Do they accelerate hard from stops? Do they take turns at high speed? The answers to these questions combine to form an aggregate profile that potentially makes each driver unique.1

We signed up for the competition to take our chances and to get more hands on experience with Spark. For more information on how Kaggle works check out their data science competitions.

This first post describes our approach to explore the data set, the feature extraction process we used and how we identified drivers given the features. We were mostly using APIs and Libraries provided by Spark. Spark is a “fast and general computation engine for large scale data processing” that provides APIs for Python, Scala, Java and most recently R, as well as an interactive REPL (spark-shell). What makes Spark attractive is the proposition of a “unified stack” that covers multiple processing models on local machine or a cluster: Batch processing, streaming data, machine learning, graph processing, SQL queries and interactive ad-hoc analysis.

For computations on the entire data set we used a comSysto cluster with 3 nodes at 8 cores (i7) and 16GB RAM each, providing us with 24 cores and 48GB RAM in total. The cluster is running the MapR Hadoop distribution with MapR provided Spark libraries. The main advantage of this setup is a high-performance file system (mapr-fs) which also offers regular NFS access. For more details on the technical insights and challenges stay tuned for the second part of this post.

Telematic Data

Let’s look at the data provided for the competition. We first expected the data to contain different features regarding drivers and their trips but the raw data only contained pairs of anonymized coordinates (x, y) of a trip: e.g. (1.3, 4.4), (2.1, 4.8), (2.9, 5.2), … The trips were  re-centered to the same origin (0, 0) and randomly rotated around the origin (see Figure 1).

Figure 1: Anonymized driver data from Kaggle’s Driver Telematic competition1

At this point our enthusiasm got a little setback: How should we identify a driver simply by looking at anonymized trip coordinates?

Defining a Telelematic Fingerprint

It seemed that if we wanted useful and significant machine learning data, we would have to derive it ourselves using the provided raw data. Our first approach was to establish a “telematic fingerprint” for each driver. This fingerprint was composed of a list of features that we found meaningful and distinguishing. In order to get the driver’s fingerprint we used the following features:

Distance: The summation of all the euclidean distances between every two consecutive coordinates.

Absolute Distance: The euclidean distance between the first and last point.

Trip’s total time stopped: The total time that the driver has stopped.

Trip’s total time: The total number of entries for a certain trip (if we assume that every trip’s records are recorded every second, the number of entries in a trip would equal the duration of that trip in seconds)

Speed: For calculating the speed at a certain point, we calculated the euclidean distance between one coordinate and the previous one. Assuming that the coordinates units were meters and that the entries are distributed with a frequency of 1 second. This result would be given in m/s. But this is totally irrelevant since we are not doing any semantic analysis on it and we only compare it with other drivers/trips. For the speed we stored the percentiles 10, 25, 50, 80, 98. We did the same also for acceleration, deceleration and centripetal acceleration.

Acceleration: We set the acceleration to the difference between the speed at one coordinate and the speed at the previous one (when we are increasing speed).

Deceleration: We set the deceleration to the difference between the speed at one coordinate and the speed at the previous one (when we are decreasing speed).

Centripetal acceleration: We used the formulae:

centripetal acceleration

where v is the speed and r is the radius of the circle that the turning curve path would form. We already have the speed at every point so the only thing that is missing is the radius. For calculating the radius we take the current, previous and subsequent points (coordinate). This feature is an indicator of “aggressiveness” in driving style: high average of centripetal acceleration indicates turning at higher speeds.

From all derived features we computed a driver profile (“telematic fingerprint”) over all trips of that driver. From experience we know that the average speed varies between driving in the city compared to driving on the highway. Therefore the average speed over all trips for a driver is maybe not revealing too much. For better results we would need to map trip features such as average speed or maximum speed to different trip types like inner city trips, long distance highway trips, rural road trips, etc. 

Data Statistics: Around 2700 drivers with 200 trips each, resulting in about 540,000 trips. All trips together contain 360 million X/Y coordinates, which means – as they are tracked per second – we have 100,000 hours of trip data.

Machine Learning

After the inital data preparation and feature extraction we could turn towards selecting and testing machine learning models for driver prediction.

Clustering

The first task was to categorize the trips: we decided to use an automated clustering algorithm (k-means) to build categories which should reflect the different trip types. The categories were derived from all trips of all drivers, which means they are not specific to a certain driver. A first look at the extracted features and computed categories revealed that some of the categories are indeed dependent on the trip length, which is an indicator for the trip type. From the cross validation results we decided to use 8 categories for our final computations. The computed cluster IDs were added to the features of every trip and used for further analysis.

Prediction

For the driver prediction we used a Random Forest algorithm to train a model for each driver, which can predict the probability of a given trip (identified by its features) belonging to a specific driver. The first task was to build a training set. This was done by taking all (around 200) trips of a driver and label them with “1” (match) and then randomly choosing (also about 200) trips of other drivers and label them with “0” (no match). This training set is then fed into the Random Forest training algorithm which results in a Random Forest model for each driver. Afterwards the model was used for cross validation (i.e. evaluating the error rate on an unseen test data set) and to compute the submission for the Kaggle competition. From the cross validation results we decided to use 10 trees and a maximum tree depth of 12 for the Random Forest model (having 23 features).

An interesting comparison between the different ensemble learning algorithms for prediction (Random Forest and Gradient-BoostedTrees (GBT) from Spark’s Machine Learning Library (MLib)) can be found on the Databricks Blog.

Pipeline

Our workflow is splitted into several self-contained steps implemented as small Java applications that can be directly submitted to Spark via the “spark-submit” command. We used Hadoop Sequence files and CSV files for input and output. The steps are as follows:

spark-article-1

Figure 2: ML pipeline for predicting drivers

Converting the raw input files: We are faced with about 550,000 small CSV files each containing a single trip of one driver. Loading all the files for each run of our model can be a major performance issue, therefore we converted all input files into a single Hadoop Sequence file which is served from the mapr-fs file system.

Extracting the features and computing statistics: We load the trip data from the sequence file, compute all the features described above as well as statistics such as variance and mean of features using the Spark RDD transformation API and write the results to a CSV file.

Computing the clusters: We load the trip features and statistics and use the Spark MLlib API to compute the clusters that categorize the trips using k-means. The features CSV is enriched with the clusterID for each trip.

Random Forest Training: For the actual model training we load the features for each trip together with some configuration values for the model parameters (e.g. maxDepth, crossValidation) and start a Random Forest model training for each driver with labeled training data and optional testdata for crossvalidation analysis. We serialize each Random Forest model to disk using Java serialization. In its current version Spark provides native saving and loading of model result instances, as well as configuring alternative serialization strategies.

For the actual Kaggle submission we simply load the serialized models and predict the likelihood of each trip belonging to that driver and save the result it in the required CSV format.

Results and Conclusions

This blog post describes our approach and methodology to solve the Kaggle Driver Competition using Apache Spark. Our prediction model based on Random Forest decision trees was able to predict the driver with an accuracy of around 74 percent which placed us at position 670 at the Kaggle leaderboard at the time of submission. Not bad for 2 days of work, however there are many possible improvements we identified during the lab.

To learn more about the implementation details, technical challenges and lessons learned regarding Spark stay tuned for the second part of this post.

You want to shape a fundamental change in dealing with data in Germany? Then join our Big Data Community Alliance!

Sources:
1. https://www.kaggle.com/c/axa-driver-telematics-analysis

Taking part in a Kaggle competition – Our experience

As part of one of comSysto’s Labs (three days of focussed work on a topic that we’re interested in), four of us took part in the Kaggle competition “Personalize Expedia Hotel Searches”. For this competition we applied machine learning algorithms to rank hotels in order to maximize purchases on Expedia. This blog posts describes our approach as well as our experience. Continue reading