Category Archives: Big Data

Spark Summit Europe Amsterdam

comSysto at Spark Summit Europe

At the end of October 2015 the first European Spark Summit took place at the Beurs van Berlage center in Amsterdam. The conference was the third of its kind this year dedicated to Apache Spark. Four of comSysto’s engineers traveled to Amsterdam for three intense days of Spark. This post summarizes highlights from the training and talks, as well as some of our general thoughts about Spark.

comsysto at spark europe

comSysto at Spark Summit Europe Amsterdam 2015

Keynotes

There were a total of 9 keynotes over two days, here our favorites:

Matei Zaharia the creator of Spark gave a state-of-the-union keynote focusing on the rapid adoption and overall growth of Spark as an Apache Foundation project. Spark now has over 600 contributors and is one of the most active Apache projects. 51% of Spark users are deploying in the cloud. Python popularity as a Spark language grew by 20% and people are also picking up R as a fourth language choice. The introduction of the new DataFrame API was the main challenge this year, more performance optimizations are coming with Project Tungsten. Zaharia also gave a peek at the upcoming Spark 1.6 features: mainly a type-safe DataFrame API named Dataset API, the integration of DataFrames into the Spark Streaming and GraphX APIs and more Tungsten features (in-memory cache, SSD storage).

Martin Odersky gave a keynote on Spark being the “ultimate scala collections”. Spark is an example of a Scala DSL that defines lazy collection operations and adds pairwise operations (e.g. reduceByKey). Scala will adopt some of the concepts, such as collection views, cachable collections and pairwise operations on sequence of pairs as a result of Spark using them extensively. On the other hand Spark can benefit from Scala’s rich type system as well as the upcoming Spores feature for compile-time check of closure captures that might get distributed across nodes. There is obviously a lot of exchange between the two communities which both can benefit from.

Talks

Magellan: Geospatial Analytics on Spark

It is promising to see a library addressing the handling of geospatial data and operations in Spark. There are many libraries available for encoding, parsing and storing geospatial data in various formats, however when trying to express more advanced operations such as geospatial joins, unions or intersections in a distributed fashion you were on your own. Spatial operations will often involve a join of multiple geospatial layers which maps well to RDD operations. Magellan provides optimized geospatial predicates and operations on top of Spark’s DataFrame API. For primitive spatial operations it depends on ESRI’s Geometry API and it aims at implementing the OpenGIS Simple Feature for SQL API.

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Helena Edelson gave a presentation on rethinking classical data processing architectures to meet the flood of data faced with today. LinkedIn for example generates 2.5 trillion events per day amounting to 1 Petabyte of streaming data. The Lambda Architecture style provides guidelines for handling both batch and stream processing of massive datasets, however implementing is still hard. Edelson discussed some technology choices for implementing different aspects of Lambda: Spark/Scala for distributed computing, Mesos for cluster resource management, Akka for concurrent and fault-tolerant application logic, Cassandra for distributed data storage and Kafka for real-time ingestion of streaming data: the SMACK stack. The colocation of Cassandra and Spark nodes for data locality especially seems like a good choice. Code for her reference application killrweather can be found on Github.

Spark DataFrames

Michael Armbrust from Databricks talked about Spark’s DataFrame API and its integration with Spark ML. A DataFrame is a distributed collection of rows organized into named columns and a unified interface for interacting with data in Scala, Java, Python or R. The main advantage of DataFrames over RDDs is Spark’s ability to optimize program execution. Since DataFrames provide more information on the structure of the data, usually better performance can be achieved by optimization compared to regular RDDs. Also user defined functions are language agnostic: for example, user defined Python functions are no longer sent to worker nodes and executed using a slower Python interpreter. Regarding integration with Spark ML, a more streamlined version of MLlib built on top of the DataFrame API was presented. Databricks also introduced Spark ML Pipeline abstraction: A practical machine learning pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. This had to be done manually and was error prone. Spark ML Pipelines provide an abstraction for those common data processing steps. It is nice to see that the programming interface matured and we think we will see plenty of new features in the upcoming releases.

Productionizing Spark and the Spark Job Server

The talk by Evan Chan focused on setting up and tuning Spark clusters and how to avoid common pitfalls: from choosing the right cluster mode to debugging Spark applications and collecting Spark context metrics. Another step towards making Spark production ready is using the Spark Job Server, which turns a Spark cluster into a “cluster as a service” by adding a REST management interface. Spark Job Server provides its own metadata store for storing and sharing jobs, configurations and job jars. It sits on top of your streaming or batch workloads and manages jobs and Spark contexts for you. Since the Job Server is creating the context, an existing Spark context can be re-used or a new one can be created, allowing for low latency queries and RDD sharing among jobs. Security, Authentication and all cluster managers are supported. Spark Job Server also found its way into the latest DataStax Enterprise distribution.

Spark Training

On the first day Databricks offered four training sessions on Spark in parallel. We chose the “Data Science with Apache Spark” training by Jon Bates since our main use cases include exploratory data analysis and machine learning. Offering a training at that scale (hundreds of participants) is definitely a challenge, however it was well executed. Databricks provided access to their cloud platform for all participants which gave everyone the opportunity to use browser-based “notebooks” for exploration and execution of lab code against their own Spark clusters in the cloud (AWS). Compared to small scale trainings there were obviously less opportunities to ask questions and the pace of presentation and amount of the material was tremendous: there was a lot to digest. However the quality of the tutorial content and the opportunity to continue to use the platform for some weeks after the training made up for that.

Conclusion

Spark is a promising tool for handling all kinds of large-scale data processing tasks which are getting more and more common at companies across all industries. IBM calls Spark “Potentially the Most Significant Open Source Project of the Next Decade” and commits to Spark by investing $300 million over the next few years and by assigning more than 3,500 researchers and developers to work on Spark-related projects. Microsoft for instance is using Spark and Cassandra to process over 10TB of event data per day from its Office 365 products. The diverse ecosystem of languages and tools offered by Spark is definitely a unique feature, making the switch from exploratory data analysis to application development a lot smoother. Deploying complete stacks (such as SMACK) on a computing cluster or in the cloud seems challenging at the moment. The current focus lies on explorative tools (notebooks) and languages (Python) tailored towards data scientists as well as deployment topics. Discussions on developing full-stack applications and integrating Spark in existing systems, however, are still rare.

At comSysto we explore Spark during our labs, at data science challenges and by implementing prototypes. For data intensive projects and for implementing lambda architectures we currently regard Spark as one of the primary options.

You want to shape a fundamental change in dealing with data in Germany? Then join our Big Data Community Alliance!

Machine Learning with Spark: Kaggle’s Driver Telematics Competition

Do you want to learn how to apply high-performance distributed computing to real-world machine learning problems? Then this article on how we used Apache Spark to participate in an exciting Kaggle competition might be of interest.

The Lab

At comSysto we regularly engage in labs, where we assess emerging technologies and share our experiences afterwards. While planning our next lab, kaggle.com came out with an interesting data science challenge:

AXA has provided a dataset of over 50,000 anonymized driver trips. The intent of this competition is to develop an algorithmic signature of driving type. Does a driver drive long trips? Short trips? Highway trips? Back roads? Do they accelerate hard from stops? Do they take turns at high speed? The answers to these questions combine to form an aggregate profile that potentially makes each driver unique.1

We signed up for the competition to take our chances and to get more hands on experience with Spark. For more information on how Kaggle works check out their data science competitions.

This first post describes our approach to explore the data set, the feature extraction process we used and how we identified drivers given the features. We were mostly using APIs and Libraries provided by Spark. Spark is a “fast and general computation engine for large scale data processing” that provides APIs for Python, Scala, Java and most recently R, as well as an interactive REPL (spark-shell). What makes Spark attractive is the proposition of a “unified stack” that covers multiple processing models on local machine or a cluster: Batch processing, streaming data, machine learning, graph processing, SQL queries and interactive ad-hoc analysis.

For computations on the entire data set we used a comSysto cluster with 3 nodes at 8 cores (i7) and 16GB RAM each, providing us with 24 cores and 48GB RAM in total. The cluster is running the MapR Hadoop distribution with MapR provided Spark libraries. The main advantage of this setup is a high-performance file system (mapr-fs) which also offers regular NFS access. For more details on the technical insights and challenges stay tuned for the second part of this post.

Telematic Data

Let’s look at the data provided for the competition. We first expected the data to contain different features regarding drivers and their trips but the raw data only contained pairs of anonymized coordinates (x, y) of a trip: e.g. (1.3, 4.4), (2.1, 4.8), (2.9, 5.2), … The trips were  re-centered to the same origin (0, 0) and randomly rotated around the origin (see Figure 1).

Figure 1: Anonymized driver data from Kaggle’s Driver Telematic competition1

At this point our enthusiasm got a little setback: How should we identify a driver simply by looking at anonymized trip coordinates?

Defining a Telelematic Fingerprint

It seemed that if we wanted useful and significant machine learning data, we would have to derive it ourselves using the provided raw data. Our first approach was to establish a “telematic fingerprint” for each driver. This fingerprint was composed of a list of features that we found meaningful and distinguishing. In order to get the driver’s fingerprint we used the following features:

Distance: The summation of all the euclidean distances between every two consecutive coordinates.

Absolute Distance: The euclidean distance between the first and last point.

Trip’s total time stopped: The total time that the driver has stopped.

Trip’s total time: The total number of entries for a certain trip (if we assume that every trip’s records are recorded every second, the number of entries in a trip would equal the duration of that trip in seconds)

Speed: For calculating the speed at a certain point, we calculated the euclidean distance between one coordinate and the previous one. Assuming that the coordinates units were meters and that the entries are distributed with a frequency of 1 second. This result would be given in m/s. But this is totally irrelevant since we are not doing any semantic analysis on it and we only compare it with other drivers/trips. For the speed we stored the percentiles 10, 25, 50, 80, 98. We did the same also for acceleration, deceleration and centripetal acceleration.

Acceleration: We set the acceleration to the difference between the speed at one coordinate and the speed at the previous one (when we are increasing speed).

Deceleration: We set the deceleration to the difference between the speed at one coordinate and the speed at the previous one (when we are decreasing speed).

Centripetal acceleration: We used the formulae:

centripetal acceleration

where v is the speed and r is the radius of the circle that the turning curve path would form. We already have the speed at every point so the only thing that is missing is the radius. For calculating the radius we take the current, previous and subsequent points (coordinate). This feature is an indicator of “aggressiveness” in driving style: high average of centripetal acceleration indicates turning at higher speeds.

From all derived features we computed a driver profile (“telematic fingerprint”) over all trips of that driver. From experience we know that the average speed varies between driving in the city compared to driving on the highway. Therefore the average speed over all trips for a driver is maybe not revealing too much. For better results we would need to map trip features such as average speed or maximum speed to different trip types like inner city trips, long distance highway trips, rural road trips, etc. 

Data Statistics: Around 2700 drivers with 200 trips each, resulting in about 540,000 trips. All trips together contain 360 million X/Y coordinates, which means – as they are tracked per second – we have 100,000 hours of trip data.

Machine Learning

After the inital data preparation and feature extraction we could turn towards selecting and testing machine learning models for driver prediction.

Clustering

The first task was to categorize the trips: we decided to use an automated clustering algorithm (k-means) to build categories which should reflect the different trip types. The categories were derived from all trips of all drivers, which means they are not specific to a certain driver. A first look at the extracted features and computed categories revealed that some of the categories are indeed dependent on the trip length, which is an indicator for the trip type. From the cross validation results we decided to use 8 categories for our final computations. The computed cluster IDs were added to the features of every trip and used for further analysis.

Prediction

For the driver prediction we used a Random Forest algorithm to train a model for each driver, which can predict the probability of a given trip (identified by its features) belonging to a specific driver. The first task was to build a training set. This was done by taking all (around 200) trips of a driver and label them with “1” (match) and then randomly choosing (also about 200) trips of other drivers and label them with “0” (no match). This training set is then fed into the Random Forest training algorithm which results in a Random Forest model for each driver. Afterwards the model was used for cross validation (i.e. evaluating the error rate on an unseen test data set) and to compute the submission for the Kaggle competition. From the cross validation results we decided to use 10 trees and a maximum tree depth of 12 for the Random Forest model (having 23 features).

An interesting comparison between the different ensemble learning algorithms for prediction (Random Forest and Gradient-BoostedTrees (GBT) from Spark’s Machine Learning Library (MLib)) can be found on the Databricks Blog.

Pipeline

Our workflow is splitted into several self-contained steps implemented as small Java applications that can be directly submitted to Spark via the “spark-submit” command. We used Hadoop Sequence files and CSV files for input and output. The steps are as follows:

spark-article-1

Figure 2: ML pipeline for predicting drivers

Converting the raw input files: We are faced with about 550,000 small CSV files each containing a single trip of one driver. Loading all the files for each run of our model can be a major performance issue, therefore we converted all input files into a single Hadoop Sequence file which is served from the mapr-fs file system.

Extracting the features and computing statistics: We load the trip data from the sequence file, compute all the features described above as well as statistics such as variance and mean of features using the Spark RDD transformation API and write the results to a CSV file.

Computing the clusters: We load the trip features and statistics and use the Spark MLlib API to compute the clusters that categorize the trips using k-means. The features CSV is enriched with the clusterID for each trip.

Random Forest Training: For the actual model training we load the features for each trip together with some configuration values for the model parameters (e.g. maxDepth, crossValidation) and start a Random Forest model training for each driver with labeled training data and optional testdata for crossvalidation analysis. We serialize each Random Forest model to disk using Java serialization. In its current version Spark provides native saving and loading of model result instances, as well as configuring alternative serialization strategies.

For the actual Kaggle submission we simply load the serialized models and predict the likelihood of each trip belonging to that driver and save the result it in the required CSV format.

Results and Conclusions

This blog post describes our approach and methodology to solve the Kaggle Driver Competition using Apache Spark. Our prediction model based on Random Forest decision trees was able to predict the driver with an accuracy of around 74 percent which placed us at position 670 at the Kaggle leaderboard at the time of submission. Not bad for 2 days of work, however there are many possible improvements we identified during the lab.

To learn more about the implementation details, technical challenges and lessons learned regarding Spark stay tuned for the second part of this post.

You want to shape a fundamental change in dealing with data in Germany? Then join our Big Data Community Alliance!

Sources:
1. https://www.kaggle.com/c/axa-driver-telematics-analysis

Combining Logstash and Graylog for Log Management

A little incomplete overview

When working in a classic IT infrastructure you often face the problem that developers only have access to test or development environments, but not to production. In order to fix bugs or to have a glance at the system running in production, log file access is needed. This is often not possible due to security requirements. The result of this situation is that the operation guys need to provide these files to the developers, which can take a certain amount of time.

A solution to these problems is to provide a Log Management Server and grant access to the developers via a UI. Despite some commercial tools like Splunk, which is the de-facto market leader in this area, there are some quite promising open source solutions which do scale very well and may provide enough features to get the job done.

The advantage of using open source technology is that you can – but do not have to – buy subscriptions. Furthermore, software like Splunk and Log Analysis have pricing plans, which depend on the amount of logs you ship daily. The problem is that you have to pay more if the volume of logs increases either due to a raised log level to help analyze some bugs in production or simply as more services are deployed.

Last but not least, there are of course cloud solutions like Loggly. You can basically ship your log events to a cloud service, which then takes care of the rest. You do not have to provide any infrastructure yourself. This is a very good solution unless the security policy of your organization prohibits shipping data to the cloud.

Of course this overview is incomplete. I just picked some tools for a brief introduction. If you think something is missing, feel free to blog or comment about it.

Open Source Log Management

The famous ELK-Stack

At the moment, the probably most famous open source log management solution is the ELK-Stack. It is called a stack because it is not one software package but a combination of well-known open source tools. The components are:

  • Elasticsearch is a document oriented database optimized for searching. It is easily scalable and can manage a huge amount of data.
  • Logstash is a log forwarder with many features. There are many types of inputs, filters and outputs. Moreover logstash can handle a bunch of codecs, like JSON for example.
  • Finally, Kibana is the UI where you can view the log entries and create very sophisticated and colorful dashboards.

Despite all the good things about the ELK-Stack there are some drawbacks, which would make it not the optimal choice under some circumstances.

Kibana has no user management. If you want user management you have to purchase commercial support from Elastic to get a license for Shield.

Next, there is no housekeeping for the Elasticsearch database. Logstash creates an index for each day. You have to remove it manually if you do not need it anymore.

Graylog

Graylog is an alternative log management platform that addresses the drawbacks of the ELK stack and is quite mature. It provides an UI and a server part. Moreover, Graylog uses Elasticsearch as database for the log messages as well as MongoDB for application data.

The UI does basically what a UI does. It makes the data accessible in a web browser.

The server part provides a consistent management of the log files. The Graylog server has the following features:

  • Several inputs: HTTP, TCP, SYSLOG, AMQP, …
  • Classification for Log Messages (Streams)
  • User Management and Access Control for the defined streams
  • Simple Dashboards created from streams
  • Housekeeping in Elasticsearch
  • Outputs to forward the messages of a particular stream

Moreover, Graylog can easily be deployed in a clustered environment, so that you get high availability and load distribution.

In order to create a full solution it is suitable to combine Graylog with Logstash with a little patching of Logstash and a custom Graylog Plugin.

As a standard for log events, Graylog promotes usage of the Graylog Extended Log Format (GELF). This is basically a JSON format containing the following information:

  • Timestamp (Unix): time of log event
  • Host: host where the event originates
  • short_message: message

A GELF message can contain many other optional fields as well as user-defined fields. The timestamp is really important to see the log messages ordered by log message creation time and not at the time when entering the system.

Putting it all together

Unfortunately it’s a little bit challenging to make Logstash talk to Graylog and vice versa. The main problem is that Graylog wants the end of a message with a NULL delimiter whereas Logstash creates \n. Logstash also expects \n when receiving log messages as well as Graylog sends log messages with the NULL delimiter.

Sending messages from Logstash to Graylog

1. Use a message broker like RabbitMQ. Logstash can write to RabbitMQ, Graylog can read. This solution decouples both applications, so that the Graylog server can be shut down while Logstash is still producing log messages.

2. Use the HTTP input in Graylog to receive messages from Logstash. This solution has some drawbacks. The biggest might be that if Graylog is down, Logstash discards the message after a failed send attempt.

3. Use the GELF TCP input and patch Logstash. Unfortunately, there is no possibility to change the line separator in the Logstash “json_lines” codec. This could be done in a patch which is currently open as a pull request. Hopefully, it will be merged soon. The big advantage in using the Logstash TCP output is that Logstash queues messages which cannot be send and retries sending them.

Sending messages from Graylog to Logstash

Sending messages from Graylog to Logstash might not make sense in the first place. But if you think of creating a file-based archive of log files on a NAS or in AWS S3 it might make sense though.

As mentioned above, even there is a problem with the line ending. Fortunately, Graylog provides a plugin API. So I created a plugin which can forward log messages to a Logstash instance. This instance can write the log files then.

The plugin is hosted on Github and licensed under the APL 2.0.

Conclusion

As described in the article, you can combine Logstash and Graylog with little effort in order to build an enterprise-ready flexible, scalable and access controlled log management system. Graylog and Elasticsearch as central components are able to scale out the described setup and can handle a huge load of data.

Graylog, Logstash and Elasticsearch are all three high-quality open source tools with a great community and many users. All these products are also commercially supported by companies behind them.

Finally there is one important note for all the Kibana lovers. Of course it is possible to also deploy Kibana in parallel to Graylog. Then you can build nice dashboards with Kibana and have the features like User Management and Elasticsearch Housekeeping in Graylog.

Graylog

Teamgeist on Android Wear

Die ganze IT Welt spricht derzeit von Wearables. Also wollte ich mir in einem Lab die Android Wear API genauer anschauen. Schnell war auch schon der erste Anwendungsfall gefunden. In unserer Teamgeist App gibt es seit kurzem die Möglichkeit Kudos zu verteilen.

Kudos

Auf einer Android Wear Uhr würden sich die Kudos prima darstellen lassen. Dazu gäbe es zwei Aktionen. Einmal für einen Kudo “voten”. Die andere wäre die Teamgeist App öffnen.

Für eine Integration mit der Teamgeist App bräuchten wir eine neue Schnittstelle. Zum kennen lernen der Android Wear Api begnügen wir uns deswegen im folgenden mit einer Android App die Kudos erstellt und verschickt.

Nach kurzer Recherche wurde klar, dass für den Anwendungsfall gar keine eigene Android Wear App notwendig ist. Es reicht eine normale Android App die mittels der Notifications API Nachrichten direkt an die Uhr versendet. Anwendungen eigens für Android Wear geschrieben, werden in einem späteren Tutorial näher beleuchtet.

Vorbereitung

Ein paar Dinge die wir benötigen bevor wir loslegen können:

  • Intellij (14) als IDE
  • Android SDK mit installierten API Packages für Level 19 (4.4.2), 20 (4.4W) und Android Support Library V4 (20)

Android SDK

  • Aus Mangel einer echten Android Wear starten wir eine aus dem AVD Manager heraus

AVD Wear

Für das Koppeln mit einem Handy benötigen wir auf dem Handy die Android Wear App aus dem Play Store. Das koppeln von der emulierten Wear und einem per USB angeschlossen Handy funktioniert erst dann wenn folgender Befehl auf Kommandozeile eingegebenen wurde (im Tools Verzeichnis vom android-sdk):

~/development/android-sdk-mac_86/platform-tools$ adb -d forward tcp:5601 tcp:5601

Erst wenn der Befehl ohne Fehler ausgeführt wurde, lässt sich aus der Android Wear App im Handy die emulierte Uhr mit dem Handy verbinden. Wird das Handy vom Rechner getrennt und neu angeschlossen, muss der Befehl erneut ausgeführt werden. Eine genau Beschreibung gibt es von Google oder hier.

Neue Android App erstellen

Nachdem wir den Emulator mit dem Handy erfolgreich gekoppelt haben, erscheinen auch schon die ersten Notifications auf der Uhr wie z.B. der Eingang neuer Mails.

Damit wir nun selbst Notifications versenden können erstellen wir InteliJ ein neues Projekt. Im ersten Bildschirm wählen wir links Android und rechts das Gradle: Android Module aus. Auf der darauffolgenden Seite müssen wir ein paar Einstellungen wie z.b. die Version des Target SDK vornehmen.

Target SDK

Anmerkung: Wir hätten hier auch 4.3 wählen können da die Android Wear App ab Android 4.3 unterstützt wird.

Auf den nächsten Seiten belassen wir die Einstellung wie sie sind und wählen auf dem letzten Bildschirm nur noch den Ordner für unser Projekt aus.

Cleanup des generierten Projektes

In unserer Teamgeist App benötigen wir natürlich als erstes unseren Teamgeist und fügen diesen zu den drawables hinzu 🙂

teamgeist_logo

 

In der activity_main.xml löschen wir die TextView und erstellen statt dessen einen Button.

<Button
    android:layout_width="wrap_content"
    android:layout_height="wrap_content"
    android:text="Sende Kudos"
    android:id="@+id/kudo_button" android:layout_centerVertical="true" android:layout_centerHorizontal="true"/>

Um mit den Button in Java zu arbeiten holen wir uns eine Referenz darauf in der MainActivity#onCreate() Methode und setzen auch gleich einen OnClickListener.

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);

    Button kudoButton = (Button)findViewById(R.id.kudo_button);
    kudoButton.setOnClickListener(new View.OnClickListener() {
        @Override
        public void onClick(View view) {
          //hierher kommt unser Notification Code
        }
    });
}

Wenn wir jetzt unsere App starten, sollte sich auf dem Handy die App öffnen mit einem Button “Sende Kudos” auf weißem Hintergrund.

Senden einer ersten Notification

Um eine erste Notification zu senden müssen wir noch die V4 Support Library zu unserem Projekt hinzufügen. Hierzu fügen wir der dependency Section unserer build.gradle Datei eine Zeile hinzu.

dependencies {
    compile fileTree(dir: 'libs', include: ['*.jar'])
    compile "com.android.support:support-v4:20.0.+"
}

Beim ersten mal hinzufügen der V4 Support Library zu einem Projekt erkennt IntelliJ dies und erstellt durch nachfragen ein eigenes Repository hierfür.

Nun können wir auf die Notification API in der onClick Methode des vorher erstellten OnClickListeners zugreifen und fügen folgenden Code hinzu.

@Override
public void onClick(View view) {
  //1. Erstellen eines NotificationCompat.Builder mit Hilfe des Builder Patterns
  Notification notification =
    new NotificationCompat.Builder(MainActivity.this)
      .setSmallIcon(R.drawable.teamgeist_logo)
      .setContentTitle("Notifications?")
      .setContentText("Congratulations, you have sent your first notification")
      .build();

  //2. Wir benötigen einen NotificationManager
  NotificationManagerCompat notificationManager =
    NotificationManagerCompat.from(MainActivity.this);

  //3. Versenden der Notification mittels NotificationManager und NotificationBuilder
  int notificationId = 1;
  notificationManager.notify(notificationId, notification);

}
  1. Als erstes wird mit Hilfe des NotificationCompat.Builder und dem Builder Pattern eine Notification erstellt. Hier setzen wir zu Beginn einen Titel, einen Text und ein Bild.
  2. Dann benötigen wir zum versenden einen NotificationManager. Den erhalten wir mit dem Aufruf der from() Methode von der Klasse NotificationManagerCompat.
  3. Danach sind wir bereit die Notification über die notify Methode des NotificationManagers zu verschicken. Die notificationId dient hierbei zur Unterscheidung von verschiedenen Notifications einer App.

Wenn wir die App jetzt deployen, starten und auf “Kudo senden” drücken kriegen wir unsere erste eigene Notification auf der Uhr.

simple_notification

Hintergrundbild

Anhand des App Icons ermittelt Android eine ähnliche Hintergrundfarbe. Ein eigenes Bild sieht jedoch viel besser aus. Wir erreichen dies in dem wir beim Builder zusätzlich setLargeIcon aufrufen.

new NotificationCompat.Builder(MainActivity.this)
 .setLargeIcon(BitmapFactory.decodeResource(getResources(), R.drawable.teamgeist_logo))
 .setSmallIcon(R.drawable.teamgeist_logo)
 .setContentTitle("Notifications?")
 .setContentText("Congratulations, you have sent your first notification")
 .build();

Damit kriegt die Notification auf der Uhr den Geist auch als Hintergrund.

simple_notification_with_background

Benutzerinteraktion

Wir können der Notification verschiedene Benutzerinteraktionen hinzufügen. Mit einem PendingIndent wird beispielsweise eine bestimmte Activity in unserer App aufgerufen und ihr mittels “Extras” Daten übergeben. Den PendingIndent erstellen wir in einer eigenen Methode.

private PendingIntent createContentIntent() {
    Intent viewIntent = new Intent(MainActivity.this, MainActivity.class);
    viewIntent.putExtra("EventNotified", "1");
    PendingIntent viewPendingIntent =
          PendingIntent.getActivity(MainActivity.this, 0, viewIntent, 0);
    return viewPendingIntent;
}

Diesen Indent übergeben wir dem Builder durch Aufruf von setContentIntent.

new NotificationCompat.Builder(MainActivity.this)
 .setLargeIcon(BitmapFactory.decodeResource(getResources(), R.drawable.teamgeist_logo))
 .setSmallIcon(R.drawable.teamgeist_logo)
 .setContentTitle("Notifications?")
 .setContentText("Congratulations, you have sent your first notification")
 .setContentIntent(createContentIntent())
 .build();

Durch nach links Wischen der Notification erscheint unsere neue Aktion.

PendingIntent

Klicken wir nun auf “Open on phone” öffnet sich die hinterlegte Activity im Handy, also in unserem Fall die MainActivity. Leider bleibt bisher die Notification auf der Uhr bestehen. Um sie dort zu entfernen, müssen wir abfragen ob die App durch die User Interaktion gestartet wurde und deaktivieren in diesem Falle die Notification. Dazu erstellen wir uns die Methode cancelNotificationOnUserInteraction Methode und rufen sie in der MainActivity#onCreate Methode auf.

private void cancelNotificationOnUserInteraction() {
    Intent intent = getIntent();
    Bundle extras = intent.getExtras();
    if (extras != null && "1".equals(extras.getString("EventNotified"))) {
        NotificationManagerCompat.from(this).cancel(1);
    }
}

Neben dieser Standard Aktion können wir weitere “Actions” hinzufügen. Dazu erstellen wir uns ein Action Objekt mit folgender Methode,

private NotificationCompat.Action showInBrowser() {
    Intent browserIntent = new Intent(Intent.ACTION_VIEW);
    Uri geoUri = Uri.parse("http://app.teamgeist.io");
    browserIntent.setData(geoUri);
    PendingIntent browserPendingIntent =
            PendingIntent.getActivity(this, 0, browserIntent, 0);

    return new NotificationCompat.Action(
            android.R.drawable.ic_dialog_map, "Open in Browser", browserPendingIntent);
}

und übergeben das Objekt an den Builder mittels der addAction Methode.

new NotificationCompat.Builder(MainActivity.this)
 .setLargeIcon(BitmapFactory.decodeResource(getResources(), R.drawable.teamgeist_logo))
 .setSmallIcon(R.drawable.teamgeist_logo)
 .setContentTitle("Notifications?")
 .setContentText("Congratulations, you have sent your first notification")
 .setContentIntent(createContentIntent())
 .addAction(showInBrowser())
 .build();

Wir können die Notification jetzt zweimal nach links schieben und kriegen dann eine weitere Aktion zur Auswahl. Beim klicken auf “Open in Browser” öffnet sich nun unsere Teamgeist Webseite auf dem Handy.

OpenInBrowserAction

Mit Hilfe so einer Action würden wir die Voting Funktion realisieren. Die App auf dem Handy müsste dann dem Teamgeist Server den vote übermitteln.

Was gibt es noch?

Damit sind wir am Ende unseres ersten Android Wear Labs angekommen. Neben diesen Aktionen gibt es noch besondere Wear Notification Features. Da wäre zum einen die Möglichkeit die Notification um mehr als eine “Page” zu erweitern. Oder Notifications zu gruppieren. Jedoch das wahrscheinlich bekannteste Feature ist die Möglichkeit auf eine Notification mittels Sprache zu antworten.

All dies sind potentielle Themen für unser nächstes Android Lab. Und natürlich möchten wir die App mit unserem Teamgeist Server verbinden um echte Kudos zu erhalten und für sie “voten” ;-).

A first timer’s experience at Strata Europe

I am totally overwhelmed by the impressions I got as a first timer at a Strata Europe conference, which was this year from 19 November to 21 November in Barcelona. The seven(!) different tracks included talks of high quality covering topics from a theoretical standpoint as well as an architectural and tooling view. The most interesting talks for me personally however were those in which the speakers shared their experiences about the real world application of the presented concepts.

_93A9821

As a perfect starting point I joined the D3 tutorial by Sebastian Gutierrez (DashingD3js.com). There I learned the basics of how D3 works, which enables me now to understand and possibly customize all the libraries that are built on top.

Continue reading

Interview with MapR’s M.C. Srivas about Apache Drill

MCSrivasRecently we had M.C.Srivas, CTO and Co-Founder of MapR Technologies, as a speaker at our Munich Hadoop User Group. He gave a nice talk about the Apache Drill Project which develops a tool providing fast interactive SQL on Hadoop and other data sources. We took the opportunity to ask Srivas a thing or two about Drill and his view on it.

 

Continue reading

Map-Reducing Everywhere – comSysto and MapR at TDWI Europe 2014

This year’s TDWI Europe conference takes place in Munich from June 23rd til 25th. The conference is one of the major hubs for the Data Warehousing and Business Intelligence scene, and comSysto and our partners MapR are happy to be giving one of the talks.

Continue reading