Category Archives: Architecture

Ice cream sales break microservices, Hystrix to the rescue

In November 2015, we had the opportunity to spend three days with a greenfield project in order to get to know Spring Cloud Netflix. At comSysto, we always try to evaluate technologies before their potential use in customer projects to make sure we know their pros and cons. Of course, we had read about several aspects, but we never really got our hands dirty using it. This had to change!

Besides coming up with a simple scenario that can be completed within a few days, our main focus was on understanding potential problems in distributed systems. First of all, any distributed system comes with the ubiquitous problem of failing services that should not break the entire application. This is most prominently addressed by Netflix’ “Simian Army” which intentionally breaks random parts of the production environment.

However, we rather wanted to provoke problems arising under heavy load due to capacity limitations. Therefore, we intentionally designed a distributed application with a bottleneck that turned into an actual problem with many simultaneous requests.

Our Use Case

Our business case is about an ice selling company, which is acting on worldwide locations. On each location there are ice selling robots. At the company’s headquarters we want to show an aggregated report about the ice selling activities for each country.

All our components are implemented as dedicated microservices using Spring Boot and Spring Cloud Netflix. Service discovery is implemented using Eureka server. The communication between the microservices is RESTful.

architecture

Architecture overview of our distributed system with the deployment setup during the experiments.

There is a basic location-service, which knows about all locations provided with ice-selling-robots. The data from all these locations has to be part of the report.

For every location, there is one instance of the corresponding microservice representing an ice-selling-robot. Every ice-selling-robot has locally stored information about the amount of totally sold ice cream and the remaining stock amount. Each of them continuously pushes this data to the central current-data-service. It fails with a certain rate, which is configured by a central Config Server.

For the sake of simplicity, the current-data-service stores this information in-memory. Every time it receives an update from one of the ice-selling-robots, it takes the new value and forgets about the old one. Old values are also forgotten if their timestamp is too old.

The current-data-service offers an interface by which the current value for the totally sold amount of ice cream or the remaining stock amount can be retrieved for one location. This interface is used by an aggregator-service, which is able to generate and deliver an aggregated report on demand. For all locations provided by the location-service the current data is retrieved from the current-data-service, which is then aggregated by summing up the single values from the locations grouped by the locations’ country. The resulting report consists of the summed up values per country and data type (totally sold ice cream and remaining stock value).

Because the connection between aggregator-service and current-data-service is quite slow, the calculation of the report takes a lot of time (we simply simulated this slow connection with a wifi connection, which is slow in comparison with an internal service call on the same machine). Therefore, an aggregated report cache has been implemented as fallback. Switching to this fallback has been implemented using Hystrix. At fixed intervals the cache is provided with the most current report by a simple scheduled job.

The reporting service is the only service with a graphical user interface. It generates a very simplistic html-based dashboard, which can be used by the business section of our company to get an overview of all the different locations. The data presented to the user is retrieved from the aggregator-service. Because this service is expected to be slow and prone to failure, a fallback is implemented which retrieves the last report from the aggregated-report-cache. With this, the user can always request a report within an acceptable response time even though it might be slightly outdated. This is a typical example for maintaining maximum service quality in case of partial failure.

report

The reporting “dashboard”.

We used a Spring Cloud Dashboard from the open source community for showing all registered services:

cloud-dashboard

Spring Cloud Dashboard in action.

The circuit-breaker within the aggregator-service can be monitored from Hystrix dashboard.

Screen Shot 2015-12-30 at 22.22.26

Hystrix dashboard for reporting service under load. All circuits are closed, but 19% of all getReport requests failed and were hence successfully redirected to the cached version.

Understanding the Bottleneck

When using Hystrix, all connectors to external services typically have a thread pool of limited size to isolate system resources. As a result, the number of concurrent (or “parallel”) calls from the aggregator-service to the report-service is limited by the size of the thread pool. This way we can easily overstress the capacity for on-demand generated reports, forcing the system to fall back to the cached report.

The relevant part of the reporting-service’s internal declaration looks as depicted in the following code snippet (note the descriptive URLs that are resolved by Eureka). The primary method getReport() is annotated with @HystrixCommand and configured to use the cached report as fallbackMethod:

@HystrixCommand(
 fallbackMethod="getCachedReport",
 threadPoolKey="getReportPool"
)
public Report getReport() {
 return restTemplate.getForObject("http://aggregator-service/", Report.class);
}

public Report getCachedReport() {
 return restTemplate.getForObject("http://aggregated-report-cache/", Report.class);
}

In order to be able to distinguish primary and fallback calls from the end user’s point of view, we decided to include a timestamp in every served report to indicate the delta between the creation and serving time of a report. Thus, as soon as the reporting-service delegates incoming requests to the fallback method, the age of the served report starts to increase.

Testing

With our bottleneck set up, testing and observing the runtime behavior is fairly easy. Using JMeter we configured a testing scenario with simultaneous requests to the reporting-service.

Basic data of our scenario:

  • aggregation-server instances: 1
  • test duration: 60s
  • hit rate per thread: 500ms
  • historize-job-rate: 30s
  • thread pool size for the getReport command: 5

Using the described setup we conducted different test runs with a JMeter thread pool size (=number of concurrent simulated users) of 3, 5 and 7. Analyzing the served reports timestamps leads us to the following conclusion:

Using a JMeter thread count below the size of the service thread pool results in a 100% success rate for the reporting-service calls. Setting sizes of both pools equal already gives a small noticeable error rate. Finally, setting the size higher than the thread pool results in growing failures and fallbacks, also forcing the circuit breaker into short circuit states.

Our measured results are as follows (note that the average report age would be 15s when always using the cached version given our historize-job-rate of 30s):

  • 3 JMeter threads: 0,78s average report age
  • 5 JMeter threads: 1,08s average report age
  • 7 JMeter threads: 3,05s average report age

After gaining these results, we changed the setup in a way that eliminates the slow connection. We did so by deploying the current-data-service to the same machine as the aggregation-service. Thus, the slow connection has now been removed and replaced with an internal, fast connection. With the new setup we conducted an additional test run, gaining the following result:

  • 7 JMeter threads, fast network: 0,74s average report age

By eliminating one part of our bottleneck, the value of report age significantly drops to a figure close below the first test run.

Remedies

The critical point of the entire system is the aggregation due to its slow connection. To address the issue, different measures can be taken.

First, it is possible to scale out by adding additional service instances. Unfortunately, this was hard to test given the hardware at hand.

Second, another approach would be to optimize the slow connection, as seen in our additional measurements.

Last but not least, we could also design our application for always using the cache assuming that all users should see the same report. In our simplistic scenario this would work, but of course that is not what we wanted to analyze in the first place.

Our Lessons Learned

Instead, let us explain a few take-aways based on our humble experience of building a simple example from scratch.

Spring Boot makes it really easy to build and run dozens of services, but really hard to figure out what is wrong when things do not work out of the box. Unfortunately, available Spring Cloud documentation is not always sufficient. Nevertheless, Eureka works like a charm when it comes to service discovery. Simply use the name of the target in an URL and put it into a RestTemplate. That’s all! Everything else is handled transparently, including client-side load balancing with Ribbon! In another lab on distributed systems, we spent a lot of time working around this issue. This time, everything was just right.

Furthermore, our poor deployment environment (3 MacBooks…) made serious performance analysis very hard. Measuring the effect of scaling out is nearly impossible on a developer machine due to its physical resource limitations. Having multiple instances of the same services doesn’t give you anything if one of them already pushes the CPU to its limits. Luckily, there are almost infinite resources in the cloud nowadays which can be allocated in no time if required. It could be worth considering this option right away when working on microservice applications.

In Brief: Should you use Spring Cloud Netflix?

So what is our recommendation after all?

First, we were totally impressed by the way Eureka makes service discovery as easy as it can be. Given you are running Spring Boot, starting the Eureka server and making each microservice a Eureka client is nothing more than dependencies and annotations. On the other hand, we did not evaluate its integration in other environments.

Second, Hystrix is very useful for preventing cascading errors throughout the system, but it cannot be used in a production environment without suitable monitoring unless you have a soft spot for flying blind. Also, it introduces a few pitfalls during development. For example, when debugging a Hystrix command the calling code will probably detect a timeout in the meantime which can give you completely different behavior. However, if you got the tools and skills to handle the additional complexity, Hystrix is definitely a winner.

In fact, this restriction applies to microservice architectures in general. You have to go a long way for being able to run it – but once you are, you can scale almost infinitely. Feel free to have a look at the code we produced on github or discuss whatever you are up to at one of our user groups.

Combining Logstash and Graylog for Log Management

A little incomplete overview

When working in a classic IT infrastructure you often face the problem that developers only have access to test or development environments, but not to production. In order to fix bugs or to have a glance at the system running in production, log file access is needed. This is often not possible due to security requirements. The result of this situation is that the operation guys need to provide these files to the developers, which can take a certain amount of time.

A solution to these problems is to provide a Log Management Server and grant access to the developers via a UI. Despite some commercial tools like Splunk, which is the de-facto market leader in this area, there are some quite promising open source solutions which do scale very well and may provide enough features to get the job done.

The advantage of using open source technology is that you can – but do not have to – buy subscriptions. Furthermore, software like Splunk and Log Analysis have pricing plans, which depend on the amount of logs you ship daily. The problem is that you have to pay more if the volume of logs increases either due to a raised log level to help analyze some bugs in production or simply as more services are deployed.

Last but not least, there are of course cloud solutions like Loggly. You can basically ship your log events to a cloud service, which then takes care of the rest. You do not have to provide any infrastructure yourself. This is a very good solution unless the security policy of your organization prohibits shipping data to the cloud.

Of course this overview is incomplete. I just picked some tools for a brief introduction. If you think something is missing, feel free to blog or comment about it.

Open Source Log Management

The famous ELK-Stack

At the moment, the probably most famous open source log management solution is the ELK-Stack. It is called a stack because it is not one software package but a combination of well-known open source tools. The components are:

  • Elasticsearch is a document oriented database optimized for searching. It is easily scalable and can manage a huge amount of data.
  • Logstash is a log forwarder with many features. There are many types of inputs, filters and outputs. Moreover logstash can handle a bunch of codecs, like JSON for example.
  • Finally, Kibana is the UI where you can view the log entries and create very sophisticated and colorful dashboards.

Despite all the good things about the ELK-Stack there are some drawbacks, which would make it not the optimal choice under some circumstances.

Kibana has no user management. If you want user management you have to purchase commercial support from Elastic to get a license for Shield.

Next, there is no housekeeping for the Elasticsearch database. Logstash creates an index for each day. You have to remove it manually if you do not need it anymore.

Graylog

Graylog is an alternative log management platform that addresses the drawbacks of the ELK stack and is quite mature. It provides an UI and a server part. Moreover, Graylog uses Elasticsearch as database for the log messages as well as MongoDB for application data.

The UI does basically what a UI does. It makes the data accessible in a web browser.

The server part provides a consistent management of the log files. The Graylog server has the following features:

  • Several inputs: HTTP, TCP, SYSLOG, AMQP, …
  • Classification for Log Messages (Streams)
  • User Management and Access Control for the defined streams
  • Simple Dashboards created from streams
  • Housekeeping in Elasticsearch
  • Outputs to forward the messages of a particular stream

Moreover, Graylog can easily be deployed in a clustered environment, so that you get high availability and load distribution.

In order to create a full solution it is suitable to combine Graylog with Logstash with a little patching of Logstash and a custom Graylog Plugin.

As a standard for log events, Graylog promotes usage of the Graylog Extended Log Format (GELF). This is basically a JSON format containing the following information:

  • Timestamp (Unix): time of log event
  • Host: host where the event originates
  • short_message: message

A GELF message can contain many other optional fields as well as user-defined fields. The timestamp is really important to see the log messages ordered by log message creation time and not at the time when entering the system.

Putting it all together

Unfortunately it’s a little bit challenging to make Logstash talk to Graylog and vice versa. The main problem is that Graylog wants the end of a message with a NULL delimiter whereas Logstash creates \n. Logstash also expects \n when receiving log messages as well as Graylog sends log messages with the NULL delimiter.

Sending messages from Logstash to Graylog

1. Use a message broker like RabbitMQ. Logstash can write to RabbitMQ, Graylog can read. This solution decouples both applications, so that the Graylog server can be shut down while Logstash is still producing log messages.

2. Use the HTTP input in Graylog to receive messages from Logstash. This solution has some drawbacks. The biggest might be that if Graylog is down, Logstash discards the message after a failed send attempt.

3. Use the GELF TCP input and patch Logstash. Unfortunately, there is no possibility to change the line separator in the Logstash “json_lines” codec. This could be done in a patch which is currently open as a pull request. Hopefully, it will be merged soon. The big advantage in using the Logstash TCP output is that Logstash queues messages which cannot be send and retries sending them.

Sending messages from Graylog to Logstash

Sending messages from Graylog to Logstash might not make sense in the first place. But if you think of creating a file-based archive of log files on a NAS or in AWS S3 it might make sense though.

As mentioned above, even there is a problem with the line ending. Fortunately, Graylog provides a plugin API. So I created a plugin which can forward log messages to a Logstash instance. This instance can write the log files then.

The plugin is hosted on Github and licensed under the APL 2.0.

Conclusion

As described in the article, you can combine Logstash and Graylog with little effort in order to build an enterprise-ready flexible, scalable and access controlled log management system. Graylog and Elasticsearch as central components are able to scale out the described setup and can handle a huge load of data.

Graylog, Logstash and Elasticsearch are all three high-quality open source tools with a great community and many users. All these products are also commercially supported by companies behind them.

Finally there is one important note for all the Kibana lovers. Of course it is possible to also deploy Kibana in parallel to Graylog. Then you can build nice dashboards with Kibana and have the features like User Management and Elasticsearch Housekeeping in Graylog.

Graylog

Developing a Modern Distributed System – Part II: Provisioning with Docker

As described in an earlier blog post “Bootstrapping the Project”, comSysto’s performance and continuous delivery guild members are currently evaluating several aspects of distributed architectures in their spare time left besides customer project work. In the initial lab, a few colleagues of mine built the starting point for a project called Hash-Collision:

Hash-collision's system structure

They focused on the basic application structure and architecture to get a showcase up and running as quickly as possible and left us the following tools for running it:

  • one simple shell script to set up the environment locally based on many assumptions
  • one complex shell script that builds and runs all services
  • hardcoded dependency on a local RabbitMQ installation

First Attempt: Docker containers as a runtime environment

In search of a more sophisticated runtime environment we went down the most obvious path and chose to get our hands on the hype technology of late 2014: Docker. I assume that most people have a basic understanding of Docker and what it is, so I will not spend too much time on its motivation here. Basically, it is a tool inspired by the idea of ‘write once, run anywhere’, but on a higher level of abstraction than that famous programming language. Docker can not only make an application portable, it allows to ship all dependencies such as web servers, databases and even operating systems as one or multiple well-defined images and use the very same configuration from development all the way to production. Even though we did not even have any production or pre-production environments, we wanted to give it a try. Being totally enthusiastic about containers, we chose the most container-like place we could find and locked ourselves in there for 2 days.

impact-hub-munich

One of the nice things about Docker is that it encourages developers to re-use existing infrastructure components by design. Images are defined incrementally by selecting a base image, and building additional functionality on top of it. For instance, the natural way to create a Tomcat image would be to choose a base image that already brings a JDK and install Tomcat on top of it. Or even simpler, choose an existing Tomcat image from the Docker Hub. As our services are already built as fat JARs with embedded web servers, things were almost trivial.

Each service should run in a standardized container with the executed JAR file being the sole difference. Therefore, we chose to use only one service image and inject the correct JAR using Docker volumes for development. On top of that, we needed additional standard containers for nginx (dockerfile/nginx) and RabbitMQ (dockerfile/rabbitmq). Each service container has a dependency on RabbitMQ to enable communication, and the nginx container needs to know where the Routing service resides to fulfill its role as a reverse proxy. All other dependencies can be resolved at runtime via any service discovery mechanism.

As a first concrete example, this is the Dockerfile for our service image. Based on Oracle’s JDK 8, there is not much left to do except for running the JAR and passing in a few program arguments:

FROM dockerfile/java:oracle-java8
MAINTAINER Christian Kroemer (christian.kroemer@comsysto.com)
CMD /bin/sh -c 'java -Dport=${PORT} -Damq_host=${AMQ_PORT_5672_TCP_ADDR} -Damq_port=${AMQ_PORT_5672_TCP_PORT} -jar /var/app/app.jar'

After building this image, it is ready for usage in the local Docker repository and can be used like this to run a container:

# start a new container based on our docker-service-image
docker run docker-service-image
# link it with a running rabbitmq container to resolve the amq dependency
docker run --link rabbitmq:amq docker-service-image
# do not block and run it in background
docker run --link rabbitmq:amq -d docker-service-image
# map the container http port 7000 to the host port 8000
docker run --link rabbitmq:amq -d -p 8000:7000 docker-service-image
# give an environment parameter to let the embedded server know it has to start on port 7000
docker run --link rabbitmq:amq -d -e "PORT=7000" -p 8000:7000 docker-service-image
# inject the user service fat jar
docker run --link rabbitmq:amq -d -e "PORT=7000" -v HASH_COLLISION_PATH/user/build/libs/user-1.0-all.jar:/var/app/app.jar -p 8000:7000 docker-service-image

Very soon we ended up with a handful of such bash commands we pasted into our shells over and over again. Obviously we were not exactly happy with that approach and started to look for more powerful tools in the Docker ecosystem and stumbled over fig (which was not yet deprecated in favor of docker-compose at that time).

Moving on: Docker Compose for some degree of service orchestration

Docker-compose is a tool that simplifies the orchestration of Docker containers all running on the same host system based on a single docker installation. Any `docker run` command can be described in a structured `docker-compose.yml` file and a simple `docker-compose up` / `docker-compose kill` is enough to start and stop the entire distributed application. Furthermore, commands such as `docker-compose logs` make it easy to aggregate information for all running containers.

fig-log-output

Here is an excerpt from our `docker-compose.yml` that illustrates how self-explanatory those files can be:

rabbitmq:
 image: dockerfile/rabbitmq
 ports:
 - ":5672"
 - "15672:15672"
user:
 build: ./service-image
 ports:
 - "8000:7000"
 volumes:
 - ../user/build/libs/user-1.0-all.jar:/var/app/app.jar
 environment:
 - PORT=7000
 links:
 - rabbitmq:amq

Semantically, the definition of the user service is equivalent to the last sample command given above except for the handling of the underlying image. The value given for the `build` key is the path to a directory that contains a `Dockerfile` which describes the image to be used. The AMQ service, on the other hand, uses a public image from the Docker Hub and hence uses the key `image`. In both cases, docker-compose will automatically make sure the required image is ready to use in the local repository before starting the container. A single `docker-compose.yml` file consisting of one such entry for each service is now sufficient for starting up the entire distributed application.

An Aside: Debugging the application within a Docker container

For being able to debug an application running in a Docker container from the IDE, we need to take advantage of remote debugging as for any physical remote server. For doing that, we defined a second service debug image with the following `Dockerfile`:

FROM dockerfile/java:oracle-java8
MAINTAINER Christian Kroemer (christian.kroemer@comsysto.com)
CMD /bin/sh -c 'java -Xdebug -Xrunjdwp:transport=dt_socket,address=10000,server=y,suspend=n -Dport=${PORT} -Damq_host=${AMQ_PORT_5672_TCP_ADDR} -Damq_port=${AMQ_PORT_5672_TCP_PORT} -jar /var/app/app.jar'

This will make the JVM listen for a remote debugger on port 10000 which can be mapped to any desired host port as shown above.

What we got so far

With a local installation of Docker (on a Mac using boot2docker http://boot2docker.io/) and docker-compose, starting up the whole application after checking out the sources and building all JARs is now as easy as:

  • boot2docker start (follow instructions)
  • docker-compose up -d (this will also fetch / build all required images)
  • open http://$(boot2docker ip):8080/overview.html

Note that several problems originate from boot2docker on Mac. For example, containers can not be accessed on `localhost`, but using the IP of a VM as boot2docker is built using a VirtualBox image.

In an upcoming blog post, I will outline one approach to migrate this setup to a more realistic environment with multiple hosts using Vagrant and Ansible on top. Until then, do not forget to check if Axel Fontaine’s Continuous Delivery Training at comSysto is just what you need to push your knowledge about modern deployment infrastructure to the next level.

Developing a Modern Distributed System – Part III: Provisioning with Docker, Vagrant and Ansible

Part II of our blog post series on ‘Developing a Modern Distributed System’ featured our first steps with Docker. In a second lab in early 2015, we tried to better understand the required changes in a production-like deployment. Without the assumption of all containers running on the same host – which makes no sense for a scalable architecture – Docker links and docker-compose are no longer valid approaches. We wanted to get the following three-node setup to work:

3-node-architecture

First of all, we created an automated Docker-Hub build linked to our github repository for rebuilding images on each commit. With that, the machines running the containers no longer had to build the images from Dockerfiles themselves. We used Vagrant to run three standard Ubuntu VMs and Ansible to provision them which included:

  • install Docker
  • upload service JARs that should be linked into the containers
  • upload static resources for nginx’s `/var/www` folder
  • run docker containers with correct parameterization (as we did with docker-compose before), still with some hardcoding to wire up different hosts

Why Ansible? First, some tool is required to avoid manually typing commands via ssh on multiple simultaneous sessions in an environment with multiple hosts. Second, Ansible was an easy choice because some of us already had experience using it while others wanted to give it a try. And last but not least, labs at comSysto are just the right place to experiment with unconventional combinations, see where their limitations are and prove they can still work! We actually achieved that, but after a `vagrant destroy` it took a full 20min on a developer machine to be up and running again. Not exactly the round-trip time you want to have while crafting your code… We needed to optimize.

The biggest improvement we were able to achieve came from the usage of a custom Vagrant base box with a ready-to-go Docker environment. Besides installing docker, we also pre-fetched all images from the Docker Hub right away which brings a huge productivity boost on slow internet connections. Even if images change, the large base images are typically pretty stable, hence download times could be reduced dramatically. The Docker image itself could also be optimized by using a minimal JDK base image such as jeanblanchard/busybox-java:8 instead of dockerfile/java:oracle-java8 which is built on top of Ubuntu.

Furthermore, we used CoreOS instead of Ubuntu as the operating system to get the base box smaller and faster to start up. CoreOS is a minimal OS designed to run Docker containers and do pretty much nothing on top of that. That also means it does not contain Python which is required to provision the VM using Ansible. Fortunately, Ansible can be installed using a specific coreos-bootstrap role.

Provisioning the running VMs with updated versions of our services, instead of destroying and rebuilding them from scratch, gave us a round-trip-time of roughly more than a minute, of which around 30s were required to rebuild all fat JARs.

Let’s have a closer look at a few aspects of that solution. First, we start a standard CoreOS box with Vagrant, and provision it with the following Ansible playbook:

- hosts: all
 gather_facts: False
 roles:
 - defunctzombie.coreos-bootstrap
- hosts: all
 gather_facts: False
 tasks:
 - name: Prepare latest Docker images to make start of fresh VMs fast in case these are still up-to-date
 command: docker pull {{item}}
 with_items:
 - dockerfile/rabbitmq
 - chkcomsysto/hash-collision-service
 - chkcomsysto/hash-collision-service-debug
 - chkcomsysto/hash-collision-nginx

Using `vagrant package` and `vagrant box add` we immediately create a snapshot of that VM state and make it available as a base box for further usage. After that, the VM has fulfilled its destiny and is destroyed right away. The new base box is used in the `Vagrantfile` of the environment for our application which, again, is provisioned using Ansible. Here is a snippet of the corresponding playbook:

- hosts: service
 sudo: yes
 gather_facts: False
 tasks:
 - name: Create directory for runnable JARs
 file:
 path: /var/hash-collision
 state: directory
 - name: Upload User service runnable JAR
 copy:
 src: ../../../user/build/libs/user-1.0-all.jar
 dest: /var/hash-collision/user-1.0-all.jar
 - name: Pull latest Docker images for all services
 command: docker pull chkcomsysto/hash-collision-service-debug
 - name: Start User service as a Docker container
 command: >
 docker run -d
 -p 7000:7000 -p 17000:10000
 --expose 7000 --expose 10000
 -e PORT=7000 -e HOST=192.168.60.6 -e AMQ_PORT_5672_TCP_ADDR=192.168.60.5 -e AMQ_PORT_5672_TCP_PORT=5672
 -v /var/hash-collision/user-1.0-all.jar:/var/app/app.jar
 chkcomsysto/hash-collision-service-debug

Where this leaves us

As we have virtualized pretty much everything, the only prerequisite left was a local Vagrant installation based on VirtualBox. After running a `quickstart-init-box.sh` script to build the Vagrant base box from scratch once, executing a `quickstart-dev-mode.sh` script was sufficient to build the application, start up three VMs with Vagrant, provision them with Ansible, and insert sample data. For a full development round-trip on a running system, another `refresh-dev-mode.sh` script was meant to build the application, provision the running VMs with Ansible, and again insert sample data (not that this is always required as we were still using in-memory storage without any persistence).

This setup allows us to run the entire distributed application in a multi-host environment during development. A more realistic approach would of course be to work on one component at a time, verify its implementation via unit tests and integrate it by starting this particular service from the IDE configured against an environment that contains the rest of the application. It is easy to see how our solution could be modified to support that use case.

Next Steps & Challenges

For several reasons, the current state feels pretty immature. First, we need to inject the own IP and global port into each container. It is questionable if a container should even need to know its identity. An alternative would be to get rid of the heartbeat-approach in which every service registers itself and build a service discovery based on Docker meta data with etcd or the likes instead.

Another area for improvements is our almost non-existent delivery pipeline. While uploading JARs into VMs is suitable for development, it is far from ideal for a production delivery. The Docker image should be self-contained, but this requires a proper build pipeline and artifact repository that automates all the way from changes in the service code to built JARs and fully functional Docker images ready to be started anywhere. Non-local deployments, e.g. on AWS, are also an interesting area of research in which the benefits of Docker are supposed to shine.

Last but not least, we need to work on all kinds of monitoring which is a critical part of any distributed application. Instead of using ssh to connect to a VM or remote server and then ssh again into the containers running there to see anything, it would be more appropriate to include a dedicated log management service (e.g. ELK) and send all logs there right away. On top of that, well-defined metrics to monitor the general health and state of services can be an additional source of information.

Obviously, there is a lot left to explore and learn in upcoming labs! Also crazy about DevOps, automation and self-contained deployment artifacts instead of 20th-century-style-delivery? See it all in action at our Continuous Delivery Training!

Teamgeist on Android Wear

Die ganze IT Welt spricht derzeit von Wearables. Also wollte ich mir in einem Lab die Android Wear API genauer anschauen. Schnell war auch schon der erste Anwendungsfall gefunden. In unserer Teamgeist App gibt es seit kurzem die Möglichkeit Kudos zu verteilen.

Kudos

Auf einer Android Wear Uhr würden sich die Kudos prima darstellen lassen. Dazu gäbe es zwei Aktionen. Einmal für einen Kudo “voten”. Die andere wäre die Teamgeist App öffnen.

Für eine Integration mit der Teamgeist App bräuchten wir eine neue Schnittstelle. Zum kennen lernen der Android Wear Api begnügen wir uns deswegen im folgenden mit einer Android App die Kudos erstellt und verschickt.

Nach kurzer Recherche wurde klar, dass für den Anwendungsfall gar keine eigene Android Wear App notwendig ist. Es reicht eine normale Android App die mittels der Notifications API Nachrichten direkt an die Uhr versendet. Anwendungen eigens für Android Wear geschrieben, werden in einem späteren Tutorial näher beleuchtet.

Vorbereitung

Ein paar Dinge die wir benötigen bevor wir loslegen können:

  • Intellij (14) als IDE
  • Android SDK mit installierten API Packages für Level 19 (4.4.2), 20 (4.4W) und Android Support Library V4 (20)

Android SDK

  • Aus Mangel einer echten Android Wear starten wir eine aus dem AVD Manager heraus

AVD Wear

Für das Koppeln mit einem Handy benötigen wir auf dem Handy die Android Wear App aus dem Play Store. Das koppeln von der emulierten Wear und einem per USB angeschlossen Handy funktioniert erst dann wenn folgender Befehl auf Kommandozeile eingegebenen wurde (im Tools Verzeichnis vom android-sdk):

~/development/android-sdk-mac_86/platform-tools$ adb -d forward tcp:5601 tcp:5601

Erst wenn der Befehl ohne Fehler ausgeführt wurde, lässt sich aus der Android Wear App im Handy die emulierte Uhr mit dem Handy verbinden. Wird das Handy vom Rechner getrennt und neu angeschlossen, muss der Befehl erneut ausgeführt werden. Eine genau Beschreibung gibt es von Google oder hier.

Neue Android App erstellen

Nachdem wir den Emulator mit dem Handy erfolgreich gekoppelt haben, erscheinen auch schon die ersten Notifications auf der Uhr wie z.B. der Eingang neuer Mails.

Damit wir nun selbst Notifications versenden können erstellen wir InteliJ ein neues Projekt. Im ersten Bildschirm wählen wir links Android und rechts das Gradle: Android Module aus. Auf der darauffolgenden Seite müssen wir ein paar Einstellungen wie z.b. die Version des Target SDK vornehmen.

Target SDK

Anmerkung: Wir hätten hier auch 4.3 wählen können da die Android Wear App ab Android 4.3 unterstützt wird.

Auf den nächsten Seiten belassen wir die Einstellung wie sie sind und wählen auf dem letzten Bildschirm nur noch den Ordner für unser Projekt aus.

Cleanup des generierten Projektes

In unserer Teamgeist App benötigen wir natürlich als erstes unseren Teamgeist und fügen diesen zu den drawables hinzu 🙂

teamgeist_logo

 

In der activity_main.xml löschen wir die TextView und erstellen statt dessen einen Button.

<Button
    android:layout_width="wrap_content"
    android:layout_height="wrap_content"
    android:text="Sende Kudos"
    android:id="@+id/kudo_button" android:layout_centerVertical="true" android:layout_centerHorizontal="true"/>

Um mit den Button in Java zu arbeiten holen wir uns eine Referenz darauf in der MainActivity#onCreate() Methode und setzen auch gleich einen OnClickListener.

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);

    Button kudoButton = (Button)findViewById(R.id.kudo_button);
    kudoButton.setOnClickListener(new View.OnClickListener() {
        @Override
        public void onClick(View view) {
          //hierher kommt unser Notification Code
        }
    });
}

Wenn wir jetzt unsere App starten, sollte sich auf dem Handy die App öffnen mit einem Button “Sende Kudos” auf weißem Hintergrund.

Senden einer ersten Notification

Um eine erste Notification zu senden müssen wir noch die V4 Support Library zu unserem Projekt hinzufügen. Hierzu fügen wir der dependency Section unserer build.gradle Datei eine Zeile hinzu.

dependencies {
    compile fileTree(dir: 'libs', include: ['*.jar'])
    compile "com.android.support:support-v4:20.0.+"
}

Beim ersten mal hinzufügen der V4 Support Library zu einem Projekt erkennt IntelliJ dies und erstellt durch nachfragen ein eigenes Repository hierfür.

Nun können wir auf die Notification API in der onClick Methode des vorher erstellten OnClickListeners zugreifen und fügen folgenden Code hinzu.

@Override
public void onClick(View view) {
  //1. Erstellen eines NotificationCompat.Builder mit Hilfe des Builder Patterns
  Notification notification =
    new NotificationCompat.Builder(MainActivity.this)
      .setSmallIcon(R.drawable.teamgeist_logo)
      .setContentTitle("Notifications?")
      .setContentText("Congratulations, you have sent your first notification")
      .build();

  //2. Wir benötigen einen NotificationManager
  NotificationManagerCompat notificationManager =
    NotificationManagerCompat.from(MainActivity.this);

  //3. Versenden der Notification mittels NotificationManager und NotificationBuilder
  int notificationId = 1;
  notificationManager.notify(notificationId, notification);

}
  1. Als erstes wird mit Hilfe des NotificationCompat.Builder und dem Builder Pattern eine Notification erstellt. Hier setzen wir zu Beginn einen Titel, einen Text und ein Bild.
  2. Dann benötigen wir zum versenden einen NotificationManager. Den erhalten wir mit dem Aufruf der from() Methode von der Klasse NotificationManagerCompat.
  3. Danach sind wir bereit die Notification über die notify Methode des NotificationManagers zu verschicken. Die notificationId dient hierbei zur Unterscheidung von verschiedenen Notifications einer App.

Wenn wir die App jetzt deployen, starten und auf “Kudo senden” drücken kriegen wir unsere erste eigene Notification auf der Uhr.

simple_notification

Hintergrundbild

Anhand des App Icons ermittelt Android eine ähnliche Hintergrundfarbe. Ein eigenes Bild sieht jedoch viel besser aus. Wir erreichen dies in dem wir beim Builder zusätzlich setLargeIcon aufrufen.

new NotificationCompat.Builder(MainActivity.this)
 .setLargeIcon(BitmapFactory.decodeResource(getResources(), R.drawable.teamgeist_logo))
 .setSmallIcon(R.drawable.teamgeist_logo)
 .setContentTitle("Notifications?")
 .setContentText("Congratulations, you have sent your first notification")
 .build();

Damit kriegt die Notification auf der Uhr den Geist auch als Hintergrund.

simple_notification_with_background

Benutzerinteraktion

Wir können der Notification verschiedene Benutzerinteraktionen hinzufügen. Mit einem PendingIndent wird beispielsweise eine bestimmte Activity in unserer App aufgerufen und ihr mittels “Extras” Daten übergeben. Den PendingIndent erstellen wir in einer eigenen Methode.

private PendingIntent createContentIntent() {
    Intent viewIntent = new Intent(MainActivity.this, MainActivity.class);
    viewIntent.putExtra("EventNotified", "1");
    PendingIntent viewPendingIntent =
          PendingIntent.getActivity(MainActivity.this, 0, viewIntent, 0);
    return viewPendingIntent;
}

Diesen Indent übergeben wir dem Builder durch Aufruf von setContentIntent.

new NotificationCompat.Builder(MainActivity.this)
 .setLargeIcon(BitmapFactory.decodeResource(getResources(), R.drawable.teamgeist_logo))
 .setSmallIcon(R.drawable.teamgeist_logo)
 .setContentTitle("Notifications?")
 .setContentText("Congratulations, you have sent your first notification")
 .setContentIntent(createContentIntent())
 .build();

Durch nach links Wischen der Notification erscheint unsere neue Aktion.

PendingIntent

Klicken wir nun auf “Open on phone” öffnet sich die hinterlegte Activity im Handy, also in unserem Fall die MainActivity. Leider bleibt bisher die Notification auf der Uhr bestehen. Um sie dort zu entfernen, müssen wir abfragen ob die App durch die User Interaktion gestartet wurde und deaktivieren in diesem Falle die Notification. Dazu erstellen wir uns die Methode cancelNotificationOnUserInteraction Methode und rufen sie in der MainActivity#onCreate Methode auf.

private void cancelNotificationOnUserInteraction() {
    Intent intent = getIntent();
    Bundle extras = intent.getExtras();
    if (extras != null && "1".equals(extras.getString("EventNotified"))) {
        NotificationManagerCompat.from(this).cancel(1);
    }
}

Neben dieser Standard Aktion können wir weitere “Actions” hinzufügen. Dazu erstellen wir uns ein Action Objekt mit folgender Methode,

private NotificationCompat.Action showInBrowser() {
    Intent browserIntent = new Intent(Intent.ACTION_VIEW);
    Uri geoUri = Uri.parse("http://app.teamgeist.io");
    browserIntent.setData(geoUri);
    PendingIntent browserPendingIntent =
            PendingIntent.getActivity(this, 0, browserIntent, 0);

    return new NotificationCompat.Action(
            android.R.drawable.ic_dialog_map, "Open in Browser", browserPendingIntent);
}

und übergeben das Objekt an den Builder mittels der addAction Methode.

new NotificationCompat.Builder(MainActivity.this)
 .setLargeIcon(BitmapFactory.decodeResource(getResources(), R.drawable.teamgeist_logo))
 .setSmallIcon(R.drawable.teamgeist_logo)
 .setContentTitle("Notifications?")
 .setContentText("Congratulations, you have sent your first notification")
 .setContentIntent(createContentIntent())
 .addAction(showInBrowser())
 .build();

Wir können die Notification jetzt zweimal nach links schieben und kriegen dann eine weitere Aktion zur Auswahl. Beim klicken auf “Open in Browser” öffnet sich nun unsere Teamgeist Webseite auf dem Handy.

OpenInBrowserAction

Mit Hilfe so einer Action würden wir die Voting Funktion realisieren. Die App auf dem Handy müsste dann dem Teamgeist Server den vote übermitteln.

Was gibt es noch?

Damit sind wir am Ende unseres ersten Android Wear Labs angekommen. Neben diesen Aktionen gibt es noch besondere Wear Notification Features. Da wäre zum einen die Möglichkeit die Notification um mehr als eine “Page” zu erweitern. Oder Notifications zu gruppieren. Jedoch das wahrscheinlich bekannteste Feature ist die Möglichkeit auf eine Notification mittels Sprache zu antworten.

All dies sind potentielle Themen für unser nächstes Android Lab. Und natürlich möchten wir die App mit unserem Teamgeist Server verbinden um echte Kudos zu erhalten und für sie “voten” ;-).

Developing a Modern Distributed System – Part I: Bootstrapping the Project

A few months ago, our performance and continuous delivery guild decided to gain more hands-on experience with distributed software architectures. As companies like Twitter or Netflix have open-sourced a lot of components from their software stacks this seemed a great idea to get started. A good introduction is a blog post about the Twitter software stack. However, we did not want to stare at architecture diagrams but rather getting our hands dirty and build something ourselves. Do you remember the good old days when Java Pet Store was new and fancy? We needed something similar for modern distributed architectures and finally settled to build a clone of the popular programmer Q&A site Stackoverflow: Hash-Collision was born. With Hash-Collision we want to address different issues in distributed systems such as:

  • Decomposition of the application into individual services
  • Bootstrapping of the whole system
  • Routing
  • Distributed service communication
  • Security among services
  • UI integration

Continue reading