Developing a Modern Distributed System – Part III: Provisioning with Docker, Vagrant and Ansible

Part II of our blog post series on ‘Developing a Modern Distributed System’ featured our first steps with Docker. In a second lab in early 2015, we tried to better understand the required changes in a production-like deployment. Without the assumption of all containers running on the same host – which makes no sense for a scalable architecture – Docker links and docker-compose are no longer valid approaches. We wanted to get the following three-node setup to work:

3-node-architecture

First of all, we created an automated Docker-Hub build linked to our github repository for rebuilding images on each commit. With that, the machines running the containers no longer had to build the images from Dockerfiles themselves. We used Vagrant to run three standard Ubuntu VMs and Ansible to provision them which included:

  • install Docker
  • upload service JARs that should be linked into the containers
  • upload static resources for nginx’s `/var/www` folder
  • run docker containers with correct parameterization (as we did with docker-compose before), still with some hardcoding to wire up different hosts

Why Ansible? First, some tool is required to avoid manually typing commands via ssh on multiple simultaneous sessions in an environment with multiple hosts. Second, Ansible was an easy choice because some of us already had experience using it while others wanted to give it a try. And last but not least, labs at comSysto are just the right place to experiment with unconventional combinations, see where their limitations are and prove they can still work! We actually achieved that, but after a `vagrant destroy` it took a full 20min on a developer machine to be up and running again. Not exactly the round-trip time you want to have while crafting your code… We needed to optimize.

The biggest improvement we were able to achieve came from the usage of a custom Vagrant base box with a ready-to-go Docker environment. Besides installing docker, we also pre-fetched all images from the Docker Hub right away which brings a huge productivity boost on slow internet connections. Even if images change, the large base images are typically pretty stable, hence download times could be reduced dramatically. The Docker image itself could also be optimized by using a minimal JDK base image such as jeanblanchard/busybox-java:8 instead of dockerfile/java:oracle-java8 which is built on top of Ubuntu.

Furthermore, we used CoreOS instead of Ubuntu as the operating system to get the base box smaller and faster to start up. CoreOS is a minimal OS designed to run Docker containers and do pretty much nothing on top of that. That also means it does not contain Python which is required to provision the VM using Ansible. Fortunately, Ansible can be installed using a specific coreos-bootstrap role.

Provisioning the running VMs with updated versions of our services, instead of destroying and rebuilding them from scratch, gave us a round-trip-time of roughly more than a minute, of which around 30s were required to rebuild all fat JARs.

Let’s have a closer look at a few aspects of that solution. First, we start a standard CoreOS box with Vagrant, and provision it with the following Ansible playbook:

- hosts: all
 gather_facts: False
 roles:
 - defunctzombie.coreos-bootstrap
- hosts: all
 gather_facts: False
 tasks:
 - name: Prepare latest Docker images to make start of fresh VMs fast in case these are still up-to-date
 command: docker pull {{item}}
 with_items:
 - dockerfile/rabbitmq
 - chkcomsysto/hash-collision-service
 - chkcomsysto/hash-collision-service-debug
 - chkcomsysto/hash-collision-nginx

Using `vagrant package` and `vagrant box add` we immediately create a snapshot of that VM state and make it available as a base box for further usage. After that, the VM has fulfilled its destiny and is destroyed right away. The new base box is used in the `Vagrantfile` of the environment for our application which, again, is provisioned using Ansible. Here is a snippet of the corresponding playbook:

- hosts: service
 sudo: yes
 gather_facts: False
 tasks:
 - name: Create directory for runnable JARs
 file:
 path: /var/hash-collision
 state: directory
 - name: Upload User service runnable JAR
 copy:
 src: ../../../user/build/libs/user-1.0-all.jar
 dest: /var/hash-collision/user-1.0-all.jar
 - name: Pull latest Docker images for all services
 command: docker pull chkcomsysto/hash-collision-service-debug
 - name: Start User service as a Docker container
 command: >
 docker run -d
 -p 7000:7000 -p 17000:10000
 --expose 7000 --expose 10000
 -e PORT=7000 -e HOST=192.168.60.6 -e AMQ_PORT_5672_TCP_ADDR=192.168.60.5 -e AMQ_PORT_5672_TCP_PORT=5672
 -v /var/hash-collision/user-1.0-all.jar:/var/app/app.jar
 chkcomsysto/hash-collision-service-debug

Where this leaves us

As we have virtualized pretty much everything, the only prerequisite left was a local Vagrant installation based on VirtualBox. After running a `quickstart-init-box.sh` script to build the Vagrant base box from scratch once, executing a `quickstart-dev-mode.sh` script was sufficient to build the application, start up three VMs with Vagrant, provision them with Ansible, and insert sample data. For a full development round-trip on a running system, another `refresh-dev-mode.sh` script was meant to build the application, provision the running VMs with Ansible, and again insert sample data (not that this is always required as we were still using in-memory storage without any persistence).

This setup allows us to run the entire distributed application in a multi-host environment during development. A more realistic approach would of course be to work on one component at a time, verify its implementation via unit tests and integrate it by starting this particular service from the IDE configured against an environment that contains the rest of the application. It is easy to see how our solution could be modified to support that use case.

Next Steps & Challenges

For several reasons, the current state feels pretty immature. First, we need to inject the own IP and global port into each container. It is questionable if a container should even need to know its identity. An alternative would be to get rid of the heartbeat-approach in which every service registers itself and build a service discovery based on Docker meta data with etcd or the likes instead.

Another area for improvements is our almost non-existent delivery pipeline. While uploading JARs into VMs is suitable for development, it is far from ideal for a production delivery. The Docker image should be self-contained, but this requires a proper build pipeline and artifact repository that automates all the way from changes in the service code to built JARs and fully functional Docker images ready to be started anywhere. Non-local deployments, e.g. on AWS, are also an interesting area of research in which the benefits of Docker are supposed to shine.

Last but not least, we need to work on all kinds of monitoring which is a critical part of any distributed application. Instead of using ssh to connect to a VM or remote server and then ssh again into the containers running there to see anything, it would be more appropriate to include a dedicated log management service (e.g. ELK) and send all logs there right away. On top of that, well-defined metrics to monitor the general health and state of services can be an additional source of information.

Obviously, there is a lot left to explore and learn in upcoming labs! Also crazy about DevOps, automation and self-contained deployment artifacts instead of 20th-century-style-delivery? See it all in action at our Continuous Delivery Training!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s