Most docker tutorials that you’ll find out there (the ones in this blog included) will assume that you have a single host running all your containers or a few hosts but where you are manually managing them. While this is nice and simple to explain the basic concepts, it is probably not the way you want to run your applications in production. In most cases you will have a cluster of servers all running different containers that need to talk to each other and know how to function properly, even when some of those servers suddenly go offline.

This is the area of orchestration and scheduling of containers, a topic that is extremely hot these days. Particularly with big players in the industry working in new projects to abstract away most of the complexities inherent to running distributed containers. Amazon recently opened up their EC2 container service, Google has Kubernetes and Mesosphere is becoming pretty popular with the underlying Apache Mesos project.

One additional project that has been gaining a lot of attention in this area is CoreOS. In this post I’m going to try to explore CoreOS and give a basic overview of the problem that it tries to solve, how it works and how to work effectively with it.

Introduction

The idea behind CoreOS is the same as with any other cluster management system. You stop thinking about your individual servers and how they work together. Instead you think about your data center (a cluster of individual servers). In other words, you no longer say “run this container in server 1 and this other container in server 2” but “run these 2 containers in my data center” and let the cluster manager take care of where and how to do that.

This also means that if one or more of your individual servers die the cluster manager will take the containers that were running in those servers and distribute them across the remaining healthy nodes.

Following this philosophy, CoreOS is an open source lightweight operating system that comes together with a set of simple tools.
The main building block behind CoreOS is Docker. Since CoreOS doesn’t come with a package manager, everything you want to run on it has to run as a container. It should be noted that while CoreOS fully supports Docker, they are also working on their own container runtime called rkt.

To start and manage all these containers CoreOS uses Fleet. Fleet is based on systemd and extends it in order to work at the cluster level. In other words, while systemd works as a single machine init system, fleet works as a cluster init system.

To coordinate all different nodes and let Fleet know where to run your containers, CoreOS provides etcd, a distributed key/value store with a strong consistency and partition tolerance model. Etcd uses the Raft consensus algorithm to handle the communication between the different nodes. This is the same algorithm used by Consul by the way.

I will go into some detail about each of this tools and how they can work together with Docker. But first, lets setup our CoreOS cluster running on Vagrant.

Bootstraping a CoreOS cluster

The CoreOS documentation has very comprehensive guides to run CoreOS on anything from bare metal hardware, to cloud providers to virtualization platforms. You can follow the step by step guide in here to run a basic cluster locally on Vagrant.

The short version is that you can clone this repo and then do some minimal configuration. The relevant files for this part are the config.rb and user-data ones.

The Vagrantfile is pretty generic and reads all the configuration it needs from config.rb, so no need to change anything there. This latter file looks something like the following:

# Size of the CoreOS cluster created by Vagrant
$num_instances=6

# Official CoreOS channel from which updates should be downloaded
$update_channel='stable'

# Customize VMs
$vm_memory = 2048
$vm_cpus = 2

# Enable port forwarding from guest(s) to host machine, syntax is: { 80 => 8080 }, auto correction is enabled by default.
# 4001 is the default etcd port, we need this if we want to run fleetctl locally on the host
$forwarded_ports = {4001 => 4001}

The file is pretty self-explanatory. You see that we define the size of our cluster (6 instances) and we give it some extra memory and cpus to run on. Lastly we forward port 4001 which is the default port used by etcd. We’ll see why we want to do this in a bit.

Then we have the user-data file:

#cloud-config
coreos:
  etcd2:
    #generate a new token for each unique cluster from https://discovery.etcd.io/new
    discovery: https://discovery.etcd.io/<token>
    # multi-region and multi-cloud deployments need to use $public_ipv4
    advertise-client-urls: http://$public_ipv4:2379
    initial-advertise-peer-urls: http://$private_ipv4:2380
    # listen on both the official ports and the legacy ports
    # legacy ports can be omitted if your application doesn't depend on them
    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
    listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001
  fleet:
    public-ip: $public_ipv4
  flannel:
    interface: $public_ipv4
  units:
    - name: etcd2.service
      command: start
    - name: fleet.service
      command: start
    - name: docker-tcp.socket
      command: start
      enable: true
      content: |
        [Unit]
        Description=Docker Socket for the API

        [Socket]
        ListenStream=2375
        Service=docker.service
        BindIPv6Only=both

        [Install]
        WantedBy=sockets.target

This is the only file you need to modify before starting the cluster. Go to https://discovery.etcd.io/new in your browser and copy the URL that you get as a response there. Now go to user-data and paste that URL where it says discovery: https://discovery.etcd.io/<token>.

You can now do a vagrant up and wait while your cluster gets created. When it’s done you should be able to run vagrant status and see the 6 nodes running.

Trying out etcd

Now that you have a CoreOS cluster up and running, we can start to play around with the different tools that are shipped with it. Lets start with etcd, the distributed key/value store.

You’ll need 2 open terminals for this (or tabs, or splits or whatever you use). We’ll ssh into core-01 in one of them (with vagrant ssh core-01) and core-02 in the other (vagrant ssh core-02). Which nodes you ssh into is irrelevant, as long as they are different.

CoreOS comes with a tool to read and write from etcd, called etcdctl. But etcd also exposes an HTTP API that is really intuitive and easy to use. In fact, etcdctl is just a facade in front of this API. We’ll see how to use both here.

Lets start by writing a value. From core-01 do a etcdctl set /key1 value1. This command adds a new key/value pair to etcd where the key is key1 and the value is value1. Now from the second node, you can read the value with etcdctl get /key1. You should see value1 as a response.

Note how etcd replicated the value that you wrote on the first node to the second one almost instantaneously. In fact, it replicated the value to all nodes in the cluster not just the two you are ssh’ed into. This is the power of a distributed store.

If you wanted to use the HTTP API you could have accomplished the same thing using curl instead of etcdctl. We can write a second key/value pair in this way. From core-01 you can do curl -L -X PUT http://127.0.0.1:4001/v2/keys/key2 -d value="value2". Now to read the value from core-02 you can do a curl -L http://127.0.0.1:4001/v2/keys/key2. Admittedly, the etcdctl tool simplifies things a little bit but both options are there to choose from.

Another really interesting thing that etcd provides are TTL (time to live) values for each entry. This is quite useful when we use etcd for things like service discovery, where we don’t want to be reading stale values. To use it you simply pass the --ttl parameter when you set a value. To see this in action go back to core-01 and do a etcdctl set /key3 value3 --ttl 15. This will add the new key with a TTL of 15 seconds. If you go to core-02 now and do a etcdctl get /key3 you should see its value (provided it took you less than 15 seconds to do that). Now wait for a while and run the same get again. The key is gone!

Finally, if you want to list all the currently stored keys withing etcd you can use the etcdctl ls command. This will print the keys available at the root level. Alternatively, if you want to print keys at any level you can pass the --recursive flag (as in etcdctl ls --recursive).

Etcd provides some other cool functionalities (like atomic test and set updates, directories, event notifications) that are well documented if you do a etcdctl help.

Starting your first Fleet unit

As I mentioned in the introduction, CoreOS comes with a cluster manager called Fleet. You’ll use the fleetctl tool to interact with the cluster. To see this in action ssh into one of the nodes and do a fleetctl list-machines like:

$ vagrant ssh core-01

core-01$ fleetctl list-machines

MACHINE         IP              METADATA
0b9fd6f8...     172.17.8.104    -
128a2e32...     172.17.8.103    -
2addf739...     172.17.8.101    -
3f608471...     172.17.8.106    -
73c0b7fc...     172.17.8.105    -
eabc97ed...     172.17.8.102    -

We can see that our 6 CoreOS nodes are automatically recognized by Fleet as being part of the same cluster.

Like I said before, we know only care about our cluster and not our individual nodes. This nodes are completely ephemeral and we should assume they can come and go without previous notice. For this reason, it doesn’t matter from which node we run the previous fleetctl command. We could’ve ssh into “core-06” and the result would’ve been exactly the same.

Using fleetctl from your host

Being able to run fleetctl from within any node is great but even better is to be able to run it from outside the cluster as well. For now our cluster is running locally on Vagrant but the same setup could be running in AWS and we would probably like to control it from our laptop without the need to ssh into individual instances before.

Luckily, we can do this easily using the --tunnel flag. From your laptop run:

$ fleetctl --tunnel 127.0.0.1:2222 list-machines

MACHINE         IP              METADATA
0b9fd6f8...     172.17.8.104    -
128a2e32...     172.17.8.103    -
2addf739...     172.17.8.101    -
3f608471...     172.17.8.106    -
73c0b7fc...     172.17.8.105    -
eabc97ed...     172.17.8.102    -

This basically tunnels all communication with your cluster over SSH using the IP and port specified. Port 2222 is the default port that Vagrant uses to SSH into your VM (you can see this by running vagrant ssh-config).

If you get a message saying something like Failed initializing SSH client: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain, make sure that your Vagrant insecure ssh key is added to your ssh-agent by running ssh-add ~/.vagrant.d/insecure_private_key

To make fleetctl commands a bit less verbose we can actually put the tunnel configuration into an environment variable:

$ export FLEETCTL_TUNNEL=127.0.0.1:2222

Then we’ll be able to run Fleet just as we would if we were inside one of our nodes:

$ fleetctl list-machines

MACHINE         IP              METADATA
0b9fd6f8...     172.17.8.104    -
128a2e32...     172.17.8.103    -
2addf739...     172.17.8.101    -
3f608471...     172.17.8.106    -
73c0b7fc...     172.17.8.105    -
eabc97ed...     172.17.8.102    -

Running Fleet units

Having our nodes up and running with Fleet is great but it is not doing anything useful by itself. We want to start telling our cluster to run some services for us. This is where Fleet Units come into play.

As I mentioned before Fleet can be seen as systemd working at the cluster level instead of at the individual machines level. As such, in order to run anything with Fleet you need to submit regular systemd units files combined with some Fleet specific properties.

A unit file defines what process you want to run and gives Fleet some hints to help it determine how and where that process should be executed. To get started, lets see what the unit file to run our good old python service would look like:

$ cat python-test.service

[Unit]
Description=Python service
Requires=docker.service
After=docker.service

[Service]
TimeoutStartSec=0
Restart=on-failure

ExecStartPre=-/usr/bin/docker kill python-service
ExecStartPre=-/usr/bin/docker rm python-service
ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service

ExecStart=/usr/bin/docker run --name python-service -P jlordiales/python-micro-service

ExecStop=/usr/bin/docker stop python-service

Lets go through the unit file and see what each section is doing. The first line simply sets a description for our unit, which is helpful when looking at all the units that are currently running. The following 2 lines Requires and After specify ordering dependencies between units (the full documentation can be seen here). Since we are running a docker container we need the docker process to be started first. This dependency also means that if the docker unit is stopped this python-test.service unit will also be stopped.

We then have the [Service] section, which effectively describes how our service should run. We first tell systemd not to wait for a completion signal from our service (with TimeoutStartSec=0). Next, we ask systemd to restart our container whenever it exits unexpectedly (exit code different than 0). This is extremely useful if we want to have a self-healing cluster and we’ll see how this works in a moment.

Finally, the Exec* commands telling systemd how to run our container. The ExecStartPre commands are run before our container is started and are basically there to setup the environment to ensure that our main process can run smoothly. In our example, we make sure that no container with the same name is running by doing a docker kill and docker rm. Note that this 2 lines are prefixed with a - before the command to run. This is very important because by default systemd will execute the commands in the order they are specified and will stop as soon as one of them returns a non-zero exit code. By prefixing the command with - systemd will ignore the exit code and continue executing the next one. We need to do that for docker kill and docker rm because those commands will fail if there is no container named python-service.

The 2 remaining lines are pretty self-explanatory. ExecStart is the command that will start the main process for our unit. In our case we run our container as we usually would, specifying a name and the -P to expose its ports. One important thing to notice here is that we don’t pass the -d flag to docker (to run in detached mode). If we do that the unit would run for a few seconds and then exit, because the container would not be started as a child of the unit’s PID. Which basically means that from the unit’s point of view there is nothing to run. The ExecStop command in the last line will do a docker stop whenever we tell systemd to stop our unit.

So now that we have our unit file, how do we run it? Well, first we need to load the unit into our cluster because so far this is only a text file that we have edited in our local environment (outside of any of the CoreOS hosts). We can do this with fleetctl submit python-test.service. To see that the unit was actually submitted we can do a fleetctl list-unit-files, which should give us the list of all units that our cluster knows about. You can even look at the contents of the unit with fleetctl cat python-test.service.

With the unit file submitted we can now do:

$ fleetctl start python-test.service

Unit python-test.service launched on 0b9fd6f8.../172.17.8.104

In this case, Fleet decided that the node 172.17.8.104 was good enough to run our container. If we want to see all the currently running units we can do that with:

$ fleetctl list-units

UNIT                    MACHINE                         ACTIVE  SUB
python-test.service     0b9fd6f8.../172.17.8.104        active  running

We can also check the status of any given unit:

$ fleetctl status python-test.service

● python-test.service - Python service
   Loaded: loaded (/run/fleet/units/python-test.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Sat 2015-07-04 1022 ; 50s ago
  Process: 1726 ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service (code=exited, status=0/SUCCESS)
  Process: 1719 ExecStartPre=/usr/bin/docker rm python-service (code=exited, status=1/FAILURE)
  Process: 1655 ExecStartPre=/usr/bin/docker kill python-service (code=exited, status=1/FAILURE)
 Main PID: 1776 (docker)
   Memory: 8.3M
   CGroup: /system.slice/python-test.service
           └─1776 /usr/bin/docker run --name python-service -P jlordiales/python-micro-service

Jul 04 1007 core-04 docker[1726]: 595ded12b855: Pulling fs layer
Jul 04 1009 core-04 docker[1726]: 595ded12b855: Download complete
Jul 04 1009 core-04 docker[1726]: 7e0b582bc16d: Pulling metadata
Jul 04 1010 core-04 docker[1726]: 7e0b582bc16d: Pulling fs layer
Jul 04 1022 core-04 docker[1726]: 7e0b582bc16d: Download complete
Jul 04 1022 core-04 docker[1726]: 7e0b582bc16d: Download complete
Jul 04 1022 core-04 docker[1726]: Status: Downloaded newer image for jlordiales/python-micro-service:latest
Jul 04 1022 core-04 systemd[1]: Started Python service.
Jul 04 1022 core-04 docker[1776]: * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
Jul 04 1022 core-04 docker[1776]: * Restarting with stat

There we can see that our process is active and running. We can also see the exit code of the 3 ExecStartPre instructions we discussed before. Finally, we can see some of the output from each process. To make sure that our container is running and responding where Fleet says it is we can get the Machine Id from the output of fleetctl list-units that we saw before (0b9fd6f8 in our case) and ssh directly into it with:

$ fleetctl list-units

UNIT                    MACHINE                         ACTIVE  SUB
python-test.service     0b9fd6f8.../172.17.8.104        active  running

$ fleetctl ssh 0b9fd6f8

core-03$ docker ps

CONTAINER ID        IMAGE                                    COMMAND             CREATED             STATUS              PORTS                     NAMES
bd6681b7eef3        jlordiales/python-micro-service:latest   "python app.py"     19 minutes ago      Up 19 minutes       0.0.0.0:32768->5000/tcp   python-service

core-03$ curl localhost:32768

Hello World from bd6681b7eef3

Self-healing nodes

When I was describing the unit file for the python unit, I briefly showed a property called Restart=on-failure, which means that systemd will automatically restart the process if it exits with an exit code different than 0. Lets see if this really works in our example. We’ll ssh again into the node running our container and kill it to see what happens:

$ fleetctl ssh 0b9fd6f8

core-03$ docker ps

CONTAINER ID        IMAGE                                    COMMAND             CREATED             STATUS              PORTS                     NAMES
bd6681b7eef3        jlordiales/python-micro-service:latest   "python app.py"     29 minutes ago      Up 29 minutes       0.0.0.0:32768->5000/tcp   python-service

core-03$ docker kill python-service

core-03$ docker ps

CONTAINER ID        IMAGE                                    COMMAND             CREATED             STATUS              PORTS                     NAMES
1da6c1b5cafc        jlordiales/python-micro-service:latest   "python app.py"     45 seconds ago      Up 44 seconds       0.0.0.0:32769->5000/tcp   python-service

Awesome! We killed the first running container and within seconds systemd started a new one for us. If however we use docker stop instead of docker kill (therefore stopping the container gracefully) systemd won’t try to restart it.

That is great if the process is killed for some reason but what happens if the entire node disappears all of the sudden? I won’t show it here but you can easily simulate this by doing a vagrant halt on the VM where your unit was placed. Fleet will detect that the node is dead and re-distribute all the units that were running in that node across the rest of cluster.

High availability services

One of the main benefits of using Fleet to mange our units is that it becomes really easy to run a highly available service with multiple instances running in different nodes. This, combined with the self-healing property we discussed in the previous section gives you a lot of power to do pretty cool stuff.

This replication of any given service across your nodes is enabled by something called Template unit files. This basically means that you can write a regular unit file like the one we wrote before and use this as a template to instantiate new units. The only difference is in the name of the unit file, that should now follow the patter <name>@.<suffix>. For example, for our previous python-test.service we should rename it to python-test@.service.

Lets rename our unit file and see how we can start as many instances of our python container as we want. But first, remove the unit file we loaded before with fleetctl destroy python-test.service. Now we can rename our unit and submit it to our cluster in the same way as we did for the first one:

$ mv python-test.service python-test@.service
$ fleetctl submit python-test@.service

With our template loaded in the cluster we can now start instances of that template using the name and some suffix after the @. For instance:

$ fleetctl start python-test@1 python-test@2 python-test@random

$ fleetctl list-units

UNIT                            MACHINE                         ACTIVE  SUB
python-test@1.service           0b9fd6f8.../172.17.8.104        active  running
python-test@2.service           128a2e32.../172.17.8.103        active  running
python-test@random.service      3f608471.../172.17.8.106        active  running

Here we can see that we started 3 instances of our python container. We can also use our shell expansion functionality to start multiple instances:

$ fleetctl start python-test@{3..5}

$ fleetctl list-units

UNIT                            MACHINE                         ACTIVE  SUB
python-test@1.service           0b9fd6f8.../172.17.8.104        active  running
python-test@2.service           128a2e32.../172.17.8.103        active  running
python-test@3.service           73c0b7fc.../172.17.8.105        active  running
python-test@4.service           eabc97ed.../172.17.8.102        active  running
python-test@5.service           2addf739.../172.17.8.101        active  running
python-test@random.service      3f608471.../172.17.8.106        active  running

Telling Fleet where to run your containers

By default Fleet makes no guarantees as to where in the cluster your units will run. In the last example from the previous section we saw that we started 6 different instances of our python unit and it just so happens that Fleet decided to run one on each node.

So what do we do if we have dependencies between our different units. Imagine for instance that you have 2 different containers, one running your application and one running a monitoring agent for that application. In that case you want to keep those 2 running on the same node always. Similarly, if you want to run multiple instances of your service to scale horizontally you want those to run on different nodes.

To do this, Fleet provides a set of fleet-specific options that allows you to control how the scheduling engine of Fleet will work. We’ll see how these work with some examples. But first we’ll need another unit file:

$ cat hello-world@.service

[Unit]
Description=MyApp
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill busybox1
ExecStartPre=-/usr/bin/docker rm busybox1
ExecStartPre=/usr/bin/docker pull busybox
ExecStart=/usr/bin/docker run --name busybox1 busybox /bin/sh -c "while true; do echo Hello World; sleep 1; done"
ExecStop=/usr/bin/docker stop busybox1

This simply runs a container that will keep printing a Hello World message to stdout.

Running units together

Now imagine that we want to run this hello-world unit together with our python-test one but we want to make sure that these 2 always run together in the same node. We can use the MachineOf Fleet attribute to achieve this.

$ cat python-test@.service

[Unit]
Description=Python service
Requires=docker.service
After=docker.service

[Service]
TimeoutStartSec=0
Restart=on-failure

ExecStartPre=-/usr/bin/docker kill python-service
ExecStartPre=-/usr/bin/docker rm python-service
ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service

ExecStart=/usr/bin/docker run --name python-service -P jlordiales/python-micro-service

ExecStop=/usr/bin/docker stop python-service

[X-Fleet]
MachineOf=hello-world@%i.service

We added the [X-Fleet] section to our unit file specifying that our unit should only be placed wherever there’s also a hello-world unit running. Lets see what happens when we submit these 2 units into our cluster:

$ fleetctl submit python-test@.service hello-world@.service
$ fleetctl start python-test@1 hello-world@1

$ fleetctl list-units

UNIT                    MACHINE                         ACTIVE  SUB
hello-world@1.service   0b9fd6f8.../172.17.8.104        active  running
python-test@1.service   0b9fd6f8.../172.17.8.104        active  running

As we expected, the 2 units were scheduled on the same node. The same thing would happen if we start multiple instances of each unit at the same time:

$ fleetctl start python-test@{2..4} hello-world@{2..4}

$ fleetctl list-units

UNIT                    MACHINE                         ACTIVE  SUB
hello-world@1.service   0b9fd6f8.../172.17.8.104        active  running
hello-world@2.service   128a2e32.../172.17.8.103        active  running
hello-world@3.service   2addf739.../172.17.8.101        active  running
hello-world@4.service   3f608471.../172.17.8.106        active  running
python-test@1.service   0b9fd6f8.../172.17.8.104        active  running
python-test@2.service   128a2e32.../172.17.8.103        active  running
python-test@3.service   2addf739.../172.17.8.101        active  running
python-test@4.service   3f608471.../172.17.8.106        active  running

This dependency between units also means that if the unit we depend on (hello-world in our example) is destroyed then all the units that were dependant on that one (python-test in our example) will also be destroyed. We can see this if we do:

$ fleetctl destroy hello-world@4

$ fleetctl list-units
UNIT                    MACHINE                         ACTIVE  SUB
hello-world@1.service   0b9fd6f8.../172.17.8.104        active  running
hello-world@2.service   128a2e32.../172.17.8.103        active  running
hello-world@3.service   2addf739.../172.17.8.101        active  running
python-test@1.service   0b9fd6f8.../172.17.8.104        active  running
python-test@2.service   128a2e32.../172.17.8.103        active  running
python-test@3.service   2addf739.../172.17.8.101        active  running

We removed hello-world@4 and Fleet automatically removed python-test@4 as well.

Running units away from each other

We saw how to run multiple units guaranteeing that they will be put always in the same node. How about the opposite scenario, running 2 or more units making sure that they are never put on the same node. We can use Fleet’s Conflicts option to achieve this. Let’s change our python-test@.service unit file to use this new option:

$ cat python-test@.service

[Unit]
Description=Python service
Requires=docker.service
After=docker.service

[Service]
TimeoutStartSec=0
Restart=on-failure

ExecStartPre=-/usr/bin/docker kill python-service
ExecStartPre=-/usr/bin/docker rm python-service
ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service

ExecStart=/usr/bin/docker run --name python-service -P jlordiales/python-micro-service

ExecStop=/usr/bin/docker stop python-service

[X-Fleet]
Conflicts=hello-world@%i.service

If we submit this new unit file and run multiple instances of our 2 services we’ll see that Fleet will place them on different nodes. If Fleet can not find a distribution that satisfies the constraints specified in the unit files then it will simply refuse to schedule them.

Conclusion

Container orchestration and scheduling is an exciting and relatively new area that is under heavy development by different players. CoreOS presents an easy and lightweight approach using etcd, Fleet and Docker as its backbone. In this post we saw how easy it is to create a local CoreOS cluster with Vagrant and run highly available and self-healing services with the help of Fleet.

By combining a few simple configuration values, we can ensure that our services are distributed across different regions and availability zones. This, combined with the fact that we can run CoreOS on pretty much any cloud provider or hardware, enables us to have very complex architectures with pretty much no manual intervention.