Most docker tutorials that you’ll find out there (the ones in this blog included) will assume that you have a single host running all your containers or a few hosts but where you are manually managing them. While this is nice and simple to explain the basic concepts, it is probably not the way you want to run your applications in production. In most cases you will have a cluster of servers all running different containers that need to talk to each other and know how to function properly, even when some of those servers suddenly go offline.
This is the area of orchestration and scheduling of containers, a topic that is extremely hot these days. Particularly with big players in the industry working in new projects to abstract away most of the complexities inherent to running distributed containers. Amazon recently opened up their EC2 container service, Google has Kubernetes and Mesosphere is becoming pretty popular with the underlying Apache Mesos project.
One additional project that has been gaining a lot of attention in this area is CoreOS. In this post I’m going to try to explore CoreOS and give a basic overview of the problem that it tries to solve, how it works and how to work effectively with it.
The idea behind CoreOS is the same as with any other cluster management system. You stop thinking about your individual servers and how they work together. Instead you think about your data center (a cluster of individual servers). In other words, you no longer say “run this container in server 1 and this other container in server 2” but “run these 2 containers in my data center” and let the cluster manager take care of where and how to do that.
This also means that if one or more of your individual servers die the cluster manager will take the containers that were running in those servers and distribute them across the remaining healthy nodes.
Following this philosophy, CoreOS is an open source lightweight operating system
that comes together with a set of simple tools.
The main building block behind CoreOS is Docker. Since CoreOS doesn’t come with a package manager, everything you want to run on it has to run as a container. It should be noted that while CoreOS fully supports Docker, they are also working on their own container runtime called rkt.
To start and manage all these containers CoreOS uses Fleet. Fleet is based on systemd and extends it in order to work at the cluster level. In other words, while systemd works as a single machine init system, fleet works as a cluster init system.
To coordinate all different nodes and let Fleet know where to run your containers, CoreOS provides etcd, a distributed key/value store with a strong consistency and partition tolerance model. Etcd uses the Raft consensus algorithm to handle the communication between the different nodes. This is the same algorithm used by Consul by the way.
I will go into some detail about each of this tools and how they can work together with Docker. But first, lets setup our CoreOS cluster running on Vagrant.
Bootstraping a CoreOS cluster
The CoreOS documentation has very comprehensive guides to run CoreOS on anything from bare metal hardware, to cloud providers to virtualization platforms. You can follow the step by step guide in here to run a basic cluster locally on Vagrant.
The short version is that you can clone this
repo and then do some minimal
The relevant files for this part are the
Vagrantfile is pretty generic and reads all the configuration it needs
config.rb, so no need to change anything there. This latter file looks
something like the following:
The file is pretty self-explanatory. You see that we define the size of our cluster (6 instances) and we give it some extra memory and cpus to run on. Lastly we forward port 4001 which is the default port used by etcd. We’ll see why we want to do this in a bit.
Then we have the
This is the only file you need to modify
before starting the cluster. Go to
https://discovery.etcd.io/new in your browser
and copy the URL that you get as a response there. Now go to
paste that URL where it says
You can now do a
vagrant up and wait while your cluster gets created. When
it’s done you should be able to run
vagrant status and see the 6 nodes
Trying out etcd
Now that you have a CoreOS cluster up and running, we can start to play around with the different tools that are shipped with it. Lets start with etcd, the distributed key/value store.
You’ll need 2 open terminals for this (or tabs, or splits or whatever you use).
We’ll ssh into core-01 in one of them (with
vagrant ssh core-01) and core-02
in the other (
vagrant ssh core-02). Which nodes you ssh into is irrelevant, as
long as they are different.
CoreOS comes with a tool to read and write from etcd, called
etcdctl. But etcd
also exposes an HTTP API that is really intuitive and easy to use. In fact,
etcdctl is just a facade in front of this API.
We’ll see how to use both here.
Lets start by writing a value. From
core-01 do a
etcdctl set /key1 value1.
This command adds a new key/value pair to etcd where the key is
key1 and the
value1. Now from the second node, you can read the value with
etcdctl get /key1. You should see
value1 as a response.
Note how etcd replicated the value that you wrote on the first node to the second one almost instantaneously. In fact, it replicated the value to all nodes in the cluster not just the two you are ssh’ed into. This is the power of a distributed store.
If you wanted to use the HTTP API you could have accomplished the same thing
curl instead of
etcdctl. We can write a second key/value pair in this
core-01 you can do
curl -L -X PUT
http://127.0.0.1:4001/v2/keys/key2 -d value="value2". Now to read the value
core-02 you can do a
curl -L http://127.0.0.1:4001/v2/keys/key2.
etcdctl tool simplifies things a little bit but both options
are there to choose from.
Another really interesting thing that etcd provides are TTL (time to live)
values for each entry. This is quite useful when we use etcd for things like
service discovery, where we don’t want to be reading stale values. To use it you
simply pass the
--ttl parameter when you set a value.
To see this in action go back to
core-01 and do a
etcdctl set /key3 value3
--ttl 15. This will add the new key with a TTL of 15 seconds. If you go to
core-02 now and do a
etcdctl get /key3 you should see its value (provided it
took you less than 15 seconds to do that). Now wait for a while and run the same
get again. The key is gone!
Finally, if you want to list all the currently stored keys withing etcd you can
etcdctl ls command. This will print the keys available at the root
level. Alternatively, if you want to print keys at any level you can pass the
--recursive flag (as in
etcdctl ls --recursive).
Etcd provides some other cool functionalities (like atomic test and set updates,
directories, event notifications) that are well documented if you do a
Starting your first Fleet unit
As I mentioned in the introduction, CoreOS comes with a cluster manager called
Fleet. You’ll use the
fleetctl tool to interact with the cluster.
To see this in action ssh into one of the nodes and do a
fleetctl list-machines like:
We can see that our 6 CoreOS nodes are automatically recognized by Fleet as being part of the same cluster.
Like I said before, we know only care about our cluster and not our individual
nodes. This nodes are completely ephemeral and we should assume they can come
and go without previous notice. For this reason, it doesn’t matter from which
node we run the previous
fleetctl command. We could’ve ssh into “core-06” and
the result would’ve been exactly the same.
Using fleetctl from your host
Being able to run
fleetctl from within any node is great but even better is to
be able to run it from outside the cluster as well. For now our cluster is
running locally on Vagrant but the same setup could be running in AWS and we
would probably like to control it from our laptop without the need to
ssh into individual instances before.
Luckily, we can do this easily using the
--tunnel flag. From your laptop run:
This basically tunnels all communication with your cluster over SSH using the IP
and port specified. Port 2222 is the default port that Vagrant uses to SSH into
your VM (you can see this by running
If you get a message saying something like
Failed initializing SSH client: ssh: handshake failed: ssh: unable to
authenticate, attempted methods [none publickey], no supported methods remain,
make sure that your Vagrant insecure ssh key is added to your ssh-agent by
fleetctl commands a bit less verbose we can actually put the tunnel
configuration into an environment variable:
Then we’ll be able to run Fleet just as we would if we were inside one of our nodes:
Running Fleet units
Having our nodes up and running with Fleet is great but it is not doing anything useful by itself. We want to start telling our cluster to run some services for us. This is where Fleet Units come into play.
As I mentioned before Fleet can be seen as systemd working at the cluster level instead of at the individual machines level. As such, in order to run anything with Fleet you need to submit regular systemd units files combined with some Fleet specific properties.
A unit file defines what process you want to run and gives Fleet some hints to help it determine how and where that process should be executed. To get started, lets see what the unit file to run our good old python service would look like:
Lets go through the unit file and see what each section is doing. The first line simply sets a description for our unit, which is helpful when looking at all the units that are currently running. The following 2 lines Requires and After specify ordering dependencies between units (the full documentation can be seen here). Since we are running a docker container we need the docker process to be started first. This dependency also means that if the docker unit is stopped this python-test.service unit will also be stopped.
We then have the [Service] section, which effectively describes how our service should run. We first tell systemd not to wait for a completion signal from our service (with TimeoutStartSec=0). Next, we ask systemd to restart our container whenever it exits unexpectedly (exit code different than 0). This is extremely useful if we want to have a self-healing cluster and we’ll see how this works in a moment.
Finally, the Exec* commands telling systemd how to run our container. The
ExecStartPre commands are run before our container is started and are
basically there to setup the environment to ensure that our main process can run
smoothly. In our example, we make sure that no container with the same name is
running by doing a
docker kill and
docker rm. Note that this 2 lines are
prefixed with a
- before the command to run. This is very important because by
default systemd will execute the commands in the order they are specified and
will stop as soon as one of them returns a non-zero exit code. By prefixing the
- systemd will ignore the exit code and continue executing the
next one. We need to do that for
docker kill and
docker rm because those
commands will fail if there is no container named python-service.
The 2 remaining lines are pretty self-explanatory. ExecStart is the command
that will start the main process for our unit. In our case we run our container
as we usually would, specifying a name and the
-P to expose its ports. One
important thing to notice here is that we don’t pass the
-d flag to docker (to
run in detached mode). If we do that the unit would run for a few seconds and
then exit, because the container would not be started as a child of the unit’s
PID. Which basically means that from the unit’s point of view there is nothing
The ExecStop command in the last line will do a
docker stop whenever we tell
systemd to stop our unit.
So now that we have our unit file, how do we run it? Well, first we need to load
the unit into our cluster because so far this is only a text file that we have
edited in our local environment (outside of any of the CoreOS hosts). We can do
fleetctl submit python-test.service. To see that the unit was
actually submitted we can do a
fleetctl list-unit-files, which should give us
the list of all units that our cluster knows about. You can even look at the
contents of the unit with
fleetctl cat python-test.service.
With the unit file submitted we can now do:
In this case, Fleet decided that the node 172.17.8.104 was good enough to run our container. If we want to see all the currently running units we can do that with:
We can also check the status of any given unit:
There we can see that our process is active and running. We can also see the
exit code of the 3 ExecStartPre instructions we discussed before. Finally, we
can see some of the output from each process.
To make sure that our container is running and responding where Fleet says it is
we can get the Machine Id from the output of
fleetctl list-units that we saw
before (0b9fd6f8 in our case) and ssh directly into it with:
When I was describing the unit file for the python unit, I briefly showed a
Restart=on-failure, which means that systemd will
automatically restart the process if it exits with an exit code different than
- Lets see if this really works in our example. We’ll ssh again into the node running our container and kill it to see what happens:
Awesome! We killed the first running container and within seconds systemd
started a new one for us. If however we use
docker stop instead of
kill (therefore stopping the container gracefully) systemd won’t try to restart
That is great if the process is killed for some reason but what happens if the
entire node disappears all of the sudden? I won’t show it here but you can
easily simulate this by doing a
vagrant halt on the VM where your unit was
placed. Fleet will detect that the node is dead and re-distribute all the units
that were running in that node across the rest of cluster.
High availability services
One of the main benefits of using Fleet to mange our units is that it becomes really easy to run a highly available service with multiple instances running in different nodes. This, combined with the self-healing property we discussed in the previous section gives you a lot of power to do pretty cool stuff.
This replication of any given service across your nodes is enabled by something
called Template unit files. This basically means that you can write a regular
unit file like the one we wrote before and use this as a template to instantiate
new units. The only difference is in the name of the unit file, that should now
follow the patter
<name>@.<suffix>. For example, for our previous
python-test.service we should rename it to
Lets rename our unit file and see how we can start as many instances of our
python container as we want. But first, remove the unit file we loaded before
fleetctl destroy python-test.service. Now we can rename our unit and
submit it to our cluster in the same way as we did for the first one:
With our template loaded in the cluster we can now start instances of that
template using the name and some suffix after the
@. For instance:
Here we can see that we started 3 instances of our python container. We can also use our shell expansion functionality to start multiple instances:
Telling Fleet where to run your containers
By default Fleet makes no guarantees as to where in the cluster your units will run. In the last example from the previous section we saw that we started 6 different instances of our python unit and it just so happens that Fleet decided to run one on each node.
So what do we do if we have dependencies between our different units. Imagine for instance that you have 2 different containers, one running your application and one running a monitoring agent for that application. In that case you want to keep those 2 running on the same node always. Similarly, if you want to run multiple instances of your service to scale horizontally you want those to run on different nodes.
To do this, Fleet provides a set of fleet-specific options that allows you to control how the scheduling engine of Fleet will work. We’ll see how these work with some examples. But first we’ll need another unit file:
This simply runs a container that will keep printing a Hello World message to stdout.
Running units together
Now imagine that we want to run this hello-world unit together with our
python-test one but we want to make sure that these 2 always run together in the
same node. We can use the
MachineOf Fleet attribute to achieve this.
We added the
[X-Fleet] section to our unit file specifying that our unit
should only be placed wherever there’s also a
hello-world unit running.
Lets see what happens when we submit these 2 units into our cluster:
As we expected, the 2 units were scheduled on the same node. The same thing would happen if we start multiple instances of each unit at the same time:
This dependency between units also means that if the unit we depend on (hello-world in our example) is destroyed then all the units that were dependant on that one (python-test in our example) will also be destroyed. We can see this if we do:
Running units away from each other
We saw how to run multiple units guaranteeing that they will be put always in
the same node. How about the opposite scenario, running 2 or more units making
sure that they are never put on the same node.
We can use Fleet’s
Conflicts option to achieve this. Let’s change our
[email protected] unit file to use this new option:
If we submit this new unit file and run multiple instances of our 2 services we’ll see that Fleet will place them on different nodes. If Fleet can not find a distribution that satisfies the constraints specified in the unit files then it will simply refuse to schedule them.
Container orchestration and scheduling is an exciting and relatively new area that is under heavy development by different players. CoreOS presents an easy and lightweight approach using etcd, Fleet and Docker as its backbone. In this post we saw how easy it is to create a local CoreOS cluster with Vagrant and run highly available and self-healing services with the help of Fleet.
By combining a few simple configuration values, we can ensure that our services are distributed across different regions and availability zones. This, combined with the fact that we can run CoreOS on pretty much any cloud provider or hardware, enables us to have very complex architectures with pretty much no manual intervention.