20 minutes
Orchestrating your containers with CoreOS, an introduction
Most docker tutorials that you’ll find out there (the ones in this blog included) will assume that you have a single host running all your containers or a few hosts but where you are manually managing them. While this is nice and simple to explain the basic concepts, it is probably not the way you want to run your applications in production. In most cases you will have a cluster of servers all running different containers that need to talk to each other and know how to function properly, even when some of those servers suddenly go offline.
This is the area of orchestration and scheduling of containers, a topic that is extremely hot these days. Particularly with big players in the industry working in new projects to abstract away most of the complexities inherent to running distributed containers. Amazon recently opened up their EC2 container service, Google has Kubernetes and Mesosphere is becoming pretty popular with the underlying Apache Mesos project.
One additional project that has been gaining a lot of attention in this area is CoreOS. In this post I’m going to try to explore CoreOS and give a basic overview of the problem that it tries to solve, how it works and how to work effectively with it.
Introduction
The idea behind CoreOS is the same as with any other cluster management system. You stop thinking about your individual servers and how they work together. Instead you think about your data center (a cluster of individual servers). In other words, you no longer say “run this container in server 1 and this other container in server 2” but “run these 2 containers in my data center” and let the cluster manager take care of where and how to do that.
This also means that if one or more of your individual servers die the cluster manager will take the containers that were running in those servers and distribute them across the remaining healthy nodes.
Following this philosophy, CoreOS is an open source lightweight operating system
that comes together with a set of simple tools.
The main building block behind CoreOS is Docker. Since CoreOS doesn’t come with
a package manager, everything you want to run on it has to run as a container.
It should be noted that while CoreOS fully supports Docker, they are also working
on their own container runtime called rkt.
To start and manage all these containers CoreOS uses Fleet. Fleet is based on systemd and extends it in order to work at the cluster level. In other words, while systemd works as a single machine init system, fleet works as a cluster init system.
To coordinate all different nodes and let Fleet know where to run your containers, CoreOS provides etcd, a distributed key/value store with a strong consistency and partition tolerance model. Etcd uses the Raft consensus algorithm to handle the communication between the different nodes. This is the same algorithm used by Consul by the way.
I will go into some detail about each of this tools and how they can work together with Docker. But first, lets setup our CoreOS cluster running on Vagrant.
Bootstraping a CoreOS cluster
The CoreOS documentation has very comprehensive guides to run CoreOS on anything from bare metal hardware, to cloud providers to virtualization platforms. You can follow the step by step guide in here to run a basic cluster locally on Vagrant.
The short version is that you can clone this
repo and then do some minimal
configuration.
The relevant files for this part are the config.rb
and
user-data
ones.
The Vagrantfile
is pretty generic and reads all the configuration it needs
from config.rb
, so no need to change anything there. This latter file looks
something like the following:
# Size of the CoreOS cluster created by Vagrant
$num_instances=6
# Official CoreOS channel from which updates should be downloaded
$update_channel='stable'
# Customize VMs
$vm_memory = 2048
$vm_cpus = 2
# Enable port forwarding from guest(s) to host machine, syntax is: { 80 => 8080 }, auto correction is enabled by default.
# 4001 is the default etcd port, we need this if we want to run fleetctl locally on the host
$forwarded_ports = {4001 => 4001}
The file is pretty self-explanatory. You see that we define the size of our cluster (6 instances) and we give it some extra memory and cpus to run on. Lastly we forward port 4001 which is the default port used by etcd. We’ll see why we want to do this in a bit.
Then we have the user-data
file:
#cloud-config
coreos:
etcd2:
#generate a new token for each unique cluster from https://discovery.etcd.io/new
discovery: https://discovery.etcd.io/<token>
# multi-region and multi-cloud deployments need to use $public_ipv4
advertise-client-urls: http://$public_ipv4:2379
initial-advertise-peer-urls: http://$private_ipv4:2380
# listen on both the official ports and the legacy ports
# legacy ports can be omitted if your application doesn't depend on them
listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001
fleet:
public-ip: $public_ipv4
flannel:
interface: $public_ipv4
units:
- name: etcd2.service
command: start
- name: fleet.service
command: start
- name: docker-tcp.socket
command: start
enable: true
content: |
[Unit]
Description=Docker Socket for the API
[Socket]
ListenStream=2375
Service=docker.service
BindIPv6Only=both
[Install]
WantedBy=sockets.target
This is the only file you need to modify
before starting the cluster. Go to
https://discovery.etcd.io/new in your browser
and copy the URL that you get as a response there. Now go to user-data
and
paste that URL where it says discovery: https://discovery.etcd.io/<token>
.
You can now do a vagrant up
and wait while your cluster gets created. When
it’s done you should be able to run vagrant status
and see the 6 nodes
running.
Trying out etcd
Now that you have a CoreOS cluster up and running, we can start to play around with the different tools that are shipped with it. Lets start with etcd, the distributed key/value store.
You’ll need 2 open terminals for this (or tabs, or splits or whatever you use).
We’ll ssh into core-01 in one of them (with vagrant ssh core-01
) and core-02
in the other (vagrant ssh core-02
). Which nodes you ssh into is irrelevant, as
long as they are different.
CoreOS comes with a tool to read and write from etcd, called etcdctl
. But etcd
also exposes an HTTP API that is really intuitive and easy to use. In fact,
etcdctl is just a facade in front of this API.
We’ll see how to use both here.
Lets start by writing a value. From core-01
do a etcdctl set /key1 value1
.
This command adds a new key/value pair to etcd where the key is key1
and the
value is value1
. Now from the second node, you can read the value with
etcdctl get /key1
. You should see value1
as a response.
Note how etcd replicated the value that you wrote on the first node to the second one almost instantaneously. In fact, it replicated the value to all nodes in the cluster not just the two you are ssh’ed into. This is the power of a distributed store.
If you wanted to use the HTTP API you could have accomplished the same thing
using curl
instead of etcdctl
. We can write a second key/value pair in this
way. From core-01
you can do curl -L -X PUT
http://127.0.0.1:4001/v2/keys/key2 -d value="value2"
. Now to read the value
from core-02
you can do a curl -L http://127.0.0.1:4001/v2/keys/key2
.
Admittedly, the etcdctl
tool simplifies things a little bit but both options
are there to choose from.
Another really interesting thing that etcd provides are TTL (time to live)
values for each entry. This is quite useful when we use etcd for things like
service discovery, where we don’t want to be reading stale values. To use it you
simply pass the --ttl
parameter when you set a value.
To see this in action go back to core-01
and do a etcdctl set /key3 value3
--ttl 15
. This will add the new key with a TTL of 15 seconds. If you go to
core-02
now and do a etcdctl get /key3
you should see its value (provided it
took you less than 15 seconds to do that). Now wait for a while and run the same
get again. The key is gone!
Finally, if you want to list all the currently stored keys withing etcd you can
use the etcdctl ls
command. This will print the keys available at the root
level. Alternatively, if you want to print keys at any level you can pass the
--recursive
flag (as in etcdctl ls --recursive
).
Etcd provides some other cool functionalities (like atomic test and set updates,
directories, event notifications) that are well documented if you do a etcdctl
help
.
Starting your first Fleet unit
As I mentioned in the introduction, CoreOS comes with a cluster manager called
Fleet. You’ll use the fleetctl
tool to interact with the cluster.
To see this in action ssh into one of the nodes and do a fleetctl list-machines
like:
$ vagrant ssh core-01
core-01$ fleetctl list-machines
MACHINE IP METADATA
0b9fd6f8... 172.17.8.104 -
128a2e32... 172.17.8.103 -
2addf739... 172.17.8.101 -
3f608471... 172.17.8.106 -
73c0b7fc... 172.17.8.105 -
eabc97ed... 172.17.8.102 -
We can see that our 6 CoreOS nodes are automatically recognized by Fleet as being part of the same cluster.
Like I said before, we know only care about our cluster and not our individual
nodes. This nodes are completely ephemeral and we should assume they can come
and go without previous notice. For this reason, it doesn’t matter from which
node we run the previous fleetctl
command. We could’ve ssh into “core-06” and
the result would’ve been exactly the same.
Using fleetctl from your host
Being able to run fleetctl
from within any node is great but even better is to
be able to run it from outside the cluster as well. For now our cluster is
running locally on Vagrant but the same setup could be running in AWS and we
would probably like to control it from our laptop without the need to
ssh into individual instances before.
Luckily, we can do this easily using the --tunnel
flag. From your laptop run:
$ fleetctl --tunnel 127.0.0.1:2222 list-machines
MACHINE IP METADATA
0b9fd6f8... 172.17.8.104 -
128a2e32... 172.17.8.103 -
2addf739... 172.17.8.101 -
3f608471... 172.17.8.106 -
73c0b7fc... 172.17.8.105 -
eabc97ed... 172.17.8.102 -
This basically tunnels all communication with your cluster over SSH using the IP
and port specified. Port 2222 is the default port that Vagrant uses to SSH into
your VM (you can see this by running vagrant ssh-config
).
If you get a message saying something like
Failed initializing SSH client: ssh: handshake failed: ssh: unable to
authenticate, attempted methods [none publickey], no supported methods remain,
make sure that your Vagrant insecure ssh key is added to your ssh-agent by
running ssh-add ~/.vagrant.d/insecure_private_key
To make fleetctl
commands a bit less verbose we can actually put the tunnel
configuration into an environment variable:
$ export FLEETCTL_TUNNEL=127.0.0.1:2222
Then we’ll be able to run Fleet just as we would if we were inside one of our nodes:
$ fleetctl list-machines
MACHINE IP METADATA
0b9fd6f8... 172.17.8.104 -
128a2e32... 172.17.8.103 -
2addf739... 172.17.8.101 -
3f608471... 172.17.8.106 -
73c0b7fc... 172.17.8.105 -
eabc97ed... 172.17.8.102 -
Running Fleet units
Having our nodes up and running with Fleet is great but it is not doing anything useful by itself. We want to start telling our cluster to run some services for us. This is where Fleet Units come into play.
As I mentioned before Fleet can be seen as systemd working at the cluster level instead of at the individual machines level. As such, in order to run anything with Fleet you need to submit regular systemd units files combined with some Fleet specific properties.
A unit file defines what process you want to run and gives Fleet some hints to help it determine how and where that process should be executed. To get started, lets see what the unit file to run our good old python service would look like:
$ cat python-test.service
[Unit]
Description=Python service
Requires=docker.service
After=docker.service
[Service]
TimeoutStartSec=0
Restart=on-failure
ExecStartPre=-/usr/bin/docker kill python-service
ExecStartPre=-/usr/bin/docker rm python-service
ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service
ExecStart=/usr/bin/docker run --name python-service -P jlordiales/python-micro-service
ExecStop=/usr/bin/docker stop python-service
Lets go through the unit file and see what each section is doing. The first line simply sets a description for our unit, which is helpful when looking at all the units that are currently running. The following 2 lines Requires and After specify ordering dependencies between units (the full documentation can be seen here). Since we are running a docker container we need the docker process to be started first. This dependency also means that if the docker unit is stopped this python-test.service unit will also be stopped.
We then have the [Service] section, which effectively describes how our service should run. We first tell systemd not to wait for a completion signal from our service (with TimeoutStartSec=0). Next, we ask systemd to restart our container whenever it exits unexpectedly (exit code different than 0). This is extremely useful if we want to have a self-healing cluster and we’ll see how this works in a moment.
Finally, the Exec* commands telling systemd how to run our container. The
ExecStartPre commands are run before our container is started and are
basically there to setup the environment to ensure that our main process can run
smoothly. In our example, we make sure that no container with the same name is
running by doing a docker kill
and docker rm
. Note that this 2 lines are
prefixed with a -
before the command to run. This is very important because by
default systemd will execute the commands in the order they are specified and
will stop as soon as one of them returns a non-zero exit code. By prefixing the
command with -
systemd will ignore the exit code and continue executing the
next one. We need to do that for docker kill
and docker rm
because those
commands will fail if there is no container named python-service.
The 2 remaining lines are pretty self-explanatory. ExecStart is the command
that will start the main process for our unit. In our case we run our container
as we usually would, specifying a name and the -P
to expose its ports. One
important thing to notice here is that we don’t pass the -d
flag to docker (to
run in detached mode). If we do that the unit would run for a few seconds and
then exit, because the container would not be started as a child of the unit’s
PID. Which basically means that from the unit’s point of view there is nothing
to run.
The ExecStop command in the last line will do a docker stop
whenever we tell
systemd to stop our unit.
So now that we have our unit file, how do we run it? Well, first we need to load
the unit into our cluster because so far this is only a text file that we have
edited in our local environment (outside of any of the CoreOS hosts). We can do
this with fleetctl submit python-test.service
. To see that the unit was
actually submitted we can do a fleetctl list-unit-files
, which should give us
the list of all units that our cluster knows about. You can even look at the
contents of the unit with fleetctl cat python-test.service
.
With the unit file submitted we can now do:
$ fleetctl start python-test.service
Unit python-test.service launched on 0b9fd6f8.../172.17.8.104
In this case, Fleet decided that the node 172.17.8.104 was good enough to run our container. If we want to see all the currently running units we can do that with:
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
python-test.service 0b9fd6f8.../172.17.8.104 active running
We can also check the status of any given unit:
$ fleetctl status python-test.service
● python-test.service - Python service
Loaded: loaded (/run/fleet/units/python-test.service; linked-runtime; vendor preset: disabled)
Active: active (running) since Sat 2015-07-04 1022 ; 50s ago
Process: 1726 ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service (code=exited, status=0/SUCCESS)
Process: 1719 ExecStartPre=/usr/bin/docker rm python-service (code=exited, status=1/FAILURE)
Process: 1655 ExecStartPre=/usr/bin/docker kill python-service (code=exited, status=1/FAILURE)
Main PID: 1776 (docker)
Memory: 8.3M
CGroup: /system.slice/python-test.service
└─1776 /usr/bin/docker run --name python-service -P jlordiales/python-micro-service
Jul 04 1007 core-04 docker[1726]: 595ded12b855: Pulling fs layer
Jul 04 1009 core-04 docker[1726]: 595ded12b855: Download complete
Jul 04 1009 core-04 docker[1726]: 7e0b582bc16d: Pulling metadata
Jul 04 1010 core-04 docker[1726]: 7e0b582bc16d: Pulling fs layer
Jul 04 1022 core-04 docker[1726]: 7e0b582bc16d: Download complete
Jul 04 1022 core-04 docker[1726]: 7e0b582bc16d: Download complete
Jul 04 1022 core-04 docker[1726]: Status: Downloaded newer image for jlordiales/python-micro-service:latest
Jul 04 1022 core-04 systemd[1]: Started Python service.
Jul 04 1022 core-04 docker[1776]: * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
Jul 04 1022 core-04 docker[1776]: * Restarting with stat
There we can see that our process is active and running. We can also see the
exit code of the 3 ExecStartPre instructions we discussed before. Finally, we
can see some of the output from each process.
To make sure that our container is running and responding where Fleet says it is
we can get the Machine Id from the output of fleetctl list-units
that we saw
before (0b9fd6f8 in our case) and ssh directly into it with:
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
python-test.service 0b9fd6f8.../172.17.8.104 active running
$ fleetctl ssh 0b9fd6f8
core-03$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bd6681b7eef3 jlordiales/python-micro-service:latest "python app.py" 19 minutes ago Up 19 minutes 0.0.0.0:32768->5000/tcp python-service
core-03$ curl localhost:32768
Hello World from bd6681b7eef3
Self-healing nodes
When I was describing the unit file for the python unit, I briefly showed a
property called Restart=on-failure
, which means that systemd will
automatically restart the process if it exits with an exit code different than
0. Lets see if this really works in our example. We’ll ssh again into the node
running our container and kill it to see what happens:
$ fleetctl ssh 0b9fd6f8
core-03$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bd6681b7eef3 jlordiales/python-micro-service:latest "python app.py" 29 minutes ago Up 29 minutes 0.0.0.0:32768->5000/tcp python-service
core-03$ docker kill python-service
core-03$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1da6c1b5cafc jlordiales/python-micro-service:latest "python app.py" 45 seconds ago Up 44 seconds 0.0.0.0:32769->5000/tcp python-service
Awesome! We killed the first running container and within seconds systemd
started a new one for us. If however we use docker stop
instead of docker
kill
(therefore stopping the container gracefully) systemd won’t try to restart
it.
That is great if the process is killed for some reason but what happens if the
entire node disappears all of the sudden? I won’t show it here but you can
easily simulate this by doing a vagrant halt
on the VM where your unit was
placed. Fleet will detect that the node is dead and re-distribute all the units
that were running in that node across the rest of cluster.
High availability services
One of the main benefits of using Fleet to mange our units is that it becomes really easy to run a highly available service with multiple instances running in different nodes. This, combined with the self-healing property we discussed in the previous section gives you a lot of power to do pretty cool stuff.
This replication of any given service across your nodes is enabled by something
called Template unit files. This basically means that you can write a regular
unit file like the one we wrote before and use this as a template to instantiate
new units. The only difference is in the name of the unit file, that should now
follow the patter <name>@.<suffix>
. For example, for our previous
python-test.service
we should rename it to python-test@.service
.
Lets rename our unit file and see how we can start as many instances of our
python container as we want. But first, remove the unit file we loaded before
with fleetctl destroy python-test.service
. Now we can rename our unit and
submit it to our cluster in the same way as we did for the first one:
$ mv python-test.service python-test@.service
$ fleetctl submit python-test@.service
With our template loaded in the cluster we can now start instances of that
template using the name and some suffix after the @
. For instance:
$ fleetctl start python-test@1 python-test@2 python-test@random
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
python-test@1.service 0b9fd6f8.../172.17.8.104 active running
python-test@2.service 128a2e32.../172.17.8.103 active running
python-test@random.service 3f608471.../172.17.8.106 active running
Here we can see that we started 3 instances of our python container. We can also use our shell expansion functionality to start multiple instances:
$ fleetctl start python-test@{3..5}
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
python-test@1.service 0b9fd6f8.../172.17.8.104 active running
python-test@2.service 128a2e32.../172.17.8.103 active running
python-test@3.service 73c0b7fc.../172.17.8.105 active running
python-test@4.service eabc97ed.../172.17.8.102 active running
python-test@5.service 2addf739.../172.17.8.101 active running
python-test@random.service 3f608471.../172.17.8.106 active running
Telling Fleet where to run your containers
By default Fleet makes no guarantees as to where in the cluster your units will run. In the last example from the previous section we saw that we started 6 different instances of our python unit and it just so happens that Fleet decided to run one on each node.
So what do we do if we have dependencies between our different units. Imagine for instance that you have 2 different containers, one running your application and one running a monitoring agent for that application. In that case you want to keep those 2 running on the same node always. Similarly, if you want to run multiple instances of your service to scale horizontally you want those to run on different nodes.
To do this, Fleet provides a set of fleet-specific options that allows you to control how the scheduling engine of Fleet will work. We’ll see how these work with some examples. But first we’ll need another unit file:
$ cat hello-world@.service
[Unit]
Description=MyApp
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill busybox1
ExecStartPre=-/usr/bin/docker rm busybox1
ExecStartPre=/usr/bin/docker pull busybox
ExecStart=/usr/bin/docker run --name busybox1 busybox /bin/sh -c "while true; do echo Hello World; sleep 1; done"
ExecStop=/usr/bin/docker stop busybox1
This simply runs a container that will keep printing a Hello World message to stdout.
Running units together
Now imagine that we want to run this hello-world unit together with our
python-test one but we want to make sure that these 2 always run together in the
same node. We can use the MachineOf
Fleet attribute to achieve this.
$ cat python-test@.service
[Unit]
Description=Python service
Requires=docker.service
After=docker.service
[Service]
TimeoutStartSec=0
Restart=on-failure
ExecStartPre=-/usr/bin/docker kill python-service
ExecStartPre=-/usr/bin/docker rm python-service
ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service
ExecStart=/usr/bin/docker run --name python-service -P jlordiales/python-micro-service
ExecStop=/usr/bin/docker stop python-service
[X-Fleet]
MachineOf=hello-world@%i.service
We added the [X-Fleet]
section to our unit file specifying that our unit
should only be placed wherever there’s also a hello-world
unit running.
Lets see what happens when we submit these 2 units into our cluster:
$ fleetctl submit python-test@.service hello-world@.service
$ fleetctl start python-test@1 hello-world@1
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
hello-world@1.service 0b9fd6f8.../172.17.8.104 active running
python-test@1.service 0b9fd6f8.../172.17.8.104 active running
As we expected, the 2 units were scheduled on the same node. The same thing would happen if we start multiple instances of each unit at the same time:
$ fleetctl start python-test@{2..4} hello-world@{2..4}
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
hello-world@1.service 0b9fd6f8.../172.17.8.104 active running
hello-world@2.service 128a2e32.../172.17.8.103 active running
hello-world@3.service 2addf739.../172.17.8.101 active running
hello-world@4.service 3f608471.../172.17.8.106 active running
python-test@1.service 0b9fd6f8.../172.17.8.104 active running
python-test@2.service 128a2e32.../172.17.8.103 active running
python-test@3.service 2addf739.../172.17.8.101 active running
python-test@4.service 3f608471.../172.17.8.106 active running
This dependency between units also means that if the unit we depend on (hello-world in our example) is destroyed then all the units that were dependant on that one (python-test in our example) will also be destroyed. We can see this if we do:
$ fleetctl destroy hello-world@4
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
hello-world@1.service 0b9fd6f8.../172.17.8.104 active running
hello-world@2.service 128a2e32.../172.17.8.103 active running
hello-world@3.service 2addf739.../172.17.8.101 active running
python-test@1.service 0b9fd6f8.../172.17.8.104 active running
python-test@2.service 128a2e32.../172.17.8.103 active running
python-test@3.service 2addf739.../172.17.8.101 active running
We removed hello-world@4
and Fleet automatically removed python-test@4
as
well.
Running units away from each other
We saw how to run multiple units guaranteeing that they will be put always in
the same node. How about the opposite scenario, running 2 or more units making
sure that they are never put on the same node.
We can use Fleet’s Conflicts
option to achieve this. Let’s change our
python-test@.service
unit file to use this new option:
$ cat python-test@.service
[Unit]
Description=Python service
Requires=docker.service
After=docker.service
[Service]
TimeoutStartSec=0
Restart=on-failure
ExecStartPre=-/usr/bin/docker kill python-service
ExecStartPre=-/usr/bin/docker rm python-service
ExecStartPre=/usr/bin/docker pull jlordiales/python-micro-service
ExecStart=/usr/bin/docker run --name python-service -P jlordiales/python-micro-service
ExecStop=/usr/bin/docker stop python-service
[X-Fleet]
Conflicts=hello-world@%i.service
If we submit this new unit file and run multiple instances of our 2 services we’ll see that Fleet will place them on different nodes. If Fleet can not find a distribution that satisfies the constraints specified in the unit files then it will simply refuse to schedule them.
Conclusion
Container orchestration and scheduling is an exciting and relatively new area that is under heavy development by different players. CoreOS presents an easy and lightweight approach using etcd, Fleet and Docker as its backbone. In this post we saw how easy it is to create a local CoreOS cluster with Vagrant and run highly available and self-healing services with the help of Fleet.
By combining a few simple configuration values, we can ensure that our services are distributed across different regions and availability zones. This, combined with the fact that we can run CoreOS on pretty much any cloud provider or hardware, enables us to have very complex architectures with pretty much no manual intervention.