Tiny Docker Pieces, Loosely Joined
Much of the following discussion is probably obvious stuff to long-time users of Docker. Consider this post as an attempt by a relative Docker newbie to provide assistance to other newcomers to the Docker project.
In that spirit, if I say something wrong or misleading, please let me know, so I can correct it!
Also, pay attention to the date of the post. It was written circa Docker version 0.7.0, but for a fast-moving project like Docker, it's possible that the information below may quickly become obsolete.
Persistent Data with Docker
One of the first stumbling blocks, for me, when I started playing with Docker was figuring out the best way to persist data.
I understood the concept that Docker containers were designed to be small, usually running just a single process. And, that they were quick to build, which means that you don't ever have to upgrade a running container. Rather, you create a new one and throw away the old one.
But, what wasn't obvious to me was, how do you marry the idea of ephemeral containers with the need for persistent data? What's the best way to make sure your data outlives the brief lifetime of a Docker container?
To figure that out, I needed to understand Volumes.
In my opinion, the best resource for learning about how to use volumes is Michael Crosby's blog post, "Advanced Docker Volumes." In it, he discusses an example using 3 Docker containers: Minecraft, MapGenerator, and WebServer. The MapGenerator reads game data files from the Minecraft container, generates maps, and then writes the maps to the WebServer container. This is highly instructive, showing not only how to utilize a separate Docker container for each process, but also showing how to use the
-volumes-from option to share data among different containers.
But, what if you want to upgrade to the next version of Minecraft? That should be easy...you simply create a new Docker image with the latest version, right? However, if you destroy the old Minecraft container, and then launch your new Minecraft container, you no longer have access to the game data produced by the original Minecraft container. That's because the application and the data are bundled together on the same container. (Note that the game data hasn't been deleted, since it lives on in the host filesystem at
/vol/lib/docker/volumes. There's just no easy way for the new Minecraft container to access it.)
What you need is a way to separate the application from the data.
I've seen this question of how to persist data come up on IRC (usually regarding databases), and the typical answer is to run the application (database server) in one container, and store the data files in a separate data-only container.
I was never quite sure how a "data-only container" worked in practice, but luckily, I found a great example (called "Desktop Integration") in the contrib directory of the docker repository that showed how to dockerize a desktop application by using two Docker containers: one for Firefox, and a second for the data.
My goal for Docker Global Hack Day was to apply the lessons learned from that example in my own project.
My #dockerhackday Project
I'm a huge fan of ReadtheDocs (and have made a few contributions over the years), but unfortunately, not every project writes their documentation in reStructuredText. I've noticed several projects on Github that keep their documentation in Markdown. So, for the hack day, I set myself the task of implementing a tool to view these Markdown docs in HTML. The steps needed to achieve this were:
- Clone the git repository of the project.
- Run all of the Markdown files in the docs directory through pandoc to convert them to HTML, and save the results in a new directory.
- Serve the directory of HTML files so I can view the docs in my browser.
Though it would be easy enough to accomplish this using a single Docker container, I gave myself the further requirement of utilizing multiple Docker containers, each one handling a separate step in the process, and all of them sharing a single, data-only Docker container.
Much of the code is below, but you can also view the full repository for the project on Github: docker-data-only-container-demo.
Step 1: Build a Data-Only Image
As it turns out, there's not much to building a data-only image. The "Desktop Integration" example is already pretty simple, but mine was even more bare bones. In fact, here's the entirety of my Dockerfile (stripped of extraneous comments):
FROM stackbrew/busybox:latest MAINTAINER Tom Offermann <firstname.lastname@example.org> # Create data directory RUN mkdir /data # Create /data volume VOLUME /data
That's it? But, this image doesn't do anything when you run it!
Well, a data-only container doesn't need to do anything. All it does is gain access to storage space that is outside of it's own container filesystem. (See the "Under the Hood" section in Michael Crosby's post for details on how this works.)
I can even create a "data-only" container without first building an image (as above) just by running this command:
$ docker run -v /data busybox /bin/sh
It turns out that a "data-only" container is one of those concepts that's really, really simple...once you get it. Perhaps it's so simple that those who already understand it don't even see the need to explain it. (Or, perhaps I'm just a little slow...)
Step 2: Run a Data-Only Image
So, we've established that a data-only container doesn't actually "do" anything. Let's run one anyway, and then figure out what it's good for. First, build it:
$ docker build -t data .
Next, run it:
$ docker run -name pandoc-data data true
A couple of subtleties surprised me about this
docker run command.
First, while you have to run a command when you start up a container, it doesn't really matter what the command is. It can even be a command that essentially does nothing, as
true does here.
Second, you don't have to daemonize a data-only container by passing the
-d option. In fact, the docker container exits immediately (after running
true), but even so, it is still usable as a data volume, even in a stopped state.
Step 3: Create Additional Images
To use a data-only container, I need to create additional docker containers to act upon it. For my pandoc-conversion project, I ended up creating three containers:
# Dockerfile for git-clone # Clone a git repository, find the directory of doc files in Markdown, # and save the doc directory to /data/md. FROM stackbrew/ubuntu:precise MAINTAINER Tom Offermann <email@example.com> RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list RUN apt-get update RUN apt-get -y install git python python-pip RUN pip install --upgrade pip docopt VOLUME /data ADD git-clone-docs.py /git-clone-docs.py RUN chmod 755 /git-clone-docs.py ENTRYPOINT ["/git-clone-docs.py"] CMD ["-h"]
# Dockerfile for pandoc-convert # Convert a directory of markdown files to directory of html files. FROM stackbrew/ubuntu:precise MAINTAINER Tom Offermann <firstname.lastname@example.org> RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list RUN apt-get update RUN apt-get -y install pandoc rsync python python-pip RUN pip install --upgrade pip RUN pip install docopt ADD convert-directory.py /convert-directory.py RUN chmod 755 /convert-directory.py ENTRYPOINT ["/convert-directory.py"] CMD ["-h"]
# Dockerfile for Node http-server FROM stackbrew/ubuntu:precise MAINTAINER Tom Offermann <email@example.com> RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list RUN echo "deb http://ppa.launchpad.net/chris-lea/node.js/ubuntu precise main" >> /etc/apt/sources.list RUN apt-get update # Faster to add GPG key directly, rather than install python-software-properties RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C7917B12 RUN apt-get update RUN apt-get install -y nodejs RUN npm install http-server -g ENTRYPOINT ["http-server"] CMD ["--help"]
Each of the 3 images has a single command as the
pandoc-convert, I wrote custom scripts whose functions are described in the Dockerfile comments, while the
http-server images's entrypoint is the Node http-server.
Step 4: Use the Data-Only Image
To implement the Markdown conversion workflow, I run each of the three containers in sequence, and each of them utilizes the
pandoc-data data-only container.
$ docker run -volumes-from pandoc-data git-clone https://github.com/isaacs/npm
-volumes-from option to
docker run gives the
git-clone container access to the
pandoc-data data-only container. It clones a git repository (
npm's repo, in this case) and places the doc directory in the
$ docker run -volumes-from pandoc-data pandoc-convert /data/md /data/html
This time, it's
pandoc-convert that can access the
pandoc-data volume, and it takes the markdown files in
/data/md that were downloaded in the previous step, converts them to HTML, and saves the results in
$ docker run -d -volumes-from pandoc-data -p 8080:8080 http-server /data/html
http-server serves the
/data/html directory from the
What's So Great About Data-Only Containers?
Since I've spelled it out in such exhaustive detail, this may all seem like much ado about nothing...but, now that I finally grok the idea, I can see how data-only containers are so useful.
- It's the simplest way to share data among multiple containers.
- It makes it easy to upgrade the application (or process) that operates on the data. For example, say we wanted to generate PDF files in addition to HTML files. All we need to do to accomplish that is to create an updated
- Adding additional processes that operate on the data is equally easy. What if you wanted to push the HTML files to an Amazon S3, in addition to serving them locally? Just create a new
Not too long ago, I thought I was beginning to get the Docker approach when I separated a database server and a web server into separate containers. But, as I start thinking along the lines that each Docker container should have just one piece of functionality or one responsibility, I realize that that there are a lot of advantages to breaking your overall application into lots of tiny Docker pieces.
The more I learn about Docker, the smaller my Docker images get, and the more numerous they become.