Tiny Docker Pieces, Loosely Joined

06 December 2013

During the most recent #dockerhackday, I explored the concept of data-only Docker containers by setting up a workflow around pandoc ("a universal document converter").

Note:

Much of the following discussion is probably obvious stuff to long-time users of Docker. Consider this post as an attempt by a relative Docker newbie to provide assistance to other newcomers to the Docker project.

In that spirit, if I say something wrong or misleading, please let me know, so I can correct it!

Also, pay attention to the date of the post. It was written circa Docker version 0.7.0, but for a fast-moving project like Docker, it's possible that the information below may quickly become obsolete.

Moby Dock, the Docker Whale

Persistent Data with Docker

One of the first stumbling blocks, for me, when I started playing with Docker was figuring out the best way to persist data.

I understood the concept that Docker containers were designed to be small, usually running just a single process. And, that they were quick to build, which means that you don't ever have to upgrade a running container. Rather, you create a new one and throw away the old one.

But, what wasn't obvious to me was, how do you marry the idea of ephemeral containers with the need for persistent data? What's the best way to make sure your data outlives the brief lifetime of a Docker container?

To figure that out, I needed to understand Volumes.

Docker Volumes

In my opinion, the best resource for learning about how to use volumes is Michael Crosby's blog post, "Advanced Docker Volumes." In it, he discusses an example using 3 Docker containers: Minecraft, MapGenerator, and WebServer. The MapGenerator reads game data files from the Minecraft container, generates maps, and then writes the maps to the WebServer container. This is highly instructive, showing not only how to utilize a separate Docker container for each process, but also showing how to use the -volumes-from option to share data among different containers.

But, what if you want to upgrade to the next version of Minecraft? That should be easy...you simply create a new Docker image with the latest version, right? However, if you destroy the old Minecraft container, and then launch your new Minecraft container, you no longer have access to the game data produced by the original Minecraft container. That's because the application and the data are bundled together on the same container. (Note that the game data hasn't been deleted, since it lives on in the host filesystem at /vol/lib/docker/volumes. There's just no easy way for the new Minecraft container to access it.)

What you need is a way to separate the application from the data.

Data-Only Containers

I've seen this question of how to persist data come up on IRC (usually regarding databases), and the typical answer is to run the application (database server) in one container, and store the data files in a separate data-only container.

I was never quite sure how a "data-only container" worked in practice, but luckily, I found a great example (called "Desktop Integration") in the contrib directory of the docker repository that showed how to dockerize a desktop application by using two Docker containers: one for Firefox, and a second for the data.

My goal for Docker Global Hack Day was to apply the lessons learned from that example in my own project.

My #dockerhackday Project

I'm a huge fan of ReadtheDocs (and have made a few contributions over the years), but unfortunately, not every project writes their documentation in reStructuredText. I've noticed several projects on Github that keep their documentation in Markdown. So, for the hack day, I set myself the task of implementing a tool to view these Markdown docs in HTML. The steps needed to achieve this were:

Clone the git repository of the project.
Run all of the Markdown files in the docs directory through pandoc to convert them to HTML, and save the results in a new directory.
Serve the directory of HTML files so I can view the docs in my browser.

Though it would be easy enough to accomplish this using a single Docker container, I gave myself the further requirement of utilizing multiple Docker containers, each one handling a separate step in the process, and all of them sharing a single, data-only Docker container.

Much of the code is below, but you can also view the full repository for the project on Github: docker-data-only-container-demo.

Step 1: Build a Data-Only Image

As it turns out, there's not much to building a data-only image. The "Desktop Integration" example is already pretty simple, but mine was even more bare bones. In fact, here's the entirety of my Dockerfile (stripped of extraneous comments):

FROM stackbrew/busybox:latest
MAINTAINER Tom Offermann <tom@offermann.us>

# Create data directory
RUN mkdir /data

# Create /data volume
VOLUME /data

Wait...what?!?

That's it? But, this image doesn't do anything when you run it!

Well, a data-only container doesn't need to do anything. All it does is gain access to storage space that is outside of it's own container filesystem. (See the "Under the Hood" section in Michael Crosby's post for details on how this works.)

I can even create a "data-only" container without first building an image (as above) just by running this command:

$ docker run -v /data busybox /bin/sh

It turns out that a "data-only" container is one of those concepts that's really, really simple...once you get it. Perhaps it's so simple that those who already understand it don't even see the need to explain it. (Or, perhaps I'm just a little slow...)

Step 2: Run a Data-Only Image

So, we've established that a data-only container doesn't actually "do" anything. Let's run one anyway, and then figure out what it's good for. First, build it:

$ docker build -t data .

Next, run it:

$ docker run -name pandoc-data data true

A couple of subtleties surprised me about this docker run command.

First, while you have to run a command when you start up a container, it doesn't really matter what the command is. It can even be a command that essentially does nothing, as true does here.

Second, you don't have to daemonize a data-only container by passing the -d option. In fact, the docker container exits immediately (after running true), but even so, it is still usable as a data volume, even in a stopped state.

Step 3: Create Additional Images

To use a data-only container, I need to create additional docker containers to act upon it. For my pandoc-conversion project, I ended up creating three containers:

git-clone

# Dockerfile for git-clone
# Clone a git repository, find the directory of doc files in Markdown,
# and save the doc directory to /data/md.

FROM stackbrew/ubuntu:precise
MAINTAINER Tom Offermann <tom@offermann.us>

RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list

RUN apt-get update
RUN apt-get -y install git python python-pip
RUN pip install --upgrade pip docopt

VOLUME /data

ADD git-clone-docs.py /git-clone-docs.py
RUN chmod 755 /git-clone-docs.py

ENTRYPOINT ["/git-clone-docs.py"]
CMD ["-h"]

pandoc-convert

# Dockerfile for pandoc-convert
# Convert a directory of markdown files to directory of html files.

FROM stackbrew/ubuntu:precise
MAINTAINER Tom Offermann <tom@offermann.us>

RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list

RUN apt-get update
RUN apt-get -y install pandoc rsync python python-pip

RUN pip install --upgrade pip
RUN pip install docopt

ADD convert-directory.py /convert-directory.py
RUN chmod 755 /convert-directory.py

ENTRYPOINT ["/convert-directory.py"]
CMD ["-h"]

http-server

# Dockerfile for Node http-server

FROM stackbrew/ubuntu:precise
MAINTAINER Tom Offermann <tom@offermann.us>

RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list
RUN echo "deb http://ppa.launchpad.net/chris-lea/node.js/ubuntu precise main" >> /etc/apt/sources.list
RUN apt-get update

# Faster to add GPG key directly, rather than install python-software-properties
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C7917B12

RUN apt-get update
RUN apt-get install -y nodejs
RUN npm install http-server -g

ENTRYPOINT ["http-server"]
CMD ["--help"]

Each of the 3 images has a single command as the ENTRYPOINT. For git-clone and pandoc-convert, I wrote custom scripts whose functions are described in the Dockerfile comments, while the http-server images's entrypoint is the Node http-server.

Step 4: Use the Data-Only Image

To implement the Markdown conversion workflow, I run each of the three containers in sequence, and each of them utilizes the pandoc-data data-only container.

$ docker run -volumes-from pandoc-data git-clone https://github.com/isaacs/npm

Using the -volumes-from option to docker run gives the git-clone container access to the pandoc-data data-only container. It clones a git repository (npm's repo, in this case) and places the doc directory in the /data/md directory.

$ docker run -volumes-from pandoc-data pandoc-convert /data/md /data/html

This time, it's pandoc-convert that can access the pandoc-data volume, and it takes the markdown files in /data/md that were downloaded in the previous step, converts them to HTML, and saves the results in /data/html.

$ docker run -d -volumes-from pandoc-data -p 8080:8080 http-server /data/html

Finally, http-server serves the /data/html directory from the pandoc-data container.

What's So Great About Data-Only Containers?

Since I've spelled it out in such exhaustive detail, this may all seem like much ado about nothing...but, now that I finally grok the idea, I can see how data-only containers are so useful.

It's the simplest way to share data among multiple containers.
It makes it easy to upgrade the application (or process) that operates on the data. For example, say we wanted to generate PDF files in addition to HTML files. All we need to do to accomplish that is to create an updated pandoc-convert Docker image.
Adding additional processes that operate on the data is equally easy. What if you wanted to push the HTML files to an Amazon S3, in addition to serving them locally? Just create a new s3sync Docker image.

Not too long ago, I thought I was beginning to get the Docker approach when I separated a database server and a web server into separate containers. But, as I start thinking along the lines that each Docker container should have just one piece of functionality or one responsibility, I realize that that there are a lot of advantages to breaking your overall application into lots of tiny Docker pieces.

The more I learn about Docker, the smaller my Docker images get, and the more numerous they become.