Sarah Ting

Zero downtime deploys with Docker and php-fpm/nginx

I’m still in the middle of moving an existing web application into Docker! I’ve written about it a little previously in Laravel/PHP/nginx Alpine Dockerfile.

For my personal projects I refuse to use cloud hosts due to the cost. This means I get to have lots of fun figuring out the deployment pipeline that I normally wouldn’t have to worry about with a cloud host with managed container orchestration.

I have three production servers set up —

  • Web server (nginx container, certbot container, and app container with the web image)
  • Jobs server (app container with the jobs image)
  • Scheduler server (app container with the scheduler image)

This post pertains to the deploy script for the web server. My previous deployment process for this project was handled by a single bash script that did the following —

  • Pull the most recent release branch into the web server
  • Build the project into a new release folder
  • Update the symlink of the web folder to point towards the new release folder
  • Gracefully reload apache

This is fast and easy but has a few obvious stability issues. Since this will be deprecated anyway, let’s move onto talking about the Docker release strategy instead.


😱 docker-compose: The naive solution

My first attempt was to just toss a docker compose YML onto each staging server so I could docker-compose up to put the server up.

I didn’t have high expectations, and my suspicions were confirmed — if I updated the image version and re-upped the application, the website would go down while the container is being deployed (502). This is caused by the old container being removed before the new container is run.


🤔 docker-compose —scale?

I found one solution published by Tines which seemed to make sense. This solution describes creating a deployment bash script which takes the following steps —

  1. Use the --scale flag of docker-compose to put up a new container alongside the old.
  2. Wait for the new container to respond on the provisioned port.
  3. Reload nginx so that nginx is now aware of the new container.
  4. Stop the old container.
  5. Reload nginx again so that nginx removes the old container from the upstream.

I tried this out, but this was still producing a noticeable second of downtime for me that could be reproduced just by having the website open in one window and continuously refreshing. This didn’t present as an error message for me, and instead would hang while loading forever.

This downtime happens between steps 4 and 5 above, and occurs due to nginx routing the request to a container that is being stopped. I thought I might be able to fix this by adjusting the nginx settings, but I wasn’t out of ideas that I wanted to try yet, so I moved onto the next solution.


🚀 Back to bash!

Servers For Hackers’ Shipping Docker course suggests a more manual solution that replaces the upstream docker container name explicitly in the nginx configurations. I made some of my own additions according to what I needed for my use case and cobbled together the following script.

This does the following —

  1. Put up a new container
  2. Spam the healthcheck for the container until it passes
    1. I decided to replace the curl command with the healthcheck command as I think this more reliably reflects the container status. This comes with the notable downside of having a delay before the first healthcheck passes (caused by the initial healthcheck always occurring after the configured healthcheck interval, per the documentation).
    2. I don’t mind the wait, but I also added an optional -n flag which will skip the healthcheck and immediately swap the container out. I could improve on this by checking and skipping the healthcheck for the new container if the old container is already failing its healthcheck, since this reflects that the site is already down.
  3. Once the healthcheck passes:
    1. Explicitly replace the new container name into the nginx configuration, then reload nginx. This step will abort if the nginx configuration fails.
    2. Once the swap is complete, remove old and dangling containers.
  4. If the healthcheck fails, or times out (after 2 minutes):
    1. Abort and clean up the new containers.

This works seamlessly without any downtime or lag between deployments.

#!/bin/bash

set -e

help()
{
   echo ""
   echo "Usage: $0 -n -v0.0.0"
   echo -e "\t-n Skip container healthcheck"
   echo -e "\t-v Release version number"
   exit 1
}

timeout()
{
  echo "Timed out while waiting for health check"
  echo "Cleaning up"
  docker stop $NEW_CONTAINER
  docker rm $NEW_CONTAINER
  echo "Removed container $NEW_CONTAINER"
  exit 1
}

while getopts nv: option
do
  case "${option}" in
    v) DOCKER_TAG="${OPTARG}";;
    n) SKIP_HEALTH_CHECK=1;;
    ?) help;;
  esac
done

# Generate docker container names
OLD_CONTAINER=$(docker ps -a -q --filter="name=acme_app")
NEW_CONTAINER="acme_app_${DOCKER_TAG}_`date +"%s"`"

# Put up the new container
docker run -d \
  --restart=always \
  --name="$NEW_CONTAINER" \
    acme:$DOCKER_TAG

if [ -z $SKIP_HEALTH_CHECK ]; then
  # wait for new container to be available
  echo "Waiting for container to pass health check $NEW_CONTAINER"
  START_TIME=$(date +%s)
  until [ $(docker inspect -f {{.State.Health.Status}} $NEW_CONTAINER) == "healthy" ]; do
      ELAPSED_TIME=$(($(date +%s) - $START_TIME))
      if [[ $ELAPSED_TIME -gt 120 ]]; then
        timeout
      fi
      sleep 0.1;
  done;
fi
echo "Started new container $NEW_CONTAINER"

# (Update Nginx)
# Here I update my nginx configs to point to $NEW_CONTAINER
# It's just a sed find and replace

# Config test Nginx
docker exec -T nginx nginx -t

NGINX_STABLE=$?

if [ $NGINX_STABLE -eq 0 ]; then
    # Reload Nginx
    docker exec -T nginx nginx -s reload

		# Clean up old app containers
    if [ -n "$OLD_CONTAINER" ]; then
      echo "Removing old app container $OLD_CONTAINER"
      docker stop $OLD_CONTAINER
      docker rm $OLD_CONTAINER
      echo "Removed old app container $OLD_CONTAINER"
    fi

    # Clean up dangling images
    echo "Clean up dangling"
    DANGLING_IMAGES=$(docker image ls -f "dangling=true" -q)
    if [ ! -z "$DANGLING_IMAGES" ]; then
        docker image rm $(docker image ls -f "dangling=true" -q)
    fi
else
    echo "ERROR: Nginx configuration test failed!"
    exit 1
fi

⛵CICD

I’m using GHA for my CICD; the deploy flow is super straight forward and triggers automatically on a push to the release branch —

  1. Bump the version and tag the release.
  2. Build and push the new images to the docker repository.
  3. Remotely trigger the above deploy script using the release tag from step 1 to deploy to the staging server.
  4. After confirming staging, I have a separate GHA workflow that deploys the latest release tag to the production servers.

My tests run before merging to master, but maybe it’d be safer to add a test suite run before a release too… 🤔


That’s all! I’ll be testing and slowly rolling this out over the next few days.