we screw up sometimes but they're usually fun stories
Não pode escolher mais do que 25 tópicos Os tópicos devem começar com uma letra ou um número, podem incluir traços ('-') e podem ter até 35 caracteres.

3.3 KiB

title date author tags
Accidental updates 2018-04-02T21:11:41+01:00 butlerx [Docker Postgres]

Why you need to lock versions

The Preamble

Redbrick runs a service called Hackmd. It is web based markdown editor. At the start of the year Hackmd version one came out and in early February we decided to update it.

Unlike most updates in Redbrick this isn’t really a big thing as it runs inside Docker container, with its only external dependency being Postgres 9. We don’t actually have a central a central Postgres database in Redbrick, a problem for another day, just a mysql. So inside another container we run Postgres.

The update process was meant to be simple update the Dockerfile, run docker-compose up --build, and wait....

What Actually Went Wrong

And we waited but hackmd didn’t come back. One tail of the logs later and the problem was obvious Postgres had restarted. But not only that it was trying and failing to upgrade itself to postgres 10.

The Problem was Docker had pulled the latest version of postgres. This was fine when hackmd was first installed the year earlier when all the docker tags pointed to postgres 9, but wasn’t so great after.

The Fix

The fix was pretty simple add a version tag to for the database image to the docker-compose.yml. A quick vim and docker-compose up later and hackmd was back with lots of new bells and whistles.

So we went through and changed all the other compose files to have tagged versions of dependencies.

Problem Solved. Clean our hands and move on. Well unfortunately not.

The Solution

Over the next couple of months couple problems were mentioned with hackmd but no one really looked in to them. That was until it affected an admin. We couldn’t publish our roadmap for in coming admins.

So back to the logs and yep database issue. Seems that Postgres cant find some of the keys. Initial thought was that we missed a database migration way back when we upgraded. So we execed in to the container ran the migrations script and .... nothing, there wasn’t any.

Back to the drawing board. We start reviewing configs and double checking against the repo. But everything seems right. Next we decide to go to the heart of the problem the database itself. One docker run and we have a postgresql shell. Start digging though tables, trying to find the missing key when we get a duplicate key error and an alias.

BINGO

Odd thing was when you looked up the alias there was only one entry for it. We couldn’t delete the duplicate entry as it didn’t exist and couldn’t modify the table entries as we got duplicate key errors.

Bit of googling later and we had a solution. Copy the table, delete it and restore. Is this the Database version of turn it off and on again?

hackmd=#  SELECT DISTINCT * INTO notes from "Notes";
SELECT 268
hackmd=# DROP TABLE "Notes";
DROP TABLE
hackmd=#  ALTER TABLE notes rename to "Notes";
ALTER TABLE
hackmd=# REINDEX DATABASE hackmd;
REINDEX
hackmd=# VACUUM(FULL, ANALYZE, VERBOSE);

What it turns out is the tables index was wrong. While Postgres’ attempt to update itself to 10 had failed it had modified the indexes for some of the tables and reverting the container didn’t magically fix the database inside.

So the tl;dr.

  • Always lock your container version
  • containers don’t magically fix things
  • And validate your database after modifying it