we screw up sometimes but they're usually fun stories
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2019-02-18.md 7.2 KiB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
  1. ---
  2. title: 'Paying the Debt: MySQL 5.5 to 8.0'
  3. date: '2019-02-18T09:11:41+01:00'
  4. author: m1cr0man
  5. twitter: m1cr0m4n
  6. description: Fixing MySQL, with bonus containerization
  7. tags:
  8. - MySQL
  9. ---
  10. # Paying the Debt: Upgrading MySQL 5.5 to Percona 8.0
  11. It cannot be denied that Redbrick's services have been a bit wonky recently.
  12. Our uptime has been all over the place and our systems have been crumbling. As
  13. much as we would love to blame this on our hardware, this is not the case. RB's
  14. tech stack has been a spaghetti monster for quite some time, having multiple
  15. appendages added to it over the years. Sufficient time has not been committed
  16. to keeping things up to date, and the tech debt has finally caught up with us
  17. in a major way.
  18. ## SELECT issues FROM mysql\_history;
  19. It was diagnosed some time in 2018 that the reason for the intermittent down
  20. time of a number of our sites was our single all-serving MySQL 5.5 instance.
  21. This runs our our server Metharme - a Ubuntu _12.04_ box with quite a beefy
  22. spec which houses our Apache sites, MySQL and a plethora of other services,
  23. all deployed monolithically on the bare metal. Needless to say, any updates
  24. to this box are a feat of human engineering.
  25. In November things became desperate. We had to restart the MySQL instance
  26. every 3 hours or so to keep our services online. We tried using a cron job
  27. to automate this (which was a bad enough idea already) but to our horror we
  28. discovered that cron is broken on Metharme - an issue for another day.
  29. On top of the monitoring alerts themselves, we started to receive a lot of
  30. emails from our users stating that their site was down, to the point where
  31. we built up some 10-mail threads with people in a mixed attempt to resolve
  32. and diagnose the issues. Something else had to be done.
  33. ## SELECT * FROM investigation;
  34. Diagnosing this failure was very difficult. There are hundreds of websites
  35. on Redbrick's infrastructure which rely on this service. There are also
  36. hundreds of databases in our MySQL instance, some of which date back to
  37. MySQL 4.X. Narrowing down a culprit was akin to finding a needle in a haystack.
  38. From space.
  39. Below you can see our MySQL Queries counter which clearly shows our 3-hour
  40. restart hack. Where it flatlines is where the service was down.
  41. <image src="/img/2019-02-18-metharme.webp" alt="Metharme stats">
  42. None of our graphs indicated any sort of hardware limitation to be the culprit,
  43. and our only idea at this point was to (at least try) upgrade our instance.
  44. The apache logs and metrics were no help either as we do not have per-site
  45. insights into our performance (since we only have one Apache instance).
  46. ## UPDATE mysql SET version = 8.0;
  47. Updating MySQL in-place wasn't an option. The latest available package for
  48. Ubuntu 12.04 was already installed on Metharme. The best thing we could do
  49. was deploy MySQL containerised onto Zeus, our container server, and redirect
  50. traffic to it.
  51. There was a lot of discussion around what version to use. We looked at MySQL 5.7
  52. and 8.0, MariaDB 11, and Percona 8.0. The logic was that if we were going
  53. to upgrade to a breaking version with some confidence, it would be Percona.
  54. Should that fail catastrophically we would bring it up to MySQL 5.7. Maria
  55. was quickly ruled out as an option as it cannot replicate between MySQL
  56. and itself due to some breaking changes in its design (namely GTID changes).
  57. Putting it in a container specifically was only a means to fulfil the task.
  58. We ended up attaching a bridge network and giving the container a "real" IP
  59. address on our network - one which looks and behaves like a baremetal host.
  60. This means that we have an IP address for mysql which can "float" to wherever,
  61. or whatever, is hosting the instance, and also doesn't complicate our
  62. routing any further.
  63. Another goal was to "actually" update the databases. This meant maintaining
  64. as many defaults as possible in the new instance - such as the `SQL_MODE` which
  65. controls what sort of error tolerance it provides. We only needed to set the
  66. following things in the end:
  67. ```
  68. [mysqld]
  69. default_authentication_plugin=mysql_native_password
  70. character_set_server=utf8mb3
  71. collation_server=utf8mb3_unicode_ci
  72. ```
  73. As it turns out, there's almost no PHP support for the newer password hashes,
  74. and on top of the work required to convert the old passwords this ended up
  75. being a necessary change.
  76. ## UPDATE mysql SET host="Zeus";
  77. This upgrade had to be seamless. Shutting down MySQL (despite the terrible
  78. uptime) wasn't an option until the replacement was production ready.
  79. In order to pull this off, we restored a backup from Metharme to Zeus and
  80. enabled replication from the old to the new instance. Despite the countless
  81. warnings and edge cases painted all over the documentation, this worked
  82. almost perfectly. I say almost - as some tables were so old they couldn't be
  83. restored and others were corrupt in the 5.5 instance itself. Of the ~584
  84. databases we have, only 12 ended up being left behind.
  85. Once the backup + replication was setup and working, we began testing a few
  86. sites on the new instance. One of the first was [the wiki](https://wiki.redbrick.dcu.ie/mw/Main_Page),
  87. which went perfectly. After changing the MediaWiki configs we were immediately
  88. able to read and change wiki entries on the new instance.
  89. Next we moved a random tribute member's site. The webapp they are using is
  90. called [Koken](http://koken.me/) and provides an image gallery blog. It's the
  91. perfect outdated PHP worst-case scenario for the test, and after changing
  92. some compatibility flags in Percona 8.0 and changing the `SQL_MODE` for
  93. the app's connections it worked fine!
  94. ## INSERT INTO mysql\_history(result,issues) VALUES ("SUCCESS", 0);
  95. Fortunately the admins of old made at least one good decision. MySQL has its
  96. own DNS entry, and all our services are referencing it. In order to switch
  97. everything to the new instance all we had to do was update the DNS entry
  98. to point to the new IP address and it would "just work".
  99. To our own surprise and relief, this turned out to be true. After changing the
  100. address it took us a moment to verify if the switch had actually been made, as
  101. the only visible difference was in the graphs on either side. Every site
  102. appeared to continue operating normally.
  103. ## DROP DATABASE metharme;
  104. With the DNS now switched, there was no real turning back. Data was already
  105. writing to the new instance and not being replicated back to MySQL 5.5. With
  106. that, we immediately shut down Metharme MySQL. Most would consider this an
  107. insane move given the whole 30 seconds of testing we had between the DNS switch
  108. and the shutdown of the old instance, but we were confident that our new
  109. instance was operating normally and it was safe to do so.
  110. Metrics on the new instance have been deceptfully low, however. Here's a screenshot
  111. of our metrics the day this post was made.
  112. <image src="/img/2019-02-18-zeus.webp" alt="Zeus stats">
  113. As you can see, we're seeing a fraction of the traffic we had before, and things
  114. are just as performant as they always were. This would
  115. suggest that the service(s) to blame for killing 5.5 were either hard coded to use
  116. the old IP address or aren't compatible with Percona 8.0.
  117. To this date, we have encountered no issues with MySQL nor have any users
  118. contacted us about any site issues. It's safe to say, MySQL is fixed.
  119. m1cr0man && The admin Team