Recently, Ceph “Luminous” V12.2.2 was released, a bug fix release for the latest stable release of Ceph. It contains some urgently awaited fixed, i.e. for “Bluestore” memory leaks, and admins around the world started upgrading immediately.
Just before Chrismas, I had to handle a “situation” with such an upgraded Ceph cluster. It had been working for months, coming from pre-Luminous times, was upgraded to V12.2.1 a few weeks ago and now was brought to V12.2.2 in preparation of introducing “Bluestore” OSDs. Admittedly, the cluster wasn’t in perfect shape, but “HEALTH_OK” was reported before and right after the upgrade to V12.2.2.
Things started to go wrong when the first OSDs were taken “out” in preparation of “Bluestore” OSDs, step 2 of the official docs. The cluster reported “too many PGs per OSD” and showed slow requests that didn’t seem to go away. What’s worse, the cluster started to show signs of blocked requests, like unresponsive clients and hanging CephFS access. After some time, these were confirmed by “ceph -s”, where slow requests turned to blocked requests after 4064 seconds, taking the cluster to HEALTH_ERR. Additionally, the PG rearrangement, started by taking out the first OSDs, came to a halt and left the cluster with still high numbers of misplaced and degraded PGs. Overall, the cluster became unusable. Continue reading