Recently, Ceph “Luminous” V12.2.2 was released, a bug fix release for the latest stable release of Ceph. It contains some urgently awaited fixed, i.e. for “Bluestore” memory leaks, and admins around the world started upgrading immediately.
Just before Chrismas, I had to handle a “situation” with such an upgraded Ceph cluster. It had been working for months, coming from pre-Luminous times, was upgraded to V12.2.1 a few weeks ago and now was brought to V12.2.2 in preparation of introducing “Bluestore” OSDs. Admittedly, the cluster wasn’t in perfect shape, but “HEALTH_OK” was reported before and right after the upgrade to V12.2.2.
Things started to go wrong when the first OSDs were taken “out” in preparation of “Bluestore” OSDs, step 2 of the official docs. The cluster reported “too many PGs per OSD” and showed slow requests that didn’t seem to go away. What’s worse, the cluster started to show signs of blocked requests, like unresponsive clients and hanging CephFS access. After some time, these were confirmed by “ceph -s”, where slow requests turned to blocked requests after 4064 seconds, taking the cluster to HEALTH_ERR. Additionally, the PG rearrangement, started by taking out the first OSDs, came to a halt and left the cluster with still high numbers of misplaced and degraded PGs. Overall, the cluster became unusable.
A new feature introduced mid-way
As already mentioned, the cluster wasn’t optimally configured: The number of placement groups per OSD was way above what was recommended by the Ceph team, even before running “Luminous”. This may have caused some performance degradation and higher memory and CPU demands at the Ceph cluster nodes, but this didn’t keep the cluster fro doing its job. So why the sudden warning about “too many PGs per OSD (xxx > max 200)”, which didn’t show up earlier and at first seemed to be at least annoying? Insert “PG overdose protection“. This is something that was introduced with Ceph “Luminous”, but for some reason doesn’t automatically affect all Ceph Luminous clusters with more than 200 PGs per OSD. Rather, it takes something to trigger, whatever it may be. The linked article mentions two new monitor parameters, “mon_max_pg_per_osd” and “osd_max_pg_per_osd_hard_ratio“. But setting “mon_max_pg_per_osd” won’t make the warning go away, or so it seemed
But what’s much worse is what hides in that linked article, when it states “If any individual OSD is ever asked to create more PGs than it should it will simply refuse and ignore the request.”: While your migrated pre-Luminous cluster, with PGs per OSD above the default limit of 200, will continue to work without any indication of trouble, any operation that will bring additional PGs to such an OSD (or tries to fill new OSDs above that limit) will simply come to a halt!
And that’s right what we faced with above cluster. Without any further error report, and only that general warning, cluster reorganization simply stopped. What’s worse, this caused e. g. cache tier write-backs to stall and in the end, Ceph became inoperative, from a user’s point of view.
The solution
Once we recognized that we might be hitting that limit, we changed the cluster configuration, setting “mon_max_pg_per_osd” (in the “[mon]” section of cluster.conf) and restarted the monitors. Unfortunately, this didn’t work (and led us to believe we were barking up the wrong tree). Only later, by try&err, we noticed that adding the parameter (and an accompanying “osd_max_pg_per_osd_hard_ratio“) in the “[general]” section made the problem go away, after restarting all affected OSDs.
(Update: In a recent message on the mailing list it was mentioned that the daemon actually acting on the “mon_max_pg_per_osd” parameter is the “manager” / mgr. So restarting that one, after moving the parameter(s) to the global section, was making the message go away.)
Once the changes were effective and the cluster had its time to re-place all PGs it felt to need a new storage location, not only did the degration go away, but the “too many PGs per OSD” warning vanished as well. Until then, it had been shown despite the reconfigured maximum, and it didn’t return even when rebalancing the PG tree again, later on.
So there’s two things to learn: It seems that not every parameter starting with a “mon_” prefix just affects monitors, but may influence the operation of other Ceph daemons, too. And in its current implementation, the Ceph cluster may be brought into a state where OSDs will cease to continue operations, without further notice and without reporting an error. At least the latter is confirmed and reported on the mailing list, the tracker issue is at http://tracker.ceph.com/issues/22440. Hopefully we’ll see a fix in an update soon.