Ceph caching for image pools

Running a Ceph storage for small and mid-size private clouds can easily become challenging and spotting supporting information is not always that easy.

A major concern will likely be the over-all speed of the Ceph cluster, as seen by the clients. On the same level, the money required to build and operate the cluster will be important, too. So how do you optimize between these two? Will you need SSDs, will you really need 10G networking?

Here’s my report of what started as a demo environment and moved on to what you may call a production system.

The starting point

My guess is that nobody will start an Openstack+Ceph environment from scratch nor as a newbie to IT – you’ll have to have a significant amount of knowledge when you’re starting this endeavor and typically there’ll be an active production system that needs to be taken to a new level. If you’re lucky (as we were), you’ll get a chance to test-drive the new environment before making the jump into production use with all it’s constraints, but the result will have to compare to what’s been in place before and you’ll be trying to wrap your head around the new technologies by applying your existing knowledge.

Our previous world was a small-sized cluster of Xen servers, using a self-built Fiber Channel environment for storage. The Xen servers were based on “SUSE Linux Enterprise Server”, storage was provided by two OpenSUSE servers using a complex setup to provide NFS, SaMBa and block storage (using SCST as the Fiber Channel target). The whole solution had evolved over many years, both growing in size and complexity, but always being a test-bed and learning environment for us. And it was acting as a play-ground to get to know new technology, with a lot of hands-on experience. (Which, by the way, was the main motivation to set up our own private cloud, too.)

Focusing on the storage part, the latest version of our servers can be described by the following corner stones:

  • Two physical servers in a single enclosure (SuperMicro Twin²) with 12 disk enclosures (SAS) per server were the main storage machines
  • A MD-RAID6 served as the mass storage back-end, with up to 10 SAS HDDs (2,5″)
  • A MD-RAID1 with two SSDs served as a cache, created via Linux’ “bcache” feature
  • LVM and a broad mix of logical volumes, one per storage resource (be it NFS, SaMBa share or the disks of a specific Fiber Channel initiator)
  • DRBD to mirror the LVs of both machines to a third server, our “backup machine”
  • NFS, SaMBa, SCST, rsync and other services on top of that, handing out the storage resources via two “1 Gbps” Ethernet connections per server (bundled via LACP)
  • A third (older) server, acting solely as a machine to drive data backups of our storage
    • RAID6 via an Areca card, with room for 16 3,5″ HDDs (one RAID6 set per main server)
    • DRBD to receive the mirrored data from the main servers’ LVs
    • A software solution to copy the mirrored volumes to external disks for round-robin, off-site storage (for disaster recovery purposes)
    • independently running a Bacula instance to create historic back ups of any important data, fed by according agents inside the various virtual machines and hardware servers and pushing the data to an external tape library

So in effect, we had two storage servers, with redundantly stored data (RAID6 locally and DRBD copies on the third, oldish server), serving any and all virtual machines and larger storage requirements of even physical servers. This setup was put to test during a major disk outage on one of the two servers, where DRBD saved the production by bypassing the failed SAS layer of one of the two servers, feeding the data from the DRBD copy on the old server. Had a complete server failed, this setup would not have covered this and we would have had to manually recreate the more important data stores on the remaining server, until the failing server was available again.

The latest step in the evolution of that setup was to integrate “bcache”. Before that, we had significant I/O waits on the two main servers, because we’re running 30-somthing virtual Linux machines via that storage… and with many small hits (mostly writes) from these VMs, the I/O bandwith had a hard time coping. Once we switched to using bcache (right at the “physical volume” layer), those small writes mostly went to the SSDs and the server load went down to zero to 2 percent, on average. The improvement was felt when working with the VMs, they definitely weren’t “laggy” anymore.

Going for a private cloud

Running a private cloud promised to be very favorable for us for a number of reasons, but was a whole new experience. Design paradigms had shifted and we had a lot to learn, a lot to change and quite some design requirements to drop.

Something we weren’t able to achieve was getting our Openstack environment to mingle with our existing SCST (Fiber Channel) storage servers. This hurt us most, because we had invested significantly in that technology (switches and HBAs) and were pretty happy with the I/O performance we were able to achieve. Switching to NFS seemed like chickening out and after all would be using our 1G Ethernet, instead of employing our 4G Fiber Channel infrastructure.

Still, in the end, we decided to give Ceph a try. It would be running across our 1G Ethernet, but at least we’d get hands-on with that new technology and eliminate the current dependency on single servers.

Try and fail, re-try and re-fail

Unfortunately, we were in no position to request the hardware for a completely new Ceph storage infrastructure. All we could do was to try to integrate the Ceph services with our existing storage hardware servers.

We knew it’d lead to inferior results, but we opted to start with running Ceph OSDs on logical volumes, using our existing bcache’d RAID and LVM environment. As our private cloud was, at that point in time, not much more than a proof-of-concept, we only had to provide small amounts of disk space and were prepared to add new LV/OSD combos when required. And interestingly, it did give surprisingly fast results – at least if you take our pretty low expectations into account. In the end, this road lead nowhere, or rather, was a dead end.

For those not that familiar with Ceph: Ceph itself will create redundancy for the data it handles, so putting the Ceph storage on RAID means to redundantly store data on multiple, (RAID-)redundant block storages. That’s both a waste of resources and processing time. And it adds complexity.

So we switched tactics, moving to separate physical disks for Ceph. This is still a small installation, so buying super-speed disks was not an option. At least we got enterprise-grade SAS HDDs and given the very positive results we had when introducing bcache to our original design, we decided to keep data and journal on the same disk, but to add a layer of bcache between the HDD and the OSD process.

Having migrated our data from LV-based OSDs to HDD-based OSDs and then moving virtual machines to the new private cloud (so their disk images moved from our old SCST-based Fiber Channel to the new Ceph environment), we were able to free up disks in the RAID6 setup and used these as further (bcached) HDDs for more OSDs.

Unfortunately, although the SSDs used to create the common bcache device of the servers were enterprise-grade and write-optimized, those multiple OSDs (and the remaining data from the old environment, especially NFS resources) created too much contention even for these SSDs. Using performance monitoring, we could see that the HDDs were not used to their (I/O) limits most of the time, but the SSDs were too often at 100% of their I/O capacity. This lead to a rather unpleasant situation, with sluggish VM responses and sometimes even many seconds of pure waits for anything to show in the user interface. This was on top of the other major difference to our original setup, we were experiencing a significant reduction in streaming throughput, somewhere along the lines of previously seeing 130 MB/s as a sustained rate and now having to live with 25 to 40 MB/s. The latter issue wasn’t nice, but the former problem of sluggishness was driving us mad.

But where’s the bottle-neck – only the HDD/bcache combo? Or was it networking, too? Whenever your read up on Ceph, the recommendation you’ll find is “10G networking”. What we have is 1G (or 2G, because of the LACP link aggregations). We’re running a pretty close monitoring on our systems, which includes tracking the used bandwith on all active Ethernet switch ports. And it was rarely seen that traffic maxed out on the Ceph server ports, and if so, then only during scrubs (which we reduced to night-time, for obvious reasons). Typical day-time load was rather in the area of 240, sometimes 400 Mbps – these 1G links still had plenty of bandwidth left to offer.

Introducing a caching tier

With all that positive experience with bcache, the idea of running an SSD caching tier in front of the main Ceph storage tier with its slow devices seemed very appealing to us. Especially as we had already been able to verify, that our main access pattern was that of many small writes to the (virtual) device layer. Our bcache devices were set to large dirty blocks ratios, but still were typically below 5 GB of “dirty blocks” per server: It were mainly the same blocks re-written most of the time, for small log entries and other similar items.

When looking up information on the caching tier in Ceph documentation, you likely will stumble over the following paragraph:

Cache tiering will degrade performance for most workloads. Users should use extreme caution before using this feature.

[…]

The following configurations are known to work poorly with cache tiering.

  • RBD with replicated cache and erasure-coded base: This is a common request, but usually does not perform well. Even reasonably skewed workloads still send some small writes to cold objects, and because small writes are not yet supported by the erasure-coded pool, entire (usually 4 MB) objects must be migrated into the cache in order to satisfy a small (often 4 KB) write. Only a handful of users have successfully deployed this configuration, and it only works for them because their data is extremely cold (backups) and they are not in any way sensitive to performance.
  • RBD with replicated cache and base: RBD with a replicated base tier does better than when the base is erasure coded, but it is still highly dependent on the amount of skew in the workload, and very difficult to validate. The user will need to have a good understanding of their workload and will need to tune the cache tiering parameters carefully.

Well, yes. Having a good understanding of your workload is always the prerequisite when creating a good design. If your RADOS block devices are mostly used for CPU-intensive processing and the main I/O load stems from small writes in mostly the same areas, then introducing a caching tier is something that will really help to reduce the write latency, from the VMs’ point of view.

So while trying to locate advice on and experience with setting up an SSD-based cache tier in Ceph, various interesting pieces of information popped up. For instance, that using a simple mirroring setup with two OSDs will already provide some latency improvement, since Ceph clients wouldn’t have to wait for three (or even more) replicas to be updated. On the other hand, above source mentiones that using a cache will almost always lead to a drop in performance.

We were biased. We had made pretty good experience with an SSD caching tier at the device level. So we decided to give Ceph’s caching feature a try.

Our implementation

Using an SSD caching tier with only two SSDs perfectly fit our current hardware situation – two “real” NAS/SAN hardware servers, with an older node serving as a third node for replication and backup. So we checked the market prices of current write-optimized SSDs, recovered from fainting and went with two Seagate SSDs “ST800FM0173”, giving us 800 GB per piece. We would have bought Toshiba “PX05SMB040”, but these were not shipping then and the Seagate SSDs were priced at a comparable level, but with double capacity.

Installing the SSDs and creating the OSDs

Installing the SSDs in the server HDD bays went expectedly smooth, while adding them to the Ceph cluster was a bit of a hassle:

I was used to adding new OSDs via “ceph-deploy”, but found no way to specify the “root” to which the new OSDs were to be added (after creating that root via crushmap edits). So I figured I might set the “noin” option, so that the Ceph cluster would keep quiet after adding the new OSDs (so I know their names) until I have properly configured the CRUSHMAP.

Well thought, badly done. Despite the “noin” option set (I could see this when running “ceph -s”, even giving me a WARNING state because of that option being active), the new OSD was right in and active after running “ceph-deploy osd prepare …”. Immediately, the cluster started re-balancing and our performance dropped drastically – the existing disks I/O utilization maxed out, as did some network interfaces. Even after setting the new OSD to “out” explicitly, the process continued. This likely can be attributed to the fact that after the “osd prepare”, probably followed by an automatic activation via systemd, the new OSD was added to the *server’s* tree node in the CRUSHMAP. So even after marking the OSD as out, the placement groups needed to be re-arranged to the new layout.

As things already were as slow as can be, I figured I might add the second SSD as well (which is where I learned that even setting the new OSD to “out” wouldn’t help to avoid rebalancing all PGs). Well, after an hour of slow storage, things were far enough so that I could adopt the CRUSHMAP to our new needs.

The crushmap

While preparing this whole change, I came across different approaches to setting up the crushmap. What I opted for was a mix of what’s in the Ceph docs (which assumes that all caching tier OSDs will be on separate Ceph nodes) and co-locating OSDs of different roots on the same node.

To read out the current map and create an editable version, you can use the following two commands:

ceph@cephnode01:~/ceph_deploy> ceph osd getcrushmap -o crush_map.bin
ceph@cephnode01:~/ceph_deploy> crushtool -d crush_map.bin -o crush_map.txt

The resulting file “crush_map.txt” may look similar to this:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host cephnode01 {
  id -2 # do not change unnecessarily
  # weight 3.640
  alg straw
  hash 0 # rjenkins1
  item osd.3 weight 0.910
  item osd.4 weight 0.910
  item osd.0 weight 0.910
  item osd.1 weight 0.910
}
host cephnode02 {
  id -3 # do not change unnecessarily
  # weight 3.640
  alg straw
  hash 0 # rjenkins1
  item osd.6 weight 0.910
  item osd.7 weight 0.910
  item osd.2 weight 0.910
  item osd.5 weight 0.910
}
host cephnode03 {
  id -4 # do not change unnecessarily
  # weight 3.640
  alg straw
  hash 0 # rjenkins1
  item osd.10 weight 0.910
  item osd.11 weight 0.910
  item osd.8 weight 0.910
  item osd.9 weight 0.910
}
root default {
  id -1 # do not change unnecessarily
  # weight 10.920
  alg straw
  hash 0 # rjenkins1
  item cephnode01 weight 3.640
  item cephnode02 weight 3.640
  item cephnode03 weight 3.640
}

# rules
rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
}

# end crush map

So what you can see is that there are 12 OSDs, seemingly all of the same size (as their (generated) weight value is the same), spread evenly across the three existing Ceph nodes. There’s one tree with a root named “default”, where all three nodes are bundled.

Once the SSDs were added, two new OSDs were defined (osd.12 and osd.13) and added to “cephnode01” and “cephnode02”:

...
device 12 osd.12
device 13 osd.13
...
host cephnode01 {
  id -2 # do not change unnecessarily
  # weight 3.640
  alg straw
  hash 0 # rjenkins1
  item osd.3 weight 0.910
  item osd.4 weight 0.910
  item osd.0 weight 0.910
  item osd.1 weight 0.910
  item osd.12 weight 0.720
}

host cephnode02 {
  id -3 # do not change unnecessarily
  # weight 3.640
  alg straw
  hash 0 # rjenkins1
  item osd.6 weight 0.910
  item osd.7 weight 0.910
  item osd.2 weight 0.910
  item osd.5 weight 0.910
  item osd.13 weight 0.720
}
...

Having had these two new OSDs added to the “nodes” was causing the re-adjustment of all PGs across the disks – but at least, now that both OSDs are marked “out”, no PGs reside on osd.11 nor osd.12 and they can be moved in the tree.

In order to create the caching tier, you’ll have to create a separate pool and add that to the main pool, stating you’ll want to use it as a caching tier.

In order to create that separate caching pool on SSDs only (and to prevent from other pools using the SSDs), you’ll have to have a way to specify that these SSD-OSDs are to be used for the pool, and usually you want nothing else on these OSDs. This is done by adding a separate root to the crushmap, to assign the new OSDs to the new tree, and by defining an additional rule set that will use resources from the new root only.

Actually, the most obvious approach seemed to be the best one for me, too: I removed the new OSDs from the “node groups”, created a new root and added the OSDs to that root explicitly. A new ruleset explicitly using the new root rounded things up:

host cephnode01 {
id -2 # do not change unnecessarily
# weight 3.640
alg straw
hash 0 # rjenkins1
item osd.3 weight 0.910
item osd.4 weight 0.910
item osd.0 weight 0.910
item osd.1 weight 0.910
}

host cephnode02 {
id -2 # do not change unnecessarily
# weight 3.640
alg straw
hash 0 # rjenkins1
item osd.6 weight 0.910
item osd.7 weight 0.910
item osd.2 weight 0.910
item osd.5 weight 0.910
}

...

root ssd {
        id -5
        alg straw
        hash 0  # rjenkins1
        item osd.12 weight 0.720
        item osd.13 weight 0.720
}

...

rule ssd_ruleset {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take ssd
        step chooseleaf firstn 0 type osd
        step emit
}

Of course, applying that new map would cause another major re-balancing of PGs across the old OSDs, since I changed the layout again by removing the SSD OSDs from the “default” root.

ceph@cephnode01:~/ceph_deploy> crushtool -c crush_map.txt -o crush_map.bin.new
ceph@cephnode01:~/ceph_deploy> ceph osd setcrushmap -i crush_map.bin.new

So, some time later everything settled and setting up the cache tier could start. The Ceph docs state a few steps that will get the ball rolling:

ceph@cephnode01:~/ceph_deploy> ceph osd pool create images-cache 1024 1024 replicated ssd_ruleset
ceph@cephnode01:~/ceph_deploy> ceph osd tier add images images-cache
ceph@cephnode01:~/ceph_deploy> ceph osd tier cache-mode images-cache writeback
ceph@cephnode01:~/ceph_deploy> ceph osd tier set-overlay images images-cache

The above four steps will create the new pool to be used as a cache. By referencing our new crushmap rule set in the first command, we make sure that this new pool only uses the new SSDs. The second command makes our new pool a tier next to the original, HDD-based pool and the the third command sets the cache to write-back mode. The last command makes the new tier an overlay, so that all client requests will go only to the SSD tier.

(Fine) tuning

At this stage, the cache is already active and warming up. So how about closing all terminal sessions, shutting down the PC and starting that well-deserved, weeks-long vacation trip?

Don’t yet go.

First of all, you’ll notice error messages, or rather warnings, that your configuration is not complete yet (“HEALTH_WARN; 1 cache pools are missing hit_sets”).  The most simple resolution to this is to continue reading in the Ceph docs, where it hints at setting a Bloom filter:

ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache hit_set_type bloom

The next thing that you’ll probably notice is the growing number of degraded objects. But wait – we set up a two-OSD pool, but never specified the replication factor, which by default is “3”. So indeed there’s a third copy missing for every object that is created in the cache pool…. easy to fix:

ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache size 2

Now everything looks nice and clean in the logs. But how will things evolve? When will Ceph start moving “dirty” data from the cache pool to the main storage pool (that’s called “purging”) and as it is caching read accesses as well, when will it free space rarely used for read-cached objects (“eviction”)? Because while we’re chatting along, the cache is already growing and starting to use up the space provided by the SSD(s) and we actually want to control the limits.

The Ceph docs list a number of settings and how to properly use them is really up to a separate evaluation, taking into account the specific usage patterns of the Ceph clients. For a starter, I decided to limit the space used by *this* cache (so I can add more caches later on, for i.e. CephFS pools) and to set some check points where Ceph should be kicking in, purging dirty objects and evicting unused clean objects:

ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache hit_set_count 4
ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache hit_set_period 14400
ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache target_max_bytes 322122547200
ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache cache_target_dirty_ratio 0.4
ceph@cephnode01:~/ceph_deploy> ceph osd pool set images-cache cache_target_full_ratio 0.8

Again: These are probably not the right values for your setup – read up on what these commands try to achieve, try to understand how Ceph is used in your case and what access pattern you’re trying to speed up, and set values that you consider sane for your use case.

Monitoring

I’ve not yet seen the optimum monitoring for Ceph caching tiers – but here are a few commands that may get you started and help to assess if your pool is working correctly (or at least in the way you had it configured 😉 ).

First of all, you may want to check your crushmap setup, which should reflect the tree changes configured earlier on:

ceph@cephnode01:~/ceph-deploy> ceph osd tree
ID WEIGHT   TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-5  1.43997 root ssd                                            
10  0.71999     osd.12             up  1.00000          1.00000 
11  0.71999     osd.13             up  1.00000          1.00000 
-1 10.92000 root default                                        
-2  3.64000     host cephnode01                                   
 3  0.90999         osd.3          up  1.00000          1.00000 
 4  0.90999         osd.4          up  1.00000          1.00000 
 0  0.90999         osd.0          up  1.00000          1.00000 
 1  0.90999         osd.1          up  1.00000          1.00000 
-3  3.64000     host cephnode02                                   
 6  0.90999         osd.6          up  1.00000          1.00000 
 7  0.90999         osd.7          up  1.00000          1.00000 
 2  0.90999         osd.2          up  1.00000          1.00000 
 5  0.90999         osd.5          up  1.00000          1.00000 
-4  3.64000     host cephnode03                                      
20  0.90999         osd.10         up  1.00000          1.00000 
21  0.90999         osd.11         up  1.00000          1.00000 
 8  0.90999         osd.8          up  1.00000          1.00000 
 9  0.90999         osd.9          up  1.00000          1.00000 
ceph@cephnode01:~/ceph-deploy>

The output show nicely the new root named “ssd”, carrying the two new OSDs created for the SSDs.

The pools defined and the read/write operations are visible through “rados df”:

ceph@cephnode01:~/ceph-deploy> rados df
pool name                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
...
images            1865050136       444162        10111            0           0   3105636134  79607169027   2006036393  40607777028
images-cache       246043065        64599            0            0           0     67086980   1764818843     28318945    635723497
...
ceph@ndesan01:~/ceph-deploy>

When watching how these number evolve (i.e. via “watch -d rados df”) you should see that most write (and probably read) operations go to the cache pool, instead of putting load on the original pool. If you have most write operations cached, then increasing “wr” and “wr KB” numbers should only be visible for the original pool for purges (Ceph moving “dirty” objects from the cache tier to the back-end tier).

But how much dirty data is currently on the cache? “ceph df detail” will tell:

ceph@cephnode01:~/ceph-deploy> ceph df detail
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED     OBJECTS 
    12652G     6749G        5902G         46.65        497k 
POOLS:
    NAME             ID     CATEGORY     USED      %USED     MAX AVAIL     OBJECTS     DIRTY     READ       WRITE  
...
    images           1      -            1778G     14.06         1831G      444161      433k      2961M      1913M 
    images-cache     27     -             234G      1.86          504G       64618     28187     65538k     27682k 
...
ceph@cephnode01:~/ceph-deploy> 

You’ll notice that most of your pools, namely those that are not cache tiers, will have as many dirty objects as they have objects at all. This has to do with the reporting logic and can be read as “none of these objects have been migrated to some back-end pool”. But if you look at the caching pool, you’ll see that about 43 % of the objects are “dirty” and hence 57 % are “clean” – this is the results of above commands to set the dirty ratio to “0.4”.

Conclusion

Setting up the caching tier wasn’t difficult, but does offer some pitfalls. Attention should be paid to how you add the new OSDs, because currently, “ceph-deploy” will add new OSDs to the server’s crushmap container and cause an I/O-intensive re-balance.

From a throughput perspective, adding that caching tier did not provide too much improvement. This does point at other bottle necks that need looking after. Probably the network performance, like “network-added delay”, is something that indeed may need more investigation. But from a human user’s point of view, many of these small delays and the over-all lagginess is no longer felt, giving the while system a more responsive feel.

Running mostly an rdb workload on a replicated pool, we have many virtual disk images in our Ceph storage, actively used for mainly small writes. Since adding the caching tier and giving it a few days for warm-up, we could definitely see that close to all writes and many reads are covered by the cache. So our assumption about the nature of our work-load was correct and none of the general warnings, especially those hinting at decreased performance, seem to apply. But of course, more time will be spend on fine-tuning the cache setup and we’ll add more caching tiers for other pools – but those will be mostly CephFS, which we wanted to avoid until a speedy caching solution was available to us.

This entry was posted in Ceph, howto, Linux, OpenSUSE. Bookmark the permalink.

Leave a Reply