Recently, we received reports of SCST-related error messages from one of our SAN servers:
kernel: [1082754.895993] : dev_vdisk: ***ERROR***: write() returned -28 from 4096
A quick analysis showed that these errors started on a working day right after holiday season, but during the evening hours (so no-one was actively working on-site). From then on, we saw roughly 10 to 30 similar messages per incident, with these incidents occurring 45 to 90 minutes apart. What caught our eye was the fact that these incident were always on a quarter-of-an-hour mark.
The timing of the incidents clearly pointed at some cron job, but none was set up on the SAN server that might run at all the times of the incidents. Checking the log of backup jobs for resources on this SAN server revealed no coincidences either.
While the error message itself did not mention the resource SCST was acting on, the PID from the syslog entry identified the resource’s worker thread:
sanserver:/root # ps ax|grep 4821
4821 ? S 1:21 [ourResource-lu1_2]
27522 pts/1 S+ 0:00 grep --color=auto 4821
(This of course will only work if SCST was not restarted since the incident.)
That resource is a file-based virtual disk of one of our VMs. And even before checking the VM for cron jobs that might cause any trouble, a quick glance at the SAN server file systems revealed that the one used by the virtual disk image was filled to the brim. “Touching” a new file worked (because that only creates an empty file, just using a (still available) inode), while something like “echo testcontent > testfile” failed – the latter would have required a free data block.
At first glance, one might ask why the virtual disk might grow. The technical reason in this specific instances lies in the fact that the VM in question was migrated from a different installation, where the admin had created a 250 GB disk image. Only 15 GB were actually used, so it seemed a waste of SAN server disk space to copy the image one to one. Instead, the image file was copied in “sparse mode” to a 20 GB file system. At that time, there were still 5 GB of “head room”.
Over time, some process inside the VM created more and more log data and thus filled the blank spaces of the virtual disk. The disk image needed more actual disk space and voila, the file system ran out of space.
Resolving the issue
The fix was easy and could be done live: We increased the size of the file system carrying the virtual disk image. And of course we will both from now on pay more attention to the free disk space of this particular SAN server area, and have clean-up jobs removing unnecessary data inside the VM.