servers – Pervasive Code

Minimizing I/O when migrating from Virtualbox on Veracrypt to Proxmox on ZFS

Jamie Flournoy — Sun, 14 Aug 2022 16:28:03 +0000

I’m moving large-ish (half-TB or larger) files between hosts. It’s important to avoid extra copies in this workflow, since each pass over one of the larger files to read the whole thing (and there are several of them) takes hours. I managed to decrypt VM disk images, transform them from one disk image format to another, copy them from one host to another, calculate SHA-256 hashes on both sides to verify data integrity, compress and encrypt them on the destination, and to display a progress bar, all without any additional copies. One big block-device read on the source end and one big block-device write on the destination end is all of the disk I/O that’s happening.

See below for how I did this.

I have an oldish home file server, consisting of a 2015-era Xeon CPU, 32 GiB of RAM, and a heap of RAID-mirrored hard disks and a couple of SSDs. It’s filling a role that I didn’t have a name for before, but apparently the right name these days is a “homelab“: built for virtualization, it hosts a bunch of small guest VMs doing various things and not interfering with each other the way the do-everything predecessor of this server did. It lets me experiment with different operating systems and software in a sandboxed environment, and lets me isolate changes so upgrades to one service don’t break everything else.

Well, that was the idea, anyway. I got about 50% of the way through migrating everything off of the host OS and onto VM guests, and I was enjoying the resulting newness of guest OSs and the new tools that were available (example: newer borgmatic releases will directly ping monitoring services like healthchecks.io to say that the backup succeeded).

But the pain of managing VMs with the Virtualbox GUI made me reconsider, and eventually life happened and I got distracted from this project and just left it half-finished. Accessing the Virtualbox GUI is done via X11, and somehow the latest version of Virtualbox that I can use with the host OS has some serious redrawing issues, that leave windows half-drawn in unpredictable ways. It’s nearly unusable, and I’ve fiddled with it a lot to try and get it to work properly, with no luck. Also, sometimes the GUI just locks up and I have to kill the GUI app. Rarely, the GUI locks up and kill/relaunch doesn’t fix the stuckness (the re-run GUI just locks up immediately), and my only recourse is to reboot the host. (I could live without the GUI, but then I can’t see the VM console anymore. I could use RDP instead of X11, but that doesn’t work properly either, even after quite a bit of futzing.) It’s just janky as hell. Also, even when I’m not managing a VM and I’m just leaving it running, Virtualbox on this host occasionally makes my guest VMs go crazy with “access beyond end of device” kernel errors, and it seems I’m not the only one experiencing this. I don’t know why it’s doing this but it sure would be nice to have a stable and usable virtualization setup. I’ve lived with this for several years, because it’s a time sink and I have other things to do. It’s not great.

Well, now it has a disk failing (it’s a disk in a smallish RAID mirror, so it’s not a huge problem to just move the data off and stop using the remaining drive) and I have some free time to try again, and I also have a gaming PC that can be used for experimental installations and migrations while the existing server just keeps on truckin’. So I’m trying again, with the notion of abandoning the setup of Virtualbox VMs on manually-mounted Veracrypt-encrypted disk images, in favor of something newer and better and easier to manage. I had previously planned to just use LUKS to manage full-disk encryption, but as it turns out, ZFS is finally available for Linux, and Proxmox will happily manage VMs atop a ZFS-based system, so I decided to try that.

It’s great so far. I like ZFS, a lot. It solves a bunch of problems that would otherwise need to be solved by a stack of software, such as Linux Software RAID + LUKS + LVM + ext4.

It took extra work to get the whole zpool to be encrypted, since you can only do that at zpool creation time and the Proxmox installer doesn’t support ZFS encryption (even though the underlying OS, Debian 11 “bullseye”, does support ZFS encryption). Fortunately, I found an excellent guide to setting up a ZFS encrypted root on Debian Bullseye and a guide to installing Proxmox on an existing Debian Bullseye host, so that’s done now.

However, the VM disk images on the source side are .vdi disk image files located on Veracrypt disk images, whereas the destination VM disk images need to be in either qcow2 format in the filesystem, or RAW format in an encrypted and compressed ZFS zvol. So the files will need to go through a lot of processing between origin and destination:

The files need to be converted from VDI to RAW format.
I want to have a SHA-256 hash computed on the source, and again on the destination, so that I can be certain that the file copy actually succeeded without introducing any data corruption.
They need to be copied from the source to destination.
I want them to be encrypted at rest on both sides.
Finally, I’d like a progress bar that tells me how the process is going.

The decryption and encryption part is pretty straightforward: Veracrypt volumes mounted as a block device do that transparently, as do ZFS zvols if they are encrypted.

ZFS also offers transparent compression, and zstd is available to make the compression so fast that it barely adds any overhead (smaller I/Os with a bit of added CPU overhead is generally faster than uncompressed I/O), so I’m using that with the zvols too. Just mark it as compression=zstd at creation time, and everything in it gets compression, unless you mark it as compression=no. (Incompressible data is automatically detected and left uncompressed, also transparent to the user.) It’s very, very slick, and apparently all the ZFS cool kids are just turning compression on for everything, unless they have a good reason not to (such as “it’s huge volume of very incompressible data”, like a file server for compressed video). You can mark the top-level (zpool) as being compressed, and then later mark a nested dataset as not-compressed, and the nested dataset won’t be compressed. So it’s a no-brainer to just turn compression on via compression=zstd, and override that choice only on a per-dataset basis later.

The VDI to RAW conversion is done on the fly by qemu-nbd, used like qemu-nbd --read-only -c /dev/nbd0 somefile.vdi, exposing the data in the VDI as a read-only block device in RAW format.

That avoids something gross like making a local copy of the file with qemu-img or VBoxManage clonehd, which would take hours and require a ton of disk space. When I say “a ton” I mean more than the .vdi, since .vdi files can be created to expand and shrink based on how much data is stored on them in the same way that a .qcow2 file can, whereas RAW files are allocated up-front to their full size. The RAW files can be sparse files, but even with a sparse .vdi file, VBoxManage clonehd doesn’t preserve the sparseness in the output RAW image file, so you’d have to re-run virt-sparsify on the non-sparse RAW file, producing a third, sparse RAW file in addition to the non-sparse RAW file and the original sparse VDI file. Also, virt-sparsify needs a temporary directory with enough space to store a temporary full copy of the original image, so that’s a total of 2 sparse and 2 non-sparse copies of the image file that you have to find space for, and there’s extra I/O to make all those copies, which eats up a huge amount of time (in my case, it would add several days of non-stop I/O). virt-sparsify can run with --in-place, but then you risk data loss if something goes wrong.

So, anyway, using qemu-nbd --read-only avoids all of that: it’s a read-only view of the contents of the .vdi file. No extra copies are made on disk; all of the transformation happens in RAM.

Progress output is done with pv (pipe viewer), which can read from a block device. It gives a nice progress display as it copies data. If you’re not familiar with pv, it’s darn handy.

Now all you need to do is avoid reading the RAW data twice (once to copy it, then once again to compute the SHA-256 hash), which is easy with tee and Bash process substitution. This is a trick I haven’t used before: process substitution directs file-output from one command to STDIN of another command, and combined with tee, you can copy tee’s STDIN to two subcommands’ STDIN, so they operate on the data in parallel.

Copying to the other server is streamed over SSH, with “Compression no” in the SSH config since they are on the LAN next to each other, and OpenSSH’s compression options are not terribly efficient so they would actually bottleneck the transfer.

On the destination side, using tee with Bash process substitution in the same way again allows computation of the SHA-256 to happen in parallel with the writing of the file to the ZFS zvol block device, with no additional I/O.

So, in summary: decryption is done on the fly by Veracrypt since the .vc image is mounted as a block device. Transformation from .vdi to raw is done on the fly by qemu-nbd which mounts the .vdi image as a block device. tee running on the source allows streaming copy-and-also-hash to happen on the source side, and another tee running on the destination allows streaming write-and-also-hash to happen on the destination side. SSH does the actual transferring of bytes from source to destination. I saw data rates of ~100MiB/s which is about the limit of the write speed of the destination zvols, since they are compressed and encrypted and reside on mirrored 5200 RPM disks currently attached via external USB 3 disk enclosures. (dd if=/dev/zero of=/the/zvol/path on the destination didn’t do any better than that, so I know that all the source transformations and network copying aren’t causing a bottleneck.)

Here’s the actual command line, redacted appropriately:

SIZEBYTES=$(blockdev --getsize64 /dev/nbd0) ; pv -s $SIZEBYTES /dev/nbd0 | \
  tee >(shasum -a 256 --tag >> /root/shasums.txt ) |  \
  ssh root@dest 'tee >(dd of=/dev/zvol/mypool/vms/myvmname/mydiskname conv=fsync) >(shasum -a 256 --tag >> /root/shasums.txt ) | cat > /dev/null'

It’s a doozy, but you get the SHA-256 values in /root/shasum.txt on both sides, a progress display thanks to PV, and a transformation from Veracrypt+VDI to ZFS compressed-encrypted-zvol-RAW and a network transfer, all with one big read and one big write. It still takes hours, but you only have to do this to each image file once and it’s all migrated over, with a data integrity check so you know it worked.

~~Smash that like and subscribe button~~ Let me know in the replies below if this was helpful, if you find an error, or just have suggestions.

Solved: Ubuntu 20.04 update makes boot time increase by 4 minutes

Jamie Flournoy — Sun, 26 Sep 2021 21:42:07 +0000

I’m doing software development on an Ubuntu 20.04 VM running on my MacBook Pro via Parallels Desktop 17. I recently noticed that the “Software Update” app in the Ubuntu desktop machine was saying there were a lot of things to update, which makes sense since I just created this VM a couple of months ago via a fancy Ansible playbook, and I haven’t gotten around to updating it until yesterday.

Well, after the update, it started taking a really long time to boot. There were two places where it hangs for 2 minutes for no apparent reason, meaning that the boot process is about 4-1/2 minutes long.

I’ve figured out what happened and found a workaround, but I’ll mention a few of the things I looked at along the way since they’re sort of interesting and may be relevant if your issue is similar but not caused by the same thing.

A bit of background: this VM was created by the Parallels installer wizard, which asks the user some questions and then downloads the Ubuntu 20.04 installer and feeds it an unattended-install ISO image based on your answers. It’s actually pretty slick, and I don’t mind the fact that I can’t use my custom scripts-and-YAML frontend to virtualbox’s VBoxManage that creates a VM and its disk images from scratch and then installs Ubuntu 20.04 on that VM. (I use that frontend when making Linux-on-Virtualbox-on-Linux VMs, but I don’t have any automation for creating Parallels VMs at the moment.)

I still use my Ansible playbook that installs all of the things that all of my VMs need, which in the case of this VM (a development server for me to use when working on software development for a single customer) includes the goodies I like on a shell server, the tweaks I like to make to Ubuntu servers in general, and all of the things that I need for this particular host. So this host’s configuration is mostly under formal configuration management, and I can build a copy of it pretty quickly.

That’s why my first reaction to a messed up VM was to just make it from scratch again, since I figured that I could probably do that (and clear out whatever gremlins were in the updated one) faster than I could troubleshoot the actual problem.

Well, that didn’t work. I made a new VM disk and disconnected the old one, running the Ubuntu 20.04 installer directly (without the Parallels wizard) and choosing the most boring options available to populate the new VM disk. It worked fine, so I applied the software update. That update made it act really weird: black screen, no POST, no GRUB bootloader, no kernel output. WTF? Had I hosed the VM configuration somehow?

I made a whole new VM instance and ran the installer again, and it worked fine. I ran the software update to get it up to date with the latest packages for Ubuntu 20.04, and I got the same result(!): bricked. No POST, no GRUB, no kernel output, just hangs with a blank screen at power-on. WT-actual-F?

It was at this point that I really wished I had thought to use the snapshots feature of Parallels to create a pre-update state that I could revert to if the update went badly. Ugh. Lesson learned: always use the snapshot feature when you can, even if this machine is so easy to recreate (and contains only data that’s in a Git repo elsewhere) that it’s not worth backing up. Reverting to an old snapshot is still faster than rebuilding the machine. Snapshots cost nothing and take seconds, vs. reinstalling which takes an hour or so.

I dug and dug and dug on various Linux and Ubuntu support forums for quite a while for a clue as to what the heck was wrong. In the meantime, buried under all of my browser windows, the “bricked” VMs actually booted. Huh?

What I was actually seeing was not a pair of non-bootable VMs, but a pair of VMs that played dead for a long time: a subsecond POST display that never actually rendered on the VM’s virtual display (but did appear for an instant on the thumbnail of the VM’s display that’s shown in the Parallels Control Center window), then a deliberately hidden (by the Ubuntu installer’s settings in /etc/default/grub) GRUB menu with a 0-second delay before booting, and a “quiet splash” boot screen, because the Ubuntu desktop installer prioritizes pretty boot screens over being able to see what’s going wrong. For some reason the splash screen doesn’t work on this configuration (missing image files? VM video driver incompatibility?), so the pretty Ubuntu logo wasn’t showing at all for the 4+ minutes it was taking to get to the GUI login screen. But it was booting, eventually.

Editing /etc/default/grub to contain GRUB_TIMEOUT=10, GRUB_TIMEOUT_STYLE=menu, and GRUB_CMDLINE_LINUX_DEFAULT="" (and then running sudo update-grub) ended the reign of the useless black boot screen.

So, it turns out all three installations (the original one, the new one on the original VM with a new blank virtual disk, and the one on the new VM) had the same boot-time problem. I had thought that maybe my Ansible script had messed up the /etc/fstab somehow, since it was something disk-device related that was apparently taking 2 minutes to fail.

I had figured out that it was disk-device related because the two 2-minute kernel boot freezeups were around things that said “Failed to start udev Wait for Complete Device Initialization” and “ata8: SATA link down“.

There were a bunch of forum threads assisting people who had messed up their /etc/fstab so that the kernel couldn’t find their swap file, so I just disabled swap entirely, but that didn’t fix my problem.

Some people blamed LVM (specifically, the use of /dev/mapper/…. vs. UUID=”” to identify the filesystem on the LVM volume, and possibly using the wrong UUID), and that seemed plausible since my Ansible script rewrites the /etc/fstab to use UUIDs for everything instead of /dev/sda1 etc. That was a red herring also; after all, a freshly installed system that used the /dev/mapper style to point to the LVM root volume still had this problem. I’ll probably switch to using the /dev/mapper/xyz style instead of UUIDs in my Ansible script, since I’m really just trying to use a stable disk identifier instead of a /dev/sdX identifier, but clearly that’s not what caused the issue since it persists even with a fresh Ubuntu 20.04 install.

So, it’s not swap and it’s not LVM. That’s the only thing in /etc/fstab, so it’s not that.

What’s left? Some kind of disk device problem? How could that be the case with a fresh VM and a brand new disk device and an /etc/fstab created by the Ubuntu installer? Is it possible that everybody with Ubuntu 20.04 who has applied this update has this same issue?

The actual problem & solution:

As it turns out, yeah, apparently this is a problem that a lot of people are experiencing. Issue with Parallels, Ubuntu VM and the CD-ROM on the Parallels forum describes this exact problem and identifies the source and a workaround.

The issue is caused by a Linux kernel bug, which of course is already fixed. But it’s recent, so the fix isn’t available via “apt update” on Ubuntu 20.04 yet.

The workaround is to not have an empty virtual CD drive on the VM, which is pretty easy to solve by either deleting your virtual CD device, or ensuring that an ISO image is in the drive which won’t be booted (by changing the boot order, or by attaching a blank .iso image to the virtual CD drive, or both).

So, that’s what I did on Saturday night. How was yours? ;)

Appendix:

This post on linuxquestions.org shows a quick way to create an empty ISO:

mkdir /tmp/directory/
mkisofs -o /tmp/cd.iso /tmp/directory/

A few things that I learned about while chasing the red herrings:

systemd-analyze critical-chain is a pretty cool way to look into your systemd services to see which ones are making your startup process take too long. This wasn’t actually that useful since the delay happens before systemd starts, but it’s interesting anyway.

Using this systemd-analyze showed me that there’s a deprecated service called systemd-udev-settle.service, which is what was adding the second 2-minute wait during boot time. There’s a thread about it on the Ubuntu subreddit that suggests that it’s not needed anyway. I use LVM on this host and disabling that service via systemctl mask systemd-udev-settle.service didn’t break it, so I guess it isn’t needed anymore (at least, not for a setup with a virtual disk like this, that doesn’t take any time to spin up and become accessible).

Karmic on Xen with Bad /etc/fstab = PAIN

Jamie Flournoy — Mon, 08 Feb 2010 00:32:20 +0000

Argh! I spent about 5 hours yesterday troubleshooting a failed Ubuntu Jaunty -> Karmic (9.04->9.10) upgrade. It worked fine until I rebooted and then failed to boot. Here’s how I fixed it.

It failed to boot, saying this:

One or more mounts listed in /etc/fstab cannot yet be mounted
/ : waiting for /dev/xvda1
/tmp : waiting for (null)
/swap : waiting for /dev/xvda9

I tried a lot of stuff and finally solved it. My solution is on the Ubuntu Forum, here: One or more of the mounts listed in /etc/fstab/ cannot yet be mounted (Karmic).

Ubuntu 9.10 (Jaunty Jackalope) upgrade notes

Jamie Flournoy — Sun, 07 Feb 2010 04:24:57 +0000

Once again Ubuntu Linux proves itself to be easy to upgrade. Going from 9.04 to 9.10 (one release newer, since their numbering is bsaed on dates) was easy, but included the standard sprinkling of manual re-customization that I’ve come to expect from Debian based systems.

I did the Network Upgrade for Servers.

I had to re-customize these files since I’m not running with 100% default configuration:

/etc/monit/monit
/etc/monit/monitrc
/etc/dovecot/dovecot.conf
/etc/apache2/apache2.conf
/etc/php/apache2/php.ini

I basically did a manual diff side by side in Emacs and copied my changes over into the new config files. Reboot, no problems. Nice.

Fancier Stubbing of GeoKit for Rails unit tests

Jamie Flournoy — Fri, 24 Jul 2009 00:00:44 +0000

I’m working on a Rails app that uses the ym4r_gm plugin, getting Google to do the geocoding for Thentic. I liked the idea of stubbing the web service call, because all those calls to an external service add up to over 20 seconds of test suite run time(!). That’s almost half of the 50 second run time of my unit tests (and 50 seconds is much too long for a unit test suite).

I found a good starting point at geokit stubbing for faster tests. I also wanted a way to stub a geocoding failure, and a way to prevent any unit tests from using the real geocoding web service.

Here’s how I did it.

In test/test_helper.rb, add this:

class ActiveSupport::TestCase
  def setup
    GeoKit::Geocoders::MultiGeocoder.stubs(:geocode).raises(RuntimeError,
      'Use mock_geocoding_success! or mock_geocoding_failure! in your test')
  end

  def mock_geocoding_success!
    geocode_payload = GeoKit::GeoLoc.new(:lat => 123.456, :lng => 123.456)  
    geocode_payload.success = true
    GeoKit::Geocoders::MultiGeocoder.expects(:geocode).returns(geocode_payload)
  end
  
  def mock_geocoding_failure!
    geocode_payload = GeoKit::GeoLoc.new
    geocode_payload.success = false
    GeoKit::Geocoders::MultiGeocoder.expects(:geocode).returns(geocode_payload)
  end
end

What this does is to force you to choose either one of those mock_geocoding methods before you call the geocode method. To me this seems like a good idea since the integration tests that exercise the full application stack should probably be written using Cucumber and Webrat (which is what I’m using).

You will probably want to merge my one-line setup method into your existing setup code in test_helper, if any. Also note that this uses Mocha for mocking.

Ubuntu 8.10 and 9.04 (Intrepid Ibex and Jaunty Jackalope) upgrade notes

Jamie Flournoy — Sat, 30 May 2009 22:54:18 +0000

I’m at WordCamp San Francisco today and decided that running a year old version of WordPress (on a year old version of Ubuntu Linux) was undesirable. So, with the confidence that comes from many relatively easy Ubuntu OS upgrades, I charged ahead. For (I think) the second time ever, things went badly. Here’s what I did and how I fixed it.

First, I had to figure out what release of Ubuntu was currently installed:
lsb_release -a

I was on “hardy”, a.k.a. the Hardy Heron release, a.k.a. Ubuntu 8.04 LTS.

I had not bothered to install Ubuntu 8.10 / “Intrepid Ibex” because I didn’t have a reason to when it was release. I now wanted to upgrade to Ubuntu 9.04 “Jaunty Jackalope” which has WordPress 2.7.1, the current release (as of today).

The way to upgrade from 8.04 to 9.04 is to upgrade to 8.10 first. So I did that:

Intrepid Upgrades: Network Upgrade for Ubuntu Servers worked really well. I had to do a little bit of manual file merging as usual (I still don’t understand why dpkg can’t merge changes from the old file into a new file) but that was it. Easy!

When I rebooted the VPS, it kernel panicked: can’t mount the root filesystem. Oh crap. /dev/xvda1 is missing? Really? I told the VPS to hard reboot and it came up fine. But that’s a little scary. (I think this is something more related to my VPS hosting provider than Ubuntu, but I haven’t yet upgraded my laptop VMWare Ubuntu VPS’s yet so I’m not sure.)

The second stage didn’t go so well. I did the same sort of simple upgrade: the Jaunty Network Upgrade for Ubuntu Servers instructions are the same as the ones for Intrepid. Upgrade, edit a couple of config files, reboot. Kernel panic again, same reason, reboot. Should work, right?

It booted, but had no network access. I was able to log in via my VPS hosting provider’s SSH remote console feature, so I was able to see that /etc/init.d/networking was failing to start. It was the same problem that’s described in Ubuntu 9.04 in an OpenVZ VE. Adding that one line to /etc/init.d/networking fixed the problem. Reboot, all better.

So if you’re doing this upgrade on a VPS, make sure you’ve added that little 1-line hack after you do the Jaunty upgrade and before you reboot.

CentOS 5.3 Minimal VPS Install Guide

Jamie Flournoy — Sat, 30 May 2009 16:52:14 +0000

I just did this yesterday; you can pretty much just follow my CentOS 5.1 Minimal VPS Install Guide.

The differences are:

When you get to the “More Minimizing” section, yum -C grouplist will show a package called “Yum Utilities” which you probably want to leave installed.
The Deployment_Guide-en-US file is not there so you don’t need to remove it.

That’s it.

I should also note that downloading a 3.9GB DVD ISO image in order to build a ~700MB installed OS may not be very efficient. I didn’t bother looking for a network installer but that might be the way to get this done faster.

Recommended mount options for ext3

Jamie Flournoy — Fri, 16 May 2008 03:54:30 +0000

The details of the various mount options for the ext3 filesystem are fairly well documented, but as with many things in the Unix world, knowledge is far easier to come by than wisdom. That’s a pithy way of saying that I had to do some digging to find recommendations, as opposed to explanations. So here are my recommendations for ext3 users (which encompasses the majority of the Linux-using world, as far as I can tell).

noatime

First of all, do yourself a favor and disable atime updates, using the noatime mount option. This yields a huge performance boost.

This is done by adding noatime to the appropriate lines in /etc/fstab (do it once for each ext3 filesystem that’s listed), in the fourth column, which probably says defaults now.

To make this change to a live, running filesystem, remount the drive (adjust this so that the right disk device is specified at the end of the line:

sudo mount -o noatime,nodiratime,remount,rw /dev/xvda1

(My understanding is that the noatime implies the nodiratime option, but I decided to add it just in case this was not true.)

atime is a relative of the well known file modification and creation timestamps, but it tracks access to file data. That means that if you read one byte from a file, even if it’s cached in RAM, you’re actually also triggering a write to the directory entry for that file, so that its atime can be updated. (If you want to slap your forehead now in disbelief, be my guest.) And if you read a ton of little files (which happens rather often in the unix world), that means a ton of writes to update all of their directory entries. You don’t want that, right?

But do you need it? Almost certainly not. It’s required by the POSIX standard, and the need for it to be present and turned on is well debated by people more knowledgeable about this in this thread from the Linux kernel mailing list. The summary of their argument is that it’s the kernel’s job to remain standards compliant, and only the distributor or user has enough information to know that they don’t care about that part of the standard and can safely disable it. I can understand that point of view.

Well, I did the reading, and you can safely disable it, unless you’re using mutt. If you’re using mutt, or if you’re just nervous about disabling something that somebody somewhere says you might maybe need someday, then disable atime for every filesystem that doesn’t have your mail spool on it, and use the relatime mode on that drive. (relatime is a clever hack that simulates atime behavior while skipping the disk write in certain cases.)

Journaling mode

Ext3 is a journaling filesystem, which is generally a good thing. There are three modes of operation for ext3’s journaling functionality, but which to use?

“It depends” is not very satisfying, so an easy rule of thumb would be to use data=journal if you really, really want to ensure the durability of your data, and data=ordered if you can tolerate a teeny tiny chance of data corruption.

I measured all three journaling modes by running time sudo rsnapshot hourly on a VPS that backed up VPSs on the same physical server to a dedicated backup disk. In other words, the source was on the same physical server as the destination but they were on different disks.

rsnapshot uses hard links to share file data across backup sets, so backing up an unchanged directory twice takes a hardly any additional space compared to backing it up once. But it does need to do a bunch of disk reads and writes to make all the linked directory entries when it does this, so there is a fair amount of I/O involved: more than what rsync would need to just update a local directory to match the remote directory, but far less than what would be needed to make a separate copy of every file for each backup.

In abstract terms, the I/O for this backup process involves a lot of small reads and writes, and a very small number of medium or large writes for changed files. All of these occur as fast as the disk can service them, and the disk is quiet aside from this activity.

Here’s what I measured (in three test runs per journal type):

Journal Type	Real Time
data=journal	2m05s, 2m57s, 2m51s
data=writeback	2m03s, 1m18s, 1m22s
data=ordered	2m12s, 1m30s, 1m20s

For this application, data=journal takes twice as long as the others, while data=ordered runs just as fast as data=writeback while providing some additional protection.

So data=writeback is useless in my case, and the fact that data=ordered is the default makes sense. You get almost the same level of data protection as with data=journal, but with the performance of data=writeback. Different I/O patterns will give different results, but I suspect that the pattern I tested with is the most common in real server usage. (Note that in ext3’s v1 journal format, data=journal was the only journal behavior.)

My inclination is to stick with the default setting, even using data=ordered on database servers, since the database is doing its own higher-level journaling in the form of a transaction log. I’m basing this recommendation on this detail from the Gentoo article:

When appending data to files, data=ordered mode provides all of the integrity guarantees offered by ext3’s full data journaling mode. However, if part of a file is being overwritten and the system crashes, it’s possible that the region being written will contain a combination of original blocks interspersed with updated blocks.

Since a database transaction log is generally appended to rather than overwritten, my understanding is that it will protect against the above scenario in which data=ordered can cause a mix of old and new data. The database’s data files may have a mix of old and new data, but the transaction log would not show that the transaction have been completed yet, so it would be re-run during recovery and the remaining old data would be removed. I think.

The usage pattern where data that you really care about is overwritten regularly (as opposed to logs, which simply append) is rare in my experience, except in the case of database servers which are covered by their own logs as I just mentioned. So I don’t know of a particular application type that demands the full data journaling mode.

Anyway, I recommend against data=writeback altogether, unless you don’t mind some data corruption if there’s a power failure. The speed gain I measured isn’t worth the risk, in my opinion.

Save power and heat: spin down backup drives when idle

Jamie Flournoy — Fri, 16 May 2008 00:48:16 +0000

Here’s a tip for those of you who, like me, back up your data to hard disks instead of tapes. Backing up to the same hard disk doesn’t protect you much (if the disk failed, you’d lose the data and the backup at once), so presumably you’re backing up to a separate physical drive. That means that the backup drive need not spin 24/7. Instead, it only needs to spin at backup time.

Fortunately most disks can be told to spin down when idle, like laptop drives do. For the main disks of a server this is probably not worth the trouble, but for backup drives it can save you a lot of power and heat. Excessive heat kills hard drives, so this can also prolong the life of your backup drive.

On Linux this is accomplished with hdparm:

sudo hdparm -S 6 /dev/sda

The numeric value is on a nonlinear scale, so man hdparm and read about the -S option to make sure you pick the value you intended.

You can check the current status (active or standby) like this:

sudo hdparm -C /dev/sda

Gotchas:
External drives aren’t managed directly by the kernel, so hdparm won’t work on them. My hardware RAID card won’t pass through these commands. Linux RAID will not pass these commands through, though you may be able to apply the commands individually to RAID set members (which is what I was able to do).

I don’t know how to do this on a per-drive basis on Mac OS X (pmset apparently affects every drive on the system), or at all on Windows. If you know feel free to post a comment explaining how it’s done.

Retroactively Minimizing Installed Packages on CentOS 5.1

Jamie Flournoy — Tue, 15 Apr 2008 01:16:44 +0000

In my CentOS 5.1 Minimal VPS Install Guide I describe how to install a very lean set of OS packages when starting from scratch. But what if the VPS is preinstalled for you by a hosting provider? There will be things preinstalled that you don’t need, which will slow down backups and updates, and waste the relatively tiny amount of disk space that VPS plans offer. So here are some instructions to help you identify and remove packages that you don’t need, when they’ve already been installed.

The first thing you need is a list of minimal packages that your server must have in order to function. This is somewhat subjective, so you may wish to customize it, but here is a roughly minimal list of yum package names for CentOS 5.1. Save that on your CentOS machine as minimal_package_names.txt.

Next, you need a way to compare this list to the list of what you have installed. Here’s a command line that I used:

yum list installed | awk 'split($1,a,".") { if (NR>2){ print a[1] } }' \\
> installed_package_names.txt ; diff installed_package_names.txt \\
minimal_package_names.txt  | grep '<' | colrm 1 2

(The awk command is there to strip out the version number and architecture from the package name.)

Now you can run that command and see a list of package names that are not in your minimal_package_names.txt list. You can switch that grep command so it looks for ‘>’ instead of ‘<', and see things that you consider minimal which are not currently installed. Then it's just a matter of "yum install foo" and "yum remove foo". I encourage you to use "yum info foo" to make removal decisions one by one, since someone at the ISP probably took the time to research them and thought you might find them useful. You should probably also remove packages in small groups or one by one, because you might be surprised at the dependencies you find. I was surprised to find that uninstalling postgresql-libs would cause httpd (Apache) to be removed as well. But if you want to automate it, just tack | xargs yum remove on the end of that command, and it will automatically remove them all at once.

Using this as a starting point, you can change your “minimal” packages list to fit your preferences, or even as a quick and dirty alternative to using Kickstart.