I’m doing software development on an Ubuntu 20.04 VM running on my MacBook Pro via Parallels Desktop 17. I recently noticed that the “Software Update” app in the Ubuntu desktop machine was saying there were a lot of things to update, which makes sense since I just created this VM a couple of months ago via a fancy Ansible playbook, and I haven’t gotten around to updating it until yesterday.
Well, after the update, it started taking a really long time to boot. There were two places where it hangs for 2 minutes for no apparent reason, meaning that the boot process is about 4-1/2 minutes long.
I’ve figured out what happened and found a workaround, but I’ll mention a few of the things I looked at along the way since they’re sort of interesting and may be relevant if your issue is similar but not caused by the same thing.
A bit of background: this VM was created by the Parallels installer wizard, which asks the user some questions and then downloads the Ubuntu 20.04 installer and feeds it an unattended-install ISO image based on your answers. It’s actually pretty slick, and I don’t mind the fact that I can’t use my custom scripts-and-YAML frontend to virtualbox’s VBoxManage that creates a VM and its disk images from scratch and then installs Ubuntu 20.04 on that VM. (I use that frontend when making Linux-on-Virtualbox-on-Linux VMs, but I don’t have any automation for creating Parallels VMs at the moment.)
I still use my Ansible playbook that installs all of the things that all of my VMs need, which in the case of this VM (a development server for me to use when working on software development for a single customer) includes the goodies I like on a shell server, the tweaks I like to make to Ubuntu servers in general, and all of the things that I need for this particular host. So this host’s configuration is mostly under formal configuration management, and I can build a copy of it pretty quickly.
That’s why my first reaction to a messed up VM was to just make it from scratch again, since I figured that I could probably do that (and clear out whatever gremlins were in the updated one) faster than I could troubleshoot the actual problem.
Well, that didn’t work. I made a new VM disk and disconnected the old one, running the Ubuntu 20.04 installer directly (without the Parallels wizard) and choosing the most boring options available to populate the new VM disk. It worked fine, so I applied the software update. That update made it act really weird: black screen, no POST, no GRUB bootloader, no kernel output. WTF? Had I hosed the VM configuration somehow?
I made a whole new VM instance and ran the installer again, and it worked fine. I ran the software update to get it up to date with the latest packages for Ubuntu 20.04, and I got the same result(!): bricked. No POST, no GRUB, no kernel output, just hangs with a blank screen at power-on. WT-actual-F?
It was at this point that I really wished I had thought to use the snapshots feature of Parallels to create a pre-update state that I could revert to if the update went badly. Ugh. Lesson learned: always use the snapshot feature when you can, even if this machine is so easy to recreate (and contains only data that’s in a Git repo elsewhere) that it’s not worth backing up. Reverting to an old snapshot is still faster than rebuilding the machine. Snapshots cost nothing and take seconds, vs. reinstalling which takes an hour or so.
I dug and dug and dug on various Linux and Ubuntu support forums for quite a while for a clue as to what the heck was wrong. In the meantime, buried under all of my browser windows, the “bricked” VMs actually booted. Huh?
What I was actually seeing was not a pair of non-bootable VMs, but a pair of VMs that played dead for a long time: a subsecond POST display that never actually rendered on the VM’s virtual display (but did appear for an instant on the thumbnail of the VM’s display that’s shown in the Parallels Control Center window), then a deliberately hidden (by the Ubuntu installer’s settings in /etc/default/grub) GRUB menu with a 0-second delay before booting, and a “quiet splash” boot screen, because the Ubuntu desktop installer prioritizes pretty boot screens over being able to see what’s going wrong. For some reason the splash screen doesn’t work on this configuration (missing image files? VM video driver incompatibility?), so the pretty Ubuntu logo wasn’t showing at all for the 4+ minutes it was taking to get to the GUI login screen. But it was booting, eventually.
Editing /etc/default/grub to contain GRUB_TIMEOUT=10, GRUB_TIMEOUT_STYLE=menu, and GRUB_CMDLINE_LINUX_DEFAULT="" (and then running sudo update-grub) ended the reign of the useless black boot screen.
So, it turns out all three installations (the original one, the new one on the original VM with a new blank virtual disk, and the one on the new VM) had the same boot-time problem. I had thought that maybe my Ansible script had messed up the /etc/fstab somehow, since it was something disk-device related that was apparently taking 2 minutes to fail.
I had figured out that it was disk-device related because the two 2-minute kernel boot freezeups were around things that said “ Failed to start udev Wait for Complete Device Initialization” and “ ata8: SATA link down”.
There were a bunch of forum threads assisting people who had messed up their /etc/fstab so that the kernel couldn’t find their swap file, so I just disabled swap entirely, but that didn’t fix my problem.
Some people blamed LVM (specifically, the use of /dev/mapper/…. vs. UUID=”” to identify the filesystem on the LVM volume, and possibly using the wrong UUID), and that seemed plausible since my Ansible script rewrites the /etc/fstab to use UUIDs for everything instead of /dev/sda1 etc. That was a red herring also; after all, a freshly installed system that used the /dev/mapper style to point to the LVM root volume still had this problem. I’ll probably switch to using the /dev/mapper/xyz style instead of UUIDs in my Ansible script, since I’m really just trying to use a stable disk identifier instead of a /dev/sdX identifier, but clearly that’s not what caused the issue since it persists even with a fresh Ubuntu 20.04 install.
So, it’s not swap and it’s not LVM. That’s the only thing in /etc/fstab, so it’s not that.
What’s left? Some kind of disk device problem? How could that be the case with a fresh VM and a brand new disk device and an /etc/fstab created by the Ubuntu installer? Is it possible that everybody with Ubuntu 20.04 who has applied this update has this same issue?
The actual problem & solution:
As it turns out, yeah, apparently this is a problem that a lot of people are experiencing. Issue with Parallels, Ubuntu VM and the CD-ROM on the Parallels forum describes this exact problem and identifies the source and a workaround.
The issue is caused by a Linux kernel bug, which of course is already fixed. But it’s recent, so the fix isn’t available via “apt update” on Ubuntu 20.04 yet.
The workaround is to not have an empty virtual CD drive on the VM, which is pretty easy to solve by either deleting your virtual CD device, or ensuring that an ISO image is in the drive which won’t be booted (by changing the boot order, or by attaching a blank .iso image to the virtual CD drive, or both).
So, that’s what I did on Saturday night. How was yours? ;)
This post on linuxquestions.org shows a quick way to create an empty ISO:
mkisofs -o /tmp/cd.iso /tmp/directory/
A few things that I learned about while chasing the red herrings:
systemd-analyze critical-chain is a pretty cool way to look into your systemd services to see which ones are making your startup process take too long. This wasn’t actually that useful since the delay happens before systemd starts, but it’s interesting anyway.
Using this systemd-analyze showed me that there’s a deprecated service called systemd-udev-settle.service, which is what was adding the second 2-minute wait during boot time. There’s a thread about it on the Ubuntu subreddit that suggests that it’s not needed anyway. I use LVM on this host and disabling that service via systemctl mask systemd-udev-settle.service didn’t break it, so I guess it isn’t needed anymore (at least, not for a setup with a virtual disk like this, that doesn’t take any time to spin up and become accessible).