systemd-nspawn and OSTree
I run my servers and my laptop using Fedora Atomic. This is the next Fedora CoreOS which is curently in devloppment. The current Fedora release is only available through the Silverblue channel, so that’s what I’m using currently. It features atomic updates and clean package layering. You can easily rollback.
Now, I have a physical server to mess up with things locally, and I wanted to create multiple virtual environments where I can test different things. In particular I want to try a Rancher k3s deployment on bare metal. The only thing is that physical machines are difficult to clone, so I wanted to setup some kind of virtual environment on top of it to be able to start up a different OS distribution later on without requiring me to buy a new machine and reconfigure it from scratch.
instead of going the virtual machine route, I wanted to test systemd-nspawn and
systemd-machined to manage my “virtual” machines wuing machinectl
. Also, I
wanted to boot an OSTree distribution both as host and guest. OSTree is nice
because it allows to share a single /ostree
between various systems. here is
how I did it.
Create the OSTree deployment
You need to create a new OSTree deployment without impacting the host. First, create a new OS inside your OSTree:
[root@host host]# ostree admin os-init guest
ostree/deploy/guest initialized as OSTree root
You now have a guest OS registered, but it does not have any deployment. A
deployment is a system image you layer on top of resident data (/var
). To
create a deployment, you need an OSTree commit hash to reference it. Let’s start
with the same commit as the host.
Let’s view your deployments:
[root@host host]# ostree admin status
fedora 131642a5b3da24dc47f9ff8191f5856bbeced08275ea823fa543327fd20f4862.0 (staged)
Version: 30.20190715.0
origin: <unknown origin type>
* fedora 8e721e68bfa7c813361f3067d202b2fd907634e192059451ffa8365ca5733224.0
Version: 30.20190713.0
origin: <unknown origin type>
fedora 38a1f86b1168dda0e8ed80782afb3e56bb1918b1984c1dc0d7216f63f865a7a0.0 (rollback)
Version: 30.20190701.0
origin: <unknown origin type>
here you see I performed an updated (staged) but did not reboot to it. I also
have the current deployment and the rollback deployment. the hashes are not
commit ids yet. Perhaps there is another way to get the commit ids, but I did a
rpm-ostree status
to get it:
[root@host host]# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
ostree://fedora:fedora/30/x86_64/silverblue
Version: 30.20190715.0 (2019-07-15T00:36:07Z)
BaseCommit: fa23b170d892d78935de422064e73a84057119ea315257a9030dc7659e726a7b
StateRoot: fedora
GPGSignature: Valid signature by F1D8EC98F241AAF20DF69420EF3C111FCFC659B9
Diff: 11 upgraded, 1 added
LayeredPackages: net-tools strace systemd-container tcpdump tmux
● ostree://fedora:fedora/30/x86_64/silverblue
Version: 30.20190713.0 (2019-07-13T00:38:50Z)
BaseCommit: f8aeed3ee64bddecb6ebfd382db9c4b19c5f5502cb24fdb4eed788b9c4558f62
StateRoot: fedora
GPGSignature: Valid signature by F1D8EC98F241AAF20DF69420EF3C111FCFC659B9
LayeredPackages: net-tools systemd-container tcpdump tmux
ostree://fedora:fedora/30/x86_64/silverblue
Version: 30.20190701.0 (2019-07-01T00:36:21Z)
BaseCommit: f5efad3126a7bc90b88c0a3ba382c15eb3119e123eef3b1f1e2c4a7f7480fcc1
StateRoot: fedora
GPGSignature: Valid signature by F1D8EC98F241AAF20DF69420EF3C111FCFC659B9
LayeredPackages: net-tools tcpdump tmux
Let’s take commit
fa23b170d892d78935de422064e73a84057119ea315257a9030dc7659e726a7b
as a base
commit for our new guest
deployment.
[root@host host]# ostree admin deploy --not-as-default --os=guest fa23b170d892d78935de422064e73a84057119ea315257a9030dc7659e726a7b
Relabeling /var (no stamp file 'var/.ostree-selabeled' found)
Bootloader updated; bootconfig swap: yes; deployment count change: 0
You need to specify --not-as-default
else when the host is going to reboot, it
will reboot to the guest deployment. That would not be nice.
Setup systemd-nspawn to start the guest
Because systemd-nspawn is accessing /sysroot
and I believe SELinux is not
configured to allow that by default, you’ll have to find a way for SELinux to
accept what we are doing. You can just test things out first by setting SELinux
to permissive mode:
[root@host host]# setenforce Permissive
(you might want to configure this to keep it across reboots)
Now, you need to create a dropin to systemd-nspawn@guest.service
to set up
option to systemd-nspawn:
[root@host host]# cat /etc/systemd/system/systemd-nspawn@guest.service.d/10-service.conf
[Service]
ExecStart=
Environment=SYSTEMD_NSPAWN_LOCK=0
ExecStart=/bin/sh -xc ' \
deployment=`ostree admin status | sed -ne \'s/. %i \\(\\S*\\)\\( .*\\)*$/\\1/; T; p; q\'`; \
exec /usr/bin/systemd-nspawn --quiet --keep-unit --boot --link-journal=try-guest --network-veth --settings=override --machine=%i \
--directory=/sysroot \
--bind /boot:/boot \
--bind +/sysroot/ostree/deploy/%i/var:/var \
--pivot-root /ostree/deploy/%i/deploy/$deployment:/sysroot \
$SYSTEMD_NSPAWN_FLAGS \
'
RestartForceExitStatus=
Restart=always
RestartSec=3s
In particular:
-
-U
is removed because we do not want systemd-nspawn to update the UID/GID in /ostree. Unfortunately, this article does not cover how we can separate UIDs/GIDs between the host and the guest. -
--directory=/sysroot
is telling systemd-nspawn to boot/sysroot
as root directory. This is the filesystem root. -
--bind /boot:/boot
is telling to keep the boot partition mounted. This is necessary for the ostree command-line to work (it looks for files in/boot/loader/entries
) -
--bind +/sysroot/ostree/deploy/%i/var:/var
is telling to mount the persistent/var
directory. Else we would have a complete system read-only. -
--pivot-root /ostree/deploy/%i/deploy/$deployment:/sysroot
is telling to pivot-root to the specific deployment (computed from the sed pipeline above) -
Restart=always
andRestartSec=3s
gives enough time to systemd-machined to deregister the machine so the machine can restart without errors (Failed to register machine: Machine 'guest' already exists
)
With that, you can now enable and start the service, and you have a sub-machine ready.
[root@host host]# systemctl enable --now systemd-nspawn@guest.service
[root@host host]# machinectl list
MACHINE CLASS SERVICE OS VERSION ADDRESSES
guest container systemd-nspawn fedora 30 -
2 machines listed.
[root@host host]# machinectl shell guest
Connected to machine guest. Press ^] three times within 1s to exit session.
Welcome to Fedora Silverblue. This terminal is running on the
immutable host system. You may want to try out the Toolbox for a
more traditional environment that allows package installation
with DNF.
For more information, see the documentation.
[root@guest roothome]# ostree admin status
fedora 8e721e68bfa7c813361f3067d202b2fd907634e192059451ffa8365ca5733224.0
Version: 30.20190713.0
origin: <unknown origin type>
* guest fa23b170d892d78935de422064e73a84057119ea315257a9030dc7659e726a7b.0
Version: 30.20190715.0
origin refspec: fa23b170d892d78935de422064e73a84057119ea315257a9030dc7659e726a7b
fedora 38a1f86b1168dda0e8ed80782afb3e56bb1918b1984c1dc0d7216f63f865a7a0.0
Version: 30.20190701.0
origin: <unknown origin type>
You will also need to enable machines.target
:
[root@host host]$ systemctl enable machines.target
Created symlink /etc/systemd/system/multi-user.target.wants/machines.target → /usr/lib/systemd/system/machines.target.
Configure network bridge
To make the guest machine accessible from the network, it needs to be bridged (or otherwise connected) to the host network. By default a network interface pair is created but left unconfigured. We’ll see how to bridge it to the host network.
First, you need to create a bridge on the host. We’ll use systemd-networkd which is suitable for servers. Ensure it is enabled and started:
[root@host host]# systemctl status systemd-networkd
● systemd-networkd.service - Network Service
Loaded: loaded (/usr/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-07-15 14:34:30 CEST; 22s ago
Then you need to create a bridge device:
[root@host host]# cat /etc/systemd/network/br0.netdev
[NetDev]
Name=br0
Kind=bridge
Assign the upstream network interfaces to the bridge (beware, if you have static DHCP rules, the host MAC address will change to the MAC address of the bridge, which is deterministic):
[root@host host]# cat /etc/systemd/network/br0-bind.network
[Match]
Name=en*
[Network]
Bridge=br0
And enable DHCP on the bridge:
[root@host host]# cat /etc/systemd/network/br0.network
[Match]
Name=br0
[Network]
DHCP=ipv4
And add a systemd-nspawn flag to bridge the machine:
[root@host host]# cat /etc/systemd/system/systemd-nspawn@guest.service.d/10-flags.conf
[Service]
Environment=SYSTEMD_NSPAWN_FLAGS=--network-bridge=br0
Then, you can restart systemd-networkd
and systemd-nspawn@guest
.
In case you want to run Kubernetes in it
Short answer: it’s difficult and need further work
You need to make a few changes. First, you need to disable conntrack tweaking in
kube-proxy. kube-proxy be default attempts to tweak the nf_conntrack kernel
module, which is not possible in the nspawn jail because
/sys/module/nf_conntrack/parameters/hashsize
does not exists. Add to the
kube-proxy command-line:
--conntrack-max-per-core=0
If you use btrfs, you may have problems with kubelet which cannot find device
major and minor number for its data directories. I had a strange error like
Failed to start ContainerManager failed to get rootfs info: failed to get device for dir "/var/lib/rancher/k3s/agent/kubelet": could not find device with major: 0, minor: 44 in cached partitions map
which followed a warning stat failed on /dev/sda5 with error: no such file or directory
.
What kubelet is doing (in fact it’s delegated to cadvisor) is it’s trying to
find the device number for the data directory. If the device major number is 0,
it means that the mount point is a device-less mount point. In this case, a
btrfs subvolume. To get the proper device, it detects if the filesystem is btrfs
and is looking up the device number of the btrfs device itself (/dev/sda5
in
my case). Because systemd-nspawn mask the block devices, it fails.
The solution is to add to the systemd-nspawn command-line before the pivot root:
--bind /dev:/dev
To get /proc/sys
writeable, you need to start systemd-nspawn with:
Environment=SYSTEMD_NSPAWN_API_VFS_WRITABLE=1
Then, you’ll have a problem because the cgroups hierarchy is read-only. This is because cgroups-v1 cannot be delegated properly and there might be conflicts. cgroups-v2 might help there but it’s still early.