Booting x86-64: from firmware to PID1

Posted on 2024-08-25under [

]

Introduction

The path from firmware to userland is surprisingly complex and there are many details which are critical for this process to succeed.

In this post I'll explain as many details as possible when booting a fairly standard system with ESP, /boot and / partitions, which has the / partition configured with software RAID:

We'll investigate:

EDK2: UEFI reference implementation
GRUB2: Bootloader
Linux: Kernel
mdadm: Software raid implementation

Sadly, not all of these are hosted in a way that I can link to references in a stable way (mostly, released as tarballs). In those cases, I'm going to link to a third-party mirror¹.

Storage in computers relies on physical disks, which are usually divided in two² types: mechanical hard drives (with spinning platters which store data magnetically; and solid-state drives (SSDs) which use flash memory for faster access times and improved reliability.

Linux exposes these storage devices through the "block I/O layer", which represents storage media as sequences of fixed-size blocks, typically 512 bytes or 4KB in size.

Access to the block layers happens through device files which are represented in /dev (usually /dev/sdX for SATA/SAS drives, /dev/nvmeXnY for NVME drives, but you can also find hdX, fdX, cdromX, ...).

These devices are identified as block devices during initialization by their respective drivers, which create the device files with type S_IFBLK

Firmware initialization

x86-64 computers require device-specific firmware to perform hardware initialization during power-on, so that they can load the operating system and hand over execution.

There are two standardized types of firmware to handle the booting process: (Legacy) BIOS and UEFI.

BIOS

The boot process on Legacy BIOS (1981) is very simple and I won't go into much detail.

BIOS will iterate through all devices and validate whether they are bootable by checking if they have a Master Boot Record on the first sector (which is defined as 512 bytes, even if modern devices have non-512b-sectors)

Whether there's an MBR present or not is defined by finding the MBR signature (0x55 0xAA) as the last two bytes in the sector. The layout is also quite simple:

If indeed there's an MBR on the disk, the code will be mapped at the physical address 0x7C00, then BIOS will jmp there, good luck!

BIOS is very simple but also quite limited. Having only 16 bytes of information per partition limits their maximum size to 2TiB (along with a maximum of 4 partitions). The small bootstrap area (446 bytes of code is not a lot, even with x86's dense instruction format) requires multiple stages of boot loaders to be chained.

UEFI

Given the limitations of BIOS, Intel started UEFI in ~2005³, which did overcome all of these limitations, but also became a massive beast⁴ .

At system power-on, the firmware needs to determine which disk to boot from, so it will check a set of variables (BootOrder, Boot####) in NVRAM for a pre-defined boot order (using gRT->GetVariable()).

If no valid boot order is stored in NVRAM, then UEFI will enumerate all connected storage devices by calling LocateHandle() with a SearchType of EFI_BLOCK_IO_PROTOCOL.

Once a disk is selected for booting, the firmware reads the GUID Partition Table (GPT) header.

The first 512 bytes of the disk usually contain a "protective" MBR, whose only purpose is to prevent old tools from considering the drive empty (and potentially overwriting data). This is not necessary.

The second logical block (LBA 1) contains the GPT header; there's also a backup GPT header at the very end of the disk.

So, we are at a point that UEFI has:

Initialized the hardware sufficiently to enumerate all block devices
Filtered said devices down to the ones with a valid GPT header
Picked a device to boot based on user preference (choice saved to NVRAM) or other algorithm (usually "first valid device")

How does UEFI find the bootloader?

First, it looks for EFI System Partition (ESP) on the disk, which has the GUID C12A7328-F81F-11D2-BA4B-00A0C93EC93B.

This partition must be formatted as FAT (12, 16 or 32) (which is probably good, FAT is very simple).

Inside that GUID-tagged, FAT-formatted partition, it will look for a file in the path \EFI\BOOT\BOOTX64.EFI (obviously, the filename is architecture dependent)⁵.

UEFI will load that file, which is a bootloader⁶ (in our case, this is GRUB2) in Portable Executable format, in memory and prepare to execute it.

The bootloader

In a UEFI system, the bootloader must be a Position Independent Executable (PIE), as there is no guarantee of the physical address at which it will be loaded.

When the bootloader is executed, it's still running in the UEFI environment, so it must be careful to not clobber any of UEFI's state, which is defined in a MemoryMap and can be obtained by calling the GetMemoryMap service.

Allocations should be performed with the AllocatePages or AllocatePool boot services so UEFI has an up to date view on memory usage and does not clobber GRUB's data.

Once GRUB is loaded, it will load its configuration from $PREFIX/grub.cfg (prefix is set at build time) which looks something like this:

search.fs_uuid 7ce5b857-b91e-4dab-a23c-44b631e6ebab root 
set prefix=($root)'/grub'
configfile $prefix/grub.cfg

This configuration must be placed next to the GRUB image (executable), in the ESP.

So, what does GRUB do once this configuration is read? It will scan all block devices for a partition with a matching UUID, then load a new configuration.

But how do you identify the correct partition?

From the GPT header, We know the start & end of every partition, and very conveniently filesystems usually implement a concept of Super Block, which places significant metadata at a static position within the block, both for self-reference and external identification.

Each filesystem, however, places this information at different offsets⁷, so we need an initial way of identifying the filesystem, and we can do so through their identifying Magic numbers which are defined in their documentation

Filesystem	Offset	Magic value	Docs
XFS	`0x00000`	`"XFSB"`	link (didn't find official doc)
Ext4	`0x00438`	`0xEF53`	link
BTRFS	`0x10040`	`"_BHRfS_M"`	link

When looking at the second partition on my disk

$ dd if=/dev/nvme0n1p2 bs=1 count=$((1024+64)) | hexdump
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
00000400  00 60 d1 01 00 0a 45 07  4c 0d 5d 00 df 77 7f 02
00000410  b0 01 93 01 00 00 00 00  02 00 00 00 02 00 00 00
00000420  00 80 00 00 00 80 00 00  00 20 00 00 3f d1 09 65
00000430  1f cd b0 63 82 00 ff ff  53 ef 01 00 01 00 00 00
                                   ^^ ^^

we can see the 0x53 0xef (little endian) on the last row, so this is an EXT4 filesystem.

On EXT4 we can find the UUID at 0x468 (0x68 within the superblock, 1128 in decimal)

$ dd if=/dev/nvme0n1p2 bs=1 count=16 skip=1128 | hexdump
7c e5 b8 57 b9 1e 4d ab  a2 3c 44 b6 31 e6 eb ab

which matches the UUID in GRUB configuration, so this must be the partition we are looking for (and indeed, it is my /boot partition).

Through the magic of an EXT4 implementation, GRUB will traverse the filesystem and find the second level configuration, which looks something like this (trimmed):

menuentry 'Linux'  {
        #       v path to kernel          v cmdline
        linux   /vmlinuz-6.8.0-40-generic root=UUID=9c5e17bc-8649-40db-bede-b48e10adc713
        #       v path to initrd
        initrd  /initrd.img-6.8.0-40-generic
}

In the config we see 3 things:

The kernel image (/vmlinuz-6.8.0-40-generic): Kernel that we are booting
The kernel cmdline: Boot-time parameters to the kernel
The initrd: A root filesystem to mount in a RAM-disk, to bootstrap the real root filesystem.

Loading the kernel and initrd

GRUB will load the kernel image (vmlinuz-6.8.0-40-generic) at LOAD_PHYSICAL_ADDR (which is 0x100_000 = 1MB, based on the boot protocol)

GRUB also needs to load the initrd somewhere, and while any address is valid, GRUB prefers as high as possible⁸.

How does the kernel find out the address and size for the initrd and cmdline arguments?

Again, it's specified in the boot protocol — the Kernel's Real Mode header must be read from the kernel image (at offset 0x1f0) and written back somewhere, with updated fields for cmdline/initrd addresses (among others).

But if the header is somewhere, again, how does the kernel find it? Well, GRUB must place the physical address of the header in the rsi register before jumping into kernel code.

Transferring control to the kernel

In 64-bit boot protocol, the kernel's entry point is at a fixed offset: 0x200.

How does the kernel load the initrd?

During early initialization, the kernel will save the rsi register into r15, which is callee-saved.

After the early initialization is done, the kernel will copy r15 to rdi (because rdi is the first argument to a function in the then System V calling convention), then it will call x86_64_start_kernel, which is the C entrypoint of the kernel.

From there we are on more familiar land, and the kernel will eventually call copy_bootdata(__va(real_mode_data)); (here __va will get the virtual address from a physical address).

copy_bootdata, finally, will copy the kernel cmdline and the boot params (which include the initrd address) onto kernel-managed memory.

At this point, the kernel will finally call start_kernel() and the non-platform-specific kernel code will start executing.

Eventually, do_populate_rootfs is called⁹, after which, the kernel will call run_init_process(ramdisk_execute_command);¹⁰, and we are on (crude) userland!

Early userspace

We are on a PID1, but it's not a good PID1 — we are in a ramdisk with limited utility; the goal of the initrd is, usually¹¹, to perform some bootstrapping action and pivot to a useful workload as soon as possible, so let's see what that looks like.

From the GRUB config we saw an interesting parameter in the kernel's commandline arguments:

linux   /vmlinuz-6.8.0-40-generic root=UUID=9c5e17bc-8649-40db-bede-b48e10adc713

root, meaning the root filesystem, and is tagged with UUID=..., so we get to go on another hunt for disks.

This time though, the previous table of filesystem magic identifier was not useful, when looking at the contents of the drive:

0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 4efc a92b 0001 0000 0001 0000 0000 0000
...
0003000 0000 0000 0000 0000 0000 0000 0000 0000
*
8100400 e000 0095 7f00 0257 f98c 001d d553 023d

We see.. nothing useful?

Let me clarify, what this dump is trying to show is that there are only zeroes in the ranges 0x0000-0x1000 (xfs and ext4 place their signatures in this range) and 0x3000-0x8100400 (btrfs places its signature here), so this must be another way of storing data.

In this case, the root filesystem is on a RAID1 configuration using mdadm, how do we verify it? Another magic number table!

Name	Offset	Magic value	Docs
mdadm	`0x1000`	`0xa92b4efc`	link

We know we want to interact with the md module to assemble the virtual device, so we first need to load the module.

Loading kernel modules

Kernel modules are stored as ELF (Executable and Linkable Format) files, which allows the kernel to:

Verify the module's compatibility with the running kernel, by checking the .modinfo section
Resolve symbols (function names, variables) between the module and the kernel
Perform necessary relocations to load the module at any available kernel address

The userspace task of loading a module is relatively simple; we need to open the file and call the finit_module syscall on the file descriptor:

let file = File::open(path)?;
let fd = file.as_raw_fd();

let params_cstr = std::ffi::CString::new("module parameters")?;

unsafe {
    finit_module(fd, params_cstr.as_ptr() as *const c_char, 0)
};

Now the module is loaded, but how do we tell the module to create the virtual device?

First, we need to find the right disks among the sea of block devices, conveniently the kernel will group them up for easy listing in /sys/class/block:

$ ls /sys
ls: cannot access '/sys': No such file or directory

Right. Initrd. Okay, we can mount the sysfs with the mount syscall:

unsigned long mountflags = 0;
const void *data = NULL;

mount("none", "/sys", "sysfs", mountflags, data);

Now we can actually look at the block devices

$ ls /sys/class/block
/sys/class/block/nvme0n1   -> ../../devices/pci0000:00/0000:00:03.1/0000:2b:00.0/nvme/nvme0/nvme0n1
/sys/class/block/nvme0n1p1 -> ../../devices/pci0000:00/0000:00:03.1/0000:2b:00.0/nvme/nvme0/nvme0n1/nvme0n1p1
/sys/class/block/nvme0n1p2 -> ../../devices/pci0000:00/0000:00:03.1/0000:2b:00.0/nvme/nvme0/nvme0n1/nvme0n1p2
/sys/class/block/nvme0n1p3 -> ../../devices/pci0000:00/0000:00:03.1/0000:2b:00.0/nvme/nvme0/nvme0n1/nvme0n1p3
/sys/class/block/nvme0n1p4 -> ../../devices/pci0000:00/0000:00:03.1/0000:2b:00.0/nvme/nvme0/nvme0n1/nvme0n1p4
/sys/class/block/nvme1n1   -> ../../devices/pci0000:00/0000:00:03.2/0000:2c:00.0/nvme/nvme1/nvme1n1
/sys/class/block/nvme1n1p1 -> ../../devices/pci0000:00/0000:00:03.2/0000:2c:00.0/nvme/nvme1/nvme1n1/nvme1n1p1
/sys/class/block/nvme1n1p2 -> ../../devices/pci0000:00/0000:00:03.2/0000:2c:00.0/nvme/nvme1/nvme1n1/nvme1n1p2
/sys/class/block/nvme1n1p3 -> ../../devices/pci0000:00/0000:00:03.2/0000:2c:00.0/nvme/nvme1/nvme1n1/nvme1n1p3
/sys/class/block/nvme1n1p4 -> ../../devices/pci0000:00/0000:00:03.2/0000:2c:00.0/nvme/nvme1/nvme1n1/nvme1n1p4

but which ones make up the array?

Well, we can look at the first few kilobytes of each and check if they have a valid superblock, which looks something like this:

pub struct MdpSuperblock1 {
    pub array_info: ArrayInfo,
    pub device_info: DeviceInfo,
    // ...
}
#[repr(C, packed)]
pub struct ArrayInfo {
    pub magic: u32,
    pub major_version: u32,
    set_uuid: [u8; 16],
    set_name: [u8; 32],
    ctime: u64,          // /* lo 40 bits are seconds, top 24 are microseconds or 0*/
    pub level: u32,      /* -4 (multipath), -1 (linear), 0,1,4,5 */
    pub layout: u32,     /* used for raid5, raid6, raid10, and raid0 */
    pub size: u64,       // in 512b sectors
    pub chunksize: u32,  // in 512b sectors
    pub raid_disks: u32, // count
    // ...
}
#[repr(C)]
pub struct DeviceInfo {
    pub data_offset: u64,
    pub data_size: u64,
    pub super_offset: u64,
    pub dev_number: u32,
    pub device_uuid: [u8; 16],
    // ...
}
// Other structs omitted

If we interpret the byte range 0x1000-0x1064 as an ArrayInfo, we can validate (through the Superblock magic) whether the block device is in fact a member of a software RAID, and if yes, member of which array by reading the UUID.

In this case, nvme0n1p3 and nvme1n1p3 are members of the array with UUID 3373544e:facdb6ce:a5f48e39:c6a4a29e.

Having identified the devices, and the properties of the array (2 disks in RAID1), how do we assemble it? I could say mdadm --assemble, but I am no chicken.

Due to reasons Linux¹² has implemented a cop-out syscall: ioctl (I/O Control) which allows encoding device-specific "driver calls" (in contrast to "system calls")

In our case, we'll use ioctl to interact with the md (Multiple Devices) driver for array operations, as defined in the docs.

For any ioctl, we need an open file descriptor that is linked to the module, so we are going to make a device node with "major" device number 0x9:

const char *devpath = "/dev/.tmp.md.150649:9:0";

mknodat(AT_FDCWD, devpath, S_IFBLK | 0600, makedev(0x9, 0));
int fd = openat(AT_FDCWD, devpath, O_RDWR | O_EXCL | O_DIRECT);
// fd is 4

To inform the md driver about the array composition we need to populate a struct mdu_array_info_t, defined in md_u.h, with information gathered from the superblock (defined above as ArrayInfo and DeviceInfo), which we read from the block devices.

With this information, we can build the which only requires basic information for the array (RAID level, disk count, present disks, etc).

We can inform the md driver about the array information

const mdu_array_info_t array_info = {
  .major_version = 1,
  .minor_version = 2,
  .level = 1,
  .nr_disks = 2,
  // ...
};
ioctl(4, SET_ARRAY_INFO, &array_info);

Then, for each disk, we have to populate a struct mdu_disk_info_t, defined in md_u.h; this struct is even more basic than the previous one.

const mdu_disk_info_t disk_info = {
  .number = 0,
  .major = 7, // block device major number
  .minor = 1, // block device minor number
  .raid_disk = 0,
  .state = (1 << MD_DISK_SYNC) | (1 << MD_DISK_ACTIVE),
};
ioctl(4, ADD_NEW_DISK, &disk_info);

Then we start the array (confusingly, without populating the mdu_param_t struct):

ioctl(4, RUN_ARRAY, NULL);

We can verify that the array is up¹³:

$ cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md125 : active raid1 nvme1n1p3[0] nvme0n1p3[1]
      157154304 blocks super 1.2 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

Like before, we can now call the mount syscall to use our newly-assembled filesystem:

unsigned long mountflags = 0;
const void *data = NULL;

mount("/dev/md125", "/rootfs", "ext4", mountflags, data);

Going to userspace

Now we have all our lovely files at /rootfs but that's not good enough — we want a real root filesystem and we want it at /.

There's a syscall, pivot_root which does precisely what we need:

moves the root mount to the directory put_old and makes new_root the new root mount.

syscall(SYS_pivot_root, "/rootfs", "/old-root");
chdir("/")

Which marks the end of the tasks for the initrd! Hooray!

Now we only need to replace ourselves by passing control to the real init process¹⁴

exec("/sbin/init")

Putting it all together

On boot, the processor will start to execute UEFI code, which will:

Look for bootable block devices
- By checking for a GPT header
Pick a device with an EFI System Partition
Execute the bootloader at \EFI\BOOT\BOOTX64.EFI
- Which must be a Portable Executable

The bootloader will:

Read its configuration within the EFI System Partition
Look for block devices containing partitions referenced in the configuration
- Checking each partition listed in the GPT for matching UUIDs
- Load modules for the filesystems in the referenced partitions
Load second-level configuration (which points to the kernel and initrd)
Load the kernel and initrd from the filesystem
Jump into the kernel

The kernel will:

Populate the root filesystem from the initrd
Execute /init from the (temporary) root filesystem

The (temporary) init process will:

Look for block devices referenced in the root= kernel commandline argument
Mount the device
- Optionally, assemble a virtual device from RAID members
Pivot the root filesystem to the real filesystem
Unmount the temporary rootfs
Replace itself (exec) with the new init (/sbin/init)

And the disk layout looks something like this:

Closing thoughts

I went into this because I noticed that a freshly initialized array has surprisingly little inside of it

$ mdadm -v --detail --scan /dev/md0
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=1.2 name=framework:arrayname UUID=6865ede1:f1a76b48:45071d66:a55755bd
   devices=/dev/loop2,/dev/loop23
$ hexdump -C /dev/loop2
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  fc 4e 2b a9 01 00 00 00  00 00 00 00 00 00 00 00  |.N+.............|
00001010  68 65 ed e1 f1 a7 6b 48  45 07 1d 66 a5 57 55 bd  |he....kHE..f.WU.|
00001020  66 72 61 6d 65 77 6f 72  6b 3a 61 72 72 61 79 6e  |framework:arrayn|
00001030  61 6d 65 00 00 00 00 00  00 00 00 00 00 00 00 00  |ame.............|
00001040  8e f3 b8 66 00 00 00 00  01 00 00 00 00 00 00 00  |...f............|
00001050  00 18 03 00 00 00 00 00  00 00 00 00 02 00 00 00  |................|
*
00001090  08 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000010a0  00 00 00 00 00 00 00 00  a5 cc 7f 3a a6 bc 32 e7  |...........:..2.|
000010b0  97 b7 72 89 75 54 b6 2f  00 00 08 00 10 00 00 00  |..r.uT./........|
000010c0  8e f3 b8 66 00 00 00 00  11 00 00 00 00 00 00 00  |...f............|
000010d0  ff ff ff ff ff ff ff ff  6f 63 64 60 80 00 00 00  |........ocd`....|
000010e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
06400000

which made me think "Just a couple of bytes? Surely the rest of the boot stack is also small"

I was so naive.

I ended up spending about a week in this rabbit hole, of which about 3 days were reading Kernel, GRUB and util-linux code, and somehow I ended up re-implemeting mdadm (Github repository, at a PoC stage) to make sure I understood it, the rest of the time was "thinking"/"processing" and writing this post.

I wanted to go into filesystems (both FAT for ESP and ext4 for rootfs), but it seemed larger than everything else combined and I'm a chicken after all. The docs are "fairly clear".

I also wanted to prepare a QEMU image which would go through all of this, and I probably will, but not right now.

I am left with some questions and observations:

This process has three separate entities (UEFI, GRUB, Linux) listing block devices and inspecting their contents
- It feels unnecessary, but I'm not sure it can be done flexibly in another way (simpler ways probably lead back to something BIOS or petitboot?)
Why does the kernel need to be loaded at LOAD_PHYSICAL_ADDR? Shouldn't any address be good?
How exactly does GRUB set the rsi register to the real mode header? I didn't get it from the source code.

this post includes enough rabbit holes, I didn't want to also go explore how to host these sources in a nice way. ↩
You can also have floppy disks, cd-roms, tape, or anything really, but it's not really important here. ↩
they actually started EFI "in the 1990s" but it didn't really take off ↩
The specification is 2145 pages, and mentions things like "bytecode virtual machine", "HTTP Boot", "Bluetooth", "Wi-Fi", "IPsec". ↩
The filename (or a fallback) may be configured in some UEFI implementations, but you can't depend on it, so everyone uses the default. ↩
sometimes you don't need a bootloader, such as when using the EFI Boot Stub, running UEFI applications or "hobby" kernels (which are usually UEFI applications). ↩
What happens if you plop down multiple superblocks on the same block device? ↩
I assume this is due to historical reasons, when booting in real mode, the CPU can only access 1MiB memory, and placing the kernel as high as possible limits memory fragmentation ↩
I'm not super clear on how the __init calls get scheduled for modules, I see that rootfs_initcall generates a function in a specific section, but what ends up calling it? ↩
ramdisk_execute_command defaults to /init but can be overridden in rdinit_setup, which looks at the rdinit= kernel commandline argument ↩
There are use cases for initrds which never pass on to a real userland, for example: flashing an OS from an initramfs ↩
Windows has the ~same idea with DeviceIoControl ↩
mount procfs like we did with sysfs ↩
there's some cleanup to do, at least unmounting the /old-root to free some RAM ↩

Mumbling about computers