Learning about PCI-e: Emulating a custom device
This post is part of the series Learning about PCI-e
- Learning about PCI-e: Emulating a custom device (this article)
- Learning about PCI-e: Driver & DMA
- Learning about PCI-e: Implementing an option ROM
Ever since reading about the FuryGpu, I've been curious about how PCI-e works and what it would take to build a simple display adapter. In this post I will try to document what I learn during the process.
Thsi project is probably way larger than I can even imagine, so I'm going to try an iterative process, taking the smallest steps that further my understanding and can achieve something.
My current, limited, understanding of PCI-e:
When you plug an adapter (card) into a PCI-e slot, magic happens then:
- The card becomes enumerable in the PCI-e bus by the kernel
- The kernel loads a driver based on the vendor/device code exposed by the card.
- The driver knows how to communicate with the card (writing/reading specific memory offsets)
A good first step to further my understanding: "build" a "device" that can be enumerated by Linux, and read/write to it.
As an example, my graphics card currently enumerates like this:
c1:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (rev cb) (prog-if 00 [VGA controller])
Subsystem: Device f111:0006
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 81
IOMMU group: 4
Region 0: Memory at 8000000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at 90000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at 1000 [size=256]
Region 5: Memory at 90500000 (32-bit, non-prefetchable) [size=512K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Building a real device at my current level of understanding would be slightly insane, so I'm going to emulate one in QEMU.
QEMU
Adding a new device to QEMU's source requires us to be able to build it, which is straightforward:
$ ./configure --target-list="x86_64-softmmu" --enable-debug
$ make -j8
The resulting binary can then be executed with ./build/qemu-system-x86_64
.
To validate that the binary works, I copied my laptop's kernel cp -t . /boot/vmlinuz
:
$ ./build/qemu-system-x86_64 -kernel vmlinuz \
-display none -m 256 \
-chardev stdio,id=char0 -serial chardev:char0 \
-append 'console=ttyS0 quiet panic=-1'
[ 3.232653] Initramfs unpacking failed: write error
[ 3.882868] Failed to execute /init (error -2)
[ 3.885107] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[ 3.885863] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-39-generic #40-Ubuntu
[ 3.886198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
This error is expected - without a disk attached to the VM, there's no /init
to execute.
The easiest way to execute some code is to package it up in an initramfs, in cpio format.
To build a basic initramfs:
- Download static busybox and place it into a dir called
initramfs
- Write this init script as
initramfs/init.sh
and make it executable
#!/busybox sh
/busybox mkdir /sys
/busybox mkdir /proc
/busybox mount -t proc null /proc
/busybox mount -t sysfs null /sys
/busybox mknod /dev/mem c 1 1
/busybox lspci
exec /busybox sh
- Create a CPIO archive
cd initramfs && find . -print0 | cpio --null -H newc -o | gzip -9 > ../initramfs.gz
We can pass this initramfs.gz
file to the kernel to be executed (also note rdinit
in the kernel's arguments):
$ ./build/qemu-system-x86_64 -enable-kvm -kernel vmlinuz -initrd initramfs.gz \
-chardev stdio,id=char0 -serial chardev:char0 \
-append 'quiet console=ttyS0,115200 rdinit=/init.sh' \
-display none -m 256 -nodefaults
00:01.0 Class 0601: 8086:7000
00:00.0 Class 0600: 8086:1237
00:01.3 Class 0680: 8086:7113
00:01.1 Class 0101: 8086:7010
This shows that we can execute some arbitrary commands, and tha there are 4 devices in the PCI bus on an otherwise "empty" virtual machine.
From this point on I'll refer to these QEMU commandline options as $OPTS
Creating a very minimal device
To create a PCI-e device in QEMU, we only need to provide a few things:
- A state definition (as a struct)
- A function to register our device
- Init/realize functions
We can create a gpu.c
file in hw/misc
(slightly abbreviated, find the file here):
#define TYPE_PCI_GPU_DEVICE "gpu"
#define GPU_DEVICE_ID 0x1337
static void pci_gpu_register_types(void) {
static InterfaceInfo interfaces[] = {
{ INTERFACE_CONVENTIONAL_PCI_DEVICE },
{ },
};
static const TypeInfo gpu_info = {
.name = TYPE_PCI_GPU_DEVICE,
.parent = TYPE_PCI_DEVICE,
.instance_size = sizeof(GpuState),
.instance_init = gpu_instance_init,
.class_init = gpu_class_init,
.interfaces = interfaces,
};
type_register_static(&gpu_info);
}
static void gpu_class_init(ObjectClass *class, void *data) {
PCIDeviceClass *k = PCI_DEVICE_CLASS(class);
k->realize = pci_gpu_realize;
k->exit = pci_gpu_uninit;
k->vendor_id = PCI_VENDOR_ID_QEMU;
k->device_id = GPU_DEVICE_ID;
k->class_id = PCI_CLASS_OTHERS;
}
We also need to update the build system to build this device:
$ git diff hw
diff --git a/hw/misc/Kconfig b/hw/misc/Kconfig
index 1e08785..a2e533e 100644
--- a/hw/misc/Kconfig
+++ b/hw/misc/Kconfig
@@ -25,6 +25,11 @@ config PCI_TESTDEV
default y if TEST_DEVICES
depends on PCI
+config GPU
+ bool
+ default y if TEST_DEVICES
+ depends on PCI && MSI_NONBROKEN
+
config EDU
bool
default y if TEST_DEVICES
diff --git a/hw/misc/meson.build b/hw/misc/meson.build
index 86596a3..ca704f4 100644
--- a/hw/misc/meson.build
+++ b/hw/misc/meson.build
@@ -1,4 +1,6 @@
system_ss.add(when: 'CONFIG_APPLESMC', if_true: files('applesmc.c'))
+system_ss.add(when: 'CONFIG_GPU', if_true: files('gpu.c'))
system_ss.add(when: 'CONFIG_EDU', if_true: files('edu.c'))
system_ss.add(when: 'CONFIG_FW_CFG_DMA', if_true: files('vmcoreinfo.c'))
system_ss.add(when: 'CONFIG_ISA_DEBUG', if_true: files('debugexit.c'))
We can then rebuild QEMU with make
(this time it's super quick) and run
$ ./build/qemu-system-x86_64 $OPTS -device gpu
00:01.0 Class 0601: 8086:7000
00:04.0 Class 00ff: 1234:1337 # <<<< our device
00:00.0 Class 0600: 8086:1237
00:01.3 Class 0680: 8086:7113
00:03.0 Class 0200: 8086:100e
00:01.1 Class 0101: 8086:7010
At this point, I also built lspci statically (add -static
to OPTS and set HWDB
to no
) and put it in the initramfs, for a more complete lspci
output, which gave me:
00:04.0 Class [00ff]: Device [1234:1337]
Subsystem: Device [1af4:1100]
Physical Slot: 4
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Doing something with the device
At this point, the 'GPU' does absolutely nothing, other than showing up on the bus.
To advertise a memory region to the CPU, it needs to be visible in the card's configuration space, we can do this by configuring a BAR, which will contain a base address and size for the memory region.
First, define the memory region and create some memory to operate on
struct GpuState {
PCIDevice pdev;
+ MemoryRegion mem;
+ unsigned char data[0x100000];
};
Then when the device is instantiated (realized), we need to register the memory region & tell QEMU what to do with read/write operations (find the file here):
static uint64_t gpu_mem_read(void *opaque, hwaddr addr, unsigned size) {
GpuState *gpu = opaque;
uint64_t bitcount = ((uint64_t)size)<<3;
uint64_t mask = (1ULL << bitcount)-1;
uint64_t got = gpu->data[addr] & mask;
printf("Tried to read 0x%x bytes at 0x%lx = 0x%lx\n", size, addr, got);
return got;
}
static void gpu_mem_write(void *opaque, hwaddr addr, uint64_t val, unsigned size) {
GpuState *gpu = opaque;
uint64_t bitcount = ((uint64_t)size)<<3;
uint64_t mask = (1ULL << bitcount)-1;
uint64_t sizedval = val & mask;
gpu->data[addr] = sizedval;
printf("Tried to write 0x%lx [0x%lx] (0x%x bytes) at 0x%lx\n", val, sizedval, size, addr);
}
static const MemoryRegionOps gpu_mem_ops = {
.read = gpu_mem_read,
.write = gpu_mem_write,
};
static void pci_gpu_realize(PCIDevice *pdev, Error **errp)
{
GpuState *gpu = GPU(pdev);
memory_region_init_io(&gpu->mem, OBJECT(gpu), &gpu_mem_ops, gpu, "gpu-mem", 1 * MiB);
pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &gpu->mem);
}
The new memory region shows up in lspci
!
00:02.0 Class [00ff]: Device [1234:1337]
Subsystem: Device [1af4:1100]
Physical Slot: 2
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Region 0: Memory at feb00000 (32-bit, non-prefetchable) [size=1M] <<<<< here
Now we have 1 memory region, which is 32-bit-addressable and 1MB large, and it can be interacted by reading/writing to it
$ /busybox mknod /dev/mem c 1 1 # create /dev/mem to be able read/write arbitrary memory
$ /busybox devmem 0xfeb00000 16 # read 4 bytes
0x0000
$ /busybox devmem 0xfeb00000 16 4 # write a '4', as a 4 byte type
$ /busybox devmem 0xfeb00000 16 # read 4 bytes
0x0004
And QEMU logged
Tried to read 0x2 bytes at 0x0 = 0x0
Tried to write 0x4 [0x4] (0x2 bytes) at 0x0
Tried to read 0x2 bytes at 0x0 = 0x4
If we wanted to have multiple memory regions, we'd need to duplicate: gpu_mem_read
, gpu_mem_write
, gpu_mem_ops
, then call memory_region_init
and pci_register_bar
with those parameters.
That's it for now, next time we are tackling DMA & a simple kernel driver.