Cursing a process' vDSO for time hacking
We found... unexpected behavior on Python's Event
object: if the system's clock moves backwards while an Event
is being waited on,
it will seemingly hang "forever".
Event.wait(5) # time moves backwards during these 5 seconds
So I started digging on CPython's source code to try and find the reason why.
class Event:
def __init__(self):
self._cond = Condition(Lock())
...
def wait(self, timeout=None):
...
signaled = self._cond.wait(timeout)
...
The Event
class initializes a Condition
and later executes wait
on it.
class Condition:
...
def wait(self, timeout=None):
...
waiter = _allocate_lock()
...
if timeout > 0:
gotit = waiter.acquire(True, timeout)
Where Condition
itself waits by executing acquire
on waiter
, which is just an alias to _thread.allocate_lock
I couldn't find any exact matches for allocate_lock
but this fuzzy match is promising
PyThread_type_lock PyThread_allocate_lock(void);
There are some functions that act on a PyThread_type_lock
, but acquire_lock_timed
seems like the best match for the scenario
PyLockStatus PyThread_acquire_lock_timed(PyThread_type_lock lock, PY_TIMEOUT_T microseconds, int intr_flag) {
...
if (microseconds > 0) {
#ifdef HAVE_SEM_CLOCKWAIT
monotonic_abs_timeout(microseconds, &ts);
#else
MICROSECONDS_TO_TIMESPEC(microseconds, ts);
...
#endif
}
while (1) {
if (microseconds > 0) {
... // wait
}
}
}
Here's the first suspicious bit: MICROSECONDS_TO_TIMESPEC
most definitely looks like something that will be affected by the system clock rewinding,
as it is defined as a macro around gettimeofday
#define MICROSECONDS_TO_TIMESPEC(microseconds, ts) \
do { \
struct timeval tv; \
gettimeofday(&tv, NULL); \
tv.tv_usec += microseconds % 1000000; \
tv.tv_sec += microseconds / 1000000; \
tv.tv_sec += tv.tv_usec / 1000000; \
tv.tv_usec %= 1000000; \
ts.tv_sec = tv.tv_sec; \
ts.tv_nsec = tv.tv_usec * 1000; \
} while(0)
Turns out, our CPython was not built with HAVE_SEM_CLOCKWAIT
because we use glibc2.17 (from ~2012, yay CentOS).
The solution to the actual problem we are facing is to either use glibc >2.30 (2019) or not use Event.wait()
with a timeout, and instead use a different waiting mechanism.
We opted to go with using a different mechanism, which is pretty crude but sufficient for now:
for i in range(0, timeout):
if event.is_set():
break
time.sleep(1)
So, how to test that the bug is fixed? We could've introduced the same bug if time.sleep
used gettimeofday
internally!
A proper test is complicated: changing the system clock usually requires root permissions and is not trivial to execute in CI, so we have to resort to mocking.
There are some existing projects that deal with this:
- freezegun has a nice user API, but only mocks calls coming from Python code, leaving us without validating whether we are using "unsafe" functions from CPython itself.
- python-libfaketime also has a nice API, and as it is based on libfaketime, it manages to also mock the calls done by CPython itself.
However.
libfaketime ends up being extremely janky, as it is uses LD_PRELOAD
to override symbols:
- The process must be re-executed with
LD_PRELOAD
set- This requires a shared object to be placed on disk, for the linker to pick up.
- As there is no method of coordination between the pre-loaded object and user's code, communication is done via environment variables or files in the user's home.
The goal for today is to replace gettimeofday
(and friends) with something that:
- Does not require the
LD_PRELOAD
trick - Is user-space controllable
The vDSO
gettimeofday
is an external function, provided by a shared library coming from the kernel: the "vDSO" (virtual, dynamic shared object).
The kernel automatically maps this shared library into the address space of all programs, which do not know that this is happening, they merely resolve external symbols (in this case gettimeofday
) to a function they can call.
The point of the vDSO mechanism is to speed up some system calls: there is overhead on executing a syscall due to context-switching into the kernel, and this vastly dominates the time it takes to execute the actual requested function.
By mapping these functions directly to userspace, the context switch is bypassed and the necessary time to execute gettimeofday
drops dramatically.
As this function must be dynamically loaded into the process at startup, all the information for linking should be available to the process itself, and indeed it is mapped in its own special region ([vdso]
):
$ grep vdso /proc/self/maps
7ffd9f1d1000-7ffd9f1d3000 r-xp 00000000 00:00 0 [vdso]
In this example, the vDSO memory is mapped in the range 0x7ffd9f1d1000-0x7ffd9f1d3000
, with permissions to read and execute (but not write, this is a standard security practice called W^X).
We can write a small program that reads its own vDSO memory mapping range and writes it out to disk, so we'll be able to look at it
$ ./extract_vdso > output && file output
output: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped
The vDSO memory area is in ELF format, which can be scanned for dynamic symbols:
$ objdump -T output
DYNAMIC SYMBOL TABLE:
0000000000000c10 w DF .text 0000000000000005 LINUX_2.6 clock_gettime
0000000000000bd0 g DF .text 0000000000000005 LINUX_2.6 __vdso_gettimeofday
0000000000000c20 w DF .text 0000000000000060 LINUX_2.6 clock_getres
0000000000000c20 g DF .text 0000000000000060 LINUX_2.6 __vdso_clock_getres
0000000000000bd0 w DF .text 0000000000000005 LINUX_2.6 gettimeofday
0000000000000be0 g DF .text 0000000000000029 LINUX_2.6 __vdso_time
0000000000000cb0 g DF .text 000000000000009c LINUX_2.6 __vdso_sgx_enter_enclave
0000000000000be0 w DF .text 0000000000000029 LINUX_2.6 time
0000000000000c10 g DF .text 0000000000000005 LINUX_2.6 __vdso_clock_gettime
0000000000000000 g DO *ABS* 0000000000000000 LINUX_2.6 LINUX_2.6
0000000000000c80 g DF .text 0000000000000025 LINUX_2.6 __vdso_getcpu
0000000000000c80 w DF .text 0000000000000025 LINUX_2.6 getcpu
Each of these symbols is just a function that can be called, and at the same time they are just some bytes in memory, which we could write.
The steps could be summarized as:
- Find the vDSO memory range by scanning
/proc/self/maps
. - Read the vDSO ELF blob.
- Overwrite vDSO area with a user-provided function.
Find the process' vDSO address
Thanks to the proc filesystem, a process can inspect its own metadata very easily; /proc/self/maps
contains all the relevant information
in an easy-to-parse, whitespace-delimited, format (trimmed down):
Address | Perms | Offset | Path |
---|---|---|---|
5604dff9a000-5604dff9c000 |
r--p |
000000 | /usr/bin/cat |
5604e121d000-5604e123e000 |
rw-p |
000000 | [heap] |
7f38a9bd8000-7f38a9c02000 |
r-xp |
002000 | ld-linux-x86-64.so.2 |
7fff378cb000-7fff378ec000 |
rw-p |
000000 | [stack] |
7fff3794f000-7fff37953000 |
r--p |
000000 | [vvar] |
7fff37953000-7fff37955000 |
r-xp |
000000 | [vdso] |
Where each line represents a single, contiguous range.
Parsing the vDSO
In a similar vein, we can read our own memory at /proc/self/mem
, skip ahead to the vDSO range (based on the /proc/self/maps
metadata) and read the vDSO ELF blob. Thanks to the goblin crate, interpreting these bytes as a walkable structure is trivial.
The ELF format (1, 2, 3) is reasonable and greatly documented1.
A short summary of the ELF format, that only covers the parts relevant for this task:
- There is a
Program Header Table
, which holds an array ofProgram Header
. - Each
Program Header
points to aSection Header Table
- Each
Section Header Table
holds an array ofSection Header
- Each
Section Header
points to aSection
- Each
Section
holds an array ofSymbol
- Each
Symbol
has:- Name: an offset into the
STRTAB
- Size: the amount of bytes of code
- Address: the offset, relative to the start of the ELF, where the first byte of program is.
- Name: an offset into the
This is probably better explained via a diagram:
After extraction, this is what some symbols look like:
DynSym { name: "clock_gettime", address: 3088, size: 5 },
DynSym { name: "__vdso_gettimeofday", address: 3024, size: 5 },
where address is just the offset from the start of the ELF, not including the vDSO address in memory.
Calling functions by address
Having the address of a function and its signature, is all that's really needed to execute a function.
As an example, calling a
extern "C" my_gettimeofday(tp: *mut libc::timeval, tz: *mut c_void) {
// ...
};
// Virtual Address of the function
let fptr = my_gettimeofday as *const ();
// Assign a type to the function, so the compiler will let us call it
let code: extern "C" fn(tp: *mut libc::timeval, tz: *mut c_void) =
unsafe { std::mem::transmute(fptr) };
// Call the function
(code)(&mut tv, std::ptr::null_mut());
// Observe the result
println!(
"called mygettimeofday manually {} {}",
tv.tv_sec, tv.tv_usec
);
This worked as expected! We can now attempt to directly call the addresses obtained from the vDSO, if it works, it will validate the mechanism used to extract it.
We can replace fptr
in the previous example:
let fptr = ((vdso_range.start as u64) + dynsym.address) as *const ();
Which also worked!
Overwrite the vDSO
We now know:
- How to extract dynamic symbols from the vDSO
- The name and address of the interesting symbols
- How to call functions by address (if we know the signature)
With this knowledge, we should be able to overwrite some vDSO symbols with our own code, something like:
let addr = (vdso_range.start as u64) + dynsym.address;
unsafe {
std::ptr::write_bytes((addr + 0) as *mut u8, 0xC3, 1); // RET
}
This code, sadly, immediately exits the program, with return code 139 (Segmentation Fault).
Writing to this memory address wasn't allowed by the operating system.
From the memory map, we knew the process has no write permissions to the vDSO pages:
$ grep vdso /proc/self/maps
7ffd9f1d1000-7ffd9f1d3000 r-xp 00000000 00:00 0 [vdso]
but the process should be the owner of these pages, and able to change the permissions.
unsafe {
libc::mprotect(
r.start as *mut libc::c_void,
r.end - r.start,
libc::PROT_EXEC | libc::PROT_WRITE | libc::PROT_READ,
);
}
Verifying that the write
bit is set2
$ grep vdso /proc/self/maps
7ffd9f1d1000-7ffd9f1d3000 rwxp 00000000 00:00 0 [vdso]
With the pages being writable, we can attempt to write on the vDSO again:
let addr = (vdso_range.start as u64) + dynsym.address;
unsafe {
std::ptr::write_bytes((addr + 0) as *mut u8, 0xC3, 1); // RET
std::ptr::write_bytes((addr + 1) as *mut u8, 0x90, 15); // NOP
}
Which succeeds! We can dump the state of the vDSO before and after to verify the changes:
Before:
<__vdso_clock_gettime@@LINUX_2.6>:
c10: e9 9b fb ff ff jmp 7b0 <LINUX_2.6@@LINUX_2.6+0x7b0>
c15: 66 66 2e 0f 1f 84 00 data16 cs nop WORD PTR [rax+rax*1+0x0]
c1c: 00 00 00 00
After:
<__vdso_clock_gettime@@LINUX_2.6>:
c10: c3 ret
c11: 90 nop
c12: 90 nop
c13: 90 nop
c14: 90 nop
...
Now the only thing left is actually placing our function in this area.
However.
These symbols must be multiples of the ${ELF-sector alignment} in size.
In this case that is 16 bytes, and some of the symbols are just 16 bytes, drastically limiting the functions that can be placed in this space.
I thought about modifying the ELF itself and re-writing the vDSO to have as much space as necessary, but that would also shift the following symbols and, generally, any code that has already run might have kept a reference to the the original function address around, which wouldn't work anymore.
What we can do with 16 bytes though, is building a trampoline and use it to land in an "unrestricted-size" function.
This, conceptually, is very easy: overwrite the code to execute with a single jmp $DST
instruction. In practice, I had a bunch of problems:
- In
x86_64
, you can't jump to an absolute address that's represented as an immediate, it must be either of {indirect, relative, in a register} - There's a million opcodes for MOV, it really wasn't clear which one I should use
So I cheated, and let nasm
and objdump
deal with it for me; wrote
global _start
section .text
_start:
mov rax, 0x12ff34ff56ff78ff
jmp rax
Which nasm -f elf64
compiled for me, and objdump -M intel
dumped:
<_start>:
0: 48 b8 ff 78 ff 56 ff movabs rax,0x12ff34ff56ff78ff
7: 34 ff 12
a: ff e0 jmp rax
The function to overwrite the vDSO now looks like this:
// MOV RAX, <address>
std::ptr::write_bytes((addr + 0) as *mut u8, 0x48, 1);
std::ptr::write_bytes((addr + 1) as *mut u8, 0xB8, 1);
std::ptr::copy(&dst_address as *const u64, (addr + 2) as *mut u64, 1);
// JMP
std::ptr::write_bytes((addr + 10) as *mut u8, 0xFF, 1);
std::ptr::write_bytes((addr + 11) as *mut u8, 0xE0, 1);
// NOP the remaining space, unnecessary, but useful when debugging
std::ptr::write_bytes((addr + 12) as *mut u8, 0x90, padding_size);
Re-dumped a modified vDSO and got...
<__vdso_clock_gettime@@LINUX_2.6>:
c10: 48 b8 30 4f 87 bb 65 movabs rax,0x5565bb874f30
c17: 55 00 00
c1a: ff e0 jmp rax
c1c: 90 nop
c1d: 90 nop
c1e: 90 nop
c1f: 90 nop
Success!!
Now that we can jump into any user-controlled address, we only need to write code that matches the original function signatures:
extern "C" fn my_gettimeofday(tp: *mut libc::timeval, _tz: *mut c_void) {
if !tp.is_null() {
unsafe {
(*tp).tv_sec = 666;
(*tp).tv_usec = 999;
}
}
}
This works! Any caller to gettimeofday
within this rust program (or anything linking this crate) will see time coming from this function.
User provided functions
While a PoC that returns a constant was a lot of work, it's also pretty useless.
I'm not entirely sure of what's the proper way to do this -- the extern
functions that are the trampoline's targets can't be closures, so for now I'm using static
variables (there is only one gettimeofday
anyway).
type ClockGetTimeOfDayCb = fn() -> TimeVal;
lazy_static! {
static ref CLOCK_GTOD_CB: RwLock<Option<ClockGetTimeOfDayCb>> = RwLock::new(None);
}
If a user passes a function matching the signature specified in ClockGetTimeOfDayCb
, we can proxy the gettimeofday
call back to them:
extern "C" fn my_gettimeofday(tp: *mut libc::timeval, _tz: *mut c_void) {
// TODO: Support TZ
if !tp.is_null() {
let res = CLOCK_GTOD_CB.read().unwrap().unwrap()();
unsafe {
(*tp).tv_sec = res.seconds;
(*tp).tv_usec = res.micros;
}
}
}
This is what I set out to achieve, so I'm calling it a success!
Python bindings
As a bit of an extra, I made a separate crate with Python bindings, via PyO3, which now lives at py-tpom, and can show the usefulness of such a thing:
def test_time_changes():
target = datetime(2012, 1, 14, 1, 2, 3)
assert datetime.now() != target
with Freezer(target):
assert datetime.now() == target
assert datetime.now() != target
You can find the source code here.