Most UI Applications are Broken Real-time Applications

I’ve been programming for a long time. When I say long time, I mean decades, with an S. Hopefully that’s long enough. In that time my experience has primarily been programming for contemporary platforms, e.g. Linux, Windows, macOS on desktop-class or server-class CPU architectures. Recently, I embarked on building a MIDI engine for a system with significantly less processing power.

Soon after I started, I ran into the issue of guaranteeing that it was impossible for the queue of input events to overflow. This essentially boils down to making sure that each event handler doesn’t run longer than some maximum amount. Then it hit me, I’ve heard this before, maximum amount of time: I’m building a real-time system.

Once I realized that I had to additionally take real-time constraints into account while building, it drove a lot of the engineering decisions I made in a specific direction. In particular, the worst case time of every sequence of code must be accounted for, the average-case time was irrelevant for correctness. Under this discipline, algorithms which had better worst-case time but worse average-case time are preferred, branching usually must be to the faster path, and adding fast paths to slow algorithms was not helpful. It was interesting work and it changed how I thought about building systems in a profound way.

Armed with this new awareness, I began to notice the lack of real-time discipline in other applications, including my own. This was a jarring experience, how could I have never noticed this before? The juggernaut during this period was when I realized that most mainstream desktop UI applications were fundamentally broken.

When I click a mouse button, when I press a key on the keyboard, I expect a response in a bounded amount of time. Bounded amount of time? We’ve heard this before! UI applications are also real-time systems. How much time is this bounded amount of time? 100ms or maybe 250ms. Well, take your pick, the key point is that the response time should not be indefinite. I should never see a beach ball of death. Never.

Library Functions are not Real-time

One of the fundamental problems is that many UI applications on Windows, Linux, and macOS call functions that are not specified to run in a bounded amount of time. Here’s a basic example: many applications don’t think twice about doing file IO in a UI event handler. That results in a tolerable amount of latency most of the time on standard disk drives but what if the file is stored on a network drive? It could take much longer than a second to service the file request. This will result in a temporarily hung application with the user not knowing what is happening. The network drive is operating correctly, the UI application isn’t.

So all we have to do is avoid file system IO functions from the main thread? Not a big deal. That doesn’t mean UI applications are fundamentally broken. That’s just one broken application and it’s still relatively easily fixable.

It’s not just file system IO functions. File system IO functions belong to a class of functions called blocking functions. These are functions that are specified not to return until some external event happens. So correct UI applications cannot call any blocking function from their main threads.

It gets worse. Literally none of the standard library functions on contemporary systems are guaranteed to return in any amount of time. If you want to write a correct UI application, you technically cannot call any of them. I’m talking malloc(). Each call risks taking an amount of time longer than the maximum time allotted to respond to the event.

You may think I am being excessively pedantic with the previous point. Maybe you think, “No sane implementation of any standard library function will take more than a 500us on good data. It’s good enough to avoid blocking functions on the main thread.” I have two words for you: virtual memory.

Virtual Memory

When it comes to Windows, Linux, and macOS, these operating systems are virtual memory systems. When applications allocate memory, they are not actually allocating physical memory. They are telling the operating system that they will be using a certain memory region for a certain purpose. This enables lots of functionality but in particular this allows operating systems to save physical memory by transparently storing memory pages onto a hard disk and restoring the page when the application accesses the page again. This means that a memory access can block on a hard disk access.

This is a transparent process that is not under control of the application. Thus, if any given memory access can block on IO from a disk drive, that means the system is fundamentally not real-time, therefore UI applications on such a system are fundamentally broken.

This doesn’t seem like a common problem but whole system “out of memory” conditions are not that uncommon. When the system is in this state, it starts rapidly paging memory onto the hard disk. UI applications will be affected and this will cause your system to hang without warning and with no way to intervene since keypresses cannot be processed. From a user standpoint, this is worse than a kernel panic. This type of failure has happened to me multiple times on Linux so I know it’s a problem there. Perhaps Windows and macOS engineers have already considered this issue but I doubt it.

Is there a way to fix this? At least on Linux there is the mlock() family of functions that tell the operating system to put and keep the process’s memory pages into RAM. There are likely similar functions available on Windows and macOS. Of course there are still complications, e.g. is the application or the operating system responsible for locking memory pages? how does the application know which pages to lock? how does the operating system know which pages to lock?

Real-time Scheduling

The final fundamental issue with implementing real-time UI on top of contemporary mainstream platforms is the lack of real-time scheduling for the active UI application. These systems are time-sharing systems, meaning that a process’s execution can be indefinitely paused if there are many other processes competing for use of the CPU.

Imagine you have background processes running at 100% CPU, then an event comes in to the active UI application. The operating system may block for 100ms or more before allowing the UI application to process the event, potentially causing a delayed response to the user that violates the real-time constraint (NB: 10Hz is a common timeslice for time-sharing systems).

There is a solution for this as well, the window manager or equivalent can tell the OS to give scheduling priority to whatever UI application has active focus. This means that while the UI application is active and needs CPU, background processes are starved. There are complications with adapting a solution like this to existing systems as well, e.g. What to do when the active UI application runs into an infinite loop? What about multi-process UI applications?

Conclusion

Hopefully I’ve at least convinced you that mainstream UI applications are built on poor foundations. Do these UI applications work? Sure, most of the time they work but when they fail due to bad real-time assumptions, they fail in an annoying way. Beach ball of death. It’s unacceptible for workstation-class interactive systems to ever fail this way. I want to use responsive, correct applications. In the future, the UIs on which people will depend will take real-time constraints into account across the entire stack.

As far as I can tell, fixing this in a meaningful way will require large ecosystem-level changes and broad awareness. Lots of wide-reaching architectural decisions that ignore these issues have accumulated over decades. I’m tempted to abandon using Windows, macOS and Linux as the main platforms with which I interact.

Send any comments to @cejetvole

Rian Hunter
2023-09-21

Edit: The “Real-time Scheduling” section was added shortly after initial publication.