Thursday, July 21, 2022

Jurassic Cloud

Linux, traditionally being modeled after UNIX operating systems, has long since followed the traditional blocking, buffered, and cached I/O model. In essence, every I/O operation is indirect and served from a special memory pool known as the “page cache”. When we write, Linux first writes the data to a page in the page cache, and asynchronously flushes the write to disk later. This is why we have the notorious and costly sync syscall, which forces Linux to write data to disk immediately. Likewise, when we read data, Linux consults the page cache for an existing copy of the data and only fetches from disk if the data is stale or missing.

This design has a lot of advantages, like merging read and write operations; but the real benefit is that operations are executed on a much faster medium - RAM. RAM being so much faster than disk, raises the question: why fuss over complex control mechanisms and semantics of a non-blocking API? A simple blocking API made a lot of sense from both safety and ease of use perspectives. It used to make sense until disks became so much faster…

As usual, the devil was in the details - the blocking API came with a price to be paid - the overhead of a context switch. Context switches happen when a userspace program blocks and the kernel switches tasks to do something else (among other reasons). A context switch may take 1-5µs on a modern computer - completely negligible compared to a 10ms magnetic disk seek, but disastrously expensive compared to a 10µs latency of a modern NVMe drive.

This problem alone was enough to send a myriad of programmers scrambling to write async I/O frameworks, but to no avail. Until recently, Linux did not have a proper non-blocking I/O API that worked with disks (we do now, check out io_uring!), leading to horrible hacks and compromises.



from Hacker News https://ift.tt/QxEBdOR

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.