Saturday, June 26, 2021

Patching until the COWs come home

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

March 22, 2021

This article was contributed by Vlastimil Babka

The kernel's memory-management subsystem is built upon many concepts, one of which is called "copy on write", or "COW". The idea behind COW is conceptually simple, but its details are tricky and its past is troublesome. Any change to its implementation can have unexpected consequences and cause subtle breakage for existing workloads. So it is somewhat surprising that last year we saw two major changes the kernel's COW code; less surprising is the fact that, both times, these changes had unexpected consequences and broke things. Some of the resulting problems are still not fixed today, almost ten months after the first change, while the original reason for the changes — a security vulnerability — is also not fully fixed. Read on for a description of COW, the vulnerability, and the initial fix; the concluding article in the series will describe the complications that arose thereafter.

Copy on write is a standard mechanism for sharing a single instance of an object between processes in a situation where each process has the illusion of an independent, private copy of that object. Examples include memory pages shared between processes or data extents shared between files. To see how COW is used in the memory-management subsystem, consider what happens when a process calls fork(): the pages in that process's private memory areas should no longer be shared between the parent and child. But, instead of creating new copies of those pages for the child process during the fork() call, the kernel will simply map the parent's pages in the child's page tables. Importantly, the page-table entries in both parent and child are set as read-only (write-protected).

If either process attempts to write to one of these pages, a page fault will occur, and the kernel's page-fault handler will create a new copy of the page, replacing the page-table entry (PTE) in the faulting process with a PTE that references the new page, but which allows the write to proceed. This action is often referred to as "breaking COW". If the other process then tries to write to that same page, another page fault will occur, as that process's PTE is still marked read-only. But now the page-fault handler will recognize that the page is no longer shared, so the PTE can just be made writable and the process can resume.

The benefits of this scheme are lower memory consumption and a reduction of CPU time spent copying pages during fork() calls. Often the price of copying is never paid for many of the pages because the child might call exit() or exec() before either the parent or the child writes to those pages.

While the COW mechanism looks simple, the devil is in the details, as has been shown already in the past. The recent trouble in this area started in 2020; it resulted in two major changes while attempting to fix a vulnerability — which is actually still not fixed in all scenarios — and resulted in many corner cases, some of which are still not fully ironed out.

The trouble begins

The first public sign of issues with the COW mechanism appeared in the form of commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue") at the end of May 2020. The changelog doesn't fully describe the problem scenario, but what is there is ominous enough:

End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead.

Any doubts about whether the commit fixed a security vulnerability vanish when one notices the Reported-by tag mentioning Jann Horn; presumably Horn's report went through the appropriate non-public security channels. The practice of making fixes to some vulnerabilities immediately public without explicitly marking them as such is not new, especially in the COW area. Nevertheless, the related Project Zero issue was made public in August, and CVE-2020-29374 was assigned in December; both point to the above-mentioned commit as the fix.

As the Project Zero issue includes proof-of-concept (PoC) code, we can look at the fix with that code in mind and not rely on the incomplete commit log. The most important parts of the PoC are the following:

    static void *data;

    posix_memalign(&data, 0x1000, 0x1000);
    strcpy(data, "BORING DATA");

    if (fork() == 0) {
        // child
        int pipe_fds[2];
        struct iovec iov = {.iov_base = data, .iov_len = 0x1000 };
        char buf[0x1000];

        pipe(pipe_fds);
        vmsplice(pipe_fds[1], &iov, 1, 0);
        munmap(data, 0x1000);

        sleep(2);
        read(pipe_fds[0], buf, 0x1000);
        printf("read string from child: %s\n", buf);
   } else {
        // parent
        sleep(1);
        strcpy(data, "THIS IS SECRET");
   }

The code starts by allocating an anonymous, private page and writing some data there; it then calls fork(). At that point, the page becomes a COW page — it is write-protected for the parent process by making the corresponding page-table entry read-only, and for the child process an identical PTE is created. Then, while the parent is blocked inside sleep(), the child creates a pipe and passes the page to that pipe with vmsplice(), a system call that is similar to write() but which allows a zero-copy data transfer of the page's contents. In order to achieve that, the kernel takes a reference on the source page (by increasing its reference count) through get_user_page() or one of its variants; the set of these functions is often referred to as "GUP". The child then unmaps the page from its own page tables (but retains the reference in the pipe) and goes to sleep.

The parent wakes up from its sleep and writes new data to the page. The page table entry is write-protected, so the write causes a page fault. The page-fault handler can tell that this is fault on a COW page because the the mapping allows write access while the PTE is write protected. If there were more processes mapping the page then the content would have to be copied (breaking COW), but if there is a single mapping, the page can be just made writeable. The kernel relies on the value returned by page_mapcount() to determine how many mappings exist.

Here is the problem: page_mapcount() at this point in the PoC's execution includes only the parent's mapping, because the child has already called munmap() on that page. This function does not take into account the fact that the child can still access the parent's page through the pipe; it ignores the elevated page reference count. Thus, the kernel allows the parent process to write new data into the page, which is no longer considered to be a COW page. Finally, the child wakes up and reads that new data from the pipe, which might include sensitive information that the parent did not expect the child to see.

Corralling the problem

One might rightfully ask why this potential of leaking data from parent to child can matter in practice, as both processes are normally executing the code from the same binary and the fork() only acts as a branch in the code. So we can assume that, either the binary is trusted and thus the child process is too, or it is not and then we probably should not let the parent access any sensitive data in the first place. And, in the scenario where fork() from a trusted binary is followed by an exec() of a potentially malicious binary, exec() removes all shared pages from the address space of the child process before loading the new binary.

But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug.

Moreover, the vmsplice() syscall might just be a symptom of a broader issue, since there are many other callers of the GUP functions in the kernel. So it is a good idea in general not to let a child process hold on to a page shared through the COW mechanism with the parent while letting the parent write new contents to the page.

To prevent exploits of this behavior, commit 17839856fd58 made it impossible to get a reference (even a read-only reference) via GUP to a COW-shared page. All such attempts now result in breaking COW and returning a reference to the new copy instead. Thus, in the PoC code above, calling vmsplice() now causes the child process to replace the shared COW page in the corresponding page table entry with a new page, which is then passed to the pipe. Afterward, the child no longer has any way to access the parent's page and the new contents written there.

The commit notes the possibility of worse performance for some GUP users, especially those that rely on a lockless variant of the interface like get_user_pages_fast(). The changelog continues that finer-grained rules could be added later for situations where it is clear that it is safe to keep sharing the COW page because it can never be overwritten with new, potentially sensitive contents. The system-wide zero-page would be one example of this sort of situation. But otherwise, Linus Torvalds (the author of the change) expected no fundamental issues with this aggressively COW-breaking approach for GUP. Linux 5.8 was duly released with this commit.

And this, one might think, was the end of the problem. But, as was mentioned at the outset, COW is a complicated and subtle beast. In truth, the problems were just beginning. The second half of this article will delve into how the COW fix led to a stampede of new problems that still have yet to be completely solved.


(

Log in

to post comments)



from Hacker News https://ift.tt/3dt6zaa

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.