Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964847AbbEUPxp (ORCPT ); Thu, 21 May 2015 11:53:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:36760 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754830AbbEUPxj (ORCPT ); Thu, 21 May 2015 11:53:39 -0400 Date: Thu, 21 May 2015 17:52:51 +0200 From: Andrea Arcangeli To: Kirill Smelkov Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-api@vger.kernel.org, Pavel Emelyanov , Sanidhya Kashyap , zhang.zhanghailiang@huawei.com, Linus Torvalds , "Kirill A. Shutemov" , Andres Lagar-Cavilla , Dave Hansen , Paolo Bonzini , Rik van Riel , Mel Gorman , Andy Lutomirski , Hugh Dickins , Peter Feiner , "Dr. David Alan Gilbert" , Johannes Weiner , "Huangpeng (Peter)" Subject: Re: [PATCH 00/23] userfaultfd v4 Message-ID: <20150521155251.GB4643@redhat.com> References: <1431624680-20153-1-git-send-email-aarcange@redhat.com> <20150521131111.GA8932@teco.navytux.spb.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150521131111.GA8932@teco.navytux.spb.ru> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12341 Lines: 282 Hi Kirill, On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote: > Sorry for maybe speaking up too late, but here is additional real Not too late, in fact I don't think there's any change required for this at this stage, but it'd be great if you could help me to review. > Since arrays can be large, it would be slow and thus not practical to [..] > So I've implemented a scheme where array data is initially PROT_READ > protected, then we catch SIGSEGV, if it is write and area belongs to array In the case of postcopy live migration (for qemu and/or containers) and postcopy live snapshotting, splitting the vmas is not an option because we may run out of them. If your PROT_READ areas are limited perhaps this isn't an issue but with hundreds GB guests (currently plenty in production) that needs to live migrate fully reliably and fast, the vmas could exceed the limit if we were to use mprotect. If your arrays are very large and the PROT_READ aren't limited, using userfaultfd this isn't only an optimization for you too, it's actually a must to avoid a potential -ENOMEM. > Also, since arrays could be large - bigger than RAM, and only sparse > parts of it could be needed to get needed information, for reading it > also makes sense to lazily load data in SIGSEGV handler with initial > PROT_NONE protection. Similarly I heard somebody wrote a fastresume to load the suspended (on disk) guest ram using userfaultfd. That is a slightly less fundamental case than postcopy because you could do it also with MAP_SHARED, but it's still interesting in allowing to compress or decompress the suspended ram on the fly with lz4 for example, something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the additional benefit of not having an orphaned inode left open even if the file is deleted, that prevents to unmount the filesystem for the whole lifetime of the guest). > This is very similar to how memory mapped files work, but adds > transactionality which, as far as I know, is not provided by any > currently in-kernel filesystem on Linux. That's another benefit yes. > The gist of virtual memory-manager is this: > > https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h > https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault) I'll check it more in detail ASAP, thanks for the pointers! > For operations it currently needs > > - establishing virtual memory areas and connecting to tracking it That's the UFFDIO_REGISTER/UNREGISTER. > - changing pages protection > > PROT_NONE or absent - initially absent is what works with -mm already. The lazy loading already works. > PROT_NONE -> PROT_READ - after read Current UFFDIO_COPY will map it using vma->vm_page_prot. We'll need a new flag for UFFDIO_COPY to map it readonly. This is already contemplated: /* * There will be a wrprotection flag later that allows to map * pages wrprotected on the fly. And such a flag will be * available if the wrprotection ioctl are implemented for the * range according to the uffdio_register.ioctls. */ #define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0) __u64 mode; If the memory protection framework exists (either through the uffdio_register.ioctl out value, or through uffdio_api.features out-only value) you can pass a new flag (MODE_WP) above to transition from "absent" to "PROT_READ". > PROT_READ -> PROT_READWRITE - after write This will need to add UFFDIO_MPROTECT. > PROT_READWRITE -> PROT_READ - after commit UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want to run this UFFDIO_MPROTECT without stopping the threads that are accessing the memory concurrently). And this should only work if the uffdio_register.mode had MODE_WP set, so we don't run into the races created by COWs (gup vs fork race). > PROT_READWRITE -> PROT_NONE or absent (again) - after abort UFFDIO_MPROTECT again, but you won't be able to read the page contents inside the memory manager thread (the one working with userfaultfd). The manager at all times if forbidden to touch the memory it is tracking with userfaultfd (if it does it'll deadlock, but kill -9 will get rid of it). gdb ironically because it is using an underoptimized access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set in handle_userfault in the gdb context, and it'll just receive a sigbus if by mistake the user tries to touch the memory. Even if it will hung later as get_user_pages_locked|unlocked gets used there too, kill -9 would solve gdb too. Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE) memory: to do that a new ioctl should be required. I'd rather not go back to the route of UFFDIO_REMAP, but it could copy the data using the kernel address. It could be simply a reverse UFFDIO_COPY. We could add a UFFDIO_COPY_MODE_REVERSE flag to the "mode" of UFFDIO_COPY to mean "read source from kernel address and write destination in user address". By default it reads the source from user address and write the destination in kernel address (to be atomic). If you want to put data back before lifting the PROT_NONE, UFFDIO_COPY could be used in the standard way but with a UFFDIO_COPY_MODE_OVERWRITE flag that just overwrites the contents of the old page if it's not mapped (protnone), or just get rid of the old page (currently it'd return -EEXIST if the pte is not none). So the process would be: UFFDIO_COPY(dst_tmpaddr, src_addr, mode=REVERSE) UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE) Then if you also set mode=READONLY in the last UFFDIO_COPY, it'll create a wrprotected mapping atomically before giving visibility to the new page contents: UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE|WP) > PROT_READ -> PROT_NONE or absent (again) - on reclaim Same as above. > - working with aliasable memory (thus taken from tmpfs) > > there could be two overlapping-in-file mapping for file (array) > requested at different time, and changes from one mapping should > propagate to another one -> for common parts only 1 page should > be memory-mapped into 2 places in address-space. Why isn't the manager thread taking care of calling UFFDIO_MPROTECT in two places? And UFFDIO_COPY would fill the page and replace the old page and the effect would be visible as far as the "data" is concerned, but the protection bits would be more naturally different for each mapping, like a double mmap call is also required to map such an area in two places. You could have a MAP_PRIVATE vma with PROT_READ, you can't create a writable pte into it, just because you called UFFDIO_MPROTECT(PROT_READ|PROT_WRITE) in a different mapping of the same tmpfs page. NOTE: the availability of the UFFDIO_MPROTECT|COPY on tmpfs ares would still depend on UFFDIO_REGISTER to return the respecteve ioctl id in the uffdio_register.ioctl (out value of the register ioctl). > so what is currently lacking on userfaultfd side is: > > - ability to remove / make PROT_NONE already mapped pages > (UFFDIO_REMAP was recently dropped) > > - ability to arbitrarily change pages protection (e.g. RW -> R) > > - inject aliasable memory from tmpfs (or better hugetlbfs) and into > several places (UFFDIO_REMAP + some mapping copy semantic). I think UFFDIO_COPY if added with OVERWRITE|REVERSE|WP flags is an ok substitute for UFFDIO_REMAP. If UFFDIO_COPY sees the page is protnone during the REVERSE copy (to extract the memory atomically), it can also skip the tlb flush (and obviously there's no tlb flush in the reverse direction). If the page was not protnone, it can turn it in protnone, do a tlb flush, and then copy it to the destination address using the userland mapping. UFFDIO_MPROTECT is definitely necessary for postcopy live snapshotting too (the reverse UFFDIO_COPY is not, it never deals with PROT_NONE and it never cares about missing faults). MPROTECT(PROT_NONE) so far seems needed only by this and perhaps UML (and perhaps qemu linux-user). I posted in another email why these features aren't implemented yet == There will be some complications in adding the wrprotection/protnone feature: if faults could already happen when the wrprotect/protnone is armed, the handle_userfault() could be invoked in a retry-fault, that is not ok without allowing the userfault to return VM_FAULT_RETRY even during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd permissions must also not break anywhere. These are the two main reasons why these features that requires to flip protection bits are left implemented later and made visible later with uffdio_api.feature flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER. == > The performance currently is not great, partly because of page clearing > when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas > overhead and other dumb things on my side. Also the page faults get slowed down when the rbtree grows a lot, userfaultfd won't let the rbtree grow. > I still wanted to show the case, as userfaultd here has potential to > remove overhead related to kernel. That's very useful and interesting feedback! Could you review the API to be sure we don't have to modify it when we extend it like described above? 1) tmpfs returning uffdio_register.ioctl |= UFFDIO_MPROTECT|UFFDIO_COPY when enabled 2) UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) and in turn UFFDIO_COPY_MODE_REVERSE|UFFDIO_COPY_MODE_OVERWRITE|UFFDIO_COPY_MODE_WP being available if uffdio_register.ioctl includes UFFDIO_MPROTECT (and uffdio_api.features will then include UFFD_FEATURE_PAGEFAULT_WP to signal the uffd_msg.pagefault.flag WP is available [bit 1], and UFFDIO_REGISTER_MODE_WP can be used in uffdio_register.mode) All of it could just check for uffdio_api.features & UFFD_FEATURE_PAGEFAULT_WP being set, but you'd still have to check for UFFDIO_MPROTECT being set in uffdio_register.ioctl for tmpfs areas (or to know it's not available yet on hugetlbfs), so I think it's more robust to check UFFDIO_MPROTECT ioctl being set in uffdio_register.ioctl to assume all mprotection and writeprotect tracking features are available for that specific range. The feature flag will just tell that UFFDIO_REGISTER_MODE_WP can be used in the register ioctl, that is something you need to know before in order to "arm" the VMA for wrprotect faults. For your usage I think you probably want to set UFFDIO_REGISTER_MODE_WP|UFFDIO_REGISTER_MODE_MISSING and you'll be told through uffdio_msg.flags if it's a WP or MISSING fault. You won't be told if it's missing because of PROT_NONE or absent. On a side note: all of the above is completely orthognal from the non-cooperative usage: as far as memory protection features it doesn't need any, it just needs to track more events like fork/mremap to adjust its offsets as the memory manager is not part of the app and it has no way to orchestrate by other means. Doing it all at once (non-cooperative + full memory protection) looked too much. We should just try to get the API right in a way that won't require an UFFD_API bump passed to uffdio_api.api. Even then, if an api bump is required, that's not a big deal, until recently the non-cooperative usage already did the API bumb but we accomodated the read(2) API to avoid it. Thinking at the worst case scenario, if the API gets bumped the only thing that has to remain fixed is the ioctl number of the UFFDIO_API and the uffdio_api structure. Everything else can be upgraded without risk of ABI breakage, even the ioctl numbers can be reused (except the very UFFDIO_API). When the non-cooperative usage bumped the API it actually kept all ioctl the same except the read(2) format. Comments welcome, thanks! Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/