Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp6012401pxu; Wed, 23 Dec 2020 10:56:53 -0800 (PST) X-Google-Smtp-Source: ABdhPJyZewxUsOx33MM1z7GmYQye9I/zsr3saQliSmQOwWFk4ZRsFj8oFSiXGfk1LCXnGrhg2s7L X-Received: by 2002:a17:906:b0c2:: with SMTP id bk2mr25791187ejb.223.1608749813594; Wed, 23 Dec 2020 10:56:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608749813; cv=none; d=google.com; s=arc-20160816; b=mOeCwjH9H2fhCDBQqC1/JZBDCXF5HBzC8ZhIK6NT+S+bY0DIv4qtHtAqjrLMEQRO4C +zEVsrLmpUgp5PvZhrYtgOxlLDQXQWFcd5fPp7dC/OCzlihGvjT1N8HO3rwlQvpVv4qT iUZxX/E2gRILkUJpW6ZlWcC3SZCWSHxIO0LaKp67wppDqtv4LVwEc8HZM7WvRlTE1YQB nVaB8kkqijV7fHvoD9WYZKgjh+A2Admw5WXDx7s2rxbBGOlT3k0N1/iJVp5otjXerz/S kF8P39rngp7fgU6eaAd/DA93kPgHhf827kyssPHGAqkRlPeqNI051b5FihVZnUTcIEYu tdvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=nirhQ0v89JBa/obnWzdEB/CYDDTajoqPDYQzAo3/nb0=; b=TpvH9spv+KF605574XIEd/9viBtDeeph/IgComGa5gBRSPXAhZNBMAiLrZilA/aiFa Ay5LKfRIH0k4nC9SZadPFCFPxr09nkUNxPlBac1dEgx/TcfUR1WaZ/Vw01wEWLyLwap4 oBbzUW1W12KU//Uejon+83kPxt37gQJcpcUs72GXAEtRGsjBJmwMg20OtxIWmb4AaXtx URH5pV5/1PJlVnlSZa/JqpSCkeNkFB4UABRoypSxwFoPwobdyGkc9HDQOt0JivJ44ZNP wddKO9D9duE3RMH8NcyCNx32etFEjTC0un1SypisaqHQMV2rBypiyEgGCNBpgrPTc3kz yFOQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=R79Jzxdq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b6si16650856edu.567.2020.12.23.10.56.30; Wed, 23 Dec 2020 10:56:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=R79Jzxdq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727671AbgLWSxi (ORCPT + 99 others); Wed, 23 Dec 2020 13:53:38 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:32376 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726384AbgLWSxh (ORCPT ); Wed, 23 Dec 2020 13:53:37 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608749530; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=nirhQ0v89JBa/obnWzdEB/CYDDTajoqPDYQzAo3/nb0=; b=R79JzxdqVLsQ6r6XYLnC25LutVXANtDzJql4pQE9642zwAiza9p+jI9JwU6+EAWb0JGKcj THQ3QFrNYjuImUC7eYjUStJbh8p/HA7Hj6LeFA9uUpPghi3EPlFsOYjD9RUs0HH/ZdyykE TOQGBFob0db1HL0gO0YAoWzH1F6rnTA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-213-tGodstCLPfyySGTuSzpNXw-1; Wed, 23 Dec 2020 13:52:06 -0500 X-MC-Unique: tGodstCLPfyySGTuSzpNXw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 61A9E800D55; Wed, 23 Dec 2020 18:52:04 +0000 (UTC) Received: from mail (ovpn-112-5.rdu2.redhat.com [10.10.112.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E287210013C1; Wed, 23 Dec 2020 18:51:59 +0000 (UTC) Date: Wed, 23 Dec 2020 13:51:59 -0500 From: Andrea Arcangeli To: Peter Xu Cc: Yu Zhao , Linus Torvalds , Andy Lutomirski , Nadav Amit , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <20201223162416.GD6404@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201223162416.GD6404@xz-x1> User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote: > I think this is not against Linus's example - where cpu2 does not have tlb > cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify > it. So IMHO there's no problem here. > > But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if > it's uffd-wp wr-protection it's always applied along with removing of the write > bit in change_pte_range(): > > if (uffd_wp) { > ptent = pte_wrprotect(ptent); > ptent = pte_mkuffd_wp(ptent); > } > > So instead of being handled as COW page do_wp_page() will always trap > userfaultfd-wp first, hence no chance to race with COW. > > COW could only trigger after another uffd-wp-resolve ioctl which could remove > the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen > after all wr-protect completes, which guarantees that when reaching the COW > path the tlb must has been flushed anyways. Then no one should be modifying > the page anymore even without pgtable lock in COW path. > > So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to > work, but it just may cause more tlb flush than Andrea's proposal especially > when the protection range is large (it's common to have a huge protection range > for e.g. VM live snapshotting, where we'll protect all guest rw ram). > > My understanding of current issue is that either we can take Andrea's proposal > (although I think the group rwsem may not be extremely better than a per-mm > rwsem, which I don't know... at least not worst than that?), or we can also go > the other way (also as Andrea mentioned) so that when wr-protect: > > - for <=2M range (pmd or less), we take read rwsem, but flush tlb within > pgtable lock > > - for >2M range, we take write rwsem directly but flush tlb once I fully agree indeed. I still have to fully understand how the clear_refs_write was fixed. clear_refs_write has not even a "catcher" in the page fault but it clears the pte_write under mmap_read_lock and it doesn't flush before releasing the PT lock. It appears way more broken than userfaultfd_writeprotect ever was. static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */ https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ "Although this is fine when simply aging the ptes, in the case of clearing the "soft-dirty" state we can end up with entries where pte_write() is false, yet a writable mapping remains in the TLB. Fix this by avoiding the mmu_gather API altogether: managing both the 'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB invalidation for the sort-dirty path, much like mprotect() does already" NOTE: about the above comment, that mprotect takes mmap_read_lock. Your above code change in the commit above, still has mmap_read_lock, there's still no similarity with mprotect whatsoever. Did you test what happens when clear_refs_write leaves writable stale TLB and you run do_wp_page copies the page while the other CPU still is writing to the page? Has clear_refs_write a selftest as aggressive and racy as the uffd selftest to reproduce that workload? The race fixed with the group lock internally to uffd, is possible thanks to marker+catcher in NUMA balancing style? But that's not there for clear_refs_write even after the above commit. It doesn't appear either that this patch is adding a synchronous tlb flush inside the PT lock either. Overall, it would be unfair if clear_refs_write is allowed to do the same thing that userfaultfd_writeprotect has to do, but with the mmap_read_lock, if userfaultfd_writeprotect will be forced to take mmap_write_lock. In my view there are 3 official ways to deal with this: 1) set a marker in the pte/hugepmd inside the PT lock, and add a catcher for the marker in the page fault to send the page fault to a dead end while there are stale TLB entries cases: userfaultfd_writeprotect() and NUMA balancing 2) take mmap_write_lock case: mprotect 3) flush the TLB before the PT lock release case: KSM write_protect_page Where does the below patch land in the 3 cases? It doesn't fit any of them and what it does looks still not sufficient to fix the userfault memory corruption. https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ It'd be backwards if the complexity of the kernel was increased for clear_refs_write, but forbidden for the O(1) API that delivers the exact same information (clear_refs_write API delivers that info in O(N)). We went the last mile to guarantee uffd can be always used fully securely unprivileged and by default in current Linus's tree and in future Android out of the box. It's simply impossible with the current defaults, to use uffd to enlarge the any kernel user-after-free race window either, so even that concern is gone. From on those research testcases will stick to fuse instead I guess. Thanks, Andrea