Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp3359039pxp; Tue, 8 Mar 2022 12:39:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJxj4nCcwEjLzq6yiasxq2WT/3R0WQEroXP1RzXEaKoQDL0kKxBfY0ku5akEuGlevhp8JFam X-Received: by 2002:a63:3e47:0:b0:380:629f:a6bf with SMTP id l68-20020a633e47000000b00380629fa6bfmr8097186pga.462.1646771945561; Tue, 08 Mar 2022 12:39:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646771945; cv=none; d=google.com; s=arc-20160816; b=mkzUvqLWa0WHkBMiUpz8C/lSgEA7QIdf7SDdsRf24lg8QjIqyebmPkDdxRCdG/2Lne o5Zz6q96QOGgtZnVdSnNWDhlpvUhvRG6a3dj7gFKG5BAqMrw4AHJA5MWqcMp7ewFUDU+ u8yVDAx0AmqU5aCUdoccuFUbr/WzMmPKK2Ov4/35uxTomCrBEdAzPuGdMd41CNEw+ivY l1XALTdsBcvaM2aznR8Gt4EzrEI8dch2NHegjpQrjeglMGyKi98N0gxt0JT6rDWH6Pbm zfyrDTaokOZ1EsjjH9qlId7+C3RdQO2PP57RdT915kXbjxYUX1fe7rB78Kg7sUSn9/Vx 36vQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=7YOmozIxA68qOa65qjgNa3MC0RCFSk/OIc/u84Nt6LA=; b=DrGCTPh7bWGqNNGjpI0oGYJKI3LQwTWPzILP9Hp1nhxyEZHt8WOg8fnn3S0c8J8lZA s7DaLirAsx/CmgNvp4Pbo8SSxUt1ugyqdIHkKiSkm6LFotj5thXFLq5ob8GSKt/3Cf/0 CeGZhGODZpLSqjziNKoEkbijA/G3B6DMc+QmEFRVfq9lmmyX16MW9Np2PNw8ONCP4IyP bQLspdgoN8APeILTi2MCf7H1f/q8EPAhEHeqKu8wAePjrMzBOSyBwLeMG+DcJoL2dI6c EuuH+i7EFCIOIMg+IGQ0tTzYSdSJ8VzmySuQ8JRYuoIMWoAAfsAsIOn+CzBSxeco9qMH R/DA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YIrCvKK8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g19-20020a63fa53000000b00364cc2c3c29si16260617pgk.264.2022.03.08.12.38.47; Tue, 08 Mar 2022 12:39:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YIrCvKK8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347421AbiCHOUE (ORCPT + 99 others); Tue, 8 Mar 2022 09:20:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347426AbiCHOTz (ORCPT ); Tue, 8 Mar 2022 09:19:55 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A12414B84A for ; Tue, 8 Mar 2022 06:18:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1646749134; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7YOmozIxA68qOa65qjgNa3MC0RCFSk/OIc/u84Nt6LA=; b=YIrCvKK8n/lf8JBpKMzGTiF2q1NhD6HGs8xM9MPr/h80hB0GRlXtdJfJ36qMI0fR2HKEsa SQ3ywkLGonWb72lJRQLSXKQWgprMnjhfzpj6YcxrQ0D+CDqtJ24UxySwjZo7XY4kSs+Zzf pW1BUAKlm20hpUnLe59tM4OJgi9NdF4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-578-cDMhn0TyOiWaVEzEjWUjZQ-1; Tue, 08 Mar 2022 09:18:51 -0500 X-MC-Unique: cDMhn0TyOiWaVEzEjWUjZQ-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2059C801AFE; Tue, 8 Mar 2022 14:18:48 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.195.19]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9E5077887D; Tue, 8 Mar 2022 14:18:41 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , Pedro Gomes , Oded Gabbay , linux-mm@kvack.org, David Hildenbrand Subject: [PATCH v1 14/15] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page Date: Tue, 8 Mar 2022 15:14:36 +0100 Message-Id: <20220308141437.144919-15-david@redhat.com> In-Reply-To: <20220308141437.144919-1-david@redhat.com> References: <20220308141437.144919-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Whenever GUP currently ends up taking a R/O pin on an anonymous page that might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault on the page table entry will end up replacing the mapped anonymous page due to COW, resulting in the GUP pin no longer being consistent with the page actually mapped into the page table. The possible ways to deal with this situation are: (1) Ignore and pin -- what we do right now. (2) Fail to pin -- which would be rather surprising to callers and could break user space. (3) Trigger unsharing and pin the now exclusive page -- reliable R/O pins. Let's implement 3) because it provides the clearest semantics and allows for checking in unpin_user_pages() and friends for possible BUGs: when trying to unpin a page that's no longer exclusive, clearly something went very wrong and might result in memory corruptions that might be hard to debug. So we better have a nice way to spot such issues. This change implies that whenever user space *wrote* to a private mapping (IOW, we have an anonymous page mapped), that GUP pins will always remain consistent: reliable R/O GUP pins of anonymous pages. As a side note, this commit fixes the COW security issue for hugetlb with FOLL_PIN as documented in: https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com The vmsplice reproducer still applies, because vmsplice uses FOLL_GET instead of FOLL_PIN. Note that follow_huge_pmd() doesn't apply because we cannot end up in there with FOLL_PIN. This commit is heavily based on prototype patches by Andrea. Co-developed-by: Andrea Arcangeli Signed-off-by: Andrea Arcangeli Signed-off-by: David Hildenbrand --- include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++ mm/gup.c | 42 +++++++++++++++++++++++++++++++++++++++--- mm/huge_memory.c | 3 +++ mm/hugetlb.c | 27 ++++++++++++++++++++++++--- 4 files changed, 105 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 63ee06001189..45fe97213a27 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3003,6 +3003,45 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) return 0; } +/* + * Indicates for which pages that are write-protected in the page table, + * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the + * GUP pin will remain consistent with the pages mapped into the page tables + * of the MM. + * + * Temporary unmapping of PageAnonExclusive() pages or clearing of + * PageAnonExclusive() has to protect against concurrent GUP: + * * Ordinary GUP: Using the PT lock + * * GUP-fast and fork(): mm->write_protect_seq + * * GUP-fast and KSM or temporary unmapping (swap, migration): + * clear/invalidate+flush of the page table entry + * + * Must be called with the (sub)page that's actually referenced via the + * page table entry, which might not necessarily be the head page for a + * PTE-mapped THP. + */ +static inline bool gup_must_unshare(unsigned int flags, struct page *page) +{ + /* + * FOLL_WRITE is implicitly handled correctly as the page table entry + * has to be writable -- and if it references (part of) an anonymous + * folio, that part is required to be marked exclusive. + */ + if ((flags & (FOLL_WRITE | FOLL_PIN)) != FOLL_PIN) + return false; + /* + * Note: PageAnon(page) is stable until the page is actually getting + * freed. + */ + if (!PageAnon(page)) + return false; + /* + * Note that PageKsm() pages cannot be exclusive, and consequently, + * cannot get pinned. + */ + return !PageAnonExclusive(page); +} + typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data); diff --git a/mm/gup.c b/mm/gup.c index 7127e789d210..9e4864130202 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -568,6 +568,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } } + if (!pte_write(pte) && gup_must_unshare(flags, page)) { + page = ERR_PTR(-EMLINK); + goto out; + } /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ if (unlikely(!try_grab_page(page, flags))) { page = ERR_PTR(-ENOMEM); @@ -820,6 +824,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches * the device's dev_pagemap metadata to avoid repeating expensive lookups. * + * When getting an anonymous page and the caller has to trigger unsharing + * of a shared anonymous page first, -EMLINK is returned. The caller should + * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only + * relevant with FOLL_PIN and !FOLL_WRITE. + * * On output, the @ctx->page_mask is set according to the size of the page. * * Return: the mapped (struct page *), %NULL if no mapping exists, or @@ -943,7 +952,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, * is, *@locked will be set to 0 and -EBUSY returned. */ static int faultin_page(struct vm_area_struct *vma, - unsigned long address, unsigned int *flags, int *locked) + unsigned long address, unsigned int *flags, bool unshare, + int *locked) { unsigned int fault_flags = 0; vm_fault_t ret; @@ -968,6 +978,11 @@ static int faultin_page(struct vm_area_struct *vma, */ fault_flags |= FAULT_FLAG_TRIED; } + if (unshare) { + fault_flags |= FAULT_FLAG_UNSHARE; + /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */ + VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE); + } ret = handle_mm_fault(vma, address, fault_flags, NULL); if (ret & VM_FAULT_ERROR) { @@ -1189,8 +1204,9 @@ static long __get_user_pages(struct mm_struct *mm, cond_resched(); page = follow_page_mask(vma, start, foll_flags, &ctx); - if (!page) { - ret = faultin_page(vma, start, &foll_flags, locked); + if (!page || PTR_ERR(page) == -EMLINK) { + ret = faultin_page(vma, start, &foll_flags, + PTR_ERR(page) == -EMLINK, locked); switch (ret) { case 0: goto retry; @@ -2346,6 +2362,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, goto pte_unmap; } + if (!pte_write(pte) && gup_must_unshare(flags, page)) { + put_compound_head(head, 1, flags); + goto pte_unmap; + } + VM_BUG_ON_PAGE(compound_head(page) != head, page); /* @@ -2529,6 +2550,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0; } + if (!pte_write(pte) && gup_must_unshare(flags, head)) { + put_compound_head(head, refs, flags); + return 0; + } + *nr += refs; SetPageReferenced(head); return 1; @@ -2589,6 +2615,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; } + if (!pmd_write(orig) && gup_must_unshare(flags, head)) { + put_compound_head(head, refs, flags); + return 0; + } + *nr += refs; SetPageReferenced(head); return 1; @@ -2623,6 +2654,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return 0; } + if (!pud_write(orig) && gup_must_unshare(flags, head)) { + put_compound_head(head, refs, flags); + return 0; + } + *nr += refs; SetPageReferenced(head); return 1; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 617ed753996e..0f02c121c884 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1389,6 +1389,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) + return ERR_PTR(-EMLINK); + if (!try_grab_page(page, flags)) return ERR_PTR(-ENOMEM); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7161f33aff9a..c5d63baa2957 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5956,6 +5956,25 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma, } } +static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte, + bool *unshare) +{ + pte_t pteval = huge_ptep_get(pte); + + *unshare = false; + if (is_swap_pte(pteval)) + return true; + if (huge_pte_write(pteval)) + return false; + if (flags & FOLL_WRITE) + return true; + if (gup_must_unshare(flags, pte_page(pteval))) { + *unshare = true; + return true; + } + return false; +} + long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, unsigned long *nr_pages, @@ -5970,6 +5989,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, while (vaddr < vma->vm_end && remainder) { pte_t *pte; spinlock_t *ptl = NULL; + bool unshare = false; int absent; struct page *page; @@ -6020,9 +6040,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * both cases, and because we can't follow correct pages * directly from any kind of swap entries. */ - if (absent || is_swap_pte(huge_ptep_get(pte)) || - ((flags & FOLL_WRITE) && - !huge_pte_write(huge_ptep_get(pte)))) { + if (absent || + __follow_hugetlb_must_fault(flags, pte, &unshare)) { vm_fault_t ret; unsigned int fault_flags = 0; @@ -6030,6 +6049,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(ptl); if (flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; + else if (unshare) + fault_flags |= FAULT_FLAG_UNSHARE; if (locked) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; -- 2.35.1