Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp2195192pxb; Tue, 12 Oct 2021 01:09:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxoP+sZ0uGjth6uU9vIOhBBw0V++7ZNzEfssh+z8jNk5wZEbxZkJkPewB6Dobc8GstL+Mep X-Received: by 2002:a05:6402:5c8:: with SMTP id n8mr20527977edx.62.1634026146096; Tue, 12 Oct 2021 01:09:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634026146; cv=none; d=google.com; s=arc-20160816; b=TMxREa3ShkqhvPg88LuWa2N+DT4DPrK6UCAuz7aCAoDZaeoSzZIaPA3nXY3/j3F52F nJ22VKoPQFe3CcVe5X8UZ/DFXw2n02QHJW7ki4fmXdbnsFeImahRoOW+CSZN/NGCYRzw IcIz48lV6ZAxysCgS9UZvgpogDM1Qk7uVDQ82Qn/VOlf6Sjq7m9quyJ60jJ+XJV7xhAe 1kNciwViNX0dwHi9Dm2XivewW7t6pZ3P8rn3sIYeHJCya4rP5LCZHBDxrhX498HWkJjP qMBI7vnROHQNwFc4b1aYf7rWZdcHpfsOst5ukn5yfbSBduUR8SJOT5SciwLb9GhwKvAn D+1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature:dkim-signature; bh=C2AhIPiypQlFYvBRtPJow7c46TmArxAHhVqeRJSr764=; b=EfMlSMjwt3Y7NEW2KtHaXWSjvwSglJY+fgmG/3lNcWpuSv0PZdc4SFD7sdtBrYU1tJ a0wYeSYczI1xMOX01H9f2uSVkzUUK1AYSOpZANE7eT+r1odlyXNMSbtadKSfeSYi4mTS 7IwWiNpc6apxakHoDaS7QxbNKJGuG8p+tCXgc6Me5Mn8AWfYpV/lhshbEk34ncB/8FfC rOt/9V3EVI3skAQ23xjb3ccNip8Q0Zc+36yet5GRUWfVUlyxdcMxub3QKK89irN8stRm TAM2ecI+oI4khInwu7YTAq2cVdmMA+k9ly+BISLdyxd1I9ql9WXnEZeh0wMLGTtmM5bp Teaw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=1OFzVo4x; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=sHMnCTRT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b13si19300103edz.405.2021.10.12.01.08.42; Tue, 12 Oct 2021 01:09:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=1OFzVo4x; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=sHMnCTRT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234542AbhJLII7 (ORCPT + 99 others); Tue, 12 Oct 2021 04:08:59 -0400 Received: from smtp-out2.suse.de ([195.135.220.29]:43846 "EHLO smtp-out2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234368AbhJLIIz (ORCPT ); Tue, 12 Oct 2021 04:08:55 -0400 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id EA98420189; Tue, 12 Oct 2021 08:06:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1634026010; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=C2AhIPiypQlFYvBRtPJow7c46TmArxAHhVqeRJSr764=; b=1OFzVo4xbpngDBGGbdpcrG7ua6UsKiGufqo4Ddk5DSJZvBiRwM5T7iz/C0ZJrDVT2lIANn zP3kvI4ixcpGPNP37WsyDkp+zplH9NN6J4z82xe52o7iYG/JkJWgJwyehNYCJ1Cg8K9gee DCXMw8DHcyG7fgL3YzksVD1lW38xYP0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1634026010; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=C2AhIPiypQlFYvBRtPJow7c46TmArxAHhVqeRJSr764=; b=sHMnCTRTT6CjObMoSlWKSEYUUa54Rtct18vBKUjBQtqOZATUosZKpWcjQs3ebE+MSg5WB1 lkYqbC/wCzmAROCQ== Received: from quack2.suse.cz (unknown [10.100.224.230]) by relay2.suse.de (Postfix) with ESMTP id AF254A3B84; Tue, 12 Oct 2021 08:06:50 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 87DBB1E1409; Tue, 12 Oct 2021 10:06:49 +0200 (CEST) Date: Tue, 12 Oct 2021 10:06:49 +0200 From: Jan Kara To: Suren Baghdasaryan Cc: stable@vger.kernel.org, gregkh@linuxfoundation.org, jannh@google.com, torvalds@linux-foundation.org, vbabka@suse.cz, peterx@redhat.com, aarcange@redhat.com, david@redhat.com, jgg@ziepe.ca, ktkhai@virtuozzo.com, shli@fb.com, namit@vmware.com, hch@lst.de, oleg@redhat.com, kirill@shutemov.name, jack@suse.cz, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com Subject: Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue Message-ID: <20211012080649.GE9697@quack2.suse.cz> References: <20211012015244.693594-1-surenb@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211012015244.693594-1-surenb@google.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote: > From: Linus Torvalds > > From: Linus Torvalds > > commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream. > > Doing a "get_user_pages()" on a copy-on-write page for reading can be > ambiguous: the page can be COW'ed at any time afterwards, and the > direction of a COW event isn't defined. > > Yes, whoever writes to it will generally do the COW, but if the thread > that did the get_user_pages() unmapped the page before the write (and > that could happen due to memory pressure in addition to any outright > action), the writer could also just take over the old page instead. > > End result: the get_user_pages() call might result in a page pointer > that is no longer associated with the original VM, and is associated > with - and controlled by - another VM having taken it over instead. > > So when doing a get_user_pages() on a COW mapping, the only really safe > thing to do would be to break the COW when getting the page, even when > only getting it for reading. > > At the same time, some users simply don't even care. > > For example, the perf code wants to look up the page not because it > cares about the page, but because the code simply wants to look up the > physical address of the access for informational purposes, and doesn't > really care about races when a page might be unmapped and remapped > elsewhere. > > This adds logic to force a COW event by setting FOLL_WRITE on any > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page > pointer as a result. > > The current semantics end up being: > > - __get_user_pages_fast(): no change. If you don't ask for a write, > you won't break COW. You'd better know what you're doing. > > - get_user_pages_fast(): the fast-case "look it up in the page tables > without anything getting mmap_sem" now refuses to follow a read-only > page, since it might need COW breaking. Which happens in the slow > path - the fast path doesn't know if the memory might be COW or not. > > - get_user_pages() (including the slow-path fallback for gup_fast()): > for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with > very similar semantics to FOLL_FORCE. > > If it turns out that we want finer granularity (ie "only break COW when > it might actually matter" - things like the zero page are special and > don't need to be broken) we might need to push these semantics deeper > into the lookup fault path. So if people care enough, it's possible > that we might end up adding a new internal FOLL_BREAK_COW flag to go > with the internal FOLL_COW flag we already have for tracking "I had a > COW". > > Alternatively, if it turns out that different callers might want to > explicitly control the forced COW break behavior, we might even want to > make such a flag visible to the users of get_user_pages() instead of > using the above default semantics. > > But for now, this is mostly commentary on the issue (this commit message > being a lot bigger than the patch, and that patch in turn is almost all > comments), with that minimal "enable COW breaking early" logic using the > existing FOLL_WRITE behavior. > > [ It might be worth noting that we've always had this ambiguity, and it > could arguably be seen as a user-space issue. > > You only get private COW mappings that could break either way in > situations where user space is doing cooperative things (ie fork() > before an execve() etc), but it _is_ surprising and very subtle, and > fork() is supposed to give you independent address spaces. > > So let's treat this as a kernel issue and make the semantics of > get_user_pages() easier to understand. Note that obviously a true > shared mapping will still get a page that can change under us, so this > does _not_ mean that get_user_pages() somehow returns any "stable" > page ] > > [surenb: backport notes > Since gup_pgd_range does not exist, made appropriate changes on > the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead. > Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd, > gup_huge_pd and gup_pud_range. > Removed FOLL_PIN usage in should_force_cow_break since it's missing in > the earlier kernels.] I'd be really careful with backporting this to stable. There was a lot of userspace breakage caused by this change if I remember right which needed to be fixed up later. There is a nice summary at https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and some problems are still being found... Honza > > Reported-by: Jann Horn > Tested-by: Christoph Hellwig > Acked-by: Oleg Nesterov > Acked-by: Kirill Shutemov > Acked-by: Jan Kara > Cc: Andrea Arcangeli > Cc: Matthew Wilcox > Signed-off-by: Linus Torvalds > [surenb: backport to 4.4 kernel] > Cc: stable@vger.kernel.org # 4.4.x > Signed-off-by: Suren Baghdasaryan > --- > mm/gup.c | 48 ++++++++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 7 +++---- > 2 files changed, 43 insertions(+), 12 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 4c5857889e9d..c80cdc408228 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > } > > /* > - * FOLL_FORCE can write to even unwritable pte's, but only > - * after we've gone through a COW cycle and they are dirty. > + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, > + * but only after we've gone through a COW cycle and they are dirty. > */ > static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) > { > - return pte_write(pte) || > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); > + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); > +} > + > +/* > + * A (separate) COW fault might break the page the other way and > + * get_user_pages() would return the page from what is now the wrong > + * VM. So we need to force a COW break at GUP time even for reads. > + */ > +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) > +{ > + return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET); > } > > static struct page *follow_page_pte(struct vm_area_struct *vma, > @@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > if (!vma || check_vma_flags(vma, gup_flags)) > return i ? : -EFAULT; > if (is_vm_hugetlb_page(vma)) { > + if (should_force_cow_break(vma, foll_flags)) > + foll_flags |= FOLL_WRITE; > i = follow_hugetlb_page(mm, vma, pages, vmas, > &start, &nr_pages, i, > - gup_flags); > + foll_flags); > continue; > } > } > + > + if (should_force_cow_break(vma, foll_flags)) > + foll_flags |= FOLL_WRITE; > + > retry: > /* > * If we have a pending SIGKILL, don't keep faulting pages and > @@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, > /* > * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to > * the regular GUP. It will only return non-negative values. > + * > + * Careful, careful! COW breaking can go either way, so a non-write > + * access can get ambiguous page results. If you call this function without > + * 'write' set, you'd better be sure that you're ok with that ambiguity. > */ > int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages) > @@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > * > * We do not adopt an rcu_read_lock(.) here as we also want to > * block IPIs that come from THPs splitting. > + * > + * NOTE! We allow read-only gup_fast() here, but you'd better be > + * careful about possible COW pages. You'll get _a_ COW page, but > + * not necessarily the one you intended to get depending on what > + * COW event happens after this. COW may break the page copy in a > + * random direction. > */ > > local_irq_save(flags); > @@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > next = pgd_addr_end(addr, end); > if (pgd_none(pgd)) > break; > + /* > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > + * because get_user_pages() may need to cause an early COW in > + * order to avoid confusing the normal COW routines. So only > + * targets that are already writable are safe to do by just > + * looking at the page tables. > + */ > if (unlikely(pgd_huge(pgd))) { > - if (!gup_huge_pgd(pgd, pgdp, addr, next, write, > + if (!gup_huge_pgd(pgd, pgdp, addr, next, 1, > pages, &nr)) > break; > } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > - PGDIR_SHIFT, next, write, pages, &nr)) > + PGDIR_SHIFT, next, 1, pages, &nr)) > break; > - } else if (!gup_pud_range(pgd, addr, next, write, pages, &nr)) > + } else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr)) > break; > } while (pgdp++, addr = next, addr != end); > local_irq_restore(flags); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 6404e4fcb4ed..fae45c56e2ee 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1268,13 +1268,12 @@ out_unlock: > } > > /* > - * FOLL_FORCE can write to even unwritable pmd's, but only > - * after we've gone through a COW cycle and they are dirty. > + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, > + * but only after we've gone through a COW cycle and they are dirty. > */ > static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) > { > - return pmd_write(pmd) || > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); > + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); > } > > struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > -- > 2.33.0.882.g93a45727a2-goog > -- Jan Kara SUSE Labs, CR