Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp6811913rwr; Tue, 2 May 2023 06:03:03 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5gTkn6wIX7EjHlfA69sBu0hEZ2dZvPd6iVv0qMkiUhiCUyBaboxwQKKEBemJJsaM0KSPZm X-Received: by 2002:a05:6a00:2d06:b0:63d:23a7:ca62 with SMTP id fa6-20020a056a002d0600b0063d23a7ca62mr22601779pfb.19.1683032582944; Tue, 02 May 2023 06:03:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683032582; cv=none; d=google.com; s=arc-20160816; b=z/mp484pEiqJ7aARbjqFiXjfpLOexF9pM11SSM8sw1F3bdacrfLiZzIrQcdNUOgzlu VFfq2X3xl4hUU+dGaZw7OzquXK0jlJrc2vRQ844xIkYNoGIn6HEwCGKbbv13cWjaSRAH jVKvCgEL2bt/wYkU/OMNfeqqId0wTrsCRk4BLB3dLcfDFJoLgNUlG0/gIK3oS9LB6CUk 8NbRbKsEylrl7/RaFIh+3KO5t2jkehQh5G7gjMf3wcotgajKxkOI7bqiOumbQtqV9xtf qFuytPrpcpp4O64ozUhm8Df6oDtDPoD88ww06JUbc3Q+s0w/t6CA9Nk9OhJSnoBCwEoX LLpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=pZWV4J/p+V2RsTLQQpuxXkHVnj5cQ9yrM686SedE9Ps=; b=ogmSCbij7QkKKpijARTy58DdsT+zDMV1bMccctvXVQSziP45Ba94ZIx8EysSoS8SxH 4VutiFVONBXkHTbBJpOocyWdxAGWH0WrG5ZMoI7yRGOP9imQRLnQxYkdbUvn7zgCI68I hv3yzHLg0unseEr9xPpEtQhgjRZgnFtR0hVpbx0HxHiVOQi5yzkfCM0rzZImUlifalXP n/zi4JfSKPn5czKtUjXjomlN2W54kfb77vrNFRPLs9qXp6IhspalfPUG7OYqqlC+1VU6 6q30rjfbF0jhajqyDAudfRLfWYqRWWmE1NWqxNnFGHVFOuULTzynZPrsrGhaOVOPu2IU Tfag== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=exGo9Qfz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t124-20020a625f82000000b0063b669ec9a0si25744126pfb.103.2023.05.02.06.02.47; Tue, 02 May 2023 06:03:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=exGo9Qfz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234283AbjEBMyx (ORCPT + 99 others); Tue, 2 May 2023 08:54:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233972AbjEBMyv (ORCPT ); Tue, 2 May 2023 08:54:51 -0400 Received: from mail-wr1-x433.google.com (mail-wr1-x433.google.com [IPv6:2a00:1450:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F357C55B7; Tue, 2 May 2023 05:54:44 -0700 (PDT) Received: by mail-wr1-x433.google.com with SMTP id ffacd0b85a97d-2f6401ce8f8so2288639f8f.3; Tue, 02 May 2023 05:54:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683032083; x=1685624083; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=pZWV4J/p+V2RsTLQQpuxXkHVnj5cQ9yrM686SedE9Ps=; b=exGo9QfzUuT4w3yzqeIHni6LyvJgGUfoTXpg/Lkddz7jH8JnMdnJxjq8WbrxrJk7Jd Mv6RuPP6nGfCBrYB6JbXHB+ZvwtQAER6Ci8jlG3nHy/5V6vR2gRzWJDkEQ6HvhF9Ncw+ X8SfyEGHB//iBTdLeNVwahP3NeCXVwqD3ANA1t22fD6aBQ42NeAn+kZjj6TUE0jUgBuU dZ6ottaKFtSmhNEUXo++nmY3Hdo5YWNUyCjF3OitmEsTwIsepJYfG9drL9dJZwytDD8K hby81MpcEKa4qYEtRl43w+czZiZn64IgpO9X0dLH13u9oUbfXeEkAtzn0IUE/Lw7ebme rVyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683032083; x=1685624083; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=pZWV4J/p+V2RsTLQQpuxXkHVnj5cQ9yrM686SedE9Ps=; b=UvpZ/aSz3XRnmUY/NvnGYRn/b13uXxxXmi3e4WByPx5mhwyfDBvaUmaw2/lfsXb4Ul sss79331vUnUpDQK0kYJ6Hv9pIe79WmRBmyIbv6NoI4TUJ3NzD7ud4a0r08lfgxY5y3D cwxk9qyCU/T3Ne8jNrmxIHxy0+DbDXaBJMi3d23WDo8nuaOio9Qnh8YHRdVply8CiVwj +pU1/0W20krYb5VXv3NSnm1eo4pahD4GespfV7+sVGEXXSmfpHjp3t9AFVzT7GVAG/jB OkOH5M7RJu7aJk9vqymbDPs+xbfdp4EIG+6LYvMf/iXyZZLyNHOnLwewG+9KVKhUVLrq 6iHg== X-Gm-Message-State: AC+VfDxiu6jFKkOfk5DvRKvEHERWZNrEWvdcrJyA6HTyKLibQ70P+E4G KYHn522v5V/YmpjF7Dsp0BE= X-Received: by 2002:adf:e4cb:0:b0:306:2de2:f583 with SMTP id v11-20020adfe4cb000000b003062de2f583mr4191054wrm.53.1683032083201; Tue, 02 May 2023 05:54:43 -0700 (PDT) Received: from localhost (host86-156-84-164.range86-156.btcentralplus.com. [86.156.84.164]) by smtp.gmail.com with ESMTPSA id h3-20020a5d5043000000b002c70ce264bfsm30877530wrt.76.2023.05.02.05.54.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 05:54:42 -0700 (PDT) Date: Tue, 2 May 2023 13:54:41 +0100 From: Lorenzo Stoakes To: Christian Borntraeger Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Jason Gunthorpe , Jens Axboe , Matthew Wilcox , Dennis Dalessandro , Leon Romanovsky , Christian Benvenuti , Nelson Escobar , Bernard Metzler , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , Bjorn Topel , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , "David S . Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Christian Brauner , Richard Cochran , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, Oleg Nesterov , Jason Gunthorpe , John Hubbard , Jan Kara , "Kirill A . Shutemov" , Pavel Begunkov , Mika Penttila , David Hildenbrand , Dave Chinner , Theodore Ts'o , Peter Xu , Matthew Rosato Subject: Re: [PATCH v6 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings Message-ID: <7d56b424-ba79-4b21-b02c-c89705533852@lucifer.local> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > > as we cannot rule out folios being cleaned and then a GUP user writing to > > them again and possibly marking them dirty unexpectedly. > > > > This is especially egregious for long-term mappings (as indicated by the > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > > we have already done in the slow path. > > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? > It would no longer work with file backed memory, correct? > > See > arch/s390/kvm/pci.c > > kvm_s390_pci_aif_enable > which does have > FOLL_WRITE | FOLL_LONGTERM > to > Does this memory map a dirty-tracked file? It's kind of hard to dig into where the address originates from without going through a ton of code. In worst case if the fast code doesn't find a whitelist it'll fall back to slow path which explicitly checks for dirty-tracked filesystem. We can reintroduce a flag to permit exceptions if this is really broken, are you able to test? I don't have an s390 sat around :) > > > > We have access to less information in the fast path as we cannot examine > > the VMA containing the mapping, however we can determine whether the folio > > is anonymous and then whitelist known-good mappings - specifically hugetlb > > and shmem mappings. > > > > While we obtain a stable folio for this check, the mapping might not be, as > > a truncate could nullify it at any time. Since doing so requires mappings > > to be zapped, we can synchronise against a TLB shootdown operation. > > > > For some architectures TLB shootdown is synchronised by IPI, against which > > we are protected as the GUP-fast operation is performed with interrupts > > disabled. However, other architectures which specify > > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > > > In these instances, we acquire an RCU lock while performing our checks. If > > we cannot get a stable mapping, we fall back to the slow path, as otherwise > > we'd have to walk the page tables again and it's simpler and more effective > > to just fall back. > > > > It's important to note that there are no APIs allowing users to specify > > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > > always rely on the fact that if we fail to pin on the fast path, the code > > will fall back to the slow path which can perform the more thorough check. > > > > Suggested-by: David Hildenbrand > > Suggested-by: Kirill A . Shutemov > > Signed-off-by: Lorenzo Stoakes > > --- > > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 85 insertions(+), 2 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 0f09dec0906c..431618048a03 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -18,6 +18,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > return folio; > > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > +static bool stabilise_mapping_rcu(struct folio *folio) > > +{ > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > + > > + rcu_read_lock(); > > + > > + return mapping == READ_ONCE(folio->mapping); > > +} > > + > > +static void unlock_rcu(void) > > +{ > > + rcu_read_unlock(); > > +} > > +#else > > +static bool stabilise_mapping_rcu(struct folio *) > > +{ > > + return true; > > +} > > + > > +static void unlock_rcu(void) > > +{ > > +} > > +#endif > > + > > +/* > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > + * FOLL_WRITE pin is permitted for a specific folio. > > + * > > + * This assumes the folio is stable and pinned. > > + * > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > + * so. > > + * > > + * This function cannot be as thorough as that one as the VMA is not available > > + * in the fast path, so instead we whitelist known good cases. > > + * > > + * The folio is stable, but the mapping might not be. When truncating for > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > + */ > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > +{ > > + bool ret; > > + > > + /* hugetlb mappings do not require dirty tracking. */ > > + if (folio_test_hugetlb(folio)) > > + return true; > > + > > + if (stabilise_mapping_rcu(folio)) { > > + struct address_space *mapping = folio_mapping(folio); > > + > > + /* > > + * Neither anonymous nor shmem-backed folios require > > + * dirty tracking. > > + */ > > + ret = folio_test_anon(folio) || > > + (mapping && shmem_mapping(mapping)); > > + } else { > > + /* If the mapping is unstable, fallback to the slow path. */ > > + ret = false; > > + } > > + > > + unlock_rcu(); > > + > > + return ret; > > +} > > + > > /** > > * try_grab_folio() - Attempt to get or pin a folio. > > * @page: pointer to page to be grabbed > > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > */ > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > { > > + bool is_longterm = flags & FOLL_LONGTERM; > > + > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > return NULL; > > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > * right zone, so fail and let the caller fall back to the slow > > * path. > > */ > > - if (unlikely((flags & FOLL_LONGTERM) && > > - !is_longterm_pinnable_page(page))) > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > > return NULL; > > /* > > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > if (!folio) > > return NULL; > > + /* > > + * Can this folio be safely pinned? We need to perform this > > + * check after the folio is stabilised. > > + */ > > + if ((flags & FOLL_WRITE) && is_longterm && > > + !folio_longterm_write_pin_allowed(folio)) { > > + folio_put_refs(folio, refs); > > + return NULL; > > + } > > + > > /* > > * When pinning a large folio, use an exact count to track it. > > *