Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp518393rwi; Wed, 26 Oct 2022 04:05:17 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6bk9i4f1Geuw+tsxZUHCkrMvltX/jR+IqlIY4RgDQ1K7X0u3VJCvC0uv4V35zmkOwACbiK X-Received: by 2002:a05:6402:3592:b0:45c:fb8a:c57d with SMTP id y18-20020a056402359200b0045cfb8ac57dmr40878745edc.290.1666782317428; Wed, 26 Oct 2022 04:05:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666782317; cv=none; d=google.com; s=arc-20160816; b=GakTIc6ipPF4H61uN2JUjQcHJWSl4VNl52kaCqRNDdiIxaWMtMtc3A7sMzRiU5Gund Bz3RMJiW3eH73VTerZ0inuIMbCWTF8erJF9y8T//5uhWsss036ScT+dvTjcDpSlP0jX5 q7gWRDEg/6ppHyDdIirKs7drARz64W3x7Jb4JA2ikyBBvxKCgw3FNG8w+ka8W615ZACE gcW/jXecZxIFYSnv7r+ura8sYLauAGGgkz2ajfLZ+2xPw9Z6jcJAwUj3SWX2MMQ72Owt dB7X6wy3vv1jaPgtLCAaQFPPp6QkedxX538bP3TVdKpaYqb1xRqmlAXHEg1MDgo4pcuk dAWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id:dkim-signature:dkim-signature; bh=I/9MxA2nqtLvv/nblRFcqLSboC1NnO6//9e0dIi9iYc=; b=A/v2vQcz6je1Jf8rgmyLUusy4fB0cX7wj+/3amdCRn1pnKFtfs9qlBjp2JY/x4oEVT /J3tpPyOSpWT+L+UnoMZjLTknfAncKUd19b5aSVG0MxZMFcL+uynB9KQXxObWLN686wc YNalC9MALTuYZCNt/Z6eVrVLUyiYxhA1rDzz6CLSdTCD1Ykwo3wkYev0AuzM6nuCFodF xqsycuna4jiYN5yyYAdWPmBteJ80XNEzO/s4tyuujVDrxHHuPkf4fMqE342shtLV5aen JM36fRg2f9V6V3v/4/AiPKZs6keQcZeFpMFiXMgRhrkKUotKeyzJyWcLIH0cJBEz+ZOS PXdQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=RDCaxEY2; dkim=neutral (no key) header.i=@suse.cz header.b="a3dWpX/9"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id cs17-20020a170906dc9100b0078e16203457si5571365ejc.5.2022.10.26.04.04.32; Wed, 26 Oct 2022 04:05:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=RDCaxEY2; dkim=neutral (no key) header.i=@suse.cz header.b="a3dWpX/9"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233318AbiJZKwG (ORCPT + 99 others); Wed, 26 Oct 2022 06:52:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233287AbiJZKwE (ORCPT ); Wed, 26 Oct 2022 06:52:04 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 19E797AB07; Wed, 26 Oct 2022 03:52:03 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id BDCBA1F8F2; Wed, 26 Oct 2022 10:52:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1666781521; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=I/9MxA2nqtLvv/nblRFcqLSboC1NnO6//9e0dIi9iYc=; b=RDCaxEY2acAri4nKcAhwRUXcR2iCcjcCDJ4ZI0i9g+gJHDecJLvtJg84fpnXHBnJSklPEY NNgXaIOiftwsAYaoQIvsTBRYX/TRN5IIM/z2cn2oZWmQGGKAgvUorVEcJ2Hr1MTdh/6gLp 2qOwuXsLOWwTtfyrtrDZX7bAS6UYfD0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1666781521; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=I/9MxA2nqtLvv/nblRFcqLSboC1NnO6//9e0dIi9iYc=; b=a3dWpX/91tksGkMbHJiSgA49VmOd31hY9VJ8WN/nN8LmWV2gmcj4Nim5QFheIVWCdlV6RU gsKvO5SDUHFIXfCQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 6B6EF13A77; Wed, 26 Oct 2022 10:52:01 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id C1neGFERWWNkDgAAMHmgww (envelope-from ); Wed, 26 Oct 2022 10:52:01 +0000 Message-ID: <521ecc3f-c45a-74bb-9c2b-2d6284589e15@suse.cz> Date: Wed, 26 Oct 2022 12:52:01 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: amusing SLUB compaction bug when CC_OPTIMIZE_FOR_SIZE Content-Language: en-US From: Vlastimil Babka To: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Matthew Wilcox , Hugh Dickins , David Laight , Joel Fernandes , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, rcu@vger.kernel.org, "Paul E . McKenney" References: <35502bdd-1a78-dea1-6ac3-6ff1bcc073fa@suse.cz> <7dddca4c-bc36-2cf0-de1c-a770bef9e1b7@suse.cz> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_SOFTFAIL,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/25/22 16:08, Vlastimil Babka wrote: > On 10/25/22 15:47, Hyeonggon Yoo wrote: >> On Mon, Oct 24, 2022 at 04:35:04PM +0200, Vlastimil Babka wrote: >> >> [,,,] >> >>> diff --git a/mm/slab.c b/mm/slab.c >>> index 59c8e28f7b6a..219beb48588e 100644 >>> --- a/mm/slab.c >>> +++ b/mm/slab.c >>> @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, >>> >>> account_slab(slab, cachep->gfporder, cachep, flags); >>> __folio_set_slab(folio); >>> + /* Make the flag visible before any changes to folio->mapping */ >>> + smp_wmb(); >>> /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ >>> if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0))) >>> slab_set_pfmemalloc(slab); >>> @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab) >>> >>> BUG_ON(!folio_test_slab(folio)); >>> __slab_clear_pfmemalloc(slab); >>> - __folio_clear_slab(folio); >>> page_mapcount_reset(folio_page(folio, 0)); >>> folio->mapping = NULL; >>> + /* Make the mapping reset visible before clearing the flag */ >>> + smp_wmb(); >>> + __folio_clear_slab(folio); >>> >>> if (current->reclaim_state) >>> current->reclaim_state->reclaimed_slab += 1 << order; >>> diff --git a/mm/slub.c b/mm/slub.c >>> index 157527d7101b..6dc17cb915c5 100644 >>> --- a/mm/slub.c >>> +++ b/mm/slub.c >>> @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node, >>> >>> slab = folio_slab(folio); >>> __folio_set_slab(folio); >>> + /* Make the flag visible before any changes to folio->mapping */ >>> + smp_wmb(); >>> if (page_is_pfmemalloc(folio_page(folio, 0))) >>> slab_set_pfmemalloc(slab); >>> >>> @@ -2008,8 +2010,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab) >>> } >>> >>> __slab_clear_pfmemalloc(slab); >>> - __folio_clear_slab(folio); >>> folio->mapping = NULL; >>> + /* Make the mapping reset visible before clearing the flag */ >>> + smp_wmb(); >>> + __folio_clear_slab(folio); >>> if (current->reclaim_state) >>> current->reclaim_state->reclaimed_slab += pages; >>> unaccount_slab(slab, order, s); >>> -- >>> 2.38.0 >> >> Do we need to try this with memory barriers before frozen refcount lands in? > > There was IIRC an unresolved issue with frozen refcount tripping the page > isolation code so I didn't want to be depending on that. > >> It's quite complicated and IIUC there is a still theoretical race: >> >> At isolate_movable_page: At slab alloc: At slab free: >> folio = alloc_pages(flags, order) >> >> folio_try_get() >> folio_test_slab() == false >> __folio_set_slab(folio) >> smp_wmb() >> >> call_rcu(&slab->rcu_head, rcu_free_slab); >> >> >> smp_rmb() >> __folio_test_movable() == true >> >> folio->mapping = NULL; >> smp_wmb() >> __folio_clear_slab(folio); >> smp_rmb() >> folio_test_slab() == false >> >> folio_trylock() > > There's also between above and below: > > if (!PageMovable(page) || PageIsolated(page)) > goto out_no_isolated; > > mops = page_movable_ops(page); > > If we put another smp_rmb() before the PageMovable test, could that have > helped? It would assure we observe the folio->mapping = NULL; from the "slab > free" side? > > But yeah, it's getting ridiculous. Maybe there's a simpler way to check two > bits in two different bytes atomically. Or maybe it's just an impossible > task, I feel I just dunno computers at this point. After more thought, I think I just made a mistake by doing two folio_test_slab() tests around a single __folio_test_movable(). What I was supposed to do was two __folio_test_movable() tests around a single folio_test_slab()... I hope. That should take care of your scenario, or do you see another one? Thanks. ----8---- From 5ca1c10f6411d73ad579b58d4fa10326bf77cf0a Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 24 Oct 2022 16:11:27 +0200 Subject: [PATCH] mm/migrate: make isolate_movable_page() skip slab pages In the next commit we want to rearrange struct slab fields to allow a larger rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct list_head slab_list", where the value of prev pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can confuse PageMovable() to be a false positive and cause a GPF as reported by lkp [1]. I think the real problem here is that isolate_movable_page() is insufficiently paranoid. Looking at the gyrations that GUP and the page cache do to convince themselves that the page they got really is the page they wanted, there are a few missing pieces (eg checking that you actually got a refcount on _this_ page and not some random other page you were temporarily part of a compound page with). This patch does three things: - Turns one of the comments into English. There are some others which I'm still scratching my head over. - Uses a folio to help distinguish which operations are being done to the head vs the specific page (this is somewhat an abuse of the folio concept, but it's acceptable) - Add the aforementioned check that we're actually operating on the page that we think we want to be. - Add a check that the folio isn't secretly a slab. We could put the slab check in PageMapping and call it after taking the folio lock, but that seems pointless. It's the acquisition of the refcount which stabilises the slab flag, not holding the lock. [ vbabka@suse.cz: add memory barriers to SLAB and SLUB's page allocation and freeing, and their counterparts to isolate_movable_page(), to make the checks for folio_test_slab() and __folio_test_movable() SMP safe ] [1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/ Signed-off-by: Vlastimil Babka --- mm/migrate.c | 40 ++++++++++++++++++++++++++++------------ mm/slab.c | 6 +++++- mm/slub.c | 6 +++++- 3 files changed, 38 insertions(+), 14 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 1379e1912772..f0f58e42c1d4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -60,6 +60,7 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode) { + struct folio *folio = page_folio(page); const struct movable_operations *mops; /* @@ -71,16 +72,31 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode) * the put_page() at the end of this block will take care of * release this page, thus avoiding a nasty leakage. */ - if (unlikely(!get_page_unless_zero(page))) + if (unlikely(!folio_try_get(folio))) goto out; + /* Recheck the page is still part of the folio we just got */ + if (unlikely(page_folio(page) != folio)) + goto out_put; + /* - * Check PageMovable before holding a PG_lock because page's owner - * assumes anybody doesn't touch PG_lock of newly allocated page - * so unconditionally grabbing the lock ruins page's owner side. + * Check movable flag before taking the folio lock because + * we use non-atomic bitops on newly allocated page flags so + * unconditionally grabbing the lock ruins page's owner side. + * Make sure we don't have a slab folio here as its usage of the + * mapping field can cause a false positive movable flag. */ - if (unlikely(!__PageMovable(page))) - goto out_putpage; + if (unlikely(!__folio_test_movable(folio))) + goto out_put; + /* Pairs with smp_wmb() in slab allocation, e.g. SLUB's alloc_slab_page() */ + smp_rmb(); + if (unlikely(folio_test_slab(folio))) + goto out_put; + /* Pairs with smp_wmb() in slab freeing, e.g. SLUB's __free_slab() */ + smp_rmb(); + if (unlikely(!__folio_test_movable(folio))) + goto out_put; + /* * As movable pages are not isolated from LRU lists, concurrent * compaction threads can race against page migration functions @@ -92,8 +108,8 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode) * lets be sure we have the page lock * before proceeding with the movable page isolation steps. */ - if (unlikely(!trylock_page(page))) - goto out_putpage; + if (unlikely(!folio_trylock(folio))) + goto out_put; if (!PageMovable(page) || PageIsolated(page)) goto out_no_isolated; @@ -107,14 +123,14 @@ int isolate_movable_page(struct page *page, isolate_mode_t mode) /* Driver shouldn't use PG_isolated bit of page->flags */ WARN_ON_ONCE(PageIsolated(page)); SetPageIsolated(page); - unlock_page(page); + folio_unlock(folio); return 0; out_no_isolated: - unlock_page(page); -out_putpage: - put_page(page); + folio_unlock(folio); +out_put: + folio_put(folio); out: return -EBUSY; } diff --git a/mm/slab.c b/mm/slab.c index 59c8e28f7b6a..219beb48588e 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1370,6 +1370,8 @@ static struct slab *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, account_slab(slab, cachep->gfporder, cachep, flags); __folio_set_slab(folio); + /* Make the flag visible before any changes to folio->mapping */ + smp_wmb(); /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ if (sk_memalloc_socks() && page_is_pfmemalloc(folio_page(folio, 0))) slab_set_pfmemalloc(slab); @@ -1387,9 +1389,11 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab) BUG_ON(!folio_test_slab(folio)); __slab_clear_pfmemalloc(slab); - __folio_clear_slab(folio); page_mapcount_reset(folio_page(folio, 0)); folio->mapping = NULL; + /* Make the mapping reset visible before clearing the flag */ + smp_wmb(); + __folio_clear_slab(folio); if (current->reclaim_state) current->reclaim_state->reclaimed_slab += 1 << order; diff --git a/mm/slub.c b/mm/slub.c index 99ba865afc4a..5e6519d5169c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1800,6 +1800,8 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node, slab = folio_slab(folio); __folio_set_slab(folio); + /* Make the flag visible before any changes to folio->mapping */ + smp_wmb(); if (page_is_pfmemalloc(folio_page(folio, 0))) slab_set_pfmemalloc(slab); @@ -2000,8 +2002,10 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab) int pages = 1 << order; __slab_clear_pfmemalloc(slab); - __folio_clear_slab(folio); folio->mapping = NULL; + /* Make the mapping reset visible before clearing the flag */ + smp_wmb(); + __folio_clear_slab(folio); if (current->reclaim_state) current->reclaim_state->reclaimed_slab += pages; unaccount_slab(slab, order, s); -- 2.38.0