Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp1320770rwr; Fri, 5 May 2023 12:13:01 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7FiHF+3HQ/MPB2O6GloRIZKkcelVJjJ+bdBbj5Ndk7i+//zgZCWVS2e4S6IDV/P9ArAsq6 X-Received: by 2002:a17:902:eccd:b0:1a6:4127:857 with SMTP id a13-20020a170902eccd00b001a641270857mr3390071plh.5.1683313981283; Fri, 05 May 2023 12:13:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683313981; cv=none; d=google.com; s=arc-20160816; b=UJeT54AkqfjNhYOekZAmdIZeMQ8yzxOz7MgUv7SKP+Uix/2yr18VMiL0UHHagz2GWR iz2EBzO2BcP8XuSJrrD+l62r/SSO1aMt4mmh4IUAtb0LYC8sGZWxjTGUT1LApj2sBZZh LMfka1h3yLxiESj1TScGhzSGSC0uFCEyWAdntT1h+HDwZYunr6nFxXZ6GDuSFKmyjulG gs1iYFHTdbjTM9BO6og9pN29o8xvqBXHPe6W6TzWqzo/tRk0CRi0thNbtl4M6dESt0Uc wzmFM5EwHJKkY900MoGM5CtAQMwe+TWunW8DTa3vwKtOCkWN9v/eI7nXasPdRMVSbJLV l8mg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=O3Pwr7B80kSqA1ofTGTJdRspp4l8nHMqdJ66SIuLoGc=; b=zaDVzep9pfPc5LIedBMlBFJkpvUxEl7IJkeDpKK/6CyWP1kovPSNSId1CA26YGtUs9 O6C+p3Sn5CSA+ZQSz1pdvfTttPBgvPbaxeAMeloVKlqg3LJb89X4X1hry3S4NZP9Hpk8 lkwgzpr4zVt55IamftpABx7Th9JK+yKN5/Y7Ri8ne1UO+gaZ7A5FtEfjre7qc13Mz/mm S3auFrUtPTo3Yp4xv7vIPARhtN7TC8CSjuCWXdfzi5FhpPPPB7i4ro5VBZxM5ZZmrZyH 1DRnNez6Bphat0KkiYSYVwOIo8bo1r9GhpDg3HGx3c3kax4WLLomVegC2N+knmh1+CYn xNXQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=DlaHdWVd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i14-20020a17090332ce00b001ab1279d418si2522723plr.474.2023.05.05.12.12.47; Fri, 05 May 2023 12:13:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=DlaHdWVd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233327AbjEESvG (ORCPT + 99 others); Fri, 5 May 2023 14:51:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233306AbjEESvF (ORCPT ); Fri, 5 May 2023 14:51:05 -0400 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 772681891B for ; Fri, 5 May 2023 11:50:56 -0700 (PDT) Received: by mail-pl1-x635.google.com with SMTP id d9443c01a7336-1ab14cb3aaeso15290485ad.2 for ; Fri, 05 May 2023 11:50:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683312656; x=1685904656; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=O3Pwr7B80kSqA1ofTGTJdRspp4l8nHMqdJ66SIuLoGc=; b=DlaHdWVdiGOlAxw1awjEx6laGYPzXJmug8SJKO7DDHkWd9lRDfbQ8sb2k1sGrYQ54Y i36SixdRsHDPZk0PW58feDv1msCBEixpiwD01tZvjfJg2wwOCBh26DVZS/1VrvYMLlwa ZI9xMUbJObBc+YvM6PWDj+py7p2KfPtaiOCiV+qYAntA49bSo2hLF6zbYG9vc3P7wiEr 9nQBFMM5/UAUKgiTqCcJcKfF0zqbx3e1hGupkr5dqb6g/t459svxbwligH/gQRv0PIoJ JlDc2Yq/oRp9EXjUAG2ucuDxQSf9qk1/Fl1M0latVJtNZ/C5WvdkvJ9SGqCiI/q3z+0y Gf+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683312656; x=1685904656; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=O3Pwr7B80kSqA1ofTGTJdRspp4l8nHMqdJ66SIuLoGc=; b=j+gY+RDDTW0YoZUxxgEfd27ZnZy8UJQgxkCeYPMjXvF7p19DwXviE2HTfdiF1vDXBx hCR/VWf1ozttFegW85DEm0Y0apIBQCX1EpC2uk0SAJ45Jnwq8nDVM++nfGXWQ9jQI0m1 GhDWtnflGTDtYTKf7YYopu/tKZnwTokM81lcbgQNL5Pp1UKTgAbAwYY5NK0TrrZmbWpG w4PniZOLUYPRKeaTe8dVqzP2vQ/PtqCV6s+hdtCSFvkbUci0dyYfg+zbHP7oM9jRU+0/ JmKYA3qPKdEsDkNmNaC1tjVZOfv7ilbkqBkeakqfbmleURtV8foAhUCmZcsGH7E4kK/x +B5A== X-Gm-Message-State: AC+VfDyYk9qVV38S5q8Tf+m0fV/pmJY6TLj5SS1rEQPLuRpG2MigDmEW KEOEfOiLCRBRdJh5eun/HtY= X-Received: by 2002:a17:902:ef94:b0:1a9:5e33:72db with SMTP id iz20-20020a170902ef9400b001a95e3372dbmr2302214plb.28.1683312655537; Fri, 05 May 2023 11:50:55 -0700 (PDT) Received: from localhost (fwdproxy-prn-023.fbsv.net. [2a03:2880:ff:17::face:b00c]) by smtp.gmail.com with ESMTPSA id bc6-20020a170902930600b001a9884c02e3sm2128695plb.10.2023.05.05.11.50.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 May 2023 11:50:55 -0700 (PDT) From: Nhat Pham To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, minchan@kernel.org, ngupta@vflare.org, senozhatsky@chromium.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, kernel-team@meta.com Subject: [PATCH] zsmalloc: move LRU update from zs_map_object() to zs_malloc() Date: Fri, 5 May 2023 11:50:54 -0700 Message-Id: <20230505185054.2417128-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Under memory pressure, we sometimes observe the following crash: [ 5694.832838] ------------[ cut here ]------------ [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100) [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80 [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1 [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021 [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80 [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7 [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246 [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000 [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480 [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370 [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002 [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240 [ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000 [ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0 [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5695.207197] PKRU: 55555554 [ 5695.212602] Call Trace: [ 5695.217486] [ 5695.221674] zs_map_object+0x91/0x270 [ 5695.229000] zswap_frontswap_store+0x33d/0x870 [ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0 [ 5695.245899] __frontswap_store+0x51/0xb0 [ 5695.253742] swap_writepage+0x3c/0x60 [ 5695.261063] shrink_page_list+0x738/0x1230 [ 5695.269255] shrink_lruvec+0x5ec/0xcd0 [ 5695.276749] ? shrink_slab+0x187/0x5f0 [ 5695.284240] ? mem_cgroup_iter+0x6e/0x120 [ 5695.292255] shrink_node+0x293/0x7b0 [ 5695.299402] do_try_to_free_pages+0xea/0x550 [ 5695.307940] try_to_free_pages+0x19a/0x490 [ 5695.316126] __folio_alloc+0x19ff/0x3e40 [ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.332681] ? walk_component+0x2a8/0xb50 [ 5695.340697] ? generic_permission+0xda/0x2a0 [ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.357940] ? walk_component+0x2a8/0xb50 [ 5695.365955] vma_alloc_folio+0x10e/0x570 [ 5695.373796] ? walk_component+0x52/0xb50 [ 5695.381634] wp_page_copy+0x38c/0xc10 [ 5695.388953] ? filename_lookup+0x378/0xbc0 [ 5695.397140] handle_mm_fault+0x87f/0x1800 [ 5695.405157] do_user_addr_fault+0x1bd/0x570 [ 5695.413520] exc_page_fault+0x5d/0x110 [ 5695.421017] asm_exc_page_fault+0x22/0x30 After some investigation, I have found the following issue: unlike other zswap backends, zsmalloc performs the LRU list update at the object mapping time, rather than when the slot for the object is allocated. This deviation was discussed and agreed upon during the review process of the zsmalloc writeback patch series: https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/ Unfortunately, this introduces a subtle bug that occurs when there is a concurrent store and reclaim, which interleave as follows: zswap_frontswap_store() shrink_worker() zs_malloc() zs_zpool_shrink() spin_lock(&pool->lock) zs_reclaim_page() zspage = find_get_zspage() spin_unlock(&pool->lock) spin_lock(&pool->lock) zspage = list_first_entry(&pool->lru) list_del(&zspage->lru) zspage->lru.next = LIST_POISON1 zspage->lru.prev = LIST_POISON2 spin_unlock(&pool->lock) zs_map_object() spin_lock(&pool->lock) if (!list_empty(&zspage->lru)) list_del(&zspage->lru) CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */ With the current upstream code, this issue rarely happens. zswap only triggers writeback when the pool is already full, at which point all further store attempts are short-circuited. This creates an implicit pseudo-serialization between reclaim and store. I am working on a new zswap shrinking mechanism, which makes interleaving reclaim and store more likely, exposing this bug. zbud and z3fold do not have this problem, because they perform the LRU list update in the alloc function, while still holding the pool's lock. This patch fixes the aforementioned bug by moving the LRU update back to zs_malloc(), analogous to zbud and z3fold. Suggested-by: Johannes Weiner Acked-by: Johannes Weiner Signed-off-by: Nhat Pham --- mm/zsmalloc.c | 36 +++++++++--------------------------- 1 file changed, 9 insertions(+), 27 deletions(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 44ddaf5d601e..02f7f414aade 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1331,31 +1331,6 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, obj_to_location(obj, &page, &obj_idx); zspage = get_zspage(page); -#ifdef CONFIG_ZPOOL - /* - * Move the zspage to front of pool's LRU. - * - * Note that this is swap-specific, so by definition there are no ongoing - * accesses to the memory while the page is swapped out that would make - * it "hot". A new entry is hot, then ages to the tail until it gets either - * written back or swaps back in. - * - * Furthermore, map is also called during writeback. We must not put an - * isolated page on the LRU mid-reclaim. - * - * As a result, only update the LRU when the page is mapped for write - * when it's first instantiated. - * - * This is a deviation from the other backends, which perform this update - * in the allocation function (zbud_alloc, z3fold_alloc). - */ - if (mm == ZS_MM_WO) { - if (!list_empty(&zspage->lru)) - list_del(&zspage->lru); - list_add(&zspage->lru, &pool->lru); - } -#endif - /* * migration cannot move any zpages in this zspage. Here, pool->lock * is too heavy since callers would take some time until they calls @@ -1525,9 +1500,8 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) fix_fullness_group(class, zspage); record_obj(handle, obj); class_stat_inc(class, ZS_OBJS_INUSE, 1); - spin_unlock(&pool->lock); - return handle; + goto out; } spin_unlock(&pool->lock); @@ -1550,6 +1524,14 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) /* We completely set up zspage so mark them as movable */ SetZsPageMovable(pool, zspage); +out: +#ifdef CONFIG_ZPOOL + /* Add/move zspage to beginning of LRU */ + if (!list_empty(&zspage->lru)) + list_del(&zspage->lru); + list_add(&zspage->lru, &pool->lru); +#endif + spin_unlock(&pool->lock); return handle; -- 2.34.1