Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6499923rdb; Thu, 14 Dec 2023 23:24:59 -0800 (PST) X-Google-Smtp-Source: AGHT+IHMQXC8YfORV2oh/xbFCLq1PXPVwbJ0EI16QSkvxBSOE33fqr2d7q+KwhrP/mBTo7tJtBFM X-Received: by 2002:a17:903:48a:b0:1d3:62b4:40f6 with SMTP id jj10-20020a170903048a00b001d362b440f6mr3118779plb.46.1702625099145; Thu, 14 Dec 2023 23:24:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702625099; cv=none; d=google.com; s=arc-20160816; b=jfKmwZoVyz/2IOKHzoJTRqvYpU7MYJ+tH2oT+jva+VL9jQ/UOi7hBlYD4zm0IDuBzQ DtmXRjm5VnYJrhx+KvSvNoZMS/go4nfjeKREOiLbBslvDy/PUsZ8X6ouE0Ul2eyiTrNC iHAQOwtZOxjcobUnKtA1s8lxV9Ez1FInNLHcc0ipvrvWToaUOC9/OfJ6E0JoxhMSnGXX mirf2VNtcGLHhsv9Wx2Y8hyKMy6bSZ8M9fJzO12Vzfo7oSY5wkwSwuydPVJvY/ZIbC4h AoxtdLMhQKmyLVpLKdNmUSUkKzcII6QoqTX7qKkn94vbCTOJpK+80rofVTNTrOut5oOE qcuw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=Ar1/QaxuZM6RP2byX2TH29qNuqN+lLyFuf9z+BnTVrk=; fh=Rvn3V6AT6D9gTB9n2NY6R5UcLgsd0RtIGRw66Zb0n7Q=; b=Caz+ejhcTrE3pIk+cVQdUJ0ivVgXHZlwyryYUR2BS/bBoUPdJAHBdF8nSXklZirog6 umsc2bAYzrE4qSWRWzeph162aYoeMx5xdyUYFEQGziq7T4esG/RfTvwRlFxT3A5UlECR 94gErV/3DO3rG6uDUcma7eTnWJQX3OxPdoDW3NvWgfQbrP6tnKKJT0Y82LLlG7yHxd5N aYQq36jo0/r5yCZBIQzHAqjNgqlf0gOpGk20UM9zBLzHNoCun9Se7B3JteFoYXrbiFr+ b5CZoEvJXwT9WZAe/7VGVCJbRpeMDZnSR+btmOZg2lJGjKm/EfHLBCNIntWiNJcEwxIO hTRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=FkD+o86L; spf=pass (google.com: domain of linux-kernel+bounces-514-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-514-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id ju9-20020a170903428900b001d3683c9c5asi2358914plb.547.2023.12.14.23.24.58 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 23:24:59 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-514-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=FkD+o86L; spf=pass (google.com: domain of linux-kernel+bounces-514-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-514-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id E14D4B22630 for ; Fri, 15 Dec 2023 07:24:40 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C37F41094A; Fri, 15 Dec 2023 07:24:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="FkD+o86L" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01A0C101C7 for ; Fri, 15 Dec 2023 07:24:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ed1-f43.google.com with SMTP id 4fb4d7f45d1cf-54744e66d27so8749a12.0 for ; Thu, 14 Dec 2023 23:24:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702625070; x=1703229870; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ar1/QaxuZM6RP2byX2TH29qNuqN+lLyFuf9z+BnTVrk=; b=FkD+o86LBHX4w202GY6540tAmzmUEnqUqJ2qDiHwI8ZXiP66qdqAp9+5RsXEW2HTEV +mhM23fgb5GzeWgZRq6CcYoVNBIEnuL+N3Mp4mgrhtt3TiPGLxTH+Mg3pro4z5k12nOd kgXIzhUYnxPJcIe2p4cyc8No0ngtG6S7qcLEcx3ik8rayo71PfKp9Z4TWADOv2sn5vx8 RJtTdQ3dWfigzsGqNmPTKF68U71Pn/oITmtd6FPxv1Q2stlMUVlFlSR461jsSf4gXyX/ /jLTta/FsAMSU2fLDAvDXaG1qCnKBnEWiW7IDB9cobtP+QAPqEQGgBxn6Zdx4CvdKTEX 9/fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702625070; x=1703229870; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ar1/QaxuZM6RP2byX2TH29qNuqN+lLyFuf9z+BnTVrk=; b=Y1eAA9bTS3Swwnnghdisy7oRVTBDvLzfJJSQx5SQscn33CVU1AcUQrZCsfJYeDx2xg gEt56tIhMuw0nsa0Vi6HTYiC6wWe6Ny0M8xzEx2wtx2Qah8+22OsCdIJ124oDoeoq8k+ PQyEvisVGfRK36by2Ya89Ztanyt1m9Hnv3BNcKRt6SCtsyUF+n91PXUws3D4WbPImO9E iqusJSUXG+WsQ6xOdTCATyb8IZGdI+DKjD3cDOv2lQiottrmE3SMspvlcoLLIHaMrO5G qvPjZkTdbFB7fTmm+C1wtW1waWyFcTlueGuUo9oT+DNrNFPq9IB3fkOq7YYCnTjPdRuz XyIA== X-Gm-Message-State: AOJu0YxievKyMq2Xr1WhnE3QIPWEaWovZ25XDFZcUTEZcm9sjh2M+B8H AL0jOArJJg8rSuNMgGoa9bMdbDqasniDm+ohsPoEYw== X-Received: by 2002:a05:6402:35c5:b0:551:9870:472 with SMTP id z5-20020a05640235c500b0055198700472mr517917edc.1.1702625069994; Thu, 14 Dec 2023 23:24:29 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com> In-Reply-To: <951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com> From: Yu Zhao Date: Fri, 15 Dec 2023 00:23:52 -0700 Message-ID: Subject: Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap To: Henry Huang Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, =?UTF-8?B?6LCI6Ym06ZSL?= , =?UTF-8?B?5pyx6L6JKOiMtuawtCk=?= , akpm@linux-foundation.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Dec 6, 2023 at 5:51=E2=80=AFAM Henry Huang = wrote: > > Multi-Gen LRU page-table walker clears pte young flag, but it doesn't > clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to chec= k > whether one page is accessed, it would tell us this page is idle, > but actually this page has been accessed. > > For those unmapped filecache pages, page idle flag would not been > cleared in folio_mark_accessed if Multi-Gen LRU is enabled. > So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether > a filecache page is read or written. > > What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. > If one page is accessed, it would set page young flag. Multi-Gen LRU > page-table walker should check both page&pte young flags. > > how-to-reproduce-problem > > idle_page_track > a tools to track process accessed memory during a specific time > usage > idle_page_track $pid $time > how-it-works > 1. scan process vma from /proc/$pid/maps > 2. vfn --> pfn from /proc/$pid/pagemap > 3. write /sys/kernel/mm/page_idle/bitmap to > mark phy page idle flag and clear pte young flag > 4. sleep $time > 5. read /sys/kernel/mm/page_idle/bitmap to > test_and_clear pte young flag and > return whether phy page is accessed > > test ---- test program > > #include > #include > #include > #include > #include > #include > #include > > int main(int argc, const char *argv[]) > { > char *buf =3D NULL; > char pipe_info[4096]; > int n; > int fd =3D -1; > > buf =3D malloc(1024*1024*1024UL); > memset(buf, 0, 1024*1024*1024UL); > fd =3D open("access.pipe", O_RDONLY); > if (fd < 0) > goto out; > while (1) { > n =3D read(fd, pipe_info, sizeof(pipe_info)); > if (!n) { > sleep(1); > continue; > } else if (n < 0) { > break; > } > memset(buf, 0, 1024*1024*1024UL); > puts("finish access"); > } > out: > if (fd >=3D0) > close(fd); > if (buf) > free(buf); > > return 0; > } > > prepare: > mkfifo access.pipe > ./test > ps -ef | grep test > root 4106 3148 8 06:47 pts/0 00:00:01 ./test > > We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. > > case 1: mglru walker break page_idle > ./idle_page_track 4106 60 & > sleep 5; echo 1 > access.pipe > sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen > > the output of idle_page_track is: > Est(s) Ref(MB) > 64.822 1.00 > only found 1MB were accessed during 64.822s, but actually 1024MB were > accessed. > > case 2: page_idle break mglru walker > echo 1 > access.pipe > ./idle_page_track 4106 10 > echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen > lru gen status: > memcg 8 /user.slice > node 0 > 5 772458 1065 9735 > 6 737435 262244 72 > 7 538053 1184 632 > 8 59404 6422 0 > almost pages should be in max_seq-1 queue, but actually not. > > Signed-off-by: Henry Huang Regarding the change itself, it'd cause a slight regression to other use cases (details below). > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, > unsigned long pfn; > struct folio *folio; > pte_t ptent =3D ptep_get(pte + i); > + bool is_pte_young; > > total++; > walk->mm_stats[MM_LEAF_TOTAL]++; > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned l= ong start, unsigned long end, > if (pfn =3D=3D -1) > continue; > > - if (!pte_young(ptent)) { > - walk->mm_stats[MM_LEAF_OLD]++; Most overhead from page table scanning normally comes from get_pfn_folio() because it almost always causes a cache miss. This is like a pointer dereference, whereas scanning PTEs is like streaming an array (bad vs good cache performance). pte_young() is here to avoid an unnecessary cache miss from get_pfn_folio(). Also see the first comment in get_pfn_folio(). It should be easy to verify the regression -- FlameGraph from the memcached benchmark in the original commit message should do it. Would a tracepoint here work for you? > + is_pte_young =3D !!pte_young(ptent); > + folio =3D get_pfn_folio(pfn, memcg, pgdat, walk->can_swap= , is_pte_young); > + if (!folio) { > + if (!is_pte_young) > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > } > > - folio =3D get_pfn_folio(pfn, memcg, pgdat, walk->can_swap= ); > - if (!folio) > + if (!folio_test_clear_young(folio) && !is_pte_young) { > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > + } > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > + if (is_pte_young && !ptep_test_and_clear_young(args->vma,= addr, pte + i)) > VM_WARN_ON_ONCE(true); > > young++;