Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp135924iob; Tue, 17 May 2022 21:27:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwVdGMnx8n0h2z/ZF6yAWJzP0/MeVNgXnlwYD2rYmGYzJazRnqZjBlyI9F6Paqf8AchvBJH X-Received: by 2002:a17:902:f605:b0:14f:5d75:4fb0 with SMTP id n5-20020a170902f60500b0014f5d754fb0mr25708233plg.101.1652848077289; Tue, 17 May 2022 21:27:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652848077; cv=none; d=google.com; s=arc-20160816; b=CKBYHYbts1pJcyMqgyslFEhO4Y2GadABOcozvGLY/VhMQdnnFDO2TDnIJNoiGRvUuh L0Ppc0zggvGUHjj0IgHlLv7N5WexYJPhiZJI78tdn6qXUCVq1CbLKV1x1iQRB2voklE5 ywKTlQ8uOHLVCjHgeGGHnjAjAHmevuDSwHN9H32atpzFF9C/gFMZXySJSczU9y4tQFsI Xs7Fn3hkW7JkqWi8soR6t3uqOFq+24dlXi+bXVHS5TBGE6M1J6Q3Jol2VUE+dE9jO2Xh p8fMdDRumoI3mwTpmTBS9O2n0SKemLxd7tLq9yCd+K7I0aRZz+PBc4aWXzNOgYFQqKY8 kwAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=oDRmYIZkRXCw47tYWNpQTMfPwHaD4jUBbPvynchhw84=; b=njkdO7TIaS0aZYhZWM3BdtrFzSBY0CHQ5T2Qzl3HUQF5tLBwapH2KxRiy0xm1fXB/6 zXKjhxEuLPzXyg2tVN5ESlC9DN5dg8+DUM+UCBLjUbo4yonMuKO2b4nZ75wNNEXLRRrG /5Eck0kkqigzegEE5oBSOU/d0wFJjAvhqj6DTaFrHErpPvyEyIM84qGlhp23H2O73P1C tVpsdqfoqBMxMud9OCN2Qb4Uxn5+y/S3zWKrQcvMsC9zAO+m9397OqCvf6TJ++8d5lqH wcUx4y87GvkzI0yocYJn2TEe0Gr43biNSdT/JUQqxGM4W1nUPawe16rpdVRp1nmGKx/z Dwtg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=c+DuDWog; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id u10-20020a170903124a00b0015d07fa22edsi1411250plh.182.2022.05.17.21.27.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 May 2022 21:27:57 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=c+DuDWog; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 99E77AF312; Tue, 17 May 2022 20:50:34 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234008AbiERBsT (ORCPT + 99 others); Tue, 17 May 2022 21:48:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58242 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233960AbiERBrW (ORCPT ); Tue, 17 May 2022 21:47:22 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FD7B54F98 for ; Tue, 17 May 2022 18:47:15 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id a144-20020a25ca96000000b0064d6e10dd6aso712378ybg.1 for ; Tue, 17 May 2022 18:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=oDRmYIZkRXCw47tYWNpQTMfPwHaD4jUBbPvynchhw84=; b=c+DuDWogV3HUcoWs4xOcKlCpTX2mcy0Y2GE+VRTwOZ3IGxptGhK3wtyVpZ1KZLVr6G ubDmscCXwFxrBzjkr/Ygt5cBLoClV4EBSwQEBVS7PbHQXSZE+oIoBG4OAKfluE+ysBVt IJvt0Raa+iq69htrMLP/mTP2Lq3ABUm5XGZOh82NrJkfoQYNgRl1dVm/cC2va7Urmw42 umm3BQS2hZ9k2lfKUT/xHL1mM4cA0iBJeAI8n7vemQdx+oQ7+0FGE4Uybg74fyZz8H+b Cjj95J8cDULNDZ8Di11h3iq+2csNPQIlyGVPIPuHLJLyqGYSL8RH2MmDVMhOvVtfFnji JPdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=oDRmYIZkRXCw47tYWNpQTMfPwHaD4jUBbPvynchhw84=; b=TiUtfN9pMzHTCWRfeFU8boglrREFnvyWexyPBvaC06Bm2onArDo5msT5W7tBbT197Y rruMVjOW+IZzqnqBBX7St67nr2AAxMgIUE2HE8zNM+0hA0HoSuiCOt5/+FXDkLNlWSKW xLYA7VmXYE93pzVc5aiUjXGr9dq3FD4ezJRq7fLK6+nj39bY3khv18vclXWiFurvLSzE PQLy3pUeZVNpS+3GN6UqE+WNzyTIJ7weS/LAT9Hs6IYYNsYG3XGo8UWcpQ5t7WNMevMs nKCG6dpL4KohINWgTihU0xAJYB2rpZ60J5pRGGwJTr8BhnyJ1N2YZhsRhU3mhB8+WXDC kmuQ== X-Gm-Message-State: AOAM531p2QTvu5Q+GQPShHeYC5+ISOM7bQTONEDZrfpKXbYG6Kr8u1Vs NBuQpqI8zsSWHAVLVmdM46vQJzVSmuk= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:f7bc:90c9:d86e:4ea8]) (user=yuzhao job=sendgmr) by 2002:a25:9f86:0:b0:64e:238e:9a87 with SMTP id u6-20020a259f86000000b0064e238e9a87mr1178014ybq.208.1652838434145; Tue, 17 May 2022 18:47:14 -0700 (PDT) Date: Tue, 17 May 2022 19:46:33 -0600 In-Reply-To: <20220518014632.922072-1-yuzhao@google.com> Message-Id: <20220518014632.922072-15-yuzhao@google.com> Mime-Version: 1.0 References: <20220518014632.922072-1-yuzhao@google.com> X-Mailer: git-send-email 2.36.0.550.gb090851708-goog Subject: [PATCH v11 14/14] mm: multi-gen LRU: design doc From: Yu Zhao To: Andrew Morton , linux-mm@kvack.org Cc: Andi Kleen , Aneesh Kumar , Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Peter Zijlstra , Tejun Heo , Vlastimil Babka , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, page-reclaim@google.com, Yu Zhao , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , "=?UTF-8?q?Holger=20Hoffst=C3=A4tte?=" , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add a design doc. Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffst=C3=A4tte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++ 2 files changed, 160 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 44365c4574a3..b48434300226 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allo= cating memory, see the ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_= lru.rst new file mode 100644 index 000000000000..bc8eaf1b956c --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Multi-Gen LRU +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +(time-based) common frame of reference and therefore help make better +choices, e.g., between different memcgs on a computer or different +computers in a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +sparse to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is necessary. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. Tiers represent +different ranges of numbers of accesses through file descriptors. A +page accessed ``N`` times through file descriptors is in tier +``order_base_2(N)``. In contrast to moving across generations, which +requires the LRU lock, moving across tiers only involves atomic +operations on ``folio->flags`` and therefore has a negligible cost. A +feedback loop modeled after the PID controller monitors refaults over +all the tiers from anon and file types and decides which tiers from +which types to evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs, and after each iteration, it increments ``max_seq``. For +the latter, when the eviction walks the rmap and finds a young PTE, +the aging scans the adjacent PTEs. For both, on finding a young PTE, +the aging clears the accessed bit and updates the gen counter of the +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to its gen counter if the aging has found this page +accessed through page tables and updated its gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +do this, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Page table walks +* Rmap walks +* Bloom filters +* PID controller + +The aging and the eviction form a producer-consumer model; +specifically, the latter drives the former by the sliding window over +generations. Within the aging, rmap walks drive page table walks by +inserting hot densely populated page tables to the Bloom filters. +Within the eviction, the PID controller uses refaults as the feedback +to select types to evict and tiers to protect. --=20 2.36.0.550.gb090851708-goog