Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp2634543pxb; Tue, 13 Apr 2021 06:43:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzfPdol1oFbXyt2NmuEVmGsXdRTbccHMmbh02UCEwbaPwAcDRyaqGM6WsknOPu7JFKGwXfm X-Received: by 2002:a17:906:b118:: with SMTP id u24mr5734271ejy.331.1618321403695; Tue, 13 Apr 2021 06:43:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618321403; cv=none; d=google.com; s=arc-20160816; b=I/azY5RhadN0sznPdlopNHBjkng8SakkIouyMN1LfDNcgSoOjh4UYMAZcGJFN04PMF juqROPwxkrFOhrZCLdw/a6Y3ljoIHp7zCKgn5Y+T0BhsKbWjv4ChtJa7mRHzHE4O1Q4q /RwnhxXM1dRyFQiG5Wtks1xxCFmjC7Yjt8lYOi9K0Xutd27S5wUPmjBkOq3+LfYtTn7S GKqgFVLMfGBWT1aZoL4Hl3hlwN+0yw6JKOOZDyMcyDeIVbgH9C81ksfyD2WSiIs/0Rhp 9PYSHWXrc6srfrHvdZVX0W8xrQCCyzJXJAnCl4utRkoe4Tm6I1JdfzccmNR5hOHeOji1 aotQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=o5Jou7hUitprbLWSkwF9m0rzlQtpjYePVUNvL8744B4=; b=ATTE7JpL2flcfUXG4KIV6oWFFqTLy9WU52vZa2R8nl8m9PEYZghqBchz4CnMRah4mc 6MKV8Q08iojEt5GDJuJm+8OwR3PwvgJ/qe6ZTtfectetAmetwkTQyGK4fRbSqDzu0Dkm 8bruNk99Z04efu6cUt4JTr6qUZNpdgx8teu3JTq9gHU32D84OJmA3APBKWlnBNf3+EuH nS8e87HVSfAM8sd6qipLb1KOBIZaWTrkdvvkCbX9gBq+/tbkmU00Hf0H196S0VyTRh6c T9no7uZAxPvdZTIDxvxfS395D4f6h4nI+nyyBgZtcvHviXcMgMBsR51Lef8a4vT/qxpy CKMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=j0OnRRuI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w22si10431369edx.93.2021.04.13.06.43.00; Tue, 13 Apr 2021 06:43:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=j0OnRRuI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345112AbhDMG5e (ORCPT + 99 others); Tue, 13 Apr 2021 02:57:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345075AbhDMG5O (ORCPT ); Tue, 13 Apr 2021 02:57:14 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1B27C061342 for ; Mon, 12 Apr 2021 23:56:54 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id g7so15243258ybm.13 for ; Mon, 12 Apr 2021 23:56:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=o5Jou7hUitprbLWSkwF9m0rzlQtpjYePVUNvL8744B4=; b=j0OnRRuICsaUkKDFgMmxVB6XdLNdlw7bkERy4WEKt8hjBSvD+Kp0+iOIcFy8N7824S fiIZT/4kse0kGwqLNz6aT5fmfZX9JxxYEdOVwlR/Ws0MZO827eTQkQKIlfbqh7xkc4GT TA7uVRsWqbOXCZgWt9zOAQjOZb/rs2P9QMKUlOFvfucJY2YuTWnwAyhKKGoanMVjppPe XiDsyf+xl36l8HZCKTFf1nC3jlDQYELifqMsU7LnJQvyp4qL2Ghw5qGYALRz1HLWn1HT nDo94se9xqkySvHWr7K7F6f3bxkPeLasd/CUo3jf80RHfUmgLwPgfJh9UGJtXbKnz7fZ QiIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=o5Jou7hUitprbLWSkwF9m0rzlQtpjYePVUNvL8744B4=; b=GyMzG4Y9CRlIQTVJmAqzu40iDf9Ip5RESHdeLQAYm+tiJUh2RGVBJa6vKg38UMcgXC EphRx2fv2WzLbuzG3KYV63fQ6mVN44J7Q5DZllmGANTY0ulI4ONN6upN04OPR+6Py8nD thVg9bECRFbbKis2TNfSLXbGoO0/p8IfhjTpTAY+/gcDlXuuEwdN42+F5w+mKC73Ybd4 YzMfYRrVWHdmd49KirIiJ2yKVwsTTFfOgJlsRhMjIxnKiDO88ZiQPXOhSThi9Pq3d4xZ AKWIylGhQNKmESlmvpmEzuo3lhpofz6NtP61MD5kogRHKN8cOrfEwHfr81CTzg1JSAjQ d+PQ== X-Gm-Message-State: AOAM530BBghVYsHEGPHYaVOEjeRU+Fi6DhCLAJz+E/4KNkH046B//NxP jRpr98Lw0DozCkFBmdQ3Y2SqfxcTm/k= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:d02d:cccc:9ebe:9fe9]) (user=yuzhao job=sendgmr) by 2002:a5b:f51:: with SMTP id y17mr7630772ybr.398.1618297013927; Mon, 12 Apr 2021 23:56:53 -0700 (PDT) Date: Tue, 13 Apr 2021 00:56:26 -0600 In-Reply-To: <20210413065633.2782273-1-yuzhao@google.com> Message-Id: <20210413065633.2782273-10-yuzhao@google.com> Mime-Version: 1.0 References: <20210413065633.2782273-1-yuzhao@google.com> X-Mailer: git-send-email 2.31.1.295.g9ea45b61b8-goog Subject: [PATCH v2 09/16] mm: multigenerational lru: activation From: Yu Zhao To: linux-mm@kvack.org Cc: Alex Shi , Andi Kleen , Andrew Morton , Benjamin Manes , Dave Chinner , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Miaohe Lin , Michael Larabel , Michal Hocko , Michel Lespinasse , Rik van Riel , Roman Gushchin , Rong Chen , SeongJae Park , Tim Chen , Vlastimil Babka , Yang Shi , Ying Huang , Zi Yan , linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com, Yu Zhao Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For pages accessed multiple times via file descriptors, instead of activating them upon the second accesses, we activate them based on the refault rates of their tiers. Pages accessed N times via file descriptors belong to tier order_base_2(N). Pages from tier 0, i.e., those read ahead, accessed once via file descriptors and accessed only via page tables, are evicted regardless of the refault rate. Pages from other tiers will be moved to the next generation, i.e., activated, if the refault rates of their tiers are higher than that of tier 0. Each generation contains at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2 bits in page->flags. This feedback model has a few advantages over the current feedforward model: 1) It has a negligible overhead in the access path because activations are done in the reclaim path. 2) It takes mapped pages into account and avoids overprotecting pages accessed multiple times via file descriptors. 3) More tiers offer better protection to pages accessed more than twice when buffered-I/O-intensive workloads are under memory pressure. For pages mapped upon page faults, the accessed bit is set and they must be properly aged. We add them to the per-zone lists index by max_seq, i.e., the youngest generation. For pages not in page cache or swap cache, this can be done easily in the page fault path: we rename lru_cache_add_inactive_or_unevictable() to lru_cache_add_page_vma() and add a new parameter, which is set to true for pages mapped upon page faults. For pages in page cache or swap cache, we cannot differentiate the page fault path from the read ahead path at the time we call lru_cache_add() in add_to_page_cache_lru() and __read_swap_cache_async(). So we add a new function lru_gen_activation(), which is essentially activate_page(), to move pages to the per-zone lists indexed by max_seq at a later time. Hopefully we would find those pages in lru_pvecs.lru_add and simply set PageActive() on them without having to actually move them. Finally, we need to be compatible with the existing notion of active and inactive. We cannot use PageActive() because it is not set on active pages unless they are isolated, in order to spare the aging the trouble of clearing it when an active generation becomes inactive. A new function page_is_active() compares the generation number of a page with max_seq and max_seq-1 (modulo MAX_NR_GENS), which are considered active and protected from the eviction. Other generations, which may or may not exist, are considered inactive. Signed-off-by: Yu Zhao --- fs/proc/task_mmu.c | 3 +- include/linux/mm_inline.h | 101 +++++++++++++++++++++ include/linux/swap.h | 4 +- kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 2 +- mm/memory.c | 14 +-- mm/migrate.c | 2 +- mm/swap.c | 26 +++--- mm/swapfile.c | 2 +- mm/userfaultfd.c | 2 +- mm/vmscan.c | 91 ++++++++++++++++++- mm/workingset.c | 179 +++++++++++++++++++++++++++++++------- 13 files changed, 371 insertions(+), 59 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e862cab69583..d292f20c4e3d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -1718,7 +1719,7 @@ static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty, if (PageSwapCache(page)) md->swapcache += nr_pages; - if (PageActive(page) || PageUnevictable(page)) + if (PageUnevictable(page) || page_is_active(compound_head(page), NULL)) md->active += nr_pages; if (PageWriteback(page)) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 2bf910eb3dd7..5eb4b12972ec 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -95,6 +95,12 @@ static inline int lru_gen_from_seq(unsigned long seq) return seq % MAX_NR_GENS; } +/* Convert the level of usage to a tier. See the comment on MAX_NR_TIERS. */ +static inline int lru_tier_from_usage(int usage) +{ + return order_base_2(usage + 1); +} + /* Return a proper index regardless whether we keep a full history of stats. */ static inline int sid_from_seq_or_gen(int seq_or_gen) { @@ -238,12 +244,93 @@ static inline bool lru_gen_deletion(struct page *page, struct lruvec *lruvec) return true; } +/* Activate a page from page cache or swap cache after it's mapped. */ +static inline void lru_gen_activation(struct page *page, struct vm_area_struct *vma) +{ + if (!lru_gen_enabled()) + return; + + if (PageActive(page) || PageUnevictable(page) || vma_is_dax(vma) || + (vma->vm_flags & (VM_LOCKED | VM_SPECIAL))) + return; + /* + * TODO: pass vm_fault to add_to_page_cache_lru() and + * __read_swap_cache_async() so they can activate pages directly when in + * the page fault path. + */ + activate_page(page); +} + /* Return -1 when a page is not on a list of the multigenerational lru. */ static inline int page_lru_gen(struct page *page) { return ((READ_ONCE(page->flags) & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; } +/* This function works regardless whether the multigenerational lru is enabled. */ +static inline bool page_is_active(struct page *page, struct lruvec *lruvec) +{ + struct mem_cgroup *memcg; + int gen = page_lru_gen(page); + bool active = false; + + VM_BUG_ON_PAGE(PageTail(page), page); + + if (gen < 0) + return PageActive(page); + + if (lruvec) { + VM_BUG_ON_PAGE(PageUnevictable(page), page); + VM_BUG_ON_PAGE(PageActive(page), page); + lockdep_assert_held(&lruvec->lru_lock); + + return lru_gen_is_active(lruvec, gen); + } + + rcu_read_lock(); + + memcg = page_memcg_rcu(page); + lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); + active = lru_gen_is_active(lruvec, gen); + + rcu_read_unlock(); + + return active; +} + +/* Return the level of usage of a page. See the comment on MAX_NR_TIERS. */ +static inline int page_tier_usage(struct page *page) +{ + unsigned long flags = READ_ONCE(page->flags); + + return flags & BIT(PG_workingset) ? + ((flags & LRU_USAGE_MASK) >> LRU_USAGE_PGOFF) + 1 : 0; +} + +/* Increment the usage counter after a page is accessed via file descriptors. */ +static inline bool page_inc_usage(struct page *page) +{ + unsigned long old_flags, new_flags; + + if (!lru_gen_enabled()) + return PageActive(page); + + do { + old_flags = READ_ONCE(page->flags); + + if (!(old_flags & BIT(PG_workingset))) + new_flags = old_flags | BIT(PG_workingset); + else + new_flags = (old_flags & ~LRU_USAGE_MASK) | min(LRU_USAGE_MASK, + (old_flags & LRU_USAGE_MASK) + BIT(LRU_USAGE_PGOFF)); + + if (old_flags == new_flags) + break; + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); + + return true; +} + #else /* CONFIG_LRU_GEN */ static inline bool lru_gen_enabled(void) @@ -261,6 +348,20 @@ static inline bool lru_gen_deletion(struct page *page, struct lruvec *lruvec) return false; } +static inline void lru_gen_activation(struct page *page, struct vm_area_struct *vma) +{ +} + +static inline bool page_is_active(struct page *page, struct lruvec *lruvec) +{ + return PageActive(page); +} + +static inline bool page_inc_usage(struct page *page) +{ + return PageActive(page); +} + #endif /* CONFIG_LRU_GEN */ static __always_inline void add_page_to_lru_list(struct page *page, diff --git a/include/linux/swap.h b/include/linux/swap.h index de2bbbf181ba..0e7532c7db22 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -350,8 +350,8 @@ extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); -extern void lru_cache_add_inactive_or_unevictable(struct page *page, - struct vm_area_struct *vma); +extern void lru_cache_add_page_vma(struct page *page, struct vm_area_struct *vma, + bool faulting); /* linux/mm/vmscan.c */ extern unsigned long zone_reclaimable_pages(struct zone *zone); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6addc9780319..4e93e5602723 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, if (new_page) { get_page(new_page); page_add_new_anon_rmap(new_page, vma, addr, false); - lru_cache_add_inactive_or_unevictable(new_page, vma); + lru_cache_add_page_vma(new_page, vma, false); } else /* no new page, just dec_mm_counter for old_page */ dec_mm_counter(mm, MM_ANONPAGES); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 26d3cc4a7a0b..2cf46270c84b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -637,7 +637,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, entry = mk_huge_pmd(page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); page_add_new_anon_rmap(page, vma, haddr, true); - lru_cache_add_inactive_or_unevictable(page, vma); + lru_cache_add_page_vma(page, vma, true); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a7d6cb912b05..08a43910f232 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1199,7 +1199,7 @@ static void collapse_huge_page(struct mm_struct *mm, spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); page_add_new_anon_rmap(new_page, vma, address, true); - lru_cache_add_inactive_or_unevictable(new_page, vma); + lru_cache_add_page_vma(new_page, vma, true); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); diff --git a/mm/memory.c b/mm/memory.c index 550405fc3b5e..9a6cb6d31430 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -73,6 +73,7 @@ #include #include #include +#include #include @@ -839,7 +840,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma copy_user_highpage(new_page, page, addr, src_vma); __SetPageUptodate(new_page); page_add_new_anon_rmap(new_page, dst_vma, addr, false); - lru_cache_add_inactive_or_unevictable(new_page, dst_vma); + lru_cache_add_page_vma(new_page, dst_vma, false); rss[mm_counter(new_page)]++; /* All done, just insert the new page copy in the child */ @@ -2907,7 +2908,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) */ ptep_clear_flush_notify(vma, vmf->address, vmf->pte); page_add_new_anon_rmap(new_page, vma, vmf->address, false); - lru_cache_add_inactive_or_unevictable(new_page, vma); + lru_cache_add_page_vma(new_page, vma, true); /* * We call the notify macro here because, when using secondary * mmu page tables (such as kvm shadow page tables), we want the @@ -3438,9 +3439,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* ksm created a completely new copy */ if (unlikely(page != swapcache && swapcache)) { page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_inactive_or_unevictable(page, vma); + lru_cache_add_page_vma(page, vma, true); } else { do_page_add_anon_rmap(page, vma, vmf->address, exclusive); + lru_gen_activation(page, vma); } swap_free(entry); @@ -3584,7 +3586,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_inactive_or_unevictable(page, vma); + lru_cache_add_page_vma(page, vma, true); setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); @@ -3709,6 +3711,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR); page_add_file_rmap(page, true); + lru_gen_activation(page, vma); /* * deposit and withdraw with pmd lock held */ @@ -3752,10 +3755,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, addr, false); - lru_cache_add_inactive_or_unevictable(page, vma); + lru_cache_add_page_vma(page, vma, true); } else { inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); page_add_file_rmap(page, false); + lru_gen_activation(page, vma); } set_pte_at(vma->vm_mm, addr, vmf->pte, entry); } diff --git a/mm/migrate.c b/mm/migrate.c index 62b81d5257aa..1064b03cac33 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -3004,7 +3004,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, inc_mm_counter(mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, addr, false); if (!is_zone_device_page(page)) - lru_cache_add_inactive_or_unevictable(page, vma); + lru_cache_add_page_vma(page, vma, false); get_page(page); if (flush) { diff --git a/mm/swap.c b/mm/swap.c index f20ed56ebbbf..d6458ee1e9f8 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -306,7 +306,7 @@ void lru_note_cost_page(struct page *page) static void __activate_page(struct page *page, struct lruvec *lruvec) { - if (!PageActive(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page) && !page_is_active(page, lruvec)) { int nr_pages = thp_nr_pages(page); del_page_from_lru_list(page, lruvec); @@ -337,7 +337,7 @@ static bool need_activate_page_drain(int cpu) static void activate_page_on_lru(struct page *page) { page = compound_head(page); - if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { + if (PageLRU(page) && !PageUnevictable(page) && !page_is_active(page, NULL)) { struct pagevec *pvec; local_lock(&lru_pvecs.lock); @@ -431,7 +431,7 @@ void mark_page_accessed(struct page *page) * this list is never rotated or maintained, so marking an * evictable page accessed has no effect. */ - } else if (!PageActive(page)) { + } else if (!page_inc_usage(page)) { activate_page(page); ClearPageReferenced(page); workingset_activation(page); @@ -467,15 +467,14 @@ void lru_cache_add(struct page *page) EXPORT_SYMBOL(lru_cache_add); /** - * lru_cache_add_inactive_or_unevictable + * lru_cache_add_page_vma * @page: the page to be added to LRU * @vma: vma in which page is mapped for determining reclaimability * - * Place @page on the inactive or unevictable LRU list, depending on its - * evictability. + * Place @page on an LRU list, depending on its evictability. */ -void lru_cache_add_inactive_or_unevictable(struct page *page, - struct vm_area_struct *vma) +void lru_cache_add_page_vma(struct page *page, struct vm_area_struct *vma, + bool faulting) { bool unevictable; @@ -492,6 +491,11 @@ void lru_cache_add_inactive_or_unevictable(struct page *page, __mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages); count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages); } + + /* tell the multigenerational lru that the page is being faulted in */ + if (lru_gen_enabled() && !unevictable && faulting) + SetPageActive(page); + lru_cache_add(page); } @@ -518,7 +522,7 @@ void lru_cache_add_inactive_or_unevictable(struct page *page, */ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) { - bool active = PageActive(page); + bool active = page_is_active(page, lruvec); int nr_pages = thp_nr_pages(page); if (PageUnevictable(page)) @@ -558,7 +562,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { - if (PageActive(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page) && page_is_active(page, lruvec)) { int nr_pages = thp_nr_pages(page); del_page_from_lru_list(page, lruvec); @@ -672,7 +676,7 @@ void deactivate_file_page(struct page *page) */ void deactivate_page(struct page *page) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageLRU(page) && !PageUnevictable(page) && page_is_active(page, NULL)) { struct pagevec *pvec; local_lock(&lru_pvecs.lock); diff --git a/mm/swapfile.c b/mm/swapfile.c index c6041d10a73a..ab3b5ca404fd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1936,7 +1936,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, page_add_anon_rmap(page, vma, addr, false); } else { /* ksm created a completely new copy */ page_add_new_anon_rmap(page, vma, addr, false); - lru_cache_add_inactive_or_unevictable(page, vma); + lru_cache_add_page_vma(page, vma, false); } swap_free(entry); out: diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9a3d451402d7..e1d4cd3103b8 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -123,7 +123,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, inc_mm_counter(dst_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, dst_vma, dst_addr, false); - lru_cache_add_inactive_or_unevictable(page, dst_vma); + lru_cache_add_page_vma(page, dst_vma, true); set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); diff --git a/mm/vmscan.c b/mm/vmscan.c index 8559bb94d452..c74ebe2039f7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -898,9 +898,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; - mem_cgroup_swapout(page, swap); + + /* get a shadow entry before page_memcg() is cleared */ if (reclaimed && !mapping_exiting(mapping)) shadow = workingset_eviction(page, target_memcg); + mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page, swap, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); @@ -4375,6 +4377,93 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) get_nr_gens(lruvec, 1) <= MAX_NR_GENS; } +/****************************************************************************** + * refault feedback loop + ******************************************************************************/ + +/* + * A feedback loop modeled after the PID controller. Currently supports the + * proportional (P) and the integral (I) terms; the derivative (D) term can be + * added if necessary. The setpoint (SP) is the desired position; the process + * variable (PV) is the measured position. The error is the difference between + * the SP and the PV. A positive error results in a positive control output + * correction, which, in our case, is to allow eviction. + * + * The P term is the current refault rate refaulted/(evicted+activated), which + * has a weight of 1. The I term is the arithmetic mean of the last N refault + * rates, weighted by geometric series 1/2, 1/4, ..., 1/(1<evictable; + int sid = sid_from_seq_or_gen(lrugen->min_seq[file]); + + pos->refaulted = lrugen->avg_refaulted[file][tier] + + atomic_long_read(&lrugen->refaulted[sid][file][tier]); + pos->total = lrugen->avg_total[file][tier] + + atomic_long_read(&lrugen->evicted[sid][file][tier]); + if (tier) + pos->total += lrugen->activated[sid][file][tier - 1]; + pos->gain = gain; +} + +static void reset_controller_pos(struct lruvec *lruvec, int gen, int file) +{ + int tier; + int sid = sid_from_seq_or_gen(gen); + struct lrugen *lrugen = &lruvec->evictable; + bool carryover = gen == lru_gen_from_seq(lrugen->min_seq[file]); + + if (!carryover && NR_STAT_GENS == 1) + return; + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + if (carryover) { + unsigned long sum; + + sum = lrugen->avg_refaulted[file][tier] + + atomic_long_read(&lrugen->refaulted[sid][file][tier]); + WRITE_ONCE(lrugen->avg_refaulted[file][tier], sum >> 1); + + sum = lrugen->avg_total[file][tier] + + atomic_long_read(&lrugen->evicted[sid][file][tier]); + if (tier) + sum += lrugen->activated[sid][file][tier - 1]; + WRITE_ONCE(lrugen->avg_total[file][tier], sum >> 1); + + if (NR_STAT_GENS > 1) + continue; + } + + atomic_long_set(&lrugen->refaulted[sid][file][tier], 0); + atomic_long_set(&lrugen->evicted[sid][file][tier], 0); + if (tier) + WRITE_ONCE(lrugen->activated[sid][file][tier - 1], 0); + } +} + +static bool positive_ctrl_err(struct controller_pos *sp, struct controller_pos *pv) +{ + /* + * Allow eviction if the PV has a limited number of refaulted pages or a + * lower refault rate than the SP. + */ + return pv->refaulted < SWAP_CLUSTER_MAX || + pv->refaulted * max(sp->total, 1UL) * sp->gain <= + sp->refaulted * max(pv->total, 1UL) * pv->gain; +} + /****************************************************************************** * state change ******************************************************************************/ diff --git a/mm/workingset.c b/mm/workingset.c index cd39902c1062..df363f9419fc 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -168,9 +168,9 @@ * refault distance will immediately activate the refaulting page. */ -#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ - 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT) -#define EVICTION_MASK (~0UL >> EVICTION_SHIFT) +#define EVICTION_SHIFT (BITS_PER_XA_VALUE - MEM_CGROUP_ID_SHIFT - NODES_SHIFT) +#define EVICTION_MASK (BIT(EVICTION_SHIFT) - 1) +#define WORKINGSET_WIDTH 1 /* * Eviction timestamps need to be able to cover the full range of @@ -182,38 +182,139 @@ */ static unsigned int bucket_order __read_mostly; -static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, - bool workingset) +static void *pack_shadow(int memcg_id, struct pglist_data *pgdat, unsigned long val) { - eviction >>= bucket_order; - eviction &= EVICTION_MASK; - eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; - eviction = (eviction << NODES_SHIFT) | pgdat->node_id; - eviction = (eviction << 1) | workingset; + val = (val << MEM_CGROUP_ID_SHIFT) | memcg_id; + val = (val << NODES_SHIFT) | pgdat->node_id; - return xa_mk_value(eviction); + return xa_mk_value(val); } -static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - unsigned long *evictionp, bool *workingsetp) +static unsigned long unpack_shadow(void *shadow, int *memcg_id, struct pglist_data **pgdat) { - unsigned long entry = xa_to_value(shadow); - int memcgid, nid; - bool workingset; - - workingset = entry & 1; - entry >>= 1; - nid = entry & ((1UL << NODES_SHIFT) - 1); - entry >>= NODES_SHIFT; - memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); - entry >>= MEM_CGROUP_ID_SHIFT; - - *memcgidp = memcgid; - *pgdat = NODE_DATA(nid); - *evictionp = entry << bucket_order; - *workingsetp = workingset; + unsigned long val = xa_to_value(shadow); + + *pgdat = NODE_DATA(val & (BIT(NODES_SHIFT) - 1)); + val >>= NODES_SHIFT; + *memcg_id = val & (BIT(MEM_CGROUP_ID_SHIFT) - 1); + + return val >> MEM_CGROUP_ID_SHIFT; +} + +#ifdef CONFIG_LRU_GEN + +#if LRU_GEN_SHIFT + LRU_USAGE_SHIFT >= EVICTION_SHIFT +#error "Please try smaller NODES_SHIFT, NR_LRU_GENS and TIERS_PER_GEN configurations" +#endif + +static void page_set_usage(struct page *page, int usage) +{ + unsigned long old_flags, new_flags; + + VM_BUG_ON(usage > BIT(LRU_USAGE_WIDTH)); + + if (!usage) + return; + + do { + old_flags = READ_ONCE(page->flags); + new_flags = (old_flags & ~LRU_USAGE_MASK) | LRU_TIER_FLAGS | + ((usage - 1UL) << LRU_USAGE_PGOFF); + if (old_flags == new_flags) + break; + } while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags); +} + +/* Return a token to be stored in the shadow entry of a page being evicted. */ +static void *lru_gen_eviction(struct page *page) +{ + int sid, tier; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lrugen *lrugen; + int file = page_is_file_lru(page); + int usage = page_tier_usage(page); + struct mem_cgroup *memcg = page_memcg(page); + struct pglist_data *pgdat = page_pgdat(page); + + if (!lru_gen_enabled()) + return NULL; + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->evictable; + min_seq = READ_ONCE(lrugen->min_seq[file]); + token = (min_seq << LRU_USAGE_SHIFT) | usage; + + sid = sid_from_seq_or_gen(min_seq); + tier = lru_tier_from_usage(usage); + atomic_long_add(thp_nr_pages(page), &lrugen->evicted[sid][file][tier]); + + return pack_shadow(mem_cgroup_id(memcg), pgdat, token); +} + +/* Account a refaulted page based on the token stored in its shadow entry. */ +static bool lru_gen_refault(struct page *page, void *shadow) +{ + int sid, tier, usage; + int memcg_id; + unsigned long token; + unsigned long min_seq; + struct lruvec *lruvec; + struct lrugen *lrugen; + struct pglist_data *pgdat; + struct mem_cgroup *memcg; + int file = page_is_file_lru(page); + + if (!lru_gen_enabled()) + return false; + + token = unpack_shadow(shadow, &memcg_id, &pgdat); + if (page_pgdat(page) != pgdat) + return true; + + rcu_read_lock(); + memcg = page_memcg_rcu(page); + if (mem_cgroup_id(memcg) != memcg_id) + goto unlock; + + usage = token & (BIT(LRU_USAGE_SHIFT) - 1); + token >>= LRU_USAGE_SHIFT; + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + lrugen = &lruvec->evictable; + min_seq = READ_ONCE(lrugen->min_seq[file]); + if (token != (min_seq & (EVICTION_MASK >> LRU_USAGE_SHIFT))) + goto unlock; + + page_set_usage(page, usage); + + sid = sid_from_seq_or_gen(min_seq); + tier = lru_tier_from_usage(usage); + atomic_long_add(thp_nr_pages(page), &lrugen->refaulted[sid][file][tier]); + inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); + if (tier) + inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file); +unlock: + rcu_read_unlock(); + + return true; +} + +#else /* CONFIG_LRU_GEN */ + +static void *lru_gen_eviction(struct page *page) +{ + return NULL; } +static bool lru_gen_refault(struct page *page, void *shadow) +{ + return false; +} + +#endif /* CONFIG_LRU_GEN */ + /** * workingset_age_nonresident - age non-resident entries as LRU ages * @lruvec: the lruvec that was aged @@ -256,18 +357,25 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) unsigned long eviction; struct lruvec *lruvec; int memcgid; + void *shadow; /* Page is fully exclusive and pins page's memory cgroup pointer */ VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); + shadow = lru_gen_eviction(page); + if (shadow) + return shadow; + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); eviction = atomic_long_read(&lruvec->nonresident_age); + eviction >>= bucket_order; + eviction = (eviction << WORKINGSET_WIDTH) | PageWorkingset(page); workingset_age_nonresident(lruvec, thp_nr_pages(page)); - return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); + return pack_shadow(memcgid, pgdat, eviction); } /** @@ -294,7 +402,10 @@ void workingset_refault(struct page *page, void *shadow) bool workingset; int memcgid; - unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); + if (lru_gen_refault(page, shadow)) + return; + + eviction = unpack_shadow(shadow, &memcgid, &pgdat); rcu_read_lock(); /* @@ -318,6 +429,8 @@ void workingset_refault(struct page *page, void *shadow) goto out; eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); refault = atomic_long_read(&eviction_lruvec->nonresident_age); + workingset = eviction & (BIT(WORKINGSET_WIDTH) - 1); + eviction = (eviction >> WORKINGSET_WIDTH) << bucket_order; /* * Calculate the refault distance @@ -335,7 +448,7 @@ void workingset_refault(struct page *page, void *shadow) * longest time, so the occasional inappropriate activation * leading to pressure on the active list is not a problem. */ - refault_distance = (refault - eviction) & EVICTION_MASK; + refault_distance = (refault - eviction) & (EVICTION_MASK >> WORKINGSET_WIDTH); /* * The activation decision for this page is made at the level @@ -594,7 +707,7 @@ static int __init workingset_init(void) unsigned int max_order; int ret; - BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT); + BUILD_BUG_ON(EVICTION_SHIFT < WORKINGSET_WIDTH); /* * Calculate the eviction bucket size to cover the longest * actionable refault distance, which is currently half of @@ -602,7 +715,7 @@ static int __init workingset_init(void) * some more pages at runtime, so keep working with up to * double the initial memory by using totalram_pages as-is. */ - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; + timestamp_bits = EVICTION_SHIFT - WORKINGSET_WIDTH; max_order = fls_long(totalram_pages() - 1); if (max_order > timestamp_bits) bucket_order = max_order - timestamp_bits; -- 2.31.1.295.g9ea45b61b8-goog