Received: by 2002:ab2:3350:0:b0:1f4:6588:b3a7 with SMTP id o16csp1220494lqe; Mon, 8 Apr 2024 02:23:10 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXeLpfxeVDpvAzXQ+SVd3pYh3eZuFiQSBekwba/PKn9TL7ic/IaJHZ0qE2eikc2j4wzA2K3bYZufBqpH8wwE7CpK1G/7XA/mpDKmu00nQ== X-Google-Smtp-Source: AGHT+IE0maPIX1N8pKfJb856hCb/jJ24Km1IFMSFeuMEe1a1t4eVoCWYS9pWC4/D2ypIgA3xjVto X-Received: by 2002:a17:907:2d89:b0:a51:db19:1014 with SMTP id gt9-20020a1709072d8900b00a51db191014mr1141745ejc.63.1712568190382; Mon, 08 Apr 2024 02:23:10 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712568190; cv=pass; d=google.com; s=arc-20160816; b=m1nw62nSxc1rotNvVesRmv/U+O/Q0bb1BY6AhfhE5kKsSqjl9r7OhIctb/Qsobf0uG eD3e6iGUeorW9GXm7NgnfX6w+ibNLcMQKywiqRX2oECWaGWfAYYI7keowj5HpDm07kKE cBSQU6B7RjDlg1zD2U7M7TLAK1UCU0nYGGHET4X6C8uzv+dL1L9jknO+IVCPhaW9XYns IM5xmJntYlUYAeH+tV3ybhMWR1yJ6agtBpayM1fbw9TKwU6tzlgeydAeACn4mz8n2Mbj o1lZYkalPbEFRZF/cXe1Nd5CKyIN+wND1P7+1btRqN//7aYPKu/ASKp7IJPWc0ezcghG xQeg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=wR3OkzURiu4ZP5HryfHk2mpSJ4An1l/F1EVeEa0jpzk=; fh=X8Ye0zNbQ3DGHGIuZ73PsDHIYx4AqcRuTr5uMr01yYM=; b=JEzg4PYBo7NeRRegrX4jU4urmxzF1nVDRTDs4HhJfydXGAZYnXMGBdAxLQgCPGHzhp QVDj1CBs60p4/KDtEboPgk3TD1jdkvcBoIT/1sGc6ZJsUAnypY35OOJMI4rpQMJrMH9s snkytIHaroKxyawYH/1pSRpGOKOsrXxYr5a2OTPpZYQag092ex6PjEQrED/ssVOM76gw zk5z2U0CQOY0DLbzfNqWGwbNalgQXrSqx2qE8ZV3uVUM+F0EUxvcsSNinPsIO3pGiYEr b5Zm97R8QtcmzyYsL8vB9le0dSATUL09vEeZ1xxlAajuSCJj/pGbD7V2+GYRyuWII8/Q kUXg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-135137-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135137-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id ji15-20020a170907980f00b00a51d0df64d9si1193507ejc.905.2024.04.08.02.23.10 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Apr 2024 02:23:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-135137-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-135137-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135137-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id DE3671F22C87 for ; Mon, 8 Apr 2024 09:23:04 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A701545950; Mon, 8 Apr 2024 09:22:58 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0743838FB9 for ; Mon, 8 Apr 2024 09:22:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712568177; cv=none; b=gTTziBult5x920hzAA5VAcGOt5hHknwLq+gRaCQYN6TDdv61N/u9Ax14JG/qigaigvxn58iXT3urX5keU748gj0mqWNGz18/obTm3ao4uwd00wiv1dW5QGuUimaTe1Q8tllg/ICDJ6k3OHTsC8CjCAJtOCKwn9Wii0oAvqV2X08= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712568177; c=relaxed/simple; bh=rE/1OcXIoHWzQDFcMGMuib77Zv9TNp8ruHC1sXZ4mO0=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=GQk2RLrup2eS5ZwTADKiz6mH3VtM98CKtfD7mPZ7WevXCILhLpjZ7biXpwVlUnjcZhpWjboHE+1aKiUcITaTe/5N3xQaV7OmQCzwGEvldrak3LYWkhW2jVo0MRoblKTHF69YPZAfAHtXDo4IOBonidwNpWMHX0lVReJX73E91Wg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DB4181007; Mon, 8 Apr 2024 02:23:24 -0700 (PDT) Received: from [10.57.73.169] (unknown [10.57.73.169]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6CB373F766; Mon, 8 Apr 2024 02:22:52 -0700 (PDT) Message-ID: <73adae65-4429-41d7-bbb6-4c58156060d3@arm.com> Date: Mon, 8 Apr 2024 10:22:51 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li , Lance Yang Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240403114032.1162100-1-ryan.roberts@arm.com> <20240403114032.1162100-3-ryan.roberts@arm.com> <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com> From: Ryan Roberts In-Reply-To: <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Andrew, Looks like there are a few niggles for me to address below. So please don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a new version out (or patch for this depending on how discussion lands) in the next couple of days. On 05/04/2024 11:13, David Hildenbrand wrote: > On 03.04.24 13:40, Ryan Roberts wrote: >> Now that we no longer have a convenient flag in the cluster to determine >> if a folio is large, free_swap_and_cache() will take a reference and >> lock a large folio much more often, which could lead to contention and >> (e.g.) failure to split large folios, etc. >> >> Let's solve that problem by batch freeing swap and cache with a new >> function, free_swap_and_cache_nr(), to free a contiguous range of swap >> entries together. This allows us to first drop a reference to each swap >> slot before we try to release the cache folio. This means we only try to >> release the folio once, only taking the reference and lock once - much >> better than the previous 512 times for the 2M THP case. >> >> Contiguous swap entries are gathered in zap_pte_range() and >> madvise_free_pte_range() in a similar way to how present ptes are >> already gathered in zap_pte_range(). >> >> While we are at it, let's simplify by converting the return type of both >> functions to void. The return value was used only by zap_pte_range() to >> print a bad pte, and was ignored by everyone else, so the extra >> reporting wasn't exactly guaranteed. We will still get the warning with >> most of the information from get_swap_device(). With the batch version, >> we wouldn't know which pte was bad anyway so could print the wrong one. >> >> Signed-off-by: Ryan Roberts >> --- >>   include/linux/pgtable.h | 28 ++++++++++++++ >>   include/linux/swap.h    | 12 ++++-- >>   mm/internal.h           | 48 +++++++++++++++++++++++ >>   mm/madvise.c            | 12 ++++-- >>   mm/memory.c             | 13 ++++--- >>   mm/swapfile.c           | 86 ++++++++++++++++++++++++++++++++--------- >>   6 files changed, 167 insertions(+), 32 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a3fc8150b047..0278259f7078 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct >> mm_struct *mm, >>   } >>   #endif >>   +#ifndef clear_not_present_full_ptes >> +/** >> + * clear_not_present_full_ptes - Clear consecutive not present PTEs. > > > Consecutive only in the page table or also in some other sense? > > I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any > offset/pfn. yes. > > Consider document that. How about: "Clear multiple not present PTEs which are consecutive in the pgtable"? > >> + * @mm: Address space the ptes represent. >> + * @addr: Address of the first pte. >> + * @ptep: Page table pointer for the first entry. >> + * @nr: Number of entries to clear. >> + * @full: Whether we are clearing a full mm. >> + * >> + * May be overridden by the architecture; otherwise, implemented as a simple >> + * loop over pte_clear_not_present_full(). >> + * >> + * Context: The caller holds the page table lock.  The PTEs are all not present. >> + * The PTEs are all in the same PMD. >> + */ >> +static inline void clear_not_present_full_ptes(struct mm_struct *mm, >> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full) >> +{ >> +    for (;;) { >> +        pte_clear_not_present_full(mm, addr, ptep, full); >> +        if (--nr == 0) >> +            break; >> +        ptep++; >> +        addr += PAGE_SIZE; >> +    } >> +} >> +#endif >> + >>   #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH >>   extern pte_t ptep_clear_flush(struct vm_area_struct *vma, >>                     unsigned long address, >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index f6f78198f000..5737236dc3ce 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t); >>   extern int swapcache_prepare(swp_entry_t); >>   extern void swap_free(swp_entry_t); >>   extern void swapcache_free_entries(swp_entry_t *entries, int n); >> -extern int free_swap_and_cache(swp_entry_t); >> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); >>   int swap_type_of(dev_t device, sector_t offset); >>   int find_first_swap(dev_t *device); >>   extern unsigned int count_swap_pages(int, int); >> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct >> *si) >>   #define free_pages_and_swap_cache(pages, nr) \ >>       release_pages((pages), (nr)); >>   > > > [...] > >> + >> +/** >> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >> + * @start_ptep: Page table pointer for the first entry. >> + * @max_nr: The maximum number of table entries to consider. >> + * @entry: Swap entry recovered from the first table entry. >> + * >> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >> + * containing swap entries all with consecutive offsets and targeting the same >> + * swap type. >> + * > > Likely you should document that any swp pte bits are ignored? () Sorry I don't understand this comment. I thought any non-none, non-present PTE was always considered to contain only a "swap entry" and a swap entry consists of a "type" and an "offset" only. (and its a special "non-swap" swap entry if type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that are not part of the swap entry? > >> + * max_nr must be at least one and must be limited by the caller so scanning >> + * cannot exceed a single page table. >> + * >> + * Return: the number of table entries in the batch. >> + */ >> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >> +                 swp_entry_t entry) >> +{ >> +    const pte_t *end_ptep = start_ptep + max_nr; >> +    unsigned long expected_offset = swp_offset(entry) + 1; >> +    unsigned int expected_type = swp_type(entry); >> +    pte_t *ptep = start_ptep + 1; >> + >> +    VM_WARN_ON(max_nr < 1); >> +    VM_WARN_ON(non_swap_entry(entry)); >> + >> +    while (ptep < end_ptep) { >> +        pte_t pte = ptep_get(ptep); >> + >> +        if (pte_none(pte) || pte_present(pte)) >> +            break; >> + >> +        entry = pte_to_swp_entry(pte); >> + >> +        if (non_swap_entry(entry) || >> +            swp_type(entry) != expected_type || >> +            swp_offset(entry) != expected_offset) >> +            break; >> + >> +        expected_offset++; >> +        ptep++; >> +    } >> + >> +    return ptep - start_ptep; >> +} > > Looks very clean :) > > I was wondering whether we could similarly construct the expected swp PTE and > only check pte_same. > > expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); > > ... or have a variant to increase only the swp offset for an existing pte. But > non-trivial due to the arch-dependent format. > > But then, we'd fail on mismatch of other swp pte bits. Hmm, perhaps I have a misunderstanding regarding "swp pte bits"... > > > On swapin, when reusing this function (likely!), we'll might to make sure that > the PTE bits match as well. > > See below regarding uffd-wp. > > >>   #endif /* CONFIG_MMU */ >>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 1f77a51baaac..070bedb4996e 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >>       struct folio *folio; >>       int nr_swap = 0; >>       unsigned long next; >> +    int nr, max_nr; >>         next = pmd_addr_end(addr, end); >>       if (pmd_trans_huge(*pmd)) >> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >>           return 0; >>       flush_tlb_batched_pending(mm); >>       arch_enter_lazy_mmu_mode(); >> -    for (; addr != end; pte++, addr += PAGE_SIZE) { >> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >> +        nr = 1; >>           ptent = ptep_get(pte); >>             if (pte_none(ptent)) >> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >> long addr, >>                 entry = pte_to_swp_entry(ptent); >>               if (!non_swap_entry(entry)) { >> -                nr_swap--; >> -                free_swap_and_cache(entry); >> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> +                max_nr = (end - addr) / PAGE_SIZE; >> +                nr = swap_pte_batch(pte, max_nr, entry); >> +                nr_swap -= nr; >> +                free_swap_and_cache_nr(entry, nr); >> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>               } else if (is_hwpoison_entry(entry) || >>                      is_poisoned_swp_entry(entry)) { >>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> diff --git a/mm/memory.c b/mm/memory.c >> index 7dc6c3d9fa83..ef2968894718 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >> *tlb, >>                   folio_remove_rmap_pte(folio, page, vma); >>               folio_put(folio); >>           } else if (!non_swap_entry(entry)) { >> -            /* Genuine swap entry, hence a private anon page */ >> +            max_nr = (end - addr) / PAGE_SIZE; >> +            nr = swap_pte_batch(pte, max_nr, entry); >> +            /* Genuine swap entries, hence a private anon pages */ >>               if (!should_zap_cows(details)) >>                   continue; >> -            rss[MM_SWAPENTS]--; >> -            if (unlikely(!free_swap_and_cache(entry))) >> -                print_bad_pte(vma, addr, ptent, NULL); >> +            rss[MM_SWAPENTS] -= nr; >> +            free_swap_and_cache_nr(entry, nr); >>           } else if (is_migration_entry(entry)) { >>               folio = pfn_swap_entry_folio(entry); >>               if (!should_zap_folio(details, folio)) >> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>               WARN_ON_ONCE(1); >>           } >> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); > > For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. > > zap_install_uffd_wp_if_needed() will use the uffd-wp information in > ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. > > On mixture, you either lose some or place too many markers. What path are you concerned about here? I don't get how what you describe can happen? swap_pte_batch() will only give me a batch of actual swap entries and actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a "non-swap" swap entry, it bails. I thought the uffd-wp info was populated based on the VMA state at swap-in? I think you are telling me that it's persisted across the swap per-pte? > > A simple workaround would be to disable any such batching if the VMA does have > uffd-wp enabled. > >> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >>         add_mm_rss_vec(mm, rss); >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 0d44ee2b4f9c..d059de6896c1 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent) >>   /* Reclaim the swap entry if swap is getting full*/ >>   #define TTRS_FULL        0x4 >>   -/* returns 1 if swap entry is freed */ >> +/* >> + * returns number of pages in the folio that backs the swap entry. If positive, >> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no >> + * folio was associated with the swap entry. >> + */ >>   static int __try_to_reclaim_swap(struct swap_info_struct *si, >>                    unsigned long offset, unsigned long flags) >>   { >> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, >>               ret = folio_free_swap(folio); >>           folio_unlock(folio); >>       } >> +    ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); >>       folio_put(folio); >>       return ret; >>   } >> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>           swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); >>           spin_lock(&si->lock); >>           /* entry was freed successfully, try to use this again */ >> -        if (swap_was_freed) >> +        if (swap_was_freed > 0) >>               goto checks; >>           goto scan; /* check next one */ >>       } >> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio) >>       return true; >>   } >>   -/* >> - * Free the swap entry like above, but also try to >> - * free the page cache entry if it is the last user. >> - */ >> -int free_swap_and_cache(swp_entry_t entry) > > Can we have some documentation what this function expects? How does nr relate to > entry? > > i.e., offset range defined by [entry.offset, entry.offset + nr). Yep, will add something along these lines. > >> +void free_swap_and_cache_nr(swp_entry_t entry, int nr) >>   { >> -    struct swap_info_struct *p; > > It might be easier to get if you do > > const unsigned long start_offset = swp_offset(entry); > const unsigned long end_offset = start_offset + nr; OK will do this. > >> +    unsigned long end = swp_offset(entry) + nr; >> +    unsigned int type = swp_type(entry); >> +    struct swap_info_struct *si; >> +    bool any_only_cache = false; >> +    unsigned long offset; >>       unsigned char count; >>         if (non_swap_entry(entry)) >> -        return 1; >> +        return; >>   -    p = get_swap_device(entry); >> -    if (p) { >> -        if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) { >> -            put_swap_device(p); >> -            return 0; >> +    si = get_swap_device(entry); >> +    if (!si) >> +        return; >> + >> +    if (WARN_ON(end > si->max)) >> +        goto out; >> + >> +    /* >> +     * First free all entries in the range. >> +     */ >> +    for (offset = swp_offset(entry); offset < end; offset++) { >> +        if (!WARN_ON(data_race(!si->swap_map[offset]))) { > > Ouch "!(!)). Confusing. > > I'm sure there is a better way to write that, maybe using more lines > > if (data_race(si->swap_map[offset])) { >     ... > } else { >     WARN_ON_ONCE(1); > } OK, I thought it was kinda neat myself. I'll change it. > > >> +            count = __swap_entry_free(si, swp_entry(type, offset)); >> +            if (count == SWAP_HAS_CACHE) >> +                any_only_cache = true; >>           } >> +    } >> + >> +    /* >> +     * Short-circuit the below loop if none of the entries had their >> +     * reference drop to zero. >> +     */ >> +    if (!any_only_cache) >> +        goto out; >>   -        count = __swap_entry_free(p, entry); >> -        if (count == SWAP_HAS_CACHE) >> -            __try_to_reclaim_swap(p, swp_offset(entry), >> +    /* >> +     * Now go back over the range trying to reclaim the swap cache. This is >> +     * more efficient for large folios because we will only try to reclaim >> +     * the swap once per folio in the common case. If we do >> +     * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the >> +     * latter will get a reference and lock the folio for every individual >> +     * page but will only succeed once the swap slot for every subpage is >> +     * zero. >> +     */ >> +    for (offset = swp_offset(entry); offset < end; offset += nr) { >> +        nr = 1; >> +        if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { > > Here we use READ_ONCE() only, above data_race(). Hmmm. Yes. I think this is correct. READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is used. I believe READ_ONCE() is required here because we don't have a lock and we want to make sure we read it in a non-tearing manner. We don't need the READ_ONCE() above since we don't care about the exact value - only that it's not 0 (because we should be holding a ref). So do a plain access to give the compiler a bit more freedom. But we need to mark that with data_race() to stop KCSAN from complaining. > >> +            /* >> +             * Folios are always naturally aligned in swap so >> +             * advance forward to the next boundary. Zero means no >> +             * folio was found for the swap entry, so advance by 1 >> +             * in this case. Negative value means folio was found >> +             * but could not be reclaimed. Here we can still advance >> +             * to the next boundary. >> +             */ >> +            nr = __try_to_reclaim_swap(si, offset, >>                             TTRS_UNMAPPED | TTRS_FULL); >> -        put_swap_device(p); >> +            if (nr == 0) >> +                nr = 1; >> +            else if (nr < 0) >> +                nr = -nr; >> +            nr = ALIGN(offset + 1, nr) - offset; >> +        } > > Apart from that nothing jumped at me. Thanks for the review!