Received: by 2002:ab2:3350:0:b0:1f4:6588:b3a7 with SMTP id o16csp1359768lqe; Mon, 8 Apr 2024 06:57:52 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCU9AivortxCNNmRuaksBbkb6ttiIscEkTQbDm86aEW/r0Wp3/r3lsg0TB71BmTivuo2I4nfHVnvdYIsraggV+VnIw4YFHGO5nyZtsXy7g== X-Google-Smtp-Source: AGHT+IFm7lezvfAdmg627dVxOKxIWnwZAcQnAsQXeZ93VYPrIn1zE5Pgg9KV1bDv7VdLqkHv2/SM X-Received: by 2002:a05:6359:428d:b0:186:f3f:324c with SMTP id kp13-20020a056359428d00b001860f3f324cmr6463414rwb.0.1712584671724; Mon, 08 Apr 2024 06:57:51 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712584671; cv=pass; d=google.com; s=arc-20160816; b=HIW5NiAKvNuw7YnkRMmIj8FlDrdHMHZVP7ud1lWZgnwJjYzNaMnXcNRkFEXji/x2OQ UWyGwR9rrmR+BnMR2BevwAtHLbVnGwo/sf0/w42LWzxDtPoCbzd2Q5XhZIBaq96SM8DU J2zS7d8EkODoZE1vTIPMOYOQZjvwxb2PKIm0EykkHlf4MzLeehiD+dU+UdsSXf2wmf2X MxXpI7KQPrwvIjWJefEeH1jFDlqe6RX4fccVRHWNxEk9AQ/Q+c2uhL0o2uw0FRNysyK6 xCuZDnbzOtHg22U/Cg8Cp5LbGLrXGGsYRXjybQaAbfd7LZyyBAuZJ06m/Dk55OghTHDd zKow== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=GIBgHRjoa4eTemC7NB0M7G4EyRo/0tTCGfQ+FUTcoN8=; fh=X8Ye0zNbQ3DGHGIuZ73PsDHIYx4AqcRuTr5uMr01yYM=; b=vVF3hWhANG1QzkHVaazZWC+7LAPCPsVZW0kzQTgOnIcTsJ9KuauF+Q0oawBRcM/5L1 2CQjlZOODc3rKFY7DB+HM1ULpNIxdZAr3Zp9aK4ozLrZLTXJMVBWAsnGc8RoY9Xji5J7 4N1wsJkghlJglhKnAEHZHvDMXW8Gxx9M1yGozjxmvpdyQVI7v5iOkC/uXfjwwy06yd8q Z7Fa9Mg0n+DTXQy8YRhQSgTfOcoiipIbbk7t5v2QRN9hnmhOCwtDf3IwEk5YiHSrFFTF ygOzgsiNe6frMfqDCjVzxy7whk0YfnEJtngZPQWq4/ypmZE4KA4zNCqHXH5TgZqYVWJo JHTw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-135437-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135437-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id b70-20020a633449000000b005dc4baf54desi6177278pga.769.2024.04.08.06.57.51 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Apr 2024 06:57:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-135437-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-135437-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135437-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 4AEB3B25B1A for ; Mon, 8 Apr 2024 13:31:32 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 49F127F49B; Mon, 8 Apr 2024 13:27:44 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C9C7C7EF1F for ; Mon, 8 Apr 2024 13:27:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712582863; cv=none; b=nTjLhVTkmgvfTNNzrZICd/8N/dEL7oESSc9UznAp8TB+k/ar6T+eiLBmf4kA/vqqmpGwN23YB8J7Z5etT7hUqujQe+0+UN0VjQTCBKkbA2Z550hcD59z7TBHKo0cJfasw/Xk6klWWlDmg4WScUtJlGW+Mx/GwslDcMW5RUXKePE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712582863; c=relaxed/simple; bh=aN0U4RvNBa1kFfdbamMdF1alxdn/emMd8WePEhgo9A4=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=oFHLGZzMe47CUkl+WmE+IGaXtpuTwxwin9ghJoIorN8U3xR3Bezgx0ofe/a3A7Y+jzQqYmgO6BV1v2wKlazuT8PWLvo2VdapSSgcYTFzaRu0Da4g+6iAZNKQI57HpaXpyLzkrBxiASXQ3Ns96MEF5tMwAmLLv6Eb0SHq9QK0/80= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 253F9DA7; Mon, 8 Apr 2024 06:28:10 -0700 (PDT) Received: from [10.57.73.169] (unknown [10.57.73.169]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ABBE13F64C; Mon, 8 Apr 2024 06:27:37 -0700 (PDT) Message-ID: <2cfa542a-ae38-4867-a64b-621e7778fdf7@arm.com> Date: Mon, 8 Apr 2024 14:27:36 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Content-Language: en-GB From: Ryan Roberts To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li , Lance Yang Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240403114032.1162100-1-ryan.roberts@arm.com> <20240403114032.1162100-3-ryan.roberts@arm.com> <051052af-3b56-4290-98d3-fd5a1eb11ce1@redhat.com> <4110bb1d-65e5-4cf0-91ad-62749975829d@arm.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 08/04/2024 13:47, Ryan Roberts wrote: > On 08/04/2024 13:07, Ryan Roberts wrote: >> [...] >>> >>> [...] >>> >>>> + >>>> +/** >>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries >>>> + * @start_ptep: Page table pointer for the first entry. >>>> + * @max_nr: The maximum number of table entries to consider. >>>> + * @entry: Swap entry recovered from the first table entry. >>>> + * >>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs >>>> + * containing swap entries all with consecutive offsets and targeting the same >>>> + * swap type. >>>> + * >>> >>> Likely you should document that any swp pte bits are ignored? () >> >> Now that I understand what swp pte bits are, I think the simplest thing is to >> just make this function always consider the pte bits by using pte_same() as you >> suggest below? I don't think there is ever a case for ignoring the swp pte bits? >> And then I don't need to do anything special for uffd-wp either (below you >> suggested not doing batching when the VMA has uffd enabled). >> >> Any concerns? >> >>> >>>> + * max_nr must be at least one and must be limited by the caller so scanning >>>> + * cannot exceed a single page table. >>>> + * >>>> + * Return: the number of table entries in the batch. >>>> + */ >>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, >>>> +                 swp_entry_t entry) >>>> +{ >>>> +    const pte_t *end_ptep = start_ptep + max_nr; >>>> +    unsigned long expected_offset = swp_offset(entry) + 1; >>>> +    unsigned int expected_type = swp_type(entry); >>>> +    pte_t *ptep = start_ptep + 1; >>>> + >>>> +    VM_WARN_ON(max_nr < 1); >>>> +    VM_WARN_ON(non_swap_entry(entry)); >>>> + >>>> +    while (ptep < end_ptep) { >>>> +        pte_t pte = ptep_get(ptep); >>>> + >>>> +        if (pte_none(pte) || pte_present(pte)) >>>> +            break; >>>> + >>>> +        entry = pte_to_swp_entry(pte); >>>> + >>>> +        if (non_swap_entry(entry) || >>>> +            swp_type(entry) != expected_type || >>>> +            swp_offset(entry) != expected_offset) >>>> +            break; >>>> + >>>> +        expected_offset++; >>>> +        ptep++; >>>> +    } >>>> + >>>> +    return ptep - start_ptep; >>>> +} >>> >>> Looks very clean :) >>> >>> I was wondering whether we could similarly construct the expected swp PTE and >>> only check pte_same. >>> >>> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset)); >> >> So planning to do this. > > Of course this clears all the swp pte bits in expected_pte. So need to do something a bit more complex. > > If we can safely assume all offset bits are contiguous in every per-arch representation then we can do: Looks like at least csky and hexagon store the offset in discontiguous regions. So it will have to be the second approach if we want to avoid anything arch-specific. I'll assume that for now; we can always specialize pte_next_swp_offset() per-arch in the future if needed. > > static inline pte_t pte_next_swp_offset(pte_t pte) > { > pte_t offset_inc = __swp_entry_to_pte(__swp_entry(0, 1)); > > return __pte(pte_val(pte) + pte_val(offset_inc)); > } > > Or if not: > > static inline pte_t pte_next_swp_offset(pte_t pte) > { > swp_entry_t entry = pte_to_swp_entry(pte); > pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), swp_offset(entry) + 1)); > > if (pte_swp_soft_dirty(pte)) > new = pte_swp_mksoft_dirty(new); > if (pte_swp_exclusive(pte)) > new = pte_swp_mkexclusive(new); > if (pte_swp_uffd_wp(pte)) > new = pte_swp_mkuffd_wp(new); > > return new; > } > > Then swap_pte_batch() becomes: > > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) > { > pte_t expected_pte = pte_next_swp_offset(pte); > const pte_t *end_ptep = start_ptep + max_nr; > pte_t *ptep = start_ptep + 1; > > VM_WARN_ON(max_nr < 1); > VM_WARN_ON(!is_swap_pte(pte)); > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte))); > > while (ptep < end_ptep) { > pte = ptep_get(ptep); > > if (!pte_same(pte, expected_pte)) > break; > > expected_pte = pte_next_swp_offset(expected_pte); > ptep++; > } > > return ptep - start_ptep; > } > > Would you be happy with either of these? I'll go look if we can assume the offset bits are always contiguous. > > >> >>> >>> ... or have a variant to increase only the swp offset for an existing pte. But >>> non-trivial due to the arch-dependent format. >> >> not this - I agree this will be difficult due to per-arch changes. I'd rather >> just do the generic version and leave the compiler to do the best it can to >> simplify and optimize. >> >>> >>> But then, we'd fail on mismatch of other swp pte bits. >>> >>> >>> On swapin, when reusing this function (likely!), we'll might to make sure that >>> the PTE bits match as well. >>> >>> See below regarding uffd-wp. >>> >>> >>>>   #endif /* CONFIG_MMU */ >>>>     void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 1f77a51baaac..070bedb4996e 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>>       struct folio *folio; >>>>       int nr_swap = 0; >>>>       unsigned long next; >>>> +    int nr, max_nr; >>>>         next = pmd_addr_end(addr, end); >>>>       if (pmd_trans_huge(*pmd)) >>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>>           return 0; >>>>       flush_tlb_batched_pending(mm); >>>>       arch_enter_lazy_mmu_mode(); >>>> -    for (; addr != end; pte++, addr += PAGE_SIZE) { >>>> +    for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { >>>> +        nr = 1; >>>>           ptent = ptep_get(pte); >>>>             if (pte_none(ptent)) >>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned >>>> long addr, >>>>                 entry = pte_to_swp_entry(ptent); >>>>               if (!non_swap_entry(entry)) { >>>> -                nr_swap--; >>>> -                free_swap_and_cache(entry); >>>> -                pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> +                max_nr = (end - addr) / PAGE_SIZE; >>>> +                nr = swap_pte_batch(pte, max_nr, entry); >>>> +                nr_swap -= nr; >>>> +                free_swap_and_cache_nr(entry, nr); >>>> +                clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>>>               } else if (is_hwpoison_entry(entry) || >>>>                      is_poisoned_swp_entry(entry)) { >>>>                   pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 7dc6c3d9fa83..ef2968894718 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather >>>> *tlb, >>>>                   folio_remove_rmap_pte(folio, page, vma); >>>>               folio_put(folio); >>>>           } else if (!non_swap_entry(entry)) { >>>> -            /* Genuine swap entry, hence a private anon page */ >>>> +            max_nr = (end - addr) / PAGE_SIZE; >>>> +            nr = swap_pte_batch(pte, max_nr, entry); >>>> +            /* Genuine swap entries, hence a private anon pages */ >>>>               if (!should_zap_cows(details)) >>>>                   continue; >>>> -            rss[MM_SWAPENTS]--; >>>> -            if (unlikely(!free_swap_and_cache(entry))) >>>> -                print_bad_pte(vma, addr, ptent, NULL); >>>> +            rss[MM_SWAPENTS] -= nr; >>>> +            free_swap_and_cache_nr(entry, nr); >>>>           } else if (is_migration_entry(entry)) { >>>>               folio = pfn_swap_entry_folio(entry); >>>>               if (!should_zap_folio(details, folio)) >>>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, >>>>               pr_alert("unrecognized swap entry 0x%lx\n", entry.val); >>>>               WARN_ON_ONCE(1); >>>>           } >>>> -        pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); >>>> -        zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); >>>> +        clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); >>> >>> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match. >>> >>> zap_install_uffd_wp_if_needed() will use the uffd-wp information in >>> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers. >>> >>> On mixture, you either lose some or place too many markers. >>> >>> A simple workaround would be to disable any such batching if the VMA does have >>> uffd-wp enabled. >> >> Rather than this, I'll just consider all the swp pte bits when batching. >> >>> >>>> +        zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); >>>>       } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); >> >> [...] >> >