Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp3822715rdg; Wed, 18 Oct 2023 07:08:13 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHQKGZJ/VZddRFCAZ3VAJL/hqEAIVjYA+vbu0zS3+ISnjDjELAkf7Wv1nCekpfe9uzP7mFG X-Received: by 2002:a17:902:f543:b0:1ca:8cb0:7ff5 with SMTP id h3-20020a170902f54300b001ca8cb07ff5mr6358213plf.9.1697638092873; Wed, 18 Oct 2023 07:08:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697638092; cv=none; d=google.com; s=arc-20160816; b=YQ6pQRaRX7G5RR21XmrKScLBG77JiVQpz0jN8Uc92xke/QL0klIO6g8S6iloSQOONP xDvwzHnSYSpMsT41GKTwk2mwEBJpn2sGOJNEj/KJDRMY9jwkPu7ZQ/cTjpCMX1ce1MU9 IXLXNs+VTB377Rz8fDvbwoSuTbP6Y9/iF8dzL8MOAyQmLnGaHNhK7Ae/SbNHTbSEpL39 5UwFoQURLUQPlAMf6P/32YK7a+mLDCmMrSJOguwEPrJZBjwVRhtRO42O9aqfBx+o+l4k l1nAcaqB2CDIRl3+4eJMImfYu0d/+VFp6qHrRXhcyvAItn3YydEc3OKscHN87NsSJfog Pv/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=Xj5eJQvcdU+uygn24dkOyLm1Osa+SFNskqCYnbk7AYk=; fh=7tOfH79O2wWRHeGs8lN0wn9M9JDK41cFACjyvErrg6s=; b=GjyO//zbkqcu5y2p+BL3qACjyiKzDN5+/otCtODx1ZPbpI4qfiwOo/0tTusTQXVS3g eI7ROsVhfwn14cI8UvQ8Pp/LCp/Qo1gWiRU7dG0/TEd5XUqUWT9NuA+Kkthfu1xm43Oz e9E3e0nzqgLleNvu6qMlsZEpI3Xcuh+IVxIlQSsrL2RUEE8MaU1ylKrSNuXSqhfaKTBb mumJ6+/SiF4mDcew9xcd/1RjbNMMosuoJAKKUPYZVgOxYuexfdMB2JShxNqFTkguQi7R UpY92SNuUF88M3e/XIybGcZqPEwCSFPcOeESV1UhTX7fL023k8pgYVZFebjCMETqpUcv 23RA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id d8-20020a170902654800b001bc162f3318si4051625pln.640.2023.10.18.07.08.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Oct 2023 07:08:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 95ECC816EBC9; Wed, 18 Oct 2023 07:08:09 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344834AbjJROIB (ORCPT + 99 others); Wed, 18 Oct 2023 10:08:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42676 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344828AbjJROIB (ORCPT ); Wed, 18 Oct 2023 10:08:01 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 854F095 for ; Wed, 18 Oct 2023 07:07:58 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3DF172F4; Wed, 18 Oct 2023 07:08:39 -0700 (PDT) Received: from [10.1.34.143] (XHFQ2J9959.cambridge.arm.com [10.1.34.143]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B6DF53F64C; Wed, 18 Oct 2023 07:07:56 -0700 (PDT) Message-ID: Date: Wed, 18 Oct 2023 15:07:55 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Content-Language: en-GB To: "Huang, Ying" Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20231017161302.2518826-1-ryan.roberts@arm.com> <20231017161302.2518826-3-ryan.roberts@arm.com> <87r0ls773p.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Ryan Roberts In-Reply-To: <87r0ls773p.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Wed, 18 Oct 2023 07:08:09 -0700 (PDT) On 18/10/2023 07:55, Huang, Ying wrote: > Ryan Roberts writes: > >> The upcoming anonymous small-sized THP feature enables performance >> improvements by allocating large folios for anonymous memory. However >> I've observed that on an arm64 system running a parallel workload (e.g. >> kernel compilation) across many cores, under high memory pressure, the >> speed regresses. This is due to bottlenecking on the increased number of >> TLBIs added due to all the extra folio splitting. >> >> Therefore, solve this regression by adding support for swapping out >> small-sized THP without needing to split the folio, just like is already >> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >> enabled, and when the swap backing store is a non-rotating block device. >> These are the same constraints as for the existing PMD-sized THP >> swap-out support. >> >> Note that no attempt is made to swap-in THP here - this is still done >> page-by-page, like for PMD-sized THP. >> >> The main change here is to improve the swap entry allocator so that it >> can allocate any power-of-2 number of contiguous entries between [4, (1 >> << PMD_ORDER)] (THP cannot support order-1 folios). This is done by >> allocating a cluster for each distinct order and allocating sequentially >> from it until the cluster is full. This ensures that we don't need to >> search the map and we get no fragmentation due to alignment padding for >> different orders in the cluster. If there is no current cluster for a >> given order, we attempt to allocate a free cluster from the list. If >> there are no free clusters, we fail the allocation and the caller falls >> back to splitting the folio and allocates individual entries (as per >> existing PMD-sized THP fallback). >> >> The per-order current clusters are maintained per-cpu using the existing >> percpu_cluster infrastructure. This is done to avoid interleving pages >> from different tasks, which would prevent IO being batched. This is >> already done for the order-0 allocations so we follow the same pattern. >> >> As far as I can tell, this should not cause any extra fragmentation >> concerns, given how similar it is to the existing PMD-sized THP >> allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus >> clusters in concurrent use though, which in a pathalogical case (cluster >> set aside for every order for every cpu and only one huge entry >> allocated from it) would tie up ~12MiB of unused swap entries for these >> high orders (assuming PMD_ORDER=9). In practice, the number of orders in >> use will be small and the amount of swap space reserved is very small >> compared to a typical swap file. >> >> Note that PMD_ORDER is not compile-time constant on powerpc, so we have >> to allocate the large_next[] array at runtime. >> >> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram >> device as the swap device and from inside a memcg limited to 40G memory. >> I've then run `usemem` from vm-scalability with 70 processes (each has >> its own core), each allocating and writing 1G of memory. I've repeated >> everything 5 times and taken the mean and stdev: >> >> Mean Performance Improvement vs 4K/baseline >> >> | alloc size | baseline | + this series | >> | | v6.6-rc4+anonfolio | | >> |:-----------|--------------------:|--------------------:| >> | 4K Page | 0.0% | 1.1% | >> | 64K THP | -44.1% | 0.9% | >> | 2M THP | 56.0% | 56.4% | >> >> So with this change, the regression for 64K swap performance goes away. >> Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from >> try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in >> the test, I see upto 2x performance of 64K THP swapping compared to 4K. >> >> Signed-off-by: Ryan Roberts >> --- >> include/linux/swap.h | 6 ++++ >> mm/swapfile.c | 74 +++++++++++++++++++++++++++++++++++--------- >> mm/vmscan.c | 10 +++--- >> 3 files changed, 71 insertions(+), 19 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index a073366a227c..35cbbe6509a9 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -268,6 +268,12 @@ struct swap_cluster_info { >> struct percpu_cluster { >> struct swap_cluster_info index; /* Current cluster index */ >> unsigned int next; /* Likely next allocation offset */ >> + unsigned int large_next[]; /* >> + * next free offset within current >> + * allocation cluster for large folios, >> + * or UINT_MAX if no current cluster. >> + * Index is (order - 1). >> + */ >> }; >> >> struct swap_cluster_list { >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index b83ad77e04c0..625964e53c22 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> return n_ret; >> } >> >> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >> + unsigned int nr_pages) > > This looks hacky. IMO, we should put the allocation logic inside > percpu_cluster framework. If percpu_cluster framework doesn't work for > you, just refactor it firstly. I'm not sure I really understand what you are suggesting - could you elaborate? What "framework"? I only see a per-cpu data structure and scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 allocations. Are you suggesting you want to allocate large entries (> order-0) from the same cluster that is used for small (order-0) entries? The problem with this approach is that there may not be enough space left in the current cluster for the large entry that you are trying to allocate. Then you would need to take a new cluster from the free list and presumably leave the previous cluster with unused entries (which will now only be accessible by scanning). That unused space could be considerable. That's why I am currently reserving a "current cluster" per order; that way, all allocations are the same order, they are all naturally aligned and there is no waste. Perhaps I could implement "stealing" between cpus so that a cpu trying to allocate a specific order, but which doesn't have a current cluster for that order and the free list is empty, could allocate from another cpu's current cluster? I don't think it's a good idea to mix different orders in the same cluster though. I guess if really low, I could remove a current cluster from a cpu and allow it to be scanned for order-0 allocations at least? Any opinions gratefully received! > >> { >> + int order_idx; >> unsigned long idx; >> struct swap_cluster_info *ci; >> + struct percpu_cluster *cluster; >> unsigned long offset; >> >> /* >> * Should not even be attempting cluster allocations when huge >> * page swap is disabled. Warn and fail the allocation. >> */ >> - if (!IS_ENABLED(CONFIG_THP_SWAP)) { >> + if (!IS_ENABLED(CONFIG_THP_SWAP) || >> + nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER || >> + !is_power_of_2(nr_pages)) { >> VM_WARN_ON_ONCE(1); >> return 0; >> } >> >> - if (cluster_list_empty(&si->free_clusters)) >> + /* >> + * Not using clusters so unable to allocate large entries. >> + */ >> + if (!si->cluster_info) >> return 0; >> >> - idx = cluster_list_first(&si->free_clusters); >> - offset = idx * SWAPFILE_CLUSTER; >> - ci = lock_cluster(si, offset); >> - alloc_cluster(si, idx); >> - cluster_set_count(ci, SWAPFILE_CLUSTER); >> + order_idx = ilog2(nr_pages) - 2; >> + cluster = this_cpu_ptr(si->percpu_cluster); >> + offset = cluster->large_next[order_idx]; >> + >> + if (offset == UINT_MAX) { >> + if (cluster_list_empty(&si->free_clusters)) >> + return 0; >> + >> + idx = cluster_list_first(&si->free_clusters); >> + offset = idx * SWAPFILE_CLUSTER; >> >> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); >> + ci = lock_cluster(si, offset); >> + alloc_cluster(si, idx); >> + cluster_set_count(ci, SWAPFILE_CLUSTER); >> + >> + /* >> + * If scan_swap_map_slots() can't find a free cluster, it will >> + * check si->swap_map directly. To make sure this standby >> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >> + * entries bad (occupied). (same approach as discard). >> + */ >> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >> + SWAPFILE_CLUSTER - nr_pages); > > There's an issue with this solution. If the free space of swap device > runs low, it's possible that > > - some cluster are put in the percpu_cluster of some CPUs > the swap entries there are marked as used > > - no free swap entries elsewhere > > - nr_swap_pages isn't 0 > > So, we will still scan LRU, but swap allocation fails, although there's > still free swap space. Ahh yes, good spot. > > I think that we should follow the method we used for the original > percpu_cluster. That is, if all free swap entries are in > percpu_cluster, we will start to allocate from percpu_cluster. As i suggested above, I think I could do this by removing a cpu's current cluster for a given order from the percpu_cluster and making it generally available for scanning. Does that work for you? > >> + } else { >> + idx = offset / SWAPFILE_CLUSTER; >> + ci = lock_cluster(si, offset); >> + } >> + >> + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); >> unlock_cluster(ci); >> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); >> + swap_range_alloc(si, offset, nr_pages); >> *slot = swp_entry(si->type, offset); >> >> + offset += nr_pages; >> + if (idx != offset / SWAPFILE_CLUSTER) >> + offset = UINT_MAX; >> + cluster->large_next[order_idx] = offset; >> + >> return 1; >> } >> > > [snip] > > -- > Best Regards, > Huang, Ying