Received: by 2002:a05:7412:f690:b0:e2:908c:2ebd with SMTP id ej16csp390062rdb; Thu, 19 Oct 2023 07:27:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG7GSqL2O+0q7MvDbOiWwDxXDX0/em9Ydyc1m8qaz/EsL7NaGb7IgPhLv8qAUeIXLxtpQ5A X-Received: by 2002:a17:902:d48f:b0:1c9:cf1e:f907 with SMTP id c15-20020a170902d48f00b001c9cf1ef907mr2772221plg.57.1697725634671; Thu, 19 Oct 2023 07:27:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697725634; cv=none; d=google.com; s=arc-20160816; b=S0MLKq7oe4s3V/CYV1bETU+J+GclMoZkDFG8uMmoE0T/0V0xEbNZorc7ZVyMEnomGm v1B9daUdubcLpn+nibCc9m8nsoInJn0cOxDUS3H/9wtU7WZGmRn5qAsMHe+LQg2w67rB 5dltVQfQ+dlO4nDuCjl2xVLdp0HH62dNOP2gsgn+ALzwRsfshrqX34FfkJ3xasF7LC4H Jg+YZkjJkXOUItidWLJ1VHkm+7iaDw/Y5Ym6XJS1rUj020bbTjauW9ncWLsSAHX8lrDw O7QBF1xOEfhrR1uHDvICHDRBrsf/H8NuvkPyWq62IKxlJ8xBS7FjUbmHjq/4ZLgUKHgb /q3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=hz60r7leSjplN1nMDhtnOzgZSnkvY55+t83DGkjq8p0=; fh=qPlFX8NsMLCXoq1Rwsm9dfLxQJ5p+n/NZq2wRpuXyN0=; b=gLO3GSHwKeoD7GeMw+AUVr3yXNjee+PjZrReaYH/gD41/4PHSbClYu91yt6Ht1FHfm b7K1gKLm45oTlEjfL8YMMCknz0nKAkKHGjZ9v6fMdhZuazFk9d7JG/fSah/lRKlxfBvW cV0sK+PPS+L7f6HbTIq9Uh6vZkwpSXGH0Ltn0fnXsy2KdyR9S/9CxwrprDHu5yVkuIWW iWKiMJxxok0IXyxVGTQ50HfRZn0ifwurXGOSa6keLksX71ur8sGWoskhtbRvN+8+5zFN AtB6QPZVCYWTrc2vc1u1MWh+C9YPfHvqMsSmkld9B3BvTkL2C3LxNLc9SChc1v6M2ugF 5Slg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id d8-20020a170902654800b001b3d6c68bd1si2119207pln.643.2023.10.19.07.27.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 07:27:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 8297B80267D3; Thu, 19 Oct 2023 07:26:14 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345958AbjJSOZq (ORCPT + 99 others); Thu, 19 Oct 2023 10:25:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235397AbjJSOZp (ORCPT ); Thu, 19 Oct 2023 10:25:45 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D8752B0 for ; Thu, 19 Oct 2023 07:25:42 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 659A62F4; Thu, 19 Oct 2023 07:26:23 -0700 (PDT) Received: from [10.57.68.54] (unknown [10.57.68.54]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B2F723F762; Thu, 19 Oct 2023 07:25:40 -0700 (PDT) Message-ID: <4b351eaa-c2db-408b-9ce2-4b12e3d6b30a@arm.com> Date: Thu, 19 Oct 2023 15:25:39 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Content-Language: en-GB To: "Huang, Ying" Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Chen References: <20231017161302.2518826-1-ryan.roberts@arm.com> <20231017161302.2518826-3-ryan.roberts@arm.com> <87r0ls773p.fsf@yhuang6-desk2.ccr.corp.intel.com> <87mswfuppr.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Ryan Roberts In-Reply-To: <87mswfuppr.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 19 Oct 2023 07:26:15 -0700 (PDT) On 19/10/2023 06:49, Huang, Ying wrote: > Ryan Roberts writes: > >> On 18/10/2023 07:55, Huang, Ying wrote: >>> Ryan Roberts writes: >>> > > [snip] > >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index a073366a227c..35cbbe6509a9 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -268,6 +268,12 @@ struct swap_cluster_info { >>>> struct percpu_cluster { >>>> struct swap_cluster_info index; /* Current cluster index */ >>>> unsigned int next; /* Likely next allocation offset */ >>>> + unsigned int large_next[]; /* >>>> + * next free offset within current >>>> + * allocation cluster for large folios, >>>> + * or UINT_MAX if no current cluster. >>>> + * Index is (order - 1). >>>> + */ >>>> }; >>>> >>>> struct swap_cluster_list { >>>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>>> index b83ad77e04c0..625964e53c22 100644 >>>> --- a/mm/swapfile.c >>>> +++ b/mm/swapfile.c >>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>> return n_ret; >>>> } >>>> >>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >>>> + unsigned int nr_pages) >>> >>> This looks hacky. IMO, we should put the allocation logic inside >>> percpu_cluster framework. If percpu_cluster framework doesn't work for >>> you, just refactor it firstly. >> >> I'm not sure I really understand what you are suggesting - could you elaborate? >> What "framework"? I only see a per-cpu data structure and >> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 >> allocations. > > I suggest to share as much code as possible between order-0 and order > > 0 swap entry allocation. I think that we can make > scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation. > [...] >>>> + /* >>>> + * If scan_swap_map_slots() can't find a free cluster, it will >>>> + * check si->swap_map directly. To make sure this standby >>>> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >>>> + * entries bad (occupied). (same approach as discard). >>>> + */ >>>> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >>>> + SWAPFILE_CLUSTER - nr_pages); >>> >>> There's an issue with this solution. If the free space of swap device >>> runs low, it's possible that >>> >>> - some cluster are put in the percpu_cluster of some CPUs >>> the swap entries there are marked as used >>> >>> - no free swap entries elsewhere >>> >>> - nr_swap_pages isn't 0 >>> >>> So, we will still scan LRU, but swap allocation fails, although there's >>> still free swap space. I'd like to decide how best to solve this problem before I can figure out how much code sharing I can do for the order-0 vs order > 0 allocators. I see a couple of potential options: 1) Manipulate nr_swap_pages to include the entries that are marked SWAP_MAP_BAD, so when reserving a new per-order/per-cpu cluster, subtract SWAPFILE_CLUSTER, and then add nr_pages for each allocation from that cluster. 2) Don't mark the entries in the reserved cluster as SWAP_MAP_BAD, which would allow the scanner to steal (order-0) entries. The scanner could set a flag in the cluster info to mark it as having been allocated from by the scanner, so the next attempt to allocate a high order from it would cause discarding it as the cpu's current cluster and trying to get a fresh cluster from the free list. While option 2 is a bit more complex, I prefer it as a solution. What do you think?