Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp1357291pxf; Fri, 9 Apr 2021 06:33:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxuBTgQN+DBTjYI5r5nacN6BWORA7CKtzhW2sGwzuAtSerMfsC6PJKV32FPTXK5vBGU0qN0 X-Received: by 2002:a17:902:b483:b029:e9:eef4:4f16 with SMTP id y3-20020a170902b483b02900e9eef44f16mr2564059plr.38.1617975228752; Fri, 09 Apr 2021 06:33:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617975228; cv=none; d=google.com; s=arc-20160816; b=EZO7TMXWNaKoFcisbU0NQicaitIDFqo3XecPRIeOuCtl8eue63cpdmZtYuqG5QPuZV aul8zzlhrLS5CwX3tCv8Scft3mzbRBzBDMd8Mlx8+WUSm8Imh5IHIQ9gxuRWiaBa9NFa TTnI2icxE8D4KSy1a4yAIsqPDZfyq96kRjEXgmaws8fFDrIxUN//eOeFzZxTccJ38Svy 2FSPCeNiKXPKSfihuR6ALjhabyVgzZl7GvvHpAZY28luOet+eNdq8yoUoA3PtfERlMH0 ZsqiFsUQCRxlzwvH42WNN7xAFUbaGinIqiwfWtwGujh2Osn4GX8piXkfimx6dUIrIaaC qM4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=jzLFmW5AIBpfDBg/8I7am6aU0j06lNLL5GCTaFUYMWU=; b=k4ProRs0rly2+K68u2MTZ30+Xcs4/bCyr8lVSRLqeh0tnsidfWNnkJAB0OEVQya4uS so59tFG2u2TCcgu9jW1/906b4KnwLuH5cdtoKhIpnJ9zTH2C5Hi+kxgzvxYbUXI6ZNrH 4FgXOD/vXTK7Ev448Du1OyjkMjC3K6vw5LNPARpQEQQcDfJ1MCkwSRJZRvecpLsx/Phb +/6+9O8n/6h5ionyd3R68UVjD94Go25Z4EOWakgzrF972AmsI4PUTUpzdgEN4R+oaAhD jqCtLgq0w47YG8GXNYfjjywO9EapqKpSG7UzqceAEc8+dq2aext81KjDxKcAIJe3Ht91 senA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y12si3072863pfq.221.2021.04.09.06.33.36; Fri, 09 Apr 2021 06:33:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231127AbhDINdN (ORCPT + 99 others); Fri, 9 Apr 2021 09:33:13 -0400 Received: from outbound-smtp10.blacknight.com ([46.22.139.15]:55743 "EHLO outbound-smtp10.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231819AbhDINdM (ORCPT ); Fri, 9 Apr 2021 09:33:12 -0400 Received: from mail.blacknight.com (pemlinmail06.blacknight.ie [81.17.255.152]) by outbound-smtp10.blacknight.com (Postfix) with ESMTPS id 6698F1C3F2D for ; Fri, 9 Apr 2021 14:32:58 +0100 (IST) Received: (qmail 26227 invoked from network); 9 Apr 2021 13:32:58 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.22.4]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 9 Apr 2021 13:32:58 -0000 Date: Fri, 9 Apr 2021 14:32:56 +0100 From: Mel Gorman To: Peter Zijlstra Cc: Linux-MM , Linux-RT-Users , LKML , Chuck Lever , Jesper Dangaard Brouer , Matthew Wilcox , Thomas Gleixner , Ingo Molnar , Michal Hocko , Oscar Salvador Subject: Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock Message-ID: <20210409133256.GN3697@techsingularity.net> References: <20210407202423.16022-1-mgorman@techsingularity.net> <20210407202423.16022-3-mgorman@techsingularity.net> <20210408174244.GG3697@techsingularity.net> <20210409075939.GJ3697@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 09, 2021 at 10:24:24AM +0200, Peter Zijlstra wrote: > On Fri, Apr 09, 2021 at 08:59:39AM +0100, Mel Gorman wrote: > > In the end I just gave up and kept it simple as there is no benefit to > > !PREEMPT_RT which just disables IRQs. Maybe it'll be worth considering when > > PREEMPT_RT is upstream and can be enabled. The series was functionally > > tested on the PREEMPT_RT tree by reverting the page_alloc.c patch and > > applies this series and all of its prerequisites on top. > > Right, I see the problem. Fair enough; perhaps ammend the changelog to > include some of that so that we can 'remember' in a few months why the > code is 'funneh'. > I updated the changelog and also added a comment above the declaration. That said, there are some curious users already. fs/squashfs/decompressor_multi_percpu.c looks like it always uses the local_lock in CPU 0's per-cpu structure instead of stabilising a per-cpu pointer. drivers/block/zram/zcomp.c appears to do the same although for at least one of the zcomp_stream_get() callers, the CPU is pinned for other reasons (bit spin lock held). I think it happens to work anyway but it's weird and I'm not a fan. Anyway, new version looks like is below. -- [PATCH] mm/page_alloc: Convert per-cpu list protection to local_lock There is a lack of clarity of what exactly local_irq_save/local_irq_restore protects in page_alloc.c . It conflates the protection of per-cpu page allocation structures with per-cpu vmstat deltas. This patch protects the PCP structure using local_lock which for most configurations is identical to IRQ enabling/disabling. The scope of the lock is still wider than it should be but this is decreased later. local_lock is declared statically instead of placing it within a structure and this is deliberate. Placing it in the zone offers limited benefit and confuses what the lock is protecting -- struct per_cpu_pages. However, putting it in per_cpu_pages is problematic because the task is not guaranteed to be pinned to the CPU yet so looking up a per-cpu structure is unsafe. [lkp@intel.com: Make pagesets static] Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 2 ++ mm/page_alloc.c | 67 +++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 54 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a4393ac27336..106da8fbc72a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -20,6 +20,7 @@ #include #include #include +#include #include /* Free memory management - zoned buddy allocator. */ @@ -337,6 +338,7 @@ enum zone_watermarks { #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) +/* Fields and list protected by pagesets local_lock in page_alloc.c */ struct per_cpu_pages { int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3bc4da4cbf9c..04644c3dd187 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -112,6 +112,30 @@ typedef int __bitwise fpi_t; static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_FRACTION (8) +/* + * Protects the per_cpu_pages structures. + * + * This lock is not placed in struct per_cpu_pages because the task acquiring + * the lock is not guaranteed to be pinned to the CPU yet due to + * preempt/migrate/IRQs disabled or holding a spinlock. The pattern to acquire + * the lock would become + * + * migrate_disable(); + * pcp = this_cpu_ptr(zone->per_cpu_pageset); + * local_lock_irqsave(&pcp->lock, flags); + * + * While a helper would avoid code duplication, there is no inherent advantage + * and migrate_disable itself is undesirable (see include/linux/preempt.h). + * Similarly, putting the lock in the zone offers no particular benefit but + * confuses what the lock is protecting. + */ +struct pagesets { + local_lock_t lock; +}; +static DEFINE_PER_CPU(struct pagesets, pagesets) = { + .lock = INIT_LOCAL_LOCK(lock), +}; + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -1421,6 +1445,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, } while (--count && --batch_free && !list_empty(list)); } + /* + * local_lock_irq held so equivalent to spin_lock_irqsave for + * both PREEMPT_RT and non-PREEMPT_RT configurations. + */ spin_lock(&zone->lock); isolated_pageblocks = has_isolate_pageblock(zone); @@ -1541,6 +1569,11 @@ static void __free_pages_ok(struct page *page, unsigned int order, return; migratetype = get_pfnblock_migratetype(page, pfn); + + /* + * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock + * and protect vmstat updates. + */ local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); free_one_page(page_zone(page), page, pfn, order, migratetype, @@ -2910,6 +2943,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, { int i, allocated = 0; + /* + * local_lock_irq held so equivalent to spin_lock_irqsave for + * both PREEMPT_RT and non-PREEMPT_RT configurations. + */ spin_lock(&zone->lock); for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, @@ -2962,12 +2999,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) unsigned long flags; int to_drain, batch; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) free_pcppages_bulk(zone, to_drain, pcp); - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } #endif @@ -2983,13 +3020,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) unsigned long flags; struct per_cpu_pages *pcp; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); if (pcp->count) free_pcppages_bulk(zone, pcp->count, pcp); - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } /* @@ -3252,9 +3289,9 @@ void free_unref_page(struct page *page) if (!free_unref_page_prepare(page, pfn)) return; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); free_unref_page_commit(page, pfn); - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } /* @@ -3274,7 +3311,7 @@ void free_unref_page_list(struct list_head *list) set_page_private(page, pfn); } - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); list_for_each_entry_safe(page, next, list, lru) { unsigned long pfn = page_private(page); @@ -3287,12 +3324,12 @@ void free_unref_page_list(struct list_head *list) * a large list of pages to free. */ if (++batch_count == SWAP_CLUSTER_MAX) { - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); batch_count = 0; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); } } - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); } /* @@ -3449,7 +3486,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, struct page *page; unsigned long flags; - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); pcp = this_cpu_ptr(zone->per_cpu_pageset); list = &pcp->lists[migratetype]; page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); @@ -3457,7 +3494,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); zone_statistics(preferred_zone, zone); } - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); return page; } @@ -5052,7 +5089,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, goto failed; /* Attempt the batch allocation */ - local_irq_save(flags); + local_lock_irqsave(&pagesets.lock, flags); pcp = this_cpu_ptr(zone->per_cpu_pageset); pcp_list = &pcp->lists[ac.migratetype]; @@ -5090,12 +5127,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, nr_populated++; } - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); return nr_populated; failed_irq: - local_irq_restore(flags); + local_unlock_irqrestore(&pagesets.lock, flags); failed: page = __alloc_pages(gfp, 0, preferred_nid, nodemask);