Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp356384pxj; Wed, 16 Jun 2021 04:09:21 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzANc0Z0A8vp8EmqxruQ57+5/nORvOzEY54AZsm75XU2IERiQFovPU7KJdyD4131D1z6NrK X-Received: by 2002:aa7:c50d:: with SMTP id o13mr3369611edq.9.1623841760929; Wed, 16 Jun 2021 04:09:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623841760; cv=none; d=google.com; s=arc-20160816; b=div/aem7gomtc8PQD0JSDPafVOSesAjAC7K3Y+Oky1EQI/8/dCw8e6y/xi1pNvMT0q ZxMfYq/OOiF38YuSwWLtYzo7CClpieSNikcCQmrmDsmbmyIBbJy3hpRSqF3Vu+GkKMQK eSWtziJIH2okaYClY9tBeoyioa9E7+udOF7BkMZd/s1P7b0rURdKhReoyyt7+z3c5wiR vPGvOxcRV+UBCXUTAxca1tzoiFmsd25sbv9d57dGZ2J6FYZhJz14HsbEt8VZAjZ/sS45 OtbCsj0GEsdnESYR6c2v27H8s42IqA0xrhFewrckENkCqnT4t2lF9kgwc3V1BMQvraF9 DEiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:content-disposition:mime-version :message-id:subject:cc:to:from:date; bh=59tKEZio1RYRiM0cp37eiLfE9Yt8BZCtscoJriy02+o=; b=Q1pwD1EvDxl/smE5jkkcqfjvLk1xHASTN0oOZpc29Yhb4AKct8iWtoG0W+Zqjcmy7O MMMSB0MFY80tMzwbxq2bLrcPO3rvGrdyYnyG/5+ikkOz8savWvW6caQgid0Nc0L0uA2K N0+KSnAXUIzNBhD7UBgM5sn6XGclAOtqYctezOEnDffRynMptJBv91kzCtXLaJx8pSPc U40SQ9dgaMkksjz6qt4hxVT5cq9bHM4zYq6d+ROUzmoPE8LVq6WYovD3tkn65HsGleFA 3iNz6dSP/m+l5mJtynkvHRe2zMcIn4yJzy9IvNnjaCYFOHsKWLbh+njHSMqfLB8vMKQj jotg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e19si1824867edz.270.2021.06.16.04.08.58; Wed, 16 Jun 2021 04:09:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230225AbhFPLJz (ORCPT + 99 others); Wed, 16 Jun 2021 07:09:55 -0400 Received: from outbound-smtp53.blacknight.com ([46.22.136.237]:52977 "EHLO outbound-smtp53.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229943AbhFPLJx (ORCPT ); Wed, 16 Jun 2021 07:09:53 -0400 Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp53.blacknight.com (Postfix) with ESMTPS id 2A3EBFAFD2 for ; Wed, 16 Jun 2021 12:07:45 +0100 (IST) Received: (qmail 31185 invoked from network); 16 Jun 2021 11:07:44 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.17.255]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 16 Jun 2021 11:07:44 -0000 Date: Wed, 16 Jun 2021 12:07:43 +0100 From: Mel Gorman To: Andrew Morton Cc: Hillf Danton , Dave Hansen , Vlastimil Babka , Michal Hocko , LKML , Linux-MM , "Tang, Feng" Subject: [PATCH] mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes Message-ID: <20210616110743.GK30378@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen reported the following about Feng Tang's tests on a machine with persistent memory onlined as a DRAM-like device. Feng Tang tossed these on a "Cascade Lake" system with 96 threads and ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile use" mode and being managed via the buddy just like the normal RAM. The PMEM zones are big ones: present 65011712 = 248 G high 134595 = 525 M The PMEM nodes, of course, don't have any CPUs in them. With your series, the pcp->high value per-cpu is 69584 pages or about 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of worst-case memory in the pcps per zone, or roughly 10% of the size of the zone. This should not cause a problem as such although it could trigger reclaim due to pages being stored on per-cpu lists for CPUs remote to a node. It is not possible to treat cpuless nodes exactly the same as normal nodes but the worst-case scenario can be mitigated by splitting pcp->high across all online CPUs for cpuless memory nodes. Suggested-by: Dave Hansen Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Dave Hansen --- mm/page_alloc.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3ab6aac2f1a3..21c67a587e36 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6687,7 +6687,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) { #ifdef CONFIG_MMU int high; - int nr_local_cpus; + int nr_split_cpus; unsigned long total_pages; if (!percpu_pagelist_high_fraction) { @@ -6710,10 +6710,14 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * Split the high value across all online CPUs local to the zone. Note * that early in boot that CPUs may not be online yet and that during * CPU hotplug that the cpumask is not yet updated when a CPU is being - * onlined. - */ - nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online; - high = total_pages / nr_local_cpus; + * onlined. For memory nodes that have no CPUs, split pcp->high across + * all online CPUs to mitigate the risk that reclaim is triggered + * prematurely due to pages stored on pcp lists. + */ + nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online; + if (!nr_split_cpus) + nr_split_cpus = num_online_cpus(); + high = total_pages / nr_split_cpus; /* * Ensure high is at least batch*4. The multiple is based on the