Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp1288510pxf; Fri, 2 Apr 2021 06:35:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyS7msAhfKJzoRagk3iAZRVmhfNh0xH3D9EiIt350MDgVjFuIghof1NPamhv6tVFAfo5UWG X-Received: by 2002:a92:c6c3:: with SMTP id v3mr10334415ilm.191.1617370507337; Fri, 02 Apr 2021 06:35:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617370507; cv=none; d=google.com; s=arc-20160816; b=Qf+K6K0DN154E9Kc/+LfUkF3UzfYqsSDKkuibY3B/W5QNWYc1lPCI/If+BUZxO+fX5 9mApO1m4c18jiha0minbEv2mt9iRzWukoLaEBx2CA0cpUmCemuSgOrBsv7GTGAKGl1v2 YhnYtA3/wb/vrpj+OKSA56/vRptQM8DWNj0mCNGv24L+Ru5lmvi8snIhpew/SJpsTxGd jawn9vkQ+/zAdrnuVEJDEjmERefPAaOkwiFFFF5k9fQIcO1o+cZmNvL0UszMrG91ijLh 6W8TN8LKfgKqRD2JDJ+6K4snQzQB8JB4KPInipxNjU+QvRmh6EVhAh/sB0Fg2F3i3PCo XFQw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:dkim-signature; bh=3O/oOdZbpaKyTAMpukLAT+byXF+1fN0uLjwb1ebzmbE=; b=IPw6Ev0AwdDHV3dWOJOcx3UkMbwuAiHrHPWbxNiKFZLMUXuw4C9RNHC58J/tnclkgH QQ134mF3+vDV1upLBi6pIb1/rnkmiYnL6+AAdF8OCr8PJMNopeIGmpzKgfM/xsQf5y4r C6AF84GlEmzizcO7H42CIv/J0UY7fjLWwF8lbscHXSWxBWhYTw26wrYrlOhrnjgcOTOq 5MprOzQ0ZcgXs4qoNlT2+1hcQVwLjoNwC0LkfsdvyAQqYvLu9ulXb82Wt1eRLeggSuB7 +UOCi4M2r15PEafOxHY91LzVdwt2dN6NvH3q5Jxgit8+Trwwxazxe1diMXBlMvkh+1VB YR4Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b="dLk/jRWk"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y24si7788249ioj.75.2021.04.02.06.34.53; Fri, 02 Apr 2021 06:35:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b="dLk/jRWk"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234628AbhDBNe3 (ORCPT + 99 others); Fri, 2 Apr 2021 09:34:29 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:22050 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229932AbhDBNe2 (ORCPT ); Fri, 2 Apr 2021 09:34:28 -0400 Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 132DWvT2018019; Fri, 2 Apr 2021 09:34:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : in-reply-to : references : date : message-id : mime-version : content-type; s=pp1; bh=3O/oOdZbpaKyTAMpukLAT+byXF+1fN0uLjwb1ebzmbE=; b=dLk/jRWkRL0fHc+cfyomZRFiYvl+LJCyjxC5++wBQCMrx2zwPKllbPAm1yvHvLms9mP/ hCFGZnyBsF9FiEJXOLzsZW4KES0maRgQz7Pe17R1xLmMM8Q1mtldhrxIRYfO0AJCVx6O eoCC5Iy84QgzDNWA/ACwrqQLD1LrZNUYu1dyZZGQKLIjwkGWuQwjJEbgYNn2zNLsk1Or Gic11fGV7hdcAO9Q8tdaS2sUnf0rPgPCdCH8FUgoO3NBYifFOZD1wkM+03zoA1Jpklvq dRghm08wQi1cKmL2rnaO5Ub4Wpz8qnFe16H409rNUHc8b4hZBEABK71Jgz+PQ0FoxgHx aw== Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com with ESMTP id 37n95gxr15-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 02 Apr 2021 09:34:04 -0400 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 132DSDNU016920; Fri, 2 Apr 2021 13:34:03 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma04dal.us.ibm.com with ESMTP id 37n28up3jq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 02 Apr 2021 13:34:03 +0000 Received: from b03ledav001.gho.boulder.ibm.com (b03ledav001.gho.boulder.ibm.com [9.17.130.232]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 132DY1Db24641852 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 2 Apr 2021 13:34:01 GMT Received: from b03ledav001.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CFDC46E04E; Fri, 2 Apr 2021 13:34:01 +0000 (GMT) Received: from b03ledav001.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 99E356E04C; Fri, 2 Apr 2021 13:34:01 +0000 (GMT) Received: from localhost (unknown [9.163.15.116]) by b03ledav001.gho.boulder.ibm.com (Postfix) with ESMTP; Fri, 2 Apr 2021 13:34:01 +0000 (GMT) From: Nathan Lynch To: Laurent Dufour Cc: cheloha@linux.ibm.com, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, mpe@ellerman.id.au, benh@kernel.crashing.org, paulus@samba.org, Srikar Dronamraju Subject: Re: [PATCH v2] pseries: prevent free CPU ids to be reused on another node In-Reply-To: <20210325093512.57856-1-ldufour@linux.ibm.com> References: <20210325093512.57856-1-ldufour@linux.ibm.com> Date: Fri, 02 Apr 2021 08:34:01 -0500 Message-ID: <87a6qgbyk6.fsf@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: woBakMfSiXYnAuz5bjlwSNmEB1CmYyq8 X-Proofpoint-GUID: woBakMfSiXYnAuz5bjlwSNmEB1CmYyq8 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.369,18.0.761 definitions=2021-04-02_08:2021-04-01,2021-04-02 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 phishscore=0 mlxlogscore=999 impostorscore=0 bulkscore=0 mlxscore=0 malwarescore=0 suspectscore=0 spamscore=0 lowpriorityscore=0 adultscore=0 clxscore=1011 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2103310000 definitions=main-2104020100 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Laurent, Laurent Dufour writes: > When a CPU is hot added, the CPU ids are taken from the available mask from > the lower possible set. If that set of values was previously used for CPU > attached to a different node, this seems to application like if these CPUs > have migrated from a node to another one which is not expected in real > life. This seems like a problem that could affect other architectures or platforms? I guess as long as arch code is responsible for placing new CPUs in cpu_present_mask, that code will have the responsibility of ensuring CPU IDs' NUMA assignments remain stable. [...] > The effect of this patch can be seen by removing and adding CPUs using the > Qemu monitor. In the following case, the first CPU from the node 2 is > removed, then the first one from the node 1 is removed too. Later, the > first CPU of the node 2 is added back. Without that patch, the kernel will > numbered these CPUs using the first CPU ids available which are the ones > freed when removing the second CPU of the node 0. This leads to the CPU ids > 16-23 to move from the node 1 to the node 2. With the patch applied, the > CPU ids 32-39 are used since they are the lowest free ones which have not > been used on another node. > > At boot time: > [root@vm40 ~]# numactl -H | grep cpus > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 > > Vanilla kernel, after the CPU hot unplug/plug operations: > [root@vm40 ~]# numactl -H | grep cpus > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 1 cpus: 24 25 26 27 28 29 30 31 > node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 > > Patched kernel, after the CPU hot unplug/plug operations: > [root@vm40 ~]# numactl -H | grep cpus > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 1 cpus: 24 25 26 27 28 29 30 31 > node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Good demonstration of the problem. CPUs 16-23 "move" from node 1 to node 2. > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index 12cbffd3c2e3..48c7943b25b0 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -39,6 +39,8 @@ > /* This version can't take the spinlock, because it never returns */ > static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE; > > +static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES]; I guess this should have documentation that it must be accessed/manipulated with cpu_add_remove_lock held? > + > static void rtas_stop_self(void) > { > static struct rtas_args args; > @@ -151,29 +153,61 @@ static void pseries_cpu_die(unsigned int cpu) > */ > static int pseries_add_processor(struct device_node *np) > { > - unsigned int cpu; > + unsigned int cpu, node; > cpumask_var_t candidate_mask, tmp; > - int err = -ENOSPC, len, nthreads, i; > + int err = -ENOSPC, len, nthreads, i, nid; From eight local vars to ten, and the two new variables' names are "node" and "nid". More distinctive names would help readers. > const __be32 *intserv; > + bool force_reusing = false; > > intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", &len); > if (!intserv) > return 0; > > - zalloc_cpumask_var(&candidate_mask, GFP_KERNEL); > - zalloc_cpumask_var(&tmp, GFP_KERNEL); > + alloc_cpumask_var(&candidate_mask, GFP_KERNEL); > + alloc_cpumask_var(&tmp, GFP_KERNEL); > + > + /* > + * Fetch from the DT nodes read by dlpar_configure_connector() the NUMA > + * node id the added CPU belongs to. > + */ > + nid = of_node_to_nid(np); > + if (nid < 0 || !node_possible(nid)) > + nid = first_online_node; > > nthreads = len / sizeof(u32); > - for (i = 0; i < nthreads; i++) > - cpumask_set_cpu(i, tmp); > > cpu_maps_update_begin(); > > BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask)); > > +again: > + cpumask_clear(candidate_mask); > + cpumask_clear(tmp); > + for (i = 0; i < nthreads; i++) > + cpumask_set_cpu(i, tmp); > + > /* Get a bitmap of unoccupied slots. */ > cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask); > + > + /* > + * Remove free ids previously assigned on the other nodes. We can walk > + * only online nodes because once a node became online it is not turned > + * offlined back. > + */ > + if (!force_reusing) > + for_each_online_node(node) { > + if (node == nid) /* Keep our node's recorded ids */ > + continue; > + cpumask_andnot(candidate_mask, candidate_mask, > + node_recorded_ids_map[node]); > + } > + > if (cpumask_empty(candidate_mask)) { > + if (!force_reusing) { > + force_reusing = true; > + goto again; > + } > + Hmm I'd encourage you to work toward a solution that doesn't involve adding backwards jumps and a bool flag to this code. The function already mixes concerns and this change makes it a bit more difficult to follow. I'd suggest that you first factor out into a separate function the parts that allocate a suitable range from cpu_possible_mask, and only then introduce the behavior change constraining the results. > /* If we get here, it most likely means that NR_CPUS is > * less than the partition's max processors setting. > */ > @@ -191,12 +225,36 @@ static int pseries_add_processor(struct device_node *np) > cpumask_shift_left(tmp, tmp, nthreads); > > if (cpumask_empty(tmp)) { > + if (!force_reusing) { > + force_reusing = true; > + goto again; > + } > printk(KERN_ERR "Unable to find space in cpu_present_mask for" > " processor %pOFn with %d thread(s)\n", np, > nthreads); > goto out_unlock; > } > > + /* Record the newly used CPU ids for the associate node. */ > + cpumask_or(node_recorded_ids_map[nid], node_recorded_ids_map[nid], tmp); > + > + /* > + * If we force reusing the id, remove these ids from any node which was > + * previously using it. > + */ > + if (force_reusing) { > + cpu = cpumask_first(tmp); > + pr_warn("Reusing free CPU ids %d-%d from another node\n", > + cpu, cpu + nthreads - 1); > + > + for_each_online_node(node) { > + if (node == nid) > + continue; > + cpumask_andnot(node_recorded_ids_map[node], > + node_recorded_ids_map[node], tmp); > + } > + } > + I don't know, should we not fail the request instead of doing the ABI-breaking thing the code in this change is trying to prevent? I don't think a warning in the kernel log is going to help any application that would be affected by this. > for_each_cpu(cpu, tmp) { > BUG_ON(cpu_present(cpu)); > set_cpu_present(cpu, true); > @@ -889,6 +947,7 @@ static struct notifier_block pseries_smp_nb = { > static int __init pseries_cpu_hotplug_init(void) > { > int qcss_tok; > + unsigned int node; > > #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE > ppc_md.cpu_probe = dlpar_cpu_probe; > @@ -910,8 +969,18 @@ static int __init pseries_cpu_hotplug_init(void) > smp_ops->cpu_die = pseries_cpu_die; > > /* Processors can be added/removed only on LPAR */ > - if (firmware_has_feature(FW_FEATURE_LPAR)) > + if (firmware_has_feature(FW_FEATURE_LPAR)) { > + for_each_node(node) { > + alloc_bootmem_cpumask_var(&node_recorded_ids_map[node]); > + > + /* Record ids of CPU added at boot time */ > + cpumask_or(node_recorded_ids_map[node], > + node_recorded_ids_map[node], > + node_to_cpumask_map[node]); > + } > + > of_reconfig_notifier_register(&pseries_smp_nb); > + } This looks OK.