Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp2304138ybx; Fri, 8 Nov 2019 02:32:07 -0800 (PST) X-Google-Smtp-Source: APXvYqy20ZZFX0AOpFghvv0Rt9UKNsmhRSOpBfMzUmPjtJgKC9p8G1waXOVhNK31KCGS5orXdpTh X-Received: by 2002:a50:b558:: with SMTP id z24mr9299317edd.67.1573209126736; Fri, 08 Nov 2019 02:32:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573209126; cv=none; d=google.com; s=arc-20160816; b=GkhSXvQgwZ2zT8BIQZtJAwxPMub8yv/BJefbFyOdilyy9UIOE0AtItWQgjCHCMQ5zg nLq4DCRiWb4gT+BpDQU7MK+uRrNDxKRzj7/9qT4Z3uZeh4cBLXrcw/0sCeRpat8vlXIH ye5l2hP7y8gsN/UrkkRWDuwjXsUAn3sy70STTiZw2YJK/E59r0QDfEdN6Uzm5u+gdXAT fIYd60GaAxODWEDaKJ8n8RoycKsOlGsIdgEHYiRgOp/rWSI7Nk2KDc5+BQJgoVKHarWj vARDcCKQeKb3p2jNyJaMfeZaTzAJtuQFTrPt4RCb1uGA0W0c4+t0gGq87iEkSw1WCoqD vr3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=OWuShwoYZf6xvIiWBjzVaFc2nnZOGdOeIm9ri4Rqq8M=; b=WfUc3h439oniRP99iBWHgnyNkfvcGQEcoJq0NfYeZo11VpvGseJ+WhRwgmeCfYdilC 0zdbEQxjo102aKUwRDuPjDFVDWx58uUlhYRNWuzX8K+2O/OnC/wq1ldV/p0xKwxM4BFt vX0od5/BPlWdpx+csUvPjpGJkmPlEFURsE4ulP4xtGLk57E5LEiN1GGNDAW/XWFJ9Kod AqkxnSKxn7Qtga/iNiF6onG+xW4QhG5mcIa5pReTb+P3+hCcL1FU1wVo//7vJwlSwAZX cVj7wdTerc9CfETW7bqAiQMG8v2mRkb2K0qwt/2ptMLS6TjUrCMTFc1ma1O97VE4dTFr slJw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b10si3688638eds.63.2019.11.08.02.31.41; Fri, 08 Nov 2019 02:32:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730623AbfKHKbJ (ORCPT + 99 others); Fri, 8 Nov 2019 05:31:09 -0500 Received: from mx2.suse.de ([195.135.220.15]:49712 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726149AbfKHKbJ (ORCPT ); Fri, 8 Nov 2019 05:31:09 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id D6F3BB178; Fri, 8 Nov 2019 10:31:03 +0000 (UTC) Date: Fri, 8 Nov 2019 11:31:02 +0100 From: Michal Hocko To: Shaokun Zhang Cc: linux-kernel@vger.kernel.org, yuqi jin , Andrew Morton , Mike Rapoport , Paul Burton , Michael Ellerman , Anshuman Khandual , netdev@vger.kernel.org Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread() Message-ID: <20191108103102.GF15658@dhcp22.suse.cz> References: <1573091048-10595-1-git-send-email-zhangshaokun@hisilicon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1573091048-10595-1-git-send-email-zhangshaokun@hisilicon.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This changelog looks better, thanks! I still have some questions though. Btw. cpumask_local_spread is used by the networking code but I do not see net guys involved (Cc netdev) On Thu 07-11-19 09:44:08, Shaokun Zhang wrote: > From: yuqi jin > > In the multi-processors and NUMA system, I/O driver will find cpu cores > that which shall be bound IRQ. When cpu cores in the local numa have > been used, it is better to find the node closest to the local numa node, > instead of choosing any online cpu immediately. > > On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu > system(0 - 1). Please send a topology of this server (numactl -H). > We perform PS (parameter server) business test, the > behavior of the service is that the client initiates a request through > the network card, the server responds to the request after calculation. Is the benchmark any ublicly available? > When two PS processes run on node2 and node3 separately and the > network card is located on 'node2' which is in cpu1, the performance > of node2 (26W QPS) and node3 (22W QPS) was different. > It is better that the NIC queues are bound to the cpu1 cores in turn, > then XPS will also be properly initialized, while cpumask_local_spread > only considers the local node. When the number of NIC queues exceeds > the number of cores in the local node, it returns to the online core > directly. So when PS runs on node3 sending a calculated request, > the performance is not as good as the node2. It is considered that > the NIC and other I/O devices shall initialize the interrupt binding, > if the cores of the local node are used up, it is reasonable to return > the node closest to it. Can you post cpu affinities before and after this patch? > Let's optimize it and find the nearest node through NUMA distance for the > non-local NUMA nodes. The performance will be better if it return the > nearest node than the random node. > > After this patch, the performance of the node3 is the same as node2 > that is 26W QPS when the network card is still in 'node2'. Since it will > return the closest non-local NUMA code rather than random node, it is no > harm to others at least. It would be also nice to explain why the current implementation hasn't taken the path your have chosen. Was it a simplicity or is there a more fundamental reason? Is there any risk that existing users would regress? Preferring cpus from the local socket which is what you aim for sounds like a logical thing to do so I am wondering why this hasn't been considered. > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Paul Burton > Cc: Michal Hocko > Cc: Michael Ellerman > Cc: Anshuman Khandual > Signed-off-by: yuqi jin > Signed-off-by: Shaokun Zhang > --- > ChangeLog from v2: > 1. Change the variables as static and use spinlock to protect; > 2. Give more explantation on test and performance; > lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 90 insertions(+), 12 deletions(-) > > diff --git a/lib/cpumask.c b/lib/cpumask.c > index 0cb672eb107c..b98a2256bc5a 100644 > --- a/lib/cpumask.c > +++ b/lib/cpumask.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > > /** > * cpumask_next - get the next cpu in a cpumask > @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) > } > #endif > > -/** > - * cpumask_local_spread - select the i'th cpu with local numa cpu's first > - * @i: index number > - * @node: local numa_node > - * > - * This function selects an online CPU according to a numa aware policy; > - * local cpus are returned first, followed by non-local ones, then it > - * wraps around. > - * > - * It's not very efficient, but useful for setup. > - */ > -unsigned int cpumask_local_spread(unsigned int i, int node) > +static void calc_node_distance(int *node_dist, int node) > +{ > + int i; > + > + for (i = 0; i < nr_node_ids; i++) > + node_dist[i] = node_distance(node, i); > +} > + > +static int find_nearest_node(int *node_dist, bool *used) > +{ > + int i, min_dist = node_dist[0], node_id = -1; > + > + /* Choose the first unused node to compare */ > + for (i = 0; i < nr_node_ids; i++) { > + if (used[i] == 0) { > + min_dist = node_dist[i]; > + node_id = i; > + break; > + } > + } > + > + /* Compare and return the nearest node */ > + for (i = 0; i < nr_node_ids; i++) { > + if (node_dist[i] < min_dist && used[i] == 0) { > + min_dist = node_dist[i]; > + node_id = i; > + } > + } > + > + return node_id; > +} > + > +static unsigned int __cpumask_local_spread(unsigned int i, int node) > { > int cpu; > > @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node) > } > BUG(); > } > + > +static DEFINE_SPINLOCK(spread_lock); > +/** > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first > + * @i: index number > + * @node: local numa_node > + * > + * This function selects an online CPU according to a numa aware policy; > + * local cpus are returned first, followed by the nearest non-local ones, > + * then it wraps around. > + * > + * It's not very efficient, but useful for setup. > + */ > +unsigned int cpumask_local_spread(unsigned int i, int node) > +{ > + static int node_dist[MAX_NUMNODES]; > + static bool used[MAX_NUMNODES]; > + unsigned long flags; > + int cpu, j, id; > + > + /* Wrap: we always want a cpu. */ > + i %= num_online_cpus(); > + > + if (node == NUMA_NO_NODE) { > + for_each_cpu(cpu, cpu_online_mask) > + if (i-- == 0) > + return cpu; > + } else { > + if (nr_node_ids > MAX_NUMNODES) > + return __cpumask_local_spread(i, node); > + > + spin_lock_irqsave(&spread_lock, flags); > + memset(used, 0, nr_node_ids * sizeof(bool)); > + calc_node_distance(node_dist, node); > + for (j = 0; j < nr_node_ids; j++) { > + id = find_nearest_node(node_dist, used); > + if (id < 0) > + break; > + > + for_each_cpu_and(cpu, cpumask_of_node(id), > + cpu_online_mask) > + if (i-- == 0) { > + spin_unlock_irqrestore(&spread_lock, > + flags); > + return cpu; > + } > + used[id] = 1; > + } > + spin_unlock_irqrestore(&spread_lock, flags); > + > + for_each_cpu(cpu, cpu_online_mask) > + if (i-- == 0) > + return cpu; > + } > + BUG(); > +} > EXPORT_SYMBOL(cpumask_local_spread); > -- > 2.7.4 -- Michal Hocko SUSE Labs