Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp1160530pxu; Fri, 27 Nov 2020 00:50:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJwM54cTI1SOAsus8wtVcfAdLy/Wd3OkULBft0uv2BWjVfecV72h9qq4zuoRlP+WZRRVaoIa X-Received: by 2002:a50:858a:: with SMTP id a10mr6512060edh.190.1606467023754; Fri, 27 Nov 2020 00:50:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606467023; cv=none; d=google.com; s=arc-20160816; b=dmdVQk1md7TKLiEP6boAR5/dLZcSUpFexi89KvecBpMnWgQWy1gel0eHzXHHJOEbS0 uHWzc06pPYBnzHizp3Jfz4DbLg/rYUYk7sQLGKA6Sp61o/+12ASfH6gX3YfmT1/8Vyxm gcqH1l4h3tMEfXOicPdKjqHYJ5Yl4gq1AvKjy8Zqae4vhUjPGubkJMWIxRv1//EixQHT Ym7I3ylc9lK5LqioPH0zDLxsw6qeZ8IsleFqxGbrwfPozPGUL0O8gORcvboPLjPMtI4d Zn4PQuTF2phSF+f8mwtBbA04pePM851x4gV16QS1VU3JNU0Khnbc+50n+PB/KOkoAy7Z rCwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=H9suwd1Rmzs8ibQC98r3jpGDV4yAaPUN3Q9FqxUSoIA=; b=noBNo/fA6FqJbyuIRItOLu1YWCWfSrOL+UFaSvivbh+uNwhHbI4Lvgp9052XdngEJ+ ZIzxvl3Oc26/X76n7N7wZmJbdDsRJPuLvDDWiGVC0nfrTI2uM5yPZALTFGiCzDtRtLGj 41TUk0EYaSkQW/HJX4cqEFRcoBKeOU6EqOu8Y4OQ4IKKPCisAosuVNUxdLHhlbN4Dt4O pLCOnZP/rIcZKFaoDc/t2nDSZc6xqhWkui7WfIvzA1MeFbrRQFN3hCUpDtkDbbM+tS4y AiSfqq+aQeGdTTBUqXAEEg6VMKQduRsfWvc7b5PTdEOYpzCgGucTMW/bxTwuerq3gJy2 sApg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o13si5183766ejb.377.2020.11.27.00.50.01; Fri, 27 Nov 2020 00:50:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2392239AbgK0C3d (ORCPT + 99 others); Thu, 26 Nov 2020 21:29:33 -0500 Received: from szxga05-in.huawei.com ([45.249.212.191]:8597 "EHLO szxga05-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726862AbgK0C3d (ORCPT ); Thu, 26 Nov 2020 21:29:33 -0500 Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.59]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4Chz8T5zB2zLw1f; Fri, 27 Nov 2020 10:28:57 +0800 (CST) Received: from [10.67.76.251] (10.67.76.251) by DGGEMS407-HUB.china.huawei.com (10.3.19.207) with Microsoft SMTP Server id 14.3.487.0; Fri, 27 Nov 2020 10:29:16 +0800 Subject: Re: [PATCH v7] lib: optimize cpumask_local_spread() To: Dave Hansen , , CC: Yuqi Jin , Rusty Russell , Andrew Morton , Juergen Gross , Paul Burton , Michal Hocko , "Michael Ellerman" , Mike Rapoport , "Anshuman Khandual" References: <1605668072-44780-1-git-send-email-zhangshaokun@hisilicon.com> <6a6e6d37-a3dc-94ed-bc8c-62c50ea1dff5@intel.com> From: Shaokun Zhang Message-ID: Date: Fri, 27 Nov 2020 10:29:16 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.1 MIME-Version: 1.0 In-Reply-To: <6a6e6d37-a3dc-94ed-bc8c-62c50ea1dff5@intel.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.76.251] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Dave, Apologies for later reply. 在 2020/11/21 1:48, Dave Hansen 写道: > On 11/17/20 6:54 PM, Shaokun Zhang wrote: >> From: Yuqi Jin >> >> In multi-processor and NUMA system, I/O driver will find cpu cores that >> which shall be bound IRQ. When cpu cores in the local numa have been >> used up, it is better to find the node closest to the local numa node >> for performance, instead of choosing any online cpu immediately. >> >> On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the >> network card is located in node2 of socket1, while the number queues >> of network card is greater than the number of cores of node2, when all >> cores of node2 has been bound to the queues, the remaining queues will >> be bound to the cores of node0 which is further than NUMA node3. > > That's quite the run-on sentence. :) > >> It is >> not friendly for performance or Intel's DDIO (Data Direct I/O Technology) > > Could you explain *why* it is not friendly to DDIO specifically? This > patch affects where the interrupt handler runs. But, DDIO is based on > memory locations rather than the location of the interrupt handler. > > It would be ideal to make that connection: How does the location of the > interrupt handler impact the memory allocation location? > When the interrupt handler is across chips, the BD, packet header, and even payload are required for the RX packet interrupt handler. However, the DDIO cannot transmit data to there. >> when if the user enables SNC (sub-NUMA-clustering). > > Again, the role that SNC plays here isn't spelled out. I *believe* it's > because SNC ends up reducing the number of CPUs in each NUMA node. That > makes the existing code run out of CPUs to which to bind to the "local" > node sooner. Yes. > >> +static int find_nearest_node(int node, bool *used) >> +{ >> + int i, min_dist, node_id = -1; >> + >> + /* Choose the first unused node to compare */ >> + for (i = 0; i < nr_node_ids; i++) { >> + if (used[i] == false) { >> + min_dist = node_distance(node, i); >> + node_id = i; >> + break; >> + } >> + } >> + >> + /* Compare and return the nearest node */ >> + for (i = 0; i < nr_node_ids; i++) { >> + if (node_distance(node, i) < min_dist && used[i] == false) { >> + min_dist = node_distance(node, i); >> + node_id = i; >> + } >> + } >> + >> + return node_id; >> +} >> + >> /** >> * cpumask_local_spread - select the i'th cpu with local numa cpu's first >> * @i: index number >> * @node: local numa_node >> * >> * This function selects an online CPU according to a numa aware policy; >> - * local cpus are returned first, followed by non-local ones, then it >> - * wraps around. >> + * local cpus are returned first, followed by the next one which is the >> + * nearest unused NUMA node based on NUMA distance, then it wraps around. >> * >> * It's not very efficient, but useful for setup. >> */ >> unsigned int cpumask_local_spread(unsigned int i, int node) > > FWIW, I think 'i' is criminally bad naming. It should be called > nr_cpus_to_skip or something similar. > Ok, I really didn't consider this parameter naming before. > I also detest the comments that are there today. > > Loop through all the online CPUs on the system. Start with the > CPUs on 'node', then fall back to CPUs on NUMA nodes which are > increasingly far away. > > Skip the first 'nr_cpus_to_skip' CPUs which are found. > > This function is not very efficient, especially for large > 'nr_cpus_to_skip' because it loops over the same CPUs on each > call and does not remember its state from previous calls. > Shame for my bad comment, I will follow it. >> { >> - int cpu, hk_flags; >> + static DEFINE_SPINLOCK(spread_lock); >> + static bool used[MAX_NUMNODES]; > > I thought I mentioned this last time. How large is this array? How > large would it be if it were a nodemask_t? Would this be less code if Apologies that I forgot to do it. > you just dynamically allocated and freed the node mask instead of having > a spinlock and a memset? > Ok, but I think the spinlock is also needed, do I miss something? >> + unsigned long flags; >> + int cpu, hk_flags, j, id; >> const struct cpumask *mask; >> >> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; >> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node) >> return cpu; >> } >> } else { >> - /* NUMA first. */ >> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) { >> - if (i-- == 0) >> - return cpu; >> + spin_lock_irqsave(&spread_lock, flags); >> + memset(used, 0, nr_node_ids * sizeof(bool)); >> + /* select node according to the distance from local node */ >> + for (j = 0; j < nr_node_ids; j++) { >> + id = find_nearest_node(node, used); >> + if (id < 0) >> + break; > > There's presumably an outer loop in a driver which is trying to bind a > bunch of interrupts to a bunch of CPUs. We know there are on the order > of dozens of these interrupts. > > for_each_interrupt() // in the driver > for (j=0;j // find_nearest_node(): > for (i = 0; i < nr_node_ids; i++) { > for (i = 0; i < nr_node_ids; i++) { > > Does this worry anybody else? It thought our upper limits on the number > of NUMA nodes was 1024. Doesn't that make our loop O(N^3) where the > worst case is hundreds of millions of loops? > If the NUMA nodes is 1024 in real system, it is more worthy to find the earest node, rather than choose a random one, And it is only called in I/O device initialization. Comments also are given to this interface. > I don't want to prematurely optimize this, but that seems like something > that might just fall over on bigger systems. > > This also seems really wasteful if we have a bunch of memory-only nodes. > Each of those will be found via find_nearest_node(), but then this loop: > Got it, all effort is used to choose the nearest node for performance. If we don't it, I think some one will also debug this in future. >> + for_each_cpu_and(cpu, cpumask_of_node(id), mask) >> + if (i-- == 0) { >> + spin_unlock_irqrestore(&spread_lock, >> + flags); >> + return cpu; >> + } >> + used[id] = true; >> } > > Will just exit immediately because cpumask_of_node() is empty. Yes, and this node used[id] became true. > > 'used', for instance, should start by setting 'true' for all nodes which > are not in N_CPUS. No, because I used 'nr_node_ids' which is possible node ids to check. Thanks, Shaokun > >> + spin_unlock_irqrestore(&spread_lock, flags); >> >> - for_each_cpu(cpu, mask) { >> - /* Skip NUMA nodes, done above. */ >> - if (cpumask_test_cpu(cpu, cpumask_of_node(node))) >> - continue; >> - >> + for_each_cpu(cpu, mask) >> if (i-- == 0) >> return cpu; >> - } >> } >> BUG(); >> } > . >