Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp2917593pxb; Mon, 16 Nov 2020 00:04:12 -0800 (PST) X-Google-Smtp-Source: ABdhPJwIne44GUzYPzk74fw/5+6eOrclIzdSjXUqYdgJiItGwLfMbdASNQUvhJ+4iRcf05qBQTOk X-Received: by 2002:a17:906:a052:: with SMTP id bg18mr14159834ejb.550.1605513852135; Mon, 16 Nov 2020 00:04:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605513852; cv=none; d=google.com; s=arc-20160816; b=Yqghg1s4iG/WO+EpwaAaZhGb4BG+N2/6HCK3BMojw0LH0fLZHfrcll1GGLO8P7ukSX LKkBGXYiCbMgdW3clIFdDQ31WqAzwKjNnajApxaarQ90M+Cc70eAvuHZgEm8z/Jo3tJ5 LdkycR0PY4BbcdIay1RQXrh/jnfBkAjRkryMoJcSc5HdVu0dDtP3xTaRqrQu3FsW6Ki8 /CfzAndToTQjJ3Gp0LD/oEWLS0E3ZrfzW9lXfl3WRawj9I3eu6YJN/gq6P9it838LQH5 pY+Kcen5mnZ0wy1qJmznB7YDHu2GTNb19Hk034k5u2Rb5DEzAiGs+KGC+hPHTXCNduui /jWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=O8iKh18jDSMtnXd+TwwSPa7ZKAq5w9/rmH0SBGooOyU=; b=lJvAAk2MR2aShxWJSJOOZrvDR9EwcA1VpM8xLrMNx0BcfOszm7eHcypJYuguq8OLEs Bg32Wd1sD3Z9RWNh36crpvTnXYCBWNmiMJQlq9+MOiqAHtb+KtwUgWVx++W4WtiljdIF T63nt51p69cdyuXjwV4WW+VaS6Y9WaiWAGRxij0lI9+VhF/84WaX7SHFbNE3spaqODUI oJSiwrO4GncwttHqGHIZwK7gIDCEocRK9d0EopYBO9+FZWLiX+g0oIA7Sg4t7uC2lNpr mT1HD1tNNOVT/l5dKS2iGtFWL5HXjaBcyI3JzSKtMS4OhGnttmhlUB6qV4ujGOFEfezU el0w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id cc8si12519472edb.163.2020.11.16.00.03.49; Mon, 16 Nov 2020 00:04:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728021AbgKPH7v (ORCPT + 99 others); Mon, 16 Nov 2020 02:59:51 -0500 Received: from szxga05-in.huawei.com ([45.249.212.191]:7543 "EHLO szxga05-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727711AbgKPH7u (ORCPT ); Mon, 16 Nov 2020 02:59:50 -0500 Received: from DGGEMS410-HUB.china.huawei.com (unknown [172.30.72.58]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4CZM112qyDzhbNW; Mon, 16 Nov 2020 15:59:33 +0800 (CST) Received: from [10.67.76.251] (10.67.76.251) by DGGEMS410-HUB.china.huawei.com (10.3.19.210) with Microsoft SMTP Server id 14.3.487.0; Mon, 16 Nov 2020 15:59:37 +0800 Subject: Re: [PATCH v6] lib: optimize cpumask_local_spread() To: Dave Hansen , , CC: Yuqi Jin , Rusty Russell , Andrew Morton , Juergen Gross , Paul Burton , Michal Hocko , "Michael Ellerman" , Mike Rapoport , "Anshuman Khandual" References: <1604410767-55947-1-git-send-email-zhangshaokun@hisilicon.com> <3e2e760d-e4b9-8bd0-a279-b23bd7841ae7@intel.com> <5e8b0304-4de1-4bdc-41d2-79fa5464fbc7@intel.com> From: Shaokun Zhang Message-ID: <1ca0d77f-7cf3-57d8-af23-169975b63b32@hisilicon.com> Date: Mon, 16 Nov 2020 15:59:38 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.8.1 MIME-Version: 1.0 In-Reply-To: <5e8b0304-4de1-4bdc-41d2-79fa5464fbc7@intel.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.76.251] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Dave, 在 2020/11/14 0:02, Dave Hansen 写道: > On 11/12/20 6:06 PM, Shaokun Zhang wrote: >>>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu >>>> system(0 - 1). The topology of this server is followed: >>> >>> This is with a feature enabled that Intel calls sub-NUMA-clustering >>> (SNC), right? Explaining *that* feature would also be great context for >> >> Correct, >> >>> why this gets triggered on your system and not normally on others and >>> why nobody noticed this until now. >> >> This is on intel 6248 platform: > > I have no idea what a "6248 platform" is. > My apologies that it's Cascade Lake, [1] >>>> +static void calc_node_distance(int *node_dist, int node) >>>> +{ >>>> + int i; >>>> + >>>> + for (i = 0; i < nr_node_ids; i++) >>>> + node_dist[i] = node_distance(node, i); >>>> +} >>> >>> This appears to be the only place node_dist[] is written. That means it >>> always contains a one-dimensional slice of the two-dimensional data >>> represented by node_distance(). >>> >>> Why is a copy of this data needed? >> >> It is used to store the distance with the @node for later, apologies that I >> can't follow your question correctly. > > Right, the data that you store is useful. *But*, it's also a verbatim > copy of the data from node_distance(). Why not just use node_distance() > directly in your code rather than creating a partial copy of it in the Ok, I will remove this redundant function in next version. > local node_dist[] array? > > >>>> unsigned int cpumask_local_spread(unsigned int i, int node) >>>> { >>>> - int cpu, hk_flags; >>>> + static DEFINE_SPINLOCK(spread_lock); >>>> + static int node_dist[MAX_NUMNODES]; >>>> + static bool used[MAX_NUMNODES]; >>> >>> Not to be *too* picky, but there is a reason we declare nodemask_t as a >>> bitmap and not an array of bools. Isn't this just wasteful? >>> >>>> + unsigned long flags; >>>> + int cpu, hk_flags, j, id; >>>> const struct cpumask *mask; >>>> >>>> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; >>>> @@ -220,20 +256,28 @@ unsigned int cpumask_local_spread(unsigned int i, int node) >>>> return cpu; >>>> } >>>> } else { >>>> - /* NUMA first. */ >>>> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) { >>>> - if (i-- == 0) >>>> - return cpu; >>>> - } >>>> + spin_lock_irqsave(&spread_lock, flags); >>>> + memset(used, 0, nr_node_ids * sizeof(bool)); >>>> + calc_node_distance(node_dist, node); >>>> + /* Local node first then the nearest node is used */ >>> >>> Is this comment really correct? This makes it sound like there is only >> >> I think it is correct, that's what we want to choose the nearest node. >> >>> fallback to a single node. Doesn't the _code_ fall back basically >>> without limit? >> >> If I follow your question correctly, without this patch, if the local >> node is used up, one random node will be choosed, right? Now we firstly >> choose the nearest node by the distance, if all nodes has been choosen, >> it will return the initial solution. > > The comment makes it sound like the code does: > 1. Do the local node > 2. Do the next nearest node > 3. Stop > That's more clear, I will udpate the comments as the new patch. > In reality, I *think* it's more of a loop where it search > ever-increasing distances away from the local node. > > I just think the comment needs to be made more precise. Got it. > >>>> + for (j = 0; j < nr_node_ids; j++) { >>>> + id = find_nearest_node(node_dist, used); >>>> + if (id < 0) >>>> + break; >>>> >>>> - for_each_cpu(cpu, mask) { >>>> - /* Skip NUMA nodes, done above. */ >>>> - if (cpumask_test_cpu(cpu, cpumask_of_node(node))) >>>> - continue; >>>> + for_each_cpu_and(cpu, cpumask_of_node(id), mask) >>>> + if (i-- == 0) { >>>> + spin_unlock_irqrestore(&spread_lock, >>>> + flags); >>>> + return cpu; >>>> + } >>>> + used[id] = 1; >>>> + } >>>> + spin_unlock_irqrestore(&spread_lock, flags); >>> >>> The existing code was pretty sparsely commented. This looks to me to >>> make it more complicated and *less* commented. Not the best combo. >> >> Apologies for the bad comments, hopefully I describe it clearly by the above >> explantion. > > Do you want to take another pass at submitting this patch? 'Another pass'? Sorry for my bad understading, I don't follow it correctly. Thanks, Shaokun [1]https://en.wikichip.org/wiki/intel/xeon_gold/6248 > . >