Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp2184318pxb; Thu, 3 Feb 2022 00:50:54 -0800 (PST) X-Google-Smtp-Source: ABdhPJxBjXPzjYoXyhPWLUJdpTX3mITKBEIFzRDLywouqOGrg5PJ1cKjSGKcVpMv/D8ysjM7hB+S X-Received: by 2002:a17:90b:391:: with SMTP id ga17mr12641426pjb.230.1643878254662; Thu, 03 Feb 2022 00:50:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643878254; cv=none; d=google.com; s=arc-20160816; b=z7izhd3+OFmF/V2sPrSQxhWQ1YFxslj9wgD3tOpx/kxphmOPDY2ShfZbsRVN5kFucM 94XpQwEDxQR99rlUqrhkgq8unx9m28c2uTlNr8jwZcakl9q9TdCMVt4OAqNyIQxrBYUf 9jZDs3DBTYpqPVh6PECwutefoDAIyCjTZnOllOjgN7HRGRczmJXZ1wUk6bYFigjPctLE b9lXQ5zjVKx5WPYFDtdwgw9120wMUx164xdDBeV8ZfTLodhxBYCQM0B06CYMc8JUFolY 7yHageM57mKTjpVr0FagThszgtw+yH0kCzPEcqsQ/T1FALgqhJlOVu+xZUBEMsFou5zB zXWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=+e2uuf8TQa9/DY/3jtTEDwnTERJqovJ5+ArTt9eKl6k=; b=BqpCKZtUJfS3A/0c5/N3H/LvfGjPiyXOH0kTKygaCCiQkVq1lWhkZLhmPRehPXoGA3 BDZn1X5Kbm6mDhL9c6GxOyWQHsOcFyIgOS12ZBJAZ9LeSsyZL8LALhpn8yS9BStyK/6b vG1ZyQcV4+2cL/DwFK5Crk01tf6JfyEv+PxXFAB4ur9I7QjzVkjIShNDQVY2G6Fw4DCf YOBSYMkpp7X0mBhhtUKQBN29DFy5HCzT355vGCk2w1pDrx2VOeX84SHvmJF/ojYOEP5d PKue4sCotbVfV1q7Ux0CY1abldNjyYHIrxB/yabV6SZ7+IuxKgqpW8jug1clHrRWVwte sOIA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=Fcfl3v07; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 68si20204347pfw.137.2022.02.03.00.50.42; Thu, 03 Feb 2022 00:50:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=Fcfl3v07; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236114AbiBAJjy (ORCPT + 99 others); Tue, 1 Feb 2022 04:39:54 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:40290 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236102AbiBAJjw (ORCPT ); Tue, 1 Feb 2022 04:39:52 -0500 Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 2117WbSZ030900; Tue, 1 Feb 2022 09:39:09 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : cc : subject : message-id : reply-to : references : mime-version : content-type : in-reply-to; s=pp1; bh=+e2uuf8TQa9/DY/3jtTEDwnTERJqovJ5+ArTt9eKl6k=; b=Fcfl3v07ciJFI7ORVGqGdxhF79I0YM5YZhLqIocIMB7kE+MCCaVB56YQX+be2rwww8OP w6P16414hokQV6bCb+wagXL+U/+3RWLozfQgWB1SK3Z/qwMkHV8nWVhbXANIJtpSxaGZ 7rca50ViURYs5KE/f0w1dy9Bmo4AVXJ6SfVwKkm4nMCJAcI0mnyn05ZSdNhb6+dyhwed JqI3nhuz0g4w5rxynkiANQc3bdw0Dh0DZ89OMXg2RRvz3pfMqSp3+zn7K1VOnhPiaLXh dcr8UtjTrELjueIAE8uXaPD77OOKaqd7cPPr6RlFrT/tuv4fgjquB028rkizhtRTvgic eA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3dxw1q5qq3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Feb 2022 09:39:09 +0000 Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 2118pc5P006249; Tue, 1 Feb 2022 09:39:08 GMT Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 3dxw1q5qpe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Feb 2022 09:39:08 +0000 Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1]) by ppma03fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 2119cSNh030776; Tue, 1 Feb 2022 09:39:06 GMT Received: from b06cxnps4074.portsmouth.uk.ibm.com (d06relay11.portsmouth.uk.ibm.com [9.149.109.196]) by ppma03fra.de.ibm.com with ESMTP id 3dvw79hnac-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 01 Feb 2022 09:39:06 +0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 2119d3BE41025834 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 1 Feb 2022 09:39:03 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9691E11C050; Tue, 1 Feb 2022 09:39:03 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id ED64E11C06E; Tue, 1 Feb 2022 09:38:59 +0000 (GMT) Received: from linux.vnet.ibm.com (unknown [9.126.150.29]) by d06av25.portsmouth.uk.ibm.com (Postfix) with SMTP; Tue, 1 Feb 2022 09:38:59 +0000 (GMT) Date: Tue, 1 Feb 2022 15:08:59 +0530 From: Srikar Dronamraju To: Barry Song <21cnbao@gmail.com> Cc: "Gautham R. Shenoy" , Yicong Yang , Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Tim Chen , LKML , LAK , Dietmar Eggemann , Steven Rostedt , Ben Segall , Daniel Bristot de Oliveira , prime.zeng@huawei.com, Jonathan Cameron , ego@linux.vnet.ibm.com, Linuxarm , Barry Song , Guodong Xu Subject: Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Message-ID: <20220201093859.GE618915@linux.vnet.ibm.com> Reply-To: Srikar Dronamraju References: <20220126080947.4529-1-yangyicong@hisilicon.com> <20220126080947.4529-3-yangyicong@hisilicon.com> <20220128071337.GC618915@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: X-TM-AS-GCONF: 00 X-Proofpoint-GUID: oGkZVNQxbjVj6_UnurNyuffyd2NqWRUZ X-Proofpoint-ORIG-GUID: xRRLGZDZwoOvDadlrI-29qGoEhSoFpV4 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-02-01_03,2022-01-31_01,2021-12-02_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 bulkscore=0 phishscore=0 malwarescore=0 suspectscore=0 priorityscore=1501 mlxlogscore=999 mlxscore=0 adultscore=0 lowpriorityscore=0 impostorscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2201110000 definitions=main-2202010051 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]: > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju > wrote: > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]: > > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy > > > wrote: > > > > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote: > > > > > From: Barry Song > > > > > > > > > > For platforms having clusters like Kunpeng920, CPUs within the same > > > > > cluster have lower latency when synchronizing and accessing shared > > > > > resources like cache. Thus, this patch tries to find an idle cpu > > > > > within the cluster of the target CPU before scanning the whole LLC > > > > > to gain lower latency. > > > > > > > > > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this > > > > > patch doesn't consider SMT for this moment. > > > > > > > > > > Testing has been done on Kunpeng920 by pinning tasks to one numa > > > > > and two numa. On Kunpeng920, Each numa has 8 clusters and each > > > > > cluster has 4 CPUs. > > > > > > > > > > With this patch, We noticed enhancement on tbench within one > > > > > numa or cross two numa. > > > > > > > > > > On numa 0: > > > > > 5.17-rc1 patched > > > > > Hmean 1 324.73 ( 0.00%) 378.01 * 16.41%* > > > > > Hmean 2 645.36 ( 0.00%) 754.63 * 16.93%* > > > > > Hmean 4 1302.09 ( 0.00%) 1507.54 * 15.78%* > > > > > Hmean 8 2612.03 ( 0.00%) 2982.57 * 14.19%* > > > > > Hmean 16 5307.12 ( 0.00%) 5886.66 * 10.92%* > > > > > Hmean 32 9354.22 ( 0.00%) 9908.13 * 5.92%* > > > > > Hmean 64 7240.35 ( 0.00%) 7278.78 * 0.53%* > > > > > Hmean 128 6186.40 ( 0.00%) 6187.85 ( 0.02%) > > > > > > > > > > On numa 0-1: > > > > > 5.17-rc1 patched > > > > > Hmean 1 320.01 ( 0.00%) 378.44 * 18.26%* > > > > > Hmean 2 643.85 ( 0.00%) 752.52 * 16.88%* > > > > > Hmean 4 1287.36 ( 0.00%) 1505.62 * 16.95%* > > > > > Hmean 8 2564.60 ( 0.00%) 2955.29 * 15.23%* > > > > > Hmean 16 5195.69 ( 0.00%) 5814.74 * 11.91%* > > > > > Hmean 32 9769.16 ( 0.00%) 10872.63 * 11.30%* > > > > > Hmean 64 15952.50 ( 0.00%) 17281.98 * 8.33%* > > > > > Hmean 128 13113.77 ( 0.00%) 13895.20 * 5.96%* > > > > > Hmean 256 10997.59 ( 0.00%) 11244.69 * 2.25%* > > > > > Hmean 512 14623.60 ( 0.00%) 15526.25 * 6.17%* > > > > > > > > > > This will also help to improve the MySQL. With MySQL server > > > > > running on numa 0 and client running on numa 1, both QPS and > > > > > latency is imporved on read-write case: > > > > > 5.17-rc1 patched > > > > > QPS-16threads 143333.2633 145077.4033(+1.22%) > > > > > QPS-24threads 195085.9367 202719.6133(+3.91%) > > > > > QPS-32threads 241165.6867 249020.74(+3.26%) > > > > > QPS-64threads 244586.8433 253387.7567(+3.60%) > > > > > avg-lat-16threads 2.23 2.19(+1.19%) > > > > > avg-lat-24threads 2.46 2.36(+3.79%) > > > > > avg-lat-36threads 2.66 2.57(+3.26%) > > > > > avg-lat-64threads 5.23 5.05(+3.44%) > > > > > > > > > > Tested-by: Yicong Yang > > > > > Signed-off-by: Barry Song > > > > > Signed-off-by: Yicong Yang > > > > > --- > > > > > kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++---- > > > > > 1 file changed, 42 insertions(+), 4 deletions(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index 5146163bfabb..2f84a933aedd 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd > > > > > > > > > > #endif /* CONFIG_SCHED_SMT */ > > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > +/* > > > > > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning > > > > > + */ > > > > > +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target) > > > > > +{ > > > > > + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > > > > > + struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target)); > > > > > + int cpu, idle_cpu; > > > > > + > > > > > + /* TODO: Support SMT case while a machine with both cluster and SMT born */ > > > > > + if (!sched_smt_active() && sd) { > > > > > + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) { > > > > > + idle_cpu = __select_idle_cpu(cpu, p); > > > > > + if ((unsigned int)idle_cpu < nr_cpumask_bits) > > > > > + return idle_cpu; > > > > > + } > > > > > + > > > > > + /* Don't ping-pong tasks in and out cluster frequently */ > > > > > + if (cpus_share_resources(target, prev_cpu)) > > > > > + return target; > > > > > > > > We reach here when there aren't any idle CPUs within the > > > > cluster. However there might be idle CPUs in the MC domain. Is a busy > > > > @target preferable to a potentially idle CPU within the larger domain > > > > ? > > > > > > Hi Gautham, > > > > > > > Hi Barry, > > > > > > > My benchmark showed some performance regression while load was medium or above > > > if we grabbed idle cpu in and out the cluster. it turned out the > > > regression disappeared if > > > we blocked the ping-pong. so the logic here is that if we have scanned > > > and found an > > > idle cpu within the cluster before, we don't let the task jumping back > > > and forth frequently > > > as cache synchronization is higher cost. but the code still allows > > > scanning out of the cluster > > > if we haven't packed waker and wakee together yet. > > > > > > > Like what Gautham said, should we choose the same cluster if we find that > > there are no idle-cpus in the LLC? This way we avoid ping-pong if there are > > no idle-cpus but we still pick an idle-cpu to a busy cpu? > > Hi Srikar, > I am sorry I didn't get your question. Currently the code works as below: > if task A wakes up task B, and task A is in LLC0 and task B is in LLC1. > we will scan the cluster of A before scanning the whole LLC0, in this case, > cluster of A is the closest sibling, so it is the better choice than other CPUs > which are in LLC0 but not in the cluster of A. Yes, this is right. > But we do scan all cpus of LLC0 > afterwards if we fail to find an idle CPU in the cluster. However my reading of the patch, before we can scan other clusters within the LLC (aka LLC0), we have a check in scan cluster which says /* Don't ping-pong tasks in and out cluster frequently */ if (cpus_share_resources(target, prev_cpu)) return target; My reading of this is, ignore other clusters (at this point, we know there are no idle CPUs in this cluster. We don't know if there are idle cpus in them or not) if the previous CPU and target CPU happen to be from the same cluster. This effectively means we are given preference to cache over idle CPU. Or Am I still missing something? > > After a while, if the cluster of A gets an idle CPU and pulls B into the > cluster, we prefer not pushing B out of the cluster of A again though > there might be an idle CPU outside. as benchmark shows getting an > idle CPU out of the cluster of A doesn't bring performance improvement > but performance decreases as B might be getting in and getting out > the cluster of A very frequently, then cache coherence ping-pong. > The counter argument can be that Task A and Task B are related and were running on the same cluster. But Load balancer moved Task B to a different cluster. Now this check may cause them to continue to run on two different clusters, even though the underlying load balance issues may have changed. No? -- Thanks and Regards Srikar Dronamraju