Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp2275062rdf; Mon, 6 Nov 2023 09:18:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IGb0r2MifQwbaj+pJquzbsw/k57PswLMi68ff4kmss/fst1bpIH0snhOLKUtQRwjdw2uHyw X-Received: by 2002:a05:6358:5913:b0:168:e4a0:dd82 with SMTP id g19-20020a056358591300b00168e4a0dd82mr26038012rwf.1.1699291124109; Mon, 06 Nov 2023 09:18:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699291124; cv=none; d=google.com; s=arc-20160816; b=G6QCr5lcZBjqCxGJuDv3013nJtw3m1TIC7x2749zFIuE5+QJBLWZX09SUk87fcwgtl eLaIuLlwiqMKCkJFQ8dPLouMMYZoF4thzKZ362Dr24Bd7t79hgqwvXnC2+L5d4y0k7a/ +vW9HDANW4EE/AyYdilfAPwQBnsy2FpAWyhJ66rCGguZ5MI4MFuj94urKttDtYgVznVW sRkf92UUngN428SihVg72UVVuC8HCZwS8h1ySUsQKMcsORUr66uCS2NAGAz18keleAHt QWQKCE3iE9HCdgO/mc77yiMFrtLQRzce5lrs9B9+EdaqD7eFtqqc/7YkpuedHSuac0qt EdSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=nlissOUrDlfFFgldU9JGiuAc1o41WgmH4ymUlYsOMRo=; fh=LUPW7rvn1fuY4anjBeEGjtszmXF5E/qGi2AgVZ0YKqc=; b=Evs6pJtT03sOqqCozjnaSGEnrwIFsyScR/U5tpZsaPArdmBdkfv26MIoAQawLHedoC pOdsI5Q4llUa3d74yjGTsAKz7K0WNVI0cAgu8snZdUd/dpCFI7ph5L4mtU+1I60V22eJ uZmAnFg3kPHwUC0j/abg+V+HW8Ur/q/2OsmCL4pkSB+FWZhFo4N7TJQJo7VZwUat+lRb q/LjYg+uH3jV/SwX4u6oOQZUPn5ls+npkVBnLNkMvm3fomUvFDjJ04cdAeFyZkdpvTun yOSMrTu4kqnLxfYj9xRbRapmQLfo4VcyHskuft4N2J2i7l72Yni4AS/ujdRHqHTOQglG Zg6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=SdQ+RVgH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id bf10-20020a656d0a000000b00563f627f2easi143317pgb.122.2023.11.06.09.18.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 09:18:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=SdQ+RVgH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 992E4807BED1; Mon, 6 Nov 2023 09:18:04 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231669AbjKFRRp (ORCPT + 99 others); Mon, 6 Nov 2023 12:17:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229715AbjKFRRo (ORCPT ); Mon, 6 Nov 2023 12:17:44 -0500 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ADBA083 for ; Mon, 6 Nov 2023 09:17:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1699291058; bh=1Bm79slS+MhiQC1oY45zbKcFvD9MPaHvt0Pu2hSc8sM=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=SdQ+RVgHzdEIuddxBqmxT5/Hl3/CHdu0CX0W74LV/p5sw1GxTNnkBz07X4NCNO7Lm sVr8cQHhx4uPOoqZ6cBl6l2pPz7F2+u1rcDgOEdL2p3iJuiqUndwEb6uruE6PzSo/f qm8CvjMRrV6wk7bRkPXTRNJaIqpsACKrCOs+U/H4hKv3xxTsHUx4JpBHje1k+cUBzk YZmV1qp8CLPPyPkVr9pW3jPnol8wRPF7EYJGskRlcAVZqbsDI2xlQP6TSupaHqgQsC /Yvnazw0cWgvl4QICVbAjStdJ8VyR5cqIT9RxOgewKysgxJbtEhuAzs9vl/wsOQ0EV tyUTUkawLjGNg== Received: from [172.16.0.134] (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4SPJ2B2wXkz1b6c; Mon, 6 Nov 2023 12:17:38 -0500 (EST) Message-ID: Date: Mon, 6 Nov 2023 12:18:03 -0500 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features Content-Language: en-US To: K Prateek Nayak , Chen Yu Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Vincent Guittot , Juri Lelli , Swapnil Sapkal , Aaron Lu , Tim Chen , "Gautham R . Shenoy" , x86@kernel.org References: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com> <85b710a9-5b26-b0df-8c21-c2768a21e182@amd.com> From: Mathieu Desnoyers In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Mon, 06 Nov 2023 09:18:04 -0800 (PST) On 2023-11-06 02:06, K Prateek Nayak wrote: > Hello Chenyu, > > On 11/6/2023 11:22 AM, Chen Yu wrote: >> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote: >>> Hello Mathieu, >>> >>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote: >>>> Hi, >>>> >>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY >>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of >>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD >>>> EPYC. >>>> >>>> The main metrics which are significantly improved are: >>>> >>>> - cpu-migrations are reduced by 80%, >>>> - CPU utilization is increased by 17%. >>>> >>>> Feedback is welcome. I am especially interested to learn whether this >>>> series has positive or detrimental effects on performance of other >>>> workloads. >>> >>> I got a chance to test this series on a dual socket 3rd Generation EPYC >>> System (2 x 64C/128T). Following is a quick summary: >>> >>> - stream and ycsb-mongodb don't see any changes. >>> >>> - hackbench and DeathStarBench see a major improvement. Both are high >>> utilization workloads with CPUs being overloaded most of the time. >>> DeathStarBench is known to benefit from lower migration count. It was >>> discussed by Gautham at OSPM '23. >>> >>> - tbench, netperf, and sch bench regresses. The former two when the >>> system is near fully loaded, and the latter for most cases. >> >> Does it mean hackbench gets benefits when the system is overloaded, while >> tbench/netperf do not get benefit when the system is underloaded? > > Yup! Seems like that from the results. From what I have seen so far, > there seems to be a work conservation aspect to hackbench where if we > reduce the time spent in the kernel (by reducing time to decide on the > target which Mathieu's patch [this one] achieves, I am confused by this comment. Quoting Daniel Bristot, "work conserving" is defined as "in a system with M processor, the M "higest priority" must be running (in real-time)". This should apply to other scheduling classes as well. This definition fits with this paper's definition [1]: "The Linux scheduler is work-conserving, meaning that it should never leave cores idle if there is work to do." Do you mean something different by "work conservation" ? Just in case, I've made the following experiment to figure out if my patches benefit from having less time spent in select_task_rq_fair(). I have copied the original "select_idle_sibling()" into a separate function "select_idle_sibling_orig()", which I call at the beginning of the new "biased" select_idle_sibling. I use its result in an empty asm volatile, which ensures that the code is not optimized away. Then the biased function selects the runqueue with the new biased approach. The result with hackbench is that the speed up is still pretty much the same with or without the added "select_idle_sibling_orig()" call. Based on this, my understanding is that the speed up comes from minimizing the amount of migrations (and the side effects caused by those migrations such as runqueue locks and cache misses), rather than by making select_idle_sibling faster. So based on this, I suspect that we could add some overhead to select_task_runqueue_fair if it means we do a better task placement decision and minimize migrations, and that would still provide an overall benefit performance-wise. > there is also a > second order effect from another one of Mathieu's Patches that uses > wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's > observation [1] thus reducing time spent in select_idle_cpu()) > hackbench results seem to improve. It's possible that an indirect effect of bias towards prev runqueue is to affect the metrics used by select_idle_cpu() as well and make it return early. I've tried adding a 1000 iteration barrier() loop within select_idle_sibling_orig(), and indeed the hackbench time goes from 29s to 31s. Therefore, slowing down the task rq selection does have some impact. > > [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ > > schbench, tbench, and netperf see that wakeups are faster when the > client and server are on same LLC so consolidation as long as there is > one task per run queue for under loaded case is better than just keeping > them on separate LLCs. What is faster for the 1:1 client/server ping-pong scenario: having the client and server on the same LLC, but different runqueues, or having them share a single runqueue ? If they wait for each other, then I suspect it's better to place them on the same runqueue as long as there is capacity left. > >> >>> All these benchmarks are client-server / messenger-worker oriented and is >>> known to perform better when client-server / messenger-worker are on >>> same CCX (LLC domain). >> >> I thought hackbench should also be of client-server mode, because hackbench has >> socket/pipe mode and exchanges datas between sender/receiver. > > Yes but its N:M nature makes it slightly complicated to understand where > the cache benefits disappear and the work conservation benefits become > more prominent. The N:M nature of hackbench AFAIU causes N-server *and* M-client tasks to pull each other pretty much randomly, therefore trashing cache locality. I'm still unclear about the definition of "work conservation" in this discussion. > >> >> This reminds me of your proposal to provide user hint to the scheduler >> to whether do task consolidation vs task spreading, and could this also >> be applied to Mathieu's case? For task or task group with "consolidate" >> flag set, tasks prefer to be woken up on target/previous CPU if the wakee >> fits into that CPU. In this way we could bring benefit and not introduce >> regress. > > I think even a simple WF_SYNC check will help tbench and netperf case. > Let me get back to you with some data on different variants of hackbench > wit the latest tip. AFAIU (to be double-checked) the hackbench workload also has WF_SYNC, which prevents us from using this flag to distinguish between 1:1 server/client and N:M scenarios. Or am I missing something ? Thanks, Mathieu [1] https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com