Received: by 2002:a05:7412:d024:b0:f9:90c9:de9f with SMTP id bd36csp155876rdb; Wed, 20 Dec 2023 08:39:46 -0800 (PST) X-Google-Smtp-Source: AGHT+IHGOL9OkCer9Wg0OaxR0Hf2VnyjjtCVUZNXKb3N+8v3f4ReUrNQl6E5o1ra5xVXhX8mdV/Q X-Received: by 2002:a05:6a00:a82:b0:6d0:875c:5d8d with SMTP id b2-20020a056a000a8200b006d0875c5d8dmr19726256pfl.41.1703090386240; Wed, 20 Dec 2023 08:39:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703090386; cv=none; d=google.com; s=arc-20160816; b=kKI9ZCeBDMcxAcREiBYlIxQoT9oypVm8MR0No4jScWyBKsZDJlr7JUx5ijnT36y9RJ oCu08BeMnGQXBUfSqXUeMxpXfINicdXmmiD6e8W6wS6j+WU7zRxntD2WM2F6f95wyZ1D sIXR1+sMgQWp4W6ituXLZuaLcN6qvsE83Df/L8dbY27UTrULt6FcF+3tGqnwL02OfqHd CCCLt6uVyXsW9OZRorpI4YJuf8OJS6UGoqqzwsrnbsYwkE/EpK/abWYHo5C3RbduCRKG wtzR0LPuNK6dWbbL/92KTLTpnUSYAROPWdSvwEkIqDbB1VBf6MygIfHK2b8OvLYVV3LF bmCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:in-reply-to:subject:cc:to:from:date :dkim-signature; bh=jZuB1SWu1rTqUusWnh9xfO54q7lnXcSPPU9kuHUzttY=; fh=WuLomYDyAW/aYAkZdiGMRXhA3UBMDOVvFe0bJPpPH3o=; b=U7rC4Rw4BYXYFXMqyceR+Kmy9nk+1kts6M676fNYJvOSgHWJSE6rV9ff3i7wy6SmrK SqgtwJfZhRj7FkJUZ7hoB6YtSkaVQDuc4vtB2Ycy7GDrDx12dJTgEffEW8JVdeQy7tdY EU8Uck9ZjMFfjxJlMKB5Hc8m0Il7YEeQzOEAQ/ubUeV1xxF7fDXw4QpzUx16Mq8gIRIL q4PVoLs/R2VC3U4gh1xiw9ElE+Y0DgEAWS0qiEgO/uYk8U5JQirZPkJ1qz70JKUR/VET k4djeh1H0zMN8PjLdu2Xa/7DQjLq7VMwRh9qA1fcavML0WSWmqzhi4vNLz7LaIa4xoZl xe9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@inria.fr header.s=dc header.b=A3WYyiSP; spf=pass (google.com: domain of linux-kernel+bounces-7308-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-7308-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=inria.fr Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id d124-20020a633682000000b005cda184c28asi8454pga.204.2023.12.20.08.39.46 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 08:39:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-7308-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@inria.fr header.s=dc header.b=A3WYyiSP; spf=pass (google.com: domain of linux-kernel+bounces-7308-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-7308-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=inria.fr Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id DCF192845AC for ; Wed, 20 Dec 2023 16:39:45 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A85274652C; Wed, 20 Dec 2023 16:39:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=inria.fr header.i=@inria.fr header.b="A3WYyiSP" X-Original-To: linux-kernel@vger.kernel.org Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 71DB946427 for ; Wed, 20 Dec 2023 16:39:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=inria.fr Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=inria.fr DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=inria.fr; s=dc; h=date:from:to:cc:subject:in-reply-to:message-id: references:mime-version; bh=jZuB1SWu1rTqUusWnh9xfO54q7lnXcSPPU9kuHUzttY=; b=A3WYyiSPKR3frvNDdWhoLJFnm3c7fDeFQ68GFqfrGGCmu8cIYQAG5/GS hnD9SUFv7wfLYCxVuhCUA5GDVchAI7qez2LW6E24Y2mmnFfs8ksc2vdlK 2s32d4OwYXtm7zPZXV/TkfQj7ZgnOwCQVeS/LUuLeYM7NQLLo+5nq5Qru c=; Authentication-Results: mail2-relais-roc.national.inria.fr; dkim=none (message not signed) header.i=none; spf=SoftFail smtp.mailfrom=julia.lawall@inria.fr; dmarc=fail (p=none dis=none) d=inria.fr X-IronPort-AV: E=Sophos;i="6.04,291,1695679200"; d="scan'208";a="143383956" Received: from dt-lawall.paris.inria.fr ([128.93.67.65]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Dec 2023 17:39:25 +0100 Date: Wed, 20 Dec 2023 17:39:24 +0100 (CET) From: Julia Lawall To: Vincent Guittot cc: Peter Zijlstra , Ingo Molnar , Dietmar Eggemann , Mel Gorman , linux-kernel@vger.kernel.org Subject: Re: EEVDF and NUMA balancing In-Reply-To: Message-ID: <44df7caf-dbb0-70c3-fbad-7242c0f87b5f@inria.fr> References: <20231003215159.GJ1539@noisy.programming.kicks-ass.net> <20231004120544.GA6307@noisy.programming.kicks-ass.net> <20231004174801.GE19999@noisy.programming.kicks-ass.net> <20231009102949.GC14330@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII On Tue, 19 Dec 2023, Vincent Guittot wrote: > On Mon, 18 Dec 2023 at 23:31, Julia Lawall wrote: > > > > > > > > On Mon, 18 Dec 2023, Vincent Guittot wrote: > > > > > On Mon, 18 Dec 2023 at 14:58, Julia Lawall wrote: > > > > > > > > Hello, > > > > > > > > I have looked further into the NUMA balancing issue. > > > > > > > > The context is that there are 2N threads running on 2N cores, one thread > > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket > > > > and N-1 threads on the other socket. This condition typically persists > > > > for one or more seconds. > > > > > > > > Previously, I reported this on a 4-socket machine, but it can also occur > > > > on a 2-socket machine, with other tests from the NAS benchmark suite > > > > (sp.B, bt.B, etc). > > > > > > > > Since there are N+1 threads on one of the sockets, it would seem that load > > > > balancing would quickly kick in to bring some thread back to socket with > > > > only N-1 threads. This doesn't happen, though, because actually most of > > > > the threads have some NUMA effects such that they have a preferred node. > > > > So there is a high chance that an attempt to steal will fail, because both > > > > threads have a preference for the socket. > > > > > > > > At this point, the only hope is active balancing. However, triggering > > > > active balancing requires the success of the following condition in > > > > imbalanced_active_balance: > > > > > > > > if ((env->migration_type == migrate_task) && > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2)) > > > > > > > > sd->nr_balance_failed does not increase because the core is idle. When a > > > > core is idle, it comes to the load_balance function from schedule() though > > > > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE, > > > > even if the core has been idle for a long time. > > > > > > Do you mean that you never kick a normal idle load balance ? > > > > OK, it seems that both happen, at different times. But the calls to > > trigger_load_balance seem to rarely do more than the SMT level. > > yes, the min period is equal to "cpumask_weight of sched_domain" ms, 2 > ms at SMT level and 2N ms at numa level. > > > > > I have attached part of a trace in which I print various things that > > happen during the idle period. > > > > > > > > > > > > > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when > > > > the core was already idle before the call to schedule() is not enough > > > > though, because there is also the constraint on the migration type. That > > > > turns out to be (mostly?) migrate_util. Removing the following > > > > code from find_busiest_queue: > > > > > > > > /* > > > > * Don't try to pull utilization from a CPU with one > > > > * running task. Whatever its utilization, we will fail > > > > * detach the task. > > > > */ > > > > if (nr_running <= 1) > > > > continue; > > > > > > I'm surprised that load_balance wants to "migrate_util" instead of > > > "migrate_task" > > > > In the attached trace, there are 147 occurrences of migrate_util, and 3 > > occurrences of migrate_task. But even when migrate_task appears, the > > counter has gotten knocked back down, due to the calls to newidle_balance. > > > > > You have N+1 threads on a group of 2N CPUs so you should have at most > > > 1 thread per CPUs in your busiest group. > > > > One CPU has 2 threads, and the others have one. The one with two threads > > is returned as the busiest one. But nothing happens, because both of them > > prefer the socket that they are on. > > This explains way load_balance uses migrate_util and not migrate_task. > One CPU with 2 threads can be overloaded The node with N-1 tasks (and thus an empty core) is categorized as group_has_spare and the one with N+1 tasks (and thus one core with 2 tasks and N-1 cores with 1 task) is categorized as group_overloaded. This seems reasonable, and based on these values the conditions hold for migrate_util to be chosen. I tried just extending the test in imbalanced_active_balance to also accept migrate_util, but the sd->nr_balance_failed still goes up too slowly due to the many calls from newidle_balance. julia > > ok, so it seems that your 1st problem is that you have 2 threads on > the same CPU whereas you should have an idle core in this numa node. > All cores are sharing the same LLC, aren't they ? > > You should not have more than 1 thread per CPU when there are N+1 > threads on a node with N cores / 2N CPUs. This will enable the > load_balance to try to migrate a task instead of some util(ization) > and you should reach the active load balance. > > > > > > In theory you should have the > > > local "group_has_spare" and the busiest "group_fully_busy" (at most). > > > This means that no group should be overloaded and load_balance should > > > not try to migrate utli but only task > > > > I didn't collect information about the groups. I will look into that. > > > > julia > > > > > > > > > > > > > > > > and changing the above test to: > > > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) && > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2)) > > > > > > > > seems to solve the problem. > > > > > > > > I will test this on more applications. But let me know if the above > > > > solution seems completely inappropriate. Maybe it violates some other > > > > constraints. > > > > > > > > I have no idea why this problem became more visible with EEVDF. It seems > > > > to have to do with the time slices all turning out to be the same. I got > > > > the same behavior in 6.5 by overwriting the timeslice calculation to > > > > always return 1. But I don't see the connection between the timeslice and > > > > the behavior of the idle task. > > > > > > > > thanks, > > > > julia > > > >