Received: by 2002:a05:7412:d008:b0:f9:6acb:47ec with SMTP id bd8csp72978rdb; Tue, 19 Dec 2023 09:39:29 -0800 (PST) X-Google-Smtp-Source: AGHT+IGLQkiLC5yE1xFwwxPbqrnH9wFeVMnOvJIb7NsoNZekK0q368DTBPORVgIJl557AggIMc6T X-Received: by 2002:a17:90a:ba85:b0:28b:2839:8202 with SMTP id t5-20020a17090aba8500b0028b28398202mr4867806pjr.22.1703007568717; Tue, 19 Dec 2023 09:39:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703007568; cv=none; d=google.com; s=arc-20160816; b=LhBRURJBVlJyk+nOKyO+rwiSBdqAX7ckOKB+jyxORyUS808nHrAD1Duf5GCfWA2+vY edss7WIoeGWrCC8xVTmY812cNtigC1PdhEia+5zGV/vpjQ1vcu8pQu5WjZnvFiF4gx3R vt5jshXqIWqHDAzZt2JzeyqTWdLj7akNXP97bMAI2QKDoeZ64StTr0Et56U9N2HboV4h Zgjf9p1fjlYcCXb6Np8098KIckx9DMXlYpyki0WvoeO50W/xwjCaoUxQFTpgQ6ciT9Eh 2gE2uzCbd2NmButh++SKI5l70I1gvkafEbs4I2ezn269PTqZamN9I2B0YqwVJ4je5Oyy hiKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :dkim-signature; bh=dwB1tsO1zuR2AK7X6AAFLPcDtCqSfV+mkwSgMHXNxAo=; fh=jJBbNPnQZtZvstbxiz5RTvIPxy/uxViwSQpNAR2vU2o=; b=uz+3hpUU+sckFYJpjgX1fdWJy5gBJVKJz9TvhJItzbeWUbYuwolV2g64sV9Vp1cc5i j+dex95gWZ4ocE4zKhKqMJJPJvx0bb/oXFzud2sIojVSdXZG7hLNyeauCrJsmhr/1Vja 2ZVae2lGkYkO/DAhzk3ZivEOK65Ej4M7WyjQvwOiZVghCslOwldN7WlU+TyS9GvgBhVi B+m3P6pv8JccUaAOFZKXLGI7n9KrVap0aW7Zh5prXDN/mNK4M6A1nbNq19FQikp1UDKg CXebpqvmoreBq7iml/kjYyFkHE2a75slvqvxNt00k6Z5kZTAWS671AKzh56pYaEnfJFU zKXA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ueQlHyJ+; spf=pass (google.com: domain of linux-kernel+bounces-5701-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-5701-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id l8-20020a17090a49c800b0028b7cae54c4si1511335pjm.1.2023.12.19.09.39.28 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 09:39:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-5701-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ueQlHyJ+; spf=pass (google.com: domain of linux-kernel+bounces-5701-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-5701-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 1BEFFB22B09 for ; Tue, 19 Dec 2023 17:39:10 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 456C237D18; Tue, 19 Dec 2023 17:38:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="ueQlHyJ+" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 453FC37D05 for ; Tue, 19 Dec 2023 17:38:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-517ab9a4a13so3630956a12.1 for ; Tue, 19 Dec 2023 09:38:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1703007529; x=1703612329; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dwB1tsO1zuR2AK7X6AAFLPcDtCqSfV+mkwSgMHXNxAo=; b=ueQlHyJ+xMGhbUp9baWdifmx5eQnY9WbJ5tpkMiYd7NyDeYPzpk4xIxiaFlBb+xvsf Ke5LW72V/MsvR6B8LnRhtK2Mzeggyg3i5yoSZkBgvMoR+90LqfFPNplQLFUMSNRnRI8D MOO4yoAlCtZpwkPyzEHoYJbnsYV8W/9QSG0glWT7H9tNvrFL/IEtJ8ilPUBXk84Uyfhc 8WD0KQA7BiZ3d7lMB2hpr9yDh7nd3aRbntVpcYsEsLd34JX5AIc0B0vvc9jSxPNqRps+ cyupxS+FHU4BPeK0ubNqaS8wPPfU3UnQ9qq/Qf5t+4mFYRR9yY5xK8q/3kLn3TR8tdIv 6QiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703007529; x=1703612329; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dwB1tsO1zuR2AK7X6AAFLPcDtCqSfV+mkwSgMHXNxAo=; b=tYeHB+E5pc4YfEotSkuX5NLas/i+7QMUxjOCW0iVNCSwXoDUVuD4rcGBfSGa6Bkv0b YPB1pSTCL7DbyVGdZMOVzVMbiWyisBIEKFdYlnL7xAAiVgCDOsJfJ3NKea2NbtlSkXsm 5dA4gqTp9kiNeSIvP/mSg/S3uhVGwff6E2gjyqXkYN1xC57wgMKtAfb3z8biYA65cWVk VEgtIDjviPpNBmhxJ+SI+XW0v1wcczy9aupEiTSw4MT9Ihuve+LXiqgLyWAQRTH0ZLye FkOnifwQBbsr4QzOr4d+5wjtkhuQOc7Gvy8eM9A7amO6/7WG9i9zVRC9pjxfk5LwYZ6M J+wA== X-Gm-Message-State: AOJu0YwzFF1EoeZLLNCggtPXSoaY+tnGngyxAzhlI3+/l8sawJPXeggx PQlRWY97xfAoIFWsTssAg0LmLwz78KeDd8EsZ6YCLg== X-Received: by 2002:a17:902:e88a:b0:1cc:43af:f568 with SMTP id w10-20020a170902e88a00b001cc43aff568mr22468704plg.6.1703007529566; Tue, 19 Dec 2023 09:38:49 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20231003215159.GJ1539@noisy.programming.kicks-ass.net> <20231004120544.GA6307@noisy.programming.kicks-ass.net> <20231004174801.GE19999@noisy.programming.kicks-ass.net> <20231009102949.GC14330@noisy.programming.kicks-ass.net> In-Reply-To: From: Vincent Guittot Date: Tue, 19 Dec 2023 18:38:38 +0100 Message-ID: Subject: Re: EEVDF and NUMA balancing To: Julia Lawall Cc: Peter Zijlstra , Ingo Molnar , Dietmar Eggemann , Mel Gorman , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" On Mon, 18 Dec 2023 at 23:31, Julia Lawall wrote: > > > > On Mon, 18 Dec 2023, Vincent Guittot wrote: > > > On Mon, 18 Dec 2023 at 14:58, Julia Lawall wrote: > > > > > > Hello, > > > > > > I have looked further into the NUMA balancing issue. > > > > > > The context is that there are 2N threads running on 2N cores, one thread > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket > > > and N-1 threads on the other socket. This condition typically persists > > > for one or more seconds. > > > > > > Previously, I reported this on a 4-socket machine, but it can also occur > > > on a 2-socket machine, with other tests from the NAS benchmark suite > > > (sp.B, bt.B, etc). > > > > > > Since there are N+1 threads on one of the sockets, it would seem that load > > > balancing would quickly kick in to bring some thread back to socket with > > > only N-1 threads. This doesn't happen, though, because actually most of > > > the threads have some NUMA effects such that they have a preferred node. > > > So there is a high chance that an attempt to steal will fail, because both > > > threads have a preference for the socket. > > > > > > At this point, the only hope is active balancing. However, triggering > > > active balancing requires the success of the following condition in > > > imbalanced_active_balance: > > > > > > if ((env->migration_type == migrate_task) && > > > (sd->nr_balance_failed > sd->cache_nice_tries+2)) > > > > > > sd->nr_balance_failed does not increase because the core is idle. When a > > > core is idle, it comes to the load_balance function from schedule() though > > > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE, > > > even if the core has been idle for a long time. > > > > Do you mean that you never kick a normal idle load balance ? > > OK, it seems that both happen, at different times. But the calls to > trigger_load_balance seem to rarely do more than the SMT level. yes, the min period is equal to "cpumask_weight of sched_domain" ms, 2 ms at SMT level and 2N ms at numa level. > > I have attached part of a trace in which I print various things that > happen during the idle period. > > > > > > > > > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when > > > the core was already idle before the call to schedule() is not enough > > > though, because there is also the constraint on the migration type. That > > > turns out to be (mostly?) migrate_util. Removing the following > > > code from find_busiest_queue: > > > > > > /* > > > * Don't try to pull utilization from a CPU with one > > > * running task. Whatever its utilization, we will fail > > > * detach the task. > > > */ > > > if (nr_running <= 1) > > > continue; > > > > I'm surprised that load_balance wants to "migrate_util" instead of > > "migrate_task" > > In the attached trace, there are 147 occurrences of migrate_util, and 3 > occurrences of migrate_task. But even when migrate_task appears, the > counter has gotten knocked back down, due to the calls to newidle_balance. > > > You have N+1 threads on a group of 2N CPUs so you should have at most > > 1 thread per CPUs in your busiest group. > > One CPU has 2 threads, and the others have one. The one with two threads > is returned as the busiest one. But nothing happens, because both of them > prefer the socket that they are on. This explains way load_balance uses migrate_util and not migrate_task. One CPU with 2 threads can be overloaded ok, so it seems that your 1st problem is that you have 2 threads on the same CPU whereas you should have an idle core in this numa node. All cores are sharing the same LLC, aren't they ? You should not have more than 1 thread per CPU when there are N+1 threads on a node with N cores / 2N CPUs. This will enable the load_balance to try to migrate a task instead of some util(ization) and you should reach the active load balance. > > > In theory you should have the > > local "group_has_spare" and the busiest "group_fully_busy" (at most). > > This means that no group should be overloaded and load_balance should > > not try to migrate utli but only task > > I didn't collect information about the groups. I will look into that. > > julia > > > > > > > > > > > and changing the above test to: > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) && > > > (sd->nr_balance_failed > sd->cache_nice_tries+2)) > > > > > > seems to solve the problem. > > > > > > I will test this on more applications. But let me know if the above > > > solution seems completely inappropriate. Maybe it violates some other > > > constraints. > > > > > > I have no idea why this problem became more visible with EEVDF. It seems > > > to have to do with the time slices all turning out to be the same. I got > > > the same behavior in 6.5 by overwriting the timeslice calculation to > > > always return 1. But I don't see the connection between the timeslice and > > > the behavior of the idle task. > > > > > > thanks, > > > julia > >