Received: by 2002:a89:288:0:b0:1f7:eeee:6653 with SMTP id j8csp481886lqh; Tue, 7 May 2024 05:31:20 -0700 (PDT) X-Forwarded-Encrypted: i=4; AJvYcCULMWizZLlpEj3913pghzeh9PmwjWmpEqO8whlEKTDBS/2jiF7Bk3w8ay/jH0z2SbMMCh2pPtKSzdI6eW4EWcrnfY4tXjRIbJn0vl0lkw== X-Google-Smtp-Source: AGHT+IGLHpBjAzR88XFWK00snbFQdhuDBJ2+t6p6oTGfRJY26aBtJnWy3WVEx9Li5IAO3DL4BhJT X-Received: by 2002:a50:aad4:0:b0:572:a16f:294 with SMTP id r20-20020a50aad4000000b00572a16f0294mr8882430edc.30.1715085080034; Tue, 07 May 2024 05:31:20 -0700 (PDT) ARC-Seal: i=3; a=rsa-sha256; t=1715085080; cv=pass; d=google.com; s=arc-20160816; b=Z4ZzWXrB0Ov4g0KdCr8QpxaUk9QN9vd3fS+NO79k/rEUchDgTlJXH+ZFoVmKunNc9y QqkPnlmYXoMjUaaprIghRB+9D9Sz7cJy8RvL+ygeXHwXlB8AHhpq3KS4kXkoBlFZi/Bt NWk7cVUc2+2snCX0Hk2CR4OcTe2xd5uEhrcKmF99L6kAsFx81+pZLRd5nWLC9kDvecQ2 UDeZzRUr6G1rhYHVsvbpdcLRPV3Wn+07l5qoSREXVftx/qwDdDAXfKOS9KhWvWbQFn6o 53MgoGvlsSzR/YK1xpl0TTnvcngjtiR8NfLIoTKy2AR6Bij/InMfsEyFPo2AYZx3m3Nw clMw== ARC-Message-Signature: i=3; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:to:from :dkim-signature:delivered-to; bh=6JH667N9MNuqarrLY2eci1O+t0r7FGIRT/TB+9bd0A4=; fh=SotPdUFMeEEyMi/Tle40x0qcsTe88+U78qIS3skPpMM=; b=cUQhGFpT/aUKZrfcoTcNbRai0HByz5wsov8qm/AYXaJJlpInctWT4u8lnMl1hRrq7Z 3OH49fK5UZ+fXWDa0RqRHQNFkB8GC5WMPGLyHhUNyzdy3zkIHXaJaM6mXiuN/uofLH/D /0pzSvLhWCRv63f6yhI2LM5NghnaIKOWW1z5MrKUwCWyBMzwIWhzUeTWT8uVbREO10PN Eid8Tjo3lU9cfGtRCPaTSzIztUzFnmRLQCtbZU79Jw4D3J8ckz1TwHQ8TIJQE5eUV8B5 LLHb/CA1/BLobac3e6ULrDWfsW78tv0ng4FzouCaMR2c+e0PamtG1igD8P3xmjb6ealZ 1n8Q==; dara=google.com ARC-Authentication-Results: i=3; mx.google.com; dkim=pass header.i=@bursov.com header.s=zoho header.b=IC5cB9Ri; arc=pass (i=2 spf=pass spfdomain=bursov.com dkim=pass dkdomain=bursov.com); spf=pass (google.com: domain of linux-kernel+bounces-171279-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-171279-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id k3-20020aa7d8c3000000b0057277507396si6171609eds.642.2024.05.07.05.31.19 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 May 2024 05:31:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-171279-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@bursov.com header.s=zoho header.b=IC5cB9Ri; arc=pass (i=2 spf=pass spfdomain=bursov.com dkim=pass dkdomain=bursov.com); spf=pass (google.com: domain of linux-kernel+bounces-171279-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-171279-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 91C641F25E45 for ; Tue, 7 May 2024 12:31:19 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id D322B15B979; Tue, 7 May 2024 12:31:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=bursov.com header.i=vitaly@bursov.com header.b="IC5cB9Ri" Received: from sender-of-o51.zoho.eu (sender-of-o51.zoho.eu [136.143.169.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3693E73530 for ; Tue, 7 May 2024 12:31:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.169.51 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715085066; cv=pass; b=mkHtQ5dVDIkrUef7imPOuySuQTFDio3BoyjDcx1IXXxOkf1DmcUuI+qvj3XLgTzGlGEMnDztO7HM95BxziOkmLSvElpc5A1+7PIrU+e3VikiebRx7dsBGyGVu4XrC+uWqMGcAeAUh6cwfCj0wAuwJTjMzk+yiJ5W+ioAvWq9/p8= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715085066; c=relaxed/simple; bh=3lNBrw1WJZMPjkNpetGYfWipy0iRYqNp/zLJ3z75zkc=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=IFTQVpTqH1Xh1cCtScWXXTtthOXOofMDMZXejvxV9/d9xPsXUz/FF6uR5a7PTUiPdP1UTEWXhYkvfXYIbU7EgjzBFEv68CvnKlFK997w0U1HrSQUODTu4RYebz7zUVs2jBNnYuXH0sj1xVljL7qxKsS4VyVOepMd6zam9X8JS40= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bursov.com; spf=pass smtp.mailfrom=bursov.com; dkim=pass (1024-bit key) header.d=bursov.com header.i=vitaly@bursov.com header.b=IC5cB9Ri; arc=pass smtp.client-ip=136.143.169.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bursov.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bursov.com Delivered-To: vitaly@bursov.com ARC-Seal: i=1; a=rsa-sha256; t=1715084141; cv=none; d=zohomail.eu; s=zohoarc; b=Wt3qjOous/Ekl1PqD9wMcMluEOnFedYEk6Kji88GtStVdsP/BrLMa0ZKXyTZBymcKM4kHWeo0Jiu2/mu1TWYQptBFlqtgaV0W6OuUksDWVJo4Qm9j88/XCzeTUbA7HHrHnvz0lee7zWbtGO8EDEQRrE4d9BKComYjxtDa+CUgF4= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.eu; s=zohoarc; t=1715084141; h=Content-Transfer-Encoding:Date:Date:From:From:MIME-Version:Message-ID:Subject:Subject:To:To:Message-Id:Reply-To:Cc; bh=6JH667N9MNuqarrLY2eci1O+t0r7FGIRT/TB+9bd0A4=; b=j4nb0EQRWK9e4ulGeUZx04WQXsSdirgQFF/MNKUIzHOpkv2bszc9UR0kb+iHSSTvfHbZXqhJ8jar3pQZ56/aVi3KYnREuR3B0GsPDpuzc3MvfnpUBJeixoVpZfVJki4lTMruF5LLQZbYBEAtAXBOWAjIhOXGNea61Rtp0c6S9pA= ARC-Authentication-Results: i=1; mx.zohomail.eu; dkim=pass header.i=bursov.com; spf=pass smtp.mailfrom=vitaly@bursov.com; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1715084141; s=zoho; d=bursov.com; i=vitaly@bursov.com; h=From:From:To:To:Subject:Subject:Date:Date:Message-Id:Message-Id:MIME-Version:Content-Transfer-Encoding:Reply-To:Cc; bh=6JH667N9MNuqarrLY2eci1O+t0r7FGIRT/TB+9bd0A4=; b=IC5cB9RicnoUZbw/4i0tg8mYepPyGZCe9vR00TWIVhTm1cP/0D3bOE/AsQGRL5CV clNmXO5syDF+ZSUJnuar1eF/BodZ2QfzpIoIRnqf3tLLcmlgFcmnvx05QInihw6g+jY CF6gn6p4/Ez3OIUENQEgwnhGQJWGjRqLFU2VHnZw= Received: by mx.zoho.eu with SMTPS id 1715084138365828.1828024804557; Tue, 7 May 2024 14:15:38 +0200 (CEST) From: Vitalii Bursov To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , linux-kernel@vger.kernel.org, Vitalii Bursov Subject: [PATCH v5 0/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level Date: Tue, 7 May 2024 15:15:30 +0300 Message-Id: X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-ZohoMailClient: External Changes in v5: - Collected tags in commit messages. - Link to v4: https://lore.kernel.org/lkml/cover.1714488502.git.vitaly@bursov.com/ Changes in v4: - Updated commit messages: set Fixes tag to a proper commit, comment about SDM macro in a debug commit. - Link to v3: https://lore.kernel.org/lkml/cover.1712147341.git.vitaly@bursov.com/ Changes in v3: - Remove levels table change from the documentation patch - Link to v2: https://lore.kernel.org/lkml/cover.1711900396.git.vitaly@bursov.com/ Changes in v2: - Split debug.c change in a separate commit and move new "level" after "groups_flags" - Added "Fixes" tag and updated commit message - Update domain levels cgroup-v1/cpusets.rst documentation - Link to v1: https://lore.kernel.org/all/cover.1711584739.git.vitaly@bursov.com/ During the upgrade from Linux 5.4 we found a small (around 3%) performance regression which was tracked to commit c5b0a7eefc70150caf23e37bc9d639c68c87a097 sched/fair: Remove sysctl_sched_migration_cost condition With a default value of 500us, sysctl_sched_migration_cost is significanlty higher than the cost of load_balance. Remove the condition and rely on the sd->max_newidle_lb_cost to abort newidle_balance. Looks like "newidle" balancing is beneficial for a lot of workloads, just not for this specific one. The workload is video encoding, there are 100s-1000s of threads, some are synchronized with mutexes and conditional variables. The process aims to have a portion of CPU idle, so no CPU cores are 100% busy. Perhaps, the performance impact we see comes from additional processing in the scheduler and additional cost like more cache misses, and not from an incorrect balancing. See perf output below. My understanding is that "sched_relax_domain_level" cgroup parameter should control if sched_balance_newidle() is called and what's the scope of the balancing is, but it doesn't fully work for this case. cpusets.rst documentation: > The 'cpuset.sched_relax_domain_level' file allows you to request changing > this searching range as you like. This file takes int value which > indicates size of searching range in levels ideally as follows, > otherwise initial value -1 that indicates the cpuset has no request. > > ====== =========================================================== > -1 no request. use system default or follow request of others. > 0 no search. > 1 search siblings (hyperthreads in a core). > 2 search cores in a package. > 3 search cpus in a node [= system wide on non-NUMA system] > 4 search nodes in a chunk of node [on NUMA system] > 5 search system wide [on NUMA system] > ====== =========================================================== Setting cpuset.sched_relax_domain_level to 0 works as 1. On a dual-CPU server, domains and levels are as follows: domain 0: level 0, SMT domain 1: level 2, MC domain 2: level 5, NUMA So, to support "0 no search", the value in cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a specified level and keep it enabled for prior levels. For example, SMT level is 0, so sched_relax_domain_level=0 should exclude levels >=0. Instead, cpuset.sched_relax_domain_level enables the specified level, which effectively removes "no search" option. See below for domain flags for all cpuset.sched_relax_domain_level values. Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when cpuset.sched_relax_domain_level is set to 0 and extends max value validation range beyond sched_domain_level_max. This allows setting SD_BALANCE_NEWIDLE on all levels and override platform default if it does not include all levels. Thanks ========================= Perf output for a simimar workload/test case shows that newidle_balance (now renamed to sched_balance_newidle) is called when handling futex and nanosleep syscalls: 8.74% 0.40% a.out [kernel.vmlinux] [k] entry_SYSCALL_64 8.34% entry_SYSCALL_64 - do_syscall_64 - 5.50% __x64_sys_futex - 5.42% do_futex - 3.79% futex_wait - 3.74% __futex_wait - 3.53% futex_wait_queue - 3.45% schedule - 3.43% __schedule - 2.06% pick_next_task - 1.93% pick_next_task_fair - 1.87% newidle_balance - 1.52% load_balance - 1.16% find_busiest_group - 1.13% update_sd_lb_stats.constprop.0 1.01% update_sg_lb_stats - 0.83% dequeue_task_fair 0.66% dequeue_entity - 1.57% futex_wake - 1.22% wake_up_q - 1.20% try_to_wake_up 0.58% select_task_rq_fair - 2.44% __x64_sys_nanosleep - 2.36% hrtimer_nanosleep - 2.33% do_nanosleep - 2.05% schedule - 2.03% __schedule - 1.23% pick_next_task - 1.15% pick_next_task_fair - 1.12% newidle_balance - 0.90% load_balance - 0.68% find_busiest_group - 0.66% update_sd_lb_stats.constprop.0 0.59% update_sg_lb_stats 0.52% dequeue_task_fair When newidle_balance is disabled (or when using older kernels), perf output is: 6.37% 0.41% a.out [kernel.vmlinux] [k] entry_SYSCALL_64 5.96% entry_SYSCALL_64 - do_syscall_64 - 3.97% __x64_sys_futex - 3.89% do_futex - 2.32% futex_wait - 2.27% __futex_wait - 2.05% futex_wait_queue - 1.98% schedule - 1.96% __schedule - 0.81% dequeue_task_fair 0.66% dequeue_entity - 0.64% pick_next_task 0.51% pick_next_task_fair - 1.52% futex_wake - 1.15% wake_up_q - try_to_wake_up 0.59% select_task_rq_fair - 1.58% __x64_sys_nanosleep - 1.52% hrtimer_nanosleep - 1.48% do_nanosleep - 1.20% schedule - 1.19% __schedule 0.53% dequeue_task_fair Without a patch: ========================= CPUs: 2 Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz # uname -r 6.8.1 # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 63962 MB node 0 free: 59961 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 64446 MB node 1 free: 63338 MB node distances: node 0 1 0: 10 21 1: 21 10 # head /proc/schedstat version 15 timestamp 4295347219 cpu0 0 0 0 0 0 0 3035466036 858375615 67578 domain0 0000,01000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... domain1 000f,ff000fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... domain2 ffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... # cd /sys/kernel/debug/sched/domains # echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{name,flags,groups_flags,max_newidle_lb_cost} cpu0/domain0/name:SMT cpu0/domain1/name:MC cpu0/domain2/name:NUMA cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain0/max_newidle_lb_cost:2236 cpu0/domain1/max_newidle_lb_cost:3444 cpu0/domain2/max_newidle_lb_cost:4590 # echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain0/max_newidle_lb_cost:0 cpu0/domain1/max_newidle_lb_cost:0 cpu0/domain2/max_newidle_lb_cost:0 # echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain0/max_newidle_lb_cost:309 cpu0/domain1/max_newidle_lb_cost:0 cpu0/domain2/max_newidle_lb_cost:0 # echo 2 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain0/max_newidle_lb_cost:276 cpu0/domain1/max_newidle_lb_cost:2776 cpu0/domain2/max_newidle_lb_cost:0 # echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain0/max_newidle_lb_cost:289 cpu0/domain1/max_newidle_lb_cost:3192 cpu0/domain2/max_newidle_lb_cost:0 # echo 4 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING cpu0/domain0/max_newidle_lb_cost:1306 cpu0/domain1/max_newidle_lb_cost:1999 cpu0/domain2/max_newidle_lb_cost:0 # echo 5 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level bash: echo: write error: Invalid argument ========================= The same system with the patch applied: ========================= # cd /sys/kernel/debug/sched/domains # echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{name,level,flags,groups_flags} cpu0/domain0/name:SMT cpu0/domain1/name:MC cpu0/domain2/name:NUMA cpu0/domain0/level:0 cpu0/domain1/level:2 cpu0/domain2/level:5 cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... # echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags} cpu0/domain0/flags:SD_BALANCE_EXEC ... cpu0/domain1/flags:SD_BALANCE_EXEC ... cpu0/domain2/flags:SD_BALANCE_EXEC ... cpu0/domain1/groups_flags:SD_BALANCE_EXEC ... cpu0/domain2/groups_flags:SD_BALANCE_EXEC ... # echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain1/flags:SD_BALANCE_EXEC ... cpu0/domain2/flags:SD_BALANCE_EXEC ... cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/groups_flags:SD_BALANCE_EXEC ... [skip 2, same as 1] # echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/flags:SD_BALANCE_EXEC ... cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... [skip 4 and 5, same as 3] # echo 6 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level # grep . cpu0/*/{flags,groups_flags} cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ... # echo 7 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level bash: echo: write error: Invalid argument ========================= Vitalii Bursov (3): sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level sched/debug: dump domains' level docs: cgroup-v1: clarify that domain levels are system-specific Documentation/admin-guide/cgroup-v1/cpusets.rst | 7 ++++++- kernel/cgroup/cpuset.c | 2 +- kernel/sched/debug.c | 1 + kernel/sched/topology.c | 2 +- 4 files changed, 9 insertions(+), 3 deletions(-) -- 2.20.1