Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp3559628pxp; Tue, 8 Mar 2022 17:27:24 -0800 (PST) X-Google-Smtp-Source: ABdhPJwS+PPMfjBFrtXOBDZYVDhE4mXT069ivXrUeEh6Wmz2q9djiw+9PmMUIvj4n8pNBuPEttir X-Received: by 2002:a63:8ac8:0:b0:37f:fcb8:a43c with SMTP id y191-20020a638ac8000000b0037ffcb8a43cmr15485746pgd.370.1646789244496; Tue, 08 Mar 2022 17:27:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646789244; cv=none; d=google.com; s=arc-20160816; b=skfR8/VnQzM1ZLTDTuE/Hp3JaqI42KP6vIqJpdefXajA3QQN0danjK74rpJgVZRk/l FSgM9NywMuUo5V1Gt4W89loMEdZ7SeyZc+A/iorqesAU8bxzQxTeRfRRU6PgtbmewqxZ wQ4UuvfHkCkMk72DMIqiZ3hGfMRB1URUyDckq7qjmIX6U/By0YfJfrwzGVM36taEAnJb FWjgnRiZOy33SmqFqrZtXopfCxk3Hdu0QKuwq6WSKAx2gdOmKZ5Aa4XcUJp4jXEXWNqd 7l1mWESXNDdxCrV+L9m3VgiOSMJy8hID4KoKze86zEJVchp9MCRoWRp6Ffb6WeTu4C7z kmzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:dkim-signature; bh=wWvQYspvs7iHu5r8VIoC5yAtIhQy2ZYwDAi7rR68jsc=; b=SaSa3yuHYu8n6TfIhXIZtpMBolFUJayekiumaAwFASXuMJvfh/VcKv1TsCq0OaCmuN xLV6kA+vgF/fQReQ6ES85fgeE9IacrkyvyvmO82sR0YlsOrB1A+solGpCf/QvcEGSBOm 1zvvEMCRDLjVGBdowmbV+20HijeFyezKKx8fOfscIwXEZ+T83jvD42pzbySygqv3BcAL D4M3tS2D9Sj2n9v+AYT1m4K2fLHj4bcBjrZHx2aPdEo0fFXbzFwGWr0PN/9XqtbAEQV0 Y1XW0llqhm/JA5vwzoKEu3e1G5m8FEwYHv1hX6Tgd2A4AGzhlU/qotiaLAHUvNiBAiSa BPAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=FVGkod1W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id e14-20020a17090301ce00b00152057b5e8fsi572586plh.205.2022.03.08.17.27.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Mar 2022 17:27:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=FVGkod1W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4178E1A06C3; Tue, 8 Mar 2022 16:16:38 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233419AbiCHCGS (ORCPT + 99 others); Mon, 7 Mar 2022 21:06:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35060 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244793AbiCHCGR (ORCPT ); Mon, 7 Mar 2022 21:06:17 -0500 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AD57387BB for ; Mon, 7 Mar 2022 18:05:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646705121; x=1678241121; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=EUxeudjMztn8aspF1JyaXCA8b1FfoMQZEDSG31wEBck=; b=FVGkod1WhB1e8NgXus0P1k2POTGiTCf6VxU/VrDiAzvW/FIZAxm2zEPg VMbH6m1qpoLhtqEYoM1pu6aPvQdYTYJ9ObmgtSxXXxHP5QaEaYKNXpDhj BQpq6YTNiFwZ0IN1R25cRqA2J7sJdV97ePFAVIef3rQDo/di+15FVeHqE GFjvYGCmga9HvWdm9DoW75qA+MR5x93sz9b8TsUWCX+eXoCO7B9A2Ts0V lQ2UKP91NrKilps8KWok7dxPYf3s8WV62nVk7bBVnkWDFsNrI2Z5yqtiG xfc1i3q69rU6P6nmFJaQ7xKh5jIV5f4hfbGc+rAvuwnOM2igDL1V0+y2Q Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10279"; a="242000776" X-IronPort-AV: E=Sophos;i="5.90,163,1643702400"; d="scan'208";a="242000776" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2022 18:05:21 -0800 X-IronPort-AV: E=Sophos;i="5.90,163,1643702400"; d="scan'208";a="553424513" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.13.94]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2022 18:05:19 -0800 From: "Huang, Ying" To: Peter Zijlstra Cc: , Valentin Schneider , Ingo Molnar , Mel Gorman , Rik van Riel , Srikar Dronamraju Subject: [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate task to CPU-less node References: <20220214121553.582248-1-ying.huang@intel.com> <20220214121553.582248-2-ying.huang@intel.com> Date: Tue, 08 Mar 2022 10:05:16 +0800 In-Reply-To: <20220214121553.582248-2-ying.huang@intel.com> (Huang Ying's message of "Mon, 14 Feb 2022 20:15:53 +0800") Message-ID: <87y21lkxlv.fsf_-_@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA nodes. But if the number of the hint page faults on a PMEM node is the max for a task, The current NUMA balancing policy may try to place the task on the PMEM node instead of DRAM node. This is unreasonable, because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less nodes are ignored when searching the migration target node for a task in this patch. To test the patch, we run a workload that accesses more memory in PMEM node than memory in DRAM node. Without the patch, the PMEM node will be chosen as preferred node in task_numa_placement(). While the DRAM node will be chosen instead with the patch. Known issue: I don't have systems to test complex NUMA topology type, for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH. v3: - Fix a boot crash for some uncovered marginal condition. Thanks Qian Cai for reporting and testing the bug! - Fix several missing places to use CPU-less nodes as migrating target. Signed-off-by: "Huang, Ying" Reported-and-tested-by: Qian Cai # boot crash Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Ingo Molnar Cc: Mel Gorman Cc: Rik van Riel Cc: Srikar Dronamraju --- kernel/sched/fair.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04968f3f9b6d..1fe7a4510cca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1988,7 +1988,7 @@ static int task_numa_migrate(struct task_struct *p) */ ng = deref_curr_numa_group(p); if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) { - for_each_online_node(nid) { + for_each_node_state(nid, N_CPU) { if (nid == env.src_nid || nid == p->numa_preferred_nid) continue; @@ -2086,13 +2086,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group) unsigned long faults, max_faults = 0; int nid, active_nodes = 0; - for_each_online_node(nid) { + for_each_node_state(nid, N_CPU) { faults = group_faults_cpu(numa_group, nid); if (faults > max_faults) max_faults = faults; } - for_each_online_node(nid) { + for_each_node_state(nid, N_CPU) { faults = group_faults_cpu(numa_group, nid); if (faults * ACTIVE_NODE_FRACTION > max_faults) active_nodes++; @@ -2246,7 +2246,7 @@ static int preferred_group_nid(struct task_struct *p, int nid) dist = sched_max_numa_distance; - for_each_online_node(node) { + for_each_node_state(node, N_CPU) { score = group_weight(p, node, dist); if (score > max_score) { max_score = score; @@ -2265,7 +2265,7 @@ static int preferred_group_nid(struct task_struct *p, int nid) * inside the highest scoring group of nodes. The nodemask tricks * keep the complexity of the search down. */ - nodes = node_online_map; + nodes = node_states[N_CPU]; for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) { unsigned long max_faults = 0; nodemask_t max_group = NODE_MASK_NONE; @@ -2404,6 +2404,21 @@ static void task_numa_placement(struct task_struct *p) } } + /* Cannot migrate task to CPU-less node */ + if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) { + int near_nid = max_nid; + int distance, near_distance = INT_MAX; + + for_each_node_state(nid, N_CPU) { + distance = node_distance(max_nid, nid); + if (distance < near_distance) { + near_nid = nid; + near_distance = distance; + } + } + max_nid = near_nid; + } + if (ng) { numa_group_count_active_nodes(ng); spin_unlock_irq(group_lock); -- 2.30.2