Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp9095264pxu; Mon, 28 Dec 2020 06:31:30 -0800 (PST) X-Google-Smtp-Source: ABdhPJwUa19BTi/ljdKaoNFN/pFMlCknhNyWACWIyVB/RMJGFqCYJZMds1jIElMzSg5TuCtcdrVf X-Received: by 2002:a50:9ee6:: with SMTP id a93mr42891050edf.174.1609165889927; Mon, 28 Dec 2020 06:31:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609165889; cv=none; d=google.com; s=arc-20160816; b=rneyNQeCjHU8vBSA+4DVCbzfSwSEb3CQpXF3bWNKdLIVtnDXsX+kjXnWgGbV1wJ27C O5WUWT78cldl60hXaZd+IhH2OeXICypLlawj2w82+hhUtdfyB9sh02vv7bH74P77Xth6 rUpt4QVAUKid1AtU4f3Sky39H1bnDxOm15q29AXbmYmlwbizsXQhVlLVyRwFkEOOg7Du Bo194ng0mSwL5v/ah62rGNZ5ZgO2772o/uoHCn/Hvp/2spQz/t/kIULjXvjqTFjGDzT2 9f481uwgfCqVxZpSSgSIDK1EPJja1WkiI7juO2LtyC0TGxUiue2UGReX5dFI/SZHfeOX xuYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=SmOQ5GD5fpwJLnsxDHjDrPq2k0rldMITTg0gfBhpvjE=; b=Z5eWI3EYmzf8BrPVRkZhqU509MZVFvOHXVURJnqmWJtY7gGH/AX2qahSxaTBdv18T7 PeJP8HHJaAl5l4CMfWhCdLpEo3tQWR5zCoUIpTL8RxKWH0HVO0ithk/9SyhRXqWfhoEq dOlDtv0QkwdOvdays47EEq8312YmF4xrGTprTAE3MpmJJ+bTocGVEl274aRoGvKAcTKO zuEQ8Nxv5eUNBBFLKzw9jvL434U+2+5V4l7lHNlPlbT23meTfjSx7JDmfTjXpnpy3LR2 fO92DU5djsBQHeIY15ZFOEP5vRTlUIURX3w7OU07vAYOnnTo3pWu5ZWq+eY9qY6ks3I5 rxeg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=yot5ccyp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id kt11si19089566ejb.445.2020.12.28.06.31.07; Mon, 28 Dec 2020 06:31:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=yot5ccyp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2502209AbgL1O2D (ORCPT + 99 others); Mon, 28 Dec 2020 09:28:03 -0500 Received: from mail.kernel.org ([198.145.29.99]:35156 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2502387AbgL1O1k (ORCPT ); Mon, 28 Dec 2020 09:27:40 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 0FD0A2245C; Mon, 28 Dec 2020 14:27:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1609165644; bh=QeQsraruizf+1XRGJzaAFvFpyM4WlOZqvz9ypgycZlo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=yot5ccyp3DhYt493SuBTPkjselHX+Hcc8uGcu3xzz0OTnct6q+rvnO4DZAjxyTU94 Wq6Q1MQVznL6q86lUHhNXhsrK+fJYHj36MpVxykqmnnvjx0DkXMh2qkOf8d2FrDLgu 7bHi6jlZ6ybYOc7LYe5GbkzAND51FCBJjGsd4U9I= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Daniel Jordan , "Peter Zijlstra (Intel)" , Tejun Heo Subject: [PATCH 5.10 577/717] cpuset: fix race between hotplug work and later CPU offline Date: Mon, 28 Dec 2020 13:49:35 +0100 Message-Id: <20201228125048.564078778@linuxfoundation.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228125020.963311703@linuxfoundation.org> References: <20201228125020.963311703@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Daniel Jordan commit 406100f3da08066c00105165db8520bbc7694a36 upstream. One of our machines keeled over trying to rebuild the scheduler domains. Mainline produces the same splat: BUG: unable to handle page fault for address: 0000607f820054db CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6 Workqueue: events cpuset_hotplug_workfn RIP: build_sched_domains Call Trace: partition_sched_domains_locked rebuild_sched_domains_locked cpuset_hotplug_workfn It happens with cgroup2 and exclusive cpusets only. This reproducer triggers it on an 8-cpu vm and works most effectively with no preexisting child cgroups: cd $UNIFIED_ROOT mkdir cg1 echo 4-7 > cg1/cpuset.cpus echo root > cg1/cpuset.cpus.partition # with smt/control reading 'on', echo off > /sys/devices/system/cpu/smt/control RIP maps to sd->shared = *per_cpu_ptr(sdd->sds, sd_id); from sd_init(). sd_id is calculated earlier in the same function: cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); sd_id = cpumask_first(sched_domain_span(sd)); tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus value from per_cpu_ptr() above. The problem is a race between cpuset_hotplug_workfn() and a later offline of CPU N. cpuset_hotplug_workfn() updates the effective masks when N is still online, the offline clears N from cpu_sibling_map, and then the worker uses the stale effective masks that still have N to generate the scheduling domains, leading the worker to read N's empty cpu_sibling_map in sd_init(). rebuild_sched_domains_locked() prevented the race during the cgroup2 cpuset series up until the Fixes commit changed its check. Make the check more robust so that it can detect an offline CPU in any exclusive cpuset's effective mask, not just the top one. Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition") Signed-off-by: Daniel Jordan Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com Signed-off-by: Greg Kroah-Hartman --- kernel/cgroup/cpuset.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -983,25 +983,48 @@ partition_and_rebuild_sched_domains(int */ static void rebuild_sched_domains_locked(void) { + struct cgroup_subsys_state *pos_css; struct sched_domain_attr *attr; cpumask_var_t *doms; + struct cpuset *cs; int ndoms; lockdep_assert_cpus_held(); percpu_rwsem_assert_held(&cpuset_rwsem); /* - * We have raced with CPU hotplug. Don't do anything to avoid + * If we have raced with CPU hotplug, return early to avoid * passing doms with offlined cpu to partition_sched_domains(). - * Anyways, hotplug work item will rebuild sched domains. + * Anyways, cpuset_hotplug_workfn() will rebuild sched domains. + * + * With no CPUs in any subpartitions, top_cpuset's effective CPUs + * should be the same as the active CPUs, so checking only top_cpuset + * is enough to detect racing CPU offlines. */ if (!top_cpuset.nr_subparts_cpus && !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask)) return; - if (top_cpuset.nr_subparts_cpus && - !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask)) - return; + /* + * With subpartition CPUs, however, the effective CPUs of a partition + * root should be only a subset of the active CPUs. Since a CPU in any + * partition root could be offlined, all must be checked. + */ + if (top_cpuset.nr_subparts_cpus) { + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (!is_partition_root(cs)) { + pos_css = css_rightmost_descendant(pos_css); + continue; + } + if (!cpumask_subset(cs->effective_cpus, + cpu_active_mask)) { + rcu_read_unlock(); + return; + } + } + rcu_read_unlock(); + } /* Generate domain masks and attrs */ ndoms = generate_sched_domains(&doms, &attr);