Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp5941413ybi; Wed, 12 Jun 2019 11:05:32 -0700 (PDT) X-Google-Smtp-Source: APXvYqy4icfzpPveSKkyjSYUWVH04FXQfs91sFiXXD4kjtR+x2vcebISpsro7C7XYjSblV+7yXcP X-Received: by 2002:a17:90a:9281:: with SMTP id n1mr447331pjo.25.1560362732829; Wed, 12 Jun 2019 11:05:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560362732; cv=none; d=google.com; s=arc-20160816; b=VsjtjtmQ11aGRQiaGpxY5rMZKuv6CTA0YGSQ6tgfX+j+fbSKeI1wHjaBGrw7JSfSmX 74+tZVimUtLDNkq/nlbF9NgiX9Q336CY41HEIo4wLW+t520/UnaLdOAqbqJhA16Ybs0o /DxD9nVuZJOeIF5kx/R8KQEDjwEb+Ia5afwqMOV7xWLUlr1noCijnIMA4Lr1XJkWOmgk D4aGDUKIHebp/5f7kRnJMDcC0UVH2nsZqzPV9Za+qWdG19MYP5IJUGn7CQNptTpoCcCj FiABPEpRSkiL/HkxpgDFaKmBN/+BlhOrsf6ejfgt1VKaBSMb0/7si8ZQZBdGDtjwOd/K wmtA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from; bh=JN8MN0fbR3dSj/qvCR8ZeS+41gEMa0HR1twlITRA7oQ=; b=gK54uyQbog82IyJtcyPObnZaFUN0E1wECzvTxmu9A+fjxCIJXBLfHDytwzxaN8Yoea vxggAPOAHOKqAmEyWtXqTpezaG1Sny7L64MLeQzzFxI8iOsHM4/tMJ04Hudo/8HsFItn xk5lrVr4IzlgcDje+Ee7+Yba6OgfO/mvszty+WYlLlZiitH8GFXuf+nifkCkmi35uNFF u6Wk3j734YqSlg3D5PhJm9VoAv7N63P33kGe2s0bdf4vm9n+v1NEU6vQfQEhnooNdH+K TMikadDVaKNb5cvSlwvQvUfMRmBTFm6BLa+C39h+cbKxyFt19UrtsnftQ6a5AIGamC2L jVwA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 59si339620plp.90.2019.06.12.11.05.18; Wed, 12 Jun 2019 11:05:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729442AbfFLPyW (ORCPT + 99 others); Wed, 12 Jun 2019 11:54:22 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39692 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726725AbfFLPyV (ORCPT ); Wed, 12 Jun 2019 11:54:21 -0400 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1CBB563162; Wed, 12 Jun 2019 15:54:13 +0000 (UTC) Received: from jsavitz.bos.com (dhcp-17-175.bos.redhat.com [10.18.17.175]) by smtp.corp.redhat.com (Postfix) with ESMTP id D2AF719481; Wed, 12 Jun 2019 15:54:08 +0000 (UTC) From: Joel Savitz To: linux-kernel@vger.kernel.org Cc: Joel Savitz , Li Zefan , Phil Auld , Waiman Long , Tejun Heo , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , cgroups@vger.kernel.org Subject: [RESEND PATCH v3] cpuset: restore sanity to cpuset_cpus_allowed_fallback() Date: Wed, 12 Jun 2019 11:50:48 -0400 Message-Id: <1560354648-23632-1-git-send-email-jsavitz@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Wed, 12 Jun 2019 15:54:21 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In the case that a process is constrained by taskset(1) (i.e. sched_setaffinity(2)) to a subset of available cpus, and all of those are subsequently offlined, the scheduler will set tsk->cpus_allowed to the current value of task_cs(tsk)->effective_cpus. This is done via a call to do_set_cpus_allowed() in the context of cpuset_cpus_allowed_fallback() made by the scheduler when this case is detected. This is the only call made to cpuset_cpus_allowed_fallback() in the latest mainline kernel. However, this is not sane behavior. I will demonstrate this on a system running the latest upstream kernel with the following initial configuration: # grep -i cpu /proc/$$/status Cpus_allowed: ffffffff,fffffff Cpus_allowed_list: 0-63 (Where cpus 32-63 are provided via smt.) If we limit our current shell process to cpu2 only and then offline it and reonline it: # taskset -p 4 $$ pid 2272's current affinity mask: ffffffffffffffff pid 2272's new affinity mask: 4 # echo off > /sys/devices/system/cpu/cpu2/online # dmesg | tail -3 [ 2195.866089] process 2272 (bash) no longer affine to cpu2 [ 2195.872700] IRQ 114: no longer affine to CPU2 [ 2195.879128] smpboot: CPU 2 is now offline # echo on > /sys/devices/system/cpu/cpu2/online # dmesg | tail -1 [ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4 We see that our current process now has an affinity mask containing every cpu available on the system _except_ the one we originally constrained it to: # grep -i cpu /proc/$$/status Cpus_allowed: ffffffff,fffffffb Cpus_allowed_list: 0-1,3-63 This is not sane behavior, as the scheduler can now not only place the process on previously forbidden cpus, it can't even schedule it on the cpu it was originally constrained to! Other cases result in even more exotic affinity masks. Take for instance a process with an affinity mask containing only cpus provided by smt at the moment that smt is toggled, in a configuration such as the following: # taskset -p f000000000 $$ # grep -i cpu /proc/$$/status Cpus_allowed: 000000f0,00000000 Cpus_allowed_list: 36-39 A double toggle of smt results in the following behavior: # echo off > /sys/devices/system/cpu/smt/control # echo on > /sys/devices/system/cpu/smt/control # grep -i cpus /proc/$$/status Cpus_allowed: ffffff00,ffffffff Cpus_allowed_list: 0-31,40-63 This is even less sane than the previous case, as the new affinity mask excludes all smt-provided cpus with ids less than those that were previously in the affinity mask, as well as those that were actually in the mask. With this patch applied, both of these cases end in the following state: # grep -i cpu /proc/$$/status Cpus_allowed: ffffffff,ffffffff Cpus_allowed_list: 0-63 The original policy is discarded. Though not ideal, it is the simplest way to restore sanity to this fallback case without reinventing the cpuset wheel that rolls down the kernel just fine in cgroup v2. A user who wishes for the previous affinity mask to be restored in this fallback case can use that mechanism instead. This patch modifies scheduler behavior by instead resetting the mask to task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy mode. I tested the cases above on both modes. Note that the scheduler uses this fallback mechanism if and only if _every_ other valid avenue has been traveled, and it is the last resort before calling BUG(). Suggested-by: Waiman Long Suggested-by: Phil Auld Signed-off-by: Joel Savitz --- kernel/cgroup/cpuset.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6a1942ed781c..515525ff1cfd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) spin_unlock_irqrestore(&callback_lock, flags); } +/** + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe. + * @tsk: pointer to task_struct with which the scheduler is struggling + * + * Description: In the case that the scheduler cannot find an allowed cpu in + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy + * mode however, this value is the same as task_cs(tsk)->effective_cpus, + * which will not contain a sane cpumask during cases such as cpu hotplugging. + * This is the absolute last resort for the scheduler and it is only used if + * _every_ other avenue has been traveled. + **/ + void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, is_in_v2_mode() ? + task_cs(tsk)->cpus_allowed : cpu_possible_mask); rcu_read_unlock(); /* -- 2.18.1