Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp78077ybt; Tue, 30 Jun 2020 15:12:25 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxD0b2B2yFhf6iu6wFUSG9K7q6nbg9SQQNx2jFjyMwzNL+xrCMvgmgnfYW2mwkC1omPftqu X-Received: by 2002:a17:906:1f0d:: with SMTP id w13mr20972103ejj.0.1593554689041; Tue, 30 Jun 2020 15:04:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593554689; cv=none; d=google.com; s=arc-20160816; b=rJq88OwtkHONPP9sVjpXQyRaCxUxBX5QIXY1k170Mh2vD0uiy/uimFfAO5V+I2MDVt 368lThV45gkwLJCFK5esxEpPxi5Y4ltCw+ksd5r+FuFY4VQfhcvupAeayqJWHXrPp35B jbQ3FRhDubOgLrPU7NLeyMuUEY+INktwVk2bVanq7UNSzxVEUPPcDnOUXvpqh7pVr4E8 ZzhTn7JPqpXEewaeBpeS3FmtafDA6pZVfK8jiAvSn5VezQxu/T1dc/F/SoezrUdaGhti +S8i7vP8yxJXxIcNtgyl5dorKCWomvFkTN/ZFi+n97z7y+S54YibArS4znfAOk+o/HjO qfGQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:dkim-signature; bh=+w1t38b3w4AZT1Xc7eDxHzeTTLA5N812ElI9LsAJM04=; b=DfQSVoibOpaXyz53ub2XhfmP9AYn2EjBVvvEHIvc/m+wv3gVl6vv1af9EywQHYGP6U XjVCmUpQ6oZAl/zm6NJuwKCp29XkdUujWdMfKfrKxkBVfelY8vx8dYdVB94rwhy7BP53 Ln+Qp/KfM24bwyo5fPZ8MiaT1GiHG+gu1IF0xG43Urqt0hWV4guBnxUfL8Uez5KlQiMu sk9bzvRT0Z9nKUaOe59bLT4A4AqSUQdZXl4QBTFNCTbCp8mEaEldPQalsSVeeTtjY7JP Of1YcX/k+tzZDT9vwbz1j2lqhiIvPYLzUilFvjg08S6z6Tlfa8g6+XmPEXERqcnYpjwK 2QnA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=CWbPUwdT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w17si2299092ejk.421.2020.06.30.15.04.25; Tue, 30 Jun 2020 15:04:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=CWbPUwdT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729377AbgF3VeH (ORCPT + 99 others); Tue, 30 Jun 2020 17:34:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729316AbgF3Vd6 (ORCPT ); Tue, 30 Jun 2020 17:33:58 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9EFECC03E97A for ; Tue, 30 Jun 2020 14:33:58 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id 207so9799062pfu.3 for ; Tue, 30 Jun 2020 14:33:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=digitalocean.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=+w1t38b3w4AZT1Xc7eDxHzeTTLA5N812ElI9LsAJM04=; b=CWbPUwdTJ5g8URhIJnHeNITo0nZAednz9PBOyw+NOwDt34Tl6Dv1qG37QkcSv9uHni 8jIYe7VVkhwY7cVmSwqUeQFMIb0QxWFtGn3RgWGBiMWw0HSxUc8+zbzK+JOF+787t1xS siMTK876xjOjOY5yePv1tse1Sd3CiY8Bo/k3M= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=+w1t38b3w4AZT1Xc7eDxHzeTTLA5N812ElI9LsAJM04=; b=dDr+ZXyF6GvskZ6kcCMBLD8x5lv0iX2eqeUF7RM9J6zP8g4rjL+2BeE2UZ3Exu/X+U xC2sKF2dX7jbfxMenOHZeTLFoDvnHsJTV9d8FrPj7QB517eoTADjpBjP8mAqNIzSCXWR /TiR6yTsTmyyJWXodA/VihHDaU7vl4i4wiqhw4+ah8xVbwAf88QIal+KvC9hWL6B+XYZ 2/jGPuSiMmbavQQ/0uVGz4jWIUCok4r8D00ARB50yi0O/OxvdK+81mNOHKejcITnQHEf Gaycf/PBVZsgwnAnNrOfXYhcZ5EZYbyL5Icd+wO/Q4RNGrFx2cW7Y2ZqytpXj0nHr/WZ RUnw== X-Gm-Message-State: AOAM533JxSTNSJ3tRl5cvSIP7TuY1P9rsZZg4tzcS5oy79JhmnFDz+i/ iP+NHwzhSFQsiplD/xQC7fWA5w== X-Received: by 2002:aa7:98c6:: with SMTP id e6mr21195759pfm.17.1593552837673; Tue, 30 Jun 2020 14:33:57 -0700 (PDT) Received: from vpillai-dev.sfo2.internal.digitalocean.com ([138.68.32.68]) by smtp.gmail.com with ESMTPSA id j19sm3518695pfn.109.2020.06.30.14.33.56 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Tue, 30 Jun 2020 14:33:57 -0700 (PDT) From: Vineeth Remanan Pillai To: Nishanth Aravamudan , Julien Desfossez , Peter Zijlstra , Tim Chen , mingo@kernel.org, tglx@linutronix.de, pjt@google.com, torvalds@linux-foundation.org Cc: "Joel Fernandes (Google)" , linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil Auld , Aaron Lu , Aubrey Li , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , Joel Fernandes , vineethrp@gmail.com, Chen Yu , Christian Brauner , Vineeth Remanan Pillai Subject: [RFC PATCH 15/16] Documentation: Add documentation on core scheduling Date: Tue, 30 Jun 2020 21:32:36 +0000 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Joel Fernandes (Google)" Signed-off-by: Joel Fernandes (Google) Signed-off-by: Vineeth Remanan Pillai --- .../admin-guide/hw-vuln/core-scheduling.rst | 241 ++++++++++++++++++ Documentation/admin-guide/hw-vuln/index.rst | 1 + 2 files changed, 242 insertions(+) create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst new file mode 100644 index 000000000000..275568162a74 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst @@ -0,0 +1,241 @@ +Core Scheduling +================ +MDS and L1TF mitigations do not protect from cross-HT attacks (attacker running +on one HT with victim running on another). For proper mitigation of this, +core scheduling support is available via the ``CONFIG_SCHED_CORE`` config option. +Using this feature, userspace defines groups of tasks that trust each other. +The core scheduler uses this information to make sure that tasks that do not +trust each other will never run simultaneously on a core while ensuring trying +to maintain and ensure scheduler properties and requirements. + +Usage +----- +The user-interface to this feature is not yet finalized. CUrrent implementation +uses CPU controller cgroup. Core scheduling adds a ``cpu.tag`` file to the CPU +controller CGroup. If the content of this file is 1, then all the tasks in this +CGroup trust each other and are allowed to run concurrently on the siblings of +a core. + +This interface has drawbacks. Trusted tasks has to be grouped into one CPU CGroup +and this is not always possible based on system's existing Cgroup configuration, +where trusted tasks could already be in different CPU Cgroups. Also, this feature +will have a hard dependency on CGroups and systems with CGroups disabled would not +be able to use core scheduling. See `Future work`_ for other API proposals. + +Design/Implementation +--------------------- +Tasks are grouped as mentioned in `Usage`_ and tasks that trust each other +share the same cookie value(in task_struct). + +The basic idea is that, every schedule event tries to select tasks for all the +siblings of a core such that all the selected tasks are trusted(same cookie). + +During a schedule event on any sibling of a core, the highest priority task for +that core is picked and assigned to the sibling which has it enqueued. For rest of +the siblings in the core, highest priority task with the same cookie is selected if +there is one runnable in the run queue. If a task with same cookie is not available, +idle task is selected. Idle task is globally trusted. + +Once a task has been selected for all the siblings in the core, an IPI is sent to +all the siblings who has a new task selected. Siblings on receiving the IPI, will +switch to the new task immediately. + +Force-idling of tasks +--------------------- +The scheduler tries its best to find tasks that trust each other such that all +tasks selected to be scheduled are of the highest priority in that runqueue. +However, this is not always possible. Favoring security over fairness, one or +more siblings could be forced to select a lower priority task if the highest +priority task is not trusted with respect to the core wide highest priority task. +If a sibling does not have a trusted task to run, it will be forced idle by the +scheduler(idle thread is scheduled to run). + +When the highest priorty task is selected to run, a reschedule-IPI is sent to +the sibling to force it into idle. This results in 4 cases which need to be +considered depending on whether a VM or a regular usermode process was running +on either HT: + +:: + HT1 (attack) HT2 (victim) + + A idle -> user space user space -> idle + + B idle -> user space guest -> idle + + C idle -> guest user space -> idle + + D idle -> guest guest -> idle + +Note that for better performance, we do not wait for the destination CPU +(victim) to enter idle mode. This is because the sending of the IPI would +bring the destination CPU immediately into kernel mode from user space, or +VMEXIT from guests. At best, this would only leak some scheduler metadata which +may not be worth protecting. + +Protection against interrupts using IRQ pausing +----------------------------------------------- +The scheduler on its own cannot protect interrupt data. This is because the +scheduler is unaware of interrupts at scheduling time. To mitigate this, we +send an IPI to siblings on IRQ entry. This IPI handler busy-waits until the IRQ +on the sending HT exits. For good performance, we send an IPI only if it is +detected that the core is running tasks that have been marked for +core scheduling. Both interrupts and softirqs are protected. + +This protection can be disabled by disabling ``CONFIG_SCHED_CORE_IRQ_PAUSE`` or +through the ``sched_core_irq_pause`` boot parameter. + +If it is desired to disable IRQ pausing, other mitigation methods can be considered: + +1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +By changing the interrupt affinities to a designated safe-CPU which runs +only trusted tasks, IRQ data can be protected. One issue is this involves +giving up a full CPU core of the system to run safe tasks. Another is that, +per-cpu interrupts such as the local timer interrupt cannot have their +affinity changed. also, sensitive timer callbacks such as the random entropy timer +can run in softirq on return from these interrupts and expose sensitive +data. In the future, that could be mitigated by forcing softirqs into threaded +mode by utilizing a mechanism similar to ``PREEMPT_RT``. + +Yet another issue with this is, for multiqueue devices with managed +interrupts, the IRQ affinities cannot be changed however it could be +possible to force a reduced number of queues which would in turn allow to +shield one or two CPUs from such interrupts and queue handling for the price +of indirection. + +2. Running IRQs as threaded-IRQs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +This would result in forcing IRQs into the scheduler which would then provide +the process-context mitigation. However, not all interrupts can be threaded. + +Trust model +----------- +Core scheduling understands trust relationships by assignment of a cookie to +related tasks using the above mentioned interface. When a system with core +scheduling boots, all tasks are considered to trust each other. This is because +the scheduler does not have information about trust relationships. That is, all +tasks have a default cookie value of 0. This cookie value is also considered +the system-wide cookie value and the IRQ-pausing mitigation is avoided if +siblings are running these cookie-0 tasks. + +By default, all system processes on boot are considered trusted and userspace +has to explicitly use the interfaces mentioned above to group sets of tasks. +Tasks within the group trust each other, but not those outside. Tasks outside +the group don't trust the task inside. + +Limitations +----------- +Core scheduling tries to guarentee that only trusted tasks run concurrently on a +core. But there could be small window of time during which untrusted tasks run +concurrently or kernel could be running concurrently with a task not trusted by +kernel. + +1. IPI processing delays +^^^^^^^^^^^^^^^^^^^^^^^^ +Core scheduling selects only trusted tasks to run together. IPI is used to notify +the siblings to switch to the new task. But there could be hardware delays in +receiving of the IPI on some arch (on x86, this has not been observed). This may +cause an attacker task to start running on a cpu before its siblings receive the +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings +may populate data in the cache and micro acrhitectural buffers after the attacker +starts to run and this is a possibility for data leak. + +2. Asynchronous Kernel entries +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A task can switch to kernel any time due to events like irqs, system calls etc. +Since core scheduling synchronizes only during a schedule event, kernel can run +along with a task that it doesn't trust. The IRQ pause mechanism mentioned above, +provides protection during nmi/irq/softirqs. But tasks could still enter kernel +mode via system calls and this is not currently protected. + +There are ideas about mitigating this: + - Kernel Address Space Isolation: System calls could run in a much restricted + address space which is guarenteed not to leak any sensitive data. There are + practical limitation in implementing this - the main concern being how to + decided on an address space that is guarenteed to not have any sensitive + data + - On a system call, change the cookie to the system trusted cookie and initiate + a schedule event. This would be better than pausing all the siblings during + the entire duration for the system call, but still would be a huge hit to the + performance. + +Open cross-HT issues that core scheduling does not solve +-------------------------------------------------------- +1. For MDS +^^^^^^^^^^ +Core scheduling cannot protect against MDS attacks between an HT running in +user mode and another running in kernel mode. Even though both HTs run tasks +which trust each other, kernel memory is still considered untrusted. Such +attacks are possible for any combination of sibling CPU modes (host or guest mode). + +2. For L1TF +^^^^^^^^^^^ +Core scheduling cannot protect against a L1TF guest attackers exploiting a +guest or host victim. This is because the guest attacker can craft invalid +PTEs which are not inverted due to a vulnerable guest kernel. The only +solution is to disable EPT. + +For both MDS and L1TF, if the guest vCPU is configured to not trust each +other (by tagging separately), then the guest to guest attacks would go away. +Or it could be a system admin policy which considers guest to guest attacks as +a guest problem. + +Another approach to resolve these would be to make every untrusted task on the +system to not trust every other untrusted task. While this could reduce +parallelism of the untrusted tasks, it would still solve the above issues while +allowing system processes (trusted tasks) to share a core. + +Use cases +--------- +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities +with SMT enabled. There are other use cases where this feature could be used: + +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks + that uses SIMD instructions etc. +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled + together could also be realized using core scheduling. One example is vcpus of + a VM. + +Future work +----------- +1. API Proposals +^^^^^^^^^^^^^^^^ + +As mentioned in `Usage`_ section, various API proposals are listed here: + +- ``prctl`` : We can pass in a tag and all tasks with same tag set by prctl forms + a trusted group. + +- ``sched_setattr`` : Similar to prctl, but has the advantage that tasks could be + tagged by other tasks with appropriate permissions. + +- ``Auto Tagging`` : Related tasks are tagged automatically. Relation could be, + threads of the same process, tasks by a user, group or session etc. + +- Dedicated cgroup or procfs/sysfs interface for grouping trusted tasks. This could + be combined with prctl/sched_setattr as well. + +2. Auto-tagging of KVM vCPU threads +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To make configuration easier, it would be great if KVM auto-tags vCPU threads +such that a given VM only trusts other vCPUs of the same VM. Or something more +aggressive like assiging a vCPU thread a unique tag. + +3. Auto-tagging of processes by default +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Currently core scheduling does not prevent 'unconfigured' tasks from being +co-scheduled on the same core. In other words, everything trusts everything +else by default. If a user wants everything default untrusted, a CONFIG option +could be added to assign every task with a unique tag by default. + +4. Auto-tagging on fork +^^^^^^^^^^^^^^^^^^^^^^^ +Currently, on fork a thread is added to the same trust-domain as the parent. For +systems which want all tasks to have a unique tag, it could be desirable to assign +a unique tag to a task so that the parent does not trust the child by default. + +5. Skipping per-HT mitigations if task is trusted +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If core scheduling is enabled, by default all tasks trust each other as +mentioned above. In such scenario, it may be desirable to skip the same-HT +mitigations on return to the trusted user-mode to improve performance. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index ca4dbdd9016d..f12cda55538b 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -15,3 +15,4 @@ are configurable at compile, boot or run time. tsx_async_abort multihit.rst special-register-buffer-data-sampling.rst + core-scheduling.rst -- 2.17.1