Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp5149142imm; Tue, 21 Aug 2018 07:06:00 -0700 (PDT) X-Google-Smtp-Source: AA+uWPyjb/rFFr1kqe9HCZudICDYc9WEsC56XIYeh51EFfAwGDDO1/FSzP2/03sMGx9/GWvOrsab X-Received: by 2002:a63:5660:: with SMTP id g32-v6mr47557465pgm.227.1534860360298; Tue, 21 Aug 2018 07:06:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534860360; cv=none; d=google.com; s=arc-20160816; b=F7bdOiPgY2tjgtQUGT+fX3puBO96d9nCw1psE3+ZUMdt0h+Hi//VqbDki4OfWl4eHH TRovns9oHwr7ZnIWKTcSgM+p4Tws2wUVJU6f2OjS8cQEZCRwM0fgasUydXYMdKpf6rcO T0YUeZMaAv4Ew/LQEZl6N8Pw6I+kYY8uLm0gKKCJ3VL9e/SY6iE3P92LUbAHK28s/ejk IljFGTA36m7JESdfILg7UL05/hH15iyKHOUz5VHaiTbVNbo+10QxR7S44ppoZorkBFpV fg6yf08u/kC/81822P5E5QqTGv+ZbhErhKplqINve43LSHulbxBEn0dRsrP/zy1lhf4t nqmA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature:arc-authentication-results; bh=eS/jbnoiStp6z91Q35PdP9FPUYjBnHPylDqiZ9YmGcE=; b=LvNILenIisf64MUHe8S0CAyKOdTtP+HA0tKRbJIHfGZQ6GCE+jpLDmLjBQkIbwfJXm eWNih9XpnqT7qUFaSohLfPcej3jIZEIIaY3Rw6qbZkvR4R24XpYdSZNyFY2n1l1Dp11q ycXB/f19E5YbcSWvKLv8E1qp0gpmw0Y9pXi6KfCj3pa+lPEG4JyWpsjXKTHGl0I5Don5 0TmIq/RnWa7etLujdKLt8186hFnuoxajAeV/hCm6fVYI2SioStQySYeZ+fZ3WIV/VKWC lhtYboXvA4H+W+HuOS1UtmRa+FIe9/+NW/CqvjRUQMoNVA/3kN2CKiauLWYs75RWNHSx WbcA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=l4tbA9iU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k198-v6si7963779pgc.442.2018.08.21.07.05.44; Tue, 21 Aug 2018 07:06:00 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=l4tbA9iU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727545AbeHURYf (ORCPT + 99 others); Tue, 21 Aug 2018 13:24:35 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:47212 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726679AbeHURYf (ORCPT ); Tue, 21 Aug 2018 13:24:35 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w7LDxJg2021449; Tue, 21 Aug 2018 14:02:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2018-07-02; bh=eS/jbnoiStp6z91Q35PdP9FPUYjBnHPylDqiZ9YmGcE=; b=l4tbA9iU3WMQzcTk4oV1wIwTxjfZcl84f1HLB6B7I7LFj/K+Ck94IfVivWLT9usgCJOO Jk8xs7GAO0rnNih5PJlHEkX5+Ddr6mj6kizN/tQj9VILVlUc1Bk2u/Wjc77iLNvd2LlV tK8hulHJn3KRthIbR5QzKb1nHAiZbRV2ajbJ32KtjRxEkYJ8oZghdLR82gb8WWsipOmH fK3vADpPRBsIeOuthtFx9Pwng3nt+xRcUQoTXL43kdfc/iSubHiabB+dkwGQx6ouW+Vs ug8yGug77sRCQIgJpijwN2xgYUnlIiu9jIDiVmD7POwocAwKK/R1E+D6iNmZqk4QE0gw AA== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2kxc3qmm1a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Aug 2018 14:02:07 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w7LE26hD030546 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Aug 2018 14:02:06 GMT Received: from abhmp0003.oracle.com (abhmp0003.oracle.com [141.146.116.9]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w7LE24T4027816; Tue, 21 Aug 2018 14:02:04 GMT Received: from lirans-mbp.ravello.local (/213.57.127.2) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 21 Aug 2018 07:02:03 -0700 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU) From: Liran Alon In-Reply-To: <1534845423.10027.44.camel@infradead.org> Date: Tue, 21 Aug 2018 17:01:57 +0300 Cc: Linus Torvalds , Konrad Rzeszutek Wilk , juerg.haefliger@hpe.com, deepa.srinivasan@oracle.com, Jim Mattson , Andrew Cooper , Linux Kernel Mailing List , Boris Ostrovsky , linux-mm , Thomas Gleixner , joao.m.martins@oracle.com, pradeep.vincent@oracle.com, Andi Kleen , Khalid Aziz , kanth.ghatraju@oracle.com, Kees Cook , jsteckli@os.inf.tu-dresden.de, Kernel Hardening , chris.hyser@oracle.com, Tyler Hicks , John Haxby , Jon Masters Content-Transfer-Encoding: quoted-printable Message-Id: References: <20180820212556.GC2230@char.us.oracle.com> <1534801939.10027.24.camel@amazon.co.uk> <1534845423.10027.44.camel@infradead.org> To: David Woodhouse X-Mailer: Apple Mail (2.3445.4.7) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8991 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1808210147 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On 21 Aug 2018, at 12:57, David Woodhouse wrote: >=20 > Another alternative... I'm told POWER8 does an interesting thing with > hyperthreading and gang scheduling for KVM. The host kernel doesn't > actually *see* the hyperthreads at all, and KVM just launches the full > set of siblings when it enters a guest, and gathers them again when = any > of them exits. That's definitely worth investigating as an option for > x86, too. I actually think that such scheduling mechanism which prevents leaking = cache entries to sibling hyperthreads should co-exist together with the = KVM address space isolation to fully mitigate L1TF and other similar = vulnerabilities. The address space isolation should prevent VMExit = handlers code gadgets from loading arbitrary host memory to the cache. = Once VMExit code path switches to full host address space, then we = should also make sure that no other sibling hyprethread is running in = the guest. Focusing on the scheduling mechanism, we must make sure that when a = logical processor runs guest code, all siblings logical processors must = run code which do not populate L1D cache with information unrelated to = this VM. This includes forbidding one logical processor to run guest = code while sibling is running a host task such as a NIC interrupt = handler. Thus, when a vCPU thread exits the guest into the host and VMExit = handler reaches code flow which could populate L1D cache with this = information, we should force an exit from the guest of the siblings = logical processors, such that they will be allowed to resume only on a = core which we can promise that the L1D cache is free from information = unrelated to this VM. At first, I have created a patch series which attempts to implement such = mechanism in KVM. However, it became clear to me that this may need to = be implemented in the scheduler itself. This is because: 1. It is difficult to handle all new scheduling contrains only in KVM. 2. This mechanism should be relevant for any Type-2 hypervisor which = runs inside Linux besides KVM (Such as VMware Workstation or = VirtualBox). 3. This mechanism could also be used to prevent future = =E2=80=9Ccore-cache-leaking=E2=80=9D vulnerabilities to be exploited = between processes of different security domains which run as siblings on = the same core. The main idea is a mechanism which is very similar to Microsoft's "core = scheduler" which they implemented to mitigate this vulnerability. The = mechanism should work as follows: 1. Each CPU core will now be tagged with a "security domain id". 2. The scheduler will provide a mechanism to tag a task with a security = domain id. 3. Tasks will inherit their security domain id from their parent task. 3.1. First task in system will have security domain id of 0. Thus, = if nothing special is done, all tasks will be assigned with security = domain id of 0. 4. Tasks will be able to allocate a new security domain id from the = scheduler and assign it to another task dynamically. 5. Linux scheduler will prevent scheduling tasks on a core with a = different security domain id: 5.0. CPU core security domain id will be set to the security domain = id of the tasks which currently run on it. 5.1. The scheduler will attempt to first schedule a task on a core = with required security domain id if such exists. 5.2. Otherwise, will need to decide if it wishes to kick all tasks = running on some core to run the task with a different security domain id = on that core. The above mechanism can be used to mitigate the L1TF HT variant by just = assigning vCPU tasks with a security domain id which is unique per VM = and also different than the security domain id of the host which is 0. I would be glad to hear feedback on the above suggestion. If this should better be discussed on a separate email thread, please = say so and I will open a new thread. Thanks, -Liran