Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp1103824ybl; Wed, 28 Aug 2019 09:40:14 -0700 (PDT) X-Google-Smtp-Source: APXvYqziOzFyQGpxIjzExfd7tYYXzV/Xqx8Vi8PtNje14CG9NUSa4iV+wK0HLUr3pQ0EqrryhK++ X-Received: by 2002:a62:e516:: with SMTP id n22mr5822360pff.105.1567010414882; Wed, 28 Aug 2019 09:40:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567010414; cv=none; d=google.com; s=arc-20160816; b=fiGjKTaLC9k9zkX0LyNet8po4u4dQUB5g5tPvXQbIVaHu74owEij8C9UPjuWggb6QH m7kkA0+5wZg5ucGUM0oT56dWU+FvpvSVNueb/4q7yN5q3At6ZsLlgTRwZVJffvZlKN+1 NFYFTYfr3cgq3h1WfbEoUCrZj09mwAlPDxGhuh+pxUcv/ManHSq2p3EZjMvyDQq+yk8s U2HwHL9xX5twY3/3tpF/jF3SbXrVV64R4dfUt3iqLvnqXdxZPUwU0bu8oGl/Q+7vWxJQ N4bW4Y7AeCJeCiSrkoQ8mKN8UAdcwL0rdfZ0Zu/DE7YC2A9i4xK2O5yDpx5E1oNfdjtm PZjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:subject:autocrypt:openpgp:from:references:cc:to; bh=cqzYlFUhFEJNj1HHEMKZ/VQ6A8NkrjuS+pIl/9zKjwo=; b=xSOeEKj8bsps4vnI3vT4XuaK/sG9HffeiLtTKTI0X9HqX9yN1A2iPNcMo9DU42VCD/ duh5AybM2YR9efAGZJNP70qAFcCArp2VYS0DN7FvmyHG183XKdcS5XlmoBCElnElTAPq x6Ztw6WruhH7tBtu2OrOUmhMJ9FaPRibxZHv+hnagcpHGshv1sBWSqvSG7ISdw6680fL fD1NcP4vWUr5YOnrNAKrNDDKZy4IQN2CkVrKKmFeKqi5aV8dYAcLBykPKhN21d/jGnMG SwyLO2HI2+H1nCrYiQZozEbwxcem5sLNsLnPsOeaVKcVySB5XESZ+Ula3yr/ugNgMh4t Vgiw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g7si3014665pfk.224.2019.08.28.09.39.58; Wed, 28 Aug 2019 09:40:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726697AbfH1Qhi (ORCPT + 99 others); Wed, 28 Aug 2019 12:37:38 -0400 Received: from mga07.intel.com ([134.134.136.100]:12320 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726397AbfH1Qhi (ORCPT ); Wed, 28 Aug 2019 12:37:38 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Aug 2019 09:37:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.64,441,1559545200"; d="scan'208";a="174970995" Received: from schen9-desk.jf.intel.com (HELO [10.54.74.162]) ([10.54.74.162]) by orsmga008.jf.intel.com with ESMTP; 28 Aug 2019 09:37:37 -0700 To: Peter Zijlstra , Phil Auld Cc: Matthew Garrett , Vineeth Remanan Pillai , Nishanth Aravamudan , Julien Desfossez , mingo@kernel.org, tglx@linutronix.de, pjt@google.com, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Aaron Lu , Aubrey Li , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini References: <20190827211417.snpwgnhsu5t6u52y@srcf.ucam.org> <20190827215035.GH2332@hirez.programming.kicks-ass.net> <20190828153033.GA15512@pauld.bos.csb> <20190828160114.GE17205@worktop.programming.kicks-ass.net> From: Tim Chen Openpgp: preference=signencrypt Autocrypt: addr=tim.c.chen@linux.intel.com; prefer-encrypt=mutual; keydata= mQINBE6ONugBEAC1c8laQ2QrezbYFetwrzD0v8rOqanj5X1jkySQr3hm/rqVcDJudcfdSMv0 BNCCjt2dofFxVfRL0G8eQR4qoSgzDGDzoFva3NjTJ/34TlK9MMouLY7X5x3sXdZtrV4zhKGv 3Rt2osfARdH3QDoTUHujhQxlcPk7cwjTXe4o3aHIFbcIBUmxhqPaz3AMfdCqbhd7uWe9MAZX 7M9vk6PboyO4PgZRAs5lWRoD4ZfROtSViX49KEkO7BDClacVsODITpiaWtZVDxkYUX/D9OxG AkxmqrCxZxxZHDQos1SnS08aKD0QITm/LWQtwx1y0P4GGMXRlIAQE4rK69BDvzSaLB45ppOw AO7kw8aR3eu/sW8p016dx34bUFFTwbILJFvazpvRImdjmZGcTcvRd8QgmhNV5INyGwtfA8sn L4V13aZNZA9eWd+iuB8qZfoFiyAeHNWzLX/Moi8hB7LxFuEGnvbxYByRS83jsxjH2Bd49bTi XOsAY/YyGj6gl8KkjSbKOkj0IRy28nLisFdGBvgeQrvaLaA06VexptmrLjp1Qtyesw6zIJeP oHUImJltjPjFvyfkuIPfVIB87kukpB78bhSRA5mC365LsLRl+nrX7SauEo8b7MX0qbW9pg0f wsiyCCK0ioTTm4IWL2wiDB7PeiJSsViBORNKoxA093B42BWFJQARAQABtDRUaW0gQ2hlbiAo d29yayByZWxhdGVkKSA8dGltLmMuY2hlbkBsaW51eC5pbnRlbC5jb20+iQI+BBMBAgAoAhsD BgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAUCXFIuxAUJEYZe0wAKCRCiZ7WKota4STH3EACW 1jBRzdzEd5QeTQWrTtB0Dxs5cC8/P7gEYlYQCr3Dod8fG7UcPbY7wlZXc3vr7+A47/bSTVc0 DhUAUwJT+VBMIpKdYUbvfjmgicL9mOYW73/PHTO38BsMyoeOtuZlyoUl3yoxWmIqD4S1xV04 q5qKyTakghFa+1ZlGTAIqjIzixY0E6309spVTHoImJTkXNdDQSF0AxjW0YNejt52rkGXXSoi IgYLRb3mLJE/k1KziYtXbkgQRYssty3n731prN5XrupcS4AiZIQl6+uG7nN2DGn9ozy2dgTi smPAOFH7PKJwj8UU8HUYtX24mQA6LKRNmOgB290PvrIy89FsBot/xKT2kpSlk20Ftmke7KCa 65br/ExDzfaBKLynztcF8o72DXuJ4nS2IxfT/Zmkekvvx/s9R4kyPyebJ5IA/CH2Ez6kXIP+ q0QVS25WF21vOtK52buUgt4SeRbqSpTZc8bpBBpWQcmeJqleo19WzITojpt0JvdVNC/1H7mF 4l7og76MYSTCqIKcLzvKFeJSie50PM3IOPp4U2czSrmZURlTO0o1TRAa7Z5v/j8KxtSJKTgD lYKhR0MTIaNw3z5LPWCCYCmYfcwCsIa2vd3aZr3/Ao31ZnBuF4K2LCkZR7RQgLu+y5Tr8P7c e82t/AhTZrzQowzP0Vl6NQo8N6C2fcwjSrkCDQROjjboARAAx+LxKhznLH0RFvuBEGTcntrC 3S0tpYmVsuWbdWr2ZL9VqZmXh6UWb0K7w7OpPNW1FiaWtVLnG1nuMmBJhE5jpYsi+yU8sbMA 5BEiQn2hUo0k5eww5/oiyNI9H7vql9h628JhYd9T1CcDMghTNOKfCPNGzQ8Js33cFnszqL4I N9jh+qdg5FnMHs/+oBNtlvNjD1dQdM6gm8WLhFttXNPn7nRUPuLQxTqbuoPgoTmxUxR3/M5A KDjntKEdYZziBYfQJkvfLJdnRZnuHvXhO2EU1/7bAhdz7nULZktw9j1Sp9zRYfKRnQdIvXXa jHkOn3N41n0zjoKV1J1KpAH3UcVfOmnTj+u6iVMW5dkxLo07CddJDaayXtCBSmmd90OG0Odx cq9VaIu/DOQJ8OZU3JORiuuq40jlFsF1fy7nZSvQFsJlSmHkb+cDMZDc1yk0ko65girmNjMF hsAdVYfVsqS1TJrnengBgbPgesYO5eY0Tm3+0pa07EkONsxnzyWJDn4fh/eA6IEUo2JrOrex O6cRBNv9dwrUfJbMgzFeKdoyq/Zwe9QmdStkFpoh9036iWsj6Nt58NhXP8WDHOfBg9o86z9O VMZMC2Q0r6pGm7L0yHmPiixrxWdW0dGKvTHu/DH/ORUrjBYYeMsCc4jWoUt4Xq49LX98KDGN dhkZDGwKnAUAEQEAAYkCJQQYAQIADwIbDAUCXFIulQUJEYZenwAKCRCiZ7WKota4SYqUEACj P/GMnWbaG6s4TPM5Dg6lkiSjFLWWJi74m34I19vaX2CAJDxPXoTU6ya8KwNgXU4yhVq7TMId keQGTIw/fnCv3RLNRcTAapLarxwDPRzzq2snkZKIeNh+WcwilFjTpTRASRMRy9ehKYMq6Zh7 PXXULzxblhF60dsvi7CuRsyiYprJg0h2iZVJbCIjhumCrsLnZ531SbZpnWz6OJM9Y16+HILp iZ77miSE87+xNa5Ye1W1ASRNnTd9ftWoTgLezi0/MeZVQ4Qz2Shk0MIOu56UxBb0asIaOgRj B5RGfDpbHfjy3Ja5WBDWgUQGgLd2b5B6MVruiFjpYK5WwDGPsj0nAOoENByJ+Oa6vvP2Olkl gQzSV2zm9vjgWeWx9H+X0eq40U+ounxTLJYNoJLK3jSkguwdXOfL2/Bvj2IyU35EOC5sgO6h VRt3kA/JPvZK+6MDxXmm6R8OyohR8uM/9NCb9aDw/DnLEWcFPHfzzFFn0idp7zD5SNgAXHzV PFY6UGIm86OuPZuSG31R0AU5zvcmWCeIvhxl5ZNfmZtv5h8TgmfGAgF4PSD0x/Bq4qobcfaL ugWG5FwiybPzu2H9ZLGoaRwRmCnzblJG0pRzNaC/F+0hNf63F1iSXzIlncHZ3By15bnt5QDk l50q2K/r651xphs7CGEdKi1nU0YJVbQxJQ== Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3 Message-ID: <1a80f754-ff5b-353e-cc92-a5a4823976db@linux.intel.com> Date: Wed, 28 Aug 2019 09:37:37 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190828160114.GE17205@worktop.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/28/19 9:01 AM, Peter Zijlstra wrote: > On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote: >> On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote: > >> The current core scheduler implementation, I believe, still has (theoretical?) >> holes involving interrupts, once/if those are closed it may be even less >> attractive. > > No; so MDS leaks anything the other sibling (currently) does, this makes > _any_ privilidge boundary a synchronization context. > > Worse still, the exploit doesn't require a VM at all, any other task can > get to it. > > That means you get to sync the siblings on lovely things like system > call entry and exit, along with VMM and anything else that one would > consider a privilidge boundary. Now, system calls are not rare, they > are really quite common in fact. Trying to sync up siblings at the rate > of system calls is utter madness. > > So under MDS, SMT is completely hosed. If you use VMs exclusively, then > it _might_ work because a 'pure' host doesn't schedule that often > (maybe, same assumption as for L1TF). > > Now, there have been proposals of moving the privilidge boundary further > into the kernel. Just like PTI exposes the entry stack and code to > Meltdown, the thinking is, lets expose more. By moving the priv boundary > the hope is that we can do lots of common system calls without having to > sync up -- lots of details are 'pending'. > If are willing to consider the idea that we will sync with the sibling only if we touch potential user data, then a significant portion of syscalls may not need to sync. Yeah, it still sucks because of the complexity added to audit all the places in kernel that may touch privileged data and require synchronization. I did a prototype (without core sched), kernel build slow 2.5%. So this use case still seem reasonable. A worst case scenario is concurrent SMT FIO write to encrypted file, which have a lot of synchronizations due to extended access to privilege data by crypto, we slow by 9%. Tim