Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp2218953pxa; Mon, 3 Aug 2020 10:19:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx7jvQPBOdtJ6C9D5m63ehRy2bM/sLJ5jLzBeb/eLz3iC5VVsKlFUAGvFDCuyVF3ZkbN9W4 X-Received: by 2002:a17:906:2e09:: with SMTP id n9mr18771490eji.0.1596475177245; Mon, 03 Aug 2020 10:19:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1596475177; cv=none; d=google.com; s=arc-20160816; b=msXd3ooD9A1/L37rijk60/FS1VAiaw1oASaqnXL0zKpM5zfT6HvX+gvSJ+TMSLgEaY wrv7LOoV5D//0o8ehlYtFtSj5Gwt9oE4P5fNK+DL9EtBzlp1g7bHo4TY/FZ37jT12k6O EcWvkXYfSKxeWAxJ0WRvegsmhutlonJ+YpvFdfizuoc+JyNVnA97QD83U4NpwugNefP6 +fQcWn03kQaVPq7kjHihlp8XIyhWHzObUr31H2EOfuuvOaRdAzmWQrPe5LrNUI4szMJ+ WhHjJaEXINy2XT72vo9fJCSsJfXwT8PM9BB0aULbbeAUQNxWAm6tXhh4lrkpqY9Q7Mtb azZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=IYDHgw3h3JUrFT4mHOFI94lQ+OXV1OBTGcN+QxEQyg8=; b=Jk9fS2Nn0gbEshkmCzpcnOxVJx6gz+nGWB7e5LuTYi48YVvZox7n8EHzyne0XGT6BT BXxX8Lc8riYkvpo788Le3BUyKoKnJ/zwGu8YXlewmwkZZIphcuFfVXisrCfaEO8AJ/HS IXvcXuZ95U9ELM8ZO37KZzydeTIDuYiStW4wB/7XnFyH5Zf4lof7Yl8Sbs1yjkfDLWvg NKMlGBpDfn7cICdvOJBPRWbtBdV8QCWjWtWjz9yOFE2jk6nGvEyWB4lPqNSuBNgAiZSg OP/L4gKbkvl0Vwvb8lACviSUYeF0mmB/QXtbkSlrW4xLc5PmVe0nFPr11olNTS5VDk5e wTlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Np7Qal28; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dp16si15411135ejc.328.2020.08.03.10.19.14; Mon, 03 Aug 2020 10:19:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Np7Qal28; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728419AbgHCRRG (ORCPT + 99 others); Mon, 3 Aug 2020 13:17:06 -0400 Received: from mail.kernel.org ([198.145.29.99]:42260 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726878AbgHCRRF (ORCPT ); Mon, 3 Aug 2020 13:17:05 -0400 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 6B70422BED for ; Mon, 3 Aug 2020 17:17:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1596475023; bh=ZfW2vhvSScmUs7HxUE0FsgQ1dlDVdrRmbY/gG/hY7hg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=Np7Qal28gMw4vcH1iKZe+TCqR6WclB8Xl5dpKtQX+V2iYlgPe7V6F8enqvaZFIEk2 I/Mthwxs+qNrkOId4Kwi++aCgOrG1nb2jiaXJkC5c1dj2bJ7lcaLS3wS0GK5Mj1eMN /F+YS18i6ApdnLrSIBlXk7GDgD7KvH9c8mG8Tmg8= Received: by mail-lf1-f48.google.com with SMTP id v15so16336370lfg.6 for ; Mon, 03 Aug 2020 10:17:04 -0700 (PDT) X-Gm-Message-State: AOAM532nl7fMRCkhIHFalrbgGy1SGd7FW+CbYG7l98HPcrLRVdINcG8L gnB1gidwDV+GoBc5SWuFw4A3od2K4QYgbzoCA9zOZg== X-Received: by 2002:a5d:65d2:: with SMTP id e18mr15622569wrw.70.1596475021990; Mon, 03 Aug 2020 10:17:01 -0700 (PDT) MIME-Version: 1.0 References: <1594684087-61184-1-git-send-email-fenghua.yu@intel.com> <1594684087-61184-13-git-send-email-fenghua.yu@intel.com> <12fbdc01-e444-8d10-5790-e3495fc8a837@intel.com> In-Reply-To: <12fbdc01-e444-8d10-5790-e3495fc8a837@intel.com> From: Andy Lutomirski Date: Mon, 3 Aug 2020 10:16:50 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID To: Dave Hansen Cc: Andy Lutomirski , Fenghua Yu , Thomas Gleixner , Joerg Roedel , Ingo Molnar , Borislav Petkov , Peter Zijlstra , H Peter Anvin , David Woodhouse , Lu Baolu , Felix Kuehling , Tony Luck , Jean-Philippe Brucker , Christoph Hellwig , Ashok Raj , Jacob Jun Pan , Dave Jiang , Sohil Mehta , Ravi V Shankar , linux-kernel , x86 , iommu , amd-gfx Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 3, 2020 at 9:37 AM Dave Hansen wrote: > > On 8/3/20 8:12 AM, Andy Lutomirski wrote: > > I could easily be convinced that the PASID fixup is so trivial and so > > obviously free of misfiring in a way that causes an infinite loop that > > this code is fine. But I think we first need to answer the bigger > > question of why we're doing a lazy fixup in the first place. > > There was an (internal to Intel) implementation of this about a year ago > that used smp_call_function_many() to force the MSR state into all > threads of a process. I took one look at it, decided there was a 0% > chance of it actually functioning and recommended we find another way. > While I'm sure it could be done more efficiently, the implementation I > looked at took ~200 lines of code and comments. It started to look too > much like another instance of mm->cpumask for my comfort. If I were implementing this, I would try making switch_mm_irqs_off() do, roughly: void load_mm_pasid(...) { if (cpu_feature_enabled(X86_FEATURE_ENQCMD)) tsk->xstate[offset] = READ_ONCE(next->context.pasid); } This costs one cache miss, although the cache line in question is about to be read anyway. It might be faster to, instead, do: void load_mm_pasid(...) { u32 pasid = READ_ONCE(next->context.pasid); if (tsk->xstate[offset] != pasid) tsk->state[offset] = pasid; } so we don't dirty the cache line in the common case. The actual generated code ought to be pretty good -- surely the offset of PASID in XSTATE is an entry in an array somewhere that can be found with a single read, right? The READ_ONCE is because this could race against a write to context.pasid, so this code needs to be at the end of the function where it's protected by mm_cpumask. With all this done, the pasid update is just on_each_cpu_mask(mm_cpumask(mm), load_mm_pasid, mm, true). This looks like maybe 20 lines of code. As an added bonus, it lets us free PASIDs early if we ever decide we want to. May I take this opportunity to ask Intel to please put some real thought into future pieces of CPU state? Here's a summary of some things we have: - Normal extended state (FPU, XMM, etc): genuinely per thread and only ever used explicitly. Actually makes sense with XSAVE(C/S). - PKRU: affects CPL0-originated memory accesses, so it needs to be eagerly loaded in the kernel. Does not make sense with XRSTOR(C/S), but it's in there anyway. - CR3: per-mm state. Makes some sense in CR3, but it's tangled up with CR4 in nasty ways. - LDTR: per-mm on Linux and mostly obsolete everyone. In it's own register, so it's not a big deal. - PASID: per-mm state (surely Intel always intended it to be per-mm, since it's for shared _virtual memory_!). But for some reason it's in an MSR (which is slow), and it's cleverly, but not that cleverly, accessible with XSAVES/XRSTORS. Doesn't actually make sense. Also, PASID is lazy-loadable, but the mechanism for telling the kernel that a lazy load is needed got flubbed. - TILE: genuinely per-thread, but it's expensive so it's lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not fully disambiguated from the other use of #NM. So it sort of works, but it's gross. - "KERNEL_GS_BASE", i.e. the shadow GS base. This is logically per-user-thread state, but it's only accessible in MSRs. For some reason this is *not* in XSAVES/XRSTORS state, nor is there any efficient way to access it at all. - Segment registers: can't be properly saved except by hypervisors, and can almost, but not quite, be properly loaded (assuming the state was sane to begin with) by normal kernels. Just don't try to load 1, 2, or 3 into any of them. Sometimes I think that this is all intended to be as confusing as possible and that it's all a ploy to keep context switches slow and complicated. Maybe Intel doesn't actually want to compete with other architectures that can context switch quickly? It would be really nice if we had a clean way to load per-mm state (see my private emails about this), a clean way to load CPL3 register state, and a clean way to load per-user-thread *kernel* register state (e.g. PKRU and probably PKRS). And there should be an exception that says "user code accessed a lazy-loaded resource that isn't loaded, and this is the resource it tried to access".