Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp1862107ybt; Mon, 15 Jun 2020 11:17:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxURC/VdghXm36Nm1yHNMBHZoUEbVvSEv9ChosvPA+qjWkH1gBABkYSTa+DLMd2Md8n4xQz X-Received: by 2002:a17:906:5283:: with SMTP id c3mr25211125ejm.22.1592245073503; Mon, 15 Jun 2020 11:17:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592245073; cv=none; d=google.com; s=arc-20160816; b=z2chvFd3isJUNZuOWMcSSxJ4Jy0/Ajk1Gbf55HTo5CSpP2qvl8Mj9nLGPWuSz5xgR5 zLAzcTd946bpWfJO8mo3rd/Pf5Bw2CVen6tanFi30m27xpQ00XzadApotiiGI0CJ5S8B jKOVGpPLu1CcbBH6HyArAIWvzTgLHJQBuq93jQ6TdX+38gNxskDd2yb3+1LocGH8njbE u3JMEXSGn/5QG6hnRguNMbsMYz1GIgHWFeSi9pwVeD7op4ElJDz1enia03nWd8MSQnOn BwLOyLsiqpxjUL4j/+B2bhKHY/MPI2fcN8oOA4v+vJ7cLEwl8hSkEuqb8PGlJdmhmftO zp6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:ironport-sdr:ironport-sdr; bh=YsyWeOqvWiia8ZC0KZGs4gnJZe9+oJrOwsWtsYr04BA=; b=H0Lhx0iXP86M2n2nroCt6pX0DETdxqk3GYmRYCgRYNaX553dnQl6zgIgAmikShNHCb Sc/kdHiR9/bH87n7EK+CkB6WKgXSUu7bl71sHRvjZtylHvm4309FqyzPPpPzFU8x4HF0 Lqf44ehPJa/AeXGKUGutBBB+URHV+4YMs1NIWKA3+Xq764iy7dCxs6qAAj68UeIr5JoI Al4JOGGv4mMNYD7KScLr56Rdmc3vb7pUPfUpIbrsfkSBW0nWivJwlPK+aYWEIF5Kf6ga qiXddpjqacz6vik6aH+eITtjDnsnI4tTnnLutY4admrlE9yJgipvBhPmiVt5hIXC3T6e en3A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id np7si9275392ejb.377.2020.06.15.11.17.30; Mon, 15 Jun 2020 11:17:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731249AbgFOSNB (ORCPT + 99 others); Mon, 15 Jun 2020 14:13:01 -0400 Received: from mga03.intel.com ([134.134.136.65]:59194 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728585AbgFOSNA (ORCPT ); Mon, 15 Jun 2020 14:13:00 -0400 IronPort-SDR: Jte9bOxmer9kUGuo0EmSy6RkxpyBG2LipYcHXKvpJxSzbXfVvyxRRtZL1eDnA3jf7HHmrcYdJb w5LCv2AeHnlQ== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2020 11:13:00 -0700 IronPort-SDR: aGFdvIif39uNn6knFGFspXFXk4ZAtlyeBPnBtDASHBXf5oqChnPTpdU4q54bgMAhUzqoFoCZGM ev3SwJLIgHJQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,515,1583222400"; d="scan'208";a="382619746" Received: from romley-ivt3.sc.intel.com ([172.25.110.60]) by fmsmga001.fm.intel.com with ESMTP; 15 Jun 2020 11:12:59 -0700 Date: Mon, 15 Jun 2020 11:12:59 -0700 From: Fenghua Yu To: Peter Zijlstra Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , H Peter Anvin , David Woodhouse , Lu Baolu , Frederic Barrat , Andrew Donnellan , Felix Kuehling , Joerg Roedel , Dave Hansen , Tony Luck , Ashok Raj , Jacob Jun Pan , Dave Jiang , Yu-cheng Yu , Sohil Mehta , Ravi V Shankar , linux-kernel , x86 , iommu@lists.linux-foundation.org, amd-gfx , linuxppc-dev Subject: Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID Message-ID: <20200615181259.GC13792@romley-ivt3.sc.intel.com> References: <1592008893-9388-1-git-send-email-fenghua.yu@intel.com> <1592008893-9388-13-git-send-email-fenghua.yu@intel.com> <20200615075649.GK2497@hirez.programming.kicks-ass.net> <20200615154854.GB13792@romley-ivt3.sc.intel.com> <20200615160357.GA2531@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200615160357.GA2531@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 15, 2020 at 06:03:57PM +0200, Peter Zijlstra wrote: > On Mon, Jun 15, 2020 at 08:48:54AM -0700, Fenghua Yu wrote: > > Hi, Peter, > > On Mon, Jun 15, 2020 at 09:56:49AM +0200, Peter Zijlstra wrote: > > > On Fri, Jun 12, 2020 at 05:41:33PM -0700, Fenghua Yu wrote: > > > > +/* > > > > + * Apply some heuristics to see if the #GP fault was caused by a thread > > > > + * that hasn't had the IA32_PASID MSR initialized. If it looks like that > > > > + * is the problem, try initializing the IA32_PASID MSR. If the heuristic > > > > + * guesses incorrectly, take one more #GP fault. > > > > > > How is that going to help? Aren't we then going to run this same > > > heuristic again and again and again? > > > > The heuristic always initializes the MSR with the per mm PASID IIF the > > mm has a valid PASID but the MSR doesn't have one. This heuristic usually > > happens only once on the first #GP in a thread. > > But it doesn't guarantee the PASID is the right one. Suppose both the mm > has a PASID and the MSR has a VALID one, but the MSR isn't the mm one. > Then we NO-OP. So if the exception was due to us having the wrong PASID, > we stuck. The MSR for each thread was cleared during fork() and clone(). The PASID was cleared during mm_init(). The per-mm PASID was assigned when fist bind_mm() is called and remains the same value until process exit(). The MSR is only fixed up when the first ENQCMD is executed in a thread: bit 31 in the MSR is 0 and the PASID in the mm is non-zero. The MSR remains the same PASID value once it's fixed up until the thread exits. So the work flow ensures the PASID goes from mm's PASID to the MSR. The PASID could be unbund from the mm. In this case, iommu will generate #PF and kernel oops instead of #GP. > > > If the next #GP still comes in, the heuristic finds out the MSR already > > has a valid PASID and thus will not fixup the MSR any more. The fixup() > > returns "false" and lets others to handle the #GP. > > > > So the heuristic will be executed once (at most) and won't be executed > > again and again. > > So I get that you want to set the PASID on-demand, but I find this all > really weird code to make that happen. We could keep PASID same in all threads sychronously by propogating the MSRs when the PASID is bound to the mm via IPIs or taskworks to all threads in the process. But the code is complex and error-prone and overhead could be high: 1. The user can call driver to do binding and unbinding multiple times. The IPIs or taskworks will be sent multiple times to make sure only the last IPIs or taskworks take action. 2. Even if a thread never executes ENQCMD and thus never uses the MSR, the MSR still needs to be updated whenever bind_mm() and needs to be context switched each time. This could cause high overhead. Setting the PASID on-demand is simpler and cleaner and was recommended by Thomas. > > > > > +bool __fixup_pasid_exception(void) > > > > +{ > > > > + u64 pasid_msr; > > > > + unsigned int pasid; > > > > + > > > > + /* > > > > + * This function is called only when this #GP was triggered from user > > > > + * space. So the mm cannot be NULL. > > > > + */ > > > > + pasid = current->mm->pasid; > > > > + /* If the mm doesn't have a valid PASID, then can't help. */ > > > > + if (invalid_pasid(pasid)) > > > > + return false; > > > > + > > > > + /* > > > > + * Since IRQ is disabled now, the current task still owns the FPU on > > > > > > That's just weird and confusing. What you want to say is that you rely > > > on the exception disabling the interrupt. > > > > I checked SDM again. You are right. #GP can be interrupted by machine check > > or other interrupts. So I cannot assume the current task still owns the FPU. > > Instead of directly rdmsr() and wrmsr(), I will add helpers that can access > > either the MSR on the processor or the PASID state in the memory. > > That's not in fact what I meant, but yes, you can take exceptions while > !IF just fine. > > > > > + * this CPU and the PASID MSR can be directly accessed. > > > > + * > > > > + * If the MSR has a valid PASID, the #GP must be for some other reason. > > > > + * > > > > + * If rdmsr() is really a performance issue, a TIF_ flag may be > > > > + * added to check if the thread has a valid PASID instead of rdmsr(). > > > > > > I don't understand any of this. Nobody except us writes to this MSR, we > > > should bloody well know what's in it. What gives? > > > > Patch 4 describes how to manage the MSR and patch 7 describes the format > > of the MSR (20-bit PASID value and bit 31 is valid bit). > > > > Are they sufficient to help? Or do you mean something else? > > I don't get why you need a rdmsr here, or why not having one would > require a TIF flag. Is that because this MSR is XSAVE/XRSTOR managed? My concern is TIF flags are precious (only 3 slots available). Defining a new TIF flag may be not worth it while rdmsr() can check if PASID is valid in the MSR. And performance here might not be a big issue in #GP. But if you think using TIF flag is better, I can define a new TIF flag and maintain it per thread (init 0 when clone()/fork(), set 1 in fixup()). Then we can avoid using rdmsr() to check valid PASID in the MSR. > > > > > + */ > > > > + rdmsrl(MSR_IA32_PASID, pasid_msr); > > > > + if (pasid_msr & MSR_IA32_PASID_VALID) > > > > + return false; > > > > + > > > > + /* Fix up the MSR if the MSR doesn't have a valid PASID. */ > > > > + wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID); > > How much more expensive is the wrmsr over the rdmsr? Can't we just > unconditionally write the current PASID and call it a day? rdmsr() and wrmsr() might have same cost. If using a TIF flag to check if the thread has a valid PASID MSR, then I can replace rdmsr() by checking the TIF flag and only do wrmsr() when the TIF flag is set. Please advice. Thanks. -Fenghua