Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp3124241ybi; Mon, 17 Jun 2019 17:10:13 -0700 (PDT) X-Google-Smtp-Source: APXvYqy5ocEm/Okj224OaqA7vEvxTkcPFcUuBCR+4IWtdEokYv+N3dZIzdoaokvc1p6PPV9cI32c X-Received: by 2002:aa7:8dd2:: with SMTP id j18mr47102784pfr.88.1560816613887; Mon, 17 Jun 2019 17:10:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560816613; cv=none; d=google.com; s=arc-20160816; b=SNjDVO7EI9rSY9seZ0EnHTKXkmecctuX9OYdVoOeKrJoMV/rc0gz+vsjKzBichXZyH gqAx20dYicM26xH3DR/ZzLzKU3HqcqemLqtsG2sq5CJaTeJLuq2R9NWm4n2rQe4+k0cg 1+4z4//6r6aFNhbe7azTYdYeP7uID+Uy3OQX1N9ekqd68kiDRaFMrZu2f8U3zVTxnO9B kf2BFMnJ1LrPp3tSEW3ZTRo5PHYB7uo8oji4Q6vrIcdbAPbAOi1Nnrbg7AuRUQa/VJ6g amMmbfEjdVIcrzX+LonQLr5xsjxmI0gd/yqzlGlTZ+/HhUErQoPXScmz6qHRbGgwdC9U PRDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=jJcn7f+QDtbUxnUU5wMleWGeqi8McSVX6zfUt3HUUh0=; b=LtbqpqdCYSKj5GgE+tKR0Mhx5oHAjfvfm3/PSVY6S/uwY+4T/SR3o1MTVFRrUWtjtN oG7E0aWP4gmSRSCC8iGmboFXSDylUg+ITNkBhKG4yKBbb4rnIpKYbl+etNtBg7g1UB6D kzJo4rvHUPOYlkXe2lDNhsFhcPVTD//ZPhgLu3I1H4crgy9lEDeIVk7DLLATPwSBPMc7 1EnqFyylZDP5EO3eTJnk64c2MFs5HWIG0xzja2f18GK01srTw29hjYbg9fTtofPgdbyv Bqb+jtSRPoUtEby7JVdKgLWc3wLZFx1Y8Dy8ISotp8RSCBT+w7nhNbpSJDXa1L3A4AI2 EjVA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t14si12013521pgh.128.2019.06.17.17.09.57; Mon, 17 Jun 2019 17:10:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727523AbfFRAJr (ORCPT + 99 others); Mon, 17 Jun 2019 20:09:47 -0400 Received: from mga07.intel.com ([134.134.136.100]:59974 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726568AbfFRAJq (ORCPT ); Mon, 17 Jun 2019 20:09:46 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Jun 2019 17:09:46 -0700 X-ExtLoop1: 1 Received: from romley-ivt3.sc.intel.com ([172.25.110.60]) by orsmga008.jf.intel.com with ESMTP; 17 Jun 2019 17:09:45 -0700 Date: Mon, 17 Jun 2019 17:00:14 -0700 From: Fenghua Yu To: Andy Lutomirski Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , H Peter Anvin , Ashok Raj , Tony Luck , Ravi V Shankar , linux-kernel , x86 Subject: Re: [PATCH v4 3/5] x86/umwait: Add sysfs interface to control umwait C0.2 state Message-ID: <20190618000014.GH217081@romley-ivt3.sc.intel.com> References: <1559944837-149589-4-git-send-email-fenghua.yu@intel.com> <20190610035302.GA162238@romley-ivt3.sc.intel.com> <20190610060234.GD162238@romley-ivt3.sc.intel.com> <20190617202702.GB217081@romley-ivt3.sc.intel.com> <20190617231104.GF217081@romley-ivt3.sc.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 17, 2019 at 04:41:38PM -0700, Andy Lutomirski wrote: > On Mon, Jun 17, 2019 at 4:20 PM Fenghua Yu wrote: > > > > On Mon, Jun 17, 2019 at 04:02:50PM -0700, Andy Lutomirski wrote: > > > On Mon, Jun 17, 2019 at 1:36 PM Fenghua Yu wrote: > > > > > > > > On Mon, Jun 10, 2019 at 06:41:31AM -0700, Andy Lutomirski wrote: > > > > > > > > > > > > > > > > On Jun 9, 2019, at 11:02 PM, Fenghua Yu wrote: > > > > > > > > > > > >> On Sun, Jun 09, 2019 at 09:24:18PM -0700, Andy Lutomirski wrote: > > > > > >>> On Sun, Jun 9, 2019 at 9:02 PM Fenghua Yu wrote: > > > > > >>> > > > > > >>>> On Sat, Jun 08, 2019 at 03:50:32PM -0700, Andy Lutomirski wrote: > > > > > >>>>> On Fri, Jun 7, 2019 at 3:10 PM Fenghua Yu wrote: > > > > > >>>>> > > > > > >>>>> C0.2 state in umwait and tpause instructions can be enabled or disabled > > > > > >>>>> on a processor through IA32_UMWAIT_CONTROL MSR register. > > > > > >>>>> > > > > > >>>>> By default, C0.2 is enabled and the user wait instructions result in > > > > > >>>>> lower power consumption with slower wakeup time. > > > > > >>>>> > > > > > >>>>> But in real time systems which require faster wakeup time although power > > > > > >>>>> savings could be smaller, the administrator needs to disable C0.2 and all > > > > > >>>>> C0.2 requests from user applications revert to C0.1. > > > > > >>>>> > > > > > >>>>> A sysfs interface "/sys/devices/system/cpu/umwait_control/enable_c02" is > > > > > >>>>> created to allow the administrator to control C0.2 state during run time. > > > > > >>>> > > > > > >>>> This looks better than the previous version. I think the locking is > > > > > >>>> still rather confused. You have a mutex that you hold while changing > > > > > >>>> the value, which is entirely reasonable. But, of the code paths that > > > > > >>>> write the MSR, only one takes the mutex. > > > > > >>>> > > > > > >>>> I think you should consider making a function that just does: > > > > > >>>> > > > > > >>>> wrmsr(MSR_IA32_UMWAIT_CONTROL, READ_ONCE(umwait_control_cached), 0); > > > > > >>>> > > > > > >>>> and using it in all the places that update the MSR. The only thing > > > > > >>>> that should need the lock is the sysfs code to avoid accidentally > > > > > >>>> corrupting the value, but that code should also use WRITE_ONCE to do > > > > > >>>> its update. > > > > > >>> > > > > > >>> Based on the comment, the illustrative CPU online and enable_c02 store > > > > > >>> functions would be: > > > > > >>> > > > > > >>> umwait_cpu_online() > > > > > >>> { > > > > > >>> wrmsr(MSR_IA32_UMWAIT_CONTROL, READ_ONCE(umwait_control_cached), 0); > > > > > >>> return 0; > > > > > >>> } > > > > > >>> > > > > > >>> enable_c02_store() > > > > > >>> { > > > > > >>> mutex_lock(&umwait_lock); > > > > > >>> umwait_control_c02 = (u32)!c02_enabled; > > > > > >>> WRITE_ONCE(umwait_control_cached, 2 | get_umwait_control_max_time()); > > > > > >>> on_each_cpu(umwait_control_msr_update, NULL, 1); > > > > > >>> mutex_unlock(&umwait_lock); > > > > > >>> } > > > > > >>> > > > > > >>> Then suppose umwait_control_cached = 100000 initially and only CPU0 is > > > > > >>> running. Admin change bit 0 in MSR from 0 to 1 to disable C0.2 and is > > > > > >>> onlining CPU1 in the same time: > > > > > >>> > > > > > >>> 1. On CPU1, read umwait_control_cached to eax as 100000 in > > > > > >>> umwait_cpu_online() > > > > > >>> 2. On CPU0, write 100001 to umwait_control_cached in enable_c02_store() > > > > > >>> 3. On CPU1, wrmsr with eax=100000 in umwaint_cpu_online() > > > > > >>> 4. On CPU0, wrmsr with 100001 in enabled_c02_store() > > > > > >>> > > > > > >>> The result is CPU0 and CPU1 have different MSR values. > > > > > >> > > > > > >> Yes, but only transiently, because you didn't finish your example. > > > > > >> > > > > > >> Step 5: enable_c02_store() does on_each_cpu(), and CPU 1 gets updated. > > > > > > > > > > > > There is no sync on wrmsr on CPU0 and CPU1. > > > > > > > > > > What do you mean by sync? > > > > > > > > > > > So a better sequence to > > > > > > describe the problem is changing the order of wrmsr: > > > > > > > > > > > > 1. On CPU1, read umwait_control_cached to eax as 100000 in > > > > > > umwait_cpu_online() > > > > > > 2. On CPU0, write 100001 to umwait_control_cached in enable_c02_store() > > > > > > 3. On CPU0, wrmsr with 100001 in on_each_cpu() in enabled_c02_store() > > > > > > 4. On CPU1, wrmsr with eax=100000 in umwaint_cpu_online() > > > > > > > > > > > > So CPU1 and CPU0 have different MSR values. This won't be transient. > > > > > > > > > > You are still ignoring the wrmsr on CPU1 due to on_each_cpu(). > > > > > > > > > > > > > Initially umwait_control_cached is 100000 and CPU0 is online while CPU1 > > > > is going to be online: > > > > > > > > 1. On CPU1, cpu_online_mask=0x3 in start_secondary() > > > > 2. On CPU1, read umwait_control_cached to eax as 100000 in umwait_cpu_online() > > > > 3. On CPU0, write 100001 to umwait_control_cached in enable_c02_store() > > > > 4. On CPU0, execute one_each_cpu() in enabled_c02_store(): > > > > wrmsr with 100001 on CPU0 > > > > wrmsr with 100001 on CPU1 > > > > 5. On CPU1, wrmsr with eax=100000 in umwaint_cpu_online() > > > > > > > > So the MSR is 100000 on CPU1 and 100001 on CPU0. The MSRs are different on > > > > the CPUs. > > > > > > > > Is this a right sequence to demonstrate locking issue without the mutex > > > > locking? > > > > > > > > > > Fair enough. I would fix it differently, though: > > > > > > static void update_this_cpu_umwait_msr(void) > > > { > > > WARN_ON_ONCE(!irqs_disabled()); /* or local_irq_save() */ > > > > > > /* We need to prevent umwait_control from being changed *and* > > > completing its WRMSR between our read and our WRMSR. By turning IRQs > > > off here, we ensure that no sysfs write happens on this CPU and we > > > also make sure that any concurrent sysfs write from a different CPU > > > will not finish updating us via IPI until we're done. */ > > > wrmsrl(MSR_..., READ_ONCE(umwait_control), 0); > > > } > > > > If no other objections, then I will keep the current mutex lock/unlock to > > protect wrmsr and the umwait_control_cached variable. > > > > I don't think that's sufficient. In your current code, you hold the > mutex in some places and not in others, and there's no explanation. The mutex is used in sysfs writing and cpu online. But it's not used in syscore resume because only BP is running syscore resume. > And I think you're relying on the IRQs-off protection in at least one > code path already, so you're not gaining any simplicity. I don't rely on IRQs-off protection. I only use mutex to protect. > At the very > least, you need to add some extensive comments everywhere if you want > to keep the mutex, I have comment on why no need for mutex protection in syscore resume. But I can add more comments on the locking. > but I think it's simpler and clearer if you just > use the same logic everywhere, for example, as I proposed above. But using irqs_disabled() before wrmsr() and READ_ONCE/WRITE_ONCE for umwait_control_cached alone are not sufficient. The mutex is still needed to protect sysfs writing, is that right? Without mutex, one_each_cpu() can write different values on CPUs, right? If irqs disabling, READ_ONCE/WRITE_ONCE, and mutex are all used to protect, isn't that more complex than just using mutex? Thanks. -Fenghua