Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4634773iob; Sun, 8 May 2022 20:02:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw+hFyvMxlHBeIy7e2amv9UMFYDJTbfKfWIGr9S2tZM/N8pD+zCUkc6mycIhI7zPrQ8ZjG8 X-Received: by 2002:a05:6a00:cc4:b0:50d:e9db:6145 with SMTP id b4-20020a056a000cc400b0050de9db6145mr14071378pfv.56.1652065365939; Sun, 08 May 2022 20:02:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652065365; cv=none; d=google.com; s=arc-20160816; b=LfTXFod31eVz59xl4ogOuL8GMIsuVZhPDillyM5ywd5zoLi+JaJ2ixdK76CoLNg1dB BuMk10+LFy1PgHaUPyRm5aNgm72h8Cv+p9PUEeqTcCanxRP3DG9euukq1y5rDEycKcDi lkzKcB2s/rRNAa5i8zfQybRItdnsvsaZjcXfXVPXbvtuZ133IQ4s6V9CiBvx/JRxuSeC TBDyWvS2nh92cBDAYYugYCfL8pIEDfx1cbKQqVG145cVIW3bR0Yc3C9kD2oTX4IbmGlc ARwzf4JDp7ecXNJJH/mQ8Aqa/GL3PjqKz+BSFz1ERtdrktgqnzS8FEmfzYRiI6520u1U ocJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:references :in-reply-to:subject:cc:to:dkim-signature:dkim-signature:from; bh=B2VY1pBCHecUOk5t/5XaUHPp27OaBx7YYYmupaRwS+A=; b=QRbf5suvvVYhxBWcHU59195B1xc8Vbf051DRN0fZKE8IuA5Xu8v8HQL8OLXrvqGTTr vbgitiVxN+hTQpufnq/0EEW3p3EQ/OAzNUl2/zEttbL4Ium6aAzho+sOQZX5urwkvLEY tuDAWW4Mfy3462y6L6PTrhBCywdHVhuSGQSIe7LKQ+GUdRWgERvaHcAzvvZHSA6wqW/z fXJXnb90z8DuyrIAexOe+EVAic4m6UBEOf80gi3EvZvCtVsdz15kSto0Jpbmv7a5HHqD MdlDX88IY5+z6jww+qlMiQY7OymDHjXAUFyoxzFiFuzZ+vwggjL/MUeJz+iYiJyMae80 yIFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=zdznyDCS; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b="uDUwSXu/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id d1-20020a636801000000b003c600e703edsi12982007pgc.257.2022.05.08.20.02.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 May 2022 20:02:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=zdznyDCS; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b="uDUwSXu/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id DB64A49CB8; Sun, 8 May 2022 20:01:14 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349478AbiEDMdO (ORCPT + 99 others); Wed, 4 May 2022 08:33:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43002 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347936AbiEDMdM (ORCPT ); Wed, 4 May 2022 08:33:12 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 83F2F24BE4; Wed, 4 May 2022 05:29:35 -0700 (PDT) From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1651667373; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=B2VY1pBCHecUOk5t/5XaUHPp27OaBx7YYYmupaRwS+A=; b=zdznyDCSDehnyX/b0QFmsj2JVN5/n99CsO5EFO+/CCG/+E1TjEQ2bgcdGG0deAKbf3xEJS uOZwf/IxxHbe7wV8Q1AFizSFsFLv925fZnu7sd2gJqn0HSI5uH9kA/QW4PbPOC3Z4ifO5S YbL2LSMILOw3J4wQDfKuUEND7wy3N6bHhSwZO8n03inujIDA8/ska9h4W90rFoahqgyEnR kHl+4YHzuHp3tGYN79MWlrffO53Wh1mDX07L9fvHunCFh+YOqRBRYph4ZB1SswZqMwXBTA 9MIK2CJ4FtpoYWxO9lYtnXeE82a3tTb9YVQomsk2mFpqPLZ0ysTEo3g3qkzvow== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1651667373; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=B2VY1pBCHecUOk5t/5XaUHPp27OaBx7YYYmupaRwS+A=; b=uDUwSXu/Lh+hMmMAi1uwqvrO9ZBzsl5U9wICOYYSbVL1NYk1IAG604Z/MquO+3gwwn2t8P UsmEFtu+Jq+yETDA== To: Tony Luck , hdegoede@redhat.com, markgross@kernel.org Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, corbet@lwn.net, gregkh@linuxfoundation.org, andriy.shevchenko@linux.intel.com, jithu.joseph@intel.com, ashok.raj@intel.com, tony.luck@intel.com, rostedt@goodmis.org, dan.j.williams@intel.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, platform-driver-x86@vger.kernel.org, patches@lists.linux.dev, ravi.v.shankar@intel.com Subject: Re: [PATCH v5 07/10] platform/x86/intel/ifs: Add scan test support In-Reply-To: <20220428153849.295779-8-tony.luck@intel.com> References: <20220422200219.2843823-1-tony.luck@intel.com> <20220428153849.295779-1-tony.luck@intel.com> <20220428153849.295779-8-tony.luck@intel.com> Date: Wed, 04 May 2022 14:29:33 +0200 Message-ID: <87r159jxaq.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 28 2022 at 08:38, Tony Luck wrote: > +static bool wait_for_siblings(struct device *dev, struct ifs_data *ifsd, atomic_t *t, long long timeout) > +{ > + atomic_inc(t); > + while (atomic_read(t) < cpu_sibl_ct) { > + if (timeout < SPINUNIT) { > + dev_err(dev, > + "Timeout while waiting for CPUs rendezvous, remaining: %d\n", > + cpu_sibl_ct - atomic_read(t)); > + return false; > + } > + > + ndelay(SPINUNIT); > + timeout -= SPINUNIT; > + > + touch_nmi_watchdog(); > + } > + > + return true; > +} > + > +/* > + * When a Scan test (for a particular core) is triggered by the user, worker threads > + * for each sibling cpus(belonging to that core) are queued to execute this function in > + * the Workqueue (ifs_wq) context. > + * Wait for the sibling thread to join before the execution. > + * Execute the scan test by running wrmsr(MSR_ACTIVATE_SCAN). > + */ > +static void ifs_work_func(struct work_struct *work) > +{ > + struct ifs_work *local_work = container_of(work, struct ifs_work, w); > + int cpu = smp_processor_id(); > + union ifs_scan activate; > + union ifs_status status; > + unsigned long timeout; > + struct ifs_data *ifsd; > + struct device *dev; > + int retries; > + u32 first; > + > + dev = local_work->dev; > + ifsd = ifs_get_data(dev); > + > + activate.rsvd = 0; > + activate.delay = msec_to_tsc(THREAD_WAIT); > + activate.sigmce = 0; > + > + /* > + * Need to get (and keep) the threads on this core executing close together > + * so that the writes to MSR_ACTIVATE_SCAN below will succeed in entering > + * IFS test mode on this core. Interrupts on each thread are expected to be > + * brief. But preemption would be a problem. > + */ > + preempt_disable(); > + > + /* wait for the sibling threads to join */ > + first = cpumask_first(topology_sibling_cpumask(cpu)); > + if (!wait_for_siblings(dev, ifsd, &siblings_in, NSEC_PER_SEC)) { Waiting for a second with preemption disabled? Seriously? > + preempt_enable(); > + dev_err(dev, "cpu %d sibling did not join rendezvous\n", cpu); > + goto out; > + } > + > + activate.start = 0; > + activate.stop = ifsd->valid_chunks - 1; > + timeout = jiffies + HZ / 2; Plus another half a second with preemption disabled. That's just insane. > + retries = MAX_IFS_RETRIES; > + > + while (activate.start <= activate.stop) { > + if (time_after(jiffies, timeout)) { > + status.error_code = IFS_SW_TIMEOUT; > + break; > + } > + > + local_irq_disable(); > + wrmsrl(MSR_ACTIVATE_SCAN, activate.data); > + local_irq_enable(); That local_irq_disable() solves what? > + /* > + * All logical CPUs on this core are now running IFS test. When it completes > + * execution or is interrupted, the following RDMSR gets the scan status. > + */ > + > + rdmsrl(MSR_SCAN_STATUS, status.data); Wait. Is that rdmsrl() blocking execution until the scan completes? If so, what's the stall time here? If not, how is the logic below supposed to work? > + /* Some cases can be retried, give up for others */ > + if (!can_restart(status)) > + break; > + > + if (status.chunk_num == activate.start) { > + /* Check for forward progress */ > + if (retries-- == 0) { > + if (status.error_code == IFS_NO_ERROR) > + status.error_code = IFS_SW_PARTIAL_COMPLETION; > + break; > + } > + } else { > + retries = MAX_IFS_RETRIES; > + activate.start = status.chunk_num; > + } > + } > + > + preempt_enable(); > + > + if (cpu == first) { > + /* Update status for this core */ > + ifsd->scan_details = status.data; > + > + if (status.control_error || status.signature_error) { > + ifsd->status = SCAN_TEST_FAIL; > + message_fail(dev, cpu, status); > + } else if (status.error_code) { > + ifsd->status = SCAN_NOT_TESTED; > + message_not_tested(dev, cpu, status); > + } else { > + ifsd->status = SCAN_TEST_PASS; > + } > + } > + > + if (!wait_for_siblings(dev, ifsd, &siblings_out, NSEC_PER_SEC)) > + dev_err(dev, "cpu %d sibling did not exit rendezvous\n", cpu); > + > +out: > + if (cpu == first) > + complete(&test_thread_done); > +} > + > +/* > + * Initiate per core test. It wakes up work queue threads on the target cpu and > + * its sibling cpu. Once all sibling threads wake up, the scan test gets executed and > + * wait for all sibling threads to finish the scan test. > + */ > +int do_core_test(int cpu, struct device *dev) > +{ > + struct ifs_work *local_work; > + int sibling; > + int ret = 0; > + int i = 0; > + > + if (!scan_enabled) > + return -ENXIO; > + > + cpu_hotplug_disable(); Why cpu_hotplug_disable()? Why is cpus_read_lock() not sufficient here? > + if (!cpu_online(cpu)) { > + dev_info(dev, "cannot test on the offline cpu %d\n", cpu); > + ret = -EINVAL; > + goto out; > + } > + > + reinit_completion(&test_thread_done); > + atomic_set(&siblings_in, 0); > + atomic_set(&siblings_out, 0); > + > + cpu_sibl_ct = cpumask_weight(topology_sibling_cpumask(cpu)); > + local_work = kcalloc(cpu_sibl_ct, sizeof(*local_work), GFP_NOWAIT); Why does this need GFP_NOWAIT? > +int ifs_setup_wq(void) > +{ > + /* Flags are to keep all the sibling cpu worker threads (of a core) in close sync */ I put that into the wishful thinking realm. Is there anywhere a proper specification of this mechanism? The public available MSR list in the SDM is uselss. Without proper documentation it's pretty much impossible to review this code and to think about the approach. Thanks, tglx