Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp402602pxp; Wed, 16 Mar 2022 08:07:20 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwwL0kfgLEftPCEeWErw/AF2E6NLuLmoyzQyFTnwYE4WYwj8Aj84xqpiIrYzikCYYh0tTRx X-Received: by 2002:a62:a50b:0:b0:4f7:4457:a48a with SMTP id v11-20020a62a50b000000b004f74457a48amr34576575pfm.50.1647443240345; Wed, 16 Mar 2022 08:07:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647443240; cv=none; d=google.com; s=arc-20160816; b=OM6/1iOUYJ9+yRJBXj6b6coJqsXUlTjSiL6MsI5yuvAqWJm54TbMMeHkrmQwe1EJCF IAI1yGSP0H4Yajl+Pl9afohU1WCh4NJOmDRhjyh4wbQS4Y+PVLCOhBtMo/HRAePR3m6K RJcpQQ/+SJhB3lAsVIGPkQj1bzvrbl2j19m5INWZQZG0oXB+FsUqLvlU3TM0FZ0C0dVn ithrD0mV3W9BKUflJ25hQlZPiFxHm/Txs3he/U5XHp1ngACAWuaeH8Wl8SwumwRAzYXn D3BtZ1LJkPFCfjWYq8vBhz+m1+xhhsfbmcr+tuZvWOPs+hUnAIDV9yezZ2/gdedGZhC5 CKPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=gvkC3jGwM0BSWnuGpgckFHOiTk6SWYeHrSDlcpLuLck=; b=sdXXElcgHhGo0IORojPZwLIIcI0Gvcst0aqx2QK1TVwJo/X8p6slDARPrvY8oKmjCc c/UEQW+pSWKVP/B64aAhtyf8m0iyd7Fyy2zDo2+mcOxgILRaqORGX43XhhwaLC5FPqT0 jP6fTN90SswElpBoMGnMNA0S3kG/bR3S0swYTpMD6KSDPG3jCJePuAIisR5NXFS+wlLC kk7nuoX4nUEKhMC6qQiYg4QT+7jmWAG+27IWe+3NMUckTPFtYE3bPtHnGdGF5PNpdWih 4Ejz+gk8K1io5lHVUZ/lb6ogQT1g+TYW38hPbBcPLYmcLdgeo4pLEyTTy0nvhNBULNf8 lAGA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=3gqdwDRi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h11-20020a170902b94b00b0014f40ac475bsi1837878pls.387.2022.03.16.08.07.00; Wed, 16 Mar 2022 08:07:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=3gqdwDRi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347922AbiCOQFs (ORCPT + 99 others); Tue, 15 Mar 2022 12:05:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238915AbiCOQFp (ORCPT ); Tue, 15 Mar 2022 12:05:45 -0400 Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B0331FA79 for ; Tue, 15 Mar 2022 09:04:32 -0700 (PDT) Received: by mail-pj1-x102a.google.com with SMTP id mm23-20020a17090b359700b001bfceefd8c6so2690256pjb.3 for ; Tue, 15 Mar 2022 09:04:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gvkC3jGwM0BSWnuGpgckFHOiTk6SWYeHrSDlcpLuLck=; b=3gqdwDRiCpRYgnbCiljAUJPe+a2nTGkGs2zkB5QHuuedEezypdquacGD/eUaQO1L48 W8q+bQNYgZzYuWRLozKeXY/T6eFQRQeF0zEZJj2Uj51n8r+WEVn59ZGNDqZ3WRH9Ixbr vjfaZmO3t+A/a04EBXTmKgmCaYz6pFUFvcPblzGN9GWeI0uTrZv+8qlpHoRUapzOJIA0 0LAxv4xPNTNd8ErCac5s+GPIxnEWbZF6vtS5bIIgEOOeBIDJ+DEp1UkRkXlpE8LPq4pU 4LskO51J4VqSAb83p4KhKkgt/VPsn4Al4Dr/pozZ5fSQdu/gp2Ukcgmrc5f3QRI8t8tu 7VRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gvkC3jGwM0BSWnuGpgckFHOiTk6SWYeHrSDlcpLuLck=; b=Uf98LMQI1I3pEVV9quIZCQydJF2pk4eUvvNGsseKSGrDrZdjtKM14DFEdoe2CPwAjH N7y4H9TXppJ/C7nEuBV3eP3pcpk3cp1GrUmCIu28c3q51Y8a6aHl3ejCpaabj5kYkWcF bzzpWNN4mY5ePYqXYJ80zJ3ZSGBMtgYz6dD/5Hlv7rQf+Ej3UB0jtLoXoTNsTwb/fol3 Km1PSZbjkvp40BZ6J42lxtV4kVh0jCUafHq/qYPWfhOqGSuMhEQY7voQGkAvUhA0Kemo OboNoAhPiU8jpSNJce6Z224RL4RkQHosvyhB2hL/xjotnByg6NFKU07kT6/S6Hek4BdZ rfWg== X-Gm-Message-State: AOAM5322eljO1EXYSsnetMnIoc+qsrjYMejs14IkUHC2obfvRZu+UUUR rz0o57qhUeVsf2sKY4yzz0MS/XHNWU8OUktzJ24hhw== X-Received: by 2002:a17:90a:430d:b0:1bc:f340:8096 with SMTP id q13-20020a17090a430d00b001bcf3408096mr5395590pjg.93.1647360271908; Tue, 15 Mar 2022 09:04:31 -0700 (PDT) MIME-Version: 1.0 References: <20220301195457.21152-1-jithu.joseph@intel.com> In-Reply-To: From: Dan Williams Date: Tue, 15 Mar 2022 09:04:20 -0700 Message-ID: Subject: Re: [RFC 00/10] Introduce In Field Scan driver To: Greg KH Cc: "Luck, Tony" , "Joseph, Jithu" , "hdegoede@redhat.com" , "markgross@kernel.org" , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "corbet@lwn.net" , "andriy.shevchenko@linux.intel.com" , "Raj, Ashok" , "rostedt@goodmis.org" , "linux-kernel@vger.kernel.org" , "linux-doc@vger.kernel.org" , "platform-driver-x86@vger.kernel.org" , "patches@lists.linux.dev" , "Shankar, Ravi V" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 15, 2022 at 8:27 AM Greg KH wrote: > > On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote: > > >> This seems a novel use of uevent ... is it OK, or is is abuse? > > > > > > Don't create "novel" uses of uevents. They are there to express a > > > change in state of a device so that userspace can then go and do > > > something with that information. If that pattern fits here, wonderful. > > > > Maybe Dan will chime in here to better explain his idea. I think for > > the case where the core test fails, there is a good match with uevent. > > The device (one CPU core) has changed state from "working" to > > "untrustworthy". Userspace can do things like: take the logical CPUs > > on that core offline, initiate a service call, or in a VMM cluster environment > > migrate work to a different node. > > Again, I have no idea what you are doing at all with this driver, nor > what you want to do with it. > > Start over please. > > What is the hardware you have to support? > > What is the expectation from userspace with regards to using the > hardware? Here is what I have learned about this driver since engaging on this patch set. Cores go bad at run time. Datacenters can detect them at scale. When I worked at Facebook there was an epic story of debugging random user login failures that resulted in the discovery of a marginal lot-number of CPUs in a certain cluster. In that case the crypto instructions on a few cores of those CPUs gave wrong answers. Whether that was an electromigration effect, or just a marginal bin of CPUs, the only detection method was A-B testing different clusters of CPUs to isolate the differences. This driver takes advantage of a CPU feature to inject a diagnostic test similar to what can be done via JTAG to validate the functionality of a given core on a CPU at a low level. The diagnostic is run periodically since some failures may be sensitive to thermals while other failures may be be related to the lifetime of the CPU. The result of the diagnostic is "here are 1 or more cores that may miscalculate, stop using them and replace the CPU". At a base level the ABI need only be something that conveys "core X failed its last diagnostic". All the other details are just extra, and in my opinion can be dropped save for maybe "core X was unable to run the diagnostic". The thought process that got me from the proposal on the table "extend /sys/devices/system/cpu with per-cpu result state and other details" to "emit uevents on each test completion" were the following: -The complexity and maintenance burden of dynamically extending /sys/devices/system/cpu: Given that you identified a reference counting issue, I wondered why this was trying to use /sys/devices/system/cpu in the first instance. - The result of the test is an event that kicks off remediation actions: When this fails a tech is paged to replace the CPU and in the meantime the system can either be taken offline, or if some of the cores are still good the workloads can be moved off of the bad cores to keep some capacity online until the replacement can be made. - KOBJ_CHANGE uevents are already deployed in NVME for AEN (Asynchronous Event Notifications): If the results of the test were conveyed only in sysfs then there would be a program that would scrape sysfs and turn around and fire an event for the downstream remediation actions. Uevent cuts to the chase and lets udev rule policy log, notify, and/or take pre-emptive CPU offline action. The CPU state has changed after a test run. It has either changed to a failed CPU, or it has changed to one that has recently asserted its health. > > > I doubt you can report "test results" via a uevent in a way that the > > > current uevent states and messages would properly convey, but hey, maybe > > > I'm wrong. > > > > But here things get a bit sketchy. Reporting "pass", or "didn't complete the test" > > isn't a state change. But it seems like a poor interface if there is no feedback > > that the test was run. Using different methods to report pass/fail/incomplete > > also seems user hostile. > > We have an in-kernel "test" framework. Yes, it's for kernel code, but > why not extend that to also include hardware tests? This is where my head was at when starting out with this, but this is more of an asynchronous error reporting mechanism like machine check, or PCIe AER, than a test. The only difference being that the error in this case is only reported by first requesting an error check. So it is more similar to something like a background patrol scrub that seeks out latent ECC errors in memory.