Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp457187pxb; Fri, 29 Oct 2021 13:10:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxEFtMnUS2RjcdZbd6rAPHNo/xo9VNXqtkeG1ToZtw2DmfnyeAG8f6fjrdm62Q+QfjxmgXc X-Received: by 2002:a05:6638:d16:: with SMTP id q22mr1584715jaj.35.1635538214644; Fri, 29 Oct 2021 13:10:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635538214; cv=none; d=google.com; s=arc-20160816; b=sy1Yu2WlPUIFYxre2mutd8Gt1wk4MO7U7/KkiI0H8ylgGlYwTLh876W05z/6fIEgfV 2S6dzDixljLIvBeX8s3nA9r1GtWKbTHATOUPEEfA6Jr9e8I677/XegpLDsn8Bz491oGB 9OVQt80ZzKL5ar26xB+4GN2bZTGVWRFlUxweMb91Bd7deOKgiu62cHXsysQb6e0l5v8M 3AFyG48crmvHrDert5LFFoy0tza45wvDRQ+x8m7cwrr+uh+Zbb54F2O5QsyuEqSpZelP jRItxtyvFyHojQh89+L3L0k/q4LpxgyhZtzMrR6rfkGyg6b2tDm/9Gn3x6J4Ya1nxJHO Gknw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=hxFoBEr5xmkH7ZqJaXtnBs+0zmM5/AepCPnyolpxaPQ=; b=0vPbm4Re5Yuj+QntNQ4Ral3hhMVtJlZ0bNS94tJWUjBuQJTgjKzCASW2Nofz5QgXc4 5mLm9j+PfwApcs9iVYRbUN5A0l9w+ZS0k5OpPjtHgjB6eAvQkkel1rE7P9nZe1ba0HRx mGsNhBKYiiGmBwWa8d6E6DWTvAUSb73zWe1j19cJjjBg8bt08i4Z8BxuxgqaWE3QLhk8 XRNxjwvmChz8HBioz3Kka3ATcmfIshmLaTj+uPYJOSlclyA8LG8Q8Cq3xnawFfC/y0w0 aFgLdTxigv4gCgrpd3ywy6K/ceEcSUB7JwplKzmnenfTSxpSbHiDlkPDcRE0Pfcgpe0N 7AMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=sWYWqoF+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g3si5019148ile.19.2021.10.29.13.10.01; Fri, 29 Oct 2021 13:10:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=sWYWqoF+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231313AbhJ2UL2 (ORCPT + 99 others); Fri, 29 Oct 2021 16:11:28 -0400 Received: from mail.kernel.org ([198.145.29.99]:40300 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229979AbhJ2UL1 (ORCPT ); Fri, 29 Oct 2021 16:11:27 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 04AFE6101E; Fri, 29 Oct 2021 20:08:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1635538138; bh=B1lpTPBJKOUOmSSKH19PhuTop/N0Fuwt4VAFYWghRIY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=sWYWqoF+5pjy039ODvqWeSBLMYGKelDrJ38RQS6nAHJSSAWTsdfWU/0OWoHMO/rCi xqLEKiWBWfmTGlDLdcWGd43ynnFsicOBT/CvQ1ED/ymVzO1gAA9Aj3Sg8YhB+lzV6x dRGmux0PtIWmJ7LOjiWcwlN2OqaHOlv9ySZCJBYtCqzBWP+bS1O11D4teuq7k2TLTv szA7y9vYu8QEiQVw1pUudI2183e2QzXHeyB2Gj4CnWB/DgFNkl7P/zoNIb46C+cgVd uevIyZn8ygAjR+k5wKmPde74jNEw0Jc5TyCV3OqTfmBB2trksX1/u2vxqVKlVcdMia vlse3EPsZ577g== Date: Fri, 29 Oct 2021 13:08:57 -0700 From: "Darrick J. Wong" To: Pavel Begunkov Cc: Dave Chinner , Christoph Hellwig , Jane Chu , "dan.j.williams@intel.com" , "vishal.l.verma@intel.com" , "dave.jiang@intel.com" , "agk@redhat.com" , "snitzer@redhat.com" , "dm-devel@redhat.com" , "ira.weiny@intel.com" , "willy@infradead.org" , "vgoyal@redhat.com" , "linux-fsdevel@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag Message-ID: <20211029200857.GD2237511@magnolia> References: <2102a2e6-c543-2557-28a2-8b0bdc470855@oracle.com> <20211028002451.GB2237511@magnolia> <20211028225955.GA449541@dread.disaster.area> <22255117-52de-4b2d-822e-b4bc50bbc52b@gmail.com> <20211029165747.GC2237511@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote: > On 10/29/21 17:57, Darrick J. Wong wrote: > > On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: > > > On 10/28/21 23:59, Dave Chinner wrote: > > > [...] > > > > > > Well, my point is doing recovery from bit errors is by definition not > > > > > > the fast path. Which is why I'd rather keep it away from the pmem > > > > > > read/write fast path, which also happens to be the (much more important) > > > > > > non-pmem read/write path. > > > > > > > > > > The trouble is, we really /do/ want to be able to (re)write the failed > > > > > area, and we probably want to try to read whatever we can. Those are > > > > > reads and writes, not {pre,f}allocation activities. This is where Dave > > > > > and I arrived at a month ago. > > > > > > > > > > Unless you'd be ok with a second IO path for recovery where we're > > > > > allowed to be slow? That would probably have the same user interface > > > > > flag, just a different path into the pmem driver. > > > > > > > > I just don't see how 4 single line branches to propage RWF_RECOVERY > > > > down to the hardware is in any way an imposition on the fast path. > > > > It's no different for passing RWF_HIPRI down to the hardware *in the > > > > fast path* so that the IO runs the hardware in polling mode because > > > > it's faster for some hardware. > > > > > > Not particularly about this flag, but it is expensive. Surely looks > > > cheap when it's just one feature, but there are dozens of them with > > > limited applicability, default config kernels are already sluggish > > > when it comes to really fast devices and it's not getting better. > > > Also, pretty often every of them will add a bunch of extra checks > > > to fix something of whatever it would be. > > > > So we can't have data recovery because moving fast the only goal? > > That's not what was said and you missed the point, which was in > the rest of the message. ...whatever point you were trying to make was so vague that it was totally uninformative and I completely missed it. What does "callbacks or bit masks" mean, then, specifically? How *exactly* would you solve the problem that Jane is seeking to solve by using callbacks? Actually, you know what? I'm so fed up with every single DAX conversation turning into a ****storm of people saying NO NO NO NO NO NO NO NO to everything proposed that I'm actually going to respond to whatever I think your point is, and you can defend whatever I come up with. > > > > That's so meta. > > > > --D > > > > > So let's add a bit of pragmatism to the picture, if there is just one > > > user of a feature but it adds overhead for millions of machines that > > > won't ever use it, it's expensive. Errors are infrequent, and since everything is cloud-based and disposble now, we can replace error handling with BUG_ON(). This will reduce code complexity, which will reduce code size, and improve icache usage. Win! > > > This one doesn't spill yet into paths I care about, ...so you sail in and say 'no' even though you don't yet care... > > > but in general > > > it'd be great if we start thinking more about such stuff instead of > > > throwing yet another if into the path, e.g. by shifting the overhead > > > from linear to a constant for cases that don't use it, for instance > > > with callbacks Ok so after userspace calls into pread to access a DAX file, hits the poisoned memory line and the machinecheck fires, what then? I guess we just have to figure out how to get from the MCA handler (assuming the machine doesn't just reboot instantly) all the way back into memcpy? Ok, you're in charge of figuring that out because I don't know how to do that. Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a callback that happens after memory controller realizes it's lost something, kicks a notification to the OS kernel through ACPI, and the kernel signal userspace to do something about it. Yeah, that's dumb since spinning rust already does all this for us, but that's pmem. > > > or bit masks. WTF does this even mean? --D > > > > > > > IOWs, saying that we shouldn't implement RWF_RECOVERY because it > > > > adds a handful of branches the fast path is like saying that we > > > > shouldn't implement RWF_HIPRI because it slows down the fast path > > > > for non-polled IO.... > > > > > > > > Just factor the actual recovery operations out into a separate > > > > function like: > > > > > > -- > > > Pavel Begunkov > > -- > Pavel Begunkov