Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp4035943ybz; Mon, 4 May 2020 14:32:57 -0700 (PDT) X-Google-Smtp-Source: APiQypJptmQjMn1NcQwztS3iQqeBAMUQIxwt1Gx2w27nVu/DkNKqUk+k3nsF2ZqoZkyOi+OWtXrS X-Received: by 2002:a50:fb92:: with SMTP id e18mr27352edq.177.1588627977229; Mon, 04 May 2020 14:32:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588627977; cv=none; d=google.com; s=arc-20160816; b=RQy6bLULxyOUeBrx5XeloQ0iEbemx047lSskqtKCV0YxAe1Yle5kC4N5AQV+0iL6O2 6XtDkfK8+ooUSUGn5Sh7f1hpiJxZghDEdvtvAascYOl6RLWlTONCZlFv1PWJINyK7QND YRoSoH3Kp8/mQhnerXDnfqfGZYRsWsSZxycGWOVAmc3BygQsf6pxILxdYqHNh8HoBgZr scxZdpsSLR/oAqSq1WAzhp1c1MM/dX2RyyqydkvvhgDJA49sLMSgWfnxIo4iD59kZr9E lKnJ2nrUFxgZFTURrP1BoXym+RG6QtKmYwRLUdjYZVjxRqJcHG27nYYoD4rvhUfyfMFS rA2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=S6K9rzdvWscNG4wZPQ1rJpU9QAUIGngtAWX+SWcl3dk=; b=xPqIcNYad/oH/MJKi6IjhqJ26tSMAVPZyjwmfjUGmb+izulcS9pSrYmXuAe78VzCTD sEXTDxMHPIKgHjEcF4hbXbepqAopub4YWDMZnxUYhsBSuNik7ukSL+l6tUC6SLohFAis 3n86r+rLPO+qy864jnaP8pVQO0MzgZ66ICrjwytR3neZRoLd6VseX/vV9spKqeLZlAaV FlHCmSStl3gCWo+fcSUvPAI9xLX95XYv6n6T43WTK8lbENzm9dmaK85b/qfJHvWMMSQ4 XCY5ayzMNUixBvYjBgnS+3fKnNjEdBKaGZFOMS3BWvlKHdYvg7+SEz1qI+YOT1YVKvZ9 8wpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="HFz4z/6v"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k3si7906880ejk.202.2020.05.04.14.32.33; Mon, 04 May 2020 14:32:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b="HFz4z/6v"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726530AbgEDVbH (ORCPT + 99 others); Mon, 4 May 2020 17:31:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33514 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1726453AbgEDVbH (ORCPT ); Mon, 4 May 2020 17:31:07 -0400 Received: from mail-ed1-x541.google.com (mail-ed1-x541.google.com [IPv6:2a00:1450:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2324C061A0F for ; Mon, 4 May 2020 14:31:06 -0700 (PDT) Received: by mail-ed1-x541.google.com with SMTP id k22so16550eds.6 for ; Mon, 04 May 2020 14:31:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=S6K9rzdvWscNG4wZPQ1rJpU9QAUIGngtAWX+SWcl3dk=; b=HFz4z/6vNebp8NseCgqhAEQfo5KWVIcyjrxldQaD5QkV0Qfculhk7762n62lN/0bGY EE5bkJGb69yHfC2fRhWCC/qXBf0gqZ5OdDpol202tw5MCCGK1ss2EAsqOa++mVG1KjiN SurKrYicx486IwQfZJdRs1K1Cwxb2R+dhl5qBJGNKhdfs6IztdFeg4HM3b4kHaSWRylb R/CI15wL0O7F4WhglJR6q+k5Lj11gnXFenkR/PqnxmqL7/PGOCFnvYrgLsL6QN1xnOTD eX2JK0UM7l00ntGZVtUsQc6pmed40z1McRpthk7/QXnBlGQ+Fa7IOG1WRA1dOIO2eX1k 5Yag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=S6K9rzdvWscNG4wZPQ1rJpU9QAUIGngtAWX+SWcl3dk=; b=PmfdUgNsumxd8zB9YPheYLA5sdRTIc2/eQyO/w07/Jm0U9stoio4hC5CjNiqyk5ziU ORL5n5YLnumi9wvxF9lD4o8uqZ7Fp1HNq5gt/KEOlTZQwKjlUnTU/V85zM4s/4miuFHM mSjsZBuV4/Gk8ak0FC/SojHGAeV1qDuwrEwkqHCxwIlY8t1jz/npd7rl3D6FXyq+FUdQ LzfE3QxCNHOceK1n75zJgCgd7hcQTn6zwPdSnyen6tjmifPPl3ZkURd7Y1V4JrisKWfB DAJfatg7+5nMr1DuPArJqErMlUFhdl8wCSCWI4F9byEcv2niBs+qwJIkfRvkw5Yemu02 WLcQ== X-Gm-Message-State: AGi0PuYufnxk7dCCaatMu2awJXiuirD2SihmFfGsCODjX+I18YC5gsBe VZutyb09pVNmlwA+xGuiCFpv5VrP9vlH/TExcwDWiw== X-Received: by 2002:a50:bb6b:: with SMTP id y98mr42475ede.296.1588627865384; Mon, 04 May 2020 14:31:05 -0700 (PDT) MIME-Version: 1.0 References: <1962EE67-8AD1-409D-963A-4F1E1AB3B865@amacapital.net> <3908561D78D1C84285E8C5FCA982C28F7F60EBB6@ORSMSX115.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F7F612DF4@ORSMSX115.amr.corp.intel.com> In-Reply-To: From: Dan Williams Date: Mon, 4 May 2020 14:30:54 -0700 Message-ID: Subject: Re: [PATCH v2 0/2] Replace and improve "mcsafe" with copy_safe() To: Andy Lutomirski Cc: "Luck, Tony" , Linus Torvalds , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , stable , "the arch/x86 maintainers" , "H. Peter Anvin" , Paul Mackerras , Benjamin Herrenschmidt , "Tsaur, Erwin" , Michael Ellerman , Arnaldo Carvalho de Melo , linux-nvdimm , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski wrote: > > On Mon, May 4, 2020 at 1:05 PM Luck, Tony wrote: > > > > > When a copy function hits a bad page and the page is not yet known to > > > be bad, what does it do? (I.e. the page was believed to be fine but > > > the copy function gets #MC.) Does it unmap it right away? What does > > > it return? > > > > I suspect that we will only ever find a handful of situations where the > > kernel can recover from memory that has gone bad that are worth fixing > > (got to be some code path that touches a meaningful fraction of memory, > > otherwise we get code complexity without any meaningful payoff). > > > > I don't think we'd want different actions for the cases of "we just found out > > now that this page is bad" and "we got a notification an hour ago that this > > page had gone bad". Currently we treat those the same for application > > errors ... SIGBUS either way[1]. > > Oh, I agree that the end result should be the same. I'm thinking more > about the mechanism and the internal API. As a somewhat silly example > of why there's a difference, the first time we try to read from bad > memory, we can expect #MC (I assume, on a sensibly functioning > platform). But, once we get the #MC, I imagine that the #MC handler > will want to unmap the page to prevent a storm of additional #MC > events on the same page -- given the awful x86 #MC design, too many > all at once is fatal. So the next time we copy_mc_to_user() or > whatever from the memory, we'll get #PF instead. Or maybe that #MC > will defer the unmap? After the consumption the PMEM driver arranges for the page to never be mapped again via its "badblocks" list. > > So the point of my questions is that the overall design should be at > least somewhat settled before anyone tries to review just the copy > functions. I would say that DAX / PMEM stretches the Linux memory error handling model beyond what it was originally designed. The primary concepts that bend the assumptions of mm/memory-failure.c are: 1/ DAX pages can not be offlined via the page allocator. 2/ DAX pages (well cachelines in those pages) can be asynchronously marked poisoned by a platform or device patrol scrub facility. 3/ DAX pages might be repaired by writes. Currently 1/ and 2/ are managed by a per-block-device "badblocks" list that is populated by scrub results and also amended when #MC is raised (see nfit_handle_mce()). When fs/dax.c services faults it will decline to map the page if the physical file extent intersects a bad block. There is also support for sending SIGBUS if userspace races the scrubber to consume the badblock. However, that uses the standard 'struct page' error model and assumes that a file backed page is 1:1 mapped to a file. This requirement prevents filesystems from enabling reflink. That collision and the desire to enable reflink is why we are now investigating supplanting the mm/memory-failure.c model. When the page is "owned" by a filesystem invoke the filesystem to handle the memory error across all impacted files. The presence of 3/ means that any action error handling takes to disable access to the page needs to be capable of being undone, which runs counter to the mm/memory-failure.c assumption that offlining is a one-way trip.