Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp248947pxf; Wed, 17 Mar 2021 22:02:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzZ9P0uTCiPTyvgemP3nXFlbx8oofB5YgRTNirQMgyvkqxqkrp9dy2Hv46VxJx+KoQUc2rA X-Received: by 2002:aa7:cd16:: with SMTP id b22mr1341743edw.357.1616043730842; Wed, 17 Mar 2021 22:02:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616043730; cv=none; d=google.com; s=arc-20160816; b=ho5UIP+UatS6N6a/lYOUfp1KMgTtqOve6FixE0HCiswwWcA4RYaktmkhmgHKlGftyo JzkST87zBWwRL4Rft2Chu946Quh6P8my8BkDUeMSNSr7XDrMgYASKBKyZzqtFg29p+fl FQ7woxwHH0NQqFRWlBCiUZ9sR5NGW7kyHz/Wd5WpwXPOtWOMb7NW2ljYfpmiTyjflMCz Y8xDvEbGveFJGyBS0wFFEHW7CWVNCyPAftwNMI2JYBN4T9FZh1vHkNAq6bZnUlQYWW3h MTrM43K++3/pCzcr6XIkPKcny1ik5FqWy0r/wzkF15fgWku+37euwK8TyQN+ygb9KDYp M26Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=gVxGElQsg38TcouZxmhwjEBNgfMYuYN+Hj5IEgtkAqo=; b=nQqBaASe+/qZ774snCTku9Bfdta9mkM8IYfIYpSn79rUyl57dedCnikL9o9UrJnqM/ kjVpylPs0lGYd7qZCTWXnapiJJgNbjkS6+iUKdkwi0XzCDTUsvTzG60LTbDcs6Y3322u ytNdXyCLX1+4C9amFILBdAb48LnsRBh1pI38BWwpcDcJyjOF1axiHm947ZpBUnvhoKfQ DF/u0QsnXY/9XK0kqlG3hdRkqti1M7uC911KrCeCKMARQrRp5R5hUJsamGSz696oJkTj VReoSK0mV27uQaZglMlbpu6z4k4+oAcSPmoPO5AGtcYmh5Ffazl1YUl7yEJ/a32VyOOW bu5A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d10si702915edp.256.2021.03.17.22.01.48; Wed, 17 Mar 2021 22:02:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229600AbhCRE6G (ORCPT + 99 others); Thu, 18 Mar 2021 00:58:06 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:59586 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229454AbhCRE5z (ORCPT ); Thu, 18 Mar 2021 00:57:55 -0400 Received: from dread.disaster.area (pa49-181-239-12.pa.nsw.optusnet.com.au [49.181.239.12]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 8F98878B349; Thu, 18 Mar 2021 15:57:46 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1lMkjB-003oZC-HC; Thu, 18 Mar 2021 15:57:45 +1100 Date: Thu, 18 Mar 2021 15:57:45 +1100 From: Dave Chinner To: Dan Williams Cc: linux-mm@kvack.org, linux-nvdimm@lists.01.org, Jason Gunthorpe , Christoph Hellwig , Shiyang Ruan , Vishal Verma , Dave Jiang , Ira Weiny , Matthew Wilcox , Jan Kara , Andrew Morton , Naoya Horiguchi , "Darrick J. Wong" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() Message-ID: <20210318045745.GC349301@dread.disaster.area> References: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> <161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=F8MpiZpN c=1 sm=1 tr=0 cx=a_idp_d a=gO82wUwQTSpaJfP49aMSow==:117 a=gO82wUwQTSpaJfP49aMSow==:17 a=kj9zAlcOel0A:10 a=dESyimp9J3IA:10 a=7-415B0cAAAA:8 a=WmxcBHIv_b8-_gLMp1kA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote: > Jason wondered why the get_user_pages_fast() path takes references on a > @pgmap object. The rationale was to protect against accessing a 'struct > page' that might be in the process of being removed by the driver, but > he rightly points out that should be solved the same way all gup-fast > synchronization is solved which is invalidate the mapping and let the > gup slow path do @pgmap synchronization [1]. > > To achieve that it means that new user mappings need to stop being > created and all existing user mappings need to be invalidated. > > For device-dax this is already the case as kill_dax() prevents future > faults from installing a pte, and the single device-dax inode > address_space can be trivially unmapped. > > The situation is different for filesystem-dax where device pages could > be mapped by any number of inode address_space instances. An initial > thought was to treat the device removal event like a drop_pagecache_sb() > event that walks superblocks and unmaps all inodes. However, Dave points > out that it is not just the filesystem user-mappings that need to react > to global DAX page-unmap events, it is also filesystem metadata > (proposed DAX metadata access), and other drivers (upstream > DM-writecache) that need to react to this event [2]. > > The only kernel facility that is meant to globally broadcast the loss of > a page (via corruption or surprise remove) is memory_failure(). The > downside of memory_failure() is that it is a pfn-at-a-time interface. > However, the events that would trigger the need to call memory_failure() > over a full PMEM device should be rare. This is a highly suboptimal design. Filesystems only need a single callout to trigger a shutdown that unmaps every active mapping in the filesystem - we do not need a page-by-page error notification which results in 250 million hwposion callouts per TB of pmem to do this. Indeed, the moment we get the first hwpoison from this patch, we'll map it to the primary XFS superblock and we'd almost certainly consider losing the storage behind that block to be a shut down trigger. During the shutdown, the filesystem should unmap all the active mappings (we already need to add this to shutdown on DAX regardless of this device remove issue) and so we really don't need a page-by-page notification of badness. AFAICT, it's going to take minutes, maybe hours for do the page-by-page iteration to hwposion every page. It's going to take a few seconds for the filesystem shutdown to run a device wide invalidation. SO, yeah, I think this should simply be a single ranged call to the filesystem like: ->memory_failure(dev, 0, -1ULL) to tell the filesystem that the entire backing device has gone away, and leave the filesystem to handle failure entirely at the filesystem level. -Dave. -- Dave Chinner david@fromorbit.com