Received: by 2002:ac0:a679:0:0:0:0:0 with SMTP id p54csp987034imp; Thu, 21 Feb 2019 15:47:59 -0800 (PST) X-Google-Smtp-Source: AHgI3IYyI0DzHXbDilIyuLrgNb2UwjsZQ08Z965ZL5vYMjVtgrF3naacBfvmtU+nqWo6WCqOYYKc X-Received: by 2002:a63:f84d:: with SMTP id v13mr1087597pgj.384.1550792879830; Thu, 21 Feb 2019 15:47:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550792879; cv=none; d=google.com; s=arc-20160816; b=chNDYlgUKroCNhtQR9xN+E1DIXalfb/L6PEYFs0/1Z9rCviQ8M+DuGUPkeBnfFdGSr sz5a8raMARwuSKOA1CBWyjos8FuBAU9V/7+/8ay4E5cCvHUn1BomyeVg15U++meSGxbK VGM3vjJcJsuEHj6MD+hRe//mLiL5MkiaG/g6ttzm/YcT5Ucv23O1UnXVRHlBOK4uAkj5 qXtMqLZ4VLFakgseiQM8xHJ0JS7j4KCZ7VLilzo8e+vis/42QDZ2U4Z6yTquuME+wKym i5WkBjUWGITh2kCEfGqz7J65IB9F/sHedOIlhQFGawPCHP5ordErx+yJWAu3WTjV6zdd N4dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from; bh=yECcF4ryzn/D2WtDmOn2rYb9gJ3DGQpvK5inUI6EUl4=; b=SpGnmcaIZsjqhTMapi4Xgf2wkL4jPuUYRI7Rmfrak0TGGP3cU3UNXo4TfIXdnZnnAV ttEdonInrR+b4GC0YB1sYCIoXxCRTigTMZumRiRq4tlqn8xU3sns4w84q4CRxuke1a2q Skscm3CCl8gCtFcKWt2L8+WUyVvymL5NMCuIDzyCzKS5R86epf9aLp6wYz9Piy214qkL YT3hhSsS3V95WwGUAxWjMvMI7r1qrDhuw7FI6KH8FtimiYm/oYLOXNsddLYRKVS+0amO 3KJZQr4lLj2XmPXdGuP00vaAWdfL4jdqGCGE6su3c1yAEMasUvz5yqZ/31thvsybqmC9 OznQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z17si202894pgf.267.2019.02.21.15.47.44; Thu, 21 Feb 2019 15:47:59 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726409AbfBUXrZ (ORCPT + 99 others); Thu, 21 Feb 2019 18:47:25 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58654 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725961AbfBUXrY (ORCPT ); Thu, 21 Feb 2019 18:47:24 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C323F10C6D; Thu, 21 Feb 2019 23:47:23 +0000 (UTC) Received: from segfault.boston.devel.redhat.com (segfault.boston.devel.redhat.com [10.19.60.26]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0EF0760C80; Thu, 21 Feb 2019 23:47:22 +0000 (UTC) From: Jeff Moyer To: Dan Williams Cc: linux-nvdimm@lists.01.org, stable@vger.kernel.org, linux-kernel@vger.kernel.org, vishal.l.verma@intel.com, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation References: <155000668075.348031.9371497273408112600.stgit@dwillia2-desk3.amr.corp.intel.com> <155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 Date: Thu, 21 Feb 2019 18:47:22 -0500 In-Reply-To: <155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com> (Dan Williams's message of "Tue, 12 Feb 2019 13:25:17 -0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Thu, 21 Feb 2019 23:47:23 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Dan, Thanks for the comprehensive write-up. Comments below. Dan Williams writes: > In the beginning the pmem driver simply passed the persistent memory > resource range to memremap and was done. With the introduction of > devm_memremap_pages() and vmem_altmap the implementation needed to > contend with metadata at the start of the resource to indicate whether > the vmemmap is located in System RAM or Persistent Memory, and reserve > vmemmap capacity in pmem for the latter case. > > The indication of metadata space was communicated in the > nd_pfn->data_offset property and it was defined to be identical to the > pmem_device->data_offset property, i.e. relative to the raw resource > base of the namespace. Up until this point in the driver's development > pmem_device->phys_addr == __pa(pmem_device->virt_addr). This > implementation was fine up until the discovery of platforms with > physical address layouts that mapped Persistent Memory and System RAM to > the same Linux memory hotplug section (128MB span). > > The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced > to pad and truncate the capacity to fit within an exclusive Linux > memory hotplug section span, and it was at this point where the > ->start_pad definition did not comprehend the pmem_device->phys_addr to > pmem_device->virt_addr relationship. Platforms in the wild typically > only collided 'System RAM' at the end of the Persistent Memory range so > ->start_pad was often zero. > > Lately Linux has encountered platforms that collide Persistent Memory > regions between each other, specifically cases where ->start_pad needed > to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad > pfn namespaces relative to other regions". That commit allowed > namespaces to be mapped with devm_memremap_pages(). However dax > operations on those configurations currently fail if attempted within the > ->start_pad range because pmem_device->data_offset was still relative to > raw resource base not relative to the section aligned resource range > mapped by devm_memremap_pages(). > > Luckily __bdev_dax_supported() caught these failures and simply disabled > dax. Let me make sure I understand the current state of things. Assume a machine with two persistent memory ranges overlapping the same hotplug memory section. Let's take the example from the ndctl github issue[1]: 187c000000-967bffffff : Persistent Memory /sys/bus/nd/devices/region0/resource: 0x187c000000 /sys/bus/nd/devices/region1/resource: 0x577c000000 Create a namespace in region1. That namespace will have a start_pad of 64MiB. The problem is that, while the correct offset was specified when laying out the struct pages (via arch_add_memory), the data_offset for the pmem block device itself does not take the start_pad into account (despite the comment in the nd_pfn_sb data structure!). As a result, the block device starts at the beginning of the address range, but struct pages only exist for the address space starting 64MiB into the range. __bdev_dax_supported fails, because it tries to perform a direct_access call on sector 0, and there's no pgmap for the address corresponding to that sector. So, we can't simply make the code correct (by adding the start_pad to pmem->data_offset) without bumping the superblock version, because that would change the size of the block device, and the location of data on that block device would all be off by 64MiB (and you'd lose the first 64MiB). Mass hysteria. > However, to fix this situation a non-backwards compatible change > needs to be made to the interpretation of the nd_pfn info-block. > ->start_pad needs to be accounted in ->map.map_offset (formerly > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be > adjusted to the section aligned resource base used to establish > ->map.map formerly (formerly ->virt_addr). > > The guiding principles of the info-block compatibility fixup is to > maintain the interpretation of ->data_offset for implementations like > the EFI driver that only care about data_access not dax, but cause older > Linux implementations that care about the mode and dax to fail to parse > the new info-block. What if the core mm grew support for hotplug on sub-section boundaries? Would't that fix this problem (and others)? -Jeff [1] https://github.com/pmem/ndctl/issues/76