Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751496AbbEFXrU (ORCPT ); Wed, 6 May 2015 19:47:20 -0400 Received: from mail-wi0-f182.google.com ([209.85.212.182]:33719 "EHLO mail-wi0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751219AbbEFXrP (ORCPT ); Wed, 6 May 2015 19:47:15 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150506200219.40425.74411.stgit@dwillia2-desk3.amr.corp.intel.com> Date: Wed, 6 May 2015 16:47:14 -0700 Message-ID: Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t From: Dan Williams To: Linus Torvalds Cc: Linux Kernel Mailing List , Boaz Harrosh , Jan Kara , Mike Snitzer , Neil Brown , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , Chris Mason , Paul Mackerras , "H. Peter Anvin" , Christoph Hellwig , Alasdair Kergon , "linux-nvdimm@lists.01.org" , Ingo Molnar , Mel Gorman , Matthew Wilcox , Ross Zwisler , Rik van Riel , Martin Schwidefsky , Jens Axboe , "Theodore Ts'o" , "Martin K. Petersen" , Julia Lawall , Tejun Heo , linux-fsdevel , Andrew Morton Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3074 Lines: 66 On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds wrote: > On Wed, May 6, 2015 at 1:04 PM, Dan Williams wrote: >> >> The motivation for this change is persistent memory and the desire to >> use it not only via the pmem driver, but also as a memory target for I/O >> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. > > I detest this approach. > Hmm, yes, I can't argue against "put the onus on odd behavior where it belongs."... > I'd much rather go exactly the other way around, and do the dynamic > "struct page" instead. > > Add a flag to "struct page" Ok, given I had already precluded 32-bit systems in this __pfn_t approach we should have flag space for this on 64-bit. > to mark it as a fake entry and teach > "page_to_pfn()" to look up the actual pfn some way (that union tha > contains "index" looks like a good target to also contain 'pfn', for > example). > > Especially if this is mainly for persistent storage, we'll never have > issues with worrying about writing it back under memory pressure, so > allocating a "struct page" for these things shouldn't be a problem. > There's likely only a few paths that actually generate IO for those > things. > > In other words, I'd really like our basic infrastructure to be for the > *normal* case, and the "struct page" is about so much more than just > "what's the target for IO". For normal IO, "struct page" is also what > serializes the IO so that you have a consistent view of the end > result, and there's obviously the reference count there too. So I > really *really* think that "struct page" is the better entity for > describing the actual IO, because it's the common and the generic > thing, while a "pfn" is not actually *enough* for IO in general, and > you now end up having to look up the "struct page" for the locking and > refcounting etc. > > If you go the other way, and instead generate a "struct page" from the > pfn for the few cases that need it, you put the onus on odd behavior > where it belongs. > > Yes, it might not be any simpler in the end, but I think it would be > conceptually much better. Conceptually better, but certainly more difficult to audit if the fake struct page is initialized in a subtle way that breaks when/if it leaks to some unwitting context. The one benefit I may need to concede is a mechanism to opt-in to handle these fake pages to the few paths that know what they are doing. That was easy with __pfn_t, but a struct page can go silently almost anywhere. Certainly nothing is prepared a for a given struct page pointer to change the pfn it points to on the fly, which I think is what we would end up doing for something like a raid cache. Keep a pool of struct pages around and point them at persistent memory pfns while I/O is in flight. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/