Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752465AbbEGPAJ (ORCPT ); Thu, 7 May 2015 11:00:09 -0400 Received: from mail-ie0-f180.google.com ([209.85.223.180]:33219 "EHLO mail-ie0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751460AbbEGPAF (ORCPT ); Thu, 7 May 2015 11:00:05 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150506200219.40425.74411.stgit@dwillia2-desk3.amr.corp.intel.com> Date: Thu, 7 May 2015 08:00:05 -0700 X-Google-Sender-Auth: uJzHhfyLl93wo3ESihzMQStMb3k Message-ID: Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t From: Linus Torvalds To: Dan Williams Cc: Linux Kernel Mailing List , Boaz Harrosh , Jan Kara , Mike Snitzer , Neil Brown , Benjamin Herrenschmidt , Dave Hansen , Heiko Carstens , Chris Mason , Paul Mackerras , "H. Peter Anvin" , Christoph Hellwig , Alasdair Kergon , "linux-nvdimm@lists.01.org" , Ingo Molnar , Mel Gorman , Matthew Wilcox , Ross Zwisler , Rik van Riel , Martin Schwidefsky , Jens Axboe , "Theodore Ts'o" , "Martin K. Petersen" , Julia Lawall , Tejun Heo , linux-fsdevel , Andrew Morton Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3322 Lines: 68 On Wed, May 6, 2015 at 7:36 PM, Dan Williams wrote: > > My pet concrete example is covered by __pfn_t. Referencing persistent > memory in an md/dm hierarchical storage configuration. Setting aside > the thrash to get existing block users to do "bvec_set_page(page)" > instead of "bvec->page = page" the onus is on that md/dm > implementation and backing storage device driver to operate on > __pfn_t. That use case is simple because there is no use of page > locking or refcounting in that path, just dma_map_page() and > kmap_atomic(). So clarify for me: are you trying to make the IO stack in general be able to use the persistent memory as a source (or destination) for IO to _other_ devices, or are you talking about just internally shuffling things around for something like RAID on top of persistent memory? Because I think those are two very different things. For example, one of the things I worry about is for people doing IO from persistent memory directly to some "slow stable storage" (aka disk). That was what I thought you were aiming for: infrastructure so that you can make a bio for a *disk* device contain a page list that is the persistent memory. And I think that is a very dangerous operation to do, because the persistent memory itself is going to have some filesystem on it, so anything that looks up the persistent memory pages is *not* going to have a stable pfn: the pfn will point to a fixed part of the persistent memory, but the file that was there may be deleted and the memory reassigned to something else. That's the kind of thing that "struct page" helps with for normal IO devices. It's both a source of serialization and indirection, so that when somebody does a "truncate()" on a file, we don't end up doing IO to random stale locations on the disk that got reassigned to another file. So "struct page" is very fundamental. It's *not* just a "this is the physical source/drain of the data you are doing IO on". So if you are looking at some kind of "zero-copy IO", where you can do IO from a filesystem on persistent storage to *another* filesystem on (say, a big rotational disk used for long-term storage) by just doing a bo that targets the disk, but has the persistent memory as the source memory, I really want to understand how you are going to serialize this. So *that* is what I meant by "What is the primary thing that is driving this need? Do we have a very concrete example?" I abvsolutely do *not* want to teach the bio subsystem to just randomly be able to take the source/destination of the IO as being some random pfn without knowing what the actual uses are and how these IO's are generated in the first place. I was assuming that you wanted to do something where you mmap() the persistent memory, and then write it out to another device (possibly using aio_write()). But that really does require some kind of serialization at a higher level, because you can't just look up the pfn's in the page table and assume they are stable: they are *not* stable. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/