Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932970AbbELOsI (ORCPT ); Tue, 12 May 2015 10:48:08 -0400 Received: from mail-qk0-f181.google.com ([209.85.220.181]:33292 "EHLO mail-qk0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932375AbbELOsD (ORCPT ); Tue, 12 May 2015 10:48:03 -0400 Date: Tue, 12 May 2015 10:47:54 -0400 From: Jerome Glisse To: Dave Chinner Cc: Ingo Molnar , Rik van Riel , Linus Torvalds , John Stoffel , Dave Hansen , Dan Williams , Linux Kernel Mailing List , Boaz Harrosh , Jan Kara , Mike Snitzer , Neil Brown , Benjamin Herrenschmidt , Heiko Carstens , Chris Mason , Paul Mackerras , "H. Peter Anvin" , Christoph Hellwig , Alasdair Kergon , "linux-nvdimm@lists.01.org" , Mel Gorman , Matthew Wilcox , Ross Zwisler , Martin Schwidefsky , Jens Axboe , "Theodore Ts'o" , "Martin K. Petersen" , Julia Lawall , Tejun Heo , linux-fsdevel , Andrew Morton Subject: Re: "Directly mapped persistent memory page cache" Message-ID: <20150512144752.GA4003@gmail.com> References: <20150507191107.GB22952@gmail.com> <554CBE17.4070904@redhat.com> <20150508140556.GA2185@gmail.com> <21836.51957.715473.780762@quad.stoffel.home> <554CEB5D.90209@redhat.com> <20150509084510.GA10587@gmail.com> <20150511082536.GP4327@dastard> <20150511091836.GA29191@gmail.com> <20150512005347.GQ4327@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20150512005347.GQ4327@dastard> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5139 Lines: 100 On Tue, May 12, 2015 at 10:53:47AM +1000, Dave Chinner wrote: > On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote: > > > > And, of course, different platforms have different page sizes, so > > > designing page array structures to be optimal for x86-64 is just a > > > wee bit premature. > > > > 4K is the smallest one on x86 and ARM, and it's also a IMHO pretty > > sane default from a human workflow point of view. > > > > But oddball configs with larger page sizes could also be supported at > > device creation time (via a simple superblock structure). > > Ok, so now I know it's volatile, why do we need a persistent > superblock? Why is *anything* persistent required? And why would > page size matter if the reserved area is volatile? > > And if it is volatile, then the kernel is effectively doing dynamic > allocation and initialisation of the struct pages, so why wouldn't > we just do dynamic allocation out of a slab cache in RAM and free > them when the last reference to the page goes away? Applications > aren't going to be able to reference every page in persistent > memory at the same time... > > Keep in mind we need to design for tens of TB of PRAM at minimum > (400GB NVDIMMS and tens of them in a single machine are not that far > away), so static arrays of structures that index 4k blocks is not a > design that scales to these sizes - it's like using 1980s filesystem > algorithms for a new filesystem designed for tens of terabytes of > storage - it can be made to work, but it's just not efficient or > scalable in the long term. On having easy pfn<->struct page relation i would agree with Ingo. I thin it is important. For instance in my case when migrating system memory to device memory i store a pfn in special swap entry. While right now i use my own adhoc structure i would rather directly use a struct page that i can easily find back from the pfn. In the scheme i proposed you only need to allocate PUD & PMD directory and use a huge zero page map read only for the whole array at boot time. When you need a struct page for a given pfn you allocate 2 page, one for the PMD directory and one for the struct page array for given range of pfn. Once the struct page is no longer needed you free both page and turn back to the zero huge page. So you get dynamic allocation and keep the nice pfn<->struct page mapping working. > > As an example, look at the current problems with scaling the > initialisation for struct pages for large memory machines - 16TB > machines are taking 10 minutes just to initialise the struct page > arrays on startup. That's the scale of overhead that static page > arrays will have for PRAM, whether they are lazily initialised or > not. IOWs, static page arrays are not scalable, and hence aren't a > viable long term solution to the PRAM problem. With solution i describe above all you need to initialize is PUD & PMD directory to point to a zero huge page. I would think this should be fast enough even for 1TB 2^(40 - 12 - 9 - 9) = 2^10 so you need 1024 PUD and 512K PMD (4M of PUD and 256M of PMD). You can even directly share PMD and have to dynamicly allocate 3 pages (1 for the PMD level, 1 for the PTE level, 1 for struct page array) effectively reducing to static 4M allocation for all PUD. Rest being dynamicly allocated/freed upon useage. > IMO, we need to be designing around the concept that the filesytem > manages the pmem space, and the MM subsystem simply uses the block > mapping information provided to it from the filesystem to decide how > it references and maps the regions into the user's address space or > for DMA. The mm subsystem does not manage the pmem space, it's > alignment or how it is allocated to user files. Hence page mappings > can only be - at best - reactive to what the filesystem does with > it's free space. The mm subsystem already has to query the block > layer to get mappings on page faults, so it's only a small stretch > to enhance the DAX mapping request to ask for a large page mapping > rather than a 4k mapping. If the fs can't do a large page mapping, > you'll get a 4k aligned mapping back. > > What I'm trying to say is that the mapping behaviour needs to be > designed with the way filesystems and the mm subsystem interact in > mind, not from a pre-formed "direct Io is bad, we must use the page > cache" point of view. The filesystem and the mm subsystem must > co-operate to allow things like large page mappings to be made and > hence looking at the problem purely from a mm<->pmem device > perspective as you are ignores an important chunk of the system: > the part that actually manages the pmem space... I am all for letting the filesystem manage pmem, but i think having struct page expose to mm allow the mm side to stay ignorant of what is really behind. Also if i could share more code with other i would be happier :) Cheers, J?r?me -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/