Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751607AbbEIP4f (ORCPT ); Sat, 9 May 2015 11:56:35 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:54129 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750886AbbEIP4b (ORCPT ); Sat, 9 May 2015 11:56:31 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Ingo Molnar Cc: Rik van Riel , Linus Torvalds , John Stoffel , Dave Hansen , Dan Williams , Linux Kernel Mailing List , Boaz Harrosh , Jan Kara , Mike Snitzer , Neil Brown , Benjamin Herrenschmidt , Heiko Carstens , Chris Mason , Paul Mackerras , "H. Peter Anvin" , Christoph Hellwig , Alasdair Kergon , "linux-nvdimm\@lists.01.org" , Mel Gorman , Matthew Wilcox , Ross Zwisler , Martin Schwidefsky , Jens Axboe , "Theodore Ts'o" , "Martin K. Petersen" , Julia Lawall , Tejun Heo , linux-fsdevel , Andrew Morton References: <20150507173641.GA21781@gmail.com> <554BA748.9030804@linux.intel.com> <20150507191107.GB22952@gmail.com> <554CBE17.4070904@redhat.com> <20150508140556.GA2185@gmail.com> <21836.51957.715473.780762@quad.stoffel.home> <554CEB5D.90209@redhat.com> <20150509084510.GA10587@gmail.com> Date: Sat, 09 May 2015 10:51:49 -0500 In-Reply-To: <20150509084510.GA10587@gmail.com> (Ingo Molnar's message of "Sat, 9 May 2015 10:45:10 +0200") Message-ID: <87r3qpyciy.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-AID: U2FsdGVkX1/TBv6rkgEyFtLXOuC78E4QMyvG40LMay8= X-SA-Exim-Connect-IP: 67.3.205.90 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Ingo Molnar X-Spam-Relay-Country: X-Spam-Timing: total 721 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 3.8 (0.5%), b_tie_ro: 2.5 (0.4%), parse: 1.54 (0.2%), extract_message_metadata: 20 (2.8%), get_uri_detail_list: 7 (1.0%), tests_pri_-1000: 7 (1.0%), tests_pri_-950: 1.27 (0.2%), tests_pri_-900: 1.19 (0.2%), tests_pri_-400: 59 (8.1%), check_bayes: 57 (7.9%), b_tokenize: 26 (3.6%), b_tok_get_all: 16 (2.2%), b_comp_prob: 7 (1.0%), b_tok_touch_all: 5 (0.8%), b_finish: 0.70 (0.1%), tests_pri_0: 616 (85.4%), tests_pri_500: 7 (1.0%), rewrite_mail: 0.00 (0.0%) Subject: Re: "Directly mapped persistent memory page cache" X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Wed, 24 Sep 2014 11:00:52 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7722 Lines: 165 Ingo Molnar writes: > * Rik van Riel wrote: > >> On 05/08/2015 11:54 AM, Linus Torvalds wrote: >> > On Fri, May 8, 2015 at 7:40 AM, John Stoffel wrote: >> >> >> >> Now go and look at your /home or /data/ or /work areas, where the >> >> endusers are actually keeping their day to day work. Photos, mp3, >> >> design files, source code, object code littered around, etc. >> > >> > However, the big files in that list are almost immaterial from a >> > caching standpoint. >> >> > The big files in your home directory? Let me make an educated guess. >> > Very few to *none* of them are actually in your page cache right now. >> > And you'd never even care if they ever made it into your page cache >> > *at*all*. Much less whether you could ever cache them using large >> > pages using some very fancy cache. >> >> However, for persistent memory, all of the files will be "in >> memory". >> >> Not instantiating the 4kB struct pages for 2MB areas that are not >> currently being accessed with small files may make a difference. >> >> For dynamically allocated 4kB page structs, we need some way to >> discover where they are. It may make sense, from a simplicity point >> of view, to have one mechanism that works both for pmem and for >> normal system memory. > > I don't think we need to or want to allocate page structs dynamically, > which makes the model really simple and robust. > > If we 'think big', we can create something very exciting IMHO, that > also gets rid of most of the complications with DIO, DAX, etc: > > "Directly mapped pmem integrated into the page cache": > ------------------------------------------------------ > > - The pmem filesystem is mapped directly in all cases, it has device > side struct page arrays, and its struct pages are directly in the > page cache, write-through cached. (See further below about how we > can do this.) > > Note that this is radically different from the current approach > that tries to use DIO and DAX to provide specialized "direct > access" APIs. > > With the 'directly mapped' approach we have numerous advantages: > > - no double buffering to main RAM: the device pages represent > file content. > > - no bdflush, no VM pressure, no writeback pressure, no > swapping: this is a very simple VM model where the device is > RAM and we don't have much dirty state. The primary kernel > cache is the dcache and the directly mapped page cache, which > is not a writeback cache in this case but essentially a > logical->physical index cache of filesystem indexing > metadata. > > - every binary mmap()ed would be XIP mapped in essence > > - every read() would be equivalent a DIO read, without the > complexity of DIO. > > - every read() or write() done into a data mmap() area would > allow device-to-device zero copy DMA. > > - main RAM caching would still be avilable and would work in > many cases by default: as most apps use file processing > buffers in anonymous memory into which they read() data. > > We can achieve this by statically allocating all page structs on the > device, in the following way: > > - For every 128MB of pmem data we allocate 2MB of struct-page > descriptors, 64 bytes each, that describes that 128MB data range > in a 4K granular way. We never have to allocate page structs as > they are always there. > > - Filesystems don't directly see the preallocated page arrays, they > still get a 'logical block space' presented that to them looks > like a continuous block device (which is 1.5% smaller than the > true size of the device): this allows arbitrary filesystems to be > put into such pmem devices, fsck will just work, etc. > > I.e. no special pmem filesystem: the full range of existing block > device based Linux filesystems can be used. > > - These page structs are initialized in three layers: > > - a single bit at 128MB data granularity: the first struct page > of the 2MB large array (32,768 struct page array members) > represents the initialization state of all of them. > > - a single bit at 2MB data granularity: the first struct page > of every 32K array within the 2MB array represents the whole > 2MB data area. There are 64 such bits per 2MB array. > > - a single bit at 4K data granularity: the whole page array. > > A page marked uninitialized at a higher layer means all lower > layer struct pages are in their initial state. > > This is a variant of your suggestion: one that keeps everything > 2MB aligned, so that a single kernel side 2MB TLB covers a > continuous chunk of the page array. This allows us to create a > linear VMAP physical memory model to simplify index mapping. > > - Looking up such a struct page (from a pfn) involves two simple, > easily computable indirections. With locality of access > present, 'hot' struct pages will be in the CPU cache. Them being > 64 bytes each will help this. The on-device format is so simple > and so temporary that no fsck is needed for it. > > - 2MB mappings, where desired, are 'natural' in such a layout: > everything's 2MB aligned both for kernel and user space use, while > 4K granularity is still a first class citizen as well. > > - For TB range storage we could make it 1GB granular: We'd allocate > a 1GB array for every 64 GB of data. This would also allow gbpage > TLBs to be taken advantage of: especially on the kernel side > (vmapping the 1GB page array) this might be useful, even if all > actual file usage is 4KB granular. The last block would be allowed > to be smaller than 64GB, but size would still be rounded to 1GB to > keep the mapping simple. > > What do you think? The tricky bit is what happens when you reboot and run a different version of the kernel, especially a kernel with things debugging features like kmemcheck that increase the size of struct page. I think we could reserve space for struct page entries in the persistent memory and 64bytes appears to be a reasonable size. But it would have to be something that we initialize on mount or initialize on demand. I don't think we could have persistent struct page entries, as the exact contents of the struct page entries is too volatile and too different between architectures. Especially architecture changes that a pmem store is likely to see such as switching between a 32bit and a 64bit kernel. Further I think where in the persistent memory the struct page arrays live is something we could leave up to the filesystem. We could have some reasonable constraints to make it fast but I think whoever decides where things live on the persistent memory can make that choice. For small persistent memories it probably make sense to allocate the struct page array describing them out of ordinary ram. For small memories I don't think we are talking enough memory to worry about. For TB+ persistent memories where you need 16GiB per TiB it makes sense to allocate a one or several regions to store your struct page arrays, as you can't count on ordinary ram having enough capacity, and you may not even be talking about a system that actually has ordinary ram at that point. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/