Received: by 2002:ac0:a679:0:0:0:0:0 with SMTP id p54csp857826imp; Thu, 21 Feb 2019 12:42:26 -0800 (PST) X-Google-Smtp-Source: AHgI3IZfsAliPrf1SoPbcVpJeiLzhqbuT2v4AnHd6weHmYVo/nPAnAl7C9l33VbYbpl7XF8GTDhe X-Received: by 2002:a17:902:48c8:: with SMTP id u8mr444107plh.79.1550781746656; Thu, 21 Feb 2019 12:42:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550781746; cv=none; d=google.com; s=arc-20160816; b=eigiiA70g+4jhGblf9AWkrCtZhhRxjsxuHfAq0Yti9VXAepco7uW5sqwG9Nf4Vol2P 4cuDRg0jRsodTmKUm0EZ/Kntx8OaVlRmn15z8HT0aT8CLSBdfPdHlbbYf7hxI8I8B3Oq F+QD1pgNXPVfMKC/k7aSd12H7qRUjCfGTtOaZYOlsmQYZSTJqzOJANpdz8Iy7MBughkt SrQpg6M9otbxAwFJehxhA8zgSNa5TipnCOxxUM/w9hwsLIM5v8tOn9BtYdhAv5xt0kpy hmn5Z1eqCROTeFmDYt3APRLtZBjjgic1iODBnCT/qyKmOvnPWfug0bo436Il1di7VTw0 DFCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=873Db4kQ0MHXbNqco9XA8/TtaHIUxZYY5tHn2oL/FSI=; b=luuRc6p3Jyddnur2buIRVopAxYyxnxOMIDvWTrWeciQkNcvISSFr1G58os6MIH1OgB rpOOBtPFQhUPleozNWrXgEH3eJWAbPeAQeSB+6A+MbaEUD+UxaUJtMz+vLvDHLxy6Gm1 qXrbQUgwmAe/pqCk/aI3DjKJnPKojBa3UqoQGJVcN/bT3Yvoo9dTbx9BCIqnGj/jCZfG BNJ8sxmdWaQE9o7p8zYlIlzOK0NdmcdNYkKzNYg/DFXQ5Vt/Z64weaprguHY3uL6wscC eAlWns+NzwFzVoj3sjsQDp/bUodsBzPhR9+p+fVsxKQ41LjZYnv0Xu870IpVW/qWcJIa lJGg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k127si4498910pgc.124.2019.02.21.12.42.11; Thu, 21 Feb 2019 12:42:26 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726524AbfBUUlo (ORCPT + 99 others); Thu, 21 Feb 2019 15:41:44 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48576 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725866AbfBUUlo (ORCPT ); Thu, 21 Feb 2019 15:41:44 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id AE5A6882EF; Thu, 21 Feb 2019 20:41:43 +0000 (UTC) Received: from redhat.com (unknown [10.20.6.236]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 378C65C28F; Thu, 21 Feb 2019 20:41:43 +0000 (UTC) Date: Thu, 21 Feb 2019 15:41:41 -0500 From: Jerome Glisse To: Larry Bassel Cc: linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org Subject: Re: question about page tables in DAX/FS/PMEM case Message-ID: <20190221204141.GB5201@redhat.com> References: <20190220230622.GI19341@ubuette> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190220230622.GI19341@ubuette> User-Agent: Mutt/1.10.0 (2018-05-17) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Thu, 21 Feb 2019 20:41:43 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote: > I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case. > > If multiple processes would use the identical page of PMDs corresponding > to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead > of populating a new PUD, just atomically increment a refcount and point > to the same PUD in the next level above. I think page table sharing was discuss several time in the past and the complexity involve versus the benefit were not clear. For 1GB of virtual address you need: #pte pages = 1G/(512 * 2^12) = 512 pte pages #pmd pages = 1G/(512 * 512 * 2^12) = 1 pmd pages So if we were to share the pmd directory page we would be saving a total of 513 pages for every page table or ~2MB. This goes up with the number of process that map the same range ie if 10 process map the same range and share the same pmd than you are saving 9 * 2MB 18MB of memory. This seems relatively modest saving. AFAIK there is no hardware benefit from sharing the page table directory within different page table. So the only benefit is the amount of memory we save. See below for comments on complexity to achieve this. > > i.e. > > OLD: > process 1: > VA -> levels of page tables -> PUD1 -> page of PMDs1 > process 2: > VA -> levels of page tables -> PUD2 -> page of PMDs2 > > NEW: > process 1: > VA -> levels of page tables -> PUD1 -> page of PMDs1 > process 2: > VA -> levels of page tables -> PUD1 -> page of PMDs1 (refcount 2) > > There are several cases to consider: > > 1. New mapping > OLD: > make a new PUD, populate the associated page of PMDs > (at least partially) with PMD entries. > NEW: > same > > 2. Mapping by a process same (same VA->PA and size and protections, etc.) > as one that already exists > OLD: > make a new PUD, populate the associated page of PMDs > (at least partially) with PMD entries. > NEW: > use same PUD, increase refcount (potentially even if this mapping is private > in which case there may eventually be a copy-on-write -- see #5 below) > > 3. Unmapping of a mapping which is the same as that from another process > OLD: > destroy the process's copy of mapping, free PUD, etc. > NEW: > decrease refcount, only if now 0 do we destroy mapping, etc. > > 4. Unmapping of a mapping which is unique (refcount 1) > OLD: > destroy the process's copy of mapping, free PUD, etc. > NEW: > same > > 5. Mapping was private (but same as another process), process writes > OLD: > break the PMD into PTEs, destroy PMD mapping, free PUD, etc.. > NEW: > decrease refcount, only if now 0 do we destroy mapping, etc. > we still break the PMD into PTEs. > > If I have a mmap of a DAX/FS/PMEM file and I take > a page (either pte or PMD sized) fault on access to this file, > the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?). Not exactly the page table are allocated long before dax_iomap_fault() get calls. They are allocated by the handle_mm_fault() and its childs functions. > > If the process later munmaps this file or exits but there are still > other users of the shared page of PMDs, I would need to > detect that this has happened and act accordingly (#3 above) > > Where will these page table entries be torn down? > In the same code where any other page table is torn down? > If this is the case, what would the cleanest way of telling that these > page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping > (look at the physical address pointed to?) so that > I could do the right thing here. > > I understand that I may have missed something obvious here. > They are many issues here are the one i can think of: - finding a pmd/pud to share, you need to walk the reverse mapping of the range you are mapping and to find if any process or other virtual address already as a pud or pmd you can reuse. This can take more time than allocating page directory pages. - if one process munmap some portion of a share pud you need to break the sharing this means that munmap (or mremap) would need to handle this page table directory sharing case first - many code path in the kernel might need update to understand this share page table thing (mprotect, userfaultfd, ...) - the locking rules is bound to be painfull - this might not work on all architecture as some architecture do associate information with page table directory and that can not always be share (it would need to be enabled arch by arch) The nice thing: - unmapping for migration, when you unmap a share pud/pmd you can decrement mapcount by share pud/pmd count this could speedup migration This is what i could think of on the top of my head but there might be other thing. I believe the question is really a benefit versus cost and to me at least the complexity cost outweight the benefit one for now. Kirill Shutemov proposed rework on how we do page table and this kind of rework might tip the balance the other way. So my suggestion would be to look into how the page table management can be change in a beneficial way that could also achieve the page table sharing. Cheers, J?r?me