Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752357AbcDQBtj (ORCPT ); Sat, 16 Apr 2016 21:49:39 -0400 Received: from mail-pa0-f47.google.com ([209.85.220.47]:36426 "EHLO mail-pa0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752157AbcDQBti (ORCPT ); Sat, 16 Apr 2016 21:49:38 -0400 Date: Sat, 16 Apr 2016 18:49:28 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: "Kirill A. Shutemov" cc: Hugh Dickins , Andrew Morton , "Kirill A. Shutemov" , Andrea Arcangeli , Andres Lagar-Cavilla , Yang Shi , Ning Qu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 09/31] huge tmpfs: avoid premature exposure of new pagetable In-Reply-To: <20160411115422.GF22996@node.shutemov.name> Message-ID: References: <20160411115422.GF22996@node.shutemov.name> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2767 Lines: 57 On Mon, 11 Apr 2016, Kirill A. Shutemov wrote: > On Tue, Apr 05, 2016 at 02:24:23PM -0700, Hugh Dickins wrote: > > > > That itself is not a problem on x86_64, but there's plenty more: > > how about those places which use pte_offset_map_lock() - if that > > spinlock is in the struct page of a pagetable, which has been > > deposited and might be withdrawn and freed at any moment (being > > on a list unattached to the allocating pmd in the case of x86), > > taking the spinlock might corrupt someone else's struct page. > > > > Because THP has departed from the earlier rules (when pagetable > > was only freed under exclusive mmap_sem, or at exit_mmap, after > > removing all affected vmas from the rmap list): zap_huge_pmd() > > does pte_free() even when serving MADV_DONTNEED under down_read > > of mmap_sem. > > Emm.. The pte table freed from zap_huge_pmd() is from deposit. It wasn't > linked into process' page table tree. So I don't see how THP has departed > from the rules. That's true at the time that it is freed: but my point was, that we don't know the past history of that pagetable, which might have been linked in and contained visible ptes very recently, without sufficient barriers in between. And in the x86 case (perhaps any non-powerpc case), there's no logical association between the pagetable freed and the place that it's freed from. Now, I've certainly not paged back in all the anxieties I had, and avenues I'd gone down, at the time that I first wrote that comment: it's quite possible that they were self-inflicted issues, and perhaps remnants of earlier ways in which I'd tried ordering it. I'm sure that I never reached any "hey, anon THP has got this wrong" conclusion, merely doubts, and surprise that it could free a pagetable there. But it looks as if all these ruminations here will vanish with the revert of the patch. Though one day I'll probably be worrying about it again, when I try to remove the need for mmap_sem protection around recovery's remap_team_by_pmd(). > > do_fault_around() presents one last problem: it wants pagetable to > > have been allocated, but was being called by do_read_fault() before > > __do_fault(). I see no disadvantage to moving it after, allowing huge > > pmd to be chosen first; but Kirill reports additional radix-tree lookup > > in hot pagecache case when he implemented faultaround: needs further > > investigation. > > In my implementation faultaround can establish PMD mappings. So there's no > disadvantage to call faultaround first. > > And if faultaround happened to solve the page fault we don't need to do > usual ->fault lookup. Sounds good, though not something I'll be looking to add in myself: feel free to add it, but maybe it fits easier with compound pages. Hugh