Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753301AbcDGCFd (ORCPT ); Wed, 6 Apr 2016 22:05:33 -0400 Received: from mail-pf0-f182.google.com ([209.85.192.182]:33852 "EHLO mail-pf0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751649AbcDGCFb (ORCPT ); Wed, 6 Apr 2016 22:05:31 -0400 Date: Wed, 6 Apr 2016 19:05:20 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Mika Penttila cc: Hugh Dickins , Andrew Morton , "Kirill A. Shutemov" , Andrea Arcangeli , Andres Lagar-Cavilla , Yang Shi , Ning Qu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages In-Reply-To: <5704E4D2.5020808@nextfour.com> Message-ID: References: <5704E4D2.5020808@nextfour.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2866 Lines: 77 On Wed, 6 Apr 2016, Mika Penttila wrote: > On 04/06/2016 12:53 AM, Hugh Dickins wrote: > > +static void shmem_recovery_work(struct work_struct *work) ... > > + > > + if (head) { > > + /* We are resuming work from a previous partial recovery */ > > + if (PageTeam(page)) > > + shr_stats(resume_teamed); > > + else > > + shr_stats(resume_tagged); > > + } else { > > + gfp_t gfp = mapping_gfp_mask(mapping); > > + /* > > + * XXX: Note that with swapin readahead, page_to_nid(page) will > > + * often choose an unsuitable NUMA node: something to fix soon, > > + * but not an immediate blocker. > > + */ > > + head = __alloc_pages_node(page_to_nid(page), > > + gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER); > > + if (!head) { > > + shr_stats(huge_failed); > > + error = -ENOMEM; > > + goto out; > > + } > > Should this head marked PageTeam? Because in patch 27/31 when given as a hint to shmem_getpage_gfp() : > > hugehint = NULL; > + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && > + sgp == SGP_TEAM && *pagep) { > + struct page *head; > + > + if (!get_page_unless_zero(*pagep)) { > + error = -ENOENT; > + goto decused; > + } > + page = *pagep; > + lock_page(page); > + head = page - (index & (HPAGE_PMD_NR-1)); > > we fail always because : > + if (!PageTeam(head)) { > + error = -ENOENT; > + goto decused; > + } Great observation, thank you Mika. We don't fail always, because in most cases the page wanted for the head will either be already in memory, or read in from swap, and that SGP_TEAM block in shmem_getpage_gfp() (with the -ENOENT you show) not come into play on it: then shmem_recovery_populate() does its !recovery->exposed_team SetPageTeam(head) and all is well from then on. But I think what you point out means that the current recovery code is incapable of assembling a hugepage if its first page was not already instantiated earlier: not something I'd realized until you showed me. Not a common failing, and would never be the case for an extent which had been mapped huge in the past, but it's certainly not what I'd intended. As to whether the head should be marked PageTeam immediately after the hugepage allocation: I think not, especially because of the swapin case (26/31). Swapin may need to read data from disk into that head page, and I've never had to think about the consequences of having a swap page marked PageTeam. Perhaps it would work out okay, but I'd prefer not to go there. At this moment I'm too tired to think what the right answer will be, and certainly won't be able to commit to any without some testing. So, not as incapacitating as perhaps you thought, and not any danger to people trying out huge tmpfs, but definitely something to be fixed: I'll mull it over in the background and let you know when I'm sure. Thank you again, Hugh