Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp126884ybg; Tue, 22 Oct 2019 17:21:35 -0700 (PDT) X-Google-Smtp-Source: APXvYqx8Pscr0GgjlaD4JIOAVWPPRyzghoDCaYEGuuKStWZGuFrHfFt4fmDPeevhw8eGpuMElYSq X-Received: by 2002:a17:906:4448:: with SMTP id i8mr29234070ejp.298.1571790095262; Tue, 22 Oct 2019 17:21:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1571790095; cv=none; d=google.com; s=arc-20160816; b=qr/EqNgq8nt6f1mSOUCp8cJKr2U/dDNjQGx8GtFW5xtR3+ZmQW7UxJRAsnd0HAj/cU +ZXdm5DPwEmmwOzBTbDfwBjgQLDMu0Wr1he83036C2WtsogmTCjBPOojtliELVEEdSpG c2LoBnE5eMaK0HFVoc4N7wfhuIm6D8hs6fr7MReonwyV8obR/gh5Rh42gDgUNxFVro+A FUH5FRjgtZ9xOBQSfO1LFWPvKv7hU343Xi6H9Pozc5EP3dYSndBpzc6tNTabz7SWqOBX 4HNcyWWbXfZint9eemMDrG3Cn42+2OjqCQcpl4Z2O5h50kmT27kHVqObhe4+eHmBSCsp kqnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=jCnnNs7/jrFS/aBDmiHpQd7/jCgCvlruTBM0gz/8e/M=; b=SJOvwW/MjSzRCd0ZYKB47DWxNe2uwRRuCHGupiptXUtgD0we5ptgJFiWNG1O4vOBkT cQKOt5ClJ/7I+BMRxVuteP3G7SqRsa5bFYHANUNF7YT6z6M2eOGA+8gxbHBsnUtKgFXU vWD157wgysJlvtSNw6w0oiZQDiJ8gRfY3Wl/u7l2u9K50uHFdhAEcN1tqu7GCvs5oPZ3 g4dARPh41O+vboNcX/9yFJwyVvHci6C+ytsU1UNLvtrPrZ8wi4frybUnxZlYraLyjfVC V0m2DBXqPFtmrwQu0GUbF9GvXIktK1Qk6RHYql/8f+Y3OoE6eOEP1GHm3Dv81zL1xDbC da/g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o12si13675947edc.279.2019.10.22.17.21.11; Tue, 22 Oct 2019 17:21:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389771AbfJVXis (ORCPT + 99 others); Tue, 22 Oct 2019 19:38:48 -0400 Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:48494 "EHLO out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732854AbfJVXis (ORCPT ); Tue, 22 Oct 2019 19:38:48 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07486;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TfwXB6r_1571787520; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TfwXB6r_1571787520) by smtp.aliyun-inc.com(127.0.0.1); Wed, 23 Oct 2019 07:38:43 +0800 Subject: Re: [PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap To: Hugh Dickins Cc: aarcange@redhat.com, kirill.shutemov@linux.intel.com, gavin.dg@linux.alibaba.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com> From: Yang Shi Message-ID: <4ea5d015-19cb-d5d9-42f7-d1319d8de7c4@linux.alibaba.com> Date: Tue, 22 Oct 2019 16:38:32 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/22/19 3:27 PM, Hugh Dickins wrote: > On Wed, 23 Oct 2019, Yang Shi wrote: > >> We have usecase to use tmpfs as QEMU memory backend and we would like to >> take the advantage of THP as well. But, our test shows the EPT is not >> PMD mapped even though the underlying THP are PMD mapped on host. >> The number showed by /sys/kernel/debug/kvm/largepage is much less than >> the number of PMD mapped shmem pages as the below: >> >> 7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) >> Size: 4194304 kB >> [snip] >> AnonHugePages: 0 kB >> ShmemPmdMapped: 579584 kB >> [snip] >> Locked: 0 kB >> >> cat /sys/kernel/debug/kvm/largepages >> 12 >> >> And some benchmarks do worse than with anonymous THPs. >> >> By digging into the code we figured out that commit 127393fbe597 ("mm: >> thp: kvm: fix memory corruption in KVM with THP enabled") checks if >> there is a single PTE mapping on the page for anonymous THP when >> setting up EPT map. But, the _mapcount < 0 check doesn't fit to page >> cache THP since every subpage of page cache THP would get _mapcount >> inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns >> false for page cache THP. This would prevent KVM from setting up PMD >> mapped EPT entry. >> >> So we need handle page cache THP correctly. However, when page cache >> THP's PMD gets split, kernel just remove the map instead of setting up >> PTE map like what anonymous THP does. Before KVM calls get_user_pages() >> the subpages may get PTE mapped even though it is still a THP since the >> page cache THP may be mapped by other processes at the mean time. >> >> Checking its _mapcount and whether the THP is double mapped or not since >> we can't tell if the single PTE mapping comes from the current process >> or not by _mapcount. Although this may report some false negative cases >> (PTE mapped by other processes), it looks not trivial to make this >> accurate. >> >> With this fix /sys/kernel/debug/kvm/largepage would show reasonable >> pages are PMD mapped by EPT as the below: >> >> 7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) >> Size: 4194304 kB >> [snip] >> AnonHugePages: 0 kB >> ShmemPmdMapped: 557056 kB >> [snip] >> Locked: 0 kB >> >> cat /sys/kernel/debug/kvm/largepages >> 271 >> >> And the benchmarks are as same as anonymous THPs. >> >> Signed-off-by: Yang Shi >> Reported-by: Gang Deng >> Tested-by: Gang Deng >> Cc: Andrea Arcangeli >> Cc: Kirill A. Shutemov >> Cc: Hugh Dickins >> Cc: 4.8+ >> --- >> include/linux/page-flags.h | 54 ++++++++++++++++++++++++++++------------------ >> 1 file changed, 33 insertions(+), 21 deletions(-) >> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >> index f91cb88..3b8e5c5 100644 >> --- a/include/linux/page-flags.h >> +++ b/include/linux/page-flags.h >> @@ -610,27 +610,6 @@ static inline int PageTransCompound(struct page *page) >> } >> >> /* >> - * PageTransCompoundMap is the same as PageTransCompound, but it also >> - * guarantees the primary MMU has the entire compound page mapped >> - * through pmd_trans_huge, which in turn guarantees the secondary MMUs >> - * can also map the entire compound page. This allows the secondary >> - * MMUs to call get_user_pages() only once for each compound page and >> - * to immediately map the entire compound page with a single secondary >> - * MMU fault. If there will be a pmd split later, the secondary MMUs >> - * will get an update through the MMU notifier invalidation through >> - * split_huge_pmd(). >> - * >> - * Unlike PageTransCompound, this is safe to be called only while >> - * split_huge_pmd() cannot run from under us, like if protected by the >> - * MMU notifier, otherwise it may result in page->_mapcount < 0 false >> - * positives. >> - */ >> -static inline int PageTransCompoundMap(struct page *page) >> -{ >> - return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0; >> -} >> - >> -/* >> * PageTransTail returns true for both transparent huge pages >> * and hugetlbfs pages, so it should only be called when it's known >> * that hugetlbfs pages aren't involved. >> @@ -681,6 +660,39 @@ static inline int TestClearPageDoubleMap(struct page *page) >> return test_and_clear_bit(PG_double_map, &page[1].flags); >> } >> >> +/* >> + * PageTransCompoundMap is the same as PageTransCompound, but it also >> + * guarantees the primary MMU has the entire compound page mapped >> + * through pmd_trans_huge, which in turn guarantees the secondary MMUs >> + * can also map the entire compound page. This allows the secondary >> + * MMUs to call get_user_pages() only once for each compound page and >> + * to immediately map the entire compound page with a single secondary >> + * MMU fault. If there will be a pmd split later, the secondary MMUs >> + * will get an update through the MMU notifier invalidation through >> + * split_huge_pmd(). >> + * >> + * Unlike PageTransCompound, this is safe to be called only while >> + * split_huge_pmd() cannot run from under us, like if protected by the >> + * MMU notifier, otherwise it may result in page->_mapcount check false >> + * positives. >> + * >> + * We have to treat page cache THP differently since every subpage of it >> + * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE >> + * mapped in the current process so checking PageDoubleMap flag to rule >> + * this out. >> + */ >> +static inline int PageTransCompoundMap(struct page *page) >> +{ >> + bool pmd_mapped; >> + >> + if (PageAnon(page)) >> + pmd_mapped = atomic_read(&page->_mapcount) < 0; >> + else >> + pmd_mapped = atomic_read(&page->_mapcount) >= 0 && >> + !PageDoubleMap(compound_head(page)); >> + >> + return PageTransCompound(page) && pmd_mapped; >> +} >> #else >> TESTPAGEFLAG_FALSE(TransHuge) >> TESTPAGEFLAG_FALSE(TransCompound) >> -- >> 1.8.3.1 > I completely agree that the current PageTransCompoundMap() is wrong. > > A fix for that is one of many patches I've not yet got to upstreaming. > Comparing yours and mine, I'm worried by your use of PageDoubleMap(), > because really that's a flag for anon THP, and not properly supported > on shmem (or now I suppose file) THP - I forget the details, is it > that it sometimes gets set, but never cleared? Generally, we just > don't refer to PageDoubleMap() on shmem THPs (but there may be > exceptions: sorting out the THP mapcount maze, and eliminating > PageDoubleMap(), is one of my long-held ambitions, not yet reached). > > Here's the patch I've been carrying, but it's from earlier, so I > should warn that I've done no more than build-testing it on 5.4, > and I'm too far away from these issues at the moment to be able to > make a good judgement or argue for it - I hope you and others can > decide which patch is the better. I should also add that we're > barely using PageTransCompoundMap() at all: at best it can only > give a heuristic guess as to whether the page is pmd-mapped in > any particular case, and we preferred to take forward the KVM > patches we posted back in April 2016, plumbing hva down to where > it's needed - though of course those are somewhat different now. Thanks for catching this. I was definitely thinking about using compount_mapcount instead of DoubleMap flag when I was working the patch. I just simply thought it would change less file by using DoubleMap flag but I didn't notice it was kind of unbalanced for file THP. With the unbalanced DoubleMap flag, it sounds better to use compound_mapcount. Thanks for sharing your patch, I'm going to rework v2 by using compound_mapcount. Do you mind I might steal your patch? I'm supposed we'd better fix this bug regardless of whether you would like to move forward your KVM patches. > > [PATCH] mm: kvm: huge tmpfs: fix PageTransCompoundMap() > > 4.6 commit 127393fbe597 > ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") > replaced the PageTransCompound() in KVM's transparent_hugepage_adjust() > by a new PageTransCompoundMap() which also checks page _mapcount; > while 4.8 commit dd78fedde4b9 ("rmap: support file thp") > was under development, making shmem file THP's use of page _mapcount > different from anon THP's (to avoid double-accounting in the tiresome > NR_FILE_MAPPED statistic, I believe). > > Nobody spotted the disconnect, whose consequence has been that KVM > could not recognize a shmem file THP, so could not optimize for it. > > Fix PageTransCompoundMap() to handle both cases. Unfortunately, its > definition has been placed (as you would suppose from the CamelCase) > in page-flags.h, but now it needs compound_mapcount_ptr() from mm.h. > My zest for header file refactoring has ebbed: instead, put identical > defines of it in both files - no build will get far if they disagree. > > Whether PageTransCompoundMap() is actually correct is another matter. > Good enough heuristic, erring on the safe side? Good enough heuristic > for anon, but not so good for shmem file (if a random other usage could > ever map some ptes)? > > Fixes: dd78fedde4b9 ("rmap: support file thp") > Signed-off-by: Hugh Dickins > --- > > include/linux/mm.h | 6 ++---- > include/linux/page-flags.h | 20 +++++++++++++++++--- > 2 files changed, 19 insertions(+), 7 deletions(-) > > --- 5.4-rc4/include/linux/mm.h 2019-10-06 17:20:18.473153411 -0700 > +++ linux/include/linux/mm.h 2019-10-22 13:00:30.697944228 -0700 > @@ -695,10 +695,8 @@ static inline void *kvcalloc(size_t n, s > > extern void kvfree(const void *addr); > > -static inline atomic_t *compound_mapcount_ptr(struct page *page) > -{ > - return &page[1].compound_mapcount; > -} > +/* #define in order to validate duplicate definition in page-flags.h */ > +#define compound_mapcount_ptr(head) (&(head)[1].compound_mapcount) > > static inline int compound_mapcount(struct page *page) > { > --- 5.4-rc4/include/linux/page-flags.h 2019-09-15 14:19:32.000000000 -0700 > +++ linux/include/linux/page-flags.h 2019-10-22 13:00:30.697944228 -0700 > @@ -609,6 +609,9 @@ static inline int PageTransCompound(stru > return PageCompound(page); > } > > +/* #define in order to validate duplicate definition in mm.h */ > +#define compound_mapcount_ptr(head) (&(head)[1].compound_mapcount) > + > /* > * PageTransCompoundMap is the same as PageTransCompound, but it also > * guarantees the primary MMU has the entire compound page mapped > @@ -622,12 +625,23 @@ static inline int PageTransCompound(stru > * > * Unlike PageTransCompound, this is safe to be called only while > * split_huge_pmd() cannot run from under us, like if protected by the > - * MMU notifier, otherwise it may result in page->_mapcount < 0 false > - * positives. > + * MMU notifier, otherwise it may result in false positives. > */ > static inline int PageTransCompoundMap(struct page *page) > { > - return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0; > + struct page *head; > + > + if (!PageTransCompound(page)) > + return 0; > + if (PageAnon(page)) > + return atomic_read(&page->_mapcount) < 0; > + /* > + * pmd mappings are not included in anon THP's _mapcounts, > + * but they are included in shmem file THP's _mapcounts. > + */ > + head = compound_head(page); > + return atomic_read(&page->_mapcount) == > + atomic_read(compound_mapcount_ptr(head)); > } > > /*