Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp963909ybb; Wed, 8 Apr 2020 13:28:35 -0700 (PDT) X-Google-Smtp-Source: APiQypIQiUyyP5mXR4Hr8no7FSKm7VjxIr4WpfpQ9CIVUe/M1K3l3gCUqOaGpBIcOtvXcn0VU1/P X-Received: by 2002:a4a:95ee:: with SMTP id p43mr5553835ooi.24.1586377714878; Wed, 08 Apr 2020 13:28:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586377714; cv=none; d=google.com; s=arc-20160816; b=rvcq8gRPKMRDCtsY1yPOtHX8OUvjQpF5dRBFGGt3sFimrO+8psg9LH3aX/6IlVuRkj Ql8OKkEZrQPF1UqEqH90uZ/S1IHprtrNTg0clVbcpuO/OrXfvlwH1rY8quO1i7u+5YQ+ e8J6cwwBDinUaBiDXY+45x0HtNk04/d/HWd5h5ZVKVd9l9xOoWxmy/HUhNSP4U3+YKIT qBQZqe6jKqg955Q9Dmb8xrvfipmwt/9RpcbyhROdU4I284P6EEWzmPlpRC6OGp5AkVx9 t6bZwuFhQNhqYnYlwmUgF8MhhRI3sP6+e686gEO3xzJZxthotHeufV+bYY/mf5miD6tZ bZxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=6QLix4kdBrSK9kg7t8MfAgHQ+l6AKsazuqW90TW+o9I=; b=RVnLJqB5GMDDWRUYd4jDCPSrrOmGgpCfqI8BcDmHqYAWDvwwpYtJ7mjVN3LdVVEVSD VxM/5AJFA9jniS8oerSsgG9Gb982cN2nthUQHogzyLb1lQN7+QxAqIIDG9Hmi0dLqKO+ AlwRDQacW/3pLwjLLFqQsMZuJZ0OnYNBBdEIvY0O7/StR+inoJ2EDbubPoKeOX/eO0B9 m0hyS7RhSyLA+li9SygUqrXg7zm7sPpgBih5HnOWaNKBGVqUuC5JStciqlGuTgfJ7dnc vsVuFDWh7BTgv6LtGizgEOXNXrdnXbta2ymIn58D59174Aj4Ljjo5RSX0Js5sv+Ticv4 ziqQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w65si3261334ooa.42.2020.04.08.13.28.21; Wed, 08 Apr 2020 13:28:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728529AbgDHS5g (ORCPT + 99 others); Wed, 8 Apr 2020 14:57:36 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:54001 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726730AbgDHS5g (ORCPT ); Wed, 8 Apr 2020 14:57:36 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0Tv-zLoJ_1586372250; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tv-zLoJ_1586372250) by smtp.aliyun-inc.com(127.0.0.1); Thu, 09 Apr 2020 02:57:32 +0800 Subject: Re: [PATCHv2 6/8] khugepaged: Allow to collapse PTE-mapped compound pages To: "Kirill A. Shutemov" Cc: akpm@linux-foundation.org, Andrea Arcangeli , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" References: <20200403112928.19742-1-kirill.shutemov@linux.intel.com> <20200403112928.19742-7-kirill.shutemov@linux.intel.com> <10820b6f-f15a-1768-3aca-d0f464185b5e@linux.alibaba.com> <20200408132948.axaz3ytquoulaazp@box> From: Yang Shi Message-ID: <90b803eb-e3ba-bcf7-7987-6f0e33f76d4c@linux.alibaba.com> Date: Wed, 8 Apr 2020 11:57:27 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20200408132948.axaz3ytquoulaazp@box> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/8/20 6:29 AM, Kirill A. Shutemov wrote: > On Mon, Apr 06, 2020 at 02:29:35PM -0700, Yang Shi wrote: >> >> On 4/3/20 4:29 AM, Kirill A. Shutemov wrote: >>> We can collapse PTE-mapped compound pages. We only need to avoid >>> handling them more than once: lock/unlock page only once if it's present >>> in the PMD range multiple times as it handled on compound level. The >>> same goes for LRU isolation and putback. >>> >>> Signed-off-by: Kirill A. Shutemov >>> --- >>> mm/khugepaged.c | 103 ++++++++++++++++++++++++++++++++---------------- >>> 1 file changed, 68 insertions(+), 35 deletions(-) >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 1e7e6543ebca..49e56e4e30d1 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -515,23 +515,37 @@ void __khugepaged_exit(struct mm_struct *mm) >>> static void release_pte_page(struct page *page) >>> { >>> - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page)); >>> + mod_node_page_state(page_pgdat(page), >>> + NR_ISOLATED_ANON + page_is_file_cache(page), >>> + -compound_nr(page)); >>> unlock_page(page); >>> putback_lru_page(page); >>> } >>> -static void release_pte_pages(pte_t *pte, pte_t *_pte) >>> +static void release_pte_pages(pte_t *pte, pte_t *_pte, >>> + struct list_head *compound_pagelist) >>> { >>> + struct page *page, *tmp; >>> + >>> while (--_pte >= pte) { >>> pte_t pteval = *_pte; >>> - if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval))) >>> - release_pte_page(pte_page(pteval)); >>> + >>> + page = pte_page(pteval); >>> + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) && >>> + !PageCompound(page)) >>> + release_pte_page(page); >>> + } >>> + >>> + list_for_each_entry_safe(page, tmp, compound_pagelist, lru) { >>> + list_del(&page->lru); >>> + release_pte_page(page); >>> } >>> } >>> static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> unsigned long address, >>> - pte_t *pte) >>> + pte_t *pte, >>> + struct list_head *compound_pagelist) >>> { >>> struct page *page = NULL; >>> pte_t *_pte; >>> @@ -561,13 +575,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> goto out; >>> } >>> - /* TODO: teach khugepaged to collapse THP mapped with pte */ >>> + VM_BUG_ON_PAGE(!PageAnon(page), page); >>> + >>> if (PageCompound(page)) { >>> - result = SCAN_PAGE_COMPOUND; >>> - goto out; >>> - } >>> + struct page *p; >>> + page = compound_head(page); >>> - VM_BUG_ON_PAGE(!PageAnon(page), page); >>> + /* >>> + * Check if we have dealt with the compound page >>> + * already >>> + */ >>> + list_for_each_entry(p, compound_pagelist, lru) { >>> + if (page == p) >>> + goto next; >>> + } >>> + } >>> /* >>> * We can do it before isolate_lru_page because the >>> @@ -597,19 +619,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> result = SCAN_PAGE_COUNT; >>> goto out; >>> } >>> - if (pte_write(pteval)) { >>> - writable = true; >>> - } else { >>> - if (PageSwapCache(page) && >>> - !reuse_swap_page(page, NULL)) { >>> - unlock_page(page); >>> - result = SCAN_SWAP_CACHE_PAGE; >>> - goto out; >>> - } >>> + if (!pte_write(pteval) && PageSwapCache(page) && >>> + !reuse_swap_page(page, NULL)) { >>> /* >>> - * Page is not in the swap cache. It can be collapsed >>> - * into a THP. >>> + * Page is in the swap cache and cannot be re-used. >>> + * It cannot be collapsed into a THP. >>> */ >>> + unlock_page(page); >>> + result = SCAN_SWAP_CACHE_PAGE; >>> + goto out; >>> } >>> /* >>> @@ -621,16 +639,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> result = SCAN_DEL_PAGE_LRU; >>> goto out; >>> } >>> - inc_node_page_state(page, >>> - NR_ISOLATED_ANON + page_is_file_cache(page)); >>> + mod_node_page_state(page_pgdat(page), >>> + NR_ISOLATED_ANON + page_is_file_cache(page), >>> + compound_nr(page)); >>> VM_BUG_ON_PAGE(!PageLocked(page), page); >>> VM_BUG_ON_PAGE(PageLRU(page), page); >>> + if (PageCompound(page)) >>> + list_add_tail(&page->lru, compound_pagelist); >>> +next: >>> /* There should be enough young pte to collapse the page */ >>> if (pte_young(pteval) || >>> page_is_young(page) || PageReferenced(page) || >>> mmu_notifier_test_young(vma->vm_mm, address)) >>> referenced++; >>> + >>> + if (pte_write(pteval)) >>> + writable = true; >>> } >>> if (likely(writable)) { >>> if (likely(referenced)) { >>> @@ -644,7 +669,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> } >>> out: >>> - release_pte_pages(pte, _pte); >>> + release_pte_pages(pte, _pte, compound_pagelist); >>> trace_mm_collapse_huge_page_isolate(page, none_or_zero, >>> referenced, writable, result); >>> return 0; >>> @@ -653,13 +678,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> static void __collapse_huge_page_copy(pte_t *pte, struct page *page, >>> struct vm_area_struct *vma, >>> unsigned long address, >>> - spinlock_t *ptl) >>> + spinlock_t *ptl, >>> + struct list_head *compound_pagelist) >>> { >>> + struct page *src_page, *tmp; >>> pte_t *_pte; >>> for (_pte = pte; _pte < pte + HPAGE_PMD_NR; >>> _pte++, page++, address += PAGE_SIZE) { >>> pte_t pteval = *_pte; >>> - struct page *src_page; >>> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >>> clear_user_highpage(page, address); >>> @@ -679,7 +705,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, >>> } else { >>> src_page = pte_page(pteval); >>> copy_user_highpage(page, src_page, address, vma); >>> - release_pte_page(src_page); >>> /* >>> * ptl mostly unnecessary, but preempt has to >>> * be disabled to update the per-cpu stats >>> @@ -693,9 +718,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, >>> pte_clear(vma->vm_mm, address, _pte); >>> page_remove_rmap(src_page, false); >>> spin_unlock(ptl); >>> - free_page_and_swap_cache(src_page); >>> + if (!PageCompound(src_page)) { >>> + release_pte_page(src_page); >>> + free_page_and_swap_cache(src_page); >>> + } >>> } >>> } >>> + >>> + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { >>> + list_del(&src_page->lru); >>> + release_pte_page(src_page); >>> + free_page_and_swap_cache(src_page); >>> + } >> It looks this may mess up the PTE-mapped THP's refcount if it has multiple >> PTE-mapped subpages since put_page() is not called for every PTE-mapped >> subpages. > Good catch! > > I *think* something like this should fix the issue (untested): Is this an incremental fix? > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index bfb6155f1d69..9a96f9bff798 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -570,6 +570,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte, > list_for_each_entry_safe(page, tmp, compound_pagelist, lru) { > list_del(&page->lru); > release_pte_page(page); > + put_page(page); > } > } > > @@ -682,8 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > VM_BUG_ON_PAGE(!PageLocked(page), page); > VM_BUG_ON_PAGE(PageLRU(page), page); > > - if (PageCompound(page)) > + if (PageCompound(page)) { > list_add_tail(&page->lru, compound_pagelist); > + get_page(page); I don't get why we have to bump refcount here? The THP won't go away since it is pinned by isolation. I'm supposed calling free_page_and_swap_cache(src_page) in each loop looks good enough to dec refcount for each PTE-mapped subpage. Then calling release_pte_page() for each entry on compund_pagelist outside the loop to release the pin from isolation. The compound page would get freed via pagevec finally. > + } > next: > /* There should be enough young pte to collapse the page */ > if (pte_young(pteval) || > @@ -742,6 +745,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > } else { > src_page = pte_page(pteval); > copy_user_highpage(page, src_page, address, vma); > + if (!PageCompound(src_page)) > + release_pte_page(src_page); > /* > * ptl mostly unnecessary, but preempt has to > * be disabled to update the per-cpu stats > @@ -755,10 +760,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > pte_clear(vma->vm_mm, address, _pte); > page_remove_rmap(src_page, false); > spin_unlock(ptl); > - if (!PageCompound(src_page)) { > - release_pte_page(src_page); > - free_page_and_swap_cache(src_page); > - } > + free_page_and_swap_cache(src_page); > } > } >