Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp378092imm; Thu, 21 Jun 2018 21:00:05 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLpQMJnlNoxY7qYpGtlR4OtTtYV9rIqr+qqNuI05IVUwr93WgPS+NZkovpSTZXhOFyrAIrW X-Received: by 2002:a17:902:112b:: with SMTP id d40-v6mr31752081pla.123.1529640005478; Thu, 21 Jun 2018 21:00:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529640005; cv=none; d=google.com; s=arc-20160816; b=GdUqLD+8Swi2jusH85QTPf8jHqMJYWdUIznUedT06mm1nTiYNqDSgwKSoeYNXZga2V fn1AnRsqJ0YJxyOHMDAAYEMG0zF8MsXll983i0csHgBaflksLmaqPfbp5qubWf7718w5 tr+3u3FoMMQYvAwYhBC+BYMNW+ejiGjwcMhaM9mj2COmUq+iS8iNfxsqss6705bsXVbK UZ3re6iP4ZppIHV2SWJ9xp0u6w/Og8eqxynDBxvqUCro4MwmUxJloVct9abAzWA06w1k LBpoNzBW0l3Xxkm652F733znDLZHc8PkcLRoB1yhrzEEe+WAIN1zVufZTcoKh1wBWxrR ffyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=pfTD9bw7eLiGAKSL6e2W56qsD0qp8aDrOWJPMedWUqQ=; b=RyGXDh0gc/3daGreOx1KYP5LY+c04/YarocYhXUc1+yUym6yubjNRnwP1JdNy3Wnjg ypJ4r3asIovok4KVOSWr0/bBdBpeAlBiqaNzYdtMXjV9LI0zPYt0Uj1oZyv59YAwHSPk KDfOZa+Yaw56oG8KQUE3FI7Xh7jdaKBHeQDV0ktLeu4jPDo/bHtxGGgs8MfyHL213Oyz YqOqrs5YKg7MX5Pd5gkWHsXhiuPFfA+75liCDGq2AK1ZYsHpqsI3sghG5Uh/ysd66xoL VoXrAABWp+QIl3AdyVuITGAQkg11C1GsMOC1Olcbk5iAhSiOrnGyMs2s5Z+AmOP1n44x a5Kw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f32-v6si1041399plf.38.2018.06.21.20.59.50; Thu, 21 Jun 2018 21:00:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964850AbeFVD6C (ORCPT + 99 others); Thu, 21 Jun 2018 23:58:02 -0400 Received: from mga17.intel.com ([192.55.52.151]:9398 "EHLO mga17.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934390AbeFVDzh (ORCPT ); Thu, 21 Jun 2018 23:55:37 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Jun 2018 20:55:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,255,1526367600"; d="scan'208";a="65335138" Received: from wanpingl-mobl.ccr.corp.intel.com (HELO yhuang6-ux31a.ccr.corp.intel.com) ([10.254.212.200]) by fmsmga004.fm.intel.com with ESMTP; 21 Jun 2018 20:55:30 -0700 From: "Huang, Ying" To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan , Daniel Jordan Subject: [PATCH -mm -v4 09/21] mm, THP, swap: Swapin a THP as a whole Date: Fri, 22 Jun 2018 11:51:39 +0800 Message-Id: <20180622035151.6676-10-ying.huang@intel.com> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20180622035151.6676-1-ying.huang@intel.com> References: <20180622035151.6676-1-ying.huang@intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Huang Ying With this patch, when page fault handler find a PMD swap mapping, it will swap in a THP as a whole. This avoids the overhead of splitting/collapsing before/after the THP swapping. And improves the swap performance greatly for reduced page fault count etc. do_huge_pmd_swap_page() is added in the patch to implement this. It is similar to do_swap_page() for normal page swapin. If failing to allocate a THP, the huge swap cluster and the PMD swap mapping will be split to fallback to normal page swapin. If the huge swap cluster has been split already, the PMD swap mapping will be split to fallback to normal page swapin. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 9 +++ include/linux/swap.h | 9 +++ mm/huge_memory.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/memory.c | 16 +++-- 4 files changed, 198 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index c5b8af173f67..42117b75de2d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -403,4 +403,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#ifdef CONFIG_THP_SWAP +extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); +#else /* CONFIG_THP_SWAP */ +static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) +{ + return 0; +} +#endif /* CONFIG_THP_SWAP */ + #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index d2e017dd7bbd..5832a750baed 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -560,6 +560,15 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp, return NULL; } +static inline struct page *read_swap_cache_async(swp_entry_t swp, + gfp_t gft_mask, + struct vm_area_struct *vma, + unsigned long addr, + bool do_poll) +{ + return NULL; +} + static inline int add_to_swap(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 275a4e616ec9..ac79ae2ab257 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,6 +33,8 @@ #include #include #include +#include +#include #include #include @@ -1609,6 +1611,174 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma, smp_wmb(); /* make pte visible before pmd */ pmd_populate(mm, pmd, pgtable); } + +static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) +{ + struct mm_struct *mm = vma->vm_mm; + spinlock_t *ptl; + int ret = 0; + + ptl = pmd_lock(mm, pmd); + if (pmd_same(*pmd, orig_pmd)) + __split_huge_swap_pmd(vma, address & HPAGE_PMD_MASK, pmd); + else + ret = -ENOENT; + spin_unlock(ptl); + + return ret; +} + +int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) +{ + struct page *page; + struct mem_cgroup *memcg; + struct vm_area_struct *vma = vmf->vma; + unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + swp_entry_t entry; + pmd_t pmd; + int i, locked, exclusive = 0, ret = 0; + + entry = pmd_to_swp_entry(orig_pmd); + VM_BUG_ON(non_swap_entry(entry)); + delayacct_set_flag(DELAYACCT_PF_SWAPIN); +retry: + page = lookup_swap_cache(entry, NULL, vmf->address); + if (!page) { + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma, + haddr, false); + if (!page) { + /* + * Back out if somebody else faulted in this pmd + * while we released the pmd lock. + */ + if (likely(pmd_same(*vmf->pmd, orig_pmd))) { + ret = split_swap_cluster(entry, false); + /* + * Retry if somebody else swap in the swap + * entry + */ + if (ret == -EEXIST) { + ret = 0; + goto retry; + /* swapoff occurs under us */ + } else if (ret == -EINVAL) + ret = 0; + else + goto fallback; + } + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + goto out; + } + + /* Had to read the page from swap area: Major fault */ + ret = VM_FAULT_MAJOR; + count_vm_event(PGMAJFAULT); + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); + } else if (!PageTransCompound(page)) + goto fallback; + + locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); + + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + if (!locked) { + ret |= VM_FAULT_RETRY; + goto out_release; + } + + /* + * Make sure try_to_free_swap or reuse_swap_page or swapoff did not + * release the swapcache from under us. The page pin, and pmd_same + * test below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not changed. + */ + if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) + goto out_page; + + if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, + &memcg, true)) { + ret = VM_FAULT_OOM; + goto out_page; + } + + /* + * Back out if somebody else already faulted in this pmd. + */ + vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); + spin_lock(vmf->ptl); + if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) + goto out_nomap; + + if (unlikely(!PageUptodate(page))) { + ret = VM_FAULT_SIGBUS; + goto out_nomap; + } + + /* + * The page isn't present yet, go ahead with the fault. + * + * Be careful about the sequence of operations here. + * To get its accounting right, reuse_swap_page() must be called + * while the page is counted on swap but not yet in mapcount i.e. + * before page_add_anon_rmap() and swap_free(); try_to_free_swap() + * must be called after the swap_free(), or it will never succeed. + */ + + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -HPAGE_PMD_NR); + pmd = mk_huge_pmd(page, vma->vm_page_prot); + if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { + pmd = maybe_pmd_mkwrite(pmd_mkdirty(pmd), vma); + vmf->flags &= ~FAULT_FLAG_WRITE; + ret |= VM_FAULT_WRITE; + exclusive = RMAP_EXCLUSIVE; + } + for (i = 0; i < HPAGE_PMD_NR; i++) + flush_icache_page(vma, page + i); + if (pmd_swp_soft_dirty(orig_pmd)) + pmd = pmd_mksoft_dirty(pmd); + do_page_add_anon_rmap(page, vma, haddr, + exclusive | RMAP_COMPOUND); + mem_cgroup_commit_charge(page, memcg, true, true); + activate_page(page); + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); + + swap_free(entry, true); + if (mem_cgroup_swap_full(page) || + (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) + try_to_free_swap(page); + unlock_page(page); + + if (vmf->flags & FAULT_FLAG_WRITE) { + ret |= do_huge_pmd_wp_page(vmf, pmd); + if (ret & VM_FAULT_ERROR) + ret &= VM_FAULT_ERROR; + goto out; + } + + /* No need to invalidate - it was non-present before */ + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); + spin_unlock(vmf->ptl); +out: + return ret; +out_nomap: + mem_cgroup_cancel_charge(page, memcg, true); + spin_unlock(vmf->ptl); +out_page: + unlock_page(page); +out_release: + put_page(page); + return ret; +fallback: + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + if (!split_huge_swap_pmd(vmf->vma, vmf->pmd, vmf->address, orig_pmd)) + ret = VM_FAULT_FALLBACK; + else + ret = 0; + if (page) + put_page(page); + return ret; +} #else static inline void __split_huge_swap_pmd(struct vm_area_struct *vma, unsigned long haddr, diff --git a/mm/memory.c b/mm/memory.c index 55e278bb59ee..2125035b6a70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4072,13 +4072,17 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, barrier(); if (unlikely(is_swap_pmd(orig_pmd))) { - VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(orig_pmd)); - if (is_pmd_migration_entry(orig_pmd)) + if (thp_migration_supported() && + is_pmd_migration_entry(orig_pmd)) { pmd_migration_entry_wait(mm, vmf.pmd); - return 0; - } - if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { + return 0; + } else if (thp_swap_supported()) { + ret = do_huge_pmd_swap_page(&vmf, orig_pmd); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } else + VM_BUG_ON(1); + } else if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf, orig_pmd); -- 2.16.4