Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp560728imm; Wed, 23 May 2018 01:32:33 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrvCF9OKtMFG3ydCdTHpiAjkCkufDhqeElc6gc/u+s4ODfV49bqiIxksCQt2d+Sk1i05YeI X-Received: by 2002:a63:4143:: with SMTP id o64-v6mr1575253pga.280.1527064353262; Wed, 23 May 2018 01:32:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527064353; cv=none; d=google.com; s=arc-20160816; b=pBX6x7L0Wp/1YUwUdmLs7lY+wD8sRXVepXdqEF4lQ3WtkPul3fH3wOv8OCkTU6GcPl /Cc+9cPuusuz8tobzqjdcAIoPj1H6F+kf/0VFD7NUUvV9YQLCy0v1V895IxUayzs9i5f y0tJZAIVi7ocKkmPzAyTzopU4GWa5bdR/qZ+lFKC9a8qXiNS6DSwUD+5vRYIV8PfWh0k prU+6X0oSoneFRBVWTs7eelzK8OovrSkUzd7i4i7Ch8mMMWEER+zh3pG51zc+2xziOr+ gpiSXuATG11dXXk+T+pcDaCkQIqkG/i6ykgyiVr0/+Eg+IbdSkiTuobxdkoLeFdrQ+E1 RvdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=CyLD0Z/8x0RrK2peMAul8p2SRPapy9FdhRZkd4uRZTw=; b=s7ggJbCYXFan8HpVU8T/HQXnwj1OshsXqVb8hsSMpxjgZyK6lCu2pBmKDJ4xky0IPY aqa0wp/1mMhlLHnZNn+tCOjaqT7ljy9dBTD7dmt3VZa1IhI/kzq1z8OyZgZAUsoDxdvQ 1QKpXGu08CSE0vlsvHHn9102l+4veAjg7s3t+TycZEN+huGhUcef5Ttf5H/fff0KmxxX VVCxgbAw99yGg52lPy0cesjRNgWIAkjyN8ovsXW5GXHhSedLGA5z4h2xrFPoBzQGYgD8 MahscZWJKASIz7wRn4mnc9KSxrpXRwW2Jp7GrwGpJPC17G6gPNbk5lzxGjn7i+FHnY3M N5pA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i72-v6si14288432pgd.211.2018.05.23.01.32.18; Wed, 23 May 2018 01:32:33 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754577AbeEWIbh (ORCPT + 99 others); Wed, 23 May 2018 04:31:37 -0400 Received: from mga03.intel.com ([134.134.136.65]:7720 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754516AbeEWI04 (ORCPT ); Wed, 23 May 2018 04:26:56 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 23 May 2018 01:26:55 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,432,1520924400"; d="scan'208";a="57726075" Received: from yhuang6-ux31a.sh.intel.com ([10.239.197.97]) by fmsmga001.fm.intel.com with ESMTP; 23 May 2018 01:26:53 -0700 From: "Huang, Ying" To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan Subject: [PATCH -mm -V3 09/21] mm, THP, swap: Swapin a THP as a whole Date: Wed, 23 May 2018 16:26:13 +0800 Message-Id: <20180523082625.6897-10-ying.huang@intel.com> X-Mailer: git-send-email 2.16.1 In-Reply-To: <20180523082625.6897-1-ying.huang@intel.com> References: <20180523082625.6897-1-ying.huang@intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Huang Ying With this patch, when page fault handler find a PMD swap mapping, it will swap in a THP as a whole. This avoids the overhead of splitting/collapsing before/after the THP swapping. And improves the swap performance greatly for reduced page fault count etc. do_huge_pmd_swap_page() is added in the patch to implement this. It is similar to do_swap_page() for normal page swapin. If failing to allocate a THP, the huge swap cluster and the PMD swap mapping will be split to fallback to normal page swapin. If the huge swap cluster has been split already, the PMD swap mapping will be split to fallback to normal page swapin. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan --- include/linux/huge_mm.h | 9 +++ include/linux/swap.h | 9 +++ mm/huge_memory.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/memory.c | 16 +++-- 4 files changed, 198 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0dbfbe34b01a..f5348d072351 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -402,4 +402,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#ifdef CONFIG_THP_SWAP +extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); +#else /* CONFIG_THP_SWAP */ +static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) +{ + return 0; +} +#endif /* CONFIG_THP_SWAP */ + #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index d2e017dd7bbd..5832a750baed 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -560,6 +560,15 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp, return NULL; } +static inline struct page *read_swap_cache_async(swp_entry_t swp, + gfp_t gft_mask, + struct vm_area_struct *vma, + unsigned long addr, + bool do_poll) +{ + return NULL; +} + static inline int add_to_swap(struct page *page) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3975d824b4ed..8303fa021c42 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,6 +33,8 @@ #include #include #include +#include +#include #include #include @@ -1609,6 +1611,174 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma, smp_wmb(); /* make pte visible before pmd */ pmd_populate(mm, pmd, pgtable); } + +static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) +{ + struct mm_struct *mm = vma->vm_mm; + spinlock_t *ptl; + int ret = 0; + + ptl = pmd_lock(mm, pmd); + if (pmd_same(*pmd, orig_pmd)) + __split_huge_swap_pmd(vma, address & HPAGE_PMD_MASK, pmd); + else + ret = -ENOENT; + spin_unlock(ptl); + + return ret; +} + +int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) +{ + struct page *page; + struct mem_cgroup *memcg; + struct vm_area_struct *vma = vmf->vma; + unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + swp_entry_t entry; + pmd_t pmd; + int i, locked, exclusive = 0, ret = 0; + + entry = pmd_to_swp_entry(orig_pmd); + VM_BUG_ON(non_swap_entry(entry)); + delayacct_set_flag(DELAYACCT_PF_SWAPIN); +retry: + page = lookup_swap_cache(entry, NULL, vmf->address); + if (!page) { + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma, + haddr, false); + if (!page) { + /* + * Back out if somebody else faulted in this pmd + * while we released the pmd lock. + */ + if (likely(pmd_same(*vmf->pmd, orig_pmd))) { + ret = split_swap_cluster(entry, false); + /* + * Retry if somebody else swap in the swap + * entry + */ + if (ret == -EEXIST) { + ret = 0; + goto retry; + /* swapoff occurs under us */ + } else if (ret == -EINVAL) + ret = 0; + else + goto fallback; + } + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + goto out; + } + + /* Had to read the page from swap area: Major fault */ + ret = VM_FAULT_MAJOR; + count_vm_event(PGMAJFAULT); + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); + } else if (!PageTransCompound(page)) + goto fallback; + + locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); + + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + if (!locked) { + ret |= VM_FAULT_RETRY; + goto out_release; + } + + /* + * Make sure try_to_free_swap or reuse_swap_page or swapoff did not + * release the swapcache from under us. The page pin, and pmd_same + * test below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not changed. + */ + if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val)) + goto out_page; + + if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, + &memcg, true)) { + ret = VM_FAULT_OOM; + goto out_page; + } + + /* + * Back out if somebody else already faulted in this pmd. + */ + vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); + spin_lock(vmf->ptl); + if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) + goto out_nomap; + + if (unlikely(!PageUptodate(page))) { + ret = VM_FAULT_SIGBUS; + goto out_nomap; + } + + /* + * The page isn't present yet, go ahead with the fault. + * + * Be careful about the sequence of operations here. + * To get its accounting right, reuse_swap_page() must be called + * while the page is counted on swap but not yet in mapcount i.e. + * before page_add_anon_rmap() and swap_free(); try_to_free_swap() + * must be called after the swap_free(), or it will never succeed. + */ + + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -HPAGE_PMD_NR); + pmd = mk_huge_pmd(page, vma->vm_page_prot); + if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { + pmd = maybe_pmd_mkwrite(pmd_mkdirty(pmd), vma); + vmf->flags &= ~FAULT_FLAG_WRITE; + ret |= VM_FAULT_WRITE; + exclusive = RMAP_EXCLUSIVE; + } + for (i = 0; i < HPAGE_PMD_NR; i++) + flush_icache_page(vma, page + i); + if (pmd_swp_soft_dirty(orig_pmd)) + pmd = pmd_mksoft_dirty(pmd); + do_page_add_anon_rmap(page, vma, haddr, + exclusive | RMAP_COMPOUND); + mem_cgroup_commit_charge(page, memcg, true, true); + activate_page(page); + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); + + swap_free(entry, true); + if (mem_cgroup_swap_full(page) || + (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) + try_to_free_swap(page); + unlock_page(page); + + if (vmf->flags & FAULT_FLAG_WRITE) { + ret |= do_huge_pmd_wp_page(vmf, pmd); + if (ret & VM_FAULT_ERROR) + ret &= VM_FAULT_ERROR; + goto out; + } + + /* No need to invalidate - it was non-present before */ + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); + spin_unlock(vmf->ptl); +out: + return ret; +out_nomap: + mem_cgroup_cancel_charge(page, memcg, true); + spin_unlock(vmf->ptl); +out_page: + unlock_page(page); +out_release: + put_page(page); + return ret; +fallback: + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + if (!split_huge_swap_pmd(vmf->vma, vmf->pmd, vmf->address, orig_pmd)) + ret = VM_FAULT_FALLBACK; + else + ret = 0; + if (page) + put_page(page); + return ret; +} #else static inline void __split_huge_swap_pmd(struct vm_area_struct *vma, unsigned long haddr, diff --git a/mm/memory.c b/mm/memory.c index f8a4336dd015..a8b85fb333c7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4073,13 +4073,17 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, barrier(); if (unlikely(is_swap_pmd(orig_pmd))) { - VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(orig_pmd)); - if (is_pmd_migration_entry(orig_pmd)) + if (thp_migration_supported() && + is_pmd_migration_entry(orig_pmd)) { pmd_migration_entry_wait(mm, vmf.pmd); - return 0; - } - if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { + return 0; + } else if (thp_swap_supported()) { + ret = do_huge_pmd_swap_page(&vmf, orig_pmd); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } else + VM_BUG_ON(1); + } else if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf, orig_pmd); -- 2.16.1