Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp560496imm; Wed, 23 May 2018 01:32:19 -0700 (PDT) X-Google-Smtp-Source: AB8JxZreQZcWT2Sx+yFRXoGv9IJDHXF+hLY/2DQ2HV/0+6hokpusuQlkF3zWqOp8eg+HHy1Z1GiA X-Received: by 2002:a62:dfcd:: with SMTP id d74-v6mr1968808pfl.114.1527064339427; Wed, 23 May 2018 01:32:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527064339; cv=none; d=google.com; s=arc-20160816; b=ciItBUcDEBQmGat2IMUEH0Dk2ZPoMsiYasnWSQHZdGzyeSdr6/aSxKpo6bYQ1OJZj8 700kJXl5bOgIe0FgW/bMacwfigL2/mhivF2vXgpZ3C/rEkTPJM6M0uZ4GNb/I9VI9Fcm cPsEwgbRl8dJWVd7sI4En91vmX0tx6ozsZ72b8gL5717/+bSnwlHE2/OZGDKIOJAaAS1 CNfSH/QcLpch70l2jTjvG/PwxdVOEr7uYvP1vI9AbiMcjW6QtkuFYbqgmj7HtKpn3UHv QU0AKkBS0YA23/Ezo+RkvtGi8V/zjovvYdQg6xl4OgklPfYgWldcm3lSKnW9BmYybSvn 5TEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=Ynb2/aoqB52oWHyDFmnxhSSMci4FWJcRk4b0Gm1gatU=; b=sNvoIohtnIWUxoXLpLADoskEYuM9hYAp+gbFNBxZPN0CKUS50adt+9Vy3Ol7rMWO4r BoYSZAzTMqOPeJmee84aqK1gQFJ9b8y2PTs8rs6IjdNjsdUffpcVyQsLdTs/QUlOVA8Z yo5gHFSm18n2S/Yfu680LP3J9Z7HttM8S8cyByXURzQUua9vvLk/RUGEN6ZVbQwCSgCb u4RAwx+XGuXrrEOW8P22maCFlsoPsh4uxk4JPiJC9Hf7bo+baUoWozWVQl4Et2SyV75S NG0R6OHXu0y8TH3jO5AY2u0UAVKFolbA0iWuoCdTDC5mnaDmmRIP7LDKAsT4sJAuE+/Z 7kBw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o33-v6si19763127pld.170.2018.05.23.01.32.04; Wed, 23 May 2018 01:32:19 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932146AbeEWIaN (ORCPT + 99 others); Wed, 23 May 2018 04:30:13 -0400 Received: from mga03.intel.com ([134.134.136.65]:7726 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754540AbeEWI1D (ORCPT ); Wed, 23 May 2018 04:27:03 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 23 May 2018 01:27:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,432,1520924400"; d="scan'208";a="57726126" Received: from yhuang6-ux31a.sh.intel.com ([10.239.197.97]) by fmsmga001.fm.intel.com with ESMTP; 23 May 2018 01:27:00 -0700 From: "Huang, Ying" To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan Subject: [PATCH -mm -V3 12/21] mm, THP, swap: Support PMD swap mapping in swapoff Date: Wed, 23 May 2018 16:26:16 +0800 Message-Id: <20180523082625.6897-13-ying.huang@intel.com> X-Mailer: git-send-email 2.16.1 In-Reply-To: <20180523082625.6897-1-ying.huang@intel.com> References: <20180523082625.6897-1-ying.huang@intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Huang Ying During swapoff, for a huge swap cluster, we need to allocate a THP, read its contents into the THP and unuse the PMD and PTE swap mappings to it. If failed to allocate a THP, the huge swap cluster will be split. During unuse, if it is found that the swap cluster mapped by a PMD swap mapping is split already, we will split the PMD swap mapping and unuse the PTEs. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan --- include/asm-generic/pgtable.h | 15 ++------ include/linux/huge_mm.h | 8 ++++ mm/huge_memory.c | 4 +- mm/swapfile.c | 86 ++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 98 insertions(+), 15 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index bb8354981a36..caa381962cd2 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -931,22 +931,13 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) barrier(); #endif /* - * !pmd_present() checks for pmd migration entries - * - * The complete check uses is_pmd_migration_entry() in linux/swapops.h - * But using that requires moving current function and pmd_trans_unstable() - * to linux/swapops.h to resovle dependency, which is too much code move. - * - * !pmd_present() is equivalent to is_pmd_migration_entry() currently, - * because !pmd_present() pages can only be under migration not swapped - * out. - * - * pmd_none() is preseved for future condition checks on pmd migration + * pmd_none() is preseved for future condition checks on pmd swap * entries and not confusing with this function name, although it is * redundant with !pmd_present(). */ if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || - (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval))) + ((IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) || + IS_ENABLED(CONFIG_THP_SWAP)) && !pmd_present(pmdval))) return 1; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 1cfd43047f0d..4e299e327720 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -405,6 +405,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #ifdef CONFIG_THP_SWAP +extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd); extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); static inline bool transparent_hugepage_swapin_enabled( @@ -430,6 +432,12 @@ static inline bool transparent_hugepage_swapin_enabled( return false; } #else /* CONFIG_THP_SWAP */ +static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) +{ + return 0; +} + static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3fd129a21f2e..668d77cec14d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1663,8 +1663,8 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma, pmd_populate(mm, pmd, pgtable); } -static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long address, pmd_t orig_pmd) +int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; diff --git a/mm/swapfile.c b/mm/swapfile.c index 1a62fbc13381..77b2ddd37d9b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1937,6 +1937,11 @@ static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte) return pte_same(pte_swp_clear_soft_dirty(pte), swp_pte); } +static inline int pmd_same_as_swp(pmd_t pmd, pmd_t swp_pmd) +{ + return pmd_same(pmd_swp_clear_soft_dirty(pmd), swp_pmd); +} + /* * No need to decide whether this PTE shares the swap entry with others, * just let do_wp_page work it out if a write is requested later - to @@ -1998,6 +2003,57 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, return ret; } +#ifdef CONFIG_THP_SWAP +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, swp_entry_t entry, struct page *page) +{ + struct mem_cgroup *memcg; + struct swap_info_struct *si; + spinlock_t *ptl; + int ret = 1; + + if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, + &memcg, true)) { + ret = -ENOMEM; + goto out_nolock; + } + + ptl = pmd_lock(vma->vm_mm, pmd); + if (unlikely(!pmd_same_as_swp(*pmd, swp_entry_to_pmd(entry)))) { + mem_cgroup_cancel_charge(page, memcg, true); + ret = 0; + goto out; + } + + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -HPAGE_PMD_NR); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + get_page(page); + set_pmd_at(vma->vm_mm, addr, pmd, + pmd_mkold(mk_huge_pmd(page, vma->vm_page_prot))); + page_add_anon_rmap(page, vma, addr, true); + mem_cgroup_commit_charge(page, memcg, true, true); + si = _swap_info_get(entry); + if (si) + swap_free_cluster(si, entry); + /* + * Move the page to the active list so it is not + * immediately swapped out again after swapon. + */ + activate_page(page); +out: + spin_unlock(ptl); +out_nolock: + return ret; +} +#else +static inline int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, swp_entry_t entry, + struct page *page) +{ + return 0; +} +#endif + static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, swp_entry_t entry, struct page *page) @@ -2038,7 +2094,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, swp_entry_t entry, struct page *page) { - pmd_t *pmd; + pmd_t swp_pmd = swp_entry_to_pmd(entry), *pmd, orig_pmd; unsigned long next; int ret; @@ -2046,6 +2102,24 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); + orig_pmd = *pmd; + if (thp_swap_supported() && is_swap_pmd(orig_pmd)) { + if (likely(!pmd_same_as_swp(orig_pmd, swp_pmd))) + continue; + /* Huge cluster has been split already */ + if (!PageTransCompound(page)) { + ret = split_huge_swap_pmd(vma, pmd, + addr, orig_pmd); + if (ret) + return ret; + ret = unuse_pte_range(vma, pmd, addr, + next, entry, page); + } else + ret = unuse_pmd(vma, pmd, addr, entry, page); + if (ret) + return ret; + continue; + } if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, entry, page); @@ -2210,6 +2284,7 @@ int try_to_unuse(unsigned int type, bool frontswap, * to prevent compiler doing * something odd. */ + struct swap_cluster_info *ci = NULL; unsigned char swcount; struct page *page; swp_entry_t entry; @@ -2239,6 +2314,7 @@ int try_to_unuse(unsigned int type, bool frontswap, * there are races when an instance of an entry might be missed. */ while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { +retry: if (signal_pending(current)) { retval = -EINTR; break; @@ -2250,6 +2326,8 @@ int try_to_unuse(unsigned int type, bool frontswap, * page and read the swap into it. */ swap_map = &si->swap_map[i]; + if (si->cluster_info) + ci = si->cluster_info + i / SWAPFILE_CLUSTER; entry = swp_entry(type, i); page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, NULL, 0, false); @@ -2270,6 +2348,12 @@ int try_to_unuse(unsigned int type, bool frontswap, */ if (!swcount || swcount == SWAP_MAP_BAD) continue; + /* Split huge cluster if failed to allocate huge page */ + if (thp_swap_supported() && cluster_is_huge(ci)) { + retval = split_swap_cluster(entry, false); + if (!retval || retval == -EEXIST) + goto retry; + } retval = -ENOMEM; break; } -- 2.16.1