2017-07-24 05:18:56

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 00/12] mm, THP, swap: Delay splitting THP after swapped out

From: Huang Ying <[email protected]>

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Johannes and Minchan, Thanks a lot for your review to the first
step of the THP swap optimization! Could you help me to review the
second step in this patchset?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset? Especially [01/12], [02/12], [03/12],
[04/12], [11/12], and [12/12].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset? Especially [01/12], [03/12], [07/12], [08/12], [09/12],
[11/12].

Hi, Johannes, Michal, could you help me to review the cgroup part of
the patchset? Especially [08/12], [09/12], and [10/12].

And for all, Any comment is welcome!

Because the THP swap writing support patch [06/12] needs to be rebased
on multipage bvec patchset which hasn't been merged yet. The [06/12]
in this patchset is just a test patch and will be rewritten later.
The patchset depends on multipage bvec patchset too.

This is the second step of THP (Transparent Huge Page) swap
optimization. In the first step, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the
swap space for the THP and adding the THP into the swap cache. In the
second step, the splitting is delayed further to after the swapping
out finished. The plan is to delay splitting THP step by step,
finally avoid splitting THP for the THP swapping out and swap out/in
the THP as a whole.

In the patchset, more operations for the anonymous THP reclaiming,
such as TLB flushing, writing the THP to the swap device, removing the
THP from the swap cache are batched. So that the performance of
anonymous THP swapping out are improved.

This patchset is based on the 7/14 head of mmotm/master.

During the development, the following scenarios/code paths have been
checked,

- swap out/in
- swap off
- write protect page fault
- madvise_free
- process exit
- split huge page

Please let me know if I missed something.

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes. At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%. The test is done on a Xeon E5 v3 system. The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.

Below is the part of the cover letter for the first step patchset of
THP swap optimization which applies to all steps.

----------------------------------------------------------------->

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine. Because the
performance of the storage device improved faster than that of single
logical CPU. And it seems that the trend will not change in the near
future. On the other hand, the THP becomes more and more popular
because of increased memory size. So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce TLB flushing and
lock acquiring/releasing, including allocating/freeing the swap
space, adding/deleting to/from the swap cache, and writing/reading
the swap space, etc. This will help improve the performance of the
THP swap.

- The THP swap space read/write will be 2M sequential IO. It is
particularly helpful for the swap read, which are usually 4k random
IO. This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
heavily used by the applications. The 2M continuous pages will be
free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
turned on. Because the speed for khugepaged to collapse the normal
pages into the THP is quite slow. After the THP is split during the
swapping out, it will take quite long time for the normal pages to
collapse back into the THP after being swapped in. The high THP
utilization helps the efficiency of the page based memory management
too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device. To deal with that, the THP swap in should be
turned on only when necessary. For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

Changelog:

v3:

- Rebased on latest -mm tree
- Some minor fixes

Best Regards,
Huang, Ying


2017-07-24 05:19:05

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 01/12] mm, THP, swap: Support to clear swap cache flag for THP swapped out

From: Huang Ying <[email protected]>

Previously, swapcache_free_cluster() is used only in the error path of
shrink_page_list() to free the swap cluster just allocated if the
THP (Transparent Huge Page) is failed to be split. In this patch, it
is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
cluster that holds the contents of THP swapped out.

This will be used in delaying splitting THP after swapping out
support. Because there is no THP swapping in as a whole support yet,
after clearing the swap cache flag, the swap cluster backing the THP
swapped out will be split. So that the swap slots in the swap cluster
can be swapped in as normal pages later.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/swapfile.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6ba4aab2db0b..c32e9b23d642 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1168,22 +1168,40 @@ static void swapcache_free_cluster(swp_entry_t entry)
struct swap_cluster_info *ci;
struct swap_info_struct *si;
unsigned char *map;
- unsigned int i;
+ unsigned int i, free_entries = 0;
+ unsigned char val;

- si = swap_info_get(entry);
+ si = _swap_info_get(entry);
if (!si)
return;

ci = lock_cluster(si, offset);
map = si->swap_map + offset;
for (i = 0; i < SWAPFILE_CLUSTER; i++) {
- VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
- map[i] = 0;
+ val = map[i];
+ VM_BUG_ON(!(val & SWAP_HAS_CACHE));
+ if (val == SWAP_HAS_CACHE)
+ free_entries++;
+ }
+ if (!free_entries) {
+ for (i = 0; i < SWAPFILE_CLUSTER; i++)
+ map[i] &= ~SWAP_HAS_CACHE;
}
unlock_cluster(ci);
- mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
- swap_free_cluster(si, idx);
- spin_unlock(&si->lock);
+ if (free_entries == SWAPFILE_CLUSTER) {
+ spin_lock(&si->lock);
+ ci = lock_cluster(si, offset);
+ memset(map, 0, SWAPFILE_CLUSTER);
+ unlock_cluster(ci);
+ mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+ swap_free_cluster(si, idx);
+ spin_unlock(&si->lock);
+ } else if (free_entries) {
+ for (i = 0; i < SWAPFILE_CLUSTER; i++, entry.val++) {
+ if (!__swap_entry_free(si, entry, SWAP_HAS_CACHE))
+ free_swap_slot(entry);
+ }
+ }
}
#else
static inline void swapcache_free_cluster(swp_entry_t entry)
--
2.13.2

2017-07-24 05:19:12

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 02/12] mm, THP, swap: Support to reclaim swap space for THP swapped out

From: Huang Ying <[email protected]>

The normal swap slot reclaiming can be done when the swap count
reaches SWAP_HAS_CACHE. But for the swap slot which is backing a THP,
all swap slots backing one THP must be reclaimed together, because the
swap slot may be used again when the THP is swapped out again later.
So the swap slots backing one THP can be reclaimed together when the
swap count for all swap slots for the THP reached SWAP_HAS_CACHE. In
the patch, the functions to check whether the swap count for all swap
slots backing one THP reached SWAP_HAS_CACHE are implemented and used
when checking whether a swap slot can be reclaimed.

To make it easier to determine whether a swap slot is backing a THP, a
new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
cluster which is backing a THP (Transparent Huge Page). Because THP
swap in as a whole isn't supported now. After deleting the THP from
the swap cache (for example, swapping out finished), the
CLUSTER_FLAG_HUGE flag will be cleared. So that, the normal pages
inside THP can be swapped in individually.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
---
include/linux/swap.h | 1 +
mm/swapfile.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d83d28e53e62..964b4f1fba4a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -188,6 +188,7 @@ struct swap_cluster_info {
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */

/*
* We assign a cluster to each CPU, so each CPU can allocate swap entry from
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c32e9b23d642..7db19846f8c7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -265,6 +265,16 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
info->data = 0;
}

+static inline bool cluster_is_huge(struct swap_cluster_info *info)
+{
+ return info->flags & CLUSTER_FLAG_HUGE;
+}
+
+static inline void cluster_clear_huge(struct swap_cluster_info *info)
+{
+ info->flags &= ~CLUSTER_FLAG_HUGE;
+}
+
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
unsigned long offset)
{
@@ -846,7 +856,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
offset = idx * SWAPFILE_CLUSTER;
ci = lock_cluster(si, offset);
alloc_cluster(si, idx);
- cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+ cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);

map = si->swap_map + offset;
for (i = 0; i < SWAPFILE_CLUSTER; i++)
@@ -1176,6 +1186,7 @@ static void swapcache_free_cluster(swp_entry_t entry)
return;

ci = lock_cluster(si, offset);
+ VM_BUG_ON(!cluster_is_huge(ci));
map = si->swap_map + offset;
for (i = 0; i < SWAPFILE_CLUSTER; i++) {
val = map[i];
@@ -1187,6 +1198,7 @@ static void swapcache_free_cluster(swp_entry_t entry)
for (i = 0; i < SWAPFILE_CLUSTER; i++)
map[i] &= ~SWAP_HAS_CACHE;
}
+ cluster_clear_huge(ci);
unlock_cluster(ci);
if (free_entries == SWAPFILE_CLUSTER) {
spin_lock(&si->lock);
@@ -1350,6 +1362,54 @@ int swp_swapcount(swp_entry_t entry)
return count;
}

+#ifdef CONFIG_THP_SWAP
+static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
+ swp_entry_t entry)
+{
+ struct swap_cluster_info *ci;
+ unsigned char *map = si->swap_map;
+ unsigned long roffset = swp_offset(entry);
+ unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+ int i;
+ bool ret = false;
+
+ ci = lock_cluster_or_swap_info(si, offset);
+ if (!cluster_is_huge(ci)) {
+ if (map[roffset] != SWAP_HAS_CACHE)
+ ret = true;
+ goto unlock_out;
+ }
+ for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+ if (map[offset + i] != SWAP_HAS_CACHE) {
+ ret = true;
+ break;
+ }
+ }
+unlock_out:
+ unlock_cluster_or_swap_info(si, ci);
+ return ret;
+}
+
+static bool page_swapped(struct page *page)
+{
+ swp_entry_t entry;
+ struct swap_info_struct *si;
+
+ if (likely(!PageTransCompound(page)))
+ return page_swapcount(page) != 0;
+
+ page = compound_head(page);
+ entry.val = page_private(page);
+ si = _swap_info_get(entry);
+ if (si)
+ return swap_page_trans_huge_swapped(si, entry);
+ return false;
+}
+#else
+#define swap_page_trans_huge_swapped(si, entry) swap_swapcount(si, entry)
+#define page_swapped(page) (page_swapcount(page) != 0)
+#endif
+
/*
* We can write to an anon page without COW if there are no other references
* to it. And as a side-effect, free up its swap: because the old content
@@ -1404,7 +1464,7 @@ int try_to_free_swap(struct page *page)
return 0;
if (PageWriteback(page))
return 0;
- if (page_swapcount(page))
+ if (page_swapped(page))
return 0;

/*
@@ -1425,6 +1485,7 @@ int try_to_free_swap(struct page *page)
if (pm_suspended_storage())
return 0;

+ page = compound_head(page);
delete_from_swap_cache(page);
SetPageDirty(page);
return 1;
@@ -1446,7 +1507,8 @@ int free_swap_and_cache(swp_entry_t entry)
p = _swap_info_get(entry);
if (p) {
count = __swap_entry_free(p, entry, 1);
- if (count == SWAP_HAS_CACHE) {
+ if (count == SWAP_HAS_CACHE &&
+ !swap_page_trans_huge_swapped(p, entry)) {
page = find_get_page(swap_address_space(entry),
swp_offset(entry));
if (page && !trylock_page(page)) {
@@ -1463,7 +1525,8 @@ int free_swap_and_cache(swp_entry_t entry)
*/
if (PageSwapCache(page) && !PageWriteback(page) &&
(!page_mapped(page) || mem_cgroup_swap_full(page)) &&
- !swap_swapcount(p, entry)) {
+ !swap_page_trans_huge_swapped(p, entry)) {
+ page = compound_head(page);
delete_from_swap_cache(page);
SetPageDirty(page);
}
@@ -2017,7 +2080,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
.sync_mode = WB_SYNC_NONE,
};

- swap_writepage(page, &wbc);
+ swap_writepage(compound_head(page), &wbc);
lock_page(page);
wait_on_page_writeback(page);
}
@@ -2030,8 +2093,9 @@ int try_to_unuse(unsigned int type, bool frontswap,
* delete, since it may not have been written out to swap yet.
*/
if (PageSwapCache(page) &&
- likely(page_private(page) == entry.val))
- delete_from_swap_cache(page);
+ likely(page_private(page) == entry.val) &&
+ !page_swapped(page))
+ delete_from_swap_cache(compound_head(page));

/*
* So we could skip searching mms once swap count went
--
2.13.2

2017-07-24 05:19:27

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 03/12] mm, THP, swap: Make reuse_swap_page() works for THP swapped out

From: Huang Ying <[email protected]>

After supporting to delay THP (Transparent Huge Page) splitting after
swapped out, it is possible that some page table mappings of the THP
are turned into swap entries. So reuse_swap_page() need to check the
swap count in addition to the map count as before. This patch done
that.

In the huge PMD write protect fault handler, in addition to the page
map count, the swap count need to be checked too, so the page lock
need to be acquired too when calling reuse_swap_page() in addition to
the page table lock.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
---
include/linux/swap.h | 4 +-
mm/huge_memory.c | 16 +++++++-
mm/memory.c | 6 +--
mm/swapfile.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 964b4f1fba4a..7176ba780e83 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -510,8 +510,8 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0;
}

-#define reuse_swap_page(page, total_mapcount) \
- (page_trans_huge_mapcount(page, total_mapcount) == 1)
+#define reuse_swap_page(page, total_map_swapcount) \
+ (page_trans_huge_mapcount(page, total_map_swapcount) == 1)

static inline int try_to_free_swap(struct page *page)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..7392ba184126 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1226,15 +1226,29 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
* We can only reuse the page if nobody else maps the huge page or it's
* part.
*/
- if (page_trans_huge_mapcount(page, NULL) == 1) {
+ if (!trylock_page(page)) {
+ get_page(page);
+ spin_unlock(vmf->ptl);
+ lock_page(page);
+ spin_lock(vmf->ptl);
+ if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
+ unlock_page(page);
+ put_page(page);
+ goto out_unlock;
+ }
+ put_page(page);
+ }
+ if (reuse_swap_page(page, NULL)) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
ret |= VM_FAULT_WRITE;
+ unlock_page(page);
goto out_unlock;
}
+ unlock_page(page);
get_page(page);
spin_unlock(vmf->ptl);
alloc:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..8e7452c619bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2541,7 +2541,7 @@ static int do_wp_page(struct vm_fault *vmf)
* not dirty accountable.
*/
if (PageAnon(vmf->page) && !PageKsm(vmf->page)) {
- int total_mapcount;
+ int total_map_swapcount;
if (!trylock_page(vmf->page)) {
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2556,8 +2556,8 @@ static int do_wp_page(struct vm_fault *vmf)
}
put_page(vmf->page);
}
- if (reuse_swap_page(vmf->page, &total_mapcount)) {
- if (total_mapcount == 1) {
+ if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
+ if (total_map_swapcount == 1) {
/*
* The page is all ours. Move it to
* our anon_vma so the rmap code will
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7db19846f8c7..b110faf3c82a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1405,9 +1405,89 @@ static bool page_swapped(struct page *page)
return swap_page_trans_huge_swapped(si, entry);
return false;
}
+
+static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
+ int *total_swapcount)
+{
+ int i, map_swapcount, _total_mapcount, _total_swapcount;
+ unsigned long offset;
+ struct swap_info_struct *si;
+ struct swap_cluster_info *ci = NULL;
+ unsigned char *map = NULL;
+ int mapcount, swapcount = 0;
+
+ /* hugetlbfs shouldn't call it */
+ VM_BUG_ON_PAGE(PageHuge(page), page);
+
+ if (likely(!PageTransCompound(page))) {
+ mapcount = atomic_read(&page->_mapcount) + 1;
+ if (total_mapcount)
+ *total_mapcount = mapcount;
+ if (PageSwapCache(page))
+ swapcount = page_swapcount(page);
+ if (total_swapcount)
+ *total_swapcount = swapcount;
+ return mapcount + swapcount;
+ }
+
+ page = compound_head(page);
+
+ _total_mapcount = _total_swapcount = map_swapcount = 0;
+ if (PageSwapCache(page)) {
+ swp_entry_t entry;
+
+ entry.val = page_private(page);
+ si = _swap_info_get(entry);
+ if (si) {
+ map = si->swap_map;
+ offset = swp_offset(entry);
+ }
+ }
+ if (map)
+ ci = lock_cluster(si, offset);
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ mapcount = atomic_read(&page[i]._mapcount) + 1;
+ _total_mapcount += mapcount;
+ if (map) {
+ swapcount = swap_count(map[offset + i]);
+ _total_swapcount += swapcount;
+ }
+ map_swapcount = max(map_swapcount, mapcount + swapcount);
+ }
+ unlock_cluster(ci);
+ if (PageDoubleMap(page)) {
+ map_swapcount -= 1;
+ _total_mapcount -= HPAGE_PMD_NR;
+ }
+ mapcount = compound_mapcount(page);
+ map_swapcount += mapcount;
+ _total_mapcount += mapcount;
+ if (total_mapcount)
+ *total_mapcount = _total_mapcount;
+ if (total_swapcount)
+ *total_swapcount = _total_swapcount;
+
+ return map_swapcount;
+}
#else
#define swap_page_trans_huge_swapped(si, entry) swap_swapcount(si, entry)
#define page_swapped(page) (page_swapcount(page) != 0)
+
+static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
+ int *total_swapcount)
+{
+ int mapcount, swapcount = 0;
+
+ /* hugetlbfs shouldn't call it */
+ VM_BUG_ON_PAGE(PageHuge(page), page);
+
+ mapcount = page_trans_huge_mapcount(page, total_mapcount);
+ if (PageSwapCache(page))
+ swapcount = page_swapcount(page);
+ if (total_swapcount)
+ *total_swapcount = swapcount;
+ return mapcount + swapcount;
+}
#endif

/*
@@ -1416,23 +1496,27 @@ static bool page_swapped(struct page *page)
* on disk will never be read, and seeking back there to write new content
* later would only waste time away from clustering.
*
- * NOTE: total_mapcount should not be relied upon by the caller if
+ * NOTE: total_map_swapcount should not be relied upon by the caller if
* reuse_swap_page() returns false, but it may be always overwritten
* (see the other implementation for CONFIG_SWAP=n).
*/
-bool reuse_swap_page(struct page *page, int *total_mapcount)
+bool reuse_swap_page(struct page *page, int *total_map_swapcount)
{
- int count;
+ int count, total_mapcount, total_swapcount;

VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page)))
return false;
- count = page_trans_huge_mapcount(page, total_mapcount);
- if (count <= 1 && PageSwapCache(page)) {
- count += page_swapcount(page);
- if (count != 1)
- goto out;
+ count = page_trans_huge_map_swapcount(page, &total_mapcount,
+ &total_swapcount);
+ if (total_map_swapcount)
+ *total_map_swapcount = total_mapcount + total_swapcount;
+ if (count == 1 && PageSwapCache(page) &&
+ (likely(!PageTransCompound(page)) ||
+ /* The remaining swap count will be freed soon */
+ total_swapcount == page_swapcount(page))) {
if (!PageWriteback(page)) {
+ page = compound_head(page);
delete_from_swap_cache(page);
SetPageDirty(page);
} else {
@@ -1448,7 +1532,7 @@ bool reuse_swap_page(struct page *page, int *total_mapcount)
spin_unlock(&p->lock);
}
}
-out:
+
return count <= 1;
}

--
2.13.2

2017-07-24 05:19:42

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 05/12] block, THP: Make block_device_operations.rw_page support THP

From: Huang Ying <[email protected]>

The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device. To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.

Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c]
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal L Verma <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
---
drivers/block/brd.c | 6 +++++-
drivers/block/zram/zram_drv.c | 2 ++
drivers/nvdimm/btt.c | 4 +++-
drivers/nvdimm/pmem.c | 41 ++++++++++++++++++++++++++++++-----------
4 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 104b71c0490d..5d9ed0616413 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -326,7 +326,11 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
struct page *page, bool is_write)
{
struct brd_device *brd = bdev->bd_disk->private_data;
- int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
+ int err;
+
+ if (PageTransHuge(page))
+ return -ENOTSUPP;
+ err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
page_endio(page, is_write, err);
return err;
}
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 856d5dc02451..e2a305b41cd4 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -927,6 +927,8 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
struct zram *zram;
struct bio_vec bv;

+ if (PageTransHuge(page))
+ return -ENOTSUPP;
zram = bdev->bd_disk->private_data;

if (!valid_io_request(zram, sector, PAGE_SIZE)) {
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 14323faf8bd9..60491641a8d6 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1241,8 +1241,10 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
{
struct btt *btt = bdev->bd_disk->private_data;
int rc;
+ unsigned int len;

- rc = btt_do_bvec(btt, NULL, page, PAGE_SIZE, 0, is_write, sector);
+ len = hpage_nr_pages(page) * PAGE_SIZE;
+ rc = btt_do_bvec(btt, NULL, page, len, 0, is_write, sector);
if (rc == 0)
page_endio(page, is_write, 0);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7099adaabc0..e9aa453da50c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,22 +80,40 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
static void write_pmem(void *pmem_addr, struct page *page,
unsigned int off, unsigned int len)
{
- void *mem = kmap_atomic(page);
-
- memcpy_flushcache(pmem_addr, mem + off, len);
- kunmap_atomic(mem);
+ unsigned int chunk;
+ void *mem;
+
+ while (len) {
+ mem = kmap_atomic(page);
+ chunk = min_t(unsigned int, len, PAGE_SIZE);
+ memcpy_flushcache(pmem_addr, mem + off, chunk);
+ kunmap_atomic(mem);
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += PAGE_SIZE;
+ }
}

static blk_status_t read_pmem(struct page *page, unsigned int off,
void *pmem_addr, unsigned int len)
{
+ unsigned int chunk;
int rc;
- void *mem = kmap_atomic(page);
-
- rc = memcpy_mcsafe(mem + off, pmem_addr, len);
- kunmap_atomic(mem);
- if (rc)
- return BLK_STS_IOERR;
+ void *mem;
+
+ while (len) {
+ mem = kmap_atomic(page);
+ chunk = min_t(unsigned int, len, PAGE_SIZE);
+ rc = memcpy_mcsafe(mem + off, pmem_addr, chunk);
+ kunmap_atomic(mem);
+ if (rc)
+ return BLK_STS_IOERR;
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += PAGE_SIZE;
+ }
return BLK_STS_OK;
}

@@ -188,7 +206,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
struct pmem_device *pmem = bdev->bd_queue->queuedata;
blk_status_t rc;

- rc = pmem_do_bvec(pmem, page, PAGE_SIZE, 0, is_write, sector);
+ rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
+ 0, is_write, sector);

/*
* The ->rw_page interface is subtle and tricky. The core
--
2.13.2

2017-07-24 05:19:55

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 06/12] Test code to write THP to swap device as a whole

From: Huang Ying <[email protected]>

To support to delay splitting THP (Transparent Huge Page) after
swapped out. We need to enhance swap writing code to support to write
a THP as a whole. This will improve swap write IO performance. As
Ming Lei <[email protected]> pointed out, this should be based on
multipage bvec support, which hasn't been merged yet. So this patch
is only for testing the functionality of the other patches in the
series. And will be reimplemented after multipage bvec support is
merged.

Signed-off-by: "Huang, Ying" <[email protected]>
---
include/linux/bio.h | 8 ++++++++
include/linux/page-flags.h | 4 ++--
include/linux/vm_event_item.h | 1 +
mm/page_io.c | 21 ++++++++++++++++-----
mm/vmstat.c | 1 +
5 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7b1cf4ba0902..1f0720de8990 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -38,7 +38,15 @@
#define BIO_BUG_ON
#endif

+#ifdef CONFIG_THP_SWAP
+#if HPAGE_PMD_NR > 256
+#define BIO_MAX_PAGES HPAGE_PMD_NR
+#else
#define BIO_MAX_PAGES 256
+#endif
+#else
+#define BIO_MAX_PAGES 256
+#endif

#define bio_prio(bio) (bio)->bi_ioprio
#define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d33e3280c8ad..ba2d470d2d0a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -303,8 +303,8 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
* Only test-and-set exist for PG_writeback. The unconditional operators are
* risky: they bypass page accounting.
*/
-TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
- TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
+TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
+ TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)

/* PG_readahead is only used for reads; PG_reclaim is only for writes */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 37e8d31a4632..c75024e80eed 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -85,6 +85,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#endif
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
+ THP_SWPOUT,
#endif
#ifdef CONFIG_MEMORY_BALLOON
BALLOON_INFLATE,
diff --git a/mm/page_io.c b/mm/page_io.c
index b6c4ac388209..d5d9871a14e5 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -27,16 +27,18 @@
static struct bio *get_swap_bio(gfp_t gfp_flags,
struct page *page, bio_end_io_t end_io)
{
+ int i, nr = hpage_nr_pages(page);
struct bio *bio;

- bio = bio_alloc(gfp_flags, 1);
+ bio = bio_alloc(gfp_flags, nr);
if (bio) {
bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
bio->bi_end_io = end_io;

- bio_add_page(bio, page, PAGE_SIZE, 0);
- BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
+ for (i = 0; i < nr; i++)
+ bio_add_page(bio, page + i, PAGE_SIZE, 0);
+ VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr);
}
return bio;
}
@@ -260,6 +262,15 @@ static sector_t swap_page_sector(struct page *page)
return (sector_t)__page_file_index(page) << (PAGE_SHIFT - 9);
}

+static inline void count_swpout_vm_event(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (unlikely(PageTransHuge(page)))
+ count_vm_event(THP_SWPOUT);
+#endif
+ count_vm_events(PSWPOUT, hpage_nr_pages(page));
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func)
{
@@ -311,7 +322,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
if (!ret) {
- count_vm_event(PSWPOUT);
+ count_swpout_vm_event(page);
return 0;
}

@@ -324,7 +335,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
goto out;
}
bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
- count_vm_event(PSWPOUT);
+ count_swpout_vm_event(page);
set_page_writeback(page);
unlock_page(page);
submit_bio(bio);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441bbeef2..bccf426453cd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1071,6 +1071,7 @@ const char * const vmstat_text[] = {
#endif
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
+ "thp_swpout",
#endif
#ifdef CONFIG_MEMORY_BALLOON
"balloon_inflate",
--
2.13.2

2017-07-24 05:20:09

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 04/12] mm, THP, swap: Don't allocate huge cluster for file backed swap device

From: Huang Ying <[email protected]>

It's hard to write a whole transparent huge page (THP) to a file
backed swap device during swapping out and the file backed swap device
isn't very popular. So the huge cluster allocation for the file
backed swap device is disabled.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/swapfile.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b110faf3c82a..21cbdecbc19a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -948,9 +948,10 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
spin_unlock(&si->lock);
goto nextsi;
}
- if (cluster)
- n_ret = swap_alloc_cluster(si, swp_entries);
- else
+ if (cluster) {
+ if (!(si->flags & SWP_FILE))
+ n_ret = swap_alloc_cluster(si, swp_entries);
+ } else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
n_goal, swp_entries);
spin_unlock(&si->lock);
--
2.13.2

2017-07-24 05:20:16

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 09/12] memcg, THP, swap: Avoid to duplicated charge THP in swap cache

From: Huang Ying <[email protected]>

For a THP (Transparent Huge Page), tail_page->mem_cgroup is NULL. So
to check whether the page is charged already, we need to check the
head page. This is not an issue before because it is impossible for a
THP to be in the swap cache before. But after we add delaying
splitting THP after swapped out support, it is possible now.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c2618bd8ebdd..a627b0fd67ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5407,7 +5407,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
* in turn serializes uncharging.
*/
VM_BUG_ON_PAGE(!PageLocked(page), page);
- if (page->mem_cgroup)
+ if (compound_head(page)->mem_cgroup)
goto out;

if (do_swap_account) {
--
2.13.2

2017-07-24 05:20:28

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 10/12] memcg, THP, swap: Make mem_cgroup_swapout() support THP

From: Huang Ying <[email protected]>

This patch makes mem_cgroup_swapout() works for the transparent huge
page (THP). Which will move the memory cgroup charge from memory to
swap for a THP.

This will be used for the THP swap support. Where a THP may be
swapped out as a whole to a set of (HPAGE_PMD_NR) continuous swap
slots on the swap device.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
---
mm/memcontrol.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a627b0fd67ea..b92f3327aca2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4631,8 +4631,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
- * We don't consider swapping or file mapped pages because THP does not
- * support them for now.
+ * We don't consider PMD mapped swapping or file mapped pages because THP does
+ * not support them for now.
* Caller should make sure that pmd_trans_huge(pmd) is true.
*/
static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
@@ -5890,6 +5890,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
{
struct mem_cgroup *memcg, *swap_memcg;
+ unsigned int nr_entries;
unsigned short oldid;

VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -5910,19 +5911,24 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
* ancestor for the swap instead and transfer the memory+swap charge.
*/
swap_memcg = mem_cgroup_id_get_online(memcg);
- oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
+ nr_entries = hpage_nr_pages(page);
+ /* Get references for the tail pages, too */
+ if (nr_entries > 1)
+ mem_cgroup_id_get_many(swap_memcg, nr_entries - 1);
+ oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg),
+ nr_entries);
VM_BUG_ON_PAGE(oldid, page);
- mem_cgroup_swap_statistics(swap_memcg, 1);
+ mem_cgroup_swap_statistics(swap_memcg, nr_entries);

page->mem_cgroup = NULL;

if (!mem_cgroup_is_root(memcg))
- page_counter_uncharge(&memcg->memory, 1);
+ page_counter_uncharge(&memcg->memory, nr_entries);

if (memcg != swap_memcg) {
if (!mem_cgroup_is_root(swap_memcg))
- page_counter_charge(&swap_memcg->memsw, 1);
- page_counter_uncharge(&memcg->memsw, 1);
+ page_counter_charge(&swap_memcg->memsw, nr_entries);
+ page_counter_uncharge(&memcg->memsw, nr_entries);
}

/*
@@ -5932,7 +5938,8 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
* only synchronisation we have for udpating the per-CPU variables.
*/
VM_BUG_ON(!irqs_disabled());
- mem_cgroup_charge_statistics(memcg, page, false, -1);
+ mem_cgroup_charge_statistics(memcg, page, PageTransHuge(page),
+ -nr_entries);
memcg_check_events(memcg, page);

if (!mem_cgroup_is_root(memcg))
--
2.13.2

2017-07-24 05:20:40

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 12/12] mm, THP, swap: Add THP swapping out fallback counting

From: Huang Ying <[email protected]>

When swapping out THP (Transparent Huge Page), instead of swapping out
the THP as a whole, sometimes we have to fallback to split the THP
into normal pages before swapping, because no free swap clusters are
available, or cgroup limit is exceeded, etc. To count the number of
the fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted
when we fallback to split the THP.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/vmscan.c | 3 +++
mm/vmstat.c | 1 +
3 files changed, 5 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index c75024e80eed..e02820fc2861 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -86,6 +86,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
+ THP_SWPOUT_FALLBACK,
#endif
#ifdef CONFIG_MEMORY_BALLOON
BALLOON_INFLATE,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7472ddafc14a..4f7212f8ca00 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1153,6 +1153,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (split_huge_page_to_list(page,
page_list))
goto activate_locked;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ count_vm_event(THP_SWPOUT_FALLBACK);
+#endif
if (!add_to_swap(page))
goto activate_locked;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bccf426453cd..e131b51654c7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1072,6 +1072,7 @@ const char * const vmstat_text[] = {
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
"thp_swpout",
+ "thp_swpout_fallback",
#endif
#ifdef CONFIG_MEMORY_BALLOON
"balloon_inflate",
--
2.13.2

2017-07-24 05:20:51

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 07/12] mm, THP, swap: Support to split THP for THP swapped out

From: Huang Ying <[email protected]>

After adding swapping out support for THP (Transparent Huge Page), it
is possible that a THP in swap cache (partly swapped out) need to be
split. To split such a THP, the swap cluster backing the THP need to
be split too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared
for the swap cluster. The patch implemented this.

And because the THP swap writing needs the THP keeps as huge page
during writing. The PageWriteback flag is checked before splitting.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
---
include/linux/swap.h | 9 +++++++++
mm/huge_memory.c | 10 +++++++++-
mm/swapfile.c | 15 +++++++++++++++
3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7176ba780e83..461cf107ad52 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -527,6 +527,15 @@ static inline swp_entry_t get_swap_page(struct page *page)

#endif /* CONFIG_SWAP */

+#ifdef CONFIG_THP_SWAP
+extern int split_swap_cluster(swp_entry_t entry);
+#else
+static inline int split_swap_cluster(swp_entry_t entry)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_MEMCG
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7392ba184126..409e9dd28e0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2452,6 +2452,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);

+ if (PageWriteback(page))
+ return -EBUSY;
+
if (PageAnon(head)) {
/*
* The caller does not necessarily hold an mmap_sem that would
@@ -2529,7 +2532,12 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
__dec_node_page_state(page, NR_SHMEM_THPS);
spin_unlock(&pgdata->split_queue_lock);
__split_huge_page(page, list, flags);
- ret = 0;
+ if (PageSwapCache(head)) {
+ swp_entry_t entry = { .val = page_private(head) };
+
+ ret = split_swap_cluster(entry);
+ } else
+ ret = 0;
} else {
if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
pr_alert("total_mapcount: %u, page_count(): %u\n",
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 21cbdecbc19a..1af21311a672 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1216,6 +1216,21 @@ static void swapcache_free_cluster(swp_entry_t entry)
}
}
}
+
+int split_swap_cluster(swp_entry_t entry)
+{
+ struct swap_info_struct *si;
+ struct swap_cluster_info *ci;
+ unsigned long offset = swp_offset(entry);
+
+ si = _swap_info_get(entry);
+ if (!si)
+ return -EBUSY;
+ ci = lock_cluster(si, offset);
+ cluster_clear_huge(ci);
+ unlock_cluster(ci);
+ return 0;
+}
#else
static inline void swapcache_free_cluster(swp_entry_t entry)
{
--
2.13.2

2017-07-24 05:21:03

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 08/12] memcg, THP, swap: Support move mem cgroup charge for THP swapped out

From: Huang Ying <[email protected]>

PTE mapped THP (Transparent Huge Page) will be ignored when moving
memory cgroup charge. But for THP which is in the swap cache, the
memory cgroup charge for the swap of a tail-page may be moved in
current implementation. That isn't correct, because the swap charge
for all sub-pages of a THP should be moved together. Following the
processing of the PTE mapped THP, the mem cgroup charge moving for the
swap entry for a tail-page of a THP is ignored too.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
---
mm/memcontrol.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3c04d73ab..c2618bd8ebdd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4616,8 +4616,11 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
if (!ret || !target)
put_page(page);
}
- /* There is a swap entry and a page doesn't exist or isn't charged */
- if (ent.val && !ret &&
+ /*
+ * There is a swap entry and a page doesn't exist or isn't charged.
+ * But we cannot move a tail-page in a THP.
+ */
+ if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
ret = MC_TARGET_SWAP;
if (target)
--
2.13.2

2017-07-24 05:21:15

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm -v3 11/12] mm, THP, swap: Delay splitting THP after swapped out

From: Huang Ying <[email protected]>

In this patch, splitting transparent huge page (THP) during swapping
out is delayed from after adding the THP into the swap cache to after
swapping out finishes. After the patch, more operations for the
anonymous THP reclaiming, such as writing the THP to the swap device,
removing the THP from the swap cache could be batched. So that the
performance of anonymous THP swapping out could be improved.

This is the second step for the THP swap support. The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes. At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%. The test is done on a Xeon E5 v3 system. The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
---
mm/vmscan.c | 95 +++++++++++++++++++++++++++++++++----------------------------
1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efc9da21c5e6..7472ddafc14a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -535,7 +535,9 @@ static inline int is_page_cache_freeable(struct page *page)
* that isolated the page, the page cache radix tree and
* optional buffer heads at page->private.
*/
- return page_count(page) - page_has_private(page) == 2;
+ int radix_pins = PageTransHuge(page) && PageSwapCache(page) ?
+ HPAGE_PMD_NR : 1;
+ return page_count(page) - page_has_private(page) == 1 + radix_pins;
}

static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -665,6 +667,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
bool reclaimed)
{
unsigned long flags;
+ int refcount;

BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -695,11 +698,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
- if (!page_ref_freeze(page, 2))
+ if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
+ refcount = 1 + HPAGE_PMD_NR;
+ else
+ refcount = 2;
+ if (!page_ref_freeze(page, refcount))
goto cannot_free;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page))) {
- page_ref_unfreeze(page, 2);
+ page_ref_unfreeze(page, refcount);
goto cannot_free;
}

@@ -1121,58 +1128,56 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Try to allocate it some swap space here.
* Lazyfree page could be freed directly
*/
- if (PageAnon(page) && PageSwapBacked(page) &&
- !PageSwapCache(page)) {
- if (!(sc->gfp_mask & __GFP_IO))
- goto keep_locked;
- if (PageTransHuge(page)) {
- /* cannot split THP, skip it */
- if (!can_split_huge_page(page, NULL))
- goto activate_locked;
- /*
- * Split pages without a PMD map right
- * away. Chances are some or all of the
- * tail pages can be freed without IO.
- */
- if (!compound_mapcount(page) &&
- split_huge_page_to_list(page, page_list))
- goto activate_locked;
- }
- if (!add_to_swap(page)) {
- if (!PageTransHuge(page))
- goto activate_locked;
- /* Split THP and swap individual base pages */
- if (split_huge_page_to_list(page, page_list))
- goto activate_locked;
- if (!add_to_swap(page))
- goto activate_locked;
- }
-
- /* XXX: We don't support THP writes */
- if (PageTransHuge(page) &&
- split_huge_page_to_list(page, page_list)) {
- delete_from_swap_cache(page);
- goto activate_locked;
- }
+ if (PageAnon(page) && PageSwapBacked(page)) {
+ if (!PageSwapCache(page)) {
+ if (!(sc->gfp_mask & __GFP_IO))
+ goto keep_locked;
+ if (PageTransHuge(page)) {
+ /* cannot split THP, skip it */
+ if (!can_split_huge_page(page, NULL))
+ goto activate_locked;
+ /*
+ * Split pages without a PMD map right
+ * away. Chances are some or all of the
+ * tail pages can be freed without IO.
+ */
+ if (!compound_mapcount(page) &&
+ split_huge_page_to_list(page,
+ page_list))
+ goto activate_locked;
+ }
+ if (!add_to_swap(page)) {
+ if (!PageTransHuge(page))
+ goto activate_locked;
+ /* Fallback to swap normal pages */
+ if (split_huge_page_to_list(page,
+ page_list))
+ goto activate_locked;
+ if (!add_to_swap(page))
+ goto activate_locked;
+ }

- may_enter_fs = 1;
+ may_enter_fs = 1;

- /* Adding to swap updated mapping */
- mapping = page_mapping(page);
+ /* Adding to swap updated mapping */
+ mapping = page_mapping(page);
+ }
} else if (unlikely(PageTransHuge(page))) {
/* Split file THP */
if (split_huge_page_to_list(page, page_list))
goto keep_locked;
}

- VM_BUG_ON_PAGE(PageTransHuge(page), page);
-
/*
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
*/
if (page_mapped(page)) {
- if (!try_to_unmap(page, ttu_flags | TTU_BATCH_FLUSH)) {
+ enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
+
+ if (unlikely(PageTransHuge(page)))
+ flags |= TTU_SPLIT_HUGE_PMD;
+ if (!try_to_unmap(page, flags)) {
nr_unmap_fail++;
goto activate_locked;
}
@@ -1312,7 +1317,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Is there need to periodically free_page_list? It would
* appear not as the counts should be low
*/
- list_add(&page->lru, &free_pages);
+ if (unlikely(PageTransHuge(page))) {
+ mem_cgroup_uncharge(page);
+ (*get_compound_page_dtor(page))(page);
+ } else
+ list_add(&page->lru, &free_pages);
continue;

activate_locked:
--
2.13.2

2017-07-25 16:39:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm -v3 01/12] mm, THP, swap: Support to clear swap cache flag for THP swapped out

On Mon, 2017-07-24 at 13:18 +0800, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> Previously, swapcache_free_cluster() is used only in the error path
> of
> shrink_page_list() to free the swap cluster just allocated if the
> THP (Transparent Huge Page) is failed to be split.  In this patch, it
> is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the
> swap
> cluster that holds the contents of THP swapped out.
>
> This will be used in delaying splitting THP after swapping out
> support.  Because there is no THP swapping in as a whole support yet,
> after clearing the swap cache flag, the swap cluster backing the THP
> swapped out will be split.  So that the swap slots in the swap
> cluster
> can be swapped in as normal pages later.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Shaohua Li <[email protected]>
> Cc: Rik van Riel <[email protected]>
>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2017-07-25 17:47:41

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm -v3 02/12] mm, THP, swap: Support to reclaim swap space for THP swapped out

On Mon, 2017-07-24 at 13:18 +0800, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> The normal swap slot reclaiming can be done when the swap count
> reaches SWAP_HAS_CACHE.  But for the swap slot which is backing a
> THP,
> all swap slots backing one THP must be reclaimed together, because
> the
> swap slot may be used again when the THP is swapped out again later.
> So the swap slots backing one THP can be reclaimed together when the
> swap count for all swap slots for the THP reached SWAP_HAS_CACHE.  In
> the patch, the functions to check whether the swap count for all swap
> slots backing one THP reached SWAP_HAS_CACHE are implemented and used
> when checking whether a swap slot can be reclaimed.
>
> To make it easier to determine whether a swap slot is backing a THP,
> a
> new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
> cluster which is backing a THP (Transparent Huge Page).  Because THP
> swap in as a whole isn't supported now.  After deleting the THP from
> the swap cache (for example, swapping out finished), the
> CLUSTER_FLAG_HUGE flag will be cleared.  So that, the normal pages
> inside THP can be swapped in individually.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Shaohua Li <[email protected]>
> Cc: Rik van Riel <[email protected]>
>
Acked-by: Rik van Riel <[email protected]>

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part