2017-03-17 06:47:11

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async

From: Huang Ying <[email protected]>

The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
deadlock while waiting on discard I/O completion") fixed a deadlock in
read_swap_cache_async(). Because at that time, in swap allocation
path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
discarding to complete before the page for the swap entry is added to
the swap cache. But in the commit 815c2c543d3a ("swap: make swap
discard async"), the discarding for swap become asynchronous, waiting
for discarding to complete will be done before the swap entry is set
as SWAP_HAS_CACHE. So the comments in code is incorrect now. This
patch fixes the comments.

The cond_resched() added in the commit cbab0e4eec29 is not necessary
now too. But if we added some sleep in swap allocation path in the
future, there may be some hard to debug/reproduce deadlock bug. So it
is kept.

Cc: Shaohua Li <[email protected]>
Cc: Rafael Aquini <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
---
mm/swap_state.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 473b71e052a8..7bfb9bd1ca21 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/*
* We might race against get_swap_page() and stumble
* across a SWAP_HAS_CACHE swap_map entry whose page
- * has not been brought into the swapcache yet, while
- * the other end is scheduled away waiting on discard
- * I/O completion at scan_swap_map().
- *
- * In order to avoid turning this transitory state
- * into a permanent loop around this -EEXIST case
- * if !CONFIG_PREEMPT and the I/O completion happens
- * to be waiting on the CPU waitqueue where we are now
- * busy looping, we just conditionally invoke the
- * scheduler here, if there are some more important
- * tasks to run.
+ * has not been brought into the swapcache yet.
*/
cond_resched();
continue;
--
2.11.0


2017-03-17 06:49:44

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 2/5] mm, swap: Improve readability via make spin_lock/unlock balanced

From: Huang Ying <[email protected]>

This is just a cleanup patch, no functionality change. In
cluster_list_add_tail(), spin_lock_nested() is used to lock the
cluster, while unlock_cluster() is used to unlock the cluster. To
improve the code readability. Use spin_unlock() directly to unlock
the cluster.

Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Tim Chen <[email protected]>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6b6bb1bb6209..42fd620dcf4c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -335,7 +335,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
ci_tail = ci + tail;
spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
cluster_set_next(ci_tail, idx);
- unlock_cluster(ci_tail);
+ spin_unlock(&ci_tail->lock);
cluster_set_next_flag(&list->tail, idx, 0);
}
}
--
2.11.0

2017-03-17 06:49:56

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 3/5] mm, swap: Avoid lock swap_avail_lock when held cluster lock

From: Huang Ying <[email protected]>

Cluster lock is used to protect the swap_cluster_info and
corresponding elements in swap_info_struct->swap_map[]. But it is
found that now in scan_swap_map_slots(), swap_avail_lock may be
acquired when cluster lock is held. This does no good except making
the locking more complex and improving the potential locking
contention, because the swap_info_struct->lock is used to protect the
data structure operated in the code already. Fix this via moving the
corresponding operations in scan_swap_map_slots() out of cluster lock.

Signed-off-by: "Huang, Ying" <[email protected]>
Acked-by: Tim Chen <[email protected]>
---
mm/swapfile.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fd620dcf4c..53b5881ee0d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -672,6 +672,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
else
goto done;
}
+ si->swap_map[offset] = usage;
+ inc_cluster_info_page(si, si->cluster_info, offset);
+ unlock_cluster(ci);

if (offset == si->lowest_bit)
si->lowest_bit++;
@@ -685,9 +688,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
plist_del(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
}
- si->swap_map[offset] = usage;
- inc_cluster_info_page(si, si->cluster_info, offset);
- unlock_cluster(ci);
si->cluster_next = offset + 1;
slots[n_ret++] = swp_entry(si->type, offset);

--
2.11.0

2017-03-17 06:50:13

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc

From: Huang Ying <[email protected]>

Now vzalloc() is used in swap code to allocate various data
structures, such as swap cache, swap slots cache, cluster info, etc.
Because the size may be too large on some system, so that normal
kzalloc() may fail. But using kzalloc() has some advantages, for
example, less memory fragmentation, less TLB pressure, etc. So change
the data structure allocation in swap code to try to use kzalloc()
firstly, and fallback to vzalloc() if kzalloc() failed.

The allocation for swap_map[] in struct swap_info_struct is not
changed, because that is usually quite large and vmalloc_to_page() is
used for it. That makes it a little harder to change.

Signed-off-by: Huang Ying <[email protected]>
Acked-by: Tim Chen <[email protected]>
---
include/linux/swap.h | 2 ++
mm/swap_slots.c | 20 +++++++++++---------
mm/swap_state.c | 2 +-
mm/swapfile.c | 20 ++++++++++++++++----
4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f59d6b077401..35d5b626c4bc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -426,6 +426,8 @@ extern void exit_swap_address_space(unsigned int type);
extern int get_swap_slots(int n, swp_entry_t *slots);
extern void swapcache_free_batch(swp_entry_t *entries, int n);

+extern void *swap_kvzalloc(size_t size);
+
#else /* CONFIG_SWAP */

#define swap_address_space(entry) (NULL)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9b5bc86f96ad..7ae10e6f757d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -31,6 +31,8 @@
#include <linux/cpumask.h>
#include <linux/vmalloc.h>
#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/mm.h>

#ifdef CONFIG_SWAP

@@ -118,17 +120,17 @@ static int alloc_swap_slot_cache(unsigned int cpu)
swp_entry_t *slots, *slots_ret;

/*
- * Do allocation outside swap_slots_cache_mutex
- * as vzalloc could trigger reclaim and get_swap_page,
+ * Do allocation outside swap_slots_cache_mutex as
+ * kzalloc/vzalloc could trigger reclaim and get_swap_page,
* which can lock swap_slots_cache_mutex.
*/
- slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+ slots = swap_kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
if (!slots)
return -ENOMEM;

- slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+ slots_ret = swap_kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
if (!slots_ret) {
- vfree(slots);
+ kvfree(slots);
return -ENOMEM;
}

@@ -152,9 +154,9 @@ static int alloc_swap_slot_cache(unsigned int cpu)
out:
mutex_unlock(&swap_slots_cache_mutex);
if (slots)
- vfree(slots);
+ kvfree(slots);
if (slots_ret)
- vfree(slots_ret);
+ kvfree(slots_ret);
return 0;
}

@@ -171,7 +173,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
cache->cur = 0;
cache->nr = 0;
if (free_slots && cache->slots) {
- vfree(cache->slots);
+ kvfree(cache->slots);
cache->slots = NULL;
}
mutex_unlock(&cache->alloc_lock);
@@ -186,7 +188,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
}
spin_unlock_irq(&cache->free_lock);
if (slots)
- vfree(slots);
+ kvfree(slots);
}
}

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7bfb9bd1ca21..d31017532ad5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -523,7 +523,7 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
unsigned int i, nr;

nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
- spaces = vzalloc(sizeof(struct address_space) * nr);
+ spaces = swap_kvzalloc(sizeof(struct address_space) * nr);
if (!spaces)
return -ENOMEM;
for (i = 0; i < nr; i++) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 53b5881ee0d6..1fb966cf2175 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2272,8 +2272,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
free_percpu(p->percpu_cluster);
p->percpu_cluster = NULL;
vfree(swap_map);
- vfree(cluster_info);
- vfree(frontswap_map);
+ kvfree(cluster_info);
+ kvfree(frontswap_map);
/* Destroy swap account information */
swap_cgroup_swapoff(p->type);
exit_swap_address_space(p->type);
@@ -2796,7 +2796,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);

- cluster_info = vzalloc(nr_cluster * sizeof(*cluster_info));
+ cluster_info = swap_kvzalloc(nr_cluster * sizeof(*cluster_info));
if (!cluster_info) {
error = -ENOMEM;
goto bad_swap;
@@ -2829,7 +2829,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
}
/* frontswap enabled? set up bit-per-page map for frontswap */
if (IS_ENABLED(CONFIG_FRONTSWAP))
- frontswap_map = vzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));
+ frontswap_map =
+ swap_kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));

if (p->bdev &&(swap_flags & SWAP_FLAG_DISCARD) && swap_discardable(p)) {
/*
@@ -3308,3 +3309,14 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
}
}
}
+
+void *swap_kvzalloc(size_t size)
+{
+ void *p;
+
+ p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+ if (!p)
+ p = vzalloc(size);
+
+ return p;
+}
--
2.11.0

2017-03-17 06:50:27

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 5/5] mm, swap: Sort swap entries before free

From: Huang Ying <[email protected]>

To reduce the lock contention of swap_info_struct->lock when freeing
swap entry. The freed swap entries will be collected in a per-CPU
buffer firstly, and be really freed later in batch. During the batch
freeing, if the consecutive swap entries in the per-CPU buffer belongs
to same swap device, the swap_info_struct->lock needs to be
acquired/released only once, so that the lock contention could be
reduced greatly. But if there are multiple swap devices, it is
possible that the lock may be unnecessarily released/acquired because
the swap entries belong to the same swap device are non-consecutive in
the per-CPU buffer.

To solve the issue, the per-CPU buffer is sorted according to the swap
device before freeing the swap entries. Test shows that the time
spent by swapcache_free_entries() could be reduced after the patch.

Test the patch via measuring the run time of swap_cache_free_entries()
during the exit phase of the applications use much swap space. The
results shows that the average run time of swap_cache_free_entries()
reduced about 20% after applying the patch.

Signed-off-by: Huang Ying <[email protected]>
Acked-by: Tim Chen <[email protected]>
---
mm/swapfile.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fb966cf2175..dc1716c7997f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -37,6 +37,7 @@
#include <linux/swapfile.h>
#include <linux/export.h>
#include <linux/swap_slots.h>
+#include <linux/sort.h>

#include <asm/pgtable.h>
#include <asm/tlbflush.h>
@@ -1065,6 +1066,13 @@ void swapcache_free(swp_entry_t entry)
}
}

+static int swp_entry_cmp(const void *ent1, const void *ent2)
+{
+ const swp_entry_t *e1 = ent1, *e2 = ent2;
+
+ return (long)(swp_type(*e1) - swp_type(*e2));
+}
+
void swapcache_free_entries(swp_entry_t *entries, int n)
{
struct swap_info_struct *p, *prev;
@@ -1075,6 +1083,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)

prev = NULL;
p = NULL;
+ sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
for (i = 0; i < n; ++i) {
p = swap_info_get_cont(entries[i], prev);
if (p)
--
2.11.0

2017-03-17 08:52:58

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc

On Fri, 17 Mar 2017, Huang, Ying wrote:

> From: Huang Ying <[email protected]>
>
> Now vzalloc() is used in swap code to allocate various data
> structures, such as swap cache, swap slots cache, cluster info, etc.
> Because the size may be too large on some system, so that normal
> kzalloc() may fail. But using kzalloc() has some advantages, for
> example, less memory fragmentation, less TLB pressure, etc. So change
> the data structure allocation in swap code to try to use kzalloc()
> firstly, and fallback to vzalloc() if kzalloc() failed.
>

I'm concerned about preferring kzalloc() with __GFP_RECLAIM since the page
allocator will try to do memory compaction for high-order allocations when
the vzalloc() would have succeeded immediately. Do we necessarily want to
spend time doing memory compaction and direct reclaim for contiguous
memory if it's not needed?

2017-03-17 11:47:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc

On Fri 17-03-17 14:46:22, Huang, Ying wrote:
> +void *swap_kvzalloc(size_t size)
> +{
> + void *p;
> +
> + p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> + if (!p)
> + p = vzalloc(size);
> +
> + return p;
> +}

please do not invent your own kvmalloc implementation when we already
have on in mmotm tree.
--
Michal Hocko
SUSE Labs

2017-03-17 12:43:36

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async

On Fri, Mar 17, 2017 at 02:46:19PM +0800, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
> deadlock while waiting on discard I/O completion") fixed a deadlock in
> read_swap_cache_async(). Because at that time, in swap allocation
> path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
> discarding to complete before the page for the swap entry is added to
> the swap cache. But in the commit 815c2c543d3a ("swap: make swap
> discard async"), the discarding for swap become asynchronous, waiting
> for discarding to complete will be done before the swap entry is set
> as SWAP_HAS_CACHE. So the comments in code is incorrect now. This
> patch fixes the comments.
>
> The cond_resched() added in the commit cbab0e4eec29 is not necessary
> now too. But if we added some sleep in swap allocation path in the
> future, there may be some hard to debug/reproduce deadlock bug. So it
> is kept.
>

^ this is a rather disconcerting way to describe why you left that part
behind, and I recollect telling you about it in a private discussion.

The fact is that __read_swap_cache_async() still races against get_swap_page()
with a way narrower window due to the async fashioned SSD wear leveling
done for swap nowadays and other changes made within __read_swap_cache_async()'s
while loop thus making that old deadlock scenario very improbable to strike again.

All seems legit, apart from that last paragraph in the commit log
message


Acked-by: Rafael Aquini <[email protected]>

> Cc: Shaohua Li <[email protected]>
> Cc: Rafael Aquini <[email protected]>
> Signed-off-by: "Huang, Ying" <[email protected]>
> ---
> mm/swap_state.c | 12 +-----------
> 1 file changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 473b71e052a8..7bfb9bd1ca21 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> /*
> * We might race against get_swap_page() and stumble
> * across a SWAP_HAS_CACHE swap_map entry whose page
> - * has not been brought into the swapcache yet, while
> - * the other end is scheduled away waiting on discard
> - * I/O completion at scan_swap_map().
> - *
> - * In order to avoid turning this transitory state
> - * into a permanent loop around this -EEXIST case
> - * if !CONFIG_PREEMPT and the I/O completion happens
> - * to be waiting on the CPU waitqueue where we are now
> - * busy looping, we just conditionally invoke the
> - * scheduler here, if there are some more important
> - * tasks to run.
> + * has not been brought into the swapcache yet.
> */
> cond_resched();
> continue;
> --
> 2.11.0
>

2017-03-20 02:08:57

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async

Hi, Rafeal,

Rafael Aquini <[email protected]> writes:

> On Fri, Mar 17, 2017 at 02:46:19PM +0800, Huang, Ying wrote:
>> From: Huang Ying <[email protected]>
>>
>> The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
>> deadlock while waiting on discard I/O completion") fixed a deadlock in
>> read_swap_cache_async(). Because at that time, in swap allocation
>> path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
>> discarding to complete before the page for the swap entry is added to
>> the swap cache. But in the commit 815c2c543d3a ("swap: make swap
>> discard async"), the discarding for swap become asynchronous, waiting
>> for discarding to complete will be done before the swap entry is set
>> as SWAP_HAS_CACHE. So the comments in code is incorrect now. This
>> patch fixes the comments.
>>
>> The cond_resched() added in the commit cbab0e4eec29 is not necessary
>> now too. But if we added some sleep in swap allocation path in the
>> future, there may be some hard to debug/reproduce deadlock bug. So it
>> is kept.
>>
>
> ^ this is a rather disconcerting way to describe why you left that part
> behind, and I recollect telling you about it in a private discussion.
>
> The fact is that __read_swap_cache_async() still races against get_swap_page()
> with a way narrower window due to the async fashioned SSD wear leveling
> done for swap nowadays and other changes made within __read_swap_cache_async()'s
> while loop thus making that old deadlock scenario very improbable to strike again.

Thanks for your comments! Could you tell me which kind of race
remaining?

> All seems legit, apart from that last paragraph in the commit log
> message
>
>
> Acked-by: Rafael Aquini <[email protected]>

Thanks!

Best Regards,
Huang, Ying

>> Cc: Shaohua Li <[email protected]>
>> Cc: Rafael Aquini <[email protected]>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> ---
>> mm/swap_state.c | 12 +-----------
>> 1 file changed, 1 insertion(+), 11 deletions(-)
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 473b71e052a8..7bfb9bd1ca21 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>> /*
>> * We might race against get_swap_page() and stumble
>> * across a SWAP_HAS_CACHE swap_map entry whose page
>> - * has not been brought into the swapcache yet, while
>> - * the other end is scheduled away waiting on discard
>> - * I/O completion at scan_swap_map().
>> - *
>> - * In order to avoid turning this transitory state
>> - * into a permanent loop around this -EEXIST case
>> - * if !CONFIG_PREEMPT and the I/O completion happens
>> - * to be waiting on the CPU waitqueue where we are now
>> - * busy looping, we just conditionally invoke the
>> - * scheduler here, if there are some more important
>> - * tasks to run.
>> + * has not been brought into the swapcache yet.
>> */
>> cond_resched();
>> continue;
>> --
>> 2.11.0
>>