2017-12-18 07:34:41

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -V3 -mm] mm, swap: Fix race between swapoff and some swap operations

From: Huang Ying <[email protected]>

When the swapin is performed, after getting the swap entry information
from the page table, system will swap in the swap entry, without any
lock held to prevent the swap device from being swapoff. This may
cause the race like below,

CPU 1 CPU 2
----- -----
do_swap_page
swapin_readahead
__read_swap_cache_async
swapoff swapcache_prepare
p->swap_map = NULL __swap_duplicate
p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race
may not hit many people in practice. But it is still a race need to
be fixed.

To fix the race, get_swap_device() is added to check whether the
specified swap entry is valid in its swap device. If so, it will keep
the swap entry valid via preventing the swap device from being
swapoff, until put_swap_device() is called.

Because swapoff() is very race code path, to make the normal path runs
as fast as possible, RCU instead of reference count is used to
implement get/put_swap_device(). From get_swap_device() to
put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
swapoff() will wait until put_swap_device() is called.

In addition to swap_map, cluster_info, etc. data structure in the
struct swap_info_struct, the swap cache radix tree will be freed after
swapoff, so this patch fixes the race between swap cache looking up
and swapoff too.

Cc: Hugh Dickins <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Aaron Lu <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>

Changelog:

v3:

- Re-implemented with RCU to reduce the overhead of normal paths

v2:

- Re-implemented with SRCU to reduce the overhead of normal paths.

- Avoid to check whether the swap device has been swapoff in
get_swap_device(). Because we can check the origin of the swap
entry to make sure the swap device hasn't bee swapoff.
---
include/linux/swap.h | 11 +++++-
mm/memory.c | 2 +-
mm/swap_state.c | 16 ++++++--
mm/swapfile.c | 105 +++++++++++++++++++++++++++++++++++++++------------
4 files changed, 103 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..f7e8f26cf07f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -172,8 +172,9 @@ enum {
SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */
SWP_STABLE_WRITES = (1 << 10), /* no overwrite PG_writeback pages */
SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
+ SWP_VALID = (1 << 12), /* swap is valid to be operated on? */
/* add others here before... */
- SWP_SCANNING = (1 << 12), /* refcount in scan_swap_map */
+ SWP_SCANNING = (1 << 13), /* refcount in scan_swap_map */
};

#define SWAP_CLUSTER_MAX 32UL
@@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
extern sector_t map_swap_page(struct page *, struct block_device **);
extern sector_t swapdev_block(int, pgoff_t);
extern int page_swapcount(struct page *);
-extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
+extern int __swap_count(swp_entry_t entry);
extern int __swp_swapcount(swp_entry_t entry);
extern int swp_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *);
@@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
struct backing_dev_info;
extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
extern void exit_swap_address_space(unsigned int type);
+extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+ rcu_read_unlock();
+}

#else /* CONFIG_SWAP */

diff --git a/mm/memory.c b/mm/memory.c
index 1a969992f76b..77a7d6191218 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2909,7 +2909,7 @@ int do_swap_page(struct vm_fault *vmf)
struct swap_info_struct *si = swp_swap_info(entry);

if (si->flags & SWP_SYNCHRONOUS_IO &&
- __swap_count(si, entry) == 1) {
+ __swap_count(entry) == 1) {
/* skip swapcache */
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0b8ae361981f..8dde719e973c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -337,8 +337,13 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
unsigned long addr)
{
struct page *page;
+ struct swap_info_struct *si;

+ si = get_swap_device(entry);
+ if (!si)
+ return NULL;
page = find_get_page(swap_address_space(entry), swp_offset(entry));
+ put_swap_device(si);

INC_CACHE_INFO(find_total);
if (page) {
@@ -376,8 +381,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
bool *new_page_allocated)
{
- struct page *found_page, *new_page = NULL;
- struct address_space *swapper_space = swap_address_space(entry);
+ struct page *found_page = NULL, *new_page = NULL;
+ struct swap_info_struct *si;
int err;
*new_page_allocated = false;

@@ -387,7 +392,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* called after lookup_swap_cache() failed, re-calling
* that would confuse statistics.
*/
- found_page = find_get_page(swapper_space, swp_offset(entry));
+ si = get_swap_device(entry);
+ if (!si)
+ break;
+ found_page = find_get_page(swap_address_space(entry),
+ swp_offset(entry));
+ put_swap_device(si);
if (found_page)
break;

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fe5653814a..ca7b4c5ebe34 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1107,6 +1107,46 @@ static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
return p;
}

+/*
+ * Check whether swap entry is valid in the swap device. If so,
+ * return pointer to swap_info_struct, and keep the swap entry valid
+ * via preventing the swap device from being swapoff, until
+ * put_swap_device() is called. Otherwise return NULL.
+ */
+struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+ struct swap_info_struct *si;
+ unsigned long type, offset;
+
+ if (!entry.val)
+ goto out;
+ type = swp_type(entry);
+ if (type >= nr_swapfiles)
+ goto bad_nofile;
+ si = swap_info[type];
+
+ rcu_read_lock();
+ if (!(si->flags & SWP_VALID))
+ goto unlock_out;
+ /*
+ * Corresponds smp_wmb() in _enable_swap_info() to make sure
+ * swap_map, cluster_info, etc. are read after flags is read
+ */
+ smp_rmb();
+ offset = swp_offset(entry);
+ if (offset >= si->max)
+ goto unlock_out;
+
+ return si;
+bad_nofile:
+ pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+out:
+ return NULL;
+unlock_out:
+ rcu_read_unlock();
+ return NULL;
+}
+
static unsigned char __swap_entry_free(struct swap_info_struct *p,
swp_entry_t entry, unsigned char usage)
{
@@ -1328,11 +1368,18 @@ int page_swapcount(struct page *page)
return count;
}

-int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
+int __swap_count(swp_entry_t entry)
{
+ struct swap_info_struct *si;
pgoff_t offset = swp_offset(entry);
+ int count = 0;

- return swap_count(si->swap_map[offset]);
+ si = get_swap_device(entry);
+ if (si) {
+ count = swap_count(si->swap_map[offset]);
+ put_swap_device(si);
+ }
+ return count;
}

static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
@@ -1357,9 +1404,11 @@ int __swp_swapcount(swp_entry_t entry)
int count = 0;
struct swap_info_struct *si;

- si = __swap_info_get(entry);
- if (si)
+ si = get_swap_device(entry);
+ if (si) {
count = swap_swapcount(si, entry);
+ put_swap_device(si);
+ }
return count;
}

@@ -2478,7 +2527,9 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
}
p->swap_map = swap_map;
p->cluster_info = cluster_info;
- p->flags |= SWP_WRITEOK;
+ /* Correspond to smp_rmb() in get_swap_device(), check it for details */
+ smp_wmb();
+ p->flags |= SWP_WRITEOK | SWP_VALID;
atomic_long_add(p->pages, &nr_swap_pages);
total_swap_pages += p->pages;

@@ -2617,6 +2668,17 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)

reenable_swap_slots_cache_unlock();

+ spin_lock(&swap_lock);
+ spin_lock(&p->lock);
+ p->flags &= ~SWP_VALID; /* mark swap device as invalid */
+ spin_unlock(&p->lock);
+ spin_unlock(&swap_lock);
+ /*
+ * wait for swap operations protected by get/put_swap_device()
+ * to complete
+ */
+ synchronize_rcu();
+
flush_work(&p->discard_work);

destroy_swap_extents(p);
@@ -3356,22 +3418,16 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
{
struct swap_info_struct *p;
struct swap_cluster_info *ci;
- unsigned long offset, type;
+ unsigned long offset;
unsigned char count;
unsigned char has_cache;
int err = -EINVAL;

- if (non_swap_entry(entry))
+ p = get_swap_device(entry);
+ if (!p)
goto out;

- type = swp_type(entry);
- if (type >= nr_swapfiles)
- goto bad_file;
- p = swap_info[type];
offset = swp_offset(entry);
- if (unlikely(offset >= p->max))
- goto out;
-
ci = lock_cluster_or_swap_info(p, offset);

count = p->swap_map[offset];
@@ -3417,11 +3473,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
unlock_out:
unlock_cluster_or_swap_info(p, ci);
out:
+ if (p)
+ put_swap_device(p);
return err;
-
-bad_file:
- pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
- goto out;
}

/*
@@ -3513,6 +3567,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
struct page *list_page;
pgoff_t offset;
unsigned char count;
+ int ret = 0;

/*
* When debugging, it's easier to use __GFP_ZERO here; but it's better
@@ -3520,15 +3575,15 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
*/
page = alloc_page(gfp_mask | __GFP_HIGHMEM);

- si = swap_info_get(entry);
+ si = get_swap_device(entry);
if (!si) {
/*
* An acceptable race has occurred since the failing
- * __swap_duplicate(): the swap entry has been freed,
- * perhaps even the whole swap_map cleared for swapoff.
+ * __swap_duplicate(): the swap device may be swapoff
*/
goto outer;
}
+ spin_lock(&si->lock);

offset = swp_offset(entry);

@@ -3546,9 +3601,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
}

if (!page) {
- unlock_cluster(ci);
- spin_unlock(&si->lock);
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out;
}

/*
@@ -3600,10 +3654,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
out:
unlock_cluster(ci);
spin_unlock(&si->lock);
+ put_swap_device(si);
outer:
if (page)
__free_page(page);
- return 0;
+ return ret;
}

/*
--
2.15.0


2017-12-18 07:41:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -V3 -mm] mm, swap: Fix race between swapoff and some swap operations

"Huang, Ying" <[email protected]> writes:

> From: Huang Ying <[email protected]>
>
> When the swapin is performed, after getting the swap entry information
> from the page table, system will swap in the swap entry, without any
> lock held to prevent the swap device from being swapoff. This may
> cause the race like below,
>
> CPU 1 CPU 2
> ----- -----
> do_swap_page
> swapin_readahead
> __read_swap_cache_async
> swapoff swapcache_prepare
> p->swap_map = NULL __swap_duplicate
> p->swap_map[?] /* !!! NULL pointer access */
>
> Because swapoff is usually done when system shutdown only, the race
> may not hit many people in practice. But it is still a race need to
> be fixed.
>
> To fix the race, get_swap_device() is added to check whether the
> specified swap entry is valid in its swap device. If so, it will keep
> the swap entry valid via preventing the swap device from being
> swapoff, until put_swap_device() is called.
>
> Because swapoff() is very race code path, to make the normal path runs
> as fast as possible, RCU instead of reference count is used to
> implement get/put_swap_device(). From get_swap_device() to
> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> swapoff() will wait until put_swap_device() is called.
>
> In addition to swap_map, cluster_info, etc. data structure in the
> struct swap_info_struct, the swap cache radix tree will be freed after
> swapoff, so this patch fixes the race between swap cache looking up
> and swapoff too.
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Shaohua Li <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: "Jérôme Glisse" <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Aaron Lu <[email protected]>
> Signed-off-by: "Huang, Ying" <[email protected]>
>
> Changelog:
>
> v3:
>
> - Re-implemented with RCU to reduce the overhead of normal paths
>
> v2:
>
> - Re-implemented with SRCU to reduce the overhead of normal paths.
>
> - Avoid to check whether the swap device has been swapoff in
> get_swap_device(). Because we can check the origin of the swap
> entry to make sure the swap device hasn't bee swapoff.

A version implemented via stop_machine() could be gotten via a small
patch as below. If you still prefer stop_machine(), I can resend a
version implemented with stop_machine().

And, it appears that if we replace smp_wmb() in _enable_swap_info() with
stop_machine() in some way, we can avoid smp_rmb() in get_swap_device().
This can reduce overhead in normal path further. Can we get same effect
with RCU? For example, use synchronize_rcu() instead of stop_machine()?

Hi, Paul, can you help me on this?

Best Regards,
Huang, Ying

----------------8<------------------------------
---
include/linux/swap.h | 2 +-
mm/swapfile.c | 12 +++++++++---
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f7e8f26cf07f..1027169d5a04 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -475,7 +475,7 @@ extern struct swap_info_struct *get_swap_device(swp_entry_t entry);

static inline void put_swap_device(struct swap_info_struct *si)
{
- rcu_read_unlock();
+ preempt_enable();
}

#else /* CONFIG_SWAP */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ca7b4c5ebe34..feb13ce01045 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -38,6 +38,7 @@
#include <linux/export.h>
#include <linux/swap_slots.h>
#include <linux/sort.h>
+#include <linux/stop_machine.h>

#include <asm/pgtable.h>
#include <asm/tlbflush.h>
@@ -1125,7 +1126,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
goto bad_nofile;
si = swap_info[type];

- rcu_read_lock();
+ preempt_disable();
if (!(si->flags & SWP_VALID))
goto unlock_out;
/*
@@ -1143,7 +1144,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
out:
return NULL;
unlock_out:
- rcu_read_unlock();
+ preempt_enable();
return NULL;
}

@@ -2581,6 +2582,11 @@ bool has_usable_swap(void)
return ret;
}

+static int swapoff_stop(void *arg)
+{
+ return 0;
+}
+
SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
{
struct swap_info_struct *p = NULL;
@@ -2677,7 +2683,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
* wait for swap operations protected by get/put_swap_device()
* to complete
*/
- synchronize_rcu();
+ stop_machine(swapoff_stop, NULL, cpu_online_mask);

flush_work(&p->discard_work);


2017-12-18 21:13:26

by Junaid Shahid

[permalink] [raw]
Subject: Re: [PATCH -V3 -mm] mm, swap: Fix race between swapoff and some swap operations

On Monday, December 18, 2017 3:41:41 PM PST Huang, Ying wrote:
>
> A version implemented via stop_machine() could be gotten via a small
> patch as below. If you still prefer stop_machine(), I can resend a
> version implemented with stop_machine().
>

For the stop_machine() version, would it work to just put preempt_disable/enable at the start and end of lock_cluster() rather than introducing get/put_swap_device? Something like that might be simpler and would also disable preemption for less duration.

Thanks,
Junaid

2017-12-18 23:09:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH -V3 -mm] mm, swap: Fix race between swapoff and some swap operations

On Mon, Dec 18, 2017 at 03:41:41PM +0800, Huang, Ying wrote:
> "Huang, Ying" <[email protected]> writes:
>
> > From: Huang Ying <[email protected]>
> >
> > When the swapin is performed, after getting the swap entry information
> > from the page table, system will swap in the swap entry, without any
> > lock held to prevent the swap device from being swapoff. This may
> > cause the race like below,
> >
> > CPU 1 CPU 2
> > ----- -----
> > do_swap_page
> > swapin_readahead
> > __read_swap_cache_async
> > swapoff swapcache_prepare
> > p->swap_map = NULL __swap_duplicate
> > p->swap_map[?] /* !!! NULL pointer access */
> >
> > Because swapoff is usually done when system shutdown only, the race
> > may not hit many people in practice. But it is still a race need to
> > be fixed.
> >
> > To fix the race, get_swap_device() is added to check whether the
> > specified swap entry is valid in its swap device. If so, it will keep
> > the swap entry valid via preventing the swap device from being
> > swapoff, until put_swap_device() is called.
> >
> > Because swapoff() is very race code path, to make the normal path runs
> > as fast as possible, RCU instead of reference count is used to
> > implement get/put_swap_device(). From get_swap_device() to
> > put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> > swapoff() will wait until put_swap_device() is called.
> >
> > In addition to swap_map, cluster_info, etc. data structure in the
> > struct swap_info_struct, the swap cache radix tree will be freed after
> > swapoff, so this patch fixes the race between swap cache looking up
> > and swapoff too.
> >
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Tim Chen <[email protected]>
> > Cc: Shaohua Li <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: "J?r?me Glisse" <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: David Rientjes <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: Dave Jiang <[email protected]>
> > Cc: Aaron Lu <[email protected]>
> > Signed-off-by: "Huang, Ying" <[email protected]>
> >
> > Changelog:
> >
> > v3:
> >
> > - Re-implemented with RCU to reduce the overhead of normal paths
> >
> > v2:
> >
> > - Re-implemented with SRCU to reduce the overhead of normal paths.
> >
> > - Avoid to check whether the swap device has been swapoff in
> > get_swap_device(). Because we can check the origin of the swap
> > entry to make sure the swap device hasn't bee swapoff.
>
> A version implemented via stop_machine() could be gotten via a small
> patch as below. If you still prefer stop_machine(), I can resend a
> version implemented with stop_machine().
>
> And, it appears that if we replace smp_wmb() in _enable_swap_info() with
> stop_machine() in some way, we can avoid smp_rmb() in get_swap_device().
> This can reduce overhead in normal path further. Can we get same effect
> with RCU? For example, use synchronize_rcu() instead of stop_machine()?
>
> Hi, Paul, can you help me on this?

If the key loads before and after the smp_rmb() are within the same
RCU read-side critical section, -and- if one of the critical writes is
before the synchronize_rcu() and the other critical write is after the
synchronize_rcu(), then you normally don't need the smp_rmb().

Otherwise, you likely do still need the smp_rmb().

Thanx, Paul

> Best Regards,
> Huang, Ying
>
> ----------------8<------------------------------
> ---
> include/linux/swap.h | 2 +-
> mm/swapfile.c | 12 +++++++++---
> 2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f7e8f26cf07f..1027169d5a04 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -475,7 +475,7 @@ extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
>
> static inline void put_swap_device(struct swap_info_struct *si)
> {
> - rcu_read_unlock();
> + preempt_enable();
> }
>
> #else /* CONFIG_SWAP */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ca7b4c5ebe34..feb13ce01045 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -38,6 +38,7 @@
> #include <linux/export.h>
> #include <linux/swap_slots.h>
> #include <linux/sort.h>
> +#include <linux/stop_machine.h>
>
> #include <asm/pgtable.h>
> #include <asm/tlbflush.h>
> @@ -1125,7 +1126,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
> goto bad_nofile;
> si = swap_info[type];
>
> - rcu_read_lock();
> + preempt_disable();
> if (!(si->flags & SWP_VALID))
> goto unlock_out;
> /*
> @@ -1143,7 +1144,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
> out:
> return NULL;
> unlock_out:
> - rcu_read_unlock();
> + preempt_enable();
> return NULL;
> }
>
> @@ -2581,6 +2582,11 @@ bool has_usable_swap(void)
> return ret;
> }
>
> +static int swapoff_stop(void *arg)
> +{
> + return 0;
> +}
> +
> SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> {
> struct swap_info_struct *p = NULL;
> @@ -2677,7 +2683,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> * wait for swap operations protected by get/put_swap_device()
> * to complete
> */
> - synchronize_rcu();
> + stop_machine(swapoff_stop, NULL, cpu_online_mask);
>
> flush_work(&p->discard_work);
>
>

2017-12-19 05:36:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH -V3 -mm] mm, swap: Fix race between swapoff and some swap operations

On Tue, Dec 19, 2017 at 09:57:21AM +0800, Huang, Ying wrote:
> "Paul E. McKenney" <[email protected]> writes:
>
> > On Mon, Dec 18, 2017 at 03:41:41PM +0800, Huang, Ying wrote:
> >> "Huang, Ying" <[email protected]> writes:
> >> And, it appears that if we replace smp_wmb() in _enable_swap_info() with
> >> stop_machine() in some way, we can avoid smp_rmb() in get_swap_device().
> >> This can reduce overhead in normal path further. Can we get same effect
> >> with RCU? For example, use synchronize_rcu() instead of stop_machine()?
> >>
> >> Hi, Paul, can you help me on this?
> >
> > If the key loads before and after the smp_rmb() are within the same
> > RCU read-side critical section, -and- if one of the critical writes is
> > before the synchronize_rcu() and the other critical write is after the
> > synchronize_rcu(), then you normally don't need the smp_rmb().
> >
> > Otherwise, you likely do still need the smp_rmb().
>
> My question may be too general, let make it more specific. For the
> following program,
>
> "
> int a;
> int b;
>
> void intialize(void)
> {
> a = 1;
> synchronize_rcu();
> b = 2;
> }
>
> void test(void)
> {
> int c;
>
> rcu_read_lock();
> c = b;
> /* ignored smp_rmb() */
> if (c)
> pr_info("a=%d\n", a);
> rcu_read_unlock();
> }
> "
>
> Is it possible for it to show
>
> "
> a=0
> "
>
> in kernel log?
>
>
> If it couldn't, this could be a useful usage model of RCU to accelerate
> hot path.

This is not possible, and it can be verified using the Linux kernel
memory model. An introduction to an older version of this model may
be found here (including an introduction to litmus tests and their
output):

https://lwn.net/Articles/718628/
https://lwn.net/Articles/720550/

The litmus test and its output are shown below.

The reason it is not possible is that the entirety of test()'s RCU
read-side critical section must do one of two things:

1. Come before the return from initialize()'s synchronize_rcu().
2. Come after the call to initialize()'s synchronize_rcu().

Suppose test()'s load from "b" sees initialize()'s assignment. Then
some part of test()'s RCU read-side critical section came after
initialize()'s call to synchronize_rcu(), which means that the entirety
of test()'s RCU read-side critical section must come after initialize()'s
call to synchronize_rcu(). Therefore, whenever "c" is non-zero, the
pr_info() must see "a" non-zero.

Thanx, Paul

------------------------------------------------------------------------

C MP-o-sync-o+rl-o-ctl-o-rul

{}

P0(int *a, int *b)
{
WRITE_ONCE(*a, 1);
synchronize_rcu();
WRITE_ONCE(*b, 2);
}

P1(int *a, int *b)
{
int r0;
int r1;

rcu_read_lock();
r0 = READ_ONCE(*b);
if (r0)
r1 = READ_ONCE(*a);
rcu_read_unlock();
}

exists (1:r0=1 /\ 1:r1=0)

------------------------------------------------------------------------

States 2
1:r0=0; 1:r1=0;
1:r0=2; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r0=1 /\ 1:r1=0)
Observation MP-o-sync-o+rl-o-ctl-o-rul Never 0 2
Time MP-o-sync-o+rl-o-ctl-o-rul 0.01
Hash=b20eca2da50fa84b15e489502420ff56

------------------------------------------------------------------------

The "Never 0 2" means that the condition cannot happen.

2017-12-19 08:08:19

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -V3 -mm] mm, swap: Fix race between swapoff and some swap operations

"Paul E. McKenney" <[email protected]> writes:

> On Tue, Dec 19, 2017 at 09:57:21AM +0800, Huang, Ying wrote:
>> "Paul E. McKenney" <[email protected]> writes:
>>
>> > On Mon, Dec 18, 2017 at 03:41:41PM +0800, Huang, Ying wrote:
>> >> "Huang, Ying" <[email protected]> writes:
>> >> And, it appears that if we replace smp_wmb() in _enable_swap_info() with
>> >> stop_machine() in some way, we can avoid smp_rmb() in get_swap_device().
>> >> This can reduce overhead in normal path further. Can we get same effect
>> >> with RCU? For example, use synchronize_rcu() instead of stop_machine()?
>> >>
>> >> Hi, Paul, can you help me on this?
>> >
>> > If the key loads before and after the smp_rmb() are within the same
>> > RCU read-side critical section, -and- if one of the critical writes is
>> > before the synchronize_rcu() and the other critical write is after the
>> > synchronize_rcu(), then you normally don't need the smp_rmb().
>> >
>> > Otherwise, you likely do still need the smp_rmb().
>>
>> My question may be too general, let make it more specific. For the
>> following program,
>>
>> "
>> int a;
>> int b;
>>
>> void intialize(void)
>> {
>> a = 1;
>> synchronize_rcu();
>> b = 2;
>> }
>>
>> void test(void)
>> {
>> int c;
>>
>> rcu_read_lock();
>> c = b;
>> /* ignored smp_rmb() */
>> if (c)
>> pr_info("a=%d\n", a);
>> rcu_read_unlock();
>> }
>> "
>>
>> Is it possible for it to show
>>
>> "
>> a=0
>> "
>>
>> in kernel log?
>>
>>
>> If it couldn't, this could be a useful usage model of RCU to accelerate
>> hot path.
>
> This is not possible, and it can be verified using the Linux kernel
> memory model. An introduction to an older version of this model may
> be found here (including an introduction to litmus tests and their
> output):
>
> https://lwn.net/Articles/718628/
> https://lwn.net/Articles/720550/
>
> The litmus test and its output are shown below.
>
> The reason it is not possible is that the entirety of test()'s RCU
> read-side critical section must do one of two things:
>
> 1. Come before the return from initialize()'s synchronize_rcu().
> 2. Come after the call to initialize()'s synchronize_rcu().
>
> Suppose test()'s load from "b" sees initialize()'s assignment. Then
> some part of test()'s RCU read-side critical section came after
> initialize()'s call to synchronize_rcu(), which means that the entirety
> of test()'s RCU read-side critical section must come after initialize()'s
> call to synchronize_rcu(). Therefore, whenever "c" is non-zero, the
> pr_info() must see "a" non-zero.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> C MP-o-sync-o+rl-o-ctl-o-rul
>
> {}
>
> P0(int *a, int *b)
> {
> WRITE_ONCE(*a, 1);
> synchronize_rcu();
> WRITE_ONCE(*b, 2);
> }
>
> P1(int *a, int *b)
> {
> int r0;
> int r1;
>
> rcu_read_lock();
> r0 = READ_ONCE(*b);
> if (r0)
> r1 = READ_ONCE(*a);
> rcu_read_unlock();
> }
>
> exists (1:r0=1 /\ 1:r1=0)
>
> ------------------------------------------------------------------------
>
> States 2
> 1:r0=0; 1:r1=0;
> 1:r0=2; 1:r1=1;
> No
> Witnesses
> Positive: 0 Negative: 2
> Condition exists (1:r0=1 /\ 1:r1=0)
> Observation MP-o-sync-o+rl-o-ctl-o-rul Never 0 2
> Time MP-o-sync-o+rl-o-ctl-o-rul 0.01
> Hash=b20eca2da50fa84b15e489502420ff56
>
> ------------------------------------------------------------------------
>
> The "Never 0 2" means that the condition cannot happen.

Thanks a lot for your detailed explanation! That helps me much!

Best Regards,
Huang, Ying