LinuxLists.cc - [PATCH v2 0/5] close various race windows for swap

2021-04-17 09:43:38

Subject: [PATCH v2 0/5] close various race windows for swap

Hi all,
When I was investigating the swap code, I found some possible race
windows. This series aims to fix all these races. But using current
get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really
long time. And to reduce the performance overhead on the hot-path as
much as possible, it appears we can use the percpu_ref to close this
race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref
support for swap and most of the remaining patches try to use this to
close various race windows. More details can be found in the respective
changelogs. Thanks!

v1->v2:
reorganize the patch-2/5
various enhance and fixup per Huang, Ying
Many thanks for the comments of Huang, Ying, Dennis Zhou and Tim Chen.

Miaohe Lin (5):
mm/swapfile: add percpu_ref support for swap
mm/swapfile: use percpu_ref to serialize against concurrent swapoff
swap: fix do_swap_page() race with swapoff
mm/swap: remove confusing checking for non_swap_entry() in
swap_ra_info()
mm/shmem: fix shmem_swapin() race with swapoff

include/linux/swap.h | 15 +++++++--
mm/memory.c | 9 ++++++
mm/shmem.c | 6 ++++
mm/swap_state.c | 6 ----
mm/swapfile.c | 74 +++++++++++++++++++++++++++-----------------
5 files changed, 73 insertions(+), 37 deletions(-)

--
2.19.1

2021-04-17 09:43:38

by Miaohe Lin

[permalink] [raw]

Subject: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

When I was investigating the swap code, I found the below possible race
window:

CPU 1 CPU 2
----- -----
do_swap_page
swap_readpage(skip swap cache case)
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
p->flags = &= ~SWP_VALID;
..
synchronize_rcu();
..
p->swap_file = NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]

Note that for the pages that are swapped in through swap cache, this isn't
an issue. Because the page is locked, and the swap entry will be marked
with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
unlocked.

Using current get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really long
time. And this race may not be really pernicious because swapoff is usually
done when system shutdown only. To reduce the performance overhead on the
hot-path as much as possible, it appears we can use the percpu_ref to close
this race window(as suggested by Huang, Ying).

Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device")
Reported-by: kernel test robot <[email protected]> (auto build test ERROR)
Signed-off-by: Miaohe Lin <[email protected]>
---
include/linux/swap.h | 9 +++++++++
mm/memory.c | 9 +++++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 993693b38109..523c2411a135 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -528,6 +528,15 @@ static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
return NULL;
}

+static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+ return NULL;
+}
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+}
+
#define swap_address_space(entry) (NULL)
#define get_nr_swap_pages() 0L
#define total_swap_pages 0L
diff --git a/mm/memory.c b/mm/memory.c
index 27014c3bde9f..7a2fe12cf641 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
+ struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out;
}

+ /* Prevent swapoff from happening to us. */
+ si = get_swap_device(entry);
+ if (unlikely(!si))
+ goto out;

delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
@@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
+ if (si)
+ put_swap_device(si);
return ret;
out_nomap:
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(swapcache);
put_page(swapcache);
}
+ if (si)
+ put_swap_device(si);
return ret;
}

--
2.19.1

2021-04-17 09:43:41

by Miaohe Lin

[permalink] [raw]

Subject: [PATCH v2 2/5] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

Use percpu_ref to serialize against concurrent swapoff. Also remove the
SWP_VALID flag because it's used together with RCU solution.

Signed-off-by: Miaohe Lin <[email protected]>
---
include/linux/swap.h | 3 +--
mm/swapfile.c | 43 +++++++++++++++++--------------------------
2 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8be36eb58b7a..993693b38109 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -177,7 +177,6 @@ enum {
SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */
SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */
SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
- SWP_VALID = (1 << 13), /* swap is valid to be operated on? */
/* add others here before... */
SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */
};
@@ -514,7 +513,7 @@ sector_t swap_page_sector(struct page *page);

static inline void put_swap_device(struct swap_info_struct *si)
{
- rcu_read_unlock();
+ percpu_ref_put(&si->users);
}

#else /* CONFIG_SWAP */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 66515a3a2824..90e197bc2eeb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1279,18 +1279,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
* via preventing the swap device from being swapoff, until
* put_swap_device() is called. Otherwise return NULL.
*
- * The entirety of the RCU read critical section must come before the
- * return from or after the call to synchronize_rcu() in
- * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
- * true, the si->map, si->cluster_info, etc. must be valid in the
- * critical section.
- *
* Notice that swapoff or swapoff+swapon can still happen before the
- * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
- * in put_swap_device() if there isn't any other way to prevent
- * swapoff, such as page lock, page table lock, etc. The caller must
- * be prepared for that. For example, the following situation is
- * possible.
+ * percpu_ref_tryget_live() in get_swap_device() or after the
+ * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * to prevent swapoff, such as page lock, page table lock, etc. The
+ * caller must be prepared for that. For example, the following
+ * situation is possible.
*
* CPU1 CPU2
* do_swap_page()
@@ -1318,21 +1312,24 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
si = swp_swap_info(entry);
if (!si)
goto bad_nofile;
-
- rcu_read_lock();
- if (data_race(!(si->flags & SWP_VALID)))
- goto unlock_out;
+ if (!percpu_ref_tryget_live(&si->users))
+ goto out;
+ /*
+ * Guarantee we will not reference uninitialized fields
+ * of swap_info_struct.
+ */
+ smp_rmb();
offset = swp_offset(entry);
if (offset >= si->max)
- goto unlock_out;
+ goto put_out;

return si;
bad_nofile:
pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
out:
return NULL;
-unlock_out:
- rcu_read_unlock();
+put_out:
+ percpu_ref_put(&si->users);
return NULL;
}

@@ -2475,7 +2472,7 @@ static void setup_swap_info(struct swap_info_struct *p, int prio,

static void _enable_swap_info(struct swap_info_struct *p)
{
- p->flags |= SWP_WRITEOK | SWP_VALID;
+ p->flags |= SWP_WRITEOK;
atomic_long_add(p->pages, &nr_swap_pages);
total_swap_pages += p->pages;

@@ -2507,7 +2504,7 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
spin_unlock(&swap_lock);
/*
* Guarantee swap_map, cluster_info, etc. fields are valid
- * between get/put_swap_device() if SWP_VALID bit is set
+ * between get/put_swap_device().
*/
percpu_ref_resurrect(&p->users);
spin_lock(&swap_lock);
@@ -2625,12 +2622,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)

reenable_swap_slots_cache_unlock();

- spin_lock(&swap_lock);
- spin_lock(&p->lock);
- p->flags &= ~SWP_VALID; /* mark swap device as invalid */
- spin_unlock(&p->lock);
- spin_unlock(&swap_lock);
-
percpu_ref_kill(&p->users);
/*
* We need synchronize_rcu() here to protect the accessing
--
2.19.1

2021-04-17 09:44:13

by Miaohe Lin

[permalink] [raw]

Subject: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

When I was investigating the swap code, I found the below possible race
window:

CPU 1 CPU 2
----- -----
shmem_swapin
swap_cluster_readahead
if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
swapoff
si->flags &= ~SWP_VALID;
..
synchronize_rcu();
..
si->swap_file = NULL;
struct inode *inode = si->swap_file->f_mapping->host;[oops!]

Close this race window by using get/put_swap_device() to guard against
concurrent swapoff.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Miaohe Lin <[email protected]>
---
mm/shmem.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 26c76b13ad23..936ba5595297 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
+ struct swap_info_struct *si;
struct vm_area_struct pvma;
struct page *page;
struct vm_fault vmf = {
.vma = &pvma,
};

+ /* Prevent swapoff from happening to us. */
+ si = get_swap_device(swap);
+ if (unlikely(!si))
+ return NULL;
shmem_pseudo_vma_init(&pvma, info, index);
page = swap_cluster_readahead(swap, gfp, &vmf);
shmem_pseudo_vma_destroy(&pvma);
+ put_swap_device(si);

return page;
}
--
2.19.1

2021-04-19 02:18:24

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

Miaohe Lin <[email protected]> writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> ----- -----
> shmem_swapin
> swap_cluster_readahead
> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
> swapoff
> si->flags &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..

You have removed these code in the previous patches of the series. And
they are not relevant in this patch.

> si->swap_file = NULL;
> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>
> Close this race window by using get/put_swap_device() to guard against
> concurrent swapoff.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

No. This isn't the commit that introduces the race condition. Please
recheck your git blame result.

Best Regards,
Huang, Ying

> Signed-off-by: Miaohe Lin <[email protected]>
> ---
> mm/shmem.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 26c76b13ad23..936ba5595297 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
> static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
> struct shmem_inode_info *info, pgoff_t index)
> {
> + struct swap_info_struct *si;
> struct vm_area_struct pvma;
> struct page *page;
> struct vm_fault vmf = {
> .vma = &pvma,
> };
>
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(swap);
> + if (unlikely(!si))
> + return NULL;
> shmem_pseudo_vma_init(&pvma, info, index);
> page = swap_cluster_readahead(swap, gfp, &vmf);
> shmem_pseudo_vma_destroy(&pvma);
> + put_swap_device(si);
>
> return page;
> }

2021-04-19 02:25:19

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

Miaohe Lin <[email protected]> writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> ----- -----
> do_swap_page

This is OK for swap cache cases. So

if (data_race(si->flags & SWP_SYNCHRONOUS_IO))

should be shown here.

> swap_readpage(skip swap cache case)
> if (data_race(sis->flags & SWP_FS_OPS)) {
> swapoff
> p->flags = &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..
> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
>
> Note that for the pages that are swapped in through swap cache, this isn't
> an issue. Because the page is locked, and the swap entry will be marked
> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
> unlocked.
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

I still suggest to squash PATCH 1-3, at least PATCH 1-2. That will
change the relevant code together and make it easier to review.

Best Regards,
Huang, Ying

> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device")
> Reported-by: kernel test robot <[email protected]> (auto build test ERROR)
> Signed-off-by: Miaohe Lin <[email protected]>
> ---
> include/linux/swap.h | 9 +++++++++
> mm/memory.c | 9 +++++++++
> 2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 993693b38109..523c2411a135 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -528,6 +528,15 @@ static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
> return NULL;
> }
>
> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> + return NULL;
> +}
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +}
> +
> #define swap_address_space(entry) (NULL)
> #define get_nr_swap_pages() 0L
> #define total_swap_pages 0L
> diff --git a/mm/memory.c b/mm/memory.c
> index 27014c3bde9f..7a2fe12cf641 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
> swp_entry_t entry;
> pte_t pte;
> int locked;
> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out;
> }
>
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(entry);
> + if (unlikely(!si))
> + goto out;
>
> delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
> page = lookup_swap_cache(entry, vma, vmf->address);
> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> + if (si)
> + put_swap_device(si);
> return ret;
> out_nomap:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock_page(swapcache);
> put_page(swapcache);
> }
> + if (si)
> + put_swap_device(si);
> return ret;
> }

2021-04-19 02:56:01

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

Miaohe Lin <[email protected]> writes:

> Use percpu_ref to serialize against concurrent swapoff. Also remove the
> SWP_VALID flag because it's used together with RCU solution.
>
> Signed-off-by: Miaohe Lin <[email protected]>
> ---
> include/linux/swap.h | 3 +--
> mm/swapfile.c | 43 +++++++++++++++++--------------------------
> 2 files changed, 18 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8be36eb58b7a..993693b38109 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -177,7 +177,6 @@ enum {
> SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */
> SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */
> SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
> - SWP_VALID = (1 << 13), /* swap is valid to be operated on? */
> /* add others here before... */
> SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */
> };
> @@ -514,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>
> static inline void put_swap_device(struct swap_info_struct *si)
> {
> - rcu_read_unlock();
> + percpu_ref_put(&si->users);
> }
>
> #else /* CONFIG_SWAP */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 66515a3a2824..90e197bc2eeb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1279,18 +1279,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
> * via preventing the swap device from being swapoff, until
> * put_swap_device() is called. Otherwise return NULL.
> *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
> * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc. The caller must
> - * be prepared for that. For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_swap_device() or after the
> + * percpu_ref_put() in put_swap_device() if there isn't any other way
> + * to prevent swapoff, such as page lock, page table lock, etc. The
> + * caller must be prepared for that. For example, the following
> + * situation is possible.
> *
> * CPU1 CPU2
> * do_swap_page()
> @@ -1318,21 +1312,24 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
> si = swp_swap_info(entry);
> if (!si)
> goto bad_nofile;
> -
> - rcu_read_lock();
> - if (data_race(!(si->flags & SWP_VALID)))
> - goto unlock_out;
> + if (!percpu_ref_tryget_live(&si->users))
> + goto out;
> + /*
> + * Guarantee we will not reference uninitialized fields
> + * of swap_info_struct.
> + */

/*
* Guarantee the si->users are checked before accessing other fields of
* swap_info_struct.
*/

> + smp_rmb();

Usually, smp_rmb() need to be paired with smp_wmb(). Some comments are
needed for that. Here smb_rmb() is paired with the spin_unlock() after
setup_swap_info() in enable_swap_info().

> offset = swp_offset(entry);
> if (offset >= si->max)
> - goto unlock_out;
> + goto put_out;
>
> return si;
> bad_nofile:
> pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> out:
> return NULL;
> -unlock_out:
> - rcu_read_unlock();
> +put_out:
> + percpu_ref_put(&si->users);
> return NULL;
> }
>
> @@ -2475,7 +2472,7 @@ static void setup_swap_info(struct swap_info_struct *p, int prio,
>
> static void _enable_swap_info(struct swap_info_struct *p)
> {
> - p->flags |= SWP_WRITEOK | SWP_VALID;
> + p->flags |= SWP_WRITEOK;
> atomic_long_add(p->pages, &nr_swap_pages);
> total_swap_pages += p->pages;
>
> @@ -2507,7 +2504,7 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
> spin_unlock(&swap_lock);
> /*
> * Guarantee swap_map, cluster_info, etc. fields are valid
> - * between get/put_swap_device() if SWP_VALID bit is set
> + * between get/put_swap_device().
> */

The comments need to be revised. Something likes below?

/* Finished initialized swap device, now it's safe to reference it */

Best Regards,
Huang, Ying

> percpu_ref_resurrect(&p->users);
> spin_lock(&swap_lock);
> @@ -2625,12 +2622,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>
> reenable_swap_slots_cache_unlock();
>
> - spin_lock(&swap_lock);
> - spin_lock(&p->lock);
> - p->flags &= ~SWP_VALID; /* mark swap device as invalid */
> - spin_unlock(&p->lock);
> - spin_unlock(&swap_lock);
> -
> percpu_ref_kill(&p->users);
> /*
> * We need synchronize_rcu() here to protect the accessing

2021-04-19 06:51:58

by Miaohe Lin

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

On 2021/4/19 10:15, Huang, Ying wrote:
> Miaohe Lin <[email protected]> writes:
>
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1 CPU 2
>> ----- -----
>> shmem_swapin
>> swap_cluster_readahead
>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>> swapoff
>> si->flags &= ~SWP_VALID;
>> ..
>> synchronize_rcu();
>> ..
>
> You have removed these code in the previous patches of the series. And
> they are not relevant in this patch.

Yes, I should change these. Thanks.

>
>> si->swap_file = NULL;
>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>
>> Close this race window by using get/put_swap_device() to guard against
>> concurrent swapoff.
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>
> No. This isn't the commit that introduces the race condition. Please
> recheck your git blame result.
>

I think this is really hard to find exact commit. I used git blame and found
this race should be existed when this is introduced. Any suggestion ?
Thanks.

> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Miaohe Lin <[email protected]>
>> ---
>> mm/shmem.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 26c76b13ad23..936ba5595297 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
>> static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>> struct shmem_inode_info *info, pgoff_t index)
>> {
>> + struct swap_info_struct *si;
>> struct vm_area_struct pvma;
>> struct page *page;
>> struct vm_fault vmf = {
>> .vma = &pvma,
>> };
>>
>> + /* Prevent swapoff from happening to us. */
>> + si = get_swap_device(swap);
>> + if (unlikely(!si))
>> + return NULL;
>> shmem_pseudo_vma_init(&pvma, info, index);
>> page = swap_cluster_readahead(swap, gfp, &vmf);
>> shmem_pseudo_vma_destroy(&pvma);
>> + put_swap_device(si);
>>
>> return page;
>> }
> .
>

2021-04-19 06:57:43

by Miaohe Lin

[permalink] [raw]

Subject: Re: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

On 2021/4/19 10:23, Huang, Ying wrote:
> Miaohe Lin <[email protected]> writes:
>
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1 CPU 2
>> ----- -----
>> do_swap_page
>
> This is OK for swap cache cases. So
>
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
>
> should be shown here.

Ok.

>
>> swap_readpage(skip swap cache case)
>> if (data_race(sis->flags & SWP_FS_OPS)) {
>> swapoff
>> p->flags = &= ~SWP_VALID;
>> ..
>> synchronize_rcu();
>> ..
>> p->swap_file = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>
>> Note that for the pages that are swapped in through swap cache, this isn't
>> an issue. Because the page is locked, and the swap entry will be marked
>> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
>> unlocked.
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
>
> I still suggest to squash PATCH 1-3, at least PATCH 1-2. That will
> change the relevant code together and make it easier to review.
>

Will squash PATCH 1-2. Thanks.

> Best Regards,
> Huang, Ying
>
>> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device")
>> Reported-by: kernel test robot <[email protected]> (auto build test ERROR)
>> Signed-off-by: Miaohe Lin <[email protected]>
>> ---
>> include/linux/swap.h | 9 +++++++++
>> mm/memory.c | 9 +++++++++
>> 2 files changed, 18 insertions(+)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 993693b38109..523c2411a135 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -528,6 +528,15 @@ static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
>> return NULL;
>> }
>>
>> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline void put_swap_device(struct swap_info_struct *si)
>> +{
>> +}
>> +
>> #define swap_address_space(entry) (NULL)
>> #define get_nr_swap_pages() 0L
>> #define total_swap_pages 0L
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 27014c3bde9f..7a2fe12cf641 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> struct page *page = NULL, *swapcache;
>> + struct swap_info_struct *si = NULL;
>> swp_entry_t entry;
>> pte_t pte;
>> int locked;
>> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> goto out;
>> }
>>
>> + /* Prevent swapoff from happening to us. */
>> + si = get_swap_device(entry);
>> + if (unlikely(!si))
>> + goto out;
>>
>> delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
>> page = lookup_swap_cache(entry, vma, vmf->address);
>> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> unlock:
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> out:
>> + if (si)
>> + put_swap_device(si);
>> return ret;
>> out_nomap:
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> unlock_page(swapcache);
>> put_page(swapcache);
>> }
>> + if (si)
>> + put_swap_device(si);
>> return ret;
>> }
> .
>

2021-04-19 06:58:57

by Miaohe Lin

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] mm/swapfile: use percpu_ref to serialize against concurrent swapoff

On 2021/4/19 10:54, Huang, Ying wrote:
> Miaohe Lin <[email protected]> writes:
>
>> Use percpu_ref to serialize against concurrent swapoff. Also remove the
>> SWP_VALID flag because it's used together with RCU solution.
>>
>> Signed-off-by: Miaohe Lin <[email protected]>
>> ---
>> include/linux/swap.h | 3 +--
>> mm/swapfile.c | 43 +++++++++++++++++--------------------------
>> 2 files changed, 18 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 8be36eb58b7a..993693b38109 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -177,7 +177,6 @@ enum {
>> SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */
>> SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */
>> SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
>> - SWP_VALID = (1 << 13), /* swap is valid to be operated on? */
>> /* add others here before... */
>> SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */
>> };
>> @@ -514,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>
>> static inline void put_swap_device(struct swap_info_struct *si)
>> {
>> - rcu_read_unlock();
>> + percpu_ref_put(&si->users);
>> }
>>
>> #else /* CONFIG_SWAP */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 66515a3a2824..90e197bc2eeb 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1279,18 +1279,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
>> * via preventing the swap device from being swapoff, until
>> * put_swap_device() is called. Otherwise return NULL.
>> *
>> - * The entirety of the RCU read critical section must come before the
>> - * return from or after the call to synchronize_rcu() in
>> - * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
>> - * true, the si->map, si->cluster_info, etc. must be valid in the
>> - * critical section.
>> - *
>> * Notice that swapoff or swapoff+swapon can still happen before the
>> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
>> - * in put_swap_device() if there isn't any other way to prevent
>> - * swapoff, such as page lock, page table lock, etc. The caller must
>> - * be prepared for that. For example, the following situation is
>> - * possible.
>> + * percpu_ref_tryget_live() in get_swap_device() or after the
>> + * percpu_ref_put() in put_swap_device() if there isn't any other way
>> + * to prevent swapoff, such as page lock, page table lock, etc. The
>> + * caller must be prepared for that. For example, the following
>> + * situation is possible.
>> *
>> * CPU1 CPU2
>> * do_swap_page()
>> @@ -1318,21 +1312,24 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> si = swp_swap_info(entry);
>> if (!si)
>> goto bad_nofile;
>> -
>> - rcu_read_lock();
>> - if (data_race(!(si->flags & SWP_VALID)))
>> - goto unlock_out;
>> + if (!percpu_ref_tryget_live(&si->users))
>> + goto out;
>> + /*
>> + * Guarantee we will not reference uninitialized fields
>> + * of swap_info_struct.
>> + */
>
> /*
> * Guarantee the si->users are checked before accessing other fields of
> * swap_info_struct.
> */
>
>> + smp_rmb();
>
> Usually, smp_rmb() need to be paired with smp_wmb(). Some comments are
> needed for that. Here smb_rmb() is paired with the spin_unlock() after
> setup_swap_info() in enable_swap_info().
>
>> offset = swp_offset(entry);
>> if (offset >= si->max)
>> - goto unlock_out;
>> + goto put_out;
>>
>> return si;
>> bad_nofile:
>> pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
>> out:
>> return NULL;
>> -unlock_out:
>> - rcu_read_unlock();
>> +put_out:
>> + percpu_ref_put(&si->users);
>> return NULL;
>> }
>>
>> @@ -2475,7 +2472,7 @@ static void setup_swap_info(struct swap_info_struct *p, int prio,
>>
>> static void _enable_swap_info(struct swap_info_struct *p)
>> {
>> - p->flags |= SWP_WRITEOK | SWP_VALID;
>> + p->flags |= SWP_WRITEOK;
>> atomic_long_add(p->pages, &nr_swap_pages);
>> total_swap_pages += p->pages;
>>
>> @@ -2507,7 +2504,7 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
>> spin_unlock(&swap_lock);
>> /*
>> * Guarantee swap_map, cluster_info, etc. fields are valid
>> - * between get/put_swap_device() if SWP_VALID bit is set
>> + * between get/put_swap_device().
>> */
>
> The comments need to be revised. Something likes below?
>
> /* Finished initialized swap device, now it's safe to reference it */
>

All look good for me. Will do. Many thanks!

> Best Regards,
> Huang, Ying
>
>> percpu_ref_resurrect(&p->users);
>> spin_lock(&swap_lock);
>> @@ -2625,12 +2622,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>
>> reenable_swap_slots_cache_unlock();
>>
>> - spin_lock(&swap_lock);
>> - spin_lock(&p->lock);
>> - p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>> - spin_unlock(&p->lock);
>> - spin_unlock(&swap_lock);
>> -
>> percpu_ref_kill(&p->users);
>> /*
>> * We need synchronize_rcu() here to protect the accessing
>
> .
>

2021-04-19 08:17:04

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

Miaohe Lin <[email protected]> writes:

> On 2021/4/19 10:15, Huang, Ying wrote:
>> Miaohe Lin <[email protected]> writes:
>>
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1 CPU 2
>>> ----- -----
>>> shmem_swapin
>>> swap_cluster_readahead
>>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>>> swapoff
>>> si->flags &= ~SWP_VALID;
>>> ..
>>> synchronize_rcu();
>>> ..
>>
>> You have removed these code in the previous patches of the series. And
>> they are not relevant in this patch.
>
> Yes, I should change these. Thanks.
>
>>
>>> si->swap_file = NULL;
>>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>>
>>> Close this race window by using get/put_swap_device() to guard against
>>> concurrent swapoff.
>>>
>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>
>> No. This isn't the commit that introduces the race condition. Please
>> recheck your git blame result.
>>
>
> I think this is really hard to find exact commit. I used git blame and found
> this race should be existed when this is introduced. Any suggestion ?
> Thanks.

I think the commit that introduces the race condition is commit
8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or
not")

Best Regards,
Huang, Ying

>> Best Regards,
>> Huang, Ying
>>
>>> Signed-off-by: Miaohe Lin <[email protected]>
>>> ---
>>> mm/shmem.c | 6 ++++++
>>> 1 file changed, 6 insertions(+)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 26c76b13ad23..936ba5595297 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
>>> static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>>> struct shmem_inode_info *info, pgoff_t index)
>>> {
>>> + struct swap_info_struct *si;
>>> struct vm_area_struct pvma;
>>> struct page *page;
>>> struct vm_fault vmf = {
>>> .vma = &pvma,
>>> };
>>>
>>> + /* Prevent swapoff from happening to us. */
>>> + si = get_swap_device(swap);
>>> + if (unlikely(!si))
>>> + return NULL;
>>> shmem_pseudo_vma_init(&pvma, info, index);
>>> page = swap_cluster_readahead(swap, gfp, &vmf);
>>> shmem_pseudo_vma_destroy(&pvma);
>>> + put_swap_device(si);
>>>
>>> return page;
>>> }
>> .
>>

2021-04-19 08:18:07

by Miaohe Lin

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

On 2021/4/19 15:04, Huang, Ying wrote:
> Miaohe Lin <[email protected]> writes:
>
>> On 2021/4/19 10:15, Huang, Ying wrote:
>>> Miaohe Lin <[email protected]> writes:
>>>
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1 CPU 2
>>>> ----- -----
>>>> shmem_swapin
>>>> swap_cluster_readahead
>>>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>>>> swapoff
>>>> si->flags &= ~SWP_VALID;
>>>> ..
>>>> synchronize_rcu();
>>>> ..
>>>
>>> You have removed these code in the previous patches of the series. And
>>> they are not relevant in this patch.
>>
>> Yes, I should change these. Thanks.
>>
>>>
>>>> si->swap_file = NULL;
>>>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>>>
>>>> Close this race window by using get/put_swap_device() to guard against
>>>> concurrent swapoff.
>>>>
>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>>
>>> No. This isn't the commit that introduces the race condition. Please
>>> recheck your git blame result.
>>>
>>
>> I think this is really hard to find exact commit. I used git blame and found
>> this race should be existed when this is introduced. Any suggestion ?
>> Thanks.
>
> I think the commit that introduces the race condition is commit
> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or
> not")
>

Thanks.
The commit log only describes one race condition. And for that one, this should be correct
Fixes tag. But there are still many other race conditions inside swap_cluster_readahead,
such as swap_readpage() called from swap_cluster_readahead. This tag could not cover the
all race windows.

> Best Regards,
> Huang, Ying
>
>>> Best Regards,
>>> Huang, Ying
>>>
>>>> Signed-off-by: Miaohe Lin <[email protected]>
>>>> ---
>>>> mm/shmem.c | 6 ++++++
>>>> 1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 26c76b13ad23..936ba5595297 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
>>>> static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>>>> struct shmem_inode_info *info, pgoff_t index)
>>>> {
>>>> + struct swap_info_struct *si;
>>>> struct vm_area_struct pvma;
>>>> struct page *page;
>>>> struct vm_fault vmf = {
>>>> .vma = &pvma,
>>>> };
>>>>
>>>> + /* Prevent swapoff from happening to us. */
>>>> + si = get_swap_device(swap);
>>>> + if (unlikely(!si))
>>>> + return NULL;
>>>> shmem_pseudo_vma_init(&pvma, info, index);
>>>> page = swap_cluster_readahead(swap, gfp, &vmf);
>>>> shmem_pseudo_vma_destroy(&pvma);
>>>> + put_swap_device(si);
>>>>
>>>> return page;
>>>> }
>>> .
>>>
> .
>

2021-04-19 08:19:08

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

Miaohe Lin <[email protected]> writes:

> On 2021/4/19 15:04, Huang, Ying wrote:
>> Miaohe Lin <[email protected]> writes:
>>
>>> On 2021/4/19 10:15, Huang, Ying wrote:
>>>> Miaohe Lin <[email protected]> writes:
>>>>
>>>>> When I was investigating the swap code, I found the below possible race
>>>>> window:
>>>>>
>>>>> CPU 1 CPU 2
>>>>> ----- -----
>>>>> shmem_swapin
>>>>> swap_cluster_readahead
>>>>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>>>>> swapoff
>>>>> si->flags &= ~SWP_VALID;
>>>>> ..
>>>>> synchronize_rcu();
>>>>> ..
>>>>
>>>> You have removed these code in the previous patches of the series. And
>>>> they are not relevant in this patch.
>>>
>>> Yes, I should change these. Thanks.
>>>
>>>>
>>>>> si->swap_file = NULL;
>>>>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>>>>
>>>>> Close this race window by using get/put_swap_device() to guard against
>>>>> concurrent swapoff.
>>>>>
>>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>>>
>>>> No. This isn't the commit that introduces the race condition. Please
>>>> recheck your git blame result.
>>>>
>>>
>>> I think this is really hard to find exact commit. I used git blame and found
>>> this race should be existed when this is introduced. Any suggestion ?
>>> Thanks.
>>
>> I think the commit that introduces the race condition is commit
>> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or
>> not")
>>
>
> Thanks.
> The commit log only describes one race condition. And for that one, this should be correct
> Fixes tag. But there are still many other race conditions inside swap_cluster_readahead,
> such as swap_readpage() called from swap_cluster_readahead. This tag could not cover the
> all race windows.

No. swap_readpage() in swap_cluster_readahead() is OK. Because
__read_swap_cache_async() is called before that, so the swap entry will
be marked with SWAP_HAS_CACHE, and page will be locked.

Best Regards,
Huang, Ying

>> Best Regards,
>> Huang, Ying
>>
>>>> Best Regards,
>>>> Huang, Ying
>>>>
>>>>> Signed-off-by: Miaohe Lin <[email protected]>
>>>>> ---
>>>>> mm/shmem.c | 6 ++++++
>>>>> 1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index 26c76b13ad23..936ba5595297 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
>>>>> static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>>>>> struct shmem_inode_info *info, pgoff_t index)
>>>>> {
>>>>> + struct swap_info_struct *si;
>>>>> struct vm_area_struct pvma;
>>>>> struct page *page;
>>>>> struct vm_fault vmf = {
>>>>> .vma = &pvma,
>>>>> };
>>>>>
>>>>> + /* Prevent swapoff from happening to us. */
>>>>> + si = get_swap_device(swap);
>>>>> + if (unlikely(!si))
>>>>> + return NULL;
>>>>> shmem_pseudo_vma_init(&pvma, info, index);
>>>>> page = swap_cluster_readahead(swap, gfp, &vmf);
>>>>> shmem_pseudo_vma_destroy(&pvma);
>>>>> + put_swap_device(si);
>>>>>
>>>>> return page;
>>>>> }
>>>> .
>>>>
>> .
>>

2021-04-19 08:34:53

by Miaohe Lin

[permalink] [raw]

Subject: Re: [PATCH v2 5/5] mm/shmem: fix shmem_swapin() race with swapoff

On 2021/4/19 15:41, Huang, Ying wrote:
> Miaohe Lin <[email protected]> writes:
>
>> On 2021/4/19 15:04, Huang, Ying wrote:
>>> Miaohe Lin <[email protected]> writes:
>>>
>>>> On 2021/4/19 10:15, Huang, Ying wrote:
>>>>> Miaohe Lin <[email protected]> writes:
>>>>>
>>>>>> When I was investigating the swap code, I found the below possible race
>>>>>> window:
>>>>>>
>>>>>> CPU 1 CPU 2
>>>>>> ----- -----
>>>>>> shmem_swapin
>>>>>> swap_cluster_readahead
>>>>>> if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
>>>>>> swapoff
>>>>>> si->flags &= ~SWP_VALID;
>>>>>> ..
>>>>>> synchronize_rcu();
>>>>>> ..
>>>>>
>>>>> You have removed these code in the previous patches of the series. And
>>>>> they are not relevant in this patch.
>>>>
>>>> Yes, I should change these. Thanks.
>>>>
>>>>>
>>>>>> si->swap_file = NULL;
>>>>>> struct inode *inode = si->swap_file->f_mapping->host;[oops!]
>>>>>>
>>>>>> Close this race window by using get/put_swap_device() to guard against
>>>>>> concurrent swapoff.
>>>>>>
>>>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>>>>
>>>>> No. This isn't the commit that introduces the race condition. Please
>>>>> recheck your git blame result.
>>>>>
>>>>
>>>> I think this is really hard to find exact commit. I used git blame and found
>>>> this race should be existed when this is introduced. Any suggestion ?
>>>> Thanks.
>>>
>>> I think the commit that introduces the race condition is commit
>>> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or
>>> not")
>>>
>>
>> Thanks.
>> The commit log only describes one race condition. And for that one, this should be correct
>> Fixes tag. But there are still many other race conditions inside swap_cluster_readahead,
>> such as swap_readpage() called from swap_cluster_readahead. This tag could not cover the
>> all race windows.
>
> No. swap_readpage() in swap_cluster_readahead() is OK. Because
> __read_swap_cache_async() is called before that, so the swap entry will
> be marked with SWAP_HAS_CACHE, and page will be locked.
>

Oh... I missed this. Many thanks for your remind.

> Best Regards,
> Huang, Ying
>
>>> Best Regards,
>>> Huang, Ying
>>>
>>>>> Best Regards,
>>>>> Huang, Ying
>>>>>
>>>>>> Signed-off-by: Miaohe Lin <[email protected]>
>>>>>> ---
>>>>>> mm/shmem.c | 6 ++++++
>>>>>> 1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>>> index 26c76b13ad23..936ba5595297 100644
>>>>>> --- a/mm/shmem.c
>>>>>> +++ b/mm/shmem.c
>>>>>> @@ -1492,15 +1492,21 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
>>>>>> static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
>>>>>> struct shmem_inode_info *info, pgoff_t index)
>>>>>> {
>>>>>> + struct swap_info_struct *si;
>>>>>> struct vm_area_struct pvma;
>>>>>> struct page *page;
>>>>>> struct vm_fault vmf = {
>>>>>> .vma = &pvma,
>>>>>> };
>>>>>>
>>>>>> + /* Prevent swapoff from happening to us. */
>>>>>> + si = get_swap_device(swap);
>>>>>> + if (unlikely(!si))
>>>>>> + return NULL;
>>>>>> shmem_pseudo_vma_init(&pvma, info, index);
>>>>>> page = swap_cluster_readahead(swap, gfp, &vmf);
>>>>>> shmem_pseudo_vma_destroy(&pvma);
>>>>>> + put_swap_device(si);
>>>>>>
>>>>>> return page;
>>>>>> }
>>>>> .
>>>>>
>>> .
>>>
> .
>