2023-04-01 22:22:07

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Without this modification, a core will wait (mostly)
'swap_info_struct->lock' when completing
'del_from_avail_list(p)'. Immediately, other cores
soon calling 'add_to_avail_list()' to add the same
object again when acquiring the lock that released
by former. It's not the desired result but exists
indeed. This case can be described as below:

core 0 core 1
swapoff

del_from_avail_list(p) waiting

try lock p->lock acquire swap_avail_lock and
add p into swap_avail_head again

acquire p->lock but
missing p already be
added again, and continuing
to clear SWP_WRITEOK, etc.

It can be easily found a massive warning messages can
be triggered inside get_swap_pages() by some special
cases, for example, we call madvise(MADV_PAGEOUT) on
blocks of touched memory concurrently, meanwhile, run
much swapon-swapoff operations (e.g. stress-ng-swap).

But, a worse consequence, panic also can be caused by
the above scene. In swapoff(), p, refers to one
swap_info_struct variable, maybe reinsert swap_avail_head
by 'reinsert_swap_info', or as we wanted, turns off this
swap block successfully. the worse case is that swapoff()
run the last code of function but the p still linked in
swap_avail_head[]. It has very bad effects, such as the
memory used by p could be kept in swap_info[], this means
reuse it will destroy the data. A panic message caused:

(with CONFIG_PLIST_DEBUG enabled)

------------[ cut here ]------------
top: ffff001800875c00, n: ffff001800fdc6e0, p: ffff001800fdc6e0
prev: ffff001800875c00, n: ffff001800fdc6e0, p: ffff001800fdc6e0
next: ffff001800fdc6e0, n: ffff001800fdc6e0, p: ffff001800fdc6e0
WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
Modules linked in: rfkill(E) crct10dif_ce(E)...
CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : plist_check_prev_next_node+0x50/0x70
lr : plist_check_prev_next_node+0x50/0x70
sp : ffff0018009d3c30
x29: ffff0018009d3c40 x28: ffff800011b32a98
x27: 0000000000000000 x26: ffff001803908000
x25: ffff8000128ea088 x24: ffff800011b32a48
x23: 0000000000000028 x22: ffff001800875c00
x21: ffff800010f9e520 x20: ffff001800875c00
x19: ffff001800fdc6e0 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: 0736076307640766 x14: 0730073007380731
x13: 0736076307640766 x12: 0730073007380731
x11: 000000000004058d x10: 0000000085a85b76
x9 : ffff8000101436e4 x8 : ffff800011c8ce08
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0017df9ed338 x4 : 0000000000000001
x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
plist_check_prev_next_node+0x50/0x70
plist_check_head+0x80/0xf0
plist_add+0x28/0x140
add_to_avail_list+0x9c/0xf0
_enable_swap_info+0x78/0xb4
__do_sys_swapon+0x918/0xa10
__arm64_sys_swapon+0x20/0x30
el0_svc_common+0x8c/0x220
do_el0_svc+0x2c/0x90
el0_svc+0x1c/0x30
el0_sync_handler+0xa8/0xb0
el0_sync+0x148/0x180
irq event stamp: 2082270

In this patch, we lock p->lock before calling
'del_from_avail_list()' to make sure other thread
see the swap_info_struct object had been deleted
and SWP_WRITEOK cleared together, will not reinsert
again.

We also find this problem exists in stable 5.10.

Signed-off-by: Rongwei Wang <[email protected]>
---
mm/swapfile.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5af6b0f770de..4df77fef50b5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2610,8 +2610,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&swap_lock);
goto out_dput;
}
- del_from_avail_list(p);
+ /*
+ * Here lock is used to protect deleting and SWP_WRITEOK clearing
+ * can be seen concurrently.
+ */
spin_lock(&p->lock);
+ del_from_avail_list(p);
if (p->prio < 0) {
struct swap_info_struct *si = p;
int nid;
--
2.27.0


2023-04-02 13:41:54

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

On Sun, Apr 02, 2023 at 06:19:20AM +0800, Rongwei Wang wrote:
> Without this modification, a core will wait (mostly)
"Currently, a core will ..."

> But, a worse consequence, panic also can be caused by
"However, in the worst case, ..."

> In this patch, we lock p->lock before calling
"Lock p->lock before calling ..."

> We also find this problem exists in stable 5.10.

So, you claim that 5.15.y and 6.1.y aren't affected, right?

Also, Cc: [email protected] on the SoB area (as pointed by kernel
test robot [1].

Thanks.

[1]: https://lore.kernel.org/stable/ZCiuGEkyk%2F1Afisk@ec83ac1404bb/

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (683.00 B)
signature.asc (235.00 B)
Download all attachments

2023-04-02 15:03:22

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()


On 4/2/23 9:37 PM, Bagas Sanjaya wrote:
> On Sun, Apr 02, 2023 at 06:19:20AM +0800, Rongwei Wang wrote:
>> Without this modification, a core will wait (mostly)
> "Currently, a core will ..."
>
>> But, a worse consequence, panic also can be caused by
> "However, in the worst case, ..."
>
>> In this patch, we lock p->lock before calling
> "Lock p->lock before calling ..."
>
>> We also find this problem exists in stable 5.10.
> So, you claim that 5.15.y and 6.1.y aren't affected, right?

I think above both versions also be affected. I can check they next.

>
> Also, Cc: [email protected] on the SoB area (as pointed by kernel
> test robot [1].
Thanks, next version will to do.
>
> Thanks.
>
> [1]: https://lore.kernel.org/stable/ZCiuGEkyk%2F1Afisk@ec83ac1404bb/
>

2023-04-03 04:11:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

On Sun, Apr 02, 2023 at 06:19:20AM +0800, Rongwei Wang wrote:
> Without this modification, a core will wait (mostly)
> 'swap_info_struct->lock' when completing
> 'del_from_avail_list(p)'. Immediately, other cores
> soon calling 'add_to_avail_list()' to add the same
> object again when acquiring the lock that released
> by former. It's not the desired result but exists
> indeed. This case can be described as below:

This feels like a very verbose way of saying

"The si->lock must be held when deleting the si from the
available list. Otherwise, another thread can re-add the
si to the available list, which can lead to memory corruption.
The only place we have found where this happens is in the
swapoff path."

> +++ b/mm/swapfile.c
> @@ -2610,8 +2610,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> spin_unlock(&swap_lock);
> goto out_dput;
> }
> - del_from_avail_list(p);
> + /*
> + * Here lock is used to protect deleting and SWP_WRITEOK clearing
> + * can be seen concurrently.
> + */

This comment isn't necessary. But I would add a lockdep assert inside
__del_from_avail_list() that p->lock is held.

> spin_lock(&p->lock);
> + del_from_avail_list(p);
> if (p->prio < 0) {
> struct swap_info_struct *si = p;
> int nid;
> --
> 2.27.0
>
>

2023-04-03 08:09:40

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()


On 4/3/23 12:10 PM, Matthew Wilcox wrote:
> On Sun, Apr 02, 2023 at 06:19:20AM +0800, Rongwei Wang wrote:
>> Without this modification, a core will wait (mostly)
>> 'swap_info_struct->lock' when completing
>> 'del_from_avail_list(p)'. Immediately, other cores
>> soon calling 'add_to_avail_list()' to add the same
>> object again when acquiring the lock that released
>> by former. It's not the desired result but exists
>> indeed. This case can be described as below:
> This feels like a very verbose way of saying
>
> "The si->lock must be held when deleting the si from the
> available list. Otherwise, another thread can re-add the
> si to the available list, which can lead to memory corruption.
> The only place we have found where this happens is in the
> swapoff path."
It looks better than mine. Sorry for my confusing description, it will
be fixed in the next version.
>
>> +++ b/mm/swapfile.c
>> @@ -2610,8 +2610,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> spin_unlock(&swap_lock);
>> goto out_dput;
>> }
>> - del_from_avail_list(p);
>> + /*
>> + * Here lock is used to protect deleting and SWP_WRITEOK clearing
>> + * can be seen concurrently.
>> + */
> This comment isn't necessary. But I would add a lockdep assert inside
> __del_from_avail_list() that p->lock is held.

Thanks. Actually, I have this line in previous test version, but delete
for saving one line of code.

I will update here as you said.


Thanks for your time.

>
>> spin_lock(&p->lock);
>> + del_from_avail_list(p);
>> if (p->prio < 0) {
>> struct swap_info_struct *si = p;
>> int nid;
>> --
>> 2.27.0
>>
>>

2023-04-04 15:55:07

by Rongwei Wang

[permalink] [raw]
Subject: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

The si->lock must be held when deleting the si from
the available list. Otherwise, another thread can
re-add the si to the available list, which can lead
to memory corruption. The only place we have found
where this happens is in the swapoff path. This case
can be described as below:

core 0 core 1
swapoff

del_from_avail_list(si) waiting

try lock si->lock acquire swap_avail_lock
and re-add si into
swap_avail_head

acquire si->lock but
missing si already be
added again, and continuing
to clear SWP_WRITEOK, etc.

It can be easily found a massive warning messages can
be triggered inside get_swap_pages() by some special
cases, for example, we call madvise(MADV_PAGEOUT) on
blocks of touched memory concurrently, meanwhile, run
much swapon-swapoff operations (e.g. stress-ng-swap).

However, in the worst case, panic can be caused by the
above scene. In swapoff(), the memory used by si could
be kept in swap_info[] after turning off a swap. This
means memory corruption will not be caused immediately
until allocated and reset for a new swap in the swapon
path. A panic message caused:
(with CONFIG_PLIST_DEBUG enabled)

------------[ cut here ]------------
top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
Modules linked in: rfkill(E) crct10dif_ce(E)...
CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : plist_check_prev_next_node+0x50/0x70
lr : plist_check_prev_next_node+0x50/0x70
sp : ffff0018009d3c30
x29: ffff0018009d3c40 x28: ffff800011b32a98
x27: 0000000000000000 x26: ffff001803908000
x25: ffff8000128ea088 x24: ffff800011b32a48
x23: 0000000000000028 x22: ffff001800875c00
x21: ffff800010f9e520 x20: ffff001800875c00
x19: ffff001800fdc6e0 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: 0736076307640766 x14: 0730073007380731
x13: 0736076307640766 x12: 0730073007380731
x11: 000000000004058d x10: 0000000085a85b76
x9 : ffff8000101436e4 x8 : ffff800011c8ce08
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0017df9ed338 x4 : 0000000000000001
x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
plist_check_prev_next_node+0x50/0x70
plist_check_head+0x80/0xf0
plist_add+0x28/0x140
add_to_avail_list+0x9c/0xf0
_enable_swap_info+0x78/0xb4
__do_sys_swapon+0x918/0xa10
__arm64_sys_swapon+0x20/0x30
el0_svc_common+0x8c/0x220
do_el0_svc+0x2c/0x90
el0_svc+0x1c/0x30
el0_sync_handler+0xa8/0xb0
el0_sync+0x148/0x180
irq event stamp: 2082270

Now, si->lock locked before calling 'del_from_avail_list()'
to make sure other thread see the si had been deleted
and SWP_WRITEOK cleared together, will not reinsert again.

This problem exists in versions after stable 5.10.y.

Cc: [email protected]
Tested-by: Yongchen Yin <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
mm/swapfile.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 62ba2bf577d7..2c718f45745f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
{
int nid;

+ assert_spin_locked(&p->lock);
for_each_node(nid)
plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
}
@@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&swap_lock);
goto out_dput;
}
- del_from_avail_list(p);
spin_lock(&p->lock);
+ del_from_avail_list(p);
if (p->prio < 0) {
struct swap_info_struct *si = p;
int nid;
--
2.27.0

2023-04-04 16:10:44

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Hello

I have fix up some stuff base on Patch v1. And in order to help all
readers and reviewers to

reproduce this bug, share a reproducer here:

swap_bomb.sh

#!/usr/bin/env bash

stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v


madvise_shared.c

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define MSIZE (1024 * 1024 * 2)

int main()
{
        char *shm_addr;
        unsigned long i;

        while (1) {
                // Map shared memory segment
                shm_addr =
                    mmap(NULL, MSIZE, PROT_READ | PROT_WRITE,
                         MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                if (shm_addr == MAP_FAILED) {
                        perror("Failed to map shared memory segment");
                        exit(EXIT_FAILURE);
                }

                for (i = 0; i < MSIZE; i++) {
                        shm_addr[i] = 1;
                }

                // Advise kernel on usage pattern of shared memory
                if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
                        perror
                            ("Failed to advise kernel on shared memory
usage");
                        exit(EXIT_FAILURE);
                }

                for (i = 0; i < MSIZE; i++) {
                        shm_addr[i] = 1;
                }

                // Advise kernel on usage pattern of shared memory
                if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
                        perror
                            ("Failed to advise kernel on shared memory
usage");
                        exit(EXIT_FAILURE);
                }
                // Use shared memory
                printf("Hello, shared memory: 0x%lx\n", shm_addr);

                // Unmap shared memory segment
                if (munmap(shm_addr, MSIZE) == -1) {
                        perror("Failed to unmap shared memory segment");
                        exit(EXIT_FAILURE);
                }
        }

        return 0;
}

The bug will reproduce more quickly (about 2~5 minutes) if concurrent
more swap_bomb.sh and madvise_shared.

Thanks.


change log:

v1 -> v2

* fix up some commits and add assert_spin_locked(&p->lock) inside
__delete_from_avail_list() (suggested by Matthew Wilcox and Bagas Sanjaya)


On 4/4/23 11:47 PM, Rongwei Wang wrote:
> The si->lock must be held when deleting the si from
> the available list. Otherwise, another thread can
> re-add the si to the available list, which can lead
> to memory corruption. The only place we have found
> where this happens is in the swapoff path. This case
> can be described as below:
>
> core 0 core 1
> swapoff
>
> del_from_avail_list(si) waiting
>
> try lock si->lock acquire swap_avail_lock
> and re-add si into
> swap_avail_head
>
> acquire si->lock but
> missing si already be
> added again, and continuing
> to clear SWP_WRITEOK, etc.
>
> It can be easily found a massive warning messages can
> be triggered inside get_swap_pages() by some special
> cases, for example, we call madvise(MADV_PAGEOUT) on
> blocks of touched memory concurrently, meanwhile, run
> much swapon-swapoff operations (e.g. stress-ng-swap).
>
> However, in the worst case, panic can be caused by the
> above scene. In swapoff(), the memory used by si could
> be kept in swap_info[] after turning off a swap. This
> means memory corruption will not be caused immediately
> until allocated and reset for a new swap in the swapon
> path. A panic message caused:
> (with CONFIG_PLIST_DEBUG enabled)
>
> ------------[ cut here ]------------
> top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> Modules linked in: rfkill(E) crct10dif_ce(E)...
> CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> pc : plist_check_prev_next_node+0x50/0x70
> lr : plist_check_prev_next_node+0x50/0x70
> sp : ffff0018009d3c30
> x29: ffff0018009d3c40 x28: ffff800011b32a98
> x27: 0000000000000000 x26: ffff001803908000
> x25: ffff8000128ea088 x24: ffff800011b32a48
> x23: 0000000000000028 x22: ffff001800875c00
> x21: ffff800010f9e520 x20: ffff001800875c00
> x19: ffff001800fdc6e0 x18: 0000000000000030
> x17: 0000000000000000 x16: 0000000000000000
> x15: 0736076307640766 x14: 0730073007380731
> x13: 0736076307640766 x12: 0730073007380731
> x11: 000000000004058d x10: 0000000085a85b76
> x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> x7 : 0000000000000000 x6 : 0000000000000001
> x5 : ffff0017df9ed338 x4 : 0000000000000001
> x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> x1 : 0000000000000000 x0 : 0000000000000000
> Call trace:
> plist_check_prev_next_node+0x50/0x70
> plist_check_head+0x80/0xf0
> plist_add+0x28/0x140
> add_to_avail_list+0x9c/0xf0
> _enable_swap_info+0x78/0xb4
> __do_sys_swapon+0x918/0xa10
> __arm64_sys_swapon+0x20/0x30
> el0_svc_common+0x8c/0x220
> do_el0_svc+0x2c/0x90
> el0_svc+0x1c/0x30
> el0_sync_handler+0xa8/0xb0
> el0_sync+0x148/0x180
> irq event stamp: 2082270
>
> Now, si->lock locked before calling 'del_from_avail_list()'
> to make sure other thread see the si had been deleted
> and SWP_WRITEOK cleared together, will not reinsert again.
>
> This problem exists in versions after stable 5.10.y.
>
> Cc: [email protected]
> Tested-by: Yongchen Yin <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>
> ---
> mm/swapfile.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 62ba2bf577d7..2c718f45745f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> {
> int nid;
>
> + assert_spin_locked(&p->lock);
> for_each_node(nid)
> plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> }
> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> spin_unlock(&swap_lock);
> goto out_dput;
> }
> - del_from_avail_list(p);
> spin_lock(&p->lock);
> + del_from_avail_list(p);
> if (p->prio < 0) {
> struct swap_info_struct *si = p;
> int nid;

2023-04-04 19:28:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang <[email protected]> wrote:

> The si->lock must be held when deleting the si from
> the available list.
>
> ...
>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> {
> int nid;
>
> + assert_spin_locked(&p->lock);
> for_each_node(nid)
> plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> }
> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> spin_unlock(&swap_lock);
> goto out_dput;
> }
> - del_from_avail_list(p);
> spin_lock(&p->lock);
> + del_from_avail_list(p);
> if (p->prio < 0) {
> struct swap_info_struct *si = p;
> int nid;

So we have

swap_avail_lock
swap_info_struct.lock
swap_cluster_info.lock

Is the ranking of these three clearly documented somewhere?


Did you test this with lockdep fully enabled?


I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device
according to numa node") is the appropriate Fixes: target - do you
agree?


These functions use identifier `p' for the swap_info_struct*, whereas
most other code uses the much more sensible `si'. That's just rude.
But we shouldn't change that within this fix.

2023-04-05 06:53:42

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Hi Andrew

On 4/5/23 3:26 AM, Andrew Morton wrote:
> On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang <[email protected]> wrote:
>
>> The si->lock must be held when deleting the si from
>> the available list.
>>
>> ...
>>
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
>> {
>> int nid;
>>
>> + assert_spin_locked(&p->lock);
>> for_each_node(nid)
>> plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
>> }
>> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> spin_unlock(&swap_lock);
>> goto out_dput;
>> }
>> - del_from_avail_list(p);
>> spin_lock(&p->lock);
>> + del_from_avail_list(p);
>> if (p->prio < 0) {
>> struct swap_info_struct *si = p;
>> int nid;
> So we have
>
> swap_avail_lock
> swap_info_struct.lock
> swap_cluster_info.lock
>
> Is the ranking of these three clearly documented somewhere?

It seems have

swap_lock

swap_info_struct.lock

swap_avail_lock

I just summary the ranking of these three locks by reading code, not
find any documents (maybe have).

>
>
> Did you test this with lockdep fully enabled?
>
>
> I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device
> according to numa node") is the appropriate Fixes: target - do you
> agree?

Yes, I'm sure my latest test version has included Aaron's a2468cc9bfdff,
and my test .config has enabled CONFIG

as below:

CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y

>
>
> These functions use identifier `p' for the swap_info_struct*, whereas
> most other code uses the much more sensible `si'. That's just rude.
> But we shouldn't change that within this fix.

Indeed, It's confusing more or less to use both 'si' and 'p'. I can
ready for another patch to replace 'p' with 'si'.

Thanks.

2023-04-06 07:07:09

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Hi Andrew,

Sorry for replying a little late, it's holiday here yesterday.

On Tue, Apr 04, 2023 at 12:26:00PM -0700, Andrew Morton wrote:
> On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang <[email protected]> wrote:
>
> > The si->lock must be held when deleting the si from
> > the available list.
> >
> > ...
> >
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> > {
> > int nid;
> >
> > + assert_spin_locked(&p->lock);
> > for_each_node(nid)
> > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> > }
> > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> > spin_unlock(&swap_lock);
> > goto out_dput;
> > }
> > - del_from_avail_list(p);
> > spin_lock(&p->lock);
> > + del_from_avail_list(p);
> > if (p->prio < 0) {
> > struct swap_info_struct *si = p;
> > int nid;
>
> So we have
>
> swap_avail_lock
> swap_info_struct.lock
> swap_cluster_info.lock
>
> Is the ranking of these three clearly documented somewhere?
>

I see some comments in swapfile.c mentioned something related, e.g.
above the definition of swap_avail_heads, the comment mentioned
swap_lock has to be taken before si->lock and swap_avail_lock can be
taken after si->lock is held, but I'm not aware of a place documenting
these things. Documenting these things is useful information I think,
let me see if I can come up with something later.

>
> Did you test this with lockdep fully enabled?
>
>
> I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device
> according to numa node") is the appropriate Fixes: target - do you
> agree?

It doesn't appear to be the case. For one thing, the problematic code
that removes the swap device from the avail list without acquiring
si->lock was there before my commit and my commit didn't change that
behaviour. For another, I wanted to see if the problem is still there
without my commit(just to make sure).

I followed Rongwei's description and used stress-ng/swap test together
with some test progs that does memory allocation then MADVISE(pageout)
in a loop to reproduce this problem and I can also see the warning like
below using Linus' master branch as of today, I believe this is the
problem Rongwei described:

[ 1914.518786] ------------[ cut here ]------------
[ 1914.519049] swap_info 9 in list but !SWP_WRITEOK
[ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440
[ 1914.519660] Modules linked in:
[ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G W 6.3.0-rc5-00032-g99ddf2254feb #5
[ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014
[ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440
[ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78
[ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282
[ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000
[ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001
[ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003
[ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350
[ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00
[ 1914.523576] FS: 00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000
[ 1914.523942] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0
[ 1914.524534] PKRU: 55555554
[ 1914.524661] Call Trace:
[ 1914.524782] <TASK>
[ 1914.524889] folio_alloc_swap+0xde/0x230
[ 1914.525076] add_to_swap+0x36/0xb0
[ 1914.525242] shrink_folio_list+0x9ab/0xef0
[ 1914.525445] reclaim_folio_list+0x70/0x130
[ 1914.525644] reclaim_pages+0x9c/0x1c0
[ 1914.525819] madvise_cold_or_pageout_pte_range+0x79f/0xc80
[ 1914.526073] walk_pgd_range+0x4d8/0x940
[ 1914.526255] ? mt_find+0x15b/0x490
[ 1914.526426] __walk_page_range+0x211/0x230
[ 1914.526619] walk_page_range+0x17a/0x1e0
[ 1914.526807] madvise_pageout+0xef/0x250

And when I reverted my commit on the same branch(needs some manual edits),
the problem is still there.

Another thing is, I noticed Rongwei mentioned "This problem exists in
versions after stable 5.10.y." in the changelog while my commit entered
mainline in v4.14.

So either this problem is always there, i.e. earlier than my commit; or
this problem is indeed only there after v5.10, then it should be something
else that triggered it. My qemu refuses to boot v4.14 kernel so I can
not verify the former yet.

2023-04-06 12:13:54

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote:
> Hello
>
> I have fix up some stuff base on Patch v1. And in order to help all readers
> and reviewers to
>
> reproduce this bug, share a reproducer here:

I reproduced this problem under a VM this way:

$ sudo ./stress-ng --swap 1
// on another terminal
$ for i in `seq 8`; do ./swap & done
Looks simpler than yours :-)
(Didn't realize you have posted your reproducer here since I'm not CCed
and just found it after invented mine)
Then the warning message normally appear within a few seconds.

Here is the code for the above swap prog:

#include <stdio.h>
#include <stddef.h>
#include <sys/mman.h>

#define SIZE 0x100000

int main(void)
{
int i, ret;
void *p;

p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}

ret = 0;
while (1) {
for (i = 0; i < SIZE; i += 0x1000)
((char *)p)[i] = 1;
ret = madvise(p, SIZE, MADV_PAGEOUT);
if (ret != 0) {
perror("madvise");
break;
}
}

return ret;
}

Unfortunately, this test prog did not work on kernels before v5.4 because
MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is
also there.

Haven't found a way to trigger swap with swap device come and go on
kernels before v5.4; tried putting the test prog in a memcg with memory
limit but then the prog is easily killed due to nowhere to swap out.

>
> swap_bomb.sh
>
> #!/usr/bin/env bash
>
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v &
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v &
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v &
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v
>
>
> madvise_shared.c
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
>
> #define MSIZE (1024 * 1024 * 2)
>
> int main()
> {
> ??????? char *shm_addr;
> ??????? unsigned long i;
>
> ??????? while (1) {
> ??????????????? // Map shared memory segment
> ??????????????? shm_addr =
> ??????????????????? mmap(NULL, MSIZE, PROT_READ | PROT_WRITE,
> ???????????????????????? MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> ??????????????? if (shm_addr == MAP_FAILED) {
> ??????????????????????? perror("Failed to map shared memory segment");
> ??????????????????????? exit(EXIT_FAILURE);
> ??????????????? }
>
> ??????????????? for (i = 0; i < MSIZE; i++) {
> ??????????????????????? shm_addr[i] = 1;
> ??????????????? }
>
> ??????????????? // Advise kernel on usage pattern of shared memory
> ??????????????? if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
> ??????????????????????? perror
> ??????????????????????????? ("Failed to advise kernel on shared memory
> usage");
> ??????????????????????? exit(EXIT_FAILURE);
> ??????????????? }
>
> ??????????????? for (i = 0; i < MSIZE; i++) {
> ??????????????????????? shm_addr[i] = 1;
> ??????????????? }
>
> ??????????????? // Advise kernel on usage pattern of shared memory
> ??????????????? if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
> ??????????????????????? perror
> ??????????????????????????? ("Failed to advise kernel on shared memory
> usage");
> ??????????????????????? exit(EXIT_FAILURE);
> ??????????????? }
> ??????????????? // Use shared memory
> ??????????????? printf("Hello, shared memory: 0x%lx\n", shm_addr);
>
> ??????????????? // Unmap shared memory segment
> ??????????????? if (munmap(shm_addr, MSIZE) == -1) {
> ??????????????????????? perror("Failed to unmap shared memory segment");
> ??????????????????????? exit(EXIT_FAILURE);
> ??????????????? }
> ??????? }
>
> ??????? return 0;
> }
>
> The bug will reproduce more quickly (about 2~5 minutes) if concurrent more
> swap_bomb.sh and madvise_shared.
>
> Thanks.

2023-04-06 12:31:26

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()


> It doesn't appear to be the case. For one thing, the problematic code
> that removes the swap device from the avail list without acquiring
> si->lock was there before my commit and my commit didn't change that
> behaviour. For another, I wanted to see if the problem is still there
> without my commit(just to make sure).
>
> I followed Rongwei's description and used stress-ng/swap test together
> with some test progs that does memory allocation then MADVISE(pageout)
> in a loop to reproduce this problem and I can also see the warning like
> below using Linus' master branch as of today, I believe this is the
> problem Rongwei described:
Hi, Aaron, I can sure this is that bug, and the panic will happen when
CONFIG_PLIST_DEBUG enabled (I'm not sure whether you have enabled it).
>
> [ 1914.518786] ------------[ cut here ]------------
> [ 1914.519049] swap_info 9 in list but !SWP_WRITEOK
> [ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440
> [ 1914.519660] Modules linked in:
> [ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G W 6.3.0-rc5-00032-g99ddf2254feb #5
> [ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014
> [ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440
> [ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78
> [ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282
> [ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000
> [ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001
> [ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003
> [ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350
> [ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00
> [ 1914.523576] FS: 00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000
> [ 1914.523942] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0
> [ 1914.524534] PKRU: 55555554
> [ 1914.524661] Call Trace:
> [ 1914.524782] <TASK>
> [ 1914.524889] folio_alloc_swap+0xde/0x230
> [ 1914.525076] add_to_swap+0x36/0xb0
> [ 1914.525242] shrink_folio_list+0x9ab/0xef0
> [ 1914.525445] reclaim_folio_list+0x70/0x130
> [ 1914.525644] reclaim_pages+0x9c/0x1c0
> [ 1914.525819] madvise_cold_or_pageout_pte_range+0x79f/0xc80
> [ 1914.526073] walk_pgd_range+0x4d8/0x940
> [ 1914.526255] ? mt_find+0x15b/0x490
> [ 1914.526426] __walk_page_range+0x211/0x230
> [ 1914.526619] walk_page_range+0x17a/0x1e0
> [ 1914.526807] madvise_pageout+0xef/0x250
>
> And when I reverted my commit on the same branch(needs some manual edits),
> the problem is still there.
>
> Another thing is, I noticed Rongwei mentioned "This problem exists in
> versions after stable 5.10.y." in the changelog while my commit entered
> mainline in v4.14.
>
> So either this problem is always there, i.e. earlier than my commit; or
> this problem is indeed only there after v5.10, then it should be something
> else that triggered it. My qemu refuses to boot v4.14 kernel so I can
> not verify the former yet.

Me too. The oldest kernel that my qemu can run is 4.19.

BTW, I try to replace 'p' with 'si' today, and find there are many areas
need to be modified, especially inside swapoff() and swapon(). So many
modifications maybe affect future tracking of code modifications and
will cost some time to test. So I wanna to ensure whether need I to do
this. If need, I can continue to do this.


Thanks.

2023-04-06 12:58:01

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Oh, sorry, I miss this email just now, that because of I'm also replying
your previous email.

On 2023/4/6 20:12, Aaron Lu wrote:
> On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote:
>> Hello
>>
>> I have fix up some stuff base on Patch v1. And in order to help all readers
>> and reviewers to
>>
>> reproduce this bug, share a reproducer here:
> I reproduced this problem under a VM this way:
>
> $ sudo ./stress-ng --swap 1
> // on another terminal
> $ for i in `seq 8`; do ./swap & done
> Looks simpler than yours :-)
Cool, indeed become simpler.
> (Didn't realize you have posted your reproducer here since I'm not CCed
> and just found it after invented mine)
> Then the warning message normally appear within a few seconds.
>
> Here is the code for the above swap prog:
>
> #include <stdio.h>
> #include <stddef.h>
> #include <sys/mman.h>
>
> #define SIZE 0x100000
>
> int main(void)
> {
> int i, ret;
> void *p;
>
> p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
> if (p == MAP_FAILED) {
> perror("mmap");
> return -1;
> }
>
> ret = 0;
> while (1) {
> for (i = 0; i < SIZE; i += 0x1000)
> ((char *)p)[i] = 1;
> ret = madvise(p, SIZE, MADV_PAGEOUT);
> if (ret != 0) {
> perror("madvise");
> break;
> }
> }
>
> return ret;
> }
>
> Unfortunately, this test prog did not work on kernels before v5.4 because
> MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is
> also there.

Maybe that is this bug can not be found since now. And we found this is
triggered by stress-ng-swap and stress-ng-madvise (PAGEOUT) firstly. It
seems this is that reason.

It seems MADV_COLD is also introduced together with MADV_PAGEOUT. I have
no idea and have to depend on you.:-)

>
> Haven't found a way to trigger swap with swap device come and go on
> kernels before v5.4; tried putting the test prog in a memcg with memory
> limit but then the prog is easily killed due to nowhere to swap out.
>
Personally, I do not intend to continuing searching for the method to
reproduce before v5.4. Of course, if you have idea, I can try.


Thanks:-)

2023-04-06 14:09:38

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
> The si->lock must be held when deleting the si from
> the available list. Otherwise, another thread can
> re-add the si to the available list, which can lead
> to memory corruption. The only place we have found
> where this happens is in the swapoff path. This case
> can be described as below:
>
> core 0 core 1
> swapoff
>
> del_from_avail_list(si) waiting
>
> try lock si->lock acquire swap_avail_lock
> and re-add si into
> swap_avail_head

confused here.

If del_from_avail_list(si) finished in swaoff path, then this si should
not exist in any of the per-node avail list and core 1 should not be
able to re-add it.

I stared at the code for a while and couldn't figure out how this
happened, will continue to look at this tomorrow.

Thanks,
Aaron

>
> acquire si->lock but
> missing si already be
> added again, and continuing
> to clear SWP_WRITEOK, etc.
>
> It can be easily found a massive warning messages can
> be triggered inside get_swap_pages() by some special
> cases, for example, we call madvise(MADV_PAGEOUT) on
> blocks of touched memory concurrently, meanwhile, run
> much swapon-swapoff operations (e.g. stress-ng-swap).
>
> However, in the worst case, panic can be caused by the
> above scene. In swapoff(), the memory used by si could
> be kept in swap_info[] after turning off a swap. This
> means memory corruption will not be caused immediately
> until allocated and reset for a new swap in the swapon
> path. A panic message caused:
> (with CONFIG_PLIST_DEBUG enabled)
>
> ------------[ cut here ]------------
> top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> Modules linked in: rfkill(E) crct10dif_ce(E)...
> CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> pc : plist_check_prev_next_node+0x50/0x70
> lr : plist_check_prev_next_node+0x50/0x70
> sp : ffff0018009d3c30
> x29: ffff0018009d3c40 x28: ffff800011b32a98
> x27: 0000000000000000 x26: ffff001803908000
> x25: ffff8000128ea088 x24: ffff800011b32a48
> x23: 0000000000000028 x22: ffff001800875c00
> x21: ffff800010f9e520 x20: ffff001800875c00
> x19: ffff001800fdc6e0 x18: 0000000000000030
> x17: 0000000000000000 x16: 0000000000000000
> x15: 0736076307640766 x14: 0730073007380731
> x13: 0736076307640766 x12: 0730073007380731
> x11: 000000000004058d x10: 0000000085a85b76
> x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> x7 : 0000000000000000 x6 : 0000000000000001
> x5 : ffff0017df9ed338 x4 : 0000000000000001
> x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> x1 : 0000000000000000 x0 : 0000000000000000
> Call trace:
> plist_check_prev_next_node+0x50/0x70
> plist_check_head+0x80/0xf0
> plist_add+0x28/0x140
> add_to_avail_list+0x9c/0xf0
> _enable_swap_info+0x78/0xb4
> __do_sys_swapon+0x918/0xa10
> __arm64_sys_swapon+0x20/0x30
> el0_svc_common+0x8c/0x220
> do_el0_svc+0x2c/0x90
> el0_svc+0x1c/0x30
> el0_sync_handler+0xa8/0xb0
> el0_sync+0x148/0x180
> irq event stamp: 2082270
>
> Now, si->lock locked before calling 'del_from_avail_list()'
> to make sure other thread see the si had been deleted
> and SWP_WRITEOK cleared together, will not reinsert again.
>
> This problem exists in versions after stable 5.10.y.
>
> Cc: [email protected]
> Tested-by: Yongchen Yin <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>
> ---
> mm/swapfile.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 62ba2bf577d7..2c718f45745f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> {
> int nid;
>
> + assert_spin_locked(&p->lock);
> for_each_node(nid)
> plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> }
> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> spin_unlock(&swap_lock);
> goto out_dput;
> }
> - del_from_avail_list(p);
> spin_lock(&p->lock);
> + del_from_avail_list(p);
> if (p->prio < 0) {
> struct swap_info_struct *si = p;
> int nid;
> --
> 2.27.0
>

2023-04-06 15:01:43

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
> On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
> > The si->lock must be held when deleting the si from
> > the available list. Otherwise, another thread can
> > re-add the si to the available list, which can lead
> > to memory corruption. The only place we have found
> > where this happens is in the swapoff path. This case
> > can be described as below:
> >
> > core 0 core 1
> > swapoff
> >
> > del_from_avail_list(si) waiting
> >
> > try lock si->lock acquire swap_avail_lock
> > and re-add si into
> > swap_avail_head
>
> confused here.
>
> If del_from_avail_list(si) finished in swaoff path, then this si should
> not exist in any of the per-node avail list and core 1 should not be
> able to re-add it.

I think a possible sequence could be like this:

cpuX cpuY
swapoff put_swap_folio()

del_from_avail_list(si)
taken si->lock
spin_lock(&si->lock);

swap_range_free()
was_full && SWP_WRITEOK -> re-add!
drop si->lock

taken si->lock
proceed removing si

End result: si left on avail_list after being swapped off.

The problem is, in add_to_avail_list(), it has no idea this si is being
swapped off and taking si->lock then del_from_avail_list() could avoid
this problem, so I think this patch did the right thing but the
changelog about how this happened needs updating and after that:

Reviewed-by: Aaron Lu <[email protected]>

Thanks,
Aaron

>
> I stared at the code for a while and couldn't figure out how this
> happened, will continue to look at this tomorrow.
> >
> > acquire si->lock but
> > missing si already be
> > added again, and continuing
> > to clear SWP_WRITEOK, etc.
> >
> > It can be easily found a massive warning messages can
> > be triggered inside get_swap_pages() by some special
> > cases, for example, we call madvise(MADV_PAGEOUT) on
> > blocks of touched memory concurrently, meanwhile, run
> > much swapon-swapoff operations (e.g. stress-ng-swap).
> >
> > However, in the worst case, panic can be caused by the
> > above scene. In swapoff(), the memory used by si could
> > be kept in swap_info[] after turning off a swap. This
> > means memory corruption will not be caused immediately
> > until allocated and reset for a new swap in the swapon
> > path. A panic message caused:
> > (with CONFIG_PLIST_DEBUG enabled)
> >
> > ------------[ cut here ]------------
> > top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> > prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> > next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> > WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> > Modules linked in: rfkill(E) crct10dif_ce(E)...
> > CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> > Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> > pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> > pc : plist_check_prev_next_node+0x50/0x70
> > lr : plist_check_prev_next_node+0x50/0x70
> > sp : ffff0018009d3c30
> > x29: ffff0018009d3c40 x28: ffff800011b32a98
> > x27: 0000000000000000 x26: ffff001803908000
> > x25: ffff8000128ea088 x24: ffff800011b32a48
> > x23: 0000000000000028 x22: ffff001800875c00
> > x21: ffff800010f9e520 x20: ffff001800875c00
> > x19: ffff001800fdc6e0 x18: 0000000000000030
> > x17: 0000000000000000 x16: 0000000000000000
> > x15: 0736076307640766 x14: 0730073007380731
> > x13: 0736076307640766 x12: 0730073007380731
> > x11: 000000000004058d x10: 0000000085a85b76
> > x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> > x7 : 0000000000000000 x6 : 0000000000000001
> > x5 : ffff0017df9ed338 x4 : 0000000000000001
> > x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> > x1 : 0000000000000000 x0 : 0000000000000000
> > Call trace:
> > plist_check_prev_next_node+0x50/0x70
> > plist_check_head+0x80/0xf0
> > plist_add+0x28/0x140
> > add_to_avail_list+0x9c/0xf0
> > _enable_swap_info+0x78/0xb4
> > __do_sys_swapon+0x918/0xa10
> > __arm64_sys_swapon+0x20/0x30
> > el0_svc_common+0x8c/0x220
> > do_el0_svc+0x2c/0x90
> > el0_svc+0x1c/0x30
> > el0_sync_handler+0xa8/0xb0
> > el0_sync+0x148/0x180
> > irq event stamp: 2082270
> >
> > Now, si->lock locked before calling 'del_from_avail_list()'
> > to make sure other thread see the si had been deleted
> > and SWP_WRITEOK cleared together, will not reinsert again.
> >
> > This problem exists in versions after stable 5.10.y.
> >
> > Cc: [email protected]
> > Tested-by: Yongchen Yin <[email protected]>
> > Signed-off-by: Rongwei Wang <[email protected]>
> > ---
> > mm/swapfile.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 62ba2bf577d7..2c718f45745f 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> > {
> > int nid;
> >
> > + assert_spin_locked(&p->lock);
> > for_each_node(nid)
> > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> > }
> > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> > spin_unlock(&swap_lock);
> > goto out_dput;
> > }
> > - del_from_avail_list(p);
> > spin_lock(&p->lock);
> > + del_from_avail_list(p);
> > if (p->prio < 0) {
> > struct swap_info_struct *si = p;
> > int nid;
> > --
> > 2.27.0
> >

2023-04-07 02:38:24

by Rongwei Wang

[permalink] [raw]
Subject: Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()


On 2023/4/6 22:57, Aaron Lu wrote:
> On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
>> On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
>>> The si->lock must be held when deleting the si from
>>> the available list. Otherwise, another thread can
>>> re-add the si to the available list, which can lead
>>> to memory corruption. The only place we have found
>>> where this happens is in the swapoff path. This case
>>> can be described as below:
>>>
>>> core 0 core 1
>>> swapoff
>>>
>>> del_from_avail_list(si) waiting
>>>
>>> try lock si->lock acquire swap_avail_lock
>>> and re-add si into
>>> swap_avail_head
>> confused here.
>>
>> If del_from_avail_list(si) finished in swaoff path, then this si should
>> not exist in any of the per-node avail list and core 1 should not be
>> able to re-add it.
> I think a possible sequence could be like this:
>
> cpuX cpuY
> swapoff put_swap_folio()
>
> del_from_avail_list(si)
> taken si->lock
> spin_lock(&si->lock);
>
> swap_range_free()
> was_full && SWP_WRITEOK -> re-add!
> drop si->lock
>
> taken si->lock
> proceed removing si
>
> End result: si left on avail_list after being swapped off.
>
> The problem is, in add_to_avail_list(), it has no idea this si is being
> swapped off and taking si->lock then del_from_avail_list() could avoid
> this problem, so I think this patch did the right thing but the
> changelog about how this happened needs updating and after that:

Hi Aaron

That's my fault. Actually, I don't refers specifically to
swap_range_free() path in commit, mainly because cpuY can stand all
threads which is waiting swap_avail_lock, then to call
add_to_avail_list(), not only swap_range_free(), e.g. maybe swapon().

>
> Reviewed-by: Aaron Lu <[email protected]>

Thanks for your time.

-wrw

>
> Thanks,
> Aaron
>