I've just experienced this bug with ksmd:
[ 55.837551] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8
[ 55.837598] IP: [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
[ 55.837630] PGD 0
[ 55.837643] Oops: 0000 [#1] SMP
[ 55.837663] CPU 2
[ 55.837674] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_conexant rtl8192ce rtl8192c_common rtlwifi mac80211 usbhid hid cfg80211 snd_hda_intel snd_hda_codec psmouse snd_pcm e1000e thinkpad_acpi snd_timer snd_page_alloc snd soundcore nvram
[ 55.837816]
[ 55.837824] Pid: 33, comm: ksmd Not tainted 3.0.0-rc1+ #289 LENOVO 4286CTO/4286CTO
[ 55.837850] RIP: 0010:[<ffffffff810bb9b2>] [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
[ 55.837878] RSP: 0018:ffff88023d3abc50 EFLAGS: 00010046
[ 55.837894] RAX: 0000000000000046 RBX: 00000000000000e8 RCX: 0000000000000001
[ 55.837915] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000e8
[ 55.837936] RBP: ffff88023d3abd40 R08: 0000000000000002 R09: 0000000000000000
[ 55.837957] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88023d3a3e00
[ 55.837978] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
[ 55.837999] FS: 0000000000000000(0000) GS:ffff88023e280000(0000) knlGS:0000000000000000
[ 55.838022] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 55.838039] CR2: 00000000000000e8 CR3: 00000000016f5000 CR4: 00000000000406e0
[ 55.838060] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 55.838081] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 55.838102] Process ksmd (pid: 33, threadinfo ffff88023d3aa000, task ffff88023d3a3e00)
[ 55.838131] Stack:
[ 55.838140] ffff88023d3abce0 0000000000000000 ffffffff81d46810 00000000000012c7
[ 55.838168] 000000000000037c ffff88023d3a3e00 0000000000000001 0000000000000000
[ 55.838338] 0000000000000000 0000000000000000 00000000001ba37c ffffffff81a22000
[ 55.838365] Call Trace:
[ 55.838375] [<ffffffff810be55f>] ? mark_held_locks+0x6f/0xa0
[ 55.838394] [<ffffffff814e3360>] ? _raw_spin_unlock_irqrestore+0x40/0x70
[ 55.838416] [<ffffffff810bdc90>] lock_acquire+0x90/0x110
[ 55.838434] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
[ 55.838453] [<ffffffff8112df6c>] ? free_percpu+0x9c/0x130
[ 55.838470] [<ffffffff814e1cbc>] down_read+0x4c/0x70
[ 55.838486] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
[ 55.838505] [<ffffffff814e33bb>] ? _raw_spin_unlock+0x2b/0x40
[ 55.838523] [<ffffffff8114c652>] ksm_scan_thread+0x132/0xe20
[ 55.838541] [<ffffffff814df822>] ? schedule+0x3b2/0x960
[ 55.838559] [<ffffffff810a5690>] ? wake_up_bit+0x40/0x40
[ 55.838576] [<ffffffff8114c520>] ? run_store+0x310/0x310
[ 55.838593] [<ffffffff810a5186>] kthread+0x96/0xa0
[ 55.838609] [<ffffffff814e5014>] kernel_thread_helper+0x4/0x10
[ 55.838628] [<ffffffff814e3700>] ? retint_restore_args+0xe/0xe
[ 55.838647] [<ffffffff810a50f0>] ? __init_kthread_worker+0x70/0x70
[ 55.838666] [<ffffffff814e5010>] ? gs_change+0xb/0xb
[ 55.838681] Code: b7 00 00 48 89 fb 85 c0 41 89 f5 45 0f 45 f0 8b 05 84 de 68 00 85 c0 0f 84 7b 09 00 00 8b 05 7a 49 7a 00 85 c0 0f 84 c6 01 00 00
[ 55.838780] 8b 03 ba 01 00 00 00 48 3d e0 3c 8c 81 44 0f 44 f2 41 83 fd
[ 55.838830] RIP [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
[ 55.838850] RSP <ffff88023d3abc50>
[ 55.839567] CR2: 00000000000000e8
[ 55.895721] ---[ end trace eea0fa5dfa6846f1 ]---
The bug can be easily reproduced using the following testcase:
========================
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#define BUFSIZE getpagesize()
int main(int argc, char **argv)
{
void *ptr;
if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
perror("posix_memalign");
exit(1);
}
if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
perror("madvise");
exit(1);
}
*(char *)NULL = 0;
return 0;
}
========================
It seems that when a task segfaults mm_slot->mm becomes NULL, but it's
still wrongly considered by the ksm scan. Is there a race with
__ksm_exit()?
Probably the following is not the right way to fix it, but if I apply
this the problem disappears. Anyway, I'm posting this information, it
can help you to debug the problem better.
Signed-off-by: Andrea Righi <[email protected]>
---
mm/ksm.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index d708b3e..f457feb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1308,6 +1308,8 @@ next_mm:
}
mm = slot->mm;
+ if (unlikely(!mm))
+ return NULL;
down_read(&mm->mmap_sem);
if (ksm_test_exit(mm))
vma = NULL;
* Andrea Righi ([email protected]) wrote:
> The bug can be easily reproduced using the following testcase:
Thanks for the testcase.
> ========================
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> #define BUFSIZE getpagesize()
>
> int main(int argc, char **argv)
> {
> void *ptr;
>
> if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
> perror("posix_memalign");
> exit(1);
> }
> if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
> perror("madvise");
> exit(1);
> }
> *(char *)NULL = 0;
>
> return 0;
> }
> ========================
>
> It seems that when a task segfaults mm_slot->mm becomes NULL, but it's
> still wrongly considered by the ksm scan. Is there a race with
> __ksm_exit()?
Hmm, wonder if khugepaged has the same issue too. We should be holding
a reference to ->mm, but we seem to have inconsistent serialization w/
mmap_sem. Hugh mentioned some of these concerns when introducing
ksm_exit.
Hello,
----- Original Message -----
> I've just experienced this bug with ksmd:
>
> [ 55.837551] BUG: unable to handle kernel NULL pointer dereference at
> 00000000000000e8
> [ 55.837598] IP: [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
> [ 55.837630] PGD 0
> [ 55.837643] Oops: 0000 [#1] SMP
> [ 55.837663] CPU 2
> [ 55.837674] Modules linked in: snd_hda_codec_hdmi
> snd_hda_codec_conexant rtl8192ce rtl8192c_common rtlwifi mac80211
> usbhid hid cfg80211 snd_hda_intel snd_hda_codec psmouse snd_pcm e1000e
> thinkpad_acpi snd_timer snd_page_alloc snd soundcore nvram
> [ 55.837816]
> [ 55.837824] Pid: 33, comm: ksmd Not tainted 3.0.0-rc1+ #289 LENOVO
> 4286CTO/4286CTO
> [ 55.837850] RIP: 0010:[<ffffffff810bb9b2>] [<ffffffff810bb9b2>]
> __lock_acquire+0x62/0x1d70
> [ 55.837878] RSP: 0018:ffff88023d3abc50 EFLAGS: 00010046
> [ 55.837894] RAX: 0000000000000046 RBX: 00000000000000e8 RCX:
> 0000000000000001
> [ 55.837915] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 00000000000000e8
> [ 55.837936] RBP: ffff88023d3abd40 R08: 0000000000000002 R09:
> 0000000000000000
> [ 55.837957] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffff88023d3a3e00
> [ 55.837978] R13: 0000000000000000 R14: 0000000000000002 R15:
> 0000000000000000
> [ 55.837999] FS: 0000000000000000(0000) GS:ffff88023e280000(0000)
> knlGS:0000000000000000
> [ 55.838022] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 55.838039] CR2: 00000000000000e8 CR3: 00000000016f5000 CR4:
> 00000000000406e0
> [ 55.838060] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 55.838081] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 55.838102] Process ksmd (pid: 33, threadinfo ffff88023d3aa000, task
> ffff88023d3a3e00)
> [ 55.838131] Stack:
> [ 55.838140] ffff88023d3abce0 0000000000000000 ffffffff81d46810
> 00000000000012c7
> [ 55.838168] 000000000000037c ffff88023d3a3e00 0000000000000001
> 0000000000000000
> [ 55.838338] 0000000000000000 0000000000000000 00000000001ba37c
> ffffffff81a22000
> [ 55.838365] Call Trace:
> [ 55.838375] [<ffffffff810be55f>] ? mark_held_locks+0x6f/0xa0
> [ 55.838394] [<ffffffff814e3360>] ?
> _raw_spin_unlock_irqrestore+0x40/0x70
> [ 55.838416] [<ffffffff810bdc90>] lock_acquire+0x90/0x110
> [ 55.838434] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
> [ 55.838453] [<ffffffff8112df6c>] ? free_percpu+0x9c/0x130
> [ 55.838470] [<ffffffff814e1cbc>] down_read+0x4c/0x70
> [ 55.838486] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
> [ 55.838505] [<ffffffff814e33bb>] ? _raw_spin_unlock+0x2b/0x40
> [ 55.838523] [<ffffffff8114c652>] ksm_scan_thread+0x132/0xe20
> [ 55.838541] [<ffffffff814df822>] ? schedule+0x3b2/0x960
> [ 55.838559] [<ffffffff810a5690>] ? wake_up_bit+0x40/0x40
> [ 55.838576] [<ffffffff8114c520>] ? run_store+0x310/0x310
> [ 55.838593] [<ffffffff810a5186>] kthread+0x96/0xa0
> [ 55.838609] [<ffffffff814e5014>] kernel_thread_helper+0x4/0x10
> [ 55.838628] [<ffffffff814e3700>] ? retint_restore_args+0xe/0xe
> [ 55.838647] [<ffffffff810a50f0>] ? __init_kthread_worker+0x70/0x70
> [ 55.838666] [<ffffffff814e5010>] ? gs_change+0xb/0xb
> [ 55.838681] Code: b7 00 00 48 89 fb 85 c0 41 89 f5 45 0f 45 f0 8b 05
> 84 de 68 00 85 c0 0f 84 7b 09 00 00 8b 05 7a 49 7a 00 85 c0 0f 84 c6
> 01 00 00
> [ 55.838780] 8b 03 ba 01 00 00 00 48 3d e0 3c 8c 81 44 0f 44 f2 41 83
> fd
> [ 55.838830] RIP [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
> [ 55.838850] RSP <ffff88023d3abc50>
> [ 55.839567] CR2: 00000000000000e8
> [ 55.895721] ---[ end trace eea0fa5dfa6846f1 ]---
>
> The bug can be easily reproduced using the following testcase:
>
> ========================
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> #define BUFSIZE getpagesize()
>
> int main(int argc, char **argv)
> {
> void *ptr;
>
> if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
> perror("posix_memalign");
> exit(1);
> }
> if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
> perror("madvise");
> exit(1);
> }
> *(char *)NULL = 0;
Hmm, the reproducer gave something else here but no panic.
$ strace ./test
execve("./test", ["./test"], [/* 26 vars */]) = 0
brk(0) = 0x220f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ec0a000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=41227, ...}) = 0
mmap(NULL, 41227, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd18ebff000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\355\341n<\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1912928, ...}) = 0
mmap(0x3c6ee00000, 3737768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c6ee00000
mprotect(0x3c6ef87000, 2097152, PROT_NONE) = 0
mmap(0x3c6f187000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x187000) = 0x3c6f187000
mmap(0x3c6f18c000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c6f18c000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ebfe000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ebfd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ebfc000
arch_prctl(ARCH_SET_FS, 0x7fd18ebfd700) = 0
mprotect(0x3c6f187000, 16384, PROT_READ) = 0
mprotect(0x3c6e81f000, 4096, PROT_READ) = 0
munmap(0x7fd18ebff000, 41227) = 0
brk(0) = 0x220f000
brk(0x2232000) = 0x2232000
madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
Did I miss anything?
Thanks,
CAI Qian
> return 0;
> }
> ========================
>
> It seems that when a task segfaults mm_slot->mm becomes NULL, but it's
> still wrongly considered by the ksm scan. Is there a race with
> __ksm_exit()?
>
> Probably the following is not the right way to fix it, but if I apply
> this the problem disappears. Anyway, I'm posting this information, it
> can help you to debug the problem better.
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> mm/ksm.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index d708b3e..f457feb 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1308,6 +1308,8 @@ next_mm:
> }
>
> mm = slot->mm;
> + if (unlikely(!mm))
> + return NULL;
> down_read(&mm->mmap_sem);
> if (ksm_test_exit(mm))
> vma = NULL;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, Jun 02, 2011 at 03:09:53AM -0400, CAI Qian wrote:
> Hello,
>
> ----- Original Message -----
> > I've just experienced this bug with ksmd:
> >
> > [ 55.837551] BUG: unable to handle kernel NULL pointer dereference at
> > 00000000000000e8
> > [ 55.837598] IP: [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
> > [ 55.837630] PGD 0
> > [ 55.837643] Oops: 0000 [#1] SMP
> > [ 55.837663] CPU 2
> > [ 55.837674] Modules linked in: snd_hda_codec_hdmi
> > snd_hda_codec_conexant rtl8192ce rtl8192c_common rtlwifi mac80211
> > usbhid hid cfg80211 snd_hda_intel snd_hda_codec psmouse snd_pcm e1000e
> > thinkpad_acpi snd_timer snd_page_alloc snd soundcore nvram
> > [ 55.837816]
> > [ 55.837824] Pid: 33, comm: ksmd Not tainted 3.0.0-rc1+ #289 LENOVO
> > 4286CTO/4286CTO
> > [ 55.837850] RIP: 0010:[<ffffffff810bb9b2>] [<ffffffff810bb9b2>]
> > __lock_acquire+0x62/0x1d70
> > [ 55.837878] RSP: 0018:ffff88023d3abc50 EFLAGS: 00010046
> > [ 55.837894] RAX: 0000000000000046 RBX: 00000000000000e8 RCX:
> > 0000000000000001
> > [ 55.837915] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> > 00000000000000e8
> > [ 55.837936] RBP: ffff88023d3abd40 R08: 0000000000000002 R09:
> > 0000000000000000
> > [ 55.837957] R10: 0000000000000001 R11: 0000000000000000 R12:
> > ffff88023d3a3e00
> > [ 55.837978] R13: 0000000000000000 R14: 0000000000000002 R15:
> > 0000000000000000
> > [ 55.837999] FS: 0000000000000000(0000) GS:ffff88023e280000(0000)
> > knlGS:0000000000000000
> > [ 55.838022] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 55.838039] CR2: 00000000000000e8 CR3: 00000000016f5000 CR4:
> > 00000000000406e0
> > [ 55.838060] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [ 55.838081] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > 0000000000000400
> > [ 55.838102] Process ksmd (pid: 33, threadinfo ffff88023d3aa000, task
> > ffff88023d3a3e00)
> > [ 55.838131] Stack:
> > [ 55.838140] ffff88023d3abce0 0000000000000000 ffffffff81d46810
> > 00000000000012c7
> > [ 55.838168] 000000000000037c ffff88023d3a3e00 0000000000000001
> > 0000000000000000
> > [ 55.838338] 0000000000000000 0000000000000000 00000000001ba37c
> > ffffffff81a22000
> > [ 55.838365] Call Trace:
> > [ 55.838375] [<ffffffff810be55f>] ? mark_held_locks+0x6f/0xa0
> > [ 55.838394] [<ffffffff814e3360>] ?
> > _raw_spin_unlock_irqrestore+0x40/0x70
> > [ 55.838416] [<ffffffff810bdc90>] lock_acquire+0x90/0x110
> > [ 55.838434] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
> > [ 55.838453] [<ffffffff8112df6c>] ? free_percpu+0x9c/0x130
> > [ 55.838470] [<ffffffff814e1cbc>] down_read+0x4c/0x70
> > [ 55.838486] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
> > [ 55.838505] [<ffffffff814e33bb>] ? _raw_spin_unlock+0x2b/0x40
> > [ 55.838523] [<ffffffff8114c652>] ksm_scan_thread+0x132/0xe20
> > [ 55.838541] [<ffffffff814df822>] ? schedule+0x3b2/0x960
> > [ 55.838559] [<ffffffff810a5690>] ? wake_up_bit+0x40/0x40
> > [ 55.838576] [<ffffffff8114c520>] ? run_store+0x310/0x310
> > [ 55.838593] [<ffffffff810a5186>] kthread+0x96/0xa0
> > [ 55.838609] [<ffffffff814e5014>] kernel_thread_helper+0x4/0x10
> > [ 55.838628] [<ffffffff814e3700>] ? retint_restore_args+0xe/0xe
> > [ 55.838647] [<ffffffff810a50f0>] ? __init_kthread_worker+0x70/0x70
> > [ 55.838666] [<ffffffff814e5010>] ? gs_change+0xb/0xb
> > [ 55.838681] Code: b7 00 00 48 89 fb 85 c0 41 89 f5 45 0f 45 f0 8b 05
> > 84 de 68 00 85 c0 0f 84 7b 09 00 00 8b 05 7a 49 7a 00 85 c0 0f 84 c6
> > 01 00 00
> > [ 55.838780] 8b 03 ba 01 00 00 00 48 3d e0 3c 8c 81 44 0f 44 f2 41 83
> > fd
> > [ 55.838830] RIP [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
> > [ 55.838850] RSP <ffff88023d3abc50>
> > [ 55.839567] CR2: 00000000000000e8
> > [ 55.895721] ---[ end trace eea0fa5dfa6846f1 ]---
> >
> > The bug can be easily reproduced using the following testcase:
> >
> > ========================
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <sys/mman.h>
> >
> > #define BUFSIZE getpagesize()
> >
> > int main(int argc, char **argv)
> > {
> > void *ptr;
> >
> > if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
> > perror("posix_memalign");
> > exit(1);
> > }
> > if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
> > perror("madvise");
> > exit(1);
> > }
> > *(char *)NULL = 0;
> Hmm, the reproducer gave something else here but no panic.
> $ strace ./test
> execve("./test", ["./test"], [/* 26 vars */]) = 0
> brk(0) = 0x220f000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ec0a000
> access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
> open("/etc/ld.so.cache", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=41227, ...}) = 0
> mmap(NULL, 41227, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd18ebff000
> close(3) = 0
> open("/lib64/libc.so.6", O_RDONLY) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\355\341n<\0\0\0"..., 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1912928, ...}) = 0
> mmap(0x3c6ee00000, 3737768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c6ee00000
> mprotect(0x3c6ef87000, 2097152, PROT_NONE) = 0
> mmap(0x3c6f187000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x187000) = 0x3c6f187000
> mmap(0x3c6f18c000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c6f18c000
> close(3) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ebfe000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ebfd000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd18ebfc000
> arch_prctl(ARCH_SET_FS, 0x7fd18ebfd700) = 0
> mprotect(0x3c6f187000, 16384, PROT_READ) = 0
> mprotect(0x3c6e81f000, 4096, PROT_READ) = 0
> munmap(0x7fd18ebff000, 41227) = 0
> brk(0) = 0x220f000
> brk(0x2232000) = 0x2232000
> madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV (core dumped) +++
> Segmentation fault (core dumped)
>
> Did I miss anything?
mmh.. I can reproduce the bug also with the standard ubuntu (11.04)
kernel. Could you post your .config?
Thanks,
-Andrea
* CAI Qian ([email protected]) wrote:
> madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
Right, that's just what the program is trying to do, segfault.
> +++ killed by SIGSEGV (core dumped) +++
> Segmentation fault (core dumped)
>
> Did I miss anything?
I found it works but not 100% of the time.
So I just run the bug in a loop.
On Thu, Jun 02, 2011 at 07:31:43AM -0700, Chris Wright wrote:
> * CAI Qian ([email protected]) wrote:
> > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>
> Right, that's just what the program is trying to do, segfault.
>
> > +++ killed by SIGSEGV (core dumped) +++
> > Segmentation fault (core dumped)
> >
> > Did I miss anything?
>
> I found it works but not 100% of the time.
>
> So I just run the bug in a loop.
echo 0 >scan_millisecs helps.
* Andrea Arcangeli ([email protected]) wrote:
> On Thu, Jun 02, 2011 at 07:31:43AM -0700, Chris Wright wrote:
> > * CAI Qian ([email protected]) wrote:
> > > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> >
> > Right, that's just what the program is trying to do, segfault.
> >
> > > +++ killed by SIGSEGV (core dumped) +++
> > > Segmentation fault (core dumped)
> > >
> > > Did I miss anything?
> >
> > I found it works but not 100% of the time.
> >
> > So I just run the bug in a loop.
>
> echo 0 >scan_millisecs helps.
BTW, here's my stack trace (I dropped back to 2.6.39 just to see if it
happened to be recent regression). It looks like mm_slot is off the list:
R10: dead000000200200 R11: dead000000100100
w/ ->mm == NULL. Smells like use after free, but doesn't quite all add up.
BUG: unable to handle kernel
ksmd-bug[14824]: segfault at 0 ip 0000000000400677 sp 00007fff987cb8b0 error 6 in ksmd-bug[400000+1000]
NULL pointer dereference at 0000000000000060
IP: [<ffffffff815be345>] down_read+0x19/0x28
PGD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 6
Modules linked in: bridge stp vhost_net macvtap macvlan tun kvm_intel kvm ixgbe mdio igb [last unloaded: scsi_wait_scan]
Pid: 825, comm: ksmd Not tainted 2.6.39+ #23 Supermicro X8DTN/X8DTN
RIP: 0010:[<ffffffff815be345>] [<ffffffff815be345>] down_read+0x19/0x28
RSP: 0018:ffff8801b5325e10 EFLAGS: 00010246
RAX: 0000000000000060 RBX: 0000000000000060 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: 0000000000000286 RDI: 0000000000000060
RBP: ffff8801b5325e20 R08: ffffffffffffffff R09: ffff8801b5325db0
R10: dead000000200200 R11: dead000000100100 R12: 0000000000000000
R13: ffffffff81a20e60 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88033fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000060 CR3: 0000000001a03000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksmd (pid: 825, threadinfo ffff8801b5324000, task ffff8801b55196b0)
Stack:
ffff8801b5325e20 0000000000000060 ffff8801b5325ee0 ffffffff810eec3a
ffff88033fd51cc0 ffff8801b5325e80 ffff8801b5325ee0 ffffffff00000000
ffff8801b55196b0 ffff8801b5325e98 ffff8801b5324000 00000064b5324010
Call Trace:
[<ffffffff810eec3a>] ksm_scan_thread+0x12d/0xc47
[<ffffffff8105a6e7>] ? wake_up_bit+0x2a/0x2a
[<ffffffff810eeb0d>] ? try_to_merge_with_ksm_page+0x498/0x498
[<ffffffff8105a25e>] kthread+0x82/0x8a
[<ffffffff815c6354>] kernel_thread_helper+0x4/0x10
[<ffffffff8105a1dc>] ? kthread_worker_fn+0x13f/0x13f
[<ffffffff815c6350>] ? gs_change+0xb/0xb
Code: 48 0f c1 10 48 85 d2 74 05 e8 98 84 c8 ff 58 5b c9 c3 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 85 f4 ff ff 48 89 d8 <f0> 48 ff 00 79 05 e8 40 84 c8 ff 5a 5b c9 c3 55 48 89 e5 0f 1f
RIP [<ffffffff815be345>] down_read+0x19/0x28
RSP <ffff8801b5325e10>
CR2: 0000000000000060
On Thu, Jun 02, 2011 at 08:36:41AM -0700, Chris Wright wrote:
> * Andrea Arcangeli ([email protected]) wrote:
> > On Thu, Jun 02, 2011 at 07:31:43AM -0700, Chris Wright wrote:
> > > * CAI Qian ([email protected]) wrote:
> > > > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> > >
> > > Right, that's just what the program is trying to do, segfault.
> > >
> > > > +++ killed by SIGSEGV (core dumped) +++
> > > > Segmentation fault (core dumped)
> > > >
> > > > Did I miss anything?
> > >
> > > I found it works but not 100% of the time.
> > >
> > > So I just run the bug in a loop.
> >
> > echo 0 >scan_millisecs helps.
>
> BTW, here's my stack trace (I dropped back to 2.6.39 just to see if it
> happened to be recent regression). It looks like mm_slot is off the list:
>
> R10: dead000000200200 R11: dead000000100100
Yes it had to be use after free.
I cooked this patch, still untested but it builds. Will test it soon.
===
Subject: ksm: fix __ksm_exit vs ksm scan SMP race
From: Andrea Arcangeli <[email protected]>
If the KSM scan releases the ksm_mmlist_lock after the mm_users already dropped
to zero but before __ksm_exit had a chance runs, both the KSM scan and
__ksm_exit will free the slot. This fixes the SMP race condition by using
test_and_bit_set in __ksm_exit to see if __ksm_exit arrived before the KSM
scan or not.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
diff --git a/mm/ksm.c b/mm/ksm.c
index d708b3e..47ef4c1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -645,10 +645,16 @@ static int unmerge_and_remove_all_rmap_items(void)
if (ksm_test_exit(mm)) {
hlist_del(&mm_slot->link);
list_del(&mm_slot->mm_list);
+ /*
+ * After releasing ksm_mmlist_lock __ksm_exit
+ * can run and we already changed mm_slot, so
+ * notify it with MMF_VM_MERGEABLE not to free
+ * this again.
+ */
+ clear_bit(MMF_VM_MERGEABLE, &mm->flags);
spin_unlock(&ksm_mmlist_lock);
free_mm_slot(mm_slot);
- clear_bit(MMF_VM_MERGEABLE, &mm->flags);
up_read(&mm->mmap_sem);
mmdrop(mm);
} else {
@@ -1377,10 +1383,15 @@ next_mm:
*/
hlist_del(&slot->link);
list_del(&slot->mm_list);
+ /*
+ * After releasing ksm_mmlist_lock __ksm_exit can run
+ * and we already changed mm_slot, so notify it with
+ * MMF_VM_MERGEABLE not to free this again.
+ */
+ clear_bit(MMF_VM_MERGEABLE, &mm->flags);
spin_unlock(&ksm_mmlist_lock);
free_mm_slot(slot);
- clear_bit(MMF_VM_MERGEABLE, &mm->flags);
up_read(&mm->mmap_sem);
mmdrop(mm);
} else {
@@ -1463,6 +1474,11 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
VM_NONLINEAR | VM_MIXEDMAP | VM_SAO))
return 0; /* just ignore the advice */
+ /*
+ * It should be safe to test_bit instead of
+ * test_and_bit_set because the madvise generic caller
+ * holds the mmap_sem write mode.
+ */
if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
err = __ksm_enter(mm);
if (err)
@@ -1511,6 +1527,10 @@ int __ksm_enter(struct mm_struct *mm)
list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
spin_unlock(&ksm_mmlist_lock);
+ /*
+ * It should be safe to set it outside ksm_mmlist_lock because
+ * we hold a mm_user pin on the mm so __ksm_exit can't run.
+ */
set_bit(MMF_VM_MERGEABLE, &mm->flags);
atomic_inc(&mm->mm_count);
@@ -1538,9 +1558,28 @@ void __ksm_exit(struct mm_struct *mm)
mm_slot = get_mm_slot(mm);
if (mm_slot && ksm_scan.mm_slot != mm_slot) {
if (!mm_slot->rmap_list) {
- hlist_del(&mm_slot->link);
- list_del(&mm_slot->mm_list);
- easy_to_free = 1;
+ /*
+ * If MMF_VM_MERGEABLE isn't set it was freed
+ * by the scan immediately after mm_count
+ * reached zero (visible by the scan) but
+ * before __ksm_exit() run, so we don't need
+ * to do anything here. We don't even need to
+ * wait for the KSM scan to release the
+ * mmap_sem as it's not working on the mm
+ * anymore but it's just releasing it, and it
+ * probably already did and dropped its
+ * mm_count too (it would however be safe to
+ * take mmap_sem here even if MMF_VM_MERGEABLE
+ * is already clear, as the actual mm can't be
+ * freed until we return and we run mmdrop
+ * too, but it's unnecessary).
+ */
+ if (test_and_clear_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+ hlist_del(&mm_slot->link);
+ list_del(&mm_slot->mm_list);
+ easy_to_free = 1;
+ } else
+ mm_slot = NULL;
} else {
list_move(&mm_slot->mm_list,
&ksm_scan.mm_slot->mm_list);
@@ -1550,7 +1589,6 @@ void __ksm_exit(struct mm_struct *mm)
if (easy_to_free) {
free_mm_slot(mm_slot);
- clear_bit(MMF_VM_MERGEABLE, &mm->flags);
mmdrop(mm);
} else if (mm_slot) {
down_write(&mm->mmap_sem);
* Andrea Righi ([email protected]) wrote:
> mmh.. I can reproduce the bug also with the standard ubuntu (11.04)
> kernel. Could you post your .config?
Andrea (Righi), can you tell me if this WARN fires? This looks
like a pure race between removing from list and checking list, i.e.
insufficient locking.
ksm_scan.mm_slot == the only registered mm
CPU 1 (bug program) CPU 2 (ksmd)
list_empty() is false
lock
ksm_scan.mm_slot
list_del
unlock
slot == &ksm_mm_head (but list is now empty_)
diff --git a/mm/ksm.c b/mm/ksm.c
index 942dfc7..ab79a92 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1301,6 +1301,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
ksm_scan.mm_slot = slot;
spin_unlock(&ksm_mmlist_lock);
+ WARN_ON(slot == &ksm_mm_head);
next_mm:
ksm_scan.address = 0;
ksm_scan.rmap_list = &slot->rmap_list;
On Thu, 2 Jun 2011, Chris Wright wrote:
> * Andrea Righi ([email protected]) wrote:
> > mmh.. I can reproduce the bug also with the standard ubuntu (11.04)
> > kernel. Could you post your .config?
>
> Andrea (Righi), can you tell me if this WARN fires? This looks
> like a pure race between removing from list and checking list, i.e.
> insufficient locking.
>
> ksm_scan.mm_slot == the only registered mm
>
> CPU 1 (bug program) CPU 2 (ksmd)
> list_empty() is false
> lock
> ksm_scan.mm_slot
> list_del
> unlock
> slot == &ksm_mm_head (but list is now empty_)
>
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 942dfc7..ab79a92 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1301,6 +1301,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> ksm_scan.mm_slot = slot;
> spin_unlock(&ksm_mmlist_lock);
> + WARN_ON(slot == &ksm_mm_head);
> next_mm:
> ksm_scan.address = 0;
> ksm_scan.rmap_list = &slot->rmap_list;
AndreaR, good find, many thanks for discovering and reporting it.
I couldn't look at it until last night, and even then, it was not
obvious to me exactly where my assumptions were going wrong.
Even now it's unclear what role the SIGSEGV plays, as opposed to an
normal exit: I guess it just happens to change the timing enough to
make the race dangerous.
Your patch was not wrong, but I do prefer a patch that plugs the
exact hole; and I needed to understand what was going on - without
understanding it, there was a danger we might leak memory instead.
AndreaA, I didn't study the patch you posted half an hour ago,
since by that time I'd worked it out and was preparing patch below.
I think your patch would be for a different bug, hopefully one we
don't have, it looks more complicated than we should need for this.
ChrisW, yes, your WARN_ON is spot on, matches what I saw exactly.
I'll fill in the patch description later, must dash now, probably
offline until late tonight. Or if you're satisfied and don't want to
wait, you guys fill that in and send off to Linus & Andrew - thanks.
[PATCH] ksm: fix easily reproduced NULL pointer dereference
Reported-by: Andrea Righi <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---
mm/ksm.c | 7 +++++++
1 file changed, 7 insertions(+)
--- 3.0-rc1/mm/ksm.c 2011-05-29 18:42:37.429882601 -0700
+++ linux/mm/ksm.c 2011-06-02 09:55:31.729702490 -0700
@@ -1302,6 +1302,13 @@ static struct rmap_item *scan_get_next_r
slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
ksm_scan.mm_slot = slot;
spin_unlock(&ksm_mmlist_lock);
+
+ /*
+ * Although we tested list_empty() above, a racing __ksm_exit
+ * of the last mm on the list may have removed it since then.
+ */
+ if (slot == &ksm_mm_head)
+ return NULL;
next_mm:
ksm_scan.address = 0;
ksm_scan.rmap_list = &slot->rmap_list;
Andrea Righi reported a case where an exiting task can race against
ksmd.
ksm_scan.mm_slot == the only registered mm
CPU 1 (bug program) CPU 2 (ksmd)
list_empty() is false
lock
ksm_scan.mm_slot
list_del
unlock
slot == &ksm_mm_head (but list is now empty_)
Close this race by revalidating that the new slot is not simply the list
head again.
Reported-by: Andrea Righi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
mm/ksm.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 942dfc7..0373ce4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1301,6 +1301,9 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
ksm_scan.mm_slot = slot;
spin_unlock(&ksm_mmlist_lock);
+ /* We raced against exit of last slot on the list */
+ if (slot == &ksm_mm_head)
+ return NULL;
next_mm:
ksm_scan.address = 0;
ksm_scan.rmap_list = &slot->rmap_list;
On Thu, Jun 02, 2011 at 10:29:39AM -0700, Hugh Dickins wrote:
> AndreaA, I didn't study the patch you posted half an hour ago,
> since by that time I'd worked it out and was preparing patch below.
> I think your patch would be for a different bug, hopefully one we
> don't have, it looks more complicated than we should need for this.
I didn't expect two different bugs leading to double free.
If you've time please review my other patch too because mmput runs
with no mmap_sem hold and I think the ksm scan code runs under the
assumption that __ksm_exit is waiting in down_write() when
ksm_mmlist_lock is released (before freeing the mm_slot), and that
assumption is wrong. ksm_test_exit may very well be true despite
__ksm_exit didn't run yet, and ksm scan will proceed freeing after
changing the mm_slot and ksm_exit will be free to run and free again
immediately after the ksm scan releases the ksm_mmlist_lock and before
it clears the MMF_VM_MERGEABLE (because the mm_slot has been changed
before releasing the ksm_mmlist_lock).
The rmap_list being null will kind of hide it, the fact there's so
little time between the unlock of the ksm_mmlist_lock and the clearing
of MMF_VM_MERGEABLE (that will stop ksm_exit from calling __ksm_exit
at all) will also hide it. At least in
unmerge_and_remove_all_rmap_items remove_trailing_rmap_items will nuke
the rmap_list just before this race runs so making it more likely
possible.
On Thu, Jun 02, 2011 at 09:48:41AM -0700, Chris Wright wrote:
> * Andrea Righi ([email protected]) wrote:
> > mmh.. I can reproduce the bug also with the standard ubuntu (11.04)
> > kernel. Could you post your .config?
>
> Andrea (Righi), can you tell me if this WARN fires? This looks
> like a pure race between removing from list and checking list, i.e.
> insufficient locking.
Yes, it does. With this patch:
[ 50.968896] WARNING: at mm/ksm.c:1305 ksm_scan_thread+0x9e3/0xe50()
>
> ksm_scan.mm_slot == the only registered mm
>
> CPU 1 (bug program) CPU 2 (ksmd)
> list_empty() is false
> lock
> ksm_scan.mm_slot
> list_del
> unlock
> slot == &ksm_mm_head (but list is now empty_)
It seems to be the exact problem.
-Andrea
>
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 942dfc7..ab79a92 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1301,6 +1301,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> ksm_scan.mm_slot = slot;
> spin_unlock(&ksm_mmlist_lock);
> + WARN_ON(slot == &ksm_mm_head);
> next_mm:
> ksm_scan.address = 0;
> ksm_scan.rmap_list = &slot->rmap_list;
On Thu, Jun 02, 2011 at 10:35:49AM -0700, Chris Wright wrote:
> Andrea Righi reported a case where an exiting task can race against
> ksmd.
>
> ksm_scan.mm_slot == the only registered mm
> CPU 1 (bug program) CPU 2 (ksmd)
> list_empty() is false
> lock
> ksm_scan.mm_slot
> list_del
> unlock
> slot == &ksm_mm_head (but list is now empty_)
>
> Close this race by revalidating that the new slot is not simply the list
> head again.
I confirm this fixes the problem on my side.
Tested-by: Andrea Righi <[email protected]>
>
> Reported-by: Andrea Righi <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Signed-off-by: Chris Wright <[email protected]>
> ---
> mm/ksm.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 942dfc7..0373ce4 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1301,6 +1301,9 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> ksm_scan.mm_slot = slot;
> spin_unlock(&ksm_mmlist_lock);
> + /* We raced against exit of last slot on the list */
> + if (slot == &ksm_mm_head)
> + return NULL;
> next_mm:
> ksm_scan.address = 0;
> ksm_scan.rmap_list = &slot->rmap_list;
On Thu, Jun 02, 2011 at 06:44:58PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 02, 2011 at 08:36:41AM -0700, Chris Wright wrote:
> > * Andrea Arcangeli ([email protected]) wrote:
> > > On Thu, Jun 02, 2011 at 07:31:43AM -0700, Chris Wright wrote:
> > > > * CAI Qian ([email protected]) wrote:
> > > > > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > > > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> > > >
> > > > Right, that's just what the program is trying to do, segfault.
> > > >
> > > > > +++ killed by SIGSEGV (core dumped) +++
> > > > > Segmentation fault (core dumped)
> > > > >
> > > > > Did I miss anything?
> > > >
> > > > I found it works but not 100% of the time.
> > > >
> > > > So I just run the bug in a loop.
> > >
> > > echo 0 >scan_millisecs helps.
> >
> > BTW, here's my stack trace (I dropped back to 2.6.39 just to see if it
> > happened to be recent regression). It looks like mm_slot is off the list:
> >
> > R10: dead000000200200 R11: dead000000100100
>
> Yes it had to be use after free.
>
> I cooked this patch, still untested but it builds. Will test it soon.
Hi Andrea,
I just tested this patch, but it doesn't seem to fix the problem, at
least not the one I reported. The same bug happens again.
Thanks,
-Andrea
>
> ===
> Subject: ksm: fix __ksm_exit vs ksm scan SMP race
>
> From: Andrea Arcangeli <[email protected]>
>
> If the KSM scan releases the ksm_mmlist_lock after the mm_users already dropped
> to zero but before __ksm_exit had a chance runs, both the KSM scan and
> __ksm_exit will free the slot. This fixes the SMP race condition by using
> test_and_bit_set in __ksm_exit to see if __ksm_exit arrived before the KSM
> scan or not.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index d708b3e..47ef4c1 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -645,10 +645,16 @@ static int unmerge_and_remove_all_rmap_items(void)
> if (ksm_test_exit(mm)) {
> hlist_del(&mm_slot->link);
> list_del(&mm_slot->mm_list);
> + /*
> + * After releasing ksm_mmlist_lock __ksm_exit
> + * can run and we already changed mm_slot, so
> + * notify it with MMF_VM_MERGEABLE not to free
> + * this again.
> + */
> + clear_bit(MMF_VM_MERGEABLE, &mm->flags);
> spin_unlock(&ksm_mmlist_lock);
>
> free_mm_slot(mm_slot);
> - clear_bit(MMF_VM_MERGEABLE, &mm->flags);
> up_read(&mm->mmap_sem);
> mmdrop(mm);
> } else {
> @@ -1377,10 +1383,15 @@ next_mm:
> */
> hlist_del(&slot->link);
> list_del(&slot->mm_list);
> + /*
> + * After releasing ksm_mmlist_lock __ksm_exit can run
> + * and we already changed mm_slot, so notify it with
> + * MMF_VM_MERGEABLE not to free this again.
> + */
> + clear_bit(MMF_VM_MERGEABLE, &mm->flags);
> spin_unlock(&ksm_mmlist_lock);
>
> free_mm_slot(slot);
> - clear_bit(MMF_VM_MERGEABLE, &mm->flags);
> up_read(&mm->mmap_sem);
> mmdrop(mm);
> } else {
> @@ -1463,6 +1474,11 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> VM_NONLINEAR | VM_MIXEDMAP | VM_SAO))
> return 0; /* just ignore the advice */
>
> + /*
> + * It should be safe to test_bit instead of
> + * test_and_bit_set because the madvise generic caller
> + * holds the mmap_sem write mode.
> + */
> if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
> err = __ksm_enter(mm);
> if (err)
> @@ -1511,6 +1527,10 @@ int __ksm_enter(struct mm_struct *mm)
> list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
> spin_unlock(&ksm_mmlist_lock);
>
> + /*
> + * It should be safe to set it outside ksm_mmlist_lock because
> + * we hold a mm_user pin on the mm so __ksm_exit can't run.
> + */
> set_bit(MMF_VM_MERGEABLE, &mm->flags);
> atomic_inc(&mm->mm_count);
>
> @@ -1538,9 +1558,28 @@ void __ksm_exit(struct mm_struct *mm)
> mm_slot = get_mm_slot(mm);
> if (mm_slot && ksm_scan.mm_slot != mm_slot) {
> if (!mm_slot->rmap_list) {
> - hlist_del(&mm_slot->link);
> - list_del(&mm_slot->mm_list);
> - easy_to_free = 1;
> + /*
> + * If MMF_VM_MERGEABLE isn't set it was freed
> + * by the scan immediately after mm_count
> + * reached zero (visible by the scan) but
> + * before __ksm_exit() run, so we don't need
> + * to do anything here. We don't even need to
> + * wait for the KSM scan to release the
> + * mmap_sem as it's not working on the mm
> + * anymore but it's just releasing it, and it
> + * probably already did and dropped its
> + * mm_count too (it would however be safe to
> + * take mmap_sem here even if MMF_VM_MERGEABLE
> + * is already clear, as the actual mm can't be
> + * freed until we return and we run mmdrop
> + * too, but it's unnecessary).
> + */
> + if (test_and_clear_bit(MMF_VM_MERGEABLE, &mm->flags)) {
> + hlist_del(&mm_slot->link);
> + list_del(&mm_slot->mm_list);
> + easy_to_free = 1;
> + } else
> + mm_slot = NULL;
> } else {
> list_move(&mm_slot->mm_list,
> &ksm_scan.mm_slot->mm_list);
> @@ -1550,7 +1589,6 @@ void __ksm_exit(struct mm_struct *mm)
>
> if (easy_to_free) {
> free_mm_slot(mm_slot);
> - clear_bit(MMF_VM_MERGEABLE, &mm->flags);
> mmdrop(mm);
> } else if (mm_slot) {
> down_write(&mm->mmap_sem);
* Andrea Righi ([email protected]) wrote:
> On Thu, Jun 02, 2011 at 10:35:49AM -0700, Chris Wright wrote:
> > Andrea Righi reported a case where an exiting task can race against
> > ksmd.
> >
> > ksm_scan.mm_slot == the only registered mm
> > CPU 1 (bug program) CPU 2 (ksmd)
> > list_empty() is false
> > lock
> > ksm_scan.mm_slot
> > list_del
> > unlock
> > slot == &ksm_mm_head (but list is now empty_)
> >
> > Close this race by revalidating that the new slot is not simply the list
> > head again.
>
> I confirm this fixes the problem on my side.
>
> Tested-by: Andrea Righi <[email protected]>
Great, thanks for verifying.
thanks,
-chris
On Thu, Jun 02, 2011 at 10:15:01PM +0200, Andrea Righi wrote:
> I just tested this patch, but it doesn't seem to fix the problem, at
> least not the one I reported. The same bug happens again.
Yes I probably found another (not reproducible) bug that would lead to
a similar error, I'm waiting some other opinion on it.
----- Original Message -----
> On Thu, Jun 02, 2011 at 03:09:53AM -0400, CAI Qian wrote:
> > Hello,
> >
> > ----- Original Message -----
> > > I've just experienced this bug with ksmd:
> > >
> > > [ 55.837551] BUG: unable to handle kernel NULL pointer dereference
> > > at
> > > 00000000000000e8
> > > [ 55.837598] IP: [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
> > > [ 55.837630] PGD 0
> > > [ 55.837643] Oops: 0000 [#1] SMP
> > > [ 55.837663] CPU 2
> > > [ 55.837674] Modules linked in: snd_hda_codec_hdmi
> > > snd_hda_codec_conexant rtl8192ce rtl8192c_common rtlwifi mac80211
> > > usbhid hid cfg80211 snd_hda_intel snd_hda_codec psmouse snd_pcm
> > > e1000e
> > > thinkpad_acpi snd_timer snd_page_alloc snd soundcore nvram
> > > [ 55.837816]
> > > [ 55.837824] Pid: 33, comm: ksmd Not tainted 3.0.0-rc1+ #289
> > > LENOVO
> > > 4286CTO/4286CTO
> > > [ 55.837850] RIP: 0010:[<ffffffff810bb9b2>] [<ffffffff810bb9b2>]
> > > __lock_acquire+0x62/0x1d70
> > > [ 55.837878] RSP: 0018:ffff88023d3abc50 EFLAGS: 00010046
> > > [ 55.837894] RAX: 0000000000000046 RBX: 00000000000000e8 RCX:
> > > 0000000000000001
> > > [ 55.837915] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> > > 00000000000000e8
> > > [ 55.837936] RBP: ffff88023d3abd40 R08: 0000000000000002 R09:
> > > 0000000000000000
> > > [ 55.837957] R10: 0000000000000001 R11: 0000000000000000 R12:
> > > ffff88023d3a3e00
> > > [ 55.837978] R13: 0000000000000000 R14: 0000000000000002 R15:
> > > 0000000000000000
> > > [ 55.837999] FS: 0000000000000000(0000) GS:ffff88023e280000(0000)
> > > knlGS:0000000000000000
> > > [ 55.838022] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > [ 55.838039] CR2: 00000000000000e8 CR3: 00000000016f5000 CR4:
> > > 00000000000406e0
> > > [ 55.838060] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [ 55.838081] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > > 0000000000000400
> > > [ 55.838102] Process ksmd (pid: 33, threadinfo ffff88023d3aa000,
> > > task
> > > ffff88023d3a3e00)
> > > [ 55.838131] Stack:
> > > [ 55.838140] ffff88023d3abce0 0000000000000000 ffffffff81d46810
> > > 00000000000012c7
> > > [ 55.838168] 000000000000037c ffff88023d3a3e00 0000000000000001
> > > 0000000000000000
> > > [ 55.838338] 0000000000000000 0000000000000000 00000000001ba37c
> > > ffffffff81a22000
> > > [ 55.838365] Call Trace:
> > > [ 55.838375] [<ffffffff810be55f>] ? mark_held_locks+0x6f/0xa0
> > > [ 55.838394] [<ffffffff814e3360>] ?
> > > _raw_spin_unlock_irqrestore+0x40/0x70
> > > [ 55.838416] [<ffffffff810bdc90>] lock_acquire+0x90/0x110
> > > [ 55.838434] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
> > > [ 55.838453] [<ffffffff8112df6c>] ? free_percpu+0x9c/0x130
> > > [ 55.838470] [<ffffffff814e1cbc>] down_read+0x4c/0x70
> > > [ 55.838486] [<ffffffff8114c652>] ? ksm_scan_thread+0x132/0xe20
> > > [ 55.838505] [<ffffffff814e33bb>] ? _raw_spin_unlock+0x2b/0x40
> > > [ 55.838523] [<ffffffff8114c652>] ksm_scan_thread+0x132/0xe20
> > > [ 55.838541] [<ffffffff814df822>] ? schedule+0x3b2/0x960
> > > [ 55.838559] [<ffffffff810a5690>] ? wake_up_bit+0x40/0x40
> > > [ 55.838576] [<ffffffff8114c520>] ? run_store+0x310/0x310
> > > [ 55.838593] [<ffffffff810a5186>] kthread+0x96/0xa0
> > > [ 55.838609] [<ffffffff814e5014>] kernel_thread_helper+0x4/0x10
> > > [ 55.838628] [<ffffffff814e3700>] ? retint_restore_args+0xe/0xe
> > > [ 55.838647] [<ffffffff810a50f0>] ?
> > > __init_kthread_worker+0x70/0x70
> > > [ 55.838666] [<ffffffff814e5010>] ? gs_change+0xb/0xb
> > > [ 55.838681] Code: b7 00 00 48 89 fb 85 c0 41 89 f5 45 0f 45 f0 8b
> > > 05
> > > 84 de 68 00 85 c0 0f 84 7b 09 00 00 8b 05 7a 49 7a 00 85 c0 0f 84
> > > c6
> > > 01 00 00
> > > [ 55.838780] 8b 03 ba 01 00 00 00 48 3d e0 3c 8c 81 44 0f 44 f2 41
> > > 83
> > > fd
> > > [ 55.838830] RIP [<ffffffff810bb9b2>] __lock_acquire+0x62/0x1d70
> > > [ 55.838850] RSP <ffff88023d3abc50>
> > > [ 55.839567] CR2: 00000000000000e8
> > > [ 55.895721] ---[ end trace eea0fa5dfa6846f1 ]---
> > >
> > > The bug can be easily reproduced using the following testcase:
> > >
> > > ========================
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <unistd.h>
> > > #include <sys/mman.h>
> > >
> > > #define BUFSIZE getpagesize()
> > >
> > > int main(int argc, char **argv)
> > > {
> > > void *ptr;
> > >
> > > if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
> > > perror("posix_memalign");
> > > exit(1);
> > > }
> > > if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
> > > perror("madvise");
> > > exit(1);
> > > }
> > > *(char *)NULL = 0;
> > Hmm, the reproducer gave something else here but no panic.
> > $ strace ./test
> > execve("./test", ["./test"], [/* 26 vars */]) = 0
> > brk(0) = 0x220f000
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> > -1, 0) = 0x7fd18ec0a000
> > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or
> > directory)
> > open("/etc/ld.so.cache", O_RDONLY) = 3
> > fstat(3, {st_mode=S_IFREG|0644, st_size=41227, ...}) = 0
> > mmap(NULL, 41227, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd18ebff000
> > close(3) = 0
> > open("/lib64/libc.so.6", O_RDONLY) = 3
> > read(3,
> > "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\355\341n<\0\0\0"...,
> > 832) = 832
> > fstat(3, {st_mode=S_IFREG|0755, st_size=1912928, ...}) = 0
> > mmap(0x3c6ee00000, 3737768, PROT_READ|PROT_EXEC,
> > MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3c6ee00000
> > mprotect(0x3c6ef87000, 2097152, PROT_NONE) = 0
> > mmap(0x3c6f187000, 20480, PROT_READ|PROT_WRITE,
> > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x187000) = 0x3c6f187000
> > mmap(0x3c6f18c000, 18600, PROT_READ|PROT_WRITE,
> > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3c6f18c000
> > close(3) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> > -1, 0) = 0x7fd18ebfe000
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> > -1, 0) = 0x7fd18ebfd000
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> > -1, 0) = 0x7fd18ebfc000
> > arch_prctl(ARCH_SET_FS, 0x7fd18ebfd700) = 0
> > mprotect(0x3c6f187000, 16384, PROT_READ) = 0
> > mprotect(0x3c6e81f000, 4096, PROT_READ) = 0
> > munmap(0x7fd18ebff000, 41227) = 0
> > brk(0) = 0x220f000
> > brk(0x2232000) = 0x2232000
> > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> > +++ killed by SIGSEGV (core dumped) +++
> > Segmentation fault (core dumped)
> >
> > Did I miss anything?
>
> mmh.. I can reproduce the bug also with the standard ubuntu (11.04)
> kernel. Could you post your .config?
http://people.redhat.com/qcai/config-3rc1
> Thanks,
> -Andrea
----- Original Message -----
> * CAI Qian ([email protected]) wrote:
> > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>
> Right, that's just what the program is trying to do, segfault.
>
> > +++ killed by SIGSEGV (core dumped) +++
> > Segmentation fault (core dumped)
> >
> > Did I miss anything?
>
> I found it works but not 100% of the time.
>
> So I just run the bug in a loop.
Still no luck here.
# while :; do ./test ; done
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
...
I can't really see what different with the hardware
here. It is only a NUMA server system.
----- Original Message -----
> On Thu, Jun 02, 2011 at 07:31:43AM -0700, Chris Wright wrote:
> > * CAI Qian ([email protected]) wrote:
> > > madvise(0x2210000, 4096, 0xc /* MADV_??? */) = 0
> > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> >
> > Right, that's just what the program is trying to do, segfault.
> >
> > > +++ killed by SIGSEGV (core dumped) +++
> > > Segmentation fault (core dumped)
> > >
> > > Did I miss anything?
> >
> > I found it works but not 100% of the time.
> >
> > So I just run the bug in a loop.
>
> echo 0 >scan_millisecs helps.
Thanks. Indeed.
NULL pointer dereference at 0000000000000060
IP: [<ffffffff814d2659>] down_read+0x19/0x30
PGD 0
Oops: 0002 [#1] SMP
CPU 0
Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 dm_mirror dm_region_hash dm_log cdc_ether usbnet mii microcode serio_raw pcspkr i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg shpchp ioatdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: scsi_wait_scan]
Pid: 103, comm: ksmd Not tainted 3.0.0-rc1+ #6 IBM System x3550 M3 -[7944I21]-/69Y4438
RIP: 0010:[<ffffffff814d2659>] [<ffffffff814d2659>] down_read+0x19/0x30
RSP: 0018:ffff880271307e00 EFLAGS: 00010246
RAX: 0000000000000060 RBX: 0000000000000060 RCX: 00000000000000db
RDX: 0000000000000000 RSI: 0000000000000282 RDI: 0000000000000060
RBP: ffff880271307e10 R08: 00000000000000df R09: ffff88026fcff400
R10: 0001b690ffffff90 R11: 0001b690ffffff90 R12: ffff880271307ea8
R13: 0000000000000050 R14: 0000000000000000 R15: ffffffff81a54960
FS: 0000000000000000(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000060 CR3: 0000000001a03000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksmd (pid: 103, threadinfo ffff880271306000, task ffff8802712fc0c0)
Stack:
ffff880271307e10 ffff8802712fc0c0 ffff880271307e60 ffffffff81146e09
ffff8802712fc0c0 0000000000000060 ffff880271307e80 ffff8802712fc0c0
ffff880271307e80 ffff8802712fc0c0 ffff8802712fc0c0 0000000000000063
Call Trace:
[<ffffffff81146e09>] scan_get_next_rmap_item+0x59/0x400
[<ffffffff811476ce>] ksm_scan_thread+0xfe/0x2c0
[<ffffffff81084d50>] ? wake_up_bit+0x40/0x40
[<ffffffff811475d0>] ? cmp_and_merge_page+0x420/0x420
[<ffffffff810846d6>] kthread+0x96/0xa0
[<ffffffff814dc544>] kernel_thread_helper+0x4/0x10
[<ffffffff81084640>] ? kthread_worker_fn+0x1a0/0x1a0
[<ffffffff814dc540>] ? gs_change+0x13/0x13
Code: 9e cd d6 ff 48 83 c4 08 5b c9 c3 0f 1f 80 00 00 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89 fb e8 0a ee ff ff 48 89 d8 <f0> 48 ff 00 79 05 e8 3c cd d6 ff 48 83 c4 08 5b c9 c3 00 00 00
RIP [<ffffffff814d2659>] down_read+0x19/0x30
RSP <ffff880271307e00>
CR2: 0000000000000060
---[ end trace a6feafc139ba5f85 ]---
On Thu, 2 Jun 2011, Chris Wright wrote:
> Andrea Righi reported a case where an exiting task can race against
> ksmd.
>
> ksm_scan.mm_slot == the only registered mm
> CPU 1 (bug program) CPU 2 (ksmd)
> list_empty() is false
> lock
> ksm_scan.mm_slot
> list_del
> unlock
> slot == &ksm_mm_head (but list is now empty_)
>
> Close this race by revalidating that the new slot is not simply the list
> head again.
Remarkably similar to my patch: it must be good!
But yours appears to be more popular - thanks, Chris.
>
> Reported-by: Andrea Righi <[email protected]>
> Cc: Hugh Dickins <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Signed-off-by: Chris Wright <[email protected]>
Cc: [email protected]
> ---
> mm/ksm.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 942dfc7..0373ce4 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1301,6 +1301,9 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
> ksm_scan.mm_slot = slot;
> spin_unlock(&ksm_mmlist_lock);
> + /* We raced against exit of last slot on the list */
> + if (slot == &ksm_mm_head)
> + return NULL;
> next_mm:
> ksm_scan.address = 0;
> ksm_scan.rmap_list = &slot->rmap_list;
On Thu, 2 Jun 2011, Andrea Arcangeli wrote:
> On Thu, Jun 02, 2011 at 10:29:39AM -0700, Hugh Dickins wrote:
> > AndreaA, I didn't study the patch you posted half an hour ago,
> > since by that time I'd worked it out and was preparing patch below.
> > I think your patch would be for a different bug, hopefully one we
> > don't have, it looks more complicated than we should need for this.
>
> I didn't expect two different bugs leading to double free.
There wasn't a double free there, just failure to cope with race
emptying the list, so accessing head when expecting a full entry.
>
> If you've time please review my other patch too because mmput runs
> with no mmap_sem hold and I think the ksm scan code runs under the
> assumption that __ksm_exit is waiting in down_write() when
> ksm_mmlist_lock is released (before freeing the mm_slot), and that
> assumption is wrong. ksm_test_exit may very well be true despite
> __ksm_exit didn't run yet, and ksm scan will proceed freeing after
> changing the mm_slot and ksm_exit will be free to run and free again
> immediately after the ksm scan releases the ksm_mmlist_lock and before
> it clears the MMF_VM_MERGEABLE (because the mm_slot has been changed
> before releasing the ksm_mmlist_lock).
>
> The rmap_list being null will kind of hide it, the fact there's so
> little time between the unlock of the ksm_mmlist_lock and the clearing
> of MMF_VM_MERGEABLE (that will stop ksm_exit from calling __ksm_exit
> at all) will also hide it. At least in
> unmerge_and_remove_all_rmap_items remove_trailing_rmap_items will nuke
> the rmap_list just before this race runs so making it more likely
> possible.
You'll see from the "beware" comment in scan_get_next_rmap_item()
that this case is expected, that it sometimes reaches freeing the
slots before the exiting task reaches __ksm_exit().
That race should already be handled. I believe your patch is unnecessary,
because get_mm_slot() is a hashlist lookup, and will return NULL once
either end has done the hlist_del(&mm_slot->link).
Hugh
On Fri, Jun 03, 2011 at 10:06:14AM -0700, Hugh Dickins wrote:
> On Thu, 2 Jun 2011, Andrea Arcangeli wrote:
> > On Thu, Jun 02, 2011 at 10:29:39AM -0700, Hugh Dickins wrote:
> > > AndreaA, I didn't study the patch you posted half an hour ago,
> > > since by that time I'd worked it out and was preparing patch below.
> > > I think your patch would be for a different bug, hopefully one we
> > > don't have, it looks more complicated than we should need for this.
> >
> > I didn't expect two different bugs leading to double free.
>
> There wasn't a double free there, just failure to cope with race
> emptying the list, so accessing head when expecting a full entry.
Yes, we thought it was a double free initially because of two dead
pointers but we couldn't explain why mm was null so consistently.
> You'll see from the "beware" comment in scan_get_next_rmap_item()
> that this case is expected, that it sometimes reaches freeing the
> slots before the exiting task reaches __ksm_exit().
>
> That race should already be handled. I believe your patch is unnecessary,
> because get_mm_slot() is a hashlist lookup, and will return NULL once
> either end has done the hlist_del(&mm_slot->link).
Ok so that case is handled by get_mm_slot not succeeding. I see thanks
for the review.
* Hugh Dickins ([email protected]) wrote:
> On Thu, 2 Jun 2011, Chris Wright wrote:
> > Close this race by revalidating that the new slot is not simply the list
> > head again.
>
> Remarkably similar to my patch: it must be good!
Indeed ;)
> But yours appears to be more popular - thanks, Chris.
But I like the comment in yours better...
Andrew, here's a refresh w/ the acks/tested-bys/etc + AndreaR's test case
in the changelog.
thanks,
-chris
--
Subject: [PATCH] ksm: fix NULL pointer dereference in scan_get_next_rmap_item
From: Hugh Dickins <[email protected]>
Andrea Righi reported a case where an exiting task can race against
ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742)
easily triggering a NULL pointer dereference in ksmd.
ksm_scan.mm_slot == &ksm_mm_head with only one registered mm
CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
list_empty() is false
lock slot == &ksm_mm_head
list_del(slot->mm_list)
(list now empty)
unlock
lock
slot = list_entry(slot->mm_list.next)
(list is empty, so slot is still ksm_mm_head)
unlock
slot->mm == NULL ... Oops
Close this race by revalidating that the new slot is not simply the list
head again.
Andrea's test case:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#define BUFSIZE getpagesize()
int main(int argc, char **argv)
{
void *ptr;
if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
perror("posix_memalign");
exit(1);
}
if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
perror("madvise");
exit(1);
}
*(char *)NULL = 0;
return 0;
}
Reported-by: Andrea Righi <[email protected]>
Tested-by: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: [email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
mm/ksm.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index d708b3e..9a68b0c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1302,6 +1302,12 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
ksm_scan.mm_slot = slot;
spin_unlock(&ksm_mmlist_lock);
+ /*
+ * Although we tested list_empty() above, a racing __ksm_exit
+ * of the last mm on the list may have removed it since then.
+ */
+ if (slot == &ksm_mm_head)
+ return NULL;
next_mm:
ksm_scan.address = 0;
ksm_scan.rmap_list = &slot->rmap_list;