2008-03-18 01:53:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: deadlock on 2.6.24.3-rt3

[Added Peter Zijlsta and LKML]

Hiroshi, thanks for looking into this!

Peter, this looks like a legit fix, could you Ack it.


On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:

> Hiroshi Shimamoto wrote:
> > Hi,
> >
> > I got a soft lockup message on 2.6.24.3-rt3.
> > I attached the .config.
> >
> > I think there is a deadlock scenario, I explain later.
> >
> > Here is the console log;
> > BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
> > CPU 2:
> > Modules linked in:
> > Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
> > RIP: 0010:[<ffffffff8063f052>] [<ffffffff8063f052>] __spin_lock+0x57/0x67
> > RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
> > RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
> > RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
> > RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
> > R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
> > R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
> > FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Call Trace:
> > [<ffffffff8063f024>] __spin_lock+0x29/0x67
> > [<ffffffff80296597>] swap_info_get+0x65/0xdd
> > [<ffffffff80296c0c>] can_share_swap_page+0x39/0x84
> > [<ffffffff8028ae6e>] do_wp_page+0x2f9/0x519
> > [<ffffffff8028c8d2>] handle_mm_fault+0x615/0x7cf
> > [<ffffffff802db823>] proc_flush_task+0x171/0x29c
> > [<ffffffff8024a64d>] recalc_sigpending+0xe/0x3c
> > [<ffffffff8022645e>] do_page_fault+0x162/0x754
> > [<ffffffff8026fe81>] audit_syscall_exit+0x31c/0x37a
> > [<ffffffff8063f449>] error_exit+0x0/0x51
> > ---------------------------
> > | preempt count: 00010002 ]
> > | 2-level deep critical section nesting:
> > ----------------------------------------
> > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > .....[<00000000>] .. ( <= 0x0)
> > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > .....[<00000000>] .. ( <= 0x0)
> > BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
> > CPU 3:
> > Modules linked in:
> > Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
> > RIP: 0010:[<ffffffff8027c974>] [<ffffffff8027c974>] find_get_page+0xad/0xbe
> > RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
> > 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
> > RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
> > RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
> > R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
> > R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
> > FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Call Trace:
> > [<ffffffff8027c8ea>] find_get_page+0x23/0xbe
> > [<ffffffff80296f83>] free_swap_and_cache+0x46/0xdd
> > [<ffffffff8028b9b7>] unmap_vmas+0x626/0x8ce
> > [<ffffffff8028fa4c>] exit_mmap+0xac/0x147
> > [<ffffffff8023ced7>] mmput+0x32/0xae
> > [<ffffffff80242f00>] do_exit+0x199/0x914
> > [<ffffffff8024ab3b>] __dequeue_signal+0x19/0x1b7
> > [<ffffffff802436a7>] do_group_exit+0x2c/0x7e
> > [<ffffffff8024c47b>] get_signal_to_deliver+0x2ef/0x4aa
> > [<ffffffff8020b5dc>] do_notify_resume+0xa8/0x7cd
> > [<ffffffff80239320>] add_preempt_count+0x14/0x111
> > [<ffffffff8038b292>] __up_read+0x13/0x8d
> > [<ffffffff80226483>] do_page_fault+0x187/0x754
> > [<ffffffff802335ae>] __dequeue_entity+0x2d/0x34
> > [<ffffffff8020a6d5>] __switch_to+0x27/0x2c9
> > [<ffffffff802264f0>] do_page_fault+0x1f4/0x754
> > [<ffffffff8020c7be>] retint_signal+0x3d/0x7f
> > ---------------------------
> > | preempt count: 00010005 ]
> > | 5-level deep critical section nesting:
> > ----------------------------------------
> > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > .....[<00000000>] .. ( <= 0x0)
> > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > .....[<00000000>] .. ( <= 0x0)
> > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > .....[<00000000>] .. ( <= 0x0)
> > .. [<ffffffff8027c8db>] .... find_get_page+0x14/0xbe
> > .....[<00000000>] .. ( <= 0x0)
> > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > .....[<00000000>] .. ( <= 0x0)
> >
> >
> > I also got a kernel core.
> > (gdb) info thr
> > 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
> > offset=18446744071570598016) at include/asm/processor_64.h:385
> > 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
> > * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> >
> > CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
> > The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
> > called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
> > The other threads waiting the swap_lock.
> > On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
> > lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
> > So if the target page of remove_mapping() is in the exiting process memory,
> > the kernel is deadlock.
> >
> > (gdb) bt
> > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
> > at mm/swapfile.c:253
> > #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
> > #3 0xffffffff80286acd in remove_mapping (mapping=<value optimized out>,
> > page=0xffff810005ad8910) at mm/vmscan.c:423
> > ...
> >
> > (gdb) thr 2
> > (gdb) bt
> > #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
> > #1 0xffffffff80296374 in valid_swaphandles (entry=<value optimized out>,
> > offset=0xffff81001e22bc78) at mm/swapfile.c:1783
> > #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
> > at mm/memory.c:2054
> > #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
> > pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
> > ...
> >
> > (gdb) thr 3
> > (gdb) bt
> > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
> > at mm/swapfile.c:253
> > #2 0xffffffff80296c0c in can_share_swap_page (page=<value optimized out>)
> > at mm/swapfile.c:317
> > #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
> > address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
> > ptl=0xffff810007c95458, orig_pte=<value optimized out>) at mm/memory.c:1606
> > ...
> >
> > (gdb) thr 4
> > (gdb) bt
> > #0 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
> > offset=18446744071570598016) at include/asm/processor_64.h:385
> > #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
> > #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
> > start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
> > details=0x0) at mm/memory.c:728
> > #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
> > #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
> > #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
> > ...
> >
> >
> > I think it came from the lockless speculative get page patch.
> > I found the newer version of this patch in linux-mm.
> > http://marc.info/?l=linux-mm&m=119477111927364&w=2
> >
> > I haven't tested it because it looks big change and hard to apply.
> > But it seems to fix this deadlock issue.
> > Any other patch to fix this issue is welcome.
> >
>
> Is this patch good?
>
> ---
> From: Hiroshi Shimamoto <[email protected]>
> Subject: [PATCH] avoid deadlock related with PG_nonewrefs and swap_lock
>
> There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
> remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
> free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
> off in find_get_page().
>
> swap_lock can be unlocked before calling find_get_page().
>
> Signed-off-by: Hiroshi Shimamoto <[email protected]>
> ---
> mm/swapfile.c | 5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5036b70..581afee 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -400,13 +400,14 @@ void free_swap_and_cache(swp_entry_t entry)
> p = swap_info_get(entry);
> if (p) {
> if (swap_entry_free(p, swp_offset(entry)) == 1) {
> + spin_unlock(&swap_lock);
> page = find_get_page(&swapper_space, entry.val);
> if (page && unlikely(TestSetPageLocked(page))) {
> page_cache_release(page);
> page = NULL;
> }
> - }
> - spin_unlock(&swap_lock);
> + } else
> + spin_unlock(&swap_lock);
> }
> if (page) {
> int one_user;
> --
> 1.5.4.1
>


2008-03-18 09:40:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: deadlock on 2.6.24.3-rt3

On Mon, 2008-03-17 at 21:53 -0400, Steven Rostedt wrote:
> [Added Peter Zijlsta and LKML]

Yeah, patches and such should really go to LKML as LKML is the -rt
development list.

> Hiroshi, thanks for looking into this!
>
> Peter, this looks like a legit fix, could you Ack it.
>
>
> On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:
>
> > Hiroshi Shimamoto wrote:
> > > Hi,
> > >
> > > I got a soft lockup message on 2.6.24.3-rt3.
> > > I attached the .config.
> > >
> > > I think there is a deadlock scenario, I explain later.
> > >
> > > Here is the console log;
> > > BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
> > > CPU 2:
> > > Modules linked in:
> > > Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
> > > RIP: 0010:[<ffffffff8063f052>] [<ffffffff8063f052>] __spin_lock+0x57/0x67
> > > RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
> > > RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
> > > RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
> > > RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
> > > R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
> > > R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
> > > FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > [<ffffffff8063f024>] __spin_lock+0x29/0x67
> > > [<ffffffff80296597>] swap_info_get+0x65/0xdd
> > > [<ffffffff80296c0c>] can_share_swap_page+0x39/0x84
> > > [<ffffffff8028ae6e>] do_wp_page+0x2f9/0x519
> > > [<ffffffff8028c8d2>] handle_mm_fault+0x615/0x7cf
> > > [<ffffffff802db823>] proc_flush_task+0x171/0x29c
> > > [<ffffffff8024a64d>] recalc_sigpending+0xe/0x3c
> > > [<ffffffff8022645e>] do_page_fault+0x162/0x754
> > > [<ffffffff8026fe81>] audit_syscall_exit+0x31c/0x37a
> > > [<ffffffff8063f449>] error_exit+0x0/0x51
> > > ---------------------------
> > > | preempt count: 00010002 ]
> > > | 2-level deep critical section nesting:
> > > ----------------------------------------
> > > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > > .....[<00000000>] .. ( <= 0x0)
> > > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > > .....[<00000000>] .. ( <= 0x0)
> > > BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
> > > CPU 3:
> > > Modules linked in:
> > > Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
> > > RIP: 0010:[<ffffffff8027c974>] [<ffffffff8027c974>] find_get_page+0xad/0xbe
> > > RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
> > > 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
> > > RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
> > > RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
> > > R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
> > > R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
> > > FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > [<ffffffff8027c8ea>] find_get_page+0x23/0xbe
> > > [<ffffffff80296f83>] free_swap_and_cache+0x46/0xdd
> > > [<ffffffff8028b9b7>] unmap_vmas+0x626/0x8ce
> > > [<ffffffff8028fa4c>] exit_mmap+0xac/0x147
> > > [<ffffffff8023ced7>] mmput+0x32/0xae
> > > [<ffffffff80242f00>] do_exit+0x199/0x914
> > > [<ffffffff8024ab3b>] __dequeue_signal+0x19/0x1b7
> > > [<ffffffff802436a7>] do_group_exit+0x2c/0x7e
> > > [<ffffffff8024c47b>] get_signal_to_deliver+0x2ef/0x4aa
> > > [<ffffffff8020b5dc>] do_notify_resume+0xa8/0x7cd
> > > [<ffffffff80239320>] add_preempt_count+0x14/0x111
> > > [<ffffffff8038b292>] __up_read+0x13/0x8d
> > > [<ffffffff80226483>] do_page_fault+0x187/0x754
> > > [<ffffffff802335ae>] __dequeue_entity+0x2d/0x34
> > > [<ffffffff8020a6d5>] __switch_to+0x27/0x2c9
> > > [<ffffffff802264f0>] do_page_fault+0x1f4/0x754
> > > [<ffffffff8020c7be>] retint_signal+0x3d/0x7f
> > > ---------------------------
> > > | preempt count: 00010005 ]
> > > | 5-level deep critical section nesting:
> > > ----------------------------------------
> > > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > > .....[<00000000>] .. ( <= 0x0)
> > > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > > .....[<00000000>] .. ( <= 0x0)
> > > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > > .....[<00000000>] .. ( <= 0x0)
> > > .. [<ffffffff8027c8db>] .... find_get_page+0x14/0xbe
> > > .....[<00000000>] .. ( <= 0x0)
> > > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
> > > .....[<00000000>] .. ( <= 0x0)
> > >
> > >
> > > I also got a kernel core.
> > > (gdb) info thr
> > > 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
> > > offset=18446744071570598016) at include/asm/processor_64.h:385
> > > 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > > 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
> > > * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > >
> > > CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
> > > The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
> > > called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
> > > The other threads waiting the swap_lock.
> > > On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
> > > lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
> > > So if the target page of remove_mapping() is in the exiting process memory,
> > > the kernel is deadlock.
> > >
> > > (gdb) bt
> > > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > > #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
> > > at mm/swapfile.c:253
> > > #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
> > > #3 0xffffffff80286acd in remove_mapping (mapping=<value optimized out>,
> > > page=0xffff810005ad8910) at mm/vmscan.c:423
> > > ...
> > >
> > > (gdb) thr 2
> > > (gdb) bt
> > > #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
> > > #1 0xffffffff80296374 in valid_swaphandles (entry=<value optimized out>,
> > > offset=0xffff81001e22bc78) at mm/swapfile.c:1783
> > > #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
> > > at mm/memory.c:2054
> > > #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
> > > pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
> > > ...
> > >
> > > (gdb) thr 3
> > > (gdb) bt
> > > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
> > > #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
> > > at mm/swapfile.c:253
> > > #2 0xffffffff80296c0c in can_share_swap_page (page=<value optimized out>)
> > > at mm/swapfile.c:317
> > > #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
> > > address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
> > > ptl=0xffff810007c95458, orig_pte=<value optimized out>) at mm/memory.c:1606
> > > ...
> > >
> > > (gdb) thr 4
> > > (gdb) bt
> > > #0 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
> > > offset=18446744071570598016) at include/asm/processor_64.h:385
> > > #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
> > > #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
> > > start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
> > > details=0x0) at mm/memory.c:728
> > > #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
> > > #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
> > > #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
> > > ...
> > >
> > >
> > > I think it came from the lockless speculative get page patch.
> > > I found the newer version of this patch in linux-mm.
> > > http://marc.info/?l=linux-mm&m=119477111927364&w=2
> > >
> > > I haven't tested it because it looks big change and hard to apply.
> > > But it seems to fix this deadlock issue.
> > > Any other patch to fix this issue is welcome.

Yeah, in the latest lockless pagecache patches Nick got rid of
PG_nonewrefs as suggested by Hugh, however -rt also has my concurrent
pagecache patches and those need PG_nonewrefs on their own, so I hadn't
bothered to update to Nick's latest.

Perhaps I ought to, as you point out, page_cache_get_speculative() is
cleaner in his latest.. /me puts it on his overlong TODO list.

> > Is this patch good?

Yes, it does look good. Doesn't remove_exclusice_swap_page() also nest
PG_nonewrefs inside of swap_lock?

Looks to me that also needs fixing..

> > ---
> > From: Hiroshi Shimamoto <[email protected]>
> > Subject: [PATCH] avoid deadlock related with PG_nonewrefs and swap_lock
> >
> > There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
> > remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
> > free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
> > off in find_get_page().
> >
> > swap_lock can be unlocked before calling find_get_page().
> >
> > Signed-off-by: Hiroshi Shimamoto <[email protected]>
> > ---
> > mm/swapfile.c | 5 +++--
> > 1 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5036b70..581afee 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -400,13 +400,14 @@ void free_swap_and_cache(swp_entry_t entry)
> > p = swap_info_get(entry);
> > if (p) {
> > if (swap_entry_free(p, swp_offset(entry)) == 1) {
> > + spin_unlock(&swap_lock);
> > page = find_get_page(&swapper_space, entry.val);
> > if (page && unlikely(TestSetPageLocked(page))) {
> > page_cache_release(page);
> > page = NULL;
> > }
> > - }
> > - spin_unlock(&swap_lock);
> > + } else
> > + spin_unlock(&swap_lock);
> > }
> > if (page) {
> > int one_user;
> > --
> > 1.5.4.1
> >

2008-03-19 20:39:57

by Hiroshi Shimamoto

[permalink] [raw]
Subject: Re: deadlock on 2.6.24.3-rt3

Peter Zijlstra wrote:
> On Mon, 2008-03-17 at 21:53 -0400, Steven Rostedt wrote:
>> [Added Peter Zijlsta and LKML]
>
> Yeah, patches and such should really go to LKML as LKML is the -rt
> development list.

Sorry, I missed to add LKML. It's written in RT wiki.
I'm not sure about BUG report too. Should I send it to LKML?

>
>> Hiroshi, thanks for looking into this!
>>
>> Peter, this looks like a legit fix, could you Ack it.
>>
>>
>> On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:
>>
>>> Hiroshi Shimamoto wrote:
>>>> Hi,
>>>>
>>>> I got a soft lockup message on 2.6.24.3-rt3.
>>>> I attached the .config.
>>>>
>>>> I think there is a deadlock scenario, I explain later.
>>>>
>>>> Here is the console log;
>>>> BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
>>>> CPU 2:
>>>> Modules linked in:
>>>> Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
>>>> RIP: 0010:[<ffffffff8063f052>] [<ffffffff8063f052>] __spin_lock+0x57/0x67
>>>> RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
>>>> RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
>>>> RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
>>>> RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
>>>> R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
>>>> R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
>>>> FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>> Call Trace:
>>>> [<ffffffff8063f024>] __spin_lock+0x29/0x67
>>>> [<ffffffff80296597>] swap_info_get+0x65/0xdd
>>>> [<ffffffff80296c0c>] can_share_swap_page+0x39/0x84
>>>> [<ffffffff8028ae6e>] do_wp_page+0x2f9/0x519
>>>> [<ffffffff8028c8d2>] handle_mm_fault+0x615/0x7cf
>>>> [<ffffffff802db823>] proc_flush_task+0x171/0x29c
>>>> [<ffffffff8024a64d>] recalc_sigpending+0xe/0x3c
>>>> [<ffffffff8022645e>] do_page_fault+0x162/0x754
>>>> [<ffffffff8026fe81>] audit_syscall_exit+0x31c/0x37a
>>>> [<ffffffff8063f449>] error_exit+0x0/0x51
>>>> ---------------------------
>>>> | preempt count: 00010002 ]
>>>> | 2-level deep critical section nesting:
>>>> ----------------------------------------
>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>> .....[<00000000>] .. ( <= 0x0)
>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>> .....[<00000000>] .. ( <= 0x0)
>>>> BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
>>>> CPU 3:
>>>> Modules linked in:
>>>> Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
>>>> RIP: 0010:[<ffffffff8027c974>] [<ffffffff8027c974>] find_get_page+0xad/0xbe
>>>> RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
>>>> 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
>>>> RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
>>>> RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
>>>> R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
>>>> R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
>>>> FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>> Call Trace:
>>>> [<ffffffff8027c8ea>] find_get_page+0x23/0xbe
>>>> [<ffffffff80296f83>] free_swap_and_cache+0x46/0xdd
>>>> [<ffffffff8028b9b7>] unmap_vmas+0x626/0x8ce
>>>> [<ffffffff8028fa4c>] exit_mmap+0xac/0x147
>>>> [<ffffffff8023ced7>] mmput+0x32/0xae
>>>> [<ffffffff80242f00>] do_exit+0x199/0x914
>>>> [<ffffffff8024ab3b>] __dequeue_signal+0x19/0x1b7
>>>> [<ffffffff802436a7>] do_group_exit+0x2c/0x7e
>>>> [<ffffffff8024c47b>] get_signal_to_deliver+0x2ef/0x4aa
>>>> [<ffffffff8020b5dc>] do_notify_resume+0xa8/0x7cd
>>>> [<ffffffff80239320>] add_preempt_count+0x14/0x111
>>>> [<ffffffff8038b292>] __up_read+0x13/0x8d
>>>> [<ffffffff80226483>] do_page_fault+0x187/0x754
>>>> [<ffffffff802335ae>] __dequeue_entity+0x2d/0x34
>>>> [<ffffffff8020a6d5>] __switch_to+0x27/0x2c9
>>>> [<ffffffff802264f0>] do_page_fault+0x1f4/0x754
>>>> [<ffffffff8020c7be>] retint_signal+0x3d/0x7f
>>>> ---------------------------
>>>> | preempt count: 00010005 ]
>>>> | 5-level deep critical section nesting:
>>>> ----------------------------------------
>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>> .....[<00000000>] .. ( <= 0x0)
>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>> .....[<00000000>] .. ( <= 0x0)
>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>> .....[<00000000>] .. ( <= 0x0)
>>>> .. [<ffffffff8027c8db>] .... find_get_page+0x14/0xbe
>>>> .....[<00000000>] .. ( <= 0x0)
>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>> .....[<00000000>] .. ( <= 0x0)
>>>>
>>>>
>>>> I also got a kernel core.
>>>> (gdb) info thr
>>>> 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
>>>> offset=18446744071570598016) at include/asm/processor_64.h:385
>>>> 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>> 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
>>>> * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>>
>>>> CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
>>>> The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
>>>> called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
>>>> The other threads waiting the swap_lock.
>>>> On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
>>>> lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
>>>> So if the target page of remove_mapping() is in the exiting process memory,
>>>> the kernel is deadlock.
>>>>
>>>> (gdb) bt
>>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>> #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
>>>> at mm/swapfile.c:253
>>>> #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
>>>> #3 0xffffffff80286acd in remove_mapping (mapping=<value optimized out>,
>>>> page=0xffff810005ad8910) at mm/vmscan.c:423
>>>> ...
>>>>
>>>> (gdb) thr 2
>>>> (gdb) bt
>>>> #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
>>>> #1 0xffffffff80296374 in valid_swaphandles (entry=<value optimized out>,
>>>> offset=0xffff81001e22bc78) at mm/swapfile.c:1783
>>>> #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
>>>> at mm/memory.c:2054
>>>> #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
>>>> pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
>>>> ...
>>>>
>>>> (gdb) thr 3
>>>> (gdb) bt
>>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>> #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
>>>> at mm/swapfile.c:253
>>>> #2 0xffffffff80296c0c in can_share_swap_page (page=<value optimized out>)
>>>> at mm/swapfile.c:317
>>>> #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
>>>> address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
>>>> ptl=0xffff810007c95458, orig_pte=<value optimized out>) at mm/memory.c:1606
>>>> ...
>>>>
>>>> (gdb) thr 4
>>>> (gdb) bt
>>>> #0 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
>>>> offset=18446744071570598016) at include/asm/processor_64.h:385
>>>> #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
>>>> #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
>>>> start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
>>>> details=0x0) at mm/memory.c:728
>>>> #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
>>>> #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
>>>> #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
>>>> ...
>>>>
>>>>
>>>> I think it came from the lockless speculative get page patch.
>>>> I found the newer version of this patch in linux-mm.
>>>> http://marc.info/?l=linux-mm&m=119477111927364&w=2
>>>>
>>>> I haven't tested it because it looks big change and hard to apply.
>>>> But it seems to fix this deadlock issue.
>>>> Any other patch to fix this issue is welcome.
>
> Yeah, in the latest lockless pagecache patches Nick got rid of
> PG_nonewrefs as suggested by Hugh, however -rt also has my concurrent
> pagecache patches and those need PG_nonewrefs on their own, so I hadn't
> bothered to update to Nick's latest.
>
> Perhaps I ought to, as you point out, page_cache_get_speculative() is
> cleaner in his latest.. /me puts it on his overlong TODO list.
>
>>> Is this patch good?
>
> Yes, it does look good. Doesn't remove_exclusice_swap_page() also nest
> PG_nonewrefs inside of swap_lock?

I'm not sure about it. I haven't noticed it and I haven't checked all PG_nonewrefs.
Will look into it.

By the way, unfortunately, I got another BUG message w/ my patch... I'll report it.
The new issue is related with SLAB.
kernel BUG at mm/slab.c:3125!

thanks,
Hiroshi Shimamoto

2008-03-20 20:35:41

by Hiroshi Shimamoto

[permalink] [raw]
Subject: Re: deadlock on 2.6.24.3-rt3

Hiroshi Shimamoto wrote:
> Peter Zijlstra wrote:
>> On Mon, 2008-03-17 at 21:53 -0400, Steven Rostedt wrote:
>>> [Added Peter Zijlsta and LKML]
>> Yeah, patches and such should really go to LKML as LKML is the -rt
>> development list.
>
> Sorry, I missed to add LKML. It's written in RT wiki.
> I'm not sure about BUG report too. Should I send it to LKML?
>
>>> Hiroshi, thanks for looking into this!
>>>
>>> Peter, this looks like a legit fix, could you Ack it.
>>>
>>>
>>> On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:
>>>
>>>> Hiroshi Shimamoto wrote:
>>>>> Hi,
>>>>>
>>>>> I got a soft lockup message on 2.6.24.3-rt3.
>>>>> I attached the .config.
>>>>>
>>>>> I think there is a deadlock scenario, I explain later.
>>>>>
>>>>> Here is the console log;
>>>>> BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
>>>>> CPU 2:
>>>>> Modules linked in:
>>>>> Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
>>>>> RIP: 0010:[<ffffffff8063f052>] [<ffffffff8063f052>] __spin_lock+0x57/0x67
>>>>> RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
>>>>> RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
>>>>> RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
>>>>> RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
>>>>> R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
>>>>> R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
>>>>> FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>> [<ffffffff8063f024>] __spin_lock+0x29/0x67
>>>>> [<ffffffff80296597>] swap_info_get+0x65/0xdd
>>>>> [<ffffffff80296c0c>] can_share_swap_page+0x39/0x84
>>>>> [<ffffffff8028ae6e>] do_wp_page+0x2f9/0x519
>>>>> [<ffffffff8028c8d2>] handle_mm_fault+0x615/0x7cf
>>>>> [<ffffffff802db823>] proc_flush_task+0x171/0x29c
>>>>> [<ffffffff8024a64d>] recalc_sigpending+0xe/0x3c
>>>>> [<ffffffff8022645e>] do_page_fault+0x162/0x754
>>>>> [<ffffffff8026fe81>] audit_syscall_exit+0x31c/0x37a
>>>>> [<ffffffff8063f449>] error_exit+0x0/0x51
>>>>> ---------------------------
>>>>> | preempt count: 00010002 ]
>>>>> | 2-level deep critical section nesting:
>>>>> ----------------------------------------
>>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>> BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
>>>>> CPU 3:
>>>>> Modules linked in:
>>>>> Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
>>>>> RIP: 0010:[<ffffffff8027c974>] [<ffffffff8027c974>] find_get_page+0xad/0xbe
>>>>> RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
>>>>> 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
>>>>> RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
>>>>> RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
>>>>> R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
>>>>> R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
>>>>> FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>> [<ffffffff8027c8ea>] find_get_page+0x23/0xbe
>>>>> [<ffffffff80296f83>] free_swap_and_cache+0x46/0xdd
>>>>> [<ffffffff8028b9b7>] unmap_vmas+0x626/0x8ce
>>>>> [<ffffffff8028fa4c>] exit_mmap+0xac/0x147
>>>>> [<ffffffff8023ced7>] mmput+0x32/0xae
>>>>> [<ffffffff80242f00>] do_exit+0x199/0x914
>>>>> [<ffffffff8024ab3b>] __dequeue_signal+0x19/0x1b7
>>>>> [<ffffffff802436a7>] do_group_exit+0x2c/0x7e
>>>>> [<ffffffff8024c47b>] get_signal_to_deliver+0x2ef/0x4aa
>>>>> [<ffffffff8020b5dc>] do_notify_resume+0xa8/0x7cd
>>>>> [<ffffffff80239320>] add_preempt_count+0x14/0x111
>>>>> [<ffffffff8038b292>] __up_read+0x13/0x8d
>>>>> [<ffffffff80226483>] do_page_fault+0x187/0x754
>>>>> [<ffffffff802335ae>] __dequeue_entity+0x2d/0x34
>>>>> [<ffffffff8020a6d5>] __switch_to+0x27/0x2c9
>>>>> [<ffffffff802264f0>] do_page_fault+0x1f4/0x754
>>>>> [<ffffffff8020c7be>] retint_signal+0x3d/0x7f
>>>>> ---------------------------
>>>>> | preempt count: 00010005 ]
>>>>> | 5-level deep critical section nesting:
>>>>> ----------------------------------------
>>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>> .. [<ffffffff8027c8db>] .... find_get_page+0x14/0xbe
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>> .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67
>>>>> .....[<00000000>] .. ( <= 0x0)
>>>>>
>>>>>
>>>>> I also got a kernel core.
>>>>> (gdb) info thr
>>>>> 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
>>>>> offset=18446744071570598016) at include/asm/processor_64.h:385
>>>>> 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>>> 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
>>>>> * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>>>
>>>>> CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
>>>>> The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
>>>>> called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
>>>>> The other threads waiting the swap_lock.
>>>>> On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
>>>>> lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
>>>>> So if the target page of remove_mapping() is in the exiting process memory,
>>>>> the kernel is deadlock.
>>>>>
>>>>> (gdb) bt
>>>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>>> #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
>>>>> at mm/swapfile.c:253
>>>>> #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
>>>>> #3 0xffffffff80286acd in remove_mapping (mapping=<value optimized out>,
>>>>> page=0xffff810005ad8910) at mm/vmscan.c:423
>>>>> ...
>>>>>
>>>>> (gdb) thr 2
>>>>> (gdb) bt
>>>>> #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
>>>>> #1 0xffffffff80296374 in valid_swaphandles (entry=<value optimized out>,
>>>>> offset=0xffff81001e22bc78) at mm/swapfile.c:1783
>>>>> #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
>>>>> at mm/memory.c:2054
>>>>> #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
>>>>> pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
>>>>> ...
>>>>>
>>>>> (gdb) thr 3
>>>>> (gdb) bt
>>>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
>>>>> #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>)
>>>>> at mm/swapfile.c:253
>>>>> #2 0xffffffff80296c0c in can_share_swap_page (page=<value optimized out>)
>>>>> at mm/swapfile.c:317
>>>>> #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
>>>>> address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
>>>>> ptl=0xffff810007c95458, orig_pte=<value optimized out>) at mm/memory.c:1606
>>>>> ...
>>>>>
>>>>> (gdb) thr 4
>>>>> (gdb) bt
>>>>> #0 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>,
>>>>> offset=18446744071570598016) at include/asm/processor_64.h:385
>>>>> #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
>>>>> #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
>>>>> start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
>>>>> details=0x0) at mm/memory.c:728
>>>>> #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
>>>>> #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
>>>>> #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
>>>>> ...
>>>>>
>>>>>
>>>>> I think it came from the lockless speculative get page patch.
>>>>> I found the newer version of this patch in linux-mm.
>>>>> http://marc.info/?l=linux-mm&m=119477111927364&w=2
>>>>>
>>>>> I haven't tested it because it looks big change and hard to apply.
>>>>> But it seems to fix this deadlock issue.
>>>>> Any other patch to fix this issue is welcome.
>> Yeah, in the latest lockless pagecache patches Nick got rid of
>> PG_nonewrefs as suggested by Hugh, however -rt also has my concurrent
>> pagecache patches and those need PG_nonewrefs on their own, so I hadn't
>> bothered to update to Nick's latest.
>>
>> Perhaps I ought to, as you point out, page_cache_get_speculative() is
>> cleaner in his latest.. /me puts it on his overlong TODO list.
>>
>>>> Is this patch good?
>> Yes, it does look good. Doesn't remove_exclusice_swap_page() also nest
>> PG_nonewrefs inside of swap_lock?
>
> I'm not sure about it. I haven't noticed it and I haven't checked all PG_nonewrefs.
> Will look into it.
>

Is this valid fix?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 581afee..6fbc77e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
/* Is the only swap cache user the cache itself? */
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
+ spin_unlock(&swap_lock);
/* Recheck the page count with the swapcache lock held.. */
lock_page_ref_irq(page);
if ((page_count(page) == 2) && !PageWriteback(page)) {
@@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
retval = 1;
}
unlock_page_ref_irq(page);
- }
- spin_unlock(&swap_lock);
+ } else
+ spin_unlock(&swap_lock);

if (retval) {
swap_free(entry);

2008-03-24 18:24:33

by Hiroshi Shimamoto

[permalink] [raw]
Subject: [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock

Hi Peter,

I've updated the patch. Could you please review it?

I'm also thinking that it can be in the mainline because it makes
the lock period shorter, correct?

---
From: Hiroshi Shimamoto <[email protected]>

There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
off in find_get_page().

swap_lock can be unlocked before calling find_get_page().

In remove_exclusive_swap_page(), there is similar lock sequence;
swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before
turning PG_nonewrefs bit on.

Signed-off-by: Hiroshi Shimamoto <[email protected]>
---
mm/swapfile.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5036b70..6fbc77e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
/* Is the only swap cache user the cache itself? */
retval = 0;
if (p->swap_map[swp_offset(entry)] == 1) {
+ spin_unlock(&swap_lock);
/* Recheck the page count with the swapcache lock held.. */
lock_page_ref_irq(page);
if ((page_count(page) == 2) && !PageWriteback(page)) {
@@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
retval = 1;
}
unlock_page_ref_irq(page);
- }
- spin_unlock(&swap_lock);
+ } else
+ spin_unlock(&swap_lock);

if (retval) {
swap_free(entry);
@@ -400,13 +401,14 @@ void free_swap_and_cache(swp_entry_t entry)
p = swap_info_get(entry);
if (p) {
if (swap_entry_free(p, swp_offset(entry)) == 1) {
+ spin_unlock(&swap_lock);
page = find_get_page(&swapper_space, entry.val);
if (page && unlikely(TestSetPageLocked(page))) {
page_cache_release(page);
page = NULL;
}
- }
- spin_unlock(&swap_lock);
+ } else
+ spin_unlock(&swap_lock);
}
if (page) {
int one_user;
--
1.5.4.1

2008-03-26 09:50:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock

On Mon, 2008-03-24 at 11:24 -0700, Hiroshi Shimamoto wrote:
> Hi Peter,
>
> I've updated the patch. Could you please review it?
>
> I'm also thinking that it can be in the mainline because it makes
> the lock period shorter, correct?

Possibly yeah, Nick, Hugh?

> ---
> From: Hiroshi Shimamoto <[email protected]>
>
> There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
> remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
> free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
> off in find_get_page().
>
> swap_lock can be unlocked before calling find_get_page().
>
> In remove_exclusive_swap_page(), there is similar lock sequence;
> swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before
> turning PG_nonewrefs bit on.

I worry about this, Once we free the swap entry with swap_entry_free(),
and drop the swap_lock, another task is basically free to re-use that
swap location and try to insert another page in that same spot in
add_to_swap() - read_swap_cache_async() can't race because it would mean
it still has a swap entry pinned.

However, add_to_swap() can already handle the race, because it used to
race against read_swap_cache_async(). It also swap_free()s the entry so
as to not leak entries. So I think this is indeed correct.

[ I ought to find some time to port the concurrent page-cache patches on
top of Nick's latest lockless series, Hugh's suggestion makes the
speculative get much nicer. ]

> Signed-off-by: Hiroshi Shimamoto <[email protected]>

Acked-by: Peter Zijlstra <[email protected]>

> ---
> mm/swapfile.c | 10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5036b70..6fbc77e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
> /* Is the only swap cache user the cache itself? */
> retval = 0;
> if (p->swap_map[swp_offset(entry)] == 1) {
> + spin_unlock(&swap_lock);
> /* Recheck the page count with the swapcache lock held.. */
> lock_page_ref_irq(page);
> if ((page_count(page) == 2) && !PageWriteback(page)) {
> @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
> retval = 1;
> }
> unlock_page_ref_irq(page);
> - }
> - spin_unlock(&swap_lock);
> + } else
> + spin_unlock(&swap_lock);
>
> if (retval) {
> swap_free(entry);
> @@ -400,13 +401,14 @@ void free_swap_and_cache(swp_entry_t entry)
> p = swap_info_get(entry);
> if (p) {
> if (swap_entry_free(p, swp_offset(entry)) == 1) {
> + spin_unlock(&swap_lock);
> page = find_get_page(&swapper_space, entry.val);
> if (page && unlikely(TestSetPageLocked(page))) {
> page_cache_release(page);
> page = NULL;
> }
> - }
> - spin_unlock(&swap_lock);
> + } else
> + spin_unlock(&swap_lock);
> }
> if (page) {
> int one_user;

2008-03-26 10:10:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock

On Wed, 2008-03-26 at 10:50 +0100, Peter Zijlstra wrote:
> On Mon, 2008-03-24 at 11:24 -0700, Hiroshi Shimamoto wrote:
> > Hi Peter,
> >
> > I've updated the patch. Could you please review it?
> >
> > I'm also thinking that it can be in the mainline because it makes
> > the lock period shorter, correct?
>
> Possibly yeah, Nick, Hugh?
>
> > ---
> > From: Hiroshi Shimamoto <[email protected]>
> >
> > There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
> > remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
> > free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
> > off in find_get_page().
> >
> > swap_lock can be unlocked before calling find_get_page().
> >
> > In remove_exclusive_swap_page(), there is similar lock sequence;
> > swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before
> > turning PG_nonewrefs bit on.
>
> I worry about this, Once we free the swap entry with swap_entry_free(),
> and drop the swap_lock, another task is basically free to re-use that
> swap location and try to insert another page in that same spot in
> add_to_swap() - read_swap_cache_async() can't race because it would mean
> it still has a swap entry pinned.

D'oh of course it can race, otherwise the add_to_swap() vs
read_swap_cache_async() race wouldn't exist.

Still, given that add_to_swap() handles the race I suspect the other end
does the right thing as well.

> However, add_to_swap() can already handle the race, because it used to
> race against read_swap_cache_async(). It also swap_free()s the entry so
> as to not leak entries. So I think this is indeed correct.
>
> [ I ought to find some time to port the concurrent page-cache patches on
> top of Nick's latest lockless series, Hugh's suggestion makes the
> speculative get much nicer. ]
>
> > Signed-off-by: Hiroshi Shimamoto <[email protected]>
>
> Acked-by: Peter Zijlstra <[email protected]>
>
> > ---
> > mm/swapfile.c | 10 ++++++----
> > 1 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5036b70..6fbc77e 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
> > /* Is the only swap cache user the cache itself? */
> > retval = 0;
> > if (p->swap_map[swp_offset(entry)] == 1) {
> > + spin_unlock(&swap_lock);
> > /* Recheck the page count with the swapcache lock held.. */
> > lock_page_ref_irq(page);
> > if ((page_count(page) == 2) && !PageWriteback(page)) {
> > @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
> > retval = 1;
> > }
> > unlock_page_ref_irq(page);
> > - }
> > - spin_unlock(&swap_lock);
> > + } else
> > + spin_unlock(&swap_lock);
> >
> > if (retval) {
> > swap_free(entry);
> > @@ -400,13 +401,14 @@ void free_swap_and_cache(swp_entry_t entry)
> > p = swap_info_get(entry);
> > if (p) {
> > if (swap_entry_free(p, swp_offset(entry)) == 1) {
> > + spin_unlock(&swap_lock);
> > page = find_get_page(&swapper_space, entry.val);
> > if (page && unlikely(TestSetPageLocked(page))) {
> > page_cache_release(page);
> > page = NULL;
> > }
> > - }
> > - spin_unlock(&swap_lock);
> > + } else
> > + spin_unlock(&swap_lock);
> > }
> > if (page) {
> > int one_user;