2014-04-06 19:18:32

by Sasha Levin

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()

On 03/11/2014 08:40 AM, tip-bot for Mike Galbraith wrote:
> Commit-ID: 156654f491dd8d52687a5fbe1637f472a52ce75b
> Gitweb: http://git.kernel.org/tip/156654f491dd8d52687a5fbe1637f472a52ce75b
> Author: Mike Galbraith <[email protected]>
> AuthorDate: Fri, 28 Feb 2014 07:23:11 +0100
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 11 Mar 2014 12:05:43 +0100
>
> sched/numa: Move task_numa_free() to __put_task_struct()
>
> Bad idea on -rt:
>
> [ 908.026136] [<ffffffff8150ad6a>] rt_spin_lock_slowlock+0xaa/0x2c0
> [ 908.026145] [<ffffffff8108f701>] task_numa_free+0x31/0x130
> [ 908.026151] [<ffffffff8108121e>] finish_task_switch+0xce/0x100
> [ 908.026156] [<ffffffff81509c0a>] thread_return+0x48/0x4ae
> [ 908.026160] [<ffffffff8150a095>] schedule+0x25/0xa0
> [ 908.026163] [<ffffffff8150ad95>] rt_spin_lock_slowlock+0xd5/0x2c0
> [ 908.026170] [<ffffffff810658cf>] get_signal_to_deliver+0xaf/0x680
> [ 908.026175] [<ffffffff8100242d>] do_signal+0x3d/0x5b0
> [ 908.026179] [<ffffffff81002a30>] do_notify_resume+0x90/0xe0
> [ 908.026186] [<ffffffff81513176>] int_signal+0x12/0x17
> [ 908.026193] [<00007ff2a388b1d0>] 0x7ff2a388b1cf
>
> and since upstream does not mind where we do this, be a bit nicer ...
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Ingo Molnar <[email protected]>

As it seems, upstream does mind:

[ 2590.260734] ======================================================
[ 2590.261695] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
[ 2590.262748] 3.14.0-next-20140403-sasha-00022-g10224c0 #377 Tainted: G W
[ 2590.263846] ------------------------------------------------------
[ 2590.264730] trinity-c244/1210 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 2590.265783] (&(&grp->lock)->rlock){+.+...}, at: task_numa_free (kernel/sched/fair.c:1714)
[ 2590.267179]
[ 2590.267179] and this task is already holding:
[ 2590.267996] (&(&new_timer->it_lock)->rlock){-.....}, at: exit_itimers (kernel/posix-timers.c:971 kernel/posix-timers.c:998)
[ 2590.269381] which would create a new lock dependency:
[ 2590.270067] (&(&new_timer->it_lock)->rlock){-.....} -> (&(&grp->lock)->rlock){+.+...}
[ 2590.270067]
[ 2590.270067] but this new dependency connects a HARDIRQ-irq-safe lock:
[ 2590.270067] (&(&new_timer->it_lock)->rlock){-.....}
... which became HARDIRQ-irq-safe at:
[ 2590.270067] __lock_acquire (kernel/locking/lockdep.c:2783 kernel/locking/lockdep.c:3138)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[ 2590.270067] posix_timer_fn (kernel/posix-timers.c:437)
[ 2590.270067] __run_hrtimer (kernel/hrtimer.c:1245 (discriminator 2))
[ 2590.270067] hrtimer_interrupt (kernel/hrtimer.c:1892)
[ 2590.270067] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:921)
[ 2590.270067] smp_apic_timer_interrupt (arch/x86/include/asm/apic.h:696 arch/x86/kernel/apic/apic.c:945)
[ 2590.270067] apic_timer_interrupt (arch/x86/kernel/entry_64.S:1164)
[ 2590.270067] default_idle (arch/x86/include/asm/paravirt.h:111 arch/x86/kernel/process.c:310)
[ 2590.270067] arch_cpu_idle (arch/x86/kernel/process.c:302)
[ 2590.270067] cpu_idle_loop (kernel/sched/idle.c:179 kernel/sched/idle.c:226)
[ 2590.270067] cpu_startup_entry (??:?)
[ 2590.270067] start_secondary (arch/x86/kernel/smpboot.c:267)
[ 2590.270067]
[ 2590.270067] to a HARDIRQ-irq-unsafe lock:
[ 2590.270067] (&(&grp->lock)->rlock){+.+...}
... which became HARDIRQ-irq-unsafe at:
[ 2590.270067] ... __lock_acquire (kernel/locking/lockdep.c:2800 kernel/locking/lockdep.c:3138)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] task_numa_placement (include/linux/nodemask.h:248 kernel/sched/fair.c:1504)
[ 2590.270067] task_numa_fault (kernel/sched/fair.c:1794)
[ 2590.270067] __handle_mm_fault (mm/memory.c:3796 mm/memory.c:3796 mm/memory.c:3909)
[ 2590.270067] handle_mm_fault (include/linux/memcontrol.h:148 mm/memory.c:3935)
[ 2590.270067] __do_page_fault (arch/x86/mm/fault.c:1220)
[ 2590.270067] do_page_fault (arch/x86/mm/fault.c:1272 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1273)
[ 2590.270067] do_async_page_fault (arch/x86/kernel/kvm.c:263)
[ 2590.270067] async_page_fault (arch/x86/kernel/entry_64.S:1496)
[ 2590.270067]
[ 2590.270067] other info that might help us debug this:
[ 2590.270067]
[ 2590.270067] Possible interrupt unsafe locking scenario:
[ 2590.270067]
[ 2590.270067] CPU0 CPU1
[ 2590.270067] ---- ----
[ 2590.270067] lock(&(&grp->lock)->rlock);
[ 2590.270067] local_irq_disable();
[ 2590.270067] lock(&(&new_timer->it_lock)->rlock);
[ 2590.270067] lock(&(&grp->lock)->rlock);
[ 2590.270067] <Interrupt>
[ 2590.270067] lock(&(&new_timer->it_lock)->rlock);
[ 2590.270067]
[ 2590.270067] *** DEADLOCK ***
[ 2590.270067]
[ 2590.270067] 1 lock held by trinity-c244/1210:
[ 2590.270067] #0: (&(&new_timer->it_lock)->rlock){-.....}, at: exit_itimers (kernel/posix-timers.c:971 kernel/posix-timers.c:998)
[ 2590.270067]
the dependencies between HARDIRQ-irq-safe lock and the holding lock:
[ 2590.270067] -> (&(&new_timer->it_lock)->rlock){-.....} ops: 361 {
[ 2590.270067] IN-HARDIRQ-W at:
[ 2590.270067] __lock_acquire (kernel/locking/lockdep.c:2783 kernel/locking/lockdep.c:3138)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[ 2590.270067] posix_timer_fn (kernel/posix-timers.c:437)
[ 2590.270067] __run_hrtimer (kernel/hrtimer.c:1245 (discriminator 2))
[ 2590.270067] hrtimer_interrupt (kernel/hrtimer.c:1892)
[ 2590.270067] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:921)
[ 2590.270067] smp_apic_timer_interrupt (arch/x86/include/asm/apic.h:696 arch/x86/kernel/apic/apic.c:945)
[ 2590.270067] apic_timer_interrupt (arch/x86/kernel/entry_64.S:1164)
[ 2590.270067] default_idle (arch/x86/include/asm/paravirt.h:111 arch/x86/kernel/process.c:310)
[ 2590.270067] arch_cpu_idle (arch/x86/kernel/process.c:302)
[ 2590.270067] cpu_idle_loop (kernel/sched/idle.c:179 kernel/sched/idle.c:226)
[ 2590.270067] cpu_startup_entry (??:?)
[ 2590.270067] start_secondary (arch/x86/kernel/smpboot.c:267)
[ 2590.270067] INITIAL USE at:
[ 2590.270067] __lock_acquire (kernel/locking/lockdep.c:3142)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[ 2590.270067] exit_itimers (kernel/posix-timers.c:971 kernel/posix-timers.c:998)
[ 2590.270067] do_exit (kernel/exit.c:766)
[ 2590.270067] do_group_exit (kernel/exit.c:919)
[ 2590.270067] SyS_exit_group (kernel/exit.c:930)
[ 2590.270067] tracesys (arch/x86/kernel/entry_64.S:749)
[ 2590.270067] }
[ 2590.270067] ... key at: __key.33130 (??:?)
[ 2590.270067] ... acquired at:
[ 2590.270067] check_irq_usage (kernel/locking/lockdep.c:1638)
[ 2590.270067] __lock_acquire (kernel/locking/lockdep_states.h:7 kernel/locking/lockdep.c:1844 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] task_numa_free (kernel/sched/fair.c:1714)
[ 2590.270067] __put_task_struct (kernel/fork.c:244 (discriminator 2))
[ 2590.270067] posix_cpu_timer_del (include/linux/sched.h:1807 kernel/posix-cpu-timers.c:409)
[ 2590.270067] exit_itimers (kernel/posix-timers.c:973 kernel/posix-timers.c:998)
[ 2590.270067] do_exit (kernel/exit.c:766)
[ 2590.270067] do_group_exit (kernel/exit.c:919)
[ 2590.270067] SyS_exit_group (kernel/exit.c:930)
[ 2590.270067] tracesys (arch/x86/kernel/entry_64.S:749)
[ 2590.270067]
[ 2590.270067]
the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:
[ 2590.270067] -> (&(&grp->lock)->rlock){+.+...} ops: 91 {
[ 2590.270067] HARDIRQ-ON-W at:
[ 2590.270067] __lock_acquire (kernel/locking/lockdep.c:2800 kernel/locking/lockdep.c:3138)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] task_numa_placement (include/linux/nodemask.h:248 kernel/sched/fair.c:1504)
[ 2590.270067] task_numa_fault (kernel/sched/fair.c:1794)
[ 2590.270067] __handle_mm_fault (mm/memory.c:3796 mm/memory.c:3796 mm/memory.c:3909)
[ 2590.270067] handle_mm_fault (include/linux/memcontrol.h:148 mm/memory.c:3935)
[ 2590.270067] __do_page_fault (arch/x86/mm/fault.c:1220)
[ 2590.270067] do_page_fault (arch/x86/mm/fault.c:1272 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1273)
[ 2590.270067] do_async_page_fault (arch/x86/kernel/kvm.c:263)
[ 2590.270067] async_page_fault (arch/x86/kernel/entry_64.S:1496)
[ 2590.270067] SOFTIRQ-ON-W at:
[ 2590.270067] __lock_acquire (kernel/locking/lockdep.c:2804 kernel/locking/lockdep.c:3138)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] task_numa_placement (include/linux/nodemask.h:248 kernel/sched/fair.c:1504)
[ 2590.270067] task_numa_fault (kernel/sched/fair.c:1794)
[ 2590.270067] __handle_mm_fault (mm/memory.c:3796 mm/memory.c:3796 mm/memory.c:3909)
[ 2590.270067] handle_mm_fault (include/linux/memcontrol.h:148 mm/memory.c:3935)
[ 2590.270067] __do_page_fault (arch/x86/mm/fault.c:1220)
[ 2590.270067] do_page_fault (arch/x86/mm/fault.c:1272 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1273)
[ 2590.270067] do_async_page_fault (arch/x86/kernel/kvm.c:263)
[ 2590.270067] async_page_fault (arch/x86/kernel/entry_64.S:1496)
[ 2590.270067] INITIAL USE at:
[ 2590.270067] __lock_acquire (kernel/locking/lockdep.c:3142)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] task_numa_placement (include/linux/nodemask.h:248 kernel/sched/fair.c:1504)
[ 2590.270067] task_numa_fault (kernel/sched/fair.c:1794)
[ 2590.270067] __handle_mm_fault (mm/memory.c:3796 mm/memory.c:3796 mm/memory.c:3909)
[ 2590.270067] handle_mm_fault (include/linux/memcontrol.h:148 mm/memory.c:3935)
[ 2590.270067] __do_page_fault (arch/x86/mm/fault.c:1220)
[ 2590.270067] do_page_fault (arch/x86/mm/fault.c:1272 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1273)
[ 2590.270067] do_async_page_fault (arch/x86/kernel/kvm.c:263)
[ 2590.270067] async_page_fault (arch/x86/kernel/entry_64.S:1496)
[ 2590.270067] }
[ 2590.270067] ... key at: __key.32449 (??:?)
[ 2590.270067] ... acquired at:
[ 2590.270067] check_irq_usage (kernel/locking/lockdep.c:1638)
[ 2590.270067] __lock_acquire (kernel/locking/lockdep_states.h:7 kernel/locking/lockdep.c:1844 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] task_numa_free (kernel/sched/fair.c:1714)
[ 2590.270067] __put_task_struct (kernel/fork.c:244 (discriminator 2))
[ 2590.270067] posix_cpu_timer_del (include/linux/sched.h:1807 kernel/posix-cpu-timers.c:409)
[ 2590.270067] exit_itimers (kernel/posix-timers.c:973 kernel/posix-timers.c:998)
[ 2590.270067] do_exit (kernel/exit.c:766)
[ 2590.270067] do_group_exit (kernel/exit.c:919)
[ 2590.270067] SyS_exit_group (kernel/exit.c:930)
[ 2590.270067] tracesys (arch/x86/kernel/entry_64.S:749)
[ 2590.270067]
[ 2590.270067]
[ 2590.270067] stack backtrace:
[ 2590.270067] CPU: 3 PID: 1210 Comm: trinity-c244 Tainted: G W 3.14.0-next-20140403-sasha-00022-g10224c0 #377
[ 2590.270067] ffffffff87a83b60 ffff880081695ad8 ffffffff844bfb3f 0000000000000000
[ 2590.270067] ffff880081698cf0 ffff880081695be8 ffffffff811c0d05 0000000000000000
[ 2590.270067] ffffffff00000000 ffff880000000001 ffffffff8107aac5 ffff880081695b38
[ 2590.270067] Call Trace:
[ 2590.270067] dump_stack (lib/dump_stack.c:52)
[ 2590.270067] check_usage (kernel/locking/lockdep.c:1549 kernel/locking/lockdep.c:1580)
[ 2590.270067] ? sched_clock (arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[ 2590.270067] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 2590.270067] check_irq_usage (kernel/locking/lockdep.c:1638)
[ 2590.270067] __lock_acquire (kernel/locking/lockdep_states.h:7 kernel/locking/lockdep.c:1844 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
[ 2590.270067] ? __lock_acquire (kernel/locking/lockdep.c:3189)
[ 2590.270067] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2590.270067] ? task_numa_free (kernel/sched/fair.c:1714)
[ 2590.270067] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2590.270067] ? task_numa_free (kernel/sched/fair.c:1714)
[ 2590.270067] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 2590.270067] task_numa_free (kernel/sched/fair.c:1714)
[ 2590.270067] __put_task_struct (kernel/fork.c:244 (discriminator 2))
[ 2590.270067] posix_cpu_timer_del (include/linux/sched.h:1807 kernel/posix-cpu-timers.c:409)
[ 2590.270067] exit_itimers (kernel/posix-timers.c:973 kernel/posix-timers.c:998)
[ 2590.270067] do_exit (kernel/exit.c:766)
[ 2590.270067] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 2590.270067] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[ 2590.270067] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[ 2590.270067] do_group_exit (kernel/exit.c:919)
[ 2590.270067] SyS_exit_group (kernel/exit.c:930)
[ 2590.270067] tracesys (arch/x86/kernel/entry_64.S:749)


Thanks,
Sasha


2014-04-07 05:32:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()

On Sun, 2014-04-06 at 15:17 -0400, Sasha Levin wrote:
> On 03/11/2014 08:40 AM, tip-bot for Mike Galbraith wrote:
> > Commit-ID: 156654f491dd8d52687a5fbe1637f472a52ce75b
> > Gitweb: http://git.kernel.org/tip/156654f491dd8d52687a5fbe1637f472a52ce75b
> > Author: Mike Galbraith <[email protected]>
> > AuthorDate: Fri, 28 Feb 2014 07:23:11 +0100
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Tue, 11 Mar 2014 12:05:43 +0100
> >
> > sched/numa: Move task_numa_free() to __put_task_struct()
> >
> > Bad idea on -rt:
> >
> > [ 908.026136] [<ffffffff8150ad6a>] rt_spin_lock_slowlock+0xaa/0x2c0
> > [ 908.026145] [<ffffffff8108f701>] task_numa_free+0x31/0x130
> > [ 908.026151] [<ffffffff8108121e>] finish_task_switch+0xce/0x100
> > [ 908.026156] [<ffffffff81509c0a>] thread_return+0x48/0x4ae
> > [ 908.026160] [<ffffffff8150a095>] schedule+0x25/0xa0
> > [ 908.026163] [<ffffffff8150ad95>] rt_spin_lock_slowlock+0xd5/0x2c0
> > [ 908.026170] [<ffffffff810658cf>] get_signal_to_deliver+0xaf/0x680
> > [ 908.026175] [<ffffffff8100242d>] do_signal+0x3d/0x5b0
> > [ 908.026179] [<ffffffff81002a30>] do_notify_resume+0x90/0xe0
> > [ 908.026186] [<ffffffff81513176>] int_signal+0x12/0x17
> > [ 908.026193] [<00007ff2a388b1d0>] 0x7ff2a388b1cf
> >
> > and since upstream does not mind where we do this, be a bit nicer ...
> >
> > Signed-off-by: Mike Galbraith <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Link: http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Ingo Molnar <[email protected]>
>
> As it seems, upstream does mind:
>
> [ 2590.260734] ======================================================
> [ 2590.261695] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> [ 2590.262748] 3.14.0-next-20140403-sasha-00022-g10224c0 #377 Tainted: G W
> [ 2590.263846] ------------------------------------------------------
> [ 2590.264730] trinity-c244/1210 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 2590.265783] (&(&grp->lock)->rlock){+.+...}, at: task_numa_free (kernel/sched/fair.c:1714)
> [ 2590.267179]
> [ 2590.267179] and this task is already holding:
> [ 2590.267996] (&(&new_timer->it_lock)->rlock){-.....}, at: exit_itimers (kernel/posix-timers.c:971 kernel/posix-timers.c:998)
> [ 2590.269381] which would create a new lock dependency:
> [ 2590.270067] (&(&new_timer->it_lock)->rlock){-.....} -> (&(&grp->lock)->rlock){+.+...}

I'm not getting it.

I moved task_numa_free() from one interrupts enabled spot to another.
But, with numa=fake=4 and lockdep enabled, not only does lockdep gripe,
my little box locks up on splat. Saying spin_lock/unlock_irq() did the
expected, just moved lockdep gripe to task_numa_fault().


> [ 2590.270067] Possible interrupt unsafe locking scenario:
> [ 2590.270067]
> [ 2590.270067] CPU0 CPU1
> [ 2590.270067] ---- ----
> [ 2590.270067] lock(&(&grp->lock)->rlock);
> [ 2590.270067] local_irq_disable();
> [ 2590.270067] lock(&(&new_timer->it_lock)->rlock);
> [ 2590.270067] lock(&(&grp->lock)->rlock);
> [ 2590.270067] <Interrupt>
> [ 2590.270067] lock(&(&new_timer->it_lock)->rlock);
> [ 2590.270067]
> [ 2590.270067] *** DEADLOCK ***

Ok, so how did I manage that HARDIRQ-safe -> HARDIRQ-unsafe?

-Mike

2014-04-07 07:30:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()

On Mon, 2014-04-07 at 07:29 +0200, Mike Galbraith wrote:

> I'm not getting it.
>
> I moved task_numa_free() from one interrupts enabled spot to another.
> But, with numa=fake=4 and lockdep enabled, not only does lockdep gripe,
> my little box locks up on splat. Saying spin_lock/unlock_irq() did the
> expected, just moved lockdep gripe to task_numa_fault().
>
>
> > [ 2590.270067] Possible interrupt unsafe locking scenario:
> > [ 2590.270067]
> > [ 2590.270067] CPU0 CPU1
> > [ 2590.270067] ---- ----
> > [ 2590.270067] lock(&(&grp->lock)->rlock);
> > [ 2590.270067] local_irq_disable();
> > [ 2590.270067] lock(&(&new_timer->it_lock)->rlock);
> > [ 2590.270067] lock(&(&grp->lock)->rlock);
> > [ 2590.270067] <Interrupt>
> > [ 2590.270067] lock(&(&new_timer->it_lock)->rlock);
> > [ 2590.270067]
> > [ 2590.270067] *** DEADLOCK ***
>
> Ok, so how did I manage that HARDIRQ-safe -> HARDIRQ-unsafe?

Think I'll turn lockdep off, and make context switches take a good long
while after finish_lock_switch(), but meanwhile, this made it happy.

Sasha reports that lockdep claims 156654f491dd8d52687a5fbe1637f472a52ce75b made
numa_group.lock interrupt unsafe. While I don't see how that could be given the
commit in question moved task_numa_free() from one irq enabled region to another,
the below does make both gripes and lockup upon gripe with numa=fake=4 go away.

Reported-by: Sasha Levin <[email protected]>
Not-signed-off-by: Mike Galbraith <[email protected]>
---
kernel/sched/fair.c | 12 +++++++-----
kernel/sched/sched.h | 9 +++++++++
2 files changed, 16 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1497,7 +1497,7 @@ static void task_numa_placement(struct t
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
- spin_lock(group_lock);
+ spin_lock_irq(group_lock);
}

/* Find the node with the highest number of faults */
@@ -1572,7 +1572,7 @@ static void task_numa_placement(struct t
}
}

- spin_unlock(group_lock);
+ spin_unlock_irq(group_lock);
}

/* Preferred node as the node with the most faults */
@@ -1677,7 +1677,8 @@ static void task_numa_group(struct task_
if (!join)
return;

- double_lock(&my_grp->lock, &grp->lock);
+ BUG_ON(irqs_disabled());
+ double_lock_irq(&my_grp->lock, &grp->lock);

for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults_memory[i];
@@ -1692,6 +1693,7 @@ static void task_numa_group(struct task_

spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);
+ local_irq_enable();

rcu_assign_pointer(p->numa_group, grp);

@@ -1710,14 +1712,14 @@ void task_numa_free(struct task_struct *
void *numa_faults = p->numa_faults_memory;

if (grp) {
- spin_lock(&grp->lock);
+ spin_lock_irq(&grp->lock);
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
grp->faults[i] -= p->numa_faults_memory[i];
grp->total_faults -= p->total_numa_faults;

list_del(&p->numa_entry);
grp->nr_tasks--;
- spin_unlock(&grp->lock);
+ spin_unlock_irq(&grp->lock);
rcu_assign_pointer(p->numa_group, NULL);
put_numa_group(grp);
}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1388,6 +1388,15 @@ static inline void double_lock(spinlock_
spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
}

+static inline void double_lock_irq(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock_irq(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
{
if (l1 > l2)

2014-04-07 08:16:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()

On Mon, Apr 07, 2014 at 09:30:30AM +0200, Mike Galbraith wrote:
> - double_lock(&my_grp->lock, &grp->lock);
> + BUG_ON(irqs_disabled());
> + double_lock_irq(&my_grp->lock, &grp->lock);

So either make this:

local_irq_disable();
double_lock();

or

>
> for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
> my_grp->faults[i] -= p->numa_faults_memory[i];
> @@ -1692,6 +1693,7 @@ static void task_numa_group(struct task_
>
> spin_unlock(&my_grp->lock);
> spin_unlock(&grp->lock);
> + local_irq_enable();

use:
spin_unlock()
spin_unlock_irq()

or so, but this imbalance is making my itch :-)


>
> rcu_assign_pointer(p->numa_group, grp);
>

2014-04-07 08:43:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()

On Mon, 2014-04-07 at 10:16 +0200, Peter Zijlstra wrote:
> On Mon, Apr 07, 2014 at 09:30:30AM +0200, Mike Galbraith wrote:
> > - double_lock(&my_grp->lock, &grp->lock);
> > + BUG_ON(irqs_disabled());
> > + double_lock_irq(&my_grp->lock, &grp->lock);
>
> So either make this:
>
> local_irq_disable();
> double_lock();
>
> or
>
> >
> > for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
> > my_grp->faults[i] -= p->numa_faults_memory[i];
> > @@ -1692,6 +1693,7 @@ static void task_numa_group(struct task_
> >
> > spin_unlock(&my_grp->lock);
> > spin_unlock(&grp->lock);
> > + local_irq_enable();
>
> use:
> spin_unlock()
> spin_unlock_irq()

*thwap* Well duh.

> or so, but this imbalance is making my itch :-)

Yeah, much better.

Before I actually sign that off, mind cluing me in as to why I should
not be sitting here thinking lockdep smoked its breakfast?

-Mike

2014-04-07 08:55:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()

On Mon, 2014-04-07 at 10:16 +0200, Peter Zijlstra wrote:
> On Mon, Apr 07, 2014 at 09:30:30AM +0200, Mike Galbraith wrote:
> > - double_lock(&my_grp->lock, &grp->lock);
> > + BUG_ON(irqs_disabled());
> > + double_lock_irq(&my_grp->lock, &grp->lock);
>
> So either make this:
>
> local_irq_disable();
> double_lock();
>
> or
>
> >
> > for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
> > my_grp->faults[i] -= p->numa_faults_memory[i];
> > @@ -1692,6 +1693,7 @@ static void task_numa_group(struct task_
> >
> > spin_unlock(&my_grp->lock);
> > spin_unlock(&grp->lock);
> > + local_irq_enable();
>
> use:
> spin_unlock()
> spin_unlock_irq()
>
> or so, but this imbalance is making my itch :-)

sched, numa: fix task_numa_free() lockdep splat

Sasha reports that lockdep claims 156654f491dd8d52687a5fbe1637f472a52ce75b made
numa_group.lock interrupt unsafe. While I don't see how that could be given the
commit in question moved task_numa_free() from one irq enabled region to another,
the below does make both gripes and lockups upon gripe with numa=fake=4 go away.

Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
---
kernel/sched/fair.c | 13 +++++++------
kernel/sched/sched.h | 9 +++++++++
2 files changed, 16 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1497,7 +1497,7 @@ static void task_numa_placement(struct t
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
- spin_lock(group_lock);
+ spin_lock_irq(group_lock);
}

/* Find the node with the highest number of faults */
@@ -1572,7 +1572,7 @@ static void task_numa_placement(struct t
}
}

- spin_unlock(group_lock);
+ spin_unlock_irq(group_lock);
}

/* Preferred node as the node with the most faults */
@@ -1677,7 +1677,8 @@ static void task_numa_group(struct task_
if (!join)
return;

- double_lock(&my_grp->lock, &grp->lock);
+ BUG_ON(irqs_disabled());
+ double_lock_irq(&my_grp->lock, &grp->lock);

for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults_memory[i];
@@ -1691,7 +1692,7 @@ static void task_numa_group(struct task_
grp->nr_tasks++;

spin_unlock(&my_grp->lock);
- spin_unlock(&grp->lock);
+ spin_unlock_irq(&grp->lock);

rcu_assign_pointer(p->numa_group, grp);

@@ -1710,14 +1711,14 @@ void task_numa_free(struct task_struct *
void *numa_faults = p->numa_faults_memory;

if (grp) {
- spin_lock(&grp->lock);
+ spin_lock_irq(&grp->lock);
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
grp->faults[i] -= p->numa_faults_memory[i];
grp->total_faults -= p->total_numa_faults;

list_del(&p->numa_entry);
grp->nr_tasks--;
- spin_unlock(&grp->lock);
+ spin_unlock_irq(&grp->lock);
rcu_assign_pointer(p->numa_group, NULL);
put_numa_group(grp);
}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1388,6 +1388,15 @@ static inline void double_lock(spinlock_
spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
}

+static inline void double_lock_irq(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock_irq(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
{
if (l1 > l2)

Subject: Re: [tip:sched/core] sched/numa: Move task_numa_free() to __put_task_struct()



On Mon, 7 Apr 2014, Mike Galbraith wrote:

> On Mon, 2014-04-07 at 10:16 +0200, Peter Zijlstra wrote:
>> On Mon, Apr 07, 2014 at 09:30:30AM +0200, Mike Galbraith wrote:
>>> - double_lock(&my_grp->lock, &grp->lock);
>>> + BUG_ON(irqs_disabled());
>>> + double_lock_irq(&my_grp->lock, &grp->lock);
>>
>> So either make this:
>>
>> local_irq_disable();
>> double_lock();
>>
>> or
>>
>>>
>>> for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
>>> my_grp->faults[i] -= p->numa_faults_memory[i];
>>> @@ -1692,6 +1693,7 @@ static void task_numa_group(struct task_
>>>
>>> spin_unlock(&my_grp->lock);
>>> spin_unlock(&grp->lock);
>>> + local_irq_enable();
>>
>> use:
>> spin_unlock()
>> spin_unlock_irq()
>>
>> or so, but this imbalance is making my itch :-)
>
> sched, numa: fix task_numa_free() lockdep splat
>
> Sasha reports that lockdep claims 156654f491dd8d52687a5fbe1637f472a52ce75b made
> numa_group.lock interrupt unsafe. While I don't see how that could be given the
> commit in question moved task_numa_free() from one irq enabled region to another,
> the below does make both gripes and lockups upon gripe with numa=fake=4 go away.
>

Hi

I Am hitting this bug quite frequently. I do not see the problem after applying
this patch.

Thanks

Tested-by: Govindarajulu Varadarajan <[email protected]>

> Reported-by: Sasha Levin <[email protected]>
> Signed-off-by: Mike Galbraith <[email protected]>
> ---
> kernel/sched/fair.c | 13 +++++++------
> kernel/sched/sched.h | 9 +++++++++
> 2 files changed, 16 insertions(+), 6 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1497,7 +1497,7 @@ static void task_numa_placement(struct t
> /* If the task is part of a group prevent parallel updates to group stats */
> if (p->numa_group) {
> group_lock = &p->numa_group->lock;
> - spin_lock(group_lock);
> + spin_lock_irq(group_lock);
> }
>
> /* Find the node with the highest number of faults */
> @@ -1572,7 +1572,7 @@ static void task_numa_placement(struct t
> }
> }
>
> - spin_unlock(group_lock);
> + spin_unlock_irq(group_lock);
> }
>
> /* Preferred node as the node with the most faults */
> @@ -1677,7 +1677,8 @@ static void task_numa_group(struct task_
> if (!join)
> return;
>
> - double_lock(&my_grp->lock, &grp->lock);
> + BUG_ON(irqs_disabled());
> + double_lock_irq(&my_grp->lock, &grp->lock);
>
> for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
> my_grp->faults[i] -= p->numa_faults_memory[i];
> @@ -1691,7 +1692,7 @@ static void task_numa_group(struct task_
> grp->nr_tasks++;
>
> spin_unlock(&my_grp->lock);
> - spin_unlock(&grp->lock);
> + spin_unlock_irq(&grp->lock);
>
> rcu_assign_pointer(p->numa_group, grp);
>
> @@ -1710,14 +1711,14 @@ void task_numa_free(struct task_struct *
> void *numa_faults = p->numa_faults_memory;
>
> if (grp) {
> - spin_lock(&grp->lock);
> + spin_lock_irq(&grp->lock);
> for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
> grp->faults[i] -= p->numa_faults_memory[i];
> grp->total_faults -= p->total_numa_faults;
>
> list_del(&p->numa_entry);
> grp->nr_tasks--;
> - spin_unlock(&grp->lock);
> + spin_unlock_irq(&grp->lock);
> rcu_assign_pointer(p->numa_group, NULL);
> put_numa_group(grp);
> }
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1388,6 +1388,15 @@ static inline void double_lock(spinlock_
> spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
> }
>
> +static inline void double_lock_irq(spinlock_t *l1, spinlock_t *l2)
> +{
> + if (l1 > l2)
> + swap(l1, l2);
> +
> + spin_lock_irq(l1);
> + spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
> +}
> +
> static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
> {
> if (l1 > l2)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Subject: [tip:sched/urgent] sched/numa: Fix task_numa_free() lockdep splat

Commit-ID: 60e69eed85bb7b5198ef70643b5895c26ad76ef7
Gitweb: http://git.kernel.org/tip/60e69eed85bb7b5198ef70643b5895c26ad76ef7
Author: Mike Galbraith <[email protected]>
AuthorDate: Mon, 7 Apr 2014 10:55:15 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Apr 2014 10:39:15 +0200

sched/numa: Fix task_numa_free() lockdep splat

Sasha reported that lockdep claims that the following commit:
made numa_group.lock interrupt unsafe:

156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")

While I don't see how that could be, given the commit in question moved
task_numa_free() from one irq enabled region to another, the below does
make both gripes and lockups upon gripe with numa=fake=4 go away.

Reported-by: Sasha Levin <[email protected]>
Fixes: 156654f491dd ("sched/numa: Move task_numa_free() to __put_task_struct()")
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Dave Jones <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 13 +++++++------
kernel/sched/sched.h | 9 +++++++++
2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e9bd0b..4f14a65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1497,7 +1497,7 @@ static void task_numa_placement(struct task_struct *p)
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
- spin_lock(group_lock);
+ spin_lock_irq(group_lock);
}

/* Find the node with the highest number of faults */
@@ -1572,7 +1572,7 @@ static void task_numa_placement(struct task_struct *p)
}
}

- spin_unlock(group_lock);
+ spin_unlock_irq(group_lock);
}

/* Preferred node as the node with the most faults */
@@ -1677,7 +1677,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
if (!join)
return;

- double_lock(&my_grp->lock, &grp->lock);
+ BUG_ON(irqs_disabled());
+ double_lock_irq(&my_grp->lock, &grp->lock);

for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults_memory[i];
@@ -1691,7 +1692,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
grp->nr_tasks++;

spin_unlock(&my_grp->lock);
- spin_unlock(&grp->lock);
+ spin_unlock_irq(&grp->lock);

rcu_assign_pointer(p->numa_group, grp);

@@ -1710,14 +1711,14 @@ void task_numa_free(struct task_struct *p)
void *numa_faults = p->numa_faults_memory;

if (grp) {
- spin_lock(&grp->lock);
+ spin_lock_irq(&grp->lock);
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
grp->faults[i] -= p->numa_faults_memory[i];
grp->total_faults -= p->total_numa_faults;

list_del(&p->numa_entry);
grp->nr_tasks--;
- spin_unlock(&grp->lock);
+ spin_unlock_irq(&grp->lock);
rcu_assign_pointer(p->numa_group, NULL);
put_numa_group(grp);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c9007f2..456e492 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1385,6 +1385,15 @@ static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
}

+static inline void double_lock_irq(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock_irq(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
{
if (l1 > l2)