One of the cores might have just allocated irq_desc() and other core
might be doing irq migration in the hot plug path. In the hot plug path
during the IRQ migration, for_each_active_irq macro is trying to get
irqs whose bits are set in allocated_irqs bit map but there is no return
value check after irq_to_desc for desc validity.
[ 24.566381] msm_thermal:do_core_control Set Offline: CPU4 Temp: 73
[ 24.568821] Unable to handle kernel NULL pointer dereference at virtual address 000000a4
[ 24.568931] pgd = ffffffc002184000
[ 24.568995] [000000a4] *pgd=0000000178df5003, *pud=0000000178df5003, *pmd=0000000178df6003, *pte=0060000017a00707
[ 24.569153] ------------[ cut here ]------------
[ 24.569228] Kernel BUG at ffffffc0000f3060 [verbose debug info unavailable]
[ 24.569334] Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
[ 24.569422] Modules linked in:
[ 24.569486] CPU: 4 PID: 28 Comm: migration/4 Tainted: G W 4.4.8-perf-9407222-eng #1
[ 24.569684] task: ffffffc0f28f0e80 ti: ffffffc0f2a84000 task.ti: ffffffc0f2a84000
[ 24.569785] PC is at do_raw_spin_lock+0x20/0x160
[ 24.569859] LR is at _raw_spin_lock+0x34/0x40
[ 24.569931] pc : [<ffffffc0000f3060>] lr : [<ffffffc001023dec>] pstate: 200001c5
[ 24.570029] sp : ffffffc0f2a87bc0
[ 24.570091] x29: ffffffc0f2a87bc0 x28: ffffffc001033988
[ 24.570174] x27: ffffffc001adebb0 x26: 0000000000000000
[ 24.570257] x25: 00000000000000a0 x24: 0000000000000020
[ 24.570339] x23: ffffffc001033978 x22: ffffffc001adeb80
[ 24.570421] x21: 000000000000027e x20: 0000000000000000
[ 24.570502] x19: 00000000000000a0 x18: 000000000000000d
[ 24.570584] x17: 0000000000005f00 x16: 0000000000000003
[ 24.570666] x15: 000000000000bd39 x14: 0ffffffffffffffe
[ 24.570748] x13: 0000000000000000 x12: 0000000000000018
[ 24.570829] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[ 24.570911] x9 : fefefefeff332e6d x8 : 7f7f7f7f7f7f7f7f
[ 24.570993] x7 : ffffffc00344f6a0 x6 : 0000000000000000
[ 24.571075] x5 : 0000000000000001 x4 : ffffffc00344f488
[ 24.571157] x3 : 0000000000000000 x2 : 0000000000000000
[ 24.571238] x1 : ffffffc0f2a84000 x0 : 0000000000004ead
...
...
...
[ 24.581324] Call trace:
[ 24.581379] [<ffffffc0000f3060>] do_raw_spin_lock+0x20/0x160
[ 24.581463] [<ffffffc001023dec>] _raw_spin_lock+0x34/0x40
[ 24.581546] [<ffffffc000103f10>] irq_migrate_all_off_this_cpu+0x84/0x1c4
[ 24.581641] [<ffffffc00008ec84>] __cpu_disable+0x54/0x74
[ 24.581722] [<ffffffc0000a3368>] take_cpu_down+0x1c/0x58
[ 24.581803] [<ffffffc00013ac08>] multi_cpu_stop+0xb0/0x10c
[ 24.581885] [<ffffffc00013ad60>] cpu_stopper_thread+0xb8/0x184
[ 24.581972] [<ffffffc0000c4920>] smpboot_thread_fn+0x26c/0x290
[ 24.582057] [<ffffffc0000c0f84>] kthread+0x100/0x108
[ 24.582135] Code: aa0003f3 aa1e03e0 d503201f 5289d5a0 (b9400661)
[ 24.582224] ---[ end trace 609f38584306f5d9 ]---
Signed-off-by: Prasad Sodagudi <[email protected]>
Signed-off-by: Prakruthi Deepak Heragu <[email protected]>
---
kernel/irq/cpuhotplug.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 5b1072e..17c5e71 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -160,6 +160,9 @@ void irq_migrate_all_off_this_cpu(void)
bool affinity_broken;
desc = irq_to_desc(irq);
+ if (!desc)
+ continue;
+
raw_spin_lock(&desc->lock);
affinity_broken = migrate_one_irq(desc);
raw_spin_unlock(&desc->lock);
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
On Wed, 5 Sep 2018, Prakruthi Deepak Heragu wrote:
> One of the cores might have just allocated irq_desc() and other core
> might be doing irq migration in the hot plug path. In the hot plug path
> during the IRQ migration, for_each_active_irq macro is trying to get
> irqs whose bits are set in allocated_irqs bit map but there is no return
> value check after irq_to_desc for desc validity.
Confused. All parts involved, irq allocation/deallocation and the CPU
hotplug code take sparse_irq_lock to prevent exavtly that.
Thanks,
tglx
On 2018-09-05 11:23, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Prakruthi Deepak Heragu wrote:
>
>> One of the cores might have just allocated irq_desc() and other core
>> might be doing irq migration in the hot plug path. In the hot plug
>> path
>> during the IRQ migration, for_each_active_irq macro is trying to get
>> irqs whose bits are set in allocated_irqs bit map but there is no
>> return
>> value check after irq_to_desc for desc validity.
>
> Confused. All parts involved, irq allocation/deallocation and the CPU
> hotplug code take sparse_irq_lock to prevent exavtly that.
>
Removing the NULL pointer check and adding this sparse_irq_lock
that you suggested will solve this issue. The code looks like
this now. Is this okay?
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 9409b55..f2ef76e 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -185,14 +185,10 @@ void irq_migrate_all_off_this_cpu(void)
{
struct irq_desc *desc;
unsigned int irq;
-
+ irq_lock_sparse();
for_each_active_irq(irq) {
bool affinity_broken;
-
desc = irq_to_desc(irq);
- if (!desc)
- continue;
-
raw_spin_lock(&desc->lock);
affinity_broken = migrate_one_irq(desc);
raw_spin_unlock(&desc->lock);
@@ -202,6 +198,7 @@ void irq_migrate_all_off_this_cpu(void)
irq, smp_processor_id());
}
}
+ irq_unlock_sparse();
}
static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned
int cpu)
> Thanks,
>
> tglx
Thanks,
Prakruthi Deepak Heragu
On Wed, 5 Sep 2018, [email protected] wrote:
> On 2018-09-05 11:23, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Prakruthi Deepak Heragu wrote:
> >
> > > One of the cores might have just allocated irq_desc() and other core
> > > might be doing irq migration in the hot plug path. In the hot plug path
> > > during the IRQ migration, for_each_active_irq macro is trying to get
> > > irqs whose bits are set in allocated_irqs bit map but there is no return
> > > value check after irq_to_desc for desc validity.
> >
> > Confused. All parts involved, irq allocation/deallocation and the CPU
> > hotplug code take sparse_irq_lock to prevent exavtly that.
> >
> Removing the NULL pointer check and adding this sparse_irq_lock
> that you suggested will solve this issue. The code looks like
> this now. Is this okay?
No, it's not okay at all. You _CANNOT_ take sparse_irq_lock() there.
I wrote that _ALL_ parts take that lock already. Care to look at the code?
takedown_cpu()
{
/*
* Prevent irq alloc/free while the dying cpu reorganizes the
* interrupt affinities.
*/
irq_lock_sparse();
/*
* So now all preempt/rcu users must observe !cpu_active().
*/
err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
and __cpu_disable() which in turn calls irq_migrate_all_off_this_cpu() is
called from take_cpu_down(). So this deadlocks in any case.
I have no idea what kind of modifications you have in your kernel which
make
1) The thing explode in the first place
2) Not immediately deadlock with that hack you just sent
and honestly I don't want to know at all.
Thanks,
tglx