Alexandra Kossovsky reported the following lockdep splat when testing an
out-of-tree version of sfc on 3.6-rt. The problem is specific to RT,
and we haven't tested anything later but I think it's still unfixed.
> ======================================================
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.6.11.2-rt33.39.el6rt.x86_64.debug #1 Tainted: G O
> ------------------------------------------------------
> insmod/3076 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> (&(&(&gcwq->lock)->lock)->wait_lock){+.+...}, at: [<ffffffff81589a78>] rt_spin_lock_slowlock+0x48/0x2f0
>
> and this task is already holding:
> (&irq_desc_lock_class){-.....}, at: [<ffffffff810ec226>] irq_set_affinity+0x46/0x80
irq_set_affinity() holds the irq_desc lock, and then schedules a work
item to call the IRQ affinity notifier.
> which would create a new lock dependency:
> (&irq_desc_lock_class){-.....} -> (&(&(&gcwq->lock)->lock)->wait_lock){+.+...}
>
> but this new dependency connects a HARDIRQ-irq-safe lock:
> (&irq_desc_lock_class){-.....}
> ... which became HARDIRQ-irq-safe at:
> [<ffffffff810adba2>] mark_irqflags+0x172/0x190
> [<ffffffff810af2c4>] __lock_acquire+0x344/0x4e0
> [<ffffffff810af4ea>] lock_acquire+0x8a/0x140
> [<ffffffff8158ac50>] _raw_spin_lock+0x40/0x80
> [<ffffffff810edfae>] handle_level_irq+0x1e/0x100
> [<ffffffff810044e1>] handle_irq+0x71/0x190
> [<ffffffff815943ad>] do_IRQ+0x5d/0xe0
> [<ffffffff8158b32c>] ret_from_intr+0x0/0x13
> [<ffffffff81d16d9d>] tsc_init+0x24/0x102
> [<ffffffff81d13d77>] x86_late_time_init+0xf/0x11
> [<ffffffff81d10cfe>] start_kernel+0x312/0x3c6
> [<ffffffff81d1032d>] x86_64_start_reservations+0x131/0x136
> [<ffffffff81d1041f>] x86_64_start_kernel+0xed/0xf4
Obviously, irq_desc is used in hard-IRQ context.
> to a HARDIRQ-irq-unsafe lock:
> (&(&(&gcwq->lock)->lock)->wait_lock){+.+...}
> ... which became HARDIRQ-irq-unsafe at:
> ... [<ffffffff810adb50>] mark_irqflags+0x120/0x190
> [<ffffffff810af2c4>] __lock_acquire+0x344/0x4e0
> [<ffffffff810af4ea>] lock_acquire+0x8a/0x140
> [<ffffffff8158ac50>] _raw_spin_lock+0x40/0x80
> [<ffffffff81589a78>] rt_spin_lock_slowlock+0x48/0x2f0
> [<ffffffff8158a1b6>] rt_spin_lock+0x16/0x40
> [<ffffffff8106cf99>] create_worker+0x69/0x220
> [<ffffffff81d2bae5>] init_workqueues+0x24b/0x3f8
> [<ffffffff810001c2>] do_one_initcall+0x42/0x170
> [<ffffffff81d10775>] kernel_init+0xe5/0x17a
> [<ffffffff81593c24>] kernel_thread_helper+0x4/0x10
[...]
Workqueue code uses spin_lock_irq() on the workqueue lock, which with
PREEMPT_RT enabled doesn't actually block IRQs.
In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
any outstanding notifications before freeing the cpu_rmap that they use.
This won't be reliable if the notification is scheduled after releasing
the irq_desc lock.
However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
all workqueues') in 3.8, I think that it is sufficient to do only
kref_get(&desc->affinity_notify->kref) in __irq_set_affinity_locked()
and then call schedule_work() in irq_set_affinity() after releasing the
lock. Something like this (untested):
diff --git a/arch/mips/cavium-octeon/octeon-irq.c b/arch/mips/cavium-octeon/octeon-irq.c
index 9d36774..0406481 100644
--- a/arch/mips/cavium-octeon/octeon-irq.c
+++ b/arch/mips/cavium-octeon/octeon-irq.c
@@ -635,7 +635,8 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data *data)
cpumask_clear(&new_affinity);
cpumask_set_cpu(cpumask_first(cpu_online_mask), &new_affinity);
}
- __irq_set_affinity_locked(data, &new_affinity);
+ /* XXX No-one else calls this; why does this chip need it? */
+ __irq_set_affinity_locked(data, &new_affinity, NULL);
}
static int octeon_irq_ciu_set_affinity(struct irq_data *data,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f04d3ba..de992f4 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -380,7 +380,9 @@ extern void remove_percpu_irq(unsigned int irq, struct irqaction *act);
extern void irq_cpu_online(void);
extern void irq_cpu_offline(void);
-extern int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *cpumask);
+extern int __irq_set_affinity_locked(struct irq_data *data,
+ const struct cpumask *cpumask,
+ struct irq_affinity_notify **notify);
#ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 514bcfd..157afa2 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -162,12 +162,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
return ret;
}
-int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask)
+int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask,
+ struct irq_affinity_notify **notify)
{
struct irq_chip *chip = irq_data_get_irq_chip(data);
struct irq_desc *desc = irq_data_to_desc(data);
int ret = 0;
+ if (notify)
+ *notify = NULL;
+
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
@@ -178,9 +182,9 @@ int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask)
irq_copy_pending(desc, mask);
}
- if (desc->affinity_notify) {
+ if (notify && desc->affinity_notify) {
kref_get(&desc->affinity_notify->kref);
- schedule_work(&desc->affinity_notify->work);
+ *notify = desc->affinity_notify;
}
irqd_set(data, IRQD_AFFINITY_SET);
@@ -196,6 +200,7 @@ int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask)
int irq_set_affinity(unsigned int irq, const struct cpumask *mask)
{
struct irq_desc *desc = irq_to_desc(irq);
+ struct irq_affinity_notify *notify;
unsigned long flags;
int ret;
@@ -203,8 +208,13 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *mask)
return -EINVAL;
raw_spin_lock_irqsave(&desc->lock, flags);
- ret = __irq_set_affinity_locked(irq_desc_get_irq_data(desc), mask);
+ ret = __irq_set_affinity_locked(irq_desc_get_irq_data(desc), mask,
+ ¬ify);
raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+ if (notify)
+ schedule_work(¬ify->work);
+
return ret;
}
--- END ---
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
Sebastian, I saw you came up with a fix for this but apparently without
seeing my earlier message:
On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
> Alexandra Kossovsky reported the following lockdep splat when testing an
> out-of-tree version of sfc on 3.6-rt. The problem is specific to RT,
> and we haven't tested anything later but I think it's still unfixed.
>
> > ======================================================
> > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> > 3.6.11.2-rt33.39.el6rt.x86_64.debug #1 Tainted: G O
> > ------------------------------------------------------
> > insmod/3076 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > (&(&(&gcwq->lock)->lock)->wait_lock){+.+...}, at: [<ffffffff81589a78>] rt_spin_lock_slowlock+0x48/0x2f0
> >
> > and this task is already holding:
> > (&irq_desc_lock_class){-.....}, at: [<ffffffff810ec226>] irq_set_affinity+0x46/0x80
>
> irq_set_affinity() holds the irq_desc lock, and then schedules a work
> item to call the IRQ affinity notifier.
>
> > which would create a new lock dependency:
> > (&irq_desc_lock_class){-.....} -> (&(&(&gcwq->lock)->lock)->wait_lock){+.+...}
> >
> > but this new dependency connects a HARDIRQ-irq-safe lock:
> > (&irq_desc_lock_class){-.....}
> > ... which became HARDIRQ-irq-safe at:
> > [<ffffffff810adba2>] mark_irqflags+0x172/0x190
> > [<ffffffff810af2c4>] __lock_acquire+0x344/0x4e0
> > [<ffffffff810af4ea>] lock_acquire+0x8a/0x140
> > [<ffffffff8158ac50>] _raw_spin_lock+0x40/0x80
> > [<ffffffff810edfae>] handle_level_irq+0x1e/0x100
> > [<ffffffff810044e1>] handle_irq+0x71/0x190
> > [<ffffffff815943ad>] do_IRQ+0x5d/0xe0
> > [<ffffffff8158b32c>] ret_from_intr+0x0/0x13
> > [<ffffffff81d16d9d>] tsc_init+0x24/0x102
> > [<ffffffff81d13d77>] x86_late_time_init+0xf/0x11
> > [<ffffffff81d10cfe>] start_kernel+0x312/0x3c6
> > [<ffffffff81d1032d>] x86_64_start_reservations+0x131/0x136
> > [<ffffffff81d1041f>] x86_64_start_kernel+0xed/0xf4
>
> Obviously, irq_desc is used in hard-IRQ context.
>
> > to a HARDIRQ-irq-unsafe lock:
> > (&(&(&gcwq->lock)->lock)->wait_lock){+.+...}
> > ... which became HARDIRQ-irq-unsafe at:
> > ... [<ffffffff810adb50>] mark_irqflags+0x120/0x190
> > [<ffffffff810af2c4>] __lock_acquire+0x344/0x4e0
> > [<ffffffff810af4ea>] lock_acquire+0x8a/0x140
> > [<ffffffff8158ac50>] _raw_spin_lock+0x40/0x80
> > [<ffffffff81589a78>] rt_spin_lock_slowlock+0x48/0x2f0
> > [<ffffffff8158a1b6>] rt_spin_lock+0x16/0x40
> > [<ffffffff8106cf99>] create_worker+0x69/0x220
> > [<ffffffff81d2bae5>] init_workqueues+0x24b/0x3f8
> > [<ffffffff810001c2>] do_one_initcall+0x42/0x170
> > [<ffffffff81d10775>] kernel_init+0xe5/0x17a
> > [<ffffffff81593c24>] kernel_thread_helper+0x4/0x10
> [...]
>
> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
> PREEMPT_RT enabled doesn't actually block IRQs.
>
> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
> any outstanding notifications before freeing the cpu_rmap that they use.
> This won't be reliable if the notification is scheduled after releasing
> the irq_desc lock.
>
> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
> all workqueues') in 3.8, I think that it is sufficient to do only
> kref_get(&desc->affinity_notify->kref) in __irq_set_affinity_locked()
> and then call schedule_work() in irq_set_affinity() after releasing the
> lock. Something like this (untested):
Does the following make sense to you?
Ben.
> diff --git a/arch/mips/cavium-octeon/octeon-irq.c b/arch/mips/cavium-octeon/octeon-irq.c
> index 9d36774..0406481 100644
> --- a/arch/mips/cavium-octeon/octeon-irq.c
> +++ b/arch/mips/cavium-octeon/octeon-irq.c
> @@ -635,7 +635,8 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data *data)
> cpumask_clear(&new_affinity);
> cpumask_set_cpu(cpumask_first(cpu_online_mask), &new_affinity);
> }
> - __irq_set_affinity_locked(data, &new_affinity);
> + /* XXX No-one else calls this; why does this chip need it? */
> + __irq_set_affinity_locked(data, &new_affinity, NULL);
> }
>
> static int octeon_irq_ciu_set_affinity(struct irq_data *data,
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index f04d3ba..de992f4 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -380,7 +380,9 @@ extern void remove_percpu_irq(unsigned int irq, struct irqaction *act);
>
> extern void irq_cpu_online(void);
> extern void irq_cpu_offline(void);
> -extern int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *cpumask);
> +extern int __irq_set_affinity_locked(struct irq_data *data,
> + const struct cpumask *cpumask,
> + struct irq_affinity_notify **notify);
>
> #ifdef CONFIG_GENERIC_HARDIRQS
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 514bcfd..157afa2 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -162,12 +162,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> return ret;
> }
>
> -int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask)
> +int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask,
> + struct irq_affinity_notify **notify)
> {
> struct irq_chip *chip = irq_data_get_irq_chip(data);
> struct irq_desc *desc = irq_data_to_desc(data);
> int ret = 0;
>
> + if (notify)
> + *notify = NULL;
> +
> if (!chip || !chip->irq_set_affinity)
> return -EINVAL;
>
> @@ -178,9 +182,9 @@ int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask)
> irq_copy_pending(desc, mask);
> }
>
> - if (desc->affinity_notify) {
> + if (notify && desc->affinity_notify) {
> kref_get(&desc->affinity_notify->kref);
> - schedule_work(&desc->affinity_notify->work);
> + *notify = desc->affinity_notify;
> }
> irqd_set(data, IRQD_AFFINITY_SET);
>
> @@ -196,6 +200,7 @@ int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask)
> int irq_set_affinity(unsigned int irq, const struct cpumask *mask)
> {
> struct irq_desc *desc = irq_to_desc(irq);
> + struct irq_affinity_notify *notify;
> unsigned long flags;
> int ret;
>
> @@ -203,8 +208,13 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *mask)
> return -EINVAL;
>
> raw_spin_lock_irqsave(&desc->lock, flags);
> - ret = __irq_set_affinity_locked(irq_desc_get_irq_data(desc), mask);
> + ret = __irq_set_affinity_locked(irq_desc_get_irq_data(desc), mask,
> + ¬ify);
> raw_spin_unlock_irqrestore(&desc->lock, flags);
> +
> + if (notify)
> + schedule_work(¬ify->work);
> +
> return ret;
> }
>
> --- END ---
>
> Ben.
>
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On 08/30/2013 11:29 PM, Ben Hutchings wrote:
> Sebastian, I saw you came up with a fix for this but apparently without
> seeing my earlier message:
Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
it.
> On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
>> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
>> PREEMPT_RT enabled doesn't actually block IRQs.
>>
>> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
>> any outstanding notifications before freeing the cpu_rmap that they use.
>> This won't be reliable if the notification is scheduled after releasing
>> the irq_desc lock.
>>
>> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
>> all workqueues') in 3.8, I think that it is sufficient to do only
>> kref_get(&desc->affinity_notify->kref) in __irq_set_affinity_locked()
>> and then call schedule_work() in irq_set_affinity() after releasing the
>> lock. Something like this (untested):
>
> Does the following make sense to you?
This was suggested by the original submitter on [email protected] (Joe
Korty) where I've been made aware of this for the first time. This okay
except for the part where the workqueue is not scheduled if calling by
the __ function (i.e. the mips case). If I read the code correctly, the
CPU goes offline and affinity change should be updated / users notified
and this is not the case with this patch.
It is a valid question why only one mips SoC needs the function. If you
look at commit 0c3263870f ("MIPS: Octeon: Rewrite interrupt handling
code.") you can see that tglx himself made this adjustment so it might
be valid :) Therefore I assume that we may get more callers of this
function and the workqueue should be executed and I made something
simple that works on RT.
>
> Ben.
>
Sebastian
On Mon, 2013-09-23 at 14:58 +0200, Sebastian Andrzej Siewior wrote:
> On 08/30/2013 11:29 PM, Ben Hutchings wrote:
> > Sebastian, I saw you came up with a fix for this but apparently without
> > seeing my earlier message:
>
> Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
> it.
You weren't, as I didn't realise you were maintaining the RT patch set
then.
> > On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
>
> >> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
> >> PREEMPT_RT enabled doesn't actually block IRQs.
> >>
> >> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
> >> any outstanding notifications before freeing the cpu_rmap that they use.
> >> This won't be reliable if the notification is scheduled after releasing
> >> the irq_desc lock.
> >>
> >> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
> >> all workqueues') in 3.8, I think that it is sufficient to do only
> >> kref_get(&desc->affinity_notify->kref) in __irq_set_affinity_locked()
> >> and then call schedule_work() in irq_set_affinity() after releasing the
> >> lock. Something like this (untested):
> >
> > Does the following make sense to you?
>
> This was suggested by the original submitter on [email protected] (Joe
> Korty) where I've been made aware of this for the first time. This okay
> except for the part where the workqueue is not scheduled if calling by
> the __ function (i.e. the mips case). If I read the code correctly, the
> CPU goes offline and affinity change should be updated / users notified
> and this is not the case with this patch.
>
> It is a valid question why only one mips SoC needs the function. If you
> look at commit 0c3263870f ("MIPS: Octeon: Rewrite interrupt handling
> code.") you can see that tglx himself made this adjustment so it might
> be valid :) Therefore I assume that we may get more callers of this
> function and the workqueue should be executed and I made something
> simple that works on RT.
Right, but it looks quite strange to have this thread just for a
(probably) quite rare event. Maybe some day someone will work out how
to make Octeon's IRQ management less unusual and then you can use the
simpler approach for RT...
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On 10/01/2013 01:44 AM, Ben Hutchings wrote:
> Right, but it looks quite strange to have this thread just for a
> (probably) quite rare event. Maybe some day someone will work out how
> to make Octeon's IRQ management less unusual and then you can use the
> simpler approach for RT...
We have three threads for some rare events by now if I count correctly.
That is
- mce check on x86 (mce_notify_helper_thread)
- this one
- clock_was_set_delayed() (pending)
So right now I am thinking about a workqueue for rare events because if
I keep doing this we end up with a bunch useless tasks for something
that might never occur.
> Ben.
Sebastian