From: Alexander Sverdlin <[email protected]>
Seems that the discovered race was here since adapting the PowerPC code
into genirq in 2012. I was surprized it went unnoticed all this time, but
it turns out, device registration (being it DT or ACPI) is mostly
sequential in kernel. In our case probe deferring was involved and another
independent probe(), so that the fatal sequence probably looked like
following:
CPU0 CPU1
// probe() of irq-user-device
// irq_domain not found =>
// -EPROBE_DEFER
// probe of irq_domain, // probe() of some unrelated device
// including typical in // leading to re-try of
// drivers/irqchip pattern: // irq-user-device probe(), calling
irq_domain_create() of_irq_get()
for (...) irq_create_mapping() => irq_create_mapping()
=> irq_find_mapping() => irq_find_mapping()
=> irq_domain_alloc_descs() => irq_domain_alloc_descs()
=> irq_domain_associate()
=> irq_domain_associate()
irq_domain_associate() uses irq_domain_mutex internally, this doesn't help
however, because parallel calls to irq_create_mapping() return different
virq numbers and either the irq_domain driver is not able to perform
necessary configuration or the mapping the "slave" driver has got is being
overwritten by the controller driver and the virq "slave" got is never
triggered. The test code [1] reproduces the simplified scenario, which
doesn't involve real deferred probe.
As irq_find_mapping() is used in interrupt handlers it cannot take locks.
I didn't want to change the semantics of the exported
irq_domain_associate() (to remove the mutex from it and wrap
irq_find_mapping()-irq_domain_associate() pair into this mutex). Therefore
irq_domain_associate() was modified to detect the collisions and the
callers were modified to deal with this.
The race in irq_create_fwspec_mapping() is distinct, but has the same
nature: making a decision based on the result of unprotected
irq_find_mapping() call.
Alexander Sverdlin (3):
genirq/irqdomain: Check for existing mapping in irq_domain_associate()
genirq/irqdomain: Re-check mapping after associate in
irq_create_mapping()
genirq/irqdomain: Detect type race in irq_create_fwspec_mapping()
kernel/irq/irqdomain.c | 90 ++++++++++++++++++++++++++++++--------------------
1 file changed, 54 insertions(+), 36 deletions(-)
[1] Test kernel module:
// SPDX-License-Identifier: GPL-2.0
/**********************************************************************
* Author: Alexander Sverdlin <[email protected]>
*
* Copyright (c) 2019 Nokia
*
* This module is intended to test the discovered race condition in
* irq_create_mapping() function. The typical use case is when
* irq_create_mapping() is being called from an IRQ controller driver
* and of_irq_get() is used in a device driver requesting an IRQ from
* the particular IRQ controller. But of_irq_get() is calling
* irq_create_mapping() internally and this is the genuine race.
**********************************************************************/
#include <linux/atomic.h>
#include <linux/irqdomain.h>
#include <linux/kernel.h>
#include <linux/kthread.h>
#include <linux/module.h>
static unsigned iterations = 1000;
module_param(iterations, uint, S_IRUGO);
static struct irq_domain *domain;
static atomic_t sync_begin = ATOMIC_INIT(0);
static atomic_t sync_end = ATOMIC_INIT(0);
static int icm_thread(void *data)
{
unsigned *res = data;
while (!kthread_should_stop()) {
/* Wait until main thread signals start */
if (!atomic_read(&sync_begin))
continue;
*res = irq_create_mapping(domain, 0);
smp_mb__before_atomic();
atomic_dec(&sync_begin);
/* Wait until all test threads attepted the mapping */
while (!kthread_should_stop())
if (!atomic_read(&sync_begin))
break;
atomic_dec(&sync_end);
}
return 0;
}
static int irq_create_mapping_race_init(void)
{
const static struct irq_domain_ops ops;
struct task_struct *thr[2];
unsigned irq[ARRAY_SIZE(thr)];
unsigned i;
int ret = 0;
domain = irq_domain_create_linear(NULL, 2, &ops, NULL);
if (!domain) {
pr_err("irq_domain_create_linear() failed\n");
return -EINVAL;
}
for (i = 0; i < ARRAY_SIZE(thr); i++)
thr[i] = kthread_run(icm_thread, &irq[i], "icm_thr%u", i);
for (i = 0; i < iterations; i++) {
unsigned j;
/* Signal test threads to attempt the mapping */
atomic_add(ARRAY_SIZE(thr), &sync_end);
atomic_add(ARRAY_SIZE(thr), &sync_begin);
/* Wait until all threads have made an iteration */
while (atomic_read(&sync_end))
cpu_relax();
smp_mb__after_atomic();
for (j = 0; j < ARRAY_SIZE(thr); j++)
if (!irq[j]) {
pr_err("%s: %s got no virq\n", domain->name,
thr[j]->comm);
ret = -ENOENT;
goto stop_threads;
}
for (j = 1; j < ARRAY_SIZE(thr); j++)
if (irq[0] != irq[j]) {
pr_err("%s: %s got virq %u and %s got virq %u\n",
domain->name, thr[0]->comm, irq[0],
thr[j]->comm, irq[j]);
ret = -EINVAL;
goto stop_threads;
}
irq_dispose_mapping(irq[0]);
}
stop_threads:
for (i = 0; i < ARRAY_SIZE(thr); i++)
kthread_stop(thr[i]);
irq_domain_remove(domain);
return ret;
}
module_init(irq_create_mapping_race_init);
static void __exit irq_create_mapping_race_exit(void)
{
}
module_exit(irq_create_mapping_race_exit);
MODULE_AUTHOR("Alexander Sverdlin <[email protected]>");
MODULE_DESCRIPTION("irq_create_mapping() race test module");
MODULE_LICENSE("GPL");
From: Alexander Sverdlin <[email protected]>
irq_domain_associate() is the only place where irq_find_mapping() can be
used reliably (under irq_domain_mutex) to make a decision if the mapping
shall be created or not. Other calls to irq_find_mapping() (not under
any lock) cannot be used for this purpose and lead to race conditions in
particular inside irq_create_mapping().
Give the callers of irq_domain_associate() an ability to detect existing
domain reliably by examining the return value.
Cc: [email protected]
Signed-off-by: Alexander Sverdlin <[email protected]>
---
kernel/irq/irqdomain.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 3078d0e..7bc07b6 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -532,6 +532,7 @@ int irq_domain_associate(struct irq_domain *domain, unsigned int virq,
irq_hw_number_t hwirq)
{
struct irq_data *irq_data = irq_get_irq_data(virq);
+ unsigned int eirq;
int ret;
if (WARN(hwirq >= domain->hwirq_max,
@@ -543,6 +544,16 @@ int irq_domain_associate(struct irq_domain *domain, unsigned int virq,
return -EINVAL;
mutex_lock(&irq_domain_mutex);
+
+ /* Check if mapping already exists */
+ eirq = irq_find_mapping(domain, hwirq);
+ if (eirq) {
+ mutex_unlock(&irq_domain_mutex);
+ pr_debug("%s: conflicting mapping for hwirq 0x%x\n",
+ domain->name, (int)hwirq);
+ return -EBUSY;
+ }
+
irq_data->hwirq = hwirq;
irq_data->domain = domain;
if (domain->ops->map) {
--
2.4.6
From: Alexander Sverdlin <[email protected]>
irq_create_fwspec_mapping() can race with itself during IRQ trigger type
configuration. Possible scenarios include:
- Mapping exists, two irq_create_fwspec_mapping() running in parallel do
not detect type mismatch, IRQ remains configured with one of the
different trigger types randomly
- Second call to irq_create_fwspec_mapping() sees existing mapping just
created by first call, but earlier irqd_set_trigger_type() call races
with later irqd_set_trigger_type() => totally undetected, IRQ type
is being set randomly to either one or another type
Introduce helper function to detect parallel changes to IRQ type.
Cc: [email protected]
Signed-off-by: Alexander Sverdlin <[email protected]>
---
kernel/irq/irqdomain.c | 66 +++++++++++++++++++++++++++++---------------------
1 file changed, 38 insertions(+), 28 deletions(-)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 176f2cc..af4d30c 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -764,10 +764,45 @@ static void of_phandle_args_to_fwspec(struct device_node *np, const u32 *args,
fwspec->param[i] = args[i];
}
+/* Detect races during IRQ type setting */
+static int irq_set_trigger_type_locked(unsigned int virq, unsigned int type,
+ irq_hw_number_t hwirq,
+ const struct irq_fwspec *fwspec)
+{
+ struct irq_data *irq_data;
+ int ret = 0;
+
+ mutex_lock(&irq_domain_mutex);
+ /*
+ * If the trigger type is not specified or matches the current trigger
+ * type then we are done.
+ */
+ if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq))
+ goto unlock;
+
+ /* If the trigger type has not been set yet, then set it now */
+ if (irq_get_trigger_type(virq) != IRQ_TYPE_NONE) {
+ pr_warn("type mismatch, failed to map hwirq-%lu for %s!\n",
+ hwirq, of_node_full_name(to_of_node(fwspec->fwnode)));
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ irq_data = irq_get_irq_data(virq);
+ if (!irq_data) {
+ ret = -ENOENT;
+ goto unlock;
+ }
+ irqd_set_trigger_type(irq_data, type);
+
+unlock:
+ mutex_unlock(&irq_domain_mutex);
+ return ret;
+}
+
unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
{
struct irq_domain *domain;
- struct irq_data *irq_data;
irq_hw_number_t hwirq;
unsigned int type = IRQ_TYPE_NONE;
int virq;
@@ -802,29 +837,8 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
*/
virq = irq_find_mapping(domain, hwirq);
if (virq) {
- /*
- * If the trigger type is not specified or matches the
- * current trigger type then we are done so return the
- * interrupt number.
- */
- if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq))
- return virq;
-
- /*
- * If the trigger type has not been set yet, then set
- * it now and return the interrupt number.
- */
- if (irq_get_trigger_type(virq) == IRQ_TYPE_NONE) {
- irq_data = irq_get_irq_data(virq);
- if (!irq_data)
- return 0;
-
- irqd_set_trigger_type(irq_data, type);
+ if (!irq_set_trigger_type_locked(virq, type, hwirq, fwspec))
return virq;
- }
-
- pr_warn("type mismatch, failed to map hwirq-%lu for %s!\n",
- hwirq, of_node_full_name(to_of_node(fwspec->fwnode)));
return 0;
}
@@ -839,8 +853,7 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
return virq;
}
- irq_data = irq_get_irq_data(virq);
- if (!irq_data) {
+ if (irq_set_trigger_type_locked(virq, type, hwirq, fwspec)) {
if (irq_domain_is_hierarchy(domain))
irq_domain_free_irqs(virq, 1);
else
@@ -848,9 +861,6 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec *fwspec)
return 0;
}
- /* Store trigger type */
- irqd_set_trigger_type(irq_data, type);
-
return virq;
}
EXPORT_SYMBOL_GPL(irq_create_fwspec_mapping);
--
2.4.6
From: Alexander Sverdlin <[email protected]>
If two irq_create_mapping() calls perform a mapping of the same hwirq on
two CPU cores in parallel they both will get 0 from irq_find_mapping(),
both will allocate unique virq using irq_domain_alloc_descs() and both
will finally irq_domain_associate() it. Giving different virq numbers
to their callers.
In practice the first caller is usually an interrupt controller driver and
the seconds is some device requesting the interrupt providede by the above
interrupt controller.
In this case either the interrupt controller driver configures virq which
is not the one being "associated" with hwirq, or the "slave" device
requests the virq which is never being triggered.
Cc: [email protected]
Signed-off-by: Alexander Sverdlin <[email protected]>
---
kernel/irq/irqdomain.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 7bc07b6..176f2cc 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -675,13 +675,6 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
of_node = irq_domain_get_of_node(domain);
- /* Check if mapping already exists */
- virq = irq_find_mapping(domain, hwirq);
- if (virq) {
- pr_debug("-> existing mapping on virq %d\n", virq);
- return virq;
- }
-
/* Allocate a virtual interrupt number */
virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
if (virq <= 0) {
@@ -691,7 +684,11 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
if (irq_domain_associate(domain, virq, hwirq)) {
irq_free_desc(virq);
- return 0;
+
+ virq = irq_find_mapping(domain, hwirq);
+ if (virq)
+ pr_debug("-> existing mapping on virq %d\n", virq);
+ return virq;
}
pr_debug("irq %lu on domain %s mapped to virtual irq %u\n",
--
2.4.6
On 12/09/2019 10:44, Sverdlin, Alexander (Nokia - DE/Ulm) wrote:
> From: Alexander Sverdlin <[email protected]>
>
> irq_create_fwspec_mapping() can race with itself during IRQ trigger type
> configuration. Possible scenarios include:
>
> - Mapping exists, two irq_create_fwspec_mapping() running in parallel do
> not detect type mismatch, IRQ remains configured with one of the
> different trigger types randomly
> - Second call to irq_create_fwspec_mapping() sees existing mapping just
> created by first call, but earlier irqd_set_trigger_type() call races
> with later irqd_set_trigger_type() => totally undetected, IRQ type
> is being set randomly to either one or another type
Is that an actual thing? Frankly, the scenario you're describing here
seems to carry the hallmarks of a completely broken system. Can you
point at a system supported in mainline that would behave as such?
Thanks,
M.
--
Jazz is not dead, it just smells funny...
Hi Marc,
On 20/09/2019 17:52, Marc Zyngier wrote:
>> If two irq_create_mapping() calls perform a mapping of the same hwirq on
>> two CPU cores in parallel they both will get 0 from irq_find_mapping(),
>> both will allocate unique virq using irq_domain_alloc_descs() and both
>> will finally irq_domain_associate() it. Giving different virq numbers
>> to their callers.
>>
>> In practice the first caller is usually an interrupt controller driver and
>> the seconds is some device requesting the interrupt providede by the above
>> interrupt controller.
> I disagree with this "In practice". An irqchip controller should *very
> rarely* call irq_create_mapping on its own. It usually indicates some
> level of brokenness, unless the mapped interrupt is exposed by the
> irqchip itself (the GIC maintenance interrupt, for example).
I also didn't understand the reason the irqchip in question calls
irq_create_mapping(), but as 9 upstream irqchips do this as well I was
not really interested in the reasons for this.
>> In this case either the interrupt controller driver configures virq which
>> is not the one being "associated" with hwirq, or the "slave" device
>> requests the virq which is never being triggered.
> Why should the interrupt controller configure that interrupt? On any
> sane platform, the mapping should be created by the user of the
> interrupt, and not by the provider.
>
> This doesn't mean we shouldn't fix the irqdomain races, but I tend to
> disagree with the analysis here.
That's in fact what happens in our case and may happen with 9 upstream
irqchips as well. Same race would however happen with any IRQ client
driver calling of_irq_get(), if they share same HW IRQ line.
--
Best regards,
Alexander Sverdlin.
On 12/09/2019 10:44, Sverdlin, Alexander (Nokia - DE/Ulm) wrote:
> From: Alexander Sverdlin <[email protected]>
>
> irq_domain_associate() is the only place where irq_find_mapping() can be
> used reliably (under irq_domain_mutex) to make a decision if the mapping
> shall be created or not. Other calls to irq_find_mapping() (not under
> any lock) cannot be used for this purpose and lead to race conditions in
> particular inside irq_create_mapping().
>
> Give the callers of irq_domain_associate() an ability to detect existing
> domain reliably by examining the return value.
>
> Cc: [email protected]
> Signed-off-by: Alexander Sverdlin <[email protected]>
> ---
> kernel/irq/irqdomain.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> index 3078d0e..7bc07b6 100644
> --- a/kernel/irq/irqdomain.c
> +++ b/kernel/irq/irqdomain.c
> @@ -532,6 +532,7 @@ int irq_domain_associate(struct irq_domain *domain, unsigned int virq,
> irq_hw_number_t hwirq)
> {
> struct irq_data *irq_data = irq_get_irq_data(virq);
> + unsigned int eirq;
> int ret;
>
> if (WARN(hwirq >= domain->hwirq_max,
> @@ -543,6 +544,16 @@ int irq_domain_associate(struct irq_domain *domain, unsigned int virq,
> return -EINVAL;
>
> mutex_lock(&irq_domain_mutex);
> +
> + /* Check if mapping already exists */
> + eirq = irq_find_mapping(domain, hwirq);
> + if (eirq) {
nit: Just have
if (irq_find_mapping(...)) {
and get rid the extra variable, given that it's not used for anything.
> + mutex_unlock(&irq_domain_mutex);
> + pr_debug("%s: conflicting mapping for hwirq 0x%x\n",
> + domain->name, (int)hwirq);
> + return -EBUSY;
I'm overall OK with this, although I'd rather we return -EEXIST rather
than -EBUSY.
> + }
> +
> irq_data->hwirq = hwirq;
> irq_data->domain = domain;
> if (domain->ops->map) {
>
Thanks,
M.
--
Jazz is not dead, it just smells funny...
On 12/09/2019 10:44, Sverdlin, Alexander (Nokia - DE/Ulm) wrote:
> From: Alexander Sverdlin <[email protected]>
>
> If two irq_create_mapping() calls perform a mapping of the same hwirq on
> two CPU cores in parallel they both will get 0 from irq_find_mapping(),
> both will allocate unique virq using irq_domain_alloc_descs() and both
> will finally irq_domain_associate() it. Giving different virq numbers
> to their callers.
>
> In practice the first caller is usually an interrupt controller driver and
> the seconds is some device requesting the interrupt providede by the above
> interrupt controller.
I disagree with this "In practice". An irqchip controller should *very
rarely* call irq_create_mapping on its own. It usually indicates some
level of brokenness, unless the mapped interrupt is exposed by the
irqchip itself (the GIC maintenance interrupt, for example).
> In this case either the interrupt controller driver configures virq which
> is not the one being "associated" with hwirq, or the "slave" device
> requests the virq which is never being triggered.
Why should the interrupt controller configure that interrupt? On any
sane platform, the mapping should be created by the user of the
interrupt, and not by the provider.
This doesn't mean we shouldn't fix the irqdomain races, but I tend to
disagree with the analysis here.
>
> Cc: [email protected]
> Signed-off-by: Alexander Sverdlin <[email protected]>
> ---
> kernel/irq/irqdomain.c | 13 +++++--------
> 1 file changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> index 7bc07b6..176f2cc 100644
> --- a/kernel/irq/irqdomain.c
> +++ b/kernel/irq/irqdomain.c
> @@ -675,13 +675,6 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>
> of_node = irq_domain_get_of_node(domain);
>
> - /* Check if mapping already exists */
> - virq = irq_find_mapping(domain, hwirq);
> - if (virq) {
> - pr_debug("-> existing mapping on virq %d\n", virq);
> - return virq;
> - }
> -
> /* Allocate a virtual interrupt number */
> virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
> if (virq <= 0) {
> @@ -691,7 +684,11 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
>
> if (irq_domain_associate(domain, virq, hwirq)) {
> irq_free_desc(virq);
> - return 0;
> +
> + virq = irq_find_mapping(domain, hwirq);
> + if (virq)
> + pr_debug("-> existing mapping on virq %d\n", virq);
I'd rather you limit this second irq_find_mapping() to cases where we're
sure we've found a duplicate:
ret = irq_domain_associate(domain, virq, hwirq);
if (ret) {
irq_free_desc(virq);
if (ret == -EEXIST)
return irq_find_mapping(domain, hwirq);
return 0;
}
> + return virq;
> }
>
> pr_debug("irq %lu on domain %s mapped to virtual irq %u\n",
>
Thanks,
M.
--
Jazz is not dead, it just smells funny...
Hi!
On 20/09/2019 18:07, Marc Zyngier wrote:
>> irq_create_fwspec_mapping() can race with itself during IRQ trigger type
>> configuration. Possible scenarios include:
>>
>> - Mapping exists, two irq_create_fwspec_mapping() running in parallel do
>> not detect type mismatch, IRQ remains configured with one of the
>> different trigger types randomly
>> - Second call to irq_create_fwspec_mapping() sees existing mapping just
>> created by first call, but earlier irqd_set_trigger_type() call races
>> with later irqd_set_trigger_type() => totally undetected, IRQ type
>> is being set randomly to either one or another type
> Is that an actual thing? Frankly, the scenario you're describing here
> seems to carry the hallmarks of a completely broken system. Can you
> point at a system supported in mainline that would behave as such?
Briefly speaking, this race is about not-complaining in case of a broken
device tree. This is not something purely theoretical. I don't know if
DTs under arch/arm/boot/dts are all correct, but I saw a lot DTs from
silicone vendors and very little of them were 100% correct.
In other words, this patch repairs error-handling. With 100% correct
DTs (or ACPI tables, have you seen one 100% correct BTW? :)) it's
not required.
--
Best regards,
Alexander Sverdlin.
Hello Marc,
On 20/09/2019 17:52, Marc Zyngier wrote:
> On 12/09/2019 10:44, Sverdlin, Alexander (Nokia - DE/Ulm) wrote:
>> From: Alexander Sverdlin <[email protected]>
>>
>> If two irq_create_mapping() calls perform a mapping of the same hwirq on
>> two CPU cores in parallel they both will get 0 from irq_find_mapping(),
>> both will allocate unique virq using irq_domain_alloc_descs() and both
>> will finally irq_domain_associate() it. Giving different virq numbers
>> to their callers.
>>
>> In practice the first caller is usually an interrupt controller driver and
>> the seconds is some device requesting the interrupt providede by the above
>> interrupt controller.
> I disagree with this "In practice". An irqchip controller should *very
> rarely* call irq_create_mapping on its own. It usually indicates some
> level of brokenness, unless the mapped interrupt is exposed by the
> irqchip itself (the GIC maintenance interrupt, for example).
>
>> In this case either the interrupt controller driver configures virq which
>> is not the one being "associated" with hwirq, or the "slave" device
>> requests the virq which is never being triggered.
> Why should the interrupt controller configure that interrupt? On any
> sane platform, the mapping should be created by the user of the
> interrupt, and not by the provider.
>
> This doesn't mean we shouldn't fix the irqdomain races, but I tend to
> disagree with the analysis here.
would you have time to review v2 of this series?
--
Best regards,
Alexander Sverdlin.