2023-07-20 12:37:42

by Jie Zhan

[permalink] [raw]
Subject: [PATCH] irqdomain: Fix driver re-inserting failures when IRQs not being freed completely

Since commit 4615fbc3788d ("genirq/irqdomain: Don't try to free an
interrupt that has no mapping"), we have found failures when
re-inserting some specific drivers:

[root@localhost ~]# rmmod hisi_sas_v3_hw
[root@localhost ~]# modprobe hisi_sas_v3_hw
[ 1295.622525] hisi_sas_v3_hw: probe of 0000:30:04.0 failed with error -2

This comes from the case where some IRQs allocated from a low-level domain,
e.g. GIC ITS, are not freed completely, leaving some leaked. Thus, the next
driver insertion fails to get the same number of IRQs because some IRQs are
still occupied.

Free a contiguous group of IRQs in one go to fix this issue.

A previous discussion can be found at:
https://lore.kernel.org/lkml/[email protected]/
This solution was originally written by Marc Zyngier in the discussion, but
no code ends up upstreamed in that thread. Hopefully, this patch could get
some notice back.

Fixes: 4615fbc3788d ("genirq/irqdomain: Don't try to free an interrupt that has no mapping")
Signed-off-by: Jie Zhan <[email protected]>
Reviewed-by: Liao Chang <[email protected]>
Signed-off-by: Zheng Zengkai <[email protected]>
---
kernel/irq/irqdomain.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index f34760a1e222..f059e00dc827 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1445,13 +1445,24 @@ static void irq_domain_free_irqs_hierarchy(struct irq_domain *domain,
unsigned int nr_irqs)
{
unsigned int i;
+ int n;

if (!domain->ops->free)
return;

for (i = 0; i < nr_irqs; i++) {
- if (irq_domain_get_irq_data(domain, irq_base + i))
- domain->ops->free(domain, irq_base + i, 1);
+ /* Find the largest possible span of IRQs to free in one go */
+ for (n = 0;
+ ((i + n) < nr_irqs) &&
+ (irq_domain_get_irq_data(domain, irq_base + i + n));
+ n++)
+ ;
+
+ if (!n)
+ continue;
+
+ domain->ops->free(domain, irq_base + i, n);
+ i += n;
}
}

--
2.30.0



2023-08-29 13:35:48

by Jie Zhan

[permalink] [raw]
Subject: Re: [PATCH] irqdomain: Fix driver re-inserting failures when IRQs not being freed completely



On 26/08/2023 02:00, Thomas Gleixner wrote:
> On Thu, Jul 20 2023 at 20:24, Jie Zhan wrote:
>> Since commit 4615fbc3788d ("genirq/irqdomain: Don't try to free an
>> interrupt that has no mapping"), we have found failures when
>> re-inserting some specific drivers:
>>
>> [root@localhost ~]# rmmod hisi_sas_v3_hw
>> [root@localhost ~]# modprobe hisi_sas_v3_hw
>> [ 1295.622525] hisi_sas_v3_hw: probe of 0000:30:04.0 failed with error -2
>>
>> This comes from the case where some IRQs allocated from a low-level domain,
>> e.g. GIC ITS, are not freed completely, leaving some leaked. Thus, the next
>> driver insertion fails to get the same number of IRQs because some IRQs are
>> still occupied.
> Why?
>
>> Free a contiguous group of IRQs in one go to fix this issue.
> Again why?
>
>> @@ -1445,13 +1445,24 @@ static void irq_domain_free_irqs_hierarchy(struct irq_domain *domain,
>> unsigned int nr_irqs)
>> {
>> unsigned int i;
>> + int n;
>>
>> if (!domain->ops->free)
>> return;
>>
>> for (i = 0; i < nr_irqs; i++) {
>> - if (irq_domain_get_irq_data(domain, irq_base + i))
>> - domain->ops->free(domain, irq_base + i, 1);
>> + /* Find the largest possible span of IRQs to free in one go */
>> + for (n = 0;
>> + ((i + n) < nr_irqs) &&
>> + (irq_domain_get_irq_data(domain, irq_base + i + n));
>> + n++)
>> + ;
> For one this is unreadable gunk. But what's worse it still does not
> explain what this is solving.
>
> It's completely sensible to expect that freeing interrupts in a range
> one by one just works.
>
> So why do we need to work around an obvious low level failure in the
> core code?
>
> Thanks,
>
> tglx

Hi Thomas,

Many thanks for taking a look.

I believe this patch should be completely reworked as it has caused many
questions
in the first place --- it's not explaining itself well. Please ignore
this one now.

The story of the problem is a bit long and complicated. The previous
disscusion can
be found in the link attached.

Jie