2022-03-21 23:11:15

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH v2 0/3] genirq: Managed affinity fixes

John (and later on David) reported[1] a while ago that booting with
maxcpus=1, managed affinity devices would fail to get the interrupts
that were associated with offlined CPUs.

Similarly, Xiongfeng reported[2] that the GICv3 ITS would sometime use
non-housekeeping CPUs instead of the affinity that was passed down as
a parameter.

[1] can be fixed by not trying to activate these interrupts if no CPU
that can satisfy the affinity is present (a patch addressing this was
already posted[3])

[2] is a consequence of affinities containing non-online CPUs being
passed down to the interrupt controller driver and the ITS driver
trying to paper over that by ignoring the affinity parameter and doing
its own (stupid) thing. It would be better to (a) get the core code to
remove the offline CPUs from the affinity mask at all times, and (b)
fix the drivers so that they can trust the core code not to trip them.

This small series, based on 5.17, addresses the above.

Thanks,

M.

[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/[email protected]
[3] https://lore.kernel.org/r/[email protected]

Marc Zyngier (3):
genirq/msi: Shutdown managed interrupts with unsatifiable affinities
genirq: Always limit the affinity to online CPUs
irqchip/gic-v3: Always trust the managed affinity provided by the core
code

drivers/irqchip/irq-gic-v3-its.c | 2 +-
kernel/irq/manage.c | 25 +++++++++++++++++--------
kernel/irq/msi.c | 15 +++++++++++++++
3 files changed, 33 insertions(+), 9 deletions(-)

--
2.34.1


2022-03-21 23:32:04

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH v2 1/3] genirq/msi: Shutdown managed interrupts with unsatifiable affinities

When booting with maxcpus=<small number>, interrupt controllers
such as the GICv3 ITS may not be able to satisfy the affinity of
some managed interrupts, as some of the HW resources are simply
not available.

The same thing happens when loading a driver using managed interrupts
while CPUs are offline.

In order to deal with this, do not try to activate such interrupt
if there is no online CPU capable of handling it. Instead, place
it in shutdown state. Once a capable CPU shows up, it will be
activated.

Reported-by: John Garry <[email protected]>
Tested-by: John Garry <[email protected]>
Reported-by: David Decotigny <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
---
kernel/irq/msi.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 2bdfce5edafd..a9ee535293eb 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -818,6 +818,21 @@ static int msi_init_virq(struct irq_domain *domain, int virq, unsigned int vflag
irqd_clr_can_reserve(irqd);
if (vflags & VIRQ_NOMASK_QUIRK)
irqd_set_msi_nomask_quirk(irqd);
+
+ /*
+ * If the interrupt is managed but no CPU is available to
+ * service it, shut it down until better times. Note that
+ * we only do this on the !RESERVE path as x86 (the only
+ * architecture using this flag) deals with this in a
+ * different way by using a catch-all vector.
+ */
+ if ((vflags & VIRQ_ACTIVATE) &&
+ irqd_affinity_is_managed(irqd) &&
+ !cpumask_intersects(irq_data_get_affinity_mask(irqd),
+ cpu_online_mask)) {
+ irqd_set_managed_shutdown(irqd);
+ return 0;
+ }
}

if (!(vflags & VIRQ_ACTIVATE))
--
2.34.1

2022-03-23 09:53:20

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] genirq: Managed affinity fixes

Hi Xiongfeng,

On Wed, 23 Mar 2022 03:52:46 +0000,
Xiongfeng Wang <[email protected]> wrote:
>
> Hi, Marc
>
> On 2022/3/22 3:36, Marc Zyngier wrote:
> > John (and later on David) reported[1] a while ago that booting with
> > maxcpus=1, managed affinity devices would fail to get the interrupts
> > that were associated with offlined CPUs.
> >
> > Similarly, Xiongfeng reported[2] that the GICv3 ITS would sometime use
> > non-housekeeping CPUs instead of the affinity that was passed down as
> > a parameter.
> >
> > [1] can be fixed by not trying to activate these interrupts if no CPU
> > that can satisfy the affinity is present (a patch addressing this was
> > already posted[3])
> >
> > [2] is a consequence of affinities containing non-online CPUs being
> > passed down to the interrupt controller driver and the ITS driver
> > trying to paper over that by ignoring the affinity parameter and doing
> > its own (stupid) thing. It would be better to (a) get the core code to
> > remove the offline CPUs from the affinity mask at all times, and (b)
> > fix the drivers so that they can trust the core code not to trip them.
> >
> > This small series, based on 5.17, addresses the above.
>
> I have tested this patchset on D06. It works well with kernel parameter
> 'maxcpus=1' or 'nohz_full=1-127 isolcpus=nohz,domain,managed_irq,1-127'.
> Also the 'effective_affinity' is correct. Thanks!

Thanks for having given it a go.

> By the way, I merged the second patch manually because of conflicts.
> Maybe I lack some patches on your local repo.

That's odd, as the patches are directly sitting on top of 5.17 in my
tree (see [1]). Do you have any out of tree patches around? Please
make sure you test this without any extra change.

Thanks,

M.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/managed-affinity-fixes

--
Without deviation from the norm, progress is not possible.

2022-03-24 11:46:36

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] genirq: Managed affinity fixes



On 2022/3/23 16:56, Marc Zyngier wrote:
> Hi Xiongfeng,
>
> On Wed, 23 Mar 2022 03:52:46 +0000,
> Xiongfeng Wang <[email protected]> wrote:
>>
>> Hi, Marc
>>
>> On 2022/3/22 3:36, Marc Zyngier wrote:
>>> John (and later on David) reported[1] a while ago that booting with
>>> maxcpus=1, managed affinity devices would fail to get the interrupts
>>> that were associated with offlined CPUs.
>>>
>>> Similarly, Xiongfeng reported[2] that the GICv3 ITS would sometime use
>>> non-housekeeping CPUs instead of the affinity that was passed down as
>>> a parameter.
>>>
>>> [1] can be fixed by not trying to activate these interrupts if no CPU
>>> that can satisfy the affinity is present (a patch addressing this was
>>> already posted[3])
>>>
>>> [2] is a consequence of affinities containing non-online CPUs being
>>> passed down to the interrupt controller driver and the ITS driver
>>> trying to paper over that by ignoring the affinity parameter and doing
>>> its own (stupid) thing. It would be better to (a) get the core code to
>>> remove the offline CPUs from the affinity mask at all times, and (b)
>>> fix the drivers so that they can trust the core code not to trip them.
>>>
>>> This small series, based on 5.17, addresses the above.
>>
>> I have tested this patchset on D06. It works well with kernel parameter
>> 'maxcpus=1' or 'nohz_full=1-127 isolcpus=nohz,domain,managed_irq,1-127'.
>> Also the 'effective_affinity' is correct. Thanks!
>
> Thanks for having given it a go.
>
>> By the way, I merged the second patch manually because of conflicts.
>> Maybe I lack some patches on your local repo.
>
> That's odd, as the patches are directly sitting on top of 5.17 in my
> tree (see [1]). Do you have any out of tree patches around? Please
> make sure you test this without any extra change.

I apply the patchset based on the latest mainline kernel. The latest commit is
commit 3bf03b9a0839c9fb06927ae53ebd0f960b19d408
Merge branch 'akpm' (patches from Andrew)
I didn't change the modification of the second patch. Only resolve the
context conflicts, which is cause by the following commit.
commit 04d4e665a60902cf36e7ad39af1179cb5df542ad
sched/isolation: Use single feature type while referring to housekeeping cpumask
It changed 'HK_FLAG_MANAGED_IRQ' to 'HK_TYPE_MANAGED_IRQ'.

Thanks,
Xiongfeng

>
> Thanks,
>
> M.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/managed-affinity-fixes
>

2022-03-25 01:30:43

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] genirq: Managed affinity fixes

Hi, Marc

On 2022/3/22 3:36, Marc Zyngier wrote:
> John (and later on David) reported[1] a while ago that booting with
> maxcpus=1, managed affinity devices would fail to get the interrupts
> that were associated with offlined CPUs.
>
> Similarly, Xiongfeng reported[2] that the GICv3 ITS would sometime use
> non-housekeeping CPUs instead of the affinity that was passed down as
> a parameter.
>
> [1] can be fixed by not trying to activate these interrupts if no CPU
> that can satisfy the affinity is present (a patch addressing this was
> already posted[3])
>
> [2] is a consequence of affinities containing non-online CPUs being
> passed down to the interrupt controller driver and the ITS driver
> trying to paper over that by ignoring the affinity parameter and doing
> its own (stupid) thing. It would be better to (a) get the core code to
> remove the offline CPUs from the affinity mask at all times, and (b)
> fix the drivers so that they can trust the core code not to trip them.
>
> This small series, based on 5.17, addresses the above.

I have tested this patchset on D06. It works well with kernel parameter
'maxcpus=1' or 'nohz_full=1-127 isolcpus=nohz,domain,managed_irq,1-127'.
Also the 'effective_affinity' is correct. Thanks!
By the way, I merged the second patch manually because of conflicts.
Maybe I lack some patches on your local repo.

Thanks,
Xiongfeng

>
> Thanks,
>
> M.
>
> [1] https://lore.kernel.org/r/[email protected]
> [2] https://lore.kernel.org/r/[email protected]
> [3] https://lore.kernel.org/r/[email protected]
>
> Marc Zyngier (3):
> genirq/msi: Shutdown managed interrupts with unsatifiable affinities
> genirq: Always limit the affinity to online CPUs
> irqchip/gic-v3: Always trust the managed affinity provided by the core
> code
>
> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> kernel/irq/manage.c | 25 +++++++++++++++++--------
> kernel/irq/msi.c | 15 +++++++++++++++
> 3 files changed, 33 insertions(+), 9 deletions(-)
>

2022-03-25 12:29:16

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] genirq: Managed affinity fixes

On Wed, 23 Mar 2022 10:58:33 +0000,
Xiongfeng Wang <[email protected]> wrote:
>
>
>
> On 2022/3/23 16:56, Marc Zyngier wrote:
> > Hi Xiongfeng,
> >
> > On Wed, 23 Mar 2022 03:52:46 +0000,
> > Xiongfeng Wang <[email protected]> wrote:
> >>
> >> Hi, Marc
> >>
> >> On 2022/3/22 3:36, Marc Zyngier wrote:
> >>> John (and later on David) reported[1] a while ago that booting with
> >>> maxcpus=1, managed affinity devices would fail to get the interrupts
> >>> that were associated with offlined CPUs.
> >>>
> >>> Similarly, Xiongfeng reported[2] that the GICv3 ITS would sometime use
> >>> non-housekeeping CPUs instead of the affinity that was passed down as
> >>> a parameter.
> >>>
> >>> [1] can be fixed by not trying to activate these interrupts if no CPU
> >>> that can satisfy the affinity is present (a patch addressing this was
> >>> already posted[3])
> >>>
> >>> [2] is a consequence of affinities containing non-online CPUs being
> >>> passed down to the interrupt controller driver and the ITS driver
> >>> trying to paper over that by ignoring the affinity parameter and doing
> >>> its own (stupid) thing. It would be better to (a) get the core code to
> >>> remove the offline CPUs from the affinity mask at all times, and (b)
> >>> fix the drivers so that they can trust the core code not to trip them.
> >>>
> >>> This small series, based on 5.17, addresses the above.
> >>
> >> I have tested this patchset on D06. It works well with kernel parameter
> >> 'maxcpus=1' or 'nohz_full=1-127 isolcpus=nohz,domain,managed_irq,1-127'.
> >> Also the 'effective_affinity' is correct. Thanks!
> >
> > Thanks for having given it a go.
> >
> >> By the way, I merged the second patch manually because of conflicts.
> >> Maybe I lack some patches on your local repo.
> >
> > That's odd, as the patches are directly sitting on top of 5.17 in my
> > tree (see [1]). Do you have any out of tree patches around? Please
> > make sure you test this without any extra change.
>
> I apply the patchset based on the latest mainline kernel. The latest commit is
> commit 3bf03b9a0839c9fb06927ae53ebd0f960b19d408
> Merge branch 'akpm' (patches from Andrew)
> I didn't change the modification of the second patch. Only resolve the
> context conflicts, which is cause by the following commit.
> commit 04d4e665a60902cf36e7ad39af1179cb5df542ad
> sched/isolation: Use single feature type while referring to housekeeping cpumask
> It changed 'HK_FLAG_MANAGED_IRQ' to 'HK_TYPE_MANAGED_IRQ'.

Ah, that's on top of linux/master then. Yeah, I expect some small
conflicts (this is a popular spot). I'll rebase things at some point
once (and if) we agree that patch #2 is the right thing to do.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.