2023-04-10 15:58:05

by Shanker Donthineni

[permalink] [raw]
Subject: [PATCH v3 0/3] Increase the number of IRQ descriptors for SPARSEIRQ

The ARM64 architecture uses SPARSEIRQ with a default value of NR_IRQS,
which is set to 64. This means that only 64+8192 IRQ descriptors are
allowed, which may not be sufficient for modern ARM64 servers that
have a large number of IO devices and GIC hardware that supports
direct vSGI and vLPI injection features.

This limitation has caused issues when attempting to launch multiple
virtual machines with GICv4.1 features, resulting in the error message
'kvm_err("VPE IRQ allocation failure\n")'. The root cause of this issue
is the ~8K IRQ descriptor limit.

To address this issue, an initial proposal was made to define NR_IRQS
to 2^19 for ARM64. However, Marc Zyngier suggested implementing a
generic solution instead of hard-coded values. Thomas Gleixner advised
to use the maple tree data structure and provided most of the necessary
functions.

For more information, refer to the discussion thread at
https://lore.kernel.org/linux-arm-kernel/[email protected]/.

This patch series converts the static memory allocation to dynamic using
the maple tree, and increases the maximum number of IRQ descriptors to
INT_MAX from NR_IRQS+8192. This change has been tested on an ARM64 server
with CONFIG_SPARSE_IRQ=y, where 256 virtual machines were launched,
creating a total of 128K+ IRQ descriptors, and IRQ injection was verified.

Changes in v3:
- Edited commit text
- Added a helper function irq_resend_init()
- Rebased to v6.3-rc6

Changes in v2:
- The patches have been updated to v6.3-rc5.
- The patches 2/5 and 4/5 have been removed as they are unnecessary.
- The review comments from Thomas have been addressed.

Shanker Donthineni (3):
genirq: Use hlist for managing resend handlers
genirq: Encapsulate sparse bitmap handling
genirq: Use the maple tree for IRQ descriptors management

include/linux/irqdesc.h | 3 ++
kernel/irq/chip.c | 1 +
kernel/irq/internals.h | 6 ++--
kernel/irq/irqdesc.c | 76 +++++++++++++++++++++++++----------------
kernel/irq/resend.c | 47 ++++++++++++++++---------
5 files changed, 85 insertions(+), 48 deletions(-)

--
2.25.1


2023-04-10 15:58:19

by Shanker Donthineni

[permalink] [raw]
Subject: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

The current implementation uses a static bitmap and a radix tree
to manage IRQ allocation and irq_desc pointer store respectively.
However, the size of the bitmap is constrained by the build time
macro MAX_SPARSE_IRQS, which may not be sufficient to support the
high-end servers, particularly those with GICv4.1 hardware, which
require a large interrupt space to cover LPIs and vSGIs.

The maple tree is a highly efficient data structure for storing
non-overlapping ranges and can handle a large number of entries,
up to ULONG_MAX. It can be utilized for both storing interrupt
descriptors and identifying available free spaces.

The interrupt descriptors management can be simplified by switching
to a maple tree data structure, which offers greater flexibility
and scalability. To support modern servers, the maximum number of
IRQs has been increased to INT_MAX, which provides a more adequate
value than the previous limit of NR_IRQS+8192.

Signed-off-by: Shanker Donthineni <[email protected]>
---
kernel/irq/internals.h | 2 +-
kernel/irq/irqdesc.c | 54 +++++++++++++++++++++++-------------------
2 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index f3f2090dd2de..7bdb7507efb0 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,7 +12,7 @@
#include <linux/sched/clock.h>

#ifdef CONFIG_SPARSE_IRQ
-# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS INT_MAX
#else
# define MAX_SPARSE_IRQS NR_IRQS
#endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 9a71fc6f2c5f..d6d8120ffd56 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -12,8 +12,7 @@
#include <linux/export.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
-#include <linux/radix-tree.h>
-#include <linux/bitmap.h>
+#include <linux/maple_tree.h>
#include <linux/irqdomain.h>
#include <linux/sysfs.h>

@@ -131,17 +130,38 @@ int nr_irqs = NR_IRQS;
EXPORT_SYMBOL_GPL(nr_irqs);

static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
+ MT_FLAGS_ALLOC_RANGE |
+ MT_FLAGS_LOCK_EXTERN |
+ MT_FLAGS_USE_RCU,
+ sparse_irq_lock);

static int irq_find_free_area(unsigned int from, unsigned int cnt)
{
- return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
- from, cnt, 0);
+ MA_STATE(mas, &sparse_irqs, 0, 0);
+
+ if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
+ return -ENOSPC;
+ return mas.index;
}

static unsigned int irq_find_next_irq(unsigned int offset)
{
- return find_next_bit(allocated_irqs, nr_irqs, offset);
+ struct irq_desc *desc = mt_next(&sparse_irqs, offset, nr_irqs);
+
+ return desc ? irq_desc_get_irq(desc) : nr_irqs;
+}
+
+static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
+{
+ MA_STATE(mas, &sparse_irqs, irq, irq);
+ WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);
+}
+
+static void delete_irq_desc(unsigned int irq)
+{
+ MA_STATE(mas, &sparse_irqs, irq, irq);
+ mas_erase(&mas);
}

#ifdef CONFIG_SPARSE_IRQ
@@ -355,26 +375,14 @@ static void irq_sysfs_del(struct irq_desc *desc) {}

#endif /* CONFIG_SYSFS */

-static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
-
-static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
-{
- radix_tree_insert(&irq_desc_tree, irq, desc);
-}
-
struct irq_desc *irq_to_desc(unsigned int irq)
{
- return radix_tree_lookup(&irq_desc_tree, irq);
+ return mtree_load(&sparse_irqs, irq);
}
#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
EXPORT_SYMBOL_GPL(irq_to_desc);
#endif

-static void delete_irq_desc(unsigned int irq)
-{
- radix_tree_delete(&irq_desc_tree, irq);
-}
-
#ifdef CONFIG_SMP
static void free_masks(struct irq_desc *desc)
{
@@ -517,7 +525,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
irq_sysfs_add(start + i, desc);
irq_add_debugfs_entry(start + i, desc);
}
- bitmap_set(allocated_irqs, start, cnt);
return start;

err:
@@ -557,7 +564,6 @@ int __init early_irq_init(void)

for (i = 0; i < initcnt; i++) {
desc = alloc_desc(i, node, 0, NULL, NULL);
- set_bit(i, allocated_irqs);
irq_insert_desc(i, desc);
}
return arch_early_irq_init();
@@ -612,6 +618,7 @@ static void free_desc(unsigned int irq)
raw_spin_lock_irqsave(&desc->lock, flags);
desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
raw_spin_unlock_irqrestore(&desc->lock, flags);
+ delete_irq_desc(irq);
}

static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
@@ -624,8 +631,8 @@ static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
struct irq_desc *desc = irq_to_desc(start + i);

desc->owner = owner;
+ irq_insert_desc(start + i, desc);
}
- bitmap_set(allocated_irqs, start, cnt);
return start;
}

@@ -637,7 +644,7 @@ static int irq_expand_nr_irqs(unsigned int nr)
void irq_mark_irq(unsigned int irq)
{
mutex_lock(&sparse_irq_lock);
- bitmap_set(allocated_irqs, irq, 1);
+ irq_insert_desc(irq, irq_desc + irq);
mutex_unlock(&sparse_irq_lock);
}

@@ -781,7 +788,6 @@ void irq_free_descs(unsigned int from, unsigned int cnt)
for (i = 0; i < cnt; i++)
free_desc(from + i);

- bitmap_clear(allocated_irqs, from, cnt);
mutex_unlock(&sparse_irq_lock);
}
EXPORT_SYMBOL_GPL(irq_free_descs);
--
2.25.1

2023-04-10 15:59:00

by Shanker Donthineni

[permalink] [raw]
Subject: [PATCH v3 2/3] genirq: Encapsulate sparse bitmap handling

Move the open coded sparse bitmap handling into helper functions as
a preparatory step for converting the sparse interrupt management
to a maple tree.

No functional change.

Signed-off-by: Shanker Donthineni <[email protected]>
---
kernel/irq/internals.h | 4 ++--
kernel/irq/irqdesc.c | 28 +++++++++++++++++++---------
2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 51fc8c497c22..f3f2090dd2de 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,9 +12,9 @@
#include <linux/sched/clock.h>

#ifdef CONFIG_SPARSE_IRQ
-# define IRQ_BITMAP_BITS (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
#else
-# define IRQ_BITMAP_BITS NR_IRQS
+# define MAX_SPARSE_IRQS NR_IRQS
#endif

#define istate core_internal_state__do_not_mess_with_it
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index b401b89b226a..9a71fc6f2c5f 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -131,7 +131,18 @@ int nr_irqs = NR_IRQS;
EXPORT_SYMBOL_GPL(nr_irqs);

static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);
+static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+
+static int irq_find_free_area(unsigned int from, unsigned int cnt)
+{
+ return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
+ from, cnt, 0);
+}
+
+static unsigned int irq_find_next_irq(unsigned int offset)
+{
+ return find_next_bit(allocated_irqs, nr_irqs, offset);
+}

#ifdef CONFIG_SPARSE_IRQ

@@ -517,7 +528,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,

static int irq_expand_nr_irqs(unsigned int nr)
{
- if (nr > IRQ_BITMAP_BITS)
+ if (nr > MAX_SPARSE_IRQS)
return -ENOMEM;
nr_irqs = nr;
return 0;
@@ -535,11 +546,11 @@ int __init early_irq_init(void)
printk(KERN_INFO "NR_IRQS: %d, nr_irqs: %d, preallocated irqs: %d\n",
NR_IRQS, nr_irqs, initcnt);

- if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
- nr_irqs = IRQ_BITMAP_BITS;
+ if (WARN_ON(nr_irqs > MAX_SPARSE_IRQS))
+ nr_irqs = MAX_SPARSE_IRQS;

- if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
- initcnt = IRQ_BITMAP_BITS;
+ if (WARN_ON(initcnt > MAX_SPARSE_IRQS))
+ initcnt = MAX_SPARSE_IRQS;

if (initcnt > nr_irqs)
nr_irqs = initcnt;
@@ -812,8 +823,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,

mutex_lock(&sparse_irq_lock);

- start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
- from, cnt, 0);
+ start = irq_find_free_area(from, cnt);
ret = -EEXIST;
if (irq >=0 && start != irq)
goto unlock;
@@ -838,7 +848,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
*/
unsigned int irq_get_next_irq(unsigned int offset)
{
- return find_next_bit(allocated_irqs, nr_irqs, offset);
+ return irq_find_next_irq(offset);
}

struct irq_desc *
--
2.25.1

2023-04-10 15:59:24

by Shanker Donthineni

[permalink] [raw]
Subject: [PATCH v3 1/3] genirq: Use hlist for managing resend handlers

The current implementation utilizes a bitmap for managing IRQ resend
handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
However, this method may not efficiently utilize memory during runtime,
particularly when IRQ_BITMAP_BITS is large.

Address this issue by using the hlist to manage IRQ resend handlers
instead of relying on a static bitmap memory allocation. Additionally,
a new function, clear_irq_resend(), is introduced and called from
irq_shutdown to ensure a graceful teardown of the interrupt.

Signed-off-by: Shanker Donthineni <[email protected]>
---
include/linux/irqdesc.h | 3 +++
kernel/irq/chip.c | 1 +
kernel/irq/internals.h | 2 ++
kernel/irq/irqdesc.c | 2 ++
kernel/irq/resend.c | 47 ++++++++++++++++++++++++++---------------
5 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 844a8e30e6de..d9451d456a73 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -102,6 +102,9 @@ struct irq_desc {
int parent_irq;
struct module *owner;
const char *name;
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+ struct hlist_node resend_node;
+#endif
} ____cacheline_internodealigned_in_smp;

#ifdef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..2eac5532c3c8 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
void irq_shutdown(struct irq_desc *desc)
{
if (irqd_is_started(&desc->irq_data)) {
+ clear_irq_resend(desc);
desc->depth = 1;
if (desc->irq_data.chip->irq_shutdown) {
desc->irq_data.chip->irq_shutdown(&desc->irq_data);
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 5fdc0b557579..51fc8c497c22 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -113,6 +113,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);

/* Resending of interrupts :*/
int check_irq_resend(struct irq_desc *desc, bool inject);
+void clear_irq_resend(struct irq_desc *desc);
+void irq_resend_init(struct irq_desc *desc);
bool irq_wait_for_poll(struct irq_desc *desc);
void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 240e145e969f..b401b89b226a 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -415,6 +415,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
desc_set_defaults(irq, desc, node, affinity, owner);
irqd_set(&desc->irq_data, flags);
kobject_init(&desc->kobj, &irq_kobj_type);
+ irq_resend_init(desc);

return desc;

@@ -581,6 +582,7 @@ int __init early_irq_init(void)
mutex_init(&desc[i].request_mutex);
init_waitqueue_head(&desc[i].wait_for_threads);
desc_set_defaults(i, &desc[i], node, NULL, NULL);
+ irq_resend_init(desc);
}
return arch_early_irq_init();
}
diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
index 0c46e9fe3a89..edec335c0a7a 100644
--- a/kernel/irq/resend.c
+++ b/kernel/irq/resend.c
@@ -21,8 +21,9 @@

#ifdef CONFIG_HARDIRQS_SW_RESEND

-/* Bitmap to handle software resend of interrupts: */
-static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
+/* hlist_head to handle software resend of interrupts: */
+static HLIST_HEAD(irq_resend_list);
+static DEFINE_RAW_SPINLOCK(irq_resend_lock);

/*
* Run software resends of IRQ's
@@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
static void resend_irqs(struct tasklet_struct *unused)
{
struct irq_desc *desc;
- int irq;
-
- while (!bitmap_empty(irqs_resend, nr_irqs)) {
- irq = find_first_bit(irqs_resend, nr_irqs);
- clear_bit(irq, irqs_resend);
- desc = irq_to_desc(irq);
- if (!desc)
- continue;
- local_irq_disable();
+
+ raw_spin_lock_irq(&irq_resend_lock);
+ while (!hlist_empty(&irq_resend_list)) {
+ desc = hlist_entry(irq_resend_list.first, struct irq_desc,
+ resend_node);
+ hlist_del_init(&desc->resend_node);
+ raw_spin_unlock(&irq_resend_lock);
desc->handle_irq(desc);
- local_irq_enable();
+ raw_spin_lock(&irq_resend_lock);
}
+ raw_spin_unlock_irq(&irq_resend_lock);
}

/* Tasklet to handle resend: */
@@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);

static int irq_sw_resend(struct irq_desc *desc)
{
- unsigned int irq = irq_desc_get_irq(desc);
-
/*
* Validate whether this interrupt can be safely injected from
* non interrupt context
@@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
*/
if (!desc->parent_irq)
return -EINVAL;
- irq = desc->parent_irq;
}

- /* Set it pending and activate the softirq: */
- set_bit(irq, irqs_resend);
+ /* Add to resend_list and activate the softirq: */
+ raw_spin_lock(&irq_resend_lock);
+ hlist_add_head(&desc->resend_node, &irq_resend_list);
+ raw_spin_unlock(&irq_resend_lock);
tasklet_schedule(&resend_tasklet);
return 0;
}

+void clear_irq_resend(struct irq_desc *desc)
+{
+ raw_spin_lock(&irq_resend_lock);
+ hlist_del_init(&desc->resend_node);
+ raw_spin_unlock(&irq_resend_lock);
+}
+
+void irq_resend_init(struct irq_desc *desc)
+{
+ INIT_HLIST_NODE(&desc->resend_node);
+}
#else
+void clear_irq_resend(struct irq_desc *desc) {}
+void irq_resend_init(struct irq_desc *desc) {}
+
static int irq_sw_resend(struct irq_desc *desc)
{
return -EINVAL;
--
2.25.1

2023-04-15 15:57:10

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Increase the number of IRQ descriptors for SPARSEIRQ

Hello Thomas,

I wanted to update you that all the review comments have been resolved and
the necessary fixes for the maple tree have been integrated into the mainline.
If there are any outstanding tasks required to consider it for the upcoming
v6.4 release, please let me know.

Thanks,
Shanker

On 4/10/23 10:57, Shanker Donthineni wrote:
> The ARM64 architecture uses SPARSEIRQ with a default value of NR_IRQS,
> which is set to 64. This means that only 64+8192 IRQ descriptors are
> allowed, which may not be sufficient for modern ARM64 servers that
> have a large number of IO devices and GIC hardware that supports
> direct vSGI and vLPI injection features.
>
> This limitation has caused issues when attempting to launch multiple
> virtual machines with GICv4.1 features, resulting in the error message
> 'kvm_err("VPE IRQ allocation failure\n")'. The root cause of this issue
> is the ~8K IRQ descriptor limit.
>
> To address this issue, an initial proposal was made to define NR_IRQS
> to 2^19 for ARM64. However, Marc Zyngier suggested implementing a
> generic solution instead of hard-coded values. Thomas Gleixner advised
> to use the maple tree data structure and provided most of the necessary
> functions.
>
> For more information, refer to the discussion thread at
> https://lore.kernel.org/linux-arm-kernel/[email protected]/.
>
> This patch series converts the static memory allocation to dynamic using
> the maple tree, and increases the maximum number of IRQ descriptors to
> INT_MAX from NR_IRQS+8192. This change has been tested on an ARM64 server
> with CONFIG_SPARSE_IRQ=y, where 256 virtual machines were launched,
> creating a total of 128K+ IRQ descriptors, and IRQ injection was verified.
>
> Changes in v3:
> - Edited commit text
> - Added a helper function irq_resend_init()
> - Rebased to v6.3-rc6
>
> Changes in v2:
> - The patches have been updated to v6.3-rc5.
> - The patches 2/5 and 4/5 have been removed as they are unnecessary.
> - The review comments from Thomas have been addressed.
>
> Shanker Donthineni (3):
> genirq: Use hlist for managing resend handlers
> genirq: Encapsulate sparse bitmap handling
> genirq: Use the maple tree for IRQ descriptors management
>
> include/linux/irqdesc.h | 3 ++
> kernel/irq/chip.c | 1 +
> kernel/irq/internals.h | 6 ++--
> kernel/irq/irqdesc.c | 76 +++++++++++++++++++++++++----------------
> kernel/irq/resend.c | 47 ++++++++++++++++---------
> 5 files changed, 85 insertions(+), 48 deletions(-)
>

2023-04-15 21:50:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Increase the number of IRQ descriptors for SPARSEIRQ

On Sat, Apr 15 2023 at 10:49, Shanker Donthineni wrote:
> Hello Thomas,
>
> I wanted to update you that all the review comments have been resolved and
> the necessary fixes for the maple tree have been integrated into the mainline.
> If there are any outstanding tasks required to consider it for the upcoming
> v6.4 release, please let me know.

It's on my todo list and as I'm traveling next week I might not come
around to review/merge this right now.

I'll get to it.

Thanks,

tglx

2023-04-25 03:24:06

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Hello,

kernel test robot noticed "WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical" on:

commit: 13eb5c4e7d2fb860d3dc5f63d910e3acf78dfd28 ("[PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management")
url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230410-235853
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 6f3ee0e22b4c62f44b8fa3c8de6e369a4d112a75
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

in testcase: blktests
version: blktests-i386-a4f80bf-1_20230418
with following parameters:

disk: 1SSD
test: block-group-00

compiler: gcc-11
kconfig: i386-debian-10.3-func (attached)
test machine: 4 threads Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz (Skylake) with 16G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


[ 207.463256][ T30] ------------[ cut here ]------------
[ 207.468574][ T30] WARNING: CPU: 3 PID: 30 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 207.479193][ T30] Modules linked in: loop(E) dm_multipath(E) dm_mod(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) sd_mod(E) t10_pi(E) coretemp(E) i915(E) crc64_rocksoft_generic(E) crc64_rocksoft(E) kvm_intel(E) drm_buddy(E) crc64(E) kvm(E) irqbypass(E) mei_wdt(E) crc32_pclmul(E) crc32c_intel(E) xhci_pci(E) drm_display_helper(E) ipmi_devintf(E) aesni_intel(E) ipmi_msghandler(E) crypto_simd(E) cryptd(E) rapl(E) xhci_hcd(E) cec(E) tpm_infineon(E) intel_cstate(E) wmi_bmof(E) mei_me(E) ahci(E) evdev(E) libahci(E) i2c_i801(E) psmouse(E) drm_kms_helper(E) tpm_tis(E) tpm_tis_core(E) intel_uncore(E) serio_raw(E) usbcore(E) i2c_smbus(E) video(E) syscopyarea(E) libata(E) sysfillrect(E) sysimgblt(E) tpm(E) mei(E) intel_pch_thermal(E) usb_common(E) ie31200_edac(E) ttm(E) wmi(E) acpi_pad(E) rng_core(E) button(E) fuse(E) drm(E) configfs(E) autofs4(E)
[ 207.479246][ T30] [last unloaded: null_blk(E)]
[ 207.568426][ T30] CPU: 3 PID: 30 Comm: migration/3 Tainted: G S E 6.2.0-rc4-00051-g13eb5c4e7d2f #1
[ 207.578787][ T30] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS N51 Ver. 01.63 10/05/2017
[ 207.588372][ T30] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked (kernel/stop_machine.c:429 kernel/stop_machine.c:469 kernel/stop_machine.c:619)
[ 207.596649][ T30] EIP: default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 207.602666][ T30] Code: 2a c2 f7 d1 85 c1 75 22 b9 00 08 00 00 e8 2c fc ff ff 80 e7 02 74 01 fb 8b 5d fc c9 c3 8d 74 26 00 90 c3 8d b4 26 00 00 00 00 <0f> 0b eb da 3e 8d 74 26 00 a1 e4 1e 2a c2 f6 c4 02 75 0d 31 c0 c3
All code
========
0: 2a c2 sub %dl,%al
2: f7 d1 not %ecx
4: 85 c1 test %eax,%ecx
6: 75 22 jne 0x2a
8: b9 00 08 00 00 mov $0x800,%ecx
d: e8 2c fc ff ff call 0xfffffffffffffc3e
12: 80 e7 02 and $0x2,%bh
15: 74 01 je 0x18
17: fb sti
18: 8b 5d fc mov -0x4(%rbp),%ebx
1b: c9 leave
1c: c3 ret
1d: 8d 74 26 00 lea 0x0(%rsi,%riz,1),%esi
21: 90 nop
22: c3 ret
23: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb da jmp 0x8
2e: 3e 8d 74 26 00 ds lea 0x0(%rsi,%riz,1),%esi
33: a1 e4 1e 2a c2 f6 c4 movabs 0x7502c4f6c22a1ee4,%eax
3a: 02 75
3c: 0d .byte 0xd
3d: 31 c0 xor %eax,%eax
3f: c3 ret

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb da jmp 0xffffffffffffffde
4: 3e 8d 74 26 00 ds lea 0x0(%rsi,%riz,1),%esi
9: a1 e4 1e 2a c2 f6 c4 movabs 0x7502c4f6c22a1ee4,%eax
10: 02 75
12: 0d .byte 0xd
13: 31 c0 xor %eax,%eax
15: c3 ret
[ 207.622147][ T30] EAX: 00000008 EBX: 00000002 ECX: fffffffc EDX: 00000022
[ 207.629124][ T30] ESI: 00000022 EDI: c1040e30 EBP: c29fdeb8 ESP: c29fdeb4
[ 207.636097][ T30] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010002
[ 207.643599][ T30] CR0: 80050033 CR2: 0214ece0 CR3: 023c8000 CR4: 003506f0
[ 207.650568][ T30] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 207.657539][ T30] DR6: fffe0ff0 DR7: 00000400
[ 207.662078][ T30] Call Trace:
[ 207.665219][ T30] default_send_IPI_single (arch/x86/kernel/apic/ipi.c:228)
[ 207.670370][ T30] apic_retrigger_irq (arch/x86/kernel/apic/vector.c:880)
[ 207.675083][ T30] irq_chip_retrigger_hierarchy (kernel/irq/chip.c:1473)
[ 207.680667][ T30] fixup_irqs (arch/x86/kernel/irq.c:371)
[ 207.684780][ T30] cpu_disable_common (arch/x86/kernel/smpboot.c:1663)
[ 207.689491][ T30] native_cpu_disable (arch/x86/kernel/smpboot.c:1694)
[ 207.694205][ T30] take_cpu_down (kernel/cpu.c:1035)
[ 207.698484][ T30] multi_cpu_stop (kernel/stop_machine.c:240)
[ 207.702854][ T30] cpu_stopper_thread (kernel/stop_machine.c:512)
[ 207.707658][ T30] ? stop_machine_yield+0x4/0x4
[ 207.712377][ T30] smpboot_thread_fn (kernel/smpboot.c:164 (discriminator 4))
[ 207.717090][ T30] kthread (kernel/kthread.c:376)
[ 207.720854][ T30] ? find_next_bit (arch/x86/events/intel/core.c:4525)
[ 207.725135][ T30] ? kthread_complete_and_exit (kernel/kthread.c:331)
[ 207.730630][ T30] ret_from_fork (arch/x86/entry/entry_32.S:770)
[ 207.734912][ T30] ---[ end trace 0000000000000000 ]---


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (6.19 kB)
config-6.2.0-rc4-00051-g13eb5c4e7d2f (153.70 kB)
job-script (5.76 kB)
dmesg.xz (42.78 kB)
blktests (6.04 kB)
job.yaml (5.17 kB)
reproduce (224.00 B)
Download all attachments

2023-04-26 12:27:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

On Tue, Apr 25 2023 at 11:16, kernel test robot wrote:
> kernel test robot noticed "WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical" on:
>
> commit: 13eb5c4e7d2fb860d3dc5f63d910e3acf78dfd28 ("[PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management")
> url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230410-235853
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 6f3ee0e22b4c62f44b8fa3c8de6e369a4d112a75
> patch link: https://lore.kernel.org/all/[email protected]/
> patch subject: [PATCH v3 3/3] genirq: Use the maple tree for IRQ
> descriptors management

This happens during CPU hot-unplug.

[ 206.930774][ T228] block/008 => sdb2 (do IO while hotplugging CPUs)
[ 206.935757][ T2086] run blktests block/008 at 2023-04-22 16:27:25
[ 207.199359][ T2086] smpboot: CPU 2 is now offline

[ 207.468574][ T30] WARNING: CPU: 3 PID: 30 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical+0x40/0x44
[ 207.568426][ T30] CPU: 3 PID: 30 Comm: migration/3 Tainted: G S E 6.2.0-rc4-00051-g13eb5c4e7d2f #1
[ 207.588372][ T30] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0xf5/0x138
[ 207.596649][ T30] EIP: default_send_IPI_mask_logical+0x40/0x44

This warns because fixup_irqs() sends an IPI to an offline CPU. In this
case to CPU3 which just cleared its online bit and is about to vanish:

[ 207.622147][ T30] EAX: 00000008 EBX: 00000002 ECX: fffffffc EDX: 00000022

EAX contains the target and ECX the inverted online mask. That's
probably the ata2 interrupt as that later detects a timeout:

[ 238.826212][ T174] ata2.00: exception Emask 0x0 SAct 0x3c00000 SErr 0x0 action 0x6 frozen
[ 238.834522][ T174] ata2.00: failed command: READ FPDMA QUEUED
[ 238.840378][ T174] ata2.00: cmd 60/08:b0:90:3e:90/00:00:25:00:00/40 tag 22 ncq dma 4096 in
[ 238.840378][ T174] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Which means that migrating the interrupt away from the outgoing CPU3
failed for yet to understand reasons.

The patch in question is changing the interrupt descriptor storage and
with that also the iterator function. But I can't spot anything wrong
right now.

But what I can spot is this:

[ 0.000000][ T0] Linux version 6.2.0-rc4-00051-g13eb5c4e7d2f

IOW, that test is based on some random upstream version, which lacks
about 30 commits to maple_tree, where 12 of them have 'fix' in the
commit subject.

Can you please retest this on v6.3 and report back when the problem
persists?

Thanks,

tglx


2023-04-28 01:38:01

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Hi Thomas,

On Wed, Apr 26, 2023 at 02:08:54PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 25 2023 at 11:16, kernel test robot wrote:
> > kernel test robot noticed "WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical" on:
> >
> > commit: 13eb5c4e7d2fb860d3dc5f63d910e3acf78dfd28 ("[PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management")
> > url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230410-235853
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 6f3ee0e22b4c62f44b8fa3c8de6e369a4d112a75
> > patch link: https://lore.kernel.org/all/[email protected]/
> > patch subject: [PATCH v3 3/3] genirq: Use the maple tree for IRQ
> > descriptors management
>
> This happens during CPU hot-unplug.
>
> [ 206.930774][ T228] block/008 => sdb2 (do IO while hotplugging CPUs)
> [ 206.935757][ T2086] run blktests block/008 at 2023-04-22 16:27:25
> [ 207.199359][ T2086] smpboot: CPU 2 is now offline
>
> [ 207.468574][ T30] WARNING: CPU: 3 PID: 30 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical+0x40/0x44
> [ 207.568426][ T30] CPU: 3 PID: 30 Comm: migration/3 Tainted: G S E 6.2.0-rc4-00051-g13eb5c4e7d2f #1
> [ 207.588372][ T30] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0xf5/0x138
> [ 207.596649][ T30] EIP: default_send_IPI_mask_logical+0x40/0x44
>
> This warns because fixup_irqs() sends an IPI to an offline CPU. In this
> case to CPU3 which just cleared its online bit and is about to vanish:
>
> [ 207.622147][ T30] EAX: 00000008 EBX: 00000002 ECX: fffffffc EDX: 00000022
>
> EAX contains the target and ECX the inverted online mask. That's
> probably the ata2 interrupt as that later detects a timeout:
>
> [ 238.826212][ T174] ata2.00: exception Emask 0x0 SAct 0x3c00000 SErr 0x0 action 0x6 frozen
> [ 238.834522][ T174] ata2.00: failed command: READ FPDMA QUEUED
> [ 238.840378][ T174] ata2.00: cmd 60/08:b0:90:3e:90/00:00:25:00:00/40 tag 22 ncq dma 4096 in
> [ 238.840378][ T174] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>
> Which means that migrating the interrupt away from the outgoing CPU3
> failed for yet to understand reasons.
>
> The patch in question is changing the interrupt descriptor storage and
> with that also the iterator function. But I can't spot anything wrong
> right now.
>
> But what I can spot is this:
>
> [ 0.000000][ T0] Linux version 6.2.0-rc4-00051-g13eb5c4e7d2f
>
> IOW, that test is based on some random upstream version, which lacks
> about 30 commits to maple_tree, where 12 of them have 'fix' in the
> commit subject.
>
> Can you please retest this on v6.3 and report back when the problem
> persists?

Thanks for your help looking into this problem.

The problem persists when tested on v6.3, but not 100% reproducible.
We ran the test on v6.3 and v6.3+patch each for 20 runs. There are 9
failed runs on v6.3+patch, while v6.3 is all clean. Full dmesg is
attached.

=========================================================================================
compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
gcc-11/1SSD/i386-debian-10.3-func/debian-11.1-i386-20220923.cgz/lkp-skl-d06/block-group-00/blktests

commit:
v6.3
32c58fc685e5c ("genirq: Use the maple tree for IRQ descriptors management")

v6.3 32c58fc685e5cd6b5947a5f8e9a
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:20 45% 9:20 dmesg.EIP:default_send_IPI_mask_logical
:20 45% 9:20 dmesg.WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical
:20 45% 9:20 blktests.block/008.fail
:20 45% 9:20 blktests.block/012.fail


[ 214.484584][ T235] block/008 => sdb2 (do IO while hotplugging CPUs)
[ 214.484586][ T235]
[ 214.489534][ T2125] run blktests block/008 at 2023-04-27 10:06:38
[ 214.749748][ T2125] smpboot: CPU 3 is now offline
[ 215.009645][ T28] ------------[ cut here ]------------
[ 215.014948][ T28] WARNING: CPU: 2 PID: 28 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 215.025527][ T28] Modules linked in: loop(E) dm_multipath(E) dm_mod(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) sd_mod(E) t10_pi(E) coretemp(E) crc64_rocksoft_generic(E) kvm_intel(E) crc64_rocksoft(E) crc64(E) kvm(E) irqbypass(E) ipmi_devintf(E) ipmi_msghandler(E) crc32_pclmul(E) i915(E) drm_buddy(E) crc32c_intel(E) tpm_infineon(E) drm_display_helper(E) aesni_intel(E) crypto_simd(E) mei_wdt(E) cec(E) tpm_tis(E) cryptd(E) ahci(E) drm_kms_helper(E) psmouse(E) tpm_tis_core(E) xhci_pci(E) mei_me(E) i2c_i801(E) rapl(E) syscopyarea(E) wmi_bmof(E) xhci_hcd(E) usbcore(E) intel_cstate(E) libahci(E) evdev(E) serio_raw(E) i2c_smbus(E) sysfillrect(E) sysimgblt(E) intel_uncore(E) tpm(E) video(E) libata(E) mei(E) intel_pch_thermal(E) ttm(E) ie31200_edac(E) usb_common(E) rng_core(E) wmi(E) acpi_pad(E) button(E) fuse(E) drm(E) configfs(E) autofs4(E) [last unloaded: null_blk(E)]
[ 215.112194][ T28] CPU: 2 PID: 28 Comm: migration/2 Tainted: G S E 6.3.0-00003-g32c58fc685e5 #1
[ 215.122171][ T28] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS N51 Ver. 01.63 10/05/2017
[ 215.131716][ T28] Stopper: multi_cpu_stop+0x0/0xf4 <- stop_machine_cpuslocked (kernel/stop_machine.c:429 kernel/stop_machine.c:469 kernel/stop_machine.c:619)
[ 215.139966][ T28] EIP: default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 215.145961][ T28] Code: 2b c2 f7 d1 85 c1 75 22 b9 00 08 00 00 e8 cc fb ff ff 80 e7 02 74 01 fb 8b 5d fc c9 c3 8d 74 26 00 90 c3 8d b4 26 00 00 00 00 <0f> 0b eb da 3e 8d 74 26 00 a1 64 b8 2b c2 f6 c4 02 75 0d 31 c0 c3
All code
========
0: 2b c2 sub %edx,%eax
2: f7 d1 not %ecx
4: 85 c1 test %eax,%ecx
6: 75 22 jne 0x2a
8: b9 00 08 00 00 mov $0x800,%ecx
d: e8 cc fb ff ff call 0xfffffffffffffbde
12: 80 e7 02 and $0x2,%bh
15: 74 01 je 0x18
17: fb sti
18: 8b 5d fc mov -0x4(%rbp),%ebx
1b: c9 leave
1c: c3 ret
1d: 8d 74 26 00 lea 0x0(%rsi,%riz,1),%esi
21: 90 nop
22: c3 ret
23: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb da jmp 0x8
2e: 3e 8d 74 26 00 ds lea 0x0(%rsi,%riz,1),%esi
33: a1 64 b8 2b c2 f6 c4 movabs 0x7502c4f6c22bb864,%eax
3a: 02 75
3c: 0d .byte 0xd
3d: 31 c0 xor %eax,%eax
3f: c3 ret

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb da jmp 0xffffffffffffffde
4: 3e 8d 74 26 00 ds lea 0x0(%rsi,%riz,1),%esi
9: a1 64 b8 2b c2 f6 c4 movabs 0x7502c4f6c22bb864,%eax
10: 02 75
12: 0d .byte 0xd
13: 31 c0 xor %eax,%eax
15: c3 ret
[ 215.165363][ T28] EAX: 00000004 EBX: 00000002 ECX: fffffffc EDX: 00000023
[ 215.172308][ T28] ESI: 00000023 EDI: c1045a80 EBP: c30fdeb8 ESP: c30fdeb4
[ 215.179248][ T28] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010002
[ 215.186717][ T28] CR0: 80050033 CR2: a95b2f4c CR3: 02400000 CR4: 003506f0
[ 215.193662][ T28] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 215.200601][ T28] DR6: fffe0ff0 DR7: 00000400
[ 215.205126][ T28] Call Trace:
[ 215.208266][ T28] default_send_IPI_single (arch/x86/kernel/apic/ipi.c:228)
[ 215.213396][ T28] apic_retrigger_irq (arch/x86/kernel/apic/vector.c:880)
[ 215.218095][ T28] irq_chip_retrigger_hierarchy (kernel/irq/chip.c:1473)
[ 215.223658][ T28] fixup_irqs (arch/x86/kernel/irq.c:371)
[ 215.227753][ T28] cpu_disable_common (arch/x86/kernel/smpboot.c:1663)
[ 215.232451][ T28] native_cpu_disable (arch/x86/kernel/smpboot.c:1694)
[ 215.237150][ T28] take_cpu_down (kernel/cpu.c:1035)
[ 215.241415][ T28] multi_cpu_stop (kernel/stop_machine.c:240)
[ 215.245766][ T28] cpu_stopper_thread (kernel/stop_machine.c:512)
[ 215.250549][ T28] ? stop_machine_yield+0x4/0x4
[ 215.255246][ T28] smpboot_thread_fn (kernel/smpboot.c:164 (discriminator 4))
[ 215.259943][ T28] kthread (kernel/kthread.c:376)
[ 215.263688][ T28] ? sort_range (kernel/smpboot.c:107)
[ 215.267858][ T28] ? kthread_complete_and_exit (kernel/kthread.c:331)
[ 215.273334][ T28] ret_from_fork (arch/x86/entry/entry_32.S:770)
[ 215.277599][ T28] ---[ end trace 0000000000000000 ]---

--
Best Regards,
Yujie


Attachments:
(No filename) (9.19 kB)
dmesg.xz (39.33 kB)
Download all attachments

2023-04-28 10:37:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Yujie!

On Fri, Apr 28 2023 at 09:33, Yujie Liu wrote:
> On Wed, Apr 26, 2023 at 02:08:54PM +0200, Thomas Gleixner wrote:
>> Can you please retest this on v6.3 and report back when the problem
>> persists?
>
> Thanks for your help looking into this problem.
>
> The problem persists when tested on v6.3, but not 100% reproducible.
> We ran the test on v6.3 and v6.3+patch each for 20 runs. There are 9
> failed runs on v6.3+patch, while v6.3 is all clean. Full dmesg is
> attached.

Under the assumption that the code is correct, then the effect of this
patch is that it changes the timing. Sigh.

1) Does this happen with a 64-bit kernel too?

2) Can you enable the irq_vector:vector_*.* tracepoints and provide
the trace?

And please provide /proc/interrupts from that machine.

Thanks,

tglx

2023-05-07 08:20:21

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Hi Thomas,

Sorry for late reply as we were on public holiday earlier this week.

On Fri, Apr 28, 2023 at 12:31:14PM +0200, Thomas Gleixner wrote:
> On Fri, Apr 28 2023 at 09:33, Yujie Liu wrote:
> > On Wed, Apr 26, 2023 at 02:08:54PM +0200, Thomas Gleixner wrote:
> >> Can you please retest this on v6.3 and report back when the problem
> >> persists?
> >
> > Thanks for your help looking into this problem.
> >
> > The problem persists when tested on v6.3, but not 100% reproducible.
> > We ran the test on v6.3 and v6.3+patch each for 20 runs. There are 9
> > failed runs on v6.3+patch, while v6.3 is all clean. Full dmesg is
> > attached.
>
> Under the assumption that the code is correct, then the effect of this
> patch is that it changes the timing. Sigh.
>
> 1) Does this happen with a 64-bit kernel too?

It doesn't happen on a 64-bit kernel:

=========================================================================================
commit/compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
32c58fc685e5cd6b5947a5f8e9a3c318a63fcf0a/gcc-11/1SSD/i386-debian-10.3-func/debian-11.1-i386-20220923.cgz/lkp-skl-d06/block-group-00/blktests

commit/compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
32c58fc685e5cd6b5947a5f8e9a3c318a63fcf0a/gcc-11/1SSD/x86_64-rhel-8.3-func/debian-11.1-x86_64-20220510.cgz/lkp-skl-d06/block-group-00/blktests

kconfig:
i386-debian-10.3-func
x86_64-rhel-8.3-func

i386-debian-10.3 x86_64-rhel-8.3-func
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
16:48 -33% :7 dmesg.EIP:default_send_IPI_mask_logical
16:48 -33% :7 dmesg.WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical

> 2) Can you enable the irq_vector:vector_*.* tracepoints and provide
> the trace?

I'm a beginner of kernel and not sure if I'm doing this correctly. Here
are my test steps:

# enable irq_vector events
# cat /sys/kernel/debug/tracing/set_event
irq_vectors:vector_free_moved
irq_vectors:vector_setup
irq_vectors:vector_teardown
irq_vectors:vector_deactivate
irq_vectors:vector_activate
irq_vectors:vector_alloc_managed
irq_vectors:vector_alloc
irq_vectors:vector_reserve
irq_vectors:vector_reserve_managed
irq_vectors:vector_clear
irq_vectors:vector_update
irq_vectors:vector_config

# turn on the tracer
# cat /sys/kernel/debug/tracing/tracing_on
1

# run the testcase
# ./check block/008
block/008 => sdb2 (do IO while hotplugging CPUs)
read iops 0 ...
runtime 0.576s ...

block/008 => sdb2 (do IO while hotplugging CPUs) [failed]
read iops 0 ... 0d06 at May 7 07:33:46 ...
runtime 0.576s ... 0.580sommon_interrupt: 2.34 No irq handler for vector
--- tests/block/008.out 2023-04-18 17:36:53.000000000 +0000
+++ /lkp/benchmarks/blktests/results/sdb2/block/008.out.bad 2023-05-07 07:33:47.336596084 +0000
@@ -1,2 +1,8 @@
Running block/008
+fio: io_u error on file /dev/sdb2: Input/output error: read offset=273571840, buflen=4096
+fio: io_u error on file /dev/sdb2: Input/output error: read offset=64757760, buflen=4096
+fio: io_u error on file /dev/sdb2: Input/output error: read offset=700571648, buflen=4096
+fio: io_u error on file /dev/sdb2: Input/output error: read offset=105357312, buflen=4096
+fio exited with status 0
+4;fio-3.25;reads;0;5;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;0.000000%;0.000000%;12;0;163;100.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;sdb;0;0;0;0;0;0;0;0.00%
...
(Run 'diff -u tests/block/008.out /lkp/benchmarks/blktests/results/sdb2/block/008.out.bad' to see the entire diff)

# check the trace
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |

Nothing was written to trace buffer, seems like no irq_vector events
were captured during this test.

> And please provide /proc/interrupts from that machine.

Before running the test:

# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 34 0 0 0 IO-APIC 2-edge timer
1: 4 0 0 0 IO-APIC 1-edge i8042
4: 0 0 635 0 IO-APIC 4-edge ttyS0
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 0 0 0 6 IO-APIC 12-edge i8042
16: 0 0 0 4 IO-APIC 16-fasteoi i801_smbus
19: 12 0 0 0 IO-APIC 19-fasteoi
120: 3608 0 617 0 PCI-MSI-0000:00:1f.6 0-edge eth0
121: 50 0 0 0 PCI-MSI-0000:00:16.0 0-edge mei_me
122: 0 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd
123: 0 0 2109981 0 PCI-MSI-0000:00:17.0 0-edge ahci[0000:00:17.0]
124: 0 0 0 206 PCI-MSI-0000:00:02.0 0-edge i915
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 221252 228764 471985 229286 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
IWI: 16 0 0 8 IRQ work interrupts
RTR: 108 25 15 11 APIC ICR read retries
RES: 5016 5486 697 10440 Rescheduling interrupts
CAL: 4618 4578 2775 5754 Function call interrupts
TLB: 12 15 14 18 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 4 108 109 107 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event

After running the test:

# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 34 0 0 0 IO-APIC 2-edge timer
1: 4 0 0 0 IO-APIC 1-edge i8042
4: 0 0 635 0 IO-APIC 4-edge ttyS0
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 0 0 0 6 IO-APIC 12-edge i8042
16: 0 0 0 4 IO-APIC 16-fasteoi i801_smbus
19: 12 0 0 0 IO-APIC 19-fasteoi
120: 3807 0 617 0 PCI-MSI-0000:00:1f.6 0-edge eth0
121: 50 0 0 0 PCI-MSI-0000:00:16.0 0-edge mei_me
122: 0 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd
123: 0 0 2109981 0 PCI-MSI-0000:00:17.0 0-edge ahci[0000:00:17.0]
124: 0 0 0 206 PCI-MSI-0000:00:02.0 0-edge i915
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 221568 228973 472239 229475 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
IWI: 16 0 0 8 IRQ work interrupts
RTR: 108 25 16 11 APIC ICR read retries
RES: 5017 5488 698 10440 Rescheduling interrupts
CAL: 4622 4584 2783 5757 Function call interrupts
TLB: 12 15 15 19 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 4 108 109 109 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event


Thanks again for taking the time to look into this.

--
Best Regards,
Yujie

2023-05-08 09:50:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Yujie!

On Sun, May 07 2023 at 16:05, Yujie Liu wrote:
> Sorry for late reply as we were on public holiday earlier this week.

Holidays are more important and the problems do not run away :)

> On Fri, Apr 28, 2023 at 12:31:14PM +0200, Thomas Gleixner wrote:
>> Under the assumption that the code is correct, then the effect of this
>> patch is that it changes the timing. Sigh.
>>
>> 1) Does this happen with a 64-bit kernel too?
>
> It doesn't happen on a 64-bit kernel:

Ok. So one difference might be that a 64 bit kernel enables interrupt
rempping. Can you add 'intremap=off' to the kernel command line please?

>> 2) Can you enable the irq_vector:vector_*.* tracepoints and provide
>> the trace?
>
> I'm a beginner of kernel and not sure if I'm doing this correctly. Here
> are my test steps:

They are perfectly fine.

> # check the trace
> # cat /sys/kernel/debug/tracing/trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 0/0 #P:4
> #
> # _-----=> irqs-off/BH-disabled
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / _-=> migrate-disable
> # |||| / delay
> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> # | | | ||||| | |
>
> Nothing was written to trace buffer, seems like no irq_vector events
> were captured during this test.

Stupid me. I completely forgot that this happens on the outgoing CPU at
a point where the tracer for that CPU is already shut down.

Can you please apply the patch below? No need to enable the irq_vector
events. It just dumps the information into dmesg.

Thanks,

tglx
---
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -57,7 +57,8 @@ static bool migrate_one_irq(struct irq_d
bool maskchip = !irq_can_move_pcntxt(d) && !irqd_irq_masked(d);
const struct cpumask *affinity;
bool brokeaff = false;
- int err;
+ int err, irq = d->irq;
+ bool move_pending;

/*
* IRQ chip might be already torn down, but the irq descriptor is
@@ -101,10 +102,16 @@ static bool migrate_one_irq(struct irq_d
* there is no move pending or the pending mask does not contain
* any online CPU, use the current affinity mask.
*/
- if (irq_fixup_move_pending(desc, true))
+ move_pending = irqd_is_setaffinity_pending(d);
+ if (irq_fixup_move_pending(desc, true)) {
affinity = irq_desc_get_pending_mask(desc);
- else
+ pr_info("IRQ %3d: move_pending=%d pending mask: %*pbl\n",
+ irq, move_pending, cpumask_pr_args(affinity));
+ } else {
affinity = irq_data_get_affinity_mask(d);
+ pr_info("IRQ %3d: move_pending=%d affinity mask: %*pbl\n",
+ irq, move_pending, cpumask_pr_args(affinity));
+ }

/* Mask the chip for interrupts which cannot move in process context */
if (maskchip && chip->irq_mask)
@@ -136,6 +143,9 @@ static bool migrate_one_irq(struct irq_d
brokeaff = false;
}

+ affinity = irq_data_get_effective_affinity_mask(d);
+ pr_info("IRQ %3d: Done: %*pbl\n", irq, cpumask_pr_args(affinity));
+
if (maskchip && chip->irq_unmask)
chip->irq_unmask(d);



2023-05-10 07:41:58

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Hi Thomas,

On Mon, May 08, 2023 at 11:36:37AM +0200, Thomas Gleixner wrote:
> >> Under the assumption that the code is correct, then the effect of this
> >> patch is that it changes the timing. Sigh.
> >>
> >> 1) Does this happen with a 64-bit kernel too?
> >
> > It doesn't happen on a 64-bit kernel:
>
> Ok. So one difference might be that a 64 bit kernel enables interrupt
> rempping. Can you add 'intremap=off' to the kernel command line please?

Sorry, my previous info was incorrect.

The block/008 (do IO while hotplugging CPUs) failure also happens on a
64-bit kernel no matter having 'intremap=off' or not, and persists when
tested against v6.3, but the warning in default_send_IPI_mask_logical
function is not triggered on a 64-bit kernel. Not sure if that function
is 32-bit specific since it is set in arch/x86/kernel/apic/probe_32.c.

== x86_64 kernel ==

compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
gcc-11/1SSD/x86_64-rhel-8.3-func/debian-11.1-x86_64-20220510.cgz/lkp-skl-d06/block-group-00/blktests

commit:
v6.3
32c58fc685e5c ("genirq: Use the maple tree for IRQ descriptors management")

v6.3 32c58fc685e5cd6b5947a5f8e9a
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:10 70% 7:7 blktests.block/008.fail
:10 70% 7:7 blktests.block/012.fail

== i386 kernel ==

compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
gcc-11/1SSD/i386-debian-10.3-func/debian-11.1-i386-20220923.cgz/lkp-skl-d06/block-group-00/blktests

commit:
v6.3
32c58fc685e5c ("genirq: Use the maple tree for IRQ descriptors management")

v6.3 32c58fc685e5cd6b5947a5f8e9a
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:20 90% 18:49 blktests.block/008.fail
:20 90% 18:49 blktests.block/012.fail
:20 80% 16:49 dmesg.EIP:default_send_IPI_mask_logical
:20 80% 16:49 dmesg.WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical

> >> 2) Can you enable the irq_vector:vector_*.* tracepoints and provide
> >> the trace?
> >
> > Nothing was written to trace buffer, seems like no irq_vector events
> > were captured during this test.
>
> Can you please apply the patch below? No need to enable the irq_vector
> events. It just dumps the information into dmesg.

The dmesgs of 64-bit and 32-bit kernels are attached.

--
Best Regards,
Yujie

> ---
> --- a/kernel/irq/cpuhotplug.c
> +++ b/kernel/irq/cpuhotplug.c
> @@ -57,7 +57,8 @@ static bool migrate_one_irq(struct irq_d
> bool maskchip = !irq_can_move_pcntxt(d) && !irqd_irq_masked(d);
> const struct cpumask *affinity;
> bool brokeaff = false;
> - int err;
> + int err, irq = d->irq;
> + bool move_pending;
>
> /*
> * IRQ chip might be already torn down, but the irq descriptor is
> @@ -101,10 +102,16 @@ static bool migrate_one_irq(struct irq_d
> * there is no move pending or the pending mask does not contain
> * any online CPU, use the current affinity mask.
> */
> - if (irq_fixup_move_pending(desc, true))
> + move_pending = irqd_is_setaffinity_pending(d);
> + if (irq_fixup_move_pending(desc, true)) {
> affinity = irq_desc_get_pending_mask(desc);
> - else
> + pr_info("IRQ %3d: move_pending=%d pending mask: %*pbl\n",
> + irq, move_pending, cpumask_pr_args(affinity));
> + } else {
> affinity = irq_data_get_affinity_mask(d);
> + pr_info("IRQ %3d: move_pending=%d affinity mask: %*pbl\n",
> + irq, move_pending, cpumask_pr_args(affinity));
> + }
>
> /* Mask the chip for interrupts which cannot move in process context */
> if (maskchip && chip->irq_mask)
> @@ -136,6 +143,9 @@ static bool migrate_one_irq(struct irq_d
> brokeaff = false;
> }
>
> + affinity = irq_data_get_effective_affinity_mask(d);
> + pr_info("IRQ %3d: Done: %*pbl\n", irq, cpumask_pr_args(affinity));
> +
> if (maskchip && chip->irq_unmask)
> chip->irq_unmask(d);
>


Attachments:
(No filename) (4.18 kB)
dmesg_i386.xz (41.56 kB)
dmesg_x86_64.xz (73.32 kB)
Download all attachments

2023-05-10 14:48:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Yujie!

On Wed, May 10 2023 at 15:24, Yujie Liu wrote:
> On Mon, May 08, 2023 at 11:36:37AM +0200, Thomas Gleixner wrote:
>> Ok. So one difference might be that a 64 bit kernel enables interrupt
>> rempping. Can you add 'intremap=off' to the kernel command line please?
>
> Sorry, my previous info was incorrect.
>
> The block/008 (do IO while hotplugging CPUs) failure also happens on a
> 64-bit kernel no matter having 'intremap=off' or not, and persists when
> tested against v6.3, but the warning in default_send_IPI_mask_logical
> function is not triggered on a 64-bit kernel. Not sure if that function
> is 32-bit specific since it is set in arch/x86/kernel/apic/probe_32.c.

Ok. That makes more sense as the issue is clearly independent of 32 or
64 bit.

I decoded it by now and that maple_tree conversion is the culprit. It
broke irq_get_next_irq() which is used during hotplug. It misses every
other interrupt, so affinities are not fixed up.

Please ignore that series.

Thanks a lot for your help!

tglx

2023-05-10 15:18:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Shanker!

On Wed, May 10 2023 at 16:41, Thomas Gleixner wrote:
> On Wed, May 10 2023 at 15:24, Yujie Liu wrote:
> I decoded it by now and that maple_tree conversion is the culprit. It
> broke irq_get_next_irq() which is used during hotplug. It misses every
> other interrupt, so affinities are not fixed up.

I'm seriously grumpy. You throw that untested stuff over the fence,
pester me about merging it and then ignore the fallout.

This breaks cpuhotplug, debugfs, /proc/stat, x86/IOAPIC and some more.

It's not asked too much that if you change an iterator implementation to
validate that the outcome is still the same on the usage sites.

That change has never seen CPU hotplug testing. It reproduces
instantaneously in a VM even without running blktest.

I grant you that the documentation of mt_next() is incorrect, but that's
absolutely no excuse for not testing such a fundamental change at
all. It's neither an excuse for ignoring the fallout and wasting other
peoples time.

I'm dropping this from my to-merge list.

Yours grumpy

Thomas

2023-05-10 15:44:56

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management


Hi Thomas & Marc,


Apologies for my lack of familiarity with the maple tree data structure and
not testing all functions. I received advice from the review comments below
regarding the iterator. I am looking for your guidance to address the issue
with the iterator if not possible can we increase NR_IRQS for ARM64 arch.


https://lore.kernel.org/linux-arm-kernel/875ydej9ur.ffs@tglx/

static unsigned int irq_find_next_irq(unsigned int offset)
{
MA_STATE(mas, &sparse_irqs, offset, nr_irqs);
struct irq_desc *desc = mas_next(&mas, nr_irqs);

return desc ? irq_desc_get_irq(desc) : nr_irqs;
}

-Shanker

On 5/10/23 09:49, Thomas Gleixner wrote:
> External email: Use caution opening links or attachments
>
>
> Shanker!
>
> On Wed, May 10 2023 at 16:41, Thomas Gleixner wrote:
>> On Wed, May 10 2023 at 15:24, Yujie Liu wrote:
>> I decoded it by now and that maple_tree conversion is the culprit. It
>> broke irq_get_next_irq() which is used during hotplug. It misses every
>> other interrupt, so affinities are not fixed up.
>
> I'm seriously grumpy. You throw that untested stuff over the fence,
> pester me about merging it and then ignore the fallout.
>
> This breaks cpuhotplug, debugfs, /proc/stat, x86/IOAPIC and some more.
>
> It's not asked too much that if you change an iterator implementation to
> validate that the outcome is still the same on the usage sites.
>
> That change has never seen CPU hotplug testing. It reproduces
> instantaneously in a VM even without running blktest.
>
> I grant you that the documentation of mt_next() is incorrect, but that's
> absolutely no excuse for not testing such a fundamental change at
> all. It's neither an excuse for ignoring the fallout and wasting other
> peoples time.
>
> I'm dropping this from my to-merge list.
>
> Yours grumpy
>
> Thomas

2023-05-10 16:09:16

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Hi Thomas,

On 4/10/23 10:57, Shanker Donthineni wrote:
> The current implementation uses a static bitmap and a radix tree
> to manage IRQ allocation and irq_desc pointer store respectively.
> However, the size of the bitmap is constrained by the build time
> macro MAX_SPARSE_IRQS, which may not be sufficient to support the
> high-end servers, particularly those with GICv4.1 hardware, which
> require a large interrupt space to cover LPIs and vSGIs.
>
> The maple tree is a highly efficient data structure for storing
> non-overlapping ranges and can handle a large number of entries,
> up to ULONG_MAX. It can be utilized for both storing interrupt
> descriptors and identifying available free spaces.
>
> The interrupt descriptors management can be simplified by switching
> to a maple tree data structure, which offers greater flexibility
> and scalability. To support modern servers, the maximum number of
> IRQs has been increased to INT_MAX, which provides a more adequate
> value than the previous limit of NR_IRQS+8192.
>
> Signed-off-by: Shanker Donthineni <[email protected]>
> ---
> kernel/irq/internals.h | 2 +-
> kernel/irq/irqdesc.c | 54 +++++++++++++++++++++++-------------------
> 2 files changed, 31 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
> index f3f2090dd2de..7bdb7507efb0 100644
> --- a/kernel/irq/internals.h
> +++ b/kernel/irq/internals.h
> @@ -12,7 +12,7 @@
> #include <linux/sched/clock.h>
>
> #ifdef CONFIG_SPARSE_IRQ
> -# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
> +# define MAX_SPARSE_IRQS INT_MAX
> #else
> # define MAX_SPARSE_IRQS NR_IRQS
> #endif
> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> index 9a71fc6f2c5f..d6d8120ffd56 100644
> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -12,8 +12,7 @@
> #include <linux/export.h>
> #include <linux/interrupt.h>
> #include <linux/kernel_stat.h>
> -#include <linux/radix-tree.h>
> -#include <linux/bitmap.h>
> +#include <linux/maple_tree.h>
> #include <linux/irqdomain.h>
> #include <linux/sysfs.h>
>
> @@ -131,17 +130,38 @@ int nr_irqs = NR_IRQS;
> EXPORT_SYMBOL_GPL(nr_irqs);
>
> static DEFINE_MUTEX(sparse_irq_lock);
> -static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
> +static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
> + MT_FLAGS_ALLOC_RANGE |
> + MT_FLAGS_LOCK_EXTERN |
> + MT_FLAGS_USE_RCU,
> + sparse_irq_lock);
>
> static int irq_find_free_area(unsigned int from, unsigned int cnt)
> {
> - return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
> - from, cnt, 0);
> + MA_STATE(mas, &sparse_irqs, 0, 0);
> +
> + if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
> + return -ENOSPC;
> + return mas.index;
> }
>
> static unsigned int irq_find_next_irq(unsigned int offset)
> {
> - return find_next_bit(allocated_irqs, nr_irqs, offset);
> + struct irq_desc *desc = mt_next(&sparse_irqs, offset, nr_irqs);
> +
> + return desc ? irq_desc_get_irq(desc) : nr_irqs;
> +}
> +

Based on the comments of mt_find(), seems it is the right function to
use for finding the next entry.

static unsigned int irq_find_next_irq(unsigned int offset)
{
struct irq_desc *desc = mt_find(&sparse_irqs, offset, nr_irqs);

return desc ? irq_desc_get_irq(desc) : nr_irqs;
}


-Shanker

2023-05-10 17:29:31

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Hi Thomas,

On 5/10/23 10:19, Shanker Donthineni wrote:
>
> Apologies for my lack of familiarity with the maple tree data structure and
> not testing all functions. I received advice from the review comments below
> regarding the iterator. I am looking for your guidance to address the issue
> with the iterator if not possible can we increase NR_IRQS for ARM64 arch.
>
>
> https://lore.kernel.org/linux-arm-kernel/875ydej9ur.ffs@tglx/
>
> static unsigned int irq_find_next_irq(unsigned int offset)
> {
>         MA_STATE(mas, &sparse_irqs, offset, nr_irqs);
>         struct irq_desc *desc = mas_next(&mas, nr_irqs);
>
>         return desc ? irq_desc_get_irq(desc) : nr_irqs;
> }

The problem has been resolved, and I have confirmed the functionality of
CPU hot-unplug on ARM64 systems. I have posted v4 to fix the issue.

https://lore.kernel.org/lkml/[email protected]/T/#mfa8e2a5490f275cfecb2e93473c24f35db7adcaf

I have made the change of replacing mt_next() with mt_find() in irq_find_next_irq() function.

static unsigned int irq_find_next_irq(unsigned int offset)
{
unsigned long index = offset;
struct irq_desc *desc = mt_find(&sparse_irqs, &index, nr_irqs);

return desc ? irq_desc_get_irq(desc) : nr_irqs;
}

-Shanker

2023-05-10 19:32:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

Shanker!

On Wed, May 10 2023 at 10:19, Shanker Donthineni wrote:
> Apologies for my lack of familiarity with the maple tree data structure and
> not testing all functions. I received advice from the review comments below
> regarding the iterator.
>
> https://lore.kernel.org/linux-arm-kernel/875ydej9ur.ffs@tglx/
>
> static unsigned int irq_find_next_irq(unsigned int offset)
> {
> MA_STATE(mas, &sparse_irqs, offset, nr_irqs);
> struct irq_desc *desc = mas_next(&mas, nr_irqs);
>
> return desc ? irq_desc_get_irq(desc) : nr_irqs;
> }

You gracefully omitted what I wrote below this:

"or something like that."

which means that it's a suggestion and not claiming that it is in any
way correct.

It does not mean that you should copy it verbatim and assume that it is
correct. It means that if you pick that suggestion up it's your
responsibility to thoroughly understand and test the code you submit,
no?

That has nothing to do with familiarity vs. the maple tree. I'm neither
familiar with the maple tree and all I did was reading through the
documentation, which is pretty well done, except for that tiny
documentation error vs. the return value of mt_next().

It's a simple question of engineering principles:

"correctness first" vs. "works for me".

> I am looking for your guidance to address the issue with the iterator

You must be kidding.

1) You got 90% of the solution on a silber tablet

2) You failed to understand and test the stuff you submitted

3) You ignored the bug report. Someone else debugged the fallout and
explained where and what the problem is.

4) Now you ask that someone else goes and reads the maple tree
documentation for you to find the proper function to use. So that
you can go and blindly apply the change and then go back to #2.

I'm really happy to help and guide people, but what you are asking for
is that I do your homework completely. Seriously?

> if not possible can we increase NR_IRQS for ARM64 arch.

Definitely not. The maple tree conversion is the right thing to do and
fixing the issue at hand is definitely not rocket science.

I'm pretty sure that it is about 3 lines of change to make it work. Plus
some polishing afterwards to make it palatable.

Thanks,

tglx