The ARM64 architecture uses SPARSEIRQ with a default value of NR_IRQS,
which is set to 64. This means that only 64+8192 IRQ descriptors are
allowed, which may not be sufficient for modern ARM64 servers that
have a large number of IO devices and GIC hardware that supports
direct vSGI and vLPI injection features.
This limitation has caused issues when attempting to launch multiple
virtual machines with GICv4.1 features, resulting in the error message
'kvm_err("VPE IRQ allocation failure\n")'. The root cause of this issue
is the ~8K IRQ descriptor limit.
To address this issue, an initial proposal was made to define NR_IRQS
to 2^19 for ARM64. However, Marc Zyngier suggested implementing a
generic solution instead of hard-coded values. Thomas Gleixner advised
to use the maple tree data structure and provided most of the necessary
functions.
For more information, refer to the discussion thread at
https://lore.kernel.org/linux-arm-kernel/[email protected]/.
This patch series converts the static memory allocation to dynamic using
the maple tree, and increases the maximum number of IRQ descriptors to
INT_MAX from NR_IRQS+8192. This change has been tested on an ARM64 server
with CONFIG_SPARSE_IRQ=y, where 256 virtual machines were launched,
creating a total of 128K+ IRQ descriptors, and IRQ injection was verified.
Teted on ARM64 system:
- Normal boot with ~200 cores
- CPU hot-unplug/hot-plug using sysfs
- Booted virtual machines with PCIe device pass-through
- Hot-unplug of CPU where vfio-pci interrupts are bounded
Changes in v5:
- Change function name irq_find_next_irq() to irq_find_at_or_after()
- Update comment to reflect the return value of irq_get_next_irq()
Changes in v4:
- Fix the iterator function using mt_find() instead of nt_next()
- Tested CPU hot-unplug and hot-plug on ARM64 system. This is
completely broken before v4.
Changes in v3:
- Edited commit text
- Added a helper function irq_resend_init()
- Rebased to v6.3-rc6
Changes in v2:
- The patches have been updated to v6.3-rc5.
- The patches 2/5 and 4/5 have been removed as they are unnecessary.
- The review comments from Thomas have been addressed.
Shanker Donthineni (3):
genirq: Use hlist for managing resend handlers
genirq: Encapsulate sparse bitmap handling
genirq: Use the maple tree for IRQ descriptors management
include/linux/irqdesc.h | 3 ++
kernel/irq/chip.c | 1 +
kernel/irq/internals.h | 6 ++--
kernel/irq/irqdesc.c | 77 +++++++++++++++++++++++++----------------
kernel/irq/resend.c | 47 ++++++++++++++++---------
5 files changed, 86 insertions(+), 48 deletions(-)
--
2.25.1
The current implementation uses a static bitmap and a radix tree
to manage IRQ allocation and irq_desc pointer store respectively.
However, the size of the bitmap is constrained by the build time
macro MAX_SPARSE_IRQS, which may not be sufficient to support the
high-end servers, particularly those with GICv4.1 hardware, which
require a large interrupt space to cover LPIs and vSGIs.
The maple tree is a highly efficient data structure for storing
non-overlapping ranges and can handle a large number of entries,
up to ULONG_MAX. It can be utilized for both storing interrupt
descriptors and identifying available free spaces.
The interrupt descriptors management can be simplified by switching
to a maple tree data structure, which offers greater flexibility
and scalability. To support modern servers, the maximum number of
IRQs has been increased to INT_MAX, which provides a more adequate
value than the previous limit of NR_IRQS+8192.
Signed-off-by: Shanker Donthineni <[email protected]>
---
kernel/irq/internals.h | 2 +-
kernel/irq/irqdesc.c | 57 ++++++++++++++++++++++++------------------
2 files changed, 33 insertions(+), 26 deletions(-)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index f3f2090dd2de..7bdb7507efb0 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,7 +12,7 @@
#include <linux/sched/clock.h>
#ifdef CONFIG_SPARSE_IRQ
-# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS INT_MAX
#else
# define MAX_SPARSE_IRQS NR_IRQS
#endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index e0d9dd9b36f9..27ca1c866f29 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -12,8 +12,7 @@
#include <linux/export.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
-#include <linux/radix-tree.h>
-#include <linux/bitmap.h>
+#include <linux/maple_tree.h>
#include <linux/irqdomain.h>
#include <linux/sysfs.h>
@@ -131,17 +130,39 @@ int nr_irqs = NR_IRQS;
EXPORT_SYMBOL_GPL(nr_irqs);
static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
+ MT_FLAGS_ALLOC_RANGE |
+ MT_FLAGS_LOCK_EXTERN |
+ MT_FLAGS_USE_RCU,
+ sparse_irq_lock);
static int irq_find_free_area(unsigned int from, unsigned int cnt)
{
- return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
- from, cnt, 0);
+ MA_STATE(mas, &sparse_irqs, 0, 0);
+
+ if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
+ return -ENOSPC;
+ return mas.index;
}
static unsigned int irq_find_at_or_after(unsigned int offset)
{
- return find_next_bit(allocated_irqs, nr_irqs, offset);
+ unsigned long index = offset;
+ struct irq_desc *desc = mt_find(&sparse_irqs, &index, nr_irqs);
+
+ return desc ? irq_desc_get_irq(desc) : nr_irqs;
+}
+
+static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
+{
+ MA_STATE(mas, &sparse_irqs, irq, irq);
+ WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);
+}
+
+static void delete_irq_desc(unsigned int irq)
+{
+ MA_STATE(mas, &sparse_irqs, irq, irq);
+ mas_erase(&mas);
}
#ifdef CONFIG_SPARSE_IRQ
@@ -355,26 +376,14 @@ static void irq_sysfs_del(struct irq_desc *desc) {}
#endif /* CONFIG_SYSFS */
-static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
-
-static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
-{
- radix_tree_insert(&irq_desc_tree, irq, desc);
-}
-
struct irq_desc *irq_to_desc(unsigned int irq)
{
- return radix_tree_lookup(&irq_desc_tree, irq);
+ return mtree_load(&sparse_irqs, irq);
}
#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
EXPORT_SYMBOL_GPL(irq_to_desc);
#endif
-static void delete_irq_desc(unsigned int irq)
-{
- radix_tree_delete(&irq_desc_tree, irq);
-}
-
#ifdef CONFIG_SMP
static void free_masks(struct irq_desc *desc)
{
@@ -517,7 +526,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
irq_sysfs_add(start + i, desc);
irq_add_debugfs_entry(start + i, desc);
}
- bitmap_set(allocated_irqs, start, cnt);
return start;
err:
@@ -557,7 +565,6 @@ int __init early_irq_init(void)
for (i = 0; i < initcnt; i++) {
desc = alloc_desc(i, node, 0, NULL, NULL);
- set_bit(i, allocated_irqs);
irq_insert_desc(i, desc);
}
return arch_early_irq_init();
@@ -612,6 +619,7 @@ static void free_desc(unsigned int irq)
raw_spin_lock_irqsave(&desc->lock, flags);
desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
raw_spin_unlock_irqrestore(&desc->lock, flags);
+ delete_irq_desc(irq);
}
static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
@@ -624,8 +632,8 @@ static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
struct irq_desc *desc = irq_to_desc(start + i);
desc->owner = owner;
+ irq_insert_desc(start + i, desc);
}
- bitmap_set(allocated_irqs, start, cnt);
return start;
}
@@ -637,7 +645,7 @@ static int irq_expand_nr_irqs(unsigned int nr)
void irq_mark_irq(unsigned int irq)
{
mutex_lock(&sparse_irq_lock);
- bitmap_set(allocated_irqs, irq, 1);
+ irq_insert_desc(irq, irq_desc + irq);
mutex_unlock(&sparse_irq_lock);
}
@@ -781,7 +789,6 @@ void irq_free_descs(unsigned int from, unsigned int cnt)
for (i = 0; i < cnt; i++)
free_desc(from + i);
- bitmap_clear(allocated_irqs, from, cnt);
mutex_unlock(&sparse_irq_lock);
}
EXPORT_SYMBOL_GPL(irq_free_descs);
@@ -844,7 +851,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
* irq_get_next_irq - get next allocated irq number
* @offset: where to start the search
*
- * Returns next irq number at or after offset or nr_irqs if none is found.
+ * Returns next irq number after offset or nr_irqs if none is found.
*/
unsigned int irq_get_next_irq(unsigned int offset)
{
--
2.25.1
Move the open coded sparse bitmap handling into helper functions as
a preparatory step for converting the sparse interrupt management
to a maple tree.
No functional change.
Signed-off-by: Shanker Donthineni <[email protected]>
---
kernel/irq/internals.h | 4 ++--
kernel/irq/irqdesc.c | 30 ++++++++++++++++++++----------
2 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 51fc8c497c22..f3f2090dd2de 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,9 +12,9 @@
#include <linux/sched/clock.h>
#ifdef CONFIG_SPARSE_IRQ
-# define IRQ_BITMAP_BITS (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
#else
-# define IRQ_BITMAP_BITS NR_IRQS
+# define MAX_SPARSE_IRQS NR_IRQS
#endif
#define istate core_internal_state__do_not_mess_with_it
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index b401b89b226a..e0d9dd9b36f9 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -131,7 +131,18 @@ int nr_irqs = NR_IRQS;
EXPORT_SYMBOL_GPL(nr_irqs);
static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);
+static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+
+static int irq_find_free_area(unsigned int from, unsigned int cnt)
+{
+ return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
+ from, cnt, 0);
+}
+
+static unsigned int irq_find_at_or_after(unsigned int offset)
+{
+ return find_next_bit(allocated_irqs, nr_irqs, offset);
+}
#ifdef CONFIG_SPARSE_IRQ
@@ -517,7 +528,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
static int irq_expand_nr_irqs(unsigned int nr)
{
- if (nr > IRQ_BITMAP_BITS)
+ if (nr > MAX_SPARSE_IRQS)
return -ENOMEM;
nr_irqs = nr;
return 0;
@@ -535,11 +546,11 @@ int __init early_irq_init(void)
printk(KERN_INFO "NR_IRQS: %d, nr_irqs: %d, preallocated irqs: %d\n",
NR_IRQS, nr_irqs, initcnt);
- if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
- nr_irqs = IRQ_BITMAP_BITS;
+ if (WARN_ON(nr_irqs > MAX_SPARSE_IRQS))
+ nr_irqs = MAX_SPARSE_IRQS;
- if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
- initcnt = IRQ_BITMAP_BITS;
+ if (WARN_ON(initcnt > MAX_SPARSE_IRQS))
+ initcnt = MAX_SPARSE_IRQS;
if (initcnt > nr_irqs)
nr_irqs = initcnt;
@@ -812,8 +823,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
mutex_lock(&sparse_irq_lock);
- start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
- from, cnt, 0);
+ start = irq_find_free_area(from, cnt);
ret = -EEXIST;
if (irq >=0 && start != irq)
goto unlock;
@@ -834,11 +844,11 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
* irq_get_next_irq - get next allocated irq number
* @offset: where to start the search
*
- * Returns next irq number after offset or nr_irqs if none is found.
+ * Returns next irq number at or after offset or nr_irqs if none is found.
*/
unsigned int irq_get_next_irq(unsigned int offset)
{
- return find_next_bit(allocated_irqs, nr_irqs, offset);
+ return irq_find_at_or_after(offset);
}
struct irq_desc *
--
2.25.1
The current implementation utilizes a bitmap for managing IRQ resend
handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
However, this method may not efficiently utilize memory during runtime,
particularly when IRQ_BITMAP_BITS is large.
Address this issue by using the hlist to manage IRQ resend handlers
instead of relying on a static bitmap memory allocation. Additionally,
a new function, clear_irq_resend(), is introduced and called from
irq_shutdown to ensure a graceful teardown of the interrupt.
Signed-off-by: Shanker Donthineni <[email protected]>
---
include/linux/irqdesc.h | 3 +++
kernel/irq/chip.c | 1 +
kernel/irq/internals.h | 2 ++
kernel/irq/irqdesc.c | 2 ++
kernel/irq/resend.c | 47 ++++++++++++++++++++++++++---------------
5 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 844a8e30e6de..d9451d456a73 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -102,6 +102,9 @@ struct irq_desc {
int parent_irq;
struct module *owner;
const char *name;
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+ struct hlist_node resend_node;
+#endif
} ____cacheline_internodealigned_in_smp;
#ifdef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..2eac5532c3c8 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
void irq_shutdown(struct irq_desc *desc)
{
if (irqd_is_started(&desc->irq_data)) {
+ clear_irq_resend(desc);
desc->depth = 1;
if (desc->irq_data.chip->irq_shutdown) {
desc->irq_data.chip->irq_shutdown(&desc->irq_data);
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 5fdc0b557579..51fc8c497c22 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -113,6 +113,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
/* Resending of interrupts :*/
int check_irq_resend(struct irq_desc *desc, bool inject);
+void clear_irq_resend(struct irq_desc *desc);
+void irq_resend_init(struct irq_desc *desc);
bool irq_wait_for_poll(struct irq_desc *desc);
void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 240e145e969f..b401b89b226a 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -415,6 +415,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
desc_set_defaults(irq, desc, node, affinity, owner);
irqd_set(&desc->irq_data, flags);
kobject_init(&desc->kobj, &irq_kobj_type);
+ irq_resend_init(desc);
return desc;
@@ -581,6 +582,7 @@ int __init early_irq_init(void)
mutex_init(&desc[i].request_mutex);
init_waitqueue_head(&desc[i].wait_for_threads);
desc_set_defaults(i, &desc[i], node, NULL, NULL);
+ irq_resend_init(desc);
}
return arch_early_irq_init();
}
diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
index 0c46e9fe3a89..edec335c0a7a 100644
--- a/kernel/irq/resend.c
+++ b/kernel/irq/resend.c
@@ -21,8 +21,9 @@
#ifdef CONFIG_HARDIRQS_SW_RESEND
-/* Bitmap to handle software resend of interrupts: */
-static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
+/* hlist_head to handle software resend of interrupts: */
+static HLIST_HEAD(irq_resend_list);
+static DEFINE_RAW_SPINLOCK(irq_resend_lock);
/*
* Run software resends of IRQ's
@@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
static void resend_irqs(struct tasklet_struct *unused)
{
struct irq_desc *desc;
- int irq;
-
- while (!bitmap_empty(irqs_resend, nr_irqs)) {
- irq = find_first_bit(irqs_resend, nr_irqs);
- clear_bit(irq, irqs_resend);
- desc = irq_to_desc(irq);
- if (!desc)
- continue;
- local_irq_disable();
+
+ raw_spin_lock_irq(&irq_resend_lock);
+ while (!hlist_empty(&irq_resend_list)) {
+ desc = hlist_entry(irq_resend_list.first, struct irq_desc,
+ resend_node);
+ hlist_del_init(&desc->resend_node);
+ raw_spin_unlock(&irq_resend_lock);
desc->handle_irq(desc);
- local_irq_enable();
+ raw_spin_lock(&irq_resend_lock);
}
+ raw_spin_unlock_irq(&irq_resend_lock);
}
/* Tasklet to handle resend: */
@@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
static int irq_sw_resend(struct irq_desc *desc)
{
- unsigned int irq = irq_desc_get_irq(desc);
-
/*
* Validate whether this interrupt can be safely injected from
* non interrupt context
@@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
*/
if (!desc->parent_irq)
return -EINVAL;
- irq = desc->parent_irq;
}
- /* Set it pending and activate the softirq: */
- set_bit(irq, irqs_resend);
+ /* Add to resend_list and activate the softirq: */
+ raw_spin_lock(&irq_resend_lock);
+ hlist_add_head(&desc->resend_node, &irq_resend_list);
+ raw_spin_unlock(&irq_resend_lock);
tasklet_schedule(&resend_tasklet);
return 0;
}
+void clear_irq_resend(struct irq_desc *desc)
+{
+ raw_spin_lock(&irq_resend_lock);
+ hlist_del_init(&desc->resend_node);
+ raw_spin_unlock(&irq_resend_lock);
+}
+
+void irq_resend_init(struct irq_desc *desc)
+{
+ INIT_HLIST_NODE(&desc->resend_node);
+}
#else
+void clear_irq_resend(struct irq_desc *desc) {}
+void irq_resend_init(struct irq_desc *desc) {}
+
static int irq_sw_resend(struct irq_desc *desc)
{
return -EINVAL;
--
2.25.1
The following commit has been merged into the irq/core branch of tip:
Commit-ID: 721255b9826bd11c7a38b585905fc2dd0fb94e52
Gitweb: https://git.kernel.org/tip/721255b9826bd11c7a38b585905fc2dd0fb94e52
Author: Shanker Donthineni <[email protected]>
AuthorDate: Fri, 19 May 2023 08:49:02 -05:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 24 May 2023 11:39:44 +02:00
genirq: Use a maple tree for interrupt descriptor management
The current implementation uses a static bitmap for interrupt descriptor
allocation and a radix tree to pointer store the pointer for lookup.
However, the size of the bitmap is constrained by the build time macro
MAX_SPARSE_IRQS, which may not be sufficient to support high-end servers,
particularly those with GICv4.1 hardware, which require a large interrupt
space to cover LPIs and vSGIs.
Replace the bitmap and the radix tree with a maple tree, which not only
stores pointers for lookup, but also provides a mechanism to find free
ranges. That removes the build time hardcoded upper limit.
Signed-off-by: Shanker Donthineni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/irq/internals.h | 2 +-
kernel/irq/irqdesc.c | 57 +++++++++++++++++++++++------------------
2 files changed, 33 insertions(+), 26 deletions(-)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index f3f2090..7bdb750 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,7 +12,7 @@
#include <linux/sched/clock.h>
#ifdef CONFIG_SPARSE_IRQ
-# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS INT_MAX
#else
# define MAX_SPARSE_IRQS NR_IRQS
#endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index e0d9dd9..27ca1c8 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -12,8 +12,7 @@
#include <linux/export.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
-#include <linux/radix-tree.h>
-#include <linux/bitmap.h>
+#include <linux/maple_tree.h>
#include <linux/irqdomain.h>
#include <linux/sysfs.h>
@@ -131,17 +130,39 @@ int nr_irqs = NR_IRQS;
EXPORT_SYMBOL_GPL(nr_irqs);
static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
+ MT_FLAGS_ALLOC_RANGE |
+ MT_FLAGS_LOCK_EXTERN |
+ MT_FLAGS_USE_RCU,
+ sparse_irq_lock);
static int irq_find_free_area(unsigned int from, unsigned int cnt)
{
- return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
- from, cnt, 0);
+ MA_STATE(mas, &sparse_irqs, 0, 0);
+
+ if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
+ return -ENOSPC;
+ return mas.index;
}
static unsigned int irq_find_at_or_after(unsigned int offset)
{
- return find_next_bit(allocated_irqs, nr_irqs, offset);
+ unsigned long index = offset;
+ struct irq_desc *desc = mt_find(&sparse_irqs, &index, nr_irqs);
+
+ return desc ? irq_desc_get_irq(desc) : nr_irqs;
+}
+
+static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
+{
+ MA_STATE(mas, &sparse_irqs, irq, irq);
+ WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);
+}
+
+static void delete_irq_desc(unsigned int irq)
+{
+ MA_STATE(mas, &sparse_irqs, irq, irq);
+ mas_erase(&mas);
}
#ifdef CONFIG_SPARSE_IRQ
@@ -355,26 +376,14 @@ static void irq_sysfs_del(struct irq_desc *desc) {}
#endif /* CONFIG_SYSFS */
-static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
-
-static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
-{
- radix_tree_insert(&irq_desc_tree, irq, desc);
-}
-
struct irq_desc *irq_to_desc(unsigned int irq)
{
- return radix_tree_lookup(&irq_desc_tree, irq);
+ return mtree_load(&sparse_irqs, irq);
}
#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
EXPORT_SYMBOL_GPL(irq_to_desc);
#endif
-static void delete_irq_desc(unsigned int irq)
-{
- radix_tree_delete(&irq_desc_tree, irq);
-}
-
#ifdef CONFIG_SMP
static void free_masks(struct irq_desc *desc)
{
@@ -517,7 +526,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
irq_sysfs_add(start + i, desc);
irq_add_debugfs_entry(start + i, desc);
}
- bitmap_set(allocated_irqs, start, cnt);
return start;
err:
@@ -557,7 +565,6 @@ int __init early_irq_init(void)
for (i = 0; i < initcnt; i++) {
desc = alloc_desc(i, node, 0, NULL, NULL);
- set_bit(i, allocated_irqs);
irq_insert_desc(i, desc);
}
return arch_early_irq_init();
@@ -612,6 +619,7 @@ static void free_desc(unsigned int irq)
raw_spin_lock_irqsave(&desc->lock, flags);
desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
raw_spin_unlock_irqrestore(&desc->lock, flags);
+ delete_irq_desc(irq);
}
static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
@@ -624,8 +632,8 @@ static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
struct irq_desc *desc = irq_to_desc(start + i);
desc->owner = owner;
+ irq_insert_desc(start + i, desc);
}
- bitmap_set(allocated_irqs, start, cnt);
return start;
}
@@ -637,7 +645,7 @@ static int irq_expand_nr_irqs(unsigned int nr)
void irq_mark_irq(unsigned int irq)
{
mutex_lock(&sparse_irq_lock);
- bitmap_set(allocated_irqs, irq, 1);
+ irq_insert_desc(irq, irq_desc + irq);
mutex_unlock(&sparse_irq_lock);
}
@@ -781,7 +789,6 @@ void irq_free_descs(unsigned int from, unsigned int cnt)
for (i = 0; i < cnt; i++)
free_desc(from + i);
- bitmap_clear(allocated_irqs, from, cnt);
mutex_unlock(&sparse_irq_lock);
}
EXPORT_SYMBOL_GPL(irq_free_descs);
@@ -844,7 +851,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
* irq_get_next_irq - get next allocated irq number
* @offset: where to start the search
*
- * Returns next irq number at or after offset or nr_irqs if none is found.
+ * Returns next irq number after offset or nr_irqs if none is found.
*/
unsigned int irq_get_next_irq(unsigned int offset)
{
The following commit has been merged into the irq/core branch of tip:
Commit-ID: 5e630aa8d9fcd4c0cb6d5d09422009533aba979a
Gitweb: https://git.kernel.org/tip/5e630aa8d9fcd4c0cb6d5d09422009533aba979a
Author: Shanker Donthineni <[email protected]>
AuthorDate: Fri, 19 May 2023 08:49:01 -05:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 24 May 2023 11:39:44 +02:00
genirq: Encapsulate sparse bitmap handling
Move the open coded sparse bitmap handling into helper functions as
a preparatory step for converting the sparse interrupt management
to a maple tree.
No functional change.
Signed-off-by: Shanker Donthineni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/irq/internals.h | 4 ++--
kernel/irq/irqdesc.c | 30 ++++++++++++++++++++----------
2 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 51fc8c4..f3f2090 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,9 +12,9 @@
#include <linux/sched/clock.h>
#ifdef CONFIG_SPARSE_IRQ
-# define IRQ_BITMAP_BITS (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS (NR_IRQS + 8196)
#else
-# define IRQ_BITMAP_BITS NR_IRQS
+# define MAX_SPARSE_IRQS NR_IRQS
#endif
#define istate core_internal_state__do_not_mess_with_it
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index b401b89..e0d9dd9 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -131,7 +131,18 @@ int nr_irqs = NR_IRQS;
EXPORT_SYMBOL_GPL(nr_irqs);
static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);
+static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+
+static int irq_find_free_area(unsigned int from, unsigned int cnt)
+{
+ return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
+ from, cnt, 0);
+}
+
+static unsigned int irq_find_at_or_after(unsigned int offset)
+{
+ return find_next_bit(allocated_irqs, nr_irqs, offset);
+}
#ifdef CONFIG_SPARSE_IRQ
@@ -517,7 +528,7 @@ err:
static int irq_expand_nr_irqs(unsigned int nr)
{
- if (nr > IRQ_BITMAP_BITS)
+ if (nr > MAX_SPARSE_IRQS)
return -ENOMEM;
nr_irqs = nr;
return 0;
@@ -535,11 +546,11 @@ int __init early_irq_init(void)
printk(KERN_INFO "NR_IRQS: %d, nr_irqs: %d, preallocated irqs: %d\n",
NR_IRQS, nr_irqs, initcnt);
- if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
- nr_irqs = IRQ_BITMAP_BITS;
+ if (WARN_ON(nr_irqs > MAX_SPARSE_IRQS))
+ nr_irqs = MAX_SPARSE_IRQS;
- if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
- initcnt = IRQ_BITMAP_BITS;
+ if (WARN_ON(initcnt > MAX_SPARSE_IRQS))
+ initcnt = MAX_SPARSE_IRQS;
if (initcnt > nr_irqs)
nr_irqs = initcnt;
@@ -812,8 +823,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
mutex_lock(&sparse_irq_lock);
- start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
- from, cnt, 0);
+ start = irq_find_free_area(from, cnt);
ret = -EEXIST;
if (irq >=0 && start != irq)
goto unlock;
@@ -834,11 +844,11 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
* irq_get_next_irq - get next allocated irq number
* @offset: where to start the search
*
- * Returns next irq number after offset or nr_irqs if none is found.
+ * Returns next irq number at or after offset or nr_irqs if none is found.
*/
unsigned int irq_get_next_irq(unsigned int offset)
{
- return find_next_bit(allocated_irqs, nr_irqs, offset);
+ return irq_find_at_or_after(offset);
}
struct irq_desc *
The following commit has been merged into the irq/core branch of tip:
Commit-ID: bc06a9e0874239cb6d4eebcb0ecd1a91ad9272db
Gitweb: https://git.kernel.org/tip/bc06a9e0874239cb6d4eebcb0ecd1a91ad9272db
Author: Shanker Donthineni <[email protected]>
AuthorDate: Fri, 19 May 2023 08:49:00 -05:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 24 May 2023 11:39:44 +02:00
genirq: Use hlist for managing resend handlers
The current implementation utilizes a bitmap for managing interrupt resend
handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
However, this method may not efficiently utilize memory during runtime,
particularly when IRQ_BITMAP_BITS is large.
Address this issue by using an hlist to manage interrupt resend handlers
instead of relying on a static bitmap memory allocation. Additionally, a
new function, clear_irq_resend(), is introduced and called from
irq_shutdown to ensure a graceful teardown of the interrupt.
Signed-off-by: Shanker Donthineni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/irqdesc.h | 3 +++-
kernel/irq/chip.c | 1 +-
kernel/irq/internals.h | 2 ++-
kernel/irq/irqdesc.c | 2 ++-
kernel/irq/resend.c | 47 +++++++++++++++++++++++++---------------
5 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 844a8e3..d9451d4 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -102,6 +102,9 @@ struct irq_desc {
int parent_irq;
struct module *owner;
const char *name;
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+ struct hlist_node resend_node;
+#endif
} ____cacheline_internodealigned_in_smp;
#ifdef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc8..2eac553 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
void irq_shutdown(struct irq_desc *desc)
{
if (irqd_is_started(&desc->irq_data)) {
+ clear_irq_resend(desc);
desc->depth = 1;
if (desc->irq_data.chip->irq_shutdown) {
desc->irq_data.chip->irq_shutdown(&desc->irq_data);
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 5fdc0b5..51fc8c4 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -113,6 +113,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
/* Resending of interrupts :*/
int check_irq_resend(struct irq_desc *desc, bool inject);
+void clear_irq_resend(struct irq_desc *desc);
+void irq_resend_init(struct irq_desc *desc);
bool irq_wait_for_poll(struct irq_desc *desc);
void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 240e145..b401b89 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -415,6 +415,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
desc_set_defaults(irq, desc, node, affinity, owner);
irqd_set(&desc->irq_data, flags);
kobject_init(&desc->kobj, &irq_kobj_type);
+ irq_resend_init(desc);
return desc;
@@ -581,6 +582,7 @@ int __init early_irq_init(void)
mutex_init(&desc[i].request_mutex);
init_waitqueue_head(&desc[i].wait_for_threads);
desc_set_defaults(i, &desc[i], node, NULL, NULL);
+ irq_resend_init(desc);
}
return arch_early_irq_init();
}
diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
index 0c46e9f..edec335 100644
--- a/kernel/irq/resend.c
+++ b/kernel/irq/resend.c
@@ -21,8 +21,9 @@
#ifdef CONFIG_HARDIRQS_SW_RESEND
-/* Bitmap to handle software resend of interrupts: */
-static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
+/* hlist_head to handle software resend of interrupts: */
+static HLIST_HEAD(irq_resend_list);
+static DEFINE_RAW_SPINLOCK(irq_resend_lock);
/*
* Run software resends of IRQ's
@@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
static void resend_irqs(struct tasklet_struct *unused)
{
struct irq_desc *desc;
- int irq;
-
- while (!bitmap_empty(irqs_resend, nr_irqs)) {
- irq = find_first_bit(irqs_resend, nr_irqs);
- clear_bit(irq, irqs_resend);
- desc = irq_to_desc(irq);
- if (!desc)
- continue;
- local_irq_disable();
+
+ raw_spin_lock_irq(&irq_resend_lock);
+ while (!hlist_empty(&irq_resend_list)) {
+ desc = hlist_entry(irq_resend_list.first, struct irq_desc,
+ resend_node);
+ hlist_del_init(&desc->resend_node);
+ raw_spin_unlock(&irq_resend_lock);
desc->handle_irq(desc);
- local_irq_enable();
+ raw_spin_lock(&irq_resend_lock);
}
+ raw_spin_unlock_irq(&irq_resend_lock);
}
/* Tasklet to handle resend: */
@@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
static int irq_sw_resend(struct irq_desc *desc)
{
- unsigned int irq = irq_desc_get_irq(desc);
-
/*
* Validate whether this interrupt can be safely injected from
* non interrupt context
@@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
*/
if (!desc->parent_irq)
return -EINVAL;
- irq = desc->parent_irq;
}
- /* Set it pending and activate the softirq: */
- set_bit(irq, irqs_resend);
+ /* Add to resend_list and activate the softirq: */
+ raw_spin_lock(&irq_resend_lock);
+ hlist_add_head(&desc->resend_node, &irq_resend_list);
+ raw_spin_unlock(&irq_resend_lock);
tasklet_schedule(&resend_tasklet);
return 0;
}
+void clear_irq_resend(struct irq_desc *desc)
+{
+ raw_spin_lock(&irq_resend_lock);
+ hlist_del_init(&desc->resend_node);
+ raw_spin_unlock(&irq_resend_lock);
+}
+
+void irq_resend_init(struct irq_desc *desc)
+{
+ INIT_HLIST_NODE(&desc->resend_node);
+}
#else
+void clear_irq_resend(struct irq_desc *desc) {}
+void irq_resend_init(struct irq_desc *desc) {}
+
static int irq_sw_resend(struct irq_desc *desc)
{
return -EINVAL;
Hi, Shanker
在 2023/5/19 21:49, Shanker Donthineni 写道:
> The current implementation utilizes a bitmap for managing IRQ resend
> handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
> However, this method may not efficiently utilize memory during runtime,
> particularly when IRQ_BITMAP_BITS is large.
>
> Address this issue by using the hlist to manage IRQ resend handlers
> instead of relying on a static bitmap memory allocation. Additionally,
> a new function, clear_irq_resend(), is introduced and called from
> irq_shutdown to ensure a graceful teardown of the interrupt.
>
> Signed-off-by: Shanker Donthineni <[email protected]>
> ---
> include/linux/irqdesc.h | 3 +++
> kernel/irq/chip.c | 1 +
> kernel/irq/internals.h | 2 ++
> kernel/irq/irqdesc.c | 2 ++
> kernel/irq/resend.c | 47 ++++++++++++++++++++++++++---------------
> 5 files changed, 38 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
> index 844a8e30e6de..d9451d456a73 100644
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -102,6 +102,9 @@ struct irq_desc {
> int parent_irq;
> struct module *owner;
> const char *name;
> +#ifdef CONFIG_HARDIRQS_SW_RESEND
> + struct hlist_node resend_node;
> +#endif
> } ____cacheline_internodealigned_in_smp;
Although there is no documented rule that limits the change of the KABI
struct irq_desc, it is still better to keep the irq_desc definition stable.
>
> #ifdef CONFIG_SPARSE_IRQ
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..2eac5532c3c8 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
> void irq_shutdown(struct irq_desc *desc)
> {
> if (irqd_is_started(&desc->irq_data)) {
> + clear_irq_resend(desc);
> desc->depth = 1;
> if (desc->irq_data.chip->irq_shutdown) {
> desc->irq_data.chip->irq_shutdown(&desc->irq_data);
> diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
> index 5fdc0b557579..51fc8c497c22 100644
> --- a/kernel/irq/internals.h
> +++ b/kernel/irq/internals.h
> @@ -113,6 +113,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
>
> /* Resending of interrupts :*/
> int check_irq_resend(struct irq_desc *desc, bool inject);
> +void clear_irq_resend(struct irq_desc *desc);
> +void irq_resend_init(struct irq_desc *desc);
> bool irq_wait_for_poll(struct irq_desc *desc);
> void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
>
> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> index 240e145e969f..b401b89b226a 100644
> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -415,6 +415,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
> desc_set_defaults(irq, desc, node, affinity, owner);
> irqd_set(&desc->irq_data, flags);
> kobject_init(&desc->kobj, &irq_kobj_type);
> + irq_resend_init(desc);
>
> return desc;
>
> @@ -581,6 +582,7 @@ int __init early_irq_init(void)
> mutex_init(&desc[i].request_mutex);
> init_waitqueue_head(&desc[i].wait_for_threads);
> desc_set_defaults(i, &desc[i], node, NULL, NULL);
> + irq_resend_init(desc);
> }
> return arch_early_irq_init();
> }
> diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
> index 0c46e9fe3a89..edec335c0a7a 100644
> --- a/kernel/irq/resend.c
> +++ b/kernel/irq/resend.c
> @@ -21,8 +21,9 @@
>
> #ifdef CONFIG_HARDIRQS_SW_RESEND
>
> -/* Bitmap to handle software resend of interrupts: */
> -static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
> +/* hlist_head to handle software resend of interrupts: */
> +static HLIST_HEAD(irq_resend_list);
> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
What is the benefit of using hlist here? If you want to enjoy the
low latency of querying elements by key, you must define a hlist table
with a reasonable number of buckets. Otherwise, I don't think the time
complexity of hlist is better than a regular double-linked list, right?
>
> /*
> * Run software resends of IRQ's
> @@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
> static void resend_irqs(struct tasklet_struct *unused)
> {
> struct irq_desc *desc;
> - int irq;
> -
> - while (!bitmap_empty(irqs_resend, nr_irqs)) {
> - irq = find_first_bit(irqs_resend, nr_irqs);
> - clear_bit(irq, irqs_resend);
> - desc = irq_to_desc(irq);
> - if (!desc)
> - continue;
> - local_irq_disable();
> +
> + raw_spin_lock_irq(&irq_resend_lock);
> + while (!hlist_empty(&irq_resend_list)) {> + desc = hlist_entry(irq_resend_list.first, struct irq_desc,
> + resend_node);
> + hlist_del_init(&desc->resend_node);
> + raw_spin_unlock(&irq_resend_lock);
> desc->handle_irq(desc);
> - local_irq_enable();
> + raw_spin_lock(&irq_resend_lock);
> }
> + raw_spin_unlock_irq(&irq_resend_lock);
> }
>
> /* Tasklet to handle resend: */
> @@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
>
> static int irq_sw_resend(struct irq_desc *desc)
> {
> - unsigned int irq = irq_desc_get_irq(desc);
> -
> /*
> * Validate whether this interrupt can be safely injected from
> * non interrupt context
> @@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
> */
> if (!desc->parent_irq)
> return -EINVAL;
> - irq = desc->parent_irq;
Why delete this code?
> }
>
> - /* Set it pending and activate the softirq: */
> - set_bit(irq, irqs_resend);
> + /* Add to resend_list and activate the softirq: */
> + raw_spin_lock(&irq_resend_lock);
> + hlist_add_head(&desc->resend_node, &irq_resend_list);
> + raw_spin_unlock(&irq_resend_lock);
Do you conside a situation where irq_sw_resend() is running on two CPUs concurrently?
If so, the same desc could be added into irq_resend_list twice by mistake.
> tasklet_schedule(&resend_tasklet);
> return 0;
> }
>
> +void clear_irq_resend(struct irq_desc *desc)
> +{
> + raw_spin_lock(&irq_resend_lock);
> + hlist_del_init(&desc->resend_node);
> + raw_spin_unlock(&irq_resend_lock);
> +}
> +
> +void irq_resend_init(struct irq_desc *desc)
> +{
> + INIT_HLIST_NODE(&desc->resend_node);
> +}
> #else
> +void clear_irq_resend(struct irq_desc *desc) {}
> +void irq_resend_init(struct irq_desc *desc) {}
> +
> static int irq_sw_resend(struct irq_desc *desc)
> {
> return -EINVAL;
--
BR
Liao, Chang
On Mon, 29 May 2023 08:57:07 +0100,
"Liao, Chang" <[email protected]> wrote:
>
> Hi, Shanker
>
> 在 2023/5/19 21:49, Shanker Donthineni 写道:
> > The current implementation utilizes a bitmap for managing IRQ resend
> > handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
> > However, this method may not efficiently utilize memory during runtime,
> > particularly when IRQ_BITMAP_BITS is large.
> >
> > Address this issue by using the hlist to manage IRQ resend handlers
> > instead of relying on a static bitmap memory allocation. Additionally,
> > a new function, clear_irq_resend(), is introduced and called from
> > irq_shutdown to ensure a graceful teardown of the interrupt.
> >
> > Signed-off-by: Shanker Donthineni <[email protected]>
> > ---
> > include/linux/irqdesc.h | 3 +++
> > kernel/irq/chip.c | 1 +
> > kernel/irq/internals.h | 2 ++
> > kernel/irq/irqdesc.c | 2 ++
> > kernel/irq/resend.c | 47 ++++++++++++++++++++++++++---------------
> > 5 files changed, 38 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
> > index 844a8e30e6de..d9451d456a73 100644
> > --- a/include/linux/irqdesc.h
> > +++ b/include/linux/irqdesc.h
> > @@ -102,6 +102,9 @@ struct irq_desc {
> > int parent_irq;
> > struct module *owner;
> > const char *name;
> > +#ifdef CONFIG_HARDIRQS_SW_RESEND
> > + struct hlist_node resend_node;
> > +#endif
> > } ____cacheline_internodealigned_in_smp;
>
> Although there is no documented rule that limits the change of the KABI
> struct irq_desc, it is still better to keep the irq_desc definition stable.
On what grounds? There is no such thing as a stable in-kernel ABI,
specially for things as internal as irq_desc.
> >
> > #ifdef CONFIG_SPARSE_IRQ
> > diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> > index 49e7bc871fec..2eac5532c3c8 100644
> > --- a/kernel/irq/chip.c
> > +++ b/kernel/irq/chip.c
> > @@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
> > void irq_shutdown(struct irq_desc *desc)
> > {
> > if (irqd_is_started(&desc->irq_data)) {
> > + clear_irq_resend(desc);
> > desc->depth = 1;
> > if (desc->irq_data.chip->irq_shutdown) {
> > desc->irq_data.chip->irq_shutdown(&desc->irq_data);
> > diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
> > index 5fdc0b557579..51fc8c497c22 100644
> > --- a/kernel/irq/internals.h
> > +++ b/kernel/irq/internals.h
> > @@ -113,6 +113,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
> >
> > /* Resending of interrupts :*/
> > int check_irq_resend(struct irq_desc *desc, bool inject);
> > +void clear_irq_resend(struct irq_desc *desc);
> > +void irq_resend_init(struct irq_desc *desc);
> > bool irq_wait_for_poll(struct irq_desc *desc);
> > void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
> >
> > diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> > index 240e145e969f..b401b89b226a 100644
> > --- a/kernel/irq/irqdesc.c
> > +++ b/kernel/irq/irqdesc.c
> > @@ -415,6 +415,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
> > desc_set_defaults(irq, desc, node, affinity, owner);
> > irqd_set(&desc->irq_data, flags);
> > kobject_init(&desc->kobj, &irq_kobj_type);
> > + irq_resend_init(desc);
> >
> > return desc;
> >
> > @@ -581,6 +582,7 @@ int __init early_irq_init(void)
> > mutex_init(&desc[i].request_mutex);
> > init_waitqueue_head(&desc[i].wait_for_threads);
> > desc_set_defaults(i, &desc[i], node, NULL, NULL);
> > + irq_resend_init(desc);
> > }
> > return arch_early_irq_init();
> > }
> > diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
> > index 0c46e9fe3a89..edec335c0a7a 100644
> > --- a/kernel/irq/resend.c
> > +++ b/kernel/irq/resend.c
> > @@ -21,8 +21,9 @@
> >
> > #ifdef CONFIG_HARDIRQS_SW_RESEND
> >
> > -/* Bitmap to handle software resend of interrupts: */
> > -static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
> > +/* hlist_head to handle software resend of interrupts: */
> > +static HLIST_HEAD(irq_resend_list);
> > +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>
> What is the benefit of using hlist here? If you want to enjoy the
> low latency of querying elements by key, you must define a hlist table
> with a reasonable number of buckets. Otherwise, I don't think the time
> complexity of hlist is better than a regular double-linked list, right?
You do realise that the list is processed in order, one element after
the other, without ever querying any arbitrary element? Have you read
the code?
>
> >
> > /*
> > * Run software resends of IRQ's
> > @@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
> > static void resend_irqs(struct tasklet_struct *unused)
> > {
> > struct irq_desc *desc;
> > - int irq;
> > -
> > - while (!bitmap_empty(irqs_resend, nr_irqs)) {
> > - irq = find_first_bit(irqs_resend, nr_irqs);
> > - clear_bit(irq, irqs_resend);
> > - desc = irq_to_desc(irq);
> > - if (!desc)
> > - continue;
> > - local_irq_disable();
> > +
> > + raw_spin_lock_irq(&irq_resend_lock);
> > + while (!hlist_empty(&irq_resend_list)) {
> > + desc = hlist_entry(irq_resend_list.first, struct irq_desc,
> > + resend_node);
> > + hlist_del_init(&desc->resend_node);
> > + raw_spin_unlock(&irq_resend_lock);
> > desc->handle_irq(desc);
> > - local_irq_enable();
> > + raw_spin_lock(&irq_resend_lock);
> > }
> > + raw_spin_unlock_irq(&irq_resend_lock);
> > }
> >
> > /* Tasklet to handle resend: */
> > @@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
> >
> > static int irq_sw_resend(struct irq_desc *desc)
> > {
> > - unsigned int irq = irq_desc_get_irq(desc);
> > -
> > /*
> > * Validate whether this interrupt can be safely injected from
> > * non interrupt context
> > @@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
> > */
> > if (!desc->parent_irq)
> > return -EINVAL;
> > - irq = desc->parent_irq;
>
> Why delete this code?
OK, now I know you haven't read this code at all :-(.
>
> > }
> >
> > - /* Set it pending and activate the softirq: */
> > - set_bit(irq, irqs_resend);
> > + /* Add to resend_list and activate the softirq: */
> > + raw_spin_lock(&irq_resend_lock);
> > + hlist_add_head(&desc->resend_node, &irq_resend_list);
> > + raw_spin_unlock(&irq_resend_lock);
>
> Do you conside a situation where irq_sw_resend() is running on two CPUs concurrently?
> If so, the same desc could be added into irq_resend_list twice by mistake.
Have you looked at the calling site (stress on singular), the locking
requirements, and the role IRQS_REPLAY plays when it comes to queuing
an interrupt for resend?
M.
--
Without deviation from the norm, progress is not possible.
On Mon, May 29 2023 at 15:57, Chang Liao wrote:
> 在 2023/5/19 21:49, Shanker Donthineni 写道:
>> diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
>> index 844a8e30e6de..d9451d456a73 100644
>> --- a/include/linux/irqdesc.h
>> +++ b/include/linux/irqdesc.h
>> @@ -102,6 +102,9 @@ struct irq_desc {
>> int parent_irq;
>> struct module *owner;
>> const char *name;
>> +#ifdef CONFIG_HARDIRQS_SW_RESEND
>> + struct hlist_node resend_node;
>> +#endif
>> } ____cacheline_internodealigned_in_smp;
>
> Although there is no documented rule that limits the change of the KABI
> struct irq_desc, it is still better to keep the irq_desc definition
> stable.
Please read and understand:
Documentation/process/stable-api-nonsense.rst
If you want KABI, then that's _YOUR_ problem, period.
>> -/* Bitmap to handle software resend of interrupts: */
>> -static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>> +/* hlist_head to handle software resend of interrupts: */
>> +static HLIST_HEAD(irq_resend_list);
>> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>
> What is the benefit of using hlist here? If you want to enjoy the
> low latency of querying elements by key, you must define a hlist table
> with a reasonable number of buckets. Otherwise, I don't think the time
> complexity of hlist is better than a regular double-linked list,
> right?
What's complex about hlist in this case? Please explain.
Thanks,
tglx
Hi, Marc
在 2023/5/29 16:48, Marc Zyngier 写道:
> On Mon, 29 May 2023 08:57:07 +0100,
> "Liao, Chang" <[email protected]> wrote:
>>
>> Hi, Shanker
>>
>> 在 2023/5/19 21:49, Shanker Donthineni 写道:
>>> The current implementation utilizes a bitmap for managing IRQ resend
>>> handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
>>> However, this method may not efficiently utilize memory during runtime,
>>> particularly when IRQ_BITMAP_BITS is large.
>>>
>>> Address this issue by using the hlist to manage IRQ resend handlers
>>> instead of relying on a static bitmap memory allocation. Additionally,
>>> a new function, clear_irq_resend(), is introduced and called from
>>> irq_shutdown to ensure a graceful teardown of the interrupt.
>>>
>>> Signed-off-by: Shanker Donthineni <[email protected]>
>>> ---
>>> include/linux/irqdesc.h | 3 +++
>>> kernel/irq/chip.c | 1 +
>>> kernel/irq/internals.h | 2 ++
>>> kernel/irq/irqdesc.c | 2 ++
>>> kernel/irq/resend.c | 47 ++++++++++++++++++++++++++---------------
>>> 5 files changed, 38 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
>>> index 844a8e30e6de..d9451d456a73 100644
>>> --- a/include/linux/irqdesc.h
>>> +++ b/include/linux/irqdesc.h
>>> @@ -102,6 +102,9 @@ struct irq_desc {
>>> int parent_irq;
>>> struct module *owner;
>>> const char *name;
>>> +#ifdef CONFIG_HARDIRQS_SW_RESEND
>>> + struct hlist_node resend_node;
>>> +#endif
>>> } ____cacheline_internodealigned_in_smp;
>>
>> Although there is no documented rule that limits the change of the KABI
>> struct irq_desc, it is still better to keep the irq_desc definition stable.
>
> On what grounds? There is no such thing as a stable in-kernel ABI,
> specially for things as internal as irq_desc.
Okay, helpful.
>
>>>
>>> #ifdef CONFIG_SPARSE_IRQ
>>> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
>>> index 49e7bc871fec..2eac5532c3c8 100644
>>> --- a/kernel/irq/chip.c
>>> +++ b/kernel/irq/chip.c
>>> @@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
>>> void irq_shutdown(struct irq_desc *desc)
>>> {
>>> if (irqd_is_started(&desc->irq_data)) {
>>> + clear_irq_resend(desc);
>>> desc->depth = 1;
>>> if (desc->irq_data.chip->irq_shutdown) {
>>> desc->irq_data.chip->irq_shutdown(&desc->irq_data);
>>> diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
>>> index 5fdc0b557579..51fc8c497c22 100644
>>> --- a/kernel/irq/internals.h
>>> +++ b/kernel/irq/internals.h
>>> @@ -113,6 +113,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
>>>
>>> /* Resending of interrupts :*/
>>> int check_irq_resend(struct irq_desc *desc, bool inject);
>>> +void clear_irq_resend(struct irq_desc *desc);
>>> +void irq_resend_init(struct irq_desc *desc);
>>> bool irq_wait_for_poll(struct irq_desc *desc);
>>> void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
>>>
>>> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
>>> index 240e145e969f..b401b89b226a 100644
>>> --- a/kernel/irq/irqdesc.c
>>> +++ b/kernel/irq/irqdesc.c
>>> @@ -415,6 +415,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
>>> desc_set_defaults(irq, desc, node, affinity, owner);
>>> irqd_set(&desc->irq_data, flags);
>>> kobject_init(&desc->kobj, &irq_kobj_type);
>>> + irq_resend_init(desc);
>>>
>>> return desc;
>>>
>>> @@ -581,6 +582,7 @@ int __init early_irq_init(void)
>>> mutex_init(&desc[i].request_mutex);
>>> init_waitqueue_head(&desc[i].wait_for_threads);
>>> desc_set_defaults(i, &desc[i], node, NULL, NULL);
>>> + irq_resend_init(desc);
>>> }
>>> return arch_early_irq_init();
>>> }
>>> diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
>>> index 0c46e9fe3a89..edec335c0a7a 100644
>>> --- a/kernel/irq/resend.c
>>> +++ b/kernel/irq/resend.c
>>> @@ -21,8 +21,9 @@
>>>
>>> #ifdef CONFIG_HARDIRQS_SW_RESEND
>>>
>>> -/* Bitmap to handle software resend of interrupts: */
>>> -static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>>> +/* hlist_head to handle software resend of interrupts: */
>>> +static HLIST_HEAD(irq_resend_list);
>>> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>>
>> What is the benefit of using hlist here? If you want to enjoy the
>> low latency of querying elements by key, you must define a hlist table
>> with a reasonable number of buckets. Otherwise, I don't think the time
>> complexity of hlist is better than a regular double-linked list, right?
>
> You do realise that the list is processed in order, one element after
> the other, without ever querying any arbitrary element? Have you read
> the code?
Yes, so i *wonder* why not use regular a linked-list here if no need to do
arbitrary querying. I have no doubt the idea of these changes are sound,
just curious about the data structure used to maintain resend IRQs.
>
>>
>>>
>>> /*
>>> * Run software resends of IRQ's
>>> @@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>>> static void resend_irqs(struct tasklet_struct *unused)
>>> {
>>> struct irq_desc *desc;
>>> - int irq;
>>> -
>>> - while (!bitmap_empty(irqs_resend, nr_irqs)) {
>>> - irq = find_first_bit(irqs_resend, nr_irqs);
>>> - clear_bit(irq, irqs_resend);
>>> - desc = irq_to_desc(irq);
>>> - if (!desc)
>>> - continue;
>>> - local_irq_disable();
>>> +
>>> + raw_spin_lock_irq(&irq_resend_lock);
>>> + while (!hlist_empty(&irq_resend_list)) {
>>> + desc = hlist_entry(irq_resend_list.first, struct irq_desc,
>>> + resend_node);
>>> + hlist_del_init(&desc->resend_node);
>>> + raw_spin_unlock(&irq_resend_lock);
>>> desc->handle_irq(desc);
>>> - local_irq_enable();
>>> + raw_spin_lock(&irq_resend_lock);
>>> }
>>> + raw_spin_unlock_irq(&irq_resend_lock);
>>> }
>>>
>>> /* Tasklet to handle resend: */
>>> @@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
>>>
>>> static int irq_sw_resend(struct irq_desc *desc)
>>> {
>>> - unsigned int irq = irq_desc_get_irq(desc);
>>> -
>>> /*
>>> * Validate whether this interrupt can be safely injected from
>>> * non interrupt context
>>> @@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
>>> */
>>> if (!desc->parent_irq)
>>> return -EINVAL;
>>> - irq = desc->parent_irq;
>>
>> Why delete this code?
>
> OK, now I know you haven't read this code at all :-(.
Sorry for the noise, I overlook some core changes.
>
>>
>>> }
>>>
>>> - /* Set it pending and activate the softirq: */
>>> - set_bit(irq, irqs_resend);
>>> + /* Add to resend_list and activate the softirq: */
>>> + raw_spin_lock(&irq_resend_lock);
>>> + hlist_add_head(&desc->resend_node, &irq_resend_list);
>>> + raw_spin_unlock(&irq_resend_lock);
>>
>> Do you conside a situation where irq_sw_resend() is running on two CPUs concurrently?
>> If so, the same desc could be added into irq_resend_list twice by mistake.
>
> Have you looked at the calling site (stress on singular), the locking
> requirements, and the role IRQS_REPLAY plays when it comes to queuing
> an interrupt for resend?
I see, the IRQS_REPLAY and locking of irq_desc ensure that irq_sw_resend() is not called
on the same interrupt concurrently, thanks for pointing that out.
>
> M.
>
--
BR
Liao, Chang
Hi, Thomas
在 2023/5/30 5:51, Thomas Gleixner 写道:
> On Mon, May 29 2023 at 15:57, Chang Liao wrote:
>> 在 2023/5/19 21:49, Shanker Donthineni 写道:
>>> diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
>>> index 844a8e30e6de..d9451d456a73 100644
>>> --- a/include/linux/irqdesc.h
>>> +++ b/include/linux/irqdesc.h
>>> @@ -102,6 +102,9 @@ struct irq_desc {
>>> int parent_irq;
>>> struct module *owner;
>>> const char *name;
>>> +#ifdef CONFIG_HARDIRQS_SW_RESEND
>>> + struct hlist_node resend_node;
>>> +#endif
>>> } ____cacheline_internodealigned_in_smp;
>>
>> Although there is no documented rule that limits the change of the KABI
>> struct irq_desc, it is still better to keep the irq_desc definition
>> stable.
>
> Please read and understand:
>
> Documentation/process/stable-api-nonsense.rst
>
> If you want KABI, then that's _YOUR_ problem, period.
Thanks for the guide.
>
>>> -/* Bitmap to handle software resend of interrupts: */
>>> -static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>>> +/* hlist_head to handle software resend of interrupts: */
>>> +static HLIST_HEAD(irq_resend_list);
>>> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>>
>> What is the benefit of using hlist here? If you want to enjoy the
>> low latency of querying elements by key, you must define a hlist table
>> with a reasonable number of buckets. Otherwise, I don't think the time
>> complexity of hlist is better than a regular double-linked list,
>> right?
>
> What's complex about hlist in this case? Please explain.
Honestly, it is not about the complexity. Perhaps I do not understand the
usage of hlist very deeply. I have searched some codes in the kernel and
found that hlist is always used to speed up arbitrary querying, such as
searching a registered kprobe by address. Back to this patch, these resend
IRQs are organized in a sequence list actually, and traveled one by one to
handle. Further, by comparing the difference between hlist_empty, hlist_add_head,
hlist_del_init, and their counterparts in list, it looks like a regular linked
list is also good enough.
Thanks.
>
> Thanks,
>
> tglx
--
BR
Liao, Chang
On Tue, 30 May 2023 02:44:05 +0100,
"Liao, Chang" <[email protected]> wrote:
>
> >> What is the benefit of using hlist here? If you want to enjoy the
> >> low latency of querying elements by key, you must define a hlist table
> >> with a reasonable number of buckets. Otherwise, I don't think the time
> >> complexity of hlist is better than a regular double-linked list, right?
> >
> > You do realise that the list is processed in order, one element after
> > the other, without ever querying any arbitrary element? Have you read
> > the code?
>
> Yes, so i *wonder* why not use regular a linked-list here if no need to do
> arbitrary querying. I have no doubt the idea of these changes are sound,
> just curious about the data structure used to maintain resend IRQs.
What about it? For the use case at hand, they result in the same
complexity. Unless you have spotted a corner case that results in a
non O(1) complexity?
M.
--
Without deviation from the norm, progress is not possible.
On Tue, May 30 2023 at 09:59, Chang Liao wrote:
> 在 2023/5/30 5:51, Thomas Gleixner 写道:
>>> What is the benefit of using hlist here? If you want to enjoy the
>>> low latency of querying elements by key, you must define a hlist table
>>> with a reasonable number of buckets. Otherwise, I don't think the time
>>> complexity of hlist is better than a regular double-linked list,
>>> right?
>>
>> What's complex about hlist in this case? Please explain.
>
> Honestly, it is not about the complexity. Perhaps I do not understand the
> usage of hlist very deeply. I have searched some codes in the kernel and
> found that hlist is always used to speed up arbitrary querying, such as
> searching a registered kprobe by address. Back to this patch, these resend
> IRQs are organized in a sequence list actually, and traveled one by one to
> handle. Further, by comparing the difference between hlist_empty, hlist_add_head,
> hlist_del_init, and their counterparts in list, it looks like a regular linked
> list is also good enough.
Sure that works too.
The main difference between regular linked lists and hlist is that the
list head of hlist is half the size of a regular double linked list.
The only downside of hlist is that there is no back link in the list
head to the tail. Searching for the tail is O(N) while on a double
linked list it's O(1).
Nothing in this use case needs to access the tail. So what's your
problem?
Thanks,
tglx
在 2023/5/30 20:19, Thomas Gleixner 写道:
> On Tue, May 30 2023 at 09:59, Chang Liao wrote:
>> 在 2023/5/30 5:51, Thomas Gleixner 写道:
>>>> What is the benefit of using hlist here? If you want to enjoy the
>>>> low latency of querying elements by key, you must define a hlist table
>>>> with a reasonable number of buckets. Otherwise, I don't think the time
>>>> complexity of hlist is better than a regular double-linked list,
>>>> right?
>>>
>>> What's complex about hlist in this case? Please explain.
>>
>> Honestly, it is not about the complexity. Perhaps I do not understand the
>> usage of hlist very deeply. I have searched some codes in the kernel and
>> found that hlist is always used to speed up arbitrary querying, such as
>> searching a registered kprobe by address. Back to this patch, these resend
>> IRQs are organized in a sequence list actually, and traveled one by one to
>> handle. Further, by comparing the difference between hlist_empty, hlist_add_head,
>> hlist_del_init, and their counterparts in list, it looks like a regular linked
>> list is also good enough.
>
> Sure that works too.
>
> The main difference between regular linked lists and hlist is that the
> list head of hlist is half the size of a regular double linked list.
>
> The only downside of hlist is that there is no back link in the list
> head to the tail. Searching for the tail is O(N) while on a double
> linked list it's O(1).
>
> Nothing in this use case needs to access the tail. So what's your
> problem?
Oh, that is the point, your explanation made it all clear, my problem is solved,
Thanks!
>
> Thanks,
>
> tglx
--
BR
Liao, Chang