2009-12-25 08:43:54

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86: increase NR_IRQS and nr_irqs



have a system with lots of igb and ixgbe, when iov/vf are enabled for
them, we hit the limit of 3064.

so try to increase NR_IRQS and nr_irqs to about 8000 and 6000.
for that system.

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 12 ++++++------
arch/x86/kernel/apic/io_apic.c | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -154,21 +154,21 @@ static inline int invalid_vm86_irq(int i

#define NR_IRQS_LEGACY 16

-#define CPU_VECTOR_LIMIT ( 8 * NR_CPUS )
#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS )

#ifdef CONFIG_X86_IO_APIC
# ifdef CONFIG_SPARSE_IRQ
+# define CPU_VECTOR_LIMIT (64 * NR_CPUS)
# define NR_IRQS \
(CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \
(NR_VECTORS + CPU_VECTOR_LIMIT) : \
(NR_VECTORS + IO_APIC_VECTOR_LIMIT))
# else
-# if NR_CPUS < MAX_IO_APICS
-# define NR_IRQS (NR_VECTORS + 4*CPU_VECTOR_LIMIT)
-# else
-# define NR_IRQS (NR_VECTORS + IO_APIC_VECTOR_LIMIT)
-# endif
+# define CPU_VECTOR_LIMIT (32 * NR_CPUS)
+# define NR_IRQS \
+ (CPU_VECTOR_LIMIT < IO_APIC_VECTOR_LIMIT ? \
+ (NR_VECTORS + CPU_VECTOR_LIMIT) : \
+ (NR_VECTORS + IO_APIC_VECTOR_LIMIT))
# endif
#else /* !CONFIG_X86_IO_APIC: */
# define NR_IRQS NR_IRQS_LEGACY
Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3840,7 +3840,7 @@ int __init arch_probe_nr_irqs(void)
/*
* for MSI and HT dyn irq
*/
- nr += nr_irqs_gsi * 16;
+ nr += nr_irqs_gsi * 48;
#endif
if (nr < nr_irqs)
nr_irqs = nr;


2009-12-28 09:47:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: increase NR_IRQS and nr_irqs


* Yinghai Lu <[email protected]> wrote:

> have a system with lots of igb and ixgbe, when iov/vf are enabled for
> them, we hit the limit of 3064.
>
> so try to increase NR_IRQS and nr_irqs to about 8000 and 6000.
> for that system.

could you please provide before/after .data size analysis, expected dynamic
allocation overhead difference (if any), etc. - so that we can see whether
there's any (and if yes, how much) cost due to the bumping of the limit.

Also, instead of just upping the limit to a new random limit that works on
that testbox, could we put some more thought into how much we want to increase
it, and stay at least 2-3 years ahead of the expected hardware curve?

Ingo

2009-12-29 05:10:05

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH -v2] x86: increase NR_IRQS and nr_irqs



have a system with lots of igb and ixgbe, when iov/vf are enabled for
them, we hit the limit of 3064.

when system have 20 pcie installed, and one card have 2 functions, and one
function need 64 msi-x,
may need 20 * 2 * 64 = 2560 for msi-x
but if iov and vf are enabled
may need 20 * 2 * 64 * 3 = 7680 for msi-x
assume system with 5 ioapic, nr_irqs_gsi will be 120.
NR_CPUS = 512, and nr_cpu_ids = 128
will have NR_IRQS = 256 + 512 * 64 = 33024
will have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824

when SPARSE_IRQ is not set, there is no increase with data
when NR_CPUS=128, and SPARSE_IRQ is set
text data bss dec hex filename
21837444 4216564 12480736 38534744 24bfe58 vmlinux.before
21837442 4216580 12480736 38534758 24bfe66 vmlinux.after
when NR_CPUS=4096, and SPARSE_IRQ is set
text data bss dec hex filename
21878619 5610244 13415392 40904255 270263f vmlinux.before
21878617 5610244 13415392 40904253 270263d vmlinux.after

-v2: update comments to address Ingo's concern

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 12 ++++++------
arch/x86/kernel/apic/io_apic.c | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -154,21 +154,21 @@ static inline int invalid_vm86_irq(int i

#define NR_IRQS_LEGACY 16

-#define CPU_VECTOR_LIMIT ( 8 * NR_CPUS )
#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS )

#ifdef CONFIG_X86_IO_APIC
# ifdef CONFIG_SPARSE_IRQ
+# define CPU_VECTOR_LIMIT (64 * NR_CPUS)
# define NR_IRQS \
(CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \
(NR_VECTORS + CPU_VECTOR_LIMIT) : \
(NR_VECTORS + IO_APIC_VECTOR_LIMIT))
# else
-# if NR_CPUS < MAX_IO_APICS
-# define NR_IRQS (NR_VECTORS + 4*CPU_VECTOR_LIMIT)
-# else
-# define NR_IRQS (NR_VECTORS + IO_APIC_VECTOR_LIMIT)
-# endif
+# define CPU_VECTOR_LIMIT (32 * NR_CPUS)
+# define NR_IRQS \
+ (CPU_VECTOR_LIMIT < IO_APIC_VECTOR_LIMIT ? \
+ (NR_VECTORS + CPU_VECTOR_LIMIT) : \
+ (NR_VECTORS + IO_APIC_VECTOR_LIMIT))
# endif
#else /* !CONFIG_X86_IO_APIC: */
# define NR_IRQS NR_IRQS_LEGACY
Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3840,7 +3840,7 @@ int __init arch_probe_nr_irqs(void)
/*
* for MSI and HT dyn irq
*/
- nr += nr_irqs_gsi * 16;
+ nr += nr_irqs_gsi * 64;
#endif
if (nr < nr_irqs)
nr_irqs = nr;

2009-12-30 12:22:58

by Yinghai Lu

[permalink] [raw]
Subject: [tip:x86/apic] x86: Increase NR_IRQS and nr_irqs

Commit-ID: 9959c888a38b0f25b0e81a480f537d6489348442
Gitweb: http://git.kernel.org/tip/9959c888a38b0f25b0e81a480f537d6489348442
Author: Yinghai Lu <[email protected]>
AuthorDate: Mon, 28 Dec 2009 21:08:29 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 30 Dec 2009 11:55:59 +0100

x86: Increase NR_IRQS and nr_irqs

I have a system with lots of igb and ixgbe, when iov/vf are
enabled for them, we hit the limit of 3064.

when system has 20 pcie installed, and one card has 2
functions, and one function needs 64 msi-x,
may need 20 * 2 * 64 = 2560 for msi-x

but if iov and vf are enabled
may need 20 * 2 * 64 * 3 = 7680 for msi-x
assume system with 5 ioapic, nr_irqs_gsi will be 120.

NR_CPUS = 512, and nr_cpu_ids = 128
will have NR_IRQS = 256 + 512 * 64 = 33024
will have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824

When SPARSE_IRQ is not set, there is no increase with kernel data
size.

when NR_CPUS=128, and SPARSE_IRQ is set:
text data bss dec hex filename
21837444 4216564 12480736 38534744 24bfe58 vmlinux.before
21837442 4216580 12480736 38534758 24bfe66 vmlinux.after
when NR_CPUS=4096, and SPARSE_IRQ is set
text data bss dec hex filename
21878619 5610244 13415392 40904255 270263f vmlinux.before
21878617 5610244 13415392 40904253 270263d vmlinux.after

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/irq_vectors.h | 12 ++++++------
arch/x86/kernel/apic/io_apic.c | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 4611f08..3ab43df 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -154,21 +154,21 @@ static inline int invalid_vm86_irq(int irq)

#define NR_IRQS_LEGACY 16

-#define CPU_VECTOR_LIMIT ( 8 * NR_CPUS )
#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS )

#ifdef CONFIG_X86_IO_APIC
# ifdef CONFIG_SPARSE_IRQ
+# define CPU_VECTOR_LIMIT (64 * NR_CPUS)
# define NR_IRQS \
(CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \
(NR_VECTORS + CPU_VECTOR_LIMIT) : \
(NR_VECTORS + IO_APIC_VECTOR_LIMIT))
# else
-# if NR_CPUS < MAX_IO_APICS
-# define NR_IRQS (NR_VECTORS + 4*CPU_VECTOR_LIMIT)
-# else
-# define NR_IRQS (NR_VECTORS + IO_APIC_VECTOR_LIMIT)
-# endif
+# define CPU_VECTOR_LIMIT (32 * NR_CPUS)
+# define NR_IRQS \
+ (CPU_VECTOR_LIMIT < IO_APIC_VECTOR_LIMIT ? \
+ (NR_VECTORS + CPU_VECTOR_LIMIT) : \
+ (NR_VECTORS + IO_APIC_VECTOR_LIMIT))
# endif
#else /* !CONFIG_X86_IO_APIC: */
# define NR_IRQS NR_IRQS_LEGACY
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index de00c46..d9cd1f1 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3840,7 +3840,7 @@ int __init arch_probe_nr_irqs(void)
/*
* for MSI and HT dyn irq
*/
- nr += nr_irqs_gsi * 16;
+ nr += nr_irqs_gsi * 64;
#endif
if (nr < nr_irqs)
nr_irqs = nr;

2010-01-04 03:06:15

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: [PATCH -v2] x86: increase NR_IRQS and nr_irqs

On Mon, Dec 28, 2009 at 9:08 PM, Yinghai Lu <[email protected]> wrote:
> have a system with lots of igb and ixgbe, when iov/vf are enabled for
> them, we hit the limit of 3064.
>
> when system have 20 pcie installed, and one card have 2 functions, and one
> function need 64 msi-x,
> ?may need 20 * 2 * 64 = 2560 for msi-x
> but if iov and vf are enabled
> ?may need 20 * 2 * 64 * 3 = 7680 for msi-x
> assume system with 5 ioapic, nr_irqs_gsi will be 120.
> NR_CPUS = 512, and nr_cpu_ids = 128
> will have NR_IRQS = 256 + 512 * 64 = 33024
> will have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
>
> when SPARSE_IRQ is not set, there is no increase with data
> when NR_CPUS=128, and SPARSE_IRQ is set
> ? text ? ? ? ? ? ?data ? ? bss ? ? ? ? ? ?dec ? ? ? ? ? hex ? ?filename
> 21837444 ? ? ? ?4216564 12480736 ? ? ? ?38534744 ? ? ? ?24bfe58 vmlinux.before
> 21837442 ? ? ? ?4216580 12480736 ? ? ? ?38534758 ? ? ? ?24bfe66 vmlinux.after
> when NR_CPUS=4096, and SPARSE_IRQ is set
> ? text ? ? ? ? ? ?data ? ? bss ? ? ? ? ? ?dec ? ? ? ? ? hex ? ?filename
> 21878619 ? ? ? ?5610244 13415392 ? ? ? ?40904255 ? ? ? ?270263f vmlinux.before
> 21878617 ? ? ? ?5610244 13415392 ? ? ? ?40904253 ? ? ? ?270263d vmlinux.after
>
> -v2: update comments to address Ingo's concern
>
> Signed-off-by: Yinghai Lu <[email protected]>

I'm not sure this is the best plan, but may be okay for now. What
happens when all of your slots have 6 port 82599 ixgbe adapters in
them? They are being made[1], as well as quad port 82576 igb
adapters, however I'm not fully sure of the SRIOV support of the
bridges being used on those adapters.

Is it on the table to (re-)design this subsystem to be a little more
dynamic? There are probably examples in ppc64 or ia64 directories.
Every time you suggest a limit I can find a case where it won't be
enough.

[1] http://www.hotlavasystems.com/products_10gbe.html

2010-01-04 03:21:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v2] x86: increase NR_IRQS and nr_irqs

On 01/03/2010 07:06 PM, Jesse Brandeburg wrote:
> On Mon, Dec 28, 2009 at 9:08 PM, Yinghai Lu <[email protected]> wrote:
>> have a system with lots of igb and ixgbe, when iov/vf are enabled for
>> them, we hit the limit of 3064.
>>
>> when system have 20 pcie installed, and one card have 2 functions, and one
>> function need 64 msi-x,
>> may need 20 * 2 * 64 = 2560 for msi-x
>> but if iov and vf are enabled
>> may need 20 * 2 * 64 * 3 = 7680 for msi-x
>> assume system with 5 ioapic, nr_irqs_gsi will be 120.
>> NR_CPUS = 512, and nr_cpu_ids = 128
>> will have NR_IRQS = 256 + 512 * 64 = 33024
>> will have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
>>
>> when SPARSE_IRQ is not set, there is no increase with data
>> when NR_CPUS=128, and SPARSE_IRQ is set
>> text data bss dec hex filename
>> 21837444 4216564 12480736 38534744 24bfe58 vmlinux.before
>> 21837442 4216580 12480736 38534758 24bfe66 vmlinux.after
>> when NR_CPUS=4096, and SPARSE_IRQ is set
>> text data bss dec hex filename
>> 21878619 5610244 13415392 40904255 270263f vmlinux.before
>> 21878617 5610244 13415392 40904253 270263d vmlinux.after
>>
>> -v2: update comments to address Ingo's concern
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>
> I'm not sure this is the best plan, but may be okay for now. What
> happens when all of your slots have 6 port 82599 ixgbe adapters in
> them? They are being made[1], as well as quad port 82576 igb
> adapters, however I'm not fully sure of the SRIOV support of the
> bridges being used on those adapters.
>
> Is it on the table to (re-)design this subsystem to be a little more
> dynamic? There are probably examples in ppc64 or ia64 directories.
> Every time you suggest a limit I can find a case where it won't be
> enough.
>
> [1] http://www.hotlavasystems.com/products_10gbe.html

you mean
20 * 6 * 64 * 3 ?

23040

maybe we can just keep nr_irqs = (256 - 32 - 10) * nr_cpu_ids?
32: for exception
10: for IPI etc.

YH

2010-01-04 06:57:57

by Yinghai Lu

[permalink] [raw]
Subject: Subject: [PATCH 1/2] x86: get back 15 vectors



between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)

for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.

also try to reuse 0x30 to 0x3f after smp_affinity for irq[0,15] is changed to other cpu.

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/apic/io_apic.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -1162,7 +1162,8 @@ __assign_irq_vector(int irq, struct irq_
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_vector = FIRST_EXTERNAL_VECTOR + 1;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu, err;
cpumask_var_t tmp_mask;
@@ -1198,7 +1199,7 @@ next:
if (vector >= first_system_vector) {
/* If out of vectors on large boxen, must share them. */
offset = (offset + 1) % 8;
- vector = FIRST_DEVICE_VECTOR + offset;
+ vector = FIRST_EXTERNAL_VECTOR + 1 + offset;
}
if (unlikely(current_vector == vector))
continue;

2010-01-04 06:59:46

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 2/2] x86: get more exact nr_irqs



first check with NR_VECTORS - FIRST_EXTERNAL_VECTOR - 0x20
aka minus exceptions and system vectors.

NR_CPUS = 512, and nr_cpu_ids = 128
will have NR_IRQS = 256 + 512 * 64 = 33024

assume we have 20 intel ixgbe 6 port cards (with sriov and ixgbevf)
20 * 6 * 64 * 3 = 23040

first will get:
128 * (256 - 64) = 24576
then with nr_irqs_gsi will get
(120 + 8 * 128 + 120 * 256) = 31864

so 24576 will be used for nr_irqs.

24576 * 8 = 196608 bytes will be used for irq_desc_ptrs[]

before this patch:
have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
and irq_desc_ptrs[] is 70592

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/apic/io_apic.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3833,15 +3833,20 @@ int __init arch_probe_nr_irqs(void)
{
int nr;

- if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
- nr_irqs = NR_VECTORS * nr_cpu_ids;
+ /* 0x20 for ipi etc system vectors */
+ nr = NR_VECTORS - FIRST_EXTERNAL_VECTOR - 0x20;
+
+ nr *= nr_cpu_ids;
+
+ if (nr < nr_irqs)
+ nr_irqs = nr;

nr = nr_irqs_gsi + 8 * nr_cpu_ids;
#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
/*
* for MSI and HT dyn irq
*/
- nr += nr_irqs_gsi * 64;
+ nr += nr_irqs_gsi * 256;
#endif
if (nr < nr_irqs)
nr_irqs = nr;

2010-01-04 07:01:05

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 1/2] x86: get back 15 vectors



between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)

for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.

also try to reuse 0x30 to 0x3f after smp_affinity for irq[0,15] is changed to other cpu.

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/apic/io_apic.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -1162,7 +1162,8 @@ __assign_irq_vector(int irq, struct irq_
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_vector = FIRST_EXTERNAL_VECTOR + 1;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu, err;
cpumask_var_t tmp_mask;
@@ -1198,7 +1199,7 @@ next:
if (vector >= first_system_vector) {
/* If out of vectors on large boxen, must share them. */
offset = (offset + 1) % 8;
- vector = FIRST_DEVICE_VECTOR + offset;
+ vector = FIRST_EXTERNAL_VECTOR + 1 + offset;
}
if (unlikely(current_vector == vector))
continue;

2010-01-04 16:19:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

Yinghai Lu <[email protected]> writes:

This patch is wrong.

> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>
> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.

We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
priority level to ensure that the irq move cleanup ipi is of a lower
priority.

> also try to reuse 0x30 to 0x3f after smp_affinity for irq[0,15] is changed to other cpu.

There may be a point with 0x30 to 0x3f as I recall when those irqs come through a legacy
pic we need to reserve those vectors on all cpus.


Eric


> Signed-off-by: Yinghai Lu <[email protected]>


> ---
> arch/x86/kernel/apic/io_apic.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
> +++ linux-2.6/arch/x86/kernel/apic/io_apic.c
> @@ -1162,7 +1162,8 @@ __assign_irq_vector(int irq, struct irq_
> * Also, we've got to be careful not to trash gate
> * 0x80, because int 0x80 is hm, kind of importantish. ;)
> */
> - static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
> + static int current_vector = FIRST_EXTERNAL_VECTOR + 1;
> + static int current_offset = 0;
> unsigned int old_vector;
> int cpu, err;
> cpumask_var_t tmp_mask;
> @@ -1198,7 +1199,7 @@ next:
> if (vector >= first_system_vector) {
> /* If out of vectors on large boxen, must share them. */
> offset = (offset + 1) % 8;
> - vector = FIRST_DEVICE_VECTOR + offset;
> + vector = FIRST_EXTERNAL_VECTOR + 1 + offset;
> }
> if (unlikely(current_vector == vector))
> continue;

2010-01-04 16:56:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

Yinghai Lu <[email protected]> writes:

> first check with NR_VECTORS - FIRST_EXTERNAL_VECTOR - 0x20
> aka minus exceptions and system vectors.
>
> NR_CPUS = 512, and nr_cpu_ids = 128
> will have NR_IRQS = 256 + 512 * 64 = 33024
>
> assume we have 20 intel ixgbe 6 port cards (with sriov and ixgbevf)
> 20 * 6 * 64 * 3 = 23040
>
> first will get:
> 128 * (256 - 64) = 24576
> then with nr_irqs_gsi will get
> (120 + 8 * 128 + 120 * 256) = 31864
>
> so 24576 will be used for nr_irqs.
>
> 24576 * 8 = 196608 bytes will be used for irq_desc_ptrs[]
>
> before this patch:
> have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
> and irq_desc_ptrs[] is 70592
>
> Signed-off-by: Yinghai Lu <[email protected]>

I am lost. arch_probe_nr_irqs appears to be total nonsense.

We have three concepts.
- The number of irq sources we can talk about. ( nr_irqs)
- The number of irqs we can possibly service. ((NR_VECTORS - 0x30) *nr_cpu_ids)
- The number of irqs we actually connected up to cards in the
system that we need to do something with.

Why do we need to allocate arrays at all?

arch_probe_nr_irqs looks like a pile of magic numbers (even more magic
with the addition of 0x20), that is always going to be a little bit
wrong.

We should be able to remove the arrays all together and allocate
irq_desc dynamically.


> ---
> arch/x86/kernel/apic/io_apic.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
> +++ linux-2.6/arch/x86/kernel/apic/io_apic.c
> @@ -3833,15 +3833,20 @@ int __init arch_probe_nr_irqs(void)
> {
> int nr;
>
> - if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
> - nr_irqs = NR_VECTORS * nr_cpu_ids;
> + /* 0x20 for ipi etc system vectors */
> + nr = NR_VECTORS - FIRST_EXTERNAL_VECTOR - 0x20;

If you are going to subtract of the number of ipis please
put appropriate defines in irq_vectors.h. A raw 0x20 is
wrong.

> +
> + nr *= nr_cpu_ids;
> +
> + if (nr < nr_irqs)
> + nr_irqs = nr;
> nr = nr_irqs_gsi + 8 * nr_cpu_ids;
> #if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
> /*
> * for MSI and HT dyn irq
> */
> - nr += nr_irqs_gsi * 64;
> + nr += nr_irqs_gsi * 256;

This part seems like magic voodoo. Why should their
be a correlation between the number of gsis and the number
of msis?

Eric

2010-01-04 18:42:33

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 08:18 AM, Eric W. Biederman wrote:
> Yinghai Lu <[email protected]> writes:
>
> This patch is wrong.
>
>> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>>
>> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.
>
> We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
> priority level to ensure that the irq move cleanup ipi is of a lower
> priority.
>
>> also try to reuse 0x30 to 0x3f after smp_affinity for irq[0,15] is changed to other cpu.
>
> There may be a point with 0x30 to 0x3f as I recall when those irqs come through a legacy
> pic we need to reserve those vectors on all cpus.

ok, I see.

any reason that we can not use 0x40?

Thanks

Yinghai

2010-01-04 19:03:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 08:18 AM, Eric W. Biederman wrote:
> Yinghai Lu <[email protected]> writes:
>
> This patch is wrong.
>
>> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>>
>> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.
>
> We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
> priority level to ensure that the irq move cleanup ipi is of a lower
> priority.
>

Almost makes one want to abuse 0x1f for that. Although 0x00..0x1f are
reserved for exceptions, the APICs range down to 0x10, and well, when
0x1f ends up actually getting used as an exception vector that we
support, then we can trivially change that. In the meantime it would
actually make use of an otherwise-unusable APIC priority level.

-hpa

2010-01-04 19:03:17

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

On Mon, Jan 4, 2010 at 8:55 AM, Eric W. Biederman <[email protected]> wrote:
> Yinghai Lu <[email protected]> writes:
>
>> first check with NR_VECTORS - FIRST_EXTERNAL_VECTOR - 0x20
>> aka minus exceptions and system vectors.
>>
>> NR_CPUS = 512, and nr_cpu_ids = 128
>> will have NR_IRQS = 256 + 512 * 64 = 33024
>>
>> assume we have 20 intel ixgbe 6 port cards (with sriov and ixgbevf)
>> ? ? ? 20 * 6 * 64 * 3 = 23040
>>
>> first will get:
>> ? ? ? 128 * (256 - 64) = 24576
>> then with nr_irqs_gsi will get
>> ? ? ? (120 + 8 * 128 + 120 * 256) = 31864
>>
>> so 24576 will be used for nr_irqs.
>>
>> 24576 * 8 = 196608 bytes will be used for irq_desc_ptrs[]
>>
>> before this patch:
>> ? ? have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
>> ? ? ? and irq_desc_ptrs[] is 70592
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>
> I am lost. ? ?arch_probe_nr_irqs appears to be total nonsense.
>
> We have three concepts.
> - The number of irq sources we can talk about. ?( nr_irqs)
> - The number of irqs we can possibly service. ? ((NR_VECTORS - 0x30) *nr_cpu_ids)
> - The number of irqs we actually connected up to cards in the
> ?system that we need to do something with.
>
> Why do we need to allocate arrays at all?
>

irq_desc is allocated dynamically.

but irq_desc_ptrs is pointer array, it need to be allocated after
nr_irqs is probed.

YH

2010-01-04 19:04:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

Yinghai Lu <[email protected]> writes:

> On 01/04/2010 08:18 AM, Eric W. Biederman wrote:
>> Yinghai Lu <[email protected]> writes:
>>
>> This patch is wrong.
>>
>>> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>>>
>>> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.
>>
>> We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
>> priority level to ensure that the irq move cleanup ipi is of a lower
>> priority.
>>
>>> also try to reuse 0x30 to 0x3f after smp_affinity for irq[0,15] is changed to other cpu.
>>
>> There may be a point with 0x30 to 0x3f as I recall when those irqs come through a legacy
>> pic we need to reserve those vectors on all cpus.
>
> ok, I see.
>
> any reason that we can not use 0x40?

Not that I now of. Reading the comment it looks like it was only
skipped so that the initial assignment of vectors would be.

0x31, 0x41, 0x51, 0x61, 0x71, 0x81, 0x91, 0xa1, 0xb1, 0xc1, 0xd1, 0xe1
Instead of.
0x30, 0x40, 0x50, 0x60, 0x70, 0x90, 0xa0, 0xb0, 0xc0, 0xc0, 0xe0

Which doesn't seem to be the worst notion, but at the point we are looking
for every vector we can get it does seem to be problematic.

Eric

2010-01-04 19:09:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

"H. Peter Anvin" <[email protected]> writes:

> On 01/04/2010 08:18 AM, Eric W. Biederman wrote:
>> Yinghai Lu <[email protected]> writes:
>>
>> This patch is wrong.
>>
>>> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>>>
>>> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.
>>
>> We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
>> priority level to ensure that the irq move cleanup ipi is of a lower
>> priority.
>>
>
> Almost makes one want to abuse 0x1f for that. Although 0x00..0x1f are
> reserved for exceptions, the APICs range down to 0x10, and well, when
> 0x1f ends up actually getting used as an exception vector that we
> support, then we can trivially change that. In the meantime it would
> actually make use of an otherwise-unusable APIC priority level.

An optimization like that (with a big fat comment) seems reasonable
to me.

Eric

2010-01-04 19:15:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 11:04 AM, Eric W. Biederman wrote:
>>
>> any reason that we can not use 0x40?
>
> Not that I now of. Reading the comment it looks like it was only
> skipped so that the initial assignment of vectors would be.
>
> 0x31, 0x41, 0x51, 0x61, 0x71, 0x81, 0x91, 0xa1, 0xb1, 0xc1, 0xd1, 0xe1
> Instead of.
> 0x30, 0x40, 0x50, 0x60, 0x70, 0x90, 0xa0, 0xb0, 0xc0, 0xc0, 0xe0
>
> Which doesn't seem to be the worst notion, but at the point we are looking
> for every vector we can get it does seem to be problematic.
>

This can presumably be worked around by tweaking the initial assignment
algorithm slightly, without losing a whole vector to that.

Also, if we abuse vector 0x1f as the IRQ reassignment vector, we free up
a full 16 vectors per CPU -- this seems worthwhile especially since it
is a decision that can be trivially undone in the future: this is all
kernel internal, we're not creating any kind of API.

-hpa

2010-01-04 19:16:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

Yinghai Lu <[email protected]> writes:

> On Mon, Jan 4, 2010 at 8:55 AM, Eric W. Biederman <[email protected]> wrote:
>> Yinghai Lu <[email protected]> writes:
>>
>>> first check with NR_VECTORS - FIRST_EXTERNAL_VECTOR - 0x20
>>> aka minus exceptions and system vectors.
>>>
>>> NR_CPUS = 512, and nr_cpu_ids = 128
>>> will have NR_IRQS = 256 + 512 * 64 = 33024
>>>
>>> assume we have 20 intel ixgbe 6 port cards (with sriov and ixgbevf)
>>>       20 * 6 * 64 * 3 = 23040
>>>
>>> first will get:
>>>       128 * (256 - 64) = 24576
>>> then with nr_irqs_gsi will get
>>>       (120 + 8 * 128 + 120 * 256) = 31864
>>>
>>> so 24576 will be used for nr_irqs.
>>>
>>> 24576 * 8 = 196608 bytes will be used for irq_desc_ptrs[]
>>>
>>> before this patch:
>>>     have nr_irqs = 120 + 8 * 128 + 120 * 64 = 8824
>>>       and irq_desc_ptrs[] is 70592
>>>
>>> Signed-off-by: Yinghai Lu <[email protected]>
>>
>> I am lost.    arch_probe_nr_irqs appears to be total nonsense.
>>
>> We have three concepts.
>> - The number of irq sources we can talk about.  ( nr_irqs)
>> - The number of irqs we can possibly service.   ((NR_VECTORS - 0x30) *nr_cpu_ids)
>> - The number of irqs we actually connected up to cards in the
>>  system that we need to do something with.
>>
>> Why do we need to allocate arrays at all?
>>
>
> irq_desc is allocated dynamically.
>
> but irq_desc_ptrs is pointer array, it need to be allocated after
> nr_irqs is probed.

If we care about memory use efficiency let's replace irq_desc_ptrs
with a rbtree or a radix_tree. Something that moves the memory use
penalty onto those machines that have a lot of irqs.

Eric

2010-01-04 19:33:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

On 01/04/2010 11:16 AM, Eric W. Biederman wrote:
>
> If we care about memory use efficiency let's replace irq_desc_ptrs
> with a rbtree or a radix_tree. Something that moves the memory use
> penalty onto those machines that have a lot of irqs.
>

rbtree doesn't make much sense for something that is addressed by index,
and doesn't need to answer questions of the form "give me the highest
member <= X". A hash table or radix tree makes sense, depending on the
expected sparseness of the index.

-hpa

2010-01-04 19:37:16

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 11:09 AM, Eric W. Biederman wrote:
> "H. Peter Anvin" <[email protected]> writes:
>
>> On 01/04/2010 08:18 AM, Eric W. Biederman wrote:
>>> Yinghai Lu <[email protected]> writes:
>>>
>>> This patch is wrong.
>>>
>>>> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>>>>
>>>> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.
>>>
>>> We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
>>> priority level to ensure that the irq move cleanup ipi is of a lower
>>> priority.
>>>
>>
>> Almost makes one want to abuse 0x1f for that. Although 0x00..0x1f are
>> reserved for exceptions, the APICs range down to 0x10, and well, when
>> 0x1f ends up actually getting used as an exception vector that we
>> support, then we can trivially change that. In the meantime it would
>> actually make use of an otherwise-unusable APIC priority level.
>
> An optimization like that (with a big fat comment) seems reasonable
> to me.

so we can use [0x10, 0x1f]

sth like this?

Subject: [PATCH 1/2] x86: get back 16 vectors

-v2: according to hpa that we could start from 0x10
according to Eric, we should hold 16 vectors for IRQ MOVE

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 3 ++-
arch/x86/kernel/apic/io_apic.c | 5 +++--
arch/x86/kernel/irqinit.c | 11 +++++++++--
3 files changed, 14 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -1162,7 +1162,8 @@ __assign_irq_vector(int irq, struct irq_
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_vector = 0;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu, err;
cpumask_var_t tmp_mask;
@@ -1198,7 +1199,7 @@ next:
if (vector >= first_system_vector) {
/* If out of vectors on large boxen, must share them. */
offset = (offset + 1) % 8;
- vector = FIRST_DEVICE_VECTOR + offset;
+ vector = 0 + offset;
}
if (unlikely(current_vector == vector))
continue;
Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -30,8 +30,9 @@
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
+ * hpa said we can start from 0x10
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x10

#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
Index: linux-2.6/arch/x86/kernel/irqinit.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/irqinit.c
+++ linux-2.6/arch/x86/kernel/irqinit.c
@@ -149,6 +149,8 @@ static void __init smp_intr_init(void)
{
#ifdef CONFIG_SMP
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_LOCAL_APIC)
+ int i;
+
/*
* The reschedule interrupt is a CPU-to-CPU reschedule-helper
* IPI, driven by wakeup.
@@ -174,7 +176,9 @@ static void __init smp_intr_init(void)

/* Low priority IPI to cleanup after moving an irq */
set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
- set_bit(IRQ_MOVE_CLEANUP_VECTOR, used_vectors);
+ /* Eric said: Need to hold entire priority */
+ for (i = IRQ_MOVE_CLEANUP_VECTOR; i < IRQ_MOVE_CLEANUP_VECTOR+0x10; i++)
+ set_bit(i, used_vectors);

/* IPI used for rebooting/stopping */
alloc_intr_gate(REBOOT_VECTOR, reboot_interrupt);
@@ -222,6 +226,9 @@ void __init native_init_IRQ(void)
/* Execute any quirks before the call gates are initialised: */
x86_init.irqs.pre_vector_init();

+ for (i = 0; i < 0x10; i++)
+ set_bit(i, used_vectors);
+
apic_intr_init();

/*
@@ -229,7 +236,7 @@ void __init native_init_IRQ(void)
* us. (some of these will be overridden and become
* 'special' SMP interrupts)
*/
- for (i = FIRST_EXTERNAL_VECTOR; i < NR_VECTORS; i++) {
+ for (i = 0; i < NR_VECTORS; i++) {
/* IA32_SYSCALL_VECTOR could be used in trap_init already. */
if (!test_bit(i, used_vectors))
set_intr_gate(i, interrupt[i-FIRST_EXTERNAL_VECTOR]);

2010-01-04 19:47:14

by Suresh Siddha

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On Mon, 2010-01-04 at 11:35 -0800, Yinghai Lu wrote:
> sth like this?
>
> Subject: [PATCH 1/2] x86: get back 16 vectors
>
> -v2: according to hpa that we could start from 0x10
> according to Eric, we should hold 16 vectors for IRQ MOVE
>
> Signed-off-by: Yinghai Lu <[email protected]>
>

Yinghai we have to change IRQ_MOVE_CLEANUP_VECTOR to 0x1f or so. From
the cpu perspective this vector is documented as illegal, so we need to
check if this change will work on the cpu's we have today to get some
confidence.

thanks,
suresh

2010-01-04 19:48:48

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

On 01/04/2010 11:30 AM, H. Peter Anvin wrote:
> On 01/04/2010 11:16 AM, Eric W. Biederman wrote:
>>
>> If we care about memory use efficiency let's replace irq_desc_ptrs
>> with a rbtree or a radix_tree. Something that moves the memory use
>> penalty onto those machines that have a lot of irqs.
>>
>
> rbtree doesn't make much sense for something that is addressed by index,
> and doesn't need to answer questions of the form "give me the highest
> member <= X". A hash table or radix tree makes sense, depending on the
> expected sparseness of the index.

will check if we can use radix with it like powerpc

YH

2010-01-04 19:49:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

[Adding Suresh to the Cc: list]

On 01/04/2010 11:35 AM, Yinghai Lu wrote:
>
> so we can use [0x10, 0x1f]
>
> sth like this?
>

No!!!

[0x10, 0x1f] is reserved for exceptions. We can probably get away with
stealing *one* vector... presumably at the end (0x1f). However, we can
absolutely not use the whole block: 0x10-0x13 is occupied by exceptions
we already have OS support for (#MF, #AC, #MC, and #XM), and it's pretty
much guaranteed we'll have more coming. However, growth is quite slow
and since this is a kernel-internal vector (not accessible to user
space) it is not creating an API.

In other words, we could change FIRST_EXTERNAL_VECTOR to 0x1f, and use
it for IRQ_MOVE_CLEANUP_VECTOR. Then use 0x20..0x2f for the legacy vectors.

-hpa

2010-01-04 19:51:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 11:45 AM, Suresh Siddha wrote:
> On Mon, 2010-01-04 at 11:35 -0800, Yinghai Lu wrote:
>> sth like this?
>>
>> Subject: [PATCH 1/2] x86: get back 16 vectors
>>
>> -v2: according to hpa that we could start from 0x10
>> according to Eric, we should hold 16 vectors for IRQ MOVE
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>>
>
> Yinghai we have to change IRQ_MOVE_CLEANUP_VECTOR to 0x1f or so. From
> the cpu perspective this vector is documented as illegal, so we need to
> check if this change will work on the cpu's we have today to get some
> confidence.
>

It's documented as reserved, not illegal. The ability for the APIC to
generate vectors starting at 0x10 is documented, as is the ability for
the CPU to receive any vector number as an interrupt -- in fact, the
legacy BIOS relies on being able to receive interrupts starting at
vector 0x08. It causes problems galore, but only at the software level.

-hpa

2010-01-04 20:05:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

"H. Peter Anvin" <[email protected]> writes:

> On 01/04/2010 11:16 AM, Eric W. Biederman wrote:
>>
>> If we care about memory use efficiency let's replace irq_desc_ptrs
>> with a rbtree or a radix_tree. Something that moves the memory use
>> penalty onto those machines that have a lot of irqs.
>>
>
> rbtree doesn't make much sense for something that is addressed by index,
> and doesn't need to answer questions of the form "give me the highest
> member <= X". A hash table or radix tree makes sense, depending on the
> expected sparseness of the index.

Not counting irqs for msi's I think we are looking 36% to 25% fill. Maybe
a little lower. The sparseness is much higher if we count the number of
irqs that we might/use allocate as we do today.

Short of driver hotplug msis should be allocated densely, unless we start
reserving all possible 4K msi-x vectors.

For each ioapic we allocate 16 gsis, and only maybe four of them are
connected to actual pci slots.

This is essentially a slow path operation, so as long as we are not
too expensive we can use any data structure we want. In kernel hash
tables don't grow well so I don't think a hash table is a good choice,
and a hash table is essentially what we have now.

The truth is we don't know how many irqs we will have until msi
supporting drivers claim all of theirs.

I think a radix-tree would likely be the least intrusive choice as it
does not imply any changes to the data structure indexed.

Eric

2010-01-04 20:08:38

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 11:48 AM, H. Peter Anvin wrote:
> [Adding Suresh to the Cc: list]
>
> On 01/04/2010 11:35 AM, Yinghai Lu wrote:
>>
>> so we can use [0x10, 0x1f]
>>
>> sth like this?
>>
>
> No!!!
>
> [0x10, 0x1f] is reserved for exceptions. We can probably get away with
> stealing *one* vector... presumably at the end (0x1f). However, we can
> absolutely not use the whole block: 0x10-0x13 is occupied by exceptions
> we already have OS support for (#MF, #AC, #MC, and #XM), and it's pretty
> much guaranteed we'll have more coming. However, growth is quite slow
> and since this is a kernel-internal vector (not accessible to user
> space) it is not creating an API.
>
> In other words, we could change FIRST_EXTERNAL_VECTOR to 0x1f, and use
> it for IRQ_MOVE_CLEANUP_VECTOR. Then use 0x20..0x2f for the legacy vectors.
>
Subject: [PATCH 1/2] x86: get back 16 vectors

-v2: according to hpa that we could start from 0x1f

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -30,8 +30,9 @@
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
+ * hpa said we can start from 0x1f
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x1f

#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
@@ -41,15 +42,15 @@
#endif

/*
- * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
+ * Reserve the lowest usable priority level 0x1f for triggering
* cleanup after irq migration.
*/
#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR

/*
- * Vectors 0x30-0x3f are used for ISA interrupts.
+ * Vectors 0x20-0x2f are used for ISA interrupts.
*/
-#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
+#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 1)

#define IRQ1_VECTOR (IRQ0_VECTOR + 1)
#define IRQ2_VECTOR (IRQ0_VECTOR + 2)

2010-01-04 20:08:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

Yinghai Lu <[email protected]> writes:

> On 01/04/2010 11:09 AM, Eric W. Biederman wrote:
>> "H. Peter Anvin" <[email protected]> writes:
>>
>>> On 01/04/2010 08:18 AM, Eric W. Biederman wrote:
>>>> Yinghai Lu <[email protected]> writes:
>>>>
>>>> This patch is wrong.
>>>>
>>>>> between FIRST_EXTERNAL_VECTOR (0x20) and FIRST_DEVICE_VECTOR (0x41)
>>>>>
>>>>> for 0x20 and 0x2f, we are safe be used_vectors will prevent it to use used one.
>>>>
>>>> We can not use any of 0x20 - 0x2f for ioapic irqs. We need the entire
>>>> priority level to ensure that the irq move cleanup ipi is of a lower
>>>> priority.
>>>>
>>>
>>> Almost makes one want to abuse 0x1f for that. Although 0x00..0x1f are
>>> reserved for exceptions, the APICs range down to 0x10, and well, when
>>> 0x1f ends up actually getting used as an exception vector that we
>>> support, then we can trivially change that. In the meantime it would
>>> actually make use of an otherwise-unusable APIC priority level.
>>
>> An optimization like that (with a big fat comment) seems reasonable
>> to me.
>
> so we can use [0x10, 0x1f]
>
> sth like this?

Something. We can not use all of 0x10 - 0x1f, it is simply
that hardware can address all of that. 0x10 is already defined
as something I forget what. 0x12 is already the MCE_VECTOR.


Since hardware has not yet defined 0x1f (and is not likely to for
a while. We can use that). So we wind up using hardware priority
a single ipi, and hardware exceptions.

Eric

2010-01-04 20:14:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

Yinghai Lu <[email protected]> writes:

> On 01/04/2010 11:48 AM, H. Peter Anvin wrote:
>> [Adding Suresh to the Cc: list]
>>
>> On 01/04/2010 11:35 AM, Yinghai Lu wrote:
>>>
>>> so we can use [0x10, 0x1f]
>>>
>>> sth like this?
>>>
>>
>> No!!!
>>
>> [0x10, 0x1f] is reserved for exceptions. We can probably get away with
>> stealing *one* vector... presumably at the end (0x1f). However, we can
>> absolutely not use the whole block: 0x10-0x13 is occupied by exceptions
>> we already have OS support for (#MF, #AC, #MC, and #XM), and it's pretty
>> much guaranteed we'll have more coming. However, growth is quite slow
>> and since this is a kernel-internal vector (not accessible to user
>> space) it is not creating an API.
>>
>> In other words, we could change FIRST_EXTERNAL_VECTOR to 0x1f, and use
>> it for IRQ_MOVE_CLEANUP_VECTOR. Then use 0x20..0x2f for the legacy vectors.
>>
> Subject: [PATCH 1/2] x86: get back 16 vectors
>
> -v2: according to hpa that we could start from 0x1f

The code in the patch is ok, but the comments are wrong.

>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> arch/x86/include/asm/irq_vectors.h | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
> +++ linux-2.6/arch/x86/include/asm/irq_vectors.h
> @@ -30,8 +30,9 @@
> /*
> * IDT vectors usable for external interrupt sources start
> * at 0x20:
> + * hpa said we can start from 0x1f

You need to document the reasons here.

> */
> -#define FIRST_EXTERNAL_VECTOR 0x20
> +#define FIRST_EXTERNAL_VECTOR 0x1f
>
> #ifdef CONFIG_X86_32
> # define SYSCALL_VECTOR 0x80
> @@ -41,15 +42,15 @@
> #endif
>
> /*
> - * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
> + * Reserve the lowest usable priority level 0x1f for triggering

Should be:
+ * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
> * cleanup after irq migration.

+ * this overlaps with the reserved range for cpu exceptions so this
+ * will need to be changed to 0x20 - 0x2f if the last cpu exception is
+ * ever allocated.
> */
> #define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR
>
> /*
> - * Vectors 0x30-0x3f are used for ISA interrupts.
> + * Vectors 0x20-0x2f are used for ISA interrupts.
> */
> -#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
> +#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 1)
>
> #define IRQ1_VECTOR (IRQ0_VECTOR + 1)
> #define IRQ2_VECTOR (IRQ0_VECTOR + 2)

2010-01-04 20:34:48

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

Subject: [PATCH 1/2] x86: get back 16 vectors

-v2: according to hpa that we could start from 0x1f
-v3: update comments

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -30,8 +30,12 @@
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
+ * hpa said we can start from 0x1f.
+ * 0x1f is documented as reserved. The ability for the APIC to
+ * generate vectors starting at 0x10 is documented, as is the ability for
+ * the CPU to receive any vector number as an interrupt
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x1f

#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
@@ -41,15 +45,19 @@
#endif

/*
- * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
+ * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
* cleanup after irq migration.
+ * this overlaps with the reserved range for cpu exceptions so this
+ * will need to be changed to 0x20 - 0x2f if the last cpu exception is
+ * ever allocated.
*/
+
#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR

/*
- * Vectors 0x30-0x3f are used for ISA interrupts.
+ * Vectors 0x20-0x2f are used for ISA interrupts.
*/
-#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
+#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 1)

#define IRQ1_VECTOR (IRQ0_VECTOR + 1)
#define IRQ2_VECTOR (IRQ0_VECTOR + 2)
@@ -68,6 +76,13 @@
#define IRQ15_VECTOR (IRQ0_VECTOR + 15)

/*
+ * First APIC vector available to drivers: (vectors 0x30-0xee) we
+ * start at 0x31 to spread out vectors evenly between priority
+ * levels. (0x80 is the syscall vector)
+ */
+#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
+
+/*
* Special IRQ vectors used by the SMP architecture, 0xf0-0xff
*
* some of the following vectors are 'rare', they are merged
@@ -120,13 +135,6 @@
*/
#define MCE_SELF_VECTOR 0xeb

-/*
- * First APIC vector available to drivers: (vectors 0x30-0xee) we
- * start at 0x31(0x41) to spread out vectors evenly between priority
- * levels. (0x80 is the syscall vector)
- */
-#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
-
#define NR_VECTORS 256

#define FPU_IRQ 13

2010-01-04 21:12:02

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 12:33 PM, Yinghai Lu wrote:
> --- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
> +++ linux-2.6/arch/x86/include/asm/irq_vectors.h
> @@ -30,8 +30,12 @@
> /*
> * IDT vectors usable for external interrupt sources start
> * at 0x20:
> + * hpa said we can start from 0x1f.
> + * 0x1f is documented as reserved. The ability for the APIC to
> + * generate vectors starting at 0x10 is documented, as is the ability for
> + * the CPU to receive any vector number as an interrupt
> */
> -#define FIRST_EXTERNAL_VECTOR 0x20
> +#define FIRST_EXTERNAL_VECTOR 0x1f
>

This really isn't a sufficient explanation either. I know writing
English prose is very difficult for you, but I'm sorry, you really need
to start getting better about your comments and commit messages.

In this case, the text is missing one very important piece of the
justification: otherwise we have to waste a full 16 vectors in order for
the IRQ migration interrupt to get its own priority level. Thus,
something like this:

* 0x1f is documented as reserved. However, the ability for the APIC
* to generate vectors starting at 0x10 is documented, as is the
* ability for the CPU to receive any vector number as an interrupt.
* 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
* an entire privilege level (16 vectors) all by itself at a higher
* priority than any actual device vector. Thus, by placing it in the
* otherwise-unusable 0x10 privilege level, we avoid wasting a full
* 16-vector block.

-hpa

2010-01-04 21:22:14

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 01:10 PM, H. Peter Anvin wrote:
> * 0x1f is documented as reserved. However, the ability for the APIC
> * to generate vectors starting at 0x10 is documented, as is the
> * ability for the CPU to receive any vector number as an interrupt.
> * 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
> * an entire privilege level (16 vectors) all by itself at a higher
> * priority than any actual device vector. Thus, by placing it in the
> * otherwise-unusable 0x10 privilege level, we avoid wasting a full
> * 16-vector block.

Subject: [PATCH 1/2] x86: get back 16 vectors

-v2: according to hpa that we could start from 0x1f
-v3: update comments from Eric
-v4: update comments from hpa

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -30,8 +30,17 @@
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
+ * hpa said we can start from 0x1f.
+ * 0x1f is documented as reserved. However, the ability for the APIC
+ * to generate vectors starting at 0x10 is documented, as is the
+ * ability for the CPU to receive any vector number as an interrupt.
+ * 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
+ * an entire privilege level (16 vectors) all by itself at a higher
+ * priority than any actual device vector. Thus, by placing it in the
+ * otherwise-unusable 0x10 privilege level, we avoid wasting a full
+ * 16-vector block.
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x1f

#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
@@ -41,15 +50,19 @@
#endif

/*
- * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
+ * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
* cleanup after irq migration.
+ * this overlaps with the reserved range for cpu exceptions so this
+ * will need to be changed to 0x20 - 0x2f if the last cpu exception is
+ * ever allocated.
*/
+
#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR

/*
- * Vectors 0x30-0x3f are used for ISA interrupts.
+ * Vectors 0x20-0x2f are used for ISA interrupts.
*/
-#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
+#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 1)

#define IRQ1_VECTOR (IRQ0_VECTOR + 1)
#define IRQ2_VECTOR (IRQ0_VECTOR + 2)
@@ -68,6 +81,13 @@
#define IRQ15_VECTOR (IRQ0_VECTOR + 15)

/*
+ * First APIC vector available to drivers: (vectors 0x30-0xee) we
+ * start at 0x31 to spread out vectors evenly between priority
+ * levels. (0x80 is the syscall vector)
+ */
+#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
+
+/*
* Special IRQ vectors used by the SMP architecture, 0xf0-0xff
*
* some of the following vectors are 'rare', they are merged
@@ -120,13 +140,6 @@
*/
#define MCE_SELF_VECTOR 0xeb

-/*
- * First APIC vector available to drivers: (vectors 0x30-0xee) we
- * start at 0x31(0x41) to spread out vectors evenly between priority
- * levels. (0x80 is the syscall vector)
- */
-#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
-
#define NR_VECTORS 256

#define FPU_IRQ 13

2010-01-04 21:34:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 01:20 PM, Yinghai Lu wrote:
> --- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
> +++ linux-2.6/arch/x86/include/asm/irq_vectors.h
> @@ -30,8 +30,17 @@
> /*
> * IDT vectors usable for external interrupt sources start
> * at 0x20:
> + * hpa said we can start from 0x1f.
> + * 0x1f is documented as reserved. However, the ability for the APIC
> + * to generate vectors starting at 0x10 is documented, as is the
> + * ability for the CPU to receive any vector number as an interrupt.
> + * 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
> + * an entire privilege level (16 vectors) all by itself at a higher
> + * priority than any actual device vector. Thus, by placing it in the
> + * otherwise-unusable 0x10 privilege level, we avoid wasting a full
> + * 16-vector block.
> */
> -#define FIRST_EXTERNAL_VECTOR 0x20
> +#define FIRST_EXTERNAL_VECTOR 0x1f
>
> #ifdef CONFIG_X86_32
> # define SYSCALL_VECTOR 0x80
> @@ -41,15 +50,19 @@
> #endif
>
> /*
> - * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
> + * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
> * cleanup after irq migration.
> + * this overlaps with the reserved range for cpu exceptions so this
> + * will need to be changed to 0x20 - 0x2f if the last cpu exception is
> + * ever allocated.
> */
> +
> #define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR
>
> /*
> - * Vectors 0x30-0x3f are used for ISA interrupts.
> + * Vectors 0x20-0x2f are used for ISA interrupts.
> */
> -#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
> +#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 1)
>
> #define IRQ1_VECTOR (IRQ0_VECTOR + 1)
> #define IRQ2_VECTOR (IRQ0_VECTOR + 2)
> @@ -68,6 +81,13 @@
> #define IRQ15_VECTOR (IRQ0_VECTOR + 15)
>

I'm not sure that making IRQ_MOVE_CLEANUP_VECTOR and IRQ0_VECTOR offsets
from FIRST_EXTERNAL_VECTOR makes sense from a readability perspective.
These are now magic numbers, and making them offsets is only confusing,
as it implies we could do it differently.

If nothing else, the actual logic for IRQ0_VECTOR should be:

#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)

... since that is what we actually want -- we round up to the next
16-vector boundary. Both +16 and +1 misrepresent the logic.

> /*
> + * First APIC vector available to drivers: (vectors 0x30-0xee) we
> + * start at 0x31 to spread out vectors evenly between priority
> + * levels. (0x80 is the syscall vector)
> + */
> +#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
> +

We really should fix that so we can do +1 here instead of +2; that
presumably means fixing the logic so we do something smarter than just
jump over 0x80.

-hpa

2010-01-04 21:51:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86: get more exact nr_irqs

On 01/04/2010 12:05 PM, Eric W. Biederman wrote:
>>
>> rbtree doesn't make much sense for something that is addressed by index,
>> and doesn't need to answer questions of the form "give me the highest
>> member <= X". A hash table or radix tree makes sense, depending on the
>> expected sparseness of the index.
>
> Not counting irqs for msi's I think we are looking 36% to 25% fill. Maybe
> a little lower. The sparseness is much higher if we count the number of
> irqs that we might/use allocate as we do today.
>
> Short of driver hotplug msis should be allocated densely, unless we start
> reserving all possible 4K msi-x vectors.
>
> For each ioapic we allocate 16 gsis, and only maybe four of them are
> connected to actual pci slots.
>
> This is essentially a slow path operation, so as long as we are not
> too expensive we can use any data structure we want. In kernel hash
> tables don't grow well so I don't think a hash table is a good choice,
> and a hash table is essentially what we have now.
>
> The truth is we don't know how many irqs we will have until msi
> supporting drivers claim all of theirs.
>
> I think a radix-tree would likely be the least intrusive choice as it
> does not imply any changes to the data structure indexed.
>

Yes, for that kind of densities radix tree is a good choice.

-hpa

2010-01-04 22:03:56

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 01:33 PM, H. Peter Anvin wrote:
> On 01/04/2010 01:20 PM, Yinghai Lu wrote:
>> --- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
>> +++ linux-2.6/arch/x86/include/asm/irq_vectors.h
>> @@ -30,8 +30,17 @@
>> /*
>> * IDT vectors usable for external interrupt sources start
>> * at 0x20:
>> + * hpa said we can start from 0x1f.
>> + * 0x1f is documented as reserved. However, the ability for the APIC
>> + * to generate vectors starting at 0x10 is documented, as is the
>> + * ability for the CPU to receive any vector number as an interrupt.
>> + * 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
>> + * an entire privilege level (16 vectors) all by itself at a higher
>> + * priority than any actual device vector. Thus, by placing it in the
>> + * otherwise-unusable 0x10 privilege level, we avoid wasting a full
>> + * 16-vector block.
>> */
>> -#define FIRST_EXTERNAL_VECTOR 0x20
>> +#define FIRST_EXTERNAL_VECTOR 0x1f
>>
>> #ifdef CONFIG_X86_32
>> # define SYSCALL_VECTOR 0x80
>> @@ -41,15 +50,19 @@
>> #endif
>>
>> /*
>> - * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
>> + * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
>> * cleanup after irq migration.
>> + * this overlaps with the reserved range for cpu exceptions so this
>> + * will need to be changed to 0x20 - 0x2f if the last cpu exception is
>> + * ever allocated.
>> */
>> +
>> #define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR
>>
>> /*
>> - * Vectors 0x30-0x3f are used for ISA interrupts.
>> + * Vectors 0x20-0x2f are used for ISA interrupts.
>> */
>> -#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
>> +#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 1)
>>
>> #define IRQ1_VECTOR (IRQ0_VECTOR + 1)
>> #define IRQ2_VECTOR (IRQ0_VECTOR + 2)
>> @@ -68,6 +81,13 @@
>> #define IRQ15_VECTOR (IRQ0_VECTOR + 15)
>>
>
> I'm not sure that making IRQ_MOVE_CLEANUP_VECTOR and IRQ0_VECTOR offsets
> from FIRST_EXTERNAL_VECTOR makes sense from a readability perspective.
> These are now magic numbers, and making them offsets is only confusing,
> as it implies we could do it differently.
>
> If nothing else, the actual logic for IRQ0_VECTOR should be:
>
> #define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)
>
> ... since that is what we actually want -- we round up to the next
> 16-vector boundary. Both +16 and +1 misrepresent the logic.

that will be good, if later update FIRST_EXTERNAL_VECTOR...

>
>> /*
>> + * First APIC vector available to drivers: (vectors 0x30-0xee) we
>> + * start at 0x31 to spread out vectors evenly between priority
>> + * levels. (0x80 is the syscall vector)
>> + */
>> +#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
>> +
>
> We really should fix that so we can do +1 here instead of +2; that
> presumably means fixing the logic so we do something smarter than just
> jump over 0x80.

we already use used_vectors to skip 0x80. so we could change that to +1?

YH

2010-01-04 23:04:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 02:01 PM, Yinghai Lu wrote:
>>
>>> /*
>>> + * First APIC vector available to drivers: (vectors 0x30-0xee) we
>>> + * start at 0x31 to spread out vectors evenly between priority
>>> + * levels. (0x80 is the syscall vector)
>>> + */
>>> +#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
>>> +
>>
>> We really should fix that so we can do +1 here instead of +2; that
>> presumably means fixing the logic so we do something smarter than just
>> jump over 0x80.
>
> we already use used_vectors to skip 0x80. so we could change that to +1?
>

Yes, but the problem is that we *skip* 0x80, which leads to suboptimal
allocation on systems with only a handful of vectors.

The easy solution to accomplishing what we want without wasting vector
0x30 is obviously to start allocation at 0x31, but not by artificially
limiting the vector space; see the attached patch.

For what it's worth, this code(__assign_irq_vector() in
arch/x86/kernel/apic/io_apic.c) has me somewhat confused about the use
of the constant 8:

vector += 8;

The only justification that I can immediately think of is to try to
assign exactly two sources to each priority level (since early APICs
started losing interrupts with more than two sources per priority level.)

This is ancient code -- predates not just the git but the bk history --
and as such I would assume that that is the motivation.

-hpa


Attachments:
0001-x86-irq-Don-t-waste-a-vector-to-improve-vector-sprea.patch (2.32 kB)

2010-01-04 23:34:06

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 03:03 PM, H. Peter Anvin wrote:
> On 01/04/2010 02:01 PM, Yinghai Lu wrote:
>>>
>>>> /*
>>>> + * First APIC vector available to drivers: (vectors 0x30-0xee) we
>>>> + * start at 0x31 to spread out vectors evenly between priority
>>>> + * levels. (0x80 is the syscall vector)
>>>> + */
>>>> +#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
>>>> +
>>>
>>> We really should fix that so we can do +1 here instead of +2; that
>>> presumably means fixing the logic so we do something smarter than just
>>> jump over 0x80.
>>
>> we already use used_vectors to skip 0x80. so we could change that to +1?
>>
>
> Yes, but the problem is that we *skip* 0x80, which leads to suboptimal
> allocation on systems with only a handful of vectors.
>
> The easy solution to accomplishing what we want without wasting vector
> 0x30 is obviously to start allocation at 0x31, but not by artificially
> limiting the vector space; see the attached patch.
>
> For what it's worth, this code(__assign_irq_vector() in
> arch/x86/kernel/apic/io_apic.c) has me somewhat confused about the use
> of the constant 8:
>
> vector += 8;
>
> The only justification that I can immediately think of is to try to
> assign exactly two sources to each priority level (since early APICs
> started losing interrupts with more than two sources per priority level.)
>
> This is ancient code -- predates not just the git but the bk history --
> and as such I would assume that that is the motivation.

yes the patch get back 0x30, 0x38, 0x40, 0x48 etc back.

YH

2010-01-04 23:39:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 03:32 PM, Yinghai Lu wrote:
>
> yes the patch get back 0x30, 0x38, 0x40, 0x48 etc back.
>

I would have expected the old code to still have had 0x38, 0x40, 0x48,
... available, just missing the first one?

-hpa

2010-01-04 23:44:22

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 03:38 PM, H. Peter Anvin wrote:
> On 01/04/2010 03:32 PM, Yinghai Lu wrote:
>>
>> yes the patch get back 0x30, 0x38, 0x40, 0x48 etc back.
>>
>
> I would have expected the old code to still have had 0x38, 0x40, 0x48,
> ... available, just missing the first one?
>

you are right. other via offset=7 to be used.

YH

2010-01-04 23:51:47

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

why not start from 0x30, 0x38, 0x40, 0x48 ...
instead of 0x31, 0x39, 0x41, 0x47...?

YH


2010-01-05 00:00:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 03:49 PM, Yinghai Lu wrote:
> why not start from 0x30, 0x38, 0x40, 0x48 ...
> instead of 0x31, 0x39, 0x41, 0x47...?
>
> YH

The whole point is to defer the skip over 0x80.

-hpa

2010-01-05 00:06:58

by Suresh Siddha

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On Mon, 2010-01-04 at 11:50 -0800, H. Peter Anvin wrote:
> On 01/04/2010 11:45 AM, Suresh Siddha wrote:
> > On Mon, 2010-01-04 at 11:35 -0800, Yinghai Lu wrote:
> >> sth like this?
> >>
> >> Subject: [PATCH 1/2] x86: get back 16 vectors
> >>
> >> -v2: according to hpa that we could start from 0x10
> >> according to Eric, we should hold 16 vectors for IRQ MOVE
> >>
> >> Signed-off-by: Yinghai Lu <[email protected]>
> >>
> >
> > Yinghai we have to change IRQ_MOVE_CLEANUP_VECTOR to 0x1f or so. From
> > the cpu perspective this vector is documented as illegal, so we need to
> > check if this change will work on the cpu's we have today to get some
> > confidence.
> >
>
> It's documented as reserved, not illegal. The ability for the APIC to
> generate vectors starting at 0x10 is documented, as is the ability for
> the CPU to receive any vector number as an interrupt -- in fact, the
> legacy BIOS relies on being able to receive interrupts starting at
> vector 0x08. It causes problems galore, but only at the software level.

I have checked out couple of platforms (including 32-bit atom) and 0x1f
vector logic seems to be working.

Hopefully we won't have other hardware or software issues (vmm
restrictions etc) with this logic.

thanks,
suresh

2010-01-05 00:17:25

by Yinghai Lu

[permalink] [raw]
Subject: Re: Subject: [PATCH 1/2] x86: get back 15 vectors

On 01/04/2010 04:05 PM, Suresh Siddha wrote:
> On Mon, 2010-01-04 at 11:50 -0800, H. Peter Anvin wrote:
>> On 01/04/2010 11:45 AM, Suresh Siddha wrote:
>>> On Mon, 2010-01-04 at 11:35 -0800, Yinghai Lu wrote:
>>>> sth like this?
>>>>
>>>> Subject: [PATCH 1/2] x86: get back 16 vectors
>>>>
>>>> -v2: according to hpa that we could start from 0x10
>>>> according to Eric, we should hold 16 vectors for IRQ MOVE
>>>>
>>>> Signed-off-by: Yinghai Lu <[email protected]>
>>>>
>>>
>>> Yinghai we have to change IRQ_MOVE_CLEANUP_VECTOR to 0x1f or so. From
>>> the cpu perspective this vector is documented as illegal, so we need to
>>> check if this change will work on the cpu's we have today to get some
>>> confidence.
>>>
>>
>> It's documented as reserved, not illegal. The ability for the APIC to
>> generate vectors starting at 0x10 is documented, as is the ability for
>> the CPU to receive any vector number as an interrupt -- in fact, the
>> legacy BIOS relies on being able to receive interrupts starting at
>> vector 0x08. It causes problems galore, but only at the software level.
>
> I have checked out couple of platforms (including 32-bit atom) and 0x1f
> vector logic seems to be working.
>
> Hopefully we won't have other hardware or software issues (vmm
> restrictions etc) with this logic.
>

good. hope it is final version. let's have hpa own it.

From: "H. Peter Anvin" <[email protected]>
Subject: [PATCH] x86: get back 16 vectors

-v2: according to hpa that we could start from 0x1f
-v3: update comments from Eric
-v4: update comments from hpa
-v5: use round up for IRQ0_VECTOR according to hpa

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/irq_vectors.h | 40 ++++++++++++++++++++++++-------------
1 file changed, 26 insertions(+), 14 deletions(-)

Index: linux-2.6/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/irq_vectors.h
+++ linux-2.6/arch/x86/include/asm/irq_vectors.h
@@ -30,26 +30,38 @@
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
+ * hpa said we can start from 0x1f.
+ * 0x1f is documented as reserved. However, the ability for the APIC
+ * to generate vectors starting at 0x10 is documented, as is the
+ * ability for the CPU to receive any vector number as an interrupt.
+ * 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
+ * an entire privilege level (16 vectors) all by itself at a higher
+ * priority than any actual device vector. Thus, by placing it in the
+ * otherwise-unusable 0x10 privilege level, we avoid wasting a full
+ * 16-vector block.
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x1f

+#define IA32_SYSCALL_VECTOR 0x80
#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
-# define IA32_SYSCALL_VECTOR 0x80
-#else
-# define IA32_SYSCALL_VECTOR 0x80
#endif

/*
- * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
+ * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
* cleanup after irq migration.
+ * this overlaps with the reserved range for cpu exceptions so this
+ * will need to be changed to 0x20 - 0x2f if the last cpu exception is
+ * ever allocated.
*/
+
#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR

/*
- * Vectors 0x30-0x3f are used for ISA interrupts.
+ * Vectors 0x20-0x2f are used for ISA interrupts.
+ * round up to the next 16-vector boundary
*/
-#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
+#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)

#define IRQ1_VECTOR (IRQ0_VECTOR + 1)
#define IRQ2_VECTOR (IRQ0_VECTOR + 2)

2010-01-05 05:31:15

by H. Peter Anvin

[permalink] [raw]
Subject: [tip:x86/apic] x86, apic: Reclaim IDT vectors 0x20-0x2f

Commit-ID: 99d113b17e8ca5a8b68a9d3f7691e2f552dd6a06
Gitweb: http://git.kernel.org/tip/99d113b17e8ca5a8b68a9d3f7691e2f552dd6a06
Author: H. Peter Anvin <[email protected]>
AuthorDate: Mon, 4 Jan 2010 16:16:06 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 4 Jan 2010 21:12:52 -0800

x86, apic: Reclaim IDT vectors 0x20-0x2f

Reclaim 16 IDT vectors and make them available for general allocation.

Reclaim vectors 0x20-0x2f by reallocating the IRQ_MOVE_CLEANUP_VECTOR
to vector 0x1f. This is in the range of vector numbers that is
officially reserved for the CPU (for exceptions), however, the use of
the APIC to generate any vector 0x10 or above is documented, and the
CPU internally can receive any vector number (the legacy BIOS uses INT
0x08-0x0f for interrupts, as messed up as that is.)

Since IRQ_MOVE_CLEANUP_VECTOR has to be alone in the lowest-numbered
priority level (block of 16), this effectively enables us to reclaim
an otherwise-unusable APIC priority level and put it to use.

Since this is a transient kernel-only allocation we can change it at
any time, and if/when there is an exception at vector 0x1f this
assignment needs to be changed as part of OS enabling that new feature.

Signed-off-by: Yinghai Lu <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/irq_vectors.h | 28 ++++++++++++++++++++--------
1 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 3ab43df..dbc81ac 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -30,26 +30,38 @@
/*
* IDT vectors usable for external interrupt sources start
* at 0x20:
+ * hpa said we can start from 0x1f.
+ * 0x1f is documented as reserved. However, the ability for the APIC
+ * to generate vectors starting at 0x10 is documented, as is the
+ * ability for the CPU to receive any vector number as an interrupt.
+ * 0x1f is used for IRQ_MOVE_CLEANUP_VECTOR since that vector needs
+ * an entire privilege level (16 vectors) all by itself at a higher
+ * priority than any actual device vector. Thus, by placing it in the
+ * otherwise-unusable 0x10 privilege level, we avoid wasting a full
+ * 16-vector block.
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR 0x1f

+#define IA32_SYSCALL_VECTOR 0x80
#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
-# define IA32_SYSCALL_VECTOR 0x80
-#else
-# define IA32_SYSCALL_VECTOR 0x80
#endif

/*
- * Reserve the lowest usable priority level 0x20 - 0x2f for triggering
+ * Reserve the lowest usable priority level 0x10 - 0x1f for triggering
* cleanup after irq migration.
+ * this overlaps with the reserved range for cpu exceptions so this
+ * will need to be changed to 0x20 - 0x2f if the last cpu exception is
+ * ever allocated.
*/
+
#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR

/*
- * Vectors 0x30-0x3f are used for ISA interrupts.
+ * Vectors 0x20-0x2f are used for ISA interrupts.
+ * round up to the next 16-vector boundary
*/
-#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10)
+#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)

#define IRQ1_VECTOR (IRQ0_VECTOR + 1)
#define IRQ2_VECTOR (IRQ0_VECTOR + 2)
@@ -122,7 +134,7 @@

/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
- * start at 0x31(0x41) to spread out vectors evenly between priority
+ * start at 0x31 to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
*/
#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)

2010-01-05 05:31:25

by H. Peter Anvin

[permalink] [raw]
Subject: [tip:x86/apic] x86, apic: Don't waste a vector to improve vector spread

Commit-ID: ea94396629a3e0cb9a3a9c75335b1de255b30426
Gitweb: http://git.kernel.org/tip/ea94396629a3e0cb9a3a9c75335b1de255b30426
Author: H. Peter Anvin <[email protected]>
AuthorDate: Mon, 4 Jan 2010 21:14:41 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 4 Jan 2010 21:28:24 -0800

x86, apic: Don't waste a vector to improve vector spread

We want to use a vector-assignment sequence that avoids stumbling onto
0x80 earlier in the sequence, in order to improve the spread of
vectors across priority levels on machines with a small number of
interrupt sources. Right now, this is done by simply making the first
vector (0x31 or 0x41) completely unusable. This is unnecessary; all
we need is to start assignment at a +1 offset, we don't actually need
to prohibit the usage of this vector once we have wrapped around.

Signed-off-by: H. Peter Anvin <[email protected]>
LKML-Reference: <[email protected]>
---
arch/x86/include/asm/irq_vectors.h | 9 +++++----
arch/x86/kernel/apic/io_apic.c | 3 ++-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index dbc81ac..585a428 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -133,11 +133,12 @@
#define MCE_SELF_VECTOR 0xeb

/*
- * First APIC vector available to drivers: (vectors 0x30-0xee) we
- * start at 0x31 to spread out vectors evenly between priority
- * levels. (0x80 is the syscall vector)
+ * First APIC vector available to drivers: (vectors 0x30-0xee). We
+ * start allocating at 0x31 to spread out vectors evenly between
+ * priority levels. (0x80 is the syscall vector)
*/
-#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
+#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 1)
+#define VECTOR_OFFSET_START 1

#define NR_VECTORS 256

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index d9cd1f1..e9ba090 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1162,7 +1162,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_vector = FIRST_DEVICE_VECTOR + VECTOR_OFFSET_START;
+ static int current_offset = VECTOR_OFFSET_START % 8;
unsigned int old_vector;
int cpu, err;
cpumask_var_t tmp_mask;