Currrently on a SMP system we can theoretically support
NR_CPUS*224 irqs. Unfortunately our data structures
don't cope will with that many irqs, nor does hardware
typically provide that many irq sources.
With the number of cores starting to follow Moores
law, and the apicid limits being raised beyond an 8bit
number trying to track our current maximum with our
current data structures would be fatal and wasteful.
So this patch decouples the number of irqs we support
from the number of cpus. We can revisit this decision
once someone reworks the current data structures.
Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/Kconfig | 13 +++++++++++++
include/asm-x86_64/irq.h | 3 ++-
2 files changed, 15 insertions(+), 1 deletions(-)
diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 7598d99..d744e5b 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -384,6 +384,19 @@ config NR_CPUS
This is purely to save memory - each supported CPU requires
memory in the static kernel configuration.
+config NR_IRQS
+ int "Maximum number of IRQs (224-4096)"
+ range 256 4096
+ depends on SMP
+ default "4096"
+ help
+ This allows you to specify the maximum number of IRQs which this
+ kernel will support. Current maximum is 4096 IRQs as that
+ is slightly larger than has observed in the field.
+
+ This is purely to save memory - each supported IRQ requires
+ memory in the static kernel configuration.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
depends on SMP && HOTPLUG && EXPERIMENTAL
diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
index 5006c6e..34b264a 100644
--- a/include/asm-x86_64/irq.h
+++ b/include/asm-x86_64/irq.h
@@ -31,7 +31,8 @@ #define NR_VECTORS 256
#define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */
-#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
+/* We can use at most NR_CPUS*224 irqs at one time */
+#define NR_IRQS (CONFIG_NR_IRQS)
#define NR_IRQ_VECTORS NR_IRQS
static __inline__ int irq_canonicalize(int irq)
--
1.4.2.rc3.g7e18e
On Monday 07 August 2006 17:26, Eric W. Biederman wrote:
>
> Currrently on a SMP system we can theoretically support
> NR_CPUS*224 irqs. Unfortunately our data structures
> don't cope will with that many irqs, nor does hardware
> typically provide that many irq sources.
>
> With the number of cores starting to follow Moores
> law, and the apicid limits being raised beyond an 8bit
> number trying to track our current maximum with our
> current data structures would be fatal and wasteful.
>
> So this patch decouples the number of irqs we support
> from the number of cpus. We can revisit this decision
> once someone reworks the current data structures.
Ok. I was about to apply it, but it seems to require
mm patches right now, so i didn't
-Andi
On Mon, 07 Aug 2006 09:26:21 -0600 Eric W. Biederman wrote:
> Currrently on a SMP system we can theoretically support
> NR_CPUS*224 irqs. Unfortunately our data structures
> don't cope will with that many irqs, nor does hardware
> typically provide that many irq sources.
>
> With the number of cores starting to follow Moores
> law, and the apicid limits being raised beyond an 8bit
> number trying to track our current maximum with our
> current data structures would be fatal and wasteful.
>
> So this patch decouples the number of irqs we support
> from the number of cpus. We can revisit this decision
> once someone reworks the current data structures.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> arch/x86_64/Kconfig | 13 +++++++++++++
> include/asm-x86_64/irq.h | 3 ++-
> 2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
> index 7598d99..d744e5b 100644
> --- a/arch/x86_64/Kconfig
> +++ b/arch/x86_64/Kconfig
> @@ -384,6 +384,19 @@ config NR_CPUS
> This is purely to save memory - each supported CPU requires
> memory in the static kernel configuration.
>
> +config NR_IRQS
> + int "Maximum number of IRQs (224-4096)"
> + range 256 4096
> + depends on SMP
> + default "4096"
> + help
> + This allows you to specify the maximum number of IRQs which this
> + kernel will support. Current maximum is 4096 IRQs as that
> + is slightly larger than has observed in the field.
> +
> + This is purely to save memory - each supported IRQ requires
> + memory in the static kernel configuration.
If (a) "nor does hardware typically provide that many irq sources"
and (b) "This is purely to save memory", why is the default
4096 instead of something smaller?
---
~Randy
Andi Kleen <[email protected]> writes:
> On Monday 07 August 2006 17:26, Eric W. Biederman wrote:
>>
>> Currrently on a SMP system we can theoretically support
>> NR_CPUS*224 irqs. Unfortunately our data structures
>> don't cope will with that many irqs, nor does hardware
>> typically provide that many irq sources.
>>
>> With the number of cores starting to follow Moores
>> law, and the apicid limits being raised beyond an 8bit
>> number trying to track our current maximum with our
>> current data structures would be fatal and wasteful.
>>
>> So this patch decouples the number of irqs we support
>> from the number of cpus. We can revisit this decision
>> once someone reworks the current data structures.
>
> Ok. I was about to apply it, but it seems to require
> mm patches right now, so i didn't
Right. This is post 2.6.18 material, that is getting
the final bug fixes now. So it will be ready when 2.6.19 opens
up. Andi I just wanted to make certain you saw it. :)
Eric
> > Currrently on a SMP system we can theoretically support
> > NR_CPUS*224 irqs. Unfortunately our data structures don't
> cope will
> > with that many irqs, nor does hardware typically provide
> that many irq
> > sources.
> >
> > With the number of cores starting to follow Moores law, and
> the apicid
> > limits being raised beyond an 8bit number trying to track
> our current
> > maximum with our current data structures would be fatal and
> wasteful.
> >
> > So this patch decouples the number of irqs we support from
> the number
> > of cpus. We can revisit this decision once someone reworks the
> > current data structures.
> >
> > Signed-off-by: Eric W. Biederman <[email protected]>
> > ---
> > arch/x86_64/Kconfig | 13 +++++++++++++
> > include/asm-x86_64/irq.h | 3 ++-
> > 2 files changed, 15 insertions(+), 1 deletions(-)
> >
> > diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig index
> > 7598d99..d744e5b 100644
> > --- a/arch/x86_64/Kconfig
> > +++ b/arch/x86_64/Kconfig
> > @@ -384,6 +384,19 @@ config NR_CPUS
> > This is purely to save memory - each supported CPU requires
> > memory in the static kernel configuration.
> >
> > +config NR_IRQS
> > + int "Maximum number of IRQs (224-4096)"
> > + range 256 4096
> > + depends on SMP
> > + default "4096"
> > + help
> > + This allows you to specify the maximum number of IRQs
> which this
> > + kernel will support. Current maximum is 4096 IRQs as that
> > + is slightly larger than has observed in the field.
> > +
> > + This is purely to save memory - each supported IRQ requires
> > + memory in the static kernel configuration.
>
> If (a) "nor does hardware typically provide that many irq sources"
> and (b) "This is purely to save memory", why is the default
> 4096 instead of something smaller?
>
4k being a humble maximum is definitely a relative term here, but on the
system with "only" 64 or 128 processors the cpu*224 would be much higher
:) However, maybe CONFIG_TINY that Andi suggested would leverage this
number also. What do you think, Eric?
--Natalie
> 4k being a humble maximum is definitely a relative term here, but on the
> system with "only" 64 or 128 processors the cpu*224 would be much higher
> :) However, maybe CONFIG_TINY that Andi suggested would leverage this
> number also. What do you think, Eric?
Best would be something dynamic - kernels should be self tuning, not
require that much CONFIG magic.
Just PCI hotplug gives me headaches with this.
Maybe we just need growable per CPU data.
-Andi
> > 4k being a humble maximum is definitely a relative term
> here, but on
> > the system with "only" 64 or 128 processors the cpu*224
> would be much
> > higher
> > :) However, maybe CONFIG_TINY that Andi suggested would
> leverage this
> > number also. What do you think, Eric?
>
> Best would be something dynamic - kernels should be self
> tuning, not require that much CONFIG magic.
>
> Just PCI hotplug gives me headaches with this.
>
> Maybe we just need growable per CPU data.
>
Yes, evaluating dynamically would be best... Should be ACPI job I
suppose, including accounting of all possible hot plug controllers.
Unisys boxes have plenty of them, I can look into possible scenarios.
--Natalie
"Randy.Dunlap" <[email protected]> writes:
>> diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
>> index 7598d99..d744e5b 100644
>> --- a/arch/x86_64/Kconfig
>> +++ b/arch/x86_64/Kconfig
>> @@ -384,6 +384,19 @@ config NR_CPUS
>> This is purely to save memory - each supported CPU requires
>> memory in the static kernel configuration.
>>
>> +config NR_IRQS
>> + int "Maximum number of IRQs (224-4096)"
>> + range 256 4096
>> + depends on SMP
>> + default "4096"
>> + help
>> + This allows you to specify the maximum number of IRQs which this
>> + kernel will support. Current maximum is 4096 IRQs as that
>> + is slightly larger than has observed in the field.
>> +
>> + This is purely to save memory - each supported IRQ requires
>> + memory in the static kernel configuration.
>
> If (a) "nor does hardware typically provide that many irq sources"
> and (b) "This is purely to save memory", why is the default
> 4096 instead of something smaller?
a) Because I would like to flush out bugs.
b) Because I want a default that works for everyone.
c) Because with MSI we have a potential for large irq counts on most systems.
d) Because anyone who disagrees with me can send a patch and fix
the default.
e) Because with the default number of cpus we can very close to needing
this many irqs in the worst case.
f) This is much better than previous to my patch and setting NR_CPUS=255
and getting 8K IRQS.
g) Because I probably should have been more inventive than copying the
NR_IRQS text, but when I did the wording sounded ok to me.
Eric
Andi Kleen <[email protected]> writes:
>> 4k being a humble maximum is definitely a relative term here, but on the
>> system with "only" 64 or 128 processors the cpu*224 would be much higher
>> :) However, maybe CONFIG_TINY that Andi suggested would leverage this
>> number also. What do you think, Eric?
>
> Best would be something dynamic - kernels should be self tuning, not
> require that much CONFIG magic.
I agree. That is the way thing should be :
> Just PCI hotplug gives me headaches with this.
>
> Maybe we just need growable per CPU data.
This would require a growable NR_IRQS to fix. Something
that we don't have a good handle on at all. But at least
much less code cares.
If we killed the counters for each pair of cpu and irq this would
not involve the per cpu area at all. But we still have the one
static array of irqs. That will be more fun to get rid of.
Eric
Eric W. Biederman wrote:
>
> a) Because I would like to flush out bugs.
> b) Because I want a default that works for everyone.
> c) Because with MSI we have a potential for large irq counts on most systems.
> d) Because anyone who disagrees with me can send a patch and fix
> the default.
> e) Because with the default number of cpus we can very close to needing
> this many irqs in the worst case.
> f) This is much better than previous to my patch and setting NR_CPUS=255
> and getting 8K IRQS.
> g) Because I probably should have been more inventive than copying the
> NR_IRQS text, but when I did the wording sounded ok to me.
>
Why not simply reserve 224*NR_CPUS IRQs? If you have 256 CPUs allocating
64K IRQs should hardly matter :)
-hpa
Currently on a SMP system we can theoretically support
NR_CPUS*224 irqs. Unfortunately our data structures
don't cope will with that many irqs, nor does hardware
typically provide that many irq sources.
With the number of cores starting to follow Moore's Law,
and the apicid limits being raised beyond an 8bit
number trying to track our current maximum with our
current data structures would be fatal and wasteful.
So this patch decouples the number of irqs we support
from the number of cpus. We can revisit this decision
once someone reworks the current data structures.
This version has my stupid typos fix and the true maximum
exposed to make it clear that I have a low default. The
worst that I can see happening is there won't be any
per_cpu space left for modules if someone sets this
too high, but the system should still boot.
Signed-off-by: Eric W. Biederman <[email protected]>
---
This of course applies to the -mm tree because the rest
of the irq work is not yet in the mainline kernel.
arch/x86_64/Kconfig | 14 ++++++++++++++
include/asm-x86_64/irq.h | 3 ++-
2 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 7598d99..cea78d7 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -384,6 +384,20 @@ config NR_CPUS
This is purely to save memory - each supported CPU requires
memory in the static kernel configuration.
+config NR_IRQS
+ int "Maximum number of IRQs (224-57344)"
+ range 224 57344
+ depends on SMP
+ default "4096"
+ help
+ This allows you to specify the maximum number of IRQs which this
+ kernel will support. Current default is 4096 IRQs as that
+ is slightly larger than has observed in the field. Setting
+ a noticeably larger value will exhaust your per cpu memory,
+ and waste memory in the per irq arrays.
+
+ If unsure leave this at 4096.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
depends on SMP && HOTPLUG && EXPERIMENTAL
diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
index 5006c6e..34b264a 100644
--- a/include/asm-x86_64/irq.h
+++ b/include/asm-x86_64/irq.h
@@ -31,7 +31,8 @@ #define NR_VECTORS 256
#define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */
-#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
+/* We can use at most NR_CPUS*224 irqs at one time */
+#define NR_IRQS (CONFIG_NR_IRQS)
#define NR_IRQ_VECTORS NR_IRQS
static __inline__ int irq_canonicalize(int irq)
--
1.4.2.rc3.g7e18e
"H. Peter Anvin" <[email protected]> writes:
> Eric W. Biederman wrote:
>> a) Because I would like to flush out bugs.
>> b) Because I want a default that works for everyone.
>> c) Because with MSI we have a potential for large irq counts on most systems.
>> d) Because anyone who disagrees with me can send a patch and fix
>> the default.
>> e) Because with the default number of cpus we can very close to needing
>> this many irqs in the worst case.
>> f) This is much better than previous to my patch and setting NR_CPUS=255
>> and getting 8K IRQS.
>> g) Because I probably should have been more inventive than copying the
>> NR_IRQS text, but when I did the wording sounded ok to me.
>>
>
> Why not simply reserve 224*NR_CPUS IRQs? If you have 256 CPUs allocating 64K
> IRQs should hardly matter :)
Well there is this little matter of 224*NR_CPUS*NR_CPUS counters at that point
that I think would be prohibitive for most sane people. Taking 224K of per cpu
memory in 256 different per cpu areas.
Still what is 56MB when you have a terrabyte of RAM. :)
Eric
Eric W. Biederman wrote:
> "H. Peter Anvin" <[email protected]> writes:
>
>> Eric W. Biederman wrote:
>>> a) Because I would like to flush out bugs.
>>> b) Because I want a default that works for everyone.
>>> c) Because with MSI we have a potential for large irq counts on most systems.
>>> d) Because anyone who disagrees with me can send a patch and fix
>>> the default.
>>> e) Because with the default number of cpus we can very close to needing
>>> this many irqs in the worst case.
>>> f) This is much better than previous to my patch and setting NR_CPUS=255
>>> and getting 8K IRQS.
>>> g) Because I probably should have been more inventive than copying the
>>> NR_IRQS text, but when I did the wording sounded ok to me.
>>>
>> Why not simply reserve 224*NR_CPUS IRQs? If you have 256 CPUs allocating 64K
>> IRQs should hardly matter :)
>
> Well there is this little matter of 224*NR_CPUS*NR_CPUS counters at that point
> that I think would be prohibitive for most sane people. Taking 224K of per cpu
> memory in 256 different per cpu areas.
>
> Still what is 56MB when you have a terrabyte of RAM. :)
>
However, 99.99% of all systems have 16 or fewer CPU cores. Your solution
with its proposed default eats more memory for any system with fewer
than 19 CPUs.
Furthermore, you don't need 224*NR_CPUS*NR_CPUS counters. If an IRQ is
only mapped into one CPU's space it can only be taken on that CPU, thus
you only need 224*NR_CPUS counters.
-hpa
On Mon, 07 Aug 2006 11:30:24 -0600 Eric W. Biederman wrote:
> Currently on a SMP system we can theoretically support
> NR_CPUS*224 irqs. Unfortunately our data structures
> don't cope will with that many irqs, nor does hardware
> typically provide that many irq sources.
>
> With the number of cores starting to follow Moore's Law,
> and the apicid limits being raised beyond an 8bit
> number trying to track our current maximum with our
> current data structures would be fatal and wasteful.
>
> So this patch decouples the number of irqs we support
> from the number of cpus. We can revisit this decision
> once someone reworks the current data structures.
>
> This version has my stupid typos fix and the true maximum
> exposed to make it clear that I have a low default. The
> worst that I can see happening is there won't be any
> per_cpu space left for modules if someone sets this
> too high, but the system should still boot.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
>
> This of course applies to the -mm tree because the rest
> of the irq work is not yet in the mainline kernel.
>
> arch/x86_64/Kconfig | 14 ++++++++++++++
> include/asm-x86_64/irq.h | 3 ++-
> 2 files changed, 16 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
> index 7598d99..cea78d7 100644
> --- a/arch/x86_64/Kconfig
> +++ b/arch/x86_64/Kconfig
> @@ -384,6 +384,20 @@ config NR_CPUS
> This is purely to save memory - each supported CPU requires
> memory in the static kernel configuration.
Thanks for the language fixes.
I'm confused about one thing. What is NR_IRQS for non-SMP?
Does it default to 4096 or something else?
Does this build on non-SMP? Is CONFIG_NR_IRQS defined for non-SMP?
> +config NR_IRQS
> + int "Maximum number of IRQs (224-57344)"
> + range 224 57344
> + depends on SMP
> + default "4096"
> + help
> + This allows you to specify the maximum number of IRQs which this
> + kernel will support. Current default is 4096 IRQs as that
> + is slightly larger than has observed in the field. Setting
> + a noticeably larger value will exhaust your per cpu memory,
> + and waste memory in the per irq arrays.
> +
> + If unsure leave this at 4096.
> +
> config HOTPLUG_CPU
> bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
> depends on SMP && HOTPLUG && EXPERIMENTAL
> diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
> index 5006c6e..34b264a 100644
> --- a/include/asm-x86_64/irq.h
> +++ b/include/asm-x86_64/irq.h
> @@ -31,7 +31,8 @@ #define NR_VECTORS 256
>
> #define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */
>
> -#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
> +/* We can use at most NR_CPUS*224 irqs at one time */
> +#define NR_IRQS (CONFIG_NR_IRQS)
> #define NR_IRQ_VECTORS NR_IRQS
>
> static __inline__ int irq_canonicalize(int irq)
> --
---
~Randy
"Randy.Dunlap" <[email protected]> writes:
>>
>> This of course applies to the -mm tree because the rest
>> of the irq work is not yet in the mainline kernel.
>>
>> arch/x86_64/Kconfig | 14 ++++++++++++++
>> include/asm-x86_64/irq.h | 3 ++-
>> 2 files changed, 16 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
>> index 7598d99..cea78d7 100644
>> --- a/arch/x86_64/Kconfig
>> +++ b/arch/x86_64/Kconfig
>> @@ -384,6 +384,20 @@ config NR_CPUS
>> This is purely to save memory - each supported CPU requires
>> memory in the static kernel configuration.
>
> Thanks for the language fixes.
> I'm confused about one thing. What is NR_IRQS for non-SMP?
> Does it default to 4096 or something else?
Right the default is still 4096 which is fairly silly.
> Does this build on non-SMP? Is CONFIG_NR_IRQS defined for non-SMP?
Ugh. I have a "depends on SMP" line in there. That shouldn't be.
Ok. I need to fix the non SMP case, at least take out the dependency
and if I'm clever move the default down to 224.
Eric
Currently on a SMP system we can theoretically support
NR_CPUS*224 irqs. Unfortunately our data structures
don't cope will with that many irqs, nor does hardware
typically provide that many irq sources.
With the number of cores starting to follow Moore's Law,
and the apicid limits being raised beyond an 8bit
number trying to track our current maximum with our
current data structures would be fatal and wasteful.
So this patch decouples the number of irqs we support
from the number of cpus. We can revisit this decision
once someone reworks the current data structures.
This version has my stupid typos fix and the true maximum
exposed to make it clear that I have a low default. The
worst that I can see happening is there won't be any
per_cpu space left for modules if someone sets this
too high, but the system should still boot.
For non-SMP systems the default is set to 224 IRQs.
Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/Kconfig | 14 ++++++++++++++
include/asm-x86_64/irq.h | 3 ++-
2 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 7598d99..c87b0bc 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -384,6 +384,20 @@ config NR_CPUS
This is purely to save memory - each supported CPU requires
memory in the static kernel configuration.
+config NR_IRQS
+ int "Maximum number of IRQs (224-57344)"
+ range 224 57344
+ default "4096" if SMP
+ default "224" if !SMP
+ help
+ This allows you to specify the maximum number of IRQs which this
+ kernel will support. Current default is 4096 IRQs as that
+ is slightly larger than has observed in the field. Setting
+ a noticeably larger value will exhaust your per cpu memory,
+ and waste memory in the per irq arrays.
+
+ If unsure leave this at the default.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
depends on SMP && HOTPLUG && EXPERIMENTAL
diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
index 5006c6e..34b264a 100644
--- a/include/asm-x86_64/irq.h
+++ b/include/asm-x86_64/irq.h
@@ -31,7 +31,8 @@ #define NR_VECTORS 256
#define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */
-#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
+/* We can use at most NR_CPUS*224 irqs at one time */
+#define NR_IRQS (CONFIG_NR_IRQS)
#define NR_IRQ_VECTORS NR_IRQS
static __inline__ int irq_canonicalize(int irq)
--
1.4.2.rc3.g7e18e
On Mon, 07 Aug 2006 12:53:35 -0600 Eric W. Biederman wrote:
>
> Currently on a SMP system we can theoretically support
> NR_CPUS*224 irqs. Unfortunately our data structures
> don't cope will with that many irqs, nor does hardware
> typically provide that many irq sources.
>
> With the number of cores starting to follow Moore's Law,
> and the apicid limits being raised beyond an 8bit
> number trying to track our current maximum with our
> current data structures would be fatal and wasteful.
>
> So this patch decouples the number of irqs we support
> from the number of cpus. We can revisit this decision
> once someone reworks the current data structures.
>
> This version has my stupid typos fix and the true maximum
> exposed to make it clear that I have a low default. The
> worst that I can see happening is there won't be any
> per_cpu space left for modules if someone sets this
> too high, but the system should still boot.
>
> For non-SMP systems the default is set to 224 IRQs.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> arch/x86_64/Kconfig | 14 ++++++++++++++
> include/asm-x86_64/irq.h | 3 ++-
> 2 files changed, 16 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
> index 7598d99..c87b0bc 100644
> --- a/arch/x86_64/Kconfig
> +++ b/arch/x86_64/Kconfig
> @@ -384,6 +384,20 @@ config NR_CPUS
> This is purely to save memory - each supported CPU requires
> memory in the static kernel configuration.
>
> +config NR_IRQS
> + int "Maximum number of IRQs (224-57344)"
> + range 224 57344
> + default "4096" if SMP
> + default "224" if !SMP
> + help
> + This allows you to specify the maximum number of IRQs which this
> + kernel will support. Current default is 4096 IRQs as that
> + is slightly larger than has observed in the field. Setting
> + a noticeably larger value will exhaust your per cpu memory,
> + and waste memory in the per irq arrays.
If you'll fix this text for the non-SMP case too, I think
you'll be done. :)
> + If unsure leave this at the default.
> +
> config HOTPLUG_CPU
> bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
> depends on SMP && HOTPLUG && EXPERIMENTAL
> diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
> index 5006c6e..34b264a 100644
> --- a/include/asm-x86_64/irq.h
> +++ b/include/asm-x86_64/irq.h
> @@ -31,7 +31,8 @@ #define NR_VECTORS 256
>
> #define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */
>
> -#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
> +/* We can use at most NR_CPUS*224 irqs at one time */
> +#define NR_IRQS (CONFIG_NR_IRQS)
> #define NR_IRQ_VECTORS NR_IRQS
>
> static __inline__ int irq_canonicalize(int irq)
> --
---
~Randy
On Mon, Aug 07, 2006 at 12:53:35PM -0600, Eric W. Biederman wrote:
>...
> --- a/arch/x86_64/Kconfig
> +++ b/arch/x86_64/Kconfig
> @@ -384,6 +384,20 @@ config NR_CPUS
> This is purely to save memory - each supported CPU requires
> memory in the static kernel configuration.
>
> +config NR_IRQS
> + int "Maximum number of IRQs (224-57344)"
int "Maximum number of IRQs (224-57344)" depends on SMP
This way, people with SMP=n will not see this question.
> + range 224 57344
> + default "4096" if SMP
> + default "224" if !SMP
Why not always
default "224"
?
> + help
>...
cu
Adrian
--
Gentoo kernels are 42 times more popular than SUSE kernels among
KLive users (a service by SUSE contractor Andrea Arcangeli that
gathers data about kernels from many users worldwide).
There are three kinds of lies: Lies, Damn Lies, and Statistics.
Benjamin Disraeli
Currently on a SMP system we can theoretically support
NR_CPUS*224 irqs. Unfortunately our data structures
don't cope will with that many irqs, nor does hardware
typically provide that many irq sources.
With the number of cores starting to follow Moore's Law,
and the apicid limits being raised beyond an 8bit
number trying to track our current maximum with our
current data structures would be fatal and wasteful.
So this patch decouples the number of irqs we support
from the number of cpus. We can revisit this decision
once someone reworks the current data structures.
This version has my stupid typos fixed and the true maximum
exposed to make it clear that I have a low default. The
worst that I can see happening is there won't be any
per_cpu space left for modules if someone sets this
too high, but the system should still boot.
For non-SMP systems the default is no set to 224 IRQs.
The description has been reworded in an attempt
to make it clear what this option controls.
Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/Kconfig | 25 +++++++++++++++++++++++++
include/asm-x86_64/irq.h | 3 ++-
2 files changed, 27 insertions(+), 1 deletions(-)
diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 7598d99..adcbb21 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -384,6 +384,31 @@ config NR_CPUS
This is purely to save memory - each supported CPU requires
memory in the static kernel configuration.
+config NR_IRQS
+ int "Maximum number of IRQs (224-57344)"
+ range 224 57344
+ default "4096" if SMP
+ default "224" if !SMP
+ help
+ This option allows you to specify the maximum number of interrupt
+ sources your kernel will support. Architecturally there are
+ 224 interrupt destinations per cpu, so setting to a higher value
+ can be wasteful.
+
+ Many machines have irq controllers with unconnected interrupt
+ pins, leading to unused irq numbers in the kernel. Since a
+ destination is not assigned to an unused interrupt source
+ it can be reasonable to support more interrupt sources then
+ there are destinations to receive them.
+
+ The current recommended value is 4096 as it is slightly more irqs
+ than any known machine and still small enough to have a
+ reasonable memory consumption. Setting a noticeably larger value
+ will exhaust your per cpu memory, and waste memory in the per irq
+ arrays.
+
+ If unsure leave this at the default.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs (EXPERIMENTAL)"
depends on SMP && HOTPLUG && EXPERIMENTAL
diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
index 5006c6e..b0f6460 100644
--- a/include/asm-x86_64/irq.h
+++ b/include/asm-x86_64/irq.h
@@ -31,7 +31,8 @@ #define NR_VECTORS 256
#define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */
-#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
+/* We can be setup to receive at most NR_CPUS*224 irqs simultaneously */
+#define NR_IRQS (CONFIG_NR_IRQS)
#define NR_IRQ_VECTORS NR_IRQS
static __inline__ int irq_canonicalize(int irq)
--
1.4.2.rc3.g7e18e
Adrian Bunk <[email protected]> writes:
> On Mon, Aug 07, 2006 at 12:53:35PM -0600, Eric W. Biederman wrote:
>>...
>> --- a/arch/x86_64/Kconfig
>> +++ b/arch/x86_64/Kconfig
>> @@ -384,6 +384,20 @@ config NR_CPUS
>> This is purely to save memory - each supported CPU requires
>> memory in the static kernel configuration.
>>
>> +config NR_IRQS
>> + int "Maximum number of IRQs (224-57344)"
>
> int "Maximum number of IRQs (224-57344)" depends on SMP
>
> This way, people with SMP=n will not see this question.
I doubt it will be interesting but it might be, it is certainly
well defined what happens when you have more irqs that a cpu
has irq destinations.
>> + range 224 57344
>> + default "4096" if SMP
>> + default "224" if !SMP
>
> Why not always
> default "224"
> ?
A couple of reasons.
- Things still need shaking out at the > 256 irq level and since
this is going into -mm it is reasonable to have a large default.
- It is silly to have a default that won't work on some hardware,
that we can support without unreasonable overhead.
- There are major simplicity gains to be had from a slight sparse
irq space.
- I haven't a clue what the irq numbers look like in the real world
that we should be supporting since there was code in x86_64 and
i386 to hack them up terribly. All I have a clue about are
the really big machines. So I wouldn't be surprised if there
were some small but I/O heavy machines that found 224 too limiting.
I know of at least one uniprocessor machine that would have used
almost all 224 irqs.
- I want people to realize that we can easily have more than 256 irqs.
With pure software interrupt sources and networking drivers allocating
one irq per cpu the chances of us using our maximum allotment of irqs
is much more likely in the next couple of years.
- 4096 is the number I expect distribution vendors will ship. Why set
a different default than what you expect most people will use?
Eric
On Mon, Aug 07, 2006 at 04:26:07PM -0600, Eric W. Biederman wrote:
> Adrian Bunk <[email protected]> writes:
>
> > On Mon, Aug 07, 2006 at 12:53:35PM -0600, Eric W. Biederman wrote:
> >>...
> >> --- a/arch/x86_64/Kconfig
> >> +++ b/arch/x86_64/Kconfig
> >> @@ -384,6 +384,20 @@ config NR_CPUS
> >> This is purely to save memory - each supported CPU requires
> >> memory in the static kernel configuration.
> >>
> >> +config NR_IRQS
> >> + int "Maximum number of IRQs (224-57344)"
> >
> > int "Maximum number of IRQs (224-57344)" depends on SMP
> >
> > This way, people with SMP=n will not see this question.
>
> I doubt it will be interesting but it might be, it is certainly
> well defined what happens when you have more irqs that a cpu
> has irq destinations.
The only effect of the user visibility of this option with SMP=n is that
the user might choose a higher value resulting in wasted space.
We do already have a big amount of options, there's no reason for
showing more than required.
> >> + range 224 57344
> >> + default "4096" if SMP
> >> + default "224" if !SMP
> >
> > Why not always
> > default "224"
> > ?
>
> A couple of reasons.
> - Things still need shaking out at the > 256 irq level and since
> this is going into -mm it is reasonable to have a large default.
For -mm, it might make even more sense defaulting to 57344 for getting a
better testing coverage.
This would give a good feedback of both space usage problems and the
timing behavior of all the
for (i = 0; i < NR_IRQS; i++)
loops in the kernel.
> - It is silly to have a default that won't work on some hardware,
> that we can support without unreasonable overhead.
So let's default NR_CPUS to 255 and NR_IRQS to 57344?
It's also silly if the defaults waste space (and time) for the majority
of users.
> - There are major simplicity gains to be had from a slight sparse
> irq space.
>
> - I haven't a clue what the irq numbers look like in the real world
> that we should be supporting since there was code in x86_64 and
> i386 to hack them up terribly. All I have a clue about are
> the really big machines. So I wouldn't be surprised if there
> were some small but I/O heavy machines that found 224 too limiting.
> I know of at least one uniprocessor machine that would have used
> almost all 224 irqs.
>
> - I want people to realize that we can easily have more than 256 irqs.
> With pure software interrupt sources and networking drivers allocating
> one irq per cpu the chances of us using our maximum allotment of irqs
> is much more likely in the next couple of years.
The common x86_64 SMP machine for sale today is a dual core desktop.
More than 224 IRQs are an exceptional case, and we can expect people
building such systems to know what they are doing.
> - 4096 is the number I expect distribution vendors will ship. Why set
> a different default than what you expect most people will use?
Defaults are not for distributions.
Distribution maintainers are expected to set such options not depending
on the defaults but depending on the needs of their users.
Depending on whether the distribution targets only desktop users, or
whether the primary target are big servers, 4096 might be wrong in any
direction.
> Eric
cu
Adrian
--
Gentoo kernels are 42 times more popular than SUSE kernels among
KLive users (a service by SUSE contractor Andrea Arcangeli that
gathers data about kernels from many users worldwide).
There are three kinds of lies: Lies, Damn Lies, and Statistics.
Benjamin Disraeli
On Mon, 07 Aug 2006 16:10:14 -0600
[email protected] (Eric W. Biederman) wrote:
> +/* We can be setup to receive at most NR_CPUS*224 irqs simultaneously */
> +#define NR_IRQS (CONFIG_NR_IRQS)
We know that setting this high can cause machines to run out of per-cpu
memory, so we're handing people a foot blowing-off tool here.
And it's a pretty nasty one because it can get people into the situation
where the kernel worked fine for those who released it, but users who
happen to load more modules (or the right combination of them) will
experience per-cpu memory exhaustion.
So shouldn't we being scaling the per-cpu memory as well?
If so, I'd suggest that we special-case that huge kstat structure. We can
calculate its size exactly, so how about we do:
#define SIZE_OF_KSTAT_THING <complicated expression>
#define PERCPU_ENOUGH_ROOM 32768
#define PERCPU_ENOUGH_ROOM_WHICH_WE_REALLY_USE \
PERCPU_ENOUGH_ROOM + SIZE_OF_KSTAT_THING
?
(And as it's a critical managed resource, I'm thinking that we should be
adding some /proc reporting of the per-cpu memory consumption..)
>
> And it's a pretty nasty one because it can get people into the situation
> where the kernel worked fine for those who released it, but users who
> happen to load more modules (or the right combination of them) will
> experience per-cpu memory exhaustion.
Yes, and a high value will waste a lot of memory for normal users.
> So shouldn't we being scaling the per-cpu memory as well?
If we move it into vmalloc space it would be easy to extend at runtime - just the
virtual address space would need to be prereserved, but then more pages
could be mapped. Maybe we should just do that instead of continuing to kludge around?
Drawback would be some more TLB misses.
-Andi
On Tue, 8 Aug 2006 04:17:59 +0200
Andi Kleen <[email protected]> wrote:
>
> >
> > And it's a pretty nasty one because it can get people into the situation
> > where the kernel worked fine for those who released it, but users who
> > happen to load more modules (or the right combination of them) will
> > experience per-cpu memory exhaustion.
>
> Yes, and a high value will waste a lot of memory for normal users.
>
> > So shouldn't we being scaling the per-cpu memory as well?
>
> If we move it into vmalloc space it would be easy to extend at runtime - just the
> virtual address space would need to be prereserved, but then more pages
> could be mapped. Maybe we should just do that instead of continuing to kludge around?
Sounds sane.
otoh, we need something for 2.6.19.
> Drawback would be some more TLB misses.
yup. On some (important) architectures - I'm not sure which architectures
do the bigpage-for-kernel trick.
On Mon, 2006-08-07 at 19:41 -0700, Andrew Morton wrote:
> On Tue, 8 Aug 2006 04:17:59 +0200
> Andi Kleen <[email protected]> wrote:
>
> >
> > >
> > > And it's a pretty nasty one because it can get people into the situation
> > > where the kernel worked fine for those who released it, but users who
> > > happen to load more modules (or the right combination of them) will
> > > experience per-cpu memory exhaustion.
> >
> > Yes, and a high value will waste a lot of memory for normal users.
> >
> > > So shouldn't we being scaling the per-cpu memory as well?
> >
> > If we move it into vmalloc space it would be easy to extend at runtime - just the
> > virtual address space would need to be prereserved, but then more pages
> > could be mapped. Maybe we should just do that instead of continuing to kludge around?
>
> Sounds sane.
>
> otoh, we need something for 2.6.19.
>
> > Drawback would be some more TLB misses.
>
> yup. On some (important) architectures - I'm not sure which architectures
> do the bigpage-for-kernel trick.
also most of the architectures that do bigpage-for-kernel stuff only
have a very limited number of tlb entries for bigpages (usually 2 to 4)
while they have many more entries for normal pages.. so it's not
automatically worse (now if these data structures are close to the
kernel text or so.. then yes there's sharing. but with the sorting of
kernel text that's a lot less true already)
Andrew Morton writes:
> > Drawback would be some more TLB misses.
>
> yup. On some (important) architectures - I'm not sure which architectures
> do the bigpage-for-kernel trick.
I looked at optimizing the per-cpu data accessors on PowerPC and only
ever saw fractions of a percent change in overall performance, which
says to me that we don't actually use per-cpu data all that much. So
unless you make per-cpu data really really slow, I doubt that we'll
see any significant performance difference.
Paul.
On Tuesday 08 August 2006 07:09, Paul Mackerras wrote:
> Andrew Morton writes:
[adding linux-arch; talking about doing extensible per cpu areas
by prereserving virtual space and then later fill it up as needed]
> > > Drawback would be some more TLB misses.
> >
> > yup. On some (important) architectures - I'm not sure which architectures
> > do the bigpage-for-kernel trick.
>
> I looked at optimizing the per-cpu data accessors on PowerPC and only
> ever saw fractions of a percent change in overall performance, which
> says to me that we don't actually use per-cpu data all that much. So
> unless you make per-cpu data really really slow, I doubt that we'll
> see any significant performance difference.
The main problem is that we would need a "vmalloc reserve first; allocate pages
later" interface. On x86 it would be easy by just splitting up vmalloc/vmap a bit
again. Does anybody else see problems with implementing that on any
other architecture?
This wouldn't be truly demand paged, just pages initialized on allocation.
-Andi
Now for a completely different but trivial approach.
I just boot tested it with 255 CPUS and everything worked.
Currently everything (except module data) we place in
the per cpu area we know about at compile time. So
instead of allocating a fixed size for the per_cpu area
allocate the number of bytes we need plus a fixed constant
for to be used for modules.
It isn't perfect but it is much less of a pain to
work with than what we are doing now.
Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/setup64.c | 7 ++-----
include/asm-x86_64/percpu.h | 10 ++++++++++
2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c
index 0cd3694..94336cf 100644
--- a/arch/x86_64/kernel/setup64.c
+++ b/arch/x86_64/kernel/setup64.c
@@ -95,12 +95,9 @@ #ifdef CONFIG_HOTPLUG_CPU
#endif
/* Copy section for each CPU (we discard the original) */
- size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
-#ifdef CONFIG_MODULES
- if (size < PERCPU_ENOUGH_ROOM)
- size = PERCPU_ENOUGH_ROOM;
-#endif
+ size = PERCPU_ENOUGH_ROOM;
+ printk(KERN_INFO "PERCPU: Allocating %d bytes of per cpu data\n", size);
for_each_cpu_mask (i, cpu_possible_map) {
char *ptr;
diff --git a/include/asm-x86_64/percpu.h b/include/asm-x86_64/percpu.h
index 08dd9f9..39d2bab 100644
--- a/include/asm-x86_64/percpu.h
+++ b/include/asm-x86_64/percpu.h
@@ -11,6 +11,16 @@ #ifdef CONFIG_SMP
#include <asm/pda.h>
+#ifdef CONFIG_MODULES
+# define PERCPU_MODULE_RESERVE 8192
+#else
+# define PERCPU_MODULE_RESERVE 0
+#endif
+
+#define PERCPU_ENOUGH_ROOM \
+ (ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES) + \
+ PERCPU_MODULE_RESERVE)
+
#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
#define __my_cpu_offset() read_pda(data_offset)
--
1.4.2.rc3.g7e18e
On Mon, 07 Aug 2006 23:47:23 -0600
[email protected] (Eric W. Biederman) wrote:
> +#ifdef CONFIG_MODULES
> +# define PERCPU_MODULE_RESERVE 8192
> +#else
> +# define PERCPU_MODULE_RESERVE 0
> +#endif
> +
> +#define PERCPU_ENOUGH_ROOM \
> + (ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES) + \
> + PERCPU_MODULE_RESERVE)
> +
Seems sane, but isn't 8192 a bit small?
On Tuesday 08 August 2006 07:47, Eric W. Biederman wrote:
>
> Now for a completely different but trivial approach.
> I just boot tested it with 255 CPUS and everything worked.
>
> Currently everything (except module data) we place in
> the per cpu area we know about at compile time. So
> instead of allocating a fixed size for the per_cpu area
> allocate the number of bytes we need plus a fixed constant
> for to be used for modules.
>
> It isn't perfect but it is much less of a pain to
> work with than what we are doing now.
Yes makes sense.
However not that particular patch - i already changed that
code in my tree because I needed really early per cpu for something and
i had switched to using a static array for cpu0's cpudata.
I will modify it to work like your proposal.
-Andi
Andrew Morton <[email protected]> writes:
> On Mon, 07 Aug 2006 23:47:23 -0600
> [email protected] (Eric W. Biederman) wrote:
>
>> +#ifdef CONFIG_MODULES
>> +# define PERCPU_MODULE_RESERVE 8192
>> +#else
>> +# define PERCPU_MODULE_RESERVE 0
>> +#endif
>> +
>> +#define PERCPU_ENOUGH_ROOM \
>> + (ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES) + \
>> + PERCPU_MODULE_RESERVE)
>> +
>
> Seems sane, but isn't 8192 a bit small?
By my measure 8K is about 1/2KB less than what we have free in
2.6.18-rc3. So it looks like a good initial guess to me.
Eric
Andi Kleen <[email protected]> writes:
> On Tuesday 08 August 2006 07:47, Eric W. Biederman wrote:
>>
>> Now for a completely different but trivial approach.
>> I just boot tested it with 255 CPUS and everything worked.
>>
>> Currently everything (except module data) we place in
>> the per cpu area we know about at compile time. So
>> instead of allocating a fixed size for the per_cpu area
>> allocate the number of bytes we need plus a fixed constant
>> for to be used for modules.
>>
>> It isn't perfect but it is much less of a pain to
>> work with than what we are doing now.
>
> Yes makes sense.
>
> However not that particular patch - i already changed that
> code in my tree because I needed really early per cpu for something and
> i had switched to using a static array for cpu0's cpudata.
>
> I will modify it to work like your proposal.
Sounds good to me.
Eric
> >
> > However not that particular patch - i already changed that
> > code in my tree because I needed really early per cpu for something and
> > i had switched to using a static array for cpu0's cpudata.
> >
> > I will modify it to work like your proposal.
>
> Sounds good to me.
Actually i ended up going with your patch and dropping mine
because of some other issues and I solved the problem
that caused me to do the other in a different way.
-Andi
Andi Kleen <[email protected]> writes:
>> >
>> > However not that particular patch - i already changed that
>> > code in my tree because I needed really early per cpu for something and
>> > i had switched to using a static array for cpu0's cpudata.
>> >
>> > I will modify it to work like your proposal.
>>
>> Sounds good to me.
>
> Actually i ended up going with your patch and dropping mine
> because of some other issues and I solved the problem
> that caused me to do the other in a different way.
Ok.
Since this is the agreed upon path, Andrew can you please pick
this patch up for the next -mm release?
Then the final practical question does it still make sense to decouple
the NR_IRQS from NR_CPUS? As my other patch was doing?
Eric
On Tue, 2006-08-08 at 07:14 +0200, Andi Kleen wrote:
> > > > Drawback would be some more TLB misses.
> > >
> > > yup. On some (important) architectures - I'm not sure which architectures
> > > do the bigpage-for-kernel trick.
> >
> > I looked at optimizing the per-cpu data accessors on PowerPC and only
> > ever saw fractions of a percent change in overall performance, which
> > says to me that we don't actually use per-cpu data all that much. So
> > unless you make per-cpu data really really slow, I doubt that we'll
> > see any significant performance difference.
>
> The main problem is that we would need a "vmalloc reserve first; allocate pages
> later" interface. On x86 it would be easy by just splitting up vmalloc/vmap a bit
> again. Does anybody else see problems with implementing that on any
> other architecture?
"vmalloc reserve first; allocate pages later" would be a really nice
feature. We could use this on s390 to implement the virtual mem_map
array spanning the whole 64 bit address range (with holes in it). To
make it perfect a "deallocate pages; keep vmalloc reserve" should be
added, then we could free parts of the mem_map array again on hot memory
remove.
I don't see a problem for s390.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Tue, Aug 08, 2006 at 10:17:53AM +0200, Martin Schwidefsky wrote:
> "vmalloc reserve first; allocate pages later" would be a really nice
> feature. We could use this on s390 to implement the virtual mem_map
> array spanning the whole 64 bit address range (with holes in it). To
> make it perfect a "deallocate pages; keep vmalloc reserve" should be
> added, then we could free parts of the mem_map array again on hot memory
> remove.
IA-64 already has some arch. specific code to allocate a sparse
virtual memory map ... having generic code to do so would be
nice, but I foresee some chicken&egg problems in getting enough
of the vmalloc/vmap framework up & running before mem_map[] has
been allocated.
That and the hotplug memory folks don't like the virtual mem_map
code and have spurned it in favour of SPARSE.
-Tony
On Wed, 2006-08-09 at 10:58 -0700, Luck, Tony wrote:
> On Tue, Aug 08, 2006 at 10:17:53AM +0200, Martin Schwidefsky wrote:
> > "vmalloc reserve first; allocate pages later" would be a really nice
> > feature. We could use this on s390 to implement the virtual mem_map
> > array spanning the whole 64 bit address range (with holes in it). To
> > make it perfect a "deallocate pages; keep vmalloc reserve" should be
> > added, then we could free parts of the mem_map array again on hot memory
> > remove.
Martin,
We can already do this partial freeing today with sparsemem and memory
hot-remove. It would be a shame to go have to do another implementation
for each an every architecture that wants to do it.
For the very sparse 64-bit address spaces, I would be really interested
to see an alternate pfn_to_section_nr() that relies on something other
than a direct correlation between physical address and section number.
Instead of:
#define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT)
We could do:
static inline unsigned long pfn_to_section_nr(unsigned long pfn)
{
return some_hash(pfn) % NR_OF_SECTION_SLOTS;
}
This would, of course, still have limits on how _many_ sections can be
populated. But, it would remove the relationship on what the actual
physical address ranges can be from the number of populated sections.
Of course, it isn't quite that simple. You need to make sure that the
sparse code is clean from all connections between section number and
physical address, as well as handling things like hash collisions. We'd
probably also need to store the _actual_ physical address somewhere
because we can't get it from the section number any more.
But, Andy and I have talked about this kind of thing from the beginning
of sparsemem, so I hope the code is amenable to change like this.
-- Dave
P.S. With sparsemem extreme, I think you can cover an entire 64-bits of
address space with a 4GB top-level table. If one more level of tables
was added, we'd be down to (I think) an 8MB table. So, that might be an
option, too.
On Wed, 2006-08-09 at 11:25 -0700, Dave Hansen wrote:
> Instead of:
>
> #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT)
>
> We could do:
>
> static inline unsigned long pfn_to_section_nr(unsigned long pfn)
> {
> return some_hash(pfn) % NR_OF_SECTION_SLOTS;
> }
>
> This would, of course, still have limits on how _many_ sections can be
> populated. But, it would remove the relationship on what the actual
> physical address ranges can be from the number of populated sections.
>
> Of course, it isn't quite that simple. You need to make sure that the
> sparse code is clean from all connections between section number and
> physical address, as well as handling things like hash collisions. We'd
> probably also need to store the _actual_ physical address somewhere
> because we can't get it from the section number any more.
You have to deal with the hash collisions somehow, for example with a
list of pages that have the same hash. And you have to calculate the
hash value. Both hurts performance.
> P.S. With sparsemem extreme, I think you can cover an entire 64-bits of
> address space with a 4GB top-level table. If one more level of tables
> was added, we'd be down to (I think) an 8MB table. So, that might be an
> option, too.
On s390 we have to prepare for the situation of an address space that
has a chunk of memory at the low end and another chunk with bit 2^63
set. So the mem_map array needs to cover the whole 64 bit address range.
For sparsemem, we can choose on the size of the mem_map sections and on
how many indirections the lookup table should have. Some examples:
1) flat mem_map array: 2^52 entries, 56 bytes each.
2) mem_map sections with 256 entries / 14KB for each section,
1 indirection level, 2^44 indirection pointers, 128TB overhead
3) mem_map sections with 256 entries / 14KB for each section,
2 indirection levels, 2^22 indirection pointers for each level,
32MB for each indirection array, minimum 64MB overhead
4) mem_map sections with 256 entries / 14KB for each section,
3 indirection levels, 2^15/2^15/2^14 indirection pointers,
256K/256K/128K indirection arrays, minimum 640K overhead
5) mem_map sections with 1024 entries / 56KB for each section,
3 indirection levels, 2^14/2^14/2^14 indirection pointers,
128K/128K/128K indirection arrays, minimum 384KB overhead
2 levels of indirection results in large overhead in regard to memory.
For 3 levels of indirection the memory overhead is ok, but each lookup
has to walk 3 indirections. This adds cpu cycles to access the mem_map
array.
The alternative of a flat mem_map array in vmalloc space is much more
attractive. The size of the array is 2^52*56 Byte. 1,3% of the virtual
address space. The access doesn't change, an array gets accessed. The
access gets automatically cached by the hardware.
Simple, straightforward, no additional overhead. Only the setup of the
kernel page tables for the mem_map vmalloc area needs some thought.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Wed, 2006-08-09 at 11:25 -0700, Dave Hansen wrote:
>> Instead of:
>>
>> #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT)
>>
>> We could do:
>>
>> static inline unsigned long pfn_to_section_nr(unsigned long pfn)
>> {
>> return some_hash(pfn) % NR_OF_SECTION_SLOTS;
>> }
>>
>> This would, of course, still have limits on how _many_ sections can be
>> populated. But, it would remove the relationship on what the actual
>> physical address ranges can be from the number of populated sections.
>>
>> Of course, it isn't quite that simple. You need to make sure that the
>> sparse code is clean from all connections between section number and
>> physical address, as well as handling things like hash collisions. We'd
>> probably also need to store the _actual_ physical address somewhere
>> because we can't get it from the section number any more.
>
> You have to deal with the hash collisions somehow, for example with a
> list of pages that have the same hash. And you have to calculate the
> hash value. Both hurts performance.
>
>> P.S. With sparsemem extreme, I think you can cover an entire 64-bits of
>> address space with a 4GB top-level table. If one more level of tables
>> was added, we'd be down to (I think) an 8MB table. So, that might be an
>> option, too.
>
> On s390 we have to prepare for the situation of an address space that
> has a chunk of memory at the low end and another chunk with bit 2^63
> set. So the mem_map array needs to cover the whole 64 bit address range.
> For sparsemem, we can choose on the size of the mem_map sections and on
> how many indirections the lookup table should have. Some examples:
>
> 1) flat mem_map array: 2^52 entries, 56 bytes each.
> 2) mem_map sections with 256 entries / 14KB for each section,
> 1 indirection level, 2^44 indirection pointers, 128TB overhead
> 3) mem_map sections with 256 entries / 14KB for each section,
> 2 indirection levels, 2^22 indirection pointers for each level,
> 32MB for each indirection array, minimum 64MB overhead
> 4) mem_map sections with 256 entries / 14KB for each section,
> 3 indirection levels, 2^15/2^15/2^14 indirection pointers,
> 256K/256K/128K indirection arrays, minimum 640K overhead
> 5) mem_map sections with 1024 entries / 56KB for each section,
> 3 indirection levels, 2^14/2^14/2^14 indirection pointers,
> 128K/128K/128K indirection arrays, minimum 384KB overhead
>
> 2 levels of indirection results in large overhead in regard to memory.
> For 3 levels of indirection the memory overhead is ok, but each lookup
> has to walk 3 indirections. This adds cpu cycles to access the mem_map
> array.
>
> The alternative of a flat mem_map array in vmalloc space is much more
> attractive. The size of the array is 2^52*56 Byte. 1,3% of the virtual
> address space. The access doesn't change, an array gets accessed. The
> access gets automatically cached by the hardware.
> Simple, straightforward, no additional overhead. Only the setup of the
> kernel page tables for the mem_map vmalloc area needs some thought.
>
Well you could do something more fun with the top of the address. You
don't need to keep the bytes in the same order for instance. If this is
really a fair size chunk at the bottom and one at the top then taking
the address and swapping the bytes like:
ABCDEFGH => BCDAEFGH
Would be a pretty trivial bit of register wibbling (ie very quick), but
would probabally mean a single flat, smaller sparsemem table would cover
all likely areas.
-apw
On Thu, 2006-08-10 at 15:40 +0100, Andy Whitcroft wrote:
> Well you could do something more fun with the top of the address. You
> don't need to keep the bytes in the same order for instance. If this
> is really a fair size chunk at the bottom and one at the top then
> taking the address and swapping the bytes like:
>
> ABCDEFGH => BCDAEFGH
>
> Would be a pretty trivial bit of register wibbling (ie very quick),
> but would probabally mean a single flat, smaller sparsemem table would
> cover all likely areas.
Not if you don't know where the objects will be mapped..
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.