We've run into a bit of a problem with a forthcoming system. The BIOS
reserves so many PCI bus numbers for hotplug when maxed out PCI expansion
box(es) are present that some arrays (mp_bus_id_to_node[],
mp_bus_id_to_pci_bus[], etc) overflow, splattering important variables.
They've been predicting the time when 256 PCI busses are not enough for some
time now. We've nearly hit that point. Meanwhile, here's a patch to boost
MAX_MP_BUSSES from 32 to the max, plus one more for ISA.
Some folks might think that this is excessive, but the BIOS is adding 50 bus
records to the tables just for a quad computer and one fully loaded PCI
expansion box. (You can buy extra PCI slots in units of 6.) The thing is
supposed to support 3 expansion boxes soon. The NUMA version of this, coming
out next year, has the potential to be even more extreme: 4 quads with 3
expansion boxes each == a whole lot of PCI slots, each of which needs its own
PCI bus number in case someone hotplugs in a quad ethernet card or some other
combo board with its own PCI bridge chip.
Marcello and Linus, please apply.
--- linux-2.4.16/include/asm-i386/mpspec.h Thu Nov 22 11:46:18 2001
+++ jamesc-2.4.16/include/asm-i386/mpspec.h Tue Dec 18 13:45:49 2001
@@ -184,13 +185,10 @@
* 7 2 CPU MCA+PCI
*/
-#ifdef CONFIG_MULTIQUAD
-#define MAX_IRQ_SOURCES 512
-#else /* !CONFIG_MULTIQUAD */
-#define MAX_IRQ_SOURCES 256
-#endif /* CONFIG_MULTIQUAD */
+#define MAX_MP_BUSSES 257 /* Need max PCI busses for hotplug + 1 for ISA. */
+ /* Four intrs per PCI slot. */
+#define MAX_IRQ_SOURCES (MAX_MP_BUSSES * 4)
-#define MAX_MP_BUSSES 32
enum mp_bustype {
MP_BUS_ISA = 1,
MP_BUS_EISA,
--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] (kmail) | [email protected] (Lotus Notes)
On Wed, 19 Dec 2001, James Cleverdon wrote:
>
> Marcello and Linus, please apply.
Can you give a rough description of what kinds of arrays this will impact,
just out of curiosity. Ie do we talk about "5kB more memory in order to
avoid problems", or are we talking about something noticeable..
Linus "too lazy to grep" Torvalds
On Wed, 19 Dec 2001, Linus Torvalds wrote:
> > Marcello and Linus, please apply.
>
> Can you give a rough description of what kinds of arrays this will
> impact, just out of curiosity. Ie do we talk about "5kB more memory in
> order to avoid problems", or are we talking about something
> noticeable..
>
> Linus "too lazy to grep" Torvalds
the change is OK, it's about +3K RAM used, on SMP kernels.
Ingo
[email protected] (Ingo Molnar) writes:
> On Wed, 19 Dec 2001, Linus Torvalds wrote:
>
> > > Marcello and Linus, please apply.
> >
> > Can you give a rough description of what kinds of arrays this will
> > impact, just out of curiosity. Ie do we talk about "5kB more memory in
> > order to avoid problems", or are we talking about something
> > noticeable..
> >
> > Linus "too lazy to grep" Torvalds
>
> the change is OK, it's about +3K RAM used, on SMP kernels.
hmm, I come more to 11K (most of it in mp_irqs)
... which could be easily allocated with the bootmem allocator at parse time.
-Andi
On Thursday 20 December 2001 09:03 am, you wrote:
> [email protected] (Ingo Molnar) writes:
> > On Wed, 19 Dec 2001, Linus Torvalds wrote:
> > > > Marcello and Linus, please apply.
> > >
> > > Can you give a rough description of what kinds of arrays this will
> > > impact, just out of curiosity. Ie do we talk about "5kB more memory in
> > > order to avoid problems", or are we talking about something
> > > noticeable..
> > >
> > > Linus "too lazy to grep" Torvalds
> >
> > the change is OK, it's about +3K RAM used, on SMP kernels.
>
> hmm, I come more to 11K (most of it in mp_irqs)
> ... which could be easily allocated with the bootmem allocator at parse
> time.
>
> -Andi
Thanks to all who replied. My rationale for simply increasing the size of
static arrays was to have a minimum impact on 2.4, as well as to make
something that Cannot Fail(TM). If you like, I could make one for 2.5 that
would do an initial scan of the MPS table, allocate the arrays using the
bootmem allocator, then go about its business as usual. (Special offer for a
limited time only! mpc_* array overflow checking added at NO EXTRA CHARGE!!
;^)
The catch with bootmem allocation is that it only allocates in pages (unless
wli's new bootmem allocator is adopted). So, expect some extra memory to be
lost to internal fragmentation anyway.
Another suggestion through private mail was to make MAX_MP_BUSSES a tunable
config parameter. I didn't know about that. Early boot stuff should work
without fuss, not rely on config tweaks. At the very least, I'd have to add
array overflow checking, because this crashes before the console is opened or
kdb is initialized. Silent crashes like that are bad news.
--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]
On Thu, Dec 20, 2001 at 11:29:32PM +0100, James Cleverdon wrote:
> Thanks to all who replied. My rationale for simply increasing the size of
> static arrays was to have a minimum impact on 2.4, as well as to make
> something that Cannot Fail(TM). If you like, I could make one for 2.5 that
> would do an initial scan of the MPS table, allocate the arrays using the
> bootmem allocator, then go about its business as usual. (Special offer for a
> limited time only! mpc_* array overflow checking added at NO EXTRA CHARGE!!
> ;^)
Shouldn't be that hard. In the worst case I can do it, but I don't have
hardware to test it properly.
>
> The catch with bootmem allocation is that it only allocates in pages (unless
> wli's new bootmem allocator is adopted). So, expect some extra memory to be
> lost to internal fragmentation anyway.
I don't think that's true, unless I'm misreading the code in
__alloc_bootmem_core badly. It may not be the most fragmentation avoiding
allocator in the world, but for linear allocations with no frees it shouldn't
waste space.
> Another suggestion through private mail was to make MAX_MP_BUSSES a tunable
> config parameter. I didn't know about that. Early boot stuff should work
> without fuss, not rely on config tweaks. At the very least, I'd have to add
> array overflow checking, because this crashes before the console is opened or
> kdb is initialized. Silent crashes like that are bad news.
I agree that it shouldn't be tunable.
-Andi
--
Life would be so much easier if we could just look at the source code.
Before trying to divorce the issues, I'd like to thank James for the
moral support. =)
On Fri, Dec 21, 2001 at 11:31:10AM +0100, Ingo Molnar wrote:
> yes, the current bootmem is a pretty effective linear allocator - you can
> even allocate 1 byte chunks without any space loss - and it's also a
> pretty simple allocator. During boot-time we allocate linearly in 99% of
> the cases. The only chance for some very minor loss is when the linear
> 'point of allocation' reaches a physical discontinuity and a bigger
> allocation has to skip over that point. For every such skipover (which
> skipover never happens with typical PCs, and it happens once with PCs that
> have lots of RAM) the 'loss' of RAM due to granularity is 2 KB. Plus
> deallocation can introduce fragmentation, the average cost for one
> fragment is 2KB. Almost no code frees bootmem (apart from the final
> freeing of bootmem), so this effect just does not happen on my box.
The page coalescing algorithm in the stock bootmem is designed only to
handle allocations in linear sequences. Which is probably the important
case. The extra generality in my code is something I'd like to call a bonus.
On Fri, Dec 21, 2001 at 11:31:10AM +0100, Ingo Molnar wrote:
> a tree structure will not avoid this 'skipping' loss either, in the
> majority of cases. (the basic reason for skipping is the need for a bigger
> chunk of RAM, and tree structure wont solve this.)
Essentially memory savings during boot-time reservations are minor if
present at all. They are seen, though. And your point here holds.
On Fri, Dec 21, 2001 at 11:31:10AM +0100, Ingo Molnar wrote:
> So i'm very doubtful about some of wli's claims about the RAM savings the
> tree changes bring - the 200 KB cited is i'd say physically impossible.
> The best-case would be 4 KB saved. And a tree data structure increases
> complexity significantly. (we had enough bugs in the current form of
> bootmem already.)
> I'd very much like if wli clarified this point - he knows his code, he
> should know the exact effect. In fact he should know the exact RAM
> allocation difference on his own box, with and without the tree-structure
> patch. It seems that this 'tree structure saves 200 KB' misinformation has
> stuck in people's mind to a certain degree. William?
200KB sounds extreme to me, and I don't recall mentioning that specific
number. The highest I've heard on i386 is 2MB, which was reported, but
almost implausible. ACPI on IA64 makes very heavy use of bootmem
allocations, which is where much larger space savings are seen, although
once again, compared to the total amount of system memory, it's very minor.
And I recall 3MB on IA64 Lion systems, though as I've been waiting for
feedback (even of this kind), I've not done tests there in a few weeks.
More than 4KB is seen. On i386 I have personally seen up to 12KB, and
see 8KB on my i386 machine at home without variation.
Essentially you have pointed out an error on my part, which is reporting
back a fairly implausible result that was reported to me without filtering
it for sanity.
On the other hand, I'm unclear on what tree-based bootmem has to do with
MAX_MP_BUSSES. Shall we take it to another thread? I'm interested in
discussing more about the tradeoffs involved with extent-based
representations of physical memory and where I can actively work to
simplify the interface to boot-time memory reservation, by whatever means.
And this has nothing to do with PCI busses.
Cheers,
Bill
On Fri, Dec 21, 2001 at 11:42:40AM +0100, Ingo Molnar wrote:
> William, could you please give us the amount of RAM saved on your box,
> with and without your tree-structure patch applied?
Please refer to my other messages regarding the true purposes of the
tree-based bootmem and so on.
Given that it's one architecture, a single very small machine, and one
kernel version, this is perhaps not as scientific as I'd like it to be,
but my personal computer (i386) at home shows the following results for
2.4.17-rc2 and my local patch set (plus or minus bootmem).
2.4.17-rc2-wli-bootmem:
Memory: 769332k/786368k available
(1389k kernel code, 16648k reserved, 1019k data, 224k init, 0k highmem)
2.4.17-rc2-wli:
Memory: 769340k/786368k available
(1389k kernel code, 16640k reserved, 1019k data, 216k init, 0k highmem)
After freeing init, the total memory is identical, not my expected 8KB:
--- bootmem.meminfo Fri Dec 21 03:33:44 2001
+++ nobootmem.meminfo Fri Dec 21 03:30:53 2001
@@ -1,18 +1,18 @@
total: used: free: shared: buffers: cached:
-Mem: 788025344 78618624 709406720 0 5652480 33247232
+Mem: 788025344 87261184 700764160 0 12656640 34500608
Swap: 518152192 0 518152192
MemTotal: 769556 kB
-MemFree: 692780 kB
+MemFree: 684340 kB
MemShared: 0 kB
-Buffers: 5520 kB
-Cached: 32468 kB
+Buffers: 12360 kB
+Cached: 33692 kB
SwapCached: 0 kB
-Active: 53784 kB
-Inact_dirty: 4748 kB
+Active: 61728 kB
+Inact_dirty: 4876 kB
Inact_clean: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 769556 kB
-LowFree: 692780 kB
+LowFree: 684340 kB
SwapTotal: 506008 kB
SwapFree: 506008 kB
Cheers,
Bill
James,
Don't remove the "#ifdef CONFIG_MULTIQUAD" on your patch: Let the old max
(32) for non-multiquad machines.
Please resend me the patch this way.
On Wed, 19 Dec 2001, James Cleverdon wrote:
> We've run into a bit of a problem with a forthcoming system. The BIOS
> reserves so many PCI bus numbers for hotplug when maxed out PCI expansion
> box(es) are present that some arrays (mp_bus_id_to_node[],
> mp_bus_id_to_pci_bus[], etc) overflow, splattering important variables.
Back from holiday.
On Wednesday 26 December 2001 07:43 am, Marcelo Tosatti wrote:
> James,
>
> Don't remove the "#ifdef CONFIG_MULTIQUAD" on your patch: Let the old max
> (32) for non-multiquad machines.
There's a problem with that -- despite its name, CONFIG_MULTIQUAD is used for
the old NUMA-Q hardware. It turns on some memory mapped port I/O code that
doesn't have any purpose for other machines. The PCI bus overflow happens on
our new Foster-based boxes that may or may not contain multiple quad CPU
boards.
Still, CONFIG_MULTIQUAD is better than nothing. It just may take a little
bit of redefinition, so long as we can coax the various distros to build
their installation and working kernels with CONFIG_MULTIQUAD turned on....
> Please resend me the patch this way.
OK, what do you think about this:
diff -ru 2.4.17/include/asm-i386/mpspec.h
jamesc-2.4.17/include/asm-i386/mpspec.h
--- 2.4.17/include/asm-i386/mpspec.h Thu Nov 22 11:46:18 2001
+++ jamesc-2.4.17/include/asm-i386/mpspec.h Tue Jan 8 01:00:12 2002
@@ -185,12 +186,13 @@
*/
#ifdef CONFIG_MULTIQUAD
-#define MAX_IRQ_SOURCES 512
+#define MAX_MP_BUSSES 257 /* Need max PCI busses for hotplug + 1 for ISA. */
+#define MAX_IRQ_SOURCES (MAX_MP_BUSSES * 4) /* Four intrs per PCI slot. */
#else /* !CONFIG_MULTIQUAD */
+#define MAX_MP_BUSSES 32
#define MAX_IRQ_SOURCES 256
#endif /* CONFIG_MULTIQUAD */
-#define MAX_MP_BUSSES 32
enum mp_bustype {
MP_BUS_ISA = 1,
MP_BUS_EISA,
> On Wed, 19 Dec 2001, James Cleverdon wrote:
> > We've run into a bit of a problem with a forthcoming system. The BIOS
> > reserves so many PCI bus numbers for hotplug when maxed out PCI expansion
> > box(es) are present that some arrays (mp_bus_id_to_node[],
> > mp_bus_id_to_pci_bus[], etc) overflow, splattering important variables.
--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]
> There's a problem with that -- despite its name, CONFIG_MULTIQUAD is used for
> the old NUMA-Q hardware. It turns on some memory mapped port I/O code that
> doesn't have any purpose for other machines. The PCI bus overflow happens on
> our new Foster-based boxes that may or may not contain multiple quad CPU
> boards.
>
> Still, CONFIG_MULTIQUAD is better than nothing. It just may take a little
> bit of redefinition, so long as we can coax the various distros to build
> their installation and working kernels with CONFIG_MULTIQUAD turned on....
That's not a good idea. You're going to introduce extra switches to every port
IO path, and every IPI (for everybody, not just yourself). CONFIG_MULTIQUAD
is also used to alter the way that PCI config space writes are done (in later
patches). I suggest you use a different config option, or construct it dynamically.
Martin.
On Tuesday 08 January 2002 01:41 am, Martin J. Bligh wrote:
> > There's a problem with that -- despite its name, CONFIG_MULTIQUAD is used
> > for the old NUMA-Q hardware. It turns on some memory mapped port I/O
> > code that doesn't have any purpose for other machines. The PCI bus
> > overflow happens on our new Foster-based boxes that may or may not
> > contain multiple quad CPU boards.
> >
> > Still, CONFIG_MULTIQUAD is better than nothing. It just may take a
> > little bit of redefinition, so long as we can coax the various distros to
> > build their installation and working kernels with CONFIG_MULTIQUAD turned
> > on....
>
> That's not a good idea. You're going to introduce extra switches to every
> port IO path, and every IPI (for everybody, not just yourself).
> CONFIG_MULTIQUAD is also used to alter the way that PCI config space writes
> are done (in later patches). I suggest you use a different config option,
> or construct it dynamically.
>
> Martin.
OK, I've thrown together an attempt at dynamic table allocation that works on
the boxes with all the busses. For safety's sake I keep the static tables
and only reallocate if they overflow:
diff -ru 2.4.17/arch/i386/kernel/mpparse.c test/arch/i386/kernel/mpparse.c
--- 2.4.17/arch/i386/kernel/mpparse.c Fri Nov 9 17:58:18 2001
+++ test/arch/i386/kernel/mpparse.c Fri Jan 18 13:12:46 2002
@@ -35,16 +35,22 @@
* MP-table.
*/
int apic_version [MAX_APICS];
-int mp_bus_id_to_type [MAX_MP_BUSSES];
-int mp_bus_id_to_node [MAX_MP_BUSSES];
-int mp_bus_id_to_pci_bus [MAX_MP_BUSSES] = { [0 ... MAX_MP_BUSSES-1] = -1 };
+int static_mp_bus_id_to_type[MAX_MP_BUSSES];
+int static_mp_bus_id_to_node[MAX_MP_BUSSES];
+int static_mp_bus_id_to_pci_bus[MAX_MP_BUSSES] = { [0 ... MAX_MP_BUSSES-1] =
-1 };
+int *mp_bus_id_to_type = static_mp_bus_id_to_type;
+int *mp_bus_id_to_node = static_mp_bus_id_to_node;
+int *mp_bus_id_to_pci_bus = static_mp_bus_id_to_pci_bus;
+int max_mp_busses = MAX_MP_BUSSES;
+int max_irq_sources = MAX_IRQ_SOURCES;
int mp_current_pci_id;
/* I/O APIC entries */
struct mpc_config_ioapic mp_ioapics[MAX_IO_APICS];
/* # of MP IRQ source entries */
-struct mpc_config_intsrc mp_irqs[MAX_IRQ_SOURCES];
+struct mpc_config_intsrc static_mp_irqs[MAX_IRQ_SOURCES];
+struct mpc_config_intsrc *mp_irqs = static_mp_irqs;
/* MP IRQ source entries */
int mp_irq_entries;
@@ -219,6 +225,7 @@
if (m->mpc_apicid > MAX_APICS) {
printk("Processor #%d INVALID. (Max ID: %d).\n",
m->mpc_apicid, MAX_APICS);
+ --num_processors;
return;
}
ver = m->mpc_apicver;
@@ -296,7 +303,7 @@
m->mpc_irqtype, m->mpc_irqflag & 3,
(m->mpc_irqflag >> 2) & 3, m->mpc_srcbus,
m->mpc_srcbusirq, m->mpc_dstapic, m->mpc_dstirq);
- if (++mp_irq_entries == MAX_IRQ_SOURCES)
+ if (++mp_irq_entries == max_irq_sources)
panic("Max # of irq sources exceeded!!\n");
}
@@ -385,6 +392,7 @@
static int __init smp_read_mpc(struct mp_config_table *mpc)
{
+ int num_bus, num_irq;
char str[16];
int count=sizeof(*mpc);
unsigned char *mpt=((unsigned char *)mpc)+count;
@@ -434,6 +442,76 @@
}
/*
+ * Pre-scan to see if the bus and interrupt records are too big to fit
+ * in the static tables. If so, then alloc bigger ones.
+ */
+ {
+ int cnt = count;
+ unsigned char *mpp = mpt;
+ size_t size;
+
+ num_bus = 0;
+ num_irq = 0;
+ while (cnt < mpc->mpc_length) {
+ switch (*mpp) {
+ case MP_PROCESSOR:
+ size = sizeof(struct mpc_config_processor);
+ break;
+ case MP_BUS:
+ ++num_bus;
+ size = sizeof(struct mpc_config_bus);
+ break;
+ case MP_INTSRC:
+ ++num_irq;
+ size = sizeof(struct mpc_config_intsrc);
+ break;
+ case MP_IOAPIC:
+ size = sizeof(struct mpc_config_ioapic);
+ break;
+ case MP_LINTSRC:
+ size = sizeof(struct mpc_config_lintsrc);
+ break;
+ default:
+ cnt = mpc->mpc_length;
+ size = 0;
+ break;
+ }
+ mpp += size;
+ cnt += size;
+ }
+ if (num_bus > MAX_MP_BUSSES || num_irq >= MAX_IRQ_SOURCES) {
+ unsigned char *up;
+
+ /* Paranoia: Alloc one extra of each. */
+ max_mp_busses = ++num_bus;
+ if (num_irq < 4 * num_bus)
+ num_irq = 4 * num_bus; /* 4 intr/PCI slot */
+ max_irq_sources = ++num_irq;
+ num_bus *= sizeof(int);
+ cnt = 3 * num_bus;
+ num_irq *= sizeof(struct mpc_config_intsrc);
+ cnt += num_irq;
+ up = alloc_bootmem(cnt);
+ if (!up) {
+ printk("smp_read_mpc: Bus/Intr table realloc (%d "
+ "bytes) failed! MPS records will be lost!\n",
+ cnt);
+ } else {
+ mp_bus_id_to_type = (int *)up;
+ up += num_bus;
+ mp_bus_id_to_node = (int *)up;
+ up += num_bus;
+ mp_bus_id_to_pci_bus = (int *)up;
+ up += num_bus;
+ mp_irqs = (struct mpc_config_intsrc *)up;
+ memset(mp_bus_id_to_pci_bus, -1, num_bus);
+ }
+ }
+ num_bus = 0;
+ num_irq = 0;
+ }
+
+ /*
* Now process the configuration blocks.
*/
while (count < mpc->mpc_length) {
@@ -454,7 +532,11 @@
{
struct mpc_config_bus *m=
(struct mpc_config_bus *)mpt;
- MP_bus_info(m);
+ if (++num_bus < max_mp_busses)
+ MP_bus_info(m);
+ else
+ printk("Dropped bus #%d (%.6s)!\n",
+ m->mpc_busid, m->mpc_bustype);
mpt += sizeof(*m);
count += sizeof(*m);
break;
@@ -473,7 +555,12 @@
struct mpc_config_intsrc *m=
(struct mpc_config_intsrc *)mpt;
- MP_intsrc_info(m);
+ if (++num_irq < max_irq_sources)
+ MP_intsrc_info(m);
+ else
+ printk("Dropped intsrc! (%d, %d, %02x, %d)\n",
+ m->mpc_srcbus, m->mpc_srcbusirq,
+ m->mpc_dstapic, m->mpc_dstirq);
mpt+=sizeof(*m);
count+=sizeof(*m);
break;
diff -ru 2.4.17/include/asm-i386/io_apic.h test/include/asm-i386/io_apic.h
--- 2.4.17/include/asm-i386/io_apic.h Wed Jan 9 15:58:47 2002
+++ test/include/asm-i386/io_apic.h Fri Jan 18 13:10:31 2002
@@ -96,7 +96,7 @@
extern int mp_irq_entries;
/* MP IRQ source entries */
-extern struct mpc_config_intsrc mp_irqs[MAX_IRQ_SOURCES];
+extern struct mpc_config_intsrc *mp_irqs;
/* non-0 if default (table-less) MP configuration */
extern int mpc_default_type;
diff -ru 2.4.17/include/asm-i386/mpspec.h test/include/asm-i386/mpspec.h
--- 2.4.17/include/asm-i386/mpspec.h Wed Jan 9 15:58:47 2002
+++ test/include/asm-i386/mpspec.h Fri Jan 18 13:10:31 2002
@@ -197,8 +197,8 @@
MP_BUS_PCI,
MP_BUS_MCA
};
-extern int mp_bus_id_to_type [MAX_MP_BUSSES];
-extern int mp_bus_id_to_pci_bus [MAX_MP_BUSSES];
+extern int *mp_bus_id_to_type;
+extern int *mp_bus_id_to_pci_bus;
extern unsigned int boot_cpu_physical_apicid;
extern unsigned long phys_cpu_present_map;
@@ -207,11 +207,11 @@
extern void get_smp_config (void);
extern int nr_ioapics;
extern int apic_version [MAX_APICS];
-extern int mp_bus_id_to_type [MAX_MP_BUSSES];
+extern int *mp_bus_id_to_type;
extern int mp_irq_entries;
-extern struct mpc_config_intsrc mp_irqs [MAX_IRQ_SOURCES];
+extern struct mpc_config_intsrc *mp_irqs;
extern int mpc_default_type;
-extern int mp_bus_id_to_pci_bus [MAX_MP_BUSSES];
+extern int *mp_bus_id_to_pci_bus;
extern int mp_current_pci_id;
extern unsigned long mp_lapic_addr;
extern int pic_mode;
--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]