Hi Tomonori-san:
I have a big box (64 threads, 256GB memory) that is crashing early in
boot as below. I bisected it down to f4780ca0 ("x86: Move swiotlb
initialization before dma32_free_bootmem"); reverting just this commit
from the latest git (3ea6b3d0 is what I tested) fixes things.
I haven't tried to debug this yet, but I guess on such a huge box there
is not enough memory below 4GB for swiotlb if we don't free the DMA32
stuff allocated earlier? I don't know why that would be, since the
bootmem is grabbing 512MB and I have pretty close to 4GB below 4GB.
Anyway, I'm going to go to bed soon, but if you need more information or
have anything you want me to try, I will do it tomorrow morning.
Thanks,
Roland
Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x04080000
Movable zone start PFN for each node
early_node_map[3] active PFN ranges
0: 0x00000001 -> 0x0000009b
0: 0x00000100 -> 0x00078c74
0: 0x00100000 -> 0x04080000
... snip ...
PERCPU: Embedded 29 pages/cpu @ffff880172400000 s90008 r8192 d20584 u131072
pcpu-alloc: s90008 r8192 d20584 u131072 alloc=1*2097152
pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
pcpu-alloc: [0] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
pcpu-alloc: [0] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
pcpu-alloc: [0] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 66154133
Policy zone: Normal
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-2.6.30-2-amd64 root=UUID=c32babd2-b320-48e5-bdd4-350d659b07a5 ro console=ttyS0,115200n8 earlyprintk=ttyS0,115200
PID hash table entries: 4096 (order: 3, 32768 bytes)
bootmem alloc of 67108864 bytes failed!
Kernel panic - not syncing: Out of memory
Pid: 0, comm: swapper Not tainted 2.6.32 #19
Call Trace:
[<ffffffff81342c82>] ? panic+0x86/0x141
[<ffffffff816f0e00>] ? ___alloc_bootmem_nopanic+0x89/0xc7
[<ffffffff813477b6>] ? _etext+0x0/0x38d84a
[<ffffffff816f0e6d>] ? ___alloc_bootmem_node+0x0/0x59
[<ffffffff816f98a7>] ? swiotlb_init_with_default_size+0x3f/0x126
[<ffffffff816e6669>] ? pci_swiotlb_init+0x50/0x63
[<ffffffff816d5140>] ? early_idt_handler+0x0/0x71
[<ffffffff816dbd2a>] ? pci_iommu_alloc+0xb/0x73
[<ffffffff816d5140>] ? early_idt_handler+0x0/0x71
[<ffffffff816e9e70>] ? mem_init+0x15/0xe5
[<ffffffff816d5ad2>] ? start_kernel+0x1bd/0x39e
[<ffffffff816d53b2>] ? x86_64_start_kernel+0xf9/0x106
On Mon, 14 Dec 2009 23:47:07 -0800
Roland Dreier <[email protected]> wrote:
> I have a big box (64 threads, 256GB memory) that is crashing early in
> boot as below. I bisected it down to f4780ca0 ("x86: Move swiotlb
> initialization before dma32_free_bootmem"); reverting just this commit
> from the latest git (3ea6b3d0 is what I tested) fixes things.
Ah, really sorry about that.
> I haven't tried to debug this yet, but I guess on such a huge box there
> is not enough memory below 4GB for swiotlb if we don't free the
Yeah, Yinghai also hit this (his box has more memory than yours).
> stuff allocated earlier? I don't know why that would be, since the
> bootmem is grabbing 512MB and I have pretty close to 4GB below 4GB.
> Anyway, I'm going to go to bed soon, but if you need more information or
> have anything you want me to try, I will do it tomorrow morning.
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
It makes the swiotlb initialization into two stages. I don't like it
much since I like to avoid complicating the initialization.
dma32_reserve_bootmem() allocates 128MB for broken GART IOMMU but I
think 64MB should be enough since broken GART IOMMU allocates
64MB. The following simple patch might work too because swiotlb uses
64MB.
With coming huge memory boxes, we might need to work on ZONE_DMA32
shortage issue anyway (sparse-vmemmap, anything else)?
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 75e14e2..fbe7154 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -67,7 +67,7 @@ EXPORT_SYMBOL(dma_set_mask);
#ifdef CONFIG_X86_64
static __initdata void *dma32_bootmem_ptr;
-static unsigned long dma32_bootmem_size __initdata = (128ULL<<20);
+static unsigned long dma32_bootmem_size __initdata = (64ULL<<20);
static int __init parse_dma32_size_opt(char *p)
{
FUJITA Tomonori wrote:
> On Mon, 14 Dec 2009 23:47:07 -0800
> Roland Dreier <[email protected]> wrote:
>
>> I have a big box (64 threads, 256GB memory) that is crashing early in
>> boot as below. I bisected it down to f4780ca0 ("x86: Move swiotlb
>> initialization before dma32_free_bootmem"); reverting just this commit
>> from the latest git (3ea6b3d0 is what I tested) fixes things.
>
> Ah, really sorry about that.
>
>
>> I haven't tried to debug this yet, but I guess on such a huge box there
>> is not enough memory below 4GB for swiotlb if we don't free the
>
> Yeah, Yinghai also hit this (his box has more memory than yours).
>
>
>> stuff allocated earlier? I don't know why that would be, since the
>> bootmem is grabbing 512MB and I have pretty close to 4GB below 4GB.
>> Anyway, I'm going to go to bed soon, but if you need more information or
>> have anything you want me to try, I will do it tomorrow morning.
>
> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
>
> It makes the swiotlb initialization into two stages. I don't like it
> much since I like to avoid complicating the initialization.
>
> dma32_reserve_bootmem() allocates 128MB for broken GART IOMMU but I
> think 64MB should be enough since broken GART IOMMU allocates
> 64MB. The following simple patch might work too because swiotlb uses
> 64MB.
>
> With coming huge memory boxes, we might need to work on ZONE_DMA32
> shortage issue anyway (sparse-vmemmap, anything else)?
>
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 75e14e2..fbe7154 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -67,7 +67,7 @@ EXPORT_SYMBOL(dma_set_mask);
>
> #ifdef CONFIG_X86_64
> static __initdata void *dma32_bootmem_ptr;
> -static unsigned long dma32_bootmem_size __initdata = (128ULL<<20);
> +static unsigned long dma32_bootmem_size __initdata = (64ULL<<20);
>
> static int __init parse_dma32_size_opt(char *p)
> {
static __initdata void *dma32_bootmem_ptr;
static unsigned long dma32_bootmem_size __initdata = (128ULL<<20);
static int __init parse_dma32_size_opt(char *p)
{
if (!p)
return -EINVAL;
dma32_bootmem_size = memparse(p, &p);
return 0;
}
early_param("dma32_size", parse_dma32_size_opt);
dma32_size is the command line..., user could adjust that.
YH
FUJITA Tomonori wrote:
> On Mon, 14 Dec 2009 23:47:07 -0800
> Roland Dreier <[email protected]> wrote:
>
>> I have a big box (64 threads, 256GB memory) that is crashing early in
>> boot as below. I bisected it down to f4780ca0 ("x86: Move swiotlb
>> initialization before dma32_free_bootmem"); reverting just this commit
>> from the latest git (3ea6b3d0 is what I tested) fixes things.
>
> Ah, really sorry about that.
>
>
>> I haven't tried to debug this yet, but I guess on such a huge box there
>> is not enough memory below 4GB for swiotlb if we don't free the
>
> Yeah, Yinghai also hit this (his box has more memory than yours).
>
>
>> stuff allocated earlier? I don't know why that would be, since the
>> bootmem is grabbing 512MB and I have pretty close to 4GB below 4GB.
>> Anyway, I'm going to go to bed soon, but if you need more information or
>> have anything you want me to try, I will do it tomorrow morning.
>
> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
>
> It makes the swiotlb initialization into two stages. I don't like it
> much since I like to avoid complicating the initialization.
>
> dma32_reserve_bootmem() allocates 128MB for broken GART IOMMU but I
> think 64MB should be enough since broken GART IOMMU allocates
> 64MB. The following simple patch might work too because swiotlb uses
> 64MB.
>
> With coming huge memory boxes, we might need to work on ZONE_DMA32
> shortage issue anyway (sparse-vmemmap, anything else)?
maybe just revert f4780ca0... for now
actually dma32_free_bootmem will also make sure it will give some buffer to pci_swiotlb_init...
YH
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index fcc2f2b..afcc58b 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -120,14 +120,11 @@ static void __init dma32_free_bootmem(void)
void __init pci_iommu_alloc(void)
{
- int use_swiotlb;
-
- use_swiotlb = pci_swiotlb_init();
#ifdef CONFIG_X86_64
/* free the range so iommu could get some range less than 4G */
dma32_free_bootmem();
#endif
- if (use_swiotlb)
+ if (pci_swiotlb_init())
return;
gart_iommu_hole_init();
On Tue, 15 Dec 2009 01:04:10 -0800
Yinghai Lu <[email protected]> wrote:
> FUJITA Tomonori wrote:
> > On Mon, 14 Dec 2009 23:47:07 -0800
> > Roland Dreier <[email protected]> wrote:
> >
> >> I have a big box (64 threads, 256GB memory) that is crashing early in
> >> boot as below. I bisected it down to f4780ca0 ("x86: Move swiotlb
> >> initialization before dma32_free_bootmem"); reverting just this commit
> >> from the latest git (3ea6b3d0 is what I tested) fixes things.
> >
> > Ah, really sorry about that.
> >
> >
> >> I haven't tried to debug this yet, but I guess on such a huge box there
> >> is not enough memory below 4GB for swiotlb if we don't free the
> >
> > Yeah, Yinghai also hit this (his box has more memory than yours).
> >
> >
> >> stuff allocated earlier? I don't know why that would be, since the
> >> bootmem is grabbing 512MB and I have pretty close to 4GB below 4GB.
> >> Anyway, I'm going to go to bed soon, but if you need more information or
> >> have anything you want me to try, I will do it tomorrow morning.
> >
> > http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
> >
> > It makes the swiotlb initialization into two stages. I don't like it
> > much since I like to avoid complicating the initialization.
> >
> > dma32_reserve_bootmem() allocates 128MB for broken GART IOMMU but I
> > think 64MB should be enough since broken GART IOMMU allocates
> > 64MB. The following simple patch might work too because swiotlb uses
> > 64MB.
> >
> > With coming huge memory boxes, we might need to work on ZONE_DMA32
> > shortage issue anyway (sparse-vmemmap, anything else)?
> >
> >
> > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> > index 75e14e2..fbe7154 100644
> > --- a/arch/x86/kernel/pci-dma.c
> > +++ b/arch/x86/kernel/pci-dma.c
> > @@ -67,7 +67,7 @@ EXPORT_SYMBOL(dma_set_mask);
> >
> > #ifdef CONFIG_X86_64
> > static __initdata void *dma32_bootmem_ptr;
> > -static unsigned long dma32_bootmem_size __initdata = (128ULL<<20);
> > +static unsigned long dma32_bootmem_size __initdata = (64ULL<<20);
> >
> > static int __init parse_dma32_size_opt(char *p)
> > {
>
> static __initdata void *dma32_bootmem_ptr;
> static unsigned long dma32_bootmem_size __initdata = (128ULL<<20);
>
> static int __init parse_dma32_size_opt(char *p)
> {
> if (!p)
> return -EINVAL;
> dma32_bootmem_size = memparse(p, &p);
> return 0;
> }
> early_param("dma32_size", parse_dma32_size_opt);
>
> dma32_size is the command line..., user could adjust that.
Yeah, I know but I'm not sure what you mean.
You mean that user increases dma32_size and hit the problem even if we
decrease the default allocation to 64MB? If so, I'm not sure that huge
boxes users need this option for broken GART BIOS.
On Tue, 15 Dec 2009 01:11:05 -0800
Yinghai Lu <[email protected]> wrote:
> FUJITA Tomonori wrote:
> > On Mon, 14 Dec 2009 23:47:07 -0800
> > Roland Dreier <[email protected]> wrote:
> >
> >> I have a big box (64 threads, 256GB memory) that is crashing early in
> >> boot as below. I bisected it down to f4780ca0 ("x86: Move swiotlb
> >> initialization before dma32_free_bootmem"); reverting just this commit
> >> from the latest git (3ea6b3d0 is what I tested) fixes things.
> >
> > Ah, really sorry about that.
> >
> >
> >> I haven't tried to debug this yet, but I guess on such a huge box there
> >> is not enough memory below 4GB for swiotlb if we don't free the
> >
> > Yeah, Yinghai also hit this (his box has more memory than yours).
> >
> >
> >> stuff allocated earlier? I don't know why that would be, since the
> >> bootmem is grabbing 512MB and I have pretty close to 4GB below 4GB.
> >> Anyway, I'm going to go to bed soon, but if you need more information or
> >> have anything you want me to try, I will do it tomorrow morning.
> >
> > http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
> >
> > It makes the swiotlb initialization into two stages. I don't like it
> > much since I like to avoid complicating the initialization.
> >
> > dma32_reserve_bootmem() allocates 128MB for broken GART IOMMU but I
> > think 64MB should be enough since broken GART IOMMU allocates
> > 64MB. The following simple patch might work too because swiotlb uses
> > 64MB.
> >
> > With coming huge memory boxes, we might need to work on ZONE_DMA32
> > shortage issue anyway (sparse-vmemmap, anything else)?
>
> maybe just revert f4780ca0... for now
As I wrote, I think that the following patch works.
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
There are a few people who hit this. How many people use a box with
over 256GB memory?
And you can work around this with "dma32_size" kernel boot option.
> actually dma32_free_bootmem will also make sure it will give some buffer to pci_swiotlb_init...
But unlike broken GART BIOS, swiotlb doesn't set the goal. swiotlb had
been fine without dma32_reserve_bootmem(). We use more DMA32_ZONE than
we did though.
* FUJITA Tomonori <[email protected]> wrote:
> There are a few people who hit this. How many people use a box with over
> 256GB memory?
>
> And you can work around this with "dma32_size" kernel boot option.
Well, since the kernel has not crashed before this change there's really just
two options as per upstream kernel regression policy: either we fix it or we
revert it.
Ingo
On Tue, 15 Dec 2009 11:56:50 +0100
Ingo Molnar <[email protected]> wrote:
>
> * FUJITA Tomonori <[email protected]> wrote:
>
> > There are a few people who hit this. How many people use a box with over
> > 256GB memory?
> >
> > And you can work around this with "dma32_size" kernel boot option.
>
> Well, since the kernel has not crashed before this change there's really just
> two options as per upstream kernel regression policy: either we fix it or we
> revert it.
As I wrote, here is a patch that can be applied to cleanly to the git
head:
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
It fixes the problem. Yinghai, can you test it? It should work but
it's good to confirm it.
I simply wanted to say that it's not a bug that breaks lots of boxes
or leads to something serious like data corruption (no need to say
something like "revert it now!"). It's also worth investigating why it
breaks, I think.
FUJITA Tomonori wrote:
> On Tue, 15 Dec 2009 11:56:50 +0100
> Ingo Molnar <[email protected]> wrote:
>
>> * FUJITA Tomonori <[email protected]> wrote:
>>
>>> There are a few people who hit this. How many people use a box with over
>>> 256GB memory?
>>>
>>> And you can work around this with "dma32_size" kernel boot option.
>> Well, since the kernel has not crashed before this change there's really just
>> two options as per upstream kernel regression policy: either we fix it or we
>> revert it.
>
> As I wrote, here is a patch that can be applied to cleanly to the git
> head:
>
> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
>
> It fixes the problem. Yinghai, can you test it? It should work but
> it's good to confirm it.
i tested already, it works.
>
> I simply wanted to say that it's not a bug that breaks lots of boxes
> or leads to something serious like data corruption (no need to say
> something like "revert it now!"). It's also worth investigating why it
> breaks, I think.
will look at it later
YH
* Yinghai Lu <[email protected]> wrote:
> FUJITA Tomonori wrote:
> > On Tue, 15 Dec 2009 11:56:50 +0100
> > Ingo Molnar <[email protected]> wrote:
> >
> >> * FUJITA Tomonori <[email protected]> wrote:
> >>
> >>> There are a few people who hit this. How many people use a box with over
> >>> 256GB memory?
> >>>
> >>> And you can work around this with "dma32_size" kernel boot option.
> >> Well, since the kernel has not crashed before this change there's really just
> >> two options as per upstream kernel regression policy: either we fix it or we
> >> revert it.
> >
> > As I wrote, here is a patch that can be applied to cleanly to the git
> > head:
> >
> > http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
> >
> > It fixes the problem. Yinghai, can you test it? It should work but
> > it's good to confirm it.
>
> i tested already, it works.
Ok, mind someone please resend the agreed-upon patch with a Tested-by/Acked-by
line so that i can apply it?
Thanks,
Ingo
On Tue, 15 Dec 2009 03:25:41 -0800
Yinghai Lu <[email protected]> wrote:
> FUJITA Tomonori wrote:
> > On Tue, 15 Dec 2009 11:56:50 +0100
> > Ingo Molnar <[email protected]> wrote:
> >
> >> * FUJITA Tomonori <[email protected]> wrote:
> >>
> >>> There are a few people who hit this. How many people use a box with over
> >>> 256GB memory?
> >>>
> >>> And you can work around this with "dma32_size" kernel boot option.
> >> Well, since the kernel has not crashed before this change there's really just
> >> two options as per upstream kernel regression policy: either we fix it or we
> >> revert it.
> >
> > As I wrote, here is a patch that can be applied to cleanly to the git
> > head:
> >
> > http://www.kernel.org/pub/linux/kernel/people/tomo/misc/0001-x86-two-stage-swiotlb-initialization.patch
> >
> > It fixes the problem. Yinghai, can you test it? It should work but
> > it's good to confirm it.
>
> i tested already, it works.
Great, thanks a lot!
I've just sent the patch.
> > I simply wanted to say that it's not a bug that breaks lots of boxes
> > or leads to something serious like data corruption (no need to say
> > something like "revert it now!"). It's also worth investigating why it
> > breaks, I think.
>
> will look at it later
Thanks. If it's due to huge memory, I try to work on it.
Yinghai Lu wrote:
> FUJITA Tomonori wrote:
>> I simply wanted to say that it's not a bug that breaks lots of boxes
>> or leads to something serious like data corruption (no need to say
>> something like "revert it now!"). It's also worth investigating why it
>> breaks, I think.
>
> will look at it later
ok, have the solutions for that.
Ingo,
this patch depends on 4 early_res related patches i sent before.
YH
[PATCH] x86: make early_node_mem get mem > 4g if possible
so we could put pgdata for the node high, and later sparse
vmmap will get the section nr that need.
with this patch will make <4g ram will not use sparse vmmap
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa_64.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
Index: linux-2.6/arch/x86/mm/numa_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_64.c
+++ linux-2.6/arch/x86/mm/numa_64.c
@@ -163,14 +163,27 @@ static void * __init early_node_mem(int
unsigned long end, unsigned long size,
unsigned long align)
{
- unsigned long mem = find_e820_area(start, end, size, align);
+ unsigned long mem;
+ /*
+ * put it on high as possible
+ * something will go with NODE_DATA
+ */
+ if (start < (MAX_DMA_PFN<<PAGE_SHIFT))
+ start = MAX_DMA_PFN<<PAGE_SHIFT;
+ if (start < (MAX_DMA32_PFN<<PAGE_SHIFT) &&
+ end > (MAX_DMA32_PFN<<PAGE_SHIFT))
+ start = MAX_DMA32_PFN<<PAGE_SHIFT;
+ mem = find_e820_area(start, end, size, align);
if (mem != -1L)
return __va(mem);
- start = __pa(MAX_DMA_ADDRESS);
- end = max_low_pfn_mapped << PAGE_SHIFT;
+ end = max_pfn_mapped << PAGE_SHIFT;
+ if (end > (MAX_DMA32_PFN<<PAGE_SHIFT))
+ start = MAX_DMA32_PFN<<PAGE_SHIFT;
+ else
+ start = MAX_DMA_PFN<<PAGE_SHIFT;
mem = find_e820_area(start, end, size, align);
if (mem != -1L)
return __va(mem);