2005-10-17 09:37:11

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: x86_64: 2.6.14-rc4 swiotlb broken

On x86_64 NUMA boxes, the revert
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6e3254c4e2927c117044a02acf5f5b56e1373053
meant that swiotlb gets the IOTLB
memory from pages over 4G (if mem > 4G), which basically renders swiotlb useless, causing
breakage with devices not capable of DMA beyond 4G. 2.6.13 was (kinda) not
broken, although the patch titled "Reverse order of bootmem lists" was
not in 2.6.13, The reason is commit
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6142891a0c0209c91aa4a98f725de0d6e2ed4918
was not in 2.6.13, PCI_DMA_BUS_IS_PHYS was 1 when no mmu was present, and the block layer did
the bouncing, never using swiotlb. I guess the right fix is to make sure
swiotlb gets the right memory. Here is a patch doing that. Tested on IBM
x460. I hope the patch is ok for ia64s too. I do not have access to ia64
boxen.

Thanks,
Kiran

Signed-off-by: Shai Fultheim <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
===================================================================
--- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-14 00:06:21.000000000 -0700
+++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-17 00:05:22.000000000 -0700
@@ -123,7 +123,7 @@
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
+ io_tlb_start = alloc_bootmem_node(NODE_DATA(0), io_tlb_nslabs *
(1 << IO_TLB_SHIFT));
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");


2005-10-17 09:50:50

by Andrew Morton

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On x86_64 NUMA boxes, the revert
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6e3254c4e2927c117044a02acf5f5b56e1373053
> meant that swiotlb gets the IOTLB
> memory from pages over 4G (if mem > 4G), which basically renders swiotlb useless, causing
> breakage with devices not capable of DMA beyond 4G. 2.6.13 was (kinda) not
> broken, although the patch titled "Reverse order of bootmem lists" was
> not in 2.6.13, The reason is commit
> http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6142891a0c0209c91aa4a98f725de0d6e2ed4918
> was not in 2.6.13, PCI_DMA_BUS_IS_PHYS was 1 when no mmu was present, and the block layer did
> the bouncing, never using swiotlb. I guess the right fix is to make sure
> swiotlb gets the right memory. Here is a patch doing that. Tested on IBM
> x460. I hope the patch is ok for ia64s too. I do not have access to ia64
> boxen.
>

This is an ia64 patch - what point was there in testing it on an x460?

Is something missing here?

>
> Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
> ===================================================================
> --- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-14 00:06:21.000000000 -0700
> +++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-17 00:05:22.000000000 -0700
> @@ -123,7 +123,7 @@
> /*
> * Get IO TLB memory from the low pages
> */
> - io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
> + io_tlb_start = alloc_bootmem_node(NODE_DATA(0), io_tlb_nslabs *
> (1 << IO_TLB_SHIFT));
> if (!io_tlb_start)
> panic("Cannot allocate SWIOTLB buffer");

2005-10-17 09:53:47

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 11:50, Andrew Morton wrote:

>
> This is an ia64 patch - what point was there in testing it on an x460?
>
> Is something missing here?

x86-64 shares that code with ia64.

The patch is actually not quite correct - in theory node 0 could be too small
to contain the full swiotlb bounce buffers.

The real fix would be to get rid of the pgdata lists and just walk the
node_online_map on bootmem.c. The memory hotplug guys have
a patch pending for this.

-Andi

2005-10-17 10:03:58

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 02:50:07AM -0700, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > On x86_64 NUMA boxes, the revert
> > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6e3254c4e2927c117044a02acf5f5b56e1373053
> > meant that swiotlb gets the IOTLB
> > memory from pages over 4G (if mem > 4G), which basically renders swiotlb useless, causing
> > breakage with devices not capable of DMA beyond 4G. 2.6.13 was (kinda) not
> > broken, although the patch titled "Reverse order of bootmem lists" was
> > not in 2.6.13, The reason is commit
> > http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6142891a0c0209c91aa4a98f725de0d6e2ed4918
> > was not in 2.6.13, PCI_DMA_BUS_IS_PHYS was 1 when no mmu was present, and the block layer did
> > the bouncing, never using swiotlb. I guess the right fix is to make sure
> > swiotlb gets the right memory. Here is a patch doing that. Tested on IBM
> > x460. I hope the patch is ok for ia64s too. I do not have access to ia64
> > boxen.
> >
>
> This is an ia64 patch - what point was there in testing it on an x460?
>
> Is something missing here?

x86-64 uses swioltb as well, via arch/ia64 directly. John Linville has
a patch to move the swiotlb to lib/swiotlb.c that is waiting in an
IA64 for inclusion (post 2.6.14, I guess?)

Cheers,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/

2005-10-17 10:55:40

by Yasunori Goto

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

Hello. Ravikiran-san.

> > This is an ia64 patch - what point was there in testing it on an x460?
> >
> > Is something missing here?
>
> x86-64 shares that code with ia64.
>
> The patch is actually not quite correct - in theory node 0 could be too small
> to contain the full swiotlb bounce buffers.
>
> The real fix would be to get rid of the pgdata lists and just walk the
> node_online_map on bootmem.c. The memory hotplug guys have
> a patch pending for this.

Yeah!
I posted a patch for this problem to linux-mm ML. Could you try it?
http://marc.theaimsgroup.com/?l=linux-mm&m=112791558527522&w=2

2.6.14-rc4-mm1 already has merged it. ;-)

Thanks.

--
Yasunori Goto

2005-10-17 15:27:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken



On Mon, 17 Oct 2005, Andi Kleen wrote:
>
> The patch is actually not quite correct - in theory node 0 could be too small
> to contain the full swiotlb bounce buffers.

Is node 0 guaranteed to be all low-memory? What if it allocates stuff at
the end of memory on NODE(0)?

Anyway, it sounds like "alloc_bootmem_low_pages()" is seriously buggered
if it allocates non-low pages, if only because of its name...

> The real fix would be to get rid of the pgdata lists and just walk the
> node_online_map on bootmem.c. The memory hotplug guys have
> a patch pending for this.

Argh. Which one should I pick? The NODE(0) one looks simpler, but is it
sufficient for now in practice (with the real one going into 2.6.14+)?

Linus

2005-10-17 15:30:26

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 11:53:54AM +0200, Andi Kleen wrote:
> On Monday 17 October 2005 11:50, Andrew Morton wrote:
>
> >
> > This is an ia64 patch - what point was there in testing it on an x460?
> >
> > Is something missing here?
>
> x86-64 shares that code with ia64.
>
> The patch is actually not quite correct - in theory node 0 could be too small
> to contain the full swiotlb bounce buffers.

Good point. I missed that possibility.

>
> The real fix would be to get rid of the pgdata lists and just walk the
> node_online_map on bootmem.c. The memory hotplug guys have
> a patch pending for this.

Yes, I just saw Yasunori-san's patch. Would that be merged for 2.6.14?
'Cause 2.6.14 is broken as of now for x86_64 boxes with more than 4G ram.

Thanks,
Kiran

2005-10-17 15:37:35

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 08:27:40AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 17 Oct 2005, Andi Kleen wrote:
> >
> > The patch is actually not quite correct - in theory node 0 could be too small
> > to contain the full swiotlb bounce buffers.
>
> Is node 0 guaranteed to be all low-memory? What if it allocates stuff at
> the end of memory on NODE(0)?
>
> Anyway, it sounds like "alloc_bootmem_low_pages()" is seriously buggered
> if it allocates non-low pages, if only because of its name...
>
> > The real fix would be to get rid of the pgdata lists and just walk the
> > node_online_map on bootmem.c. The memory hotplug guys have
> > a patch pending for this.
>
> Argh. Which one should I pick? The NODE(0) one looks simpler, but is it
> sufficient for now in practice (with the real one going into 2.6.14+)?

That's the reason I made a small patch. It does work on the boxes we use.
But I like Yasunori-san's patch. Mine is just a chewing gum fix.

Thanks,
Kiran

2005-10-17 15:40:36

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 17:27, Linus Torvalds wrote:
> On Mon, 17 Oct 2005, Andi Kleen wrote:
> > The patch is actually not quite correct - in theory node 0 could be too
> > small to contain the full swiotlb bounce buffers.
>
> Is node 0 guaranteed to be all low-memory? What if it allocates stuff at
> the end of memory on NODE(0)?

This is 64bit ... only low memory.

>
> Anyway, it sounds like "alloc_bootmem_low_pages()" is seriously buggered
> if it allocates non-low pages, if only because of its name...

The pages are all low, just at the wrong end of the memory.

> > The real fix would be to get rid of the pgdata lists and just walk the
> > node_online_map on bootmem.c. The memory hotplug guys have
> > a patch pending for this.
>
> Argh. Which one should I pick? The NODE(0) one looks simpler, but is it
> sufficient for now in practice (with the real one going into 2.6.14+)?
>
> Linus

First this problem is definitely not critical. AFAIK it only happens on
scalex's unreleased machines. Intel NUMA x86 machines are really rare
and on AMD it doesn't happen because the swiotlb is not used there.
So just ignoring it is a quite reasonable option.

Both NODE(0) and node_online_map are risky. NODE(0) may break IA64
(they share this code) and node_online_map may break one of the weirder
ARM platforms again (for which the original revert was done)

For 2.6.14 I think it's best to ignore it and use node_online_map for 2.6.15.

-Andi

2005-10-17 15:43:25

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 17:30, Ravikiran G Thirumalai wrote:

> Yes, I just saw Yasunori-san's patch. Would that be merged for 2.6.14?

I think both are too risky at this point.

> 'Cause 2.6.14 is broken as of now for x86_64 boxes with more than 4G ram.

... that need swiotlb. Which makes that set much much smaller.
Sorry, but your scalex platform is definitely not anywhere release critical
right now.

-Andi

2005-10-17 15:56:36

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 05:40:56PM +0200, Andi Kleen wrote:

> First this problem is definitely not critical. AFAIK it only happens on
> scalex's unreleased machines. Intel NUMA x86 machines are really rare
> and on AMD it doesn't happen because the swiotlb is not used there.

It's not used by default, but there are cases where it's used and it
would be a shame to release a major kernel and knowingly break
them. For example, any setup that used iommu_force or any non-AMD
x86-64 machine with more than 4GB of memory and only 32-bit capable
DMA devices.

> Both NODE(0) and node_online_map are risky. NODE(0) may break IA64
> (they share this code) and node_online_map may break one of the weirder
> ARM platforms again (for which the original revert was done)

I don't have an IA64 machine, but if the NODE(0) fix is safe there, I
vote for it for 2.6.14-rc4. Another alternative is to temporarily
provide a different version of swiotlb_init() for x86-64 and IA64 -
I can whip up a patch if that's acceptable.

Cheers,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/

2005-10-17 16:01:46

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 17:56, Muli Ben-Yehuda wrote:
> On Mon, Oct 17, 2005 at 05:40:56PM +0200, Andi Kleen wrote:
> > First this problem is definitely not critical. AFAIK it only happens on
> > scalex's unreleased machines. Intel NUMA x86 machines are really rare
> > and on AMD it doesn't happen because the swiotlb is not used there.
>
> It's not used by default, but there are cases where it's used and it
> would be a shame to release a major kernel and knowingly break
> them. For example, any setup that used iommu_force or any non-AMD
> x86-64 machine with more than 4GB of memory and only 32-bit capable
> DMA devices.

... but risk breaking other stuff. Unless you can get the ARM and/or IA64
people to do some retesting with the proposed fixes it's quite risky.
Sometimes you have to make compromises before releases.

> Another alternative is to temporarily
> provide a different version of swiotlb_init() for x86-64 and IA64 -
> I can whip up a patch if that's acceptable.

I don't want that.

-Andi

2005-10-17 16:03:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken



On Mon, 17 Oct 2005, Andi Kleen wrote:

> On Monday 17 October 2005 17:27, Linus Torvalds wrote:
> > On Mon, 17 Oct 2005, Andi Kleen wrote:
> > > The patch is actually not quite correct - in theory node 0 could be too
> > > small to contain the full swiotlb bounce buffers.
> >
> > Is node 0 guaranteed to be all low-memory? What if it allocates stuff at
> > the end of memory on NODE(0)?
>
> This is 64bit ... only low memory.

Ehh.. No there isn't.

PCI DMA isn't magically 64-bit, even on your Opteron.

So low memory in this case is anything < 32 bits. How many bits the CPU
has is immaterial.

That's the whole _point_ of swtlb, after all, so I don't see why you
argue.

Linus

2005-10-17 16:26:17

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 18:02, Linus Torvalds wrote:
> On Mon, 17 Oct 2005, Andi Kleen wrote:
> > On Monday 17 October 2005 17:27, Linus Torvalds wrote:
> > > On Mon, 17 Oct 2005, Andi Kleen wrote:
> > > > The patch is actually not quite correct - in theory node 0 could be
> > > > too small to contain the full swiotlb bounce buffers.
> > >
> > > Is node 0 guaranteed to be all low-memory? What if it allocates stuff
> > > at the end of memory on NODE(0)?
> >
> > This is 64bit ... only low memory.
>
> Ehh.. No there isn't.
>
> PCI DMA isn't magically 64-bit, even on your Opteron.
>
> So low memory in this case is anything < 32 bits. How many bits the CPU
> has is immaterial.

That's completely new terminology. We always called all of ZONE_NORMAL low
memory.

The 32bit DMA zone had no special name before the ZONE_DMA32 patches.
With that they are called dma32 zone.

-Andi

2005-10-17 16:42:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken



On Mon, 17 Oct 2005, Andi Kleen wrote:
>
> That's completely new terminology. We always called all of ZONE_NORMAL low
> memory.

We call it "low" memory because it happens to have "low" addresses. In
other words, it's not "terminology", it's "English".

None of the allocators that allocate stuff in ZONE_NORMAL is called "low"
normally. It's _normal_ memory. It's not ZONE_LOW.

We don't say "kmalloc_low()". We say "kmalloc()".

A function that is called "xyz_low()" means something else than normal to
me. If it was normal memory, we'd call it just "xyz()". And if it did high
memory, we'd call it "xyz_highmem()" (or, preferably, we'd just have a
generic function that accepted GFP_HIGHMEM as a parameter).

Linus

2005-10-17 17:09:42

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 18:42, Linus Torvalds wrote:
> We call it "low" memory because it happens to have "low" addresses.

Well in NUMA bootmem it never was, unless you registered the nodes reversed.
It always starts with the highest node
(which I can't easily do, ARM does it so fixing it properly breaks them)

-Andi

2005-10-17 17:52:41

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 05:40:56PM +0200, Andi Kleen wrote:
> On Monday 17 October 2005 17:27, Linus Torvalds wrote:
> > On Mon, 17 Oct 2005, Andi Kleen wrote:
> > ...
> > Argh. Which one should I pick? The NODE(0) one looks simpler, but is it
> > sufficient for now in practice (with the real one going into 2.6.14+)?
> >
> > Linus
>
> First this problem is definitely not critical. AFAIK it only happens on
> scalex's unreleased machines. Intel NUMA x86 machines are really rare
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No they are not. IBM X460s are generally available machines and the bug
affects those boxes. How can there be a major kernel release which is known
to have breakage??

Maybe someone with access to ia64 NUMA boxen can check if the NODE(0)
solution works (and does not break anything) on ia64? Chrisoph, can you help?

Thanks,
Kiran

2005-10-17 18:08:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 19:52, Ravikiran G Thirumalai wrote:

> No they are not. IBM X460s are generally available machines and the bug
> affects those boxes.

No reports from that front so far.

> How can there be a major kernel release which is known
> to have breakage??

Welcome to the painful real world of software engineering.

Every software has bugs and if you want to ever get a release out you
have to make such decisions sometimes.

As an alternative I can just backout the patch that enables the Intel
SRAT code. That is probably better for a short term fix and will
not regress anybody.

-Andi

2005-10-17 18:21:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, 17 Oct 2005, Ravikiran G Thirumalai wrote:

> Maybe someone with access to ia64 NUMA boxen can check if the NODE(0)
> solution works (and does not break anything) on ia64? Chrisoph, can you help?

Umm... SGI does not use the swiotlb and we do not have these issues. HP
does use the swiotlb on IA64. CCing John and Alex.

For the newcomers: Thread is at
http://marc.theaimsgroup.com/?t=112954203900001&r=1&w=2

Proposed patch by Kiran:

Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
===================================================================
--- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-14
00:06:21.000000000 -0700
+++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-17
00:05:22.000000000 -0700
@@ -123,7 +123,7 @@
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
+ io_tlb_start = alloc_bootmem_node(NODE_DATA(0), io_tlb_nslabs *
(1 << IO_TLB_SHIFT));
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");



2005-10-17 18:28:36

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 08:08:24PM +0200, Andi Kleen wrote:
> On Monday 17 October 2005 19:52, Ravikiran G Thirumalai wrote:
>
> > No they are not. IBM X460s are generally available machines and the bug
> > affects those boxes.
>
> No reports from that front so far.

We have such machines with >4GB memory and 32 bit DMA capable SCSI
controllers and would like to be able to run 2.6.14 on them when it
comes out...

> > How can there be a major kernel release which is known
> > to have breakage??
>
> Welcome to the painful real world of software engineering.
>
> Every software has bugs and if you want to ever get a release out you
> have to make such decisions sometimes.

Fair enough, but this is a regression for something that used to
work. If a painful choice is required, how about reverting the patch
that broke it and breaking something that used to be broken?

> As an alternative I can just backout the patch that enables the Intel
> SRAT code. That is probably better for a short term fix and will
> not regress anybody.

Sounds great!

Cheers,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/

2005-10-17 18:32:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 20:27, Muli Ben-Yehuda wrote:
> On Mon, Oct 17, 2005 at 08:08:24PM +0200, Andi Kleen wrote:
> > On Monday 17 October 2005 19:52, Ravikiran G Thirumalai wrote:
> > > No they are not. IBM X460s are generally available machines and the
> > > bug affects those boxes.
> >
> > No reports from that front so far.
>
> We have such machines with >4GB memory and 32 bit DMA capable SCSI
> controllers

32bit DMA SCSI controllers??? Where did you find such a beast?

> and would like to be able to run 2.6.14 on them when it
> comes out...

So you're saying you tested it and it doesn't work?

-Andi

2005-10-17 18:38:53

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 08:08:24PM +0200, Andi Kleen wrote:
> On Monday 17 October 2005 19:52, Ravikiran G Thirumalai wrote:
>
> > No they are not. IBM X460s are generally available machines and the bug
> > affects those boxes.
>
> No reports from that front so far.

Attached is the dmesg on x460 which shows where swiotlb is allocated.

>
> > How can there be a major kernel release which is known
> > to have breakage??
>
> Welcome to the painful real world of software engineering.
>
> Every software has bugs and if you want to ever get a release out you
> have to make such decisions sometimes.
>
> As an alternative I can just backout the patch that enables the Intel
> SRAT code. That is probably better for a short term fix and will
> not regress anybody.

Intel SRAT code? which patch are you referring to? How is that relevant here?

Thanks,
Kiran


Attachments:
(No filename) (873.00 B)
dmesg (22.97 kB)
Download all attachments

2005-10-17 18:46:01

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 08:32:45PM +0200, Andi Kleen wrote:

> > and would like to be able to run 2.6.14 on them when it
> > comes out...
>
> So you're saying you tested it and it doesn't work?

Not quite; I'm saying that form the description up-thread it sounds
like there's a good chance it won't. Jon Mason (CC'd) has access to
such a machine. Jon, can you please try the latest hg tree with and
without the patch and see how it fares?

Thanks,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/

2005-10-17 18:53:32

by Russell King

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 04:02:09PM +0200, Andi Kleen wrote:
> ... but risk breaking other stuff. Unless you can get the ARM and/or IA64
> people to do some retesting with the proposed fixes it's quite risky.
> Sometimes you have to make compromises before releases.

Since this issue came up, I have reworked ARM's memory initialisation
as below, which makes us independent of the order in which memory zones
are initialised. However, this patch is more or less untested, and
does _not_ yet support XIP (which is pretty important for some people
on ARM.) As such it isn't ready for mainline yet.

Please note that ARM has numerious differing memory setups, so it's
impossible for any one person to sufficiently test it - hence it's
definitely very early release cycle material only.

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -86,15 +86,22 @@ void show_mem(void)
printk("%d pages swap cached\n", cached);
}

-struct node_info {
- unsigned int start;
- unsigned int end;
- int bootmap_pages;
-};
+static inline pmd_t *pmd_off(pgd_t *pgd, unsigned long virt)
+{
+ return pmd_offset(pgd, virt);
+}
+
+static inline pmd_t *pmd_off_k(unsigned long virt)
+{
+ return pmd_off(pgd_offset_k(virt), virt);
+}

-#define O_PFN_DOWN(x) ((x) >> PAGE_SHIFT)
#define O_PFN_UP(x) (PAGE_ALIGN(x) >> PAGE_SHIFT)

+#define for_each_nodebank(iter,mi,no) \
+ for (iter = 0; iter < mi->nr_banks; iter++) \
+ if (mi->bank[iter].node == no)
+
/*
* FIXME: We really want to avoid allocating the bootmap bitmap
* over the top of the initrd. Hopefully, this is located towards
@@ -109,12 +116,9 @@ find_bootmap_pfn(int node, struct meminf
start_pfn = O_PFN_UP(__pa(&_end));
bootmap_pfn = 0;

- for (bank = 0; bank < mi->nr_banks; bank ++) {
+ for_each_nodebank(bank, mi, node) {
unsigned int start, end;

- if (mi->bank[bank].node != node)
- continue;
-
start = mi->bank[bank].start >> PAGE_SHIFT;
end = (mi->bank[bank].size +
mi->bank[bank].start) >> PAGE_SHIFT;
@@ -140,92 +144,6 @@ find_bootmap_pfn(int node, struct meminf
return bootmap_pfn;
}

-/*
- * Scan the memory info structure and pull out:
- * - the end of memory
- * - the number of nodes
- * - the pfn range of each node
- * - the number of bootmem bitmap pages
- */
-static unsigned int __init
-find_memend_and_nodes(struct meminfo *mi, struct node_info *np)
-{
- unsigned int i, bootmem_pages = 0, memend_pfn = 0;
-
- for (i = 0; i < MAX_NUMNODES; i++) {
- np[i].start = -1U;
- np[i].end = 0;
- np[i].bootmap_pages = 0;
- }
-
- for (i = 0; i < mi->nr_banks; i++) {
- unsigned long start, end;
- int node;
-
- if (mi->bank[i].size == 0) {
- /*
- * Mark this bank with an invalid node number
- */
- mi->bank[i].node = -1;
- continue;
- }
-
- node = mi->bank[i].node;
-
- /*
- * Make sure we haven't exceeded the maximum number of nodes
- * that we have in this configuration. If we have, we're in
- * trouble. (maybe we ought to limit, instead of bugging?)
- */
- if (node >= MAX_NUMNODES)
- BUG();
- node_set_online(node);
-
- /*
- * Get the start and end pfns for this bank
- */
- start = mi->bank[i].start >> PAGE_SHIFT;
- end = (mi->bank[i].start + mi->bank[i].size) >> PAGE_SHIFT;
-
- if (np[node].start > start)
- np[node].start = start;
-
- if (np[node].end < end)
- np[node].end = end;
-
- if (memend_pfn < end)
- memend_pfn = end;
- }
-
- /*
- * Calculate the number of pages we require to
- * store the bootmem bitmaps.
- */
- for_each_online_node(i) {
- if (np[i].end == 0)
- continue;
-
- np[i].bootmap_pages = bootmem_bootmap_pages(np[i].end -
- np[i].start);
- bootmem_pages += np[i].bootmap_pages;
- }
-
- high_memory = __va(memend_pfn << PAGE_SHIFT);
-
- /*
- * This doesn't seem to be used by the Linux memory
- * manager any more. If we can get rid of it, we
- * also get rid of some of the stuff above as well.
- *
- * Note: max_low_pfn and max_pfn reflect the number
- * of _pages_ in the system, not the maximum PFN.
- */
- max_low_pfn = memend_pfn - O_PFN_DOWN(PHYS_OFFSET);
- max_pfn = memend_pfn - O_PFN_DOWN(PHYS_OFFSET);
-
- return bootmem_pages;
-}
-
static int __init check_initrd(struct meminfo *mi)
{
int initrd_node = -2;
@@ -266,9 +184,8 @@ static int __init check_initrd(struct me
/*
* Reserve the various regions of node 0
*/
-static __init void reserve_node_zero(unsigned int bootmap_pfn, unsigned int bootmap_pages)
+static __init void reserve_node_zero(pg_data_t *pgdat)
{
- pg_data_t *pgdat = NODE_DATA(0);
unsigned long res_size = 0;

/*
@@ -289,13 +206,6 @@ static __init void reserve_node_zero(uns
PTRS_PER_PGD * sizeof(pgd_t));

/*
- * And don't forget to reserve the allocator bitmap,
- * which will be freed later.
- */
- reserve_bootmem_node(pgdat, bootmap_pfn << PAGE_SHIFT,
- bootmap_pages << PAGE_SHIFT);
-
- /*
* Hmm... This should go elsewhere, but we really really need to
* stop things allocating the low memory; ideally we need a better
* implementation of GFP_DMA which does not assume that DMA-able
@@ -324,183 +234,276 @@ static __init void reserve_node_zero(uns
reserve_bootmem_node(pgdat, PHYS_OFFSET, res_size);
}

-/*
- * Register all available RAM in this node with the bootmem allocator.
- */
-static inline void free_bootmem_node_bank(int node, struct meminfo *mi)
+void __init build_mem_type_table(void);
+void __init create_mapping(struct map_desc *md);
+
+static unsigned long __init
+bootmem_init_node(int node, int initrd_node, struct meminfo *mi)
{
- pg_data_t *pgdat = NODE_DATA(node);
- int bank;
+ unsigned long zone_size[MAX_NR_ZONES], zhole_size[MAX_NR_ZONES];
+ unsigned long start_pfn, end_pfn, boot_pfn;
+ unsigned int boot_pages;
+ pg_data_t *pgdat;
+ int i;

- for (bank = 0; bank < mi->nr_banks; bank++)
- if (mi->bank[bank].node == node)
- free_bootmem_node(pgdat, mi->bank[bank].start,
- mi->bank[bank].size);
-}
+ start_pfn = -1UL;
+ end_pfn = 0;

-/*
- * Initialise the bootmem allocator for all nodes. This is called
- * early during the architecture specific initialisation.
- */
-static void __init bootmem_init(struct meminfo *mi)
-{
- struct node_info node_info[MAX_NUMNODES], *np = node_info;
- unsigned int bootmap_pages, bootmap_pfn, map_pg;
- int node, initrd_node;
+ /*
+ * Calculate the pfn range, and map the memory banks for this node.
+ */
+ for_each_nodebank(i, mi, node) {
+ unsigned long start, end;
+ struct map_desc map;

- bootmap_pages = find_memend_and_nodes(mi, np);
- bootmap_pfn = find_bootmap_pfn(0, mi, bootmap_pages);
- initrd_node = check_initrd(mi);
+ start = mi->bank[i].start >> PAGE_SHIFT;
+ end = (mi->bank[i].start + mi->bank[i].size) >> PAGE_SHIFT;

- map_pg = bootmap_pfn;
+ if (start_pfn > start)
+ start_pfn = start;
+ if (end_pfn < end)
+ end_pfn = end;
+
+ map.physical = mi->bank[i].start;
+ map.virtual = __phys_to_virt(map.physical);
+ map.length = mi->bank[i].size;
+ map.type = MT_MEMORY;
+
+ create_mapping(&map);
+ }

/*
- * Initialise the bootmem nodes.
- *
- * What we really want to do is:
- *
- * unmap_all_regions_except_kernel();
- * for_each_node_in_reverse_order(node) {
- * map_node(node);
- * allocate_bootmem_map(node);
- * init_bootmem_node(node);
- * free_bootmem_node(node);
- * }
- *
- * but this is a 2.5-type change. For now, we just set
- * the nodes up in reverse order.
- *
- * (we could also do with rolling bootmem_init and paging_init
- * into one generic "memory_init" type function).
+ * If there is no memory in this node, ignore it.
*/
- np += num_online_nodes() - 1;
- for (node = num_online_nodes() - 1; node >= 0; node--, np--) {
- /*
- * If there are no pages in this node, ignore it.
- * Note that node 0 must always have some pages.
- */
- if (np->end == 0 || !node_online(node)) {
- if (node == 0)
- BUG();
- continue;
- }
+ if (end_pfn == 0)
+ return end_pfn;

- /*
- * Initialise the bootmem allocator.
- */
- init_bootmem_node(NODE_DATA(node), map_pg, np->start, np->end);
- free_bootmem_node_bank(node, mi);
- map_pg += np->bootmap_pages;
+ /*
+ * Allocate the bootmem bitmap page.
+ */
+ boot_pages = bootmem_bootmap_pages(end_pfn - start_pfn);
+ boot_pfn = find_bootmap_pfn(node, mi, boot_pages);

- /*
- * If this is node 0, we need to reserve some areas ASAP -
- * we may use bootmem on node 0 to setup the other nodes.
- */
- if (node == 0)
- reserve_node_zero(bootmap_pfn, bootmap_pages);
- }
+ /*
+ * Initialise the bootmem allocator for this node, handing the
+ * memory banks over to bootmem.
+ */
+ node_set_online(node);
+ pgdat = NODE_DATA(node);
+ init_bootmem_node(pgdat, boot_pfn, start_pfn, end_pfn);

+ for_each_nodebank(i, mi, node)
+ free_bootmem_node(pgdat, mi->bank[i].start, mi->bank[i].size);
+
+ /*
+ * Reserve the bootmem bitmap for this node.
+ */
+ reserve_bootmem_node(pgdat, boot_pfn << PAGE_SHIFT,
+ boot_pages << PAGE_SHIFT);

#ifdef CONFIG_BLK_DEV_INITRD
- if (phys_initrd_size && initrd_node >= 0) {
- reserve_bootmem_node(NODE_DATA(initrd_node), phys_initrd_start,
+ /*
+ * If the initrd is in this node, reserve its memory.
+ */
+ if (node == initrd_node) {
+ reserve_bootmem_node(pgdat, phys_initrd_start,
phys_initrd_size);
initrd_start = __phys_to_virt(phys_initrd_start);
initrd_end = initrd_start + phys_initrd_size;
}
#endif

- BUG_ON(map_pg != bootmap_pfn + bootmap_pages);
+ /*
+ * Finally, reserve any node zero regions.
+ */
+ if (node == 0)
+ reserve_node_zero(pgdat);
+
+ /*
+ * initialise the zones within this node.
+ */
+ memset(zone_size, 0, sizeof(zone_size));
+ memset(zhole_size, 0, sizeof(zhole_size));
+
+ /*
+ * The size of this node has already been determined. If we need
+ * to do anything fancy with the allocation of this memory to the
+ * zones, now is the time to do it.
+ */
+ zone_size[0] = end_pfn - start_pfn;
+
+ /*
+ * For each bank in this node, calculate the size of the holes.
+ * holes = node_size - sum(bank_sizes_in_node)
+ */
+ zhole_size[0] = zone_size[0];
+ for_each_nodebank(i, mi, node)
+ zhole_size[0] -= mi->bank[i].size >> PAGE_SHIFT;
+
+ /*
+ * Adjust the sizes according to any special requirements for
+ * this machine type.
+ */
+ arch_adjust_zones(node, zone_size, zhole_size);
+
+ free_area_init_node(node, pgdat, zone_size, start_pfn, zhole_size);
+
+ return end_pfn;
}

-/*
- * paging_init() sets up the page tables, initialises the zone memory
- * maps, and sets up the zero page, bad page and bad page tables.
- */
-void __init paging_init(struct meminfo *mi, struct machine_desc *mdesc)
+static void __init bootmem_init(struct meminfo *mi)
{
- void *zero_page;
- int node;
+ unsigned long addr, memend_pfn = 0;
+ int node, initrd_node, i;

- bootmem_init(mi);
+ /*
+ * Invalidate the node number for empty or invalid memory banks
+ */
+ for (i = 0; i < mi->nr_banks; i++)
+ if (mi->bank[i].size == 0 || mi->bank[i].node >= MAX_NUMNODES)
+ mi->bank[i].node = -1;

memcpy(&meminfo, mi, sizeof(meminfo));

+#ifdef CONFIG_XIP_KERNEL
+#error needs fixing
+ p->physical = CONFIG_XIP_PHYS_ADDR & PMD_MASK;
+ p->virtual = (unsigned long)&_stext & PMD_MASK;
+ p->length = ((unsigned long)&_etext - p->virtual + ~PMD_MASK) & PMD_MASK;
+ p->type = MT_ROM;
+ p ++;
+#endif
+
/*
- * allocate the zero page. Note that we count on this going ok.
+ * Clear out all the mappings below the kernel image.
+ * FIXME: what about XIP?
*/
- zero_page = alloc_bootmem_low_pages(PAGE_SIZE);
+ for (addr = 0; addr < PAGE_OFFSET; addr += PGDIR_SIZE)
+ pmd_clear(pmd_off_k(addr));

/*
- * initialise the page tables.
+ * Clear out all the kernel space mappings, except for the first
+ * memory bank, up to the end of the vmalloc region.
*/
- memtable_init(mi);
- if (mdesc->map_io)
- mdesc->map_io();
- local_flush_tlb_all();
+ for (addr = __phys_to_virt(mi->bank[0].start + mi->bank[0].size);
+ addr < VMALLOC_END; addr += PGDIR_SIZE)
+ pmd_clear(pmd_off_k(addr));

/*
- * initialise the zones within each node
+ * Locate which node contains the ramdisk image, if any.
*/
- for_each_online_node(node) {
- unsigned long zone_size[MAX_NR_ZONES];
- unsigned long zhole_size[MAX_NR_ZONES];
- struct bootmem_data *bdata;
- pg_data_t *pgdat;
- int i;
+ initrd_node = check_initrd(mi);

- /*
- * Initialise the zone size information.
- */
- for (i = 0; i < MAX_NR_ZONES; i++) {
- zone_size[i] = 0;
- zhole_size[i] = 0;
- }
+ /*
+ * Run through each node initialising the bootmem allocator.
+ */
+ for_each_node(node) {
+ unsigned long end_pfn;

- pgdat = NODE_DATA(node);
- bdata = pgdat->bdata;
+ end_pfn = bootmem_init_node(node, initrd_node, mi);

/*
- * The size of this node has already been determined.
- * If we need to do anything fancy with the allocation
- * of this memory to the zones, now is the time to do
- * it.
+ * Remember the highest memory PFN.
*/
- zone_size[0] = bdata->node_low_pfn -
- (bdata->node_boot_start >> PAGE_SHIFT);
+ if (end_pfn > memend_pfn)
+ memend_pfn = end_pfn;
+ }

- /*
- * If this zone has zero size, skip it.
- */
- if (!zone_size[0])
- continue;
+ high_memory = __va(memend_pfn << PAGE_SHIFT);

- /*
- * For each bank in this node, calculate the size of the
- * holes. holes = node_size - sum(bank_sizes_in_node)
- */
- zhole_size[0] = zone_size[0];
- for (i = 0; i < mi->nr_banks; i++) {
- if (mi->bank[i].node != node)
- continue;
+ /*
+ * This doesn't seem to be used by the Linux memory manager any
+ * more, but is used by ll_rw_block. If we can get rid of it, we
+ * also get rid of some of the stuff above as well.
+ *
+ * Note: max_low_pfn and max_pfn reflect the number of _pages_ in
+ * the system, not the maximum PFN.
+ */
+ max_pfn = max_low_pfn = memend_pfn - PHYS_PFN_OFFSET;
+}

- zhole_size[0] -= mi->bank[i].size >> PAGE_SHIFT;
- }
+/*
+ * Set up device the mappings. Since we clear out the page tables for all
+ * mappings above VMALLOC_END, we will remove any debug device mappings.
+ * This means you have to be careful how you debug this function, or any
+ * called function. (Do it by code inspection!)
+ */
+static void __init devicemaps_init(struct machine_desc *mdesc)
+{
+ struct map_desc map;
+ unsigned long addr;
+ void *vectors;

- /*
- * Adjust the sizes according to any special
- * requirements for this machine type.
- */
- arch_adjust_zones(node, zone_size, zhole_size);
+ for (addr = VMALLOC_END; addr; addr += PGDIR_SIZE)
+ pmd_clear(pmd_off_k(addr));
+
+ /*
+ * Map the cache flushing regions.
+ */
+#ifdef FLUSH_BASE
+ map.physical = FLUSH_BASE_PHYS;
+ map.virtual = FLUSH_BASE;
+ map.length = PGDIR_SIZE;
+ map.type = MT_CACHECLEAN;
+ create_mapping(&map);
+#endif
+#ifdef FLUSH_BASE_MINICACHE
+ map.physical = FLUSH_BASE_PHYS + PGDIR_SIZE;
+ map.virtual = FLUSH_BASE_MINICACHE;
+ map.length = PGDIR_SIZE;
+ map.type = MT_MINICLEAN;
+ create_mapping(&map);
+#endif

- free_area_init_node(node, pgdat, zone_size,
- bdata->node_boot_start >> PAGE_SHIFT, zhole_size);
+ flush_cache_all();
+ local_flush_tlb_all();
+
+ vectors = alloc_bootmem_low_pages(PAGE_SIZE);
+ BUG_ON(!vectors);
+
+ /*
+ * Create a mapping for the machine vectors at the high-vectors
+ * location (0xffff0000). If we aren't using high-vectors, also
+ * create a mapping at the low-vectors virtual address.
+ */
+ map.physical = virt_to_phys(vectors);
+ map.virtual = 0xffff0000;
+ map.length = PAGE_SIZE;
+ map.type = MT_HIGH_VECTORS;
+ create_mapping(&map);
+
+ if (!vectors_high()) {
+ map.virtual = 0;
+ map.type = MT_LOW_VECTORS;
+ create_mapping(&map);
}

/*
- * finish off the bad pages once
- * the mem_map is initialised
+ * Ask the machine support to map in the statically mapped devices.
+ * After this point, we can start to touch devices again.
+ */
+ if (mdesc->map_io)
+ mdesc->map_io();
+}
+
+/*
+ * paging_init() sets up the page tables, initialises the zone memory
+ * maps, and sets up the zero page, bad page and bad page tables.
+ */
+void __init paging_init(struct meminfo *mi, struct machine_desc *mdesc)
+{
+ void *zero_page;
+
+ build_mem_type_table();
+ bootmem_init(mi);
+ devicemaps_init(mdesc);
+
+ top_pmd = pmd_off_k(0xffff0000);
+
+ /*
+ * allocate the zero page. Note that we count on this going ok.
*/
+ zero_page = alloc_bootmem_low_pages(PAGE_SIZE);
memzero(zero_page, PAGE_SIZE);
empty_zero_page = virt_to_page(zero_page);
flush_dcache_page(empty_zero_page);
@@ -562,10 +565,7 @@ static void __init free_unused_memmap_no
* may not be the case, especially if the user has provided the
* information on the command line.
*/
- for (i = 0; i < mi->nr_banks; i++) {
- if (mi->bank[i].size == 0 || mi->bank[i].node != node)
- continue;
-
+ for_each_nodebank(i, mi, node) {
bank_start = mi->bank[i].start >> PAGE_SHIFT;
if (bank_start < prev_bank_end) {
printk(KERN_ERR "MEM: unordered memory banks. "
diff --git a/arch/arm/mm/mm-armv.c b/arch/arm/mm/mm-armv.c
--- a/arch/arm/mm/mm-armv.c
+++ b/arch/arm/mm/mm-armv.c
@@ -305,16 +305,6 @@ alloc_init_page(unsigned long virt, unsi
set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, prot));
}

-/*
- * Clear any PGD mapping. On a two-level page table system,
- * the clearance is done by the middle-level functions (pmd)
- * rather than the top-level (pgd) functions.
- */
-static inline void clear_mapping(unsigned long virt)
-{
- pmd_clear(pmd_off_k(virt));
-}
-
struct mem_types {
unsigned int prot_pte;
unsigned int prot_l1;
@@ -373,7 +363,7 @@ static struct mem_types mem_types[] __in
/*
* Adjust the PMD section entries according to the CPU in use.
*/
-static void __init build_mem_type_table(void)
+void __init build_mem_type_table(void)
{
struct cachepolicy *cp;
unsigned int cr = get_cr();
@@ -483,7 +473,7 @@ static void __init build_mem_type_table(
* offsets, and we take full advantage of sections and
* supersections.
*/
-static void __init create_mapping(struct map_desc *md)
+void __init create_mapping(struct map_desc *md)
{
unsigned long virt, length;
int prot_sect, prot_l1, domain;
@@ -601,100 +591,6 @@ void setup_mm_for_reboot(char mode)
}
}

-extern void _stext, _etext;
-
-/*
- * Setup initial mappings. We use the page we allocated for zero page to hold
- * the mappings, which will get overwritten by the vectors in traps_init().
- * The mappings must be in virtual address order.
- */
-void __init memtable_init(struct meminfo *mi)
-{
- struct map_desc *init_maps, *p, *q;
- unsigned long address = 0;
- int i;
-
- build_mem_type_table();
-
- init_maps = p = alloc_bootmem_low_pages(PAGE_SIZE);
-
-#ifdef CONFIG_XIP_KERNEL
- p->physical = CONFIG_XIP_PHYS_ADDR & PMD_MASK;
- p->virtual = (unsigned long)&_stext & PMD_MASK;
- p->length = ((unsigned long)&_etext - p->virtual + ~PMD_MASK) & PMD_MASK;
- p->type = MT_ROM;
- p ++;
-#endif
-
- for (i = 0; i < mi->nr_banks; i++) {
- if (mi->bank[i].size == 0)
- continue;
-
- p->physical = mi->bank[i].start;
- p->virtual = __phys_to_virt(p->physical);
- p->length = mi->bank[i].size;
- p->type = MT_MEMORY;
- p ++;
- }
-
-#ifdef FLUSH_BASE
- p->physical = FLUSH_BASE_PHYS;
- p->virtual = FLUSH_BASE;
- p->length = PGDIR_SIZE;
- p->type = MT_CACHECLEAN;
- p ++;
-#endif
-
-#ifdef FLUSH_BASE_MINICACHE
- p->physical = FLUSH_BASE_PHYS + PGDIR_SIZE;
- p->virtual = FLUSH_BASE_MINICACHE;
- p->length = PGDIR_SIZE;
- p->type = MT_MINICLEAN;
- p ++;
-#endif
-
- /*
- * Go through the initial mappings, but clear out any
- * pgdir entries that are not in the description.
- */
- q = init_maps;
- do {
- if (address < q->virtual || q == p) {
- clear_mapping(address);
- address += PGDIR_SIZE;
- } else {
- create_mapping(q);
-
- address = q->virtual + q->length;
- address = (address + PGDIR_SIZE - 1) & PGDIR_MASK;
-
- q ++;
- }
- } while (address != 0);
-
- /*
- * Create a mapping for the machine vectors at the high-vectors
- * location (0xffff0000). If we aren't using high-vectors, also
- * create a mapping at the low-vectors virtual address.
- */
- init_maps->physical = virt_to_phys(init_maps);
- init_maps->virtual = 0xffff0000;
- init_maps->length = PAGE_SIZE;
- init_maps->type = MT_HIGH_VECTORS;
- create_mapping(init_maps);
-
- if (!vectors_high()) {
- init_maps->virtual = 0;
- init_maps->type = MT_LOW_VECTORS;
- create_mapping(init_maps);
- }
-
- flush_cache_all();
- local_flush_tlb_all();
-
- top_pmd = pmd_off_k(0xffff0000);
-}
-
/*
* Create the architecture specific mappings
*/

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-10-17 19:04:32

by Alex Williamson

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, 2005-10-17 at 11:20 -0700, Christoph Lameter wrote:
> On Mon, 17 Oct 2005, Ravikiran G Thirumalai wrote:
>
> > Maybe someone with access to ia64 NUMA boxen can check if the NODE(0)
> > solution works (and does not break anything) on ia64? Chrisoph, can you help?
>
> Umm... SGI does not use the swiotlb and we do not have these issues. HP
> does use the swiotlb on IA64. CCing John and Alex.
...
> @@ -123,7 +123,7 @@
> /*
> * Get IO TLB memory from the low pages
> */
> - io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
> + io_tlb_start = alloc_bootmem_node(NODE_DATA(0), io_tlb_nslabs *

HP ia64 boxes typically use a hardware I/O TLB, so this is not the
normal case. However, the sx1000 boxes are exactly an example that will
break because of this assumption about memory layout. These boxes can
be configured to have various ratios of node local memory and
interleaved memory. Node local memory starts well above 4GB.
Interleaved memory is zero-based, and described in it's own proximity
domain. It therefore looks like a memory-only node. I believe the
above code change would cause us to allocate memory from the node local
range, way too high in the address space for bounce buffers.

BTW, I've got a patch in Tony's testing branch that allows a late
initialization of the swiotlb. This is currently only useful on ia64
since we have 4GB of ZONE_DMA, but may be useful on x86-ish archs when
the 4GB zone is introduced.

Alex

--

2005-10-17 19:05:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken



On Mon, 17 Oct 2005, Muli Ben-Yehuda wrote:
>
> On Mon, Oct 17, 2005 at 08:32:45PM +0200, Andi Kleen wrote:
>
> > > and would like to be able to run 2.6.14 on them when it
> > > comes out...
> >
> > So you're saying you tested it and it doesn't work?
>
> Not quite; I'm saying that form the description up-thread it sounds
> like there's a good chance it won't. Jon Mason (CC'd) has access to
> such a machine. Jon, can you please try the latest hg tree with and
> without the patch and see how it fares?

NOTE! Even if the machine has 4GB or more of memory, it's entirely likely
that the quick "use NODE(0)" hack will work fine.

Why? Because the bootmem memory should still be allocated low-to-high by
default, which means that as logn as NODE(0) has _enough_ memory in the
DMA range, we should be ok.

So I _think_ the simple one-liner NODE(0) patch is sufficient, and should
work (and is a lot more acceptable for 2.6.14 than switching the node
ordering around yet again, or doing bigger surgery on the bootmem code).

So the only thing that worried me (and made me ask whether there might be
machines where it doesn't work) is if some machines might have their high
memory (or no memory at all) on NODE(0). It does sound unlikely, but I
simple don't know what kind of strange NUMA configs there are out there.

And I'm definitely only interested in machines that are out there, not
some theoretical issues.

Linus

2005-10-17 19:07:14

by Luck, Tony

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

> x86-64 uses swioltb as well, via arch/ia64 directly. John Linville has
> a patch to move the swiotlb to lib/swiotlb.c that is waiting in an
> IA64 for inclusion (post 2.6.14, I guess?)

Yes. John's patch is sitting in a "swiotlb" branch of my GIT tree. I'd like to
hear some positive confirmation from Linus and/or Andrew that moving this
up to lib/ is ok ... I don't want to be accused of shovelling tasteless code up
into the base.

This is definitely a post 2.6.14 move.

-Tony

2005-10-17 19:09:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Monday 17 October 2005 21:04, Linus Torvalds wrote:

> So the only thing that worried me (and made me ask whether there might be
> machines where it doesn't work) is if some machines might have their high
> memory (or no memory at all) on NODE(0). It does sound unlikely, but I
> simple don't know what kind of strange NUMA configs there are out there.

It could happen in VirtualIron (they seem to interleave node 0 over many nodes
to get equal use of lowmem in 32bit NUMA), but should not in x86-64..

> And I'm definitely only interested in machines that are out there, not
> some theoretical issues.

According to Alex W. it will break their sx1000 IA64 boxen.

-Andi (who still thinks it's best to just ignore it or disable Intel NUMA)

2005-10-17 19:26:46

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 01:04:01PM -0600, Alex Williamson wrote:
> On Mon, 2005-10-17 at 11:20 -0700, Christoph Lameter wrote:
> > On Mon, 17 Oct 2005, Ravikiran G Thirumalai wrote:
> >
> > > Maybe someone with access to ia64 NUMA boxen can check if the NODE(0)
> > > solution works (and does not break anything) on ia64? Chrisoph, can you help?
> >
> > Umm... SGI does not use the swiotlb and we do not have these issues. HP
> > does use the swiotlb on IA64. CCing John and Alex.
> ...
> > @@ -123,7 +123,7 @@
> > /*
> > * Get IO TLB memory from the low pages
> > */
> > - io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
> > + io_tlb_start = alloc_bootmem_node(NODE_DATA(0), io_tlb_nslabs *
>
> HP ia64 boxes typically use a hardware I/O TLB, so this is not the
> normal case. However, the sx1000 boxes are exactly an example that will
> break because of this assumption about memory layout. These boxes can
> be configured to have various ratios of node local memory and
> interleaved memory. Node local memory starts well above 4GB.
> Interleaved memory is zero-based, and described in it's own proximity
> domain. It therefore looks like a memory-only node. I believe the
> above code change would cause us to allocate memory from the node local
> range, way too high in the address space for bounce buffers.

This memory only node has a node id? Then how about a patch which iterates over
nodes in swiotlb.c, trying to allocate DMA'ble memory from node 0 and above
until it gets proper memory for swiotlb?

Would that be accepatble? I can quickly make a patch for that if it is
acceptable..

Thanks,
Kiran

2005-10-17 19:28:37

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, 2005-10-17 at 21:09 +0200, Andi Kleen wrote:
> On Monday 17 October 2005 21:04, Linus Torvalds wrote:
>
> > So the only thing that worried me (and made me ask whether there might be
> > machines where it doesn't work) is if some machines might have their high
> > memory (or no memory at all) on NODE(0). It does sound unlikely, but I
> > simple don't know what kind of strange NUMA configs there are out there.
>
> It could happen in VirtualIron (they seem to interleave node 0 over many nodes
> to get equal use of lowmem in 32bit NUMA), but should not in x86-64..
>
does VirtualIron work with kernel.org kernels at all?


2005-10-17 19:47:55

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 09:09:52PM +0200, Andi Kleen wrote:
> On Monday 17 October 2005 21:04, Linus Torvalds wrote:
>
> > So the only thing that worried me (and made me ask whether there might be
> > machines where it doesn't work) is if some machines might have their high
> > memory (or no memory at all) on NODE(0). It does sound unlikely, but I
> > simple don't know what kind of strange NUMA configs there are out there.
>
> It could happen in VirtualIron (they seem to interleave node 0 over many nodes
> to get equal use of lowmem in 32bit NUMA), but should not in x86-64..
>
> > And I'm definitely only interested in machines that are out there, not
> > some theoretical issues.
>
> According to Alex W. it will break their sx1000 IA64 boxen.

sx1000 is probably already broken; Unless the last pgdat happens to be the
memory only node with 0-4G? How about the fix I suggested which would iterate
across all nodes until it found the right node for swiotlb?

Thanks,
Kiran

2005-10-17 19:52:37

by Alex Williamson

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, 2005-10-17 at 12:26 -0700, Ravikiran G Thirumalai wrote:

> This memory only node has a node id? Then how about a patch which iterates over
> nodes in swiotlb.c, trying to allocate DMA'ble memory from node 0 and above
> until it gets proper memory for swiotlb?
>
> Would that be accepatble? I can quickly make a patch for that if it is
> acceptable..

Yes, the memory-only node is just another node. Iterating over all
nodes sounds a little brute force, but I guess it should work. FWIW,
here's the results of the previous one-liner on an HP Superdome (booting
w/ machvec=dig to use the swiotlb instead of hardware iotlb):

2.6.14-rc4-mm1, before patch:
Placing software IO TLB between 0x4cdc000 - 0x8cdc000

after patch:
Placing software IO TLB between 0x74104e6b200 - 0x74108e6b200

Thanks,

Alex

--

2005-10-17 20:44:44

by Andrew Morton

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

Andi Kleen <[email protected]> wrote:
>
> On Monday 17 October 2005 17:30, Ravikiran G Thirumalai wrote:
>
> > Yes, I just saw Yasunori-san's patch. Would that be merged for 2.6.14?
>
> I think both are too risky at this point.
>

Maybe.

There seem to be a lot of proposed solutions floating about and I fear that
different people will try to fix this in different ways. Do we all agree
that this patch is the correct solution to this problem, or is something
more needed?



From: Yasunori Goto <[email protected]>

This is a patch to guarantee that alloc_bootmem_low() allocate DMA area.

Current alloc_bootmem_low() is just specify "goal=0". And it is used for
__alloc_bootmem_core() to decide which address is better. However, there
is no guarantee that __alloc_bootmem_core() allocate DMA area when goal=0
is specified. Even if there is no DMA'ble area in searching node, it
allocates higher address than MAX_DMA_ADDRESS.

__alloc_bootmem_core() is called by order of for_each_pgdat() in
__alloc_bootmem(). So, if first node (node_id = 0) has DMA'ble area, no
trouble will occur. However, our new Itanium2 server can change which node
has lower address. And panic really occurred on it. The message was
"bounce buffer is not DMA'ble" in swiothl_map_single().

To avoid this panic, following patch confirms allocated area, and retry if
it is not in DMA. I tested this patch on my Tiger 4 and our new server.

Signed-off-by Yasunori Goto <[email protected]>

Signed-off-by: Andrew Morton <[email protected]>
---

include/linux/bootmem.h | 16 ++++++++++++----
mm/bootmem.c | 43 +++++++++++++++++++++++++++++++++++++++----
2 files changed, 51 insertions(+), 8 deletions(-)

diff -puN include/linux/bootmem.h~guarantee-dma-area-for-alloc_bootmem_low include/linux/bootmem.h
--- devel/include/linux/bootmem.h~guarantee-dma-area-for-alloc_bootmem_low 2005-10-11 00:34:32.000000000 -0700
+++ devel-akpm/include/linux/bootmem.h 2005-10-11 00:34:32.000000000 -0700
@@ -40,6 +40,14 @@ typedef struct bootmem_data {
* up searching */
} bootmem_data_t;

+static inline unsigned long max_dma_physaddr(void)
+{
+
+ if (MAX_DMA_ADDRESS == ~0UL)
+ return MAX_DMA_ADDRESS;
+ return __pa(MAX_DMA_ADDRESS);
+}
+
extern unsigned long __init bootmem_bootmap_pages (unsigned long);
extern unsigned long __init init_bootmem (unsigned long addr, unsigned long memend);
extern void __init free_bootmem (unsigned long addr, unsigned long size);
@@ -47,11 +55,11 @@ extern void * __init __alloc_bootmem (un
#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
extern void __init reserve_bootmem (unsigned long addr, unsigned long size);
#define alloc_bootmem(x) \
- __alloc_bootmem((x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
+ __alloc_bootmem((x), SMP_CACHE_BYTES, max_dma_physaddr())
#define alloc_bootmem_low(x) \
__alloc_bootmem((x), SMP_CACHE_BYTES, 0)
#define alloc_bootmem_pages(x) \
- __alloc_bootmem((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
+ __alloc_bootmem((x), PAGE_SIZE, max_dma_physaddr())
#define alloc_bootmem_low_pages(x) \
__alloc_bootmem((x), PAGE_SIZE, 0)
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
@@ -64,9 +72,9 @@ extern unsigned long __init free_all_boo
extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
#define alloc_bootmem_node(pgdat, x) \
- __alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
+ __alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, max_dma_physaddr())
#define alloc_bootmem_pages_node(pgdat, x) \
- __alloc_bootmem_node((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
+ __alloc_bootmem_node((pgdat), (x), PAGE_SIZE, max_dma_physaddr())
#define alloc_bootmem_low_pages_node(pgdat, x) \
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, 0)
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
diff -puN mm/bootmem.c~guarantee-dma-area-for-alloc_bootmem_low mm/bootmem.c
--- devel/mm/bootmem.c~guarantee-dma-area-for-alloc_bootmem_low 2005-10-11 00:34:32.000000000 -0700
+++ devel-akpm/mm/bootmem.c 2005-10-11 00:34:32.000000000 -0700
@@ -382,19 +382,54 @@ unsigned long __init free_all_bootmem (v
return(free_all_bootmem_core(NODE_DATA(0)));
}

+static int __init is_dma_required(unsigned long goal)
+{
+ return goal < max_dma_physaddr() ? 1 : 0;
+}
+
+static int __init unmatch_dma_required(void *ptr, unsigned long goal)
+{
+
+ if(is_dma_required(goal) && (unsigned long)ptr >= MAX_DMA_ADDRESS)
+ return 1;
+
+ return 0;
+}
+
void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
{
pg_data_t *pgdat = pgdat_list;
void *ptr;
+ int retried = 0;
+
+retry:
+ for_each_pgdat(pgdat){

- for_each_pgdat(pgdat)
- if ((ptr = __alloc_bootmem_core(pgdat->bdata, size,
- align, goal)))
- return(ptr);
+ ptr = __alloc_bootmem_core(pgdat->bdata, size,
+ align, goal);
+ if (!ptr)
+ continue;
+
+ if (unmatch_dma_required(ptr, goal) && !retried){
+ /* DMA is required, but normal area is allocated.
+ Other node might have DMA, should try it. */
+ free_bootmem_core(pgdat->bdata, virt_to_phys(ptr), size);
+ continue;
+ }
+
+ return ptr;
+ }

/*
* Whoops, we cannot satisfy the allocation request.
*/
+ if (is_dma_required(goal) && !retried){
+ printk(KERN_WARNING "bootmem alloc DMA of %lu bytes failed, retry normal area!\n", size);
+ dump_stack();
+ retried++;
+ goto retry;
+ }
+
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
panic("Out of memory");
return NULL;
_

2005-10-17 21:11:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken



On Mon, 17 Oct 2005, Andrew Morton wrote:
>
> There seem to be a lot of proposed solutions floating about and I fear that
> different people will try to fix this in different ways. Do we all agree
> that this patch is the correct solution to this problem, or is something
> more needed?

I think this will fix it.

The naming is horrible, though, and that whole "goal" parameter is
senseless and ugly.

It should just be a flag on whether we want DMA'able memory or not. That's
what it _is_, it's just strangely implemented, making the code less
readable.

Since the patch changes all the users of that third parameter _anyway_, it
should probably be fixed to just make the parameter sane instead.

Linus

2005-10-17 23:51:38

by David Lang

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, 17 Oct 2005, Linus Torvalds wrote:
>
> Why? Because the bootmem memory should still be allocated low-to-high by
> default, which means that as logn as NODE(0) has _enough_ memory in the
> DMA range, we should be ok.
>
> So I _think_ the simple one-liner NODE(0) patch is sufficient, and should
> work (and is a lot more acceptable for 2.6.14 than switching the node
> ordering around yet again, or doing bigger surgery on the bootmem code).
>
> So the only thing that worried me (and made me ask whether there might be
> machines where it doesn't work) is if some machines might have their high
> memory (or no memory at all) on NODE(0). It does sound unlikely, but I
> simple don't know what kind of strange NUMA configs there are out there.
>
> And I'm definitely only interested in machines that are out there, not
> some theoretical issues.

what about x86_64 machines with 8G of ram connected to each CPU (4x2G
DIMMs), not exactly common, but also not that rare.

also if CPU's are registered from high to low would a 2 CPU box with 8G of
ram connected to one CPU and no ram connected to the second CPU match your
problem case (I know that's a sub-optimal layout, but it's legal)

David Lang


> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-10-18 00:16:30

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Mon, Oct 17, 2005 at 02:11:20PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 17 Oct 2005, Andrew Morton wrote:
> >
> > There seem to be a lot of proposed solutions floating about and I fear that
> > different people will try to fix this in different ways. Do we all agree
> > that this patch is the correct solution to this problem, or is something
> > more needed?
>
> I think this will fix it.
>

I just tried Yasunori-sans patch on our x460. It doesn't fix the problem.
Attaching the dmesg capture. However, the patch to iterate over nodes and
allocate bootmem works, and should work for ia64s, boxes with insufficient
memory on node 0, nodes with just cpus etc. I have requested Alex to try this
on the superdome (although the the use of swiotlb on superdomes seem to be
optional). If this works on superdomes, then please apply. This has been
tested on x460.

Thanks,
Kiran

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
===================================================================
--- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-17 13:27:35.000000000 -0700
+++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-17 16:00:44.000000000 -0700
@@ -106,6 +106,8 @@
__setup("swiotlb=", setup_io_tlb_npages);
/* make io_tlb_overflow tunable too? */

+#define IS_LOWPAGES(paddr, size) ((paddr < 0xffffffff) && ((paddr+size) < 0xffffffff))
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the PCI DMA API.
@@ -114,17 +116,32 @@
swiotlb_init_with_default_size (size_t default_size)
{
unsigned long i;
+ unsigned long iotlbsz;
+ int node;

if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
}

+ iotlbsz = io_tlb_nslabs * (1 << IO_TLB_SHIFT);
+
/*
- * Get IO TLB memory from the low pages
+ * Get IO TLB memory from the 0-4G range
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
- (1 << IO_TLB_SHIFT));
+
+ for_each_node(node) {
+ io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);
+ if (io_tlb_start) {
+ if (IS_LOWPAGES(virt_to_phys(io_tlb_start), iotlbsz))
+ break;
+ free_bootmem_node(NODE_DATA(node),
+ virt_to_phys(io_tlb_start), iotlbsz);
+ io_tlb_start = NULL;
+ }
+ }
+
+
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);


Attachments:
(No filename) (2.49 kB)
dmesg-x460-fail (23.32 kB)
Download all attachments

2005-10-18 02:30:38

by Yasunori Goto

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

Hello. Linus-san.

> NOTE! Even if the machine has 4GB or more of memory, it's entirely likely
> that the quick "use NODE(0)" hack will work fine.
>
> Why? Because the bootmem memory should still be allocated low-to-high by
> default, which means that as logn as NODE(0) has _enough_ memory in the
> DMA range, we should be ok.
>
> So I _think_ the simple one-liner NODE(0) patch is sufficient, and should
> work (and is a lot more acceptable for 2.6.14 than switching the node
> ordering around yet again, or doing bigger surgery on the bootmem code).
>
> So the only thing that worried me (and made me ask whether there might be
> machines where it doesn't work) is if some machines might have their high
> memory (or no memory at all) on NODE(0). It does sound unlikely, but I
> simple don't know what kind of strange NUMA configs there are out there.
>
> And I'm definitely only interested in machines that are out there, not
> some theoretical issues.

In our making IA64 machine node 0 might not have any low-memory, and
another node can have low-memory instead.

This cause comes from hotplug whole of one node.
For example, please imagine following case.

1) In this case, firmware remembers pxm 1's node has low memory.

node 0 node 1
+--------------+ +-----------+
| pxm = 1 | | pxm = 2 |
| low memory | | |
+--------------+ +-----------+


2) If one node is hot-added at pxm = 0 (pxm is decided from physical
locate by firmware.), new node will be node 2.

node 2 node 0 node 1
+-----------+ +--------------+ +-----------+
| pxm = 0 | | pxm = 1 | | pxm = 2 |
| | | low memory | | |
+-----------+ +--------------+ +-----------+

3) If user reboots the machine, Linux decides node id from pxm's order.
But firmware still remembers which node has low memory.
So, node 0 will not have any low memory.

node 0 node 1 node 2
+-----------+ +--------------+ +-----------+
| pxm = 0 | | pxm = 1 | | pxm = 2 |
| | | low memory | | |
+-----------+ +--------------+ +-----------+

So, just "use NODE(0)" is not enough hack for our machine.
If "use NODE(0)" is selected, kernel must sort pgdat link and
node id by memory address. I think that hot add code will be a
bit messy instead.

Thanks.

--
Yasunori Goto

2005-10-18 03:20:49

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 11:29:18AM +0900, Yasunori Goto wrote:
> Hello. Linus-san.
> ...
> In our making IA64 machine node 0 might not have any low-memory, and
> another node can have low-memory instead.
>
> This cause comes from hotplug whole of one node.
> For example, please imagine following case.
>
> 1) In this case, firmware remembers pxm 1's node has low memory.
>
> node 0 node 1
> +--------------+ +-----------+
> | pxm = 1 | | pxm = 2 |
> | low memory | | |
> +--------------+ +-----------+
>
>
> 2) If one node is hot-added at pxm = 0 (pxm is decided from physical
> locate by firmware.), new node will be node 2.
>
> node 2 node 0 node 1
> +-----------+ +--------------+ +-----------+
> | pxm = 0 | | pxm = 1 | | pxm = 2 |
> | | | low memory | | |
> +-----------+ +--------------+ +-----------+
>
> 3) If user reboots the machine, Linux decides node id from pxm's order.
> But firmware still remembers which node has low memory.
> So, node 0 will not have any low memory.
>
> node 0 node 1 node 2
> +-----------+ +--------------+ +-----------+
> | pxm = 0 | | pxm = 1 | | pxm = 2 |
> | | | low memory | | |
> +-----------+ +--------------+ +-----------+
>
> So, just "use NODE(0)" is not enough hack for our machine.
> If "use NODE(0)" is selected, kernel must sort pgdat link and
> node id by memory address. I think that hot add code will be a
> bit messy instead.

Yasunori-san,
Does this patch work on your boxes instead? (For 2.6.14)
http://marc.theaimsgroup.com/?l=linux-kernel&m=112959469914681&w=2

Thanks,
Kiran

2005-10-18 04:30:11

by Yasunori Goto

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

> > So, just "use NODE(0)" is not enough hack for our machine.
> > If "use NODE(0)" is selected, kernel must sort pgdat link and
> > node id by memory address. I think that hot add code will be a
> > bit messy instead.
>
> Yasunori-san,
> Does this patch work on your boxes instead? (For 2.6.14)
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112959469914681&w=2

Not yet. But could you change this line at least?

+
+ for_each_node(node) {
+ io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);

for_each_node() loop walks around node_possible_map which includes
"offlined" node.
Please use for_each_online_node() instead. Then, I'll check it. :-)



And I understand why my patch doesn't work on your box by
your patch. (Thanks!)
x86-64's DMA zone is smaller area than 16MB.
But swiotlb requires the area which is smaller than 4GB.
There is no interface to describe its difference.

I think your patch looks reasonable for 2.6.14 to be produced ASAP.
This is less impact than mine.

However, kernel should have new interface like alloc_bootmmem_low32().
This problem will occur not only by swiotlb.

I'll rewrite my patch to make it. If it is applied at 2.6.15,
I'll be glad.

Thanks.

--
Yasunori Goto

2005-10-18 06:13:36

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 01:28:20PM +0900, Yasunori Goto wrote:
> > > So, just "use NODE(0)" is not enough hack for our machine.
> > > If "use NODE(0)" is selected, kernel must sort pgdat link and
> > > node id by memory address. I think that hot add code will be a
> > > bit messy instead.
> >
> > Yasunori-san,
> > Does this patch work on your boxes instead? (For 2.6.14)
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112959469914681&w=2
>
> Not yet. But could you change this line at least?
>
> +
> + for_each_node(node) {
> + io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);
>
> for_each_node() loop walks around node_possible_map which includes
> "offlined" node.
> Please use for_each_online_node() instead. Then, I'll check it. :-)

Since swiotlb is is allocated even before APs are brought up, I thought,
if the node containing lowmem32 was not the boot node, it would not be
online. But on closer look, my assumption was wrong. Here is the patch
which iterates through online nodes to allocate lowmem32 bootmem for
swiotlb.

--

Patch to ensure low32 mem allocation for x86_64 swiotlb

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
===================================================================
--- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-17 13:27:35.000000000 -0700
+++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-17 16:00:44.000000000 -0700
@@ -106,6 +106,8 @@
__setup("swiotlb=", setup_io_tlb_npages);
/* make io_tlb_overflow tunable too? */

+#define IS_LOWPAGES(paddr, size) ((paddr < 0xffffffff) && ((paddr+size) < 0xffffffff))
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the PCI DMA API.
@@ -114,17 +116,32 @@
swiotlb_init_with_default_size (size_t default_size)
{
unsigned long i;
+ unsigned long iotlbsz;
+ int node;

if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
}

+ iotlbsz = io_tlb_nslabs * (1 << IO_TLB_SHIFT);
+
/*
- * Get IO TLB memory from the low pages
+ * Get IO TLB memory from the 0-4G range
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
- (1 << IO_TLB_SHIFT));
+
+ for_each_online_node(node) {
+ io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);
+ if (io_tlb_start) {
+ if (IS_LOWPAGES(virt_to_phys(io_tlb_start), iotlbsz))
+ break;
+ free_bootmem_node(NODE_DATA(node),
+ virt_to_phys(io_tlb_start), iotlbsz);
+ io_tlb_start = NULL;
+ }
+ }
+
+
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);

2005-10-18 08:23:30

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tuesday 18 October 2005 02:16, Ravikiran G Thirumalai wrote:
> On Mon, Oct 17, 2005 at 02:11:20PM -0700, Linus Torvalds wrote:
> > On Mon, 17 Oct 2005, Andrew Morton wrote:
> > > There seem to be a lot of proposed solutions floating about and I fear
> > > that different people will try to fix this in different ways. Do we
> > > all agree that this patch is the correct solution to this problem, or
> > > is something more needed?
> >
> > I think this will fix it.
>
> I just tried Yasunori-sans patch on our x460. It doesn't fix the problem.

That's surprising. Can you post the full boot log? The nodes should be really
in order. Maybe we need to sort the SRAT first.

-Andi

2005-10-18 10:11:42

by Yasunori Goto

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken


I tested your patch, but unfortunately, it doesn't work IA64.
alloc_bootmem_node() requires bigger area than MAX_DMA_ADDRESS.
It is defined as 4GB for ia64. (arch/ia64/mm/init.c)
But this patch require smaller area than 4GB.
So they are exclusive each other.

I'm convinced that a new interface like alloc_bootmem_low32() is
necessary after all. ;-)

Thanks.


> On Tue, Oct 18, 2005 at 01:28:20PM +0900, Yasunori Goto wrote:
> > > > So, just "use NODE(0)" is not enough hack for our machine.
> > > > If "use NODE(0)" is selected, kernel must sort pgdat link and
> > > > node id by memory address. I think that hot add code will be a
> > > > bit messy instead.
> > >
> > > Yasunori-san,
> > > Does this patch work on your boxes instead? (For 2.6.14)
> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=112959469914681&w=2
> >
> > Not yet. But could you change this line at least?
> >
> > +
> > + for_each_node(node) {
> > + io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);
> >
> > for_each_node() loop walks around node_possible_map which includes
> > "offlined" node.
> > Please use for_each_online_node() instead. Then, I'll check it. :-)
>
> Since swiotlb is is allocated even before APs are brought up, I thought,
> if the node containing lowmem32 was not the boot node, it would not be
> online. But on closer look, my assumption was wrong. Here is the patch
> which iterates through online nodes to allocate lowmem32 bootmem for
> swiotlb.
>
> --
>
> Patch to ensure low32 mem allocation for x86_64 swiotlb
>
> Signed-off-by: Ravikiran Thirumalai <[email protected]>
>
> Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
> ===================================================================
> --- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-17 13:27:35.000000000 -0700
> +++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-17 16:00:44.000000000 -0700
> @@ -106,6 +106,8 @@
> __setup("swiotlb=", setup_io_tlb_npages);
> /* make io_tlb_overflow tunable too? */
>
> +#define IS_LOWPAGES(paddr, size) ((paddr < 0xffffffff) && ((paddr+size) < 0xffffffff))
> +
> /*
> * Statically reserve bounce buffer space and initialize bounce buffer data
> * structures for the software IO TLB used to implement the PCI DMA API.
> @@ -114,17 +116,32 @@
> swiotlb_init_with_default_size (size_t default_size)
> {
> unsigned long i;
> + unsigned long iotlbsz;
> + int node;
>
> if (!io_tlb_nslabs) {
> io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
> io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
> }
>
> + iotlbsz = io_tlb_nslabs * (1 << IO_TLB_SHIFT);
> +
> /*
> - * Get IO TLB memory from the low pages
> + * Get IO TLB memory from the 0-4G range
> */
> - io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
> - (1 << IO_TLB_SHIFT));
> +
> + for_each_online_node(node) {
> + io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);
> + if (io_tlb_start) {
> + if (IS_LOWPAGES(virt_to_phys(io_tlb_start), iotlbsz))
> + break;
> + free_bootmem_node(NODE_DATA(node),
> + virt_to_phys(io_tlb_start), iotlbsz);
> + io_tlb_start = NULL;
> + }
> + }
> +
> +
> if (!io_tlb_start)
> panic("Cannot allocate SWIOTLB buffer");
> io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);

--
Yasunori Goto

2005-10-18 15:48:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken



On Mon, 17 Oct 2005, Ravikiran G Thirumalai wrote:
>
> I just tried Yasunori-sans patch on our x460. It doesn't fix the problem.
> Attaching the dmesg capture. However, the patch to iterate over nodes and
> allocate bootmem works, and should work for ia64s, boxes with insufficient
> memory on node 0, nodes with just cpus etc. I have requested Alex to try this
> on the superdome (although the the use of swiotlb on superdomes seem to be
> optional). If this works on superdomes, then please apply. This has been
> tested on x460.

This version looks cleaner and smaller, and leaves the rest of the bootmem
code alone.

I vote for this one, assuming everybody who can test is happy.

Linus

2005-10-18 15:50:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: x86_64: 2.6.14-rc4 swiotlb broken



On Tue, 18 Oct 2005, Linus Torvalds wrote:
>
> I vote for this one, assuming everybody who can test is happy.

Of course, just after sending the patch I noticed that there was a new
version, even simpler. Can people test that one?

Linus

2005-10-18 18:51:53

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 07:09:03PM +0900, Yasunori Goto wrote:
>
> I tested your patch, but unfortunately, it doesn't work IA64.
> alloc_bootmem_node() requires bigger area than MAX_DMA_ADDRESS.
> It is defined as 4GB for ia64. (arch/ia64/mm/init.c)
> But this patch require smaller area than 4GB.
> So they are exclusive each other.

No, alloc_bootmem_node should work for 4G allocations too; So my approach
should work unless there was some other bootmem request served out of that
node earlier. Btw, the default is to allocate 64MB for swiotlb.
Do you modify that for your boxes? IMHO, we should stick to fixing the
stock kernel now for 2.6.14.

However, IS_LOWPAGES macro in my patch enforces that iotlb start and end
locations to be within 4G. I can change that to within and upto 4G, and the
patch should work for you too.

Thanks,
Kiran

2005-10-18 19:07:52

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 10:23:51AM +0200, Andi Kleen wrote:
> On Tuesday 18 October 2005 02:16, Ravikiran G Thirumalai wrote:
> > On Mon, Oct 17, 2005 at 02:11:20PM -0700, Linus Torvalds wrote:
> > > On Mon, 17 Oct 2005, Andrew Morton wrote:
> > > > There seem to be a lot of proposed solutions floating about and I fear
> > > > that different people will try to fix this in different ways. Do we
> > > > all agree that this patch is the correct solution to this problem, or
> > > > is something more needed?
> > >
> > > I think this will fix it.
> >
> > I just tried Yasunori-sans patch on our x460. It doesn't fix the problem.
>
> That's surprising. Can you post the full boot log?

I'd already posted that... Here is the link
http://marc.theaimsgroup.com/?l=linux-kernel&m=112959469914681&w=2

> The nodes should be really in order. Maybe we need to sort the SRAT first.

I don't understand this comment. I thought Yasunori-san's patch did not
require pgdats to be in any order.. Anywayz, the x460 nodes _are_ in proper
order.

Thanks,
Kiran

2005-10-18 19:54:32

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 08:50:18AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 18 Oct 2005, Linus Torvalds wrote:
> >
> > I vote for this one, assuming everybody who can test is happy.
>
> Of course, just after sending the patch I noticed that there was a new
> version, even simpler. Can people test that one?
>

This version should work for everyone. It falls back to the old 2.6.13
behaviour when it does not find suitable memory from any of the nodes.

Yasunori-san, Alex, can you confirm. (Please use stock 2.6.14)

Thanks,
Kiran

--

Patch to ensure low32 mem allocation for x86_64 swiotlb

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
===================================================================
--- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-17 22:48:25.000000000 -0700
+++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-18 12:44:17.000000000 -0700
@@ -106,6 +106,8 @@
__setup("swiotlb=", setup_io_tlb_npages);
/* make io_tlb_overflow tunable too? */

+#define IS_LOWPAGES(paddr, size) ((paddr < 0xffffffff) && ((paddr+size) < 0xffffffff))
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the PCI DMA API.
@@ -114,17 +116,43 @@
swiotlb_init_with_default_size (size_t default_size)
{
unsigned long i;
+ unsigned long iotlbsz;
+ int node;

if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
}

+ iotlbsz = io_tlb_nslabs * (1 << IO_TLB_SHIFT);
+
/*
- * Get IO TLB memory from the low pages
+ * Get IO TLB memory from the 0-4G range
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
- (1 << IO_TLB_SHIFT));
+
+ for_each_online_node(node) {
+ io_tlb_start = alloc_bootmem_node(NODE_DATA(node), iotlbsz);
+ if (io_tlb_start) {
+ if (IS_LOWPAGES(virt_to_phys(io_tlb_start), iotlbsz))
+ break;
+ free_bootmem_node(NODE_DATA(node),
+ virt_to_phys(io_tlb_start), iotlbsz);
+ io_tlb_start = NULL;
+ }
+ }
+
+ /*
+ * FIXME: This should go away when the bootmem allocator is fixed to
+ * guarantee lowmem32 allocations somehow, and the swiotlb mess is
+ * cleaned. The alloc_bootmem_low_pages fall back is to ensure
+ * boxes like amd64 which donot use swiotlb but still have
+ * swiotlb compiled in, falls back to the 2.6.13 behaviour instead
+ * of panicking, when proper low32 pages are not available
+ */
+ if (!io_tlb_start)
+ io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
+ (1 << IO_TLB_SHIFT));
+
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);

2005-10-18 21:28:47

by Alex Williamson

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, 2005-10-18 at 12:54 -0700, Ravikiran G Thirumalai wrote:
> On Tue, Oct 18, 2005 at 08:50:18AM -0700, Linus Torvalds wrote:
> >
> >
> > On Tue, 18 Oct 2005, Linus Torvalds wrote:
> > >
> > > I vote for this one, assuming everybody who can test is happy.
> >
> > Of course, just after sending the patch I noticed that there was a new
> > version, even simpler. Can people test that one?
> >
>
> This version should work for everyone. It falls back to the old 2.6.13
> behaviour when it does not find suitable memory from any of the nodes.
>
> Yasunori-san, Alex, can you confirm. (Please use stock 2.6.14)

Oops, I used 2.6.14-rc4-mm1, I'll retest. However, this does work on
the Superdome. Not because of the iterating over the nodes code, but
because of the call to alloc_bootmem_low_pages() fallback case. Adding
a printk(), I get this:

Node 0: 0xe000074104e6b200
Node 1: 0xe000082080723000
Node 2: 0xe000000101532000 *Note this is the interleaved memory node
Placing software IO TLB between 0x4cdc000 - 0x8cdc000

Looking at the memory map of the system, I see these major ranges:

Node 2:
0x00000000000 - 0x0007ffdefff (~2GB)
0x00100000000 - 0x0017fffffff (2GB)
0x04080000000 - 0x040f0000000 (2GB)
Node 0:
0x74100000000 - 0x741fbbfffff (~4GB)
Node 1:
0x82080000000 - 0x820fb453fff (~2GB)

So, it looks like we're iterating over the nodes, but
alloc_bootmem_node() isn't even guaranteed to try to get memory from the
low memory on that node. Thanks,

Alex



2005-10-18 21:53:56

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 03:28:27PM -0600, Alex Williamson wrote:
> ...
> Oops, I used 2.6.14-rc4-mm1, I'll retest. However, this does work on
> the Superdome. Not because of the iterating over the nodes code, but
> because of the call to alloc_bootmem_low_pages() fallback case. Adding
> a printk(), I get this:
>
> Node 0: 0xe000074104e6b200
> Node 1: 0xe000082080723000
> Node 2: 0xe000000101532000 *Note this is the interleaved memory node
> Placing software IO TLB between 0x4cdc000 - 0x8cdc000
>
> Looking at the memory map of the system, I see these major ranges:
>
> Node 2:
> 0x00000000000 - 0x0007ffdefff (~2GB)
> 0x00100000000 - 0x0017fffffff (2GB)
> 0x04080000000 - 0x040f0000000 (2GB)
> Node 0:
> 0x74100000000 - 0x741fbbfffff (~4GB)
> Node 1:
> 0x82080000000 - 0x820fb453fff (~2GB)
>
> So, it looks like we're iterating over the nodes, but
> alloc_bootmem_node() isn't even guaranteed to try to get memory from the
> low memory on that node.

Thanks Alex. 2.6.14-rc4-mm1 already has the
guarantee-dma-area-for-alloc_bootmem_low.patch by Yasunori-san. So it is
safer to confirm results on latest 2.6.14 stock.

Could it also be that Node 2 is offline when swiotlb is allocated?

Thanks,
Kiran

2005-10-18 22:04:38

by Alex Williamson

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, 2005-10-18 at 14:53 -0700, Ravikiran G Thirumalai wrote:
> On Tue, Oct 18, 2005 at 03:28:27PM -0600, Alex Williamson wrote:
> > So, it looks like we're iterating over the nodes, but
> > alloc_bootmem_node() isn't even guaranteed to try to get memory from the
> > low memory on that node.
>
> Thanks Alex. 2.6.14-rc4-mm1 already has the
> guarantee-dma-area-for-alloc_bootmem_low.patch by Yasunori-san. So it is
> safer to confirm results on latest 2.6.14 stock.

Ok. I'll need to build a stock tree then.

> Could it also be that Node 2 is offline when swiotlb is allocated?

Nope. Note that Node2 is iterated in the for_each_online_node, my
printk is within the body of the loop. Also, the allocation it did get
is still from Node2. My understanding is that goal for
alloc_bootmem_node is MAX_DMA_ADDRESS. On ia64, that defaults to 4GB.
So it makes sense that we're using the second region in the
discontiguous space of that node. This solution therefore relies on one
of the nodes having less than 4GB of zero base memory. That's a pretty
weak assumption. Thanks,

Alex

--

2005-10-18 22:37:54

by Alex Williamson

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, 2005-10-18 at 16:04 -0600, Alex Williamson wrote:
> On Tue, 2005-10-18 at 14:53 -0700, Ravikiran G Thirumalai wrote:
> > On Tue, Oct 18, 2005 at 03:28:27PM -0600, Alex Williamson wrote:
> > > So, it looks like we're iterating over the nodes, but
> > > alloc_bootmem_node() isn't even guaranteed to try to get memory from the
> > > low memory on that node.
> >
> > Thanks Alex. 2.6.14-rc4-mm1 already has the
> > guarantee-dma-area-for-alloc_bootmem_low.patch by Yasunori-san. So it is
> > safer to confirm results on latest 2.6.14 stock.
>
> Ok. I'll need to build a stock tree then.
>

Nope, it breaks with a current git-2.6.14. Here's what my extra
printk says:

Node 0: 0xe000074104e67200
Node 1: 0xe000082080722000
Node 2: 0xe000000101532000
Placing software IO TLB between 0x74108e68000 - 0x7410ce68000

So same scenario as on -mm w/ the iterating across nodes, but now
alloc_bootmem_low_pages() doesn't catch us in the end. This will fail
in any case where MAX_DMA_ADDRESS is 4GB and every node has memory
sufficient for the swiotlb size above 4GB. Thanks,

Alex

2005-10-18 22:47:33

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 04:04:00PM -0600, Alex Williamson wrote:
> On Tue, 2005-10-18 at 14:53 -0700, Ravikiran G Thirumalai wrote:
> > On Tue, Oct 18, 2005 at 03:28:27PM -0600, Alex Williamson wrote:
> > > So, it looks like we're iterating over the nodes, but
> > > alloc_bootmem_node() isn't even guaranteed to try to get memory from the
> > > low memory on that node.
> >
> > Thanks Alex. 2.6.14-rc4-mm1 already has the
> > guarantee-dma-area-for-alloc_bootmem_low.patch by Yasunori-san. So it is
> > safer to confirm results on latest 2.6.14 stock.
>
> Ok. I'll need to build a stock tree then.
>
> > Could it also be that Node 2 is offline when swiotlb is allocated?
>
> Nope. Note that Node2 is iterated in the for_each_online_node, my
> printk is within the body of the loop. Also, the allocation it did get
> is still from Node2. My understanding is that goal for
> alloc_bootmem_node is MAX_DMA_ADDRESS. On ia64, that defaults to 4GB.

Ahhh... and it is 16MB on x86_64. alloc_bootmem_node will never work here
for ia64 then, However, alloc_bootmem_low_pages_node will work, but then it will
dig into 16MB DMA area of x86_64.... arrrrgh...

The first cleanup post 2.6.14 should be to seperate swiotlb for ia64 and
x86_64 IMHO.

2005-10-18 23:22:09

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, Oct 18, 2005 at 04:37:03PM -0600, Alex Williamson wrote:
> On Tue, 2005-10-18 at 16:04 -0600, Alex Williamson wrote:
>
> Nope, it breaks with a current git-2.6.14. Here's what my extra
> printk says:
>
> Node 0: 0xe000074104e67200
> Node 1: 0xe000082080722000
> Node 2: 0xe000000101532000
> Placing software IO TLB between 0x74108e68000 - 0x7410ce68000
>

Hope the following works. Using __alloc_bootmem_node now with a hard coded
goal to avoid 16MB DMA zone. It is ugly :( and hope it works this time
<fingers crossed>.

--

Patch to ensure low32 mem allocation for x86_64 swiotlb

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c
===================================================================
--- linux-2.6.14-rc4.orig/arch/ia64/lib/swiotlb.c 2005-10-18 14:14:12.000000000 -0700
+++ linux-2.6.14-rc4/arch/ia64/lib/swiotlb.c 2005-10-18 16:09:51.000000000 -0700
@@ -106,6 +106,8 @@
__setup("swiotlb=", setup_io_tlb_npages);
/* make io_tlb_overflow tunable too? */

+#define IS_LOWPAGES(paddr, size) ((paddr < 0xffffffff) && ((paddr+size) < 0xffffffff))
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the PCI DMA API.
@@ -114,17 +116,46 @@
swiotlb_init_with_default_size (size_t default_size)
{
unsigned long i;
+ unsigned long iotlbsz;
+ int node;

if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
}

+ iotlbsz = io_tlb_nslabs * (1 << IO_TLB_SHIFT);
+
/*
- * Get IO TLB memory from the low pages
+ * Get IO TLB memory from the 0-4G range
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
- (1 << IO_TLB_SHIFT));
+
+ for_each_online_node(node) {
+ /* Ugly, hate it. To be gone post 2.6.14 */
+ io_tlb_start = __alloc_bootmem_node(NODE_DATA(node),
+ iotlbsz, PAGE_SIZE,
+ 0x1000000);
+ if (io_tlb_start) {
+ if (IS_LOWPAGES(virt_to_phys(io_tlb_start), iotlbsz))
+ break;
+ free_bootmem_node(NODE_DATA(node),
+ virt_to_phys(io_tlb_start), iotlbsz);
+ io_tlb_start = NULL;
+ }
+ }
+
+ /*
+ * FIXME: This should go away when the bootmem allocator is fixed to
+ * guarantee lowmem32 allocations somehow, and the swiotlb mess is
+ * cleaned. The alloc_bootmem_low_pages fall back is to ensure
+ * boxes like amd64 which donot use swiotlb but still have
+ * swiotlb compiled in, falls back to the 2.6.13 behaviour instead
+ * of panicking, when proper low32 pages are not available
+ */
+ if (!io_tlb_start)
+ io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
+ (1 << IO_TLB_SHIFT));
+
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);

2005-10-19 01:22:56

by Alex Williamson

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, 2005-10-18 at 16:22 -0700, Ravikiran G Thirumalai wrote:

> Hope the following works. Using __alloc_bootmem_node now with a hard coded
> goal to avoid 16MB DMA zone. It is ugly :( and hope it works this time
> <fingers crossed>.

Nope, it still gives me memory above 4GB. If I change goal to 0x0 it
works. One nit, shouldn't IS_LOW_AGES() be more like this:

#define IS_LOWPAGES(paddr, size) ((paddr+size-1) < 0x100000000UL)

Minor optimization not checking the start, but a 64MB swiotlb starting
at 4GB-64MB should be found as ok. Thanks,

Alex


2005-10-19 02:02:37

by Alex Williamson

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Tue, 2005-10-18 at 19:22 -0600, Alex Williamson wrote:
> On Tue, 2005-10-18 at 16:22 -0700, Ravikiran G Thirumalai wrote:
>
> > Hope the following works. Using __alloc_bootmem_node now with a hard coded
> > goal to avoid 16MB DMA zone. It is ugly :( and hope it works this time
> > <fingers crossed>.
>
> Nope, it still gives me memory above 4GB. If I change goal to 0x0 it
> works.

BTW, the reason 16MB fails is this test in __alloc_bootmem_core():

if (bdata->last_success >= preferred)
preferred = bdata->last_success;

That pretty much negates most of usefulness of passing in a goal other
than 0 or !0. If I comment out this test, the 16MB goal works as
expected, but I get an uninitialized timer in the sym53c8xx driver. Not
sure what's happening there, but apparently the test has some purpose
other than optimization. Thanks,

Alex



2005-10-19 12:47:58

by Yasunori Goto

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

> On Tue, 2005-10-18 at 16:22 -0700, Ravikiran G Thirumalai wrote:
>
> > Hope the following works. Using __alloc_bootmem_node now with a hard coded
> > goal to avoid 16MB DMA zone. It is ugly :( and hope it works this time
> > <fingers crossed>.
>
> Nope, it still gives me memory above 4GB. If I change goal to 0x0 it
> works.
> One nit, shouldn't IS_LOW_AGES() be more like this:
>
> #define IS_LOWPAGES(paddr, size) ((paddr+size-1) < 0x100000000UL)
>
> Minor optimization not checking the start, but a 64MB swiotlb starting
> at 4GB-64MB should be found as ok. Thanks,
>
> Alex
>


Hmm.....
How is this patch? This is another way.

I think that true issue is there is no way for requester to
specify maxmum address at __alloc_bootmem_core().

"goal" is just to keep space lower address as much as possible.
and __alloc_bootmem_core() doesn't care about max address for requester.
I suppose it is a bit strange. The swiotlb's case is good example
by it.

So, I made a patch that __alloc_bootmem_core() cares it.

Thanks.


Signed-off-by: Yasunori Goto <[email protected]>
-------

Index: 3rdbootmem/arch/ia64/lib/swiotlb.c
===================================================================
--- 3rdbootmem.orig/arch/ia64/lib/swiotlb.c 2005-10-19 21:05:15.000000000 +0900
+++ 3rdbootmem/arch/ia64/lib/swiotlb.c 2005-10-19 21:05:18.000000000 +0900
@@ -123,8 +123,8 @@ swiotlb_init_with_default_size (size_t d
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
- (1 << IO_TLB_SHIFT));
+ io_tlb_start = alloc_bootmem_low_pages_limit(io_tlb_nslabs *
+ (1 << IO_TLB_SHIFT), 0x100000000);
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);
Index: 3rdbootmem/include/linux/bootmem.h
===================================================================
--- 3rdbootmem.orig/include/linux/bootmem.h 2005-10-19 21:05:15.000000000 +0900
+++ 3rdbootmem/include/linux/bootmem.h 2005-10-19 21:09:21.000000000 +0900
@@ -43,7 +43,7 @@ typedef struct bootmem_data {
extern unsigned long __init bootmem_bootmap_pages (unsigned long);
extern unsigned long __init init_bootmem (unsigned long addr, unsigned long memend);
extern void __init free_bootmem (unsigned long addr, unsigned long size);
-extern void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal);
+extern void * __init __alloc_bootmem_limit (unsigned long size, unsigned long align, unsigned long goal, unsigned long limit);
#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
extern void __init reserve_bootmem (unsigned long addr, unsigned long size);
#define alloc_bootmem(x) \
@@ -54,6 +54,16 @@ extern void __init reserve_bootmem (unsi
__alloc_bootmem((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low_pages(x) \
__alloc_bootmem((x), PAGE_SIZE, 0)
+
+#define alloc_bootmem_limit(x, limit) \
+ __alloc_bootmem_limit((x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_low_limit(x, limit) \
+ __alloc_bootmem_limit((x), SMP_CACHE_BYTES, 0, (limit))
+#define alloc_bootmem_pages_limit(x, limit) \
+ __alloc_bootmem_limit((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_low_pages_limit(x, limit) \
+ __alloc_bootmem_limit((x), PAGE_SIZE, 0, (limit))
+
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
extern unsigned long __init free_all_bootmem (void);

@@ -61,7 +71,7 @@ extern unsigned long __init init_bootmem
extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size);
extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
-extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
+extern void * __init __alloc_bootmem_node_limit (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal, unsigned long limit);
#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
#define alloc_bootmem_node(pgdat, x) \
__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
@@ -69,6 +79,14 @@ extern void * __init __alloc_bootmem_nod
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low_pages_node(pgdat, x) \
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, 0)
+
+#define alloc_bootmem_node_limit(pgdat, x, limit) \
+ __alloc_bootmem_node_limit((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_pages_node_limit(pgdat, x, limit) \
+ __alloc_bootmem_node_limit((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_low_pages_node_limit(pgdat, x, limit) \
+ __alloc_bootmem_node_limit((pgdat), (x), PAGE_SIZE, 0, (limit))
+
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */

#ifdef CONFIG_HAVE_ARCH_ALLOC_REMAP
@@ -105,5 +123,15 @@ extern void *__init alloc_large_system_h
#endif
extern int __initdata hashdist; /* Distribute hashes across NUMA nodes? */

+static inline void *__alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
+{
+ return __alloc_bootmem_limit(size, align, goal, 0);
+}
+
+static inline void *__alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align,
+ unsigned long goal)
+{
+ return __alloc_bootmem_node_limit(pgdat, size, align, goal, 0);
+}

#endif /* _LINUX_BOOTMEM_H */
Index: 3rdbootmem/mm/bootmem.c
===================================================================
--- 3rdbootmem.orig/mm/bootmem.c 2005-10-19 21:05:15.000000000 +0900
+++ 3rdbootmem/mm/bootmem.c 2005-10-19 21:05:18.000000000 +0900
@@ -154,10 +154,10 @@ static void __init free_bootmem_core(boo
*/
static void * __init
__alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
- unsigned long align, unsigned long goal)
+ unsigned long align, unsigned long goal, unsigned long limit)
{
unsigned long offset, remaining_size, areasize, preferred;
- unsigned long i, start = 0, incr, eidx;
+ unsigned long i, start = 0, incr, eidx, end_pfn = bdata->node_low_pfn;
void *ret;

if(!size) {
@@ -166,7 +166,14 @@ __alloc_bootmem_core(struct bootmem_data
}
BUG_ON(align & (align-1));

- eidx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
+ if (limit && bdata->node_boot_start >= limit)
+ return NULL;
+
+ limit >>=PAGE_SHIFT;
+ if (limit && end_pfn > limit)
+ end_pfn = limit;
+
+ eidx = end_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
offset = 0;
if (align &&
(bdata->node_boot_start & (align - 1UL)) != 0)
@@ -178,11 +185,12 @@ __alloc_bootmem_core(struct bootmem_data
* first, then we try to allocate lower pages.
*/
if (goal && (goal >= bdata->node_boot_start) &&
- ((goal >> PAGE_SHIFT) < bdata->node_low_pfn)) {
+ ((goal >> PAGE_SHIFT) < end_pfn)) {
preferred = goal - bdata->node_boot_start;

if (bdata->last_success >= preferred)
- preferred = bdata->last_success;
+ if (!limit || (limit && limit > bdata->last_success))
+ preferred = bdata->last_success;
} else
preferred = 0;

@@ -382,14 +390,15 @@ unsigned long __init free_all_bootmem (v
return(free_all_bootmem_core(NODE_DATA(0)));
}

-void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
+void * __init __alloc_bootmem_limit (unsigned long size, unsigned long align, unsigned long goal,
+ unsigned long limit)
{
pg_data_t *pgdat = pgdat_list;
void *ptr;

for_each_pgdat(pgdat)
if ((ptr = __alloc_bootmem_core(pgdat->bdata, size,
- align, goal)))
+ align, goal, limit)))
return(ptr);

/*
@@ -400,14 +409,16 @@ void * __init __alloc_bootmem (unsigned
return NULL;
}

-void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal)
+
+void * __init __alloc_bootmem_node_limit (pg_data_t *pgdat, unsigned long size, unsigned long align,
+ unsigned long goal, unsigned long limit)
{
void *ptr;

- ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal);
+ ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, limit);
if (ptr)
return (ptr);

- return __alloc_bootmem(size, align, goal);
+ return __alloc_bootmem_limit(size, align, goal, limit);
}



--
Yasunori Goto

2005-10-19 14:19:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Wed, 2005-10-19 at 21:47 +0900, Yasunori Goto wrote:
> Hmm.....
> How is this patch? This is another way.
>
> I think that true issue is there is no way for requester to
> specify maxmum address at __alloc_bootmem_core().
>
> "goal" is just to keep space lower address as much as possible.
> and __alloc_bootmem_core() doesn't care about max address for requester.
> I suppose it is a bit strange. The swiotlb's case is good example
> by it.
>
> So, I made a patch that __alloc_bootmem_core() cares it.

This works on the Superdome. The swiotlb is shown as:

Placing software IO TLB between 0x4d48000 - 0x8d48000

Thanks,

Alex

2005-10-19 17:18:31

by Jon Mason

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

I have run a few tests on the original code and the patches posted to the list,
and have some interesting results. First, my system setup: Dual Opteron, 8GB
RAM, SIL SATA controller (which is apparently 32bit), pcnet32 NIC (compiled as
a module) connected to the network. The latter 2 should show any bounce
buffer problems.

I get the following behavior:
- With AMD HW IOMMU (AGP_GART), everything works perfectly.
- With IOMMU disabled (iommu=off), my kernel panics in SATA.
- With SW IOMMU (iommu=soft), my kernel is unable to find any partitions on my
SATA disk (and panics)
- With SW IOMMU and Ravikiran's original patch, kernel boots with no aperient
problems
- With SW IOMMU and Ravikiran's second patch, kernel boots with no aperient
problems
- With SW IOMMU and Yasunori's patch, kernel boots with no aperient problems

The edited dmesg output can be found below.

>From the above, we can tell that my system requires either hardware IOMMU or
bounce buffers for the SATA disk. We can also tell that the current
implimentation has a bug (where the actual cause lies is unknown). All of the
patches in this e-mail thread seem to fix my problem and I had no noticeable
problems.

If anyone wants me to try anything on my system, I'll be happy to help.

Thanks,
Jon

----------------------------------------------------

with Yasunori's last patch and with iommu=soft
[...]
Placing software IO TLB between 0x4821000 - 0x8821000
[...]
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
PCI: Bridge: 0000:00:06.0
IO window: 2000-2fff
MEM window: f9700000-f97fffff
PREFETCH window: f9000000-f90fffff
PCI: Bridge: 0000:00:0a.0
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0b.0
IO window: disabled.
MEM window: f9800000-f98fffff
PREFETCH window: disabled.
PCI: Failed to allocate mem resource #6:20000@f8000000 for 0000:09:00.0
PCI: Bridge: 0000:08:01.0
IO window: disabled.
MEM window: f8000000-f8ffffff
PREFETCH window: f0000000-f7ffffff
PCI: Bridge: 0000:08:03.0
IO window: 3000-3fff
MEM window: f9200000-f92fffff
PREFETCH window: f9100000-f91fffff
PCI: Bridge: 0000:08:04.0
IO window: 4000-4fff
MEM window: f9300000-f93fffff
PREFETCH window: f9500000-f95fffff
[...]
scsi2 : sata_sil
ata2: no device found (phy stat 00000000)
scsi3 : sata_sil
Vendor: ATA Model: Maxtor 6Y080M0 Rev: YAR5
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3
[...]


with Ravikiran's second patch and with iommu=soft
[...]
Placing software IO TLB between 0x4821000 - 0x8821000
[...]
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
PCI: Bridge: 0000:00:06.0
IO window: 2000-2fff
MEM window: f9700000-f97fffff
PREFETCH window: f9000000-f90fffff
PCI: Bridge: 0000:00:0a.0
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0b.0
IO window: disabled.
MEM window: f9800000-f98fffff
PREFETCH window: disabled.
PCI: Failed to allocate mem resource #6:20000@f8000000 for 0000:09:00.0
PCI: Bridge: 0000:08:01.0
IO window: disabled.
MEM window: f8000000-f8ffffff
PREFETCH window: f0000000-f7ffffff
PCI: Bridge: 0000:08:03.0
IO window: 3000-3fff
MEM window: f9200000-f92fffff
PREFETCH window: f9100000-f91fffff
PCI: Bridge: 0000:08:04.0
IO window: 4000-4fff
MEM window: f9300000-f93fffff
PREFETCH window: f9500000-f95fffff
[...]
scsi2 : sata_sil
ata2: no device found (phy stat 00000000)
scsi3 : sata_sil
Vendor: ATA Model: Maxtor 6Y080M0 Rev: YAR5
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3
[...]


with Ravikiran's first patch and with iommu=soft
[...]
Placing software IO TLB between 0x4820180 - 0x8820180
[...]
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
PCI: Bridge: 0000:00:06.0
IO window: 2000-2fff
MEM window: f9700000-f97fffff
PREFETCH window: f9000000-f90fffff
PCI: Bridge: 0000:00:0a.0
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0b.0
IO window: disabled.
MEM window: f9800000-f98fffff
PREFETCH window: disabled.
PCI: Failed to allocate mem resource #6:20000@f8000000 for 0000:09:00.0
PCI: Bridge: 0000:08:01.0
IO window: disabled.
MEM window: f8000000-f8ffffff
PREFETCH window: f0000000-f7ffffff
PCI: Bridge: 0000:08:03.0
IO window: 3000-3fff
MEM window: f9200000-f92fffff
PREFETCH window: f9100000-f91fffff
PCI: Bridge: 0000:08:04.0
IO window: 4000-4fff
MEM window: f9300000-f93fffff
PREFETCH window: f9500000-f95fffff
[...]
scsi2 : sata_sil
ata2: no device found (phy stat 00000000)
scsi3 : sata_sil
Vendor: ATA Model: Maxtor 6Y080M0 Rev: YAR5
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3
[...]



without patch and with iommu=soft
[...]
Placing software IO TLB between 0x104466000 - 0x108466000
[...]
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
PCI: Bridge: 0000:00:06.0
IO window: 2000-2fff
MEM window: f9700000-f97fffff
PREFETCH window: f9000000-f90fffff
PCI: Bridge: 0000:00:0a.0
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0b.0
IO window: disabled.
MEM window: f9800000-f98fffff
PREFETCH window: disabled.
PCI: Failed to allocate mem resource #6:20000@f8000000 for 0000:09:00.0
PCI: Bridge: 0000:08:01.0
IO window: disabled.
MEM window: f8000000-f8ffffff
PREFETCH window: f0000000-f7ffffff
PCI: Bridge: 0000:08:03.0
IO window: 3000-3fff
MEM window: f9200000-f92fffff
PREFETCH window: f9100000-f91fffff
PCI: Bridge: 0000:08:04.0
IO window: 4000-4fff
MEM window: f9300000-f93fffff
PREFETCH window: f9500000-f95fffff
[...]
scsi2 : sata_sil
ata2: no device found (phy stat 00000000)
scsi3 : sata_sil
Vendor: ATA Model: Maxtor 6Y080M0 Rev: YAR5
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
sda: unknown partition table
[...]
VFS: Cannot open root device "sda2" or unknown-block(8,2)
Please append a correct "root=" boot option
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2)


without patch and with iommu=off
[...]
PCI-DMA: Disabling IOMMU.
PCI: Bridge: 0000:00:06.0
IO window: 2000-2fff
MEM window: f9700000-f97fffff
PREFETCH window: f9000000-f90fffff
PCI: Bridge: 0000:00:0a.0
IO window: disabled.
MEM window: disabled.
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0b.0
IO window: disabled.
MEM window: f9800000-f98fffff
PREFETCH window: disabled.
PCI: Failed to allocate mem resource #6:20000@f8000000 for 0000:09:00.0
PCI: Bridge: 0000:08:01.0
IO window: disabled.
MEM window: f8000000-f8ffffff
PREFETCH window: f0000000-f7ffffff
PCI: Bridge: 0000:08:03.0
IO window: 3000-3fff
MEM window: f9200000-f92fffff
PREFETCH window: f9100000-f91fffff
PCI: Bridge: 0000:08:04.0
IO window: 4000-4fff
MEM window: f9300000-f93fffff
PREFETCH window: f9500000-f95fffff
[...]
scsi2 : sata_sil
ata2: no device found (phy stat 00000000)
scsi3 : sata_sil
Vendor: ATA Model: Maxtor 6Y080M0 Rev: YAR5
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 156312576 512-byte hdwr sectors (80032 MB)
SCSI device sda: drive cache: write back
sda:<0>Kernel panic - not syncing: PCI-DMA: high address but no IOMMU.

2005-10-19 18:07:12

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Wed, Oct 19, 2005 at 09:47:02PM +0900, Yasunori Goto wrote:
> > On Tue, 2005-10-18 at 16:22 -0700, Ravikiran G Thirumalai wrote:
>
>
> Hmm.....
> How is this patch? This is another way.
>
> I think that true issue is there is no way for requester to
> specify maxmum address at __alloc_bootmem_core().
>
> "goal" is just to keep space lower address as much as possible.
> and __alloc_bootmem_core() doesn't care about max address for requester.
> I suppose it is a bit strange. The swiotlb's case is good example
> by it.

This works for me too. After this patch I see
[ 400.495902] Placing software IO TLB between 0x722a000 - 0xb22a000
which means, the new patch is not digging into the 16MB x86_64 DMA area too.

Linus would you apply this for 2.6.14? This is the patch which works for
both x86_64 and ia64 boxes. I limited my approaches not to touch the core
bootmem allocator for 2.6.14, but that doesn't seem to work for ia64 boxes.
Fixing the bootmem is the correct approach IMHO. But in case you feel
this is too intrusive for 2.6.14, we can whip up an ugly #ifdef CONFIG_X86_64
patch which patches swiotlb.c only.

Thanks Yasunori-san for the patch, and Alex for testing out all the patches.

Thanks.
Kiran

2005-10-19 20:45:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken



On Wed, 19 Oct 2005, Ravikiran G Thirumalai wrote:
>
> Linus would you apply this for 2.6.14? This is the patch which works for
> both x86_64 and ia64 boxes.

Ok, I'd love to have something that people finally agree on.

Can you re-post the final version as such, with explanations for the
commit messages and the sign-off, and people who have issues with it
_please_ speak up asap?

Linus

2005-10-19 22:52:30

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Wed, Oct 19, 2005 at 01:45:21PM -0700, Linus Torvalds wrote:
> ...
> Can you re-post the final version as such, with explanations for the
> commit messages and the sign-off, and people who have issues with it
> _please_ speak up asap?

The final version which works for everyone is from Yasunori Goto.

His patch introduces a limit parameter to the core bootmem allocator; This
new parameter indicates that physical memory allocated by the bootmem allocator
should be within the requested limit. The patch also introduces
alloc_bootmem_low_pages_limit(), alloc_bootmem_node_limit,
alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit()
is the only api used for swiotlb. IMO, instead of introducing xxx_limit apis,
the existing alloc_bootmem_low_pages() api could instead be changed and made
to pass right limit to the core allocator. But then that would make the
patch more intrusive for 2.6.14, as other arches use alloc_bootmem_low_pages().
(So maybe that can be done post 2.6.14 if Yasunori-san is OK with that)

With the patch, swiotlb gets memory within 4G for both x86_64 and ia64 arches.
Here's his post FYR

Thanks,
Kiran

--

From: Yasunori Goto <[email protected]>
To: Alex Williamson <[email protected]>,
Ravikiran G Thirumalai <[email protected]>
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken
Cc: Linus Torvalds <[email protected]>, Andrew Morton <[email protected]>,
Andi Kleen <[email protected]>, [email protected],
[email protected], [email protected], [email protected]

> On Tue, 2005-10-18 at 16:22 -0700, Ravikiran G Thirumalai wrote:
>
> > Hope the following works. Using __alloc_bootmem_node now with a hard coded
> > goal to avoid 16MB DMA zone. It is ugly :( and hope it works this time
> > <fingers crossed>.
>
> Nope, it still gives me memory above 4GB. If I change goal to 0x0 it
> works.
> One nit, shouldn't IS_LOW_AGES() be more like this:
>
> #define IS_LOWPAGES(paddr, size) ((paddr+size-1) < 0x100000000UL)
>
> Minor optimization not checking the start, but a 64MB swiotlb starting
> at 4GB-64MB should be found as ok. Thanks,
>
> Alex
>


Hmm.....
How is this patch? This is another way.

I think that true issue is there is no way for requester to
specify maxmum address at __alloc_bootmem_core().

"goal" is just to keep space lower address as much as possible.
and __alloc_bootmem_core() doesn't care about max address for requester.
I suppose it is a bit strange. The swiotlb's case is good example
by it.

So, I made a patch that __alloc_bootmem_core() cares it.

Thanks.


Signed-off-by: Yasunori Goto <[email protected]>
-------

Index: 3rdbootmem/arch/ia64/lib/swiotlb.c
===================================================================
--- 3rdbootmem.orig/arch/ia64/lib/swiotlb.c 2005-10-19 21:05:15.000000000 +0900
+++ 3rdbootmem/arch/ia64/lib/swiotlb.c 2005-10-19 21:05:18.000000000 +0900
@@ -123,8 +123,8 @@ swiotlb_init_with_default_size (size_t d
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(io_tlb_nslabs *
- (1 << IO_TLB_SHIFT));
+ io_tlb_start = alloc_bootmem_low_pages_limit(io_tlb_nslabs *
+ (1 << IO_TLB_SHIFT), 0x100000000);
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + io_tlb_nslabs * (1 << IO_TLB_SHIFT);
Index: 3rdbootmem/include/linux/bootmem.h
===================================================================
--- 3rdbootmem.orig/include/linux/bootmem.h 2005-10-19 21:05:15.000000000 +0900
+++ 3rdbootmem/include/linux/bootmem.h 2005-10-19 21:09:21.000000000 +0900
@@ -43,7 +43,7 @@ typedef struct bootmem_data {
extern unsigned long __init bootmem_bootmap_pages (unsigned long);
extern unsigned long __init init_bootmem (unsigned long addr, unsigned long memend);
extern void __init free_bootmem (unsigned long addr, unsigned long size);
-extern void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal);
+extern void * __init __alloc_bootmem_limit (unsigned long size, unsigned long align, unsigned long goal, unsigned long limit);
#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
extern void __init reserve_bootmem (unsigned long addr, unsigned long size);
#define alloc_bootmem(x) \
@@ -54,6 +54,16 @@ extern void __init reserve_bootmem (unsi
__alloc_bootmem((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low_pages(x) \
__alloc_bootmem((x), PAGE_SIZE, 0)
+
+#define alloc_bootmem_limit(x, limit) \
+ __alloc_bootmem_limit((x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_low_limit(x, limit) \
+ __alloc_bootmem_limit((x), SMP_CACHE_BYTES, 0, (limit))
+#define alloc_bootmem_pages_limit(x, limit) \
+ __alloc_bootmem_limit((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_low_pages_limit(x, limit) \
+ __alloc_bootmem_limit((x), PAGE_SIZE, 0, (limit))
+
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
extern unsigned long __init free_all_bootmem (void);

@@ -61,7 +71,7 @@ extern unsigned long __init init_bootmem
extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size);
extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
-extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
+extern void * __init __alloc_bootmem_node_limit (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal, unsigned long limit);
#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
#define alloc_bootmem_node(pgdat, x) \
__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
@@ -69,6 +79,14 @@ extern void * __init __alloc_bootmem_nod
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low_pages_node(pgdat, x) \
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, 0)
+
+#define alloc_bootmem_node_limit(pgdat, x, limit) \
+ __alloc_bootmem_node_limit((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_pages_node_limit(pgdat, x, limit) \
+ __alloc_bootmem_node_limit((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS), (limit))
+#define alloc_bootmem_low_pages_node_limit(pgdat, x, limit) \
+ __alloc_bootmem_node_limit((pgdat), (x), PAGE_SIZE, 0, (limit))
+
#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */

#ifdef CONFIG_HAVE_ARCH_ALLOC_REMAP
@@ -105,5 +123,15 @@ extern void *__init alloc_large_system_h
#endif
extern int __initdata hashdist; /* Distribute hashes across NUMA nodes? */

+static inline void *__alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
+{
+ return __alloc_bootmem_limit(size, align, goal, 0);
+}
+
+static inline void *__alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align,
+ unsigned long goal)
+{
+ return __alloc_bootmem_node_limit(pgdat, size, align, goal, 0);
+}

#endif /* _LINUX_BOOTMEM_H */
Index: 3rdbootmem/mm/bootmem.c
===================================================================
--- 3rdbootmem.orig/mm/bootmem.c 2005-10-19 21:05:15.000000000 +0900
+++ 3rdbootmem/mm/bootmem.c 2005-10-19 21:05:18.000000000 +0900
@@ -154,10 +154,10 @@ static void __init free_bootmem_core(boo
*/
static void * __init
__alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
- unsigned long align, unsigned long goal)
+ unsigned long align, unsigned long goal, unsigned long limit)
{
unsigned long offset, remaining_size, areasize, preferred;
- unsigned long i, start = 0, incr, eidx;
+ unsigned long i, start = 0, incr, eidx, end_pfn = bdata->node_low_pfn;
void *ret;

if(!size) {
@@ -166,7 +166,14 @@ __alloc_bootmem_core(struct bootmem_data
}
BUG_ON(align & (align-1));

- eidx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
+ if (limit && bdata->node_boot_start >= limit)
+ return NULL;
+
+ limit >>=PAGE_SHIFT;
+ if (limit && end_pfn > limit)
+ end_pfn = limit;
+
+ eidx = end_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
offset = 0;
if (align &&
(bdata->node_boot_start & (align - 1UL)) != 0)
@@ -178,11 +185,12 @@ __alloc_bootmem_core(struct bootmem_data
* first, then we try to allocate lower pages.
*/
if (goal && (goal >= bdata->node_boot_start) &&
- ((goal >> PAGE_SHIFT) < bdata->node_low_pfn)) {
+ ((goal >> PAGE_SHIFT) < end_pfn)) {
preferred = goal - bdata->node_boot_start;

if (bdata->last_success >= preferred)
- preferred = bdata->last_success;
+ if (!limit || (limit && limit > bdata->last_success))
+ preferred = bdata->last_success;
} else
preferred = 0;

@@ -382,14 +390,15 @@ unsigned long __init free_all_bootmem (v
return(free_all_bootmem_core(NODE_DATA(0)));
}

-void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
+void * __init __alloc_bootmem_limit (unsigned long size, unsigned long align, unsigned long goal,
+ unsigned long limit)
{
pg_data_t *pgdat = pgdat_list;
void *ptr;

for_each_pgdat(pgdat)
if ((ptr = __alloc_bootmem_core(pgdat->bdata, size,
- align, goal)))
+ align, goal, limit)))
return(ptr);

/*
@@ -400,14 +409,16 @@ void * __init __alloc_bootmem (unsigned
return NULL;
}

-void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal)
+
+void * __init __alloc_bootmem_node_limit (pg_data_t *pgdat, unsigned long size, unsigned long align,
+ unsigned long goal, unsigned long limit)
{
void *ptr;

- ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal);
+ ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, limit);
if (ptr)
return (ptr);

- return __alloc_bootmem(size, align, goal);
+ return __alloc_bootmem_limit(size, align, goal, limit);
}

2005-10-20 00:52:23

by Yasunori Goto

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken


Thanks a lot for your explanation!

> On Wed, Oct 19, 2005 at 01:45:21PM -0700, Linus Torvalds wrote:
> > ...
> > Can you re-post the final version as such, with explanations for the
> > commit messages and the sign-off, and people who have issues with it
> > _please_ speak up asap?
>
> The final version which works for everyone is from Yasunori Goto.
>
> His patch introduces a limit parameter to the core bootmem allocator; This
> new parameter indicates that physical memory allocated by the bootmem allocator
> should be within the requested limit. The patch also introduces
> alloc_bootmem_low_pages_limit(), alloc_bootmem_node_limit,
> alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit()
> is the only api used for swiotlb. IMO, instead of introducing xxx_limit apis,
> the existing alloc_bootmem_low_pages() api could instead be changed and made
> to pass right limit to the core allocator. But then that would make the
> patch more intrusive for 2.6.14, as other arches use alloc_bootmem_low_pages().

Yes. I worried a bit that I should replace all of the function or not.
But, its impact is too risky for 2.6.14. So, I wrote the patch to avoid
big impact as much as possible.

> (So maybe that can be done post 2.6.14 if Yasunori-san is OK with that)

Sure!

--
Yasunori Goto

2005-10-20 07:27:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Wednesday 19 October 2005 19:18, Jon Mason wrote:
> I have run a few tests on the original code and the patches posted to the
> list, and have some interesting results. First, my system setup: Dual
> Opteron, 8GB RAM, SIL SATA controller (which is apparently 32bit), pcnet32
> NIC (compiled as a module) connected to the network. The latter 2 should
> show any bounce buffer problems.

We don't care about AMD systems for this problem because they
don't use swiotlb in normal operations, but the AGP remapping hardware (except
on one particulr chipset, but people don't build servers from that)

If anything then testing results from Summit3 based Intel x86 NUMA systems
would be interesting.

-Andi

2005-10-20 10:35:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: x86_64: 2.6.14-rc4 swiotlb broken

On Wednesday 19 October 2005 14:47, Yasunori Goto wrote:

>
> Hmm.....
> How is this patch? This is another way.

That one looks good to me.

Thanks,
-Andi