2006-03-01 23:23:54

by Michael Monnerie

[permalink] [raw]
Subject: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

Hello, I use SUSE 10.0 with all updates and actual kernel 2.6.13-15.8 as
provided from SUSE (just self compiled to optimize for Athlon64, SMP,
and HZ=100), with an Asus A8N-E motherboard, and an Athlon64x2 CPU.
This host is used with VMware GSX server running 6 Linux client and one
Windows client host. There's a SW-RAID1 using 2 SATA HDs.

When we wanted to install a new client, and inserted a DVD into the PC,
it behaved like on drugs, showing the following output
in /var/log/warn:

Feb 28 15:19:30 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:19:47 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:20:26 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:20:45 baum kernel: PCI-DMA: Out of IOMMU space for 225280
bytes at device 0000:00:08.0
Feb 28 15:20:45 baum kernel: end_request: I/O error, dev sda, sector
12248446
Feb 28 15:20:45 baum kernel: Operation continuing on 1 devices
Feb 28 15:20:45 baum kernel: PCI-DMA: Out of IOMMU space for 221184
bytes at device 0000:00:08.0
Feb 28 15:20:45 baum kernel: end_request: I/O error, dev sdb, sector
12248454
Feb 28 15:20:45 baum kernel: printk: 118 messages suppressed.
Feb 28 15:20:45 baum kernel: Buffer I/O error on device md0, logical
block 1004928
Feb 28 15:20:45 baum kernel: lost page write due to I/O error on md0
Feb 28 15:20:45 baum kernel: Buffer I/O error on device md0, logical
block 1004929
Feb 28 15:20:45 baum kernel: lost page write due to I/O error on md0

After that, the SW RAID stopped it's work, some VMware clients crashed
and so on. Not a good situation.

I found this in dmesg:
CPU 0: aperture @ b6c6000000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 8000000
PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ 8000000 size 65536 KB
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

The problem is, it's an ASUS A8N-E, and doesn't have AGP but PCI-Express
- and therefore no more BIOS setting for the AGP aperture size.

The kernel set that IOMMU to 64MB, but it still seems to small. There's
a kernel boot option "iommu=...", but using "iommu=off" lead to a
frozen system during boot, and "iommu=128M" didn't boot at all.

I found a message from Andi Kleen of SUSE suggesting using
"iommu=memaper=3", and that helped - at least until now. I just wanted
to say thank you for that, and document this fact here for others
possibly running into the same problem.

Output from the kernel source script "sh scripts/ver_linux":

Linux baum 2.6.13-15.8-ZMI #1 SMP Tue Feb 28 16:07:49 CET 2006 x86_64
x86_64 x86_64 GNU/Linux

Gnu C 4.0.2
Gnu make 3.80
binutils 2.16.91.0.2
util-linux 2.12q
mount 2.12q
module-init-tools 3.2-pre8
e2fsprogs 1.38
jfsutils 1.1.8
reiserfsprogs 3.6.18
reiser4progs line
xfsprogs 2.6.36
Linux C Library 2.3.5
Dynamic linker (ldd) 2.3.5
Linux C++ Library 6.0.6
Procps 3.2.5
Net-tools 1.60
Kbd 1.12
Sh-utils 5.3.0
udev 068
Modules Loaded vmnet vmmon joydev af_packet iptable_filter
ip_tables button battery ac ipv6 ide_cd cdrom sundance mii shpchp
pci_hotplug generic ehci_hcd i2c_nforce2 ohci_hcd usbcore i2c_core
dm_mod reiserfs raid1 fan thermal processor sg sata_nv libata amd74xx
sd_mod scsi_mod ide_disk ide_core
--
// Michael Monnerie, Ing.BSc --- it-management Michael Monnerie
// http://zmi.at Tel: 0660/4156531 Linux 2.6.11
// PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879
// Keyserver: http://www.keyserver.net Key-ID: 0x70545879


Attachments:
(No filename) (3.87 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-02 01:01:57

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thursday 02 March 2006 00:23, Michael Monnerie wrote:
> Hello, I use SUSE 10.0 with all updates and actual kernel 2.6.13-15.8 as
> provided from SUSE (just self compiled to optimize for Athlon64, SMP,
> and HZ=100), with an Asus A8N-E motherboard, and an Athlon64x2 CPU.
> This host is used with VMware GSX server running 6 Linux client and one
> Windows client host. There's a SW-RAID1 using 2 SATA HDs.

Nvidia hardware SATA cannot directly DMA to > 4GB, so it
has to go through the IOMMU. And in that kernel the Nforce
ethernet driver also didn't do >4GB access, although the ethernet HW
is theoretically capable.

Maybe VMware pins unusually much IO memory in flight (e.g. by using
a lot of O_DIRECT). That could potentially cause the IOMMU to fill up.
The RAID-1 probably also makes it worse because it will double the IO
mapping requirements.

Or you have a leak in some driver, but if the problem goes away
after enlarging the IOMMU that's unlikely.

What would probably help is to get a new SATA controller that can
access >4GB natively and at some point update to a newer kernel
with newer forcedeth driver. Or just run with the enlarged IOMMU.

-Andi

2006-03-02 10:00:24

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
> On Thursday 02 March 2006 00:23, Michael Monnerie wrote:
> > Hello, I use SUSE 10.0 with all updates and actual kernel 2.6.13-15.8 as
> > provided from SUSE (just self compiled to optimize for Athlon64, SMP,
> > and HZ=100), with an Asus A8N-E motherboard, and an Athlon64x2 CPU.
> > This host is used with VMware GSX server running 6 Linux client and one
> > Windows client host. There's a SW-RAID1 using 2 SATA HDs.
>
> Nvidia hardware SATA cannot directly DMA to > 4GB, so it
> has to go through the IOMMU. And in that kernel the Nforce
> ethernet driver also didn't do >4GB access, although the ethernet HW
> is theoretically capable.
>
> Maybe VMware pins unusually much IO memory in flight (e.g. by using
> a lot of O_DIRECT). That could potentially cause the IOMMU to fill up.
> The RAID-1 probably also makes it worse because it will double the IO
> mapping requirements.
>
> Or you have a leak in some driver, but if the problem goes away
> after enlarging the IOMMU that's unlikely.
>
> What would probably help is to get a new SATA controller that can
> access >4GB natively and at some point update to a newer kernel
> with newer forcedeth driver. Or just run with the enlarged IOMMU.

libata should also handle this case better. Usually we just need to
defer command handling if the dma_map_sg() fails. Changing
ata_qc_issue() to return nsegments for success, 0 for defer failure, and
-1 for permanent failure should be enough. The SCSI path is easy at
least, as we can just ask for a defer there. The internal qc_issue is a
little more tricky.

The NCQ patches have logic to handle this, although for other reasons
(to avoid overlap between NCQ and non-NCQ commands). It could easily be
reused for this as well.

--
Jens Axboe

2006-03-02 12:24:35

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thursday 02 March 2006 13:16, Jeff Garzik wrote:

> > Yes I've been thinking about adding a new sleeping interface to the IOMMU
> > that would block for new space to handle this. If I did that - would
> > libata be able to use it?
>
> No :( We map inside a spin_lock_irqsave.

Would it be easily possible to change that or is it difficult?

Also with the blocking interface there might be possible deadlock issues
because it will be essentially similar to allocating memory during IO.
But I think it's probably safe.

-Andi

2006-03-02 12:31:46

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
> On Thursday 02 March 2006 13:16, Jeff Garzik wrote:
>
> > > Yes I've been thinking about adding a new sleeping interface to the IOMMU
> > > that would block for new space to handle this. If I did that - would
> > > libata be able to use it?
> >
> > No :( We map inside a spin_lock_irqsave.
>
> Would it be easily possible to change that or is it difficult?
>
> Also with the blocking interface there might be possible deadlock issues
> because it will be essentially similar to allocating memory during IO.
> But I think it's probably safe.

For most cases, perhaps. But it's a nasty interface. It works for eg
mempools because of the way they are designed, but you simply have to
allow the caller the option of doing something in case we cannot map.

--
Jens Axboe

2006-03-02 12:33:41

by Jeff Garzik

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

Andi Kleen wrote:
> On Thursday 02 March 2006 13:16, Jeff Garzik wrote:
>
>
>>>Yes I've been thinking about adding a new sleeping interface to the IOMMU
>>>that would block for new space to handle this. If I did that - would
>>>libata be able to use it?
>>
>>No :( We map inside a spin_lock_irqsave.
>
>
> Would it be easily possible to change that or is it difficult?
>
> Also with the blocking interface there might be possible deadlock issues
> because it will be essentially similar to allocating memory during IO.
> But I think it's probably safe.

The SCSI layer submits stuff to libata inside spin_lock_irqsave(), and
from there we DMA-map and send straight to hardware.

So, changing the hot path to permit sleeping would be difficult and add
needless complexity, IMO.

I would rather pay the penalty of resubmitting if the
map-inside-spinlock fails, than to slow down the hot path.

Jeff



2006-03-02 13:07:29

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thursday 02 March 2006 13:30, Jens Axboe wrote:

> I'd much rather prefer punting and letting the upper layer decide how to
> handle it. Who knows, it may have to do something active like kicking
> pending io in action at the controller level.

But how would you wait for new space to be available then?

You need at least a wait queue from the IOMMU code to hook into I suspect.

-Andi

2006-03-02 13:11:07

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
> On Thursday 02 March 2006 13:30, Jens Axboe wrote:
>
> > I'd much rather prefer punting and letting the upper layer decide how to
> > handle it. Who knows, it may have to do something active like kicking
> > pending io in action at the controller level.
>
> But how would you wait for new space to be available then?
>
> You need at least a wait queue from the IOMMU code to hook into I suspect.

There are two cases as far as I can see:

- We have in-driver pending stuff, so we can just retry the operation
later when some of that completes.
- We are unlucky enough that someone else holds all the resources, we
have nothing to wait for.

The first case is easy, just punt and retry when some of your io
completes. The last case requires a way to wait on the iommu as you
describe, which the driver needs to do somewhere safe.

--
Jens Axboe

2006-03-02 13:31:20

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution


> - We have in-driver pending stuff, so we can just retry the operation
> later when some of that completes.
> - We are unlucky enough that someone else holds all the resources, we
> have nothing to wait for.

I suspect the second is more common - typically the problem seems to happen
when people have multiple devices active that need the IOMMU in parallel.

> The first case is easy, just punt and retry when some of your io
> completes. The last case requires a way to wait on the iommu as you
> describe, which the driver needs to do somewhere safe.

Also where to put the wait queue? The IOMMU code only
sees the bus devices not the queues and I'm not sure the low level
devices would be the right place to put it because it wouldn't handle
the case of a queue having multiple devices well and in general
would probably violate the layers.

Maybe just using a global one? The situation should be rare anyways.
Would just need a way to detect this case to avoid bouncing the cache lines
of the wait queue in the normal case. Perhaps a simple global counter
would be good enough for that.

e.g. you increase the counter and then the IOMMU code just does a wakeup
on a global waitqueue every time it frees space.

Hrm one problem I guess is that you need to make sure there are no
races between detection of the low space condition and the increasing
of the counter, but some lazy locking and rechecking might be able
to cure that.

-Andi

2006-03-02 13:33:51

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
>
> > - We have in-driver pending stuff, so we can just retry the operation
> > later when some of that completes.
> > - We are unlucky enough that someone else holds all the resources, we
> > have nothing to wait for.
>
> I suspect the second is more common - typically the problem seems to happen
> when people have multiple devices active that need the IOMMU in parallel.

Hmm I would have guessed the first is way more common, the device/driver
consuming lots of iommu space would be the most likely to run into
IOMMU-OOM.

> > The first case is easy, just punt and retry when some of your io
> > completes. The last case requires a way to wait on the iommu as you
> > describe, which the driver needs to do somewhere safe.
>
> Also where to put the wait queue? The IOMMU code only
> sees the bus devices not the queues and I'm not sure the low level
> devices would be the right place to put it because it wouldn't handle
> the case of a queue having multiple devices well and in general
> would probably violate the layers.
>
> Maybe just using a global one? The situation should be rare anyways.
> Would just need a way to detect this case to avoid bouncing the cache lines
> of the wait queue in the normal case. Perhaps a simple global counter
> would be good enough for that.

I was thinking just a global one, we are in soft error handling anyways
so should be ok. I don't think you would need to dirty any global cache
line unless you actually need to wake waiters.

> e.g. you increase the counter and then the IOMMU code just does a wakeup
> on a global waitqueue every time it frees space.
>
> Hrm one problem I guess is that you need to make sure there are no
> races between detection of the low space condition and the increasing
> of the counter, but some lazy locking and rechecking might be able
> to cure that.

I think so, yes.

--
Jens Axboe

2006-03-02 13:44:49

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thursday 02 March 2006 14:33, Jens Axboe wrote:

> Hmm I would have guessed the first is way more common, the device/driver
> consuming lots of iommu space would be the most likely to run into
> IOMMU-OOM.

e.g. consider a simple RAID-1. It will always map the requests twice so the
normal case is 2 times as much IOMMU space needed. Or even more with bigger
raids.

But you're right of course that only waiting for one user would be likely
sufficient. e.g. even if it misses some freeing events the "current" device
should eventually free some space too.

On the other hand it would seem cleaner to me to solve it globally
instead of trying to hack around it in the higher layers.

>
> I was thinking just a global one, we are in soft error handling anyways
> so should be ok. I don't think you would need to dirty any global cache
> line unless you actually need to wake waiters.

__wake_up takes the spinlock even when nobody waits.

-Andi

2006-03-02 13:52:49

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
> On Thursday 02 March 2006 14:33, Jens Axboe wrote:
>
> > Hmm I would have guessed the first is way more common, the device/driver
> > consuming lots of iommu space would be the most likely to run into
> > IOMMU-OOM.
>
> e.g. consider a simple RAID-1. It will always map the requests twice so the
> normal case is 2 times as much IOMMU space needed. Or even more with bigger
> raids.
>
> But you're right of course that only waiting for one user would be likely
> sufficient. e.g. even if it misses some freeing events the "current" device
> should eventually free some space too.
>
> On the other hand it would seem cleaner to me to solve it globally
> instead of trying to hack around it in the higher layers.

But I don't think that's really possible. As Jeff points out, SCSI can't
do this right now because of the way we map requests. And it would be a
shame to change the hot path because of the error case. And then you
have things like networking and other block drivers - it would be a big
audit/fixup to make that work.

It's much easier to extend the dma mapping api to have an error
fallback.

> > I was thinking just a global one, we are in soft error handling anyways
> > so should be ok. I don't think you would need to dirty any global cache
> > line unless you actually need to wake waiters.
>
> __wake_up takes the spinlock even when nobody waits.

I would not want to call wake_up() unless I have to. Would a

smp_mb();
if (waitqueue_active(&iommu_wq))
...

not be sufficient?

--
Jens Axboe

2006-03-02 14:05:23

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thursday 02 March 2006 14:49, Jens Axboe wrote:
> On Thu, Mar 02 2006, Andi Kleen wrote:
> > On Thursday 02 March 2006 14:33, Jens Axboe wrote:
> >
> > > Hmm I would have guessed the first is way more common, the device/driver
> > > consuming lots of iommu space would be the most likely to run into
> > > IOMMU-OOM.
> >
> > e.g. consider a simple RAID-1. It will always map the requests twice so the
> > normal case is 2 times as much IOMMU space needed. Or even more with bigger
> > raids.
> >
> > But you're right of course that only waiting for one user would be likely
> > sufficient. e.g. even if it misses some freeing events the "current" device
> > should eventually free some space too.
> >
> > On the other hand it would seem cleaner to me to solve it globally
> > instead of trying to hack around it in the higher layers.
>
> But I don't think that's really possible.

Wasn't this whole thread about making it possible?

> As Jeff points out, SCSI can't
> do this right now because of the way we map requests.

Sure you have to punt out outside this spinlock and then find
a "safe place" as you put it to wait. The low level IOMMU code
would supply the wakeup.

> And it would be a
> shame to change the hot path because of the error case. And then you
> have things like networking and other block drivers - it would be a big
> audit/fixup to make that work.
>
> It's much easier to extend the dma mapping api to have an error
> fallback.

It already has one (pci_map_sg returning 0 or pci_mapping_error()
for pci_map_single())

The problem is just that when you get it you can only error out
because there is no way to wait for a free space event. With
your help I've been trying to figure out how to add it. Of course
after that's done you still have to do the work to handle
it in the block layer somewhere.

> > > I was thinking just a global one, we are in soft error handling anyways
> > > so should be ok. I don't think you would need to dirty any global cache
> > > line unless you actually need to wake waiters.
> >
> > __wake_up takes the spinlock even when nobody waits.
>
> I would not want to call wake_up() unless I have to. Would a
>
> smp_mb();
> if (waitqueue_active(&iommu_wq))
> ...
>
> not be sufficient?

Probably, but one would need to be careful to not miss events this way.

-Andi

2006-03-02 14:14:42

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
> On Thursday 02 March 2006 14:49, Jens Axboe wrote:
> > On Thu, Mar 02 2006, Andi Kleen wrote:
> > > On Thursday 02 March 2006 14:33, Jens Axboe wrote:
> > >
> > > > Hmm I would have guessed the first is way more common, the device/driver
> > > > consuming lots of iommu space would be the most likely to run into
> > > > IOMMU-OOM.
> > >
> > > e.g. consider a simple RAID-1. It will always map the requests twice so the
> > > normal case is 2 times as much IOMMU space needed. Or even more with bigger
> > > raids.
> > >
> > > But you're right of course that only waiting for one user would be likely
> > > sufficient. e.g. even if it misses some freeing events the "current" device
> > > should eventually free some space too.
> > >
> > > On the other hand it would seem cleaner to me to solve it globally
> > > instead of trying to hack around it in the higher layers.
> >
> > But I don't think that's really possible.
>
> Wasn't this whole thread about making it possible?

Sorry, what I mean is that I don't think it solvable in the normal
dma_map_sg() path. You have to punt and allow the upper layer to wait.

> > As Jeff points out, SCSI can't
> > do this right now because of the way we map requests.
>
> Sure you have to punt out outside this spinlock and then find
> a "safe place" as you put it to wait. The low level IOMMU code
> would supply the wakeup.

Precisely.

> > And it would be a
> > shame to change the hot path because of the error case. And then you
> > have things like networking and other block drivers - it would be a big
> > audit/fixup to make that work.
> >
> > It's much easier to extend the dma mapping api to have an error
> > fallback.
>
> It already has one (pci_map_sg returning 0 or pci_mapping_error()
> for pci_map_single())

Yeah we can signal the error in map_sg() with 0, that's not what I
meant. I meant adding a way to handle that error, not signal it. Which
is the wait stuff we are discussing.

> The problem is just that when you get it you can only error out
> because there is no way to wait for a free space event. With
> your help I've been trying to figure out how to add it. Of course
> after that's done you still have to do the work to handle
> it in the block layer somewhere.

Yes that's the issue. We can have a defer helper in the block layer that
could reinvoke the request handling when we _hope_ it'll work. That's
already in place, the driver does a BLKPREP_DEFER for that case. For
drivers that don't use the prep handler, we can do something very
similar.

> > > > I was thinking just a global one, we are in soft error handling anyways
> > > > so should be ok. I don't think you would need to dirty any global cache
> > > > line unless you actually need to wake waiters.
> > >
> > > __wake_up takes the spinlock even when nobody waits.
> >
> > I would not want to call wake_up() unless I have to. Would a
> >
> > smp_mb();
> > if (waitqueue_active(&iommu_wq))
> > ...
> >
> > not be sufficient?
>
> Probably, but one would need to be careful to not miss events this way.

Definitely, as far as I can see the above should be enough...

--
Jens Axboe

2006-03-02 14:35:49

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thursday 02 March 2006 15:14, Jens Axboe wrote:

[...]

Ok great we agree on everything then.

> > >
> > > I would not want to call wake_up() unless I have to. Would a
> > >
> > > smp_mb();
> > > if (waitqueue_active(&iommu_wq))
> > > ...
> > >
> > > not be sufficient?
> >
> > Probably, but one would need to be careful to not miss events this way.
>
> Definitely, as far as I can see the above should be enough...

Ok - you just need to give me a wait queue then and I would be happy
to add the wakeups to the low level code

(or you can just do it yourself if you prefer, shouldn't be very difficult ... - just
needs to be done for both swiotlb and GART iommu. The other architectures
can follow then. At the beginning using an ARCH_HAS_* ifdef might be a good
idea for easier transition for everybody)

-Andi

2006-03-02 14:39:04

by Jens Axboe

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02 2006, Andi Kleen wrote:
> On Thursday 02 March 2006 15:14, Jens Axboe wrote:
>
> [...]
>
> Ok great we agree on everything then.

Seems so :)

> > > >
> > > > I would not want to call wake_up() unless I have to. Would a
> > > >
> > > > smp_mb();
> > > > if (waitqueue_active(&iommu_wq))
> > > > ...
> > > >
> > > > not be sufficient?
> > >
> > > Probably, but one would need to be careful to not miss events this way.
> >
> > Definitely, as far as I can see the above should be enough...
>
> Ok - you just need to give me a wait queue then and I would be happy
> to add the wakeups to the low level code
>
> (or you can just do it yourself if you prefer, shouldn't be very
> difficult ... - just needs to be done for both swiotlb and GART iommu.
> The other architectures can follow then. At the beginning using an
> ARCH_HAS_* ifdef might be a good idea for easier transition for
> everybody)

I'd prefer adding that wait queue in the iommu code, it's where it
belongs. Didn't we agree on just a global waitqueue? The the interface
for block/net/whatever-consume would just be something like:

iommu_wait();

which would return when we have enough space, hopefully. A subsystem
specific waitqueue would have the handy side of supplying a callback on
the wake up, which would be a nicer design. But isn't the enough
wait_for_resources() call sufficient for this?

--
Jens Axboe

2006-03-03 08:17:01

by Chris Wedgwood

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Thu, Mar 02, 2006 at 02:03:48AM +0100, Andi Kleen wrote:

> Nvidia hardware SATA cannot directly DMA to > 4GB, so it has to go
> through the IOMMU.

do you know if that is an actual hardware limitation or simply a
something we don't know how to do for lack of docs?

> And in that kernel the Nforce ethernet driver also didn't do >4GB
> access, although the ethernet HW is theoretically capable.

hrm, again, with a lack of docs is that likely to occur anytime soon?

2006-03-03 11:00:59

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Friday 03 March 2006 09:16, Chris Wedgwood wrote:
> On Thu, Mar 02, 2006 at 02:03:48AM +0100, Andi Kleen wrote:
>
> > Nvidia hardware SATA cannot directly DMA to > 4GB, so it has to go
> > through the IOMMU.
>
> do you know if that is an actual hardware limitation or simply a
> something we don't know how to do for lack of docs?

I assume that's a hardware limitation. I guess they'll move to AHCI
at some point though - that should fix that.

>
> > And in that kernel the Nforce ethernet driver also didn't do >4GB
> > access, although the ethernet HW is theoretically capable.
>
> hrm, again, with a lack of docs is that likely to occur anytime soon?

That has been already fixed, just not in the kernel version Michael
is using.

-Andi

2006-03-03 21:27:55

by Allen Martin

[permalink] [raw]
Subject: RE: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

> > On Thu, Mar 02, 2006 at 02:03:48AM +0100, Andi Kleen wrote:
> >
> > > Nvidia hardware SATA cannot directly DMA to > 4GB, so it has to go
> > > through the IOMMU.
> >
> > do you know if that is an actual hardware limitation or simply a
> > something we don't know how to do for lack of docs?
>
> I assume that's a hardware limitation. I guess they'll move to AHCI
> at some point though - that should fix that.

nForce4 has 64 bit (40 bit AMD64) DMA in the SATA controller. We gave
the docs to Jeff Garzik under NDA. He posted some non functional driver
code to linux-ide earlier this week that has the 64 bit registers and
structures although it doesn't make use of them. Someone could pick
this up if they wanted to work on it though.

-Allen

2006-03-03 22:12:34

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Friday 03 March 2006 22:27, Allen Martin wrote:

> nForce4 has 64 bit (40 bit AMD64) DMA in the SATA controller. We gave
> the docs to Jeff Garzik under NDA. He posted some non functional driver
> code to linux-ide earlier this week that has the 64 bit registers and
> structures although it doesn't make use of them. Someone could pick
> this up if they wanted to work on it though.

Thanks for the correction. Sounds nice - hopefully we'll get a driver soon.
I guess it's in good hands with Jeff for now.

-Andi

2006-03-03 22:23:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

Andi Kleen wrote:
> On Friday 03 March 2006 22:27, Allen Martin wrote:
>
>
>>nForce4 has 64 bit (40 bit AMD64) DMA in the SATA controller. We gave
>>the docs to Jeff Garzik under NDA. He posted some non functional driver
>>code to linux-ide earlier this week that has the 64 bit registers and
>>structures although it doesn't make use of them. Someone could pick
>>this up if they wanted to work on it though.
>
>
> Thanks for the correction. Sounds nice - hopefully we'll get a driver soon.
> I guess it's in good hands with Jeff for now.

I'll happen but not soon. Motivation is low at NV and here as well,
since newer NV is AHCI. The code in question, "NV ADMA", is essentially
legacy at this point -- though I certainly acknowledge the large current
installed base. Just being honest about the current state of things...

Jeff



2006-03-03 22:33:21

by Andi Kleen

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Friday 03 March 2006 23:23, Jeff Garzik wrote:
> Andi Kleen wrote:
> > On Friday 03 March 2006 22:27, Allen Martin wrote:
> >>nForce4 has 64 bit (40 bit AMD64) DMA in the SATA controller. We gave
> >>the docs to Jeff Garzik under NDA. He posted some non functional driver
> >>code to linux-ide earlier this week that has the 64 bit registers and
> >>structures although it doesn't make use of them. Someone could pick
> >>this up if they wanted to work on it though.
> >
> > Thanks for the correction. Sounds nice - hopefully we'll get a driver
> > soon. I guess it's in good hands with Jeff for now.
>
> I'll happen but not soon. Motivation is low at NV and here as well,
> since newer NV is AHCI. The code in question, "NV ADMA", is essentially
> legacy at this point

NForce4s are used widely in new shipping systems so I wouldn't
exactly call them legacy.

> -- though I certainly acknowledge the large current
> installed base. Just being honest about the current state of things...

How much work would it be to finish the prototype driver you have?

-Andi

2006-03-04 06:35:26

by Michael Monnerie

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

On Freitag, 3. M?rz 2006 23:23 Jeff Garzik wrote:
> I'll happen but not soon. ?Motivation is low at NV and here as well,
> since newer NV is AHCI. ?The code in question, "NV ADMA", is
> essentially legacy at this point -- though I certainly acknowledge
> the large current installed base. ?Just being honest about the
> current state of things...

I'd like to raise motivation a lot because most MB sold here (central
Europe) are Nforce4 with Athlon64x2 at the moment. It would be nice
from vendors if they support OSS developers more, as it's their
interest to have good drivers.

For the moment, I'd recommend *against* using Nforce4 because of that
problems we had (and that caused us a lot of unpaid repairing work).
Hopefully NV does something quick to resolve the remaining issues,
especially as the 4GB "border" is hit more and more often.

mfg zmi
--
// Michael Monnerie, Ing.BSc --- it-management Michael Monnerie
// http://zmi.at Tel: 0660/4156531 Linux 2.6.11
// PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879
// Keyserver: http://www.keyserver.net Key-ID: 0x70545879


Attachments:
(No filename) (1.17 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-07 00:13:36

by Robert Hancock

[permalink] [raw]
Subject: Re: PCI-DMA: Out of IOMMU space on x86-64 (Athlon64x2), with solution

Michael Monnerie wrote:
> On Freitag, 3. M?rz 2006 23:23 Jeff Garzik wrote:
>> I'll happen but not soon. Motivation is low at NV and here as well,
>> since newer NV is AHCI. The code in question, "NV ADMA", is
>> essentially legacy at this point -- though I certainly acknowledge
>> the large current installed base. Just being honest about the
>> current state of things...
>
> I'd like to raise motivation a lot because most MB sold here (central
> Europe) are Nforce4 with Athlon64x2 at the moment. It would be nice
> from vendors if they support OSS developers more, as it's their
> interest to have good drivers.

I second that.. It appears that nForce4 will continue to be a popular
chipset even after the Socket AM2 chips are released, so the demand for
this (and for NCQ support as well, likely) will only increase.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/