I'm writing a PCI driver for the first time and I'm trying to wrap my
head around the DMA mappings in that world. I've done a ISA driver which
uses DMA, but this is a bit more complex and the documentation doesn't
explain everything.
What I'm particularly confused about is how the IOMMU should be handled
with regard to scatterlist limits. My hardware cannot handle
scatterlists, only a single DMA address. But from what I understand the
IOMMU can be very similar to a normal "CPU" MMU. So it should be able to
aggregate pages that are non-continuous in physical memory into one
single block in bus memory. Now the question is what do I set
nr_phys_segments and nr_hw_segments to? Of course the code also needs to
handle systems without an IOMMU.
Rgds
Pierre
On Thu, Nov 17 2005, Pierre Ossman wrote:
> I'm writing a PCI driver for the first time and I'm trying to wrap my
> head around the DMA mappings in that world. I've done a ISA driver which
> uses DMA, but this is a bit more complex and the documentation doesn't
> explain everything.
>
> What I'm particularly confused about is how the IOMMU should be handled
> with regard to scatterlist limits. My hardware cannot handle
> scatterlists, only a single DMA address. But from what I understand the
What kind of hardware can't handle scatter gather?
> IOMMU can be very similar to a normal "CPU" MMU. So it should be able to
> aggregate pages that are non-continuous in physical memory into one
> single block in bus memory. Now the question is what do I set
> nr_phys_segments and nr_hw_segments to? Of course the code also needs to
> handle systems without an IOMMU.
nr_hw_segments is how many segments your driver will see once dma
mapping is complete (and the IOMMU has done its tricks), so you want to
set that to 1 if the hardware can't handle an sg list.
That'll work irregardless of whether there's an IOMMU there or not. Note
that the mere existence of an IOMMU will _not_ save your performance on
this hardware, you need one with good virtual merging support to get
larger transfers.
--
Jens Axboe
Jens Axboe wrote:
> On Thu, Nov 17 2005, Pierre Ossman wrote:
>
>> I'm writing a PCI driver for the first time and I'm trying to wrap my
>> head around the DMA mappings in that world. I've done a ISA driver which
>> uses DMA, but this is a bit more complex and the documentation doesn't
>> explain everything.
>>
>> What I'm particularly confused about is how the IOMMU should be handled
>> with regard to scatterlist limits. My hardware cannot handle
>> scatterlists, only a single DMA address. But from what I understand the
>>
>
> What kind of hardware can't handle scatter gather?
>
>
I'd figure most hardware? DMA is handled by writing the start address
into one register and a size into another. Being able to set several
addr/len pairs seems highly advanced to me. :)
>> IOMMU can be very similar to a normal "CPU" MMU. So it should be able to
>> aggregate pages that are non-continuous in physical memory into one
>> single block in bus memory. Now the question is what do I set
>> nr_phys_segments and nr_hw_segments to? Of course the code also needs to
>> handle systems without an IOMMU.
>>
>
> nr_hw_segments is how many segments your driver will see once dma
> mapping is complete (and the IOMMU has done its tricks), so you want to
> set that to 1 if the hardware can't handle an sg list.
>
>
And nr_phys_segments? I haven't really grasped the relation between the
two. Is this the number of segments handed to the IOMMU? If so, then I
would need to know how many it can handle (and set it to one if there is
no IOMMU).
> That'll work irregardless of whether there's an IOMMU there or not. Note
> that the mere existence of an IOMMU will _not_ save your performance on
> this hardware, you need one with good virtual merging support to get
> larger transfers.
>
>
I thought the IOMMU could do the merging through its mapping tables? The
way I understood it, sg support in the device was just to avoid wasting
resources on the IOMMU by using fewer mappings (which would assume the
IOMMU is segment based, not page based).
Rgds
Pierre
On Thu, Nov 17 2005, Pierre Ossman wrote:
> Jens Axboe wrote:
> > On Thu, Nov 17 2005, Pierre Ossman wrote:
> >
> >> I'm writing a PCI driver for the first time and I'm trying to wrap my
> >> head around the DMA mappings in that world. I've done a ISA driver which
> >> uses DMA, but this is a bit more complex and the documentation doesn't
> >> explain everything.
> >>
> >> What I'm particularly confused about is how the IOMMU should be handled
> >> with regard to scatterlist limits. My hardware cannot handle
> >> scatterlists, only a single DMA address. But from what I understand the
> >>
> >
> > What kind of hardware can't handle scatter gather?
> >
> >
>
> I'd figure most hardware? DMA is handled by writing the start address
> into one register and a size into another. Being able to set several
> addr/len pairs seems highly advanced to me. :)
Must be a pretty nice rock you are living behind, since it's apparently
kept you there for a long time :-)
Sane hardware will accept an sg list directly. Are you sure you are
reading the specifications for that hardware correctly?
> >> IOMMU can be very similar to a normal "CPU" MMU. So it should be able to
> >> aggregate pages that are non-continuous in physical memory into one
> >> single block in bus memory. Now the question is what do I set
> >> nr_phys_segments and nr_hw_segments to? Of course the code also needs to
> >> handle systems without an IOMMU.
> >>
> >
> > nr_hw_segments is how many segments your driver will see once dma
> > mapping is complete (and the IOMMU has done its tricks), so you want to
> > set that to 1 if the hardware can't handle an sg list.
> >
> >
>
> And nr_phys_segments? I haven't really grasped the relation between the
> two. Is this the number of segments handed to the IOMMU? If so, then I
> would need to know how many it can handle (and set it to one if there is
> no IOMMU).
nr_phys_segments is basically just to cap the segments somewhere, since
the driver needs to store it before getting it dma mapped to a (perhaps)
smaller number of segments. So yes, it's the number of 'real' segments
before dma mapping.
> > That'll work irregardless of whether there's an IOMMU there or not. Note
> > that the mere existence of an IOMMU will _not_ save your performance on
> > this hardware, you need one with good virtual merging support to get
> > larger transfers.
> >
> >
>
> I thought the IOMMU could do the merging through its mapping tables? The
> way I understood it, sg support in the device was just to avoid wasting
> resources on the IOMMU by using fewer mappings (which would assume the
> IOMMU is segment based, not page based).
Depends on the IOMMU. Some IOMMUs just help you with address remapping
for high addresses. The way I see it, with just 1 segment you need to be
pretty damn picky with your hardware about what platform you use it on
or risk losing 50% performance or so.
--
Jens Axboe
Jens Axboe wrote:
> On Thu, Nov 17 2005, Pierre Ossman wrote:
>
>> Jens Axboe wrote:
>>
>>>
>>>
>>> What kind of hardware can't handle scatter gather?
>>>
>>>
>>>
>> I'd figure most hardware? DMA is handled by writing the start address
>> into one register and a size into another. Being able to set several
>> addr/len pairs seems highly advanced to me. :)
>>
>
> Must be a pretty nice rock you are living behind, since it's apparently
> kept you there for a long time :-)
>
>
The driver support is simply too good in Linux so I haven't had the need
for writing a PCI driver until now. ;)
> Sane hardware will accept an sg list directly. Are you sure you are
> reading the specifications for that hardware correctly?
>
>
Specifications? Such luxury. This driver is based on googling and
reverse engineering. Any requests for specifications have so far been
put in the round filing cabinet.
What I know is that I have the registers:
* System address (32 bit)
* Block size (16 bit)
* Block count (16 bit)
>From what I've seen these are written to once. So I'm having a hard time
believing these support more than one segment.
>>>
>>>
>> And nr_phys_segments? I haven't really grasped the relation between the
>> two. Is this the number of segments handed to the IOMMU? If so, then I
>> would need to know how many it can handle (and set it to one if there is
>> no IOMMU).
>>
>
> nr_phys_segments is basically just to cap the segments somewhere, since
> the driver needs to store it before getting it dma mapped to a (perhaps)
> smaller number of segments. So yes, it's the number of 'real' segments
> before dma mapping.
>
>
So from a driver point of view, this is just a matter of memory usage?
In that case, what is a good value? =)
Since there is no guarantee this will be mapped down to one segment
(that the hardware can accept), is it expected that the driver iterates
over the entire list or can I mark only the first segment as completed
and wait for the request to be reissued? (this is a MMC driver, which
behaves like the block layer)
>>> That'll work irregardless of whether there's an IOMMU there or not. Note
>>> that the mere existence of an IOMMU will _not_ save your performance on
>>> this hardware, you need one with good virtual merging support to get
>>> larger transfers.
>>>
>>>
>>>
>> I thought the IOMMU could do the merging through its mapping tables? The
>> way I understood it, sg support in the device was just to avoid wasting
>> resources on the IOMMU by using fewer mappings (which would assume the
>> IOMMU is segment based, not page based).
>>
>
> Depends on the IOMMU. Some IOMMUs just help you with address remapping
> for high addresses. The way I see it, with just 1 segment you need to be
> pretty damn picky with your hardware about what platform you use it on
> or risk losing 50% performance or so.
>
>
Ok. Being a block device, the segments are usually rather large so the
overhead of setting up many DMA transfers shouldn't be that terrible.
Rgds
Pierre
On Thu, Nov 17 2005, Pierre Ossman wrote:
> Jens Axboe wrote:
> > On Thu, Nov 17 2005, Pierre Ossman wrote:
> >
> >> Jens Axboe wrote:
> >>
> >>>
> >>>
> >>> What kind of hardware can't handle scatter gather?
> >>>
> >>>
> >>>
> >> I'd figure most hardware? DMA is handled by writing the start address
> >> into one register and a size into another. Being able to set several
> >> addr/len pairs seems highly advanced to me. :)
> >>
> >
> > Must be a pretty nice rock you are living behind, since it's apparently
> > kept you there for a long time :-)
> >
> >
>
> The driver support is simply too good in Linux so I haven't had the need
> for writing a PCI driver until now. ;)
;-)
> > Sane hardware will accept an sg list directly. Are you sure you are
> > reading the specifications for that hardware correctly?
> >
> >
>
> Specifications? Such luxury. This driver is based on googling and
> reverse engineering. Any requests for specifications have so far been
> put in the round filing cabinet.
>
> What I know is that I have the registers:
>
> * System address (32 bit)
> * Block size (16 bit)
> * Block count (16 bit)
Sounds like a pretty simple device, then. Any device engineered for any
kind of at least half serious performance would accept more than just a
address/length tupple.
> >From what I've seen these are written to once. So I'm having a hard time
> believing these support more than one segment.
>
> >>>
> >>>
> >> And nr_phys_segments? I haven't really grasped the relation between the
> >> two. Is this the number of segments handed to the IOMMU? If so, then I
> >> would need to know how many it can handle (and set it to one if there is
> >> no IOMMU).
> >>
> >
> > nr_phys_segments is basically just to cap the segments somewhere, since
> > the driver needs to store it before getting it dma mapped to a (perhaps)
> > smaller number of segments. So yes, it's the number of 'real' segments
> > before dma mapping.
> >
> >
>
> So from a driver point of view, this is just a matter of memory usage?
> In that case, what is a good value? =)
Yep. A good value depends on how big a transfer you can support anyways
and how fast the device is. And how much you potentially gain by doing
larger transfers as compared to small. The block layer default is 128
segments, but that's probably too big for you. Something like 16 should
still give you at least 64kb transfers.
> Since there is no guarantee this will be mapped down to one segment
> (that the hardware can accept), is it expected that the driver iterates
> over the entire list or can I mark only the first segment as completed
> and wait for the request to be reissued? (this is a MMC driver, which
> behaves like the block layer)
Ah MMC, that explains a few things :-)
It's quite legal (and possible) to partially handle a given request, you
are not obliged to handle a request as a single unit. See how other
block drivers have an end request handling function ala:
void my_end_request(struct hw_struct *hw, struct request *rq,
int nbytes, int uptodate)
{
...
if (!end_that_request_chunk(rq, uptodate, nbytes)) {
blkdev_dequeue_request(rq);
end_that_request_last(rq);
}
...
}
elv_next_request() will keep giving you the same request until you have
dequeued and ended it, so you don't have to keep track of the 'current'
request. end_that_request_*() will make sure the request state is sane
after each call as well, so you can treat the request as a new one every
time. Doing partial requests is not harder than doing full requests.
> >>> That'll work irregardless of whether there's an IOMMU there or not. Note
> >>> that the mere existence of an IOMMU will _not_ save your performance on
> >>> this hardware, you need one with good virtual merging support to get
> >>> larger transfers.
> >>>
> >>>
> >>>
> >> I thought the IOMMU could do the merging through its mapping tables? The
> >> way I understood it, sg support in the device was just to avoid wasting
> >> resources on the IOMMU by using fewer mappings (which would assume the
> >> IOMMU is segment based, not page based).
> >>
> >
> > Depends on the IOMMU. Some IOMMUs just help you with address remapping
> > for high addresses. The way I see it, with just 1 segment you need to be
> > pretty damn picky with your hardware about what platform you use it on
> > or risk losing 50% performance or so.
> >
> >
>
> Ok. Being a block device, the segments are usually rather large so the
> overhead of setting up many DMA transfers shouldn't be that terrible.
The segments will typically be paged size, so could be worse. It all
depends on what your command overhead is like whether it hurts
performance a lot or not.
--
Jens Axboe
Jens Axboe wrote:
> On Thu, Nov 17 2005, Pierre Ossman wrote:
>
>> Ok. Being a block device, the segments are usually rather large so the
>> overhead of setting up many DMA transfers shouldn't be that terrible.
>>
>
> The segments will typically be paged size, so could be worse. It all
> depends on what your command overhead is like whether it hurts
> performance a lot or not.
>
>
MMC overhead is a lot larger than sending new addr/len tuples to the
hardware. So I suppose there is performance to be gained by iterating
over the segments inside the driver.
Thanks for clearing things up. Maybe someone could update
DMA-mapping.txt with the things you've explained to me here *hint* ;)
Rgds
Pierre
On Thu, Nov 17 2005, Pierre Ossman wrote:
> Thanks for clearing things up. Maybe someone could update
> DMA-mapping.txt with the things you've explained to me here *hint* ;)
Most of it is block driver specific, I doubt I added much in the way of
actual DMA-mapping.txt :-)
But yeah, it's not the first time I've been asked these questions. At
least this time it was with lkml cc'ed, so I can point others at the
thread!
--
Jens Axboe
Revisiting a dear old thread. :)
After some initial tests, some more questions popped up. See below.
Jens Axboe wrote:
> On Thu, Nov 17 2005, Pierre Ossman wrote:
>
>> Since there is no guarantee this will be mapped down to one segment
>> (that the hardware can accept), is it expected that the driver iterates
>> over the entire list or can I mark only the first segment as completed
>> and wait for the request to be reissued? (this is a MMC driver, which
>> behaves like the block layer)
>>
>
> Ah MMC, that explains a few things :-)
>
> It's quite legal (and possible) to partially handle a given request, you
> are not obliged to handle a request as a single unit. See how other
> block drivers have an end request handling function ala:
>
>
After testing this it seems the block layer never gives me more than
max_hw_segs segments. Is it being clever because I'm compiling for a
system without an IOMMU?
The hardware should (haven't properly tested this) be able to get new
DMA addresses during a transfer. In essence scatter gather with some CPU
support. Since I avoid MMC overhead this should give a nice performance
boost. But this relies on the block layer giving me more than one
segment. Do I need to lie in max_hw_segs to achieve this?
Rgds
Pierre
Pierre Ossman wrote:
> Revisiting a dear old thread. :)
>
> After some initial tests, some more questions popped up. See below.
>
> Jens Axboe wrote:
>
>>On Thu, Nov 17 2005, Pierre Ossman wrote:
>>
>>
>>>Since there is no guarantee this will be mapped down to one segment
>>>(that the hardware can accept), is it expected that the driver iterates
>>>over the entire list or can I mark only the first segment as completed
>>>and wait for the request to be reissued? (this is a MMC driver, which
>>>behaves like the block layer)
>>>
>>
>>Ah MMC, that explains a few things :-)
>>
>>It's quite legal (and possible) to partially handle a given request, you
>>are not obliged to handle a request as a single unit. See how other
>>block drivers have an end request handling function ala:
>>
>>
>
>
> After testing this it seems the block layer never gives me more than
> max_hw_segs segments. Is it being clever because I'm compiling for a
> system without an IOMMU?
>
> The hardware should (haven't properly tested this) be able to get new
> DMA addresses during a transfer. In essence scatter gather with some CPU
> support. Since I avoid MMC overhead this should give a nice performance
> boost. But this relies on the block layer giving me more than one
> segment. Do I need to lie in max_hw_segs to achieve this?
>
Hi, Pierre.
max_phys_segments: the maximum number of segments in a request
*before* DMA mapping
max_hw_segments: the maximum number of segments in a request
*after* DMA mapping (ie. after IOMMU merging)
Those maximum numbers are for block layer. Block layer must not exceed
above limits when it passes a request downward. As long as all entries
in sg are processed, block layer doesn't care whether sg iteration is
performed by the driver or hardware.
So, if you're gonna perform sg by iterating in the driver, what numbers
to report for max_phys_segments and max_hw_segments is entirely upto how
many entries the driver can handle.
Just report some nice number (64 or 128?) for both. Don't forget that
the number of sg entries can be decreased after DMA-mapping on machines
with IOMMU.
IOW, the part which performs sg iteration gets to determine above
limits. In your case, the driver is reponsible for both iterations (pre
and post DMA mapping), so all the limits are upto the driver.
Hope it helped.
--
tejun
Tejun Heo wrote:
> Pierre Ossman wrote:
>> Revisiting a dear old thread. :)
>>
>> After some initial tests, some more questions popped up. See below.
>>
>> Jens Axboe wrote:
>>
>>> On Thu, Nov 17 2005, Pierre Ossman wrote:
>>>
>>>
>>>> Since there is no guarantee this will be mapped down to one segment
>>>> (that the hardware can accept), is it expected that the driver
>>>> iterates
>>>> over the entire list or can I mark only the first segment as completed
>>>> and wait for the request to be reissued? (this is a MMC driver, which
>>>> behaves like the block layer)
>>>>
>>>
>>> Ah MMC, that explains a few things :-)
>>>
>>> It's quite legal (and possible) to partially handle a given request,
>>> you
>>> are not obliged to handle a request as a single unit. See how other
>>> block drivers have an end request handling function ala:
>>>
>>>
>>
>>
>> After testing this it seems the block layer never gives me more than
>> max_hw_segs segments. Is it being clever because I'm compiling for a
>> system without an IOMMU?
>>
>> The hardware should (haven't properly tested this) be able to get new
>> DMA addresses during a transfer. In essence scatter gather with some CPU
>> support. Since I avoid MMC overhead this should give a nice performance
>> boost. But this relies on the block layer giving me more than one
>> segment. Do I need to lie in max_hw_segs to achieve this?
>>
>
> Hi, Pierre.
>
> max_phys_segments: the maximum number of segments in a request
> *before* DMA mapping
>
> max_hw_segments: the maximum number of segments in a request
> *after* DMA mapping (ie. after IOMMU merging)
>
> Those maximum numbers are for block layer. Block layer must not
> exceed above limits when it passes a request downward. As long as all
> entries in sg are processed, block layer doesn't care whether sg
> iteration is performed by the driver or hardware.
>
> So, if you're gonna perform sg by iterating in the driver, what
> numbers to report for max_phys_segments and max_hw_segments is
> entirely upto how many entries the driver can handle.
>
> Just report some nice number (64 or 128?) for both. Don't forget that
> the number of sg entries can be decreased after DMA-mapping on
> machines with IOMMU.
>
> IOW, the part which performs sg iteration gets to determine above
> limits. In your case, the driver is reponsible for both iterations
> (pre and post DMA mapping), so all the limits are upto the driver.
>
>
I'm still a bit confused why the block layer needs to know the maximum
number of hw segments. Different hardware might be connected to
different IOMMU:s, so only the driver will now how much the number can
be reduced. So the block layer should only care about not going above
max_phys_segments, since that's what the driver has room for.
What is the scenario that requires both?
Rgds
Pierre
Pierre Ossman wrote:
> Tejun Heo wrote:
>
>>Pierre Ossman wrote:
>>>
>>>After testing this it seems the block layer never gives me more than
>>>max_hw_segs segments. Is it being clever because I'm compiling for a
>>>system without an IOMMU?
>>>
>>>The hardware should (haven't properly tested this) be able to get new
>>>DMA addresses during a transfer. In essence scatter gather with some CPU
>>>support. Since I avoid MMC overhead this should give a nice performance
>>>boost. But this relies on the block layer giving me more than one
>>>segment. Do I need to lie in max_hw_segs to achieve this?
>>>
>>
>>Hi, Pierre.
>>
>>max_phys_segments: the maximum number of segments in a request
>> *before* DMA mapping
>>
>>max_hw_segments: the maximum number of segments in a request
>> *after* DMA mapping (ie. after IOMMU merging)
>>
>>Those maximum numbers are for block layer. Block layer must not
>>exceed above limits when it passes a request downward. As long as all
>>entries in sg are processed, block layer doesn't care whether sg
>>iteration is performed by the driver or hardware.
>>
>>So, if you're gonna perform sg by iterating in the driver, what
>>numbers to report for max_phys_segments and max_hw_segments is
>>entirely upto how many entries the driver can handle.
>>
>>Just report some nice number (64 or 128?) for both. Don't forget that
>>the number of sg entries can be decreased after DMA-mapping on
>>machines with IOMMU.
>>
>>IOW, the part which performs sg iteration gets to determine above
>>limits. In your case, the driver is reponsible for both iterations
>>(pre and post DMA mapping), so all the limits are upto the driver.
>>
>>
>
>
> I'm still a bit confused why the block layer needs to know the maximum
> number of hw segments. Different hardware might be connected to
> different IOMMU:s, so only the driver will now how much the number can
> be reduced. So the block layer should only care about not going above
> max_phys_segments, since that's what the driver has room for.
>
> What is the scenario that requires both?
>
Let's say there is a piece of (crap) controller which can handle 4
segments; but the system has a powerful IOMMU which can merge pretty
well. The driver wants to handle large requests for performance but it
doesn't want to break up requests itself (pretty pointless, block layer
merges, driver breaks down). A request should be large but not larger
than what the hardware can take at once.
So, it uses max_phys_segments to tell block layer how many sg entries
the driver is willing to handle (some arbitrary large number) and
reports 4 for max_hw_segments letting block layer know that requests
should not be more than 4 segments after DMA-mapping.
To sum up, block layer performs request sizing in favor of block
drivers, so it needs to know the size limits.
Is this explanation any better than my previous one? :-P
Also, theoretically there can be more than one IOMMUs on a system (is
there already?). Block layer isn't yet ready to handle such cases but
when it becomes necessary, all that needed is to make currently global
IOMMU merging parameters request queue specific and modify drivers such
that they tell block layer their IOMMU parameters.
--
tejun
Tejun Heo wrote:
> Pierre Ossman wrote:
>>
>> I'm still a bit confused why the block layer needs to know the maximum
>> number of hw segments. Different hardware might be connected to
>> different IOMMU:s, so only the driver will now how much the number can
>> be reduced. So the block layer should only care about not going above
>> max_phys_segments, since that's what the driver has room for.
>>
>> What is the scenario that requires both?
>>
>
> Let's say there is a piece of (crap) controller which can handle 4
> segments; but the system has a powerful IOMMU which can merge pretty
> well. The driver wants to handle large requests for performance but
> it doesn't want to break up requests itself (pretty pointless, block
> layer merges, driver breaks down). A request should be large but not
> larger than what the hardware can take at once.
>
> So, it uses max_phys_segments to tell block layer how many sg entries
> the driver is willing to handle (some arbitrary large number) and
> reports 4 for max_hw_segments letting block layer know that requests
> should not be more than 4 segments after DMA-mapping.
>
> To sum up, block layer performs request sizing in favor of block
> drivers, so it needs to know the size limits.
>
> Is this explanation any better than my previous one? :-P
>
> Also, theoretically there can be more than one IOMMUs on a system (is
> there already?). Block layer isn't yet ready to handle such cases but
> when it becomes necessary, all that needed is to make currently global
> IOMMU merging parameters request queue specific and modify drivers
> such that they tell block layer their IOMMU parameters.
>
Ahh. I thought the block layer wasn't aware of any IOMMU. Since I saw
those as bus specific I figured only the DMA APIs, which have access to
the device object, could know which IOMMU was to be used and how it
would merge segments.
So in my case I'll have to lie to the block layer. Iterating in the
driver will be much faster than having to do an entire new transfer.
Thanks for clearing things up. :)
Rgds
Pierre