Hi
I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had to
fix the endianity issues in the driver, but that's unrelated).
When I examined the crashes, it turned out that SCSI layer passed requests
with too many segments. The controller has at most 32 SG entries per
request. It sets shost->sg_tablesize to 32, but despite this, larger
requests were submitted to it --- this resulted in overwriting random
memory and crashes.
A typical scenario, in the beginning of inia100_queue, looked like this:
A number of segments returned by scsi_sg_count(cmd) == 33
cmd->request->nr_hw_segments == 32
cmd->request->nr_phys_segments == 37
cmd->sdb.table.nents == 33
cmd->sdb.table.orig_nents == 37
The SCSI layer submitted a request with 33 scatter-gather segments.
When I was searching for a reson for this, I found that virtual merging
thing. The idea behind virtual merging is this: on architectures with
IOMMU, the kernel can program IOMMU in such a way that several
discontinuous pages in memory appear as continuous space for the PCI
device. Thus, entries in device's SG table are saved. The virtual merging
alone is good and harmless idea --- it just improves performance a bit by
reducing the number of regions visible to the device.
But now, look at the block layer:
The block layer, when merging requests, must make sure that it won't
overflow the device's SG table. So it accounts the number of physical
segments and it tries to account the number of segments after virtual
merging by the IOMMU is performed (see function blk_recalc_rq_segments and
similar). The block layer calls BIOVEC_VIRT_MERGEABLE and
BIOVEC_VIRT_OVERSIZE (these macros use architecture-specific constants
BIO_VMERGE_BOUNDARY and BIO_VMERGE_MAX_SIZE) to check if two segments can
be merged by the IOMMU and sets rq->nr_hw_segments appropriatelly.
What was causing the bug in my case - IO layer expected that two segments
could be virtually merged and so merged the I/O requests. However, the
IOMMU function dma_4u_map_sg didn't merge these segments (the specific
check that was failing was "outs->dma_length + s->length > max_seg_size"
--- but any of the three checks in that function could trigger that bug).
So it produced one more SG entry than the IO layer accounted and crashed
the SCSI host driver.
--
When I thought about it more, I realized that this accounting of virtual
segments in I/O layer can't work correctly at all. If an architecture
defines symbols BIOVEC_VIRT_MERGEABLE and BIOVEC_VIRT_OVERSIZE, it
declares that it's IOMMU must merge any two regions satisfying these
conditions. But in an IOMMU, it is impossible to guarantee, because:
* the bus address is allocated basiclly randomly, so we can hit
dev->dma_parms->segment_boundary_mask any time. This will prevent virtual
merging from happenning. I/O layer doesn't know the bus address at the
time it merges requests, so it can't predict when this happens.
* the IOMMU isn't guaranteed to find a continuous space in it's bus
address space. If it skips over already mapped regions, it can't perform
virtual merging.
* when creating the mapping, we can hit per-device limit
"dev->dma_parms->max_segment_size" --- but the I/O layer checks only
against global limit BIOVEC_VIRT_OVERSIZE. (this last issue is fixable,
the previous two are not).
Basically, the IOMMU treats virtual merging as an option that it can do to
increase performance, but doesn't have to do it in some boundary
conditions. And the I/O layer treats virtual merging as a mandatory
feature and expects that any two regions satisfying the alignment and size
can be virtually merged.
I would suggest to drop the virtual segment accounting from the I/O layer
at all. (I am not suggesting to drop the virtual merging itself). For SCSI
drivers with large SGLIST, dropping the virtual segment accounting will
make no difference (the merge will happen anyway, just the I/O layer won't
know about it in advance). For SCSI drivers with small SGLIST, like
A100u2w, dropping virtual merge accounting may slightly decrease
performance (the I/O layer won't use maximum request size considering
virtual merging) --- but it will make these drivers work. They don't work
now, because these off-by-one overshoots in number of segments are
inevitable.
Mikulas
On Fri, Jul 11, 2008 at 07:56, Mikulas Patocka <[email protected]> wrote:
> Hi
>
> I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had to fix
> the endianity issues in the driver, but that's unrelated).
>
> When I examined the crashes, it turned out that SCSI layer passed requests
> with too many segments. The controller has at most 32 SG entries per
> request. It sets shost->sg_tablesize to 32, but despite this, larger
> requests were submitted to it --- this resulted in overwriting random memory
> and crashes.
[snip]
Should this have been CC'd to linux-scsi too? This sounds like it
could be a bug at their end.
Thanks,
--
Julian Calaby
Email: [email protected]
On Fri, 11 Jul 2008, Julian Calaby wrote:
> On Fri, Jul 11, 2008 at 07:56, Mikulas Patocka <[email protected]> wrote:
>> Hi
>>
>> I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had to fix
>> the endianity issues in the driver, but that's unrelated).
>>
>> When I examined the crashes, it turned out that SCSI layer passed requests
>> with too many segments. The controller has at most 32 SG entries per
>> request. It sets shost->sg_tablesize to 32, but despite this, larger
>> requests were submitted to it --- this resulted in overwriting random memory
>> and crashes.
>
> [snip]
>
> Should this have been CC'd to linux-scsi too? This sounds like it
> could be a bug at their end.
>
> Thanks,
>
> --
> Julian Calaby
This is not a bug in scsi layer, those "nr_hw_segments" miscalculations
are hapenning in block/blk-merge.c and they can show up with any block
device driver that has small number of SG entries.
But I can CC it there, just for their information.
Mikulas
On Thu, 10 Jul 2008 17:56:08 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> When I thought about it more, I realized that this accounting of virtual
> segments in I/O layer can't work correctly at all. If an architecture
> defines symbols BIOVEC_VIRT_MERGEABLE and BIOVEC_VIRT_OVERSIZE, it
> declares that it's IOMMU must merge any two regions satisfying these
> conditions. But in an IOMMU, it is impossible to guarantee, because:
Yeah, IOMMUs can't guarantee that. The majority of architectures set
BIO_VMERGE_BOUNDARY to 0 so they don't hit this, I think.
> * the bus address is allocated basiclly randomly, so we can hit
> dev->dma_parms->segment_boundary_mask any time. This will prevent virtual
> merging from happenning. I/O layer doesn't know the bus address at the
> time it merges requests, so it can't predict when this happens.
>
> * the IOMMU isn't guaranteed to find a continuous space in it's bus
> address space. If it skips over already mapped regions, it can't perform
> virtual merging.
>
> * when creating the mapping, we can hit per-device limit
> "dev->dma_parms->max_segment_size" --- but the I/O layer checks only
> against global limit BIOVEC_VIRT_OVERSIZE. (this last issue is fixable,
> the previous two are not).
I think that the block layer can handle this properly via
q->max_segment_size. We have the same value at two different
places. Yeah, it's not good...
BTW, inia100_template sets sg_tablesize to SG_ALL. If the controller
has at most 32 SG entries per request, we need to fix that.
On Fri, 11 Jul 2008, FUJITA Tomonori wrote:
> On Thu, 10 Jul 2008 17:56:08 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
>> When I thought about it more, I realized that this accounting of virtual
>> segments in I/O layer can't work correctly at all. If an architecture
>> defines symbols BIOVEC_VIRT_MERGEABLE and BIOVEC_VIRT_OVERSIZE, it
>> declares that it's IOMMU must merge any two regions satisfying these
>> conditions. But in an IOMMU, it is impossible to guarantee, because:
>
> Yeah, IOMMUs can't guarantee that. The majority of architectures set
> BIO_VMERGE_BOUNDARY to 0 so they don't hit this, I think.
Yes, the architectures without IOMMU don't hit this problem.
>> * the bus address is allocated basiclly randomly, so we can hit
>> dev->dma_parms->segment_boundary_mask any time. This will prevent virtual
>> merging from happenning. I/O layer doesn't know the bus address at the
>> time it merges requests, so it can't predict when this happens.
>>
>> * the IOMMU isn't guaranteed to find a continuous space in it's bus
>> address space. If it skips over already mapped regions, it can't perform
>> virtual merging.
>>
>> * when creating the mapping, we can hit per-device limit
>> "dev->dma_parms->max_segment_size" --- but the I/O layer checks only
>> against global limit BIOVEC_VIRT_OVERSIZE. (this last issue is fixable,
>> the previous two are not).
>
> I think that the block layer can handle this properly via
> q->max_segment_size. We have the same value at two different
> places. Yeah, it's not good...
>
>
> BTW, inia100_template sets sg_tablesize to SG_ALL. If the controller
> has at most 32 SG entries per request, we need to fix that.
Later, it sets that to shost->sg_tablesize = TOTAL_SG_ENTRY; I don't know
why in inia100_template there is SG_ALL.
Mikulas
On Fri, 11 Jul 2008 06:52:09 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
>
>
> On Fri, 11 Jul 2008, FUJITA Tomonori wrote:
>
> > On Thu, 10 Jul 2008 17:56:08 -0400 (EDT)
> > Mikulas Patocka <[email protected]> wrote:
> >
> >> When I thought about it more, I realized that this accounting of virtual
> >> segments in I/O layer can't work correctly at all. If an architecture
> >> defines symbols BIOVEC_VIRT_MERGEABLE and BIOVEC_VIRT_OVERSIZE, it
> >> declares that it's IOMMU must merge any two regions satisfying these
> >> conditions. But in an IOMMU, it is impossible to guarantee, because:
> >
> > Yeah, IOMMUs can't guarantee that. The majority of architectures set
> > BIO_VMERGE_BOUNDARY to 0 so they don't hit this, I think.
>
> Yes, the architectures without IOMMU don't hit this problem.
I meant that even if some architectures support IOMMUs, they set
BIO_VMERGE_BOUNDARY to 0.
> >> * the bus address is allocated basiclly randomly, so we can hit
> >> dev->dma_parms->segment_boundary_mask any time. This will prevent virtual
> >> merging from happenning. I/O layer doesn't know the bus address at the
> >> time it merges requests, so it can't predict when this happens.
> >>
> >> * the IOMMU isn't guaranteed to find a continuous space in it's bus
> >> address space. If it skips over already mapped regions, it can't perform
> >> virtual merging.
> >>
> >> * when creating the mapping, we can hit per-device limit
> >> "dev->dma_parms->max_segment_size" --- but the I/O layer checks only
> >> against global limit BIOVEC_VIRT_OVERSIZE. (this last issue is fixable,
> >> the previous two are not).
> >
> > I think that the block layer can handle this properly via
> > q->max_segment_size. We have the same value at two different
> > places. Yeah, it's not good...
> >
> >
> > BTW, inia100_template sets sg_tablesize to SG_ALL. If the controller
> > has at most 32 SG entries per request, we need to fix that.
>
> Later, it sets that to shost->sg_tablesize = TOTAL_SG_ENTRY; I don't know
> why in inia100_template there is SG_ALL.
Thanks, I overlooked it.
Seems that the driver sets shost->sg_tablesize to TOTAL_SG_ENTRY
unconditionally. Setting it in inia100_template is better.
From: FUJITA Tomonori <[email protected]>
Date: Fri, 11 Jul 2008 20:15:52 +0900
> On Fri, 11 Jul 2008 06:52:09 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > On Fri, 11 Jul 2008, FUJITA Tomonori wrote:
> >
> > > Yeah, IOMMUs can't guarantee that. The majority of architectures set
> > > BIO_VMERGE_BOUNDARY to 0 so they don't hit this, I think.
> >
> > Yes, the architectures without IOMMU don't hit this problem.
>
> I meant that even if some architectures support IOMMUs, they set
> BIO_VMERGE_BOUNDARY to 0.
Keep in mind that these settings were added long before
we supported segment boundary restrictions.
Someone added code to handle segment boundaries, but didn't
fix any of the block I/O layer infrastructure :-)
Several platforms that have IOMMU but set these values to zero
actually did so for another reason. They considered being
required to always merge page-adjacent mappings virtually too
strong a requirement to meet %100 of the time.
On Fri, 11 Jul 2008, David Miller wrote:
> From: FUJITA Tomonori <[email protected]>
> Date: Fri, 11 Jul 2008 20:15:52 +0900
>
>> On Fri, 11 Jul 2008 06:52:09 -0400 (EDT)
>> Mikulas Patocka <[email protected]> wrote:
>>
>>> On Fri, 11 Jul 2008, FUJITA Tomonori wrote:
>>>
>>>> Yeah, IOMMUs can't guarantee that. The majority of architectures set
>>>> BIO_VMERGE_BOUNDARY to 0 so they don't hit this, I think.
>>>
>>> Yes, the architectures without IOMMU don't hit this problem.
>>
>> I meant that even if some architectures support IOMMUs, they set
>> BIO_VMERGE_BOUNDARY to 0.
>
> Keep in mind that these settings were added long before
> we supported segment boundary restrictions.
>
> Someone added code to handle segment boundaries, but didn't
> fix any of the block I/O layer infrastructure :-)
>
> Several platforms that have IOMMU but set these values to zero
> actually did so for another reason. They considered being
> required to always merge page-adjacent mappings virtually too
> strong a requirement to meet %100 of the time.
It is broken on Sparc64 even without boundary restrictions --- if you skip
over already allocated entry in IOMMU table, you don't merge too.
I'd just drop it, because these requirements seem to me too brittle to
maintain. It is too easy to make bug here and too hard to check for it.
Basically there are few independent code parts (I/O layer and
arch-specific IOMMUs) that are attempting to do the same calculation and
if they differ, the driver will crash. Even if we managed to fix it,
someone will likely break it again after year or two :-(
Would it mean that nr_hw_segments entry in bio and request could be
dropped too? Or is it used for some other purpose?
BTW.: what's the reason that by default (without any driver intervention)
device DMA is restricted to cross 64k boundary?
Mikulas
Mikulas Patocka <[email protected]> writes:
> I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had
> to fix the endianity issues in the driver, but that's unrelated).
x86-64 (and powerpc) solved this a long time ago by only doing
opportunistic merging: as in don't announce to the block layer
that you can merge, but try to merge anyways. This way SG lists
are not necessarily filled completely, but it's still better
than overflowing them in some rare cases.
-Andi
> Mikulas Patocka <[email protected]> writes:
>
>> I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had
>> to fix the endianity issues in the driver, but that's unrelated).
>
> x86-64 (and powerpc) solved this a long time ago by only doing
> opportunistic merging: as in don't announce to the block layer
> that you can merge, but try to merge anyways. This way SG lists
> are not necessarily filled completely, but it's still better
> than overflowing them in some rare cases.
>
> -Andi
There's option "biomerge" that enables that feature. I'm wondering, if
there's some situation when it should be used.
Mikulas
Mikulas Patocka wrote:
>> Mikulas Patocka <[email protected]> writes:
>>
>>> I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had
>>> to fix the endianity issues in the driver, but that's unrelated).
>>
>> x86-64 (and powerpc) solved this a long time ago by only doing
>> opportunistic merging: as in don't announce to the block layer
>> that you can merge, but try to merge anyways. This way SG lists
>> are not necessarily filled completely, but it's still better
>> than overflowing them in some rare cases.
>>
>> -Andi
>
> There's option "biomerge" that enables that feature. I'm wondering, if
> there's some situation when it should be used.
It would only help if your SG hardware is slower at processing merges
than the IOMMU.
We found some long ago that was (old MPT Fusion) that gained
a few percent on some benchmarks, but that might be quite different
depending on the particular IO controller and chipset.
Also biomerge was used for the AMD K8 GART "IOMMU" which is
relatively slow, the situation on more modern x86 IOMMUs might be
different. Also the GART IOMMU is small and fragmentation is more
likely.
The situation on the more modern real x86 IOMMU (like in newer Intel or IBM
chipsets) might be also different.
Still I would expect that modern IO controllers are typically fast
enough at processing SG lists that it shouldn't matter much.
-Andi
From: Andi Kleen <[email protected]>
Date: Sun, 13 Jul 2008 15:50:55 +0200
> Still I would expect that modern IO controllers are typically fast
> enough at processing SG lists that it shouldn't matter much.
I know it matters a lot on sparc64 ESP scsi controllers.
You can only have one address/len pair DMA'ing at a time and you have
to service an interrupt to load in the the next DMA sg elements into
the chips registers.
Merging is essentially a must for performance on those cards.
David Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Sun, 13 Jul 2008 15:50:55 +0200
>
>> Still I would expect that modern IO controllers are typically fast
>> enough at processing SG lists that it shouldn't matter much.
>
> I know it matters a lot on sparc64 ESP scsi controllers.
>
> You can only have one address/len pair DMA'ing at a time and you have
> to service an interrupt to load in the the next DMA sg elements into
> the chips registers.
>
> Merging is essentially a must for performance on those cards.
Well right now your setup breaks all controllers with "weird requirements"
like 64k DMA or similar. You'll need to find some way to turn off BIO
merge for those at least.
Perhaps this needs to be really a block queue attribute instead of a global?
-Andi
On Sun, 13 Jul 2008, Andi Kleen wrote:
> David Miller wrote:
>> From: Andi Kleen <[email protected]>
>> Date: Sun, 13 Jul 2008 15:50:55 +0200
>>
>>> Still I would expect that modern IO controllers are typically fast
>>> enough at processing SG lists that it shouldn't matter much.
>>
>> I know it matters a lot on sparc64 ESP scsi controllers.
>>
>> You can only have one address/len pair DMA'ing at a time and you have
>> to service an interrupt to load in the the next DMA sg elements into
>> the chips registers.
>>
>> Merging is essentially a must for performance on those cards.
>
> Well right now your setup breaks all controllers with "weird requirements"
> like 64k DMA or similar. You'll need to find some way to turn off BIO
> merge for those at least.
>
> Perhaps this needs to be really a block queue attribute instead of a global?
>
> -Andi
There is no need to turn BIO merge off --- the problem is caused by
accounting of BIO merging in block layer, not by BIO merging itself.
Just do #define BIO_VMERGE_BOUNDARY 0, and that disables the accounting,
but leaves merging as it is.
Mikulas
From: Andi Kleen <[email protected]>
Date: Sun, 13 Jul 2008 22:13:17 +0200
> David Miller wrote:
> > From: Andi Kleen <[email protected]>
> > Date: Sun, 13 Jul 2008 15:50:55 +0200
> >
> >> Still I would expect that modern IO controllers are typically fast
> >> enough at processing SG lists that it shouldn't matter much.
> >
> > I know it matters a lot on sparc64 ESP scsi controllers.
> >
> > You can only have one address/len pair DMA'ing at a time and you have
> > to service an interrupt to load in the the next DMA sg elements into
> > the chips registers.
> >
> > Merging is essentially a must for performance on those cards.
>
> Well right now your setup breaks all controllers with "weird requirements"
> like 64k DMA or similar. You'll need to find some way to turn off BIO
> merge for those at least.
>
> Perhaps this needs to be really a block queue attribute instead of a global?
Like I said, that code was written at a time when none of the block
segment check stuff existed, and therefore worked perfectly fine in
the environment in which it was created.
Someone added the segmenting code, but didn't bother to add proper
checking to the merging bits.
Usually we revert code that breaks things like that, right?
So I find it unusual for people to talk about turning off the
code that was working perfectly fine previously in situations
like this.
From: Mikulas Patocka <[email protected]>
Date: Sun, 13 Jul 2008 19:53:01 -0400 (EDT)
> There is no need to turn BIO merge off --- the problem is caused by
> accounting of BIO merging in block layer, not by BIO merging itself.
>
> Just do #define BIO_VMERGE_BOUNDARY 0, and that disables the accounting,
> but leaves merging as it is.
For the thousanth time, the BIO_VMERGE_BOUNDARY code is useful
and worked perfectly fine before segment boundary handling was
added to the block layer.
It's a regression, and as such should be fixed or the guilty
code reverted.
Since when do we say "sorry that got broken, turn it off, thanks"
?
On Sun, 13 Jul 2008 17:41:19 -0700 (PDT)
David Miller <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
> Date: Sun, 13 Jul 2008 22:13:17 +0200
>
> > David Miller wrote:
> > > From: Andi Kleen <[email protected]>
> > > Date: Sun, 13 Jul 2008 15:50:55 +0200
> > >
> > >> Still I would expect that modern IO controllers are typically fast
> > >> enough at processing SG lists that it shouldn't matter much.
> > >
> > > I know it matters a lot on sparc64 ESP scsi controllers.
> > >
> > > You can only have one address/len pair DMA'ing at a time and you have
> > > to service an interrupt to load in the the next DMA sg elements into
> > > the chips registers.
> > >
> > > Merging is essentially a must for performance on those cards.
> >
> > Well right now your setup breaks all controllers with "weird requirements"
> > like 64k DMA or similar. You'll need to find some way to turn off BIO
> > merge for those at least.
> >
> > Perhaps this needs to be really a block queue attribute instead of a global?
>
> Like I said, that code was written at a time when none of the block
> segment check stuff existed, and therefore worked perfectly fine in
> the environment in which it was created.
>
> Someone added the segmenting code, but didn't bother to add proper
> checking to the merging bits.
>
> Usually we revert code that breaks things like that, right?
>
> So I find it unusual for people to talk about turning off the
> code that was working perfectly fine previously in situations
> like this.
Seems that there are some confusion.
It's not likely that the DMA boundary restriction causes this issue
because we set it to 4G by default.
As Mikulas pointed out, before my IOMMU work, this problem existed in
SPARC64 (and other architectures that set BIO_VMERGE_BOUNDARY to non
zero) because IOMMUs can't guarantee that they merge sg segments.
I think that now we hit this problems due to the max_segment_size.
In the past, IOMMUs aggressively merged sg entries. IOMMUs ignored the
block layer's max_segment_size (64K by default), merge sg segments,
and create a large segment. But most of LLDs can handle a segment size
larger than 64K. So everything was fine.
Now IOMMUs don't ignore the max_segment_size. We hit this problem.
It's the right thing that IOMMUs don't ignore the max_segment_size. I
guess that if the A100u2w driver sets a max_segment_size to a larger
value, the problem will be fixed. However, as we discussed, IOMMUs
can't guarantee that they merge sg segments. It's possible that we
still hit this problem.
We tell SCSI driver developers that the drivers don't get the larger
number of segments than they tell the SCSI subsystem. If we keep the
virtual merge concept, we need to fix this first.
Sorry about this problem. As I said, this problem existed before my
IOMMU work, but I should have taken care about this.
From: FUJITA Tomonori <[email protected]>
Date: Mon, 14 Jul 2008 11:19:02 +0900
> Now IOMMUs don't ignore the max_segment_size. We hit this problem.
It seems simply that max_segment_size needs to propagate
down into the block layer VMERGE code so that it can
act accordingly.
On Sun, 13 Jul 2008, David Miller wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Sun, 13 Jul 2008 19:53:01 -0400 (EDT)
>
>> There is no need to turn BIO merge off --- the problem is caused by
>> accounting of BIO merging in block layer, not by BIO merging itself.
>>
>> Just do #define BIO_VMERGE_BOUNDARY 0, and that disables the accounting,
>> but leaves merging as it is.
>
> For the thousanth time, the BIO_VMERGE_BOUNDARY code is useful
> and worked perfectly fine before segment boundary handling was
> added to the block layer.
>
> It's a regression, and as such should be fixed or the guilty
> code reverted.
>
> Since when do we say "sorry that got broken, turn it off, thanks"
> ?
And which was the supposed "working" kernel version?
In current kernel there are three conditions that can cause merge failure
(and possible crash as a result):
1. dma_addr != dma_next --- skipping over an allocated entry
2. outs->dma_length + s->length > max_seg_size --- max segment size, this
is what I was hitting
3. is_span_boundary(out_entry, base_shift, seg_boundary_size, outs, s) ---
a devices that have special dma_get_seg_boundary(dev)
So, show me a sparc64 kernel where none of these conditions existed and
where merge-accounting was bug-free.
I suppose that before condition (2) was added, the conditions (1) and (3)
still existed, making crashes still possible, although not as common as
with condition (2). But if you want to show me otherwise --- a bug-free
implementation of merge accounting --- just do it.
> For the thousanth time, the BIO_VMERGE_BOUNDARY code is useful
Where? BIO_VMERGE_BOUNDARY has nothing to do with actual merging (the
merging happens with or without it BIO_VMERGE_BOUNDARY).
As you mentioned ESP driver, it declares .sg_tablesize = SG_ALL, so
BIO_VMERGE_BOUNDARY has no effect on the operation of this driver. Any
other driver where BIO_VMERGE_BOUNDARY does matter?
Mikulas
From: Mikulas Patocka <[email protected]>
Date: Mon, 14 Jul 2008 08:16:13 -0400 (EDT)
> As you mentioned ESP driver, it declares .sg_tablesize = SG_ALL, so
> BIO_VMERGE_BOUNDARY has no effect on the operation of this driver. Any
> other driver where BIO_VMERGE_BOUNDARY does matter?
When BIO_VMERGE_BOUNDARY is turned on, requests that would not
otherwise fit into the device's limits, can.
It's the only way to know ahead of time that the IOMMU is going
to merge things for us.
On Mon, 14 Jul 2008, David Miller wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Mon, 14 Jul 2008 08:16:13 -0400 (EDT)
>
>> As you mentioned ESP driver, it declares .sg_tablesize = SG_ALL, so
>> BIO_VMERGE_BOUNDARY has no effect on the operation of this driver. Any
>> other driver where BIO_VMERGE_BOUNDARY does matter?
>
> When BIO_VMERGE_BOUNDARY is turned on, requests that would not
> otherwise fit into the device's limits, can.
Why would someone want to overshoot SG_ALL? ... and, shouldn't the
constant be increased then --- instead of making buggy BIO_VMERGE_BOUNDARY
expectations?
Mikulas
> It's the only way to know ahead of time that the IOMMU is going
> to merge things for us.
On Sun, 13 Jul 2008 20:20:35 -0700 (PDT)
David Miller <[email protected]> wrote:
> From: FUJITA Tomonori <[email protected]>
> Date: Mon, 14 Jul 2008 11:19:02 +0900
>
> > Now IOMMUs don't ignore the max_segment_size. We hit this problem.
>
> It seems simply that max_segment_size needs to propagate
> down into the block layer VMERGE code so that it can
> act accordingly.
Yeah, I think so. I'll take care of that.
But I think that even with the fix, SPARC64 IOMMU still hits the
problem.
With the VMERGE accounting enabled,
- an IOMMU must merge segments that the block layer expects the IOMMU
to merge.
- an IOMMU must return an error if it can't merge such segments.
But the current SPARC64 IOMMU code doesn't return an error if it can't
merge such segments (as POWER's IOMMU does, as you know).
dma_4u_map_sg() has:
if (segstart != s) {
/* We cannot merge if:
* - allocated dma_addr isn't contiguous to previous allocation
*/
if ((dma_addr != dma_next) ||
(outs->dma_length + s->length > max_seg_size) ||
(is_span_boundary(out_entry, base_shift,
/* Can't merge: create a new segment */
segstart = s;
outcount++;
outs = sg_next(outs);
So if the IOMMU allocated dma_addr isn't contiguous to previous
allocation, it might not merge segments that the block layer expected
the IOMMU to merge.
We need kinda two phase merging code such as the old SPARC64 IOMMU
code and PARISC IOMMUs though I like the new simple SPARC64 IOMMU
code.
On Tue, 15 Jul 2008, FUJITA Tomonori wrote:
> On Sun, 13 Jul 2008 20:20:35 -0700 (PDT)
> David Miller <[email protected]> wrote:
>
>> From: FUJITA Tomonori <[email protected]>
>> Date: Mon, 14 Jul 2008 11:19:02 +0900
>>
>>> Now IOMMUs don't ignore the max_segment_size. We hit this problem.
>>
>> It seems simply that max_segment_size needs to propagate
>> down into the block layer VMERGE code so that it can
>> act accordingly.
>
> Yeah, I think so. I'll take care of that.
>
> But I think that even with the fix, SPARC64 IOMMU still hits the
> problem.
>
> With the VMERGE accounting enabled,
>
> - an IOMMU must merge segments that the block layer expects the IOMMU
> to merge.
>
> - an IOMMU must return an error if it can't merge such segments.
>
> But the current SPARC64 IOMMU code doesn't return an error if it can't
> merge such segments (as POWER's IOMMU does, as you know).
BTW. what should the block device driver do when it receives a mapping
error? (if it aborts the request and it was write request, there will be
data corruption). Is it even legal to return IOMMU error in case where no
real hardware error or overflow happened?
Mikulas
From: Mikulas Patocka <[email protected]>
Date: Mon, 14 Jul 2008 10:03:29 -0400 (EDT)
> On Mon, 14 Jul 2008, David Miller wrote:
>
> > From: Mikulas Patocka <[email protected]>
> > Date: Mon, 14 Jul 2008 08:16:13 -0400 (EDT)
> >
> >> As you mentioned ESP driver, it declares .sg_tablesize = SG_ALL, so
> >> BIO_VMERGE_BOUNDARY has no effect on the operation of this driver. Any
> >> other driver where BIO_VMERGE_BOUNDARY does matter?
> >
> > When BIO_VMERGE_BOUNDARY is turned on, requests that would not
> > otherwise fit into the device's limits, can.
>
> Why would someone want to overshoot SG_ALL? ... and, shouldn't the
> constant be increased then --- instead of making buggy BIO_VMERGE_BOUNDARY
> expectations?
I'm talking about other devices, not about the ESP, here.
>>>> As you mentioned ESP driver, it declares .sg_tablesize = SG_ALL, so
>>>> BIO_VMERGE_BOUNDARY has no effect on the operation of this driver. Any
>>>> other driver where BIO_VMERGE_BOUNDARY does matter?
>>>
>>> When BIO_VMERGE_BOUNDARY is turned on, requests that would not
>>> otherwise fit into the device's limits, can.
>>
>> Why would someone want to overshoot SG_ALL? ... and, shouldn't the
>> constant be increased then --- instead of making buggy BIO_VMERGE_BOUNDARY
>> expectations?
>
> I'm talking about other devices, not about the ESP, here.
And which ones are those important drivers that need merge accounting?
A100U2W is an old card, I got it for $8.5 in bazaar, it does 38MB/s. This
virtual merge accounting helps to stuff on average 4 more segments into
the 32-entry table.
So the question is: to reduce number of requests by 12% on an outdated
SCSI card, it is sensible to maintain complicated merge accounting logic
in the core block layer? To me, it doesn't seem sensible.
Mikulas
From: Mikulas Patocka <[email protected]>
Date: Mon, 14 Jul 2008 19:16:17 -0400 (EDT)
> So the question is: to reduce number of requests by 12% on an outdated
> SCSI card, it is sensible to maintain complicated merge accounting logic
> in the core block layer? To me, it doesn't seem sensible.
Rip out the code if you like, then. I really don't have time to
work on this myself. So if you do, by all means do whatever
you think is appropriate.
On Mon, 14 Jul 2008 10:03:29 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> On Mon, 14 Jul 2008, David Miller wrote:
>
> > From: Mikulas Patocka <[email protected]>
> > Date: Mon, 14 Jul 2008 08:16:13 -0400 (EDT)
> >
> >> As you mentioned ESP driver, it declares .sg_tablesize = SG_ALL, so
> >> BIO_VMERGE_BOUNDARY has no effect on the operation of this driver. Any
> >> other driver where BIO_VMERGE_BOUNDARY does matter?
> >
> > When BIO_VMERGE_BOUNDARY is turned on, requests that would not
> > otherwise fit into the device's limits, can.
>
> Why would someone want to overshoot SG_ALL? ... and, shouldn't the
> constant be increased then --- instead of making buggy BIO_VMERGE_BOUNDARY
> expectations?
'buggy' is a wrong word. VMERGE works for some IOMMUs (though I think
that we need to fix it about max_segment_size).
On Mon, 14 Jul 2008 17:26:38 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> On Tue, 15 Jul 2008, FUJITA Tomonori wrote:
>
> > On Sun, 13 Jul 2008 20:20:35 -0700 (PDT)
> > David Miller <[email protected]> wrote:
> >
> >> From: FUJITA Tomonori <[email protected]>
> >> Date: Mon, 14 Jul 2008 11:19:02 +0900
> >>
> >>> Now IOMMUs don't ignore the max_segment_size. We hit this problem.
> >>
> >> It seems simply that max_segment_size needs to propagate
> >> down into the block layer VMERGE code so that it can
> >> act accordingly.
> >
> > Yeah, I think so. I'll take care of that.
> >
> > But I think that even with the fix, SPARC64 IOMMU still hits the
> > problem.
> >
> > With the VMERGE accounting enabled,
> >
> > - an IOMMU must merge segments that the block layer expects the IOMMU
> > to merge.
> >
> > - an IOMMU must return an error if it can't merge such segments.
> >
> > But the current SPARC64 IOMMU code doesn't return an error if it can't
> > merge such segments (as POWER's IOMMU does, as you know).
>
> BTW. what should the block device driver do when it receives a mapping
> error? (if it aborts the request and it was write request, there will be
> data corruption).
I'm not sure how a aborted request can corrupt data on disk.
> Is it even legal to return IOMMU error in case where no
> real hardware error or overflow happened?
Of course, it's legal. It's a common event like kinda OOM. It's very
possible with old IOMMUs that have the small I/O space. A SCSI driver
retries a failed request later. But note that some drivers are still
not able to handle that.
http://marc.info/?l=linux-kernel&m=121300637114672&w=2
>> BTW. what should the block device driver do when it receives a mapping
>> error? (if it aborts the request and it was write request, there will be
>> data corruption).
>
> I'm not sure how a aborted request can corrupt data on disk.
Writes are done by an async daemon and no one checks for their completion
status. If there are three writes to directory, inode table and inode
bitmap and one of these writes fail, there's no code to undo the other
two. So the filesystem will be corrupted on write failure.
>> Is it even legal to return IOMMU error in case where no
>> real hardware error or overflow happened?
>
> Of course, it's legal. It's a common event like kinda OOM. It's very
> possible with old IOMMUs that have the small I/O space. A SCSI driver
> retries a failed request later. But note that some drivers are still
> not able to handle that.
>
> http://marc.info/?l=linux-kernel&m=121300637114672&w=2
So should it return SCSI_MLQUEUE_HOST_BUSY? Or something other?
Unfortunatelly, many of the drivers contain BUG_ON or no check at all.
Mikulas
Mikulas Patocka wrote:
>>> BTW. what should the block device driver do when it receives a mapping
>>> error? (if it aborts the request and it was write request, there will be
>>> data corruption).
>>
>> I'm not sure how a aborted request can corrupt data on disk.
>
> Writes are done by an async daemon and no one checks for their
> completion status. If there are three writes to directory, inode table
> and inode bitmap and one of these writes fail, there's no code to undo
> the other two. So the filesystem will be corrupted on write failure.
Normally journaling in ordered mode takes care of that. The transaction
is not committed until all earlier data has been successfully written.
And even the other fs typically turn the file system read only
on IO error to prevent further corruption.
Bigger issue is when the IOMMU mapping fails, but the driver doesn't have
error handling code. In the early days of the x86-64 GART IOMMU I managed
to corrupt a few super blocks this way because random data would be written then.
The block drivers have this pretty much all fixed these days, but in some other
areas (like network drivers) it looks much worse. So if your file system is
actually a network file system this might be a problem.
-Andi
On Tue, 15 Jul 2008 08:09:53 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:
> >> BTW. what should the block device driver do when it receives a mapping
> >> error? (if it aborts the request and it was write request, there will be
> >> data corruption).
> >
> > I'm not sure how a aborted request can corrupt data on disk.
>
> Writes are done by an async daemon and no one checks for their completion
> status. If there are three writes to directory, inode table and inode
> bitmap and one of these writes fail, there's no code to undo the other
> two. So the filesystem will be corrupted on write failure.
Are you talking about file system metadata corruption?
If your logic is correct, we hit the corruption every time a FC cable
(switch, or whatever) has a problem. Fortunately, we don't. modern
file systems can handle that.
> >> Is it even legal to return IOMMU error in case where no
> >> real hardware error or overflow happened?
> >
> > Of course, it's legal. It's a common event like kinda OOM. It's very
> > possible with old IOMMUs that have the small I/O space. A SCSI driver
> > retries a failed request later. But note that some drivers are still
> > not able to handle that.
> >
> > http://marc.info/?l=linux-kernel&m=121300637114672&w=2
>
> So should it return SCSI_MLQUEUE_HOST_BUSY? Or something other?
>
> Unfortunatelly, many of the drivers contain BUG_ON or no check at all.
Yeah, many of the old scsi drivers contatinas BUG_ON. They are not
likely to be used with IOMMUs. We don't care about them much. The
recent scsi drivers can handle this.
On Tue, 15 Jul 2008, Andi Kleen wrote:
> Mikulas Patocka wrote:
>>>> BTW. what should the block device driver do when it receives a mapping
>>>> error? (if it aborts the request and it was write request, there will be
>>>> data corruption).
>>>
>>> I'm not sure how a aborted request can corrupt data on disk.
>>
>> Writes are done by an async daemon and no one checks for their
>> completion status. If there are three writes to directory, inode table
>> and inode bitmap and one of these writes fail, there's no code to undo
>> the other two. So the filesystem will be corrupted on write failure.
>
> Normally journaling in ordered mode takes care of that. The transaction
> is not committed until all earlier data has been successfully written.
And if there was write error, then what happens? Retry? Blocking of any
further updates?
> And even the other fs typically turn the file system read only
> on IO error to prevent further corruption.
There is no interface how filesystem could query that buffer marked with
mark_buffer_dirty was not written. Or is there?
Mikulas
Mikulas Patocka wrote:
> On Tue, 15 Jul 2008, Andi Kleen wrote:
>
>> Mikulas Patocka wrote:
>>>>> BTW. what should the block device driver do when it receives a mapping
>>>>> error? (if it aborts the request and it was write request, there
>>>>> will be
>>>>> data corruption).
>>>>
>>>> I'm not sure how a aborted request can corrupt data on disk.
>>>
>>> Writes are done by an async daemon and no one checks for their
>>> completion status. If there are three writes to directory, inode table
>>> and inode bitmap and one of these writes fail, there's no code to undo
>>> the other two. So the filesystem will be corrupted on write failure.
>>
>> Normally journaling in ordered mode takes care of that. The transaction
>> is not committed until all earlier data has been successfully written.
>
> And if there was write error, then what happens? Retry? Blocking of any
> further updates?
The file system is mounted r/o and the transaction is not committed.
Then on mount it is replayed. Similar for a journal write error.
>
>> And even the other fs typically turn the file system read only
>> on IO error to prevent further corruption.
>
> There is no interface how filesystem could query that buffer marked with
> mark_buffer_dirty was not written. Or is there?
For journaled meta data at least the file system usually checks synchronously
(e.g. by using sync_dirty_buffer() and the handling the commit when all
IO completed successfully) For normal data it is just handled by the normal VFS functions.
-Andi
On Mon, 14 Jul 2008, David Miller wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Mon, 14 Jul 2008 19:16:17 -0400 (EDT)
>
> > So the question is: to reduce number of requests by 12% on an outdated
> > SCSI card, it is sensible to maintain complicated merge accounting logic
> > in the core block layer? To me, it doesn't seem sensible.
>
> Rip out the code if you like, then. I really don't have time to
> work on this myself. So if you do, by all means do whatever
> you think is appropriate.
So add signed-off line and forward it to Linus.
Signed-off-by: Mikulas Patocka <[email protected]>
---
include/asm-sparc64/io.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-2.6.26-fast/include/asm-sparc64/io.h
===================================================================
--- linux-2.6.26-fast.orig/include/asm-sparc64/io.h 2008-07-15 23:29:39.000000000 +0200
+++ linux-2.6.26-fast/include/asm-sparc64/io.h 2008-07-15 23:33:41.000000000 +0200
@@ -16,7 +16,10 @@
/* BIO layer definitions. */
extern unsigned long kern_base, kern_size;
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-#define BIO_VMERGE_BOUNDARY 8192
+
+/* Don't declare that we are virtual-merge capable because Sparc64 IOMMU code
+ doesn't guarantee that the merge is always possible */
+#define BIO_VMERGE_BOUNDARY 0
static inline u8 _inb(unsigned long addr)
{
From: Mikulas Patocka <[email protected]>
Date: Tue, 15 Jul 2008 18:32:37 -0400 (EDT)
> On Mon, 14 Jul 2008, David Miller wrote:
>
> > From: Mikulas Patocka <[email protected]>
> > Date: Mon, 14 Jul 2008 19:16:17 -0400 (EDT)
> >
> > > So the question is: to reduce number of requests by 12% on an outdated
> > > SCSI card, it is sensible to maintain complicated merge accounting logic
> > > in the core block layer? To me, it doesn't seem sensible.
> >
> > Rip out the code if you like, then. I really don't have time to
> > work on this myself. So if you do, by all means do whatever
> > you think is appropriate.
>
> So add signed-off line and forward it to Linus.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
I said remove code, not turn if off. I guess you didn't like
that option even though you seem heavily convinced that it buys
us essentially nothing, and I'm even starting to agree with you.
If the VMERGE code is going to stay, and it's a bug or a limitation in
the sparc64 IOMMU code, I'd rather that get fixed.
I have FUJITA's excellent analysis of the sparc64 specific IOMMU issue
in my inbox and I intend to have a look at it when I get a chance.
On Tue, 15 Jul 2008, David Miller wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Tue, 15 Jul 2008 18:32:37 -0400 (EDT)
>
> > On Mon, 14 Jul 2008, David Miller wrote:
> >
> > > From: Mikulas Patocka <[email protected]>
> > > Date: Mon, 14 Jul 2008 19:16:17 -0400 (EDT)
> > >
> > > > So the question is: to reduce number of requests by 12% on an outdated
> > > > SCSI card, it is sensible to maintain complicated merge accounting logic
> > > > in the core block layer? To me, it doesn't seem sensible.
> > >
> > > Rip out the code if you like, then. I really don't have time to
> > > work on this myself. So if you do, by all means do whatever
> > > you think is appropriate.
> >
> > So add signed-off line and forward it to Linus.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
>
> I said remove code, not turn if off.
Removing the code from the block layer needs to be ACKed by block
maintainers --- and it can be done if it will be found that on
architectures that use it (alpha, pa-risc) it has no benefit. I believe
that it will eventually show up that there is little/no benefit from the
hw_segments accounting, but it will take some time.
This patch is a simple fix that can be done now to stop sparc64 from
crashing. (most other architectures also define it as 0 --- for the same
reason, they can't guarantee that the merge will happen).
> I guess you didn't like that option even though you seem heavily
> convinced that it buys us essentially nothing, and I'm even starting to
> agree with you.
>
> If the VMERGE code is going to stay, and it's a bug or a limitation in
> the sparc64 IOMMU code, I'd rather that get fixed.
So if it stays and if you find time to fix it, you can revert it back to
8192.
But until one of two possible solutions happen (removing the virtual merge
accounting or declaring that it must stay and fixing sparc64), the
simplest way to avoid crashes is to define BIO_VMERGE_BOUNDARY 0.
> I have FUJITA's excellent analysis of the sparc64 specific IOMMU issue
> in my inbox and I intend to have a look at it when I get a chance.
Mikulas
From: FUJITA Tomonori <[email protected]>
Date: Tue, 15 Jul 2008 02:45:03 +0900
> dma_4u_map_sg() has:
>
> if (segstart != s) {
> /* We cannot merge if:
> * - allocated dma_addr isn't contiguous to previous allocation
> */
> if ((dma_addr != dma_next) ||
> (outs->dma_length + s->length > max_seg_size) ||
> (is_span_boundary(out_entry, base_shift,
> /* Can't merge: create a new segment */
> segstart = s;
> outcount++;
> outs = sg_next(outs);
>
> So if the IOMMU allocated dma_addr isn't contiguous to previous
> allocation, it might not merge segments that the block layer expected
> the IOMMU to merge.
>
> We need kinda two phase merging code such as the old SPARC64 IOMMU
> code and PARISC IOMMUs though I like the new simple SPARC64 IOMMU
> code.
I see.
I wonder if all that complexity is really worth it. Also, all of this
IOMMU allocation and mapping code runs under a spinlock with hw IRQs
disabled.
More and more I'm seeing that it's likely better to remove the VMERGE
code. I can't see what it really buys us anymore, and to make it work
requires quite a large amount of complexity in the IOMMU layer.
On Tue, 15 Jul 2008 20:10:42 -0700 (PDT)
David Miller <[email protected]> wrote:
> From: FUJITA Tomonori <[email protected]>
> Date: Tue, 15 Jul 2008 02:45:03 +0900
>
> > dma_4u_map_sg() has:
> >
> > if (segstart != s) {
> > /* We cannot merge if:
> > * - allocated dma_addr isn't contiguous to previous allocation
> > */
> > if ((dma_addr != dma_next) ||
> > (outs->dma_length + s->length > max_seg_size) ||
> > (is_span_boundary(out_entry, base_shift,
> > /* Can't merge: create a new segment */
> > segstart = s;
> > outcount++;
> > outs = sg_next(outs);
> >
> > So if the IOMMU allocated dma_addr isn't contiguous to previous
> > allocation, it might not merge segments that the block layer expected
> > the IOMMU to merge.
> >
> > We need kinda two phase merging code such as the old SPARC64 IOMMU
> > code and PARISC IOMMUs though I like the new simple SPARC64 IOMMU
> > code.
>
> I see.
>
> I wonder if all that complexity is really worth it. Also, all of this
> IOMMU allocation and mapping code runs under a spinlock with hw IRQs
> disabled.
>
> More and more I'm seeing that it's likely better to remove the VMERGE
> code. I can't see what it really buys us anymore, and to make it work
> requires quite a large amount of complexity in the IOMMU layer.
Agreed, especially with modern HBAs, the VMERGE accounting isn't
useful, I think. The recent IOMMU implementations, Intel VT-d and AMD
virtualization one, don't do even virtual merging.
I'm fine with removing the VMERGE accounting in the block layer if
Jens and the users are happy about it.