2004-01-05 17:52:13

by Justin T. Gibbs

[permalink] [raw]
Subject: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

Berkley Shands recently tripped over this problem. The 2.6.X pci_map_sg
code for x86_64 modifies the passed in S/G list to compact it for mapping
by the GART. This modification is not reversed when pci_unmap_sg is
called. In the case of a retried SCSI command, this causes any attempt
to map the command a second time to fail with a BUG assertion since the
nseg parameter passed into the second map call is state. nseg comes from
the "use_sg" field in the SCSI command structure which is never touched
by the HBA drivers invoking pci_map_sg.

DMA-API.txt doesn't seem to cover this issue. Should the low-level DMA
code restore the S/G list to its original state on unmap or should the
SCSI HBA drivers be changed to update "use_sg" with the segment count
reported by the pci_map_sg() API? If the latter, this seems to contradict
the mandate in DMA-API that the nseg parameter passed into the unmap call
be the same as that passed into the map call. Most of the kernel assumes
that an S/G list can be mapped an unmapped multiple times using the same
arguments. This doesn't seem to me to be an unreasonable expectation.

--
Justin


2004-01-05 19:22:13

by David Miller

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

On Mon, 05 Jan 2004 10:57:35 -0700
"Justin T. Gibbs" <[email protected]> wrote:

> DMA-API.txt doesn't seem to cover this issue. Should the low-level DMA
> code restore the S/G list to its original state on unmap or should the
> SCSI HBA drivers be changed to update "use_sg" with the segment count
> reported by the pci_map_sg() API? If the latter, this seems to contradict
> the mandate in DMA-API that the nseg parameter passed into the unmap call
> be the same as that passed into the map call. Most of the kernel assumes
> that an S/G list can be mapped an unmapped multiple times using the same
> arguments. This doesn't seem to me to be an unreasonable expectation.

No, the PCI layer is not required at all to restore the SG to it's original state
if it does DMA page coalescing as sparc64 and x86_64 do.

But I don't see what the real problem is, all the PCI layer is doing is setting up
the DMA address/length pairs in the entries that are needed, then returning
the number of such slots that exist. You, in your driver, can remember the original
total number of physical entries and use that value to pass things back in.

2004-01-05 19:30:01

by Berkley Shands

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

The pci layer is modifying the sg list, and then placing a zero
in the length field. pci-gart.c at line 453 (2.6.0 sources) checks this length field
after a retry, sees that it is zero, and bughalts.
As Justin points out, SOMEBODY needs to reset the sg list, or kick the error
back up far enough to recreate it completely. Or just don't bother to
combine the entries. A machine with 8GB or more can stand to lose a few bytes.
How many hardware devices can handle a 1MB+ I/O buffer being passed to them anyway?

berkley

2004-01-05 19:33:53

by David Miller

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

On Mon, 5 Jan 2004 13:29:54 -0600 (CST)
Berkley Shands <[email protected]> wrote:

> The pci layer is modifying the sg list, and then placing a zero
> in the length field. pci-gart.c at line 453 (2.6.0 sources) checks this length field
> after a retry, sees that it is zero, and bughalts.

Oh that's a bug. It is allowed to modify the dma_length field but not
the physical length field.

I imagine x86_64 is doing this so that there need not be a seperate dma_length
field in the scatter_gather struct defined for that platform, and that's too bad it will
definitely need such a seperate field if it wants to implement coalescing.

2004-01-05 19:48:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

"Justin T. Gibbs" <[email protected]> writes:

> Berkley Shands recently tripped over this problem. The 2.6.X pci_map_sg
> code for x86_64 modifies the passed in S/G list to compact it for mapping
> by the GART. This modification is not reversed when pci_unmap_sg is
> called. In the case of a retried SCSI command, this causes any attempt

Qlogic has the same bug.

Actually I disabled merging by default in the latest x86-64 code,
but it can be still enabled by the user using options (it makes some
adapters run several percent faster). I would appreciate if you could
fix the problem anyways.

I was actually planning to add a BUG() for this. Should do that.
There is already one that triggers often when the problem occurs.

> to map the command a second time to fail with a BUG assertion since the
> nseg parameter passed into the second map call is state. nseg comes from
> the "use_sg" field in the SCSI command structure which is never touched
> by the HBA drivers invoking pci_map_sg.
>
> DMA-API.txt doesn't seem to cover this issue. Should the low-level DMA

It does actually, but not very clearly. I've suggested changes of wording
recently to make this more clear, but it hasn't gotten in.

> code restore the S/G list to its original state on unmap or should the
> SCSI HBA drivers be changed to update "use_sg" with the segment count
> reported by the pci_map_sg() API? If the latter, this seems to contradict

You should never remap an already mapped sg list. Either reuse the already mapped
list or keep a copy of the original list around. First is better because the later
may have problems with the page reference counts.

-Andi

2004-01-05 19:58:26

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

> "Justin T. Gibbs" <[email protected]> writes:
>
>> Berkley Shands recently tripped over this problem. The 2.6.X pci_map_sg
>> code for x86_64 modifies the passed in S/G list to compact it for mapping
>> by the GART. This modification is not reversed when pci_unmap_sg is
>> called. In the case of a retried SCSI command, this causes any attempt
>
> Qlogic has the same bug.

Which bug is this? Not updating use_sg, or mapping the command the
second time when it is retried? If the latter, I don't think this is an
HBA bug. The HBA driver doesn't know that the command has been mapped in
the past, so it will blindly map/unmap it again.

...

>> DMA-API.txt doesn't seem to cover this issue.
>
> It does actually, but not very clearly. I've suggested changes of wording
> recently to make this more clear, but it hasn't gotten in.

Can you point to the section of the document you believe applies?

> You should never remap an already mapped sg list.

But the sg list is no longer mapped. The HBA driver did call pci_unmap_sg
on it. Did you mean to say, "Never map an sg list again that has been
mapped in the past"?

> Either reuse the already mapped list or keep a copy of the original list
> around. First is better because the later may have problems with the page
> reference counts.

The mid-layer doesn't map the list. The HBA drivers do. So you're saying
that either the mid-layer or the HBA drivers need to copy the list so it
can be restored just in case the command will be retried?

--
Justin

2004-01-05 20:00:26

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

On Monday 05 January 2004 09:57 am, Justin T. Gibbs wrote:
> Berkley Shands recently tripped over this problem. The 2.6.X pci_map_sg
> code for x86_64 modifies the passed in S/G list to compact it for mapping
> by the GART. This modification is not reversed when pci_unmap_sg is
> called. In the case of a retried SCSI command, this causes any attempt
> to map the command a second time to fail with a BUG assertion since the
> nseg parameter passed into the second map call is state. nseg comes from
> the "use_sg" field in the SCSI command structure which is never touched
> by the HBA drivers invoking pci_map_sg.
>
> DMA-API.txt doesn't seem to cover this issue. Should the low-level DMA
> code restore the S/G list to its original state on unmap or should the
> SCSI HBA drivers be changed to update "use_sg" with the segment count
> reported by the pci_map_sg() API? If the latter, this seems to contradict
> the mandate in DMA-API that the nseg parameter passed into the unmap call
> be the same as that passed into the map call. Most of the kernel assumes
> that an S/G list can be mapped an unmapped multiple times using the same
> arguments. This doesn't seem to me to be an unreasonable expectation.

Hi,

I ran into the same thing a month ago. I talked to Andi Kleen about this.
He seems to think that, it is okay to modify the sg-list in pci_map_sg().
pci_unmap_sg() can't restore it back, since we lost lot of information.
One option is to recreate entire sg-list in case of a retry. I used a patch
similar to following to do this. Is this acceptable ?

Thanks,
Badari

--- scsi_lib.c 2004-01-04 19:53:56.000000000 -0800
+++ scsi_lib.c.new 2004-01-04 19:57:35.000000000 -0800
@@ -308,6 +308,13 @@ void scsi_setup_cmd_retry(struct scsi_cm
cmd->cmd_len = cmd->old_cmd_len;
cmd->sc_data_direction = cmd->sc_old_data_direction;
cmd->underflow = cmd->old_underflow;
+ if (cmd->use_sg) {
+ struct request *req = cmd->request;
+ int count;
+
+ count = blk_rq_map_sg(req->q, req, cmd->request_buffer);
+ BUG_ON(count != cmd->use_sg);
+ }
}


2004-01-05 20:00:38

by Berkley Shands

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

BUG_ON(s->length == 0);

size += s->length;

/* Handle the previous not yet processed entries */

if (i > start) {
struct scatterlist *ps = &sg[i-1];
/* Can only merge when the last chunk ends on a page
boundary. */
if (1 || !force_iommu || !need || (i-1 > start && ps->offset) ||
(ps->offset + ps->length) % PAGE_SIZE) {
if (pci_map_cont(sg, start, i, sg+out, pages,
need) < 0)
goto error;

If I totally disable coalescing, in arch/x86_64/pci-gart.c by adding the if (1 || ...
then the bughalt goes away. So might some performance, but a system that runs is
MUCH better than one that hard faults on a regular basis.

The shipped scsi driver still needs to be updated to NOT "offline" the drives.
the 2.0.5 driver does this quite well.

Thank you.

berkley

2004-01-05 20:41:13

by David Miller

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

On Mon, 05 Jan 2004 20:47:19 +0100
Andi Kleen <[email protected]> wrote:

> Actually I disabled merging by default in the latest x86-64 code,
> but it can be still enabled by the user using options (it makes some
> adapters run several percent faster). I would appreciate if you could
> fix the problem anyways.
>
> I was actually planning to add a BUG() for this. Should do that.
> There is already one that triggers often when the problem occurs.

Andi, you must not modify sg->length in any way shape or form.

The following is legal:

pci_map_sg(..&sg);
pci_unmap_sg(...&sg);
pci_map_sg(..&sg);

If you must modify the length field for DMA, you must have a seperate
dma_length member of the scatterlist structure on your platform, see what
sparc64 does here.

If the documentation states this wrongly, it's a doc bug.

2004-01-05 20:58:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

"Justin T. Gibbs" <[email protected]> writes:

> > "Justin T. Gibbs" <[email protected]> writes:
> >
> >> Berkley Shands recently tripped over this problem. The 2.6.X pci_map_sg
> >> code for x86_64 modifies the passed in S/G list to compact it for mapping
> >> by the GART. This modification is not reversed when pci_unmap_sg is
> >> called. In the case of a retried SCSI command, this causes any attempt
> >
> > Qlogic has the same bug.
>
> Which bug is this? Not updating use_sg, or mapping the command the
> second time when it is retried? If the latter, I don't think this is an
> HBA bug. The HBA driver doesn't know that the command has been mapped in
> the past, so it will blindly map/unmap it again.

In some cases x86-64 pci_map_sg causes a BUG() when you pass it an already
mapped list. I've got reports of that happening with Qlogic.

I haven't analyzed it in detail, whether the mid layer or the driver
or someone else is to blame.

> ...
>
> >> DMA-API.txt doesn't seem to cover this issue.
> >
> > It does actually, but not very clearly. I've suggested changes of wording
> > recently to make this more clear, but it hasn't gotten in.
>
> Can you point to the section of the document you believe applies?

It documents that pci_map_sg rewrites the sg list and never guarantees
that the rewriting is undone on pci_unmap_sg.

> > You should never remap an already mapped sg list.
>
> But the sg list is no longer mapped. The HBA driver did call pci_unmap_sg
> on it. Did you mean to say, "Never map an sg list again that has been
> mapped in the past"?

Yep. While it would be in theory possible to reconstruct the list at
unmap time it would be extremly costly (require lots of uncached
memory access on x86-64) and is not done.

>
> > Either reuse the already mapped list or keep a copy of the original list
> > around. First is better because the later may have problems with the page
> > reference counts.
>
> The mid-layer doesn't map the list. The HBA drivers do. So you're saying
> that either the mid-layer or the HBA drivers need to copy the list so it
> can be restored just in case the command will be retried?

I'm just pointing out the requirements of pci_map_sg.

Either you can keep a flag around somewhere that says if the list is
mapped or not and skip the remapping and don't do a unmap in
between. Or copy and make sure the page reference counts are correctly
maintained. I'm not very intimate with the SCSI/block code so you
guys have to decide what is more convenient.

-Andi

2004-01-05 21:09:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

On Mon, Jan 05, 2004 at 12:35:09PM -0800, David S. Miller wrote:
> On Mon, 05 Jan 2004 20:47:19 +0100
> Andi Kleen <[email protected]> wrote:
>
> > Actually I disabled merging by default in the latest x86-64 code,
> > but it can be still enabled by the user using options (it makes some
> > adapters run several percent faster). I would appreciate if you could
> > fix the problem anyways.
> >
> > I was actually planning to add a BUG() for this. Should do that.
> > There is already one that triggers often when the problem occurs.
>
> Andi, you must not modify sg->length in any way shape or form.
>
> The following is legal:
>
> pci_map_sg(..&sg);
> pci_unmap_sg(...&sg);
> pci_map_sg(..&sg);

Well, on x86-64 it is not legal.

> If you must modify the length field for DMA, you must have a seperate
> dma_length member of the scatterlist structure on your platform, see what
> sparc64 does here.
>
> If the documentation states this wrongly, it's a doc bug.

The documentation doesn't allow it so I didn't implement it.

-Andi

2004-01-07 15:35:35

by Berkley Shands

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

Running with the force segment merge OFF panics the processor after
about 1000 scsi retries. the error given, also in pci-gart.c, is
pci_map_area overflow 4096 bytes
So a brain dead repair kills the kernel. Someone clearly needs to figure
out where to correct the merge of the sg lists. A bit of doc on the iommu
and the 4096 byte limit would be nice too :-)

I see that is is the aborting of an SCB that causes the sg list halt.

Jan 7 09:18:32 typhoon kernel: DevQ(0:6:0): 0 waiting
Jan 7 09:18:32 typhoon kernel: (scsi0:A:2:0): SCB 0x46 - timed out
Jan 7 09:18:32 typhoon kernel: Recovery SCB completes
Jan 7 09:18:32 typhoon kernel: scsi0: Issued Channel A Bus Reset. 3 SCBs aborted
Jan 7 09:18:46 typhoon kernel: Did it again, boss 0000:01:03.0

Since the sg list merges into one i/o list, simply adding s->length = 4096
back into the list seems to keep the kernel up. a better if slightly less
stupid fix is to add up the remaining sg list lengths, and ajust
the sg[0] entry to sum to the correct value.

/* BUG_ON(s->length == 0); */
if (! s->length)
{
unsigned long zero = sg[0].length;
unsigned long remain = 0;
int t = 0;

BUG_ON(i != 1); /* some other error here */

for (t = i + 1; t < nents; t++)
remain += sg[t].length; /* collect remaining sizes */
zero -= remain; /* deduct what is left on the list */
sg[0].length = zero / 2;
sg[1].length = zero / 2; /* allocate uniformly */
size = zero / 2; /* reduce oversize first entry */
printk(KERN_WARNING "Did it again, boss %s\n", dev->slot_name);
}

The better solution is to have the upper layer fix the sg list, or
have some marker that the list was diddled, and save the old entries
to put it back.

berkley

2004-01-07 16:33:28

by Berkley Shands

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

Here is a somewhat more robust (and trivial) fix for the sg iommu
problem on the x86_64. Still does not correct the sg list modification logic though :-)

--- /raid0/kernels/linux/arch/x86_64/kernel/pci-gart.c 2004-01-05 10:16:11.000000000 -0600
+++ pci-gart.c 2004-01-07 10:14:47.000000000 -0600
@@ -446,6 +446,16 @@
return 0;
out = 0;
start = 0;
+
+ /* see if we are recovering from an error */
+
+ if (sg[0].oldlength) {
+ printk(KERN_WARNING "Recovering sg list for %s\n", dev->slot_name);
+ for (i = 0; i < nents; i++) {
+ sg[i].length = sg[i].oldlength;
+ }
+ }
+
for (i = 0; i < nents; i++) {
struct scatterlist *s = &sg[i];
dma_addr_t addr = page_to_phys(s->page) + s->offset;
@@ -453,6 +463,7 @@
BUG_ON(s->length == 0);

size += s->length;
+ s->oldlength = s->length;

/* Handle the previous not yet processed entries */
if (i > start) {

--- /raid0/kernels/linux/include/asm-x86_64/scatterlist.h 2003-12-17 20:59:39.000000000 -0600
+++ scatterlist.h 2004-01-07 10:01:06.000000000 -0600
@@ -5,6 +5,7 @@
struct page *page;
unsigned int offset;
unsigned int length;
+ unsigned int oldlength;
dma_addr_t dma_address;
};


berkley

2004-01-07 19:22:30

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [BUG] x86_64 pci_map_sg modifies sg list - fails multiple map/unmaps

Hi,

Andi Kleen reworked pci-gart code.
Would you try Andi's x86-64 patch and see if the problems still exist ?

ftp://ftp.x86-64.org/pub/linux/v2.6/x86_64-2.6.1rc2-1.bz2

And also, Can you try with "iommu=noforce,nomerge" ?

Fixing the sg-list in the upperlayer (by re-creating) it in case of retry
worked for me. I am still not sure why you are running into "len==0"
panics.

Thanks,
Badari

Berkley Shands wrote:

> Running with the force segment merge OFF panics the processor after
> about 1000 scsi retries. the error given, also in pci-gart.c, is
> pci_map_area overflow 4096 bytes
> So a brain dead repair kills the kernel. Someone clearly needs to figure
> out where to correct the merge of the sg lists. A bit of doc on the iommu
> and the 4096 byte limit would be nice too :-)
>
> I see that is is the aborting of an SCB that causes the sg list halt.
>
> Jan 7 09:18:32 typhoon kernel: DevQ(0:6:0): 0 waiting
> Jan 7 09:18:32 typhoon kernel: (scsi0:A:2:0): SCB 0x46 - timed out
> Jan 7 09:18:32 typhoon kernel: Recovery SCB completes
> Jan 7 09:18:32 typhoon kernel: scsi0: Issued Channel A Bus Reset. 3 SCBs aborted
> Jan 7 09:18:46 typhoon kernel: Did it again, boss 0000:01:03.0
>
> Since the sg list merges into one i/o list, simply adding s->length = 4096
> back into the list seems to keep the kernel up. a better if slightly less
> stupid fix is to add up the remaining sg list lengths, and ajust
> the sg[0] entry to sum to the correct value.
>
> /* BUG_ON(s->length == 0); */
> if (! s->length)
> {
> unsigned long zero = sg[0].length;
> unsigned long remain = 0;
> int t = 0;
>
> BUG_ON(i != 1); /* some other error here */
>
> for (t = i + 1; t < nents; t++)
> remain += sg[t].length; /* collect remaining sizes */
> zero -= remain; /* deduct what is left on the list */
> sg[0].length = zero / 2;
> sg[1].length = zero / 2; /* allocate uniformly */
> size = zero / 2; /* reduce oversize first entry */
> printk(KERN_WARNING "Did it again, boss %s\n", dev->slot_name);
> }
>
> The better solution is to have the upper layer fix the sg list, or
> have some marker that the list was diddled, and save the old entries
> to put it back.
>
> berkley