2001-10-07 10:39:06

by Jim Crilly

[permalink] [raw]
Subject: AIC7xxx panic

I got a reproducible panic while running dbench simulating 25+ clients,
the new aic7xxx driver panics with "Too few segs for dma mapping.
"Increase AHC_NSEG". The partition in question is FAT32 and on a
different disk than /, I'm not using HIGHMEM. I am using XFS and the
preempt patches, but I don't think they're related to the panic.

The odd thing, is if I run dbench in the same manner on my / partition,
which is on a different disk on the same controller, it goes fine. It
seems, to my untrained eye anyway, to be a bad interaction between the
vfat driver and the aic7xxx driver.

I'm using the old aic7xxx driver right now and it's fine, has anyone
else seen anything like this?

Jim
--
Help protect your rights on-line.
Join the Electronic Frontiers Foundation today: http://www.eff.org/join
-----------------------------------------------------------------------
Security: Antonyms: See Microsoft
-----------------------------------------------------------------------
"We are coming after you. God may have mercy on you, but we won't,"
declared Sen. John McCain, R-Arizona.


2001-10-07 10:51:29

by Rob Turk

[permalink] [raw]
Subject: Re: AIC7xxx panic

"Jim Crilly" <[email protected]> wrote in message
news:cistron.1002451051.3718.20.camel@warblade...
> I got a reproducible panic while running dbench simulating 25+ clients,
> the new aic7xxx driver panics with "Too few segs for dma mapping.
> "Increase AHC_NSEG". The partition in question is FAT32 and on a
> different disk than /, I'm not using HIGHMEM. I am using XFS and the
> preempt patches, but I don't think they're related to the panic.
>
> The odd thing, is if I run dbench in the same manner on my / partition,
> which is on a different disk on the same controller, it goes fine. It
> seems, to my untrained eye anyway, to be a bad interaction between the
> vfat driver and the aic7xxx driver.
>
> I'm using the old aic7xxx driver right now and it's fine, has anyone
> else seen anything like this?
>
> Jim

Since this seems to fail on just one disk, it might have to do with one of the
disk characteristics, like command queue depth. Did you enable Tagged Command
Queueing, and if so, can you try playing around with the maximum depth?

Rob




2001-10-07 11:30:41

by Jim Crilly

[permalink] [raw]
Subject: Re: AIC7xxx panic

Both disks on the controller are Seagate Cheetahs, the one being worked
during the panic is a ST39204LW, the other disk is a ST318451LW.

I did have TCQ enabled and I left it at the default of 255, I'll try a
lower value tomorrow, since it's so late.

Jim

On Sun, 2001-10-07 at 06:48, Rob Turk wrote:
> "Jim Crilly" <[email protected]> wrote in message
> news:cistron.1002451051.3718.20.camel@warblade...
> > I got a reproducible panic while running dbench simulating 25+ clients,
> > the new aic7xxx driver panics with "Too few segs for dma mapping.
> > "Increase AHC_NSEG". The partition in question is FAT32 and on a
> > different disk than /, I'm not using HIGHMEM. I am using XFS and the
> > preempt patches, but I don't think they're related to the panic.
> >
> > The odd thing, is if I run dbench in the same manner on my / partition,
> > which is on a different disk on the same controller, it goes fine. It
> > seems, to my untrained eye anyway, to be a bad interaction between the
> > vfat driver and the aic7xxx driver.
> >
> > I'm using the old aic7xxx driver right now and it's fine, has anyone
> > else seen anything like this?
> >
> > Jim
>
> Since this seems to fail on just one disk, it might have to do with one of the
> disk characteristics, like command queue depth. Did you enable Tagged Command
> Queueing, and if so, can you try playing around with the maximum depth?
>
> Rob
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Help protect your rights on-line.
Join the Electronic Frontiers Foundation today: http://www.eff.org/join
-----------------------------------------------------------------------
Security: Antonyms: See Microsoft
-----------------------------------------------------------------------
"We are coming after you. God may have mercy on you, but we won't,"
declared Sen. John McCain, R-Arizona.

2001-10-07 12:20:52

by David M. Grimes

[permalink] [raw]
Subject: Re: AIC7xxx panic

On Sun, Oct 07, 2001 at 07:28:57AM -0400, Jim Crilly wrote:
> Both disks on the controller are Seagate Cheetahs, the one being worked
> during the panic is a ST39204LW, the other disk is a ST318451LW.

I've seen this on a 2-disk system (both Seagate ST150176LW) on a
VA-Systems onboad AIC 7xxx. I enabled TCQ, and noticed the default
depth increased sometime around 2.4.10, not exactly sure when (it used
to be 8, now much higher). I've seen it on both disks.

In drivers/scsi/aic7xxx/aic7xxx_osm.h is the #define for NSEG, and I
changed it from 128 to 512, and it stopped the problem. Question is,
why was the TCQ depth increased, and should NSEG have been upped with
it?

>
> I did have TCQ enabled and I left it at the default of 255, I'll try a
> lower value tomorrow, since it's so late.

This also fixed my problem, I left NSEG at 128 and lowered the TCQ depth
back to 8. This worked fine as well.

I'll be intereted to see what the eventual outcome of this is, so I can
apply the "right" fix!

Anyhow, thought you might want another datapoint.

Thanks,

Dave

>
> Jim
>
> On Sun, 2001-10-07 at 06:48, Rob Turk wrote:
> > "Jim Crilly" <[email protected]> wrote in message
> > news:cistron.1002451051.3718.20.camel@warblade...
> > > I got a reproducible panic while running dbench simulating 25+ clients,
> > > the new aic7xxx driver panics with "Too few segs for dma mapping.
> > > "Increase AHC_NSEG". The partition in question is FAT32 and on a
> > > different disk than /, I'm not using HIGHMEM. I am using XFS and the
> > > preempt patches, but I don't think they're related to the panic.
> > >
> > > The odd thing, is if I run dbench in the same manner on my / partition,
> > > which is on a different disk on the same controller, it goes fine. It
> > > seems, to my untrained eye anyway, to be a bad interaction between the
> > > vfat driver and the aic7xxx driver.
> > >
> > > I'm using the old aic7xxx driver right now and it's fine, has anyone
> > > else seen anything like this?
> > >
> > > Jim
> >
> > Since this seems to fail on just one disk, it might have to do with one of the
> > disk characteristics, like command queue depth. Did you enable Tagged Command
> > Queueing, and if so, can you try playing around with the maximum depth?
> >
> > Rob

2001-10-07 14:53:47

by Gérard Roudier

[permalink] [raw]
Subject: Re: AIC7xxx panic



On Sun, 7 Oct 2001, David M. Grimes wrote:

> On Sun, Oct 07, 2001 at 07:28:57AM -0400, Jim Crilly wrote:
> > Both disks on the controller are Seagate Cheetahs, the one being worked
> > during the panic is a ST39204LW, the other disk is a ST318451LW.
>
> I've seen this on a 2-disk system (both Seagate ST150176LW) on a
> VA-Systems onboad AIC 7xxx. I enabled TCQ, and noticed the default
> depth increased sometime around 2.4.10, not exactly sure when (it used
> to be 8, now much higher). I've seen it on both disks.
>
> In drivers/scsi/aic7xxx/aic7xxx_osm.h is the #define for NSEG, and I
> changed it from 128 to 512, and it stopped the problem. Question is,
> why was the TCQ depth increased, and should NSEG have been upped with
> it?

The default TCQ depth was 8 in Doug Ledford's aic7xxx driver but was 253
in Justin Gibbs' aic7xxx driver. As seen from driver developpers the TCQ
depth haven't been changed. :-)

The max number of DMA segments and TCQ depths are totally unrelated items.
Your guessed work-around may just indicate that their interaction may
trigger some software bug. Using larger TCQ depths make more pressure on
memory and disk IOs, leading to more memory being locked for IO pending
and memory segmentation being more likely.

> > I did have TCQ enabled and I left it at the default of 255, I'll try a
> > lower value tomorrow, since it's so late.
>
> This also fixed my problem, I left NSEG at 128 and lowered the TCQ depth
> back to 8. This worked fine as well.
>
> I'll be intereted to see what the eventual outcome of this is, so I can
> apply the "right" fix!

The right fix might well not apply to the driver code. Btw, I donnot plan
to look into the problem, as Justin may just be studying it, in my
guessing. I just wanted to suggest to also look into upper layers and not
to only focus on the low-level driver.

G?rard.

> Anyhow, thought you might want another datapoint.
>
> Thanks,
>
> Dave
>
> >
> > Jim
> >
> > On Sun, 2001-10-07 at 06:48, Rob Turk wrote:
> > > "Jim Crilly" <[email protected]> wrote in message
> > > news:cistron.1002451051.3718.20.camel@warblade...
> > > > I got a reproducible panic while running dbench simulating 25+ clients,
> > > > the new aic7xxx driver panics with "Too few segs for dma mapping.
> > > > "Increase AHC_NSEG". The partition in question is FAT32 and on a
> > > > different disk than /, I'm not using HIGHMEM. I am using XFS and the
> > > > preempt patches, but I don't think they're related to the panic.
> > > >
> > > > The odd thing, is if I run dbench in the same manner on my / partition,
> > > > which is on a different disk on the same controller, it goes fine. It
> > > > seems, to my untrained eye anyway, to be a bad interaction between the
> > > > vfat driver and the aic7xxx driver.
> > > >
> > > > I'm using the old aic7xxx driver right now and it's fine, has anyone
> > > > else seen anything like this?
> > > >
> > > > Jim
> > >
> > > Since this seems to fail on just one disk, it might have to do with one of the
> > > disk characteristics, like command queue depth. Did you enable Tagged Command
> > > Queueing, and if so, can you try playing around with the maximum depth?
> > >
> > > Rob
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2001-10-08 02:33:15

by Jim Crilly

[permalink] [raw]
Subject: Re: AIC7xxx panic

I changed AHC_NSEG from 128 to 512 and as expected the panic went away,
but does this mean the default should be higher in the kernel or is
there a real bug here? The main reason I wonder is because it ran fine
on disk 0 but panic'd on disk 1.

On Sun, 2001-10-07 at 10:48, G?rard Roudier wrote:
>
>
> On Sun, 7 Oct 2001, David M. Grimes wrote:
>
> > On Sun, Oct 07, 2001 at 07:28:57AM -0400, Jim Crilly wrote:
> > > Both disks on the controller are Seagate Cheetahs, the one being worked
> > > during the panic is a ST39204LW, the other disk is a ST318451LW.
> >
> > I've seen this on a 2-disk system (both Seagate ST150176LW) on a
> > VA-Systems onboad AIC 7xxx. I enabled TCQ, and noticed the default
> > depth increased sometime around 2.4.10, not exactly sure when (it used
> > to be 8, now much higher). I've seen it on both disks.
> >
> > In drivers/scsi/aic7xxx/aic7xxx_osm.h is the #define for NSEG, and I
> > changed it from 128 to 512, and it stopped the problem. Question is,
> > why was the TCQ depth increased, and should NSEG have been upped with
> > it?
>
> The default TCQ depth was 8 in Doug Ledford's aic7xxx driver but was 253
> in Justin Gibbs' aic7xxx driver. As seen from driver developpers the TCQ
> depth haven't been changed. :-)
>
> The max number of DMA segments and TCQ depths are totally unrelated items.
> Your guessed work-around may just indicate that their interaction may
> trigger some software bug. Using larger TCQ depths make more pressure on
> memory and disk IOs, leading to more memory being locked for IO pending
> and memory segmentation being more likely.
>
> > > I did have TCQ enabled and I left it at the default of 255, I'll try a
> > > lower value tomorrow, since it's so late.
> >
> > This also fixed my problem, I left NSEG at 128 and lowered the TCQ depth
> > back to 8. This worked fine as well.
> >
> > I'll be intereted to see what the eventual outcome of this is, so I can
> > apply the "right" fix!
>
> The right fix might well not apply to the driver code. Btw, I donnot plan
> to look into the problem, as Justin may just be studying it, in my
> guessing. I just wanted to suggest to also look into upper layers and not
> to only focus on the low-level driver.
>
> G?rard.
>
> > Anyhow, thought you might want another datapoint.
> >
> > Thanks,
> >
> > Dave
> >
> > >
> > > Jim
> > >
> > > On Sun, 2001-10-07 at 06:48, Rob Turk wrote:
> > > > "Jim Crilly" <[email protected]> wrote in message
> > > > news:cistron.1002451051.3718.20.camel@warblade...
> > > > > I got a reproducible panic while running dbench simulating 25+ clients,
> > > > > the new aic7xxx driver panics with "Too few segs for dma mapping.
> > > > > "Increase AHC_NSEG". The partition in question is FAT32 and on a
> > > > > different disk than /, I'm not using HIGHMEM. I am using XFS and the
> > > > > preempt patches, but I don't think they're related to the panic.
> > > > >
> > > > > The odd thing, is if I run dbench in the same manner on my / partition,
> > > > > which is on a different disk on the same controller, it goes fine. It
> > > > > seems, to my untrained eye anyway, to be a bad interaction between the
> > > > > vfat driver and the aic7xxx driver.
> > > > >
> > > > > I'm using the old aic7xxx driver right now and it's fine, has anyone
> > > > > else seen anything like this?
> > > > >
> > > > > Jim
> > > >
> > > > Since this seems to fail on just one disk, it might have to do with one of the
> > > > disk characteristics, like command queue depth. Did you enable Tagged Command
> > > > Queueing, and if so, can you try playing around with the maximum depth?
> > > >
> > > > Rob
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
> >
--
Help protect your rights on-line.
Join the Electronic Frontiers Foundation today: http://www.eff.org/join
-----------------------------------------------------------------------
Security: Antonyms: See Microsoft
-----------------------------------------------------------------------
"We are coming after you. God may have mercy on you, but we won't,"
declared Sen. John McCain, R-Arizona.

2001-10-09 00:50:50

by David M. Grimes

[permalink] [raw]
Subject: Re: AIC7xxx panic

On Sun, Oct 07, 2001 at 10:31:48PM -0400, Jim Crilly wrote:
> I changed AHC_NSEG from 128 to 512 and as expected the panic went away,
> but does this mean the default should be higher in the kernel or is
> there a real bug here? The main reason I wonder is because it ran fine
> on disk 0 but panic'd on disk 1.

Perhaps this is related (from 2.4.10-acX thread later on l-k):

--------
>From [email protected] Mon Oct 8 18:34:32 2001
Subject: Re: linux-2.4.10-acX
To: [email protected] (Mike Fedyk)
From: Alan Cox <[email protected]>

> > - Elevator flow control
>
> Where can I find more information on this?

Read the ll_rw_blk diff. Basically it tries to avoid too many locked
buffers clogging up memory and killing the box. I'm not totally sure its
the right approach.
--------

Were there recent changes in ll_rw_blk which are being addressed by
"Elevator flow control"? As suggested earlier in this thread, the cause
might be a few layers up, and this seemed relevant.

Can anyone confirm or shed any additional light on this?

Dave

2001-10-09 02:22:05

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: AIC7xxx panic

>The right fix might well not apply to the driver code. Btw, I donnot plan
>to look into the problem, as Justin may just be studying it, in my
>guessing. I just wanted to suggest to also look into upper layers and not
>to only focus on the low-level driver.

I can't really speak to what is an acceptable number of segments
for Linux (I just copied what the old driver did), but the aic7xxx
driver does export its current limit to upper layers and that limit
should be honored.

--
Justin

2001-10-09 11:41:49

by Alan

[permalink] [raw]
Subject: Re: AIC7xxx panic

> Were there recent changes in ll_rw_blk which are being addressed by
> "Elevator flow control"? As suggested earlier in this thread, the cause
> might be a few layers up, and this seemed relevant.

Unrelated I suspect. All it means is that in some cases -ac will have less
segments queued before blocking. The max sectors per I/O and max segments
per I/o are controlled by the drivers