2004-03-06 19:47:12

by Henrik Persson

[permalink] [raw]
Subject: Strange DMA-errors and system hang with Promise 20268

Hi.

The last month or so I've experienced some strangeness with one of my
boxes. It is up and running without any problems and then suddently i
get this in the syslog:

Mar 6 20:29:42 eurisco kernel: hdf: dma_timer_expiry: dma status ==
0x61

(sometimes dma status has been 0x41)

And a few seconds later the system has frozen and I have to reset the
box to get it back up and running.

It isn't always the same hdX. If I remove the device that produces the
error another device, but it's allways a device on the promise
controller, fails.

I've seen this behaviour with 2.4.25, 2.4.24 and 2.4.23 (I think).

Any ideas?

--
Henrik Persson <[email protected]>


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-03-06 19:55:53

by Henrik Persson

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

On Sat, 2004-03-06 at 20:47, Henrik Persson wrote:
> Hi.
>
> The last month or so I've experienced some strangeness with one of my
> boxes. It is up and running without any problems and then suddently i
> get this in the syslog:
>
> Mar 6 20:29:42 eurisco kernel: hdf: dma_timer_expiry: dma status ==
> 0x61
>
> (sometimes dma status has been 0x41)
>
> And a few seconds later the system has frozen and I have to reset the
> box to get it back up and running.
>
> It isn't always the same hdX. If I remove the device that produces the
> error another device, but it's allways a device on the promise
> controller, fails.
>
> I've seen this behaviour with 2.4.25, 2.4.24 and 2.4.23 (I think).
>
> Any ideas?

And ah. .config, lspci and cpuinfo attached.

--
Henrik Persson <[email protected]>


Attachments:
eurisco.config (16.12 kB)
eurisco.cpuinfo (404.00 B)
eurisco.lspci (5.44 kB)
Download all attachments

2004-03-07 01:05:52

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

Henrik Persson <[email protected]> wrote:
> boxes. It is up and running without any problems and then suddently i
> get this in the syslog:
> Mar 6 20:29:42 eurisco kernel: hdf: dma_timer_expiry: dma status == 0x61
> And a few seconds later the system has frozen and I have to reset the

Same here:

Mar 4 01:01:06 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
Mar 5 01:02:00 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
Mar 6 01:10:22 darkside kernel: hde: dma_timer_expiry: dma status == 0x21

Can you somehow correlate this to start of S.M.A.R.T selftests?

I suspect it having something to do with 2.4.25 new "One last
read after the timeout" in ide-iops.c and accessing the drive
while selftest running (possibly especially short selftest).
Here, daily at 01:00 smartmontools runs smart short selftests
and a bit later the machine hangs.
Today, I disabled that job and the machine stays stable.

> error another device, but it's allways a device on the promise
> controller, fails.

Dito... PDC20269 U133TX2
CONFIG_BLK_DEV_PDC202XX_NEW=y

And until now it was always hde connected to the promise
controller.

> I've seen this behaviour with 2.4.25, 2.4.24 and 2.4.23 (I think).

My machine did run at least since:
Jan 18 09:41:21 darkside kernel: Linux version 2.4.24
...
Feb 28 01:43:48 darkside kernel: Linux version 2.4.24
Feb 28 04:58:47 darkside kernel: Linux version 2.4.25

First time the problem occured was Mar 4 01:01:06.

smartmontools last update was at Feb 14 03:04


regards,
Mario
--
I heard, if you play a NT-CD backwards, you get satanic messages...
That's nothing. If you play it forwards, it installs NT.

2004-03-08 13:30:51

by Henrik Persson

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

On Sun, 2004-03-07 at 02:05, Mario 'BitKoenig' Holbe wrote:
> Same here:
>
> Mar 4 01:01:06 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
> Mar 5 01:02:00 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
> Mar 6 01:10:22 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
>
> Can you somehow correlate this to start of S.M.A.R.T selftests?

Nope. To this date I wasn't running anything of the sort. I ran a few
selftest now though.. Nothing happened..

> I suspect it having something to do with 2.4.25 new "One last
> read after the timeout" in ide-iops.c and accessing the drive
> while selftest running (possibly especially short selftest).
> Here, daily at 01:00 smartmontools runs smart short selftests
> and a bit later the machine hangs.
> Today, I disabled that job and the machine stays stable.

This happens every now and then.. Sometimes once a week or once a month.
Sometimes it's once per hour. I can't correlate this behaviour with any
activity that the box in question is doing (mysql, nfsd)..

> > error another device, but it's allways a device on the promise
> > controller, fails.
>
> Dito... PDC20269 U133TX2
> CONFIG_BLK_DEV_PDC202XX_NEW=y
>
> And until now it was always hde connected to the promise
> controller.
>
> > I've seen this behaviour with 2.4.25, 2.4.24 and 2.4.23 (I think).
>
> My machine did run at least since:
> Jan 18 09:41:21 darkside kernel: Linux version 2.4.24
> ...
> Feb 28 01:43:48 darkside kernel: Linux version 2.4.24
> Feb 28 04:58:47 darkside kernel: Linux version 2.4.25
>
> First time the problem occured was Mar 4 01:01:06.

I've had those problems for at least a month. ;/

I just have no clue what's wrong with the damn thing.

--
Henrik Persson <[email protected]>

2004-03-10 11:50:20

by Bruce Allen

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

> > Mar 4 01:01:06 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
> > Mar 5 01:02:00 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
> > Mar 6 01:10:22 darkside kernel: hde: dma_timer_expiry: dma status == 0x21
> >
> > Can you somehow correlate this to start of S.M.A.R.T selftests?
>
> Nope. To this date I wasn't running anything of the sort. I ran a few
> selftest now though.. Nothing happened..
>
> > I suspect it having something to do with 2.4.25 new "One last
> > read after the timeout" in ide-iops.c and accessing the drive
> > while selftest running (possibly especially short selftest).
> > Here, daily at 01:00 smartmontools runs smart short selftests
> > and a bit later the machine hangs.
> > Today, I disabled that job and the machine stays stable.

Does the disk's SMART error log (smartctl -l error) show any entries
related to this problem? If so, please print them with the latest version
of smartmontools (5.30) which makes them more 'human readable" than
previous versions.

Bruce

2004-03-10 12:38:31

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

On Wed, Mar 10, 2004 at 05:50:12AM -0600, Bruce Allen wrote:
> Does the disk's SMART error log (smartctl -l error) show any entries
> related to this problem? If so, please print them with the latest version

No, none at all. This was the first I was looking at, because
I just thought it was some disk problem.


regards,
Mario
--
"Why are we hiding from the police, daddy?" | J. E. Guenther
"Because we use SuSE son, they use SYSVR4." | de.alt.sysadmin.recovery

2004-03-10 15:01:46

by Henrik Persson

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

On Wed, 2004-03-10 at 13:36, Mario 'BitKoenig' Holbe wrote:
> On Wed, Mar 10, 2004 at 05:50:12AM -0600, Bruce Allen wrote:
> > Does the disk's SMART error log (smartctl -l error) show any entries
> > related to this problem? If so, please print them with the latest version
>
> No, none at all. This was the first I was looking at, because
> I just thought it was some disk problem.

Same here. Just one of the discs that has stopped during the last month
has any entries in the log at all. Those errors are attached.

The funny thing is that the machine stops responding after the
dma_timer_expiry.. Why doesn't just the kernel (or the controller for
that matter) disable DMA and then the problem would be solved, if the
problem is related to DMA, right? Sure, the speed (or lack of it) would
be painful but I wouldn't need to sit 60km from home and wondering why
my box just stopped responding. ;/

--
Henrik Persson <[email protected]>


Attachments:
smarterrors (3.67 kB)

2004-03-10 15:42:24

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

On Wed, Mar 10, 2004 at 05:50:12AM -0600, Bruce Allen wrote:
> > > I suspect it having something to do with 2.4.25 new "One last
> > > read after the timeout" in ide-iops.c and accessing the drive
> > > while selftest running (possibly especially short selftest).
> Does the disk's SMART error log (smartctl -l error) show any entries

Just in addition, to point this out more clearly:
I personally don't suspect smartmontools having some
problem.
I run debians smartmontools package since a long time
and it does the selftests a long time as well. It never
had problems with it, it wasnt updated close to first
occurence of the problem or changed in any other way.
I have 4 disks, 2 on the onboard VIA controller, 2 on
the Promise. The problem always occured on the Promise
(like Henrik pointed too) disk.
I more suspect any kernel ide <-> promise-driver timing
problem. Maybe smartmontools makes it more possibe that
this timing problem occurs, maybe not (with Henriks
answer to my question I rather favorite the 'maybe not'),
maybe it's even just some load issue making the problem
occur.


Mario
--
<jv> Oh well, config
<jv> one actually wonders what force in the universe is holding it
<jv> and makes it working
<Beeth> chances and accidents :)

2004-03-11 09:25:56

by Bruce Allen

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

> On Wed, Mar 10, 2004 at 05:50:12AM -0600, Bruce Allen wrote:
> > > > I suspect it having something to do with 2.4.25 new "One last
> > > > read after the timeout" in ide-iops.c and accessing the drive
> > > > while selftest running (possibly especially short selftest).
> > Does the disk's SMART error log (smartctl -l error) show any entries
>
> Just in addition, to point this out more clearly:
> I personally don't suspect smartmontools having some
> problem.
> I run debians smartmontools package since a long time
> and it does the selftests a long time as well. It never
> had problems with it, it wasnt updated close to first
> occurence of the problem or changed in any other way.
> I have 4 disks, 2 on the onboard VIA controller, 2 on
> the Promise. The problem always occured on the Promise
> (like Henrik pointed too) disk.
> I more suspect any kernel ide <-> promise-driver timing
> problem. Maybe smartmontools makes it more possibe that
> this timing problem occurs, maybe not (with Henriks
> answer to my question I rather favorite the 'maybe not'),
> maybe it's even just some load issue making the problem
> occur.

OK, thanks for the reassurance. There have been some warnings about
promise 20262 and 20265 controllers interacting badly with smartmontools
(locking up systems). See
http://cvs.sourceforge.net/viewcvs.py/smartmontools/sm5/WARNINGS?view=markup
Perhaps this is in some way related.

Cheers,
Bruce

2004-03-11 09:36:17

by Bruce Allen

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

> On Wed, 2004-03-10 at 13:36, Mario 'BitKoenig' Holbe wrote:
> > On Wed, Mar 10, 2004 at 05:50:12AM -0600, Bruce Allen wrote:
> > > Does the disk's SMART error log (smartctl -l error) show any entries
> > > related to this problem? If so, please print them with the latest version
> >
> > No, none at all. This was the first I was looking at, because
> > I just thought it was some disk problem.
>
> Same here. Just one of the discs that has stopped during the last
> month has any entries in the log at all. Those errors are attached.

FWIW, these four errors:

Error 4 occurred at disk power-on lifetime: 6619 hours
When the command that caused the error occurred, the device was in an
unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 00 00 00 e0 Error: ICRC, ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name
-- -- -- -- -- -- -- -- --------- --------------------
c8 ff 01 00 00 00 e0 08 546.992 READ DMA
ef 03 45 20 77 a5 e0 08 546.992 SET FEATURES [Set transfer mode]
c6 ff 10 20 77 a5 e0 08 546.992 SET MULTIPLE MODE
10 ff 50 20 77 a5 e0 08 546.992 RECALIBRATE [OBS-4]
91 03 3f 20 77 a5 ef 08 546.992 INITIALIZE DEVICE PARAMETERS [OBS-6]

are all 'conventional' DMA errors, in which there was a CRC error in the
hardware interface between the disk and the controller. Either the cable
connections to this disk were briefly problematic or their was electrical
noise on the lines. Probably not anything to worry about.

> The funny thing is that the machine stops responding after the
> dma_timer_expiry.. Why doesn't just the kernel (or the controller for
> that matter) disable DMA and then the problem would be solved, if the
> problem is related to DMA, right? Sure, the speed (or lack of it) would
> be painful but I wouldn't need to sit 60km from home and wondering why
> my box just stopped responding. ;/

FWIW, there have been reports of problems (system lockup) with
smartmontools on systems with Promise 20262 and 20265 controllers. See:
http://cvs.sourceforge.net/viewcvs.py/smartmontools/sm5/WARNINGS?sortby=date&view=markup
So I guess I will need to add the 20268 controller to this list, although
as Mario says, smartmontools may play only an indirect role, in exposing
an existing problem.

Cheers,
Bruce

2004-03-11 14:32:17

by Henrik Persson

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

On Thu, 2004-03-11 at 10:36, Bruce Allen wrote:
*snip*

> FWIW, there have been reports of problems (system lockup) with
> smartmontools on systems with Promise 20262 and 20265 controllers. See:
> http://cvs.sourceforge.net/viewcvs.py/smartmontools/sm5/WARNINGS?sortby=date&view=markup
> So I guess I will need to add the 20268 controller to this list, although
> as Mario says, smartmontools may play only an indirect role, in exposing
> an existing problem.

Well. I guess it's an existing problem, cause I didn't even have
smartmontools installed until Mario brought it up. ;)

--
Henrik Persson <[email protected]>

2004-06-04 20:09:23

by Bruce Allen

[permalink] [raw]
Subject: Re: Strange DMA-errors and system hang with Promise 20268

> hmmm, it seems I solved the issue in my case.
>
> I did just connect the two disks on the Promise to a second separate
> power supply and everything works rock-stable and survives all things
>
> <SNIP>
>
> Bruce: this is probably something to hint for at the smartmontools
> warning page.

Done.