2007-10-01 07:43:19

by Alexander Sabourenkov

[permalink] [raw]
Subject: Promise SATA300 TX4: errors, oops in ext3 code

Hardware: Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10
320G, jumper-limited to SATA150.
Kernel : 2.6.22.9 amd64

Problem:
Heavy load causes errors and triggers oops.

History:
Problems were first encountered on kernel 2.6.19, both i686 ("old"
system) and amd64 (gentoo installation CD).
Can't say anything about older kernels. Most probably they have same
issues (or worse).

Problems were blamed:
- SATA300 being too 'hot' (jumpered the drives)
- cables (work perfectly on onboard controller)
- interrupt sharing (found the only slot which does not share
interrupt line)
- cooling (3 fans installed, smartctl-reported temperature at max
load dropped to 35C)
- weak PSU (installed 600W FSP)
- kernel bugs (upgraded to 2.6.22.9)

All those measures significantly dropped error rate (from about 20 to
2-4 per mirror rebuild) but did not eliminate the problem.

Errors are easily reproduced by performing resync on a md RAID-1.
Raising overall system load (compilation, copy operations on other HDDs)
makes errors happen sooner.

Errors lead to data loss: if errors occur on master disk while raid
rebuild is in progress, and there is some write activity (like package
installation), some files disappear: package management check reports
missing files.

Oops was encountered while rsyncing data from disks on separate (VIA
onboard) controller and simultaneously rebuilding GCC.



Typical error:

Sep 30 17:26:51 host ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x2
Sep 30 17:26:51 host ata3.00: (port_status 0x20080000)
Sep 30 17:26:51 host ata3.00: cmd c8/00:08:31:47:36/00:00:00:00:00/e6
tag 0 cdb 0x0 data 4096 in
Sep 30 17:26:51 host res 50/00:00:38:47:36/00:00:00:00:00/e6 Emask 0x2
(HSM violation)
Sep 30 17:26:52 host ata3: soft resetting port
Sep 30 17:26:52 host ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 30 17:26:52 host ata3.00: configured for UDMA/133
Sep 30 17:26:52 host ata3: EH complete
Sep 30 17:26:52 host sd 2:0:0:0: [sda] 625142448 512-byte hardware
sectors (320073 MB)
Sep 30 17:26:52 host sd 2:0:0:0: [sda] Write Protect is off
Sep 30 17:26:52 host sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
Sep 30 17:26:52 host sd 2:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA

Oops:

Sep 30 17:27:46 host Unable to handle kernel NULL pointer dereference at
000000000000000e RIP:
Sep 30 17:27:46 host [<ffffffff8024b03a>] __pagevec_lru_add+0x12/0x97
Sep 30 17:27:46 host PGD fdff067 PUD 84c1067 PMD 0
Sep 30 17:27:46 host Oops: 0000 [1]
Sep 30 17:27:46 host CPU 0
Sep 30 17:27:46 host Modules linked in: snd_via82xx snd_mpu401_uart
snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_timer
snd_page_alloc snd_util_mem snd_hwdep snd soundcore
Sep 30 17:27:46 host Pid: 21657, comm: rsync Not tainted 2.6.22-gentoo-r8 #1
Sep 30 17:27:46 host RIP: 0010:[<ffffffff8024b03a>]
[<ffffffff8024b03a>] __pagevec_lru_add+0x12/0x97
Sep 30 17:27:46 host RSP: 0000:ffff81000b081998 EFLAGS: 00010217
Sep 30 17:27:46 host RAX: 0000000000000000 RBX: ffff81000b0819a8 RCX:
0000000000000000
Sep 30 17:27:46 host RDX: 0000000000000000 RSI: 6db6db6db6db6db7 RDI:
000000000000000e
Sep 30 17:27:46 host RBP: 0000000000000013 R08: 0000000000000000 R09:
00000000000000ff
Sep 30 17:27:46 host R10: ffff810021c72bc0 R11: ffff810033a3e068 R12:
ffff81000b081b88
Sep 30 17:27:46 host R13: ffff810021c72bc0 R14: ffffffff802a122c R15:
ffff810011f3fe00
Sep 30 17:27:46 host FS: 00002ae93d4d66d0(0000)
GS:ffffffff80595000(0000) knlGS:00000000f7dc2a00
Sep 30 17:27:46 host CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 30 17:27:46 host CR2: 000000000000000e CR3: 0000000008fcd000 CR4:
00000000000006e0
Sep 30 17:27:46 host Process rsync (pid: 21657, threadinfo
ffff81000b080000, task ffff81003f783100)
Sep 30 17:27:46 host Stack: ffff810001152bf8 ffffffff802830f1
ffffffff802a122c ffffffff802218bf
Sep 30 17:27:46 host 000000000000000e 0000000000000000 ffff81000195cdf0
ffff8100014e6998
Sep 30 17:27:46 host ffff810001ac6540 ffff810001af2488 ffff81000154d900
ffff8100018aced8
Sep 30 17:27:46 host Call Trace:
Sep 30 17:27:46 host [<ffffffff802830f1>] mpage_readpages+0xe5/0x138
Sep 30 17:27:46 host [<ffffffff802a122c>] ext3_get_block+0x0/0xe4
Sep 30 17:27:46 host [<ffffffff802218bf>] __activate_task+0x23/0x34
Sep 30 17:27:46 host [<ffffffff80248f2a>] __alloc_pages+0x61/0x2a8
Sep 30 17:27:46 host [<ffffffff8024a79e>]
__do_page_cache_readahead+0x124/0x217
Sep 30 17:27:46 host [<ffffffff80465a08>] thread_return+0x0/0x94
Sep 30 17:27:46 host [<ffffffff80245233>] sync_page+0x0/0x40
Sep 30 17:27:46 host [<ffffffff8024a8e4>]
blockable_page_cache_readahead+0x53/0xb2
Sep 30 17:27:46 host [<ffffffff8024a9ca>] make_ahead_window+0x87/0xa3
Sep 30 17:27:46 host [<ffffffff8024ab6e>] page_cache_readahead+0x188/0x1bf
Sep 30 17:27:46 host [<ffffffff80245a1a>]
do_generic_mapping_read+0x136/0x43a
Sep 30 17:27:46 host [<ffffffff8024500a>] file_read_actor+0x0/0x11d
Sep 30 17:27:46 host [<ffffffff8024754f>] generic_file_aio_read+0x11d/0x15a
Sep 30 17:27:46 host [<ffffffff802617b0>] do_sync_read+0xc9/0x10c
Sep 30 17:27:46 host [<ffffffff80233fb1>] autoremove_wake_function+0x0/0x2e
Sep 30 17:27:46 host [<ffffffff80261f21>] vfs_read+0xaa/0x132
Sep 30 17:27:46 host [<ffffffff80262236>] sys_read+0x45/0x6e
Sep 30 17:27:46 host [<ffffffff8020936e>] system_call+0x7e/0x83
Sep 30 17:27:46 host
Sep 30 17:27:46 host
Sep 30 17:27:46 host Code: 48 8b 07 48 c1 e8 3e 48 69 c0 70 02 00 00 48
8d b0 20 2e 56
Sep 30 17:27:46 host RIP [<ffffffff8024b03a>] __pagevec_lru_add+0x12/0x97
Sep 30 17:27:46 host RSP <ffff81000b081998>
Sep 30 17:27:46 host CR2: 000000000000000e


Dmesg output: http://lxnt.info/linux/dmesg.text
Errors and oops: http://lxnt.info/linux/oops.text


--

./lxnt


2007-10-01 09:11:10

by Clemens Koller

[permalink] [raw]
Subject: Re: Promise SATA300 TX4: errors, oops in ext3 code

Alexander Sabourenkov schrieb:
> Hardware: Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10
> 320G, jumper-limited to SATA150.
> Kernel : 2.6.22.9 amd64
>
> Problem:
> Heavy load causes errors and triggers oops.

Have you checked your memory already (memtest86)?

We have several applications with Promise controllers on strange
hardware and we never had integrity problems with i.e. not so standard
SATA connections over custom vaccum-tight connectors.

> Problems were blamed:
> - SATA300 being too 'hot' (jumpered the drives)

Is this a common known problem with your harddrives or controller?
(ask google) Otherwise, it sounds like a problem with broken hardware.

> - cables (work perfectly on onboard controller)
> - interrupt sharing (found the only slot which does not share
> interrupt line)
> - cooling (3 fans installed, smartctl-reported temperature at max load
> dropped to 35C)

Try to heat up your memory a little (your wife's hair blower).
If it fails more often, your memory is most likely broken.

> - weak PSU (installed 600W FSP)
> - kernel bugs (upgraded to 2.6.22.9)
>
> All those measures significantly dropped error rate (from about 20 to
> 2-4 per mirror rebuild) but did not eliminate the problem.

Again... sounds like bad memory to me.

Juat my $0.05.
Regards,

Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com


2007-10-01 10:24:39

by Alexander Sabourenkov

[permalink] [raw]
Subject: Re: Promise SATA300 TX4: errors, oops in ext3 code

Clemens Koller wrote:
> Alexander Sabourenkov schrieb:
> > Hardware: Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10
> > 320G, jumper-limited to SATA150.
> > Kernel : 2.6.22.9 amd64
> >
> > Problem:
> > Heavy load causes errors and triggers oops.
>
> Have you checked your memory already (memtest86)?

Last run was about a year ago.

This box gets regularly updated (rebuild of all installed software),
so I'm reasonably certain that memory is ok - gcc being almost as
sensitive as memtest.

Will recheck anyway.

>
> We have several applications with Promise controllers on strange
> hardware and we never had integrity problems with i.e. not so standard
> SATA connections over custom vaccum-tight connectors.

Judging from linux and freebsd mailing lists, the TX4 is now quite
well-known for
intermittent problems, which are hard to reproduce on different hardware.

I have two machines with those controllers, one FreeBSD-6.2 on MSI
K8Neo2 motherboard (ATI chipset),
and this one. FreeBSD box does not exhibit this problem under the
little load it gets, but
6-STABLE and 7-CURRENT branches do have similar symptoms since around 19
April 2007,
with rare occurences (but not unheard of) before.

Thus I am unable to keep machines up to date, and before having to dump
$140 worth of hardware,
I'd like to try to help fix this problem or at least be certain that
those controllers are indeed unusable.


> > Problems were blamed:
> > - SATA300 being too 'hot' (jumpered the drives)
>
> Is this a common known problem with your harddrives or controller?
> (ask google) Otherwise, it sounds like a problem with broken hardware.

This is a common problem with at least VIA onboard controllers and
Seagate disks,
and I think with SATA150 controllers and speed negotiation in general.

This step was suggested in some mailing list as a general precaution, but
actually made no difference to error rate.

I did not unjumper drivers back to SATA300 so that I can easily connect
the drives
to the onboard controller.

--

./lxnt


2007-10-02 04:33:55

by Alexander Sabourenkov

[permalink] [raw]
Subject: Re: Promise SATA300 TX4: errors, oops in ext3 code


>
> Have you checked your memory already (memtest86)?
>

[...]

>
>
> Again... sounds like bad memory to me.
>

Nightly memtest86 run : 11 hours, 23 passes, 0 errors.


--

./lxnt


2007-10-02 09:08:57

by Clemens Koller

[permalink] [raw]
Subject: Re: Promise SATA300 TX4: errors, oops in ext3 code

Alexander Sabourenkov schrieb:
>> Have you checked your memory already (memtest86)?
> [...]
>> Again... sounds like bad memory to me.
> Nightly memtest86 run : 11 hours, 23 passes, 0 errors.

Okay, I have no idea about any bugs there.
You have several options: Find a 100% working vanilla kernel for your
problem (minimal configuration, skip i.e. the sound stuff, ...).
And then git bisect with a known bad kernel.

Same thing in hardware: move components (Controllers + HDD) to/from a working
machine and verify...

Regards,

Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

2007-10-02 09:25:19

by Clemens Koller

[permalink] [raw]
Subject: Re: Promise SATA300 TX4: errors, oops in ext3 code

Alexander Sabourenkov schrieb:
> Hardware: Athlon64, Asus A8V, Promise SATA300 TX4, 2xSeagate 7200.10
> 320G, jumper-limited to SATA150.
> Kernel : 2.6.22.9 amd64
>
> Problem:
> Heavy load causes errors and triggers oops.
>
> History:
> Problems were first encountered on kernel 2.6.19, both i686 ("old"
> system) and amd64 (gentoo installation CD).
> Can't say anything about older kernels. Most probably they have same
> issues (or worse).
>
> Problems were blamed:
> - SATA300 being too 'hot' (jumpered the drives)

Did you turn it back to SATA300 and does it basically still work?
Then cool it actively and see if the error rate depends on it.

In one of my Promise HDD (PDC20275) controller designs I forgot to connect
the thermal pad (they call it E-PAD) properly to a GND plane so it just
worked with lots of errors which were also temperature sensitive (so, a
typical hardware design flaw :-).
On a PCI add-on card with the same chip, the E-PAD also didn't look
soldered over it's whole E-PAD area but it was working.

As you might have noticed, I am more into hardware debugging.
Propably some kernel gurus might have other ideas (related to software).

Regards,

Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com


2007-10-02 09:47:19

by Alexander Sabourenkov

[permalink] [raw]
Subject: Re: Promise SATA300 TX4: errors, oops in ext3 code

Clemens Koller wrote:
> Okay, I have no idea about any bugs there.
> You have several options: Find a 100% working vanilla kernel for your
> problem (minimal configuration, skip i.e. the sound stuff, ...).
> And then git bisect with a known bad kernel.

I'm afraid there is no 100% working kernel. Problems were reported as
far back as 2.6.11, and I never found a single thread in mailing lists
ending with "problem solved" (not counting PSU and thermal issues).

> Same thing in hardware: move components (Controllers + HDD) to/from a
> working
> machine and verify...

Unfortunately right now I have no yet-untested machine - both I have
show same problems.


Time permitting I'll test 2.6.23 kernel, libata-dev branch,
SATA300/SATA150 modes and agressive card cooling as you suggested in
your other email and document all this on a separate page or maybe a wiki.


--

./lxnt