processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 275
stepping : 2
cpu MHz : 2194.616
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4390.69
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 275
stepping : 2
cpu MHz : 2194.616
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4390.11
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 2
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 275
stepping : 2
cpu MHz : 2194.616
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4393.11
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : Dual Core AMD Opteron(tm) Processor 275
stepping : 2
cpu MHz : 2194.616
cache size : 1024 KB
physical id : 1
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 4393.51
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
Hi,
Emmeran Seehuber wrote:
> we`ve got a database server machine running a 2.6.18.2 vanilla kernel on
> Debian Etch. The database is MySQL 5. Everything works fine, but sometimes
> the server "lags", i.e. it doesn`t respond for 30 seconds. We`ve now
> investigated the problem and found this messages in syslog (and dmesg):
>
> 15:55:44 omega11 kernel: ata1: port is slow to respond, please be patient
> 15:55:44 omega11 kernel: ata1: soft resetting port
> 15:55:44 omega11 kernel: ata1: port is slow to respond, please be patient
> 15:55:44 omega11 kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl
> 300)
> 15:55:44 omega11 kernel: ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
> 15:55:44 omega11 last message repeated 5 times
> 15:55:44 omega11 kernel: ata1.00: qc timeout (cmd 0xec)
> 15:55:44 omega11 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> 15:55:44 omega11 kernel: ata1: failed to recover some devices, retrying in 5
> secs
> 15:55:44 omega11 kernel: ata1: hard resetting port
> 15:55:44 omega11 kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl
> 300)
> 15:55:44 omega11 kernel: ata1.00: configured for UDMA/133
> 15:55:44 omega11 kernel: ata1: EH complete
> 15:55:44 omega11 kernel: SCSI device sda: 293046768 512-byte hdwr sectors
> (150040 MB)
> 15:55:44 omega11 kernel: sda: Write Protect is off
> 15:55:44 omega11 kernel: SCSI device sda: drive cache: write back
This is just the recovery part. Need more log. If possible, please
give a shot at 2.6.20. It might have fixed your problem or at least
allow better diagnosis.
> We`ve got this messages up to 5 times a day since as far as our syslogs reach.
>
> It seems no kind of queuing is used:
> # cat /sys/block/sda/device/queue_type
> none
> # cat /sys/block/sda/device/queue_depth
> 1
>
> The server is up for 91 days now and has low to medium load (depending on
> daytime). Since it`s a production server located in a datacenter, we can`t
> just test some random kernel on it :(
I see.
> Does somebody have a glue whats going on here? Could it be a hardware failure?
It might be. Quite some SATA bug reports turn out to be hardware
problem, most commonly PSU issues.
> We have an identical machine using the same kernel. It`s used as a webserver.
> There also this messages shows up, but not that often (10 times in 91 days
> uptime). If it is a hardware failure, then both machines would been affected
> by the same hardware problem.
Hmmm...
> What can we do to fix this problem? Is it known?
>
> I`ve found many posts related to SATA problems, but none seemed to be about
> this problem.
>
> Do you need additional information?
Yeah, please post the content of /var/log/boot.msg if available and the
result of dmesg and lspci -nn.
--
tejun
Am Friday 09 February 2007 schrieb Tejun Heo:
> Hi,
>
> This is just the recovery part. Need more log. If possible, please
> give a shot at 2.6.20. It might have fixed your problem or at least
> allow better diagnosis.
>
I?ll look into getting 2.6.20 on the machine. But it might take some time till
we can do this.
> > Does somebody have a glue whats going on here? Could it be a hardware
> > failure?
>
> It might be. Quite some SATA bug reports turn out to be hardware
> problem, most commonly PSU issues.
The power supply unit (you meant this with PSU, didn`t you?) has 800 Watt, so
it should be powerfull enough for one harddisk and no graphics board.
> > Do you need additional information?
>
> Yeah, please post the content of /var/log/boot.msg if available and the
> result of dmesg and lspci -nn.
We don`t have a /var/log/boot.msg, but it seems the boot messages were saved
in /var/log/dmesg, so I attached it.
Thanks for your effort.
cu,
Emmy
Emmeran Seehuber wrote:
>>> Does somebody have a glue whats going on here? Could it be a hardware
>>> failure?
>> It might be. Quite some SATA bug reports turn out to be hardware
>> problem, most commonly PSU issues.
>
> The power supply unit (you meant this with PSU, didn`t you?) has 800 Watt, so
> it should be powerfull enough for one harddisk and no graphics board.
I see.
>>> Do you need additional information?
>> Yeah, please post the content of /var/log/boot.msg if available and the
>> result of dmesg and lspci -nn.
>
> We don`t have a /var/log/boot.msg, but it seems the boot messages were saved
> in /var/log/dmesg, so I attached it.
Yeap, that's exactly what I wanted. So, the driver is sata_svw and
errors are timeouts for both reads and writes with BMDMA engine still
running. It looks like transmission errors to me. Can you post the
result of 'smartctl -a /dev/sdX'?
--
tejun
Am Friday 09 February 2007 schrieb Tejun Heo:
> Yeap, that's exactly what I wanted. So, the driver is sata_svw and
> errors are timeouts for both reads and writes with BMDMA engine still
> running. It looks like transmission errors to me. Can you post the
> result of 'smartctl -a /dev/sdX'?
here it is:
-->
# smartctl -a /dev/sda
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/
Device: ATA WDC WD1500ADFD-0 Version: 20.0
Serial number: WD-WMAP41246348
Device type: disk
Local Time is: Fri Feb 9 18:06:23 2007 CET
Device does not support SMART
Error Counter logging not supported
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
<--
cu,
Emmy
I believe you need to add the flag '-d ata' to the smartctl command in
order to see the smart status of a SATA device.
-Jesse
-------------------------------------------------------------
here it is:
-->
# smartctl -a /dev/sda
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/
Device: ATA WDC WD1500ADFD-0 Version: 20.0
Serial number: WD-WMAP41246348
Device type: disk
Local Time is: Fri Feb 9 18:06:23 2007 CET
Device does not support SMART
Error Counter logging not supported
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
<--
cu,
Emmy
Emmeran Seehuber wrote:
> # smartctl -a /dev/sda
> smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce
> Allen
> Home page is http://smartmontools.sourceforge.net/
>
> Device: ATA WDC WD1500ADFD-0 Version: 20.0
> Serial number: WD-WMAP41246348
> Device type: disk
> Local Time is: Fri Feb 9 18:06:23 2007 CET
> Device does not support SMART
Hmmm... Raptor not supporting SMART. That's weird. Please try
'smartctl -d ata -a /dev/sda'.
Thanks.
--
tejun
Am Saturday 10 February 2007 schrieb Tejun Heo:
> Hmmm... Raptor not supporting SMART. ?That's weird. ?Please try
> 'smartctl -d ata -a /dev/sda'.
The output is attached.
cu,
Emmy
Hello, Emmeran.
There is no logged error on drive's side. Only timeouts on host's side
with BMDMA engine running. I dunno specifics of the severwork
controller but many controllers with BMDMA interface timeouts the
command if transmission failure occurs, so my primary suspect is still
hardware transmission problem which seems quite common in SATA world.
Can you try the followings?
1. Use different cable and connect the hdd to different hdd.
2. If possible, connect the harddisk to different power supply. (I know
you have juicy PSU but just in case)
Probably applying this to only one machine and leaving the other alone
as control is a good idea.
Thanks.
--
tejun