2010-12-20 14:15:56

by Rogier Wolff

[permalink] [raw]
Subject: Slow disks.


Hi,

A friend of mine has a server in a datacenter somewhere. His machine
is not working properly: most of his disks take 10-100 times longer
to process each IO request than normal.

iostat -kx 10 output:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdd 0.30 0.00 0.40 1.20 2.80 1.10 4.88 0.43 271.50 271.44 43.43

shows that in this 10 second period, the disk was busy for 4.3 seconds
and serviced 15-16 requests during that time.

Normal disks show "svctm" of around 10-20ms.

Now you might say: It's his disk that's broken.
Well no: I don't believe that all four of his disks are broken.
(I just showed you output about one disk, but there are 4 disks in there
all behaving similar, but some are worse than others.)

Or you might say: It's his controller that's broken. So we thought
too. We replaced the onboard sata controller with a 4-port sata
card. Now they are running off the external sata card... Slightly
better, but not by much.

Or you might say: it's hardware. But suppose the disk doesn't properly
transfer the data 9 times out of 10, wouldn't the driver tell us
SOMETHING in the syslog that things are not fine and dandy? Moreover,
In the case above, 12kb were transferred in 4.3 seconds. If CRC errors
were happening, the interface would've been able to transfer over
400Mb during that time. So every transfer would need to be retried on
average 30000 times... Not realistic. If that were the case, we'd
surely hit a maximum retry limit every now and then?


These syptoms started when the system was running 2.6.33, but are
still present now the system has been upgraded to 2.6.36.

Is there anything you can suggest to get to the root of this problem?
Could this be a software issue with the driver? Can we enable some
driver debugging to find out what is wrong?

Any help will be appreciated.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ


2010-12-20 18:06:59

by Bruno Prémont

[permalink] [raw]
Subject: Re: Slow disks.

Hi,

[ccing linux-ide]

Please provide the part of kernel log showing initialization of your
disk controller(s) as well as detection of all the discs.
Verbose lspci output for the disc controller and $(smartctl -i -A $disk)
output might be useful as well.

Did you try the individual discs on a completely different system (e.g.
plain desktop system) and what revision of SATA are both components
supporting?

Bruno


On Mon, 20 December 2010 Rogier Wolff <[email protected]> wrote:
> Hi,
>
> A friend of mine has a server in a datacenter somewhere. His machine
> is not working properly: most of his disks take 10-100 times longer
> to process each IO request than normal.
>
> iostat -kx 10 output:
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sdd 0.30 0.00 0.40 1.20 2.80 1.10 4.88 0.43 271.50 271.44 43.43
>
> shows that in this 10 second period, the disk was busy for 4.3 seconds
> and serviced 15-16 requests during that time.
>
> Normal disks show "svctm" of around 10-20ms.
>
> Now you might say: It's his disk that's broken.
> Well no: I don't believe that all four of his disks are broken.
> (I just showed you output about one disk, but there are 4 disks in there
> all behaving similar, but some are worse than others.)
>
> Or you might say: It's his controller that's broken. So we thought
> too. We replaced the onboard sata controller with a 4-port sata
> card. Now they are running off the external sata card... Slightly
> better, but not by much.
>
> Or you might say: it's hardware. But suppose the disk doesn't properly
> transfer the data 9 times out of 10, wouldn't the driver tell us
> SOMETHING in the syslog that things are not fine and dandy? Moreover,
> In the case above, 12kb were transferred in 4.3 seconds. If CRC errors
> were happening, the interface would've been able to transfer over
> 400Mb during that time. So every transfer would need to be retried on
> average 30000 times... Not realistic. If that were the case, we'd
> surely hit a maximum retry limit every now and then?
>
>
> These syptoms started when the system was running 2.6.33, but are
> still present now the system has been upgraded to 2.6.36.
>
> Is there anything you can suggest to get to the root of this problem?
> Could this be a software issue with the driver? Can we enable some
> driver debugging to find out what is wrong?
>
> Any help will be appreciated.
>
> Roger.
>

2010-12-20 18:33:07

by Greg Freemyer

[permalink] [raw]
Subject: Re: Slow disks.

On Mon, Dec 20, 2010 at 1:06 PM, Bruno Pr?mont
<[email protected]> wrote:
> Hi,
>
> [ccing linux-ide]
>
> Please provide the part of kernel log showing initialization of your
> disk controller(s) as well as detection of all the discs.
> Verbose lspci output for the disc controller and $(smartctl -i -A $disk)
> output might be useful as well.
>
> Did you try the individual discs on a completely different system (e.g.
> plain desktop system) and what revision of SATA are both components
> supporting?
>
> Bruno
>
>
> On Mon, 20 December 2010 Rogier Wolff <[email protected]> wrote:
>> Hi,
>>
>> A friend of mine has a server in a datacenter somewhere. His machine
>> is not working properly: most of his disks take 10-100 times longer
>> to process each IO request than normal.
>>
>> iostat -kx 10 output:
>> Device: rrqm/s wrqm/s r/s ?w/s ?rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
>> sdd ? ? 0.30 ? 0.00 ? 0.40 1.20 2.80 ?1.10 ?4.88 ? ? 0.43 ?271.50 271.44 ?43.43
>>
>> shows that in this 10 second period, the disk was busy for 4.3 seconds
>> and serviced 15-16 requests during that time.
>>
>> Normal disks show "svctm" of around 10-20ms.
>>
>> Now you might say: It's his disk that's broken.
>> Well no: I don't believe that all four of his disks are broken.
>> (I just showed you output about one disk, but there are 4 disks in there
>> all behaving similar, but some are worse than others.)
>>
>> Or you might say: It's his controller that's broken. So we thought
>> too. We replaced the onboard sata controller with a 4-port sata
>> card. Now they are running off the external sata card... Slightly
>> better, but not by much.
>>
>> Or you might say: it's hardware. But suppose the disk doesn't properly
>> transfer the data 9 times out of 10, wouldn't the driver tell us
>> SOMETHING in the syslog that things are not fine and dandy? Moreover,
>> In the case above, 12kb were transferred in 4.3 seconds. If CRC errors
>> were happening, the interface would've been able to transfer over
>> 400Mb during that time. So every transfer would need to be retried on
>> average 30000 times... Not realistic. If that were the case, we'd
>> surely hit a maximum retry limit every now and then?
>>
>>
>> These syptoms started when the system was running 2.6.33, but are
>> still present now the system has been upgraded to 2.6.36.
>>
>> Is there anything you can suggest to get to the root of this problem?
>> Could this be a software issue with the driver? Can we enable some
>> driver debugging to find out what is wrong?
>>
>> Any help will be appreciated.
>>
>> ? ? ? Roger.

My personal guess would definitely be hardware. The only common
component I can think of is power. SATA is very sensitive to
requiring high-quality power. Much more so than IDE.

Greg

2010-12-20 19:14:08

by Jeff Moyer

[permalink] [raw]
Subject: Re: Slow disks.

Rogier Wolff <[email protected]> writes:

> Hi,
>
> A friend of mine has a server in a datacenter somewhere. His machine
> is not working properly: most of his disks take 10-100 times longer
> to process each IO request than normal.
>
> iostat -kx 10 output:
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sdd 0.30 0.00 0.40 1.20 2.80 1.10 4.88 0.43 271.50 271.44 43.43

You are doing ~4KB I/Os and driving a queue depth of <1. If you are
seeking all over the disk and mixing reads and writes, you may very well
trigger bad behaviour like this for SATA disks.

To further diagnose the issue, I'd recommend running blktrace on one of
the devices. If you report results here, could you also include more
information about the hardware, the storage layout (including any dm/md
drivers and file systems involved) and the workload? Also, please let
us know which I/O scheduler you're using.

> These syptoms started when the system was running 2.6.33, but are
> still present now the system has been upgraded to 2.6.36.

Are you sure it was a kernel change that caused the issue? In other
words, can you run with the 2.6.32 and confirm that the issue is not
present?

Thanks,
Jeff

2010-12-21 12:40:23

by Arto Jantunen

[permalink] [raw]
Subject: Re: Slow disks.

Rogier Wolff <[email protected]> writes:

> Hi,
>
> A friend of mine has a server in a datacenter somewhere. His machine
> is not working properly: most of his disks take 10-100 times longer
> to process each IO request than normal.
>
> iostat -kx 10 output:
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> sdd 0.30 0.00 0.40 1.20 2.80 1.10 4.88 0.43 271.50 271.44 43.43
>
> shows that in this 10 second period, the disk was busy for 4.3 seconds
> and serviced 15-16 requests during that time.
>
> Normal disks show "svctm" of around 10-20ms.
>
> Now you might say: It's his disk that's broken.
> Well no: I don't believe that all four of his disks are broken.
> (I just showed you output about one disk, but there are 4 disks in there
> all behaving similar, but some are worse than others.)
>
> Or you might say: It's his controller that's broken. So we thought
> too. We replaced the onboard sata controller with a 4-port sata
> card. Now they are running off the external sata card... Slightly
> better, but not by much.
>
> Or you might say: it's hardware. But suppose the disk doesn't properly
> transfer the data 9 times out of 10, wouldn't the driver tell us
> SOMETHING in the syslog that things are not fine and dandy? Moreover,
> In the case above, 12kb were transferred in 4.3 seconds. If CRC errors
> were happening, the interface would've been able to transfer over
> 400Mb during that time. So every transfer would need to be retried on
> average 30000 times... Not realistic. If that were the case, we'd
> surely hit a maximum retry limit every now and then?

I had something somewhat similar happen on an Areca RAID card with four disks
in RAID5. The first symptom was that the machine was extremely slow, that
tracked down to IO being slow. By looking at the IO pattern it became apparent
that it was very bursty, it did a few requests and then froze for about 30
seconds and then did a few requests again.

It was tracked down to one of the disks being faulty in a way that did not get
it dropped out of the array. In this case when the machine was frozen and not
doing any IO the activity led on the faulty disk was constantly on, when it
came off a burst of IO happened.

I'm not sure what kind of a disk failure this was caused by, but you could
test for it either by simply monitoring the activity leds (may not show
anything in all cases, I don't know) or removing the disks one by one and
testing if the problem disappears. I didn't get much log output in this case,
I think the Areca driver was occasionally complaining about timeouts while
communicating with the controller.

--
Arto Jantunen

2010-12-22 10:43:12

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.


Unquoted text below is from either me or from my friend.


Someone suggested we try an older kernel as if kernel 2.6.32 would not
have this problem. We do NOT think it suddenly started with a certain
kernel version. I was just hoping to have you kernel-guys help with
prodding the kernel into revealing which component was screwing things
up....


On Mon, Dec 20, 2010 at 01:32:44PM -0500, Greg Freemyer wrote:
> On Mon, Dec 20, 2010 at 1:06 PM, Bruno Pr?mont
> <[email protected]> wrote:
> > Hi,
> >
> > [ccing linux-ide]
> >
> > Please provide the part of kernel log showing initialization of your
> > disk controller(s) as well as detection of all the discs.


sata_sil 0000:03:01.0: version 2.4
sata_sil 0000:03:01.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24
sata_sil 0000:03:01.0: Applying R_ERR on DMA activate FIS errata fix
scsi2 : sata_sil
scsi3 : sata_sil
scsi4 : sata_sil
scsi5 : sata_sil
ata3: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed200080 irq 24
ata4: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed2000c0 irq 24
ata5: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed200280 irq 24
ata6: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed2002c0 irq 24
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata3.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
ata3.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata3.00: configured for UDMA/100
scsi 2:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
usb 2-2: new low speed USB device using uhci_hcd and address 2
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata4.00: ATA-7: SAMSUNG HD103SI, 1AG01118, max UDMA7
ata4.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata4.00: configured for UDMA/100
scsi 3:0:0:0: Direct-Access ATA SAMSUNG HD103SI 1AG0 PQ: 0 ANSI: 5
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata5.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
ata5.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata5.00: configured for UDMA/100
scsi 4:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata6.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
ata6.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata6.00: configured for UDMA/100
scsi 5:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
sd 2:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 3:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 3:0:0:0: [sdb] Write Protect is off
sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 4:0:0:0: [sdc] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 4:0:0:0: [sdc] Write Protect is off
sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 5:0:0:0: [sdd] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 5:0:0:0: [sdd] Write Protect is off
sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 5:0:0:0: [sdd] Write Protect is off
sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sdb: sdb1 sdb2 sdb3 sdb4
sd 3:0:0:0: [sdb] Attached SCSI disk
sda: sda1 sda2 sda3 sda4
sd 2:0:0:0: [sda] Attached SCSI disk
sdc: sdc1 sdc2 sdc3 sdc4
sd 4:0:0:0: [sdc] Attached SCSI disk
sdd: sdd1 sdd2 sdd3 sdd4
sd 5:0:0:0: [sdd] Attached SCSI disk



> > Verbose lspci output for the disc controller and $(smartctl -i -A $disk)
> > output might be useful as well.


03:01.0 Mass storage controller: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller (rev 02)
Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 32, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 24
Region 0: I/O ports at 4020 [size=8]
Region 1: I/O ports at 4014 [size=4]
Region 2: I/O ports at 4018 [size=8]
Region 3: I/O ports at 4010 [size=4]
Region 4: I/O ports at 4000 [size=16]
Region 5: Memory at ed200000 (32-bit, non-prefetchable) [size=1K]
[virtual] Expansion ROM at e8000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
Kernel driver in use: sata_sil
Kernel modules: sata_sil


But also tried onboard card:

00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE
Controller (rev 01) (prog-if 8a [Master SecP PriP])
Subsystem: Super Micro Computer Inc Device 7980
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 18
Region 0: I/O ports at 01f0 [size=8]
Region 1: I/O ports at 03f4 [size=1]
Region 2: I/O ports at 0170 [size=8]
Region 3: I/O ports at 0374 [size=1]
Region 4: I/O ports at 30a0 [size=16]
Kernel driver in use: ata_piix
Kernel modules: ata_generic, pata_acpi, ata_piix, ide-pci-generic,
piix

smartctl output:
Kernel modules: ata_generic, pata_acpi, ata_piix, ide-pci-generic,
piix

smartctl output:

smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (Adv. Format) family
Device Model: WDC WD10EARS-00Y5B1
Serial Number: WD-WCAV55759454
Firmware Version: 80.00A80
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Dec 21 20:06:00 2010 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 132 119 021 Pre-fail
Always - 6391
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 56
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 091 091 000 Old_age
Always - 7189
10 Spin_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 54
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always
- 39
193 Load_Cycle_Count 0x0032 164 164 000 Old_age Always
- 109955
194 Temperature_Celsius 0x0022 109 107 000 Old_age Always
- 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
- 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0

smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (Adv. Format) family
Device Model: WDC WD10EARS-00Y5B1
Serial Number: WD-WCAV55759454
Firmware Version: 80.00A80
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Dec 21 20:06:00 2010 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 132 119 021 Pre-fail
Always - 6391
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 56
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 091 091 000 Old_age
Always - 7189
10 Spin_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 54
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always
- 39
193 Load_Cycle_Count 0x0032 164 164 000 Old_age Always
- 109955
194 Temperature_Celsius 0x0022 109 107 000 Old_age Always
- 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0

smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build)
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0


The others are very similar....


> >
> > Did you try the individual discs on a completely different system (e.g.
> > plain desktop system) and what revision of SATA are both components
> > supporting?

Yes I did. The disks were installed in a MSI/Core2DUO based desktop
system. No problems at all. Transfer rates up to 200MB/s.


The SIL 3114 chip is 1.5Gbps SATA. .


Searching for information on the WD drives I stumbled across:

http://community.wdc.com/t5/Other-Internal-Drives/1-TB-WD10EARS-desynch-issues-in-RAID/m-p/11559

Where it seems that WD simply says not to use these drives in a RAID.
I have experience with "Raid Edition" drives: They go bad at a MUCH
too high rate. If we can't use the non-raid for a RAID application, then
there is just ONE possible option: STAY AWAY FROM WESTERN DIGITAL:

Western digital claims it has the right to mess things up if you put a
non-raid drive in a raid configuration. Well fine. Then they can also
mess things up in normal situations because when Linux does software
raid there isn't any difference from RAID accesses.

(if you click through and read their entry in the knowledge base,
you'd notice that it should be more or less the other way
around. Linux will drop the RAID-enabled drive from the RAID within
seven seconds and reporting error on a sector, whereas the desktop
drive would remain operational until Linux times out (30 seconds?))



More hardware info:

System: Supermicro PDSMi, 4xDDR2 1GB, disks and controllers as above.
Current kernel version: 2.6.36.2
Problem was also present in kernel 2.6.33 (sorry cannot downgrade again.
This is a production system...)

uname -a:
Linux jcz.nl 2.6.36-ARCH #1 SMP PREEMPT Fri Dec 10 20:32:37 CET 2010
x86_64 Intel(R) Pentium(R) D CPU 3.20GHz GenuineIntel GNU/Linux

Disklayout:

major minor #blocks name

8 0 976762584 sda
8 1 240943 sda1
8 2 19535040 sda2
8 3 1951897 sda3
8 4 955032120 sda4
8 16 976762584 sdb
8 17 240943 sdb1
8 18 19535040 sdb2
8 19 1951897 sdb3
8 20 955032120 sdb4
8 32 976762584 sdc
8 33 240943 sdc1
8 34 19535040 sdc2
8 35 1951897 sdc3
8 36 955032120 sdc4
8 48 976762584 sdd
8 49 240943 sdd1
8 50 19535040 sdd2
8 51 1951897 sdd3
8 52 955032120 sdd4
9 127 240832 md127
9 1 39067648 md1
9 126 1910063104 md126
9 125 3903488 md125

MDstat:

Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active raid5 sdd3[5](S) sdb3[4] sda3[0] sdc3[3]
3903488 blocks super 1.1 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

md126 : active raid5 sda4[0] sdd4[3] sdc4[5](S) sdb4[4]
1910063104 blocks super 1.1 level 5, 512k chunk, algorithm 2
[3/3] [UUU]

md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
[3/3] [UUU]

md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
[UUU]

md127 : active raid1 sdd1[3](S) sda1[0] sdb1[1] sdc1[2]
240832 blocks [3/3] [UUU]

unused devices: <none>
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sys /sys sysfs rw,relatime 0 0
udev /dev devtmpfs
rw,nosuid,relatime,size=10240k,nr_inodes=506317,mode=755 0 0
/dev/disk/by-label/rootfs / ext4
rw,relatime,barrier=1,stripe=256,data=ordered 0 0
devpts /dev/pts devpts rw,relatime,mode=600,ptmxmode=000 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/md127 /boot ext3
rw,relatime,errors=continue,barrier=0,data=writeback 0 0
/dev/md126 /data ext4 rw,relatime,barrier=1,data=ordered 0 0


Because of the severity of the problems (which remain after trying
another sata card), I have already bought a new Supermicro server. Let's
hope that helps.




--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-22 16:32:35

by Jeff Moyer

[permalink] [raw]
Subject: Re: Slow disks.

Rogier Wolff <[email protected]> writes:

> Unquoted text below is from either me or from my friend.
>
>
> Someone suggested we try an older kernel as if kernel 2.6.32 would not
> have this problem. We do NOT think it suddenly started with a certain
> kernel version. I was just hoping to have you kernel-guys help with
> prodding the kernel into revealing which component was screwing things
> up....
[...]
> ata3.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133

This is an "Advanced format" drive, which, in this case, means it
internally has a 4KB sector size and exports a 512byte logical sector
size. If your partitions are misaligned, this can cause performance
problems.

> MDstat:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md125 : active raid5 sdd3[5](S) sdb3[4] sda3[0] sdc3[3]
> 3903488 blocks super 1.1 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>
> md126 : active raid5 sda4[0] sdd4[3] sdc4[5](S) sdb4[4]
> 1910063104 blocks super 1.1 level 5, 512k chunk, algorithm 2
> [3/3] [UUU]
>
> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> [3/3] [UUU]
>
> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> [UUU]

A 512KB raid5 chunk with 4KB I/Os? That is a recipe for inefficiency.
Again, blktrace data would be helpful.

Cheers,
Jeff

2010-12-22 15:59:47

by Greg Freemyer

[permalink] [raw]
Subject: Re: Slow disks.

On Wed, Dec 22, 2010 at 5:43 AM, Rogier Wolff <[email protected]> wrote:
>
> Unquoted text below is from either me or from my friend.
>
>
> Someone suggested we try an older kernel as if kernel 2.6.32 would not
> have this problem. We do NOT think it suddenly started with a certain
> kernel version. I was just hoping to have you kernel-guys help with
> prodding the kernel into revealing which component was screwing things
> up....
>
>
> On Mon, Dec 20, 2010 at 01:32:44PM -0500, Greg Freemyer wrote:
>> On Mon, Dec 20, 2010 at 1:06 PM, Bruno Pr?mont
>> <[email protected]> wrote:
>> > Hi,
>> >
>> > [ccing linux-ide]
>> >
>> > Please provide the part of kernel log showing initialization of your
>> > disk controller(s) as well as detection of all the discs.
>
>
> sata_sil 0000:03:01.0: version 2.4
> sata_sil 0000:03:01.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24
> sata_sil 0000:03:01.0: Applying R_ERR on DMA activate FIS errata fix
> scsi2 : sata_sil
> scsi3 : sata_sil
> scsi4 : sata_sil
> scsi5 : sata_sil
> ata3: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed200080 irq 24
> ata4: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed2000c0 irq 24
> ata5: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed200280 irq 24
> ata6: SATA max UDMA/100 mmio m1024@0xed200000 tf 0xed2002c0 irq 24
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata3.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
> ata3.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
> ata3.00: configured for UDMA/100
> scsi 2:0:0:0: Direct-Access ? ? ATA ? ? ?WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
> usb 2-2: new low speed USB device using uhci_hcd and address 2
> ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata4.00: ATA-7: SAMSUNG HD103SI, 1AG01118, max UDMA7
> ata4.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
> ata4.00: configured for UDMA/100
> scsi 3:0:0:0: Direct-Access ? ? ATA ? ? ?SAMSUNG HD103SI ?1AG0 PQ: 0 ANSI: 5
> ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata5.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
> ata5.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
> ata5.00: configured for UDMA/100
> scsi 4:0:0:0: Direct-Access ? ? ATA ? ? ?WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
> ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata6.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
> ata6.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
> ata6.00: configured for UDMA/100
> scsi 5:0:0:0: Direct-Access ? ? ATA ? ? ?WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
> sd 2:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> sd 2:0:0:0: [sda] Write Protect is off
> sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
> sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> sd 3:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> sd 3:0:0:0: [sdb] Write Protect is off
> sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> sd 4:0:0:0: [sdc] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> sd 4:0:0:0: [sdc] Write Protect is off
> sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> sd 5:0:0:0: [sdd] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
> sd 5:0:0:0: [sdd] Write Protect is off
> sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> sd 5:0:0:0: [sdd] Write Protect is off
> sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> ?sdb: sdb1 sdb2 sdb3 sdb4
> sd 3:0:0:0: [sdb] Attached SCSI disk
> ?sda: sda1 sda2 sda3 sda4
> sd 2:0:0:0: [sda] Attached SCSI disk
> ?sdc: sdc1 sdc2 sdc3 sdc4
> sd 4:0:0:0: [sdc] Attached SCSI disk
> ?sdd: sdd1 sdd2 sdd3 sdd4
> sd 5:0:0:0: [sdd] Attached SCSI disk
>
>
>
>> > Verbose lspci output for the disc controller and $(smartctl -i -A $disk)
>> > output might be useful as well.
>
>
> 03:01.0 Mass storage controller: Silicon Image, Inc. SiI 3114
> [SATALink/SATARaid] Serial ATA Controller (rev 02)
> ? ? ? ?Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller
> ? ? ? ?Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR+ FastB2B- DisINTx-
> ? ? ? ?Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> ? ? ? ?Latency: 32, Cache Line Size: 32 bytes
> ? ? ? ?Interrupt: pin A routed to IRQ 24
> ? ? ? ?Region 0: I/O ports at 4020 [size=8]
> ? ? ? ?Region 1: I/O ports at 4014 [size=4]
> ? ? ? ?Region 2: I/O ports at 4018 [size=8]
> ? ? ? ?Region 3: I/O ports at 4010 [size=4]
> ? ? ? ?Region 4: I/O ports at 4000 [size=16]
> ? ? ? ?Region 5: Memory at ed200000 (32-bit, non-prefetchable) [size=1K]
> ? ? ? ?[virtual] Expansion ROM at e8000000 [disabled] [size=512K]
> ? ? ? ?Capabilities: [60] Power Management version 2
> ? ? ? ? ? ? ? ?Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
> ? ? ? ? ? ? ? ?PME(D0-,D1-,D2-,D3hot-,D3cold-)
> ? ? ? ? ? ? ? ?Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
> ? ? ? ?Kernel driver in use: sata_sil
> ? ? ? ?Kernel modules: sata_sil
>
>
> But also tried onboard card:
>
> 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE
> Controller (rev 01) (prog-if 8a [Master SecP PriP])
> ? ? ? ?Subsystem: Super Micro Computer Inc Device 7980
> ? ? ? ?Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
> ? ? ? ?Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> ? ? ? ?Latency: 0
> ? ? ? ?Interrupt: pin A routed to IRQ 18
> ? ? ? ?Region 0: I/O ports at 01f0 [size=8]
> ? ? ? ?Region 1: I/O ports at 03f4 [size=1]
> ? ? ? ?Region 2: I/O ports at 0170 [size=8]
> ? ? ? ?Region 3: I/O ports at 0374 [size=1]
> ? ? ? ?Region 4: I/O ports at 30a0 [size=16]
> ? ? ? ?Kernel driver in use: ata_piix
> ? ? ? ?Kernel modules: ata_generic, pata_acpi, ata_piix, ide-pci-generic,
> ? ? ? ?piix
>
> smartctl output:
> ? ? ? ?Kernel modules: ata_generic, pata_acpi, ata_piix, ide-pci-generic,
> ? ? ? ?piix
>
> smartctl output:
>
> smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family: ? ? Western Digital Caviar Green (Adv. Format) family
> Device Model: ? ? WDC WD10EARS-00Y5B1
> Serial Number: ? ?WD-WCAV55759454
> Firmware Version: 80.00A80
> User Capacity: ? ?1,000,204,886,016 bytes
> Device is: ? ? ? ?In smartctl database [for details use: -P show]
> ATA Version is: ? 8
> ATA Standard is: ?Exact ATA specification draft version not indicated
> Local Time is: ? ?Tue Dec 21 20:06:00 2010 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME ? ? ? ? ?FLAG ? ? VALUE WORST THRESH TYPE
> UPDATED ?WHEN_FAILED RAW_VALUE
> ?1 Raw_Read_Error_Rate ? ? 0x002f ? 200 ? 200 ? 051 ? ?Pre-fail
> Always ? ? ? - ? ? ? 0
> ?3 Spin_Up_Time ? ? ? ? ? ?0x0027 ? 132 ? 119 ? 021 ? ?Pre-fail
> Always ? ? ? - ? ? ? 6391
> ?4 Start_Stop_Count ? ? ? ?0x0032 ? 100 ? 100 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 56
> ?5 Reallocated_Sector_Ct ? 0x0033 ? 200 ? 200 ? 140 ? ?Pre-fail
> Always ? ? ? - ? ? ? 0
> ?7 Seek_Error_Rate ? ? ? ? 0x002e ? 200 ? 200 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 0
> ?9 Power_On_Hours ? ? ? ? ?0x0032 ? 091 ? 091 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 7189
> ?10 Spin_Retry_Count ? ? ? ?0x0032 ? 100 ? 253 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 0
> ?11 Calibration_Retry_Count 0x0032 ? 100 ? 253 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 0
> ?12 Power_Cycle_Count ? ? ? 0x0032 ? 100 ? 100 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 54
> 192 Power-Off_Retract_Count 0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 39
> 193 Load_Cycle_Count ? ? ? ?0x0032 ? 164 ? 164 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 109955
> 194 Temperature_Celsius ? ? 0x0022 ? 109 ? 107 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 38
> 196 Reallocated_Event_Count 0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 0
> 197 Current_Pending_Sector ?0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 0
> 198 Offline_Uncorrectable ? 0x0030 ? 200 ? 200 ? 000 ? ?Old_age
> Offline ? ? ?- ? ? ? 0
> 199 UDMA_CRC_Error_Count ? ?0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 0
> 200 Multi_Zone_Error_Rate ? 0x0008 ? 200 ? 200 ? 000 ? ?Old_age
> Offline ? ? ?- ? ? ? 0
> ? ? ?- ? ? ? 0
> 200 Multi_Zone_Error_Rate ? 0x0008 ? 200 ? 200 ? 000 ? ?Old_age
> Offline ? ? ?- ? ? ? 0
>
> smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family: ? ? Western Digital Caviar Green (Adv. Format) family
> Device Model: ? ? WDC WD10EARS-00Y5B1
> Serial Number: ? ?WD-WCAV55759454
> Firmware Version: 80.00A80
> User Capacity: ? ?1,000,204,886,016 bytes
> Device is: ? ? ? ?In smartctl database [for details use: -P show]
> ATA Version is: ? 8
> ATA Standard is: ?Exact ATA specification draft version not indicated
> Local Time is: ? ?Tue Dec 21 20:06:00 2010 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME ? ? ? ? ?FLAG ? ? VALUE WORST THRESH TYPE
> UPDATED ?WHEN_FAILED RAW_VALUE
> ?1 Raw_Read_Error_Rate ? ? 0x002f ? 200 ? 200 ? 051 ? ?Pre-fail
> Always ? ? ? - ? ? ? 0
> ?3 Spin_Up_Time ? ? ? ? ? ?0x0027 ? 132 ? 119 ? 021 ? ?Pre-fail
> Always ? ? ? - ? ? ? 6391
> ?4 Start_Stop_Count ? ? ? ?0x0032 ? 100 ? 100 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 56
> ?5 Reallocated_Sector_Ct ? 0x0033 ? 200 ? 200 ? 140 ? ?Pre-fail
> Always ? ? ? - ? ? ? 0
> ?7 Seek_Error_Rate ? ? ? ? 0x002e ? 200 ? 200 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 0
> ?9 Power_On_Hours ? ? ? ? ?0x0032 ? 091 ? 091 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 7189
> ?10 Spin_Retry_Count ? ? ? ?0x0032 ? 100 ? 253 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 0
> ?11 Calibration_Retry_Count 0x0032 ? 100 ? 253 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 0
> ?12 Power_Cycle_Count ? ? ? 0x0032 ? 100 ? 100 ? 000 ? ?Old_age
> Always ? ? ? - ? ? ? 54
> 192 Power-Off_Retract_Count 0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 39
> 193 Load_Cycle_Count ? ? ? ?0x0032 ? 164 ? 164 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 109955
> 194 Temperature_Celsius ? ? 0x0022 ? 109 ? 107 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 38
> 196 Reallocated_Event_Count 0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 0
> 197 Current_Pending_Sector ?0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 0
> 198 Offline_Uncorrectable ? 0x0030 ? 200 ? 200 ? 000 ? ?Old_age
> Offline ? ? ?- ? ? ? 0
> 199 UDMA_CRC_Error_Count ? ?0x0032 ? 200 ? 200 ? 000 ? ?Old_age ? Always
> ? ? ?- ? ? ? 0
> 200 Multi_Zone_Error_Rate ? 0x0008 ? 200 ? 200 ? 000 ? ?Old_age
> Offline ? ? ?- ? ? ? 0
>
> smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build)
> 200 Multi_Zone_Error_Rate ? 0x0008 ? 200 ? 200 ? 000 ? ?Old_age
> Offline ? ? ?- ? ? ? 0
>
>
> The others are very similar....
>
>
>> >
>> > Did you try the individual discs on a completely different system (e.g.
>> > plain desktop system) and what revision of SATA are both components
>> > supporting?
>
> Yes I did. The disks were installed in a MSI/Core2DUO based desktop
> system. No problems at all. Transfer rates up to 200MB/s.
>
>
> The SIL 3114 chip is 1.5Gbps SATA. .
>
>
> Searching for information on the WD drives I stumbled across:
>
> http://community.wdc.com/t5/Other-Internal-Drives/1-TB-WD10EARS-desynch-issues-in-RAID/m-p/11559
>
> Where it seems that WD simply says not to use these drives in a RAID.
> I have experience with "Raid Edition" drives: They go bad at a MUCH
> too high rate. If we can't use the non-raid for a RAID application, then
> there is just ONE possible option: STAY AWAY FROM WESTERN DIGITAL:
>
> Western digital claims it has the right to mess things up if you put a
> non-raid drive in a raid configuration. Well fine. Then they can also
> mess things up in normal situations because when Linux does software
> raid there isn't any difference from RAID accesses.
>
> (if you click through and read their entry in the knowledge base,
> you'd notice that it should be more or less the other way
> around. Linux will drop the RAID-enabled drive from the RAID within
> seven seconds and reporting error on a sector, whereas the desktop
> drive would remain operational until Linux times out (30 seconds?))
>
>
>
> More hardware info:
>
> System: Supermicro PDSMi, 4xDDR2 1GB, disks and controllers as above.
> Current kernel version: 2.6.36.2
> Problem was also present in kernel 2.6.33 (sorry cannot downgrade again.
> This is a production system...)
>
> uname -a:
> Linux jcz.nl 2.6.36-ARCH #1 SMP PREEMPT Fri Dec 10 20:32:37 CET 2010
> x86_64 Intel(R) Pentium(R) D CPU 3.20GHz GenuineIntel GNU/Linux
>
> Disklayout:
>
> major minor ?#blocks ?name
>
> ? 8 ? ? ? ?0 ?976762584 sda
> ? 8 ? ? ? ?1 ? ? 240943 sda1
> ? 8 ? ? ? ?2 ? 19535040 sda2
> ? 8 ? ? ? ?3 ? ?1951897 sda3
> ? 8 ? ? ? ?4 ?955032120 sda4
> ? 8 ? ? ? 16 ?976762584 sdb
> ? 8 ? ? ? 17 ? ? 240943 sdb1
> ? 8 ? ? ? 18 ? 19535040 sdb2
> ? 8 ? ? ? 19 ? ?1951897 sdb3
> ? 8 ? ? ? 20 ?955032120 sdb4
> ? 8 ? ? ? 32 ?976762584 sdc
> ? 8 ? ? ? 33 ? ? 240943 sdc1
> ? 8 ? ? ? 34 ? 19535040 sdc2
> ? 8 ? ? ? 35 ? ?1951897 sdc3
> ? 8 ? ? ? 36 ?955032120 sdc4
> ? 8 ? ? ? 48 ?976762584 sdd
> ? 8 ? ? ? 49 ? ? 240943 sdd1
> ? 8 ? ? ? 50 ? 19535040 sdd2
> ? 8 ? ? ? 51 ? ?1951897 sdd3
> ? 8 ? ? ? 52 ?955032120 sdd4
> ? 9 ? ? ?127 ? ? 240832 md127
> ? 9 ? ? ? ?1 ? 39067648 md1
> ? 9 ? ? ?126 1910063104 md126
> ? 9 ? ? ?125 ? ?3903488 md125
>
> MDstat:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md125 : active raid5 sdd3[5](S) sdb3[4] sda3[0] sdc3[3]
> ? ? ?3903488 blocks super 1.1 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>
> md126 : active raid5 sda4[0] sdd4[3] sdc4[5](S) sdb4[4]
> ? ? ?1910063104 blocks super 1.1 level 5, 512k chunk, algorithm 2
> [3/3] [UUU]
>
> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> ? ? ?39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> [3/3] [UUU]
>
> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> ? ? ?39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> [UUU]
>
> md127 : active raid1 sdd1[3](S) sda1[0] sdb1[1] sdc1[2]
> ? ? ?240832 blocks [3/3] [UUU]
>
> unused devices: <none>
> rootfs / rootfs rw 0 0
> proc /proc proc rw,relatime 0 0
> sys /sys sysfs rw,relatime 0 0
> udev /dev devtmpfs
> rw,nosuid,relatime,size=10240k,nr_inodes=506317,mode=755 0 0
> /dev/disk/by-label/rootfs / ext4
> rw,relatime,barrier=1,stripe=256,data=ordered 0 0
> devpts /dev/pts devpts rw,relatime,mode=600,ptmxmode=000 0 0
> shm /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
> /dev/md127 /boot ext3
> rw,relatime,errors=continue,barrier=0,data=writeback 0 0
> /dev/md126 /data ext4 rw,relatime,barrier=1,data=ordered 0 0
>
>
> Because of the severity of the problems (which remain after trying
> another sata card), I have already bought a new Supermicro server. Let's
> hope that helps.


The load_cycle_counts are very high and that means your drive heads
are parking all the time. Possibly multiple times a minute.

I don't know if its your problem, but I'd say something is wrong and
I've seen excessive head parking cause disk write failures in Windows.
In linux I think it just wears out your drive way prematurely. And
of course and i/o's are delayed if the heads are parked when the
commands hit the drive.

There is a linux package specifically targeting drives that have this
issue. Hopefully it can at least keep your heads from parking
continuously. storage-fixup.

1) Be sure you have the userspace package storage-fixup installed.

2) Look in /etc/storage-fixup.conf and see if your drives are in the list.

If not, try to work with the storage-fixup maintainer (Tejun Heo?) to
get your drives added.

And while testing, watch Load_cycle_count and ensure it is not
increasing too fast. ie. Several times an hour is fine. Several
times per minute is too much.

Greg

2010-12-22 20:52:32

by David Rees

[permalink] [raw]
Subject: Re: Slow disks.

On Mon, Dec 20, 2010 at 6:15 AM, Rogier Wolff <[email protected]> wrote:
> A friend of mine has a server in a datacenter somewhere. His machine
> is not working properly: most of his disks take 10-100 times longer
> to process each IO request than normal.

I've seen noticeable slowdowns before when the disks were running long
SMART tests. It doesn't seem to affect all drives the same. The
smartctl data you posted didn't say whether or not the disks were
currently running a SMART test, but it's worth double checking.

-Dave

2010-12-22 22:44:19

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Wed, Dec 22, 2010 at 11:27:20AM -0500, Jeff Moyer wrote:
> Rogier Wolff <[email protected]> writes:
>
> > Unquoted text below is from either me or from my friend.
> >
> >
> > Someone suggested we try an older kernel as if kernel 2.6.32 would not
> > have this problem. We do NOT think it suddenly started with a certain
> > kernel version. I was just hoping to have you kernel-guys help with
> > prodding the kernel into revealing which component was screwing things
> > up....
> [...]
> > ata3.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
>
> This is an "Advanced format" drive, which, in this case, means it
> internally has a 4KB sector size and exports a 512byte logical sector
> size. If your partitions are misaligned, this can cause performance
> problems.

This would mean that for a misalgned write, the drive would have to
read-modify-write every super-sector.

In my performance calculations, 10ms average seek (should be around
7), 4ms average rotational latency for a total of 14ms. This would
degrade for read-modify-write to 10+4+8 = 22ms. Still 10 times better
than what we observe: service times on the order of 200-300ms.

> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> > 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> > [UUU]
>
> A 512KB raid5 chunk with 4KB I/Os? That is a recipe for inefficiency.
> Again, blktrace data would be helpful.

Where did you get the 4kb IOs from? You mean from the iostat -x
output? The system/filesystem decided to do those small IOs. With the
throughput we're getting on the filesystem, it better not try to write
larger chuncks...

I have benchmarked my own "high bandwidth" raid arrays. I benchmarked
them with 128k, 256, 512 and 1024k blocksize. I got the best
throughput (for my benchmark: dd if=/dev/md0 of=/dev/null bs=1024k)
with 512k blocksize. (and yes that IS a valid benchmark for my
usage of the array.)

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-22 22:46:04

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Wed, Dec 22, 2010 at 12:52:08PM -0800, David Rees wrote:
> On Mon, Dec 20, 2010 at 6:15 AM, Rogier Wolff <[email protected]> wrote:
> > A friend of mine has a server in a datacenter somewhere. His machine
> > is not working properly: most of his disks take 10-100 times longer
> > to process each IO request than normal.
>
> I've seen noticeable slowdowns before when the disks were running long
> SMART tests. It doesn't seem to affect all drives the same. The
> smartctl data you posted didn't say whether or not the disks were
> currently running a SMART test, but it's worth double checking.

No, the machine has trouble 24/7, I seriously doubt that it would be
running smart tests 24/7.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-22 23:13:32

by David Rees

[permalink] [raw]
Subject: Re: Slow disks.

On Wed, Dec 22, 2010 at 2:46 PM, Rogier Wolff <[email protected]> wrote:
> On Wed, Dec 22, 2010 at 12:52:08PM -0800, David Rees wrote:
>> On Mon, Dec 20, 2010 at 6:15 AM, Rogier Wolff <[email protected]> wrote:
>> > A friend of mine has a server in a datacenter somewhere. His machine
>> > is not working properly: most of his disks take 10-100 times longer
>> > to process each IO request than normal.
>>
>> I've seen noticeable slowdowns before when the disks were running long
>> SMART tests. ?It doesn't seem to affect all drives the same. ?The
>> smartctl data you posted didn't say whether or not the disks were
>> currently running a SMART test, but it's worth double checking.
>
> No, the machine has trouble 24/7, I seriously doubt that it would be
> running smart tests 24/7.

It can not hurt to check the status of the disk to be sure...

-Dave

2010-12-23 14:41:15

by Jeff Moyer

[permalink] [raw]
Subject: Re: Slow disks.

Rogier Wolff <[email protected]> writes:

> On Wed, Dec 22, 2010 at 11:27:20AM -0500, Jeff Moyer wrote:
>> Rogier Wolff <[email protected]> writes:
>>
>> > Unquoted text below is from either me or from my friend.
>> >
>> >
>> > Someone suggested we try an older kernel as if kernel 2.6.32 would not
>> > have this problem. We do NOT think it suddenly started with a certain
>> > kernel version. I was just hoping to have you kernel-guys help with
>> > prodding the kernel into revealing which component was screwing things
>> > up....
>> [...]
>> > ata3.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
>>
>> This is an "Advanced format" drive, which, in this case, means it
>> internally has a 4KB sector size and exports a 512byte logical sector
>> size. If your partitions are misaligned, this can cause performance
>> problems.
>
> This would mean that for a misalgned write, the drive would have to
> read-modify-write every super-sector.
>
> In my performance calculations, 10ms average seek (should be around
> 7), 4ms average rotational latency for a total of 14ms. This would
> degrade for read-modify-write to 10+4+8 = 22ms. Still 10 times better
> than what we observe: service times on the order of 200-300ms.

I didn't say it would account for all of your degradation, just that it
could affect performance. I'm sorry if I wasn't clear on that.

> > md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
>> > 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
>> > [UUU]
>>
>> A 512KB raid5 chunk with 4KB I/Os? That is a recipe for inefficiency.
>> Again, blktrace data would be helpful.
>
> Where did you get the 4kb IOs from? You mean from the iostat -x
> output?

Yes, since that's all I have to go on at the moment.

> The system/filesystem decided to do those small IOs. With the
> throughput we're getting on the filesystem, it better not try to write
> larger chuncks...

Your logic is a bit flawed, for so many reasons I'm not even going to
try to enumerate them here. Anyway, I'll continue to sound like a
broken record and ask for blktrace data.

> I have benchmarked my own "high bandwidth" raid arrays. I benchmarked
> them with 128k, 256, 512 and 1024k blocksize. I got the best
> throughput (for my benchmark: dd if=/dev/md0 of=/dev/null bs=1024k)
> with 512k blocksize. (and yes that IS a valid benchmark for my
> usage of the array.)

Sorry, I'm not sure I understand how this is relevant. I thought we
were troubleshooting a problem on someone else's system. Further, the
window into the workload we saw via iostat definitely shows that smaller
I/Os are issued.

Anyway, it will be much easier to debate the issue once the blktrace
data is gathered.

Happy holidays.

-Jeff

2010-12-23 17:01:12

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Thu, Dec 23, 2010 at 09:40:54AM -0500, Jeff Moyer wrote:
> > In my performance calculations, 10ms average seek (should be around
> > 7), 4ms average rotational latency for a total of 14ms. This would
> > degrade for read-modify-write to 10+4+8 = 22ms. Still 10 times better
> > than what we observe: service times on the order of 200-300ms.
>
> I didn't say it would account for all of your degradation, just that it
> could affect performance. I'm sorry if I wasn't clear on that.

We can live with a "2x performance degradation" due to stupid
configuration. But not with the 10x -30x that we're seeing now.

> > > md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> >> > 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> >> > [UUU]
> >>
> >> A 512KB raid5 chunk with 4KB I/Os? That is a recipe for inefficiency.
> >> Again, blktrace data would be helpful.
> >
> > Where did you get the 4kb IOs from? You mean from the iostat -x
> > output?
>
> Yes, since that's all I have to go on at the moment.
>
> > The system/filesystem decided to do those small IOs. With the
> > throughput we're getting on the filesystem, it better not try to write
> > larger chuncks...
>
> Your logic is a bit flawed, for so many reasons I'm not even going to
> try to enumerate them here. Anyway, I'll continue to sound like a
> broken record and ask for blktrace data.

Here it is.

http://prive.bitwizard.nl/blktrace.log

I can't read those yet... Manual is unclear.

I'd guess that "D" means "submitted to driver". and "C" means
"completed". I very often see a D followed VERY shortly by a C. Also I
see more C's than "D"s.

Anohter way of looking at it, was to sort on the "ID" field. I would
expect each "transaction" to follow similar steps. But many IDs only
occur twice, and not the same for each.

> > I have benchmarked my own "high bandwidth" raid arrays. I benchmarked
> > them with 128k, 256, 512 and 1024k blocksize. I got the best
> > throughput (for my benchmark: dd if=/dev/md0 of=/dev/null bs=1024k)
> > with 512k blocksize. (and yes that IS a valid benchmark for my
> > usage of the array.)
>
> Sorry, I'm not sure I understand how this is relevant. I thought we
> were troubleshooting a problem on someone else's system. Further, the
> window into the workload we saw via iostat definitely shows that smaller
> I/Os are issued.

My friend confessed to me today that he determined the "optimal" RAID
block size with the exact same test as I had done, and reached the
same conclusion. So that explains his raid blocksize of 512k.

The system is a mailserver running on a raid on three of the disks.
most of the IOs are generated by the mail server software through the
FS driver, and the raid system. It's not that we're running a database
that inherently requires 4k IOs. Apparently what the
system needs are those small IOs.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-23 17:12:35

by Jaap Crezee

[permalink] [raw]
Subject: Re: Slow disks.

On 12/23/10 15:40, Jeff Moyer wrote:
> Your logic is a bit flawed, for so many reasons I'm not even going to
> try to enumerate them here. Anyway, I'll continue to sound like a
> broken record and ask for blktrace data.
> Anyway, it will be much easier to debate the issue once the blktrace
> data is gathered.

Here you are:

http://jcz.nl:8080/blktrace.log

Regards,

Jaap Crezee


PS Today one drive failed with IO errors and is currently "dead". The performance increased a little...

2010-12-23 17:47:46

by Jeff Moyer

[permalink] [raw]
Subject: Re: Slow disks.

Rogier Wolff <[email protected]> writes:

> On Thu, Dec 23, 2010 at 09:40:54AM -0500, Jeff Moyer wrote:
>> > In my performance calculations, 10ms average seek (should be around
>> > 7), 4ms average rotational latency for a total of 14ms. This would
>> > degrade for read-modify-write to 10+4+8 = 22ms. Still 10 times better
>> > than what we observe: service times on the order of 200-300ms.
>>
>> I didn't say it would account for all of your degradation, just that it
>> could affect performance. I'm sorry if I wasn't clear on that.
>
> We can live with a "2x performance degradation" due to stupid
> configuration. But not with the 10x -30x that we're seeing now.

Wow. I'm not willing to give up any performance due to
misconfiguration!

>> > > md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
>> >> > 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
>> >> > [UUU]
>> >>
>> >> A 512KB raid5 chunk with 4KB I/Os? That is a recipe for inefficiency.
>> >> Again, blktrace data would be helpful.
>> >
>> > Where did you get the 4kb IOs from? You mean from the iostat -x
>> > output?
>>
>> Yes, since that's all I have to go on at the moment.
>>
>> > The system/filesystem decided to do those small IOs. With the
>> > throughput we're getting on the filesystem, it better not try to write
>> > larger chuncks...
>>
>> Your logic is a bit flawed, for so many reasons I'm not even going to
>> try to enumerate them here. Anyway, I'll continue to sound like a
>> broken record and ask for blktrace data.
>
> Here it is.
>
> http://prive.bitwizard.nl/blktrace.log
>
> I can't read those yet... Manual is unclear.

OK, I should have made it clear that I wanted the binary logs. No
matter, we'll work with what you've sent.

> My friend confessed to me today that he determined the "optimal" RAID
> block size with the exact same test as I had done, and reached the
> same conclusion. So that explains his raid blocksize of 512k.
>
> The system is a mailserver running on a raid on three of the disks.
> most of the IOs are generated by the mail server software through the
> FS driver, and the raid system. It's not that we're running a database
> that inherently requires 4k IOs. Apparently what the
> system needs are those small IOs.

The log shows a lot of write barriers:

8,32 0 1183 169.033279975 778 A WBS 481958 + 2 <- (8,34) 8
^^^

On pre-2.6.37 kernels, that will fully flush the device queue, which is
why you're seeing such a small queue depth. There was also a CFQ patch
that sped up fsync performance for small files that landed in .37. I
can't remember if you ran with a 2.6.37-rc or not. Have you? It may be
in your best interest to give the latest -rc a try and report back.

Cheers,
Jeff

2010-12-23 18:51:36

by Greg Freemyer

[permalink] [raw]
Subject: Re: Slow disks.

On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer <[email protected]> wrote:
> Rogier Wolff <[email protected]> writes:
>
>> On Thu, Dec 23, 2010 at 09:40:54AM -0500, Jeff Moyer wrote:
>>> > In my performance calculations, 10ms average seek (should be around
>>> > 7), 4ms average rotational latency for a total of 14ms. This would
>>> > degrade for read-modify-write to 10+4+8 = 22ms. Still 10 times better
>>> > than what we observe: service times on the order of 200-300ms.
>>>
>>> I didn't say it would account for all of your degradation, just that it
>>> could affect performance. ?I'm sorry if I wasn't clear on that.
>>
>> We can live with a "2x performance degradation" due to stupid
>> configuration. But not with the 10x -30x that we're seeing now.
>
> Wow. ?I'm not willing to give up any performance due to
> misconfiguration!

I suspect a mailserver on a raid 5 with large chunksize could be a lot
worse than 2x slower. But most of the blame is just raid 5.

ie.
write 4K from userspace

Kernel
Read old primary data, wait for data to actually arrive
Read old parity data, wait again
modify both for new data
write primary data to drive queue
write parity data to drive queue

userspace: fsync
kernel: force data from queues to drive (requires wait)


I'm guessing raid1 or raid10 would be several times faster. And is at
least as robust as raid 5.

ie.
write 4K from userspace

Kernel
write 4K to first mirror's queue
write 4K to second mirror's queue
done

userspace: fsync
kernel: force data from queues to drive (requires wait)

Good Luck
Greg

2010-12-23 19:10:42

by Jaap Crezee

[permalink] [raw]
Subject: Re: Slow disks.

On 12/23/10 19:51, Greg Freemyer wrote:
> On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer<[email protected]> wrote:
> I suspect a mailserver on a raid 5 with large chunksize could be a lot
> worse than 2x slower. But most of the blame is just raid 5.

Hmmm, well if this really is so.. I use raid 5 to not "spoil" the storage space
of one disk. I am using some other servers with raid 5 md's which seems to be
running just fine; even under higher load than the machine we are talking about.

Looking at the vmstat block io the typical load (both write and read) seems to
be less than 20 blocks per second. Will this drop the performance of the array
(measured by dd if=/dev/md<x> of=/dev/null bs=1M) below 3MB/secs?

> ie.
> write 4K from userspace
>
> Kernel
> Read old primary data, wait for data to actually arrive
> Read old parity data, wait again
> modify both for new data
> write primary data to drive queue
> write parity data to drive queue

What if I (theoratically) change the chunksize to 4kb? (I can try that in the
new server...).

Jaap

2010-12-23 22:10:09

by Greg Freemyer

[permalink] [raw]
Subject: Re: Slow disks.

On Thu, Dec 23, 2010 at 2:10 PM, Jaap Crezee <[email protected]> wrote:
> On 12/23/10 19:51, Greg Freemyer wrote:
>> On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer<[email protected]> ?wrote:
>> I suspect a mailserver on a raid 5 with large chunksize could be a lot
>> worse than 2x slower. ?But most of the blame is just raid 5.
>
> Hmmm, well if this really is so.. I use raid 5 to not "spoil" the storage
> space of one disk. I am using some other servers with raid 5 md's which
> seems to be running just fine; even under higher load than the machine we
> are talking about.
>
> Looking at the vmstat block io the typical load (both write and read) seems
> to be less than 20 blocks per second. Will this drop the performance of the
> array (measured by dd if=/dev/md<x> of=/dev/null bs=1M) below 3MB/secs?
>

You clearly have problems more significant than your raid choice, but
hopefully you will find the below informative anyway.

====

The above is a meaningless performance tuning test for a email server,
but assuming it was a useful test for you:

With bs=1MB you should have optimum performance with a 3-disk raid5
and 512KB chunks.

The reason is that a full raid stripe for that is 1MB (512K data +
512K data + 512K parity = 1024K data)

So the raid software should see that as a full stripe update and not
have to read in any of the old data.

Thus at the kernel level it is just:

write data1 chunk
write data2 chunk
write parity chunk

All those should happen in parallel, so a raid 5 setup for 1MB writes
is actually just about optimal!

Anything smaller than a 1 stripe write is where the issues occur,
because then you have the read-modify-write cycles.

(And yes, the linux mdraid layer recognizes full stripe writes and
thus skips the read-modify portion of the process.)

>> ie.
>> write 4K from userspace
>>
>> Kernel
>> Read old primary data, wait for data to actually arrive
>> Read old parity data, wait again
>> modify both for new data
>> write primary data to drive queue
>> write parity data to drive queue
>
> What if I (theoratically) change the chunksize to 4kb? (I can try that in
> the new server...).

4KB random writes is really just too small for an efficient raid 5
setup. Since that's your real workload, I'd get away from raid 5.

If you really want to optimize a 3-disk raid-5 for random 4K writes,
you need to drop down to 2K chunks which gives you a 4K stripe. I've
never seen chunks that small used, so I have no idea how it would
work.

===> fyi: If reliability is one of the things pushing you away from raid-1

A 2 disk raid-1 is more reliable than a 3-disk raid-5.

The math is, assume each of your drives has a one in 1000 chance of
dieing on a specific day.

So a raid-1 has a 1 in a million chance of a dual failure on that same
specific day.

And a raid-5 would have 3 in a million chances of a dual failure on
that same specific day. ie. drive 1 and 2 can fail that day, or 1 and
3, or 2 and 3.

So a 2 drive raid-1 is 3 times as reliable as a 3-drive raid-5.

If raid-1 still makes you uncomfortable, then go with a 3-disk mirror
(raid 1 or raid 10 depending on what you need.)

You can get 2TB sata drives now for about $100 on sale, so you could
do a 2 TB 3-disk raid-1 for $300. Not a bad price at all in my
opinion.

fyi: I don't know if "enterprise" drives cost more or not. But it is
important you use those in a raid setup. The reason being normal
desktop drives have retry logic built into the drive that can take
from 30 to 120 seconds. Enterprise drives have fast fail logic that
allows a media error to rapidly be reported back to the kernel so that
it can read that data from the alternate drives available in a raid.

> Jaap

Greg


--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
?? http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2010-12-24 10:45:06

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Thu, Dec 23, 2010 at 12:47:34PM -0500, Jeff Moyer wrote:
> Rogier Wolff <[email protected]> writes:
>
> > On Thu, Dec 23, 2010 at 09:40:54AM -0500, Jeff Moyer wrote:
> >> > In my performance calculations, 10ms average seek (should be around
> >> > 7), 4ms average rotational latency for a total of 14ms. This would
> >> > degrade for read-modify-write to 10+4+8 = 22ms. Still 10 times better
> >> > than what we observe: service times on the order of 200-300ms.
> >>
> >> I didn't say it would account for all of your degradation, just that it
> >> could affect performance. I'm sorry if I wasn't clear on that.
> >
> > We can live with a "2x performance degradation" due to stupid
> > configuration. But not with the 10x -30x that we're seeing now.
>
> Wow. I'm not willing to give up any performance due to
> misconfiguration!

Suppose you have a hard-to-reach server somewhere. Suppose that you
find out that the <whatever> card could perform 15% better if you put
it in a different slot. Would you go and dig the server out to fix
this if you know the performance now will be adequate for the next few
years? Isn't it acceptable to keep things like this until a next
scheduled (or unscheduled) maintenance?

In reality I have two servers with 8T of RAID storage each. Together
with shuffling all important data around on these trying to get the
exactly optimal performance out of these storage systems is very
timeconsuming. Also each "move the data out of the way, reconfigure
the RAID, move the data back" cycle incurs risks of losing or
corrupting the data.

I prefer concentrating on the most important part. In this case we
have a 30fold performance problem. If there is a 15fold one and a
2fold one then I'll settle for looking into and hopefully fixing the
15fold one, and I'll discard the 2fold one for the time being. Not
important enough to look into. The machine happens to have 30fold
performance margin. It can keep up with what it has to do with the
30fold slower disks. However work comes in batches so the queue grows
significantly during a higher-workload-period.

> >> > > md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
> >> >> > 39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
> >> >> > [UUU]
> >> >>
> >> >> A 512KB raid5 chunk with 4KB I/Os? That is a recipe for inefficiency.
> >> >> Again, blktrace data would be helpful.
> >> >
> >> > Where did you get the 4kb IOs from? You mean from the iostat -x
> >> > output?
> >>
> >> Yes, since that's all I have to go on at the moment.
> >>
> >> > The system/filesystem decided to do those small IOs. With the
> >> > throughput we're getting on the filesystem, it better not try to write
> >> > larger chuncks...
> >>
> >> Your logic is a bit flawed, for so many reasons I'm not even going to
> >> try to enumerate them here. Anyway, I'll continue to sound like a
> >> broken record and ask for blktrace data.
> >
> > Here it is.
> >
> > http://prive.bitwizard.nl/blktrace.log
> >
> > I can't read those yet... Manual is unclear.
>
> OK, I should have made it clear that I wanted the binary logs. No
> matter, we'll work with what you've sent.
>
> > My friend confessed to me today that he determined the "optimal" RAID
> > block size with the exact same test as I had done, and reached the
> > same conclusion. So that explains his raid blocksize of 512k.
> >
> > The system is a mailserver running on a raid on three of the disks.
> > most of the IOs are generated by the mail server software through the
> > FS driver, and the raid system. It's not that we're running a database
> > that inherently requires 4k IOs. Apparently what the
> > system needs are those small IOs.
>
> The log shows a lot of write barriers:
>
> 8,32 0 1183 169.033279975 778 A WBS 481958 + 2 <- (8,34) 8
^^^
>
> On pre-2.6.37 kernels, that will fully flush the device queue, which is
> why you're seeing such a small queue depth. There was also a CFQ patch
> that sped up fsync performance for small files that landed in .37. I
> can't remember if you ran with a 2.6.37-rc or not. Have you? It may be
> in your best interest to give the latest -rc a try and report back.

It is a production system. Wether my friend is willing to run a
prerelease kernel there remains to be seen.

On the other hand, if this were a MAJOR performance bottleneck it
wouldn't be on the "list of things to fix in december 2010, but it
would've been fixed years ago.

Jeff, can you tell me where in that blktrace output do I see the
system noticing "we need to read block XXX from the disk", then that
gets queued, next it gets submitted to the hardware, and eventually
the hardware reports back: I got block XXX from the media here it
is. Can you point these events out in the logfile form me? (for any
single transaction that belongs together?)

It would be useful to see the XXX numbers (for things like block
device optimizers) and the timestamps (for us to debug this problem
today.) I strongly suspect that both are logged, right?

Roger.


--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-24 11:40:12

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Thu, Dec 23, 2010 at 05:09:43PM -0500, Greg Freemyer wrote:
> On Thu, Dec 23, 2010 at 2:10 PM, Jaap Crezee <[email protected]> wrote:
> > On 12/23/10 19:51, Greg Freemyer wrote:
> >> On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer<[email protected]> ?wrote:
> >> I suspect a mailserver on a raid 5 with large chunksize could be a lot
> >> worse than 2x slower. ?But most of the blame is just raid 5.
> >
> > Hmmm, well if this really is so.. I use raid 5 to not "spoil" the storage
> > space of one disk. I am using some other servers with raid 5 md's which
> > seems to be running just fine; even under higher load than the machine we
> > are talking about.
> >
> > Looking at the vmstat block io the typical load (both write and read) seems
> > to be less than 20 blocks per second. Will this drop the performance of the
> > array (measured by dd if=/dev/md<x> of=/dev/null bs=1M) below 3MB/secs?
> >
>
> You clearly have problems more significant than your raid choice, but
> hopefully you will find the below informative anyway.
>
> ====
>
> The above is a meaningless performance tuning test for a email server,
> but assuming it was a useful test for you:
>
> With bs=1MB you should have optimum performance with a 3-disk raid5
> and 512KB chunks.
>
> The reason is that a full raid stripe for that is 1MB (512K data +
> 512K data + 512K parity = 1024K data)
>
> So the raid software should see that as a full stripe update and not
> have to read in any of the old data.
>
> Thus at the kernel level it is just:
>
> write data1 chunk
> write data2 chunk
> write parity chunk
>
> All those should happen in parallel, so a raid 5 setup for 1MB writes
> is actually just about optimal!

You are assuming that the kernel is blind and doesn't do any
readaheads. I've done some tests and even when I run dd with a
blocksize of 32k, the average request sizes that are hitting the disk
are about 1000k (or 1000 sectors I don't know what units that column
are in when I run with -k option).

So your argument that "it fits exactly when your blocksize is 1M, so
it is obvious that 512k blocksizes are optimal" doesn't hold water.

When the blocksize is too large, the system will be busy reading and
waiting for one disk, while leaving the second (and third and ...)
disk idle. Just because readahead is final. You indeed want the
readahead to hit many disks at the same time so that when you get
around to reading the data from the drives they can run at close to
bus speed.

When the block size is too small you'll spend to much time splitting
say a 1M readahead on the MD device into 16 64k chunks for individual
drives, and then (if that works) merging them back together again for
those drives to prevent the over head of too many commands to each
drive (for a 4-drive raid5, the first and fourth block are likely to
be consecutive on the same drive...). Hmmm. but those would have to go
into different spots in a buffer, so it might simply ahve to incur
that extra overhead....

> Anything smaller than a 1 stripe write is where the issues occur,
> because then you have the read-modify-write cycles.

Yes. But still they shouldn't be as heavy as we are seeing. Besides
doing the "big searches" on my 8T array, I also sometimes write "lots
of small files". I'll see how many I can mange on that server....

Roger.

>
> (And yes, the linux mdraid layer recognizes full stripe writes and
> thus skips the read-modify portion of the process.)
>
> >> ie.
> >> write 4K from userspace
> >>
> >> Kernel
> >> Read old primary data, wait for data to actually arrive
> >> Read old parity data, wait again
> >> modify both for new data
> >> write primary data to drive queue
> >> write parity data to drive queue
> >
> > What if I (theoratically) change the chunksize to 4kb? (I can try that in
> > the new server...).
>
> 4KB random writes is really just too small for an efficient raid 5
> setup. Since that's your real workload, I'd get away from raid 5.
>
> If you really want to optimize a 3-disk raid-5 for random 4K writes,
> you need to drop down to 2K chunks which gives you a 4K stripe. I've
> never seen chunks that small used, so I have no idea how it would
> work.
>
> ===> fyi: If reliability is one of the things pushing you away from raid-1
>
> A 2 disk raid-1 is more reliable than a 3-disk raid-5.
>
> The math is, assume each of your drives has a one in 1000 chance of
> dieing on a specific day.
>
> So a raid-1 has a 1 in a million chance of a dual failure on that same
> specific day.
>
> And a raid-5 would have 3 in a million chances of a dual failure on
> that same specific day. ie. drive 1 and 2 can fail that day, or 1 and
> 3, or 2 and 3.
>
> So a 2 drive raid-1 is 3 times as reliable as a 3-drive raid-5.
>
> If raid-1 still makes you uncomfortable, then go with a 3-disk mirror
> (raid 1 or raid 10 depending on what you need.)
>
> You can get 2TB sata drives now for about $100 on sale, so you could
> do a 2 TB 3-disk raid-1 for $300. Not a bad price at all in my
> opinion.
>
> fyi: I don't know if "enterprise" drives cost more or not. But it is

They do. They cost about twice as much.

> important you use those in a raid setup. The reason being normal
> desktop drives have retry logic built into the drive that can take
> from 30 to 120 seconds. Enterprise drives have fast fail logic that
> allows a media error to rapidly be reported back to the kernel so that
> it can read that data from the alternate drives available in a raid.

You're repeating what WD says about their enterprise drives versus
desktop drives. I'm pretty sure that they believe what they are saying
to be true. And they probably have done tests to see support for their
theory. But for Linux it simply isn't true.

WD apparently tested their drives with a certain unnamed operating
system. That operating system may wait for up to two minutes for a
drive to report "bad block" or "succesfully remapped this block, and
here is your data".

>From my experience, it is unlikely that a desktop user will sit behind
his/her workstation for two minutes for the screen to unfreeze while
the drive goes into deep recovery. The reset button will have been
pressed by that time. Both on Linux /and/ that other OS.


Moreover, Linux uses a 30 second timeout. If a drive doesn't respond
in 30 second, it will be reset and the request tried again. I don't
think the drive will restart the "deep recovery" procedure after a
reset-identify-reread cycle where it left off. It will start all
over.

The SCSI disks have it all figured out. There you can use standard
commands to set the maximum recovery time. If you set it to "20ms" the
drive can calculate that it has ONE retry option on the next
revolution (or two if it runs at more than xxx RPM) and nothing else.

WD claims a RAID array might quickly switch to a different drive if it
knows the block cannot be read from one drive. This is true. But at
least for Linux software raid, the drive will immediately be bumped
from the array, and never be used/read/written again until it is
replaced.

Now they might have a point there. For a drive a limited amount of bad
blocks, it might be MUCH better to mark the drive as "in desparate
need of replacement" instead of "Failed". One thing you can do to help
the drive is to rewrite the bad sectors with the recalculated
data. The drive can then remap the sectors.

We see MUCH too often raid arrays that lose a drive evict it from the
RAID and everything keeps on working, so nobody wakes up. Only after a
second drive fails, things stop working and the datarecovery company
gets called into action. Often we have a drive with a few bad blocks
and months-old data, and a totally failed drive which is neccesary for
a full recovery. It's much better to keep the failed/failing drive in
the array and up-to-date during the time that you're pushing the
operator to get it replaced.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-24 13:01:43

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: Slow disks.

Rogier Wolff <[email protected]> writes:

> ata6.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
> scsi 5:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
> sd 2:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931
> GiB)

WD10EARS are "green" drives, 5400 rpm. They aren't designed exactly for
speed. Never used them, though.

No NCQ? Sil 3114 doesn't support of course. ICH7 (without letters)
doesn't do AHCI either IIRC.

> 9 Power_On_Hours 0x0032 091 091 000 Old_age
> Always - 7189
> 193 Load_Cycle_Count 0x0032 164 164 000 Old_age Always
> - 109955

Hmm, some agressive power savings? May reduce performance significantly.
I'd disable all this "green" crap first.

> Where it seems that WD simply says not to use these drives in a RAID.

That smells like "don't use them in any serious application".

I'd start with a RAID-1 or RAID-10 if possible.
--
Krzysztof Halasa

2010-12-24 15:24:43

by Michael Tokarev

[permalink] [raw]
Subject: Re: Slow disks.

24.12.2010 16:01, Krzysztof Halasa wrote:
> Rogier Wolff <[email protected]> writes:
>
>> ata6.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
>> scsi 5:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
>> sd 2:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931
>> GiB)
>
> WD10EARS are "green" drives, 5400 rpm. They aren't designed exactly for
> speed. Never used them, though.

Oh.. The famous EARS drives. I missed this info in the
start of the thread.

Now, after this info, the whole thread is quite moot.

The thing is, for these WD*EARS drivers, it is _vital_ to get
proper alignment of all partitions and operations. They've
4Kb sectors physically but report 512bytes sectors to the OS.
It is _essential_ to ensure all partitions are aligned to the
4Kb sectors. Be it LVM, raid-something, etc - each filesystem
must start at a 4kb boundary at least, or else you'll see
_dramatic_ write speed problems.

So.. check the whole storage stack and ensure proper alignment
everywhere. In particular, check that your partitions are not
aligned to 63 sectors (512b), or starts at N+1 sector - the
most problematic mode for these drives.

And before anyone asks, no, these drives are actually very
good. With proper alignment it works very fast for both
reads and writes, despite it being 5400RPM. I have a 2Tb
drive from these series (WD20EARS) - despite numerous claims
that it does not work or works very slow with small files,
it's quite fast, faster than many prev-gen 7200 drives.

> No NCQ? Sil 3114 doesn't support of course. ICH7 (without letters)
> doesn't do AHCI either IIRC.

Yes, Sil 3114 does not support NCQ. But these drives does not
have good NCQ implementation either - apparently the Read-Modify-Write
logic has eaten NCQ which were traditionally good in recent WD
drives.

>> 9 Power_On_Hours 0x0032 091 091 000 Old_age
>> Always - 7189
>> 193 Load_Cycle_Count 0x0032 164 164 000 Old_age Always
>> - 109955
>
> Hmm, some agressive power savings? May reduce performance significantly.
> I'd disable all this "green" crap first.

There's a utility (ms-dos based) to disable this feature for wd
ears drives, on their website.

>> Where it seems that WD simply says not to use these drives in a RAID.
>
> That smells like "don't use them in any serious application".

No, this is about TLER. The "desktop" drives like this will try
re-read data in case of error, and if that does not work the
raid code will most likely declare the drive's dead and kick
it off the array. Drives which are supposed to work in RAID
config has configurable timeouts/retries, so that the RAID
code will be able to take care of read errors.

/mjt

2010-12-24 20:58:52

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: Slow disks.

Michael Tokarev <[email protected]> writes:

> No, this is about TLER. The "desktop" drives like this will try
> re-read data in case of error, and if that does not work the
> raid code will most likely declare the drive's dead and kick
> it off the array. Drives which are supposed to work in RAID
> config has configurable timeouts/retries, so that the RAID
> code will be able to take care of read errors.

Assuming the controller won't kick the drive off anyway, and it most
probably will.

In this case, no drives fail, so it doesn't matter.
Guess we should then see the partition table(s).
--
Krzysztof Halasa

2010-12-25 12:14:13

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Fri, Dec 24, 2010 at 06:24:38PM +0300, Michael Tokarev wrote:
> 24.12.2010 16:01, Krzysztof Halasa wrote:
> > Rogier Wolff <[email protected]> writes:
> >
> >> ata6.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32)
> >> scsi 5:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 80.0 PQ: 0 ANSI: 5
> >> sd 2:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931
> >> GiB)
> >
> > WD10EARS are "green" drives, 5400 rpm. They aren't designed exactly for
> > speed. Never used them, though.
>
> Oh.. The famous EARS drives. I missed this info in the
> start of the thread.
>
> Now, after this info, the whole thread is quite moot.

No it is not.

We're seeing a performance issue, about 10 fold beyond what is to be
expected of unaligned accesses.

> The thing is, for these WD*EARS drivers, it is _vital_ to get
> proper alignment of all partitions and operations. They've
> 4Kb sectors physically but report 512bytes sectors to the OS.
> It is _essential_ to ensure all partitions are aligned to the
> 4Kb sectors. Be it LVM, raid-something, etc - each filesystem
> must start at a 4kb boundary at least, or else you'll see
> _dramatic_ write speed problems.

Out of four partitions on the drive, two are aligned, two are not.

The two that are not are the boot partition (/boot) which never gets
written to. The other one is the swap partition. (the machine has
enough ram to function perfectly without swap. We only noticed a
missing "mkswap" on the swap partition after it had been running
without swap for weeks, while we were diagnosing the performance
problems. So, apart from the fact that this was "lucky" to come out
this way, it /is/ correctly configured.

> So.. check the whole storage stack and ensure proper alignment
> everywhere. In particular, check that your partitions are not
> aligned to 63 sectors (512b), or starts at N+1 sector - the
> most problematic mode for these drives.

The important (used ones) ARE correctly aligned. (I surely hope that
RAID5 doesn't grab the first 1k and alings everything else to (4n +
1)k ...)

> No, this is about TLER. The "desktop" drives like this will try
> re-read data in case of error, and if that does not work the
> raid code will most likely declare the drive's dead and kick
> it off the array. Drives which are supposed to work in RAID
> config has configurable timeouts/retries, so that the RAID
> code will be able to take care of read errors.

That's WDs story. Under Linux the driver will timeout after some
thirty seconds, reset the drive and retry a few times (again
triggering the long retries sequence, but not letting it finish), and
finally report an error. When an error is reported Linux-raid will
kick the drive out of the array and work in degraded mode from then
on.

If you'd use a RAID drive that DOES report error quickly, you'll be on
your way maybe 113 seconds quicker, but still with a degraded array.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-25 12:33:47

by Mikael Abrahamsson

[permalink] [raw]
Subject: Re: Slow disks.

On Sat, 25 Dec 2010, Rogier Wolff wrote:

> If you'd use a RAID drive that DOES report error quickly, you'll be on
> your way maybe 113 seconds quicker, but still with a degraded array.

Read errors will use parity drive and be re-written by the md layer.

Write errors will cause the drive to be kicked.

--
Mikael Abrahamsson email: [email protected]

2010-12-25 18:12:08

by Jaap Crezee

[permalink] [raw]
Subject: Re: Slow disks.

On 12/25/10 13:19, Mikael Abrahamsson wrote:
> On Sat, 25 Dec 2010, Rogier Wolff wrote:
> Read errors will use parity drive and be re-written by the md layer.

Can you show us *that* code? I don't recall seeing this behaviour.... Just one
read error and "bang". Or does the kernel not report the first few read errors
and just continues without telling us?
I don't like that. With a normal disk without raid I would have had some log
telling me the disk got bad?

Jaap

2010-12-25 21:28:41

by Michael Tokarev

[permalink] [raw]
Subject: Re: Slow disks.

25.12.2010 21:12, Jaap Crezee wrote:
> On 12/25/10 13:19, Mikael Abrahamsson wrote:
>> On Sat, 25 Dec 2010, Rogier Wolff wrote:
>> Read errors will use parity drive and be re-written by the md layer.
>
> Can you show us *that* code? I don't recall seeing this behaviour....
> Just one read error and "bang". Or does the kernel not report the first
> few read errors and just continues without telling us?
> I don't like that. With a normal disk without raid I would have had some
> log telling me the disk got bad?

The code is in the md driver.

This behavour is here for several years. If the drive respond in time
ofcourse - if it tries to re-read the problematic sector forever and
ignores other commands during this time it will be kicked off any
array - be it software or hardware.

The logs will be in whatever logfile your kernel messages are logged to.

And yes, the application which reads the "problematic" area is _not_
notfied, since the data gets restored from other driver - if that was
an md array with sufficient redundancy (any non-degraded raid other
than raid0).

/mjt

2010-12-26 21:40:41

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Sat, Dec 25, 2010 at 01:14:07PM +0100, Rogier Wolff wrote:
> Out of four partitions on the drive, two are aligned, two are not.

Ooops. I looked wrong. Sorry. All four partitions are mis aligned.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-26 22:08:02

by Niels

[permalink] [raw]
Subject: Re: Slow disks.

On Friday 24 December 2010 16:24, Michael Tokarev wrote:

[snip]
> The thing is, for these WD*EARS drivers, it is _vital_ to get
> proper alignment of all partitions and operations. They've
> 4Kb sectors physically but report 512bytes sectors to the OS.
> It is _essential_ to ensure all partitions are aligned to the
> 4Kb sectors. Be it LVM, raid-something, etc - each filesystem
> must start at a 4kb boundary at least, or else you'll see
> _dramatic_ write speed problems.
>
> So.. check the whole storage stack and ensure proper alignment
> everywhere. In particular, check that your partitions are not
> aligned to 63 sectors (512b), or starts at N+1 sector - the
> most problematic mode for these drives.
[snip]

I have several of these drives, in various sizes. They seem to be aligned to
63 as you mention:

Device Boot Start End Blocks Id System
/dev/sda1 63 208844 104391 83 Linux
/dev/sda2 208845 19759949 9775552+ 82 Linux swap / Solaris
/dev/sda3 19759950 215094284 97667167+ 83 Linux
/dev/sda4 215094285 781417664 283161690 83 Linux

Device Boot Start End Blocks Id System
/dev/sdb1 63 1953520064 976760001 83 Linux


Is that bad?

What can I do to repair it?

What can I do to prevent it from happening again?


Thanks,
Niels



2010-12-26 23:05:28

by Greg Freemyer

[permalink] [raw]
Subject: Re: Slow disks.

On Fri, Dec 24, 2010 at 6:40 AM, Rogier Wolff <[email protected]> wrote:
> On Thu, Dec 23, 2010 at 05:09:43PM -0500, Greg Freemyer wrote:
>> On Thu, Dec 23, 2010 at 2:10 PM, Jaap Crezee <[email protected]> wrote:
>> > On 12/23/10 19:51, Greg Freemyer wrote:
>> >> On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer<[email protected]> ?wrote:
>> >> I suspect a mailserver on a raid 5 with large chunksize could be a lot
>> >> worse than 2x slower. ?But most of the blame is just raid 5.
>> >
>> > Hmmm, well if this really is so.. I use raid 5 to not "spoil" the storage
>> > space of one disk. I am using some other servers with raid 5 md's which
>> > seems to be running just fine; even under higher load than the machine we
>> > are talking about.
>> >
>> > Looking at the vmstat block io the typical load (both write and read) seems
>> > to be less than 20 blocks per second. Will this drop the performance of the
>> > array (measured by dd if=/dev/md<x> of=/dev/null bs=1M) below 3MB/secs?
>> >
>>
>> You clearly have problems more significant than your raid choice, but
>> hopefully you will find the below informative anyway.
>>
>> ====
>>
>> The above is a meaningless performance tuning test for a email server,
>> but assuming it was a useful test for you:
>>
>> With bs=1MB you should have optimum performance with a 3-disk raid5
>> and 512KB chunks.
>>
>> The reason is that a full raid stripe for that is 1MB ?(512K data +
>> 512K data + 512K parity = 1024K data)
>>
>> So the raid software should see that as a full stripe update and not
>> have to read in any of the old data.
>>
>> Thus at the kernel level it is just:
>>
>> write data1 chunk
>> write data2 chunk
>> write parity chunk
>>
>> All those should happen in parallel, so a raid 5 setup for 1MB writes
>> is actually just about optimal!
>
> You are assuming that the kernel is blind and doesn't do any
> readaheads. I've done some tests and even when I run dd with a
> blocksize of 32k, the average request sizes that are hitting the disk
> are about 1000k (or 1000 sectors I don't know what units that column
> are in when I run with -k option).

dd is not a benchmark tool.

You are building a email server that does 4KB random writes.
Performance testing / tuning with dd is of very limited use.

For your load, read ahead is pretty much useless!


> So your argument that "it fits exactly when your blocksize is 1M, so
> it is obvious that 512k blocksizes are optimal" doesn't hold water.

If you were doing a real i/o benchmark, then 1MB random writes
perfectly aligned to the Raid stripes would be perfect. Raid really
needs to be designed around the i/o pattern, not just optimizing dd.

<snip>

>> Anything smaller than a 1 stripe write is where the issues occur,
>> because then you have the read-modify-write cycles.
>
> Yes. But still they shouldn't be as heavy as we are seeing. ?Besides
> doing the "big searches" on my 8T array, I also sometimes write "lots
> of small files". I'll see how many I can mange on that server....

<snip>
>
> You're repeating what WD says about their enterprise drives versus
> desktop drives. I'm pretty sure that they believe what they are saying
> to be true. And they probably have done tests to see support for their
> theory. But for Linux it simply isn't true.

What kernel are you talking about. mdraid has seen major improvements
in this area in the last 2 o3 years or so. Are you using a old kernel
by chance? Or reading old reviews?

> We see MUCH too often raid arrays that lose a drive evict it from the
> RAID and everything keeps on working, so nobody wakes up. Only after a
> second drive fails, things stop working and the datarecovery company
> gets called into action. Often we have a drive with a few bad blocks
> and months-old data, and a totally failed drive which is neccesary for
> a full recovery. It's much better to keep the failed/failing drive in
> the array and up-to-date during the time that you're pushing the
> operator to get it replaced.
>
> ? ? ? ?Roger.

The linux-raid mailing list is very helpful. If you're seeing
problems, ask for help there.

What your describing simply sounds wrong. (At least for mdraid, which
is what I assume you are using.)

Greg

2010-12-26 23:18:04

by Greg Freemyer

[permalink] [raw]
Subject: Re: Slow disks.

On Sun, Dec 26, 2010 at 4:40 PM, Rogier Wolff <[email protected]> wrote:
> On Sat, Dec 25, 2010 at 01:14:07PM +0100, Rogier Wolff wrote:
>> Out of four partitions on the drive, two are aligned, two are not.
>
> Ooops. I looked wrong. Sorry. All four partitions are mis aligned.
>
> ? ? ? ?Roger.

It sounds like you have 3 or 4 separate issues that are each costing
you a factor of 3 or so of pain.

3 * 3 * 3 *3 = 81 ( I don't recall how bad your overall problem was).

I can remember:

Raid 5 - poor choice for 4KB random writes - a factor of 2-4
Misaligned 4KB physical sectors - a factor of 2+
Excessive Head Parking (You show 100K plus head parks in your smart
data) - (who knows. It takes a while to un-park the heads and your
heads are parking way too often.)

I think you said it was a 8TB, I know its a pain, but I'd rebuild. I
don't know how reliable various large drives are currently. in
2008/2009 they seemed to be horrible and the advice was to use drives
from different vendors, etc so you would be less likely to have a
batch of drives go bad all at once.

Good Luck
Greg

2010-12-26 23:38:56

by Mark Knecht

[permalink] [raw]
Subject: Re: Slow disks.

On Wed, Dec 22, 2010 at 8:27 AM, Jeff Moyer <[email protected]> wrote:
> Rogier Wolff <[email protected]> writes:
>
>> Unquoted text below is from either me or from my friend.
>>
>>
>> Someone suggested we try an older kernel as if kernel 2.6.32 would not
>> have this problem. We do NOT think it suddenly started with a certain
>> kernel version. I was just hoping to have you kernel-guys help with
>> prodding the kernel into revealing which component was screwing things
>> up....
> [...]
>> ata3.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
>
> This is an "Advanced format" drive, which, in this case, means it
> internally has a 4KB sector size and exports a 512byte logical sector
> size.  If your partitions are misaligned, this can cause performance
> problems.
>
>> MDstat:
>>
>> Personalities : [raid1] [raid6] [raid5] [raid4]
>> md125 : active raid5 sdd3[5](S) sdb3[4] sda3[0] sdc3[3]
>>       3903488 blocks super 1.1 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>>
>> md126 : active raid5 sda4[0] sdd4[3] sdc4[5](S) sdb4[4]
>>       1910063104 blocks super 1.1 level 5, 512k chunk, algorithm 2
>> [3/3] [UUU]
>>
>> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
>>       39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
>> [3/3] [UUU]
>>
>> md1 : active raid5 sda2[0] sdd2[3](S) sdb2[1] sdc2[4]
>>       39067648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
>> [UUU]
>
> A 512KB raid5 chunk with 4KB I/Os?  That is a recipe for inefficiency.
> Again, blktrace data would be helpful.
>
> Cheers,
> Jeff

I am late to this thread but I'd like to give a +1 to everything Jeff said.

I tried to build many different mdadm RAID configurations using the
WD10EARS. They simply didn't work. There were huge delays and mdadm
would take drives off line. It was a mess.

First, the drives are 4K sector so it's really important to get them
on the right partition boundary. Installing Gentoo requires that we
un-tar some large-ish files. IIRC on a 512 byte boundary these drives
took 20-30 minutes to do this. On a 4K boundary they took about 1
minute. I know that doesn't make sense but those were my numbers, and
it's a very high end Intel i7-980x MB so there wasn't any other
problem I could find.

Even when I got that part worked out I found that the drives just
didn't work well in a RAID. The mdadm list led me to believe that the
root cause was the lack of TLER in the firmware. I don't know how to
show that's true or not...

When all that was determined I dropped RAID and I still ran into Load
Count issue that's discussed elsewhere in this thread.

I then bought 5 WD RAID Edition drives and everything works perfectly.
> 100MB/Sec on RAID1 and about 180MB/S on RAID0.

Overall the WD10EARS drives were very disappointing, at least as
shipped. I have 5 of them sitting here unused. I would like to find
some time to try them again one of these days but I don't have high
hopes.

Cheers,
Mark

2010-12-26 23:49:21

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Sun, Dec 26, 2010 at 06:17:40PM -0500, Greg Freemyer wrote:
> On Sun, Dec 26, 2010 at 4:40 PM, Rogier Wolff <[email protected]> wrote:
> > On Sat, Dec 25, 2010 at 01:14:07PM +0100, Rogier Wolff wrote:
> >> Out of four partitions on the drive, two are aligned, two are not.
> >
> > Ooops. I looked wrong. Sorry. All four partitions are mis aligned.
> >
> > ? ? ? ?Roger.
>
> It sounds like you have 3 or 4 separate issues that are each costing
> you a factor of 3 or so of pain.
>
> 3 * 3 * 3 *3 = 81 ( I don't recall how bad your overall problem was).
>
> I can remember:
>
> Raid 5 - poor choice for 4KB random writes - a factor of 2-4
> Misaligned 4KB physical sectors - a factor of 2+
> Excessive Head Parking (You show 100K plus head parks in your smart
> data) - (who knows. It takes a while to un-park the heads and your
> heads are parking way too often.)

Still I think that iostat -x output shows a different story:

On a different server, I have many many many files. I started a delete
of some of them two days ago. That server is performing just fine. It
shows that disk IOs are serviced on average after around 5-15ms. Those
are normal numbers.

The "remove many files" operation is also likely to cause a similar
small-ios-workload as the mailserver. On the other hand, there are
samsung drives there.

You might be right that we have a few issues together. that might
compound things. But the "30x" performance difference estimate came
from the service time stats from iostat.

Raid5 might cause us an additional performance hit in real life, but
it does not affect the measured service times of the drives.

The head parking (and unparking) can indeed influence service times if
say the drive is parking after each IO. On the other hand when the
kernel shows "100% usage" we might hope for the drive not to have time
enough to start thinking about parking the head. And the kernel showed
100% usage through periods as large as ten seconds at a time.

> I think you said it was a 8TB, I know its a pain, but I'd rebuild. I
> don't know how reliable various large drives are currently. in
> 2008/2009 they seemed to be horrible and the advice was to use drives
> from different vendors, etc so you would be less likely to have a
> batch of drives go bad all at once.

The Samsungs are holding out really well. I have 8*1T and 4*2T (and a
few 4*1T) running just great. Seagate goofed with some 500Gb and 1T
drive sizes IIRC. You're right you wouldn't want to have two of them
stop/crash at the same time due to the same bug. (chances were around
1% for a 4-drive setup).

Anyway. Others say that mixing drives is bad because the performance
wouldn't match. At least the performance bottleneck for the array is
the slowest disk, so you'll almost allways end up with worse
performance than when you have all disks the same.



Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-27 00:27:54

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Sun, Dec 26, 2010 at 06:05:05PM -0500, Greg Freemyer wrote:
> > You are assuming that the kernel is blind and doesn't do any
> > readaheads. I've done some tests and even when I run dd with a
> > blocksize of 32k, the average request sizes that are hitting the disk
> > are about 1000k (or 1000 sectors I don't know what units that column
> > are in when I run with -k option).
>
> dd is not a benchmark tool.
>
> You are building a email server that does 4KB random writes.
> Performance testing / tuning with dd is of very limited use.
>
> For your load, read ahead is pretty much useless!

Greg, maybe it's wrong for me to tell you things about other systems
while we're discussing one system. But I do want to be able to tell you
that things are definitively different on that other server.

That other server DOES Have loads similar to the access pattern that
dd generates. So that's why I benchmarked it that way, and base
decisions on that benchmark.

It turns out that, barring an easy way to "simulate the workload of a
mail server" my friend benchmarked his raid setup the same way.

This will at least provide for the benchmarked workload the optimal
setup. We all agree that this does not guarantee optimal performance
for the actual workload.

> > So your argument that "it fits exactly when your blocksize is 1M, so
> > it is obvious that 512k blocksizes are optimal" doesn't hold water.
>
> If you were doing a real i/o benchmark, then 1MB random writes
> perfectly aligned to the Raid stripes would be perfect. Raid really
> needs to be designed around the i/o pattern, not just optimizing dd.

Except when "dd" actually models the workload. Which in some cases it
does. Note that "some" doesn't refer to the badly performing
mailserver as you should know.

> >> Anything smaller than a 1 stripe write is where the issues occur,
> >> because then you have the read-modify-write cycles.
> >
> > Yes. But still they shouldn't be as heavy as we are seeing. ?Besides
> > doing the "big searches" on my 8T array, I also sometimes write "lots
> > of small files". I'll see how many I can mange on that server....
>
> <snip>
> >
> > You're repeating what WD says about their enterprise drives versus
> > desktop drives. I'm pretty sure that they believe what they are saying
> > to be true. And they probably have done tests to see support for their
> > theory. But for Linux it simply isn't true.
>
> What kernel are you talking about. mdraid has seen major improvements
> in this area in the last 2 o3 years or so. Are you using a old kernel
> by chance? Or reading old reviews?

OK. You might be right. I haven't had a RAID fail on me the last few
months. I don't tend to upgrade servers that are performing well. And
the things I can test and notice are for file servers things like
"serving files" not how they behave when a disk dies.

In my friends case, the server was in production doing its thing. He
doesn't like doing kernel upgrades unless he's near the machine. So
yes, the server could be running something several years old.

However the issue is NOT that the raid system was badly configured or
could perform a few percent better, but that the disks (on which said
RAID array was running) were performing really bad: according to
"iostat -x" IO requests to the drives in the raid were taking on the
order of 200-300 ms, whereas normal drives service requests on the
order of 5-20ms. Now I wouldn't mind being told that for example the
stats from iostat -x are not accurate in suchandsuch case. Fine. We
can then do the measurements in a different way. But in my opinion the
observed slowness of the machine can be explained by the measurements
we see from iostat -x.


If you say that linux raid has been improved, I'm not sure I prefer
the new behaviour. Whatever a raidsubsystem does, things could be bad
in one situation or another.....

I don't like my system silently rewriting bad sectors on a failing
drive without making noise about the drive getting worse and
worse. I'd like to be informed that I have to swap out the drive. I
have zero tolerance for drives that manage to lose as little as 4096
bits (one sector) of my data..... But maybe it WILL start making noise
Then things would be good.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-27 00:34:59

by Rogier Wolff

[permalink] [raw]
Subject: Re: Slow disks.

On Sun, Dec 26, 2010 at 03:38:51PM -0800, Mark Knecht wrote:
> First, the drives are 4K sector so it's really important to get them
> on the right partition boundary. Installing Gentoo requires that we
> un-tar some large-ish files. IIRC on a 512 byte boundary these drives
> took 20-30 minutes to do this. On a 4K boundary they took about 1
> minute. I know that doesn't make sense but those were my numbers, and
> it's a very high end Intel i7-980x MB so there wasn't any other
> problem I could find.

Mark,

Thanks for your contribution. You're seeing a 20-30 fold performance
issue, where only about 2x would be expected. So to test this theory
I'll see if we can evacuate a partition, untar a large-ish file as a
benchmark, and then see if we can improve that by changing the
alignment of the partition. (hmm. Maybe we should do things the other
way around. First find a tar-workload that takes about 1 minute on a
properly working drive, and then we can wait up to 30 minutes for it
to finish on the improper setup.).

Being able to reproduce a problem at will is a great progress to being
able to solve it!

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-27 03:12:48

by Mark Knecht

[permalink] [raw]
Subject: Re: Slow disks.

On Sun, Dec 26, 2010 at 4:34 PM, Rogier Wolff <[email protected]> wrote:
> On Sun, Dec 26, 2010 at 03:38:51PM -0800, Mark Knecht wrote:
>> First, the drives are 4K sector so it's really important to get them
>> on the right partition boundary. Installing Gentoo requires that we
>> un-tar some large-ish files. IIRC on a 512 byte boundary these drives
>> took 20-30 minutes to do this. On a 4K boundary they took about 1
>> minute. I know that doesn't make sense but those were my numbers, and
>> it's a very high end Intel i7-980x MB so there wasn't any other
>> problem I could find.
>
> Mark,
>
> Thanks for your contribution. You're seeing a 20-30 fold performance
> issue, where only about 2x would be expected. So to test this theory
> I'll see if we can evacuate a partition, untar a large-ish file as a
> benchmark, and then see if we can improve that by changing the
> alignment of the partition. (hmm. Maybe we should do things the other
> way around. First find a tar-workload that takes about 1 minute on a
> properly working drive, and then we can wait up to 30 minutes for it
> to finish on the improper setup.).
>
> Being able to reproduce a problem at will is a great progress to being
> able to solve it!
>
>        Roger.

I hope my memory isn't way off on those numbers. I can certainly
recall that the difference was remarkable and was very easy to
demonstrate.

1) Partition the disk with fdisk placing a partition at sector 63, the
typical first address we're given. Untar a Gentoo Stage 3 release
snapshot such as

http://gentoo.osuosl.org/releases/amd64/current-stage3/stage3-amd64-20101223.tar.bz2

2) Create a second partition starting at sector 1000 and do it again.

It was day and night.

If it was only the performance difference that would be one thing but
the problem i had with these WD10EARS drives was sometimes they would
just hang for long periods. In a RAID they just got kicked.

The kernel was modern for the time I was building the machines, maybe
7-9 months ago.

Hope this helps,
Mark

2010-12-27 10:56:09

by Tejun Heo

[permalink] [raw]
Subject: Re: Slow disks.

On Sun, Dec 26, 2010 at 11:07:45PM +0100, Niels wrote:
> I have several of these drives, in various sizes. They seem to be aligned to
> 63 as you mention:
>
> Device Boot Start End Blocks Id System
> /dev/sda1 63 208844 104391 83 Linux
> /dev/sda2 208845 19759949 9775552+ 82 Linux swap / Solaris
> /dev/sda3 19759950 215094284 97667167+ 83 Linux
> /dev/sda4 215094285 781417664 283161690 83 Linux
>
> Device Boot Start End Blocks Id System
> /dev/sdb1 63 1953520064 976760001 83 Linux
>
> Is that bad?

Yes.

> What can I do to repair it?

Repartition it.

> What can I do to prevent it from happening again?

Use more recent distros. Most distros released in the past six month
have partition utilties which honor partition alignment requirements.

Good luck.

--
tejun

2010-12-27 18:20:44

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: Slow disks.

Mark Knecht <[email protected]> writes:

> Even when I got that part worked out I found that the drives just
> didn't work well in a RAID. The mdadm list led me to believe that the
> root cause was the lack of TLER in the firmware. I don't know how to
> show that's true or not...

But TLER only matters when the drive can't read a sector. For a normal
drive which can easily read all its sectors (at least without retrying
for several seconds) TLER doesn't matter.

Alignment, sure. Personally I'd use whole disks (not partitions) for
RAID-5 and partition the resulting /dev/md* instead, taking into account
even longer "sector" size.

Or better, use RAID-1 (or RAID-10) with 4 KB block fs, partition
/dev/md* with 4 KB alignment, and avoid all these issues. Disks aren't
that expensive now.
--
Krzysztof Halasa