Hi,
We have 2 HP Proliant DL380G5 server running with different kernels.
I was inspecting a basic kernel-compile time. On the one with 2.6.25.20
kernel, the compilation took ~1.5 minutes. On the one with 2.6.30.9
kernel, it took ~6 minutes. Both systems are using ccache as a build helper.
Then I ran hdparm on both systems, the results are below.
I'd like to help debugging this issue through bisect or another method
but since there are more parameters that differ from one to the other
server than only the kernel version, I'm a little bit stuck.
Thanks,
Ozan
### 2.6.30.9 (Slow one, compiled with PAE support, FS is ext4) ###
# sync; sleep 2; echo 3 > /proc/sys/vm/drop_caches; hdparm -tT -vvvv
/dev/cciss/c0d0p5
/dev/cciss/c0d0p5:
HDIO_DRIVE_CMD(identify) failed: Invalid exchange
readonly = 0 (off)
readahead = 256 (on)
geometry = 245410/255/32, sectors = 2002550382, start = 4225158
Timing cached reads: 12038 in 2.00 seconds = 6027.00 MB/sec
Timing buffered disk reads: 184 MB in 3.00 seconds = 61.31
MB/sec <------ Note the drop here!
# dmesg | grep cciss
[ 0.000000] Kernel command line: root=LABEL=PARDUS_ROOT vga=791
splash=silent quiet resume=/dev/cciss/c0d0p1
[ 6.023542] cciss 0000:18:08.0: PCI INT A -> GSI 19 (level, low) ->
IRQ 19
[ 6.023566] cciss: MSI init failed
[ 6.053008] IRQ 19/cciss0: IRQF_DISABLED is not guaranteed on shared IRQs
[ 6.053015] cciss0: <0x3238> at PCI 0000:18:08.0 IRQ 19 using DAC
[ 6.053918] cciss/c0d0: p1 p2 < p5 >
[ 6.320852] kjournald2 starting: pid 190, dev cciss!c0d0p5:8, commit
interval 5 seconds
[ 6.322344] EXT4-fs: mounted filesystem cciss!c0d0p5 with ordered
data mode
[ 10.994505] EXT4 FS on cciss!c0d0p5, internal journal on cciss!c0d0p5:8
[ 11.783302] Adding 2112508k swap on /dev/cciss/c0d0p1. Priority:-1
extents:1 across:2112508k
[ 16.696090] JBD: barrier-based sync failed on cciss!c0d0p5:8 -
disabling barriers
### 2.6.25.20 (Fast one, no PAE support, FS is ext3) ###
# sync;sleep 2; echo 3 > /proc/sys/vm/drop_caches; hdparm -tT -vvv
/dev/cciss/c0d0p5
/dev/cciss/c0d0p5:
readonly = 0 (off)
readahead = 256 (on)
geometry = 245426/255/32, sectors = 2002678902, start = 4096638
Timing cached reads: 10650 MB in 2.00 seconds = 5334.38 MB/sec
Timing buffered disk reads: 420 MB in 3.01 seconds = 139.72 MB/sec
# dmesg | grep cciss
Kernel command line: root=LABEL=PARDUS_ROOT vga=791 splash=silent quiet
resume=/dev/cciss/c0d0p1
cciss0: <0x3238> at PCI 0000:18:08.0 IRQ 212 using DAC
cciss/c0d0: p1 p2 < p5 >
EXT3 FS on cciss/c0d0p5, internal journal
Adding 2048248k swap on /dev/cciss/c0d0p1. Priority:-1 extents:1
across:2048248k
> -----Original Message-----
> From: Ozan ?a?layan [mailto:[email protected]]
> Sent: Monday, December 07, 2009 4:46 AM
> To: linux-kernel
> Cc: [email protected]; Miller, Mike (OS Dev);
> [email protected]
> Subject: CCISS performance drop in buffered disk reads in
> newer kernels
>
> Hi,
>
> We have 2 HP Proliant DL380G5 server running with different kernels.
>
> I was inspecting a basic kernel-compile time. On the one with
> 2.6.25.20 kernel, the compilation took ~1.5 minutes. On the
> one with 2.6.30.9 kernel, it took ~6 minutes. Both systems
> are using ccache as a build helper.
>
> Then I ran hdparm on both systems, the results are below.
>
> I'd like to help debugging this issue through bisect or
> another method but since there are more parameters that
> differ from one to the other server than only the kernel
> version, I'm a little bit stuck.
>
> Thanks,
> Ozan
>
Ozan,
I'm aware of the performance drop. Please see: http://bugzilla.kernel.org/show_bug.cgi?id=13127. I removed the huge read ahead value of 1024 that we used because users were complaining about small writes being starved. That was back around the 2.6.25 timeframe. Since that timeframe there have no changes in the main i/o path. I'll get back on this as time allows.
Meanwhile, you can tweak some of the block layer tunables as such.
echo 64 > /sys/block/cciss\!c0d1/queue/read_ahead_kb
OR
blockdev --setra 128 /dev/cciss/c0d1
These are just example values. There is also max_hw_sectors_kb and max_sectors_kb that be adjusted.
-- mikem
>
> ### 2.6.30.9 (Slow one, compiled with PAE support, FS is ext4) ###
>
> # sync; sleep 2; echo 3 > /proc/sys/vm/drop_caches; hdparm -tT -vvvv
> /dev/cciss/c0d0p5
>
> /dev/cciss/c0d0p5:
> HDIO_DRIVE_CMD(identify) failed: Invalid exchange
> readonly = 0 (off)
> readahead = 256 (on)
> geometry = 245410/255/32, sectors = 2002550382, start = 4225158
> Timing cached reads: 12038 in 2.00 seconds = 6027.00 MB/sec
> Timing buffered disk reads: 184 MB in 3.00 seconds = 61.31
> MB/sec <------ Note the drop here!
>
> # dmesg | grep cciss
> [ 0.000000] Kernel command line: root=LABEL=PARDUS_ROOT vga=791
> splash=silent quiet resume=/dev/cciss/c0d0p1
> [ 6.023542] cciss 0000:18:08.0: PCI INT A -> GSI 19 (level, low) ->
> IRQ 19
> [ 6.023566] cciss: MSI init failed
> [ 6.053008] IRQ 19/cciss0: IRQF_DISABLED is not guaranteed
> on shared IRQs
> [ 6.053015] cciss0: <0x3238> at PCI 0000:18:08.0 IRQ 19 using DAC
> [ 6.053918] cciss/c0d0: p1 p2 < p5 >
> [ 6.320852] kjournald2 starting: pid 190, dev
> cciss!c0d0p5:8, commit
> interval 5 seconds
> [ 6.322344] EXT4-fs: mounted filesystem cciss!c0d0p5 with ordered
> data mode
> [ 10.994505] EXT4 FS on cciss!c0d0p5, internal journal on
> cciss!c0d0p5:8
> [ 11.783302] Adding 2112508k swap on /dev/cciss/c0d0p1. Priority:-1
> extents:1 across:2112508k
> [ 16.696090] JBD: barrier-based sync failed on cciss!c0d0p5:8 -
> disabling barriers
>
>
> ### 2.6.25.20 (Fast one, no PAE support, FS is ext3) ###
>
> # sync;sleep 2; echo 3 > /proc/sys/vm/drop_caches; hdparm -tT -vvv
> /dev/cciss/c0d0p5
>
> /dev/cciss/c0d0p5:
> readonly = 0 (off)
> readahead = 256 (on)
> geometry = 245426/255/32, sectors = 2002678902, start = 4096638
> Timing cached reads: 10650 MB in 2.00 seconds = 5334.38 MB/sec
> Timing buffered disk reads: 420 MB in 3.01 seconds = 139.72 MB/sec
>
> # dmesg | grep cciss
> Kernel command line: root=LABEL=PARDUS_ROOT vga=791
> splash=silent quiet
> resume=/dev/cciss/c0d0p1
> cciss0: <0x3238> at PCI 0000:18:08.0 IRQ 212 using DAC
> cciss/c0d0: p1 p2 < p5 >
> EXT3 FS on cciss/c0d0p5, internal journal Adding 2048248k
> swap on /dev/cciss/c0d0p1. Priority:-1 extents:1 across:2048248k
>
>
Miller, Mike (OS Dev) wrote:
> Ozan,
> I'm aware of the performance drop. Please see: http://bugzilla.kernel.org/show_bug.cgi?id=13127. I removed the huge read ahead value of 1024 that we used because users were complaining about small writes being starved. That was back around the 2.6.25 timeframe. Since that timeframe there have no changes in the main i/o path. I'll get back on this as time allows.
>
> Meanwhile, you can tweak some of the block layer tunables as such.
>
> echo 64 > /sys/block/cciss\!c0d1/queue/read_ahead_kb
> OR
> blockdev --setra 128 /dev/cciss/c0d1
>
> These are just example values. There is also max_hw_sectors_kb and max_sectors_kb that be adjusted.
>
Hi,
Actually the "#define READ_AHEAD 1024" was removed on March 2008 which
was included in the 2.6.25.y tree so 2.6.25.20 has 128kB read_ahead
value too.
*But* setting read_ahead to 2048 increases buffered disk read average
from 60~MB/s to 190~MB/s hence the kernel compile time drops to 2 minutes.
So maybe the regression/change is in another place?
The server is just a compile-farm so it's triggered by hand, compiles
distribution's packages and stays idle until the next compilation queue.
Is it safe/OK to use that 2048kB read_ahead value for such workload?
(max_hw_sectors_kb is 512 on my 2.6.25.20 setup and 1024 on 2.6.30.9 but
it seems that it's read-only)
Thanks!
On Mon, Dec 07 2009, Ozan ?a??layan wrote:
> Miller, Mike (OS Dev) wrote:
> > Ozan,
> > I'm aware of the performance drop. Please see: http://bugzilla.kernel.org/show_bug.cgi?id=13127. I removed the huge read ahead value of 1024 that we used because users were complaining about small writes being starved. That was back around the 2.6.25 timeframe. Since that timeframe there have no changes in the main i/o path. I'll get back on this as time allows.
> >
> > Meanwhile, you can tweak some of the block layer tunables as such.
> >
> > echo 64 > /sys/block/cciss\!c0d1/queue/read_ahead_kb
> > OR
> > blockdev --setra 128 /dev/cciss/c0d1
> >
> > These are just example values. There is also max_hw_sectors_kb and max_sectors_kb that be adjusted.
> >
>
> Hi,
>
> Actually the "#define READ_AHEAD 1024" was removed on March 2008 which
> was included in the 2.6.25.y tree so 2.6.25.20 has 128kB read_ahead
> value too.
>
> *But* setting read_ahead to 2048 increases buffered disk read average
> from 60~MB/s to 190~MB/s hence the kernel compile time drops to 2 minutes.
>
> So maybe the regression/change is in another place?
>
> The server is just a compile-farm so it's triggered by hand, compiles
> distribution's packages and stays idle until the next compilation queue.
> Is it safe/OK to use that 2048kB read_ahead value for such workload?
Yes, it's definitely safe.
> (max_hw_sectors_kb is 512 on my 2.6.25.20 setup and 1024 on 2.6.30.9 but
> it seems that it's read-only)
The *_hw_* values are the driver exported hardware limits, so they are
always read-only.
--
Jens Axboe
> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Monday, December 07, 2009 12:40 PM
> To: Ozan ?a??layan
> Cc: Miller, Mike (OS Dev); linux-kernel; [email protected]
> Subject: Re: CCISS performance drop in buffered disk reads in
> newer kernels
>
> On Mon, Dec 07 2009, Ozan ?a??layan wrote:
> > Miller, Mike (OS Dev) wrote:
> > > Ozan,
> > > I'm aware of the performance drop. Please see:
> http://bugzilla.kernel.org/show_bug.cgi?id=13127. I removed
> the huge read ahead value of 1024 that we used because users
> were complaining about small writes being starved. That was
> back around the 2.6.25 timeframe. Since that timeframe there
> have no changes in the main i/o path. I'll get back on this
> as time allows.
> > >
> > > Meanwhile, you can tweak some of the block layer tunables as such.
> > >
> > > echo 64 > /sys/block/cciss\!c0d1/queue/read_ahead_kb
> > > OR
> > > blockdev --setra 128 /dev/cciss/c0d1
> > >
> > > These are just example values. There is also
> max_hw_sectors_kb and max_sectors_kb that be adjusted.
> > >
> >
> > Hi,
> >
> > Actually the "#define READ_AHEAD 1024" was removed on March
> 2008 which
> > was included in the 2.6.25.y tree so 2.6.25.20 has 128kB read_ahead
> > value too.
> >
> > *But* setting read_ahead to 2048 increases buffered disk
> read average
> > from 60~MB/s to 190~MB/s hence the kernel compile time
> drops to 2 minutes.
> >
> > So maybe the regression/change is in another place?
> >
> > The server is just a compile-farm so it's triggered by
> hand, compiles
> > distribution's packages and stays idle until the next
> compilation queue.
> > Is it safe/OK to use that 2048kB read_ahead value for such workload?
>
> Yes, it's definitely safe.
I agree.
>
> > (max_hw_sectors_kb is 512 on my 2.6.25.20 setup and 1024 on
> 2.6.30.9
> > but it seems that it's read-only)
>
> The *_hw_* values are the driver exported hardware limits, so
> they are always read-only.
Ahhh, I didn't know that. There is also an nr_requests attribute which to me implies limiting requests somewhere. The value of nr_request is 128 but the max commands to the cciss controllers exceed that value. What is nr_request supposed to do?
-- mikem
>
> --
> Jens Axboe
>
> -
On Mon, Dec 07 2009, Miller, Mike (OS Dev) wrote:
> > > (max_hw_sectors_kb is 512 on my 2.6.25.20 setup and 1024 on
> > 2.6.30.9
> > > but it seems that it's read-only)
> >
> > The *_hw_* values are the driver exported hardware limits, so
> > they are always read-only.
>
> Ahhh, I didn't know that. There is also an nr_requests attribute which
> to me implies limiting requests somewhere. The value of nr_request is
> 128 but the max commands to the cciss controllers exceed that value.
> What is nr_request supposed to do?
It controls what the block layer queue depth may be. As a rule of thumb,
it should be twice the hardware queue depth. A value of 128 means you
can have at most 128 reads and 128 writes queued in the IO scheduler. In
practice it's a bit more due to request allocation batching.
--
Jens Axboe