As I have already said I have noticed a strange IDE performance change
upgrading from 2.6.0 to 2.6.1-rc1.
Now I have more data (and a graph) to show: the test is done using
"hdparm -t dev/hda" at various readahead (form 0 to 512).
o SCRIPT
#!/bin/bash
MIN=0
MAX=512
ra=$MIN
while test $ra -le $MAX; do
hdparm -a $ra /dev/hda > /dev/null;
echo -n $ra$'\t';
s1=`hdparm -t /dev/hda | grep 'Timing' | cut -d'=' -f2| cut -d' ' -f2`;
s2=`hdparm -t /dev/hda | grep 'Timing' | cut -d'=' -f2| cut -d' ' -f2`;
s=`echo "scale=2; ($s1+$s2)/2" | bc`;
echo $s;
ra=$(($ra+16));
done
o RESULTS for 2.6.0 (readahead / speed)
0 13.30
16 13.52
32 13.76
48 31.81
64 31.83
80 31.90
96 31.86
112 31.82
128 31.89
144 31.93
160 31.89
176 31.86
192 31.93
208 31.91
224 31.87
240 31.18
256 26.41
272 27.52
288 31.74
304 27.29
320 27.23
336 25.44
352 27.59
368 27.32
384 31.84
400 28.03
416 28.07
432 20.46
448 28.59
464 28.63
480 23.95
496 27.21
512 22.38
o RESULTS for 2.6.1-rc1 (readahead / speed)
0 13.34
16 25.86
32 26.27
48 24.81
64 26.26
80 24.88
96 27.09
112 24.88
128 26.31
144 24.79
160 26.31
176 24.51
192 25.86
208 24.35
224 26.48
240 24.82
256 26.38
272 24.60
288 31.15
304 24.61
320 26.69
336 24.54
352 26.23
368 24.87
384 25.91
400 25.74
416 26.45
432 23.61
448 26.44
464 24.36
480 26.80
496 24.60
512 26.49
The graph is attached. (x = readahead && y = MB/s)
The kernel config for 2.6.0 is attached (for 2.6.1-rc1 I have just used
"make oldconfig").
INFO on my HD:
/dev/hda:
Model=WDC WD200BB-53AUA1, FwRev=18.20D18, SerialNo=WD-WMA6Y1501425
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=39102336
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 *udma4 udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: device does not report version: 1 2 3 4 5
INFO on my IDE controller:
00:04.1 IDE interface: VIA Technologies, Inc. VT82C586/B/686A/B PIPC Bus
Master IDE (rev 10)
Comments are welcomed.
Bye,
--
Paolo Ornati
Linux v2.4.23
I do not see this behavior and i'm using the same ide chipset driver
(though not the same ide chipset). btw, readahead for all my other
drives is set to 8192 during these tests but changing them showed no
effect on my numbers.
/dev/hda:
Model=Maxtor 6Y120P0, FwRev=YAR41VW0, SerialNo=Y40D924E
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
BuffType=DualPortCache, BuffSize=7936kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=240121728
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma6
AdvancedPM=yes: disabled (255) WriteCache=enabled
Drive conforms to: (null):
00:07.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT8233/A/C/VT8235 PIPC Bus Master IDE (rev 06)
it's a vt82C686A
128
/dev/hda:
Timing buffered disk reads: 130 MB in 3.05 seconds = 42.69 MB/sec
256
/dev/hda:
Timing buffered disk reads: 134 MB in 3.03 seconds = 44.27 MB/sec
512
/dev/hda:
Timing buffered disk reads: 136 MB in 3.00 seconds = 45.33 MB/sec
8192
/dev/hda:
Timing buffered disk reads: 140 MB in 3.03 seconds = 46.24 MB/sec
Note, sometimes when moving backwards back to a lower readhead my speed
does not decrease to the values you see here. readahead on my system
always goes up (on avg) with higher readahead numbers, maxing at 8192.
No matter the buffer size or speed or position the ide drive is in.
hdparm -t is difficult to get really accurate, which is why they suggest
running it multiple times. I see differences of 4MB/sec on subsequent
runs without changing anything. run hdparm -t at least 3-4 times for
each readahead value.
I suggest trying 128, 256,512,8192 as values for readahead and skip all
those crap numbers in between.
if you still see on avg lower numbers on the top end, try nicing hdparm
to -20. Also, update to a newer hdparm. hdparm v5.4, you seem to be
using an older one.
Paolo Ornati wrote:
> As I have already said I have noticed a strange IDE performance change
> upgrading from 2.6.0 to 2.6.1-rc1.
>
> Now I have more data (and a graph) to show: the test is done using
> "hdparm -t dev/hda" at various readahead (form 0 to 512).
>
> o SCRIPT
>
> #!/bin/bash
>
> MIN=0
> MAX=512
>
> ra=$MIN
> while test $ra -le $MAX; do
> hdparm -a $ra /dev/hda > /dev/null;
> echo -n $ra$'\t';
> s1=`hdparm -t /dev/hda | grep 'Timing' | cut -d'=' -f2| cut -d' ' -f2`;
> s2=`hdparm -t /dev/hda | grep 'Timing' | cut -d'=' -f2| cut -d' ' -f2`;
> s=`echo "scale=2; ($s1+$s2)/2" | bc`;
> echo $s;
> ra=$(($ra+16));
> done
>
>
> o RESULTS for 2.6.0 (readahead / speed)
>
> 0 13.30
> 16 13.52
> 32 13.76
> 48 31.81
> 64 31.83
> 80 31.90
> 96 31.86
> 112 31.82
> 128 31.89
> 144 31.93
> 160 31.89
> 176 31.86
> 192 31.93
> 208 31.91
> 224 31.87
> 240 31.18
> 256 26.41
> 272 27.52
> 288 31.74
> 304 27.29
> 320 27.23
> 336 25.44
> 352 27.59
> 368 27.32
> 384 31.84
> 400 28.03
> 416 28.07
> 432 20.46
> 448 28.59
> 464 28.63
> 480 23.95
> 496 27.21
> 512 22.38
>
>
> o RESULTS for 2.6.1-rc1 (readahead / speed)
>
> 0 13.34
> 16 25.86
> 32 26.27
> 48 24.81
> 64 26.26
> 80 24.88
> 96 27.09
> 112 24.88
> 128 26.31
> 144 24.79
> 160 26.31
> 176 24.51
> 192 25.86
> 208 24.35
> 224 26.48
> 240 24.82
> 256 26.38
> 272 24.60
> 288 31.15
> 304 24.61
> 320 26.69
> 336 24.54
> 352 26.23
> 368 24.87
> 384 25.91
> 400 25.74
> 416 26.45
> 432 23.61
> 448 26.44
> 464 24.36
> 480 26.80
> 496 24.60
> 512 26.49
>
>
> The graph is attached. (x = readahead && y = MB/s)
>
> The kernel config for 2.6.0 is attached (for 2.6.1-rc1 I have just used
> "make oldconfig").
>
> INFO on my HD:
>
> /dev/hda:
>
> Model=WDC WD200BB-53AUA1, FwRev=18.20D18, SerialNo=WD-WMA6Y1501425
> Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
> RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
> BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
> CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=39102336
> IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2
> UDMA modes: udma0 udma1 udma2 udma3 *udma4 udma5
> AdvancedPM=no WriteCache=enabled
> Drive conforms to: device does not report version: 1 2 3 4 5
>
> INFO on my IDE controller:
>
> 00:04.1 IDE interface: VIA Technologies, Inc. VT82C586/B/686A/B PIPC Bus
> Master IDE (rev 10)
>
>
> Comments are welcomed.
>
> Bye,
>
>
On Friday 02 January 2004 19:08, Ed Sweetman wrote:
>
>
> Note, sometimes when moving backwards back to a lower readhead my speed
> does not decrease to the values you see here. readahead on my system
> always goes up (on avg) with higher readahead numbers, maxing at 8192.
> No matter the buffer size or speed or position the ide drive is in.
>
> hdparm -t is difficult to get really accurate, which is why they suggest
> running it multiple times. I see differences of 4MB/sec on subsequent
> runs without changing anything. run hdparm -t at least 3-4 times for
> each readahead value.
>
> I suggest trying 128, 256,512,8192 as values for readahead and skip all
> those crap numbers in between.
>
>
> if you still see on avg lower numbers on the top end, try nicing hdparm
> to -20. Also, update to a newer hdparm. hdparm v5.4, you seem to be
> using an older one.
>
ok, hdparm updated to v5.4
and this is the new script:
_____________________________________________________________________
#!/bin/bash
# This script assumes hdparm v5.4
NR_TESTS=3
RA_VALUES="64 128 256 8192"
killall5
sync
hdparm -a 0 /dev/hda > /dev/null
hdparm -t /dev/hda > /dev/null
for ra in $RA_VALUES; do
hdparm -a $ra /dev/hda > /dev/null;
echo -n $ra$'\t';
tot=0;
for i in `seq $NR_TESTS`; do
tmp=`nice -n '-20' hdparm -t /dev/hda|grep 'Timing'|tr -d ' '|cut -d'=' -f2|cut -d'M' -f1`;
tot=`echo "scale=2; $tot+$tmp" | bc`;
done;
s=`echo "scale=2; $tot/$NR_TESTS" | bc`;
echo $s;
done
_____________________________________________________________________
The results are like the previous.
2.6.0:
64 31.91
128 31.89
256 26.22 # during the transfer HD LED blinks
8192 26.26 # during the transfer HD LED blinks
2.6.1-rc1:
64 25.84 # during the transfer HD LED blinks
128 25.85 # during the transfer HD LED blinks
256 25.90 # during the transfer HD LED blinks
8192 26.42 # during the transfer HD LED blinks
I have tried with and without "nice -n '-20'" but without any visible changes.
Performance with 2.4:
with kernel 2.4.23 && readahead = 8 I get 31.89 MB/s...
changing readahead doesn't seem to affect the speed too much.
Bye
--
Paolo Ornati
Linux v2.4.23
On Fri, 02 Jan 2004 22:04:27 +0100, Paolo Ornati said:
> The results are like the previous.
>
> 2.6.0:
> 64 31.91
> 128 31.89
> 256 26.22 # during the transfer HD LED blinks
> 8192 26.26 # during the transfer HD LED blinks
>
> 2.6.1-rc1:
> 64 25.84 # during the transfer HD LED blinks
> 128 25.85 # during the transfer HD LED blinks
> 256 25.90 # during the transfer HD LED blinks
> 8192 26.42 # during the transfer HD LED blinks
>
> I have tried with and without "nice -n '-20'" but without any visible changes
Do you get different numbers if you boot with:
elevator=as
elevator=deadline
elevator=cfq (for -mm kernels)
elevator=noop
(You may need to build a kernel with these configured - the symbols are:
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y (-mm kernels only)
and can be selected in the 'General Setup' menu - they should all be
built by default unless you've selected EMBEDDED).
On Fri, Jan 02, 2004 at 10:04:27PM +0100, Paolo Ornati wrote:
> NR_TESTS=3
> RA_VALUES="64 128 256 8192"
Can you add more samples between 128 and 256, maybe at intervals of 32?
Have there been any ide updates in 2.6.1-rc1?
On Fri, 2004-01-02 at 22:32, Mike Fedyk wrote:
> Have there been any ide updates in 2.6.1-rc1?
I see that a readahead patch was applied just before -rc1 was released.
found it in bk-commits-head
Subject: [PATCH] readahead: multiple performance fixes
Message-Id: <[email protected]>
Maybe Paolo can try backing it out.
--
/Martin
Here are some numbers I got on 2.4 and 2.6 with hdparm.
2.4.23-acl-preempt-lowlatency:
0: 47.18 47.18
8: 47.18 47.18
16: 47.18 47.18
32: 47.18 47.18
64: 47.18 47.02
128: 47.18 47.18
2.6.0:
0: 28.68 28.73
8: 28.87 28.76
16: 28.82 28.83
256: 43.77 44.13
512: 24.86 24.86
1024: 26.49
Note: The last number is missing because I used hdparm -a${x}t and it
seems that it sets the readahead _after_ the measurements and I
had to compensate for that, after I noticed it with the following
measurement.
2.6.0 with preempt enabled, now 3 repeats and
hdparm -a${x}t /dev/hda
instead of
hdparm -a${x}tT /dev/hda
0: 28.09 28.11 28.17
128: 41.52 41.44 40.94
256: 41.07 41.39 41.32
512: 24.59 25.04 24.84
1024: 26.49 26.30
2.6.1-rc1 without preempt and corrected script to do the readahead
setting first, anticipatory scheduler:
0: 28.92 28.91 28.49
128: 33.78 33.60 33.62
256: 33.62 33.55 33.60
512: 33.54 33.54 33.41
1024: 33.60 33.60 33.43
2.6.1-rc1, noop scheduler:
0: 28.36 28.86 28.82
128: 33.45 33.50 33.52
256: 33.45 33.51 33.52
512: 33.23 33.51 33.51
1024: 33.52 33.54 33.54
Very interesting tidbit:
with 2.6.1-rc1 and "dd if=/dev/hda of=/dev/null" I get stable 28 MB/s,
but with "cat < /dev/hda > /dev/null" I get 48 MB/s according to "vmstat
5".
oprofile report for 2.6.0, the second run IIRC:
CPU: Athlon, speed 1477.56 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 738778
vma samples % app name symbol name
c01f0b3e 16325 38.9944 vmlinux __copy_to_user_ll
c022fdd7 3600 8.5991 vmlinux ide_outb
c010cb41 3374 8.0592 vmlinux mask_and_ack_8259A
c01117c4 2670 6.3776 vmlinux mark_offset_tsc
00000000 1761 4.2064 hdparm (no symbols)
c022fd9b 1448 3.4587 vmlinux ide_inb
c010ca69 1335 3.1888 vmlinux enable_8259A_irq
c01088b4 862 2.0590 vmlinux irq_entries_start
c0111a45 502 1.1991 vmlinux delay_tsc
c0106a8c 484 1.1561 vmlinux default_idle
c010ca16 414 0.9889 vmlinux disable_8259A_irq
c0109154 257 0.6139 vmlinux apic_timer_interrupt
c022fde2 217 0.5183 vmlinux ide_outbsync
c012fa13 206 0.4921 vmlinux mempool_alloc
c012da03 198 0.4729 vmlinux do_generic_mapping_read
c01471a4 195 0.4658 vmlinux drop_buffers
c013045a 191 0.4562 vmlinux __rmqueue
c0144515 180 0.4300 vmlinux unlock_buffer
c0218999 179 0.4276 vmlinux blk_rq_map_sg
c022fe0a 179 0.4276 vmlinux ide_outl
c013335f 177 0.4228 vmlinux kmem_cache_free
c01307bd 163 0.3893 vmlinux buffered_rmqueue
c0219b09 159 0.3798 vmlinux __make_request
c014613c 157 0.3750 vmlinux block_read_full_page
c0116b39 150 0.3583 vmlinux schedule
c01332ce 149 0.3559 vmlinux kmem_cache_alloc
c0144b2f 136 0.3249 vmlinux end_buffer_async_read
c012fb42 111 0.2651 vmlinux mempool_free
c014741f 110 0.2627 vmlinux init_buffer_head
c01ee88b 109 0.2604 vmlinux radix_tree_insert
c0130108 105 0.2508 vmlinux bad_range
c0133152 98 0.2341 vmlinux free_block
c01071ca 96 0.2293 vmlinux __switch_to
c012d71f 96 0.2293 vmlinux find_get_page
c014597e 93 0.2221 vmlinux create_empty_buffers
c022dd3a 90 0.2150 vmlinux ide_do_request
c01301ba 84 0.2006 vmlinux free_pages_bulk
c01eeab3 84 0.2006 vmlinux radix_tree_delete
c012d421 81 0.1935 vmlinux add_to_page_cache
c013236e 80 0.1911 vmlinux page_cache_readahead
c012d5e8 79 0.1887 vmlinux unlock_page
c011813c 76 0.1815 vmlinux prepare_to_wait
c01308e1 74 0.1768 vmlinux __alloc_pages
c0147512 71 0.1696 vmlinux bio_alloc
c0146f5a 68 0.1624 vmlinux submit_bh
c01ee922 67 0.1600 vmlinux radix_tree_lookup
c01f0729 66 0.1576 vmlinux fast_clear_page
c01306d4 66 0.1576 vmlinux free_hot_cold_page
c022d12c 66 0.1576 vmlinux ide_end_request
c022e349 65 0.1553 vmlinux ide_intr
c0116247 65 0.1553 vmlinux recalc_task_prio
c021e20d 61 0.1457 vmlinux as_merge
c012dd4d 60 0.1433 vmlinux file_read_actor
c012d514 60 0.1433 vmlinux page_waitqueue
c022db45 58 0.1385 vmlinux start_request
c0218800 57 0.1362 vmlinux blk_recount_segments
c01445f6 56 0.1338 vmlinux __set_page_buffers
c0148fc0 56 0.1338 vmlinux max_block
c010a36c 54 0.1290 vmlinux handle_IRQ_event
c021db03 53 0.1266 vmlinux as_move_to_dispatch
c0236409 53 0.1266 vmlinux lba_48_rw_disk
c012d8ee 52 0.1242 vmlinux find_get_pages
c021d4a9 51 0.1218 vmlinux as_update_iohist
c0146f2d 51 0.1218 vmlinux end_bio_bh_io_sync
c021a10c 51 0.1218 vmlinux submit_bio
c021a313 50 0.1194 vmlinux __end_that_request_first
c0116391 49 0.1170 vmlinux try_to_wake_up
c021e1b4 48 0.1147 vmlinux as_queue_empty
c01201fc 48 0.1147 vmlinux del_timer
c01473d5 48 0.1147 vmlinux free_buffer_head
c0219fdf 47 0.1123 vmlinux generic_make_request
c0147259 47 0.1123 vmlinux try_to_free_buffers
c021dc81 46 0.1099 vmlinux as_dispatch_request
c0219439 45 0.1075 vmlinux get_request
c0134795 45 0.1075 vmlinux invalidate_complete_page
c0218ac9 45 0.1075 vmlinux ll_back_merge_fn
c01161f8 43 0.1027 vmlinux effective_prio
c0120006 42 0.1003 vmlinux __mod_timer
c0132fa8 41 0.0979 vmlinux cache_alloc_refill
c0147393 40 0.0955 vmlinux alloc_buffer_head
c01451f5 39 0.0932 vmlinux create_buffers
c0134344 39 0.0932 vmlinux release_pages
c0236118 37 0.0884 vmlinux __ide_do_rw_disk
c021e35d 37 0.0884 vmlinux as_merged_request
c010a5b9 36 0.0860 vmlinux do_IRQ
c0134a52 36 0.0860 vmlinux invalidate_mapping_pages
c0111763 36 0.0860 vmlinux sched_clock
c012d540 36 0.0860 vmlinux wait_on_page_bit
c021909d 35 0.0836 vmlinux blk_run_queues
c021d88c 32 0.0764 vmlinux as_remove_queued_request
c0147c8c 32 0.0764 vmlinux bio_endio
c0217ddc 32 0.0764 vmlinux elv_try_last_merge
c023c716 32 0.0764 vmlinux ide_build_dmatable
00000000 31 0.0740 libc-2.3.2.so (no symbols)
c011d0a0 31 0.0740 vmlinux do_softirq
c014913b 30 0.0717 vmlinux blkdev_get_block
00000000 29 0.0693 ld-2.3.2.so (no symbols)
c0134517 29 0.0693 vmlinux __pagevec_lru_add
c01474cc 29 0.0693 vmlinux bio_destructor
c02311bb 29 0.0693 vmlinux do_rw_taskfile
c0231f7d 29 0.0693 vmlinux ide_cmd_type_parser
c014583c 29 0.0693 vmlinux set_bh_page
c0132227 27 0.0645 vmlinux do_page_cache_readahead
c0219805 27 0.0645 vmlinux drive_stat_acct
c0230a11 27 0.0645 vmlinux ide_execute_command
c01201b2 27 0.0645 vmlinux mod_timer
c021a77f 26 0.0621 vmlinux get_io_context
c011672e 26 0.0621 vmlinux scheduler_tick
00000000 25 0.0597 bash (no symbols)
c0217bec 25 0.0597 vmlinux elv_queue_empty
c012059e 24 0.0573 vmlinux update_one_process
c0231000 23 0.0549 vmlinux SELECT_DRIVE
c023cb24 23 0.0549 vmlinux __ide_dma_read
c021d3a2 23 0.0549 vmlinux as_can_break_anticipation
c0109134 23 0.0549 vmlinux common_interrupt
c0218fb4 23 0.0549 vmlinux generic_unplug_device
c01ee966 22 0.0525 vmlinux __lookup
c012d194 22 0.0525 vmlinux __remove_from_page_cache
c02001ed 22 0.0525 vmlinux add_timer_randomness
c021deef 22 0.0525 vmlinux as_add_request
c021a4dd 22 0.0525 vmlinux end_that_request_last
c012fbc8 22 0.0525 vmlinux mempool_free_slab
c0131f78 22 0.0525 vmlinux read_pages
c02361cd 21 0.0502 vmlinux get_command
c01eef2b 21 0.0502 vmlinux rb_next
c021cf02 20 0.0478 vmlinux as_add_arq_hash
c021e80b 20 0.0478 vmlinux as_set_request
c023c547 20 0.0478 vmlinux ide_build_sglist
c012fbb8 20 0.0478 vmlinux mempool_alloc_slab
c010da6b 20 0.0478 vmlinux timer_interrupt
00000000 19 0.0454 oprofiled26 (no symbols)
c021d740 19 0.0454 vmlinux as_completed_request
c0230231 19 0.0454 vmlinux drive_is_ready
c02302e1 19 0.0454 vmlinux ide_wait_stat
c01341c1 19 0.0454 vmlinux mark_page_accessed
c013055b 19 0.0454 vmlinux rmqueue_bulk
c012069a 19 0.0454 vmlinux run_timer_softirq
c021da9c 18 0.0430 vmlinux as_fifo_expired
c021d95f 18 0.0430 vmlinux as_remove_dispatched_request
c01181d7 17 0.0406 vmlinux finish_wait
c013246e 17 0.0406 vmlinux handle_ra_miss
c021e0f2 16 0.0382 vmlinux as_insert_request
c010ad15 16 0.0382 vmlinux disable_irq_nosync
c01f0785 16 0.0382 vmlinux fast_copy_page
c010a441 16 0.0382 vmlinux note_interrupt
c01ee7aa 16 0.0382 vmlinux radix_tree_preload
c0120817 15 0.0358 vmlinux do_timer
c01087ee 15 0.0358 vmlinux restore_all
c01f04bc 14 0.0334 vmlinux __delay
c021d47e 14 0.0334 vmlinux as_can_anticipate
c021d684 14 0.0334 vmlinux as_update_arq
c0147eec 14 0.0334 vmlinux bio_hw_segments
c011d18d 14 0.0334 vmlinux raise_softirq
c01086c5 14 0.0334 vmlinux ret_from_intr
c01204c6 14 0.0334 vmlinux update_wall_time_one_tick
c011702f 13 0.0311 vmlinux __wake_up_common
c021da16 13 0.0311 vmlinux as_remove_request
c0147ecf 13 0.0311 vmlinux bio_phys_segments
c0218e97 13 0.0311 vmlinux blk_plug_device
c021a7e0 13 0.0311 vmlinux copy_io_context
c0136cdd 13 0.0311 vmlinux copy_page_range
c0106ae1 13 0.0311 vmlinux cpu_idle
c0235577 13 0.0311 vmlinux default_end_request
c010938c 13 0.0311 vmlinux page_fault
c01eef63 13 0.0311 vmlinux rb_prev
c023cc6c 12 0.0287 vmlinux __ide_dma_begin
c021cfd8 12 0.0287 vmlinux as_add_arq_rb
c021d315 12 0.0287 vmlinux as_close_req
c013685b 12 0.0287 vmlinux blk_queue_bounce
c013847d 12 0.0287 vmlinux do_no_page
c0217a19 12 0.0287 vmlinux elv_merge
c0217b01 12 0.0287 vmlinux elv_next_request
c023c4ca 12 0.0287 vmlinux ide_dma_intr
c013041e 12 0.0287 vmlinux prep_new_page
c012d678 11 0.0263 vmlinux __lock_page
c021d178 11 0.0263 vmlinux as_find_next_arq
c01444e5 11 0.0263 vmlinux bh_waitq_head
c021a4af 11 0.0263 vmlinux end_that_request_first
c0108702 11 0.0263 vmlinux need_resched
c01eee3f 11 0.0263 vmlinux rb_erase
c0145878 11 0.0263 vmlinux try_to_release_page
c01444fa 11 0.0263 vmlinux wake_up_buffer
c0144623 10 0.0239 vmlinux __clear_page_buffers
c023cca4 10 0.0239 vmlinux __ide_dma_end
c021d07f 10 0.0239 vmlinux as_choose_req
c021e7c2 10 0.0239 vmlinux as_put_request
c0147155 10 0.0239 vmlinux check_ttfb_buffer
c026493f 10 0.0239 vmlinux i8042_interrupt
c01341ef 10 0.0239 vmlinux lru_cache_add
c014735a 10 0.0239 vmlinux recalc_bh_state
c02198b0 9 0.0215 vmlinux __blk_put_request
c021e1f4 9 0.0215 vmlinux as_latter_request
c010a503 9 0.0215 vmlinux enable_irq
c0231efa 9 0.0215 vmlinux ide_handler_parser
c0124348 9 0.0215 vmlinux notifier_call_chain
c013bd66 9 0.0215 vmlinux page_remove_rmap
c0116fdc 9 0.0215 vmlinux preempt_schedule
c011707a 8 0.0191 vmlinux __wake_up
c012d4e7 8 0.0191 vmlinux add_to_page_cache_lru
c014921e 8 0.0191 vmlinux blkdev_readpage
c0217c9d 8 0.0191 vmlinux elv_completed_request
c0113745 8 0.0191 vmlinux smp_apic_timer_interrupt
c028f61d 8 0.0191 vmlinux sync_buffer
c012056b 8 0.0191 vmlinux update_wall_time
c023cdac 7 0.0167 vmlinux __ide_dma_count
c0132cef 7 0.0167 vmlinux cache_init_objs
c0217c05 7 0.0167 vmlinux elv_latter_request
c0231e74 7 0.0167 vmlinux ide_pre_handler_parser
c023caac 7 0.0167 vmlinux ide_start_dma
c021a6d7 7 0.0167 vmlinux put_io_context
c01eec00 7 0.0167 vmlinux rb_insert_color
c01f0517 6 0.0143 vmlinux __const_udelay
c0130c1c 6 0.0143 vmlinux __pagevec_free
c021d259 6 0.0143 vmlinux as_antic_stop
c021ce97 6 0.0143 vmlinux as_get_io_context
c021dec8 6 0.0143 vmlinux as_next_request
c021cec5 6 0.0143 vmlinux as_remove_merge_hints
c0217bc7 6 0.0143 vmlinux elv_remove_request
c012d7a4 6 0.0143 vmlinux find_lock_page
c0236508 6 0.0143 vmlinux ide_do_rw_disk
c013bcba 6 0.0143 vmlinux page_add_rmap
c013cf5a 6 0.0143 vmlinux shmem_getpage
c013c39c 6 0.0143 vmlinux shmem_swp_alloc
c0131ba4 6 0.0143 vmlinux test_clear_page_dirty
00000000 5 0.0119 ISO8859-1.so (no symbols)
c01eecce 5 0.0119 vmlinux __rb_erase_color
c01f05dc 5 0.0119 vmlinux _mmx_memcpy
c013413c 5 0.0119 vmlinux activate_page
c028f76c 5 0.0119 vmlinux add_event_entry
c011701d 5 0.0119 vmlinux default_wake_function
c021986e 5 0.0119 vmlinux disk_round_stats
c0217a35 5 0.0119 vmlinux elv_merged_request
c010c9e8 5 0.0119 vmlinux end_8259A_irq
c02193b3 5 0.0119 vmlinux freed_request
c011ff72 5 0.0119 vmlinux internal_add_timer
c01eeb7d 5 0.0119 vmlinux radix_tree_node_ctor
c012080d 5 0.0119 vmlinux run_local_timers
00000000 4 0.0096 nmbd (no symbols)
c023cd21 4 0.0096 vmlinux __ide_dma_test_irq
c013320a 4 0.0096 vmlinux cache_flusharray
c022e022 4 0.0096 vmlinux do_ide_request
c0217c4c 4 0.0096 vmlinux elv_set_request
c0139e36 4 0.0096 vmlinux find_vma
c013885e 4 0.0096 vmlinux handle_mm_fault
c0117c4e 4 0.0096 vmlinux io_schedule
c0256be8 4 0.0096 vmlinux uhci_hub_status_data
c0136ff3 4 0.0096 vmlinux zap_pte_range
c0217a99 3 0.0072 vmlinux __elv_add_request
c028f4ed 3 0.0072 vmlinux add_sample_entry
c021cfb8 3 0.0072 vmlinux as_find_first_arq
c0118233 3 0.0072 vmlinux autoremove_wake_function
c0147672 3 0.0072 vmlinux bio_put
c0132d68 3 0.0072 vmlinux cache_grow
c010920c 3 0.0072 vmlinux device_not_available
c0217c71 3 0.0072 vmlinux elv_put_request
c017f280 3 0.0072 vmlinux journal_switch_revoke_table
c0144ce6 3 0.0072 vmlinux mark_buffer_async_read
c0229b32 3 0.0072 vmlinux mdio_read
c0133701 3 0.0072 vmlinux reap_timer_fnc
c01180f8 3 0.0072 vmlinux remove_wait_queue
c010e4eb 3 0.0072 vmlinux restore_fpu
c0259c77 3 0.0072 vmlinux stall_callback
c01087ac 3 0.0072 vmlinux system_call
00000000 2 0.0048 apache (no symbols)
00000000 2 0.0048 cupsd (no symbols)
c015cadc 2 0.0048 vmlinux __mark_inode_dirty
c0131abc 2 0.0048 vmlinux __set_page_dirty_nobuffers
c0200345 2 0.0048 vmlinux add_disk_randomness
c0217e50 2 0.0048 vmlinux clear_queue_congested
c017b4da 2 0.0048 vmlinux do_get_write_access
c01154e7 2 0.0048 vmlinux do_page_fault
c0137b16 2 0.0048 vmlinux do_wp_page
c01555d9 2 0.0048 vmlinux dput
c01442a9 2 0.0048 vmlinux fget_light
c012e044 2 0.0048 vmlinux generic_file_read
c025a39b 2 0.0048 vmlinux hc_state_transitions
c0231f7a 2 0.0048 vmlinux ide_post_handler_parser
c0133391 2 0.0048 vmlinux kfree
c016271f 2 0.0048 vmlinux load_elf_binary
c025a326 2 0.0048 vmlinux ports_active
c013c174 2 0.0048 vmlinux pte_chain_alloc
c013c292 2 0.0048 vmlinux shmem_swp_entry
c0123389 2 0.0048 vmlinux sys_rt_sigprocmask
c0134735 2 0.0048 vmlinux truncate_complete_page
c0202375 2 0.0048 vmlinux tty_write
c015803a 2 0.0048 vmlinux update_atime
c0143527 2 0.0048 vmlinux vfs_read
c0242712 2 0.0048 vmlinux vgacon_cursor
c02436fc 2 0.0048 vmlinux vgacon_scroll
c012645e 2 0.0048 vmlinux worker_thread
c0206a54 2 0.0048 vmlinux write_chan
00000000 1 0.0024 gawk (no symbols)
00000000 1 0.0024 libc-2.3.2.so (no symbols)
00000000 1 0.0024 ls (no symbols)
00000000 1 0.0024 tee (no symbols)
c0145d1c 1 0.0024 vmlinux __block_prepare_write
c01f0b94 1 0.0024 vmlinux __copy_from_user_ll
c01456a2 1 0.0024 vmlinux __find_get_block
c012de1f 1 0.0024 vmlinux __generic_file_aio_read
c0130b8e 1 0.0024 vmlinux __get_free_pages
c017c9b4 1 0.0024 vmlinux __journal_file_buffer
c0134483 1 0.0024 vmlinux __pagevec_release
c0153801 1 0.0024 vmlinux __posix_lock_file
c013c128 1 0.0024 vmlinux __pte_chain_free
c0144ec2 1 0.0024 vmlinux __set_page_dirty_buffers
c015cbde 1 0.0024 vmlinux __sync_single_inode
c028f45a 1 0.0024 vmlinux add_kernel_ctx_switch
c021cf4d 1 0.0024 vmlinux as_find_arq_hash
c0218f2f 1 0.0024 vmlinux blk_remove_plug
c01472f5 1 0.0024 vmlinux block_sync_page
c0146e4d 1 0.0024 vmlinux block_write_full_page
c0228b23 1 0.0024 vmlinux boomerang_interrupt
c014de24 1 0.0024 vmlinux cached_lookup
c01d0de4 1 0.0024 vmlinux cap_bprm_set_security
c01d10d7 1 0.0024 vmlinux cap_vm_enough_memory
c0136b10 1 0.0024 vmlinux clear_page_tables
c01f097f 1 0.0024 vmlinux clear_user
c01189c6 1 0.0024 vmlinux copy_files
c0118527 1 0.0024 vmlinux copy_mm
c014b52b 1 0.0024 vmlinux copy_strings
c0106e05 1 0.0024 vmlinux copy_thread
c014b4ed 1 0.0024 vmlinux count
c0158f04 1 0.0024 vmlinux dnotify_parent
c01382ac 1 0.0024 vmlinux do_anonymous_page
c020ef74 1 0.0024 vmlinux do_con_trol
c011c4f4 1 0.0024 vmlinux do_getitimer
c0152745 1 0.0024 vmlinux do_poll
c0151fbf 1 0.0024 vmlinux do_select
c014347f 1 0.0024 vmlinux do_sync_read
c0133682 1 0.0024 vmlinux drain_array_locked
c0118269 1 0.0024 vmlinux dup_task_struct
c0172d1c 1 0.0024 vmlinux ext3_get_inode_loc
c0171437 1 0.0024 vmlinux ext3_getblk
c016e989 1 0.0024 vmlinux ext3_new_block
c01514cf 1 0.0024 vmlinux fasync_helper
c014426d 1 0.0024 vmlinux fget
c0144342 1 0.0024 vmlinux file_move
c012e30a 1 0.0024 vmlinux filemap_nopage
c013fbab 1 0.0024 vmlinux free_page_and_swap_cache
c0130c70 1 0.0024 vmlinux free_pages
c0157d5e 1 0.0024 vmlinux generic_delete_inode
c014aa08 1 0.0024 vmlinux generic_fillattr
c028f5d5 1 0.0024 vmlinux get_slots
c014868c 1 0.0024 vmlinux get_super
c014dbb0 1 0.0024 vmlinux getname
c01cccfd 1 0.0024 vmlinux grow_ary
c0108471 1 0.0024 vmlinux handle_signal
c0264b57 1 0.0024 vmlinux i8042_timer_func
c0157ff6 1 0.0024 vmlinux inode_times_differ
c01cd0e4 1 0.0024 vmlinux ipc_lock
c017b9b0 1 0.0024 vmlinux journal_get_write_access
c017f8af 1 0.0024 vmlinux journal_write_metadata_buffer
c017f2d8 1 0.0024 vmlinux journal_write_revoke_records
c014e113 1 0.0024 vmlinux link_path_walk
c0162400 1 0.0024 vmlinux load_elf_interp
c0134273 1 0.0024 vmlinux lru_add_drain
c010a13c 1 0.0024 vmlinux math_state_restore
c014edfd 1 0.0024 vmlinux may_open
c01f0887 1 0.0024 vmlinux mmx_clear_page
c014ea19 1 0.0024 vmlinux path_lookup
c014d48b 1 0.0024 vmlinux pipe_poll
c0151e25 1 0.0024 vmlinux poll_freewait
c011aa13 1 0.0024 vmlinux profile_hook
c0136ba4 1 0.0024 vmlinux pte_alloc_map
c01262f4 1 0.0024 vmlinux queue_work
c01ee764 1 0.0024 vmlinux radix_tree_node_alloc
c0120f94 1 0.0024 vmlinux recalc_sigpending
c011a4d8 1 0.0024 vmlinux release_console_sem
c028f589 1 0.0024 vmlinux release_mm
c010be88 1 0.0024 vmlinux release_x86_irqs
c0139160 1 0.0024 vmlinux remove_shared_vm_struct
c01086f8 1 0.0024 vmlinux resume_kernel
c024a6e0 1 0.0024 vmlinux rh_report_status
c01268f1 1 0.0024 vmlinux schedule_work
c0155e2d 1 0.0024 vmlinux select_parent
c0107f3e 1 0.0024 vmlinux setup_sigcontext
c01325f3 1 0.0024 vmlinux slab_destroy
c017ad84 1 0.0024 vmlinux start_this_handle
c01f0938 1 0.0024 vmlinux strncpy_from_user
c01f09d2 1 0.0024 vmlinux strnlen_user
c015d0c2 1 0.0024 vmlinux sync_inodes_sb
c015ce4f 1 0.0024 vmlinux sync_sb_inodes
c01484e9 1 0.0024 vmlinux sync_supers
c0151775 1 0.0024 vmlinux sys_ioctl
c015226a 1 0.0024 vmlinux sys_select
c01371b5 1 0.0024 vmlinux unmap_page_range
c013a061 1 0.0024 vmlinux unmap_vma
c0120655 1 0.0024 vmlinux update_process_times
c015d031 1 0.0024 vmlinux writeback_inodes
HTH,
--
Tobias PGP: http://9ac7e0bc.2ya.com
np: CF-Theme
On Sat, 03 Jan 2004 04:33:28 +0100, Tobias Diedrich <[email protected]> said:
> Very interesting tidbit:
>
> with 2.6.1-rc1 and "dd if=/dev/hda of=/dev/null" I get stable 28 MB/s,
> but with "cat < /dev/hda > /dev/null" I get 48 MB/s according to "vmstat
> 5".
'cat' is probably doing a stat() on stdout and seeing it's connected to /dev/null
and not even bothering to do the write() call. I've seen similar behavior in other
GNU utilities.
On Friday 02 January 2004 22:27, you wrote:
>
> Do you get different numbers if you boot with:
>
> elevator=as
> elevator=deadline
> elevator=cfq (for -mm kernels)
> elevator=noop
>
Changing io scheduler doesn't seem to affect performance too much...
AS (the one already used)
> > 2.6.0:
> > 64 31.91
> > 128 31.89
> > 256 26.22 # during the transfer HD LED blinks
> > 8192 26.26 # during the transfer HD LED blinks
> >
> > 2.6.1-rc1:
> > 64 25.84 # during the transfer HD LED blinks
> > 128 25.85 # during the transfer HD LED blinks
> > 256 25.90 # during the transfer HD LED blinks
> > 8192 26.42 # during the transfer HD LED blinks
DEADLINE
2.6.0:
64 31.89
128 31.90
256 26.18
8192 26.22
2.6.1-rc1:
64 25.90
128 26.14
256 26.06
8192 26.45
NOOP
2.6.0:
64 31.90
128 31.76
256 26.05
8192 26.20
2.6.1-rc1:
64 25.91
128 26.23
256 26.16
8192 26.40
Bye
--
Paolo Ornati
Linux v2.4.23
On Friday 02 January 2004 22:32, Mike Fedyk wrote:
> On Fri, Jan 02, 2004 at 10:04:27PM +0100, Paolo Ornati wrote:
> > NR_TESTS=3
> > RA_VALUES="64 128 256 8192"
>
> Can you add more samples between 128 and 256, maybe at intervals of 32?
YES
2.6.0:
128 31.66
160 31.88
192 30.93
224 31.18
256 26.16 # HD LED blinking
2.6.1-rc1:
128 25.91 # HD LED blinking
160 26.00 # HD LED blinking
192 26.06 # HD LED blinking
224 25.94 # HD LED blinking
256 25.96 # HD LED blinking
bye
--
Paolo Ornati
Linux v2.4.23
On Friday 02 January 2004 23:34, you wrote:
> On Fri, 2004-01-02 at 22:32, Mike Fedyk wrote:
> > Have there been any ide updates in 2.6.1-rc1?
>
> I see that a readahead patch was applied just before -rc1 was released.
>
> found it in bk-commits-head
>
> Subject: [PATCH] readahead: multiple performance fixes
> Message-Id: <[email protected]>
>
> Maybe Paolo can try backing it out.
YES, YES, YES...
Reveting "readahead: multiple performance fixes" patch performance came back
like in 2.6.0.
2.6.0:
64 31.91
128 31.89
256 26.22
8192 26.26
2.6.1-rc1 (readahead patch reverted):
64 31.84
128 31.86
256 25.93
8192 26.16
I know these are only performance in sequential data reads... and real life
is another thing... but I think the author of the patch should be informed
(Ram Pai).
for Ram Pai:
_____________________________________________________________________
My first message:
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/0004.html
This thread:
Strange IDE performance change in 2.6.1-rc1 (again)
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/0289.html
(look at the graph)
Any comments?
_____________________________________________________________________
Bye
--
Paolo Ornati
Linux v2.4.23
[email protected] wrote:
> 'cat' is probably doing a stat() on stdout and seeing it's connected
> to /dev/null and not even bothering to do the write() call. I've seen
> similar behavior in other GNU utilities.
I can't see any special casing for /dev/null in cat's source, but I
forgot to check dd with bigger block size. It's ok with bs=4096...
--
Tobias PGP: http://9ac7e0bc.2ya.com
This mail is made of 100% recycled bits.
I wrote:
> [email protected] wrote:
>
> > 'cat' is probably doing a stat() on stdout and seeing it's connected
> > to /dev/null and not even bothering to do the write() call. I've seen
> > similar behavior in other GNU utilities.
>
> I can't see any special casing for /dev/null in cat's source, but I
> forgot to check dd with bigger block size. It's ok with bs=4096...
However with 2.4 dd performs fine even with bs=512.
--
Tobias PGP: http://9ac7e0bc.2ya.com
Be vigilant!
np: PHILFUL3
Paolo Ornati <[email protected]> wrote:
>
> I know these are only performance in sequential data reads... and real life
> is another thing... but I think the author of the patch should be informed
> (Ram Pai).
There does seem to be something whacky going on with readahead against
blockdevices. Perhaps it is related to the soft blocksize. I've never
been able to reproduce any of this.
Be aware that buffered reads for blockdevs are treated fairly differently
from buffered reads for regular files: they only use lowmem and we always
attach buffer_heads and perform I/O against them.
No effort was made to optimise buffered blockdev reads because it is not
very important and my main interest was in data coherency and filesystem
metadata consistency.
If you observe the same things reading from regular files then that is more
important.
On Fri, Jan 02, 2004 at 11:15:18PM -0500, [email protected] wrote:
> On Sat, 03 Jan 2004 04:33:28 +0100, Tobias Diedrich <[email protected]> said:
>
> > Very interesting tidbit:
> >
> > with 2.6.1-rc1 and "dd if=/dev/hda of=/dev/null" I get stable 28 MB/s,
> > but with "cat < /dev/hda > /dev/null" I get 48 MB/s according to "vmstat
> > 5".
>
> 'cat' is probably doing a stat() on stdout and seeing it's connected to /dev/null
> and not even bothering to do the write() call. I've seen similar behavior in other
> GNU utilities.
That is unlikely.
However, i have seen some versions of cat check the input
file and if it is mappable mmap it instead of read. Given
that a write to /dev/null returns count without
copy_from_user the mapped page never faults so there is no
disk io.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]
Remember Cernan and Schmitt
On Saturday 03 January 2004 23:40, Andrew Morton wrote:
> Paolo Ornati <[email protected]> wrote:
> > I know these are only performance in sequential data reads... and real
> > life is another thing... but I think the author of the patch should be
> > informed (Ram Pai).
>
> There does seem to be something whacky going on with readahead against
> blockdevices. Perhaps it is related to the soft blocksize. I've never
> been able to reproduce any of this.
>
> Be aware that buffered reads for blockdevs are treated fairly differently
> from buffered reads for regular files: they only use lowmem and we always
> attach buffer_heads and perform I/O against them.
>
> No effort was made to optimise buffered blockdev reads because it is not
> very important and my main interest was in data coherency and filesystem
> metadata consistency.
>
> If you observe the same things reading from regular files then that is
> more important.
I have done some tests with this stupid script and it seems that you are
right:
_____________________________________________________________________
#!/bin/sh
DEV=/dev/hda7
MOUNT_DIR=mnt
BIG_FILE=$MOUNT_DIR/big_file
mount $DEV $MOUNT_DIR
if [ ! -f $BIG_FILE ]; then
echo "[DD] $BIG_FILE"
dd if=/dev/zero of=$BIG_FILE bs=1M count=1024
umount $MOUNT_DIR
mount $DEV $MOUNT_DIR
fi
killall5
sleep 2
sync
sleep 2
time cat $BIG_FILE > /dev/null
umount $MOUNT_DIR
_____________________________________________________________________
Results for plain 2.6.1-rc1 (A) and 2.6.1-rc1 without Ram Pai's patch (B):
o readahead = 256 (default setting)
(A)
real 0m43.596s
user 0m0.153s
sys 0m5.602s
real 0m42.971s
user 0m0.136s
sys 0m5.571s
real 0m42.888s
user 0m0.137s
sys 0m5.648s
(B)
real 0m43.520s
user 0m0.130s
sys 0m5.615s
real 0m42.930s
user 0m0.154s
sys 0m5.745s
real 0m42.937s
user 0m0.120s
sys 0m5.751s
o readahead = 128
(A)
real 0m35.932s
user 0m0.133s
sys 0m5.926s
real 0m35.925s
user 0m0.146s
sys 0m5.930s
real 0m35.892s
user 0m0.145s
sys 0m5.946s
(B)
real 0m35.957s
user 0m0.136s
sys 0m6.041s
real 0m35.958s
user 0m0.136s
sys 0m5.957s
real 0m35.924s
user 0m0.146s
sys 0m6.069s
o readahead = 64
(A)
real 0m35.284s
user 0m0.137s
sys 0m6.182s
real 0m35.267s
user 0m0.134s
sys 0m6.110s
real 0m35.260s
user 0m0.149s
sys 0m6.003s
(B)
real 0m35.210s
user 0m0.149s
sys 0m6.009s
real 0m35.341s
user 0m0.151s
sys 0m6.119s
real 0m35.151s
user 0m0.144s
sys 0m6.195s
I don't notice any big difference between kernel A and kernel B....
>From these tests the best readahead value for my HD seems to be 64... and
the default setting (256) just wrong.
With 2.4.23 kernel and readahead = 8 I get results like these:
real 0m40.085s
user 0m0.130s
sys 0m4.560s
real 0m40.058s
user 0m0.090s
sys 0m4.630s
Bye.
--
Paolo Ornati
Linux v2.4.23
On Sat, Jan 03, 2004 at 02:40:03PM -0800, Andrew Morton wrote:
> No effort was made to optimise buffered blockdev reads because it is not
> very important and my main interest was in data coherency and filesystem
> metadata consistency.
Does that mean that blockdev reads will populate the pagecache in 2.6?
Mike Fedyk <[email protected]> wrote:
>
> On Sat, Jan 03, 2004 at 02:40:03PM -0800, Andrew Morton wrote:
> > No effort was made to optimise buffered blockdev reads because it is not
> > very important and my main interest was in data coherency and filesystem
> > metadata consistency.
>
> Does that mean that blockdev reads will populate the pagecache in 2.6?
They have since 2.4.10. The pagecache is the only cacheing entity for file
(and blockdev) data.
On Sun, Jan 04, 2004 at 02:10:30PM -0800, Andrew Morton wrote:
> Mike Fedyk <[email protected]> wrote:
> >
> > On Sat, Jan 03, 2004 at 02:40:03PM -0800, Andrew Morton wrote:
> > > No effort was made to optimise buffered blockdev reads because it is not
> > > very important and my main interest was in data coherency and filesystem
> > > metadata consistency.
> >
> > Does that mean that blockdev reads will populate the pagecache in 2.6?
>
> They have since 2.4.10. The pagecache is the only cacheing entity for file
> (and blockdev) data.
There was a large thread after 2.4.10 was released about speeding up the
boot proces by reading the underlying blockdev of the root partition in
block order.
Unfortunately at the time reading the files through the pagecache would
cause a second read of the data even if it was already buffered. I don't
remember the exact details.
Are you saying this is now resolved? And the above optimization will work?
Mike Fedyk <[email protected]> wrote:
>
> On Sun, Jan 04, 2004 at 02:10:30PM -0800, Andrew Morton wrote:
> > Mike Fedyk <[email protected]> wrote:
> > >
> > > On Sat, Jan 03, 2004 at 02:40:03PM -0800, Andrew Morton wrote:
> > > > No effort was made to optimise buffered blockdev reads because it is not
> > > > very important and my main interest was in data coherency and filesystem
> > > > metadata consistency.
> > >
> > > Does that mean that blockdev reads will populate the pagecache in 2.6?
> >
> > They have since 2.4.10. The pagecache is the only cacheing entity for file
> > (and blockdev) data.
>
> There was a large thread after 2.4.10 was released about speeding up the
> boot proces by reading the underlying blockdev of the root partition in
> block order.
>
> Unfortunately at the time reading the files through the pagecache would
> cause a second read of the data even if it was already buffered. I don't
> remember the exact details.
The pagecache is a cache-per-inode. So the cache for a regular file is not
coherent with the cache for /dev/hda1 is not coherent with the cache for
/dev/hda.
> Are you saying this is now resolved? And the above optimization will work?
It will not. And I doubt if it will make much difference anyway. I once
wrote a gizmo which a) generated tables describing pagecache contents
immediately after bootup and b) used that info to prepopulate pagecache
with an optimised seek pattern after boot. It was only worth 10-15%. One
would need an intermediate step which relaid-out the relevant files to get
useful speedups.
On Sun, Jan 04, 2004 at 03:32:58PM -0800, Andrew Morton wrote:
> Mike Fedyk <[email protected]> wrote:
> >
> > On Sun, Jan 04, 2004 at 02:10:30PM -0800, Andrew Morton wrote:
> > > Mike Fedyk <[email protected]> wrote:
> > > >
> > > > On Sat, Jan 03, 2004 at 02:40:03PM -0800, Andrew Morton wrote:
> > > > > No effort was made to optimise buffered blockdev reads because it is not
> > > > > very important and my main interest was in data coherency and filesystem
> > > > > metadata consistency.
> > > >
> > > > Does that mean that blockdev reads will populate the pagecache in 2.6?
> > >
> > > They have since 2.4.10. The pagecache is the only cacheing entity for file
> > > (and blockdev) data.
> >
> > There was a large thread after 2.4.10 was released about speeding up the
> > boot proces by reading the underlying blockdev of the root partition in
> > block order.
> >
> > Unfortunately at the time reading the files through the pagecache would
> > cause a second read of the data even if it was already buffered. I don't
> > remember the exact details.
>
> The pagecache is a cache-per-inode. So the cache for a regular file is not
> coherent with the cache for /dev/hda1 is not coherent with the cache for
> /dev/hda.
That's what I remember from the old thread. Thanks.
Duffers are attached to a page, and blockdev reads will not save
pagecache reads.
So in what way is the buffer cache coherent with the pagecache?
> > Are you saying this is now resolved? And the above optimization will work?
>
> It will not. And I doubt if it will make much difference anyway. I once
> wrote a gizmo which a) generated tables describing pagecache contents
> immediately after bootup and b) used that info to prepopulate pagecache
> with an optimised seek pattern after boot. It was only worth 10-15%. One
> would need an intermediate step which relaid-out the relevant files to get
> useful speedups.
Any progress on that pagecache coherent block relocation patch you had for
ext3? :)
Mike Fedyk <[email protected]> wrote:
>
> So in what way is the buffer cache coherent with the pagecache?
>
There is no "buffer cache" in Linux. There is a pagecache for /etc/passwd
and there is a pagecache for /dev/hda1. They are treated pretty much
identically. The kernel attaches buffer_heads to those pagecache pages
when needed - generally when it wants to deal with individual disk blocks.
> Any progress on that pagecache coherent block relocation patch you had for
> ext3? :)
No.
Sorry I was on vacation and could not get back earlier.
I do not exactly know the reason why sequential reads on blockdevices
has regressed. One probable reason is that the same lazy-read
optimization which helps large random reads is regressing the sequential
read performance.
Note: the patch, waits till the last page in the current window is being
read, before triggering a new readahead. By the time the readahead
request is satisfied, the next sequential read may already have been
requested. Hence there is some loss of parallelism here. However given
that largesize random reads is the most common case; this patch attacks
that case.
If you revert back just the lazy-read optimization, you might see no
regression for sequential reads,
Let me see if I can verify this,
Ram Pai
On Sun, 2004-01-04 at 06:30, Paolo Ornati wrote:
> On Saturday 03 January 2004 23:40, Andrew Morton wrote:
> > Paolo Ornati <[email protected]> wrote:
> > > I know these are only performance in sequential data reads... and real
> > > life is another thing... but I think the author of the patch should be
> > > informed (Ram Pai).
> >
> > There does seem to be something whacky going on with readahead against
> > blockdevices. Perhaps it is related to the soft blocksize. I've never
> > been able to reproduce any of this.
> >
> > Be aware that buffered reads for blockdevs are treated fairly differently
> > from buffered reads for regular files: they only use lowmem and we always
> > attach buffer_heads and perform I/O against them.
> >
> > No effort was made to optimise buffered blockdev reads because it is not
> > very important and my main interest was in data coherency and filesystem
> > metadata consistency.
> >
> > If you observe the same things reading from regular files then that is
> > more important.
>
> I have done some tests with this stupid script and it seems that you are
> right:
> _____________________________________________________________________
> #!/bin/sh
>
> DEV=/dev/hda7
> MOUNT_DIR=mnt
> BIG_FILE=$MOUNT_DIR/big_file
>
> mount $DEV $MOUNT_DIR
> if [ ! -f $BIG_FILE ]; then
> echo "[DD] $BIG_FILE"
> dd if=/dev/zero of=$BIG_FILE bs=1M count=1024
> umount $MOUNT_DIR
> mount $DEV $MOUNT_DIR
> fi
>
> killall5
> sleep 2
> sync
> sleep 2
>
> time cat $BIG_FILE > /dev/null
> umount $MOUNT_DIR
> _____________________________________________________________________
>
>
> Results for plain 2.6.1-rc1 (A) and 2.6.1-rc1 without Ram Pai's patch (B):
>
> o readahead = 256 (default setting)
>
> (A)
> real 0m43.596s
> user 0m0.153s
> sys 0m5.602s
>
> real 0m42.971s
> user 0m0.136s
> sys 0m5.571s
>
> real 0m42.888s
> user 0m0.137s
> sys 0m5.648s
>
> (B)
> real 0m43.520s
> user 0m0.130s
> sys 0m5.615s
>
> real 0m42.930s
> user 0m0.154s
> sys 0m5.745s
>
> real 0m42.937s
> user 0m0.120s
> sys 0m5.751s
>
>
> o readahead = 128
>
> (A)
> real 0m35.932s
> user 0m0.133s
> sys 0m5.926s
>
> real 0m35.925s
> user 0m0.146s
> sys 0m5.930s
>
> real 0m35.892s
> user 0m0.145s
> sys 0m5.946s
>
> (B)
> real 0m35.957s
> user 0m0.136s
> sys 0m6.041s
>
> real 0m35.958s
> user 0m0.136s
> sys 0m5.957s
>
> real 0m35.924s
> user 0m0.146s
> sys 0m6.069s
>
>
> o readahead = 64
> (A)
> real 0m35.284s
> user 0m0.137s
> sys 0m6.182s
>
> real 0m35.267s
> user 0m0.134s
> sys 0m6.110s
>
> real 0m35.260s
> user 0m0.149s
> sys 0m6.003s
>
>
> (B)
> real 0m35.210s
> user 0m0.149s
> sys 0m6.009s
>
> real 0m35.341s
> user 0m0.151s
> sys 0m6.119s
>
> real 0m35.151s
> user 0m0.144s
> sys 0m6.195s
>
>
> I don't notice any big difference between kernel A and kernel B....
>
> From these tests the best readahead value for my HD seems to be 64... and
> the default setting (256) just wrong.
>
> With 2.4.23 kernel and readahead = 8 I get results like these:
>
> real 0m40.085s
> user 0m0.130s
> sys 0m4.560s
>
> real 0m40.058s
> user 0m0.090s
> sys 0m4.630s
>
> Bye.
On Tuesday 06 January 2004 00:19, you wrote:
> Sorry I was on vacation and could not get back earlier.
>
> I do not exactly know the reason why sequential reads on blockdevices
> has regressed. One probable reason is that the same lazy-read
> optimization which helps large random reads is regressing the sequential
> read performance.
>
> Note: the patch, waits till the last page in the current window is being
> read, before triggering a new readahead. By the time the readahead
> request is satisfied, the next sequential read may already have been
> requested. Hence there is some loss of parallelism here. However given
> that largesize random reads is the most common case; this patch attacks
> that case.
>
> If you revert back just the lazy-read optimization, you might see no
> regression for sequential reads,
I have tried to revert it out:
--- mm/readahead.c.orig 2004-01-07 15:17:00.000000000 +0100
+++ mm/readahead.c.my 2004-01-07 15:33:13.000000000 +0100
@@ -480,7 +480,8 @@
* If we read in earlier we run the risk of wasting
* the ahead window.
*/
- if (ra->ahead_start == 0 && offset == (ra->start + ra->size -1)) {
+ if (ra->ahead_start == 0) {
ra->ahead_start = ra->start + ra->size;
ra->ahead_size = ra->next_size;
but the sequential read performance is still the same !
Reverting out the other part of the patch (that touches mm/filemap.c) the
sequential read performance comes back like in 2.6.0.
I don't know why... but it does.
>
> Let me see if I can verify this,
> Ram Pai
>
Bye
--
Paolo Ornati
Linux v2.4.23
On Wed, 2004-01-07 at 06:59, Paolo Ornati wrote:
> On Tuesday 06 January 2004 00:19, you wrote:
> > Sorry I was on vacation and could not get back earlier.
> >
> > I do not exactly know the reason why sequential reads on blockdevices
> > has regressed. One probable reason is that the same lazy-read
> > optimization which helps large random reads is regressing the sequential
> > read performance.
> >
> > Note: the patch, waits till the last page in the current window is being
> > read, before triggering a new readahead. By the time the readahead
> > request is satisfied, the next sequential read may already have been
> > requested. Hence there is some loss of parallelism here. However given
> > that largesize random reads is the most common case; this patch attacks
> > that case.
> >
> > If you revert back just the lazy-read optimization, you might see no
> > regression for sequential reads,
>
> I have tried to revert it out:
>
> --- mm/readahead.c.orig 2004-01-07 15:17:00.000000000 +0100
> +++ mm/readahead.c.my 2004-01-07 15:33:13.000000000 +0100
> @@ -480,7 +480,8 @@
> * If we read in earlier we run the risk of wasting
> * the ahead window.
> */
> - if (ra->ahead_start == 0 && offset == (ra->start + ra->size -1)) {
> + if (ra->ahead_start == 0) {
> ra->ahead_start = ra->start + ra->size;
> ra->ahead_size = ra->next_size;
>
> but the sequential read performance is still the same !
>
> Reverting out the other part of the patch (that touches mm/filemap.c) the
> sequential read performance comes back like in 2.6.0.
I tried on my lab machine with scsi disks. (I dont have access currently
to a spare machine with ide disks.)
I find that reverting the changes in mm/filemap.c and then reverting the
lazy-read optimization gives much better sequential read performance on
blockdevices. Is this your observation on IDE disks too?
>
> I don't know why... but it does.
Lets see. I think my theory is partly the reason. But the changes in
filemap.c seems to be influencing more.
>
> >
> > Let me see if I can verify this,
> > Ram Pai
> >
>
> Bye
On Wednesday 07 January 2004 20:23, Ram Pai wrote:
>
> I tried on my lab machine with scsi disks. (I dont have access currently
> to a spare machine with ide disks.)
>
> I find that reverting the changes in mm/filemap.c and then reverting the
> lazy-read optimization gives much better sequential read performance on
> blockdevices. Is this your observation on IDE disks too?
Yes and No.
I have only tried to revert lazy-read optimization (without any visible
change) so I have reapplied it AND THAN I have reverted changes in
mm/filemap.c... and performance has gone back.
>
> > I don't know why... but it does.
>
> Lets see. I think my theory is partly the reason. But the changes in
> filemap.c seems to be influencing more.
YES, I agree.
I haven't done a lot of tests but it seems to me that the changes in
mm/filemap.c are the only things that influence the sequential read
performance on my disk.
--
Paolo Ornati
Linux v2.4.23
Paolo Ornati <[email protected]> wrote:
>
> I haven't done a lot of tests but it seems to me that the changes in
> mm/filemap.c are the only things that influence the sequential read
> performance on my disk.
The fact that this only happens when reading a blockdev (true?) is a big
hint. Maybe it is because regular files implement ->readpages.
If the below patch makes read throughput worse on regular files too then
that would confirm the idea.
diff -puN mm/readahead.c~a mm/readahead.c
--- 25/mm/readahead.c~a Wed Jan 7 15:56:32 2004
+++ 25-akpm/mm/readahead.c Wed Jan 7 15:56:36 2004
@@ -103,11 +103,6 @@ static int read_pages(struct address_spa
struct pagevec lru_pvec;
int ret = 0;
- if (mapping->a_ops->readpages) {
- ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
- goto out;
- }
-
pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_to_page(pages);
_
On Wed, 2004-01-07 at 15:57, Andrew Morton wrote:
> Paolo Ornati <[email protected]> wrote:
> >
> > I haven't done a lot of tests but it seems to me that the changes in
> > mm/filemap.c are the only things that influence the sequential read
> > performance on my disk.
>
> The fact that this only happens when reading a blockdev (true?) is a big
> hint. Maybe it is because regular files implement ->readpages.
>
> If the below patch makes read throughput worse on regular files too then
> that would confirm the idea.
No the throughput did not worsen with the patch, for regular files(on
scsi disk). Lets see what Paolo Ornati finds.
Its something to do with the changes in filemap.c,
RP
Ok, I did some analysis and found that 'hdparm -t <device> '
generates reads which are of size 1M. This means 256 page requests are
generated by a single read.
do_generic_mapping_read() gets the request to read 256 pages. But with
the latest change, this function calls do_pagecahce_readahead() to keep
256 pages ready in cache. And after having done that
do_generic_mapping_read() tries to access those 256 pages.
But by then some of the pages may have been replaced under low pagecache
conditions. Hence we end up spending extra time reading those pages
again into the page cache.
I think the same problem must exist while reading files too. Paulo
Ornati used cat command to read the file. cat just generates 1 page
request per read and hence the problem did not show up. The problem must
show up if 'dd if=big_file of=/dev/null bs=1M count=256' is used.
To conclude, I think the bug is with the changes to filemap.c
If the changes are reverted the regression seen with blockdevices should
go away.
Well this is my theory, somebody should validate it,
RP
On Wed, 2004-01-07 at 15:57, Andrew Morton wrote:
> Paolo Ornati <[email protected]> wrote:
> >
> > I haven't done a lot of tests but it seems to me that the changes in
> > mm/filemap.c are the only things that influence the sequential read
> > performance on my disk.
>
> The fact that this only happens when reading a blockdev (true?) is a big
> hint. Maybe it is because regular files implement ->readpages.
>
> If the below patch makes read throughput worse on regular files too then
> that would confirm the idea.
>
> diff -puN mm/readahead.c~a mm/readahead.c
> --- 25/mm/readahead.c~a Wed Jan 7 15:56:32 2004
> +++ 25-akpm/mm/readahead.c Wed Jan 7 15:56:36 2004
> @@ -103,11 +103,6 @@ static int read_pages(struct address_spa
> struct pagevec lru_pvec;
> int ret = 0;
>
> - if (mapping->a_ops->readpages) {
> - ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> - goto out;
> - }
> -
> pagevec_init(&lru_pvec, 0);
> for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> struct page *page = list_to_page(pages);
>
> _
>
>
Ram Pai <[email protected]> wrote:
>
> Ok, I did some analysis and found that 'hdparm -t <device> '
> generates reads which are of size 1M. This means 256 page requests are
> generated by a single read.
>
> do_generic_mapping_read() gets the request to read 256 pages. But with
> the latest change, this function calls do_pagecahce_readahead() to keep
> 256 pages ready in cache. And after having done that
> do_generic_mapping_read() tries to access those 256 pages.
> But by then some of the pages may have been replaced under low pagecache
> conditions. Hence we end up spending extra time reading those pages
> again into the page cache.
>
> I think the same problem must exist while reading files too. Paulo
> Ornati used cat command to read the file. cat just generates 1 page
> request per read and hence the problem did not show up. The problem must
> show up if 'dd if=big_file of=/dev/null bs=1M count=256' is used.
>
> To conclude, I think the bug is with the changes to filemap.c
> If the changes are reverted the regression seen with blockdevices should
> go away.
>
> Well this is my theory, somebody should validate it,
One megabyte seems like far too litte memory to be triggering the effect
which you describe. But yes, the risk is certainly there.
You could verify this with:
--- 25/mm/filemap.c~a Thu Jan 8 17:15:57 2004
+++ 25-akpm/mm/filemap.c Thu Jan 8 17:16:06 2004
@@ -629,8 +629,10 @@ find_page:
handle_ra_miss(mapping, ra, index);
goto no_cached_page;
}
- if (!PageUptodate(page))
+ if (!PageUptodate(page)) {
+ printk("eek!\n");
goto page_not_up_to_date;
+ }
page_ok:
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
But still, that up-front readahead loop is undesirable and yes, it would be
better if we could go back to the original design in there.
On Thu, 2004-01-08 at 17:17, Andrew Morton wrote:
> Ram Pai <[email protected]> wrote:
> >
> > Well this is my theory, somebody should validate it,
>
> One megabyte seems like far too litte memory to be triggering the effect
> which you describe. But yes, the risk is certainly there.
>
> You could verify this with:
>
I cannot exactly reproduce what Pualo Ornati is seeing.
Pualo: Request you to validate the following,
1) see whether you see a regression with files replacing the
cat command in your script with
dd if=big_file of=/dev/null bs=1M count=256
2) and if you do, check if you see a bunch of 'eek' with Andrew's
following patch. (NOTE: without reverting the changes
in filemap.c)
--------------------------------------------------------------------------
--- 25/mm/filemap.c~a Thu Jan 8 17:15:57 2004
+++ 25-akpm/mm/filemap.c Thu Jan 8 17:16:06 2004
@@ -629,8 +629,10 @@ find_page:
handle_ra_miss(mapping, ra, index);
goto no_cached_page;
}
- if (!PageUptodate(page))
+ if (!PageUptodate(page)) {
+ printk("eek!\n");
goto page_not_up_to_date;
+ }
page_ok:
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
---------------------------------------------------------------------------
Thanks,
RP
Ram Pai <[email protected]> wrote:
>
> 1) see whether you see a regression with files replacing the
> cat command in your script with
> dd if=big_file of=/dev/null bs=1M count=256
You'll need to unmount and remount the fs in between to remove the file
from pagecache. Or use fadvise() to remove the pagecache. There's a
little tool which does that in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
On Friday 09 January 2004 20:15, Ram Pai wrote:
> On Thu, 2004-01-08 at 17:17, Andrew Morton wrote:
> > Ram Pai <[email protected]> wrote:
> > > Well this is my theory, somebody should validate it,
> >
> > One megabyte seems like far too litte memory to be triggering the
> > effect which you describe. But yes, the risk is certainly there.
> >
> > You could verify this with:
>
> I cannot exactly reproduce what Pualo Ornati is seeing.
>
> Pualo: Request you to validate the following,
>
> 1) see whether you see a regression with files replacing the
> cat command in your script with
> dd if=big_file of=/dev/null bs=1M count=256
>
> 2) and if you do, check if you see a bunch of 'eek' with Andrew's
> following patch. (NOTE: without reverting the changes
> in filemap.c)
>
> -------------------------------------------------------------------------
>-
>
> --- 25/mm/filemap.c~a Thu Jan 8 17:15:57 2004
> +++ 25-akpm/mm/filemap.c Thu Jan 8 17:16:06 2004
> @@ -629,8 +629,10 @@ find_page:
> handle_ra_miss(mapping, ra, index);
> goto no_cached_page;
> }
> - if (!PageUptodate(page))
> + if (!PageUptodate(page)) {
> + printk("eek!\n");
> goto page_not_up_to_date;
> + }
> page_ok:
> /* If users can be writing to this page using arbitrary
> * virtual addresses, take care about potential aliasing
>
> -------------------------------------------------------------------------
Ok, this patch seems for -mm tree... I have applied it by hand (on a vanilla
2.6.1-rc1).
For my tests I've used this script:
#!/bin/sh
RA_VALS="256 128 64"
FILE="/big_file"
SIZE=`stat -c '%s' $FILE`
NR_TESTS="3"
LINUX=`uname -r`
echo "HD test for Penguin $LINUX"
killall5
sync
sleep 3
for ra in $RA_VALS; do
hdparm -a $ra /dev/hda
for i in `seq $NR_TESTS`; do
echo "_ _ _ _ _ _ _ _ _"
./fadvise $FILE 0 $SIZE dontneed
time dd if=$FILE of=/dev/null bs=1M count=256
done
echo "________________________________"
done
RESULTS (2.6.0 / 2.6.1-rc1)
HD test for Penguin 2.6.0
/dev/hda:
setting fs readahead to 256
readahead = 256 (on)
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.427s
user 0m0.002s
sys 0m1.722s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.963s
user 0m0.000s
sys 0m1.760s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.291s
user 0m0.001s
sys 0m1.713s
________________________________
/dev/hda:
setting fs readahead to 128
readahead = 128 (on)
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m9.910s
user 0m0.003s
sys 0m1.882s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m9.693s
user 0m0.003s
sys 0m1.860s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m9.733s
user 0m0.004s
sys 0m1.922s
________________________________
/dev/hda:
setting fs readahead to 64
readahead = 64 (on)
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m9.107s
user 0m0.000s
sys 0m2.026s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m9.227s
user 0m0.004s
sys 0m1.984s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m9.152s
user 0m0.002s
sys 0m2.013s
________________________________
HD test for Penguin 2.6.1-rc1
/dev/hda:
setting fs readahead to 256
readahead = 256 (on)
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.984s
user 0m0.002s
sys 0m1.751s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.704s
user 0m0.002s
sys 0m1.766s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.886s
user 0m0.002s
sys 0m1.731s
________________________________
/dev/hda:
setting fs readahead to 128
readahead = 128 (on)
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.120s
user 0m0.001s
sys 0m1.830s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.596s
user 0m0.005s
sys 0m1.764s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.481s
user 0m0.002s
sys 0m1.727s
________________________________
/dev/hda:
setting fs readahead to 64
readahead = 64 (on)
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.361s
user 0m0.006s
sys 0m1.782s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.655s
user 0m0.002s
sys 0m1.778s
_ _ _ _ _ _ _ _ _
256+0 records in
256+0 records out
real 0m11.369s
user 0m0.004s
sys 0m1.798s
________________________________
As you can see 2.6.0 performances increase setting readahead from 256 to 64
(64 seems to be the best value) while 2.6.1-rc1 performances don't change
too much.
I noticed that on 2.6.0 with readahead setted at 256 the HD LED blinks
during the data transfer while with lower values (128 / 64) it stays on.
Instead on 2.6.1-rc1 HD LED blinks with almost any values (I must set it at
8 to see it stable on).
ANSWERS:
1) YES... I see a regression with files ;-(
2) YES, I see also a bunch of "eek!" (a mountain of "eek!")
Bye
--
Paolo Ornati
Linux v2.4.24
HD test for Penguin 2.6.0-mm1-extents
/dev/hda:
setting fs readahead to 8192
readahead = 8192 (on)
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.098793 seconds (244300323 bytes/sec)
real 0m1.100s
user 0m0.005s
sys 0m1.096s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.102250 seconds (243534086 bytes/sec)
real 0m1.104s
user 0m0.000s
sys 0m1.104s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.096914 seconds (244718759 bytes/sec)
real 0m1.098s
user 0m0.001s
sys 0m1.097s
________________________________
/dev/hda:
setting fs readahead to 256
readahead = 256 (on)
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.104646 seconds (243005877 bytes/sec)
real 0m1.106s
user 0m0.001s
sys 0m1.105s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.100904 seconds (243831834 bytes/sec)
real 0m1.102s
user 0m0.000s
sys 0m1.103s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.102060 seconds (243576076 bytes/sec)
real 0m1.104s
user 0m0.002s
sys 0m1.101s
________________________________
/dev/hda:
setting fs readahead to 128
readahead = 128 (on)
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.100799 seconds (243855121 bytes/sec)
real 0m1.102s
user 0m0.000s
sys 0m1.102s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.101516 seconds (243696385 bytes/sec)
real 0m1.103s
user 0m0.002s
sys 0m1.101s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.100963 seconds (243818758 bytes/sec)
real 0m1.102s
user 0m0.000s
sys 0m1.103s
________________________________
/dev/hda:
setting fs readahead to 64
readahead = 64 (on)
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.104634 seconds (243008498 bytes/sec)
real 0m1.106s
user 0m0.002s
sys 0m1.105s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.102107 seconds (243565703 bytes/sec)
real 0m1.104s
user 0m0.003s
sys 0m1.100s
_ _ _ _ _ _ _ _ _
/tester: line 18: ./fadvise: No such file or directory
256+0 records in
256+0 records out
268435456 bytes transferred in 1.104429 seconds (243053595 bytes/sec)
real 0m1.106s
user 0m0.000s
sys 0m1.106s
________________________________
Ed Sweetman wrote:
> Paolo Ornati wrote:
>
>> On Friday 09 January 2004 20:15, Ram Pai wrote:
>>
>>> On Thu, 2004-01-08 at 17:17, Andrew Morton wrote:
>>>
>>>> Ram Pai <[email protected]> wrote:
>>>>
>>>>> Well this is my theory, somebody should validate it,
>>>>
>>>>
>>>> One megabyte seems like far too litte memory to be triggering the
>>>> effect which you describe. But yes, the risk is certainly there.
>>>>
>>>> You could verify this with:
>>>
>>>
>>> I cannot exactly reproduce what Pualo Ornati is seeing.
>>>
>>> Pualo: Request you to validate the following,
>>>
>>> 1) see whether you see a regression with files replacing the
>>> cat command in your script with
>>> dd if=big_file of=/dev/null bs=1M count=256
>>>
>>> 2) and if you do, check if you see a bunch of 'eek' with Andrew's
>>> following patch. (NOTE: without reverting the changes
>>> in filemap.c)
>>>
>>> -------------------------------------------------------------------------
>>>
>>> -
>>>
>>> --- 25/mm/filemap.c~a Thu Jan 8 17:15:57 2004
>>> +++ 25-akpm/mm/filemap.c Thu Jan 8 17:16:06 2004
>>> @@ -629,8 +629,10 @@ find_page:
>>> handle_ra_miss(mapping, ra, index);
>>> goto no_cached_page;
>>> }
>>> - if (!PageUptodate(page))
>>> + if (!PageUptodate(page)) {
>>> + printk("eek!\n");
>>> goto page_not_up_to_date;
>>> + }
>>> page_ok:
>>> /* If users can be writing to this page using arbitrary
>>> * virtual addresses, take care about potential aliasing
>>>
>>> -------------------------------------------------------------------------
>>>
>>
>>
>>
>> Ok, this patch seems for -mm tree... I have applied it by hand (on a
>> vanilla 2.6.1-rc1).
>>
>> For my tests I've used this script:
>>
>> #!/bin/sh
>>
>> RA_VALS="256 128 64"
>> FILE="/big_file"
>> SIZE=`stat -c '%s' $FILE`
>> NR_TESTS="3"
>> LINUX=`uname -r`
>>
>> echo "HD test for Penguin $LINUX"
>>
>> killall5
>> sync
>> sleep 3
>>
>> for ra in $RA_VALS; do
>> hdparm -a $ra /dev/hda
>> for i in `seq $NR_TESTS`; do
>> echo "_ _ _ _ _ _ _ _ _"
>> ./fadvise $FILE 0 $SIZE dontneed
>> time dd if=$FILE of=/dev/null bs=1M count=256
>> done
>> echo "________________________________"
>> done
>>
>>
>> RESULTS (2.6.0 / 2.6.1-rc1)
>>
>> HD test for Penguin 2.6.0
>>
>> /dev/hda:
>> setting fs readahead to 256
>> readahead = 256 (on)
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.427s
>> user 0m0.002s
>> sys 0m1.722s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.963s
>> user 0m0.000s
>> sys 0m1.760s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.291s
>> user 0m0.001s
>> sys 0m1.713s
>> ________________________________
>>
>> /dev/hda:
>> setting fs readahead to 128
>> readahead = 128 (on)
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m9.910s
>> user 0m0.003s
>> sys 0m1.882s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m9.693s
>> user 0m0.003s
>> sys 0m1.860s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m9.733s
>> user 0m0.004s
>> sys 0m1.922s
>> ________________________________
>>
>> /dev/hda:
>> setting fs readahead to 64
>> readahead = 64 (on)
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m9.107s
>> user 0m0.000s
>> sys 0m2.026s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m9.227s
>> user 0m0.004s
>> sys 0m1.984s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m9.152s
>> user 0m0.002s
>> sys 0m2.013s
>> ________________________________
>>
>>
>> HD test for Penguin 2.6.1-rc1
>>
>> /dev/hda:
>> setting fs readahead to 256
>> readahead = 256 (on)
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.984s
>> user 0m0.002s
>> sys 0m1.751s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.704s
>> user 0m0.002s
>> sys 0m1.766s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.886s
>> user 0m0.002s
>> sys 0m1.731s
>> ________________________________
>>
>> /dev/hda:
>> setting fs readahead to 128
>> readahead = 128 (on)
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.120s
>> user 0m0.001s
>> sys 0m1.830s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.596s
>> user 0m0.005s
>> sys 0m1.764s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.481s
>> user 0m0.002s
>> sys 0m1.727s
>> ________________________________
>>
>> /dev/hda:
>> setting fs readahead to 64
>> readahead = 64 (on)
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.361s
>> user 0m0.006s
>> sys 0m1.782s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.655s
>> user 0m0.002s
>> sys 0m1.778s
>> _ _ _ _ _ _ _ _ _
>> 256+0 records in
>> 256+0 records out
>>
>> real 0m11.369s
>> user 0m0.004s
>> sys 0m1.798s
>> ________________________________
>>
>>
>> As you can see 2.6.0 performances increase setting readahead from 256
>> to 64 (64 seems to be the best value) while 2.6.1-rc1 performances
>> don't change too much.
>>
>> I noticed that on 2.6.0 with readahead setted at 256 the HD LED blinks
>> during the data transfer while with lower values (128 / 64) it stays on.
>> Instead on 2.6.1-rc1 HD LED blinks with almost any values (I must set
>> it at 8 to see it stable on).
>>
>> ANSWERS:
>>
>> 1) YES... I see a regression with files ;-(
>>
>> 2) YES, I see also a bunch of "eek!" (a mountain of "eek!")
>>
>> Bye
>>
>
>
> I'm using 2.6.0-mm1 and i see no difference from setting readahead to
> anything on my extent enabled partitions. So it appears that filesystem
> plays a big part in your numbers here, not just hdd attributes or settings.
>
> The partition FILE is on is an ext3 + extents enabled partition. Despite
> not having fadvise (what is this anyway?) the numbers are all real and
> no error occured. Extents totally rocks for this type of data access, as
> you can see below.
>
> Stick to non-fs tests if you want to benchmark fs independent code. Not
> everyone is going to be able to come up with the same results as you and
> as such a possible fix could actually be detrimental, and we'd be stuck
> in a loop of "ide regression" mails.
>
debian unstable's dd may also be seeing that it's writing to /dev/null
and just not doing anything. I know extents are fast and make certain
manipulations on them extremely faster than plain ext3 but 256MB/sec is
really really too fast. So in either case it looks like this test is not
usable to me.
I dont know why you dont also try 8192 for readahead, measuring
performance by the duration or intensity of the hdd is led is not very
sound. i actually copy large files to and from parts of the same ext3
partition at over 20MB/sec sustained hdparm shows it's highest numbers
under it. For me it doesn't get any faster than that. So what's this
all say, maybe all these performance numbers are just as much based on
your readahead value as they are on the position of the moon and the
rest of the system and it's hardware. btw, what is the value of your HZ
environment variable, debian still sets it to 100, i set it to 1024, not
really sure if it made any difference.
i'm using the via ide driver, so are you, i'm not seeing the type of
regression that you are, my dd doesn't do what your dd does. our hdds
are different. The regression in the kernels could just as easily be
due to a regression in the schedular and nothing to do with the ide
drivers. Have you tried just using 2.6.0 (whatever version you see
changes with your readahead values) then the same kernel with the new
ide code from the kernel you dont see any changes so you're running
everything else the same but only ide has been "upgraded" and see if you
see the same regression. I dont think you will. the readahead effects
how often you have to ask the hdd to read from the platter and waiting
on io can possibly effect how your kernel schedules it. Faster drives
would thus not be effected the same way which could explain why none of
the conclusions and results you've found are the same with my system.
or i could be completely wrong and something could be going bad with the
ide drivers. I just dont see how that could be the case and i not have
the same performance regression you have when we both use the same ide
drivers (just slightly different chipsets).
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> HD test for Penguin 2.6.0-mm1-extents
>
> /dev/hda:
> setting fs readahead to 8192
> readahead = 8192 (on)
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.098793 seconds (244300323 bytes/sec)
>
> real 0m1.100s
> user 0m0.005s
> sys 0m1.096s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.102250 seconds (243534086 bytes/sec)
>
> real 0m1.104s
> user 0m0.000s
> sys 0m1.104s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.096914 seconds (244718759 bytes/sec)
>
> real 0m1.098s
> user 0m0.001s
> sys 0m1.097s
> ________________________________
>
> /dev/hda:
> setting fs readahead to 256
> readahead = 256 (on)
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.104646 seconds (243005877 bytes/sec)
>
> real 0m1.106s
> user 0m0.001s
> sys 0m1.105s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.100904 seconds (243831834 bytes/sec)
>
> real 0m1.102s
> user 0m0.000s
> sys 0m1.103s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.102060 seconds (243576076 bytes/sec)
>
> real 0m1.104s
> user 0m0.002s
> sys 0m1.101s
> ________________________________
>
> /dev/hda:
> setting fs readahead to 128
> readahead = 128 (on)
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.100799 seconds (243855121 bytes/sec)
>
> real 0m1.102s
> user 0m0.000s
> sys 0m1.102s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.101516 seconds (243696385 bytes/sec)
>
> real 0m1.103s
> user 0m0.002s
> sys 0m1.101s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.100963 seconds (243818758 bytes/sec)
>
> real 0m1.102s
> user 0m0.000s
> sys 0m1.103s
> ________________________________
>
> /dev/hda:
> setting fs readahead to 64
> readahead = 64 (on)
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.104634 seconds (243008498 bytes/sec)
>
> real 0m1.106s
> user 0m0.002s
> sys 0m1.105s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.102107 seconds (243565703 bytes/sec)
>
> real 0m1.104s
> user 0m0.003s
> sys 0m1.100s
> _ _ _ _ _ _ _ _ _
> /tester: line 18: ./fadvise: No such file or directory
> 256+0 records in
> 256+0 records out
> 268435456 bytes transferred in 1.104429 seconds (243053595 bytes/sec)
>
> real 0m1.106s
> user 0m0.000s
> sys 0m1.106s
> ________________________________
On Saturday 10 January 2004 17:00, Ed Sweetman wrote:
>
> I'm using 2.6.0-mm1 and i see no difference from setting readahead to
> anything on my extent enabled partitions. So it appears that filesystem
> plays a big part in your numbers here, not just hdd attributes or
> settings.
>
> The partition FILE is on is an ext3 + extents enabled partition. Despite
> not having fadvise (what is this anyway?) the numbers are all real and
> no error occured. Extents totally rocks for this type of data access, as
> you can see below.
>
> Stick to non-fs tests if you want to benchmark fs independent code. Not
> everyone is going to be able to come up with the same results as you and
> as such a possible fix could actually be detrimental, and we'd be stuck
> in a loop of "ide regression" mails.
To run correctly my script you _MUST_ have "fadvise" tool (my script assumes
it is installed in current directory).
This is what Andrew said:
_____________________________________________________________________
You'll need to unmount and remount the fs in between to remove the file
from pagecache. Or use fadvise() to remove the pagecache. There's a
little tool which does that in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
_____________________________________________________________________
so "fadvise" is a simple tool that calls "fadvise64" system call.
This system call lets you do some useful things: for example you can discard
all the cached pages for a file, that is what my command does.
--
Paolo Ornati
Linux v2.4.24
On Saturday 10 January 2004 17:19, Ed Sweetman wrote:
>
> debian unstable's dd may also be seeing that it's writing to /dev/null
> and just not doing anything. I know extents are fast and make certain
> manipulations on them extremely faster than plain ext3 but 256MB/sec is
> really really too fast. So in either case it looks like this test is not
> usable to me.
yes... 256MB/s is a bit too high!
Can you try with "fadvice" installed?
Anyway I think your theroy is right... and so intalling "fadvice" you will
NOT see any big difference.
>
>
> I dont know why you dont also try 8192 for readahead, measuring
beacuse readahead setted to 8192 gives me BAD performance!
> performance by the duration or intensity of the hdd is led is not very
> sound. i actually copy large files to and from parts of the same ext3
> partition at over 20MB/sec sustained hdparm shows it's highest numbers
> under it. For me it doesn't get any faster than that. So what's this
> all say, maybe all these performance numbers are just as much based on
> your readahead value as they are on the position of the moon and the
> rest of the system and it's hardware. btw, what is the value of your HZ
> environment variable, debian still sets it to 100, i set it to 1024, not
> really sure if it made any difference.
>
> i'm using the via ide driver, so are you, i'm not seeing the type of
> regression that you are, my dd doesn't do what your dd does. our hdds
> are different. The regression in the kernels could just as easily be
> due to a regression in the schedular and nothing to do with the ide
> drivers. Have you tried just using 2.6.0 (whatever version you see
> changes with your readahead values) then the same kernel with the new
> ide code from the kernel you dont see any changes so you're running
> everything else the same but only ide has been "upgraded" and see if you
> see the same regression. I dont think you will. the readahead effects
Yes, the correct way to work is as you say....
BUT read the whole story:
1) using "hdparm -t /dev/hda" I found IDE performace regression (in
sequential reads) upgrading from 2.6.0 to 2.6.1-rc1
2) someone tell me to try to revert this patch:
"readahead: multiple performance fixes"
Reverting it in 2.6.1-rc1 kernel gives me the same ide performaces that
2.6.0 has.
3) Since 2.6.0 and 2.6.1-rc1(with "readahead: multiple performance fixes"
reverted) kernels give me the same results for any IDE performance test I
do --> I treat them as they are the same thing ;-)
The part of the patch that gives me all these problem is already found and
is quite small:
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c Sat Jan 3 02:29:08 2004
+++ b/mm/filemap.c Sat Jan 3 02:29:08 2004
@@ -587,13 +587,22 @@
read_actor_t actor)
{
struct inode *inode = mapping->host;
- unsigned long index, offset;
+ unsigned long index, offset, last;
struct page *cached_page;
int error;
cached_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;
+ last = (*ppos + desc->count) >> PAGE_CACHE_SHIFT;
+
+ /*
+ * Let the readahead logic know upfront about all
+ * the pages we'll need to satisfy this request
+ */
+ for (; index < last; index++)
+ page_cache_readahead(mapping, ra, filp, index);
+ index = *ppos >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -612,7 +621,6 @@
}
cond_resched();
- page_cache_readahead(mapping, ra, filp, index);
nr = nr - offset;
find_page:
--
Paolo Ornati
Linux v2.4.24
Sorry I was on vacation and could not get back earlier.
I do not exactly know the reason why sequential reads on blockdevices
has regressed. One probable reason is that the same lazy-read
optimization which helps large random reads is regressing the sequential
read performance.
Note: the patch, waits till the last page in the current window is being
read, before triggering a new readahead. By the time the readahead
request is satisfied, the next sequential read may already have been
requested. Hence there is some loss of parallelism here. However given
that largesize random reads is the most common case; this patch attacks
that case.
If you revert back just the lazy-read optimization, you might see no
regression for sequential reads,
Let me see if I can verify this,
Ram Pai
On Sun, 2004-01-04 at 06:30, Paolo Ornati wrote:
> On Saturday 03 January 2004 23:40, Andrew Morton wrote:
> > Paolo Ornati <[email protected]> wrote:
> > > I know these are only performance in sequential data reads... and real
> > > life is another thing... but I think the author of the patch should be
> > > informed (Ram Pai).
> >
> > There does seem to be something whacky going on with readahead against
> > blockdevices. Perhaps it is related to the soft blocksize. I've never
> > been able to reproduce any of this.
> >
> > Be aware that buffered reads for blockdevs are treated fairly differently
> > from buffered reads for regular files: they only use lowmem and we always
> > attach buffer_heads and perform I/O against them.
> >
> > No effort was made to optimise buffered blockdev reads because it is not
> > very important and my main interest was in data coherency and filesystem
> > metadata consistency.
> >
> > If you observe the same things reading from regular files then that is
> > more important.
>
> I have done some tests with this stupid script and it seems that you are
> right:
> _____________________________________________________________________
> #!/bin/sh
>
> DEV=/dev/hda7
> MOUNT_DIR=mnt
> BIG_FILE=$MOUNT_DIR/big_file
>
> mount $DEV $MOUNT_DIR
> if [ ! -f $BIG_FILE ]; then
> echo "[DD] $BIG_FILE"
> dd if=/dev/zero of=$BIG_FILE bs=1M count=1024
> umount $MOUNT_DIR
> mount $DEV $MOUNT_DIR
> fi
>
> killall5
> sleep 2
> sync
> sleep 2
>
> time cat $BIG_FILE > /dev/null
> umount $MOUNT_DIR
> _____________________________________________________________________
>
>
> Results for plain 2.6.1-rc1 (A) and 2.6.1-rc1 without Ram Pai's patch (B):
>
> o readahead = 256 (default setting)
>
> (A)
> real 0m43.596s
> user 0m0.153s
> sys 0m5.602s
>
> real 0m42.971s
> user 0m0.136s
> sys 0m5.571s
>
> real 0m42.888s
> user 0m0.137s
> sys 0m5.648s
>
> (B)
> real 0m43.520s
> user 0m0.130s
> sys 0m5.615s
>
> real 0m42.930s
> user 0m0.154s
> sys 0m5.745s
>
> real 0m42.937s
> user 0m0.120s
> sys 0m5.751s
>
>
> o readahead = 128
>
> (A)
> real 0m35.932s
> user 0m0.133s
> sys 0m5.926s
>
> real 0m35.925s
> user 0m0.146s
> sys 0m5.930s
>
> real 0m35.892s
> user 0m0.145s
> sys 0m5.946s
>
> (B)
> real 0m35.957s
> user 0m0.136s
> sys 0m6.041s
>
> real 0m35.958s
> user 0m0.136s
> sys 0m5.957s
>
> real 0m35.924s
> user 0m0.146s
> sys 0m6.069s
>
>
> o readahead = 64
> (A)
> real 0m35.284s
> user 0m0.137s
> sys 0m6.182s
>
> real 0m35.267s
> user 0m0.134s
> sys 0m6.110s
>
> real 0m35.260s
> user 0m0.149s
> sys 0m6.003s
>
>
> (B)
> real 0m35.210s
> user 0m0.149s
> sys 0m6.009s
>
> real 0m35.341s
> user 0m0.151s
> sys 0m6.119s
>
> real 0m35.151s
> user 0m0.144s
> sys 0m6.195s
>
>
> I don't notice any big difference between kernel A and kernel B....
>
> From these tests the best readahead value for my HD seems to be 64... and
> the default setting (256) just wrong.
>
> With 2.4.23 kernel and readahead = 8 I get results like these:
>
> real 0m40.085s
> user 0m0.130s
> sys 0m4.560s
>
> real 0m40.058s
> user 0m0.090s
> sys 0m4.630s
>
> Bye.