LinuxLists.cc - Re: Slow file transfer speeds with CFQ IO scheduler in some cases

2008-11-28 00:49:19

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

On Thu, Nov 27, 2008 at 08:46:35PM +0300, Vladislav Bolkhovitin wrote:
>
> Wu Fengguang wrote:
>> On Tue, Nov 25, 2008 at 03:09:12PM +0300, Vladislav Bolkhovitin wrote:
>>> Vladislav Bolkhovitin wrote:
>>>> Wu Fengguang wrote:
>>>>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>>>>> Wu Fengguang wrote:
>>>>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>>>>> Wu Fengguang wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> //Sorry for being late.
>>>>>>>>>
>>>>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>>>>> [...]
>>>>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>>>>> here as well.
>>>>>>>>>>
>>>>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>>>>> aware of any other good programs out there that would do something
>>>>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>>>>> detecting cooperating processes.
>>>>>>>>>>
>>>>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>>>>
>>>>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>>>> {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>>>> chunks in another one
>>>>>>>>> - a pool of NFSDs randomly serving some originally
>>>>>>>>> sequential read requests - now dump(8) seems to have
>>>>>>>>> some similar problem.
>>>>>>>>>
>>>>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>>>>
>>>>>>>>> It is however possible to detect most of these patterns at the
>>>>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>>>>> into the block layer and hurt performance.
>>>>>>>> I believe this would be the most effective way to go,
>>>>>>>> especially in case if data delivery path to the original
>>>>>>>> client has its own latency depended from the amount of
>>>>>>>> transferred data as it is in the case of remote NFS mount,
>>>>>>>> which does synchronous sequential reads. In this case it
>>>>>>>> is essential for performance to make both links (local to
>>>>>>>> the storage and network to the client) be always busy and
>>>>>>>> transfer data simultaneously. Since the reads are
>>>>>>>> synchronous, the only way to achieve that is perform read
>>>>>>>> ahead on the server sufficient to cover the network link
>>>>>>>> latency. Otherwise you would end up with only half of
>>>>>>>> possible throughput.
>>>>>>>>
>>>>>>>> However, from one side, server has to have a pool of
>>>>>>>> threads/processes to perform well, but, from other side,
>>>>>>>> current read ahead code doesn't detect too well that those
>>>>>>>> threads/processes are doing joint sequential read, so the
>>>>>>>> read ahead window gets smaller, hence the overall read
>>>>>>>> performance gets considerably smaller too.
>>>>>>>>
>>>>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>>>>> I can test it with SCST SCSI target sybsystem
>>>>>>>> (http://scst.sf.net). SCST needs such feature very much,
>>>>>>>> otherwise it can't get full backstorage read speed. The
>>>>>>>> maximum I can see is about ~80MB/s from ~130MB/s 15K RPM
>>>>>>>> disk over 1Gbps iSCSI link (maximum possible is ~110MB/s).
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>>>>> similar behaviors that the current readahead code cannot handle well?
>>>>>> No. SCSI target subsystem is not the same as SCSI initiator
>>>>>> subsystem, which usually called simply SCSI (sub)system. SCSI
>>>>>> target is a SCSI server. It has the same amount of common with
>>>>>> SCSI initiator as there is, e.g., between Apache (HTTP server)
>>>>>> and Firefox (HTTP client).
>>>>> Got it. So the SCSI server will split&spread sequential IO of one
>>>>> single file to cooperative threads?
>>>> Yes. It has to do so, because Linux doesn't have async. cached IO
>>>> and a client can queue several tens of commands at time. Then, on
>>>> the sequential IO with 1 command at time, CPU scheduler comes to
>>>> play and spreads those commands over those threads, so read ahead
>>>> gets too small to cover the external link latency and fill both
>>>> links with data, so that uncovered latency kills throughput.
>>> Additionally, if the uncovered external link latency is too large,
>>> one more factor is getting noticeable: storage rotation latency. If
>>> the next unread sector is missed to be read at time, server has to
>>> wait a full rotation to start receiving data for the next block,
>>> which even more decreases the resulting throughput.
>
>> Thank you for the details. I've been working slowly on the idea, and
>> should be able to send you a patch in the next one or two days.
>
> Actually, there's one more thing, which should have been mentioned. It
> is possible that remote clients have several sequential read streams at
> time together with some "noise" of random requests. A good read-ahead
> subsystem should handle such case by keeping big read-ahead windows for
> the sequential streams and don't do any read ahead for the random
> requests. And all at the same time.
>
> Currently on such workloads read ahead will be completely disabled for
> all the requests. Hence, there is a possibility here to improve
> performance in 3-5 times or even more by making the workload more linear.

Are you sure? I'd expect such mixed-sequential-random pattern to be
handled by the current readahead code pretty well: sequential ones
will get large readahead and random ones won't get readahead at all.

Attached is the context readahead patch plus a kernel module for
readahead tracing and accounting, which will hopefully help clarify the
read patterns and readahead behaviors on production workloads. It is
based on 2.6.27 for your convenience, but also applies to 2.6.28.

The patch is not targeted for code review, but if anyone are interested,
you can take a look at try_context_readahead(). This is the only newly
introduced readahead policy, the other majorities are code refactor
and tracing facilities.

The newly introduced context readahead policy is disabled by default.
To enable it:
echo 1 > /sys/block/sda/queue/context_readahead
I'm not sure for now whether this parameter will be a long term one, or
whether the context readahead policy should be enabled unconditionally.

The readahead accounting stats can be viewed by
mount -t debugfs none /sys/kernel/debug
cat /sys/kernel/debug/readahead/stats
The numbers can be reset by
echo > /sys/kernel/debug/readahead/stats

Here is a sample output from my desktop:

% cat /sys/kernel/debug/readahead/stats
pattern count sync_count eof_count size async_size actual
none 0 0 0 0 0 0
initial0 3009 3009 2033 5 4 2
initial 35 35 0 5 4 3
subsequent 1294 240 827 52 49 26
marker 220 0 109 54 53 29
trail 0 0 0 0 0 0
oversize 0 0 0 0 0 0
reverse 0 0 0 0 0 0
stride 0 0 0 0 0 0
thrash 0 0 0 0 0 0
mmap 2833 2833 1379 142 0 47
fadvise 7 7 7 0 0 40
random 7621 7621 69 1 0 1
all 15019 13745 4424 33 5 12

The readahead/read tracing messages are disabled by default.
To enable them:
echo 1 > /sys/kernel/debug/readahead/trace_enable
echo 1 > /sys/kernel/debug/readahead/read_jprobes
They(especially the latter one) will generate a lot of printk messages like:

[ 828.151013] readahead-initial0(pid=4644(zsh), dev=00:10(0:10), ino=351452(whoami), req=0+1, ra=0+4-3, async=0) = 4
[ 828.167853] readahead-mmap(pid=4644(whoami), dev=00:10(0:10), ino=351452(whoami), req=0+0, ra=0+60-0, async=0) = 3
[ 828.195652] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=115569(zsh_prompt), req=0+128, ra=0+120-60, async=0) = 3
[ 828.225081] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=342086(.zsh_history), req=0+128, ra=0+120-60, async=0) = 4

[ 964.471450] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=0, count=128)
[ 964.471544] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=64, count=448)
[ 964.471575] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=512, count=28)
[ 964.472659] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=0, count=128)
[ 964.473431] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=64, count=336)
[ 964.475639] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383010(libc-2.7.so), pos=0, count=832)
[ 964.479037] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=0, count=524288)
[ 964.479166] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=2586, count=524288)

So please enable them only when necessary.

My recommendation for the double readahead in NFS client and NFS servers,
is to keep client side readahead size small and the server side one large.
For example, 128K-512K/1M-2M(more for RAID). The NFS client side readahead size
is not directly tunable, but setting rsize to a small value does the trick.
Currently the NFS magic is readahead_size=N*rsize. The default numbers in my
2.6.28 kernel are rsize=512k, N=15, readahead_size=7680k. The latter is
obviously way too large.

Thanks,
Fengguang

> I guess, such functionality can be done by keeping several read-ahead
> contexts not in file handle as now, but in inode, or ever in task_struct
> similarly to io_context. Or even as a part of struct io_context. Then
> those contexts would be rotated in, e.g., round robin manner. I have
> some basic thoughts in this area and would be glad to share them, if you
> are interested.
>
> Going further, ultimately, it would be great to provide somehow a
> capability to allow to assign for each read-ahead stream own IO context,
> because on the remote side in most cases such streams would correspond
> to different programs reading data in parallel. This should allow to
> increase performance even more.
>
>> Thanks,
>> Fengguang
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>

Attachments:

(No filename) (11.77 kB)
readahead-context-plus-tracing-module-2.6.27.patch (25.01 kB)
Download all attachments

2009-02-12 18:35:05

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

kernel: 2.6.27.12-all_patches - all scst patches

#cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
#cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

#free
total used free shared buffers cached
Mem: 508180 502448 5732 0 430704 27740
-/+ buffers/cache: 44004 464176
Swap: 0 0 0

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 102 MB/s
b) 102 MB/s
c) 102 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 74,3 MB/s
b) 74,5 MB/s
c) 74,4 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,6 MB/s
b) 22,8 MB/s
c) 24,1 MB/s
d) 23,1 MB/s

------------------------------------------------------------------------
kernel: 2.6.27.12-except_export_alloc - all scst patches, except export_alloc_io_context.patch

# cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
# cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 48,6 MB/s
b) 49,2 MB/s
c) 48,9 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 48,3 MB/s
b) 48,5 MB/s
c) 47,9 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,2 MB/s
b) 3,9 MB/s
c) 4,1 MB/s

---------------------------------------------------------------
kernel: uname -r 2.6.27.12-readahead - all scst patches + readahead-context.patch
default /dev/md0 readahead = 512
linux-4dtq:~ #
linux-4dtq:~ #
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # blockdev --setra 1024 /dev/sda
linux-4dtq:~ # blockdev --setra 1024 /dev/sdb
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --getra /dev/md0
256
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # blockdev --getra /dev/md0
512
linux-4dtq:~ # free
total used free shared buffers cached
Mem: 508168 104480 403688 0 4516 64632
-/+ buffers/cache: 35332 472836
Swap: 0 0 0

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 96,9 MB/s
b) 101 MB/s
c) 101, MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 73,9 MB/s
b) 74,1 MB/s
c) 73,3 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 22,1 MB/s
b) 21,6 MB/s
c) 21,1 MB/s

-----------------------------------------------------------------------------------
kernel: uname -r 2.6.27.12-readahead - all scst patches + readahead-context.patch
set: /dev/md0 readahead = 1024

linux-4dtq:~ # blockdev --setra 1024 /dev/sdb
linux-4dtq:~ # blockdev --setra 1024 /dev/sda
linux-4dtq:~ # blockdev --setra 1024 /dev/md0
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # blockdev --getra /dev/md0
1024
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 102 MB/s
b) 100 MB/s
c) 104 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 74,1 MB/s
b) 73,7 MB/s
c) 74,0 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 22,0 MB/s
b) 21,5 MB/s
c) 22,9 MB/s

------------------------------------------------------------------------
kernel: uname -r 2.6.27.12-readahead - all scst patches + readahead-context.patch
set: /dev/md0 readahead = 2048

linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # blockdev --setra 1024 /dev/sda
linux-4dtq:~ # blockdev --setra 1024 /dev/sdb
linux-4dtq:~ # blockdev --setra 2048 /dev/md0
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --getra /dev/md0
2048

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 103 MB/s
b) 102 MB/s
c) 102 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 73,7 MB/s
b) 74,1 MB/s
c) 74,3 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 20,1 MB/s
b) 21,5 MB/s
c) 21,8 MB/s

---------------------------------------------------
Kernel uname -r: 2.6.27.12-readahead
readahead set to: 4M

linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
linux-4dtq:~ # blockdev --getra /dev/sdb
4096
linux-4dtq:~ # blockdev --getra /dev/sda
4096
inux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 99,4 MB/s
b) 96,4 MB/s
c) 101 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 73,5 MB/s
b) 74,3 MB/s
c) 73,9 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,0 MB/s
b) 19,8 MB/s
c) 22,4 MB/s

------------------------------------------------------------------------
kernel: uname -r 2.6.27.12-except_export+readahead - all scst patches except export_alloc_io_context.patch + readahead-contex.patch
default /dev/md0 readahead = 512

linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # blockdev --getra /dev/sda
1024
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # blockdev --getra /dev/sdb
1024
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # blockdev --getra /dev/md0
512
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sda/queue/context_readahead
1

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 47,7 MB/s
b) 47,6 MB/s
c) 47,6 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 47,6 MB/s
b) 47,2 MB/s
c) 46,8 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 3,5 MB/s
b) 3,1 MB/s
c) 3,5 MB/s

-----------------------------------------
kernel: uname -r: 2.6.27.12-except_export+readahead
readahead set to: 4M

linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --getra /dev/sda
4096
linux-4dtq:~ # blockdev --getra /dev/sdb
4096
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ #

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 49,1 MB/s
b) 49,0 MB/s
c) 48,7 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 47,9 MB/s
b) 46,6 MB/s
c) 47,2 MB/s
Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 3,6 MB/s
b) 3,4 MB/s
c) 3,6 MB/s

Attachments:

cfq-scheduler.txt (13.83 kB)

2009-02-19 02:05:45

by Wu Fengguang

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

On Tue, Feb 17, 2009 at 10:01:40PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>> Sorry for such a huge delay. There were many other activities I had
>>> to do before + I had to be sure I didn't miss anything.
>>>
>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N
>>> threads (N=5 in this case) handle IO from remote initiators
>>> (clients) coming from wire using iSCSI protocol. In addition, SCST
>>> has patch called export_alloc_io_context (see
>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
>>> queue IO using single IO context, so we can see if context RA can
>>> replace grouping IO threads in single IO context.
>>>
>>> Unfortunately, the results are negative. We find neither any
>>> advantages of context RA over current RA implementation, nor
>>> possibility for context RA to replace grouping IO threads in single
>>> IO context.
>>>
>>> Setup on the target (server) was the following. 2 SATA drives grouped
>>> in md RAID-0 with average local read throughput ~120MB/s ("dd
>>> if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs "20971520000
>>> bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md device was
>>> partitioned on 3 partitions. The first partition was 10% of space in
>>> the beginning of the device, the last partition was 10% of space in
>>> the end of the device, the middle one was the rest in the middle of
>>> the space them. Then the first and the last partitions were exported
>>> to the initiator (client). They were /dev/sdb and /dev/sdc on it
>>> correspondingly.
>>
>> Vladislav, Thank you for the benchmarks! I'm very interested in
>> optimizing your workload and figuring out what happens underneath.
>>
>> Are the client and server two standalone boxes connected by GBE?
>>
>> When you set readahead sizes in the benchmarks, you are setting them
>> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
>> client side readahead size?
>>
>> It would help a lot to debug readahead if you can provide the
>> server side readahead stats and trace log for the worst case.
>> This will automatically answer the above questions as well as disclose
>> the micro-behavior of readahead:
>>
>> mount -t debugfs none /sys/kernel/debug
>>
>> echo > /sys/kernel/debug/readahead/stats # reset counters
>> # do benchmark
>> cat /sys/kernel/debug/readahead/stats
>>
>> echo 1 > /sys/kernel/debug/readahead/trace_enable
>> # do micro-benchmark, i.e. run the same benchmark for a short time
>> echo 0 > /sys/kernel/debug/readahead/trace_enable
>> dmesg
>>
>> The above readahead trace should help find out how the client side
>> sequential reads convert into server side random reads, and how we can
>> prevent that.
>
> See attached. Could you comment the logs, please, so I will also be able
> to read them in the future?

Vladislav, thank you for the logs!

The printk format for the following lines is:

+ printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
+ "ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
+ ra_pattern_names[pattern],
+ current->pid, current->comm,
+ MAJOR(mapping->host->i_sb->s_dev),
+ MINOR(mapping->host->i_sb->s_dev),
+ mapping->host->i_sb->s_id,
+ mapping->host->i_ino,
+ filp->f_path.dentry->d_name.name,
+ offset, req_size,
+ ra->start, ra->size, ra->async_size,
+ async,
+ actual);

readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10596+1, ra=10628+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10628+1, ra=10660+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10660+1, ra=10692+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10692+1, ra=10724+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10724+1, ra=10756+32-32, async=1) = 32
readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10756+1, ra=10788+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10788+1, ra=10820+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10820+1, ra=10852+32-32, async=1) = 32
readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10852+1, ra=10884+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10884+1, ra=10916+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10916+1, ra=10948+32-32, async=1) = 32
readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=10948+1, ra=10980+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10980+1, ra=11012+32-32, async=1) = 32
readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=11012+1, ra=11044+32-32, async=1) = 32
readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11044+1, ra=11076+32-32, async=1) = 32
readahead-subsequent(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11076+1, ra=11108+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11108+1, ra=11140+32-32, async=1) = 32
readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11140+1, ra=11172+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11172+1, ra=11204+32-32, async=1) = 32
readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11204+1, ra=11236+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11236+1, ra=11268+32-32, async=1) = 32
readahead-subsequent(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11268+1, ra=11300+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11300+1, ra=11332+32-32, async=1) = 32
readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11332+1, ra=11364+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11364+1, ra=11396+32-32, async=1) = 32
readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11396+1, ra=11428+32-32, async=1) = 32

The above trace shows that the readahead logic is working pretty well,
however the max readahead size(32 pages) is way too small. This can
also be confirmed in the following table, where the average readahead
request size/async_size and actual readahead I/O size are all 30.

linux-4dtq:/ # cat /sys/kernel/debug/readahead/stats
pattern count sync_count eof_count size async_size actual
none 0 0 0 0 0 0
initial0 71 71 41 4 3 2
initial 23 23 0 4 3 4
subsequent 3845 4 21 31 31 31
marker 4222 0 1 31 31 31
trail 0 0 0 0 0 0
oversize 0 0 0 0 0 0
reverse 0 0 0 0 0 0
stride 0 0 0 0 0 0
thrash 0 0 0 0 0 0
mmap 135 135 15 32 0 17
fadvise 180 180 180 0 0 1
random 23 23 2 1 0 1
all 8499 436 260 30 30 30
^^^^^^^^^^^^^^^^^^^^^^^^

I suspect that your main performance problem comes from the small read/readahead size.
If larger values are used, even the vanilla 2.6.27 kernel will perform well.

Thanks,
Fengguang

2009-02-13 01:57:25

by Wu Fengguang

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
> Sorry for such a huge delay. There were many other activities I had to
> do before + I had to be sure I didn't miss anything.
>
> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
> iSCSI-SCST target driver. It has similar to NFS architecture, where N
> threads (N=5 in this case) handle IO from remote initiators (clients)
> coming from wire using iSCSI protocol. In addition, SCST has patch
> called export_alloc_io_context (see
> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
> queue IO using single IO context, so we can see if context RA can
> replace grouping IO threads in single IO context.
>
> Unfortunately, the results are negative. We find neither any advantages
> of context RA over current RA implementation, nor possibility for
> context RA to replace grouping IO threads in single IO context.
>
> Setup on the target (server) was the following. 2 SATA drives grouped in
> md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero
> of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)
> copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3
> partitions. The first partition was 10% of space in the beginning of the
> device, the last partition was 10% of space in the end of the device,
> the middle one was the rest in the middle of the space them. Then the
> first and the last partitions were exported to the initiator (client).
> They were /dev/sdb and /dev/sdc on it correspondingly.

Vladislav, Thank you for the benchmarks! I'm very interested in
optimizing your workload and figuring out what happens underneath.

Are the client and server two standalone boxes connected by GBE?

When you set readahead sizes in the benchmarks, you are setting them
in the server side? I.e. "linux-4dtq" is the SCST server? What's the
client side readahead size?

It would help a lot to debug readahead if you can provide the
server side readahead stats and trace log for the worst case.
This will automatically answer the above questions as well as disclose
the micro-behavior of readahead:

mount -t debugfs none /sys/kernel/debug

echo > /sys/kernel/debug/readahead/stats # reset counters
# do benchmark
cat /sys/kernel/debug/readahead/stats

echo 1 > /sys/kernel/debug/readahead/trace_enable
# do micro-benchmark, i.e. run the same benchmark for a short time
echo 0 > /sys/kernel/debug/readahead/trace_enable
dmesg

The above readahead trace should help find out how the client side
sequential reads convert into server side random reads, and how we can
prevent that.

Thanks,
Fengguang

2009-02-13 20:08:11

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Wu Fengguang, on 02/13/2009 04:57 AM wrote:
> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>> Sorry for such a huge delay. There were many other activities I had to
>> do before + I had to be sure I didn't miss anything.
>>
>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
>> iSCSI-SCST target driver. It has similar to NFS architecture, where N
>> threads (N=5 in this case) handle IO from remote initiators (clients)
>> coming from wire using iSCSI protocol. In addition, SCST has patch
>> called export_alloc_io_context (see
>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
>> queue IO using single IO context, so we can see if context RA can
>> replace grouping IO threads in single IO context.
>>
>> Unfortunately, the results are negative. We find neither any advantages
>> of context RA over current RA implementation, nor possibility for
>> context RA to replace grouping IO threads in single IO context.
>>
>> Setup on the target (server) was the following. 2 SATA drives grouped in
>> md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero
>> of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)
>> copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3
>> partitions. The first partition was 10% of space in the beginning of the
>> device, the last partition was 10% of space in the end of the device,
>> the middle one was the rest in the middle of the space them. Then the
>> first and the last partitions were exported to the initiator (client).
>> They were /dev/sdb and /dev/sdc on it correspondingly.
>
> Vladislav, Thank you for the benchmarks! I'm very interested in
> optimizing your workload and figuring out what happens underneath.
>
> Are the client and server two standalone boxes connected by GBE?

Yes, they directly connected using GbE.

> When you set readahead sizes in the benchmarks, you are setting them
> in the server side? I.e. "linux-4dtq" is the SCST server?

Yes, it's the server. On the client all the parameters were left default.

> What's the
> client side readahead size?

Default, i.e. 128K

> It would help a lot to debug readahead if you can provide the
> server side readahead stats and trace log for the worst case.
> This will automatically answer the above questions as well as disclose
> the micro-behavior of readahead:
>
> mount -t debugfs none /sys/kernel/debug
>
> echo > /sys/kernel/debug/readahead/stats # reset counters
> # do benchmark
> cat /sys/kernel/debug/readahead/stats
>
> echo 1 > /sys/kernel/debug/readahead/trace_enable
> # do micro-benchmark, i.e. run the same benchmark for a short time
> echo 0 > /sys/kernel/debug/readahead/trace_enable
> dmesg
>
> The above readahead trace should help find out how the client side
> sequential reads convert into server side random reads, and how we can
> prevent that.

We will do it as soon as we have a free window on that system.

Thanks,
Vlad

2009-02-17 19:01:42

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Wu Fengguang, on 02/13/2009 04:57 AM wrote:
> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>> Sorry for such a huge delay. There were many other activities I had to
>> do before + I had to be sure I didn't miss anything.
>>
>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
>> iSCSI-SCST target driver. It has similar to NFS architecture, where N
>> threads (N=5 in this case) handle IO from remote initiators (clients)
>> coming from wire using iSCSI protocol. In addition, SCST has patch
>> called export_alloc_io_context (see
>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
>> queue IO using single IO context, so we can see if context RA can
>> replace grouping IO threads in single IO context.
>>
>> Unfortunately, the results are negative. We find neither any advantages
>> of context RA over current RA implementation, nor possibility for
>> context RA to replace grouping IO threads in single IO context.
>>
>> Setup on the target (server) was the following. 2 SATA drives grouped in
>> md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero
>> of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)
>> copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3
>> partitions. The first partition was 10% of space in the beginning of the
>> device, the last partition was 10% of space in the end of the device,
>> the middle one was the rest in the middle of the space them. Then the
>> first and the last partitions were exported to the initiator (client).
>> They were /dev/sdb and /dev/sdc on it correspondingly.
>
> Vladislav, Thank you for the benchmarks! I'm very interested in
> optimizing your workload and figuring out what happens underneath.
>
> Are the client and server two standalone boxes connected by GBE?
>
> When you set readahead sizes in the benchmarks, you are setting them
> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
> client side readahead size?
>
> It would help a lot to debug readahead if you can provide the
> server side readahead stats and trace log for the worst case.
> This will automatically answer the above questions as well as disclose
> the micro-behavior of readahead:
>
> mount -t debugfs none /sys/kernel/debug
>
> echo > /sys/kernel/debug/readahead/stats # reset counters
> # do benchmark
> cat /sys/kernel/debug/readahead/stats
>
> echo 1 > /sys/kernel/debug/readahead/trace_enable
> # do micro-benchmark, i.e. run the same benchmark for a short time
> echo 0 > /sys/kernel/debug/readahead/trace_enable
> dmesg
>
> The above readahead trace should help find out how the client side
> sequential reads convert into server side random reads, and how we can
> prevent that.

See attached. Could you comment the logs, please, so I will also be able
to read them in the future?

Thank you,
Vlad

Attachments:

RA-debug.zip (18.16 kB)

2009-03-23 01:43:35

by Wu Fengguang

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

On Thu, Mar 19, 2009 at 08:44:13PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/19/2009 05:05 AM wrote:
>> On Tue, Feb 17, 2009 at 10:01:40PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Sorry for such a huge delay. There were many other activities I
>>>>> had to do before + I had to be sure I didn't miss anything.
>>>>>
>>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net)
>>>>> with iSCSI-SCST target driver. It has similar to NFS
>>>>> architecture, where N threads (N=5 in this case) handle IO from
>>>>> remote initiators (clients) coming from wire using iSCSI
>>>>> protocol. In addition, SCST has patch called
>>>>> export_alloc_io_context (see
>>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO
>>>>> threads queue IO using single IO context, so we can see if
>>>>> context RA can replace grouping IO threads in single IO
>>>>> context.
>>>>>
>>>>> Unfortunately, the results are negative. We find neither any
>>>>> advantages of context RA over current RA implementation, nor
>>>>> possibility for context RA to replace grouping IO threads in
>>>>> single IO context.
>>>>>
>>>>> Setup on the target (server) was the following. 2 SATA drives
>>>>> grouped in md RAID-0 with average local read throughput ~120MB/s
>>>>> ("dd if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs
>>>>> "20971520000 bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md
>>>>> device was partitioned on 3 partitions. The first partition was
>>>>> 10% of space in the beginning of the device, the last partition
>>>>> was 10% of space in the end of the device, the middle one was
>>>>> the rest in the middle of the space them. Then the first and the
>>>>> last partitions were exported to the initiator (client). They
>>>>> were /dev/sdb and /dev/sdc on it correspondingly.
>>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>>> optimizing your workload and figuring out what happens underneath.
>>>>
>>>> Are the client and server two standalone boxes connected by GBE?
>>>>
>>>> When you set readahead sizes in the benchmarks, you are setting them
>>>> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
>>>> client side readahead size?
>>>>
>>>> It would help a lot to debug readahead if you can provide the
>>>> server side readahead stats and trace log for the worst case.
>>>> This will automatically answer the above questions as well as disclose
>>>> the micro-behavior of readahead:
>>>>
>>>> mount -t debugfs none /sys/kernel/debug
>>>>
>>>> echo > /sys/kernel/debug/readahead/stats # reset counters
>>>> # do benchmark
>>>> cat /sys/kernel/debug/readahead/stats
>>>>
>>>> echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>> # do micro-benchmark, i.e. run the same benchmark for a short time
>>>> echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>> dmesg
>>>>
>>>> The above readahead trace should help find out how the client side
>>>> sequential reads convert into server side random reads, and how we can
>>>> prevent that.
>>> See attached. Could you comment the logs, please, so I will also be
>>> able to read them in the future?
>>
>> Vladislav, thank you for the logs!
>>
>> The printk format for the following lines is:
>>
>> + printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
>> + "ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
>> + ra_pattern_names[pattern],
>> + current->pid, current->comm,
>> + MAJOR(mapping->host->i_sb->s_dev),
>> + MINOR(mapping->host->i_sb->s_dev),
>> + mapping->host->i_sb->s_id,
>> + mapping->host->i_ino,
>> + filp->f_path.dentry->d_name.name,
>> + offset, req_size,
>> + ra->start, ra->size, ra->async_size,
>> + async,
>> + actual);
>>
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10596+1, ra=10628+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10628+1, ra=10660+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10660+1, ra=10692+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10692+1, ra=10724+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10724+1, ra=10756+32-32, async=1) = 32
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10756+1, ra=10788+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10788+1, ra=10820+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10820+1, ra=10852+32-32, async=1) = 32
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10852+1, ra=10884+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10884+1, ra=10916+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10916+1, ra=10948+32-32, async=1) = 32
>> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=10948+1, ra=10980+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10980+1, ra=11012+32-32, async=1) = 32
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=11012+1, ra=11044+32-32, async=1) = 32
>> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11044+1, ra=11076+32-32, async=1) = 32
>> readahead-subsequent(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11076+1, ra=11108+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11108+1, ra=11140+32-32, async=1) = 32
>> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11140+1, ra=11172+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11172+1, ra=11204+32-32, async=1) = 32
>> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11204+1, ra=11236+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11236+1, ra=11268+32-32, async=1) = 32
>> readahead-subsequent(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11268+1, ra=11300+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11300+1, ra=11332+32-32, async=1) = 32
>> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11332+1, ra=11364+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11364+1, ra=11396+32-32, async=1) = 32
>> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11396+1, ra=11428+32-32, async=1) = 32
>>
>> The above trace shows that the readahead logic is working pretty well,
>> however the max readahead size(32 pages) is way too small. This can
>> also be confirmed in the following table, where the average readahead
>> request size/async_size and actual readahead I/O size are all 30.
>>
>> linux-4dtq:/ # cat /sys/kernel/debug/readahead/stats
>> pattern count sync_count eof_count size async_size actual
>> none 0 0 0 0 0 0
>> initial0 71 71 41 4 3 2
>> initial 23 23 0 4 3 4
>> subsequent 3845 4 21 31 31 31
>> marker 4222 0 1 31 31 31
>> trail 0 0 0 0 0 0
>> oversize 0 0 0 0 0 0
>> reverse 0 0 0 0 0 0
>> stride 0 0 0 0 0 0
>> thrash 0 0 0 0 0 0
>> mmap 135 135 15 32 0 17
>> fadvise 180 180 180 0 0 1
>> random 23 23 2 1 0 1
>> all 8499 436 260 30 30 30
>> ^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> I suspect that your main performance problem comes from the small read/readahead size.
>> If larger values are used, even the vanilla 2.6.27 kernel will perform well.
>
> Yes, it was misconfiguration on our side: readahead size was not set
> correctly on all devices. In the correct configuration context based RA
> shows constant advantage over the current vanilla algorithm, but not as
> much as I would expect. It still performs considerably worse, than in
> case when all the IO threads work in the same IO context. To remind, our
> setup and tests described in http://lkml.org/lkml/2009/2/12/277.

Vladislav, thank you very much for the great efforts and details!

> Here are the conclusions from tests:
>
> 1. Making all IO threads work in the same IO context with CFQ (vanilla
> RA and default RA size) brings near 100% link utilization on single
> stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there
> is 100% improvement of CFQ over deadline. With 2 read streams CFQ has
> ever more advantage: >400% (23MB/s vs 5MB/s).

The ideal 2-stream throughput should be >60MB/s, so I guess there are
still room of improvements for the CFQ's 23MB/s?

> 2. All IO threads work in different IO contexts. With vanilla RA and
> default RA size CFQ performs 50% worse (48MB/s), even worse than
> deadline.
>
> 3. All IO threads work in different IO contexts. With default RA size
> context RA brings on single stream 40% improvement with deadline (71MB/s
> vs 51MB/s), no improvement with cfq (48MB/s).
>
> 4. All IO threads work in different IO contexts. With higher RA sizes
> there is stable 6% improvement with context RA over vanilla RA with CFQ
> starting from 20%. Deadline performs similarly. In parallel reads
> improvement is bigger: 30% on 4M RA size with deadline (39MB/s vs 27MB/s)

Your attached readahead trace shows that context RA is submitting
perfect 256-page async readahead IOs. (However the readahead-subsequent
cache hits are puzzling.)

The vanilla RA detects concurrent streams in a passive/opportunistic way.
The context RA works in an active/guaranteed way. It's also better at
serving the NFS style cooperative io processes. And your SCST workload
looks very like NFS.

The one fact I cannot understand is that SCST seems to breaking up the
client side 64K reads into server side 4K reads(above readahead layer).
But I remember you told me that SCST don't do NFS rsize style split-ups.
Is this a bug? The 4K read size is too small to be CPU/network friendly...
Where are the split-up and re-assemble done? On the client side or
internal to the server?

> 5. All IO threads work in different IO contexts. The best performance
> achieved with RA maximum size 4M on both RA algorithms, but starting
> from size 1M it starts growing very slowly.

True.

> 6. Unexpected result. In case, when ll IO threads work in the same IO
> context with CFQ increasing RA size *decreases* throughput. I think this
> is, because RA requests performed as single big READ requests, while
> requests coming from remote clients are much smaller in size (up to
> 256K), so, when the read by RA data transferred to the remote client on
> 100MB/s speed, the backstorage media gets rotated a bit, so the next
> read request must wait the rotation latency (~0.1ms on 7200RPM). This is
> well conforms with (3) above, when context RA has 40% advantage over
> vanilla RA with default RA, but much smaller with higher RA.

Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...

> Bottom line IMHO conclusions:
>
> 1. Context RA should be considered after additional examination to
> replace current RA algorithm in the kernel

That's my plan to push context RA to mainline. And thank you very much
for providing and testing out a real world application for it!

> 2. It would be better to increase default RA size to 1024K

That's a long wish to increase the default RA size. However I have a
vague feeling that it would be better to first make the lower layers
more smart on max_sectors_kb granularity request splitting and batching.

> *AND* one of the following:
>
> 3.1. All RA requests should be split in smaller requests with size up to
> 256K, which should not be merged with any other request

Are you referring to max_sectors_kb?

What's your max_sectors_kb and nr_requests? Something like

grep -r . /sys/block/sda/queue/

> OR
>
> 3.2. New RA requests should be sent before the previous one completed to
> don't let the storage device rotate too far to need full rotation to
> serve the next request.

Linus has a mmap readahead cleanup patch that can do this. It
basically replaces a {find_lock_page(); readahead();} sequence into
{find_get_page(); readahead(); lock_page();}.

I'll try to push that patch into mainline.

> I like suggestion 3.1 a lot more, since it should be simple to implement
> and has the following 2 positive side effects:
>
> 1. It would allow to minimize negative effect of higher RA size on the
> I/O delay latency by allowing CFQ to switch to too long waiting
> requests, when necessary.
>
> 2. It would allow better requests pipelining, which is very important to
> minimize uplink latency for synchronous requests (i.e. with only one IO
> request at time, next request issued, when the previous one completed).
> You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware
> recommends for maximum performance set max_sectors_kb as low as *64K*
> with 16MB RA. It allows to maximize serving commands pipelining. And
> this suggestion really works allowing to improve throughput in 50-100%!
>
> Here are the raw numbers. I also attached context RA debug output for
> 2MB RA size case for your viewing pleasure.

Thank you, it helped a lot!

(Can I wish a CONFIG_PRINTK_TIME=y next time? :-)

Thanks,
Fengguang

> --------------------------------------------------------------------
>
> Performance baseline: all IO threads work in the same IO context,
> current vanilla RA, default RA size:
>
> CFQ scheduler:
>
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 102 MB/s
> b) 102 MB/s
> c) 102 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 21,6 MB/s
> b) 22,8 MB/s
> c) 24,1 MB/s
> d) 23,1 MB/s
>
> Deadline scheduler:
>
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 51,1 MB/s
> b) 51,4 MB/s
> c) 51,1 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 4,7 MB/s
> b) 4,6 MB/s
> c) 4,8 MB/s
>
> --------------------------------------------------------------------
>
> RA performance baseline: all IO threads work in different IO contexts,
> current vanilla RA, default RA size:
>
> CFQ:
>
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 48,6 MB/s
> b) 49,2 MB/s
> c) 48,9 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 4,2 MB/s
> b) 3,9 MB/s
> c) 4,1 MB/s
>
> Deadline:
>
> 1) dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 53,2 MB/s
> b) 51,8 MB/s
> c) 51,6 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 5,1 MB/s
> b) 4,6 MB/s
> c) 4,8 MB/s
>
> --------------------------------------------------------------------
>
> Context RA, all IO threads work in different IO contexts, default RA size:
>
> CFQ:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 47,9 MB/s
> b) 48,2 MB/s
> c) 48,1 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 3,5 MB/s
> b) 3,6 MB/s
> c) 3,8 MB/s
>
> Deadline:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 72,4 MB/s
> b) 68,3 MB/s
> c) 71,3 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 4,3 MB/s
> b) 5,0 MB/s
> c) 4,8 MB/s
>
> --------------------------------------------------------------------
>
> Vanilla RA, all IO threads work in different IO contexts, various RA sizes:
>
> CFQ:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 60,5 MB/s
> b) 59,3 MB/s
> c) 59,7 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 9,4 MB/s
> b) 9,4 MB/s
> c) 9,1 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 74,7 MB/s
> b) 73,2 MB/s
> c) 74,1 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 13,7 MB/s
> b) 13,6 MB/s
> c) 13,1 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 76,7 MB/s
> b) 76,8 MB/s
> c) 76,6 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 21,8 MB/s
> b) 22,1 MB/s
> c) 20,3 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 80,8 MB/s
> b) 80.8 MB/s
> c) 80,3 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 29,6 MB/s
> b) 29,4 MB/s
> c) 27,2 MB/s
>
> === Deadline:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 68,4 MB/s
> b) 67,0 MB/s
> c) 67,6 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 8,8 MB/s
> b) 8,9 MB/s
> c) 8,7 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 81,0 MB/s
> b) 82,4 MB/s
> c) 81,7 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 13,5 MB/s
> b) 13,1 MB/s
> c) 12,9 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 81,1 MB/s
> b) 80,1 MB/s
> c) 81,8 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 21,9 MB/s
> b) 20,7 MB/s
> c) 21,3 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 83,1 MB/s
> b) 82,7 MB/s
> c) 82,9 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 27,9 MB/s
> b) 23,5 MB/s
> c) 27,6 MB/s
>
> --------------------------------------------------------------------
>
> Context RA, all IO threads work in different IO contexts, various RA sizes:
>
> CFQ:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 63,7 MB/s
> b) 63,5 MB/s
> c) 62,8 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 7,1 MB/s
> b) 6,7 MB/s
> c) 7,0 MB/s
> d) 6,9 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 81,1 MB/s
> b) 81,8 MB/s
> c) MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 14,1 MB/s
> b) 14,0 MB/s
> c) 14,1 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 81,6 MB/s
> b) 81,4 MB/s
> c) 86,0 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 22,3 MB/s
> b) 21,5 MB/s
> c) 21,7 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 83,1 MB/s
> b) 83,5 MB/s
> c) 82,9 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 32,8 MB/s
> b) 32,7 MB/s
> c) 30,2 MB/s
>
> === Deadline:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 68,8 MB/s
> b) 68,9 MB/s
> c) 69,0 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 8,7 MB/s
> b) 9,0 MB/s
> c) 8,9 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 83,5 MB/s
> b) 83,1 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 14,0 MB/s
> b) 13.9 MB/s
> c) 13,8 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 82,6 MB/s
> b) 82,4 MB/s
> c) 81,9 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 21,9 MB/s
> b) 23,1 MB/s
> c) 17,8 MB/s
> d) 21,1 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 84,5 MB/s
> b) 83,7 MB/s
> c) 83,8 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 39,9 MB/s
> b) 39,5 MB/s
> c) 38,4 MB/s
>
> --------------------------------------------------------------------
>
> all IO threads work in the same IO context, context RA, various RA sizes:
>
> === CFQ:
>
> --- RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 86,4 MB/s
> b) 87,9 MB/s
> c) 86,7 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 17,8 MB/s
> b) 18,3 MB/s
> c) 17,7 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 83,3 MB/s
> b) 81,6 MB/s
> c) 81,9 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 22,1 MB/s
> b) 21,5 MB/s
> c) 21,2 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 81,1 MB/s
> b) 81,0 MB/s
> c) 81,6 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 22,2 MB/s
> b) 20,2 MB/s
> c) 20,9 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 83,4 MB/s
> b) 82,8 MB/s
> c) 83,3 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 22,6 MB/s
> b) 23,4 MB/s
> c) 21,8 MB/s
>
> === Deadline:
>
> --- RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 70,0 MB/s
> b) 70,7 MB/s
> c) 69,7 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 9,1 MB/s
> b) 8,3 MB/s
> c) 8,4 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 84,3 MB/s
> b) 83,2 MB/s
> c) 83,3 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 13,9 MB/s
> b) 13,1 MB/s
> c) 13,4 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 82,6 MB/s
> b) 82,1 MB/s
> c) 82,3 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 21,6 MB/s
> b) 22,4 MB/s
> c) 21,3 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 83,8 MB/s
> b) 83,8 MB/s
> c) 83,1 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 39,5 MB/s
> b) 39,6 MB/s
> c) 37,0 MB/s
>
> Thanks,
> Vlad

2009-03-19 17:43:30

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Wu Fengguang, on 02/19/2009 05:05 AM wrote:
> On Tue, Feb 17, 2009 at 10:01:40PM +0300, Vladislav Bolkhovitin wrote:
>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>> Sorry for such a huge delay. There were many other activities I had
>>>> to do before + I had to be sure I didn't miss anything.
>>>>
>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
>>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N
>>>> threads (N=5 in this case) handle IO from remote initiators
>>>> (clients) coming from wire using iSCSI protocol. In addition, SCST
>>>> has patch called export_alloc_io_context (see
>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
>>>> queue IO using single IO context, so we can see if context RA can
>>>> replace grouping IO threads in single IO context.
>>>>
>>>> Unfortunately, the results are negative. We find neither any
>>>> advantages of context RA over current RA implementation, nor
>>>> possibility for context RA to replace grouping IO threads in single
>>>> IO context.
>>>>
>>>> Setup on the target (server) was the following. 2 SATA drives grouped
>>>> in md RAID-0 with average local read throughput ~120MB/s ("dd
>>>> if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs "20971520000
>>>> bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md device was
>>>> partitioned on 3 partitions. The first partition was 10% of space in
>>>> the beginning of the device, the last partition was 10% of space in
>>>> the end of the device, the middle one was the rest in the middle of
>>>> the space them. Then the first and the last partitions were exported
>>>> to the initiator (client). They were /dev/sdb and /dev/sdc on it
>>>> correspondingly.
>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>> optimizing your workload and figuring out what happens underneath.
>>>
>>> Are the client and server two standalone boxes connected by GBE?
>>>
>>> When you set readahead sizes in the benchmarks, you are setting them
>>> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
>>> client side readahead size?
>>>
>>> It would help a lot to debug readahead if you can provide the
>>> server side readahead stats and trace log for the worst case.
>>> This will automatically answer the above questions as well as disclose
>>> the micro-behavior of readahead:
>>>
>>> mount -t debugfs none /sys/kernel/debug
>>>
>>> echo > /sys/kernel/debug/readahead/stats # reset counters
>>> # do benchmark
>>> cat /sys/kernel/debug/readahead/stats
>>>
>>> echo 1 > /sys/kernel/debug/readahead/trace_enable
>>> # do micro-benchmark, i.e. run the same benchmark for a short time
>>> echo 0 > /sys/kernel/debug/readahead/trace_enable
>>> dmesg
>>>
>>> The above readahead trace should help find out how the client side
>>> sequential reads convert into server side random reads, and how we can
>>> prevent that.
>> See attached. Could you comment the logs, please, so I will also be able
>> to read them in the future?
>
> Vladislav, thank you for the logs!
>
> The printk format for the following lines is:
>
> + printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
> + "ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
> + ra_pattern_names[pattern],
> + current->pid, current->comm,
> + MAJOR(mapping->host->i_sb->s_dev),
> + MINOR(mapping->host->i_sb->s_dev),
> + mapping->host->i_sb->s_id,
> + mapping->host->i_ino,
> + filp->f_path.dentry->d_name.name,
> + offset, req_size,
> + ra->start, ra->size, ra->async_size,
> + async,
> + actual);
>
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10596+1, ra=10628+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10628+1, ra=10660+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10660+1, ra=10692+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10692+1, ra=10724+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10724+1, ra=10756+32-32, async=1) = 32
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10756+1, ra=10788+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10788+1, ra=10820+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10820+1, ra=10852+32-32, async=1) = 32
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10852+1, ra=10884+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10884+1, ra=10916+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10916+1, ra=10948+32-32, async=1) = 32
> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=10948+1, ra=10980+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10980+1, ra=11012+32-32, async=1) = 32
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=11012+1, ra=11044+32-32, async=1) = 32
> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11044+1, ra=11076+32-32, async=1) = 32
> readahead-subsequent(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11076+1, ra=11108+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11108+1, ra=11140+32-32, async=1) = 32
> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11140+1, ra=11172+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11172+1, ra=11204+32-32, async=1) = 32
> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11204+1, ra=11236+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11236+1, ra=11268+32-32, async=1) = 32
> readahead-subsequent(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11268+1, ra=11300+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11300+1, ra=11332+32-32, async=1) = 32
> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11332+1, ra=11364+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11364+1, ra=11396+32-32, async=1) = 32
> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11396+1, ra=11428+32-32, async=1) = 32
>
> The above trace shows that the readahead logic is working pretty well,
> however the max readahead size(32 pages) is way too small. This can
> also be confirmed in the following table, where the average readahead
> request size/async_size and actual readahead I/O size are all 30.
>
> linux-4dtq:/ # cat /sys/kernel/debug/readahead/stats
> pattern count sync_count eof_count size async_size actual
> none 0 0 0 0 0 0
> initial0 71 71 41 4 3 2
> initial 23 23 0 4 3 4
> subsequent 3845 4 21 31 31 31
> marker 4222 0 1 31 31 31
> trail 0 0 0 0 0 0
> oversize 0 0 0 0 0 0
> reverse 0 0 0 0 0 0
> stride 0 0 0 0 0 0
> thrash 0 0 0 0 0 0
> mmap 135 135 15 32 0 17
> fadvise 180 180 180 0 0 1
> random 23 23 2 1 0 1
> all 8499 436 260 30 30 30
> ^^^^^^^^^^^^^^^^^^^^^^^^
>
> I suspect that your main performance problem comes from the small read/readahead size.
> If larger values are used, even the vanilla 2.6.27 kernel will perform well.

Yes, it was misconfiguration on our side: readahead size was not set
correctly on all devices. In the correct configuration context based RA
shows constant advantage over the current vanilla algorithm, but not as
much as I would expect. It still performs considerably worse, than in
case when all the IO threads work in the same IO context. To remind, our
setup and tests described in http://lkml.org/lkml/2009/2/12/277.

Here are the conclusions from tests:

1. Making all IO threads work in the same IO context with CFQ (vanilla
RA and default RA size) brings near 100% link utilization on single
stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there
is 100% improvement of CFQ over deadline. With 2 read streams CFQ has
ever more advantage: >400% (23MB/s vs 5MB/s).

2. All IO threads work in different IO contexts. With vanilla RA and
default RA size CFQ performs 50% worse (48MB/s), even worse than deadline.

3. All IO threads work in different IO contexts. With default RA size
context RA brings on single stream 40% improvement with deadline (71MB/s
vs 51MB/s), no improvement with cfq (48MB/s).

4. All IO threads work in different IO contexts. With higher RA sizes
there is stable 6% improvement with context RA over vanilla RA with CFQ
starting from 20%. Deadline performs similarly. In parallel reads
improvement is bigger: 30% on 4M RA size with deadline (39MB/s vs 27MB/s)

5. All IO threads work in different IO contexts. The best performance
achieved with RA maximum size 4M on both RA algorithms, but starting
from size 1M it starts growing very slowly.

6. Unexpected result. In case, when ll IO threads work in the same IO
context with CFQ increasing RA size *decreases* throughput. I think this
is, because RA requests performed as single big READ requests, while
requests coming from remote clients are much smaller in size (up to
256K), so, when the read by RA data transferred to the remote client on
100MB/s speed, the backstorage media gets rotated a bit, so the next
read request must wait the rotation latency (~0.1ms on 7200RPM). This is
well conforms with (3) above, when context RA has 40% advantage over
vanilla RA with default RA, but much smaller with higher RA.

Bottom line IMHO conclusions:

1. Context RA should be considered after additional examination to
replace current RA algorithm in the kernel

2. It would be better to increase default RA size to 1024K

*AND* one of the following:

3.1. All RA requests should be split in smaller requests with size up to
256K, which should not be merged with any other request

OR

3.2. New RA requests should be sent before the previous one completed to
don't let the storage device rotate too far to need full rotation to
serve the next request.

I like suggestion 3.1 a lot more, since it should be simple to implement
and has the following 2 positive side effects:

1. It would allow to minimize negative effect of higher RA size on the
I/O delay latency by allowing CFQ to switch to too long waiting
requests, when necessary.

2. It would allow better requests pipelining, which is very important to
minimize uplink latency for synchronous requests (i.e. with only one IO
request at time, next request issued, when the previous one completed).
You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware
recommends for maximum performance set max_sectors_kb as low as *64K*
with 16MB RA. It allows to maximize serving commands pipelining. And
this suggestion really works allowing to improve throughput in 50-100%!

Here are the raw numbers. I also attached context RA debug output for
2MB RA size case for your viewing pleasure.

--------------------------------------------------------------------

Performance baseline: all IO threads work in the same IO context,
current vanilla RA, default RA size:

CFQ scheduler:

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 102 MB/s
b) 102 MB/s
c) 102 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,6 MB/s
b) 22,8 MB/s
c) 24,1 MB/s
d) 23,1 MB/s

Deadline scheduler:

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 51,1 MB/s
b) 51,4 MB/s
c) 51,1 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s

--------------------------------------------------------------------

RA performance baseline: all IO threads work in different IO contexts,
current vanilla RA, default RA size:

CFQ:

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 48,6 MB/s
b) 49,2 MB/s
c) 48,9 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,2 MB/s
b) 3,9 MB/s
c) 4,1 MB/s

Deadline:

1) dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 53,2 MB/s
b) 51,8 MB/s
c) 51,6 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 5,1 MB/s
b) 4,6 MB/s
c) 4,8 MB/s

--------------------------------------------------------------------

Context RA, all IO threads work in different IO contexts, default RA size:

CFQ:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 47,9 MB/s
b) 48,2 MB/s
c) 48,1 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 3,5 MB/s
b) 3,6 MB/s
c) 3,8 MB/s

Deadline:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 72,4 MB/s
b) 68,3 MB/s
c) 71,3 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,3 MB/s
b) 5,0 MB/s
c) 4,8 MB/s

--------------------------------------------------------------------

Vanilla RA, all IO threads work in different IO contexts, various RA sizes:

CFQ:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 60,5 MB/s
b) 59,3 MB/s
c) 59,7 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 9,4 MB/s
b) 9,4 MB/s
c) 9,1 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 74,7 MB/s
b) 73,2 MB/s
c) 74,1 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 13,7 MB/s
b) 13,6 MB/s
c) 13,1 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 76,7 MB/s
b) 76,8 MB/s
c) 76,6 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,8 MB/s
b) 22,1 MB/s
c) 20,3 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 80,8 MB/s
b) 80.8 MB/s
c) 80,3 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 29,6 MB/s
b) 29,4 MB/s
c) 27,2 MB/s

=== Deadline:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 68,4 MB/s
b) 67,0 MB/s
c) 67,6 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 8,8 MB/s
b) 8,9 MB/s
c) 8,7 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 81,0 MB/s
b) 82,4 MB/s
c) 81,7 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 13,5 MB/s
b) 13,1 MB/s
c) 12,9 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 81,1 MB/s
b) 80,1 MB/s
c) 81,8 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,9 MB/s
b) 20,7 MB/s
c) 21,3 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 83,1 MB/s
b) 82,7 MB/s
c) 82,9 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 27,9 MB/s
b) 23,5 MB/s
c) 27,6 MB/s

--------------------------------------------------------------------

Context RA, all IO threads work in different IO contexts, various RA sizes:

CFQ:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 63,7 MB/s
b) 63,5 MB/s
c) 62,8 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 7,1 MB/s
b) 6,7 MB/s
c) 7,0 MB/s
d) 6,9 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 81,1 MB/s
b) 81,8 MB/s
c) MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 14,1 MB/s
b) 14,0 MB/s
c) 14,1 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 81,6 MB/s
b) 81,4 MB/s
c) 86,0 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 22,3 MB/s
b) 21,5 MB/s
c) 21,7 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 83,1 MB/s
b) 83,5 MB/s
c) 82,9 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 32,8 MB/s
b) 32,7 MB/s
c) 30,2 MB/s

=== Deadline:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 68,8 MB/s
b) 68,9 MB/s
c) 69,0 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 8,7 MB/s
b) 9,0 MB/s
c) 8,9 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 83,5 MB/s
b) 83,1 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 14,0 MB/s
b) 13.9 MB/s
c) 13,8 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 82,6 MB/s
b) 82,4 MB/s
c) 81,9 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,9 MB/s
b) 23,1 MB/s
c) 17,8 MB/s
d) 21,1 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 84,5 MB/s
b) 83,7 MB/s
c) 83,8 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,9 MB/s
b) 39,5 MB/s
c) 38,4 MB/s

--------------------------------------------------------------------

all IO threads work in the same IO context, context RA, various RA sizes:

=== CFQ:

--- RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 86,4 MB/s
b) 87,9 MB/s
c) 86,7 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 17,8 MB/s
b) 18,3 MB/s
c) 17,7 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 83,3 MB/s
b) 81,6 MB/s
c) 81,9 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 22,1 MB/s
b) 21,5 MB/s
c) 21,2 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 81,1 MB/s
b) 81,0 MB/s
c) 81,6 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 22,2 MB/s
b) 20,2 MB/s
c) 20,9 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 83,4 MB/s
b) 82,8 MB/s
c) 83,3 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 22,6 MB/s
b) 23,4 MB/s
c) 21,8 MB/s

=== Deadline:

--- RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 70,0 MB/s
b) 70,7 MB/s
c) 69,7 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 9,1 MB/s
b) 8,3 MB/s
c) 8,4 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 84,3 MB/s
b) 83,2 MB/s
c) 83,3 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 13,9 MB/s
b) 13,1 MB/s
c) 13,4 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 82,6 MB/s
b) 82,1 MB/s
c) 82,3 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 21,6 MB/s
b) 22,4 MB/s
c) 21,3 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 83,8 MB/s
b) 83,8 MB/s
c) 83,1 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,5 MB/s
b) 39,6 MB/s
c) 37,0 MB/s

Thanks,
Vlad

Attachments:

ra.tar.bz2 (25.70 kB)

2009-03-20 08:52:57

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Vladislav Bolkhovitin, on 03/19/2009 08:44 PM wrote:
> 6. Unexpected result. In case, when all IO threads work in the same IO
> context with CFQ increasing RA size *decreases* throughput. I think this
> is, because RA requests performed as single big READ requests, while
> requests coming from remote clients are much smaller in size (up to
> 256K), so, when the read by RA data transferred to the remote client on
> 100MB/s speed, the backstorage media gets rotated a bit, so the next
> read request must wait the rotation latency (~0.1ms on 7200RPM).
^^^^^^^^^^^^^^^^^
Oops, it's RP*M*, hence the full rotation latency is in 60 times more,
i.e. 8.3 ms.

Sorry,
Vlad

2009-04-24 08:44:08

by Wu Fengguang

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

On Tue, Apr 21, 2009 at 10:18:25PM +0400, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>>> Here are the conclusions from tests:
>>>
>>> 1. Making all IO threads work in the same IO context with CFQ
>>> (vanilla RA and default RA size) brings near 100% link utilization
>>> on single stream reads (100MB/s) and with deadline about 50%
>>> (50MB/s). I.e. there is 100% improvement of CFQ over deadline. With
>>> 2 read streams CFQ has ever more advantage: >400% (23MB/s vs 5MB/s).
>>
>> The ideal 2-stream throughput should be >60MB/s, so I guess there are
>> still room of improvements for the CFQ's 23MB/s?
>
> Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K
> we were able to get ~40MB/s, see the previous e-mail and below.

Why do you think it's in readahead? The readahead account/tracing data
you provided in a previous email shows context readahead to work just fine:

On initiator run benchmark:
dd if=/dev/sdb of=/dev/null bs=64K count=8000
dd if=/dev/sdc of=/dev/null bs=64K count=8000

linux-4dtq:/.kernels/scst-kernel-4 # cat /sys/kernel/debug/readahead/stats
pattern count sync_count eof_count size async_size actual
initial0 4 4 1 4 3 3
subsequent 288 0 0 470 251 251
marker 722 0 0 255 255 255
mmap 2 2 2 32 0 12
fadvise 2 2 2 0 0 69
all 1018 8 5 314 252 252

The last line says the server side does 1018 readaheads with average
size 1008K, which is pretty close to the max readahead size 1024K.

Raw performance numbers are not enough here. The readahead size and
the actual IO size, and possibly the readahead traces and IO traces
are the most vivid and ultimate way to judge who's behaving wrong
and provide the hint to a solution.

>> The one fact I cannot understand is that SCST seems to breaking up the
>> client side 64K reads into server side 4K reads(above readahead layer).
>> But I remember you told me that SCST don't do NFS rsize style split-ups.
>> Is this a bug? The 4K read size is too small to be CPU/network friendly...
>> Where are the split-up and re-assemble done? On the client side or
>> internal to the server?
>
> This is on the client's side. See the target's log in the attachment.
>
> Here is the summary of commands data sizes, came to the server, for "dd
> if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:
>
> 4K 11
> 8K 0
> 16K 0
> 32K 0
> 64K 0
> 128K 81
> 256K 8
> 512K 0
> 1024K 0
> 2048K 0
> 4096K 0
>
> There's a way too many 4K requests. Apparently, the requests submission
> path isn't optimal.

So it's the client side that splits 128K (default sized) readahead IO
into 4-256K SCST requests that sent over the network?

It looks good enough since it's mostly 128K requests. However this is
still not in line with previous data:

req=0+1, ra=0+4-3, async=0) = 4
req=1+1, ra=4+16-16, async=1) = 16
req=4+1, ra=20+32-32, async=1) = 32
req=20+1, ra=52+64-64, async=1) = 64
req=52+1, ra=116+128-128, async=1) = 128
req=116+1, ra=244+256-256, async=1) = 256
req=244+1, ra=500+256-256, async=1) = 256
req=500+1, ra=756+256-256, async=1) = 256
req=756+1, ra=1012+256-256, async=1) = 256
req=1012+1, ra=1268+256-256, async=1) = 256
req=1268+1, ra=1268+512-256, async=1) = 256
req=1524+1, ra=1780+256-256, async=1) = 256
req=1780+1, ra=2036+256-256, async=1) = 256
req=2036+1, ra=2292+256-256, async=1) = 256
req=2292+1, ra=2548+256-256, async=1) = 256
req=2548+1, ra=2804+256-256, async=1) = 256
req=2804+1, ra=2804+512-256, async=1) = 256
req=3060+1, ra=3060+512-256, async=1) = 256
req=3316+1, ra=3572+256-256, async=1) = 256
req=3572+1, ra=3828+256-256, async=1) = 256
req=3828+1, ra=4084+256-256, async=1) = 256
req=4084+1, ra=4340+256-256, async=1) = 256
req=4340+1, ra=4596+256-256, async=1) = 256
req=4596+1, ra=4596+512-256, async=1) = 256
req=4852+1, ra=5108+256-256, async=1) = 256
req=5108+1, ra=5108+512-256, async=1) = 256
[and 480 more lines of pattern "req=*+1,...= 256")

Basically, the server side ondemand_readahead()
- perceived *all* about 1-page read requests
- submitted *always* 256-page readahead IO
(except during the initial size ramp up process)

Maybe the runtime condition is somehow different for the above and the
below traces.

[ 457.003661] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.008686] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.010915] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
[ 457.015238] [4209]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.015419] [4211]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
[ 457.021696] [4208]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.024205] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
[ 457.028455] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.028695] [4210]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
[ 457.033565] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.035750] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.037323] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.041132] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.043251] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.045455] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.047949] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.051463] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.052136] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.057734] [4207]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.058007] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
[ 457.063185] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.063404] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
[ 457.068749] [4211]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
[ 457.069007] [4211]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
[ 457.071339] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.075250] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.077416] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.079892] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
[ 457.080492] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)

There is an interesting pattern in the above trace: it tend to be
32-page aligned. An unligned 1-page read will always be followed by an
31-page or 63-page read, which then again make it aligned to 32-page
boundaries :-)

> Actually, this is another question I wanted to rise from the very
> beginning.
>
>>> 6. Unexpected result. In case, when ll IO threads work in the same
>>> IO context with CFQ increasing RA size *decreases* throughput. I
>>> think this is, because RA requests performed as single big READ
>>> requests, while requests coming from remote clients are much smaller
>>> in size (up to 256K), so, when the read by RA data transferred to
>>> the remote client on 100MB/s speed, the backstorage media gets
>>> rotated a bit, so the next read request must wait the rotation
>>> latency (~0.1ms on 7200RPM). This is well conforms with (3) above,
>>> when context RA has 40% advantage over vanilla RA with default RA,
>>> but much smaller with higher RA.
>>
>> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...
>
> That doesn't matter, because new request from the client won't come
> until all data for the previous one transferred to it. And that transfer
> is done on a very *finite* speed.

Async readahead does matter. The readahead pipeline is all you need to
*fully* fill the storage/network channels (with good readahead size).

The client side request time doesn't matter (ie. not too late to
impact performance). The

Let's look at the default case, where
- client readahead size = 128K
- server readahead size = 128K
- no split of request size by SCST or max_sectors_kb
and assume
- application read size = huge
- application read bw = infinite
- disk bw = network bw = limited

Imagine three 128K chunks in the file:

chunk A chunk B chunk C
================================--------------------------------________________________________
application read blocked here client side readahead IO server side readahead IO
(ongoing network IO) (ongoing network IO) (ongoing disk IO)

Normally the application can read and consume data very fast.
So in most time, it will be blocked somewhere for network IO.
Let's assume it is blocking at the first page of chunk A.

Just before the application blocks on chunk A, it will trigger a
(client side) readahead request for chunk B, which in turn will
trigger a (server side) readahead request for chunk C.

When disk bw = network bw, it _will_ quickly enter a steady state, in
which the network transfer of B and disk read of C proceed concurrently.

The below tables demonstrate the steps into this pipelined steady state.
(we ignore disk seek time for simplicity)

1) no request merging in SCST and block layer:
time 0 3 5 6 7
client ab ab ABcd ABCde ABCDef
server abc ABC ABCde ABCDef ABCDEfg
net transfers ... AB C D
disk transfers ABC .. D E

2) disk merge, no SCST merge:
time 0 3 5 6 7 8 9
client ab ab ABcd ABCde ABCde ABCDef ABCDEfg
server abc ABC ABCde ABCdef ABCDEf ABCDEFg ABCDEFGh
net transfers ... AB C . D E
disk transfers ABC .. DE F G

3) application (limited) read speed = 2 * disk bw:
time 0 3 5 5.5 6 6.5 7
client ab ab AB ABc ABcd ABCd ABCde
server abc ABC ABC ABCd ABCDe ABCDe ABCDEf
net transfers ... AB . C D
disk transfers ABC .. . D E

Legends:
- lower case 'a': request for chunk A submitted, but IO has not completed
- upper case 'A': request for chunk A submitted, and IO has completed
- dot '.': net/disk idled for one time slot

Annotations for case (2):
T0: single network IO request for chunk A&B, or AB in short;
single disk IO request for chunk ABC.
T3: disk IO for ABC completes; network idle
T5: disk idle; network IO for AB completes;
application proceeds quickly to B and then C,
which triggers two network IO requests, one for C and another for D;
which triggers one disk IO request for DE(requests for them come
close and hence get merged into one).
T6: disk busy(half way in DE); network completes request C;
which quickly triggers network request for E and disk request for F.
T7: disk completes DE; network idle before that.
T8: disk completes F; network completes D;
which in turn lead to network request for F and disk request for G.
==> pipeline established!

Case (2)/(3) are more realistic cases, where two requests come close in
time will be merged in the block layer, but not at SCST client side.

For all three cases, we start by single large request for several
chunks, and quickly converges to a steady state: a pipeline of
single-chunk disk/network requests.

>>> Bottom line IMHO conclusions:
>>>
>>> 1. Context RA should be considered after additional examination to
>>> replace current RA algorithm in the kernel
>>
>> That's my plan to push context RA to mainline. And thank you very much
>> for providing and testing out a real world application for it!
>
> You're welcome!

Good news: context readahead is now queued in the -mm tree :-)

>>> 2. It would be better to increase default RA size to 1024K
>>
>> That's a long wish to increase the default RA size. However I have a
>> vague feeling that it would be better to first make the lower layers
>> more smart on max_sectors_kb granularity request splitting and batching.
>
> Can you elaborate more on that, please?

Not exactly the above words. But the basic idea is to reduce latency
on sync IO:
- readahead algorithm classify its IO requests into sync/async ones;
- use the SATA/SCSI priority bit for sync/async requests
Each of them could introduce some problems though...

>>> *AND* one of the following:
>>>
>>> 3.1. All RA requests should be split in smaller requests with size up
>>> to 256K, which should not be merged with any other request
>>
>> Are you referring to max_sectors_kb?
>
> Yes
>
>> What's your max_sectors_kb and nr_requests? Something like
>>
>> grep -r . /sys/block/sda/queue/
>
> Default: 512 and 128 correspondingly.

OK.

>>> OR
>>>
>>> 3.2. New RA requests should be sent before the previous one completed
>>> to don't let the storage device rotate too far to need full rotation
>>> to serve the next request.
>>
>> Linus has a mmap readahead cleanup patch that can do this. It
>> basically replaces a {find_lock_page(); readahead();} sequence into
>> {find_get_page(); readahead(); lock_page();}.
>>
>> I'll try to push that patch into mainline.
>
> Good!

Good news 2: mmap readahead is also sitting in the -mm tree :-)

>>> I like suggestion 3.1 a lot more, since it should be simple to
>>> implement and has the following 2 positive side effects:
>>>
>>> 1. It would allow to minimize negative effect of higher RA size on
>>> the I/O delay latency by allowing CFQ to switch to too long waiting
>>> requests, when necessary.
>>>
>>> 2. It would allow better requests pipelining, which is very important
>>> to minimize uplink latency for synchronous requests (i.e. with only
>>> one IO request at time, next request issued, when the previous one
>>> completed). You can see in
>>> http://www.3ware.com/kb/article.aspx?id=11050 that 3ware recommends
>>> for maximum performance set max_sectors_kb as low as *64K* with 16MB
>>> RA. It allows to maximize serving commands pipelining. And this
>>> suggestion really works allowing to improve throughput in 50-100%!
>
> Seems I should elaborate more on this. Case, when client is remote has a
> fundamental difference from the case, when client is local, for which
> Linux currently optimized. When client is local data delivered to it
> from the page cache with a virtually infinite speed. But when client is
> remote data delivered to it from the server's cache on a *finite* speed.
> In our case this speed is about the same as speed of reading data to the
> cache from the storage. It has the following consequences:
>
> 1. Data for any READ request at first transferred from the storage to
> the cache, then from the cache to the client. If those transfers are
> done purely sequentially without overlapping, i.e. without any
> readahead, resulting throughput T can be found from equation: 1/T =
> 1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the
> local (i.e. from the storage) and remote links. In case, when Tlocal ~=
> Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)

Fortunately I can quickly absorb your idea ;-) Because I can recall
vividly when downloading files in a pretty old kernel, the network IO
and writeback works by turns instead of in a pipelined way. Hmm, it's
an interesting behavior that deserves more researches :-)

> 2. If data transfers on the local and remote links aren't coordinated,
> it is possible that only one link transfers data at some time. From the
> (1) above you can calculate that % of this "idle" time is % of the lost
> throughput. I.e. to get the maximum throughput both links should
> transfer data as simultaneous as possible. For our case, when Tlocal ~=
> Tremote, both links should be all the time busy. Moreover, it is
> possible that the local transfer finished, but during the remote
> transfer the storage media rotated too far, so for the next request it
> will be needed to wait the full rotation to finish (i.e. several ms of
> lost bandwidth).

This is merely a possible state. Can you prove that it is in fact a
*stable* one?

> Thus, to get the maximum possible throughput, we need to maximize
> simultaneous load of both local and remote links. It can be done by
> using well known pipelining technique. For that client should read the
> same amount of data at once, but those read should be split on smaller
> chunks, like 64K at time. This approach looks being against the
> "conventional wisdom", saying that bigger request means bigger
> throughput, but, in fact, it doesn't, because the same (big) amount of
> data are read at time. Bigger count of smaller requests will make more
> simultaneous load on both participating in the data transfers links. In
> fact, even if client is local, in most cases there is a second data
> transfer link. It's in the storage. This is especially true for RAID
> controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K
> and increase RA in the above link? ;)

Sure 64K is good for 3ware - hw raid stripe sizes are typically small.
I wonder if there is a 'stripe size' concept for multi-channel SSDs,
and their typical values :-)

> Of course, max_sectors_kb should be decreased only for smart devices,
> which allow >1 outstanding requests at time, i.e. for all modern
> SCSI/SAS/SATA/iSCSI/FC/etc. drives.

But I guess the gain of tiny max_sectors_kb is more marginal for
single spindle case. Just a guess.

> There is an objection against having too many outstanding requests at
> time. This is latency. But, since overall size of all requests remains
> unchanged, this objection isn't relevant in this proposal. There is the
> same latency-related objection against increasing RA. But many small
> individual RA requests it isn't relevant as well.
>
> We did some measurements to support the this proposal. They were done
> only with deadline scheduler to make the picture clearer. They were done
> with context RA. The tests were the same as before.
>
> --- Baseline, all default:
>
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 51,1 MB/s
> b) 51,4 MB/s
> c) 51,1 MB/s
>
> Run at the same time:
> # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 4,7 MB/s
> b) 4,6 MB/s
> c) 4,8 MB/s
>
> --- Client - all default, on the server max_sectors_kb set to 64K:
>
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> - 100 MB/s
> - 100 MB/s
> - 102 MB/s
>
> # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> - 5,2 MB/s
> - 5,3 MB/s
> - 4,2 MB/s
>
> 100% and 8% improvement comparing to the baseline.

They are definitely encouraging numbers. The 64K max_sectors_kb is
obviously doing good here. However.. How do you know it's essentially
a pipeline issue? What exactly happened behind the scheme?

> From the previous e-mail you can see that with 4096K RA
>
> # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> a) 39,9 MB/s
> b) 39,5 MB/s
> c) 38,4 MB/s
>
> I.e. there is 760% improvement over the baseline.

Which baseline? You are comparing 128K/4MB RA sizes, under the default
max_sectors_kb?

> Thus, I believe, that for all devices, supporting queue depths >1,
> max_sectors_kb should be set by default to 64K (or to 128K, maybe, but
> not more), and default RA increased to at least 1M, better 2-4M.
>
>> (Can I wish a CONFIG_PRINTK_TIME=y next time? :-)
>
> Sure

Thanks,
Fengguang

2009-04-21 18:19:42

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>> Here are the conclusions from tests:
>>
>> 1. Making all IO threads work in the same IO context with CFQ (vanilla
>> RA and default RA size) brings near 100% link utilization on single
>> stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there
>> is 100% improvement of CFQ over deadline. With 2 read streams CFQ has
>> ever more advantage: >400% (23MB/s vs 5MB/s).
>
> The ideal 2-stream throughput should be >60MB/s, so I guess there are
> still room of improvements for the CFQ's 23MB/s?

Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K
we were able to get ~40MB/s, see the previous e-mail and below.

> The one fact I cannot understand is that SCST seems to breaking up the
> client side 64K reads into server side 4K reads(above readahead layer).
> But I remember you told me that SCST don't do NFS rsize style split-ups.
> Is this a bug? The 4K read size is too small to be CPU/network friendly...
> Where are the split-up and re-assemble done? On the client side or
> internal to the server?

This is on the client's side. See the target's log in the attachment.
Here is the summary of commands data sizes, came to the server, for "dd
if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:

4K 11
8K 0
16K 0
32K 0
64K 0
128K 81
256K 8
512K 0
1024K 0
2048K 0
4096K 0

There's a way too many 4K requests. Apparently, the requests submission
path isn't optimal.

Actually, this is another question I wanted to rise from the very
beginning.

>> 6. Unexpected result. In case, when ll IO threads work in the same IO
>> context with CFQ increasing RA size *decreases* throughput. I think this
>> is, because RA requests performed as single big READ requests, while
>> requests coming from remote clients are much smaller in size (up to
>> 256K), so, when the read by RA data transferred to the remote client on
>> 100MB/s speed, the backstorage media gets rotated a bit, so the next
>> read request must wait the rotation latency (~0.1ms on 7200RPM). This is
>> well conforms with (3) above, when context RA has 40% advantage over
>> vanilla RA with default RA, but much smaller with higher RA.
>
> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...

That doesn't matter, because new request from the client won't come
until all data for the previous one transferred to it. And that transfer
is done on a very *finite* speed.

>> Bottom line IMHO conclusions:
>>
>> 1. Context RA should be considered after additional examination to
>> replace current RA algorithm in the kernel
>
> That's my plan to push context RA to mainline. And thank you very much
> for providing and testing out a real world application for it!

You're welcome!

>> 2. It would be better to increase default RA size to 1024K
>
> That's a long wish to increase the default RA size. However I have a
> vague feeling that it would be better to first make the lower layers
> more smart on max_sectors_kb granularity request splitting and batching.

Can you elaborate more on that, please?

>> *AND* one of the following:
>>
>> 3.1. All RA requests should be split in smaller requests with size up to
>> 256K, which should not be merged with any other request
>
> Are you referring to max_sectors_kb?

Yes

> What's your max_sectors_kb and nr_requests? Something like
>
> grep -r . /sys/block/sda/queue/

Default: 512 and 128 correspondingly.

>> OR
>>
>> 3.2. New RA requests should be sent before the previous one completed to
>> don't let the storage device rotate too far to need full rotation to
>> serve the next request.
>
> Linus has a mmap readahead cleanup patch that can do this. It
> basically replaces a {find_lock_page(); readahead();} sequence into
> {find_get_page(); readahead(); lock_page();}.
>
> I'll try to push that patch into mainline.

Good!

>> I like suggestion 3.1 a lot more, since it should be simple to implement
>> and has the following 2 positive side effects:
>>
>> 1. It would allow to minimize negative effect of higher RA size on the
>> I/O delay latency by allowing CFQ to switch to too long waiting
>> requests, when necessary.
>>
>> 2. It would allow better requests pipelining, which is very important to
>> minimize uplink latency for synchronous requests (i.e. with only one IO
>> request at time, next request issued, when the previous one completed).
>> You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware
>> recommends for maximum performance set max_sectors_kb as low as *64K*
>> with 16MB RA. It allows to maximize serving commands pipelining. And
>> this suggestion really works allowing to improve throughput in 50-100%!

Seems I should elaborate more on this. Case, when client is remote has a
fundamental difference from the case, when client is local, for which
Linux currently optimized. When client is local data delivered to it
from the page cache with a virtually infinite speed. But when client is
remote data delivered to it from the server's cache on a *finite* speed.
In our case this speed is about the same as speed of reading data to the
cache from the storage. It has the following consequences:

1. Data for any READ request at first transferred from the storage to
the cache, then from the cache to the client. If those transfers are
done purely sequentially without overlapping, i.e. without any
readahead, resulting throughput T can be found from equation: 1/T =
1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the
local (i.e. from the storage) and remote links. In case, when Tlocal ~=
Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)

2. If data transfers on the local and remote links aren't coordinated,
it is possible that only one link transfers data at some time. From the
(1) above you can calculate that % of this "idle" time is % of the lost
throughput. I.e. to get the maximum throughput both links should
transfer data as simultaneous as possible. For our case, when Tlocal ~=
Tremote, both links should be all the time busy. Moreover, it is
possible that the local transfer finished, but during the remote
transfer the storage media rotated too far, so for the next request it
will be needed to wait the full rotation to finish (i.e. several ms of
lost bandwidth).

Thus, to get the maximum possible throughput, we need to maximize
simultaneous load of both local and remote links. It can be done by
using well known pipelining technique. For that client should read the
same amount of data at once, but those read should be split on smaller
chunks, like 64K at time. This approach looks being against the
"conventional wisdom", saying that bigger request means bigger
throughput, but, in fact, it doesn't, because the same (big) amount of
data are read at time. Bigger count of smaller requests will make more
simultaneous load on both participating in the data transfers links. In
fact, even if client is local, in most cases there is a second data
transfer link. It's in the storage. This is especially true for RAID
controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K
and increase RA in the above link? ;)

Of course, max_sectors_kb should be decreased only for smart devices,
which allow >1 outstanding requests at time, i.e. for all modern
SCSI/SAS/SATA/iSCSI/FC/etc. drives.

There is an objection against having too many outstanding requests at
time. This is latency. But, since overall size of all requests remains
unchanged, this objection isn't relevant in this proposal. There is the
same latency-related objection against increasing RA. But many small
individual RA requests it isn't relevant as well.

We did some measurements to support the this proposal. They were done
only with deadline scheduler to make the picture clearer. They were done
with context RA. The tests were the same as before.

--- Baseline, all default:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 51,1 MB/s
b) 51,4 MB/s
c) 51,1 MB/s

Run at the same time:
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s

--- Client - all default, on the server max_sectors_kb set to 64K:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 100 MB/s
- 100 MB/s
- 102 MB/s

# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 5,2 MB/s
- 5,3 MB/s
- 4,2 MB/s

100% and 8% improvement comparing to the baseline.

From the previous e-mail you can see that with 4096K RA

# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,9 MB/s
b) 39,5 MB/s
c) 38,4 MB/s

I.e. there is 760% improvement over the baseline.

Thus, I believe, that for all devices, supporting queue depths >1,
max_sectors_kb should be set by default to 64K (or to 128K, maybe, but
not more), and default RA increased to at least 1M, better 2-4M.

> (Can I wish a CONFIG_PRINTK_TIME=y next time? :-)

Sure

Thanks,
Vlad

Attachments:

req_split.log.bz2 (5.55 kB)

2009-05-12 18:14:24

by Vladislav Bolkhovitin

[permalink] [raw]

Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Wu Fengguang, on 04/24/2009 12:43 PM wrote:
> On Tue, Apr 21, 2009 at 10:18:25PM +0400, Vladislav Bolkhovitin wrote:
>> Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>>>> Here are the conclusions from tests:
>>>>
>>>> 1. Making all IO threads work in the same IO context with CFQ
>>>> (vanilla RA and default RA size) brings near 100% link utilization
>>>> on single stream reads (100MB/s) and with deadline about 50%
>>>> (50MB/s). I.e. there is 100% improvement of CFQ over deadline. With
>>>> 2 read streams CFQ has ever more advantage: >400% (23MB/s vs 5MB/s).
>>> The ideal 2-stream throughput should be >60MB/s, so I guess there are
>>> still room of improvements for the CFQ's 23MB/s?
>> Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K
>> we were able to get ~40MB/s, see the previous e-mail and below.
>
> Why do you think it's in readahead?

See the previous results with deadline scheduler. Deadline scheduler was
chosen because it doesn't influence the stream of commands by additional
delays, like CFQ.

Default RA size:

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s

RA 4096K:

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,5 MB/s
b) 39,6 MB/s
c) 37,0 MB/s

>700% of improvement is quite impressive, isn't it?

> The readahead account/tracing data
> you provided in a previous email shows context readahead to work just fine:
>
> On initiator run benchmark:
> dd if=/dev/sdb of=/dev/null bs=64K count=8000
> dd if=/dev/sdc of=/dev/null bs=64K count=8000
>
> linux-4dtq:/.kernels/scst-kernel-4 # cat /sys/kernel/debug/readahead/stats
> pattern count sync_count eof_count size async_size actual
> initial0 4 4 1 4 3 3
> subsequent 288 0 0 470 251 251
> marker 722 0 0 255 255 255
> mmap 2 2 2 32 0 12
> fadvise 2 2 2 0 0 69
> all 1018 8 5 314 252 252
>
> The last line says the server side does 1018 readaheads with average
> size 1008K, which is pretty close to the max readahead size 1024K.
>
> Raw performance numbers are not enough here. The readahead size and
> the actual IO size, and possibly the readahead traces and IO traces
> are the most vivid and ultimate way to judge who's behaving wrong
> and provide the hint to a solution.

Well, RA can work well, but _size_ can be not _too good_. Current
default 128K is simply much too small. This is why I propose to increase
it to at least 2M (better 4M) and decrease max_sectors_kb for tagged
queuing devices to at max 128K (better 64K) to minimize possible latency
influence.

BTW, Linus recently also found out that "Doing a "echo 64 >
max_sectors_kb" does seem to make my experience nicer. At no really
noticeable downside in throughput that I can see"
(http://lwn.net/Articles/328381/)

>>> The one fact I cannot understand is that SCST seems to breaking up the
>>> client side 64K reads into server side 4K reads(above readahead layer).
>>> But I remember you told me that SCST don't do NFS rsize style split-ups.
>>> Is this a bug? The 4K read size is too small to be CPU/network friendly...
>>> Where are the split-up and re-assemble done? On the client side or
>>> internal to the server?
>> This is on the client's side. See the target's log in the attachment.
>>
>> Here is the summary of commands data sizes, came to the server, for "dd
>> if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:
>>
>> 4K 11
>> 8K 0
>> 16K 0
>> 32K 0
>> 64K 0
>> 128K 81
>> 256K 8
>> 512K 0
>> 1024K 0
>> 2048K 0
>> 4096K 0
>>
>> There's a way too many 4K requests. Apparently, the requests submission
>> path isn't optimal.
>
> So it's the client side that splits 128K (default sized) readahead IO
> into 4-256K SCST requests that sent over the network?
>
> It looks good enough since it's mostly 128K requests. However this is
> still not in line with previous data:
>
> req=0+1, ra=0+4-3, async=0) = 4
> req=1+1, ra=4+16-16, async=1) = 16
> req=4+1, ra=20+32-32, async=1) = 32
> req=20+1, ra=52+64-64, async=1) = 64
> req=52+1, ra=116+128-128, async=1) = 128
> req=116+1, ra=244+256-256, async=1) = 256
> req=244+1, ra=500+256-256, async=1) = 256
> req=500+1, ra=756+256-256, async=1) = 256
> req=756+1, ra=1012+256-256, async=1) = 256
> req=1012+1, ra=1268+256-256, async=1) = 256
> req=1268+1, ra=1268+512-256, async=1) = 256
> req=1524+1, ra=1780+256-256, async=1) = 256
> req=1780+1, ra=2036+256-256, async=1) = 256
> req=2036+1, ra=2292+256-256, async=1) = 256
> req=2292+1, ra=2548+256-256, async=1) = 256
> req=2548+1, ra=2804+256-256, async=1) = 256
> req=2804+1, ra=2804+512-256, async=1) = 256
> req=3060+1, ra=3060+512-256, async=1) = 256
> req=3316+1, ra=3572+256-256, async=1) = 256
> req=3572+1, ra=3828+256-256, async=1) = 256
> req=3828+1, ra=4084+256-256, async=1) = 256
> req=4084+1, ra=4340+256-256, async=1) = 256
> req=4340+1, ra=4596+256-256, async=1) = 256
> req=4596+1, ra=4596+512-256, async=1) = 256
> req=4852+1, ra=5108+256-256, async=1) = 256
> req=5108+1, ra=5108+512-256, async=1) = 256
> [and 480 more lines of pattern "req=*+1,...= 256")
>
> Basically, the server side ondemand_readahead()
> - perceived *all* about 1-page read requests
> - submitted *always* 256-page readahead IO
> (except during the initial size ramp up process)
>
> Maybe the runtime condition is somehow different for the above and the
> below traces.

I took the above numbers on my local system with 15K RPM parallel SCSI
drive. Also, I forgot to note that, for instance, 128K means "requests
with sizes between 64K and 128K".

> [ 457.003661] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.008686] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.010915] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
> [ 457.015238] [4209]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.015419] [4211]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
> [ 457.021696] [4208]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.024205] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
> [ 457.028455] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.028695] [4210]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
> [ 457.033565] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.035750] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.037323] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.041132] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.043251] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.045455] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.047949] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.051463] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.052136] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.057734] [4207]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.058007] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
> [ 457.063185] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.063404] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
> [ 457.068749] [4211]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
> [ 457.069007] [4211]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
> [ 457.071339] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.075250] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.077416] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.079892] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> [ 457.080492] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>
> There is an interesting pattern in the above trace: it tend to be
> 32-page aligned. An unligned 1-page read will always be followed by an
> 31-page or 63-page read, which then again make it aligned to 32-page
> boundaries :-)

Frankly, I don't have full understanding why it is so, I only see that
Linux generates very spread by sizes requests from 4K to double RA size.
I guess, this is because of interaction of how requests are submitted,
merged and the queue plugging/unplugging.

If you need me to provide you any data/logs to investigate this just let
me know. I can give you a full list of SCSI commands coming from client,
for instance.

Actually, you can download SCST and iSCSI-SCST from
http://scst.sourceforge.net/downloads.html and try yourself. Setup is
quite simple, see http://scst.sourceforge.net/iscsi-scst-howto.txt.

To get list of coming commands in kernel log you should run after SCST load

# echo "add scsi" >/proc/scsi_tgt/trace_level

Distribution of requests sizes you can find in /proc/scsi_tgt/sgv, count
of outstanding at any time commands in /proc/scsi_tgt/sessions.

>> Actually, this is another question I wanted to rise from the very
>> beginning.
>>
>>>> 6. Unexpected result. In case, when ll IO threads work in the same
>>>> IO context with CFQ increasing RA size *decreases* throughput. I
>>>> think this is, because RA requests performed as single big READ
>>>> requests, while requests coming from remote clients are much smaller
>>>> in size (up to 256K), so, when the read by RA data transferred to
>>>> the remote client on 100MB/s speed, the backstorage media gets
>>>> rotated a bit, so the next read request must wait the rotation
>>>> latency (~0.1ms on 7200RPM). This is well conforms with (3) above,
>>>> when context RA has 40% advantage over vanilla RA with default RA,
>>>> but much smaller with higher RA.
>>> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...
>> That doesn't matter, because new request from the client won't come
>> until all data for the previous one transferred to it. And that transfer
>> is done on a very *finite* speed.
>
> Async readahead does matter. The readahead pipeline is all you need to
> *fully* fill the storage/network channels (with good readahead size).
>
> The client side request time doesn't matter (ie. not too late to
> impact performance). The
>
> Let's look at the default case, where
> - client readahead size = 128K
> - server readahead size = 128K
> - no split of request size by SCST or max_sectors_kb
> and assume
> - application read size = huge
> - application read bw = infinite
> - disk bw = network bw = limited
>
> Imagine three 128K chunks in the file:
>
> chunk A chunk B chunk C
> ================================--------------------------------________________________________
> application read blocked here client side readahead IO server side readahead IO
> (ongoing network IO) (ongoing network IO) (ongoing disk IO)
>
> Normally the application can read and consume data very fast.
> So in most time, it will be blocked somewhere for network IO.
> Let's assume it is blocking at the first page of chunk A.
>
> Just before the application blocks on chunk A, it will trigger a
> (client side) readahead request for chunk B, which in turn will
> trigger a (server side) readahead request for chunk C.
>
> When disk bw = network bw, it _will_ quickly enter a steady state, in
> which the network transfer of B and disk read of C proceed concurrently.
>
> The below tables demonstrate the steps into this pipelined steady state.
> (we ignore disk seek time for simplicity)
>
> 1) no request merging in SCST and block layer:
> time 0 3 5 6 7
> client ab ab ABcd ABCde ABCDef
> server abc ABC ABCde ABCDef ABCDEfg
> net transfers ... AB C D
> disk transfers ABC .. D E
>
> 2) disk merge, no SCST merge:
> time 0 3 5 6 7 8 9
> client ab ab ABcd ABCde ABCde ABCDef ABCDEfg
> server abc ABC ABCde ABCdef ABCDEf ABCDEFg ABCDEFGh
> net transfers ... AB C . D E
> disk transfers ABC .. DE F G
>
> 3) application (limited) read speed = 2 * disk bw:
> time 0 3 5 5.5 6 6.5 7
> client ab ab AB ABc ABcd ABCd ABCde
> server abc ABC ABC ABCd ABCDe ABCDe ABCDEf
> net transfers ... AB . C D
> disk transfers ABC .. . D E
>
> Legends:
> - lower case 'a': request for chunk A submitted, but IO has not completed
> - upper case 'A': request for chunk A submitted, and IO has completed
> - dot '.': net/disk idled for one time slot
>
> Annotations for case (2):
> T0: single network IO request for chunk A&B, or AB in short;
> single disk IO request for chunk ABC.
> T3: disk IO for ABC completes; network idle
> T5: disk idle; network IO for AB completes;
> application proceeds quickly to B and then C,
> which triggers two network IO requests, one for C and another for D;
> which triggers one disk IO request for DE(requests for them come
> close and hence get merged into one).
> T6: disk busy(half way in DE); network completes request C;
> which quickly triggers network request for E and disk request for F.
> T7: disk completes DE; network idle before that.
> T8: disk completes F; network completes D;
> which in turn lead to network request for F and disk request for G.
> ==> pipeline established!
>
> Case (2)/(3) are more realistic cases, where two requests come close in
> time will be merged in the block layer, but not at SCST client side.
>
> For all three cases, we start by single large request for several
> chunks, and quickly converges to a steady state: a pipeline of
> single-chunk disk/network requests.

I have two things to say here:

1. Why do you think B and C can't be merged by block layer on the
server? Default max_sectors_kb 512K perfectly allows that. On practice I
see it happens all the time:

1.1. Distribution of requests sizes after a number of "dd if=/dev/sdX
of=/dev/null bs=512K count=2000" calls with deadline IO scheduler on the
both sides:

4K 216
8K 60
16K 77
32K 138
64K 200
128K 11276
256K 18474
>=512K 0

Look, how many there are 256K requests. Most.

1.2. I can see how many requests at time sent by client. In most cases
this number is 1 and only sometimes 2.

1.3. I made some measurements on my local single drive system, capable
to produce 130MB/s reads, with deadline IO scheduler on the both sides
and a *single IO thread* on the server:

# dd if=/dev/sdc of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 16.6582 seconds, 62.9 MB/s

This is about 60% of maximum possible bandwidth. So, definitely,
pipelining works rather doesn't work.

But with max_sectors_kb set to 32K on the client situation gets much better:

# dd if=/dev/sdc of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 10.602 seconds, 98.9 MB/s

Requests distribution:

4K 7570
8K 755
16K 1334
32K 31478
>=64K 0

At time there are up to 8 outstanding requests, 5 in average.

Here pipelining definitely does work. 57% improvement.

2. At practice, there are many cases, when there are >2 "pipes" with
limited bandwidth. The simplest example is when data digest used with
iSCSI. Here there are data digest calculation (CRC32C) "pipe" on both
client and server. So, for maximum throughput the pipelining depth
should be at least 4. See the measurements (setup is the same as above):

Default:

# dd if=/dev/sde of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 21.9438 seconds, 47.8 MB/s

max_sectors_kb 32K:

# dd if=/dev/sde of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 15.4442 seconds, 67.9 MB/s

48% improvement.

NULLIO (i.e. without disk access):

Default:

# dd if=/dev/sdh of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 16.0533 seconds, 65.3 MB/s

max_sectors_kb 32K:

# dd if=/dev/sdh of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 12.6053 seconds, 83.2 MB/s

28% improvement.

Impressive difference, isn't it?

Another example where deep pipelining is needed is hardware RAID cards,
like 3ware (see below).

>>>> Bottom line IMHO conclusions:
>>>>
>>>> 1. Context RA should be considered after additional examination to
>>>> replace current RA algorithm in the kernel
>>> That's my plan to push context RA to mainline. And thank you very much
>>> for providing and testing out a real world application for it!
>> You're welcome!
>
> Good news: context readahead is now queued in the -mm tree :-)

Good!

>>>> 2. It would be better to increase default RA size to 1024K
>>> That's a long wish to increase the default RA size. However I have a
>>> vague feeling that it would be better to first make the lower layers
>>> more smart on max_sectors_kb granularity request splitting and batching.
>> Can you elaborate more on that, please?
>
> Not exactly the above words. But the basic idea is to reduce latency
> on sync IO:
> - readahead algorithm classify its IO requests into sync/async ones;
> - use the SATA/SCSI priority bit for sync/async requests

Hmm, I can't recall SCSI has any priority bit. And, if it had, for what
it should be used?

>> 1. Data for any READ request at first transferred from the storage to
>> the cache, then from the cache to the client. If those transfers are
>> done purely sequentially without overlapping, i.e. without any
>> readahead, resulting throughput T can be found from equation: 1/T =
>> 1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the
>> local (i.e. from the storage) and remote links. In case, when Tlocal ~=
>> Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)
>
> Fortunately I can quickly absorb your idea ;-) Because I can recall
> vividly when downloading files in a pretty old kernel, the network IO
> and writeback works by turns instead of in a pipelined way. Hmm, it's
> an interesting behavior that deserves more researches :-)

Yes, definitely. I feel, if I wanted, I could have PhD or two in this
area :-)

>> 2. If data transfers on the local and remote links aren't coordinated,
>> it is possible that only one link transfers data at some time. From the
>> (1) above you can calculate that % of this "idle" time is % of the lost
>> throughput. I.e. to get the maximum throughput both links should
>> transfer data as simultaneous as possible. For our case, when Tlocal ~=
>> Tremote, both links should be all the time busy. Moreover, it is
>> possible that the local transfer finished, but during the remote
>> transfer the storage media rotated too far, so for the next request it
>> will be needed to wait the full rotation to finish (i.e. several ms of
>> lost bandwidth).
>
> This is merely a possible state. Can you prove that it is in fact a
> *stable* one?

Yes, this is rarely possible state, because all the modern HDDs have at
least 2MB RA buffer. But still possible, especially with old/cheap HDDs
with small built-in buffer.

>> Thus, to get the maximum possible throughput, we need to maximize
>> simultaneous load of both local and remote links. It can be done by
>> using well known pipelining technique. For that client should read the
>> same amount of data at once, but those read should be split on smaller
>> chunks, like 64K at time. This approach looks being against the
>> "conventional wisdom", saying that bigger request means bigger
>> throughput, but, in fact, it doesn't, because the same (big) amount of
>> data are read at time. Bigger count of smaller requests will make more
>> simultaneous load on both participating in the data transfers links. In
>> fact, even if client is local, in most cases there is a second data
>> transfer link. It's in the storage. This is especially true for RAID
>> controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K
>> and increase RA in the above link? ;)
>
> Sure 64K is good for 3ware - hw raid stripe sizes are typically small.
> I wonder if there is a 'stripe size' concept for multi-channel SSDs,
> and their typical values :-)

No, I don't think it's anyhow related to the stripe sizes. I believe,
it's rather related to the fact that those cards are physical RAIDs,
i.e. have built-in CPU and big internal memory (hundreds of MBs), hence
have additional internal data transfer links, hence need deeper
pipelining to work in full performance.

>> There is an objection against having too many outstanding requests at
>> time. This is latency. But, since overall size of all requests remains
>> unchanged, this objection isn't relevant in this proposal. There is the
>> same latency-related objection against increasing RA. But many small
>> individual RA requests it isn't relevant as well.
>>
>> We did some measurements to support the this proposal. They were done
>> only with deadline scheduler to make the picture clearer. They were done
>> with context RA. The tests were the same as before.
>>
>> --- Baseline, all default:
>>
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>> a) 51,1 MB/s
>> b) 51,4 MB/s
>> c) 51,1 MB/s
>>
>> Run at the same time:
>> # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>> a) 4,7 MB/s
>> b) 4,6 MB/s
>> c) 4,8 MB/s
>>
>> --- Client - all default, on the server max_sectors_kb set to 64K:
>>
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>> - 100 MB/s
>> - 100 MB/s
>> - 102 MB/s
>>
>> # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>> - 5,2 MB/s
>> - 5,3 MB/s
>> - 4,2 MB/s
>>
>> 100% and 8% improvement comparing to the baseline.
>
> They are definitely encouraging numbers. The 64K max_sectors_kb is
> obviously doing good here. However.. How do you know it's essentially
> a pipeline issue?

Yes, see above.

>> From the previous e-mail you can see that with 4096K RA
>>
>> # while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>> a) 39,9 MB/s
>> b) 39,5 MB/s
>> c) 38,4 MB/s
>>
>> I.e. there is 760% improvement over the baseline.
>
> Which baseline?

Search "baseline" word in one of my previous e-mails (one with the
measurement results). For your convenience it is:

Default RA size with deadline:

Run at the same time:
#while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s

> You are comparing 128K/4MB RA sizes, under the default
> max_sectors_kb?

Yes.

Vlad