On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>> Sorry for such a huge delay. There were many other activities I had
>>> to do before + I had to be sure I didn't miss anything.
>>>
>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N
>>> threads (N=5 in this case) handle IO from remote initiators
>>> (clients) coming from wire using iSCSI protocol. In addition, SCST
>>> has patch called export_alloc_io_context (see
>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
>>> queue IO using single IO context, so we can see if context RA can
>>> replace grouping IO threads in single IO context.
>>>
>>> Unfortunately, the results are negative. We find neither any
>>> advantages of context RA over current RA implementation, nor
>>> possibility for context RA to replace grouping IO threads in single
>>> IO context.
>>>
>>> Setup on the target (server) was the following. 2 SATA drives grouped
>>> in md RAID-0 with average local read throughput ~120MB/s ("dd
>>> if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs "20971520000
>>> bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md device was
>>> partitioned on 3 partitions. The first partition was 10% of space in
>>> the beginning of the device, the last partition was 10% of space in
>>> the end of the device, the middle one was the rest in the middle of
>>> the space them. Then the first and the last partitions were exported
>>> to the initiator (client). They were /dev/sdb and /dev/sdc on it
>>> correspondingly.
>>
>> Vladislav, Thank you for the benchmarks! I'm very interested in
>> optimizing your workload and figuring out what happens underneath.
>>
>> Are the client and server two standalone boxes connected by GBE?
>
> Yes, they directly connected using GbE.
>
>> When you set readahead sizes in the benchmarks, you are setting them
>> in the server side? I.e. "linux-4dtq" is the SCST server?
>
> Yes, it's the server. On the client all the parameters were left default.
>
>> What's the
>> client side readahead size?
>
> Default, i.e. 128K
>
>> It would help a lot to debug readahead if you can provide the
>> server side readahead stats and trace log for the worst case.
>> This will automatically answer the above questions as well as disclose
>> the micro-behavior of readahead:
>>
>> mount -t debugfs none /sys/kernel/debug
>>
>> echo > /sys/kernel/debug/readahead/stats # reset counters
>> # do benchmark
>> cat /sys/kernel/debug/readahead/stats
>>
>> echo 1 > /sys/kernel/debug/readahead/trace_enable
>> # do micro-benchmark, i.e. run the same benchmark for a short time
>> echo 0 > /sys/kernel/debug/readahead/trace_enable
>> dmesg
>>
>> The above readahead trace should help find out how the client side
>> sequential reads convert into server side random reads, and how we can
>> prevent that.
>
> We will do it as soon as we have a free window on that system.
Thank you. For NFS, the client side read/readahead requests will be
split into units of rsize which will be served by a pool of nfsd
concurrently and possibly out of order. Does SCST have the same
process? If so, what's the rsize value for your SCST benchmarks?
Thanks,
Fengguang
linux-4dtq:~ # uname -r
2.6.27.12-except_export+readahead
-scheduler = deadline
- RA = 4M
linux-4dtq:~ # free
total used free shared buffers cached
Mem: 508168 111288 396880 0 4476 62648
-/+ buffers/cache: 44164 464004
Swap: 0 0 0
linux-4dtq:~ # echo deadline > /sys/block/sdb/queue/scheduler
linux-4dtq:~ # echo deadline > /sys/block/sda/queue/scheduler
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory [deadline] cfq
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
linux-4dtq:~ # blockdev --getra /dev/sdb
4096
linux-4dtq:~ # blockdev --getra /dev/sda
4096
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
1st raid -wi-a- 46.00G
2nd raid -wi-a- 374.00G
3rd raid -wi-a- 46.00G
scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 262144, MaxXmitDataSegmentLength 131072,
iscsi-scst: MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst: HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048
1) dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 54,1 MB/s
b) 55,6 MB/s
c) 54,3 MB/s
2) dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 71,3 MB/s
b) 73,8 MB/s
c) 72,7 MB/s
3)Run at the same time:
while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,3 MB/s
b) 5.0 MB/s
c) 5.2 MB/s
On Tue, Feb 17, 2009 at 10:03:23PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/16/2009 05:34 AM wrote:
>> On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Sorry for such a huge delay. There were many other activities I
>>>>> had to do before + I had to be sure I didn't miss anything.
>>>>>
>>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net)
>>>>> with iSCSI-SCST target driver. It has similar to NFS
>>>>> architecture, where N threads (N=5 in this case) handle IO from
>>>>> remote initiators (clients) coming from wire using iSCSI
>>>>> protocol. In addition, SCST has patch called
>>>>> export_alloc_io_context (see
>>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO
>>>>> threads queue IO using single IO context, so we can see if
>>>>> context RA can replace grouping IO threads in single IO
>>>>> context.
>>>>>
>>>>> Unfortunately, the results are negative. We find neither any
>>>>> advantages of context RA over current RA implementation, nor
>>>>> possibility for context RA to replace grouping IO threads in
>>>>> single IO context.
>>>>>
>>>>> Setup on the target (server) was the following. 2 SATA drives
>>>>> grouped in md RAID-0 with average local read throughput ~120MB/s
>>>>> ("dd if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs
>>>>> "20971520000 bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md
>>>>> device was partitioned on 3 partitions. The first partition was
>>>>> 10% of space in the beginning of the device, the last partition
>>>>> was 10% of space in the end of the device, the middle one was
>>>>> the rest in the middle of the space them. Then the first and the
>>>>> last partitions were exported to the initiator (client). They
>>>>> were /dev/sdb and /dev/sdc on it correspondingly.
>>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>>> optimizing your workload and figuring out what happens underneath.
>>>>
>>>> Are the client and server two standalone boxes connected by GBE?
>>> Yes, they directly connected using GbE.
>>>
>>>> When you set readahead sizes in the benchmarks, you are setting them
>>>> in the server side? I.e. "linux-4dtq" is the SCST server?
>>> Yes, it's the server. On the client all the parameters were left default.
>>>
>>>> What's the
>>>> client side readahead size?
>>> Default, i.e. 128K
>>>
>>>> It would help a lot to debug readahead if you can provide the
>>>> server side readahead stats and trace log for the worst case.
>>>> This will automatically answer the above questions as well as disclose
>>>> the micro-behavior of readahead:
>>>>
>>>> mount -t debugfs none /sys/kernel/debug
>>>>
>>>> echo > /sys/kernel/debug/readahead/stats # reset counters
>>>> # do benchmark
>>>> cat /sys/kernel/debug/readahead/stats
>>>>
>>>> echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>> # do micro-benchmark, i.e. run the same benchmark for a short time
>>>> echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>> dmesg
>>>>
>>>> The above readahead trace should help find out how the client side
>>>> sequential reads convert into server side random reads, and how we can
>>>> prevent that.
>>> We will do it as soon as we have a free window on that system.
>>
>> Thank you. For NFS, the client side read/readahead requests will be
>> split into units of rsize which will be served by a pool of nfsd
>> concurrently and possibly out of order. Does SCST have the same
>> process? If so, what's the rsize value for your SCST benchmarks?
>
> No, there is no such splitting in SCST. Client sees raw SCSI disks from
> server and what client sends is directly and in full size sent by the
> server to its backstorage using regular buffered read()
> (fd->f_op->aio_read() followed by
> wait_on_retry_sync_kiocb()/wait_on_sync_kiocb() to be precise).
Then it's weird that the server is seeing 1-page sized read requests:
readahead-marker(pid=3844(vdiskd4_4), dev=00:02(bdev), ino=0(raid-3rd), req=9160+1, ra=9192+32-32, async=1) = 32
readahead-marker(pid=3842(vdiskd4_2), dev=00:02(bdev), ino=0(raid-3rd), req=9192+1, ra=9224+32-32, async=1) = 32
readahead-marker(pid=3841(vdiskd4_1), dev=00:02(bdev), ino=0(raid-3rd), req=9224+1, ra=9256+32-32, async=1) = 32
readahead-marker(pid=3844(vdiskd4_4), dev=00:02(bdev), ino=0(raid-3rd), req=9256+1, ra=9288+32-32, async=1) = 32
Here the first line means a 32-page readahead I/O was submitted for a
1-page read request.
The 1-page read size only adds overheads to CPU/NIC, but not disk I/O.
The trace shows that readahead is doing a good job, however the
readahead size is the default 128K, not 2M. That's a big problem.
The command "blockdev --setra 4096 /dev/sda" takes no effect at all.
Maybe you should put that command after mdadm? i.e.
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
Thanks,
Fengguang
Wu Fengguang, on 02/16/2009 05:34 AM wrote:
> On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>> Sorry for such a huge delay. There were many other activities I had
>>>> to do before + I had to be sure I didn't miss anything.
>>>>
>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
>>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N
>>>> threads (N=5 in this case) handle IO from remote initiators
>>>> (clients) coming from wire using iSCSI protocol. In addition, SCST
>>>> has patch called export_alloc_io_context (see
>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
>>>> queue IO using single IO context, so we can see if context RA can
>>>> replace grouping IO threads in single IO context.
>>>>
>>>> Unfortunately, the results are negative. We find neither any
>>>> advantages of context RA over current RA implementation, nor
>>>> possibility for context RA to replace grouping IO threads in single
>>>> IO context.
>>>>
>>>> Setup on the target (server) was the following. 2 SATA drives grouped
>>>> in md RAID-0 with average local read throughput ~120MB/s ("dd
>>>> if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs "20971520000
>>>> bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md device was
>>>> partitioned on 3 partitions. The first partition was 10% of space in
>>>> the beginning of the device, the last partition was 10% of space in
>>>> the end of the device, the middle one was the rest in the middle of
>>>> the space them. Then the first and the last partitions were exported
>>>> to the initiator (client). They were /dev/sdb and /dev/sdc on it
>>>> correspondingly.
>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>> optimizing your workload and figuring out what happens underneath.
>>>
>>> Are the client and server two standalone boxes connected by GBE?
>> Yes, they directly connected using GbE.
>>
>>> When you set readahead sizes in the benchmarks, you are setting them
>>> in the server side? I.e. "linux-4dtq" is the SCST server?
>> Yes, it's the server. On the client all the parameters were left default.
>>
>>> What's the
>>> client side readahead size?
>> Default, i.e. 128K
>>
>>> It would help a lot to debug readahead if you can provide the
>>> server side readahead stats and trace log for the worst case.
>>> This will automatically answer the above questions as well as disclose
>>> the micro-behavior of readahead:
>>>
>>> mount -t debugfs none /sys/kernel/debug
>>>
>>> echo > /sys/kernel/debug/readahead/stats # reset counters
>>> # do benchmark
>>> cat /sys/kernel/debug/readahead/stats
>>>
>>> echo 1 > /sys/kernel/debug/readahead/trace_enable
>>> # do micro-benchmark, i.e. run the same benchmark for a short time
>>> echo 0 > /sys/kernel/debug/readahead/trace_enable
>>> dmesg
>>>
>>> The above readahead trace should help find out how the client side
>>> sequential reads convert into server side random reads, and how we can
>>> prevent that.
>> We will do it as soon as we have a free window on that system.
>
> Thank you. For NFS, the client side read/readahead requests will be
> split into units of rsize which will be served by a pool of nfsd
> concurrently and possibly out of order. Does SCST have the same
> process? If so, what's the rsize value for your SCST benchmarks?
No, there is no such splitting in SCST. Client sees raw SCSI disks from
server and what client sends is directly and in full size sent by the
server to its backstorage using regular buffered read()
(fd->f_op->aio_read() followed by
wait_on_retry_sync_kiocb()/wait_on_sync_kiocb() to be precise).
Thanks,
Vlad