Hi All,
As part of 1.0.1 release preparations I made some performance tests to
make sure there are no performance regressions in SCST overall and
iSCSI-SCST particularly. Results were quite interesting, so I decided to
publish them together with the corresponding numbers for IET and STGT
iSCSI targets. This isn't a real performance comparison, it includes
only few chosen tests, because I don't have time for a complete
comparison. But I hope somebody will take up what I did and make it
complete.
Setup:
Target: HT 2.4GHz Xeon, x86_32, 2GB of memory limited to 256MB by kernel
command line to have less test data footprint, 75GB 15K RPM SCSI disk as
backstorage, dual port 1Gbps E1000 Intel network card, 2.6.29 kernel.
Initiator: 1.7GHz Xeon, x86_32, 1GB of memory limited to 256MB by kernel
command line to have less test data footprint, dual port 1Gbps E1000
Intel network card, 2.6.27 kernel, open-iscsi 2.0-870-rc3.
The target exported a 5GB file on XFS for FILEIO and 5GB partition for
BLOCKIO.
All the tests were ran 3 times and average written. All the values are
in MB/s. The tests were ran with CFQ and deadline IO schedulers on the
target. All other parameters on both target and initiator were default.
==================================================================
I. SEQUENTIAL ACCESS OVER SINGLE LINE
1. # dd if=/dev/sdX of=/dev/null bs=512K count=2000
ISCSI-SCST IET STGT
NULLIO: 106 105 103
FILEIO/CFQ: 82 57 55
FILEIO/deadline 69 69 67
BLOCKIO/CFQ 81 28 -
BLOCKIO/deadline 80 66 -
------------------------------------------------------------------
2. # dd if=/dev/zero of=/dev/sdX bs=512K count=2000
I didn't do other write tests, because I have data on those devices.
ISCSI-SCST IET STGT
NULLIO: 114 114 114
------------------------------------------------------------------
3. /dev/sdX formatted in ext3 and mounted in /mnt on the initiator. Then
# dd if=/mnt/q of=/dev/null bs=512K count=2000
were ran (/mnt/q was created before by the next test)
ISCSI-SCST IET STGT
FILEIO/CFQ: 94 66 46
FILEIO/deadline 74 74 72
BLOCKIO/CFQ 95 35 -
BLOCKIO/deadline 94 95 -
------------------------------------------------------------------
4. /dev/sdX formatted in ext3 and mounted in /mnt on the initiator. Then
# dd if=/dev/zero of=/mnt/q bs=512K count=2000
were ran (/mnt/q was created by the next test before)
ISCSI-SCST IET STGT
FILEIO/CFQ: 97 91 88
FILEIO/deadline 98 96 90
BLOCKIO/CFQ 112 110 -
BLOCKIO/deadline 112 110 -
------------------------------------------------------------------
Conclusions:
1. ISCSI-SCST FILEIO on buffered READs on 27% faster than IET (94 vs
74). With CFQ the difference is 42% (94 vs 66).
2. ISCSI-SCST FILEIO on buffered READs on 30% faster than STGT (94 vs
72). With CFQ the difference is 104% (94 vs 46).
3. ISCSI-SCST BLOCKIO on buffered READs has about the same performance
as IET, but with CFQ it's on 170% faster (95 vs 35).
4. Buffered WRITEs are not so interesting, because they are async. with
many outstanding commands at time, hence latency insensitive, but even
here ISCSI-SCST always a bit faster than IET.
5. STGT always the worst, sometimes considerably.
6. BLOCKIO on buffered WRITEs is constantly faster, than FILEIO, so,
definitely, there is a room for future improvement here.
7. For some reason assess on file system is considerably better, than
the same device directly.
==================================================================
II. Mostly random "realistic" access.
For this test I used io_trash utility. For more details see
http://lkml.org/lkml/2008/11/17/444. To show value of target-side
caching in this test target was ran with full 2GB of memory. I ran
io_trash with the following parameters: "2 2 ./ 500000000 50000000 10
4096 4096 300000 10 90 0 10". Total execution time was measured.
ISCSI-SCST IET STGT
FILEIO/CFQ: 4m45s 5m 5m17s
FILEIO/deadline 5m20s 5m22s 5m35s
BLOCKIO/CFQ 23m3s 23m5s -
BLOCKIO/deadline 23m15s 23m25s -
Conclusions:
1. FILEIO on 500% (five times!) faster than BLOCKIO
2. STGT, as usually, always the worst
3. Deadline always a bit slower
==================================================================
III. SEQUENTIAL ACCESS OVER MPIO
Unfortunately, my dual port network card isn't capable of simultaneous
data transfers, so I had to do some "modeling" and put my network
devices in 100Mbps mode. To make this model more realistic I also used
my old IDE 5200RPM hard drive capable to produce locally 35MB/s
throughput. So I modeled the case of double 1Gbps links with 350MB/s
backstorage, if all the following rules satisfied:
- Both links a capable of simultaneous data transfers
- There is sufficient amount of CPU power on both initiator and target
to cover requirements for the data transfers.
All the tests were done with iSCSI-SCST only.
1. # dd if=/dev/sdX of=/dev/null bs=512K count=2000
NULLIO: 23
FILEIO/CFQ: 20
FILEIO/deadline 20
BLOCKIO/CFQ 20
BLOCKIO/deadline 17
Single line NULLIO is 12.
So, there is a 67% improvement using 2 lines. With 1Gbps it should be
equivalent of 200MB/s. Not too bad.
==================================================================
Connection to the target were made with the following iSCSI parameters:
# iscsi-scst-adm --op show --tid=1 --sid=0x10000013d0200
InitialR2T=No
ImmediateData=Yes
MaxConnections=1
MaxRecvDataSegmentLength=2097152
MaxXmitDataSegmentLength=131072
MaxBurstLength=2097152
FirstBurstLength=262144
DefaultTime2Wait=2
DefaultTime2Retain=0
MaxOutstandingR2T=1
DataPDUInOrder=Yes
DataSequenceInOrder=Yes
ErrorRecoveryLevel=0
HeaderDigest=None
DataDigest=None
OFMarker=No
IFMarker=No
OFMarkInt=Reject
IFMarkInt=Reject
# ietadm --op show --tid=1 --sid=0x10000013d0200
InitialR2T=No
ImmediateData=Yes
MaxConnections=1
MaxRecvDataSegmentLength=262144
MaxXmitDataSegmentLength=131072
MaxBurstLength=2097152
FirstBurstLength=262144
DefaultTime2Wait=2
DefaultTime2Retain=20
MaxOutstandingR2T=1
DataPDUInOrder=Yes
DataSequenceInOrder=Yes
ErrorRecoveryLevel=0
HeaderDigest=None
DataDigest=None
OFMarker=No
IFMarker=No
OFMarkInt=Reject
IFMarkInt=Reject
# tgtadm --op show --mode session --tid 1 --sid 1
MaxRecvDataSegmentLength=2097152
MaxXmitDataSegmentLength=131072
HeaderDigest=None
DataDigest=None
InitialR2T=No
MaxOutstandingR2T=1
ImmediateData=Yes
FirstBurstLength=262144
MaxBurstLength=2097152
DataPDUInOrder=Yes
DataSequenceInOrder=Yes
ErrorRecoveryLevel=0
IFMarker=No
OFMarker=No
DefaultTime2Wait=2
DefaultTime2Retain=0
OFMarkInt=Reject
IFMarkInt=Reject
MaxConnections=1
RDMAExtensions=No
TargetRecvDataSegmentLength=262144
InitiatorRecvDataSegmentLength=262144
MaxOutstandingUnexpectedPDUs=0
Vlad
On Mon, Mar 30, 2009 at 7:33 PM, Vladislav Bolkhovitin <[email protected]> wrote:
==================================================================
>
> I. SEQUENTIAL ACCESS OVER SINGLE LINE
>
> 1. # dd if=/dev/sdX of=/dev/null bs=512K count=2000
>
> ? ? ? ? ? ? ? ? ? ? ? ?ISCSI-SCST ? ? ?IET ? ? ? ? ? ? STGT
> NULLIO: ? ? ? ? ? ? ? ? 106 ? ? ? ? ? ? 105 ? ? ? ? ? ? 103
> FILEIO/CFQ: ? ? ? ? ? ? 82 ? ? ? ? ? ? ?57 ? ? ? ? ? ? ?55
> FILEIO/deadline ? ? ? ? 69 ? ? ? ? ? ? ?69 ? ? ? ? ? ? ?67
> BLOCKIO/CFQ ? ? ? ? ? ? 81 ? ? ? ? ? ? ?28 ? ? ? ? ? ? ?-
> BLOCKIO/deadline ? ? ? ?80 ? ? ? ? ? ? ?66 ? ? ? ? ? ? ?-
I have repeated some of these performance tests for iSCSI over IPoIB
(two DDR PCIe 1.0 ConnectX HCA's connected back to back). The results
for the buffered I/O test with a block size of 512K (initiator)
against a file of 1GB residing on a tmpfs filesystem on the target are
as follows:
write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
And for a block size of 4 KB:
write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
Or: depending on the test scenario, SCST transfers data between 2% and
30% faster via the iSCSI protocol over this network.
Something that is not relevant for this comparison, but interesting to
know: with the SRP implementation in SCST the maximal read throughput
is 1290 MB/s on the same setup.
Bart.
Bart Van Assche, on 04/02/2009 12:14 AM wrote:
> On Mon, Mar 30, 2009 at 7:33 PM, Vladislav Bolkhovitin <[email protected]> wrote:
> ==================================================================
>> I. SEQUENTIAL ACCESS OVER SINGLE LINE
>>
>> 1. # dd if=/dev/sdX of=/dev/null bs=512K count=2000
>>
>> ISCSI-SCST IET STGT
>> NULLIO: 106 105 103
>> FILEIO/CFQ: 82 57 55
>> FILEIO/deadline 69 69 67
>> BLOCKIO/CFQ 81 28 -
>> BLOCKIO/deadline 80 66 -
>
> I have repeated some of these performance tests for iSCSI over IPoIB
> (two DDR PCIe 1.0 ConnectX HCA's connected back to back). The results
> for the buffered I/O test with a block size of 512K (initiator)
> against a file of 1GB residing on a tmpfs filesystem on the target are
> as follows:
>
> write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
> read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
>
> And for a block size of 4 KB:
>
> write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
> read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
Do you have any thoughts why writes are so bad? It shouldn't be so..
> Or: depending on the test scenario, SCST transfers data between 2% and
> 30% faster via the iSCSI protocol over this network.
>
> Something that is not relevant for this comparison, but interesting to
> know: with the SRP implementation in SCST the maximal read throughput
> is 1290 MB/s on the same setup.
This can be well explained. The limiting factor for iSCSI is that
iSCSI/TCP processing overloads a single CPU core. You can prove that on
vmstat output during the test. Sum of user and sys time should be about
100/(number of CPUs) or higher. SRP has a lot more CPU effective, hence
better has throughput.
If you try to test with 2 or more parallel IO streams, you should have
the correspondingly increased aggregate throughput up to the moment you
hit your memory copy bandwidth.
Thanks,
Vlad
> Bart.
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scst-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scst-devel
>
On Thu, Apr 2, 2009 at 7:16 PM, Vladislav Bolkhovitin <[email protected]> wrote:
> Bart Van Assche, on 04/02/2009 12:14 AM wrote:
>> I have repeated some of these performance tests for iSCSI over IPoIB
>> (two DDR PCIe 1.0 ConnectX HCA's connected back to back). The results
>> for the buffered I/O test with a block size of 512K (initiator)
>> against a file of 1GB residing on a tmpfs filesystem on the target are
>> as follows:
>>
>> write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
>> read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
>>
>> And for a block size of 4 KB:
>>
>> write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
>> read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
>
> Do you have any thoughts why writes are so bad? It shouldn't be so..
It's not impossible that with the 4 KB write test I hit the limits of
the initiator system (Intel E6750 CPU, 2.66 GHz, two cores). Some
statistics I gathered during the 4 KB write test:
Target: CPU load 0.5, 16500 mlx4-comp-0 interrupts per second, same
number of interrupts processed by each core (8250/s).
Initiator: CPU load 1.0, 32850 mlx4-comp-0 interrupts per second, all
interrupts occurred on the same core.
Bart.
> -----Original Message-----
> From: Bart Van Assche [mailto:[email protected]]
> Sent: Friday, April 03, 2009 10:09 AM
> To: Vladislav Bolkhovitin
> Cc: scst-devel; [email protected];
> [email protected]
> Subject: Re: [Scst-devel] ISCSI-SCST performance (with also
> IET and STGTdata)
>
>
> On Thu, Apr 2, 2009 at 7:16 PM, Vladislav Bolkhovitin
> <[email protected]> wrote:
> > Bart Van Assche, on 04/02/2009 12:14 AM wrote:
> >> I have repeated some of these performance tests for iSCSI
> over IPoIB
> >> (two DDR PCIe 1.0 ConnectX HCA's connected back to back).
> The results
> >> for the buffered I/O test with a block size of 512K (initiator)
> >> against a file of 1GB residing on a tmpfs filesystem on the target
> >> are as follows:
> >>
> >> write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
> >> read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
> >>
> >> And for a block size of 4 KB:
> >>
> >> write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
> >> read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
> >
> > Do you have any thoughts why writes are so bad? It shouldn't be so..
>
> It's not impossible that with the 4 KB write test I hit the
> limits of the initiator system (Intel E6750 CPU, 2.66 GHz,
> two cores). Some statistics I gathered during the 4 KB write test:
> Target: CPU load 0.5, 16500 mlx4-comp-0 interrupts per
> second, same number of interrupts processed by each core (8250/s).
> Initiator: CPU load 1.0, 32850 mlx4-comp-0 interrupts per
> second, all interrupts occurred on the same core.
Are you using connected mode IPoIB and setting the MTU to 4KB? Would
fragmentation of IPoIB drive up the interrupt rates?
>
> Bart.
>
> --------------------------------------------------------------
> ----------------
> _______________________________________________
> Scst-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scst-devel
>
On Fri, Apr 3, 2009 at 7:13 PM, Sufficool, Stanley
<[email protected]> wrote:
>> On Thu, Apr 2, 2009 at 7:16 PM, Vladislav Bolkhovitin
>> <[email protected]> wrote:
>> > Bart Van Assche, on 04/02/2009 12:14 AM wrote:
>> >> I have repeated some of these performance tests for iSCSI
>> >> over IPoIB
>> >> (two DDR PCIe 1.0 ConnectX HCA's connected back to back).
>> >> The results
>> >> for the buffered I/O test with a block size of 512K (initiator)
>> >> against a file of 1GB residing on a tmpfs filesystem on the target
>> >> are as follows:
>> >>
>> >> write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
>> >> read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
>> >>
>> >> And for a block size of 4 KB:
>> >>
>> >> write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
>> >> read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
>> >
>> > Do you have any thoughts why writes are so bad? It shouldn't be so..
>>
>> It's not impossible that with the 4 KB write test I hit the
>> limits of the initiator system (Intel E6750 CPU, 2.66 GHz,
>> two cores). Some statistics I gathered during the 4 KB write test:
>> Target: CPU load 0.5, 16500 mlx4-comp-0 interrupts per
>> second, same number of interrupts processed by each core (8250/s).
>> Initiator: CPU load 1.0, 32850 mlx4-comp-0 interrupts per
>> second, all interrupts occurred on the same core.
>
> Are you using connected mode IPoIB and setting the MTU to 4KB? Would
> fragmentation of IPoIB drive up the interrupt rates?
All tests have been run with default IPoIB settings: an MTU of 2044
bytes and datagram mode. The following data has been obtained from the
target system after several 4 KB write tests:
$ cat /sys/class/net/ib0/mode
datagram
$ /sbin/ifconfig ib0
ib0 Link encap:UNSPEC HWaddr
80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:2:d217/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:88482013 errors:0 dropped:0 overruns:0 frame:0
TX packets:38444824 errors:0 dropped:11 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:135770573672 (129480.9 Mb) TX bytes:5647702210 (5386.0 Mb)
Bart.
On Thu, Apr 2, 2009 at 7:16 PM, Vladislav Bolkhovitin <[email protected]> wrote:
> Bart Van Assche, on 04/02/2009 12:14 AM wrote:
>> I have repeated some of these performance tests for iSCSI over IPoIB
>> (two DDR PCIe 1.0 ConnectX HCA's connected back to back). The results
>> for the buffered I/O test with a block size of 512K (initiator)
>> against a file of 1GB residing on a tmpfs filesystem on the target are
>> as follows:
>>
>> write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
>> read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
>>
>> And for a block size of 4 KB:
>>
>> write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
>> read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
>
> Do you have any thoughts why writes are so bad? It shouldn't be so..
By this time I have run the following variation of the 4 KB write test:
* Target: iSCSI-SCST was exporting a 1 GB file residing on a tmpfs filesystem.
* Initiator: two processes were writing 4 KB blocks as follows:
dd if=/dev/zero of=/dev/sdb bs=4K seek=0 count=131072 oflag=sync &
dd if=/dev/zero of=/dev/sdb bs=4K seek=131072 count=131072 oflag=sync &
Results:
* Each dd process on the initiator was writing at a speed of 37.8
MB/s, or a combined writing speed of 75.6 MB/s.
* CPU load on the initiator system during the test: 2.0.
* According to /proc/interrupts, about 38000 mlx4-comp-0 interrupts
were triggered per second.
These results confirm that the initiator system was the bottleneck
during the 4 KB write test, not the target system.
Bart.
Bart Van Assche, on 04/04/2009 12:04 PM wrote:
> On Thu, Apr 2, 2009 at 7:16 PM, Vladislav Bolkhovitin <[email protected]> wrote:
>> Bart Van Assche, on 04/02/2009 12:14 AM wrote:
>>> I have repeated some of these performance tests for iSCSI over IPoIB
>>> (two DDR PCIe 1.0 ConnectX HCA's connected back to back). The results
>>> for the buffered I/O test with a block size of 512K (initiator)
>>> against a file of 1GB residing on a tmpfs filesystem on the target are
>>> as follows:
>>>
>>> write-test: iSCSI-SCST 243 MB/s; IET 192 MB/s.
>>> read-test: iSCSI-SCST 291 MB/s; IET 223 MB/s.
>>>
>>> And for a block size of 4 KB:
>>>
>>> write-test: iSCSI-SCST 43 MB/s; IET 42 MB/s.
>>> read-test: iSCSI-SCST 288 MB/s; IET 221 MB/s.
>> Do you have any thoughts why writes are so bad? It shouldn't be so..
>
> By this time I have run the following variation of the 4 KB write test:
> * Target: iSCSI-SCST was exporting a 1 GB file residing on a tmpfs filesystem.
> * Initiator: two processes were writing 4 KB blocks as follows:
> dd if=/dev/zero of=/dev/sdb bs=4K seek=0 count=131072 oflag=sync &
> dd if=/dev/zero of=/dev/sdb bs=4K seek=131072 count=131072 oflag=sync &
>
> Results:
> * Each dd process on the initiator was writing at a speed of 37.8
> MB/s, or a combined writing speed of 75.6 MB/s.
> * CPU load on the initiator system during the test: 2.0.
> * According to /proc/interrupts, about 38000 mlx4-comp-0 interrupts
> were triggered per second.
>
> These results confirm that the initiator system was the bottleneck
> during the 4 KB write test, not the target system.
If so with oflag=direct you should have a performance gain, because you
will eliminate a data copy.
> Bart.
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scst-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scst-devel
>