2014-02-08 09:38:14

by Thomas Glanzmann

[permalink] [raw]
Subject: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

[RESEND: the time it took the VMFS was created was switched between
on/off so with on it took over 2 minutes with off it took less than 4
seconds]

[RESEND 2: The throughput graphs were switched as well ;-(]

> * Thomas Glanzmann <[email protected]> [2014-02-07 08:55]:
> > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12
> > and 15 minutes on 3.14.0-rc2+.

* Nicholas A. Bellinger <[email protected]> [2014-02-07 20:30]:
> Would it be possible to try a couple of different stable kernel
> versions to help track this down?

I bisected[1] it and found the offending commit f54b311 tcp auto corking
[2] 'if we have a small send and a previous packet is already in the
qdisc or device queue, defer until TX completion or we get more data.'
- Description by David S. Miller

I gathered a pcap with tcp_autocorking on and off.

On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system
https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2
https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png
https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png

Off: - took 4 seconds to create a 500 GB VMFS file system
sysctl net.ipv4.tcp_autocorking=0
https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2
https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png
https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png

First graph can be generated by opening bunziping the file, opening it
in wireshark and select Statistics > IO Grap and change the unit to
Bytes/Tick. The second graph can be generated by selecting Statistics >
TCP Stream Graph > Round Trip Time.

You can also see that the round trip time increases by factor 25 at
least.

I once saw a similar problem with dealyed ACK packets of the
paravirtulized network driver in xen it caused that the tcp window
filled up and slowed down the throughput from 30 MB/s to less than 100
KB/s the symptom was that the login to a Windows desktop took more than
10 minutes while it used to be below 30 seconds because the profile of
the user was loaded slowly from a CIFS server. At that time the culprit
were also delayed small packets: ACK packets in the CIFS case. However I
only proofed iSCSI regression so far for tcp auto corking but assume we
will see many others if we leave it enabled.

I found the problem by doing the following:
- I compiled kernel by executing the following commands:
yes '' | make oldconfig
time make -j 24
/ make modules_install
/ mkinitramfs -o /boot/initrd.img-bisect <version>

- I cleaned the iSCSI configuration after each test by issuing:
/etc/init.d/target stop
rm /iscsi?/* /etc/target/*

- I configured iSCSI after each reboot
cat > lio-v101.conf <<EOF
set global auto_cd_after_create=false
/backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true

/iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de
/iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4
/iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5
/iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4
/iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5
/iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10
/iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1

saveconfig
yes
EOF
targetcli < lio-v101.conf
And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy
to the iSCSI target, configured the portal, rescanned and
created a 500 GB VMFS 5 filesystem and noticed the time if it
was longer than 2 minutes it was bad if it was below 10 seconds
it was good.
git bisect good/bad

My network config is:

auto bond0
iface bond0 inet static
address 10.100.4.62
netmask 255.255.0.0
gateway 10.100.0.1
slaves eth0 eth1
bond-mode 802.3ad
bond-miimon 100

auto bond0.101
iface bond0.101 inet static
address 10.101.99.4
netmask 255.255.0.0

auto bond1
iface bond1 inet static
address 10.100.5.62
netmask 255.255.0.0
slaves eth2 eth3
bond-mode 802.3ad
bond-miimon 100

auto bond1.101
iface bond1.101 inet static
address 10.101.99.5
netmask 255.255.0.0

I propose to disable tcp_autocorking by default because it obviously degrades
iSCSI performance and probably many other protocols. Also the commit mentions
that applications can explicitly disable auto corking we probably should do
that for the iSCSI target, but I don't know how. Anyone?

[1] http://pbot.rmdir.de/a65q6MjgV36tZnn5jS-DUQ
[2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f54b311142a92ea2e42598e347b84e1655caf8e3

Cheers,
Thomas


2014-02-08 13:14:38

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 10:38 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> [RESEND: the time it took the VMFS was created was switched between
> on/off so with on it took over 2 minutes with off it took less than 4
> seconds]
>
> [RESEND 2: The throughput graphs were switched as well ;-(]
>
> > * Thomas Glanzmann <[email protected]> [2014-02-07 08:55]:
> > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12
> > > and 15 minutes on 3.14.0-rc2+.
>
> * Nicholas A. Bellinger <[email protected]> [2014-02-07 20:30]:
> > Would it be possible to try a couple of different stable kernel
> > versions to help track this down?
>
> I bisected[1] it and found the offending commit f54b311 tcp auto corking
> [2] 'if we have a small send and a previous packet is already in the
> qdisc or device queue, defer until TX completion or we get more data.'
> - Description by David S. Miller
>
> I gathered a pcap with tcp_autocorking on and off.
>
> On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png
>
> Off: - took 4 seconds to create a 500 GB VMFS file system
> sysctl net.ipv4.tcp_autocorking=0
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png
>
> First graph can be generated by opening bunziping the file, opening it
> in wireshark and select Statistics > IO Grap and change the unit to
> Bytes/Tick. The second graph can be generated by selecting Statistics >
> TCP Stream Graph > Round Trip Time.
>
> You can also see that the round trip time increases by factor 25 at
> least.
>
> I once saw a similar problem with dealyed ACK packets of the
> paravirtulized network driver in xen it caused that the tcp window
> filled up and slowed down the throughput from 30 MB/s to less than 100
> KB/s the symptom was that the login to a Windows desktop took more than
> 10 minutes while it used to be below 30 seconds because the profile of
> the user was loaded slowly from a CIFS server. At that time the culprit
> were also delayed small packets: ACK packets in the CIFS case. However I
> only proofed iSCSI regression so far for tcp auto corking but assume we
> will see many others if we leave it enabled.
>
> I found the problem by doing the following:
> - I compiled kernel by executing the following commands:
> yes '' | make oldconfig
> time make -j 24
> / make modules_install
> / mkinitramfs -o /boot/initrd.img-bisect <version>
>
> - I cleaned the iSCSI configuration after each test by issuing:
> /etc/init.d/target stop
> rm /iscsi?/* /etc/target/*
>
> - I configured iSCSI after each reboot
> cat > lio-v101.conf <<EOF
> set global auto_cd_after_create=false
> /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true
>
> /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
>
> saveconfig
> yes
> EOF
> targetcli < lio-v101.conf
> And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy
> to the iSCSI target, configured the portal, rescanned and
> created a 500 GB VMFS 5 filesystem and noticed the time if it
> was longer than 2 minutes it was bad if it was below 10 seconds
> it was good.
> git bisect good/bad
>
> My network config is:
>
> auto bond0
> iface bond0 inet static
> address 10.100.4.62
> netmask 255.255.0.0
> gateway 10.100.0.1
> slaves eth0 eth1
> bond-mode 802.3ad
> bond-miimon 100
>
> auto bond0.101
> iface bond0.101 inet static
> address 10.101.99.4
> netmask 255.255.0.0
>
> auto bond1
> iface bond1 inet static
> address 10.100.5.62
> netmask 255.255.0.0
> slaves eth2 eth3
> bond-mode 802.3ad
> bond-miimon 100
>
> auto bond1.101
> iface bond1.101 inet static
> address 10.101.99.5
> netmask 255.255.0.0
>
> I propose to disable tcp_autocorking by default because it obviously degrades
> iSCSI performance and probably many other protocols. Also the commit mentions
> that applications can explicitly disable auto corking we probably should do
> that for the iSCSI target, but I don't know how. Anyone?
>
> [1] http://pbot.rmdir.de/a65q6MjgV36tZnn5jS-DUQ
> [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f54b311142a92ea2e42598e347b84e1655caf8e3
>
> Cheers,
> Thomas

Hi Thomas, thanks a lot for this very detailed bug report.

I think you are hit by other bug(s), lets try to fix it/them instead of
disabling this feature.

John Ogness started a thread yesterday about TCP_NODELAY being hit by
the TCP Small Queue mechanism, which is the base of TCP auto corking.

Two RFC patches were discussed.

One dealing with the TCP_NODELAY flag that John posted, and I'll adapt
it to the current kernel.

One dealing with a possible race, that I suggested (I doubt this could
trigger at every write, but lets fix it anyway)

Here is the combined patch, could you test it ?

Thanks !

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 10435b3b9d0f..3be16727f058 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk)
if ((1 << sk->sk_state) &
(TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
- tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC);
+ tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+ 0, GFP_ATOMIC);
}
/*
* One tasklet per cpu tries to send more skbs.
@@ -1904,7 +1905,15 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,

if (atomic_read(&sk->sk_wmem_alloc) > limit) {
set_bit(TSQ_THROTTLED, &tp->tsq_flags);
- break;
+ /* It is possible TX completion already happened
+ * before we set TSQ_THROTTLED, so we must
+ * test again the condition.
+ * We abuse smp_mb__after_clear_bit() because
+ * there is no smp_mb__after_set_bit() yet
+ */
+ smp_mb__after_clear_bit();
+ if (atomic_read(&sk->sk_wmem_alloc) > limit)
+ break;
}

limit = mss_now;

2014-02-08 13:33:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote:
> Here is the combined patch, could you test it ?

Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
("tcp: autocork should not hold first packet in write queue")
in your tree.


2014-02-08 13:37:48

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> > tcp corking kills iSCSI performance

> Here is the combined patch, could you test it?

the patch did not apply, so I edited by hand. Here is the resulting
patch:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 03d26b8..40d1958 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk)
if ((1 << sk->sk_state) &
(TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
- tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC);
+ tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+ 0, GFP_ATOMIC);
}
/*
* One tasklet per cpu tries to send more skbs.
@@ -1904,7 +1905,16 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,

if (atomic_read(&sk->sk_wmem_alloc) > limit) {
set_bit(TSQ_THROTTLED, &tp->tsq_flags);
- break;
+ /* It is possible TX completion already happened
+ * before we set TSQ_THROTTLED, so we must
+ * test again the condition.
+ * We abuse smp_mb__after_clear_bit() because
+ * there is no smp_mb__after_set_bit() yet
+ */
+ smp_mb__after_clear_bit();
+ if (atomic_read(&sk->sk_wmem_alloc) > limit)
+ break;
+
}

limit = mss_now;

-- cut here --

It fixes my case but if you look at the round trip time it is not even
close what it used to be. So while this fixes my problem I'm still for
disabling it by default.

https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2
https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png

Cheers,
Thomas

2014-02-08 13:38:52

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
> ("tcp: autocork should not hold first packet in write queue")
> in your tree.

confirmed:

(node-62) [~/work/linux-2.6] git show a181ceb501b31b4bf8812a5c84c716cc31d82c2d | head
commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
Author: Eric Dumazet <[email protected]>
Date: Tue Dec 17 09:58:30 2013 -0800

tcp: autocork should not hold first packet in write queue

Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.

Cheers,
Thomas

2014-02-08 13:50:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote:
> On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote:
> > Here is the combined patch, could you test it ?
>
> Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
> ("tcp: autocork should not hold first packet in write queue")
> in your tree.
>
>

BTW this problem demonstrates there is room for improvement in iCSCI,
using MSG_MORE to avoid sending two small segments in separate frames.

[1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48
[2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0
[3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512

[1] & [3] could be coalesced, and [2] would be avoided.

2014-02-08 13:53:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote:
> Hello Eric,

>
> It fixes my case but if you look at the round trip time it is not even
> close what it used to be. So while this fixes my problem I'm still for
> disabling it by default.
>
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png

Very nice.

Now we have to check your NIC and how TX completion is performed.

What is your NIC model and driver ?

2014-02-08 13:58:41

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> What is your NIC model and driver?

I have four Intel Corporation I350 Gigabit Network Connection (rev 01).

(node-62) [~/work/linux-2.6] lspci -v | pbot
http://pbot.rmdir.de/rgu6yHMBDVQpflMmbcJACg
(node-62) [~/work/linux-2.6] ip a s | pbot
http://pbot.rmdir.de/xJjRT8u-ekC6mrWgl09ZtQ
(node-62) [~/work/linux-2.6] dmesg | pbot
http://pbot.rmdir.de/MigrSPtxGmp0fI1CRgXsHw

I do 802.3ad link aggregation layer 2 hash with two network cards to one
switch.

I'm running:
Linux node-62 3.14.0-rc1+ #23 SMP Sat Feb 8 14:27:47 CET 2014 x86_64 GNU/Linux

Driver: igb

If you need remote access to the machine let me know.

Cheers,
Thomas

2014-02-08 14:09:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote:

>
> It fixes my case but if you look at the round trip time it is not even
> close what it used to be. So while this fixes my problem I'm still for
> disabling it by default.
>
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2

This pcap was taken on which host ?

10.101.99.5 or 10.101.0.13 ?



2014-02-08 14:12:53

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

[RESEND: dropped CC accidently]

> 10.101.99.5 or 10.101.0.13?

10.101.99.5 (iSCSI Target)

tcpdump -i bond0.101 -s 0 -w /tmp/tcp_auto_corking_on_patched.pcap host esx-03.v101.campusvl.de

Cheers,
Thomas

2014-02-08 14:13:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 05:50 -0800, Eric Dumazet wrote:
> On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote:
> > On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote:
> > > Here is the combined patch, could you test it ?
> >
> > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
> > ("tcp: autocork should not hold first packet in write queue")
> > in your tree.
> >
> >
>
> BTW this problem demonstrates there is room for improvement in iCSCI,
> using MSG_MORE to avoid sending two small segments in separate frames.
>
> [1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48
> [2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0
> [3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512
>
> [1] & [3] could be coalesced, and [2] would be avoided.
>

With the fix, new pcap is more explicit about this suboptimal behavior :

05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48

05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512

05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48

05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512

05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48

05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512


2014-02-08 14:19:10

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> > BTW this problem demonstrates there is room for improvement in iCSCI,
> > using MSG_MORE to avoid sending two small segments in separate frames.

> With the fix, new pcap is more explicit about this suboptimal behavior :

> 05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
> 05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48

> 05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
> 05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512

> 05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
> 05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48

> 05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
> 05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512

> 05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0
> 05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48

> 05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48
> 05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512

I get the idea. However I'm a little bit confused, when I do a 'git grep
MSG_MORE' I don't see much references in the Linux kernel who use it at
all. So do you have an example for me where this flags needs to be
applied?

Cheers,
Thomas

2014-02-08 14:30:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 15:19 +0100, Thomas Glanzmann wrote:
> Hello Eric,

> I get the idea. However I'm a little bit confused, when I do a 'git grep
> MSG_MORE' I don't see much references in the Linux kernel who use it at
> all. So do you have an example for me where this flags needs to be
> applied?

Idea would be to set this flag when calling sendmsg() of the 48 bytes of
the header, and not set it on the sendmsg() of the 512 bytes of the
payload.

iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but
it would be nice to add a new _initial_ flags parameter to
iscsi_sw_tcp_xmit_segment()


2014-02-08 15:00:07

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> Idea would be to set this flag when calling sendmsg() of the 48 bytes
> of the header, and not set it on the sendmsg() of the 512 bytes of the
> payload.

I see.

> iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but
> it would be nice to add a new _initial_ flags parameter to
> iscsi_sw_tcp_xmit_segment()

This is for the iscsi initiator implementation. I'm interested in iSCSI
target code, but I already found it and experiemented a little bit, but
I need to dig deeper if I want to prepare a patch.

Cheers,
Thomas

2014-02-08 15:06:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 16:00 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > Idea would be to set this flag when calling sendmsg() of the 48 bytes
> > of the header, and not set it on the sendmsg() of the 512 bytes of the
> > payload.
>
> I see.
>
> > iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but
> > it would be nice to add a new _initial_ flags parameter to
> > iscsi_sw_tcp_xmit_segment()
>
> This is for the iscsi initiator implementation. I'm interested in iSCSI
> target code, but I already found it and experiemented a little bit, but
> I need to dig deeper if I want to prepare a patch.

Fantastic !

Let me know if you want some help.

Note : We did some patches in the MSG_MORE logic for sendpage(), but in
your case I do not think its related

(git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious

2014-02-08 16:57:36

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> Note : We did some patches in the MSG_MORE logic for sendpage(), but
> in your case I do not think its related
> (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious

thank you for the pointer. The iSCSI target code actually uses sendpage
whenever it can.

Cheers,
Thomas

2014-02-08 17:08:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 17:57 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > Note : We did some patches in the MSG_MORE logic for sendpage(), but
> > in your case I do not think its related
> > (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious
>
> thank you for the pointer. The iSCSI target code actually uses sendpage
> whenever it can.

Yep, but the problem (at least on your pcap), is about sending the 48
bytes headers in TCP segment of its own, then the 512 byte payload in a
separate segment.

I suspect the sendpage() is only used for the payload. No need for
MSG_MORE here.

The MSG_MORE would need to be set on the first part (48 bytes header),
so that TCP stack will defer the push of the segment at the time the 512
bytes payload is added.



2014-02-08 17:15:35

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> Yep, but the problem (at least on your pcap), is about sending the 48
> bytes headers in TCP segment of its own, then the 512 byte payload in
> a separate segment.

I agree.

> I suspect the sendpage() is only used for the payload. No need for
> MSG_MORE here.

I see.

> The MSG_MORE would need to be set on the first part (48 bytes header),
> so that TCP stack will defer the push of the segment at the time the 512
> bytes payload is added.

The iSCSI target uses one function to send all outbound data. So in
order to do it right every function that is sending data in multiple
chunks need to mark it correctly. Of course someone could also do some
wild guessing and saying that everything that is below 512 Bytes gets
pushed out. I wonder what Nab has to say about this?

Cheers,
Thomas

2014-02-08 19:12:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 18:15 +0100, Thomas Glanzmann wrote:

> The iSCSI target uses one function to send all outbound data. So in
> order to do it right every function that is sending data in multiple
> chunks need to mark it correctly. Of course someone could also do some
> wild guessing and saying that everything that is below 512 Bytes gets
> pushed out. I wonder what Nab has to say about this?

I was simply thinking about something like :
(might need further changes, but I guess this should solve your case)

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 0819e688a398..44f0d62a88d6 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1165,7 +1165,7 @@ send_data:
iov_count = cmd->iov_misc_count;
}

- tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
+ tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
if (tx_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1196,7 +1196,8 @@ send_hdr:
iov.iov_base = cmd->pdu;
iov.iov_len = tx_hdr_size;

- tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
+ data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
+ tx_sent = tx_data(conn, &iov, 1, tx_hdr_size, data_len ? MSG_MORE : 0);
if (tx_hdr_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1205,7 +1206,6 @@ send_hdr:
return -1;
}

- data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
/*
* Set iov_off used by padding and data digest tx_data() calls below
* in order to determine proper offset into cmd->iov_data[]
@@ -1249,7 +1249,8 @@ send_padding:
if (cmd->padding) {
struct kvec *iov_p = &cmd->iov_data[iov_off++];

- tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
+ tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
+ conn->conn_ops->DataDigest ? MSG_MORE : 0);
if (cmd->padding != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1263,7 +1264,7 @@ send_datacrc:
if (conn->conn_ops->DataDigest) {
struct kvec *iov_d = &cmd->iov_data[iov_off];

- tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
+ tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
if (ISCSI_CRC_LEN != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1349,11 +1350,12 @@ static int iscsit_do_rx_data(

static int iscsit_do_tx_data(
struct iscsi_conn *conn,
- struct iscsi_data_count *count)
+ struct iscsi_data_count *count,
+ int flags)
{
int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
struct kvec *iov_p;
- struct msghdr msg;
+ struct msghdr msg = { .msg_flags = flags };

if (!conn || !conn->sock || !conn->conn_ops)
return -1;
@@ -1363,8 +1365,6 @@ static int iscsit_do_tx_data(
return -1;
}

- memset(&msg, 0, sizeof(struct msghdr));
-
iov_p = count->iov;
iov_len = count->iov_count;

@@ -1408,7 +1408,8 @@ int tx_data(
struct iscsi_conn *conn,
struct kvec *iov,
int iov_count,
- int data)
+ int data,
+ int flags)
{
struct iscsi_data_count c;

@@ -1421,7 +1422,7 @@ int tx_data(
c.data_length = data;
c.type = ISCSI_TX_DATA;

- return iscsit_do_tx_data(conn, &c);
+ return iscsit_do_tx_data(conn, &c, flags);
}

void iscsit_collect_login_stats(
diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index e4fc34a02f57..1b4f06801adc 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int);
extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int);
extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int);
extern int rx_data(struct iscsi_conn *, struct kvec *, int, int);
-extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
+extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int);
extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);


2014-02-08 21:36:14

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> I was simply thinking about something like :
> (might need further changes, but I guess this should solve your case)

thank you for your patch. It did not apply on top of Linux tip, so I put
in the changes manually and fixed up another call to tx_data that your
forgot in your initial patch to make it apply.

I gave it another run, can you confirm that it now behaves better?

https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more.pcap.bz2

And look at that roundtrip graph it is perfect. Also filesystem is now
created in 3 seconds instead of 4.

https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-22:34:57.png

Nab, do you consider this patch for upstream? Would you take if I clean
it up?

Cheers,
Thomas

PS: I'm asleep for the next 8 hours.

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index e655b04..0eb9681 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1168,7 +1168,7 @@ send_data:
iov_count = cmd->iov_misc_count;
}

- tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
+ tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
if (tx_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1199,7 +1199,8 @@ send_hdr:
iov.iov_base = cmd->pdu;
iov.iov_len = tx_hdr_size;

- tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
+ data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
+ tx_sent = tx_data(conn, &iov, 1, tx_hdr_size, data_len ? MSG_MORE : 0);
if (tx_hdr_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1208,7 +1209,6 @@ send_hdr:
return -1;
}

- data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
/*
* Set iov_off used by padding and data digest tx_data() calls below
* in order to determine proper offset into cmd->iov_data[]
@@ -1252,7 +1252,8 @@ send_padding:
if (cmd->padding) {
struct kvec *iov_p = &cmd->iov_data[iov_off++];

- tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
+ tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
+ conn->conn_ops->DataDigest ? MSG_MORE : 0);
if (cmd->padding != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1266,7 +1267,7 @@ send_datacrc:
if (conn->conn_ops->DataDigest) {
struct kvec *iov_d = &cmd->iov_data[iov_off];

- tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
+ tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
if (ISCSI_CRC_LEN != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1352,11 +1353,13 @@ static int iscsit_do_rx_data(

static int iscsit_do_tx_data(
struct iscsi_conn *conn,
- struct iscsi_data_count *count)
+ struct iscsi_data_count *count,
+ int flags)
{
int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
struct kvec *iov_p;
struct msghdr msg;
+ struct msghdr msg = { .msg_flags = flags };

if (!conn || !conn->sock || !conn->conn_ops)
return -1;
@@ -1366,8 +1369,6 @@ static int iscsit_do_tx_data(
return -1;
}

- memset(&msg, 0, sizeof(struct msghdr));
-
iov_p = count->iov;
iov_len = count->iov_count;

@@ -1411,7 +1412,8 @@ int tx_data(
struct iscsi_conn *conn,
struct kvec *iov,
int iov_count,
- int data)
+ int data,
+ int flags)
{
struct iscsi_data_count c;

@@ -1424,7 +1426,7 @@ int tx_data(
c.data_length = data;
c.type = ISCSI_TX_DATA;

- return iscsit_do_tx_data(conn, &c);
+ return iscsit_do_tx_data(conn, &c, flags);
}

2014-02-09 00:15:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Sat, 2014-02-08 at 22:36 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > I was simply thinking about something like :
> > (might need further changes, but I guess this should solve your case)
>
> thank you for your patch. It did not apply on top of Linux tip, so I put
> in the changes manually and fixed up another call to tx_data that your
> forgot in your initial patch to make it apply.
>
> I gave it another run, can you confirm that it now behaves better?
>
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more.pcap.bz2
>
> And look at that roundtrip graph it is perfect. Also filesystem is now
> created in 3 seconds instead of 4.

Yes, this is much better : 2 frames per request/response, instead of 4.

13:32:04.665367 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 384:432, ack 2529, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.665483 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 2529:3089, ack 432, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.665642 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 432:480, ack 3089, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.665756 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 3089:3649, ack 480, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.665933 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 480:528, ack 3649, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.666046 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 3649:4209, ack 528, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.666214 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 528:576, ack 4209, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.666333 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 4209:4769, ack 576, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.666678 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 576:624, ack 4769, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.666790 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 4769:5329, ack 624, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.666983 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 624:672, ack 5329, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.667097 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 5329:5889, ack 672, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.667280 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 672:720, ack 5889, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.667324 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 5889:6449, ack 720, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560
13:32:04.667500 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 720:768, ack 6449, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48
13:32:04.667540 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 6449:7009, ack 768, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560

2014-02-09 07:40:33

by Thomas Glanzmann

[permalink] [raw]
Subject: RFC: Set MSG_MORE in iscsit_fe_sendpage_sg to avoid sending multiple TCP packets instead of one

Hello Eric,
I took the liberty to test and make your patch compile by adding the
following changeset:

--- a/drivers/target/iscsi/iscsi_target_parameters.c
+++ b/drivers/target/iscsi/iscsi_target_parameters.c
@@ -79,7 +79,7 @@ int iscsi_login_tx_data(
*/
conn->if_marker += length;

- tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
+ tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
if (tx_sent != length) {
pr_err("tx_data returned %d, expecting %d.\n",
tx_sent, length);

Hello Nab,
please consider this for upstream. If I should clean it up in two patches:

- One patch to change the interface iscsit_do_tx_data and tx_data

- Another patch who uses the newly introduced interface

Let me know. This is how I would have normally done it. I set Eric as author
but singed it off and tested it. My testbed was: Using ESXi 5.5.0 GA to create
a 500 GB VMFS filesystem via iSCSI.

Cheers,
Thomas

2014-02-09 07:42:31

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

The new infrastructure is used in iscsit_fe_sendpage_sg to avoid sending three
TCP packets instead of one by settings the MSG_MORE when calling kernel_sendmsg
via the wrapper functions tx_data and iscsit_do_tx_data. This reduces the TCP
overhead by sending the same data in less TCP packets and minimized the TCP RTP
when TCP auto corking is enabled. When creating a 500 GB VMFS filesystem the
filesystem is created in 3 seconds instead of 4 seconds.

Signed-off-by: Thomas Glanzmann <[email protected]>
X-tested-by: Thomas Glanzmann <[email protected]>
---
drivers/target/iscsi/iscsi_target_parameters.c | 2 +-
drivers/target/iscsi/iscsi_target_util.c | 23 ++++++++++++-----------
drivers/target/iscsi/iscsi_target_util.h | 2 +-
3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
index 4d2e23f..b802392 100644
--- a/drivers/target/iscsi/iscsi_target_parameters.c
+++ b/drivers/target/iscsi/iscsi_target_parameters.c
@@ -79,7 +79,7 @@ int iscsi_login_tx_data(
*/
conn->if_marker += length;

- tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
+ tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
if (tx_sent != length) {
pr_err("tx_data returned %d, expecting %d.\n",
tx_sent, length);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index e655b04..887fabb 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1168,7 +1168,7 @@ send_data:
iov_count = cmd->iov_misc_count;
}

- tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
+ tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
if (tx_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1199,7 +1199,8 @@ send_hdr:
iov.iov_base = cmd->pdu;
iov.iov_len = tx_hdr_size;

- tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
+ data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
+ tx_sent = tx_data(conn, &iov, 1, tx_hdr_size, data_len ? MSG_MORE : 0);
if (tx_hdr_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1208,7 +1209,6 @@ send_hdr:
return -1;
}

- data_len = cmd->tx_size - tx_hdr_size - cmd->padding;
/*
* Set iov_off used by padding and data digest tx_data() calls below
* in order to determine proper offset into cmd->iov_data[]
@@ -1252,7 +1252,8 @@ send_padding:
if (cmd->padding) {
struct kvec *iov_p = &cmd->iov_data[iov_off++];

- tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
+ tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
+ conn->conn_ops->DataDigest ? MSG_MORE : 0);
if (cmd->padding != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1266,7 +1267,7 @@ send_datacrc:
if (conn->conn_ops->DataDigest) {
struct kvec *iov_d = &cmd->iov_data[iov_off];

- tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
+ tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
if (ISCSI_CRC_LEN != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1352,11 +1353,12 @@ static int iscsit_do_rx_data(

static int iscsit_do_tx_data(
struct iscsi_conn *conn,
- struct iscsi_data_count *count)
+ struct iscsi_data_count *count,
+ int flags)
{
int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
struct kvec *iov_p;
- struct msghdr msg;
+ struct msghdr msg = { .msg_flags = flags };

if (!conn || !conn->sock || !conn->conn_ops)
return -1;
@@ -1366,8 +1368,6 @@ static int iscsit_do_tx_data(
return -1;
}

- memset(&msg, 0, sizeof(struct msghdr));
-
iov_p = count->iov;
iov_len = count->iov_count;

@@ -1411,7 +1411,8 @@ int tx_data(
struct iscsi_conn *conn,
struct kvec *iov,
int iov_count,
- int data)
+ int data,
+ int flags)
{
struct iscsi_data_count c;

@@ -1424,7 +1425,7 @@ int tx_data(
c.data_length = data;
c.type = ISCSI_TX_DATA;

- return iscsit_do_tx_data(conn, &c);
+ return iscsit_do_tx_data(conn, &c, flags);
}

void iscsit_collect_login_stats(
diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index 561a424..1b6b336 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int);
extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int);
extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int);
extern int rx_data(struct iscsi_conn *, struct kvec *, int, int);
-extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
+extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int);
extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);

--
1.7.10.4

2014-02-09 07:45:10

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> Yes, this is much better : 2 frames per request/response, instead of 4.

perfect. I send out the page to the iscsi target list in your name since
you did the work and I added me as signed off I hope that is how it is
handled or should I have added my name to the from line and mentioned in
the description of the patch that you did the heavy lifting?

Cheers,
Thomas

2014-02-09 12:30:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

On Sun, 2014-02-09 at 08:42 +0100, Eric Dumazet wrote:
> The new infrastructure is used in iscsit_fe_sendpage_sg to avoid sending three
> TCP packets instead of one by settings the MSG_MORE when calling kernel_sendmsg
> via the wrapper functions tx_data and iscsit_do_tx_data. This reduces the TCP
> overhead by sending the same data in less TCP packets and minimized the TCP RTP
> when TCP auto corking is enabled. When creating a 500 GB VMFS filesystem the
> filesystem is created in 3 seconds instead of 4 seconds.
>
> Signed-off-by: Thomas Glanzmann <[email protected]>
> X-tested-by: Thomas Glanzmann <[email protected]>
> ---

Hmm, thanks but this is not how to do this.

When you submit a patch written by someone else, you should :

1) Use your own identity as the sender, not impersonate me.
( thats standard convention )

2) Put following line as first line of the mail
( Documentation/SubmittingPatches lines ~565)

From: Eric Dumazet <[email protected]>

Then I'll add my :
Signed-off-by: Eric Dumazet <[email protected]>

Anyway, patch is not yet complete : We also want to set
MSG_MORE/MSG_SENDPAGE_NOTLAST for all pages but last one in a sg list.


This will fix suboptimal traffic :

13:32:04.976923 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 289953:292849, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976936 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 292849:295745, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976944 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 295745:298193, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2448
13:32:04.976952 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 298193:301089, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976960 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 301089:303985, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
13:32:04.976998 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 303985:306385, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2400

Please try following updated patch, thanks !

Once tested, we'll submit it formally.

drivers/target/iscsi/iscsi_target_parameters.c | 2
drivers/target/iscsi/iscsi_target_util.c | 38 +++++++++------
drivers/target/iscsi/iscsi_target_util.h | 2
3 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
index 4d2e23fc76fd..b80239250a1c 100644
--- a/drivers/target/iscsi/iscsi_target_parameters.c
+++ b/drivers/target/iscsi/iscsi_target_parameters.c
@@ -79,7 +79,7 @@ int iscsi_login_tx_data(
*/
conn->if_marker += length;

- tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
+ tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
if (tx_sent != length) {
pr_err("tx_data returned %d, expecting %d.\n",
tx_sent, length);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 0819e688a398..3c529f7c61ce 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1165,7 +1165,7 @@ send_data:
iov_count = cmd->iov_misc_count;
}

- tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
+ tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
if (tx_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1196,7 +1196,8 @@ send_hdr:
iov.iov_base = cmd->pdu;
iov.iov_len = tx_hdr_size;

- tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
+ tx_sent = tx_data(conn, &iov, 1, tx_hdr_size,
+ cmd->tx_size != tx_hdr_size ? MSG_MORE : 0);
if (tx_hdr_size != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1225,18 +1226,24 @@ send_hdr:
while (data_len) {
u32 space = (sg->length - offset);
u32 sub_len = min_t(u32, data_len, space);
+ int flags = 0;
+
+ if ((data_len != sub_len) || cmd->padding ||
+ conn->conn_ops->DataDigest)
+ flags = MSG_SENDPAGE_NOTLAST | MSG_MORE;
+
send_pg:
tx_sent = conn->sock->ops->sendpage(conn->sock,
- sg_page(sg), sg->offset + offset, sub_len, 0);
+ sg_page(sg),
+ sg->offset + offset,
+ sub_len, flags);
if (tx_sent != sub_len) {
if (tx_sent == -EAGAIN) {
- pr_err("tcp_sendpage() returned"
- " -EAGAIN\n");
+ pr_err("tcp_sendpage() returned -EAGAIN\n");
goto send_pg;
}

- pr_err("tcp_sendpage() failure: %d\n",
- tx_sent);
+ pr_err("tcp_sendpage() failure: %d\n", tx_sent);
return -1;
}

@@ -1249,7 +1256,8 @@ send_padding:
if (cmd->padding) {
struct kvec *iov_p = &cmd->iov_data[iov_off++];

- tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
+ tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
+ conn->conn_ops->DataDigest ? MSG_MORE : 0);
if (cmd->padding != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1263,7 +1271,7 @@ send_datacrc:
if (conn->conn_ops->DataDigest) {
struct kvec *iov_d = &cmd->iov_data[iov_off];

- tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
+ tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
if (ISCSI_CRC_LEN != tx_sent) {
if (tx_sent == -EAGAIN) {
pr_err("tx_data() returned -EAGAIN\n");
@@ -1349,11 +1357,12 @@ static int iscsit_do_rx_data(

static int iscsit_do_tx_data(
struct iscsi_conn *conn,
- struct iscsi_data_count *count)
+ struct iscsi_data_count *count,
+ int flags)
{
int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
struct kvec *iov_p;
- struct msghdr msg;
+ struct msghdr msg = { .msg_flags = flags };

if (!conn || !conn->sock || !conn->conn_ops)
return -1;
@@ -1363,8 +1372,6 @@ static int iscsit_do_tx_data(
return -1;
}

- memset(&msg, 0, sizeof(struct msghdr));
-
iov_p = count->iov;
iov_len = count->iov_count;

@@ -1408,7 +1415,8 @@ int tx_data(
struct iscsi_conn *conn,
struct kvec *iov,
int iov_count,
- int data)
+ int data,
+ int flags)
{
struct iscsi_data_count c;

@@ -1421,7 +1429,7 @@ int tx_data(
c.data_length = data;
c.type = ISCSI_TX_DATA;

- return iscsit_do_tx_data(conn, &c);
+ return iscsit_do_tx_data(conn, &c, flags);
}

void iscsit_collect_login_stats(
diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
index e4fc34a02f57..1b4f06801adc 100644
--- a/drivers/target/iscsi/iscsi_target_util.h
+++ b/drivers/target/iscsi/iscsi_target_util.h
@@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int);
extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int);
extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int);
extern int rx_data(struct iscsi_conn *, struct kvec *, int, int);
-extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
+extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int);
extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);


2014-02-09 15:07:56

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

Hello Eric,

> 1) Use your own identity as the sender, not impersonate me.
> ( thats standard convention )

sorry about that, will not happen ever again.

> 2) Put following line as first line of the mail
> ( Documentation/SubmittingPatches lines ~565)

> From: Eric Dumazet <[email protected]>

> Then I'll add my :
> Signed-off-by: Eric Dumazet <[email protected]>

I see. Thank you for the awareness training. I read SubmittingPatches
completly.

> Anyway, patch is not yet complete : We also want to set
> MSG_MORE/MSG_SENDPAGE_NOTLAST for all pages but last one in a sg list.

I see.

> This will fix suboptimal traffic :

> 13:32:04.976923 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 289953:292849, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976936 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 292849:295745, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976944 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 295745:298193, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2448
> 13:32:04.976952 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 298193:301089, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976960 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 301089:303985, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976998 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 303985:306385, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2400

What is suboptimal about the traffic, could they all go in one packet?
Since my MTU is 1500 I assume that the network card will split this then
in MTU sized packets, is that correct? Should I repeat the test with MTU
9000 as well?

> Please try following updated patch, thanks!

This time it took 2 seconds instead of 4 seconds (3.12) to create the
filesystem. Find pcap here:

https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more_notlast.pcap.bz2

> Once tested, we'll submit it formally.

Let me know if you want to submit or I should. If I should do it I would
split it up in two patches, one for the interface change and one for the
packet submission logic. Btw. your last patches did not apply for me
because I cut & pasted them from e-mail instead of saving it in an
editor this one. So your patch was fine but they way I tried to apply it
was flawed.

Cheers,
Thomas

2014-02-10 18:55:54

by Nicholas A. Bellinger

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

Hi Eric & Thomas,

On Sun, 2014-02-09 at 04:30 -0800, Eric Dumazet wrote:
> On Sun, 2014-02-09 at 08:42 +0100, Eric Dumazet wrote:
> > The new infrastructure is used in iscsit_fe_sendpage_sg to avoid sending three
> > TCP packets instead of one by settings the MSG_MORE when calling kernel_sendmsg
> > via the wrapper functions tx_data and iscsit_do_tx_data. This reduces the TCP
> > overhead by sending the same data in less TCP packets and minimized the TCP RTP
> > when TCP auto corking is enabled. When creating a 500 GB VMFS filesystem the
> > filesystem is created in 3 seconds instead of 4 seconds.
> >
> > Signed-off-by: Thomas Glanzmann <[email protected]>
> > X-tested-by: Thomas Glanzmann <[email protected]>
> > ---
>
> Hmm, thanks but this is not how to do this.
>
> When you submit a patch written by someone else, you should :
>
> 1) Use your own identity as the sender, not impersonate me.
> ( thats standard convention )
>
> 2) Put following line as first line of the mail
> ( Documentation/SubmittingPatches lines ~565)
>
> From: Eric Dumazet <[email protected]>
>
> Then I'll add my :
> Signed-off-by: Eric Dumazet <[email protected]>
>
> Anyway, patch is not yet complete : We also want to set
> MSG_MORE/MSG_SENDPAGE_NOTLAST for all pages but last one in a sg list.
>
>
> This will fix suboptimal traffic :
>
> 13:32:04.976923 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 289953:292849, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976936 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 292849:295745, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976944 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 295745:298193, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2448
> 13:32:04.976952 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 298193:301089, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976960 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [.], seq 301089:303985, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2896
> 13:32:04.976998 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 303985:306385, ack 45792, win 795, options [nop,nop,TS val 4294914045 ecr 1577012], length 2400
>
> Please try following updated patch, thanks !
>
> Once tested, we'll submit it formally.
>
> drivers/target/iscsi/iscsi_target_parameters.c | 2
> drivers/target/iscsi/iscsi_target_util.c | 38 +++++++++------
> drivers/target/iscsi/iscsi_target_util.h | 2
> 3 files changed, 25 insertions(+), 17 deletions(-)
>

This looks correct to me..

Thomas, once your able to confirm please include your 'Tested-by' and
I'll include for the next -rc3 PULL request.

Thanks!

--nab

> diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
> index 4d2e23fc76fd..b80239250a1c 100644
> --- a/drivers/target/iscsi/iscsi_target_parameters.c
> +++ b/drivers/target/iscsi/iscsi_target_parameters.c
> @@ -79,7 +79,7 @@ int iscsi_login_tx_data(
> */
> conn->if_marker += length;
>
> - tx_sent = tx_data(conn, &iov[0], iov_cnt, length);
> + tx_sent = tx_data(conn, &iov[0], iov_cnt, length, 0);
> if (tx_sent != length) {
> pr_err("tx_data returned %d, expecting %d.\n",
> tx_sent, length);
> diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
> index 0819e688a398..3c529f7c61ce 100644
> --- a/drivers/target/iscsi/iscsi_target_util.c
> +++ b/drivers/target/iscsi/iscsi_target_util.c
> @@ -1165,7 +1165,7 @@ send_data:
> iov_count = cmd->iov_misc_count;
> }
>
> - tx_sent = tx_data(conn, &iov[0], iov_count, tx_size);
> + tx_sent = tx_data(conn, &iov[0], iov_count, tx_size, 0);
> if (tx_size != tx_sent) {
> if (tx_sent == -EAGAIN) {
> pr_err("tx_data() returned -EAGAIN\n");
> @@ -1196,7 +1196,8 @@ send_hdr:
> iov.iov_base = cmd->pdu;
> iov.iov_len = tx_hdr_size;
>
> - tx_sent = tx_data(conn, &iov, 1, tx_hdr_size);
> + tx_sent = tx_data(conn, &iov, 1, tx_hdr_size,
> + cmd->tx_size != tx_hdr_size ? MSG_MORE : 0);
> if (tx_hdr_size != tx_sent) {
> if (tx_sent == -EAGAIN) {
> pr_err("tx_data() returned -EAGAIN\n");
> @@ -1225,18 +1226,24 @@ send_hdr:
> while (data_len) {
> u32 space = (sg->length - offset);
> u32 sub_len = min_t(u32, data_len, space);
> + int flags = 0;
> +
> + if ((data_len != sub_len) || cmd->padding ||
> + conn->conn_ops->DataDigest)
> + flags = MSG_SENDPAGE_NOTLAST | MSG_MORE;
> +
> send_pg:
> tx_sent = conn->sock->ops->sendpage(conn->sock,
> - sg_page(sg), sg->offset + offset, sub_len, 0);
> + sg_page(sg),
> + sg->offset + offset,
> + sub_len, flags);
> if (tx_sent != sub_len) {
> if (tx_sent == -EAGAIN) {
> - pr_err("tcp_sendpage() returned"
> - " -EAGAIN\n");
> + pr_err("tcp_sendpage() returned -EAGAIN\n");
> goto send_pg;
> }
>
> - pr_err("tcp_sendpage() failure: %d\n",
> - tx_sent);
> + pr_err("tcp_sendpage() failure: %d\n", tx_sent);
> return -1;
> }
>
> @@ -1249,7 +1256,8 @@ send_padding:
> if (cmd->padding) {
> struct kvec *iov_p = &cmd->iov_data[iov_off++];
>
> - tx_sent = tx_data(conn, iov_p, 1, cmd->padding);
> + tx_sent = tx_data(conn, iov_p, 1, cmd->padding,
> + conn->conn_ops->DataDigest ? MSG_MORE : 0);
> if (cmd->padding != tx_sent) {
> if (tx_sent == -EAGAIN) {
> pr_err("tx_data() returned -EAGAIN\n");
> @@ -1263,7 +1271,7 @@ send_datacrc:
> if (conn->conn_ops->DataDigest) {
> struct kvec *iov_d = &cmd->iov_data[iov_off];
>
> - tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN);
> + tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0);
> if (ISCSI_CRC_LEN != tx_sent) {
> if (tx_sent == -EAGAIN) {
> pr_err("tx_data() returned -EAGAIN\n");
> @@ -1349,11 +1357,12 @@ static int iscsit_do_rx_data(
>
> static int iscsit_do_tx_data(
> struct iscsi_conn *conn,
> - struct iscsi_data_count *count)
> + struct iscsi_data_count *count,
> + int flags)
> {
> int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len;
> struct kvec *iov_p;
> - struct msghdr msg;
> + struct msghdr msg = { .msg_flags = flags };
>
> if (!conn || !conn->sock || !conn->conn_ops)
> return -1;
> @@ -1363,8 +1372,6 @@ static int iscsit_do_tx_data(
> return -1;
> }
>
> - memset(&msg, 0, sizeof(struct msghdr));
> -
> iov_p = count->iov;
> iov_len = count->iov_count;
>
> @@ -1408,7 +1415,8 @@ int tx_data(
> struct iscsi_conn *conn,
> struct kvec *iov,
> int iov_count,
> - int data)
> + int data,
> + int flags)
> {
> struct iscsi_data_count c;
>
> @@ -1421,7 +1429,7 @@ int tx_data(
> c.data_length = data;
> c.type = ISCSI_TX_DATA;
>
> - return iscsit_do_tx_data(conn, &c);
> + return iscsit_do_tx_data(conn, &c, flags);
> }
>
> void iscsit_collect_login_stats(
> diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h
> index e4fc34a02f57..1b4f06801adc 100644
> --- a/drivers/target/iscsi/iscsi_target_util.h
> +++ b/drivers/target/iscsi/iscsi_target_util.h
> @@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int);
> extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int);
> extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int);
> extern int rx_data(struct iscsi_conn *, struct kvec *, int, int);
> -extern int tx_data(struct iscsi_conn *, struct kvec *, int, int);
> +extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int);
> extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8);
> extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *);
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-02-10 20:56:39

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

Hello Nab,

> This looks correct to me. Thomas, once your able to confirm please
> include your 'Tested-by' and I'll include for the next -rc3 PULL
> request.

Eric is currently reviewing our latest iteration with MSG_MORE for
kernel_sendmsg and MSG_MORE | MSG_SENDPAGE_NOTLAST for sendpage. However
with the last iteration we had again a high RTT for some packets. But
than Eric let me tune net.ipv4.tcp_min_tso_segs to 8 and the RTT went
down to what it used before auto corking was enabled. At least almost.

I'm having a steep learning curve but Eric hopefully knows how to get
this back in check. Nevertheless the regression I saw are history
because I saw that Eric has submitted the patch to David S. Miller which
fixes the two bugs that killed the iSCSI performance when tcp auto
corking was on. So currently we're just optimizing to get the last 20%
or so out of it. Quite interesting. Especially how much bandwidth can be
saved by coalescing packets.

Cheers,
Thomas

2014-02-10 21:01:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

On Mon, 2014-02-10 at 21:56 +0100, Thomas Glanzmann wrote:
> Hello Nab,
>
> > This looks correct to me. Thomas, once your able to confirm please
> > include your 'Tested-by' and I'll include for the next -rc3 PULL
> > request.
>
> Eric is currently reviewing our latest iteration with MSG_MORE for
> kernel_sendmsg and MSG_MORE | MSG_SENDPAGE_NOTLAST for sendpage. However
> with the last iteration we had again a high RTT for some packets. But
> than Eric let me tune net.ipv4.tcp_min_tso_segs to 8 and the RTT went
> down to what it used before auto corking was enabled. At least almost.
>

Hmm.. I was not aware of high RTT for some packets.

Can you spot this on the pcap you provided ?


> I'm having a steep learning curve but Eric hopefully knows how to get
> this back in check. Nevertheless the regression I saw are history
> because I saw that Eric has submitted the patch to David S. Miller which
> fixes the two bugs that killed the iSCSI performance when tcp auto
> corking was on. So currently we're just optimizing to get the last 20%
> or so out of it. Quite interesting. Especially how much bandwidth can be
> saved by coalescing packets.

It depends on the ratio payload/headers.

The beginning of your pcap show a lot of 512 bytes requests, so for this
kind of requests, the gain is huge (maybe 50%), but for 32K or 64K
request, gain would be marginal.


2014-02-10 21:09:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

On Mon, 2014-02-10 at 13:01 -0800, Eric Dumazet wrote:
> On Mon, 2014-02-10 at 21:56 +0100, Thomas Glanzmann wrote:
> > Hello Nab,
> >
> > > This looks correct to me. Thomas, once your able to confirm please
> > > include your 'Tested-by' and I'll include for the next -rc3 PULL
> > > request.
> >
> > Eric is currently reviewing our latest iteration with MSG_MORE for
> > kernel_sendmsg and MSG_MORE | MSG_SENDPAGE_NOTLAST for sendpage. However
> > with the last iteration we had again a high RTT for some packets. But
> > than Eric let me tune net.ipv4.tcp_min_tso_segs to 8 and the RTT went
> > down to what it used before auto corking was enabled. At least almost.
> >
>
> Hmm.. I was not aware of high RTT for some packets.
>
> Can you spot this on the pcap you provided ?

offlist

Also please make sure you have this patch :

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=4a5ab4e224288403b0b4b6b8c4d339323150c312

2014-02-10 21:13:14

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

Hello Eric,

> Also please make sure you have this patch :
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=4a5ab4e224288403b0b4b6b8c4d339323150c312

I did not have this patch. I apply it, rerun and send you the pcap.

(node-62) [~/work/linux-2.6] wcat
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/patch/?id=4a5ab4e224288403b0b4b6b8c4d339323150c312 | patch -p1
patching file net/ipv4/tcp_input.c
patching file net/ipv4/tcp_output.c
Hunk #1 succeeded at 1987 (offset 10 lines).

Cheers,
Thomas

2014-02-10 21:14:49

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: [PATCH] This extends tx_data and and iscsit_do_tx_data with the additional parameter flags and avoids sending multiple TCP packets in iscsit_fe_sendpage_sg

Hello Eric,

> Hmm.. I was not aware of high RTT for some packets.
> Can you spot this on the pcap you provided ?

with the latest patch as in:

(node-62) [~/work/linux-2.6] git diff | pbot
http://pbot.rmdir.de/CQwqI6b7wJProw_xaukmEg

with net.ipv4.tcp_min_tso_segs=2 we had this pcap:
https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more_notlast.pcap.bz2
And here is the RTT TCP graph:
Wireshark > Statistics > TCP Stream Graph > Round Trip Time Graph
https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-10-22_04_30.png

with net.ipv4.tcp_min_tso_segs=8 we have this pcap:
https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more_notlast_min_tso_segs_8.pcap.bz2
And here is the RTT TCP graph:
Wireshark > Statistics > TCP Stream Graph > Round Trip Time Graph

This gives us 0.0015 seconds RTT (1.5 ms)

Without TCP autocorking we had 0.0005 (0.5 ms).

https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png

Cheers,
Thomas

2014-02-17 14:08:25

by Thomas Glanzmann

[permalink] [raw]

2014-02-17 15:26:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Mon, 2014-02-17 at 15:08 +0100, Thomas Glanzmann wrote:
> Hello Eric,
> may submit your latest patch for upstream? Or do you plan on doing that
> yourself?

Unfortunately you did not had good results with the MSG_MORE applied to
the page fragments.

I think I'll submit the part only dealing with the metadata.

Then later we might take care of the page themselves.

2014-02-17 15:32:20

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> Unfortunately you did not had good results with the MSG_MORE applied
> to the page fragments.

I agree. We should submit only the submit the patch from this message:

Message-ID: <1391886759.10160.114.camel@edumazet-glaptop2.roam.corp.google.com>
http://mid.gmane.org/1391886759.10160.114.camel@edumazet-glaptop2.roam.corp.google.com

> I think I'll submit the part only dealing with the metadata.

May I submit the patch or do you want to do it yourself?

Cheers,
Thomas

2014-02-17 15:46:17

by Eric Dumazet

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

On Mon, 2014-02-17 at 16:32 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> > Unfortunately you did not had good results with the MSG_MORE applied
> > to the page fragments.
>
> I agree. We should submit only the submit the patch from this message:
>
> Message-ID: <1391886759.10160.114.camel@edumazet-glaptop2.roam.corp.google.com>
> http://mid.gmane.org/1391886759.10160.114.camel@edumazet-glaptop2.roam.corp.google.com
>
> > I think I'll submit the part only dealing with the metadata.
>
> May I submit the patch or do you want to do it yourself?

I'll do it tomorrow : Today is President's Day in the US, and I am
spending the day with my family.

Thanks !

2014-02-17 15:47:03

by Thomas Glanzmann

[permalink] [raw]
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]

Hello Eric,

> I'll do it tomorrow : Today is President's Day in the US, and I am
> spending the day with my family.

thank you. Enjoy your day.

Cheers,
Thomas