2021-04-22 01:48:19

by Matt Mathis

[permalink] [raw]
Subject: Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing

(Resending in plain text mode)

Surely there is a way to adapt tcp_tso_should_defer(), it is trying to
solve a similar problem.

If I were to implement PLPMTUD today, I would more deeply entwine it
into TCP's support for TSO. e.g. successful deferring segments
sometimes enables TSO and sometimes enables PLPMTUD.

But there is a deeper question: John Heffner and I invested a huge
amount of energy in trying to make PLPMTUD work for opportunistic
Jumbo discovery, only to discover that we had moved the problem down
to the device driver/nic, were it isn't so readily solvable.

The driver needs to carve nic buffer memory before it can communicate
with a switch (to either ask or measure the MTU), and once it has done
that it needs to either re-carve the memory or run with suboptimal
carving. Both of these are problematic.

There is also a problem that many link technologies will
non-deterministically deliver jumbo frames at greatly increased error
rates. This issue requires a long conversation on it's own.

Thanks,
--MM--
The best way to predict the future is to create it. - Alan Kay

We must not tolerate intolerance;
however our response must be carefully measured:
too strong would be hypocritical and risks spiraling out of control;
too weak risks being mistaken for tacit approval.


On Wed, Apr 21, 2021 at 5:48 AM Neal Cardwell <[email protected]> wrote:
>
> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <[email protected]> wrote:
> >
> > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> > in order to accumulate enough data" but linux almost never does that.
> >
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
> > in the send buffer and if that condition is not met it will send anyway
> > using the current MSS. The feature can be made to work by sending very
> > large chunks of data from userspace (for example 128k) but for small writes
> > on fast links probes almost never happen.
> >
> > This patch tries to implement the "MAY" by adding an extra flag
> > "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> > insufficient data is available. Then data is held back in
> > tcp_write_xmit until a probe is sent, probing conditions are no longer
> > met, or 500ms pass.
> >
> > Signed-off-by: Leonard Crestez <[email protected]>
> >
> > ---
> > Documentation/networking/ip-sysctl.rst | 4 ++
> > include/net/inet_connection_sock.h | 7 +++-
> > include/net/netns/ipv4.h | 1 +
> > include/net/tcp.h | 2 +
> > net/ipv4/sysctl_net_ipv4.c | 7 ++++
> > net/ipv4/tcp_ipv4.c | 1 +
> > net/ipv4/tcp_output.c | 54 ++++++++++++++++++++++++--
> > 7 files changed, 71 insertions(+), 5 deletions(-)
> >
> > My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >
> > This patch makes the test pass quite reliably with
> > ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> > before it only worked with much higher IPERF_LEN=256k
> >
> > In my loopback tests I also observed another issue when tcp_retries
> > increases because of SACKReorder. This makes the original problem worse
> > (since the retries amount factors in buffer requirement) and seems to be
> > unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> > sack logic is confused somehow?
> >
> > I know it's towards the end of the cycle but this is mostly just intended for
> > discussion.
>
> Thanks for raising the question of how to trigger PMTU probes more often!
>
> AFAICT this approach would cause unacceptable performance impacts by
> often injecting unnecessary 500ms delays when there is no need to do
> so.
>
> If the goal is to increase the frequency of PMTU probes, which seems
> like a valid goal, I would suggest that we rethink the Linux heuristic
> for triggering PMTU probes in the light of the fact that the loss
> detection mechanism is now RACK-TLP, which provides quick recovery in
> a much wider variety of scenarios.
>
> After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
>
> In addition, the timely loss detection algorithms in most protocols
> have pre-conditions that SHOULD be satisfied before sending a probe.
>
> And we know that the "timely loss detection algorithms" have advanced
> since this RFC was written in 2007.
>
> You mention:
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
>
> The code in question seems to be:
>
> size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> How about just changing this to:
>
> size_needed = probe_size + tp->mss_cache;
>
> The rationale would be that if that amount of data is available, then
> the sender can send one probe and one following current-mss-size
> packet. If the path MTU has not increased to allow the probe of size
> probe_size to pass through the network, then the following
> current-mss-size packet will likely pass through the network, generate
> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> RACK reorder timer fires.
>
> A secondary rationale for this heuristic would be: if the flow never
> accumulates roughly two packets worth of data, then does the flow
> really need a bigger packet size?
>
> IMHO, just reducing the size_needed seems far preferable to needlessly
> injecting 500ms delays.
>
> best,
> neal


2021-04-26 02:36:30

by Leonard Crestez

[permalink] [raw]
Subject: Re: Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing

On 4/21/21 7:45 PM, Matt Mathis wrote:
> (Resending in plain text mode)
>
> Surely there is a way to adapt tcp_tso_should_defer(), it is trying to
> solve a similar problem.
>
> If I were to implement PLPMTUD today, I would more deeply entwine it
> into TCP's support for TSO. e.g. successful deferring segments
> sometimes enables TSO and sometimes enables PLPMTUD.

The mechanisms for delaying sending are difficult to understand, this
RFC just added a brand-new unrelated timer. Intertwining it with
existing mechanisms would indeed be better. On a closer look it seems
that they're not actually based on a timer but other heuristics.

It seems that tcp_sendmsg will "tcp_push_one" once the skb at the head
of the queue reaches tcp_xmit_size_goal and tcp_xmit_size_goal does not
take mtu probing into account. In practice this would mean that
application-limited streams won't perform mtu probing unless a single
write is 5*mss + probe_size (1*mss over size_needed)

I sent a different RFC which tries to modify tcp_xmit_size_goal.

> But there is a deeper question: John Heffner and I invested a huge
> amount of energy in trying to make PLPMTUD work for opportunistic
> Jumbo discovery, only to discover that we had moved the problem down
> to the device driver/nic, were it isn't so readily solvable.
>
> The driver needs to carve nic buffer memory before it can communicate
> with a switch (to either ask or measure the MTU), and once it has done
> that it needs to either re-carve the memory or run with suboptimal
> carving. Both of these are problematic.
>
> There is also a problem that many link technologies will
> non-deterministically deliver jumbo frames at greatly increased error
> rates. This issue requires a long conversation on it's own.

I'm looking to improve this for tunnels that don't correctly send ICMP
packet-too-big messages, the hardware is assumed to be fine.

--
Regards,
Leonard