Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <d7fbf3d3a2490d0a9e99945593ada243da58e0f8.1619000255.git.cdleonard@gmail.com>
 <CADVnQynLSDQHxgMN6=mU2m58t_JKUyugmw0j6g1UDG+jLxTfAw@mail.gmail.com> <CAH56bmDBGsHOSjJpo=TseUATOh0cZqTMFyFO1sqtQmMrTPHtrA@mail.gmail.com>
In-Reply-To: <CAH56bmDBGsHOSjJpo=TseUATOh0cZqTMFyFO1sqtQmMrTPHtrA@mail.gmail.com>
From:   Matt Mathis <mattmathis@google.com>
Date:   Wed, 21 Apr 2021 09:45:42 -0700
Message-ID: <CAH56bmCp8eRqsdoMTmAmCaEnubwEy317OJKQ9UjqMvDwrkcMdQ@mail.gmail.com>
Subject: Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing
To:     Leonard Crestez <cdleonard@gmail.com>
Cc:     "Cc: Willem de Bruijn" <willemb@google.com>,
        Neal Cardwell <ncardwell@google.com>,
        Ilya Lesokhin <ilyal@mellanox.com>,
        "David S. Miller" <davem@davemloft.net>,
        Eric Dumazet <edumazet@google.com>,
        Jakub Kicinski <kuba@kernel.org>,
        Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
        David Ahern <dsahern@kernel.org>, Wei Wang <weiwan@google.com>,
        Soheil Hassas Yeganeh <soheil@google.com>,
        Roopa Prabhu <roopa@cumulusnetworks.com>,
        netdev <netdev@vger.kernel.org>, linux-kernel@vger.kernel.org,
        Yuchung Cheng <ycheng@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

(Resending in plain text mode)

Surely there is a way to adapt tcp_tso_should_defer(), it is trying to
solve a similar problem.

If I were to implement PLPMTUD today, I would more deeply entwine it
into TCP's support for TSO.  e.g. successful deferring segments
sometimes enables TSO and sometimes enables PLPMTUD.

But there is a deeper question:  John Heffner and I invested a huge
amount of energy in trying to make PLPMTUD work for opportunistic
Jumbo discovery, only to discover that we had moved the problem down
to the device driver/nic, were it isn't so readily solvable.

The driver needs to carve nic buffer memory before it can communicate
with a switch (to either ask or measure the MTU), and once it has done
that it needs to either re-carve the memory or run with suboptimal
carving.  Both of these are problematic.

There is also a problem that many link technologies will
non-deterministically deliver jumbo frames at greatly increased error
rates.   This issue requires a long conversation on it's own.

Thanks,
--MM--
The best way to predict the future is to create it.  - Alan Kay

We must not tolerate intolerance;
       however our response must be carefully measured:
            too strong would be hypocritical and risks spiraling out of control;
            too weak risks being mistaken for tacit approval.


On Wed, Apr 21, 2021 at 5:48 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@gmail.com> wrote:
> >
> > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> > in order to accumulate enough data" but linux almost never does that.
> >
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
> > in the send buffer and if that condition is not met it will send anyway
> > using the current MSS. The feature can be made to work by sending very
> > large chunks of data from userspace (for example 128k) but for small writes
> > on fast links probes almost never happen.
> >
> > This patch tries to implement the "MAY" by adding an extra flag
> > "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> > insufficient data is available. Then data is held back in
> > tcp_write_xmit until a probe is sent, probing conditions are no longer
> > met, or 500ms pass.
> >
> > Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
> >
> > ---
> >  Documentation/networking/ip-sysctl.rst |  4 ++
> >  include/net/inet_connection_sock.h     |  7 +++-
> >  include/net/netns/ipv4.h               |  1 +
> >  include/net/tcp.h                      |  2 +
> >  net/ipv4/sysctl_net_ipv4.c             |  7 ++++
> >  net/ipv4/tcp_ipv4.c                    |  1 +
> >  net/ipv4/tcp_output.c                  | 54 ++++++++++++++++++++++++--
> >  7 files changed, 71 insertions(+), 5 deletions(-)
> >
> > My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >
> > This patch makes the test pass quite reliably with
> > ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> > before it only worked with much higher IPERF_LEN=256k
> >
> > In my loopback tests I also observed another issue when tcp_retries
> > increases because of SACKReorder. This makes the original problem worse
> > (since the retries amount factors in buffer requirement) and seems to be
> > unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> > sack logic is confused somehow?
> >
> > I know it's towards the end of the cycle but this is mostly just intended for
> > discussion.
>
> Thanks for raising the question of how to trigger PMTU probes more often!
>
> AFAICT this approach would cause unacceptable performance impacts by
> often injecting unnecessary 500ms delays when there is no need to do
> so.
>
> If the goal is to increase the frequency of PMTU probes, which seems
> like a valid goal, I would suggest that we rethink the Linux heuristic
> for triggering PMTU probes in the light of the fact that the loss
> detection mechanism is now RACK-TLP, which provides quick recovery in
> a much wider variety of scenarios.
>
> After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
>
>    In addition, the timely loss detection algorithms in most protocols
>    have pre-conditions that SHOULD be satisfied before sending a probe.
>
> And we know that the "timely loss detection algorithms" have advanced
> since this RFC was written in 2007.
>
> You mention:
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
>
> The code in question seems to be:
>
>   size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> How about just changing this to:
>
>   size_needed = probe_size + tp->mss_cache;
>
> The rationale would be that if that amount of data is available, then
> the sender can send one probe and one following current-mss-size
> packet. If the path MTU has not increased to allow the probe of size
> probe_size to pass through the network, then the following
> current-mss-size packet will likely pass through the network, generate
> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> RACK reorder timer fires.
>
> A secondary rationale for this heuristic would be: if the flow never
> accumulates roughly two packets worth of data, then does the flow
> really need a bigger packet size?
>
> IMHO, just reducing the size_needed seems far preferable to needlessly
> injecting 500ms delays.
>
> best,
> neal