2019-06-21 21:38:02

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Steam is broken on new kernels

This seems to have broken us:

https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.1.11

Here's some affected users:

https://github.com/ValveSoftware/steam-for-linux/issues/6326

https://www.reddit.com/r/linux_gaming/comments/c37lmh/psa_steam_does_not_connect_on_kernels_newer_than/

https://www.phoronix.com/scan.php?page=news_item&px=Steam-Networking-Kernel-Woes

I don't really understand that distributions that claim to be desktop
products would have fast-tracked a server-oriented fix to all their
users without testing one of the primary desktop usecases, but that's
another thing to figure out later.


2019-06-21 21:42:10

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 02:27:41PM -0700, Pierre-Loup A. Griffais wrote:
> This seems to have broken us:
>
> https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.1.11
>
> Here's some affected users:
>
> https://github.com/ValveSoftware/steam-for-linux/issues/6326
>
> https://www.reddit.com/r/linux_gaming/comments/c37lmh/psa_steam_does_not_connect_on_kernels_newer_than/
>
> https://www.phoronix.com/scan.php?page=news_item&px=Steam-Networking-Kernel-Woes
>
> I don't really understand that distributions that claim to be desktop
> products would have fast-tracked a server-oriented fix to all their users
> without testing one of the primary desktop usecases, but that's another
> thing to figure out later.
>

What specific commit caused the breakage?

2019-06-21 21:46:02

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: Steam is broken on new kernels



On 6/21/19 2:41 PM, Greg Kroah-Hartman wrote:
> On Fri, Jun 21, 2019 at 02:27:41PM -0700, Pierre-Loup A. Griffais wrote:
>> This seems to have broken us:
>>
>> https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.1.11
>>
>> Here's some affected users:
>>
>> https://github.com/ValveSoftware/steam-for-linux/issues/6326
>>
>> https://www.reddit.com/r/linux_gaming/comments/c37lmh/psa_steam_does_not_connect_on_kernels_newer_than/
>>
>> https://www.phoronix.com/scan.php?page=news_item&px=Steam-Networking-Kernel-Woes
>>
>> I don't really understand that distributions that claim to be desktop
>> products would have fast-tracked a server-oriented fix to all their users
>> without testing one of the primary desktop usecases, but that's another
>> thing to figure out later.
>>
>
> What specific commit caused the breakage?

Unclear as of yet, we're starting that process on our end just now. Not
sure if a user has already performed a bisect and shared his findings in
one of the links below, but we'll be scouring these and doing our own
testing as next steps.

>

2019-06-21 22:20:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
<[email protected]> wrote:
>
> What specific commit caused the breakage?

Both on reddit and on github there seems to be confusion about whether
it's a problem or not. Some people have it working with the exact same
kernel that breaks for others.

And then some people seem to say it works intermittently for them,
which seems to indicate a timing issue.

Looking at the SACK patches (assuming it's one of them), I'd suspect
the "tcp: tcp_fragment() should apply sane memory limits".

Eric, that one does

if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}

but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
same size as sk_sndbuf. So if there is some fragmentation, and we add
more skb's to it, that would seem to trigger fairly easily.
Particularly since this is all in "truesize" units, which can be a lot
bigger than the packets themselves.

I don't know the code, so I may be out to lunch and barking up
completely the wrong tree, but that particular check does seem like it
might trigger much more easily than I think the code _intended_ it to
trigger?

Pierre-Loup - do you guys have a test-case inside of valve? Or is this
purely "we see some people with problems"?

Linus

2019-06-21 23:07:00

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: Steam is broken on new kernels



On 6/21/19 3:38 PM, Eric Dumazet wrote:
> Please look at my recent patch.
>  Sorry I am travelling....
>
> On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds
> <[email protected] <mailto:[email protected]>>
> wrote:
>
> On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
> <[email protected] <mailto:[email protected]>> wrote:
> >
> > What specific commit caused the breakage?
>
> Both on reddit and on github there seems to be confusion about whether
> it's a problem or not. Some people have it working with the exact same
> kernel that breaks for others.
>
> And then some people seem to say it works intermittently for them,
> which seems to indicate a timing issue.
>
> Looking at the SACK patches (assuming it's one of them), I'd suspect
> the "tcp: tcp_fragment() should apply sane memory limits".
>
> Eric, that one does
>
>        if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
>                NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>                return -ENOMEM;
>        }
>
> but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
> same size as sk_sndbuf. So if there is some fragmentation, and we add
> more skb's to it, that would seem to trigger fairly easily.
> Particularly since this is all in "truesize" units, which can be a lot
> bigger than the packets themselves.
>
> I don't know the code, so I may be out to lunch and barking up
> completely the wrong tree, but that particular check does seem like it
> might trigger much more easily than I think the code _intended_ it to
> trigger?
>
> Pierre-Loup - do you guys have a test-case inside of valve? Or is this
> purely "we see some people with problems"?

Definitely the latter, although the volume of complaints clearly points
to a real problem from our experience. Reproducing locally, bisecting
and testing possible fixes is just now starting on our end.

I agree not all users seem affected; most affected people report success
by using -tcp to launch Steam, which makes it use direct TCP instead of
WebSockets, our current default connection method for Linux.

Thanks,
- Pierre-Loup

>
>                Linus
>

2019-06-21 23:44:00

by Josh Hunt

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 4:07 PM Pierre-Loup A. Griffais
<[email protected]> wrote:
>
>
>
> On 6/21/19 3:38 PM, Eric Dumazet wrote:
> > Please look at my recent patch.
> > Sorry I am travelling....
> >
> > On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds
> > <[email protected] <mailto:[email protected]>>
> > wrote:
> >
> > On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
> > <[email protected] <mailto:[email protected]>> wrote:
> > >
> > > What specific commit caused the breakage?
> >
> > Both on reddit and on github there seems to be confusion about whether
> > it's a problem or not. Some people have it working with the exact same
> > kernel that breaks for others.
> >
> > And then some people seem to say it works intermittently for them,
> > which seems to indicate a timing issue.
> >
> > Looking at the SACK patches (assuming it's one of them), I'd suspect
> > the "tcp: tcp_fragment() should apply sane memory limits".
> >
> > Eric, that one does
> >
> > if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> > return -ENOMEM;
> > }
> >
> > but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
> > same size as sk_sndbuf. So if there is some fragmentation, and we add
> > more skb's to it, that would seem to trigger fairly easily.
> > Particularly since this is all in "truesize" units, which can be a lot
> > bigger than the packets themselves.
> >
> > I don't know the code, so I may be out to lunch and barking up
> > completely the wrong tree, but that particular check does seem like it
> > might trigger much more easily than I think the code _intended_ it to
> > trigger?
> >
> > Pierre-Loup - do you guys have a test-case inside of valve? Or is this
> > purely "we see some people with problems"?
>
> Definitely the latter, although the volume of complaints clearly points
> to a real problem from our experience. Reproducing locally, bisecting
> and testing possible fixes is just now starting on our end.
>
> I agree not all users seem affected; most affected people report success
> by using -tcp to launch Steam, which makes it use direct TCP instead of
> WebSockets, our current default connection method for Linux.
>
> Thanks,
> - Pierre-Loup
>
> >
> > Linus
> >
>
I asked on the github thread if users seeing the problem could check
the new wqueue too big counter:
https://github.com/ValveSoftware/steam-for-linux/issues/6326

So far one person is seeing the counter increase when they see the
problem, and another that doesn't see the problem has the counter at
0. Obviously not a great sample size, but hopefully more will report.
If nothing else, someone is seeing the counter increase while trying
to connect to steam.
--
Josh

2019-06-21 23:56:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

Eric is talking about this patch, I think:

https://patchwork.ozlabs.org/patch/1120222/

I guess I'll ask people on the github thread to test that too.

Linus

On Fri, Jun 21, 2019 at 3:38 PM Eric Dumazet <[email protected]> wrote:
>
> Please look at my recent patch.
> Sorry I am travelling....
>
> On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds <[email protected]> wrote:
>>
>> On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
>> <[email protected]> wrote:
>> >
>> > What specific commit caused the breakage?
>>
>> Both on reddit and on github there seems to be confusion about whether
>> it's a problem or not. Some people have it working with the exact same
>> kernel that breaks for others.
>>
>> And then some people seem to say it works intermittently for them,
>> which seems to indicate a timing issue.
>>
>> Looking at the SACK patches (assuming it's one of them), I'd suspect
>> the "tcp: tcp_fragment() should apply sane memory limits".
>>
>> Eric, that one does
>>
>> if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
>> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>> return -ENOMEM;
>> }
>>
>> but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
>> same size as sk_sndbuf. So if there is some fragmentation, and we add
>> more skb's to it, that would seem to trigger fairly easily.
>> Particularly since this is all in "truesize" units, which can be a lot
>> bigger than the packets themselves.
>>
>> I don't know the code, so I may be out to lunch and barking up
>> completely the wrong tree, but that particular check does seem like it
>> might trigger much more easily than I think the code _intended_ it to
>> trigger?
>>
>> Pierre-Loup - do you guys have a test-case inside of valve? Or is this
>> purely "we see some people with problems"?
>>
>> Linus

2019-06-22 00:19:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 7:54 PM Linus Torvalds
<[email protected]> wrote:
>
> Eric is talking about this patch, I think:
>
> https://patchwork.ozlabs.org/patch/1120222/
>

That is correct.

I am about to take a flight from Boston to Paris, so I can not really
follow discussions/tests for the following hours.

Thanks.

> I guess I'll ask people on the github thread to test that too.
>
> Linus
>
> On Fri, Jun 21, 2019 at 3:38 PM Eric Dumazet <[email protected]> wrote:
> >
> > Please look at my recent patch.
> > Sorry I am travelling....
> >
> > On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds <[email protected]> wrote:
> >>
> >> On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
> >> <[email protected]> wrote:
> >> >
> >> > What specific commit caused the breakage?
> >>
> >> Both on reddit and on github there seems to be confusion about whether
> >> it's a problem or not. Some people have it working with the exact same
> >> kernel that breaks for others.
> >>
> >> And then some people seem to say it works intermittently for them,
> >> which seems to indicate a timing issue.
> >>
> >> Looking at the SACK patches (assuming it's one of them), I'd suspect
> >> the "tcp: tcp_fragment() should apply sane memory limits".
> >>
> >> Eric, that one does
> >>
> >> if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> >> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> >> return -ENOMEM;
> >> }
> >>
> >> but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
> >> same size as sk_sndbuf. So if there is some fragmentation, and we add
> >> more skb's to it, that would seem to trigger fairly easily.
> >> Particularly since this is all in "truesize" units, which can be a lot
> >> bigger than the packets themselves.
> >>
> >> I don't know the code, so I may be out to lunch and barking up
> >> completely the wrong tree, but that particular check does seem like it
> >> might trigger much more easily than I think the code _intended_ it to
> >> trigger?
> >>
> >> Pierre-Loup - do you guys have a test-case inside of valve? Or is this
> >> purely "we see some people with problems"?
> >>
> >> Linus

2019-06-22 01:05:27

by Pierre-Loup A. Griffais

[permalink] [raw]
Subject: Re: Steam is broken on new kernels



On 6/21/19 5:19 PM, Eric Dumazet wrote:
> On Fri, Jun 21, 2019 at 7:54 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> Eric is talking about this patch, I think:
>>
>> https://patchwork.ozlabs.org/patch/1120222/
>>
>
> That is correct.
>
> I am about to take a flight from Boston to Paris, so I can not really
> follow discussions/tests for the following hours.

I built the tip of linux-5.1.y and reproduced the issue while trying to
log out and back into Steam; it exhibited this symptom as well:

pgriffais@pgriffais:~$ nstat -az | grep -i wqueue
TcpExtTCPWqueueTooBig 31 0.0

I applied Eric's path to the tip of the branch and ran that kernel and
the bug didn't occur through several logout / login cycles, so things
look good at first glance. I'll keep running that kernel and report back
if anything crops up in the future, but I believe we're good, beyond
getting distros to ship this additional fix.

Thanks,
- Pierre-Loup

>
> Thanks.
>
>> I guess I'll ask people on the github thread to test that too.
>>
>> Linus
>>
>> On Fri, Jun 21, 2019 at 3:38 PM Eric Dumazet <[email protected]> wrote:
>>>
>>> Please look at my recent patch.
>>> Sorry I am travelling....
>>>
>>> On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds <[email protected]> wrote:
>>>>
>>>> On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
>>>> <[email protected]> wrote:
>>>>>
>>>>> What specific commit caused the breakage?
>>>>
>>>> Both on reddit and on github there seems to be confusion about whether
>>>> it's a problem or not. Some people have it working with the exact same
>>>> kernel that breaks for others.
>>>>
>>>> And then some people seem to say it works intermittently for them,
>>>> which seems to indicate a timing issue.
>>>>
>>>> Looking at the SACK patches (assuming it's one of them), I'd suspect
>>>> the "tcp: tcp_fragment() should apply sane memory limits".
>>>>
>>>> Eric, that one does
>>>>
>>>> if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
>>>> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>>>> return -ENOMEM;
>>>> }
>>>>
>>>> but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
>>>> same size as sk_sndbuf. So if there is some fragmentation, and we add
>>>> more skb's to it, that would seem to trigger fairly easily.
>>>> Particularly since this is all in "truesize" units, which can be a lot
>>>> bigger than the packets themselves.
>>>>
>>>> I don't know the code, so I may be out to lunch and barking up
>>>> completely the wrong tree, but that particular check does seem like it
>>>> might trigger much more easily than I think the code _intended_ it to
>>>> trigger?
>>>>
>>>> Pierre-Loup - do you guys have a test-case inside of valve? Or is this
>>>> purely "we see some people with problems"?
>>>>
>>>> Linus
>

2019-06-22 05:30:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
<[email protected]> wrote:
>
> I applied Eric's path to the tip of the branch and ran that kernel and
> the bug didn't occur through several logout / login cycles, so things
> look good at first glance. I'll keep running that kernel and report back
> if anything crops up in the future, but I believe we're good, beyond
> getting distros to ship this additional fix.

Good. It's now in my tree, so we can get it quickly into stable and
then quickly to distributions.

Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
tcp_fragment()"), and I'm building it right now and I'll push it out
in a couple of minutes assuming nothing odd is going on.

Linus

2019-06-22 06:36:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> <[email protected]> wrote:
> >
> > I applied Eric's path to the tip of the branch and ran that kernel and
> > the bug didn't occur through several logout / login cycles, so things
> > look good at first glance. I'll keep running that kernel and report back
> > if anything crops up in the future, but I believe we're good, beyond
> > getting distros to ship this additional fix.
>
> Good. It's now in my tree, so we can get it quickly into stable and
> then quickly to distributions.
>
> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> tcp_fragment()"), and I'm building it right now and I'll push it out
> in a couple of minutes assuming nothing odd is going on.

Thanks, I will pick it up now.

greg k-h

2019-06-22 07:38:19

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> <[email protected]> wrote:
> >
> > I applied Eric's path to the tip of the branch and ran that kernel and
> > the bug didn't occur through several logout / login cycles, so things
> > look good at first glance. I'll keep running that kernel and report back
> > if anything crops up in the future, but I believe we're good, beyond
> > getting distros to ship this additional fix.
>
> Good. It's now in my tree, so we can get it quickly into stable and
> then quickly to distributions.
>
> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> tcp_fragment()"), and I'm building it right now and I'll push it out
> in a couple of minutes assuming nothing odd is going on.

This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
a bit for them.

But for 4.14 and older, we don't have the "hint" to know this is an
outbound going packet and not to apply these checks at that point in
time, so this patch doesn't work.

I'll see if I can figure anything else later this afternoon for those
kernels...

thanks,

greg k-h

2019-06-26 02:04:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

Hi Greg,

On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> > On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> > <[email protected]> wrote:
> > >
> > > I applied Eric's path to the tip of the branch and ran that kernel and
> > > the bug didn't occur through several logout / login cycles, so things
> > > look good at first glance. I'll keep running that kernel and report back
> > > if anything crops up in the future, but I believe we're good, beyond
> > > getting distros to ship this additional fix.
> >
> > Good. It's now in my tree, so we can get it quickly into stable and
> > then quickly to distributions.
> >
> > Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> > tcp_fragment()"), and I'm building it right now and I'll push it out
> > in a couple of minutes assuming nothing odd is going on.
>
> This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
> a bit for them.
>
> But for 4.14 and older, we don't have the "hint" to know this is an
> outbound going packet and not to apply these checks at that point in
> time, so this patch doesn't work.
>
> I'll see if I can figure anything else later this afternoon for those
> kernels...
>

I may have missed it, but I don't see a fix for the problem in
older stable branches. Any news ?

One possibility might be be to apply the part of 75c119afe14f7 which
introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
is acceptable.

Thanks,
Guenter

2019-06-26 02:32:02

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Tue, Jun 25, 2019 at 07:02:20PM -0700, Guenter Roeck wrote:
> Hi Greg,
>
> On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> > > On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> > > <[email protected]> wrote:
> > > >
> > > > I applied Eric's path to the tip of the branch and ran that kernel and
> > > > the bug didn't occur through several logout / login cycles, so things
> > > > look good at first glance. I'll keep running that kernel and report back
> > > > if anything crops up in the future, but I believe we're good, beyond
> > > > getting distros to ship this additional fix.
> > >
> > > Good. It's now in my tree, so we can get it quickly into stable and
> > > then quickly to distributions.
> > >
> > > Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> > > tcp_fragment()"), and I'm building it right now and I'll push it out
> > > in a couple of minutes assuming nothing odd is going on.
> >
> > This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
> > a bit for them.
> >
> > But for 4.14 and older, we don't have the "hint" to know this is an
> > outbound going packet and not to apply these checks at that point in
> > time, so this patch doesn't work.
> >
> > I'll see if I can figure anything else later this afternoon for those
> > kernels...
> >
>
> I may have missed it, but I don't see a fix for the problem in
> older stable branches. Any news ?
>
> One possibility might be be to apply the part of 75c119afe14f7 which
> introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
> is acceptable.

That's what people have already discussed on the stable mailing list a
few hours ago, hopefully a patch shows up soon as I'm traveling at the
moment and can't do it myself...

thanks,

greg k-h

2019-06-26 03:44:53

by Guenter Roeck

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On 6/25/19 7:29 PM, Greg Kroah-Hartman wrote:
> On Tue, Jun 25, 2019 at 07:02:20PM -0700, Guenter Roeck wrote:
>> Hi Greg,
>>
>> On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
>>> On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
>>>> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
>>>> <[email protected]> wrote:
>>>>>
>>>>> I applied Eric's path to the tip of the branch and ran that kernel and
>>>>> the bug didn't occur through several logout / login cycles, so things
>>>>> look good at first glance. I'll keep running that kernel and report back
>>>>> if anything crops up in the future, but I believe we're good, beyond
>>>>> getting distros to ship this additional fix.
>>>>
>>>> Good. It's now in my tree, so we can get it quickly into stable and
>>>> then quickly to distributions.
>>>>
>>>> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
>>>> tcp_fragment()"), and I'm building it right now and I'll push it out
>>>> in a couple of minutes assuming nothing odd is going on.
>>>
>>> This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
>>> a bit for them.
>>>
>>> But for 4.14 and older, we don't have the "hint" to know this is an
>>> outbound going packet and not to apply these checks at that point in
>>> time, so this patch doesn't work.
>>>
>>> I'll see if I can figure anything else later this afternoon for those
>>> kernels...
>>>
>>
>> I may have missed it, but I don't see a fix for the problem in
>> older stable branches. Any news ?
>>
>> One possibility might be be to apply the part of 75c119afe14f7 which
>> introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
>> is acceptable.
>
> That's what people have already discussed on the stable mailing list a
> few hours ago, hopefully a patch shows up soon as I'm traveling at the
> moment and can't do it myself...
>

Sounds good. Let me know if nothing shows up; I'll be happy to do it
if needed.

Guenter

2019-06-26 04:20:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Wed, Jun 26, 2019 at 5:43 AM Guenter Roeck <[email protected]> wrote:
>
> On 6/25/19 7:29 PM, Greg Kroah-Hartman wrote:
> > On Tue, Jun 25, 2019 at 07:02:20PM -0700, Guenter Roeck wrote:
> >> Hi Greg,
> >>
> >> On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
> >>> On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> >>>> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> I applied Eric's path to the tip of the branch and ran that kernel and
> >>>>> the bug didn't occur through several logout / login cycles, so things
> >>>>> look good at first glance. I'll keep running that kernel and report back
> >>>>> if anything crops up in the future, but I believe we're good, beyond
> >>>>> getting distros to ship this additional fix.
> >>>>
> >>>> Good. It's now in my tree, so we can get it quickly into stable and
> >>>> then quickly to distributions.
> >>>>
> >>>> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> >>>> tcp_fragment()"), and I'm building it right now and I'll push it out
> >>>> in a couple of minutes assuming nothing odd is going on.
> >>>
> >>> This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
> >>> a bit for them.
> >>>
> >>> But for 4.14 and older, we don't have the "hint" to know this is an
> >>> outbound going packet and not to apply these checks at that point in
> >>> time, so this patch doesn't work.
> >>>
> >>> I'll see if I can figure anything else later this afternoon for those
> >>> kernels...
> >>>
> >>
> >> I may have missed it, but I don't see a fix for the problem in
> >> older stable branches. Any news ?
> >>
> >> One possibility might be be to apply the part of 75c119afe14f7 which
> >> introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
> >> is acceptable.
> >
> > That's what people have already discussed on the stable mailing list a
> > few hours ago, hopefully a patch shows up soon as I'm traveling at the
> > moment and can't do it myself...
> >
>
> Sounds good. Let me know if nothing shows up; I'll be happy to do it
> if needed.


Without the rb-tree for rtx queues, old kernels are vulnerable to SACK
attacks if sk_sndbuf is too big,
so I would simply add a cushion in the test, instead of trying to
backport an illusion of the rb-tree fixes.



diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a8772e11dc1cb42d4319b6fc072c625d284c7ad5..a554213afa4ac41120d781fe64b7cd18ff9b56e8
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1274,7 +1274,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff
*skb, u32 len,
if (nsize < 0)
nsize = 0;

- if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
+ if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 131072)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}

2019-06-26 06:22:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Wed, Jun 26, 2019 at 06:20:17AM +0200, Eric Dumazet wrote:
> On Wed, Jun 26, 2019 at 5:43 AM Guenter Roeck <[email protected]> wrote:
> >
> > On 6/25/19 7:29 PM, Greg Kroah-Hartman wrote:
> > > On Tue, Jun 25, 2019 at 07:02:20PM -0700, Guenter Roeck wrote:
> > >> Hi Greg,
> > >>
> > >> On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
> > >>> On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> > >>>> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> > >>>> <[email protected]> wrote:
> > >>>>>
> > >>>>> I applied Eric's path to the tip of the branch and ran that kernel and
> > >>>>> the bug didn't occur through several logout / login cycles, so things
> > >>>>> look good at first glance. I'll keep running that kernel and report back
> > >>>>> if anything crops up in the future, but I believe we're good, beyond
> > >>>>> getting distros to ship this additional fix.
> > >>>>
> > >>>> Good. It's now in my tree, so we can get it quickly into stable and
> > >>>> then quickly to distributions.
> > >>>>
> > >>>> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> > >>>> tcp_fragment()"), and I'm building it right now and I'll push it out
> > >>>> in a couple of minutes assuming nothing odd is going on.
> > >>>
> > >>> This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
> > >>> a bit for them.
> > >>>
> > >>> But for 4.14 and older, we don't have the "hint" to know this is an
> > >>> outbound going packet and not to apply these checks at that point in
> > >>> time, so this patch doesn't work.
> > >>>
> > >>> I'll see if I can figure anything else later this afternoon for those
> > >>> kernels...
> > >>>
> > >>
> > >> I may have missed it, but I don't see a fix for the problem in
> > >> older stable branches. Any news ?
> > >>
> > >> One possibility might be be to apply the part of 75c119afe14f7 which
> > >> introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
> > >> is acceptable.
> > >
> > > That's what people have already discussed on the stable mailing list a
> > > few hours ago, hopefully a patch shows up soon as I'm traveling at the
> > > moment and can't do it myself...
> > >
> >
> > Sounds good. Let me know if nothing shows up; I'll be happy to do it
> > if needed.
>
>
> Without the rb-tree for rtx queues, old kernels are vulnerable to SACK
> attacks if sk_sndbuf is too big,
> so I would simply add a cushion in the test, instead of trying to
> backport an illusion of the rb-tree fixes.
>
>
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a8772e11dc1cb42d4319b6fc072c625d284c7ad5..a554213afa4ac41120d781fe64b7cd18ff9b56e8
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1274,7 +1274,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff
> *skb, u32 len,
> if (nsize < 0)
> nsize = 0;
>
> - if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> + if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 131072)) {
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> return -ENOMEM;
> }

That's a funny magic number, can we document what it means?

And yes, it's a much simpler patch, I'd rather take this than the fake
backport.

thanks,

greg k-h

2019-06-26 06:40:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Wed, Jun 26, 2019 at 8:22 AM Greg Kroah-Hartman
<[email protected]> wrote:
>
> On Wed, Jun 26, 2019 at 06:20:17AM +0200, Eric Dumazet wrote:
> > On Wed, Jun 26, 2019 at 5:43 AM Guenter Roeck <[email protected]> wrote:
> > >
> > > On 6/25/19 7:29 PM, Greg Kroah-Hartman wrote:
> > > > On Tue, Jun 25, 2019 at 07:02:20PM -0700, Guenter Roeck wrote:
> > > >> Hi Greg,
> > > >>
> > > >> On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
> > > >>> On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> > > >>>> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> > > >>>> <[email protected]> wrote:
> > > >>>>>
> > > >>>>> I applied Eric's path to the tip of the branch and ran that kernel and
> > > >>>>> the bug didn't occur through several logout / login cycles, so things
> > > >>>>> look good at first glance. I'll keep running that kernel and report back
> > > >>>>> if anything crops up in the future, but I believe we're good, beyond
> > > >>>>> getting distros to ship this additional fix.
> > > >>>>
> > > >>>> Good. It's now in my tree, so we can get it quickly into stable and
> > > >>>> then quickly to distributions.
> > > >>>>
> > > >>>> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> > > >>>> tcp_fragment()"), and I'm building it right now and I'll push it out
> > > >>>> in a couple of minutes assuming nothing odd is going on.
> > > >>>
> > > >>> This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
> > > >>> a bit for them.
> > > >>>
> > > >>> But for 4.14 and older, we don't have the "hint" to know this is an
> > > >>> outbound going packet and not to apply these checks at that point in
> > > >>> time, so this patch doesn't work.
> > > >>>
> > > >>> I'll see if I can figure anything else later this afternoon for those
> > > >>> kernels...
> > > >>>
> > > >>
> > > >> I may have missed it, but I don't see a fix for the problem in
> > > >> older stable branches. Any news ?
> > > >>
> > > >> One possibility might be be to apply the part of 75c119afe14f7 which
> > > >> introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
> > > >> is acceptable.
> > > >
> > > > That's what people have already discussed on the stable mailing list a
> > > > few hours ago, hopefully a patch shows up soon as I'm traveling at the
> > > > moment and can't do it myself...
> > > >
> > >
> > > Sounds good. Let me know if nothing shows up; I'll be happy to do it
> > > if needed.
> >
> >
> > Without the rb-tree for rtx queues, old kernels are vulnerable to SACK
> > attacks if sk_sndbuf is too big,
> > so I would simply add a cushion in the test, instead of trying to
> > backport an illusion of the rb-tree fixes.
> >
> >
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index a8772e11dc1cb42d4319b6fc072c625d284c7ad5..a554213afa4ac41120d781fe64b7cd18ff9b56e8
> > 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1274,7 +1274,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff
> > *skb, u32 len,
> > if (nsize < 0)
> > nsize = 0;
> >
> > - if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> > + if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 131072)) {
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> > return -ENOMEM;
> > }
>
> That's a funny magic number, can we document what it means?

This is because TCP can cook skb with about 64KB of payload in
tcp_sendmsg() before
checking if memory limits are exceeded. (This is mentioned in commit
b6653b3629e5b88202be3c9abc44713973f5c4b4
" tcp: refine memory limit test in tcp_fragment()" changelog)

Then, if this giant TSO skb needs to be split in ~45 smaller skbs of
one segment each,
the resulting truesize might be twice bigger.

You could use 2 * 65536 if that looks better, and possibly a macro,
but I feel that adding a macro for this one particular spot and
stable kernels might be overkill ?

>
> And yes, it's a much simpler patch, I'd rather take this than the fake
> backport.

2019-06-26 08:25:54

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Steam is broken on new kernels

On Wed, Jun 26, 2019 at 08:38:01AM +0200, Eric Dumazet wrote:
> On Wed, Jun 26, 2019 at 8:22 AM Greg Kroah-Hartman
> <[email protected]> wrote:
> >
> > On Wed, Jun 26, 2019 at 06:20:17AM +0200, Eric Dumazet wrote:
> > > On Wed, Jun 26, 2019 at 5:43 AM Guenter Roeck <[email protected]> wrote:
> > > >
> > > > On 6/25/19 7:29 PM, Greg Kroah-Hartman wrote:
> > > > > On Tue, Jun 25, 2019 at 07:02:20PM -0700, Guenter Roeck wrote:
> > > > >> Hi Greg,
> > > > >>
> > > > >> On Sat, Jun 22, 2019 at 09:37:53AM +0200, Greg Kroah-Hartman wrote:
> > > > >>> On Fri, Jun 21, 2019 at 10:28:21PM -0700, Linus Torvalds wrote:
> > > > >>>> On Fri, Jun 21, 2019 at 6:03 PM Pierre-Loup A. Griffais
> > > > >>>> <[email protected]> wrote:
> > > > >>>>>
> > > > >>>>> I applied Eric's path to the tip of the branch and ran that kernel and
> > > > >>>>> the bug didn't occur through several logout / login cycles, so things
> > > > >>>>> look good at first glance. I'll keep running that kernel and report back
> > > > >>>>> if anything crops up in the future, but I believe we're good, beyond
> > > > >>>>> getting distros to ship this additional fix.
> > > > >>>>
> > > > >>>> Good. It's now in my tree, so we can get it quickly into stable and
> > > > >>>> then quickly to distributions.
> > > > >>>>
> > > > >>>> Greg, it's commit b6653b3629e5 ("tcp: refine memory limit test in
> > > > >>>> tcp_fragment()"), and I'm building it right now and I'll push it out
> > > > >>>> in a couple of minutes assuming nothing odd is going on.
> > > > >>>
> > > > >>> This looks good for 4.19 and 5.1, so I'll push out new stable kernels in
> > > > >>> a bit for them.
> > > > >>>
> > > > >>> But for 4.14 and older, we don't have the "hint" to know this is an
> > > > >>> outbound going packet and not to apply these checks at that point in
> > > > >>> time, so this patch doesn't work.
> > > > >>>
> > > > >>> I'll see if I can figure anything else later this afternoon for those
> > > > >>> kernels...
> > > > >>>
> > > > >>
> > > > >> I may have missed it, but I don't see a fix for the problem in
> > > > >> older stable branches. Any news ?
> > > > >>
> > > > >> One possibility might be be to apply the part of 75c119afe14f7 which
> > > > >> introduces TCP_FRAG_IN_WRITE_QUEUE and TCP_FRAG_IN_RTX_QUEUE, if that
> > > > >> is acceptable.
> > > > >
> > > > > That's what people have already discussed on the stable mailing list a
> > > > > few hours ago, hopefully a patch shows up soon as I'm traveling at the
> > > > > moment and can't do it myself...
> > > > >
> > > >
> > > > Sounds good. Let me know if nothing shows up; I'll be happy to do it
> > > > if needed.
> > >
> > >
> > > Without the rb-tree for rtx queues, old kernels are vulnerable to SACK
> > > attacks if sk_sndbuf is too big,
> > > so I would simply add a cushion in the test, instead of trying to
> > > backport an illusion of the rb-tree fixes.
> > >
> > >
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index a8772e11dc1cb42d4319b6fc072c625d284c7ad5..a554213afa4ac41120d781fe64b7cd18ff9b56e8
> > > 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -1274,7 +1274,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff
> > > *skb, u32 len,
> > > if (nsize < 0)
> > > nsize = 0;
> > >
> > > - if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> > > + if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 131072)) {
> > > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> > > return -ENOMEM;
> > > }
> >
> > That's a funny magic number, can we document what it means?
>
> This is because TCP can cook skb with about 64KB of payload in
> tcp_sendmsg() before
> checking if memory limits are exceeded. (This is mentioned in commit
> b6653b3629e5b88202be3c9abc44713973f5c4b4
> " tcp: refine memory limit test in tcp_fragment()" changelog)
>
> Then, if this giant TSO skb needs to be split in ~45 smaller skbs of
> one segment each,
> the resulting truesize might be twice bigger.
>
> You could use 2 * 65536 if that looks better, and possibly a macro,
> but I feel that adding a macro for this one particular spot and
> stable kernels might be overkill ?

Ah, yeah, 2*65536 makes more sense to me, seeing 131072 didn't trigger
the same "power of 2" thing in my brain :)

I'll fix this up and queue it up now, thanks!

greg k-h