2019-05-30 19:42:39

by Mihai Moldovan

[permalink] [raw]
Subject: Userspace woes with 5.1.5 due to TIPC

Hi


I've had a few issues lately (mainly bad RAM only, hopefully, which should be
fixed now) and generally upgraded everything.

With 5.1.5, though, some programs exhibited very weird behavior: Chromium
crashed while starting up due to not being able to launch a new zygote process,
albeit started when using --no-sandbox (likely because that didn't try to create
other processes); Opera (based upon Chromium) failed to start with SIGILL, but
that was only a red herring triggered by the same problem I guess; Firefox
started up, but was unable to render any content because its multi-process IPC
didn't work (i.e., it couldn't start new rendering processes). Interestingly,
most other programs I use daily still worked, even though they used networking
and IPC (command-line browsers, MATE Terminal, electron-based programs), so this
bug didn't make the machine completely unusable.

Since I've been using 5.1.3 without problems before and the issue was
straight-forward to test for, I did a bisection run and came to that conclusion:

================================ bisect log ================================
Bisecting: 124 revisions left to test after this (roughly 7 steps)
[ee4c3e283f8f3286bea60e9038adc70436d87d02] s390/mm: convert to the generic
get_user_pages_fast code
Bisecting: 62 revisions left to test after this (roughly 6 steps)
[f7346dc0634cbad7fca5d951b91ad2e13f497b0b] clk: mediatek: Disable tuner_en
before change PLL rate
Bisecting: 30 revisions left to test after this (roughly 5 steps)
[5ac8e698528149bb1618111d64e22bd8bb784256] parisc: Allow live-patching of
__meminit functions
Bisecting: 15 revisions left to test after this (roughly 4 steps)
[c89c9af998fef2af4e5b2b35fb723693f17e05ef] mlxsw: core: Prevent QSFP module
initialization for old hardware
Bisecting: 7 revisions left to test after this (roughly 3 steps)
[912d8c4cf9f19c93dfdf06b822eeadec9d71494d] net: test nouarg before dereferencing
zerocopy pointers
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[92166190b8282d9925e90a66961879782c50d037] rtnetlink: always put IFLA_LINK for
links with a link-netnsid
Bisecting: 1 revision left to test after this (roughly 1 step)
[7d29c9ad0ed525c1b10e29cfca4fb1eece1e93fb] vsock/virtio: free packets during the
socket release
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[2d08f204328acaf85ac2c6fe5d5d9d4760f12e13] tipc: fix modprobe tipc failed after
switch order of device registration
2d08f204328acaf85ac2c6fe5d5d9d4760f12e13 is the first bad commit
commit 2d08f204328acaf85ac2c6fe5d5d9d4760f12e13
Author: Junwei Hu <[email protected]>
Date: Fri May 17 19:27:34 2019 +0800

tipc: fix modprobe tipc failed after switch order of device registration

[ Upstream commit 532b0f7ece4cb2ffd24dc723ddf55242d1188e5e ]

Error message printed:
modprobe: ERROR: could not insert 'tipc': Address family not
supported by protocol.
when modprobe tipc after the following patch: switch order of
device registration, commit 7e27e8d6130c
("tipc: switch order of device registration to fix a crash")

Because sock_create_kern(net, AF_TIPC, ...) is called by
tipc_topsrv_create_listener() in the initialization process
of tipc_net_ops, tipc_socket_init() must be execute before that.

I move tipc_socket_init() into function tipc_init_net().

Fixes: 7e27e8d6130c
("tipc: switch order of device registration to fix a crash")
Signed-off-by: Junwei Hu <[email protected]>
Reported-by: Wang Wang <[email protected]>
Reviewed-by: Kang Zhou <[email protected]>
Reviewed-by: Suanming Mou <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

:040000 040000 13d9b014338ccf6ae0c32bdb2be779870bbf97da
df8a9c2a9f1f8df212999c2904632a77adb03782 M net
=============================== / bisect log ===============================

My kernel config is tailored to my machine, so probably not very useful to
others, but I'm including it anyway. The most obvious point being CONFIG_TIPC=y,
i.e., TIPC being built statically into the kernel. Not sure why I've done that
in the first place because TIPC is not something that would be useful to me, but
I often err on the "might be useful later" side. I might rethink that decision
and just disable TIPC for good in the future.


With this patch applied, the kernel generally spews out a few wonky messages
that I've never seen before. For completeness sake, I've attached a ring buffer
log from running the last working and first bad version.

================================ TIPC messages ================================
NET: Registered protocol family 30
Failed to register TIPC socket type
=============================== / TIPC messages ===============================


Now, blindly reverting the patch would obviously a bad idea, since that would
mean trading one regression for the (initial) other one. I'm thus CCing the
maintainers to help.



Mihai


Attachments:
config-5.1.3.xz (30.01 kB)
dmesg-5.1.4-00013-g2d08f204328a.log.xz (21.05 kB)
dmesg-5.1.4-00012-g7d29c9ad0ed5.log.xz (26.22 kB)
signature.asc (916.00 B)
OpenPGP digital signature
Download all attachments

2019-05-30 19:54:45

by Jon Maloy

[permalink] [raw]
Subject: RE: Userspace woes with 5.1.5 due to TIPC

Hi Mihai,
Make sure the following three commits are present in TIPC *after* the offending commit:

commit 532b0f7ece4c "tipc: fix modprobe tipc failed after switch order of device registration"

Since that patch one was flawed it had to be reverted:
commit 5593530e5694 ""Revert tipc: fix modprobe tipc failed after switch order of device registration"

It was then replaced with this one:
commit 526f5b851a96 "tipc: fix modprobe tipc failed after switch order of device registration"

BR
Jon Maloy


> -----Original Message-----
> From: Mihai Moldovan <[email protected]>
> Sent: 30-May-19 15:34
> To: [email protected]
> Cc: Ying Xue <[email protected]>; Jon Maloy
> <[email protected]>
> Subject: Userspace woes with 5.1.5 due to TIPC
>
> Hi
>
>
> I've had a few issues lately (mainly bad RAM only, hopefully, which should be
> fixed now) and generally upgraded everything.
>
> With 5.1.5, though, some programs exhibited very weird behavior: Chromium
> crashed while starting up due to not being able to launch a new zygote
> process, albeit started when using --no-sandbox (likely because that didn't try
> to create other processes); Opera (based upon Chromium) failed to start with
> SIGILL, but that was only a red herring triggered by the same problem I guess;
> Firefox started up, but was unable to render any content because its multi-
> process IPC didn't work (i.e., it couldn't start new rendering processes).
> Interestingly, most other programs I use daily still worked, even though they
> used networking and IPC (command-line browsers, MATE Terminal, electron-
> based programs), so this bug didn't make the machine completely unusable.
>
> Since I've been using 5.1.3 without problems before and the issue was
> straight-forward to test for, I did a bisection run and came to that conclusion:
>
> ================================ bisect log
> ================================
> Bisecting: 124 revisions left to test after this (roughly 7 steps)
> [ee4c3e283f8f3286bea60e9038adc70436d87d02] s390/mm: convert to the
> generic get_user_pages_fast code
> Bisecting: 62 revisions left to test after this (roughly 6 steps)
> [f7346dc0634cbad7fca5d951b91ad2e13f497b0b] clk: mediatek: Disable
> tuner_en before change PLL rate
> Bisecting: 30 revisions left to test after this (roughly 5 steps)
> [5ac8e698528149bb1618111d64e22bd8bb784256] parisc: Allow live-
> patching of __meminit functions
> Bisecting: 15 revisions left to test after this (roughly 4 steps)
> [c89c9af998fef2af4e5b2b35fb723693f17e05ef] mlxsw: core: Prevent QSFP
> module initialization for old hardware
> Bisecting: 7 revisions left to test after this (roughly 3 steps)
> [912d8c4cf9f19c93dfdf06b822eeadec9d71494d] net: test nouarg before
> dereferencing zerocopy pointers
> Bisecting: 3 revisions left to test after this (roughly 2 steps)
> [92166190b8282d9925e90a66961879782c50d037] rtnetlink: always put
> IFLA_LINK for links with a link-netnsid
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [7d29c9ad0ed525c1b10e29cfca4fb1eece1e93fb] vsock/virtio: free packets
> during the socket release
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [2d08f204328acaf85ac2c6fe5d5d9d4760f12e13] tipc: fix modprobe tipc
> failed after switch order of device registration
> 2d08f204328acaf85ac2c6fe5d5d9d4760f12e13 is the first bad commit
> commit 2d08f204328acaf85ac2c6fe5d5d9d4760f12e13
> Author: Junwei Hu <[email protected]>
> Date: Fri May 17 19:27:34 2019 +0800
>
> tipc: fix modprobe tipc failed after switch order of device registration
>
> [ Upstream commit 532b0f7ece4cb2ffd24dc723ddf55242d1188e5e ]
>
> Error message printed:
> modprobe: ERROR: could not insert 'tipc': Address family not
> supported by protocol.
> when modprobe tipc after the following patch: switch order of
> device registration, commit 7e27e8d6130c
> ("tipc: switch order of device registration to fix a crash")
>
> Because sock_create_kern(net, AF_TIPC, ...) is called by
> tipc_topsrv_create_listener() in the initialization process
> of tipc_net_ops, tipc_socket_init() must be execute before that.
>
> I move tipc_socket_init() into function tipc_init_net().
>
> Fixes: 7e27e8d6130c
> ("tipc: switch order of device registration to fix a crash")
> Signed-off-by: Junwei Hu <[email protected]>
> Reported-by: Wang Wang <[email protected]>
> Reviewed-by: Kang Zhou <[email protected]>
> Reviewed-by: Suanming Mou <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> :040000 040000 13d9b014338ccf6ae0c32bdb2be779870bbf97da
> df8a9c2a9f1f8df212999c2904632a77adb03782 M net
> =============================== / bisect log
> ===============================
>
> My kernel config is tailored to my machine, so probably not very useful to
> others, but I'm including it anyway. The most obvious point being
> CONFIG_TIPC=y, i.e., TIPC being built statically into the kernel. Not sure why
> I've done that in the first place because TIPC is not something that would be
> useful to me, but I often err on the "might be useful later" side. I might rethink
> that decision and just disable TIPC for good in the future.
>
>
> With this patch applied, the kernel generally spews out a few wonky messages
> that I've never seen before. For completeness sake, I've attached a ring buffer
> log from running the last working and first bad version.
>
> ================================ TIPC messages
> ================================
> NET: Registered protocol family 30
> Failed to register TIPC socket type
> =============================== / TIPC messages
> ===============================
>
>
> Now, blindly reverting the patch would obviously a bad idea, since that would
> mean trading one regression for the (initial) other one. I'm thus CCing the
> maintainers to help.
>
>
>
> Mihai

2019-05-30 20:49:05

by Mihai Moldovan

[permalink] [raw]
Subject: Re: Userspace woes with 5.1.5 due to TIPC

* On 5/30/19 9:51 PM, Jon Maloy wrote:
> Make sure the following three commits are present in TIPC *after* the offending commit:
>
> commit 532b0f7ece4c "tipc: fix modprobe tipc failed after switch order of device registration"

This *is* the offending commit, as far as I understand. Merely rebased in
linux-stable, and hence having a different SHA, but mentioning the original SHA
(i.e., 532b0f7ece4c) in its commit message.


> Since that patch one was flawed it had to be reverted:
> commit 5593530e5694 ""Revert tipc: fix modprobe tipc failed after switch order of device registration"
>
> It was then replaced with this one:
> commit 526f5b851a96 "tipc: fix modprobe tipc failed after switch order of device registration"

Okay, these two are not part of 5.1.5. I've backported them (and only these two)
to 5.1.5 and the issue(s) seem to be gone. Definitely something that should be
backported to/included in 5.1.6.


Thanks for pointing all that out! Unfortunately I didn't add anything useful but
noise, since you obviously already knew, that this commit was broken. I'd urge
Greg to release a new stable version including the fixes soon, if possible,
though, for not being able to start/use userspace browsers sounds like a pretty
bad regression to me.



Mihai


Attachments:
signature.asc (916.00 B)
OpenPGP digital signature

2019-05-31 00:53:35

by Ying Xue

[permalink] [raw]
Subject: Re: Userspace woes with 5.1.5 due to TIPC

On 5/31/19 4:47 AM, Mihai Moldovan wrote:
> * On 5/30/19 9:51 PM, Jon Maloy wrote:
>> Make sure the following three commits are present in TIPC *after* the offending commit:
>>
>> commit 532b0f7ece4c "tipc: fix modprobe tipc failed after switch order of device registration"
>
> This *is* the offending commit, as far as I understand. Merely rebased in
> linux-stable, and hence having a different SHA, but mentioning the original SHA
> (i.e., 532b0f7ece4c) in its commit message.
>
>
>> Since that patch one was flawed it had to be reverted:
>> commit 5593530e5694 ""Revert tipc: fix modprobe tipc failed after switch order of device registration"
>>
>> It was then replaced with this one:
>> commit 526f5b851a96 "tipc: fix modprobe tipc failed after switch order of device registration"
>
> Okay, these two are not part of 5.1.5. I've backported them (and only these two)
> to 5.1.5 and the issue(s) seem to be gone. Definitely something that should be
> backported to/included in 5.1.6.
>
>
> Thanks for pointing all that out! Unfortunately I didn't add anything useful but
> noise, since you obviously already knew, that this commit was broken. I'd urge
> Greg to release a new stable version including the fixes soon, if possible,
> though, for not being able to start/use userspace browsers sounds like a pretty
> bad regression to me.
>

Not only commit 526f5b851a96 )("tipc: fix modprobe tipc failed after
switch order of device registration") has to be reverted, but also I
found commit 7e27e8d6130c ("tipc: switch order of device registration to
fix a crash") introduced a serious regression which makes tipc internal
topology service server failed to be created.

Today I will fix it with the following approaches:
1. Revert commit 7e27e8d6130c ("tipc: switch order of device
registration to fix a crash")
2. Use another method to solve the problem that commit 7e27e8d6130c
tries to fix.

Thanks,
Ying

>
>
> Mihai
>

2019-05-31 11:03:53

by Jon Maloy

[permalink] [raw]
Subject: RE: Userspace woes with 5.1.5 due to TIPC



> -----Original Message-----
> From: Ying Xue <[email protected]>
> Sent: 30-May-19 20:41
> To: Mihai Moldovan <[email protected]>; Jon Maloy
> <[email protected]>; [email protected]
> Subject: Re: Userspace woes with 5.1.5 due to TIPC
>
> On 5/31/19 4:47 AM, Mihai Moldovan wrote:
> > * On 5/30/19 9:51 PM, Jon Maloy wrote:
> >> Make sure the following three commits are present in TIPC *after* the
> offending commit:
> >>
> >> commit 532b0f7ece4c "tipc: fix modprobe tipc failed after switch order of
> device registration"
> >
> > This *is* the offending commit, as far as I understand. Merely rebased
> > in linux-stable, and hence having a different SHA, but mentioning the
> > original SHA (i.e., 532b0f7ece4c) in its commit message.
> >
> >
> >> Since that patch one was flawed it had to be reverted:
> >> commit 5593530e5694 ""Revert tipc: fix modprobe tipc failed after switch
> order of device registration"
> >>
> >> It was then replaced with this one:
> >> commit 526f5b851a96 "tipc: fix modprobe tipc failed after switch order of
> device registration"
> >
> > Okay, these two are not part of 5.1.5. I've backported them (and only
> > these two) to 5.1.5 and the issue(s) seem to be gone. Definitely
> > something that should be backported to/included in 5.1.6.
> >
> >
> > Thanks for pointing all that out! Unfortunately I didn't add anything
> > useful but noise, since you obviously already knew, that this commit
> > was broken. I'd urge Greg to release a new stable version including
> > the fixes soon, if possible, though, for not being able to start/use
> > userspace browsers sounds like a pretty bad regression to me.
> >
>
> Not only commit 526f5b851a96 )("tipc: fix modprobe tipc failed after switch
> order of device registration") has to be reverted, but also I found commit
> 7e27e8d6130c ("tipc: switch order of device registration to fix a crash")
> introduced a serious regression which makes tipc internal topology service
> server failed to be created.

This was the very reason the broken patch was introduced. AFAIK there is no problem after the corrected version of that patch was applied.

///jon

>
> Today I will fix it with the following approaches:
> 1. Revert commit 7e27e8d6130c ("tipc: switch order of device registration to
> fix a crash") 2. Use another method to solve the problem that commit
> 7e27e8d6130c tries to fix.
>
> Thanks,
> Ying
>
> >
> >
> > Mihai
> >

2019-05-31 11:25:31

by Ying Xue

[permalink] [raw]
Subject: Re: Userspace woes with 5.1.5 due to TIPC

On 5/31/19 7:02 PM, Jon Maloy wrote:
> This was the very reason the broken patch was introduced. AFAIK there is no problem after the corrected version of that patch was applied.

I have prepared for our patches on net-next tree. But when I checked my
patches on net tree, it's found that the issue has been fixed with
commit 526f5b851a96566803ee4bee60d0a34df56c77f8 ("tipc: fix modprobe
tipc failed after switch order of device registration"). There are four
commits which were introduced recently, and their names are quite
similar, and the fix is not merged into net-next tree, which makes me
misunderstood the issue status.

Anyway, it looks like failing to insert TIPC module has been resolved.
But I have not validated the fix by create multiple TIPC namespace, so I
am not sure whether it works normally.

Thanks,
Ying