2019-08-22 08:36:31

by Chris Clayton

[permalink] [raw]
Subject: Regression in 5.3-rc1 and later

Hi everyone,

Firstly, apologies to anyone on the long cc list that turns out not to be particularly interested in the following, but
you were all marked as cc'd in the commit message below.

I've found a problem that isn't present in 5.2 series or 4.19 series kernels, and seems to have arrived in 5.3-rc1. The
problem is that if I suspend (to ram) my laptop, on resume 14 minutes or more after suspending, I have no networking
functionality. If I resume the laptop after 13 minutes or less, networking works fine. I haven't tried to get finer
grained timings between 13 and 14 minutes, but can do if it would help.

ifconfig shows that wlan0 is still up and still has its assigned ip address but, for instance, a ping of any other
device on my network, fails as does pinging, say, kernel.org. I've tried "downing" the network with (/sbin/ifdown) and
unloading the iwlmvm module and then reloading the module and "upping" (/sbin/ifup) the network, but my network is still
unusable. I should add that the problem also manifests if I hibernate the laptop, although my testing of this has been
minimal. I can do more if required.

As I say, the problem first appears in 5.3-rc1, so I've bisected between 5.2.0 and 5.3-rc1 and that concluded with:

[chris:~/kernel/linux]$ git bisect good
7ac8707479886c75f353bfb6a8273f423cfccb23 is the first bad commit
commit 7ac8707479886c75f353bfb6a8273f423cfccb23
Author: Vincenzo Frascino <[email protected]>
Date: Fri Jun 21 10:52:49 2019 +0100

x86/vdso: Switch to generic vDSO implementation

The x86 vDSO library requires some adaptations to take advantage of the
newly introduced generic vDSO library.

Introduce the following changes:
- Modification of vdso.c to be compliant with the common vdso datapage
- Use of lib/vdso for gettimeofday

[ tglx: Massaged changelog and cleaned up the function signature formatting ]

Signed-off-by: Vincenzo Frascino <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Mark Salyzyn <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Huw Davies <[email protected]>
Cc: Shijith Thotton <[email protected]>
Cc: Andre Przywara <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

arch/x86/Kconfig | 3 +
arch/x86/entry/vdso/Makefile | 9 ++
arch/x86/entry/vdso/vclock_gettime.c | 245 ++++---------------------------
arch/x86/entry/vdso/vdsox32.lds.S | 1 +
arch/x86/entry/vsyscall/Makefile | 2 -
arch/x86/entry/vsyscall/vsyscall_gtod.c | 83 -----------
arch/x86/include/asm/pvclock.h | 2 +-
arch/x86/include/asm/vdso/gettimeofday.h | 191 ++++++++++++++++++++++++
arch/x86/include/asm/vdso/vsyscall.h | 44 ++++++
arch/x86/include/asm/vgtod.h | 75 +---------
arch/x86/include/asm/vvar.h | 7 +-
arch/x86/kernel/pvclock.c | 1 +
12 files changed, 284 insertions(+), 379 deletions(-)
delete mode 100644 arch/x86/entry/vsyscall/vsyscall_gtod.c
create mode 100644 arch/x86/include/asm/vdso/gettimeofday.h
create mode 100644 arch/x86/include/asm/vdso/vsyscall.h

To confirm my bisection was correct, I did a git checkout of 7ac8707479886c75f353bfb6a8273f423cfccb2. As expected, the
kernel exhibited the problem I've described. However, a kernel built at the immediately preceding (parent?) commit
(bfe801ebe84f42b4666d3f0adde90f504d56e35b) has a working network after a (>= 14minute) suspend/resume cycle.

As the module name implies, I'm using wireless networking. The hardware is detected as "Intel(R) Wireless-AC 9260
160MHz, REV=0x324" by iwlwifi.

I'm more than happy to provide additional diagnostics (but may need a little hand-holding) and to apply diagnostic or
fix patches, but please cc me on any reply as I'm not subscribed to any of the kernel-related mailing lists.

Chris


2019-08-22 11:33:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

Chris,

On Thu, 22 Aug 2019, Thomas Gleixner wrote:
>
> Can you please provide the output of:
>
> dmesg | grep -i TSC

Full dmesg for both scenarios (12min and >14min) would be appreciated as well.

Thanks,

tglx

2019-08-22 11:49:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

Chris,

On Thu, 22 Aug 2019, Chris Clayton wrote:

Trimmed cc list

> I've found a problem that isn't present in 5.2 series or 4.19 series
> kernels, and seems to have arrived in 5.3-rc1. The problem is that if I
> suspend (to ram) my laptop, on resume 14 minutes or more after
> suspending, I have no networking functionality. If I resume the laptop
> after 13 minutes or less, networking works fine. I haven't tried to get
> finer grained timings between 13 and 14 minutes, but can do if it would
> help.
>
> ifconfig shows that wlan0 is still up and still has its assigned ip
> address but, for instance, a ping of any other device on my network,
> fails as does pinging, say, kernel.org. I've tried "downing" the network
> with (/sbin/ifdown) and unloading the iwlmvm module and then reloading
> the module and "upping" (/sbin/ifup) the network, but my network is still
> unusable. I should add that the problem also manifests if I hibernate the
> laptop, although my testing of this has been minimal. I can do more if
> required.

What happens if you restart the network manager and/or wpa_supplicant or
whatever your distro uses for that.

> As I say, the problem first appears in 5.3-rc1, so I've bisected between
> 5.2.0 and 5.3-rc1 and that concluded with:

Just for confirmation, it's still broken as of 5.3-rc5, right? We had fixes
post rc1.

> x86/vdso: Switch to generic vDSO implementation

> To confirm my bisection was correct, I did a git checkout of
> 7ac8707479886c75f353bfb6a8273f423cfccb2. As expected, the kernel
> exhibited the problem I've described. However, a kernel built at the
> immediately preceding (parent?) commit
> (bfe801ebe84f42b4666d3f0adde90f504d56e35b) has a working network after a
> (>= 14minute) suspend/resume cycle.

~14 minutes is odd. I can't come up with anything which rolls over, wraps
or overflows at that point.

Can you please provide the output of:

dmesg | grep -i TSC

Thanks,

tglx

2019-08-22 13:26:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

Chris,

On Thu, 22 Aug 2019, Thomas Gleixner wrote:
> On Thu, 22 Aug 2019, Thomas Gleixner wrote:
> >
> > Can you please provide the output of:
> >
> > dmesg | grep -i TSC
>
> Full dmesg for both scenarios (12min and >14min) would be appreciated as well.

Hold off with that. I think I found the issue.

Thanks,

tglx

2019-08-23 23:02:47

by Will Deacon

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

On Fri, Aug 23, 2019 at 11:36:54AM +0100, Russell King - ARM Linux admin wrote:
> To everyone on the long Cc list...
>
> What's happening with this? I was about to merge the patches for 32-bit
> ARM, which I don't want to do if doing so will cause this regression on
> 32-bit ARM as well.

tglx fixed it:

https://lkml.kernel.org/r/[email protected]

which I assume is getting routed as a fix via -tip.

Will

2019-08-23 23:02:51

by Vincenzo Frascino

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

Hi Russell,

On 8/23/19 11:36 AM, Russell King - ARM Linux admin wrote:
> Hi,
>
> To everyone on the long Cc list...
>
> What's happening with this? I was about to merge the patches for 32-bit
> ARM, which I don't want to do if doing so will cause this regression on
> 32-bit ARM as well.
>

The regression is sorted as of yesterday, a new patch is going through tip:
timers/urgent and will be part of the next -rc.

If you want to merge them there should be nothing blocking.

> Thanks.
>
> On Thu, Aug 22, 2019 at 07:57:59AM +0100, Chris Clayton wrote:
>> Hi everyone,
>>
>> Firstly, apologies to anyone on the long cc list that turns out not to be particularly interested in the following, but
>> you were all marked as cc'd in the commit message below.
>>
>> I've found a problem that isn't present in 5.2 series or 4.19 series kernels, and seems to have arrived in 5.3-rc1. The
>> problem is that if I suspend (to ram) my laptop, on resume 14 minutes or more after suspending, I have no networking
>> functionality. If I resume the laptop after 13 minutes or less, networking works fine. I haven't tried to get finer
>> grained timings between 13 and 14 minutes, but can do if it would help.
>>
>> ifconfig shows that wlan0 is still up and still has its assigned ip address but, for instance, a ping of any other
>> device on my network, fails as does pinging, say, kernel.org. I've tried "downing" the network with (/sbin/ifdown) and
>> unloading the iwlmvm module and then reloading the module and "upping" (/sbin/ifup) the network, but my network is still
>> unusable. I should add that the problem also manifests if I hibernate the laptop, although my testing of this has been
>> minimal. I can do more if required.
>>
>> As I say, the problem first appears in 5.3-rc1, so I've bisected between 5.2.0 and 5.3-rc1 and that concluded with:
>>
>> [chris:~/kernel/linux]$ git bisect good
>> 7ac8707479886c75f353bfb6a8273f423cfccb23 is the first bad commit
>> commit 7ac8707479886c75f353bfb6a8273f423cfccb23
>> Author: Vincenzo Frascino <[email protected]>
>> Date: Fri Jun 21 10:52:49 2019 +0100
>>
>> x86/vdso: Switch to generic vDSO implementation
>>
>> The x86 vDSO library requires some adaptations to take advantage of the
>> newly introduced generic vDSO library.
>>
>> Introduce the following changes:
>> - Modification of vdso.c to be compliant with the common vdso datapage
>> - Use of lib/vdso for gettimeofday
>>
>> [ tglx: Massaged changelog and cleaned up the function signature formatting ]
>>
>> Signed-off-by: Vincenzo Frascino <[email protected]>
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Arnd Bergmann <[email protected]>
>> Cc: Russell King <[email protected]>
>> Cc: Ralf Baechle <[email protected]>
>> Cc: Paul Burton <[email protected]>
>> Cc: Daniel Lezcano <[email protected]>
>> Cc: Mark Salyzyn <[email protected]>
>> Cc: Peter Collingbourne <[email protected]>
>> Cc: Shuah Khan <[email protected]>
>> Cc: Dmitry Safonov <[email protected]>
>> Cc: Rasmus Villemoes <[email protected]>
>> Cc: Huw Davies <[email protected]>
>> Cc: Shijith Thotton <[email protected]>
>> Cc: Andre Przywara <[email protected]>
>> Link: https://lkml.kernel.org/r/[email protected]
>>
>> arch/x86/Kconfig | 3 +
>> arch/x86/entry/vdso/Makefile | 9 ++
>> arch/x86/entry/vdso/vclock_gettime.c | 245 ++++---------------------------
>> arch/x86/entry/vdso/vdsox32.lds.S | 1 +
>> arch/x86/entry/vsyscall/Makefile | 2 -
>> arch/x86/entry/vsyscall/vsyscall_gtod.c | 83 -----------
>> arch/x86/include/asm/pvclock.h | 2 +-
>> arch/x86/include/asm/vdso/gettimeofday.h | 191 ++++++++++++++++++++++++
>> arch/x86/include/asm/vdso/vsyscall.h | 44 ++++++
>> arch/x86/include/asm/vgtod.h | 75 +---------
>> arch/x86/include/asm/vvar.h | 7 +-
>> arch/x86/kernel/pvclock.c | 1 +
>> 12 files changed, 284 insertions(+), 379 deletions(-)
>> delete mode 100644 arch/x86/entry/vsyscall/vsyscall_gtod.c
>> create mode 100644 arch/x86/include/asm/vdso/gettimeofday.h
>> create mode 100644 arch/x86/include/asm/vdso/vsyscall.h
>>
>> To confirm my bisection was correct, I did a git checkout of 7ac8707479886c75f353bfb6a8273f423cfccb2. As expected, the
>> kernel exhibited the problem I've described. However, a kernel built at the immediately preceding (parent?) commit
>> (bfe801ebe84f42b4666d3f0adde90f504d56e35b) has a working network after a (>= 14minute) suspend/resume cycle.
>>
>> As the module name implies, I'm using wireless networking. The hardware is detected as "Intel(R) Wireless-AC 9260
>> 160MHz, REV=0x324" by iwlwifi.
>>
>> I'm more than happy to provide additional diagnostics (but may need a little hand-holding) and to apply diagnostic or
>> fix patches, but please cc me on any reply as I'm not subscribed to any of the kernel-related mailing lists.
>>
>> Chris
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
>

--
Regards,
Vincenzo

2019-08-23 23:03:32

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

Hi,

To everyone on the long Cc list...

What's happening with this? I was about to merge the patches for 32-bit
ARM, which I don't want to do if doing so will cause this regression on
32-bit ARM as well.

Thanks.

On Thu, Aug 22, 2019 at 07:57:59AM +0100, Chris Clayton wrote:
> Hi everyone,
>
> Firstly, apologies to anyone on the long cc list that turns out not to be particularly interested in the following, but
> you were all marked as cc'd in the commit message below.
>
> I've found a problem that isn't present in 5.2 series or 4.19 series kernels, and seems to have arrived in 5.3-rc1. The
> problem is that if I suspend (to ram) my laptop, on resume 14 minutes or more after suspending, I have no networking
> functionality. If I resume the laptop after 13 minutes or less, networking works fine. I haven't tried to get finer
> grained timings between 13 and 14 minutes, but can do if it would help.
>
> ifconfig shows that wlan0 is still up and still has its assigned ip address but, for instance, a ping of any other
> device on my network, fails as does pinging, say, kernel.org. I've tried "downing" the network with (/sbin/ifdown) and
> unloading the iwlmvm module and then reloading the module and "upping" (/sbin/ifup) the network, but my network is still
> unusable. I should add that the problem also manifests if I hibernate the laptop, although my testing of this has been
> minimal. I can do more if required.
>
> As I say, the problem first appears in 5.3-rc1, so I've bisected between 5.2.0 and 5.3-rc1 and that concluded with:
>
> [chris:~/kernel/linux]$ git bisect good
> 7ac8707479886c75f353bfb6a8273f423cfccb23 is the first bad commit
> commit 7ac8707479886c75f353bfb6a8273f423cfccb23
> Author: Vincenzo Frascino <[email protected]>
> Date: Fri Jun 21 10:52:49 2019 +0100
>
> x86/vdso: Switch to generic vDSO implementation
>
> The x86 vDSO library requires some adaptations to take advantage of the
> newly introduced generic vDSO library.
>
> Introduce the following changes:
> - Modification of vdso.c to be compliant with the common vdso datapage
> - Use of lib/vdso for gettimeofday
>
> [ tglx: Massaged changelog and cleaned up the function signature formatting ]
>
> Signed-off-by: Vincenzo Frascino <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Burton <[email protected]>
> Cc: Daniel Lezcano <[email protected]>
> Cc: Mark Salyzyn <[email protected]>
> Cc: Peter Collingbourne <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Rasmus Villemoes <[email protected]>
> Cc: Huw Davies <[email protected]>
> Cc: Shijith Thotton <[email protected]>
> Cc: Andre Przywara <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
>
> arch/x86/Kconfig | 3 +
> arch/x86/entry/vdso/Makefile | 9 ++
> arch/x86/entry/vdso/vclock_gettime.c | 245 ++++---------------------------
> arch/x86/entry/vdso/vdsox32.lds.S | 1 +
> arch/x86/entry/vsyscall/Makefile | 2 -
> arch/x86/entry/vsyscall/vsyscall_gtod.c | 83 -----------
> arch/x86/include/asm/pvclock.h | 2 +-
> arch/x86/include/asm/vdso/gettimeofday.h | 191 ++++++++++++++++++++++++
> arch/x86/include/asm/vdso/vsyscall.h | 44 ++++++
> arch/x86/include/asm/vgtod.h | 75 +---------
> arch/x86/include/asm/vvar.h | 7 +-
> arch/x86/kernel/pvclock.c | 1 +
> 12 files changed, 284 insertions(+), 379 deletions(-)
> delete mode 100644 arch/x86/entry/vsyscall/vsyscall_gtod.c
> create mode 100644 arch/x86/include/asm/vdso/gettimeofday.h
> create mode 100644 arch/x86/include/asm/vdso/vsyscall.h
>
> To confirm my bisection was correct, I did a git checkout of 7ac8707479886c75f353bfb6a8273f423cfccb2. As expected, the
> kernel exhibited the problem I've described. However, a kernel built at the immediately preceding (parent?) commit
> (bfe801ebe84f42b4666d3f0adde90f504d56e35b) has a working network after a (>= 14minute) suspend/resume cycle.
>
> As the module name implies, I'm using wireless networking. The hardware is detected as "Intel(R) Wireless-AC 9260
> 160MHz, REV=0x324" by iwlwifi.
>
> I'm more than happy to provide additional diagnostics (but may need a little hand-holding) and to apply diagnostic or
> fix patches, but please cc me on any reply as I'm not subscribed to any of the kernel-related mailing lists.
>
> Chris
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

2019-08-23 23:04:15

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

On Fri, Aug 23, 2019 at 11:43:32AM +0100, Vincenzo Frascino wrote:
> Hi Russell,
>
> On 8/23/19 11:36 AM, Russell King - ARM Linux admin wrote:
> > Hi,
> >
> > To everyone on the long Cc list...
> >
> > What's happening with this? I was about to merge the patches for 32-bit
> > ARM, which I don't want to do if doing so will cause this regression on
> > 32-bit ARM as well.
> >
>
> The regression is sorted as of yesterday, a new patch is going through tip:
> timers/urgent and will be part of the next -rc.
>
> If you want to merge them there should be nothing blocking.

I don't have access to the tip tree.

I'll wait a kernel release cycle instead.

Thanks.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

2019-08-23 23:06:10

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

On Fri, Aug 23, 2019 at 11:40:50AM +0100, Will Deacon wrote:
> On Fri, Aug 23, 2019 at 11:36:54AM +0100, Russell King - ARM Linux admin wrote:
> > To everyone on the long Cc list...
> >
> > What's happening with this? I was about to merge the patches for 32-bit
> > ARM, which I don't want to do if doing so will cause this regression on
> > 32-bit ARM as well.
>
> tglx fixed it:
>
> https://lkml.kernel.org/r/[email protected]
>
> which I assume is getting routed as a fix via -tip.

Right, so Chris reported the issue to everyone involved. Tglx's
reply severely trimmed the Cc list so folk like me had no idea what
was going on, removing even the mailing lists. On the face of it,
it looks like an intentional attempt to cut people out of the loop
who really should've been kept in the loop. Yea, that's just great.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

2019-08-23 23:15:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

On Fri, 23 Aug 2019, Russell King - ARM Linux admin wrote:

> On Fri, Aug 23, 2019 at 11:40:50AM +0100, Will Deacon wrote:
> > On Fri, Aug 23, 2019 at 11:36:54AM +0100, Russell King - ARM Linux admin wrote:
> > > To everyone on the long Cc list...
> > >
> > > What's happening with this? I was about to merge the patches for 32-bit
> > > ARM, which I don't want to do if doing so will cause this regression on
> > > 32-bit ARM as well.
> >
> > tglx fixed it:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > which I assume is getting routed as a fix via -tip.
>
> Right, so Chris reported the issue to everyone involved. Tglx's
> reply severely trimmed the Cc list so folk like me had no idea what
> was going on, removing even the mailing lists. On the face of it,
> it looks like an intentional attempt to cut people out of the loop
> who really should've been kept in the loop. Yea, that's just great.

Sorry that was no intentional attempt to cut anyone out of the
loop. Trimmed it too agressively without applying much brain.

Thanks,

tglx

2019-08-23 23:16:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Regression in 5.3-rc1 and later

On Fri, 23 Aug 2019, Russell King - ARM Linux admin wrote:

> On Fri, Aug 23, 2019 at 11:43:32AM +0100, Vincenzo Frascino wrote:
> > Hi Russell,
> >
> > On 8/23/19 11:36 AM, Russell King - ARM Linux admin wrote:
> > > Hi,
> > >
> > > To everyone on the long Cc list...
> > >
> > > What's happening with this? I was about to merge the patches for 32-bit
> > > ARM, which I don't want to do if doing so will cause this regression on
> > > 32-bit ARM as well.
> > >
> >
> > The regression is sorted as of yesterday, a new patch is going through tip:
> > timers/urgent and will be part of the next -rc.
> >
> > If you want to merge them there should be nothing blocking.
>
> I don't have access to the tip tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/urgent

> I'll wait a kernel release cycle instead.

It's going to be part of -rc6. I'll send the pull request to Linus tomorrow.

Thanks,

tglx