Hi,
> Hi,
>
> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>
> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
Please have a look at the following thread:
https://lkml.org/lkml/2018/9/25/1118
Maciej
Thanks Maciej.
On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
> Hi,
>
>> Hi,
>>
>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>
>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>
> Please have a look at the following thread:
> https://lkml.org/lkml/2018/9/25/1118
>
I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
Heiner's patch to the 4.19, but again the problem is not solved.
> Maciej
>
Chris
On 29.09.2018 00:00, Chris Clayton wrote:
> Thanks Maciej.
>
> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>> Hi,
>>
>>> Hi,
>>>
>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>
>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>
>> Please have a look at the following thread:
>> https://lkml.org/lkml/2018/9/25/1118
>>
>
> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
> Heiner's patch to the 4.19, but again the problem is not solved.
>
I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
Can you provide the dmesg part with the XID?
According to your lspci output neither MSI nor MSI-X is active.
Do you have to use nomsi for whatever reason?
Heiner
>> Maciej
>>
> Chris
>
On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
>
> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
> Can you provide the dmesg part with the XID?
$ dmesg | grep -i r8169
[ 5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 5.321432] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
[ 5.322892] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19
[ 5.323786] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[ 10.232077] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
[ 10.235218] r8169 0000:05:00.2 eth0: link down
[ 11.717460] r8169 0000:05:00.2 eth0: link up
$ dmesg | grep -i r8169
[ 5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 5.208677] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
[ 5.210066] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
[ 5.210676] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[ 10.456081] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
[ 10.459217] r8169 0000:05:00.2 eth0: link down
[ 10.459880] r8169 0000:05:00.2 eth0: link down
[ 12.015158] r8169 0000:05:00.2 eth0: link up
> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?
No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
has a very clear "say Y".
>
> Heiner
>
>>> Maciej
>>>
>> Chris
>>
>
Sorry, sent by accident. Note to self - don't attempt email until after second cup of coffee.
On 29/09/2018 08:25, Chris Clayton wrote:
>
>
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
>> Can you provide the dmesg part with the XID?
I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable series kernel in which eth0 continues to
function reliably after a suspend/resume cycle. The second dmesg output below is taken from that kernel. The first one
was from an up-to-date 4.19 kernel
>
> $ dmesg | grep -i r8169
> [ 5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ 5.321432] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
> [ 5.322892] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19
> [ 5.323786] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> [ 10.232077] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
> [ 10.235218] r8169 0000:05:00.2 eth0: link down
> [ 11.717460] r8169 0000:05:00.2 eth0: link up
>
> $ dmesg | grep -i r8169
> [ 5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ 5.208677] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
> [ 5.210066] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
> [ 5.210676] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> [ 10.456081] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
> [ 10.459217] r8169 0000:05:00.2 eth0: link down
> [ 10.459880] r8169 0000:05:00.2 eth0: link down
> [ 12.015158] r8169 0000:05:00.2 eth0: link up
>
>
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
>
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
> has a very clear "say Y".
As I said above I have re-enabled MSI.
>
>>
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>
Hi Heiner,
Here's the reply to your questions. Sorry for the delay.
On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
>
> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
> Can you provide the dmesg part with the XID?
$ dmesg | grep r8169
[ 5.274938] libphy: r8169: probed
[ 5.276563] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
[ 5.278158] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[ 9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet]
(mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
[ 9.460876] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
[ 11.005336] r8169 0000:05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx
> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?
>
No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
has a very clear "say Y". I've re-enabled it now.
Chris
> Heiner
>
>>> Maciej
>>>
>> Chris
>>
>
>
Hi again,
I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
14-15ms to more than 1000ms.
Chris
On 04/10/2018 09:41, Chris Clayton wrote:
> Hi Heiner,
>
> Here's the reply to your questions. Sorry for the delay.
>
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
>> Can you provide the dmesg part with the XID?
>
> $ dmesg | grep r8169
> [ 5.274938] libphy: r8169: probed
> [ 5.276563] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
> [ 5.278158] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> [ 9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet]
> (mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
> [ 9.460876] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
> [ 11.005336] r8169 0000:05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx
>
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
>>
>
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
> has a very clear "say Y". I've re-enabled it now.
>
> Chris
>
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>
>>
On 07.10.2018 21:36, Chris Clayton wrote:
> Hi again,
>
> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
> 14-15ms to more than 1000ms.
You can try comparing chip registers (ethtool -d eth0) in the working
state (before a suspend) and in the broken state (after a resume).
Maybe there will be some obvious in the difference.
The same goes for the PCI configuration (lspci -d :8168 -vv).
> Chris
Maciej
Thanks to Maciej and Heiner for their replies.
On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
> On 07.10.2018 21:36, Chris Clayton wrote:
>> Hi again,
>>
>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>> 14-15ms to more than 1000ms.
>
> You can try comparing chip registers (ethtool -d eth0) in the working
> state (before a suspend) and in the broken state (after a resume).
> Maybe there will be some obvious in the difference.
>
> The same goes for the PCI configuration (lspci -d :8168 -vv).
>
Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
I've attached files I redirected the outputs to.
Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
the diagnostics shown in the attachments.)
Chris
>> Chris
>
> Maciej
>
On 09.10.2018 16:40, Chris Clayton wrote:
> Thanks to Maciej and Heiner for their replies.
>
> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>> On 07.10.2018 21:36, Chris Clayton wrote:
>>> Hi again,
>>>
>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>> 14-15ms to more than 1000ms.
>>
>> You can try comparing chip registers (ethtool -d eth0) in the working
>> state (before a suspend) and in the broken state (after a resume).
>> Maybe there will be some obvious in the difference.
>>
>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>
> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>
> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>
> I've attached files I redirected the outputs to.
>
> Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
> scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
> the diagnostics shown in the attachments.)
>
I'd like to check whether it may be a timing issue. The following experimental patch
adds a PCI commit after writing register ChipCmd. Could you please check whether
it changes anything?
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 7d3f671e1..f3c359492 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp)
/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
RTL_R8(tp, IntrMask);
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
+ RTL_R8(tp, ChipCmd);
rtl_init_rxcfg(tp);
rtl_set_tx_config_registers(tp);
--
2.19.1
On 09.10.2018 16:40, Chris Clayton wrote:
> Thanks to Maciej and Heiner for their replies.
>
> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>> On 07.10.2018 21:36, Chris Clayton wrote:
>>> Hi again,
>>>
>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>> 14-15ms to more than 1000ms.
>>
>> You can try comparing chip registers (ethtool -d eth0) in the working
>> state (before a suspend) and in the broken state (after a resume).
>> Maybe there will be some obvious in the difference.
>>
>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>
> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>
> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>
Hmm, this is very weird, especially taking into account that in your original
report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
register values seem to be the same before and after resume. So how can the
chip behave differently?
So far my best guess is that some chip quirk causes it to accept writes to
register RxConfig, but to misinterpret or ignore the written value.
So far your report is the only one (affecting RTL8411), but we don't know
whether other chip versions are affected too.
One option could be to call rtl_init_rxcfg() for chip versions <= 06 only
because for them we know that they need this call.
> I've attached files I redirected the outputs to.
>
> Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
> scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
> the diagnostics shown in the attachments.)
>
> Chris
>
>>> Chris
>>
>> Maciej
>>
On 09/10/2018 22:39, Heiner Kallweit wrote:
> On 09.10.2018 16:40, Chris Clayton wrote:
>> Thanks to Maciej and Heiner for their replies.
>>
>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>> Hi again,
>>>>
>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>> 14-15ms to more than 1000ms.
>>>
>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>> state (before a suspend) and in the broken state (after a resume).
>>> Maybe there will be some obvious in the difference.
>>>
>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>
>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>
>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>
>> I've attached files I redirected the outputs to.
>>
>> Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
>> scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
>> the diagnostics shown in the attachments.)
>>
> I'd like to check whether it may be a timing issue. The following experimental patch
> adds a PCI commit after writing register ChipCmd. Could you please check whether
> it changes anything?
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 7d3f671e1..f3c359492 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct rtl8169_private *tp)
> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
> RTL_R8(tp, IntrMask);
> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> + RTL_R8(tp, ChipCmd);
> rtl_init_rxcfg(tp);
> rtl_set_tx_config_registers(tp);
>
>
Sorry, this patch doesn't make any difference - my network still fails. After a suspend/resume my browsers (chromium
and firefox) both fail to open my home page (https://www.google.co.uk). The ping time for one of my ISP's name servers
increases from 14-15ms to more than 1000ms, although it after a few pings it does reduce. As the screen grab below
shows, the network does eventually fail
$ ping NS1
PING ns1 (90.207.238.97): 56 data bytes
64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms
64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms
64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms
64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms
64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms
64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms
64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms
64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms
64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms
64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms
64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms
64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms
64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms
64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms
64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms
64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms
64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms
64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms
64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms
64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms
64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms
64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms
64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms
64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms
64 bytes from 90.207.238.97: icmp_seq=34 ttl=251 time=176.696 ms
64 bytes from 90.207.238.97: icmp_seq=35 ttl=251 time=1017.462 ms
64 bytes from 90.207.238.97: icmp_seq=36 ttl=251 time=16.394 ms
64 bytes from 90.207.238.97: icmp_seq=37 ttl=251 time=20.402 ms
64 bytes from 90.207.238.97: icmp_seq=38 ttl=251 time=37.795 ms
64 bytes from 90.207.238.97: icmp_seq=39 ttl=251 time=141.997 ms
92 bytes from laptop.local.lan (192.168.0.20): Destination Host Unreachable
92 bytes from laptop.local.lan (192.168.0.20): Destination Host Unreachable
...
Chris
On 09.10.2018 22:36, Heiner Kallweit wrote:
> On 09.10.2018 16:40, Chris Clayton wrote:
>> Thanks to Maciej and Heiner for their replies.
>>
>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>> Hi again,
>>>>
>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>> 14-15ms to more than 1000ms.
>>>
>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>> state (before a suspend) and in the broken state (after a resume).
>>> Maybe there will be some obvious in the difference.
>>>
>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>
>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>
>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>
> Hmm, this is very weird, especially taking into account that in your original
> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
> register values seem to be the same before and after resume. So how can the
> chip behave differently?
> So far my best guess is that some chip quirk causes it to accept writes to
> register RxConfig, but to misinterpret or ignore the written value.
> So far your report is the only one (affecting RTL8411), but we don't know
> whether other chip versions are affected too.
Also, it is interesting that even if one removes a call to
rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
written to moments later by rtl_set_rx_mode().
The only chip accesses in the meantime seems to be a write to TxConfig by
rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
to MAR0 earlier in rtl_set_rx_mode().
My proposals are:
1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
in rtl_hw_start().
Maybe the chip does not like sometimes that RxConfig is written before
TxConfig.
2) Check the original value of RxConfig (after a resume) before
rtl_init_rxcfg() overwrites it (compile tested only):
--- r8169.c.ori
+++ r8169.c
@@ -5155,6 +5155,9 @@
/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
RTL_R8(tp, IntrMask);
RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
+
+ pr_notice("RxConfig before init was %.8x\n",
+ (unsigned int)RTL_R32(tp, RxConfig));
rtl_init_rxcfg(tp);
rtl_set_tx_config_registers(tp);
This should be the value that you got when you removed the call to
rtl_init_rxcfg() for testing.
Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
writes (under the "default:" label for your NIC model).
Hope this helps,
Maciej
On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
>
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
>
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
>
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
>
After testing your first proposal, which made no difference, I founf the following in dmesg in the output from dmesg:
[ 761.999468] ------------[ cut here ]------------
[ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0
[ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via
videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon
snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel]
[ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
[ 761.999504] Hardware name: Notebook W65_67SZ /W65_67SZ
, BIOS 1.03.05 02/26/2014
[ 761.999508] Workqueue: events rtl_task [r8169]
[ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
[ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00
[ 761.999513] RSP: 0018:ffff88040f803e98 EFLAGS: 00010282
[ 761.999514] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[ 761.999516] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff88040f8153d0
[ 761.999517] RBP: ffff88040ca9a3b8 R08: ffffffff813565f0 R09: 000000000000034e
[ 761.999517] R10: 0000000000000007 R11: 0000000000000000 R12: ffff88040ca9a39c
[ 761.999518] R13: ffff88040ca9a000 R14: 0000000000000001 R15: ffff8803ea17cc80
[ 761.999520] FS: 0000000000000000(0000) GS:ffff88040f800000(0000) knlGS:0000000000000000
[ 761.999521] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 761.999522] CR2: 00007f67280206b8 CR3: 000000000200a002 CR4: 00000000001606f0
[ 761.999523] Call Trace:
[ 761.999525] <IRQ>
[ 761.999527] ? qdisc_reset+0xe0/0xe0
[ 761.999529] ? qdisc_reset+0xe0/0xe0
[ 761.999532] call_timer_fn+0x11/0x70
[ 761.999534] expire_timers+0x8e/0xa0
[ 761.999535] run_timer_softirq+0x7e/0x150
[ 761.999538] ? __hrtimer_run_queues+0x12b/0x1a0
[ 761.999541] ? recalibrate_cpu_khz+0x10/0x10
[ 761.999543] ? ktime_get+0x32/0x90
[ 761.999546] ? lapic_next_event+0x20/0x20
[ 761.999549] __do_softirq+0xcc/0x1fc
[ 761.999552] irq_exit+0x82/0xb0
[ 761.999554] smp_apic_timer_interrupt+0x61/0x90
[ 761.999556] apic_timer_interrupt+0xf/0x20
[ 761.999557] </IRQ>
[ 761.999560] RIP: 0010:rtl_slow_event_work+0x2a/0x1f0 [r8169]
[ 761.999562] Code: 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 10 4c 8b 67 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08
31 c0 48 8b 07 66 8b 68 3e <66> 23 af da 0d 00 00 48 8b 07 66 89 68 3e 40 f6 c5 40 0f 85 3b 01
[ 761.999563] RSP: 0018:ffffc900014d7e40 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 761.999564] RAX: ffffc900000b9000 RBX: ffff88040ca9a7c0 RCX: ffff88040f81f160
[ 761.999565] RDX: ffff8803ea21b300 RSI: 0000000000000000 RDI: ffff88040ca9a7c0
[ 761.999566] RBP: ffff88040ca90050 R08: 0000000000000000 R09: 000073746e657665
[ 761.999567] R10: 8080808080808080 R11: ffff88040f81ea68 R12: ffff88040ca9a000
[ 761.999568] R13: ffff88040ca9a000 R14: ffff88040f81f140 R15: 0000000000000000
[ 761.999571] ? __switch_to_asm+0x34/0x70
[ 761.999573] rtl_task+0x4f/0x70 [r8169]
[ 761.999576] process_one_work+0x1bc/0x2f0
[ 761.999577] worker_thread+0x28/0x3c0
[ 761.999579] ? process_one_work+0x2f0/0x2f0
[ 761.999581] kthread+0x109/0x120
[ 761.999583] ? kthread_park+0x80/0x80
[ 761.999585] ret_from_fork+0x35/0x40
[ 761.999586] ---[ end trace fd5800440feffc06 ]---
I haven't seen this before, but maybe it's a consequence of swapping the order of the two functions calls.
I'll work on the second proposal later today.
Chris
> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
> RTL_R8(tp, IntrMask);
> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
> rtl_init_rxcfg(tp);
> rtl_set_tx_config_registers(tp);
>
>
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
>
> Hope this helps,
> Maciej
>
Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the
results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7
kernel/module.
I'll test your proposals properly later.
Chris
On 10/10/2018 09:09, Chris Clayton wrote:
>
>
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your original
>>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> After testing your first proposal, which made no difference, I founf the following in dmesg in the output from dmesg:
>
> [ 761.999468] ------------[ cut here ]------------
> [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
> [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0
> [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
> nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via
> videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon
> snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel]
> [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
> [ 761.999504] Hardware name: Notebook W65_67SZ /W65_67SZ
> , BIOS 1.03.05 02/26/2014
> [ 761.999508] Workqueue: events rtl_task [r8169]
> [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
> [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
> 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00
> [ 761.999513] RSP: 0018:ffff88040f803e98 EFLAGS: 00010282
> [ 761.999514] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
> [ 761.999516] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff88040f8153d0
> [ 761.999517] RBP: ffff88040ca9a3b8 R08: ffffffff813565f0 R09: 000000000000034e
> [ 761.999517] R10: 0000000000000007 R11: 0000000000000000 R12: ffff88040ca9a39c
> [ 761.999518] R13: ffff88040ca9a000 R14: 0000000000000001 R15: ffff8803ea17cc80
> [ 761.999520] FS: 0000000000000000(0000) GS:ffff88040f800000(0000) knlGS:0000000000000000
> [ 761.999521] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 761.999522] CR2: 00007f67280206b8 CR3: 000000000200a002 CR4: 00000000001606f0
> [ 761.999523] Call Trace:
> [ 761.999525] <IRQ>
> [ 761.999527] ? qdisc_reset+0xe0/0xe0
> [ 761.999529] ? qdisc_reset+0xe0/0xe0
> [ 761.999532] call_timer_fn+0x11/0x70
> [ 761.999534] expire_timers+0x8e/0xa0
> [ 761.999535] run_timer_softirq+0x7e/0x150
> [ 761.999538] ? __hrtimer_run_queues+0x12b/0x1a0
> [ 761.999541] ? recalibrate_cpu_khz+0x10/0x10
> [ 761.999543] ? ktime_get+0x32/0x90
> [ 761.999546] ? lapic_next_event+0x20/0x20
> [ 761.999549] __do_softirq+0xcc/0x1fc
> [ 761.999552] irq_exit+0x82/0xb0
> [ 761.999554] smp_apic_timer_interrupt+0x61/0x90
> [ 761.999556] apic_timer_interrupt+0xf/0x20
> [ 761.999557] </IRQ>
> [ 761.999560] RIP: 0010:rtl_slow_event_work+0x2a/0x1f0 [r8169]
> [ 761.999562] Code: 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 10 4c 8b 67 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08
> 31 c0 48 8b 07 66 8b 68 3e <66> 23 af da 0d 00 00 48 8b 07 66 89 68 3e 40 f6 c5 40 0f 85 3b 01
> [ 761.999563] RSP: 0018:ffffc900014d7e40 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> [ 761.999564] RAX: ffffc900000b9000 RBX: ffff88040ca9a7c0 RCX: ffff88040f81f160
> [ 761.999565] RDX: ffff8803ea21b300 RSI: 0000000000000000 RDI: ffff88040ca9a7c0
> [ 761.999566] RBP: ffff88040ca90050 R08: 0000000000000000 R09: 000073746e657665
> [ 761.999567] R10: 8080808080808080 R11: ffff88040f81ea68 R12: ffff88040ca9a000
> [ 761.999568] R13: ffff88040ca9a000 R14: ffff88040f81f140 R15: 0000000000000000
> [ 761.999571] ? __switch_to_asm+0x34/0x70
> [ 761.999573] rtl_task+0x4f/0x70 [r8169]
> [ 761.999576] process_one_work+0x1bc/0x2f0
> [ 761.999577] worker_thread+0x28/0x3c0
> [ 761.999579] ? process_one_work+0x2f0/0x2f0
> [ 761.999581] kthread+0x109/0x120
> [ 761.999583] ? kthread_park+0x80/0x80
> [ 761.999585] ret_from_fork+0x35/0x40
> [ 761.999586] ---[ end trace fd5800440feffc06 ]---
>
> I haven't seen this before, but maybe it's a consequence of swapping the order of the two functions calls.
>
> I'll work on the second proposal later today.
>
> Chris
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>> RTL_R8(tp, IntrMask);
>> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> + pr_notice("RxConfig before init was %.8x\n",
>> + (unsigned int)RTL_R32(tp, RxConfig));
>> rtl_init_rxcfg(tp);
>> rtl_set_tx_config_registers(tp);
>>
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
>>
>> Hope this helps,
>> Maciej
>>
OK, right kernel/module used this time. Please see findings below.
On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
>
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
>
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
>
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
>
This change made no difference. Networking still dies if I open a browser or leave ping running long enough.
> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
> RTL_R8(tp, IntrMask);
> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
> rtl_init_rxcfg(tp);
> rtl_set_tx_config_registers(tp);
>
>
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool
-d", I can see RxConfig with the following values
During boot: 0x00028700
Before suspend: 0x0002870e
During resume: 0x00024000
Post resume: 0x0002870e
I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the
following values:
During boot: 0x00028700
Before suspend: 0x0002870e
During resume: 0x00024000
Post resume: 0x0002870e
>
> Hope this helps,
> Maciej
>
Too late at night to be doing this stuff. Clicked send instead of saving a draft. Sorry, please ignore.
On 10/10/2018 23:30, Chris Clayton wrote:
> OK, right kernel/module used this time. Please see findings below.
>
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your original
>>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
>
> This change made no difference. Networking still dies if I open a browser or leave ping running long enough.
>
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>> RTL_R8(tp, IntrMask);
>> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> + pr_notice("RxConfig before init was %.8x\n",
>> + (unsigned int)RTL_R32(tp, RxConfig));
>> rtl_init_rxcfg(tp);
>> rtl_set_tx_config_registers(tp);
>>
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
>
> This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool
> -d", I can see RxConfig with the following values
>
> During boot: 0x00028700
> Before suspend: 0x0002870e
> During resume: 0x00024000
> Post resume: 0x0002870e
>
> I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the
> following values:
>
> During boot: 0x00028700
> Before suspend: 0x0002870e
> During resume: 0x00024000
> Post resume: 0x0002870e
>
>>
>> Hope this helps,
>> Maciej
>>
OK, right kernel/module used this time. Please see findings below.
On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
>
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
>
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
>
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
>
This change made no difference. Networking still dies if I open a browser or leave ping running long enough.
> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
> RTL_R8(tp, IntrMask);
> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> + pr_notice("RxConfig before init was %.8x\n",
> + (unsigned int)RTL_R32(tp, RxConfig));
> rtl_init_rxcfg(tp);
> rtl_set_tx_config_registers(tp);
>
>
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
>
This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
"ethtool -d", I can see RxConfig with the following values
During boot: 0x00028700
Before suspend: 0x0002870e
During resume: 0x00024000
Post resume: 0x0002870e
As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
installed and rebooted. Now I see the following values:
During boot: 0x00028700
Before suspend: 0x0002870e
During resume: 0x00024000
Post resume: 0x0002400e
As with 4.18.10, networking now appears to be stable after the resume. Starting a browser results in my homepage being
displayed and I've spent a few minutes surfing with no interruptions. Similarly, ping runs without stopping. I simply
don't know enough to know what might now be enabled or disabled by this change in value, but hopefully it will provide a
clue to someone as to what is going on.
Chris
> Hope this helps,
> Maciej
>
On 11.10.2018 00:49, Chris Clayton wrote:
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
>>
>
> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
> "ethtool -d", I can see RxConfig with the following values
>
> During boot: 0x00028700
> Before suspend: 0x0002870e
> During resume: 0x00024000
> Post resume: 0x0002870e
>
> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
> installed and rebooted. Now I see the following values:
>
> During boot: 0x00028700
> Before suspend: 0x0002870e
> During resume: 0x00024000
> Post resume: 0x0002400e
>
Now we can finally see some difference...
Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
(bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
is kind of expected - one can see that the working configuration
post-resume has bit 14 (or 0x4000) set, too.
This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
family as your RTL_GIGA_MAC_VER_38, so can you please try the following
change:
--- r8169.c
+++ r8169.c
@@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
case RTL_GIGA_MAC_VER_34:
case RTL_GIGA_MAC_VER_35:
+ case RTL_GIGA_MAC_VER_38:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
break;
case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
This will add RX_MULTI_EN also for your chip model (you need to add back
the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
If this does not help then I would try another values in the above write:
1) RTL_W32(tp, RxConfig, 0x00024000);
2) RTL_W32(tp, RxConfig, 0x00004000);
3) RTL_W32(tp, RxConfig, RX_DMA_BURST);
4) RTL_W32(tp, RxConfig, RX128_INT_EN);
> Chris
Maciej
On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
> On 11.10.2018 00:49, Chris Clayton wrote:
>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>> writes (under the "default:" label for your NIC model).
>>>
>>
>> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
>> "ethtool -d", I can see RxConfig with the following values
>>
>> During boot: 0x00028700
>> Before suspend: 0x0002870e
>> During resume: 0x00024000
>> Post resume: 0x0002870e
>>
>> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>> installed and rebooted. Now I see the following values:
>>
>> During boot: 0x00028700
>> Before suspend: 0x0002870e
>> During resume: 0x00024000
>> Post resume: 0x0002400e
>>
>
> Now we can finally see some difference...
> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
> is kind of expected - one can see that the working configuration
> post-resume has bit 14 (or 0x4000) set, too.
>
> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>
> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
> change:
> --- r8169.c
> +++ r8169.c
> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
> case RTL_GIGA_MAC_VER_34:
> case RTL_GIGA_MAC_VER_35:
> + case RTL_GIGA_MAC_VER_38:
> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
> break;
> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>
> This will add RX_MULTI_EN also for your chip model (you need to add back
> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>
That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the
ping times are back in the 14-15ms range.
Chris
> If this does not help then I would try another values in the above write:
> 1) RTL_W32(tp, RxConfig, 0x00024000);
> 2) RTL_W32(tp, RxConfig, 0x00004000);
> 3) RTL_W32(tp, RxConfig, RX_DMA_BURST);
> 4) RTL_W32(tp, RxConfig, RX128_INT_EN);
>
>> Chris
>
> Maciej
>
On 11.10.2018 10:24, Chris Clayton wrote:
> On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
>> On 11.10.2018 00:49, Chris Clayton wrote:
>>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>>> writes (under the "default:" label for your NIC model).
>>>>
>>>
>>> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
>>> "ethtool -d", I can see RxConfig with the following values
>>>
>>> During boot: 0x00028700
>>> Before suspend: 0x0002870e
>>> During resume: 0x00024000
>>> Post resume: 0x0002870e
>>>
>>> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>>> installed and rebooted. Now I see the following values:
>>>
>>> During boot: 0x00028700
>>> Before suspend: 0x0002870e
>>> During resume: 0x00024000
>>> Post resume: 0x0002400e
>>>
>>
>> Now we can finally see some difference...
>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
>> is kind of expected - one can see that the working configuration
>> post-resume has bit 14 (or 0x4000) set, too.
>>
>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>>
>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
>> change:
>> --- r8169.c
>> +++ r8169.c
>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>> case RTL_GIGA_MAC_VER_34:
>> case RTL_GIGA_MAC_VER_35:
>> + case RTL_GIGA_MAC_VER_38:
>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
>> break;
>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>>
>> This will add RX_MULTI_EN also for your chip model (you need to add back
>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>>
>
> That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the
> ping times are back in the 14-15ms range.
Nice!
I will submit a patch, it would be great if you could test it and then
add a "Tested-by:" tag.
> Chris
Maciej
On 11/10/2018 13:23, Maciej S. Szmigiero wrote:
> On 11.10.2018 10:24, Chris Clayton wrote:
>> On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
>>> On 11.10.2018 00:49, Chris Clayton wrote:
>>>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>>>> writes (under the "default:" label for your NIC model).
>>>>>
>>>>
>>>> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
>>>> "ethtool -d", I can see RxConfig with the following values
>>>>
>>>> During boot: 0x00028700
>>>> Before suspend: 0x0002870e
>>>> During resume: 0x00024000
>>>> Post resume: 0x0002870e
>>>>
>>>> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>>>> installed and rebooted. Now I see the following values:
>>>>
>>>> During boot: 0x00028700
>>>> Before suspend: 0x0002870e
>>>> During resume: 0x00024000
>>>> Post resume: 0x0002400e
>>>>
>>>
>>> Now we can finally see some difference...
>>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
>>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
>>> is kind of expected - one can see that the working configuration
>>> post-resume has bit 14 (or 0x4000) set, too.
>>>
>>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
>>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>>>
>>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
>>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
>>> change:
>>> --- r8169.c
>>> +++ r8169.c
>>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>>> case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>>> case RTL_GIGA_MAC_VER_34:
>>> case RTL_GIGA_MAC_VER_35:
>>> + case RTL_GIGA_MAC_VER_38:
>>> RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
>>> break;
>>> case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>>>
>>> This will add RX_MULTI_EN also for your chip model (you need to add back
>>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>>>
>>
>> That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the
>> ping times are back in the 14-15ms range.
>
> Nice!
>
> I will submit a patch, it would be great if you could test it and then
> add a "Tested-by:" tag.
>
Will do, Maciej.
Thanks for solving this.
>> Chris
>
> Maciej
>