On Tuesday, November 17, 2020 7:31:29 PM CET Rafael J. Wysocki wrote:
> On 11/16/2020 8:11 AM, Andrei Popa wrote:
> > Hello,
> >
> > After an update from vmlinuz-4.15.0-106-generic to vmlinuz-5.4.0-37-generic we experience, on a number of servers, a very high number of rx_missed_errors and dropped packets only on the uplink 10G interface. We have another 10G downlink interface with no problems.
> >
> > The affected servers have the following mainboards:
> > S5520HC ver E26045-455
> > S5520UR ver E22554-751
> > S5520UR ver E22554-753
> > S5000VSA
> >
> > On other 30 servers with similar mainboards and/or configs there are no dropped packets with vmlinuz-5.4.0-37-generic.
> >
> > We’ve installed vanilla 4.16 and there were no dropped packets.
> > Vanilla 4.17 had a very high number of dropped packets like the following:
> >
> > root@shaper:~# cat test
> > #!/bin/bash
> > while true
> > do
> > ethtool -S ens6f1|grep "missed_errors"
> > ifconfig ens6f1|grep RX|grep dropped
> > sleep 1
> > done
> >
> > root@shaper:~# ./test
> > rx_missed_errors: 2418845
> > RX errors 0 dropped 2418888 overruns 0 frame 0
> > rx_missed_errors: 2426175
> > RX errors 0 dropped 2426218 overruns 0 frame 0
> > rx_missed_errors: 2431910
> > RX errors 0 dropped 2431953 overruns 0 frame 0
> > rx_missed_errors: 2437266
> > RX errors 0 dropped 2437309 overruns 0 frame 0
> > rx_missed_errors: 2443305
> > RX errors 0 dropped 2443348 overruns 0 frame 0
> > rx_missed_errors: 2448357
> > RX errors 0 dropped 2448400 overruns 0 frame 0
> > rx_missed_errors: 2452539
> > RX errors 0 dropped 2452582 overruns 0 frame 0
> >
> > We did a git bisect and we’ve found that the following commit generates the high number of dropped packets:
> >
> > Author: Rafael J. Wysocki <[email protected] <mailto:[email protected]>>
> > Date: Thu Apr 5 19:12:43 2018 +0200
> > cpuidle: menu: Avoid selecting shallow states with stopped tick
> > If the scheduler tick has been stopped already and the governor
> > selects a shallow idle state, the CPU can spend a long time in that
> > state if the selection is based on an inaccurate prediction of idle
> > time. That effect turns out to be relevant, so it needs to be
> > mitigated.
> > To that end, modify the menu governor to discard the result of the
> > idle time prediction if the tick is stopped and the predicted idle
> > time is less than the tick period length, unless the tick timer is
> > going to expire soon.
> > Signed-off-by: Rafael J. Wysocki <[email protected] <mailto:[email protected]>>
> > Acked-by: Peter Zijlstra (Intel) <[email protected] <mailto:[email protected]>>
> > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> > index 267982e471e0..1bfe03ceb236 100644
> > --- a/drivers/cpuidle/governors/menu.c
> > +++ b/drivers/cpuidle/governors/menu.c
> > @@ -352,13 +352,28 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > */
> > data->predicted_us = min(data->predicted_us, expected_interval);
> > - /*
> > - * Use the performance multiplier and the user-configurable
> > - * latency_req to determine the maximum exit latency.
> > - */
> > - interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
> > - if (latency_req > interactivity_req)
> > - latency_req = interactivity_req;
>
> The tick_nohz_tick_stopped() check may be done after the above and it
> may be reworked a bit.
>
> I'll send a test patch to you shortly.
The patch is appended, but please note that it has been rebased by hand and
not tested.
Please let me know if it makes any difference.
And in the future please avoid pasting the entire kernel config to your
reports, that's problematic.
---
drivers/cpuidle/governors/menu.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -308,18 +308,18 @@ static int menu_select(struct cpuidle_dr
get_typical_interval(data, predicted_us)) *
NSEC_PER_USEC;
- if (tick_nohz_tick_stopped()) {
- /*
- * If the tick is already stopped, the cost of possible short
- * idle duration misprediction is much higher, because the CPU
- * may be stuck in a shallow idle state for a long time as a
- * result of it. In that case say we might mispredict and use
- * the known time till the closest timer event for the idle
- * state selection.
- */
- if (data->predicted_us < TICK_USEC)
- data->predicted_us = min_t(unsigned int, TICK_USEC,
- ktime_to_us(delta_next));
+ /*
+ * If the tick is already stopped, the cost of possible short idle
+ * duration misprediction is much higher, because the CPU may be stuck
+ * in a shallow idle state for a long time as a result of it. In that
+ * case, say we might mispredict and use the known time till the closest
+ * timer event for the idle state selection, unless that event is going
+ * to occur within the tick time frame (in which case the CPU will be
+ * woken up from whatever idle state it gets into soon enough anyway).
+ */
+ if (tick_nohz_tick_stopped() && data->predicted_us < TICK_USEC &&
+ delta_next >= TICK_NSEC) {
+ data->predicted_us = ktime_to_us(delta_next);
} else {
/*
* Use the performance multiplier and the user-configurable
Hi,
On what kernel version should I try the patch ? I tried on 5.9 and it doesn't build.
> On 18 Nov 2020, at 20:47, Rafael J. Wysocki <[email protected]> wrote:
>
> On Tuesday, November 17, 2020 7:31:29 PM CET Rafael J. Wysocki wrote:
>> On 11/16/2020 8:11 AM, Andrei Popa wrote:
>>> Hello,
>>>
>>> After an update from vmlinuz-4.15.0-106-generic to vmlinuz-5.4.0-37-generic we experience, on a number of servers, a very high number of rx_missed_errors and dropped packets only on the uplink 10G interface. We have another 10G downlink interface with no problems.
>>>
>>> The affected servers have the following mainboards:
>>> S5520HC ver E26045-455
>>> S5520UR ver E22554-751
>>> S5520UR ver E22554-753
>>> S5000VSA
>>>
>>> On other 30 servers with similar mainboards and/or configs there are no dropped packets with vmlinuz-5.4.0-37-generic.
>>>
>>> We’ve installed vanilla 4.16 and there were no dropped packets.
>>> Vanilla 4.17 had a very high number of dropped packets like the following:
>>>
>>> root@shaper:~# cat test
>>> #!/bin/bash
>>> while true
>>> do
>>> ethtool -S ens6f1|grep "missed_errors"
>>> ifconfig ens6f1|grep RX|grep dropped
>>> sleep 1
>>> done
>>>
>>> root@shaper:~# ./test
>>> rx_missed_errors: 2418845
>>> RX errors 0 dropped 2418888 overruns 0 frame 0
>>> rx_missed_errors: 2426175
>>> RX errors 0 dropped 2426218 overruns 0 frame 0
>>> rx_missed_errors: 2431910
>>> RX errors 0 dropped 2431953 overruns 0 frame 0
>>> rx_missed_errors: 2437266
>>> RX errors 0 dropped 2437309 overruns 0 frame 0
>>> rx_missed_errors: 2443305
>>> RX errors 0 dropped 2443348 overruns 0 frame 0
>>> rx_missed_errors: 2448357
>>> RX errors 0 dropped 2448400 overruns 0 frame 0
>>> rx_missed_errors: 2452539
>>> RX errors 0 dropped 2452582 overruns 0 frame 0
>>>
>>> We did a git bisect and we’ve found that the following commit generates the high number of dropped packets:
>>>
>>> Author: Rafael J. Wysocki <[email protected] <mailto:[email protected]>>
>>> Date: Thu Apr 5 19:12:43 2018 +0200
>>> cpuidle: menu: Avoid selecting shallow states with stopped tick
>>> If the scheduler tick has been stopped already and the governor
>>> selects a shallow idle state, the CPU can spend a long time in that
>>> state if the selection is based on an inaccurate prediction of idle
>>> time. That effect turns out to be relevant, so it needs to be
>>> mitigated.
>>> To that end, modify the menu governor to discard the result of the
>>> idle time prediction if the tick is stopped and the predicted idle
>>> time is less than the tick period length, unless the tick timer is
>>> going to expire soon.
>>> Signed-off-by: Rafael J. Wysocki <[email protected] <mailto:[email protected]>>
>>> Acked-by: Peter Zijlstra (Intel) <[email protected] <mailto:[email protected]>>
>>> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
>>> index 267982e471e0..1bfe03ceb236 100644
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -352,13 +352,28 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>> */
>>> data->predicted_us = min(data->predicted_us, expected_interval);
>>> - /*
>>> - * Use the performance multiplier and the user-configurable
>>> - * latency_req to determine the maximum exit latency.
>>> - */
>>> - interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
>>> - if (latency_req > interactivity_req)
>>> - latency_req = interactivity_req;
>>
>> The tick_nohz_tick_stopped() check may be done after the above and it
>> may be reworked a bit.
>>
>> I'll send a test patch to you shortly.
>
> The patch is appended, but please note that it has been rebased by hand and
> not tested.
>
> Please let me know if it makes any difference.
>
> And in the future please avoid pasting the entire kernel config to your
> reports, that's problematic.
>
> ---
> drivers/cpuidle/governors/menu.c | 23 ++++++++++++-----------
> 1 file changed, 12 insertions(+), 11 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -308,18 +308,18 @@ static int menu_select(struct cpuidle_dr
> get_typical_interval(data, predicted_us)) *
> NSEC_PER_USEC;
>
> - if (tick_nohz_tick_stopped()) {
> - /*
> - * If the tick is already stopped, the cost of possible short
> - * idle duration misprediction is much higher, because the CPU
> - * may be stuck in a shallow idle state for a long time as a
> - * result of it. In that case say we might mispredict and use
> - * the known time till the closest timer event for the idle
> - * state selection.
> - */
> - if (data->predicted_us < TICK_USEC)
> - data->predicted_us = min_t(unsigned int, TICK_USEC,
> - ktime_to_us(delta_next));
> + /*
> + * If the tick is already stopped, the cost of possible short idle
> + * duration misprediction is much higher, because the CPU may be stuck
> + * in a shallow idle state for a long time as a result of it. In that
> + * case, say we might mispredict and use the known time till the closest
> + * timer event for the idle state selection, unless that event is going
> + * to occur within the tick time frame (in which case the CPU will be
> + * woken up from whatever idle state it gets into soon enough anyway).
> + */
> + if (tick_nohz_tick_stopped() && data->predicted_us < TICK_USEC &&
> + delta_next >= TICK_NSEC) {
> + data->predicted_us = ktime_to_us(delta_next);
> } else {
> /*
> * Use the performance multiplier and the user-configurable
Hi,
I’ve applied your patch on kernel 4.17.0 and dropped packets and rx_missed_errors are still present, through they are increasing at a lower rate.
root@shaper:~# ./test
rx_missed_errors: 2135
RX errors 0 dropped 2155 overruns 0 frame 0
sleeping 60 seconds
rx_missed_errors: 2433
RX errors 0 dropped 2459 overruns 0 frame 0
sleeping 60 seconds
rx_missed_errors: 2433
RX errors 0 dropped 2465 overruns 0 frame 0
sleeping 60 seconds
rx_missed_errors: 2526
RX errors 0 dropped 2564 overruns 0 frame 0
sleeping 60 seconds
> On 3 Dec 2020, at 21:43, Andrei Popa <[email protected]> wrote:
>
> Hi,
>
> On what kernel version should I try the patch ? I tried on 5.9 and it doesn't build.
>
>> On 18 Nov 2020, at 20:47, Rafael J. Wysocki <[email protected]> wrote:
>>
>> On Tuesday, November 17, 2020 7:31:29 PM CET Rafael J. Wysocki wrote:
>>> On 11/16/2020 8:11 AM, Andrei Popa wrote:
>>>> Hello,
>>>>
>>>> After an update from vmlinuz-4.15.0-106-generic to vmlinuz-5.4.0-37-generic we experience, on a number of servers, a very high number of rx_missed_errors and dropped packets only on the uplink 10G interface. We have another 10G downlink interface with no problems.
>>>>
>>>> The affected servers have the following mainboards:
>>>> S5520HC ver E26045-455
>>>> S5520UR ver E22554-751
>>>> S5520UR ver E22554-753
>>>> S5000VSA
>>>>
>>>> On other 30 servers with similar mainboards and/or configs there are no dropped packets with vmlinuz-5.4.0-37-generic.
>>>>
>>>> We’ve installed vanilla 4.16 and there were no dropped packets.
>>>> Vanilla 4.17 had a very high number of dropped packets like the following:
>>>>
>>>> root@shaper:~# cat test
>>>> #!/bin/bash
>>>> while true
>>>> do
>>>> ethtool -S ens6f1|grep "missed_errors"
>>>> ifconfig ens6f1|grep RX|grep dropped
>>>> sleep 1
>>>> done
>>>>
>>>> root@shaper:~# ./test
>>>> rx_missed_errors: 2418845
>>>> RX errors 0 dropped 2418888 overruns 0 frame 0
>>>> rx_missed_errors: 2426175
>>>> RX errors 0 dropped 2426218 overruns 0 frame 0
>>>> rx_missed_errors: 2431910
>>>> RX errors 0 dropped 2431953 overruns 0 frame 0
>>>> rx_missed_errors: 2437266
>>>> RX errors 0 dropped 2437309 overruns 0 frame 0
>>>> rx_missed_errors: 2443305
>>>> RX errors 0 dropped 2443348 overruns 0 frame 0
>>>> rx_missed_errors: 2448357
>>>> RX errors 0 dropped 2448400 overruns 0 frame 0
>>>> rx_missed_errors: 2452539
>>>> RX errors 0 dropped 2452582 overruns 0 frame 0
>>>>
>>>> We did a git bisect and we’ve found that the following commit generates the high number of dropped packets:
>>>>
>>>> Author: Rafael J. Wysocki <[email protected] <mailto:[email protected]>>
>>>> Date: Thu Apr 5 19:12:43 2018 +0200
>>>> cpuidle: menu: Avoid selecting shallow states with stopped tick
>>>> If the scheduler tick has been stopped already and the governor
>>>> selects a shallow idle state, the CPU can spend a long time in that
>>>> state if the selection is based on an inaccurate prediction of idle
>>>> time. That effect turns out to be relevant, so it needs to be
>>>> mitigated.
>>>> To that end, modify the menu governor to discard the result of the
>>>> idle time prediction if the tick is stopped and the predicted idle
>>>> time is less than the tick period length, unless the tick timer is
>>>> going to expire soon.
>>>> Signed-off-by: Rafael J. Wysocki <[email protected] <mailto:[email protected]>>
>>>> Acked-by: Peter Zijlstra (Intel) <[email protected] <mailto:[email protected]>>
>>>> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
>>>> index 267982e471e0..1bfe03ceb236 100644
>>>> --- a/drivers/cpuidle/governors/menu.c
>>>> +++ b/drivers/cpuidle/governors/menu.c
>>>> @@ -352,13 +352,28 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>>> */
>>>> data->predicted_us = min(data->predicted_us, expected_interval);
>>>> - /*
>>>> - * Use the performance multiplier and the user-configurable
>>>> - * latency_req to determine the maximum exit latency.
>>>> - */
>>>> - interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
>>>> - if (latency_req > interactivity_req)
>>>> - latency_req = interactivity_req;
>>>
>>> The tick_nohz_tick_stopped() check may be done after the above and it
>>> may be reworked a bit.
>>>
>>> I'll send a test patch to you shortly.
>>
>> The patch is appended, but please note that it has been rebased by hand and
>> not tested.
>>
>> Please let me know if it makes any difference.
>>
>> And in the future please avoid pasting the entire kernel config to your
>> reports, that's problematic.
>>
>> ---
>> drivers/cpuidle/governors/menu.c | 23 ++++++++++++-----------
>> 1 file changed, 12 insertions(+), 11 deletions(-)
>>
>> Index: linux-pm/drivers/cpuidle/governors/menu.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
>> +++ linux-pm/drivers/cpuidle/governors/menu.c
>> @@ -308,18 +308,18 @@ static int menu_select(struct cpuidle_dr
>> get_typical_interval(data, predicted_us)) *
>> NSEC_PER_USEC;
>>
>> - if (tick_nohz_tick_stopped()) {
>> - /*
>> - * If the tick is already stopped, the cost of possible short
>> - * idle duration misprediction is much higher, because the CPU
>> - * may be stuck in a shallow idle state for a long time as a
>> - * result of it. In that case say we might mispredict and use
>> - * the known time till the closest timer event for the idle
>> - * state selection.
>> - */
>> - if (data->predicted_us < TICK_USEC)
>> - data->predicted_us = min_t(unsigned int, TICK_USEC,
>> - ktime_to_us(delta_next));
>> + /*
>> + * If the tick is already stopped, the cost of possible short idle
>> + * duration misprediction is much higher, because the CPU may be stuck
>> + * in a shallow idle state for a long time as a result of it. In that
>> + * case, say we might mispredict and use the known time till the closest
>> + * timer event for the idle state selection, unless that event is going
>> + * to occur within the tick time frame (in which case the CPU will be
>> + * woken up from whatever idle state it gets into soon enough anyway).
>> + */
>> + if (tick_nohz_tick_stopped() && data->predicted_us < TICK_USEC &&
>> + delta_next >= TICK_NSEC) {
>> + data->predicted_us = ktime_to_us(delta_next);
>> } else {
>> /*
>> * Use the performance multiplier and the user-configurable
>