2023-09-12 08:04:48

by Hao Jia

[permalink] [raw]
Subject: Re: [External] Re: [PATCH 0/2] Fix nohz_full vs rt bandwidth



On 2023/9/11 Phil Auld wrote:
>
> Hi Hao,
>
> On Mon, Sep 11, 2023 at 11:39:02AM +0800 Hao Jia wrote:
>> On 2023/9/8 Phil Auld wrote:
>>> On Fri, Sep 08, 2023 at 10:57:26AM +0800 Hao Jia wrote:
>>>> On 2023/9/7 Phil Auld wrote:
>>>>> Hi Hao,
>
> ...
>
>>>>>
>>>>> Are you actually hitting this in the real world?
>>>>>
>>>>> We, for example, no longer enable RT_GROUP_SCHED so this is a non-issue
>>>>> for our use cases. I'd recommend considering that. (Does it even
>>>>> work with cgroup2?)
>>>>>
>>>>
>>>> Yes, it has always been there. Regardless of whether RT_GROUP_SCHED is
>>>> enabled or not, rt bandwidth is always enabled. If RT_GROUP_SCHED is not
>>>> enabled, all rt tasks in the system are a group, and rt_runtime is 950000,
>>>> and rt_period is 1000000.So rt bandwidth is always enabled by default.
>>>
>>> Sure, there is that. But I think Daniel is actively trying to remove it.
>>>
>>
>> Thank you for your reply. Maybe I'm missing something. Can you give me some
>> links to discussions about it?
>>
>
> Sure, try this one:
> https://lore.kernel.org/lkml/[email protected]/
>

Thanks for the information you shared.

>
>>> Also I'm not sure you answered my question. Are you actually hitting this
>>> in the real world? I'd be tempted to think this is a mis-configuration or
>>> mis-use of RT. Plus you can disable that throttling and use stalld to catch
>>> cases where the rt task goes out of control.
>>>
>>
>>> Are you actually hitting this in the real world?
>>
>> I tested on my machine using default settings (rt_runtime is 950000, and
>> rt_period is 1000000.). The rt task is supposed to be throttled after
>> running for 0.95 seconds, but due to the influence of NO_HZ_FULL, it may be
>> throttled after running for about 1.4 seconds. This will only cause the
>> rt_bandwidth throttle to be delayed, but no warning will be triggered.
>
> Yes, you can hit this in testing. I'm asking if it's causing your real-world
> applicaton issues or is this just a theoretical problem you can contrive a
> test for? Are you actually hitting this when running your workload?
> From what you are showing (a test setup) I'm guessing no.
>

Yes, I don't see this issue in our production environment. The number of
rt tasks is very small in our production environment, and their running
time is very short, so the rt_bandwidth throttle will not be triggered
unless the rt task goes out of control.

Thanks,
Hao

>>
>>
>>> Plus you can disable that throttling and use stalld to catch cases where
>> the rt task goes out of control.
>>
>> IIRC, if we disable rt_bandwidth. The rt task is always running, which may
>> cause cfs task starvation and hung_task warnning. This may be the reason why
>> rt_bandwidth is enabled by default (rt_runtime is 950000, and rt_period is
>> 1000000).
>
> That's what stalld is for. Some rt applications don't like giving up 5% of
> the cpu time when they don't really need to.
>
>
> Cheers,
> Phil
>
>