by Honglei Wang

[permalink] [raw]

Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

On 2023/2/20 12:58, Chen Yu wrote:
> On 2023-02-17 at 16:35:24 +0800, Honglei Wang wrote:
>>
>>
>> On 2023/2/16 20:55, Abel Wu wrote:
>>> Hi Chen,
>>>
>>> I've tested this patchset (with modification) on our Redis proxy
>>> servers, and the results seems promising.
>>>
>>> On 2/3/23 1:18 PM, Chen Yu wrote:
>>>> ...
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index aa16611c7263..d50097e5fcc1 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
>>>>       return 1;
>>>> }
>>>> +/*
>>>> + * If a task switches in and then voluntarily relinquishes the
>>>> + * CPU quickly, it is regarded as a short duration task.
>>>> + *
>>>> + * SIS_SHORT tries to wake up the short wakee on current CPU. This
>>>> + * aims to avoid race condition among CPUs due to frequent context
>>>> + * switch.
>>>> + */
>>>> +static inline int is_short_task(struct task_struct *p)
>>>> +{
>>>> +    return sched_feat(SIS_SHORT) && p->se.dur_avg &&
>>>> +           ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
>>>> +}
>>>
>>> I changed the factor to fit into the shape of tasks in question.
>>>
>>>     static inline int is_short_task(struct task_struct *p)
>>>     {
>>>         u64 dur = sysctl_sched_min_granularity / 8;
>>>
>>>         if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
>>>             return false;
>>>
>>>         /*
>>>          * Bare tracepoint to allow dynamically changing
>>>          * the threshold.
>>>          */
>>>         trace_sched_short_task_tp(p, &dur);
>>>
>>>         return p->se.dur_avg < dur;
>>>     }
>>>
>>> I'm not sure it is the right way to provide such flexibility, but
>>> definition of 'short' can be workload specific.
>>>
>>>> +
>>>> /*
>>>>    * The purpose of wake_affine() is to quickly determine on which
>>>> CPU we can run
>>>>    * soonest. For the purpose of speed we only consider the waking
>>>> and previous
>>>> @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
>>>> int sync)
>>>>       if (available_idle_cpu(prev_cpu))
>>>>           return prev_cpu;
>>>> +    /* The only running task is a short duration one. */
>>>> +    if (cpu_rq(this_cpu)->nr_running == 1 &&
>>>> +        is_short_task(rcu_dereference(cpu_curr(this_cpu))))
>>>> +        return this_cpu;
>>>
>>> Since proxy server handles simple data delivery, the tasks are
>>> generally short running ones and hate task stacking which may
>>> introduce scheduling latency (even there are only 2 short tasks
>>> competing each other). So this part brings slight regression on
>>> the proxy case. But I still think this is good for most cases.
>>>
>>> Speaking of task stacking, I found wake_affine_weight() can be
>>> much more dangerous. It chooses the less loaded one between the
>>> prev & this cpu as a candidate, so 'small' tasks can be easily
>>> stacked on this cpu when wake up several tasks at one time if
>>> this cpu is unloaded. This really hurts if the 'small' tasks are
>>> latency-sensitive, although wake_affine_weight() does the right
>>> thing from the point of view of 'load'.
>>>
>>> The following change greatly reduced the p99lat of Redis service
>>> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>>>
>>> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
>>> task_struct *p,
>>>     s64 this_eff_load, prev_eff_load;
>>>     unsigned long task_load;
>>>
>>> +    if (is_short_task(p))
>>> +        return nr_cpumask_bits;
>>> +
>>>     this_eff_load = cpu_load(cpu_rq(this_cpu));
>>>
>>>     if (sync) {
>>>
>>> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
>>> sleeping duration is small or have large weights, but this works
>>> really well for this case. This is partly because delivering data
>>> is memory bandwidth intensive hence prefer cache hot cpus. And I
>>> think this is also applicable to the general purposes: do NOT let
>>> the short running tasks suffering from cache misses caused by
>>> migration.
>>>
>>
>> Redis is a bit special. It runs quick and really sensitive on schedule
>> latency. The purpose of this 'short task' feature from Yu is to mitigate the
>> migration and tend to place the waking task on local cpu, this is somehow on
>> the opposite side of workload such as Redis. The changes you did remind me
>> of the latency-prio stuff. Maybe we can do something base on both the 'short
>> task' and 'latency-prio' to make your changes more general. thoughts?
>>
> Looks reasonable, I suppose you were refering to 'latency nice' proposed by
> Vincent. For now I'd like to keep this patch simple enough, later we can
> extend it.
>

Yep, agree to keep this patch as is for now.

Thanks,
Honglei

> thanks,
> Chenyu