This patchset gathers somes cleanups but also improvements and fixes for
scheduling UCs that involved a lot of sched_idle tasks. In such case, the
sched_idle tasks impact significantly the behavior of other tasks whereas
they should not. These patches don't impact other "normal" UCs.
have foudn these changes while studying Zhang's problem [1] but AFAICT,
these patches don't fix his problem.
[1] https://lore.kernel.org/lkml/[email protected]/
Vincent Guittot (4):
sched/fair: make sure to try to detach at least one movable task
sched/fair: cleanup loop_max and loop_break
sched/fair: move call to list_last_entry() in detach_tasks
sched/fair: limit sched slice duration
kernel/sched/core.c | 6 +-----
kernel/sched/fair.c | 29 +++++++++++++++++------------
kernel/sched/sched.h | 6 ++++++
3 files changed, 24 insertions(+), 17 deletions(-)
--
2.17.1
Move the call to list_last_entry() in detach_tasks() after testing
loop_max and loop_break.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6972a1a29a48..260a55ac462f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8047,8 +8047,6 @@ static int detach_tasks(struct lb_env *env)
if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
break;
- p = list_last_entry(tasks, struct task_struct, se.group_node);
-
env->loop++;
/*
* We've more or less seen every task there is, call it quits
@@ -8065,6 +8063,8 @@ static int detach_tasks(struct lb_env *env)
break;
}
+ p = list_last_entry(tasks, struct task_struct, se.group_node);
+
if (!can_migrate_task(p, env))
goto next;
--
2.17.1
sched_nr_migrate_break is set to a fix value and never changes so we can
replace it by a define SCHED_NR_MIGRATE_BREAK.
Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
of sysctl_sched_nr_migrate which can be init to different values.
Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.
The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
trough debugfs.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 6 +-----
kernel/sched/fair.c | 11 ++++-------
kernel/sched/sched.h | 6 ++++++
3 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64c08993221b..a21e817bdd1c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -142,11 +142,7 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
* Number of tasks to iterate in a single balance run.
* Limited because this is done with IRQs disabled.
*/
-#ifdef CONFIG_PREEMPT_RT
-const_debug unsigned int sysctl_sched_nr_migrate = 8;
-#else
-const_debug unsigned int sysctl_sched_nr_migrate = 32;
-#endif
+const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
__read_mostly int scheduler_running;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02b7b808e186..6972a1a29a48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8012,8 +8012,6 @@ static struct task_struct *detach_one_task(struct lb_env *env)
return NULL;
}
-static const unsigned int sched_nr_migrate_break = 32;
-
/*
* detach_tasks() -- tries to detach up to imbalance load/util/tasks from
* busiest_rq, as part of a balancing operation within domain "sd".
@@ -8062,7 +8060,7 @@ static int detach_tasks(struct lb_env *env)
/* take a breather every nr_migrate tasks */
if (env->loop > env->loop_break) {
- env->loop_break += sched_nr_migrate_break;
+ env->loop_break += SCHED_NR_MIGRATE_BREAK;
env->flags |= LBF_NEED_BREAK;
break;
}
@@ -10103,14 +10101,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct rq *busiest;
struct rq_flags rf;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
-
struct lb_env env = {
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
.dst_grpmask = sched_group_span(sd->groups),
.idle = idle,
- .loop_break = sched_nr_migrate_break,
+ .loop_break = SCHED_NR_MIGRATE_BREAK,
.cpus = cpus,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
@@ -10219,7 +10216,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.dst_cpu = env.new_dst_cpu;
env.flags &= ~LBF_DST_PINNED;
env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
+ env.loop_break = SCHED_NR_MIGRATE_BREAK;
/*
* Go back to "more_balance" rather than "redo" since we
@@ -10251,7 +10248,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
*/
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
+ env.loop_break = SCHED_NR_MIGRATE_BREAK;
goto redo;
}
goto out_all_pinned;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccd35c22f0f..d5cfd1b5bfe9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2423,6 +2423,12 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
+#ifdef CONFIG_PREEMPT_RT
+#define SCHED_NR_MIGRATE_BREAK 8
+#else
+#define SCHED_NR_MIGRATE_BREAK 32
+#endif
+
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;
--
2.17.1
In presence of a lot of small weight tasks like sched_idle tasks, normal
or high weight tasks can see their ideal runtime (sched_slice) to increase
to hundreds ms whereas it normally stays below sysctl_sched_latency.
2 normal tasks running on a CPU will have a max sched_slice of 12ms
(half of the sched_period). This means that they will make progress
every sysctl_sched_latency period.
If we now add 1000 idle tasks on the CPU, the sched_period becomes
3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
It will even become 1500ms if the idle tasks belongs to an idle cgroup.
This means that the scheduler will look for picking another waiting task
after 609ms running time (1500ms respectively). The idle tasks change
significantly the way the 2 normal tasks interleave their running time
slot whereas they should have a small impact.
Such long sched_slice can delay significantly the release of resources
as the tasks can wait hundreds of ms before the next running slot just
because of idle tasks queued on the rq.
Cap the ideal_runtime to sysctl_sched_latency when comparing to the next
waiting task to make sure that tasks will regularly make progress and will
not be significantly impacted by idle/background tasks queued on the rq.
Signed-off-by: Vincent Guittot <[email protected]>
---
While studying the problem, I have also considered to substract
cfs.idle_h_nr_running before computing the sched_slice but we can have
quite similar problem with low weight bormal task/cgroup so I have decided
to keep this solution.
Also, this solution doesn't completly remove the impact of idle tasks
in the scheduling pattern but cap the running slice of a task to a max
value of 2*sysctl_sched_latency.
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 260a55ac462f..96fedd0ab5fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4599,6 +4599,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
if (delta < 0)
return;
+ ideal_runtime = min_t(u64, ideal_runtime, sysctl_sched_latency);
+
if (delta > ideal_runtime)
resched_curr(rq_of(cfs_rq));
}
--
2.17.1
During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.
The maximum of detached tasks remained the same as before.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da388657d5ac..02b7b808e186 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
p = list_last_entry(tasks, struct task_struct, se.group_node);
env->loop++;
- /* We've more or less seen every task there is, call it quits */
- if (env->loop > env->loop_max)
+ /*
+ * We've more or less seen every task there is, call it quits
+ * unless we haven't found any movable task yet.
+ */
+ if (env->loop > env->loop_max &&
+ !(env->flags & LBF_ALL_PINNED))
break;
/* take a breather every nr_migrate tasks */
@@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (env.flags & LBF_NEED_BREAK) {
env.flags &= ~LBF_NEED_BREAK;
- goto more_balance;
+ /* Stop if we tried all running tasks */
+ if (env.loop < busiest->nr_running)
+ goto more_balance;
}
/*
--
2.17.1
Picked up the first three.
On Thu, Aug 25, 2022 at 02:27:26PM +0200, Vincent Guittot wrote:
> In presence of a lot of small weight tasks like sched_idle tasks, normal
> or high weight tasks can see their ideal runtime (sched_slice) to increase
> to hundreds ms whereas it normally stays below sysctl_sched_latency.
>
> 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> (half of the sched_period). This means that they will make progress
> every sysctl_sched_latency period.
>
> If we now add 1000 idle tasks on the CPU, the sched_period becomes
Surely people aren't actually having that many runnable tasks and this
is a device for the argument?
> 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> This means that the scheduler will look for picking another waiting task
> after 609ms running time (1500ms respectively). The idle tasks change
> significantly the way the 2 normal tasks interleave their running time
> slot whereas they should have a small impact.
>
> Such long sched_slice can delay significantly the release of resources
> as the tasks can wait hundreds of ms before the next running slot just
> because of idle tasks queued on the rq.
>
> Cap the ideal_runtime to sysctl_sched_latency when comparing to the next
> waiting task to make sure that tasks will regularly make progress and will
> not be significantly impacted by idle/background tasks queued on the rq.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
>
> While studying the problem, I have also considered to substract
> cfs.idle_h_nr_running before computing the sched_slice but we can have
> quite similar problem with low weight bormal task/cgroup so I have decided
> to keep this solution.
That ^... my proposal below has the same problem.
This:
> Also, this solution doesn't completly remove the impact of idle tasks
> in the scheduling pattern but cap the running slice of a task to a max
> value of 2*sysctl_sched_latency.
I'm failing to see how.
> kernel/sched/fair.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 260a55ac462f..96fedd0ab5fa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4599,6 +4599,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> if (delta < 0)
> return;
(I'm thinking that early return is a bit pointless, a negative value
won't be larger than ideal_time anyway)
> + ideal_runtime = min_t(u64, ideal_runtime, sysctl_sched_latency);
> +
(superfluous whitespace -- twice, once after the = once this whole extra
line)
> if (delta > ideal_runtime)
> resched_curr(rq_of(cfs_rq));
> }
Urgghhhh..
so delta is in vtime here, while sched_latency is not, so the heavier
the queue, the larger this value becomes.
Given those 1000 idle tasks, rq-weight would be around 2048; however due
to nr_running being insane, sched_slice() ends up being something like:
1000 * min_gran * 2 / 2048
which is around ~min_gran and so won't come near to latency.
since we already have idle_min_gran; how about something like this?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efceb670e755..8dd18fc0affa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -706,12 +706,14 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
*
* p = (nr <= nl) ? l : l*nr/nl
*/
-static u64 __sched_period(unsigned long nr_running)
+static u64 __sched_period(unsigned long nr_running, unsigned long nr_idle)
{
- if (unlikely(nr_running > sched_nr_latency))
- return nr_running * sysctl_sched_min_granularity;
- else
- return sysctl_sched_latency;
+ u64 period = 0;
+
+ period += nr_running * sysctl_sched_min_granularity;
+ period += nr_idle * sysctl_sched_idle_min_granularity;
+
+ return max_t(u64, period, sysctl_sched_latency);
}
static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
@@ -724,15 +726,25 @@ static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- unsigned int nr_running = cfs_rq->nr_running;
+ unsigned int nr_idle = cfs_rq->idle_nr_running;
+ unsigned int nr_running = cfs_rq->nr_running - nr_idle;
struct sched_entity *init_se = se;
unsigned int min_gran;
u64 slice;
- if (sched_feat(ALT_PERIOD))
- nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
+ if (sched_feat(ALT_PERIOD)) {
+ nr_idle = rq_of(cfs_rq)->cfs.idle_h_nr_running;
+ nr_running = rq_of(cfs_rq)->cfs.h_nr_running - nr_idle;
+ }
+
+ if (!se->on_rq) {
+ if (se_is_idle(se))
+ nr_idle++;
+ else
+ nr_running++;
+ }
- slice = __sched_period(nr_running + !se->on_rq);
+ slice = __sched_period(nr_running, nr_idle);
for_each_sched_entity(se) {
struct load_weight *load;
This changes how the compute the period depending on the composition. It
suffers the exact same problem you had earlier though in that it doesn't
work for the other low-weight cases. But perhaps we can come up with a
better means of computing the period that *does* consider them?
As said before;... urgh! bit of a sticky problem this.
On Fri, 9 Sept 2022 at 13:14, Peter Zijlstra <[email protected]> wrote:
>
>
> Picked up the first three.
>
> On Thu, Aug 25, 2022 at 02:27:26PM +0200, Vincent Guittot wrote:
> > In presence of a lot of small weight tasks like sched_idle tasks, normal
> > or high weight tasks can see their ideal runtime (sched_slice) to increase
> > to hundreds ms whereas it normally stays below sysctl_sched_latency.
> >
> > 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> > (half of the sched_period). This means that they will make progress
> > every sysctl_sched_latency period.
> >
> > If we now add 1000 idle tasks on the CPU, the sched_period becomes
>
> Surely people aren't actually having that many runnable tasks and this
> is a device for the argument?
>
> > 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> > It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> > This means that the scheduler will look for picking another waiting task
> > after 609ms running time (1500ms respectively). The idle tasks change
> > significantly the way the 2 normal tasks interleave their running time
> > slot whereas they should have a small impact.
> >
> > Such long sched_slice can delay significantly the release of resources
> > as the tasks can wait hundreds of ms before the next running slot just
> > because of idle tasks queued on the rq.
> >
> > Cap the ideal_runtime to sysctl_sched_latency when comparing to the next
> > waiting task to make sure that tasks will regularly make progress and will
> > not be significantly impacted by idle/background tasks queued on the rq.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> >
> > While studying the problem, I have also considered to substract
> > cfs.idle_h_nr_running before computing the sched_slice but we can have
> > quite similar problem with low weight bormal task/cgroup so I have decided
> > to keep this solution.
>
> That ^... my proposal below has the same problem.
>
> This:
>
> > Also, this solution doesn't completly remove the impact of idle tasks
> > in the scheduling pattern but cap the running slice of a task to a max
> > value of 2*sysctl_sched_latency.
>
> I'm failing to see how.
The 1st part of check_preempt_tick ensures that we wait at least
sysctl_sched_min_granularity but not more than ideal_runtime before
possibly picking another entity.
Once both conditions above tested, we check that the vruntime diff
with the 1st pending entity is not larger than a sysctl_sched_latency.
Normally sched_slice should return an ideal_runtime value less than
sysctl_sched_latency. But we also want to provide a minimum runtime to
all tasks so we increase the sched_period when the number of tasks
increases too much.
The case described above is a corner case because of the large
difference between the tasks' prio.
Now, let assume that we have only 1 normal task and 1000 idle tasks, I
don't see any problem with providing a large ideal runtime for this
normal task. The problem comes when you have at least 2 normal tasks
as we don't expect the other normal task to wait for several hundreds
of ms before running.
That's why the comparison is done against the diff of vruntime; idle
task running for a 4ms tick will increase its vruntime with + 1366ms
which is comparable with the slice duration of the normal task. On the
other side, a 4ms tick will increase the vruntime of a nice 0 task to
4ms only. So the vruntime diff will quickly move above the
sysctl_sched_latency.
That being said, it doesn't completely fix the case of 2 nice -20 task runnings
>
> > kernel/sched/fair.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 260a55ac462f..96fedd0ab5fa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4599,6 +4599,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > if (delta < 0)
> > return;
>
> (I'm thinking that early return is a bit pointless, a negative value
> won't be larger than ideal_time anyway)
yes
>
> > + ideal_runtime = min_t(u64, ideal_runtime, sysctl_sched_latency);
> > +
>
> (superfluous whitespace -- twice, once after the = once this whole extra
> line)
sorry for that...
>
> > if (delta > ideal_runtime)
> > resched_curr(rq_of(cfs_rq));
> > }
>
> Urgghhhh..
>
> so delta is in vtime here, while sched_latency is not, so the heavier
> the queue, the larger this value becomes.
>
> Given those 1000 idle tasks, rq-weight would be around 2048; however due
> to nr_running being insane, sched_slice() ends up being something like:
rq weight will be 1000*3+2*1024=5048
sched_period becomes 1002 * min_gran = 3006ms
idle task got a slice of weight(3) * (1002 min_gran) / 5048 =
3002/5048 * min_gran
normal task got a slice of weight(1024) * (1002 min_gran) / 5048 =
1024*1002*5048 * min_gran ~ 200 min_gran
if the 1000 task are in a idle sched group, that even worth because
the rq weight decrease to 3+2*1024 = 2051 and the slice increase to
500 min_gran
note that if we use 2 tasks nice -20 and 1000 tasks with nice 19 we
have similar slice duration (around 500 min_gran) so we can't really
rely on idle_nr_running
>
> 1000 * min_gran * 2 / 2048
>
> which is around ~min_gran and so won't come near to latency.
>
>
> since we already have idle_min_gran; how about something like this?
the idl_min gran will divide by 4 the sched_slice which can still
remain quite high
The main problem with my proposal is that task with negative nice prio
can still get larger sched_slice because vruntime increases slower
than real time
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index efceb670e755..8dd18fc0affa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -706,12 +706,14 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> *
> * p = (nr <= nl) ? l : l*nr/nl
> */
> -static u64 __sched_period(unsigned long nr_running)
> +static u64 __sched_period(unsigned long nr_running, unsigned long nr_idle)
> {
> - if (unlikely(nr_running > sched_nr_latency))
> - return nr_running * sysctl_sched_min_granularity;
> - else
> - return sysctl_sched_latency;
> + u64 period = 0;
> +
> + period += nr_running * sysctl_sched_min_granularity;
> + period += nr_idle * sysctl_sched_idle_min_granularity;
> +
> + return max_t(u64, period, sysctl_sched_latency);
> }
>
> static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> @@ -724,15 +726,25 @@ static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> */
> static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - unsigned int nr_running = cfs_rq->nr_running;
> + unsigned int nr_idle = cfs_rq->idle_nr_running;
> + unsigned int nr_running = cfs_rq->nr_running - nr_idle;
> struct sched_entity *init_se = se;
> unsigned int min_gran;
> u64 slice;
>
> - if (sched_feat(ALT_PERIOD))
> - nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
> + if (sched_feat(ALT_PERIOD)) {
> + nr_idle = rq_of(cfs_rq)->cfs.idle_h_nr_running;
> + nr_running = rq_of(cfs_rq)->cfs.h_nr_running - nr_idle;
> + }
> +
> + if (!se->on_rq) {
> + if (se_is_idle(se))
> + nr_idle++;
> + else
> + nr_running++;
> + }
>
> - slice = __sched_period(nr_running + !se->on_rq);
> + slice = __sched_period(nr_running, nr_idle);
>
> for_each_sched_entity(se) {
> struct load_weight *load;
>
>
> This changes how the compute the period depending on the composition. It
> suffers the exact same problem you had earlier though in that it doesn't
> work for the other low-weight cases. But perhaps we can come up with a
> better means of computing the period that *does* consider them?
>
> As said before;... urgh! bit of a sticky problem this.
Le vendredi 09 sept. 2022 ? 16:03:31 (+0200), Vincent Guittot a ?crit :
> On Fri, 9 Sept 2022 at 13:14, Peter Zijlstra <[email protected]> wrote:
> >
> >
> > Picked up the first three.
> >
> > On Thu, Aug 25, 2022 at 02:27:26PM +0200, Vincent Guittot wrote:
> > > In presence of a lot of small weight tasks like sched_idle tasks, normal
> > > or high weight tasks can see their ideal runtime (sched_slice) to increase
> > > to hundreds ms whereas it normally stays below sysctl_sched_latency.
> > >
> > > 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> > > (half of the sched_period). This means that they will make progress
> > > every sysctl_sched_latency period.
> > >
> > > If we now add 1000 idle tasks on the CPU, the sched_period becomes
> >
> > Surely people aren't actually having that many runnable tasks and this
> > is a device for the argument?
> >
> > > 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> > > It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> > > This means that the scheduler will look for picking another waiting task
> > > after 609ms running time (1500ms respectively). The idle tasks change
> > > significantly the way the 2 normal tasks interleave their running time
> > > slot whereas they should have a small impact.
> > >
> > > Such long sched_slice can delay significantly the release of resources
> > > as the tasks can wait hundreds of ms before the next running slot just
> > > because of idle tasks queued on the rq.
> > >
> > > Cap the ideal_runtime to sysctl_sched_latency when comparing to the next
> > > waiting task to make sure that tasks will regularly make progress and will
> > > not be significantly impacted by idle/background tasks queued on the rq.
> > >
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > >
> > > While studying the problem, I have also considered to substract
> > > cfs.idle_h_nr_running before computing the sched_slice but we can have
> > > quite similar problem with low weight bormal task/cgroup so I have decided
> > > to keep this solution.
> >
> > That ^... my proposal below has the same problem.
> >
> > This:
> >
> > > Also, this solution doesn't completly remove the impact of idle tasks
> > > in the scheduling pattern but cap the running slice of a task to a max
> > > value of 2*sysctl_sched_latency.
> >
> > I'm failing to see how.
>
> The 1st part of check_preempt_tick ensures that we wait at least
> sysctl_sched_min_granularity but not more than ideal_runtime before
> possibly picking another entity.
>
> Once both conditions above tested, we check that the vruntime diff
> with the 1st pending entity is not larger than a sysctl_sched_latency.
>
> Normally sched_slice should return an ideal_runtime value less than
> sysctl_sched_latency. But we also want to provide a minimum runtime to
> all tasks so we increase the sched_period when the number of tasks
> increases too much.
>
> The case described above is a corner case because of the large
> difference between the tasks' prio.
>
> Now, let assume that we have only 1 normal task and 1000 idle tasks, I
> don't see any problem with providing a large ideal runtime for this
> normal task. The problem comes when you have at least 2 normal tasks
> as we don't expect the other normal task to wait for several hundreds
> of ms before running.
>
> That's why the comparison is done against the diff of vruntime; idle
> task running for a 4ms tick will increase its vruntime with + 1366ms
> which is comparable with the slice duration of the normal task. On the
> other side, a 4ms tick will increase the vruntime of a nice 0 task to
> 4ms only. So the vruntime diff will quickly move above the
> sysctl_sched_latency.
>
> That being said, it doesn't completely fix the case of 2 nice -20 task runnings
>
> >
> > > kernel/sched/fair.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 260a55ac462f..96fedd0ab5fa 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4599,6 +4599,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > if (delta < 0)
> > > return;
> >
> > (I'm thinking that early return is a bit pointless, a negative value
> > won't be larger than ideal_time anyway)
>
> yes
>
> >
> > > + ideal_runtime = min_t(u64, ideal_runtime, sysctl_sched_latency);
> > > +
scaling sysctl_sched_latency with curr weight should fix the problem for high prio task
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4599,7 +4599,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
if (delta < 0)
return;
- ideal_runtime = min_t(u64, ideal_runtime, sysctl_sched_latency);
+ ideal_runtime = min_t(u64, ideal_runtime,
+ calc_delta_fair(sysctl_sched_latency, curr));
if (delta > ideal_runtime)
resched_curr(rq_of(cfs_rq));
> >
> > (superfluous whitespace -- twice, once after the = once this whole extra
> > line)
>
> sorry for that...
>
> >
> > > if (delta > ideal_runtime)
> > > resched_curr(rq_of(cfs_rq));
> > > }
> >
> > Urgghhhh..
> >
> > so delta is in vtime here, while sched_latency is not, so the heavier
> > the queue, the larger this value becomes.
> >
> > Given those 1000 idle tasks, rq-weight would be around 2048; however due
> > to nr_running being insane, sched_slice() ends up being something like:
>
> rq weight will be 1000*3+2*1024=5048
> sched_period becomes 1002 * min_gran = 3006ms
>
> idle task got a slice of weight(3) * (1002 min_gran) / 5048 =
> 3002/5048 * min_gran
>
> normal task got a slice of weight(1024) * (1002 min_gran) / 5048 =
> 1024*1002*5048 * min_gran ~ 200 min_gran
>
> if the 1000 task are in a idle sched group, that even worth because
> the rq weight decrease to 3+2*1024 = 2051 and the slice increase to
> 500 min_gran
>
> note that if we use 2 tasks nice -20 and 1000 tasks with nice 19 we
> have similar slice duration (around 500 min_gran) so we can't really
> rely on idle_nr_running
>
> >
> > 1000 * min_gran * 2 / 2048
> >
> > which is around ~min_gran and so won't come near to latency.
> >
> >
> > since we already have idle_min_gran; how about something like this?
>
> the idl_min gran will divide by 4 the sched_slice which can still
> remain quite high
>
> The main problem with my proposal is that task with negative nice prio
> can still get larger sched_slice because vruntime increases slower
> than real time
>
> >
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index efceb670e755..8dd18fc0affa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -706,12 +706,14 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> > *
> > * p = (nr <= nl) ? l : l*nr/nl
> > */
> > -static u64 __sched_period(unsigned long nr_running)
> > +static u64 __sched_period(unsigned long nr_running, unsigned long nr_idle)
> > {
> > - if (unlikely(nr_running > sched_nr_latency))
> > - return nr_running * sysctl_sched_min_granularity;
> > - else
> > - return sysctl_sched_latency;
> > + u64 period = 0;
> > +
> > + period += nr_running * sysctl_sched_min_granularity;
> > + period += nr_idle * sysctl_sched_idle_min_granularity;
> > +
> > + return max_t(u64, period, sysctl_sched_latency);
> > }
> >
> > static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> > @@ -724,15 +726,25 @@ static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> > */
> > static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > - unsigned int nr_running = cfs_rq->nr_running;
> > + unsigned int nr_idle = cfs_rq->idle_nr_running;
> > + unsigned int nr_running = cfs_rq->nr_running - nr_idle;
> > struct sched_entity *init_se = se;
> > unsigned int min_gran;
> > u64 slice;
> >
> > - if (sched_feat(ALT_PERIOD))
> > - nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
> > + if (sched_feat(ALT_PERIOD)) {
> > + nr_idle = rq_of(cfs_rq)->cfs.idle_h_nr_running;
> > + nr_running = rq_of(cfs_rq)->cfs.h_nr_running - nr_idle;
> > + }
> > +
> > + if (!se->on_rq) {
> > + if (se_is_idle(se))
> > + nr_idle++;
> > + else
> > + nr_running++;
> > + }
> >
> > - slice = __sched_period(nr_running + !se->on_rq);
> > + slice = __sched_period(nr_running, nr_idle);
> >
> > for_each_sched_entity(se) {
> > struct load_weight *load;
> >
> >
> > This changes how the compute the period depending on the composition. It
> > suffers the exact same problem you had earlier though in that it doesn't
> > work for the other low-weight cases. But perhaps we can come up with a
> > better means of computing the period that *does* consider them?
> >
> > As said before;... urgh! bit of a sticky problem this.
On 25/08/2022 14:27, Vincent Guittot wrote:
s/sched/fair: make/sched/fair: Make
> During load balance, we try at most env->loop_max time to move a task.
> But it can happen that the loop_max LRU tasks (ie tail of
> the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> In this case, loop in the list until we found at least one.
>
> The maximum of detached tasks remained the same as before.
Not sure how this relates to the patch? Isn't this given by the
`env->imbalance <= 0` check at the end of detach_tasks()?
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da388657d5ac..02b7b808e186 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> p = list_last_entry(tasks, struct task_struct, se.group_node);
>
> env->loop++;
> - /* We've more or less seen every task there is, call it quits */
> - if (env->loop > env->loop_max)
> + /*
> + * We've more or less seen every task there is, call it quits
I never understood this `more or less`. Either we have seen all tasks or
not?
> + * unless we haven't found any movable task yet.
> + */
> + if (env->loop > env->loop_max &&
> + !(env->flags & LBF_ALL_PINNED))
> break;
>
> /* take a breather every nr_migrate tasks */
> @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>
> if (env.flags & LBF_NEED_BREAK) {
> env.flags &= ~LBF_NEED_BREAK;
> - goto more_balance;
> + /* Stop if we tried all running tasks */
Would say s/running/runnable but I see that we do use running/runnable
interchangeably.
> + if (env.loop < busiest->nr_running)
> + goto more_balance;
> }
>
> /*
IMHO, there will be some interaction with the `All tasks on this
runqueue were pinned by CPU affinity` check at the end of load_balance()?
On 25/08/2022 14:27, Vincent Guittot wrote:
s/sched/fair: move/sched/fair: Move
s/detach_tasks/detach_tasks()
> Move the call to list_last_entry() in detach_tasks() after testing
> loop_max and loop_break.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6972a1a29a48..260a55ac462f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8047,8 +8047,6 @@ static int detach_tasks(struct lb_env *env)
> if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> break;
>
> - p = list_last_entry(tasks, struct task_struct, se.group_node);
> -
> env->loop++;
> /*
> * We've more or less seen every task there is, call it quits
> @@ -8065,6 +8063,8 @@ static int detach_tasks(struct lb_env *env)
> break;
> }
>
> + p = list_last_entry(tasks, struct task_struct, se.group_node);
> +
> if (!can_migrate_task(p, env))
> goto next;
>
Reviewed-by: Dietmar Eggemann <[email protected]>
On 25/08/2022 14:27, Vincent Guittot wrote:
> sched_nr_migrate_break is set to a fix value and never changes so we can
> replace it by a define SCHED_NR_MIGRATE_BREAK.
>
> Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
> of sysctl_sched_nr_migrate which can be init to different values.
>
> Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.
>
> The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
> trough debugfs.
I don't quite get this sentence. The behavior would potentially change
if you change sysctl_sched_nr_migrate before this patch too?
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/core.c | 6 +-----
> kernel/sched/fair.c | 11 ++++-------
> kernel/sched/sched.h | 6 ++++++
> 3 files changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 64c08993221b..a21e817bdd1c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -142,11 +142,7 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
> * Number of tasks to iterate in a single balance run.
> * Limited because this is done with IRQs disabled.
> */
^^^
Shouldn't this comment be removed as well?
> -#ifdef CONFIG_PREEMPT_RT
> -const_debug unsigned int sysctl_sched_nr_migrate = 8;
> -#else
> -const_debug unsigned int sysctl_sched_nr_migrate = 32;
> -#endif
> +const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
>
> __read_mostly int scheduler_running;
[...]
On Mon, 12 Sept 2022 at 10:44, Dietmar Eggemann
<[email protected]> wrote:
>
> On 25/08/2022 14:27, Vincent Guittot wrote:
>
> s/sched/fair: make/sched/fair: Make
>
> > During load balance, we try at most env->loop_max time to move a task.
> > But it can happen that the loop_max LRU tasks (ie tail of
> > the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> > In this case, loop in the list until we found at least one.
> >
> > The maximum of detached tasks remained the same as before.
>
> Not sure how this relates to the patch? Isn't this given by the
> `env->imbalance <= 0` check at the end of detach_tasks()?
The number of detached tasks can't be higher than loop_max in
detached_tasks() and it remains the same with this patch as we will
continue to loop only if we didn't find task that can move to the cpu
>
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 12 +++++++++---
> > 1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da388657d5ac..02b7b808e186 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> > p = list_last_entry(tasks, struct task_struct, se.group_node);
> >
> > env->loop++;
> > - /* We've more or less seen every task there is, call it quits */
> > - if (env->loop > env->loop_max)
> > + /*
> > + * We've more or less seen every task there is, call it quits
>
> I never understood this `more or less`. Either we have seen all tasks or
> not?
>
> > + * unless we haven't found any movable task yet.
> > + */
> > + if (env->loop > env->loop_max &&
> > + !(env->flags & LBF_ALL_PINNED))
> > break;
> >
> > /* take a breather every nr_migrate tasks */
> > @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >
> > if (env.flags & LBF_NEED_BREAK) {
> > env.flags &= ~LBF_NEED_BREAK;
> > - goto more_balance;
> > + /* Stop if we tried all running tasks */
>
> Would say s/running/runnable but I see that we do use running/runnable
> interchangeably.
>
> > + if (env.loop < busiest->nr_running)
> > + goto more_balance;
> > }
> >
> > /*
>
> IMHO, there will be some interaction with the `All tasks on this
> runqueue were pinned by CPU affinity` check at the end of load_balance()?
On Mon, 12 Sept 2022 at 10:45, Dietmar Eggemann
<[email protected]> wrote:
>
> On 25/08/2022 14:27, Vincent Guittot wrote:
> > sched_nr_migrate_break is set to a fix value and never changes so we can
> > replace it by a define SCHED_NR_MIGRATE_BREAK.
> >
> > Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
> > of sysctl_sched_nr_migrate which can be init to different values.
> >
> > Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.
> >
> > The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
> > trough debugfs.
>
> I don't quite get this sentence. The behavior would potentially change
> if you change sysctl_sched_nr_migrate before this patch too?
the behavior is different if you change the sysctl_sched_nr_migrate.
With this patch, loop_break is now aligned with
sysctl_sched_nr_migrate value which was not the case for
CONFIG_PREEMPT_RT. For the latter, the behavior can change if you
increase sysctl_sched_nr_migrate at runtime because there is now at
least one break whereas it was not the case before as long as
sysctl_sched_nr_migrate stayed below 32
>
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/core.c | 6 +-----
> > kernel/sched/fair.c | 11 ++++-------
> > kernel/sched/sched.h | 6 ++++++
> > 3 files changed, 11 insertions(+), 12 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 64c08993221b..a21e817bdd1c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -142,11 +142,7 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
> > * Number of tasks to iterate in a single balance run.
> > * Limited because this is done with IRQs disabled.
> > */
>
> ^^^
> Shouldn't this comment be removed as well?
>
> > -#ifdef CONFIG_PREEMPT_RT
> > -const_debug unsigned int sysctl_sched_nr_migrate = 8;
> > -#else
> > -const_debug unsigned int sysctl_sched_nr_migrate = 32;
> > -#endif
> > +const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
> >
> > __read_mostly int scheduler_running;
>
> [...]
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 7e9518baed4cef76dbfa07cbffbae1e6dbc87be6
Gitweb: https://git.kernel.org/tip/7e9518baed4cef76dbfa07cbffbae1e6dbc87be6
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 25 Aug 2022 14:27:25 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 15 Sep 2022 16:13:52 +02:00
sched/fair: Move call to list_last_entry() in detach_tasks
Move the call to list_last_entry() in detach_tasks() after testing
loop_max and loop_break.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b3a58f..5ffec43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8044,8 +8044,6 @@ static int detach_tasks(struct lb_env *env)
if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
break;
- p = list_last_entry(tasks, struct task_struct, se.group_node);
-
env->loop++;
/*
* We've more or less seen every task there is, call it quits
@@ -8062,6 +8060,8 @@ static int detach_tasks(struct lb_env *env)
break;
}
+ p = list_last_entry(tasks, struct task_struct, se.group_node);
+
if (!can_migrate_task(p, env))
goto next;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c59862f8265f8060b6650ee1dc12159fe5c89779
Gitweb: https://git.kernel.org/tip/c59862f8265f8060b6650ee1dc12159fe5c89779
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 25 Aug 2022 14:27:24 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 15 Sep 2022 16:13:51 +02:00
sched/fair: Cleanup loop_max and loop_break
sched_nr_migrate_break is set to a fix value and never changes so we can
replace it by a define SCHED_NR_MIGRATE_BREAK.
Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value
of sysctl_sched_nr_migrate which can be init to different values.
Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate.
The behavior stays unchanged unless you modify sysctl_sched_nr_migrate
trough debugfs.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 6 +-----
kernel/sched/fair.c | 11 ++++-------
kernel/sched/sched.h | 6 ++++++
3 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b85d1b..4fa4a3d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -142,11 +142,7 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
* Number of tasks to iterate in a single balance run.
* Limited because this is done with IRQs disabled.
*/
-#ifdef CONFIG_PREEMPT_RT
-const_debug unsigned int sysctl_sched_nr_migrate = 8;
-#else
-const_debug unsigned int sysctl_sched_nr_migrate = 32;
-#endif
+const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
__read_mostly int scheduler_running;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dae3bfa..7b3a58f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8009,8 +8009,6 @@ static struct task_struct *detach_one_task(struct lb_env *env)
return NULL;
}
-static const unsigned int sched_nr_migrate_break = 32;
-
/*
* detach_tasks() -- tries to detach up to imbalance load/util/tasks from
* busiest_rq, as part of a balancing operation within domain "sd".
@@ -8059,7 +8057,7 @@ static int detach_tasks(struct lb_env *env)
/* take a breather every nr_migrate tasks */
if (env->loop > env->loop_break) {
- env->loop_break += sched_nr_migrate_break;
+ env->loop_break += SCHED_NR_MIGRATE_BREAK;
env->flags |= LBF_NEED_BREAK;
break;
}
@@ -10100,14 +10098,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct rq *busiest;
struct rq_flags rf;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
-
struct lb_env env = {
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
.dst_grpmask = sched_group_span(sd->groups),
.idle = idle,
- .loop_break = sched_nr_migrate_break,
+ .loop_break = SCHED_NR_MIGRATE_BREAK,
.cpus = cpus,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
@@ -10216,7 +10213,7 @@ more_balance:
env.dst_cpu = env.new_dst_cpu;
env.flags &= ~LBF_DST_PINNED;
env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
+ env.loop_break = SCHED_NR_MIGRATE_BREAK;
/*
* Go back to "more_balance" rather than "redo" since we
@@ -10248,7 +10245,7 @@ more_balance:
*/
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
+ env.loop_break = SCHED_NR_MIGRATE_BREAK;
goto redo;
}
goto out_all_pinned;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91b2c7e..1fc198b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2423,6 +2423,12 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
+#ifdef CONFIG_PREEMPT_RT
+#define SCHED_NR_MIGRATE_BREAK 8
+#else
+#define SCHED_NR_MIGRATE_BREAK 32
+#endif
+
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: b0defa7ae03ecf91b8bfd10ede430cff12fcbd06
Gitweb: https://git.kernel.org/tip/b0defa7ae03ecf91b8bfd10ede430cff12fcbd06
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 25 Aug 2022 14:27:23 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 15 Sep 2022 16:13:51 +02:00
sched/fair: Make sure to try to detach at least one movable task
During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.
The maximum of detached tasks remained the same as before.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e5b171..dae3bfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8049,8 +8049,12 @@ static int detach_tasks(struct lb_env *env)
p = list_last_entry(tasks, struct task_struct, se.group_node);
env->loop++;
- /* We've more or less seen every task there is, call it quits */
- if (env->loop > env->loop_max)
+ /*
+ * We've more or less seen every task there is, call it quits
+ * unless we haven't found any movable task yet.
+ */
+ if (env->loop > env->loop_max &&
+ !(env->flags & LBF_ALL_PINNED))
break;
/* take a breather every nr_migrate tasks */
@@ -10179,7 +10183,9 @@ more_balance:
if (env.flags & LBF_NEED_BREAK) {
env.flags &= ~LBF_NEED_BREAK;
- goto more_balance;
+ /* Stop if we tried all running tasks */
+ if (env.loop < busiest->nr_running)
+ goto more_balance;
}
/*
Hi Vincent,
On Thu, Aug 25, 2022 at 5:27 AM Vincent Guittot
<[email protected]> wrote:
>
> During load balance, we try at most env->loop_max time to move a task.
> But it can happen that the loop_max LRU tasks (ie tail of
> the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> In this case, loop in the list until we found at least one.
We had a user recently trigger a hard lockup which we believe is due
to this patch. The user in question had O(10k) threads affinitized to
a cpu; seems like the process had an out of control thread spawning
issue, and was in the middle of getting killed. However, that was
being slowed down due to the fact that load balance was iterating all
these threads and bouncing the rq lock (and making no progress due to
ALL_PINNED). Before this patch, load balance would quit after hitting
loop_max.
Even ignoring that specific instance, it seems pretty easy for this
patch to cause a softlockup due to a buggy or malicious process.
For the tradeoff you were trying to make in this patch (spend more
time searching in the hopes that there's something migratable further
in the list), perhaps it would be better to adjust
sysctl.sched_nr_migrate instead of baking this into the kernel?
Best,
Josh
>
> The maximum of detached tasks remained the same as before.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da388657d5ac..02b7b808e186 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> p = list_last_entry(tasks, struct task_struct, se.group_node);
>
> env->loop++;
> - /* We've more or less seen every task there is, call it quits */
> - if (env->loop > env->loop_max)
> + /*
> + * We've more or less seen every task there is, call it quits
> + * unless we haven't found any movable task yet.
> + */
> + if (env->loop > env->loop_max &&
> + !(env->flags & LBF_ALL_PINNED))
> break;
>
> /* take a breather every nr_migrate tasks */
> @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>
> if (env.flags & LBF_NEED_BREAK) {
> env.flags &= ~LBF_NEED_BREAK;
> - goto more_balance;
> + /* Stop if we tried all running tasks */
> + if (env.loop < busiest->nr_running)
> + goto more_balance;
> }
>
> /*
> --
> 2.17.1
>
Hi Josh,
Sorry for the late reply.
On Mon, 12 Feb 2024 at 21:29, Josh Don <[email protected]> wrote:
>
> Hi Vincent,
>
> On Thu, Aug 25, 2022 at 5:27 AM Vincent Guittot
> <[email protected]> wrote:
> >
> > During load balance, we try at most env->loop_max time to move a task.
> > But it can happen that the loop_max LRU tasks (ie tail of
> > the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> > In this case, loop in the list until we found at least one.
>
> We had a user recently trigger a hard lockup which we believe is due
> to this patch. The user in question had O(10k) threads affinitized to
> a cpu; seems like the process had an out of control thread spawning
> issue, and was in the middle of getting killed. However, that was
> being slowed down due to the fact that load balance was iterating all
Does it mean that it was progressing but not as fast as you would like
> these threads and bouncing the rq lock (and making no progress due to
> ALL_PINNED). Before this patch, load balance would quit after hitting
> loop_max.
>
> Even ignoring that specific instance, it seems pretty easy for this
> patch to cause a softlockup due to a buggy or malicious process.
The fact that the rq is released regularly should prevent a
softlockup. And we could even fasten can_migrate() which does a lot of
useless stuff for task affined to 1 cpu.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8270e2e15cb..15bc1067c69d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8920,6 +8920,8 @@ int can_migrate_task(struct task_struct *p,
struct lb_env *env)
lockdep_assert_rq_held(env->src_rq);
+ if (p->nr_cpus_allowed == 1)
+ return 0;
/*
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or
>
> For the tradeoff you were trying to make in this patch (spend more
> time searching in the hopes that there's something migratable further
> in the list), perhaps it would be better to adjust
> sysctl.sched_nr_migrate instead of baking this into the kernel?
That could be a solution but this increases the iterations for all
cases including those which are more time consuming to sort out and
the number of tasks that you will migrate in one lb. The latter is the
one which consumes time
Vincent
>
> Best,
> Josh
>
> >
> > The maximum of detached tasks remained the same as before.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 12 +++++++++---
> > 1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da388657d5ac..02b7b808e186 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> > p = list_last_entry(tasks, struct task_struct, se.group_node);
> >
> > env->loop++;
> > - /* We've more or less seen every task there is, call it quits */
> > - if (env->loop > env->loop_max)
> > + /*
> > + * We've more or less seen every task there is, call it quits
> > + * unless we haven't found any movable task yet.
> > + */
> > + if (env->loop > env->loop_max &&
> > + !(env->flags & LBF_ALL_PINNED))
> > break;
> >
> > /* take a breather every nr_migrate tasks */
> > @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >
> > if (env.flags & LBF_NEED_BREAK) {
> > env.flags &= ~LBF_NEED_BREAK;
> > - goto more_balance;
> > + /* Stop if we tried all running tasks */
> > + if (env.loop < busiest->nr_running)
> > + goto more_balance;
> > }
> >
> > /*
> > --
> > 2.17.1
> >
On Wed, Mar 20, 2024 at 9:58 AM Vincent Guittot
<[email protected]> wrote:
>
> Hi Josh,
>
> Sorry for the late reply.
No worries, responses to your comments inline below.
> > We had a user recently trigger a hard lockup which we believe is due
> > to this patch. The user in question had O(10k) threads affinitized to
> > a cpu; seems like the process had an out of control thread spawning
> > issue, and was in the middle of getting killed. However, that was
> > being slowed down due to the fact that load balance was iterating all
>
> Does it mean that it was progressing but not as fast as you would like
It was a hard lockup, so it's more than just "not fast enough".
Indeed it was progressing, but not at a rate sufficient to avoid
lockup.
> > these threads and bouncing the rq lock (and making no progress due to
> > ALL_PINNED). Before this patch, load balance would quit after hitting
> > loop_max.
> >
> > Even ignoring that specific instance, it seems pretty easy for this
> > patch to cause a softlockup due to a buggy or malicious process.
>
> The fact that the rq is released regularly should prevent a
> softlockup.
That doesn't prevent a softlockup; kernel is stuck looping over a long
list of tasks for too long, regardless of whether it is releasing and
re-acquiring the rq locks.
Note also that load balance can come from softirq in a context where
we have IRQ disabled, which can lead to hard lockup as well.
> And we could even fasten can_migrate() which does a lot of
> useless stuff for task affined to 1 cpu.
That seems like a useful optimization, but not really relevant? It
doesn't matter how small we make the constant factor, we still have an
O(n) operation in kernel mode here.
> > For the tradeoff you were trying to make in this patch (spend more
> > time searching in the hopes that there's something migratable further
> > in the list), perhaps it would be better to adjust
> > sysctl.sched_nr_migrate instead of baking this into the kernel?
>
> That could be a solution but this increases the iterations for all
> cases including those which are more time consuming to sort out and
> the number of tasks that you will migrate in one lb. The latter is the
> one which consumes time
Is is really that bad? loop_max will be unchanged for most cases since
it gets min'd with nr_running anyway. And, even if loop_max ends up
larger in some other instances, we still terminate the iteration after
fixing up env->imbalance (granted, we'll migrate more tasks to achieve
a better balance with a larger loop_max, which I think is your point).
Another idea then: what about separating the number of tasks we can
move from the number of tasks we can search? You effectively want to
keep the number of tasks that can be migrated small (nr_migrate), but
be able to search deeper in the list for things to pull (a larger
search_depth).
- Josh
On Thu, 21 Mar 2024 at 21:25, Josh Don <[email protected]> wrote:
>
> On Wed, Mar 20, 2024 at 9:58 AM Vincent Guittot
> <[email protected]> wrote:
> >
> > Hi Josh,
> >
> > Sorry for the late reply.
>
> No worries, responses to your comments inline below.
>
> > > We had a user recently trigger a hard lockup which we believe is due
> > > to this patch. The user in question had O(10k) threads affinitized to
> > > a cpu; seems like the process had an out of control thread spawning
> > > issue, and was in the middle of getting killed. However, that was
> > > being slowed down due to the fact that load balance was iterating all
> >
> > Does it mean that it was progressing but not as fast as you would like
>
> It was a hard lockup, so it's more than just "not fast enough".
> Indeed it was progressing, but not at a rate sufficient to avoid
> lockup.
>
> > > these threads and bouncing the rq lock (and making no progress due to
> > > ALL_PINNED). Before this patch, load balance would quit after hitting
> > > loop_max.
> > >
> > > Even ignoring that specific instance, it seems pretty easy for this
> > > patch to cause a softlockup due to a buggy or malicious process.
> >
> > The fact that the rq is released regularly should prevent a
> > softlockup.
>
> That doesn't prevent a softlockup; kernel is stuck looping over a long
> list of tasks for too long, regardless of whether it is releasing and
> re-acquiring the rq locks.
>
> Note also that load balance can come from softirq in a context where
> we have IRQ disabled, which can lead to hard lockup as well.
fair enough
>
> > And we could even fasten can_migrate() which does a lot of
> > useless stuff for task affined to 1 cpu.
>
> That seems like a useful optimization, but not really relevant? It
> doesn't matter how small we make the constant factor, we still have an
> O(n) operation in kernel mode here.
>
> > > For the tradeoff you were trying to make in this patch (spend more
> > > time searching in the hopes that there's something migratable further
> > > in the list), perhaps it would be better to adjust
> > > sysctl.sched_nr_migrate instead of baking this into the kernel?
> >
> > That could be a solution but this increases the iterations for all
> > cases including those which are more time consuming to sort out and
> > the number of tasks that you will migrate in one lb. The latter is the
> > one which consumes time
>
> Is is really that bad? loop_max will be unchanged for most cases since
> it gets min'd with nr_running anyway. And, even if loop_max ends up
> larger in some other instances, we still terminate the iteration after
> fixing up env->imbalance (granted, we'll migrate more tasks to achieve
> a better balance with a larger loop_max, which I think is your point).
Yes, my point is that load of a task can be quite small, especially
with cgroups, so we can end up detaching/attaching a very large number
of tasks which is far more time consuming that checking if we can
migrate it or not
>
>
> Another idea then: what about separating the number of tasks we can
> move from the number of tasks we can search? You effectively want to
> keep the number of tasks that can be migrated small (nr_migrate), but
> be able to search deeper in the list for things to pull (a larger
> search_depth).
That could be a solution indeed
>
> - Josh
> > Another idea then: what about separating the number of tasks we can
> > move from the number of tasks we can search? You effectively want to
> > keep the number of tasks that can be migrated small (nr_migrate), but
> > be able to search deeper in the list for things to pull (a larger
> > search_depth).
>
> That could be a solution indeed
Cool, I'll try spinning up a patch for that then.
Best,
Josh