2023-02-06 22:14:58

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
enabled rebuilding root domain on cpuset and hotplug operations to
correct deadline accounting.

Rebuilding root domain is a slow operation and we see 10+ of ms delays
on suspend-resume because of that (worst case captures 20ms which
happens often).

Since nothing is expected to change on suspend-resume operation; skip
rebuilding the root domains to regain the some of the time lost.

Achieve this by refactoring the code to pass whether dl accoutning needs
an update to rebuild_sched_domains(). And while at it, rename
rebuild_root_domains() to update_dl_rd_accounting() which I believe is
a more representative name since we are not really rebuilding the root
domains, but rather updating dl accounting at the root domain.

Some users of rebuild_sched_domains() will skip dl accounting update
now:

* Update sched domains when relaxing the domain level in cpuset
which only impacts searching level in load balance
* update sched domains when cpufreq governor changes and we need
to create the perf domains

Users in arch/x86 and arch/s390 are left with the old behavior.

Debugged-by: Rick Yiu <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
---

Changes in v3:

* Change the logic to avoid using cpuhp_tasks_frozen which can be racy
and have dependency on context
* Refactor the code to pass whether dl accounting needs to be updated
down the call chain
* Teach cpuset_force_rebuild() to take a reason argument and convert
the variable into int
* Rename rebuild_root_domains() into update_dl_rd_accounting() as
that's what I believe it's only doing

v2 discussion: https://lore.kernel.org/lkml/[email protected]/
v1 discussion: https://lore.kernel.org/lkml/20221216233501.gh6m75e7s66dmjgo@airbuntu/

arch/s390/kernel/topology.c | 2 +-
arch/x86/kernel/itmt.c | 6 ++---
drivers/base/arch_topology.c | 2 +-
include/linux/cpuset.h | 12 ++++++----
kernel/cgroup/cpuset.c | 43 ++++++++++++++++++++----------------
kernel/sched/core.c | 2 +-
kernel/sched/topology.c | 2 +-
7 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index c6eecd4a5302..29d57154a3f1 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -333,7 +333,7 @@ int arch_update_cpu_topology(void)

static void topology_work_fn(struct work_struct *work)
{
- rebuild_sched_domains();
+ rebuild_sched_domains(true);
}

void topology_schedule_update(void)
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 9ff480e94511..6f1446223697 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -56,7 +56,7 @@ static int sched_itmt_update_handler(struct ctl_table *table, int write,

if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
x86_topology_update = true;
- rebuild_sched_domains();
+ rebuild_sched_domains(true);
}

mutex_unlock(&itmt_update_mutex);
@@ -125,7 +125,7 @@ int sched_set_itmt_support(void)
sysctl_sched_itmt_enabled = 1;

x86_topology_update = true;
- rebuild_sched_domains();
+ rebuild_sched_domains(true);

mutex_unlock(&itmt_update_mutex);

@@ -161,7 +161,7 @@ void sched_clear_itmt_support(void)
/* disable sched_itmt if we are no longer ITMT capable */
sysctl_sched_itmt_enabled = 0;
x86_topology_update = true;
- rebuild_sched_domains();
+ rebuild_sched_domains(true);
}

mutex_unlock(&itmt_update_mutex);
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7d6e6657ffa..90d8a42335e6 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -253,7 +253,7 @@ int topology_update_cpu_topology(void)
static void update_topology_flags_workfn(struct work_struct *work)
{
update_topology = 1;
- rebuild_sched_domains();
+ rebuild_sched_domains(true);
pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
update_topology = 0;
}
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..e30d4cd35ef7 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -18,6 +18,10 @@
#include <linux/mmu_context.h>
#include <linux/jump_label.h>

+#define CPUSET_FORCE_REBUILD_RESET 0
+#define CPUSET_FORCE_REBUILD_SUSPEND_RESUME 1
+#define CPUSET_FORCE_REBUILD_PRS_ERROR 2
+
#ifdef CONFIG_CPUSETS

/*
@@ -68,7 +72,7 @@ static inline bool cpusets_insane_config(void)

extern int cpuset_init(void);
extern void cpuset_init_smp(void);
-extern void cpuset_force_rebuild(void);
+extern void cpuset_force_rebuild(int reason);
extern void cpuset_update_active_cpus(void);
extern void cpuset_wait_for_hotplug(void);
extern void cpuset_read_lock(void);
@@ -132,7 +136,7 @@ static inline int cpuset_do_slab_mem_spread(void)

extern bool current_cpuset_is_being_rebound(void);

-extern void rebuild_sched_domains(void);
+extern void rebuild_sched_domains(bool update_dl_accounting);

extern void cpuset_print_current_mems_allowed(void);

@@ -187,7 +191,7 @@ static inline bool cpusets_insane_config(void) { return false; }
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}

-static inline void cpuset_force_rebuild(void) { }
+static inline void cpuset_force_rebuild(int reason) { }

static inline void cpuset_update_active_cpus(void)
{
@@ -276,7 +280,7 @@ static inline bool current_cpuset_is_being_rebound(void)
return false;
}

-static inline void rebuild_sched_domains(void)
+static inline void rebuild_sched_domains(bool update_dl_accounting)
{
partition_sched_domains(1, NULL, NULL);
}
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a29c0b13706b..e5ddc8e11e5d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1079,7 +1079,7 @@ static void update_tasks_root_domain(struct cpuset *cs)
css_task_iter_end(&it);
}

-static void rebuild_root_domains(void)
+static void update_dl_rd_accounting(void)
{
struct cpuset *cs = NULL;
struct cgroup_subsys_state *pos_css;
@@ -1117,11 +1117,13 @@ static void rebuild_root_domains(void)

static void
partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
- struct sched_domain_attr *dattr_new)
+ struct sched_domain_attr *dattr_new,
+ bool update_dl_accounting)
{
mutex_lock(&sched_domains_mutex);
partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
- rebuild_root_domains();
+ if (update_dl_accounting)
+ update_dl_rd_accounting();
mutex_unlock(&sched_domains_mutex);
}

@@ -1136,7 +1138,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
*
* Call with cpuset_rwsem held. Takes cpus_read_lock().
*/
-static void rebuild_sched_domains_locked(void)
+static void rebuild_sched_domains_locked(bool update_dl_accounting)
{
struct cgroup_subsys_state *pos_css;
struct sched_domain_attr *attr;
@@ -1185,19 +1187,19 @@ static void rebuild_sched_domains_locked(void)
ndoms = generate_sched_domains(&doms, &attr);

/* Have scheduler rebuild the domains */
- partition_and_rebuild_sched_domains(ndoms, doms, attr);
+ partition_and_rebuild_sched_domains(ndoms, doms, attr, update_dl_accounting);
}
#else /* !CONFIG_SMP */
-static void rebuild_sched_domains_locked(void)
+static void rebuild_sched_domains_locked(bool update_dl_accounting)
{
}
#endif /* CONFIG_SMP */

-void rebuild_sched_domains(void)
+void rebuild_sched_domains(bool update_dl_accounting)
{
cpus_read_lock();
percpu_down_write(&cpuset_rwsem);
- rebuild_sched_domains_locked();
+ rebuild_sched_domains_locked(update_dl_accounting);
percpu_up_write(&cpuset_rwsem);
cpus_read_unlock();
}
@@ -1681,7 +1683,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
rcu_read_unlock();

if (need_rebuild_sched_domains)
- rebuild_sched_domains_locked();
+ rebuild_sched_domains_locked(true);
}

/**
@@ -2136,7 +2138,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
cs->relax_domain_level = val;
if (!cpumask_empty(cs->cpus_allowed) &&
is_sched_load_balance(cs))
- rebuild_sched_domains_locked();
+ rebuild_sched_domains_locked(false);
}

return 0;
@@ -2202,7 +2204,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
spin_unlock_irq(&callback_lock);

if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
- rebuild_sched_domains_locked();
+ rebuild_sched_domains_locked(true);

if (spread_flag_changed)
update_tasks_flags(cs);
@@ -2315,7 +2317,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
update_sibling_cpumasks(parent, cs, &tmpmask);

if (!sched_domain_rebuilt)
- rebuild_sched_domains_locked();
+ rebuild_sched_domains_locked(true);
out:
/*
* Make partition invalid if an error happen
@@ -3389,11 +3391,11 @@ hotplug_update_tasks(struct cpuset *cs,
update_tasks_nodemask(cs);
}

-static bool force_rebuild;
+static int force_rebuild;

-void cpuset_force_rebuild(void)
+void cpuset_force_rebuild(int reason)
{
- force_rebuild = true;
+ force_rebuild = reason;
}

/**
@@ -3489,7 +3491,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
notify_partition_change(cs, old_prs);
}
- cpuset_force_rebuild();
+ cpuset_force_rebuild(CPUSET_FORCE_REBUILD_PRS_ERROR);
}

/*
@@ -3499,7 +3501,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
else if (is_partition_valid(parent) && is_partition_invalid(cs)) {
update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp);
if (is_partition_valid(cs))
- cpuset_force_rebuild();
+ cpuset_force_rebuild(CPUSET_FORCE_REBUILD_PRS_ERROR);
}

update_tasks:
@@ -3626,8 +3628,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)

/* rebuild sched domains if cpus_allowed has changed */
if (cpus_updated || force_rebuild) {
- force_rebuild = false;
- rebuild_sched_domains();
+ bool update_dl_accounting = cpus_updated ||
+ force_rebuild == CPUSET_FORCE_REBUILD_PRS_ERROR;
+
+ force_rebuild = CPUSET_FORCE_REBUILD_RESET;
+ rebuild_sched_domains(update_dl_accounting);
}

free_cpumasks(NULL, ptmp);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4580fe3e1d0c..d68eac04c851 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9485,7 +9485,7 @@ static void cpuset_cpu_active(void)
* restore the original sched domains by considering the
* cpuset configurations.
*/
- cpuset_force_rebuild();
+ cpuset_force_rebuild(CPUSET_FORCE_REBUILD_SUSPEND_RESUME);
}
cpuset_update_active_cpus();
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d93c3379e901..bf33b84c511a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -214,7 +214,7 @@ void rebuild_sched_domains_energy(void)
{
mutex_lock(&sched_energy_mutex);
sched_energy_update = true;
- rebuild_sched_domains();
+ rebuild_sched_domains(false);
sched_energy_update = false;
mutex_unlock(&sched_energy_mutex);
}
--
2.25.1



2023-02-23 15:39:39

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 02/06/23 22:14, Qais Yousef wrote:
> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> enabled rebuilding root domain on cpuset and hotplug operations to
> correct deadline accounting.
>
> Rebuilding root domain is a slow operation and we see 10+ of ms delays
> on suspend-resume because of that (worst case captures 20ms which
> happens often).
>
> Since nothing is expected to change on suspend-resume operation; skip
> rebuilding the root domains to regain the some of the time lost.
>
> Achieve this by refactoring the code to pass whether dl accoutning needs
> an update to rebuild_sched_domains(). And while at it, rename
> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
> a more representative name since we are not really rebuilding the root
> domains, but rather updating dl accounting at the root domain.
>
> Some users of rebuild_sched_domains() will skip dl accounting update
> now:
>
> * Update sched domains when relaxing the domain level in cpuset
> which only impacts searching level in load balance
> * update sched domains when cpufreq governor changes and we need
> to create the perf domains
>
> Users in arch/x86 and arch/s390 are left with the old behavior.
>
> Debugged-by: Rick Yiu <[email protected]>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
>
> Changes in v3:
>
> * Change the logic to avoid using cpuhp_tasks_frozen which can be racy
> and have dependency on context
> * Refactor the code to pass whether dl accounting needs to be updated
> down the call chain
> * Teach cpuset_force_rebuild() to take a reason argument and convert
> the variable into int
> * Rename rebuild_root_domains() into update_dl_rd_accounting() as
> that's what I believe it's only doing
>
> v2 discussion: https://lore.kernel.org/lkml/[email protected]/
> v1 discussion: https://lore.kernel.org/lkml/20221216233501.gh6m75e7s66dmjgo@airbuntu/

Is this version any good?


Thanks!

--
Qais Yousef

>
> arch/s390/kernel/topology.c | 2 +-
> arch/x86/kernel/itmt.c | 6 ++---
> drivers/base/arch_topology.c | 2 +-
> include/linux/cpuset.h | 12 ++++++----
> kernel/cgroup/cpuset.c | 43 ++++++++++++++++++++----------------
> kernel/sched/core.c | 2 +-
> kernel/sched/topology.c | 2 +-
> 7 files changed, 39 insertions(+), 30 deletions(-)
>
> diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
> index c6eecd4a5302..29d57154a3f1 100644
> --- a/arch/s390/kernel/topology.c
> +++ b/arch/s390/kernel/topology.c
> @@ -333,7 +333,7 @@ int arch_update_cpu_topology(void)
>
> static void topology_work_fn(struct work_struct *work)
> {
> - rebuild_sched_domains();
> + rebuild_sched_domains(true);
> }
>
> void topology_schedule_update(void)
> diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
> index 9ff480e94511..6f1446223697 100644
> --- a/arch/x86/kernel/itmt.c
> +++ b/arch/x86/kernel/itmt.c
> @@ -56,7 +56,7 @@ static int sched_itmt_update_handler(struct ctl_table *table, int write,
>
> if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
> x86_topology_update = true;
> - rebuild_sched_domains();
> + rebuild_sched_domains(true);
> }
>
> mutex_unlock(&itmt_update_mutex);
> @@ -125,7 +125,7 @@ int sched_set_itmt_support(void)
> sysctl_sched_itmt_enabled = 1;
>
> x86_topology_update = true;
> - rebuild_sched_domains();
> + rebuild_sched_domains(true);
>
> mutex_unlock(&itmt_update_mutex);
>
> @@ -161,7 +161,7 @@ void sched_clear_itmt_support(void)
> /* disable sched_itmt if we are no longer ITMT capable */
> sysctl_sched_itmt_enabled = 0;
> x86_topology_update = true;
> - rebuild_sched_domains();
> + rebuild_sched_domains(true);
> }
>
> mutex_unlock(&itmt_update_mutex);
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index e7d6e6657ffa..90d8a42335e6 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -253,7 +253,7 @@ int topology_update_cpu_topology(void)
> static void update_topology_flags_workfn(struct work_struct *work)
> {
> update_topology = 1;
> - rebuild_sched_domains();
> + rebuild_sched_domains(true);
> pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
> update_topology = 0;
> }
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..e30d4cd35ef7 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -18,6 +18,10 @@
> #include <linux/mmu_context.h>
> #include <linux/jump_label.h>
>
> +#define CPUSET_FORCE_REBUILD_RESET 0
> +#define CPUSET_FORCE_REBUILD_SUSPEND_RESUME 1
> +#define CPUSET_FORCE_REBUILD_PRS_ERROR 2
> +
> #ifdef CONFIG_CPUSETS
>
> /*
> @@ -68,7 +72,7 @@ static inline bool cpusets_insane_config(void)
>
> extern int cpuset_init(void);
> extern void cpuset_init_smp(void);
> -extern void cpuset_force_rebuild(void);
> +extern void cpuset_force_rebuild(int reason);
> extern void cpuset_update_active_cpus(void);
> extern void cpuset_wait_for_hotplug(void);
> extern void cpuset_read_lock(void);
> @@ -132,7 +136,7 @@ static inline int cpuset_do_slab_mem_spread(void)
>
> extern bool current_cpuset_is_being_rebound(void);
>
> -extern void rebuild_sched_domains(void);
> +extern void rebuild_sched_domains(bool update_dl_accounting);
>
> extern void cpuset_print_current_mems_allowed(void);
>
> @@ -187,7 +191,7 @@ static inline bool cpusets_insane_config(void) { return false; }
> static inline int cpuset_init(void) { return 0; }
> static inline void cpuset_init_smp(void) {}
>
> -static inline void cpuset_force_rebuild(void) { }
> +static inline void cpuset_force_rebuild(int reason) { }
>
> static inline void cpuset_update_active_cpus(void)
> {
> @@ -276,7 +280,7 @@ static inline bool current_cpuset_is_being_rebound(void)
> return false;
> }
>
> -static inline void rebuild_sched_domains(void)
> +static inline void rebuild_sched_domains(bool update_dl_accounting)
> {
> partition_sched_domains(1, NULL, NULL);
> }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a29c0b13706b..e5ddc8e11e5d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1079,7 +1079,7 @@ static void update_tasks_root_domain(struct cpuset *cs)
> css_task_iter_end(&it);
> }
>
> -static void rebuild_root_domains(void)
> +static void update_dl_rd_accounting(void)
> {
> struct cpuset *cs = NULL;
> struct cgroup_subsys_state *pos_css;
> @@ -1117,11 +1117,13 @@ static void rebuild_root_domains(void)
>
> static void
> partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> - struct sched_domain_attr *dattr_new)
> + struct sched_domain_attr *dattr_new,
> + bool update_dl_accounting)
> {
> mutex_lock(&sched_domains_mutex);
> partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
> - rebuild_root_domains();
> + if (update_dl_accounting)
> + update_dl_rd_accounting();
> mutex_unlock(&sched_domains_mutex);
> }
>
> @@ -1136,7 +1138,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> *
> * Call with cpuset_rwsem held. Takes cpus_read_lock().
> */
> -static void rebuild_sched_domains_locked(void)
> +static void rebuild_sched_domains_locked(bool update_dl_accounting)
> {
> struct cgroup_subsys_state *pos_css;
> struct sched_domain_attr *attr;
> @@ -1185,19 +1187,19 @@ static void rebuild_sched_domains_locked(void)
> ndoms = generate_sched_domains(&doms, &attr);
>
> /* Have scheduler rebuild the domains */
> - partition_and_rebuild_sched_domains(ndoms, doms, attr);
> + partition_and_rebuild_sched_domains(ndoms, doms, attr, update_dl_accounting);
> }
> #else /* !CONFIG_SMP */
> -static void rebuild_sched_domains_locked(void)
> +static void rebuild_sched_domains_locked(bool update_dl_accounting)
> {
> }
> #endif /* CONFIG_SMP */
>
> -void rebuild_sched_domains(void)
> +void rebuild_sched_domains(bool update_dl_accounting)
> {
> cpus_read_lock();
> percpu_down_write(&cpuset_rwsem);
> - rebuild_sched_domains_locked();
> + rebuild_sched_domains_locked(update_dl_accounting);
> percpu_up_write(&cpuset_rwsem);
> cpus_read_unlock();
> }
> @@ -1681,7 +1683,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
> rcu_read_unlock();
>
> if (need_rebuild_sched_domains)
> - rebuild_sched_domains_locked();
> + rebuild_sched_domains_locked(true);
> }
>
> /**
> @@ -2136,7 +2138,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
> cs->relax_domain_level = val;
> if (!cpumask_empty(cs->cpus_allowed) &&
> is_sched_load_balance(cs))
> - rebuild_sched_domains_locked();
> + rebuild_sched_domains_locked(false);
> }
>
> return 0;
> @@ -2202,7 +2204,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
> spin_unlock_irq(&callback_lock);
>
> if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
> - rebuild_sched_domains_locked();
> + rebuild_sched_domains_locked(true);
>
> if (spread_flag_changed)
> update_tasks_flags(cs);
> @@ -2315,7 +2317,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> update_sibling_cpumasks(parent, cs, &tmpmask);
>
> if (!sched_domain_rebuilt)
> - rebuild_sched_domains_locked();
> + rebuild_sched_domains_locked(true);
> out:
> /*
> * Make partition invalid if an error happen
> @@ -3389,11 +3391,11 @@ hotplug_update_tasks(struct cpuset *cs,
> update_tasks_nodemask(cs);
> }
>
> -static bool force_rebuild;
> +static int force_rebuild;
>
> -void cpuset_force_rebuild(void)
> +void cpuset_force_rebuild(int reason)
> {
> - force_rebuild = true;
> + force_rebuild = reason;
> }
>
> /**
> @@ -3489,7 +3491,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
> notify_partition_change(cs, old_prs);
> }
> - cpuset_force_rebuild();
> + cpuset_force_rebuild(CPUSET_FORCE_REBUILD_PRS_ERROR);
> }
>
> /*
> @@ -3499,7 +3501,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> else if (is_partition_valid(parent) && is_partition_invalid(cs)) {
> update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp);
> if (is_partition_valid(cs))
> - cpuset_force_rebuild();
> + cpuset_force_rebuild(CPUSET_FORCE_REBUILD_PRS_ERROR);
> }
>
> update_tasks:
> @@ -3626,8 +3628,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
>
> /* rebuild sched domains if cpus_allowed has changed */
> if (cpus_updated || force_rebuild) {
> - force_rebuild = false;
> - rebuild_sched_domains();
> + bool update_dl_accounting = cpus_updated ||
> + force_rebuild == CPUSET_FORCE_REBUILD_PRS_ERROR;
> +
> + force_rebuild = CPUSET_FORCE_REBUILD_RESET;
> + rebuild_sched_domains(update_dl_accounting);
> }
>
> free_cpumasks(NULL, ptmp);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4580fe3e1d0c..d68eac04c851 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9485,7 +9485,7 @@ static void cpuset_cpu_active(void)
> * restore the original sched domains by considering the
> * cpuset configurations.
> */
> - cpuset_force_rebuild();
> + cpuset_force_rebuild(CPUSET_FORCE_REBUILD_SUSPEND_RESUME);
> }
> cpuset_update_active_cpus();
> }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d93c3379e901..bf33b84c511a 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -214,7 +214,7 @@ void rebuild_sched_domains_energy(void)
> {
> mutex_lock(&sched_energy_mutex);
> sched_energy_update = true;
> - rebuild_sched_domains();
> + rebuild_sched_domains(false);
> sched_energy_update = false;
> mutex_unlock(&sched_energy_mutex);
> }
> --
> 2.25.1
>

2023-02-24 15:15:48

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 23/02/2023 16:38, Qais Yousef wrote:

IMHO the patch title is misleading since what you want to avoid in
certain cases is that the RD DL accounting is updated.

> On 02/06/23 22:14, Qais Yousef wrote:
>> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
>> enabled rebuilding root domain on cpuset and hotplug operations to
>> correct deadline accounting.
>>
>> Rebuilding root domain is a slow operation and we see 10+ of ms delays
>> on suspend-resume because of that (worst case captures 20ms which
>> happens often).
>>
>> Since nothing is expected to change on suspend-resume operation; skip
>> rebuilding the root domains to regain the some of the time lost.
>>
>> Achieve this by refactoring the code to pass whether dl accoutning needs
>> an update to rebuild_sched_domains(). And while at it, rename
>> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
>> a more representative name since we are not really rebuilding the root
>> domains, but rather updating dl accounting at the root domain.
>>
>> Some users of rebuild_sched_domains() will skip dl accounting update
>> now:
>>
>> * Update sched domains when relaxing the domain level in cpuset
>> which only impacts searching level in load balance

This one is cpuset related. (1)

>> * update sched domains when cpufreq governor changes and we need
>> to create the perf domains

This one is drivers/base/arch_topology.c [arm/arm64/...] related. (2)

There are several levels of passing this `update_dl_accounting`
information through. I guess it looks like this:

update_dl_accounting

arm/arm64/riscv/parisc specific:
update_topology_flags_workfn() true
rebuild_sched_domains_energy() false (2)

cpuset_hotplug_workfn() cpus_updated ||
force_rebuild == CPUSET_FORCE_REBUILD_PRS_ERROR

->rebuild_sched_domains(update_dl_accounting)

update_cpumasks_hier() true
update_relax_domain_level() false (1)
update_flag() true
update_prstate() true

->rebuild_sched_domains_locked(update_dl_accounting)

->partition_and_rebuild_sched_domains(..., update_dl_accounting)

if (update_dl_accounting)
update_dl_rd_accounting()


There is already a somehow hidden interface for `sd/rd rebuild`

int __weak arch_update_cpu_topology(void)

which lets partition_sched_domains_locked() figure out whether sched
domains have to be rebuild..

But in your case it is more on the interface `cpuset/hotplug -> sd/rd
rebuild` and not only `arch -> `sd/rd rebuild``.

IMHO, it would be still nice to have only one way to tell `sd/rd
rebuild` what to do and what not to do during sd/rd/(pd) rebuild.

[...]


2023-02-27 20:57:34

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 02/24/23 16:14, Dietmar Eggemann wrote:
> On 23/02/2023 16:38, Qais Yousef wrote:
>
> IMHO the patch title is misleading since what you want to avoid in
> certain cases is that the RD DL accounting is updated.

The code calls it rebuild_root_domain() ..

>
> > On 02/06/23 22:14, Qais Yousef wrote:
> >> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")

.. and so is the original patch title.

I think I have enough explanation in the commit message and renamed the
function name to be more descriptive too.

> >> enabled rebuilding root domain on cpuset and hotplug operations to
> >> correct deadline accounting.
> >>
> >> Rebuilding root domain is a slow operation and we see 10+ of ms delays
> >> on suspend-resume because of that (worst case captures 20ms which
> >> happens often).
> >>
> >> Since nothing is expected to change on suspend-resume operation; skip
> >> rebuilding the root domains to regain the some of the time lost.
> >>
> >> Achieve this by refactoring the code to pass whether dl accoutning needs
> >> an update to rebuild_sched_domains(). And while at it, rename
> >> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
> >> a more representative name since we are not really rebuilding the root
> >> domains, but rather updating dl accounting at the root domain.
> >>
> >> Some users of rebuild_sched_domains() will skip dl accounting update
> >> now:
> >>
> >> * Update sched domains when relaxing the domain level in cpuset
> >> which only impacts searching level in load balance
>
> This one is cpuset related. (1)
>
> >> * update sched domains when cpufreq governor changes and we need
> >> to create the perf domains
>
> This one is drivers/base/arch_topology.c [arm/arm64/...] related. (2)
>
> There are several levels of passing this `update_dl_accounting`
> information through. I guess it looks like this:
>
> update_dl_accounting
>
> arm/arm64/riscv/parisc specific:
> update_topology_flags_workfn() true
> rebuild_sched_domains_energy() false (2)
>
> cpuset_hotplug_workfn() cpus_updated ||
> force_rebuild == CPUSET_FORCE_REBUILD_PRS_ERROR
>
> ->rebuild_sched_domains(update_dl_accounting)
>
> update_cpumasks_hier() true
> update_relax_domain_level() false (1)
> update_flag() true
> update_prstate() true
>
> ->rebuild_sched_domains_locked(update_dl_accounting)
>
> ->partition_and_rebuild_sched_domains(..., update_dl_accounting)
>
> if (update_dl_accounting)
> update_dl_rd_accounting()
>
>
> There is already a somehow hidden interface for `sd/rd rebuild`
>
> int __weak arch_update_cpu_topology(void)
>
> which lets partition_sched_domains_locked() figure out whether sched
> domains have to be rebuild..
>
> But in your case it is more on the interface `cpuset/hotplug -> sd/rd
> rebuild` and not only `arch -> `sd/rd rebuild``.
>
> IMHO, it would be still nice to have only one way to tell `sd/rd
> rebuild` what to do and what not to do during sd/rd/(pd) rebuild.

IIUC you're suggesting to introduce some new mechanism to detect if hotplug has
lead to a cpu to disappear or not and use that instead? Are you saying I can
use arch_update_cpu_topology() for that? Something like this?

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e5ddc8e11e5d..60c3dcf06f0d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1122,7 +1122,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
{
mutex_lock(&sched_domains_mutex);
partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
- if (update_dl_accounting)
+ if (arch_update_cpu_topology())
update_dl_rd_accounting();
mutex_unlock(&sched_domains_mutex);
}

I am not keen on this. arm64 seems to just read a value without a side effect.
But x86 does reset this value so we can't read it twice in the same call tree
and I'll have to extract it.

The better solution that was discussed before is to not iterate through every
task in the system and let cpuset track when dl tasks are added to it and do
smarter iteration. ATM even if there are no dl tasks in the system we'll
blindly go through every task in the hierarchy to update nothing.

But I'll leave that to Juri to address if he wants. The original change has
introduced a regression and people have noticed when phones cycle through
suspend resume (screen unlock). Juri - could you please chip in on how you want
to address this regression? In theory I should be just a reporter, but trying
my best to help ;-)


Cheers

--
Qais Yousef

2023-02-28 14:09:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 27/02/2023 21:57, Qais Yousef wrote:
> On 02/24/23 16:14, Dietmar Eggemann wrote:
>> On 23/02/2023 16:38, Qais Yousef wrote:
>>
>> IMHO the patch title is misleading since what you want to avoid in
>> certain cases is that the RD DL accounting is updated.
>
> The code calls it rebuild_root_domain() ..
>
>>
>>> On 02/06/23 22:14, Qais Yousef wrote:
>>>> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
>
> .. and so is the original patch title.
>
> I think I have enough explanation in the commit message and renamed the
> function name to be more descriptive too.

True but the title doesn't really mention the actual issue here ...
which is DL accounting. Once I read your email it became clear.

[...]

>> There is already a somehow hidden interface for `sd/rd rebuild`
>>
>> int __weak arch_update_cpu_topology(void)
>>
>> which lets partition_sched_domains_locked() figure out whether sched
>> domains have to be rebuild..
>>
>> But in your case it is more on the interface `cpuset/hotplug -> sd/rd
>> rebuild` and not only `arch -> `sd/rd rebuild``.
>>
>> IMHO, it would be still nice to have only one way to tell `sd/rd
>> rebuild` what to do and what not to do during sd/rd/(pd) rebuild.
>
> IIUC you're suggesting to introduce some new mechanism to detect if hotplug has
> lead to a cpu to disappear or not and use that instead? Are you saying I can
> use arch_update_cpu_topology() for that? Something like this?
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index e5ddc8e11e5d..60c3dcf06f0d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1122,7 +1122,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> {
> mutex_lock(&sched_domains_mutex);
> partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
> - if (update_dl_accounting)
> + if (arch_update_cpu_topology())
> update_dl_rd_accounting();
> mutex_unlock(&sched_domains_mutex);
> }

No, this is not what I meant. I'm just saying the:

partition_sched_domains_locked()
new_topology = arch_update_cpu_topology();

has to be considered here as well since we do a
`dl_clear_root_domain(rd)` (1) in partition_sched_domains_locked() for
!new_topology.

And (1) requires the `update_tasks_root_domain()` to happen later.

So there are cases now, e.g. `rebuild_sched_domains_energy()` in which
`new_topology=0` and `update_dl_accounting=false` which now clean the rd
but don't do a new DL accounting anymore.
rebuild_root_domains() itself cleans the `default root domain`, not the
other root domains which could exists as well.

Example: Switching CPUfreq policy [0,3-5] performance to schedutil (slow
switching, i.e. we have sugov:X DL task(s)):

[ 862.479906] CPU4 partition_sched_domains_locked() new_topology=0
[ 862.499073] Workqueue: events rebuild_sd_workfn
[ 862.503646] Call trace:
...
[ 862.520789] partition_sched_domains_locked+0x6c/0x670
[ 862.525962] rebuild_sched_domains_locked+0x204/0x8a0
[ 862.531050] rebuild_sched_domains+0x2c/0x50
[ 862.535351] rebuild_sd_workfn+0x38/0x54 <-- !
...
[ 862.554047] CPU4 dl_clear_root_domain() rd->span=0-5 total_bw=0
def_root_domain=0 <-- !
[ 862.561597] CPU4 dl_clear_root_domain() rd->span= total_bw=0
def_root_domain=1
[ 862.568960] CPU4 dl_add_task_root_domain() [sugov:0 1801]
total_bw=104857 def_root_domain=0 rd=0xffff0008015f0000 <-- !

The dl_clear_root_domain() of the def_root_domain and the
dl_add_task_root_domain() to the rd in use won't happen.

[sugov:0 1801] is only a simple example here. I could have spawned a
couple of DL tasks before this to illustrate the issue more obvious.

---

The same seems to happen during suspend/resume (system with 2 frequency
domains, both with slow switching schedutil CPUfreq gov):

[ 27.735821] CPU5 partition_sched_domains_locked() new_topology=0
...
[ 27.735864] Workqueue: events cpuset_hotplug_workfn
[ 27.735894] Call trace:
...
[ 27.735984] partition_sched_domains_locked+0x6c/0x670
[ 27.736004] rebuild_sched_domains_locked+0x204/0x8a0
[ 27.736026] cpuset_hotplug_workfn+0x254/0x52c <-- !
...
[ 27.736155] CPU5 dl_clear_root_domain() rd->span=0-5 total_bw=0
def_root_domain=0 <-- !
[ 27.736178] CPU5 dl_clear_root_domain() rd->span= total_bw=0
def_root_domain=1
[ 27.736296] CPU5 dl_add_task_root_domain() [sugov:0 80] <-- !
total_bw=104857 def_root_domain=0 rd=0xffff000801728000
[ 27.736318] CPU5 dl_add_task_root_domain() [sugov:1 81]
total_bw=209714 def_root_domain=0 rd=0xffff000801728000 <-- !
...

> I am not keen on this. arm64 seems to just read a value without a side effect.

Arm64 (among others) sets `update_topology=1` before
`rebuild_sched_domains()` and `update_topology=0` after it in
update_topology_flags_workfn(). This then makes `new_topology=1` in
partition_sched_domains_locked().

> But x86 does reset this value so we can't read it twice in the same call tree
> and I'll have to extract it.
>
> The better solution that was discussed before is to not iterate through every
> task in the system and let cpuset track when dl tasks are added to it and do
> smarter iteration. ATM even if there are no dl tasks in the system we'll
> blindly go through every task in the hierarchy to update nothing.

Yes, I can see the problem. And IMHO this solution approach seems to be
better than parsing update_dl_accounting` through the stack of involved
functions.

[...]





2023-02-28 17:46:36

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 02/28/23 15:09, Dietmar Eggemann wrote:

> > IIUC you're suggesting to introduce some new mechanism to detect if hotplug has
> > lead to a cpu to disappear or not and use that instead? Are you saying I can
> > use arch_update_cpu_topology() for that? Something like this?
> >
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index e5ddc8e11e5d..60c3dcf06f0d 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1122,7 +1122,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > {
> > mutex_lock(&sched_domains_mutex);
> > partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
> > - if (update_dl_accounting)
> > + if (arch_update_cpu_topology())
> > update_dl_rd_accounting();
> > mutex_unlock(&sched_domains_mutex);
> > }
>
> No, this is not what I meant. I'm just saying the:
>
> partition_sched_domains_locked()
> new_topology = arch_update_cpu_topology();
>
> has to be considered here as well since we do a
> `dl_clear_root_domain(rd)` (1) in partition_sched_domains_locked() for
> !new_topology.

Ah you're referring to the dl_clear_root_domain() call there. I thought this
doesn't trigger.

>
> And (1) requires the `update_tasks_root_domain()` to happen later.
>
> So there are cases now, e.g. `rebuild_sched_domains_energy()` in which
> `new_topology=0` and `update_dl_accounting=false` which now clean the rd
> but don't do a new DL accounting anymore.
> rebuild_root_domains() itself cleans the `default root domain`, not the
> other root domains which could exists as well.
>
> Example: Switching CPUfreq policy [0,3-5] performance to schedutil (slow
> switching, i.e. we have sugov:X DL task(s)):
>
> [ 862.479906] CPU4 partition_sched_domains_locked() new_topology=0
> [ 862.499073] Workqueue: events rebuild_sd_workfn
> [ 862.503646] Call trace:
> ...
> [ 862.520789] partition_sched_domains_locked+0x6c/0x670
> [ 862.525962] rebuild_sched_domains_locked+0x204/0x8a0
> [ 862.531050] rebuild_sched_domains+0x2c/0x50
> [ 862.535351] rebuild_sd_workfn+0x38/0x54 <-- !
> ...
> [ 862.554047] CPU4 dl_clear_root_domain() rd->span=0-5 total_bw=0
> def_root_domain=0 <-- !
> [ 862.561597] CPU4 dl_clear_root_domain() rd->span= total_bw=0
> def_root_domain=1
> [ 862.568960] CPU4 dl_add_task_root_domain() [sugov:0 1801]
> total_bw=104857 def_root_domain=0 rd=0xffff0008015f0000 <-- !
>
> The dl_clear_root_domain() of the def_root_domain and the
> dl_add_task_root_domain() to the rd in use won't happen.
>
> [sugov:0 1801] is only a simple example here. I could have spawned a
> couple of DL tasks before this to illustrate the issue more obvious.
>
> ---
>
> The same seems to happen during suspend/resume (system with 2 frequency
> domains, both with slow switching schedutil CPUfreq gov):
>
> [ 27.735821] CPU5 partition_sched_domains_locked() new_topology=0
> ...
> [ 27.735864] Workqueue: events cpuset_hotplug_workfn
> [ 27.735894] Call trace:
> ...
> [ 27.735984] partition_sched_domains_locked+0x6c/0x670
> [ 27.736004] rebuild_sched_domains_locked+0x204/0x8a0
> [ 27.736026] cpuset_hotplug_workfn+0x254/0x52c <-- !
> ...
> [ 27.736155] CPU5 dl_clear_root_domain() rd->span=0-5 total_bw=0
> def_root_domain=0 <-- !
> [ 27.736178] CPU5 dl_clear_root_domain() rd->span= total_bw=0
> def_root_domain=1
> [ 27.736296] CPU5 dl_add_task_root_domain() [sugov:0 80] <-- !
> total_bw=104857 def_root_domain=0 rd=0xffff000801728000
> [ 27.736318] CPU5 dl_add_task_root_domain() [sugov:1 81]
> total_bw=209714 def_root_domain=0 rd=0xffff000801728000 <-- !
> ...
>
> > I am not keen on this. arm64 seems to just read a value without a side effect.
>
> Arm64 (among others) sets `update_topology=1` before
> `rebuild_sched_domains()` and `update_topology=0` after it in
> update_topology_flags_workfn(). This then makes `new_topology=1` in
> partition_sched_domains_locked().
>
> > But x86 does reset this value so we can't read it twice in the same call tree
> > and I'll have to extract it.
> >
> > The better solution that was discussed before is to not iterate through every
> > task in the system and let cpuset track when dl tasks are added to it and do
> > smarter iteration. ATM even if there are no dl tasks in the system we'll
> > blindly go through every task in the hierarchy to update nothing.
>
> Yes, I can see the problem. And IMHO this solution approach seems to be
> better than parsing update_dl_accounting` through the stack of involved
> functions.

The best I can do is protect this dl_clear_root_domain() too. I really don't
have my heart in this but trying my best to help, but it has taken a lot of my
time already and would prefer to hand over to Juri to address this regression
if what I am proposing is not good enough.

FWIW, there are 0 dl tasks in the system where this was noticed. And this delay
is unbounded because it'll depend on how many tasks there are in the hierarchy.


Thanks!

--
Qais Yousef

2023-03-01 07:32:14

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

Hi,

On 28/02/23 17:46, Qais Yousef wrote:
> On 02/28/23 15:09, Dietmar Eggemann wrote:
>
> > > IIUC you're suggesting to introduce some new mechanism to detect if hotplug has
> > > lead to a cpu to disappear or not and use that instead? Are you saying I can
> > > use arch_update_cpu_topology() for that? Something like this?
> > >
> > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > index e5ddc8e11e5d..60c3dcf06f0d 100644
> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -1122,7 +1122,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > > {
> > > mutex_lock(&sched_domains_mutex);
> > > partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
> > > - if (update_dl_accounting)
> > > + if (arch_update_cpu_topology())
> > > update_dl_rd_accounting();
> > > mutex_unlock(&sched_domains_mutex);
> > > }
> >
> > No, this is not what I meant. I'm just saying the:
> >
> > partition_sched_domains_locked()
> > new_topology = arch_update_cpu_topology();
> >
> > has to be considered here as well since we do a
> > `dl_clear_root_domain(rd)` (1) in partition_sched_domains_locked() for
> > !new_topology.
>
> Ah you're referring to the dl_clear_root_domain() call there. I thought this
> doesn't trigger.
>
> >
> > And (1) requires the `update_tasks_root_domain()` to happen later.
> >
> > So there are cases now, e.g. `rebuild_sched_domains_energy()` in which
> > `new_topology=0` and `update_dl_accounting=false` which now clean the rd
> > but don't do a new DL accounting anymore.
> > rebuild_root_domains() itself cleans the `default root domain`, not the
> > other root domains which could exists as well.
> >
> > Example: Switching CPUfreq policy [0,3-5] performance to schedutil (slow
> > switching, i.e. we have sugov:X DL task(s)):
> >
> > [ 862.479906] CPU4 partition_sched_domains_locked() new_topology=0
> > [ 862.499073] Workqueue: events rebuild_sd_workfn
> > [ 862.503646] Call trace:
> > ...
> > [ 862.520789] partition_sched_domains_locked+0x6c/0x670
> > [ 862.525962] rebuild_sched_domains_locked+0x204/0x8a0
> > [ 862.531050] rebuild_sched_domains+0x2c/0x50
> > [ 862.535351] rebuild_sd_workfn+0x38/0x54 <-- !
> > ...
> > [ 862.554047] CPU4 dl_clear_root_domain() rd->span=0-5 total_bw=0
> > def_root_domain=0 <-- !
> > [ 862.561597] CPU4 dl_clear_root_domain() rd->span= total_bw=0
> > def_root_domain=1
> > [ 862.568960] CPU4 dl_add_task_root_domain() [sugov:0 1801]
> > total_bw=104857 def_root_domain=0 rd=0xffff0008015f0000 <-- !
> >
> > The dl_clear_root_domain() of the def_root_domain and the
> > dl_add_task_root_domain() to the rd in use won't happen.
> >
> > [sugov:0 1801] is only a simple example here. I could have spawned a
> > couple of DL tasks before this to illustrate the issue more obvious.
> >
> > ---
> >
> > The same seems to happen during suspend/resume (system with 2 frequency
> > domains, both with slow switching schedutil CPUfreq gov):
> >
> > [ 27.735821] CPU5 partition_sched_domains_locked() new_topology=0
> > ...
> > [ 27.735864] Workqueue: events cpuset_hotplug_workfn
> > [ 27.735894] Call trace:
> > ...
> > [ 27.735984] partition_sched_domains_locked+0x6c/0x670
> > [ 27.736004] rebuild_sched_domains_locked+0x204/0x8a0
> > [ 27.736026] cpuset_hotplug_workfn+0x254/0x52c <-- !
> > ...
> > [ 27.736155] CPU5 dl_clear_root_domain() rd->span=0-5 total_bw=0
> > def_root_domain=0 <-- !
> > [ 27.736178] CPU5 dl_clear_root_domain() rd->span= total_bw=0
> > def_root_domain=1
> > [ 27.736296] CPU5 dl_add_task_root_domain() [sugov:0 80] <-- !
> > total_bw=104857 def_root_domain=0 rd=0xffff000801728000
> > [ 27.736318] CPU5 dl_add_task_root_domain() [sugov:1 81]
> > total_bw=209714 def_root_domain=0 rd=0xffff000801728000 <-- !
> > ...
> >
> > > I am not keen on this. arm64 seems to just read a value without a side effect.
> >
> > Arm64 (among others) sets `update_topology=1` before
> > `rebuild_sched_domains()` and `update_topology=0` after it in
> > update_topology_flags_workfn(). This then makes `new_topology=1` in
> > partition_sched_domains_locked().
> >
> > > But x86 does reset this value so we can't read it twice in the same call tree
> > > and I'll have to extract it.
> > >
> > > The better solution that was discussed before is to not iterate through every
> > > task in the system and let cpuset track when dl tasks are added to it and do
> > > smarter iteration. ATM even if there are no dl tasks in the system we'll
> > > blindly go through every task in the hierarchy to update nothing.
> >
> > Yes, I can see the problem. And IMHO this solution approach seems to be
> > better than parsing update_dl_accounting` through the stack of involved
> > functions.
>
> The best I can do is protect this dl_clear_root_domain() too. I really don't
> have my heart in this but trying my best to help, but it has taken a lot of my
> time already and would prefer to hand over to Juri to address this regression
> if what I am proposing is not good enough.
>
> FWIW, there are 0 dl tasks in the system where this was noticed. And this delay
> is unbounded because it'll depend on how many tasks there are in the hierarchy.

Not ignoring you guys here, but it turns out I'm quite bogged down with
other stuff at the moment. :/ So, apologies and I'll try to get to this
asap. Thanks a lot for all your efforts and time reviewing so far!

Best,
Juri


2023-03-01 12:29:02

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 03/01/23 08:31, Juri Lelli wrote:
> Hi,
>
> On 28/02/23 17:46, Qais Yousef wrote:
> > On 02/28/23 15:09, Dietmar Eggemann wrote:
> >
> > > > IIUC you're suggesting to introduce some new mechanism to detect if hotplug has
> > > > lead to a cpu to disappear or not and use that instead? Are you saying I can
> > > > use arch_update_cpu_topology() for that? Something like this?
> > > >
> > > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > > index e5ddc8e11e5d..60c3dcf06f0d 100644
> > > > --- a/kernel/cgroup/cpuset.c
> > > > +++ b/kernel/cgroup/cpuset.c
> > > > @@ -1122,7 +1122,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > > > {
> > > > mutex_lock(&sched_domains_mutex);
> > > > partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
> > > > - if (update_dl_accounting)
> > > > + if (arch_update_cpu_topology())
> > > > update_dl_rd_accounting();
> > > > mutex_unlock(&sched_domains_mutex);
> > > > }
> > >
> > > No, this is not what I meant. I'm just saying the:
> > >
> > > partition_sched_domains_locked()
> > > new_topology = arch_update_cpu_topology();
> > >
> > > has to be considered here as well since we do a
> > > `dl_clear_root_domain(rd)` (1) in partition_sched_domains_locked() for
> > > !new_topology.
> >
> > Ah you're referring to the dl_clear_root_domain() call there. I thought this
> > doesn't trigger.
> >
> > >
> > > And (1) requires the `update_tasks_root_domain()` to happen later.
> > >
> > > So there are cases now, e.g. `rebuild_sched_domains_energy()` in which
> > > `new_topology=0` and `update_dl_accounting=false` which now clean the rd
> > > but don't do a new DL accounting anymore.
> > > rebuild_root_domains() itself cleans the `default root domain`, not the
> > > other root domains which could exists as well.
> > >
> > > Example: Switching CPUfreq policy [0,3-5] performance to schedutil (slow
> > > switching, i.e. we have sugov:X DL task(s)):
> > >
> > > [ 862.479906] CPU4 partition_sched_domains_locked() new_topology=0
> > > [ 862.499073] Workqueue: events rebuild_sd_workfn
> > > [ 862.503646] Call trace:
> > > ...
> > > [ 862.520789] partition_sched_domains_locked+0x6c/0x670
> > > [ 862.525962] rebuild_sched_domains_locked+0x204/0x8a0
> > > [ 862.531050] rebuild_sched_domains+0x2c/0x50
> > > [ 862.535351] rebuild_sd_workfn+0x38/0x54 <-- !
> > > ...
> > > [ 862.554047] CPU4 dl_clear_root_domain() rd->span=0-5 total_bw=0
> > > def_root_domain=0 <-- !
> > > [ 862.561597] CPU4 dl_clear_root_domain() rd->span= total_bw=0
> > > def_root_domain=1
> > > [ 862.568960] CPU4 dl_add_task_root_domain() [sugov:0 1801]
> > > total_bw=104857 def_root_domain=0 rd=0xffff0008015f0000 <-- !
> > >
> > > The dl_clear_root_domain() of the def_root_domain and the
> > > dl_add_task_root_domain() to the rd in use won't happen.
> > >
> > > [sugov:0 1801] is only a simple example here. I could have spawned a
> > > couple of DL tasks before this to illustrate the issue more obvious.
> > >
> > > ---
> > >
> > > The same seems to happen during suspend/resume (system with 2 frequency
> > > domains, both with slow switching schedutil CPUfreq gov):
> > >
> > > [ 27.735821] CPU5 partition_sched_domains_locked() new_topology=0
> > > ...
> > > [ 27.735864] Workqueue: events cpuset_hotplug_workfn
> > > [ 27.735894] Call trace:
> > > ...
> > > [ 27.735984] partition_sched_domains_locked+0x6c/0x670
> > > [ 27.736004] rebuild_sched_domains_locked+0x204/0x8a0
> > > [ 27.736026] cpuset_hotplug_workfn+0x254/0x52c <-- !
> > > ...
> > > [ 27.736155] CPU5 dl_clear_root_domain() rd->span=0-5 total_bw=0
> > > def_root_domain=0 <-- !
> > > [ 27.736178] CPU5 dl_clear_root_domain() rd->span= total_bw=0
> > > def_root_domain=1
> > > [ 27.736296] CPU5 dl_add_task_root_domain() [sugov:0 80] <-- !
> > > total_bw=104857 def_root_domain=0 rd=0xffff000801728000
> > > [ 27.736318] CPU5 dl_add_task_root_domain() [sugov:1 81]
> > > total_bw=209714 def_root_domain=0 rd=0xffff000801728000 <-- !
> > > ...
> > >
> > > > I am not keen on this. arm64 seems to just read a value without a side effect.
> > >
> > > Arm64 (among others) sets `update_topology=1` before
> > > `rebuild_sched_domains()` and `update_topology=0` after it in
> > > update_topology_flags_workfn(). This then makes `new_topology=1` in
> > > partition_sched_domains_locked().
> > >
> > > > But x86 does reset this value so we can't read it twice in the same call tree
> > > > and I'll have to extract it.
> > > >
> > > > The better solution that was discussed before is to not iterate through every
> > > > task in the system and let cpuset track when dl tasks are added to it and do
> > > > smarter iteration. ATM even if there are no dl tasks in the system we'll
> > > > blindly go through every task in the hierarchy to update nothing.
> > >
> > > Yes, I can see the problem. And IMHO this solution approach seems to be
> > > better than parsing update_dl_accounting` through the stack of involved
> > > functions.
> >
> > The best I can do is protect this dl_clear_root_domain() too. I really don't
> > have my heart in this but trying my best to help, but it has taken a lot of my
> > time already and would prefer to hand over to Juri to address this regression
> > if what I am proposing is not good enough.
> >
> > FWIW, there are 0 dl tasks in the system where this was noticed. And this delay
> > is unbounded because it'll depend on how many tasks there are in the hierarchy.
>
> Not ignoring you guys here, but it turns out I'm quite bogged down with
> other stuff at the moment. :/ So, apologies and I'll try to get to this
> asap. Thanks a lot for all your efforts and time reviewing so far!

Np, I can feel you :-)


Thanks

--
Qais Yousef

2023-03-01 14:27:42

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 01/03/23 12:28, Qais Yousef wrote:
> On 03/01/23 08:31, Juri Lelli wrote:

...

> > Not ignoring you guys here, but it turns out I'm quite bogged down with
> > other stuff at the moment. :/ So, apologies and I'll try to get to this
> > asap. Thanks a lot for all your efforts and time reviewing so far!
>
> Np, I can feel you :-)

Eh. :/

BTW, do you have a repro script of some sort handy I might play with?


2023-03-01 17:04:06

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 03/01/23 15:26, Juri Lelli wrote:
> On 01/03/23 12:28, Qais Yousef wrote:
> > On 03/01/23 08:31, Juri Lelli wrote:
>
> ...
>
> > > Not ignoring you guys here, but it turns out I'm quite bogged down with
> > > other stuff at the moment. :/ So, apologies and I'll try to get to this
> > > asap. Thanks a lot for all your efforts and time reviewing so far!
> >
> > Np, I can feel you :-)
>
> Eh. :/

I hope I did not offend. That was meant as no pressure, I understand.

>
> BTW, do you have a repro script of some sort handy I might play with?

Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
the time around the call and trace_printk'ed the result.

I was working on a android phone which just suspends to ram if you turn the
screen off and disconnect the usb.

I think you can use rtcwake in normal desktop environment. Something like

sudo rtcwake -u -s 5 -m mem


Thanks!

--
Qais Yousef

2023-03-07 20:00:35

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On Mon, Feb 6, 2023 at 2:15 PM Qais Yousef <[email protected]> wrote:
>
> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> enabled rebuilding root domain on cpuset and hotplug operations to
> correct deadline accounting.
>
> Rebuilding root domain is a slow operation and we see 10+ of ms delays
> on suspend-resume because of that (worst case captures 20ms which
> happens often).
>
> Since nothing is expected to change on suspend-resume operation; skip
> rebuilding the root domains to regain the some of the time lost.
>
> Achieve this by refactoring the code to pass whether dl accoutning needs
> an update to rebuild_sched_domains(). And while at it, rename
> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
> a more representative name since we are not really rebuilding the root
> domains, but rather updating dl accounting at the root domain.
>
> Some users of rebuild_sched_domains() will skip dl accounting update
> now:
>
> * Update sched domains when relaxing the domain level in cpuset
> which only impacts searching level in load balance
> * update sched domains when cpufreq governor changes and we need
> to create the perf domains
>
> Users in arch/x86 and arch/s390 are left with the old behavior.
>
> Debugged-by: Rick Yiu <[email protected]>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---

Hi Qais,

Thank you for reporting this. We observed the same issue in our
production environment. Rebuild_root_domains() is also called under
cpuset_write_resmask, which handles writing to cpuset.cpus. Under
production workloads, on a 4.15 kernel, we observed the median latency
of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
critical path in production, but blindly traversing every task in the
system is not scalable. And its cost is really unnecessary for users
who don't use deadline tasks at all.

>
> The better solution that was discussed before is to not iterate through every
> task in the system and let cpuset track when dl tasks are added to it and do
> smarter iteration. ATM even if there are no dl tasks in the system we'll
> blindly go through every task in the hierarchy to update nothing.
>

Great idea. This works. I tried this solution. With that, 98% of
cpuset.cpu writes are now within 7ms. Though there are still long
tails, they are caused by another issue, irrelevant to
rebuild_root_domains().

Your suggestion of update_dl_rd_accounting() also makes sense to me.

Hao

2023-03-07 20:10:00

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 3/7/23 14:56, Hao Luo wrote:
> On Mon, Feb 6, 2023 at 2:15 PM Qais Yousef <[email protected]> wrote:
>> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
>> enabled rebuilding root domain on cpuset and hotplug operations to
>> correct deadline accounting.
>>
>> Rebuilding root domain is a slow operation and we see 10+ of ms delays
>> on suspend-resume because of that (worst case captures 20ms which
>> happens often).
>>
>> Since nothing is expected to change on suspend-resume operation; skip
>> rebuilding the root domains to regain the some of the time lost.
>>
>> Achieve this by refactoring the code to pass whether dl accoutning needs
>> an update to rebuild_sched_domains(). And while at it, rename
>> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
>> a more representative name since we are not really rebuilding the root
>> domains, but rather updating dl accounting at the root domain.
>>
>> Some users of rebuild_sched_domains() will skip dl accounting update
>> now:
>>
>> * Update sched domains when relaxing the domain level in cpuset
>> which only impacts searching level in load balance
>> * update sched domains when cpufreq governor changes and we need
>> to create the perf domains
>>
>> Users in arch/x86 and arch/s390 are left with the old behavior.
>>
>> Debugged-by: Rick Yiu <[email protected]>
>> Signed-off-by: Qais Yousef (Google) <[email protected]>
>> ---
> Hi Qais,
>
> Thank you for reporting this. We observed the same issue in our
> production environment. Rebuild_root_domains() is also called under
> cpuset_write_resmask, which handles writing to cpuset.cpus. Under
> production workloads, on a 4.15 kernel, we observed the median latency
> of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
> 60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
> critical path in production, but blindly traversing every task in the
> system is not scalable. And its cost is really unnecessary for users
> who don't use deadline tasks at all.

The rebuild_root_domains() function shouldn't be called when updating
cpuset.cpus unless it is a partition root. Is it?

Cheers,
Longman


2023-03-07 21:07:16

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On Tue, Mar 7, 2023 at 12:09 PM Waiman Long <[email protected]> wrote:
>
> On 3/7/23 14:56, Hao Luo wrote:
> > On Mon, Feb 6, 2023 at 2:15 PM Qais Yousef <[email protected]> wrote:
> >> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> >> enabled rebuilding root domain on cpuset and hotplug operations to
> >> correct deadline accounting.
> >>
> >> Rebuilding root domain is a slow operation and we see 10+ of ms delays
> >> on suspend-resume because of that (worst case captures 20ms which
> >> happens often).
> >>
> >> Since nothing is expected to change on suspend-resume operation; skip
> >> rebuilding the root domains to regain the some of the time lost.
> >>
> >> Achieve this by refactoring the code to pass whether dl accoutning needs
> >> an update to rebuild_sched_domains(). And while at it, rename
> >> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
> >> a more representative name since we are not really rebuilding the root
> >> domains, but rather updating dl accounting at the root domain.
> >>
> >> Some users of rebuild_sched_domains() will skip dl accounting update
> >> now:
> >>
> >> * Update sched domains when relaxing the domain level in cpuset
> >> which only impacts searching level in load balance
> >> * update sched domains when cpufreq governor changes and we need
> >> to create the perf domains
> >>
> >> Users in arch/x86 and arch/s390 are left with the old behavior.
> >>
> >> Debugged-by: Rick Yiu <[email protected]>
> >> Signed-off-by: Qais Yousef (Google) <[email protected]>
> >> ---
> > Hi Qais,
> >
> > Thank you for reporting this. We observed the same issue in our
> > production environment. Rebuild_root_domains() is also called under
> > cpuset_write_resmask, which handles writing to cpuset.cpus. Under
> > production workloads, on a 4.15 kernel, we observed the median latency
> > of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
> > 60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
> > critical path in production, but blindly traversing every task in the
> > system is not scalable. And its cost is really unnecessary for users
> > who don't use deadline tasks at all.
>
> The rebuild_root_domains() function shouldn't be called when updating
> cpuset.cpus unless it is a partition root. Is it?
>

I think it's because we were using the legacy hierarchy. I'm not
familiar with cpuset partition though.

Hao

2023-03-07 21:14:56

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 3/7/23 16:06, Hao Luo wrote:
> On Tue, Mar 7, 2023 at 12:09 PM Waiman Long <[email protected]> wrote:
>> On 3/7/23 14:56, Hao Luo wrote:
>>> On Mon, Feb 6, 2023 at 2:15 PM Qais Yousef <[email protected]> wrote:
>>>> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
>>>> enabled rebuilding root domain on cpuset and hotplug operations to
>>>> correct deadline accounting.
>>>>
>>>> Rebuilding root domain is a slow operation and we see 10+ of ms delays
>>>> on suspend-resume because of that (worst case captures 20ms which
>>>> happens often).
>>>>
>>>> Since nothing is expected to change on suspend-resume operation; skip
>>>> rebuilding the root domains to regain the some of the time lost.
>>>>
>>>> Achieve this by refactoring the code to pass whether dl accoutning needs
>>>> an update to rebuild_sched_domains(). And while at it, rename
>>>> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
>>>> a more representative name since we are not really rebuilding the root
>>>> domains, but rather updating dl accounting at the root domain.
>>>>
>>>> Some users of rebuild_sched_domains() will skip dl accounting update
>>>> now:
>>>>
>>>> * Update sched domains when relaxing the domain level in cpuset
>>>> which only impacts searching level in load balance
>>>> * update sched domains when cpufreq governor changes and we need
>>>> to create the perf domains
>>>>
>>>> Users in arch/x86 and arch/s390 are left with the old behavior.
>>>>
>>>> Debugged-by: Rick Yiu <[email protected]>
>>>> Signed-off-by: Qais Yousef (Google) <[email protected]>
>>>> ---
>>> Hi Qais,
>>>
>>> Thank you for reporting this. We observed the same issue in our
>>> production environment. Rebuild_root_domains() is also called under
>>> cpuset_write_resmask, which handles writing to cpuset.cpus. Under
>>> production workloads, on a 4.15 kernel, we observed the median latency
>>> of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
>>> 60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
>>> critical path in production, but blindly traversing every task in the
>>> system is not scalable. And its cost is really unnecessary for users
>>> who don't use deadline tasks at all.
>> The rebuild_root_domains() function shouldn't be called when updating
>> cpuset.cpus unless it is a partition root. Is it?
>>
> I think it's because we were using the legacy hierarchy. I'm not
> familiar with cpuset partition though.

In legacy hierarchy, changing cpuset.cpus shouldn't lead to the calling
of rebuild_root_domains() unless you play with cpuset.sched_load_balance
file by changing it to 0 in the right cpusets. If you are touching
cpuset.sched_load_balance, you shouldn't change cpuset.cpus that often.

Cheers,
Longman


2023-03-07 22:17:43

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On Tue, Mar 7, 2023 at 1:13 PM Waiman Long <[email protected]> wrote:
>
> On 3/7/23 16:06, Hao Luo wrote:
> > On Tue, Mar 7, 2023 at 12:09 PM Waiman Long <[email protected]> wrote:
> >> On 3/7/23 14:56, Hao Luo wrote:
> >>> On Mon, Feb 6, 2023 at 2:15 PM Qais Yousef <[email protected]> wrote:
> >>>> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> >>>> enabled rebuilding root domain on cpuset and hotplug operations to
> >>>> correct deadline accounting.
> >>>>
> >>>> Rebuilding root domain is a slow operation and we see 10+ of ms delays
> >>>> on suspend-resume because of that (worst case captures 20ms which
> >>>> happens often).
> >>>>
> >>>> Since nothing is expected to change on suspend-resume operation; skip
> >>>> rebuilding the root domains to regain the some of the time lost.
> >>>>
> >>>> Achieve this by refactoring the code to pass whether dl accoutning needs
> >>>> an update to rebuild_sched_domains(). And while at it, rename
> >>>> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
> >>>> a more representative name since we are not really rebuilding the root
> >>>> domains, but rather updating dl accounting at the root domain.
> >>>>
> >>>> Some users of rebuild_sched_domains() will skip dl accounting update
> >>>> now:
> >>>>
> >>>> * Update sched domains when relaxing the domain level in cpuset
> >>>> which only impacts searching level in load balance
> >>>> * update sched domains when cpufreq governor changes and we need
> >>>> to create the perf domains
> >>>>
> >>>> Users in arch/x86 and arch/s390 are left with the old behavior.
> >>>>
> >>>> Debugged-by: Rick Yiu <[email protected]>
> >>>> Signed-off-by: Qais Yousef (Google) <[email protected]>
> >>>> ---
> >>> Hi Qais,
> >>>
> >>> Thank you for reporting this. We observed the same issue in our
> >>> production environment. Rebuild_root_domains() is also called under
> >>> cpuset_write_resmask, which handles writing to cpuset.cpus. Under
> >>> production workloads, on a 4.15 kernel, we observed the median latency
> >>> of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
> >>> 60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
> >>> critical path in production, but blindly traversing every task in the
> >>> system is not scalable. And its cost is really unnecessary for users
> >>> who don't use deadline tasks at all.
> >> The rebuild_root_domains() function shouldn't be called when updating
> >> cpuset.cpus unless it is a partition root. Is it?
> >>
> > I think it's because we were using the legacy hierarchy. I'm not
> > familiar with cpuset partition though.
>
> In legacy hierarchy, changing cpuset.cpus shouldn't lead to the calling
> of rebuild_root_domains() unless you play with cpuset.sched_load_balance
> file by changing it to 0 in the right cpusets. If you are touching
> cpuset.sched_load_balance, you shouldn't change cpuset.cpus that often.
>

Actually, I think it's the opposite. If I understand the code
correctly[1], it looks like rebuild_root_domains is called when
LOAD_BALANCE _is_ set and sched_load_balance is 1 by default. Is that
condition a bug?

I don't think we updated cpuset.sched_load_balance.

[1] https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L1677

2023-03-08 02:30:54

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 3/7/23 17:17, Hao Luo wrote:
> On Tue, Mar 7, 2023 at 1:13 PM Waiman Long <[email protected]> wrote:
>> On 3/7/23 16:06, Hao Luo wrote:
>>> On Tue, Mar 7, 2023 at 12:09 PM Waiman Long <[email protected]> wrote:
>>>> On 3/7/23 14:56, Hao Luo wrote:
>>>>> On Mon, Feb 6, 2023 at 2:15 PM Qais Yousef <[email protected]> wrote:
>>>>>> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
>>>>>> enabled rebuilding root domain on cpuset and hotplug operations to
>>>>>> correct deadline accounting.
>>>>>>
>>>>>> Rebuilding root domain is a slow operation and we see 10+ of ms delays
>>>>>> on suspend-resume because of that (worst case captures 20ms which
>>>>>> happens often).
>>>>>>
>>>>>> Since nothing is expected to change on suspend-resume operation; skip
>>>>>> rebuilding the root domains to regain the some of the time lost.
>>>>>>
>>>>>> Achieve this by refactoring the code to pass whether dl accoutning needs
>>>>>> an update to rebuild_sched_domains(). And while at it, rename
>>>>>> rebuild_root_domains() to update_dl_rd_accounting() which I believe is
>>>>>> a more representative name since we are not really rebuilding the root
>>>>>> domains, but rather updating dl accounting at the root domain.
>>>>>>
>>>>>> Some users of rebuild_sched_domains() will skip dl accounting update
>>>>>> now:
>>>>>>
>>>>>> * Update sched domains when relaxing the domain level in cpuset
>>>>>> which only impacts searching level in load balance
>>>>>> * update sched domains when cpufreq governor changes and we need
>>>>>> to create the perf domains
>>>>>>
>>>>>> Users in arch/x86 and arch/s390 are left with the old behavior.
>>>>>>
>>>>>> Debugged-by: Rick Yiu <[email protected]>
>>>>>> Signed-off-by: Qais Yousef (Google) <[email protected]>
>>>>>> ---
>>>>> Hi Qais,
>>>>>
>>>>> Thank you for reporting this. We observed the same issue in our
>>>>> production environment. Rebuild_root_domains() is also called under
>>>>> cpuset_write_resmask, which handles writing to cpuset.cpus. Under
>>>>> production workloads, on a 4.15 kernel, we observed the median latency
>>>>> of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
>>>>> 60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
>>>>> critical path in production, but blindly traversing every task in the
>>>>> system is not scalable. And its cost is really unnecessary for users
>>>>> who don't use deadline tasks at all.
>>>> The rebuild_root_domains() function shouldn't be called when updating
>>>> cpuset.cpus unless it is a partition root. Is it?
>>>>
>>> I think it's because we were using the legacy hierarchy. I'm not
>>> familiar with cpuset partition though.
>> In legacy hierarchy, changing cpuset.cpus shouldn't lead to the calling
>> of rebuild_root_domains() unless you play with cpuset.sched_load_balance
>> file by changing it to 0 in the right cpusets. If you are touching
>> cpuset.sched_load_balance, you shouldn't change cpuset.cpus that often.
>>
> Actually, I think it's the opposite. If I understand the code
> correctly[1], it looks like rebuild_root_domains is called when
> LOAD_BALANCE _is_ set and sched_load_balance is 1 by default. Is that
> condition a bug?
>
> I don't think we updated cpuset.sched_load_balance.
>
> [1] https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L1677

The only reason rebuild_root_domains() is called is because the the
scheduling domains were changed. The cpuset.sched_load_balance control
file is 1 by default. If no one touch it, there is just one global
scheduling domain that covers all the active CPUs. However, by setting
cpuset.sched_load_balance to 0 in the right cpusets, you can create
multiple scheduling domains or disabling load balancing on some CPUs.
With that setup, changing cpuset.cpus in the right place can cause
rebuild_root_domains() to be called because the set of scheduling
domains are changed.

Cheers,
Longman


2023-03-08 10:20:11

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 01/03/23 17:03, Qais Yousef wrote:
> On 03/01/23 15:26, Juri Lelli wrote:
> > On 01/03/23 12:28, Qais Yousef wrote:
> > > On 03/01/23 08:31, Juri Lelli wrote:
> >
> > ...
> >
> > > > Not ignoring you guys here, but it turns out I'm quite bogged down with
> > > > other stuff at the moment. :/ So, apologies and I'll try to get to this
> > > > asap. Thanks a lot for all your efforts and time reviewing so far!
> > >
> > > Np, I can feel you :-)
> >
> > Eh. :/
>
> I hope I did not offend. That was meant as no pressure, I understand.

No offence at all! I meant "we are all on the same boat it seems". :)

> > BTW, do you have a repro script of some sort handy I might play with?
>
> Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
> the time around the call and trace_printk'ed the result.
>
> I was working on a android phone which just suspends to ram if you turn the
> screen off and disconnect the usb.

Looks like I could come up with the following

https://github.com/jlelli/linux.git deadline/rework-cpusets
https://github.com/jlelli/linux/tree/deadline/rework-cpusets

which I don't think it's at a point that I feel comfortable to propose
as an RFC (not even sure if it actually makes sense), but it survived my
very light testing.

Could you please take a look and, if it makes some sense in theory, give
it a try on your end?

Thanks!
Juri


2023-03-08 18:02:39

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
>
> On 01/03/23 17:03, Qais Yousef wrote:
> > On 03/01/23 15:26, Juri Lelli wrote:
<...>
> > > BTW, do you have a repro script of some sort handy I might play with?
> >
> > Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
> > the time around the call and trace_printk'ed the result.
> >
> > I was working on a android phone which just suspends to ram if you turn the
> > screen off and disconnect the usb.
>
> Looks like I could come up with the following
>
> https://github.com/jlelli/linux.git deadline/rework-cpusets
> https://github.com/jlelli/linux/tree/deadline/rework-cpusets
>
> which I don't think it's at a point that I feel comfortable to propose
> as an RFC (not even sure if it actually makes sense), but it survived my
> very light testing.
>
> Could you please take a look and, if it makes some sense in theory, give
> it a try on your end?
>

Hi Juri,

Thanks for coming up with the RFC. I can test your changes in the
server environment. I observed the same issue on my side and I can
reproduce.

I sync'ed up with Qais offline earlier yesterday, and was preparing a
couple of patches that optimize the cpuset.cpus writes. Tracking dl
tasks in cpusets is one of them. But I am happy to take your patches
and do the testing. Note that I won't be able to test the dl part of
the patch, only the latency impact on rebuild_root_domains(), as we
don't have dl tasks in our system.

The other patch is fixing cpuset_rwsem. I see you switched it back to
mutex. I did observe performance issues with cpuset_rwsem. Basically,
using percpu_rwsem generates very very long latency tails for writers,
but mutex doesn't. After some debugging, I found it was because
percpu_rwsem requires every writer to call a synchronize_rcu() for
acquiring the lock. So in my patches, I disabled the fastpath of
readers for cpuset_rwsem. This has been done before[1]. But mutex also
worked.

Anyway, I'm happy to test your patches and ack once they are sent out.

[1] https://lore.kernel.org/all/[email protected]/T/#m13ef9a728c89cccc9c02af3c96924096c76162a5

Hao

2023-03-08 18:11:31

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On Tue, Mar 7, 2023 at 6:30 PM Waiman Long <[email protected]> wrote:
>
> On 3/7/23 17:17, Hao Luo wrote:
> > On Tue, Mar 7, 2023 at 1:13 PM Waiman Long <[email protected]> wrote:
> >> On 3/7/23 16:06, Hao Luo wrote:
> >>> On Tue, Mar 7, 2023 at 12:09 PM Waiman Long <[email protected]> wrote:
> >>>> On 3/7/23 14:56, Hao Luo wrote:
<...>
> >>>>> Hi Qais,
> >>>>>
> >>>>> Thank you for reporting this. We observed the same issue in our
> >>>>> production environment. Rebuild_root_domains() is also called under
> >>>>> cpuset_write_resmask, which handles writing to cpuset.cpus. Under
> >>>>> production workloads, on a 4.15 kernel, we observed the median latency
> >>>>> of writing cpuset.cpus at 3ms, p99 at 7ms. Now the median becomes
> >>>>> 60ms, p99 at >100ms. Writing cpuset.cpus is a fairly frequent and
> >>>>> critical path in production, but blindly traversing every task in the
> >>>>> system is not scalable. And its cost is really unnecessary for users
> >>>>> who don't use deadline tasks at all.
> >>>> The rebuild_root_domains() function shouldn't be called when updating
> >>>> cpuset.cpus unless it is a partition root. Is it?
> >>>>
> >>> I think it's because we were using the legacy hierarchy. I'm not
> >>> familiar with cpuset partition though.
> >> In legacy hierarchy, changing cpuset.cpus shouldn't lead to the calling
> >> of rebuild_root_domains() unless you play with cpuset.sched_load_balance
> >> file by changing it to 0 in the right cpusets. If you are touching
> >> cpuset.sched_load_balance, you shouldn't change cpuset.cpus that often.
> >>
> > Actually, I think it's the opposite. If I understand the code
> > correctly[1], it looks like rebuild_root_domains is called when
> > LOAD_BALANCE _is_ set and sched_load_balance is 1 by default. Is that
> > condition a bug?
> >
> > I don't think we updated cpuset.sched_load_balance.
> >
> > [1] https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L1677
>
> The only reason rebuild_root_domains() is called is because the the
> scheduling domains were changed. The cpuset.sched_load_balance control
> file is 1 by default. If no one touch it, there is just one global
> scheduling domain that covers all the active CPUs. However, by setting
> cpuset.sched_load_balance to 0 in the right cpusets, you can create
> multiple scheduling domains or disabling load balancing on some CPUs.
> With that setup, changing cpuset.cpus in the right place can cause
> rebuild_root_domains() to be called because the set of scheduling
> domains are changed.
>

Thanks Longman for the explanation.

I believe we don't touch cpuset.sched_load_balance, so I don't know
what's wrong for now. But I've taken note and will go back to debug
further and see if there is any setup that needs to be fixed in our
system.

Hao

2023-03-08 19:23:04

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 3/8/23 05:19, Juri Lelli wrote:
> On 01/03/23 17:03, Qais Yousef wrote:
>> On 03/01/23 15:26, Juri Lelli wrote:
>>> On 01/03/23 12:28, Qais Yousef wrote:
>>>> On 03/01/23 08:31, Juri Lelli wrote:
>>> ...
>>>
>>>>> Not ignoring you guys here, but it turns out I'm quite bogged down with
>>>>> other stuff at the moment. :/ So, apologies and I'll try to get to this
>>>>> asap. Thanks a lot for all your efforts and time reviewing so far!
>>>> Np, I can feel you :-)
>>> Eh. :/
>> I hope I did not offend. That was meant as no pressure, I understand.
> No offence at all! I meant "we are all on the same boat it seems". :)
>
>>> BTW, do you have a repro script of some sort handy I might play with?
>> Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
>> the time around the call and trace_printk'ed the result.
>>
>> I was working on a android phone which just suspends to ram if you turn the
>> screen off and disconnect the usb.
> Looks like I could come up with the following
>
> https://github.com/jlelli/linux.git deadline/rework-cpusets
> https://github.com/jlelli/linux/tree/deadline/rework-cpusets
>
> which I don't think it's at a point that I feel comfortable to propose
> as an RFC (not even sure if it actually makes sense), but it survived my
> very light testing.

I like your patch to revert the cpuset_rwsem change. In fact, I was
planning to ask you about that. It is causing a lot more latency in
workloads that need to change cpuset configuration rather frequently.

Cheers,
Longman


2023-03-09 06:56:04

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

Hi Hao,

On 08/03/23 10:01, Hao Luo wrote:
> On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
> >
> > On 01/03/23 17:03, Qais Yousef wrote:
> > > On 03/01/23 15:26, Juri Lelli wrote:
> <...>
> > > > BTW, do you have a repro script of some sort handy I might play with?
> > >
> > > Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
> > > the time around the call and trace_printk'ed the result.
> > >
> > > I was working on a android phone which just suspends to ram if you turn the
> > > screen off and disconnect the usb.
> >
> > Looks like I could come up with the following
> >
> > https://github.com/jlelli/linux.git deadline/rework-cpusets
> > https://github.com/jlelli/linux/tree/deadline/rework-cpusets
> >
> > which I don't think it's at a point that I feel comfortable to propose
> > as an RFC (not even sure if it actually makes sense), but it survived my
> > very light testing.
> >
> > Could you please take a look and, if it makes some sense in theory, give
> > it a try on your end?
> >
>
> Hi Juri,
>
> Thanks for coming up with the RFC. I can test your changes in the
> server environment. I observed the same issue on my side and I can
> reproduce.
>
> I sync'ed up with Qais offline earlier yesterday, and was preparing a
> couple of patches that optimize the cpuset.cpus writes. Tracking dl
> tasks in cpusets is one of them. But I am happy to take your patches
> and do the testing. Note that I won't be able to test the dl part of
> the patch, only the latency impact on rebuild_root_domains(), as we
> don't have dl tasks in our system.
>
> The other patch is fixing cpuset_rwsem. I see you switched it back to
> mutex. I did observe performance issues with cpuset_rwsem. Basically,
> using percpu_rwsem generates very very long latency tails for writers,
> but mutex doesn't. After some debugging, I found it was because
> percpu_rwsem requires every writer to call a synchronize_rcu() for
> acquiring the lock. So in my patches, I disabled the fastpath of
> readers for cpuset_rwsem. This has been done before[1]. But mutex also
> worked.
>
> Anyway, I'm happy to test your patches and ack once they are sent out.

Do you strictly need a proper RFC or could you please test the above for
now? If you could please do the latter, and if tests look ok, I could
then put together proper changelogs etc. and propose an RFC (it would
save me some time not to do that if the above doesn't work, apologies
for not going the proper route from the start). Guess this question
applies to Qais as well. Hummm, or maybe you are actually saying that
you are indeed going to test them already, just wanted to make sure
then. :)

Thanks!
Juri


2023-03-09 22:23:35

by Hao Luo

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On Wed, Mar 8, 2023 at 10:55 PM Juri Lelli <[email protected]> wrote:
>
> On 08/03/23 10:01, Hao Luo wrote:
> > On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
> > >
> > > On 01/03/23 17:03, Qais Yousef wrote:
> > > > On 03/01/23 15:26, Juri Lelli wrote:
> > <...>
> > > > > BTW, do you have a repro script of some sort handy I might play with?
> > > >
> > > > Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
> > > > the time around the call and trace_printk'ed the result.
> > > >
> > > > I was working on a android phone which just suspends to ram if you turn the
> > > > screen off and disconnect the usb.
> > >
> > > Looks like I could come up with the following
> > >
> > > https://github.com/jlelli/linux.git deadline/rework-cpusets
> > > https://github.com/jlelli/linux/tree/deadline/rework-cpusets
> > >
> > > which I don't think it's at a point that I feel comfortable to propose
> > > as an RFC (not even sure if it actually makes sense), but it survived my
> > > very light testing.
> > >
> > > Could you please take a look and, if it makes some sense in theory, give
> > > it a try on your end?
> > >
> >
> > Hi Juri,
> >
> > Thanks for coming up with the RFC. I can test your changes in the
> > server environment. I observed the same issue on my side and I can
> > reproduce.
> >
> > I sync'ed up with Qais offline earlier yesterday, and was preparing a
> > couple of patches that optimize the cpuset.cpus writes. Tracking dl
> > tasks in cpusets is one of them. But I am happy to take your patches
> > and do the testing. Note that I won't be able to test the dl part of
> > the patch, only the latency impact on rebuild_root_domains(), as we
> > don't have dl tasks in our system.
> >
> > The other patch is fixing cpuset_rwsem. I see you switched it back to
> > mutex. I did observe performance issues with cpuset_rwsem. Basically,
> > using percpu_rwsem generates very very long latency tails for writers,
> > but mutex doesn't. After some debugging, I found it was because
> > percpu_rwsem requires every writer to call a synchronize_rcu() for
> > acquiring the lock. So in my patches, I disabled the fastpath of
> > readers for cpuset_rwsem. This has been done before[1]. But mutex also
> > worked.
> >
> > Anyway, I'm happy to test your patches and ack once they are sent out.
>
> Do you strictly need a proper RFC or could you please test the above for
> now? If you could please do the latter, and if tests look ok, I could
> then put together proper changelogs etc. and propose an RFC (it would
> save me some time not to do that if the above doesn't work, apologies
> for not going the proper route from the start). Guess this question
> applies to Qais as well. Hummm, or maybe you are actually saying that
> you are indeed going to test them already, just wanted to make sure
> then. :)

Juri, I ported your patches to a 5.10 kernel, because my workload can
only run on 5.10. But unfortunately the kernel crashed at
cpuset_can_attach(). I'll put a few comments in your github branch.

Hao



Hao

2023-03-11 18:52:02

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 03/09/23 14:23, Hao Luo wrote:
> On Wed, Mar 8, 2023 at 10:55 PM Juri Lelli <[email protected]> wrote:
> >
> > On 08/03/23 10:01, Hao Luo wrote:
> > > On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
> > > >
> > > > On 01/03/23 17:03, Qais Yousef wrote:
> > > > > On 03/01/23 15:26, Juri Lelli wrote:
> > > <...>
> > > > > > BTW, do you have a repro script of some sort handy I might play with?
> > > > >
> > > > > Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
> > > > > the time around the call and trace_printk'ed the result.
> > > > >
> > > > > I was working on a android phone which just suspends to ram if you turn the
> > > > > screen off and disconnect the usb.
> > > >
> > > > Looks like I could come up with the following
> > > >
> > > > https://github.com/jlelli/linux.git deadline/rework-cpusets
> > > > https://github.com/jlelli/linux/tree/deadline/rework-cpusets
> > > >
> > > > which I don't think it's at a point that I feel comfortable to propose
> > > > as an RFC (not even sure if it actually makes sense), but it survived my
> > > > very light testing.
> > > >
> > > > Could you please take a look and, if it makes some sense in theory, give
> > > > it a try on your end?
> > > >
> > >
> > > Hi Juri,
> > >
> > > Thanks for coming up with the RFC. I can test your changes in the
> > > server environment. I observed the same issue on my side and I can
> > > reproduce.
> > >
> > > I sync'ed up with Qais offline earlier yesterday, and was preparing a
> > > couple of patches that optimize the cpuset.cpus writes. Tracking dl
> > > tasks in cpusets is one of them. But I am happy to take your patches
> > > and do the testing. Note that I won't be able to test the dl part of
> > > the patch, only the latency impact on rebuild_root_domains(), as we
> > > don't have dl tasks in our system.
> > >
> > > The other patch is fixing cpuset_rwsem. I see you switched it back to
> > > mutex. I did observe performance issues with cpuset_rwsem. Basically,
> > > using percpu_rwsem generates very very long latency tails for writers,
> > > but mutex doesn't. After some debugging, I found it was because
> > > percpu_rwsem requires every writer to call a synchronize_rcu() for
> > > acquiring the lock. So in my patches, I disabled the fastpath of
> > > readers for cpuset_rwsem. This has been done before[1]. But mutex also
> > > worked.
> > >
> > > Anyway, I'm happy to test your patches and ack once they are sent out.
> >
> > Do you strictly need a proper RFC or could you please test the above for
> > now? If you could please do the latter, and if tests look ok, I could
> > then put together proper changelogs etc. and propose an RFC (it would
> > save me some time not to do that if the above doesn't work, apologies
> > for not going the proper route from the start). Guess this question
> > applies to Qais as well. Hummm, or maybe you are actually saying that
> > you are indeed going to test them already, just wanted to make sure
> > then. :)
>
> Juri, I ported your patches to a 5.10 kernel, because my workload can
> only run on 5.10. But unfortunately the kernel crashed at
> cpuset_can_attach(). I'll put a few comments in your github branch.

Yeah I am working on 5.10 too (this will need to be backported to 5.10 and 5.15
ultimately) and had the same crash because task is NULL.

Fixed it this way which I think what you intended to do Juri? It moves the
check for dl_task(task) inside cgroup_taskset_for_each() loop.

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 83a8943467fb..06d6bb68d86b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2495,11 +2495,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
ret = security_task_setscheduler(task);
if (ret)
goto out_unlock;
- }

- if (dl_task(task)) {
- cs->deadline_tasks++;
- cpuset_attach_old_cs->deadline_tasks--;
+ if (dl_task(task)) {
+ cs->deadline_tasks++;
+ cpuset_attach_old_cs->deadline_tasks--;
+ }
}

/*

Like Hao I don't have any deadline tasks in the system. With the fix above
I don't notice the delay on suspend resume using your patches.

If you want any debug; please feel free to add them into your branch so I can
run with that and give you the log.


Thanks!

--
Qais Yousef

2023-03-13 12:37:59

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 08/03/2023 11:19, Juri Lelli wrote:
> On 01/03/23 17:03, Qais Yousef wrote:
>> On 03/01/23 15:26, Juri Lelli wrote:
>>> On 01/03/23 12:28, Qais Yousef wrote:
>>>> On 03/01/23 08:31, Juri Lelli wrote:
>>>
>>> ...
>>>
>>>>> Not ignoring you guys here, but it turns out I'm quite bogged down with
>>>>> other stuff at the moment. :/ So, apologies and I'll try to get to this
>>>>> asap. Thanks a lot for all your efforts and time reviewing so far!
>>>>
>>>> Np, I can feel you :-)
>>>
>>> Eh. :/
>>
>> I hope I did not offend. That was meant as no pressure, I understand.
>
> No offence at all! I meant "we are all on the same boat it seems". :)
>
>>> BTW, do you have a repro script of some sort handy I might play with?
>>
>> Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
>> the time around the call and trace_printk'ed the result.
>>
>> I was working on a android phone which just suspends to ram if you turn the
>> screen off and disconnect the usb.
>
> Looks like I could come up with the following
>
> https://github.com/jlelli/linux.git deadline/rework-cpusets
> https://github.com/jlelli/linux/tree/deadline/rework-cpusets
>
> which I don't think it's at a point that I feel comfortable to propose
> as an RFC (not even sure if it actually makes sense), but it survived my
> very light testing.
>
> Could you please take a look and, if it makes some sense in theory, give
> it a try on your end?

sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets

@@ -2474,6 +2494,11 @@ static int cpuset_can_attach(struct
cgroup_taskset *tset)
goto out_unlock;
}

+ if (dl_task(task)) {
+ cs->deadline_tasks++;
+ cpuset_attach_old_cs->deadline_tasks--;
+ }
+

This one looks odd. task is NULL here ?

If you move a DL task from one cpuset to another this happens now:

root@juno:~# ps2 | grep DLN
82 82 140 0 - DLN sugov:0
83 83 140 0 - DLN sugov:1
1615 1616 140 0 - DLN thread0-0
1615 1617 140 0 - DLN thread0-1
1615 1618 140 0 - DLN thread0-2
1615 1619 140 0 - DLN thread0-3
1615 1620 140 0 - DLN thread0-4
1615 1621 140 0 - DLN thread0-5
1615 1622 140 0 - DLN thread0-6
1615 1623 140 0 - DLN thread0-7
1615 1624 140 0 - DLN thread0-8
1615 1625 140 0 - DLN thread0-9
1615 1626 140 0 - DLN thread0-10
1615 1627 140 0 - DLN thread0-11 <--

root@juno:~# cd /sys/fs/cgroup/cpuset
root@juno:~# mkdir cs1
root@juno:~# echo 0 > cs1/cpuset.mems
root@juno:~# echo 0,3-5 > cs1/cpuset.cpus

root@juno:~# echo 1627 > cs1/tasks

root@juno:~# [ 154.968900] *** task=0000000000000000

[...]

2023-03-13 16:39:33

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 11/03/23 18:51, Qais Yousef wrote:
> On 03/09/23 14:23, Hao Luo wrote:
> > On Wed, Mar 8, 2023 at 10:55 PM Juri Lelli <[email protected]> wrote:
> > >
> > > On 08/03/23 10:01, Hao Luo wrote:
> > > > On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
> > > > >
> > > > > On 01/03/23 17:03, Qais Yousef wrote:
> > > > > > On 03/01/23 15:26, Juri Lelli wrote:
> > > > <...>
> > > > > > > BTW, do you have a repro script of some sort handy I might play with?
> > > > > >
> > > > > > Sorry no. You'll just need to suspend to ram. I had a simple patch to measure
> > > > > > the time around the call and trace_printk'ed the result.
> > > > > >
> > > > > > I was working on a android phone which just suspends to ram if you turn the
> > > > > > screen off and disconnect the usb.
> > > > >
> > > > > Looks like I could come up with the following
> > > > >
> > > > > https://github.com/jlelli/linux.git deadline/rework-cpusets
> > > > > https://github.com/jlelli/linux/tree/deadline/rework-cpusets
> > > > >
> > > > > which I don't think it's at a point that I feel comfortable to propose
> > > > > as an RFC (not even sure if it actually makes sense), but it survived my
> > > > > very light testing.
> > > > >
> > > > > Could you please take a look and, if it makes some sense in theory, give
> > > > > it a try on your end?
> > > > >
> > > >
> > > > Hi Juri,
> > > >
> > > > Thanks for coming up with the RFC. I can test your changes in the
> > > > server environment. I observed the same issue on my side and I can
> > > > reproduce.
> > > >
> > > > I sync'ed up with Qais offline earlier yesterday, and was preparing a
> > > > couple of patches that optimize the cpuset.cpus writes. Tracking dl
> > > > tasks in cpusets is one of them. But I am happy to take your patches
> > > > and do the testing. Note that I won't be able to test the dl part of
> > > > the patch, only the latency impact on rebuild_root_domains(), as we
> > > > don't have dl tasks in our system.
> > > >
> > > > The other patch is fixing cpuset_rwsem. I see you switched it back to
> > > > mutex. I did observe performance issues with cpuset_rwsem. Basically,
> > > > using percpu_rwsem generates very very long latency tails for writers,
> > > > but mutex doesn't. After some debugging, I found it was because
> > > > percpu_rwsem requires every writer to call a synchronize_rcu() for
> > > > acquiring the lock. So in my patches, I disabled the fastpath of
> > > > readers for cpuset_rwsem. This has been done before[1]. But mutex also
> > > > worked.
> > > >
> > > > Anyway, I'm happy to test your patches and ack once they are sent out.
> > >
> > > Do you strictly need a proper RFC or could you please test the above for
> > > now? If you could please do the latter, and if tests look ok, I could
> > > then put together proper changelogs etc. and propose an RFC (it would
> > > save me some time not to do that if the above doesn't work, apologies
> > > for not going the proper route from the start). Guess this question
> > > applies to Qais as well. Hummm, or maybe you are actually saying that
> > > you are indeed going to test them already, just wanted to make sure
> > > then. :)
> >
> > Juri, I ported your patches to a 5.10 kernel, because my workload can
> > only run on 5.10. But unfortunately the kernel crashed at
> > cpuset_can_attach(). I'll put a few comments in your github branch.
>
> Yeah I am working on 5.10 too (this will need to be backported to 5.10 and 5.15
> ultimately) and had the same crash because task is NULL.
>
> Fixed it this way which I think what you intended to do Juri? It moves the
> check for dl_task(task) inside cgroup_taskset_for_each() loop.
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 83a8943467fb..06d6bb68d86b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2495,11 +2495,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> ret = security_task_setscheduler(task);
> if (ret)
> goto out_unlock;
> - }
>
> - if (dl_task(task)) {
> - cs->deadline_tasks++;
> - cpuset_attach_old_cs->deadline_tasks--;
> + if (dl_task(task)) {
> + cs->deadline_tasks++;
> + cpuset_attach_old_cs->deadline_tasks--;
> + }
> }
>
> /*

Duh, indeed.

> Like Hao I don't have any deadline tasks in the system. With the fix above
> I don't notice the delay on suspend resume using your patches.

OK, cool.

> If you want any debug; please feel free to add them into your branch so I can
> run with that and give you the log.

Will need to find time to run some tests with DEADLINE tasks, yeah.
Maybe Dietmar, since you reported as well the issue above with your
testing, you could help with testing DEADLINE?

Thanks,
Juri


2023-03-13 17:12:46

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 13/03/2023 17:37, Juri Lelli wrote:
> On 11/03/23 18:51, Qais Yousef wrote:
>> On 03/09/23 14:23, Hao Luo wrote:
>>> On Wed, Mar 8, 2023 at 10:55 PM Juri Lelli <[email protected]> wrote:
>>>>
>>>> On 08/03/23 10:01, Hao Luo wrote:
>>>>> On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
>>>>>>
>>>>>> On 01/03/23 17:03, Qais Yousef wrote:
>>>>>>> On 03/01/23 15:26, Juri Lelli wrote:

[...]

>> Yeah I am working on 5.10 too (this will need to be backported to 5.10 and 5.15
>> ultimately) and had the same crash because task is NULL.
>>
>> Fixed it this way which I think what you intended to do Juri? It moves the
>> check for dl_task(task) inside cgroup_taskset_for_each() loop.
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 83a8943467fb..06d6bb68d86b 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -2495,11 +2495,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>> ret = security_task_setscheduler(task);
>> if (ret)
>> goto out_unlock;
>> - }
>>
>> - if (dl_task(task)) {
>> - cs->deadline_tasks++;
>> - cpuset_attach_old_cs->deadline_tasks--;
>> + if (dl_task(task)) {
>> + cs->deadline_tasks++;
>> + cpuset_attach_old_cs->deadline_tasks--;
>> + }
>> }
>>
>> /*
>
> Duh, indeed.
>
>> Like Hao I don't have any deadline tasks in the system. With the fix above
>> I don't notice the delay on suspend resume using your patches.
>
> OK, cool.
>
>> If you want any debug; please feel free to add them into your branch so I can
>> run with that and give you the log.
>
> Will need to find time to run some tests with DEADLINE tasks, yeah.
> Maybe Dietmar, since you reported as well the issue above with your
> testing, you could help with testing DEADLINE?

Ah, now I see! It's the same issue I saw. And it's not specifically
related to DL tasks. Any tasks which you move into a cpuset will trigger
this.
Yeah, can do some DL tests later on this fix.

2023-03-14 11:41:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3] sched: cpuset: Don't rebuild root domains on suspend-resume

On 13/03/2023 18:10, Dietmar Eggemann wrote:
> On 13/03/2023 17:37, Juri Lelli wrote:
>> On 11/03/23 18:51, Qais Yousef wrote:
>>> On 03/09/23 14:23, Hao Luo wrote:
>>>> On Wed, Mar 8, 2023 at 10:55 PM Juri Lelli <[email protected]> wrote:
>>>>>
>>>>> On 08/03/23 10:01, Hao Luo wrote:
>>>>>> On Wed, Mar 8, 2023 at 2:20 AM Juri Lelli <[email protected]> wrote:
>>>>>>>
>>>>>>> On 01/03/23 17:03, Qais Yousef wrote:
>>>>>>>> On 03/01/23 15:26, Juri Lelli wrote:
>
> [...]
>
>>> Yeah I am working on 5.10 too (this will need to be backported to 5.10 and 5.15
>>> ultimately) and had the same crash because task is NULL.
>>>
>>> Fixed it this way which I think what you intended to do Juri? It moves the
>>> check for dl_task(task) inside cgroup_taskset_for_each() loop.
>>>
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index 83a8943467fb..06d6bb68d86b 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -2495,11 +2495,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>>> ret = security_task_setscheduler(task);
>>> if (ret)
>>> goto out_unlock;
>>> - }
>>>
>>> - if (dl_task(task)) {
>>> - cs->deadline_tasks++;
>>> - cpuset_attach_old_cs->deadline_tasks--;
>>> + if (dl_task(task)) {
>>> + cs->deadline_tasks++;
>>> + cpuset_attach_old_cs->deadline_tasks--;
>>> + }
>>> }
>>>
>>> /*
>>
>> Duh, indeed.
>>
>>> Like Hao I don't have any deadline tasks in the system. With the fix above
>>> I don't notice the delay on suspend resume using your patches.
>>
>> OK, cool.
>>
>>> If you want any debug; please feel free to add them into your branch so I can
>>> run with that and give you the log.
>>
>> Will need to find time to run some tests with DEADLINE tasks, yeah.
>> Maybe Dietmar, since you reported as well the issue above with your
>> testing, you could help with testing DEADLINE?
>
> Ah, now I see! It's the same issue I saw. And it's not specifically
> related to DL tasks. Any tasks which you move into a cpuset will trigger
> this.
> Yeah, can do some DL tests later on this fix.

This fix also works for my DL test.

root@juno:~# ps2 | grep DLN
83 83 140 0 - DLN sugov:0
84 84 140 0 - DLN sugov:1
1601 1602 140 0 - DLN thread0-0
1601 1603 140 0 - DLN thread0-1
1601 1604 140 0 - DLN thread0-2
1601 1605 140 0 - DLN thread0-3
1601 1606 140 0 - DLN thread0-4
1601 1607 140 0 - DLN thread0-5
1601 1608 140 0 - DLN thread0-6
1601 1609 140 0 - DLN thread0-7
1601 1610 140 0 - DLN thread0-8
1601 1611 140 0 - DLN thread0-9
1601 1612 140 0 - DLN thread0-10
1601 1613 140 0 - DLN thread0-11

cgroupv1

root@juno:# cd /sys/fs/cgroup/cpuset
root@juno:# mkdir cs1
root@juno:# echo 0 > cs1/cpuset.mems
root@juno:# echo 0,3-5 > cs1/cpuset.cpus

root@juno:# cgclassify -g cpuset:cs1 1602 1603 1604 $$

One remaining doubt: `cgclassify` will still move one task at a time so
cpuset_can_attach() has to deal with one task per call. But
cgroup_taskset_for_each() says that tset can contain multiple tasks. In
this case we would have to think about only changing cs->deadline_tasks
if all tasks can be moved.
Don't know which test would trigger a tset with multiple tasks in
cpuset_can_attach().