2021-05-19 17:42:26

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 00/21] Add support for 32-bit tasks on asymmetric AArch32 systems

Hi folks,

This is the long-awaited v6 of these patches which I last posted at the
end of last year:

v1: https://lore.kernel.org/r/[email protected]
v2: https://lore.kernel.org/r/[email protected]
v3: https://lore.kernel.org/r/[email protected]
v4: https://lore.kernel.org/r/[email protected]
v5: https://lore.kernel.org/r/[email protected]

There was also a nice LWN writeup in case you've forgotten what this is
about:

https://lwn.net/Articles/838339/

It's taken me a while to get a v6 of this together, partly due to
addressing the review feedback on v5, but also because this has now seen
testing on real hardware which threw up some surprises in suspend/resume,
SCHED_DEADLINE and compat hwcap reporting. Thanks to Quentin for helping
me to debug those issues.

The aim of this series is to allow 32-bit ARM applications to run on
arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
Unfortunately, such SoCs are real and will continue to be productised
over the next few years at least. I can assure you that I'm not just
doing this for fun.

Changes in v6 include:

* Save/restore the affinity mask across execve() to 32-bit and back to
64-bit again.

* Allow 32-bit deadline tasks, but skip straight to fallback path when
determining new affinity mask on execve().

* Fixed resume-from-suspend path when the resuming CPU is 64-bit-only
by deferring wake-ups for 32-bit tasks until the secondary CPUs are
back online.

* Bug fixes (compat hwcaps, memory leak, cpuset fallback path).

* Documentation for arm64. It's in the divisive .rst format, but please
take a look anyway!

I'm pretty happy with this now and it seems to do the right thing,
although the new patches in this revision would certainly benefit from
review. Series based on v5.13-rc1.

Cheers,

Will

Cc: Catalin Marinas <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: [email protected]

--->8

Will Deacon (21):
arm64: cpuinfo: Split AArch32 registers out into a separate struct
arm64: Allow mismatched 32-bit EL0 support
KVM: arm64: Kill 32-bit vCPUs on systems with mismatched EL0 support
arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs
arm64: Advertise CPUs capable of running 32-bit applications in sysfs
sched: Introduce task_cpu_possible_mask() to limit fallback rq
selection
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
sched: Introduce task_struct::user_cpus_ptr to track requested
affinity
sched: Split the guts of sched_setaffinity() into a helper function
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Admit forcefully-affined tasks into SCHED_DEADLINE
freezer: Add frozen_or_skipped() helper function
sched: Defer wakeup in ttwu() for unschedulable frozen tasks
arm64: Implement task_cpu_possible_mask()
arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit
EL0
arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched
system
arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0
arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores
Documentation: arm64: describe asymmetric 32-bit support

.../ABI/testing/sysfs-devices-system-cpu | 9 +
.../admin-guide/kernel-parameters.txt | 11 +
Documentation/arm64/asymmetric-32bit.rst | 149 ++++++++
Documentation/arm64/index.rst | 1 +
arch/arm64/include/asm/cpu.h | 44 +--
arch/arm64/include/asm/cpucaps.h | 3 +-
arch/arm64/include/asm/cpufeature.h | 8 +-
arch/arm64/include/asm/mmu_context.h | 13 +
arch/arm64/kernel/cpufeature.c | 227 +++++++++---
arch/arm64/kernel/cpuinfo.c | 53 +--
arch/arm64/kernel/process.c | 21 +-
arch/arm64/kvm/arm.c | 11 +-
include/linux/cpuset.h | 3 +-
include/linux/freezer.h | 6 +
include/linux/mmu_context.h | 8 +
include/linux/sched.h | 15 +
init/init_task.c | 1 +
kernel/cgroup/cpuset.c | 45 ++-
kernel/fork.c | 2 +
kernel/freezer.c | 10 +-
kernel/hung_task.c | 4 +-
kernel/sched/core.c | 323 ++++++++++++++----
kernel/sched/sched.h | 1 +
23 files changed, 781 insertions(+), 187 deletions(-)
create mode 100644 Documentation/arm64/asymmetric-32bit.rst

--
2.31.1.751.gd2f1c929bd-goog



2021-05-19 17:42:32

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 07/21] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.

In preparation for allowing architectures to provide their own fallback
mask, just return early if we're either using cgroup v1 or we're using
cgroup v2 with a mask that contains invalid CPUs. This will allow
select_fallback_rq() to figure out the mask by itself.

Cc: Li Zefan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Reviewed-by: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/linux/cpuset.h | 1 +
kernel/cgroup/cpuset.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 04c20de66afc..ed6ec677dd6b 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -15,6 +15,7 @@
#include <linux/cpumask.h>
#include <linux/nodemask.h>
#include <linux/mm.h>
+#include <linux/mmu_context.h>
#include <linux/jump_label.h>

#ifdef CONFIG_CPUSETS
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a945504c0ae7..8c799260a4a2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3322,9 +3322,17 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)

void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
{
+ const struct cpumask *cs_mask;
+ const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
+
rcu_read_lock();
- do_set_cpus_allowed(tsk, is_in_v2_mode() ?
- task_cs(tsk)->cpus_allowed : cpu_possible_mask);
+ cs_mask = task_cs(tsk)->cpus_allowed;
+
+ if (!is_in_v2_mode() || !cpumask_subset(cs_mask, possible_mask))
+ goto unlock; /* select_fallback_rq will try harder */
+
+ do_set_cpus_allowed(tsk, cs_mask);
+unlock:
rcu_read_unlock();

/*
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:42:37

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 10/21] sched: Introduce task_struct::user_cpus_ptr to track requested affinity

In preparation for saving and restoring the user-requested CPU affinity
mask of a task, add a new cpumask_t pointer to 'struct task_struct'.

If the pointer is non-NULL, then the mask is copied across fork() and
freed on task exit.

Signed-off-by: Will Deacon <[email protected]>
---
include/linux/sched.h | 13 +++++++++++++
init/init_task.c | 1 +
kernel/fork.c | 2 ++
kernel/sched/core.c | 20 ++++++++++++++++++++
4 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c881384517..db32d4f7e5b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -730,6 +730,7 @@ struct task_struct {
unsigned int policy;
int nr_cpus_allowed;
const cpumask_t *cpus_ptr;
+ cpumask_t *user_cpus_ptr;
cpumask_t cpus_mask;
void *migration_pending;
#ifdef CONFIG_SMP
@@ -1688,6 +1689,8 @@ extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
+extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
+extern void release_user_cpus_ptr(struct task_struct *p);
#else
static inline void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
{
@@ -1698,6 +1701,16 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p, const struct cpuma
return -EINVAL;
return 0;
}
+static inline int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node)
+{
+ if (src->user_cpus_ptr)
+ return -EINVAL;
+ return 0;
+}
+static inline void release_user_cpus_ptr(struct task_struct *p)
+{
+ WARN_ON(p->user_cpus_ptr);
+}
#endif

extern int yield_to(struct task_struct *p, bool preempt);
diff --git a/init/init_task.c b/init/init_task.c
index 8b08c2e19cbb..158c2b1689e1 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -80,6 +80,7 @@ struct task_struct init_task
.normal_prio = MAX_PRIO - 20,
.policy = SCHED_NORMAL,
.cpus_ptr = &init_task.cpus_mask,
+ .user_cpus_ptr = NULL,
.cpus_mask = CPU_MASK_ALL,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..d3710e7f1686 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -446,6 +446,7 @@ void put_task_stack(struct task_struct *tsk)

void free_task(struct task_struct *tsk)
{
+ release_user_cpus_ptr(tsk);
scs_release(tsk);

#ifndef CONFIG_THREAD_INFO_IN_TASK
@@ -918,6 +919,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#endif
if (orig->cpus_ptr == &orig->cpus_mask)
tsk->cpus_ptr = &tsk->cpus_mask;
+ dup_user_cpus_ptr(tsk, orig, node);

/*
* One for the user space visible state that goes away when reaped.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21da3d7f9e47..9512623d5a60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2129,6 +2129,26 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
__do_set_cpus_allowed(p, new_mask, 0);
}

+int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
+ int node)
+{
+ if (!src->user_cpus_ptr)
+ return 0;
+
+ dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
+ if (!dst->user_cpus_ptr)
+ return -ENOMEM;
+
+ cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ return 0;
+}
+
+void release_user_cpus_ptr(struct task_struct *p)
+{
+ kfree(p->user_cpus_ptr);
+ p->user_cpus_ptr = NULL;
+}
+
/*
* This function is wildly self concurrent; here be dragons.
*
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:42:37

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 09/21] sched: Reject CPU affinity changes based on task_cpu_possible_mask()

Reject explicit requests to change the affinity mask of a task via
set_cpus_allowed_ptr() if the requested mask is not a subset of the
mask returned by task_cpu_possible_mask(). This ensures that the
'cpus_mask' for a given task cannot contain CPUs which are incapable of
executing it, except in cases where the affinity is forced.

Reviewed-by: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/sched/core.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 482f7fdca0e8..21da3d7f9e47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2350,6 +2350,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
u32 flags)
{
const struct cpumask *cpu_valid_mask = cpu_active_mask;
+ const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
unsigned int dest_cpu;
struct rq_flags rf;
struct rq *rq;
@@ -2370,6 +2371,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
* set_cpus_allowed_common() and actually reset p->cpus_ptr.
*/
cpu_valid_mask = cpu_online_mask;
+ } else if (!cpumask_subset(new_mask, cpu_allowed_mask)) {
+ ret = -EINVAL;
+ goto out;
}

/*
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:42:40

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

When confronted with a mixture of CPUs, some of which support 32-bit
applications and others which don't, we quite sensibly treat the system
as 64-bit only for userspace and prevent execve() of 32-bit binaries.

Unfortunately, some crazy folks have decided to build systems like this
with the intention of running 32-bit applications, so relax our
sanitisation logic to continue to advertise 32-bit support to userspace
on these systems and track the real 32-bit capable cores in a cpumask
instead. For now, the default behaviour remains but will be tied to
a command-line option in a later patch.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/cpucaps.h | 3 +-
arch/arm64/include/asm/cpufeature.h | 8 +-
arch/arm64/kernel/cpufeature.c | 114 ++++++++++++++++++++++++----
3 files changed, 110 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index b0c5eda0498f..b87461490977 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -18,7 +18,8 @@
#define ARM64_HAS_NO_HW_PREFETCH 8
#define ARM64_HAS_VIRT_HOST_EXTN 11
#define ARM64_WORKAROUND_CAVIUM_27456 12
-#define ARM64_HAS_32BIT_EL0 13
+/* Unreliable: use system_supports_32bit_el0() instead. */
+#define ARM64_HAS_32BIT_EL0_DO_NOT_USE 13
#define ARM64_SPECTRE_V3A 14
#define ARM64_HAS_CNP 15
#define ARM64_HAS_NO_FPSIMD 16
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 338840c00e8e..603bf4160cd6 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -630,9 +630,15 @@ static inline bool cpu_supports_mixed_endian_el0(void)
return id_aa64mmfr0_mixed_endian_el0(read_cpuid(ID_AA64MMFR0_EL1));
}

+const struct cpumask *system_32bit_el0_cpumask(void);
+DECLARE_STATIC_KEY_FALSE(arm64_mismatched_32bit_el0);
+
static inline bool system_supports_32bit_el0(void)
{
- return cpus_have_const_cap(ARM64_HAS_32BIT_EL0);
+ u64 pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
+
+ return static_branch_unlikely(&arm64_mismatched_32bit_el0) ||
+ id_aa64pfr0_32bit_el0(pfr0);
}

static inline bool system_supports_4kb_granule(void)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index a4db25cd7122..4194a47de62d 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -107,6 +107,24 @@ DECLARE_BITMAP(boot_capabilities, ARM64_NPATCHABLE);
bool arm64_use_ng_mappings = false;
EXPORT_SYMBOL(arm64_use_ng_mappings);

+/*
+ * Permit PER_LINUX32 and execve() of 32-bit binaries even if not all CPUs
+ * support it?
+ */
+static bool __read_mostly allow_mismatched_32bit_el0;
+
+/*
+ * Static branch enabled only if allow_mismatched_32bit_el0 is set and we have
+ * seen at least one CPU capable of 32-bit EL0.
+ */
+DEFINE_STATIC_KEY_FALSE(arm64_mismatched_32bit_el0);
+
+/*
+ * Mask of CPUs supporting 32-bit EL0.
+ * Only valid if arm64_mismatched_32bit_el0 is enabled.
+ */
+static cpumask_var_t cpu_32bit_el0_mask __cpumask_var_read_mostly;
+
/*
* Flag to indicate if we have computed the system wide
* capabilities based on the boot time active CPUs. This
@@ -767,7 +785,7 @@ static void __init sort_ftr_regs(void)
* Any bits that are not covered by an arm64_ftr_bits entry are considered
* RES0 for the system-wide value, and must strictly match.
*/
-static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
+static void init_cpu_ftr_reg(u32 sys_reg, u64 new)
{
u64 val = 0;
u64 strict_mask = ~0x0ULL;
@@ -863,7 +881,7 @@ static void __init init_cpu_hwcaps_indirect_list(void)

static void __init setup_boot_cpu_capabilities(void);

-static void __init init_32bit_cpu_features(struct cpuinfo_32bit *info)
+static void init_32bit_cpu_features(struct cpuinfo_32bit *info)
{
init_cpu_ftr_reg(SYS_ID_DFR0_EL1, info->reg_id_dfr0);
init_cpu_ftr_reg(SYS_ID_DFR1_EL1, info->reg_id_dfr1);
@@ -979,6 +997,22 @@ static void relax_cpu_ftr_reg(u32 sys_id, int field)
WARN_ON(!ftrp->width);
}

+static void update_mismatched_32bit_el0_cpu_features(struct cpuinfo_arm64 *info,
+ struct cpuinfo_arm64 *boot)
+{
+ static bool boot_cpu_32bit_regs_overridden = false;
+
+ if (!allow_mismatched_32bit_el0 || boot_cpu_32bit_regs_overridden)
+ return;
+
+ if (id_aa64pfr0_32bit_el0(boot->reg_id_aa64pfr0))
+ return;
+
+ boot->aarch32 = info->aarch32;
+ init_32bit_cpu_features(&boot->aarch32);
+ boot_cpu_32bit_regs_overridden = true;
+}
+
static int update_32bit_cpu_features(int cpu, struct cpuinfo_32bit *info,
struct cpuinfo_32bit *boot)
{
@@ -1139,6 +1173,7 @@ void update_cpu_features(int cpu,
* (e.g. SYS_ID_AA64PFR0_EL1), so we call it last.
*/
if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
+ update_mismatched_32bit_el0_cpu_features(info, boot);
taint |= update_32bit_cpu_features(cpu, &info->aarch32,
&boot->aarch32);
}
@@ -1251,6 +1286,28 @@ has_cpuid_feature(const struct arm64_cpu_capabilities *entry, int scope)
return feature_matches(val, entry);
}

+const struct cpumask *system_32bit_el0_cpumask(void)
+{
+ if (!system_supports_32bit_el0())
+ return cpu_none_mask;
+
+ if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+ return cpu_32bit_el0_mask;
+
+ return cpu_possible_mask;
+}
+
+static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
+{
+ if (!has_cpuid_feature(entry, scope))
+ return allow_mismatched_32bit_el0;
+
+ if (scope == SCOPE_SYSTEM)
+ pr_info("detected: 32-bit EL0 Support\n");
+
+ return true;
+}
+
static bool has_useable_gicv3_cpuif(const struct arm64_cpu_capabilities *entry, int scope)
{
bool has_sre;
@@ -1869,10 +1926,9 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.cpu_enable = cpu_copy_el2regs,
},
{
- .desc = "32-bit EL0 Support",
- .capability = ARM64_HAS_32BIT_EL0,
+ .capability = ARM64_HAS_32BIT_EL0_DO_NOT_USE,
.type = ARM64_CPUCAP_SYSTEM_FEATURE,
- .matches = has_cpuid_feature,
+ .matches = has_32bit_el0,
.sys_reg = SYS_ID_AA64PFR0_EL1,
.sign = FTR_UNSIGNED,
.field_pos = ID_AA64PFR0_EL0_SHIFT,
@@ -2381,7 +2437,7 @@ static const struct arm64_cpu_capabilities compat_elf_hwcaps[] = {
{},
};

-static void __init cap_set_elf_hwcap(const struct arm64_cpu_capabilities *cap)
+static void cap_set_elf_hwcap(const struct arm64_cpu_capabilities *cap)
{
switch (cap->hwcap_type) {
case CAP_HWCAP:
@@ -2426,7 +2482,7 @@ static bool cpus_have_elf_hwcap(const struct arm64_cpu_capabilities *cap)
return rc;
}

-static void __init setup_elf_hwcaps(const struct arm64_cpu_capabilities *hwcaps)
+static void setup_elf_hwcaps(const struct arm64_cpu_capabilities *hwcaps)
{
/* We support emulation of accesses to CPU ID feature registers */
cpu_set_named_feature(CPUID);
@@ -2601,7 +2657,7 @@ static void check_early_cpu_features(void)
}

static void
-verify_local_elf_hwcaps(const struct arm64_cpu_capabilities *caps)
+__verify_local_elf_hwcaps(const struct arm64_cpu_capabilities *caps)
{

for (; caps->matches; caps++)
@@ -2612,6 +2668,14 @@ verify_local_elf_hwcaps(const struct arm64_cpu_capabilities *caps)
}
}

+static void verify_local_elf_hwcaps(void)
+{
+ __verify_local_elf_hwcaps(arm64_elf_hwcaps);
+
+ if (id_aa64pfr0_32bit_el0(read_cpuid(ID_AA64PFR0_EL1)))
+ __verify_local_elf_hwcaps(compat_elf_hwcaps);
+}
+
static void verify_sve_features(void)
{
u64 safe_zcr = read_sanitised_ftr_reg(SYS_ZCR_EL1);
@@ -2676,11 +2740,7 @@ static void verify_local_cpu_capabilities(void)
* on all secondary CPUs.
*/
verify_local_cpu_caps(SCOPE_ALL & ~SCOPE_BOOT_CPU);
-
- verify_local_elf_hwcaps(arm64_elf_hwcaps);
-
- if (system_supports_32bit_el0())
- verify_local_elf_hwcaps(compat_elf_hwcaps);
+ verify_local_elf_hwcaps();

if (system_supports_sve())
verify_sve_features();
@@ -2815,6 +2875,34 @@ void __init setup_cpu_features(void)
ARCH_DMA_MINALIGN);
}

+static int enable_mismatched_32bit_el0(unsigned int cpu)
+{
+ struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
+ bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
+
+ if (cpu_32bit) {
+ cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
+ static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
+ setup_elf_hwcaps(compat_elf_hwcaps);
+ }
+
+ return 0;
+}
+
+static int __init init_32bit_el0_mask(void)
+{
+ if (!allow_mismatched_32bit_el0)
+ return 0;
+
+ if (!zalloc_cpumask_var(&cpu_32bit_el0_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ return cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+ "arm64/mismatched_32bit_el0:online",
+ enable_mismatched_32bit_el0, NULL);
+}
+subsys_initcall_sync(init_32bit_el0_mask);
+
static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
{
cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:42:46

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 06/21] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

On such a system, we must take care not to migrate a task to an
unsupported CPU when forcefully moving tasks in select_fallback_rq()
in response to a CPU hot-unplug operation.

Introduce a task_cpu_possible_mask() hook which, given a task argument,
allows an architecture to return a cpumask of CPUs that are capable of
executing that task. The default implementation returns the
cpu_possible_mask, since sane machines do not suffer from per-cpu ISA
limitations that affect scheduling. The new mask is used when selecting
the fallback runqueue as a last resort before forcing a migration to the
first active CPU.

Reviewed-by: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/linux/mmu_context.h | 8 ++++++++
kernel/sched/core.c | 10 ++++++----
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 03dee12d2b61..bc4ac3c525e6 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -14,4 +14,12 @@
static inline void leave_mm(int cpu) { }
#endif

+/*
+ * CPUs that are capable of running task @p. By default, we assume a sane,
+ * homogeneous system. Must contain at least one active CPU.
+ */
+#ifndef task_cpu_possible_mask
+# define task_cpu_possible_mask(p) cpu_possible_mask
+#endif
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..482f7fdca0e8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1813,8 +1813,11 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
return cpu_online(cpu);

/* Non kernel threads are not allowed during either online or offline. */
- if (!(p->flags & PF_KTHREAD))
- return cpu_active(cpu);
+ if (!(p->flags & PF_KTHREAD)) {
+ if (cpu_active(cpu))
+ return cpumask_test_cpu(cpu, task_cpu_possible_mask(p));
+ return false;
+ }

/* KTHREAD_IS_PER_CPU is always allowed. */
if (kthread_is_per_cpu(p))
@@ -2792,10 +2795,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
*
* More yuck to audit.
*/
- do_set_cpus_allowed(p, cpu_possible_mask);
+ do_set_cpus_allowed(p, task_cpu_possible_mask(p));
state = fail;
break;
-
case fail:
BUG();
break;
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:42:54

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 08/21] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
account when trying to find a suitable set of online CPUs for a given
task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
during ->attach() and will subsequently allow the cpuset hierarchy to be
taken into account when forcefully overriding the affinity mask for a
task which requires migration to a compatible CPU.

Cc: Li Zefan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/linux/cpuset.h | 2 +-
kernel/cgroup/cpuset.c | 33 +++++++++++++++++++--------------
2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index ed6ec677dd6b..414a8e694413 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -185,7 +185,7 @@ static inline void cpuset_read_unlock(void) { }
static inline void cpuset_cpus_allowed(struct task_struct *p,
struct cpumask *mask)
{
- cpumask_copy(mask, cpu_possible_mask);
+ cpumask_copy(mask, task_cpu_possible_mask(p));
}

static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8c799260a4a2..b532a5333ff9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -372,18 +372,26 @@ static inline bool is_in_v2_mode(void)
}

/*
- * Return in pmask the portion of a cpusets's cpus_allowed that
- * are online. If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.
+ * Return in pmask the portion of a task's cpusets's cpus_allowed that
+ * are online and are capable of running the task. If none are found,
+ * walk up the cpuset hierarchy until we find one that does have some
+ * appropriate cpus.
*
* One way or another, we guarantee to return some non-empty subset
* of cpu_online_mask.
*
* Call with callback_lock or cpuset_mutex held.
*/
-static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
+static void guarantee_online_cpus(struct task_struct *tsk,
+ struct cpumask *pmask)
{
- while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
+ struct cpuset *cs = task_cs(tsk);
+ const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
+
+ if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
+ cpumask_copy(pmask, cpu_online_mask);
+
+ while (!cpumask_intersects(cs->effective_cpus, pmask)) {
cs = parent_cs(cs);
if (unlikely(!cs)) {
/*
@@ -393,11 +401,10 @@ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
* cpuset's effective_cpus is on its way to be
* identical to cpu_online_mask.
*/
- cpumask_copy(pmask, cpu_online_mask);
return;
}
}
- cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
+ cpumask_and(pmask, pmask, cs->effective_cpus);
}

/*
@@ -2199,15 +2206,13 @@ static void cpuset_attach(struct cgroup_taskset *tset)

percpu_down_write(&cpuset_rwsem);

- /* prepare for attach */
- if (cs == &top_cpuset)
- cpumask_copy(cpus_attach, cpu_possible_mask);
- else
- guarantee_online_cpus(cs, cpus_attach);
-
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);

cgroup_taskset_for_each(task, css, tset) {
+ if (cs != &top_cpuset)
+ guarantee_online_cpus(task, cpus_attach);
+ else
+ cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
/*
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
@@ -3303,7 +3308,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)

spin_lock_irqsave(&callback_lock, flags);
rcu_read_lock();
- guarantee_online_cpus(task_cs(tsk), pmask);
+ guarantee_online_cpus(tsk, pmask);
rcu_read_unlock();
spin_unlock_irqrestore(&callback_lock, flags);
}
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:20

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 18/21] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system

If we want to support 32-bit applications, then when we identify a CPU
with mismatched 32-bit EL0 support we must ensure that we will always
have an active 32-bit CPU available to us from then on. This is important
for the scheduler, because is_cpu_allowed() will be constrained to 32-bit
CPUs for compat tasks and forced migration due to a hotplug event will
hang if no 32-bit CPUs are available.

On detecting a mismatch, prevent offlining of either the mismatching CPU
if it is 32-bit capable, or find the first active 32-bit capable CPU
otherwise.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/kernel/cpufeature.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 959442f76ed7..72efdc611b14 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2896,15 +2896,33 @@ void __init setup_cpu_features(void)

static int enable_mismatched_32bit_el0(unsigned int cpu)
{
+ static int lucky_winner = -1;
+
struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);

if (cpu_32bit) {
cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
- setup_elf_hwcaps(compat_elf_hwcaps);
}

+ if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
+ return 0;
+
+ if (lucky_winner >= 0)
+ return 0;
+
+ /*
+ * We've detected a mismatch. We need to keep one of our CPUs with
+ * 32-bit EL0 online so that is_cpu_allowed() doesn't end up rejecting
+ * every CPU in the system for a 32-bit task.
+ */
+ lucky_winner = cpu_32bit ? cpu : cpumask_any_and(cpu_32bit_el0_mask,
+ cpu_active_mask);
+ get_cpu_device(lucky_winner)->offline_disabled = true;
+ setup_elf_hwcaps(compat_elf_hwcaps);
+ pr_info("Asymmetric 32-bit EL0 support detected on CPU %u; CPU hot-unplug disabled on CPU %u\n",
+ cpu, lucky_winner);
return 0;
}

--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:20

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 12/21] sched: Allow task CPU affinity to be restricted on asymmetric systems

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Although userspace can carefully manage the affinity masks for such
tasks, one place where it is particularly problematic is execve()
because the CPU on which the execve() is occurring may be incompatible
with the new application image. In such a situation, it is desirable to
restrict the affinity mask of the task and ensure that the new image is
entered on a compatible CPU. From userspace's point of view, this looks
the same as if the incompatible CPUs have been hotplugged off in the
task's affinity mask. Similarly, if a subsequent execve() reverts to
a compatible image, then the old affinity is restored if it is still
valid.

In preparation for restricting the affinity mask for compat tasks on
arm64 systems without uniform support for 32-bit applications, introduce
{force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict
and restore the affinity mask for a task based on the compatible CPUs.

Reviewed-by: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched/core.c | 165 ++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 1 +
3 files changed, 152 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index db32d4f7e5b3..91a6cfeae242 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,6 +1691,8 @@ extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new
extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
extern void release_user_cpus_ptr(struct task_struct *p);
+extern void force_compatible_cpus_allowed_ptr(struct task_struct *p);
+extern void relax_compatible_cpus_allowed_ptr(struct task_struct *p);
#else
static inline void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
{
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 808bbe669a6d..ba66bcf8e812 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2357,26 +2357,21 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
}

/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
- *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
+ * Called with both p->pi_lock and rq->lock held; drops both before returning.
*/
-static int __set_cpus_allowed_ptr(struct task_struct *p,
- const struct cpumask *new_mask,
- u32 flags)
+static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
+ const struct cpumask *new_mask,
+ u32 flags,
+ struct rq *rq,
+ struct rq_flags *rf)
+ __releases(rq->lock)
+ __releases(p->pi_lock)
{
const struct cpumask *cpu_valid_mask = cpu_active_mask;
const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
unsigned int dest_cpu;
- struct rq_flags rf;
- struct rq *rq;
int ret = 0;

- rq = task_rq_lock(p, &rf);
update_rq_clock(rq);

if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
@@ -2430,20 +2425,158 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,

__do_set_cpus_allowed(p, new_mask, flags);

- return affine_move_task(rq, p, &rf, dest_cpu, flags);
+ if (flags & SCA_USER)
+ release_user_cpus_ptr(p);
+
+ return affine_move_task(rq, p, rf, dest_cpu, flags);

out:
- task_rq_unlock(rq, p, &rf);
+ task_rq_unlock(rq, p, rf);

return ret;
}

+/*
+ * Change a given task's CPU affinity. Migrate the thread to a
+ * proper CPU and schedule it away if the CPU it's executing on
+ * is removed from the allowed bitmask.
+ *
+ * NOTE: the caller must have a valid reference to the task, the
+ * task must not exit() & deallocate itself prematurely. The
+ * call is not atomic; no spinlocks may be held.
+ */
+static int __set_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *new_mask, u32 flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+}
+
int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
{
return __set_cpus_allowed_ptr(p, new_mask, 0);
}
EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

+/*
+ * Change a given task's CPU affinity to the intersection of its current
+ * affinity mask and @subset_mask, writing the resulting mask to @new_mask
+ * and pointing @p->user_cpus_ptr to a copy of the old mask.
+ * If the resulting mask is empty, leave the affinity unchanged and return
+ * -EINVAL.
+ */
+static int restrict_cpus_allowed_ptr(struct task_struct *p,
+ struct cpumask *new_mask,
+ const struct cpumask *subset_mask)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+ int err;
+ struct cpumask *user_mask = NULL;
+
+ if (!p->user_cpus_ptr)
+ user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
+
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * We're about to butcher the task affinity, so keep track of what
+ * the user asked for in case we're able to restore it later on.
+ */
+ if (user_mask) {
+ cpumask_copy(user_mask, p->cpus_ptr);
+ p->user_cpus_ptr = user_mask;
+ }
+
+ /*
+ * Forcefully restricting the affinity of a deadline task is
+ * likely to cause problems, so fail and noisily override the
+ * mask entirely.
+ */
+ if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
+ err = -EPERM;
+ goto err_unlock;
+ }
+
+ if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
+ err = -EINVAL;
+ goto err_unlock;
+ }
+
+ return __set_cpus_allowed_ptr_locked(p, new_mask, false, rq, &rf);
+
+err_unlock:
+ task_rq_unlock(rq, p, &rf);
+ return err;
+}
+
+/*
+ * Restrict the CPU affinity of task @p so that it is a subset of
+ * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the
+ * old affinity mask. If the resulting mask is empty, we warn and walk
+ * up the cpuset hierarchy until we find a suitable mask.
+ */
+void force_compatible_cpus_allowed_ptr(struct task_struct *p)
+{
+ cpumask_var_t new_mask;
+ const struct cpumask *override_mask = task_cpu_possible_mask(p);
+
+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
+ goto out_set_mask;
+
+ if (!restrict_cpus_allowed_ptr(p, new_mask, override_mask))
+ goto out_free_mask;
+
+ /*
+ * We failed to find a valid subset of the affinity mask for the
+ * task, so override it based on its cpuset hierarchy.
+ */
+ cpuset_cpus_allowed(p, new_mask);
+ override_mask = new_mask;
+
+out_set_mask:
+ if (printk_ratelimit()) {
+ printk_deferred("Overriding affinity for process %d (%s) to CPUs %*pbl\n",
+ task_pid_nr(p), p->comm,
+ cpumask_pr_args(override_mask));
+ }
+
+ set_cpus_allowed_ptr(p, override_mask);
+out_free_mask:
+ free_cpumask_var(new_mask);
+}
+
+static int
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
+
+/*
+ * Restore the affinity of a task @p which was previously restricted by a
+ * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
+ * @p->user_cpus_ptr.
+ */
+void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
+{
+ unsigned long flags;
+ struct cpumask *mask = p->user_cpus_ptr;
+
+ if (!mask)
+ return;
+
+ /*
+ * Try to restore the old affinity mask. If this fails, then
+ * we free the mask explicitly to avoid it being inherited across
+ * a subsequent fork().
+ */
+ if (__sched_setaffinity(p, mask)) {
+ raw_spin_lock_irqsave(&p->pi_lock, flags);
+ release_user_cpus_ptr(p);
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+ }
+}
+
void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
#ifdef CONFIG_SCHED_DEBUG
@@ -6821,7 +6954,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
}
#endif
again:
- retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+ retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
if (retval)
goto out_free_masks;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..29c35b51411b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1956,6 +1956,7 @@ extern struct task_struct *pick_next_task_idle(struct rq *rq);
#define SCA_CHECK 0x01
#define SCA_MIGRATE_DISABLE 0x02
#define SCA_MIGRATE_ENABLE 0x04
+#define SCA_USER 0x08

#ifdef CONFIG_SMP

--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:27

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 14/21] freezer: Add frozen_or_skipped() helper function

Occasionally it is necessary to see if a task is either frozen or
sleeping in the PF_FREEZER_SKIP state. In preparation for adding
additional users of this check, introduce a frozen_or_skipped() helper
function and convert the hung task detector over to using it.

Signed-off-by: Will Deacon <[email protected]>
---
include/linux/freezer.h | 6 ++++++
kernel/hung_task.c | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 0621c5f86c39..b9e1e4200101 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -27,6 +27,11 @@ static inline bool frozen(struct task_struct *p)
return p->flags & PF_FROZEN;
}

+static inline bool frozen_or_skipped(struct task_struct *p)
+{
+ return p->flags & (PF_FROZEN | PF_FREEZER_SKIP);
+}
+
extern bool freezing_slow_path(struct task_struct *p);

/*
@@ -270,6 +275,7 @@ static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,

#else /* !CONFIG_FREEZER */
static inline bool frozen(struct task_struct *p) { return false; }
+static inline bool frozen_or_skipped(struct task_struct *p) { return false; }
static inline bool freezing(struct task_struct *p) { return false; }
static inline void __thaw_task(struct task_struct *t) {}

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 396ebaebea3f..d2d4c4159b23 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -92,8 +92,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
* Ensure the task is not frozen.
* Also, skip vfork and any other user process that freezer should skip.
*/
- if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
- return;
+ if (unlikely(frozen_or_skipped(t)))
+ return;

/*
* When a freshly created task is scheduled once, changes its state to
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:28

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 20/21] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores

The scheduler now knows enough about these braindead systems to place
32-bit tasks accordingly, so throw out the safety checks and allow the
ret-to-user path to avoid do_notify_resume() if there is nothing to do.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/kernel/process.c | 14 +-------------
arch/arm64/kernel/signal.c | 26 --------------------------
2 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 8e0da06c4e77..de9833f08742 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -527,15 +527,6 @@ static void erratum_1418040_thread_switch(struct task_struct *prev,
write_sysreg(val, cntkctl_el1);
}

-static void compat_thread_switch(struct task_struct *next)
-{
- if (!is_compat_thread(task_thread_info(next)))
- return;
-
- if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
- set_tsk_thread_flag(next, TIF_NOTIFY_RESUME);
-}
-
static void update_sctlr_el1(u64 sctlr)
{
/*
@@ -577,7 +568,6 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
ssbs_thread_switch(next);
erratum_1418040_thread_switch(prev, next);
ptrauth_thread_switch_user(next);
- compat_thread_switch(next);

/*
* Complete any pending TLB or cache maintenance on this CPU in case
@@ -657,10 +647,8 @@ void arch_setup_new_exec(void)
* at the point of execve(), although we try a bit harder to
* honour the cpuset hierarchy.
*/
- if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
+ if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
force_compatible_cpus_allowed_ptr(current);
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
- }
} else if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
relax_compatible_cpus_allowed_ptr(current);
}
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index f8192f4ae0b8..6237486ff6bb 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -911,19 +911,6 @@ static void do_signal(struct pt_regs *regs)
restore_saved_sigmask();
}

-static bool cpu_affinity_invalid(struct pt_regs *regs)
-{
- if (!compat_user_mode(regs))
- return false;
-
- /*
- * We're preemptible, but a reschedule will cause us to check the
- * affinity again.
- */
- return !cpumask_test_cpu(raw_smp_processor_id(),
- system_32bit_el0_cpumask());
-}
-
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned long thread_flags)
{
@@ -951,19 +938,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
if (thread_flags & _TIF_NOTIFY_RESUME) {
tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
-
- /*
- * If we reschedule after checking the affinity
- * then we must ensure that TIF_NOTIFY_RESUME
- * is set so that we check the affinity again.
- * Since tracehook_notify_resume() clears the
- * flag, ensure that the compiler doesn't move
- * it after the affinity check.
- */
- barrier();
-
- if (cpu_affinity_invalid(regs))
- force_sig(SIGKILL);
}

if (thread_flags & _TIF_FOREIGN_FPSTATE)
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:37

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 16/21] arm64: Implement task_cpu_possible_mask()

Provide an implementation of task_cpu_possible_mask() so that we can
prevent 64-bit-only cores being added to the 'cpus_mask' for compat
tasks on systems with mismatched 32-bit support at EL0,

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/mmu_context.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index d3cef9133539..bb9b7510f334 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -231,6 +231,19 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
update_saved_ttbr0(tsk, next);
}

+static inline const struct cpumask *
+task_cpu_possible_mask(struct task_struct *p)
+{
+ if (!static_branch_unlikely(&arm64_mismatched_32bit_el0))
+ return cpu_possible_mask;
+
+ if (!is_compat_thread(task_thread_info(p)))
+ return cpu_possible_mask;
+
+ return system_32bit_el0_cpumask();
+}
+#define task_cpu_possible_mask task_cpu_possible_mask
+
void verify_cpu_asid_bits(void);
void post_ttbr_update_workaround(void);

--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:38

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 15/21] sched: Defer wakeup in ttwu() for unschedulable frozen tasks

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Although we take care to prevent explicit hot-unplug of all 32-bit
capable CPUs on such a system, this is required when suspending on some
SoCs where the firmware mandates that the suspend/resume operation is
handled by CPU 0, which may not be capable of running 32-bit tasks.

Consequently, there is a window on the resume path where no 32-bit
capable CPUs are available for scheduling and waking up a 32-bit task
will result in a scheduler BUG() due to failure of select_fallback_rq():

| kernel BUG at kernel/sched/core.c:2858!
| Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
| ...
| Call trace:
| select_fallback_rq+0x4b0/0x4e4
| try_to_wake_up.llvm.4388853297126348405+0x460/0x5b0
| default_wake_function+0x1c/0x30
| autoremove_wake_function+0x1c/0x60
| __wake_up_common.llvm.11763074518265335900+0x100/0x1b8
| __wake_up+0x78/0xc4
| ep_poll_callback+0x20c/0x3fc

Prevent wakeups of unschedulable frozen tasks in ttwu() and instead
defer the wakeup to __thaw_tasks(), which runs only once all the
secondary CPUs are back online.

Signed-off-by: Will Deacon <[email protected]>
---
kernel/freezer.c | 10 +++++++++-
kernel/sched/core.c | 13 +++++++++++++
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..8f3d950c2a87 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -11,6 +11,7 @@
#include <linux/syscalls.h>
#include <linux/freezer.h>
#include <linux/kthread.h>
+#include <linux/mmu_context.h>

/* total number of freezing conditions in effect */
atomic_t system_freezing_cnt = ATOMIC_INIT(0);
@@ -146,9 +147,16 @@ bool freeze_task(struct task_struct *p)
void __thaw_task(struct task_struct *p)
{
unsigned long flags;
+ const struct cpumask *mask = task_cpu_possible_mask(p);

spin_lock_irqsave(&freezer_lock, flags);
- if (frozen(p))
+ /*
+ * Wake up frozen tasks. On asymmetric systems where tasks cannot
+ * run on all CPUs, ttwu() may have deferred a wakeup generated
+ * before thaw_secondary_cpus() had completed so we generate
+ * additional wakeups here for tasks in the PF_FREEZER_SKIP state.
+ */
+ if (frozen(p) || (frozen_or_skipped(p) && mask != cpu_possible_mask))
wake_up_process(p);
spin_unlock_irqrestore(&freezer_lock, flags);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7d058fc012e..f5ff55786344 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3525,6 +3525,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
if (!(p->state & state))
goto unlock;

+#ifdef CONFIG_FREEZER
+ /*
+ * If we're going to wake up a thread which may be frozen, then
+ * we can only do so if we have an active CPU which is capable of
+ * running it. This may not be the case when resuming from suspend,
+ * as the secondary CPUs may not yet be back online. See __thaw_task()
+ * for the actual wakeup.
+ */
+ if (unlikely(frozen_or_skipped(p)) &&
+ !cpumask_intersects(cpu_active_mask, task_cpu_possible_mask(p)))
+ goto unlock;
+#endif
+
trace_sched_waking(p);

/* We're going to change ->state: */
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:51

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 21/21] Documentation: arm64: describe asymmetric 32-bit support

Document support for running 32-bit tasks on asymmetric 32-bit systems
and its impact on the user ABI when enabled.

Signed-off-by: Will Deacon <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 3 +
Documentation/arm64/asymmetric-32bit.rst | 149 ++++++++++++++++++
Documentation/arm64/index.rst | 1 +
3 files changed, 153 insertions(+)
create mode 100644 Documentation/arm64/asymmetric-32bit.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a2e453919bb6..5a1dc7e628a5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -295,6 +295,9 @@
EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
and hot-unplug operations may be restricted.

+ See Documentation/arm64/asymmetric-32bit.rst for more
+ information.
+
amd_iommu= [HW,X86-64]
Pass parameters to the AMD IOMMU driver in the system.
Possible values are:
diff --git a/Documentation/arm64/asymmetric-32bit.rst b/Documentation/arm64/asymmetric-32bit.rst
new file mode 100644
index 000000000000..baf02c143363
--- /dev/null
+++ b/Documentation/arm64/asymmetric-32bit.rst
@@ -0,0 +1,149 @@
+======================
+Asymmetric 32-bit SoCs
+======================
+
+Author: Will Deacon <[email protected]>
+
+This document describes the impact of asymmetric 32-bit SoCs on the
+execution of 32-bit (``AArch32``) applications.
+
+Date: 2021-05-17
+
+Introduction
+============
+
+Some Armv9 SoCs suffer from a big.LITTLE misfeature where only a subset
+of the CPUs are capable of executing 32-bit user applications. On such
+a system, Linux by default treats the asymmetry as a "mismatch" and
+disables support for both the ``PER_LINUX32`` personality and
+``execve(2)`` of 32-bit ELF binaries, with the latter returning
+``-ENOEXEC``. If the mismatch is detected during late onlining of a
+64-bit-only CPU, then the onlining operation fails and the new CPU is
+unavailable for scheduling.
+
+Surprisingly, these SoCs have been produced with the intention of
+running legacy 32-bit binaries. Unsurprisingly, that doesn't work very
+well with the default behaviour of Linux.
+
+It seems inevitable that future SoCs will drop 32-bit support
+altogether, so if you're stuck in the unenviable position of needing to
+run 32-bit code on one of these transitionary platforms then you would
+be wise to consider alternatives such as recompilation, emulation or
+retirement. If neither of those options are practical, then read on.
+
+Enabling kernel support
+=======================
+
+Since the kernel support is not completely transparent to userspace,
+allowing 32-bit tasks to run on an asymmetric 32-bit system requires an
+explicit "opt-in" and can be enabled by passing the
+``allow_mismatched_32bit_el0`` parameter on the kernel command-line.
+
+For the remainder of this document we will refer to an *asymmetric
+system* to mean an SoC running Linux with this kernel command-line
+option enabled.
+
+Userspace impact
+================
+
+32-bit tasks running on an asymmetric system behave in mostly the same
+way as on a homogeneous system, with a few key differences relating to
+CPU affinity.
+
+sysfs
+-----
+
+The subset of CPUs capable of running 32-bit tasks is described in
+``/sys/devices/system/cpu/aarch32_el0`` and is documented further in
+``Documentation/ABI/testing/sysfs-devices-system-cpu``.
+
+**Note:** CPUs are advertised by this file as they are detected and so
+late-onlining of 32-bit-capable CPUs can result in the file contents
+being modified by the kernel at runtime. Once advertised, CPUs are never
+removed from the file.
+
+``execve(2)``
+-------------
+
+On a homogeneous system, the CPU affinity of a task is preserved across
+``execve(2)``. This is not always possible on an asymmetric system,
+specifically when the new program being executed is 32-bit yet the
+affinity mask contains 64-bit-only CPUs. In this situation, the kernel
+determines the new affinity mask as follows:
+
+ 1. If the 32-bit-capable subset of the affinity mask is not empty,
+ then the affinity is restricted to that subset and the old affinity
+ mask is saved. This saved mask is inherited over ``fork(2)`` and
+ preserved across ``execve(2)`` of 32-bit programs.
+
+ **Note:** This step does not apply to ``SCHED_DEADLINE`` tasks.
+ See `SCHED_DEADLINE`_.
+
+ 2. Otherwise, the cpuset hierarchy of the task is walked until an
+ ancestor is found containing at least one 32-bit-capable CPU. The
+ affinity of the task is then changed to match the 32-bit-capable
+ subset of the cpuset determined by the walk.
+
+ 3. On failure (i.e. out of memory), the affinity is changed to the set
+ of all 32-bit-capable CPUs of which the kernel is aware.
+
+A subsequent ``execve(2)`` of a 64-bit program by the 32-bit task will
+invalidate the affinity mask saved in (1) and attempt to restore the CPU
+affinity of the task using the saved mask if it was previously valid.
+This restoration may fail due to intervening changes to the deadline
+policy or cpuset hierarchy, in which case the ``execve(2)`` continues
+with the affinity unchanged.
+
+Calls to ``sched_setaffinity(2)`` for a 32-bit task will consider only
+the 32-bit-capable CPUs of the requested affinity mask. On success, the
+affinity for the task is updated and any saved mask from a prior
+``execve(2)`` is invalidated.
+
+``SCHED_DEADLINE``
+------------------
+
+Admitting a 32-bit task to the deadline scheduler (e.g. by calling
+``sched_setattr(2)``) will, if valid, consider the affinity mask saved
+by a previous call to ``execve(2)`` for the purposes of input validation
+in preference to the running affinity of the task. 64-bit deadline tasks
+will skip step (1) of the process described in `execve(2)`_ when
+executed a 32-bit program.
+
+**Note:** It is recommended that the 32-bit-capable CPUs are placed into
+a separate root domain if ``SCHED_DEADLINE`` is to be used with 32-bit
+tasks on an asymmetric system. Failure to do so is likely to result in
+missed deadlines.
+
+Cpusets
+-------
+
+The affinity of a 32-bit task may include CPUs that are not explicitly
+allowed by the cpuset to which it is attached. This can occur as a
+result of the following two situations:
+
+ - A 64-bit task attached to a cpuset which allows only 64-bit CPUs
+ executes a 32-bit program.
+
+ - All of the 32-bit-capable CPUs allowed by a cpuset containing a
+ 32-bit task are offlined.
+
+In both of these cases, the new affinity is calculated according to step
+(2) of the process described in `execve(2)`_ and the cpuset hierarchy is
+unchanged irrespective of the cgroup version.
+
+CPU hotplug
+-----------
+
+When the kernel detects asymmetric 32-bit hardware, the first detected
+32-bit-capable CPU is prevented from being offlined by userspace and any
+such attempt will return ``-EPERM``. Note that suspend is still
+permitted even if the primary CPU (i.e. CPU 0) is 64-bit-only.
+
+KVM
+---
+
+Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
+asymmetric system, a broken guest at EL1 could still attempt to execute
+32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
+mode will return to host userspace with an ``exit_reason`` of
+``KVM_EXIT_FAIL_ENTRY``.
diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index 97d65ba12a35..4f840bac083e 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -10,6 +10,7 @@ ARM64 Architecture
acpi_object_usage
amu
arm-acpi
+ asymmetric-32bit
booting
cpu-feature-registers
elf_hwcaps
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:43:57

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 05/21] arm64: Advertise CPUs capable of running 32-bit applications in sysfs

Since 32-bit applications will be killed if they are caught trying to
execute on a 64-bit-only CPU in a mismatched system, advertise the set
of 32-bit capable CPUs to userspace in sysfs.

Reviewed-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
.../ABI/testing/sysfs-devices-system-cpu | 9 +++++++++
arch/arm64/kernel/cpufeature.c | 19 +++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index fe13baa53c59..899377b2715a 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -494,6 +494,15 @@ Description: AArch64 CPU registers
'identification' directory exposes the CPU ID registers for
identifying model and revision of the CPU.

+What: /sys/devices/system/cpu/aarch32_el0
+Date: May 2021
+Contact: Linux ARM Kernel Mailing list <[email protected]>
+Description: Identifies the subset of CPUs in the system that can execute
+ AArch32 (32-bit ARM) applications. If present, the same format as
+ /sys/devices/system/cpu/{offline,online,possible,present} is used.
+ If absent, then all or none of the CPUs can execute AArch32
+ applications and execve() will behave accordingly.
+
What: /sys/devices/system/cpu/cpu#/cpu_capacity
Date: December 2016
Contact: Linux kernel mailing list <[email protected]>
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 4194a47de62d..959442f76ed7 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -67,6 +67,7 @@
#include <linux/crash_dump.h>
#include <linux/sort.h>
#include <linux/stop_machine.h>
+#include <linux/sysfs.h>
#include <linux/types.h>
#include <linux/minmax.h>
#include <linux/mm.h>
@@ -1297,6 +1298,24 @@ const struct cpumask *system_32bit_el0_cpumask(void)
return cpu_possible_mask;
}

+static ssize_t aarch32_el0_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ const struct cpumask *mask = system_32bit_el0_cpumask();
+
+ return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(mask));
+}
+static const DEVICE_ATTR_RO(aarch32_el0);
+
+static int __init aarch32_el0_sysfs_init(void)
+{
+ if (!allow_mismatched_32bit_el0)
+ return 0;
+
+ return device_create_file(cpu_subsys.dev_root, &dev_attr_aarch32_el0);
+}
+device_initcall(aarch32_el0_sysfs_init);
+
static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
{
if (!has_cpuid_feature(entry, scope))
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:44:00

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 17/21] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0

When exec'ing a 32-bit task on a system with mismatched support for
32-bit EL0, try to ensure that it starts life on a CPU that can actually
run it.

Similarly, when exec'ing a 64-bit task on such a system, try to restore
the old affinity mask if it was previously restricted.

Reviewed-by: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/kernel/process.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index f4a91bf1ce0c..8e0da06c4e77 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -647,8 +647,22 @@ void arch_setup_new_exec(void)

if (is_compat_task()) {
mmflags = MMCF_AARCH32;
- if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+
+ /*
+ * Restrict the CPU affinity mask for a 32-bit task so that
+ * it contains only 32-bit-capable CPUs.
+ *
+ * From the perspective of the task, this looks similar to
+ * what would happen if the 64-bit-only CPUs were hot-unplugged
+ * at the point of execve(), although we try a bit harder to
+ * honour the cpuset hierarchy.
+ */
+ if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
+ force_compatible_cpus_allowed_ptr(current);
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ }
+ } else if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
+ relax_compatible_cpus_allowed_ptr(current);
}

current->mm->context.flags = mmflags;
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:44:28

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 19/21] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0

Allow systems with mismatched 32-bit support at EL0 to run 32-bit
applications based on a new kernel parameter.

Signed-off-by: Will Deacon <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
arch/arm64/kernel/cpufeature.c | 7 +++++++
2 files changed, 15 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cb89dbdedc46..a2e453919bb6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -287,6 +287,14 @@
do not want to use tracing_snapshot_alloc() as it needs
to be done where GFP_KERNEL allocations are allowed.

+ allow_mismatched_32bit_el0 [ARM64]
+ Allow execve() of 32-bit applications and setting of the
+ PER_LINUX32 personality on systems where only a strict
+ subset of the CPUs support 32-bit EL0. When this
+ parameter is present, the set of CPUs supporting 32-bit
+ EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
+ and hot-unplug operations may be restricted.
+
amd_iommu= [HW,X86-64]
Pass parameters to the AMD IOMMU driver in the system.
Possible values are:
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 72efdc611b14..f2c97baa050f 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1298,6 +1298,13 @@ const struct cpumask *system_32bit_el0_cpumask(void)
return cpu_possible_mask;
}

+static int __init parse_32bit_el0_param(char *str)
+{
+ allow_mismatched_32bit_el0 = true;
+ return 0;
+}
+early_param("allow_mismatched_32bit_el0", parse_32bit_el0_param);
+
static ssize_t aarch32_el0_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:44:30

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On asymmetric systems where the affinity of a task is restricted to
contain only the CPUs capable of running it, admission to the deadline
scheduler is likely to fail because the span of the sched domain
contains incompatible CPUs. Although this is arguably the right thing to
do, it is inconsistent with the case where the affinity of a task is
restricted after already having been admitted to the deadline scheduler.

For example, on an arm64 system where not all CPUs support 32-bit
applications, a 64-bit deadline task can exec() a 32-bit image and have
its affinity forcefully restricted.

Rather than reject these tasks altogether, favour the requested user
affinity saved in 'task_struct::user_cpus_ptr' over the actual affinity
of the task which has been restricted by the kernel.

Signed-off-by: Will Deacon <[email protected]>
---
kernel/sched/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba66bcf8e812..d7d058fc012e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6403,13 +6403,14 @@ static int __sched_setscheduler(struct task_struct *p,
if (dl_bandwidth_enabled() && dl_policy(policy) &&
!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
cpumask_t *span = rq->rd->span;
+ const cpumask_t *aff = p->user_cpus_ptr ?: p->cpus_ptr;

/*
* Don't allow tasks with an affinity mask smaller than
* the entire root_domain to become SCHED_DEADLINE. We
* will also fail if there's no bandwidth available.
*/
- if (!cpumask_subset(span, p->cpus_ptr) ||
+ if (!cpumask_subset(span, aff) ||
rq->rd->dl_bw.bw == 0) {
retval = -EPERM;
goto unlock;
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:44:48

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 11/21] sched: Split the guts of sched_setaffinity() into a helper function

In preparation for replaying user affinity requests using a saved mask,
split sched_setaffinity() up so that the initial task lookup and
security checks are only performed when the request is coming directly
from userspace.

Signed-off-by: Will Deacon <[email protected]>
---
kernel/sched/core.c | 110 +++++++++++++++++++++++---------------------
1 file changed, 58 insertions(+), 52 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9512623d5a60..808bbe669a6d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6788,9 +6788,61 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
return retval;
}

-long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
+static int
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
{
+ int retval;
cpumask_var_t cpus_allowed, new_mask;
+
+ if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
+ return -ENOMEM;
+
+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ cpuset_cpus_allowed(p, cpus_allowed);
+ cpumask_and(new_mask, mask, cpus_allowed);
+
+ /*
+ * Since bandwidth control happens on root_domain basis,
+ * if admission test is enabled, we only admit -deadline
+ * tasks allowed to run on all the CPUs in the task's
+ * root_domain.
+ */
+#ifdef CONFIG_SMP
+ if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
+ rcu_read_lock();
+ if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
+ retval = -EBUSY;
+ rcu_read_unlock();
+ goto out_free_masks;
+ }
+ rcu_read_unlock();
+ }
+#endif
+again:
+ retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+ if (retval)
+ goto out_free_masks;
+
+ cpuset_cpus_allowed(p, cpus_allowed);
+ if (!cpumask_subset(new_mask, cpus_allowed)) {
+ /*
+ * We must have raced with a concurrent cpuset update.
+ * Just reset the cpumask to the cpuset's cpus_allowed.
+ */
+ cpumask_copy(new_mask, cpus_allowed);
+ goto again;
+ }
+
+out_free_masks:
+ free_cpumask_var(new_mask);
+ free_cpumask_var(cpus_allowed);
+ return retval;
+}
+
+long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
+{
struct task_struct *p;
int retval;

@@ -6810,68 +6862,22 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
retval = -EINVAL;
goto out_put_task;
}
- if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
- retval = -ENOMEM;
- goto out_put_task;
- }
- if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
- retval = -ENOMEM;
- goto out_free_cpus_allowed;
- }
- retval = -EPERM;
+
if (!check_same_owner(p)) {
rcu_read_lock();
if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
rcu_read_unlock();
- goto out_free_new_mask;
+ retval = -EPERM;
+ goto out_put_task;
}
rcu_read_unlock();
}

retval = security_task_setscheduler(p);
if (retval)
- goto out_free_new_mask;
-
-
- cpuset_cpus_allowed(p, cpus_allowed);
- cpumask_and(new_mask, in_mask, cpus_allowed);
-
- /*
- * Since bandwidth control happens on root_domain basis,
- * if admission test is enabled, we only admit -deadline
- * tasks allowed to run on all the CPUs in the task's
- * root_domain.
- */
-#ifdef CONFIG_SMP
- if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
- rcu_read_lock();
- if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
- retval = -EBUSY;
- rcu_read_unlock();
- goto out_free_new_mask;
- }
- rcu_read_unlock();
- }
-#endif
-again:
- retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+ goto out_put_task;

- if (!retval) {
- cpuset_cpus_allowed(p, cpus_allowed);
- if (!cpumask_subset(new_mask, cpus_allowed)) {
- /*
- * We must have raced with a concurrent cpuset
- * update. Just reset the cpus_allowed to the
- * cpuset's cpus_allowed
- */
- cpumask_copy(new_mask, cpus_allowed);
- goto again;
- }
- }
-out_free_new_mask:
- free_cpumask_var(new_mask);
-out_free_cpus_allowed:
- free_cpumask_var(cpus_allowed);
+ retval = __sched_setaffinity(p, in_mask);
out_put_task:
put_task_struct(p);
return retval;
--
2.31.1.751.gd2f1c929bd-goog


2021-05-19 17:49:13

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Tuesday 18 May 2021 at 10:47:17 (+0100), Will Deacon wrote:
> On asymmetric systems where the affinity of a task is restricted to
> contain only the CPUs capable of running it, admission to the deadline
> scheduler is likely to fail because the span of the sched domain
> contains incompatible CPUs. Although this is arguably the right thing to
> do, it is inconsistent with the case where the affinity of a task is
> restricted after already having been admitted to the deadline scheduler.
>
> For example, on an arm64 system where not all CPUs support 32-bit
> applications, a 64-bit deadline task can exec() a 32-bit image and have
> its affinity forcefully restricted.

So I guess the alternative would be to fail exec-ing into 32bit from a
64bit DL task, and then drop this patch?

The nice thing about your approach is that existing applications won't
really notice a difference (execve would still 'work'), but on the cons
side it breaks admission control, which is sad.

I don't expect this weird execve-to-32bit pattern from DL to be that
common in practice (at the very least not in Android), so maybe we could
start with the stricter version (fail the execve), and wait to see if
folks complain? Making things stricter later will be harder.

Thoughts?

Thanks,
Quentin

2021-05-19 17:49:57

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

[Dropping Li Zefan as his mail is bouncing]

On Tue, May 18, 2021 at 10:20:38AM +0000, Quentin Perret wrote:
> On Tuesday 18 May 2021 at 10:47:17 (+0100), Will Deacon wrote:
> > On asymmetric systems where the affinity of a task is restricted to
> > contain only the CPUs capable of running it, admission to the deadline
> > scheduler is likely to fail because the span of the sched domain
> > contains incompatible CPUs. Although this is arguably the right thing to
> > do, it is inconsistent with the case where the affinity of a task is
> > restricted after already having been admitted to the deadline scheduler.
> >
> > For example, on an arm64 system where not all CPUs support 32-bit
> > applications, a 64-bit deadline task can exec() a 32-bit image and have
> > its affinity forcefully restricted.
>
> So I guess the alternative would be to fail exec-ing into 32bit from a
> 64bit DL task, and then drop this patch?
>
> The nice thing about your approach is that existing applications won't
> really notice a difference (execve would still 'work'), but on the cons
> side it breaks admission control, which is sad.

Right, with your suggestion here we would forbid any 32-bit deadline tasks
on an asymmetric system, even if you'd gone to the extraordinary effort
to cater for that (e.g. by having a separate root domain).

> I don't expect this weird execve-to-32bit pattern from DL to be that
> common in practice (at the very least not in Android), so maybe we could
> start with the stricter version (fail the execve), and wait to see if
> folks complain? Making things stricter later will be harder.
>
> Thoughts?

I don't have strong opinions on this, but I _do_ want the admission via
sched_setattr() to be consistent with execve(). What you're suggesting
ticks that box, but how many applications are prepared to handle a failed
execve()? I suspect it will be fatal.

Probably also worth pointing out that the approach here will at least
warn in the execve() case when the affinity is overridden for a deadline
task.

Will

2021-05-19 17:54:04

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > I don't have strong opinions on this, but I _do_ want the admission via
> > sched_setattr() to be consistent with execve(). What you're suggesting
> > ticks that box, but how many applications are prepared to handle a failed
> > execve()? I suspect it will be fatal.
>
> Yep, probably.
>
> > Probably also worth pointing out that the approach here will at least
> > warn in the execve() case when the affinity is overridden for a deadline
> > task.
>
> Right so I think either way will be imperfect, so I agree with the
> above.
>
> Maybe one thing though is that, IIRC, userspace _can_ disable admission
> control if it wants to. In this case I'd have no problem with allowing
> this weird behaviour when admission control is off -- the kernel won't
> provide any guarantees. But if it's left on, then it's a different
> story.
>
> So what about we say, if admission control is off, we allow execve() and
> sched_setattr() with appropriate warnings as you suggest, but if
> admission control is on then we fail both?

That's an interesting idea. The part that I'm not super keen about is
that it means admission control _also_ has an effect on the behaviour of
execve(), so practically you'd have to have it disabled as long as you
have the possibility of 32-bit deadline tasks anywhere in the system,
which impacts 64-bit tasks which may well want admission control enabled.

So perhaps my initial position of trying to keep sched_setattr() and
execve() consistent with each other is flawed and actually we can say:

* Disable admission control if you want to admit a 32-bit task explicitly
via sched_setattr()

* If a 64-bit deadline task execve()s a 32-bit program then we warn
and override the affinity (i.e. you should avoid doing this if you
care about the deadlines).

That amounts to dropping this patch and tweaking the documentation.

Dunno, what do you think?

Will

2021-05-19 17:54:17

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> I don't have strong opinions on this, but I _do_ want the admission via
> sched_setattr() to be consistent with execve(). What you're suggesting
> ticks that box, but how many applications are prepared to handle a failed
> execve()? I suspect it will be fatal.

Yep, probably.

> Probably also worth pointing out that the approach here will at least
> warn in the execve() case when the affinity is overridden for a deadline
> task.

Right so I think either way will be imperfect, so I agree with the
above.

Maybe one thing though is that, IIRC, userspace _can_ disable admission
control if it wants to. In this case I'd have no problem with allowing
this weird behaviour when admission control is off -- the kernel won't
provide any guarantees. But if it's left on, then it's a different
story.

So what about we say, if admission control is off, we allow execve() and
sched_setattr() with appropriate warnings as you suggest, but if
admission control is on then we fail both?

We might still see random failures in the wild if admission control is
left enabled on those devices but then I think these could qualify as
a device misconfiguration, not as a kernel bug.

Thoughts?

Thanks,
Quentin

2021-05-19 18:10:26

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Tuesday 18 May 2021 at 11:59:51 (+0100), Will Deacon wrote:
> On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> > On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > > I don't have strong opinions on this, but I _do_ want the admission via
> > > sched_setattr() to be consistent with execve(). What you're suggesting
> > > ticks that box, but how many applications are prepared to handle a failed
> > > execve()? I suspect it will be fatal.
> >
> > Yep, probably.
> >
> > > Probably also worth pointing out that the approach here will at least
> > > warn in the execve() case when the affinity is overridden for a deadline
> > > task.
> >
> > Right so I think either way will be imperfect, so I agree with the
> > above.
> >
> > Maybe one thing though is that, IIRC, userspace _can_ disable admission
> > control if it wants to. In this case I'd have no problem with allowing
> > this weird behaviour when admission control is off -- the kernel won't
> > provide any guarantees. But if it's left on, then it's a different
> > story.
> >
> > So what about we say, if admission control is off, we allow execve() and
> > sched_setattr() with appropriate warnings as you suggest, but if
> > admission control is on then we fail both?
>
> That's an interesting idea. The part that I'm not super keen about is
> that it means admission control _also_ has an effect on the behaviour of
> execve()

Right, that's a good point. And it looks like fork() behaves the same
regardless of admission control being enabled or not -- it is forbidden
from DL either way. So I can't say there is a precedent :/

> so practically you'd have to have it disabled as long as you
> have the possibility of 32-bit deadline tasks anywhere in the system,
> which impacts 64-bit tasks which may well want admission control enabled.

Indeed, this is a bit sad, but I don't know if the kernel should pretend
it can guarantee to meet your deadlines and at the same time allow to do
something that wrecks the underlying theory.

I'd personally be happy with saying that admission control should be
disabled on these dumb systems (and have that documented), at least
until DL gets proper support for affinities. ISTR there was work going
in that direction, but some folks in the CC list will know better.

@Juri, maybe you would know if that's still planned?

Thanks,
Quentin

2021-05-19 21:01:17

by Will Deacon

[permalink] [raw]
Subject: [PATCH v6 04/21] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs

Scheduling a 32-bit application on a 64-bit-only CPU is a bad idea.

Ensure that 32-bit applications always take the slow-path when returning
to userspace on a system with mismatched support at EL0, so that we can
avoid trying to run on a 64-bit-only CPU and force a SIGKILL instead.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/kernel/process.c | 19 ++++++++++++++++++-
arch/arm64/kernel/signal.c | 26 ++++++++++++++++++++++++++
2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index b4bb67f17a2c..f4a91bf1ce0c 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -527,6 +527,15 @@ static void erratum_1418040_thread_switch(struct task_struct *prev,
write_sysreg(val, cntkctl_el1);
}

+static void compat_thread_switch(struct task_struct *next)
+{
+ if (!is_compat_thread(task_thread_info(next)))
+ return;
+
+ if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+ set_tsk_thread_flag(next, TIF_NOTIFY_RESUME);
+}
+
static void update_sctlr_el1(u64 sctlr)
{
/*
@@ -568,6 +577,7 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
ssbs_thread_switch(next);
erratum_1418040_thread_switch(prev, next);
ptrauth_thread_switch_user(next);
+ compat_thread_switch(next);

/*
* Complete any pending TLB or cache maintenance on this CPU in case
@@ -633,8 +643,15 @@ unsigned long arch_align_stack(unsigned long sp)
*/
void arch_setup_new_exec(void)
{
- current->mm->context.flags = is_compat_task() ? MMCF_AARCH32 : 0;
+ unsigned long mmflags = 0;
+
+ if (is_compat_task()) {
+ mmflags = MMCF_AARCH32;
+ if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ }

+ current->mm->context.flags = mmflags;
ptrauth_thread_init_user();
mte_thread_init_user();

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 6237486ff6bb..f8192f4ae0b8 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -911,6 +911,19 @@ static void do_signal(struct pt_regs *regs)
restore_saved_sigmask();
}

+static bool cpu_affinity_invalid(struct pt_regs *regs)
+{
+ if (!compat_user_mode(regs))
+ return false;
+
+ /*
+ * We're preemptible, but a reschedule will cause us to check the
+ * affinity again.
+ */
+ return !cpumask_test_cpu(raw_smp_processor_id(),
+ system_32bit_el0_cpumask());
+}
+
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned long thread_flags)
{
@@ -938,6 +951,19 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
if (thread_flags & _TIF_NOTIFY_RESUME) {
tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
+
+ /*
+ * If we reschedule after checking the affinity
+ * then we must ensure that TIF_NOTIFY_RESUME
+ * is set so that we check the affinity again.
+ * Since tracehook_notify_resume() clears the
+ * flag, ensure that the compiler doesn't move
+ * it after the affinity check.
+ */
+ barrier();
+
+ if (cpu_affinity_invalid(regs))
+ force_sig(SIGKILL);
}

if (thread_flags & _TIF_FOREIGN_FPSTATE)
--
2.31.1.751.gd2f1c929bd-goog


2021-05-20 11:51:59

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

Hi Juri,

On Thu, May 20, 2021 at 11:13:39AM +0200, Juri Lelli wrote:
> Apologies for the delay in replying.

Not at all!

> On 18/05/21 13:19, Quentin Perret wrote:
> > On Tuesday 18 May 2021 at 11:59:51 (+0100), Will Deacon wrote:
> > > On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> > > > On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > > > > I don't have strong opinions on this, but I _do_ want the admission via
> > > > > sched_setattr() to be consistent with execve(). What you're suggesting
> > > > > ticks that box, but how many applications are prepared to handle a failed
> > > > > execve()? I suspect it will be fatal.
> > > >
> > > > Yep, probably.
> > > >
> > > > > Probably also worth pointing out that the approach here will at least
> > > > > warn in the execve() case when the affinity is overridden for a deadline
> > > > > task.
> > > >
> > > > Right so I think either way will be imperfect, so I agree with the
> > > > above.
> > > >
> > > > Maybe one thing though is that, IIRC, userspace _can_ disable admission
> > > > control if it wants to. In this case I'd have no problem with allowing
> > > > this weird behaviour when admission control is off -- the kernel won't
> > > > provide any guarantees. But if it's left on, then it's a different
> > > > story.
> > > >
> > > > So what about we say, if admission control is off, we allow execve() and
> > > > sched_setattr() with appropriate warnings as you suggest, but if
> > > > admission control is on then we fail both?
> > >
> > > That's an interesting idea. The part that I'm not super keen about is
> > > that it means admission control _also_ has an effect on the behaviour of
> > > execve()
> >
> > Right, that's a good point. And it looks like fork() behaves the same
> > regardless of admission control being enabled or not -- it is forbidden
> > from DL either way. So I can't say there is a precedent :/
> >
> > > so practically you'd have to have it disabled as long as you
> > > have the possibility of 32-bit deadline tasks anywhere in the system,
> > > which impacts 64-bit tasks which may well want admission control enabled.
> >
> > Indeed, this is a bit sad, but I don't know if the kernel should pretend
> > it can guarantee to meet your deadlines and at the same time allow to do
> > something that wrecks the underlying theory.
> >
> > I'd personally be happy with saying that admission control should be
> > disabled on these dumb systems (and have that documented), at least
> > until DL gets proper support for affinities. ISTR there was work going
> > in that direction, but some folks in the CC list will know better.
> >
> > @Juri, maybe you would know if that's still planned?
>
> I won't go as far as saying planned, but that is still under "our" radar
> for sure. Daniel was working on it, but I don't think he had any time to
> resume that bit of work lately.
>
> So, until we have that, I think we have been as conservative as we could
> for this type of decisions. I'm a little afraid that allowing
> configuration to break admission control (even with a non fatal warning
> is emitted) is still risky. I'd go with fail hard if AC is on, let it
> pass if AC is off (supposedly the user knows what to do). But I'm not
> familiar with the mixed 32/64 apps usecase you describe, so I might be
> missing details.

Ok, thanks for the insight. In which case, I'll go with what we discussed:
require admission control to be disabled for sched_setattr() but allow
execve() to a 32-bit task from a 64-bit deadline task with a warning (this
is probably similar to CPU hotplug?).

I'll update that for v8, and this patch will disappear.

Will

2021-05-20 14:41:44

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

Hi Quentin and Will,

Apologies for the delay in replying.

On 18/05/21 13:19, Quentin Perret wrote:
> On Tuesday 18 May 2021 at 11:59:51 (+0100), Will Deacon wrote:
> > On Tue, May 18, 2021 at 10:48:07AM +0000, Quentin Perret wrote:
> > > On Tuesday 18 May 2021 at 11:28:34 (+0100), Will Deacon wrote:
> > > > I don't have strong opinions on this, but I _do_ want the admission via
> > > > sched_setattr() to be consistent with execve(). What you're suggesting
> > > > ticks that box, but how many applications are prepared to handle a failed
> > > > execve()? I suspect it will be fatal.
> > >
> > > Yep, probably.
> > >
> > > > Probably also worth pointing out that the approach here will at least
> > > > warn in the execve() case when the affinity is overridden for a deadline
> > > > task.
> > >
> > > Right so I think either way will be imperfect, so I agree with the
> > > above.
> > >
> > > Maybe one thing though is that, IIRC, userspace _can_ disable admission
> > > control if it wants to. In this case I'd have no problem with allowing
> > > this weird behaviour when admission control is off -- the kernel won't
> > > provide any guarantees. But if it's left on, then it's a different
> > > story.
> > >
> > > So what about we say, if admission control is off, we allow execve() and
> > > sched_setattr() with appropriate warnings as you suggest, but if
> > > admission control is on then we fail both?
> >
> > That's an interesting idea. The part that I'm not super keen about is
> > that it means admission control _also_ has an effect on the behaviour of
> > execve()
>
> Right, that's a good point. And it looks like fork() behaves the same
> regardless of admission control being enabled or not -- it is forbidden
> from DL either way. So I can't say there is a precedent :/
>
> > so practically you'd have to have it disabled as long as you
> > have the possibility of 32-bit deadline tasks anywhere in the system,
> > which impacts 64-bit tasks which may well want admission control enabled.
>
> Indeed, this is a bit sad, but I don't know if the kernel should pretend
> it can guarantee to meet your deadlines and at the same time allow to do
> something that wrecks the underlying theory.
>
> I'd personally be happy with saying that admission control should be
> disabled on these dumb systems (and have that documented), at least
> until DL gets proper support for affinities. ISTR there was work going
> in that direction, but some folks in the CC list will know better.
>
> @Juri, maybe you would know if that's still planned?

I won't go as far as saying planned, but that is still under "our" radar
for sure. Daniel was working on it, but I don't think he had any time to
resume that bit of work lately.

So, until we have that, I think we have been as conservative as we could
for this type of decisions. I'm a little afraid that allowing
configuration to break admission control (even with a non fatal warning
is emitted) is still risky. I'd go with fail hard if AC is on, let it
pass if AC is off (supposedly the user knows what to do). But I'm not
familiar with the mixed 32/64 apps usecase you describe, so I might be
missing details.

Best,
Juri

2021-05-21 01:59:55

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> require admission control to be disabled for sched_setattr() but allow
> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> is probably similar to CPU hotplug?).

Still not sure that we can let execve go through ... It will break AC
all the same, so it should probably fail as well if AC is on IMO

2021-05-21 02:58:07

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 20/05/21 10:33, Quentin Perret wrote:
> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > require admission control to be disabled for sched_setattr() but allow
> > execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > is probably similar to CPU hotplug?).
>
> Still not sure that we can let execve go through ... It will break AC
> all the same, so it should probably fail as well if AC is on IMO
>

Yeah, this was my thinking as well.

Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 5/20/21 12:33 PM, Quentin Perret wrote:
> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>> require admission control to be disabled for sched_setattr() but allow
>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>> is probably similar to CPU hotplug?).
>
> Still not sure that we can let execve go through ... It will break AC
> all the same, so it should probably fail as well if AC is on IMO
>

If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
the admission control needs to be re-executed, and it could fail. So I see this
operation equivalent to sched_setaffinity(). This will likely be true for future
schedulers that will allow arbitrary affinities (AC should run on affinity
change, and could fail).

I would vote with Juri: "I'd go with fail hard if AC is on, let it
pass if AC is off (supposedly the user knows what to do)," (also hope nobody
complains until we add better support for affinity, and use this as a motivation
to get back on this front).

-- Daniel

2021-05-21 06:04:19

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 20/05/2021 14:38, Daniel Bristot de Oliveira wrote:
> On 5/20/21 12:33 PM, Quentin Perret wrote:
>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>> require admission control to be disabled for sched_setattr() but allow
>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>> is probably similar to CPU hotplug?).
>>
>> Still not sure that we can let execve go through ... It will break AC
>> all the same, so it should probably fail as well if AC is on IMO
>>
>
> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> the admission control needs to be re-executed, and it could fail. So I see this
> operation equivalent to sched_setaffinity(). This will likely be true for future
> schedulers that will allow arbitrary affinities (AC should run on affinity
> change, and could fail).
>
> I would vote with Juri: "I'd go with fail hard if AC is on, let it
> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> complains until we add better support for affinity, and use this as a motivation
> to get back on this front).
>
> -- Daniel

(1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app

(2) # ./32bit_app &

# chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)


Wouldn't the behaviour of (1) and (2) be different w/o this patch?

In (1) __sched_setscheduler() happens before execve so it operates on
p->cpus_ptr equal span.

In (2) span != p->cpus_ptr so DL AC will fail.

Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
> On 20/05/2021 14:38, Daniel Bristot de Oliveira wrote:
>> On 5/20/21 12:33 PM, Quentin Perret wrote:
>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>>> require admission control to be disabled for sched_setattr() but allow
>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>>> is probably similar to CPU hotplug?).
>>>
>>> Still not sure that we can let execve go through ... It will break AC
>>> all the same, so it should probably fail as well if AC is on IMO
>>>
>>
>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
>> the admission control needs to be re-executed, and it could fail. So I see this
>> operation equivalent to sched_setaffinity(). This will likely be true for future
>> schedulers that will allow arbitrary affinities (AC should run on affinity
>> change, and could fail).
>>
>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
>> complains until we add better support for affinity, and use this as a motivation
>> to get back on this front).
>>
>> -- Daniel
>
> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
>
> (2) # ./32bit_app &
>
> # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
>
>
> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
>
> In (1) __sched_setscheduler() happens before execve so it operates on
> p->cpus_ptr equal span.
>
> In (2) span != p->cpus_ptr so DL AC will fail.
>

As far as I got, the case (1) would be spitted in two steps:

- __sched_setscheduler() will work, then
- execv() would fail because (span != p->cpus_ptr)

So... at the end, both (1) and (2) would result in a failure...

am I missing something?

-- Daniel

2021-05-21 06:44:43

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> On 5/20/21 12:33 PM, Quentin Perret wrote:
> > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> >> require admission control to be disabled for sched_setattr() but allow
> >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> >> is probably similar to CPU hotplug?).
> >
> > Still not sure that we can let execve go through ... It will break AC
> > all the same, so it should probably fail as well if AC is on IMO
> >
>
> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> the admission control needs to be re-executed, and it could fail. So I see this
> operation equivalent to sched_setaffinity(). This will likely be true for future
> schedulers that will allow arbitrary affinities (AC should run on affinity
> change, and could fail).
>
> I would vote with Juri: "I'd go with fail hard if AC is on, let it
> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> complains until we add better support for affinity, and use this as a motivation
> to get back on this front).

I can have a go at implementing it, but I don't think it's a great solution
and here's why:

Failing an execve() is _very_ likely to be fatal to the application. It's
also very likely that the task calling execve() doesn't know whether the
program it's trying to execute is 32-bit or not. Consequently, if we go
with failing execve() then all that will happen is that people will disable
admission control altogether. That has a negative impact on "pure" 64-bit
applications and so I think we end up with the tail wagging the dog because
admission control will be disabled for everybody just because there is a
handful of 32-bit programs which may get executed. I understand that it
also means that RT throttling would be disabled.

Allowing the execve() to continue with a warning is very similar to the
case in which all the 64-bit CPUs are hot-unplugged at the point of
execve(), and this is much closer to the illusion that this patch series
intends to provide.

So, personally speaking, I would prefer the behaviour where we refuse to
admit 32-bit tasks vioa sched_set_attr() if the root domain contains
64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
64-bit deadline task.

However, you're the deadline experts so ultimately I'll implement what
you prefer. I just wanted to explain why I think it's a poor interface.

Have I changed anybody's mind?

Will

2021-05-21 09:44:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 20/05/21 19:01, Will Deacon wrote:
> On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > >> require admission control to be disabled for sched_setattr() but allow
> > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > >> is probably similar to CPU hotplug?).
> > >
> > > Still not sure that we can let execve go through ... It will break AC
> > > all the same, so it should probably fail as well if AC is on IMO
> > >
> >
> > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > the admission control needs to be re-executed, and it could fail. So I see this
> > operation equivalent to sched_setaffinity(). This will likely be true for future
> > schedulers that will allow arbitrary affinities (AC should run on affinity
> > change, and could fail).
> >
> > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > complains until we add better support for affinity, and use this as a motivation
> > to get back on this front).
>
> I can have a go at implementing it, but I don't think it's a great solution
> and here's why:
>
> Failing an execve() is _very_ likely to be fatal to the application. It's
> also very likely that the task calling execve() doesn't know whether the
> program it's trying to execute is 32-bit or not. Consequently, if we go
> with failing execve() then all that will happen is that people will disable
> admission control altogether. That has a negative impact on "pure" 64-bit
> applications and so I think we end up with the tail wagging the dog because
> admission control will be disabled for everybody just because there is a
> handful of 32-bit programs which may get executed. I understand that it
> also means that RT throttling would be disabled.

Completely understand your perplexity. But how can the kernel still give
guarantees to "pure" 64-bit applications if there are 32-bit
applications around that essentially broke admission control when they
were restricted to a subset of cores?

> Allowing the execve() to continue with a warning is very similar to the
> case in which all the 64-bit CPUs are hot-unplugged at the point of
> execve(), and this is much closer to the illusion that this patch series
> intends to provide.

So, for hotplug we currently have a check that would make hotplug
operations fail if removing a CPU would mean not enough bandwidth to run
the currently admitted set of DEADLINE tasks.

> So, personally speaking, I would prefer the behaviour where we refuse to
> admit 32-bit tasks vioa sched_set_attr() if the root domain contains
> 64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
> 64-bit deadline task.

OK, this is interesting and I guess a very valid alternative. That would
force users to create exclusive domains for 32-bit tasks, right?

> However, you're the deadline experts so ultimately I'll implement what
> you prefer. I just wanted to explain why I think it's a poor interface.
>
> Have I changed anybody's mind?

Partly! :)

Thanks a lot for the discussion so far.

Juri

2021-05-21 11:53:39

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Tue, May 18, 2021 at 10:47:06AM +0100, Will Deacon wrote:
> +static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
> +{
> + if (!has_cpuid_feature(entry, scope))
> + return allow_mismatched_32bit_el0;
> +
> + if (scope == SCOPE_SYSTEM)
> + pr_info("detected: 32-bit EL0 Support\n");
> +
> + return true;
> +}

We may have discussed this before: AFAICT this will print 32-bit EL0
detected even if there's no 32-bit EL0 on any CPU. Should we instead
print 32-bit EL0 detected on CPU X when allow_mismatched_32bit_el0 is
passed? It would also give us an indication of the system configuration
when people start reporting bugs.

--
Catalin

2021-05-21 12:08:30

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Fri, May 21, 2021 at 10:39:32AM +0200, Juri Lelli wrote:
> On 21/05/21 08:15, Quentin Perret wrote:
> > On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> > > On 20/05/21 19:01, Will Deacon wrote:
> > > > On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > > > > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > > > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > > > > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > > > > >> require admission control to be disabled for sched_setattr() but allow
> > > > > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > > > > >> is probably similar to CPU hotplug?).
> > > > > >
> > > > > > Still not sure that we can let execve go through ... It will break AC
> > > > > > all the same, so it should probably fail as well if AC is on IMO
> > > > > >
> > > > >
> > > > > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > > > > the admission control needs to be re-executed, and it could fail. So I see this
> > > > > operation equivalent to sched_setaffinity(). This will likely be true for future
> > > > > schedulers that will allow arbitrary affinities (AC should run on affinity
> > > > > change, and could fail).
> > > > >
> > > > > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > > > > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > > > > complains until we add better support for affinity, and use this as a motivation
> > > > > to get back on this front).
> > > >
> > > > I can have a go at implementing it, but I don't think it's a great solution
> > > > and here's why:
> > > >
> > > > Failing an execve() is _very_ likely to be fatal to the application. It's
> > > > also very likely that the task calling execve() doesn't know whether the
> > > > program it's trying to execute is 32-bit or not. Consequently, if we go
> > > > with failing execve() then all that will happen is that people will disable
> > > > admission control altogether.
> >
> > Right, but only on these dumb 32bit asymmetric systems, and only if we
> > care about running 32bits deadline tasks -- which I seriously doubt for
> > the Android use-case.
> >
> > Note that running deadline tasks is also a privileged operation, it
> > can't be done by random apps.
> >
> > > > That has a negative impact on "pure" 64-bit
> > > > applications and so I think we end up with the tail wagging the dog because
> > > > admission control will be disabled for everybody just because there is a
> > > > handful of 32-bit programs which may get executed. I understand that it
> > > > also means that RT throttling would be disabled.
> > >
> > > Completely understand your perplexity. But how can the kernel still give
> > > guarantees to "pure" 64-bit applications if there are 32-bit
> > > applications around that essentially broke admission control when they
> > > were restricted to a subset of cores?
> > >
> > > > Allowing the execve() to continue with a warning is very similar to the
> > > > case in which all the 64-bit CPUs are hot-unplugged at the point of
> > > > execve(), and this is much closer to the illusion that this patch series
> > > > intends to provide.
> > >
> > > So, for hotplug we currently have a check that would make hotplug
> > > operations fail if removing a CPU would mean not enough bandwidth to run
> > > the currently admitted set of DEADLINE tasks.
> >
> > Aha, wasn't aware. Any pointers to that check for my education?
>
> Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
> got removed), IIRC. So, if that fails the operation in undone.

Interesting, thanks. Thinking about this some more, it strikes me that with
these silly asymmetric systems there could be an interesting additional
problem with hotplug and deadline tasks. Imagine the following sequence of
events:

1. All online CPUs are 32-bit-capable
2. sched_setattr() admits a 32-bit deadline task
3. A 64-bit-only CPU is onlined
4. Some of the 32-bit-capable CPUs are offlined

I wonder if we can get into a situation where we think we have enough
bandwidth available, but in reality the 32-bit task is in trouble because
it can't make use of the 64-bit-only CPU.

If so, then it seems to me that admission control is really just
"best-effort" for 32-bit deadline tasks on these systems because it's based
on a snapshot in time of the available resources.

Will

2021-05-21 12:17:00

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 05/21] arm64: Advertise CPUs capable of running 32-bit applications in sysfs

On Tue, May 18, 2021 at 10:47:09AM +0100, Will Deacon wrote:
> Since 32-bit applications will be killed if they are caught trying to
> execute on a 64-bit-only CPU in a mismatched system, advertise the set
> of 32-bit capable CPUs to userspace in sysfs.
>
> Reviewed-by: Greg Kroah-Hartman <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2021-05-21 13:16:28

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Friday 21 May 2021 at 15:00:42 (+0200), Daniel Bristot de Oliveira wrote:
> On 5/21/21 12:37 PM, Will Deacon wrote:
> > Interesting, thanks. Thinking about this some more, it strikes me that with
> > these silly asymmetric systems there could be an interesting additional
> > problem with hotplug and deadline tasks. Imagine the following sequence of
> > events:
> >
> > 1. All online CPUs are 32-bit-capable
> > 2. sched_setattr() admits a 32-bit deadline task
> > 3. A 64-bit-only CPU is onlined
>
> At the point 3, the global scheduler assumption is broken. For instance, in a
> system with four CPUs and five ready 32-bit-capable tasks, when the fifth CPU as
> added, the working conserving rule is violated because the five highest priority
> thread are not running (only four are) :-(.
>
> So, at this point, for us to keep to the current behavior, the addition should
> be.. blocked? :-((
>
> > 4. Some of the 32-bit-capable CPUs are offlined
>
> Assuming that point 3 does not exist (i.e., all CPUs are 32-bit-capable). At
> this point, we will have an increase in the pressure on the 32-bit-capable CPUs.
>
> This can also create bad effects for 64-bit tasks, as the "contended" 32-bit
> tasks will still be "queued" in a future time where they were supposed to be
> done (leaving time for the 64-bit tasks).
>
> > I wonder if we can get into a situation where we think we have enough
> > bandwidth available, but in reality the 32-bit task is in trouble because
> > it can't make use of the 64-bit-only CPU.
>
> I would have to think more, but there might be a case where this contended
> 32-bit tasks could cause deadline misses for the 64-bit too.
>
> > If so, then it seems to me that admission control is really just
> > "best-effort" for 32-bit deadline tasks on these systems because it's based
> > on a snapshot in time of the available resources.
>
> The admission test as is now is "best-effort" in the sense that it allows a
> workload higher than it could handle (it is necessary, but not sufficient AC).
> But it should not be considered "best-effort" because of violations in the
> working conserving property as a result of arbitrary affinities among tasks.
> Overall, we have been trying to close any "exception left" to this later case.
>
> I know, it is a complex situation, I am just trying to illustrate our concerns,
> because, in the near future we might have a scheduler that handles arbitrary
> affinity correctly. But that might require us to stick to an AC. The AC is
> something precious for us.

FWIW, I agree with the above. As pointed out in another reply, this
looks like an existing bug and there is nothing specific to 32bits tasks
here.

But I don't think the existence of this bug should be a reason for
breaking AC even more than it is. So it still feels cleaner to block
execve() into 32bit if AC is on until we have proper support for
affinities in DL. And if folks want to use 32bit DL tasks on these
systems they'll have to disable AC, but at least they know what they are
getting ...

Thanks,
Quentin

2021-05-21 16:17:15

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 20/05/2021 18:00, Daniel Bristot de Oliveira wrote:
> On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
>> On 20/05/2021 14:38, Daniel Bristot de Oliveira wrote:
>>> On 5/20/21 12:33 PM, Quentin Perret wrote:
>>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>>>> require admission control to be disabled for sched_setattr() but allow
>>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>>>> is probably similar to CPU hotplug?).
>>>>
>>>> Still not sure that we can let execve go through ... It will break AC
>>>> all the same, so it should probably fail as well if AC is on IMO
>>>>
>>>
>>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
>>> the admission control needs to be re-executed, and it could fail. So I see this
>>> operation equivalent to sched_setaffinity(). This will likely be true for future
>>> schedulers that will allow arbitrary affinities (AC should run on affinity
>>> change, and could fail).
>>>
>>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
>>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
>>> complains until we add better support for affinity, and use this as a motivation
>>> to get back on this front).
>>>
>>> -- Daniel
>>
>> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
>>
>> (2) # ./32bit_app &
>>
>> # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
>>
>>
>> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
>>
>> In (1) __sched_setscheduler() happens before execve so it operates on
>> p->cpus_ptr equal span.
>>
>> In (2) span != p->cpus_ptr so DL AC will fail.
>>
>
> As far as I got, the case (1) would be spitted in two steps:
>
> - __sched_setscheduler() will work, then
> - execv() would fail because (span != p->cpus_ptr)
>
> So... at the end, both (1) and (2) would result in a failure...
>
> am I missing something?

Not sure. Reading this thread I was under the assumption that the only
change would be the drop of this patch. But I assume there is also this
'if DL AC is on then let sched_setattr() fail for this 32bit task'.

IMHO, the current patch-stack w/o this patch should let (1) succeed with
DL AC.

2021-05-21 16:31:27

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

Hi Dietmar,

On Thu, May 20, 2021 at 07:55:27PM +0200, Dietmar Eggemann wrote:
> On 20/05/2021 18:00, Daniel Bristot de Oliveira wrote:
> > On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
> >> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
> >>
> >> (2) # ./32bit_app &
> >>
> >> # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
> >>
> >>
> >> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
> >>
> >> In (1) __sched_setscheduler() happens before execve so it operates on
> >> p->cpus_ptr equal span.
> >>
> >> In (2) span != p->cpus_ptr so DL AC will fail.
> >>
> >
> > As far as I got, the case (1) would be spitted in two steps:
> >
> > - __sched_setscheduler() will work, then
> > - execv() would fail because (span != p->cpus_ptr)
> >
> > So... at the end, both (1) and (2) would result in a failure...
> >
> > am I missing something?
>
> Not sure. Reading this thread I was under the assumption that the only
> change would be the drop of this patch. But I assume there is also this
> 'if DL AC is on then let sched_setattr() fail for this 32bit task'.
>
> IMHO, the current patch-stack w/o this patch should let (1) succeed with
> DL AC.

That's what I'm proposing, yes, but others (including Daniel) prefer to
fail the execve(). See my other reply just now for a summary [1].

Thanks!

Will

[1] https://lore.kernel.org/lkml/20210520180138.GA10523@willie-the-truck/T/#u

2021-05-21 20:09:07

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> On 20/05/21 19:01, Will Deacon wrote:
> > On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > > >> require admission control to be disabled for sched_setattr() but allow
> > > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > > >> is probably similar to CPU hotplug?).
> > > >
> > > > Still not sure that we can let execve go through ... It will break AC
> > > > all the same, so it should probably fail as well if AC is on IMO
> > > >
> > >
> > > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > > the admission control needs to be re-executed, and it could fail. So I see this
> > > operation equivalent to sched_setaffinity(). This will likely be true for future
> > > schedulers that will allow arbitrary affinities (AC should run on affinity
> > > change, and could fail).
> > >
> > > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > > complains until we add better support for affinity, and use this as a motivation
> > > to get back on this front).
> >
> > I can have a go at implementing it, but I don't think it's a great solution
> > and here's why:
> >
> > Failing an execve() is _very_ likely to be fatal to the application. It's
> > also very likely that the task calling execve() doesn't know whether the
> > program it's trying to execute is 32-bit or not. Consequently, if we go
> > with failing execve() then all that will happen is that people will disable
> > admission control altogether.

Right, but only on these dumb 32bit asymmetric systems, and only if we
care about running 32bits deadline tasks -- which I seriously doubt for
the Android use-case.

Note that running deadline tasks is also a privileged operation, it
can't be done by random apps.

> > That has a negative impact on "pure" 64-bit
> > applications and so I think we end up with the tail wagging the dog because
> > admission control will be disabled for everybody just because there is a
> > handful of 32-bit programs which may get executed. I understand that it
> > also means that RT throttling would be disabled.
>
> Completely understand your perplexity. But how can the kernel still give
> guarantees to "pure" 64-bit applications if there are 32-bit
> applications around that essentially broke admission control when they
> were restricted to a subset of cores?
>
> > Allowing the execve() to continue with a warning is very similar to the
> > case in which all the 64-bit CPUs are hot-unplugged at the point of
> > execve(), and this is much closer to the illusion that this patch series
> > intends to provide.
>
> So, for hotplug we currently have a check that would make hotplug
> operations fail if removing a CPU would mean not enough bandwidth to run
> the currently admitted set of DEADLINE tasks.

Aha, wasn't aware. Any pointers to that check for my education?

> > So, personally speaking, I would prefer the behaviour where we refuse to
> > admit 32-bit tasks vioa sched_set_attr() if the root domain contains
> > 64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
> > 64-bit deadline task.
>
> OK, this is interesting and I guess a very valid alternative. That would
> force users to create exclusive domains for 32-bit tasks, right?

FWIW this is not practical at all for our use-cases, the implications of
splitting the system in independent root-domains are way too important
for us to be able to recommend that. Disabling AC, OTOH, sounds simple
enough. The RT throttling part is the only 'worrying' part, but even
that may not be the end of the world.

Thanks!
Quentin

2021-05-21 20:10:12

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 21/05/21 08:15, Quentin Perret wrote:
> On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> > On 20/05/21 19:01, Will Deacon wrote:
> > > On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> > > > On 5/20/21 12:33 PM, Quentin Perret wrote:
> > > > > On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> > > > >> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> > > > >> require admission control to be disabled for sched_setattr() but allow
> > > > >> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> > > > >> is probably similar to CPU hotplug?).
> > > > >
> > > > > Still not sure that we can let execve go through ... It will break AC
> > > > > all the same, so it should probably fail as well if AC is on IMO
> > > > >
> > > >
> > > > If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> > > > the admission control needs to be re-executed, and it could fail. So I see this
> > > > operation equivalent to sched_setaffinity(). This will likely be true for future
> > > > schedulers that will allow arbitrary affinities (AC should run on affinity
> > > > change, and could fail).
> > > >
> > > > I would vote with Juri: "I'd go with fail hard if AC is on, let it
> > > > pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> > > > complains until we add better support for affinity, and use this as a motivation
> > > > to get back on this front).
> > >
> > > I can have a go at implementing it, but I don't think it's a great solution
> > > and here's why:
> > >
> > > Failing an execve() is _very_ likely to be fatal to the application. It's
> > > also very likely that the task calling execve() doesn't know whether the
> > > program it's trying to execute is 32-bit or not. Consequently, if we go
> > > with failing execve() then all that will happen is that people will disable
> > > admission control altogether.
>
> Right, but only on these dumb 32bit asymmetric systems, and only if we
> care about running 32bits deadline tasks -- which I seriously doubt for
> the Android use-case.
>
> Note that running deadline tasks is also a privileged operation, it
> can't be done by random apps.
>
> > > That has a negative impact on "pure" 64-bit
> > > applications and so I think we end up with the tail wagging the dog because
> > > admission control will be disabled for everybody just because there is a
> > > handful of 32-bit programs which may get executed. I understand that it
> > > also means that RT throttling would be disabled.
> >
> > Completely understand your perplexity. But how can the kernel still give
> > guarantees to "pure" 64-bit applications if there are 32-bit
> > applications around that essentially broke admission control when they
> > were restricted to a subset of cores?
> >
> > > Allowing the execve() to continue with a warning is very similar to the
> > > case in which all the 64-bit CPUs are hot-unplugged at the point of
> > > execve(), and this is much closer to the illusion that this patch series
> > > intends to provide.
> >
> > So, for hotplug we currently have a check that would make hotplug
> > operations fail if removing a CPU would mean not enough bandwidth to run
> > the currently admitted set of DEADLINE tasks.
>
> Aha, wasn't aware. Any pointers to that check for my education?

Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
got removed), IIRC. So, if that fails the operation in undone.

> > > So, personally speaking, I would prefer the behaviour where we refuse to
> > > admit 32-bit tasks vioa sched_set_attr() if the root domain contains
> > > 64-bit CPUs, but we _don't_ fail execve() of a 32-bit program from a
> > > 64-bit deadline task.
> >
> > OK, this is interesting and I guess a very valid alternative. That would
> > force users to create exclusive domains for 32-bit tasks, right?
>
> FWIW this is not practical at all for our use-cases, the implications of
> splitting the system in independent root-domains are way too important
> for us to be able to recommend that. Disabling AC, OTOH, sounds simple
> enough. The RT throttling part is the only 'worrying' part, but even
> that may not be the end of the world.

Note that RT throttling (SCHED_{FIFO,RR}) is not handled by DEADLINE
servers yet.

Best,
Juri

2021-05-21 20:14:19

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 20/05/2021 20:03, Will Deacon wrote:
> Hi Dietmar,
>
> On Thu, May 20, 2021 at 07:55:27PM +0200, Dietmar Eggemann wrote:
>> On 20/05/2021 18:00, Daniel Bristot de Oliveira wrote:
>>> On 5/20/21 5:06 PM, Dietmar Eggemann wrote:
>>>> (1) # chrt -d -T 5000000 -P 16666666 0 ./32bit_app
>>>>
>>>> (2) # ./32bit_app &
>>>>
>>>> # chrt -d -T 5000000 -P 16666666 -p 0 pid_of(32bit_app)
>>>>
>>>>
>>>> Wouldn't the behaviour of (1) and (2) be different w/o this patch?
>>>>
>>>> In (1) __sched_setscheduler() happens before execve so it operates on
>>>> p->cpus_ptr equal span.
>>>>
>>>> In (2) span != p->cpus_ptr so DL AC will fail.
>>>>
>>>
>>> As far as I got, the case (1) would be spitted in two steps:
>>>
>>> - __sched_setscheduler() will work, then
>>> - execv() would fail because (span != p->cpus_ptr)
>>>
>>> So... at the end, both (1) and (2) would result in a failure...
>>>
>>> am I missing something?
>>
>> Not sure. Reading this thread I was under the assumption that the only
>> change would be the drop of this patch. But I assume there is also this
>> 'if DL AC is on then let sched_setattr() fail for this 32bit task'.
>>
>> IMHO, the current patch-stack w/o this patch should let (1) succeed with
>> DL AC.
>
> That's what I'm proposing, yes, but others (including Daniel) prefer to
> fail the execve(). See my other reply just now for a summary [1].

[...]

Thanks, Will ... Now I understand. Or at least I think I do ;-)

> [1] https://lore.kernel.org/lkml/20210520180138.GA10523@willie-the-truck/T/#u

2021-05-21 20:14:29

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Tue, May 18, 2021 at 10:47:06AM +0100, Will Deacon wrote:
> +static int enable_mismatched_32bit_el0(unsigned int cpu)
> +{
> + struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> + bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> +
> + if (cpu_32bit) {
> + cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> + static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);

It may be worth only calling static_branch_enable_cpuslocked() if not
already set, in case you try this on a system with lots of CPUs.

--
Catalin

2021-05-21 20:14:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 21/05/2021 12:37, Will Deacon wrote:
> On Fri, May 21, 2021 at 10:39:32AM +0200, Juri Lelli wrote:
>> On 21/05/21 08:15, Quentin Perret wrote:
>>> On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
>>>> On 20/05/21 19:01, Will Deacon wrote:
>>>>> On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
>>>>>> On 5/20/21 12:33 PM, Quentin Perret wrote:
>>>>>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
>>>>>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
>>>>>>>> require admission control to be disabled for sched_setattr() but allow
>>>>>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
>>>>>>>> is probably similar to CPU hotplug?).
>>>>>>>
>>>>>>> Still not sure that we can let execve go through ... It will break AC
>>>>>>> all the same, so it should probably fail as well if AC is on IMO
>>>>>>>
>>>>>>
>>>>>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
>>>>>> the admission control needs to be re-executed, and it could fail. So I see this
>>>>>> operation equivalent to sched_setaffinity(). This will likely be true for future
>>>>>> schedulers that will allow arbitrary affinities (AC should run on affinity
>>>>>> change, and could fail).
>>>>>>
>>>>>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
>>>>>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
>>>>>> complains until we add better support for affinity, and use this as a motivation
>>>>>> to get back on this front).
>>>>>
>>>>> I can have a go at implementing it, but I don't think it's a great solution
>>>>> and here's why:
>>>>>
>>>>> Failing an execve() is _very_ likely to be fatal to the application. It's
>>>>> also very likely that the task calling execve() doesn't know whether the
>>>>> program it's trying to execute is 32-bit or not. Consequently, if we go
>>>>> with failing execve() then all that will happen is that people will disable
>>>>> admission control altogether.
>>>
>>> Right, but only on these dumb 32bit asymmetric systems, and only if we
>>> care about running 32bits deadline tasks -- which I seriously doubt for
>>> the Android use-case.
>>>
>>> Note that running deadline tasks is also a privileged operation, it
>>> can't be done by random apps.
>>>
>>>>> That has a negative impact on "pure" 64-bit
>>>>> applications and so I think we end up with the tail wagging the dog because
>>>>> admission control will be disabled for everybody just because there is a
>>>>> handful of 32-bit programs which may get executed. I understand that it
>>>>> also means that RT throttling would be disabled.
>>>>
>>>> Completely understand your perplexity. But how can the kernel still give
>>>> guarantees to "pure" 64-bit applications if there are 32-bit
>>>> applications around that essentially broke admission control when they
>>>> were restricted to a subset of cores?
>>>>
>>>>> Allowing the execve() to continue with a warning is very similar to the
>>>>> case in which all the 64-bit CPUs are hot-unplugged at the point of
>>>>> execve(), and this is much closer to the illusion that this patch series
>>>>> intends to provide.
>>>>
>>>> So, for hotplug we currently have a check that would make hotplug
>>>> operations fail if removing a CPU would mean not enough bandwidth to run
>>>> the currently admitted set of DEADLINE tasks.
>>>
>>> Aha, wasn't aware. Any pointers to that check for my education?
>>
>> Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
>> got removed), IIRC. So, if that fails the operation in undone.
>
> Interesting, thanks. Thinking about this some more, it strikes me that with
> these silly asymmetric systems there could be an interesting additional
> problem with hotplug and deadline tasks. Imagine the following sequence of
> events:
>
> 1. All online CPUs are 32-bit-capable
> 2. sched_setattr() admits a 32-bit deadline task
> 3. A 64-bit-only CPU is onlined
> 4. Some of the 32-bit-capable CPUs are offlined
>
> I wonder if we can get into a situation where we think we have enough
> bandwidth available, but in reality the 32-bit task is in trouble because
> it can't make use of the 64-bit-only CPU.
>
> If so, then it seems to me that admission control is really just
> "best-effort" for 32-bit deadline tasks on these systems because it's based
> on a snapshot in time of the available resources.

IMHO DL AC is per root domain (rd). So if we have e.g. an 8 CPU system
with aarch32_el0 eq. [0-3] then we would need 2 exclusive cpusets ([0-3]
and [4-7]) to admit 32-bit DL tasks into [0-3] (i.e. to pass the `if
(!cpumask_subset(span, p->cpus_ptr) ...` test in __sched_setscheduler().

Trying to admit too many 32-bit DL tasks or trying to hp out a CPU[0-3]
would lead to `Device or resource busy` in case the rd bw wouldn't be
sufficient anymore for the set of admitted tasks. But the [0-3] DL AC
wouldn't care about hp on CPU[4-7].

2021-05-21 20:15:40

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 04/21] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs

On Tue, May 18, 2021 at 10:47:08AM +0100, Will Deacon wrote:
> Scheduling a 32-bit application on a 64-bit-only CPU is a bad idea.
>
> Ensure that 32-bit applications always take the slow-path when returning
> to userspace on a system with mismatched support at EL0, so that we can
> avoid trying to run on a 64-bit-only CPU and force a SIGKILL instead.
>
> Signed-off-by: Will Deacon <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 5/21/21 12:37 PM, Will Deacon wrote:
> Interesting, thanks. Thinking about this some more, it strikes me that with
> these silly asymmetric systems there could be an interesting additional
> problem with hotplug and deadline tasks. Imagine the following sequence of
> events:
>
> 1. All online CPUs are 32-bit-capable
> 2. sched_setattr() admits a 32-bit deadline task
> 3. A 64-bit-only CPU is onlined

At the point 3, the global scheduler assumption is broken. For instance, in a
system with four CPUs and five ready 32-bit-capable tasks, when the fifth CPU as
added, the working conserving rule is violated because the five highest priority
thread are not running (only four are) :-(.

So, at this point, for us to keep to the current behavior, the addition should
be.. blocked? :-((

> 4. Some of the 32-bit-capable CPUs are offlined

Assuming that point 3 does not exist (i.e., all CPUs are 32-bit-capable). At
this point, we will have an increase in the pressure on the 32-bit-capable CPUs.

This can also create bad effects for 64-bit tasks, as the "contended" 32-bit
tasks will still be "queued" in a future time where they were supposed to be
done (leaving time for the 64-bit tasks).

> I wonder if we can get into a situation where we think we have enough
> bandwidth available, but in reality the 32-bit task is in trouble because
> it can't make use of the 64-bit-only CPU.

I would have to think more, but there might be a case where this contended
32-bit tasks could cause deadline misses for the 64-bit too.

> If so, then it seems to me that admission control is really just
> "best-effort" for 32-bit deadline tasks on these systems because it's based
> on a snapshot in time of the available resources.

The admission test as is now is "best-effort" in the sense that it allows a
workload higher than it could handle (it is necessary, but not sufficient AC).
But it should not be considered "best-effort" because of violations in the
working conserving property as a result of arbitrary affinities among tasks.
Overall, we have been trying to close any "exception left" to this later case.

I know, it is a complex situation, I am just trying to illustrate our concerns,
because, in the near future we might have a scheduler that handles arbitrary
affinity correctly. But that might require us to stick to an AC. The AC is
something precious for us.

(yeah, SP would not face this problem... now that I made progress on RV I can
get back to it).

-- Daniel

2021-05-21 20:16:24

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Friday 21 May 2021 at 13:23:55 (+0200), Dietmar Eggemann wrote:
> On 21/05/2021 12:37, Will Deacon wrote:
> > On Fri, May 21, 2021 at 10:39:32AM +0200, Juri Lelli wrote:
> >> On 21/05/21 08:15, Quentin Perret wrote:
> >>> On Friday 21 May 2021 at 07:25:51 (+0200), Juri Lelli wrote:
> >>>> On 20/05/21 19:01, Will Deacon wrote:
> >>>>> On Thu, May 20, 2021 at 02:38:55PM +0200, Daniel Bristot de Oliveira wrote:
> >>>>>> On 5/20/21 12:33 PM, Quentin Perret wrote:
> >>>>>>> On Thursday 20 May 2021 at 11:16:41 (+0100), Will Deacon wrote:
> >>>>>>>> Ok, thanks for the insight. In which case, I'll go with what we discussed:
> >>>>>>>> require admission control to be disabled for sched_setattr() but allow
> >>>>>>>> execve() to a 32-bit task from a 64-bit deadline task with a warning (this
> >>>>>>>> is probably similar to CPU hotplug?).
> >>>>>>>
> >>>>>>> Still not sure that we can let execve go through ... It will break AC
> >>>>>>> all the same, so it should probably fail as well if AC is on IMO
> >>>>>>>
> >>>>>>
> >>>>>> If the cpumask of the 32-bit task is != of the 64-bit task that is executing it,
> >>>>>> the admission control needs to be re-executed, and it could fail. So I see this
> >>>>>> operation equivalent to sched_setaffinity(). This will likely be true for future
> >>>>>> schedulers that will allow arbitrary affinities (AC should run on affinity
> >>>>>> change, and could fail).
> >>>>>>
> >>>>>> I would vote with Juri: "I'd go with fail hard if AC is on, let it
> >>>>>> pass if AC is off (supposedly the user knows what to do)," (also hope nobody
> >>>>>> complains until we add better support for affinity, and use this as a motivation
> >>>>>> to get back on this front).
> >>>>>
> >>>>> I can have a go at implementing it, but I don't think it's a great solution
> >>>>> and here's why:
> >>>>>
> >>>>> Failing an execve() is _very_ likely to be fatal to the application. It's
> >>>>> also very likely that the task calling execve() doesn't know whether the
> >>>>> program it's trying to execute is 32-bit or not. Consequently, if we go
> >>>>> with failing execve() then all that will happen is that people will disable
> >>>>> admission control altogether.
> >>>
> >>> Right, but only on these dumb 32bit asymmetric systems, and only if we
> >>> care about running 32bits deadline tasks -- which I seriously doubt for
> >>> the Android use-case.
> >>>
> >>> Note that running deadline tasks is also a privileged operation, it
> >>> can't be done by random apps.
> >>>
> >>>>> That has a negative impact on "pure" 64-bit
> >>>>> applications and so I think we end up with the tail wagging the dog because
> >>>>> admission control will be disabled for everybody just because there is a
> >>>>> handful of 32-bit programs which may get executed. I understand that it
> >>>>> also means that RT throttling would be disabled.
> >>>>
> >>>> Completely understand your perplexity. But how can the kernel still give
> >>>> guarantees to "pure" 64-bit applications if there are 32-bit
> >>>> applications around that essentially broke admission control when they
> >>>> were restricted to a subset of cores?
> >>>>
> >>>>> Allowing the execve() to continue with a warning is very similar to the
> >>>>> case in which all the 64-bit CPUs are hot-unplugged at the point of
> >>>>> execve(), and this is much closer to the illusion that this patch series
> >>>>> intends to provide.
> >>>>
> >>>> So, for hotplug we currently have a check that would make hotplug
> >>>> operations fail if removing a CPU would mean not enough bandwidth to run
> >>>> the currently admitted set of DEADLINE tasks.
> >>>
> >>> Aha, wasn't aware. Any pointers to that check for my education?
> >>
> >> Hotplug ends up calling dl_cpu_busy() (after the cpu being hotplugged out
> >> got removed), IIRC. So, if that fails the operation in undone.
> >
> > Interesting, thanks. Thinking about this some more, it strikes me that with
> > these silly asymmetric systems there could be an interesting additional
> > problem with hotplug and deadline tasks. Imagine the following sequence of
> > events:
> >
> > 1. All online CPUs are 32-bit-capable
> > 2. sched_setattr() admits a 32-bit deadline task
> > 3. A 64-bit-only CPU is onlined
> > 4. Some of the 32-bit-capable CPUs are offlined
> >
> > I wonder if we can get into a situation where we think we have enough
> > bandwidth available, but in reality the 32-bit task is in trouble because
> > it can't make use of the 64-bit-only CPU.
> >
> > If so, then it seems to me that admission control is really just
> > "best-effort" for 32-bit deadline tasks on these systems because it's based
> > on a snapshot in time of the available resources.
>
> IMHO DL AC is per root domain (rd). So if we have e.g. an 8 CPU system
> with aarch32_el0 eq. [0-3] then we would need 2 exclusive cpusets ([0-3]
> and [4-7]) to admit 32-bit DL tasks into [0-3] (i.e. to pass the `if
> (!cpumask_subset(span, p->cpus_ptr) ...` test in __sched_setscheduler().
>
> Trying to admit too many 32-bit DL tasks or trying to hp out a CPU[0-3]
> would lead to `Device or resource busy` in case the rd bw wouldn't be
> sufficient anymore for the set of admitted tasks. But the [0-3] DL AC
> wouldn't care about hp on CPU[4-7].

So I think Will has a point since, IIRC, the root domains get rebuilt
during hotplug. So you can imagine a case with a single root domain, but
CPUs 4-7 are offline. In this case, sched_setattr() will happily promote
a task to DL as long as its affinity mask is a superset of the rd span,
but things may get ugly when CPUs are plugged back in later on.

This looks like an existing bug though. I just tried the following on a
system with 4 CPUs:

// Create a task affined to CPU [0-2]
> while true; do echo "Hi" > /dev/null; done &
[1] 560
> mypid=$!
> taskset -p 7 $mypid
pid 560's current affinity mask: f
pid 560's new affinity mask: 7

// Try to move it DL, this should fail because of the affinity
> chrt -d -T 5000000 -P 16666666 -p 0 $mypid
chrt: failed to set pid 560's policy: Operation not permitted

// Offline CPU 3, so the rd now covers CPUs 0-2 only
> echo 0 > /sys/devices/system/cpu/cpu3/online
[ 400.843830] CPU3: shutdown
[ 400.844100] psci: CPU3 killed (polled 0 ms)

// Try to admit the task again, which now succeeds
> chrt -d -T 5000000 -P 16666666 -p 0 $mypid

// Plug CPU3 back online
> echo 1 > /sys/devices/system/cpu/cpu3/online
[ 408.819337] Detected PIPT I-cache on CPU3
[ 408.819642] GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
[ 408.820165] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]

I don't see any easy way to fix this w/o iterating over all deadline
tasks in the rd when hotplugging a CPU back on, and blocking the hotplug
operation if it'll cause affinity issues. Urgh.

2021-05-21 20:19:22

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 21/05/21 13:02, Quentin Perret wrote:

...

> So I think Will has a point since, IIRC, the root domains get rebuilt
> during hotplug. So you can imagine a case with a single root domain, but
> CPUs 4-7 are offline. In this case, sched_setattr() will happily promote
> a task to DL as long as its affinity mask is a superset of the rd span,
> but things may get ugly when CPUs are plugged back in later on.
>
> This looks like an existing bug though. I just tried the following on a
> system with 4 CPUs:
>
> // Create a task affined to CPU [0-2]
> > while true; do echo "Hi" > /dev/null; done &
> [1] 560
> > mypid=$!
> > taskset -p 7 $mypid
> pid 560's current affinity mask: f
> pid 560's new affinity mask: 7
>
> // Try to move it DL, this should fail because of the affinity
> > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
> chrt: failed to set pid 560's policy: Operation not permitted
>
> // Offline CPU 3, so the rd now covers CPUs 0-2 only
> > echo 0 > /sys/devices/system/cpu/cpu3/online
> [ 400.843830] CPU3: shutdown
> [ 400.844100] psci: CPU3 killed (polled 0 ms)
>
> // Try to admit the task again, which now succeeds
> > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
>
> // Plug CPU3 back online
> > echo 1 > /sys/devices/system/cpu/cpu3/online
> [ 408.819337] Detected PIPT I-cache on CPU3
> [ 408.819642] GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
> [ 408.820165] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
>
> I don't see any easy way to fix this w/o iterating over all deadline
> tasks in the rd when hotplugging a CPU back on, and blocking the hotplug
> operation if it'll cause affinity issues. Urgh.
>

Yeah this looks like a plain existing bug, joy. :)

We fixed a few around AC lately, but I guess work wasn't complete.

Thanks,
Juri

2021-05-21 20:20:34

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On 05/18/21 10:47, Will Deacon wrote:
> When confronted with a mixture of CPUs, some of which support 32-bit
> applications and others which don't, we quite sensibly treat the system
> as 64-bit only for userspace and prevent execve() of 32-bit binaries.
>
> Unfortunately, some crazy folks have decided to build systems like this
> with the intention of running 32-bit applications, so relax our
> sanitisation logic to continue to advertise 32-bit support to userspace
> on these systems and track the real 32-bit capable cores in a cpumask
> instead. For now, the default behaviour remains but will be tied to
> a command-line option in a later patch.
>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> arch/arm64/include/asm/cpucaps.h | 3 +-

Heads up. I just tried to apply this on 5.13-rc2 and it failed because cpucaps.
was removed; it's autogenerated now.

See commit 0c6c2d3615ef: ()"arm64: Generate cpucaps.h")

Cheers

--
Qais Youesf

2021-05-21 20:21:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 06/21] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection

On Tue, May 18, 2021 at 10:47:10AM +0100, Will Deacon wrote:
> diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
> index 03dee12d2b61..bc4ac3c525e6 100644
> --- a/include/linux/mmu_context.h
> +++ b/include/linux/mmu_context.h
> @@ -14,4 +14,12 @@
> static inline void leave_mm(int cpu) { }
> #endif
>
> +/*
> + * CPUs that are capable of running task @p. By default, we assume a sane,
> + * homogeneous system. Must contain at least one active CPU.
> + */
> +#ifndef task_cpu_possible_mask
> +# define task_cpu_possible_mask(p) cpu_possible_mask
> +#endif

#ifndef task_cpu_possible_mask
# define task_cpu_possible_mask(p) cpu_possible_mask
# define task_cpu_possible(cpu, p) true
#else
# define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
#endif

> +
> #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..482f7fdca0e8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1813,8 +1813,11 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> return cpu_online(cpu);
>
> /* Non kernel threads are not allowed during either online or offline. */
> if (!(p->flags & PF_KTHREAD))
> - return cpu_active(cpu);
+ return cpu_active(cpu) && task_cpu_possible(cpu, p);

> /* KTHREAD_IS_PER_CPU is always allowed. */
> if (kthread_is_per_cpu(p))

Would something like that make sense?

2021-05-21 20:23:21

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 11/21] sched: Split the guts of sched_setaffinity() into a helper function

On 05/18/21 10:47, Will Deacon wrote:
> In preparation for replaying user affinity requests using a saved mask,
> split sched_setaffinity() up so that the initial task lookup and
> security checks are only performed when the request is coming directly
> from userspace.
>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/sched/core.c | 110 +++++++++++++++++++++++---------------------
> 1 file changed, 58 insertions(+), 52 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9512623d5a60..808bbe669a6d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6788,9 +6788,61 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> return retval;
> }
>
> -long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> +static int
> +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
> {
> + int retval;
> cpumask_var_t cpus_allowed, new_mask;
> +
> + if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
> + return -ENOMEM;
> +
> + if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
> + return -ENOMEM;

Shouldn't we free cpus_allowed first?

Cheers

--
Qais Yousef

> +
> + cpuset_cpus_allowed(p, cpus_allowed);
> + cpumask_and(new_mask, mask, cpus_allowed);
> +
> + /*
> + * Since bandwidth control happens on root_domain basis,
> + * if admission test is enabled, we only admit -deadline
> + * tasks allowed to run on all the CPUs in the task's
> + * root_domain.
> + */
> +#ifdef CONFIG_SMP
> + if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
> + rcu_read_lock();
> + if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
> + retval = -EBUSY;
> + rcu_read_unlock();
> + goto out_free_masks;
> + }
> + rcu_read_unlock();
> + }
> +#endif
> +again:
> + retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
> + if (retval)
> + goto out_free_masks;
> +
> + cpuset_cpus_allowed(p, cpus_allowed);
> + if (!cpumask_subset(new_mask, cpus_allowed)) {
> + /*
> + * We must have raced with a concurrent cpuset update.
> + * Just reset the cpumask to the cpuset's cpus_allowed.
> + */
> + cpumask_copy(new_mask, cpus_allowed);
> + goto again;
> + }
> +
> +out_free_masks:
> + free_cpumask_var(new_mask);
> + free_cpumask_var(cpus_allowed);
> + return retval;
> +}
> +
> +long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> +{
> struct task_struct *p;
> int retval;
>
> @@ -6810,68 +6862,22 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> retval = -EINVAL;
> goto out_put_task;
> }
> - if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
> - retval = -ENOMEM;
> - goto out_put_task;
> - }
> - if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
> - retval = -ENOMEM;
> - goto out_free_cpus_allowed;
> - }
> - retval = -EPERM;
> +
> if (!check_same_owner(p)) {
> rcu_read_lock();
> if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
> rcu_read_unlock();
> - goto out_free_new_mask;
> + retval = -EPERM;
> + goto out_put_task;
> }
> rcu_read_unlock();
> }
>
> retval = security_task_setscheduler(p);
> if (retval)
> - goto out_free_new_mask;
> -
> -
> - cpuset_cpus_allowed(p, cpus_allowed);
> - cpumask_and(new_mask, in_mask, cpus_allowed);
> -
> - /*
> - * Since bandwidth control happens on root_domain basis,
> - * if admission test is enabled, we only admit -deadline
> - * tasks allowed to run on all the CPUs in the task's
> - * root_domain.
> - */
> -#ifdef CONFIG_SMP
> - if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
> - rcu_read_lock();
> - if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
> - retval = -EBUSY;
> - rcu_read_unlock();
> - goto out_free_new_mask;
> - }
> - rcu_read_unlock();
> - }
> -#endif
> -again:
> - retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
> + goto out_put_task;
>
> - if (!retval) {
> - cpuset_cpus_allowed(p, cpus_allowed);
> - if (!cpumask_subset(new_mask, cpus_allowed)) {
> - /*
> - * We must have raced with a concurrent cpuset
> - * update. Just reset the cpus_allowed to the
> - * cpuset's cpus_allowed
> - */
> - cpumask_copy(new_mask, cpus_allowed);
> - goto again;
> - }
> - }
> -out_free_new_mask:
> - free_cpumask_var(new_mask);
> -out_free_cpus_allowed:
> - free_cpumask_var(cpus_allowed);
> + retval = __sched_setaffinity(p, in_mask);
> out_put_task:
> put_task_struct(p);
> return retval;
> --
> 2.31.1.751.gd2f1c929bd-goog
>

2021-05-21 20:23:33

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 12/21] sched: Allow task CPU affinity to be restricted on asymmetric systems

On 05/18/21 10:47, Will Deacon wrote:
> Asymmetric systems may not offer the same level of userspace ISA support
> across all CPUs, meaning that some applications cannot be executed by
> some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
> not feature support for 32-bit applications on both clusters.
>
> Although userspace can carefully manage the affinity masks for such
> tasks, one place where it is particularly problematic is execve()
> because the CPU on which the execve() is occurring may be incompatible
> with the new application image. In such a situation, it is desirable to
> restrict the affinity mask of the task and ensure that the new image is
> entered on a compatible CPU. From userspace's point of view, this looks
> the same as if the incompatible CPUs have been hotplugged off in the
> task's affinity mask. Similarly, if a subsequent execve() reverts to
> a compatible image, then the old affinity is restored if it is still
> valid.
>
> In preparation for restricting the affinity mask for compat tasks on
> arm64 systems without uniform support for 32-bit applications, introduce
> {force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict
> and restore the affinity mask for a task based on the compatible CPUs.
>
> Reviewed-by: Quentin Perret <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> include/linux/sched.h | 2 +
> kernel/sched/core.c | 165 ++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 1 +
> 3 files changed, 152 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index db32d4f7e5b3..91a6cfeae242 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1691,6 +1691,8 @@ extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new
> extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
> extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
> extern void release_user_cpus_ptr(struct task_struct *p);
> +extern void force_compatible_cpus_allowed_ptr(struct task_struct *p);
> +extern void relax_compatible_cpus_allowed_ptr(struct task_struct *p);
> #else
> static inline void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
> {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 808bbe669a6d..ba66bcf8e812 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2357,26 +2357,21 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
> }
>
> /*
> - * Change a given task's CPU affinity. Migrate the thread to a
> - * proper CPU and schedule it away if the CPU it's executing on
> - * is removed from the allowed bitmask.
> - *
> - * NOTE: the caller must have a valid reference to the task, the
> - * task must not exit() & deallocate itself prematurely. The
> - * call is not atomic; no spinlocks may be held.
> + * Called with both p->pi_lock and rq->lock held; drops both before returning.
> */
> -static int __set_cpus_allowed_ptr(struct task_struct *p,
> - const struct cpumask *new_mask,
> - u32 flags)
> +static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
> + const struct cpumask *new_mask,
> + u32 flags,
> + struct rq *rq,
> + struct rq_flags *rf)
> + __releases(rq->lock)
> + __releases(p->pi_lock)
> {
> const struct cpumask *cpu_valid_mask = cpu_active_mask;
> const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
> unsigned int dest_cpu;
> - struct rq_flags rf;
> - struct rq *rq;
> int ret = 0;
>
> - rq = task_rq_lock(p, &rf);
> update_rq_clock(rq);
>
> if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
> @@ -2430,20 +2425,158 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>
> __do_set_cpus_allowed(p, new_mask, flags);
>
> - return affine_move_task(rq, p, &rf, dest_cpu, flags);
> + if (flags & SCA_USER)
> + release_user_cpus_ptr(p);

Why do we need to release the pointer here?

Doesn't this mean if a 32bit task requests to change its affinity, then we'll
lose this info and a subsequent execve() to a 64bit application means we won't
be able to restore the original mask?

ie:

p0-64bit
execve(32bit_app)
// p1-32bit created
p1-32bit.change_affinity()
relase_user_cpus_ptr()
execve(64bit_app) // lost info about p0 affinity?

Hmm I think this helped me to get the answer. p1 changed its affinity, then
there's nothing to be inherited by a new execve(), so yes we no longer need
this info.

> +
> + return affine_move_task(rq, p, rf, dest_cpu, flags);
>
> out:
> - task_rq_unlock(rq, p, &rf);
> + task_rq_unlock(rq, p, rf);
>
> return ret;
> }

[...]

> +/*
> + * Change a given task's CPU affinity to the intersection of its current
> + * affinity mask and @subset_mask, writing the resulting mask to @new_mask
> + * and pointing @p->user_cpus_ptr to a copy of the old mask.
> + * If the resulting mask is empty, leave the affinity unchanged and return
> + * -EINVAL.
> + */
> +static int restrict_cpus_allowed_ptr(struct task_struct *p,
> + struct cpumask *new_mask,
> + const struct cpumask *subset_mask)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> + int err;
> + struct cpumask *user_mask = NULL;
> +
> + if (!p->user_cpus_ptr)
> + user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
> +
> + rq = task_rq_lock(p, &rf);
> +
> + /*
> + * We're about to butcher the task affinity, so keep track of what
> + * the user asked for in case we're able to restore it later on.
> + */
> + if (user_mask) {
> + cpumask_copy(user_mask, p->cpus_ptr);
> + p->user_cpus_ptr = user_mask;
> + }
> +
> + /*
> + * Forcefully restricting the affinity of a deadline task is
> + * likely to cause problems, so fail and noisily override the
> + * mask entirely.
> + */
> + if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
> + err = -EPERM;
> + goto err_unlock;

free(user_mark) first?

> + }
> +
> + if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
> + err = -EINVAL;
> + goto err_unlock;

ditto

> + }
> +
> + return __set_cpus_allowed_ptr_locked(p, new_mask, false, rq, &rf);
> +
> +err_unlock:
> + task_rq_unlock(rq, p, &rf);
> + return err;
> +}

Thanks

--
Qais Yousef

2021-05-21 20:24:15

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 08/21] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()

On 05/18/21 10:47, Will Deacon wrote:
> Asymmetric systems may not offer the same level of userspace ISA support
> across all CPUs, meaning that some applications cannot be executed by
> some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
> not feature support for 32-bit applications on both clusters.
>
> Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
> account when trying to find a suitable set of online CPUs for a given
> task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
> during ->attach() and will subsequently allow the cpuset hierarchy to be
> taken into account when forcefully overriding the affinity mask for a
> task which requires migration to a compatible CPU.
>
> Cc: Li Zefan <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> include/linux/cpuset.h | 2 +-
> kernel/cgroup/cpuset.c | 33 +++++++++++++++++++--------------
> 2 files changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index ed6ec677dd6b..414a8e694413 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -185,7 +185,7 @@ static inline void cpuset_read_unlock(void) { }
> static inline void cpuset_cpus_allowed(struct task_struct *p,
> struct cpumask *mask)
> {
> - cpumask_copy(mask, cpu_possible_mask);
> + cpumask_copy(mask, task_cpu_possible_mask(p));
> }
>
> static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 8c799260a4a2..b532a5333ff9 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -372,18 +372,26 @@ static inline bool is_in_v2_mode(void)
> }
>
> /*
> - * Return in pmask the portion of a cpusets's cpus_allowed that
> - * are online. If none are online, walk up the cpuset hierarchy
> - * until we find one that does have some online cpus.
> + * Return in pmask the portion of a task's cpusets's cpus_allowed that
> + * are online and are capable of running the task. If none are found,
> + * walk up the cpuset hierarchy until we find one that does have some
> + * appropriate cpus.
> *
> * One way or another, we guarantee to return some non-empty subset
> * of cpu_online_mask.
> *
> * Call with callback_lock or cpuset_mutex held.
> */
> -static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
> +static void guarantee_online_cpus(struct task_struct *tsk,
> + struct cpumask *pmask)
> {
> - while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
> + struct cpuset *cs = task_cs(tsk);

task_cs() requires rcu_read_lock(), but I can't see how the lock is obtained
from cpuset_attach() path, did I miss it? Running with lockdep should spill
suspicious RCU usage warning.

Maybe it makes more sense to move the rcu_read_lock() inside the function now
with task_cs()?

Thanks

--
Qais Yousef

> + const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +
> + if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
> + cpumask_copy(pmask, cpu_online_mask);
> +
> + while (!cpumask_intersects(cs->effective_cpus, pmask)) {
> cs = parent_cs(cs);
> if (unlikely(!cs)) {
> /*
> @@ -393,11 +401,10 @@ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
> * cpuset's effective_cpus is on its way to be
> * identical to cpu_online_mask.
> */
> - cpumask_copy(pmask, cpu_online_mask);
> return;
> }
> }
> - cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
> + cpumask_and(pmask, pmask, cs->effective_cpus);
> }
>
> /*
> @@ -2199,15 +2206,13 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>
> percpu_down_write(&cpuset_rwsem);
>
> - /* prepare for attach */
> - if (cs == &top_cpuset)
> - cpumask_copy(cpus_attach, cpu_possible_mask);
> - else
> - guarantee_online_cpus(cs, cpus_attach);
> -
> guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
>
> cgroup_taskset_for_each(task, css, tset) {
> + if (cs != &top_cpuset)
> + guarantee_online_cpus(task, cpus_attach);
> + else
> + cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
> /*
> * can_attach beforehand should guarantee that this doesn't
> * fail. TODO: have a better way to handle failure here
> @@ -3303,7 +3308,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>
> spin_lock_irqsave(&callback_lock, flags);
> rcu_read_lock();
> - guarantee_online_cpus(task_cs(tsk), pmask);
> + guarantee_online_cpus(tsk, pmask);
> rcu_read_unlock();
> spin_unlock_irqrestore(&callback_lock, flags);
> }
> --
> 2.31.1.751.gd2f1c929bd-goog
>

2021-05-21 20:24:21

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 07/21] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

On 05/18/21 10:47, Will Deacon wrote:
> If the scheduler cannot find an allowed CPU for a task,
> cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
> if cgroup v1 is in use.
>
> In preparation for allowing architectures to provide their own fallback
> mask, just return early if we're either using cgroup v1 or we're using
> cgroup v2 with a mask that contains invalid CPUs. This will allow
> select_fallback_rq() to figure out the mask by itself.
>
> Cc: Li Zefan <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Reviewed-by: Quentin Perret <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> include/linux/cpuset.h | 1 +
> kernel/cgroup/cpuset.c | 12 ++++++++++--
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 04c20de66afc..ed6ec677dd6b 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -15,6 +15,7 @@
> #include <linux/cpumask.h>
> #include <linux/nodemask.h>
> #include <linux/mm.h>
> +#include <linux/mmu_context.h>
> #include <linux/jump_label.h>
>
> #ifdef CONFIG_CPUSETS
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a945504c0ae7..8c799260a4a2 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3322,9 +3322,17 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>
> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> {
> + const struct cpumask *cs_mask;
> + const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +
> rcu_read_lock();
> - do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> - task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> + cs_mask = task_cs(tsk)->cpus_allowed;
> +
> + if (!is_in_v2_mode() || !cpumask_subset(cs_mask, possible_mask))
> + goto unlock; /* select_fallback_rq will try harder */
> +
> + do_set_cpus_allowed(tsk, cs_mask);

Shouldn't we take the intersection with possible_mask like we discussed before?

https://lore.kernel.org/lkml/20201217145954.GA17881@willie-the-truck/

Thanks

--
Qais Yousef

> +unlock:
> rcu_read_unlock();
>
> /*
> --
> 2.31.1.751.gd2f1c929bd-goog
>

2021-05-21 20:25:01

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 00/21] Add support for 32-bit tasks on asymmetric AArch32 systems

Hi Will

On 05/18/21 10:47, Will Deacon wrote:
> Hi folks,
>
> This is the long-awaited v6 of these patches which I last posted at the
> end of last year:
>
> v1: https://lore.kernel.org/r/[email protected]
> v2: https://lore.kernel.org/r/[email protected]
> v3: https://lore.kernel.org/r/[email protected]
> v4: https://lore.kernel.org/r/[email protected]
> v5: https://lore.kernel.org/r/[email protected]
>
> There was also a nice LWN writeup in case you've forgotten what this is
> about:
>
> https://lwn.net/Articles/838339/
>
> It's taken me a while to get a v6 of this together, partly due to
> addressing the review feedback on v5, but also because this has now seen
> testing on real hardware which threw up some surprises in suspend/resume,
> SCHED_DEADLINE and compat hwcap reporting. Thanks to Quentin for helping
> me to debug those issues.
>
> The aim of this series is to allow 32-bit ARM applications to run on
> arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
> Unfortunately, such SoCs are real and will continue to be productised
> over the next few years at least. I can assure you that I'm not just
> doing this for fun.
>
> Changes in v6 include:
>
> * Save/restore the affinity mask across execve() to 32-bit and back to
> 64-bit again.
>
> * Allow 32-bit deadline tasks, but skip straight to fallback path when
> determining new affinity mask on execve().
>
> * Fixed resume-from-suspend path when the resuming CPU is 64-bit-only
> by deferring wake-ups for 32-bit tasks until the secondary CPUs are
> back online.
>
> * Bug fixes (compat hwcaps, memory leak, cpuset fallback path).
>
> * Documentation for arm64. It's in the divisive .rst format, but please
> take a look anyway!
>
> I'm pretty happy with this now and it seems to do the right thing,
> although the new patches in this revision would certainly benefit from
> review. Series based on v5.13-rc1.

It's late Fri and I'm off next week (I'm starting to sense an omen here, it's
the 2nd or 3rd time the post syncs with my holiday), so a bit of a rushed
review but the series looks good to me. Feel free to stick my Reviewed-by for
the series, except patch 13 where I skipped it, given the few comments I had
are addressed.

I did test v5, but not this version. I think it had found better victims to
test it now :-P

Thanks!

--
Qais Yousef

2021-05-21 20:25:23

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 21/21] Documentation: arm64: describe asymmetric 32-bit support

On 05/18/21 10:47, Will Deacon wrote:
> Document support for running 32-bit tasks on asymmetric 32-bit systems
> and its impact on the user ABI when enabled.
>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 3 +
> Documentation/arm64/asymmetric-32bit.rst | 149 ++++++++++++++++++
> Documentation/arm64/index.rst | 1 +
> 3 files changed, 153 insertions(+)
> create mode 100644 Documentation/arm64/asymmetric-32bit.rst
>

[...]

> +Cpusets
> +-------
> +
> +The affinity of a 32-bit task may include CPUs that are not explicitly
> +allowed by the cpuset to which it is attached. This can occur as a
> +result of the following two situations:
> +
> + - A 64-bit task attached to a cpuset which allows only 64-bit CPUs
> + executes a 32-bit program.
> +
> + - All of the 32-bit-capable CPUs allowed by a cpuset containing a
> + 32-bit task are offlined.
> +
> +In both of these cases, the new affinity is calculated according to step
> +(2) of the process described in `execve(2)`_ and the cpuset hierarchy is
> +unchanged irrespective of the cgroup version.

nit: Should we call out that we're breaking cpuset-v1 behavior? Don't feel
strongly about it.

Cheers

--
Qais Yousef

2021-05-21 20:40:45

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On 21/05/2021 16:04, Juri Lelli wrote:
> On 21/05/21 13:02, Quentin Perret wrote:
>
> ...
>
>> So I think Will has a point since, IIRC, the root domains get rebuilt
>> during hotplug. So you can imagine a case with a single root domain, but
>> CPUs 4-7 are offline. In this case, sched_setattr() will happily promote
>> a task to DL as long as its affinity mask is a superset of the rd span,
>> but things may get ugly when CPUs are plugged back in later on.

Yeah, that's true. I understand the condition, that the task's affinity
mask has to be a superset of the rd span, as that DL AC (i.e DL BW
management) can only work correctly if all admitted tasks can run on
every CPU in the rd.

Like you said, you can already today let tasks with reduced affinity
mask pass the DL AC in case you hp out the other CPUs and then trick DL
AC by hp in the remaining CPUs and admit more DL tasks. But these steps
require a lot of effort to create this false setup.

The dedicated rd for 32-bit tasks matching `aarch32_el0` in an exclusive
cpuset env seems to be a feasible approach to me. But I also don't see
an eminent use case for this.

>> This looks like an existing bug though. I just tried the following on a
>> system with 4 CPUs:
>>
>> // Create a task affined to CPU [0-2]
>> > while true; do echo "Hi" > /dev/null; done &
>> [1] 560
>> > mypid=$!
>> > taskset -p 7 $mypid
>> pid 560's current affinity mask: f
>> pid 560's new affinity mask: 7
>>
>> // Try to move it DL, this should fail because of the affinity
>> > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
>> chrt: failed to set pid 560's policy: Operation not permitted
>>
>> // Offline CPU 3, so the rd now covers CPUs 0-2 only
>> > echo 0 > /sys/devices/system/cpu/cpu3/online
>> [ 400.843830] CPU3: shutdown
>> [ 400.844100] psci: CPU3 killed (polled 0 ms)
>>
>> // Try to admit the task again, which now succeeds
>> > chrt -d -T 5000000 -P 16666666 -p 0 $mypid
>>
>> // Plug CPU3 back online
>> > echo 1 > /sys/devices/system/cpu/cpu3/online
>> [ 408.819337] Detected PIPT I-cache on CPU3
>> [ 408.819642] GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
>> [ 408.820165] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
>>
>> I don't see any easy way to fix this w/o iterating over all deadline
>> tasks in the rd when hotplugging a CPU back on, and blocking the hotplug
>> operation if it'll cause affinity issues. Urgh.

Something like dl_cpu_busy() in cpuset_cpu_inactive() but the other way
around in cpuset_cpu_active().

We iterate over all DL tasks in partition_and_rebuild_sched_domains() ->
rebuild_root_domains() -> update_tasks_root_domain() ->
dl_add_task_root_domain(struct task_struct *p) to recreate DL BW
information after CPU hp but this is asynchronously to cpuset_cpu_active().

>
> Yeah this looks like a plain existing bug, joy. :)
>
> We fixed a few around AC lately, but I guess work wasn't complete.
>
> Thanks,
> Juri

2021-05-24 12:07:17

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Fri, May 21, 2021 at 11:25:23AM +0100, Catalin Marinas wrote:
> On Tue, May 18, 2021 at 10:47:06AM +0100, Will Deacon wrote:
> > +static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
> > +{
> > + if (!has_cpuid_feature(entry, scope))
> > + return allow_mismatched_32bit_el0;
> > +
> > + if (scope == SCOPE_SYSTEM)
> > + pr_info("detected: 32-bit EL0 Support\n");
> > +
> > + return true;
> > +}
>
> We may have discussed this before: AFAICT this will print 32-bit EL0
> detected even if there's no 32-bit EL0 on any CPU. Should we instead
> print 32-bit EL0 detected on CPU X when allow_mismatched_32bit_el0 is
> passed? It would also give us an indication of the system configuration
> when people start reporting bugs.

The function above only runs if we've detected 32-bit support via
aa64pfr0_el1, so I think we're ok. We also have a print when we detect the
mismatch (see enable_mismatched_32bit_el0()).

Will

2021-05-24 12:11:07

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Fri, May 21, 2021 at 11:41:56AM +0100, Catalin Marinas wrote:
> On Tue, May 18, 2021 at 10:47:06AM +0100, Will Deacon wrote:
> > +static int enable_mismatched_32bit_el0(unsigned int cpu)
> > +{
> > + struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> > + bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> > +
> > + if (cpu_32bit) {
> > + cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> > + static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
>
> It may be worth only calling static_branch_enable_cpuslocked() if not
> already set, in case you try this on a system with lots of CPUs.

static_key_enable_cpuslocked() already checks this early on, so I don't
think we need another check here (note that we're not calling stop_machine()
here _anyway_; the '_cpuslocked' suffix just says that we're already holding
cpu_hotplug_lock via the notifier).

Will

2021-05-24 12:19:54

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 06/21] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection

On Fri, May 21, 2021 at 06:03:44PM +0200, Peter Zijlstra wrote:
> On Tue, May 18, 2021 at 10:47:10AM +0100, Will Deacon wrote:
> > diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
> > index 03dee12d2b61..bc4ac3c525e6 100644
> > --- a/include/linux/mmu_context.h
> > +++ b/include/linux/mmu_context.h
> > @@ -14,4 +14,12 @@
> > static inline void leave_mm(int cpu) { }
> > #endif
> >
> > +/*
> > + * CPUs that are capable of running task @p. By default, we assume a sane,
> > + * homogeneous system. Must contain at least one active CPU.
> > + */
> > +#ifndef task_cpu_possible_mask
> > +# define task_cpu_possible_mask(p) cpu_possible_mask
> > +#endif
>
> #ifndef task_cpu_possible_mask
> # define task_cpu_possible_mask(p) cpu_possible_mask
> # define task_cpu_possible(cpu, p) true
> #else
> # define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
> #endif
>
> > +
> > #endif
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5226cc26a095..482f7fdca0e8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1813,8 +1813,11 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > return cpu_online(cpu);
> >
> > /* Non kernel threads are not allowed during either online or offline. */
> > if (!(p->flags & PF_KTHREAD))
> > - return cpu_active(cpu);
> + return cpu_active(cpu) && task_cpu_possible(cpu, p);
>
> > /* KTHREAD_IS_PER_CPU is always allowed. */
> > if (kthread_is_per_cpu(p))
>
> Would something like that make sense?

I think this is probably the only place that we could use the helper, but
it's also one of the places where architectures that don't have to worry
about asymmetry end up with the check so, yes, I'll do that for v7.

Will

2021-05-24 13:47:16

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Mon, May 24, 2021 at 01:09:59PM +0100, Will Deacon wrote:
> On Fri, May 21, 2021 at 11:41:56AM +0100, Catalin Marinas wrote:
> > On Tue, May 18, 2021 at 10:47:06AM +0100, Will Deacon wrote:
> > > +static int enable_mismatched_32bit_el0(unsigned int cpu)
> > > +{
> > > + struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> > > + bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> > > +
> > > + if (cpu_32bit) {
> > > + cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> > > + static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
> >
> > It may be worth only calling static_branch_enable_cpuslocked() if not
> > already set, in case you try this on a system with lots of CPUs.
>
> static_key_enable_cpuslocked() already checks this early on, so I don't
> think we need another check here (note that we're not calling stop_machine()
> here _anyway_; the '_cpuslocked' suffix just says that we're already holding
> cpu_hotplug_lock via the notifier).

Ah, you are right, no need for an additional check.

--
Catalin

2021-05-24 13:50:28

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Mon, May 24, 2021 at 01:05:50PM +0100, Will Deacon wrote:
> On Fri, May 21, 2021 at 11:25:23AM +0100, Catalin Marinas wrote:
> > On Tue, May 18, 2021 at 10:47:06AM +0100, Will Deacon wrote:
> > > +static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
> > > +{
> > > + if (!has_cpuid_feature(entry, scope))
> > > + return allow_mismatched_32bit_el0;
> > > +
> > > + if (scope == SCOPE_SYSTEM)
> > > + pr_info("detected: 32-bit EL0 Support\n");
> > > +
> > > + return true;
> > > +}
> >
> > We may have discussed this before: AFAICT this will print 32-bit EL0
> > detected even if there's no 32-bit EL0 on any CPU. Should we instead
> > print 32-bit EL0 detected on CPU X when allow_mismatched_32bit_el0 is
> > passed? It would also give us an indication of the system configuration
> > when people start reporting bugs.
>
> The function above only runs if we've detected 32-bit support via
> aa64pfr0_el1, so I think we're ok. We also have a print when we detect the
> mismatch (see enable_mismatched_32bit_el0()).

It makes sense, you removed the .desc from the arm64_features entry as
well.

Reviewed-by: Catalin Marinas <[email protected]>

2021-05-24 15:24:51

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 16/21] arm64: Implement task_cpu_possible_mask()

On Tue, May 18, 2021 at 10:47:20AM +0100, Will Deacon wrote:
> Provide an implementation of task_cpu_possible_mask() so that we can
> prevent 64-bit-only cores being added to the 'cpus_mask' for compat
> tasks on systems with mismatched 32-bit support at EL0,
>
> Signed-off-by: Will Deacon <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2021-05-24 15:24:51

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 17/21] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0

On Tue, May 18, 2021 at 10:47:21AM +0100, Will Deacon wrote:
> When exec'ing a 32-bit task on a system with mismatched support for
> 32-bit EL0, try to ensure that it starts life on a CPU that can actually
> run it.
>
> Similarly, when exec'ing a 64-bit task on such a system, try to restore
> the old affinity mask if it was previously restricted.
>
> Reviewed-by: Quentin Perret <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2021-05-24 16:10:52

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 19/21] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0

On Tue, May 18, 2021 at 10:47:23AM +0100, Will Deacon wrote:
> Allow systems with mismatched 32-bit support at EL0 to run 32-bit
> applications based on a new kernel parameter.
>
> Signed-off-by: Will Deacon <[email protected]>

Acked-by: Catalin Marinas <[email protected]>

2021-05-24 16:11:30

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system

On Tue, May 18, 2021 at 10:47:22AM +0100, Will Deacon wrote:
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 959442f76ed7..72efdc611b14 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -2896,15 +2896,33 @@ void __init setup_cpu_features(void)
>
> static int enable_mismatched_32bit_el0(unsigned int cpu)
> {
> + static int lucky_winner = -1;
> +
> struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
>
> if (cpu_32bit) {
> cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
> - setup_elf_hwcaps(compat_elf_hwcaps);
> }
>
> + if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
> + return 0;

I don't fully understand this early return. AFAICT, we still call
setup_elf_hwcaps() via setup_cpu_features() if the system supports
32-bit EL0 (mismatched or not) at boot. For CPU hotplug, we can add the
compat hwcaps later if we didn't set them up at boot. So this part is
fine.

However, if CPU0 is 32-bit-capable, it looks like we'd never disable the
offlining on any of the 32-bit-capable CPUs and there's nothing that
prevents offlining CPU0.

--
Catalin

2021-05-24 16:12:55

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 20/21] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores

On Tue, May 18, 2021 at 10:47:24AM +0100, Will Deacon wrote:
> The scheduler now knows enough about these braindead systems to place
> 32-bit tasks accordingly, so throw out the safety checks and allow the
> ret-to-user path to avoid do_notify_resume() if there is nothing to do.
>
> Signed-off-by: Will Deacon <[email protected]>

Acked-by: Catalin Marinas <[email protected]>

2021-05-24 16:23:02

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 21/21] Documentation: arm64: describe asymmetric 32-bit support

On Tue, May 18, 2021 at 10:47:25AM +0100, Will Deacon wrote:
> Document support for running 32-bit tasks on asymmetric 32-bit systems
> and its impact on the user ABI when enabled.
>
> Signed-off-by: Will Deacon <[email protected]>

The logic looks alright to me but I suspect most arguments are around
the scheduler behaviour.

Reviewed-by: Catalin Marinas <[email protected]>

2021-05-24 20:23:04

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 07/21] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

Hi Qais,

On Fri, May 21, 2021 at 06:39:34PM +0100, Qais Yousef wrote:
> On 05/18/21 10:47, Will Deacon wrote:
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index a945504c0ae7..8c799260a4a2 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3322,9 +3322,17 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> >
> > void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > {
> > + const struct cpumask *cs_mask;
> > + const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> > +
> > rcu_read_lock();
> > - do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> > - task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> > + cs_mask = task_cs(tsk)->cpus_allowed;
> > +
> > + if (!is_in_v2_mode() || !cpumask_subset(cs_mask, possible_mask))
> > + goto unlock; /* select_fallback_rq will try harder */
> > +
> > + do_set_cpus_allowed(tsk, cs_mask);
>
> Shouldn't we take the intersection with possible_mask like we discussed before?
>
> https://lore.kernel.org/lkml/20201217145954.GA17881@willie-the-truck/

Yes, and that's what the '!cpumask_subset()' check is doing above. Either
we use the valid subset of the cpuset mask (which is the intersection with
the possible mask) or we bail if that set is empty.

Will

2021-05-24 20:23:09

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 02/21] arm64: Allow mismatched 32-bit EL0 support

On Fri, May 21, 2021 at 04:22:55PM +0100, Qais Yousef wrote:
> On 05/18/21 10:47, Will Deacon wrote:
> > When confronted with a mixture of CPUs, some of which support 32-bit
> > applications and others which don't, we quite sensibly treat the system
> > as 64-bit only for userspace and prevent execve() of 32-bit binaries.
> >
> > Unfortunately, some crazy folks have decided to build systems like this
> > with the intention of running 32-bit applications, so relax our
> > sanitisation logic to continue to advertise 32-bit support to userspace
> > on these systems and track the real 32-bit capable cores in a cpumask
> > instead. For now, the default behaviour remains but will be tied to
> > a command-line option in a later patch.
> >
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> > arch/arm64/include/asm/cpucaps.h | 3 +-
>
> Heads up. I just tried to apply this on 5.13-rc2 and it failed because cpucaps.
> was removed; it's autogenerated now.
>
> See commit 0c6c2d3615ef: ()"arm64: Generate cpucaps.h")

Yup, cheers. I'll sort that out once we're at the stage where we're merging
patches.

Will

2021-05-24 20:34:11

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system

Hi Catalin,

On Mon, May 24, 2021 at 04:46:58PM +0100, Catalin Marinas wrote:
> On Tue, May 18, 2021 at 10:47:22AM +0100, Will Deacon wrote:
> > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> > index 959442f76ed7..72efdc611b14 100644
> > --- a/arch/arm64/kernel/cpufeature.c
> > +++ b/arch/arm64/kernel/cpufeature.c
> > @@ -2896,15 +2896,33 @@ void __init setup_cpu_features(void)
> >
> > static int enable_mismatched_32bit_el0(unsigned int cpu)
> > {
> > + static int lucky_winner = -1;
> > +
> > struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> > bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> >
> > if (cpu_32bit) {
> > cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> > static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
> > - setup_elf_hwcaps(compat_elf_hwcaps);
> > }
> >
> > + if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
> > + return 0;
>
> I don't fully understand this early return. AFAICT, we still call
> setup_elf_hwcaps() via setup_cpu_features() if the system supports
> 32-bit EL0 (mismatched or not) at boot. For CPU hotplug, we can add the
> compat hwcaps later if we didn't set them up at boot. So this part is
> fine.
>
> However, if CPU0 is 32-bit-capable, it looks like we'd never disable the
> offlining on any of the 32-bit-capable CPUs and there's nothing that
> prevents offlining CPU0.

That is also deferred until we actually detect the mismatch. For example, if
CPU0 is 32-bit capable but none of the others are, then when we online CPU1
we will print:

| CPU features: Asymmetric 32-bit EL0 support detected on CPU 1; CPU hot-unplug disabled on CPU 0

so the check above is really asking "Is the CPU being onlined mismatched wrt
the boot CPU?". If yes, then we need to make sure that we're keeping a
32-bit-capable CPU around.

Will

2021-05-24 21:39:12

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 11/21] sched: Split the guts of sched_setaffinity() into a helper function

On Fri, May 21, 2021 at 05:41:01PM +0100, Qais Yousef wrote:
> On 05/18/21 10:47, Will Deacon wrote:
> > In preparation for replaying user affinity requests using a saved mask,
> > split sched_setaffinity() up so that the initial task lookup and
> > security checks are only performed when the request is coming directly
> > from userspace.
> >
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> > kernel/sched/core.c | 110 +++++++++++++++++++++++---------------------
> > 1 file changed, 58 insertions(+), 52 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9512623d5a60..808bbe669a6d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6788,9 +6788,61 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> > return retval;
> > }
> >
> > -long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> > +static int
> > +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
> > {
> > + int retval;
> > cpumask_var_t cpus_allowed, new_mask;
> > +
> > + if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
> > + return -ENOMEM;
> > +
> > + if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
> > + return -ENOMEM;
>
> Shouldn't we free cpus_allowed first?

Oops, yes. Now fixed.

Thanks,

Will

2021-05-24 21:49:41

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 12/21] sched: Allow task CPU affinity to be restricted on asymmetric systems

On Fri, May 21, 2021 at 06:11:32PM +0100, Qais Yousef wrote:
> On 05/18/21 10:47, Will Deacon wrote:
> > +static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
> > + const struct cpumask *new_mask,
> > + u32 flags,
> > + struct rq *rq,
> > + struct rq_flags *rf)
> > + __releases(rq->lock)
> > + __releases(p->pi_lock)
> > {
> > const struct cpumask *cpu_valid_mask = cpu_active_mask;
> > const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
> > unsigned int dest_cpu;
> > - struct rq_flags rf;
> > - struct rq *rq;
> > int ret = 0;
> >
> > - rq = task_rq_lock(p, &rf);
> > update_rq_clock(rq);
> >
> > if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
> > @@ -2430,20 +2425,158 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> >
> > __do_set_cpus_allowed(p, new_mask, flags);
> >
> > - return affine_move_task(rq, p, &rf, dest_cpu, flags);
> > + if (flags & SCA_USER)
> > + release_user_cpus_ptr(p);
>
> Why do we need to release the pointer here?
>
> Doesn't this mean if a 32bit task requests to change its affinity, then we'll
> lose this info and a subsequent execve() to a 64bit application means we won't
> be able to restore the original mask?
>
> ie:
>
> p0-64bit
> execve(32bit_app)
> // p1-32bit created
> p1-32bit.change_affinity()
> relase_user_cpus_ptr()
> execve(64bit_app) // lost info about p0 affinity?
>
> Hmm I think this helped me to get the answer. p1 changed its affinity, then
> there's nothing to be inherited by a new execve(), so yes we no longer need
> this info.

Yup, you got it.

> > +static int restrict_cpus_allowed_ptr(struct task_struct *p,
> > + struct cpumask *new_mask,
> > + const struct cpumask *subset_mask)
> > +{
> > + struct rq_flags rf;
> > + struct rq *rq;
> > + int err;
> > + struct cpumask *user_mask = NULL;
> > +
> > + if (!p->user_cpus_ptr)
> > + user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
> > +
> > + rq = task_rq_lock(p, &rf);
> > +
> > + /*
> > + * We're about to butcher the task affinity, so keep track of what
> > + * the user asked for in case we're able to restore it later on.
> > + */
> > + if (user_mask) {
> > + cpumask_copy(user_mask, p->cpus_ptr);
> > + p->user_cpus_ptr = user_mask;
> > + }
> > +
> > + /*
> > + * Forcefully restricting the affinity of a deadline task is
> > + * likely to cause problems, so fail and noisily override the
> > + * mask entirely.
> > + */
> > + if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
> > + err = -EPERM;
> > + goto err_unlock;
>
> free(user_mark) first?
>
> > + }
> > +
> > + if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
> > + err = -EINVAL;
> > + goto err_unlock;
>
> ditto

We free the mask when the task exits so we don't actually need to clean up
here. I left it like this on the assumption that failing here means that
it's very likely that either the task will exit or retry very soon.

However I agree that it would be clearer to free the thing anyway, so I'll
rejig the code to do that.

Will

2021-05-24 21:51:10

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 21/21] Documentation: arm64: describe asymmetric 32-bit support

On Fri, May 21, 2021 at 06:37:21PM +0100, Qais Yousef wrote:
> On 05/18/21 10:47, Will Deacon wrote:
> > Document support for running 32-bit tasks on asymmetric 32-bit systems
> > and its impact on the user ABI when enabled.
> >
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> > .../admin-guide/kernel-parameters.txt | 3 +
> > Documentation/arm64/asymmetric-32bit.rst | 149 ++++++++++++++++++
> > Documentation/arm64/index.rst | 1 +
> > 3 files changed, 153 insertions(+)
> > create mode 100644 Documentation/arm64/asymmetric-32bit.rst
> >
>
> [...]
>
> > +Cpusets
> > +-------
> > +
> > +The affinity of a 32-bit task may include CPUs that are not explicitly
> > +allowed by the cpuset to which it is attached. This can occur as a
> > +result of the following two situations:
> > +
> > + - A 64-bit task attached to a cpuset which allows only 64-bit CPUs
> > + executes a 32-bit program.
> > +
> > + - All of the 32-bit-capable CPUs allowed by a cpuset containing a
> > + 32-bit task are offlined.
> > +
> > +In both of these cases, the new affinity is calculated according to step
> > +(2) of the process described in `execve(2)`_ and the cpuset hierarchy is
> > +unchanged irrespective of the cgroup version.
>
> nit: Should we call out that we're breaking cpuset-v1 behavior? Don't feel
> strongly about it.

I think the text is pretty clear that the new behaviour documented here
applies to cpuset-v1 and I wouldn't say we're breaking anything as we're not
changing any of the existing behaviours.

Will

2021-05-24 22:10:13

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 00/21] Add support for 32-bit tasks on asymmetric AArch32 systems

On Fri, May 21, 2021 at 06:45:42PM +0100, Qais Yousef wrote:
> Hi Will
>
> On 05/18/21 10:47, Will Deacon wrote:
> > Hi folks,
> >
> > This is the long-awaited v6 of these patches which I last posted at the
> > end of last year:
> >
> > v1: https://lore.kernel.org/r/[email protected]
> > v2: https://lore.kernel.org/r/[email protected]
> > v3: https://lore.kernel.org/r/[email protected]
> > v4: https://lore.kernel.org/r/[email protected]
> > v5: https://lore.kernel.org/r/[email protected]
> >
> > There was also a nice LWN writeup in case you've forgotten what this is
> > about:
> >
> > https://lwn.net/Articles/838339/
> >
> > It's taken me a while to get a v6 of this together, partly due to
> > addressing the review feedback on v5, but also because this has now seen
> > testing on real hardware which threw up some surprises in suspend/resume,
> > SCHED_DEADLINE and compat hwcap reporting. Thanks to Quentin for helping
> > me to debug those issues.
> >
> > The aim of this series is to allow 32-bit ARM applications to run on
> > arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
> > Unfortunately, such SoCs are real and will continue to be productised
> > over the next few years at least. I can assure you that I'm not just
> > doing this for fun.
> >
> > Changes in v6 include:
> >
> > * Save/restore the affinity mask across execve() to 32-bit and back to
> > 64-bit again.
> >
> > * Allow 32-bit deadline tasks, but skip straight to fallback path when
> > determining new affinity mask on execve().
> >
> > * Fixed resume-from-suspend path when the resuming CPU is 64-bit-only
> > by deferring wake-ups for 32-bit tasks until the secondary CPUs are
> > back online.
> >
> > * Bug fixes (compat hwcaps, memory leak, cpuset fallback path).
> >
> > * Documentation for arm64. It's in the divisive .rst format, but please
> > take a look anyway!
> >
> > I'm pretty happy with this now and it seems to do the right thing,
> > although the new patches in this revision would certainly benefit from
> > review. Series based on v5.13-rc1.
>
> It's late Fri and I'm off next week (I'm starting to sense an omen here, it's
> the 2nd or 3rd time the post syncs with my holiday), so a bit of a rushed
> review but the series looks good to me. Feel free to stick my Reviewed-by for
> the series, except patch 13 where I skipped it, given the few comments I had
> are addressed.

Thanks, Qais. I'm planning a v7 with quite a few changes, so it's probably
best if you offer your Reviewed-by on individual patches when you're happy
with them rather than me adding it to code that I'm still tweaking. But
thanks for the offer!

Have a good week off,

Will

2021-05-24 23:57:12

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] sched: Admit forcefully-affined tasks into SCHED_DEADLINE

On Fri, May 21, 2021 at 03:00:42PM +0200, Daniel Bristot de Oliveira wrote:
> On 5/21/21 12:37 PM, Will Deacon wrote:
> > Interesting, thanks. Thinking about this some more, it strikes me that with
> > these silly asymmetric systems there could be an interesting additional
> > problem with hotplug and deadline tasks. Imagine the following sequence of
> > events:
> >
> > 1. All online CPUs are 32-bit-capable
> > 2. sched_setattr() admits a 32-bit deadline task
> > 3. A 64-bit-only CPU is onlined
>
> At the point 3, the global scheduler assumption is broken. For instance, in a
> system with four CPUs and five ready 32-bit-capable tasks, when the fifth CPU as
> added, the working conserving rule is violated because the five highest priority
> thread are not running (only four are) :-(.
>
> So, at this point, for us to keep to the current behavior, the addition should
> be.. blocked? :-((
>
> > 4. Some of the 32-bit-capable CPUs are offlined
>
> Assuming that point 3 does not exist (i.e., all CPUs are 32-bit-capable). At
> this point, we will have an increase in the pressure on the 32-bit-capable CPUs.
>
> This can also create bad effects for 64-bit tasks, as the "contended" 32-bit
> tasks will still be "queued" in a future time where they were supposed to be
> done (leaving time for the 64-bit tasks).

That's a really interesting point that I hadn't previously considered. It
means that the effects of 32-bit tasks with forced affinity are far
reaching when it comes to deadline tasks.

> > I wonder if we can get into a situation where we think we have enough
> > bandwidth available, but in reality the 32-bit task is in trouble because
> > it can't make use of the 64-bit-only CPU.
>
> I would have to think more, but there might be a case where this contended
> 32-bit tasks could cause deadline misses for the 64-bit too.
>
> > If so, then it seems to me that admission control is really just
> > "best-effort" for 32-bit deadline tasks on these systems because it's based
> > on a snapshot in time of the available resources.
>
> The admission test as is now is "best-effort" in the sense that it allows a
> workload higher than it could handle (it is necessary, but not sufficient AC).
> But it should not be considered "best-effort" because of violations in the
> working conserving property as a result of arbitrary affinities among tasks.
> Overall, we have been trying to close any "exception left" to this later case.
>
> I know, it is a complex situation, I am just trying to illustrate our concerns,
> because, in the near future we might have a scheduler that handles arbitrary
> affinity correctly. But that might require us to stick to an AC. The AC is
> something precious for us.

I've implemented AC on execve() of a 32-bit program so we'll fail that system
call with -ENOEXEC if the root domain contains 64-bit-only CPUs. As expected,
the failure mode is awful because it seems as though the ELF binary is then
treated like a shell script by userspace and passed to /bin/sh:

$ sudo chrt -d -T 5000000 -P 16666666 0 ./hello32
./hello32: 1: Syntax error: word unexpected (expecting ")")

Anyway, I'll include this in v7.

Cheers,

Will

2021-05-24 23:58:13

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v6 08/21] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()

On Fri, May 21, 2021 at 05:25:24PM +0100, Qais Yousef wrote:
> On 05/18/21 10:47, Will Deacon wrote:
> > Asymmetric systems may not offer the same level of userspace ISA support
> > across all CPUs, meaning that some applications cannot be executed by
> > some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
> > not feature support for 32-bit applications on both clusters.
> >
> > Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
> > account when trying to find a suitable set of online CPUs for a given
> > task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
> > during ->attach() and will subsequently allow the cpuset hierarchy to be
> > taken into account when forcefully overriding the affinity mask for a
> > task which requires migration to a compatible CPU.
> >
> > Cc: Li Zefan <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> > include/linux/cpuset.h | 2 +-
> > kernel/cgroup/cpuset.c | 33 +++++++++++++++++++--------------
> > 2 files changed, 20 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index ed6ec677dd6b..414a8e694413 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -185,7 +185,7 @@ static inline void cpuset_read_unlock(void) { }
> > static inline void cpuset_cpus_allowed(struct task_struct *p,
> > struct cpumask *mask)
> > {
> > - cpumask_copy(mask, cpu_possible_mask);
> > + cpumask_copy(mask, task_cpu_possible_mask(p));
> > }
> >
> > static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 8c799260a4a2..b532a5333ff9 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -372,18 +372,26 @@ static inline bool is_in_v2_mode(void)
> > }
> >
> > /*
> > - * Return in pmask the portion of a cpusets's cpus_allowed that
> > - * are online. If none are online, walk up the cpuset hierarchy
> > - * until we find one that does have some online cpus.
> > + * Return in pmask the portion of a task's cpusets's cpus_allowed that
> > + * are online and are capable of running the task. If none are found,
> > + * walk up the cpuset hierarchy until we find one that does have some
> > + * appropriate cpus.
> > *
> > * One way or another, we guarantee to return some non-empty subset
> > * of cpu_online_mask.
> > *
> > * Call with callback_lock or cpuset_mutex held.
> > */
> > -static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
> > +static void guarantee_online_cpus(struct task_struct *tsk,
> > + struct cpumask *pmask)
> > {
> > - while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
> > + struct cpuset *cs = task_cs(tsk);
>
> task_cs() requires rcu_read_lock(), but I can't see how the lock is obtained
> from cpuset_attach() path, did I miss it? Running with lockdep should spill
> suspicious RCU usage warning.
>
> Maybe it makes more sense to move the rcu_read_lock() inside the function now
> with task_cs()?

Well spotted! I'll add the rcu_read_[un]lock() calls to
guarantee_online_cpus().

Will

2021-05-25 10:32:27

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system

On Mon, May 24, 2021 at 09:32:50PM +0100, Will Deacon wrote:
> On Mon, May 24, 2021 at 04:46:58PM +0100, Catalin Marinas wrote:
> > On Tue, May 18, 2021 at 10:47:22AM +0100, Will Deacon wrote:
> > > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> > > index 959442f76ed7..72efdc611b14 100644
> > > --- a/arch/arm64/kernel/cpufeature.c
> > > +++ b/arch/arm64/kernel/cpufeature.c
> > > @@ -2896,15 +2896,33 @@ void __init setup_cpu_features(void)
> > >
> > > static int enable_mismatched_32bit_el0(unsigned int cpu)
> > > {
> > > + static int lucky_winner = -1;
> > > +
> > > struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> > > bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> > >
> > > if (cpu_32bit) {
> > > cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> > > static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
> > > - setup_elf_hwcaps(compat_elf_hwcaps);
> > > }
> > >
> > > + if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
> > > + return 0;
> >
> > I don't fully understand this early return. AFAICT, we still call
> > setup_elf_hwcaps() via setup_cpu_features() if the system supports
> > 32-bit EL0 (mismatched or not) at boot. For CPU hotplug, we can add the
> > compat hwcaps later if we didn't set them up at boot. So this part is
> > fine.
> >
> > However, if CPU0 is 32-bit-capable, it looks like we'd never disable the
> > offlining on any of the 32-bit-capable CPUs and there's nothing that
> > prevents offlining CPU0.
>
> That is also deferred until we actually detect the mismatch. For example, if
> CPU0 is 32-bit capable but none of the others are, then when we online CPU1
> we will print:
>
> | CPU features: Asymmetric 32-bit EL0 support detected on CPU 1; CPU hot-unplug disabled on CPU 0
>
> so the check above is really asking "Is the CPU being onlined mismatched wrt
> the boot CPU?". If yes, then we need to make sure that we're keeping a
> 32-bit-capable CPU around.

Got it now, the offlining will only be disabled if we detected a
mismatch. For this patch:

Reviewed-by: Catalin Marinas <[email protected]>