Add a KVM_CAP to control WFx (WFI or WFE) trapping based on scheduler
runqueue depth. This is so they can be passed through if the runqueue
is shallow or the CPU has support for direct interrupt injection. They
may be always trapped by setting this value to 0. Technically this
means traps will be cleared when the runqueue depth is 0, but that
implies nothing is running anyway so there is no reason to care. The
default value is 1 to preserve previous behavior before adding this
option.
Think about his option as a threshold. The instruction will be trapped
if the runqueue depth is higher than the threshold.
Signed-off-by: Colton Lewis <[email protected]>
---
v2:
The last version was exclusively a flag to enable unconditional wfx
passthrough but there was feedback to make passthrough/trapping depend
on runqueue depth. I asked the last thread if there were any
preferences for the interface to accomplish this but I figured it's
easier to show code than wait for people telling me what to do.
v1:
https://lore.kernel.org/kvmarm/[email protected]/
arch/arm64/include/asm/kvm_host.h | 1 +
arch/arm64/kvm/arm.c | 7 ++++++-
include/linux/sched/stat.h | 1 +
include/uapi/linux/kvm.h | 2 +-
kernel/sched/core.c | 15 +++++++++++++--
5 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 21c57b812569..79f461efaa6c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -317,6 +317,7 @@ struct kvm_arch {
* the associated pKVM instance in the hypervisor.
*/
struct kvm_protected_vm pkvm;
+ u64 wfx_trap_runqueue_depth;
};
struct kvm_vcpu_fault_info {
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a25265aca432..419eed6e1814 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -116,6 +116,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(&kvm->slots_lock);
break;
+ case KVM_CAP_ARM_WFX_TRAP_RUNQUEUE_DEPTH:
+ kvm->arch.wfx_trap_runqueue_depth = cap->args[0];
+ break;
default:
r = -EINVAL;
break;
@@ -176,6 +179,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
+ kvm->arch.wfx_trap_runqueue_depth = 1;
return 0;
err_free_cpumask:
@@ -240,6 +244,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_SYSTEM_SUSPEND:
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_COUNTER_OFFSET:
+ case KVM_CAP_ARM_WFX_TRAP_RUNQUEUE_DEPTH:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
@@ -456,7 +461,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
- if (single_task_running())
+ if (nr_running_this_cpu() <= vcpu->kvm->arch.wfx_trap_runqueue_depth)
vcpu_clear_wfx_traps(vcpu);
else
vcpu_set_wfx_traps(vcpu);
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 0108a38bb64d..dc1541fcec56 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -18,6 +18,7 @@ extern int nr_threads;
DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned int nr_running(void);
+extern unsigned int nr_running_this_cpu(void);
extern bool single_task_running(void);
extern unsigned int nr_iowait(void);
extern unsigned int nr_iowait_cpu(int cpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c3308536482b..4c0ebf514c03 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1155,6 +1155,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_ARM_WFX_TRAP_RUNQUEUE_DEPTH 236
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..b18f29964648 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5420,7 +5420,7 @@ unsigned int nr_running(void)
}
/*
- * Check if only the current task is running on the CPU.
+ * Return number of tasks running on this CPU.
*
* Caution: this function does not check that the caller has disabled
* preemption, thus the result might have a time-of-check-to-time-of-use
@@ -5432,9 +5432,20 @@ unsigned int nr_running(void)
*
* - in a loop with very short iterations (e.g. a polling loop)
*/
+unsigned int nr_running_this_cpu(void)
+{
+ return raw_rq()->nr_running;
+}
+EXPORT_SYMBOL(nr_running_this_cpu);
+
+/*
+ * Check if only the current task is running on the CPU.
+ *
+ * Caution: see warning for nr_running_this_cpu
+ */
bool single_task_running(void)
{
- return raw_rq()->nr_running == 1;
+ return nr_running_this_cpu() == 1;
}
EXPORT_SYMBOL(single_task_running);
--
2.44.0.291.gc1ea87d7ee-goog
On Tuesday 19 Mar 2024 at 16:43:41 (+0000), Colton Lewis wrote:
> Add a KVM_CAP to control WFx (WFI or WFE) trapping based on scheduler
> runqueue depth. This is so they can be passed through if the runqueue
> is shallow or the CPU has support for direct interrupt injection. They
> may be always trapped by setting this value to 0. Technically this
> means traps will be cleared when the runqueue depth is 0, but that
> implies nothing is running anyway so there is no reason to care. The
> default value is 1 to preserve previous behavior before adding this
> option.
I recently discovered that this was enabled by default, but it's not
obvious to me everyone will want this enabled, so I'm in favour of
figuring out a way to turn it off (in fact we might want to make this
feature opt in as the status quo used to be to always trap).
There are a few potential issues I see with having this enabled:
- a lone vcpu thread on a CPU will completely screw up the host
scheduler's load tracking metrics if the vCPU actually spends a
significant amount of time in WFI (the PELT signal will no longer
be a good proxy for "how much CPU time does this task need");
- the scheduler's decision will impact massively the behaviour of the
vcpu task itself. Co-scheduling a task with a vcpu task (or not) will
impact massively the perceived behaviour of the vcpu task in a way
that is entirely unpredictable to the scheduler;
- while the above problems might be OK for some users, I don't think
this will always be true, e.g. when running on big.LITTLE systems the
above sounds nightmare-ish;
- the guest spending long periods of time in WFI prevents the host from
being able to enter deeper idle states, which will impact power very
negatively;
And probably a whole bunch of other things.
> Think about his option as a threshold. The instruction will be trapped
> if the runqueue depth is higher than the threshold.
So talking about the exact interface, I'm not sure exposing this to
userspace is really appropriate. The current rq depth is next to
impossible for userspace to control well.
My gut feeling tells me we might want to gate all of this on
PREEMPT_FULL instead, since PREEMPT_FULL is pretty much a way to say
"I'm willing to give up scheduler tracking accuracy to gain throughput
when I've got a task running alone on a CPU". Thoughts?
Thanks,
Quentin
On Friday 22 Mar 2024 at 14:24:35 (+0000), Quentin Perret wrote:
> On Tuesday 19 Mar 2024 at 16:43:41 (+0000), Colton Lewis wrote:
> > Add a KVM_CAP to control WFx (WFI or WFE) trapping based on scheduler
> > runqueue depth. This is so they can be passed through if the runqueue
> > is shallow or the CPU has support for direct interrupt injection. They
> > may be always trapped by setting this value to 0. Technically this
> > means traps will be cleared when the runqueue depth is 0, but that
> > implies nothing is running anyway so there is no reason to care. The
> > default value is 1 to preserve previous behavior before adding this
> > option.
>
> I recently discovered that this was enabled by default, but it's not
> obvious to me everyone will want this enabled, so I'm in favour of
> figuring out a way to turn it off (in fact we might want to make this
> feature opt in as the status quo used to be to always trap).
>
> There are a few potential issues I see with having this enabled:
>
> - a lone vcpu thread on a CPU will completely screw up the host
> scheduler's load tracking metrics if the vCPU actually spends a
> significant amount of time in WFI (the PELT signal will no longer
> be a good proxy for "how much CPU time does this task need");
>
> - the scheduler's decision will impact massively the behaviour of the
> vcpu task itself. Co-scheduling a task with a vcpu task (or not) will
> impact massively the perceived behaviour of the vcpu task in a way
> that is entirely unpredictable to the scheduler;
>
> - while the above problems might be OK for some users, I don't think
> this will always be true, e.g. when running on big.LITTLE systems the
> above sounds nightmare-ish;
>
> - the guest spending long periods of time in WFI prevents the host from
> being able to enter deeper idle states, which will impact power very
> negatively;
>
> And probably a whole bunch of other things.
>
> > Think about his option as a threshold. The instruction will be trapped
> > if the runqueue depth is higher than the threshold.
>
> So talking about the exact interface, I'm not sure exposing this to
> userspace is really appropriate. The current rq depth is next to
> impossible for userspace to control well.
>
> My gut feeling tells me we might want to gate all of this on
> PREEMPT_FULL instead, since PREEMPT_FULL is pretty much a way to say
> "I'm willing to give up scheduler tracking accuracy to gain throughput
> when I've got a task running alone on a CPU". Thoughts?
And obviously I meant s/PREEMPT_FULL/NOHZ_FULL, but hopefully that was
clear :-)
Thanks for the feedback.
Quentin Perret <[email protected]> writes:
> On Friday 22 Mar 2024 at 14:24:35 (+0000), Quentin Perret wrote:
>> On Tuesday 19 Mar 2024 at 16:43:41 (+0000), Colton Lewis wrote:
>> > Add a KVM_CAP to control WFx (WFI or WFE) trapping based on scheduler
>> > runqueue depth. This is so they can be passed through if the runqueue
>> > is shallow or the CPU has support for direct interrupt injection. They
>> > may be always trapped by setting this value to 0. Technically this
>> > means traps will be cleared when the runqueue depth is 0, but that
>> > implies nothing is running anyway so there is no reason to care. The
>> > default value is 1 to preserve previous behavior before adding this
>> > option.
>> I recently discovered that this was enabled by default, but it's not
>> obvious to me everyone will want this enabled, so I'm in favour of
>> figuring out a way to turn it off (in fact we might want to make this
>> feature opt in as the status quo used to be to always trap).
Setting the introduced threshold to zero will cause it to trap whenever
something is running. Is there a problem with doing it that way?
I'd also be interested to get more input before changing the current
default behavior.
>> There are a few potential issues I see with having this enabled:
>> - a lone vcpu thread on a CPU will completely screw up the host
>> scheduler's load tracking metrics if the vCPU actually spends a
>> significant amount of time in WFI (the PELT signal will no longer
>> be a good proxy for "how much CPU time does this task need");
>> - the scheduler's decision will impact massively the behaviour of the
>> vcpu task itself. Co-scheduling a task with a vcpu task (or not) will
>> impact massively the perceived behaviour of the vcpu task in a way
>> that is entirely unpredictable to the scheduler;
>> - while the above problems might be OK for some users, I don't think
>> this will always be true, e.g. when running on big.LITTLE systems the
>> above sounds nightmare-ish;
>> - the guest spending long periods of time in WFI prevents the host from
>> being able to enter deeper idle states, which will impact power very
>> negatively;
>> And probably a whole bunch of other things.
>> > Think about his option as a threshold. The instruction will be trapped
>> > if the runqueue depth is higher than the threshold.
>> So talking about the exact interface, I'm not sure exposing this to
>> userspace is really appropriate. The current rq depth is next to
>> impossible for userspace to control well.
Using runqueue depth is going off of a suggestion from Oliver [1], who I've
also talked to internally at Google a few times about this.
But hearing your comment makes me lean more towards having some
enumeration of behaviors like TRAP_ALWAYS, TRAP_NEVER,
TRAP_IF_MULTIPLE_TASKS.
>> My gut feeling tells me we might want to gate all of this on
>> PREEMPT_FULL instead, since PREEMPT_FULL is pretty much a way to say
>> "I'm willing to give up scheduler tracking accuracy to gain throughput
>> when I've got a task running alone on a CPU". Thoughts?
> And obviously I meant s/PREEMPT_FULL/NOHZ_FULL, but hopefully that was
> clear :-)
Sounds good to me but I've not touched anything scheduling related before.
[1] https://lore.kernel.org/kvmarm/[email protected]/
Hi Colton,
On Monday 25 Mar 2024 at 20:12:04 (+0000), Colton Lewis wrote:
> Thanks for the feedback.
>
> Quentin Perret <[email protected]> writes:
>
> > On Friday 22 Mar 2024 at 14:24:35 (+0000), Quentin Perret wrote:
> > > On Tuesday 19 Mar 2024 at 16:43:41 (+0000), Colton Lewis wrote:
> > > > Add a KVM_CAP to control WFx (WFI or WFE) trapping based on scheduler
> > > > runqueue depth. This is so they can be passed through if the runqueue
> > > > is shallow or the CPU has support for direct interrupt injection. They
> > > > may be always trapped by setting this value to 0. Technically this
> > > > means traps will be cleared when the runqueue depth is 0, but that
> > > > implies nothing is running anyway so there is no reason to care. The
> > > > default value is 1 to preserve previous behavior before adding this
> > > > option.
>
> > > I recently discovered that this was enabled by default, but it's not
> > > obvious to me everyone will want this enabled, so I'm in favour of
> > > figuring out a way to turn it off (in fact we might want to make this
> > > feature opt in as the status quo used to be to always trap).
>
> Setting the introduced threshold to zero will cause it to trap whenever
> something is running. Is there a problem with doing it that way?
No problem per se, I was simply hoping we could set the default to zero
to revert to the old behaviour. I don't think removing WFx traps was a
universally desirable behaviour, so it prob should have been opt-in from
the start.
> I'd also be interested to get more input before changing the current
> default behavior.
Ack, that is my personal opinion.
> > > There are a few potential issues I see with having this enabled:
>
> > > - a lone vcpu thread on a CPU will completely screw up the host
> > > scheduler's load tracking metrics if the vCPU actually spends a
> > > significant amount of time in WFI (the PELT signal will no longer
> > > be a good proxy for "how much CPU time does this task need");
>
> > > - the scheduler's decision will impact massively the behaviour of the
> > > vcpu task itself. Co-scheduling a task with a vcpu task (or not) will
> > > impact massively the perceived behaviour of the vcpu task in a way
> > > that is entirely unpredictable to the scheduler;
>
> > > - while the above problems might be OK for some users, I don't think
> > > this will always be true, e.g. when running on big.LITTLE systems the
> > > above sounds nightmare-ish;
>
> > > - the guest spending long periods of time in WFI prevents the host from
> > > being able to enter deeper idle states, which will impact power very
> > > negatively;
>
> > > And probably a whole bunch of other things.
>
> > > > Think about his option as a threshold. The instruction will be trapped
> > > > if the runqueue depth is higher than the threshold.
>
> > > So talking about the exact interface, I'm not sure exposing this to
> > > userspace is really appropriate. The current rq depth is next to
> > > impossible for userspace to control well.
>
> Using runqueue depth is going off of a suggestion from Oliver [1], who I've
> also talked to internally at Google a few times about this.
>
> But hearing your comment makes me lean more towards having some
> enumeration of behaviors like TRAP_ALWAYS, TRAP_NEVER,
> TRAP_IF_MULTIPLE_TASKS.
Do you guys really expect to set this TRAP_IF_MULTIPLE_TASKS? Again, the
rq depth is quite hard to reason about from userspace, so not sure
anybody will really want that? A simpler on/off thing might be simpler.
> > > My gut feeling tells me we might want to gate all of this on
> > > PREEMPT_FULL instead, since PREEMPT_FULL is pretty much a way to say
> > > "I'm willing to give up scheduler tracking accuracy to gain throughput
> > > when I've got a task running alone on a CPU". Thoughts?
>
> > And obviously I meant s/PREEMPT_FULL/NOHZ_FULL, but hopefully that was
> > clear :-)
>
> Sounds good to me but I've not touched anything scheduling related before.
Do you guys use NOHZ_FULL in prod? If not that idea might very well be a
non-starter, because switching to NOHZ_FULL would be a big ask. So,
yeah, I'm curious :)
Thanks,
Quentin