2020-03-04 17:03:56

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 00/13] Core scheduling v5


Fifth iteration of the Core-Scheduling feature.

Core scheduling is a feature that only allows trusted tasks to run
concurrently on cpus sharing compute resources(eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). So far, the feature mitigates user-space
to user-space attacks but not user-space to kernel attack, when one of
the hardware thread enters the kernel (syscall, interrupt etc).

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When
a tag is enabled in a cgroup and a task from that cgroup is running
on a hardware thread, the scheduler ensures that only idle or trusted
tasks run on the other sibling(s). Besides security concerns, this
feature can also be beneficial for RT and performance applications
where we want to control how tasks make use of SMT dynamically.

This version was focusing on performance and stability. Couple of
crashes related to task tagging and cpu hotplug path were fixed.
This version also improves the performance considerably by making
task migration and load balancing coresched aware.

In terms of performance, the major difference since the last iteration
is that now even IO-heavy and mixed-resources workloads are less
impacted by core-scheduling than by disabling SMT. Both host-level and
VM-level benchmarks were performed. Details in:
https://lkml.org/lkml/2020/2/12/1194
https://lkml.org/lkml/2019/11/1/269

v5 is rebased on top of 5.5.5(449718782a46)
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

Changes in v5
-------------
- Fixes for cgroup/process tagging during corner cases like cgroup
destroy, task moving across cgroups etc
- Tim Chen
- Coresched aware task migrations
- Aubrey Li
- Other minor stability fixes.

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
across cpus in a core.
- Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
- Aaron Lu

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
- Aaron Lu
- Vineeth Pillai
- Julien Desfossez
- Fixes the issue of starving threads due to forced idle
- Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
- Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
- Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
- Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
- Subhra Mazumdar
- Tim Chen
- Improves priority comparison logic for process in different cpus
- Peter Zijlstra
- Aaron Lu
- Fixes a hard lockup in rq locking
- Vineeth Pillai
- Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
- Vineeth Pillai
- Julien Desfossez
- Fix for 32bit build
- Aubrey Li

ISSUES
------
- Aaron(Intel) found an issue with load balancing when the tasks have
different weights(nice or cgroup shares). Task weight is not considered
in coresched aware load balancing and causes those higher weights task
to starve.
- Joel(ChromeOS) found an issue where RT task may be preempted by a
lower class task.
- Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
coreshed idle balance logic

TODO
----
- Work on merging patches that are ready to be merged
- Decide on the API for exposing the feature to userland
- Experiment with adding synchronization points in VMEXIT to mitigate
the VM-to-host-kernel leaking
- Investigate the source of the overhead even when no tasks are tagged:
https://lkml.org/lkml/2019/10/29/242

---

Aaron Lu (2):
sched/fair: wrapper for cfs_rq->min_vruntime
sched/fair: core wide vruntime comparison

Aubrey Li (1):
sched: migration changes for core scheduling

Peter Zijlstra (9):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Trivial forced-newidle balancer
sched: cgroup tagging interface for core scheduling
sched: Debug bits...

Tim Chen (1):
sched: Update core scheduler queue when taking cpu online/offline

include/linux/sched.h | 9 +-
kernel/Kconfig.preempt | 6 +
kernel/sched/core.c | 1037 +++++++++++++++++++++++++++++++++++++-
kernel/sched/cpuacct.c | 12 +-
kernel/sched/deadline.c | 69 ++-
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 387 +++++++++++---
kernel/sched/idle.c | 11 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 65 ++-
kernel/sched/sched.h | 248 +++++++--
kernel/sched/stop_task.c | 13 +-
kernel/sched/topology.c | 4 +-
13 files changed, 1672 insertions(+), 195 deletions(-)

--
2.17.1


2020-03-04 17:36:29

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On 3/4/20 8:59 AM, vpillai wrote:

>
> ISSUES
> ------
> - Aaron(Intel) found an issue with load balancing when the tasks have

Just to set the record straight, Aaron works at Alibaba.

> different weights(nice or cgroup shares). Task weight is not considered
> in coresched aware load balancing and causes those higher weights task
> to starve.

Thanks.

Tim

2020-03-04 17:43:46

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

>
> Just to set the record straight, Aaron works at Alibaba.
>
Sorry about this. Thanks for the correction.

~Vineeth

2020-04-15 21:37:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> TODO
> ----
> - Work on merging patches that are ready to be merged
> - Decide on the API for exposing the feature to userland
> - Experiment with adding synchronization points in VMEXIT to mitigate
> the VM-to-host-kernel leaking

VMEXIT is too late, you need to hook irq_enter(), which is what makes
the whole thing so horrible.

> - Investigate the source of the overhead even when no tasks are tagged:
> https://lkml.org/lkml/2019/10/29/242

- explain why we're all still doing this ....

Seriously, what actual problems does it solve? The patch-set still isn't
L1TF complete and afaict it does exactly nothing for MDS.

Like I've written many times now, back when the world was simpler and
all we had to worry about was L1TF, core-scheduling made some sense, but
how does it make sense today?

It's cute that this series sucks less than it did before, but what are
we trading that performance for?

2020-04-16 01:49:26

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > TODO
> > ----
> > - Work on merging patches that are ready to be merged
> > - Decide on the API for exposing the feature to userland
> > - Experiment with adding synchronization points in VMEXIT to mitigate
> > the VM-to-host-kernel leaking
>
> VMEXIT is too late, you need to hook irq_enter(), which is what makes
> the whole thing so horrible.

We came up with a patch to do this as well. Currently testing it more and it
looks clean, will share it soon.

> > - Investigate the source of the overhead even when no tasks are tagged:
> > https://lkml.org/lkml/2019/10/29/242
>
> - explain why we're all still doing this ....
>
> Seriously, what actual problems does it solve? The patch-set still isn't
> L1TF complete and afaict it does exactly nothing for MDS.

The L1TF incompleteness is because of cross-HT attack from Guest vCPU
attacker to an interrupt/softirq executing on the other sibling correct? The
IRQ enter pausing the other sibling should fix that (which we will share in
a future series revision after adequate testing).

> Like I've written many times now, back when the world was simpler and
> all we had to worry about was L1TF, core-scheduling made some sense, but
> how does it make sense today?

For ChromeOS we're planning to tag each and every task seperately except for
trusted processes, so we are isolating untrusted tasks even from each other.

Sorry if this sounds like pushing my usecase, but we do get parallelism
advantage for the trusted tasks while still solving all security issues (for
ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
untrusted (tagged) tasks execute together on same core, but we are not
planning to do that on our setup at least.

> It's cute that this series sucks less than it did before, but what are
> we trading that performance for?

AIUI, the performance improves vs noht in the recent series. I am told that
is the case in recent postings of the series.

thanks,

- Joel

2020-04-17 11:15:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
> On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
> > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > > TODO
> > > ----
> > > - Work on merging patches that are ready to be merged
> > > - Decide on the API for exposing the feature to userland
> > > - Experiment with adding synchronization points in VMEXIT to mitigate
> > > the VM-to-host-kernel leaking
> >
> > VMEXIT is too late, you need to hook irq_enter(), which is what makes
> > the whole thing so horrible.
>
> We came up with a patch to do this as well. Currently testing it more and it
> looks clean, will share it soon.

Thomas said we actually first do VMEXIT, and then enable interrupts. So
the VMEXIT thing should actually work, and that is indeed much saner
than sticking it in irq_enter().

It does however put yet more nails in the out-of-tree hypervisors.

> > > - Investigate the source of the overhead even when no tasks are tagged:
> > > https://lkml.org/lkml/2019/10/29/242
> >
> > - explain why we're all still doing this ....
> >
> > Seriously, what actual problems does it solve? The patch-set still isn't
> > L1TF complete and afaict it does exactly nothing for MDS.
>
> The L1TF incompleteness is because of cross-HT attack from Guest vCPU
> attacker to an interrupt/softirq executing on the other sibling correct? The
> IRQ enter pausing the other sibling should fix that (which we will share in
> a future series revision after adequate testing).

Correct, the vCPU still running can glean host (kernel) state from the
sibling handling the interrupt in the host kernel.

> > Like I've written many times now, back when the world was simpler and
> > all we had to worry about was L1TF, core-scheduling made some sense, but
> > how does it make sense today?
>
> For ChromeOS we're planning to tag each and every task seperately except for
> trusted processes, so we are isolating untrusted tasks even from each other.
>
> Sorry if this sounds like pushing my usecase, but we do get parallelism
> advantage for the trusted tasks while still solving all security issues (for
> ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
> untrusted (tagged) tasks execute together on same core, but we are not
> planning to do that on our setup at least.

That doesn't completely solve things I think. Even if you run all
untrusted tasks as core exclusive, you still have a problem of them vs
interrupts on the other sibling.

You need to somehow arrange all interrupts to the core happen on the
same sibling that runs your untrusted task, such that the VERW on
return-to-userspace works as intended.

I suppose you can try and play funny games with interrupt routing tied
to the force-idle state, but I'm dreading what that'll look like. Or
were you going to handle this from your irq_enter() thing too?


Can someone go write up a document that very clearly states all the
problems and clearly explains how to use this feature?


2020-04-17 12:40:00

by Alexander Graf

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On 17.04.20 13:12, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
>> On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
>>> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
>>>> TODO
>>>> ----
>>>> - Work on merging patches that are ready to be merged
>>>> - Decide on the API for exposing the feature to userland
>>>> - Experiment with adding synchronization points in VMEXIT to mitigate
>>>> the VM-to-host-kernel leaking
>>>
>>> VMEXIT is too late, you need to hook irq_enter(), which is what makes
>>> the whole thing so horrible.
>>
>> We came up with a patch to do this as well. Currently testing it more and it
>> looks clean, will share it soon.
>
> Thomas said we actually first do VMEXIT, and then enable interrupts. So
> the VMEXIT thing should actually work, and that is indeed much saner
> than sticking it in irq_enter().

If we first kick out the sibling HT for every #VMEXIT, performance will
be abysmal, no?

I know of a few options to make this work without the big hammer:


1) Leave interrupts disabled on "fast-path" exits. This can become
very hard to grasp very quickly.

2) Patch the IRQ handlers (or build something more generic that
installs a trampoline on all IRQ handler installations)

3) Ignore IRQ data exposure (what could possibly go wrong, it's not
like your IRQ handler reads secret data from the network, right)

4) Create a "safe" page table which runs with HT enabled. Any access
outside of the "safe" zone disables the sibling and switches to the
"full" kernel page table. This should prevent any secret data to be
fetched into caches/core buffers.

5) Create a KVM specific "safe zone": Keep improving the ASI patches
and make only the ASI environment safe for HT, everything else not.

Has there been any progress on 4? It sounded like the most generic
option ...

>
> It does however put yet more nails in the out-of-tree hypervisors.
>
>>>> - Investigate the source of the overhead even when no tasks are tagged:
>>>> https://lkml.org/lkml/2019/10/29/242
>>>
>>> - explain why we're all still doing this ....
>>>
>>> Seriously, what actual problems does it solve? The patch-set still isn't
>>> L1TF complete and afaict it does exactly nothing for MDS.
>>
>> The L1TF incompleteness is because of cross-HT attack from Guest vCPU
>> attacker to an interrupt/softirq executing on the other sibling correct? The
>> IRQ enter pausing the other sibling should fix that (which we will share in
>> a future series revision after adequate testing).
>
> Correct, the vCPU still running can glean host (kernel) state from the
> sibling handling the interrupt in the host kernel.
>
>>> Like I've written many times now, back when the world was simpler and
>>> all we had to worry about was L1TF, core-scheduling made some sense, but
>>> how does it make sense today?
>>
>> For ChromeOS we're planning to tag each and every task seperately except for
>> trusted processes, so we are isolating untrusted tasks even from each other.
>>
>> Sorry if this sounds like pushing my usecase, but we do get parallelism
>> advantage for the trusted tasks while still solving all security issues (for
>> ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
>> untrusted (tagged) tasks execute together on same core, but we are not
>> planning to do that on our setup at least.
>
> That doesn't completely solve things I think. Even if you run all
> untrusted tasks as core exclusive, you still have a problem of them vs
> interrupts on the other sibling.
>
> You need to somehow arrange all interrupts to the core happen on the
> same sibling that runs your untrusted task, such that the VERW on
> return-to-userspace works as intended.
>
> I suppose you can try and play funny games with interrupt routing tied
> to the force-idle state, but I'm dreading what that'll look like. Or
> were you going to handle this from your irq_enter() thing too?

I'm not sure I follow. We have thread local interrupts (timers, IPIs)
and device interrupts (network, block, etc).

Thread local ones shouldn't transfer too much knowledge, so I'd be
inclined to say we can just ignore that attack vector.

Device interrupts we can easily route to HT0. If we now make "core
exclusive" a synonym for "always run on HT0", we can guarantee that they
always land on the same CPU, no?

Then you don't need to hook into any idle state tracking, because you
always know which CPU the "safe" one to both schedule tasks and route
interrupts to is.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



2020-04-17 13:13:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Fri, Apr 17, 2020 at 02:35:38PM +0200, Alexander Graf wrote:
> On 17.04.20 13:12, Peter Zijlstra wrote:

> If we first kick out the sibling HT for every #VMEXIT, performance will be
> abysmal, no?

I've been given to understand that people serious about virt try really
hard to avoid VMEXIT.


> > That doesn't completely solve things I think. Even if you run all
> > untrusted tasks as core exclusive, you still have a problem of them vs
> > interrupts on the other sibling.
> >
> > You need to somehow arrange all interrupts to the core happen on the
> > same sibling that runs your untrusted task, such that the VERW on
> > return-to-userspace works as intended.
> >
> > I suppose you can try and play funny games with interrupt routing tied
> > to the force-idle state, but I'm dreading what that'll look like. Or
> > were you going to handle this from your irq_enter() thing too?
>
> I'm not sure I follow. We have thread local interrupts (timers, IPIs) and
> device interrupts (network, block, etc).
>
> Thread local ones shouldn't transfer too much knowledge, so I'd be inclined
> to say we can just ignore that attack vector.
>
> Device interrupts we can easily route to HT0. If we now make "core
> exclusive" a synonym for "always run on HT0", we can guarantee that they
> always land on the same CPU, no?
>
> Then you don't need to hook into any idle state tracking, because you always
> know which CPU the "safe" one to both schedule tasks and route interrupts to
> is.

That would come apart most mighty when someone does an explicit
sched_setaffinity() for !HT0.

While that might work for some relatively contained systems like
chromeos, it will not work in general I think.

2020-04-18 02:29:47

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

Hi Peter,

On Fri, Apr 17, 2020 at 01:12:55PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 12:32:20PM -0400, Joel Fernandes wrote:
> > On Tue, Apr 14, 2020 at 04:21:52PM +0200, Peter Zijlstra wrote:
> > > On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> > > > TODO
> > > > ----
> > > > - Work on merging patches that are ready to be merged
> > > > - Decide on the API for exposing the feature to userland
> > > > - Experiment with adding synchronization points in VMEXIT to mitigate
> > > > the VM-to-host-kernel leaking
> > >
> > > VMEXIT is too late, you need to hook irq_enter(), which is what makes
> > > the whole thing so horrible.
> >
> > We came up with a patch to do this as well. Currently testing it more and it
> > looks clean, will share it soon.
>
> Thomas said we actually first do VMEXIT, and then enable interrupts. So
> the VMEXIT thing should actually work, and that is indeed much saner
> than sticking it in irq_enter().
>
> It does however put yet more nails in the out-of-tree hypervisors.

Just to clarify what we're talking about here. The condition we are trying to
protect against:

1. VM is malicious.
2. Sibling of VM is entering an interrupt on the host.
3. When we enter the interrupt, we send an IPI to force the VM into waiting.
4. The VM on the sibling enters VMEXIT.

In step 4, we have to synchronize. Is this the scenario we are discussing?

> > > > - Investigate the source of the overhead even when no tasks are tagged:
> > > > https://lkml.org/lkml/2019/10/29/242
> > >
> > > - explain why we're all still doing this ....
> > >
> > > Seriously, what actual problems does it solve? The patch-set still isn't
> > > L1TF complete and afaict it does exactly nothing for MDS.
> >
> > The L1TF incompleteness is because of cross-HT attack from Guest vCPU
> > attacker to an interrupt/softirq executing on the other sibling correct? The
> > IRQ enter pausing the other sibling should fix that (which we will share in
> > a future series revision after adequate testing).
>
> Correct, the vCPU still running can glean host (kernel) state from the
> sibling handling the interrupt in the host kernel.

Right. This is what we're handling.

> > > Like I've written many times now, back when the world was simpler and
> > > all we had to worry about was L1TF, core-scheduling made some sense, but
> > > how does it make sense today?
> >
> > For ChromeOS we're planning to tag each and every task seperately except for
> > trusted processes, so we are isolating untrusted tasks even from each other.
> >
> > Sorry if this sounds like pushing my usecase, but we do get parallelism
> > advantage for the trusted tasks while still solving all security issues (for
> > ChromeOS). I agree that cross-HT user <-> kernel MDS is still an issue if
> > untrusted (tagged) tasks execute together on same core, but we are not
> > planning to do that on our setup at least.
>
> That doesn't completely solve things I think. Even if you run all
> untrusted tasks as core exclusive, you still have a problem of them vs
> interrupts on the other sibling.
>
> You need to somehow arrange all interrupts to the core happen on the
> same sibling that runs your untrusted task, such that the VERW on
> return-to-userspace works as intended.
>
> I suppose you can try and play funny games with interrupt routing tied
> to the force-idle state, but I'm dreading what that'll look like. Or
> were you going to handle this from your irq_enter() thing too?

Yes, even when host interrupt is on one sibling and the untrusted host
process is on the other sibling, we would be handling it the same way we
handle it for host interrupts vs untrusted guests. Perhaps we could optimize
pausing of the guest. But Vineeth tested and found that the same code that
pauses hosts also works for guests.

> Can someone go write up a document that very clearly states all the
> problems and clearly explains how to use this feature?

A document would make sense on how to use the feature. Maybe we can add it as
a documentation patch to the series?

Basically, from my notes the following are the problems:

Core-scheduling will help with cross-HT MDS and L1TF attacks. The following
are the scenarios (borrowed from an email from Thomas -- thanks!):

HT1 (attack) HT2 (victim)

A idle -> user space user space -> idle

B idle -> user space guest -> idle

C idle -> guest user space -> idle

D idle -> guest guest -> idle

All of them suffer from MDS. #C and #D suffer from L1TF.

All of these scenarios result in the victim getting idled to prevent any
leakage. However, this does not address the case where victim is an
interrupt handler or softirq. So we need to either route interrupts to run on
the attacker CPU, or have the irq_enter() send IPIs to pause the sibling
which is what we prototyped. Another approach to solve this is to force
interrupts into threaded mode as Thomas suggested.

The problematic usecase then left (if we ignore IRQ troubles), is MDS issues
between user <-> kernel simultaneous execution on both siblings. This does
not become an issue on ChromeOS where everything untrusted has its own tag.
For Vineeth's VM workloads also this isn't a problem from my discussions with
him, as he mentioned hyperthreads are both running guests and 2 vCPU threads
that belong to different VMs will not execute on the same core (though I'm
not sure whether hypercalls from a vCPU when the sibling is running another
vCPU of the same VM, is a concern here).

Any other case that needs to be considered?

thanks,

- Joel

2020-05-09 03:42:52

by Ning, Hongyu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5


- Test environment:
Intel Xeon Server platform
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4

- Kernel under test:
Core scheduling v5 base
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

- Test set based on sysbench 1.1.0-bd4b418:
A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup)
B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup)

- Test results briefing:
1 Good results:
1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload
1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off

2 Bad results:
2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads
2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads

- Test results:
Note: test results in following tables are Tput normalized to default baseline

-- Test set A Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| | **** | default | coresched | smt_off | *** | default | coresched | smt_off |
+====================+========+===========+=============+===========+=======+=============+===============+=============+
| cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg mysql 1 | cg mysql 1 | cg mysql 1 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| sysbench workload | **** | cpu | cpu | cpu | *** | mysql | mysql | mysql |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| 192 tasks / cgroup | **** | 1 | 0.95 | 0.54 | *** | 1 | 0.92 | 0.97 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+

-- Test set B Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| | **** | default | coresched | smt_off | *** | default | coresched | smt_off | ** | default | coresched | smt_off | * | default | coresched | smt_off |
+====================+========+===========+=============+===========+=======+=============+===============+=============+======+=============+===============+=============+=====+=============+===============+=============+
| cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg cpu 2 | cg cpu 2 | cg cpu 2 | ** | cg mysql 1 | cg mysql 1 | cg mysql 1 | * | cg mysql 2 | cg mysql 2 | cg mysql 2 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| sysbench workload | **** | cpu | cpu | cpu | *** | cpu | cpu | cpu | ** | mysql | mysql | mysql | * | mysql | mysql | mysql |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| 192 tasks / cgroup | **** | 1 | 0.9 | 0.47 | *** | 1 | 1.32 | 0.66 | ** | 1 | 0.42 | 0.89 | * | 1 | 0.42 | 0.89 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+


> On Date: Wed, 4 Mar 2020 16:59:50 +0000, vpillai <[email protected]> wrote:
> To: Nishanth Aravamudan <[email protected]>, Julien Desfossez <[email protected]>, Peter Zijlstra <[email protected]>, Tim Chen <[email protected]>, [email protected], [email protected], [email protected], [email protected]
> CC: vpillai <[email protected]>, [email protected], [email protected], [email protected], [email protected], Phil Auld <[email protected]>, Aaron Lu <[email protected]>, Aubrey Li <[email protected]>, [email protected], Valentin Schneider <[email protected]>, Mel Gorman <[email protected]>, Pawan Gupta <[email protected]>, Paolo Bonzini <[email protected]>, Joel Fernandes <[email protected]>, [email protected]
>
>
> Fifth iteration of the Core-Scheduling feature.
>
> Core scheduling is a feature that only allows trusted tasks to run
> concurrently on cpus sharing compute resources(eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). So far, the feature mitigates user-space
> to user-space attacks but not user-space to kernel attack, when one of
> the hardware thread enters the kernel (syscall, interrupt etc).
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When
> a tag is enabled in a cgroup and a task from that cgroup is running
> on a hardware thread, the scheduler ensures that only idle or trusted
> tasks run on the other sibling(s). Besides security concerns, this
> feature can also be beneficial for RT and performance applications
> where we want to control how tasks make use of SMT dynamically.
>
> This version was focusing on performance and stability. Couple of
> crashes related to task tagging and cpu hotplug path were fixed.
> This version also improves the performance considerably by making
> task migration and load balancing coresched aware.
>
> In terms of performance, the major difference since the last iteration
> is that now even IO-heavy and mixed-resources workloads are less
> impacted by core-scheduling than by disabling SMT. Both host-level and
> VM-level benchmarks were performed. Details in:
> https://lkml.org/lkml/2020/2/12/1194
> https://lkml.org/lkml/2019/11/1/269
>
> v5 is rebased on top of 5.5.5(449718782a46)
> https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y
>

2020-05-09 14:37:48

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Tue, 2020-04-14 at 16:21 +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:50PM +0000, vpillai wrote:
> >
> > - Investigate the source of the overhead even when no tasks are
> > tagged:
> > https://lkml.org/lkml/2019/10/29/242
>
> - explain why we're all still doing this ....
>
> Seriously, what actual problems does it solve? The patch-set still
> isn't
> L1TF complete and afaict it does exactly nothing for MDS.
>
Hey Peter! Late to the party, I know...

But I'm replying anyway. At least, you'll have the chance to yell at me
for this during OSPM. ;-P

> Like I've written many times now, back when the world was simpler and
> all we had to worry about was L1TF, core-scheduling made some sense,
> but
> how does it make sense today?
>
Indeed core-scheduling alone doesn't even completely solve L1TF. There
are the interrupts and the VMEXITs issues. Both are being discussed in
this thread and, FWIW, my personal opinion is that the way to go is
what Alex says here:

<[email protected]>

(E.g., when he mentions solution 4 "Create a "safe" page table which
runs with HT enabled", etc).

But let's stick to your point: if it were only for L1TF, then fine, but
it's all pointless because of MDS. My answer to this is very much
focused on my usecase, which is virtualization. I know you hate us, and
you surely have your good reasons, but you know... :-)

Correct me if I'm wrong, but I think that the "nice" thing of L1TF is
that it allows a VM to spy on another VM or on the host, but it does
not allow a regular task to spy on another task or on the kernel (well,
it would, but it's easily mitigated).

The bad thing about MDS is that it instead allow *all* of that.

Now, one thing that we absolutely want to avoid in virt is that a VM is
able to spy on other VMs or on the host. Sure, we also care about tasks
running in our VMs to be safe, but, really, inter-VM and VM-to-host
isolation is the primary concern of an hypervisor.

And how a VM (or stuff running inside a VM) can spy on another VM or on
the host, via L1TF or MDS? Well, if the attacker VM and the victim VM
--or if the attacker VM and the host-- are running on the same core. If
they're not, it can't... which is basically an L1TF-only looking
scenario.

So, in virt, core-scheduling:
1) is the *only* way (aside from no-EPT) to prevent attacker VM to spy
on victim VM, if they're running concurrently, both in guest mode,
on the same core (and that's, of course, because with
core-scheduling they just won't be doing that :-) )
2) interrupts and VMEXITs needs being taken care of --which was the
case already when, as you said "we had only L1TF". Once that is done
we will effectively prevent all VM to VM and VM to host attack
scenarios.

Sure, it will still be possible, for instance, for task_A in VM1 to spy
on task_B, also in VM1. This seems to be, AFAIUI, Joel's usecase, so
I'm happy to leave it to him to defend that, as he's doing already (but
indeed I'm very happy to see that it is also getting attention).

Now, of course saying anything like "works for my own usecase so let's
go for it" does not fly. But since you were asking whether and how this
feature could make sense today, suppose that:
1) we get core-scheduling,
2) we find a solution for irqs and VMEXITs, as we would have to if
there was only L1TF,
3) we manage to make the overhead of core-scheduling close to zero
when it's there (I mean, enabled at compile time) but not used (I
mean, no tagging of tasks, or whatever).

That would mean that virt people can enable core-scheduling, and
achieve good inter-VM and VM-to-host isolation, without imposing
overhead to other use cases, that would leave core-scheduling disabled.

And this is something that I would think it makes sense.

Of course, we're not there... because even when this series will give
us point 1, we will also need 2 and we need to make sure we also
satisfy 3 (and we weren't, last time I checked ;-P).

But I think it's worth keeping trying.

I'd also add a couple of more ideas, still about core-scheduling in
virt, but from a different standpoint than security:
- if I tag vcpu0 and vcpu1 together[*], then vcpu2 and vcpu3 together,
then vcpu4 and vcpu5 together, then I'm sure that each pair will
always be scheduled on the same core. At which point I can define
an SMT virtual topology, for the VM, that will make sense, even
without pinning the vcpus;
- if I run VMs from different customers, when vcpu2 of VM1 and vcpu1
of VM2 run on the same core, they influence each others' performance.
If, e.g., I bill basing on time spent on CPUs, it means customer
A's workload, running in VM1, may influence the billing of customer
B, who owns VM2. With core scheduling, if I tag all the vcpus of each
VM together, I won't have this any longer.

[*] with "tag together" I mean let them have the same tag which, ATM
would be "put them in the same cgroup and enable cpu.tag".

Whether or not these make sense, e.g., performance wise, it's a bid
hard to tell, with the feature not-yet finalized... But I've started
doing some preliminary measurements already. Hopefully, they'll be
ready by Monday.

So that's it. I hope this gives you enough material to complain about
during OSPM. At least, given the event is virtual, I won't get any
microphone box (or, worse, frozen sharks!) thrown at me in anger! :-D

Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-05-11 00:02:34

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC] Add support for core-wide protection of IRQ and softirq

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

Cc: Paul E. McKenney <[email protected]>
Co-developed-by: Vineeth Pillai <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>


---
If you like some pictures of the cases handled by this patch, please
see the OSPM slide deck (the below link jumps straight to relevant
slides - about 6-7 of them in total): https://bit.ly/2zvzxWk

TODO:
1. Any optimziations for VM usecases (can we do something better than
scheduler IPI?)
2. Waiting in schedule() can likely be optimized, example no need to
wait if previous task was idle, as there would have been an IRQ
involved with the wake up of the next task.

include/linux/sched.h | 8 +++
kernel/sched/core.c | 159 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
kernel/softirq.c | 2 +
4 files changed, 172 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 710e9a8956007..fe6ae59fcadbe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#else
+static void sched_core_irq_enter(void) { }
+static void sched_core_irq_exit(void) { }
+#endif
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21c640170323b..e06195dcca7a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+ /*
+ * Wait till the core of this HT is not in a core-wide IRQ state.
+ *
+ * Pair with smp_store_release() in sched_core_irq_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+ cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+ int i, cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ const struct cpumask *smt_mask;
+
+ if (!sched_core_enabled(rq))
+ return;
+
+ /* Count irq_enter() calls received without irq_exit() on this CPU. */
+ rq->core_this_irq_nest++;
+
+ /* If not outermost irq_enter(), do nothing. */
+ if (rq->core_this_irq_nest != 1 ||
+ WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX))
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+ WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+ if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+ goto unlock;
+
+ if (rq->core_pause_pending) {
+ /*
+ * Do nothing more since we are in a 'reschedule IPI' sent from
+ * another sibling. That sibling would have sent IPIs to all of
+ * the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide IRQ
+ * state, do nothing.
+ */
+ if (rq->core->core_irq_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /* IPI only if previous IPI was not pending. */
+ if (!srq->core_pause_pending) {
+ srq->core_pause_pending = 1;
+ smp_send_reschedule(i);
+ }
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+ int cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ bool wait_here = false;
+ unsigned int nest;
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ return;
+
+ rq->core_this_irq_nest--;
+
+ /* If not outermost on this CPU, do nothing. */
+ if (rq->core_this_irq_nest > 0 ||
+ WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX))
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_irq_nest;
+ WARN_ON_ONCE(!nest);
+
+ /*
+ * If we still have other CPUs in IRQs, we have to wait for them.
+ * Either here, or in the scheduler.
+ */
+ if (rq->core->core_cookie && nest > 1) {
+ /*
+ * If we are entering the scheduler anyway, we can just wait
+ * there for ->core_irq_nest to reach 0. If not, just wait here.
+ */
+ if (!tif_need_resched()) {
+ wait_here = true;
+ }
+ }
+
+ if (rq->core_pause_pending)
+ rq->core_pause_pending = 0;
+
+ /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+ smp_store_release(&rq->core->core_irq_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+
+ if (wait_here)
+ sched_core_sibling_irq_pause(rq);
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt)
rq_unlock_irq(rq, &rf);
}

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * If a CPU that was running a trusted task entered the scheduler, and
+ * the next task is untrusted, then check if waiting for core-wide IRQ
+ * state to cease is needed since we would not have been able to get
+ * the services of irq_exit() to do that waiting.
+ */
+ if (sched_core_enabled(rq) &&
+ !is_idle_task(next) && next->mm && next->core_cookie)
+ sched_core_sibling_irq_pause(rq);
+#endif
+
balance_callback(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7d9f156242e2..3a065d133ef51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1018,11 +1018,14 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ unsigned char core_pause_pending;
+ unsigned int core_this_irq_nest;

/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+ unsigned int core_irq_nest;
#endif
};

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0427a86743a46..b953386c8f62f 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -345,6 +345,7 @@ asmlinkage __visible void do_softirq(void)
void irq_enter(void)
{
rcu_irq_enter();
+ sched_core_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +414,7 @@ void irq_exit(void)
invoke_softirq();

tick_irq_exit();
+ sched_core_irq_exit();
rcu_irq_exit();
trace_hardirq_exit(); /* must be last! */
}
--
2.26.2.645.ge9eca65c58-goog

2020-05-11 13:54:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for core-wide protection of IRQ and softirq

On Sun, May 10, 2020 at 07:46:52PM -0400, Joel Fernandes (Google) wrote:
> With current core scheduling patchset, non-threaded IRQ and softirq
> victims can leak data from its hyperthread to a sibling hyperthread
> running an attacker.
>
> For MDS, it is possible for the IRQ and softirq handlers to leak data to
> either host or guest attackers. For L1TF, it is possible to leak to
> guest attackers. There is no possible mitigation involving flushing of
> buffers to avoid this since the execution of attacker and victims happen
> concurrently on 2 or more HTs.
>
> The solution in this patch is to monitor the outer-most core-wide
> irq_enter() and irq_exit() executed by any sibling. In between these
> two, we mark the core to be in a special core-wide IRQ state.

Another possible option is force_irqthreads :-) That would cure it
nicely.

Anyway, I'll go read this.

2020-05-11 14:57:31

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for core-wide protection of IRQ and softirq

On Mon, May 11, 2020 at 9:49 AM Peter Zijlstra <[email protected]> wrote:
>
> On Sun, May 10, 2020 at 07:46:52PM -0400, Joel Fernandes (Google) wrote:
> > With current core scheduling patchset, non-threaded IRQ and softirq
> > victims can leak data from its hyperthread to a sibling hyperthread
> > running an attacker.
> >
> > For MDS, it is possible for the IRQ and softirq handlers to leak data to
> > either host or guest attackers. For L1TF, it is possible to leak to
> > guest attackers. There is no possible mitigation involving flushing of
> > buffers to avoid this since the execution of attacker and victims happen
> > concurrently on 2 or more HTs.
> >
> > The solution in this patch is to monitor the outer-most core-wide
> > irq_enter() and irq_exit() executed by any sibling. In between these
> > two, we mark the core to be in a special core-wide IRQ state.
>
> Another possible option is force_irqthreads :-) That would cure it
> nicely.

Yes true, it was definitely my "plan B" at one point if this patch
showed any regression. Lastly, people not doing force_irqthreads would
still leave a hole open and it'd be nice to solve it by "default" than
depending on user/sysadmin configuration (same argument against
interrupt affinities, it is another knob for the sysadmin/designer to
configure correctly, Another argument being not all interrupts can be
threaded / affinitized).

Thanks in advance for reviewing the patch,

- Joel

2020-05-15 00:59:44

by Gruza, Agata

[permalink] [raw]
Subject: FW: [RFC PATCH 00/13] Core scheduling v5



-----Original Message-----
From: [email protected] <[email protected]> On Behalf Of Ning, Hongyu
Sent: Friday, May 8, 2020 8:40 PM
To: [email protected]; [email protected]; [email protected]; [email protected]; Tim Chen <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Li, Aubrey <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5


- Test environment:
Intel Xeon Server platform
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4

- Kernel under test:
Core scheduling v5 base
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

- Test set based on sysbench 1.1.0-bd4b418:
A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup)
B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup)

- Test results briefing:
1 Good results:
1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload
1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off

2 Bad results:
2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads
2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads

- Test results:
Note: test results in following tables are Tput normalized to default baseline

-- Test set A Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| | **** | default | coresched | smt_off | *** | default | coresched | smt_off |
+====================+========+===========+=============+===========+===
+====+=============+===============+=============+
| cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg mysql 1 | cg mysql 1 | cg mysql 1 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| sysbench workload | **** | cpu | cpu | cpu | *** | mysql | mysql | mysql |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+
| 192 tasks / cgroup | **** | 1 | 0.95 | 0.54 | *** | 1 | 0.92 | 0.97 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+

-- Test set B Tput normalized results:
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| | **** | default | coresched | smt_off | *** | default | coresched | smt_off | ** | default | coresched | smt_off | * | default | coresched | smt_off |
+====================+========+===========+=============+===========+===
+====+=============+===============+=============+======+=============+=
+==============+=============+=====+=============+===============+======
+=======+
| cgroups | **** | cg cpu 1 | cg cpu 1 | cg cpu 1 | *** | cg cpu 2 | cg cpu 2 | cg cpu 2 | ** | cg mysql 1 | cg mysql 1 | cg mysql 1 | * | cg mysql 2 | cg mysql 2 | cg mysql 2 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| sysbench workload | **** | cpu | cpu | cpu | *** | cpu | cpu | cpu | ** | mysql | mysql | mysql | * | mysql | mysql | mysql |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+
| 192 tasks / cgroup | **** | 1 | 0.9 | 0.47 | *** | 1 | 1.32 | 0.66 | ** | 1 | 0.42 | 0.89 | * | 1 | 0.42 | 0.89 |
+--------------------+--------+-----------+-------------+-----------+-------+-------------+---------------+-------------+------+-------------+---------------+-------------+-----+-------------+---------------+-------------+


> On Date: Wed, 4 Mar 2020 16:59:50 +0000, vpillai <[email protected]> wrote:
> To: Nishanth Aravamudan <[email protected]>, Julien
> Desfossez <[email protected]>, Peter Zijlstra
> <[email protected]>, Tim Chen <[email protected]>,
> [email protected], [email protected], [email protected],
> [email protected]
> CC: vpillai <[email protected]>, [email protected],
> [email protected], [email protected], [email protected], Phil
> Auld <[email protected]>, Aaron Lu <[email protected]>, Aubrey Li
> <[email protected]>, [email protected], Valentin
> Schneider <[email protected]>, Mel Gorman
> <[email protected]>, Pawan Gupta
> <[email protected]>, Paolo Bonzini
> <[email protected]>, Joel Fernandes <[email protected]>,
> [email protected]
>
>
> Fifth iteration of the Core-Scheduling feature.
>
> Core scheduling is a feature that only allows trusted tasks to run
> concurrently on cpus sharing compute resources(eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). So far, the feature mitigates
> user-space to user-space attacks but not user-space to kernel attack,
> when one of the hardware thread enters the kernel (syscall, interrupt etc).
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a
> tag is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature
> can also be beneficial for RT and performance applications where we
> want to control how tasks make use of SMT dynamically.
>
> This version was focusing on performance and stability. Couple of
> crashes related to task tagging and cpu hotplug path were fixed.
> This version also improves the performance considerably by making task
> migration and load balancing coresched aware.
>
> In terms of performance, the major difference since the last iteration
> is that now even IO-heavy and mixed-resources workloads are less
> impacted by core-scheduling than by disabling SMT. Both host-level and
> VM-level benchmarks were performed. Details in:
> https://lkml.org/lkml/2020/2/12/1194
> https://lkml.org/lkml/2019/11/1/269
>
> v5 is rebased on top of 5.5.5(449718782a46)
> https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5
> .y
>


----------------------------------------------------------------------
ABOUT:
----------------------------------------------------------------------
Hello,

Core scheduling is required to protect against leakage of sensitive
data allocated on a sibling thread. Our goal is to measure performance
impact of core scheduling across different workloads and show how it
evolved over time. Below you will find data based on core-sched (v5).
In attached PDF system configuration setup as well as further
explanation of the findings.

----------------------------------------------------------------------
BENCHMARKS:
----------------------------------------------------------------------
- hammerdb : database benchmarking application
- sysbench-cpu : multi-threaded cpu benchmark
- sysbench-mysql: multi-threaded benchmark that tests open source DBMS
- build-kernel : benchmark that is used to build Linux kernel


----------------------------------------------------------------------
PERFORMANCE IMPACT:
----------------------------------------------------------------------

+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| benchmark | **** | # of cgroups | overcommit | baseline + smt_on | coresched + smt_on | baseline + smt_off |
+====================+========+==============+=============+===================+====================+======================+
| hammerdb | **** | 2cgroups | 2x | 1 | 0.96 | 0.87 |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| sysbench-cpu | **** | 2cgroups | 2x | 1 | 0.95 | 0.54 |
| sysbench-mysql | **** | | | 1 | 0.90 | 0.47 |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| sysbench-cpu | **** | 4cgroups | 4x | 1 | 0.90 | 0.47 |
| sysbench-cpu | **** | | | 1 | 1.32 | 0.66 |
| sysbench-mycql | **** | | | 1 | 0.42 | 0.89 |
| sysbench-mysql | **** | | | 1 | 0.42 | 0.89 |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+
| kernel-build | **** | 2cgroups | 0.5x | 1 | 1 | 0.93 |
| | **** | | 1x | 1 | 0.99 | 0.92 |
| | **** | | 2x | 1 | 0.98 | 0.91 |
+--------------------+--------+--------------+-------------+-------------------+--------------------+----------------------+


----------------------------------------------------------------------
TAKE AWAYS:
----------------------------------------------------------------------
1. Core scheduling performs better than turning off HT.
2. Impact of core scheduling depends on the workload and thread
scheduling intensity.
3. Core scheduling requires cgroups. Tasks from the same cgroup are
scheduled on the same core.
4. Having core scheduling, in certain situations will introduce
an uneven load distribution between multiple workload types.
In such a case bias towards the cpu intensive workload is expected.
5. Load balancing is not perfect. It needs more work.

Many thanks,

--Agata



Attachments:
LKML_core_sched_v5.5.y.pdf (351.81 kB)
LKML_core_sched_v5.5.y.pdf

2020-05-20 22:29:17

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC] sched: Add a per-thread core scheduling interface

Add a per-thread core scheduling interface which allows a thread to tag
itself and enable core scheduling. Based on discussion at OSPM with
maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
1 - enable core scheduling for the task.
0 - disable core scheduling for the task.

Special cases:
(1)
The core-scheduling patchset contains a CGroup interface as well. In
order for us to respect users of that interface, we avoid overriding the
tag if a task was CGroup-tagged because the task becomes inconsistent
with the CGroup tag. Instead return -EBUSY.

(2)
If a task is prctl-tagged, allow the CGroup interface to override
the task's tag.

ChromeOS will use core-scheduling to securely enable hyperthreading.
This cuts down the keypress latency in Google docs from 150ms to 50ms
while improving the camera streaming frame rate by ~3%.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/sched.h | 6 ++++
include/uapi/linux/prctl.h | 3 ++
kernel/sched/core.c | 57 ++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 3 ++
4 files changed, 69 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe6ae59fcadbe..8a40a093aa2ca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1986,6 +1986,12 @@ static inline void rseq_execve(struct task_struct *t)

#endif

+#ifdef CONFIG_SCHED_CORE
+int task_set_core_sched(int set, struct task_struct *tsk);
+#else
+int task_set_core_sched(int set, struct task_struct *tsk) { return -ENOTSUPP; }
+#endif
+
void __exit_umh(struct task_struct *tsk);

static inline void exit_umh(struct task_struct *tsk)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e362..dba0c70f9cce6 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Core scheduling per-task interface */
+#define PR_SET_CORE_SCHED 59
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 684359ff357e7..780514d03da47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3320,6 +3320,13 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(&p->core_node);
+
+ /*
+ * If task is using prctl(2) for tagging, do the prctl(2)-style tagging
+ * for the child as well.
+ */
+ if (current->core_cookie && ((unsigned long)current == current->core_cookie))
+ task_set_core_sched(1, p);
#endif
return 0;
}
@@ -7857,6 +7864,56 @@ void __cant_sleep(const char *file, int line, int preempt_offset)
EXPORT_SYMBOL_GPL(__cant_sleep);
#endif

+#ifdef CONFIG_SCHED_CORE
+
+/* Ensure that all siblings have rescheduled once */
+static int task_set_core_sched_stopper(void *data)
+{
+ return 0;
+}
+
+int task_set_core_sched(int set, struct task_struct *tsk)
+{
+ if (!tsk)
+ tsk = current;
+
+ if (set > 1)
+ return -ERANGE;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -EINVAL;
+
+ /*
+ * If cookie was set previously, return -EBUSY if either of the
+ * following are true:
+ * 1. Task was previously tagged by CGroup method.
+ * 2. Task or its parent were tagged by prctl().
+ *
+ * Note that, if CGroup tagging is done after prctl(), then that would
+ * override the cookie. However, if prctl() is done after task was
+ * added to tagged CGroup, then the prctl() returns -EBUSY.
+ */
+ if (!!tsk->core_cookie == set) {
+ if ((tsk->core_cookie == (unsigned long)tsk) ||
+ (tsk->core_cookie == (unsigned long)tsk->sched_task_group)) {
+ return -EBUSY;
+ }
+ }
+
+ if (set)
+ sched_core_get();
+
+ tsk->core_cookie = set ? (unsigned long)tsk : 0;
+
+ stop_machine(task_set_core_sched_stopper, NULL, NULL);
+
+ if (!set)
+ sched_core_put();
+
+ return 0;
+}
+#endif
+
#ifdef CONFIG_MAGIC_SYSRQ
void normalize_rt_tasks(void)
{
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a9..5c3bcf40dcb34 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2514,6 +2514,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+ case PR_SET_CORE_SCHED:
+ error = task_set_core_sched(arg2, NULL);
+ break;
default:
error = -EINVAL;
break;
--
2.26.2.761.g0e0b3e54be-goog

2020-05-20 22:39:21

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

Cc: Julien Desfossez <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Aubrey Li <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Co-developed-by: Vineeth Pillai <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>

---
If you like some pictures of the cases handled by this patch, please
see the OSPM slide deck (the below link jumps straight to relevant
slides - about 6-7 of them in total): https://bit.ly/2zvzxWk

v1->v2:
Fixed a bug where softirq was causing deadlock (thanks Vineeth/Julien)

The issue was because of the following flow:

On CPU0:
local_bh_enable()
-> Enter softirq
-> Softirq takes a lock.
-> <new Interrupt received during softirq>
-> New interrupt's irq_exit() : Wait since it is not outermost
core-wide irq_exit().

On CPU1:
<interrupt received>
irq_enter() -> Enter the core wide IRQ state.
<ISR raises a softirq which will run from irq_exit().
irq_exit() ->
-> enters softirq
-> softirq tries to take a lock and blocks.

So it is an A->B and B->A deadlock.
A = Enter the core-wide IRQ state or wait for it to end.
B = Acquire a lock during softirq or wait for it to be released.

The fix is to enter the core-wide IRQ state even when entering through
the local_bh_enable -> softirq path (When there is no hardirq
context). which basically becomes:

On CPU0:
local_bh_enable()
(Fix: Call sched_core_irq_enter() --> similar to irq_enter()).
-> Enter softirq
-> Softirq takes a lock.
-> <new Interrupt received during softirq> -> irq_enter()
-> New interrupt's irq_exit() (Will not wait since we are inner
irq_exit()).

include/linux/sched.h | 8 +++
kernel/sched/core.c | 159 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
kernel/softirq.c | 12 ++++
4 files changed, 182 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 710e9a8956007..fe6ae59fcadbe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#else
+static void sched_core_irq_enter(void) { }
+static void sched_core_irq_exit(void) { }
+#endif
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21c640170323b..684359ff357e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+ /*
+ * Wait till the core of this HT is not in a core-wide IRQ state.
+ *
+ * Pair with smp_store_release() in sched_core_irq_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+ cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+ int i, cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ const struct cpumask *smt_mask;
+
+ if (!sched_core_enabled(rq))
+ return;
+
+ /* Count irq_enter() calls received without irq_exit() on this CPU. */
+ rq->core_this_irq_nest++;
+
+ /* If not outermost irq_enter(), do nothing. */
+ if (WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest != 1)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+ WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+ if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+ goto unlock;
+
+ if (rq->core_pause_pending) {
+ /*
+ * Do nothing more since we are in a 'reschedule IPI' sent from
+ * another sibling. That sibling would have sent IPIs to all of
+ * the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide IRQ
+ * state, do nothing.
+ */
+ if (rq->core->core_irq_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /* IPI only if previous IPI was not pending. */
+ if (!srq->core_pause_pending) {
+ srq->core_pause_pending = 1;
+ smp_send_reschedule(i);
+ }
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+ int cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ bool wait_here = false;
+ unsigned int nest;
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ return;
+
+ rq->core_this_irq_nest--;
+
+ /* If not outermost on this CPU, do nothing. */
+ if (WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest > 0)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_irq_nest;
+ WARN_ON_ONCE(!nest);
+
+ /*
+ * If we still have other CPUs in IRQs, we have to wait for them.
+ * Either here, or in the scheduler.
+ */
+ if (rq->core->core_cookie && nest > 1) {
+ /*
+ * If we are entering the scheduler anyway, we can just wait
+ * there for ->core_irq_nest to reach 0. If not, just wait here.
+ */
+ if (!tif_need_resched()) {
+ wait_here = true;
+ }
+ }
+
+ if (rq->core_pause_pending)
+ rq->core_pause_pending = 0;
+
+ /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+ smp_store_release(&rq->core->core_irq_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+
+ if (wait_here)
+ sched_core_sibling_irq_pause(rq);
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt)
rq_unlock_irq(rq, &rf);
}

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * If a CPU that was running a trusted task entered the scheduler, and
+ * the next task is untrusted, then check if waiting for core-wide IRQ
+ * state to cease is needed since we would not have been able to get
+ * the services of irq_exit() to do that waiting.
+ */
+ if (sched_core_enabled(rq) &&
+ !is_idle_task(next) && next->mm && next->core_cookie)
+ sched_core_sibling_irq_pause(rq);
+#endif
+
balance_callback(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7d9f156242e2..3a065d133ef51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1018,11 +1018,14 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ unsigned char core_pause_pending;
+ unsigned int core_this_irq_nest;

/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+ unsigned int core_irq_nest;
#endif
};

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0427a86743a46..147abd6d82599 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -273,6 +273,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);

+ /*
+ * Core scheduling mitigations require entry into softirq to send stall
+ * IPIs to sibling hyperthreads if needed (ex, sibling is running
+ * untrusted task). If we are here from irq_exit(), no IPIs are sent.
+ */
+ sched_core_irq_enter();
+
local_irq_enable();

h = softirq_vec;
@@ -305,6 +312,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
rcu_softirq_qs();
local_irq_disable();

+ /* Inform the scheduler about exit from softirq. */
+ sched_core_irq_exit();
+
pending = local_softirq_pending();
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
@@ -345,6 +355,7 @@ asmlinkage __visible void do_softirq(void)
void irq_enter(void)
{
rcu_irq_enter();
+ sched_core_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +424,7 @@ void irq_exit(void)
invoke_softirq();

tick_irq_exit();
+ sched_core_irq_exit();
rcu_irq_exit();
trace_hardirq_exit(); /* must be last! */
}
--
2.26.2.761.g0e0b3e54be-goog

2020-05-20 22:51:05

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using sched-RCU instead.

This fixes the following spinlock recursion observed when testing the
core scheduling patches on PREEMPT=y kernel on ChromeOS:

[ 3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0
[ 3.240900] lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2
[ 3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4
[ 3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018
[ 3.240910] Call Trace:
[ 3.240919] dump_stack+0x97/0xdb
[ 3.240924] ? spin_bug+0xa4/0xb1
[ 3.240927] do_raw_spin_lock+0x79/0x98
[ 3.240931] try_to_wake_up+0x367/0x61b
[ 3.240935] rcu_read_unlock_special+0xde/0x169
[ 3.240938] ? sched_core_balance+0xd9/0x11e
[ 3.240941] __rcu_read_unlock+0x48/0x4a
[ 3.240945] __balance_callback+0x50/0xa1
[ 3.240949] __schedule+0x55a/0x61e
[ 3.240952] schedule_idle+0x21/0x2d
[ 3.240956] do_idle+0x1d5/0x1f8
[ 3.240960] cpu_startup_entry+0x1d/0x1f
[ 3.240964] start_secondary+0x159/0x174
[ 3.240967] secondary_startup_64+0xa4/0xb0
[ 14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]

Cc: vpillai <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Aubrey Li <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e
---
kernel/sched/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 780514d03da47..b8ca6fcaaaf06 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq)
struct sched_domain *sd;
int cpu = cpu_of(rq);

- rcu_read_lock();
+ rcu_read_lock_sched();
raw_spin_unlock_irq(rq_lockp(rq));
for_each_domain(cpu, sd) {
if (!(sd->flags & SD_LOAD_BALANCE))
@@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq)
break;
}
raw_spin_lock_irq(rq_lockp(rq));
- rcu_read_unlock();
+ rcu_read_unlock_sched();
}

static DEFINE_PER_CPU(struct callback_head, core_balance_head);
--
2.26.2.761.g0e0b3e54be-goog

2020-05-21 04:21:03

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)



> On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) <[email protected]> wrote:
>
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> 1 - enable core scheduling for the task.
> 0 - disable core scheduling for the task.
>
> Special cases:
> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
>
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.
>
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.
Hi,
Are the performance improvements compared to the hyperthreading disabled scenario or not?
Could you help to explain how the keypress latency improvement comes with core-scheduling?

Thanks a lot.

Regards,
Jiang

2020-05-21 08:55:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> 1 - enable core scheduling for the task.
> 0 - disable core scheduling for the task.

Yeah, so this is a terrible interface :-)

It doens't allow tasks for form their own groups (by for example setting
the key to that of another task).

It is also horribly ill defined what it means to 'enable', with whoem
is it allows to share a core.

> Special cases:

> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
>
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.

OK, so cgroup always wins; is why is that a good thing?

> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.

It doesn't consider permissions.

Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
only to enable core-sched.

That also means we should very much default to disable.

2020-05-21 13:50:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

Hi Peter,
Thanks for the comments.

On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote:
> On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > 1 - enable core scheduling for the task.
> > 0 - disable core scheduling for the task.
>
> Yeah, so this is a terrible interface :-)

I tried to keep it simple. You are right, lets make it better.

> It doens't allow tasks for form their own groups (by for example setting
> the key to that of another task).

So for this, I was thinking of making the prctl pass in an integer. And 0
would mean untagged. Does that sound good to you?

> It is also horribly ill defined what it means to 'enable', with whoem
> is it allows to share a core.

I couldn't parse this. Do you mean "enabling coresched does not make sense if
we don't specify whom to share the core with?"

> > Special cases:
>
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> >
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
>
> OK, so cgroup always wins; is why is that a good thing?

I was just trying to respect the functionality of the CGroup patch in the
coresched series, after all a gentleman named Peter Zijlstra wrote that
patch ;-) ;-).

More seriously, the reason I did it this way is the prctl-tagging is a bit
incompatible with CGroup tagging:

1. What happens if 2 tasks are in a tagged CGroup and one of them changes
their cookie through prctl? Do they still remain in the tagged CGroup but are
now going to not trust each other? Do they get removed from the CGroup? This
is why I made the prctl fail with -EBUSY in such cases.

2. What happens if 2 tagged tasks with different cookies are added to a
tagged CGroup? Do we fail the addition of the tasks to the group, or do we
override their cookie (like I'm doing)?

> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
>
> It doesn't consider permissions.
>
> Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
> only to enable core-sched.

True, we were relying on the seccomp sandboxing in ChromeOS to protect the
prctl but you're right and I fixed it for next revision.

> That also means we should very much default to disable.

This is how it is already.

thanks,

- Joel

2020-05-21 13:51:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)

On Thu, May 21, 2020 at 04:09:50AM +0000, benbjiang(蒋彪) wrote:
>
>
> > On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) <[email protected]> wrote:
> >
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > 1 - enable core scheduling for the task.
> > 0 - disable core scheduling for the task.
> >
> > Special cases:
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> >
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
> >
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> Hi,
> Are the performance improvements compared to the hyperthreading disabled scenario or not?
> Could you help to explain how the keypress latency improvement comes with core-scheduling?

Hi Jiang,

The keypress end-to-end latency metric we have is calculated from when the
keypress is registered in hardware to when a character is drawn on the
scereen. This involves several parties including the GPU and browser
processes which are all running in the same trust domain and benefit from
parallelism through hyperthreading.

thanks,

- Joel

2020-05-21 18:41:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
<[email protected]> wrote:
>
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.

I'm assuming this is "compared to SMT disabled"?

What is the cost compared to "SMT enabled but no core scheduling"?

But the real reason I'm piping up is that your latency benchmark
sounds very cool.

Generally throughput benchmarks are much easier to do, how do you do
this latency benchmark, and is it perhaps something that could be run
more widely (ie I'm thinking that if it's generic enough and stable
enough to be run by some of the performance regression checking
robots, it would be a much more interesting test-case than some of the
ones they run right now...)

I'm looking at that "threaded phoronix gzip performance regression"
thread due to a totally unrelated scheduling change ("sched/fair:
Rework load_balance()"), and then I see this thread and my reaction is
"the keypress latency thing sounds like a much more interesting
performance test than threaded gzip from clear linux".

But the threaded gzip test is presumably trivial to script, while your
latency test is perhaps very specific to one particular platform and
setuip?

Linus

2020-05-21 20:22:43

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Thu, May 21, 2020 at 9:47 AM Joel Fernandes <[email protected]> wrote:
>
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
>
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?
>
On a similar note, me and Joel were discussing about prctl and it came up
that, there is no mechanism to set cookie from outside a process using
prctl(2). So, another option we could consider is to use sched_setattr(2)
and expand sched_attr to accomodate a u64 cookie. User could pass in a
cookie to explicitly set it and also use the same cookie for grouping.

Haven't prototyped it yet. Will need to dig deeper and see how it would
really look like.

Thanks,
Vineeth

2020-05-21 20:44:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

Hi Linus,

On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
> <[email protected]> wrote:
> >
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
>
> I'm assuming this is "compared to SMT disabled"?

Yes this is compared to SMT disabled, I'll improve the commit message.

> What is the cost compared to "SMT enabled but no core scheduling"?

With SMT enabled and no core scheduling, it is around 40ms in the higher
percentiles. Also one more thing I wanted to mention, this is the 90th
percentile.

> But the real reason I'm piping up is that your latency benchmark
> sounds very cool.
>
> Generally throughput benchmarks are much easier to do, how do you do
> this latency benchmark, and is it perhaps something that could be run
> more widely (ie I'm thinking that if it's generic enough and stable
> enough to be run by some of the performance regression checking
> robots, it would be a much more interesting test-case than some of the
> ones they run right now...)

Glad you like it! The metric is calculated with a timestamp of when the
driver says the key was pressed, up until when the GPU says we've drawn
pixels in response.

The test requires a mostly only requires Chrome browser. It opens some
pre-existing test URLs (a google doc, a window that opens a camera stream and
another window that decodes video). This metric is already calculated in
Chrome, we just scrape it from
chrome://histograms/Event.Latency.EndToEnd.KeyPress. If you install Chrome,
you can goto this link and see the histogram. We open a Google docs window
and synthetically input keys into it with a camera stream and video decoding
running in other windows which gives the CPUs a good beating. Then we collect
roughly the 90th percentile keypress latency from the above histogram and the
camera and decoded video's FPS, among other things. There is a test in the
works that my colleagues are writing to run the full Google hangout video
chatting stack to stress the system more (versus just the camera stream). I
guess if the robots can somehow input keys into the Google docs and open the
right windows, then it is just a matter of scraping the histogram.

> I'm looking at that "threaded phoronix gzip performance regression"
> thread due to a totally unrelated scheduling change ("sched/fair:
> Rework load_balance()"), and then I see this thread and my reaction is
> "the keypress latency thing sounds like a much more interesting
> performance test than threaded gzip from clear linux".
>
> But the threaded gzip test is presumably trivial to script, while your
> latency test is perhaps very specific to one particular platform and
> setuip?

Yes it is specifically a ChromeOS running on a pixel book running a 7th Gen
Intel Core i7 with 4 hardware threads.
https://store.google.com/us/product/google_pixelbook

I could try to make it a synthetic test but it might be difficult for a robot
to run it if it does not have graphics support and a camera connected to it.
It would then need a fake/emulated camera connected to it. These robots run
Linux in a non-GUI environment in qemu instances right?

thanks,

- Joel

2020-05-21 22:00:01

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Thu, May 21, 2020 at 1:45 PM Joel Fernandes <[email protected]> wrote:
>
> Hi Linus,
>
> On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> > On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
> > <[email protected]> wrote:
> > Generally throughput benchmarks are much easier to do, how do you do
> > this latency benchmark, and is it perhaps something that could be run
> > more widely (ie I'm thinking that if it's generic enough and stable
> > enough to be run by some of the performance regression checking
> > robots, it would be a much more interesting test-case than some of the
> > ones they run right now...)
>
> Glad you like it! The metric is calculated with a timestamp of when the
> driver says the key was pressed, up until when the GPU says we've drawn
> pixels in response.
>
> The test requires a mostly only requires Chrome browser. It opens some
> pre-existing test URLs (a google doc, a window that opens a camera stream and
> another window that decodes video). This metric is already calculated in
> Chrome, we just scrape it from
> chrome://histograms/Event.Latency.EndToEnd.KeyPress. If you install Chrome,
> you can goto this link and see the histogram. We open a Google docs window
> and synthetically input keys into it with a camera stream and video decoding
> running in other windows which gives the CPUs a good beating. Then we collect
> roughly the 90th percentile keypress latency from the above histogram and the
> camera and decoded video's FPS, among other things. There is a test in the
> works that my colleagues are writing to run the full Google hangout video
> chatting stack to stress the system more (versus just the camera stream). I
> guess if the robots can somehow input keys into the Google docs and open the
> right windows, then it is just a matter of scraping the histogram.

Expanding on this a little, we're working on a couple of projects that
should provide results like these for upstream. One is continuously
rebasing our upstream backlog onto new kernels for testing purposes
(the idea here is to make it easier for us to update kernels on
Chromebooks), and the second is to drive more stuff into the
kernelci.org infrastructure. Given the test environments we have in
place now, we can probably get results from our continuous rebase
project first and provide those against -rc releases if that's
something you'd be interested in. Going forward, I hope we can
extract several of our tests and put them into kernelci as well, so we
get more general coverage without the potential impact of our (still
somewhat large) upstream backlog of patches.

To Joel's point, there are a few changes we'll have to make to get
similar results outside of our environment, but I think that's doable
without a ton of work. And if anyone is curious, I think most of this
stuff is already public in the tast and autotest repos of the
chromiumos tree. Just let us know if you want to make changes or port
to another environment so we can try to stay in sync wrt new features,
etc.

Thanks,
Jesse

2020-05-21 22:54:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic

On Wed, May 20, 2020 at 06:48:18PM -0400, Joel Fernandes (Google) wrote:
> rcu_read_unlock() can incur an infrequent deadlock in
> sched_core_balance(). Fix this by using sched-RCU instead.
>
> This fixes the following spinlock recursion observed when testing the
> core scheduling patches on PREEMPT=y kernel on ChromeOS:
>
> [ 3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0
> [ 3.240900] lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2
> [ 3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4
> [ 3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018
> [ 3.240910] Call Trace:
> [ 3.240919] dump_stack+0x97/0xdb
> [ 3.240924] ? spin_bug+0xa4/0xb1
> [ 3.240927] do_raw_spin_lock+0x79/0x98
> [ 3.240931] try_to_wake_up+0x367/0x61b
> [ 3.240935] rcu_read_unlock_special+0xde/0x169
> [ 3.240938] ? sched_core_balance+0xd9/0x11e
> [ 3.240941] __rcu_read_unlock+0x48/0x4a
> [ 3.240945] __balance_callback+0x50/0xa1
> [ 3.240949] __schedule+0x55a/0x61e
> [ 3.240952] schedule_idle+0x21/0x2d
> [ 3.240956] do_idle+0x1d5/0x1f8
> [ 3.240960] cpu_startup_entry+0x1d/0x1f
> [ 3.240964] start_secondary+0x159/0x174
> [ 3.240967] secondary_startup_64+0xa4/0xb0
> [ 14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]
>
> Cc: vpillai <[email protected]>
> Cc: Aaron Lu <[email protected]>
> Cc: Aubrey Li <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e

With some luck, the commit removing the need for this will hit
mainline during the next merge window. Fingers firmly crossed...

Thanx, Paul

> ---
> kernel/sched/core.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 780514d03da47..b8ca6fcaaaf06 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq)
> struct sched_domain *sd;
> int cpu = cpu_of(rq);
>
> - rcu_read_lock();
> + rcu_read_lock_sched();
> raw_spin_unlock_irq(rq_lockp(rq));
> for_each_domain(cpu, sd) {
> if (!(sd->flags & SD_LOAD_BALANCE))
> @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq)
> break;
> }
> raw_spin_lock_irq(rq_lockp(rq));
> - rcu_read_unlock();
> + rcu_read_unlock_sched();
> }
>
> static DEFINE_PER_CPU(struct callback_head, core_balance_head);
> --
> 2.26.2.761.g0e0b3e54be-goog
>

2020-05-22 01:28:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic

On Thu, May 21, 2020 at 6:52 PM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, May 20, 2020 at 06:48:18PM -0400, Joel Fernandes (Google) wrote:
> > rcu_read_unlock() can incur an infrequent deadlock in
> > sched_core_balance(). Fix this by using sched-RCU instead.
> >
> > This fixes the following spinlock recursion observed when testing the
> > core scheduling patches on PREEMPT=y kernel on ChromeOS:
> >
> > [ 3.240891] BUG: spinlock recursion on CPU#2, swapper/2/0
> > [ 3.240900] lock: 0xffff9cd1eeb28e40, .magic: dead4ead, .owner: swapper/2/0, .owner_cpu: 2
> > [ 3.240905] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.22htcore #4
> > [ 3.240908] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.174.0 05/29/2018
> > [ 3.240910] Call Trace:
> > [ 3.240919] dump_stack+0x97/0xdb
> > [ 3.240924] ? spin_bug+0xa4/0xb1
> > [ 3.240927] do_raw_spin_lock+0x79/0x98
> > [ 3.240931] try_to_wake_up+0x367/0x61b
> > [ 3.240935] rcu_read_unlock_special+0xde/0x169
> > [ 3.240938] ? sched_core_balance+0xd9/0x11e
> > [ 3.240941] __rcu_read_unlock+0x48/0x4a
> > [ 3.240945] __balance_callback+0x50/0xa1
> > [ 3.240949] __schedule+0x55a/0x61e
> > [ 3.240952] schedule_idle+0x21/0x2d
> > [ 3.240956] do_idle+0x1d5/0x1f8
> > [ 3.240960] cpu_startup_entry+0x1d/0x1f
> > [ 3.240964] start_secondary+0x159/0x174
> > [ 3.240967] secondary_startup_64+0xa4/0xb0
> > [ 14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]
> >
> > Cc: vpillai <[email protected]>
> > Cc: Aaron Lu <[email protected]>
> > Cc: Aubrey Li <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > Change-Id: I1a4bf0cd1426b3c21ad5de44719813ad4ee5805e
>
> With some luck, the commit removing the need for this will hit
> mainline during the next merge window. Fingers firmly crossed...

Sounds good, thank you Paul :-)

- Joel


>
> Thanx, Paul
>
> > ---
> > kernel/sched/core.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 780514d03da47..b8ca6fcaaaf06 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4897,7 +4897,7 @@ static void sched_core_balance(struct rq *rq)
> > struct sched_domain *sd;
> > int cpu = cpu_of(rq);
> >
> > - rcu_read_lock();
> > + rcu_read_lock_sched();
> > raw_spin_unlock_irq(rq_lockp(rq));
> > for_each_domain(cpu, sd) {
> > if (!(sd->flags & SD_LOAD_BALANCE))
> > @@ -4910,7 +4910,7 @@ static void sched_core_balance(struct rq *rq)
> > break;
> > }
> > raw_spin_lock_irq(rq_lockp(rq));
> > - rcu_read_unlock();
> > + rcu_read_unlock_sched();
> > }
> >
> > static DEFINE_PER_CPU(struct callback_head, core_balance_head);
> > --
> > 2.26.2.761.g0e0b3e54be-goog
> >

2020-05-22 13:02:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Thu, May 21, 2020 at 09:47:05AM -0400, Joel Fernandes wrote:
> Hi Peter,
> Thanks for the comments.
>
> On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote:
> > On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> > > Add a per-thread core scheduling interface which allows a thread to tag
> > > itself and enable core scheduling. Based on discussion at OSPM with
> > > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > > 1 - enable core scheduling for the task.
> > > 0 - disable core scheduling for the task.
> >
> > Yeah, so this is a terrible interface :-)
>
> I tried to keep it simple. You are right, lets make it better.
>
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
>
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?

A TID, I think. If you pass your own TID, you tag yourself as
not-sharing. If you tag yourself with another tasks's TID, you can do
ptrace tests to see if you're allowed to observe their junk.

> > It is also horribly ill defined what it means to 'enable', with whoem
> > is it allows to share a core.
>
> I couldn't parse this. Do you mean "enabling coresched does not make sense if
> we don't specify whom to share the core with?"

As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
specify a 'who' you're sharing your bits with.

> > OK, so cgroup always wins; is why is that a good thing?
>
> I was just trying to respect the functionality of the CGroup patch in the
> coresched series, after all a gentleman named Peter Zijlstra wrote that
> patch ;-) ;-).

Yeah, but I think that same guy said that that was a shit interface and
only hacked up because it was easy :-)

> More seriously, the reason I did it this way is the prctl-tagging is a bit
> incompatible with CGroup tagging:
>
> 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> their cookie through prctl? Do they still remain in the tagged CGroup but are
> now going to not trust each other? Do they get removed from the CGroup? This
> is why I made the prctl fail with -EBUSY in such cases.
>
> 2. What happens if 2 tagged tasks with different cookies are added to a
> tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> override their cookie (like I'm doing)?

For #2 I think I prefer failure.

But having the rationale spelled out in documentation (man-pages for
example) is important.

> > > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > > while improving the camera streaming frame rate by ~3%.
> >
> > It doesn't consider permissions.
> >
> > Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
> > only to enable core-sched.
>
> True, we were relying on the seccomp sandboxing in ChromeOS to protect the
> prctl but you're right and I fixed it for next revision.

With the TID idea above you get the ptrace tests.

2020-05-22 16:38:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Thu, May 21, 2020 at 2:58 PM Jesse Barnes <[email protected]> wrote:
>
> Expanding on this a little, we're working on a couple of projects that
> should provide results like these for upstream. One is continuously
> rebasing our upstream backlog onto new kernels for testing purposes
> (the idea here is to make it easier for us to update kernels on
> Chromebooks),

Lovely. Not just for any performance work that comes out of this, but
hopefully this means that we'll also have quick problem reports if
something happens that affects chrome.

There's certainly been issues on the server side of google where we
made changes (*cough*cgroup*cough*) which didn't make anybody really
blink until years after the fact.. Which ends up being very
inconvenient when other parts of the community have been using the new
features for years.

> and the second is to drive more stuff into the
> kernelci.org infrastructure. Given the test environments we have in
> place now, we can probably get results from our continuous rebase
> project first and provide those against -rc releases if that's
> something you'd be interested in.

I think the more automated (or regular, or close-to-upstream)
real-world testing that we get, the better off we are. We have a
number of regular distributions that track the upstream kernel fairly
closely, so we get a fair amount of coverage for the normal desktop
loads.

And the bots are doing great, but they tend to test very specific
things (in the case of "syzbot" the "specific" thing is obviously
pretty far-ranging, but it's still very small details). And latency
has always been harder to really test (outside of the truly trivial
microbenchmarks), so the fact that it sounds like you're going to test
not only a different environment than the usual distros but have a few
macro-level latency tests just sounds lovely in general.

Let's see how lovely I think it is once you start sending regression reports..

Linus

2020-05-22 21:37:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
[..]
> > > It doens't allow tasks for form their own groups (by for example setting
> > > the key to that of another task).
> >
> > So for this, I was thinking of making the prctl pass in an integer. And 0
> > would mean untagged. Does that sound good to you?
>
> A TID, I think. If you pass your own TID, you tag yourself as
> not-sharing. If you tag yourself with another tasks's TID, you can do
> ptrace tests to see if you're allowed to observe their junk.

But that would require a bunch of tasks agreeing on which TID to tag with.
For example, if 2 tasks tag with each other's TID, then they would have
different tags and not share.

What's wrong with passing in an integer instead? In any case, we would do the
CAP_SYS_ADMIN check to limit who can do it.

Also, one thing CGroup interface allows is an external process to set the
cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
addition to, the prctl(2). That way, we can drop the CGroup interface
completely. How do you feel about that?

> > > It is also horribly ill defined what it means to 'enable', with whoem
> > > is it allows to share a core.
> >
> > I couldn't parse this. Do you mean "enabling coresched does not make sense if
> > we don't specify whom to share the core with?"
>
> As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> specify a 'who' you're sharing your bits with.

Yes, ok. I can reword the commit log a bit to make it more clear that we are
specifying who we can share a core with.

> > I was just trying to respect the functionality of the CGroup patch in the
> > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > patch ;-) ;-).
>
> Yeah, but I think that same guy said that that was a shit interface and
> only hacked up because it was easy :-)

Fair enough :-)

> > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > incompatible with CGroup tagging:
> >
> > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > their cookie through prctl? Do they still remain in the tagged CGroup but are
> > now going to not trust each other? Do they get removed from the CGroup? This
> > is why I made the prctl fail with -EBUSY in such cases.
> >
> > 2. What happens if 2 tagged tasks with different cookies are added to a
> > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > override their cookie (like I'm doing)?
>
> For #2 I think I prefer failure.
>
> But having the rationale spelled out in documentation (man-pages for
> example) is important.

If we drop the CGroup interface, this would avoid both #1 and #2.

thanks,

- Joel

2020-05-24 14:03:14

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> [..]
> > > > It doens't allow tasks for form their own groups (by for example setting
> > > > the key to that of another task).
> > >
> > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > would mean untagged. Does that sound good to you?
> >
> > A TID, I think. If you pass your own TID, you tag yourself as
> > not-sharing. If you tag yourself with another tasks's TID, you can do
> > ptrace tests to see if you're allowed to observe their junk.
>
> But that would require a bunch of tasks agreeing on which TID to tag with.
> For example, if 2 tasks tag with each other's TID, then they would have
> different tags and not share.
>
> What's wrong with passing in an integer instead? In any case, we would do the
> CAP_SYS_ADMIN check to limit who can do it.
>
> Also, one thing CGroup interface allows is an external process to set the
> cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> addition to, the prctl(2). That way, we can drop the CGroup interface
> completely. How do you feel about that?
>

I think it should be an arbitrary 64bit value, in both interfaces to avoid
any potential reuse security issues.

I think the cgroup interface could be extended not to be a boolean but take
the value. With 0 being untagged as now.

And sched_setattr could be used to set it on a per task basis.


> > > > It is also horribly ill defined what it means to 'enable', with whoem
> > > > is it allows to share a core.
> > >
> > > I couldn't parse this. Do you mean "enabling coresched does not make sense if
> > > we don't specify whom to share the core with?"
> >
> > As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> > specify a 'who' you're sharing your bits with.
>
> Yes, ok. I can reword the commit log a bit to make it more clear that we are
> specifying who we can share a core with.
>
> > > I was just trying to respect the functionality of the CGroup patch in the
> > > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > > patch ;-) ;-).
> >
> > Yeah, but I think that same guy said that that was a shit interface and
> > only hacked up because it was easy :-)
>
> Fair enough :-)
>
> > > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > > incompatible with CGroup tagging:
> > >
> > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > > their cookie through prctl? Do they still remain in the tagged CGroup but are
> > > now going to not trust each other? Do they get removed from the CGroup? This
> > > is why I made the prctl fail with -EBUSY in such cases.
> > >
> > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > > override their cookie (like I'm doing)?
> >
> > For #2 I think I prefer failure.
> >
> > But having the rationale spelled out in documentation (man-pages for
> > example) is important.
>
> If we drop the CGroup interface, this would avoid both #1 and #2.
>

I believe both are useful. Personally, I think the per-task setting should
win over the cgroup tagging. In that case #1 just falls out. And #2 pretty
much as well. Nothing would happen to the tagged task as they were added
to the cgroup. They'd keep their explicitly assigned tags and everything
should "just work". There are other reasons to be in a cpu cgroup together
than just the core scheduling tag.

There are a few other edge cases, like if you are in a cgroup, but have
been tagged explicitly with sched_setattr and then get untagged (presumably
by setting 0) do you get the cgroup tag or just stay untagged? I think based
on per-task winning you'd stay untagged. I supposed you could move out and
back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
could just be reused with the same value to re-tag everyone who's untagged).



Cheers,
Phil


> thanks,
>
> - Joel
>

--

2020-05-28 14:54:46

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > [..]
> > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > the key to that of another task).
> > > >
> > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > would mean untagged. Does that sound good to you?
> > >
> > > A TID, I think. If you pass your own TID, you tag yourself as
> > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > ptrace tests to see if you're allowed to observe their junk.
> >
> > But that would require a bunch of tasks agreeing on which TID to tag with.
> > For example, if 2 tasks tag with each other's TID, then they would have
> > different tags and not share.
> >
> > What's wrong with passing in an integer instead? In any case, we would do the
> > CAP_SYS_ADMIN check to limit who can do it.
> >
> > Also, one thing CGroup interface allows is an external process to set the
> > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > addition to, the prctl(2). That way, we can drop the CGroup interface
> > completely. How do you feel about that?
> >
>
> I think it should be an arbitrary 64bit value, in both interfaces to avoid
> any potential reuse security issues.
>
> I think the cgroup interface could be extended not to be a boolean but take
> the value. With 0 being untagged as now.
>
> And sched_setattr could be used to set it on a per task basis.

Yeah, something like this will be needed.

> > > > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > > > incompatible with CGroup tagging:
> > > >
> > > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > > > their cookie through prctl? Do they still remain in the tagged CGroup but are
> > > > now going to not trust each other? Do they get removed from the CGroup? This
> > > > is why I made the prctl fail with -EBUSY in such cases.

In util-clamp's design (which has task-specific attribute and task-group
attribute), it seems for that the priority is task-specific value first, then
the group one, then the system-wide one.

Perhaps a similar design can be adopted for this interface. So probably we
should let the per-task interface not fail if the task was already in CGroup
and rather prioritize its value first before looking at the group one?

Uclamp's comments:

* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
* - the task group effective clamp value, for tasks not either in the root
* group or in an autogroup
* - the system default clamp value, defined by the sysadmin

> > > >
> > > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > > > override their cookie (like I'm doing)?
> > >
> > > For #2 I think I prefer failure.
> > >
> > > But having the rationale spelled out in documentation (man-pages for
> > > example) is important.
> >
> > If we drop the CGroup interface, this would avoid both #1 and #2.
> >
>
> I believe both are useful. Personally, I think the per-task setting should
> win over the cgroup tagging. In that case #1 just falls out.

Cool, this is similar to what I mentioned above.

> And #2 pretty
> much as well. Nothing would happen to the tagged task as they were added
> to the cgroup. They'd keep their explicitly assigned tags and everything
> should "just work". There are other reasons to be in a cpu cgroup together
> than just the core scheduling tag.

Well ok, so there's no reason to fail them the addition to CGroup of a
prctl-tagged task then, we can let it succeed but prioritize the
task-specific attribute over the group-specific one.

> There are a few other edge cases, like if you are in a cgroup, but have
> been tagged explicitly with sched_setattr and then get untagged (presumably
> by setting 0) do you get the cgroup tag or just stay untagged? I think based
> on per-task winning you'd stay untagged. I supposed you could move out and
> back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
> could just be reused with the same value to re-tag everyone who's untagged).

If we maintain a task-specific tag and a group-specific tag, then I think
both tags can coexist and the final tag is decided on priority basis
mentioned above.

So before getting into CGroup, I think first we develop the task-specific
tagging mechanism like Peter was suggesting. So let us talk about that. I
will reply to the other thread Vineeth started while CC'ing you. In
particular, I like Peter's idea about user land passing a TID to share a core
with.

thanks,

- Joel


>
>
>
> Cheers,
> Phil
>
>
> > thanks,
> >
> > - Joel
> >
>
> --
>

2020-05-28 17:09:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > [..]
> > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > the key to that of another task).
> > > >
> > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > would mean untagged. Does that sound good to you?
> > >
> > > A TID, I think. If you pass your own TID, you tag yourself as
> > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > ptrace tests to see if you're allowed to observe their junk.
> >
> > But that would require a bunch of tasks agreeing on which TID to tag with.
> > For example, if 2 tasks tag with each other's TID, then they would have
> > different tags and not share.

Well, don't do that then ;-)

> > What's wrong with passing in an integer instead? In any case, we would do the
> > CAP_SYS_ADMIN check to limit who can do it.

So the actual permission model can be different depending on how broken
the hardware is.

> > Also, one thing CGroup interface allows is an external process to set the
> > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > addition to, the prctl(2). That way, we can drop the CGroup interface
> > completely. How do you feel about that?
> >
>
> I think it should be an arbitrary 64bit value, in both interfaces to avoid
> any potential reuse security issues.
>
> I think the cgroup interface could be extended not to be a boolean but take
> the value. With 0 being untagged as now.

How do you avoid reuse in such a huge space? That just creates yet
another problem for the kernel to keep track of who is who.

With random u64 numbers, it even becomes hard to determine if you're
sharing at all or not.

Now, with the current SMT+MDS trainwreck, any sharing is bad because it
allows leaking kernel privates. But under a less severe thread scenario,
say where only user data would be at risk, the ptrace() tests make
sense, but those become really hard with random u64 numbers too.

What would the purpose of random u64 values be for cgroups? That only
replicates the problem of determining uniqueness there. Then you can get
two cgroups unintentionally sharing because you got lucky.

Also, fundamentally, we cannot have more threads than TID space, it's a
natural identifier.

2020-05-28 18:22:06

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > > the key to that of another task).
> > > > >
> > > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > > would mean untagged. Does that sound good to you?
> > > >
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > >
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
>
> Well, don't do that then ;-)
>

That was a poorly worded example :)

The point I was trying to make was more that one TID of a group (not cgroup!)
of tasks is just an arbitrary value.

At a single process (or pair rather) level, sure, you can see it as an
identifier of whom you want to share with, but even then you have to tag
both processes with this. And it has less meaning when the whom you want to
share with is mutltiple tasks.

> > > What's wrong with passing in an integer instead? In any case, we would do the
> > > CAP_SYS_ADMIN check to limit who can do it.
>
> So the actual permission model can be different depending on how broken
> the hardware is.
>
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> >
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> >
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
>
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
>

The kernel doesn't care or have to track anything. The admin does.
At the kernel level it's just matching cookies.

Tasks A,B,C all can share core so you give them each A's TID as a cookie.
Task A then exits. Now B and C are using essentially a random value.
Task D comes along and want to share with B and C. You have to tag it
with A's old TID, which has no meaning at this point.

And if A's TID ever gets reused. The new A` gets to share too. At some
level aren't those still 32bits?

> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
>
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
>
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
>

Seems that would be more flexible for the admin.

What if you had two cgroups you wanted to allow to run together? Or a
cgroup and a few processes from a different one (say with different
quotas or something).

I don't have such use cases so I don't feel that strongly but it seemed
more flexible and followed the mechanism-in-kernel/policy-in-userspace
dictum rather than basing the functionality on the implementation details.


Cheers,
Phil


> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.
>

--

2020-05-28 18:27:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

Hi Peter,

On Thu, May 28, 2020 at 07:01:28PM +0200, Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > > the key to that of another task).
> > > > >
> > > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > > would mean untagged. Does that sound good to you?
> > > >
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > >
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
>
> Well, don't do that then ;-)

We could also guard it with a mutex. First task to set the TID wins, the
other thread just reuses the cookie of the TID that won.

But I think we cannot just use the TID value as the cookie, due to TID
wrap-around and reuse. Otherwise we could accidentally group 2 tasks. Instead, I
suggest let us keep TID as the interface per your suggestion and do the
needed ptrace checks, but convert the TID to the task_struct pointer value
and use that as the cookie for the group of tasks sharing a core.

Thoughts?

thanks,

- Joel

> > > What's wrong with passing in an integer instead? In any case, we would do the
> > > CAP_SYS_ADMIN check to limit who can do it.
>
> So the actual permission model can be different depending on how broken
> the hardware is.
>
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> >
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> >
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
>
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
>
> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
>
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
>
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
>
> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.

2020-05-28 18:36:57

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

On Thu, May 28, 2020 at 02:17:19PM -0400 Phil Auld wrote:
> On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > > [..]
> > > > > > > It doens't allow tasks for form their own groups (by for example setting
> > > > > > > the key to that of another task).
> > > > > >
> > > > > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > > > > would mean untagged. Does that sound good to you?
> > > > >
> > > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > > ptrace tests to see if you're allowed to observe their junk.
> > > >
> > > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > > For example, if 2 tasks tag with each other's TID, then they would have
> > > > different tags and not share.
> >
> > Well, don't do that then ;-)
> >
>
> That was a poorly worded example :)
>

Heh, sorry, I thought that was my statement. I do not mean to belittle Joel's
example... That's a fine example of a totally different problem than I
was thinking of :)


Cheers,
Phil

> The point I was trying to make was more that one TID of a group (not cgroup!)
> of tasks is just an arbitrary value.
>
> At a single process (or pair rather) level, sure, you can see it as an
> identifier of whom you want to share with, but even then you have to tag
> both processes with this. And it has less meaning when the whom you want to
> share with is mutltiple tasks.
>
> > > > What's wrong with passing in an integer instead? In any case, we would do the
> > > > CAP_SYS_ADMIN check to limit who can do it.
> >
> > So the actual permission model can be different depending on how broken
> > the hardware is.
> >
> > > > Also, one thing CGroup interface allows is an external process to set the
> > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> > > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > > completely. How do you feel about that?
> > > >
> > >
> > > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > > any potential reuse security issues.
> > >
> > > I think the cgroup interface could be extended not to be a boolean but take
> > > the value. With 0 being untagged as now.
> >
> > How do you avoid reuse in such a huge space? That just creates yet
> > another problem for the kernel to keep track of who is who.
> >
>
> The kernel doesn't care or have to track anything. The admin does.
> At the kernel level it's just matching cookies.
>
> Tasks A,B,C all can share core so you give them each A's TID as a cookie.
> Task A then exits. Now B and C are using essentially a random value.
> Task D comes along and want to share with B and C. You have to tag it
> with A's old TID, which has no meaning at this point.
>
> And if A's TID ever gets reused. The new A` gets to share too. At some
> level aren't those still 32bits?
>
> > With random u64 numbers, it even becomes hard to determine if you're
> > sharing at all or not.
> >
> > Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> > allows leaking kernel privates. But under a less severe thread scenario,
> > say where only user data would be at risk, the ptrace() tests make
> > sense, but those become really hard with random u64 numbers too.
> >
> > What would the purpose of random u64 values be for cgroups? That only
> > replicates the problem of determining uniqueness there. Then you can get
> > two cgroups unintentionally sharing because you got lucky.
> >
>
> Seems that would be more flexible for the admin.
>
> What if you had two cgroups you wanted to allow to run together? Or a
> cgroup and a few processes from a different one (say with different
> quotas or something).
>
> I don't have such use cases so I don't feel that strongly but it seemed
> more flexible and followed the mechanism-in-kernel/policy-in-userspace
> dictum rather than basing the functionality on the implementation details.
>
>
> Cheers,
> Phil
>
>
> > Also, fundamentally, we cannot have more threads than TID space, it's a
> > natural identifier.
> >
>
> --

--

2020-06-25 23:27:47

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Wed, Mar 4, 2020 at 12:00 PM vpillai <[email protected]> wrote:
>
>
> Fifth iteration of the Core-Scheduling feature.
>
Its probably time for an iteration and We are planning to post v6 based
on this branch:
https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y

Just wanted to share the details about v6 here before posting the patch
series. If there is no objection to the following, we shall be posting
the v6 early next week.

The main changes from v6 are the following:
1. Address Peter's comments in v5
- Code cleanup
- Remove fixes related to hotplugging.
- Split the patch out for force idling starvation
3. Fix for RCU deadlock
4. core wide priority comparison minor re-work.
5. IRQ Pause patch
6. Documentation
- https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst

This version is much leaner compared to v5 due to the removal of hotplug
support. As a result, dynamic coresched enable/disable on cpus due to
smt on/off on the core do not function anymore. I tried to reproduce the
crashes during hotplug, but could not reproduce reliably. The plan is to
try to reproduce the crashes with v6, and document each corner case for crashes
as we fix those. Previously, we randomly fixed the issues without a clear
documentation and the fixes became complex over time.

TODO lists:

- Interface discussions could not come to a conclusion in v5 and hence would
like to restart the discussion and reach a consensus on it.
- https://lwn.net/ml/linux-kernel/[email protected]

- Core wide vruntime calculation needs rework:
- https://lwn.net/ml/linux-kernel/[email protected]

- Load balancing/migration changes ignores group weights:
- https://lwn.net/ml/linux-kernel/[email protected]


Please have a look and let me know comments/suggestions or anything missed.

Thanks,
Vineeth

2020-06-26 04:37:29

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
<[email protected]> wrote:
[...]
> TODO lists:
>
> - Interface discussions could not come to a conclusion in v5 and hence would
> like to restart the discussion and reach a consensus on it.
> - https://lwn.net/ml/linux-kernel/[email protected]

Thanks Vineeth, just want to add: I have a revised implementation of
prctl(2) where you only pass a TID of a task you'd like to share a
core with (credit to Peter for the idea [1]) so we can make use of
ptrace_may_access() checks. I am currently finishing writing of
kselftests for this and post it all once it is ready.

However a question: If using the prctl(2) on a CGroup tagged task, we
discussed in previous threads [2] to override the CGroup cookie such
that the task may not share a core with any of the tasks in its CGroup
anymore and I think Peter and Phil are Ok with. My question though is
- would that not be confusing for anyone looking at the CGroup
filesystem's "tag" and "tasks" files?

To resolve this, I am proposing to add a new CGroup file
'tasks.coresched' to the CGroup, and this will only contain tasks that
were assigned cookies due to their CGroup residency. As soon as one
prctl(2)'s the task, it will stop showing up in the CGroup's
"tasks.coresched" file (unless of course it was requesting to
prctl-share a core with someone in its CGroup itself). Are folks Ok
with this solution?

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

2020-06-26 16:17:27

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <[email protected]> wrote:
>
> On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> <[email protected]> wrote:
> [...]
> > TODO lists:
> >
> > - Interface discussions could not come to a conclusion in v5 and hence would
> > like to restart the discussion and reach a consensus on it.
> > - https://lwn.net/ml/linux-kernel/[email protected]
>
> Thanks Vineeth, just want to add: I have a revised implementation of
> prctl(2) where you only pass a TID of a task you'd like to share a
> core with (credit to Peter for the idea [1]) so we can make use of
> ptrace_may_access() checks. I am currently finishing writing of
> kselftests for this and post it all once it is ready.
>
Thinking more about it, using TID/PID for prctl(2) and internally
using a task identifier to identify coresched group may have
limitations. A coresched group can exist longer than the lifetime
of a task and then there is a chance for that identifier to be
reused by a newer task which may or maynot be a part of the same
coresched group.

A way to overcome this is to have a coresched group with a seperate
identifier implemented internally and have mapping from task to the
group. And cgroup framework provides exactly that.

I feel we could use prctl for isolating individual tasks/processes
and use grouping frameworks like cgroup for core scheduling groups.
Cpu cgroup might not be a good idea as it has its own purpose. Users
might not always want a group of trusted tasks in the same cpu cgroup.
Or all the processes in an existing cpu cgroup might not be mutually
trusted as well.

What do you think about having a separate cgroup for coresched?
Both coresched cgroup and prctl() could co-exist where prctl could
be used to isolate individual process or task and coresched cgroup
to group trusted processes.

> However a question: If using the prctl(2) on a CGroup tagged task, we
> discussed in previous threads [2] to override the CGroup cookie such
> that the task may not share a core with any of the tasks in its CGroup
> anymore and I think Peter and Phil are Ok with. My question though is
> - would that not be confusing for anyone looking at the CGroup
> filesystem's "tag" and "tasks" files?
>
Having a dedicated cgroup for coresched could solve this problem
as well. "coresched.tasks" inside the cgroup hierarchy would list all
the taskx in the group and prctl can override this and take it out
of the group.

> To resolve this, I am proposing to add a new CGroup file
> 'tasks.coresched' to the CGroup, and this will only contain tasks that
> were assigned cookies due to their CGroup residency. As soon as one
> prctl(2)'s the task, it will stop showing up in the CGroup's
> "tasks.coresched" file (unless of course it was requesting to
> prctl-share a core with someone in its CGroup itself). Are folks Ok
> with this solution?
>
As I mentioned above, IMHO cpu cgroups should not be used to account
for core scheduling as well. Cpu cgroups serve a different purpose
and overloading it with core scheduling would not be flexible and
scalable. But if there is a consensus to move forward with cpu cgroups,
adding this new file seems to be okay with me.

Thoughts/suggestions/concerns?

Thanks,
Vineeth

2020-06-26 16:41:54

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <[email protected]> wrote:
> >
> > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > <[email protected]> wrote:
> > [...]
> > > TODO lists:
> > >
> > > - Interface discussions could not come to a conclusion in v5 and hence would
> > > like to restart the discussion and reach a consensus on it.
> > > - https://lwn.net/ml/linux-kernel/[email protected]
> >
> > Thanks Vineeth, just want to add: I have a revised implementation of
> > prctl(2) where you only pass a TID of a task you'd like to share a
> > core with (credit to Peter for the idea [1]) so we can make use of
> > ptrace_may_access() checks. I am currently finishing writing of
> > kselftests for this and post it all once it is ready.
> >
> Thinking more about it, using TID/PID for prctl(2) and internally
> using a task identifier to identify coresched group may have
> limitations. A coresched group can exist longer than the lifetime
> of a task and then there is a chance for that identifier to be
> reused by a newer task which may or maynot be a part of the same
> coresched group.

True, for the prctl(2) tagging (a task wanting to share core with
another) we will need some way of internally identifying groups which does
not depend on any value that can be reused for another purpose.

[..]
> What do you think about having a separate cgroup for coresched?
> Both coresched cgroup and prctl() could co-exist where prctl could
> be used to isolate individual process or task and coresched cgroup
> to group trusted processes.

This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
having a new attribute-less CGroup controller for core-scheduling and just
use that for tagging. (No need to even have a tag file, just adding/removing
to/from CGroup will tag).

> > However a question: If using the prctl(2) on a CGroup tagged task, we
> > discussed in previous threads [2] to override the CGroup cookie such
> > that the task may not share a core with any of the tasks in its CGroup
> > anymore and I think Peter and Phil are Ok with. My question though is
> > - would that not be confusing for anyone looking at the CGroup
> > filesystem's "tag" and "tasks" files?
> >
> Having a dedicated cgroup for coresched could solve this problem
> as well. "coresched.tasks" inside the cgroup hierarchy would list all
> the taskx in the group and prctl can override this and take it out
> of the group.

We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
be used.

> > To resolve this, I am proposing to add a new CGroup file
> > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > were assigned cookies due to their CGroup residency. As soon as one
> > prctl(2)'s the task, it will stop showing up in the CGroup's
> > "tasks.coresched" file (unless of course it was requesting to
> > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > with this solution?
> >
> As I mentioned above, IMHO cpu cgroups should not be used to account
> for core scheduling as well. Cpu cgroups serve a different purpose
> and overloading it with core scheduling would not be flexible and
> scalable. But if there is a consensus to move forward with cpu cgroups,
> adding this new file seems to be okay with me.

Yes, this is the problem. Many people use CPU controller CGroups already for
other purposes. In that case, tagging a CGroup would make all the entities in
the group be able to share a core, which may not always make sense. May be a
new CGroup controller is the answer (?).

thanks,

- Joel

2020-06-26 16:47:19

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Fri, Jun 26, 2020 at 11:10:28AM -0400, Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <[email protected]> wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > > <[email protected]> wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > > - Interface discussions could not come to a conclusion in v5 and hence would
> > > > like to restart the discussion and reach a consensus on it.
> > > > - https://lwn.net/ml/linux-kernel/[email protected]
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
>
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
>
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
>
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).

+Tejun

thanks,

- Joel


> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with. My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
>
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
>
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
>
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
>
> thanks,
>
> - Joel
>

2020-06-27 16:23:23

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Fri, Jun 26, 2020 at 11:10 AM Joel Fernandes <[email protected]> wrote:
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
>
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).

Unless there are any major objections to this idea, or better ideas
for CGroup users, we will consider proposing a new CGroup controller
for this. The issue with CPU controller CGroups being they may be
configured in a way that is incompatible with tagging.

And I was also thinking of a new clone flag CLONE_CORE (which allows a
child to share a parent's core). This is because the fork-semantics
are not clear and it may be better to leave the behavior of fork to
userspace IMHO than hard-coding policy in the kernel.

Perhaps we can also discuss this at the scheduler MC at Plumbers.

Any other thoughts?

- Joel

2020-06-29 19:41:55

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

Hi Aubrey,

On Mon, Jun 29, 2020 at 8:34 AM Li, Aubrey <[email protected]> wrote:
>
> > - Load balancing/migration changes ignores group weights:
> > - https://lwn.net/ml/linux-kernel/[email protected]
>
> According to Aaron's response below:
> https://lwn.net/ml/linux-kernel/[email protected]/
>
> The following logic seems to be helpful for Aaron's case.
>
> + /*
> + * Ignore cookie match if there is a big imbalance between the src rq
> + * and dst rq.
> + */
> + if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> + return true;
>
> I didn't see any other comments on the patch at here:
> https://lwn.net/ml/linux-kernel/[email protected]/
>
> Do we have another way to address this issue?
>
We do not have a clear fix for this yet, and did not get much time to
work on this.

I feel that the above change would not be fixing the real issue.
The issue is about not considering the weight of the group when we
try to load balance, but the above change is checking only the
nr_running which might not work always. I feel that we should fix
the real issue in v6 and probably hold on to adding the workaround
fix in the interim. I have added a TODO specifically for this bug
in v6.

What do you think?

Thanks,
Vineeth

2020-06-29 21:27:58

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

Hi Vineeth,

On 2020/6/26 4:12, Vineeth Remanan Pillai wrote:
> On Wed, Mar 4, 2020 at 12:00 PM vpillai <[email protected]> wrote:
>>
>>
>> Fifth iteration of the Core-Scheduling feature.
>>
> Its probably time for an iteration and We are planning to post v6 based
> on this branch:
> https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y
>
> Just wanted to share the details about v6 here before posting the patch
> series. If there is no objection to the following, we shall be posting
> the v6 early next week.
>
> The main changes from v6 are the following:
> 1. Address Peter's comments in v5
> - Code cleanup
> - Remove fixes related to hotplugging.
> - Split the patch out for force idling starvation
> 3. Fix for RCU deadlock
> 4. core wide priority comparison minor re-work.
> 5. IRQ Pause patch
> 6. Documentation
> - https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst
>
> This version is much leaner compared to v5 due to the removal of hotplug
> support. As a result, dynamic coresched enable/disable on cpus due to
> smt on/off on the core do not function anymore. I tried to reproduce the
> crashes during hotplug, but could not reproduce reliably. The plan is to
> try to reproduce the crashes with v6, and document each corner case for crashes
> as we fix those. Previously, we randomly fixed the issues without a clear
> documentation and the fixes became complex over time.
>
> TODO lists:
>
> - Interface discussions could not come to a conclusion in v5 and hence would
> like to restart the discussion and reach a consensus on it.
> - https://lwn.net/ml/linux-kernel/[email protected]
>
> - Core wide vruntime calculation needs rework:
> - https://lwn.net/ml/linux-kernel/[email protected]
>
> - Load balancing/migration changes ignores group weights:
> - https://lwn.net/ml/linux-kernel/[email protected]

According to Aaron's response below:
https://lwn.net/ml/linux-kernel/[email protected]/

The following logic seems to be helpful for Aaron's case.

+ /*
+ * Ignore cookie match if there is a big imbalance between the src rq
+ * and dst rq.
+ */
+ if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+ return true;

I didn't see any other comments on the patch at here:
https://lwn.net/ml/linux-kernel/[email protected]/

Do we have another way to address this issue?

Thanks,
-Aubrey

2020-06-30 14:17:10

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH 00/13] Core scheduling v5

On Fri, Jun 26, 2020 at 11:10:28AM -0400 Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes <[email protected]> wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > > <[email protected]> wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > > - Interface discussions could not come to a conclusion in v5 and hence would
> > > > like to restart the discussion and reach a consensus on it.
> > > > - https://lwn.net/ml/linux-kernel/[email protected]
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
>
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
>

That was my concern as well. That's why I was thinking it should be
an arbitrary, user/admin/orchestrator defined value and not be the
responsibility of the kernel at all. However...


> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
>
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).
>

... this could be an interesting approach. Then the cookie could still
be the cgroup address as is and there would be no need for the prctl. At
least so it seems.



Cheers,
Phil

> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with. My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
>
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
>
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
>
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
>
> thanks,
>
> - Joel
>

--