2019-02-18 19:05:44

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 00/16] sched: Core scheduling


A much 'demanded' feature: core-scheduling :-(

I still hate it with a passion, and that is part of why it took a little
longer than 'promised'.

While this one doesn't have all the 'features' of the previous (never
published) version and isn't L1TF 'complete', I tend to like the structure
better (relatively speaking: I hate it slightly less).

This one is sched class agnostic and therefore, in principle, doesn't horribly
wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
to force-idle siblings).

Now, as hinted by that, there are semi sane reasons for actually having this.
Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
per core (due to SMT fundamentally sharing caches) and therefore grouping
related tasks on a core makes it more reliable.

However; whichever way around you turn this cookie; it is expensive and nasty.

It doesn't help that there are truly bonghit crazy proposals for using this out
there, and I really hope to never see them in code.

These patches are lightly tested and didn't insta explode, but no promises,
they might just set your pets on fire.

'enjoy'

@pjt; I know this isn't quite what we talked about, but this is where I ended
up after I started typing. There's plenty design decisions to question and my
changelogs don't even get close to beginning to cover them all. Feel free to ask.

---
include/linux/sched.h | 9 +-
kernel/Kconfig.preempt | 8 +-
kernel/sched/core.c | 762 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/deadline.c | 99 +++---
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 129 +++++---
kernel/sched/idle.c | 42 ++-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 96 +++---
kernel/sched/sched.h | 183 ++++++++----
kernel/sched/stop_task.c | 35 ++-
kernel/sched/topology.c | 4 +-
kernel/stop_machine.c | 2 +
13 files changed, 1096 insertions(+), 279 deletions(-)




2019-02-18 19:12:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>
> However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

Linus

2019-02-18 20:42:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and nasty.
>
> Do you (or anybody else) have numbers for real loads?
>
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.

Not for these patches; they stopped crashing only yesterday and I
cleaned them up and send them out.

The previous version; which was more horrible; but L1TF complete, was
between OK-ish and horrible depending on the number of VMEXITs a
workload had.

If there were close to no VMEXITs, it beat smt=off, if there were lots
of VMEXITs it was far far worse. Supposedly hosting people try their
very bestest to have no VMEXITs so it mostly works for them (with the
obvious exception of single VCPU guests).

It's just that people have been bugging me for this crap; and I figure
I'd post it now that it's not exploding anymore and let others have at.



2019-02-19 02:46:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <[email protected]> wrote:
>
> If there were close to no VMEXITs, it beat smt=off, if there were lots
> of VMEXITs it was far far worse. Supposedly hosting people try their
> very bestest to have no VMEXITs so it mostly works for them (with the
> obvious exception of single VCPU guests).
>
> It's just that people have been bugging me for this crap; and I figure
> I'd post it now that it's not exploding anymore and let others have at.

The patches didn't look disgusting to me, but I admittedly just
scanned through them quickly.

Are there downsides (maintenance and/or performance) when core
scheduling _isn't_ enabled? I guess if it's not a maintenance or
performance nightmare when off, it's ok to just give people the
option.

That all assumes that it works at all for the people who are clamoring
for this feature, but I guess they can run some loads on it
eventually. It's a holiday in the US right now ("Presidents' Day"),
but maybe we can get some numebrs this week?

Linus

2019-02-19 15:16:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


* Linus Torvalds <[email protected]> wrote:

> On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <[email protected]> wrote:
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> >
> > It's just that people have been bugging me for this crap; and I figure
> > I'd post it now that it's not exploding anymore and let others have at.
>
> The patches didn't look disgusting to me, but I admittedly just
> scanned through them quickly.
>
> Are there downsides (maintenance and/or performance) when core
> scheduling _isn't_ enabled? I guess if it's not a maintenance or
> performance nightmare when off, it's ok to just give people the
> option.

So this bit is the main straight-line performance impact when the
CONFIG_SCHED_CORE Kconfig feature is present (which I expect distros to
enable broadly):

+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}

static inline raw_spinlock_t *rq_lockp(struct rq *rq)
{
+ if (sched_core_enabled(rq))
+ return &rq->core->__lock
+
return &rq->__lock;


This should at least in principe keep the runtime overhead down to more
NOPs and a bit bigger instruction cache footprint - modulo compiler
shenanigans.

Here's the code generation impact on x86-64 defconfig:

text data bss dec hex filename
228 48 0 276 114 sched.core.n/cpufreq.o (ex sched.core.n/built-in.a)
228 48 0 276 114 sched.core.y/cpufreq.o (ex sched.core.y/built-in.a)

4438 96 0 4534 11b6 sched.core.n/completion.o (ex sched.core.n/built-in.a)
4438 96 0 4534 11b6 sched.core.y/completion.o (ex sched.core.y/built-in.a)

2167 2428 0 4595 11f3 sched.core.n/cpuacct.o (ex sched.core.n/built-in.a)
2167 2428 0 4595 11f3 sched.core.y/cpuacct.o (ex sched.core.y/built-in.a)

61099 22114 488 83701 146f5 sched.core.n/core.o (ex sched.core.n/built-in.a)
70541 25370 508 96419 178a3 sched.core.y/core.o (ex sched.core.y/built-in.a)

3262 6272 0 9534 253e sched.core.n/wait_bit.o (ex sched.core.n/built-in.a)
3262 6272 0 9534 253e sched.core.y/wait_bit.o (ex sched.core.y/built-in.a)

12235 341 96 12672 3180 sched.core.n/rt.o (ex sched.core.n/built-in.a)
13073 917 96 14086 3706 sched.core.y/rt.o (ex sched.core.y/built-in.a)

10293 477 1928 12698 319a sched.core.n/topology.o (ex sched.core.n/built-in.a)
10363 509 1928 12800 3200 sched.core.y/topology.o (ex sched.core.y/built-in.a)

886 24 0 910 38e sched.core.n/cpupri.o (ex sched.core.n/built-in.a)
886 24 0 910 38e sched.core.y/cpupri.o (ex sched.core.y/built-in.a)

1061 64 0 1125 465 sched.core.n/stop_task.o (ex sched.core.n/built-in.a)
1077 128 0 1205 4b5 sched.core.y/stop_task.o (ex sched.core.y/built-in.a)

18443 365 24 18832 4990 sched.core.n/deadline.o (ex sched.core.n/built-in.a)
20019 2189 24 22232 56d8 sched.core.y/deadline.o (ex sched.core.y/built-in.a)

1123 8 64 1195 4ab sched.core.n/loadavg.o (ex sched.core.n/built-in.a)
1123 8 64 1195 4ab sched.core.y/loadavg.o (ex sched.core.y/built-in.a)

1323 8 0 1331 533 sched.core.n/stats.o (ex sched.core.n/built-in.a)
1323 8 0 1331 533 sched.core.y/stats.o (ex sched.core.y/built-in.a)

1282 164 32 1478 5c6 sched.core.n/isolation.o (ex sched.core.n/built-in.a)
1282 164 32 1478 5c6 sched.core.y/isolation.o (ex sched.core.y/built-in.a)

1564 36 0 1600 640 sched.core.n/cpudeadline.o (ex sched.core.n/built-in.a)
1564 36 0 1600 640 sched.core.y/cpudeadline.o (ex sched.core.y/built-in.a)

1640 56 0 1696 6a0 sched.core.n/swait.o (ex sched.core.n/built-in.a)
1640 56 0 1696 6a0 sched.core.y/swait.o (ex sched.core.y/built-in.a)

1859 244 32 2135 857 sched.core.n/clock.o (ex sched.core.n/built-in.a)
1859 244 32 2135 857 sched.core.y/clock.o (ex sched.core.y/built-in.a)

2339 8 0 2347 92b sched.core.n/cputime.o (ex sched.core.n/built-in.a)
2339 8 0 2347 92b sched.core.y/cputime.o (ex sched.core.y/built-in.a)

3014 32 0 3046 be6 sched.core.n/membarrier.o (ex sched.core.n/built-in.a)
3014 32 0 3046 be6 sched.core.y/membarrier.o (ex sched.core.y/built-in.a)

50027 964 96 51087 c78f sched.core.n/fair.o (ex sched.core.n/built-in.a)
51537 2484 96 54117 d365 sched.core.y/fair.o (ex sched.core.y/built-in.a)

3192 220 0 3412 d54 sched.core.n/idle.o (ex sched.core.n/built-in.a)
3276 252 0 3528 dc8 sched.core.y/idle.o (ex sched.core.y/built-in.a)

3633 0 0 3633 e31 sched.core.n/pelt.o (ex sched.core.n/built-in.a)
3633 0 0 3633 e31 sched.core.y/pelt.o (ex sched.core.y/built-in.a)

3794 160 0 3954 f72 sched.core.n/wait.o (ex sched.core.n/built-in.a)
3794 160 0 3954 f72 sched.core.y/wait.o (ex sched.core.y/built-in.a)

I'd say this one is representative:

text data bss dec hex filename
12235 341 96 12672 3180 sched.core.n/rt.o (ex sched.core.n/built-in.a)
13073 917 96 14086 3706 sched.core.y/rt.o (ex sched.core.y/built-in.a)

which ~6% bloat is primarily due to the higher rq-lock inlining overhead,
I believe.

This is roughly what you'd expect from a change wrapping all 350+ inlined
instantiations of rq->lock uses. I.e. it might make sense to uninline it.

In terms of long term maintenance overhead, ignoring the overhead of the
core-scheduling feature itself, the rq-lock wrappery is the biggest
ugliness, the rest is mostly isolated.

So if this actually *works* and improves the performance of some real
VMEXIT-poor SMT workloads and allows the enabling of HyperThreading with
untrusted VMs without inviting thousands of guest roots then I'm
cautiously in support of it.

> That all assumes that it works at all for the people who are clamoring
> for this feature, but I guess they can run some loads on it eventually.
> It's a holiday in the US right now ("Presidents' Day"), but maybe we
> can get some numebrs this week?

Such numbers would be *very* helpful indeed.

Thanks,

Ingo

2019-02-19 22:08:16

by Greg Kerr

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers ([email protected], [email protected], and
[email protected]) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

# API Properties

The kernel introduces coscheduling groups, which specify which processes may
be executed together. An unprivileged process may use prctl() to create a
coscheduling group. The process may then join the coscheduling group, and
place any of its child processes into the coscheduling group. To
provide flexibility for
unrelated processes to join pre-existing groups, an IPC mechanism could send a
coscheduling group handle between processes.

# Strawperson API Proposal
To create a new coscheduling group:
int coscheduling_group = prctl(PR_CREATE_COSCHEDULING_GROUP);

The return value is >= 0 on success and -1 on failure, with the following
possible values for errno:

ENOTSUP: This kernel doesn’t support the PR_NEW_COSCHEDULING_GROUP
operation.
EMFILE: The process’ kernel-side coscheduling group table is full.

To join a given process to the group:
pid_t process = /* self or child... */
int status = prctl(PR_JOIN_COSCHEDULING_GROUP, coscheduling_group, process);
if (status) {
err(errno, NULL);
}

The kernel will check and enforce that the given process ID really is the
caller’s own PID or a PID of one of the caller’s children, and that the given
group ID really exists. The return value is 0 on success and -1 on failure,
with the following possible values for errno:

EPERM: The caller could not join the given process to the coscheduling
group because it was not the creator of the given coscheduling group.
EPERM: The caller could not join the given process to the coscheduling
group because the given process was not the caller or one
of the caller’s
children.
EINVAL: The given group ID did not exist in the kernel-side coscheduling
group table associated with the caller.
ESRCH: The given process did not exist.

Regards,

Greg Kerr ([email protected])

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>
>
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.
>
> It doesn't help that there are truly bonghit crazy proposals for using this out
> there, and I really hope to never see them in code.
>
> These patches are lightly tested and didn't insta explode, but no promises,
> they might just set your pets on fire.
>
> 'enjoy'
>
> @pjt; I know this isn't quite what we talked about, but this is where I ended
> up after I started typing. There's plenty design decisions to question and my
> changelogs don't even get close to beginning to cover them all. Feel free to ask.
>
> ---
> include/linux/sched.h | 9 +-
> kernel/Kconfig.preempt | 8 +-
> kernel/sched/core.c | 762 ++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sched/deadline.c | 99 +++---
> kernel/sched/debug.c | 4 +-
> kernel/sched/fair.c | 129 +++++---
> kernel/sched/idle.c | 42 ++-
> kernel/sched/pelt.h | 2 +-
> kernel/sched/rt.c | 96 +++---
> kernel/sched/sched.h | 183 ++++++++----
> kernel/sched/stop_task.c | 35 ++-
> kernel/sched/topology.c | 4 +-
> kernel/stop_machine.c | 2 +
> 13 files changed, 1096 insertions(+), 279 deletions(-)
>
>

2019-02-20 09:44:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> quick and dirty cgroup tagging interface," I believe cgroups are used to
> define co-scheduling groups in this implementation.
>
> Chrome OS engineers ([email protected], [email protected], and
> [email protected]) are considering an interface that is usable by unprivileged
> userspace apps. cgroups are a global resource that require privileged access.
> Have you considered an interface that is akin to namespaces? Consider the
> following strawperson API proposal (I understand prctl() is generally
> used for process
> specific actions, so we aren't married to using prctl()):

I don't think we're anywhere near the point where I care about
interfaces with this stuff.

Interfaces are a trivial but tedious matter once the rest works to
satisfaction.

As it happens; there is actually a bug in that very cgroup patch that
can cause undesired scheduling. Try spotting and fixing that.

Another question is if we want to be L1TF complete (and how strict) or
not, and if so, build the missing pieces (for instance we currently
don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
and horrible code and missing for that reason).

So first; does this provide what we need? If that's sorted we can
bike-shed on uapi/abi.

2019-02-20 18:46:32

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 2/20/19 1:42 AM, Peter Zijlstra wrote:
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
> On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
>> Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
>> quick and dirty cgroup tagging interface," I believe cgroups are used to
>> define co-scheduling groups in this implementation.
>>
>> Chrome OS engineers ([email protected], [email protected], and
>> [email protected]) are considering an interface that is usable by unprivileged
>> userspace apps. cgroups are a global resource that require privileged access.
>> Have you considered an interface that is akin to namespaces? Consider the
>> following strawperson API proposal (I understand prctl() is generally
>> used for process
>> specific actions, so we aren't married to using prctl()):
> I don't think we're anywhere near the point where I care about
> interfaces with this stuff.
>
> Interfaces are a trivial but tedious matter once the rest works to
> satisfaction.
>
> As it happens; there is actually a bug in that very cgroup patch that
> can cause undesired scheduling. Try spotting and fixing that.
>
> Another question is if we want to be L1TF complete (and how strict) or
> not, and if so, build the missing pieces (for instance we currently
> don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> and horrible code and missing for that reason).
I remember asking Paul about this and he mentioned he has a Address Space
Isolation proposal to cover this. So it seems this is out of scope of
core scheduling?
>
> So first; does this provide what we need? If that's sorted we can
> bike-shed on uapi/abi.

2019-02-20 19:11:56

by Greg Kerr

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Wed, Feb 20, 2019 at 10:42:55AM +0100, Peter Zijlstra wrote:
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
I am relieved to know that when my mail client embeds HTML tags into raw
text, it will only be the second most annoying thing I've done on
e-mail.

Speaking of annoying things to do, sorry for switching e-mail addresses
but this is easier to do from my personal e-mail.

> On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> > Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> > quick and dirty cgroup tagging interface," I believe cgroups are used to
> > define co-scheduling groups in this implementation.
> >
> > Chrome OS engineers ([email protected], [email protected], and
> > [email protected]) are considering an interface that is usable by unprivileged
> > userspace apps. cgroups are a global resource that require privileged access.
> > Have you considered an interface that is akin to namespaces? Consider the
> > following strawperson API proposal (I understand prctl() is generally
> > used for process
> > specific actions, so we aren't married to using prctl()):
>
> I don't think we're anywhere near the point where I care about
> interfaces with this stuff.
>
> Interfaces are a trivial but tedious matter once the rest works to
> satisfaction.
>
I agree that the API itself is a bit of a bike shedding and that's why I
provided a strawperson proposal to highlight the desired properties. I
do think the high level semantics are important to agree upon.

Using cgroups could imply that a privileged user is meant to create and
track all the core scheduling groups. It sounds like you picked cgroups
out of ease of prototyping and not the specific behavior?

> As it happens; there is actually a bug in that very cgroup patch that
> can cause undesired scheduling. Try spotting and fixing that.
>
This is where I think the high level properties of core scheduling are
relevant. I'm not sure what bug is in the existing patch, but it's hard
for me to tell if the existing code behaves correctly without answering
questions, such as, "Should processes from two separate parents be
allowed to co-execute?"

> Another question is if we want to be L1TF complete (and how strict) or
> not, and if so, build the missing pieces (for instance we currently
> don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> and horrible code and missing for that reason).
>
I assumed from the beginning that this should be safe across exceptions.
Is there a mitigating reason that it shouldn't?

>
> So first; does this provide what we need? If that's sorted we can
> bike-shed on uapi/abi.
I agree on not bike shedding about the API, but can we agree on some of
the high level properties? For example, who generates the core
scheduling ids, what properties about them are enforced, etc.?

Regards,

Greg Kerr

2019-02-21 02:56:20

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 2/18/19 9:49 AM, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>> However; whichever way around you turn this cookie; it is expensive and nasty.
> Do you (or anybody else) have numbers for real loads?
>
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.
>
> Linus
I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.  In all cases I put each DB
instance in separate cpu cgroup. Following are the avg throughput numbers
of the 2 instances. %stdev is the standard deviation between the 2
instances.

Baseline = build w/o CONFIG_SCHED_CORE
core_sched = build w/ CONFIG_SCHED_CORE
HT_disable = offlined sibling HT with baseline

Users  Baseline  %stdev  core_sched     %stdev HT_disable       %stdev
16     997768    3.28    808193(-19%)   34 1053888(+5.6%)   2.9
24     1157314   9.4     974555(-15.8%) 40.5 1197904(+3.5%)   4.6
32     1693644   6.4     1237195(-27%)  42.8 1308180(-22.8%)  5.3

The regressions are substantial. Also noticed one of the DB instances was
having much less throughput than the other with core scheduling which
brought down the avg and also reflected in the very high %stdev. Disabling
HT has effect at 32 users but still better than core scheduling both in
terms of avg and %stdev. There are some issue with the DB setup for which
I couldn't go beyond 32 users.

2019-02-21 14:04:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
>
> On 2/18/19 9:49 AM, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> > > However; whichever way around you turn this cookie; it is expensive and nasty.
> > Do you (or anybody else) have numbers for real loads?
> >
> > Because performance is all that matters. If performance is bad, then
> > it's pointless, since just turning off SMT is the answer.
> >
> > Linus
> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
> This is on baremetal, no virtualization.?

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).

The thing with virt workloads is that if they don't VMEXIT lots, they
also don't schedule lots (the vCPU stays running, nested scheduler
etc..).

Also; like I wrote, it is quite possible there is some sibling rivalry
here, which can cause excessive rescheduling. Someone would have to
trace a workload and check.

My older patches had a condition that would not preempt a task for a
little while, such that it might make _some_ progress, these patches
don't have that (yet).


2019-02-21 18:48:08

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 2/21/19 6:03 AM, Peter Zijlstra wrote:
> On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
>> On 2/18/19 9:49 AM, Linus Torvalds wrote:
>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>> Do you (or anybody else) have numbers for real loads?
>>>
>>> Because performance is all that matters. If performance is bad, then
>>> it's pointless, since just turning off SMT is the answer.
>>>
>>> Linus
>> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
>> This is on baremetal, no virtualization.
> I'm thinking oracle schedules quite a bit, right? Then you get massive
> overhead (as shown).
Yes. In terms of idleness we have:

Users baseline core_sched
16    67% 70%
24    53% 59%
32    41% 49%

So there is more idleness with core sched which is understandable as there
can be forced idleness. The other part contributing to regression is most
likely overhead.
>
> The thing with virt workloads is that if they don't VMEXIT lots, they
> also don't schedule lots (the vCPU stays running, nested scheduler
> etc..).
I plan to run some VM workloads.
>
> Also; like I wrote, it is quite possible there is some sibling rivalry
> here, which can cause excessive rescheduling. Someone would have to
> trace a workload and check.
>
> My older patches had a condition that would not preempt a task for a
> little while, such that it might make _some_ progress, these patches
> don't have that (yet).
>

2019-02-22 00:37:43

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 2/21/19 6:03 AM, Peter Zijlstra wrote:
> On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
>> On 2/18/19 9:49 AM, Linus Torvalds wrote:
>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>> Do you (or anybody else) have numbers for real loads?
>>>
>>> Because performance is all that matters. If performance is bad, then
>>> it's pointless, since just turning off SMT is the answer.
>>>
>>> Linus
>> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
>> This is on baremetal, no virtualization.
> I'm thinking oracle schedules quite a bit, right? Then you get massive
> overhead (as shown).
>
Out of curiosity I ran the patchset from Amazon with the same setup to see
if performance wise it was any better. But it looks equally bad. At 32
users it performed even worse and the idle time increased much more. Only
good thing about it was it was being fair to both the instances as seen in
the low %stdev

Users  Baseline %stdev  %idle  cosched     %stdev %idle
16     1        2.9     66     0.93(-7%)   1.1 69
24     1        11.3    53     0.87(-13%)  11.2 61
32     1        7       41     0.66(-34%)  5.3     54

2019-02-22 12:19:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On 18/02/19 21:40, Peter Zijlstra wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>>>
>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>
>> Do you (or anybody else) have numbers for real loads?
>>
>> Because performance is all that matters. If performance is bad, then
>> it's pointless, since just turning off SMT is the answer.
>
> Not for these patches; they stopped crashing only yesterday and I
> cleaned them up and send them out.
>
> The previous version; which was more horrible; but L1TF complete, was
> between OK-ish and horrible depending on the number of VMEXITs a
> workload had.
>
> If there were close to no VMEXITs, it beat smt=off, if there were lots
> of VMEXITs it was far far worse. Supposedly hosting people try their
> very bestest to have no VMEXITs so it mostly works for them (with the
> obvious exception of single VCPU guests).

If you are giving access to dedicated cores to guests, you also let them
do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
bound workload.

In any case, IIUC what you are looking for is:

1) take a benchmark that *is* helped by SMT, this will be something CPU
bound.

2) compare two runs, one without SMT and without core scheduler, and one
with SMT+core scheduler.

3) find out whether performance is helped by SMT despite the increased
overhead of the core scheduler

Do you want some other load in the host, so that the scheduler actually
does do something? Or is the point just that you show that the
performance isn't affected when the scheduler does not have anything to
do (which should be obvious, but having numbers is always better)?

Paolo

2019-02-22 12:47:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and nasty.
>
> Do you (or anybody else) have numbers for real loads?
>
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.
>

I tried to do a comparison between tip/master, ht disabled and this series
putting test workloads into a tagged cgroup but unfortunately it failed

[ 156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
[ 156.986597] #PF error: [normal kernel read fault]
[ 156.991343] PGD 0 P4D 0
[ 156.993905] Oops: 0000 [#1] SMP PTI
[ 156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
[ 157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[ 157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[ 157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[ 157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
[ 157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
[ 157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
[ 157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
[ 157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
[ 157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
[ 157.078814] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
[ 157.086977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
[ 157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 157.119058] Call Trace:
[ 157.123865] pick_next_entity+0x61/0x110
[ 157.130137] pick_task_fair+0x4b/0x90
[ 157.136124] __schedule+0x365/0x12c0
[ 157.141985] schedule_idle+0x1e/0x40
[ 157.147822] do_idle+0x166/0x280
[ 157.153275] cpu_startup_entry+0x19/0x20
[ 157.159420] start_secondary+0x17a/0x1d0
[ 157.165568] secondary_startup_64+0xa4/0xb0
[ 157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 157.258990] CR2: 0000000000000058
[ 157.264961] ---[ end trace a301ac5e3ee86fde ]---
[ 157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[ 157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[ 157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
[ 157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
[ 157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
[ 157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
[ 157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
[ 157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
[ 157.373395] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
[ 157.384238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
[ 157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
[ 158.529804] Shutting down cpus with NMI
[ 158.573249] Kernel Offset: disabled
[ 158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

RIP translates to kernel/sched/fair.c:6819

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */

if (vdiff <= 0)
return -1;

gran = wakeup_gran(se);
if (vdiff > gran)
return 1;
}

I haven't tried debugging it yet.

--
Mel Gorman
SUSE Labs

2019-02-22 14:11:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Wed, Feb 20, 2019 at 10:33:55AM -0800, Greg Kerr wrote:
> > On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:

> Using cgroups could imply that a privileged user is meant to create and
> track all the core scheduling groups. It sounds like you picked cgroups
> out of ease of prototyping and not the specific behavior?

Yep. Where a prtcl() patch would've been similarly simple, the userspace
part would've been more annoying. The cgroup thing I can just echo into.

> > As it happens; there is actually a bug in that very cgroup patch that
> > can cause undesired scheduling. Try spotting and fixing that.
> >
> This is where I think the high level properties of core scheduling are
> relevant. I'm not sure what bug is in the existing patch, but it's hard
> for me to tell if the existing code behaves correctly without answering
> questions, such as, "Should processes from two separate parents be
> allowed to co-execute?"

Sure, why not.

The bug is that we set the cookie and don't force a reschedule. This
then allows the existing task selection to continue; which might not
adhere to the (new) cookie constraints.

It is a transient state though; as soon as we reschedule this gets
corrected automagically.

A second bug is that we leak the cgroup tag state on destroy.

A third bug would be that it is not hierarchical -- but that this point
meh.

> > Another question is if we want to be L1TF complete (and how strict) or
> > not, and if so, build the missing pieces (for instance we currently
> > don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> > and horrible code and missing for that reason).
> >
> I assumed from the beginning that this should be safe across exceptions.
> Is there a mitigating reason that it shouldn't?

I'm not entirely sure what you mean; so let me expound -- L1TF is public
now after all.

So the basic problem is that a malicious guest can read the entire L1,
right? L1 is shared between SMT. So if one sibling takes a host
interrupt and populates L1 with host data, that other thread can read
it from the guest.

This is why my old patches (which Tim has on github _somewhere_) also
have hooks in irq_enter/irq_exit.

The big question is of course; if any data touched by interrupts is
worth the pain.

> > So first; does this provide what we need? If that's sorted we can
> > bike-shed on uapi/abi.

> I agree on not bike shedding about the API, but can we agree on some of
> the high level properties? For example, who generates the core
> scheduling ids, what properties about them are enforced, etc.?

It's an opaque cookie; the scheduler really doesn't care. All it does is
ensure that tasks match or force idle within a core.

My previous patches got the cookie from a modified
preempt_notifier_register/unregister() which passed the vcpu->kvm
pointer into it from vcpu_load/put.

This auto-grouped VMs. It was also found to be somewhat annoying because
apparently KVM does a lot of userspace assist for all sorts of nonsense
and it would leave/re-join the cookie group for every single assist.
Causing tons of rescheduling.

I'm fine with having all these interfaces, kvm, prctl and cgroup, and I
don't care about conflict resolution -- that's the tedious part of the
bike-shed :-)

The far more important questions are if there's enough workloads where
this can be made useful or not. If not, none of that interface crud
matters one whit, we can file these here patches in the bit-bucket and
happily go spend out time elsewhere.

2019-02-22 14:21:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> On 18/02/19 21:40, Peter Zijlstra wrote:
> > On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> >>>
> >>> However; whichever way around you turn this cookie; it is expensive and nasty.
> >>
> >> Do you (or anybody else) have numbers for real loads?
> >>
> >> Because performance is all that matters. If performance is bad, then
> >> it's pointless, since just turning off SMT is the answer.
> >
> > Not for these patches; they stopped crashing only yesterday and I
> > cleaned them up and send them out.
> >
> > The previous version; which was more horrible; but L1TF complete, was
> > between OK-ish and horrible depending on the number of VMEXITs a
> > workload had.
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
>
> If you are giving access to dedicated cores to guests, you also let them
> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> bound workload.
>
> In any case, IIUC what you are looking for is:
>
> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> bound.
>
> 2) compare two runs, one without SMT and without core scheduler, and one
> with SMT+core scheduler.
>
> 3) find out whether performance is helped by SMT despite the increased
> overhead of the core scheduler
>
> Do you want some other load in the host, so that the scheduler actually
> does do something? Or is the point just that you show that the
> performance isn't affected when the scheduler does not have anything to
> do (which should be obvious, but having numbers is always better)?

Well, what _I_ want is for all this to just go away :-)

Tim did much of testing last time around; and I don't think he did
core-pinning of VMs much (although I'm sure he did some of that). I'm
still a complete virt noob; I can barely boot a VM to save my life.

(you should be glad to not have heard my cursing at qemu cmdline when
trying to reproduce some of Tim's results -- lets just say that I can
deal with gpg)

I'm sure he tried some oversubscribed scenarios without pinning. But
even there, when all the vCPU threads are runnable, they don't schedule
that much. Sure we take the preemption tick and thus schedule 100-1000
times a second, but that's managable.

We spend quite some time tracing workloads and fixing funny behaviour --
none of that has been done for these patches yet.

The moment KVM needed user space assist for things (and thus VMEXITs
happened) things came apart real quick.


Anyway, Tim, can you tell these fine folks what you did and for what
scenarios the last incarnation did show promise?

2019-02-22 16:11:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Fri, Feb 22, 2019 at 12:45:44PM +0000, Mel Gorman wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > However; whichever way around you turn this cookie; it is expensive and nasty.
> >
> > Do you (or anybody else) have numbers for real loads?
> >
> > Because performance is all that matters. If performance is bad, then
> > it's pointless, since just turning off SMT is the answer.
> >
>
> I tried to do a comparison between tip/master, ht disabled and this series
> putting test workloads into a tagged cgroup but unfortunately it failed
>
> [ 156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [ 156.986597] #PF error: [normal kernel read fault]
> [ 156.991343] PGD 0 P4D 0

When bodged around, one test survived (performance was crucified but the
benchmark is very synthetic). pgbench (test 2) paniced with a hard
lockup. Most of the console log was corrupted (unrelated to the patch)
but the relevant part is

[ 4587.419674] Call Trace:
[ 4587.419674] _raw_spin_lock+0x1b/0x20
[ 4587.419675] sched_core_balance+0x155/0x520
[ 4587.419675] ? __switch_to_asm+0x34/0x70
[ 4587.419675] __balance_callback+0x49/0xa0
[ 4587.419676] __schedule+0xf15/0x12c0
[ 4587.419676] schedule_idle+0x1e/0x40
[ 4587.419677] do_idle+0x166/0x280
[ 4587.419677] cpu_startup_entry+0x19/0x20
[ 4587.419678] start_secondary+0x17a/0x1d0
[ 4587.419678] secondary_startup_64+0xa4/0xb0
[ 4587.419679] Kernel panic - not syncing: Hard LOCKUP

--
Mel Gorman
SUSE Labs

2019-02-22 19:27:43

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
>> On 18/02/19 21:40, Peter Zijlstra wrote:
>>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>>>>>
>>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>>>
>>>> Do you (or anybody else) have numbers for real loads?
>>>>
>>>> Because performance is all that matters. If performance is bad, then
>>>> it's pointless, since just turning off SMT is the answer.
>>>
>>> Not for these patches; they stopped crashing only yesterday and I
>>> cleaned them up and send them out.
>>>
>>> The previous version; which was more horrible; but L1TF complete, was
>>> between OK-ish and horrible depending on the number of VMEXITs a
>>> workload had.
>>>
>>> If there were close to no VMEXITs, it beat smt=off, if there were lots
>>> of VMEXITs it was far far worse. Supposedly hosting people try their
>>> very bestest to have no VMEXITs so it mostly works for them (with the
>>> obvious exception of single VCPU guests).
>>
>> If you are giving access to dedicated cores to guests, you also let them
>> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
>> bound workload.
>>
>> In any case, IIUC what you are looking for is:
>>
>> 1) take a benchmark that *is* helped by SMT, this will be something CPU
>> bound.
>>
>> 2) compare two runs, one without SMT and without core scheduler, and one
>> with SMT+core scheduler.
>>
>> 3) find out whether performance is helped by SMT despite the increased
>> overhead of the core scheduler
>>
>> Do you want some other load in the host, so that the scheduler actually
>> does do something? Or is the point just that you show that the
>> performance isn't affected when the scheduler does not have anything to
>> do (which should be obvious, but having numbers is always better)?
>
> Well, what _I_ want is for all this to just go away :-)
>
> Tim did much of testing last time around; and I don't think he did
> core-pinning of VMs much (although I'm sure he did some of that). I'm

Yes. The last time around I tested basic scenarios like:
1. single VM pinned on a core
2. 2 VMs pinned on a core
3. system oversubscription (no pinning)

In general, CPU bound benchmarks and even things without too much I/O
causing lots of VMexits perform better with HT than without for Peter's
last patchset.

> still a complete virt noob; I can barely boot a VM to save my life.
>
> (you should be glad to not have heard my cursing at qemu cmdline when
> trying to reproduce some of Tim's results -- lets just say that I can
> deal with gpg)
>
> I'm sure he tried some oversubscribed scenarios without pinning.

We did try some oversubscribed scenarios like SPECVirt, that tried to
squeeze tons of VMs on a single system in over subscription mode.

There're two main problems in the last go around:

1. Workload with high rate of Vmexits (SpecVirt is one)
were a major source of pain when we tried Peter's previous patchset.
The switch from vcpus to qemu and back in previous version of Peter's patch
requires some coordination between the hyperthread siblings via IPI. And for
workload that does this a lot, the overhead quickly added up.

For Peter's new patch, this overhead hopefully would be reduced and give
better performance.

2. Load balancing is quite tricky. Peter's last patchset did not have
load balancing for consolidating compatible running threads.
I did some non-sophisticated load balancing
to pair vcpus up. But the constant vcpu migrations overhead probably ate up
any improvements from better load pairing. So I didn't get much
improvement in the over-subscription case when turning on load balancing
to consolidate the VCPUs of the same VM. We'll probably have to try
out this incarnation of Peter's patch and see how well the load balancing
works.

I'll try to line up some benchmarking folks to do some tests.

Tim


2019-02-26 08:28:13

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Sat, Feb 23, 2019 at 3:27 AM Tim Chen <[email protected]> wrote:
>
> On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> >> On 18/02/19 21:40, Peter Zijlstra wrote:
> >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> >>>>>
> >>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
> >>>>
> >>>> Do you (or anybody else) have numbers for real loads?
> >>>>
> >>>> Because performance is all that matters. If performance is bad, then
> >>>> it's pointless, since just turning off SMT is the answer.
> >>>
> >>> Not for these patches; they stopped crashing only yesterday and I
> >>> cleaned them up and send them out.
> >>>
> >>> The previous version; which was more horrible; but L1TF complete, was
> >>> between OK-ish and horrible depending on the number of VMEXITs a
> >>> workload had.
> >>>
> >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> >>> very bestest to have no VMEXITs so it mostly works for them (with the
> >>> obvious exception of single VCPU guests).
> >>
> >> If you are giving access to dedicated cores to guests, you also let them
> >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> >> bound workload.
> >>
> >> In any case, IIUC what you are looking for is:
> >>
> >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> >> bound.
> >>
> >> 2) compare two runs, one without SMT and without core scheduler, and one
> >> with SMT+core scheduler.
> >>
> >> 3) find out whether performance is helped by SMT despite the increased
> >> overhead of the core scheduler
> >>
> >> Do you want some other load in the host, so that the scheduler actually
> >> does do something? Or is the point just that you show that the
> >> performance isn't affected when the scheduler does not have anything to
> >> do (which should be obvious, but having numbers is always better)?
> >
> > Well, what _I_ want is for all this to just go away :-)
> >
> > Tim did much of testing last time around; and I don't think he did
> > core-pinning of VMs much (although I'm sure he did some of that). I'm
>
> Yes. The last time around I tested basic scenarios like:
> 1. single VM pinned on a core
> 2. 2 VMs pinned on a core
> 3. system oversubscription (no pinning)
>
> In general, CPU bound benchmarks and even things without too much I/O
> causing lots of VMexits perform better with HT than without for Peter's
> last patchset.
>
> > still a complete virt noob; I can barely boot a VM to save my life.
> >
> > (you should be glad to not have heard my cursing at qemu cmdline when
> > trying to reproduce some of Tim's results -- lets just say that I can
> > deal with gpg)
> >
> > I'm sure he tried some oversubscribed scenarios without pinning.
>
> We did try some oversubscribed scenarios like SPECVirt, that tried to
> squeeze tons of VMs on a single system in over subscription mode.
>
> There're two main problems in the last go around:
>
> 1. Workload with high rate of Vmexits (SpecVirt is one)
> were a major source of pain when we tried Peter's previous patchset.
> The switch from vcpus to qemu and back in previous version of Peter's patch
> requires some coordination between the hyperthread siblings via IPI. And for
> workload that does this a lot, the overhead quickly added up.
>
> For Peter's new patch, this overhead hopefully would be reduced and give
> better performance.
>
> 2. Load balancing is quite tricky. Peter's last patchset did not have
> load balancing for consolidating compatible running threads.
> I did some non-sophisticated load balancing
> to pair vcpus up. But the constant vcpu migrations overhead probably ate up
> any improvements from better load pairing. So I didn't get much
> improvement in the over-subscription case when turning on load balancing
> to consolidate the VCPUs of the same VM. We'll probably have to try
> out this incarnation of Peter's patch and see how well the load balancing
> works.
>
> I'll try to line up some benchmarking folks to do some tests.

I can help to do some basic tests.

Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?

Thanks,
-Aubrey

2019-02-27 07:55:40

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Tue, Feb 26, 2019 at 4:26 PM Aubrey Li <[email protected]> wrote:
>
> On Sat, Feb 23, 2019 at 3:27 AM Tim Chen <[email protected]> wrote:
> >
> > On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> > >> On 18/02/19 21:40, Peter Zijlstra wrote:
> > >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > >>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> > >>>>>
> > >>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
> > >>>>
> > >>>> Do you (or anybody else) have numbers for real loads?
> > >>>>
> > >>>> Because performance is all that matters. If performance is bad, then
> > >>>> it's pointless, since just turning off SMT is the answer.
> > >>>
> > >>> Not for these patches; they stopped crashing only yesterday and I
> > >>> cleaned them up and send them out.
> > >>>
> > >>> The previous version; which was more horrible; but L1TF complete, was
> > >>> between OK-ish and horrible depending on the number of VMEXITs a
> > >>> workload had.
> > >>>
> > >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> > >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> > >>> very bestest to have no VMEXITs so it mostly works for them (with the
> > >>> obvious exception of single VCPU guests).
> > >>
> > >> If you are giving access to dedicated cores to guests, you also let them
> > >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> > >> bound workload.
> > >>
> > >> In any case, IIUC what you are looking for is:
> > >>
> > >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> > >> bound.
> > >>
> > >> 2) compare two runs, one without SMT and without core scheduler, and one
> > >> with SMT+core scheduler.
> > >>
> > >> 3) find out whether performance is helped by SMT despite the increased
> > >> overhead of the core scheduler
> > >>
> > >> Do you want some other load in the host, so that the scheduler actually
> > >> does do something? Or is the point just that you show that the
> > >> performance isn't affected when the scheduler does not have anything to
> > >> do (which should be obvious, but having numbers is always better)?
> > >
> > > Well, what _I_ want is for all this to just go away :-)
> > >
> > > Tim did much of testing last time around; and I don't think he did
> > > core-pinning of VMs much (although I'm sure he did some of that). I'm
> >
> > Yes. The last time around I tested basic scenarios like:
> > 1. single VM pinned on a core
> > 2. 2 VMs pinned on a core
> > 3. system oversubscription (no pinning)
> >
> > In general, CPU bound benchmarks and even things without too much I/O
> > causing lots of VMexits perform better with HT than without for Peter's
> > last patchset.
> >
> > > still a complete virt noob; I can barely boot a VM to save my life.
> > >
> > > (you should be glad to not have heard my cursing at qemu cmdline when
> > > trying to reproduce some of Tim's results -- lets just say that I can
> > > deal with gpg)
> > >
> > > I'm sure he tried some oversubscribed scenarios without pinning.
> >
> > We did try some oversubscribed scenarios like SPECVirt, that tried to
> > squeeze tons of VMs on a single system in over subscription mode.
> >
> > There're two main problems in the last go around:
> >
> > 1. Workload with high rate of Vmexits (SpecVirt is one)
> > were a major source of pain when we tried Peter's previous patchset.
> > The switch from vcpus to qemu and back in previous version of Peter's patch
> > requires some coordination between the hyperthread siblings via IPI. And for
> > workload that does this a lot, the overhead quickly added up.
> >
> > For Peter's new patch, this overhead hopefully would be reduced and give
> > better performance.
> >
> > 2. Load balancing is quite tricky. Peter's last patchset did not have
> > load balancing for consolidating compatible running threads.
> > I did some non-sophisticated load balancing
> > to pair vcpus up. But the constant vcpu migrations overhead probably ate up
> > any improvements from better load pairing. So I didn't get much
> > improvement in the over-subscription case when turning on load balancing
> > to consolidate the VCPUs of the same VM. We'll probably have to try
> > out this incarnation of Peter's patch and see how well the load balancing
> > works.
> >
> > I'll try to line up some benchmarking folks to do some tests.
>
> I can help to do some basic tests.
>
> Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
> core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?
>

I encountered the following panic when I turned core sched on in a
cgroup when the cgroup
was running a best effort workload with high CPU utilization.

Feb 27 01:51:53 aubrey-ivb kernel: [ 508.981348] core sched enabled
[ 508.990627] BUG: unable to handle kernel NULL pointer dereference
at 000000000000008
[ 508.999445] #PF error: [normal kernel read fault]
[ 509.004772] PGD 8000001807b7d067 P4D 8000001807b7d067 PUD
18071c9067 PMD 0
[ 509.012616] Oops: 0000 [#1] SMP PTI
[ 509.016568] CPU: 24 PID: 3503 Comm: schbench Tainted: G I
5.0.0-rc8-4
[ 509.027918] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.92
[ 509.039475] RIP: 0010:rb_insert_color+0x17/0x190
[ 509.044707] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[ 509.065765] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[ 509.071671] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[ 509.079715] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[ 509.087752] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[ 509.095789] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[ 509.103833] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[ 509.111860] FS: 00007f854e7fc700(0000) GS:ffff88980f200000(0000)
knlGS:000000000000
[ 509.120957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 509.127443] CR2: 0000000000000008 CR3: 0000001807b64005 CR4:
00000000000606e0
[ 509.135478] Call Trace:
[ 509.138285] enqueue_task+0x6f/0xe0
[ 509.142278] ttwu_do_activate+0x49/0x80
[ 509.146654] try_to_wake_up+0x1dc/0x4c0
[ 509.151038] ? __probe_kernel_read+0x3a/0x70
[ 509.155909] signal_wake_up_state+0x15/0x30
[ 509.160683] zap_process+0x90/0xd0
[ 509.164573] do_coredump+0xdba/0xef0
[ 509.168679] ? _raw_spin_lock+0x1b/0x20
[ 509.173045] ? try_to_wake_up+0x120/0x4c0
[ 509.177632] ? pointer+0x1f9/0x2b0
[ 509.181532] ? sched_clock+0x5/0x10
[ 509.185526] ? sched_clock_cpu+0xc/0xa0
[ 509.189911] ? log_store+0x1b5/0x280
[ 509.194002] get_signal+0x12d/0x6d0
[ 509.197998] ? page_fault+0x8/0x30
[ 509.201895] do_signal+0x30/0x6c0
[ 509.205686] ? signal_wake_up_state+0x15/0x30
[ 509.210643] ? __send_signal+0x306/0x4a0
[ 509.215114] ? show_opcodes+0x93/0xa0
[ 509.219286] ? force_sig_info+0xc7/0xe0
[ 509.223653] ? page_fault+0x8/0x30
[ 509.227544] exit_to_usermode_loop+0x77/0xe0
[ 509.232415] prepare_exit_to_usermode+0x70/0x80
[ 509.237569] retint_user+0x8/0x8
[ 509.241273] RIP: 0033:0x7f854e7fbe80
[ 509.245357] Code: 00 00 36 2a 0e 00 00 00 00 00 90 be 7f 4e 85 7f
00 00 4c e8 bf a10
[ 509.266508] RSP: 002b:00007f854e7fbe50 EFLAGS: 00010246
[ 509.272429] RAX: 0000000000000000 RBX: 00000000002dc6c0 RCX:
0000000000000000
[ 509.280500] RDX: 00000000000e2a36 RSI: 00007f854e7fbe50 RDI:
0000000000000000
[ 509.288563] RBP: 00007f855020a170 R08: 000000005c764199 R09:
00007ffea1bfb0a0
[ 509.296624] R10: 00007f854e7fbe30 R11: 000000000002457c R12:
00007f854e7fbed0
[ 509.304685] R13: 00007f855e555e6f R14: 0000000000000000 R15:
00007f855020a150
[ 509.312738] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nai
[ 509.398325] CR2: 0000000000000008
[ 509.402116] ---[ end trace f1214a54c044bdb6 ]---
[ 509.402118] BUG: unable to handle kernel NULL pointer dereference
at 000000000000008
[ 509.402122] #PF error: [normal kernel read fault]
[ 509.412727] RIP: 0010:rb_insert_color+0x17/0x190
[ 509.416649] PGD 8000001807b7d067 P4D 8000001807b7d067 PUD
18071c9067 PMD 0
[ 509.421990] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[ 509.427230] Oops: 0000 [#2] SMP PTI
[ 509.435096] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[ 509.456243] CPU: 2 PID: 3498 Comm: schbench Tainted: G D I
5.0.0-rc8-04
[ 509.460222] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[ 509.460224] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[ 509.466152] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.92
[ 509.466159] RIP: 0010:task_tick_fair+0xb3/0x290
[ 509.477458] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[ 509.477461] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[ 509.485521] Code: 2b 53 60 48 39 d0 0f 82 a0 00 00 00 8b 0d 29 ab
19 01 48 39 ca 728
[ 509.485523] RSP: 0000:ffff888c0f083e60 EFLAGS: 00010046
[ 509.493583] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[ 509.493586] FS: 00007f854e7fc700(0000) GS:ffff88980f200000(0000)
knlGS:000000000000
[ 509.505170] RAX: 0000000000b71aff RBX: ffff888be4df3800 RCX:
0000000000000000
[ 509.505173] RDX: 00000525112fc50e RSI: 0000000000000000 RDI:
0000000000000000
[ 509.510318] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 509.510320] CR2: 0000000000000008 CR3: 0000001807b64005 CR4:
00000000000606e0
[ 509.518381] RBP: ffff888c0f0a2d40 R08: 0000000001ffffff R09:
0000000000000040
[ 509.518383] R10: ffff888c0f083e20 R11: 0000000000405f09 R12:
0000000000000000
[ 509.617516] R13: ffff889806f81e00 R14: ffff888c0f0a2cc0 R15:
0000000000000000
[ 509.625586] FS: 00007f854ffff700(0000) GS:ffff888c0f080000(0000)
knlGS:000000000000
[ 509.634742] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 509.641245] CR2: 0000000000000058 CR3: 0000001807b64002 CR4:
00000000000606e0
[ 509.649313] Call Trace:
[ 509.652131] <IRQ>
[ 509.654462] ? tick_sched_do_timer+0x60/0x60
[ 509.659315] scheduler_tick+0x84/0x120
[ 509.663584] update_process_times+0x40/0x50
[ 509.668345] tick_sched_handle+0x21/0x70
[ 509.672814] tick_sched_timer+0x37/0x70
[ 509.677204] __hrtimer_run_queues+0x108/0x290
[ 509.682163] hrtimer_interrupt+0xe5/0x240
[ 509.686732] smp_apic_timer_interrupt+0x6a/0x130
[ 509.691989] apic_timer_interrupt+0xf/0x20
[ 509.696659] </IRQ>
[ 509.699079] RIP: 0033:0x7ffea1bfe6ac
[ 509.703160] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39f
[ 509.724301] RSP: 002b:00007f854fffedf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[ 509.732872] RAX: 0000000077e4a044 RBX: 00007f854fffee50 RCX:
0000000000000002
[ 509.740941] RDX: 0000000000000166 RSI: 00007f854fffee50 RDI:
0000000000000000
[ 509.749001] RBP: 00007f854fffee10 R08: 0000000000000000 R09:
00007ffea1bfb0a0
[ 509.757061] R10: 00007ffea1bfb080 R11: 000000000002457e R12:
0000000000000000
[ 509.765121] R13: 00007f855e555e6f R14: 0000000000000000 R15:
00007f85500008c0
[ 509.773182] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nai
[ 509.858758] CR2: 0000000000000058
[ 509.862581] ---[ end trace f1214a54c044bdb7 ]---
[ 509.862583] BUG: unable to handle kernel NULL pointer dereference
at 000000000000008
[ 509.862585] #PF error: [normal kernel read fault]
[ 509.873332] RIP: 0010:rb_insert_color+0x17/0x190
[ 509.877246] PGD 8000001807b7d067 P4D 8000001807b7d067 PUD
18071c9067 PMD 0
[ 509.882592] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[ 509.887828] Oops: 0000 [#3] SMP PTI
[ 509.895684] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[ 509.916828] CPU: 26 PID: 3506 Comm: schbench Tainted: G D I
5.0.0-rc8-4
[ 509.920802] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[ 509.920804] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[ 509.926726] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.92
[ 509.926731] RIP: 0010:task_tick_fair+0xb3/0x290
[ 509.938120] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[ 509.938122] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[ 509.946183] Code: 2b 53 60 48 39 d0 0f 82 a0 00 00 00 8b 0d 29 ab
19 01 48 39 ca 728
[ 509.946186] RSP: 0000:ffff88980f283e60 EFLAGS: 00010046
[ 509.954245] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[ 509.954248] FS: 00007f854ffff700(0000) GS:ffff888c0f080000(0000)
knlGS:000000000000
[ 509.965836] RAX: 0000000000b71aff RBX: ffff888be4df3800 RCX:
0000000000000000
[ 509.965839] RDX: 00000525112fc50e RSI: 0000000000000000 RDI:
0000000000000000
[ 509.970981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 509.970983] CR2: 0000000000000058 CR3: 0000001807b64002 CR4:
00000000000606e0
[ 509.979043] RBP: ffff888c0f0a2d40 R08: 0000000001ffffff R09:
0000000000000040
[ 509.979045] R10: ffff88980f283e68 R11: 0000000000000000 R12:
0000000000000000
[ 509.987095] Kernel panic - not syncing: Fatal exception in
interrupt
[ 510.008237] R13: ffff889807f91e00 R14: ffff88980f2a2cc0 R15:
0000000000000000
[ 510.008240] FS: 00007f8547fff700(0000) GS:ffff88980f280000(0000)
knlGS:000000000000
[ 510.102589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 510.109103] CR2: 0000000000000058 CR3: 0000001807b64003 CR4:
00000000000606e0
[ 510.117164] Call Trace:
[ 510.119977] <IRQ>
[ 510.122316] ? tick_sched_do_timer+0x60/0x60
[ 510.127168] scheduler_tick+0x84/0x120
[ 510.131445] update_process_times+0x40/0x50
[ 510.136203] tick_sched_handle+0x21/0x70
[ 510.140672] tick_sched_timer+0x37/0x70
[ 510.145040] __hrtimer_run_queues+0x108/0x290
[ 510.149990] hrtimer_interrupt+0xe5/0x240
[ 510.154554] smp_apic_timer_interrupt+0x6a/0x130
[ 510.159796] apic_timer_interrupt+0xf/0x20
[ 510.164454] </IRQ>
[ 510.166882] RIP: 0033:0x7ffea1bfe6ac
[ 510.170958] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39f
[ 510.192101] RSP: 002b:00007f8547ffedf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[ 510.200675] RAX: 0000000078890657 RBX: 00007f8547ffee50 RCX:
000000000000101a
[ 510.208736] RDX: 0000000000000166 RSI: 00007f8547ffee50 RDI:
0000000000000000
[ 510.216799] RBP: 00007f8547ffee10 R08: 0000000000000000 R09:
00007ffea1bfb0a0
[ 510.224861] R10: 00007ffea1bfb080 R11: 000000000002457e R12:
0000000000000000
[ 510.234319] R13: 00007f855ed56e6f R14: 0000000000000000 R15:
00007f855830ed98
[ 510.242371] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nai
[ 510.327929] CR2: 0000000000000058
[ 510.331720] ---[ end trace f1214a54c044bdb8 ]---
[ 510.342658] RIP: 0010:rb_insert_color+0x17/0x190
[ 510.347900] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[ 510.369044] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[ 510.374968] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[ 510.383031] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[ 510.391093] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[ 510.399154] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[ 510.407214] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[ 510.415278] FS: 00007f8547fff700(0000) GS:ffff88980f280000(0000)
knlGS:000000000000
[ 510.424434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 510.430939] CR2: 0000000000000058 CR3: 0000001807b64003 CR4:
00000000000606e0
[ 511.068880] Shutting down cpus with NMI
[ 511.075437] Kernel Offset: disabled
[ 511.083621] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---

2019-03-01 03:14:19

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 2/18/19 8:56 AM, Peter Zijlstra wrote:
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.
>
I am seeing the following hard lockup frequently now. Following is full
kernel output:

[ 5846.412296] drop_caches (8657): drop_caches: 3
[ 5846.624823] drop_caches (8658): drop_caches: 3
[ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL
B is deprecated
[ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930831] RSP: 0018:ffffc9000f4dbcb8 EFLAGS: 00000002
[ 5962.930832] RAX: 0000000000000001 RBX: ffff88dfb4af1680 RCX:
0000000000000041
[ 5962.930832] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
ffff88dfb4af214c
[ 5962.930833] RBP: 0000000000000000 R08: 0000000000000001 R09:
ffffc9000f4dbd80
[ 5962.930833] R10: ffff888000000000 R11: ffffea00f0003d80 R12:
ffff88dfb4af214c
[ 5962.930834] R13: 0000000000000001 R14: 0000000000000046 R15:
0000000000000001
[ 5962.930834] FS:  00007ff4fabd9ae0(0000) GS:ffff88dfbe280000(0000)
knlGS:0000000000000000
[ 5962.930834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5962.930835] CR2: 0000000f4cc84000 CR3: 0000003b93d36002 CR4:
00000000003606e0
[ 5962.930835] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5962.930836] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5962.930836] Call Trace:
[ 5962.930837]  ? __switch_to_asm+0x34/0x70
[ 5962.930837]  ? __switch_to_asm+0x40/0x70
[ 5962.930838]  ? __switch_to_asm+0x34/0x70
[ 5962.930838]  autoremove_wake_function+0x11/0x50
[ 5962.930838]  __wake_up_common+0x8f/0x160
[ 5962.930839]  ? __switch_to_asm+0x40/0x70
[ 5962.930839]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930840]  pipe_write+0x24e/0x3f0
[ 5962.930840]  __vfs_write+0x127/0x1b0
[ 5962.930840]  vfs_write+0xb3/0x1b0
[ 5962.930841]  ksys_write+0x52/0xc0
[ 5962.930841]  do_syscall_64+0x5b/0x170
[ 5962.930842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930842] RIP: 0033:0x3b5900e7b0
[ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930843] RSP: 002b:00007ffedbcd93a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 5962.930844] RAX: ffffffffffffffda RBX: 00007ff4faa86e24 RCX:
0000003b5900e7b0
[ 5962.930845] RDX: 000000000000028f RSI: 00007ff4faa9688e RDI:
000000000000000a
[ 5962.930845] RBP: 00007ffedbcd93c0 R08: 00007ffedbcd9458 R09:
0000000000000020
[ 5962.930846] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffedbcd9458
[ 5962.930847] R13: 00007ff4faa9688e R14: 00007ff4faa89cc8 R15:
00007ff4faa86bd0
[ 5962.930847] Kernel panic - not syncing: Hard LOCKUP
[ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930848] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930849] Call Trace:
[ 5962.930849]  <NMI>
[ 5962.930849]  dump_stack+0x5c/0x7b
[ 5962.930850]  panic+0xfe/0x2b2
[ 5962.930850]  nmi_panic+0x35/0x40
[ 5962.930851]  watchdog_overflow_callback+0xef/0x100
[ 5962.930851]  __perf_event_overflow+0x5a/0xe0
[ 5962.930852]  handle_pmi_common+0x1d1/0x280
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930853]  ? set_pte_vaddr+0x3c/0x60
[ 5962.930853]  ? intel_pmu_handle_irq+0xad/0x170
[ 5962.930854]  intel_pmu_handle_irq+0xad/0x170
[ 5962.930854]  perf_event_nmi_handler+0x2e/0x50
[ 5962.930854]  nmi_handle+0x6f/0x120
[ 5962.930855]  default_do_nmi+0xee/0x110
[ 5962.930855]  do_nmi+0xe5/0x130
[ 5962.930856]  end_repeat_nmi+0x16/0x50
[ 5962.930856] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930857] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930857] RSP: 0018:ffffc9000f4dbcb8 EFLAGS: 00000002
[ 5962.930858] RAX: 0000000000000001 RBX: ffff88dfb4af1680 RCX:
0000000000000041
[ 5962.930859] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
ffff88dfb4af214c
[ 5962.930859] RBP: 0000000000000000 R08: 0000000000000001 R09:
ffffc9000f4dbd80
[ 5962.930859] R10: ffff888000000000 R11: ffffea00f0003d80 R12:
ffff88dfb4af214c
[ 5962.930860] R13: 0000000000000001 R14: 0000000000000046 R15:
0000000000000001
[ 5962.930860]  ? try_to_wake_up+0x98/0x470
[ 5962.930861]  ? try_to_wake_up+0x98/0x470
[ 5962.930861]  </NMI>
[ 5962.930862]  ? __switch_to_asm+0x34/0x70
[ 5962.930862]  ? __switch_to_asm+0x40/0x70
[ 5962.930862]  ? __switch_to_asm+0x34/0x70
[ 5962.930863]  autoremove_wake_function+0x11/0x50
[ 5962.930863]  __wake_up_common+0x8f/0x160
[ 5962.930864]  ? __switch_to_asm+0x40/0x70
[ 5962.930864]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930864]  pipe_write+0x24e/0x3f0
[ 5962.930865]  __vfs_write+0x127/0x1b0
[ 5962.930865]  vfs_write+0xb3/0x1b0
[ 5962.930866]  ksys_write+0x52/0xc0
[ 5962.930866]  do_syscall_64+0x5b/0x170
[ 5962.930866]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930867] RIP: 0033:0x3b5900e7b0
[ 5962.930868] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930868] RSP: 002b:00007ffedbcd93a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 5962.930869] RAX: ffffffffffffffda RBX: 00007ff4faa86e24 RCX:
0000003b5900e7b0
[ 5962.930869] RDX: 000000000000028f RSI: 00007ff4faa9688e RDI:
000000000000000a
[ 5962.930870] RBP: 00007ffedbcd93c0 R08: 00007ffedbcd9458 R09:
0000000000000020
[ 5962.930870] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffedbcd9458
[ 5962.930871] R13: 00007ff4faa9688e R14: 00007ff4faa89cc8 R15:
00007ff4faa86bd0
[ 5963.987766] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11
[ 5963.987767] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987775] CPU: 11 PID: 8805 Comm: ora_lg02_tpcc1 Not tainted
5.0.0-rc7core_sched #1
[ 5963.987775] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987776] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1e0
[ 5963.987777] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 00 3a 02
00 48 03 34 c5 20 98 13 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08
<85> c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 8b 07 66
[ 5963.987777] RSP: 0018:ffffc90023003760 EFLAGS: 00000046
[ 5963.987778] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000300000
[ 5963.987778] RDX: ffff88afbf2e3a00 RSI: ffff88dfbeae3a00 RDI:
ffff88dfbe1a2d40
[ 5963.987779] RBP: ffff88dfbe1a2d40 R08: 0000000000300000 R09:
00000fffffc00000
[ 5963.987779] R10: ffffc90023003778 R11: ffff88afb77b3340 R12:
000000000000001c
[ 5963.987779] R13: 0000000000022d40 R14: 0000000000000000 R15:
000000000000001c
[ 5963.987780] FS:  00007f4e14e73ae0(0000) GS:ffff88afbf2c0000(0000)
knlGS:0000000000000000
[ 5963.987780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987780] CR2: 00007fe503647850 CR3: 0000000d1b1ae002 CR4:
00000000003606e0
[ 5963.987781] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987781] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987781] Call Trace:
[ 5963.987781]  _raw_spin_lock_irqsave+0x39/0x40
[ 5963.987782]  update_blocked_averages+0x32/0x610
[ 5963.987782]  update_nohz_stats+0x4d/0x60
[ 5963.987782]  update_sd_lb_stats+0x2e5/0x7d0
[ 5963.987783]  find_busiest_group+0x3e/0x5b0
[ 5963.987783]  load_balance+0x18c/0xc00
[ 5963.987783]  newidle_balance+0x278/0x490
[ 5963.987783]  __schedule+0xd16/0x1060
[ 5963.987784]  ? lock_timer_base+0x66/0x80
[ 5963.987784]  schedule+0x32/0x70
[ 5963.987784]  schedule_timeout+0x16d/0x360
[ 5963.987785]  ? __next_timer_interrupt+0xc0/0xc0
[ 5963.987785]  do_semtimedop+0x966/0x1180
[ 5963.987785]  ? xas_load+0x9/0x80
[ 5963.987786]  ? find_get_entry+0x5d/0x1e0
[ 5963.987786]  ? pagecache_get_page+0x1b4/0x2d0
[ 5963.987786]  ? __vfs_getxattr+0x2a/0x70
[ 5963.987786]  ? enqueue_task_rt+0x98/0xb0
[ 5963.987787]  ? check_preempt_curr+0x50/0x90
[ 5963.987787]  ? push_rt_tasks+0x20/0x20
[ 5963.987787]  ? ttwu_do_wakeup+0x5e/0x160
[ 5963.987788]  ? try_to_wake_up+0x54/0x470
[ 5963.987788]  ? wake_up_q+0x2d/0x70
[ 5963.987788]  ? semctl_setval+0x26d/0x400
[ 5963.987788]  ? ksys_semtimedop+0x52/0x80
[ 5963.987789]  ksys_semtimedop+0x52/0x80
[ 5963.987789]  do_syscall_64+0x5b/0x170
[ 5963.987789]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5963.987789] RIP: 0033:0x3b58ceb28a
[ 5963.987790] Code: 73 01 c3 48 8b 0d 3e 2d 2a 00 31 d2 48 29 c2 64 89 11
48 83 c8 ff eb ea 90 90 90 90 90 90 90 90 49 89 ca b8 dc 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0e 2d 2a 00 31 d2 48 29 c2 64
[ 5963.987790] RSP: 002b:00007fff79282ed8 EFLAGS: 00000206 ORIG_RAX:
00000000000000dc
[ 5963.987791] RAX: ffffffffffffffda RBX: ffffffffffd23940 RCX:
0000003b58ceb28a
[ 5963.987791] RDX: 0000000000000001 RSI: 00007fff792830b8 RDI:
0000000000058002
[ 5963.987791] RBP: 00007fff792830e0 R08: 0000000000000000 R09:
0000000171327788
[ 5963.987792] R10: 00007fff79283068 R11: 0000000000000206 R12:
00007fff792833c8
[ 5963.987792] R13: 0000000000058002 R14: 0000000168dc0770 R15:
0000000000000000
[ 5963.987796] Shutting down cpus with NMI
[ 5963.987796] Kernel Offset: disabled
[ 5963.987797] NMI watchdog: Watchdog detected hard LOCKUP on cpu 33
[ 5963.987797] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987805] CPU: 33 PID: 10303 Comm: oracle_10303_tp Not tainted
5.0.0-rc7core_sched #1
[ 5963.987806] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987806] RIP: 0010:native_queued_spin_lock_slowpath+0x180/0x1e0
[ 5963.987807] Code: c1 e8 12 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81
c6 00 3a 02 00 48 03 34 c5 20 98 13 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
<8b> 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90
[ 5963.987807] RSP: 0018:ffffc90024833980 EFLAGS: 00000046
[ 5963.987808] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000880000
[ 5963.987808] RDX: ffff88dfbe2e3a00 RSI: ffff88dfbe763a00 RDI:
ffff88dfbe3e2d40
[ 5963.987809] RBP: ffff88dfbe3e2d40 R08: 0000000000880000 R09:
0000002000000000
[ 5963.987809] R10: 0000000000000004 R11: ffff88dfb6ffd2c0 R12:
0000000000000025
[ 5963.987809] R13: 0000000000022d40 R14: 0000000000000000 R15:
0000000000000025
[ 5963.987810] FS:  00007f0b7e5feae0(0000) GS:ffff88dfbe2c0000(0000)
knlGS:0000000000000000
[ 5963.987810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987810] CR2: 000000007564d0e7 CR3: 0000003debf6a001 CR4:
00000000003606e0
[ 5963.987811] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987811] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987811] Call Trace:
[ 5963.987812]  _raw_spin_lock_irqsave+0x39/0x40
[ 5963.987812]  update_blocked_averages+0x32/0x610
[ 5963.987812]  update_nohz_stats+0x4d/0x60
[ 5963.987812]  update_sd_lb_stats+0x2e5/0x7d0
[ 5963.987813]  find_busiest_group+0x3e/0x5b0
[ 5963.987813]  load_balance+0x18c/0xc00
[ 5963.987813]  ? __switch_to_asm+0x40/0x70
[ 5963.987813]  ? __switch_to_asm+0x34/0x70
[ 5963.987814]  newidle_balance+0x278/0x490
[ 5963.987814]  __schedule+0xd16/0x1060
[ 5963.987814]  ? enqueue_hrtimer+0x3a/0x90
[ 5963.987814]  schedule+0x32/0x70
[ 5963.987815]  do_nanosleep+0x81/0x180
[ 5963.987815]  hrtimer_nanosleep+0xce/0x1f0
[ 5963.987815]  ? __hrtimer_init+0xb0/0xb0
[ 5963.987816]  __x64_sys_nanosleep+0x8d/0xa0
[ 5963.987816]  do_syscall_64+0x5b/0x170
[ 5963.987816]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5963.987816] RIP: 0033:0x3b5900eff0
[ 5963.987817] Code: 73 01 c3 48 8b 0d b8 8f 20 00 31 d2 48 29 c2 64 89 11
48 83 c8 ff eb ea 90 90 83 3d b1 d3 20 00 00 75 10 b8 23 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e f2 ff ff 48 89 04 24
[ 5963.987817] RSP: 002b:00007ffe359b5158 EFLAGS: 00000246 ORIG_RAX:
0000000000000023
[ 5963.987818] RAX: ffffffffffffffda RBX: 0000000000169d10 RCX:
0000003b5900eff0
[ 5963.987818] RDX: 0000000000000000 RSI: 00007ffe359b5170 RDI:
00007ffe359b5160
[ 5963.987818] RBP: 00007ffe359b51c0 R08: 0000000000000000 R09:
0000000000000000
[ 5963.987819] R10: 00000001512241f8 R11: 0000000000000246 R12:
0000000000000000
[ 5963.987819] R13: 00007ffe359b5280 R14: 00000000000005ca R15:
0000000000000000
[ 5963.987827] NMI watchdog: Watchdog detected hard LOCKUP on cpu 75
[ 5963.987828] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987835] CPU: 75 PID: 0 Comm: swapper/75 Not tainted
5.0.0-rc7core_sched #1
[ 5963.987836] Hardware name: Oracle Corporation
ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987836] RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1e0
[ 5963.987837] Code: ff 75 40 f0 0f ba 2f 08 0f 82 e7 00 00 00 8b 07 30 e4
09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74 08 f3 90 8b 07
<84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 81 e6 00 ff 00 00 75 04 c6
[ 5963.987837] RSP: 0000:ffffc9000c77bd78 EFLAGS: 00000002
[ 5963.987838] RAX: 0000000001240101 RBX: 000000000000004b RCX:
0000000000000001
[ 5963.987838] RDX: ffff88dfbe522d40 RSI: 0000000000000001 RDI:
ffff88dfbe262d40
[ 5963.987838] RBP: 0000000000022d40 R08: 0000000000000000 R09:
0000000000000001
[ 5963.987839] R10: 0000000000000001 R11: 0000000000000001 R12:
ffff88dfb4e40000
[ 5963.987839] R13: ffff88dfbe7e2d40 R14: 000000000000002a R15:
ffff88dfbe522d40
[ 5963.987840] FS:  0000000000000000(0000) GS:ffff88dfbe7c0000(0000)
knlGS:0000000000000000
[ 5963.987840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987840] CR2: 0000000f4b348000 CR3: 0000003c45ed6001 CR4:
00000000003606e0
[ 5963.987841] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987841] Call Trace:
[ 5963.987842]  _raw_spin_lock+0x24/0x30
[ 5963.987842]  sched_core_balance+0x15c/0x4f0
[ 5963.987842]  __balance_callback+0x49/0xa0
[ 5963.987843]  __schedule+0xdc0/0x1060
[ 5963.987843]  schedule_idle+0x28/0x40
[ 5963.987843]  do_idle+0x164/0x260
[ 5963.987843]  cpu_startup_entry+0x19/0x20
[ 5963.987844]  start_secondary+0x17d/0x1d0
[ 5963.987844]  secondary_startup_64+0xa4/0xb0
[ 5963.987845] NMI watchdog: Watchdog detected hard LOCKUP on cpu 81
[ 5963.987845] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987853] CPU: 81 PID: 0 Comm: swapper/81 Not tainted
5.0.0-rc7core_sched #1
[ 5963.987854] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987854] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1e0
[ 5963.987854] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 00 3a 02
00 48 03 34 c5 20 98 13 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08
<85> c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 8b 07 66
[ 5963.987855] RSP: 0000:ffffc9000c7abd78 EFLAGS: 00000046
[ 5963.987855] RAX: 0000000000000000 RBX: 0000000000000051 RCX:
0000000001480000
[ 5963.987856] RDX: ffff88dfbe963a00 RSI: ffff88dfbe523a00 RDI:
ffff88dfbe522d40
[ 5963.987856] RBP: 0000000000022d40 R08: 0000000001480000 R09:
0000000000000001
[ 5963.987856] R10: ffff88dfb7a41680 R11: 0000000000000001 R12:
0000000000000001
[ 5963.987857] R13: ffff88dfbe962d40 R14: 0000000000000056 R15:
ffff88dfbeaa2d40
[ 5963.987857] FS:  0000000000000000(0000) GS:ffff88dfbe940000(0000)
knlGS:0000000000000000
[ 5963.987857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987857] CR2: 0000000f4a7a8000 CR3: 0000003b67e2a002 CR4:
00000000003606e0
[ 5963.987858] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987858] Call Trace:
[ 5963.987859]  _raw_spin_lock+0x24/0x30
[ 5963.987859]  sched_core_balance+0x15c/0x4f0
[ 5963.987859]  __balance_callback+0x49/0xa0
[ 5963.987859]  __schedule+0xdc0/0x1060
[ 5963.987860]  schedule_idle+0x28/0x40
[ 5963.987860]  do_idle+0x164/0x260
[ 5963.987860]  cpu_startup_entry+0x19/0x20
[ 5963.987860]  start_secondary+0x17d/0x1d0
[ 5963.987861]  secondary_startup_64+0xa4/0xb0
[ 5983.129164] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---





2019-03-07 22:07:35

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On 22/02/19 15:10, Peter Zijlstra wrote:
>> I agree on not bike shedding about the API, but can we agree on some of
>> the high level properties? For example, who generates the core
>> scheduling ids, what properties about them are enforced, etc.?
> It's an opaque cookie; the scheduler really doesn't care. All it does is
> ensure that tasks match or force idle within a core.
>
> My previous patches got the cookie from a modified
> preempt_notifier_register/unregister() which passed the vcpu->kvm
> pointer into it from vcpu_load/put.
>
> This auto-grouped VMs. It was also found to be somewhat annoying because
> apparently KVM does a lot of userspace assist for all sorts of nonsense
> and it would leave/re-join the cookie group for every single assist.
> Causing tons of rescheduling.

KVM doesn't do _that much_ userspace exiting in practice when VMs are
properly configured (if they're not, you probably don't care about core
scheduling).

However, note that KVM needs core scheduling groups to be defined at the
thread level; one group per process is not enough. A VM has a bunch of
I/O threads and vCPU threads, and we want to set up core scheduling like
this:

+--------------------------------------+
| VM 1 iothread1 iothread2 |
| +----------------+-----------------+ |
| | vCPU0 vCPU1 | vCPU2 vCPU3 | |
| +----------------+-----------------+ |
+--------------------------------------+

+--------------------------------------+
| VM 1 iothread1 iothread2 |
| +----------------+-----------------+ |
| | vCPU0 vCPU1 | vCPU2 vCPU3 | |
| +----------------+-----------------+ |
| | vCPU4 vCPU5 | vCPU6 vCPU7 | |
| +----------------+-----------------+ |
+--------------------------------------+

where the iothreads need not be subject to core scheduling but the vCPUs
do. If you don't place guest-sibling vCPUs in the same core scheduling
group, bad things happen.

The reason is that the guest might also be running a core scheduler, so
you could have:

- guest process 1 registering two threads A and B in the same group

- guest process 2 registering two threads C and D in the same group

- guest core scheduler placing thread A on vCPU0, thread B on vCPU1,
thread C on vCPU2, thread D on vCPU3

- host core scheduler deciding the four threads can be in physical cores
0-1, but physical core 0 gets A+C and physical core 1 gets B+D

- now process 2 shares cache with process 1. :(

Paolo

2019-03-08 19:49:23

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 2/22/19 4:45 AM, Mel Gorman wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>> Do you (or anybody else) have numbers for real loads?
>>
>> Because performance is all that matters. If performance is bad, then
>> it's pointless, since just turning off SMT is the answer.
>>
> I tried to do a comparison between tip/master, ht disabled and this series
> putting test workloads into a tagged cgroup but unfortunately it failed
>
> [ 156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [ 156.986597] #PF error: [normal kernel read fault]
> [ 156.991343] PGD 0 P4D 0
> [ 156.993905] Oops: 0000 [#1] SMP PTI
> [ 156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
> [ 157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [ 157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> [ 157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> [ 157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> [ 157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> [ 157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> [ 157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> [ 157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> [ 157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> [ 157.078814] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> [ 157.086977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> [ 157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 157.119058] Call Trace:
> [ 157.123865] pick_next_entity+0x61/0x110
> [ 157.130137] pick_task_fair+0x4b/0x90
> [ 157.136124] __schedule+0x365/0x12c0
> [ 157.141985] schedule_idle+0x1e/0x40
> [ 157.147822] do_idle+0x166/0x280
> [ 157.153275] cpu_startup_entry+0x19/0x20
> [ 157.159420] start_secondary+0x17a/0x1d0
> [ 157.165568] secondary_startup_64+0xa4/0xb0
> [ 157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> [ 157.258990] CR2: 0000000000000058
> [ 157.264961] ---[ end trace a301ac5e3ee86fde ]---
> [ 157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> [ 157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> [ 157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> [ 157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> [ 157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> [ 157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> [ 157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> [ 157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> [ 157.373395] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> [ 157.384238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> [ 157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> [ 158.529804] Shutting down cpus with NMI
> [ 158.573249] Kernel Offset: disabled
> [ 158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
>
> RIP translates to kernel/sched/fair.c:6819
>
> static int
> wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> {
> s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */
>
> if (vdiff <= 0)
> return -1;
>
> gran = wakeup_gran(se);
> if (vdiff > gran)
> return 1;
> }
>
> I haven't tried debugging it yet.
>
I think the following fix, while trivial, is the right fix for the NULL
dereference in this case. This bug is reproducible with patch 14. I also
did
some performance bisecting and with patch 14 performance is decimated,
that's
expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.

-------8<-----------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4..ecadf36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
sched_entity *curr)
???????? * Avoid running the skip buddy, if running something else can
???????? * be done without getting too unfair.
*/
-?????? if (cfs_rq->skip == se) {
+?????? if (cfs_rq->skip && cfs_rq->skip == se) {
??????????????? struct sched_entity *second;

??????????????? if (se == curr) {
@@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
sched_entity *curr)
/*
???????? * Prefer last buddy, try to return the CPU to a preempted task.
*/
-?????? if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+?????? if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
left)
+?????????? < 1)
??????????????? se = cfs_rq->last;

/*
???????? * Someone really wants this to run. If it's not unfair, run it.
*/
-?????? if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+?????? if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
left)
+?????????? < 1)
??????????????? se = cfs_rq->next;

??????? clear_buddies(cfs_rq, se);


2019-03-11 05:11:19

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
<[email protected]> wrote:
>
> expected. Most of the performance recovery happens in patch 15 which,
> unfortunately, is also the one that introduces the hard lockup.
>

After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

Mar 10 22:46:57 aubrey-ivb kernel: [ 2662.973792] core sched enabled
[ 2663.348371] WARNING: CPU: 5 PID: 3087 at kernel/sched/pelt.h:119
update_load_avg+00
[ 2663.357960] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.443269] CPU: 5 PID: 3087 Comm: schbench Tainted: G I
5.0.0-rc8-7
[ 2663.454520] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.466063] RIP: 0010:update_load_avg+0x52/0x5e0
[ 2663.471286] Code: 8b af 70 01 00 00 8b 3d 14 a6 6e 01 85 ff 74 1c
e9 4c 04 00 00 40
[ 2663.492350] RSP: 0000:ffffc9000a6a3dd8 EFLAGS: 00010046
[ 2663.498276] RAX: 0000000000000000 RBX: ffff888be7937600 RCX: 0000000000000001
[ 2663.506337] RDX: 0000000000000000 RSI: ffff888c09fe4418 RDI: 0000000000000046
[ 2663.514398] RBP: ffff888bdfb8aac0 R08: 0000000000000000 R09: ffff888bdfb9aad8
[ 2663.522459] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2663.530520] R13: ffff888c09fe4400 R14: 0000000000000001 R15: ffff888bdfb8aa40
[ 2663.538582] FS: 00007f006a7cc700(0000) GS:ffff888c0a600000(0000)
knlGS:00000000000
[ 2663.547739] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2663.554241] CR2: 0000000000604048 CR3: 0000000bfdd64006 CR4: 00000000000606e0
[ 2663.562310] Call Trace:
[ 2663.565128] ? update_load_avg+0xa6/0x5e0
[ 2663.569690] ? update_load_avg+0xa6/0x5e0
[ 2663.574252] set_next_entity+0xd9/0x240
[ 2663.578619] set_next_task_fair+0x6e/0xa0
[ 2663.583182] __schedule+0x12af/0x1570
[ 2663.587350] schedule+0x28/0x70
[ 2663.590937] exit_to_usermode_loop+0x61/0xf0
[ 2663.595791] prepare_exit_to_usermode+0xbf/0xd0
[ 2663.600936] retint_user+0x8/0x18
[ 2663.604719] RIP: 0033:0x402057
[ 2663.608209] Code: 24 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 38
31 c0 e8 2c eb ff
[ 2663.629351] RSP: 002b:00007f006a7cbe50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 2663.637924] RAX: 000000000029778f RBX: 00000000002dc6c0 RCX: 0000000000000002
[ 2663.645985] RDX: 00007f006a7cbe60 RSI: 0000000000000000 RDI: 00007f006a7cbe50
[ 2663.654046] RBP: 0000000000000006 R08: 0000000000000001 R09: 00007ffe965450a0
[ 2663.662108] R10: 00007f006a7cbe30 R11: 000000000003b368 R12: 00007f006a7cbed0
[ 2663.670160] R13: 00007f0098c1ce6f R14: 0000000000000000 R15: 00007f0084a30390
[ 2663.678226] irq event stamp: 27182
[ 2663.682114] hardirqs last enabled at (27181): [<ffffffff81003f70>]
exit_to_usermo0
[ 2663.692348] hardirqs last disabled at (27182): [<ffffffff81a0affc>]
__schedule+0xd0
[ 2663.701716] softirqs last enabled at (27004): [<ffffffff81e00359>]
__do_softirq+0a
[ 2663.711268] softirqs last disabled at (26999): [<ffffffff81095be1>]
irq_exit+0xc1/0
[ 2663.720247] ---[ end trace d46e59b84bcde977 ]---
[ 2663.725503] BUG: unable to handle kernel paging request at 00000000005df5f0
[ 2663.733377] #PF error: [WRITE]
[ 2663.736875] PGD 8000000bff037067 P4D 8000000bff037067 PUD bff0b1067
PMD bfbf02067 0
[ 2663.745954] Oops: 0002 [#1] SMP PTI
[ 2663.749931] CPU: 5 PID: 3078 Comm: schbench Tainted: G W I
5.0.0-rc8-7
[ 2663.761233] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.772836] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2663.779827] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2663.800970] RSP: 0000:ffffc9000a633e18 EFLAGS: 00010006
[ 2663.806892] RAX: 00000000005df5f0 RBX: ffff888bdfbf2a40 RCX: 0000000000180000
[ 2663.814954] RDX: ffff888c0a7e5180 RSI: 0000000000001fff RDI: ffff888bdfbf2a40
[ 2663.823015] RBP: ffff888bdfbf2a40 R08: 0000000000180000 R09: 0000000000000001
[ 2663.831068] R10: ffffc9000a633dc0 R11: ffff888bdfbf2a58 R12: 0000000000000046
[ 2663.839129] R13: ffff888bdfb8aa40 R14: ffff888be5b90d80 R15: ffff888be5b90d80
[ 2663.847182] FS: 00007f00797ea700(0000) GS:ffff888c0a600000(0000)
knlGS:00000000000
[ 2663.856330] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2663.862834] CR2: 00000000005df5f0 CR3: 0000000bfdd64006 CR4: 00000000000606e0
[ 2663.870895] Call Trace:
[ 2663.873715] do_raw_spin_lock+0xab/0xb0
[ 2663.878095] _raw_spin_lock_irqsave+0x63/0x80
[ 2663.883066] __balance_callback+0x19/0xa0
[ 2663.887626] __schedule+0x1113/0x1570
[ 2663.891803] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2663.897142] ? apic_timer_interrupt+0xa/0x20
[ 2663.901996] ? interrupt_entry+0x9a/0xe0
[ 2663.906450] ? apic_timer_interrupt+0xa/0x20
[ 2663.911307] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.996886] CR2: 00000000005df5f0
[ 2664.000686] ---[ end trace d46e59b84bcde978 ]---
[ 2664.011393] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2664.018386] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2664.039529] RSP: 0000:ffffc9000a633e18 EFLAGS: 00010006
[ 2664.045452] RAX: 00000000005df5f0 RBX: ffff888bdfbf2a40 RCX: 0000000000180000
[ 2664.053513] RDX: ffff888c0a7e5180 RSI: 0000000000001fff RDI: ffff888bdfbf2a40
[ 2664.061574] RBP: ffff888bdfbf2a40 R08: 0000000000180000 R09: 0000000000000001
[ 2664.069635] R10: ffffc9000a633dc0 R11: ffff888bdfbf2a58 R12: 0000000000000046
[ 2664.077688] R13: ffff888bdfb8aa40 R14: ffff888be5b90d80 R15: ffff888be5b90d80
[ 2664.085749] FS: 00007f00797ea700(0000) GS:ffff888c0a600000(0000)
knlGS:00000000000
[ 2664.094897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2664.101402] CR2: 00000000005df5f0 CR3: 0000000bfdd64006 CR4: 00000000000606e0
[ 2664.109481]
[ 2664.109482] ======================================================
[ 2664.109483] WARNING: possible circular locking dependency detected
[ 2664.109483] 5.0.0-rc8-00542-gd697415be692-dirty #7 Tainted: G I
[ 2664.109484] ------------------------------------------------------
[ 2664.109485] schbench/3087 is trying to acquire lock:
[ 2664.109485] 000000007a0032d4 ((console_sem).lock){-.-.}, at:
down_trylock+0xf/0x30
[ 2664.109488]
[ 2664.109497] but task is already holding lock:
[ 2664.109497] 00000000efdef567 (&rq->__lock){-.-.}, at: __schedule+0xfa/0x1570
[ 2664.109507]
[ 2664.109508] which lock already depends on the new lock.
[ 2664.109509]
[ 2664.109509]
[ 2664.109510] the existing dependency chain (in reverse order) is:
[ 2664.109510]
[ 2664.109511] -> #2 (&rq->__lock){-.-.}:
[ 2664.109513] task_fork_fair+0x35/0x1c0
[ 2664.109513] sched_fork+0xf4/0x1f0
[ 2664.109514] copy_process.part.39+0x7ac/0x21f0
[ 2664.109515] _do_fork+0xf9/0x6a0
[ 2664.109515] kernel_thread+0x25/0x30
[ 2664.109516] rest_init+0x22/0x240
[ 2664.109517] start_kernel+0x49f/0x4bf
[ 2664.109517] secondary_startup_64+0xa4/0xb0
[ 2664.109518]
[ 2664.109518] -> #1 (&p->pi_lock){-.-.}:
[ 2664.109520] try_to_wake_up+0x3d/0x510
[ 2664.109521] up+0x40/0x60
[ 2664.109521] __up_console_sem+0x41/0x70
[ 2664.109522] console_unlock+0x32a/0x610
[ 2664.109522] vprintk_emit+0x14a/0x350
[ 2664.109523] dev_vprintk_emit+0x11d/0x230
[ 2664.109524] dev_printk_emit+0x4a/0x70
[ 2664.109524] _dev_info+0x64/0x80
[ 2664.109525] usb_new_device+0x105/0x490
[ 2664.109525] hub_event+0x81f/0x1730
[ 2664.109526] process_one_work+0x2a4/0x600
[ 2664.109527] worker_thread+0x2d/0x3d0
[ 2664.109527] kthread+0x116/0x130
[ 2664.109528] ret_from_fork+0x3a/0x50
[ 2664.109528]
[ 2664.109529] -> #0 ((console_sem).lock){-.-.}:
[ 2664.109531] _raw_spin_lock_irqsave+0x41/0x80
[ 2664.109531] down_trylock+0xf/0x30
[ 2664.109532] __down_trylock_console_sem+0x33/0xa0
[ 2664.109533] console_trylock+0x13/0x60
[ 2664.109533] vprintk_emit+0x13d/0x350
[ 2664.109534] printk+0x52/0x6e
[ 2664.109534] __warn+0x5f/0x110
[ 2664.109535] report_bug+0xa5/0x110
[ 2664.109536] fixup_bug.part.15+0x18/0x30
[ 2664.109536] do_error_trap+0xbb/0x100
[ 2664.109537] do_invalid_op+0x28/0x30
[ 2664.109537] invalid_op+0x14/0x20
[ 2664.109538] update_load_avg+0x52/0x5e0
[ 2664.109538] set_next_entity+0xd9/0x240
[ 2664.109539] set_next_task_fair+0x6e/0xa0
[ 2664.109540] __schedule+0x12af/0x1570
[ 2664.109540] schedule+0x28/0x70
[ 2664.109541] exit_to_usermode_loop+0x61/0xf0
[ 2664.109542] prepare_exit_to_usermode+0xbf/0xd0
[ 2664.109542] retint_user+0x8/0x18
[ 2664.109542]
[ 2664.109543] other info that might help us debug this:
[ 2664.109544]
[ 2664.109544] Chain exists of:
[ 2664.109544] (console_sem).lock --> &p->pi_lock --> &rq->__lock
[ 2664.109547]
[ 2664.109548] Possible unsafe locking scenario:
[ 2664.109548]
[ 2664.109549] CPU0 CPU1
[ 2664.109549] ---- ----
[ 2664.109550] lock(&rq->__lock);
[ 2664.109551] lock(&p->pi_lock);
[ 2664.109553] lock(&rq->__lock);
[ 2664.109554] lock((console_sem).lock);
[ 2664.109555]
[ 2664.109556] *** DEADLOCK ***
[ 2664.109556]
[ 2664.109557] 1 lock held by schbench/3087:
[ 2664.109557] #0: 00000000efdef567 (&rq->__lock){-.-.}, at:
__schedule+0xfa/0x1570
[ 2664.109560]
[ 2664.109560] stack backtrace:
[ 2664.109561] CPU: 5 PID: 3087 Comm: schbench Tainted: G I
5.0.0-rc8-7
[ 2664.109562] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2664.109563] Call Trace:
[ 2664.109563] dump_stack+0x85/0xcb
[ 2664.109564] print_circular_bug.isra.37+0x1d7/0x1e4
[ 2664.109565] __lock_acquire+0x139c/0x1430
[ 2664.109565] ? lock_acquire+0x9e/0x180
[ 2664.109566] lock_acquire+0x9e/0x180
[ 2664.109566] ? down_trylock+0xf/0x30
[ 2664.109567] _raw_spin_lock_irqsave+0x41/0x80
[ 2664.109567] ? down_trylock+0xf/0x30
[ 2664.109568] ? vprintk_emit+0x13d/0x350
[ 2664.109569] down_trylock+0xf/0x30
[ 2664.109569] __down_trylock_console_sem+0x33/0xa0
[ 2664.109570] console_trylock+0x13/0x60
[ 2664.109571] vprintk_emit+0x13d/0x350
[ 2664.109571] ? update_load_avg+0x52/0x5e0
[ 2664.109572] printk+0x52/0x6e
[ 2664.109573] ? update_load_avg+0x52/0x5e0
[ 2664.109573] __warn+0x5f/0x110
[ 2664.109574] ? update_load_avg+0x52/0x5e0
[ 2664.109575] ? update_load_avg+0x52/0x5e0
[ 2664.109575] report_bug+0xa5/0x110
[ 2664.109576] fixup_bug.part.15+0x18/0x30
[ 2664.109576] do_error_trap+0xbb/0x100
[ 2664.109577] do_invalid_op+0x28/0x30
[ 2664.109578] ? update_load_avg+0x52/0x5e0
[ 2664.109578] invalid_op+0x14/0x20
[ 2664.109579] RIP: 0010:update_load_avg+0x52/0x5e0
[ 2664.109580] Code: 8b af 70 01 00 00 8b 3d 14 a6 6e 01 85 ff 74 1c
e9 4c 04 00 00 40
[ 2664.109581] RSP: 0000:ffffc9000a6a3dd8 EFLAGS: 00010046
[ 2664.109582] RAX: 0000000000000000 RBX: ffff888be7937600 RCX: 0000000000000001
[ 2664.109582] RDX: 0000000000000000 RSI: ffff888c09fe4418 RDI: 0000000000000046
[ 2664.109583] RBP: ffff888bdfb8aac0 R08: 0000000000000000 R09: ffff888bdfb9aad8
[ 2664.109584] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2664.109585] R13: ffff888c09fe4400 R14: 0000000000000001 R15: ffff888bdfb8aa40
[ 2664.109585] ? update_load_avg+0x4e/0x5e0
[ 2664.109586] ? update_load_avg+0xa6/0x5e0
[ 2664.109586] ? update_load_avg+0xa6/0x5e0
[ 2664.109587] set_next_entity+0xd9/0x240
[ 2664.109588] set_next_task_fair+0x6e/0xa0
[ 2664.109588] __schedule+0x12af/0x1570
[ 2664.109589] schedule+0x28/0x70
[ 2664.109589] exit_to_usermode_loop+0x61/0xf0
[ 2664.109590] prepare_exit_to_usermode+0xbf/0xd0
[ 2664.109590] retint_user+0x8/0x18
[ 2664.109591] RIP: 0033:0x402057
[ 2664.109592] Code: 24 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 38
31 c0 e8 2c eb ff
[ 2664.109593] RSP: 002b:00007f006a7cbe50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 2664.109594] RAX: 000000000029778f RBX: 00000000002dc6c0 RCX: 0000000000000002
[ 2664.109595] RDX: 00007f006a7cbe60 RSI: 0000000000000000 RDI: 00007f006a7cbe50
[ 2664.109596] RBP: 0000000000000006 R08: 0000000000000001 R09: 00007ffe965450a0
[ 2664.109596] R10: 00007f006a7cbe30 R11: 000000000003b368 R12: 00007f006a7cbed0
[ 2664.109597] R13: 00007f0098c1ce6f R14: 0000000000000000 R15: 00007f0084a30390

2019-03-11 18:38:03

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 3/10/19 9:23 PM, Aubrey Li wrote:
> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> <[email protected]> wrote:
>> expected. Most of the performance recovery happens in patch 15 which,
>> unfortunately, is also the one that introduces the hard lockup.
>>
> After applied Subhra's patch, the following is triggered by enabling
> core sched when a cgroup is
> under heavy load.
>
It seems you are facing some other deadlock where printk is involved.
Can you
drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra


2019-03-11 23:37:45

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
>
> On 3/10/19 9:23 PM, Aubrey Li wrote:
>> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
>> <[email protected]> wrote:
>>> expected. Most of the performance recovery happens in patch 15 which,
>>> unfortunately, is also the one that introduces the hard lockup.
>>>
>> After applied Subhra's patch, the following is triggered by enabling
>> core sched when a cgroup is
>> under heavy load.
>>
> It seems you are facing some other deadlock where printk is involved.
> Can you
> drop the last patch (patch 16 sched: Debug bits...) and try?
>
> Thanks,
> Subhra
>
Never Mind, I am seeing the same lockdep deadlock output even w/o patch
16. Btw
the NULL fix had something missing, following works.

--------->8------------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4..27cbc64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
sched_entity *curr)
         * Avoid running the skip buddy, if running something else can
         * be done without getting too unfair.
*/
-       if (cfs_rq->skip == se) {
+       if (cfs_rq->skip && cfs_rq->skip == se) {
                struct sched_entity *second;

                if (se == curr) {
@@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
sched_entity *curr)
/*
         * Prefer last buddy, try to return the CPU to a preempted task.
*/
-       if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+       if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
left)
+           < 1)
                se = cfs_rq->last;

/*
         * Someone really wants this to run. If it's not unfair, run it.
*/
-       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+       if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
left)
+           < 1)
                se = cfs_rq->next;

        clear_buddies(cfs_rq, se);
@@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)

                se = pick_next_entity(cfs_rq, NULL);

+               if (!(se || curr))
+                       return NULL;
+
                if (curr) {
                        if (se && curr->on_rq)
update_curr(cfs_rq);


2019-03-12 00:21:17

by Greg Kerr

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
<[email protected]> wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >> <[email protected]> wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.

Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[ 46.831828] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000000
[ 46.831829] core sched enabled
[ 46.834261] #PF error: [WRITE]
[ 46.834899] PGD 0 P4D 0
[ 46.835438] Oops: 0002 [#1] SMP PTI
[ 46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[ 46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[ 46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[ 46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
[ 46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
[ 46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
[ 46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
[ 46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
[ 46.848584] FS: 0000000000000000(0000) GS:ffff9a5737a00000(0000)
knlGS:0000000000000000
[ 46.849680] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
[ 46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 46.853326] Call Trace:
[ 46.853678] __schedule+0x139/0x11f0
[ 46.854167] ? cpumask_next+0x16/0x20
[ 46.854668] ? cpu_stop_queue_work+0xc0/0xc0
[ 46.855252] ? sort_range+0x20/0x20
[ 46.855742] schedule+0x4e/0x60
[ 46.856171] smpboot_thread_fn+0x12a/0x160
[ 46.856725] kthread+0x112/0x120
[ 46.857164] ? kthread_stop+0xf0/0xf0
[ 46.857661] ret_from_fork+0x35/0x40
[ 46.858146] Modules linked in:
[ 46.858562] CR2: 0000000000000000
[ 46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg

>
> --------->8------------
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4..27cbc64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
> * Avoid running the skip buddy, if running something else can
> * be done without getting too unfair.
> */
> - if (cfs_rq->skip == se) {
> + if (cfs_rq->skip && cfs_rq->skip == se) {
> struct sched_entity *second;
>
> if (se == curr) {
> @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
> /*
> * Prefer last buddy, try to return the CPU to a preempted task.
> */
> - if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> + if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> left)
> + < 1)
> se = cfs_rq->last;
>
> /*
> * Someone really wants this to run. If it's not unfair, run it.
> */
> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> + if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left)
> + < 1)
> se = cfs_rq->next;
>
> clear_buddies(cfs_rq, se);
> @@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)
>
> se = pick_next_entity(cfs_rq, NULL);
>
> + if (!(se || curr))
> + return NULL;
> +
> if (curr) {
> if (se && curr->on_rq)
> update_curr(cfs_rq);
>

2019-03-12 00:53:11

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


On 3/11/19 5:20 PM, Greg Kerr wrote:
> On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
> <[email protected]> wrote:
>>
>> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
>>> On 3/10/19 9:23 PM, Aubrey Li wrote:
>>>> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
>>>> <[email protected]> wrote:
>>>>> expected. Most of the performance recovery happens in patch 15 which,
>>>>> unfortunately, is also the one that introduces the hard lockup.
>>>>>
>>>> After applied Subhra's patch, the following is triggered by enabling
>>>> core sched when a cgroup is
>>>> under heavy load.
>>>>
>>> It seems you are facing some other deadlock where printk is involved.
>>> Can you
>>> drop the last patch (patch 16 sched: Debug bits...) and try?
>>>
>>> Thanks,
>>> Subhra
>>>
>> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
>> 16. Btw
>> the NULL fix had something missing, following works.
> Is this panic below, which occurs when I tag the first process,
> related or known? If not, I will debug it tomorrow.
>
> [ 46.831828] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000000
> [ 46.831829] core sched enabled
> [ 46.834261] #PF error: [WRITE]
> [ 46.834899] PGD 0 P4D 0
> [ 46.835438] Oops: 0002 [#1] SMP PTI
> [ 46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
> 5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
> [ 46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014
> [ 46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
> [ 46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
> ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
> 00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
> 00 00
> [ 46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
> [ 46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> [ 46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
> [ 46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
> [ 46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
> [ 46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
> [ 46.848584] FS: 0000000000000000(0000) GS:ffff9a5737a00000(0000)
> knlGS:0000000000000000
> [ 46.849680] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
> [ 46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 46.853326] Call Trace:
> [ 46.853678] __schedule+0x139/0x11f0
> [ 46.854167] ? cpumask_next+0x16/0x20
> [ 46.854668] ? cpu_stop_queue_work+0xc0/0xc0
> [ 46.855252] ? sort_range+0x20/0x20
> [ 46.855742] schedule+0x4e/0x60
> [ 46.856171] smpboot_thread_fn+0x12a/0x160
> [ 46.856725] kthread+0x112/0x120
> [ 46.857164] ? kthread_stop+0xf0/0xf0
> [ 46.857661] ret_from_fork+0x35/0x40
> [ 46.858146] Modules linked in:
> [ 46.858562] CR2: 0000000000000000
> [ 46.859022] ---[ end trace e9fff08f17bfd2be ]---
>
> - Greg
>
This seems to be different

2019-03-12 07:35:02

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Mon, Mar 11, 2019 at 05:20:19PM -0700, Greg Kerr wrote:
> On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
> <[email protected]> wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >> <[email protected]> wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing, following works.
>
> Is this panic below, which occurs when I tag the first process,
> related or known? If not, I will debug it tomorrow.
>
> [ 46.831828] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000000
> [ 46.831829] core sched enabled
> [ 46.834261] #PF error: [WRITE]
> [ 46.834899] PGD 0 P4D 0
> [ 46.835438] Oops: 0002 [#1] SMP PTI
> [ 46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
> 5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
> [ 46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014

Probably due to SMT not enabled for this qemu setup.

rq->core can be NULL for cpu0: sched_cpu_starting() won't be called for
CPU0 and since it doesn't have any siblings, its rq->core remains
un-initialized(NULL).

> [ 46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
> [ 46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
> ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
> 00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
> 00 00
> [ 46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
> [ 46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> [ 46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
> [ 46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
> [ 46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
> [ 46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
> [ 46.848584] FS: 0000000000000000(0000) GS:ffff9a5737a00000(0000)
> knlGS:0000000000000000
> [ 46.849680] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
> [ 46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 46.853326] Call Trace:
> [ 46.853678] __schedule+0x139/0x11f0
> [ 46.854167] ? cpumask_next+0x16/0x20
> [ 46.854668] ? cpu_stop_queue_work+0xc0/0xc0
> [ 46.855252] ? sort_range+0x20/0x20
> [ 46.855742] schedule+0x4e/0x60
> [ 46.856171] smpboot_thread_fn+0x12a/0x160
> [ 46.856725] kthread+0x112/0x120
> [ 46.857164] ? kthread_stop+0xf0/0xf0
> [ 46.857661] ret_from_fork+0x35/0x40
> [ 46.858146] Modules linked in:
> [ 46.858562] CR2: 0000000000000000
> [ 46.859022] ---[ end trace e9fff08f17bfd2be ]---

2019-03-12 07:46:38

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
<[email protected]> wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >> <[email protected]> wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing,

One more NULL pointer dereference:

Mar 12 02:24:46 aubrey-ivb kernel: [ 201.916741] core sched enabled
[ 201.950203] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[ 201.950254] ------------[ cut here ]------------
[ 201.959045] #PF error: [normal kernel read fault]
[ 201.964272] !se->on_rq
[ 201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
set_next_buddy+0x52/0x70
[ 201.969596] PGD 8000000be9ed7067 P4D 8000000be9ed7067 PUD c00911067 PMD 0
[ 201.972300] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[ 201.981712] Oops: 0000 [#1] SMP PTI
[ 201.989463] CPU: 22 PID: 2965 Comm: schbench Tainted: G I
5.0.0-rc8-00542-gd697415be692-dirty #13
[ 202.074710] CPU: 27 PID: 2947 Comm: schbench Tainted: G I
5.0.0-rc8-00542-gd697415be692-dirty #13
[ 202.078662] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[ 202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[ 202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[ 202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[ 202.118263] RSP: 0018:ffffc9000a5cbbb0 EFLAGS: 00010086
[ 202.129858] RSP: 0018:ffffc9000a463cc0 EFLAGS: 00010046
[ 202.135102] RAX: 0000000000000000 RBX: ffff88980047e800 RCX: 0000000000000000
[ 202.135105] RDX: ffff888be28caa40 RSI: 0000000000000001 RDI: ffffffff8110c3fa
[ 202.156251] RAX: 0000000000000000 RBX: ffff888bfeb80000 RCX: ffff888bfeb80000
[ 202.156255] RDX: ffff888be28c8348 RSI: ffff88980b5e50c8 RDI: ffff888bfeb80348
[ 202.177390] RBP: ffff88980047ea00 R08: 0000000000000000 R09: 00000000001e3a80
[ 202.177393] R10: ffffc9000a5cbb28 R11: 0000000000000000 R12: ffff888c0b9e4400
[ 202.183317] RBP: ffff88980b5e4400 R08: 000000000000014f R09: ffff8898049cf000
[ 202.183320] R10: 0000000000000078 R11: ffff8898049cfc5c R12: 0000000000000004
[ 202.189241] R13: ffff888be28caa40 R14: 0000000000000009 R15: 0000000000000009
[ 202.189245] FS: 00007f05f87f8700(0000) GS:ffff888c0b800000(0000)
knlGS:0000000000000000
[ 202.197310] R13: ffffc9000a463d20 R14: 0000000000000246 R15: 000000000000001c
[ 202.197314] FS: 00007f0611cca700(0000) GS:ffff88980b200000(0000)
knlGS:0000000000000000
[ 202.205373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 202.205377] CR2: 00007f05e9fdb728 CR3: 0000000be4d0e006 CR4: 00000000000606e0
[ 202.213441] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 202.213444] CR2: 0000000000000008 CR3: 0000000be4d0e005 CR4: 00000000000606e0
[ 202.221509] Call Trace:
[ 202.229574] Call Trace:
[ 202.237640] dequeue_task_fair+0x7e/0x1b0
[ 202.245700] enqueue_task+0x6f/0xb0
[ 202.253761] __schedule+0xcc8/0x1570
[ 202.261823] ttwu_do_activate+0x6a/0xc0
[ 202.270985] schedule+0x28/0x70
[ 202.279042] try_to_wake_up+0x20b/0x510
[ 202.288206] futex_wait_queue_me+0xbf/0x130
[ 202.294714] wake_up_q+0x3f/0x80
[ 202.302773] futex_wait+0xeb/0x240
[ 202.309282] futex_wake+0x157/0x180
[ 202.317353] ? __switch_to_asm+0x40/0x70
[ 202.320158] do_futex+0x451/0xad0
[ 202.322970] ? __switch_to_asm+0x34/0x70
[ 202.322980] ? __switch_to_asm+0x40/0x70
[ 202.327541] ? do_nanosleep+0xcc/0x1a0
[ 202.331521] do_futex+0x479/0xad0
[ 202.335599] ? hrtimer_nanosleep+0xe7/0x230
[ 202.339954] ? lockdep_hardirqs_on+0xf0/0x180
[ 202.343548] __x64_sys_futex+0x134/0x180
[ 202.347906] ? _raw_spin_unlock_irq+0x29/0x40
[ 202.352660] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 202.356343] ? finish_task_switch+0x9a/0x2c0
[ 202.360228] do_syscall_64+0x60/0x1b0
[ 202.364197] ? __schedule+0xbcd/0x1570
[ 202.368663] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 202.372448] __x64_sys_futex+0x134/0x180
[ 202.376913] RIP: 0033:0x7f06129e14d9
[ 202.381380] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 202.385650] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 08
[ 202.389436] do_syscall_64+0x60/0x1b0
[ 202.394190] RSP: 002b:00007f0611cc9e88 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[ 202.399147] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 202.403612] RAX: ffffffffffffffda RBX: 00007f0604a30390 RCX: 00007f06129e14d9
[ 202.403614] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 00007f060461d2a0
[ 202.408565] RIP: 0033:0x7f06129e14d9
[ 202.413905] RBP: 00007f0604a30390 R08: 0000000000000000 R09: 0000000000000000
[ 202.413908] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000f4240
[ 202.418760] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 08
[ 202.418763] RSP: 002b:00007f05f87f7e68 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[ 202.422937] R13: 00007f06125d0c88 R14: 0000000000000010 R15: 00007f06125d0c58
[ 202.422945] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[ 202.427209] RAX: ffffffffffffffda RBX: 00007f060820a180 RCX: 00007f06129e14d9
[ 202.427212] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f060820a180
[ 202.432944] CR2: 0000000000000008
[ 202.437416] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 202.437419] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f05f87f7ed0
[ 202.441506] ---[ end trace 1b953fe9220b3d88 ]---
[ 202.441508] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[ 202.441510] #PF error: [normal kernel read fault]
[ 202.441511] PGD 0 P4D 0
[ 202.441514] Oops: 0000 [#2] SMP PTI
[ 202.441516] CPU: 24 PID: 0 Comm: swapper/24 Tainted: G D I
5.0.0-rc8-00542-gd697415be692-dirty #13
[ 202.441517] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 202.441521] RIP: 0010:rb_insert_color+0x17/0x190
[ 202.441522] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[ 202.441523] RSP: 0018:ffff88980ac03e68 EFLAGS: 00010046
[ 202.441525] RAX: 0000000000000000 RBX: ffff888bfddf5480 RCX: ffff888bfddf5480
[ 202.441526] RDX: ffff888bfeb857c8 RSI: ffff88980ade50c8 RDI: ffff888bfddf57c8
[ 202.441527] RBP: ffff88980ade4400 R08: 0000000000000077 R09: 0000000eb31aac68
[ 202.441528] R10: 0000000000000078 R11: ffff889809de4418 R12: 0000000000000000
[ 202.441529] R13: ffff88980ac03ec8 R14: 0000000000000046 R15: 0000000000000018
[ 202.441531] FS: 0000000000000000(0000) GS:ffff88980ac00000(0000)
knlGS:0000000000000000
[ 202.441532] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 202.441533] CR2: 0000000000000008 CR3: 0000000002616004 CR4: 00000000000606e0
[ 202.441534] Call Trace:
[ 202.441536] <IRQ>
[ 202.441538] enqueue_task+0x6f/0xb0
[ 202.441541] ttwu_do_activate+0x6a/0xc0
[ 202.441544] try_to_wake_up+0x20b/0x510
[ 202.441549] hrtimer_wakeup+0x1e/0x30
[ 202.441551] __hrtimer_run_queues+0x117/0x3d0
[ 202.441553] ? __hrtimer_init+0xb0/0xb0
[ 202.441557] hrtimer_interrupt+0xe5/0x240
[ 202.441563] smp_apic_timer_interrupt+0x81/0x1f0
[ 202.441565] apic_timer_interrupt+0xf/0x20
[ 202.441567] </IRQ>
[ 202.441573] RIP: 0010:cpuidle_enter_state+0xbb/0x440
[ 202.441574] Code: 0f 8b ff 80 7c 24 07 00 74 17 9c 58 66 66 90 66
90 f6 c4 02 0f 85 59 03 00 00 31 ff e8 5e 2b 92 ff e8 c9 0a 99 ff fb
66 66 90 <66> 66 90 45 85 ed 0f 85
[ 202.441575] RSP: 0018:ffffc90006403ea0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[ 202.441577] RAX: 0000000000000000 RBX: ffffffff82700480 RCX: 000000000000001f
[ 202.441578] RDX: 0000002f08578463 RSI: 000000002f858b0c RDI: ffffffff8181bca7
[ 202.441579] RBP: ffffe8fffae03000 R08: 0000000000000002 R09: 00000000001e3a80
[ 202.441580] R10: ffffc90006403e80 R11: 0000000000000000 R12: 0000002f08578463
[ 202.441581] R13: 0000000000000005 R14: 0000000000000005 R15: 0000002f05150b84
[ 202.441585] ? cpuidle_enter_state+0xb7/0x440
[ 202.441590] do_idle+0x20f/0x2a0
[ 202.441594] cpu_startup_entry+0x19/0x20
[ 202.441599] start_secondary+0x17f/0x1d0
[ 202.441602] secondary_startup_64+0xa4/0xb0
[ 202.441608] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[ 202.441637] CR2: 0000000000000008
[ 202.441684] ---[ end trace 1b953fe9220b3d89 ]---
[ 202.441686] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[ 202.441689] #PF error: [normal kernel read fault]
[ 202.441690] PGD 8000000be9ed7067 P4D 8000000be9ed7067 PUD c00911067 PMD 0
[ 202.443007] Oops: 0000 [#3] SMP PTI
[ 202.443010] CPU: 0 PID: 3006 Comm: schbench Tainted: G D I
5.0.0-rc8-00542-gd697415be692-dirty #13
[ 202.443012] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 202.443016] RIP: 0010:rb_insert_color+0x17/0x190
[ 202.443018] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[ 202.443020] RSP: 0000:ffff888c09c03e68 EFLAGS: 00010046
[ 202.443022] RAX: 0000000000000000 RBX: ffff888bfddf2a40 RCX: ffff888bfddf2a40
[ 202.443024] RDX: ffff888be28cd7c8 RSI: ffff888c0b5e50c8 RDI: ffff888bfddf2d88
[ 202.443026] RBP: ffff888c0b5e4400 R08: 0000000000000077 R09: 0000000e7267ab1a
[ 202.443027] R10: 0000000000000078 R11: ffff888c0a5e4418 R12: 0000000000000004
[ 202.443029] R13: ffff888c09c03ec8 R14: 0000000000000046 R15: 0000000000000014
[ 202.443031] FS: 00007f05e37ce700(0000) GS:ffff888c09c00000(0000)
knlGS:0000000000000000
[ 202.443033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 202.443034] CR2: 0000000000000008 CR3: 0000000be4d0e001 CR4: 00000000000606f0
[ 202.443035] Call Trace:
[ 202.443038] <IRQ>
[ 202.443041] enqueue_task+0x6f/0xb0
[ 202.443046] ttwu_do_activate+0x6a/0xc0
[ 202.443050] try_to_wake_up+0x20b/0x510
[ 202.443057] hrtimer_wakeup+0x1e/0x30
[ 202.443059] __hrtimer_run_queues+0x117/0x3d0
[ 202.443062] ? __hrtimer_init+0xb0/0xb0
[ 202.443067] hrtimer_interrupt+0xe5/0x240
[ 202.443074] smp_apic_timer_interrupt+0x81/0x1f0
[ 202.443077] apic_timer_interrupt+0xf/0x20
[ 202.443079] </IRQ>
[ 202.443081] RIP: 0033:0x7ffcf1fac6ac
[ 202.443084] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39 cb 0f 84 07 01 00 00 41 8b 1c 24 85 db 75 d9 eb ba
0f 01 f9 <66> 90 48 c1 e2 20 48 0f
[ 202.443085] RSP: 002b:00007f05e37cddf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[ 202.443088] RAX: 0000000067252571 RBX: 00007f05e37cde50 RCX: 0000000000000000
[ 202.443089] RDX: 000000000000bddd RSI: 00007f05e37cde50 RDI: 0000000000000000
[ 202.443091] RBP: 00007f05e37cde10 R08: 0000000000000000 R09: 00007ffcf1fa90a0
[ 202.443092] R10: 00007ffcf1fa9080 R11: 000000000000fd2a R12: 0000000000000000
[ 202.443094] R13: 00007f0610cc7e6f R14: 0000000000000000 R15: 00007f05fca30390
[ 202.443101] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[ 202.443146] CR2: 0000000000000008
[ 202.443202] ---[ end trace 1b953fe9220b3d8a ]---
[ 202.443206] BUG: scheduling while atomic: schbench/3006/0x00010000
[ 202.443207] INFO: lockdep is turned off.
[ 202.443208] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[ 202.443253] CPU: 0 PID: 3006 Comm: schbench Tainted: G D I
5.0.0-rc8-00542-gd697415be692-dirty #13
[ 202.443254] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 202.443255] Call Trace:
[ 202.443257] <IRQ>
[ 202.443260] dump_stack+0x85/0xcb
[ 202.443263] __schedule_bug+0x62/0x90
[ 202.443266] __schedule+0x118f/0x1570
[ 202.443273] ? down_trylock+0xf/0x30
[ 202.443278] ? is_bpf_text_address+0x5/0xe0
[ 202.443282] schedule+0x28/0x70
[ 202.443285] schedule_timeout+0x221/0x4b0
[ 202.443290] ? vprintk_emit+0x1f9/0x350
[ 202.443298] __down_interruptible+0x86/0x100
[ 202.443304] ? down_interruptible+0x42/0x50
[ 202.443307] down_interruptible+0x42/0x50
[ 202.443312] pstore_dump+0x9e/0x340
[ 202.443316] ? lock_acquire+0x9e/0x180
[ 202.443319] ? kmsg_dump+0xe1/0x1d0
[ 202.443325] kmsg_dump+0x99/0x1d0
[ 202.443331] oops_end+0x6e/0xd0
[ 202.443336] no_context+0x1bd/0x540
[ 202.443341] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 202.443347] page_fault+0x1e/0x30
[ 202.443350] RIP: 0010:rb_insert_color+0x17/0x190
[ 202.443352] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[ 202.443353] RSP: 0000:ffff888c09c03e68 EFLAGS: 00010046
[ 202.443355] RAX: 0000000000000000 RBX: ffff888bfddf2a40 RCX: ffff888bfddf2a40
[ 202.443357] RDX: ffff888be28cd7c8 RSI: ffff888c0b5e50c8 RDI: ffff888bfddf2d88
[ 202.443358] RBP: ffff888c0b5e4400 R08: 0000000000000077 R09: 0000000e7267ab1a
[ 202.443360] R10: 0000000000000078 R11: ffff888c0a5e4418 R12: 0000000000000004
[ 202.443361] R13: ffff888c09c03ec8 R14: 0000000000000046 R15: 0000000000000014
[ 202.443370] enqueue_task+0x6f/0xb0
[ 202.443374] ttwu_do_activate+0x6a/0xc0
[ 202.443377] try_to_wake_up+0x20b/0x510
[ 202.443383] hrtimer_wakeup+0x1e/0x30
[ 202.443385] __hrtimer_run_queues+0x117/0x3d0
[ 202.443387] ? __hrtimer_init+0xb0/0xb0
[ 202.443393] hrtimer_interrupt+0xe5/0x240
[ 202.443398] smp_apic_timer_interrupt+0x81/0x1f0
[ 202.443401] apic_timer_interrupt+0xf/0x20
[ 202.443402] </IRQ>
[ 202.443404] RIP: 0033:0x7ffcf1fac6ac
[ 202.443406] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39 cb 0f 84 07 01 00 00 41 8b 1c 24 85 db 75 d9 eb ba
0f 01 f9 <66> 90 48 c1 e2 20 48 0f
[ 202.443407] RSP: 002b:00007f05e37cddf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[ 202.443409] RAX: 0000000067252571 RBX: 00007f05e37cde50 RCX: 0000000000000000
[ 202.443411] RDX: 000000000000bddd RSI: 00007f05e37cde50 RDI: 0000000000000000
[ 202.443412] RBP: 00007f05e37cde10 R08: 0000000000000000 R09: 00007ffcf1fa90a0
[ 202.443414] R10: 00007ffcf1fa9080 R11: 000000000000fd2a R12: 0000000000000000
[ 202.443415] R13: 00007f0610cc7e6f R14: 0000000000000000 R15: 00007f05fca30390
[ 202.447384] R13: 00007f06114c8e6f R14: 0000000000000000 R15: 00007f060820a150
[ 202.447392] irq event stamp: 6766
[ 203.839390] hardirqs last enabled at (6765): [<ffffffff810044f2>]
do_syscall_64+0x12/0x1b0
[ 203.848842] hardirqs last disabled at (6766): [<ffffffff81a0affc>]
__schedule+0xdc/0x1570
[ 203.858097] softirqs last enabled at (6760): [<ffffffff81e00359>]
__do_softirq+0x359/0x40a
[ 203.867558] softirqs last disabled at (6751): [<ffffffff81095be1>]
irq_exit+0xc1/0xd0
[ 203.876423] ---[ end trace 1b953fe9220b3d8b ]---

2019-03-12 19:09:42

by Pawan Gupta

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

Hi,

With core scheduling LTP reports 2 new failures related to cgroups(memcg_stat_rss and memcg_move_charge_at_immigrate). I will try to debug it.

Also "perf sched map" indicates there might be a small window when 2 processes in different cgroups run together on one core.
In below case B0 and D0(stress-ng-cpu and sysbench) belong to 2 different cgroups with cpu.tag enabled.

$ perf sched map

*A0 382.266600 secs A0 => kworker/0:1-eve:51
*B0 382.266612 secs B0 => stress-ng-cpu:7956
*A0 382.394597 secs
*B0 382.394609 secs
B0 *C0 382.494459 secs C0 => i915/signal:0:450
B0 *D0 382.494468 secs D0 => sysbench:8088
*. D0 382.494472 secs . => swapper:0
. *C0 383.095787 secs
*B0 C0 383.095792 secs
B0 *D0 383.095820 secs
*A0 D0 383.096587 secs

In some cases I dont see an IPI getting sent to sibling cpu when 2 incompatible processes are picked. Like is below logs at timestamp 382.146250
"stress-ng-cpu" is picked when "sysbench" is running on the sibling cpu.

kworker/0:1-51 [000] d... 382.146246: __schedule: cpu(0): selected: stress-ng-cpu/7956 ffff9945bad29200
kworker/0:1-51 [000] d... 382.146246: __schedule: max: stress-ng-cpu/7956 ffff9945bad29200
kworker/0:1-51 [000] d... 382.146247: __prio_less: (swapper/4/0;140,0,0) ?< (sysbench/8088;140,34783671987,0)
kworker/0:1-51 [000] d... 382.146248: __prio_less: (stress-ng-cpu/7956;119,34817170203,0) ?< (sysbench/8088;119,34783671987,0)
kworker/0:1-51 [000] d... 382.146249: __schedule: cpu(4): selected: sysbench/8088 ffff9945a7405200
kworker/0:1-51 [000] d... 382.146249: __prio_less: (stress-ng-cpu/7956;119,34817170203,0) ?< (sysbench/8088;119,34783671987,0)
kworker/0:1-51 [000] d... 382.146250: __schedule: picked: stress-ng-cpu/7956 ffff9945bad29200
kworker/0:1-51 [000] d... 382.146251: __switch_to: Pawan: cpu(0) switching to stress-ng-cpu
kworker/0:1-51 [000] d... 382.146251: __switch_to: Pawan: cpu(4) running sysbench
stress-ng-cpu-7956 [000] dN.. 382.274234: __schedule: cpu(0): selected: kworker/0:1/51 0
stress-ng-cpu-7956 [000] dN.. 382.274235: __schedule: max: kworker/0:1/51 0
stress-ng-cpu-7956 [000] dN.. 382.274235: __schedule: cpu(4): selected: sysbench/8088 ffff9945a7405200
stress-ng-cpu-7956 [000] dN.. 382.274237: __prio_less: (kworker/0:1/51;119,50744489595,0) ?< (sysbench/8088;119,34911643157,0)
stress-ng-cpu-7956 [000] dN.. 382.274237: __schedule: picked: kworker/0:1/51 0
stress-ng-cpu-7956 [000] d... 382.274239: __switch_to: Pawan: cpu(0) switching to kworker/0:1
stress-ng-cpu-7956 [000] d... 382.274239: __switch_to: Pawan: cpu(4) running sysbench

-Pawan

2019-03-13 05:56:32

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Tue, Mar 12, 2019 at 3:45 PM Aubrey Li <[email protected]> wrote:
>
> On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
> <[email protected]> wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >> <[email protected]> wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing,
>
> One more NULL pointer dereference:
>
> Mar 12 02:24:46 aubrey-ivb kernel: [ 201.916741] core sched enabled
> [ 201.950203] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000008
> [ 201.950254] ------------[ cut here ]------------
> [ 201.959045] #PF error: [normal kernel read fault]
> [ 201.964272] !se->on_rq
> [ 201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> set_next_buddy+0x52/0x70

A quick workaround below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4fd94f..ef6acfe2cf7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
return;

for_each_sched_entity(se) {
- if (SCHED_WARN_ON(!se->on_rq))
+ if (SCHED_WARN_ON(!(se && se->on_rq))
return;
cfs_rq_of(se)->last = se;
}
@@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
return;

for_each_sched_entity(se) {
- if (SCHED_WARN_ON(!se->on_rq))
+ if (SCHED_WARN_ON(!(se && se->on_rq))
return;
cfs_rq_of(se)->next = se;
}

And now I'm running into a hard LOCKUP:

[ 326.336279] NMI watchdog: Watchdog detected hard LOCKUP on cpu 31
[ 326.336280] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[ 326.336311] irq event stamp: 164460
[ 326.336312] hardirqs last enabled at (164459):
[<ffffffff810c7a97>] sched_core_balance+0x247/0x470
[ 326.336312] hardirqs last disabled at (164460):
[<ffffffff810c7963>] sched_core_balance+0x113/0x470
[ 326.336313] softirqs last enabled at (164250):
[<ffffffff81e00359>] __do_softirq+0x359/0x40a
[ 326.336314] softirqs last disabled at (164213):
[<ffffffff81095be1>] irq_exit+0xc1/0xd0
[ 326.336315] CPU: 31 PID: 0 Comm: swapper/31 Tainted: G I
5.0.0-rc8-00542-gd697415be692-dirty #15
[ 326.336316] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 326.336317] RIP: 0010:native_queued_spin_lock_slowpath+0x18f/0x1c0
[ 326.336318] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 80 51 1e 00 48 03 04 f5 40 58 39 82 48 89 10 8b 42 08 85 c0 75
09 f3 90 <8b> 42 08 85 c0 74 f7 4b
[ 326.336318] RSP: 0000:ffffc9000643bd58 EFLAGS: 00000046
[ 326.336319] RAX: 0000000000000000 RBX: ffff888c0ade4400 RCX: 0000000000800000
[ 326.336320] RDX: ffff88980bbe5180 RSI: 0000000000000019 RDI: ffff888c0ade4400
[ 326.336321] RBP: ffff888c0ade4400 R08: 0000000000800000 R09: 00000000001e3a80
[ 326.336321] R10: ffffc9000643bd08 R11: 0000000000000000 R12: 0000000000000000
[ 326.336322] R13: 0000000000000000 R14: ffff88980bbe4400 R15: 000000000000001f
[ 326.336323] FS: 0000000000000000(0000) GS:ffff88980ba00000(0000)
knlGS:0000000000000000
[ 326.336323] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 326.336324] CR2: 00007fdcd7fd7728 CR3: 00000017e821a001 CR4: 00000000000606e0
[ 326.336325] Call Trace:
[ 326.336325] do_raw_spin_lock+0xab/0xb0
[ 326.336326] _raw_spin_lock+0x4b/0x60
[ 326.336326] double_rq_lock+0x99/0x140
[ 326.336327] sched_core_balance+0x11e/0x470
[ 326.336327] __balance_callback+0x49/0xa0
[ 326.336328] __schedule+0x1113/0x1570
[ 326.336328] schedule_idle+0x1e/0x40
[ 326.336329] do_idle+0x16b/0x2a0
[ 326.336329] cpu_startup_entry+0x19/0x20
[ 326.336330] start_secondary+0x17f/0x1d0
[ 326.336331] secondary_startup_64+0xa4/0xb0
[ 330.959367] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---

2019-03-14 00:35:52

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


>>
>> One more NULL pointer dereference:
>>
>> Mar 12 02:24:46 aubrey-ivb kernel: [ 201.916741] core sched enabled
>> [ 201.950203] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000008
>> [ 201.950254] ------------[ cut here ]------------
>> [ 201.959045] #PF error: [normal kernel read fault]
>> [ 201.964272] !se->on_rq
>> [ 201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
>> set_next_buddy+0x52/0x70
>
> A quick workaround below:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4fd94f..ef6acfe2cf7d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
> return;
>
> for_each_sched_entity(se) {
> - if (SCHED_WARN_ON(!se->on_rq))
> + if (SCHED_WARN_ON(!(se && se->on_rq))
> return;
> cfs_rq_of(se)->last = se;
> }
> @@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
> return;
>
> for_each_sched_entity(se) {
> - if (SCHED_WARN_ON(!se->on_rq))
> + if (SCHED_WARN_ON(!(se && se->on_rq))


Shouldn't the for_each_sched_entity(se) skip the code block for !se case
have avoided null pointer access of se?

Since
#define for_each_sched_entity(se) \
for (; se; se = se->parent)

Scratching my head a bit here on how your changes would have made
a difference.

In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed with the actual OOPs?
Saw also in your original log rb_insert_color. Wonder if that
was actually the source of the Oops?


[ 202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[ 202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[ 202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[ 202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[ 202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[ 202.118263] RSP: 0018:ffffc9000a5cbbb0 EFLAGS: 00010086
[ 202.129858] RSP: 0018:ffffc9000a463cc0 EFLAGS: 00010046
[ 202.135102] RAX: 0000000000000000 RBX: ffff88980047e800 RCX: 0000000000000000
[ 202.135105] RDX: ffff888be28caa40 RSI: 0000000000000001 RDI: ffffffff8110c3fa
[ 202.156251] RAX: 0000000000000000 RBX: ffff888bfeb80000 RCX: ffff888bfeb80

Thanks.

Tim

2019-03-14 05:30:51

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Thu, Mar 14, 2019 at 8:35 AM Tim Chen <[email protected]> wrote:
> >>
> >> One more NULL pointer dereference:
> >>
> >> Mar 12 02:24:46 aubrey-ivb kernel: [ 201.916741] core sched enabled
> >> [ 201.950203] BUG: unable to handle kernel NULL pointer dereference
> >> at 0000000000000008
> >> [ 201.950254] ------------[ cut here ]------------
> >> [ 201.959045] #PF error: [normal kernel read fault]
> >> [ 201.964272] !se->on_rq
> >> [ 201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> >> set_next_buddy+0x52/0x70
> >
> Shouldn't the for_each_sched_entity(se) skip the code block for !se case
> have avoided null pointer access of se?
>
> Since
> #define for_each_sched_entity(se) \
> for (; se; se = se->parent)
>
> Scratching my head a bit here on how your changes would have made
> a difference.

This NULL pointer dereference is not replicable, which makes me thought the
change works...

>
> In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed with the actual OOPs?
> Saw also in your original log rb_insert_color. Wonder if that
> was actually the source of the Oops?

No chance to figure this out, I only saw this once, lockup occurs more
frequently.

Thanks,
-Aubrey

2019-03-14 06:08:12

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

The original patch seems missing the following change for 32bit.

Thanks,
-Aubrey

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
/*
* Take rq->lock to make 64-bit read safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
}

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif

return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
/*
* Take rq->lock to make 64-bit write safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
cpuusage->usages[i] = val;

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}

@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
* Take rq->lock to make 64-bit read safe on 32-bit
* platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

seq_printf(m, " %llu", cpuusage->usages[index]);

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}
seq_puts(m, "\n");

2019-03-14 15:29:59

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On 2/18/19 8:56 AM, Peter Zijlstra wrote:
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.

We are seeing this hard lockup within 1 hour of testing the patchset with 2
VMs using the core scheduler feature. Here is the full dmesg. We have the
kdump as well if more information is necessary.

[ 1989.647539] core sched enabled
[ 3353.211527] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[ 3353.211528] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted
5.0-0.coresched-generic #1
[ 3353.211530] RIP: 0010:native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211532] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 00 3a 02 00 48 03 04 f5 20 48 bb a6 48 89 10 8b 42 08 85 c0 75 09 <f3>
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 8e 0f 18 0e eb 8f
[ 3353.211533] RSP: 0018:ffff97ba3f603e18 EFLAGS: 00000046
[ 3353.211535] RAX: 0000000000000000 RBX: 0000000000000202 RCX:
0000000000040000
[ 3353.211535] RDX: ffff97ba3f623a00 RSI: 0000000000000007 RDI:
ffff97dabf822d40
[ 3353.211536] RBP: ffff97ba3f603e18 R08: 0000000000040000 R09:
0000000000018499
[ 3353.211537] R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000001
[ 3353.211538] R13: ffffffffa7340740 R14: 000000000000000c R15:
000000000000000c
[ 3353.211539] FS: 0000000000000000(0000) GS:ffff97ba3f600000(0000)
knlGS:0000000000000000
[ 3353.211544] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3353.211545] CR2: 00007efeac310004 CR3: 0000001bf4c0e002 CR4:
00000000001626f0
[ 3353.211546] Call Trace:
[ 3353.211546] <IRQ>
[ 3353.211547] _raw_spin_lock_irqsave+0x35/0x40
[ 3353.211548] update_blocked_averages+0x35/0x5d0
[ 3353.211549] ? rebalance_domains+0x180/0x2c0
[ 3353.211549] update_nohz_stats+0x48/0x60
[ 3353.211550] _nohz_idle_balance+0xdf/0x290
[ 3353.211551] run_rebalance_domains+0x97/0xa0
[ 3353.211551] __do_softirq+0xe4/0x2f3
[ 3353.211552] irq_exit+0xb6/0xc0
[ 3353.211553] scheduler_ipi+0xe4/0x130
[ 3353.211553] smp_reschedule_interrupt+0x39/0xe0
[ 3353.211554] reschedule_interrupt+0xf/0x20
[ 3353.211555] </IRQ>
[ 3353.211556] RIP: 0010:cpuidle_enter_state+0xbc/0x440
[ 3353.211557] Code: ff e8 d8 dd 86 ff 80 7d d3 00 74 17 9c 58 0f 1f 44 00
00 f6 c4 02 0f 85 54 03 00 00 31 ff e8 eb 1d 8d ff fb 66 0f 1f 44 00 00 <45>
85 f6 0f 88 1a 03 00 00 4c 2b 6d c8 48 ba cf f7 53 e3 a5 9b c4
[ 3353.211558] RSP: 0018:ffffffffa6e03df8 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 3353.211560] RAX: ffff97ba3f622d40 RBX: ffffffffa6f545e0 RCX:
000000000000001f
[ 3353.211561] RDX: 0000024c9b7d936c RSI: 0000000047318912 RDI:
0000000000000000
[ 3353.211562] RBP: ffffffffa6e03e38 R08: 0000000000000002 R09:
0000000000022600
[ 3353.211562] R10: ffffffffa6e03dc8 R11: 00000000000002dc R12:
ffffd6c67f602968
[ 3353.211563] R13: 0000024c9b7d936c R14: 0000000000000004 R15:
ffffffffa6f54760
[ 3353.211564] ? cpuidle_enter_state+0x98/0x440
[ 3353.211565] cpuidle_enter+0x17/0x20
[ 3353.211565] call_cpuidle+0x23/0x40
[ 3353.211566] do_idle+0x204/0x280
[ 3353.211567] cpu_startup_entry+0x1d/0x20
[ 3353.211567] rest_init+0xae/0xb0
[ 3353.211568] arch_call_rest_init+0xe/0x1b
[ 3353.211569] start_kernel+0x4f5/0x516
[ 3353.211569] x86_64_start_reservations+0x24/0x26
[ 3353.211570] x86_64_start_kernel+0x74/0x77
[ 3353.211571] secondary_startup_64+0xa4/0xb0
[ 3353.211571] Kernel panic - not syncing: NMI IOCK error: Not continuing
[ 3353.211572] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted
5.0-0.coresched-generic #1
[ 3353.211574] Call Trace:
[ 3353.211575] <NMI>
[ 3353.211575] dump_stack+0x63/0x85
[ 3353.211576] panic+0xfe/0x2a4
[ 3353.211576] nmi_panic+0x39/0x40
[ 3353.211577] io_check_error+0x92/0xa0
[ 3353.211578] default_do_nmi+0x9e/0x110
[ 3353.211578] do_nmi+0x119/0x180
[ 3353.211579] end_repeat_nmi+0x16/0x50
[ 3353.211580] RIP: 0010:native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211581] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 00 3a 02 00 48 03 04 f5 20 48 bb a6 48 89 10 8b 42 08 85 c0 75 09 <f3>
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 8e 0f 18 0e eb 8f
[ 3353.211582] RSP: 0018:ffff97ba3f603e18 EFLAGS: 00000046
[ 3353.211583] RAX: 0000000000000000 RBX: 0000000000000202 RCX:
0000000000040000
[ 3353.211584] RDX: ffff97ba3f623a00 RSI: 0000000000000007 RDI:
ffff97dabf822d40
[ 3353.211585] RBP: ffff97ba3f603e18 R08: 0000000000040000 R09:
0000000000018499
[ 3353.211586] R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000001
[ 3353.211587] R13: ffffffffa7340740 R14: 000000000000000c R15:
000000000000000c
[ 3353.211587] ? native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211588] ? native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211589] </NMI>
[ 3353.211589] <IRQ>
[ 3353.211590] _raw_spin_lock_irqsave+0x35/0x40
[ 3353.211591] update_blocked_averages+0x35/0x5d0
[ 3353.211591] ? rebalance_domains+0x180/0x2c0
[ 3353.211592] update_nohz_stats+0x48/0x60
[ 3353.211593] _nohz_idle_balance+0xdf/0x290
[ 3353.211593] run_rebalance_domains+0x97/0xa0
[ 3353.211594] __do_softirq+0xe4/0x2f3
[ 3353.211595] irq_exit+0xb6/0xc0
[ 3353.211595] scheduler_ipi+0xe4/0x130
[ 3353.211596] smp_reschedule_interrupt+0x39/0xe0
[ 3353.211597] reschedule_interrupt+0xf/0x20
[ 3353.211597] </IRQ>
[ 3353.211598] RIP: 0010:cpuidle_enter_state+0xbc/0x440
[ 3353.211599] Code: ff e8 d8 dd 86 ff 80 7d d3 00 74 17 9c 58 0f 1f 44 00
00 f6 c4 02 0f 85 54 03 00 00 31 ff e8 eb 1d 8d ff fb 66 0f 1f 44 00 00 <45>
85 f6 0f 88 1a 03 00 00 4c 2b 6d c8 48 ba cf f7 53 e3 a5 9b c4
[ 3353.211600] RSP: 0018:ffffffffa6e03df8 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 3353.211602] RAX: ffff97ba3f622d40 RBX: ffffffffa6f545e0 RCX:
000000000000001f
[ 3353.211603] RDX: 0000024c9b7d936c RSI: 0000000047318912 RDI:
0000000000000000
[ 3353.211603] RBP: ffffffffa6e03e38 R08: 0000000000000002 R09:
0000000000022600
[ 3353.211604] R10: ffffffffa6e03dc8 R11: 00000000000002dc R12:
ffffd6c67f602968
[ 3353.211605] R13: 0000024c9b7d936c R14: 0000000000000004 R15:
ffffffffa6f54760
[ 3353.211606] ? cpuidle_enter_state+0x98/0x440
[ 3353.211607] cpuidle_enter+0x17/0x20
[ 3353.211607] call_cpuidle+0x23/0x40
[ 3353.211608] do_idle+0x204/0x280
[ 3353.211609] cpu_startup_entry+0x1d/0x20
[ 3353.211609] rest_init+0xae/0xb0
[ 3353.211610] arch_call_rest_init+0xe/0x1b
[ 3353.211611] start_kernel+0x4f5/0x516
[ 3353.211611] x86_64_start_reservations+0x24/0x26
[ 3353.211612] x86_64_start_kernel+0x74/0x77
[ 3353.211613] secondary_startup_64+0xa4/0xb0


2019-03-18 06:59:34

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
<[email protected]> wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >> <[email protected]> wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.
>

okay, here is another one, on my system, the boot up CPUs don't match the
possible cpu map, so the not onlined CPU rq->core are not initialized, which
causes NULL pointer dereference panic in online_fair_sched_group():

And here is a quick fix.
-----------------------------------------------------------------------------------------------------
@@ -10488,7 +10493,8 @@ void online_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
+ if (!rq->core)
+ continue;
raw_spin_lock_irq(rq_lockp(rq));
update_rq_clock(rq);
attach_entity_cfs_rq(se);

Thanks,
-Aubrey

2019-03-26 07:33:20

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Fri, Mar 08, 2019 at 11:44:01AM -0800, Subhra Mazumdar wrote:
>
> On 2/22/19 4:45 AM, Mel Gorman wrote:
> >On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >>On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> >>>However; whichever way around you turn this cookie; it is expensive and nasty.
> >>Do you (or anybody else) have numbers for real loads?
> >>
> >>Because performance is all that matters. If performance is bad, then
> >>it's pointless, since just turning off SMT is the answer.
> >>
> >I tried to do a comparison between tip/master, ht disabled and this series
> >putting test workloads into a tagged cgroup but unfortunately it failed
> >
> >[ 156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> >[ 156.986597] #PF error: [normal kernel read fault]
> >[ 156.991343] PGD 0 P4D 0
> >[ 156.993905] Oops: 0000 [#1] SMP PTI
> >[ 156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
> >[ 157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> >[ 157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> >[ 157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> > 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> >[ 157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> >[ 157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> >[ 157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> >[ 157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> >[ 157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> >[ 157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> >[ 157.078814] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> >[ 157.086977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >[ 157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> >[ 157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >[ 157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >[ 157.119058] Call Trace:
> >[ 157.123865] pick_next_entity+0x61/0x110
> >[ 157.130137] pick_task_fair+0x4b/0x90
> >[ 157.136124] __schedule+0x365/0x12c0
> >[ 157.141985] schedule_idle+0x1e/0x40
> >[ 157.147822] do_idle+0x166/0x280
> >[ 157.153275] cpu_startup_entry+0x19/0x20
> >[ 157.159420] start_secondary+0x17a/0x1d0
> >[ 157.165568] secondary_startup_64+0xa4/0xb0
> >[ 157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> >[ 157.258990] CR2: 0000000000000058
> >[ 157.264961] ---[ end trace a301ac5e3ee86fde ]---
> >[ 157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> >[ 157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> >[ 157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> >[ 157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> >[ 157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> >[ 157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> >[ 157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> >[ 157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> >[ 157.373395] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> >[ 157.384238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >[ 157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> >[ 157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >[ 157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >[ 157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> >[ 158.529804] Shutting down cpus with NMI
> >[ 158.573249] Kernel Offset: disabled
> >[ 158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> >
> >RIP translates to kernel/sched/fair.c:6819
> >
> >static int
> >wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> >{
> > s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */
> >
> > if (vdiff <= 0)
> > return -1;
> >
> > gran = wakeup_gran(se);
> > if (vdiff > gran)
> > return 1;
> >}
> >
> >I haven't tried debugging it yet.
> >
> I think the following fix, while trivial, is the right fix for the NULL
> dereference in this case. This bug is reproducible with patch 14. I

I assume you meant patch 4?

My understanding is, this is due to 'left' being NULL in
pick_next_entity().

With patch 4, in pick_task_fair(), pick_next_entity() can be called with
an empty rbtree of cfs_rq and with a NULL 'curr'. This resulted in a
NULL 'left'. Before patch 4, this can't happen.

It's not clear to me why NULL is used instead of 'curr' for
pick_next_entity() in pick_task_fair(). My first thought is, 'curr' will
not be considered as next entity, but then 'curr' is checked after
pick_next_entity() returns so this shouldn't be the reason. Guess I
missed something.

Thanks,
Aaron

> also did
> some performance bisecting and with patch 14 performance is
> decimated, that's
> expected. Most of the performance recovery happens in patch 15 which,
> unfortunately, is also the one that introduces the hard lockup.
>
> -------8<-----------
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4..ecadf36 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
> ???????? * Avoid running the skip buddy, if running something else can
> ???????? * be done without getting too unfair.
> */
> -?????? if (cfs_rq->skip == se) {
> +?????? if (cfs_rq->skip && cfs_rq->skip == se) {
> ??????????????? struct sched_entity *second;
>
> ??????????????? if (se == curr) {
> @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq,
> struct sched_entity *curr)
> /*
> ???????? * Prefer last buddy, try to return the CPU to a preempted task.
> */
> -?????? if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> +?????? if (left && cfs_rq->last &&
> wakeup_preempt_entity(cfs_rq->last, left)
> +?????????? < 1)
> ??????????????? se = cfs_rq->last;
>
> /*
> ???????? * Someone really wants this to run. If it's not unfair, run it.
> */
> -?????? if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> +?????? if (left && cfs_rq->next &&
> wakeup_preempt_entity(cfs_rq->next, left)
> +?????????? < 1)
> ??????????????? se = cfs_rq->next;
>
> ??????? clear_buddies(cfs_rq, se);
>

2019-03-26 07:58:56

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling

On Tue, Mar 26, 2019 at 03:32:12PM +0800, Aaron Lu wrote:
> On Fri, Mar 08, 2019 at 11:44:01AM -0800, Subhra Mazumdar wrote:
> >
> > On 2/22/19 4:45 AM, Mel Gorman wrote:
> > >On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > >>On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <[email protected]> wrote:
> > >>>However; whichever way around you turn this cookie; it is expensive and nasty.
> > >>Do you (or anybody else) have numbers for real loads?
> > >>
> > >>Because performance is all that matters. If performance is bad, then
> > >>it's pointless, since just turning off SMT is the answer.
> > >>
> > >I tried to do a comparison between tip/master, ht disabled and this series
> > >putting test workloads into a tagged cgroup but unfortunately it failed
> > >
> > >[ 156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> > >[ 156.986597] #PF error: [normal kernel read fault]
> > >[ 156.991343] PGD 0 P4D 0
> > >[ 156.993905] Oops: 0000 [#1] SMP PTI
> > >[ 156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
> > >[ 157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> > >[ 157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> > >[ 157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> > > 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> > >[ 157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> > >[ 157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> > >[ 157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> > >[ 157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> > >[ 157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> > >[ 157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> > >[ 157.078814] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> > >[ 157.086977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >[ 157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> > >[ 157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >[ 157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >[ 157.119058] Call Trace:
> > >[ 157.123865] pick_next_entity+0x61/0x110
> > >[ 157.130137] pick_task_fair+0x4b/0x90
> > >[ 157.136124] __schedule+0x365/0x12c0
> > >[ 157.141985] schedule_idle+0x1e/0x40
> > >[ 157.147822] do_idle+0x166/0x280
> > >[ 157.153275] cpu_startup_entry+0x19/0x20
> > >[ 157.159420] start_secondary+0x17a/0x1d0
> > >[ 157.165568] secondary_startup_64+0xa4/0xb0
> > >[ 157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> > >[ 157.258990] CR2: 0000000000000058
> > >[ 157.264961] ---[ end trace a301ac5e3ee86fde ]---
> > >[ 157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> > >[ 157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> > >[ 157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> > >[ 157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> > >[ 157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> > >[ 157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> > >[ 157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> > >[ 157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> > >[ 157.373395] FS: 0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> > >[ 157.384238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >[ 157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> > >[ 157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >[ 157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >[ 157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> > >[ 158.529804] Shutting down cpus with NMI
> > >[ 158.573249] Kernel Offset: disabled
> > >[ 158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> > >
> > >RIP translates to kernel/sched/fair.c:6819
> > >
> > >static int
> > >wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > >{
> > > s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */
> > >
> > > if (vdiff <= 0)
> > > return -1;
> > >
> > > gran = wakeup_gran(se);
> > > if (vdiff > gran)
> > > return 1;
> > >}
> > >
> > >I haven't tried debugging it yet.
> > >
> > I think the following fix, while trivial, is the right fix for the NULL
> > dereference in this case. This bug is reproducible with patch 14. I
>
> I assume you meant patch 4?

Correction, should be patch 9 where pick_task_fair() is introduced.

Thanks,
Aaron

>
> My understanding is, this is due to 'left' being NULL in
> pick_next_entity().
>
> With patch 4, in pick_task_fair(), pick_next_entity() can be called with
> an empty rbtree of cfs_rq and with a NULL 'curr'. This resulted in a
> NULL 'left'. Before patch 4, this can't happen.
>
> It's not clear to me why NULL is used instead of 'curr' for
> pick_next_entity() in pick_task_fair(). My first thought is, 'curr' will
> not be considered as next entity, but then 'curr' is checked after
> pick_next_entity() returns so this shouldn't be the reason. Guess I
> missed something.
>
> Thanks,
> Aaron
>
> > also did
> > some performance bisecting and with patch 14 performance is
> > decimated, that's
> > expected. Most of the performance recovery happens in patch 15 which,
> > unfortunately, is also the one that introduces the hard lockup.
> >
> > -------8<-----------
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1d0dac4..ecadf36 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> > sched_entity *curr)
> > ???????? * Avoid running the skip buddy, if running something else can
> > ???????? * be done without getting too unfair.
> > */
> > -?????? if (cfs_rq->skip == se) {
> > +?????? if (cfs_rq->skip && cfs_rq->skip == se) {
> > ??????????????? struct sched_entity *second;
> >
> > ??????????????? if (se == curr) {
> > @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq,
> > struct sched_entity *curr)
> > /*
> > ???????? * Prefer last buddy, try to return the CPU to a preempted task.
> > */
> > -?????? if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> > +?????? if (left && cfs_rq->last &&
> > wakeup_preempt_entity(cfs_rq->last, left)
> > +?????????? < 1)
> > ??????????????? se = cfs_rq->last;
> >
> > /*
> > ???????? * Someone really wants this to run. If it's not unfair, run it.
> > */
> > -?????? if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> > +?????? if (left && cfs_rq->next &&
> > wakeup_preempt_entity(cfs_rq->next, left)
> > +?????????? < 1)
> > ??????????????? se = cfs_rq->next;
> >
> > ??????? clear_buddies(cfs_rq, se);
> >