2020-06-30 22:07:36

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 00/16] Core scheduling v6

Sixth iteration of the Core-Scheduling feature.

Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v6) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts. It is still possible to
have a task attack the sibling thread when it enters the kernel via
syscalls.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a
tag is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature
can also be beneficial for RT and performance applications where we
want to control how tasks make use of SMT dynamically.

This iteration is mostly a cleanup of v5 except for a major feature of
pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
introducing documentation and includes minor crash fixes.

One major cleanup was removing the hotplug support and related code.
The hotplug related crashes were not documented and the fixes piled up
over time leading to complex code. We were not able to reproduce the
crashes in the limited testing done. But if they are reroducable, we
don't want to hide them. We should document them and design better
fixes if any.

In terms of performance, the results in this release are similar to
v5. On a x86 system with N hardware threads:
- if only N/2 hardware threads are busy, the performance is similar
between baseline, corescheduling and nosmt
- if N hardware threads are busy with N different corescheduling
groups, the impact of corescheduling is similar to nosmt
- if N hardware threads are busy and multiple active threads share the
same corescheduling cookie, they gain a performance improvement over
nosmt.
The specific performance impact depends on the workload, but for a
really busy database 12-vcpu VM (1 coresched tag) running on a 36
hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
their own coresched tag), the performance drops by 54% with
corescheduling and drops by 90% with nosmt.

v6 is rebased on 5.7.6(a06eb423367e)
https://github.com/digitalocean/linux-coresched/tree/coresched/v6-v5.7.y

Changes in v6
-------------
- Documentation
- Joel
- Pause siblings on entering nmi/irq/softirq
- Joel, Vineeth
- Fix for RCU crash
- Joel
- Fix for a crash in pick_next_task
- Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
- Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
- Joel, Vineeth

Changes in v5
-------------
- Fixes for cgroup/process tagging during corner cases like cgroup
destroy, task moving across cgroups etc
- Tim Chen
- Coresched aware task migrations
- Aubrey Li
- Other minor stability fixes.

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
across cpus in a core.
- Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
- Aaron Lu

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
- Aaron Lu
- Vineeth Pillai
- Julien Desfossez
- Fixes the issue of starving threads due to forced idle
- Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
- Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
- Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
- Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
- Subhra Mazumdar
- Tim Chen
- Improves priority comparison logic for process in different cpus
- Peter Zijlstra
- Aaron Lu
- Fixes a hard lockup in rq locking
- Vineeth Pillai
- Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
- Vineeth Pillai
- Julien Desfossez
- Fix for 32bit build
- Aubrey Li

ISSUES
------
- Aaron(Intel) found an issue with load balancing when the tasks have
different weights(nice or cgroup shares). Task weight is not considered
in coresched aware load balancing and causes those higher weights task
to starve. This issue was in v5 as well and is carried over.
- Joel(ChromeOS) found an issue where RT task may be preempted by a
lower class task. This was also in v6 and is carried over.
- Coresched RB-tree doesn't get updated when task priority changes
- Potential starvation of untagged tasks(0 cookie) - a side effect of
0 cookie tasks not in the coresched RB-Tree

TODO
----
- MAJOR: Core wide vruntime comparison re-work
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/
- MAJOR: Decide on the interfaces/API for exposing the feature to userland.
- prctl/set_schedattr
- A new coresched cgroup?
- fork/clone behaviour: whether child to inherit the cookie,
process vs thread behaviour etc
- Auto tagging based on user/session/process etc.
- procfs/sysfs interface to core scheduling.
- MAJOR: Load balancing/Migration fixes for core scheduling
With v6, Load balancing is partially coresched aware, but has some
issues w.r.t process/taskgroup weights:
https://lwn.net/ml/linux-kernel/[email protected]/
- System wide trusted cookie:
As of now, cookie value of 0 is considered special - default for tasks
that are not explicitly tagged. Kernel threads are not tagged by
default and hence tasks with 0 cookie is currently considered as a
system wide trusted trusted group. IRQ pause feature assumes this.
Discussions regarding this needs to continue and converge on a system
wide trusted cookie.
- Investigate the source of the overhead even when no tasks are tagged:
https://lkml.org/lkml/2019/10/29/242
- Core scheduling test framework: kselftests, torture tests etc

---

Aaron Lu (2):
sched/fair: wrapper for cfs_rq->min_vruntime
sched/fair: core wide cfs task priority comparison

Aubrey Li (1):
sched: migration changes for core scheduling

Chen Yu (1):
sched: Fix pick_next_task() race condition in core scheduling

Joel Fernandes (Google) (2):
irq: Add support for core-wide protection of IRQ and softirq
Documentation: Add documentation on core scheduling

Peter Zijlstra (9):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Trivial forced-newidle balancer
sched: cgroup tagging interface for core scheduling
sched: Debug bits...

vpillai (1):
sched/fair: Fix forced idle sibling starvation corner case

Documentation/admin-guide/hw-vuln/core-scheduling.rst | 241 ++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
Documentation/admin-guide/kernel-parameters.txt | 9 +
include/linux/sched.h | 14 +-
kernel/Kconfig.preempt | 19 +
kernel/sched/core.c | 1076 ++++++++++++++++-
kernel/sched/cpuacct.c | 12 +-
kernel/sched/deadline.c | 34 +-
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 409 +++++--
kernel/sched/idle.c | 13 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 22 +-
kernel/sched/sched.h | 256 +++-
kernel/sched/stop_task.c | 13 +-
kernel/sched/topology.c | 4 +-
kernel/softirq.c | 46 +
17 files changed, 1953 insertions(+), 222 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

--
2.17.1


2020-06-30 22:07:44

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

From: "Joel Fernandes (Google)" <[email protected]>

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

This fetaure is controlled by the build time config option
CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
kernel boot parameter 'sched_core_irq_pause' to enable/disable the
feature at boot time. Default is enabled at boot time.

Cc: Julien Desfossez <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Aubrey Li <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Co-developed-by: Vineeth Pillai <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 9 +
include/linux/sched.h | 5 +
kernel/Kconfig.preempt | 13 ++
kernel/sched/core.c | 161 ++++++++++++++++++
kernel/sched/sched.h | 7 +
kernel/softirq.c | 46 +++++
6 files changed, 241 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5e2ce88d6eda..d44d7a997610 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4445,6 +4445,15 @@

sbni= [NET] Granch SBNI12 leased line adapter

+ sched_core_irq_pause=
+ [SCHED_CORE, SCHED_CORE_IRQ_PAUSE] Pause SMT siblings
+ of a core if atleast one of the siblings of the core
+ is running nmi/irq/softirq. This is to guarentee that
+ kernel data is not leaked to tasks which are not trusted
+ by the kernel.
+ This feature is valid only when Core scheduling is
+ enabled(SCHED_CORE).
+
sched_debug [KNL] Enables verbose scheduler debug messages.

schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f9edf013df3..097746a9f260 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2025,4 +2025,9 @@ int sched_trace_rq_cpu(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#endif
+
#endif
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 4488fbf4d3a8..59094a66a987 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,3 +86,16 @@ config SCHED_CORE
default y
depends on SCHED_SMT

+config SCHED_CORE_IRQ_PAUSE
+ bool "Pause siblings on entering irq/softirq during core-scheduling"
+ default y
+ depends on SCHED_CORE
+ help
+ This option enables pausing all SMT siblings of a core when atleast
+ one of the siblings in the core is in nmi/irq/softirq. This is to
+ enforce security such that information from kernel is not leaked to
+ non-trusted tasks running on siblings. This option is valid only if
+ Core Scheduling(CONFIG_SCHED_CORE) is enabled.
+
+ If in doubt, select 'Y' when CONFIG_SCHED_CORE=y
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ede86fb37b4e..2ec56970d6bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4252,6 +4252,155 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+ /*
+ * Wait till the core of this HT is not in a core-wide IRQ state.
+ *
+ * Pair with smp_store_release() in sched_core_irq_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+ cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+ int i, cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ const struct cpumask *smt_mask;
+
+ if (!sched_core_enabled(rq))
+ return;
+
+ /* Count irq_enter() calls received without irq_exit() on this CPU. */
+ rq->core_this_irq_nest++;
+
+ /* If not outermost irq_enter(), do nothing. */
+ if (WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest != 1)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+ WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+ if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+ goto unlock;
+
+ if (rq->core_pause_pending) {
+ /*
+ * Do nothing more since we are in a 'reschedule IPI' sent from
+ * another sibling. That sibling would have sent IPIs to all of
+ * the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide IRQ
+ * state, do nothing.
+ */
+ if (rq->core->core_irq_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /* IPI only if previous IPI was not pending. */
+ if (!srq->core_pause_pending) {
+ srq->core_pause_pending = 1;
+ smp_send_reschedule(i);
+ }
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+ int cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ bool wait_here = false;
+ unsigned int nest;
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ return;
+
+ rq->core_this_irq_nest--;
+
+ /* If not outermost on this CPU, do nothing. */
+ if (WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest > 0)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_irq_nest;
+ WARN_ON_ONCE(!nest);
+
+ /*
+ * If we still have other CPUs in IRQs, we have to wait for them.
+ * Either here, or in the scheduler.
+ */
+ if (rq->core->core_cookie && nest > 1) {
+ /*
+ * If we are entering the scheduler anyway, we can just wait
+ * there for ->core_irq_nest to reach 0. If not, just wait here.
+ */
+ if (!tif_need_resched()) {
+ wait_here = true;
+ }
+ }
+
+ if (rq->core_pause_pending)
+ rq->core_pause_pending = 0;
+
+ /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+ smp_store_release(&rq->core->core_irq_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+
+ if (wait_here)
+ sched_core_sibling_irq_pause(rq);
+}
+#endif /* CONFIG_SCHED_CORE_IRQ_PAUSE */
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -4732,6 +4881,18 @@ static void __sched notrace __schedule(bool preempt)
rq_unlock_irq(rq, &rf);
}

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ /*
+ * If a CPU that was running a trusted task entered the scheduler, and
+ * the next task is untrusted, then check if waiting for core-wide IRQ
+ * state to cease is needed since we would not have been able to get
+ * the services of irq_exit() to do that waiting.
+ */
+ if (sched_core_enabled(rq) &&
+ !is_idle_task(next) && next->mm && next->core_cookie)
+ sched_core_sibling_irq_pause(rq);
+#endif
+
balance_callback(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a5450886c4e4..6445943d3215 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1041,11 +1041,18 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ unsigned char core_pause_pending;
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ unsigned int core_this_irq_nest;
+#endif

/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ unsigned int core_irq_nest;
+#endif
#endif
};

diff --git a/kernel/softirq.c b/kernel/softirq.c
index a47c6dd57452..0745f1c6b352 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -246,6 +246,24 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+DEFINE_STATIC_KEY_TRUE(sched_core_irq_pause);
+static int __init set_sched_core_irq_pause(char *str)
+{
+ unsigned long val = 0;
+ if (!str)
+ return 0;
+
+ val = simple_strtoul(str, &str, 0);
+
+ if (val == 0)
+ static_branch_disable(&sched_core_irq_pause);
+
+ return 1;
+}
+__setup("sched_core_irq_pause=", set_sched_core_irq_pause);
+#endif
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
@@ -273,6 +291,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ /*
+ * Core scheduling mitigations require entry into softirq to send stall
+ * IPIs to sibling hyperthreads if needed (ex, sibling is running
+ * untrusted task). If we are here from irq_exit(), no IPIs are sent.
+ */
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_enter();
+#endif
+
local_irq_enable();

h = softirq_vec;
@@ -305,6 +333,12 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
rcu_softirq_qs();
local_irq_disable();

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ /* Inform the scheduler about exit from softirq. */
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_exit();
+#endif
+
pending = local_softirq_pending();
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
@@ -345,6 +379,12 @@ asmlinkage __visible void do_softirq(void)
void irq_enter(void)
{
rcu_irq_enter();
+
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_enter();
+#endif
+
if (is_idle_task(current) && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +453,12 @@ void irq_exit(void)
invoke_softirq();

tick_irq_exit();
+
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_exit();
+#endif
+
rcu_irq_exit();
/* must be last! */
lockdep_hardirq_exit();
--
2.17.1

2020-06-30 22:07:47

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer

From: Peter Zijlstra <[email protected]>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 131 +++++++++++++++++++++++++++++++++++++++++-
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 6 ++
4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c8dcc5ff039..4f9edf013df3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
#ifdef CONFIG_SCHED_CORE
struct rb_node core_node;
unsigned long core_cookie;
+ unsigned int core_occupation;
#endif

#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d6d6a678013..fb9edb09ead7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
return match;
}

+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+ struct rb_node *node = &p->core_node;
+
+ node = rb_next(node);
+ if (!node)
+ return NULL;
+
+ p = container_of(node, struct task_struct, core_node);
+ if (p->core_cookie != cookie)
+ return NULL;
+
+ return p;
+}
+
/*
* The static-key + stop-machine variable are needed such that:
*
@@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
- int i, j, cpu;
+ int i, j, cpu, occ = 0;
bool need_sync;

if (!sched_core_enabled(rq))
@@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
goto done;
}

+ if (!is_idle_task(p))
+ occ++;
+
rq_i->core_pick = p;

/*
@@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

cpu_rq(j)->core_pick = NULL;
}
+ occ = 1;
goto again;
} else {
/*
@@ -4393,6 +4412,8 @@ next_class:;
if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;

+ rq_i->core_pick->core_occupation = occ;
+
if (i == cpu)
continue;

@@ -4408,6 +4429,114 @@ next_class:;
return next;
}

+static bool try_steal_cookie(int this, int that)
+{
+ struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+ struct task_struct *p;
+ unsigned long cookie;
+ bool success = false;
+
+ local_irq_disable();
+ double_rq_lock(dst, src);
+
+ cookie = dst->core->core_cookie;
+ if (!cookie)
+ goto unlock;
+
+ if (dst->curr != dst->idle)
+ goto unlock;
+
+ p = sched_core_find(src, cookie);
+ if (p == src->idle)
+ goto unlock;
+
+ do {
+ if (p == src->core_pick || p == src->curr)
+ goto next;
+
+ if (!cpumask_test_cpu(this, &p->cpus_mask))
+ goto next;
+
+ if (p->core_occupation > dst->idle->core_occupation)
+ goto next;
+
+ p->on_rq = TASK_ON_RQ_MIGRATING;
+ deactivate_task(src, p, 0);
+ set_task_cpu(p, this);
+ activate_task(dst, p, 0);
+ p->on_rq = TASK_ON_RQ_QUEUED;
+
+ resched_curr(dst);
+
+ success = true;
+ break;
+
+next:
+ p = sched_core_next(p, cookie);
+ } while (p);
+
+unlock:
+ double_rq_unlock(dst, src);
+ local_irq_enable();
+
+ return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+ int i;
+
+ for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+ if (i == cpu)
+ continue;
+
+ if (need_resched())
+ break;
+
+ if (try_steal_cookie(cpu, i))
+ return true;
+ }
+
+ return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+ struct sched_domain *sd;
+ int cpu = cpu_of(rq);
+
+ rcu_read_lock_sched();
+ raw_spin_unlock_irq(rq_lockp(rq));
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_LOAD_BALANCE))
+ break;
+
+ if (need_resched())
+ break;
+
+ if (steal_cookie_task(cpu, sd))
+ break;
+ }
+ raw_spin_lock_irq(rq_lockp(rq));
+ rcu_read_unlock_sched();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+ if (!sched_core_enabled(rq))
+ return;
+
+ if (!rq->core->core_cookie)
+ return;
+
+ if (!rq->nr_running) /* not forced idle */
+ return;
+
+ queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
#else /* !CONFIG_SCHED_CORE */

static struct task_struct *
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a8d40ffab097..dff6ba220ed7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -395,6 +395,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
{
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
+ queue_core_balance(rq);
}

#ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 293aa1ae0308..464559676fd2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1089,6 +1089,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);

+extern void queue_core_balance(struct rq *rq);
+
#else /* !CONFIG_SCHED_CORE */

static inline bool sched_core_enabled(struct rq *rq)
@@ -1101,6 +1103,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
return &rq->__lock;
}

+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
#endif /* CONFIG_SCHED_CORE */

#ifdef CONFIG_SCHED_SMT
--
2.17.1

2020-06-30 22:08:02

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 01/16] sched: Wrap rq::lock access

From: Peter Zijlstra <[email protected]>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/core.c | 46 +++++++++---------
kernel/sched/cpuacct.c | 12 ++---
kernel/sched/deadline.c | 18 +++----
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 38 +++++++--------
kernel/sched/idle.c | 4 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 8 +--
kernel/sched/sched.h | 105 +++++++++++++++++++++-------------------
kernel/sched/topology.c | 4 +-
10 files changed, 122 insertions(+), 119 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5eccfb816d23..ef594ace6ffd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -85,12 +85,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

for (;;) {
rq = task_rq(p);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));

while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -109,7 +109,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
rq = task_rq(p);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
/*
* move_queued_task() task_rq_lock()
*
@@ -131,7 +131,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

while (unlikely(task_on_rq_migrating(p)))
@@ -201,7 +201,7 @@ void update_rq_clock(struct rq *rq)
{
s64 delta;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -505,7 +505,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (test_tsk_need_resched(curr))
return;
@@ -529,10 +529,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}

#ifdef CONFIG_SMP
@@ -947,7 +947,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
struct uclamp_bucket *bucket;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/* Update task effective clamp */
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -987,7 +987,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
unsigned int bkt_clamp;
unsigned int rq_clamp;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

bucket = &uc_rq->bucket[uc_se->bucket_id];
SCHED_WARN_ON(!bucket->tasks);
@@ -1472,7 +1472,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
struct task_struct *p, int new_cpu)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1586,7 +1586,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
* Because __kthread_bind() calls this on blocked tasks without
* holding rq->lock.
*/
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
}
if (running)
@@ -1723,7 +1723,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* task_rq_lock().
*/
WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
- lockdep_is_held(&task_rq(p)->lock)));
+ lockdep_is_held(rq_lockp(task_rq(p)))));
#endif
/*
* Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2234,7 +2234,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

#ifdef CONFIG_SMP
if (p->sched_contributes_to_load)
@@ -3088,10 +3088,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
* do an early lockdep release here:
*/
rq_unpin_lock(rq, rf);
- spin_release(&rq->lock.dep_map, _THIS_IP_);
+ spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
#ifdef CONFIG_DEBUG_SPINLOCK
/* this is a valid case when another task releases the spinlock */
- rq->lock.owner = next;
+ rq_lockp(rq)->owner = next;
#endif
}

@@ -3102,8 +3102,8 @@ static inline void finish_lock_switch(struct rq *rq)
* fix up the runqueue lock - which gets 'carried over' from
* prev into current:
*/
- spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
- raw_spin_unlock_irq(&rq->lock);
+ spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+ raw_spin_unlock_irq(rq_lockp(rq));
}

/*
@@ -3253,7 +3253,7 @@ static void __balance_callback(struct rq *rq)
void (*func)(struct rq *rq);
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
head = rq->balance_callback;
rq->balance_callback = NULL;
while (head) {
@@ -3264,7 +3264,7 @@ static void __balance_callback(struct rq *rq)

func(rq);
}
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}

static inline void balance_callback(struct rq *rq)
@@ -6034,7 +6034,7 @@ void init_idle(struct task_struct *idle, int cpu)
__sched_fork(0, idle);

raw_spin_lock_irqsave(&idle->pi_lock, flags);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));

idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();
@@ -6071,7 +6071,7 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
idle->on_cpu = 1;
#endif
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&idle->pi_lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
@@ -6634,7 +6634,7 @@ void __init sched_init(void)
struct rq *rq;

rq = cpu_rq(i);
- raw_spin_lock_init(&rq->lock);
+ raw_spin_lock_init(&rq->__lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
/*
* Take rq->lock to make 64-bit read safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
}

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif

return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
/*
* Take rq->lock to make 64-bit write safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
cpuusage->usages[i] = val;

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}

@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
* Take rq->lock to make 64-bit read safe on 32-bit
* platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

seq_printf(m, " %llu", cpuusage->usages[index]);

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}
seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 504d2f51b0d6..34f95462b838 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->running_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
if (dl_rq->running_bw > old)
@@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->this_bw += dl_bw;
SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
}
@@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->this_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
if (dl_rq->this_bw > old)
@@ -927,7 +927,7 @@ static int start_dl_timer(struct task_struct *p)
ktime_t now, act;
s64 delta;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/*
* We want the timer to fire at the deadline, but considering
@@ -1037,9 +1037,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* If the runqueue is no longer available, migrate the
* task elsewhere. This necessarily changes rq.
*/
- lockdep_unpin_lock(&rq->lock, rf.cookie);
+ lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
rq = dl_task_offline_migration(rq, p);
- rf.cookie = lockdep_pin_lock(&rq->lock);
+ rf.cookie = lockdep_pin_lock(rq_lockp(rq));
update_rq_clock(rq);

/*
@@ -1654,7 +1654,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
* from try_to_wake_up(). Hence, p->pi_lock is locked, but
* rq->lock is not... So, lock it
*/
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
if (p->dl.dl_non_contending) {
sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
@@ -1669,7 +1669,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
put_task_struct(p);
}
sub_rq_bw(&p->dl, &rq->dl);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 239970b991c0..339b729b8828 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -497,7 +497,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "exec_clock",
SPLIT_NS(cfs_rq->exec_clock));

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (rb_first_cached(&cfs_rq->tasks_timeline))
MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
last = __pick_last_entity(cfs_rq);
@@ -505,7 +505,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
max_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime",
SPLIT_NS(MIN_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ae7e30ccb33..43ff0e5cf387 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1104,7 +1104,7 @@ struct numa_group {
static struct numa_group *deref_task_numa_group(struct task_struct *p)
{
return rcu_dereference_check(p->numa_group, p == current ||
- (lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+ (lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
}

static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5265,7 +5265,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
{
struct task_group *tg;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5284,7 +5284,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
struct task_group *tg;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6749,7 +6749,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
* rq->lock and can modify state directly.
*/
- lockdep_assert_held(&task_rq(p)->lock);
+ lockdep_assert_held(rq_lockp(task_rq(p)));
detach_entity_cfs_rq(&p->se);

} else {
@@ -7377,7 +7377,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

if (p->sched_class != &fair_sched_class)
return 0;
@@ -7471,7 +7471,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
int tsk_cache_hot;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

/*
* We do not migrate tasks that are:
@@ -7549,7 +7549,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
*/
static void detach_task(struct task_struct *p, struct lb_env *env)
{
- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, env->dst_cpu);
@@ -7565,7 +7565,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
{
struct task_struct *p;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

list_for_each_entry_reverse(p,
&env->src_rq->cfs_tasks, se.group_node) {
@@ -7601,7 +7601,7 @@ static int detach_tasks(struct lb_env *env)
struct task_struct *p;
int detached = 0;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

if (env->imbalance <= 0)
return 0;
@@ -7716,7 +7716,7 @@ static int detach_tasks(struct lb_env *env)
*/
static void attach_task(struct rq *rq, struct task_struct *p)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

BUG_ON(task_rq(p) != rq);
activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9649,7 +9649,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (need_active_balance(&env)) {
unsigned long flags;

- raw_spin_lock_irqsave(&busiest->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(busiest), flags);

/*
* Don't kick the active_load_balance_cpu_stop,
@@ -9657,7 +9657,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* moved to this_cpu:
*/
if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
- raw_spin_unlock_irqrestore(&busiest->lock,
+ raw_spin_unlock_irqrestore(rq_lockp(busiest),
flags);
env.flags |= LBF_ALL_PINNED;
goto out_one_pinned;
@@ -9673,7 +9673,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
busiest->push_cpu = this_cpu;
active_balance = 1;
}
- raw_spin_unlock_irqrestore(&busiest->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);

if (active_balance) {
stop_one_cpu_nowait(cpu_of(busiest),
@@ -10418,7 +10418,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
time_before(jiffies, READ_ONCE(nohz.next_blocked)))
return;

- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));
/*
* This CPU is going to be idle and blocked load of idle CPUs
* need to be updated. Run the ilb locally as it is a good
@@ -10427,7 +10427,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
*/
if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
- raw_spin_lock(&this_rq->lock);
+ raw_spin_lock(rq_lockp(this_rq));
}

#else /* !CONFIG_NO_HZ_COMMON */
@@ -10493,7 +10493,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
goto out;
}

- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));

update_blocked_averages(this_cpu);
rcu_read_lock();
@@ -10534,7 +10534,7 @@ int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}
rcu_read_unlock();

- raw_spin_lock(&this_rq->lock);
+ raw_spin_lock(rq_lockp(this_rq));

if (curr_cost > this_rq->max_idle_balance_cost)
this_rq->max_idle_balance_cost = curr_cost;
@@ -11010,9 +11010,9 @@ void unregister_fair_sched_group(struct task_group *tg)

rq = cpu_rq(cpu);

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}
}

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf38f08f..0106d34f1d8c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -413,10 +413,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
static void
dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
{
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq(rq_lockp(rq));
printk(KERN_ERR "bad: scheduling from the idle thread!\n");
dump_stack();
- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq(rq_lockp(rq));
}

/*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index eb034d9f024d..10a9ef0dbfe6 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -136,7 +136,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)

static inline u64 rq_clock_pelt(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6d60ba21ed29..c054d9c07629 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -887,7 +887,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
if (skip)
continue;

- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
update_rq_clock(rq);

if (rt_rq->rt_time) {
@@ -925,7 +925,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)

if (enqueue)
sched_rt_rq_enqueue(rt_rq);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2094,9 +2094,9 @@ void rto_push_irq_work_func(struct irq_work *work)
* When it gets updated, a check is made if a push is possible.
*/
if (has_pushable_tasks(rq)) {
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
push_rt_tasks(rq);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f58677a8f23..b15feed95027 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -874,7 +874,7 @@ struct uclamp_rq {
*/
struct rq {
/* runqueue lock: */
- raw_spinlock_t lock;
+ raw_spinlock_t __lock;

/*
* nr_running and cpu_load should be in the same cacheline because
@@ -1055,6 +1055,10 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ return &rq->__lock;
+}

#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
@@ -1122,7 +1126,7 @@ static inline void assert_clock_updated(struct rq *rq)

static inline u64 rq_clock(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock;
@@ -1130,7 +1134,7 @@ static inline u64 rq_clock(struct rq *rq)

static inline u64 rq_clock_task(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock_task;
@@ -1156,7 +1160,7 @@ static inline u64 rq_clock_thermal(struct rq *rq)

static inline void rq_clock_skip_update(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
rq->clock_update_flags |= RQCF_REQ_SKIP;
}

@@ -1166,7 +1170,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
*/
static inline void rq_clock_cancel_skipupdate(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
rq->clock_update_flags &= ~RQCF_REQ_SKIP;
}

@@ -1185,7 +1189,7 @@ struct rq_flags {

static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
- rf->cookie = lockdep_pin_lock(&rq->lock);
+ rf->cookie = lockdep_pin_lock(rq_lockp(rq));

#ifdef CONFIG_SCHED_DEBUG
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1200,12 +1204,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
rf->clock_update_flags = RQCF_UPDATED;
#endif

- lockdep_unpin_lock(&rq->lock, rf->cookie);
+ lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
}

static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
{
- lockdep_repin_lock(&rq->lock, rf->cookie);
+ lockdep_repin_lock(rq_lockp(rq), rf->cookie);

#ifdef CONFIG_SCHED_DEBUG
/*
@@ -1226,7 +1230,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static inline void
@@ -1235,7 +1239,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
__releases(p->pi_lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
}

@@ -1243,7 +1247,7 @@ static inline void
rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irqsave(&rq->lock, rf->flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
rq_pin_lock(rq, rf);
}

@@ -1251,7 +1255,7 @@ static inline void
rq_lock_irq(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq(rq_lockp(rq));
rq_pin_lock(rq, rf);
}

@@ -1259,7 +1263,7 @@ static inline void
rq_lock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
rq_pin_lock(rq, rf);
}

@@ -1267,7 +1271,7 @@ static inline void
rq_relock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
rq_repin_lock(rq, rf);
}

@@ -1276,7 +1280,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
}

static inline void
@@ -1284,7 +1288,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq(rq_lockp(rq));
}

static inline void
@@ -1292,7 +1296,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static inline struct rq *
@@ -1357,7 +1361,7 @@ queue_balance_callback(struct rq *rq,
struct callback_head *head,
void (*func)(struct rq *rq))
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (unlikely(head->next))
return;
@@ -2047,7 +2051,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));
double_rq_lock(this_rq, busiest);

return 1;
@@ -2066,20 +2070,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- int ret = 0;
-
- if (unlikely(!raw_spin_trylock(&busiest->lock))) {
- if (busiest < this_rq) {
- raw_spin_unlock(&this_rq->lock);
- raw_spin_lock(&busiest->lock);
- raw_spin_lock_nested(&this_rq->lock,
- SINGLE_DEPTH_NESTING);
- ret = 1;
- } else
- raw_spin_lock_nested(&busiest->lock,
- SINGLE_DEPTH_NESTING);
+ if (rq_lockp(this_rq) == rq_lockp(busiest))
+ return 0;
+
+ if (likely(raw_spin_trylock(rq_lockp(busiest))))
+ return 0;
+
+ if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+ raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+ return 0;
}
- return ret;
+
+ raw_spin_unlock(rq_lockp(this_rq));
+ raw_spin_lock(rq_lockp(busiest));
+ raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+ return 1;
}

#endif /* CONFIG_PREEMPTION */
@@ -2089,11 +2095,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
*/
static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
{
- if (unlikely(!irqs_disabled())) {
- /* printk() doesn't work well under rq->lock */
- raw_spin_unlock(&this_rq->lock);
- BUG_ON(1);
- }
+ lockdep_assert_irqs_disabled();

return _double_lock_balance(this_rq, busiest);
}
@@ -2101,8 +2103,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
__releases(busiest->lock)
{
- raw_spin_unlock(&busiest->lock);
- lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+ if (rq_lockp(this_rq) != rq_lockp(busiest))
+ raw_spin_unlock(rq_lockp(busiest));
+ lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
}

static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2143,16 +2146,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
__acquires(rq2->lock)
{
BUG_ON(!irqs_disabled());
- if (rq1 == rq2) {
- raw_spin_lock(&rq1->lock);
+ if (rq_lockp(rq1) == rq_lockp(rq2)) {
+ raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock); /* Fake it out ;) */
} else {
- if (rq1 < rq2) {
- raw_spin_lock(&rq1->lock);
- raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+ if (rq_lockp(rq1) < rq_lockp(rq2)) {
+ raw_spin_lock(rq_lockp(rq1));
+ raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
} else {
- raw_spin_lock(&rq2->lock);
- raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock(rq_lockp(rq2));
+ raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
}
}
}
@@ -2167,9 +2170,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__releases(rq1->lock)
__releases(rq2->lock)
{
- raw_spin_unlock(&rq1->lock);
- if (rq1 != rq2)
- raw_spin_unlock(&rq2->lock);
+ raw_spin_unlock(rq_lockp(rq1));
+ if (rq_lockp(rq1) != rq_lockp(rq2))
+ raw_spin_unlock(rq_lockp(rq2));
else
__release(rq2->lock);
}
@@ -2192,7 +2195,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
{
BUG_ON(!irqs_disabled());
BUG_ON(rq1 != rq2);
- raw_spin_lock(&rq1->lock);
+ raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock); /* Fake it out ;) */
}

@@ -2207,7 +2210,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__releases(rq2->lock)
{
BUG_ON(rq1 != rq2);
- raw_spin_unlock(&rq1->lock);
+ raw_spin_unlock(rq_lockp(rq1));
__release(rq2->lock);
}

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..502db3357504 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -450,7 +450,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
struct root_domain *old_rd = NULL;
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);

if (rq->rd) {
old_rd = rq->rd;
@@ -476,7 +476,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);

- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);

if (old_rd)
call_rcu(&old_rd->rcu, free_rootdomain);
--
2.17.1

2020-06-30 22:08:21

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 16/16] sched: Debug bits...

From: Peter Zijlstra <[email protected]>

Not-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2ec56970d6bb..0362102fa3d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,6 +105,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)

int pa = __task_prio(a), pb = __task_prio(b);

+ trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+ a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+ b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;

@@ -302,12 +306,16 @@ static void __sched_core_enable(void)

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+ printk("core sched enabled\n");
}

static void __sched_core_disable(void)
{
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+ printk("core sched disabled\n");
}

void sched_core_get(void)
@@ -4477,6 +4485,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
put_prev_task(rq, prev);
set_next_task(rq, next);
}
+
+ trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+ rq->core->core_task_seq,
+ rq->core->core_pick_seq,
+ rq->core_sched_seq,
+ next->comm, next->pid,
+ next->core_cookie);
+
return next;
}

@@ -4551,6 +4567,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
+ trace_printk("unconstrained pick: %s/%d %lx\n",
+ next->comm, next->pid, next->core_cookie);
+
goto done;
}

@@ -4559,6 +4578,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

rq_i->core_pick = p;

+ trace_printk("cpu(%d): selected: %s/%d %lx\n",
+ i, p->comm, p->pid, p->core_cookie);
+
/*
* If this new candidate is of higher priority than the
* previous; and they're incompatible; we need to wipe
@@ -4575,6 +4597,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;

+ trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
if (old_max) {
for_each_cpu(j, smt_mask) {
if (j == i)
@@ -4602,6 +4626,7 @@ next_class:;
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
rq->core_sched_seq = rq->core->core_pick_seq;
+ trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);

/*
* Reschedule siblings
@@ -4624,11 +4649,20 @@ next_class:;
if (i == cpu)
continue;

- if (rq_i->curr != rq_i->core_pick)
+ if (rq_i->curr != rq_i->core_pick) {
+ trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
+ }

/* Did we break L1TF mitigation requirements? */
- WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+ if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+ trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+ rq_i->cpu, rq_i->core_pick->comm,
+ rq_i->core_pick->pid,
+ rq_i->core_pick->core_cookie,
+ rq_i->core->core_cookie);
+ WARN_ON_ONCE(1);
+ }
}

done:
@@ -4667,6 +4701,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;

+ trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+ p->comm, p->pid, that, this,
+ p->core_occupation, dst->idle->core_occupation, cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
@@ -7305,6 +7343,8 @@ int sched_cpu_starting(unsigned int cpu)
WARN_ON_ONCE(rq->core && rq->core != core_rq);
rq->core = core_rq;
}
+
+ printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
#endif /* CONFIG_SCHED_CORE */

sched_rq_cpu_starting(cpu);
--
2.17.1

2020-06-30 22:09:03

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 02/16] sched: Introduce sched_class::pick_task()

From: Peter Zijlstra <[email protected]>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/deadline.c | 16 ++++++++++++++--
kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++---
kernel/sched/idle.c | 8 ++++++++
kernel/sched/rt.c | 14 ++++++++++++--
kernel/sched/sched.h | 3 +++
kernel/sched/stop_task.c | 13 +++++++++++--
6 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 34f95462b838..b56ef74d2d74 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1775,7 +1775,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
}

-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
{
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = &rq->dl;
@@ -1787,7 +1787,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
p = dl_task_of(dl_se);
- set_next_task_dl(rq, p, true);
+
+ return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+ struct task_struct *p;
+
+ p = pick_task_dl(rq);
+ if (p)
+ set_next_task_dl(rq, p, true);
+
return p;
}

@@ -2444,6 +2455,7 @@ const struct sched_class dl_sched_class = {

#ifdef CONFIG_SMP
.balance = balance_dl,
+ .pick_task = pick_task_dl,
.select_task_rq = select_task_rq_dl,
.migrate_task_rq = migrate_task_rq_dl,
.set_cpus_allowed = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43ff0e5cf387..5e9f11c8256f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4428,7 +4428,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
- if (cfs_rq->skip == se) {
+ if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;

if (se == curr) {
@@ -4446,13 +4446,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
- if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+ if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;

/*
* Someone really wants this to run. If it's not unfair, run it.
*/
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+ if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;

clear_buddies(cfs_rq, se);
@@ -6953,6 +6953,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
set_last_buddy(se);
}

+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ struct sched_entity *se;
+
+ if (!cfs_rq->nr_running)
+ return NULL;
+
+ do {
+ struct sched_entity *curr = cfs_rq->curr;
+
+ se = pick_next_entity(cfs_rq, NULL);
+
+ if (curr) {
+ if (se && curr->on_rq)
+ update_curr(cfs_rq);
+
+ if (!se || entity_before(curr, se))
+ se = curr;
+ }
+
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return task_of(se);
+}
+#endif
+
struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -11134,6 +11163,7 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.balance = balance_fair,
+ .pick_task = pick_task_fair,
.select_task_rq = select_task_rq_fair,
.migrate_task_rq = migrate_task_rq_fair,

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0106d34f1d8c..a8d40ffab097 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -397,6 +397,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
schedstat_inc(rq->sched_goidle);
}

+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+ return rq->idle;
+}
+#endif
+
struct task_struct *pick_next_task_idle(struct rq *rq)
{
struct task_struct *next = rq->idle;
@@ -469,6 +476,7 @@ const struct sched_class idle_sched_class = {

#ifdef CONFIG_SMP
.balance = balance_idle,
+ .pick_task = pick_task_idle,
.select_task_rq = select_task_rq_idle,
.set_cpus_allowed = set_cpus_allowed_common,
#endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c054d9c07629..69c77370767d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1624,7 +1624,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return rt_task_of(rt_se);
}

-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
{
struct task_struct *p;

@@ -1632,7 +1632,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
return NULL;

p = _pick_next_task_rt(rq);
- set_next_task_rt(rq, p, true);
+
+ return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+ struct task_struct *p = pick_task_rt(rq);
+ if (p)
+ set_next_task_rt(rq, p, true);
+
return p;
}

@@ -2443,6 +2452,7 @@ const struct sched_class rt_sched_class = {

#ifdef CONFIG_SMP
.balance = balance_rt,
+ .pick_task = pick_task_rt,
.select_task_rq = select_task_rq_rt,
.set_cpus_allowed = set_cpus_allowed_common,
.rq_online = rq_online_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b15feed95027..a63c3115d212 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1769,6 +1769,9 @@ struct sched_class {

#ifdef CONFIG_SMP
int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
+ struct task_struct * (*pick_task)(struct rq *rq);
+
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4c9e9975684f..0611348edb28 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
stop->se.exec_start = rq_clock_task(rq);
}

-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
{
if (!sched_stop_runnable(rq))
return NULL;

- set_next_task_stop(rq, rq->stop, true);
return rq->stop;
}

+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+ struct task_struct *p = pick_task_stop(rq);
+ if (p)
+ set_next_task_stop(rq, p, true);
+
+ return p;
+}
+
static void
enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
{
@@ -130,6 +138,7 @@ const struct sched_class stop_sched_class = {

#ifdef CONFIG_SMP
.balance = balance_stop,
+ .pick_task = pick_task_stop,
.select_task_rq = select_task_rq_stop,
.set_cpus_allowed = set_cpus_allowed_common,
#endif
--
2.17.1

2020-06-30 22:09:14

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 05/16] sched: Basic tracking of matching tasks

From: Peter Zijlstra <[email protected]>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
include/linux/sched.h | 8 ++-
kernel/sched/core.c | 146 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 46 -------------
kernel/sched/sched.h | 55 ++++++++++++++++
4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..3c8dcc5ff039 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -683,10 +683,16 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+ struct sched_dl_entity dl;
+
+#ifdef CONFIG_SCHED_CORE
+ struct rb_node core_node;
+ unsigned long core_cookie;
+#endif
+
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
- struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
/* Clamp values requested for a scheduling entity */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4b81301e3f21..b21bcab20da6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,141 @@ int sysctl_sched_rt_runtime = 950000;

DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);

+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+ if (p->sched_class == &stop_sched_class) /* trumps deadline */
+ return -2;
+
+ if (rt_prio(p->prio)) /* includes deadline */
+ return p->prio; /* [-1, 99] */
+
+ if (p->sched_class == &idle_sched_class)
+ return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+ return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b) := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+ int pa = __task_prio(a), pb = __task_prio(b);
+
+ if (-pa < -pb)
+ return true;
+
+ if (-pb < -pa)
+ return false;
+
+ if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+ return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+ if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
+ u64 vruntime = b->se.vruntime;
+
+ /*
+ * Normalize the vruntime if tasks are in different cpus.
+ */
+ if (task_cpu(a) != task_cpu(b)) {
+ vruntime -= task_cfs_rq(b)->min_vruntime;
+ vruntime += task_cfs_rq(a)->min_vruntime;
+ }
+
+ return !((s64)(a->se.vruntime - vruntime) <= 0);
+ }
+
+ return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+ if (a->core_cookie < b->core_cookie)
+ return true;
+
+ if (a->core_cookie > b->core_cookie)
+ return false;
+
+ /* flip prio, so high prio is leftmost */
+ if (prio_less(b, a))
+ return true;
+
+ return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+ struct rb_node *parent, **node;
+ struct task_struct *node_task;
+
+ rq->core->core_task_seq++;
+
+ if (!p->core_cookie)
+ return;
+
+ node = &rq->core_tree.rb_node;
+ parent = *node;
+
+ while (*node) {
+ node_task = container_of(*node, struct task_struct, core_node);
+ parent = *node;
+
+ if (__sched_core_less(p, node_task))
+ node = &parent->rb_left;
+ else
+ node = &parent->rb_right;
+ }
+
+ rb_link_node(&p->core_node, parent, node);
+ rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+ rq->core->core_task_seq++;
+
+ if (!p->core_cookie)
+ return;
+
+ rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+ struct rb_node *node = rq->core_tree.rb_node;
+ struct task_struct *node_task, *match;
+
+ /*
+ * The idle task always matches any cookie!
+ */
+ match = idle_sched_class.pick_task(rq);
+
+ while (node) {
+ node_task = container_of(node, struct task_struct, core_node);
+
+ if (cookie < node_task->core_cookie) {
+ node = node->rb_left;
+ } else if (cookie > node_task->core_cookie) {
+ node = node->rb_right;
+ } else {
+ match = node_task;
+ node = node->rb_left;
+ }
+ }
+
+ return match;
+}
+
/*
* The static-key + stop-machine variable are needed such that:
*
@@ -135,6 +270,11 @@ void sched_core_put(void)
mutex_unlock(&sched_core_mutex);
}

+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
#endif /* CONFIG_SCHED_CORE */

/*
@@ -1347,6 +1487,9 @@ static inline void init_uclamp(void) { }

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
+ if (sched_core_enabled(rq))
+ sched_core_enqueue(rq, p);
+
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);

@@ -1361,6 +1504,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)

static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
{
+ if (sched_core_enabled(rq))
+ sched_core_dequeue(rq, p);
+
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e44a43b87975..ae17507533a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -260,33 +260,11 @@ const struct sched_class fair_sched_class;
*/

#ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
- SCHED_WARN_ON(!entity_is_task(se));
- return container_of(se, struct task_struct, se);
-}

/* Walk up scheduling entities hierarchy */
#define for_each_sched_entity(se) \
for (; se; se = se->parent)

-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
- return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
- return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
- return grp->my_q;
-}
-
static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
{
if (!path)
@@ -447,33 +425,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#else /* !CONFIG_FAIR_GROUP_SCHED */

-static inline struct task_struct *task_of(struct sched_entity *se)
-{
- return container_of(se, struct task_struct, se);
-}
-
#define for_each_sched_entity(se) \
for (; se; se = NULL)

-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
- return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
- struct task_struct *p = task_of(se);
- struct rq *rq = task_rq(p);
-
- return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
- return NULL;
-}
-
static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
{
if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66e586adee18..c85c5a4bc21f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1033,6 +1033,10 @@ struct rq {
/* per rq */
struct rq *core;
unsigned int core_enabled;
+ struct rb_root core_tree;
+
+ /* shared state */
+ unsigned int core_task_seq;
#endif
};

@@ -1112,6 +1116,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ SCHED_WARN_ON(!entity_is_task(se));
+ return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ struct task_struct *p = task_of(se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return NULL;
+}
+#endif
+
extern void update_rq_clock(struct rq *rq);

static inline u64 __rq_clock_broken(struct rq *rq)
--
2.17.1

2020-06-30 22:09:24

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 12/16] sched: cgroup tagging interface for core scheduling

From: Peter Zijlstra <[email protected]>

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling. Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup. A more complicated
scheme could probably avoid the stop machine. Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead. Enable it only for core with
more than one SMT hardware threads.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/sched/core.c | 183 +++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +
2 files changed, 180 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9edb09ead7..c84f209b8591 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -135,6 +135,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
return false;
}

+static bool sched_core_empty(struct rq *rq)
+{
+ return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+ return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+ struct task_struct *task;
+
+ task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+ return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ struct task_struct *task;
+
+ while (!sched_core_empty(rq)) {
+ task = sched_core_first(rq);
+ rb_erase(&task->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&task->core_node);
+ }
+ rq->core->core_task_seq++;
+}
+
static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
struct rb_node *parent, **node;
@@ -166,10 +197,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
{
rq->core->core_task_seq++;

- if (!p->core_cookie)
+ if (!sched_core_enqueued(p))
return;

rb_erase(&p->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&p->core_node);
}

/*
@@ -235,9 +267,23 @@ static int __sched_core_stopper(void *data)

for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);
- rq->core_enabled = enabled;
- if (cpu_online(cpu) && rq->core != rq)
- sched_core_adjust_sibling_vruntime(cpu, enabled);
+
+ WARN_ON_ONCE(enabled == rq->core_enabled);
+
+ if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+ /*
+ * All active and migrating tasks will have already
+ * been removed from core queue when we clear the
+ * cgroup tags. However, dying tasks could still be
+ * left in core queue. Flush them here.
+ */
+ if (!enabled)
+ sched_core_flush(cpu);
+
+ rq->core_enabled = enabled;
+ if (cpu_online(cpu) && rq->core != rq)
+ sched_core_adjust_sibling_vruntime(cpu, enabled);
+ }
}

return 0;
@@ -248,7 +294,11 @@ static int sched_core_count;

static void __sched_core_enable(void)
{
- // XXX verify there are no cookie tasks (yet)
+ int cpu;
+
+ /* verify there are no cookie tasks (yet) */
+ for_each_online_cpu(cpu)
+ BUG_ON(!sched_core_empty(cpu_rq(cpu)));

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -256,8 +306,6 @@ static void __sched_core_enable(void)

static void __sched_core_disable(void)
{
- // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
}
@@ -282,6 +330,7 @@ void sched_core_put(void)

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }

#endif /* CONFIG_SCHED_CORE */

@@ -3114,6 +3163,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&p->core_node);
#endif
return 0;
}
@@ -6674,6 +6726,9 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&idle->core_node);
+#endif
}

#ifdef CONFIG_SMP
@@ -7646,6 +7701,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+ if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+ tsk->core_cookie = 0UL;
+
+ if (tg->tagged /* && !tsk->core_cookie ? */)
+ tsk->core_cookie = (unsigned long)tg;
+#endif
+
tsk->sched_task_group = tg;

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -7738,6 +7802,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+ struct task_group *tg = css_tg(css);
+
+ if (tg->tagged) {
+ sched_core_put();
+ tg->tagged = 0;
+ }
+#endif
+}
+
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
@@ -8301,6 +8377,82 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
}
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct task_group *tg = css_tg(css);
+
+ return !!tg->tagged;
+}
+
+struct write_core_tag {
+ struct cgroup_subsys_state *css;
+ int val;
+};
+
+static int __sched_write_tag(void *data)
+{
+ struct write_core_tag *tag = (struct write_core_tag *) data;
+ struct cgroup_subsys_state *css = tag->css;
+ int val = tag->val;
+ struct task_group *tg = css_tg(tag->css);
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ tg->tagged = !!val;
+
+ css_task_iter_start(css, 0, &it);
+ /*
+ * Note: css_task_iter_next will skip dying tasks.
+ * There could still be dying tasks left in the core queue
+ * when we set cgroup tag to 0 when the loop is done below.
+ */
+ while ((p = css_task_iter_next(&it))) {
+ p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+ if (sched_core_enqueued(p)) {
+ sched_core_dequeue(task_rq(p), p);
+ if (!p->core_cookie)
+ continue;
+ }
+
+ if (sched_core_enabled(task_rq(p)) &&
+ p->core_cookie && task_on_rq_queued(p))
+ sched_core_enqueue(task_rq(p), p);
+
+ }
+ css_task_iter_end(&it);
+
+ return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ struct task_group *tg = css_tg(css);
+ struct write_core_tag wtag;
+
+ if (val > 1)
+ return -ERANGE;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -EINVAL;
+
+ if (tg->tagged == !!val)
+ return 0;
+
+ if (!!val)
+ sched_core_get();
+
+ wtag.css = css;
+ wtag.val = val;
+ stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+ if (!val)
+ sched_core_put();
+
+ return 0;
+}
+#endif
+
static struct cftype cpu_legacy_files[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -8337,6 +8489,14 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "tag",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_read_u64,
+ .write_u64 = cpu_core_tag_write_u64,
+ },
+#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
@@ -8510,6 +8670,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_weight_nice_write_s64,
},
#endif
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "tag",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_read_u64,
+ .write_u64 = cpu_core_tag_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
@@ -8538,6 +8706,7 @@ static struct cftype cpu_files[] = {
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+ .css_offline = cpu_cgroup_css_offline,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 875796d43fca..c4b4640fcdc8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,10 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+#ifdef CONFIG_SCHED_CORE
+ int tagged;
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
--
2.17.1

2020-06-30 22:09:25

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 04/16] sched/fair: Add a few assertions

From: Peter Zijlstra <[email protected]>

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e9f11c8256f..e44a43b87975 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6203,6 +6203,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}

symmetric:
+ /*
+ * per-cpu select_idle_mask usage
+ */
+ lockdep_assert_irqs_disabled();
+
if (available_idle_cpu(target) || sched_idle_cpu(target))
return target;

@@ -6644,8 +6649,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
*
* Returns the target CPU number.
- *
- * preempt must be disabled.
*/
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6656,6 +6659,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);

+ /*
+ * required for stable ->cpus_allowed
+ */
+ lockdep_assert_held(&p->pi_lock);
+
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);

--
2.17.1

2020-06-30 22:10:22

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 07/16] sched/fair: Fix forced idle sibling starvation corner case

From: vpillai <[email protected]>

If there is only one long running local task and the sibling is
forced idle, it might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae17507533a0..49fb93296e35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10613,6 +10613,40 @@ static void rq_offline_fair(struct rq *rq)

#endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+ return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+ sched_slice(cfs_rq_of(se), se);
+}
+
+/*
+ * If runqueue has only one task which used up its slice and if the sibling
+ * is forced idle, then trigger schedule to give forced idle task a chance.
+ */
+static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
+{
+ int cpu = cpu_of(rq), sibling_cpu;
+
+ if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+ return;
+
+ for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+ struct rq *sibling_rq;
+ if (sibling_cpu == cpu)
+ continue;
+ if (cpu_is_offline(sibling_cpu))
+ continue;
+
+ sibling_rq = cpu_rq(sibling_cpu);
+ if (sibling_rq->core_forceidle) {
+ resched_curr(sibling_rq);
+ }
+ }
+}
+#endif
+
/*
* scheduler tick hitting a task of our scheduling class.
*
@@ -10636,6 +10670,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(rq))
+ resched_forceidle_sibling(rq, &curr->se);
+#endif
}

/*
--
2.17.1

2020-06-30 22:11:24

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 13/16] sched: Fix pick_next_task() race condition in core scheduling

From: Chen Yu <[email protected]>

As Peter mentioned that Commit 6e2df0581f56 ("sched: Fix pick_next_task()
vs 'change' pattern race") has fixed a race condition due to rq->lock
improperly released after put_prev_task(), backport this fix to core
scheduling's pick_next_task() as well.

Without this fix, Aubrey, Long and I found an NULL exception point
triggered within one hour when running RDT MBA(Intel Resource Directory
Technolodge Memory Bandwidth Allocation) benchmarks on a 36 Core(72 HTs)
platform, which tries to dereference a NULL sched_entity:

[ 3618.429053] BUG: kernel NULL pointer dereference, address: 0000000000000160
[ 3618.429039] RIP: 0010:pick_task_fair+0x2e/0xa0
[ 3618.429042] RSP: 0018:ffffc90000317da8 EFLAGS: 00010046
[ 3618.429044] RAX: 0000000000000000 RBX: ffff88afdf4ad100 RCX: 0000000000000001
[ 3618.429045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88afdf4ad100
[ 3618.429045] RBP: ffffc90000317dc0 R08: 0000000000000048 R09: 0100000000100000
[ 3618.429046] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 3618.429047] R13: 000000000002d080 R14: ffff88afdf4ad080 R15: 0000000000000014
[ 3618.429048] ? pick_task_fair+0x48/0xa0
[ 3618.429048] pick_next_task+0x34c/0x7e0
[ 3618.429049] ? tick_program_event+0x44/0x70
[ 3618.429049] __schedule+0xee/0x5d0
[ 3618.429050] schedule_idle+0x2c/0x40
[ 3618.429051] do_idle+0x175/0x280
[ 3618.429051] cpu_startup_entry+0x1d/0x30
[ 3618.429052] start_secondary+0x169/0x1c0
[ 3618.429052] secondary_startup_64+0xa4/0xb0

While with this patch applied, no NULL pointer exception was found within
14 hours for now. Although there's no direct evidence this fix would solve
the issue, it does fix a potential race condition.

Signed-off-by: Chen Yu <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/sched/core.c | 44 +++++++++++++++++++++++++-------------------
kernel/sched/fair.c | 9 ++++++---
kernel/sched/sched.h | 7 -------
3 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c84f209b8591..ede86fb37b4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4169,6 +4169,29 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
schedstat_inc(this_rq()->sched_count);
}

+static inline void
+finish_prev_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+ const struct sched_class *class;
+
+ /*
+ * We must do the balancing pass before put_next_task(), such
+ * that when we release the rq->lock the task is in the same
+ * state as before we took rq->lock.
+ *
+ * We can terminate the balance pass as soon as we know there is
+ * a runnable task of @class priority or higher.
+ */
+ for_class_range(class, prev->sched_class, &idle_sched_class) {
+ if (class->balance(rq, prev, rf))
+ break;
+ }
+#endif
+
+ put_prev_task(rq, prev);
+}
+
/*
* Pick up the highest-prio task:
*/
@@ -4202,22 +4225,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}

restart:
-#ifdef CONFIG_SMP
- /*
- * We must do the balancing pass before put_next_task(), such
- * that when we release the rq->lock the task is in the same
- * state as before we took rq->lock.
- *
- * We can terminate the balance pass as soon as we know there is
- * a runnable task of @class priority or higher.
- */
- for_class_range(class, prev->sched_class, &idle_sched_class) {
- if (class->balance(rq, prev, rf))
- break;
- }
-#endif
-
- put_prev_task(rq, prev);
+ finish_prev_task(rq, prev, rf);

for_each_class(class) {
p = class->pick_next_task(rq);
@@ -4323,9 +4331,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
return next;
}

- prev->sched_class->put_prev_task(rq, prev);
- if (!rq->nr_running)
- newidle_balance(rq, rf);
+ finish_prev_task(rq, prev, rf);

cpu = cpu_of(rq);
smt_mask = cpu_smt_mask(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33dc4bf01817..435b460d3c3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -115,6 +115,7 @@ int __weak arch_asym_cpu_priority(int cpu)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)

+static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
#endif

#ifdef CONFIG_CFS_BANDWIDTH
@@ -7116,9 +7117,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
struct task_struct *p;
+#ifdef CONFIG_SMP
int new_tasks;

again:
+#endif
if (!sched_fair_runnable(rq))
goto idle;

@@ -7232,6 +7235,7 @@ done: __maybe_unused;
if (!rf)
return NULL;

+#ifdef CONFIG_SMP
new_tasks = newidle_balance(rq, rf);

/*
@@ -7244,6 +7248,7 @@ done: __maybe_unused;

if (new_tasks > 0)
goto again;
+#endif

/*
* rq is about to be idle, check if we need to update the
@@ -8750,10 +8755,8 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
* be computed and tested before calling idle_cpu_without().
*/

-#ifdef CONFIG_SMP
if (!llist_empty(&rq->wake_list))
return 0;
-#endif

return 1;
}
@@ -10636,7 +10639,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
* 0 - failed, no new tasks
* > 0 - success, new (fair) tasks present
*/
-int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
+static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
{
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c4b4640fcdc8..a5450886c4e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1640,15 +1640,8 @@ static inline void unregister_sched_domain_sysctl(void)
{
}
#endif
-
-extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
-
#else
-
static inline void sched_ttwu_pending(void) { }
-
-static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
-
#endif /* CONFIG_SMP */

#include "stats.h"
--
2.17.1

2020-06-30 22:11:29

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 08/16] sched/fair: wrapper for cfs_rq->min_vruntime

From: Aaron Lu <[email protected]>

Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to
return cfs_rq->min_vruntime.

It will be used in the following patch, no functionality
change.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49fb93296e35..61d19e573443 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -462,6 +462,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#endif /* CONFIG_FAIR_GROUP_SCHED */

+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->min_vruntime;
+}
+
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);

@@ -498,7 +503,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
struct sched_entity *curr = cfs_rq->curr;
struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);

- u64 vruntime = cfs_rq->min_vruntime;
+ u64 vruntime = cfs_rq_min_vruntime(cfs_rq);

if (curr) {
if (curr->on_rq)
@@ -518,7 +523,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
}

/* ensure we never gain time by being placed backwards. */
- cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+ cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -4026,7 +4031,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_SCHED_DEBUG
- s64 d = se->vruntime - cfs_rq->min_vruntime;
+ s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq);

if (d < 0)
d = -d;
@@ -4039,7 +4044,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
- u64 vruntime = cfs_rq->min_vruntime;
+ u64 vruntime = cfs_rq_min_vruntime(cfs_rq);

/*
* The 'current' period is already promised to the current tasks,
@@ -4133,7 +4138,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* update_curr().
*/
if (renorm && curr)
- se->vruntime += cfs_rq->min_vruntime;
+ se->vruntime += cfs_rq_min_vruntime(cfs_rq);

update_curr(cfs_rq);

@@ -4144,7 +4149,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* fairness detriment of existing tasks.
*/
if (renorm && !curr)
- se->vruntime += cfs_rq->min_vruntime;
+ se->vruntime += cfs_rq_min_vruntime(cfs_rq);

/*
* When enqueuing a sched_entity, we must:
@@ -4263,7 +4268,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* can move min_vruntime forward still more.
*/
if (!(flags & DEQUEUE_SLEEP))
- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= cfs_rq_min_vruntime(cfs_rq);

/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
@@ -6700,7 +6705,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
min_vruntime = cfs_rq->min_vruntime;
} while (min_vruntime != min_vruntime_copy);
#else
- min_vruntime = cfs_rq->min_vruntime;
+ min_vruntime = cfs_rq_min_vruntime(cfs_rq);
#endif

se->vruntime -= min_vruntime;
@@ -10709,7 +10714,7 @@ static void task_fork_fair(struct task_struct *p)
resched_curr(rq);
}

- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
rq_unlock(rq, &rf);
}

@@ -10832,7 +10837,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
* cause 'unlimited' sleep bonus.
*/
place_entity(cfs_rq, se, 0);
- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
}

detach_entity_cfs_rq(se);
@@ -10846,7 +10851,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
attach_entity_cfs_rq(se);

if (!vruntime_normalized(p))
- se->vruntime += cfs_rq->min_vruntime;
+ se->vruntime += cfs_rq_min_vruntime(cfs_rq);
}

static void switched_from_fair(struct rq *rq, struct task_struct *p)
--
2.17.1

2020-06-30 22:12:02

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 03/16] sched: Core-wide rq->lock

From: Peter Zijlstra <[email protected]>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/Kconfig.preempt | 6 +++
kernel/sched/core.c | 91 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 31 ++++++++++++++
3 files changed, 128 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..4488fbf4d3a8 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
config PREEMPTION
bool
select PREEMPT_COUNT
+
+config SCHED_CORE
+ bool "Core Scheduling for SMT"
+ default y
+ depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef594ace6ffd..4b81301e3f21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,70 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+ bool enabled = !!(unsigned long)data;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ cpu_rq(cpu)->core_enabled = enabled;
+
+ return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+ // XXX verify there are no cookie tasks (yet)
+
+ static_branch_enable(&__sched_core_enabled);
+ stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+ // XXX verify there are no cookie tasks (left)
+
+ stop_machine(__sched_core_stopper, (void *)false, NULL);
+ static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!sched_core_count++)
+ __sched_core_enable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!--sched_core_count)
+ __sched_core_disable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
@@ -6475,6 +6539,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
+#ifdef CONFIG_SCHED_CORE
+ const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+ struct rq *rq, *core_rq = NULL;
+ int i;
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+ if (rq->core && rq->core == rq)
+ core_rq = rq;
+ }
+
+ if (!core_rq)
+ core_rq = cpu_rq(cpu);
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+
+ WARN_ON_ONCE(rq->core && rq->core != core_rq);
+ rq->core = core_rq;
+ }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6696,6 +6782,11 @@ void __init sched_init(void)
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+ rq->core = NULL;
+ rq->core_enabled = 0;
+#endif
}

set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a63c3115d212..66e586adee18 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1028,6 +1028,12 @@ struct rq {
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
#endif
+
+#ifdef CONFIG_SCHED_CORE
+ /* per rq */
+ struct rq *core;
+ unsigned int core_enabled;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1055,11 +1061,36 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ if (sched_core_enabled(rq))
+ return &rq->core->__lock;
+
+ return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return false;
+}
+
static inline raw_spinlock_t *rq_lockp(struct rq *rq)
{
return &rq->__lock;
}

+#endif /* CONFIG_SCHED_CORE */
+
#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);

--
2.17.1

2020-06-30 22:12:25

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 15/16] Documentation: Add documentation on core scheduling

From: "Joel Fernandes (Google)" <[email protected]>

Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
.../admin-guide/hw-vuln/core-scheduling.rst | 241 ++++++++++++++++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
2 files changed, 242 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..275568162a74
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,241 @@
+Core Scheduling
+================
+MDS and L1TF mitigations do not protect from cross-HT attacks (attacker running
+on one HT with victim running on another). For proper mitigation of this,
+core scheduling support is available via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core while ensuring trying
+to maintain and ensure scheduler properties and requirements.
+
+Usage
+-----
+The user-interface to this feature is not yet finalized. CUrrent implementation
+uses CPU controller cgroup. Core scheduling adds a ``cpu.tag`` file to the CPU
+controller CGroup. If the content of this file is 1, then all the tasks in this
+CGroup trust each other and are allowed to run concurrently on the siblings of
+a core.
+
+This interface has drawbacks. Trusted tasks has to be grouped into one CPU CGroup
+and this is not always possible based on system's existing Cgroup configuration,
+where trusted tasks could already be in different CPU Cgroups. Also, this feature
+will have a hard dependency on CGroups and systems with CGroups disabled would not
+be able to use core scheduling. See `Future work`_ for other API proposals.
+
+Design/Implementation
+---------------------
+Tasks are grouped as mentioned in `Usage`_ and tasks that trust each other
+share the same cookie value(in task_struct).
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks are trusted(same cookie).
+
+During a schedule event on any sibling of a core, the highest priority task for
+that core is picked and assigned to the sibling which has it enqueued. For rest of
+the siblings in the core, highest priority task with the same cookie is selected if
+there is one runnable in the run queue. If a task with same cookie is not available,
+idle task is selected. Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+all the siblings who has a new task selected. Siblings on receiving the IPI, will
+switch to the new task immediately.
+
+Force-idling of tasks
+---------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in that runqueue.
+However, this is not always possible. Favoring security over fairness, one or
+more siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority task.
+If a sibling does not have a trusted task to run, it will be forced idle by the
+scheduler(idle thread is scheduled to run).
+
+When the highest priorty task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT:
+
+::
+ HT1 (attack) HT2 (victim)
+
+ A idle -> user space user space -> idle
+
+ B idle -> user space guest -> idle
+
+ C idle -> guest user space -> idle
+
+ D idle -> guest guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would
+bring the destination CPU immediately into kernel mode from user space, or
+VMEXIT from guests. At best, this would only leak some scheduler metadata which
+may not be worth protecting.
+
+Protection against interrupts using IRQ pausing
+-----------------------------------------------
+The scheduler on its own cannot protect interrupt data. This is because the
+scheduler is unaware of interrupts at scheduling time. To mitigate this, we
+send an IPI to siblings on IRQ entry. This IPI handler busy-waits until the IRQ
+on the sending HT exits. For good performance, we send an IPI only if it is
+detected that the core is running tasks that have been marked for
+core scheduling. Both interrupts and softirqs are protected.
+
+This protection can be disabled by disabling ``CONFIG_SCHED_CORE_IRQ_PAUSE`` or
+through the ``sched_core_irq_pause`` boot parameter.
+
+If it is desired to disable IRQ pausing, other mitigation methods can be considered:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+
+Trust model
+-----------
+Core scheduling understands trust relationships by assignment of a cookie to
+related tasks using the above mentioned interface. When a system with core
+scheduling boots, all tasks are considered to trust each other. This is because
+the scheduler does not have information about trust relationships. That is, all
+tasks have a default cookie value of 0. This cookie value is also considered
+the system-wide cookie value and the IRQ-pausing mitigation is avoided if
+siblings are running these cookie-0 tasks.
+
+By default, all system processes on boot are considered trusted and userspace
+has to explicitly use the interfaces mentioned above to group sets of tasks.
+Tasks within the group trust each other, but not those outside. Tasks outside
+the group don't trust the task inside.
+
+Limitations
+-----------
+Core scheduling tries to guarentee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+^^^^^^^^^^^^^^^^^^^^^^^^
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a cpu before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro acrhitectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+2. Asynchronous Kernel entries
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+A task can switch to kernel any time due to events like irqs, system calls etc.
+Since core scheduling synchronizes only during a schedule event, kernel can run
+along with a task that it doesn't trust. The IRQ pause mechanism mentioned above,
+provides protection during nmi/irq/softirqs. But tasks could still enter kernel
+mode via system calls and this is not currently protected.
+
+There are ideas about mitigating this:
+ - Kernel Address Space Isolation: System calls could run in a much restricted
+ address space which is guarenteed not to leak any sensitive data. There are
+ practical limitation in implementing this - the main concern being how to
+ decided on an address space that is guarenteed to not have any sensitive
+ data
+ - On a system call, change the cookie to the system trusted cookie and initiate
+ a schedule event. This would be better than pausing all the siblings during
+ the entire duration for the system call, but still would be a huge hit to the
+ performance.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+^^^^^^^^^^
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+^^^^^^^^^^^
+Core scheduling cannot protect against a L1TF guest attackers exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT.
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+ that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+ together could also be realized using core scheduling. One example is vcpus of
+ a VM.
+
+Future work
+-----------
+1. API Proposals
+^^^^^^^^^^^^^^^^
+
+As mentioned in `Usage`_ section, various API proposals are listed here:
+
+- ``prctl`` : We can pass in a tag and all tasks with same tag set by prctl forms
+ a trusted group.
+
+- ``sched_setattr`` : Similar to prctl, but has the advantage that tasks could be
+ tagged by other tasks with appropriate permissions.
+
+- ``Auto Tagging`` : Related tasks are tagged automatically. Relation could be,
+ threads of the same process, tasks by a user, group or session etc.
+
+- Dedicated cgroup or procfs/sysfs interface for grouping trusted tasks. This could
+ be combined with prctl/sched_setattr as well.
+
+2. Auto-tagging of KVM vCPU threads
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To make configuration easier, it would be great if KVM auto-tags vCPU threads
+such that a given VM only trusts other vCPUs of the same VM. Or something more
+aggressive like assiging a vCPU thread a unique tag.
+
+3. Auto-tagging of processes by default
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Currently core scheduling does not prevent 'unconfigured' tasks from being
+co-scheduled on the same core. In other words, everything trusts everything
+else by default. If a user wants everything default untrusted, a CONFIG option
+could be added to assign every task with a unique tag by default.
+
+4. Auto-tagging on fork
+^^^^^^^^^^^^^^^^^^^^^^^
+Currently, on fork a thread is added to the same trust-domain as the parent. For
+systems which want all tasks to have a unique tag, it could be desirable to assign
+a unique tag to a task so that the parent does not trust the child by default.
+
+5. Skipping per-HT mitigations if task is trusted
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..f12cda55538b 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
tsx_async_abort
multihit.rst
special-register-buffer-data-sampling.rst
+ core-scheduling.rst
--
2.17.1

2020-06-30 23:27:22

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH 11/16] sched: migration changes for core scheduling

From: Aubrey Li <[email protected]>

- Don't migrate if there is a cookie mismatch
Load balance tries to move task from busiest CPU to the
destination CPU. When core scheduling is enabled, if the
task's cookie does not match with the destination CPU's
core cookie, this task will be skipped by this CPU. This
mitigates the forced idle time on the destination CPU.

- Select cookie matched idle CPU
In the fast path of task wakeup, select the first cookie matched
idle CPU instead of the first idle CPU.

- Find cookie matched idlest CPU
In the slow path of task wakeup, find the idlest CPU whose core
cookie matches with task's cookie

- Don't migrate task if cookie not match
For the NUMA load balance, don't migrate task to the CPU whose
core cookie does not match with task's cookie

Signed-off-by: Aubrey Li <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 29 ++++++++++++++++++++
2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d16939766361..33dc4bf01817 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * Skip this cpu if source task's cookie does not match
+ * with CPU's core cookie.
+ */
+ if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+ continue;
+#endif
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this

/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+ struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+ if (!sched_core_cookie_match(rq, p))
+ continue;
+#endif
+
if (sched_idle_cpu(i))
return i;

if (available_idle_cpu(i)) {
- struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
- if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
- break;
+
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+ /*
+ * If Core Scheduling is enabled, select this cpu
+ * only if the process cookie matches core cookie.
+ */
+ if (sched_core_enabled(cpu_rq(cpu)) &&
+ p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+ break;
+ }
}

time = cpu_clock(this) - time;
@@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or
* 2) cannot be migrated to this CPU due to cpus_ptr, or
- * 3) running (obviously), or
- * 4) are cache-hot on their current CPU.
+ * 3) task's cookie does not match with this CPU's core cookie
+ * 4) running (obviously), or
+ * 5) are cache-hot on their current CPU.
*/
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * Don't migrate task if the task's cookie does not match
+ * with the destination CPU's core cookie.
+ */
+ if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+ return 0;
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

@@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
p->cpus_ptr))
continue;

+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(cpu_rq(this_cpu))) {
+ int i = 0;
+ bool cookie_match = false;
+
+ for_each_cpu(i, sched_group_span(group)) {
+ struct rq *rq = cpu_rq(i);
+
+ if (sched_core_cookie_match(rq, p)) {
+ cookie_match = true;
+ break;
+ }
+ }
+ /* Skip over this group if no cookie matched */
+ if (!cookie_match)
+ continue;
+ }
+#endif
+
local_group = cpumask_test_cpu(this_cpu,
sched_group_span(group));

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 464559676fd2..875796d43fca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1089,6 +1089,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);

+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+ bool idle_core = true;
+ int cpu;
+
+ /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+ if (!sched_core_enabled(rq))
+ return true;
+
+ for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+ if (!available_idle_cpu(cpu)) {
+ idle_core = false;
+ break;
+ }
+ }
+
+ /*
+ * A CPU in an idle core is always the best choice for tasks with
+ * cookies.
+ */
+ return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
extern void queue_core_balance(struct rq *rq);

#else /* !CONFIG_SCHED_CORE */
--
2.17.1

2020-07-10 12:23:04

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

Hi Joel/Vineeth,

On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> From: "Joel Fernandes (Google)" <[email protected]>
>
> With current core scheduling patchset, non-threaded IRQ and softirq
> victims can leak data from its hyperthread to a sibling hyperthread
> running an attacker.
>
> For MDS, it is possible for the IRQ and softirq handlers to leak data to
> either host or guest attackers. For L1TF, it is possible to leak to
> guest attackers. There is no possible mitigation involving flushing of
> buffers to avoid this since the execution of attacker and victims happen
> concurrently on 2 or more HTs.
>
> The solution in this patch is to monitor the outer-most core-wide
> irq_enter() and irq_exit() executed by any sibling. In between these
> two, we mark the core to be in a special core-wide IRQ state.
>
> In the IRQ entry, if we detect that the sibling is running untrusted
> code, we send a reschedule IPI so that the sibling transitions through
> the sibling's irq_exit() to do any waiting there, till the IRQ being
> protected finishes.
>
> We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
> outer-most irq_exit(), the core is still in the special core-wide IRQ
> state, we perform a busy-wait till the core exits this state. This
> combination of per-cpu and core-wide IRQ states helps to handle any
> combination of irq_entry()s and irq_exit()s happening on all of the
> siblings of the core in any order.
>
> Lastly, we also check in the schedule loop if we are about to schedule
> an untrusted process while the core is in such a state. This is possible
> if a trusted thread enters the scheduler by way of yielding CPU. This
> would involve no transitions through the irq_exit() point to do any
> waiting, so we have to explicitly do the waiting there.
>
> Every attempt is made to prevent a busy-wait unnecessarily, and in
> testing on real-world ChromeOS usecases, it has not shown a performance
> drop. In ChromeOS, with this and the rest of the core scheduling
> patchset, we see around a 300% improvement in key press latencies into
> Google docs when Camera streaming is running simulatenously (90th
> percentile latency of ~150ms drops to ~50ms).
>
> This fetaure is controlled by the build time config option
> CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
> kernel boot parameter 'sched_core_irq_pause' to enable/disable the
> feature at boot time. Default is enabled at boot time.

We saw a lot of soft lockups on the screen when we tested v6.

[ 186.527883] watchdog: BUG: soft lockup - CPU#86 stuck for 22s! [uperf:5551]
[ 186.535884] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [uperf:5444]
[ 186.555883] watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [uperf:5547]
[ 187.547884] rcu: INFO: rcu_sched self-detected stall on CPU
[ 187.553760] rcu: 40-....: (14997 ticks this GP) idle=49a/1/0x4000000000000002 softirq=1711/1711 fqs=7279
[ 187.564685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
[ 187.564723] NMI watchdog: Watchdog detected hard LOCKUP on cpu 38

The problem is gone when we reverted this patch. We are running multiple
uperf threads(equal to cpu number) in a cgroup with coresched enabled.
This is 100% reproducible on our side.

Just wonder if anything already known before we dig into it.

Thanks,
-Aubrey




2020-07-10 13:23:58

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On Fri, Jul 10, 2020 at 08:19:24PM +0800, Li, Aubrey wrote:
> Hi Joel/Vineeth,
>
> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > With current core scheduling patchset, non-threaded IRQ and softirq
> > victims can leak data from its hyperthread to a sibling hyperthread
> > running an attacker.
> >
> > For MDS, it is possible for the IRQ and softirq handlers to leak data to
> > either host or guest attackers. For L1TF, it is possible to leak to
> > guest attackers. There is no possible mitigation involving flushing of
> > buffers to avoid this since the execution of attacker and victims happen
> > concurrently on 2 or more HTs.
> >
> > The solution in this patch is to monitor the outer-most core-wide
> > irq_enter() and irq_exit() executed by any sibling. In between these
> > two, we mark the core to be in a special core-wide IRQ state.
> >
> > In the IRQ entry, if we detect that the sibling is running untrusted
> > code, we send a reschedule IPI so that the sibling transitions through
> > the sibling's irq_exit() to do any waiting there, till the IRQ being
> > protected finishes.
> >
> > We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
> > outer-most irq_exit(), the core is still in the special core-wide IRQ
> > state, we perform a busy-wait till the core exits this state. This
> > combination of per-cpu and core-wide IRQ states helps to handle any
> > combination of irq_entry()s and irq_exit()s happening on all of the
> > siblings of the core in any order.
> >
> > Lastly, we also check in the schedule loop if we are about to schedule
> > an untrusted process while the core is in such a state. This is possible
> > if a trusted thread enters the scheduler by way of yielding CPU. This
> > would involve no transitions through the irq_exit() point to do any
> > waiting, so we have to explicitly do the waiting there.
> >
> > Every attempt is made to prevent a busy-wait unnecessarily, and in
> > testing on real-world ChromeOS usecases, it has not shown a performance
> > drop. In ChromeOS, with this and the rest of the core scheduling
> > patchset, we see around a 300% improvement in key press latencies into
> > Google docs when Camera streaming is running simulatenously (90th
> > percentile latency of ~150ms drops to ~50ms).
> >
> > This fetaure is controlled by the build time config option
> > CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
> > kernel boot parameter 'sched_core_irq_pause' to enable/disable the
> > feature at boot time. Default is enabled at boot time.
>
> We saw a lot of soft lockups on the screen when we tested v6.
>
> [ 186.527883] watchdog: BUG: soft lockup - CPU#86 stuck for 22s! [uperf:5551]
> [ 186.535884] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [uperf:5444]
> [ 186.555883] watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [uperf:5547]
> [ 187.547884] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 187.553760] rcu: 40-....: (14997 ticks this GP) idle=49a/1/0x4000000000000002 softirq=1711/1711 fqs=7279
> [ 187.564685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
> [ 187.564723] NMI watchdog: Watchdog detected hard LOCKUP on cpu 38
>
> The problem is gone when we reverted this patch. We are running multiple
> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> This is 100% reproducible on our side.

Interesting. I am guessing you are not doing any hotplug since those fixes
were removed from v6 to expose those hotplug issues..

The last known lockups with this patch were fixed. Appreciate if you can dig
in more and provide logs/traces. The last one I remember was:

HT1 HT2
irq_enter()
- sets the core-wide flag
<softirq running>
acquires a lock.
<gets irq>
irq_enter() - do nothing.
irq_exit() - busy wait on flag.
irq_exit()
<softirq running>
acquire a lock and deadlock.

The fix was to call sched_core_irq_enter() when you enter enter a softirq
from paths other than irq_exit().

Other than this one, we have not seen lockups in heavy testing over the last
2 months since we redesigned this patch to enter the 'private state' on the
outer-most core-wide sched_core_irq_enter().

thanks,

- Joel

2020-07-10 13:39:59

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

Hi Aubrey,

On Fri, Jul 10, 2020 at 8:19 AM Li, Aubrey <[email protected]> wrote:
>
> Hi Joel/Vineeth,
> [...]
> The problem is gone when we reverted this patch. We are running multiple
> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> This is 100% reproducible on our side.
>
> Just wonder if anything already known before we dig into it.
>
Thanks for reporting this. We haven't seen any lockups like this
in our testing yet.

Could you please add more information on how to reproduce this?
Was it a simple uperf run without any options or was it running any
specific kind of network test?

We shall also try to reproduce this and investigate.

Thanks,
Vineeth

2020-07-11 01:36:51

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On Fri, Jul 10, 2020 at 9:36 PM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> Hi Aubrey,
>
> On Fri, Jul 10, 2020 at 8:19 AM Li, Aubrey <[email protected]> wrote:
> >
> > Hi Joel/Vineeth,
> > [...]
> > The problem is gone when we reverted this patch. We are running multiple
> > uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> > This is 100% reproducible on our side.
> >
> > Just wonder if anything already known before we dig into it.
> >
> Thanks for reporting this. We haven't seen any lockups like this
> in our testing yet.

This is replicable on a bare metal machine. We tried to reproduce
on a 8-cpus KVM vm but failed.

> Could you please add more information on how to reproduce this?
> Was it a simple uperf run without any options or was it running any
> specific kind of network test?

I put our scripts at here:
https://github.com/aubreyli/uperf

>
> We shall also try to reproduce this and investigate.

I'll try to see if I can narrow down the test case and grab some logs
next week.

Thanks,
-Aubrey

2020-07-13 02:32:20

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On 2020/7/10 21:21, Joel Fernandes wrote:
> On Fri, Jul 10, 2020 at 08:19:24PM +0800, Li, Aubrey wrote:
>> Hi Joel/Vineeth,
>>
>>
>> The problem is gone when we reverted this patch. We are running multiple
>> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
>> This is 100% reproducible on our side.
>
> Interesting. I am guessing you are not doing any hotplug since those fixes
> were removed from v6 to expose those hotplug issues..
>
> The last known lockups with this patch were fixed. Appreciate if you can dig
> in more and provide logs/traces. The last one I remember was:
>
> HT1 HT2
> irq_enter()
> - sets the core-wide flag
> <softirq running>
> acquires a lock.
> <gets irq>
> irq_enter() - do nothing.
> irq_exit() - busy wait on flag.
> irq_exit()
> <softirq running>
> acquire a lock and deadlock.
>
> The fix was to call sched_core_irq_enter() when you enter enter a softirq
> from paths other than irq_exit().
>
> Other than this one, we have not seen lockups in heavy testing over the last
> 2 months since we redesigned this patch to enter the 'private state' on the
> outer-most core-wide sched_core_irq_enter().

When the first soft lockup panic on CPU75, it's waiting on flush tlb IPI.

[ 170.641645] CPU: 75 PID: 5393 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
[ 170.641651] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
----snip----
[ 170.641660] Call Trace:
[ 170.641666] ? flush_tlb_func_common.constprop.10+0x220/0x220
[ 170.641668] ? x86_configure_nx+0x50/0x50
[ 170.641669] ? flush_tlb_func_common.constprop.10+0x220/0x220
[ 170.641670] on_each_cpu_cond_mask+0x2f/0x80
[ 170.641671] flush_tlb_mm_range+0xab/0xe0
[ 170.641677] change_protection+0x18a/0xca0
[ 170.641682] ? __switch_to_asm+0x34/0x70
[ 170.641685] change_prot_numa+0x15/0x30
[ 170.641689] task_numa_work+0x1aa/0x2c0
[ 170.641694] task_work_run+0x76/0xa0
[ 170.641698] exit_to_usermode_loop+0xeb/0xf0
[ 170.641700] do_syscall_64+0x1aa/0x1d0
[ 170.641701] entry_SYSCALL_64_after_hwframe+0x44/0xa9

If I read the code correctly, I assume someone is pending on irq_exit() so IPI
can't return to CPU75, and I found it's CPU91

[ 170.652257] CPU: 91 PID: 5401 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
[ 170.652257] RIP: 0010:sched_core_irq_exit+0xcc/0x110
----snip----
[ 170.652261] Call Trace:
[ 170.652262] <IRQ>
[ 170.652262] irq_exit+0x6a/0xb0
[ 170.652262] smp_apic_timer_interrupt+0x74/0x130
[ 170.652262] apic_timer_interrupt+0xf/0x20

Then I check the stack of CPU91's sibling CPU19, and found it's on a spin lock.

[ 170.643678] CPU: 19 PID: 5385 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
[ 170.643679] RIP: 0010:native_queued_spin_lock_slowpath+0x137/0x1e0
[ 170.643684] Call Trace:
[ 170.643684] <IRQ>
[ 170.643684] _raw_spin_lock+0x1b/0x20
[ 170.643685] tcp_delack_timer+0x2c/0xf0
[ 170.643685] ? tcp_delack_timer_handler+0x170/0x170
[ 170.643685] call_timer_fn+0x2d/0x130
[ 170.643685] run_timer_softirq+0x420/0x450
[ 170.643686] ? enqueue_hrtimer+0x39/0x90
[ 170.643686] ? __hrtimer_run_queues+0x138/0x290
[ 170.643686] __do_softirq+0xed/0x2f0
[ 170.643686] irq_exit+0xad/0xb0
[ 170.643686] smp_apic_timer_interrupt+0x74/0x130
[ 170.643687] apic_timer_interrupt+0xf/0x20
----snip----
[ 170.643738] entry_SYSCALL_64_after_hwframe+0x44/0xa9

So I guess the problem is,

CPU91 CPU19
(1)hold a bh_lock_sock(sk)
(2)<gets irq>
(3) <gets irq>
(4) irq_exit()
-> sched_core_irq_exit()
- not outermost, wait()
(5) invoke softirq
(6) acquire bh_lock_sock() and deadlock
(7) sched_core_irq_exit()

In case I understood anything wrong, I attached the full dmesg.

IMHO, can we let irq exit and wait before return user mode? I think we
can trust anything running in the kernel.

Thanks,
-Aubrey


Attachments:
dmesg.txt (211.81 kB)

2020-07-13 16:01:16

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On Mon, Jul 13, 2020 at 10:23:31AM +0800, Li, Aubrey wrote:
> On 2020/7/10 21:21, Joel Fernandes wrote:
> > On Fri, Jul 10, 2020 at 08:19:24PM +0800, Li, Aubrey wrote:
> >> Hi Joel/Vineeth,
> >>
> >>
> >> The problem is gone when we reverted this patch. We are running multiple
> >> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> >> This is 100% reproducible on our side.
> >
> > Interesting. I am guessing you are not doing any hotplug since those fixes
> > were removed from v6 to expose those hotplug issues..
> >
> > The last known lockups with this patch were fixed. Appreciate if you can dig
> > in more and provide logs/traces. The last one I remember was:
> >
> > HT1 HT2
> > irq_enter()
> > - sets the core-wide flag
> > <softirq running>
> > acquires a lock.
> > <gets irq>
> > irq_enter() - do nothing.
> > irq_exit() - busy wait on flag.
> > irq_exit()
> > <softirq running>
> > acquire a lock and deadlock.
> >
> > The fix was to call sched_core_irq_enter() when you enter enter a softirq
> > from paths other than irq_exit().
> >
> > Other than this one, we have not seen lockups in heavy testing over the last
> > 2 months since we redesigned this patch to enter the 'private state' on the
> > outer-most core-wide sched_core_irq_enter().
>
> When the first soft lockup panic on CPU75, it's waiting on flush tlb IPI.
>
> [ 170.641645] CPU: 75 PID: 5393 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.641651] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> ----snip----
> [ 170.641660] Call Trace:
> [ 170.641666] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641668] ? x86_configure_nx+0x50/0x50
> [ 170.641669] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641670] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.641671] flush_tlb_mm_range+0xab/0xe0
> [ 170.641677] change_protection+0x18a/0xca0
> [ 170.641682] ? __switch_to_asm+0x34/0x70
> [ 170.641685] change_prot_numa+0x15/0x30
> [ 170.641689] task_numa_work+0x1aa/0x2c0
> [ 170.641694] task_work_run+0x76/0xa0
> [ 170.641698] exit_to_usermode_loop+0xeb/0xf0
> [ 170.641700] do_syscall_64+0x1aa/0x1d0
> [ 170.641701] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> If I read the code correctly, I assume someone is pending on irq_exit() so IPI
> can't return to CPU75, and I found it's CPU91
>
> [ 170.652257] CPU: 91 PID: 5401 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.652257] RIP: 0010:sched_core_irq_exit+0xcc/0x110
> ----snip----
> [ 170.652261] Call Trace:
> [ 170.652262] <IRQ>
> [ 170.652262] irq_exit+0x6a/0xb0
> [ 170.652262] smp_apic_timer_interrupt+0x74/0x130
> [ 170.652262] apic_timer_interrupt+0xf/0x20
>
> Then I check the stack of CPU91's sibling CPU19, and found it's on a spin lock.
>
> [ 170.643678] CPU: 19 PID: 5385 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.643679] RIP: 0010:native_queued_spin_lock_slowpath+0x137/0x1e0
> [ 170.643684] Call Trace:
> [ 170.643684] <IRQ>
> [ 170.643684] _raw_spin_lock+0x1b/0x20
> [ 170.643685] tcp_delack_timer+0x2c/0xf0
> [ 170.643685] ? tcp_delack_timer_handler+0x170/0x170
> [ 170.643685] call_timer_fn+0x2d/0x130
> [ 170.643685] run_timer_softirq+0x420/0x450
> [ 170.643686] ? enqueue_hrtimer+0x39/0x90
> [ 170.643686] ? __hrtimer_run_queues+0x138/0x290
> [ 170.643686] __do_softirq+0xed/0x2f0
> [ 170.643686] irq_exit+0xad/0xb0
> [ 170.643686] smp_apic_timer_interrupt+0x74/0x130
> [ 170.643687] apic_timer_interrupt+0xf/0x20
> ----snip----
> [ 170.643738] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> So I guess the problem is,
>
> CPU91 CPU19
> (1)hold a bh_lock_sock(sk)
> (2)<gets irq>
> (3) <gets irq>
> (4) irq_exit()
> -> sched_core_irq_exit()
> - not outermost, wait()
> (5) invoke softirq
> (6) acquire bh_lock_sock() and deadlock
> (7) sched_core_irq_exit()
>
> In case I understood anything wrong, I attached the full dmesg.

Thanks a lot for the debugging. This matches the case I mentioned yesterday
so I think that's it.

> IMHO, can we let irq exit and wait before return user mode? I think we
> can trust anything running in the kernel.

Sure, we are waiting in the schedule loop after context_switch(), so per the
patch Vineeth sent, that does exactly what you mentioned.

Another idea is to do it from softirq (so we can raise_softirq). That will
eliminate the deadlock you mentioned since the CPU91 holding the bh_lock_sock
will keep bh disabled until the lock is dropped. Once the lock is dropped, it
will then do the wait in softirq.

But we are thinking softirq-wait might not be needed and resched_curr(rq) from
irq_exit() is sufficient, if the performance of calling into the
schedule_loop is not that much. Vineeth is collecting some data on this.

I was also thinking we should add lockdep annotations to the wait code to
have lockdep catch lock vs wait dependency issues. But probably that is more
useful only if we decide to go the softirq route (sinch in the __schedule()
code, we drop the rq lock before spinning and no other lock should be held).

thanks,

- Joel


> [ 0.000000] Linux version 5.7.6+ (root@ssp_clx_cdi391) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04), GNU ld (GNU Binutils for Ubuntu) 2.30) #3 SMP Mon Jul 13 01:11:27 UTC 2020
> [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.7.6+ root=/dev/mapper/ubuntu--vg-ubuntu--lv ro console=tty0 console=ttyS0,115200n8 softlockup_panic=1 crashkernel=512M-:768M
> [ 0.000000] KERNEL supported cpus:
> [ 0.000000] Intel GenuineIntel
> [ 0.000000] AMD AuthenticAMD
> [ 0.000000] Hygon HygonGenuine
> [ 0.000000] Centaur CentaurHauls
> [ 0.000000] zhaoxin Shanghai
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
> [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
> [ 0.000000] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
> [ 0.000000] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
> [ 0.000000] x86/fpu: xstate_offset[5]: 960, xstate_sizes[5]: 64
> [ 0.000000] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]: 512
> [ 0.000000] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
> [ 0.000000] x86/fpu: xstate_offset[9]: 2560, xstate_sizes[9]: 8
> [ 0.000000] x86/fpu: Enabled xstate features 0x2ff, context size is 2568 bytes, using 'compacted' format.
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009efff] reserved
> [ 0.000000] BIOS-e820: [mem 0x000000000009f000-0x000000000009ffff] ACPI NVS
> [ 0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000065e0efff] usable
> [ 0.000000] BIOS-e820: [mem 0x0000000065e0f000-0x000000006764efff] reserved
> [ 0.000000] BIOS-e820: [mem 0x000000006764f000-0x0000000068ca7fff] ACPI NVS
> [ 0.000000] BIOS-e820: [mem 0x0000000068ca8000-0x00000000695fdfff] ACPI data
> [ 0.000000] BIOS-e820: [mem 0x00000000695fe000-0x000000006f7fffff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000506fffffff] usable
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] e820: update [mem 0x5c84f018-0x5c883c57] usable ==> usable
> [ 0.000000] e820: update [mem 0x5c84f018-0x5c883c57] usable ==> usable
> [ 0.000000] e820: update [mem 0x5c81a018-0x5c84ec57] usable ==> usable
> [ 0.000000] e820: update [mem 0x5c81a018-0x5c84ec57] usable ==> usable
> [ 0.000000] extended physical RAM map:
> [ 0.000000] reserve setup_data: [mem 0x0000000000000000-0x000000000009dfff] usable
> [ 0.000000] reserve setup_data: [mem 0x000000000009e000-0x000000000009efff] reserved
> [ 0.000000] reserve setup_data: [mem 0x000000000009f000-0x000000000009ffff] ACPI NVS
> [ 0.000000] reserve setup_data: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000005c81a017] usable
> [ 0.000000] reserve setup_data: [mem 0x000000005c81a018-0x000000005c84ec57] usable
> [ 0.000000] reserve setup_data: [mem 0x000000005c84ec58-0x000000005c84f017] usable
> [ 0.000000] reserve setup_data: [mem 0x000000005c84f018-0x000000005c883c57] usable
> [ 0.000000] reserve setup_data: [mem 0x000000005c883c58-0x0000000065e0efff] usable
> [ 0.000000] reserve setup_data: [mem 0x0000000065e0f000-0x000000006764efff] reserved
> [ 0.000000] reserve setup_data: [mem 0x000000006764f000-0x0000000068ca7fff] ACPI NVS
> [ 0.000000] reserve setup_data: [mem 0x0000000068ca8000-0x00000000695fdfff] ACPI data
> [ 0.000000] reserve setup_data: [mem 0x00000000695fe000-0x000000006f7fffff] usable
> [ 0.000000] reserve setup_data: [mem 0x000000006f800000-0x000000008fffffff] reserved
> [ 0.000000] reserve setup_data: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000506fffffff] usable
> [ 0.000000] efi: EFI v2.70 by EDK II
> [ 0.000000] efi: ACPI=0x695fd000 ACPI 2.0=0x695fd014 TPMFinalLog=0x68bee000 SMBIOS=0x65e98000 SMBIOS 3.0=0x65e96000 MEMATTR=0x65d35018 TPMEventLog=0x5f610018
> [ 0.000000] SMBIOS 3.2.0 present.
> [ 0.000000] DMI: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 0.000000] tsc: Detected 2500.000 MHz processor
> [ 0.000541] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [ 0.000542] e820: remove [mem 0x000a0000-0x000fffff] usable
> [ 0.000547] last_pfn = 0x5070000 max_arch_pfn = 0x400000000
> [ 0.000551] MTRR default type: uncachable
> [ 0.000552] MTRR fixed ranges enabled:
> [ 0.000553] 00000-9FFFF write-back
> [ 0.000554] A0000-FFFFF uncachable
> [ 0.000554] MTRR variable ranges enabled:
> [ 0.000556] 0 base 000000000000 mask 3FC000000000 write-back
> [ 0.000557] 1 base 004000000000 mask 3FF000000000 write-back
> [ 0.000557] 2 base 005000000000 mask 3FFF80000000 write-back
> [ 0.000558] 3 base 005070000000 mask 3FFFF0000000 uncachable
> [ 0.000559] 4 base 000080000000 mask 3FFF80000000 uncachable
> [ 0.000560] 5 base 00007F000000 mask 3FFFFF000000 uncachable
> [ 0.000560] 6 disabled
> [ 0.000561] 7 disabled
> [ 0.000561] 8 disabled
> [ 0.000562] 9 disabled
> [ 0.001367] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
> [ 0.002413] total RAM covered: 327408M
> [ 0.002560] gran_size: 64K chunk_size: 64K num_reg: 10 lose cover RAM: 296704M
> [ 0.002562] gran_size: 64K chunk_size: 128K num_reg: 10 lose cover RAM: 296704M
> [ 0.002563] gran_size: 64K chunk_size: 256K num_reg: 10 lose cover RAM: 296704M
> [ 0.002564] gran_size: 64K chunk_size: 512K num_reg: 10 lose cover RAM: 296704M
> [ 0.002565] gran_size: 64K chunk_size: 1M num_reg: 10 lose cover RAM: 296704M
> [ 0.002567] gran_size: 64K chunk_size: 2M num_reg: 10 lose cover RAM: 296704M
> [ 0.002568] gran_size: 64K chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002569] gran_size: 64K chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002570] gran_size: 64K chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002571] gran_size: 64K chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002572] gran_size: 64K chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002573] gran_size: 64K chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002575] gran_size: 64K chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002576] *BAD*gran_size: 64K chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002577] *BAD*gran_size: 64K chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002578] *BAD*gran_size: 64K chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002579] gran_size: 128K chunk_size: 128K num_reg: 10 lose cover RAM: 296704M
> [ 0.002580] gran_size: 128K chunk_size: 256K num_reg: 10 lose cover RAM: 296704M
> [ 0.002582] gran_size: 128K chunk_size: 512K num_reg: 10 lose cover RAM: 296704M
> [ 0.002583] gran_size: 128K chunk_size: 1M num_reg: 10 lose cover RAM: 296704M
> [ 0.002584] gran_size: 128K chunk_size: 2M num_reg: 10 lose cover RAM: 296704M
> [ 0.002585] gran_size: 128K chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002586] gran_size: 128K chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002587] gran_size: 128K chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002588] gran_size: 128K chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002589] gran_size: 128K chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002590] gran_size: 128K chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002592] gran_size: 128K chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002593] *BAD*gran_size: 128K chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002594] *BAD*gran_size: 128K chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002595] *BAD*gran_size: 128K chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002596] gran_size: 256K chunk_size: 256K num_reg: 10 lose cover RAM: 296704M
> [ 0.002597] gran_size: 256K chunk_size: 512K num_reg: 10 lose cover RAM: 296704M
> [ 0.002598] gran_size: 256K chunk_size: 1M num_reg: 10 lose cover RAM: 296704M
> [ 0.002599] gran_size: 256K chunk_size: 2M num_reg: 10 lose cover RAM: 296704M
> [ 0.002601] gran_size: 256K chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002602] gran_size: 256K chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002603] gran_size: 256K chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002604] gran_size: 256K chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002605] gran_size: 256K chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002606] gran_size: 256K chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002607] gran_size: 256K chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002608] *BAD*gran_size: 256K chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002610] *BAD*gran_size: 256K chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002611] *BAD*gran_size: 256K chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002612] gran_size: 512K chunk_size: 512K num_reg: 10 lose cover RAM: 296704M
> [ 0.002613] gran_size: 512K chunk_size: 1M num_reg: 10 lose cover RAM: 296704M
> [ 0.002614] gran_size: 512K chunk_size: 2M num_reg: 10 lose cover RAM: 296704M
> [ 0.002615] gran_size: 512K chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002616] gran_size: 512K chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002617] gran_size: 512K chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002618] gran_size: 512K chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002619] gran_size: 512K chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002621] gran_size: 512K chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002622] gran_size: 512K chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002623] *BAD*gran_size: 512K chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002624] *BAD*gran_size: 512K chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002625] *BAD*gran_size: 512K chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002626] gran_size: 1M chunk_size: 1M num_reg: 10 lose cover RAM: 296704M
> [ 0.002627] gran_size: 1M chunk_size: 2M num_reg: 10 lose cover RAM: 296704M
> [ 0.002629] gran_size: 1M chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002630] gran_size: 1M chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002631] gran_size: 1M chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002632] gran_size: 1M chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002633] gran_size: 1M chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002634] gran_size: 1M chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002635] gran_size: 1M chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002636] *BAD*gran_size: 1M chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002638] *BAD*gran_size: 1M chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002639] *BAD*gran_size: 1M chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002640] gran_size: 2M chunk_size: 2M num_reg: 10 lose cover RAM: 296704M
> [ 0.002641] gran_size: 2M chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002642] gran_size: 2M chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002643] gran_size: 2M chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002644] gran_size: 2M chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002645] gran_size: 2M chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002646] gran_size: 2M chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002648] gran_size: 2M chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002649] *BAD*gran_size: 2M chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002650] *BAD*gran_size: 2M chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002651] *BAD*gran_size: 2M chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002652] gran_size: 4M chunk_size: 4M num_reg: 10 lose cover RAM: 296704M
> [ 0.002653] gran_size: 4M chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002654] gran_size: 4M chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002655] gran_size: 4M chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002656] gran_size: 4M chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002657] gran_size: 4M chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002659] gran_size: 4M chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002660] *BAD*gran_size: 4M chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002661] *BAD*gran_size: 4M chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002662] *BAD*gran_size: 4M chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002663] gran_size: 8M chunk_size: 8M num_reg: 10 lose cover RAM: 296704M
> [ 0.002664] gran_size: 8M chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002665] gran_size: 8M chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002666] gran_size: 8M chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002667] gran_size: 8M chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002669] gran_size: 8M chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002670] *BAD*gran_size: 8M chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002671] *BAD*gran_size: 8M chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002672] *BAD*gran_size: 8M chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002673] gran_size: 16M chunk_size: 16M num_reg: 10 lose cover RAM: 296704M
> [ 0.002674] gran_size: 16M chunk_size: 32M num_reg: 10 lose cover RAM: 768M
> [ 0.002675] gran_size: 16M chunk_size: 64M num_reg: 10 lose cover RAM: 768M
> [ 0.002676] gran_size: 16M chunk_size: 128M num_reg: 10 lose cover RAM: 768M
> [ 0.002677] gran_size: 16M chunk_size: 256M num_reg: 10 lose cover RAM: 768M
> [ 0.002679] *BAD*gran_size: 16M chunk_size: 512M num_reg: 10 lose cover RAM: -256M
> [ 0.002680] *BAD*gran_size: 16M chunk_size: 1G num_reg: 10 lose cover RAM: -256M
> [ 0.002681] *BAD*gran_size: 16M chunk_size: 2G num_reg: 10 lose cover RAM: -256M
> [ 0.002682] gran_size: 32M chunk_size: 32M num_reg: 10 lose cover RAM: 263952M
> [ 0.002683] gran_size: 32M chunk_size: 64M num_reg: 10 lose cover RAM: 784M
> [ 0.002684] gran_size: 32M chunk_size: 128M num_reg: 10 lose cover RAM: 784M
> [ 0.002685] gran_size: 32M chunk_size: 256M num_reg: 10 lose cover RAM: 784M
> [ 0.002687] *BAD*gran_size: 32M chunk_size: 512M num_reg: 10 lose cover RAM: -240M
> [ 0.002688] *BAD*gran_size: 32M chunk_size: 1G num_reg: 10 lose cover RAM: -240M
> [ 0.002689] *BAD*gran_size: 32M chunk_size: 2G num_reg: 10 lose cover RAM: -240M
> [ 0.002690] gran_size: 64M chunk_size: 64M num_reg: 10 lose cover RAM: 198448M
> [ 0.002691] gran_size: 64M chunk_size: 128M num_reg: 10 lose cover RAM: 816M
> [ 0.002692] gran_size: 64M chunk_size: 256M num_reg: 10 lose cover RAM: 816M
> [ 0.002693] *BAD*gran_size: 64M chunk_size: 512M num_reg: 10 lose cover RAM: -208M
> [ 0.002694] *BAD*gran_size: 64M chunk_size: 1G num_reg: 10 lose cover RAM: -208M
> [ 0.002696] *BAD*gran_size: 64M chunk_size: 2G num_reg: 10 lose cover RAM: -208M
> [ 0.002697] gran_size: 128M chunk_size: 128M num_reg: 10 lose cover RAM: 67440M
> [ 0.002698] gran_size: 128M chunk_size: 256M num_reg: 10 lose cover RAM: 880M
> [ 0.002699] *BAD*gran_size: 128M chunk_size: 512M num_reg: 10 lose cover RAM: -144M
> [ 0.002700] *BAD*gran_size: 128M chunk_size: 1G num_reg: 10 lose cover RAM: -144M
> [ 0.002701] *BAD*gran_size: 128M chunk_size: 2G num_reg: 10 lose cover RAM: -144M
> [ 0.002702] gran_size: 256M chunk_size: 256M num_reg: 10 lose cover RAM: 2032M
> [ 0.002704] *BAD*gran_size: 256M chunk_size: 512M num_reg: 10 lose cover RAM: -16M
> [ 0.002705] *BAD*gran_size: 256M chunk_size: 1G num_reg: 10 lose cover RAM: -16M
> [ 0.002706] *BAD*gran_size: 256M chunk_size: 2G num_reg: 10 lose cover RAM: -16M
> [ 0.002707] gran_size: 512M chunk_size: 512M num_reg: 10 lose cover RAM: 1264M
> [ 0.002708] gran_size: 512M chunk_size: 1G num_reg: 10 lose cover RAM: 240M
> [ 0.002709] gran_size: 512M chunk_size: 2G num_reg: 10 lose cover RAM: 240M
> [ 0.002711] gran_size: 1G chunk_size: 1G num_reg: 9 lose cover RAM: 1776M
> [ 0.002712] gran_size: 1G chunk_size: 2G num_reg: 10 lose cover RAM: 1776M
> [ 0.002713] gran_size: 2G chunk_size: 2G num_reg: 7 lose cover RAM: 3824M
> [ 0.002714] mtrr_cleanup: can not find optimal value
> [ 0.002715] please specify mtrr_gran_size/mtrr_chunk_size
> [ 0.002719] e820: update [mem 0x7f000000-0xffffffff] usable ==> reserved
> [ 0.002722] last_pfn = 0x6f800 max_arch_pfn = 0x400000000
> [ 0.028467] check: Scanning 1 areas for low memory corruption
> [ 0.028471] kexec: Reserving the low 1M of memory for crashkernel
> [ 0.028475] Using GB pages for direct mapping
> [ 0.028812] Secure boot disabled
> [ 0.028813] RAMDISK: [mem 0x3c803000-0x3fffdfff]
> [ 0.028826] ACPI: Early table checksum verification disabled
> [ 0.028829] ACPI: RSDP 0x00000000695FD014 000024 (v02 INTEL )
> [ 0.028833] ACPI: XSDT 0x0000000068E98188 0000F4 (v01 INTEL INTEL ID 00000000 INTL 01000013)
> [ 0.028839] ACPI: FACP 0x00000000695F4000 000114 (v06 INTEL INTEL ID 00000000 INTL 20091013)
> [ 0.028844] ACPI: DSDT 0x000000006955C000 091BA6 (v02 INTEL INTEL ID 00000003 INTL 20091013)
> [ 0.028848] ACPI: FACS 0x0000000068BE5000 000040
> [ 0.028850] ACPI: SSDT 0x00000000695FB000 0004C5 (v02 INTEL INTEL ID 00000000 MSFT 0100000D)
> [ 0.028853] ACPI: SSDT 0x00000000695FA000 000479 (v02 INTEL RAS_ACPI 00000001 INTL 20181003)
> [ 0.028856] ACPI: SSDT 0x00000000695F9000 00060E (v02 INTEL Tpm2Tabl 00001000 INTL 20181003)
> [ 0.028859] ACPI: TPM2 0x00000000695F8000 00004C (v04 INTEL INTEL ID 00000002 INTL 01000013)
> [ 0.028862] ACPI: SSDT 0x00000000695F7000 000663 (v02 INTEL ADDRXLAT 00000001 INTL 20181003)
> [ 0.028865] ACPI: BERT 0x00000000695F6000 000030 (v01 INTEL INTEL ID 00000001 INTL 00000001)
> [ 0.028868] ACPI: ERST 0x00000000695F5000 000230 (v01 INTEL INTEL ID 00000001 INTL 00000001)
> [ 0.028870] ACPI: HMAT 0x00000000695F3000 000270 (v01 INTEL INTEL ID 00000001 INTL 20091013)
> [ 0.028873] ACPI: HPET 0x00000000695F2000 000038 (v01 INTEL INTEL ID 00000001 INTL 20091013)
> [ 0.028876] ACPI: MCFG 0x00000000695F1000 00003C (v01 INTEL INTEL ID 00000001 INTL 20091013)
> [ 0.028879] ACPI: MIGT 0x00000000695F0000 000040 (v01 INTEL INTEL ID 00000000 INTL 20091013)
> [ 0.028882] ACPI: MSCT 0x00000000695EF000 0000E8 (v01 INTEL INTEL ID 00000001 INTL 20091013)
> [ 0.028884] ACPI: WDDT 0x00000000695EE000 000040 (v01 INTEL INTEL ID 00000000 INTL 20091013)
> [ 0.028887] ACPI: APIC 0x000000006955B000 00059E (v04 INTEL INTEL ID 00000000 INTL 20091013)
> [ 0.028890] ACPI: SLIT 0x0000000069556000 00402C (v01 INTEL INTEL ID 00000001 INTL 01000013)
> [ 0.028893] ACPI: SRAT 0x000000006954A000 00B430 (v03 INTEL INTEL ID 00000002 INTL 01000013)
> [ 0.028896] ACPI: OEM4 0x000000006923A000 30F461 (v02 INTEL CPU CST 00003000 INTL 20181003)
> [ 0.028899] ACPI: OEM1 0x0000000069013000 226889 (v02 INTEL CPU EIST 00003000 INTL 20181003)
> [ 0.028902] ACPI: OEM2 0x0000000068F86000 08C031 (v02 INTEL CPU HWP 00003000 INTL 20181003)
> [ 0.028905] ACPI: SSDT 0x0000000068E99000 0EC8A5 (v02 INTEL SSDT PM 00004000 INTL 20181003)
> [ 0.028907] ACPI: SSDT 0x00000000695FC000 000943 (v02 INTEL INTEL ID 00000000 INTL 20091013)
> [ 0.028910] ACPI: HEST 0x0000000068E97000 00013C (v01 INTEL INTEL ID 00000001 INTL 00000001)
> [ 0.028913] ACPI: SSDT 0x0000000068E88000 00E497 (v02 INTEL SpsNm 00000002 INTL 20181003)
> [ 0.028916] ACPI: SPCR 0x0000000068E87000 000050 (v02 INTEL INTEL ID 00000000 INTL 01000013)
> [ 0.028919] ACPI: FPDT 0x0000000068E86000 000044 (v01 INTEL INTEL ID 00000002 INTL 01000013)
> [ 0.028926] ACPI: Local APIC address 0xfee00000
> [ 0.029007] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> [ 0.029008] SRAT: PXM 0 -> APIC 0x02 -> Node 0
> [ 0.029008] SRAT: PXM 0 -> APIC 0x04 -> Node 0
> [ 0.029009] SRAT: PXM 0 -> APIC 0x06 -> Node 0
> [ 0.029010] SRAT: PXM 0 -> APIC 0x08 -> Node 0
> [ 0.029011] SRAT: PXM 0 -> APIC 0x10 -> Node 0
> [ 0.029012] SRAT: PXM 0 -> APIC 0x12 -> Node 0
> [ 0.029012] SRAT: PXM 0 -> APIC 0x14 -> Node 0
> [ 0.029013] SRAT: PXM 0 -> APIC 0x16 -> Node 0
> [ 0.029014] SRAT: PXM 0 -> APIC 0x20 -> Node 0
> [ 0.029015] SRAT: PXM 0 -> APIC 0x22 -> Node 0
> [ 0.029015] SRAT: PXM 0 -> APIC 0x24 -> Node 0
> [ 0.029016] SRAT: PXM 0 -> APIC 0x26 -> Node 0
> [ 0.029017] SRAT: PXM 0 -> APIC 0x28 -> Node 0
> [ 0.029018] SRAT: PXM 0 -> APIC 0x30 -> Node 0
> [ 0.029018] SRAT: PXM 0 -> APIC 0x32 -> Node 0
> [ 0.029019] SRAT: PXM 0 -> APIC 0x34 -> Node 0
> [ 0.029020] SRAT: PXM 0 -> APIC 0x36 -> Node 0
> [ 0.029021] SRAT: PXM 1 -> APIC 0x40 -> Node 1
> [ 0.029021] SRAT: PXM 1 -> APIC 0x42 -> Node 1
> [ 0.029022] SRAT: PXM 1 -> APIC 0x44 -> Node 1
> [ 0.029023] SRAT: PXM 1 -> APIC 0x46 -> Node 1
> [ 0.029024] SRAT: PXM 1 -> APIC 0x48 -> Node 1
> [ 0.029024] SRAT: PXM 1 -> APIC 0x50 -> Node 1
> [ 0.029025] SRAT: PXM 1 -> APIC 0x52 -> Node 1
> [ 0.029026] SRAT: PXM 1 -> APIC 0x54 -> Node 1
> [ 0.029027] SRAT: PXM 1 -> APIC 0x56 -> Node 1
> [ 0.029027] SRAT: PXM 1 -> APIC 0x60 -> Node 1
> [ 0.029028] SRAT: PXM 1 -> APIC 0x62 -> Node 1
> [ 0.029029] SRAT: PXM 1 -> APIC 0x64 -> Node 1
> [ 0.029030] SRAT: PXM 1 -> APIC 0x66 -> Node 1
> [ 0.029030] SRAT: PXM 1 -> APIC 0x68 -> Node 1
> [ 0.029031] SRAT: PXM 1 -> APIC 0x70 -> Node 1
> [ 0.029032] SRAT: PXM 1 -> APIC 0x72 -> Node 1
> [ 0.029033] SRAT: PXM 1 -> APIC 0x74 -> Node 1
> [ 0.029033] SRAT: PXM 1 -> APIC 0x76 -> Node 1
> [ 0.029034] SRAT: PXM 2 -> APIC 0x80 -> Node 2
> [ 0.029035] SRAT: PXM 2 -> APIC 0x82 -> Node 2
> [ 0.029036] SRAT: PXM 2 -> APIC 0x84 -> Node 2
> [ 0.029036] SRAT: PXM 2 -> APIC 0x86 -> Node 2
> [ 0.029037] SRAT: PXM 2 -> APIC 0x88 -> Node 2
> [ 0.029038] SRAT: PXM 2 -> APIC 0x90 -> Node 2
> [ 0.029039] SRAT: PXM 2 -> APIC 0x92 -> Node 2
> [ 0.029039] SRAT: PXM 2 -> APIC 0x94 -> Node 2
> [ 0.029040] SRAT: PXM 2 -> APIC 0x96 -> Node 2
> [ 0.029041] SRAT: PXM 2 -> APIC 0xa0 -> Node 2
> [ 0.029042] SRAT: PXM 2 -> APIC 0xa2 -> Node 2
> [ 0.029042] SRAT: PXM 2 -> APIC 0xa4 -> Node 2
> [ 0.029043] SRAT: PXM 2 -> APIC 0xa6 -> Node 2
> [ 0.029044] SRAT: PXM 2 -> APIC 0xa8 -> Node 2
> [ 0.029045] SRAT: PXM 2 -> APIC 0xb0 -> Node 2
> [ 0.029045] SRAT: PXM 2 -> APIC 0xb2 -> Node 2
> [ 0.029046] SRAT: PXM 2 -> APIC 0xb4 -> Node 2
> [ 0.029047] SRAT: PXM 2 -> APIC 0xb6 -> Node 2
> [ 0.029048] SRAT: PXM 3 -> APIC 0xc0 -> Node 3
> [ 0.029048] SRAT: PXM 3 -> APIC 0xc2 -> Node 3
> [ 0.029049] SRAT: PXM 3 -> APIC 0xc4 -> Node 3
> [ 0.029050] SRAT: PXM 3 -> APIC 0xc6 -> Node 3
> [ 0.029051] SRAT: PXM 3 -> APIC 0xc8 -> Node 3
> [ 0.029051] SRAT: PXM 3 -> APIC 0xd0 -> Node 3
> [ 0.029052] SRAT: PXM 3 -> APIC 0xd2 -> Node 3
> [ 0.029053] SRAT: PXM 3 -> APIC 0xd4 -> Node 3
> [ 0.029054] SRAT: PXM 3 -> APIC 0xd6 -> Node 3
> [ 0.029054] SRAT: PXM 3 -> APIC 0xe0 -> Node 3
> [ 0.029055] SRAT: PXM 3 -> APIC 0xe2 -> Node 3
> [ 0.029056] SRAT: PXM 3 -> APIC 0xe4 -> Node 3
> [ 0.029057] SRAT: PXM 3 -> APIC 0xe6 -> Node 3
> [ 0.029057] SRAT: PXM 3 -> APIC 0xe8 -> Node 3
> [ 0.029058] SRAT: PXM 3 -> APIC 0xf0 -> Node 3
> [ 0.029059] SRAT: PXM 3 -> APIC 0xf2 -> Node 3
> [ 0.029060] SRAT: PXM 3 -> APIC 0xf4 -> Node 3
> [ 0.029060] SRAT: PXM 3 -> APIC 0xf6 -> Node 3
> [ 0.029061] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> [ 0.029062] SRAT: PXM 0 -> APIC 0x03 -> Node 0
> [ 0.029063] SRAT: PXM 0 -> APIC 0x05 -> Node 0
> [ 0.029063] SRAT: PXM 0 -> APIC 0x07 -> Node 0
> [ 0.029064] SRAT: PXM 0 -> APIC 0x09 -> Node 0
> [ 0.029065] SRAT: PXM 0 -> APIC 0x11 -> Node 0
> [ 0.029066] SRAT: PXM 0 -> APIC 0x13 -> Node 0
> [ 0.029066] SRAT: PXM 0 -> APIC 0x15 -> Node 0
> [ 0.029067] SRAT: PXM 0 -> APIC 0x17 -> Node 0
> [ 0.029068] SRAT: PXM 0 -> APIC 0x21 -> Node 0
> [ 0.029069] SRAT: PXM 0 -> APIC 0x23 -> Node 0
> [ 0.029069] SRAT: PXM 0 -> APIC 0x25 -> Node 0
> [ 0.029070] SRAT: PXM 0 -> APIC 0x27 -> Node 0
> [ 0.029071] SRAT: PXM 0 -> APIC 0x29 -> Node 0
> [ 0.029071] SRAT: PXM 0 -> APIC 0x31 -> Node 0
> [ 0.029072] SRAT: PXM 0 -> APIC 0x33 -> Node 0
> [ 0.029073] SRAT: PXM 0 -> APIC 0x35 -> Node 0
> [ 0.029074] SRAT: PXM 0 -> APIC 0x37 -> Node 0
> [ 0.029074] SRAT: PXM 1 -> APIC 0x41 -> Node 1
> [ 0.029075] SRAT: PXM 1 -> APIC 0x43 -> Node 1
> [ 0.029076] SRAT: PXM 1 -> APIC 0x45 -> Node 1
> [ 0.029077] SRAT: PXM 1 -> APIC 0x47 -> Node 1
> [ 0.029077] SRAT: PXM 1 -> APIC 0x49 -> Node 1
> [ 0.029078] SRAT: PXM 1 -> APIC 0x51 -> Node 1
> [ 0.029079] SRAT: PXM 1 -> APIC 0x53 -> Node 1
> [ 0.029080] SRAT: PXM 1 -> APIC 0x55 -> Node 1
> [ 0.029080] SRAT: PXM 1 -> APIC 0x57 -> Node 1
> [ 0.029081] SRAT: PXM 1 -> APIC 0x61 -> Node 1
> [ 0.029082] SRAT: PXM 1 -> APIC 0x63 -> Node 1
> [ 0.029083] SRAT: PXM 1 -> APIC 0x65 -> Node 1
> [ 0.029083] SRAT: PXM 1 -> APIC 0x67 -> Node 1
> [ 0.029084] SRAT: PXM 1 -> APIC 0x69 -> Node 1
> [ 0.029085] SRAT: PXM 1 -> APIC 0x71 -> Node 1
> [ 0.029086] SRAT: PXM 1 -> APIC 0x73 -> Node 1
> [ 0.029086] SRAT: PXM 1 -> APIC 0x75 -> Node 1
> [ 0.029087] SRAT: PXM 1 -> APIC 0x77 -> Node 1
> [ 0.029088] SRAT: PXM 2 -> APIC 0x81 -> Node 2
> [ 0.029089] SRAT: PXM 2 -> APIC 0x83 -> Node 2
> [ 0.029089] SRAT: PXM 2 -> APIC 0x85 -> Node 2
> [ 0.029090] SRAT: PXM 2 -> APIC 0x87 -> Node 2
> [ 0.029091] SRAT: PXM 2 -> APIC 0x89 -> Node 2
> [ 0.029092] SRAT: PXM 2 -> APIC 0x91 -> Node 2
> [ 0.029092] SRAT: PXM 2 -> APIC 0x93 -> Node 2
> [ 0.029093] SRAT: PXM 2 -> APIC 0x95 -> Node 2
> [ 0.029094] SRAT: PXM 2 -> APIC 0x97 -> Node 2
> [ 0.029095] SRAT: PXM 2 -> APIC 0xa1 -> Node 2
> [ 0.029095] SRAT: PXM 2 -> APIC 0xa3 -> Node 2
> [ 0.029096] SRAT: PXM 2 -> APIC 0xa5 -> Node 2
> [ 0.029097] SRAT: PXM 2 -> APIC 0xa7 -> Node 2
> [ 0.029097] SRAT: PXM 2 -> APIC 0xa9 -> Node 2
> [ 0.029098] SRAT: PXM 2 -> APIC 0xb1 -> Node 2
> [ 0.029099] SRAT: PXM 2 -> APIC 0xb3 -> Node 2
> [ 0.029100] SRAT: PXM 2 -> APIC 0xb5 -> Node 2
> [ 0.029100] SRAT: PXM 2 -> APIC 0xb7 -> Node 2
> [ 0.029101] SRAT: PXM 3 -> APIC 0xc1 -> Node 3
> [ 0.029102] SRAT: PXM 3 -> APIC 0xc3 -> Node 3
> [ 0.029103] SRAT: PXM 3 -> APIC 0xc5 -> Node 3
> [ 0.029103] SRAT: PXM 3 -> APIC 0xc7 -> Node 3
> [ 0.029104] SRAT: PXM 3 -> APIC 0xc9 -> Node 3
> [ 0.029105] SRAT: PXM 3 -> APIC 0xd1 -> Node 3
> [ 0.029106] SRAT: PXM 3 -> APIC 0xd3 -> Node 3
> [ 0.029106] SRAT: PXM 3 -> APIC 0xd5 -> Node 3
> [ 0.029107] SRAT: PXM 3 -> APIC 0xd7 -> Node 3
> [ 0.029108] SRAT: PXM 3 -> APIC 0xe1 -> Node 3
> [ 0.029109] SRAT: PXM 3 -> APIC 0xe3 -> Node 3
> [ 0.029109] SRAT: PXM 3 -> APIC 0xe5 -> Node 3
> [ 0.029110] SRAT: PXM 3 -> APIC 0xe7 -> Node 3
> [ 0.029111] SRAT: PXM 3 -> APIC 0xe9 -> Node 3
> [ 0.029112] SRAT: PXM 3 -> APIC 0xf1 -> Node 3
> [ 0.029112] SRAT: PXM 3 -> APIC 0xf3 -> Node 3
> [ 0.029113] SRAT: PXM 3 -> APIC 0xf5 -> Node 3
> [ 0.029114] SRAT: PXM 3 -> APIC 0xf7 -> Node 3
> [ 0.029167] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> [ 0.029168] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x186fffffff]
> [ 0.029169] ACPI: SRAT: Node 1 PXM 1 [mem 0x1870000000-0x306fffffff]
> [ 0.029170] ACPI: SRAT: Node 2 PXM 2 [mem 0x3070000000-0x406fffffff]
> [ 0.029171] ACPI: SRAT: Node 3 PXM 3 [mem 0x4070000000-0x506fffffff]
> [ 0.029184] NUMA: Initialized distance table, cnt=4
> [ 0.029187] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x186fffffff] -> [mem 0x00000000-0x186fffffff]
> [ 0.029193] NODE_DATA(0) allocated [mem 0x186fffb000-0x186fffffff]
> [ 0.029198] NODE_DATA(1) allocated [mem 0x306fffb000-0x306fffffff]
> [ 0.029205] NODE_DATA(2) allocated [mem 0x406fffb000-0x406fffffff]
> [ 0.029213] NODE_DATA(3) allocated [mem 0x506fffa000-0x506fffefff]
> [ 0.029230] Reserving 768MB of memory at 192MB for crashkernel (System RAM: 327103MB)
> [ 0.029595] Zone ranges:
> [ 0.029596] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.029597] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.029598] Normal [mem 0x0000000100000000-0x000000506fffffff]
> [ 0.029599] Device empty
> [ 0.029600] Movable zone start for each node
> [ 0.029601] Early memory node ranges
> [ 0.029602] node 0: [mem 0x0000000000001000-0x000000000009dfff]
> [ 0.029604] node 0: [mem 0x0000000000100000-0x0000000065e0efff]
> [ 0.029605] node 0: [mem 0x00000000695fe000-0x000000006f7fffff]
> [ 0.029606] node 0: [mem 0x0000000100000000-0x000000186fffffff]
> [ 0.029614] node 1: [mem 0x0000001870000000-0x000000306fffffff]
> [ 0.029623] node 2: [mem 0x0000003070000000-0x000000406fffffff]
> [ 0.029629] node 3: [mem 0x0000004070000000-0x000000506fffffff]
> [ 0.029819] Zeroed struct page in unavailable ranges: 16466 pages
> [ 0.029820] Initmem setup node 0 [mem 0x0000000000001000-0x000000186fffffff]
> [ 0.029822] On node 0 totalpages: 25018286
> [ 0.029823] DMA zone: 64 pages used for memmap
> [ 0.029824] DMA zone: 157 pages reserved
> [ 0.029825] DMA zone: 3997 pages, LIFO batch:0
> [ 0.029895] DMA32 zone: 6849 pages used for memmap
> [ 0.029896] DMA32 zone: 438289 pages, LIFO batch:63
> [ 0.037451] Normal zone: 384000 pages used for memmap
> [ 0.037452] Normal zone: 24576000 pages, LIFO batch:63
> [ 0.437523] Initmem setup node 1 [mem 0x0000001870000000-0x000000306fffffff]
> [ 0.437525] On node 1 totalpages: 25165824
> [ 0.437526] Normal zone: 393216 pages used for memmap
> [ 0.437526] Normal zone: 25165824 pages, LIFO batch:63
> [ 0.886342] Initmem setup node 2 [mem 0x0000003070000000-0x000000406fffffff]
> [ 0.886344] On node 2 totalpages: 16777216
> [ 0.886344] Normal zone: 262144 pages used for memmap
> [ 0.886345] Normal zone: 16777216 pages, LIFO batch:63
> [ 1.187252] Initmem setup node 3 [mem 0x0000004070000000-0x000000506fffffff]
> [ 1.187253] On node 3 totalpages: 16777216
> [ 1.187254] Normal zone: 262144 pages used for memmap
> [ 1.187254] Normal zone: 16777216 pages, LIFO batch:63
> [ 1.485872] ACPI: PM-Timer IO Port: 0x508
> [ 1.485874] ACPI: Local APIC address 0xfee00000
> [ 1.485897] ACPI: X2APIC_NMI (uid[0xffffffff] high edge lint[0x1])
> [ 1.485900] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
> [ 1.485922] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
> [ 1.485926] IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31
> [ 1.485931] IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39
> [ 1.485935] IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47
> [ 1.485939] IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55
> [ 1.485944] IOAPIC[5]: apic_id 13, version 32, address 0xfec20000, GSI 56-63
> [ 1.485948] IOAPIC[6]: apic_id 14, version 32, address 0xfec28000, GSI 64-71
> [ 1.485953] IOAPIC[7]: apic_id 15, version 32, address 0xfec30000, GSI 72-79
> [ 1.485957] IOAPIC[8]: apic_id 16, version 32, address 0xfec38000, GSI 80-87
> [ 1.485962] IOAPIC[9]: apic_id 17, version 32, address 0xfec40000, GSI 88-95
> [ 1.485967] IOAPIC[10]: apic_id 18, version 32, address 0xfec48000, GSI 96-103
> [ 1.485972] IOAPIC[11]: apic_id 19, version 32, address 0xfec50000, GSI 104-111
> [ 1.485976] IOAPIC[12]: apic_id 20, version 32, address 0xfec58000, GSI 112-119
> [ 1.485981] IOAPIC[13]: apic_id 21, version 32, address 0xfec60000, GSI 120-127
> [ 1.485986] IOAPIC[14]: apic_id 22, version 32, address 0xfec68000, GSI 128-135
> [ 1.485991] IOAPIC[15]: apic_id 23, version 32, address 0xfec70000, GSI 136-143
> [ 1.485995] IOAPIC[16]: apic_id 24, version 32, address 0xfec78000, GSI 144-151
> [ 1.485998] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [ 1.486000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [ 1.486002] ACPI: IRQ0 used by override.
> [ 1.486003] ACPI: IRQ9 used by override.
> [ 1.486006] Using ACPI (MADT) for SMP configuration information
> [ 1.486007] ACPI: HPET id: 0x8086a701 base: 0xfed00000
> [ 1.486011] ACPI: SPCR: console: uart,io,0x3f8,115200
> [ 1.486012] TSC deadline timer available
> [ 1.486013] smpboot: Allowing 144 CPUs, 0 hotplug CPUs
> [ 1.486040] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
> [ 1.486042] PM: hibernation: Registered nosave memory: [mem 0x0009e000-0x0009efff]
> [ 1.486043] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
> [ 1.486044] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [ 1.486046] PM: hibernation: Registered nosave memory: [mem 0x5c81a000-0x5c81afff]
> [ 1.486048] PM: hibernation: Registered nosave memory: [mem 0x5c84e000-0x5c84efff]
> [ 1.486049] PM: hibernation: Registered nosave memory: [mem 0x5c84f000-0x5c84ffff]
> [ 1.486051] PM: hibernation: Registered nosave memory: [mem 0x5c883000-0x5c883fff]
> [ 1.486053] PM: hibernation: Registered nosave memory: [mem 0x65e0f000-0x6764efff]
> [ 1.486054] PM: hibernation: Registered nosave memory: [mem 0x6764f000-0x68ca7fff]
> [ 1.486054] PM: hibernation: Registered nosave memory: [mem 0x68ca8000-0x695fdfff]
> [ 1.486056] PM: hibernation: Registered nosave memory: [mem 0x6f800000-0x8fffffff]
> [ 1.486057] PM: hibernation: Registered nosave memory: [mem 0x90000000-0xfdffffff]
> [ 1.486058] PM: hibernation: Registered nosave memory: [mem 0xfe000000-0xfe010fff]
> [ 1.486059] PM: hibernation: Registered nosave memory: [mem 0xfe011000-0xffffffff]
> [ 1.486061] [mem 0x90000000-0xfdffffff] available for PCI devices
> [ 1.486062] Booting paravirtualized kernel on bare hardware
> [ 1.486065] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
> [ 1.491035] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:144 nr_node_ids:4
> [ 1.507188] percpu: Embedded 56 pages/cpu s188440 r8192 d32744 u262144
> [ 1.507206] pcpu-alloc: s188440 r8192 d32744 u262144 alloc=1*2097152
> [ 1.507207] pcpu-alloc: [0] 000 001 002 003 004 005 006 007
> [ 1.507209] pcpu-alloc: [0] 008 009 010 011 012 013 014 015
> [ 1.507211] pcpu-alloc: [0] 016 017 072 073 074 075 076 077
> [ 1.507213] pcpu-alloc: [0] 078 079 080 081 082 083 084 085
> [ 1.507215] pcpu-alloc: [0] 086 087 088 089 --- --- --- ---
> [ 1.507217] pcpu-alloc: [1] 018 019 020 021 022 023 024 025
> [ 1.507219] pcpu-alloc: [1] 026 027 028 029 030 031 032 033
> [ 1.507221] pcpu-alloc: [1] 034 035 090 091 092 093 094 095
> [ 1.507222] pcpu-alloc: [1] 096 097 098 099 100 101 102 103
> [ 1.507224] pcpu-alloc: [1] 104 105 106 107 --- --- --- ---
> [ 1.507226] pcpu-alloc: [2] 036 037 038 039 040 041 042 043
> [ 1.507228] pcpu-alloc: [2] 044 045 046 047 048 049 050 051
> [ 1.507230] pcpu-alloc: [2] 052 053 108 109 110 111 112 113
> [ 1.507232] pcpu-alloc: [2] 114 115 116 117 118 119 120 121
> [ 1.507233] pcpu-alloc: [2] 122 123 124 125 --- --- --- ---
> [ 1.507235] pcpu-alloc: [3] 054 055 056 057 058 059 060 061
> [ 1.507237] pcpu-alloc: [3] 062 063 064 065 066 067 068 069
> [ 1.507239] pcpu-alloc: [3] 070 071 126 127 128 129 130 131
> [ 1.507241] pcpu-alloc: [3] 132 133 134 135 136 137 138 139
> [ 1.507242] pcpu-alloc: [3] 140 141 142 143 --- --- --- ---
> [ 1.507310] Built 4 zonelists, mobility grouping on. Total pages: 82429968
> [ 1.507311] Policy zone: Normal
> [ 1.507313] Kernel command line: BOOT_IMAGE=/vmlinuz-5.7.6+ root=/dev/mapper/ubuntu--vg-ubuntu--lv ro console=tty0 console=ttyS0,115200n8 softlockup_panic=1 crashkernel=512M-:768M
> [ 1.507452] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [ 1.507453] printk: log_buf_len total cpu_extra contributions: 585728 bytes
> [ 1.507453] printk: log_buf_len min size: 262144 bytes
> [ 1.507789] printk: log_buf_len: 1048576 bytes
> [ 1.507790] printk: early log buf free: 224012(85%)
> [ 1.510295] mem auto-init: stack:off, heap alloc:off, heap free:off
> [ 2.362842] Memory: 328546464K/334954168K available (12291K kernel code, 1669K rwdata, 8216K rodata, 1780K init, 2752K bss, 6407704K reserved, 0K cma-reserved)
> [ 2.364642] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=144, Nodes=4
> [ 2.364729] ftrace: allocating 42557 entries in 167 pages
> [ 2.384214] ftrace: allocated 167 pages with 5 groups
> [ 2.384329]
> [ 2.384330] **********************************************************
> [ 2.384330] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
> [ 2.384331] ** **
> [ 2.384331] ** trace_printk() being used. Allocating extra memory. **
> [ 2.384332] ** **
> [ 2.384333] ** This means that this is a DEBUG kernel and it is **
> [ 2.384333] ** unsafe for production use. **
> [ 2.384334] ** **
> [ 2.384335] ** If you see this message and you are not debugging **
> [ 2.384335] ** the kernel, report this immediately to your vendor! **
> [ 2.384336] ** **
> [ 2.384336] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
> [ 2.384337] **********************************************************
> [ 2.385516] rcu: Hierarchical RCU implementation.
> [ 2.385518] rcu: RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=144.
> [ 2.385520] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [ 2.385521] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=144
> [ 2.389286] NR_IRQS: 33024, nr_irqs: 3752, preallocated irqs: 16
> [ 2.390192] random: get_random_bytes called from start_kernel+0x371/0x558 with crng_init=0
> [ 2.390433] Console: colour dummy device 80x25
> [ 2.391152] printk: console [tty0] enabled
> [ 5.520858] printk: console [ttyS0] enabled
> [ 5.525370] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> [ 5.536556] ACPI: Core revision 20200326
> [ 5.546943] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
> [ 5.556022] APIC: Switch to symmetric I/O mode setup
> [ 5.561290] x2apic: IRQ remapping doesn't support X2APIC mode
> [ 5.567225] Switched APIC routing to physical flat.
> [ 5.573427] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 5.595930] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x240939f1bb2, max_idle_ns: 440795263295 ns
> [ 5.606425] Calibrating delay loop (skipped), value calculated using timer frequency.. 5000.00 BogoMIPS (lpj=10000000)
> [ 5.610424] pid_max: default: 147456 minimum: 1152
> [ 5.624339] LSM: Security Framework initializing
> [ 5.626431] Yama: becoming mindful.
> [ 5.630494] AppArmor: AppArmor initialized
> [ 5.634425] TOMOYO Linux initialized
> [ 5.664730] Dentry cache hash table entries: 16777216 (order: 15, 134217728 bytes, vmalloc)
> [ 5.680915] Inode-cache hash table entries: 8388608 (order: 14, 67108864 bytes, vmalloc)
> [ 5.682900] Mount-cache hash table entries: 262144 (order: 9, 2097152 bytes, vmalloc)
> [ 5.686785] Mountpoint-cache hash table entries: 262144 (order: 9, 2097152 bytes, vmalloc)
> [ 5.695446] mce: CPU0: Thermal monitoring enabled (TM1)
> [ 5.698484] process: using mwait in idle threads
> [ 5.702426] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
> [ 5.706423] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
> [ 5.710429] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
> [ 5.714424] Spectre V2 : Mitigation: Enhanced IBRS
> [ 5.718423] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
> [ 5.722425] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
> [ 5.726425] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
> [ 5.730905] Freeing SMP alternatives memory: 40K
> [ 5.740732] smpboot: CPU0: Genuine Intel(R) CPU 0000%@ (family: 0x6, model: 0x55, stepping: 0xb)
> [ 5.742751] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
> [ 5.746427] ... version: 4
> [ 5.750424] ... bit width: 48
> [ 5.754423] ... generic registers: 4
> [ 5.758424] ... value mask: 0000ffffffffffff
> [ 5.762424] ... max period: 00007fffffffffff
> [ 5.766423] ... fixed-purpose events: 3
> [ 5.770424] ... event mask: 000000070000000f
> [ 5.774621] core: 0 -> 0
> [ 5.777154] rcu: Hierarchical SRCU implementation.
> [ 5.793265] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> [ 5.796304] smp: Bringing up secondary CPUs ...
> [ 5.798631] x86: Booting SMP configuration:
> [ 5.802425] .... node #0, CPUs: #1
> [ 3.180284] core: 1 -> 1
> [ 5.811825] #2
> [ 3.180284] core: 2 -> 2
> [ 5.819129] #3
> [ 3.180284] core: 3 -> 3
> [ 5.826649] #4
> [ 3.180284] core: 4 -> 4
> [ 5.833874] #5
> [ 3.180284] core: 5 -> 5
> [ 5.841260] #6
> [ 3.180284] core: 6 -> 6
> [ 5.848513] #7
> [ 3.180284] core: 7 -> 7
> [ 5.855778] #8
> [ 3.180284] core: 8 -> 8
> [ 5.863078] #9
> [ 3.180284] core: 9 -> 9
> [ 5.870650] #10
> [ 3.180284] core: 10 -> 10
> [ 5.877912] #11
> [ 3.180284] core: 11 -> 11
> [ 5.885305] #12
> [ 3.180284] core: 12 -> 12
> [ 5.892804] #13
> [ 3.180284] core: 13 -> 13
> [ 5.900433] #14
> [ 3.180284] core: 14 -> 14
> [ 5.907905] #15
> [ 3.180284] core: 15 -> 15
> [ 5.915299] #16
> [ 3.180284] core: 16 -> 16
> [ 5.922814] #17
> [ 3.180284] core: 17 -> 17
> [ 5.930718] .... node #1, CPUs: #18
> [ 3.180284] smpboot: CPU 18 Converting physical 0 to logical die 1
> [ 3.180284] core: 18 -> 18
> [ 6.026679] #19
> [ 3.180284] core: 19 -> 19
> [ 6.034470] #20
> [ 3.180284] core: 20 -> 20
> [ 6.041621] #21
> [ 3.180284] core: 21 -> 21
> [ 6.049191] #22
> [ 3.180284] core: 22 -> 22
> [ 6.056874] #23
> [ 3.180284] core: 23 -> 23
> [ 6.064488] #24
> [ 3.180284] core: 24 -> 24
> [ 6.071963] #25
> [ 3.180284] core: 25 -> 25
> [ 6.079457] #26
> [ 3.180284] core: 26 -> 26
> [ 6.087035] #27
> [ 3.180284] core: 27 -> 27
> [ 6.094683] #28
> [ 3.180284] core: 28 -> 28
> [ 6.102664] #29
> [ 3.180284] core: 29 -> 29
> [ 6.109793] #30
> [ 3.180284] core: 30 -> 30
> [ 6.117405] #31
> [ 3.180284] core: 31 -> 31
> [ 6.125147] #32
> [ 3.180284] core: 32 -> 32
> [ 6.132735] #33
> [ 3.180284] core: 33 -> 33
> [ 6.140281] #34
> [ 3.180284] core: 34 -> 34
> [ 6.147827] #35
> [ 3.180284] core: 35 -> 35
> [ 6.155499] .... node #2, CPUs: #36
> [ 3.180284] smpboot: CPU 36 Converting physical 0 to logical die 2
> [ 3.180284] core: 36 -> 36
> [ 6.250677] #37
> [ 3.180284] core: 37 -> 37
> [ 6.258524] #38
> [ 3.180284] core: 38 -> 38
> [ 6.265713] #39
> [ 3.180284] core: 39 -> 39
> [ 6.273291] #40
> [ 3.180284] core: 40 -> 40
> [ 6.280915] #41
> [ 3.180284] core: 41 -> 41
> [ 6.288556] #42
> [ 3.180284] core: 42 -> 42
> [ 6.296104] #43
> [ 3.180284] core: 43 -> 43
> [ 6.303635] #44
> [ 3.180284] core: 44 -> 44
> [ 6.311240] #45
> [ 3.180284] core: 45 -> 45
> [ 6.318939] #46
> [ 3.180284] core: 46 -> 46
> [ 6.326672] #47
> [ 3.180284] core: 47 -> 47
> [ 6.334513] #48
> [ 3.180284] core: 48 -> 48
> [ 6.341710] #49
> [ 3.180284] core: 49 -> 49
> [ 6.349393] #50
> [ 3.180284] core: 50 -> 50
> [ 6.357069] #51
> [ 3.180284] core: 51 -> 51
> [ 6.364647] #52
> [ 3.180284] core: 52 -> 52
> [ 6.372279] #53
> [ 3.180284] core: 53 -> 53
> [ 6.379986] .... node #3, CPUs: #54
> [ 3.180284] smpboot: CPU 54 Converting physical 0 to logical die 3
> [ 3.180284] core: 54 -> 54
> [ 6.474698] #55
> [ 3.180284] core: 55 -> 55
> [ 6.482665] #56
> [ 3.180284] core: 56 -> 56
> [ 6.489900] #57
> [ 3.180284] core: 57 -> 57
> [ 6.497557] #58
> [ 3.180284] core: 58 -> 58
> [ 6.505282] #59
> [ 3.180284] core: 59 -> 59
> [ 6.513000] #60
> [ 3.180284] core: 60 -> 60
> [ 6.520607] #61
> [ 3.180284] core: 61 -> 61
> [ 6.528202] #62
> [ 3.180284] core: 62 -> 62
> [ 6.535894] #63
> [ 3.180284] core: 63 -> 63
> [ 6.543752] #64
> [ 3.180284] core: 64 -> 64
> [ 6.551475] #65
> [ 3.180284] core: 65 -> 65
> [ 6.559134] #66
> [ 3.180284] core: 66 -> 66
> [ 6.566790] #67
> [ 3.180284] core: 67 -> 67
> [ 6.574673] #68
> [ 3.180284] core: 68 -> 68
> [ 6.582690] #69
> [ 3.180284] core: 69 -> 69
> [ 6.589978] #70
> [ 3.180284] core: 70 -> 70
> [ 6.597673] #71
> [ 3.180284] core: 71 -> 71
> [ 6.605486] .... node #0, CPUs: #72
> [ 3.180284] core: 72 -> 0
> [ 6.613660] #73
> [ 3.180284] core: 73 -> 1
> [ 6.621322] #74
> [ 3.180284] core: 74 -> 2
> [ 6.628771] #75
> [ 3.180284] core: 75 -> 3
> [ 6.636258] #76
> [ 3.180284] core: 76 -> 4
> [ 6.643814] #77
> [ 3.180284] core: 77 -> 5
> [ 6.651394] #78
> [ 3.180284] core: 78 -> 6
> [ 6.658850] #79
> [ 3.180284] core: 79 -> 7
> [ 6.666659] #80
> [ 3.180284] core: 80 -> 8
> [ 6.673780] #81
> [ 3.180284] core: 81 -> 9
> [ 6.681287] #82
> [ 3.180284] core: 82 -> 10
> [ 6.688870] #83
> [ 3.180284] core: 83 -> 11
> [ 6.696357] #84
> [ 3.180284] core: 84 -> 12
> [ 6.704053] #85
> [ 3.180284] core: 85 -> 13
> [ 6.711881] #86
> [ 3.180284] core: 86 -> 14
> [ 6.719458] #87
> [ 3.180284] core: 87 -> 15
> [ 6.726971] #88
> [ 3.180284] core: 88 -> 16
> [ 6.734662] #89
> [ 3.180284] core: 89 -> 17
> [ 6.742727] .... node #1, CPUs: #90
> [ 3.180284] core: 90 -> 18
> [ 6.751856] #91
> [ 3.180284] core: 91 -> 19
> [ 6.759556] #92
> [ 3.180284] core: 92 -> 20
> [ 6.767150] #93
> [ 3.180284] core: 93 -> 21
> [ 6.774786] #94
> [ 3.180284] core: 94 -> 22
> [ 6.782696] #95
> [ 3.180284] core: 95 -> 23
> [ 6.790660] #96
> [ 3.180284] core: 96 -> 24
> [ 6.797792] #97
> [ 3.180284] core: 97 -> 25
> [ 6.805373] #98
> [ 3.180284] core: 98 -> 26
> [ 6.813082] #99
> [ 3.180284] core: 99 -> 27
> [ 6.820835] #100
> [ 3.180284] core: 100 -> 28
> [ 6.828563] #101
> [ 3.180284] core: 101 -> 29
> [ 6.836315] #102
> [ 3.180284] core: 102 -> 30
> [ 6.844093] #103
> [ 3.180284] core: 103 -> 31
> [ 6.852007] #104
> [ 3.180284] core: 104 -> 32
> [ 6.859778] #105
> [ 3.180284] core: 105 -> 33
> [ 6.867486] #106
> [ 3.180284] core: 106 -> 34
> [ 6.875230] #107
> [ 3.180284] core: 107 -> 35
> [ 6.883085] .... node #2, CPUs: #108
> [ 3.180284] core: 108 -> 36
> [ 6.892931] #109
> [ 3.180284] core: 109 -> 37
> [ 6.900752] #110
> [ 3.180284] core: 110 -> 38
> [ 6.908510] #111
> [ 3.180284] core: 111 -> 39
> [ 6.916281] #112
> [ 3.180284] core: 112 -> 40
> [ 6.924094] #113
> [ 3.180284] core: 113 -> 41
> [ 6.931914] #114
> [ 3.180284] core: 114 -> 42
> [ 6.939622] #115
> [ 3.180284] core: 115 -> 43
> [ 6.947351] #116
> [ 3.180284] core: 116 -> 44
> [ 6.955130] #117
> [ 3.180284] core: 117 -> 45
> [ 6.962991] #118
> [ 3.180284] core: 118 -> 46
> [ 6.970725] #119
> [ 3.180284] core: 119 -> 47
> [ 6.978680] #120
> [ 3.180284] core: 120 -> 48
> [ 6.986676] #121
> [ 3.180284] core: 121 -> 49
> [ 6.994641] #122
> [ 3.180284] core: 122 -> 50
> [ 7.002541] #123
> [ 3.180284] core: 123 -> 51
> [ 7.009793] #124
> [ 3.180284] core: 124 -> 52
> [ 7.017584] #125
> [ 3.180284] core: 125 -> 53
> [ 7.025473] .... node #3, CPUs: #126
> [ 3.180284] core: 126 -> 54
> [ 7.035294] #127
> [ 3.180284] core: 127 -> 55
> [ 7.043254] #128
> [ 3.180284] core: 128 -> 56
> [ 7.051111] #129
> [ 3.180284] core: 129 -> 57
> [ 7.058952] #130
> [ 3.180284] core: 130 -> 58
> [ 7.066914] #131
> [ 3.180284] core: 131 -> 59
> [ 7.074839] #132
> [ 3.180284] core: 132 -> 60
> [ 7.082708] #133
> [ 3.180284] core: 133 -> 61
> [ 7.090720] #134
> [ 3.180284] core: 134 -> 62
> [ 7.098679] #135
> [ 3.180284] core: 135 -> 63
> [ 7.106669] #136
> [ 3.180284] core: 136 -> 64
> [ 7.114697] #137
> [ 3.180284] core: 137 -> 65
> [ 7.122632] #138
> [ 3.180284] core: 138 -> 66
> [ 7.130491] #139
> [ 3.180284] core: 139 -> 67
> [ 7.138475] #140
> [ 3.180284] core: 140 -> 68
> [ 7.145811] #141
> [ 3.180284] core: 141 -> 69
> [ 7.153662] #142
> [ 3.180284] core: 142 -> 70
> [ 7.161531] #143
> [ 3.180284] core: 143 -> 71
> [ 7.169477] smp: Brought up 4 nodes, 144 CPUs
> [ 7.170425] smpboot: Max logical packages: 4
> [ 7.174428] smpboot: Total of 144 processors activated (720297.64 BogoMIPS)
> [ 7.205286] devtmpfs: initialized
> [ 7.206478] x86/mm: Memory block size: 256MB
> [ 7.263082] PM: Registering ACPI NVS region [mem 0x0009f000-0x0009ffff] (4096 bytes)
> [ 7.266425] PM: Registering ACPI NVS region [mem 0x6764f000-0x68ca7fff] (23433216 bytes)
> [ 7.271171] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
> [ 7.274548] futex hash table entries: 65536 (order: 10, 4194304 bytes, vmalloc)
> [ 7.280264] pinctrl core: initialized pinctrl subsystem
> [ 7.282603] PM: RTC time: 01:14:52, date: 2020-07-13
> [ 7.286426] thermal_sys: Registered thermal governor 'fair_share'
> [ 7.286427] thermal_sys: Registered thermal governor 'bang_bang'
> [ 7.290424] thermal_sys: Registered thermal governor 'step_wise'
> [ 7.294424] thermal_sys: Registered thermal governor 'user_space'
> [ 7.298952] NET: Registered protocol family 16
> [ 7.306519] audit: initializing netlink subsys (disabled)
> [ 7.310453] audit: type=2000 audit(1594602887.744:1): state=initialized audit_enabled=0 res=1
> [ 7.318428] cpuidle: using governor ladder
> [ 7.322458] cpuidle: using governor menu
> [ 7.326558] ACPI: bus type PCI registered
> [ 7.330427] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
> [ 7.334676] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
> [ 7.346506] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
> [ 7.350447] pmd_set_huge: Cannot satisfy [mem 0x80000000-0x80200000] with a huge-page mapping due to MTRR override.
> [ 7.362838] PCI: Using configuration type 1 for base access
> [ 7.418430] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
> [ 7.426888] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [ 7.430426] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
> [ 7.443003] ACPI: Added _OSI(Module Device)
> [ 7.446425] ACPI: Added _OSI(Processor Device)
> [ 7.450424] ACPI: Added _OSI(3.0 _SCP Extensions)
> [ 7.454424] ACPI: Added _OSI(Processor Aggregator Device)
> [ 7.462424] ACPI: Added _OSI(Linux-Dell-Video)
> [ 7.466424] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
> [ 7.470424] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
> [ 7.733400] ACPI: 8 ACPI AML tables successfully acquired and loaded
> [ 7.802010] ACPI: Dynamic OEM Table Load:
> [ 8.057172] ACPI: Dynamic OEM Table Load:
> [ 8.105768] ACPI: Dynamic OEM Table Load:
> [ 8.632513] ACPI: Interpreter enabled
> [ 8.638445] ACPI: (supports S0 S3 S4 S5)
> [ 8.642365] ACPI: Using IOAPIC for interrupt routing
> [ 8.646524] HEST: Table parsing has been initialized.
> [ 8.650426] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> [ 8.696839] ACPI: Enabled 6 GPEs in block 00 to 7F
> [ 8.885041] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-14])
> [ 8.890428] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 8.902460] acpi PNP0A08:00: PCIe AER handled by firmware
> [ 8.906456] acpi PNP0A08:00: _OSC: platform does not support [LTR]
> [ 8.910480] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
> [ 8.919970] PCI host bridge to bus 0000:00
> [ 8.926425] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
> [ 8.930424] pci_bus 0000:00: root bus resource [io 0x1000-0x4fff window]
> [ 8.938424] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> [ 8.946424] pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000c7fff window]
> [ 8.954424] pci_bus 0000:00: root bus resource [mem 0xfe010000-0xfe010fff window]
> [ 8.962425] pci_bus 0000:00: root bus resource [mem 0x90000000-0x96bfffff window]
> [ 8.970424] pci_bus 0000:00: root bus resource [mem 0x200000000000-0x200fffffffff window]
> [ 8.978424] pci_bus 0000:00: root bus resource [bus 00-14]
> [ 8.982434] pci 0000:00:00.0: [8086:2020] type 00 class 0x060000
> [ 8.991125] pci 0000:00:04.0: [8086:2021] type 00 class 0x088000
> [ 8.994435] pci 0000:00:04.0: reg 0x10: [mem 0x200ffff2c000-0x200ffff2ffff 64bit]
> [ 9.003060] pci 0000:00:04.1: [8086:2021] type 00 class 0x088000
> [ 9.010434] pci 0000:00:04.1: reg 0x10: [mem 0x200ffff28000-0x200ffff2bfff 64bit]
> [ 9.019058] pci 0000:00:04.2: [8086:2021] type 00 class 0x088000
> [ 9.022434] pci 0000:00:04.2: reg 0x10: [mem 0x200ffff24000-0x200ffff27fff 64bit]
> [ 9.031055] pci 0000:00:04.3: [8086:2021] type 00 class 0x088000
> [ 9.038434] pci 0000:00:04.3: reg 0x10: [mem 0x200ffff20000-0x200ffff23fff 64bit]
> [ 9.047057] pci 0000:00:04.4: [8086:2021] type 00 class 0x088000
> [ 9.050433] pci 0000:00:04.4: reg 0x10: [mem 0x200ffff1c000-0x200ffff1ffff 64bit]
> [ 9.059054] pci 0000:00:04.5: [8086:2021] type 00 class 0x088000
> [ 9.066434] pci 0000:00:04.5: reg 0x10: [mem 0x200ffff18000-0x200ffff1bfff 64bit]
> [ 9.075053] pci 0000:00:04.6: [8086:2021] type 00 class 0x088000
> [ 9.078433] pci 0000:00:04.6: reg 0x10: [mem 0x200ffff14000-0x200ffff17fff 64bit]
> [ 9.087053] pci 0000:00:04.7: [8086:2021] type 00 class 0x088000
> [ 9.094435] pci 0000:00:04.7: reg 0x10: [mem 0x200ffff10000-0x200ffff13fff 64bit]
> [ 9.103057] pci 0000:00:05.0: [8086:2024] type 00 class 0x088000
> [ 9.107053] pci 0000:00:05.2: [8086:2025] type 00 class 0x088000
> [ 9.115047] pci 0000:00:05.4: [8086:2026] type 00 class 0x080020
> [ 9.122434] pci 0000:00:05.4: reg 0x10: [mem 0x9230b000-0x9230bfff]
> [ 9.127048] pci 0000:00:08.0: [8086:2014] type 00 class 0x088000
> [ 9.135044] pci 0000:00:08.1: [8086:2015] type 00 class 0x110100
> [ 9.143023] pci 0000:00:08.2: [8086:2016] type 00 class 0x088000
> [ 9.147048] pci 0000:00:11.0: [8086:a1ec] type 00 class 0xff0000
> [ 9.155067] pci 0000:00:11.1: [8086:a1ed] type 00 class 0xff0000
> [ 9.163068] pci 0000:00:11.5: [8086:a1d2] type 00 class 0x010601
> [ 9.166441] pci 0000:00:11.5: reg 0x10: [mem 0x92306000-0x92307fff]
> [ 9.174430] pci 0000:00:11.5: reg 0x14: [mem 0x9230a000-0x9230a0ff]
> [ 9.178430] pci 0000:00:11.5: reg 0x18: [io 0x4068-0x406f]
> [ 9.186430] pci 0000:00:11.5: reg 0x1c: [io 0x4074-0x4077]
> [ 9.190430] pci 0000:00:11.5: reg 0x20: [io 0x4040-0x405f]
> [ 9.194430] pci 0000:00:11.5: reg 0x24: [mem 0x92280000-0x922fffff]
> [ 9.202462] pci 0000:00:11.5: PME# supported from D3hot
> [ 9.207075] pci 0000:00:14.0: [8086:a1af] type 00 class 0x0c0330
> [ 9.214445] pci 0000:00:14.0: reg 0x10: [mem 0x200ffff00000-0x200ffff0ffff 64bit]
> [ 9.222488] pci 0000:00:14.0: PME# supported from D3hot D3cold
> [ 9.227060] pci 0000:00:14.2: [8086:a1b1] type 00 class 0x118000
> [ 9.234444] pci 0000:00:14.2: reg 0x10: [mem 0x200ffff34000-0x200ffff34fff 64bit]
> [ 9.243100] pci 0000:00:16.0: [8086:a1ba] type 00 class 0x078000
> [ 9.250456] pci 0000:00:16.0: reg 0x10: [mem 0x200ffff33000-0x200ffff33fff 64bit]
> [ 9.254508] pci 0000:00:16.0: PME# supported from D3hot
> [ 9.263037] pci 0000:00:16.1: [8086:a1bb] type 00 class 0x078000
> [ 9.266452] pci 0000:00:16.1: reg 0x10: [mem 0x200ffff32000-0x200ffff32fff 64bit]
> [ 9.274508] pci 0000:00:16.1: PME# supported from D3hot
> [ 9.283041] pci 0000:00:16.4: [8086:a1be] type 00 class 0x078000
> [ 9.286451] pci 0000:00:16.4: reg 0x10: [mem 0x200ffff31000-0x200ffff31fff 64bit]
> [ 9.294506] pci 0000:00:16.4: PME# supported from D3hot
> [ 9.299040] pci 0000:00:17.0: [8086:a182] type 00 class 0x010601
> [ 9.306441] pci 0000:00:17.0: reg 0x10: [mem 0x92304000-0x92305fff]
> [ 9.314436] pci 0000:00:17.0: reg 0x14: [mem 0x92309000-0x923090ff]
> [ 9.318430] pci 0000:00:17.0: reg 0x18: [io 0x4060-0x4067]
> [ 9.326430] pci 0000:00:17.0: reg 0x1c: [io 0x4070-0x4073]
> [ 9.330430] pci 0000:00:17.0: reg 0x20: [io 0x4020-0x403f]
> [ 9.334430] pci 0000:00:17.0: reg 0x24: [mem 0x92200000-0x9227ffff]
> [ 9.342487] pci 0000:00:17.0: PME# supported from D3hot
> [ 9.347101] pci 0000:00:1c.0: [8086:a194] type 01 class 0x060400
> [ 9.354503] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
> [ 9.359109] pci 0000:00:1c.5: [8086:a195] type 01 class 0x060400
> [ 9.366500] pci 0000:00:1c.5: PME# supported from D0 D3hot D3cold
> [ 9.375097] pci 0000:00:1f.0: [8086:a1c6] type 00 class 0x060100
> [ 9.379124] pci 0000:00:1f.2: [8086:a1a1] type 00 class 0x058000
> [ 9.386437] pci 0000:00:1f.2: reg 0x10: [mem 0x92300000-0x92303fff]
> [ 9.395086] pci 0000:00:1f.4: [8086:a1a3] type 00 class 0x0c0500
> [ 9.398441] pci 0000:00:1f.4: reg 0x10: [mem 0x200ffff30000-0x200ffff300ff 64bit]
> [ 9.406440] pci 0000:00:1f.4: reg 0x20: [io 0x4000-0x401f]
> [ 9.411041] pci 0000:00:1f.5: [8086:a1a4] type 00 class 0x0c8000
> [ 9.418440] pci 0000:00:1f.5: reg 0x10: [mem 0xfe010000-0xfe010fff]
> [ 9.427146] pci 0000:01:00.0: [8086:1533] type 00 class 0x020000
> [ 9.430463] pci 0000:01:00.0: reg 0x10: [mem 0x92100000-0x9217ffff]
> [ 9.438450] pci 0000:01:00.0: reg 0x18: [io 0x3000-0x301f]
> [ 9.442441] pci 0000:01:00.0: reg 0x1c: [mem 0x92180000-0x92183fff]
> [ 9.450559] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
> [ 9.454571] pci 0000:00:1c.0: PCI bridge to [bus 01]
> [ 9.462426] pci 0000:00:1c.0: bridge window [io 0x3000-0x3fff]
> [ 9.466426] pci 0000:00:1c.0: bridge window [mem 0x92100000-0x921fffff]
> [ 9.474487] pci 0000:02:00.0: [1a03:1150] type 01 class 0x060400
> [ 9.478505] pci 0000:02:00.0: enabling Extended Tags
> [ 9.486506] pci 0000:02:00.0: supports D1 D2
> [ 9.490424] pci 0000:02:00.0: PME# supported from D0 D1 D2 D3hot D3cold
> [ 9.494540] pci 0000:00:1c.5: PCI bridge to [bus 02-03]
> [ 9.502426] pci 0000:00:1c.5: bridge window [io 0x2000-0x2fff]
> [ 9.506425] pci 0000:00:1c.5: bridge window [mem 0x91000000-0x920fffff]
> [ 9.514477] pci_bus 0000:03: extended config space not accessible
> [ 9.518451] pci 0000:03:00.0: [1a03:2000] type 00 class 0x030000
> [ 9.526451] pci 0000:03:00.0: reg 0x10: [mem 0x91000000-0x91ffffff]
> [ 9.534439] pci 0000:03:00.0: reg 0x14: [mem 0x92000000-0x9201ffff]
> [ 9.538438] pci 0000:03:00.0: reg 0x18: [io 0x2000-0x207f]
> [ 9.542492] pci 0000:03:00.0: BAR 0: assigned to efifb
> [ 9.550478] pci 0000:03:00.0: supports D1 D2
> [ 9.554424] pci 0000:03:00.0: PME# supported from D0 D1 D2 D3hot D3cold
> [ 9.562538] pci 0000:02:00.0: PCI bridge to [bus 03]
> [ 9.566432] pci 0000:02:00.0: bridge window [io 0x2000-0x2fff]
> [ 9.570430] pci 0000:02:00.0: bridge window [mem 0x91000000-0x920fffff]
> [ 9.578451] pci_bus 0000:00: on NUMA node 0
> [ 9.578969] ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 15-22])
> [ 9.586427] acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 9.594489] acpi PNP0A08:01: PCIe AER handled by firmware
> [ 9.598845] acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 9.606583] PCI host bridge to bus 0000:15
> [ 9.610425] pci_bus 0000:15: root bus resource [io 0x5000-0x5fff window]
> [ 9.618424] pci_bus 0000:15: root bus resource [mem 0x96c00000-0x9d7fffff window]
> [ 9.626424] pci_bus 0000:15: root bus resource [mem 0x201000000000-0x201fffffffff window]
> [ 9.634425] pci_bus 0000:15: root bus resource [bus 15-22]
> [ 9.638433] pci 0000:15:05.0: [8086:2034] type 00 class 0x088000
> [ 9.646515] pci 0000:15:05.2: [8086:2035] type 00 class 0x088000
> [ 9.650505] pci 0000:15:05.4: [8086:2036] type 00 class 0x080020
> [ 9.658432] pci 0000:15:05.4: reg 0x10: [mem 0x96c00000-0x96c00fff]
> [ 9.662511] pci 0000:15:08.0: [8086:208d] type 00 class 0x088000
> [ 9.670496] pci 0000:15:08.1: [8086:208d] type 00 class 0x088000
> [ 9.678480] pci 0000:15:08.2: [8086:208d] type 00 class 0x088000
> [ 9.682481] pci 0000:15:08.3: [8086:208d] type 00 class 0x088000
> [ 9.690480] pci 0000:15:08.4: [8086:208d] type 00 class 0x088000
> [ 9.694481] pci 0000:15:08.5: [8086:208d] type 00 class 0x088000
> [ 9.702480] pci 0000:15:08.6: [8086:208d] type 00 class 0x088000
> [ 9.706477] pci 0000:15:08.7: [8086:208d] type 00 class 0x088000
> [ 9.714481] pci 0000:15:09.0: [8086:208d] type 00 class 0x088000
> [ 9.718491] pci 0000:15:09.1: [8086:208d] type 00 class 0x088000
> [ 9.726478] pci 0000:15:09.2: [8086:208d] type 00 class 0x088000
> [ 9.730480] pci 0000:15:09.3: [8086:208d] type 00 class 0x088000
> [ 9.738480] pci 0000:15:09.4: [8086:208d] type 00 class 0x088000
> [ 9.742481] pci 0000:15:09.5: [8086:208d] type 00 class 0x088000
> [ 9.750478] pci 0000:15:09.6: [8086:208d] type 00 class 0x088000
> [ 9.754479] pci 0000:15:09.7: [8086:208d] type 00 class 0x088000
> [ 9.762481] pci 0000:15:0a.0: [8086:208d] type 00 class 0x088000
> [ 9.766488] pci 0000:15:0a.1: [8086:208d] type 00 class 0x088000
> [ 9.774480] pci 0000:15:0a.2: [8086:208d] type 00 class 0x088000
> [ 9.778480] pci 0000:15:0a.3: [8086:208d] type 00 class 0x088000
> [ 9.786480] pci 0000:15:0a.4: [8086:208d] type 00 class 0x088000
> [ 9.790482] pci 0000:15:0a.5: [8086:208d] type 00 class 0x088000
> [ 9.798479] pci 0000:15:0a.6: [8086:208d] type 00 class 0x088000
> [ 9.802479] pci 0000:15:0a.7: [8086:208d] type 00 class 0x088000
> [ 9.810481] pci 0000:15:0b.0: [8086:208d] type 00 class 0x088000
> [ 9.814488] pci 0000:15:0b.1: [8086:208d] type 00 class 0x088000
> [ 9.822483] pci 0000:15:0b.2: [8086:208d] type 00 class 0x088000
> [ 9.826479] pci 0000:15:0b.3: [8086:208d] type 00 class 0x088000
> [ 9.834485] pci 0000:15:0e.0: [8086:208e] type 00 class 0x088000
> [ 9.838490] pci 0000:15:0e.1: [8086:208e] type 00 class 0x088000
> [ 9.846482] pci 0000:15:0e.2: [8086:208e] type 00 class 0x088000
> [ 9.850481] pci 0000:15:0e.3: [8086:208e] type 00 class 0x088000
> [ 9.858481] pci 0000:15:0e.4: [8086:208e] type 00 class 0x088000
> [ 9.862478] pci 0000:15:0e.5: [8086:208e] type 00 class 0x088000
> [ 9.870481] pci 0000:15:0e.6: [8086:208e] type 00 class 0x088000
> [ 9.874478] pci 0000:15:0e.7: [8086:208e] type 00 class 0x088000
> [ 9.882483] pci 0000:15:0f.0: [8086:208e] type 00 class 0x088000
> [ 9.886495] pci 0000:15:0f.1: [8086:208e] type 00 class 0x088000
> [ 9.894479] pci 0000:15:0f.2: [8086:208e] type 00 class 0x088000
> [ 9.898480] pci 0000:15:0f.3: [8086:208e] type 00 class 0x088000
> [ 9.906482] pci 0000:15:0f.4: [8086:208e] type 00 class 0x088000
> [ 9.910480] pci 0000:15:0f.5: [8086:208e] type 00 class 0x088000
> [ 9.918482] pci 0000:15:0f.6: [8086:208e] type 00 class 0x088000
> [ 9.926480] pci 0000:15:0f.7: [8086:208e] type 00 class 0x088000
> [ 9.930484] pci 0000:15:10.0: [8086:208e] type 00 class 0x088000
> [ 9.938492] pci 0000:15:10.1: [8086:208e] type 00 class 0x088000
> [ 9.942479] pci 0000:15:10.2: [8086:208e] type 00 class 0x088000
> [ 9.950479] pci 0000:15:10.3: [8086:208e] type 00 class 0x088000
> [ 9.954481] pci 0000:15:10.4: [8086:208e] type 00 class 0x088000
> [ 9.962484] pci 0000:15:10.5: [8086:208e] type 00 class 0x088000
> [ 9.966487] pci 0000:15:10.6: [8086:208e] type 00 class 0x088000
> [ 9.974477] pci 0000:15:10.7: [8086:208e] type 00 class 0x088000
> [ 9.978483] pci 0000:15:11.0: [8086:208e] type 00 class 0x088000
> [ 9.986490] pci 0000:15:11.1: [8086:208e] type 00 class 0x088000
> [ 9.990479] pci 0000:15:11.2: [8086:208e] type 00 class 0x088000
> [ 9.998482] pci 0000:15:11.3: [8086:208e] type 00 class 0x088000
> [ 10.002493] pci 0000:15:1d.0: [8086:2054] type 00 class 0x088000
> [ 10.010492] pci 0000:15:1d.1: [8086:2055] type 00 class 0x088000
> [ 10.014482] pci 0000:15:1d.2: [8086:2056] type 00 class 0x088000
> [ 10.022479] pci 0000:15:1d.3: [8086:2057] type 00 class 0x088000
> [ 10.026488] pci 0000:15:1e.0: [8086:2080] type 00 class 0x088000
> [ 10.034492] pci 0000:15:1e.1: [8086:2081] type 00 class 0x088000
> [ 10.038484] pci 0000:15:1e.2: [8086:2082] type 00 class 0x088000
> [ 10.046482] pci 0000:15:1e.3: [8086:2083] type 00 class 0x088000
> [ 10.050603] pci 0000:15:1e.4: [8086:2084] type 00 class 0x088000
> [ 10.058478] pci 0000:15:1e.5: [8086:2085] type 00 class 0x088000
> [ 10.062483] pci 0000:15:1e.6: [8086:2086] type 00 class 0x088000
> [ 10.070483] pci_bus 0000:15: on NUMA node 0
> [ 10.070637] ACPI: PCI Root Bridge [PC02] (domain 0000 [bus 23-30])
> [ 10.074426] acpi PNP0A08:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 10.086489] acpi PNP0A08:02: PCIe AER handled by firmware
> [ 10.090843] acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 10.098539] PCI host bridge to bus 0000:23
> [ 10.102425] pci_bus 0000:23: root bus resource [io 0x6000-0x6fff window]
> [ 10.110424] pci_bus 0000:23: root bus resource [mem 0x9d800000-0xa43fffff window]
> [ 10.118424] pci_bus 0000:23: root bus resource [mem 0x202000000000-0x202fffffffff window]
> [ 10.126424] pci_bus 0000:23: root bus resource [bus 23-30]
> [ 10.130433] pci 0000:23:05.0: [8086:2034] type 00 class 0x088000
> [ 10.138507] pci 0000:23:05.2: [8086:2035] type 00 class 0x088000
> [ 10.142498] pci 0000:23:05.4: [8086:2036] type 00 class 0x080020
> [ 10.150432] pci 0000:23:05.4: reg 0x10: [mem 0x9d800000-0x9d800fff]
> [ 10.154505] pci 0000:23:08.0: [8086:2066] type 00 class 0x088000
> [ 10.162504] pci 0000:23:09.0: [8086:2066] type 00 class 0x088000
> [ 10.166504] pci 0000:23:0a.0: [8086:2040] type 00 class 0x088000
> [ 10.174513] pci 0000:23:0a.1: [8086:2041] type 00 class 0x088000
> [ 10.178488] pci 0000:23:0a.2: [8086:2042] type 00 class 0x088000
> [ 10.186498] pci 0000:23:0a.3: [8086:2043] type 00 class 0x088000
> [ 10.190490] pci 0000:23:0a.4: [8086:2044] type 00 class 0x088000
> [ 10.198486] pci 0000:23:0a.5: [8086:2045] type 00 class 0x088000
> [ 10.202495] pci 0000:23:0a.6: [8086:2046] type 00 class 0x088000
> [ 10.210489] pci 0000:23:0a.7: [8086:2047] type 00 class 0x088000
> [ 10.214487] pci 0000:23:0b.0: [8086:2048] type 00 class 0x088000
> [ 10.222499] pci 0000:23:0b.1: [8086:2049] type 00 class 0x088000
> [ 10.226489] pci 0000:23:0b.2: [8086:204a] type 00 class 0x088000
> [ 10.234487] pci 0000:23:0b.3: [8086:204b] type 00 class 0x088000
> [ 10.238490] pci 0000:23:0c.0: [8086:2040] type 00 class 0x088000
> [ 10.246500] pci 0000:23:0c.1: [8086:2041] type 00 class 0x088000
> [ 10.250490] pci 0000:23:0c.2: [8086:2042] type 00 class 0x088000
> [ 10.258487] pci 0000:23:0c.3: [8086:2043] type 00 class 0x088000
> [ 10.262489] pci 0000:23:0c.4: [8086:2044] type 00 class 0x088000
> [ 10.270491] pci 0000:23:0c.5: [8086:2045] type 00 class 0x088000
> [ 10.278488] pci 0000:23:0c.6: [8086:2046] type 00 class 0x088000
> [ 10.282490] pci 0000:23:0c.7: [8086:2047] type 00 class 0x088000
> [ 10.290492] pci 0000:23:0d.0: [8086:2048] type 00 class 0x088000
> [ 10.294498] pci 0000:23:0d.1: [8086:2049] type 00 class 0x088000
> [ 10.302494] pci 0000:23:0d.2: [8086:204a] type 00 class 0x088000
> [ 10.306491] pci 0000:23:0d.3: [8086:204b] type 00 class 0x088000
> [ 10.314499] pci_bus 0000:23: on NUMA node 0
> [ 10.314626] ACPI: PCI Root Bridge [UC03] (domain 0000 [bus 31])
> [ 10.318426] acpi PNP0A08:03: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 10.326476] acpi PNP0A08:03: PCIe AER handled by firmware
> [ 10.334490] acpi PNP0A08:03: _OSC: platform does not support [LTR]
> [ 10.338545] acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
> [ 10.346477] PCI host bridge to bus 0000:31
> [ 10.350424] pci_bus 0000:31: root bus resource [bus 31]
> [ 10.358435] pci 0000:31:0e.0: [8086:2058] type 00 class 0x110100
> [ 10.362483] pci 0000:31:0e.1: [8086:2059] type 00 class 0x088000
> [ 10.370480] pci 0000:31:0e.4: [8086:2058] type 00 class 0xffffff
> [ 10.374478] pci 0000:31:0e.5: [8086:2059] type 00 class 0xffffff
> [ 10.382489] pci 0000:31:0e.6: [8086:205a] type 00 class 0xffffff
> [ 10.386475] pci 0000:31:0f.0: [8086:2058] type 00 class 0x110100
> [ 10.394478] pci 0000:31:0f.1: [8086:2059] type 00 class 0x088000
> [ 10.398480] pci 0000:31:0f.3: [8086:205b] type 00 class 0xffffff
> [ 10.406475] pci 0000:31:0f.4: [8086:2058] type 00 class 0xffffff
> [ 10.410477] pci 0000:31:0f.5: [8086:2059] type 00 class 0xffffff
> [ 10.418479] pci 0000:31:0f.6: [8086:205a] type 00 class 0xffffff
> [ 10.422476] pci 0000:31:10.0: [8086:2058] type 00 class 0x110100
> [ 10.430480] pci 0000:31:10.1: [8086:2059] type 00 class 0x088000
> [ 10.434478] pci 0000:31:10.4: [8086:2058] type 00 class 0xffffff
> [ 10.442479] pci 0000:31:10.5: [8086:2059] type 00 class 0xffffff
> [ 10.446480] pci 0000:31:10.6: [8086:205a] type 00 class 0xffffff
> [ 10.454477] pci 0000:31:12.0: [8086:204c] type 00 class 0x110100
> [ 10.458476] pci 0000:31:12.1: [8086:204d] type 00 class 0x110100
> [ 10.466466] pci 0000:31:12.2: [8086:204e] type 00 class 0x088000
> [ 10.470466] pci 0000:31:13.0: [8086:204c] type 00 class 0xffffff
> [ 10.478478] pci 0000:31:13.1: [8086:204d] type 00 class 0xffffff
> [ 10.482463] pci 0000:31:13.2: [8086:204e] type 00 class 0xffffff
> [ 10.490463] pci 0000:31:13.3: [8086:204f] type 00 class 0xffffff
> [ 10.494468] pci 0000:31:14.0: [8086:204c] type 00 class 0xffffff
> [ 10.502475] pci 0000:31:14.1: [8086:204d] type 00 class 0xffffff
> [ 10.506465] pci 0000:31:14.2: [8086:204e] type 00 class 0xffffff
> [ 10.514462] pci 0000:31:14.3: [8086:204f] type 00 class 0xffffff
> [ 10.518467] pci 0000:31:15.0: [8086:2018] type 00 class 0x088000
> [ 10.526464] pci 0000:31:15.1: [8086:2088] type 00 class 0x110100
> [ 10.530464] pci 0000:31:16.0: [8086:2018] type 00 class 0x088000
> [ 10.538462] pci 0000:31:16.1: [8086:2088] type 00 class 0x110100
> [ 10.542465] pci 0000:31:17.0: [8086:2018] type 00 class 0x088000
> [ 10.550464] pci 0000:31:17.1: [8086:2088] type 00 class 0x110100
> [ 10.554469] pci_bus 0000:31: on NUMA node 0
> [ 10.554538] ACPI: PCI Root Bridge [PC04] (domain 0000 [bus 32-3d])
> [ 10.562426] acpi PNP0A08:04: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 10.570485] acpi PNP0A08:04: PCIe AER handled by firmware
> [ 10.578839] acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 10.586501] PCI host bridge to bus 0000:32
> [ 10.590425] pci_bus 0000:32: root bus resource [io 0x7000-0x7fff window]
> [ 10.594424] pci_bus 0000:32: root bus resource [mem 0xa4400000-0xaaafffff window]
> [ 10.602424] pci_bus 0000:32: root bus resource [mem 0x203000000000-0x203fffffffff window]
> [ 10.610424] pci_bus 0000:32: root bus resource [bus 32-3d]
> [ 10.618433] pci 0000:32:05.0: [8086:2034] type 00 class 0x088000
> [ 10.622512] pci 0000:32:05.2: [8086:2035] type 00 class 0x088000
> [ 10.630492] pci 0000:32:05.4: [8086:2036] type 00 class 0x080020
> [ 10.634432] pci 0000:32:05.4: reg 0x10: [mem 0xa4400000-0xa4400fff]
> [ 10.642504] pci_bus 0000:32: on NUMA node 0
> [ 10.642613] ACPI: PCI Root Bridge [PC06] (domain 0000 [bus 40-42])
> [ 10.646425] acpi PNP0A08:05: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 10.658484] acpi PNP0A08:05: PCIe AER handled by firmware
> [ 10.662708] acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 10.670526] PCI host bridge to bus 0000:40
> [ 10.674424] pci_bus 0000:40: root bus resource [io 0x8000-0x8fff window]
> [ 10.682424] pci_bus 0000:40: root bus resource [mem 0xab000000-0xb1bfffff window]
> [ 10.690425] pci_bus 0000:40: root bus resource [mem 0x204000000000-0x204fffffffff window]
> [ 10.698424] pci_bus 0000:40: root bus resource [bus 40-42]
> [ 10.702432] pci 0000:40:04.0: [8086:2021] type 00 class 0x088000
> [ 10.710433] pci 0000:40:04.0: reg 0x10: [mem 0x204ffff1c000-0x204ffff1ffff 64bit]
> [ 10.714511] pci 0000:40:04.1: [8086:2021] type 00 class 0x088000
> [ 10.722433] pci 0000:40:04.1: reg 0x10: [mem 0x204ffff18000-0x204ffff1bfff 64bit]
> [ 10.730508] pci 0000:40:04.2: [8086:2021] type 00 class 0x088000
> [ 10.734432] pci 0000:40:04.2: reg 0x10: [mem 0x204ffff14000-0x204ffff17fff 64bit]
> [ 10.742503] pci 0000:40:04.3: [8086:2021] type 00 class 0x088000
> [ 10.750432] pci 0000:40:04.3: reg 0x10: [mem 0x204ffff10000-0x204ffff13fff 64bit]
> [ 10.758502] pci 0000:40:04.4: [8086:2021] type 00 class 0x088000
> [ 10.762432] pci 0000:40:04.4: reg 0x10: [mem 0x204ffff0c000-0x204ffff0ffff 64bit]
> [ 10.770501] pci 0000:40:04.5: [8086:2021] type 00 class 0x088000
> [ 10.774432] pci 0000:40:04.5: reg 0x10: [mem 0x204ffff08000-0x204ffff0bfff 64bit]
> [ 10.782505] pci 0000:40:04.6: [8086:2021] type 00 class 0x088000
> [ 10.790432] pci 0000:40:04.6: reg 0x10: [mem 0x204ffff04000-0x204ffff07fff 64bit]
> [ 10.798501] pci 0000:40:04.7: [8086:2021] type 00 class 0x088000
> [ 10.802432] pci 0000:40:04.7: reg 0x10: [mem 0x204ffff00000-0x204ffff03fff 64bit]
> [ 10.810510] pci 0000:40:05.0: [8086:2024] type 00 class 0x088000
> [ 10.818504] pci 0000:40:05.2: [8086:2025] type 00 class 0x088000
> [ 10.822502] pci 0000:40:05.4: [8086:2026] type 00 class 0x080020
> [ 10.830431] pci 0000:40:05.4: reg 0x10: [mem 0xab000000-0xab000fff]
> [ 10.834495] pci 0000:40:08.0: [8086:2014] type 00 class 0x088000
> [ 10.842490] pci 0000:40:08.1: [8086:2015] type 00 class 0x110100
> [ 10.846472] pci 0000:40:08.2: [8086:2016] type 00 class 0x088000
> [ 10.854489] pci_bus 0000:40: on NUMA node 1
> [ 10.854574] ACPI: PCI Root Bridge [PC07] (domain 0000 [bus 43-56])
> [ 10.858426] acpi PNP0A08:06: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 10.870486] acpi PNP0A08:06: PCIe AER handled by firmware
> [ 10.874840] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 10.882568] PCI host bridge to bus 0000:43
> [ 10.886425] pci_bus 0000:43: root bus resource [io 0x9000-0x9fff window]
> [ 10.894424] pci_bus 0000:43: root bus resource [mem 0xb1c00000-0xb87fffff window]
> [ 10.902424] pci_bus 0000:43: root bus resource [mem 0x205000000000-0x205fffffffff window]
> [ 10.910424] pci_bus 0000:43: root bus resource [bus 43-56]
> [ 10.914432] pci 0000:43:05.0: [8086:2034] type 00 class 0x088000
> [ 10.922507] pci 0000:43:05.2: [8086:2035] type 00 class 0x088000
> [ 10.926502] pci 0000:43:05.4: [8086:2036] type 00 class 0x080020
> [ 10.934432] pci 0000:43:05.4: reg 0x10: [mem 0xb1c00000-0xb1c00fff]
> [ 10.938502] pci 0000:43:08.0: [8086:208d] type 00 class 0x088000
> [ 10.946493] pci 0000:43:08.1: [8086:208d] type 00 class 0x088000
> [ 10.950478] pci 0000:43:08.2: [8086:208d] type 00 class 0x088000
> [ 10.958475] pci 0000:43:08.3: [8086:208d] type 00 class 0x088000
> [ 10.962477] pci 0000:43:08.4: [8086:208d] type 00 class 0x088000
> [ 10.970475] pci 0000:43:08.5: [8086:208d] type 00 class 0x088000
> [ 10.974476] pci 0000:43:08.6: [8086:208d] type 00 class 0x088000
> [ 10.982479] pci 0000:43:08.7: [8086:208d] type 00 class 0x088000
> [ 10.986475] pci 0000:43:09.0: [8086:208d] type 00 class 0x088000
> [ 10.994487] pci 0000:43:09.1: [8086:208d] type 00 class 0x088000
> [ 10.998479] pci 0000:43:09.2: [8086:208d] type 00 class 0x088000
> [ 11.006475] pci 0000:43:09.3: [8086:208d] type 00 class 0x088000
> [ 11.010477] pci 0000:43:09.4: [8086:208d] type 00 class 0x088000
> [ 11.018476] pci 0000:43:09.5: [8086:208d] type 00 class 0x088000
> [ 11.022478] pci 0000:43:09.6: [8086:208d] type 00 class 0x088000
> [ 11.030479] pci 0000:43:09.7: [8086:208d] type 00 class 0x088000
> [ 11.034476] pci 0000:43:0a.0: [8086:208d] type 00 class 0x088000
> [ 11.042488] pci 0000:43:0a.1: [8086:208d] type 00 class 0x088000
> [ 11.046478] pci 0000:43:0a.2: [8086:208d] type 00 class 0x088000
> [ 11.054474] pci 0000:43:0a.3: [8086:208d] type 00 class 0x088000
> [ 11.058476] pci 0000:43:0a.4: [8086:208d] type 00 class 0x088000
> [ 11.066474] pci 0000:43:0a.5: [8086:208d] type 00 class 0x088000
> [ 11.070476] pci 0000:43:0a.6: [8086:208d] type 00 class 0x088000
> [ 11.078478] pci 0000:43:0a.7: [8086:208d] type 00 class 0x088000
> [ 11.082478] pci 0000:43:0b.0: [8086:208d] type 00 class 0x088000
> [ 11.090488] pci 0000:43:0b.1: [8086:208d] type 00 class 0x088000
> [ 11.094475] pci 0000:43:0b.2: [8086:208d] type 00 class 0x088000
> [ 11.102477] pci 0000:43:0b.3: [8086:208d] type 00 class 0x088000
> [ 11.106485] pci 0000:43:0e.0: [8086:208e] type 00 class 0x088000
> [ 11.114484] pci 0000:43:0e.1: [8086:208e] type 00 class 0x088000
> [ 11.118476] pci 0000:43:0e.2: [8086:208e] type 00 class 0x088000
> [ 11.126478] pci 0000:43:0e.3: [8086:208e] type 00 class 0x088000
> [ 11.130474] pci 0000:43:0e.4: [8086:208e] type 00 class 0x088000
> [ 11.138476] pci 0000:43:0e.5: [8086:208e] type 00 class 0x088000
> [ 11.142475] pci 0000:43:0e.6: [8086:208e] type 00 class 0x088000
> [ 11.150477] pci 0000:43:0e.7: [8086:208e] type 00 class 0x088000
> [ 11.154476] pci 0000:43:0f.0: [8086:208e] type 00 class 0x088000
> [ 11.162486] pci 0000:43:0f.1: [8086:208e] type 00 class 0x088000
> [ 11.170486] pci 0000:43:0f.2: [8086:208e] type 00 class 0x088000
> [ 11.174479] pci 0000:43:0f.3: [8086:208e] type 00 class 0x088000
> [ 11.182476] pci 0000:43:0f.4: [8086:208e] type 00 class 0x088000
> [ 11.186478] pci 0000:43:0f.5: [8086:208e] type 00 class 0x088000
> [ 11.194475] pci 0000:43:0f.6: [8086:208e] type 00 class 0x088000
> [ 11.198478] pci 0000:43:0f.7: [8086:208e] type 00 class 0x088000
> [ 11.206481] pci 0000:43:10.0: [8086:208e] type 00 class 0x088000
> [ 11.210492] pci 0000:43:10.1: [8086:208e] type 00 class 0x088000
> [ 11.218478] pci 0000:43:10.2: [8086:208e] type 00 class 0x088000
> [ 11.222477] pci 0000:43:10.3: [8086:208e] type 00 class 0x088000
> [ 11.230477] pci 0000:43:10.4: [8086:208e] type 00 class 0x088000
> [ 11.234477] pci 0000:43:10.5: [8086:208e] type 00 class 0x088000
> [ 11.242475] pci 0000:43:10.6: [8086:208e] type 00 class 0x088000
> [ 11.246477] pci 0000:43:10.7: [8086:208e] type 00 class 0x088000
> [ 11.254476] pci 0000:43:11.0: [8086:208e] type 00 class 0x088000
> [ 11.258488] pci 0000:43:11.1: [8086:208e] type 00 class 0x088000
> [ 11.266477] pci 0000:43:11.2: [8086:208e] type 00 class 0x088000
> [ 11.270477] pci 0000:43:11.3: [8086:208e] type 00 class 0x088000
> [ 11.278490] pci 0000:43:1d.0: [8086:2054] type 00 class 0x088000
> [ 11.282489] pci 0000:43:1d.1: [8086:2055] type 00 class 0x088000
> [ 11.290476] pci 0000:43:1d.2: [8086:2056] type 00 class 0x088000
> [ 11.294478] pci 0000:43:1d.3: [8086:2057] type 00 class 0x088000
> [ 11.302481] pci 0000:43:1e.0: [8086:2080] type 00 class 0x088000
> [ 11.306485] pci 0000:43:1e.1: [8086:2081] type 00 class 0x088000
> [ 11.314487] pci 0000:43:1e.2: [8086:2082] type 00 class 0x088000
> [ 11.318481] pci 0000:43:1e.3: [8086:2083] type 00 class 0x088000
> [ 11.326477] pci 0000:43:1e.4: [8086:2084] type 00 class 0x088000
> [ 11.330477] pci 0000:43:1e.5: [8086:2085] type 00 class 0x088000
> [ 11.338476] pci 0000:43:1e.6: [8086:2086] type 00 class 0x088000
> [ 11.342480] pci_bus 0000:43: on NUMA node 1
> [ 11.342648] ACPI: PCI Root Bridge [PC08] (domain 0000 [bus 57-6a])
> [ 11.350426] acpi PNP0A08:07: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 11.358486] acpi PNP0A08:07: PCIe AER handled by firmware
> [ 11.366841] acpi PNP0A08:07: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 11.374545] PCI host bridge to bus 0000:57
> [ 11.378425] pci_bus 0000:57: root bus resource [io 0xa000-0xafff window]
> [ 11.382424] pci_bus 0000:57: root bus resource [mem 0xb8800000-0xbf3fffff window]
> [ 11.390424] pci_bus 0000:57: root bus resource [mem 0x206000000000-0x206fffffffff window]
> [ 11.398424] pci_bus 0000:57: root bus resource [bus 57-6a]
> [ 11.406432] pci 0000:57:05.0: [8086:2034] type 00 class 0x088000
> [ 11.410499] pci 0000:57:05.2: [8086:2035] type 00 class 0x088000
> [ 11.418494] pci 0000:57:05.4: [8086:2036] type 00 class 0x080020
> [ 11.422431] pci 0000:57:05.4: reg 0x10: [mem 0xb8800000-0xb8800fff]
> [ 11.430498] pci 0000:57:08.0: [8086:2066] type 00 class 0x088000
> [ 11.434499] pci 0000:57:09.0: [8086:2066] type 00 class 0x088000
> [ 11.442497] pci 0000:57:0a.0: [8086:2040] type 00 class 0x088000
> [ 11.446492] pci 0000:57:0a.1: [8086:2041] type 00 class 0x088000
> [ 11.454486] pci 0000:57:0a.2: [8086:2042] type 00 class 0x088000
> [ 11.458484] pci 0000:57:0a.3: [8086:2043] type 00 class 0x088000
> [ 11.466482] pci 0000:57:0a.4: [8086:2044] type 00 class 0x088000
> [ 11.470483] pci 0000:57:0a.5: [8086:2045] type 00 class 0x088000
> [ 11.478480] pci 0000:57:0a.6: [8086:2046] type 00 class 0x088000
> [ 11.482484] pci 0000:57:0a.7: [8086:2047] type 00 class 0x088000
> [ 11.490487] pci 0000:57:0b.0: [8086:2048] type 00 class 0x088000
> [ 11.494494] pci 0000:57:0b.1: [8086:2049] type 00 class 0x088000
> [ 11.502482] pci 0000:57:0b.2: [8086:204a] type 00 class 0x088000
> [ 11.506484] pci 0000:57:0b.3: [8086:204b] type 00 class 0x088000
> [ 11.514483] pci 0000:57:0c.0: [8086:2040] type 00 class 0x088000
> [ 11.518497] pci 0000:57:0c.1: [8086:2041] type 00 class 0x088000
> [ 11.526482] pci 0000:57:0c.2: [8086:2042] type 00 class 0x088000
> [ 11.530486] pci 0000:57:0c.3: [8086:2043] type 00 class 0x088000
> [ 11.538483] pci 0000:57:0c.4: [8086:2044] type 00 class 0x088000
> [ 11.542483] pci 0000:57:0c.5: [8086:2045] type 00 class 0x088000
> [ 11.550486] pci 0000:57:0c.6: [8086:2046] type 00 class 0x088000
> [ 11.558484] pci 0000:57:0c.7: [8086:2047] type 00 class 0x088000
> [ 11.562485] pci 0000:57:0d.0: [8086:2048] type 00 class 0x088000
> [ 11.570498] pci 0000:57:0d.1: [8086:2049] type 00 class 0x088000
> [ 11.574485] pci 0000:57:0d.2: [8086:204a] type 00 class 0x088000
> [ 11.582494] pci 0000:57:0d.3: [8086:204b] type 00 class 0x088000
> [ 11.586494] pci_bus 0000:57: on NUMA node 1
> [ 11.586624] ACPI: PCI Root Bridge [UC13] (domain 0000 [bus 6b])
> [ 11.594427] acpi PNP0A08:08: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 11.602475] acpi PNP0A08:08: PCIe AER handled by firmware
> [ 11.606490] acpi PNP0A08:08: _OSC: platform does not support [LTR]
> [ 11.614546] acpi PNP0A08:08: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
> [ 11.622478] PCI host bridge to bus 0000:6b
> [ 11.626424] pci_bus 0000:6b: root bus resource [bus 6b]
> [ 11.630433] pci 0000:6b:0e.0: [8086:2058] type 00 class 0x110100
> [ 11.638474] pci 0000:6b:0e.1: [8086:2059] type 00 class 0x088000
> [ 11.642474] pci 0000:6b:0e.4: [8086:2058] type 00 class 0xffffff
> [ 11.650474] pci 0000:6b:0e.5: [8086:2059] type 00 class 0xffffff
> [ 11.654470] pci 0000:6b:0e.6: [8086:205a] type 00 class 0xffffff
> [ 11.662472] pci 0000:6b:0f.0: [8086:2058] type 00 class 0x110100
> [ 11.666471] pci 0000:6b:0f.1: [8086:2059] type 00 class 0x088000
> [ 11.674475] pci 0000:6b:0f.3: [8086:205b] type 00 class 0xffffff
> [ 11.678471] pci 0000:6b:0f.4: [8086:2058] type 00 class 0xffffff
> [ 11.686470] pci 0000:6b:0f.5: [8086:2059] type 00 class 0xffffff
> [ 11.690471] pci 0000:6b:0f.6: [8086:205a] type 00 class 0xffffff
> [ 11.698473] pci 0000:6b:10.0: [8086:2058] type 00 class 0x110100
> [ 11.702471] pci 0000:6b:10.1: [8086:2059] type 00 class 0x088000
> [ 11.710473] pci 0000:6b:10.4: [8086:2058] type 00 class 0xffffff
> [ 11.714474] pci 0000:6b:10.5: [8086:2059] type 00 class 0xffffff
> [ 11.722473] pci 0000:6b:10.6: [8086:205a] type 00 class 0xffffff
> [ 11.726476] pci 0000:6b:12.0: [8086:204c] type 00 class 0x110100
> [ 11.734470] pci 0000:6b:12.1: [8086:204d] type 00 class 0x110100
> [ 11.738459] pci 0000:6b:12.2: [8086:204e] type 00 class 0x088000
> [ 11.746466] pci 0000:6b:13.0: [8086:204c] type 00 class 0xffffff
> [ 11.750469] pci 0000:6b:13.1: [8086:204d] type 00 class 0xffffff
> [ 11.758461] pci 0000:6b:13.2: [8086:204e] type 00 class 0xffffff
> [ 11.762459] pci 0000:6b:13.3: [8086:204f] type 00 class 0xffffff
> [ 11.770465] pci 0000:6b:14.0: [8086:204c] type 00 class 0xffffff
> [ 11.774473] pci 0000:6b:14.1: [8086:204d] type 00 class 0xffffff
> [ 11.782458] pci 0000:6b:14.2: [8086:204e] type 00 class 0xffffff
> [ 11.786460] pci 0000:6b:14.3: [8086:204f] type 00 class 0xffffff
> [ 11.794459] pci 0000:6b:15.0: [8086:2018] type 00 class 0x088000
> [ 11.798459] pci 0000:6b:15.1: [8086:2088] type 00 class 0x110100
> [ 11.806462] pci 0000:6b:16.0: [8086:2018] type 00 class 0x088000
> [ 11.810458] pci 0000:6b:16.1: [8086:2088] type 00 class 0x110100
> [ 11.818464] pci 0000:6b:17.0: [8086:2018] type 00 class 0x088000
> [ 11.822457] pci 0000:6b:17.1: [8086:2088] type 00 class 0x110100
> [ 11.830467] pci_bus 0000:6b: on NUMA node 1
> [ 11.830535] ACPI: PCI Root Bridge [PC10] (domain 0000 [bus 6c-7d])
> [ 11.834425] acpi PNP0A08:09: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 11.846484] acpi PNP0A08:09: PCIe AER handled by firmware
> [ 11.850840] acpi PNP0A08:09: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 11.858503] PCI host bridge to bus 0000:6c
> [ 11.862425] pci_bus 0000:6c: root bus resource [io 0xb000-0xbfff window]
> [ 11.870424] pci_bus 0000:6c: root bus resource [mem 0xbf400000-0xc5afffff window]
> [ 11.878424] pci_bus 0000:6c: root bus resource [mem 0x207000000000-0x207fffffffff window]
> [ 11.886424] pci_bus 0000:6c: root bus resource [bus 6c-7d]
> [ 11.890431] pci 0000:6c:00.0: [8086:203f] type 01 class 0x060400
> [ 11.898450] pci 0000:6c:00.0: enabling Extended Tags
> [ 11.902455] pci 0000:6c:00.0: PME# supported from D0 D3hot D3cold
> [ 11.906524] pci 0000:6c:05.0: [8086:2034] type 00 class 0x088000
> [ 11.914494] pci 0000:6c:05.2: [8086:2035] type 00 class 0x088000
> [ 11.918488] pci 0000:6c:05.4: [8086:2036] type 00 class 0x080020
> [ 11.926431] pci 0000:6c:05.4: reg 0x10: [mem 0xbf880000-0xbf880fff]
> [ 11.930623] acpiphp: Slot [9] registered
> [ 11.934457] pci 0000:6d:00.0: [8086:1563] type 00 class 0x020000
> [ 11.942448] pci 0000:6d:00.0: reg 0x10: [mem 0x207fff800000-0x207fffbfffff 64bit pref]
> [ 11.950442] pci 0000:6d:00.0: reg 0x20: [mem 0x207fffc04000-0x207fffc07fff 64bit pref]
> [ 11.958432] pci 0000:6d:00.0: reg 0x30: [mem 0xfff80000-0xffffffff pref]
> [ 11.966490] pci 0000:6d:00.0: PME# supported from D0 D3hot
> [ 11.970449] pci 0000:6d:00.0: reg 0x184: [mem 0xbf400000-0xbf403fff 64bit]
> [ 11.978426] pci 0000:6d:00.0: VF(n) BAR0 space: [mem 0xbf400000-0xbf4fffff 64bit] (contains BAR0 for 64 VFs)
> [ 11.986436] pci 0000:6d:00.0: reg 0x190: [mem 0xbf500000-0xbf503fff 64bit]
> [ 11.994424] pci 0000:6d:00.0: VF(n) BAR3 space: [mem 0xbf500000-0xbf5fffff 64bit] (contains BAR3 for 64 VFs)
> [ 12.002631] pci 0000:6d:00.1: [8086:1563] type 00 class 0x020000
> [ 12.010446] pci 0000:6d:00.1: reg 0x10: [mem 0x207fff400000-0x207fff7fffff 64bit pref]
> [ 12.018442] pci 0000:6d:00.1: reg 0x20: [mem 0x207fffc00000-0x207fffc03fff 64bit pref]
> [ 12.026432] pci 0000:6d:00.1: reg 0x30: [mem 0xfff80000-0xffffffff pref]
> [ 12.030496] pci 0000:6d:00.1: PME# supported from D0 D3hot
> [ 12.038445] pci 0000:6d:00.1: reg 0x184: [mem 0xbf600000-0xbf603fff 64bit]
> [ 12.046424] pci 0000:6d:00.1: VF(n) BAR0 space: [mem 0xbf600000-0xbf6fffff 64bit] (contains BAR0 for 64 VFs)
> [ 12.054436] pci 0000:6d:00.1: reg 0x190: [mem 0xbf700000-0xbf703fff 64bit]
> [ 12.062424] pci 0000:6d:00.1: VF(n) BAR3 space: [mem 0xbf700000-0xbf7fffff 64bit] (contains BAR3 for 64 VFs)
> [ 12.070641] pci 0000:6c:00.0: PCI bridge to [bus 6d-6e]
> [ 12.078426] pci 0000:6c:00.0: bridge window [mem 0xbf400000-0xbf7fffff]
> [ 12.082426] pci 0000:6c:00.0: bridge window [mem 0x207fff400000-0x207fffcfffff 64bit pref]
> [ 12.090428] pci_bus 0000:6c: on NUMA node 1
> [ 12.090534] ACPI: PCI Root Bridge [PC12] (domain 0000 [bus 80-82])
> [ 12.098426] acpi PNP0A08:0a: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 12.106488] acpi PNP0A08:0a: PCIe AER handled by firmware
> [ 12.114716] acpi PNP0A08:0a: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 12.122528] PCI host bridge to bus 0000:80
> [ 12.126425] pci_bus 0000:80: root bus resource [io 0xc000-0xcfff window]
> [ 12.130424] pci_bus 0000:80: root bus resource [mem 0xc6000000-0xccbfffff window]
> [ 12.138424] pci_bus 0000:80: root bus resource [mem 0x208000000000-0x208fffffffff window]
> [ 12.146424] pci_bus 0000:80: root bus resource [bus 80-82]
> [ 12.154431] pci 0000:80:00.0: [8086:2020] type 00 class 0x060000
> [ 12.158588] pci 0000:80:04.0: [8086:2021] type 00 class 0x088000
> [ 12.166434] pci 0000:80:04.0: reg 0x10: [mem 0x208ffff1c000-0x208ffff1ffff 64bit]
> [ 12.174522] pci 0000:80:04.1: [8086:2021] type 00 class 0x088000
> [ 12.178433] pci 0000:80:04.1: reg 0x10: [mem 0x208ffff18000-0x208ffff1bfff 64bit]
> [ 12.186511] pci 0000:80:04.2: [8086:2021] type 00 class 0x088000
> [ 12.190434] pci 0000:80:04.2: reg 0x10: [mem 0x208ffff14000-0x208ffff17fff 64bit]
> [ 12.198510] pci 0000:80:04.3: [8086:2021] type 00 class 0x088000
> [ 12.206433] pci 0000:80:04.3: reg 0x10: [mem 0x208ffff10000-0x208ffff13fff 64bit]
> [ 12.214509] pci 0000:80:04.4: [8086:2021] type 00 class 0x088000
> [ 12.218433] pci 0000:80:04.4: reg 0x10: [mem 0x208ffff0c000-0x208ffff0ffff 64bit]
> [ 12.226510] pci 0000:80:04.5: [8086:2021] type 00 class 0x088000
> [ 12.234436] pci 0000:80:04.5: reg 0x10: [mem 0x208ffff08000-0x208ffff0bfff 64bit]
> [ 12.238515] pci 0000:80:04.6: [8086:2021] type 00 class 0x088000
> [ 12.246434] pci 0000:80:04.6: reg 0x10: [mem 0x208ffff04000-0x208ffff07fff 64bit]
> [ 12.254510] pci 0000:80:04.7: [8086:2021] type 00 class 0x088000
> [ 12.258433] pci 0000:80:04.7: reg 0x10: [mem 0x208ffff00000-0x208ffff03fff 64bit]
> [ 12.266509] pci 0000:80:05.0: [8086:2024] type 00 class 0x088000
> [ 12.274514] pci 0000:80:05.2: [8086:2025] type 00 class 0x088000
> [ 12.278501] pci 0000:80:05.4: [8086:2026] type 00 class 0x080020
> [ 12.286432] pci 0000:80:05.4: reg 0x10: [mem 0xc6000000-0xc6000fff]
> [ 12.290503] pci 0000:80:08.0: [8086:2014] type 00 class 0x088000
> [ 12.298494] pci 0000:80:08.1: [8086:2015] type 00 class 0x110100
> [ 12.302477] pci 0000:80:08.2: [8086:2016] type 00 class 0x088000
> [ 12.310497] pci_bus 0000:80: on NUMA node 2
> [ 12.310584] ACPI: PCI Root Bridge [PC13] (domain 0000 [bus 83-96])
> [ 12.314426] acpi PNP0A08:0b: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 12.326488] acpi PNP0A08:0b: PCIe AER handled by firmware
> [ 12.330844] acpi PNP0A08:0b: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 12.338567] PCI host bridge to bus 0000:83
> [ 12.342425] pci_bus 0000:83: root bus resource [io 0xd000-0xdfff window]
> [ 12.350424] pci_bus 0000:83: root bus resource [mem 0xccc00000-0xd37fffff window]
> [ 12.358424] pci_bus 0000:83: root bus resource [mem 0x209000000000-0x209fffffffff window]
> [ 12.366425] pci_bus 0000:83: root bus resource [bus 83-96]
> [ 12.370433] pci 0000:83:05.0: [8086:2034] type 00 class 0x088000
> [ 12.378516] pci 0000:83:05.2: [8086:2035] type 00 class 0x088000
> [ 12.382507] pci 0000:83:05.4: [8086:2036] type 00 class 0x080020
> [ 12.390432] pci 0000:83:05.4: reg 0x10: [mem 0xccc00000-0xccc00fff]
> [ 12.394506] pci 0000:83:08.0: [8086:208d] type 00 class 0x088000
> [ 12.402500] pci 0000:83:08.1: [8086:208d] type 00 class 0x088000
> [ 12.406483] pci 0000:83:08.2: [8086:208d] type 00 class 0x088000
> [ 12.414480] pci 0000:83:08.3: [8086:208d] type 00 class 0x088000
> [ 12.418481] pci 0000:83:08.4: [8086:208d] type 00 class 0x088000
> [ 12.426480] pci 0000:83:08.5: [8086:208d] type 00 class 0x088000
> [ 12.430479] pci 0000:83:08.6: [8086:208d] type 00 class 0x088000
> [ 12.438480] pci 0000:83:08.7: [8086:208d] type 00 class 0x088000
> [ 12.442479] pci 0000:83:09.0: [8086:208d] type 00 class 0x088000
> [ 12.450492] pci 0000:83:09.1: [8086:208d] type 00 class 0x088000
> [ 12.454481] pci 0000:83:09.2: [8086:208d] type 00 class 0x088000
> [ 12.462479] pci 0000:83:09.3: [8086:208d] type 00 class 0x088000
> [ 12.466481] pci 0000:83:09.4: [8086:208d] type 00 class 0x088000
> [ 12.474478] pci 0000:83:09.5: [8086:208d] type 00 class 0x088000
> [ 12.478483] pci 0000:83:09.6: [8086:208d] type 00 class 0x088000
> [ 12.486481] pci 0000:83:09.7: [8086:208d] type 00 class 0x088000
> [ 12.490479] pci 0000:83:0a.0: [8086:208d] type 00 class 0x088000
> [ 12.498495] pci 0000:83:0a.1: [8086:208d] type 00 class 0x088000
> [ 12.502481] pci 0000:83:0a.2: [8086:208d] type 00 class 0x088000
> [ 12.510478] pci 0000:83:0a.3: [8086:208d] type 00 class 0x088000
> [ 12.518484] pci 0000:83:0a.4: [8086:208d] type 00 class 0x088000
> [ 12.522478] pci 0000:83:0a.5: [8086:208d] type 00 class 0x088000
> [ 12.530481] pci 0000:83:0a.6: [8086:208d] type 00 class 0x088000
> [ 12.534481] pci 0000:83:0a.7: [8086:208d] type 00 class 0x088000
> [ 12.542479] pci 0000:83:0b.0: [8086:208d] type 00 class 0x088000
> [ 12.546493] pci 0000:83:0b.1: [8086:208d] type 00 class 0x088000
> [ 12.554482] pci 0000:83:0b.2: [8086:208d] type 00 class 0x088000
> [ 12.558479] pci 0000:83:0b.3: [8086:208d] type 00 class 0x088000
> [ 12.566487] pci 0000:83:0e.0: [8086:208e] type 00 class 0x088000
> [ 12.570491] pci 0000:83:0e.1: [8086:208e] type 00 class 0x088000
> [ 12.578481] pci 0000:83:0e.2: [8086:208e] type 00 class 0x088000
> [ 12.582481] pci 0000:83:0e.3: [8086:208e] type 00 class 0x088000
> [ 12.590480] pci 0000:83:0e.4: [8086:208e] type 00 class 0x088000
> [ 12.594481] pci 0000:83:0e.5: [8086:208e] type 00 class 0x088000
> [ 12.602479] pci 0000:83:0e.6: [8086:208e] type 00 class 0x088000
> [ 12.606482] pci 0000:83:0e.7: [8086:208e] type 00 class 0x088000
> [ 12.614482] pci 0000:83:0f.0: [8086:208e] type 00 class 0x088000
> [ 12.618488] pci 0000:83:0f.1: [8086:208e] type 00 class 0x088000
> [ 12.626484] pci 0000:83:0f.2: [8086:208e] type 00 class 0x088000
> [ 12.630480] pci 0000:83:0f.3: [8086:208e] type 00 class 0x088000
> [ 12.638479] pci 0000:83:0f.4: [8086:208e] type 00 class 0x088000
> [ 12.642482] pci 0000:83:0f.5: [8086:208e] type 00 class 0x088000
> [ 12.650481] pci 0000:83:0f.6: [8086:208e] type 00 class 0x088000
> [ 12.654482] pci 0000:83:0f.7: [8086:208e] type 00 class 0x088000
> [ 12.662482] pci 0000:83:10.0: [8086:208e] type 00 class 0x088000
> [ 12.666489] pci 0000:83:10.1: [8086:208e] type 00 class 0x088000
> [ 12.674483] pci 0000:83:10.2: [8086:208e] type 00 class 0x088000
> [ 12.678480] pci 0000:83:10.3: [8086:208e] type 00 class 0x088000
> [ 12.686479] pci 0000:83:10.4: [8086:208e] type 00 class 0x088000
> [ 12.690481] pci 0000:83:10.5: [8086:208e] type 00 class 0x088000
> [ 12.698479] pci 0000:83:10.6: [8086:208e] type 00 class 0x088000
> [ 12.702480] pci 0000:83:10.7: [8086:208e] type 00 class 0x088000
> [ 12.710479] pci 0000:83:11.0: [8086:208e] type 00 class 0x088000
> [ 12.714491] pci 0000:83:11.1: [8086:208e] type 00 class 0x088000
> [ 12.722482] pci 0000:83:11.2: [8086:208e] type 00 class 0x088000
> [ 12.726479] pci 0000:83:11.3: [8086:208e] type 00 class 0x088000
> [ 12.734495] pci 0000:83:1d.0: [8086:2054] type 00 class 0x088000
> [ 12.738494] pci 0000:83:1d.1: [8086:2055] type 00 class 0x088000
> [ 12.746481] pci 0000:83:1d.2: [8086:2056] type 00 class 0x088000
> [ 12.750482] pci 0000:83:1d.3: [8086:2057] type 00 class 0x088000
> [ 12.758487] pci 0000:83:1e.0: [8086:2080] type 00 class 0x088000
> [ 12.762490] pci 0000:83:1e.1: [8086:2081] type 00 class 0x088000
> [ 12.770484] pci 0000:83:1e.2: [8086:2082] type 00 class 0x088000
> [ 12.778482] pci 0000:83:1e.3: [8086:2083] type 00 class 0x088000
> [ 12.782481] pci 0000:83:1e.4: [8086:2084] type 00 class 0x088000
> [ 12.790483] pci 0000:83:1e.5: [8086:2085] type 00 class 0x088000
> [ 12.794479] pci 0000:83:1e.6: [8086:2086] type 00 class 0x088000
> [ 12.802494] pci_bus 0000:83: on NUMA node 2
> [ 12.802648] ACPI: PCI Root Bridge [PC14] (domain 0000 [bus 97-aa])
> [ 12.806426] acpi PNP0A08:0c: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 12.814487] acpi PNP0A08:0c: PCIe AER handled by firmware
> [ 12.822843] acpi PNP0A08:0c: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 12.830545] PCI host bridge to bus 0000:97
> [ 12.834425] pci_bus 0000:97: root bus resource [io 0xe000-0xefff window]
> [ 12.842424] pci_bus 0000:97: root bus resource [mem 0xd3800000-0xda3fffff window]
> [ 12.846424] pci_bus 0000:97: root bus resource [mem 0x20a000000000-0x20afffffffff window]
> [ 12.858424] pci_bus 0000:97: root bus resource [bus 97-aa]
> [ 12.862433] pci 0000:97:05.0: [8086:2034] type 00 class 0x088000
> [ 12.866504] pci 0000:97:05.2: [8086:2035] type 00 class 0x088000
> [ 12.874500] pci 0000:97:05.4: [8086:2036] type 00 class 0x080020
> [ 12.878432] pci 0000:97:05.4: reg 0x10: [mem 0xd3800000-0xd3800fff]
> [ 12.886506] pci 0000:97:08.0: [8086:2066] type 00 class 0x088000
> [ 12.890504] pci 0000:97:09.0: [8086:2066] type 00 class 0x088000
> [ 12.898506] pci 0000:97:0a.0: [8086:2040] type 00 class 0x088000
> [ 12.906499] pci 0000:97:0a.1: [8086:2041] type 00 class 0x088000
> [ 12.910491] pci 0000:97:0a.2: [8086:2042] type 00 class 0x088000
> [ 12.918490] pci 0000:97:0a.3: [8086:2043] type 00 class 0x088000
> [ 12.922489] pci 0000:97:0a.4: [8086:2044] type 00 class 0x088000
> [ 12.930489] pci 0000:97:0a.5: [8086:2045] type 00 class 0x088000
> [ 12.934490] pci 0000:97:0a.6: [8086:2046] type 00 class 0x088000
> [ 12.942488] pci 0000:97:0a.7: [8086:2047] type 00 class 0x088000
> [ 12.946490] pci 0000:97:0b.0: [8086:2048] type 00 class 0x088000
> [ 12.954501] pci 0000:97:0b.1: [8086:2049] type 00 class 0x088000
> [ 12.958488] pci 0000:97:0b.2: [8086:204a] type 00 class 0x088000
> [ 12.966490] pci 0000:97:0b.3: [8086:204b] type 00 class 0x088000
> [ 12.970492] pci 0000:97:0c.0: [8086:2040] type 00 class 0x088000
> [ 12.978502] pci 0000:97:0c.1: [8086:2041] type 00 class 0x088000
> [ 12.982490] pci 0000:97:0c.2: [8086:2042] type 00 class 0x088000
> [ 12.990492] pci 0000:97:0c.3: [8086:2043] type 00 class 0x088000
> [ 12.994492] pci 0000:97:0c.4: [8086:2044] type 00 class 0x088000
> [ 13.002488] pci 0000:97:0c.5: [8086:2045] type 00 class 0x088000
> [ 13.006491] pci 0000:97:0c.6: [8086:2046] type 00 class 0x088000
> [ 13.014489] pci 0000:97:0c.7: [8086:2047] type 00 class 0x088000
> [ 13.018491] pci 0000:97:0d.0: [8086:2048] type 00 class 0x088000
> [ 13.026501] pci 0000:97:0d.1: [8086:2049] type 00 class 0x088000
> [ 13.030493] pci 0000:97:0d.2: [8086:204a] type 00 class 0x088000
> [ 13.038489] pci 0000:97:0d.3: [8086:204b] type 00 class 0x088000
> [ 13.042502] pci_bus 0000:97: on NUMA node 2
> [ 13.042621] ACPI: PCI Root Bridge [UC23] (domain 0000 [bus ab])
> [ 13.050425] acpi PNP0A08:0d: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 13.058475] acpi PNP0A08:0d: PCIe AER handled by firmware
> [ 13.062490] acpi PNP0A08:0d: _OSC: platform does not support [LTR]
> [ 13.070545] acpi PNP0A08:0d: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
> [ 13.078478] PCI host bridge to bus 0000:ab
> [ 13.082424] pci_bus 0000:ab: root bus resource [bus ab]
> [ 13.086435] pci 0000:ab:0e.0: [8086:2058] type 00 class 0x110100
> [ 13.094481] pci 0000:ab:0e.1: [8086:2059] type 00 class 0x088000
> [ 13.098479] pci 0000:ab:0e.4: [8086:2058] type 00 class 0xffffff
> [ 13.106477] pci 0000:ab:0e.5: [8086:2059] type 00 class 0xffffff
> [ 13.110477] pci 0000:ab:0e.6: [8086:205a] type 00 class 0xffffff
> [ 13.118478] pci 0000:ab:0f.0: [8086:2058] type 00 class 0x110100
> [ 13.122478] pci 0000:ab:0f.1: [8086:2059] type 00 class 0x088000
> [ 13.130481] pci 0000:ab:0f.3: [8086:205b] type 00 class 0xffffff
> [ 13.134479] pci 0000:ab:0f.4: [8086:2058] type 00 class 0xffffff
> [ 13.142477] pci 0000:ab:0f.5: [8086:2059] type 00 class 0xffffff
> [ 13.146480] pci 0000:ab:0f.6: [8086:205a] type 00 class 0xffffff
> [ 13.154488] pci 0000:ab:10.0: [8086:2058] type 00 class 0x110100
> [ 13.158477] pci 0000:ab:10.1: [8086:2059] type 00 class 0x088000
> [ 13.166481] pci 0000:ab:10.4: [8086:2058] type 00 class 0xffffff
> [ 13.170479] pci 0000:ab:10.5: [8086:2059] type 00 class 0xffffff
> [ 13.178477] pci 0000:ab:10.6: [8086:205a] type 00 class 0xffffff
> [ 13.182481] pci 0000:ab:12.0: [8086:204c] type 00 class 0x110100
> [ 13.190477] pci 0000:ab:12.1: [8086:204d] type 00 class 0x110100
> [ 13.198463] pci 0000:ab:12.2: [8086:204e] type 00 class 0x088000
> [ 13.202469] pci 0000:ab:13.0: [8086:204c] type 00 class 0xffffff
> [ 13.210475] pci 0000:ab:13.1: [8086:204d] type 00 class 0xffffff
> [ 13.214465] pci 0000:ab:13.2: [8086:204e] type 00 class 0xffffff
> [ 13.222465] pci 0000:ab:13.3: [8086:204f] type 00 class 0xffffff
> [ 13.226465] pci 0000:ab:14.0: [8086:204c] type 00 class 0xffffff
> [ 13.234483] pci 0000:ab:14.1: [8086:204d] type 00 class 0xffffff
> [ 13.238462] pci 0000:ab:14.2: [8086:204e] type 00 class 0xffffff
> [ 13.246463] pci 0000:ab:14.3: [8086:204f] type 00 class 0xffffff
> [ 13.250466] pci 0000:ab:15.0: [8086:2018] type 00 class 0x088000
> [ 13.258461] pci 0000:ab:15.1: [8086:2088] type 00 class 0x110100
> [ 13.262468] pci 0000:ab:16.0: [8086:2018] type 00 class 0x088000
> [ 13.270462] pci 0000:ab:16.1: [8086:2088] type 00 class 0x110100
> [ 13.274467] pci 0000:ab:17.0: [8086:2018] type 00 class 0x088000
> [ 13.282462] pci 0000:ab:17.1: [8086:2088] type 00 class 0x110100
> [ 13.286476] pci_bus 0000:ab: on NUMA node 2
> [ 13.286545] ACPI: PCI Root Bridge [PC16] (domain 0000 [bus ac-bd])
> [ 13.294426] acpi PNP0A08:0e: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 13.302485] acpi PNP0A08:0e: PCIe AER handled by firmware
> [ 13.306842] acpi PNP0A08:0e: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 13.314505] PCI host bridge to bus 0000:ac
> [ 13.318425] pci_bus 0000:ac: root bus resource [io 0xf000-0xffff window]
> [ 13.326424] pci_bus 0000:ac: root bus resource [mem 0xda400000-0xe0afffff window]
> [ 13.334424] pci_bus 0000:ac: root bus resource [mem 0x20b000000000-0x20bfffffffff window]
> [ 13.342424] pci_bus 0000:ac: root bus resource [bus ac-bd]
> [ 13.346433] pci 0000:ac:05.0: [8086:2034] type 00 class 0x088000
> [ 13.354496] pci 0000:ac:05.2: [8086:2035] type 00 class 0x088000
> [ 13.358495] pci 0000:ac:05.4: [8086:2036] type 00 class 0x080020
> [ 13.366432] pci 0000:ac:05.4: reg 0x10: [mem 0xda400000-0xda400fff]
> [ 13.374505] pci_bus 0000:ac: on NUMA node 2
> [ 13.374604] ACPI: PCI Root Bridge [PC18] (domain 0000 [bus c0-c2])
> [ 13.378426] acpi PNP0A08:0f: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 13.386484] acpi PNP0A08:0f: PCIe AER handled by firmware
> [ 13.394710] acpi PNP0A08:0f: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 13.402462] acpi PNP0A08:0f: host bridge window [io 0x0000 window] (ignored, not CPU addressable)
> [ 13.410488] PCI host bridge to bus 0000:c0
> [ 13.414424] pci_bus 0000:c0: root bus resource [mem 0xe1000000-0xe7bfffff window]
> [ 13.422424] pci_bus 0000:c0: root bus resource [mem 0x20c000000000-0x20cfffffffff window]
> [ 13.430424] pci_bus 0000:c0: root bus resource [bus c0-c2]
> [ 13.434433] pci 0000:c0:04.0: [8086:2021] type 00 class 0x088000
> [ 13.442434] pci 0000:c0:04.0: reg 0x10: [mem 0x20cffff1c000-0x20cffff1ffff 64bit]
> [ 13.450515] pci 0000:c0:04.1: [8086:2021] type 00 class 0x088000
> [ 13.454433] pci 0000:c0:04.1: reg 0x10: [mem 0x20cffff18000-0x20cffff1bfff 64bit]
> [ 13.462519] pci 0000:c0:04.2: [8086:2021] type 00 class 0x088000
> [ 13.470433] pci 0000:c0:04.2: reg 0x10: [mem 0x20cffff14000-0x20cffff17fff 64bit]
> [ 13.474512] pci 0000:c0:04.3: [8086:2021] type 00 class 0x088000
> [ 13.482433] pci 0000:c0:04.3: reg 0x10: [mem 0x20cffff10000-0x20cffff13fff 64bit]
> [ 13.490510] pci 0000:c0:04.4: [8086:2021] type 00 class 0x088000
> [ 13.494434] pci 0000:c0:04.4: reg 0x10: [mem 0x20cffff0c000-0x20cffff0ffff 64bit]
> [ 13.502510] pci 0000:c0:04.5: [8086:2021] type 00 class 0x088000
> [ 13.510433] pci 0000:c0:04.5: reg 0x10: [mem 0x20cffff08000-0x20cffff0bfff 64bit]
> [ 13.518510] pci 0000:c0:04.6: [8086:2021] type 00 class 0x088000
> [ 13.522433] pci 0000:c0:04.6: reg 0x10: [mem 0x20cffff04000-0x20cffff07fff 64bit]
> [ 13.530512] pci 0000:c0:04.7: [8086:2021] type 00 class 0x088000
> [ 13.538434] pci 0000:c0:04.7: reg 0x10: [mem 0x20cffff00000-0x20cffff03fff 64bit]
> [ 13.542508] pci 0000:c0:05.0: [8086:2024] type 00 class 0x088000
> [ 13.550512] pci 0000:c0:05.2: [8086:2025] type 00 class 0x088000
> [ 13.554499] pci 0000:c0:05.4: [8086:2026] type 00 class 0x080020
> [ 13.562432] pci 0000:c0:05.4: reg 0x10: [mem 0xe1000000-0xe1000fff]
> [ 13.566516] pci 0000:c0:08.0: [8086:2014] type 00 class 0x088000
> [ 13.574497] pci 0000:c0:08.1: [8086:2015] type 00 class 0x110100
> [ 13.582476] pci 0000:c0:08.2: [8086:2016] type 00 class 0x088000
> [ 13.586495] pci_bus 0000:c0: on NUMA node 3
> [ 13.586580] ACPI: PCI Root Bridge [PC19] (domain 0000 [bus c3-d6])
> [ 13.594426] acpi PNP0A08:10: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 13.602485] acpi PNP0A08:10: PCIe AER handled by firmware
> [ 13.606847] acpi PNP0A08:10: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 13.614460] acpi PNP0A08:10: host bridge window [io 0x0000 window] (ignored, not CPU addressable)
> [ 13.626537] PCI host bridge to bus 0000:c3
> [ 13.630425] pci_bus 0000:c3: root bus resource [mem 0xe7c00000-0xee7fffff window]
> [ 13.634424] pci_bus 0000:c3: root bus resource [mem 0x20d000000000-0x20dfffffffff window]
> [ 13.646424] pci_bus 0000:c3: root bus resource [bus c3-d6]
> [ 13.650433] pci 0000:c3:05.0: [8086:2034] type 00 class 0x088000
> [ 13.654513] pci 0000:c3:05.2: [8086:2035] type 00 class 0x088000
> [ 13.662506] pci 0000:c3:05.4: [8086:2036] type 00 class 0x080020
> [ 13.666432] pci 0000:c3:05.4: reg 0x10: [mem 0xe7c00000-0xe7c00fff]
> [ 13.674511] pci 0000:c3:08.0: [8086:208d] type 00 class 0x088000
> [ 13.678495] pci 0000:c3:08.1: [8086:208d] type 00 class 0x088000
> [ 13.686480] pci 0000:c3:08.2: [8086:208d] type 00 class 0x088000
> [ 13.694482] pci 0000:c3:08.3: [8086:208d] type 00 class 0x088000
> [ 13.698478] pci 0000:c3:08.4: [8086:208d] type 00 class 0x088000
> [ 13.706482] pci 0000:c3:08.5: [8086:208d] type 00 class 0x088000
> [ 13.710482] pci 0000:c3:08.6: [8086:208d] type 00 class 0x088000
> [ 13.718479] pci 0000:c3:08.7: [8086:208d] type 00 class 0x088000
> [ 13.722482] pci 0000:c3:09.0: [8086:208d] type 00 class 0x088000
> [ 13.730495] pci 0000:c3:09.1: [8086:208d] type 00 class 0x088000
> [ 13.734478] pci 0000:c3:09.2: [8086:208d] type 00 class 0x088000
> [ 13.742480] pci 0000:c3:09.3: [8086:208d] type 00 class 0x088000
> [ 13.746479] pci 0000:c3:09.4: [8086:208d] type 00 class 0x088000
> [ 13.754484] pci 0000:c3:09.5: [8086:208d] type 00 class 0x088000
> [ 13.758482] pci 0000:c3:09.6: [8086:208d] type 00 class 0x088000
> [ 13.766481] pci 0000:c3:09.7: [8086:208d] type 00 class 0x088000
> [ 13.770481] pci 0000:c3:0a.0: [8086:208d] type 00 class 0x088000
> [ 13.778495] pci 0000:c3:0a.1: [8086:208d] type 00 class 0x088000
> [ 13.782480] pci 0000:c3:0a.2: [8086:208d] type 00 class 0x088000
> [ 13.790483] pci 0000:c3:0a.3: [8086:208d] type 00 class 0x088000
> [ 13.794480] pci 0000:c3:0a.4: [8086:208d] type 00 class 0x088000
> [ 13.802483] pci 0000:c3:0a.5: [8086:208d] type 00 class 0x088000
> [ 13.806484] pci 0000:c3:0a.6: [8086:208d] type 00 class 0x088000
> [ 13.814481] pci 0000:c3:0a.7: [8086:208d] type 00 class 0x088000
> [ 13.818483] pci 0000:c3:0b.0: [8086:208d] type 00 class 0x088000
> [ 13.826491] pci 0000:c3:0b.1: [8086:208d] type 00 class 0x088000
> [ 13.830482] pci 0000:c3:0b.2: [8086:208d] type 00 class 0x088000
> [ 13.838481] pci 0000:c3:0b.3: [8086:208d] type 00 class 0x088000
> [ 13.842483] pci 0000:c3:0e.0: [8086:208e] type 00 class 0x088000
> [ 13.850492] pci 0000:c3:0e.1: [8086:208e] type 00 class 0x088000
> [ 13.854481] pci 0000:c3:0e.2: [8086:208e] type 00 class 0x088000
> [ 13.862478] pci 0000:c3:0e.3: [8086:208e] type 00 class 0x088000
> [ 13.866481] pci 0000:c3:0e.4: [8086:208e] type 00 class 0x088000
> [ 13.874486] pci 0000:c3:0e.5: [8086:208e] type 00 class 0x088000
> [ 13.878480] pci 0000:c3:0e.6: [8086:208e] type 00 class 0x088000
> [ 13.886478] pci 0000:c3:0e.7: [8086:208e] type 00 class 0x088000
> [ 13.890482] pci 0000:c3:0f.0: [8086:208e] type 00 class 0x088000
> [ 13.898494] pci 0000:c3:0f.1: [8086:208e] type 00 class 0x088000
> [ 13.902482] pci 0000:c3:0f.2: [8086:208e] type 00 class 0x088000
> [ 13.910479] pci 0000:c3:0f.3: [8086:208e] type 00 class 0x088000
> [ 13.914481] pci 0000:c3:0f.4: [8086:208e] type 00 class 0x088000
> [ 13.922481] pci 0000:c3:0f.5: [8086:208e] type 00 class 0x088000
> [ 13.926482] pci 0000:c3:0f.6: [8086:208e] type 00 class 0x088000
> [ 13.934479] pci 0000:c3:0f.7: [8086:208e] type 00 class 0x088000
> [ 13.942482] pci 0000:c3:10.0: [8086:208e] type 00 class 0x088000
> [ 13.946494] pci 0000:c3:10.1: [8086:208e] type 00 class 0x088000
> [ 13.954482] pci 0000:c3:10.2: [8086:208e] type 00 class 0x088000
> [ 13.958481] pci 0000:c3:10.3: [8086:208e] type 00 class 0x088000
> [ 13.966485] pci 0000:c3:10.4: [8086:208e] type 00 class 0x088000
> [ 13.970479] pci 0000:c3:10.5: [8086:208e] type 00 class 0x088000
> [ 13.978481] pci 0000:c3:10.6: [8086:208e] type 00 class 0x088000
> [ 13.982479] pci 0000:c3:10.7: [8086:208e] type 00 class 0x088000
> [ 13.990483] pci 0000:c3:11.0: [8086:208e] type 00 class 0x088000
> [ 13.994494] pci 0000:c3:11.1: [8086:208e] type 00 class 0x088000
> [ 14.002479] pci 0000:c3:11.2: [8086:208e] type 00 class 0x088000
> [ 14.006481] pci 0000:c3:11.3: [8086:208e] type 00 class 0x088000
> [ 14.014492] pci 0000:c3:1d.0: [8086:2054] type 00 class 0x088000
> [ 14.018495] pci 0000:c3:1d.1: [8086:2055] type 00 class 0x088000
> [ 14.026483] pci 0000:c3:1d.2: [8086:2056] type 00 class 0x088000
> [ 14.030481] pci 0000:c3:1d.3: [8086:2057] type 00 class 0x088000
> [ 14.038490] pci 0000:c3:1e.0: [8086:2080] type 00 class 0x088000
> [ 14.042495] pci 0000:c3:1e.1: [8086:2081] type 00 class 0x088000
> [ 14.050482] pci 0000:c3:1e.2: [8086:2082] type 00 class 0x088000
> [ 14.054480] pci 0000:c3:1e.3: [8086:2083] type 00 class 0x088000
> [ 14.062483] pci 0000:c3:1e.4: [8086:2084] type 00 class 0x088000
> [ 14.066480] pci 0000:c3:1e.5: [8086:2085] type 00 class 0x088000
> [ 14.074482] pci 0000:c3:1e.6: [8086:2086] type 00 class 0x088000
> [ 14.078481] pci_bus 0000:c3: on NUMA node 3
> [ 14.078635] ACPI: PCI Root Bridge [PC20] (domain 0000 [bus d7-ea])
> [ 14.086426] acpi PNP0A08:11: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 14.094487] acpi PNP0A08:11: PCIe AER handled by firmware
> [ 14.098847] acpi PNP0A08:11: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 14.110462] acpi PNP0A08:11: host bridge window [io 0x0000 window] (ignored, not CPU addressable)
> [ 14.118504] PCI host bridge to bus 0000:d7
> [ 14.122424] pci_bus 0000:d7: root bus resource [mem 0xee800000-0xf53fffff window]
> [ 14.130424] pci_bus 0000:d7: root bus resource [mem 0x20e000000000-0x20efffffffff window]
> [ 14.138424] pci_bus 0000:d7: root bus resource [bus d7-ea]
> [ 14.142433] pci 0000:d7:05.0: [8086:2034] type 00 class 0x088000
> [ 14.150507] pci 0000:d7:05.2: [8086:2035] type 00 class 0x088000
> [ 14.154498] pci 0000:d7:05.4: [8086:2036] type 00 class 0x080020
> [ 14.162432] pci 0000:d7:05.4: reg 0x10: [mem 0xee800000-0xee800fff]
> [ 14.166504] pci 0000:d7:08.0: [8086:2066] type 00 class 0x088000
> [ 14.174517] pci 0000:d7:09.0: [8086:2066] type 00 class 0x088000
> [ 14.178506] pci 0000:d7:0a.0: [8086:2040] type 00 class 0x088000
> [ 14.186504] pci 0000:d7:0a.1: [8086:2041] type 00 class 0x088000
> [ 14.190489] pci 0000:d7:0a.2: [8086:2042] type 00 class 0x088000
> [ 14.198493] pci 0000:d7:0a.3: [8086:2043] type 00 class 0x088000
> [ 14.202491] pci 0000:d7:0a.4: [8086:2044] type 00 class 0x088000
> [ 14.210497] pci 0000:d7:0a.5: [8086:2045] type 00 class 0x088000
> [ 14.214498] pci 0000:d7:0a.6: [8086:2046] type 00 class 0x088000
> [ 14.222492] pci 0000:d7:0a.7: [8086:2047] type 00 class 0x088000
> [ 14.226488] pci 0000:d7:0b.0: [8086:2048] type 00 class 0x088000
> [ 14.234504] pci 0000:d7:0b.1: [8086:2049] type 00 class 0x088000
> [ 14.238489] pci 0000:d7:0b.2: [8086:204a] type 00 class 0x088000
> [ 14.246492] pci 0000:d7:0b.3: [8086:204b] type 00 class 0x088000
> [ 14.250489] pci 0000:d7:0c.0: [8086:2040] type 00 class 0x088000
> [ 14.258501] pci 0000:d7:0c.1: [8086:2041] type 00 class 0x088000
> [ 14.262492] pci 0000:d7:0c.2: [8086:2042] type 00 class 0x088000
> [ 14.270489] pci 0000:d7:0c.3: [8086:2043] type 00 class 0x088000
> [ 14.278491] pci 0000:d7:0c.4: [8086:2044] type 00 class 0x088000
> [ 14.282493] pci 0000:d7:0c.5: [8086:2045] type 00 class 0x088000
> [ 14.290489] pci 0000:d7:0c.6: [8086:2046] type 00 class 0x088000
> [ 14.294492] pci 0000:d7:0c.7: [8086:2047] type 00 class 0x088000
> [ 14.302492] pci 0000:d7:0d.0: [8086:2048] type 00 class 0x088000
> [ 14.306502] pci 0000:d7:0d.1: [8086:2049] type 00 class 0x088000
> [ 14.314489] pci 0000:d7:0d.2: [8086:204a] type 00 class 0x088000
> [ 14.318492] pci 0000:d7:0d.3: [8086:204b] type 00 class 0x088000
> [ 14.326508] pci_bus 0000:d7: on NUMA node 3
> [ 14.326634] ACPI: PCI Root Bridge [UC33] (domain 0000 [bus eb])
> [ 14.330426] acpi PNP0A08:12: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 14.338476] acpi PNP0A08:12: PCIe AER handled by firmware
> [ 14.346491] acpi PNP0A08:12: _OSC: platform does not support [LTR]
> [ 14.350546] acpi PNP0A08:12: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
> [ 14.358477] PCI host bridge to bus 0000:eb
> [ 14.362425] pci_bus 0000:eb: root bus resource [bus eb]
> [ 14.370435] pci 0000:eb:0e.0: [8086:2058] type 00 class 0x110100
> [ 14.374484] pci 0000:eb:0e.1: [8086:2059] type 00 class 0x088000
> [ 14.382477] pci 0000:eb:0e.4: [8086:2058] type 00 class 0xffffff
> [ 14.386481] pci 0000:eb:0e.5: [8086:2059] type 00 class 0xffffff
> [ 14.394479] pci 0000:eb:0e.6: [8086:205a] type 00 class 0xffffff
> [ 14.398478] pci 0000:eb:0f.0: [8086:2058] type 00 class 0x110100
> [ 14.406478] pci 0000:eb:0f.1: [8086:2059] type 00 class 0x088000
> [ 14.410479] pci 0000:eb:0f.3: [8086:205b] type 00 class 0xffffff
> [ 14.418475] pci 0000:eb:0f.4: [8086:2058] type 00 class 0xffffff
> [ 14.422480] pci 0000:eb:0f.5: [8086:2059] type 00 class 0xffffff
> [ 14.430480] pci 0000:eb:0f.6: [8086:205a] type 00 class 0xffffff
> [ 14.434479] pci 0000:eb:10.0: [8086:2058] type 00 class 0x110100
> [ 14.442482] pci 0000:eb:10.1: [8086:2059] type 00 class 0x088000
> [ 14.446482] pci 0000:eb:10.4: [8086:2058] type 00 class 0xffffff
> [ 14.454477] pci 0000:eb:10.5: [8086:2059] type 00 class 0xffffff
> [ 14.458481] pci 0000:eb:10.6: [8086:205a] type 00 class 0xffffff
> [ 14.466479] pci 0000:eb:12.0: [8086:204c] type 00 class 0x110100
> [ 14.470478] pci 0000:eb:12.1: [8086:204d] type 00 class 0x110100
> [ 14.478469] pci 0000:eb:12.2: [8086:204e] type 00 class 0x088000
> [ 14.482468] pci 0000:eb:13.0: [8086:204c] type 00 class 0xffffff
> [ 14.490480] pci 0000:eb:13.1: [8086:204d] type 00 class 0xffffff
> [ 14.494463] pci 0000:eb:13.2: [8086:204e] type 00 class 0xffffff
> [ 14.502465] pci 0000:eb:13.3: [8086:204f] type 00 class 0xffffff
> [ 14.506469] pci 0000:eb:14.0: [8086:204c] type 00 class 0xffffff
> [ 14.514477] pci 0000:eb:14.1: [8086:204d] type 00 class 0xffffff
> [ 14.518467] pci 0000:eb:14.2: [8086:204e] type 00 class 0xffffff
> [ 14.526464] pci 0000:eb:14.3: [8086:204f] type 00 class 0xffffff
> [ 14.530467] pci 0000:eb:15.0: [8086:2018] type 00 class 0x088000
> [ 14.538466] pci 0000:eb:15.1: [8086:2088] type 00 class 0x110100
> [ 14.542466] pci 0000:eb:16.0: [8086:2018] type 00 class 0x088000
> [ 14.550464] pci 0000:eb:16.1: [8086:2088] type 00 class 0x110100
> [ 14.554466] pci 0000:eb:17.0: [8086:2018] type 00 class 0x088000
> [ 14.562466] pci 0000:eb:17.1: [8086:2088] type 00 class 0x110100
> [ 14.566470] pci_bus 0000:eb: on NUMA node 3
> [ 14.566541] ACPI: PCI Root Bridge [PC22] (domain 0000 [bus ec-fd])
> [ 14.574426] acpi PNP0A08:13: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [ 14.582486] acpi PNP0A08:13: PCIe AER handled by firmware
> [ 14.590844] acpi PNP0A08:13: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR]
> [ 14.598461] acpi PNP0A08:13: host bridge window [io 0x0000 window] (ignored, not CPU addressable)
> [ 14.606473] PCI host bridge to bus 0000:ec
> [ 14.610425] pci_bus 0000:ec: root bus resource [mem 0xf5400000-0xfbafffff window]
> [ 14.618424] pci_bus 0000:ec: root bus resource [mem 0x20f000000000-0x20ffffffffff window]
> [ 14.626424] pci_bus 0000:ec: root bus resource [bus ec-fd]
> [ 14.630433] pci 0000:ec:05.0: [8086:2034] type 00 class 0x088000
> [ 14.638501] pci 0000:ec:05.2: [8086:2035] type 00 class 0x088000
> [ 14.642497] pci 0000:ec:05.4: [8086:2036] type 00 class 0x080020
> [ 14.650433] pci 0000:ec:05.4: reg 0x10: [mem 0xf5400000-0xf5400fff]
> [ 14.654503] pci_bus 0000:ec: on NUMA node 3
> [ 14.658462] iommu: Default domain type: Translated
> [ 14.662467] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
> [ 14.670589] pci 0000:03:00.0: vgaarb: bridge control possible
> [ 14.678424] pci 0000:03:00.0: vgaarb: setting as boot device
> [ 14.682424] vgaarb: loaded
> [ 14.685551] SCSI subsystem initialized
> [ 14.690669] libata version 3.00 loaded.
> [ 14.690669] ACPI: bus type USB registered
> [ 14.694436] usbcore: registered new interface driver usbfs
> [ 14.698429] usbcore: registered new interface driver hub
> [ 14.706821] usbcore: registered new device driver usb
> [ 14.710791] EDAC MC: Ver: 3.0.0
> [ 14.714892] Registered efivars operations
> [ 14.718473] PCI: Using ACPI for IRQ routing
> [ 14.725960] PCI: pci_cache_line_size set to 64 bytes
> [ 14.726625] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
> [ 14.726626] e820: reserve RAM buffer [mem 0x5c81a018-0x5fffffff]
> [ 14.726627] e820: reserve RAM buffer [mem 0x5c84f018-0x5fffffff]
> [ 14.726628] e820: reserve RAM buffer [mem 0x65e0f000-0x67ffffff]
> [ 14.726629] e820: reserve RAM buffer [mem 0x6f800000-0x6fffffff]
> [ 14.726840] NetLabel: Initializing
> [ 14.730236] NetLabel: domain hash size = 128
> [ 14.734424] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
> [ 14.742439] NetLabel: unlabeled traffic allowed by default
> [ 14.747132] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
> [ 14.754426] hpet0: 8 comparators, 64-bit 24.000000 MHz counter
> [ 14.762577] clocksource: Switched to clocksource tsc-early
> [ 14.782582] VFS: Disk quotas dquot_6.6.0
> [ 14.786572] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> [ 14.793674] AppArmor: AppArmor Filesystem Enabled
> [ 14.798397] pnp: PnP ACPI init
> [ 14.820035] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
> [ 14.820390] system 00:01: [io 0x0500-0x05fe] has been reserved
> [ 14.826301] system 00:01: [io 0x0400-0x041f] has been reserved
> [ 14.832206] system 00:01: [io 0x0600-0x061f] has been reserved
> [ 14.838108] system 00:01: [io 0x0ca0-0x0ca1] has been reserved
> [ 14.844014] system 00:01: [io 0x0ca4-0x0ca6] has been reserved
> [ 14.849921] system 00:01: [mem 0xfed40000-0xfed47fff] could not be reserved
> [ 14.856863] system 00:01: [mem 0xff000000-0xffffffff] has been reserved
> [ 14.863465] system 00:01: Plug and Play ACPI device, IDs PNP0c02 (active)
> [ 14.863663] pnp 00:02: Plug and Play ACPI device, IDs PNP0501 (active)
> [ 14.863826] pnp 00:03: Plug and Play ACPI device, IDs PNP0501 (active)
> [ 14.864172] system 00:04: [mem 0xfd000000-0xfdabffff] has been reserved
> [ 14.870776] system 00:04: [mem 0xfdad0000-0xfdadffff] has been reserved
> [ 14.877372] system 00:04: [mem 0xfdb00000-0xfdffffff] has been reserved
> [ 14.883969] system 00:04: [mem 0xfe000000-0xfe00ffff] has been reserved
> [ 14.890568] system 00:04: [mem 0xfe011000-0xfe01ffff] has been reserved
> [ 14.897163] system 00:04: [mem 0xfe036000-0xfe03bfff] has been reserved
> [ 14.903763] system 00:04: [mem 0xfe03d000-0xfe3fffff] has been reserved
> [ 14.910356] system 00:04: [mem 0xfe410000-0xfe7fffff] has been reserved
> [ 14.916958] system 00:04: Plug and Play ACPI device, IDs PNP0c02 (active)
> [ 14.917444] system 00:05: [io 0x1000-0x10fe] has been reserved
> [ 14.923356] system 00:05: Plug and Play ACPI device, IDs PNP0c02 (active)
> [ 14.924549] pnp: PnP ACPI: found 6 devices
> [ 14.934800] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [ 14.943684] pci 0000:6d:00.0: can't claim BAR 6 [mem 0xfff80000-0xffffffff pref]: no compatible bridge window
> [ 14.953572] pci 0000:6d:00.1: can't claim BAR 6 [mem 0xfff80000-0xffffffff pref]: no compatible bridge window
> [ 14.963470] pci 0000:00:1c.0: PCI bridge to [bus 01]
> [ 14.968427] pci 0000:00:1c.0: bridge window [io 0x3000-0x3fff]
> [ 14.974506] pci 0000:00:1c.0: bridge window [mem 0x92100000-0x921fffff]
> [ 14.981278] pci 0000:02:00.0: PCI bridge to [bus 03]
> [ 14.986232] pci 0000:02:00.0: bridge window [io 0x2000-0x2fff]
> [ 14.992312] pci 0000:02:00.0: bridge window [mem 0x91000000-0x920fffff]
> [ 14.999088] pci 0000:00:1c.5: PCI bridge to [bus 02-03]
> [ 15.004301] pci 0000:00:1c.5: bridge window [io 0x2000-0x2fff]
> [ 15.010378] pci 0000:00:1c.5: bridge window [mem 0x91000000-0x920fffff]
> [ 15.017153] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
> [ 15.023310] pci_bus 0000:00: resource 5 [io 0x1000-0x4fff window]
> [ 15.029476] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
> [ 15.036333] pci_bus 0000:00: resource 7 [mem 0x000c4000-0x000c7fff window]
> [ 15.043189] pci_bus 0000:00: resource 8 [mem 0xfe010000-0xfe010fff window]
> [ 15.050045] pci_bus 0000:00: resource 9 [mem 0x90000000-0x96bfffff window]
> [ 15.056904] pci_bus 0000:00: resource 10 [mem 0x200000000000-0x200fffffffff window]
> [ 15.064540] pci_bus 0000:01: resource 0 [io 0x3000-0x3fff]
> [ 15.070106] pci_bus 0000:01: resource 1 [mem 0x92100000-0x921fffff]
> [ 15.076358] pci_bus 0000:02: resource 0 [io 0x2000-0x2fff]
> [ 15.081916] pci_bus 0000:02: resource 1 [mem 0x91000000-0x920fffff]
> [ 15.088168] pci_bus 0000:03: resource 0 [io 0x2000-0x2fff]
> [ 15.093727] pci_bus 0000:03: resource 1 [mem 0x91000000-0x920fffff]
> [ 15.100055] pci_bus 0000:15: resource 4 [io 0x5000-0x5fff window]
> [ 15.106222] pci_bus 0000:15: resource 5 [mem 0x96c00000-0x9d7fffff window]
> [ 15.113076] pci_bus 0000:15: resource 6 [mem 0x201000000000-0x201fffffffff window]
> [ 15.120652] pci_bus 0000:23: resource 4 [io 0x6000-0x6fff window]
> [ 15.126815] pci_bus 0000:23: resource 5 [mem 0x9d800000-0xa43fffff window]
> [ 15.133670] pci_bus 0000:23: resource 6 [mem 0x202000000000-0x202fffffffff window]
> [ 15.141252] pci_bus 0000:32: resource 4 [io 0x7000-0x7fff window]
> [ 15.147422] pci_bus 0000:32: resource 5 [mem 0xa4400000-0xaaafffff window]
> [ 15.154277] pci_bus 0000:32: resource 6 [mem 0x203000000000-0x203fffffffff window]
> [ 15.161844] pci_bus 0000:40: resource 4 [io 0x8000-0x8fff window]
> [ 15.168016] pci_bus 0000:40: resource 5 [mem 0xab000000-0xb1bfffff window]
> [ 15.174873] pci_bus 0000:40: resource 6 [mem 0x204000000000-0x204fffffffff window]
> [ 15.182447] pci_bus 0000:43: resource 4 [io 0x9000-0x9fff window]
> [ 15.188609] pci_bus 0000:43: resource 5 [mem 0xb1c00000-0xb87fffff window]
> [ 15.195465] pci_bus 0000:43: resource 6 [mem 0x205000000000-0x205fffffffff window]
> [ 15.203042] pci_bus 0000:57: resource 4 [io 0xa000-0xafff window]
> [ 15.209203] pci_bus 0000:57: resource 5 [mem 0xb8800000-0xbf3fffff window]
> [ 15.216061] pci_bus 0000:57: resource 6 [mem 0x206000000000-0x206fffffffff window]
> [ 15.223641] pci 0000:6d:00.0: BAR 6: no space for [mem size 0x00080000 pref]
> [ 15.230673] pci 0000:6d:00.0: BAR 6: failed to assign [mem size 0x00080000 pref]
> [ 15.238049] pci 0000:6d:00.1: BAR 6: no space for [mem size 0x00080000 pref]
> [ 15.245079] pci 0000:6d:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
> [ 15.252452] pci 0000:6c:00.0: PCI bridge to [bus 6d-6e]
> [ 15.257669] pci 0000:6c:00.0: bridge window [mem 0xbf400000-0xbf7fffff]
> [ 15.264437] pci 0000:6c:00.0: bridge window [mem 0x207fff400000-0x207fffcfffff 64bit pref]
> [ 15.272854] pci_bus 0000:6c: resource 4 [io 0xb000-0xbfff window]
> [ 15.279021] pci_bus 0000:6c: resource 5 [mem 0xbf400000-0xc5afffff window]
> [ 15.285871] pci_bus 0000:6c: resource 6 [mem 0x207000000000-0x207fffffffff window]
> [ 15.293418] pci_bus 0000:6d: resource 1 [mem 0xbf400000-0xbf7fffff]
> [ 15.299669] pci_bus 0000:6d: resource 2 [mem 0x207fff400000-0x207fffcfffff 64bit pref]
> [ 15.307584] pci_bus 0000:80: resource 4 [io 0xc000-0xcfff window]
> [ 15.313747] pci_bus 0000:80: resource 5 [mem 0xc6000000-0xccbfffff window]
> [ 15.320603] pci_bus 0000:80: resource 6 [mem 0x208000000000-0x208fffffffff window]
> [ 15.328168] pci_bus 0000:83: resource 4 [io 0xd000-0xdfff window]
> [ 15.334335] pci_bus 0000:83: resource 5 [mem 0xccc00000-0xd37fffff window]
> [ 15.341190] pci_bus 0000:83: resource 6 [mem 0x209000000000-0x209fffffffff window]
> [ 15.348765] pci_bus 0000:97: resource 4 [io 0xe000-0xefff window]
> [ 15.354927] pci_bus 0000:97: resource 5 [mem 0xd3800000-0xda3fffff window]
> [ 15.361785] pci_bus 0000:97: resource 6 [mem 0x20a000000000-0x20afffffffff window]
> [ 15.369359] pci_bus 0000:ac: resource 4 [io 0xf000-0xffff window]
> [ 15.375522] pci_bus 0000:ac: resource 5 [mem 0xda400000-0xe0afffff window]
> [ 15.382371] pci_bus 0000:ac: resource 6 [mem 0x20b000000000-0x20bfffffffff window]
> [ 15.389939] pci_bus 0000:c0: resource 4 [mem 0xe1000000-0xe7bfffff window]
> [ 15.396794] pci_bus 0000:c0: resource 5 [mem 0x20c000000000-0x20cfffffffff window]
> [ 15.404359] pci_bus 0000:c3: resource 4 [mem 0xe7c00000-0xee7fffff window]
> [ 15.411215] pci_bus 0000:c3: resource 5 [mem 0x20d000000000-0x20dfffffffff window]
> [ 15.418790] pci_bus 0000:d7: resource 4 [mem 0xee800000-0xf53fffff window]
> [ 15.425647] pci_bus 0000:d7: resource 5 [mem 0x20e000000000-0x20efffffffff window]
> [ 15.433218] pci_bus 0000:ec: resource 4 [mem 0xf5400000-0xfbafffff window]
> [ 15.440076] pci_bus 0000:ec: resource 5 [mem 0x20f000000000-0x20ffffffffff window]
> [ 15.447852] NET: Registered protocol family 2
> [ 15.452970] tcp_listen_portaddr_hash hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc)
> [ 15.462122] TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
> [ 15.471084] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc)
> [ 15.478715] TCP: Hash tables configured (established 524288 bind 65536)
> [ 15.485631] UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
> [ 15.493061] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
> [ 15.501370] NET: Registered protocol family 1
> [ 15.506362] pci 0000:15:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.513484] pci 0000:23:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.520599] pci 0000:32:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.527656] pci 0000:43:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.534772] pci 0000:57:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.541890] pci 0000:6c:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.548935] pci 0000:6d:00.0: CLS mismatch (128 != 32), using 64 bytes
> [ 15.555470] pci 0000:83:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.562566] pci 0000:97:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.569655] pci 0000:ac:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.576704] pci 0000:c3:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.583801] pci 0000:d7:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.590894] pci 0000:ec:05.0: disabled boot interrupts on device [8086:2034]
> [ 15.597981] Trying to unpack rootfs image as initramfs...
> [ 16.459493] Freeing initrd memory: 57324K
> [ 16.463533] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [ 16.469966] software IO TLB: mapped [mem 0x587a2000-0x5c7a2000] (64MB)
> [ 16.477930] RAPL PMU: API unit is 2^-32 Joules, 2 fixed counters, 655360 ms ovfl timer
> [ 16.485831] RAPL PMU: hw unit of domain package 2^-14 Joules
> [ 16.491482] RAPL PMU: hw unit of domain dram 2^-16 Joules
> [ 16.506481] check: Scanning for low memory corruption every 60 seconds
> [ 16.515515] Initialise system trusted keyrings
> [ 16.519967] Key type blacklist registered
> [ 16.524166] workingset: timestamp_bits=40 max_order=27 bucket_order=0
> [ 16.531698] zbud: loaded
> [ 16.534709] squashfs: version 4.0 (2009/01/31) Phillip Lougher
> [ 16.540753] fuse: init (API version 7.31)
> [ 16.548302] Key type asymmetric registered
> [ 16.552414] Asymmetric key parser 'x509' registered
> [ 16.557295] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
> [ 16.565005] io scheduler mq-deadline registered
> [ 16.569531] io scheduler kyber registered
> [ 16.575208] pcieport 0000:00:1c.0: PME: Signaling with IRQ 24
> [ 16.581226] pcieport 0000:00:1c.5: PME: Signaling with IRQ 25
> [ 16.587318] pcieport 0000:6c:00.0: PME: Signaling with IRQ 27
> [ 16.593873] efifb: probing for efifb
> [ 16.597479] efifb: framebuffer at 0x91000000, using 1876k, total 1875k
> [ 16.603993] efifb: mode is 800x600x32, linelength=3200, pages=1
> [ 16.609900] efifb: scrolling: redraw
> [ 16.613470] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
> [ 16.624423] Console: switching to colour frame buffer device 100x37
> [ 16.635845] fb0: EFI VGA frame buffer device
> [ 16.640265] intel_idle: MWAIT substates: 0x2020
> [ 16.640401] Monitor-Mwait will be used to enter C-1 state
> [ 16.640489] Monitor-Mwait will be used to enter C-2 state
> [ 16.640496] ACPI: \_SB_.SCK0.C000: Found 2 idle states
> [ 16.645704] intel_idle: v0.5.1 model 0x55
> [ 16.650797] intel_idle: Local APIC timer is reliable in all C-states
> [ 16.652329] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> [ 16.659908] ACPI: Power Button [PWRF]
> [ 16.678008] ERST: Error Record Serialization Table (ERST) support is initialized.
> [ 16.685587] pstore: Registered erst as persistent store backend
> [ 16.691686] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
> [ 16.699551] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
> [ 16.706077] 00:02: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> [ 16.713633] 00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> [ 16.722868] Linux agpgart interface v0.103
> [ 16.732273] tpm_tis IFX0762:00: 2.0 TPM (device-id 0x1B, rev-id 16)
> [ 16.822422] loop: module loaded
> [ 16.829282] libphy: Fixed MDIO Bus: probed
> [ 16.836190] tun: Universal TUN/TAP device driver, 1.6
> [ 16.844089] PPP generic driver version 2.4.2
> [ 16.851158] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> [ 16.860495] ehci-pci: EHCI PCI platform driver
> [ 16.867796] ehci-platform: EHCI generic platform driver
> [ 16.875811] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
> [ 16.884766] ohci-pci: OHCI PCI platform driver
> [ 16.892016] ohci-platform: OHCI generic platform driver
> [ 16.900020] uhci_hcd: USB Universal Host Controller Interface driver
> [ 16.909629] xhci_hcd 0000:00:14.0: xHCI Host Controller
> [ 16.917644] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 1
> [ 16.928938] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x0000000000009810
> [ 16.943647] xhci_hcd 0000:00:14.0: cache line size of 64 is not supported
> [ 16.953538] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.07
> [ 16.964826] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [ 16.975115] usb usb1: Product: xHCI Host Controller
> [ 16.983041] usb usb1: Manufacturer: Linux 5.7.6+ xhci-hcd
> [ 16.991466] usb usb1: SerialNumber: 0000:00:14.0
> [ 16.999201] hub 1-0:1.0: USB hub found
> [ 17.005946] hub 1-0:1.0: 16 ports detected
> [ 17.014272] xhci_hcd 0000:00:14.0: xHCI Host Controller
> [ 17.022464] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 2
> [ 17.032836] xhci_hcd 0000:00:14.0: Host supports USB 3.0 SuperSpeed
> [ 17.042095] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.07
> [ 17.053345] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [ 17.063530] usb usb2: Product: xHCI Host Controller
> [ 17.071318] usb usb2: Manufacturer: Linux 5.7.6+ xhci-hcd
> [ 17.079590] usb usb2: SerialNumber: 0000:00:14.0
> [ 17.087137] hub 2-0:1.0: USB hub found
> [ 17.093671] hub 2-0:1.0: 10 ports detected
> [ 17.100906] usb: port power management may be unreliable
> [ 17.109387] i8042: PNP: No PS/2 controller found.
> [ 17.116937] mousedev: PS/2 mouse device common for all mice
> [ 17.125396] rtc_cmos 00:00: RTC can wake from S4
> [ 17.133089] rtc_cmos 00:00: registered as rtc0
> [ 17.140425] rtc_cmos 00:00: setting system clock to 2020-07-13T01:15:01 UTC (1594602901)
> [ 17.151319] rtc_cmos 00:00: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
> [ 17.161774] i2c /dev entries driver
> [ 17.168109] device-mapper: uevent: version 1.0.3
> [ 17.175597] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: [email protected]
> [ 17.186872] intel_pstate: Intel P-state driver initializing
> [ 17.218306] intel_pstate: HWP enabled
> [ 17.233297] ledtrig-cpu: registered to indicate activity on CPUs
> [ 17.242177] EFI Variables Facility v0.08 2004-May-17
> [ 17.251912] drop_monitor: Initializing network drop monitor service
> [ 17.261294] NET: Registered protocol family 10
> [ 17.269435] Segment Routing with IPv6
> [ 17.275941] NET: Registered protocol family 17
> [ 17.283187] Key type dns_resolver registered
> [ 17.306053] microcode: sig=0x5065b, pf=0x80, revision=0x700001c
> [ 17.322630] microcode: Microcode Update Driver: v2.2.
> [ 17.322634] IPI shorthand broadcast: enabled
> [ 17.337292] sched_clock: Marking stable (14160999482, 3176284304)->(17654284093, -317000307)
> [ 17.346505] usb 1-2: new high-speed USB device number 2 using xhci_hcd
> [ 17.358225] registered taskstats version 1
> [ 17.365146] Loading compiled-in X.509 certificates
> [ 17.373536] Loaded X.509 cert 'Build time autogenerated kernel key: 2621fd24cd5cb190d3a45c4a3990fd12787c67b2'
> [ 17.389409] zswap: loaded using pool lzo/zbud
> [ 17.397290] pstore: Using crash dump compression: deflate
> [ 17.408686] Key type big_key registered
> [ 17.416605] Key type trusted registered
> [ 17.424426] Key type encrypted registered
> [ 17.431222] AppArmor: AppArmor sha1 policy hashing enabled
> [ 17.439534] ima: Allocated hash algorithm: sha1
> [ 17.483973] ima: No architecture policies found
> [ 17.491350] evm: Initialising EVM extended attributes:
> [ 17.499327] evm: security.selinux
> [ 17.503875] usb 1-2: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> [ 17.505457] evm: security.SMACK64
> [ 17.516536] usb 1-2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [ 17.522742] evm: security.SMACK64EXEC
> [ 17.522745] evm: security.SMACK64TRANSMUTE
> [ 17.530513] tsc: Refined TSC clocksource calibration: 2494.279 MHz
> [ 17.530592] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23f41d64cb8, max_idle_ns: 440795293669 ns
> [ 17.532887] usb 1-2: Product: USB Virtual Hub
> [ 17.539484] evm: security.SMACK64MMAP
> [ 17.546464] usb 1-2: Manufacturer: Aspeed
> [ 17.546466] usb 1-2: SerialNumber: 00000000
> [ 17.555549] evm: security.apparmor
> [ 17.605326] evm: security.ima
> [ 17.611018] evm: security.capability
> [ 17.617289] evm: HMAC attrs: 0x1
> [ 17.623297] clocksource: Switched to clocksource tsc
> [ 17.623768] hub 1-2:1.0: USB hub found
> [ 17.625197] PM: Magic number: 12:253:205
> [ 17.625497] acpi device:357: hash matches
> [ 17.625503] acpi device:32a: hash matches
> [ 17.625565] acpi PNP0A08:0a: hash matches
> [ 17.625635] acpi device:2d: hash matches
> [ 17.625979] memory memory772: hash matches
> [ 17.676770] hub 1-2:1.0: 5 ports detected
> [ 17.684010] Freeing unused kernel image (initmem) memory: 1780K
> [ 17.692240] Write protecting the kernel read-only data: 24576k
> [ 17.701416] Freeing unused kernel image (text/rodata gap) memory: 2044K
> [ 17.711503] Freeing unused kernel image (rodata/data gap) memory: 2024K
> [ 17.720399] Run /init as init process
> [ 17.726375] with arguments:
> [ 17.726376] /init
> [ 17.726376] with environment:
> [ 17.726376] HOME=/
> [ 17.726377] TERM=linux
> [ 17.726377] BOOT_IMAGE=/vmlinuz-5.7.6+
> [ 17.726378] crashkernel=512M-:768M
> [ 17.810720] usb 1-5: new high-speed USB device number 3 using xhci_hcd
> [ 17.967728] usb 1-5: New USB device found, idVendor=14dd, idProduct=0002, bcdDevice= 0.01
> [ 17.978350] usb 1-5: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> [ 17.987919] usb 1-5: Product: Multidevice
> [ 17.994347] usb 1-5: Manufacturer: Peppercon AG
> [ 18.001295] usb 1-5: SerialNumber: B0BD311A39339843C0C54A0DB8EAECBF
> [ 18.062467] hid: raw HID events driver (C) Jiri Kosina
> [ 18.066655] usb 1-2.1: new high-speed USB device number 4 using xhci_hcd
> [ 18.082914] usbcore: registered new interface driver usbhid
> [ 18.090848] usbhid: USB HID core driver
> [ 18.102110] input: Peppercon AG Multidevice as /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5:1.0/0003:14DD:0002.0001/input/input1
> [ 18.122089] pps_core: LinuxPPS API ver. 1 registered
> [ 18.129537] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <[email protected]>
> [ 18.144124] ahci 0000:00:11.5: version 3.0
> [ 18.144425] ahci 0000:00:11.5: AHCI 0001.0301 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
> [ 18.155122] ahci 0000:00:11.5: flags: 64bit ncq sntf pm led clo only pio slum part ems deso sadm sds apst
> [ 18.170624] PTP clock support registered
> [ 18.180851] cryptd: max_cpu_qlen set to 1000
> [ 18.182759] hid-generic 0003:14DD:0002.0001: input,hidraw0: USB HID v1.01 Keyboard [Peppercon AG Multidevice] on usb-0000:00:14.0-5/input0
> [ 18.192006] usb 1-2.1: New USB device found, idVendor=1d6b, idProduct=0104, bcdDevice= 1.00
> [ 18.206519] input: Peppercon AG Multidevice as /devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5:1.1/0003:14DD:0002.0002/input/input2
> [ 18.217551] usb 1-2.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> [ 18.217552] usb 1-2.1: Product: virtual_input
> [ 18.217552] usb 1-2.1: Manufacturer: OpenBMC
> [ 18.217553] usb 1-2.1: SerialNumber: OBMC0001
> [ 18.218516] scsi host0: ahci
> [ 18.235600] hid-generic 0003:14DD:0002.0002: input,hidraw1: USB HID v1.01 Mouse [Peppercon AG Multidevice] on usb-0000:00:14.0-5/input1
> [ 18.246371] scsi host1: ahci
> [ 18.300367] ata1: SATA max UDMA/133 abar m524288@0x92280000 port 0x92280100 irq 29
> [ 18.311436] ata2: SATA max UDMA/133 abar m524288@0x92280000 port 0x92280180 irq 29
> [ 18.323099] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 8 ports 6 Gbps 0xff impl SATA mode
> [ 18.334764] ahci 0000:00:17.0: flags: 64bit ncq sntf ilck pm led clo only pio slum part ems deso sadm sds apst
> [ 18.352218] AVX2 version of gcm_enc/dec engaged.
> [ 18.352297] dca service started, version 1.12.1
> [ 18.353454] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:14.0/usb1/1-2/1-2.1/1-2.1:1.0/0003:1D6B:0104.0003/input/input3
> [ 18.360352] AES CTR mode by8 optimization enabled
> [ 18.410800] hid-generic 0003:1D6B:0104.0003: input,hidraw2: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0000:00:14.0-2.1/input0
> [ 18.431658] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:14.0/usb1/1-2/1-2.1/1-2.1:1.1/0003:1D6B:0104.0004/input/input4
> [ 18.451358] hid-generic 0003:1D6B:0104.0004: input,hidraw3: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0000:00:14.0-2.1/input1
> [ 18.471613] scsi host2: ahci
> [ 18.478356] scsi host3: ahci
> [ 18.484957] scsi host4: ahci
> [ 18.491543] scsi host5: ahci
> [ 18.498129] scsi host6: ahci
> [ 18.504609] scsi host7: ahci
> [ 18.510998] scsi host8: ahci
> [ 18.517265] scsi host9: ahci
> [ 18.523289] ata3: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200100 irq 30
> [ 18.533969] ata4: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200180 irq 30
> [ 18.544547] ata5: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200200 irq 30
> [ 18.555011] ata6: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200280 irq 30
> [ 18.565377] ata7: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200300 irq 30
> [ 18.575623] ata8: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200380 irq 30
> [ 18.585792] ata9: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200400 irq 30
> [ 18.595945] ata10: SATA max UDMA/133 abar m524288@0x92200000 port 0x92200480 irq 30
> [ 18.607300] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 5.1.0-k
> [ 18.607301] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.6.0-k
> [ 18.607303] igb: Copyright (c) 2007-2014 Intel Corporation.
> [ 18.617657] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
> [ 18.627408] igb 0000:01:00.0: enabling device (0140 -> 0142)
> [ 18.635519] ata2: SATA link down (SStatus 4 SControl 300)
> [ 18.635719] ixgbe 0000:6d:00.0: enabling device (0140 -> 0142)
> [ 18.668829] ata1: SATA link down (SStatus 4 SControl 300)
> [ 18.678953] pps pps0: new PPS source ptp0
> [ 18.685545] igb 0000:01:00.0: added PHC on eth0
> [ 18.692587] igb 0000:01:00.0: Intel(R) Gigabit Ethernet Network Connection
> [ 18.702016] igb 0000:01:00.0: eth0: (PCIe:2.5Gb/s:Width x1) a4:bf:01:70:01:b6
> [ 18.711706] igb 0000:01:00.0: eth0: PBA No: 000300-000
> [ 18.719238] igb 0000:01:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [ 18.742755] igb 0000:01:00.0 enp1s0: renamed from eth0
> [ 18.750633] checking generic (91000000 1d5000) vs hw (91000000 1000000)
> [ 18.750637] fb0: switching to astdrmfb from EFI VGA
> [ 18.757954] Console: switching to colour dummy device 80x25
> [ 18.763667] [drm] P2A bridge disabled, using default configuration
> [ 18.769842] [drm] AST 2500 detected
> [ 18.773343] [drm] Analog VGA only
> [ 18.776672] [drm] dram MCLK=800 Mhz type=1 bus_width=16 size=01000000
> [ 18.783313] [TTM] Zone kernel: Available graphics memory: 164401722 KiB
> [ 18.790020] [TTM] Zone dma32: Available graphics memory: 2097152 KiB
> [ 18.796540] [TTM] Initializing pool allocator
> [ 18.800895] [TTM] Initializing DMA pool allocator
> [ 18.921958] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [ 18.928161] ata9: SATA link down (SStatus 4 SControl 300)
> [ 18.933582] ata5: SATA link down (SStatus 4 SControl 300)
> [ 18.939014] ata6: SATA link down (SStatus 4 SControl 300)
> [ 18.944428] fbcon: astdrmfb (fb0) is primary device
> [ 18.944445] ata8: SATA link down (SStatus 4 SControl 300)
> [ 18.944463] ata10: SATA link down (SStatus 4 SControl 300)
> [ 18.944482] ata3: SATA link down (SStatus 4 SControl 300)
> [ 18.944499] ata4: SATA link down (SStatus 4 SControl 300)
> [ 18.944523] ata7.00: ATA-10: INTEL SSDSC2KG960G8, XCV10110, max UDMA/133
> [ 18.944525] ata7.00: 1875385008 sectors, multi 1: LBA48 NCQ (depth 32)
> [ 18.944982] ata7.00: configured for UDMA/133
> [ 18.950948] scsi 6:0:0:0: Direct-Access ATA INTEL SSDSC2KG96 0110 PQ: 0 ANSI: 5
> [ 18.952397] sd 6:0:0:0: Attached scsi generic sg0 type 0
> [ 18.952946] ata7.00: Enabling discard_zeroes_data
> [ 18.953049] sd 6:0:0:0: [sda] 1875385008 512-byte logical blocks: (960 GB/894 GiB)
> [ 18.953050] sd 6:0:0:0: [sda] 4096-byte physical blocks
> [ 18.953089] sd 6:0:0:0: [sda] Write Protect is off
> [ 18.953090] sd 6:0:0:0: [sda] Mode Sense: 00 3a 00 00
> [ 18.953095] sd 6:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [ 18.957125] Console: switching to colour frame buffer device 128x48
> [ 19.053243] ast 0000:03:00.0: fb0: astdrmfb frame buffer device
> [ 19.078946] ata7.00: Enabling discard_zeroes_data
> [ 19.086057] sda: sda1 sda2 sda3
> [ 19.090485] ata7.00: Enabling discard_zeroes_data
> [ 19.094837] [drm] Initialized ast 0.1.0 20120228 for 0000:03:00.0 on minor 0
> [ 19.095407] sd 6:0:0:0: [sda] Attached SCSI disk
> [ 19.112357] random: lvm: uninitialized urandom read (4 bytes read)
> [ 19.130820] random: lvm: uninitialized urandom read (2 bytes read)
> [ 19.348421] ixgbe 0000:6d:00.0: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
> [ 19.460872] ixgbe 0000:6d:00.0: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
> [ 19.584847] ixgbe 0000:6d:00.0: MAC: 4, PHY: 0, PBA No: H86377-006
> [ 19.591423] ixgbe 0000:6d:00.0: b4:96:91:46:86:a8
> [ 19.760928] ixgbe 0000:6d:00.0: Intel(R) 10 Gigabit Network Connection
> [ 19.767945] libphy: ixgbe-mdio: probed
> [ 19.780470] ixgbe 0000:6d:00.1: enabling device (0140 -> 0142)
> [ 20.476018] ixgbe 0000:6d:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
> [ 20.587180] ixgbe 0000:6d:00.1: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
> [ 20.711184] ixgbe 0000:6d:00.1: MAC: 4, PHY: 0, PBA No: H86377-006
> [ 20.717771] ixgbe 0000:6d:00.1: b4:96:91:46:86:a9
> [ 20.887186] ixgbe 0000:6d:00.1: Intel(R) 10 Gigabit Network Connection
> [ 20.894185] libphy: ixgbe-mdio: probed
> [ 20.899926] ixgbe 0000:6d:00.0 ens9f0: renamed from eth0
> [ 20.943086] ixgbe 0000:6d:00.1 ens9f1: renamed from eth1
> [ 21.166694] raid6: avx512x4 gen() 16567 MB/s
> [ 21.238696] raid6: avx512x4 xor() 10251 MB/s
> [ 21.310698] raid6: avx512x2 gen() 16589 MB/s
> [ 21.382693] raid6: avx512x2 xor() 34532 MB/s
> [ 21.454694] raid6: avx512x1 gen() 16566 MB/s
> [ 21.526697] raid6: avx512x1 xor() 30507 MB/s
> [ 21.598689] raid6: avx2x4 gen() 16486 MB/s
> [ 21.670697] raid6: avx2x4 xor() 8626 MB/s
> [ 21.742701] raid6: avx2x2 gen() 16556 MB/s
> [ 21.814695] raid6: avx2x2 xor() 25098 MB/s
> [ 21.886687] raid6: avx2x1 gen() 12603 MB/s
> [ 21.958699] raid6: avx2x1 xor() 19952 MB/s
> [ 22.030685] raid6: sse2x4 gen() 12875 MB/s
> [ 22.102694] raid6: sse2x4 xor() 7747 MB/s
> [ 22.174692] raid6: sse2x2 gen() 12740 MB/s
> [ 22.246701] raid6: sse2x2 xor() 8453 MB/s
> [ 22.318694] raid6: sse2x1 gen() 12492 MB/s
> [ 22.390699] raid6: sse2x1 xor() 6813 MB/s
> [ 22.395311] raid6: using algorithm avx512x2 gen() 16589 MB/s
> [ 22.401282] raid6: .... xor() 34532 MB/s, rmw enabled
> [ 22.406643] raid6: using avx512x2 recovery algorithm
> [ 22.414609] xor: automatically using best checksumming function avx
> [ 22.427982] async_tx: api initialized (async)
> [ 22.452788] random: lvm: uninitialized urandom read (4 bytes read)
> [ 22.499747] Btrfs loaded, crc32c=crc32c-intel
> [ 22.531785] process '/bin/fstype' started with executable stack
> [ 22.551719] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
> [ 22.643394] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
> [ 22.695947] random: fast init done
> [ 22.783398] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> [ 22.840304] systemd[1]: Detected architecture x86-64.
> [ 22.883983] systemd[1]: Set hostname to <ssp_clx_cdi391>.
> [ 22.989699] systemd[1]: File /lib/systemd/system/systemd-journald.service:36 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
> [ 23.010818] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
> [ 23.049149] systemd[1]: /etc/systemd/system/snapd.service.d/snap_proxy.conf:4: Missing '='.
> [ 23.080938] random: systemd: uninitialized urandom read (16 bytes read)
> [ 23.092077] systemd[1]: Created slice System Slice.
> [ 23.110831] random: systemd: uninitialized urandom read (16 bytes read)
> [ 23.117976] systemd[1]: Listening on Journal Socket (/dev/log).
> [ 23.138909] systemd[1]: Listening on udev Kernel Socket.
> [ 23.158850] systemd[1]: Listening on LVM2 poll daemon socket.
> [ 23.548163] Loading iSCSI transport class v2.0-870.
> [ 23.568945] iscsi: registered transport (tcp)
> [ 23.598204] EXT4-fs (dm-0): re-mounted. Opts: (null)
> [ 23.614287] iscsi: registered transport (iser)
> [ 23.807968] swapon: swapfile has holes
> [ 24.306066] systemd-journald[1512]: Received request to flush runtime journal from PID 1
> [ 24.437411] IPMI message handler: version 39.2
> [ 24.494999] power_meter ACPI000D:00: Found ACPI power meter.
> [ 24.495816] power_meter ACPI000D:00: Ignoring unsafe software power cap!
> [ 24.495843] ipmi device interface
> [ 24.496136] power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
> [ 24.502615] ipmi_si: IPMI System Interface driver
> [ 24.502891] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
> [ 24.502893] ipmi_platform: ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
> [ 24.502894] ipmi_si: Adding SMBIOS-specified kcs state machine
> [ 24.503799] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
> [ 24.503814] ipmi_si IPI0001:00: ipmi_platform: [io 0x0ca2-0x0ca3] regsize 1 spacing 1 irq 0
> [ 24.503815] ipmi_si dmi-ipmi-si.0: Removing SMBIOS-specified kcs state machine in favor of ACPI
> [ 24.503816] ipmi_si: Adding ACPI-specified kcs state machine
> [ 24.504732] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x20, irq 0
> [ 24.717623] ioatdma: Intel(R) QuickData Technology Driver 5.00
> [ 24.718225] ioatdma 0000:00:04.0: enabling device (0004 -> 0006)
> [ 24.733114] ioatdma 0000:00:04.1: enabling device (0004 -> 0006)
> [ 24.745640] ioatdma 0000:00:04.2: enabling device (0004 -> 0006)
> [ 24.753908] ioatdma 0000:00:04.3: enabling device (0004 -> 0006)
> [ 24.761207] ioatdma 0000:00:04.4: enabling device (0004 -> 0006)
> [ 24.769541] ioatdma 0000:00:04.5: enabling device (0004 -> 0006)
> [ 24.779266] lpc_ich 0000:00:1f.0: I/O space for ACPI uninitialized
> [ 24.779267] lpc_ich 0000:00:1f.0: No MFD cells added
> [ 24.779464] ioatdma 0000:00:04.6: enabling device (0004 -> 0006)
> [ 24.788609] ioatdma 0000:00:04.7: enabling device (0004 -> 0006)
> [ 24.793000] mei_me 0000:00:16.0: Device doesn't have valid ME Interface
> [ 24.796171] ioatdma 0000:40:04.0: enabling device (0004 -> 0006)
> [ 24.806485] ioatdma 0000:40:04.1: enabling device (0004 -> 0006)
> [ 24.814306] ioatdma 0000:40:04.2: enabling device (0004 -> 0006)
> [ 24.820992] ioatdma 0000:40:04.3: enabling device (0004 -> 0006)
> [ 24.827464] ioatdma 0000:40:04.4: enabling device (0004 -> 0006)
> [ 24.834392] ioatdma 0000:40:04.5: enabling device (0004 -> 0006)
> [ 24.840857] ioatdma 0000:40:04.6: enabling device (0004 -> 0006)
> [ 24.847475] ioatdma 0000:40:04.7: enabling device (0004 -> 0006)
> [ 24.854042] ioatdma 0000:80:04.0: enabling device (0004 -> 0006)
> [ 24.863727] ioatdma 0000:80:04.1: enabling device (0004 -> 0006)
> [ 24.871113] ioatdma 0000:80:04.2: enabling device (0004 -> 0006)
> [ 24.877372] ioatdma 0000:80:04.3: enabling device (0004 -> 0006)
> [ 24.884084] ioatdma 0000:80:04.4: enabling device (0004 -> 0006)
> [ 24.890571] ioatdma 0000:80:04.5: enabling device (0004 -> 0006)
> [ 24.890703] ipmi_si IPI0001:00: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
> [ 24.897757] ioatdma 0000:80:04.6: enabling device (0004 -> 0006)
> [ 24.903505] ioatdma 0000:80:04.7: enabling device (0004 -> 0006)
> [ 24.908744] pstore: ignoring unexpected backend 'efi'
> [ 24.909914] ioatdma 0000:c0:04.0: enabling device (0004 -> 0006)
> [ 24.920021] ioatdma 0000:c0:04.1: enabling device (0004 -> 0006)
> [ 24.929095] ioatdma 0000:c0:04.2: enabling device (0004 -> 0006)
> [ 24.938541] ioatdma 0000:c0:04.3: enabling device (0004 -> 0006)
> [ 24.948251] ioatdma 0000:c0:04.4: enabling device (0004 -> 0006)
> [ 24.957287] ioatdma 0000:c0:04.5: enabling device (0004 -> 0006)
> [ 24.965213] ioatdma 0000:c0:04.6: enabling device (0004 -> 0006)
> [ 24.973148] ioatdma 0000:c0:04.7: enabling device (0004 -> 0006)
> [ 25.126247] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x000157, prod_id: 0x009d, dev_id: 0x23)
> [ 25.137772] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:23:0a.0 (INTERRUPT)
> [ 25.139660] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:23:0c.0 (INTERRUPT)
> [ 25.141569] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:57:0a.0 (INTERRUPT)
> [ 25.143545] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:57:0c.0 (INTERRUPT)
> [ 25.146823] EDAC MC4: Giving out device to module skx_edac controller Skylake Socket#2 IMC#0: DEV 0000:97:0a.0 (INTERRUPT)
> [ 25.148615] EDAC MC5: Giving out device to module skx_edac controller Skylake Socket#2 IMC#1: DEV 0000:97:0c.0 (INTERRUPT)
> [ 25.152571] EDAC MC6: Giving out device to module skx_edac controller Skylake Socket#3 IMC#0: DEV 0000:d7:0a.0 (INTERRUPT)
> [ 25.157823] EDAC MC7: Giving out device to module skx_edac controller Skylake Socket#3 IMC#1: DEV 0000:d7:0c.0 (INTERRUPT)
> [ 25.290190] intel_rapl_common: Found RAPL domain package
> [ 25.290194] intel_rapl_common: Found RAPL domain dram
> [ 25.290195] intel_rapl_common: DRAM domain energy unit 15300pj
> [ 25.291652] intel_rapl_common: Found RAPL domain package
> [ 25.291656] intel_rapl_common: Found RAPL domain dram
> [ 25.291657] intel_rapl_common: DRAM domain energy unit 15300pj
> [ 25.292226] intel_rapl_common: Found RAPL domain package
> [ 25.292230] intel_rapl_common: Found RAPL domain dram
> [ 25.292231] intel_rapl_common: DRAM domain energy unit 15300pj
> [ 25.292863] intel_rapl_common: Found RAPL domain package
> [ 25.292870] intel_rapl_common: Found RAPL domain dram
> [ 25.292872] intel_rapl_common: DRAM domain energy unit 15300pj
> [ 25.574964] ipmi_si IPI0001:00: IPMI kcs interface initialized
> [ 25.577234] ipmi_ssif: IPMI SSIF Interface driver
> [ 25.594902] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
> [ 25.619499] FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
> [ 25.851473] audit: type=1400 audit(1594602910.206:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=4764 comm="apparmor_parser"
> [ 25.854466] audit: type=1400 audit(1594602910.206:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=4765 comm="apparmor_parser"
> [ 25.854468] audit: type=1400 audit(1594602910.206:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=4765 comm="apparmor_parser"
> [ 25.854469] audit: type=1400 audit(1594602910.206:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=4765 comm="apparmor_parser"
> [ 25.856163] audit: type=1400 audit(1594602910.210:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=4766 comm="apparmor_parser"
> [ 25.856166] audit: type=1400 audit(1594602910.210:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=4766 comm="apparmor_parser"
> [ 25.857153] audit: type=1400 audit(1594602910.210:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/tcpdump" pid=4768 comm="apparmor_parser"
> [ 25.859144] audit: type=1400 audit(1594602910.214:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=4762 comm="apparmor_parser"
> [ 25.859146] audit: type=1400 audit(1594602910.214:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-cgns" pid=4762 comm="apparmor_parser"
> [ 25.859148] audit: type=1400 audit(1594602910.214:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-mounting" pid=4762 comm="apparmor_parser"
> [ 26.372887] random: crng init done
> [ 26.372890] random: 5 urandom warning(s) missed due to ratelimiting
> [ 29.903345] igb 0000:01:00.0 enp1s0: igb: enp1s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> [ 29.903653] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
> [ 34.417764] new mount options do not match the existing superblock, will be ignored
> [ 128.500959] core sched enabled
> [ 128.593222] core sched disabled
> [ 139.711143] core sched enabled
> [ 141.774846] NOHZ: local_softirq_pending 0a
> [ 141.782905] NOHZ: local_softirq_pending 0a
> [ 141.790973] NOHZ: local_softirq_pending 20a
> [ 141.794644] NOHZ: local_softirq_pending 20a
> [ 141.798640] NOHZ: local_softirq_pending 20a
> [ 141.798641] NOHZ: local_softirq_pending 20a
> [ 141.826902] NOHZ: local_softirq_pending 0a
> [ 141.830641] NOHZ: local_softirq_pending 0a
> [ 141.894765] NOHZ: local_softirq_pending 0a
> [ 141.922850] NOHZ: local_softirq_pending 208
> [ 170.634633] watchdog: BUG: soft lockup - CPU#75 stuck for 22s! [uperf:5393]
> [ 170.641614] Modules linked in: binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass joydev efi_pstore input_leds mei_me lpc_ich mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ast drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul ixgbe drm_kms_helper crc32_pclmul igb ghash_clmulni_intel aesni_intel syscopyarea crypto_simd sysfillrect dca cryptd sysimgblt glue_helper fb_sys_fops ptp mdio ahci pps_core drm i2c_algo_bit libahci wmi hid_generic usbhid hid
> [ 170.641645] CPU: 75 PID: 5393 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.641646] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.641651] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> [ 170.641653] Code: c7 e8 d3 c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80
> [ 170.641654] RSP: 0018:ffffb115617afca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> [ 170.641655] RAX: 0000000000000014 RBX: ffff9e559016dc40 RCX: 0000000000000001
> [ 170.641655] RDX: ffff9e6d8f4b44c0 RSI: 0000000000000000 RDI: ffff9e559016dc48
> [ 170.641656] RBP: 000000000000005b R08: 0000000000000000 R09: 0000000000000014
> [ 170.641656] R10: 0000000000000007 R11: 0000000000000008 R12: ffffffff8f078800
> [ 170.641656] R13: 0000000000000001 R14: ffff9e559016c600 R15: 0000000000000200
> [ 170.641657] FS: 00007f0051ffb700(0000) GS:ffff9e5590140000(0000) knlGS:0000000000000000
> [ 170.641657] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.641658] CR2: 00007f00678b9400 CR3: 00000017e5fb0001 CR4: 00000000007606e0
> [ 170.641659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.641659] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.641659] PKRU: 55555554
> [ 170.641660] Call Trace:
> [ 170.641666] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641668] ? x86_configure_nx+0x50/0x50
> [ 170.641669] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641670] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.641671] flush_tlb_mm_range+0xab/0xe0
> [ 170.641677] change_protection+0x18a/0xca0
> [ 170.641682] ? __switch_to_asm+0x34/0x70
> [ 170.641685] change_prot_numa+0x15/0x30
> [ 170.641689] task_numa_work+0x1aa/0x2c0
> [ 170.641694] task_work_run+0x76/0xa0
> [ 170.641698] exit_to_usermode_loop+0xeb/0xf0
> [ 170.641700] do_syscall_64+0x1aa/0x1d0
> [ 170.641701] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.641703] RIP: 0033:0x7f00676ac2c7
> [ 170.641704] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
> [ 170.641705] RSP: 002b:00007f0051ffa9b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 170.641705] RAX: 000000000000005a RBX: 000000000000002a RCX: 00007f00676ac2c7
> [ 170.641706] RDX: 000000000000005a RSI: 00007f0030000b20 RDI: 000000000000002a
> [ 170.641706] RBP: 00007f0030000b20 R08: 0000000000000000 R09: 0000563ccd7ca0b4
> [ 170.641707] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000005a
> [ 170.641707] R13: 00007f006804e228 R14: 0000000000000000 R15: 0000000000000000
> [ 170.641710] Sending NMI from CPU 75 to CPUs 0-74,76-143:
> [ 170.641749] NMI backtrace for cpu 0
> [ 170.641749] CPU: 0 PID: 5406 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.641749] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.641750] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> [ 170.641750] Code: c7 e8 d3 c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80
> [ 170.641751] RSP: 0018:ffffb11561c3fca0 EFLAGS: 00000202
> [ 170.641752] RAX: 000000000000005b RBX: ffff9e558fc2dc40 RCX: 0000000000000001
> [ 170.641752] RDX: ffff9e6d8f8f20c0 RSI: 0000000000000000 RDI: ffff9e558fc2dc48
> [ 170.641752] RBP: 000000000000005d R08: 0000000000000000 R09: 000000000000001b
> [ 170.641753] R10: 0000000000000001 R11: 0000000000000008 R12: ffffffff8f078800
> [ 170.641753] R13: 0000000000000001 R14: ffff9e558fc2c600 R15: 0000000000000200
> [ 170.641753] FS: 00007fed9d743700(0000) GS:ffff9e558fc00000(0000) knlGS:0000000000000000
> [ 170.641753] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.641754] CR2: 00007f002f7fea08 CR3: 00000017eb8b8001 CR4: 00000000007606f0
> [ 170.641754] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.641754] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.641754] PKRU: 55555554
> [ 170.641754] Call Trace:
> [ 170.641755] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641755] ? x86_configure_nx+0x50/0x50
> [ 170.641755] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641755] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.641755] flush_tlb_mm_range+0xab/0xe0
> [ 170.641756] change_protection+0x18a/0xca0
> [ 170.641756] change_prot_numa+0x15/0x30
> [ 170.641756] task_numa_work+0x1aa/0x2c0
> [ 170.641756] task_work_run+0x76/0xa0
> [ 170.641756] exit_to_usermode_loop+0xeb/0xf0
> [ 170.641757] do_syscall_64+0x1aa/0x1d0
> [ 170.641757] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.641757] RIP: 0033:0x7feda134d394
> [ 170.641757] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
> [ 170.641758] RSP: 002b:00007fed9d742980 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 170.641758] RAX: 000000000000005a RBX: 0000000000000031 RCX: 00007feda134d394
> [ 170.641758] RDX: 000000000000005a RSI: 00007fed78000b20 RDI: 0000000000000031
> [ 170.641759] RBP: 00007fed78000b20 R08: 0000000000000000 R09: 00007ffcfe964080
> [ 170.641759] R10: 00007fed9d742dd0 R11: 0000000000000246 R12: 000000000000005a
> [ 170.641759] R13: 00007feda1d15d50 R14: 0000000000000000 R15: 0000000000000000
> [ 170.641842] NMI backtrace for cpu 1
> [ 170.641844] CPU: 1 PID: 5412 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.641845] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.641847] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> [ 170.641848] Code: c7 e8 d3 c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80
> [ 170.641849] RSP: 0018:ffffb11561c6fca0 EFLAGS: 00000202
> [ 170.641852] RAX: 000000000000005b RBX: ffff9e558fc6dc40 RCX: 0000000000000001
> [ 170.641854] RDX: ffff9e6d8f8f38c0 RSI: 0000000000000000 RDI: ffff9e558fc6dc48
> [ 170.641855] RBP: 000000000000005d R08: 0000000000000000 R09: 000000000000001b
> [ 170.641857] R10: 0000000000000001 R11: 0000000000000008 R12: ffffffff8f078800
> [ 170.641858] R13: 0000000000000001 R14: ffff9e558fc6c600 R15: 0000000000000200
> [ 170.641863] FS: 00007fed7dffb700(0000) GS:ffff9e558fc40000(0000) knlGS:0000000000000000
> [ 170.641864] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.641866] CR2: 00007f0058000b20 CR3: 00000017eb8b8004 CR4: 00000000007606e0
> [ 170.641867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.641869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.641870] PKRU: 55555554
> [ 170.641872] Call Trace:
> [ 170.641873] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641875] ? x86_configure_nx+0x50/0x50
> [ 170.641876] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.641877] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.641879] flush_tlb_mm_range+0xab/0xe0
> [ 170.641880] change_protection+0x18a/0xca0
> [ 170.641882] change_prot_numa+0x15/0x30
> [ 170.641883] task_numa_work+0x1aa/0x2c0
> [ 170.641885] task_work_run+0x76/0xa0
> [ 170.641886] exit_to_usermode_loop+0xeb/0xf0
> [ 170.641888] do_syscall_64+0x1aa/0x1d0
> [ 170.641889] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.641890] RIP: 0033:0x7feda134d394
> [ 170.641892] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
> [ 170.641894] RSP: 002b:00007fed7dffa980 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 170.641896] RAX: 000000000000005a RBX: 0000000000000043 RCX: 00007feda134d394
> [ 170.641898] RDX: 000000000000005a RSI: 00007fed5c000b20 RDI: 0000000000000043
> [ 170.641899] RBP: 00007fed5c000b20 R08: 0000000000000000 R09: 00007ffcfe964080
> [ 170.641901] R10: 00007fed7dffadd0 R11: 0000000000000246 R12: 000000000000005a
> [ 170.641902] R13: 00007feda1d166e0 R14: 0000000000000000 R15: 0000000000000000
> [ 170.641917] NMI backtrace for cpu 2 skipped: idling at intel_idle+0x85/0x130
> [ 170.642016] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x85/0x130
> [ 170.642201] NMI backtrace for cpu 4
> [ 170.642202] CPU: 4 PID: 1512 Comm: systemd-journal Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.642203] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.642204] RIP: 0010:memset_erms+0x9/0x10
> [ 170.642206] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> [ 170.642206] RSP: 0018:ffffb115603a7da8 EFLAGS: 00010246
> [ 170.642208] RAX: ffff9e7da8114900 RBX: 0000000000000000 RCX: 0000000000000fc0
> [ 170.642209] RDX: 0000000000004000 RSI: 0000000000000000 RDI: ffffb1156038b040
> [ 170.642210] RBP: ffffb115603a7e50 R08: ffffd0dd1fd19350 R09: ffffb11560388000
> [ 170.642211] R10: 00007ff8b36aa9d0 R11: 0000000000000000 R12: 00000000003d0f00
> [ 170.642212] R13: 00000000ffffffff R14: ffff9e5587749640 R15: ffffb115603a7ec8
> [ 170.642213] FS: 00007ff8b8c98940(0000) GS:ffff9e558fd00000(0000) knlGS:0000000000000000
> [ 170.642213] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.642214] CR2: 00007ff8b36a9db8 CR3: 000000501eef6005 CR4: 00000000007606e0
> [ 170.642215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.642216] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.642217] PKRU: 55555554
> [ 170.642219] Call Trace:
> [ 170.642221] copy_process+0x22c/0x19f0
> [ 170.642223] ? mem_cgroup_throttle_swaprate+0x19/0x140
> [ 170.642224] _do_fork+0x74/0x3b0
> [ 170.642226] __do_sys_clone+0x7e/0xb0
> [ 170.642228] do_syscall_64+0x52/0x1d0
> [ 170.642230] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.642231] RIP: 0033:0x7ff8b87bda31
> [ 170.642234] Code: 48 85 ff 74 3d 48 85 f6 74 38 48 83 ee 10 48 89 4e 08 48 89 3e 48 89 d7 4c 89 c2 4d 89 c8 4c 8b 54 24 08 b8 38 00 00 00 0f 05 <48> 85 c0 7c 13 74 01 c3 31 ed 58 5f ff d0 48 89 c7 b8 3c 00 00 00
> [ 170.642235] RSP: 002b:00007fff9a239518 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
> [ 170.642239] RAX: ffffffffffffffda RBX: 00007ff8b36aa700 RCX: 00007ff8b87bda31
> [ 170.642241] RDX: 00007ff8b36aa9d0 RSI: 00007ff8b36a9db0 RDI: 00000000003d0f00
> [ 170.642243] RBP: 00007fff9a2395f0 R08: 00007ff8b36aa700 R09: 00007ff8b36aa700
> [ 170.642244] R10: 00007ff8b36aa9d0 R11: 0000000000000202 R12: 00007ff8b36a9dc0
> [ 170.642246] R13: 0000000000000000 R14: 000056283a684980 R15: 00007fff9a239580
> [ 170.642273] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x85/0x130
> [ 170.642333] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x85/0x130
> [ 170.642476] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x85/0x130
> [ 170.642529] NMI backtrace for cpu 8 skipped: idling at intel_idle+0x85/0x130
> [ 170.642665] NMI backtrace for cpu 9
> [ 170.642667] CPU: 9 PID: 0 Comm: swapper/9 Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.642668] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.642668] RIP: 0010:cpu_startup_entry+0x14/0x20
> [ 170.642670] Code: e0 9e 9f 8f 72 0a 48 81 ff 86 a7 9f 8f 0f 92 c0 f3 c3 0f 1f 40 00 0f 1f 44 00 00 53 89 fb e8 b3 fb ff ff 89 df e8 0c 71 fc ff <e8> 07 fc ff ff eb f9 cc cc cc cc cc 0f 1f 44 00 00 48 8b 8f 58 04
> [ 170.642671] RSP: 0018:ffffb115401f7f28 EFLAGS: 00000286
> [ 170.642673] RAX: 0000000000000000 RBX: 000000000000008e RCX: 0000000000000000
> [ 170.642674] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9e558fe6ccc0
> [ 170.642675] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9e558ac58000
> [ 170.642676] R10: 0000000000000007 R11: ffff9e558b138a00 R12: 0000000000000000
> [ 170.642676] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 170.642677] FS: 0000000000000000(0000) GS:ffff9e558fe40000(0000) knlGS:0000000000000000
> [ 170.642678] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.642679] CR2: 000055ed9a8a4e04 CR3: 00000017eb532004 CR4: 00000000007606e0
> [ 170.642680] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.642681] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.642681] PKRU: 55555554
> [ 170.642682] Call Trace:
> [ 170.642682] start_secondary+0x169/0x1c0
> [ 170.642683] secondary_startup_64+0xb6/0xc0
> [ 170.642769] NMI backtrace for cpu 10 skipped: idling at intel_idle+0x85/0x130
> [ 170.642868] NMI backtrace for cpu 11 skipped: idling at intel_idle+0x85/0x130
> [ 170.642969] NMI backtrace for cpu 12 skipped: idling at intel_idle+0x85/0x130
> [ 170.643067] NMI backtrace for cpu 13 skipped: idling at intel_idle+0x85/0x130
> [ 170.643171] NMI backtrace for cpu 14 skipped: idling at intel_idle+0x85/0x130
> [ 170.643268] NMI backtrace for cpu 15 skipped: idling at intel_idle+0x85/0x130
> [ 170.643372] NMI backtrace for cpu 16 skipped: idling at intel_idle+0x85/0x130
> [ 170.643473] NMI backtrace for cpu 17 skipped: idling at intel_idle+0x85/0x130
> [ 170.643555] NMI backtrace for cpu 18
> [ 170.643556] CPU: 18 PID: 5389 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.643556] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.643557] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> [ 170.643558] Code: c7 e8 d3 c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80
> [ 170.643558] RSP: 0018:ffffb1156178fca0 EFLAGS: 00000202
> [ 170.643559] RAX: 0000000000000014 RBX: ffff9e6d8f42dc40 RCX: 0000000000000001
> [ 170.643560] RDX: ffff9e6d8f4b3da0 RSI: 0000000000000000 RDI: ffff9e6d8f42dc48
> [ 170.643560] RBP: 000000000000005a R08: 0000000000000000 R09: 0000000000000014
> [ 170.643560] R10: 0000000000000007 R11: 0000000000000008 R12: ffffffff8f078800
> [ 170.643560] R13: 0000000000000001 R14: ffff9e6d8f42c600 R15: 0000000000000200
> [ 170.643561] FS: 00007f0053fff700(0000) GS:ffff9e6d8f400000(0000) knlGS:0000000000000000
> [ 170.643561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.643561] CR2: 00007f0054002e58 CR3: 00000017e5fb0002 CR4: 00000000007606e0
> [ 170.643561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.643562] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.643562] PKRU: 55555554
> [ 170.643562] Call Trace:
> [ 170.643562] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.643562] ? x86_configure_nx+0x50/0x50
> [ 170.643563] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.643563] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.643563] flush_tlb_mm_range+0xab/0xe0
> [ 170.643563] change_protection+0x18a/0xca0
> [ 170.643564] change_prot_numa+0x15/0x30
> [ 170.643564] task_numa_work+0x1aa/0x2c0
> [ 170.643564] task_work_run+0x76/0xa0
> [ 170.643565] exit_to_usermode_loop+0xeb/0xf0
> [ 170.643565] do_syscall_64+0x1aa/0x1d0
> [ 170.643565] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.643566] RIP: 0033:0x7f00676ac394
> [ 170.643566] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
> [ 170.643566] RSP: 002b:00007f0053ffe980 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 170.643567] RAX: 000000000000005a RBX: 0000000000000028 RCX: 00007f00676ac394
> [ 170.643567] RDX: 000000000000005a RSI: 00007f0040000b20 RDI: 0000000000000028
> [ 170.643567] RBP: 00007f0040000b20 R08: 0000000000000000 R09: 0000563ccd7ca0b4
> [ 170.643568] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
> [ 170.643568] R13: 00007f006804dbc8 R14: 0000000000000004 R15: 0000000000000000
> [ 170.643678] NMI backtrace for cpu 19
> [ 170.643678] CPU: 19 PID: 5385 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.643679] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.643679] RIP: 0010:native_queued_spin_lock_slowpath+0x137/0x1e0
> [ 170.643680] Code: 6a ff ff ff 41 83 c0 01 c1 e1 10 41 c1 e0 12 44 09 c1 89 c8 c1 e8 10 66 87 47 02 89 c6 c1 e6 10 85 f6 75 3c 31 f6 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 45 31 c0 44 39 c1 74 7e 48 85 f6
> [ 170.643681] RSP: 0018:ffffb1154d43ce78 EFLAGS: 00000202
> [ 170.643681] RAX: 0000000000500101 RBX: ffff9e555dea2948 RCX: 0000000000500000
> [ 170.643681] RDX: ffff9e6d8f46d9c0 RSI: 0000000000000000 RDI: ffff9e555dea2588
> [ 170.643682] RBP: ffff9e555dea2588 R08: 0000000000500000 R09: ffff9e6d8f45dc28
> [ 170.643682] R10: ffffb1154d43cef8 R11: ffff9e6d8f45dc70 R12: ffff9e555dea2500
> [ 170.643682] R13: ffff9e555dea2580 R14: 0000000000000001 R15: ffff9e555dea2948
> [ 170.643683] FS: 00007f00662a7700(0000) GS:ffff9e6d8f440000(0000) knlGS:0000000000000000
> [ 170.643683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.643683] CR2: 00007f00678b9400 CR3: 00000017e5fb0003 CR4: 00000000007606e0
> [ 170.643683] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.643684] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.643684] PKRU: 55555554
> [ 170.643684] Call Trace:
> [ 170.643684] <IRQ>
> [ 170.643684] _raw_spin_lock+0x1b/0x20
> [ 170.643685] tcp_delack_timer+0x2c/0xf0
> [ 170.643685] ? tcp_delack_timer_handler+0x170/0x170
> [ 170.643685] call_timer_fn+0x2d/0x130
> [ 170.643685] run_timer_softirq+0x420/0x450
> [ 170.643686] ? enqueue_hrtimer+0x39/0x90
> [ 170.643686] ? __hrtimer_run_queues+0x138/0x290
> [ 170.643686] __do_softirq+0xed/0x2f0
> [ 170.643686] irq_exit+0xad/0xb0
> [ 170.643686] smp_apic_timer_interrupt+0x74/0x130
> [ 170.643687] apic_timer_interrupt+0xf/0x20
> [ 170.643687] </IRQ>
> [ 170.643687] RIP: 0010:do_softirq.part.18+0x29/0x50
> [ 170.643688] Code: 90 0f 1f 44 00 00 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 65 66 8b 3d 81 80 f9 70 66 85 ff 75 0c 48 89 df 57 9d <0f> 1f 44 00 00 5b c3 0f b7 ff e8 a8 fc ff ff 84 c0 75 e8 e8 bf ca
> [ 170.643688] RSP: 0018:ffffb1156174fac8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> [ 170.643688] RAX: ffff9e555de3ac80 RBX: 0000000000000202 RCX: ffff9e6d8f46ccc0
> [ 170.643689] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000202
> [ 170.643691] RBP: ffffb1156174fb30 R08: 0000000000000001 R09: 0000000000000101
> [ 170.643694] R10: ffffb1154d43cb08 R11: 0000000000000320 R12: ffff9e6d733c2ae0
> [ 170.643696] R13: 0000000000000010 R14: 000000000000000e R15: 0000000000000000
> [ 170.643698] __local_bh_enable_ip+0x4b/0x50
> [ 170.643701] ip_finish_output2+0x1af/0x5a0
> [ 170.643703] ? __switch_to_asm+0x34/0x70
> [ 170.643705] ? __switch_to_asm+0x40/0x70
> [ 170.643707] ? ip_output+0x6d/0xe0
> [ 170.643709] ip_output+0x6d/0xe0
> [ 170.643711] ? __ip_finish_output+0x1c0/0x1c0
> [ 170.643713] __ip_queue_xmit+0x14d/0x3d0
> [ 170.643716] __tcp_transmit_skb+0x594/0xb20
> [ 170.643718] tcp_write_xmit+0x241/0x1150
> [ 170.643720] __tcp_push_pending_frames+0x32/0xf0
> [ 170.643722] tcp_sendmsg_locked+0x50f/0xd30
> [ 170.643724] tcp_sendmsg+0x27/0x40
> [ 170.643726] sock_sendmsg+0x3e/0x60
> [ 170.643728] sock_write_iter+0x97/0x100
> [ 170.643730] new_sync_write+0x1b6/0x1d0
> [ 170.643732] vfs_write+0xad/0x1b0
> [ 170.643734] ksys_write+0x50/0xe0
> [ 170.643736] do_syscall_64+0x52/0x1d0
> [ 170.643738] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.643740] RIP: 0033:0x7f00676ac2c7
> [ 170.643742] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
> [ 170.643744] RSP: 002b:00007f00662a69b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 170.643748] RAX: ffffffffffffffda RBX: 0000000000000026 RCX: 00007f00676ac2c7
> [ 170.643750] RDX: 000000000000005a RSI: 00007f005c000b20 RDI: 0000000000000026
> [ 170.643752] RBP: 00007f005c000b20 R08: 0000000000000000 R09: 0000563ccd7ca0b4
> [ 170.643754] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000005a
> [ 170.643756] R13: 00007f006804d568 R14: 0000000000000000 R15: 0000000000000000
> [ 170.643759] NMI backtrace for cpu 21 skipped: idling at intel_idle+0x85/0x130
> [ 170.643767] NMI backtrace for cpu 22 skipped: idling at intel_idle+0x85/0x130
> [ 170.643827] NMI backtrace for cpu 24 skipped: idling at intel_idle+0x85/0x130
> [ 170.643865] NMI backtrace for cpu 23 skipped: idling at intel_idle+0x85/0x130
> [ 170.643965] NMI backtrace for cpu 25 skipped: idling at intel_idle+0x85/0x130
> [ 170.643986] NMI backtrace for cpu 26 skipped: idling at intel_idle+0x85/0x130
> [ 170.644062] NMI backtrace for cpu 28 skipped: idling at intel_idle+0x85/0x130
> [ 170.644066] NMI backtrace for cpu 27 skipped: idling at intel_idle+0x85/0x130
> [ 170.644164] NMI backtrace for cpu 29 skipped: idling at intel_idle+0x85/0x130
> [ 170.644187] NMI backtrace for cpu 30 skipped: idling at intel_idle+0x85/0x130
> [ 170.644267] NMI backtrace for cpu 31 skipped: idling at intel_idle+0x85/0x130
> [ 170.644271] NMI backtrace for cpu 32 skipped: idling at intel_idle+0x85/0x130
> [ 170.644369] NMI backtrace for cpu 33 skipped: idling at intel_idle+0x85/0x130
> [ 170.644373] NMI backtrace for cpu 34 skipped: idling at intel_idle+0x85/0x130
> [ 170.644464] NMI backtrace for cpu 36 skipped: idling at intel_idle+0x85/0x130
> [ 170.644467] NMI backtrace for cpu 35 skipped: idling at intel_idle+0x85/0x130
> [ 170.644571] NMI backtrace for cpu 37 skipped: idling at intel_idle+0x85/0x130
> [ 170.644586] NMI backtrace for cpu 38 skipped: idling at intel_idle+0x85/0x130
> [ 170.644661] NMI backtrace for cpu 40 skipped: idling at intel_idle+0x85/0x130
> [ 170.644665] NMI backtrace for cpu 39 skipped: idling at intel_idle+0x85/0x130
> [ 170.644760] NMI backtrace for cpu 41 skipped: idling at intel_idle+0x85/0x130
> [ 170.644764] NMI backtrace for cpu 42 skipped: idling at intel_idle+0x85/0x130
> [ 170.644860] NMI backtrace for cpu 43 skipped: idling at intel_idle+0x85/0x130
> [ 170.644864] NMI backtrace for cpu 44 skipped: idling at intel_idle+0x85/0x130
> [ 170.644965] NMI backtrace for cpu 45 skipped: idling at intel_idle+0x85/0x130
> [ 170.645073] NMI backtrace for cpu 46 skipped: idling at intel_idle+0x85/0x130
> [ 170.645177] NMI backtrace for cpu 47 skipped: idling at intel_idle+0x85/0x130
> [ 170.645270] NMI backtrace for cpu 48 skipped: idling at intel_idle+0x85/0x130
> [ 170.645367] NMI backtrace for cpu 49 skipped: idling at intel_idle+0x85/0x130
> [ 170.645465] NMI backtrace for cpu 50 skipped: idling at intel_idle+0x85/0x130
> [ 170.645469] NMI backtrace for cpu 51 skipped: idling at intel_idle+0x85/0x130
> [ 170.645551] NMI backtrace for cpu 53 skipped: idling at intel_idle+0x85/0x130
> [ 170.645573] NMI backtrace for cpu 52 skipped: idling at intel_idle+0x85/0x130
> [ 170.645666] NMI backtrace for cpu 54 skipped: idling at intel_idle+0x85/0x130
> [ 170.645673] NMI backtrace for cpu 55 skipped: idling at intel_idle+0x85/0x130
> [ 170.645763] NMI backtrace for cpu 56 skipped: idling at intel_idle+0x85/0x130
> [ 170.645782] NMI backtrace for cpu 57 skipped: idling at intel_idle+0x85/0x130
> [ 170.645857] NMI backtrace for cpu 58 skipped: idling at intel_idle+0x85/0x130
> [ 170.645861] NMI backtrace for cpu 59 skipped: idling at intel_idle+0x85/0x130
> [ 170.645958] NMI backtrace for cpu 61 skipped: idling at intel_idle+0x85/0x130
> [ 170.645962] NMI backtrace for cpu 60 skipped: idling at intel_idle+0x85/0x130
> [ 170.646061] NMI backtrace for cpu 62 skipped: idling at intel_idle+0x85/0x130
> [ 170.646081] NMI backtrace for cpu 63 skipped: idling at intel_idle+0x85/0x130
> [ 170.646158] NMI backtrace for cpu 64 skipped: idling at intel_idle+0x85/0x130
> [ 170.646162] NMI backtrace for cpu 65 skipped: idling at intel_idle+0x85/0x130
> [ 170.646260] NMI backtrace for cpu 66 skipped: idling at intel_idle+0x85/0x130
> [ 170.646281] NMI backtrace for cpu 67 skipped: idling at intel_idle+0x85/0x130
> [ 170.646360] NMI backtrace for cpu 68 skipped: idling at intel_idle+0x85/0x130
> [ 170.646381] NMI backtrace for cpu 69 skipped: idling at intel_idle+0x85/0x130
> [ 170.646450] NMI backtrace for cpu 71 skipped: idling at intel_idle+0x85/0x130
> [ 170.646461] NMI backtrace for cpu 70 skipped: idling at intel_idle+0x85/0x130
> [ 170.646505] NMI backtrace for cpu 73 skipped: idling at intel_idle+0x85/0x130
> [ 170.646507] NMI backtrace for cpu 72 skipped: idling at intel_idle+0x85/0x130
> [ 170.646629] NMI backtrace for cpu 74
> [ 170.646629] CPU: 74 PID: 5413 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.646630] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.646630] RIP: 0010:smp_call_function_many_cond+0x2b4/0x2e0
> [ 170.646631] Code: c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 8b 4a 18 <83> e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80 99 90 48
> [ 170.646632] RSP: 0018:ffffb11561c77ca0 EFLAGS: 00000202
> [ 170.646633] RAX: 000000000000005b RBX: ffff9e559012dc40 RCX: 0000000000000003
> [ 170.646633] RDX: ffff9e6d8f8f44a0 RSI: 0000000000000000 RDI: ffff9e559012dc48
> [ 170.646633] RBP: 000000000000005d R08: 0000000000000000 R09: 000000000000001b
> [ 170.646634] R10: 0000000000000001 R11: 0000000000000008 R12: ffffffff8f078800
> [ 170.646634] R13: 0000000000000001 R14: ffff9e559012c600 R15: 0000000000000200
> [ 170.646634] FS: 00007fed7d7fa700(0000) GS:ffff9e5590100000(0000) knlGS:0000000000000000
> [ 170.646634] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.646634] CR2: 000055fd7314e1e0 CR3: 00000017eb8b8006 CR4: 00000000007606e0
> [ 170.646635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.646635] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.646635] PKRU: 55555554
> [ 170.646635] Call Trace:
> [ 170.646636] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.646636] ? x86_configure_nx+0x50/0x50
> [ 170.646636] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.646636] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.646636] flush_tlb_mm_range+0xab/0xe0
> [ 170.646637] change_protection+0x18a/0xca0
> [ 170.646637] ? __switch_to_asm+0x34/0x70
> [ 170.646637] change_prot_numa+0x15/0x30
> [ 170.646637] task_numa_work+0x1aa/0x2c0
> [ 170.646637] task_work_run+0x76/0xa0
> [ 170.646637] exit_to_usermode_loop+0xeb/0xf0
> [ 170.646638] do_syscall_64+0x1aa/0x1d0
> [ 170.646638] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.646638] RIP: 0033:0x7feda134d2c7
> [ 170.646639] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
> [ 170.646639] RSP: 002b:00007fed7d7f99b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 170.646640] RAX: 000000000000005a RBX: 0000000000000025 RCX: 00007feda134d2c7
> [ 170.646640] RDX: 000000000000005a RSI: 00007fed60000b20 RDI: 0000000000000025
> [ 170.646641] RBP: 00007fed60000b20 R08: 0000000000000000 R09: 00007ffcfe964080
> [ 170.646641] R10: 00007fed7d7f9dd0 R11: 0000000000000293 R12: 000000000000005a
> [ 170.646641] R13: 00007feda1d16878 R14: 0000000000000004 R15: 0000000000000000
> [ 170.646724] NMI backtrace for cpu 76 skipped: idling at intel_idle+0x85/0x130
> [ 170.646822] NMI backtrace for cpu 77 skipped: idling at intel_idle+0x85/0x130
> [ 170.646826] NMI backtrace for cpu 78
> [ 170.646827] CPU: 78 PID: 4845 Comm: rs:main Q:Reg Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.646829] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.646830] RIP: 0033:0x5614d82f7e11
> [ 170.646832] Code: 10 45 31 c9 48 85 c9 74 6b 49 89 c2 48 89 de 49 29 ca 48 39 c1 b8 00 00 00 00 4c 0f 43 d0 45 31 c9 4d 39 d1 41 0f 96 c0 31 ff <85> ff 75 3b 45 84 c0 74 36 48 8b 55 00 0f b6 02 38 06 75 1d 31 c0
> [ 170.646833] RSP: 002b:00007fc035a059b0 EFLAGS: 00000246
> [ 170.646835] RAX: 000000000000005b RBX: 0000000000000055 RCX: 0000000000000005
> [ 170.646836] RDX: 00005614d963e200 RSI: 00007fc028022d06 RDI: 0000000000000000
> [ 170.646837] RBP: 00005614d963e1e0 R08: 00007fc035a05a01 R09: 0000000000000006
> [ 170.646839] R10: 0000000000000024 R11: 0000000000000000 R12: 00005614d963e010
> [ 170.646841] R13: 00005614d833aef8 R14: 00005614d855926c R15: 0000000000000000
> [ 170.646843] FS: 00007fc035a06700 GS: 0000000000000000
> [ 170.646958] NMI backtrace for cpu 79 skipped: idling at intel_idle+0x85/0x130
> [ 170.647073] NMI backtrace for cpu 80
> [ 170.647074] CPU: 80 PID: 0 Comm: swapper/80 Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.647075] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.647076] RIP: 0010:pick_task_fair+0xb/0x90
> [ 170.647077] Code: 01 00 00 48 83 c1 01 48 f7 f1 49 89 86 50 0a 00 00 e9 a4 fc ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 8f d0 00 00 00 <85> c9 74 72 55 53 48 8d 9f c0 00 00 00 48 83 ec 08 eb 1f 8b 55 38
> [ 170.647078] RSP: 0018:ffffb115407d7e18 EFLAGS: 00000046
> [ 170.647080] RAX: ffffffff8f0cf510 RBX: ffff9e55902accc0 RCX: 0000000000000001
> [ 170.647081] RDX: 0000000000010001 RSI: 0000000000000000 RDI: ffff9e55902accc0
> [ 170.647082] RBP: ffffb115407d7ec8 R08: 0000000000000000 R09: 0000000000000010
> [ 170.647083] R10: 0000000000000007 R11: 000000000000002d R12: 0000000000000000
> [ 170.647083] R13: 0000000000000050 R14: 0000000000000000 R15: 0000000000000000
> [ 170.647084] FS: 0000000000000000(0000) GS:ffff9e5590280000(0000) knlGS:0000000000000000
> [ 170.647085] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.647086] CR2: 000055e027ae2868 CR3: 00000017e5c30005 CR4: 00000000007606e0
> [ 170.647087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.647088] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.647088] PKRU: 55555554
> [ 170.647089] Call Trace:
> [ 170.647089] __schedule+0x7a1/0x1450
> [ 170.647090] ? sched_clock+0x5/0x10
> [ 170.647091] ? update_ts_time_stats+0x53/0x80
> [ 170.647091] schedule_idle+0x1e/0x40
> [ 170.647092] do_idle+0x174/0x280
> [ 170.647093] cpu_startup_entry+0x19/0x20
> [ 170.647093] start_secondary+0x169/0x1c0
> [ 170.647094] secondary_startup_64+0xb6/0xc0
> [ 170.647113] NMI backtrace for cpu 81 skipped: idling at intel_idle+0x85/0x130
> [ 170.647246] NMI backtrace for cpu 82 skipped: idling at intel_idle+0x85/0x130
> [ 170.647316] NMI backtrace for cpu 83 skipped: idling at intel_idle+0x85/0x130
> [ 170.647457] NMI backtrace for cpu 84 skipped: idling at intel_idle+0x85/0x130
> [ 170.647569] NMI backtrace for cpu 85 skipped: idling at intel_idle+0x85/0x130
> [ 170.647661] NMI backtrace for cpu 86 skipped: idling at intel_idle+0x85/0x130
> [ 170.647756] NMI backtrace for cpu 87 skipped: idling at intel_idle+0x85/0x130
> [ 170.647859] NMI backtrace for cpu 88 skipped: idling at intel_idle+0x85/0x130
> [ 170.647954] NMI backtrace for cpu 89 skipped: idling at intel_idle+0x85/0x130
> [ 170.648001] NMI backtrace for cpu 90 skipped: idling at intel_idle+0x85/0x130
> [ 170.648147] NMI backtrace for cpu 92
> [ 170.648148] CPU: 92 PID: 5404 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.648149] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.648149] RIP: 0010:native_queued_spin_lock_slowpath+0x63/0x1e0
> [ 170.648150] Code: ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1e 85 c0 75 0b b8 01 00 00 00 66 89 07 c3 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> [ 170.648151] RSP: 0018:ffffb1154e0d8d68 EFLAGS: 00000202
> [ 170.648152] RAX: 0000000000500101 RBX: ffff9e555d8354e0 RCX: 0000000000000020
> [ 170.648152] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9e555dea2588
> [ 170.648152] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
> [ 170.648152] R10: 000000000100007f R11: 000000000000b7c2 R12: ffff9e555dea2588
> [ 170.648153] R13: ffffffff9093cb40 R14: ffff9e555dea2500 R15: ffff9e8da197e120
> [ 170.648153] FS: 00007fed9e745700(0000) GS:ffff9e6d8f900000(0000) knlGS:0000000000000000
> [ 170.648153] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.648154] CR2: 00007f0018000b20 CR3: 00000017eb8b8006 CR4: 00000000007606e0
> [ 170.648154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.648154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.648154] PKRU: 55555554
> [ 170.648155] Call Trace:
> [ 170.648155] <IRQ>
> [ 170.648155] _raw_spin_lock+0x1b/0x20
> [ 170.648155] tcp_v4_rcv+0xaa5/0xbc0
> [ 170.648155] ip_protocol_deliver_rcu+0x2c/0x1a0
> [ 170.648156] ip_local_deliver_finish+0x44/0x50
> [ 170.648156] ip_local_deliver+0x79/0xe0
> [ 170.648156] ? ip_rcv_finish+0x64/0xa0
> [ 170.648156] ip_rcv+0xbc/0xd0
> [ 170.648156] __netif_receive_skb_one_core+0x85/0xa0
> [ 170.648157] process_backlog+0x9b/0x150
> [ 170.648157] net_rx_action+0x149/0x3f0
> [ 170.648157] __do_softirq+0xed/0x2f0
> [ 170.648157] do_softirq_own_stack+0x2a/0x40
> [ 170.648157] </IRQ>
> [ 170.648158] do_softirq.part.18+0x41/0x50
> [ 170.648158] __local_bh_enable_ip+0x4b/0x50
> [ 170.648158] ip_finish_output2+0x1af/0x5a0
> [ 170.648158] ? __switch_to_asm+0x34/0x70
> [ 170.648159] ? __switch_to_asm+0x40/0x70
> [ 170.648159] ? ip_output+0x6d/0xe0
> [ 170.648159] ip_output+0x6d/0xe0
> [ 170.648159] ? __ip_finish_output+0x1c0/0x1c0
> [ 170.648159] __ip_queue_xmit+0x14d/0x3d0
> [ 170.648160] __tcp_transmit_skb+0x594/0xb20
> [ 170.648160] tcp_write_xmit+0x241/0x1150
> [ 170.648160] __tcp_push_pending_frames+0x32/0xf0
> [ 170.648160] tcp_sendmsg_locked+0x50f/0xd30
> [ 170.648161] tcp_sendmsg+0x27/0x40
> [ 170.648161] sock_sendmsg+0x3e/0x60
> [ 170.648161] sock_write_iter+0x97/0x100
> [ 170.648161] new_sync_write+0x1b6/0x1d0
> [ 170.648161] vfs_write+0xad/0x1b0
> [ 170.648162] ksys_write+0x50/0xe0
> [ 170.648162] do_syscall_64+0x52/0x1d0
> [ 170.648162] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.648162] RIP: 0033:0x7feda134d2c7
> [ 170.648163] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
> [ 170.648163] RSP: 002b:00007fed9e7449b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 170.648164] RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007feda134d2c7
> [ 170.648164] RDX: 000000000000005a RSI: 00007fed80000b20 RDI: 000000000000002c
> [ 170.648164] RBP: 00007fed80000b20 R08: 0000000000000000 R09: 00007ffcfe964080
> [ 170.648165] R10: 00007fed9e744dd0 R11: 0000000000000293 R12: 000000000000005a
> [ 170.648165] R13: 00007feda1d15a20 R14: 0000000000000004 R15: 0000000000000000
> [ 170.648240] NMI backtrace for cpu 93
> [ 170.648241] CPU: 93 PID: 5402 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.648241] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.648242] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> [ 170.648242] Code: c7 e8 d3 c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80
> [ 170.648243] RSP: 0018:ffffb11561c17ca0 EFLAGS: 00000202
> [ 170.648243] RAX: 000000000000005b RBX: ffff9e6d8f96dc40 RCX: 0000000000000001
> [ 170.648244] RDX: ffff9e6d8f8f4700 RSI: 0000000000000000 RDI: ffff9e6d8f96dc48
> [ 170.648244] RBP: 000000000000005c R08: 0000000000000000 R09: 000000000000001b
> [ 170.648244] R10: 0000000000000001 R11: 0000000000000008 R12: ffffffff8f078800
> [ 170.648244] R13: 0000000000000001 R14: ffff9e6d8f96c600 R15: 0000000000000200
> [ 170.648245] FS: 00007fed9f747700(0000) GS:ffff9e6d8f940000(0000) knlGS:0000000000000000
> [ 170.648245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.648245] CR2: 00007f006804d9b8 CR3: 00000017eb8b8001 CR4: 00000000007606e0
> [ 170.648245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.648246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.648246] PKRU: 55555554
> [ 170.648246] Call Trace:
> [ 170.648246] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.648246] ? x86_configure_nx+0x50/0x50
> [ 170.648247] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 170.648247] on_each_cpu_cond_mask+0x2f/0x80
> [ 170.648247] flush_tlb_mm_range+0xab/0xe0
> [ 170.648247] change_protection+0x18a/0xca0
> [ 170.648247] change_prot_numa+0x15/0x30
> [ 170.648248] task_numa_work+0x1aa/0x2c0
> [ 170.648248] task_work_run+0x76/0xa0
> [ 170.648248] exit_to_usermode_loop+0xeb/0xf0
> [ 170.648248] do_syscall_64+0x1aa/0x1d0
> [ 170.648248] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.648249] RIP: 0033:0x7feda134d2c7
> [ 170.648249] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
> [ 170.648249] RSP: 002b:00007fed9f7469b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 170.648250] RAX: 000000000000005a RBX: 0000000000000034 RCX: 00007feda134d2c7
> [ 170.648250] RDX: 000000000000005a RSI: 00007fed88000b20 RDI: 0000000000000034
> [ 170.648250] RBP: 00007fed88000b20 R08: 0000000000000000 R09: 00007ffcfe964080
> [ 170.648250] R10: 00007fed9f746dd0 R11: 0000000000000293 R12: 000000000000005a
> [ 170.648251] R13: 00007feda1d156f0 R14: 0000000000000004 R15: 0000000000000000
> [ 170.648259] NMI backtrace for cpu 94 skipped: idling at intel_idle+0x85/0x130
> [ 170.648354] NMI backtrace for cpu 95 skipped: idling at intel_idle+0x85/0x130
> [ 170.648373] NMI backtrace for cpu 96 skipped: idling at intel_idle+0x85/0x130
> [ 170.648453] NMI backtrace for cpu 97 skipped: idling at intel_idle+0x85/0x130
> [ 170.648475] NMI backtrace for cpu 98 skipped: idling at intel_idle+0x85/0x130
> [ 170.648550] NMI backtrace for cpu 100 skipped: idling at intel_idle+0x85/0x130
> [ 170.648554] NMI backtrace for cpu 99 skipped: idling at intel_idle+0x85/0x130
> [ 170.648653] NMI backtrace for cpu 101 skipped: idling at intel_idle+0x85/0x130
> [ 170.648673] NMI backtrace for cpu 102 skipped: idling at intel_idle+0x85/0x130
> [ 170.648755] NMI backtrace for cpu 104 skipped: idling at intel_idle+0x85/0x130
> [ 170.648759] NMI backtrace for cpu 103 skipped: idling at intel_idle+0x85/0x130
> [ 170.648855] NMI backtrace for cpu 105 skipped: idling at intel_idle+0x85/0x130
> [ 170.648858] NMI backtrace for cpu 106 skipped: idling at intel_idle+0x85/0x130
> [ 170.648954] NMI backtrace for cpu 108 skipped: idling at intel_idle+0x85/0x130
> [ 170.648958] NMI backtrace for cpu 107 skipped: idling at intel_idle+0x85/0x130
> [ 170.649053] NMI backtrace for cpu 109 skipped: idling at intel_idle+0x85/0x130
> [ 170.649074] NMI backtrace for cpu 110 skipped: idling at intel_idle+0x85/0x130
> [ 170.649149] NMI backtrace for cpu 111 skipped: idling at intel_idle+0x85/0x130
> [ 170.649153] NMI backtrace for cpu 112 skipped: idling at intel_idle+0x85/0x130
> [ 170.649249] NMI backtrace for cpu 114 skipped: idling at intel_idle+0x85/0x130
> [ 170.649253] NMI backtrace for cpu 113 skipped: idling at intel_idle+0x85/0x130
> [ 170.649352] NMI backtrace for cpu 115 skipped: idling at intel_idle+0x85/0x130
> [ 170.649453] NMI backtrace for cpu 116 skipped: idling at intel_idle+0x85/0x130
> [ 170.649552] NMI backtrace for cpu 117 skipped: idling at intel_idle+0x85/0x130
> [ 170.649652] NMI backtrace for cpu 118 skipped: idling at intel_idle+0x85/0x130
> [ 170.649752] NMI backtrace for cpu 119 skipped: idling at intel_idle+0x85/0x130
> [ 170.649852] NMI backtrace for cpu 121 skipped: idling at intel_idle+0x85/0x130
> [ 170.649856] NMI backtrace for cpu 120 skipped: idling at intel_idle+0x85/0x130
> [ 170.649954] NMI backtrace for cpu 122 skipped: idling at intel_idle+0x85/0x130
> [ 170.649957] NMI backtrace for cpu 123 skipped: idling at intel_idle+0x85/0x130
> [ 170.650057] NMI backtrace for cpu 124 skipped: idling at intel_idle+0x85/0x130
> [ 170.650148] NMI backtrace for cpu 125 skipped: idling at intel_idle+0x85/0x130
> [ 170.650250] NMI backtrace for cpu 126 skipped: idling at intel_idle+0x85/0x130
> [ 170.650254] NMI backtrace for cpu 127 skipped: idling at intel_idle+0x85/0x130
> [ 170.650347] NMI backtrace for cpu 128 skipped: idling at intel_idle+0x85/0x130
> [ 170.650351] NMI backtrace for cpu 129 skipped: idling at intel_idle+0x85/0x130
> [ 170.650446] NMI backtrace for cpu 130 skipped: idling at intel_idle+0x85/0x130
> [ 170.650450] NMI backtrace for cpu 131 skipped: idling at intel_idle+0x85/0x130
> [ 170.650549] NMI backtrace for cpu 132 skipped: idling at intel_idle+0x85/0x130
> [ 170.650573] NMI backtrace for cpu 133 skipped: idling at intel_idle+0x85/0x130
> [ 170.650649] NMI backtrace for cpu 134 skipped: idling at intel_idle+0x85/0x130
> [ 170.650749] NMI backtrace for cpu 135 skipped: idling at intel_idle+0x85/0x130
> [ 170.650848] NMI backtrace for cpu 136 skipped: idling at intel_idle+0x85/0x130
> [ 170.650945] NMI backtrace for cpu 138 skipped: idling at intel_idle+0x85/0x130
> [ 170.650949] NMI backtrace for cpu 137 skipped: idling at intel_idle+0x85/0x130
> [ 170.651046] NMI backtrace for cpu 139 skipped: idling at intel_idle+0x85/0x130
> [ 170.651050] NMI backtrace for cpu 140 skipped: idling at intel_idle+0x85/0x130
> [ 170.651145] NMI backtrace for cpu 142 skipped: idling at intel_idle+0x85/0x130
> [ 170.651149] NMI backtrace for cpu 141 skipped: idling at intel_idle+0x85/0x130
> [ 170.651244] NMI backtrace for cpu 143 skipped: idling at intel_idle+0x85/0x130
> [ 170.652186] NMI watchdog: Watchdog detected hard LOCKUP on cpu 20
> [ 170.652186] Modules linked in: binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass joydev efi_pstore input_leds mei_me lpc_ich mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ast drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul ixgbe drm_kms_helper crc32_pclmul igb ghash_clmulni_intel aesni_intel syscopyarea crypto_simd sysfillrect dca cryptd sysimgblt glue_helper fb_sys_fops ptp mdio ahci pps_core drm i2c_algo_bit libahci wmi hid_generic usbhid hid
> [ 170.652198] CPU: 20 PID: 5388 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.652199] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.652199] RIP: 0010:sched_core_irq_exit+0xcc/0x110
> [ 170.652200] Code: 48 8b 8b 30 0c 00 00 83 e8 01 89 81 68 0c 00 00 e9 3c 00 00 00 48 89 df c6 07 00 0f 1f 40 00 84 d2 75 07 e9 4f ff ff ff f3 90 <48> 8b 83 30 0c 00 00 8b 80 68 0c 00 00 85 c0 75 ed 5b c3 48 8b bb
> [ 170.652201] RSP: 0018:ffffb1154d468fe0 EFLAGS: 00000002
> [ 170.652201] RAX: 0000000000000001 RBX: ffff9e6d8f4accc0 RCX: ffff9e6d8f4accc0
> [ 170.652202] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9e6d8f4accc0
> [ 170.652202] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 170.652202] R10: ffffb1154d468f80 R11: 0000000000000000 R12: 0000000000000000
> [ 170.652203] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 170.652203] FS: 00007f0064aa4700(0000) GS:ffff9e6d8f480000(0000) knlGS:0000000000000000
> [ 170.652203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.652203] CR2: 00007f002effda08 CR3: 00000017e5fb0002 CR4: 00000000007606e0
> [ 170.652204] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.652204] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.652204] PKRU: 55555554
> [ 170.652204] Call Trace:
> [ 170.652204] <IRQ>
> [ 170.652205] irq_exit+0x6a/0xb0
> [ 170.652205] call_function_interrupt+0xf/0x20
> [ 170.652205] </IRQ>
> [ 170.652205] RIP: 0010:_raw_spin_lock_bh+0x1b/0x30
> [ 170.652206] Code: 0f b1 17 75 06 b8 01 00 00 00 c3 31 c0 c3 90 0f 1f 44 00 00 65 81 05 18 e3 61 70 00 02 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 02 f3 c3 89 c6 e8 8a 2e 6f ff 66 90 c3 0f 1f 80 00 00 00 00 0f
> [ 170.652206] RSP: 0018:ffffb11561787cd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff03
> [ 170.652206] RAX: 0000000000000000 RBX: ffff9e555dea2588 RCX: 0000000000000000
> [ 170.652206] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e555dea2588
> [ 170.652207] RBP: ffff9e555dea2500 R08: 0000000000000000 R09: ffffb11561787dbc
> [ 170.652207] R10: ffffb11561787ea0 R11: ffff9e555ff10e38 R12: 0000000000000000
> [ 170.652207] R13: ffff9e555dea2500 R14: ffff9e555ff10e00 R15: 0000000000000000
> [ 170.652207] ? _cond_resched+0x15/0x40
> [ 170.652207] lock_sock_nested+0x1e/0x60
> [ 170.652208] tcp_recvmsg+0x8c/0xb50
> [ 170.652208] ? _cond_resched+0x15/0x40
> [ 170.652208] ? aa_sk_perm+0x3e/0x1f0
> [ 170.652208] inet6_recvmsg+0x5d/0xe0
> [ 170.652208] sock_read_iter+0x92/0xf0
> [ 170.652209] new_sync_read+0x1a7/0x1c0
> [ 170.652209] vfs_read+0x89/0x140
> [ 170.652209] ksys_read+0x50/0xe0
> [ 170.652209] do_syscall_64+0x52/0x1d0
> [ 170.652209] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.652209] RIP: 0033:0x7f00676ac394
> [ 170.652210] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
> [ 170.652210] RSP: 002b:00007f0064aa3980 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 170.652210] RAX: ffffffffffffffda RBX: 000000000000001d RCX: 00007f00676ac394
> [ 170.652211] RDX: 000000000000005a RSI: 00007f0048000b20 RDI: 000000000000001d
> [ 170.652211] RBP: 00007f0048000b20 R08: 0000000000000000 R09: 0000563ccd7ca0b4
> [ 170.652211] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
> [ 170.652211] R13: 00007f006804da30 R14: 0000000000000004 R15: 0000000000000000
> [ 170.652212] NMI backtrace for cpu 20
> [ 170.652212] CPU: 20 PID: 5388 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.652212] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.652212] RIP: 0010:sched_core_irq_exit+0xcc/0x110
> [ 170.652213] Code: 48 8b 8b 30 0c 00 00 83 e8 01 89 81 68 0c 00 00 e9 3c 00 00 00 48 89 df c6 07 00 0f 1f 40 00 84 d2 75 07 e9 4f ff ff ff f3 90 <48> 8b 83 30 0c 00 00 8b 80 68 0c 00 00 85 c0 75 ed 5b c3 48 8b bb
> [ 170.652214] RSP: 0018:ffffb1154d468fe0 EFLAGS: 00000002
> [ 170.652214] RAX: 0000000000000001 RBX: ffff9e6d8f4accc0 RCX: ffff9e6d8f4accc0
> [ 170.652214] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9e6d8f4accc0
> [ 170.652215] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 170.652215] R10: ffffb1154d468f80 R11: 0000000000000000 R12: 0000000000000000
> [ 170.652216] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 170.652216] FS: 00007f0064aa4700(0000) GS:ffff9e6d8f480000(0000) knlGS:0000000000000000
> [ 170.652216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.652217] CR2: 00007f002effda08 CR3: 00000017e5fb0002 CR4: 00000000007606e0
> [ 170.652217] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.652217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.652217] PKRU: 55555554
> [ 170.652217] Call Trace:
> [ 170.652218] <IRQ>
> [ 170.652218] irq_exit+0x6a/0xb0
> [ 170.652218] call_function_interrupt+0xf/0x20
> [ 170.652218] </IRQ>
> [ 170.652218] RIP: 0010:_raw_spin_lock_bh+0x1b/0x30
> [ 170.652219] Code: 0f b1 17 75 06 b8 01 00 00 00 c3 31 c0 c3 90 0f 1f 44 00 00 65 81 05 18 e3 61 70 00 02 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 02 f3 c3 89 c6 e8 8a 2e 6f ff 66 90 c3 0f 1f 80 00 00 00 00 0f
> [ 170.652219] RSP: 0018:ffffb11561787cd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff03
> [ 170.652219] RAX: 0000000000000000 RBX: ffff9e555dea2588 RCX: 0000000000000000
> [ 170.652220] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e555dea2588
> [ 170.652220] RBP: ffff9e555dea2500 R08: 0000000000000000 R09: ffffb11561787dbc
> [ 170.652220] R10: ffffb11561787ea0 R11: ffff9e555ff10e38 R12: 0000000000000000
> [ 170.652221] R13: ffff9e555dea2500 R14: ffff9e555ff10e00 R15: 0000000000000000
> [ 170.652221] ? _cond_resched+0x15/0x40
> [ 170.652221] lock_sock_nested+0x1e/0x60
> [ 170.652221] tcp_recvmsg+0x8c/0xb50
> [ 170.652222] ? _cond_resched+0x15/0x40
> [ 170.652222] ? aa_sk_perm+0x3e/0x1f0
> [ 170.652222] inet6_recvmsg+0x5d/0xe0
> [ 170.652222] sock_read_iter+0x92/0xf0
> [ 170.652222] new_sync_read+0x1a7/0x1c0
> [ 170.652222] vfs_read+0x89/0x140
> [ 170.652223] ksys_read+0x50/0xe0
> [ 170.652224] do_syscall_64+0x52/0x1d0
> [ 170.652224] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.652224] RIP: 0033:0x7f00676ac394
> [ 170.652225] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
> [ 170.652225] RSP: 002b:00007f0064aa3980 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 170.652225] RAX: ffffffffffffffda RBX: 000000000000001d RCX: 00007f00676ac394
> [ 170.652226] RDX: 000000000000005a RSI: 00007f0048000b20 RDI: 000000000000001d
> [ 170.652226] RBP: 00007f0048000b20 R08: 0000000000000000 R09: 0000563ccd7ca0b4
> [ 170.652226] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
> [ 170.652226] R13: 00007f006804da30 R14: 0000000000000004 R15: 0000000000000000
> [ 170.652256] NMI backtrace for cpu 91
> [ 170.652257] CPU: 91 PID: 5401 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
> [ 170.652257] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 170.652257] RIP: 0010:sched_core_irq_exit+0xcc/0x110
> [ 170.652258] Code: 48 8b 8b 30 0c 00 00 83 e8 01 89 81 68 0c 00 00 e9 3c 00 00 00 48 89 df c6 07 00 0f 1f 40 00 84 d2 75 07 e9 4f ff ff ff f3 90 <48> 8b 83 30 0c 00 00 8b 80 68 0c 00 00 85 c0 75 ed 5b c3 48 8b bb
> [ 170.652258] RSP: 0018:ffffb1154e0acfc8 EFLAGS: 00000002
> [ 170.652258] RAX: 0000000000000001 RBX: ffff9e6d8f8eccc0 RCX: ffff9e6d8f46ccc0
> [ 170.652258] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e6d8f46ccc0
> [ 170.652259] RBP: 0000000000000000 R08: ffff9e6d82763e00 R09: 0000000000000100
> [ 170.652259] R10: 0000000000000000 R11: 0000000000000329 R12: 0000000000000000
> [ 170.652259] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 170.652260] FS: 00007fed9ff48700(0000) GS:ffff9e6d8f8c0000(0000) knlGS:0000000000000000
> [ 170.652260] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 170.652260] CR2: 00007f0053fffa08 CR3: 00000017eb8b8006 CR4: 00000000007606e0
> [ 170.652261] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 170.652261] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 170.652261] PKRU: 55555554
> [ 170.652261] Call Trace:
> [ 170.652262] <IRQ>
> [ 170.652262] irq_exit+0x6a/0xb0
> [ 170.652262] smp_apic_timer_interrupt+0x74/0x130
> [ 170.652262] apic_timer_interrupt+0xf/0x20
> [ 170.652262] </IRQ>
> [ 170.652263] RIP: 0010:finish_task_switch+0x7e/0x280
> [ 170.652263] Code: 01 00 0f 1f 44 00 00 0f 1f 44 00 00 41 c7 47 38 00 00 00 00 e9 00 01 00 00 48 89 df c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00 <65> 48 8b 04 25 c0 8b 01 00 0f 1f 44 00 00 4d 85 e4 74 22 65 48 8b
> [ 170.652263] RSP: 0018:ffffb11561c0fb08 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> [ 170.652264] RAX: 0000000000000001 RBX: ffff9e6d8f8eccc0 RCX: 0000000000000000
> [ 170.652264] RDX: 0000000080004000 RSI: 0000000000000009 RDI: ffff9e6d8f46ccc0
> [ 170.652264] RBP: ffffb11561c0fb30 R08: 0000000000000001 R09: 0000000000000000
> [ 170.652265] R10: ffffb11540a5bd98 R11: ffff9e6d8bb37600 R12: ffff9e5567070440
> [ 170.652265] R13: 0000000000000000 R14: ffff9e555de62c80 R15: ffff9e6d8baac2c0
> [ 170.652265] ? __switch_to_asm+0x34/0x70
> [ 170.652265] __schedule+0x300/0x1450
> [ 170.652265] schedule+0x40/0xb0
> [ 170.652266] schedule_timeout+0x1dd/0x300
> [ 170.652266] wait_woken+0x44/0x80
> [ 170.652266] ? release_sock+0x43/0x90
> [ 170.652266] sk_wait_data+0x128/0x140
> [ 170.652266] ? do_wait_intr_irq+0x90/0x90
> [ 170.652267] tcp_recvmsg+0x5e0/0xb50
> [ 170.652267] ? _cond_resched+0x15/0x40
> [ 170.652267] ? aa_sk_perm+0x3e/0x1f0
> [ 170.652267] inet6_recvmsg+0x5d/0xe0
> [ 170.652267] sock_read_iter+0x92/0xf0
> [ 170.652267] new_sync_read+0x1a7/0x1c0
> [ 170.652268] vfs_read+0x89/0x140
> [ 170.652268] ksys_read+0x50/0xe0
> [ 170.652268] do_syscall_64+0x52/0x1d0
> [ 170.652268] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 170.652268] RIP: 0033:0x7feda134d394
> [ 170.652269] Code: 84 00 00 00 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 8b fc ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 48 89 44 24 08 e8 c7 fc ff ff 48
> [ 170.652269] RSP: 002b:00007fed9ff47980 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 170.652269] RAX: ffffffffffffffda RBX: 0000000000000035 RCX: 00007feda134d394
> [ 170.652270] RDX: 000000000000005a RSI: 00007fed94000b20 RDI: 0000000000000035
> [ 170.652270] RBP: 00007fed94000b20 R08: 0000000000000000 R09: 00007ffcfe964080
> [ 170.652270] R10: 00007fed9ff47dd0 R11: 0000000000000246 R12: 000000000000005a
> [ 170.652270] R13: 00007feda1d15558 R14: 0000000000000000 R15: 0000000000000000
> [ 170.652299] Kernel panic - not syncing: softlockup: hung tasks
> [ 171.760272] CPU: 75 PID: 5393 Comm: uperf Kdump: loaded Tainted: G L 5.7.6+ #3
> [ 171.769122] Hardware name: Intel Corporation CooperCity/CooperCity, BIOS WLYDCRB1.SYS.0016.P38.2006170234 06/17/2020
> [ 171.779996] Call Trace:
> [ 171.782818] <IRQ>
> [ 171.785183] dump_stack+0x66/0x8b
> [ 171.788850] panic+0xfe/0x2e0
> [ 171.792187] watchdog_timer_fn+0x210/0x220
> [ 171.796638] ? softlockup_fn+0x30/0x30
> [ 171.800744] __hrtimer_run_queues+0x108/0x290
> [ 171.805459] hrtimer_interrupt+0xe5/0x240
> [ 171.809830] smp_apic_timer_interrupt+0x6a/0x130
> [ 171.814823] apic_timer_interrupt+0xf/0x20
> [ 171.819272] </IRQ>
> [ 171.821710] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
> [ 171.827795] Code: c7 e8 d3 c5 3c 00 3b 05 71 1b 86 01 0f 83 e8 fd ff ff 48 63 c8 48 8b 13 48 03 14 cd 20 29 58 90 8b 4a 18 83 e1 01 74 0a f3 90 <8b> 4a 18 83 e1 01 75 f6 eb c6 0f 0b e9 97 fd ff ff 48 c7 c2 e0 80
> [ 171.847266] RSP: 0018:ffffb115617afca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> [ 171.855198] RAX: 0000000000000014 RBX: ffff9e559016dc40 RCX: 0000000000000001
> [ 171.862702] RDX: ffff9e6d8f4b44c0 RSI: 0000000000000000 RDI: ffff9e559016dc48
> [ 171.870194] RBP: 000000000000005b R08: 0000000000000000 R09: 0000000000000014
> [ 171.877686] R10: 0000000000000007 R11: 0000000000000008 R12: ffffffff8f078800
> [ 171.885174] R13: 0000000000000001 R14: ffff9e559016c600 R15: 0000000000000200
> [ 171.892667] ? x86_configure_nx+0x50/0x50
> [ 171.897047] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 171.903173] ? x86_configure_nx+0x50/0x50
> [ 171.907565] ? flush_tlb_func_common.constprop.10+0x220/0x220
> [ 171.913688] on_each_cpu_cond_mask+0x2f/0x80
> [ 171.918343] flush_tlb_mm_range+0xab/0xe0
> [ 171.922758] change_protection+0x18a/0xca0
> [ 171.927248] ? __switch_to_asm+0x34/0x70
> [ 171.931565] change_prot_numa+0x15/0x30
> [ 171.935794] task_numa_work+0x1aa/0x2c0
> [ 171.940036] task_work_run+0x76/0xa0
> [ 171.944020] exit_to_usermode_loop+0xeb/0xf0
> [ 171.948683] do_syscall_64+0x1aa/0x1d0
> [ 171.952827] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 171.958275] RIP: 0033:0x7f00676ac2c7
> [ 171.962251] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
> [ 171.981817] RSP: 002b:00007f0051ffa9b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 171.989788] RAX: 000000000000005a RBX: 000000000000002a RCX: 00007f00676ac2c7
> [ 171.997322] RDX: 000000000000005a RSI: 00007f0030000b20 RDI: 000000000000002a
> [ 172.004849] RBP: 00007f0030000b20 R08: 0000000000000000 R09: 0000563ccd7ca0b4
> [ 172.012390] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000005a
> [ 172.019929] R13: 00007f006804e228 R14: 0000000000000000 R15: 0000000000000000

2020-07-17 23:39:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

"Li, Aubrey" <[email protected]> writes:
> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
>
> We saw a lot of soft lockups on the screen when we tested v6.
>
> [ 186.527883] watchdog: BUG: soft lockup - CPU#86 stuck for 22s! [uperf:5551]
> [ 186.535884] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [uperf:5444]
> [ 186.555883] watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [uperf:5547]
> [ 187.547884] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 187.553760] rcu: 40-....: (14997 ticks this GP) idle=49a/1/0x4000000000000002 softirq=1711/1711 fqs=7279
> [ 187.564685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
> [ 187.564723] NMI watchdog: Watchdog detected hard LOCKUP on cpu 38
>
> The problem is gone when we reverted this patch. We are running multiple
> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> This is 100% reproducible on our side.

ROTFL. I just predicted that from staring at the patch ....

2020-07-17 23:39:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

Vineeth, Joel!

Vineeth Remanan Pillai <[email protected]> writes:
> From: "Joel Fernandes (Google)" <[email protected]>
>
> Lastly, we also check in the schedule loop if we are about to schedule
> an untrusted process while the core is in such a state. This is possible
> if a trusted thread enters the scheduler by way of yielding CPU. This
> would involve no transitions through the irq_exit() point to do any
> waiting, so we have to explicitly do the waiting there.

Lot's of 'we' and 'this patch' in this changelog. Care to read
Documentation/process/submitting-patches.rst ?

> Every attempt is made to prevent a busy-wait unnecessarily, and in
> testing on real-world ChromeOS usecases, it has not shown a performance
> drop. In ChromeOS, with this and the rest of the core scheduling
> patchset, we see around a 300% improvement in key press latencies into
> Google docs when Camera streaming is running simulatenously (90th
> percentile latency of ~150ms drops to ~50ms).
>
> This fetaure is controlled by the build time config option
> CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
> kernel boot parameter 'sched_core_irq_pause' to enable/disable the
> feature at boot time. Default is enabled at boot time.

Enabled by default? At least that wants to depend on the enablement of
core scheduling at boot time.

>
> +#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
> +DEFINE_STATIC_KEY_TRUE(sched_core_irq_pause);
> +static int __init set_sched_core_irq_pause(char *str)

Gah. newlines are overrated, right? Glueing variable declarations and
function declarations next to each other is unreadable.

> +{
> + unsigned long val = 0;
> + if (!str)

Missing newline between declaration and code. checkpatch.pl has told you
so.

> + return 0;
> +
> + val = simple_strtoul(str, &str, 0);

Huch? Why are you using simple_strtoul() for parsing a boolean and
what's the point of setting the **end argument to &str?

> +
> + if (val == 0)

if (!val)

is the usual kernel coding style.

> + static_branch_disable(&sched_core_irq_pause);
> +
> + return 1;
> +}
> +__setup("sched_core_irq_pause=", set_sched_core_irq_pause);
> +#endif
> +
> asmlinkage __visible void __softirq_entry __do_softirq(void)
> {
> unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> @@ -273,6 +291,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> /* Reset the pending bitmask before enabling irqs */
> set_softirq_pending(0);
>
> +#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
> + /*
> + * Core scheduling mitigations require entry into softirq to send stall
> + * IPIs to sibling hyperthreads if needed (ex, sibling is running
> + * untrusted task). If we are here from irq_exit(), no IPIs are sent.
> + */
> + if (static_branch_likely(&sched_core_irq_pause))
> + sched_core_irq_enter();

So you already have a #ifdef CONFIG_SCHED_CORE_IRQ_PAUSE section
above. Why on earth can't you just provide helper functions and stubs
for the CONFIG_SCHED_CORE_IRQ_PAUSE=n case there instead of sprinkling
#ifdeffery all over the place?

> +#endif
> +
> local_irq_enable();

How is __do_softirq() relevant at all?

__do_softirq() is called from:

1) the tail of interrupt handling
2) the tail of local_bh_enable()
3) the ksoftirq thread

None of this is relevant for what you are trying to do:

#1: Iscovered by the interrupt handling code simply because this nests
into the interrupt handling.

#2: local_bh_enable() only ends up there when invoked from thread
context.

This would only make sense if one sibling is in the kernel and the
other in user space or guest mode. Isn't that forbidden by core
scheduling in the first place?

#3: See #2

But what's worse is that this is called from an interrupt disabled
region. So you brilliantly created a trivial source of livelocks.
stomp_machine() being the most obvious candidate. But it's easy enough
to come up with more subtle ones.

The point is that you want to bring out the sibling thread from user
space or guest mode when the other HT thread:

A) invokes a syscall or traps into the hypervisor which is
semantically equivalent (not code wise).

B) handles an interrupt/exception

So #A covers #2 and #3 above and #B covers #1 above.

But you have the same livelock problem with doing this core wait thingy
within any interrupt disabled region.

What you really want to do is on syscall and interrupt/exception entry:

if (other_sibling_in_user_or_guest())
send_IPI();

Of course it's conveniant to piggy pack that onto the reschedule IPI,
but that sucks. What you want is a dedicated IPI which does:

magic_ipi()
set_thread_flag(TIF_CORE_SCHED_EXIT);

And on the exit side you want to look at the entry code generalisation I
just posted:

https://lore.kernel.org/r/[email protected]

Add TIF_CORE_SCHED_EXIT to EXIT_TO_USER_MODE_WORK and to
EXIT_TO_GUEST_MODE_WORK and handle it in exit_to_guest_mode_work() and
exit_to_user_mode_loop() with interrupts enabled.

Trying to do this in fully interrupt disabled context is just a recipe
for disaster.

The entry case condition wants to have a TIF bit as well, i.e.

if (thread_test_bit(TIF_CORE_SCHED_REQUIRED) {
sched_ipi_dance() {
if (other_sibling_in_user_or_guest())
send_IPI();
}
}

I know you hate this, but we are not going to add tons of unholy hackery
to make this "work" by some definition of work.

The only way to make this work without making a complete mess and
killing performance is to do this on a best effort basis. It really does
not matter whether the result of the orchestration is perfect or
not. What matters is that you destroy any predictability. If the
orchestration terminates early occasionally then the resulting damage is
a purely academic exercise.

Making it work reliably without hard to debug livelocks and other
horrors is not so academic.

So please stop overengineering this and just take the pragmatic approach
of making it mostly correct. That will prevent 99.999% of realistic
attacks.

It might not prevent the carefully orchestrated academic paper attack
which utilizes the 0.001% failure rate by artificially enforcing a
scenario which cannot be enforced in a real world environment.

Thanks,

tglx






2020-07-18 17:31:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On Sat, Jul 18, 2020 at 01:37:50AM +0200, Thomas Gleixner wrote:
> "Li, Aubrey" <[email protected]> writes:
> > On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> >
> > We saw a lot of soft lockups on the screen when we tested v6.
> >
> > [ 186.527883] watchdog: BUG: soft lockup - CPU#86 stuck for 22s! [uperf:5551]
> > [ 186.535884] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [uperf:5444]
> > [ 186.555883] watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [uperf:5547]
> > [ 187.547884] rcu: INFO: rcu_sched self-detected stall on CPU
> > [ 187.553760] rcu: 40-....: (14997 ticks this GP) idle=49a/1/0x4000000000000002 softirq=1711/1711 fqs=7279
> > [ 187.564685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
> > [ 187.564723] NMI watchdog: Watchdog detected hard LOCKUP on cpu 38
> >
> > The problem is gone when we reverted this patch. We are running multiple
> > uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> > This is 100% reproducible on our side.
>
> ROTFL. I just predicted that from staring at the patch ....

Yes, sorry. It got fixed in the below patch which I was about to share before
you sent this email. However, it still has (not yet) addressed the new
comments from you we have to discuss (will reply to that email).

Its purely a dirty test patch but it makes Aubrey's deadlock go away.
Basically, I moved the waiting code to prepare_exit_to_usermode() which
removes the lock vs wait dependencies. I will work on moving the code to the
right place based on the suggestions in the other email.

https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/core-sched-fix-irq-deadlocks-v1&id=8857a261f3305887b063001c6c899869206667b6

thanks,

- Joel

2020-07-20 03:55:08

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On Sat, Jul 18, 2020 at 01:36:16AM +0200, Thomas Gleixner wrote:
> Vineeth, Joel!

Hi Thomas,

> Vineeth Remanan Pillai <[email protected]> writes:
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > Lastly, we also check in the schedule loop if we are about to schedule
> > an untrusted process while the core is in such a state. This is possible
> > if a trusted thread enters the scheduler by way of yielding CPU. This
> > would involve no transitions through the irq_exit() point to do any
> > waiting, so we have to explicitly do the waiting there.
>
> Lot's of 'we' and 'this patch' in this changelog. Care to read
> Documentation/process/submitting-patches.rst ?

Sure, will fix it.

[..]
> is the usual kernel coding style.
> > + static_branch_disable(&sched_core_irq_pause);
> > +
> > + return 1;
> > +}
> > +__setup("sched_core_irq_pause=", set_sched_core_irq_pause);
> > +#endif
> > +
> > asmlinkage __visible void __softirq_entry __do_softirq(void)
> > {
> > unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> > @@ -273,6 +291,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> > /* Reset the pending bitmask before enabling irqs */
> > set_softirq_pending(0);
> >
> > +#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
> > + /*
> > + * Core scheduling mitigations require entry into softirq to send stall
> > + * IPIs to sibling hyperthreads if needed (ex, sibling is running
> > + * untrusted task). If we are here from irq_exit(), no IPIs are sent.
> > + */
> > + if (static_branch_likely(&sched_core_irq_pause))
> > + sched_core_irq_enter();
>
> So you already have a #ifdef CONFIG_SCHED_CORE_IRQ_PAUSE section
> above. Why on earth can't you just provide helper functions and stubs
> for the CONFIG_SCHED_CORE_IRQ_PAUSE=n case there instead of sprinkling
> #ifdeffery all over the place?

These ifdeffery and checkpatch / command line parameter issues were added by
Vineeth before he sent out my patch. I'll let him comment on these, agreed
they all need fixing!

> > +#endif
> > +
> > local_irq_enable();
>
> How is __do_softirq() relevant at all?
>
> __do_softirq() is called from:
>
> 1) the tail of interrupt handling
> 2) the tail of local_bh_enable()
> 3) the ksoftirq thread
>
> None of this is relevant for what you are trying to do:
>
> #1: Iscovered by the interrupt handling code simply because this nests
> into the interrupt handling.
>
> #2: local_bh_enable() only ends up there when invoked from thread
> context.
>
> This would only make sense if one sibling is in the kernel and the
> other in user space or guest mode. Isn't that forbidden by core
> scheduling in the first place?

One sibling being in the kernel, while the other sibling is in userspace is
not yet forbidden by core-scheduling. We need to add support for that.

I think after reading your email and thinking about this last few days, we
can make it work by sending IPIs on both syscall and IRQ entries.

> #3: See #2
>
> But what's worse is that this is called from an interrupt disabled
> region. So you brilliantly created a trivial source of livelocks.
> stomp_machine() being the most obvious candidate. But it's easy enough
> to come up with more subtle ones.
>
> The point is that you want to bring out the sibling thread from user
> space or guest mode when the other HT thread:
>
> A) invokes a syscall or traps into the hypervisor which is
> semantically equivalent (not code wise).
>
> B) handles an interrupt/exception
>
> So #A covers #2 and #3 above and #B covers #1 above.

Right. #A) was never implemented. But now I'm going to do it!

> But you have the same livelock problem with doing this core wait thingy
> within any interrupt disabled region.

Right, which we fixed by moving the waiting code to
prepare_exit_to_usermode() in the other test patch I sent to fix Aubrey's
deadlock, but as you suggested I will make it work with the new generic entry
code.

> What you really want to do is on syscall and interrupt/exception entry:
>
> if (other_sibling_in_user_or_guest())
> send_IPI();

Ok.

> Of course it's conveniant to piggy pack that onto the reschedule IPI,
> but that sucks. What you want is a dedicated IPI which does:
>
> magic_ipi()
> set_thread_flag(TIF_CORE_SCHED_EXIT);

I tried IPIs before and ran into a few issues mostly related to CSD locking
if I remember, I will try again.

> And on the exit side you want to look at the entry code generalisation I
> just posted:
>
> https://lore.kernel.org/r/[email protected]

Thank you, I went over this code and it makes sense we work based on that.

> Add TIF_CORE_SCHED_EXIT to EXIT_TO_USER_MODE_WORK and to
> EXIT_TO_GUEST_MODE_WORK and handle it in exit_to_guest_mode_work() and
> exit_to_user_mode_loop() with interrupts enabled.
>

There's already state to determine if waiting is needed or not, we have
counters that determine if any sibling HT in the core is handling an IRQ or
not, and if so I wait before exiting to user mode. The outer-most IRQ entry
is being tracked so we don't send IPIs on every entry.

But I got an idea from your email. Whenever a 'tagged' task becomes current,
I can set a TIF flag marking that it is a tagged task and needs special
handling when exiting to user mode (or guest). Then on exiting to usermode
(or guest), I can check if any waiting is needed and wait there in the
interrupt enabled regions you are suggesting. This should simplify the code
since currently we have that complex if expression to determine if the task
is "special".

> Trying to do this in fully interrupt disabled context is just a recipe
> for disaster.
>
> The entry case condition wants to have a TIF bit as well, i.e.
>
> if (thread_test_bit(TIF_CORE_SCHED_REQUIRED) {
> sched_ipi_dance() {
> if (other_sibling_in_user_or_guest())
> send_IPI();
> }
> }

I did not understand this bit. Could you explain more about it? Are you
talking about the IPIs sent from the schedule() loop in this series?

> I know you hate this, but we are not going to add tons of unholy hackery
> to make this "work" by some definition of work.
>
> The only way to make this work without making a complete mess and
> killing performance is to do this on a best effort basis. It really does
> not matter whether the result of the orchestration is perfect or
> not. What matters is that you destroy any predictability. If the
> orchestration terminates early occasionally then the resulting damage is
> a purely academic exercise.
>
> Making it work reliably without hard to debug livelocks and other
> horrors is not so academic.
>
> So please stop overengineering this and just take the pragmatic approach
> of making it mostly correct. That will prevent 99.999% of realistic
> attacks.

Agreed. Actually that's what I was aiming for. Even the IPIs are sent only
if needed. I did a lot of benchmarking and did not see a performance hit.

> It might not prevent the carefully orchestrated academic paper attack
> which utilizes the 0.001% failure rate by artificially enforcing a
> scenario which cannot be enforced in a real world environment.
>

Got it.

thanks,

- Joel

2020-07-20 04:17:43

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer(Internet mail)

Hi,

> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>
> From: Peter Zijlstra <[email protected]>
>
> When a sibling is forced-idle to match the core-cookie; search for
> matching tasks to fill the core.
>
> rcu_read_unlock() can incur an infrequent deadlock in
> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Acked-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 131 +++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/idle.c | 1 +
> kernel/sched/sched.h | 6 ++
> 4 files changed, 138 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3c8dcc5ff039..4f9edf013df3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -688,6 +688,7 @@ struct task_struct {
> #ifdef CONFIG_SCHED_CORE
> struct rb_node core_node;
> unsigned long core_cookie;
> + unsigned int core_occupation;
> #endif
>
> #ifdef CONFIG_CGROUP_SCHED
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d6d6a678013..fb9edb09ead7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
> return match;
> }
>
> +static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
> +{
> + struct rb_node *node = &p->core_node;
> +
> + node = rb_next(node);
> + if (!node)
> + return NULL;
> +
> + p = container_of(node, struct task_struct, core_node);
> + if (p->core_cookie != cookie)
> + return NULL;
> +
> + return p;
> +}
> +
> /*
> * The static-key + stop-machine variable are needed such that:
> *
> @@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> struct task_struct *next, *max = NULL;
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> - int i, j, cpu;
> + int i, j, cpu, occ = 0;
> bool need_sync;
>
> if (!sched_core_enabled(rq))
> @@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> goto done;
> }
>
> + if (!is_idle_task(p))
> + occ++;
> +
> rq_i->core_pick = p;
>
> /*
> @@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> cpu_rq(j)->core_pick = NULL;
> }
> + occ = 1;
> goto again;
> } else {
> /*
> @@ -4393,6 +4412,8 @@ next_class:;
> if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> rq_i->core_forceidle = true;
>
> + rq_i->core_pick->core_occupation = occ;
> +
> if (i == cpu)
> continue;
>
> @@ -4408,6 +4429,114 @@ next_class:;
> return next;
> }
>
> +static bool try_steal_cookie(int this, int that)
> +{
> + struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> + struct task_struct *p;
> + unsigned long cookie;
> + bool success = false;
> +
> + local_irq_disable();
> + double_rq_lock(dst, src);
> +
> + cookie = dst->core->core_cookie;
> + if (!cookie)
> + goto unlock;
> +
> + if (dst->curr != dst->idle)
> + goto unlock;
> +
> + p = sched_core_find(src, cookie);
> + if (p == src->idle)
> + goto unlock;
> +
> + do {
> + if (p == src->core_pick || p == src->curr)
> + goto next;
> +
> + if (!cpumask_test_cpu(this, &p->cpus_mask))
> + goto next;
> +
> + if (p->core_occupation > dst->idle->core_occupation)
> + goto next;
> +
> + p->on_rq = TASK_ON_RQ_MIGRATING;
> + deactivate_task(src, p, 0);
> + set_task_cpu(p, this);
> + activate_task(dst, p, 0);
> + p->on_rq = TASK_ON_RQ_QUEUED;
> +
> + resched_curr(dst);
> +
> + success = true;
> + break;
> +
> +next:
> + p = sched_core_next(p, cookie);
> + } while (p);
> +
> +unlock:
> + double_rq_unlock(dst, src);
> + local_irq_enable();
> +
> + return success;
> +}
> +
> +static bool steal_cookie_task(int cpu, struct sched_domain *sd)
> +{
> + int i;
> +
> + for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
Since (i == cpu) should be skipped, should we start iteration at cpu+1? like,
for_each_cpu_wrap(i, sched_domain_span(sd), cpu+1) {

}
In that way, we could avoid hitting following if(i == cpu) always.
> + if (i == cpu)
> + continue;
> +
> + if (need_resched())
> + break;
Should we return true here to accelerate the breaking of sched_core_balance?
Otherwise the breaking would be delayed to the next level sd iteration.
> +
> + if (try_steal_cookie(cpu, i))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static void sched_core_balance(struct rq *rq)
> +{
> + struct sched_domain *sd;
> + int cpu = cpu_of(rq);
> +
> + rcu_read_lock_sched();
> + raw_spin_unlock_irq(rq_lockp(rq));
> + for_each_domain(cpu, sd) {
> + if (!(sd->flags & SD_LOAD_BALANCE))
> + break;
> +
> + if (need_resched())
> + break;
If rescheded here, we missed the chance to do further forced-newidle balance,
and the idle-core could be idle for a long time, because lacking of pulling chance.
Could it be possible to add a new forced-newidle balance chance in task_tick_idle?
which could make it more efficient.

> + if (steal_cookie_task(cpu, sd))
> + break;
> + }
> + raw_spin_lock_irq(rq_lockp(rq));
> + rcu_read_unlock_sched();
> +}
> +
> +static DEFINE_PER_CPU(struct callback_head, core_balance_head);
> +
> +void queue_core_balance(struct rq *rq)
> +{
> + if (!sched_core_enabled(rq))
> + return;
> +
> + if (!rq->core->core_cookie)
> + return;
> +
> + if (!rq->nr_running) /* not forced idle */
> + return;
> +
> + queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
> +}
> +
> #else /* !CONFIG_SCHED_CORE */
>
> static struct task_struct *
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a8d40ffab097..dff6ba220ed7 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -395,6 +395,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> {
> update_idle_core(rq);
> schedstat_inc(rq->sched_goidle);
> + queue_core_balance(rq);
> }
>
> #ifdef CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 293aa1ae0308..464559676fd2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1089,6 +1089,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
>
> +extern void queue_core_balance(struct rq *rq);
> +
> #else /* !CONFIG_SCHED_CORE */
>
> static inline bool sched_core_enabled(struct rq *rq)
> @@ -1101,6 +1103,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> return &rq->__lock;
> }
>
> +static inline void queue_core_balance(struct rq *rq)
> +{
> +}
> +
> #endif /* CONFIG_SCHED_CORE */
>
> #ifdef CONFIG_SCHED_SMT
> --
> 2.17.1
>
>

2020-07-20 06:10:37

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer(Internet mail)

On 2020/7/20 12:06, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>
>> From: Peter Zijlstra <[email protected]>
>>
>> When a sibling is forced-idle to match the core-cookie; search for
>> matching tasks to fill the core.
>>
>> rcu_read_unlock() can incur an infrequent deadlock in
>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>> Acked-by: Paul E. McKenney <[email protected]>
>> ---
>> include/linux/sched.h | 1 +
>> kernel/sched/core.c | 131 +++++++++++++++++++++++++++++++++++++++++-
>> kernel/sched/idle.c | 1 +
>> kernel/sched/sched.h | 6 ++
>> 4 files changed, 138 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 3c8dcc5ff039..4f9edf013df3 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -688,6 +688,7 @@ struct task_struct {
>> #ifdef CONFIG_SCHED_CORE
>> struct rb_node core_node;
>> unsigned long core_cookie;
>> + unsigned int core_occupation;
>> #endif
>>
>> #ifdef CONFIG_CGROUP_SCHED
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d6d6a678013..fb9edb09ead7 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
>> return match;
>> }
>>
>> +static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
>> +{
>> + struct rb_node *node = &p->core_node;
>> +
>> + node = rb_next(node);
>> + if (!node)
>> + return NULL;
>> +
>> + p = container_of(node, struct task_struct, core_node);
>> + if (p->core_cookie != cookie)
>> + return NULL;
>> +
>> + return p;
>> +}
>> +
>> /*
>> * The static-key + stop-machine variable are needed such that:
>> *
>> @@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> struct task_struct *next, *max = NULL;
>> const struct sched_class *class;
>> const struct cpumask *smt_mask;
>> - int i, j, cpu;
>> + int i, j, cpu, occ = 0;
>> bool need_sync;
>>
>> if (!sched_core_enabled(rq))
>> @@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> goto done;
>> }
>>
>> + if (!is_idle_task(p))
>> + occ++;
>> +
>> rq_i->core_pick = p;
>>
>> /*
>> @@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>> cpu_rq(j)->core_pick = NULL;
>> }
>> + occ = 1;
>> goto again;
>> } else {
>> /*
>> @@ -4393,6 +4412,8 @@ next_class:;
>> if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
>> rq_i->core_forceidle = true;
>>
>> + rq_i->core_pick->core_occupation = occ;
>> +
>> if (i == cpu)
>> continue;
>>
>> @@ -4408,6 +4429,114 @@ next_class:;
>> return next;
>> }
>>
>> +static bool try_steal_cookie(int this, int that)
>> +{
>> + struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
>> + struct task_struct *p;
>> + unsigned long cookie;
>> + bool success = false;
>> +
>> + local_irq_disable();
>> + double_rq_lock(dst, src);
>> +
>> + cookie = dst->core->core_cookie;
>> + if (!cookie)
>> + goto unlock;
>> +
>> + if (dst->curr != dst->idle)
>> + goto unlock;
>> +
>> + p = sched_core_find(src, cookie);
>> + if (p == src->idle)
>> + goto unlock;
>> +
>> + do {
>> + if (p == src->core_pick || p == src->curr)
>> + goto next;
>> +
>> + if (!cpumask_test_cpu(this, &p->cpus_mask))
>> + goto next;
>> +
>> + if (p->core_occupation > dst->idle->core_occupation)
>> + goto next;
>> +
>> + p->on_rq = TASK_ON_RQ_MIGRATING;
>> + deactivate_task(src, p, 0);
>> + set_task_cpu(p, this);
>> + activate_task(dst, p, 0);
>> + p->on_rq = TASK_ON_RQ_QUEUED;
>> +
>> + resched_curr(dst);
>> +
>> + success = true;
>> + break;
>> +
>> +next:
>> + p = sched_core_next(p, cookie);
>> + } while (p);
>> +
>> +unlock:
>> + double_rq_unlock(dst, src);
>> + local_irq_enable();
>> +
>> + return success;
>> +}
>> +
>> +static bool steal_cookie_task(int cpu, struct sched_domain *sd)
>> +{
>> + int i;
>> +
>> + for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
> Since (i == cpu) should be skipped, should we start iteration at cpu+1? like,
> for_each_cpu_wrap(i, sched_domain_span(sd), cpu+1) {
> …
> }
> In that way, we could avoid hitting following if(i == cpu) always.

IMHO, this won't work, as cpuid is not continuous.

>> + if (i == cpu)
>> + continue;
>> +
>> + if (need_resched())
>> + break;
> Should we return true here to accelerate the breaking of sched_core_balance?
> Otherwise the breaking would be delayed to the next level sd iteration.
>> +
>> + if (try_steal_cookie(cpu, i))
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>> +
>> +static void sched_core_balance(struct rq *rq)
>> +{
>> + struct sched_domain *sd;
>> + int cpu = cpu_of(rq);
>> +
>> + rcu_read_lock_sched();
>> + raw_spin_unlock_irq(rq_lockp(rq));
>> + for_each_domain(cpu, sd) {
>> + if (!(sd->flags & SD_LOAD_BALANCE))
>> + break;
>> +
>> + if (need_resched())
>> + break;
> If rescheded here, we missed the chance to do further forced-newidle balance,
> and the idle-core could be idle for a long time, because lacking of pulling chance.
> Could it be possible to add a new forced-newidle balance chance in task_tick_idle?
> which could make it more efficient.

This flag indicates there is another thread deserves to run, So I guess the core won't
be idle for a long time.

Thanks,
-Aubrey
>
>> + if (steal_cookie_task(cpu, sd))
>> + break;
>> + }
>> + raw_spin_lock_irq(rq_lockp(rq));
>> + rcu_read_unlock_sched();
>> +}
>> +
>> +static DEFINE_PER_CPU(struct callback_head, core_balance_head);
>> +
>> +void queue_core_balance(struct rq *rq)
>> +{
>> + if (!sched_core_enabled(rq))
>> + return;
>> +
>> + if (!rq->core->core_cookie)
>> + return;
>> +
>> + if (!rq->nr_running) /* not forced idle */
>> + return;
>> +
>> + queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
>> +}
>> +
>> #else /* !CONFIG_SCHED_CORE */
>>
>> static struct task_struct *
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index a8d40ffab097..dff6ba220ed7 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -395,6 +395,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
>> {
>> update_idle_core(rq);
>> schedstat_inc(rq->sched_goidle);
>> + queue_core_balance(rq);
>> }
>>
>> #ifdef CONFIG_SMP
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 293aa1ae0308..464559676fd2 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1089,6 +1089,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>> bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>> void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
>>
>> +extern void queue_core_balance(struct rq *rq);
>> +
>> #else /* !CONFIG_SCHED_CORE */
>>
>> static inline bool sched_core_enabled(struct rq *rq)
>> @@ -1101,6 +1103,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>> return &rq->__lock;
>> }
>>
>> +static inline void queue_core_balance(struct rq *rq)
>> +{
>> +}
>> +
>> #endif /* CONFIG_SCHED_CORE */
>>
>> #ifdef CONFIG_SCHED_SMT
>> --
>> 2.17.1
>>
>>
>

2020-07-20 08:22:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

Joel,

Joel Fernandes <[email protected]> writes:
> On Sat, Jul 18, 2020 at 01:36:16AM +0200, Thomas Gleixner wrote:
>>
>> The entry case condition wants to have a TIF bit as well, i.e.
>>
>> if (thread_test_bit(TIF_CORE_SCHED_REQUIRED) {
>> sched_ipi_dance() {
>> if (other_sibling_in_user_or_guest())
>> send_IPI();
>> }
>> }
>
> I did not understand this bit. Could you explain more about it? Are you
> talking about the IPIs sent from the schedule() loop in this series?

Nah, let me try again. If two tasks are out in user space (or guest
mode) and they fall under the isolation rule that they either are both
in user space or both in the kernel then you tag both with
TIF_CORE_SCHED_REQUIRED or whatever bit is appropriate.

So in entry from user you do:

if (thread_test_bit(TIF_CORE_SCHED_REQUIRED))
sched_orchestrate_entry();

void sched_orchestrate_entry(void)
{
if (other_sibling_in_user_or_guest())
send_IPI_to_sibling();
}

That IPI brings the sibling out of user or guest mode.

On the way back to user/guest you do:

if (thread_test_bit(TIF_CORE_SCHED_REQUIRED))
sched_orchestrate_exit();

void sched_orchestrate_exit(void)
{
while (other_sibling_in_kernel())
twiddle_thumbs();
}

Hope that clarifies it.

Thanks,

tglx

2020-07-20 11:12:42

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

On Sun, Jul 19, 2020 at 11:54 PM Joel Fernandes <[email protected]> wrote:
>
> These ifdeffery and checkpatch / command line parameter issues were added by
> Vineeth before he sent out my patch. I'll let him comment on these, agreed
> they all need fixing!
>
Will fix this in the next iteration. Regarding the __setup code,
I shamelessly copied it from page_alloc.c during development and
missed reviewing it before sending it out. All valid points and
will fix it with the next iteration.

Thanks,
Vineeth

2020-07-20 14:50:32

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer(Internet mail)



> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>
> From: Peter Zijlstra <[email protected]>
>
> When a sibling is forced-idle to match the core-cookie; search for
> matching tasks to fill the core.
>
> rcu_read_unlock() can incur an infrequent deadlock in
> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Acked-by: Paul E. McKenney <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 131 +++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/idle.c | 1 +
> kernel/sched/sched.h | 6 ++
> 4 files changed, 138 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3c8dcc5ff039..4f9edf013df3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -688,6 +688,7 @@ struct task_struct {
> #ifdef CONFIG_SCHED_CORE
> struct rb_node core_node;
> unsigned long core_cookie;
> + unsigned int core_occupation;
> #endif
>
> #ifdef CONFIG_CGROUP_SCHED
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d6d6a678013..fb9edb09ead7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
> return match;
> }
>
> +static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
> +{
> + struct rb_node *node = &p->core_node;
> +
> + node = rb_next(node);
> + if (!node)
> + return NULL;
> +
> + p = container_of(node, struct task_struct, core_node);
> + if (p->core_cookie != cookie)
> + return NULL;
> +
> + return p;
> +}
> +
> /*
> * The static-key + stop-machine variable are needed such that:
> *
> @@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> struct task_struct *next, *max = NULL;
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> - int i, j, cpu;
> + int i, j, cpu, occ = 0;
> bool need_sync;
>
> if (!sched_core_enabled(rq))
> @@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> goto done;
> }
>
> + if (!is_idle_task(p))
> + occ++;
> +
> rq_i->core_pick = p;
>
> /*
> @@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> cpu_rq(j)->core_pick = NULL;
> }
> + occ = 1;
> goto again;
> } else {
> /*
> @@ -4393,6 +4412,8 @@ next_class:;
> if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> rq_i->core_forceidle = true;
>
> + rq_i->core_pick->core_occupation = occ;
> +
> if (i == cpu)
> continue;
>
> @@ -4408,6 +4429,114 @@ next_class:;
> return next;
> }
>
> +static bool try_steal_cookie(int this, int that)
> +{
> + struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> + struct task_struct *p;
> + unsigned long cookie;
> + bool success = false;
> +
> + local_irq_disable();
> + double_rq_lock(dst, src);
> +
> + cookie = dst->core->core_cookie;
> + if (!cookie)
> + goto unlock;
> +
> + if (dst->curr != dst->idle)
> + goto unlock;
> +
Could it be ok to add another fast return here?
if (src->nr_running == 1)
goto unlock;
When src cpu has only 1 running task, no need to pull and no need to do sched_core_find.

Thx.
Regards,
Jiang

> + p = sched_core_find(src, cookie);
> + if (p == src->idle)
> + goto unlock;
> +
> + do {
> + if (p == src->core_pick || p == src->curr)
> + goto next;
> +
> + if (!cpumask_test_cpu(this, &p->cpus_mask))
> + goto next;
> +
> + if (p->core_occupation > dst->idle->core_occupation)
> + goto next;
> +
> + p->on_rq = TASK_ON_RQ_MIGRATING;
> + deactivate_task(src, p, 0);
> + set_task_cpu(p, this);
> + activate_task(dst, p, 0);
> + p->on_rq = TASK_ON_RQ_QUEUED;
> +
> + resched_curr(dst);
> +
> + success = true;
> + break;
> +
> +next:
> + p = sched_core_next(p, cookie);
> + } while (p);
> +
> +unlock:
> + double_rq_unlock(dst, src);
> + local_irq_enable();
> +
> + return success;
> +}
> +
> +static bool steal_cookie_task(int cpu, struct sched_domain *sd)
> +{
> + int i;
> +
> + for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
> + if (i == cpu)
> + continue;
> +
> + if (need_resched())
> + break;
> +
> + if (try_steal_cookie(cpu, i))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static void sched_core_balance(struct rq *rq)
> +{
> + struct sched_domain *sd;
> + int cpu = cpu_of(rq);
> +
> + rcu_read_lock_sched();
> + raw_spin_unlock_irq(rq_lockp(rq));
> + for_each_domain(cpu, sd) {
> + if (!(sd->flags & SD_LOAD_BALANCE))
> + break;
> +
> + if (need_resched())
> + break;
> +
> + if (steal_cookie_task(cpu, sd))
> + break;
> + }
> + raw_spin_lock_irq(rq_lockp(rq));
> + rcu_read_unlock_sched();
> +}
> +
> +static DEFINE_PER_CPU(struct callback_head, core_balance_head);
> +
> +void queue_core_balance(struct rq *rq)
> +{
> + if (!sched_core_enabled(rq))
> + return;
> +
> + if (!rq->core->core_cookie)
> + return;
> +
> + if (!rq->nr_running) /* not forced idle */
> + return;
> +
> + queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
> +}
> +
> #else /* !CONFIG_SCHED_CORE */
>
> static struct task_struct *
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index a8d40ffab097..dff6ba220ed7 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -395,6 +395,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> {
> update_idle_core(rq);
> schedstat_inc(rq->sched_goidle);
> + queue_core_balance(rq);
> }
>
> #ifdef CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 293aa1ae0308..464559676fd2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1089,6 +1089,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
>
> +extern void queue_core_balance(struct rq *rq);
> +
> #else /* !CONFIG_SCHED_CORE */
>
> static inline bool sched_core_enabled(struct rq *rq)
> @@ -1101,6 +1103,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> return &rq->__lock;
> }
>
> +static inline void queue_core_balance(struct rq *rq)
> +{
> +}
> +
> #endif /* CONFIG_SCHED_CORE */
>
> #ifdef CONFIG_SCHED_SMT
> --
> 2.17.1
>
>

2020-07-21 07:38:03

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 07/16] sched/fair: Fix forced idle sibling starvation corner case(Internet mail)

Hi,

> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>
> From: vpillai <[email protected]>
>
> If there is only one long running local task and the sibling is
> forced idle, it might not get a chance to run until a schedule
> event happens on any cpu in the core.
>
> So we check for this condition during a tick to see if a sibling
> is starved and then give it a chance to schedule.
>
> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> Signed-off-by: Julien Desfossez <[email protected]>
> ---
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae17507533a0..49fb93296e35 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10613,6 +10613,40 @@ static void rq_offline_fair(struct rq *rq)
>
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_SCHED_CORE
> +static inline bool
> +__entity_slice_used(struct sched_entity *se)
> +{
> + return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
> + sched_slice(cfs_rq_of(se), se);
> +}
> +
> +/*
> + * If runqueue has only one task which used up its slice and if the sibling
> + * is forced idle, then trigger schedule to give forced idle task a chance.
> + */
> +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
> +{
> + int cpu = cpu_of(rq), sibling_cpu;
> +
> + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
> + return;
> +
> + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
> + struct rq *sibling_rq;
> + if (sibling_cpu == cpu)
> + continue;
> + if (cpu_is_offline(sibling_cpu))
> + continue;
> +
> + sibling_rq = cpu_rq(sibling_cpu);
> + if (sibling_rq->core_forceidle) {
> + resched_curr(sibling_rq);
> + }
> + }
> +}
> +#endif
> +
> /*
> * scheduler tick hitting a task of our scheduling class.
> *
> @@ -10636,6 +10670,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> update_misfit_status(curr, rq);
> update_overutilized_status(task_rq(curr));
> +
> +#ifdef CONFIG_SCHED_CORE
> + if (sched_core_enabled(rq))
> + resched_forceidle_sibling(rq, &curr->se);
> +#endif
Hi,

resched_forceidle_sibling depends on tick, but there could be no tick in 1s(scheduler_tick_max_derferment) after
entering nohz_full mode.
And when enable nohz_full, cpu will enter nohz_full mode frequently when *there is only one long running local task*.
That means the siblings rescheduling would be delayed much more than sched_slice(), could be unfair and result in
big latency.

Should we restrict cpu with forced-idle sibling entering nohz_full mode by adding specific flag and checking it before
stop tick?

Or we can do rescheduling on siblings in task_tick_idle by checking starvation time? :)

Thx
Regard,
Jiang

> }
>
> /*
> --
> 2.17.1
>
>

2020-07-21 14:03:23

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 05/16] sched: Basic tracking of matching tasks(Internet mail)

Hi, perter

> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>
> From: Peter Zijlstra <[email protected]>
>
> Introduce task_struct::core_cookie as an opaque identifier for core
> scheduling. When enabled; core scheduling will only allow matching
> task to be on the core; where idle matches everything.
>
> When task_struct::core_cookie is set (and core scheduling is enabled)
> these tasks are indexed in a second RB-tree, first on cookie value
> then on scheduling function, such that matching task selection always
> finds the most elegible match.
>
> NOTE: *shudder* at the overhead...
>
> NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
> is per class tracking of cookies and that just duplicates a lot of
> stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> Signed-off-by: Julien Desfossez <[email protected]>
> ---
> include/linux/sched.h | 8 ++-
> kernel/sched/core.c | 146 ++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 46 -------------
> kernel/sched/sched.h | 55 ++++++++++++++++
> 4 files changed, 208 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4418f5cb8324..3c8dcc5ff039 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -683,10 +683,16 @@ struct task_struct {
> const struct sched_class *sched_class;
> struct sched_entity se;
> struct sched_rt_entity rt;
> + struct sched_dl_entity dl;
> +
> +#ifdef CONFIG_SCHED_CORE
> + struct rb_node core_node;
> + unsigned long core_cookie;
> +#endif
> +
> #ifdef CONFIG_CGROUP_SCHED
> struct task_group *sched_task_group;
> #endif
> - struct sched_dl_entity dl;
>
> #ifdef CONFIG_UCLAMP_TASK
> /* Clamp values requested for a scheduling entity */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4b81301e3f21..b21bcab20da6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -77,6 +77,141 @@ int sysctl_sched_rt_runtime = 950000;
>
> DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>
> +/* kernel prio, less is more */
> +static inline int __task_prio(struct task_struct *p)
> +{
> + if (p->sched_class == &stop_sched_class) /* trumps deadline */
> + return -2;
> +
> + if (rt_prio(p->prio)) /* includes deadline */
> + return p->prio; /* [-1, 99] */
> +
> + if (p->sched_class == &idle_sched_class)
> + return MAX_RT_PRIO + NICE_WIDTH; /* 140 */

return MAX_PRIO;
May be simpler?

> +
> + return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */

MAX_RT_PRIO(100) + MAX_NICE(19) Should be 119? :)

Thx.
Regards,
Jiang

> +}
> +
> +/*
> + * l(a,b)
> + * le(a,b) := !l(b,a)
> + * g(a,b) := l(b,a)
> + * ge(a,b) := !l(a,b)
> + */
> +
> +/* real prio, less is less */
> +static inline bool prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +
> + int pa = __task_prio(a), pb = __task_prio(b);
> +
> + if (-pa < -pb)
> + return true;
> +
> + if (-pb < -pa)
> + return false;
> +
> + if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
> + return !dl_time_before(a->dl.deadline, b->dl.deadline);
> +
> + if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
> + u64 vruntime = b->se.vruntime;
> +
> + /*
> + * Normalize the vruntime if tasks are in different cpus.
> + */
> + if (task_cpu(a) != task_cpu(b)) {
> + vruntime -= task_cfs_rq(b)->min_vruntime;
> + vruntime += task_cfs_rq(a)->min_vruntime;
> + }
> +
> + return !((s64)(a->se.vruntime - vruntime) <= 0);
> + }
> +
> + return false;
> +}
> +
> +static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
> +{
> + if (a->core_cookie < b->core_cookie)
> + return true;
> +
> + if (a->core_cookie > b->core_cookie)
> + return false;
> +
> + /* flip prio, so high prio is leftmost */
> + if (prio_less(b, a))
> + return true;
> +
> + return false;
> +}
> +
> +static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> +{
> + struct rb_node *parent, **node;
> + struct task_struct *node_task;
> +
> + rq->core->core_task_seq++;
> +
> + if (!p->core_cookie)
> + return;
> +
> + node = &rq->core_tree.rb_node;
> + parent = *node;
> +
> + while (*node) {
> + node_task = container_of(*node, struct task_struct, core_node);
> + parent = *node;
> +
> + if (__sched_core_less(p, node_task))
> + node = &parent->rb_left;
> + else
> + node = &parent->rb_right;
> + }
> +
> + rb_link_node(&p->core_node, parent, node);
> + rb_insert_color(&p->core_node, &rq->core_tree);
> +}
> +
> +static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
> +{
> + rq->core->core_task_seq++;
> +
> + if (!p->core_cookie)
> + return;
> +
> + rb_erase(&p->core_node, &rq->core_tree);
> +}
> +
> +/*
> + * Find left-most (aka, highest priority) task matching @cookie.
> + */
> +static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
> +{
> + struct rb_node *node = rq->core_tree.rb_node;
> + struct task_struct *node_task, *match;
> +
> + /*
> + * The idle task always matches any cookie!
> + */
> + match = idle_sched_class.pick_task(rq);
> +
> + while (node) {
> + node_task = container_of(node, struct task_struct, core_node);
> +
> + if (cookie < node_task->core_cookie) {
> + node = node->rb_left;
> + } else if (cookie > node_task->core_cookie) {
> + node = node->rb_right;
> + } else {
> + match = node_task;
> + node = node->rb_left;
> + }
> + }
> +
> + return match;
> +}
> +
> /*
> * The static-key + stop-machine variable are needed such that:
> *
> @@ -135,6 +270,11 @@ void sched_core_put(void)
> mutex_unlock(&sched_core_mutex);
> }
>
> +#else /* !CONFIG_SCHED_CORE */
> +
> +static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
> +static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
> +
> #endif /* CONFIG_SCHED_CORE */
>
> /*
> @@ -1347,6 +1487,9 @@ static inline void init_uclamp(void) { }
>
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> + if (sched_core_enabled(rq))
> + sched_core_enqueue(rq, p);
> +
> if (!(flags & ENQUEUE_NOCLOCK))
> update_rq_clock(rq);
>
> @@ -1361,6 +1504,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>
> static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> + if (sched_core_enabled(rq))
> + sched_core_dequeue(rq, p);
> +
> if (!(flags & DEQUEUE_NOCLOCK))
> update_rq_clock(rq);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e44a43b87975..ae17507533a0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -260,33 +260,11 @@ const struct sched_class fair_sched_class;
> */
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> -static inline struct task_struct *task_of(struct sched_entity *se)
> -{
> - SCHED_WARN_ON(!entity_is_task(se));
> - return container_of(se, struct task_struct, se);
> -}
>
> /* Walk up scheduling entities hierarchy */
> #define for_each_sched_entity(se) \
> for (; se; se = se->parent)
>
> -static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
> -{
> - return p->se.cfs_rq;
> -}
> -
> -/* runqueue on which this entity is (to be) queued */
> -static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
> -{
> - return se->cfs_rq;
> -}
> -
> -/* runqueue "owned" by this group */
> -static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
> -{
> - return grp->my_q;
> -}
> -
> static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
> {
> if (!path)
> @@ -447,33 +425,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>
> #else /* !CONFIG_FAIR_GROUP_SCHED */
>
> -static inline struct task_struct *task_of(struct sched_entity *se)
> -{
> - return container_of(se, struct task_struct, se);
> -}
> -
> #define for_each_sched_entity(se) \
> for (; se; se = NULL)
>
> -static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
> -{
> - return &task_rq(p)->cfs;
> -}
> -
> -static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
> -{
> - struct task_struct *p = task_of(se);
> - struct rq *rq = task_rq(p);
> -
> - return &rq->cfs;
> -}
> -
> -/* runqueue "owned" by this group */
> -static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
> -{
> - return NULL;
> -}
> -
> static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
> {
> if (path)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 66e586adee18..c85c5a4bc21f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1033,6 +1033,10 @@ struct rq {
> /* per rq */
> struct rq *core;
> unsigned int core_enabled;
> + struct rb_root core_tree;
> +
> + /* shared state */
> + unsigned int core_task_seq;
> #endif
> };
>
> @@ -1112,6 +1116,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> #define raw_rq() raw_cpu_ptr(&runqueues)
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_struct *task_of(struct sched_entity *se)
> +{
> + SCHED_WARN_ON(!entity_is_task(se));
> + return container_of(se, struct task_struct, se);
> +}
> +
> +static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
> +{
> + return p->se.cfs_rq;
> +}
> +
> +/* runqueue on which this entity is (to be) queued */
> +static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
> +{
> + return se->cfs_rq;
> +}
> +
> +/* runqueue "owned" by this group */
> +static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
> +{
> + return grp->my_q;
> +}
> +
> +#else
> +
> +static inline struct task_struct *task_of(struct sched_entity *se)
> +{
> + return container_of(se, struct task_struct, se);
> +}
> +
> +static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
> +{
> + return &task_rq(p)->cfs;
> +}
> +
> +static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
> +{
> + struct task_struct *p = task_of(se);
> + struct rq *rq = task_rq(p);
> +
> + return &rq->cfs;
> +}
> +
> +/* runqueue "owned" by this group */
> +static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
> +{
> + return NULL;
> +}
> +#endif
> +
> extern void update_rq_clock(struct rq *rq);
>
> static inline u64 __rq_clock_broken(struct rq *rq)
> --
> 2.17.1
>
>

2020-07-22 07:20:53

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 07/16] sched/fair: Fix forced idle sibling starvation corner case(Internet mail)



> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>
> From: vpillai <[email protected]>
>
> If there is only one long running local task and the sibling is
> forced idle, it might not get a chance to run until a schedule
> event happens on any cpu in the core.
>
> So we check for this condition during a tick to see if a sibling
> is starved and then give it a chance to schedule.
Hi,

There may be other similar starvation cases this patch can not cover.
Such as, If there is one long running RT task and sibling is forced idle, then all tasks with different cookies on all siblings could be starving forever.
Current load-balances seems not able to pull the starved tasks away.
Would load-balance be more aware of core-scheduling to make things better? :)

Thx.
Regards,
Jiang

>
> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> Signed-off-by: Julien Desfossez <[email protected]>
> ---
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae17507533a0..49fb93296e35 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10613,6 +10613,40 @@ static void rq_offline_fair(struct rq *rq)
>
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_SCHED_CORE
> +static inline bool
> +__entity_slice_used(struct sched_entity *se)
> +{
> + return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
> + sched_slice(cfs_rq_of(se), se);
> +}
> +
> +/*
> + * If runqueue has only one task which used up its slice and if the sibling
> + * is forced idle, then trigger schedule to give forced idle task a chance.
> + */
> +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
> +{
> + int cpu = cpu_of(rq), sibling_cpu;
> +
> + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
> + return;
> +
> + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
> + struct rq *sibling_rq;
> + if (sibling_cpu == cpu)
> + continue;
> + if (cpu_is_offline(sibling_cpu))
> + continue;
> +
> + sibling_rq = cpu_rq(sibling_cpu);
> + if (sibling_rq->core_forceidle) {
> + resched_curr(sibling_rq);
> + }
> + }
> +}
> +#endif
> +
> /*
> * scheduler tick hitting a task of our scheduling class.
> *
> @@ -10636,6 +10670,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> update_misfit_status(curr, rq);
> update_overutilized_status(task_rq(curr));
> +
> +#ifdef CONFIG_SCHED_CORE
> + if (sched_core_enabled(rq))
> + resched_forceidle_sibling(rq, &curr->se);
> +#endif
> }
>
> /*
> --
> 2.17.1
>
>

2020-07-22 08:55:31

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi, Aubrey,

> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>
> From: Aubrey Li <[email protected]>
>
> - Don't migrate if there is a cookie mismatch
> Load balance tries to move task from busiest CPU to the
> destination CPU. When core scheduling is enabled, if the
> task's cookie does not match with the destination CPU's
> core cookie, this task will be skipped by this CPU. This
> mitigates the forced idle time on the destination CPU.
>
> - Select cookie matched idle CPU
> In the fast path of task wakeup, select the first cookie matched
> idle CPU instead of the first idle CPU.
>
> - Find cookie matched idlest CPU
> In the slow path of task wakeup, find the idlest CPU whose core
> cookie matches with task's cookie
>
> - Don't migrate task if cookie not match
> For the NUMA load balance, don't migrate task to the CPU whose
> core cookie does not match with task's cookie
>
> Signed-off-by: Aubrey Li <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> ---
> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 29 ++++++++++++++++++++
> 2 files changed, 88 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d16939766361..33dc4bf01817 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
> continue;
>
> +#ifdef CONFIG_SCHED_CORE
> + /*
> + * Skip this cpu if source task's cookie does not match
> + * with CPU's core cookie.
> + */
> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> + continue;
> +#endif
> +
> env->dst_cpu = cpu;
> if (task_numa_compare(env, taskimp, groupimp, maymove))
> break;
> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>
> /* Traverse only the allowed CPUs */
> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> + struct rq *rq = cpu_rq(i);
> +
> +#ifdef CONFIG_SCHED_CORE
> + if (!sched_core_cookie_match(rq, p))
> + continue;
> +#endif
> +
> if (sched_idle_cpu(i))
> return i;
>
> if (available_idle_cpu(i)) {
> - struct rq *rq = cpu_rq(i);
> struct cpuidle_state *idle = idle_get_state(rq);
> if (idle && idle->exit_latency < min_exit_latency) {
> /*
> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> return -1;
> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> - break;
> +
> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +#ifdef CONFIG_SCHED_CORE
> + /*
> + * If Core Scheduling is enabled, select this cpu
> + * only if the process cookie matches core cookie.
> + */
> + if (sched_core_enabled(cpu_rq(cpu)) &&
> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
Why not also add similar logic in select_idle_smt to reduce forced-idle? :)


> +#endif
> + break;
> + }
> }
>
> time = cpu_clock(this) - time;
> @@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * We do not migrate tasks that are:
> * 1) throttled_lb_pair, or
> * 2) cannot be migrated to this CPU due to cpus_ptr, or
> - * 3) running (obviously), or
> - * 4) are cache-hot on their current CPU.
> + * 3) task's cookie does not match with this CPU's core cookie
> + * 4) running (obviously), or
> + * 5) are cache-hot on their current CPU.
> */
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
> @@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> return 0;
> }
>
> +#ifdef CONFIG_SCHED_CORE
> + /*
> + * Don't migrate task if the task's cookie does not match
> + * with the destination CPU's core cookie.
> + */
> + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
> + return 0;
> +#endif
> +
> /* Record that we found atleast one task that could run on dst_cpu */
> env->flags &= ~LBF_ALL_PINNED;
>
> @@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> p->cpus_ptr))
> continue;
>
> +#ifdef CONFIG_SCHED_CORE
> + if (sched_core_enabled(cpu_rq(this_cpu))) {
> + int i = 0;
> + bool cookie_match = false;
> +
> + for_each_cpu(i, sched_group_span(group)) {
Should we consider the p->cpus_ptr here? like,
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr ) {
...
}
Thx.
Regards,
Jiang

> + struct rq *rq = cpu_rq(i);
> +
> + if (sched_core_cookie_match(rq, p)) {
> + cookie_match = true;
> + break;
> + }
> + }
> + /* Skip over this group if no cookie matched */
> + if (!cookie_match)
> + continue;
> + }
> +#endif
> +
> local_group = cpumask_test_cpu(this_cpu,
> sched_group_span(group));
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 464559676fd2..875796d43fca 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1089,6 +1089,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
>
> +/*
> + * Helper to check if the CPU's core cookie matches with the task's cookie
> + * when core scheduling is enabled.
> + * A special case is that the task's cookie always matches with CPU's core
> + * cookie if the CPU is in an idle core.
> + */
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> + bool idle_core = true;
> + int cpu;
> +
> + /* Ignore cookie match if core scheduler is not enabled on the CPU. */
> + if (!sched_core_enabled(rq))
> + return true;
> +
> + for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> + if (!available_idle_cpu(cpu)) {
> + idle_core = false;
> + break;
> + }
> + }
> +
> + /*
> + * A CPU in an idle core is always the best choice for tasks with
> + * cookies.
> + */
> + return idle_core || rq->core->core_cookie == p->core_cookie;
> +}
> +
> extern void queue_core_balance(struct rq *rq);
>
> #else /* !CONFIG_SCHED_CORE */
> --
> 2.17.1
>
>

2020-07-22 12:15:45

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On 2020/7/22 16:54, benbjiang($B>UI7(B) wrote:
> Hi, Aubrey,
>
>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>
>> From: Aubrey Li <[email protected]>
>>
>> - Don't migrate if there is a cookie mismatch
>> Load balance tries to move task from busiest CPU to the
>> destination CPU. When core scheduling is enabled, if the
>> task's cookie does not match with the destination CPU's
>> core cookie, this task will be skipped by this CPU. This
>> mitigates the forced idle time on the destination CPU.
>>
>> - Select cookie matched idle CPU
>> In the fast path of task wakeup, select the first cookie matched
>> idle CPU instead of the first idle CPU.
>>
>> - Find cookie matched idlest CPU
>> In the slow path of task wakeup, find the idlest CPU whose core
>> cookie matches with task's cookie
>>
>> - Don't migrate task if cookie not match
>> For the NUMA load balance, don't migrate task to the CPU whose
>> core cookie does not match with task's cookie
>>
>> Signed-off-by: Aubrey Li <[email protected]>
>> Signed-off-by: Tim Chen <[email protected]>
>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>> ---
>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d16939766361..33dc4bf01817 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>> continue;
>>
>> +#ifdef CONFIG_SCHED_CORE
>> + /*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> + continue;
>> +#endif
>> +
>> env->dst_cpu = cpu;
>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>> break;
>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>
>> /* Traverse only the allowed CPUs */
>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> + struct rq *rq = cpu_rq(i);
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> + if (!sched_core_cookie_match(rq, p))
>> + continue;
>> +#endif
>> +
>> if (sched_idle_cpu(i))
>> return i;
>>
>> if (available_idle_cpu(i)) {
>> - struct rq *rq = cpu_rq(i);
>> struct cpuidle_state *idle = idle_get_state(rq);
>> if (idle && idle->exit_latency < min_exit_latency) {
>> /*
>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)
>> return -1;
>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> - break;
>> +
>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +#ifdef CONFIG_SCHED_CORE
>> + /*
>> + * If Core Scheduling is enabled, select this cpu
>> + * only if the process cookie matches core cookie.
>> + */
>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
We hit select_idle_smt after we scaned the entire LLC domain for idle cores
and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
this scenario.

>
>> +#endif
>> + break;
>> + }
>> }
>>
>> time = cpu_clock(this) - time;
>> @@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> * We do not migrate tasks that are:
>> * 1) throttled_lb_pair, or
>> * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> - * 3) running (obviously), or
>> - * 4) are cache-hot on their current CPU.
>> + * 3) task's cookie does not match with this CPU's core cookie
>> + * 4) running (obviously), or
>> + * 5) are cache-hot on their current CPU.
>> */
>> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>> return 0;
>> @@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> return 0;
>> }
>>
>> +#ifdef CONFIG_SCHED_CORE
>> + /*
>> + * Don't migrate task if the task's cookie does not match
>> + * with the destination CPU's core cookie.
>> + */
>> + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> + return 0;
>> +#endif
>> +
>> /* Record that we found atleast one task that could run on dst_cpu */
>> env->flags &= ~LBF_ALL_PINNED;
>>
>> @@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> p->cpus_ptr))
>> continue;
>>
>> +#ifdef CONFIG_SCHED_CORE
>> + if (sched_core_enabled(cpu_rq(this_cpu))) {
>> + int i = 0;
>> + bool cookie_match = false;
>> +
>> + for_each_cpu(i, sched_group_span(group)) {
> Should we consider the p->cpus_ptr here? like,
> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr ) {

This is already considered just above #ifdef CONFIG_SCHED_CORE, but not included
in the patch file.

Thanks,
-Aubrey

> ...
> }
> Thx.
> Regards,
> Jiang
>
>> + struct rq *rq = cpu_rq(i);
>> +
>> + if (sched_core_cookie_match(rq, p)) {
>> + cookie_match = true;
>> + break;
>> + }
>> + }
>> + /* Skip over this group if no cookie matched */
>> + if (!cookie_match)
>> + continue;
>> + }
>> +#endif
>> +
>> local_group = cpumask_test_cpu(this_cpu,
>> sched_group_span(group));
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 464559676fd2..875796d43fca 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1089,6 +1089,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>> bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>> void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
>>
>> +/*
>> + * Helper to check if the CPU's core cookie matches with the task's cookie
>> + * when core scheduling is enabled.
>> + * A special case is that the task's cookie always matches with CPU's core
>> + * cookie if the CPU is in an idle core.
>> + */
>> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
>> +{
>> + bool idle_core = true;
>> + int cpu;
>> +
>> + /* Ignore cookie match if core scheduler is not enabled on the CPU. */
>> + if (!sched_core_enabled(rq))
>> + return true;
>> +
>> + for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
>> + if (!available_idle_cpu(cpu)) {
>> + idle_core = false;
>> + break;
>> + }
>> + }
>> +
>> + /*
>> + * A CPU in an idle core is always the best choice for tasks with
>> + * cookies.
>> + */
>> + return idle_core || rq->core->core_cookie == p->core_cookie;
>> +}
>> +
>> extern void queue_core_balance(struct rq *rq);
>>
>> #else /* !CONFIG_SCHED_CORE */
>> --
>> 2.17.1
>>
>>
>

2020-07-22 14:36:17

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,

> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>> Hi, Aubrey,
>>
>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>
>>> From: Aubrey Li <[email protected]>
>>>
>>> - Don't migrate if there is a cookie mismatch
>>> Load balance tries to move task from busiest CPU to the
>>> destination CPU. When core scheduling is enabled, if the
>>> task's cookie does not match with the destination CPU's
>>> core cookie, this task will be skipped by this CPU. This
>>> mitigates the forced idle time on the destination CPU.
>>>
>>> - Select cookie matched idle CPU
>>> In the fast path of task wakeup, select the first cookie matched
>>> idle CPU instead of the first idle CPU.
>>>
>>> - Find cookie matched idlest CPU
>>> In the slow path of task wakeup, find the idlest CPU whose core
>>> cookie matches with task's cookie
>>>
>>> - Don't migrate task if cookie not match
>>> For the NUMA load balance, don't migrate task to the CPU whose
>>> core cookie does not match with task's cookie
>>>
>>> Signed-off-by: Aubrey Li <[email protected]>
>>> Signed-off-by: Tim Chen <[email protected]>
>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d16939766361..33dc4bf01817 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>> continue;
>>>
>>> +#ifdef CONFIG_SCHED_CORE
>>> + /*
>>> + * Skip this cpu if source task's cookie does not match
>>> + * with CPU's core cookie.
>>> + */
>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>> + continue;
>>> +#endif
>>> +
>>> env->dst_cpu = cpu;
>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>> break;
>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>
>>> /* Traverse only the allowed CPUs */
>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>> + struct rq *rq = cpu_rq(i);
>>> +
>>> +#ifdef CONFIG_SCHED_CORE
>>> + if (!sched_core_cookie_match(rq, p))
>>> + continue;
>>> +#endif
>>> +
>>> if (sched_idle_cpu(i))
>>> return i;
>>>
>>> if (available_idle_cpu(i)) {
>>> - struct rq *rq = cpu_rq(i);
>>> struct cpuidle_state *idle = idle_get_state(rq);
>>> if (idle && idle->exit_latency < min_exit_latency) {
>>> /*
>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>> for_each_cpu_wrap(cpu, cpus, target) {
>>> if (!--nr)
>>> return -1;
>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>> - break;
>>> +
>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>> +#ifdef CONFIG_SCHED_CORE
>>> + /*
>>> + * If Core Scheduling is enabled, select this cpu
>>> + * only if the process cookie matches core cookie.
>>> + */
>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
> this scenario.

AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
Besides, choosing *target* cpu may be more cache friendly. So IMHO, *target* cpu may be a better choice if cookie not match, instead of idle sibling.

>
>>
>>> +#endif
>>> + break;
>>> + }
>>> }
>>>
>>> time = cpu_clock(this) - time;
>>> @@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>> * We do not migrate tasks that are:
>>> * 1) throttled_lb_pair, or
>>> * 2) cannot be migrated to this CPU due to cpus_ptr, or
>>> - * 3) running (obviously), or
>>> - * 4) are cache-hot on their current CPU.
>>> + * 3) task's cookie does not match with this CPU's core cookie
>>> + * 4) running (obviously), or
>>> + * 5) are cache-hot on their current CPU.
>>> */
>>> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>> return 0;
>>> @@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>> return 0;
>>> }
>>>
>>> +#ifdef CONFIG_SCHED_CORE
>>> + /*
>>> + * Don't migrate task if the task's cookie does not match
>>> + * with the destination CPU's core cookie.
>>> + */
>>> + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>>> + return 0;
>>> +#endif
>>> +
>>> /* Record that we found atleast one task that could run on dst_cpu */
>>> env->flags &= ~LBF_ALL_PINNED;
>>>
>>> @@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>> p->cpus_ptr))
>>> continue;
>>>
>>> +#ifdef CONFIG_SCHED_CORE
>>> + if (sched_core_enabled(cpu_rq(this_cpu))) {
>>> + int i = 0;
>>> + bool cookie_match = false;
>>> +
>>> + for_each_cpu(i, sched_group_span(group)) {
>> Should we consider the p->cpus_ptr here? like,
>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr ) {
>
> This is already considered just above #ifdef CONFIG_SCHED_CORE, but not included
> in the patch file.
>
> Thanks,
> -Aubrey

The above consideration is,
8893 /* Skip over this group if it has no CPUs allowed */
8894 if (!cpumask_intersects(sched_group_span(group),
8895 p->cpus_ptr))
8896 continue;
8897
It only considers the case of *p is not allowed for the whole group*, which is not enough.
If( cpumask_subset(p->cpus_ptr, sched_group_span(group)), the following sched_core_cookie_match() may choose a *wrong(not allowed)* cpu to match cookie. In that case, the matching result could be confusing and lead to wrong result.
On the other hand, considering p->cpus_ptr here could reduce the loop times and cost, if cpumask_and(p->cpus_ptr, sched_group_span(group)) is the subset of sched_group_span(group).

Thx.
Regards,
Jiang

>
>> ...
>> }
>> Thx.
>> Regards,
>> Jiang
>>
>>> + struct rq *rq = cpu_rq(i);
>>> +
>>> + if (sched_core_cookie_match(rq, p)) {
>>> + cookie_match = true;
>>> + break;
>>> + }
>>> + }
>>> + /* Skip over this group if no cookie matched */
>>> + if (!cookie_match)
>>> + continue;
>>> + }
>>> +#endif
>>> +
>>> local_group = cpumask_test_cpu(this_cpu,
>>> sched_group_span(group));
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 464559676fd2..875796d43fca 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1089,6 +1089,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>>> bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
>>> void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
>>>
>>> +/*
>>> + * Helper to check if the CPU's core cookie matches with the task's cookie
>>> + * when core scheduling is enabled.
>>> + * A special case is that the task's cookie always matches with CPU's core
>>> + * cookie if the CPU is in an idle core.
>>> + */
>>> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
>>> +{
>>> + bool idle_core = true;
>>> + int cpu;
>>> +
>>> + /* Ignore cookie match if core scheduler is not enabled on the CPU. */
>>> + if (!sched_core_enabled(rq))
>>> + return true;
>>> +
>>> + for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
>>> + if (!available_idle_cpu(cpu)) {
>>> + idle_core = false;
>>> + break;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * A CPU in an idle core is always the best choice for tasks with
>>> + * cookies.
>>> + */
>>> + return idle_core || rq->core->core_cookie == p->core_cookie;
>>> +}
>>> +
>>> extern void queue_core_balance(struct rq *rq);
>>>
>>> #else /* !CONFIG_SCHED_CORE */
>>> --
>>> 2.17.1

2020-07-23 01:59:08

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>> Hi, Aubrey,
>>>
>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>
>>>> From: Aubrey Li <[email protected]>
>>>>
>>>> - Don't migrate if there is a cookie mismatch
>>>> Load balance tries to move task from busiest CPU to the
>>>> destination CPU. When core scheduling is enabled, if the
>>>> task's cookie does not match with the destination CPU's
>>>> core cookie, this task will be skipped by this CPU. This
>>>> mitigates the forced idle time on the destination CPU.
>>>>
>>>> - Select cookie matched idle CPU
>>>> In the fast path of task wakeup, select the first cookie matched
>>>> idle CPU instead of the first idle CPU.
>>>>
>>>> - Find cookie matched idlest CPU
>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>> cookie matches with task's cookie
>>>>
>>>> - Don't migrate task if cookie not match
>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>> core cookie does not match with task's cookie
>>>>
>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>> Signed-off-by: Tim Chen <[email protected]>
>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index d16939766361..33dc4bf01817 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>> continue;
>>>>
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> + /*
>>>> + * Skip this cpu if source task's cookie does not match
>>>> + * with CPU's core cookie.
>>>> + */
>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>> + continue;
>>>> +#endif
>>>> +
>>>> env->dst_cpu = cpu;
>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>> break;
>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>
>>>> /* Traverse only the allowed CPUs */
>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>> + struct rq *rq = cpu_rq(i);
>>>> +
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> + if (!sched_core_cookie_match(rq, p))
>>>> + continue;
>>>> +#endif
>>>> +
>>>> if (sched_idle_cpu(i))
>>>> return i;
>>>>
>>>> if (available_idle_cpu(i)) {
>>>> - struct rq *rq = cpu_rq(i);
>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>> /*
>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>> if (!--nr)
>>>> return -1;
>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>> - break;
>>>> +
>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> + /*
>>>> + * If Core Scheduling is enabled, select this cpu
>>>> + * only if the process cookie matches core cookie.
>>>> + */
>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>> this scenario.
>
> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
Choosing target cpu could increase the runnable task number on the target runqueue, this
could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.

> Besides, choosing *target* cpu may be more cache friendly. So IMHO, *target* cpu may be a better choice if cookie not match, instead of idle sibling.
I'm not sure if it's more cache friendly as the target is busy, and the coming task
is a cookie unmatched task.

>
>>
>>>
>>>> +#endif
>>>> + break;
>>>> + }
>>>> }
>>>>
>>>> time = cpu_clock(this) - time;
>>>> @@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>> * We do not migrate tasks that are:
>>>> * 1) throttled_lb_pair, or
>>>> * 2) cannot be migrated to this CPU due to cpus_ptr, or
>>>> - * 3) running (obviously), or
>>>> - * 4) are cache-hot on their current CPU.
>>>> + * 3) task's cookie does not match with this CPU's core cookie
>>>> + * 4) running (obviously), or
>>>> + * 5) are cache-hot on their current CPU.
>>>> */
>>>> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>>> return 0;
>>>> @@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>> return 0;
>>>> }
>>>>
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> + /*
>>>> + * Don't migrate task if the task's cookie does not match
>>>> + * with the destination CPU's core cookie.
>>>> + */
>>>> + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>>>> + return 0;
>>>> +#endif
>>>> +
>>>> /* Record that we found atleast one task that could run on dst_cpu */
>>>> env->flags &= ~LBF_ALL_PINNED;
>>>>
>>>> @@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>> p->cpus_ptr))
>>>> continue;
>>>>
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> + if (sched_core_enabled(cpu_rq(this_cpu))) {
>>>> + int i = 0;
>>>> + bool cookie_match = false;
>>>> +
>>>> + for_each_cpu(i, sched_group_span(group)) {
>>> Should we consider the p->cpus_ptr here? like,
>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr ) {
>>
>> This is already considered just above #ifdef CONFIG_SCHED_CORE, but not included
>> in the patch file.
>>
>> Thanks,
>> -Aubrey
>
> The above consideration is,
> 8893 /* Skip over this group if it has no CPUs allowed */
> 8894 if (!cpumask_intersects(sched_group_span(group),
> 8895 p->cpus_ptr))
> 8896 continue;
> 8897
> It only considers the case of *p is not allowed for the whole group*, which is not enough.
> If( cpumask_subset(p->cpus_ptr, sched_group_span(group)), the following sched_core_cookie_match() may choose a *wrong(not allowed)* cpu to match cookie. In that case, the matching result could be confusing and lead to wrong result.
> On the other hand, considering p->cpus_ptr here could reduce the loop times and cost, if cpumask_and(p->cpus_ptr, sched_group_span(group)) is the subset of sched_group_span(group).

Though find_idlest_group_cpu() will check p->cpus_ptr again, I believe this is a good catch and
should be fixed in the next iteration.

Thanks,
-Aubrey

2020-07-23 02:45:49

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,

> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>> Hi, Aubrey,
>>>>
>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>
>>>>> From: Aubrey Li <[email protected]>
>>>>>
>>>>> - Don't migrate if there is a cookie mismatch
>>>>> Load balance tries to move task from busiest CPU to the
>>>>> destination CPU. When core scheduling is enabled, if the
>>>>> task's cookie does not match with the destination CPU's
>>>>> core cookie, this task will be skipped by this CPU. This
>>>>> mitigates the forced idle time on the destination CPU.
>>>>>
>>>>> - Select cookie matched idle CPU
>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>> idle CPU instead of the first idle CPU.
>>>>>
>>>>> - Find cookie matched idlest CPU
>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>> cookie matches with task's cookie
>>>>>
>>>>> - Don't migrate task if cookie not match
>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>> core cookie does not match with task's cookie
>>>>>
>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>> ---
>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index d16939766361..33dc4bf01817 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>> continue;
>>>>>
>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>> + /*
>>>>> + * Skip this cpu if source task's cookie does not match
>>>>> + * with CPU's core cookie.
>>>>> + */
>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>> + continue;
>>>>> +#endif
>>>>> +
>>>>> env->dst_cpu = cpu;
>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>> break;
>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>
>>>>> /* Traverse only the allowed CPUs */
>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>> + struct rq *rq = cpu_rq(i);
>>>>> +
>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>> + continue;
>>>>> +#endif
>>>>> +
>>>>> if (sched_idle_cpu(i))
>>>>> return i;
>>>>>
>>>>> if (available_idle_cpu(i)) {
>>>>> - struct rq *rq = cpu_rq(i);
>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>> /*
>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>> if (!--nr)
>>>>> return -1;
>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>> - break;
>>>>> +
>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>> + /*
>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>> + * only if the process cookie matches core cookie.
>>>>> + */
>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>> this scenario.
>>
>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
> Choosing target cpu could increase the runnable task number on the target runqueue, this
> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.

I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
the load-balance always do periodicly , and unsuccess means nothing happen.
On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
Which is more unstable, and more costly when pick_next_task for all siblings.
In consideration of currently load-balance being not fully aware of core-scheduling, and can not improve
the *unmatched sibling* case, the *find_idlest_** entry should try its best to avoid the case, IMHO.
Also, just an advice and an option. :)

Thx.
Regards,
Jiang

>
>> Besides, choosing *target* cpu may be more cache friendly. So IMHO, *target* cpu may be a better choice if cookie not match, instead of idle sibling.
> I'm not sure if it's more cache friendly as the target is busy, and the coming task
> is a cookie unmatched task.
>
>>
>>>
>>>>
>>>>> +#endif
>>>>> + break;
>>>>> + }
>>>>> }
>>>>>
>>>>> time = cpu_clock(this) - time;
>>>>> @@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>>> * We do not migrate tasks that are:
>>>>> * 1) throttled_lb_pair, or
>>>>> * 2) cannot be migrated to this CPU due to cpus_ptr, or
>>>>> - * 3) running (obviously), or
>>>>> - * 4) are cache-hot on their current CPU.
>>>>> + * 3) task's cookie does not match with this CPU's core cookie
>>>>> + * 4) running (obviously), or
>>>>> + * 5) are cache-hot on their current CPU.
>>>>> */
>>>>> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>>>> return 0;
>>>>> @@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>> + /*
>>>>> + * Don't migrate task if the task's cookie does not match
>>>>> + * with the destination CPU's core cookie.
>>>>> + */
>>>>> + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>>>>> + return 0;
>>>>> +#endif
>>>>> +
>>>>> /* Record that we found atleast one task that could run on dst_cpu */
>>>>> env->flags &= ~LBF_ALL_PINNED;
>>>>>
>>>>> @@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>>> p->cpus_ptr))
>>>>> continue;
>>>>>
>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>> + if (sched_core_enabled(cpu_rq(this_cpu))) {
>>>>> + int i = 0;
>>>>> + bool cookie_match = false;
>>>>> +
>>>>> + for_each_cpu(i, sched_group_span(group)) {
>>>> Should we consider the p->cpus_ptr here? like,
>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr ) {
>>>
>>> This is already considered just above #ifdef CONFIG_SCHED_CORE, but not included
>>> in the patch file.
>>>
>>> Thanks,
>>> -Aubrey
>>
>> The above consideration is,
>> 8893 /* Skip over this group if it has no CPUs allowed */
>> 8894 if (!cpumask_intersects(sched_group_span(group),
>> 8895 p->cpus_ptr))
>> 8896 continue;
>> 8897
>> It only considers the case of *p is not allowed for the whole group*, which is not enough.
>> If( cpumask_subset(p->cpus_ptr, sched_group_span(group)), the following sched_core_cookie_match() may choose a *wrong(not allowed)* cpu to match cookie. In that case, the matching result could be confusing and lead to wrong result.
>> On the other hand, considering p->cpus_ptr here could reduce the loop times and cost, if cpumask_and(p->cpus_ptr, sched_group_span(group)) is the subset of sched_group_span(group).
>
> Though find_idlest_group_cpu() will check p->cpus_ptr again, I believe this is a good catch and
> should be fixed in the next iteration.
>
> Thanks,
> -Aubrey

2020-07-23 03:36:52

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>> Hi,
>>>
>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>
>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>> Hi, Aubrey,
>>>>>
>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>
>>>>>> From: Aubrey Li <[email protected]>
>>>>>>
>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>> task's cookie does not match with the destination CPU's
>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>
>>>>>> - Select cookie matched idle CPU
>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>> idle CPU instead of the first idle CPU.
>>>>>>
>>>>>> - Find cookie matched idlest CPU
>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>> cookie matches with task's cookie
>>>>>>
>>>>>> - Don't migrate task if cookie not match
>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>> core cookie does not match with task's cookie
>>>>>>
>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>> ---
>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>> continue;
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>> + /*
>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>> + * with CPU's core cookie.
>>>>>> + */
>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>> + continue;
>>>>>> +#endif
>>>>>> +
>>>>>> env->dst_cpu = cpu;
>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>> break;
>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>
>>>>>> /* Traverse only the allowed CPUs */
>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>> +
>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>> + continue;
>>>>>> +#endif
>>>>>> +
>>>>>> if (sched_idle_cpu(i))
>>>>>> return i;
>>>>>>
>>>>>> if (available_idle_cpu(i)) {
>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>> /*
>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>> if (!--nr)
>>>>>> return -1;
>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>> - break;
>>>>>> +
>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>> + /*
>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>> + * only if the process cookie matches core cookie.
>>>>>> + */
>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>> this scenario.
>>>
>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>
> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
> the load-balance always do periodicly , and unsuccess means nothing happen.
unsuccess pulling means more unnecessary overhead in load balance.

> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
> Which is more unstable, and more costly when pick_next_task for all siblings.
Not worse than two tasks ping-pong on the same target run queue I guess, and better if
- task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
- task3(cookie B) coming

If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
But if task3 chooses target, it will wait for next pulling luck of load balancer

Thanks,
-Aubrey

> In consideration of currently load-balance being not fully aware of core-scheduling, and can not improve
> the *unmatched sibling* case, the *find_idlest_** entry should try its best to avoid the case, IMHO.

> Also, just an advice and an option. :)
>
> Thx.
> Regards,
> Jiang
>
>>
>>> Besides, choosing *target* cpu may be more cache friendly. So IMHO, *target* cpu may be a better choice if cookie not match, instead of idle sibling.
>> I'm not sure if it's more cache friendly as the target is busy, and the coming task
>> is a cookie unmatched task.
>>

2020-07-23 04:25:05

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,
> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>
>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>> Hi, Aubrey,
>>>>>>
>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>
>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>
>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>
>>>>>>> - Select cookie matched idle CPU
>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>
>>>>>>> - Find cookie matched idlest CPU
>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>> cookie matches with task's cookie
>>>>>>>
>>>>>>> - Don't migrate task if cookie not match
>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>> core cookie does not match with task's cookie
>>>>>>>
>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>> ---
>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>> --- a/kernel/sched/fair.c
>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>> continue;
>>>>>>>
>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>> + /*
>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>> + * with CPU's core cookie.
>>>>>>> + */
>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>> + continue;
>>>>>>> +#endif
>>>>>>> +
>>>>>>> env->dst_cpu = cpu;
>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>> break;
>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>
>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>> +
>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>> + continue;
>>>>>>> +#endif
>>>>>>> +
>>>>>>> if (sched_idle_cpu(i))
>>>>>>> return i;
>>>>>>>
>>>>>>> if (available_idle_cpu(i)) {
>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>> /*
>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>> if (!--nr)
>>>>>>> return -1;
>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>> - break;
>>>>>>> +
>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>> + /*
>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>> + */
>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>> this scenario.
>>>>
>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>
>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>> the load-balance always do periodicly , and unsuccess means nothing happen.
> unsuccess pulling means more unnecessary overhead in load balance.
>
>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>> Which is more unstable, and more costly when pick_next_task for all siblings.
> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
> - task3(cookie B) coming
>
> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
> But if task3 chooses target, it will wait for next pulling luck of load balancer
That’s more interesting. :)
Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
which could improve the unmatched situation and be helpful to reach the *ideal stable status*. Maybe that’s what we expect. :)

Thx.
Regards,
Jiang

>
> Thanks,
> -Aubrey
>
>> In consideration of currently load-balance being not fully aware of core-scheduling, and can not improve
>> the *unmatched sibling* case, the *find_idlest_** entry should try its best to avoid the case, IMHO.
>
>> Also, just an advice and an option. :)
>>
>> Thx.
>> Regards,
>> Jiang
>>
>>>
>>>> Besides, choosing *target* cpu may be more cache friendly. So IMHO, *target* cpu may be a better choice if cookie not match, instead of idle sibling.
>>> I'm not sure if it's more cache friendly as the target is busy, and the coming task
>>> is a cookie unmatched task.

2020-07-23 05:42:44

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
> Hi,
>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>> Hi,
>>>
>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>
>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>
>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>> Hi, Aubrey,
>>>>>>>
>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>
>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>
>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>
>>>>>>>> - Select cookie matched idle CPU
>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>
>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>> cookie matches with task's cookie
>>>>>>>>
>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>
>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>> ---
>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>> continue;
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>> + /*
>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>> + * with CPU's core cookie.
>>>>>>>> + */
>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>> + continue;
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>> env->dst_cpu = cpu;
>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>> break;
>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>
>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>> + continue;
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>> return i;
>>>>>>>>
>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>> /*
>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>> if (!--nr)
>>>>>>>> return -1;
>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>> - break;
>>>>>>>> +
>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>> + /*
>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>> + */
>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>> this scenario.
>>>>>
>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>
>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>> unsuccess pulling means more unnecessary overhead in load balance.
>>
>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>> - task3(cookie B) coming
>>
>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>> But if task3 chooses target, it will wait for next pulling luck of load balancer
> That’s more interesting. :)
> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
We already tried to avoid when we scan idle cores and idle cpus in llc domain.

> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
Pulling away to where needs another bunch of elaboration.

> which could improve the unmatched situation and be helpful to reach the *ideal stable status*. Maybe that’s what we expect. :)
>
If we limit to this one-core two-sibling three-tasks case, choosing the idle sibling is the ideal stable
status, as it saves one lucky load balancer pulling and task migration.

Thanks,
-Aubrey

2020-07-23 07:47:51

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,

> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
>> Hi,
>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>>
>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>>> Hi, Aubrey,
>>>>>>>>
>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>>
>>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>>
>>>>>>>>> - Select cookie matched idle CPU
>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>>
>>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>>> cookie matches with task's cookie
>>>>>>>>>
>>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>>
>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>>> ---
>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>>> continue;
>>>>>>>>>
>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>> + /*
>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>> + */
>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>> + continue;
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>> env->dst_cpu = cpu;
>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>>> break;
>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>>
>>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>>> +
>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>>> + continue;
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>>> return i;
>>>>>>>>>
>>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>>> /*
>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>>> if (!--nr)
>>>>>>>>> return -1;
>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>>> - break;
>>>>>>>>> +
>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>> + /*
>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>>> + */
>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>>> this scenario.
>>>>>>
>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>>
>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>>> unsuccess pulling means more unnecessary overhead in load balance.
>>>
>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>>> - task3(cookie B) coming
>>>
>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
>> That’s more interesting. :)
>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
> We already tried to avoid when we scan idle cores and idle cpus in llc domain.

I’m afraid that’s not enough either, :)
1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.

>
>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
> Pulling away to where needs another bunch of elaboration.

Still with the SMT2+3tasks case,
if *idle sibling* chosen,
Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
but with performance at a discount.

If *target sibling* chose,
Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
1. If task1(cookie A) picked, that’s done for good.
2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
And then, we may reach a more stable state *globally* without performance discount.

IMHO, maybe wrong. :-)

Thx.
Regards,
Jiang

>> which could improve the unmatched situation and be helpful to reach the *ideal stable status*. Maybe that’s what we expect. :)
>>
> If we limit to this one-core two-sibling three-tasks case, choosing the idle sibling is the ideal stable
> status, as it saves one lucky load balancer pulling and task migration.
>
> Thanks,
> -Aubrey

2020-07-23 08:08:22

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
>>> Hi,
>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>>>
>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>
>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>>>> Hi, Aubrey,
>>>>>>>>>
>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>>>
>>>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>>>
>>>>>>>>>> - Select cookie matched idle CPU
>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>>>
>>>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>>>> cookie matches with task's cookie
>>>>>>>>>>
>>>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>>>> continue;
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>> + /*
>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>> + */
>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>> + continue;
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>> env->dst_cpu = cpu;
>>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>>>> break;
>>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>>>
>>>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>>>> +
>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>>>> + continue;
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>>>> return i;
>>>>>>>>>>
>>>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>>>> /*
>>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>>>> if (!--nr)
>>>>>>>>>> return -1;
>>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>>>> - break;
>>>>>>>>>> +
>>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>> + /*
>>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>>>> + */
>>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>>>> this scenario.
>>>>>>>
>>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>>>
>>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>>>> unsuccess pulling means more unnecessary overhead in load balance.
>>>>
>>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>>>> - task3(cookie B) coming
>>>>
>>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
>>> That’s more interesting. :)
>>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
>>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
>> We already tried to avoid when we scan idle cores and idle cpus in llc domain.
>
> I’m afraid that’s not enough either, :)
> 1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
> 2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.
>
>>
>>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
>>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
>>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
>> Pulling away to where needs another bunch of elaboration.
>
> Still with the SMT2+3tasks case,
> if *idle sibling* chosen,
> Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
> so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
> but with performance at a discount.
>
> If *target sibling* chose,
> Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
> 1. If task1(cookie A) picked, that’s done for good.
> 2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
> And then, we may reach a more stable state *globally* without performance discount.

I'm not sure what you mean pulled away,
- if you mean pulled away from this core, cookieA in idle sibling case can be
pulled away too.
- and if you mean pulled away but within this core, I guess cookieB in target
sibling case can't be pulled away either, as nr_running difference = 1

Thanks,
-Aubrey

2020-07-23 08:31:35

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,

> On Jul 23, 2020, at 4:06 PM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>>>>
>>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi, Aubrey,
>>>>>>>>>>
>>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>>>>
>>>>>>>>>>> - Select cookie matched idle CPU
>>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>>>>
>>>>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>>>>> cookie matches with task's cookie
>>>>>>>>>>>
>>>>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>>>>> continue;
>>>>>>>>>>>
>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>> + /*
>>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>>> + */
>>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>>> + continue;
>>>>>>>>>>> +#endif
>>>>>>>>>>> +
>>>>>>>>>>> env->dst_cpu = cpu;
>>>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>>>>> break;
>>>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>>>>
>>>>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>>>>> +
>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>>>>> + continue;
>>>>>>>>>>> +#endif
>>>>>>>>>>> +
>>>>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>>>>> return i;
>>>>>>>>>>>
>>>>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>>>>> /*
>>>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>>>>> if (!--nr)
>>>>>>>>>>> return -1;
>>>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>>>>> - break;
>>>>>>>>>>> +
>>>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>> + /*
>>>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>>>>> + */
>>>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>>>>> this scenario.
>>>>>>>>
>>>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>>>>
>>>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>>>>> unsuccess pulling means more unnecessary overhead in load balance.
>>>>>
>>>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>>>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>>>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>>>>> - task3(cookie B) coming
>>>>>
>>>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>>>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
>>>> That’s more interesting. :)
>>>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
>>>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
>>> We already tried to avoid when we scan idle cores and idle cpus in llc domain.
>>
>> I’m afraid that’s not enough either, :)
>> 1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
>> 2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.
>>
>>>
>>>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
>>>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
>>>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
>>> Pulling away to where needs another bunch of elaboration.
>>
>> Still with the SMT2+3tasks case,
>> if *idle sibling* chosen,
>> Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
>> so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
>> but with performance at a discount.
>>
>> If *target sibling* chose,
>> Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
>> 1. If task1(cookie A) picked, that’s done for good.
>> 2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
>> And then, we may reach a more stable state *globally* without performance discount.
>
> I'm not sure what you mean pulled away,
I mean pulled away by other cpus, may be triggered by idle balance or periodic balance on other cpus.

> - if you mean pulled away from this core, cookieA in idle sibling case can be
> pulled away too.
Yep, cookieA(task1) in idle sibling case could be pulled away, but
cookieB(task3) on the smt2 could never get the chance being pulled
away(unless being waken up).
If cookieA(task1) failed being pulled(cookieB(task2) on smt1 may be pulled,
50% chance), cookieA(task1) and cookieB(task3) would reach the stable state
with performance discount.

Thx.
Regards,
Jiang

> - and if you mean pulled away but within this core, I guess cookieB in target
> sibling case can't be pulled away either, as nr_running difference = 1



>
> Thanks,
> -Aubrey

2020-07-23 23:46:38

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪) <[email protected]> wrote:
>
> Hi,
>
> > On Jul 23, 2020, at 4:06 PM, Li, Aubrey <[email protected]> wrote:
> >
> > On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
> >> Hi,
> >>
> >>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
> >>>
> >>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
> >>>> Hi,
> >>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
> >>>>>
> >>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
> >>>>>>>>>> Hi, Aubrey,
> >>>>>>>>>>
> >>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> From: Aubrey Li <[email protected]>
> >>>>>>>>>>>
> >>>>>>>>>>> - Don't migrate if there is a cookie mismatch
> >>>>>>>>>>> Load balance tries to move task from busiest CPU to the
> >>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
> >>>>>>>>>>> task's cookie does not match with the destination CPU's
> >>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
> >>>>>>>>>>> mitigates the forced idle time on the destination CPU.
> >>>>>>>>>>>
> >>>>>>>>>>> - Select cookie matched idle CPU
> >>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
> >>>>>>>>>>> idle CPU instead of the first idle CPU.
> >>>>>>>>>>>
> >>>>>>>>>>> - Find cookie matched idlest CPU
> >>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
> >>>>>>>>>>> cookie matches with task's cookie
> >>>>>>>>>>>
> >>>>>>>>>>> - Don't migrate task if cookie not match
> >>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
> >>>>>>>>>>> core cookie does not match with task's cookie
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
> >>>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
> >>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> >>>>>>>>>>> ---
> >>>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
> >>>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
> >>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>>>>>>> index d16939766361..33dc4bf01817 100644
> >>>>>>>>>>> --- a/kernel/sched/fair.c
> >>>>>>>>>>> +++ b/kernel/sched/fair.c
> >>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> >>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
> >>>>>>>>>>> continue;
> >>>>>>>>>>>
> >>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
> >>>>>>>>>>> + /*
> >>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
> >>>>>>>>>>> + * with CPU's core cookie.
> >>>>>>>>>>> + */
> >>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
> >>>>>>>>>>> + continue;
> >>>>>>>>>>> +#endif
> >>>>>>>>>>> +
> >>>>>>>>>>> env->dst_cpu = cpu;
> >>>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
> >>>>>>>>>>> break;
> >>>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
> >>>>>>>>>>>
> >>>>>>>>>>> /* Traverse only the allowed CPUs */
> >>>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
> >>>>>>>>>>> + struct rq *rq = cpu_rq(i);
> >>>>>>>>>>> +
> >>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
> >>>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
> >>>>>>>>>>> + continue;
> >>>>>>>>>>> +#endif
> >>>>>>>>>>> +
> >>>>>>>>>>> if (sched_idle_cpu(i))
> >>>>>>>>>>> return i;
> >>>>>>>>>>>
> >>>>>>>>>>> if (available_idle_cpu(i)) {
> >>>>>>>>>>> - struct rq *rq = cpu_rq(i);
> >>>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
> >>>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
> >>>>>>>>>>> /*
> >>>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
> >>>>>>>>>>> if (!--nr)
> >>>>>>>>>>> return -1;
> >>>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> >>>>>>>>>>> - break;
> >>>>>>>>>>> +
> >>>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> >>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
> >>>>>>>>>>> + /*
> >>>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
> >>>>>>>>>>> + * only if the process cookie matches core cookie.
> >>>>>>>>>>> + */
> >>>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
> >>>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
> >>>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
> >>>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
> >>>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
> >>>>>>>>> this scenario.
> >>>>>>>>
> >>>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
> >>>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
> >>>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
> >>>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
> >>>>>>
> >>>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
> >>>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
> >>>>> unsuccess pulling means more unnecessary overhead in load balance.
> >>>>>
> >>>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
> >>>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
> >>>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
> >>>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
> >>>>> - task3(cookie B) coming
> >>>>>
> >>>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
> >>>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
> >>>> That’s more interesting. :)
> >>>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
> >>>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
> >>> We already tried to avoid when we scan idle cores and idle cpus in llc domain.
> >>
> >> I’m afraid that’s not enough either, :)
> >> 1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
> >> 2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.
> >>
> >>>
> >>>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
> >>>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
> >>>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
> >>> Pulling away to where needs another bunch of elaboration.
> >>
> >> Still with the SMT2+3tasks case,
> >> if *idle sibling* chosen,
> >> Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
> >> so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
> >> but with performance at a discount.
> >>
> >> If *target sibling* chose,
> >> Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
> >> 1. If task1(cookie A) picked, that’s done for good.
> >> 2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
> >> And then, we may reach a more stable state *globally* without performance discount.
> >
> > I'm not sure what you mean pulled away,
> I mean pulled away by other cpus, may be triggered by idle balance or periodic balance on other cpus.
>
> > - if you mean pulled away from this core, cookieA in idle sibling case can be
> > pulled away too.
> Yep, cookieA(task1) in idle sibling case could be pulled away, but
> cookieB(task3) on the smt2 could never get the chance being pulled
> away(unless being waken up).
> If cookieA(task1) failed being pulled(cookieB(task2) on smt1 may be pulled,
> 50% chance), cookieA(task1) and cookieB(task3) would reach the stable state
> with performance discount.
>
If you meant pulled away from this core, I didn't see how two cases are
different either. For example, when task2(cookieB) runs on SMT1, task3
cookieb can be pulled to SMT2. and when task1(cookieA) switch onto SMT1,
task2(cookieB) can be pulled away by other cpus, too.

Thanks,
-Aubrey

2020-07-24 01:28:40

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,

> On Jul 24, 2020, at 7:43 AM, Aubrey Li <[email protected]> wrote:
>
> On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪) <[email protected]> wrote:
>>
>> Hi,
>>
>>> On Jul 23, 2020, at 4:06 PM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
>>>>>
>>>>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>>>>>>> Hi, Aubrey,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Select cookie matched idle CPU
>>>>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>>>>>>> cookie matches with task's cookie
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>>>>>>> continue;
>>>>>>>>>>>>>
>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>> + /*
>>>>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>>>>> + continue;
>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>> +
>>>>>>>>>>>>> env->dst_cpu = cpu;
>>>>>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>>>>>>> break;
>>>>>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>>>>>>
>>>>>>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>>>>>>> + continue;
>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>> +
>>>>>>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>>>>>>> return i;
>>>>>>>>>>>>>
>>>>>>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>>>>>>> /*
>>>>>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>>>>>>> if (!--nr)
>>>>>>>>>>>>> return -1;
>>>>>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>>>>>>> - break;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>> + /*
>>>>>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>>>>>>> this scenario.
>>>>>>>>>>
>>>>>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>>>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>>>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>>>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>>>>>>
>>>>>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>>>>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>>>>>>> unsuccess pulling means more unnecessary overhead in load balance.
>>>>>>>
>>>>>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>>>>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>>>>>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>>>>>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>>>>>>> - task3(cookie B) coming
>>>>>>>
>>>>>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>>>>>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
>>>>>> That’s more interesting. :)
>>>>>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
>>>>>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
>>>>> We already tried to avoid when we scan idle cores and idle cpus in llc domain.
>>>>
>>>> I’m afraid that’s not enough either, :)
>>>> 1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
>>>> 2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.
>>>>
>>>>>
>>>>>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
>>>>>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
>>>>>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
>>>>> Pulling away to where needs another bunch of elaboration.
>>>>
>>>> Still with the SMT2+3tasks case,
>>>> if *idle sibling* chosen,
>>>> Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
>>>> so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
>>>> but with performance at a discount.
>>>>
>>>> If *target sibling* chose,
>>>> Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
>>>> 1. If task1(cookie A) picked, that’s done for good.
>>>> 2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
>>>> And then, we may reach a more stable state *globally* without performance discount.
>>>
>>> I'm not sure what you mean pulled away,
>> I mean pulled away by other cpus, may be triggered by idle balance or periodic balance on other cpus.
>>
>>> - if you mean pulled away from this core, cookieA in idle sibling case can be
>>> pulled away too.
>> Yep, cookieA(task1) in idle sibling case could be pulled away, but
>> cookieB(task3) on the smt2 could never get the chance being pulled
>> away(unless being waken up).
>> If cookieA(task1) failed being pulled(cookieB(task2) on smt1 may be pulled,
>> 50% chance), cookieA(task1) and cookieB(task3) would reach the stable state
>> with performance discount.
>>
> If you meant pulled away from this core, I didn't see how two cases are
> different either. For example, when task2(cookieB) runs on SMT1, task3
> cookieb can be pulled to SMT2. and when task1(cookieA) switch onto SMT1,
> task2(cookieB) can be pulled away by other cpus, too.
That’s the case only if SMT2’s pulling happens when task2(cookieB) is running
on SMT1, which depends on,
1. Smt2 not entering tickless or nohz_balancer_kick picks smt2 before other
cpu’s pulling, may be unlikely. :)
2. Task1(cookieA) is not running on SMT1.
otherwise it would be the case I described above.

Besides, for other cases, like smt2+2task(CookieA+CookieB), picking *target*
cpu instead of *idle sibling* could be more helpful to reach the global stable
status(distribute different cookies onto different cores).

Thx.
Regard,
Jiang

>
> Thanks,
> -Aubrey

2020-07-24 02:06:13

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

On 2020/7/24 9:26, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Jul 24, 2020, at 7:43 AM, Aubrey Li <[email protected]> wrote:
>>
>> On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪) <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>>> On Jul 23, 2020, at 4:06 PM, Li, Aubrey <[email protected]> wrote:
>>>>
>>>> On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>
>>>>>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
>>>>>>> Hi,
>>>>>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>> Hi, Aubrey,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Select cookie matched idle CPU
>>>>>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>>>>>>>> cookie matches with task's cookie
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>>>>>>>> continue;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>>>>>> + continue;
>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> env->dst_cpu = cpu;
>>>>>>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>>>>>>>> break;
>>>>>>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>>>>>>>> + continue;
>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>>>>>>>> return i;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>>>>>>>> if (!--nr)
>>>>>>>>>>>>>> return -1;
>>>>>>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>>>>>>>> - break;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>>>>>>>> this scenario.
>>>>>>>>>>>
>>>>>>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>>>>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>>>>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>>>>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>>>>>>>
>>>>>>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>>>>>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>>>>>>>> unsuccess pulling means more unnecessary overhead in load balance.
>>>>>>>>
>>>>>>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>>>>>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>>>>>>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>>>>>>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>>>>>>>> - task3(cookie B) coming
>>>>>>>>
>>>>>>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>>>>>>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
>>>>>>> That’s more interesting. :)
>>>>>>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
>>>>>>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
>>>>>> We already tried to avoid when we scan idle cores and idle cpus in llc domain.
>>>>>
>>>>> I’m afraid that’s not enough either, :)
>>>>> 1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
>>>>> 2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.
>>>>>
>>>>>>
>>>>>>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
>>>>>>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
>>>>>>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
>>>>>> Pulling away to where needs another bunch of elaboration.
>>>>>
>>>>> Still with the SMT2+3tasks case,
>>>>> if *idle sibling* chosen,
>>>>> Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
>>>>> so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
>>>>> but with performance at a discount.
>>>>>
>>>>> If *target sibling* chose,
>>>>> Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
>>>>> 1. If task1(cookie A) picked, that’s done for good.
>>>>> 2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
>>>>> And then, we may reach a more stable state *globally* without performance discount.
>>>>
>>>> I'm not sure what you mean pulled away,
>>> I mean pulled away by other cpus, may be triggered by idle balance or periodic balance on other cpus.
>>>
>>>> - if you mean pulled away from this core, cookieA in idle sibling case can be
>>>> pulled away too.
>>> Yep, cookieA(task1) in idle sibling case could be pulled away, but
>>> cookieB(task3) on the smt2 could never get the chance being pulled
>>> away(unless being waken up).
>>> If cookieA(task1) failed being pulled(cookieB(task2) on smt1 may be pulled,
>>> 50% chance), cookieA(task1) and cookieB(task3) would reach the stable state
>>> with performance discount.
>>>
>> If you meant pulled away from this core, I didn't see how two cases are
>> different either. For example, when task2(cookieB) runs on SMT1, task3
>> cookieb can be pulled to SMT2. and when task1(cookieA) switch onto SMT1,
>> task2(cookieB) can be pulled away by other cpus, too.
> That’s the case only if SMT2’s pulling happens when task2(cookieB) is running
> on SMT1, which depends on,
> 1. Smt2 not entering tickless or nohz_balancer_kick picks smt2 before other
> cpu’s pulling, may be unlikely. :)
> 2. Task1(cookieA) is not running on SMT1.
> otherwise it would be the case I described above.
>
> Besides, for other cases, like smt2+2task(CookieA+CookieB), picking *target*
> cpu instead of *idle sibling* could be more helpful to reach the global stable
> status(distribute different cookies onto different cores).
>If the task number of two cookies has a big difference, then distributing
different cookies onto different cores leads to a big imbalance, that state may
be stable but not an optimal state, I guess that's why core run queue does
not refuse different cookies onto its rb tree.

I think I understand your concern but IMHO I'm not convinced adding cookie match
in idle SMT selection is a best choice, if you have some performance data of your
workload, that will be very helpful to understand the case.

If distributing different cookies onto different cores is a hard requirement from
your side, you are welcome to submit a patch to see others opinion.

Thanks,
-Aubrey

2020-07-24 02:30:46

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)

Hi,

> On Jul 24, 2020, at 10:05 AM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/24 9:26, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Jul 24, 2020, at 7:43 AM, Aubrey Li <[email protected]> wrote:
>>>
>>> On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪) <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>>> On Jul 23, 2020, at 4:06 PM, Li, Aubrey <[email protected]> wrote:
>>>>>
>>>>> On 2020/7/23 15:47, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>> Hi, Aubrey,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Aubrey Li <[email protected]>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Don't migrate if there is a cookie mismatch
>>>>>>>>>>>>>>> Load balance tries to move task from busiest CPU to the
>>>>>>>>>>>>>>> destination CPU. When core scheduling is enabled, if the
>>>>>>>>>>>>>>> task's cookie does not match with the destination CPU's
>>>>>>>>>>>>>>> core cookie, this task will be skipped by this CPU. This
>>>>>>>>>>>>>>> mitigates the forced idle time on the destination CPU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Select cookie matched idle CPU
>>>>>>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched
>>>>>>>>>>>>>>> idle CPU instead of the first idle CPU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Find cookie matched idlest CPU
>>>>>>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core
>>>>>>>>>>>>>>> cookie matches with task's cookie
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Don't migrate task if cookie not match
>>>>>>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose
>>>>>>>>>>>>>>> core cookie does not match with task's cookie
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Aubrey Li <[email protected]>
>>>>>>>>>>>>>>> Signed-off-by: Tim Chen <[email protected]>
>>>>>>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
>>>>>>>>>>>>>>> kernel/sched/sched.h | 29 ++++++++++++++++++++
>>>>>>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>>>>>>>>> index d16939766361..33dc4bf01817 100644
>>>>>>>>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>>>>>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>>>>>>>>>>>>>> continue;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>>>>>>> + continue;
>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> env->dst_cpu = cpu;
>>>>>>>>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>>>>>>>>>>>>>>> break;
>>>>>>>>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /* Traverse only the allowed CPUs */
>>>>>>>>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>>>>>>>>>>>>>>> + struct rq *rq = cpu_rq(i);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>>>> + if (!sched_core_cookie_match(rq, p))
>>>>>>>>>>>>>>> + continue;
>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> if (sched_idle_cpu(i))
>>>>>>>>>>>>>>> return i;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if (available_idle_cpu(i)) {
>>>>>>>>>>>>>>> - struct rq *rq = cpu_rq(i);
>>>>>>>>>>>>>>> struct cpuidle_state *idle = idle_get_state(rq);
>>>>>>>>>>>>>>> if (idle && idle->exit_latency < min_exit_latency) {
>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>>>>>>>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>>>>>>>>>>>> if (!--nr)
>>>>>>>>>>>>>>> return -1;
>>>>>>>>>>>>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>>>>>>>>>>>>>>> - break;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>>>>>>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>>>>>>> + /*
>>>>>>>>>>>>>>> + * If Core Scheduling is enabled, select this cpu
>>>>>>>>>>>>>>> + * only if the process cookie matches core cookie.
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> + if (sched_core_enabled(cpu_rq(cpu)) &&
>>>>>>>>>>>>>>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>>>>>>>>>>>>>> Why not also add similar logic in select_idle_smt to reduce forced-idle? :)
>>>>>>>>>>>>> We hit select_idle_smt after we scaned the entire LLC domain for idle cores
>>>>>>>>>>>>> and idle cpus and failed,so IMHO, an idle smt is probably a good choice under
>>>>>>>>>>>>> this scenario.
>>>>>>>>>>>>
>>>>>>>>>>>> AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cpu.
>>>>>>>>>>> Choosing target cpu could increase the runnable task number on the target runqueue, this
>>>>>>>>>>> could trigger busiest->nr_running > 1 logic and makes the idle sibling trying to pull but
>>>>>>>>>>> not success(due to cookie not match). Putting task to the idle sibling is relatively stable IMHO.
>>>>>>>>>>
>>>>>>>>>> I’m afraid that *unsuccessful* pullings between smts would not result in unstableness, because
>>>>>>>>>> the load-balance always do periodicly , and unsuccess means nothing happen.
>>>>>>>>> unsuccess pulling means more unnecessary overhead in load balance.
>>>>>>>>>
>>>>>>>>>> On the contrary, unmatched sibling tasks running concurrently could bring forced-idle to each other repeatedly,
>>>>>>>>>> Which is more unstable, and more costly when pick_next_task for all siblings.
>>>>>>>>> Not worse than two tasks ping-pong on the same target run queue I guess, and better if
>>>>>>>>> - task1(cookie A) is running on the target, and task2(cookie B) in the runqueue,
>>>>>>>>> - task3(cookie B) coming
>>>>>>>>>
>>>>>>>>> If task3 chooses target's sibling, it could have a chance to run concurrently with task2.
>>>>>>>>> But if task3 chooses target, it will wait for next pulling luck of load balancer
>>>>>>>> That’s more interesting. :)
>>>>>>>> Distributing different cookie tasks onto different cpus(or cpusets) could be the *ideal stable status* we want, as I understood.
>>>>>>>> Different cookie tasks running on sibling smts could hurt performance, and that should be avoided with best effort.
>>>>>>> We already tried to avoid when we scan idle cores and idle cpus in llc domain.
>>>>>>
>>>>>> I’m afraid that’s not enough either, :)
>>>>>> 1. Scanning Idle cpus is not a full scan, there is limit according to scan cost.
>>>>>> 2. That's only trying at the *core/cpu* level, *SMT* level should be considered too.
>>>>>>
>>>>>>>
>>>>>>>> For above case, selecting idle sibling cpu can improve the concurrency indeed, but it decrease the imbalance for load-balancer.
>>>>>>>> In that case, load-balancer could not notice the imbalance, and would do nothing to improve the unmatched situation.
>>>>>>>> On the contrary, choosing the *target* cpu could enhance the imbalance, and load-balancer could try to pull unmatched task away,
>>>>>>> Pulling away to where needs another bunch of elaboration.
>>>>>>
>>>>>> Still with the SMT2+3tasks case,
>>>>>> if *idle sibling* chosen,
>>>>>> Smt1’s load = task1+task2, smt2’s load = task3. Task3 will run intermittently because of forced-idle,
>>>>>> so smt2’s real load could low enough, that it could not be pulled away forever. That’s indeed a stable state,
>>>>>> but with performance at a discount.
>>>>>>
>>>>>> If *target sibling* chose,
>>>>>> Smt1’s load = task1+task2+task3, smt2’s load=0. It’s a obvious imbalance, and load-balancer will pick a task to pull,
>>>>>> 1. If task1(cookie A) picked, that’s done for good.
>>>>>> 2. If task2(cookie B) or task3(cookie B) picked, that’s ok too, the rest task(cookie B) could be pulled away at next balance(maybe need to improve the pulling to tend to pull matched task more aggressively).
>>>>>> And then, we may reach a more stable state *globally* without performance discount.
>>>>>
>>>>> I'm not sure what you mean pulled away,
>>>> I mean pulled away by other cpus, may be triggered by idle balance or periodic balance on other cpus.
>>>>
>>>>> - if you mean pulled away from this core, cookieA in idle sibling case can be
>>>>> pulled away too.
>>>> Yep, cookieA(task1) in idle sibling case could be pulled away, but
>>>> cookieB(task3) on the smt2 could never get the chance being pulled
>>>> away(unless being waken up).
>>>> If cookieA(task1) failed being pulled(cookieB(task2) on smt1 may be pulled,
>>>> 50% chance), cookieA(task1) and cookieB(task3) would reach the stable state
>>>> with performance discount.
>>>>
>>> If you meant pulled away from this core, I didn't see how two cases are
>>> different either. For example, when task2(cookieB) runs on SMT1, task3
>>> cookieb can be pulled to SMT2. and when task1(cookieA) switch onto SMT1,
>>> task2(cookieB) can be pulled away by other cpus, too.
>> That’s the case only if SMT2’s pulling happens when task2(cookieB) is running
>> on SMT1, which depends on,
>> 1. Smt2 not entering tickless or nohz_balancer_kick picks smt2 before other
>> cpu’s pulling, may be unlikely. :)
>> 2. Task1(cookieA) is not running on SMT1.
>> otherwise it would be the case I described above.
>>
>> Besides, for other cases, like smt2+2task(CookieA+CookieB), picking *target*
>> cpu instead of *idle sibling* could be more helpful to reach the global stable
>> status(distribute different cookies onto different cores).
>> If the task number of two cookies has a big difference, then distributing
> different cookies onto different cores leads to a big imbalance, that state may
> be stable but not an optimal state, I guess that's why core run queue does
> not refuse different cookies onto its rb tree.
That’s the overcommit case, in which case distributing is not possible, and would
fallback to the *local stable status*(as what current implementation does) too.

If the total task number is not overcommit, distributing could work. :)

>
> I think I understand your concern but IMHO I'm not convinced adding cookie match
> in idle SMT selection is a best choice, if you have some performance data of your
> workload, that will be very helpful to understand the case.
Got it, I’ll try later.

>
> If distributing different cookies onto different cores is a hard requirement from
> your side, you are welcome to submit a patch to see others opinion.

Thanks for your patience. If possible, could you please loop me in for future
discussions about core-scheduling.

Thx.
Regard,
Jiang

>
> Thanks,
> -Aubrey

2020-07-31 16:42:32

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

On 20/07/26 06:49AM, Vineeth Pillai wrote:
>
>
> Sixth iteration of the Core-Scheduling feature.
>
I am no longer with DigitalOcean. Kindly use this email address for all
future responses.

Thanks,
Vineeth

2020-08-03 08:26:41

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> Sixth iteration of the Core-Scheduling feature.
>
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v6) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts. It is still possible to
> have a task attack the sibling thread when it enters the kernel via
> syscalls.
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a
> tag is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature
> can also be beneficial for RT and performance applications where we
> want to control how tasks make use of SMT dynamically.
>
> This iteration is mostly a cleanup of v5 except for a major feature of
> pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
> introducing documentation and includes minor crash fixes.
>
> One major cleanup was removing the hotplug support and related code.
> The hotplug related crashes were not documented and the fixes piled up
> over time leading to complex code. We were not able to reproduce the
> crashes in the limited testing done. But if they are reroducable, we
> don't want to hide them. We should document them and design better
> fixes if any.
>
> In terms of performance, the results in this release are similar to
> v5. On a x86 system with N hardware threads:
> - if only N/2 hardware threads are busy, the performance is similar
> between baseline, corescheduling and nosmt
> - if N hardware threads are busy with N different corescheduling
> groups, the impact of corescheduling is similar to nosmt
> - if N hardware threads are busy and multiple active threads share the
> same corescheduling cookie, they gain a performance improvement over
> nosmt.
> The specific performance impact depends on the workload, but for a
> really busy database 12-vcpu VM (1 coresched tag) running on a 36
> hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
> their own coresched tag), the performance drops by 54% with
> corescheduling and drops by 90% with nosmt.
>

We found uperf(in cgroup) throughput drops by ~50% with corescheduling.

The problem is, uperf triggered a lot of softirq and offloaded softirq
service to *ksoftirqd* thread.

- default, ksoftirqd thread can run with uperf on the same core, we saw
100% CPU utilization.
- coresched enabled, ksoftirqd's core cookie is different from uperf, so
they can't run concurrently on the same core, we saw ~15% forced idle.

I guess this kind of performance drop can be replicated by other similar
(a lot of softirq activities) workloads.

Currently core scheduler picks cookie-match tasks for all SMT siblings, does
it make sense we add a policy to allow cookie-compatible task running together?
For example, if a task is trusted(set by admin), it can work with kernel thread.
The difference from corescheduling disabled is that we still have user to user
isolation.

Thanks,
-Aubrey






2020-08-03 16:54:14

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

Hi Aubrey,

On Mon, Aug 3, 2020 at 4:23 AM Li, Aubrey <[email protected]> wrote:
>
> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> > Sixth iteration of the Core-Scheduling feature.
> >
> > Core scheduling is a feature that allows only trusted tasks to run
> > concurrently on cpus sharing compute resources (eg: hyperthreads on a
> > core). The goal is to mitigate the core-level side-channel attacks
> > without requiring to disable SMT (which has a significant impact on
> > performance in some situations). Core scheduling (as of v6) mitigates
> > user-space to user-space attacks and user to kernel attack when one of
> > the siblings enters the kernel via interrupts. It is still possible to
> > have a task attack the sibling thread when it enters the kernel via
> > syscalls.
> >
> > By default, the feature doesn't change any of the current scheduler
> > behavior. The user decides which tasks can run simultaneously on the
> > same core (for now by having them in the same tagged cgroup). When a
> > tag is enabled in a cgroup and a task from that cgroup is running on a
> > hardware thread, the scheduler ensures that only idle or trusted tasks
> > run on the other sibling(s). Besides security concerns, this feature
> > can also be beneficial for RT and performance applications where we
> > want to control how tasks make use of SMT dynamically.
> >
> > This iteration is mostly a cleanup of v5 except for a major feature of
> > pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
> > introducing documentation and includes minor crash fixes.
> >
> > One major cleanup was removing the hotplug support and related code.
> > The hotplug related crashes were not documented and the fixes piled up
> > over time leading to complex code. We were not able to reproduce the
> > crashes in the limited testing done. But if they are reroducable, we
> > don't want to hide them. We should document them and design better
> > fixes if any.
> >
> > In terms of performance, the results in this release are similar to
> > v5. On a x86 system with N hardware threads:
> > - if only N/2 hardware threads are busy, the performance is similar
> > between baseline, corescheduling and nosmt
> > - if N hardware threads are busy with N different corescheduling
> > groups, the impact of corescheduling is similar to nosmt
> > - if N hardware threads are busy and multiple active threads share the
> > same corescheduling cookie, they gain a performance improvement over
> > nosmt.
> > The specific performance impact depends on the workload, but for a
> > really busy database 12-vcpu VM (1 coresched tag) running on a 36
> > hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
> > their own coresched tag), the performance drops by 54% with
> > corescheduling and drops by 90% with nosmt.
> >
>
> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>
> The problem is, uperf triggered a lot of softirq and offloaded softirq
> service to *ksoftirqd* thread.
>
> - default, ksoftirqd thread can run with uperf on the same core, we saw
> 100% CPU utilization.
> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
> they can't run concurrently on the same core, we saw ~15% forced idle.
>
> I guess this kind of performance drop can be replicated by other similar
> (a lot of softirq activities) workloads.
>
> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
> it make sense we add a policy to allow cookie-compatible task running together?
> For example, if a task is trusted(set by admin), it can work with kernel thread.
> The difference from corescheduling disabled is that we still have user to user
> isolation.

In ChromeOS we are considering all cookie-0 tasks as trusted.
Basically if you don't trust a task, then that is when you assign the
task a tag. We do this for the sandboxed processes.

Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?

thanks,

- Joel
PS: I am planning to write a patch behind a CONFIG option that tags
all processes (default untrusted) so everything gets a cookie which
some folks said was how they wanted (have a whitelist instead of
blacklist).

2020-08-05 03:57:56

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

On 2020/8/4 0:53, Joel Fernandes wrote:
> Hi Aubrey,
>
> On Mon, Aug 3, 2020 at 4:23 AM Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
>>> Sixth iteration of the Core-Scheduling feature.
>>>
>>> Core scheduling is a feature that allows only trusted tasks to run
>>> concurrently on cpus sharing compute resources (eg: hyperthreads on a
>>> core). The goal is to mitigate the core-level side-channel attacks
>>> without requiring to disable SMT (which has a significant impact on
>>> performance in some situations). Core scheduling (as of v6) mitigates
>>> user-space to user-space attacks and user to kernel attack when one of
>>> the siblings enters the kernel via interrupts. It is still possible to
>>> have a task attack the sibling thread when it enters the kernel via
>>> syscalls.
>>>
>>> By default, the feature doesn't change any of the current scheduler
>>> behavior. The user decides which tasks can run simultaneously on the
>>> same core (for now by having them in the same tagged cgroup). When a
>>> tag is enabled in a cgroup and a task from that cgroup is running on a
>>> hardware thread, the scheduler ensures that only idle or trusted tasks
>>> run on the other sibling(s). Besides security concerns, this feature
>>> can also be beneficial for RT and performance applications where we
>>> want to control how tasks make use of SMT dynamically.
>>>
>>> This iteration is mostly a cleanup of v5 except for a major feature of
>>> pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
>>> introducing documentation and includes minor crash fixes.
>>>
>>> One major cleanup was removing the hotplug support and related code.
>>> The hotplug related crashes were not documented and the fixes piled up
>>> over time leading to complex code. We were not able to reproduce the
>>> crashes in the limited testing done. But if they are reroducable, we
>>> don't want to hide them. We should document them and design better
>>> fixes if any.
>>>
>>> In terms of performance, the results in this release are similar to
>>> v5. On a x86 system with N hardware threads:
>>> - if only N/2 hardware threads are busy, the performance is similar
>>> between baseline, corescheduling and nosmt
>>> - if N hardware threads are busy with N different corescheduling
>>> groups, the impact of corescheduling is similar to nosmt
>>> - if N hardware threads are busy and multiple active threads share the
>>> same corescheduling cookie, they gain a performance improvement over
>>> nosmt.
>>> The specific performance impact depends on the workload, but for a
>>> really busy database 12-vcpu VM (1 coresched tag) running on a 36
>>> hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
>>> their own coresched tag), the performance drops by 54% with
>>> corescheduling and drops by 90% with nosmt.
>>>
>>
>> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>>
>> The problem is, uperf triggered a lot of softirq and offloaded softirq
>> service to *ksoftirqd* thread.
>>
>> - default, ksoftirqd thread can run with uperf on the same core, we saw
>> 100% CPU utilization.
>> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
>> they can't run concurrently on the same core, we saw ~15% forced idle.
>>
>> I guess this kind of performance drop can be replicated by other similar
>> (a lot of softirq activities) workloads.
>>
>> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
>> it make sense we add a policy to allow cookie-compatible task running together?
>> For example, if a task is trusted(set by admin), it can work with kernel thread.
>> The difference from corescheduling disabled is that we still have user to user
>> isolation.
>
> In ChromeOS we are considering all cookie-0 tasks as trusted.
> Basically if you don't trust a task, then that is when you assign the
> task a tag. We do this for the sandboxed processes.

I have a proposal of this, by changing cpu.tag to cpu.coresched_policy,
something like the following:

+/*
+ * Core scheduling policy:
+ * - CORE_SCHED_DISABLED: core scheduling is disabled.
+ * - CORE_COOKIE_MATCH: tasks with same cookie can run
+ * on the same core concurrently.
+ * - CORE_COOKIE_TRUST: trusted task can run with kernel
thread on the same core concurrently.
+ * - CORE_COOKIE_LONELY: tasks with cookie can run only
+ * with idle thread on the same core.
+ */
+enum coresched_policy {
+ CORE_SCHED_DISABLED,
+ CORE_SCHED_COOKIE_MATCH,
+ CORE_SCHED_COOKIE_TRUST,
+ CORE_SCHED_COOKIE_LONELY,
+};

We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
of performance regression. Not sure if this sounds attractive?

>
> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?

This is a good question, from the data we measured by uperf,
SMT+core-scheduling is 28.2% worse than no-SMT, :(

Thanks,
-Aubrey

>
> thanks,
>
> - Joel
> PS: I am planning to write a patch behind a CONFIG option that tags
> all processes (default untrusted) so everything gets a cookie which
> some folks said was how they wanted (have a whitelist instead of
> blacklist).
>

2020-08-05 06:17:52

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)

Hi,

> On Aug 5, 2020, at 11:57 AM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/8/4 0:53, Joel Fernandes wrote:
>> Hi Aubrey,
>>
>> On Mon, Aug 3, 2020 at 4:23 AM Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
>>>> Sixth iteration of the Core-Scheduling feature.
>>>>
>>>> Core scheduling is a feature that allows only trusted tasks to run
>>>> concurrently on cpus sharing compute resources (eg: hyperthreads on a
>>>> core). The goal is to mitigate the core-level side-channel attacks
>>>> without requiring to disable SMT (which has a significant impact on
>>>> performance in some situations). Core scheduling (as of v6) mitigates
>>>> user-space to user-space attacks and user to kernel attack when one of
>>>> the siblings enters the kernel via interrupts. It is still possible to
>>>> have a task attack the sibling thread when it enters the kernel via
>>>> syscalls.
>>>>
>>>> By default, the feature doesn't change any of the current scheduler
>>>> behavior. The user decides which tasks can run simultaneously on the
>>>> same core (for now by having them in the same tagged cgroup). When a
>>>> tag is enabled in a cgroup and a task from that cgroup is running on a
>>>> hardware thread, the scheduler ensures that only idle or trusted tasks
>>>> run on the other sibling(s). Besides security concerns, this feature
>>>> can also be beneficial for RT and performance applications where we
>>>> want to control how tasks make use of SMT dynamically.
>>>>
>>>> This iteration is mostly a cleanup of v5 except for a major feature of
>>>> pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
>>>> introducing documentation and includes minor crash fixes.
>>>>
>>>> One major cleanup was removing the hotplug support and related code.
>>>> The hotplug related crashes were not documented and the fixes piled up
>>>> over time leading to complex code. We were not able to reproduce the
>>>> crashes in the limited testing done. But if they are reroducable, we
>>>> don't want to hide them. We should document them and design better
>>>> fixes if any.
>>>>
>>>> In terms of performance, the results in this release are similar to
>>>> v5. On a x86 system with N hardware threads:
>>>> - if only N/2 hardware threads are busy, the performance is similar
>>>> between baseline, corescheduling and nosmt
>>>> - if N hardware threads are busy with N different corescheduling
>>>> groups, the impact of corescheduling is similar to nosmt
>>>> - if N hardware threads are busy and multiple active threads share the
>>>> same corescheduling cookie, they gain a performance improvement over
>>>> nosmt.
>>>> The specific performance impact depends on the workload, but for a
>>>> really busy database 12-vcpu VM (1 coresched tag) running on a 36
>>>> hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
>>>> their own coresched tag), the performance drops by 54% with
>>>> corescheduling and drops by 90% with nosmt.
>>>>
>>>
>>> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>>>
>>> The problem is, uperf triggered a lot of softirq and offloaded softirq
>>> service to *ksoftirqd* thread.
>>>
>>> - default, ksoftirqd thread can run with uperf on the same core, we saw
>>> 100% CPU utilization.
>>> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
>>> they can't run concurrently on the same core, we saw ~15% forced idle.
>>>
>>> I guess this kind of performance drop can be replicated by other similar
>>> (a lot of softirq activities) workloads.
>>>
>>> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
>>> it make sense we add a policy to allow cookie-compatible task running together?
>>> For example, if a task is trusted(set by admin), it can work with kernel thread.
>>> The difference from corescheduling disabled is that we still have user to user
>>> isolation.
>>
>> In ChromeOS we are considering all cookie-0 tasks as trusted.
>> Basically if you don't trust a task, then that is when you assign the
>> task a tag. We do this for the sandboxed processes.
>
> I have a proposal of this, by changing cpu.tag to cpu.coresched_policy,
> something like the following:
>
> +/*
> + * Core scheduling policy:
> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
> + * on the same core concurrently.
> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
> thread on the same core concurrently.
How about other OS tasks(like systemd) except kernel thread? :)

Thx.
Regards,
Jiang
> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
> + * with idle thread on the same core.
> + */
> +enum coresched_policy {
> + CORE_SCHED_DISABLED,
> + CORE_SCHED_COOKIE_MATCH,
> + CORE_SCHED_COOKIE_TRUST,
> + CORE_SCHED_COOKIE_LONELY,
> +};
>
> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
> of performance regression. Not sure if this sounds attractive?
>
>>
>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>
> This is a good question, from the data we measured by uperf,
> SMT+core-scheduling is 28.2% worse than no-SMT, :(
>
> Thanks,
> -Aubrey
>
>>
>> thanks,
>>
>> - Joel
>> PS: I am planning to write a patch behind a CONFIG option that tags
>> all processes (default untrusted) so everything gets a cookie which
>> some folks said was how they wanted (have a whitelist instead of
>> blacklist).

2020-08-09 16:45:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

Hi Aubrey,

Apologies for replying late as I was still looking into the details.

On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
[...]
> +/*
> + * Core scheduling policy:
> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
> + * on the same core concurrently.
> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
> thread on the same core concurrently.
> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
> + * with idle thread on the same core.
> + */
> +enum coresched_policy {
> + CORE_SCHED_DISABLED,
> + CORE_SCHED_COOKIE_MATCH,
> + CORE_SCHED_COOKIE_TRUST,
> + CORE_SCHED_COOKIE_LONELY,
> +};
>
> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
> of performance regression. Not sure if this sounds attractive?

Instead of this, I think it can be something simpler IMHO:

1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
core-scheduling patchset, such tasks will share a core and sniff on each
other. So let us not pretend that such tasks are not trusted).

2. All kernel threads and idle task would have a cookie 0 (so that will cover
ksoftirqd reported in your original issue).

3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
enable it. Setting this option would tag all tasks that are forked from a
cookie-0 task with their own cookie. Later on, such tasks can be added to
a group. This cover's PeterZ's ask about having 'default untrusted').
(Users like ChromeOS that don't want to userspace system processes to be
tagged can disable this option so such tasks will be cookie-0).

4. Allow prctl/cgroup interfaces to create groups of tasks and override the
above behaviors.

5. Document everything clearly so the semantics are clear both to the
developers of core scheduling and to system administrators.

Note that, with the concept of "system trusted cookie", we can also do
optimizations like:
1. Disable STIBP when switching into trusted tasks.
2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
trusted tasks.

At least #1 seems to be biting enabling HT on ChromeOS right now, and one
other engineer requested I do something like #2 already.

Once we get full-syscall isolation working, threads belonging to a process
can also share a core so those can just share a core with the task-group
leader.

> > Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>
> This is a good question, from the data we measured by uperf,
> SMT+core-scheduling is 28.2% worse than no-SMT, :(

This is worrying for sure. :-(. We ought to debug/profile it more to see what
is causing the overhead. Me/Vineeth added it as a topic for LPC as well.

Any other thoughts from others on this?

thanks,

- Joel


> > thanks,
> >
> > - Joel
> > PS: I am planning to write a patch behind a CONFIG option that tags
> > all processes (default untrusted) so everything gets a cookie which
> > some folks said was how they wanted (have a whitelist instead of
> > blacklist).
> >
>

2020-08-12 02:02:40

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

Hi Joel,

On 2020/8/10 0:44, Joel Fernandes wrote:
> Hi Aubrey,
>
> Apologies for replying late as I was still looking into the details.
>
> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
> [...]
>> +/*
>> + * Core scheduling policy:
>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>> + * on the same core concurrently.
>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>> thread on the same core concurrently.
>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>> + * with idle thread on the same core.
>> + */
>> +enum coresched_policy {
>> + CORE_SCHED_DISABLED,
>> + CORE_SCHED_COOKIE_MATCH,
>> + CORE_SCHED_COOKIE_TRUST,
>> + CORE_SCHED_COOKIE_LONELY,
>> +};
>>
>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>> of performance regression. Not sure if this sounds attractive?
>
> Instead of this, I think it can be something simpler IMHO:
>
> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
> core-scheduling patchset, such tasks will share a core and sniff on each
> other. So let us not pretend that such tasks are not trusted).
>
> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
> ksoftirqd reported in your original issue).
>
> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
> enable it. Setting this option would tag all tasks that are forked from a
> cookie-0 task with their own cookie. Later on, such tasks can be added to
> a group. This cover's PeterZ's ask about having 'default untrusted').
> (Users like ChromeOS that don't want to userspace system processes to be
> tagged can disable this option so such tasks will be cookie-0).
>
> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
> above behaviors.

How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
cookie to be cookie-0 via prctl?

Thanks,
-Aubrey
>
> 5. Document everything clearly so the semantics are clear both to the
> developers of core scheduling and to system administrators.
>
> Note that, with the concept of "system trusted cookie", we can also do
> optimizations like:
> 1. Disable STIBP when switching into trusted tasks.
> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
> trusted tasks.
>
> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
> other engineer requested I do something like #2 already.
>
> Once we get full-syscall isolation working, threads belonging to a process
> can also share a core so those can just share a core with the task-group
> leader.
>
>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>
>> This is a good question, from the data we measured by uperf,
>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
>
> This is worrying for sure. :-(. We ought to debug/profile it more to see what
> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
>
> Any other thoughts from others on this?
>
> thanks,
>
> - Joel
>
>
>>> thanks,
>>>
>>> - Joel
>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>> all processes (default untrusted) so everything gets a cookie which
>>> some folks said was how they wanted (have a whitelist instead of
>>> blacklist).
>>>
>>

2020-08-12 23:10:45

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
> Hi Joel,
>
> On 2020/8/10 0:44, Joel Fernandes wrote:
> > Hi Aubrey,
> >
> > Apologies for replying late as I was still looking into the details.
> >
> > On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
> > [...]
> >> +/*
> >> + * Core scheduling policy:
> >> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
> >> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
> >> + * on the same core concurrently.
> >> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
> >> thread on the same core concurrently.
> >> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
> >> + * with idle thread on the same core.
> >> + */
> >> +enum coresched_policy {
> >> + CORE_SCHED_DISABLED,
> >> + CORE_SCHED_COOKIE_MATCH,
> >> + CORE_SCHED_COOKIE_TRUST,
> >> + CORE_SCHED_COOKIE_LONELY,
> >> +};
> >>
> >> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
> >> of performance regression. Not sure if this sounds attractive?
> >
> > Instead of this, I think it can be something simpler IMHO:
> >
> > 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
> > core-scheduling patchset, such tasks will share a core and sniff on each
> > other. So let us not pretend that such tasks are not trusted).
> >
> > 2. All kernel threads and idle task would have a cookie 0 (so that will cover
> > ksoftirqd reported in your original issue).
> >
> > 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
> > enable it. Setting this option would tag all tasks that are forked from a
> > cookie-0 task with their own cookie. Later on, such tasks can be added to
> > a group. This cover's PeterZ's ask about having 'default untrusted').
> > (Users like ChromeOS that don't want to userspace system processes to be
> > tagged can disable this option so such tasks will be cookie-0).
> >
> > 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
> > above behaviors.
>
> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
> cookie to be cookie-0 via prctl?

Yes, but let me try to understand better. There are 2 problems here I think:

1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
core with it: This should not be any worse than SMT OFF, because even SMT OFF
would also reduce ksoftirqd's CPU time just core sched is doing. Sure
core-scheduling adds some overhead with IPIs but such a huge drop of perf is
strange. Peter any thoughts on that?

2. Interface: To solve the performance problem, you are saying you want uperf
to share a core with ksoftirqd so that it is not forced into idle. Why not
just keep uperf out of the cgroup? Then it will have cookie 0 and be able to
share core with kernel threads. About user-user isolation that you need, if
you tag any "untrusted" threads by adding it to CGroup, then there will
automatically isolated from uperf while allowing uperf to share CPU with
kernel threads.

Please let me know your thoughts and thanks,

- Joel

>
> Thanks,
> -Aubrey
> >
> > 5. Document everything clearly so the semantics are clear both to the
> > developers of core scheduling and to system administrators.
> >
> > Note that, with the concept of "system trusted cookie", we can also do
> > optimizations like:
> > 1. Disable STIBP when switching into trusted tasks.
> > 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
> > trusted tasks.
> >
> > At least #1 seems to be biting enabling HT on ChromeOS right now, and one
> > other engineer requested I do something like #2 already.
> >
> > Once we get full-syscall isolation working, threads belonging to a process
> > can also share a core so those can just share a core with the task-group
> > leader.
> >
> >>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
> >>
> >> This is a good question, from the data we measured by uperf,
> >> SMT+core-scheduling is 28.2% worse than no-SMT, :(
> >
> > This is worrying for sure. :-(. We ought to debug/profile it more to see what
> > is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
> >
> > Any other thoughts from others on this?
> >
> > thanks,
> >
> > - Joel
> >
> >
> >>> thanks,
> >>>
> >>> - Joel
> >>> PS: I am planning to write a patch behind a CONFIG option that tags
> >>> all processes (default untrusted) so everything gets a cookie which
> >>> some folks said was how they wanted (have a whitelist instead of
> >>> blacklist).
> >>>
> >>
>

2020-08-13 04:30:05

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

On 2020/8/13 7:08, Joel Fernandes wrote:
> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>> Hi Joel,
>>
>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>> Hi Aubrey,
>>>
>>> Apologies for replying late as I was still looking into the details.
>>>
>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>> [...]
>>>> +/*
>>>> + * Core scheduling policy:
>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>> + * on the same core concurrently.
>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>> thread on the same core concurrently.
>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>> + * with idle thread on the same core.
>>>> + */
>>>> +enum coresched_policy {
>>>> + CORE_SCHED_DISABLED,
>>>> + CORE_SCHED_COOKIE_MATCH,
>>>> + CORE_SCHED_COOKIE_TRUST,
>>>> + CORE_SCHED_COOKIE_LONELY,
>>>> +};
>>>>
>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>> of performance regression. Not sure if this sounds attractive?
>>>
>>> Instead of this, I think it can be something simpler IMHO:
>>>
>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>> core-scheduling patchset, such tasks will share a core and sniff on each
>>> other. So let us not pretend that such tasks are not trusted).
>>>
>>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>>> ksoftirqd reported in your original issue).
>>>
>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>> enable it. Setting this option would tag all tasks that are forked from a
>>> cookie-0 task with their own cookie. Later on, such tasks can be added to
>>> a group. This cover's PeterZ's ask about having 'default untrusted').
>>> (Users like ChromeOS that don't want to userspace system processes to be
>>> tagged can disable this option so such tasks will be cookie-0).
>>>
>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>> above behaviors.
>>
>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
>> cookie to be cookie-0 via prctl?
>
> Yes, but let me try to understand better. There are 2 problems here I think:
>
> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
> core with it: This should not be any worse than SMT OFF, because even SMT OFF
> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
> strange. Peter any thoughts on that?
>
> 2. Interface: To solve the performance problem, you are saying you want uperf
> to share a core with ksoftirqd so that it is not forced into idle. Why not
> just keep uperf out of the cgroup?

I guess this is unacceptable for who runs their apps in container and vm.

Thanks,
-Aubrey

> Then it will have cookie 0 and be able to
> share core with kernel threads. About user-user isolation that you need, if
> you tag any "untrusted" threads by adding it to CGroup, then there will
> automatically isolated from uperf while allowing uperf to share CPU with
> kernel threads.
>
> Please let me know your thoughts and thanks,
>
> - Joel
>
>>
>> Thanks,
>> -Aubrey
>>>
>>> 5. Document everything clearly so the semantics are clear both to the
>>> developers of core scheduling and to system administrators.
>>>
>>> Note that, with the concept of "system trusted cookie", we can also do
>>> optimizations like:
>>> 1. Disable STIBP when switching into trusted tasks.
>>> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
>>> trusted tasks.
>>>
>>> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
>>> other engineer requested I do something like #2 already.
>>>
>>> Once we get full-syscall isolation working, threads belonging to a process
>>> can also share a core so those can just share a core with the task-group
>>> leader.
>>>
>>>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>>>
>>>> This is a good question, from the data we measured by uperf,
>>>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
>>>
>>> This is worrying for sure. :-(. We ought to debug/profile it more to see what
>>> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
>>>
>>> Any other thoughts from others on this?
>>>
>>> thanks,
>>>
>>> - Joel
>>>
>>>
>>>>> thanks,
>>>>>
>>>>> - Joel
>>>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>>>> all processes (default untrusted) so everything gets a cookie which
>>>>> some folks said was how they wanted (have a whitelist instead of
>>>>> blacklist).
>>>>>
>>>>
>>

2020-08-14 00:27:14

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)



> On Aug 13, 2020, at 12:28 PM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/8/13 7:08, Joel Fernandes wrote:
>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>> Hi Joel,
>>>
>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>> Hi Aubrey,
>>>>
>>>> Apologies for replying late as I was still looking into the details.
>>>>
>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>> [...]
>>>>> +/*
>>>>> + * Core scheduling policy:
>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>> + * on the same core concurrently.
>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>> thread on the same core concurrently.
>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>> + * with idle thread on the same core.
>>>>> + */
>>>>> +enum coresched_policy {
>>>>> + CORE_SCHED_DISABLED,
>>>>> + CORE_SCHED_COOKIE_MATCH,
>>>>> + CORE_SCHED_COOKIE_TRUST,
>>>>> + CORE_SCHED_COOKIE_LONELY,
>>>>> +};
>>>>>
>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>>> of performance regression. Not sure if this sounds attractive?
>>>>
>>>> Instead of this, I think it can be something simpler IMHO:
>>>>
>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>> core-scheduling patchset, such tasks will share a core and sniff on each
>>>> other. So let us not pretend that such tasks are not trusted).
>>>>
>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>>>> ksoftirqd reported in your original issue).
>>>>
>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>> enable it. Setting this option would tag all tasks that are forked from a
>>>> cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>> a group. This cover's PeterZ's ask about having 'default untrusted').
>>>> (Users like ChromeOS that don't want to userspace system processes to be
>>>> tagged can disable this option so such tasks will be cookie-0).
>>>>
>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>> above behaviors.
>>>
>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
>>> cookie to be cookie-0 via prctl?
>>
>> Yes, but let me try to understand better. There are 2 problems here I think:
>>
>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
>> core with it: This should not be any worse than SMT OFF, because even SMT OFF
>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
>> strange. Peter any thoughts on that?
>>
>> 2. Interface: To solve the performance problem, you are saying you want uperf
>> to share a core with ksoftirqd so that it is not forced into idle. Why not
>> just keep uperf out of the cgroup?
>
> I guess this is unacceptable for who runs their apps in container and vm.
IMHO, just as Joel proposed,
1. Consider all cookie-0 task as trusted.
2. All kernel threads and idle task would have a cookie 0
In that way, all tasks with cookies(including uperf in a cgroup) could run
concurrently with kernel threads.
That could be a good solution for the issue. :)

If with CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED enabled,
maybe we should set ksoftirqd’s cookie to be cookie-0 to solve the issue.

Thx.
Regards,
Jiang
>
> Thanks,
> -Aubrey
>
>> Then it will have cookie 0 and be able to
>> share core with kernel threads. About user-user isolation that you need, if
>> you tag any "untrusted" threads by adding it to CGroup, then there will
>> automatically isolated from uperf while allowing uperf to share CPU with
>> kernel threads.
>>
>> Please let me know your thoughts and thanks,
>>
>> - Joel
>>
>>>
>>> Thanks,
>>> -Aubrey
>>>>
>>>> 5. Document everything clearly so the semantics are clear both to the
>>>> developers of core scheduling and to system administrators.
>>>>
>>>> Note that, with the concept of "system trusted cookie", we can also do
>>>> optimizations like:
>>>> 1. Disable STIBP when switching into trusted tasks.
>>>> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
>>>> trusted tasks.
>>>>
>>>> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
>>>> other engineer requested I do something like #2 already.
>>>>
>>>> Once we get full-syscall isolation working, threads belonging to a process
>>>> can also share a core so those can just share a core with the task-group
>>>> leader.
>>>>
>>>>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>>>>
>>>>> This is a good question, from the data we measured by uperf,
>>>>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
>>>>
>>>> This is worrying for sure. :-(. We ought to debug/profile it more to see what
>>>> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
>>>>
>>>> Any other thoughts from others on this?
>>>>
>>>> thanks,
>>>>
>>>> - Joel
>>>>
>>>>
>>>>>> thanks,
>>>>>>
>>>>>> - Joel
>>>>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>>>>> all processes (default untrusted) so everything gets a cookie which
>>>>>> some folks said was how they wanted (have a whitelist instead of
>>>>>> blacklist).
>>>>>>
>>>>>
>>>
>
>

2020-08-14 01:39:50

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)

On 2020/8/14 8:26, benbjiang(蒋彪) wrote:
>
>
>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/8/13 7:08, Joel Fernandes wrote:
>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>>> Hi Joel,
>>>>
>>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> Apologies for replying late as I was still looking into the details.
>>>>>
>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>>> [...]
>>>>>> +/*
>>>>>> + * Core scheduling policy:
>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>>> + * on the same core concurrently.
>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>>> thread on the same core concurrently.
>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>>> + * with idle thread on the same core.
>>>>>> + */
>>>>>> +enum coresched_policy {
>>>>>> + CORE_SCHED_DISABLED,
>>>>>> + CORE_SCHED_COOKIE_MATCH,
>>>>>> + CORE_SCHED_COOKIE_TRUST,
>>>>>> + CORE_SCHED_COOKIE_LONELY,
>>>>>> +};
>>>>>>
>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>>>> of performance regression. Not sure if this sounds attractive?
>>>>>
>>>>> Instead of this, I think it can be something simpler IMHO:
>>>>>
>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>>> core-scheduling patchset, such tasks will share a core and sniff on each
>>>>> other. So let us not pretend that such tasks are not trusted).
>>>>>
>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>>>>> ksoftirqd reported in your original issue).
>>>>>
>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>>> enable it. Setting this option would tag all tasks that are forked from a
>>>>> cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>>> a group. This cover's PeterZ's ask about having 'default untrusted').
>>>>> (Users like ChromeOS that don't want to userspace system processes to be
>>>>> tagged can disable this option so such tasks will be cookie-0).
>>>>>
>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>>> above behaviors.
>>>>
>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
>>>> cookie to be cookie-0 via prctl?
>>>
>>> Yes, but let me try to understand better. There are 2 problems here I think:
>>>
>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
>>> core with it: This should not be any worse than SMT OFF, because even SMT OFF
>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
>>> strange. Peter any thoughts on that?
>>>
>>> 2. Interface: To solve the performance problem, you are saying you want uperf
>>> to share a core with ksoftirqd so that it is not forced into idle. Why not
>>> just keep uperf out of the cgroup?
>>
>> I guess this is unacceptable for who runs their apps in container and vm.
> IMHO, just as Joel proposed,
> 1. Consider all cookie-0 task as trusted.
> 2. All kernel threads and idle task would have a cookie 0
> In that way, all tasks with cookies(including uperf in a cgroup) could run
> concurrently with kernel threads.
> That could be a good solution for the issue. :)

From uperf point of review, it can trust cookie-0(I assume we still need
some modifications to change cookie-match to cookie-compatible to allow
ZERO and NONZERO run together).

But from kernel thread point of review, it can NOT trust uperf, unless
we set uperf's cookie to 0.

Thanks,
-Aubrey

2020-08-14 04:06:42

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)



> On Aug 14, 2020, at 9:36 AM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/8/14 8:26, benbjiang(蒋彪) wrote:
>>
>>
>>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/8/13 7:08, Joel Fernandes wrote:
>>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>>>> Hi Joel,
>>>>>
>>>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>>>> Hi Aubrey,
>>>>>>
>>>>>> Apologies for replying late as I was still looking into the details.
>>>>>>
>>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>>>> [...]
>>>>>>> +/*
>>>>>>> + * Core scheduling policy:
>>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>>>> + * on the same core concurrently.
>>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>>>> thread on the same core concurrently.
>>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>>>> + * with idle thread on the same core.
>>>>>>> + */
>>>>>>> +enum coresched_policy {
>>>>>>> + CORE_SCHED_DISABLED,
>>>>>>> + CORE_SCHED_COOKIE_MATCH,
>>>>>>> + CORE_SCHED_COOKIE_TRUST,
>>>>>>> + CORE_SCHED_COOKIE_LONELY,
>>>>>>> +};
>>>>>>>
>>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>>>>> of performance regression. Not sure if this sounds attractive?
>>>>>>
>>>>>> Instead of this, I think it can be something simpler IMHO:
>>>>>>
>>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>>>> core-scheduling patchset, such tasks will share a core and sniff on each
>>>>>> other. So let us not pretend that such tasks are not trusted).
>>>>>>
>>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>>>>>> ksoftirqd reported in your original issue).
>>>>>>
>>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>>>> enable it. Setting this option would tag all tasks that are forked from a
>>>>>> cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>>>> a group. This cover's PeterZ's ask about having 'default untrusted').
>>>>>> (Users like ChromeOS that don't want to userspace system processes to be
>>>>>> tagged can disable this option so such tasks will be cookie-0).
>>>>>>
>>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>>>> above behaviors.
>>>>>
>>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
>>>>> cookie to be cookie-0 via prctl?
>>>>
>>>> Yes, but let me try to understand better. There are 2 problems here I think:
>>>>
>>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
>>>> core with it: This should not be any worse than SMT OFF, because even SMT OFF
>>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
>>>> strange. Peter any thoughts on that?
>>>>
>>>> 2. Interface: To solve the performance problem, you are saying you want uperf
>>>> to share a core with ksoftirqd so that it is not forced into idle. Why not
>>>> just keep uperf out of the cgroup?
>>>
>>> I guess this is unacceptable for who runs their apps in container and vm.
>> IMHO, just as Joel proposed,
>> 1. Consider all cookie-0 task as trusted.
>> 2. All kernel threads and idle task would have a cookie 0
>> In that way, all tasks with cookies(including uperf in a cgroup) could run
>> concurrently with kernel threads.
>> That could be a good solution for the issue. :)
>
> From uperf point of review, it can trust cookie-0(I assume we still need
> some modifications to change cookie-match to cookie-compatible to allow
> ZERO and NONZERO run together).
>
> But from kernel thread point of review, it can NOT trust uperf, unless
> we set uperf's cookie to 0.
That’s right. :)
Could we set the cookie of cgroup where uperf lies to 0?

Thx.
Regards,
Jiang

>
> Thanks,
> -Aubrey
>

2020-08-14 05:20:04

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)

On 2020/8/14 12:04, benbjiang(蒋彪) wrote:
>
>
>> On Aug 14, 2020, at 9:36 AM, Li, Aubrey <[email protected]> wrote:
>>
>> On 2020/8/14 8:26, benbjiang(蒋彪) wrote:
>>>
>>>
>>>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey <[email protected]> wrote:
>>>>
>>>> On 2020/8/13 7:08, Joel Fernandes wrote:
>>>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>>>>> Hi Joel,
>>>>>>
>>>>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>>>>> Hi Aubrey,
>>>>>>>
>>>>>>> Apologies for replying late as I was still looking into the details.
>>>>>>>
>>>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>>>>> [...]
>>>>>>>> +/*
>>>>>>>> + * Core scheduling policy:
>>>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>>>>> + * on the same core concurrently.
>>>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>>>>> thread on the same core concurrently.
>>>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>>>>> + * with idle thread on the same core.
>>>>>>>> + */
>>>>>>>> +enum coresched_policy {
>>>>>>>> + CORE_SCHED_DISABLED,
>>>>>>>> + CORE_SCHED_COOKIE_MATCH,
>>>>>>>> + CORE_SCHED_COOKIE_TRUST,
>>>>>>>> + CORE_SCHED_COOKIE_LONELY,
>>>>>>>> +};
>>>>>>>>
>>>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>>>>>> of performance regression. Not sure if this sounds attractive?
>>>>>>>
>>>>>>> Instead of this, I think it can be something simpler IMHO:
>>>>>>>
>>>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>>>>> core-scheduling patchset, such tasks will share a core and sniff on each
>>>>>>> other. So let us not pretend that such tasks are not trusted).
>>>>>>>
>>>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>>>>>>> ksoftirqd reported in your original issue).
>>>>>>>
>>>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>>>>> enable it. Setting this option would tag all tasks that are forked from a
>>>>>>> cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>>>>> a group. This cover's PeterZ's ask about having 'default untrusted').
>>>>>>> (Users like ChromeOS that don't want to userspace system processes to be
>>>>>>> tagged can disable this option so such tasks will be cookie-0).
>>>>>>>
>>>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>>>>> above behaviors.
>>>>>>
>>>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
>>>>>> cookie to be cookie-0 via prctl?
>>>>>
>>>>> Yes, but let me try to understand better. There are 2 problems here I think:
>>>>>
>>>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
>>>>> core with it: This should not be any worse than SMT OFF, because even SMT OFF
>>>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>>>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
>>>>> strange. Peter any thoughts on that?
>>>>>
>>>>> 2. Interface: To solve the performance problem, you are saying you want uperf
>>>>> to share a core with ksoftirqd so that it is not forced into idle. Why not
>>>>> just keep uperf out of the cgroup?
>>>>
>>>> I guess this is unacceptable for who runs their apps in container and vm.
>>> IMHO, just as Joel proposed,
>>> 1. Consider all cookie-0 task as trusted.
>>> 2. All kernel threads and idle task would have a cookie 0
>>> In that way, all tasks with cookies(including uperf in a cgroup) could run
>>> concurrently with kernel threads.
>>> That could be a good solution for the issue. :)
>>
>> From uperf point of review, it can trust cookie-0(I assume we still need
>> some modifications to change cookie-match to cookie-compatible to allow
>> ZERO and NONZERO run together).
>>
>> But from kernel thread point of review, it can NOT trust uperf, unless
>> we set uperf's cookie to 0.
> That’s right. :)
> Could we set the cookie of cgroup where uperf lies to 0?
>
IMHO the disadvantage is that if there are two or more cgroups set cookie-0,
then the user applications in these cgroups could run concurrently on a core,
though all of them are set as trusted, we made a hole of user->user isolation.

Thanks,
-Aubrey

2020-08-14 09:20:14

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)

Hi,

> On Aug 14, 2020, at 1:18 PM, Li, Aubrey <[email protected]> wrote:
>
> On 2020/8/14 12:04, benbjiang(蒋彪) wrote:
>>
>>
>>> On Aug 14, 2020, at 9:36 AM, Li, Aubrey <[email protected]> wrote:
>>>
>>> On 2020/8/14 8:26, benbjiang(蒋彪) wrote:
>>>>
>>>>
>>>>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey <[email protected]> wrote:
>>>>>
>>>>> On 2020/8/13 7:08, Joel Fernandes wrote:
>>>>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>>>>>>> Hi Joel,
>>>>>>>
>>>>>>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>>>>>>> Hi Aubrey,
>>>>>>>>
>>>>>>>> Apologies for replying late as I was still looking into the details.
>>>>>>>>
>>>>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>>>>>>> [...]
>>>>>>>>> +/*
>>>>>>>>> + * Core scheduling policy:
>>>>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>>>>>>> + * on the same core concurrently.
>>>>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>>>>>> thread on the same core concurrently.
>>>>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>>>>>>> + * with idle thread on the same core.
>>>>>>>>> + */
>>>>>>>>> +enum coresched_policy {
>>>>>>>>> + CORE_SCHED_DISABLED,
>>>>>>>>> + CORE_SCHED_COOKIE_MATCH,
>>>>>>>>> + CORE_SCHED_COOKIE_TRUST,
>>>>>>>>> + CORE_SCHED_COOKIE_LONELY,
>>>>>>>>> +};
>>>>>>>>>
>>>>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>>>>>>> of performance regression. Not sure if this sounds attractive?
>>>>>>>>
>>>>>>>> Instead of this, I think it can be something simpler IMHO:
>>>>>>>>
>>>>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>>>>>> core-scheduling patchset, such tasks will share a core and sniff on each
>>>>>>>> other. So let us not pretend that such tasks are not trusted).
>>>>>>>>
>>>>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>>>>>>>> ksoftirqd reported in your original issue).
>>>>>>>>
>>>>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>>>>>> enable it. Setting this option would tag all tasks that are forked from a
>>>>>>>> cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>>>>>> a group. This cover's PeterZ's ask about having 'default untrusted').
>>>>>>>> (Users like ChromeOS that don't want to userspace system processes to be
>>>>>>>> tagged can disable this option so such tasks will be cookie-0).
>>>>>>>>
>>>>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>>>>>> above behaviors.
>>>>>>>
>>>>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
>>>>>>> cookie to be cookie-0 via prctl?
>>>>>>
>>>>>> Yes, but let me try to understand better. There are 2 problems here I think:
>>>>>>
>>>>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
>>>>>> core with it: This should not be any worse than SMT OFF, because even SMT OFF
>>>>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
>>>>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
>>>>>> strange. Peter any thoughts on that?
>>>>>>
>>>>>> 2. Interface: To solve the performance problem, you are saying you want uperf
>>>>>> to share a core with ksoftirqd so that it is not forced into idle. Why not
>>>>>> just keep uperf out of the cgroup?
>>>>>
>>>>> I guess this is unacceptable for who runs their apps in container and vm.
>>>> IMHO, just as Joel proposed,
>>>> 1. Consider all cookie-0 task as trusted.
>>>> 2. All kernel threads and idle task would have a cookie 0
>>>> In that way, all tasks with cookies(including uperf in a cgroup) could run
>>>> concurrently with kernel threads.
>>>> That could be a good solution for the issue. :)
>>>
>>> From uperf point of review, it can trust cookie-0(I assume we still need
>>> some modifications to change cookie-match to cookie-compatible to allow
>>> ZERO and NONZERO run together).
>>>
>>> But from kernel thread point of review, it can NOT trust uperf, unless
>>> we set uperf's cookie to 0.
>> That’s right. :)
>> Could we set the cookie of cgroup where uperf lies to 0?
>>
> IMHO the disadvantage is that if there are two or more cgroups set cookie-0,
> then the user applications in these cgroups could run concurrently on a core,
> though all of them are set as trusted, we made a hole of user->user isolation.
For that case, how about,
- use a special cookie(cookie-trust) instead of cookie-0 for kernel thread
- implement cookie_partial_match() to match part of the cookie
- Cookie-normal(normal tasks use) could trust cookie-trust,
- tasks tend to be trusted by cookie-trust could use cookies including
cookie-trust partially, while cookie-normal does not include cookie-trust.
- cookie-trust tasks use cookie_partial_match() to match cookie
- normal tasks use the standard cookie match(full match) interface to match cookie.

Just a sudden thought. :)

Thx.
Regards,
Jiang

2020-08-20 22:39:32

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

On Thu, Aug 13, 2020 at 12:28:17PM +0800, Li, Aubrey wrote:
> On 2020/8/13 7:08, Joel Fernandes wrote:
> > On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
> >> Hi Joel,
> >>
> >> On 2020/8/10 0:44, Joel Fernandes wrote:
> >>> Hi Aubrey,
> >>>
> >>> Apologies for replying late as I was still looking into the details.
> >>>
> >>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
> >>> [...]
> >>>> +/*
> >>>> + * Core scheduling policy:
> >>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
> >>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
> >>>> + * on the same core concurrently.
> >>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
> >>>> thread on the same core concurrently.
> >>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
> >>>> + * with idle thread on the same core.
> >>>> + */
> >>>> +enum coresched_policy {
> >>>> + CORE_SCHED_DISABLED,
> >>>> + CORE_SCHED_COOKIE_MATCH,
> >>>> + CORE_SCHED_COOKIE_TRUST,
> >>>> + CORE_SCHED_COOKIE_LONELY,
> >>>> +};
> >>>>
> >>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
> >>>> of performance regression. Not sure if this sounds attractive?
> >>>
> >>> Instead of this, I think it can be something simpler IMHO:
> >>>
> >>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
> >>> core-scheduling patchset, such tasks will share a core and sniff on each
> >>> other. So let us not pretend that such tasks are not trusted).
> >>>
> >>> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
> >>> ksoftirqd reported in your original issue).
> >>>
> >>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
> >>> enable it. Setting this option would tag all tasks that are forked from a
> >>> cookie-0 task with their own cookie. Later on, such tasks can be added to
> >>> a group. This cover's PeterZ's ask about having 'default untrusted').
> >>> (Users like ChromeOS that don't want to userspace system processes to be
> >>> tagged can disable this option so such tasks will be cookie-0).
> >>>
> >>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
> >>> above behaviors.
> >>
> >> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
> >> cookie to be cookie-0 via prctl?
> >
> > Yes, but let me try to understand better. There are 2 problems here I think:
> >
> > 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
> > core with it: This should not be any worse than SMT OFF, because even SMT OFF
> > would also reduce ksoftirqd's CPU time just core sched is doing. Sure
> > core-scheduling adds some overhead with IPIs but such a huge drop of perf is
> > strange. Peter any thoughts on that?
> >
> > 2. Interface: To solve the performance problem, you are saying you want uperf
> > to share a core with ksoftirqd so that it is not forced into idle. Why not
> > just keep uperf out of the cgroup?
>
> I guess this is unacceptable for who runs their apps in container and vm.

I think let us forget about #2, that's just a workaround. #1 is probably
what we should look into for your problem. Was talking to Vineeth earlier, is
it possible that the fairness issues that Aaron and Peter are looking into is
causing the performance problem here?

So like, if ksoftirqd being higher prio is making the vruntime delta between
2 CFS tasks sharing a core to be quite high, then it causes the core-wide
min_vruntime to be high. Then if uperf gets enqueued, it will get starved by
ksoftirqd and not able to run till ksoftirqd's vruntime catches up.

Other than that, the only other thing (AFAIK) is the IPI/scheduler overhead
is giving uperf worse performance than SMT-off and we ought to reduce the
overhead some how. Does a kernel perf profile show you any smoking guns?

thanks,

- Joel

>
> Thanks,
> -Aubrey
>
> > Then it will have cookie 0 and be able to
> > share core with kernel threads. About user-user isolation that you need, if
> > you tag any "untrusted" threads by adding it to CGroup, then there will
> > automatically isolated from uperf while allowing uperf to share CPU with
> > kernel threads.
> >
> > Please let me know your thoughts and thanks,
> >
> > - Joel
> >
> >>
> >> Thanks,
> >> -Aubrey
> >>>
> >>> 5. Document everything clearly so the semantics are clear both to the
> >>> developers of core scheduling and to system administrators.
> >>>
> >>> Note that, with the concept of "system trusted cookie", we can also do
> >>> optimizations like:
> >>> 1. Disable STIBP when switching into trusted tasks.
> >>> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
> >>> trusted tasks.
> >>>
> >>> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
> >>> other engineer requested I do something like #2 already.
> >>>
> >>> Once we get full-syscall isolation working, threads belonging to a process
> >>> can also share a core so those can just share a core with the task-group
> >>> leader.
> >>>
> >>>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
> >>>>
> >>>> This is a good question, from the data we measured by uperf,
> >>>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
> >>>
> >>> This is worrying for sure. :-(. We ought to debug/profile it more to see what
> >>> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
> >>>
> >>> Any other thoughts from others on this?
> >>>
> >>> thanks,
> >>>
> >>> - Joel
> >>>
> >>>
> >>>>> thanks,
> >>>>>
> >>>>> - Joel
> >>>>> PS: I am planning to write a patch behind a CONFIG option that tags
> >>>>> all processes (default untrusted) so everything gets a cookie which
> >>>>> some folks said was how they wanted (have a whitelist instead of
> >>>>> blacklist).
> >>>>>
> >>>>
> >>
>

2020-08-27 00:32:08

by Alexander Graf

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

Hi Vineeth,

On 30.06.20 23:32, Vineeth Remanan Pillai wrote:
> Sixth iteration of the Core-Scheduling feature.
>
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v6) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts. It is still possible to
> have a task attack the sibling thread when it enters the kernel via
> syscalls.
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a
> tag is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature
> can also be beneficial for RT and performance applications where we
> want to control how tasks make use of SMT dynamically.
>
> This iteration is mostly a cleanup of v5 except for a major feature of
> pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
> introducing documentation and includes minor crash fixes.
>
> One major cleanup was removing the hotplug support and related code.
> The hotplug related crashes were not documented and the fixes piled up
> over time leading to complex code. We were not able to reproduce the
> crashes in the limited testing done. But if they are reroducable, we
> don't want to hide them. We should document them and design better
> fixes if any.
>
> In terms of performance, the results in this release are similar to
> v5. On a x86 system with N hardware threads:
> - if only N/2 hardware threads are busy, the performance is similar
> between baseline, corescheduling and nosmt
> - if N hardware threads are busy with N different corescheduling
> groups, the impact of corescheduling is similar to nosmt
> - if N hardware threads are busy and multiple active threads share the
> same corescheduling cookie, they gain a performance improvement over
> nosmt.
> The specific performance impact depends on the workload, but for a
> really busy database 12-vcpu VM (1 coresched tag) running on a 36
> hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
> their own coresched tag), the performance drops by 54% with
> corescheduling and drops by 90% with nosmt.
>
> v6 is rebased on 5.7.6(a06eb423367e)
> https://github.com/digitalocean/linux-coresched/tree/coresched/v6-v5.7.y

As discussed during Linux Plumbers, here is a small repo with test
scripts and applications that I've used to look at core scheduling
unfairness:

https://github.com/agraf/schedgaps

Please let me know if it's unclear how to use it or if you see issues in
your environment.

Please also make sure to only run this on idle server class hardware.
Notebooks will most definitely have too many uncontrollable sources of
timing entropy to give sensible results.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



2020-08-27 01:21:27

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] Core scheduling v6

Hi Alex,

>
> As discussed during Linux Plumbers, here is a small repo with test
> scripts and applications that I've used to look at core scheduling
> unfairness:
>
> https://github.com/agraf/schedgaps
>
Thanks for sharing :).

> Please let me know if it's unclear how to use it or if you see issues in
> your environment.
>
Will give it a try soon and let you know. Went through the
README quickly and documentation is very clear.

This is really helpful and would be really useful in future
testing as well.

Thanks,
Vineeth