Hi,
I previously submitted a version of this patch set called "memdelay",
which translated delays from reclaim, swap-in, thrashing page cache
into a pressure percentage of lost walltime. I've since extended this
code to aggregate all delay states tracked by delayacct in order to
have generalized pressure/overcommit levels for CPU, memory, and IO.
There was feedback from Peter on the previous version that I have
incorporated as much as possible and as it still applies to this code:
- got rid of the extra lock in the sched callbacks; all task
state changes we care about serialize through rq->lock
- got rid of ktime_get() inside the sched callbacks and
switched time measuring to rq_clock()
- got rid of all divisions inside the sched callbacks,
tracking everything natively in ns now
I also moved this stuff into existing sched/stat.h callbacks, so it
doesn't get in the way in sched/core.c, and of course moved the whole
thing behind CONFIG_PSI since not everyone is going to want it.
Real-world applications
Since the last posting, we've begun using the data collected by this
code quite extensively at Facebook, and with several success stories.
First we used it on systems that frequently locked up in low memory
situations. The reason this happens is that the OOM killer is
triggered by reclaim not being able to make forward progress, but with
fast flash devices there is *always* some clean and uptodate cache to
reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
the time faulting executables. There is no situation where this ever
makes sense in practice. We wrote a <100 line POC python script to
monitor memory pressure and kill stuff manually, way before such
pathological thrashing.
We've since extended the python script into a more generic oomd that
we use all over the place, not just to avoid livelocks but also to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.
We also use the memory pressure info for loadshedding. Our batch job
infrastructure used to refuse new requests on heuristics based on RSS
and other existing VM metrics in an attempt to avoid OOM kills and
maximize utilization. Since it was still plagued by frequent OOM
kills, we switched it to shed load on psi memory pressure, which has
turned out to be a much better bellwether, and we managed to reduce
OOM kills drastically. Reducing the rate of OOM outages from the
worker pool raised its aggregate productivity, and we were able to
switch that service to smaller machines.
Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to do this properly without the
pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem. We now
log and graph the pressure metrics for all containers in our fleet and
can trivially link service drops to resource pressure after the fact.
How do you use this?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
calculate pressure at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which some tasks are
delayed on the runqueue while another task has the CPU. They're recent
averages over 10s, 1m, 5m windows, so you can tell short term trends
from long term ones, similarly to the load average.
What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu: the time in which at least one
task is stalled on the resource.
The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks could be waiting
on thrashing cache simultaneously. It can also happen when a single
reclaimer occupies the CPU, since nothing else can make forward
progress during that time. CPU cycles are being wasted. Significant
time spent in there is a good trigger for killing, moving jobs to
other machines, or dropping incoming requests, since neither the jobs
nor the machine overall is making too much headway.
The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse
with future hardware).
The io file is similar to memory. However, unlike CPU and memory, the
block layer doesn't have a concept of hardware contention. We cannot
know if the IO a task is waiting on is being performed by the device
or whether the device is busy with or slowed down other requests. As a
result, we can tell how many CPU cycles go to waste due to IO delays,
but we can not identify the competition factor in those delays.
These patches are against v4.17-rc4.
Documentation/accounting/psi.txt | 73 ++++
Documentation/cgroup-v2.txt | 18 +
arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +-
arch/powerpc/platforms/cell/spufs/sched.c | 9 +-
arch/s390/appldata/appldata_os.c | 4 -
drivers/cpuidle/governors/menu.c | 4 -
fs/proc/loadavg.c | 3 -
include/linux/cgroup-defs.h | 4 +
include/linux/cgroup.h | 15 +
include/linux/delayacct.h | 23 +
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 5 +-
include/linux/psi.h | 52 +++
include/linux/psi_types.h | 84 ++++
include/linux/sched.h | 10 +
include/linux/sched/loadavg.h | 90 +++-
include/linux/sched/stat.h | 10 +-
include/linux/swap.h | 2 +-
include/trace/events/mmflags.h | 1 +
include/uapi/linux/taskstats.h | 6 +-
init/Kconfig | 20 +
kernel/cgroup/cgroup.c | 45 +-
kernel/debug/kdb/kdb_main.c | 7 +-
kernel/delayacct.c | 15 +
kernel/fork.c | 4 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 3 +
kernel/sched/loadavg.c | 84 ----
kernel/sched/psi.c | 499 ++++++++++++++++++++++
kernel/sched/sched.h | 166 +++----
kernel/sched/stats.h | 91 +++-
mm/compaction.c | 5 +
mm/filemap.c | 27 +-
mm/huge_memory.c | 1 +
mm/memcontrol.c | 2 +
mm/migrate.c | 2 +
mm/page_alloc.c | 10 +
mm/swap_state.c | 1 +
mm/vmscan.c | 14 +
mm/vmstat.c | 1 +
mm/workingset.c | 113 +++--
tools/accounting/getdelays.c | 8 +-
42 files changed, 1279 insertions(+), 256 deletions(-)
Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
nodes. If that's not enough, the system can switch to discontigmem and
re-gain the 6 or 7 sparsemem section bits.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 5 +-
include/linux/swap.h | 2 +-
include/trace/events/mmflags.h | 1 +
mm/filemap.c | 9 ++--
mm/huge_memory.c | 1 +
mm/memcontrol.c | 2 +
mm/migrate.c | 2 +
mm/swap_state.c | 1 +
mm/vmscan.c | 1 +
mm/vmstat.c | 1 +
mm/workingset.c | 95 ++++++++++++++++++++++------------
12 files changed, 79 insertions(+), 42 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..6af87946d241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,7 @@ enum node_stat_item {
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+ WORKINGSET_RESTORE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..7af1c3c15d8e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -69,13 +69,14 @@
*/
enum pageflags {
PG_locked, /* Page is locked. Don't touch. */
- PG_error,
PG_referenced,
PG_uptodate,
PG_dirty,
PG_lru,
PG_active,
+ PG_workingset,
PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+ PG_error,
PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
@@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+ TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
__PAGEFLAG(Slab, slab, PF_NO_TAIL)
__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..d8c47dcdec6f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,7 +296,7 @@ struct vma_swap_readahead {
/* linux/mm/workingset.c */
void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
void workingset_activation(struct page *page);
/* Do not use directly, use workingset_lookup_update */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..a1675d43777e 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -88,6 +88,7 @@
{1UL << PG_dirty, "dirty" }, \
{1UL << PG_lru, "lru" }, \
{1UL << PG_active, "active" }, \
+ {1UL << PG_workingset, "workingset" }, \
{1UL << PG_slab, "slab" }, \
{1UL << PG_owner_priv_1, "owner_priv_1" }, \
{1UL << PG_arch_1, "arch_1" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 0604cb02e6f3..bd36b7226cf4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -915,12 +915,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
* data from the working set, only to cache data that will
* get overwritten with something else, is a waste of memory.
*/
- if (!(gfp_mask & __GFP_WRITE) &&
- shadow && workingset_refault(shadow)) {
- SetPageActive(page);
- workingset_activation(page);
- } else
- ClearPageActive(page);
+ WARN_ON_ONCE(PageActive(page));
+ if (!(gfp_mask & __GFP_WRITE) && shadow)
+ workingset_refault(page, shadow);
lru_cache_add(page);
}
return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..82bb427dea20 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2370,6 +2370,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
+ (1L << PG_workingset) |
(1L << PG_locked) |
(1L << PG_unevictable) |
(1L << PG_dirty)));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c59519d600ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5283,6 +5283,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
stat[WORKINGSET_REFAULT]);
seq_printf(m, "workingset_activate %lu\n",
stat[WORKINGSET_ACTIVATE]);
+ seq_printf(m, "workingset_restore %lu\n",
+ stat[WORKINGSET_RESTORE]);
seq_printf(m, "workingset_nodereclaim %lu\n",
stat[WORKINGSET_NODERECLAIM]);
diff --git a/mm/migrate.c b/mm/migrate.c
index 568433023831..94bfa52dc610 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -684,6 +684,8 @@ void migrate_page_states(struct page *newpage, struct page *page)
SetPageActive(newpage);
} else if (TestClearPageUnevictable(page))
SetPageUnevictable(newpage);
+ if (PageWorkingset(page))
+ SetPageWorkingset(newpage);
if (PageChecked(page))
SetPageChecked(newpage);
if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 07f9aa2340c3..2721ef8862d1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -451,6 +451,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/*
* Initiate read into locked page and return.
*/
+ SetPageWorkingset(new_page);
lru_cache_add_anon(new_page);
*new_page_allocated = true;
return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..4ae5d0eb9489 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,6 +1976,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
ClearPageActive(page); /* we are de-activating */
+ SetPageWorkingset(page);
list_add(&page->lru, &l_inactive);
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 536332e988b8..3f02e8672356 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1145,6 +1145,7 @@ const char * const vmstat_text[] = {
"nr_isolated_file",
"workingset_refault",
"workingset_activate",
+ "workingset_restore",
"workingset_nodereclaim",
"nr_anon_pages",
"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index 53759a3cf99a..ef6be3d92116 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -121,7 +121,7 @@
* the only thing eating into inactive list space is active pages.
*
*
- * Activating refaulting pages
+ * Refaulting inactive pages
*
* All that is known about the active list is that the pages have been
* accessed more than once in the past. This means that at any given
@@ -134,6 +134,10 @@
* used less frequently than the refaulting page - or even not used at
* all anymore.
*
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
* If this is wrong and demotion kicks in, the pages which are truly
* used more frequently will be reactivated while the less frequently
* used once will be evicted from memory.
@@ -141,6 +145,14 @@
* But if this is right, the stale pages will be pushed out of memory
* and the used pages get to stay in cache.
*
+ * Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
*
* Implementation
*
@@ -156,8 +168,7 @@
*/
#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
- NODES_SHIFT + \
- MEM_CGROUP_ID_SHIFT)
+ 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
/*
@@ -170,23 +181,28 @@
*/
static unsigned int bucket_order __read_mostly;
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+ bool workingset)
{
eviction >>= bucket_order;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+ eviction = (eviction << 1) | workingset;
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}
static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
- unsigned long *evictionp)
+ unsigned long *evictionp, bool *workingsetp)
{
unsigned long entry = (unsigned long)shadow;
int memcgid, nid;
+ bool workingset;
entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+ workingset = entry & 1;
+ entry >>= 1;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*memcgidp = memcgid;
*pgdat = NODE_DATA(nid);
*evictionp = entry << bucket_order;
+ *workingsetp = workingset;
}
/**
@@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*/
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
- struct mem_cgroup *memcg = page_memcg(page);
struct pglist_data *pgdat = page_pgdat(page);
+ struct mem_cgroup *memcg = page_memcg(page);
int memcgid = mem_cgroup_id(memcg);
unsigned long eviction;
struct lruvec *lruvec;
@@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
lruvec = mem_cgroup_lruvec(pgdat, memcg);
eviction = atomic_long_inc_return(&lruvec->inactive_age);
- return pack_shadow(memcgid, pgdat, eviction);
+ return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}
/**
* workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
* evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
*/
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
{
unsigned long refault_distance;
+ struct pglist_data *pgdat;
unsigned long active_file;
struct mem_cgroup *memcg;
unsigned long eviction;
struct lruvec *lruvec;
unsigned long refault;
- struct pglist_data *pgdat;
+ bool workingset;
int memcgid;
- unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+ unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
rcu_read_lock();
/*
@@ -263,41 +280,51 @@ bool workingset_refault(void *shadow)
* configurations instead.
*/
memcg = mem_cgroup_from_id(memcgid);
- if (!mem_cgroup_disabled() && !memcg) {
- rcu_read_unlock();
- return false;
- }
+ if (!mem_cgroup_disabled() && !memcg)
+ goto out;
lruvec = mem_cgroup_lruvec(pgdat, memcg);
refault = atomic_long_read(&lruvec->inactive_age);
active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
/*
- * The unsigned subtraction here gives an accurate distance
- * across inactive_age overflows in most cases.
+ * Calculate the refault distance
*
- * There is a special case: usually, shadow entries have a
- * short lifetime and are either refaulted or reclaimed along
- * with the inode before they get too old. But it is not
- * impossible for the inactive_age to lap a shadow entry in
- * the field, which can then can result in a false small
- * refault distance, leading to a false activation should this
- * old entry actually refault again. However, earlier kernels
- * used to deactivate unconditionally with *every* reclaim
- * invocation for the longest time, so the occasional
- * inappropriate activation leading to pressure on the active
- * list is not a problem.
+ * The unsigned subtraction here gives an accurate distance
+ * across inactive_age overflows in most cases. There is a
+ * special case: usually, shadow entries have a short lifetime
+ * and are either refaulted or reclaimed along with the inode
+ * before they get too old. But it is not impossible for the
+ * inactive_age to lap a shadow entry in the field, which can
+ * then can result in a false small refault distance, leading
+ * to a false activation should this old entry actually
+ * refault again. However, earlier kernels used to deactivate
+ * unconditionally with *every* reclaim invocation for the
+ * longest time, so the occasional inappropriate activation
+ * leading to pressure on the active list is not a problem.
*/
refault_distance = (refault - eviction) & EVICTION_MASK;
inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
- if (refault_distance <= active_file) {
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
- rcu_read_unlock();
- return true;
+ /*
+ * Compare the distance to the existing workingset size. We
+ * don't act on pages that couldn't stay resident even if all
+ * the memory was available to the page cache.
+ */
+ if (refault_distance > active_file)
+ goto out;
+
+ SetPageActive(page);
+ atomic_long_inc(&lruvec->inactive_age);
+ inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+
+ /* Page was active prior to eviction */
+ if (workingset) {
+ SetPageWorkingset(page);
+ inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
}
+out:
rcu_read_unlock();
- return false;
}
/**
--
2.17.0
There are several definitions of those functions/macso in places that
mess with fixed-point load averages. Provide an official version.
Signed-off-by: Johannes Weiner <[email protected]>
---
.../platforms/cell/cpufreq_spudemand.c | 2 +-
arch/powerpc/platforms/cell/spufs/sched.c | 9 +++-----
arch/s390/appldata/appldata_os.c | 4 ----
drivers/cpuidle/governors/menu.c | 4 ----
fs/proc/loadavg.c | 3 ---
include/linux/sched/loadavg.h | 21 +++++++++++++++----
kernel/debug/kdb/kdb_main.c | 7 +------
kernel/sched/loadavg.c | 15 -------------
8 files changed, 22 insertions(+), 43 deletions(-)
diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
index 882944c36ef5..5d8e8b6bb1cc 100644
--- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c
+++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
@@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info)
cpu = info->policy->cpu;
busy_spus = atomic_read(&cbe_spu_info[cpu_to_node(cpu)].busy_spus);
- CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1);
+ info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1);
pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n",
cpu, busy_spus, info->busy_spus);
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index ccc421503363..70101510b19d 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -987,9 +987,9 @@ static void spu_calc_load(void)
unsigned long active_tasks; /* fixed-point */
active_tasks = count_active_contexts() * FIXED_1;
- CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks);
- CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks);
- CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks);
+ spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks);
+ spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks);
+ spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks);
}
static void spusched_wake(struct timer_list *unused)
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
}
}
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
static int show_spu_loadavg(struct seq_file *s, void *private)
{
int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 433a994b1a89..54f375627532 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -25,10 +25,6 @@
#include "appldata.h"
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
/*
* OS data
*
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 1bfe03ceb236..3738b670df7a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -133,10 +133,6 @@ struct menu_device {
int interval_ptr;
};
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
static inline int get_loadavg(unsigned long load)
{
return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index b572cc865b92..8bee50a97c0f 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -10,9 +10,6 @@
#include <linux/seqlock.h>
#include <linux/time.h>
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
static int loadavg_proc_show(struct seq_file *m, void *v)
{
unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 80bc84ba5d2a..cc9cc62bb1f8 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -22,10 +22,23 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
#define EXP_5 2014 /* 1/exp(5sec/5min) */
#define EXP_15 2037 /* 1/exp(5sec/15min) */
-#define CALC_LOAD(load,exp,n) \
- load *= exp; \
- load += n*(FIXED_1-exp); \
- load >>= FSHIFT;
+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
+static inline unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+ unsigned long newload;
+
+ newload = load * exp + active * (FIXED_1 - exp);
+ if (active >= load)
+ newload += FIXED_1-1;
+
+ return newload / FIXED_1;
+}
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
extern void calc_global_load(unsigned long ticks);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index e405677ee08d..a8f5aca5eb5e 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2556,16 +2556,11 @@ static int kdb_summary(int argc, const char **argv)
}
kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
- /* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
kdb_printf("load avg %ld.%02ld %ld.%02ld %ld.%02ld\n",
LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]),
LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]),
LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2]));
-#undef LOAD_INT
-#undef LOAD_FRAC
+
/* Display in kilobytes */
#define K(x) ((x) << (PAGE_SHIFT - 10))
kdb_printf("\nMemTotal: %8lu kB\nMemFree: %8lu kB\n"
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a171c1258109..54fbdfb2d86c 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -91,21 +91,6 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
return delta;
}
-/*
- * a1 = a0 * e + a * (1 - e)
- */
-static unsigned long
-calc_load(unsigned long load, unsigned long exp, unsigned long active)
-{
- unsigned long newload;
-
- newload = load * exp + active * (FIXED_1 - exp);
- if (active >= load)
- newload += FIXED_1-1;
-
- return newload / FIXED_1;
-}
-
#ifdef CONFIG_NO_HZ_COMMON
/*
* Handle NO_HZ for the global load-average.
--
2.17.0
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).
Signed-off-by: Johannes Weiner <[email protected]>
---
Documentation/accounting/psi.txt | 73 ++++++
include/linux/psi.h | 27 ++
include/linux/psi_types.h | 84 ++++++
include/linux/sched.h | 10 +
include/linux/sched/stat.h | 10 +-
init/Kconfig | 16 ++
kernel/fork.c | 4 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 3 +
kernel/sched/psi.c | 424 +++++++++++++++++++++++++++++++
kernel/sched/sched.h | 166 ++++++------
kernel/sched/stats.h | 91 ++++++-
mm/compaction.c | 5 +
mm/filemap.c | 15 +-
mm/page_alloc.c | 10 +
mm/vmscan.c | 13 +
16 files changed, 859 insertions(+), 93 deletions(-)
create mode 100644 Documentation/accounting/psi.txt
create mode 100644 include/linux/psi.h
create mode 100644 include/linux/psi_types.h
create mode 100644 kernel/sched/psi.c
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index 000000000000..e051810d5127
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,73 @@
+================================
+PSI - Pressure Stall Information
+================================
+
+:Date: April, 2018
+:Author: Johannes Weiner <[email protected]>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==================
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+
+The ratios are tracked as recent trends over ten, sixty, and three
+hundred second windows, which gives insight into short term events as
+well as medium and long term trends. The total absolute stall time is
+tracked and exported as well, to allow detection of latency spikes
+which wouldn't necessarily make a dent in the time averages, or to
+average trends over custom time frames.
+
+Cgroup2 interface
+=================
+
+In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
+mounted, pressure stall information is also tracked for tasks grouped
+into cgroups. Each subdirectory in the cgroupfs mountpoint contains
+cpu.pressure, memory.pressure, and io.pressure files; the format is
+the same as the /proc/pressure/ files.
diff --git a/include/linux/psi.h b/include/linux/psi.h
new file mode 100644
index 000000000000..371af1479699
--- /dev/null
+++ b/include/linux/psi.h
@@ -0,0 +1,27 @@
+#ifndef _LINUX_PSI_H
+#define _LINUX_PSI_H
+
+#include <linux/psi_types.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_PSI
+
+extern bool psi_disabled;
+
+void psi_init(void);
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set);
+
+void psi_memstall_enter(unsigned long *flags);
+void psi_memstall_leave(unsigned long *flags);
+
+#else /* CONFIG_PSI */
+
+static inline void psi_init(void) {}
+
+static inline void psi_memstall_enter(unsigned long *flags) {}
+static inline void psi_memstall_leave(unsigned long *flags) {}
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
new file mode 100644
index 000000000000..b22b0ffc729d
--- /dev/null
+++ b/include/linux/psi_types.h
@@ -0,0 +1,84 @@
+#ifndef _LINUX_PSI_TYPES_H
+#define _LINUX_PSI_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_PSI
+
+/* Tracked task states */
+enum psi_task_count {
+ NR_RUNNING,
+ NR_IOWAIT,
+ NR_MEMSTALL,
+ NR_PSI_TASK_COUNTS,
+};
+
+/* Task state bitmasks */
+#define TSK_RUNNING (1 << NR_RUNNING)
+#define TSK_IOWAIT (1 << NR_IOWAIT)
+#define TSK_MEMSTALL (1 << NR_MEMSTALL)
+
+/* Resources that workloads could be stalled on */
+enum psi_res {
+ PSI_CPU,
+ PSI_MEM,
+ PSI_IO,
+ NR_PSI_RESOURCES,
+};
+
+/* Pressure states for a group of tasks */
+enum psi_state {
+ PSI_NONE, /* No stalled tasks */
+ PSI_SOME, /* Stalled tasks & working tasks */
+ PSI_FULL, /* Stalled tasks & no working tasks */
+ NR_PSI_STATES,
+};
+
+struct psi_resource {
+ /* Current pressure state for this resource */
+ enum psi_state state;
+
+ /* Start of current state (cpu_clock) */
+ u64 state_start;
+
+ /* Time sampling buckets for pressure states (ns) */
+ u64 times[NR_PSI_STATES - 1];
+};
+
+struct psi_group_cpu {
+ /* States of the tasks belonging to this group */
+ unsigned int tasks[NR_PSI_TASK_COUNTS];
+
+ /* Per-resource pressure tracking in this group */
+ struct psi_resource res[NR_PSI_RESOURCES];
+
+ /* There are runnable or D-state tasks */
+ bool nonidle;
+
+ /* Start of current non-idle state (cpu_clock) */
+ u64 nonidle_start;
+
+ /* Time sampling bucket for non-idle state (ns) */
+ u64 nonidle_time;
+};
+
+struct psi_group {
+ struct psi_group_cpu *cpus;
+
+ struct delayed_work clock_work;
+ unsigned long period_expires;
+
+ u64 some[NR_PSI_RESOURCES];
+ u64 full[NR_PSI_RESOURCES];
+
+ unsigned long avg_some[NR_PSI_RESOURCES][3];
+ unsigned long avg_full[NR_PSI_RESOURCES][3];
+};
+
+#else /* CONFIG_PSI */
+
+struct psi_group { };
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d697f3b573..d854652f9603 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@
#include <linux/latencytop.h>
#include <linux/sched/prio.h>
#include <linux/signal_types.h>
+#include <linux/psi_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
@@ -669,6 +670,10 @@ struct task_struct {
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
unsigned sched_remote_wakeup:1;
+#ifdef CONFIG_PSI
+ unsigned sched_psi_wake_requeue:1;
+#endif
+
/* Force alignment to the next boundary: */
unsigned :0;
@@ -916,6 +921,10 @@ struct task_struct {
siginfo_t *last_siginfo;
struct task_io_accounting ioac;
+#ifdef CONFIG_PSI
+ /* Pressure stall state */
+ unsigned int psi_flags;
+#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
@@ -1345,6 +1354,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 04f1321d14c4..ac39435d1521 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -28,10 +28,14 @@ static inline int sched_info_on(void)
return 1;
#elif defined(CONFIG_TASK_DELAY_ACCT)
extern int delayacct_on;
- return delayacct_on;
-#else
- return 0;
+ if (delayacct_on)
+ return 1;
+#elif defined(CONFIG_PSI)
+ extern int psi_disabled;
+ if (!psi_disabled)
+ return 1;
#endif
+ return 0;
}
#ifdef CONFIG_SCHEDSTATS
diff --git a/init/Kconfig b/init/Kconfig
index f013afc74b11..36208c2a386c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
Say N if unsure.
+config PSI
+ bool "Pressure stall information tracking"
+ select SCHED_INFO
+ help
+ Collect metrics that indicate how overcommitted the CPU, memory,
+ and IO capacity are in the system.
+
+ If you say Y here, the kernel will create /proc/pressure/ with the
+ pressure statistics files cpu, memory, and io. These will indicate
+ the share of walltime in which some or all tasks in the system are
+ delayed due to contention of the respective resource.
+
+ For more details see Documentation/accounting/psi.txt.
+
+ Say N if unsure.
+
endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..067aa5c28526 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1704,6 +1704,10 @@ static __latent_entropy struct task_struct *copy_process(
p->default_timer_slack_ns = current->timer_slack_ns;
+#ifdef CONFIG_PSI
+ p->psi_flags = 0;
+#endif
+
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..b29bc18f2704 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+obj-$(CONFIG_PSI) += psi.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..e663333ec6fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
+ psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
@@ -6113,6 +6114,8 @@ void __init sched_init(void)
init_schedstats();
+ psi_init();
+
scheduler_running = 1;
}
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
new file mode 100644
index 000000000000..052c529a053b
--- /dev/null
+++ b/kernel/sched/psi.c
@@ -0,0 +1,424 @@
+/*
+ * Measure workload productivity impact from overcommitting CPU, memory, IO
+ *
+ * Copyright (c) 2017 Facebook, Inc.
+ * Author: Johannes Weiner <[email protected]>
+ *
+ * Implementation
+ *
+ * Task states -- running, iowait, memstall -- are tracked through the
+ * scheduler and aggregated into a system-wide productivity state. The
+ * ratio between the times spent in productive states and delays tells
+ * us the overall productivity of the workload.
+ *
+ * The ratio is tracked in decaying time averages over 10s, 1m, 5m
+ * windows. Cumluative stall times are tracked and exported as well to
+ * allow detection of latency spikes and custom time averaging.
+ *
+ * Multiple CPUs
+ *
+ * To avoid cache contention, times are tracked local to the CPUs. To
+ * get a comprehensive view of a system or cgroup, we have to consider
+ * the fact that CPUs could be unevenly loaded or even entirely idle
+ * if the workload doesn't have enough threads. To avoid artifacts
+ * caused by that, when adding up the global pressure ratio, the
+ * CPU-local ratios are weighed according to their non-idle time:
+ *
+ * Time the CPU had stalled tasks Time the CPU was non-idle
+ * ------------------------------ * ---------------------------
+ * Walltime Time all CPUs were non-idle
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/cgroup.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/psi.h>
+#include "sched.h"
+
+static int psi_bug __read_mostly;
+
+bool psi_disabled __read_mostly;
+core_param(psi_disabled, psi_disabled, bool, 0644);
+
+/* Running averages - we need to be higher-res than loadavg */
+#define MY_LOAD_FREQ (2*HZ+1) /* 2 sec intervals */
+#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
+#define EXP_60s 1981 /* 1/exp(2s/60s) */
+#define EXP_300s 2034 /* 1/exp(2s/300s) */
+
+/* Load frequency in nanoseconds */
+static u64 load_period __read_mostly;
+
+/* System-level pressure tracking */
+static DEFINE_PER_CPU(struct psi_group_cpu, system_group_cpus);
+static struct psi_group psi_system = {
+ .cpus = &system_group_cpus,
+};
+
+static void psi_clock(struct work_struct *work);
+
+static void psi_group_init(struct psi_group *group)
+{
+ group->period_expires = jiffies + MY_LOAD_FREQ;
+ INIT_DELAYED_WORK(&group->clock_work, psi_clock);
+}
+
+void __init psi_init(void)
+{
+ load_period = jiffies_to_nsecs(MY_LOAD_FREQ);
+ psi_group_init(&psi_system);
+}
+
+static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods)
+{
+ unsigned long pct;
+
+ /* Sample the most recent active period */
+ pct = time * 100 / load_period;
+ pct *= FIXED_1;
+ avg[0] = calc_load(avg[0], EXP_10s, pct);
+ avg[1] = calc_load(avg[1], EXP_60s, pct);
+ avg[2] = calc_load(avg[2], EXP_300s, pct);
+
+ /* Fill in zeroes for periods of no activity */
+ if (missed_periods) {
+ avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
+ avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
+ avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
+ }
+}
+
+static void psi_clock(struct work_struct *work)
+{
+ u64 some[NR_PSI_RESOURCES] = { 0, };
+ u64 full[NR_PSI_RESOURCES] = { 0, };
+ unsigned long nonidle_total = 0;
+ unsigned long missed_periods;
+ struct delayed_work *dwork;
+ struct psi_group *group;
+ unsigned long expires;
+ int cpu;
+ int r;
+
+ dwork = to_delayed_work(work);
+ group = container_of(dwork, struct psi_group, clock_work);
+
+ /*
+ * Calculate the sampling period. The clock might have been
+ * stopped for a while.
+ */
+ expires = group->period_expires;
+ missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
+ group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
+
+ /*
+ * Aggregate the per-cpu state into a global state. Each CPU
+ * is weighted by its non-idle time in the sampling period.
+ */
+ for_each_online_cpu(cpu) {
+ struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
+ unsigned long nonidle;
+
+ nonidle = nsecs_to_jiffies(groupc->nonidle_time);
+ groupc->nonidle_time = 0;
+ nonidle_total += nonidle;
+
+ for (r = 0; r < NR_PSI_RESOURCES; r++) {
+ struct psi_resource *res = &groupc->res[r];
+
+ some[r] += (res->times[0] + res->times[1]) * nonidle;
+ full[r] += res->times[1] * nonidle;
+
+ /* It's racy, but we can tolerate some error */
+ res->times[0] = 0;
+ res->times[1] = 0;
+ }
+ }
+
+ for (r = 0; r < NR_PSI_RESOURCES; r++) {
+ /* Finish the weighted aggregation */
+ some[r] /= max(nonidle_total, 1UL);
+ full[r] /= max(nonidle_total, 1UL);
+
+ /* Accumulate stall time */
+ group->some[r] += some[r];
+ group->full[r] += full[r];
+
+ /* Calculate recent pressure averages */
+ calc_avgs(group->avg_some[r], some[r], missed_periods);
+ calc_avgs(group->avg_full[r], full[r], missed_periods);
+ }
+
+ /* Keep the clock ticking only when there is action */
+ if (nonidle_total)
+ schedule_delayed_work(dwork, MY_LOAD_FREQ);
+}
+
+static void time_state(struct psi_resource *res, int state, u64 now)
+{
+ if (res->state != PSI_NONE) {
+ bool was_full = res->state == PSI_FULL;
+
+ res->times[was_full] += now - res->state_start;
+ }
+ if (res->state != state)
+ res->state = state;
+ if (res->state != PSI_NONE)
+ res->state_start = now;
+}
+
+static void psi_group_update(struct psi_group *group, int cpu, u64 now,
+ unsigned int clear, unsigned int set)
+{
+ enum psi_state state = PSI_NONE;
+ struct psi_group_cpu *groupc;
+ unsigned int *tasks;
+ unsigned int to, bo;
+
+ groupc = per_cpu_ptr(group->cpus, cpu);
+ tasks = groupc->tasks;
+
+ /* Update task counts according to the set/clear bitmasks */
+ for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
+ int idx = to + (bo - 1);
+
+ if (tasks[idx] == 0 && !psi_bug) {
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u %u]\n",
+ cpu, idx, tasks[0], tasks[1],
+ tasks[2], tasks[3]);
+ psi_bug = 1;
+ }
+ tasks[idx]--;
+ }
+ for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
+ tasks[to + (bo - 1)]++;
+
+ /* Time in which tasks wait for the CPU */
+ state = PSI_NONE;
+ if (tasks[NR_RUNNING] > 1)
+ state = PSI_SOME;
+ time_state(&groupc->res[PSI_CPU], state, now);
+
+ /* Time in which tasks wait for memory */
+ state = PSI_NONE;
+ if (tasks[NR_MEMSTALL]) {
+ if (!tasks[NR_RUNNING] ||
+ (cpu_curr(cpu)->flags & PF_MEMSTALL))
+ state = PSI_FULL;
+ else
+ state = PSI_SOME;
+ }
+ time_state(&groupc->res[PSI_MEM], state, now);
+
+ /* Time in which tasks wait for IO */
+ state = PSI_NONE;
+ if (tasks[NR_IOWAIT]) {
+ if (!tasks[NR_RUNNING])
+ state = PSI_FULL;
+ else
+ state = PSI_SOME;
+ }
+ time_state(&groupc->res[PSI_IO], state, now);
+
+ /* Time in which tasks are non-idle, to weigh the CPU in summaries */
+ if (groupc->nonidle)
+ groupc->nonidle_time += now - groupc->nonidle_start;
+ groupc->nonidle = tasks[NR_RUNNING] ||
+ tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
+ if (groupc->nonidle)
+ groupc->nonidle_start = now;
+
+ /* Kick the stats aggregation worker if it's gone to sleep */
+ if (!delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);
+}
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
+{
+ struct cgroup *cgroup, *parent;
+ int cpu = task_cpu(task);
+
+ if (psi_disabled)
+ return;
+
+ if (!task->pid)
+ return;
+
+ if (((task->psi_flags & set) ||
+ (task->psi_flags & clear) != clear) &&
+ !psi_bug) {
+ printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
+ task->pid, task->comm, cpu,
+ task->psi_flags, clear, set);
+ psi_bug = 1;
+ }
+
+ task->psi_flags &= ~clear;
+ task->psi_flags |= set;
+
+ psi_group_update(&psi_system, cpu, now, clear, set);
+}
+
+/**
+ * psi_memstall_enter - mark the beginning of a memory stall section
+ * @flags: flags to handle nested sections
+ *
+ * Marks the calling task as being stalled due to a lack of memory,
+ * such as waiting for a refault or performing reclaim.
+ */
+void psi_memstall_enter(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ *flags = current->flags & PF_MEMSTALL;
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL setting & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we can
+ * race with CPU migration.
+ */
+ local_irq_disable();
+ rq = this_rq();
+ raw_spin_lock(&rq->lock);
+ rq_pin_lock(rq, &rf);
+
+ update_rq_clock(rq);
+
+ current->flags |= PF_MEMSTALL;
+ psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
+
+ rq_unpin_lock(rq, &rf);
+ raw_spin_unlock(&rq->lock);
+ local_irq_enable();
+}
+
+/**
+ * psi_memstall_leave - mark the end of an memory stall section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer stalled due to lack of memory.
+ */
+void psi_memstall_leave(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we could
+ * race with CPU migration.
+ */
+ local_irq_disable();
+ rq = this_rq();
+ raw_spin_lock(&rq->lock);
+ rq_pin_lock(rq, &rf);
+
+ update_rq_clock(rq);
+
+ current->flags &= ~PF_MEMSTALL;
+ psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
+
+ rq_unpin_lock(rq, &rf);
+ raw_spin_unlock(&rq->lock);
+ local_irq_enable();
+}
+
+static int psi_show(struct seq_file *m, struct psi_group *group,
+ enum psi_res res)
+{
+ unsigned long avg[2][3];
+ int w;
+
+ if (psi_disabled)
+ return -EOPNOTSUPP;
+
+ for (w = 0; w < 3; w++) {
+ avg[0][w] = group->avg_some[res][w];
+ avg[1][w] = group->avg_full[res][w];
+ }
+
+ seq_printf(m, "some avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+ LOAD_INT(avg[0][0]), LOAD_FRAC(avg[0][0]),
+ LOAD_INT(avg[0][1]), LOAD_FRAC(avg[0][1]),
+ LOAD_INT(avg[0][2]), LOAD_FRAC(avg[0][2]),
+ group->some[res] / NSEC_PER_USEC);
+
+ if (res == PSI_CPU)
+ return 0;
+
+ seq_printf(m, "full avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+ LOAD_INT(avg[1][0]), LOAD_FRAC(avg[1][0]),
+ LOAD_INT(avg[1][1]), LOAD_FRAC(avg[1][1]),
+ LOAD_INT(avg[1][2]), LOAD_FRAC(avg[1][2]),
+ group->full[res] / NSEC_PER_USEC);
+
+ return 0;
+}
+
+static int psi_cpu_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_CPU);
+}
+
+static int psi_memory_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_MEM);
+}
+
+static int psi_io_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IO);
+}
+
+static int psi_cpu_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_cpu_show, NULL);
+}
+
+static int psi_memory_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_memory_show, NULL);
+}
+
+static int psi_io_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_io_show, NULL);
+}
+
+static const struct file_operations psi_cpu_fops = {
+ .open = psi_cpu_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_memory_fops = {
+ .open = psi_memory_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_io_fops = {
+ .open = psi_io_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init psi_proc_init(void)
+{
+ proc_mkdir("pressure", NULL);
+ proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
+ proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
+ proc_create("pressure/io", 0, NULL, &psi_io_fops);
+ return 0;
+}
+module_init(psi_proc_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..1658477466d5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/prefetch.h>
#include <linux/profile.h>
+#include <linux/psi.h>
#include <linux/rcupdate_wait.h>
#include <linux/security.h>
#include <linux/stackprotector.h>
@@ -320,6 +321,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
#ifdef CONFIG_CGROUP_SCHED
#include <linux/cgroup.h>
+#include <linux/psi.h>
struct cfs_rq;
struct rt_rq;
@@ -919,6 +921,8 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)
+extern void update_rq_clock(struct rq *rq);
+
static inline u64 __rq_clock_broken(struct rq *rq)
{
return READ_ONCE(rq->clock);
@@ -1037,6 +1041,86 @@ static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
#endif
}
+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
+ __acquires(rq->lock);
+
+struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
+ __acquires(p->pi_lock)
+ __acquires(rq->lock);
+
+static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+}
+
+static inline void
+task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+ __releases(rq->lock)
+ __releases(p->pi_lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
+}
+
+static inline void
+rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock_irqsave(&rq->lock, rf->flags);
+ rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_lock_irq(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock_irq(&rq->lock);
+ rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_lock(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock(&rq->lock);
+ rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_relock(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock(&rq->lock);
+ rq_repin_lock(rq, rf);
+}
+
+static inline void
+rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+}
+
+static inline void
+rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock_irq(&rq->lock);
+}
+
+static inline void
+rq_unlock(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+}
+
#ifdef CONFIG_NUMA
enum numa_topology_type {
NUMA_DIRECT,
@@ -1670,8 +1754,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
sched_update_tick_dependency(rq);
}
-extern void update_rq_clock(struct rq *rq);
-
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
@@ -1752,86 +1834,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif
-struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
- __acquires(rq->lock);
-
-struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
- __acquires(p->pi_lock)
- __acquires(rq->lock);
-
-static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
-}
-
-static inline void
-task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
- __releases(rq->lock)
- __releases(p->pi_lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
- raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
-}
-
-static inline void
-rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock_irqsave(&rq->lock, rf->flags);
- rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_lock_irq(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock_irq(&rq->lock);
- rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_lock(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock(&rq->lock);
- rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_relock(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock(&rq->lock);
- rq_repin_lock(rq, rf);
-}
-
-static inline void
-rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
-}
-
-static inline void
-rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock_irq(&rq->lock);
-}
-
-static inline void
-rq_unlock(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
-}
-
#ifdef CONFIG_SMP
#ifdef CONFIG_PREEMPT
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199a39b4..cb4a68bcf37a 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -55,12 +55,90 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_val_or_zero(var) 0
#endif /* CONFIG_SCHEDSTATS */
+#ifdef CONFIG_PSI
+/*
+ * PSI tracks state that persists across sleeps, such as iowaits and
+ * memory stalls. As a result, it has to distinguish between sleeps,
+ * where a task's runnable state changes, and requeues, where a task
+ * and its state are being moved between CPUs and runqueues.
+ */
+static inline void psi_enqueue(struct task_struct *p, u64 now)
+{
+ int clear = 0, set = TSK_RUNNING;
+
+ if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
+ if (p->flags & PF_MEMSTALL)
+ set |= TSK_MEMSTALL;
+ p->sched_psi_wake_requeue = 0;
+ } else {
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, now, clear, set);
+}
+static inline void psi_dequeue(struct task_struct *p, u64 now)
+{
+ int clear = TSK_RUNNING, set = 0;
+
+ if (p->state == TASK_RUNNING) {
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+ } else {
+ if (p->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, now, clear, set);
+}
+static inline void psi_ttwu_dequeue(struct task_struct *p)
+{
+ /*
+ * Is the task being migrated during a wakeup? Make sure to
+ * deregister its sleep-persistent psi states from the old
+ * queue, and let psi_enqueue() know it has to requeue.
+ */
+ if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+ struct rq_flags rf;
+ struct rq *rq;
+ int clear = 0;
+
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+
+ rq = __task_rq_lock(p, &rf);
+ update_rq_clock(rq);
+ psi_task_change(p, rq_clock(rq), clear, 0);
+ p->sched_psi_wake_requeue = 1;
+ __task_rq_unlock(rq, &rf);
+ }
+}
+#else /* CONFIG_PSI */
+static inline void psi_enqueue(struct task_struct *p, u64 now)
+{
+}
+static inline void psi_dequeue(struct task_struct *p, u64 now)
+{
+}
+static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+{
+}
+#endif /* CONFIG_PSI */
+
#ifdef CONFIG_SCHED_INFO
static inline void sched_info_reset_dequeued(struct task_struct *t)
{
t->sched_info.last_queued = 0;
}
+static inline void sched_info_reset_queued(struct task_struct *t, u64 now)
+{
+ if (!t->sched_info.last_queued)
+ t->sched_info.last_queued = now;
+}
+
/*
* We are interested in knowing how long it was from the *first* time a
* task was queued to the time that it finally hit a CPU, we call this routine
@@ -71,9 +149,11 @@ static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
{
unsigned long long now = rq_clock(rq), delta = 0;
- if (unlikely(sched_info_on()))
+ if (unlikely(sched_info_on())) {
if (t->sched_info.last_queued)
delta = now - t->sched_info.last_queued;
+ psi_dequeue(t, now);
+ }
sched_info_reset_dequeued(t);
t->sched_info.run_delay += delta;
@@ -107,8 +187,10 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t)
static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
{
if (unlikely(sched_info_on())) {
- if (!t->sched_info.last_queued)
- t->sched_info.last_queued = rq_clock(rq);
+ unsigned long long now = rq_clock(rq);
+
+ sched_info_reset_queued(t, now);
+ psi_enqueue(t, now);
}
}
@@ -127,7 +209,8 @@ static inline void sched_info_depart(struct rq *rq, struct task_struct *t)
rq_sched_info_depart(rq, delta);
if (t->state == TASK_RUNNING)
- sched_info_queued(rq, t);
+ if (unlikely(sched_info_on()))
+ sched_info_reset_queued(t, rq_clock(rq));
}
/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 028b7210a669..7f51685d493b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -22,6 +22,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/page_owner.h>
+#include <linux/psi.h>
#include "internal.h"
#ifdef CONFIG_COMPACTION
@@ -2066,11 +2067,15 @@ static int kcompactd(void *p)
pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
while (!kthread_should_stop()) {
+ unsigned long pflags;
+
trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
wait_event_freezable(pgdat->kcompactd_wait,
kcompactd_work_requested(pgdat));
+ psi_memstall_enter(&pflags);
kcompactd_do_work(pgdat);
+ psi_memstall_leave(&pflags);
}
return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index e49961e13dd9..eee06145b997 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -37,6 +37,7 @@
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
#include <linux/delayacct.h>
+#include <linux/psi.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
bool thrashing = false;
+ unsigned long pflags;
int ret = 0;
- if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+ if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
- delayacct_thrashing_start();
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_start();
+ psi_memstall_enter(&pflags);
thrashing = true;
}
@@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
- if (thrashing)
- delayacct_thrashing_end();
+ if (thrashing) {
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_end();
+ psi_memstall_leave(&pflags);
+ }
/*
* A signal could leave PageWaiters set. Clearing it here if
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..a4b5673166a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
#include <linux/ftrace.h>
#include <linux/lockdep.h>
#include <linux/nmi.h>
+#include <linux/psi.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -3559,15 +3560,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum compact_priority prio, enum compact_result *compact_result)
{
struct page *page;
+ unsigned long pflags;
unsigned int noreclaim_flag;
if (!order)
return NULL;
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
@@ -3756,11 +3762,14 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
struct reclaim_state reclaim_state;
int progress;
unsigned int noreclaim_flag;
+ unsigned long pflags;
cond_resched();
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
+
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
fs_reclaim_acquire(gfp_mask);
reclaim_state.reclaimed_slab = 0;
@@ -3772,6 +3781,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
current->reclaim_state = NULL;
fs_reclaim_release(gfp_mask);
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
cond_resched();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ae5d0eb9489..f05a8ef1db15 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
#include <linux/prefetch.h>
#include <linux/printk.h>
#include <linux/dax.h>
+#include <linux/psi.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3115,6 +3116,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
+ unsigned long pflags;
int nid;
unsigned int noreclaim_flag;
struct scan_control sc = {
@@ -3143,9 +3145,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
sc.gfp_mask,
sc.reclaim_idx);
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -3565,6 +3571,7 @@ static int kswapd(void *p)
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
for ( ; ; ) {
+ unsigned long pflags;
bool ret;
alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3601,9 +3608,15 @@ static int kswapd(void *p)
*/
trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
alloc_order);
+
+ psi_memstall_enter(&pflags);
fs_reclaim_acquire(GFP_KERNEL);
+
reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+
fs_reclaim_release(GFP_KERNEL);
+ psi_memstall_leave(&pflags);
+
if (reclaim_order < alloc_order)
goto kswapd_try_sleep;
}
--
2.17.0
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system,
but also that of individual jobs, so that we can ensure individual job
health, fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for
only the tasks inside the cgroup.
Signed-off-by: Johannes Weiner <[email protected]>
---
Documentation/cgroup-v2.txt | 18 +++++++++
include/linux/cgroup-defs.h | 4 ++
include/linux/cgroup.h | 15 +++++++
include/linux/psi.h | 25 ++++++++++++
init/Kconfig | 4 ++
kernel/cgroup/cgroup.c | 45 ++++++++++++++++++++-
kernel/sched/psi.c | 79 ++++++++++++++++++++++++++++++++++++-
7 files changed, 186 insertions(+), 4 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeaed9f7a..a22879dba019 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -963,6 +963,12 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for CPU. See
+ Documentation/accounting/psi.txt for details.
+
Memory
------
@@ -1199,6 +1205,12 @@ PAGE_SIZE multiple when read back.
Swap usage hard limit. If a cgroup's swap usage reaches this
limit, anonymous memory of the cgroup will not be swapped out.
+ memory.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for memory. See
+ Documentation/accounting/psi.txt for details.
+
Usage Guidelines
~~~~~~~~~~~~~~~~
@@ -1334,6 +1346,12 @@ IO Interface Files
8:16 rbps=2097152 wbps=max riops=max wiops=max
+ io.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for IO. See
+ Documentation/accounting/psi.txt for details.
+
Writeback
~~~~~~~~~
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index dc5b70449dc6..280f18da956a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
#include <linux/u64_stats_sync.h>
#include <linux/workqueue.h>
#include <linux/bpf-cgroup.h>
+#include <linux/psi_types.h>
#ifdef CONFIG_CGROUPS
@@ -424,6 +425,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
+ /* used to track pressure stalls */
+ struct psi_group psi;
+
/* used to store eBPF programs */
struct cgroup_bpf bpf;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 473e0c0abb86..fd94c294c207 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return &cgrp->psi;
+}
+
static inline void cgroup_init_kthreadd(void)
{
/*
@@ -680,6 +685,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
return NULL;
}
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+ return NULL;
+}
+
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return NULL;
+}
+
static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
struct cgroup *ancestor)
{
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 371af1479699..05c3dae3e9c5 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -4,6 +4,9 @@
#include <linux/psi_types.h>
#include <linux/sched.h>
+struct seq_file;
+struct css_set;
+
#ifdef CONFIG_PSI
extern bool psi_disabled;
@@ -15,6 +18,14 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set);
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);
+int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
+
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgrp);
+void psi_cgroup_free(struct cgroup *cgrp);
+void cgroup_move_task(struct task_struct *p, struct css_set *to);
+#endif
+
#else /* CONFIG_PSI */
static inline void psi_init(void) {}
@@ -22,6 +33,20 @@ static inline void psi_init(void) {}
static inline void psi_memstall_enter(unsigned long *flags) {}
static inline void psi_memstall_leave(unsigned long *flags) {}
+#ifdef CONFIG_CGROUPS
+static inline int psi_cgroup_alloc(struct cgroup *cgrp)
+{
+ return 0;
+}
+static inline void psi_cgroup_free(struct cgroup *cgrp)
+{
+}
+static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
+{
+ rcu_assign_pointer(p->cgroups, to);
+}
+#endif
+
#endif /* CONFIG_PSI */
#endif /* _LINUX_PSI_H */
diff --git a/init/Kconfig b/init/Kconfig
index 36208c2a386c..a34e33aae638 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -469,6 +469,10 @@ config PSI
the share of walltime in which some or all tasks in the system are
delayed due to contention of the respective resource.
+ In kernels with cgroup support (cgroup2 only), cgroups will
+ have cpu.pressure, memory.pressure, and io.pressure files,
+ which aggregate pressure stalls for the grouped tasks only.
+
For more details see Documentation/accounting/psi.txt.
Say N if unsure.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a662bfcbea0e..de1ca380f234 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -54,6 +54,7 @@
#include <linux/proc_ns.h>
#include <linux/nsproxy.h>
#include <linux/file.h>
+#include <linux/psi.h>
#include <net/sock.h>
#define CREATE_TRACE_POINTS
@@ -826,7 +827,7 @@ static void css_set_move_task(struct task_struct *task,
*/
WARN_ON_ONCE(task->flags & PF_EXITING);
- rcu_assign_pointer(task->cgroups, to_cset);
+ cgroup_move_task(task, to_cset);
list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
&to_cset->tasks);
}
@@ -3388,6 +3389,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
return ret;
}
+#ifdef CONFIG_PSI
+static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
+{
+ return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
+}
+static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
+{
+ return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM);
+}
+static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
+{
+ return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO);
+}
+#endif
+
static int cgroup_file_open(struct kernfs_open_file *of)
{
struct cftype *cft = of->kn->priv;
@@ -4499,6 +4515,23 @@ static struct cftype cgroup_base_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_stat_show,
},
+#ifdef CONFIG_PSI
+ {
+ .name = "cpu.pressure",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_cpu_pressure_show,
+ },
+ {
+ .name = "memory.pressure",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_memory_pressure_show,
+ },
+ {
+ .name = "io.pressure",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_io_pressure_show,
+ },
+#endif
{ } /* terminate */
};
@@ -4559,6 +4592,7 @@ static void css_free_rwork_fn(struct work_struct *work)
*/
cgroup_put(cgroup_parent(cgrp));
kernfs_put(cgrp->kn);
+ psi_cgroup_free(cgrp);
if (cgroup_on_dfl(cgrp))
cgroup_stat_exit(cgrp);
kfree(cgrp);
@@ -4805,10 +4839,15 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
cgrp->self.parent = &parent->self;
cgrp->root = root;
cgrp->level = level;
- ret = cgroup_bpf_inherit(cgrp);
+
+ ret = psi_cgroup_alloc(cgrp);
if (ret)
goto out_idr_free;
+ ret = cgroup_bpf_inherit(cgrp);
+ if (ret)
+ goto out_psi_free;
+
for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4846,6 +4885,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
return cgrp;
+out_psi_free:
+ psi_cgroup_free(cgrp);
out_idr_free:
cgroup_idr_remove(&root->cgroup_idr, cgrp->id);
out_stat_exit:
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 052c529a053b..783b35b744b4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -260,6 +260,18 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
task->psi_flags |= set;
psi_group_update(&psi_system, cpu, now, clear, set);
+
+#ifdef CONFIG_CGROUPS
+ cgroup = task->cgroups->dfl_cgrp;
+ while (cgroup && (parent = cgroup_parent(cgroup))) {
+ struct psi_group *group;
+
+ group = cgroup_psi(cgroup);
+ psi_group_update(group, cpu, now, clear, set);
+
+ cgroup = parent;
+ }
+#endif
}
/**
@@ -330,8 +342,71 @@ void psi_memstall_leave(unsigned long *flags)
local_irq_enable();
}
-static int psi_show(struct seq_file *m, struct psi_group *group,
- enum psi_res res)
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgroup)
+{
+ cgroup->psi.cpus = alloc_percpu(struct psi_group_cpu);
+ if (!cgroup->psi.cpus)
+ return -ENOMEM;
+ psi_group_init(&cgroup->psi);
+ return 0;
+}
+
+void psi_cgroup_free(struct cgroup *cgroup)
+{
+ cancel_delayed_work_sync(&cgroup->psi.clock_work);
+ free_percpu(cgroup->psi.cpus);
+}
+
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated stall
+ * state between the different groups.
+ *
+ * This function acquires the task's rq lock to lock out concurrent
+ * changes to the task's scheduling state and - in case the task is
+ * running - concurrent changes to its stall state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+ unsigned int task_flags = 0;
+ struct rq_flags rf;
+ struct rq *rq;
+ u64 now;
+
+ rq = task_rq_lock(task, &rf);
+
+ if (task_on_rq_queued(task)) {
+ task_flags = TSK_RUNNING;
+ } else if (task->in_iowait) {
+ task_flags = TSK_IOWAIT;
+ }
+ if (task->flags & PF_MEMSTALL)
+ task_flags |= TSK_MEMSTALL;
+
+ if (task_flags) {
+ update_rq_clock(rq);
+ now = rq_clock(rq);
+ psi_task_change(task, now, task_flags, 0);
+ }
+
+ /*
+ * Lame to do this here, but the scheduler cannot be locked
+ * from the outside, so we move cgroups from inside sched/.
+ */
+ rcu_assign_pointer(task->cgroups, to);
+
+ if (task_flags)
+ psi_task_change(task, now, 0, task_flags);
+
+ task_rq_unlock(rq, task, &rf);
+}
+#endif /* CONFIG_CGROUPS */
+
+int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
unsigned long avg[2][3];
int w;
--
2.17.0
Delay accounting already measures the time a task spends in direct
reclaim and waiting for swapin, but in low memory situations tasks
spend can spend a significant amount of their time waiting on
thrashing page cache. This isn't tracked right now.
To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache
page to read back into memory.
Also update tools/accounting/getdelays.c:
[hannes@computer accounting]$ sudo ./getdelays -d -p 1
print delayacct stats ON
PID 1
CPU count real total virtual total delay total delay average
50318 745000000 847346785 400533713 0.008ms
IO count delay total delay average
435 122601218 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
THRASHING count delay total delay average
19 12621439 0ms
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/delayacct.h | 23 +++++++++++++++++++++++
include/uapi/linux/taskstats.h | 6 +++++-
kernel/delayacct.c | 15 +++++++++++++++
mm/filemap.c | 11 +++++++++++
tools/accounting/getdelays.c | 8 +++++++-
5 files changed, 61 insertions(+), 2 deletions(-)
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 5e335b6203f4..d3e75b3ba487 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -57,7 +57,12 @@ struct task_delay_info {
u64 freepages_start;
u64 freepages_delay; /* wait for memory reclaim */
+
+ u64 thrashing_start;
+ u64 thrashing_delay; /* wait for thrashing page */
+
u32 freepages_count; /* total count of memory reclaim */
+ u32 thrashing_count; /* total count of thrash waits */
};
#endif
@@ -76,6 +81,8 @@ extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
extern __u64 __delayacct_blkio_ticks(struct task_struct *);
extern void __delayacct_freepages_start(void);
extern void __delayacct_freepages_end(void);
+extern void __delayacct_thrashing_start(void);
+extern void __delayacct_thrashing_end(void);
static inline int delayacct_is_task_waiting_on_io(struct task_struct *p)
{
@@ -156,6 +163,18 @@ static inline void delayacct_freepages_end(void)
__delayacct_freepages_end();
}
+static inline void delayacct_thrashing_start(void)
+{
+ if (current->delays)
+ __delayacct_thrashing_start();
+}
+
+static inline void delayacct_thrashing_end(void)
+{
+ if (current->delays)
+ __delayacct_thrashing_end();
+}
+
#else
static inline void delayacct_set_flag(int flag)
{}
@@ -182,6 +201,10 @@ static inline void delayacct_freepages_start(void)
{}
static inline void delayacct_freepages_end(void)
{}
+static inline void delayacct_thrashing_start(void)
+{}
+static inline void delayacct_thrashing_end(void)
+{}
#endif /* CONFIG_TASK_DELAY_ACCT */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index b7aa7bb2349f..5e8ca16a9079 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
*/
-#define TASKSTATS_VERSION 8
+#define TASKSTATS_VERSION 9
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -164,6 +164,10 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64 freepages_count;
__u64 freepages_delay_total;
+
+ /* Delay waiting for thrashing page */
+ __u64 thrashing_count;
+ __u64 thrashing_delay_total;
};
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index e2764d767f18..02ba745c448d 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -134,9 +134,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
+ tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay;
+ d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp;
d->blkio_count += tsk->delays->blkio_count;
d->swapin_count += tsk->delays->swapin_count;
d->freepages_count += tsk->delays->freepages_count;
+ d->thrashing_count += tsk->delays->thrashing_count;
spin_unlock_irqrestore(&tsk->delays->lock, flags);
return 0;
@@ -168,3 +171,15 @@ void __delayacct_freepages_end(void)
¤t->delays->freepages_count);
}
+void __delayacct_thrashing_start(void)
+{
+ current->delays->thrashing_start = ktime_get_ns();
+}
+
+void __delayacct_thrashing_end(void)
+{
+ delayacct_end(¤t->delays->lock,
+ ¤t->delays->thrashing_start,
+ ¤t->delays->thrashing_delay,
+ ¤t->delays->thrashing_count);
+}
diff --git a/mm/filemap.c b/mm/filemap.c
index bd36b7226cf4..e49961e13dd9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
#include <linux/cleancache.h>
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
+#include <linux/delayacct.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -1073,8 +1074,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
{
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
+ bool thrashing = false;
int ret = 0;
+ if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+ !PageUptodate(page) && PageWorkingset(page)) {
+ delayacct_thrashing_start();
+ thrashing = true;
+ }
+
init_wait(wait);
wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
wait->func = wake_page_function;
@@ -1113,6 +1121,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
+ if (thrashing)
+ delayacct_thrashing_end();
+
/*
* A signal could leave PageWaiters set. Clearing it here if
* !waitqueue_active would be possible (by open-coding finish_wait),
diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 9f420d98b5fb..8cb504d30384 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -203,6 +203,8 @@ static void print_delayacct(struct taskstats *t)
"SWAP %15s%15s%15s\n"
" %15llu%15llu%15llums\n"
"RECLAIM %12s%15s%15s\n"
+ " %15llu%15llu%15llums\n"
+ "THRASHING%12s%15s%15s\n"
" %15llu%15llu%15llums\n",
"count", "real total", "virtual total",
"delay total", "delay average",
@@ -222,7 +224,11 @@ static void print_delayacct(struct taskstats *t)
"count", "delay total", "delay average",
(unsigned long long)t->freepages_count,
(unsigned long long)t->freepages_delay_total,
- average_ms(t->freepages_delay_total, t->freepages_count));
+ average_ms(t->freepages_delay_total, t->freepages_count),
+ "count", "delay total", "delay average",
+ (unsigned long long)t->thrashing_count,
+ (unsigned long long)t->thrashing_delay_total,
+ average_ms(t->thrashing_delay_total, t->thrashing_count));
}
static void task_context_switch_counts(struct taskstats *t)
--
2.17.0
It's going to be used in the following patch. Keep the churn separate.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/sched/loadavg.h | 69 +++++++++++++++++++++++++++++++++++
kernel/sched/loadavg.c | 69 -----------------------------------
2 files changed, 69 insertions(+), 69 deletions(-)
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index cc9cc62bb1f8..0e4c24978751 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -37,6 +37,75 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
return newload / FIXED_1;
}
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x: base of the power
+ * @frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ */
+static inline unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+
+ if (n) {
+ for (;;) {
+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;
+ if (!n)
+ break;
+ x *= x;
+ x += 1UL << (frac_bits - 1);
+ x >>= frac_bits;
+ }
+ }
+
+ return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ * = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ * ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
+ * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ * = a0 * e^n + a * (1 - e^n)
+ *
+ * [1] application of the geometric series:
+ *
+ * n 1 - x^(n+1)
+ * S_n := \Sum x^i = -------------
+ * i=0 1 - x
+ */
+static inline unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n)
+{
+ return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
#define LOAD_INT(x) ((x) >> FSHIFT)
#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 54fbdfb2d86c..0736e349a54e 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -210,75 +210,6 @@ static long calc_load_nohz_fold(void)
return delta;
}
-/**
- * fixed_power_int - compute: x^n, in O(log n) time
- *
- * @x: base of the power
- * @frac_bits: fractional bits of @x
- * @n: power to raise @x to.
- *
- * By exploiting the relation between the definition of the natural power
- * function: x^n := x*x*...*x (x multiplied by itself for n times), and
- * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
- * (where: n_i \elem {0, 1}, the binary vector representing n),
- * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
- * of course trivially computable in O(log_2 n), the length of our binary
- * vector.
- */
-static unsigned long
-fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
-{
- unsigned long result = 1UL << frac_bits;
-
- if (n) {
- for (;;) {
- if (n & 1) {
- result *= x;
- result += 1UL << (frac_bits - 1);
- result >>= frac_bits;
- }
- n >>= 1;
- if (!n)
- break;
- x *= x;
- x += 1UL << (frac_bits - 1);
- x >>= frac_bits;
- }
- }
-
- return result;
-}
-
-/*
- * a1 = a0 * e + a * (1 - e)
- *
- * a2 = a1 * e + a * (1 - e)
- * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
- * = a0 * e^2 + a * (1 - e) * (1 + e)
- *
- * a3 = a2 * e + a * (1 - e)
- * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
- * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
- *
- * ...
- *
- * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
- * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
- * = a0 * e^n + a * (1 - e^n)
- *
- * [1] application of the geometric series:
- *
- * n 1 - x^(n+1)
- * S_n := \Sum x^i = -------------
- * i=0 1 - x
- */
-static unsigned long
-calc_load_n(unsigned long load, unsigned long exp,
- unsigned long active, unsigned int n)
-{
- return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
-}
-
/*
* NO_HZ can leave us missing all per-CPU ticks calling
* calc_load_fold_active(), but since a NO_HZ CPU folds its delta into
--
2.17.0
From: Johannes Weiner <[email protected]>
If we just keep enough refault information to match the CURRENT page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out
all the cache. Once cache comes back, we won't see those refaults.
They might not be actionable for LRU aging, but we want to know about
them for measuring memory pressure.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/workingset.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..53759a3cf99a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
{
unsigned long max_nodes;
unsigned long nodes;
- unsigned long cache;
+ unsigned long pages;
/* list_lru lock nests inside the IRQ-safe i_pages lock */
local_irq_disable();
@@ -393,14 +393,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
*
* PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
*/
- if (sc->memcg) {
- cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
- LRU_ALL_FILE);
- } else {
- cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
- node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
- }
- max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+#ifdef CONFIG_MEMCG
+ if (sc->memcg)
+ pages = page_counter_read(&sc->memcg->memory);
+ else
+#endif
+ pages = node_present_pages(sc->nid);
+
+ max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
if (nodes <= max_nodes)
return 0;
--
2.17.0
On 05/07/2018 02:01 PM, Johannes Weiner wrote:
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> Documentation/accounting/psi.txt | 73 ++++++
> include/linux/psi.h | 27 ++
> include/linux/psi_types.h | 84 ++++++
> include/linux/sched.h | 10 +
> include/linux/sched/stat.h | 10 +-
> init/Kconfig | 16 ++
> kernel/fork.c | 4 +
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 3 +
> kernel/sched/psi.c | 424 +++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 166 ++++++------
> kernel/sched/stats.h | 91 ++++++-
> mm/compaction.c | 5 +
> mm/filemap.c | 15 +-
> mm/page_alloc.c | 10 +
> mm/vmscan.c | 13 +
> 16 files changed, 859 insertions(+), 93 deletions(-)
> create mode 100644 Documentation/accounting/psi.txt
> create mode 100644 include/linux/psi.h
> create mode 100644 include/linux/psi_types.h
> create mode 100644 kernel/sched/psi.c
>
> diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
> new file mode 100644
> index 000000000000..e051810d5127
> --- /dev/null
> +++ b/Documentation/accounting/psi.txt
> @@ -0,0 +1,73 @@
Looks good to me.
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> new file mode 100644
> index 000000000000..052c529a053b
> --- /dev/null
> +++ b/kernel/sched/psi.c
> @@ -0,0 +1,424 @@
> +/*
> + * Measure workload productivity impact from overcommitting CPU, memory, IO
> + *
> + * Copyright (c) 2017 Facebook, Inc.
> + * Author: Johannes Weiner <[email protected]>
> + *
> + * Implementation
> + *
> + * Task states -- running, iowait, memstall -- are tracked through the
> + * scheduler and aggregated into a system-wide productivity state. The
> + * ratio between the times spent in productive states and delays tells
> + * us the overall productivity of the workload.
> + *
> + * The ratio is tracked in decaying time averages over 10s, 1m, 5m
> + * windows. Cumluative stall times are tracked and exported as well to
Cumulative
> + * allow detection of latency spikes and custom time averaging.
> + *
> + * Multiple CPUs
> + *
> + * To avoid cache contention, times are tracked local to the CPUs. To
> + * get a comprehensive view of a system or cgroup, we have to consider
> + * the fact that CPUs could be unevenly loaded or even entirely idle
> + * if the workload doesn't have enough threads. To avoid artifacts
> + * caused by that, when adding up the global pressure ratio, the
> + * CPU-local ratios are weighed according to their non-idle time:
> + *
> + * Time the CPU had stalled tasks Time the CPU was non-idle
> + * ------------------------------ * ---------------------------
> + * Walltime Time all CPUs were non-idle
> + */
> +
> +/**
> + * psi_memstall_leave - mark the end of an memory stall section
end of a memory
> + * @flags: flags to handle nested memdelay sections
> + *
> + * Marks the calling task as no longer stalled due to lack of memory.
> + */
> +void psi_memstall_leave(unsigned long *flags)
> +{
--
~Randy
Hi Johannes,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4]
[cannot apply to next-20180507]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-workingset-don-t-drop-refault-information-prematurely/20180508-081214
config: i386-randconfig-x073-201818 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
In file included from kernel/sched/sched.h:1317:0,
from kernel/sched/core.c:8:
>> kernel/sched/stats.h:126:1: error: expected identifier or '(' before '{' token
{
^
vim +126 kernel/sched/stats.h
57
58 #ifdef CONFIG_PSI
59 /*
60 * PSI tracks state that persists across sleeps, such as iowaits and
61 * memory stalls. As a result, it has to distinguish between sleeps,
62 * where a task's runnable state changes, and requeues, where a task
63 * and its state are being moved between CPUs and runqueues.
64 */
65 static inline void psi_enqueue(struct task_struct *p, u64 now)
66 {
67 int clear = 0, set = TSK_RUNNING;
68
69 if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
70 if (p->flags & PF_MEMSTALL)
71 set |= TSK_MEMSTALL;
72 p->sched_psi_wake_requeue = 0;
73 } else {
74 if (p->in_iowait)
75 clear |= TSK_IOWAIT;
76 }
77
78 psi_task_change(p, now, clear, set);
79 }
80 static inline void psi_dequeue(struct task_struct *p, u64 now)
81 {
82 int clear = TSK_RUNNING, set = 0;
83
84 if (p->state == TASK_RUNNING) {
85 if (p->flags & PF_MEMSTALL)
86 clear |= TSK_MEMSTALL;
87 } else {
88 if (p->in_iowait)
89 set |= TSK_IOWAIT;
90 }
91
92 psi_task_change(p, now, clear, set);
93 }
94 static inline void psi_ttwu_dequeue(struct task_struct *p)
95 {
96 /*
97 * Is the task being migrated during a wakeup? Make sure to
98 * deregister its sleep-persistent psi states from the old
99 * queue, and let psi_enqueue() know it has to requeue.
100 */
101 if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
102 struct rq_flags rf;
103 struct rq *rq;
104 int clear = 0;
105
106 if (p->in_iowait)
107 clear |= TSK_IOWAIT;
108 if (p->flags & PF_MEMSTALL)
109 clear |= TSK_MEMSTALL;
110
111 rq = __task_rq_lock(p, &rf);
112 update_rq_clock(rq);
113 psi_task_change(p, rq_clock(rq), clear, 0);
114 p->sched_psi_wake_requeue = 1;
115 __task_rq_unlock(rq, &rf);
116 }
117 }
118 #else /* CONFIG_PSI */
119 static inline void psi_enqueue(struct task_struct *p, u64 now)
120 {
121 }
122 static inline void psi_dequeue(struct task_struct *p, u64 now)
123 {
124 }
125 static inline void psi_ttwu_dequeue(struct task_struct *p) {}
> 126 {
127 }
128 #endif /* CONFIG_PSI */
129
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Johannes,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4]
[cannot apply to next-20180507]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-workingset-don-t-drop-refault-information-prematurely/20180508-081214
config: x86_64-randconfig-x012-201818 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
In file included from kernel/livepatch/../sched/sched.h:1317:0,
from kernel/livepatch/transition.c:27:
>> kernel/livepatch/../sched/stats.h:126:1: error: expected identifier or '(' before '{' token
{
^
vim +126 kernel/livepatch/../sched/stats.h
57
58 #ifdef CONFIG_PSI
59 /*
60 * PSI tracks state that persists across sleeps, such as iowaits and
61 * memory stalls. As a result, it has to distinguish between sleeps,
62 * where a task's runnable state changes, and requeues, where a task
63 * and its state are being moved between CPUs and runqueues.
64 */
65 static inline void psi_enqueue(struct task_struct *p, u64 now)
66 {
67 int clear = 0, set = TSK_RUNNING;
68
69 if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
70 if (p->flags & PF_MEMSTALL)
71 set |= TSK_MEMSTALL;
72 p->sched_psi_wake_requeue = 0;
73 } else {
74 if (p->in_iowait)
75 clear |= TSK_IOWAIT;
76 }
77
78 psi_task_change(p, now, clear, set);
79 }
80 static inline void psi_dequeue(struct task_struct *p, u64 now)
81 {
82 int clear = TSK_RUNNING, set = 0;
83
84 if (p->state == TASK_RUNNING) {
85 if (p->flags & PF_MEMSTALL)
86 clear |= TSK_MEMSTALL;
87 } else {
88 if (p->in_iowait)
89 set |= TSK_IOWAIT;
90 }
91
92 psi_task_change(p, now, clear, set);
93 }
94 static inline void psi_ttwu_dequeue(struct task_struct *p)
95 {
96 /*
97 * Is the task being migrated during a wakeup? Make sure to
98 * deregister its sleep-persistent psi states from the old
99 * queue, and let psi_enqueue() know it has to requeue.
100 */
101 if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
102 struct rq_flags rf;
103 struct rq *rq;
104 int clear = 0;
105
106 if (p->in_iowait)
107 clear |= TSK_IOWAIT;
108 if (p->flags & PF_MEMSTALL)
109 clear |= TSK_MEMSTALL;
110
111 rq = __task_rq_lock(p, &rf);
112 update_rq_clock(rq);
113 psi_task_change(p, rq_clock(rq), clear, 0);
114 p->sched_psi_wake_requeue = 1;
115 __task_rq_unlock(rq, &rf);
116 }
117 }
118 #else /* CONFIG_PSI */
119 static inline void psi_enqueue(struct task_struct *p, u64 now)
120 {
121 }
122 static inline void psi_dequeue(struct task_struct *p, u64 now)
123 {
124 }
125 static inline void psi_ttwu_dequeue(struct task_struct *p) {}
> 126 {
127 }
128 #endif /* CONFIG_PSI */
129
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Tue, May 08, 2018 at 11:04:09AM +0800, kbuild test robot wrote:
> 118 #else /* CONFIG_PSI */
> 119 static inline void psi_enqueue(struct task_struct *p, u64 now)
> 120 {
> 121 }
> 122 static inline void psi_dequeue(struct task_struct *p, u64 now)
> 123 {
> 124 }
> 125 static inline void psi_ttwu_dequeue(struct task_struct *p) {}
> > 126 {
> 127 }
Stupid last-minute cleanup reshuffling. v2 will have:
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index cb4a68bcf37a..ff6256b3d216 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -122,7 +122,7 @@ static inline void psi_enqueue(struct task_struct *p, u64 now)
static inline void psi_dequeue(struct task_struct *p, u64 now)
{
}
-static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p)
{
}
#endif /* CONFIG_PSI */
On Mon, May 07, 2018 at 05:42:36PM -0700, Randy Dunlap wrote:
> On 05/07/2018 02:01 PM, Johannes Weiner wrote:
> > + * The ratio is tracked in decaying time averages over 10s, 1m, 5m
> > + * windows. Cumluative stall times are tracked and exported as well to
>
> Cumulative
>
> > +/**
> > + * psi_memstall_leave - mark the end of an memory stall section
>
> end of a memory
Thanks Randy, I'll get those fixed.
On Mon, May 07, 2018 at 05:01:33PM -0400, Johannes Weiner wrote:
> +static inline unsigned long
> +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
> +{
> + unsigned long result = 1UL << frac_bits;
> +
> + if (n) {
> + for (;;) {
> + if (n & 1) {
> + result *= x;
> + result += 1UL << (frac_bits - 1);
> + result >>= frac_bits;
> + }
> + n >>= 1;
> + if (!n)
> + break;
> + x *= x;
> + x += 1UL << (frac_bits - 1);
> + x >>= frac_bits;
> + }
> + }
> +
> + return result;
> +}
No real objection; but that does look a wee bit fat for an inline I
suppose.
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> new file mode 100644
> index 000000000000..b22b0ffc729d
> --- /dev/null
> +++ b/include/linux/psi_types.h
> @@ -0,0 +1,84 @@
> +#ifndef _LINUX_PSI_TYPES_H
> +#define _LINUX_PSI_TYPES_H
> +
> +#include <linux/types.h>
> +
> +#ifdef CONFIG_PSI
> +
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};
> +
> +/* Task state bitmasks */
> +#define TSK_RUNNING (1 << NR_RUNNING)
> +#define TSK_IOWAIT (1 << NR_IOWAIT)
> +#define TSK_MEMSTALL (1 << NR_MEMSTALL)
> +
> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};
> +
> +/* Pressure states for a group of tasks */
> +enum psi_state {
> + PSI_NONE, /* No stalled tasks */
> + PSI_SOME, /* Stalled tasks & working tasks */
> + PSI_FULL, /* Stalled tasks & no working tasks */
> + NR_PSI_STATES,
> +};
> +
> +struct psi_resource {
> + /* Current pressure state for this resource */
> + enum psi_state state;
> +
> + /* Start of current state (cpu_clock) */
> + u64 state_start;
> +
> + /* Time sampling buckets for pressure states (ns) */
> + u64 times[NR_PSI_STATES - 1];
Fails to explain why no FULL.
> +};
> +
> +struct psi_group_cpu {
> + /* States of the tasks belonging to this group */
> + unsigned int tasks[NR_PSI_TASK_COUNTS];
> +
AFAICT there's a hole here, that would fit the @nonidle member. Which
also avoids the later hole generated by it.
> + /* Per-resource pressure tracking in this group */
> + struct psi_resource res[NR_PSI_RESOURCES];
> +
> + /* There are runnable or D-state tasks */
> + bool nonidle;
Mandatory complaint about using _Bool in composites goes here.
> + /* Start of current non-idle state (cpu_clock) */
> + u64 nonidle_start;
> +
> + /* Time sampling bucket for non-idle state (ns) */
> + u64 nonidle_time;
> +};
> +
> +struct psi_group {
> + struct psi_group_cpu *cpus;
> +
> + struct delayed_work clock_work;
> + unsigned long period_expires;
> +
> + u64 some[NR_PSI_RESOURCES];
> + u64 full[NR_PSI_RESOURCES];
> +
> + unsigned long avg_some[NR_PSI_RESOURCES][3];
> + unsigned long avg_full[NR_PSI_RESOURCES][3];
> +};
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> +static void psi_clock(struct work_struct *work)
> +{
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };
> + unsigned long nonidle_total = 0;
> + unsigned long missed_periods;
> + struct delayed_work *dwork;
> + struct psi_group *group;
> + unsigned long expires;
> + int cpu;
> + int r;
> +
> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, clock_work);
> +
> + /*
> + * Calculate the sampling period. The clock might have been
> + * stopped for a while.
> + */
> + expires = group->period_expires;
> + missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> + group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> +
> + /*
> + * Aggregate the per-cpu state into a global state. Each CPU
> + * is weighted by its non-idle time in the sampling period.
> + */
> + for_each_online_cpu(cpu) {
Typically when using online CPU state, you also need hotplug notifiers
to deal with changes in the online set.
You also typically need something like cpus_read_lock() around an
iteration of online CPUs, to avoid the set changing while you're poking
at them.
The lack for neither is evident or explained.
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = &groupc->res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + /* Finish the weighted aggregation */
> + some[r] /= max(nonidle_total, 1UL);
> + full[r] /= max(nonidle_total, 1UL);
> +
> + /* Accumulate stall time */
> + group->some[r] += some[r];
> + group->full[r] += full[r];
> +
> + /* Calculate recent pressure averages */
> + calc_avgs(group->avg_some[r], some[r], missed_periods);
> + calc_avgs(group->avg_full[r], full[r], missed_periods);
> + }
> +
> + /* Keep the clock ticking only when there is action */
> + if (nonidle_total)
> + schedule_delayed_work(dwork, MY_LOAD_FREQ);
> +}
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };
> + some[r] /= max(nonidle_total, 1UL);
> + full[r] /= max(nonidle_total, 1UL);
That's a bare 64bit divide.. that typically failed to build on 32bit
archs.
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 15750c222ca2..1658477466d5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -919,6 +921,8 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> #define raw_rq() raw_cpu_ptr(&runqueues)
>
> +extern void update_rq_clock(struct rq *rq);
> +
> static inline u64 __rq_clock_broken(struct rq *rq)
> {
> return READ_ONCE(rq->clock);
> @@ -1037,6 +1041,86 @@ static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
> #endif
> }
>
> +struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> + __acquires(rq->lock);
> +
> +struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> + __acquires(p->pi_lock)
> + __acquires(rq->lock);
> +
> +static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> +}
> +
> +static inline void
> +task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> + __releases(rq->lock)
> + __releases(p->pi_lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> + raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> +}
> +
> +static inline void
> +rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock_irqsave(&rq->lock, rf->flags);
> + rq_pin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_lock_irq(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock_irq(&rq->lock);
> + rq_pin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_lock(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock(&rq->lock);
> + rq_pin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_relock(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock(&rq->lock);
> + rq_repin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
> +}
> +
> +static inline void
> +rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock_irq(&rq->lock);
> +}
> +
> +static inline void
> +rq_unlock(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> +}
> +
> #ifdef CONFIG_NUMA
> enum numa_topology_type {
> NUMA_DIRECT,
> @@ -1670,8 +1754,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
> sched_update_tick_dependency(rq);
> }
>
> -extern void update_rq_clock(struct rq *rq);
> -
> extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
> extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1752,86 +1834,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
> static inline void sched_avg_update(struct rq *rq) { }
> #endif
>
> -struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> - __acquires(rq->lock);
> -
> -struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> - __acquires(p->pi_lock)
> - __acquires(rq->lock);
> -
> -static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
> - __releases(rq->lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock(&rq->lock);
> -}
> -
> -static inline void
> -task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> - __releases(rq->lock)
> - __releases(p->pi_lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock(&rq->lock);
> - raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> -}
> -
> -static inline void
> -rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock_irqsave(&rq->lock, rf->flags);
> - rq_pin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_lock_irq(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock_irq(&rq->lock);
> - rq_pin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_lock(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock(&rq->lock);
> - rq_pin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_relock(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock(&rq->lock);
> - rq_repin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
> - __releases(rq->lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
> -}
> -
> -static inline void
> -rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
> - __releases(rq->lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock_irq(&rq->lock);
> -}
> -
> -static inline void
> -rq_unlock(struct rq *rq, struct rq_flags *rf)
> - __releases(rq->lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock(&rq->lock);
> -}
> -
> #ifdef CONFIG_SMP
> #ifdef CONFIG_PREEMPT
>
What's all this churn about?
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> +/**
> + * psi_memstall_enter - mark the beginning of a memory stall section
> + * @flags: flags to handle nested sections
> + *
> + * Marks the calling task as being stalled due to a lack of memory,
> + * such as waiting for a refault or performing reclaim.
> + */
> +void psi_memstall_enter(unsigned long *flags)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + *flags = current->flags & PF_MEMSTALL;
> + if (*flags)
> + return;
> + /*
> + * PF_MEMSTALL setting & accounting needs to be atomic wrt
> + * changes to the task's scheduling state, otherwise we can
> + * race with CPU migration.
> + */
> + local_irq_disable();
> + rq = this_rq();
> + raw_spin_lock(&rq->lock);
> + rq_pin_lock(rq, &rf);
Given that churn in sched.h, you seen rq_lock() and friends.
Either write this like:
local_irq_disable();
rq = this_rq();
rq_lock(rq, &rf);
Or instroduce "rq = this_rq_lock_irq()", which we could also use in
do_sched_yield().
> + update_rq_clock(rq);
> +
> + current->flags |= PF_MEMSTALL;
> + psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> +
> + rq_unpin_lock(rq, &rf);
> + raw_spin_unlock(&rq->lock);
> + local_irq_enable();
That's called rq_unlock_irq().
> +}
> +
> +/**
> + * psi_memstall_leave - mark the end of an memory stall section
> + * @flags: flags to handle nested memdelay sections
> + *
> + * Marks the calling task as no longer stalled due to lack of memory.
> + */
> +void psi_memstall_leave(unsigned long *flags)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (*flags)
> + return;
> + /*
> + * PF_MEMSTALL clearing & accounting needs to be atomic wrt
> + * changes to the task's scheduling state, otherwise we could
> + * race with CPU migration.
> + */
> + local_irq_disable();
> + rq = this_rq();
> + raw_spin_lock(&rq->lock);
> + rq_pin_lock(rq, &rf);
> +
> + update_rq_clock(rq);
> +
> + current->flags &= ~PF_MEMSTALL;
> + psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
> +
> + rq_unpin_lock(rq, &rf);
> + raw_spin_unlock(&rq->lock);
> + local_irq_enable();
> +}
Idem.
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> +static void psi_clock(struct work_struct *work)
> +{
> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, clock_work);
> +
> +
> + /* Keep the clock ticking only when there is action */
> + if (nonidle_total)
> + schedule_delayed_work(dwork, MY_LOAD_FREQ);
> +}
Note that this doesn't generate a stable frequency for the callback.
The (nondeterministic) time spend doing the actual work is added to each
period, this gives an unconditional downward bias to the frequency, but
also makes it very unstable.
You want explicit management of timer->expires, and add MY_LOAD_FREQ
(which is a misnomer) to it and not reset it based on jiffies.
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> if (task_cpu(p) != cpu) {
> wake_flags |= WF_MIGRATED;
> + psi_ttwu_dequeue(p);
> set_task_cpu(p, cpu);
> }
>
> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> + /*
> + * Is the task being migrated during a wakeup? Make sure to
> + * deregister its sleep-persistent psi states from the old
> + * queue, and let psi_enqueue() know it has to requeue.
> + */
> + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> + struct rq_flags rf;
> + struct rq *rq;
> + int clear = 0;
> +
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> +
> + rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> + psi_task_change(p, rq_clock(rq), clear, 0);
> + p->sched_psi_wake_requeue = 1;
> + __task_rq_unlock(rq, &rf);
> + }
> +}
Yeah, no... not happening.
We spend a lot of time to never touch the old rq->lock on wakeups. Mason
was the one pushing for that, so he should very well know this.
The one cross-cpu atomic (iowait) is already a problem (the whole iowait
accounting being useless makes it even worse), adding significant remote
prodding is just really bad.
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
> return 1;
> #elif defined(CONFIG_TASK_DELAY_ACCT)
> extern int delayacct_on;
> - return delayacct_on;
> -#else
> - return 0;
> + if (delayacct_on)
> + return 1;
> +#elif defined(CONFIG_PSI)
> + extern int psi_disabled;
> + if (!psi_disabled)
> + return 1;
> #endif
> + return 0;
> }
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8aea199a39b4..cb4a68bcf37a 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -55,12 +55,90 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
> # define schedstat_val_or_zero(var) 0
> #endif /* CONFIG_SCHEDSTATS */
>
> +#ifdef CONFIG_PSI
> +/*
> + * PSI tracks state that persists across sleeps, such as iowaits and
> + * memory stalls. As a result, it has to distinguish between sleeps,
> + * where a task's runnable state changes, and requeues, where a task
> + * and its state are being moved between CPUs and runqueues.
> + */
> +static inline void psi_enqueue(struct task_struct *p, u64 now)
> +{
> + int clear = 0, set = TSK_RUNNING;
> +
> + if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
> + if (p->flags & PF_MEMSTALL)
> + set |= TSK_MEMSTALL;
> + p->sched_psi_wake_requeue = 0;
> + } else {
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +static inline void psi_dequeue(struct task_struct *p, u64 now)
> +{
> + int clear = TSK_RUNNING, set = 0;
> +
> + if (p->state == TASK_RUNNING) {
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> + } else {
> + if (p->in_iowait)
> + set |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> + /*
> + * Is the task being migrated during a wakeup? Make sure to
> + * deregister its sleep-persistent psi states from the old
> + * queue, and let psi_enqueue() know it has to requeue.
> + */
> + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> + struct rq_flags rf;
> + struct rq *rq;
> + int clear = 0;
> +
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> +
> + rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> + psi_task_change(p, rq_clock(rq), clear, 0);
> + p->sched_psi_wake_requeue = 1;
> + __task_rq_unlock(rq, &rf);
> + }
> +}
That all seems to be missing psi_disabled tests.. Yes I know it's
burried down in psi_task_change() somewhere, but that's really (too)
late.
(also, you seem to be conserving whitespace; typically we have an empty
lines between functions)
On 5/8/2018 2:31 AM, Johannes Weiner wrote:
> +static void psi_group_update(struct psi_group *group, int cpu, u64 now,
> + unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;
> +
> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u %u]\n",
> + cpu, idx, tasks[0], tasks[1],
> + tasks[2], tasks[3]);
> + psi_bug = 1;
> + }
> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;
> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(&groupc->res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(&groupc->res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(&groupc->res[PSI_IO], state, now);
> +
> + /* Time in which tasks are non-idle, to weigh the CPU in summaries */
> + if (groupc->nonidle)
> + groupc->nonidle_time += now - groupc->nonidle_start;
> + groupc->nonidle = tasks[NR_RUNNING] ||
> + tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
> + if (groupc->nonidle)
> + groupc->nonidle_start = now;
> +
> + /* Kick the stats aggregation worker if it's gone to sleep */
> + if (!delayed_work_pending(&group->clock_work))
This causes a crash when the work is scheduled before system_wq is up. In my case when the first
schedule was called from kthreadd. And I had to do this to make it work.
if (keventd_up() && !delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);
> +}
> +
> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> +{
> + struct cgroup *cgroup, *parent;
unused variables
Thanks,
Vinayak
On Mon, May 07, 2018 at 05:01:35PM -0400, Johannes Weiner wrote:
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -260,6 +260,18 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> task->psi_flags |= set;
>
> psi_group_update(&psi_system, cpu, now, clear, set);
> +
> +#ifdef CONFIG_CGROUPS
> + cgroup = task->cgroups->dfl_cgrp;
> + while (cgroup && (parent = cgroup_parent(cgroup))) {
> + struct psi_group *group;
> +
> + group = cgroup_psi(cgroup);
> + psi_group_update(group, cpu, now, clear, set);
> +
> + cgroup = parent;
> + }
> +#endif
> }
TJ fixed needing that for stats at some point, why can't you do the
same?
On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
>
> > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > if (task_cpu(p) != cpu) {
> > wake_flags |= WF_MIGRATED;
> > + psi_ttwu_dequeue(p);
> > set_task_cpu(p, cpu);
> > }
> >
>
> > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > +{
> > + /*
> > + * Is the task being migrated during a wakeup? Make sure to
> > + * deregister its sleep-persistent psi states from the old
> > + * queue, and let psi_enqueue() know it has to requeue.
> > + */
> > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > + struct rq_flags rf;
> > + struct rq *rq;
> > + int clear = 0;
> > +
> > + if (p->in_iowait)
> > + clear |= TSK_IOWAIT;
> > + if (p->flags & PF_MEMSTALL)
> > + clear |= TSK_MEMSTALL;
> > +
> > + rq = __task_rq_lock(p, &rf);
> > + update_rq_clock(rq);
> > + psi_task_change(p, rq_clock(rq), clear, 0);
> > + p->sched_psi_wake_requeue = 1;
> > + __task_rq_unlock(rq, &rf);
> > + }
> > +}
>
> Yeah, no... not happening.
>
> We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> was the one pushing for that, so he should very well know this.
>
> The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> accounting being useless makes it even worse), adding significant remote
> prodding is just really bad.
Also, since all you need is the global number, I don't think you
actually need any of this. See what we do for nr_uninterruptible.
In general I think you want to (re)read loadavg.c some more, and maybe
reuse a bit more of that.
On Wed, May 09, 2018 at 01:38:49PM +0200, Peter Zijlstra wrote:
> On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> > On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> >
> > > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > > if (task_cpu(p) != cpu) {
> > > wake_flags |= WF_MIGRATED;
> > > + psi_ttwu_dequeue(p);
> > > set_task_cpu(p, cpu);
> > > }
> > >
> >
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > + /*
> > > + * Is the task being migrated during a wakeup? Make sure to
> > > + * deregister its sleep-persistent psi states from the old
> > > + * queue, and let psi_enqueue() know it has to requeue.
> > > + */
> > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > + int clear = 0;
> > > +
> > > + if (p->in_iowait)
> > > + clear |= TSK_IOWAIT;
> > > + if (p->flags & PF_MEMSTALL)
> > > + clear |= TSK_MEMSTALL;
> > > +
> > > + rq = __task_rq_lock(p, &rf);
> > > + update_rq_clock(rq);
> > > + psi_task_change(p, rq_clock(rq), clear, 0);
> > > + p->sched_psi_wake_requeue = 1;
> > > + __task_rq_unlock(rq, &rf);
> > > + }
> > > +}
> >
> > Yeah, no... not happening.
> >
> > We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> > was the one pushing for that, so he should very well know this.
> >
> > The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> > accounting being useless makes it even worse), adding significant remote
> > prodding is just really bad.
>
> Also, since all you need is the global number, I don't think you
> actually need any of this. See what we do for nr_uninterruptible.
>
> In general I think you want to (re)read loadavg.c some more, and maybe
> reuse a bit more of that.
So there is a reason I'm tracking productivity states per-cpu and not
globally. Consider the following example periods on two CPUs:
CPU 0
Task 1: | EXECUTING | memstalled |
Task 2: | runqueued | EXECUTING |
CPU 1
Task 3: | memstalled | EXECUTING |
If we tracked only the global number of stalled tasks, similarly to
nr_uninterruptible, the number would be elevated throughout the whole
sampling period, giving a pressure value of 100% for "some stalled".
And, since there is always something executing, a "full stall" of 0%.
Now consider what happens when the Task 3 sequence is the other way
around:
CPU 0
Task 1: | EXECUTING | memstalled |
Task 2: | runqueued | EXECUTING |
CPU 1
Task 3: | EXECUTING | memstalled |
Here the number of stalled tasks is elevated only during half of the
sampling period, this time giving a pressure reading of 50% for "some"
(and again 0% for "full").
That's a different measurement, but in terms of workload progress, the
sequences are functionally equivalent. In both scenarios the same
amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
and the same amount of potentially productive CPU time is lost due to
the contention of memory. We really ought to read the same pressure.
So what I'm doing is calculating the productivity loss on each CPU in
a sampling period as if they were independent time slices. It doesn't
matter how you slice and dice the sequences within each one - if used
CPU time and lost CPU time have the same proportion, we have the same
pressure.
In both scenarios above, this method will give a pressure reading of
some=50% and full=25% of "normalized walltime", which is the time loss
the work would experience on a single CPU executing it serially.
To illustrate:
CPU X
1 2 3 4
Task 1: | EXECUTING | memstalled | sleeping | sleeping |
Task 2: | runqueued | EXECUTING | sleeping | sleeping |
Task 3: | sleeping | sleeping | EXECUTING | memstalled |
You can clearly see the 50% of walltime in which *somebody* isn't
advancing (2 and 4), and the 25% of walltime in which *no* tasks are
(3). Same amount of work, same memory stalls, same pressure numbers.
Globalized state tracking would produce those numbers on the single
CPU (obviously), but once concurrency gets into the mix, it's
questionable what its results mean. It certainly isn't able to
reliably detect equivalent slowdowns of individual tasks ("some" is
all over the place), and in this example wasn't able to capture the
impact of contention on overall work completion ("full" is 0%).
* CPU 0: some = 50%, full = 0%
CPU 1: some = 50%, full = 50%
avg: some = 50%, full = 25%
On Wed, May 09, 2018 at 11:49:06AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:33PM -0400, Johannes Weiner wrote:
> > +static inline unsigned long
> > +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
> > +{
> > + unsigned long result = 1UL << frac_bits;
> > +
> > + if (n) {
> > + for (;;) {
> > + if (n & 1) {
> > + result *= x;
> > + result += 1UL << (frac_bits - 1);
> > + result >>= frac_bits;
> > + }
> > + n >>= 1;
> > + if (!n)
> > + break;
> > + x *= x;
> > + x += 1UL << (frac_bits - 1);
> > + x >>= frac_bits;
> > + }
> > + }
> > +
> > + return result;
> > +}
>
> No real objection; but that does look a wee bit fat for an inline I
> suppose.
Fair enough, I'll put these back where I found them and make
calc_load_n() extern instead.
On Wed, May 09, 2018 at 11:59:38AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > new file mode 100644
> > index 000000000000..b22b0ffc729d
> > --- /dev/null
> > +++ b/include/linux/psi_types.h
> > @@ -0,0 +1,84 @@
> > +#ifndef _LINUX_PSI_TYPES_H
> > +#define _LINUX_PSI_TYPES_H
> > +
> > +#include <linux/types.h>
> > +
> > +#ifdef CONFIG_PSI
> > +
> > +/* Tracked task states */
> > +enum psi_task_count {
> > + NR_RUNNING,
> > + NR_IOWAIT,
> > + NR_MEMSTALL,
> > + NR_PSI_TASK_COUNTS,
> > +};
> > +
> > +/* Task state bitmasks */
> > +#define TSK_RUNNING (1 << NR_RUNNING)
> > +#define TSK_IOWAIT (1 << NR_IOWAIT)
> > +#define TSK_MEMSTALL (1 << NR_MEMSTALL)
> > +
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > + PSI_CPU,
> > + PSI_MEM,
> > + PSI_IO,
> > + NR_PSI_RESOURCES,
> > +};
> > +
> > +/* Pressure states for a group of tasks */
> > +enum psi_state {
> > + PSI_NONE, /* No stalled tasks */
> > + PSI_SOME, /* Stalled tasks & working tasks */
> > + PSI_FULL, /* Stalled tasks & no working tasks */
> > + NR_PSI_STATES,
> > +};
> > +
> > +struct psi_resource {
> > + /* Current pressure state for this resource */
> > + enum psi_state state;
> > +
> > + /* Start of current state (cpu_clock) */
> > + u64 state_start;
> > +
> > + /* Time sampling buckets for pressure states (ns) */
> > + u64 times[NR_PSI_STATES - 1];
>
> Fails to explain why no FULL.
It's NONE that's excluded. I'll add a comment.
> > +struct psi_group_cpu {
> > + /* States of the tasks belonging to this group */
> > + unsigned int tasks[NR_PSI_TASK_COUNTS];
> > +
>
> AFAICT there's a hole here, that would fit the @nonidle member. Which
> also avoids the later hole generated by it.
Good spot, I'll reshuffle this accordingly.
> > + /* Per-resource pressure tracking in this group */
> > + struct psi_resource res[NR_PSI_RESOURCES];
> > +
> > + /* There are runnable or D-state tasks */
> > + bool nonidle;
>
> Mandatory complaint about using _Bool in composites goes here.
int it is.
Thanks
On Wed, May 09, 2018 at 12:04:55PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +static void psi_clock(struct work_struct *work)
> > +{
> > + u64 some[NR_PSI_RESOURCES] = { 0, };
> > + u64 full[NR_PSI_RESOURCES] = { 0, };
> > + unsigned long nonidle_total = 0;
> > + unsigned long missed_periods;
> > + struct delayed_work *dwork;
> > + struct psi_group *group;
> > + unsigned long expires;
> > + int cpu;
> > + int r;
> > +
> > + dwork = to_delayed_work(work);
> > + group = container_of(dwork, struct psi_group, clock_work);
> > +
> > + /*
> > + * Calculate the sampling period. The clock might have been
> > + * stopped for a while.
> > + */
> > + expires = group->period_expires;
> > + missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> > + group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> > +
> > + /*
> > + * Aggregate the per-cpu state into a global state. Each CPU
> > + * is weighted by its non-idle time in the sampling period.
> > + */
> > + for_each_online_cpu(cpu) {
>
> Typically when using online CPU state, you also need hotplug notifiers
> to deal with changes in the online set.
>
> You also typically need something like cpus_read_lock() around an
> iteration of online CPUs, to avoid the set changing while you're poking
> at them.
>
> The lack for neither is evident or explained.
The per-cpu state we access is allocated for each possible CPU, so
that is safe (and state being all 0 is semantically sound, too). In a
race with onlining, we might miss some per-cpu samples, but would
catch them the next time. In a race with offlining, we may never
consider the final up to 2s state history of the disappearing CPU; we
could have an offlining callback to flush the state, but I'm not sure
this would be an actual problem in the real world since the error is
small (smallest averaging window is 5 sampling periods) and then would
age out quickly.
I can certainly add a comment explaining this at least.
> > + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> > + unsigned long nonidle;
> > +
> > + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> > + groupc->nonidle_time = 0;
> > + nonidle_total += nonidle;
> > +
> > + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > + struct psi_resource *res = &groupc->res[r];
> > +
> > + some[r] += (res->times[0] + res->times[1]) * nonidle;
> > + full[r] += res->times[1] * nonidle;
> > +
> > + /* It's racy, but we can tolerate some error */
> > + res->times[0] = 0;
> > + res->times[1] = 0;
> > + }
> > + }
On Wed, May 09, 2018 at 12:05:51PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > + u64 some[NR_PSI_RESOURCES] = { 0, };
> > + u64 full[NR_PSI_RESOURCES] = { 0, };
>
> > + some[r] /= max(nonidle_total, 1UL);
> > + full[r] /= max(nonidle_total, 1UL);
>
> That's a bare 64bit divide.. that typically failed to build on 32bit
> archs.
Ah yes, I'll switch that to do_div(). Thanks
On Wed, May 09, 2018 at 12:14:54PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 15750c222ca2..1658477466d5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
[...]
> What's all this churn about?
The psi callbacks in kernel/sched/stat.h use these rq lock functions
from this file, but sched.h includes stat.h before those definitions.
I'll move this into a separate patch with a proper explanation.
On Wed, May 09, 2018 at 12:21:00PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > + local_irq_disable();
> > + rq = this_rq();
> > + raw_spin_lock(&rq->lock);
> > + rq_pin_lock(rq, &rf);
>
> Given that churn in sched.h, you seen rq_lock() and friends.
>
> Either write this like:
>
> local_irq_disable();
> rq = this_rq();
> rq_lock(rq, &rf);
>
> Or instroduce "rq = this_rq_lock_irq()", which we could also use in
> do_sched_yield().
Sounds good, I'll add that.
> > + update_rq_clock(rq);
> > +
> > + current->flags |= PF_MEMSTALL;
> > + psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > + rq_unpin_lock(rq, &rf);
> > + raw_spin_unlock(&rq->lock);
> > + local_irq_enable();
>
> That's called rq_unlock_irq().
I'll use that. This code was first written against a kernel that
didn't have 8a8c69c32778 ("sched/core: Add rq->lock wrappers.") yet ;)
On Wed, May 09, 2018 at 01:07:36PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:35PM -0400, Johannes Weiner wrote:
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -260,6 +260,18 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> > task->psi_flags |= set;
> >
> > psi_group_update(&psi_system, cpu, now, clear, set);
> > +
> > +#ifdef CONFIG_CGROUPS
> > + cgroup = task->cgroups->dfl_cgrp;
> > + while (cgroup && (parent = cgroup_parent(cgroup))) {
> > + struct psi_group *group;
> > +
> > + group = cgroup_psi(cgroup);
> > + psi_group_update(group, cpu, now, clear, set);
> > +
> > + cgroup = parent;
> > + }
> > +#endif
> > }
>
> TJ fixed needing that for stats at some point, why can't you do the
> same?
The stats deltas are all additive, so it's okay to delay flushing them
up the tree right before somebody is trying to look at them.
With this, though, we are tracking time of an aggregate state composed
of child tasks, and that state might not be identical for you and all
your ancestor, so everytime a task state changes we have to evaluate
and start/stop clocks on every level, because we cannot derive our
state from the state history of our child groups.
For example, say you have the following tree:
root
/
A
/ \
A1 A2
running=1 running=1
I.e. There is a a running task in A1 and one in A2.
root, A, A1, and A2 are all PSI_NONE as nothing is stalled.
Now the task in A2 enters a memstall.
root
/
A
/ \
A1 A2
running=1 memstall=1
From the perspective of A2, the group is now fully blocked and starts
recording time in PSI_FULL.
From the perspective of A, it has a working group below it and a
stalled one, which would make it PSI_SOME, so it starts recording time
in PSI_SOME.
The root/sytem level likewise has to start the timer on PSI_SOME.
Now the task in A1 enters a memstall, and we have to propagate the
PSI_FULL state up A1 -> A -> root.
I'm not quite sure how we could make this lazy. Say we hadn't
propagated the state from A1 and A2 right away, and somebody is asking
about the averages for A. We could tell that A1 and A2 had been in
PSI_FULL recently, but we wouldn't know exactly if them being in these
states fully overlapped (all PSI_FULL), overlapped partially (some
PSI_FULL and some PSI_SOME), or didn't overlap at all (PSI_SOME).
On Thu, May 10, 2018 at 09:41:32AM -0400, Johannes Weiner wrote:
> So there is a reason I'm tracking productivity states per-cpu and not
> globally. Consider the following example periods on two CPUs:
>
> CPU 0
> Task 1: | EXECUTING | memstalled |
> Task 2: | runqueued | EXECUTING |
>
> CPU 1
> Task 3: | memstalled | EXECUTING |
>
> If we tracked only the global number of stalled tasks, similarly to
> nr_uninterruptible, the number would be elevated throughout the whole
> sampling period, giving a pressure value of 100% for "some stalled".
> And, since there is always something executing, a "full stall" of 0%.
But if you read the comment about SMP IO-wait; see commit:
e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
you'll see that per-cpu accounting has issues too.
Also, note that in your example above you have 1 memstalled task (at any
one time), but _2_ CPUs. So at most you should end up with a 50% value.
There is no way 1 task could consume 2 CPUs worth of time.
Furthermore, associating a blocked task to any particular CPU is
fundamentally broken and I'll hard NAK anything that relies on it.
> Now consider what happens when the Task 3 sequence is the other way
> around:
>
> CPU 0
> Task 1: | EXECUTING | memstalled |
> Task 2: | runqueued | EXECUTING |
>
> CPU 1
> Task 3: | EXECUTING | memstalled |
>
> Here the number of stalled tasks is elevated only during half of the
> sampling period, this time giving a pressure reading of 50% for "some"
> (and again 0% for "full").
That entirely depends on your averaging; an exponentially decaying
average would not typically result in 50% for the above case. But I
think we can agree that this results in one 0% and one 100% sample -- we
have two stalled tasks and two CPUs.
> That's a different measurement, but in terms of workload progress, the
> sequences are functionally equivalent. In both scenarios the same
> amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
> and the same amount of potentially productive CPU time is lost due to
> the contention of memory. We really ought to read the same pressure.
And you do -- subject to the averaging used, as per the above.
The first gives two 50% samples, the second gives 0%, 100%.
> So what I'm doing is calculating the productivity loss on each CPU in
> a sampling period as if they were independent time slices. It doesn't
> matter how you slice and dice the sequences within each one - if used
> CPU time and lost CPU time have the same proportion, we have the same
> pressure.
I'm still thinking you can do basically the same without the stong CPU
relation.
> To illustrate:
>
> CPU X
> 1 2 3 4
> Task 1: | EXECUTING | memstalled | sleeping | sleeping |
> Task 2: | runqueued | EXECUTING | sleeping | sleeping |
> Task 3: | sleeping | sleeping | EXECUTING | memstalled |
>
> You can clearly see the 50% of walltime in which *somebody* isn't
> advancing (2 and 4), and the 25% of walltime in which *no* tasks are
> (3). Same amount of work, same memory stalls, same pressure numbers.
>
> Globalized state tracking would produce those numbers on the single
> CPU (obviously), but once concurrency gets into the mix, it's
> questionable what its results mean. It certainly isn't able to
> reliably detect equivalent slowdowns of individual tasks ("some" is
> all over the place), and in this example wasn't able to capture the
> impact of contention on overall work completion ("full" is 0%).
>
> * CPU 0: some = 50%, full = 0%
> CPU 1: some = 50%, full = 50%
> avg: some = 50%, full = 25%
I'm not entirely sure I get your point here; but note that a task
doesn't sleep on a CPU. When it sleeps it is not strictly associated
with a CPU, only when it runs does it have an association.
What is the value of accounting a sleep state to a particular CPU if the
task when wakes up on another? Where did the sleep take place?
All we really can say is that a task slept, and if we can reduce the
reason for its sleeping (IO, reclaim, whatever) then it could've ran
sooner. And then you can make predictions based on the number of CPUs
and global idle time, how much that could improve things.
On Mon, 7 May 2018, Johannes Weiner wrote:
> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resources. Depending on what your priority
> is, an elevated "some" number may or may not require action.
This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?
On 05/14/18 08:39, Christopher Lameter wrote:
> On Mon, 7 May 2018, Johannes Weiner wrote:
>> What to make of this number? If CPU utilization is at 100% and CPU
>> pressure is 0, it means the system is perfectly utilized, with one
>> runnable thread per CPU and nobody waiting. At two or more runnable
>> tasks per CPU, the system is 100% overcommitted and the pressure
>> average will indicate as much. From a utilization perspective this is
>> a great state of course: no CPU cycles are being wasted, even when 50%
>> of the threads were to go idle (and most workloads do vary). From the
>> perspective of the individual job it's not great, however, and they
>> might do better with more resources. Depending on what your priority
>> is, an elevated "some" number may or may not require action.
>
> This looks awfully similar to loadavg. Problem is that loadavg gets
> screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
> loadavg instead?
The following article explains why it probably made sense in 1993 to
include TASK_UNINTERRUPTIBLE in loadavg and also why this no longer
makes sense today:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
Bart.
On Mon, May 14, 2018 at 03:39:33PM +0000, Christopher Lameter wrote:
> On Mon, 7 May 2018, Johannes Weiner wrote:
>
> > What to make of this number? If CPU utilization is at 100% and CPU
> > pressure is 0, it means the system is perfectly utilized, with one
> > runnable thread per CPU and nobody waiting. At two or more runnable
> > tasks per CPU, the system is 100% overcommitted and the pressure
> > average will indicate as much. From a utilization perspective this is
> > a great state of course: no CPU cycles are being wasted, even when 50%
> > of the threads were to go idle (and most workloads do vary). From the
> > perspective of the individual job it's not great, however, and they
> > might do better with more resources. Depending on what your priority
> > is, an elevated "some" number may or may not require action.
>
> This looks awfully similar to loadavg. Problem is that loadavg gets
> screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
> loadavg instead?
Counting iowaiting tasks is one thing, but there are a few more things
that make it hard to use for telling the impact of CPU competition:
- It's not normalized to available CPU count. The loadavg in isolation
doesn't mean anything, and you have to know the number of CPUs and
any CPU bindings / restrictions in effect, which presents at least
some difficulty when monitoring a big heterogeneous fleet.
- The way it's sampled makes it impossible to use for latencies. You
could be mostly idle but periodically have herds of tasks competing
for the CPU for short, low-latency operations. Even if we changed
this in the implementation, you're still stuck with the interface
that has...
- ...a short-term load window of 1m. This is generally fairly coarse
for something that can be loaded and unloaded as abruptly as the CPU
I'm trying to fix these with a portable way of aggregating multi-cpu
states, as well as tracking the true time spent in a state instead of
sampling it. Plus a smaller short-term window of 10s, but that's
almost irrelevant because I'm exporting the absolute state time clock
so you can calculate your own averages over any time window you want.
Since I'm using the same model and infrastructure for memory and IO
load as well, IMO it makes more sense to present them in a coherent
interface instead of trying to retrofit and change the loadavg file,
which might not even be possible.
On Mon, 14 May 2018, Johannes Weiner wrote:
> Since I'm using the same model and infrastructure for memory and IO
> load as well, IMO it makes more sense to present them in a coherent
> interface instead of trying to retrofit and change the loadavg file,
> which might not even be possible.
Well I keep looking at the loadavg output from numerous tools and then in
my mind I divide by the number of processors, guess if any of the threads
would be doing I/O and if I cannot figure that out groan and run "vmstat"
for awhile to figure that out.
Lets have some numbers there that make more sense please.
On Wed, May 09, 2018 at 04:33:24PM +0530, Vinayak Menon wrote:
> On 5/8/2018 2:31 AM, Johannes Weiner wrote:
> > + /* Kick the stats aggregation worker if it's gone to sleep */
> > + if (!delayed_work_pending(&group->clock_work))
>
> This causes a crash when the work is scheduled before system_wq is up. In my case when the first
> schedule was called from kthreadd. And I had to do this to make it work.
> if (keventd_up() && !delayed_work_pending(&group->clock_work))
>
> > + schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);
I was trying to figure out how this is possible, and it didn't make
sense because we do initialize the system_wq way before kthreadd.
Did you by any chance backport this to a pre-4.10 kernel which does
not have 3347fa092821 ("workqueue: make workqueue available early
during boot") yet?
> > +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> > +{
> > + struct cgroup *cgroup, *parent;
>
> unused variables
They're used in the next patch, I'll fix that up.
Thanks
On 5/23/2018 6:47 PM, Johannes Weiner wrote:
> On Wed, May 09, 2018 at 04:33:24PM +0530, Vinayak Menon wrote:
>> On 5/8/2018 2:31 AM, Johannes Weiner wrote:
>>> + /* Kick the stats aggregation worker if it's gone to sleep */
>>> + if (!delayed_work_pending(&group->clock_work))
>> This causes a crash when the work is scheduled before system_wq is up. In my case when the first
>> schedule was called from kthreadd. And I had to do this to make it work.
>> if (keventd_up() && !delayed_work_pending(&group->clock_work))
>>
>>> + schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);
> I was trying to figure out how this is possible, and it didn't make
> sense because we do initialize the system_wq way before kthreadd.
>
> Did you by any chance backport this to a pre-4.10 kernel which does
> not have 3347fa092821 ("workqueue: make workqueue available early
> during boot") yet?
Sorry I did not mention that. I was trying on 4.9 kernel. It's clear now. Thanks.
>>> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
>>> +{
>>> + struct cgroup *cgroup, *parent;
>> unused variables
> They're used in the next patch, I'll fix that up.
>
> Thanks
Hi Johannes,
I tried your previous memdelay patches before this new set was posted
and results were promising for predicting when Android system is close
to OOM. I'm definitely going to try this one after I backport it to
4.9.
On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner <[email protected]> wrote:
> Hi,
>
> I previously submitted a version of this patch set called "memdelay",
> which translated delays from reclaim, swap-in, thrashing page cache
> into a pressure percentage of lost walltime. I've since extended this
> code to aggregate all delay states tracked by delayacct in order to
> have generalized pressure/overcommit levels for CPU, memory, and IO.
>
> There was feedback from Peter on the previous version that I have
> incorporated as much as possible and as it still applies to this code:
>
> - got rid of the extra lock in the sched callbacks; all task
> state changes we care about serialize through rq->lock
>
> - got rid of ktime_get() inside the sched callbacks and
> switched time measuring to rq_clock()
>
> - got rid of all divisions inside the sched callbacks,
> tracking everything natively in ns now
>
> I also moved this stuff into existing sched/stat.h callbacks, so it
> doesn't get in the way in sched/core.c, and of course moved the whole
> thing behind CONFIG_PSI since not everyone is going to want it.
Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
subset of this feature?
>
> Real-world applications
>
> Since the last posting, we've begun using the data collected by this
> code quite extensively at Facebook, and with several success stories.
>
> First we used it on systems that frequently locked up in low memory
> situations. The reason this happens is that the OOM killer is
> triggered by reclaim not being able to make forward progress, but with
> fast flash devices there is *always* some clean and uptodate cache to
> reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
> the time faulting executables. There is no situation where this ever
> makes sense in practice. We wrote a <100 line POC python script to
> monitor memory pressure and kill stuff manually, way before such
> pathological thrashing.
>
> We've since extended the python script into a more generic oomd that
> we use all over the place, not just to avoid livelocks but also to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> We also use the memory pressure info for loadshedding. Our batch job
> infrastructure used to refuse new requests on heuristics based on RSS
> and other existing VM metrics in an attempt to avoid OOM kills and
> maximize utilization. Since it was still plagued by frequent OOM
> kills, we switched it to shed load on psi memory pressure, which has
> turned out to be a much better bellwether, and we managed to reduce
> OOM kills drastically. Reducing the rate of OOM outages from the
> worker pool raised its aggregate productivity, and we were able to
> switch that service to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to do this properly without the
> pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem. We now
> log and graph the pressure metrics for all containers in our fleet and
> can trivially link service drops to resource pressure after the fact.
>
> How do you use this?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> calculate pressure at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
> some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which some tasks are
> delayed on the runqueue while another task has the CPU. They're recent
> averages over 10s, 1m, 5m windows, so you can tell short term trends
> from long term ones, similarly to the load average.
>
> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resources. Depending on what your priority
> is, an elevated "some" number may or may not require action.
>
> The memory file contains two lines:
>
> some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
> full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
>
> The some line is the same as for cpu: the time in which at least one
> task is stalled on the resource.
>
> The full line, however, indicates time in which *nobody* is using the
> CPU productively due to pressure: all non-idle tasks could be waiting
> on thrashing cache simultaneously. It can also happen when a single
> reclaimer occupies the CPU, since nothing else can make forward
> progress during that time. CPU cycles are being wasted. Significant
> time spent in there is a good trigger for killing, moving jobs to
> other machines, or dropping incoming requests, since neither the jobs
> nor the machine overall is making too much headway.
>
> The total= value gives the absolute stall time in microseconds. This
> allows detecting latency spikes that might be too short to sway the
> running averages. It also allows custom time averaging in case the
> 10s/1m/5m windows aren't adequate for the usecase (or are too coarse
> with future hardware).
>
Any reasons these specific windows were chosen (empirical
data/historical reasons)? I'm worried that with the smallest window
being 10s the signal might be too inert to detect fast memory pressure
buildup before OOM kill happens. I'll have to experiment with that
first, however if you have some insights into this already please
share them.
> The io file is similar to memory. However, unlike CPU and memory, the
> block layer doesn't have a concept of hardware contention. We cannot
> know if the IO a task is waiting on is being performed by the device
> or whether the device is busy with or slowed down other requests. As a
> result, we can tell how many CPU cycles go to waste due to IO delays,
> but we can not identify the competition factor in those delays.
>
> These patches are against v4.17-rc4.
>
> Documentation/accounting/psi.txt | 73 ++++
> Documentation/cgroup-v2.txt | 18 +
> arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +-
> arch/powerpc/platforms/cell/spufs/sched.c | 9 +-
> arch/s390/appldata/appldata_os.c | 4 -
> drivers/cpuidle/governors/menu.c | 4 -
> fs/proc/loadavg.c | 3 -
> include/linux/cgroup-defs.h | 4 +
> include/linux/cgroup.h | 15 +
> include/linux/delayacct.h | 23 +
> include/linux/mmzone.h | 1 +
> include/linux/page-flags.h | 5 +-
> include/linux/psi.h | 52 +++
> include/linux/psi_types.h | 84 ++++
> include/linux/sched.h | 10 +
> include/linux/sched/loadavg.h | 90 +++-
> include/linux/sched/stat.h | 10 +-
> include/linux/swap.h | 2 +-
> include/trace/events/mmflags.h | 1 +
> include/uapi/linux/taskstats.h | 6 +-
> init/Kconfig | 20 +
> kernel/cgroup/cgroup.c | 45 +-
> kernel/debug/kdb/kdb_main.c | 7 +-
> kernel/delayacct.c | 15 +
> kernel/fork.c | 4 +
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 3 +
> kernel/sched/loadavg.c | 84 ----
> kernel/sched/psi.c | 499 ++++++++++++++++++++++
> kernel/sched/sched.h | 166 +++----
> kernel/sched/stats.h | 91 +++-
> mm/compaction.c | 5 +
> mm/filemap.c | 27 +-
> mm/huge_memory.c | 1 +
> mm/memcontrol.c | 2 +
> mm/migrate.c | 2 +
> mm/page_alloc.c | 10 +
> mm/swap_state.c | 1 +
> mm/vmscan.c | 14 +
> mm/vmstat.c | 1 +
> mm/workingset.c | 113 +++--
> tools/accounting/getdelays.c | 8 +-
> 42 files changed, 1279 insertions(+), 256 deletions(-)
>
>
>
>
Thanks,
Suren.
Hi Suren,
On Fri, May 25, 2018 at 05:29:30PM -0700, Suren Baghdasaryan wrote:
> Hi Johannes,
> I tried your previous memdelay patches before this new set was posted
> and results were promising for predicting when Android system is close
> to OOM. I'm definitely going to try this one after I backport it to
> 4.9.
I'm happy to hear that!
> Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
> CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
> subset of this feature?
Yes, that should be doable. I'll split them out in the next version.
> > The total= value gives the absolute stall time in microseconds. This
> > allows detecting latency spikes that might be too short to sway the
> > running averages. It also allows custom time averaging in case the
> > 10s/1m/5m windows aren't adequate for the usecase (or are too coarse
> > with future hardware).
>
> Any reasons these specific windows were chosen (empirical
> data/historical reasons)? I'm worried that with the smallest window
> being 10s the signal might be too inert to detect fast memory pressure
> buildup before OOM kill happens. I'll have to experiment with that
> first, however if you have some insights into this already please
> share them.
They were chosen empirically. We started out with the loadavg window
sizes, but had to reduce them for exactly the reason you mention -
they're way too coarse to detect acute pressure buildup.
10s has been working well for us. We could make it smaller, but there
is some worry that we don't have enough samples then and the average
becomes too erratic - whereas monitoring total= directly would allow
you to detect accute spikes and handle this erraticness explicitly.
Let me know how it works out in your tests.
Thanks for your feedback.
On Tue, May 29, 2018 at 11:16 AM, Johannes Weiner <[email protected]> wrote:
> Hi Suren,
>
> On Fri, May 25, 2018 at 05:29:30PM -0700, Suren Baghdasaryan wrote:
>> Hi Johannes,
>> I tried your previous memdelay patches before this new set was posted
>> and results were promising for predicting when Android system is close
>> to OOM. I'm definitely going to try this one after I backport it to
>> 4.9.
>
> I'm happy to hear that!
>
>> Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
>> CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
>> subset of this feature?
>
> Yes, that should be doable. I'll split them out in the next version.
>
>> > The total= value gives the absolute stall time in microseconds. This
>> > allows detecting latency spikes that might be too short to sway the
>> > running averages. It also allows custom time averaging in case the
>> > 10s/1m/5m windows aren't adequate for the usecase (or are too coarse
>> > with future hardware).
>>
>> Any reasons these specific windows were chosen (empirical
>> data/historical reasons)? I'm worried that with the smallest window
>> being 10s the signal might be too inert to detect fast memory pressure
>> buildup before OOM kill happens. I'll have to experiment with that
>> first, however if you have some insights into this already please
>> share them.
>
> They were chosen empirically. We started out with the loadavg window
> sizes, but had to reduce them for exactly the reason you mention -
> they're way too coarse to detect acute pressure buildup.
>
> 10s has been working well for us. We could make it smaller, but there
> is some worry that we don't have enough samples then and the average
> becomes too erratic - whereas monitoring total= directly would allow
> you to detect accute spikes and handle this erraticness explicitly.
Unfortunately total= field is now updated only at 2sec intervals which
might be too late to react to mounting memory pressure. With previous
memdelay patchset md->aggregate which is reported as "total" was
calculated directly from inside memdelay_task_change, so it was always
up-to-date. Now group->some and group->full are updated from inside
psi_clock with up to 2sec delay. This prevents us from detecting these
acute pressure spikes immediately. I understand why you moved these
calculations out of the hot path but maybe we could keep updating
"total" inside psi_group_update? This would allow for custom averaging
and eliminate this delay for detecting spikes in the pressure signal.
More conceptually I would love to have a way to monitor the averages
at a slow rate and when they rise and cross some threshold to increase
the monitoring rate and react quickly in case they shoot up. Current
2sec delay poses a problem for doing that.
>
> Let me know how it works out in your tests.
I've done the backporting to 4.9 and running the tests but the 2sec
delay is problematic for getting a detailed look at the signal and its
usefulness. Thinking about workarounds if only for data collection but
don't want to deviate too much from your baseline. Would love to hear
from you if a good compromise can be reached here.
>
> Thanks for your feedback.
Hi Johannes,
On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner <[email protected]> wrote:
> +static void psi_clock(struct work_struct *work)
> +{
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };
> + unsigned long nonidle_total = 0;
> + unsigned long missed_periods;
> + struct delayed_work *dwork;
> + struct psi_group *group;
> + unsigned long expires;
> + int cpu;
> + int r;
> +
> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, clock_work);
> +
> + /*
> + * Calculate the sampling period. The clock might have been
> + * stopped for a while.
> + */
> + expires = group->period_expires;
> + missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> + group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> +
> + /*
> + * Aggregate the per-cpu state into a global state. Each CPU
> + * is weighted by its non-idle time in the sampling period.
> + */
Would it be possible to move this aggregation code (excluding
calc_avgs()) into a separate function which is called from here as
well as from psi_show() before group->some[] and group->full[] are
reported? This would not affect the performance if the information is
not requested and at the same time would keep at least the "total"
field up-to-date when the data is requested. For calc_avgs() I think
we would have to calculate the change in nonidle_total, group->some[]
and group->full[] fields differently because a call to psi_show() in
the middle of two psi_clock() calls would refresh these fields before
2secs expire, however calculating that change is trivial if we store
previous group->some[], group->full[] and nonidle_total values inside
psi_clock(). This would require new fields in psi_group struct to
store these previous values but the upside is that we would eliminate
the problem with reporting potentially stale data (up to 2sec update
delay) and provide a function one can use to refresh group->some[] and
group->full[] and implement custom averaging.
> + for_each_online_cpu(cpu) {
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = &groupc->res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + /* Finish the weighted aggregation */
> + some[r] /= max(nonidle_total, 1UL);
> + full[r] /= max(nonidle_total, 1UL);
> +
> + /* Accumulate stall time */
> + group->some[r] += some[r];
> + group->full[r] += full[r];
> +
> + /* Calculate recent pressure averages */
> + calc_avgs(group->avg_some[r], some[r], missed_periods);
> + calc_avgs(group->avg_full[r], full[r], missed_periods);
> + }
> +
> + /* Keep the clock ticking only when there is action */
> + if (nonidle_total)
> + schedule_delayed_work(dwork, MY_LOAD_FREQ);
> +}
> +
Thanks,
Suren.