2019-01-24 21:16:12

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 0/5] psi: pressure stall monitors v3

This is respin of:
https://lwn.net/ml/linux-kernel/[email protected]/

Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high-enough frequency right now.

This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to
be notified when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user.
For example, using memory stall monitors in userspace low memory
killer daemon (lmkd) we can detect mounting pressure and kill less
important processes before device becomes visibly sluggish. In our
memory stress testing psi memory monitors produce roughly 10x less
false positives compared to vmpressure signals. Having ability to
specify multiple triggers for the same psi metric allows other parts
of Android framework to monitor memory state of the device and act
accordingly.

The new interface is straight-forward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:

/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 100000 1000000"
fd = open("/proc/pressure/memory")
write(fd, trigger, sizeof(trigger))
while (poll() >= 0) {
...
};
close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in
order to emit event signals in a timely fashion. Once the stalling
subsides, aggregation reverts back to normal.

The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-4 prepare the psi code for polling support. Patch 5
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.

The patches were developed in collaboration with Johannes Weiner.

The patches are based on 4.20-rc7.

Johannes Weiner (2):
fs: kernfs: add poll file operation
kernel: cgroup: add poll file operation

Suren Baghdasaryan (3):
psi: introduce state_mask to represent stalled psi states
psi: rename psi fields in preparation for psi trigger addition
psi: introduce psi monitor

Documentation/accounting/psi.txt | 104 ++++++
fs/kernfs/file.c | 31 +-
include/linux/cgroup-defs.h | 4 +
include/linux/kernfs.h | 6 +
include/linux/psi.h | 10 +
include/linux/psi_types.h | 83 ++++-
kernel/cgroup/cgroup.c | 119 +++++-
kernel/sched/psi.c | 605 ++++++++++++++++++++++++++++---
8 files changed, 891 insertions(+), 71 deletions(-)

Changes in v3:
- Added smp_mb in the slowpath, as per Peter
- Changed psi_group.polling flag to atomic_t, as per Peter
- Added a comment in the hotpath about implicit smp_wmb in the
chmxchg, as per Johannes
- Minimized line-breaks wherever possible, as per Peter

--
2.20.1.321.g9e740568ce-goog



2019-01-24 21:16:13

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 1/5] fs: kernfs: add poll file operation

From: Johannes Weiner <[email protected]>

Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
fs/kernfs/file.c | 31 ++++++++++++++++++++-----------
include/linux/kernfs.h | 6 ++++++
2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index f8d5021a652e..ae948aaa4c53 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -832,26 +832,35 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
* to see if it supports poll (Neither 'poll' nor 'select' return
* an appropriate error code). When in doubt, set a suitable timeout value.
*/
+__poll_t kernfs_generic_poll(struct kernfs_open_file *of, poll_table *wait)
+{
+ struct kernfs_node *kn = kernfs_dentry_node(of->file->f_path.dentry);
+ struct kernfs_open_node *on = kn->attr.open;
+
+ poll_wait(of->file, &on->poll, wait);
+
+ if (of->event != atomic_read(&on->event))
+ return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
+
+ return DEFAULT_POLLMASK;
+}
+
static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait)
{
struct kernfs_open_file *of = kernfs_of(filp);
struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
- struct kernfs_open_node *on = kn->attr.open;
+ __poll_t ret;

if (!kernfs_get_active(kn))
- goto trigger;
+ return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;

- poll_wait(filp, &on->poll, wait);
+ if (kn->attr.ops->poll)
+ ret = kn->attr.ops->poll(of, wait);
+ else
+ ret = kernfs_generic_poll(of, wait);

kernfs_put_active(kn);
-
- if (of->event != atomic_read(&on->event))
- goto trigger;
-
- return DEFAULT_POLLMASK;
-
- trigger:
- return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
+ return ret;
}

static void kernfs_notify_workfn(struct work_struct *work)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5b36b1287a5a..0cac1207bb00 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -25,6 +25,7 @@ struct seq_file;
struct vm_area_struct;
struct super_block;
struct file_system_type;
+struct poll_table_struct;

struct kernfs_open_node;
struct kernfs_iattrs;
@@ -261,6 +262,9 @@ struct kernfs_ops {
ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes,
loff_t off);

+ __poll_t (*poll)(struct kernfs_open_file *of,
+ struct poll_table_struct *pt);
+
int (*mmap)(struct kernfs_open_file *of, struct vm_area_struct *vma);

#ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -350,6 +354,8 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
const char *new_name, const void *new_ns);
int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
+__poll_t kernfs_generic_poll(struct kernfs_open_file *of,
+ struct poll_table_struct *pt);
void kernfs_notify(struct kernfs_node *kn);

const void *kernfs_super_ns(struct super_block *sb);
--
2.20.1.321.g9e740568ce-goog


2019-01-24 21:16:24

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 3/5] psi: introduce state_mask to represent stalled psi states

The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.
This adds 4-byte state_mask member into psi_group_cpu struct which
results in its first cacheline-aligned part to become 52 bytes long.
Add explicit values to enumeration element counters that affect
psi_group_cpu struct size.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/psi_types.h | 9 ++++++---
kernel/sched/psi.c | 29 +++++++++++++++++++----------
2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2cf422db5d18..762c6bb16f3c 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -11,7 +11,7 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
- NR_PSI_TASK_COUNTS,
+ NR_PSI_TASK_COUNTS = 3,
};

/* Task state bitmasks */
@@ -24,7 +24,7 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES,
+ NR_PSI_RESOURCES = 3,
};

/*
@@ -41,7 +41,7 @@ enum psi_states {
PSI_CPU_SOME,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES,
+ NR_PSI_STATES = 6,
};

struct psi_group_cpu {
@@ -53,6 +53,9 @@ struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS];

+ /* Aggregate pressure state derived from the tasks */
+ u32 state_mask;
+
/* Period time sampling buckets for each state of interest (ns) */
u32 times[NR_PSI_STATES];

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index fe24de3fbc93..2262d920295f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -212,17 +212,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
- unsigned int tasks[NR_PSI_TASK_COUNTS];
u64 now, state_start;
+ enum psi_states s;
unsigned int seq;
- int s;
+ u32 state_mask;

/* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
now = cpu_clock(cpu);
memcpy(times, groupc->times, sizeof(groupc->times));
- memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
+ state_mask = groupc->state_mask;
state_start = groupc->state_start;
} while (read_seqcount_retry(&groupc->seq, seq));

@@ -238,7 +238,7 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
* (u32) and our reported pressure close to what's
* actually happening.
*/
- if (test_state(tasks, s))
+ if (state_mask & (1 << s))
times[s] += now - state_start;

delta = times[s] - groupc->times_prev[s];
@@ -406,15 +406,15 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
delta = now - groupc->state_start;
groupc->state_start = now;

- if (test_state(groupc->tasks, PSI_IO_SOME)) {
+ if (groupc->state_mask & (1 << PSI_IO_SOME)) {
groupc->times[PSI_IO_SOME] += delta;
- if (test_state(groupc->tasks, PSI_IO_FULL))
+ if (groupc->state_mask & (1 << PSI_IO_FULL))
groupc->times[PSI_IO_FULL] += delta;
}

- if (test_state(groupc->tasks, PSI_MEM_SOME)) {
+ if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
groupc->times[PSI_MEM_SOME] += delta;
- if (test_state(groupc->tasks, PSI_MEM_FULL))
+ if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
else if (memstall_tick) {
u32 sample;
@@ -435,10 +435,10 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
}
}

- if (test_state(groupc->tasks, PSI_CPU_SOME))
+ if (groupc->state_mask & (1 << PSI_CPU_SOME))
groupc->times[PSI_CPU_SOME] += delta;

- if (test_state(groupc->tasks, PSI_NONIDLE))
+ if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
}

@@ -447,6 +447,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
{
struct psi_group_cpu *groupc;
unsigned int t, m;
+ enum psi_states s;
+ u32 state_mask = 0;

groupc = per_cpu_ptr(group->pcpu, cpu);

@@ -479,6 +481,13 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;

+ /* Calculate state mask representing active states */
+ for (s = 0; s < NR_PSI_STATES; s++) {
+ if (test_state(groupc->tasks, s))
+ state_mask |= (1 << s);
+ }
+ groupc->state_mask = state_mask;
+
write_seqcount_end(&groupc->seq);

if (!delayed_work_pending(&group->clock_work))
--
2.20.1.321.g9e740568ce-goog


2019-01-24 21:16:29

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 4/5] psi: rename psi fields in preparation for psi trigger addition

Renaming psi_group structure member fields used for calculating psi
totals and averages for clear distinction between them and trigger-related
fields that will be added next.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/psi_types.h | 15 ++++++++-------
kernel/sched/psi.c | 26 ++++++++++++++------------
2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 762c6bb16f3c..47757668bdcb 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -69,20 +69,21 @@ struct psi_group_cpu {
};

struct psi_group {
- /* Protects data updated during an aggregation */
- struct mutex stat_lock;
+ /* Protects data used by the aggregator */
+ struct mutex update_lock;

/* Per-cpu task state & time tracking */
struct psi_group_cpu __percpu *pcpu;

- /* Periodic aggregation state */
- u64 total_prev[NR_PSI_STATES - 1];
- u64 last_update;
- u64 next_update;
struct delayed_work clock_work;

- /* Total stall times and sampled pressure averages */
+ /* Total stall times observed */
u64 total[NR_PSI_STATES - 1];
+
+ /* Running pressure averages */
+ u64 avg_total[NR_PSI_STATES - 1];
+ u64 avg_last_update;
+ u64 avg_next_update;
unsigned long avg[NR_PSI_STATES - 1][3];
};

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2262d920295f..c366503ba135 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -172,9 +172,9 @@ static void group_init(struct psi_group *group)

for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
- group->next_update = sched_clock() + psi_period;
+ group->avg_next_update = sched_clock() + psi_period;
INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
- mutex_init(&group->stat_lock);
+ mutex_init(&group->update_lock);
}

void __init psi_init(void)
@@ -277,7 +277,7 @@ static bool update_stats(struct psi_group *group)
int cpu;
int s;

- mutex_lock(&group->stat_lock);
+ mutex_lock(&group->update_lock);

/*
* Collect the per-cpu time buckets and average them into a
@@ -318,7 +318,7 @@ static bool update_stats(struct psi_group *group)

/* avgX= */
now = sched_clock();
- expires = group->next_update;
+ expires = group->avg_next_update;
if (now < expires)
goto out;
if (now - expires > psi_period)
@@ -331,14 +331,14 @@ static bool update_stats(struct psi_group *group)
* But the deltas we sample out of the per-cpu buckets above
* are based on the actual time elapsing between clock ticks.
*/
- group->next_update = expires + ((1 + missed_periods) * psi_period);
- period = now - (group->last_update + (missed_periods * psi_period));
- group->last_update = now;
+ group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
+ period = now - (group->avg_last_update + (missed_periods * psi_period));
+ group->avg_last_update = now;

for (s = 0; s < NR_PSI_STATES - 1; s++) {
u32 sample;

- sample = group->total[s] - group->total_prev[s];
+ sample = group->total[s] - group->avg_total[s];
/*
* Due to the lockless sampling of the time buckets,
* recorded time deltas can slip into the next period,
@@ -358,11 +358,11 @@ static bool update_stats(struct psi_group *group)
*/
if (sample > period)
sample = period;
- group->total_prev[s] += sample;
+ group->avg_total[s] += sample;
calc_avgs(group->avg[s], missed_periods, sample, period);
}
out:
- mutex_unlock(&group->stat_lock);
+ mutex_unlock(&group->update_lock);
return nonidle_total;
}

@@ -390,8 +390,10 @@ static void psi_update_work(struct work_struct *work)
u64 now;

now = sched_clock();
- if (group->next_update > now)
- delay = nsecs_to_jiffies(group->next_update - now) + 1;
+ if (group->avg_next_update > now) {
+ delay = nsecs_to_jiffies(
+ group->avg_next_update - now) + 1;
+ }
schedule_delayed_work(dwork, delay);
}
}
--
2.20.1.321.g9e740568ce-goog


2019-01-24 21:17:01

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 5/5] psi: introduce psi monitor

Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
Documentation/accounting/psi.txt | 104 ++++++
include/linux/psi.h | 10 +
include/linux/psi_types.h | 59 ++++
kernel/cgroup/cgroup.c | 107 +++++-
kernel/sched/psi.c | 562 +++++++++++++++++++++++++++++--
5 files changed, 808 insertions(+), 34 deletions(-)

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..6b21c72aa87c 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -63,6 +63,107 @@ tracked and exported as well, to allow detection of latency spikes
which wouldn't necessarily make a dent in the time averages, or to
average trends over custom time frames.

+Monitoring for pressure thresholds
+==================================
+
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used:
+
+<some|full> <stall amount in us> <time window in us>
+
+For example writing "some 150000 1000000" into /proc/pressure/memory
+would add 150ms threshold for partial memory stall measured within
+1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
+would add 50ms threshold for full io stall measured within 1sec time window.
+
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+therefore for each trigger a separate open() syscall should be made even
+when opening the same psi interface file.
+
+Monitors activate only when system enters stall state for the monitored
+psi metric and deactivates upon exit from the stall state. While system is
+in the stall state psi signal growth is monitored at a rate of 10 times per
+tracking window.
+
+The kernel accepts window sizes ranging from 500ms to 10s, therefore min
+monitoring update interval is 50ms and max is 1s.
+
+When activated, psi monitor stays active for at least the duration of one
+tracking window to avoid repeated activations/deactivations when system is
+bouncing in and out of the stall state.
+
+Notifications to the userspace are rate-limited to one per tracking window.
+
+The trigger will de-register when the file descriptor used to define the
+trigger is closed.
+
+Userspace monitor usage example
+===============================
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <poll.h>
+#include <string.h>
+#include <unistd.h>
+
+/*
+ * Monitor memory partial stall with 1s tracking window size
+ * and 150ms threshold.
+ */
+int main() {
+ const char trig[] = "some 150000 1000000";
+ struct pollfd fds;
+ int n;
+
+ fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
+ if (fds.fd < 0) {
+ printf("/proc/pressure/memory open error: %s\n",
+ strerror(errno));
+ return 1;
+ }
+ fds.events = POLLPRI;
+
+ if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
+ printf("/proc/pressure/memory write error: %s\n",
+ strerror(errno));
+ return 1;
+ }
+
+ printf("waiting for events...\n");
+ while (1) {
+ n = poll(&fds, 1, -1);
+ if (n < 0) {
+ printf("poll error: %s\n", strerror(errno));
+ return 1;
+ }
+ if (fds.revents & POLLERR) {
+ printf("got POLLERR, event source is gone\n");
+ return 0;
+ }
+ if (fds.revents & POLLPRI) {
+ printf("event triggered!\n");
+ } else {
+ printf("unknown event received: 0x%x\n", fds.revents);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
Cgroup2 interface
=================

@@ -71,3 +172,6 @@ mounted, pressure stall information is also tracked for tasks grouped
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
cpu.pressure, memory.pressure, and io.pressure files; the format is
the same as the /proc/pressure/ files.
+
+Per-cgroup psi monitors can be specified and used the same way as
+system-wide ones.
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7006008d5b72..7490f8ef83ac 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -4,6 +4,7 @@
#include <linux/jump_label.h>
#include <linux/psi_types.h>
#include <linux/sched.h>
+#include <linux/poll.h>

struct seq_file;
struct css_set;
@@ -26,6 +27,15 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
+
+ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
+ enum psi_states *state, u32 *threshold_us, u32 *win_sz_us);
+struct psi_trigger *psi_trigger_create(struct psi_group *group,
+ enum psi_states state, u32 threshold_us, u32 win_sz_us);
+void psi_trigger_destroy(struct psi_trigger *t);
+
+__poll_t psi_trigger_poll(struct psi_trigger *t, struct file *file,
+ poll_table *wait);
#endif

#else /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 47757668bdcb..41654473cb69 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -3,6 +3,7 @@

#include <linux/seqlock.h>
#include <linux/types.h>
+#include <linux/wait.h>

#ifdef CONFIG_PSI

@@ -68,6 +69,50 @@ struct psi_group_cpu {
u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
};

+/* PSI growth tracking window */
+struct psi_window {
+ /* Window size in ns */
+ u64 size;
+
+ /* Start time of the current window in ns */
+ u64 start_time;
+
+ /* Value at the start of the window */
+ u64 start_value;
+
+ /* Value growth per previous window(s) */
+ u64 per_win_growth;
+};
+
+struct psi_trigger {
+ /* PSI state being monitored by the trigger */
+ enum psi_states state;
+
+ /* User-spacified threshold in ns */
+ u64 threshold;
+
+ /* List node inside triggers list */
+ struct list_head node;
+
+ /* Backpointer needed during trigger destruction */
+ struct psi_group *group;
+
+ /* Wait queue for polling */
+ wait_queue_head_t event_wait;
+
+ /* Pending event flag */
+ int event;
+
+ /* Tracking window */
+ struct psi_window win;
+
+ /*
+ * Time last event was generated. Used for rate-limiting
+ * events to one per window
+ */
+ u64 last_event_time;
+};
+
struct psi_group {
/* Protects data used by the aggregator */
struct mutex update_lock;
@@ -75,6 +120,8 @@ struct psi_group {
/* Per-cpu task state & time tracking */
struct psi_group_cpu __percpu *pcpu;

+ /* Periodic work control */
+ atomic_t polling;
struct delayed_work clock_work;

/* Total stall times observed */
@@ -85,6 +132,18 @@ struct psi_group {
u64 avg_last_update;
u64 avg_next_update;
unsigned long avg[NR_PSI_STATES - 1][3];
+
+ /* Configured polling triggers */
+ struct list_head triggers;
+ u32 nr_triggers[NR_PSI_STATES - 1];
+ u32 trigger_mask;
+ u64 trigger_min_period;
+
+ /* Polling state */
+ /* Total stall times at the start of monitor activation */
+ u64 polling_total[NR_PSI_STATES - 1];
+ u64 polling_next_update;
+ u64 polling_until;
};

#else /* CONFIG_PSI */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e8cd12c6a553..de3ac22a5e23 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3464,7 +3464,101 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
}
-#endif
+
+static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, enum psi_res res)
+{
+ enum psi_states state;
+ struct psi_trigger *old;
+ struct psi_trigger *new;
+ struct cgroup *cgrp;
+ u32 threshold_us;
+ u32 win_sz_us;
+ ssize_t ret;
+
+ cgrp = cgroup_kn_lock_live(of->kn, false);
+ if (!cgrp)
+ return -ENODEV;
+
+ cgroup_get(cgrp);
+ cgroup_kn_unlock(of->kn);
+
+ ret = psi_trigger_parse(buf, nbytes, res,
+ &state, &threshold_us, &win_sz_us);
+ if (ret) {
+ cgroup_put(cgrp);
+ return ret;
+ }
+
+ new = psi_trigger_create(&cgrp->psi,
+ state, threshold_us, win_sz_us);
+ if (IS_ERR(new)) {
+ cgroup_put(cgrp);
+ return PTR_ERR(new);
+ }
+
+ old = of->priv;
+ rcu_assign_pointer(of->priv, new);
+ if (old) {
+ synchronize_rcu();
+ psi_trigger_destroy(old);
+ }
+
+ cgroup_put(cgrp);
+
+ return nbytes;
+}
+
+static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
+}
+
+static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
+}
+
+static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
+}
+
+static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
+ poll_table *pt)
+{
+ struct psi_trigger *t;
+ __poll_t ret;
+
+ rcu_read_lock();
+ t = rcu_dereference(of->priv);
+ if (t)
+ ret = psi_trigger_poll(t, of->file, pt);
+ else
+ ret = DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static void cgroup_pressure_release(struct kernfs_open_file *of)
+{
+ struct psi_trigger *t = of->priv;
+
+ if (!t)
+ return;
+
+ rcu_assign_pointer(of->priv, NULL);
+ synchronize_rcu();
+ psi_trigger_destroy(t);
+}
+#endif /* CONFIG_PSI */

static int cgroup_file_open(struct kernfs_open_file *of)
{
@@ -4619,18 +4713,27 @@ static struct cftype cgroup_base_files[] = {
.name = "io.pressure",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_io_pressure_show,
+ .write = cgroup_io_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
},
{
.name = "memory.pressure",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_memory_pressure_show,
+ .write = cgroup_memory_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
},
{
.name = "cpu.pressure",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_cpu_pressure_show,
+ .write = cgroup_cpu_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
},
-#endif
+#endif /* CONFIG_PSI */
{ } /* terminate */
};

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index c366503ba135..fefb98f19a80 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -4,6 +4,9 @@
* Copyright (c) 2018 Facebook, Inc.
* Author: Johannes Weiner <[email protected]>
*
+ * Polling support by Suren Baghdasaryan <[email protected]>
+ * Copyright (c) 2018 Google, Inc.
+ *
* When CPU, memory and IO are contended, tasks experience delays that
* reduce throughput and introduce latencies into the workload. Memory
* and IO contention, in addition, can cause a full loss of forward
@@ -126,11 +129,16 @@

#include <linux/sched/loadavg.h>
#include <linux/seq_file.h>
+#include <linux/eventfd.h>
#include <linux/proc_fs.h>
#include <linux/seqlock.h>
+#include <linux/uaccess.h>
#include <linux/cgroup.h>
#include <linux/module.h>
#include <linux/sched.h>
+#include <linux/ctype.h>
+#include <linux/file.h>
+#include <linux/poll.h>
#include <linux/psi.h>
#include "sched.h"

@@ -150,11 +158,16 @@ static int __init setup_psi(char *str)
__setup("psi=", setup_psi);

/* Running averages - we need to be higher-res than loadavg */
-#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
+#define PSI_FREQ (2*HZ+1UL) /* 2 sec intervals */
#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
#define EXP_60s 1981 /* 1/exp(2s/60s) */
#define EXP_300s 2034 /* 1/exp(2s/300s) */

+/* PSI trigger definitions */
+#define PSI_TRIG_MIN_WIN_US 500000 /* Min window size is 500ms */
+#define PSI_TRIG_MAX_WIN_US 10000000 /* Max window size is 10s */
+#define PSI_TRIG_UPDATES_PER_WIN 10 /* 10 updates per window */
+
/* Sampling frequency in nanoseconds */
static u64 psi_period __read_mostly;

@@ -173,8 +186,17 @@ static void group_init(struct psi_group *group)
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
group->avg_next_update = sched_clock() + psi_period;
+ atomic_set(&group->polling, 0);
INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
mutex_init(&group->update_lock);
+ /* Init trigger-related members */
+ INIT_LIST_HEAD(&group->triggers);
+ memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
+ group->trigger_mask = 0;
+ group->trigger_min_period = U32_MAX;
+ memset(group->polling_total, 0, sizeof(group->polling_total));
+ group->polling_next_update = ULLONG_MAX;
+ group->polling_until = 0;
}

void __init psi_init(void)
@@ -209,10 +231,11 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
}
}

-static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
+static u32 get_recent_times(struct psi_group *group, int cpu, u32 *times)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
u64 now, state_start;
+ u32 change_mask = 0;
enum psi_states s;
unsigned int seq;
u32 state_mask;
@@ -245,7 +268,11 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
groupc->times_prev[s] = times[s];

times[s] = delta;
+ if (delta)
+ change_mask |= (1 << s);
}
+
+ return change_mask;
}

static void calc_avgs(unsigned long avg[3], int missed_periods,
@@ -268,17 +295,14 @@ static void calc_avgs(unsigned long avg[3], int missed_periods,
avg[2] = calc_load(avg[2], EXP_300s, pct);
}

-static bool update_stats(struct psi_group *group)
+static u32 collect_percpu_times(struct psi_group *group)
{
u64 deltas[NR_PSI_STATES - 1] = { 0, };
- unsigned long missed_periods = 0;
unsigned long nonidle_total = 0;
- u64 now, expires, period;
+ u32 change_mask = 0;
int cpu;
int s;

- mutex_lock(&group->update_lock);
-
/*
* Collect the per-cpu time buckets and average them into a
* single time sample that is normalized to wallclock time.
@@ -291,7 +315,7 @@ static bool update_stats(struct psi_group *group)
u32 times[NR_PSI_STATES];
u32 nonidle;

- get_recent_times(group, cpu, times);
+ change_mask |= get_recent_times(group, cpu, times);

nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
nonidle_total += nonidle;
@@ -316,11 +340,18 @@ static bool update_stats(struct psi_group *group)
for (s = 0; s < NR_PSI_STATES - 1; s++)
group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));

+ return change_mask;
+}
+
+static u64 calculate_averages(struct psi_group *group, u64 now)
+{
+ unsigned long missed_periods = 0;
+ u64 expires, period;
+ u64 avg_next_update;
+ int s;
+
/* avgX= */
- now = sched_clock();
expires = group->avg_next_update;
- if (now < expires)
- goto out;
if (now - expires > psi_period)
missed_periods = div_u64(now - expires, psi_period);

@@ -331,7 +362,7 @@ static bool update_stats(struct psi_group *group)
* But the deltas we sample out of the per-cpu buckets above
* are based on the actual time elapsing between clock ticks.
*/
- group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
+ avg_next_update = expires + ((1 + missed_periods) * psi_period);
period = now - (group->avg_last_update + (missed_periods * psi_period));
group->avg_last_update = now;

@@ -361,20 +392,237 @@ static bool update_stats(struct psi_group *group)
group->avg_total[s] += sample;
calc_avgs(group->avg[s], missed_periods, sample, period);
}
-out:
- mutex_unlock(&group->update_lock);
- return nonidle_total;
+
+ return avg_next_update;
+}
+
+/* Trigger tracking window manupulations */
+static void window_init(struct psi_window *win, u64 now, u64 value)
+{
+ win->start_value = value;
+ win->start_time = now;
+ win->per_win_growth = 0;
+}
+
+/*
+ * PSI growth tracking window update and growth calculation routine.
+ * This approximates a sliding tracking window by interpolating
+ * partially elapsed windows using historical growth data from the
+ * previous intervals. This minimizes memory requirements (by not storing
+ * all the intermediate values in the previous window) and simplifies
+ * the calculations. It works well because PSI signal changes only in
+ * positive direction and over relatively small window sizes the growth
+ * is close to linear.
+ */
+static u64 window_update(struct psi_window *win, u64 now, u64 value)
+{
+ u64 interval;
+ u64 growth;
+
+ interval = now - win->start_time;
+ growth = value - win->start_value;
+ /*
+ * After each tracking window passes win->start_value and
+ * win->start_time get reset and win->per_win_growth stores
+ * the average per-window growth of the previous window.
+ * win->per_win_growth is then used to interpolate additional
+ * growth from the previous window assuming it was linear.
+ */
+ if (interval > win->size) {
+ win->per_win_growth = growth;
+ win->start_value = value;
+ win->start_time = now;
+ } else {
+ u32 unelapsed;
+
+ unelapsed = win->size - interval;
+ growth += div_u64(win->per_win_growth * unelapsed, win->size);
+ }
+
+ return growth;
+}
+
+static void init_triggers(struct psi_group *group, u64 now)
+{
+ struct psi_trigger *t;
+
+ list_for_each_entry(t, &group->triggers, node)
+ window_init(&t->win, now, group->total[t->state]);
+ memcpy(group->polling_total, group->total,
+ sizeof(group->polling_total));
+ group->polling_next_update = now + group->trigger_min_period;
+}
+
+static u64 poll_triggers(struct psi_group *group, u64 now)
+{
+ struct psi_trigger *t;
+ bool new_stall = false;
+
+ /*
+ * On subsequent updates, calculate growth deltas and let
+ * watchers know when their specified thresholds are exceeded.
+ */
+ list_for_each_entry(t, &group->triggers, node) {
+ u64 growth;
+
+ /* Check for stall activity */
+ if (group->polling_total[t->state] == group->total[t->state])
+ continue;
+
+ /*
+ * Multiple triggers might be looking at the same state,
+ * remember to update group->polling_total[] once we've
+ * been through all of them. Also remember to extend the
+ * polling time if we see new stall activity.
+ */
+ new_stall = true;
+
+ /* Calculate growth since last update */
+ growth = window_update(&t->win, now, group->total[t->state]);
+ if (growth < t->threshold)
+ continue;
+
+ /* Limit event signaling to once per window */
+ if (now < t->last_event_time + t->win.size)
+ continue;
+
+ /* Generate an event */
+ if (cmpxchg(&t->event, 0, 1) == 0)
+ wake_up_interruptible(&t->event_wait);
+ t->last_event_time = now;
+ }
+
+ if (new_stall) {
+ memcpy(group->polling_total, group->total,
+ sizeof(group->polling_total));
+ }
+
+ return now + group->trigger_min_period;
}

+/*
+ * psi_update_work represents slowpath accounting part while psi_group_change
+ * represents hotpath part. There are two potential races between them:
+ * 1. Changes to group->polling when slowpath checks for new stall, then hotpath
+ * records new stall and then slowpath resets group->polling flag. This leads
+ * to the exit from the polling mode while monitored state is still changing.
+ * 2. Slowpath overwriting an immediate update scheduled from the hotpath with
+ * a regular update further in the future and missing the immediate update.
+ * Both races are handled with a retry cycle in the slowpath:
+ *
+ * HOTPATH: | SLOWPATH:
+ * |
+ * A) times[cpu] += delta | E) delta = times[*]
+ * B) start_poll = (delta[poll_mask] &&| if delta[poll_mask]:
+ * cmpxchg(g->polling, 0, 1) == 0)| F) polling_until = now + grace_period
+ * if start_poll: | if now > polling_until:
+ * C) mod_delayed_work(1) | if g->polling:
+ * else if !delayed_work_pending():| G) g->polling = polling = 0
+ * D) schedule_delayed_work(PSI_FREQ)| smp_mb
+ * | H) goto SLOWPATH
+ * | else:
+ * | if !g->polling:
+ * | I) g->polling = polling = 1
+ * | J) if delta && first_pass:
+ * | next_avg = calculate_averages()
+ * | if polling:
+ * | next_poll = poll_triggers()
+ * | if (delta && first_pass) || polling:
+ * | K) mod_delayed_work(
+ * | min(next_avg, next_poll))
+ * | if !polling:
+ * | first_pass = false
+ * | L) goto SLOWPATH
+ *
+ * Race #1 is represented by (EABGD) sequence in which case slowpath deactivates
+ * polling mode because it misses new monitored stall and hotpath doesn't
+ * activate it because at (B) g->polling is not yet reset by slowpath in (G).
+ * This race is handled by the (H) retry, which in the race described above
+ * results in the new sequence of (EABGDHEIK) that reactivates polling mode.
+ *
+ * Race #2 is represented by polling==false && (JABCK) sequence which overwrites
+ * immediate update scheduled at (C) with a later (next_avg) update scheduled at
+ * (K). This race is handled by the (L) retry which results in the new sequence
+ * of polling==false && (JABCKLEIK) that reactivates polling mode and
+ * reschedules the next polling update (next_poll).
+ *
+ * Note that retries can't result in an infinite loop because retry #1 happens
+ * only during polling reactivation and retry #2 happens only on the first pass.
+ * Constant reactivations are impossible because polling will stay active for at
+ * least grace_period. Worst case scenario involves two retries (HEJKLE)
+ */
static void psi_update_work(struct work_struct *work)
{
struct delayed_work *dwork;
struct psi_group *group;
+ bool first_pass = true;
+ u64 next_update;
+ u32 change_mask;
+ int polling;
bool nonidle;
+ u64 now;

dwork = to_delayed_work(work);
group = container_of(dwork, struct psi_group, clock_work);

+ now = sched_clock();
+
+ mutex_lock(&group->update_lock);
+
+retry:
+ change_mask = collect_percpu_times(group);
+ polling = atomic_read(&group->polling);
+
+ if (change_mask & group->trigger_mask) {
+ /* Initialize trigger windows when entering polling mode */
+ if (now > group->polling_until)
+ init_triggers(group, now);
+
+ /*
+ * Keep the monitor active for at least the duration of the
+ * minimum tracking window as long as monitor states are
+ * changing. This prevents frequent changes to polling flag
+ * when system bounces in and out of stall states.
+ */
+ group->polling_until = now +
+ group->trigger_min_period * PSI_TRIG_UPDATES_PER_WIN;
+ }
+
+ /* Handle polling flag transitions */
+ if (now > group->polling_until) {
+ if (polling) {
+ group->polling_next_update = ULLONG_MAX;
+ polling = 0;
+ atomic_set(&group->polling, polling);
+ /*
+ * Memory barrier is needed to order group->polling
+ * write before times[] read in collect_percpu_times()
+ */
+ smp_mb__after_atomic();
+ /*
+ * Check if we missed stall recorded by hotpath while
+ * polling flag was set to 1 causing hotpath to skip
+ * entering polling mode
+ */
+ goto retry;
+ }
+ } else {
+ if (!polling) {
+ /*
+ * This can happen as a fixup in the retry cycle after
+ * new stall is discovered
+ */
+ polling = 1;
+ atomic_set(&group->polling, polling);
+ }
+ }
+ /*
+ * At this point group->polling race with hotpath is resolved and
+ * we rely on local polling flag ignoring possible further changes
+ * to group->polling
+ */
+
+ nonidle = (change_mask & (1 << PSI_NONIDLE));
/*
* If there is task activity, periodically fold the per-cpu
* times and feed samples into the running averages. If things
@@ -382,20 +630,32 @@ static void psi_update_work(struct work_struct *work)
* Once restarted, we'll catch up the running averages in one
* go - see calc_avgs() and missed_periods.
*/
+ if (nonidle && first_pass) {
+ if (now >= group->avg_next_update)
+ group->avg_next_update = calculate_averages(group, now);

- nonidle = update_stats(group);
-
- if (nonidle) {
- unsigned long delay = 0;
- u64 now;
-
- now = sched_clock();
- if (group->avg_next_update > now) {
- delay = nsecs_to_jiffies(
- group->avg_next_update - now) + 1;
+ if (now >= group->polling_next_update)
+ group->polling_next_update = poll_triggers(group, now);
+ }
+ if ((nonidle && first_pass) || polling) {
+ /* Calculate closest update time */
+ next_update = min(group->polling_next_update,
+ group->avg_next_update);
+ mod_delayed_work(system_wq, dwork, nsecs_to_jiffies(
+ next_update - now) + 1);
+ if (!polling) {
+ /*
+ * We might have overwritten an immediate update
+ * scheduled from the hotpath with a longer regular
+ * update (group->avg_next_update). Execute second pass
+ * retry to discover that and resume polling.
+ */
+ first_pass = false;
+ goto retry;
}
- schedule_delayed_work(dwork, delay);
}
+
+ mutex_unlock(&group->update_lock);
}

static void record_times(struct psi_group_cpu *groupc, int cpu,
@@ -492,8 +752,25 @@ static void psi_group_change(struct psi_group *group, int cpu,

write_seqcount_end(&groupc->seq);

- if (!delayed_work_pending(&group->clock_work))
- schedule_delayed_work(&group->clock_work, PSI_FREQ);
+ /*
+ * Polling flag resets to 0 at the max rate of once per update window
+ * (at least 500ms interval). smp_wmb is required after group->polling
+ * 0-to-1 transition to order groupc->times and group->polling writes
+ * because stall detection logic in the slowpath relies on groupc->times
+ * changing before group->polling. Explicit smp_wmb is missing because
+ * cmpxchg() implies smp_mb.
+ */
+ if ((state_mask & group->trigger_mask) &&
+ atomic_cmpxchg(&group->polling, 0, 1) == 0) {
+ /*
+ * Start polling immediately even if the work is already
+ * scheduled
+ */
+ mod_delayed_work(system_wq, &group->clock_work, 1);
+ } else {
+ if (!delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, PSI_FREQ);
+ }
}

static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
@@ -640,6 +917,8 @@ void psi_cgroup_free(struct cgroup *cgroup)

cancel_delayed_work_sync(&cgroup->psi.clock_work);
free_percpu(cgroup->psi.pcpu);
+ /* All triggers must be removed by now by psi_trigger_destroy */
+ WARN_ONCE(cgroup->psi.trigger_mask, "psi: trigger leak\n");
}

/**
@@ -699,7 +978,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
if (static_branch_likely(&psi_disabled))
return -EOPNOTSUPP;

- update_stats(group);
+ /* Update averages before reporting them */
+ mutex_lock(&group->update_lock);
+ collect_percpu_times(group);
+ calculate_averages(group, sched_clock());
+ mutex_unlock(&group->update_lock);

for (full = 0; full < 2 - (res == PSI_CPU); full++) {
unsigned long avg[3];
@@ -751,25 +1034,240 @@ static int psi_cpu_open(struct inode *inode, struct file *file)
return single_open(file, psi_cpu_show, NULL);
}

+ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
+ enum psi_states *pstate, u32 *pthreshold_us, u32 *pwin_sz_us)
+{
+ enum psi_states state;
+ u32 threshold_us;
+ u32 win_sz_us;
+
+ if (sscanf(buf, "some %u %u", &threshold_us, &win_sz_us) == 2)
+ state = PSI_IO_SOME + res * 2;
+ else if (sscanf(buf, "full %u %u", &threshold_us, &win_sz_us) == 2)
+ state = PSI_IO_FULL + res * 2;
+ else
+ return -EINVAL;
+
+ if (state >= PSI_NONIDLE)
+ return -EINVAL;
+
+ if (win_sz_us < PSI_TRIG_MIN_WIN_US ||
+ win_sz_us > PSI_TRIG_MAX_WIN_US)
+ return -EINVAL;
+
+ /* Check threshold */
+ if (threshold_us == 0 || threshold_us > win_sz_us)
+ return -EINVAL;
+
+ *pstate = state;
+ *pthreshold_us = threshold_us;
+ *pwin_sz_us = win_sz_us;
+ return 0;
+}
+
+struct psi_trigger *psi_trigger_create(struct psi_group *group,
+ enum psi_states state, u32 threshold_us, u32 win_sz_us)
+{
+ struct psi_trigger *t;
+
+ if (static_branch_likely(&psi_disabled))
+ return ERR_PTR(-EOPNOTSUPP);
+
+ t = kzalloc(sizeof(*t), GFP_KERNEL);
+ if (!t)
+ return ERR_PTR(-ENOMEM);
+
+ t->group = group;
+ t->state = state;
+ t->threshold = threshold_us * NSEC_PER_USEC;
+ t->win.size = win_sz_us * NSEC_PER_USEC;
+ t->event = 0;
+ init_waitqueue_head(&t->event_wait);
+
+ mutex_lock(&group->update_lock);
+
+ list_add(&t->node, &group->triggers);
+ group->trigger_min_period = min(group->trigger_min_period,
+ div_u64(t->win.size, PSI_TRIG_UPDATES_PER_WIN));
+ group->nr_triggers[t->state]++;
+ group->trigger_mask |= (1 << t->state);
+
+ mutex_unlock(&group->update_lock);
+
+ return t;
+}
+
+void psi_trigger_destroy(struct psi_trigger *t)
+{
+ struct psi_group *group = t->group;
+
+ if (static_branch_likely(&psi_disabled))
+ return;
+
+ mutex_lock(&group->update_lock);
+ if (!list_empty(&t->node)) {
+ struct psi_trigger *tmp;
+ u64 period = ULLONG_MAX;
+
+ list_del_init(&t->node);
+ group->nr_triggers[t->state]--;
+ if (!group->nr_triggers[t->state])
+ group->trigger_mask &= ~(1 << t->state);
+ /* reset min update period for the remaining triggers */
+ list_for_each_entry(tmp, &group->triggers, node) {
+ period = min(period, div_u64(tmp->win.size,
+ PSI_TRIG_UPDATES_PER_WIN));
+ }
+ group->trigger_min_period = period;
+ /*
+ * Wakeup waiters to stop polling.
+ * Can happen if cgroup is deleted from under
+ * a polling process.
+ */
+ wake_up_interruptible(&t->event_wait);
+ kfree(t);
+ }
+ mutex_unlock(&group->update_lock);
+}
+
+__poll_t psi_trigger_poll(struct psi_trigger *t,
+ struct file *file, poll_table *wait)
+{
+ if (static_branch_likely(&psi_disabled))
+ return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
+
+ poll_wait(file, &t->event_wait, wait);
+
+ if (cmpxchg(&t->event, 1, 0) == 1)
+ return DEFAULT_POLLMASK | EPOLLPRI;
+
+ /* Wait */
+ return DEFAULT_POLLMASK;
+}
+
+static ssize_t psi_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, enum psi_res res)
+{
+ char buf[32];
+ size_t buf_size;
+ struct seq_file *seq;
+ struct psi_trigger *old;
+ struct psi_trigger *new;
+ enum psi_states state;
+ u32 threshold_us;
+ u32 win_sz_us;
+ ssize_t ret;
+
+ if (static_branch_likely(&psi_disabled))
+ return -EOPNOTSUPP;
+
+ buf_size = min(nbytes, (sizeof(buf) - 1));
+ if (copy_from_user(buf, user_buf, buf_size))
+ return -EFAULT;
+
+ buf[buf_size - 1] = '\0';
+
+ ret = psi_trigger_parse(buf, nbytes, res,
+ &state, &threshold_us, &win_sz_us);
+ if (ret < 0)
+ return ret;
+
+ new = psi_trigger_create(&psi_system,
+ state, threshold_us, win_sz_us);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ seq = file->private_data;
+ /* Take seq->lock to protect seq->private from concurrent writes */
+ mutex_lock(&seq->lock);
+ old = seq->private;
+ rcu_assign_pointer(seq->private, new);
+ mutex_unlock(&seq->lock);
+
+ if (old) {
+ synchronize_rcu();
+ psi_trigger_destroy(old);
+ }
+
+ return nbytes;
+}
+
+static ssize_t psi_io_write(struct file *file,
+ const char __user *user_buf, size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IO);
+}
+
+static ssize_t psi_memory_write(struct file *file,
+ const char __user *user_buf, size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_MEM);
+}
+
+static ssize_t psi_cpu_write(struct file *file,
+ const char __user *user_buf, size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_CPU);
+}
+
+static __poll_t psi_fop_poll(struct file *file, poll_table *wait)
+{
+ struct seq_file *seq = file->private_data;
+ struct psi_trigger *t;
+ __poll_t ret;
+
+ rcu_read_lock();
+ t = rcu_dereference(seq->private);
+ if (t)
+ ret = psi_trigger_poll(t, file, wait);
+ else
+ ret = DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
+ rcu_read_unlock();
+
+ return ret;
+
+}
+
+static int psi_fop_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq = file->private_data;
+ struct psi_trigger *t = seq->private;
+
+ if (static_branch_likely(&psi_disabled) || !t)
+ goto out;
+
+ rcu_assign_pointer(seq->private, NULL);
+ synchronize_rcu();
+ psi_trigger_destroy(t);
+out:
+ return single_release(inode, file);
+}
+
static const struct file_operations psi_io_fops = {
.open = psi_io_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = single_release,
+ .write = psi_io_write,
+ .poll = psi_fop_poll,
+ .release = psi_fop_release,
};

static const struct file_operations psi_memory_fops = {
.open = psi_memory_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = single_release,
+ .write = psi_memory_write,
+ .poll = psi_fop_poll,
+ .release = psi_fop_release,
};

static const struct file_operations psi_cpu_fops = {
.open = psi_cpu_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = single_release,
+ .write = psi_cpu_write,
+ .poll = psi_fop_poll,
+ .release = psi_fop_release,
};

static int __init psi_proc_init(void)
--
2.20.1.321.g9e740568ce-goog


2019-01-24 21:17:17

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v3 2/5] kernel: cgroup: add poll file operation

From: Johannes Weiner <[email protected]>

Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/cgroup-defs.h | 4 ++++
kernel/cgroup/cgroup.c | 12 ++++++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8fcbae1b8db0..aad3babef007 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -32,6 +32,7 @@ struct kernfs_node;
struct kernfs_ops;
struct kernfs_open_file;
struct seq_file;
+struct poll_table_struct;

#define MAX_CGROUP_TYPE_NAMELEN 32
#define MAX_CGROUP_ROOT_NAMELEN 64
@@ -574,6 +575,9 @@ struct cftype {
ssize_t (*write)(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);

+ __poll_t (*poll)(struct kernfs_open_file *of,
+ struct poll_table_struct *pt);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lock_class_key lockdep_key;
#endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f31bd61c9466..e8cd12c6a553 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3533,6 +3533,16 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
return ret ?: nbytes;
}

+static __poll_t cgroup_file_poll(struct kernfs_open_file *of, poll_table *pt)
+{
+ struct cftype *cft = of->kn->priv;
+
+ if (cft->poll)
+ return cft->poll(of, pt);
+
+ return kernfs_generic_poll(of, pt);
+}
+
static void *cgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
{
return seq_cft(seq)->seq_start(seq, ppos);
@@ -3571,6 +3581,7 @@ static struct kernfs_ops cgroup_kf_single_ops = {
.open = cgroup_file_open,
.release = cgroup_file_release,
.write = cgroup_file_write,
+ .poll = cgroup_file_poll,
.seq_show = cgroup_seqfile_show,
};

@@ -3579,6 +3590,7 @@ static struct kernfs_ops cgroup_kf_ops = {
.open = cgroup_file_open,
.release = cgroup_file_release,
.write = cgroup_file_write,
+ .poll = cgroup_file_poll,
.seq_start = cgroup_seqfile_start,
.seq_next = cgroup_seqfile_next,
.seq_stop = cgroup_seqfile_stop,
--
2.20.1.321.g9e740568ce-goog


2019-01-28 21:23:19

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] psi: introduce state_mask to represent stalled psi states

On Thu, Jan 24, 2019 at 01:15:16PM -0800, Suren Baghdasaryan wrote:
> The psi monitoring patches will need to determine the same states as
> record_times(). To avoid calculating them twice, maintain a state mask
> that can be consulted cheaply. Do this in a separate patch to keep the
> churn in the main feature patch at a minimum.
> This adds 4-byte state_mask member into psi_group_cpu struct which
> results in its first cacheline-aligned part to become 52 bytes long.
> Add explicit values to enumeration element counters that affect
> psi_group_cpu struct size.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2019-01-28 21:25:02

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] psi: rename psi fields in preparation for psi trigger addition

On Thu, Jan 24, 2019 at 01:15:17PM -0800, Suren Baghdasaryan wrote:
> Renaming psi_group structure member fields used for calculating psi
> totals and averages for clear distinction between them and trigger-related
> fields that will be added next.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2019-01-28 21:27:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

One thought on the v3 delta that I missed earlier:

On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> +/*
> + * psi_update_work represents slowpath accounting part while psi_group_change
> + * represents hotpath part. There are two potential races between them:
> + * 1. Changes to group->polling when slowpath checks for new stall, then hotpath
> + * records new stall and then slowpath resets group->polling flag. This leads
> + * to the exit from the polling mode while monitored state is still changing.
> + * 2. Slowpath overwriting an immediate update scheduled from the hotpath with
> + * a regular update further in the future and missing the immediate update.
> + * Both races are handled with a retry cycle in the slowpath:
> + *
> + * HOTPATH: | SLOWPATH:
> + * |
> + * A) times[cpu] += delta | E) delta = times[*]
> + * B) start_poll = (delta[poll_mask] &&| if delta[poll_mask]:
> + * cmpxchg(g->polling, 0, 1) == 0)| F) polling_until = now + grace_period
> + * if start_poll: | if now > polling_until:
> + * C) mod_delayed_work(1) | if g->polling:

With the polling flag being atomic now, this "if g->polling" line
isn't accurate anymore. Since this diagram is specifically about
memory ordering, this should move the g->polling load up to where
delta is read and then refer to unordered local variables down here.

2019-01-28 23:54:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

Hi Suren,

When I review first time, it was rather hard to understand due to naming
so below comments are mostly cleanup or minor.
I'm not strong against if you don't think it's helpful.
Feel free to select parts.

Thanks.

On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> Psi monitor aims to provide a low-latency short-term pressure
> detection mechanism configurable by users. It allows users to
> monitor psi metrics growth and trigger events whenever a metric
> raises above user-defined threshold within user-defined time window.
>
> Time window and threshold are both expressed in usecs. Multiple psi
> resources with different thresholds and window sizes can be monitored
> concurrently.
>
> Psi monitors activate when system enters stall state for the monitored
> psi metric and deactivate upon exit from the stall state. While system
> is in the stall state psi signal growth is monitored at a rate of 10 times
> per tracking window. Min window size is 500ms, therefore the min monitoring
> interval is 50ms. Max window size is 10s with monitoring interval of 1s.
>
> When activated psi monitor stays active for at least the duration of one
> tracking window to avoid repeated activations/deactivations when psi
> signal is bouncing.
>
> Notifications to the users are rate-limited to one per tracking window.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> Documentation/accounting/psi.txt | 104 ++++++
> include/linux/psi.h | 10 +
> include/linux/psi_types.h | 59 ++++
> kernel/cgroup/cgroup.c | 107 +++++-
> kernel/sched/psi.c | 562 +++++++++++++++++++++++++++++--
> 5 files changed, 808 insertions(+), 34 deletions(-)
>
> diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
> index b8ca28b60215..6b21c72aa87c 100644
> --- a/Documentation/accounting/psi.txt
> +++ b/Documentation/accounting/psi.txt
> @@ -63,6 +63,107 @@ tracked and exported as well, to allow detection of latency spikes
> which wouldn't necessarily make a dent in the time averages, or to
> average trends over custom time frames.
>
> +Monitoring for pressure thresholds
> +==================================
> +
> +Users can register triggers and use poll() to be woken up when resource
> +pressure exceeds certain thresholds.
> +
> +A trigger describes the maximum cumulative stall time over a specific
> +time window, e.g. 100ms of total stall time within any 500ms window to
> +generate a wakeup event.
> +
> +To register a trigger user has to open psi interface file under
> +/proc/pressure/ representing the resource to be monitored and write the
> +desired threshold and time window. The open file descriptor should be
> +used to wait for trigger events using select(), poll() or epoll().
> +The following format is used:
> +
> +<some|full> <stall amount in us> <time window in us>
> +
> +For example writing "some 150000 1000000" into /proc/pressure/memory
> +would add 150ms threshold for partial memory stall measured within
> +1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
> +would add 50ms threshold for full io stall measured within 1sec time window.
> +
> +Triggers can be set on more than one psi metric and more than one trigger
> +for the same psi metric can be specified. However for each trigger a separate
> +file descriptor is required to be able to poll it separately from others,
> +therefore for each trigger a separate open() syscall should be made even
> +when opening the same psi interface file.
> +
> +Monitors activate only when system enters stall state for the monitored
> +psi metric and deactivates upon exit from the stall state. While system is
> +in the stall state psi signal growth is monitored at a rate of 10 times per
> +tracking window.
> +
> +The kernel accepts window sizes ranging from 500ms to 10s, therefore min
> +monitoring update interval is 50ms and max is 1s.

Hope to add why we decide these number into min/max.

> +
> +When activated, psi monitor stays active for at least the duration of one
> +tracking window to avoid repeated activations/deactivations when system is
> +bouncing in and out of the stall state.
> +
> +Notifications to the userspace are rate-limited to one per tracking window.
> +
> +The trigger will de-register when the file descriptor used to define the
> +trigger is closed.
> +
> +Userspace monitor usage example
> +===============================
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <poll.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +/*
> + * Monitor memory partial stall with 1s tracking window size
> + * and 150ms threshold.
> + */
> +int main() {
> + const char trig[] = "some 150000 1000000";
> + struct pollfd fds;
> + int n;
> +
> + fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
> + if (fds.fd < 0) {
> + printf("/proc/pressure/memory open error: %s\n",
> + strerror(errno));
> + return 1;
> + }
> + fds.events = POLLPRI;
> +
> + if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
> + printf("/proc/pressure/memory write error: %s\n",
> + strerror(errno));
> + return 1;
> + }
> +
> + printf("waiting for events...\n");
> + while (1) {
> + n = poll(&fds, 1, -1);
> + if (n < 0) {
> + printf("poll error: %s\n", strerror(errno));
> + return 1;
> + }
> + if (fds.revents & POLLERR) {
> + printf("got POLLERR, event source is gone\n");
> + return 0;
> + }
> + if (fds.revents & POLLPRI) {
> + printf("event triggered!\n");
> + } else {
> + printf("unknown event received: 0x%x\n", fds.revents);
> + return 1;
> + }
> + }
> +
> + return 0;
> +}
> +
> Cgroup2 interface
> =================
>
> @@ -71,3 +172,6 @@ mounted, pressure stall information is also tracked for tasks grouped
> into cgroups. Each subdirectory in the cgroupfs mountpoint contains
> cpu.pressure, memory.pressure, and io.pressure files; the format is
> the same as the /proc/pressure/ files.
> +
> +Per-cgroup psi monitors can be specified and used the same way as
> +system-wide ones.
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index 7006008d5b72..7490f8ef83ac 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -4,6 +4,7 @@
> #include <linux/jump_label.h>
> #include <linux/psi_types.h>
> #include <linux/sched.h>
> +#include <linux/poll.h>
>
> struct seq_file;
> struct css_set;
> @@ -26,6 +27,15 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
> int psi_cgroup_alloc(struct cgroup *cgrp);
> void psi_cgroup_free(struct cgroup *cgrp);
> void cgroup_move_task(struct task_struct *p, struct css_set *to);
> +
> +ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
> + enum psi_states *state, u32 *threshold_us, u32 *win_sz_us);
> +struct psi_trigger *psi_trigger_create(struct psi_group *group,
> + enum psi_states state, u32 threshold_us, u32 win_sz_us);
> +void psi_trigger_destroy(struct psi_trigger *t);
> +
> +__poll_t psi_trigger_poll(struct psi_trigger *t, struct file *file,
> + poll_table *wait);
> #endif
>
> #else /* CONFIG_PSI */
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 47757668bdcb..41654473cb69 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -3,6 +3,7 @@
>
> #include <linux/seqlock.h>
> #include <linux/types.h>
> +#include <linux/wait.h>
>
> #ifdef CONFIG_PSI
>
> @@ -68,6 +69,50 @@ struct psi_group_cpu {
> u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
> };
>
> +/* PSI growth tracking window */
> +struct psi_window {
> + /* Window size in ns */
> + u64 size;

As rest of field are time, how about "interval" instead of size?

> +
> + /* Start time of the current window in ns */
> + u64 start_time;
> +
> + /* Value at the start of the window */
> + u64 start_value;

"value" is rather vague. starting_stall?

> +
> + /* Value growth per previous window(s) */
> + u64 per_win_growth;

Rather than per, prev would be more meaninful, I think.
How about prev_win_stall?

> +};
> +
> +struct psi_trigger {
> + /* PSI state being monitored by the trigger */
> + enum psi_states state;
> +
> + /* User-spacified threshold in ns */
> + u64 threshold;
> +
> + /* List node inside triggers list */
> + struct list_head node;
> +
> + /* Backpointer needed during trigger destruction */
> + struct psi_group *group;
> +
> + /* Wait queue for polling */
> + wait_queue_head_t event_wait;
> +
> + /* Pending event flag */
> + int event;
> +
> + /* Tracking window */
> + struct psi_window win;
> +
> + /*
> + * Time last event was generated. Used for rate-limiting
> + * events to one per window
> + */
> + u64 last_event_time;
> +};
> +
> struct psi_group {
> /* Protects data used by the aggregator */
> struct mutex update_lock;
> @@ -75,6 +120,8 @@ struct psi_group {
> /* Per-cpu task state & time tracking */
> struct psi_group_cpu __percpu *pcpu;
>
> + /* Periodic work control */
> + atomic_t polling;
> struct delayed_work clock_work;
>
> /* Total stall times observed */
> @@ -85,6 +132,18 @@ struct psi_group {
> u64 avg_last_update;
> u64 avg_next_update;
> unsigned long avg[NR_PSI_STATES - 1][3];
> +
> + /* Configured polling triggers */
> + struct list_head triggers;
> + u32 nr_triggers[NR_PSI_STATES - 1];
> + u32 trigger_mask;

This is a state we have an interest.
How about trigger_states?

> + u64 trigger_min_period;
> +
> + /* Polling state */
> + /* Total stall times at the start of monitor activation */
> + u64 polling_total[NR_PSI_STATES - 1];
> + u64 polling_next_update;
> + u64 polling_until;
> };
>
> #else /* CONFIG_PSI */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index e8cd12c6a553..de3ac22a5e23 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3464,7 +3464,101 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
> {
> return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
> }
> -#endif
> +
> +static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, enum psi_res res)
> +{
> + enum psi_states state;
> + struct psi_trigger *old;
> + struct psi_trigger *new;
> + struct cgroup *cgrp;
> + u32 threshold_us;
> + u32 win_sz_us;

window_us?

> + ssize_t ret;
> +
> + cgrp = cgroup_kn_lock_live(of->kn, false);
> + if (!cgrp)
> + return -ENODEV;
> +
> + cgroup_get(cgrp);
> + cgroup_kn_unlock(of->kn);
> +
> + ret = psi_trigger_parse(buf, nbytes, res,
> + &state, &threshold_us, &win_sz_us);
> + if (ret) {
> + cgroup_put(cgrp);
> + return ret;
> + }
> +
> + new = psi_trigger_create(&cgrp->psi,
> + state, threshold_us, win_sz_us);
> + if (IS_ERR(new)) {
> + cgroup_put(cgrp);
> + return PTR_ERR(new);
> + }
> +
> + old = of->priv;
> + rcu_assign_pointer(of->priv, new);
> + if (old) {
> + synchronize_rcu();
> + psi_trigger_destroy(old);
> + }
> +
> + cgroup_put(cgrp);
> +
> + return nbytes;
> +}
> +
> +static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
> +}
> +
> +static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
> +}
> +
> +static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
> +}
> +
> +static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
> + poll_table *pt)
> +{
> + struct psi_trigger *t;
> + __poll_t ret;
> +
> + rcu_read_lock();
> + t = rcu_dereference(of->priv);
> + if (t)
> + ret = psi_trigger_poll(t, of->file, pt);
> + else
> + ret = DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +static void cgroup_pressure_release(struct kernfs_open_file *of)
> +{
> + struct psi_trigger *t = of->priv;
> +
> + if (!t)
> + return;
> +
> + rcu_assign_pointer(of->priv, NULL);
> + synchronize_rcu();
> + psi_trigger_destroy(t);
> +}
> +#endif /* CONFIG_PSI */
>
> static int cgroup_file_open(struct kernfs_open_file *of)
> {
> @@ -4619,18 +4713,27 @@ static struct cftype cgroup_base_files[] = {
> .name = "io.pressure",
> .flags = CFTYPE_NOT_ON_ROOT,
> .seq_show = cgroup_io_pressure_show,
> + .write = cgroup_io_pressure_write,
> + .poll = cgroup_pressure_poll,
> + .release = cgroup_pressure_release,
> },
> {
> .name = "memory.pressure",
> .flags = CFTYPE_NOT_ON_ROOT,
> .seq_show = cgroup_memory_pressure_show,
> + .write = cgroup_memory_pressure_write,
> + .poll = cgroup_pressure_poll,
> + .release = cgroup_pressure_release,
> },
> {
> .name = "cpu.pressure",
> .flags = CFTYPE_NOT_ON_ROOT,
> .seq_show = cgroup_cpu_pressure_show,
> + .write = cgroup_cpu_pressure_write,
> + .poll = cgroup_pressure_poll,
> + .release = cgroup_pressure_release,
> },
> -#endif
> +#endif /* CONFIG_PSI */
> { } /* terminate */
> };
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index c366503ba135..fefb98f19a80 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -4,6 +4,9 @@
> * Copyright (c) 2018 Facebook, Inc.
> * Author: Johannes Weiner <[email protected]>
> *
> + * Polling support by Suren Baghdasaryan <[email protected]>
> + * Copyright (c) 2018 Google, Inc.
> + *
> * When CPU, memory and IO are contended, tasks experience delays that
> * reduce throughput and introduce latencies into the workload. Memory
> * and IO contention, in addition, can cause a full loss of forward
> @@ -126,11 +129,16 @@
>
> #include <linux/sched/loadavg.h>
> #include <linux/seq_file.h>
> +#include <linux/eventfd.h>
> #include <linux/proc_fs.h>
> #include <linux/seqlock.h>
> +#include <linux/uaccess.h>
> #include <linux/cgroup.h>
> #include <linux/module.h>
> #include <linux/sched.h>
> +#include <linux/ctype.h>
> +#include <linux/file.h>
> +#include <linux/poll.h>
> #include <linux/psi.h>
> #include "sched.h"
>
> @@ -150,11 +158,16 @@ static int __init setup_psi(char *str)
> __setup("psi=", setup_psi);
>
> /* Running averages - we need to be higher-res than loadavg */
> -#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
> +#define PSI_FREQ (2*HZ+1UL) /* 2 sec intervals */
> #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
> #define EXP_60s 1981 /* 1/exp(2s/60s) */
> #define EXP_300s 2034 /* 1/exp(2s/300s) */
>
> +/* PSI trigger definitions */
> +#define PSI_TRIG_MIN_WIN_US 500000 /* Min window size is 500ms */
> +#define PSI_TRIG_MAX_WIN_US 10000000 /* Max window size is 10s */
> +#define PSI_TRIG_UPDATES_PER_WIN 10 /* 10 updates per window */

To me, it's rather long.
How about WINDOW_MIN_US, WINDOW_MAX_US, UPDATES_PER_WINDOW?

> +
> /* Sampling frequency in nanoseconds */
> static u64 psi_period __read_mostly;
>
> @@ -173,8 +186,17 @@ static void group_init(struct psi_group *group)
> for_each_possible_cpu(cpu)
> seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> group->avg_next_update = sched_clock() + psi_period;
> + atomic_set(&group->polling, 0);
> INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
> mutex_init(&group->update_lock);
> + /* Init trigger-related members */
> + INIT_LIST_HEAD(&group->triggers);
> + memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> + group->trigger_mask = 0;
> + group->trigger_min_period = U32_MAX;
> + memset(group->polling_total, 0, sizeof(group->polling_total));
> + group->polling_next_update = ULLONG_MAX;
> + group->polling_until = 0;
> }
>
> void __init psi_init(void)
> @@ -209,10 +231,11 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> }
> }
>
> -static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
> +static u32 get_recent_times(struct psi_group *group, int cpu, u32 *times)

Rather awkward to return change_mask when we consider function name as
get_recent_times It would be better to add additional parameter
instead of return value.

> {
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> u64 now, state_start;
> + u32 change_mask = 0;
> enum psi_states s;
> unsigned int seq;
> u32 state_mask;
> @@ -245,7 +268,11 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
> groupc->times_prev[s] = times[s];
>
> times[s] = delta;
> + if (delta)
> + change_mask |= (1 << s);
> }
> +
> + return change_mask;
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
> @@ -268,17 +295,14 @@ static void calc_avgs(unsigned long avg[3], int missed_periods,
> avg[2] = calc_load(avg[2], EXP_300s, pct);
> }
>
> -static bool update_stats(struct psi_group *group)
> +static u32 collect_percpu_times(struct psi_group *group)

Not sure it's a good idea to add "implementation facility" in here.
How about update_stall_time with additional parameter of
"[changed|updated]_states?

IOW,
static void update_stall_times(struct psi_group *group, u32 *changed_states)

> {
> u64 deltas[NR_PSI_STATES - 1] = { 0, };
> - unsigned long missed_periods = 0;
> unsigned long nonidle_total = 0;
> - u64 now, expires, period;
> + u32 change_mask = 0;
> int cpu;
> int s;
>
> - mutex_lock(&group->update_lock);
> -
> /*
> * Collect the per-cpu time buckets and average them into a
> * single time sample that is normalized to wallclock time.
> @@ -291,7 +315,7 @@ static bool update_stats(struct psi_group *group)
> u32 times[NR_PSI_STATES];
> u32 nonidle;
>
> - get_recent_times(group, cpu, times);
> + change_mask |= get_recent_times(group, cpu, times);
>
> nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
> nonidle_total += nonidle;
> @@ -316,11 +340,18 @@ static bool update_stats(struct psi_group *group)
> for (s = 0; s < NR_PSI_STATES - 1; s++)
> group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
>
> + return change_mask;
> +}
> +
> +static u64 calculate_averages(struct psi_group *group, u64 now)

time?

The function name is still awkward to me.
If someone see this function, he will expect return value as "average", not next_update.
If we want to have next_update as return value, it would better to rename *update_avgs*.


> +{
> + unsigned long missed_periods = 0;
> + u64 expires, period;
> + u64 avg_next_update;
> + int s;
> +
> /* avgX= */
> - now = sched_clock();
> expires = group->avg_next_update;
> - if (now < expires)
> - goto out;
> if (now - expires > psi_period)
> missed_periods = div_u64(now - expires, psi_period);
>
> @@ -331,7 +362,7 @@ static bool update_stats(struct psi_group *group)
> * But the deltas we sample out of the per-cpu buckets above
> * are based on the actual time elapsing between clock ticks.
> */
> - group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
> + avg_next_update = expires + ((1 + missed_periods) * psi_period);
> period = now - (group->avg_last_update + (missed_periods * psi_period));
> group->avg_last_update = now;
>
> @@ -361,20 +392,237 @@ static bool update_stats(struct psi_group *group)
> group->avg_total[s] += sample;
> calc_avgs(group->avg[s], missed_periods, sample, period);
> }
> -out:
> - mutex_unlock(&group->update_lock);
> - return nonidle_total;
> +
> + return avg_next_update;
> +}
> +
> +/* Trigger tracking window manupulations */
> +static void window_init(struct psi_window *win, u64 now, u64 value)
> +{
> + win->start_value = value;
> + win->start_time = now;
> + win->per_win_growth = 0;
> +}
> +
> +/*
> + * PSI growth tracking window update and growth calculation routine.

Let's add empty line here.

> + * This approximates a sliding tracking window by interpolating
> + * partially elapsed windows using historical growth data from the
> + * previous intervals. This minimizes memory requirements (by not storing
> + * all the intermediate values in the previous window) and simplifies
> + * the calculations. It works well because PSI signal changes only in
> + * positive direction and over relatively small window sizes the growth
> + * is close to linear.
> + */
> +static u64 window_update(struct psi_window *win, u64 now, u64 value)

Hope to change now as just time for function.

Insetad of value, couldn't we use more concrete naming?
Maybe stall_time or just stall?

> +{
> + u64 interval;

elapsed?

> + u64 growth;
> +
> + interval = now - win->start_time;
> + growth = value - win->start_value;
> + /*
> + * After each tracking window passes win->start_value and
> + * win->start_time get reset and win->per_win_growth stores
> + * the average per-window growth of the previous window.
> + * win->per_win_growth is then used to interpolate additional
> + * growth from the previous window assuming it was linear.
> + */
> + if (interval > win->size) {
> + win->per_win_growth = growth;
> + win->start_value = value;
> + win->start_time = now;

We can use window_init via adding per_win_growth in the function
parameter. Maybe, window_reset would be better function name.

> + } else {
> + u32 unelapsed;

remaining? remained?

> +
> + unelapsed = win->size - interval;
> + growth += div_u64(win->per_win_growth * unelapsed, win->size);
> + }
> +
> + return growth;
> +}
> +
> +static void init_triggers(struct psi_group *group, u64 now)
> +{
> + struct psi_trigger *t;
> +
> + list_for_each_entry(t, &group->triggers, node)
> + window_init(&t->win, now, group->total[t->state]);
> + memcpy(group->polling_total, group->total,
> + sizeof(group->polling_total));
> + group->polling_next_update = now + group->trigger_min_period;
> +}
> +
> +static u64 poll_triggers(struct psi_group *group, u64 now)

How about update_[poll|trigger]_stat?

> +{
> + struct psi_trigger *t;
> + bool new_stall = false;
> +
> + /*
> + * On subsequent updates, calculate growth deltas and let
> + * watchers know when their specified thresholds are exceeded.
> + */
> + list_for_each_entry(t, &group->triggers, node) {
> + u64 growth;
> +
> + /* Check for stall activity */
> + if (group->polling_total[t->state] == group->total[t->state])
> + continue;
> +
> + /*
> + * Multiple triggers might be looking at the same state,
> + * remember to update group->polling_total[] once we've
> + * been through all of them. Also remember to extend the
> + * polling time if we see new stall activity.
> + */
> + new_stall = true;
> +
> + /* Calculate growth since last update */
> + growth = window_update(&t->win, now, group->total[t->state]);
> + if (growth < t->threshold)
> + continue;
> +
> + /* Limit event signaling to once per window */
> + if (now < t->last_event_time + t->win.size)
> + continue;
> +
> + /* Generate an event */
> + if (cmpxchg(&t->event, 0, 1) == 0)
> + wake_up_interruptible(&t->event_wait);
> + t->last_event_time = now;
> + }
> +
> + if (new_stall) {
> + memcpy(group->polling_total, group->total,
> + sizeof(group->polling_total));
> + }
> +
> + return now + group->trigger_min_period;
> }
>
> +/*
> + * psi_update_work represents slowpath accounting part while psi_group_change
> + * represents hotpath part. There are two potential races between them:
> + * 1. Changes to group->polling when slowpath checks for new stall, then hotpath
> + * records new stall and then slowpath resets group->polling flag. This leads
> + * to the exit from the polling mode while monitored state is still changing.
> + * 2. Slowpath overwriting an immediate update scheduled from the hotpath with
> + * a regular update further in the future and missing the immediate update.
> + * Both races are handled with a retry cycle in the slowpath:
> + *
> + * HOTPATH: | SLOWPATH:
> + * |
> + * A) times[cpu] += delta | E) delta = times[*]
> + * B) start_poll = (delta[poll_mask] &&| if delta[poll_mask]:
> + * cmpxchg(g->polling, 0, 1) == 0)| F) polling_until = now + grace_period
> + * if start_poll: | if now > polling_until:
> + * C) mod_delayed_work(1) | if g->polling:
> + * else if !delayed_work_pending():| G) g->polling = polling = 0
> + * D) schedule_delayed_work(PSI_FREQ)| smp_mb
> + * | H) goto SLOWPATH
> + * | else:
> + * | if !g->polling:
> + * | I) g->polling = polling = 1
> + * | J) if delta && first_pass:
> + * | next_avg = calculate_averages()
> + * | if polling:
> + * | next_poll = poll_triggers()
> + * | if (delta && first_pass) || polling:
> + * | K) mod_delayed_work(
> + * | min(next_avg, next_poll))
> + * | if !polling:
> + * | first_pass = false
> + * | L) goto SLOWPATH
> + *
> + * Race #1 is represented by (EABGD) sequence in which case slowpath deactivates
> + * polling mode because it misses new monitored stall and hotpath doesn't
> + * activate it because at (B) g->polling is not yet reset by slowpath in (G).
> + * This race is handled by the (H) retry, which in the race described above
> + * results in the new sequence of (EABGDHEIK) that reactivates polling mode.
> + *
> + * Race #2 is represented by polling==false && (JABCK) sequence which overwrites
> + * immediate update scheduled at (C) with a later (next_avg) update scheduled at
> + * (K). This race is handled by the (L) retry which results in the new sequence
> + * of polling==false && (JABCKLEIK) that reactivates polling mode and
> + * reschedules the next polling update (next_poll).
> + *
> + * Note that retries can't result in an infinite loop because retry #1 happens
> + * only during polling reactivation and retry #2 happens only on the first pass.
> + * Constant reactivations are impossible because polling will stay active for at
> + * least grace_period. Worst case scenario involves two retries (HEJKLE)
> + */
> static void psi_update_work(struct work_struct *work)
> {
> struct delayed_work *dwork;
> struct psi_group *group;
> + bool first_pass = true;
> + u64 next_update;
> + u32 change_mask;

How about [changed|updated]_states?

> + int polling;
> bool nonidle;
> + u64 now;
>
> dwork = to_delayed_work(work);
> group = container_of(dwork, struct psi_group, clock_work);
>
> + now = sched_clock();
> +
> + mutex_lock(&group->update_lock);
> +
> +retry:
> + change_mask = collect_percpu_times(group);
> + polling = atomic_read(&group->polling);
> +
> + if (change_mask & group->trigger_mask) {
> + /* Initialize trigger windows when entering polling mode */
> + if (now > group->polling_until)
> + init_triggers(group, now);
> +
> + /*
> + * Keep the monitor active for at least the duration of the
> + * minimum tracking window as long as monitor states are
> + * changing. This prevents frequent changes to polling flag
> + * when system bounces in and out of stall states.
> + */
> + group->polling_until = now +
> + group->trigger_min_period * PSI_TRIG_UPDATES_PER_WIN;
> + }
> +
> + /* Handle polling flag transitions */
> + if (now > group->polling_until) {
> + if (polling) {
> + group->polling_next_update = ULLONG_MAX;
> + polling = 0;
> + atomic_set(&group->polling, polling);
> + /*
> + * Memory barrier is needed to order group->polling
> + * write before times[] read in collect_percpu_times()
> + */
> + smp_mb__after_atomic();
> + /*
> + * Check if we missed stall recorded by hotpath while
> + * polling flag was set to 1 causing hotpath to skip
> + * entering polling mode
> + */
> + goto retry;
> + }
> + } else {
> + if (!polling) {
> + /*
> + * This can happen as a fixup in the retry cycle after
> + * new stall is discovered
> + */
> + polling = 1;
> + atomic_set(&group->polling, polling);
> + }
> + }
> + /*
> + * At this point group->polling race with hotpath is resolved and
> + * we rely on local polling flag ignoring possible further changes
> + * to group->polling
> + */
> +
> + nonidle = (change_mask & (1 << PSI_NONIDLE));
> /*
> * If there is task activity, periodically fold the per-cpu
> * times and feed samples into the running averages. If things
> @@ -382,20 +630,32 @@ static void psi_update_work(struct work_struct *work)
> * Once restarted, we'll catch up the running averages in one
> * go - see calc_avgs() and missed_periods.
> */
> + if (nonidle && first_pass) {
> + if (now >= group->avg_next_update)
> + group->avg_next_update = calculate_averages(group, now);
>
> - nonidle = update_stats(group);
> -
> - if (nonidle) {
> - unsigned long delay = 0;
> - u64 now;
> -
> - now = sched_clock();
> - if (group->avg_next_update > now) {
> - delay = nsecs_to_jiffies(
> - group->avg_next_update - now) + 1;
> + if (now >= group->polling_next_update)
> + group->polling_next_update = poll_triggers(group, now);
> + }
> + if ((nonidle && first_pass) || polling) {
> + /* Calculate closest update time */
> + next_update = min(group->polling_next_update,
> + group->avg_next_update);
> + mod_delayed_work(system_wq, dwork, nsecs_to_jiffies(
> + next_update - now) + 1);
> + if (!polling) {
> + /*
> + * We might have overwritten an immediate update
> + * scheduled from the hotpath with a longer regular
> + * update (group->avg_next_update). Execute second pass
> + * retry to discover that and resume polling.
> + */
> + first_pass = false;
> + goto retry;
> }
> - schedule_delayed_work(dwork, delay);
> }
> +
> + mutex_unlock(&group->update_lock);
> }
>
> static void record_times(struct psi_group_cpu *groupc, int cpu,
> @@ -492,8 +752,25 @@ static void psi_group_change(struct psi_group *group, int cpu,
>
> write_seqcount_end(&groupc->seq);
>
> - if (!delayed_work_pending(&group->clock_work))
> - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + /*
> + * Polling flag resets to 0 at the max rate of once per update window
> + * (at least 500ms interval). smp_wmb is required after group->polling
> + * 0-to-1 transition to order groupc->times and group->polling writes
> + * because stall detection logic in the slowpath relies on groupc->times
> + * changing before group->polling. Explicit smp_wmb is missing because
> + * cmpxchg() implies smp_mb.
> + */
> + if ((state_mask & group->trigger_mask) &&
> + atomic_cmpxchg(&group->polling, 0, 1) == 0) {
> + /*
> + * Start polling immediately even if the work is already
> + * scheduled
> + */
> + mod_delayed_work(system_wq, &group->clock_work, 1);
> + } else {
> + if (!delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + }
> }
>
> static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
> @@ -640,6 +917,8 @@ void psi_cgroup_free(struct cgroup *cgroup)
>
> cancel_delayed_work_sync(&cgroup->psi.clock_work);
> free_percpu(cgroup->psi.pcpu);
> + /* All triggers must be removed by now by psi_trigger_destroy */
> + WARN_ONCE(cgroup->psi.trigger_mask, "psi: trigger leak\n");
> }
>
> /**
> @@ -699,7 +978,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
> if (static_branch_likely(&psi_disabled))
> return -EOPNOTSUPP;
>
> - update_stats(group);
> + /* Update averages before reporting them */
> + mutex_lock(&group->update_lock);
> + collect_percpu_times(group);
> + calculate_averages(group, sched_clock());
> + mutex_unlock(&group->update_lock);
>
> for (full = 0; full < 2 - (res == PSI_CPU); full++) {
> unsigned long avg[3];
> @@ -751,25 +1034,240 @@ static int psi_cpu_open(struct inode *inode, struct file *file)
> return single_open(file, psi_cpu_show, NULL);
> }
>
> +ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
> + enum psi_states *pstate, u32 *pthreshold_us, u32 *pwin_sz_us)
> +{
> + enum psi_states state;
> + u32 threshold_us;
> + u32 win_sz_us;
> +
> + if (sscanf(buf, "some %u %u", &threshold_us, &win_sz_us) == 2)
> + state = PSI_IO_SOME + res * 2;
> + else if (sscanf(buf, "full %u %u", &threshold_us, &win_sz_us) == 2)
> + state = PSI_IO_FULL + res * 2;
> + else
> + return -EINVAL;
> +
> + if (state >= PSI_NONIDLE)
> + return -EINVAL;
> +
> + if (win_sz_us < PSI_TRIG_MIN_WIN_US ||
> + win_sz_us > PSI_TRIG_MAX_WIN_US)
> + return -EINVAL;
> +
> + /* Check threshold */
> + if (threshold_us == 0 || threshold_us > win_sz_us)
> + return -EINVAL;
> +
> + *pstate = state;
> + *pthreshold_us = threshold_us;
> + *pwin_sz_us = win_sz_us;
> + return 0;
> +}
> +
> +struct psi_trigger *psi_trigger_create(struct psi_group *group,
> + enum psi_states state, u32 threshold_us, u32 win_sz_us)
> +{
> + struct psi_trigger *t;
> +
> + if (static_branch_likely(&psi_disabled))
> + return ERR_PTR(-EOPNOTSUPP);
> +
> + t = kzalloc(sizeof(*t), GFP_KERNEL);

You reset most of field again in here. Then, I think we don't need to use
kzalloc in here. let's use kmalloc and initialize other field, too.

> + if (!t)
> + return ERR_PTR(-ENOMEM);
> +
> + t->group = group;
> + t->state = state;
> + t->threshold = threshold_us * NSEC_PER_USEC;
> + t->win.size = win_sz_us * NSEC_PER_USEC;
> + t->event = 0;
> + init_waitqueue_head(&t->event_wait);
> +
> + mutex_lock(&group->update_lock);
> +
> + list_add(&t->node, &group->triggers);
> + group->trigger_min_period = min(group->trigger_min_period,
> + div_u64(t->win.size, PSI_TRIG_UPDATES_PER_WIN));
> + group->nr_triggers[t->state]++;
> + group->trigger_mask |= (1 << t->state);
> +
> + mutex_unlock(&group->update_lock);
> +
> + return t;
> +}
> +

2019-01-29 01:56:15

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Mon, Jan 28, 2019 at 3:54 PM Minchan Kim <[email protected]> wrote:
>
> Hi Suren,
>
> When I review first time, it was rather hard to understand due to naming
> so below comments are mostly cleanup or minor.
> I'm not strong against if you don't think it's helpful.
> Feel free to select parts.
>
> Thanks.

Thanks for your input Minchan! This is very timely because I have to
re-post ver 4. Will address your comments before re-posting it.

> On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > Psi monitor aims to provide a low-latency short-term pressure
> > detection mechanism configurable by users. It allows users to
> > monitor psi metrics growth and trigger events whenever a metric
> > raises above user-defined threshold within user-defined time window.
> >
> > Time window and threshold are both expressed in usecs. Multiple psi
> > resources with different thresholds and window sizes can be monitored
> > concurrently.
> >
> > Psi monitors activate when system enters stall state for the monitored
> > psi metric and deactivate upon exit from the stall state. While system
> > is in the stall state psi signal growth is monitored at a rate of 10 times
> > per tracking window. Min window size is 500ms, therefore the min monitoring
> > interval is 50ms. Max window size is 10s with monitoring interval of 1s.
> >
> > When activated psi monitor stays active for at least the duration of one
> > tracking window to avoid repeated activations/deactivations when psi
> > signal is bouncing.
> >
> > Notifications to the users are rate-limited to one per tracking window.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > Documentation/accounting/psi.txt | 104 ++++++
> > include/linux/psi.h | 10 +
> > include/linux/psi_types.h | 59 ++++
> > kernel/cgroup/cgroup.c | 107 +++++-
> > kernel/sched/psi.c | 562 +++++++++++++++++++++++++++++--
> > 5 files changed, 808 insertions(+), 34 deletions(-)
> >
> > diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
> > index b8ca28b60215..6b21c72aa87c 100644
> > --- a/Documentation/accounting/psi.txt
> > +++ b/Documentation/accounting/psi.txt
> > @@ -63,6 +63,107 @@ tracked and exported as well, to allow detection of latency spikes
> > which wouldn't necessarily make a dent in the time averages, or to
> > average trends over custom time frames.
> >
> > +Monitoring for pressure thresholds
> > +==================================
> > +
> > +Users can register triggers and use poll() to be woken up when resource
> > +pressure exceeds certain thresholds.
> > +
> > +A trigger describes the maximum cumulative stall time over a specific
> > +time window, e.g. 100ms of total stall time within any 500ms window to
> > +generate a wakeup event.
> > +
> > +To register a trigger user has to open psi interface file under
> > +/proc/pressure/ representing the resource to be monitored and write the
> > +desired threshold and time window. The open file descriptor should be
> > +used to wait for trigger events using select(), poll() or epoll().
> > +The following format is used:
> > +
> > +<some|full> <stall amount in us> <time window in us>
> > +
> > +For example writing "some 150000 1000000" into /proc/pressure/memory
> > +would add 150ms threshold for partial memory stall measured within
> > +1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
> > +would add 50ms threshold for full io stall measured within 1sec time window.
> > +
> > +Triggers can be set on more than one psi metric and more than one trigger
> > +for the same psi metric can be specified. However for each trigger a separate
> > +file descriptor is required to be able to poll it separately from others,
> > +therefore for each trigger a separate open() syscall should be made even
> > +when opening the same psi interface file.
> > +
> > +Monitors activate only when system enters stall state for the monitored
> > +psi metric and deactivates upon exit from the stall state. While system is
> > +in the stall state psi signal growth is monitored at a rate of 10 times per
> > +tracking window.
> > +
> > +The kernel accepts window sizes ranging from 500ms to 10s, therefore min
> > +monitoring update interval is 50ms and max is 1s.
>
> Hope to add why we decide these number into min/max.

Will add more explanation. The min limit is important to make sure psi
monitor does not poll too often affecting overall system performance.
Max limit of 10sec was chosen as a big enough limit after which the
monitor is probably not the best choice. 10sec window means that
polling internal is 1sec, so userspace is not concerned about reaction
time to mounting pressure. In such situation it's probably better to
read corresponding psi pressure averages instead of using the
monitors. I might be wrong here but in that case the limit can be
increased.

> > +
> > +When activated, psi monitor stays active for at least the duration of one
> > +tracking window to avoid repeated activations/deactivations when system is
> > +bouncing in and out of the stall state.
> > +
> > +Notifications to the userspace are rate-limited to one per tracking window.
> > +
> > +The trigger will de-register when the file descriptor used to define the
> > +trigger is closed.
> > +
> > +Userspace monitor usage example
> > +===============================
> > +
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <stdio.h>
> > +#include <poll.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +
> > +/*
> > + * Monitor memory partial stall with 1s tracking window size
> > + * and 150ms threshold.
> > + */
> > +int main() {
> > + const char trig[] = "some 150000 1000000";
> > + struct pollfd fds;
> > + int n;
> > +
> > + fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
> > + if (fds.fd < 0) {
> > + printf("/proc/pressure/memory open error: %s\n",
> > + strerror(errno));
> > + return 1;
> > + }
> > + fds.events = POLLPRI;
> > +
> > + if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
> > + printf("/proc/pressure/memory write error: %s\n",
> > + strerror(errno));
> > + return 1;
> > + }
> > +
> > + printf("waiting for events...\n");
> > + while (1) {
> > + n = poll(&fds, 1, -1);
> > + if (n < 0) {
> > + printf("poll error: %s\n", strerror(errno));
> > + return 1;
> > + }
> > + if (fds.revents & POLLERR) {
> > + printf("got POLLERR, event source is gone\n");
> > + return 0;
> > + }
> > + if (fds.revents & POLLPRI) {
> > + printf("event triggered!\n");
> > + } else {
> > + printf("unknown event received: 0x%x\n", fds.revents);
> > + return 1;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > Cgroup2 interface
> > =================
> >
> > @@ -71,3 +172,6 @@ mounted, pressure stall information is also tracked for tasks grouped
> > into cgroups. Each subdirectory in the cgroupfs mountpoint contains
> > cpu.pressure, memory.pressure, and io.pressure files; the format is
> > the same as the /proc/pressure/ files.
> > +
> > +Per-cgroup psi monitors can be specified and used the same way as
> > +system-wide ones.
> > diff --git a/include/linux/psi.h b/include/linux/psi.h
> > index 7006008d5b72..7490f8ef83ac 100644
> > --- a/include/linux/psi.h
> > +++ b/include/linux/psi.h
> > @@ -4,6 +4,7 @@
> > #include <linux/jump_label.h>
> > #include <linux/psi_types.h>
> > #include <linux/sched.h>
> > +#include <linux/poll.h>
> >
> > struct seq_file;
> > struct css_set;
> > @@ -26,6 +27,15 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
> > int psi_cgroup_alloc(struct cgroup *cgrp);
> > void psi_cgroup_free(struct cgroup *cgrp);
> > void cgroup_move_task(struct task_struct *p, struct css_set *to);
> > +
> > +ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
> > + enum psi_states *state, u32 *threshold_us, u32 *win_sz_us);
> > +struct psi_trigger *psi_trigger_create(struct psi_group *group,
> > + enum psi_states state, u32 threshold_us, u32 win_sz_us);
> > +void psi_trigger_destroy(struct psi_trigger *t);
> > +
> > +__poll_t psi_trigger_poll(struct psi_trigger *t, struct file *file,
> > + poll_table *wait);
> > #endif
> >
> > #else /* CONFIG_PSI */
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > index 47757668bdcb..41654473cb69 100644
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -3,6 +3,7 @@
> >
> > #include <linux/seqlock.h>
> > #include <linux/types.h>
> > +#include <linux/wait.h>
> >
> > #ifdef CONFIG_PSI
> >
> > @@ -68,6 +69,50 @@ struct psi_group_cpu {
> > u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
> > };
> >
> > +/* PSI growth tracking window */
> > +struct psi_window {
> > + /* Window size in ns */
> > + u64 size;
>
> As rest of field are time, how about "interval" instead of size?

Sounds reasonable.

> > +
> > + /* Start time of the current window in ns */
> > + u64 start_time;
> > +
> > + /* Value at the start of the window */
> > + u64 start_value;
>
> "value" is rather vague. starting_stall?
>
> > +
> > + /* Value growth per previous window(s) */
> > + u64 per_win_growth;
>
> Rather than per, prev would be more meaninful, I think.
> How about prev_win_stall?

Yes, that sounds better.

> > +};
> > +
> > +struct psi_trigger {
> > + /* PSI state being monitored by the trigger */
> > + enum psi_states state;
> > +
> > + /* User-spacified threshold in ns */
> > + u64 threshold;
> > +
> > + /* List node inside triggers list */
> > + struct list_head node;
> > +
> > + /* Backpointer needed during trigger destruction */
> > + struct psi_group *group;
> > +
> > + /* Wait queue for polling */
> > + wait_queue_head_t event_wait;
> > +
> > + /* Pending event flag */
> > + int event;
> > +
> > + /* Tracking window */
> > + struct psi_window win;
> > +
> > + /*
> > + * Time last event was generated. Used for rate-limiting
> > + * events to one per window
> > + */
> > + u64 last_event_time;
> > +};
> > +
> > struct psi_group {
> > /* Protects data used by the aggregator */
> > struct mutex update_lock;
> > @@ -75,6 +120,8 @@ struct psi_group {
> > /* Per-cpu task state & time tracking */
> > struct psi_group_cpu __percpu *pcpu;
> >
> > + /* Periodic work control */
> > + atomic_t polling;
> > struct delayed_work clock_work;
> >
> > /* Total stall times observed */
> > @@ -85,6 +132,18 @@ struct psi_group {
> > u64 avg_last_update;
> > u64 avg_next_update;
> > unsigned long avg[NR_PSI_STATES - 1][3];
> > +
> > + /* Configured polling triggers */
> > + struct list_head triggers;
> > + u32 nr_triggers[NR_PSI_STATES - 1];
> > + u32 trigger_mask;
>
> This is a state we have an interest.
> How about trigger_states?

Will change.

> > + u64 trigger_min_period;
> > +
> > + /* Polling state */
> > + /* Total stall times at the start of monitor activation */
> > + u64 polling_total[NR_PSI_STATES - 1];
> > + u64 polling_next_update;
> > + u64 polling_until;
> > };
> >
> > #else /* CONFIG_PSI */
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index e8cd12c6a553..de3ac22a5e23 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -3464,7 +3464,101 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
> > {
> > return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
> > }
> > -#endif
> > +
> > +static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
> > + size_t nbytes, enum psi_res res)
> > +{
> > + enum psi_states state;
> > + struct psi_trigger *old;
> > + struct psi_trigger *new;
> > + struct cgroup *cgrp;
> > + u32 threshold_us;
> > + u32 win_sz_us;
>
> window_us?

Will change.

> > + ssize_t ret;
> > +
> > + cgrp = cgroup_kn_lock_live(of->kn, false);
> > + if (!cgrp)
> > + return -ENODEV;
> > +
> > + cgroup_get(cgrp);
> > + cgroup_kn_unlock(of->kn);
> > +
> > + ret = psi_trigger_parse(buf, nbytes, res,
> > + &state, &threshold_us, &win_sz_us);
> > + if (ret) {
> > + cgroup_put(cgrp);
> > + return ret;
> > + }
> > +
> > + new = psi_trigger_create(&cgrp->psi,
> > + state, threshold_us, win_sz_us);
> > + if (IS_ERR(new)) {
> > + cgroup_put(cgrp);
> > + return PTR_ERR(new);
> > + }
> > +
> > + old = of->priv;
> > + rcu_assign_pointer(of->priv, new);
> > + if (old) {
> > + synchronize_rcu();
> > + psi_trigger_destroy(old);
> > + }
> > +
> > + cgroup_put(cgrp);
> > +
> > + return nbytes;
> > +}
> > +
> > +static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
> > +}
> > +
> > +static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
> > +}
> > +
> > +static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
> > +}
> > +
> > +static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
> > + poll_table *pt)
> > +{
> > + struct psi_trigger *t;
> > + __poll_t ret;
> > +
> > + rcu_read_lock();
> > + t = rcu_dereference(of->priv);
> > + if (t)
> > + ret = psi_trigger_poll(t, of->file, pt);
> > + else
> > + ret = DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
> > + rcu_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > +static void cgroup_pressure_release(struct kernfs_open_file *of)
> > +{
> > + struct psi_trigger *t = of->priv;
> > +
> > + if (!t)
> > + return;
> > +
> > + rcu_assign_pointer(of->priv, NULL);
> > + synchronize_rcu();
> > + psi_trigger_destroy(t);
> > +}
> > +#endif /* CONFIG_PSI */
> >
> > static int cgroup_file_open(struct kernfs_open_file *of)
> > {
> > @@ -4619,18 +4713,27 @@ static struct cftype cgroup_base_files[] = {
> > .name = "io.pressure",
> > .flags = CFTYPE_NOT_ON_ROOT,
> > .seq_show = cgroup_io_pressure_show,
> > + .write = cgroup_io_pressure_write,
> > + .poll = cgroup_pressure_poll,
> > + .release = cgroup_pressure_release,
> > },
> > {
> > .name = "memory.pressure",
> > .flags = CFTYPE_NOT_ON_ROOT,
> > .seq_show = cgroup_memory_pressure_show,
> > + .write = cgroup_memory_pressure_write,
> > + .poll = cgroup_pressure_poll,
> > + .release = cgroup_pressure_release,
> > },
> > {
> > .name = "cpu.pressure",
> > .flags = CFTYPE_NOT_ON_ROOT,
> > .seq_show = cgroup_cpu_pressure_show,
> > + .write = cgroup_cpu_pressure_write,
> > + .poll = cgroup_pressure_poll,
> > + .release = cgroup_pressure_release,
> > },
> > -#endif
> > +#endif /* CONFIG_PSI */
> > { } /* terminate */
> > };
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index c366503ba135..fefb98f19a80 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -4,6 +4,9 @@
> > * Copyright (c) 2018 Facebook, Inc.
> > * Author: Johannes Weiner <[email protected]>
> > *
> > + * Polling support by Suren Baghdasaryan <[email protected]>
> > + * Copyright (c) 2018 Google, Inc.
> > + *
> > * When CPU, memory and IO are contended, tasks experience delays that
> > * reduce throughput and introduce latencies into the workload. Memory
> > * and IO contention, in addition, can cause a full loss of forward
> > @@ -126,11 +129,16 @@
> >
> > #include <linux/sched/loadavg.h>
> > #include <linux/seq_file.h>
> > +#include <linux/eventfd.h>
> > #include <linux/proc_fs.h>
> > #include <linux/seqlock.h>
> > +#include <linux/uaccess.h>
> > #include <linux/cgroup.h>
> > #include <linux/module.h>
> > #include <linux/sched.h>
> > +#include <linux/ctype.h>
> > +#include <linux/file.h>
> > +#include <linux/poll.h>
> > #include <linux/psi.h>
> > #include "sched.h"
> >
> > @@ -150,11 +158,16 @@ static int __init setup_psi(char *str)
> > __setup("psi=", setup_psi);
> >
> > /* Running averages - we need to be higher-res than loadavg */
> > -#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
> > +#define PSI_FREQ (2*HZ+1UL) /* 2 sec intervals */
> > #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
> > #define EXP_60s 1981 /* 1/exp(2s/60s) */
> > #define EXP_300s 2034 /* 1/exp(2s/300s) */
> >
> > +/* PSI trigger definitions */
> > +#define PSI_TRIG_MIN_WIN_US 500000 /* Min window size is 500ms */
> > +#define PSI_TRIG_MAX_WIN_US 10000000 /* Max window size is 10s */
> > +#define PSI_TRIG_UPDATES_PER_WIN 10 /* 10 updates per window */
>
> To me, it's rather long.
> How about WINDOW_MIN_US, WINDOW_MAX_US, UPDATES_PER_WINDOW?

Will change.

> > +
> > /* Sampling frequency in nanoseconds */
> > static u64 psi_period __read_mostly;
> >
> > @@ -173,8 +186,17 @@ static void group_init(struct psi_group *group)
> > for_each_possible_cpu(cpu)
> > seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > group->avg_next_update = sched_clock() + psi_period;
> > + atomic_set(&group->polling, 0);
> > INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
> > mutex_init(&group->update_lock);
> > + /* Init trigger-related members */
> > + INIT_LIST_HEAD(&group->triggers);
> > + memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> > + group->trigger_mask = 0;
> > + group->trigger_min_period = U32_MAX;
> > + memset(group->polling_total, 0, sizeof(group->polling_total));
> > + group->polling_next_update = ULLONG_MAX;
> > + group->polling_until = 0;
> > }
> >
> > void __init psi_init(void)
> > @@ -209,10 +231,11 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> > }
> > }
> >
> > -static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
> > +static u32 get_recent_times(struct psi_group *group, int cpu, u32 *times)
>
> Rather awkward to return change_mask when we consider function name as
> get_recent_times It would be better to add additional parameter
> instead of return value.

SGTM. Will double-check with Johannes and if he agrees will change.


> > {
> > struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > u64 now, state_start;
> > + u32 change_mask = 0;
> > enum psi_states s;
> > unsigned int seq;
> > u32 state_mask;
> > @@ -245,7 +268,11 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
> > groupc->times_prev[s] = times[s];
> >
> > times[s] = delta;
> > + if (delta)
> > + change_mask |= (1 << s);
> > }
> > +
> > + return change_mask;
> > }
> >
> > static void calc_avgs(unsigned long avg[3], int missed_periods,
> > @@ -268,17 +295,14 @@ static void calc_avgs(unsigned long avg[3], int missed_periods,
> > avg[2] = calc_load(avg[2], EXP_300s, pct);
> > }
> >
> > -static bool update_stats(struct psi_group *group)
> > +static u32 collect_percpu_times(struct psi_group *group)
>
> Not sure it's a good idea to add "implementation facility" in here.
> How about update_stall_time with additional parameter of
> "[changed|updated]_states?
>
> IOW,
> static void update_stall_times(struct psi_group *group, u32 *changed_states)

SGTM. Will double-check with Johannes and if he agrees will change.

> > {
> > u64 deltas[NR_PSI_STATES - 1] = { 0, };
> > - unsigned long missed_periods = 0;
> > unsigned long nonidle_total = 0;
> > - u64 now, expires, period;
> > + u32 change_mask = 0;
> > int cpu;
> > int s;
> >
> > - mutex_lock(&group->update_lock);
> > -
> > /*
> > * Collect the per-cpu time buckets and average them into a
> > * single time sample that is normalized to wallclock time.
> > @@ -291,7 +315,7 @@ static bool update_stats(struct psi_group *group)
> > u32 times[NR_PSI_STATES];
> > u32 nonidle;
> >
> > - get_recent_times(group, cpu, times);
> > + change_mask |= get_recent_times(group, cpu, times);
> >
> > nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
> > nonidle_total += nonidle;
> > @@ -316,11 +340,18 @@ static bool update_stats(struct psi_group *group)
> > for (s = 0; s < NR_PSI_STATES - 1; s++)
> > group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
> >
> > + return change_mask;
> > +}
> > +
> > +static u64 calculate_averages(struct psi_group *group, u64 now)
>
> time?
>
> The function name is still awkward to me.
> If someone see this function, he will expect return value as "average", not next_update.
> If we want to have next_update as return value, it would better to rename *update_avgs*.
>

Ok. If we change this to update_avgs() I think it also makes sense to
change poll_triggers() into update_triggers(). That way in the
psi_update_work() we will have:

next_avg = update_avgs()
if polling:
next_poll = update_triggers()
mod_delayed_work(min(next_avg, next_poll))

> > +{
> > + unsigned long missed_periods = 0;
> > + u64 expires, period;
> > + u64 avg_next_update;
> > + int s;
> > +
> > /* avgX= */
> > - now = sched_clock();
> > expires = group->avg_next_update;
> > - if (now < expires)
> > - goto out;
> > if (now - expires > psi_period)
> > missed_periods = div_u64(now - expires, psi_period);
> >
> > @@ -331,7 +362,7 @@ static bool update_stats(struct psi_group *group)
> > * But the deltas we sample out of the per-cpu buckets above
> > * are based on the actual time elapsing between clock ticks.
> > */
> > - group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
> > + avg_next_update = expires + ((1 + missed_periods) * psi_period);
> > period = now - (group->avg_last_update + (missed_periods * psi_period));
> > group->avg_last_update = now;
> >
> > @@ -361,20 +392,237 @@ static bool update_stats(struct psi_group *group)
> > group->avg_total[s] += sample;
> > calc_avgs(group->avg[s], missed_periods, sample, period);
> > }
> > -out:
> > - mutex_unlock(&group->update_lock);
> > - return nonidle_total;
> > +
> > + return avg_next_update;
> > +}
> > +
> > +/* Trigger tracking window manupulations */
> > +static void window_init(struct psi_window *win, u64 now, u64 value)
> > +{
> > + win->start_value = value;
> > + win->start_time = now;
> > + win->per_win_growth = 0;
> > +}
> > +
> > +/*
> > + * PSI growth tracking window update and growth calculation routine.
>
> Let's add empty line here.

Will do.

> > + * This approximates a sliding tracking window by interpolating
> > + * partially elapsed windows using historical growth data from the
> > + * previous intervals. This minimizes memory requirements (by not storing
> > + * all the intermediate values in the previous window) and simplifies
> > + * the calculations. It works well because PSI signal changes only in
> > + * positive direction and over relatively small window sizes the growth
> > + * is close to linear.
> > + */
> > +static u64 window_update(struct psi_window *win, u64 now, u64 value)
>
> Hope to change now as just time for function.
>
> Insetad of value, couldn't we use more concrete naming?
> Maybe stall_time or just stall?

Will do.

> > +{
> > + u64 interval;
>
> elapsed?

SGTM

> > + u64 growth;
> > +
> > + interval = now - win->start_time;
> > + growth = value - win->start_value;
> > + /*
> > + * After each tracking window passes win->start_value and
> > + * win->start_time get reset and win->per_win_growth stores
> > + * the average per-window growth of the previous window.
> > + * win->per_win_growth is then used to interpolate additional
> > + * growth from the previous window assuming it was linear.
> > + */
> > + if (interval > win->size) {
> > + win->per_win_growth = growth;
> > + win->start_value = value;
> > + win->start_time = now;
>
> We can use window_init via adding per_win_growth in the function
> parameter. Maybe, window_reset would be better function name.

Good idea, will do.

> > + } else {
> > + u32 unelapsed;
>
> remaining? remained?

SGTM

> > +
> > + unelapsed = win->size - interval;
> > + growth += div_u64(win->per_win_growth * unelapsed, win->size);
> > + }
> > +
> > + return growth;
> > +}
> > +
> > +static void init_triggers(struct psi_group *group, u64 now)
> > +{
> > + struct psi_trigger *t;
> > +
> > + list_for_each_entry(t, &group->triggers, node)
> > + window_init(&t->win, now, group->total[t->state]);
> > + memcpy(group->polling_total, group->total,
> > + sizeof(group->polling_total));
> > + group->polling_next_update = now + group->trigger_min_period;
> > +}
> > +
> > +static u64 poll_triggers(struct psi_group *group, u64 now)
>
> How about update_[poll|trigger]_stat?

update_triggers() to be symmetric with update_avgs() as I suggested
in an earlier comment?

> > +{
> > + struct psi_trigger *t;
> > + bool new_stall = false;
> > +
> > + /*
> > + * On subsequent updates, calculate growth deltas and let
> > + * watchers know when their specified thresholds are exceeded.
> > + */
> > + list_for_each_entry(t, &group->triggers, node) {
> > + u64 growth;
> > +
> > + /* Check for stall activity */
> > + if (group->polling_total[t->state] == group->total[t->state])
> > + continue;
> > +
> > + /*
> > + * Multiple triggers might be looking at the same state,
> > + * remember to update group->polling_total[] once we've
> > + * been through all of them. Also remember to extend the
> > + * polling time if we see new stall activity.
> > + */
> > + new_stall = true;
> > +
> > + /* Calculate growth since last update */
> > + growth = window_update(&t->win, now, group->total[t->state]);
> > + if (growth < t->threshold)
> > + continue;
> > +
> > + /* Limit event signaling to once per window */
> > + if (now < t->last_event_time + t->win.size)
> > + continue;
> > +
> > + /* Generate an event */
> > + if (cmpxchg(&t->event, 0, 1) == 0)
> > + wake_up_interruptible(&t->event_wait);
> > + t->last_event_time = now;
> > + }
> > +
> > + if (new_stall) {
> > + memcpy(group->polling_total, group->total,
> > + sizeof(group->polling_total));
> > + }
> > +
> > + return now + group->trigger_min_period;
> > }
> >
> > +/*
> > + * psi_update_work represents slowpath accounting part while psi_group_change
> > + * represents hotpath part. There are two potential races between them:
> > + * 1. Changes to group->polling when slowpath checks for new stall, then hotpath
> > + * records new stall and then slowpath resets group->polling flag. This leads
> > + * to the exit from the polling mode while monitored state is still changing.
> > + * 2. Slowpath overwriting an immediate update scheduled from the hotpath with
> > + * a regular update further in the future and missing the immediate update.
> > + * Both races are handled with a retry cycle in the slowpath:
> > + *
> > + * HOTPATH: | SLOWPATH:
> > + * |
> > + * A) times[cpu] += delta | E) delta = times[*]
> > + * B) start_poll = (delta[poll_mask] &&| if delta[poll_mask]:
> > + * cmpxchg(g->polling, 0, 1) == 0)| F) polling_until = now + grace_period
> > + * if start_poll: | if now > polling_until:
> > + * C) mod_delayed_work(1) | if g->polling:
> > + * else if !delayed_work_pending():| G) g->polling = polling = 0
> > + * D) schedule_delayed_work(PSI_FREQ)| smp_mb
> > + * | H) goto SLOWPATH
> > + * | else:
> > + * | if !g->polling:
> > + * | I) g->polling = polling = 1
> > + * | J) if delta && first_pass:
> > + * | next_avg = calculate_averages()
> > + * | if polling:
> > + * | next_poll = poll_triggers()
> > + * | if (delta && first_pass) || polling:
> > + * | K) mod_delayed_work(
> > + * | min(next_avg, next_poll))
> > + * | if !polling:
> > + * | first_pass = false
> > + * | L) goto SLOWPATH
> > + *
> > + * Race #1 is represented by (EABGD) sequence in which case slowpath deactivates
> > + * polling mode because it misses new monitored stall and hotpath doesn't
> > + * activate it because at (B) g->polling is not yet reset by slowpath in (G).
> > + * This race is handled by the (H) retry, which in the race described above
> > + * results in the new sequence of (EABGDHEIK) that reactivates polling mode.
> > + *
> > + * Race #2 is represented by polling==false && (JABCK) sequence which overwrites
> > + * immediate update scheduled at (C) with a later (next_avg) update scheduled at
> > + * (K). This race is handled by the (L) retry which results in the new sequence
> > + * of polling==false && (JABCKLEIK) that reactivates polling mode and
> > + * reschedules the next polling update (next_poll).
> > + *
> > + * Note that retries can't result in an infinite loop because retry #1 happens
> > + * only during polling reactivation and retry #2 happens only on the first pass.
> > + * Constant reactivations are impossible because polling will stay active for at
> > + * least grace_period. Worst case scenario involves two retries (HEJKLE)
> > + */
> > static void psi_update_work(struct work_struct *work)
> > {
> > struct delayed_work *dwork;
> > struct psi_group *group;
> > + bool first_pass = true;
> > + u64 next_update;
> > + u32 change_mask;
>
> How about [changed|updated]_states?

Will do.

> > + int polling;
> > bool nonidle;
> > + u64 now;
> >
> > dwork = to_delayed_work(work);
> > group = container_of(dwork, struct psi_group, clock_work);
> >
> > + now = sched_clock();
> > +
> > + mutex_lock(&group->update_lock);
> > +
> > +retry:
> > + change_mask = collect_percpu_times(group);
> > + polling = atomic_read(&group->polling);
> > +
> > + if (change_mask & group->trigger_mask) {
> > + /* Initialize trigger windows when entering polling mode */
> > + if (now > group->polling_until)
> > + init_triggers(group, now);
> > +
> > + /*
> > + * Keep the monitor active for at least the duration of the
> > + * minimum tracking window as long as monitor states are
> > + * changing. This prevents frequent changes to polling flag
> > + * when system bounces in and out of stall states.
> > + */
> > + group->polling_until = now +
> > + group->trigger_min_period * PSI_TRIG_UPDATES_PER_WIN;
> > + }
> > +
> > + /* Handle polling flag transitions */
> > + if (now > group->polling_until) {
> > + if (polling) {
> > + group->polling_next_update = ULLONG_MAX;
> > + polling = 0;
> > + atomic_set(&group->polling, polling);
> > + /*
> > + * Memory barrier is needed to order group->polling
> > + * write before times[] read in collect_percpu_times()
> > + */
> > + smp_mb__after_atomic();
> > + /*
> > + * Check if we missed stall recorded by hotpath while
> > + * polling flag was set to 1 causing hotpath to skip
> > + * entering polling mode
> > + */
> > + goto retry;
> > + }
> > + } else {
> > + if (!polling) {
> > + /*
> > + * This can happen as a fixup in the retry cycle after
> > + * new stall is discovered
> > + */
> > + polling = 1;
> > + atomic_set(&group->polling, polling);
> > + }
> > + }
> > + /*
> > + * At this point group->polling race with hotpath is resolved and
> > + * we rely on local polling flag ignoring possible further changes
> > + * to group->polling
> > + */
> > +
> > + nonidle = (change_mask & (1 << PSI_NONIDLE));
> > /*
> > * If there is task activity, periodically fold the per-cpu
> > * times and feed samples into the running averages. If things
> > @@ -382,20 +630,32 @@ static void psi_update_work(struct work_struct *work)
> > * Once restarted, we'll catch up the running averages in one
> > * go - see calc_avgs() and missed_periods.
> > */
> > + if (nonidle && first_pass) {
> > + if (now >= group->avg_next_update)
> > + group->avg_next_update = calculate_averages(group, now);
> >
> > - nonidle = update_stats(group);
> > -
> > - if (nonidle) {
> > - unsigned long delay = 0;
> > - u64 now;
> > -
> > - now = sched_clock();
> > - if (group->avg_next_update > now) {
> > - delay = nsecs_to_jiffies(
> > - group->avg_next_update - now) + 1;
> > + if (now >= group->polling_next_update)
> > + group->polling_next_update = poll_triggers(group, now);
> > + }
> > + if ((nonidle && first_pass) || polling) {
> > + /* Calculate closest update time */
> > + next_update = min(group->polling_next_update,
> > + group->avg_next_update);
> > + mod_delayed_work(system_wq, dwork, nsecs_to_jiffies(
> > + next_update - now) + 1);
> > + if (!polling) {
> > + /*
> > + * We might have overwritten an immediate update
> > + * scheduled from the hotpath with a longer regular
> > + * update (group->avg_next_update). Execute second pass
> > + * retry to discover that and resume polling.
> > + */
> > + first_pass = false;
> > + goto retry;
> > }
> > - schedule_delayed_work(dwork, delay);
> > }
> > +
> > + mutex_unlock(&group->update_lock);
> > }
> >
> > static void record_times(struct psi_group_cpu *groupc, int cpu,
> > @@ -492,8 +752,25 @@ static void psi_group_change(struct psi_group *group, int cpu,
> >
> > write_seqcount_end(&groupc->seq);
> >
> > - if (!delayed_work_pending(&group->clock_work))
> > - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> > + /*
> > + * Polling flag resets to 0 at the max rate of once per update window
> > + * (at least 500ms interval). smp_wmb is required after group->polling
> > + * 0-to-1 transition to order groupc->times and group->polling writes
> > + * because stall detection logic in the slowpath relies on groupc->times
> > + * changing before group->polling. Explicit smp_wmb is missing because
> > + * cmpxchg() implies smp_mb.
> > + */
> > + if ((state_mask & group->trigger_mask) &&
> > + atomic_cmpxchg(&group->polling, 0, 1) == 0) {
> > + /*
> > + * Start polling immediately even if the work is already
> > + * scheduled
> > + */
> > + mod_delayed_work(system_wq, &group->clock_work, 1);
> > + } else {
> > + if (!delayed_work_pending(&group->clock_work))
> > + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> > + }
> > }
> >
> > static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
> > @@ -640,6 +917,8 @@ void psi_cgroup_free(struct cgroup *cgroup)
> >
> > cancel_delayed_work_sync(&cgroup->psi.clock_work);
> > free_percpu(cgroup->psi.pcpu);
> > + /* All triggers must be removed by now by psi_trigger_destroy */
> > + WARN_ONCE(cgroup->psi.trigger_mask, "psi: trigger leak\n");
> > }
> >
> > /**
> > @@ -699,7 +978,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
> > if (static_branch_likely(&psi_disabled))
> > return -EOPNOTSUPP;
> >
> > - update_stats(group);
> > + /* Update averages before reporting them */
> > + mutex_lock(&group->update_lock);
> > + collect_percpu_times(group);
> > + calculate_averages(group, sched_clock());
> > + mutex_unlock(&group->update_lock);
> >
> > for (full = 0; full < 2 - (res == PSI_CPU); full++) {
> > unsigned long avg[3];
> > @@ -751,25 +1034,240 @@ static int psi_cpu_open(struct inode *inode, struct file *file)
> > return single_open(file, psi_cpu_show, NULL);
> > }
> >
> > +ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
> > + enum psi_states *pstate, u32 *pthreshold_us, u32 *pwin_sz_us)
> > +{
> > + enum psi_states state;
> > + u32 threshold_us;
> > + u32 win_sz_us;
> > +
> > + if (sscanf(buf, "some %u %u", &threshold_us, &win_sz_us) == 2)
> > + state = PSI_IO_SOME + res * 2;
> > + else if (sscanf(buf, "full %u %u", &threshold_us, &win_sz_us) == 2)
> > + state = PSI_IO_FULL + res * 2;
> > + else
> > + return -EINVAL;
> > +
> > + if (state >= PSI_NONIDLE)
> > + return -EINVAL;
> > +
> > + if (win_sz_us < PSI_TRIG_MIN_WIN_US ||
> > + win_sz_us > PSI_TRIG_MAX_WIN_US)
> > + return -EINVAL;
> > +
> > + /* Check threshold */
> > + if (threshold_us == 0 || threshold_us > win_sz_us)
> > + return -EINVAL;
> > +
> > + *pstate = state;
> > + *pthreshold_us = threshold_us;
> > + *pwin_sz_us = win_sz_us;
> > + return 0;
> > +}
> > +
> > +struct psi_trigger *psi_trigger_create(struct psi_group *group,
> > + enum psi_states state, u32 threshold_us, u32 win_sz_us)
> > +{
> > + struct psi_trigger *t;
> > +
> > + if (static_branch_likely(&psi_disabled))
> > + return ERR_PTR(-EOPNOTSUPP);
> > +
> > + t = kzalloc(sizeof(*t), GFP_KERNEL);
>
> You reset most of field again in here. Then, I think we don't need to use
> kzalloc in here. let's use kmalloc and initialize other field, too.

SGTM.

> > + if (!t)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + t->group = group;
> > + t->state = state;
> > + t->threshold = threshold_us * NSEC_PER_USEC;
> > + t->win.size = win_sz_us * NSEC_PER_USEC;
> > + t->event = 0;
> > + init_waitqueue_head(&t->event_wait);
> > +
> > + mutex_lock(&group->update_lock);
> > +
> > + list_add(&t->node, &group->triggers);
> > + group->trigger_min_period = min(group->trigger_min_period,
> > + div_u64(t->win.size, PSI_TRIG_UPDATES_PER_WIN));
> > + group->nr_triggers[t->state]++;
> > + group->trigger_mask |= (1 << t->state);
> > +
> > + mutex_unlock(&group->update_lock);
> > +
> > + return t;
> > +}
> > +
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2019-01-29 10:45:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> static void psi_update_work(struct work_struct *work)
> {
> struct delayed_work *dwork;
> struct psi_group *group;
> + bool first_pass = true;
> + u64 next_update;
> + u32 change_mask;
> + int polling;
> bool nonidle;
> + u64 now;
>
> dwork = to_delayed_work(work);
> group = container_of(dwork, struct psi_group, clock_work);
>
> + now = sched_clock();
> +
> + mutex_lock(&group->update_lock);

actually acquiring a mutex can take a fairly long while; would it not
make more sense to take the @now timestanp _after_ it, instead of
before?

2019-01-29 12:39:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> + atomic_set(&group->polling, polling);
> + /*
> + * Memory barrier is needed to order group->polling
> + * write before times[] read in collect_percpu_times()
> + */
> + smp_mb__after_atomic();

That's broken, smp_mb__{before,after}_atomic() can only be used on
atomic RmW operations, something atomic_set() is _not_.

2019-01-29 15:17:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Tue, Jan 29, 2019 at 01:38:43PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > + atomic_set(&group->polling, polling);
> > + /*
> > + * Memory barrier is needed to order group->polling
> > + * write before times[] read in collect_percpu_times()
> > + */
> > + smp_mb__after_atomic();
>
> That's broken, smp_mb__{before,after}_atomic() can only be used on
> atomic RmW operations, something atomic_set() is _not_.

Also; the comment should explain _why_ not only what.

2019-01-29 18:05:59

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

Thanks for review Peter!

On Tue, Jan 29, 2019 at 2:44 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > static void psi_update_work(struct work_struct *work)
> > {
> > struct delayed_work *dwork;
> > struct psi_group *group;
> > + bool first_pass = true;
> > + u64 next_update;
> > + u32 change_mask;
> > + int polling;
> > bool nonidle;
> > + u64 now;
> >
> > dwork = to_delayed_work(work);
> > group = container_of(dwork, struct psi_group, clock_work);
> >
> > + now = sched_clock();
> > +
> > + mutex_lock(&group->update_lock);
>
> actually acquiring a mutex can take a fairly long while; would it not
> make more sense to take the @now timestanp _after_ it, instead of
> before?

Yes, that makes sense. As long as *now* is set before the *retry*
label, otherwise the retry mechanism would get even more complicated
to understand with floating *now* timestamp. Will move the assignment
right after the mutex_lock()

> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2019-01-29 18:19:06

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Tue, Jan 29, 2019 at 4:38 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > + atomic_set(&group->polling, polling);
> > + /*
> > + * Memory barrier is needed to order group->polling
> > + * write before times[] read in collect_percpu_times()
> > + */
> > + smp_mb__after_atomic();
>
> That's broken, smp_mb__{before,after}_atomic() can only be used on
> atomic RmW operations, something atomic_set() is _not_.

Oh, I didn't realize that. After reading the following example from
atomic_ops.txt I was under impression that smp_mb__after_atomic()
would make changes done by atomic_set() visible:

/* All memory operations before this call will
* be globally visible before the clear_bit().
*/
smp_mb__before_atomic();
clear_bit( ... );
/* The clear_bit() will be visible before all
* subsequent memory operations.
*/
smp_mb__after_atomic();

but I'm probably missing something. Is there a more detailed
description of these rules anywhere else?

Meanwhile I'll change smp_mb__after_atomic() into smp_mb(). Would that
fix the ordering?

> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2019-01-29 18:27:09

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Tue, Jan 29, 2019 at 7:16 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jan 29, 2019 at 01:38:43PM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > > + atomic_set(&group->polling, polling);
> > > + /*
> > > + * Memory barrier is needed to order group->polling
> > > + * write before times[] read in collect_percpu_times()
> > > + */
> > > + smp_mb__after_atomic();
> >
> > That's broken, smp_mb__{before,after}_atomic() can only be used on
> > atomic RmW operations, something atomic_set() is _not_.
>
> Also; the comment should explain _why_ not only what.

Got it. Will change the comment to something like:

Order group->polling=0 before reading times[] in
collect_percpu_times() to detect possible race with hotpath that
modifies times[] before it sets group->polling=1 (see Race #1 in the
comments at the top).

> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2019-01-29 18:33:51

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Tue, Jan 29, 2019 at 10:18 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, Jan 29, 2019 at 4:38 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > > + atomic_set(&group->polling, polling);
> > > + /*
> > > + * Memory barrier is needed to order group->polling
> > > + * write before times[] read in collect_percpu_times()
> > > + */
> > > + smp_mb__after_atomic();
> >
> > That's broken, smp_mb__{before,after}_atomic() can only be used on
> > atomic RmW operations, something atomic_set() is _not_.
>
> Oh, I didn't realize that. After reading the following example from
> atomic_ops.txt I was under impression that smp_mb__after_atomic()
> would make changes done by atomic_set() visible:
>
> /* All memory operations before this call will
> * be globally visible before the clear_bit().
> */
> smp_mb__before_atomic();
> clear_bit( ... );
> /* The clear_bit() will be visible before all
> * subsequent memory operations.
> */
> smp_mb__after_atomic();
>
> but I'm probably missing something. Is there a more detailed
> description of these rules anywhere else?

I was referred to memory-barriers.txt that explains this clearly
stating that "These functions do not imply memory barriers.". Thanks
for noticing! Will change to smp_mb().

> Meanwhile I'll change smp_mb__after_atomic() into smp_mb(). Would that
> fix the ordering?
>
> > --
> > You received this message because you are subscribed to the Google Groups "kernel-team" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >

2019-01-29 18:36:19

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

Hi Minchan,

good to see your name on the lists again :)

On Tue, Jan 29, 2019 at 08:53:58AM +0900, Minchan Kim wrote:
> On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > @@ -68,6 +69,50 @@ struct psi_group_cpu {
> > u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
> > };
> >
> > +/* PSI growth tracking window */
> > +struct psi_window {
> > + /* Window size in ns */
> > + u64 size;
>
> As rest of field are time, how about "interval" instead of size?

If it were "size" on its own, I would agree, but "window size" is an
existing term that works pretty well here. "window interval" wouldn't.

> > + /* Start time of the current window in ns */
> > + u64 start_time;
> > +
> > + /* Value at the start of the window */
> > + u64 start_value;
>
> "value" is rather vague. starting_stall?

I'm not a fan of using stall here, because it reads like an event,
when it's really a time interval we're interested in.

For an abstraction that samples time intervals, value seems like a
pretty good, straight-forward name...

> > +
> > + /* Value growth per previous window(s) */
> > + u64 per_win_growth;
>
> Rather than per, prev would be more meaninful, I think.
> How about prev_win_stall?

Agreed on the "per", but still not loving the stall. How about
prev_delta? prev_growth?

> > +struct psi_trigger {
> > + /* PSI state being monitored by the trigger */
> > + enum psi_states state;
> > +
> > + /* User-spacified threshold in ns */
> > + u64 threshold;
> > +
> > + /* List node inside triggers list */
> > + struct list_head node;
> > +
> > + /* Backpointer needed during trigger destruction */
> > + struct psi_group *group;
> > +
> > + /* Wait queue for polling */
> > + wait_queue_head_t event_wait;
> > +
> > + /* Pending event flag */
> > + int event;
> > +
> > + /* Tracking window */
> > + struct psi_window win;
> > +
> > + /*
> > + * Time last event was generated. Used for rate-limiting
> > + * events to one per window
> > + */
> > + u64 last_event_time;
> > +};
> > +
> > struct psi_group {
> > /* Protects data used by the aggregator */
> > struct mutex update_lock;
> > @@ -75,6 +120,8 @@ struct psi_group {
> > /* Per-cpu task state & time tracking */
> > struct psi_group_cpu __percpu *pcpu;
> >
> > + /* Periodic work control */
> > + atomic_t polling;
> > struct delayed_work clock_work;
> >
> > /* Total stall times observed */
> > @@ -85,6 +132,18 @@ struct psi_group {
> > u64 avg_last_update;
> > u64 avg_next_update;
> > unsigned long avg[NR_PSI_STATES - 1][3];
> > +
> > + /* Configured polling triggers */
> > + struct list_head triggers;
> > + u32 nr_triggers[NR_PSI_STATES - 1];
> > + u32 trigger_mask;
>
> This is a state we have an interest.
> How about trigger_states?

Sounds good to me, I'd also switch change_mask below to
changed_states:

if (changed_states & trigger_states)
/* engage! */

[ After reading the rest, I see Minchan proposed the same. ]

> > + u64 trigger_min_period;
> > +
> > + /* Polling state */
> > + /* Total stall times at the start of monitor activation */
> > + u64 polling_total[NR_PSI_STATES - 1];
> > + u64 polling_next_update;
> > + u64 polling_until;
> > };
> >
> > #else /* CONFIG_PSI */
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index e8cd12c6a553..de3ac22a5e23 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -3464,7 +3464,101 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
> > {
> > return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
> > }
> > -#endif
> > +
> > +static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
> > + size_t nbytes, enum psi_res res)
> > +{
> > + enum psi_states state;
> > + struct psi_trigger *old;
> > + struct psi_trigger *new;
> > + struct cgroup *cgrp;
> > + u32 threshold_us;
> > + u32 win_sz_us;
>
> window_us?

We don't really encode units in variables in the rest of the code,
maybe we can drop it here as well.

Btw, it looks like the original reason for splitting up trigger_parse
and trigger_create seems gone from the code. Can we merge them again
and keep all those details out of the filesystem ->write methods?

new = psi_trigger_create(group, buf, nbytes, res);

> > + ssize_t ret;
> > +
> > + cgrp = cgroup_kn_lock_live(of->kn, false);
> > + if (!cgrp)
> > + return -ENODEV;
> > +
> > + cgroup_get(cgrp);
> > + cgroup_kn_unlock(of->kn);
> > +
> > + ret = psi_trigger_parse(buf, nbytes, res,
> > + &state, &threshold_us, &win_sz_us);
> > + if (ret) {
> > + cgroup_put(cgrp);
> > + return ret;
> > + }
> > +
> > + new = psi_trigger_create(&cgrp->psi,
> > + state, threshold_us, win_sz_us);
> > + if (IS_ERR(new)) {
> > + cgroup_put(cgrp);
> > + return PTR_ERR(new);
> > + }
> > +
> > + old = of->priv;
> > + rcu_assign_pointer(of->priv, new);
> > + if (old) {
> > + synchronize_rcu();
> > + psi_trigger_destroy(old);
> > + }
> > +
> > + cgroup_put(cgrp);
> > +
> > + return nbytes;
> > +}
> > +
> > +static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
> > +}
> > +
> > +static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
> > +}
> > +
> > +static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
> > +}
> > +
> > +static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
> > + poll_table *pt)
> > +{
> > + struct psi_trigger *t;
> > + __poll_t ret;
> > +
> > + rcu_read_lock();
> > + t = rcu_dereference(of->priv);
> > + if (t)
> > + ret = psi_trigger_poll(t, of->file, pt);
> > + else
> > + ret = DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
> > + rcu_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > +static void cgroup_pressure_release(struct kernfs_open_file *of)
> > +{
> > + struct psi_trigger *t = of->priv;
> > +
> > + if (!t)
> > + return;
> > +
> > + rcu_assign_pointer(of->priv, NULL);
> > + synchronize_rcu();
> > + psi_trigger_destroy(t);
> > +}
> > +#endif /* CONFIG_PSI */
> >
> > static int cgroup_file_open(struct kernfs_open_file *of)
> > {
> > @@ -4619,18 +4713,27 @@ static struct cftype cgroup_base_files[] = {
> > .name = "io.pressure",
> > .flags = CFTYPE_NOT_ON_ROOT,
> > .seq_show = cgroup_io_pressure_show,
> > + .write = cgroup_io_pressure_write,
> > + .poll = cgroup_pressure_poll,
> > + .release = cgroup_pressure_release,
> > },
> > {
> > .name = "memory.pressure",
> > .flags = CFTYPE_NOT_ON_ROOT,
> > .seq_show = cgroup_memory_pressure_show,
> > + .write = cgroup_memory_pressure_write,
> > + .poll = cgroup_pressure_poll,
> > + .release = cgroup_pressure_release,
> > },
> > {
> > .name = "cpu.pressure",
> > .flags = CFTYPE_NOT_ON_ROOT,
> > .seq_show = cgroup_cpu_pressure_show,
> > + .write = cgroup_cpu_pressure_write,
> > + .poll = cgroup_pressure_poll,
> > + .release = cgroup_pressure_release,
> > },
> > -#endif
> > +#endif /* CONFIG_PSI */
> > { } /* terminate */
> > };
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index c366503ba135..fefb98f19a80 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -4,6 +4,9 @@
> > * Copyright (c) 2018 Facebook, Inc.
> > * Author: Johannes Weiner <[email protected]>
> > *
> > + * Polling support by Suren Baghdasaryan <[email protected]>
> > + * Copyright (c) 2018 Google, Inc.
> > + *
> > * When CPU, memory and IO are contended, tasks experience delays that
> > * reduce throughput and introduce latencies into the workload. Memory
> > * and IO contention, in addition, can cause a full loss of forward
> > @@ -126,11 +129,16 @@
> >
> > #include <linux/sched/loadavg.h>
> > #include <linux/seq_file.h>
> > +#include <linux/eventfd.h>
> > #include <linux/proc_fs.h>
> > #include <linux/seqlock.h>
> > +#include <linux/uaccess.h>
> > #include <linux/cgroup.h>
> > #include <linux/module.h>
> > #include <linux/sched.h>
> > +#include <linux/ctype.h>
> > +#include <linux/file.h>
> > +#include <linux/poll.h>
> > #include <linux/psi.h>
> > #include "sched.h"
> >
> > @@ -150,11 +158,16 @@ static int __init setup_psi(char *str)
> > __setup("psi=", setup_psi);
> >
> > /* Running averages - we need to be higher-res than loadavg */
> > -#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
> > +#define PSI_FREQ (2*HZ+1UL) /* 2 sec intervals */
> > #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
> > #define EXP_60s 1981 /* 1/exp(2s/60s) */
> > #define EXP_300s 2034 /* 1/exp(2s/300s) */
> >
> > +/* PSI trigger definitions */
> > +#define PSI_TRIG_MIN_WIN_US 500000 /* Min window size is 500ms */
> > +#define PSI_TRIG_MAX_WIN_US 10000000 /* Max window size is 10s */
> > +#define PSI_TRIG_UPDATES_PER_WIN 10 /* 10 updates per window */
>
> To me, it's rather long.
> How about WINDOW_MIN_US, WINDOW_MAX_US, UPDATES_PER_WINDOW?

Sounds good to me too. I'm just ambivalent on the _US suffix. Dealer's
choice, though.

> > +
> > /* Sampling frequency in nanoseconds */
> > static u64 psi_period __read_mostly;
> >
> > @@ -173,8 +186,17 @@ static void group_init(struct psi_group *group)
> > for_each_possible_cpu(cpu)
> > seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > group->avg_next_update = sched_clock() + psi_period;
> > + atomic_set(&group->polling, 0);
> > INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
> > mutex_init(&group->update_lock);
> > + /* Init trigger-related members */
> > + INIT_LIST_HEAD(&group->triggers);
> > + memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> > + group->trigger_mask = 0;
> > + group->trigger_min_period = U32_MAX;
> > + memset(group->polling_total, 0, sizeof(group->polling_total));
> > + group->polling_next_update = ULLONG_MAX;
> > + group->polling_until = 0;
> > }
> >
> > void __init psi_init(void)
> > @@ -209,10 +231,11 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> > }
> > }
> >
> > -static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
> > +static u32 get_recent_times(struct psi_group *group, int cpu, u32 *times)
>
> Rather awkward to return change_mask when we consider function name as
> get_recent_times It would be better to add additional parameter
> instead of return value.

Good suggestion, I have to agree this would be nicer.

> > {
> > struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > u64 now, state_start;
> > + u32 change_mask = 0;
> > enum psi_states s;
> > unsigned int seq;
> > u32 state_mask;
> > @@ -245,7 +268,11 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
> > groupc->times_prev[s] = times[s];
> >
> > times[s] = delta;
> > + if (delta)
> > + change_mask |= (1 << s);
> > }
> > +
> > + return change_mask;
> > }
> >
> > static void calc_avgs(unsigned long avg[3], int missed_periods,
> > @@ -268,17 +295,14 @@ static void calc_avgs(unsigned long avg[3], int missed_periods,
> > avg[2] = calc_load(avg[2], EXP_300s, pct);
> > }
> >
> > -static bool update_stats(struct psi_group *group)
> > +static u32 collect_percpu_times(struct psi_group *group)
>
> Not sure it's a good idea to add "implementation facility" in here.
> How about update_stall_time with additional parameter of
> "[changed|updated]_states?
>
> IOW,
> static void update_stall_times(struct psi_group *group, u32 *changed_states)

I disagree on this one. collect_percpu_times() isn't too detailed of a
name, but it does reflect the complexity/cost of the function and the
structure that is being aggregated, which is a good thing.

But the return-by-parameter is a good idea.

> > u64 deltas[NR_PSI_STATES - 1] = { 0, };
> > - unsigned long missed_periods = 0;
> > unsigned long nonidle_total = 0;
> > - u64 now, expires, period;
> > + u32 change_mask = 0;
> > int cpu;
> > int s;
> >
> > - mutex_lock(&group->update_lock);
> > -
> > /*
> > * Collect the per-cpu time buckets and average them into a
> > * single time sample that is normalized to wallclock time.
> > @@ -291,7 +315,7 @@ static bool update_stats(struct psi_group *group)
> > u32 times[NR_PSI_STATES];
> > u32 nonidle;
> >
> > - get_recent_times(group, cpu, times);
> > + change_mask |= get_recent_times(group, cpu, times);
> >
> > nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
> > nonidle_total += nonidle;
> > @@ -316,11 +340,18 @@ static bool update_stats(struct psi_group *group)
> > for (s = 0; s < NR_PSI_STATES - 1; s++)
> > group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
> >
> > + return change_mask;
> > +}
> > +
> > +static u64 calculate_averages(struct psi_group *group, u64 now)
>
> time?
>
> The function name is still awkward to me.
> If someone see this function, he will expect return value as "average", not next_update.
> If we want to have next_update as return value, it would better to rename *update_avgs*.

update_averages() would be nice, agreed.

But I disagree on the now -> time. time is really vague - could be a
random timestamp or a period. We use "now" everywhere in this code to
mean the current time (cpu clock in cpu-local paths, sched clock for
global stuff), so let's keep it here as well.

> > +{
> > + unsigned long missed_periods = 0;
> > + u64 expires, period;
> > + u64 avg_next_update;
> > + int s;
> > +
> > /* avgX= */
> > - now = sched_clock();
> > expires = group->avg_next_update;
> > - if (now < expires)
> > - goto out;
> > if (now - expires > psi_period)
> > missed_periods = div_u64(now - expires, psi_period);
> >
> > @@ -331,7 +362,7 @@ static bool update_stats(struct psi_group *group)
> > * But the deltas we sample out of the per-cpu buckets above
> > * are based on the actual time elapsing between clock ticks.
> > */
> > - group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
> > + avg_next_update = expires + ((1 + missed_periods) * psi_period);
> > period = now - (group->avg_last_update + (missed_periods * psi_period));
> > group->avg_last_update = now;
> >
> > @@ -361,20 +392,237 @@ static bool update_stats(struct psi_group *group)
> > group->avg_total[s] += sample;
> > calc_avgs(group->avg[s], missed_periods, sample, period);
> > }
> > -out:
> > - mutex_unlock(&group->update_lock);
> > - return nonidle_total;
> > +
> > + return avg_next_update;
> > +}
> > +
> > +/* Trigger tracking window manupulations */
> > +static void window_init(struct psi_window *win, u64 now, u64 value)
> > +{
> > + win->start_value = value;
> > + win->start_time = now;
> > + win->per_win_growth = 0;
> > +}
> > +
> > +/*
> > + * PSI growth tracking window update and growth calculation routine.
>
> Let's add empty line here.

Agreed.

> > + * This approximates a sliding tracking window by interpolating
> > + * partially elapsed windows using historical growth data from the
> > + * previous intervals. This minimizes memory requirements (by not storing
> > + * all the intermediate values in the previous window) and simplifies
> > + * the calculations. It works well because PSI signal changes only in
> > + * positive direction and over relatively small window sizes the growth
> > + * is close to linear.
> > + */
> > +static u64 window_update(struct psi_window *win, u64 now, u64 value)
>
> Hope to change now as just time for function.
>
> Insetad of value, couldn't we use more concrete naming?
> Maybe stall_time or just stall?

Disagreed on both :)

> > +{
> > + u64 interval;
>
> elapsed?

Hm, elapsed is a bit better, but how about period? We use that in the
averages code for the same functionality.

> > + u64 growth;
> > +
> > + interval = now - win->start_time;
> > + growth = value - win->start_value;
> > + /*
> > + * After each tracking window passes win->start_value and
> > + * win->start_time get reset and win->per_win_growth stores
> > + * the average per-window growth of the previous window.
> > + * win->per_win_growth is then used to interpolate additional
> > + * growth from the previous window assuming it was linear.
> > + */
> > + if (interval > win->size) {
> > + win->per_win_growth = growth;
> > + win->start_value = value;
> > + win->start_time = now;
>
> We can use window_init via adding per_win_growth in the function
> parameter. Maybe, window_reset would be better function name.
>
> > + } else {
> > + u32 unelapsed;
>
> remaining? remained?

Yup, much better.

> > +
> > + unelapsed = win->size - interval;
> > + growth += div_u64(win->per_win_growth * unelapsed, win->size);
> > + }
> > +
> > + return growth;
> > +}
> > +
> > +static void init_triggers(struct psi_group *group, u64 now)
> > +{
> > + struct psi_trigger *t;
> > +
> > + list_for_each_entry(t, &group->triggers, node)
> > + window_init(&t->win, now, group->total[t->state]);
> > + memcpy(group->polling_total, group->total,
> > + sizeof(group->polling_total));
> > + group->polling_next_update = now + group->trigger_min_period;
> > +}
> > +
> > +static u64 poll_triggers(struct psi_group *group, u64 now)
>
> How about update_[poll|trigger]_stat?

update_triggers()? The signature already matches the update_averages()
one, so we might as well do the same thing there I guess.

> > +{
> > + struct psi_trigger *t;
> > + bool new_stall = false;
> > +
> > + /*
> > + * On subsequent updates, calculate growth deltas and let
> > + * watchers know when their specified thresholds are exceeded.
> > + */
> > + list_for_each_entry(t, &group->triggers, node) {
> > + u64 growth;
> > +
> > + /* Check for stall activity */
> > + if (group->polling_total[t->state] == group->total[t->state])
> > + continue;
> > +
> > + /*
> > + * Multiple triggers might be looking at the same state,
> > + * remember to update group->polling_total[] once we've
> > + * been through all of them. Also remember to extend the
> > + * polling time if we see new stall activity.
> > + */
> > + new_stall = true;
> > +
> > + /* Calculate growth since last update */
> > + growth = window_update(&t->win, now, group->total[t->state]);
> > + if (growth < t->threshold)
> > + continue;
> > +
> > + /* Limit event signaling to once per window */
> > + if (now < t->last_event_time + t->win.size)
> > + continue;
> > +
> > + /* Generate an event */
> > + if (cmpxchg(&t->event, 0, 1) == 0)
> > + wake_up_interruptible(&t->event_wait);
> > + t->last_event_time = now;
> > + }
> > +
> > + if (new_stall) {
> > + memcpy(group->polling_total, group->total,
> > + sizeof(group->polling_total));
> > + }
> > +
> > + return now + group->trigger_min_period;
> > }
> >
> > +/*
> > + * psi_update_work represents slowpath accounting part while psi_group_change
> > + * represents hotpath part. There are two potential races between them:
> > + * 1. Changes to group->polling when slowpath checks for new stall, then hotpath
> > + * records new stall and then slowpath resets group->polling flag. This leads
> > + * to the exit from the polling mode while monitored state is still changing.
> > + * 2. Slowpath overwriting an immediate update scheduled from the hotpath with
> > + * a regular update further in the future and missing the immediate update.
> > + * Both races are handled with a retry cycle in the slowpath:
> > + *
> > + * HOTPATH: | SLOWPATH:
> > + * |
> > + * A) times[cpu] += delta | E) delta = times[*]
> > + * B) start_poll = (delta[poll_mask] &&| if delta[poll_mask]:
> > + * cmpxchg(g->polling, 0, 1) == 0)| F) polling_until = now + grace_period
> > + * if start_poll: | if now > polling_until:
> > + * C) mod_delayed_work(1) | if g->polling:
> > + * else if !delayed_work_pending():| G) g->polling = polling = 0
> > + * D) schedule_delayed_work(PSI_FREQ)| smp_mb
> > + * | H) goto SLOWPATH
> > + * | else:
> > + * | if !g->polling:
> > + * | I) g->polling = polling = 1
> > + * | J) if delta && first_pass:
> > + * | next_avg = calculate_averages()
> > + * | if polling:
> > + * | next_poll = poll_triggers()
> > + * | if (delta && first_pass) || polling:
> > + * | K) mod_delayed_work(
> > + * | min(next_avg, next_poll))
> > + * | if !polling:
> > + * | first_pass = false
> > + * | L) goto SLOWPATH
> > + *
> > + * Race #1 is represented by (EABGD) sequence in which case slowpath deactivates
> > + * polling mode because it misses new monitored stall and hotpath doesn't
> > + * activate it because at (B) g->polling is not yet reset by slowpath in (G).
> > + * This race is handled by the (H) retry, which in the race described above
> > + * results in the new sequence of (EABGDHEIK) that reactivates polling mode.
> > + *
> > + * Race #2 is represented by polling==false && (JABCK) sequence which overwrites
> > + * immediate update scheduled at (C) with a later (next_avg) update scheduled at
> > + * (K). This race is handled by the (L) retry which results in the new sequence
> > + * of polling==false && (JABCKLEIK) that reactivates polling mode and
> > + * reschedules the next polling update (next_poll).
> > + *
> > + * Note that retries can't result in an infinite loop because retry #1 happens
> > + * only during polling reactivation and retry #2 happens only on the first pass.
> > + * Constant reactivations are impossible because polling will stay active for at
> > + * least grace_period. Worst case scenario involves two retries (HEJKLE)
> > + */
> > static void psi_update_work(struct work_struct *work)
> > {
> > struct delayed_work *dwork;
> > struct psi_group *group;
> > + bool first_pass = true;
> > + u64 next_update;
> > + u32 change_mask;
>
> How about [changed|updated]_states?

changed_states sounds good to me.

2019-01-29 19:10:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] psi: introduce psi monitor

On Tue, Jan 29, 2019 at 10:18:20AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 29, 2019 at 4:38 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> > > + atomic_set(&group->polling, polling);
> > > + /*
> > > + * Memory barrier is needed to order group->polling
> > > + * write before times[] read in collect_percpu_times()
> > > + */
> > > + smp_mb__after_atomic();
> >
> > That's broken, smp_mb__{before,after}_atomic() can only be used on
> > atomic RmW operations, something atomic_set() is _not_.
>
> Oh, I didn't realize that. After reading the following example from
> atomic_ops.txt

That document it woefully out of date (and I should double check, but I
think we can actually delete it now). Please see
Documentation/atomic_t.txt

> I was under impression that smp_mb__after_atomic()
> would make changes done by atomic_set() visible:
>
> /* All memory operations before this call will
> * be globally visible before the clear_bit().
> */
> smp_mb__before_atomic();
> clear_bit( ... );
> /* The clear_bit() will be visible before all
> * subsequent memory operations.
> */
> smp_mb__after_atomic();
>
> but I'm probably missing something. Is there a more detailed
> description of these rules anywhere else?

See atomic_t.txt; but the difference is that clear_bit() is a RmW, while
atomic_set() is just a plain store.

> Meanwhile I'll change smp_mb__after_atomic() into smp_mb(). Would that
> fix the ordering?

It would work here; but I'm still trying to actually understand all
this. So while the detail would be fine, I'm not ready to judge the
over-all thing.