2023-03-09 17:10:54

by Domenico Cerasuolo

[permalink] [raw]
Subject: [PATCH 0/4] sched/psi: Allow unprivileged PSI polling

PSI offers 2 mechanisms to get information about a specific resource
pressure. One is reading from /proc/pressure/<resource>, which gives
average pressures aggregated every 2s. The other is creating a pollable
fd for a specific resource and cgroup.

The trigger creation requires CAP_SYS_RESOURCE, and gives the
possibility to pick specific time window and threshold, spawing an RT
thread to aggregate the data.

Systemd would like to provide containers the option to monitor pressure
on their own cgroup and sub-cgroups. For example, if systemd launches a
container that itself then launches services, the container should have
the ability to poll() for pressure in individual services. But neither
the container nor the services are privileged.

The series is implemented in 4 steps in order to reduce the noise of
the change.

Domenico Cerasuolo (4):
sched/psi: rearrange polling code in preparation
sched/psi: rename existing poll members in preparation
sched/psi: extract update_triggers side effect
sched/psi: allow unprivileged polling of N*2s period

Documentation/accounting/psi.rst | 4 +
include/linux/psi.h | 2 +-
include/linux/psi_types.h | 43 ++--
kernel/cgroup/cgroup.c | 2 +-
kernel/sched/psi.c | 412 ++++++++++++++++---------------
5 files changed, 250 insertions(+), 213 deletions(-)

--
2.34.1



2023-03-09 17:11:03

by Domenico Cerasuolo

[permalink] [raw]
Subject: [PATCH 3/4] sched/psi: extract update_triggers side effect

The update of rtpoll_total inside update_triggers can be moved out of
the function since changed_states has the same information as the
update_total flag used in the function. Besides the simplification of
the function, with the next patch it would become an unwanted side
effect needed only for PSI_POLL.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Domenico Cerasuolo <[email protected]>
---
kernel/sched/psi.c | 20 +++++---------------
1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a3d0b5cf797a..476941c1cbea 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -433,7 +433,6 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
static u64 update_triggers(struct psi_group *group, u64 now)
{
struct psi_trigger *t;
- bool update_total = false;
u64 *total = group->total[PSI_POLL];

/*
@@ -456,14 +455,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
* events without dropping any).
*/
if (new_stall) {
- /*
- * Multiple triggers might be looking at the same state,
- * remember to update group->polling_total[] once we've
- * been through all of them. Also remember to extend the
- * polling time if we see new stall activity.
- */
- update_total = true;
-
/* Calculate growth since last update */
growth = window_update(&t->win, now, total[t->state]);
if (!t->pending_event) {
@@ -484,11 +475,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
/* Reset threshold breach flag once event got generated */
t->pending_event = false;
}
-
- if (update_total)
- memcpy(group->rtpoll_total, total,
- sizeof(group->rtpoll_total));
-
return now + group->rtpoll_min_period;
}

@@ -686,8 +672,12 @@ static void psi_rtpoll_work(struct psi_group *group)
goto out;
}

- if (now >= group->rtpoll_next_update)
+ if (now >= group->rtpoll_next_update) {
group->rtpoll_next_update = update_triggers(group, now);
+ if (changed_states & group->rtpoll_states)
+ memcpy(group->rtpoll_total, group->total[PSI_POLL],
+ sizeof(group->rtpoll_total));
+ }

psi_schedule_rtpoll_work(group,
nsecs_to_jiffies(group->rtpoll_next_update - now) + 1,
--
2.34.1


2023-03-09 17:11:11

by Domenico Cerasuolo

[permalink] [raw]
Subject: [PATCH 1/4] sched/psi: rearrange polling code in preparation

Move a few functions up in the file to avoid forward declaration needed
in the patch implementing unprivileged PSI triggers.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Domenico Cerasuolo <[email protected]>
---
kernel/sched/psi.c | 196 ++++++++++++++++++++++-----------------------
1 file changed, 98 insertions(+), 98 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 02e011cabe91..fe9269f1d2a4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -384,92 +384,6 @@ static void collect_percpu_times(struct psi_group *group,
*pchanged_states = changed_states;
}

-static u64 update_averages(struct psi_group *group, u64 now)
-{
- unsigned long missed_periods = 0;
- u64 expires, period;
- u64 avg_next_update;
- int s;
-
- /* avgX= */
- expires = group->avg_next_update;
- if (now - expires >= psi_period)
- missed_periods = div_u64(now - expires, psi_period);
-
- /*
- * The periodic clock tick can get delayed for various
- * reasons, especially on loaded systems. To avoid clock
- * drift, we schedule the clock in fixed psi_period intervals.
- * But the deltas we sample out of the per-cpu buckets above
- * are based on the actual time elapsing between clock ticks.
- */
- avg_next_update = expires + ((1 + missed_periods) * psi_period);
- period = now - (group->avg_last_update + (missed_periods * psi_period));
- group->avg_last_update = now;
-
- for (s = 0; s < NR_PSI_STATES - 1; s++) {
- u32 sample;
-
- sample = group->total[PSI_AVGS][s] - group->avg_total[s];
- /*
- * Due to the lockless sampling of the time buckets,
- * recorded time deltas can slip into the next period,
- * which under full pressure can result in samples in
- * excess of the period length.
- *
- * We don't want to report non-sensical pressures in
- * excess of 100%, nor do we want to drop such events
- * on the floor. Instead we punt any overage into the
- * future until pressure subsides. By doing this we
- * don't underreport the occurring pressure curve, we
- * just report it delayed by one period length.
- *
- * The error isn't cumulative. As soon as another
- * delta slips from a period P to P+1, by definition
- * it frees up its time T in P.
- */
- if (sample > period)
- sample = period;
- group->avg_total[s] += sample;
- calc_avgs(group->avg[s], missed_periods, sample, period);
- }
-
- return avg_next_update;
-}
-
-static void psi_avgs_work(struct work_struct *work)
-{
- struct delayed_work *dwork;
- struct psi_group *group;
- u32 changed_states;
- u64 now;
-
- dwork = to_delayed_work(work);
- group = container_of(dwork, struct psi_group, avgs_work);
-
- mutex_lock(&group->avgs_lock);
-
- now = sched_clock();
-
- collect_percpu_times(group, PSI_AVGS, &changed_states);
- /*
- * If there is task activity, periodically fold the per-cpu
- * times and feed samples into the running averages. If things
- * are idle and there is no data to process, stop the clock.
- * Once restarted, we'll catch up the running averages in one
- * go - see calc_avgs() and missed_periods.
- */
- if (now >= group->avg_next_update)
- group->avg_next_update = update_averages(group, now);
-
- if (changed_states & PSI_STATE_RESCHEDULE) {
- schedule_delayed_work(dwork, nsecs_to_jiffies(
- group->avg_next_update - now) + 1);
- }
-
- mutex_unlock(&group->avgs_lock);
-}
-
/* Trigger tracking window manipulations */
static void window_reset(struct psi_window *win, u64 now, u64 value,
u64 prev_growth)
@@ -516,18 +430,6 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
return growth;
}

-static void init_triggers(struct psi_group *group, u64 now)
-{
- struct psi_trigger *t;
-
- list_for_each_entry(t, &group->triggers, node)
- window_reset(&t->win, now,
- group->total[PSI_POLL][t->state], 0);
- memcpy(group->polling_total, group->total[PSI_POLL],
- sizeof(group->polling_total));
- group->polling_next_update = now + group->poll_min_period;
-}
-
static u64 update_triggers(struct psi_group *group, u64 now)
{
struct psi_trigger *t;
@@ -590,6 +492,104 @@ static u64 update_triggers(struct psi_group *group, u64 now)
return now + group->poll_min_period;
}

+static u64 update_averages(struct psi_group *group, u64 now)
+{
+ unsigned long missed_periods = 0;
+ u64 expires, period;
+ u64 avg_next_update;
+ int s;
+
+ /* avgX= */
+ expires = group->avg_next_update;
+ if (now - expires >= psi_period)
+ missed_periods = div_u64(now - expires, psi_period);
+
+ /*
+ * The periodic clock tick can get delayed for various
+ * reasons, especially on loaded systems. To avoid clock
+ * drift, we schedule the clock in fixed psi_period intervals.
+ * But the deltas we sample out of the per-cpu buckets above
+ * are based on the actual time elapsing between clock ticks.
+ */
+ avg_next_update = expires + ((1 + missed_periods) * psi_period);
+ period = now - (group->avg_last_update + (missed_periods * psi_period));
+ group->avg_last_update = now;
+
+ for (s = 0; s < NR_PSI_STATES - 1; s++) {
+ u32 sample;
+
+ sample = group->total[PSI_AVGS][s] - group->avg_total[s];
+ /*
+ * Due to the lockless sampling of the time buckets,
+ * recorded time deltas can slip into the next period,
+ * which under full pressure can result in samples in
+ * excess of the period length.
+ *
+ * We don't want to report non-sensical pressures in
+ * excess of 100%, nor do we want to drop such events
+ * on the floor. Instead we punt any overage into the
+ * future until pressure subsides. By doing this we
+ * don't underreport the occurring pressure curve, we
+ * just report it delayed by one period length.
+ *
+ * The error isn't cumulative. As soon as another
+ * delta slips from a period P to P+1, by definition
+ * it frees up its time T in P.
+ */
+ if (sample > period)
+ sample = period;
+ group->avg_total[s] += sample;
+ calc_avgs(group->avg[s], missed_periods, sample, period);
+ }
+
+ return avg_next_update;
+}
+
+static void psi_avgs_work(struct work_struct *work)
+{
+ struct delayed_work *dwork;
+ struct psi_group *group;
+ u32 changed_states;
+ u64 now;
+
+ dwork = to_delayed_work(work);
+ group = container_of(dwork, struct psi_group, avgs_work);
+
+ mutex_lock(&group->avgs_lock);
+
+ now = sched_clock();
+
+ collect_percpu_times(group, PSI_AVGS, &changed_states);
+ /*
+ * If there is task activity, periodically fold the per-cpu
+ * times and feed samples into the running averages. If things
+ * are idle and there is no data to process, stop the clock.
+ * Once restarted, we'll catch up the running averages in one
+ * go - see calc_avgs() and missed_periods.
+ */
+ if (now >= group->avg_next_update)
+ group->avg_next_update = update_averages(group, now);
+
+ if (changed_states & PSI_STATE_RESCHEDULE) {
+ schedule_delayed_work(dwork, nsecs_to_jiffies(
+ group->avg_next_update - now) + 1);
+ }
+
+ mutex_unlock(&group->avgs_lock);
+}
+
+static void init_triggers(struct psi_group *group, u64 now)
+{
+ struct psi_trigger *t;
+
+ list_for_each_entry(t, &group->triggers, node)
+ window_reset(&t->win, now,
+ group->total[PSI_POLL][t->state], 0);
+ memcpy(group->polling_total, group->total[PSI_POLL],
+ sizeof(group->polling_total));
+ group->polling_next_update = now + group->poll_min_period;
+}
+
/* Schedule polling if it's not already scheduled or forced. */
static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
bool force)
--
2.34.1


2023-03-09 17:11:18

by Domenico Cerasuolo

[permalink] [raw]
Subject: [PATCH 4/4] sched/psi: allow unprivileged polling of N*2s period

PSI offers 2 mechanisms to get information about a specific resource
pressure. One is reading from /proc/pressure/<resource>, which gives
average pressures aggregated every 2s. The other is creating a pollable
fd for a specific resource and cgroup.

The trigger creation requires CAP_SYS_RESOURCE, and gives the
possibility to pick specific time window and threshold, spawing an RT
thread to aggregate the data.

Systemd would like to provide containers the option to monitor pressure
on their own cgroup and sub-cgroups. For example, if systemd launches a
container that itself then launches services, the container should have
the ability to poll() for pressure in individual services. But neither
the container nor the services are privileged.

This patch implements a mechanism to allow unprivileged users to create
pressure triggers. The difference with privileged triggers creation is
that unprivileged ones must have a time window that's a multiple of 2s.
This is so that we can avoid unrestricted spawning of rt threads, and
use instead the same aggregation mechanism done for the averages, which
runs indipendently of any triggers.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Domenico Cerasuolo <[email protected]>
---
Documentation/accounting/psi.rst | 4 ++
include/linux/psi.h | 2 +-
include/linux/psi_types.h | 7 +++
kernel/cgroup/cgroup.c | 2 +-
kernel/sched/psi.c | 105 ++++++++++++++++++++-----------
5 files changed, 83 insertions(+), 37 deletions(-)

diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
index 5e40b3f437f9..df6062eb3abb 100644
--- a/Documentation/accounting/psi.rst
+++ b/Documentation/accounting/psi.rst
@@ -105,6 +105,10 @@ prevent overly frequent polling. Max limit is chosen as a high enough number
after which monitors are most likely not needed and psi averages can be used
instead.

+Unprivileged users can also create monitors, with the only limitation that the
+window size must be a multiple of 2s, in order to prevent excessive resource
+usage.
+
When activated, psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when system is
bouncing in and out of the stall state.
diff --git a/include/linux/psi.h b/include/linux/psi.h
index b029a847def1..ab26200c2803 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -24,7 +24,7 @@ void psi_memstall_leave(unsigned long *flags);

int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
struct psi_trigger *psi_trigger_create(struct psi_group *group,
- char *buf, enum psi_res res);
+ char *buf, enum psi_res res, struct file *file);
void psi_trigger_destroy(struct psi_trigger *t);

__poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 1819afa8b198..e439f411c23b 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -151,6 +151,9 @@ struct psi_trigger {

/* Deferred event(s) from previous ratelimit window */
bool pending_event;
+
+ /* Used to differentiate destruction action*/
+ enum psi_aggregators aggregator;
};

struct psi_group {
@@ -171,6 +174,10 @@ struct psi_group {
/* Aggregator work control */
struct delayed_work avgs_work;

+ /* Unprivileged triggers against N*PSI_FREQ windows */
+ struct list_head triggers;
+ u32 nr_triggers[NR_PSI_STATES - 1];
+
/* Total stall times and sampled pressure averages */
u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1];
unsigned long avg[NR_PSI_STATES - 1][3];
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 935e8121b21e..dead36969bba 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3761,7 +3761,7 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
}

psi = cgroup_psi(cgrp);
- new = psi_trigger_create(psi, buf, res);
+ new = psi_trigger_create(psi, buf, res, of->file);
if (IS_ERR(new)) {
cgroup_put(cgrp);
return PTR_ERR(new);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 476941c1cbea..fde91aa4e55f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
group->avg_last_update = sched_clock();
group->avg_next_update = group->avg_last_update + psi_period;
- INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
mutex_init(&group->avgs_lock);
- /* Init trigger-related members */
+
+ /* Init rtpoll trigger-related members */
atomic_set(&group->rtpoll_scheduled, 0);
mutex_init(&group->rtpoll_trigger_lock);
INIT_LIST_HEAD(&group->rtpoll_triggers);
@@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
init_waitqueue_head(&group->rtpoll_wait);
timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
rcu_assign_pointer(group->rtpoll_task, NULL);
+
+ /* Init avg trigger-related members */
+ INIT_LIST_HEAD(&group->triggers);
+ memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
+ INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
}

void __init psi_init(void)
@@ -430,20 +435,23 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
return growth;
}

-static u64 update_triggers(struct psi_group *group, u64 now)
+static u64 update_triggers(struct psi_group *group, u64 now, enum psi_aggregators aggregator)
{
struct psi_trigger *t;
- u64 *total = group->total[PSI_POLL];
+ u64 *total = group->total[aggregator];
+ struct list_head *triggers = aggregator == PSI_AVGS ? &group->triggers
+ : &group->rtpoll_triggers;
+ u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total : group->rtpoll_total;

/*
* On subsequent updates, calculate growth deltas and let
* watchers know when their specified thresholds are exceeded.
*/
- list_for_each_entry(t, &group->rtpoll_triggers, node) {
+ list_for_each_entry(t, triggers, node) {
u64 growth;
bool new_stall;

- new_stall = group->rtpoll_total[t->state] != total[t->state];
+ new_stall = aggregator_total[t->state] != total[t->state];

/* Check for stall activity or a previous threshold breach */
if (!new_stall && !t->pending_event)
@@ -553,8 +561,10 @@ static void psi_avgs_work(struct work_struct *work)
* Once restarted, we'll catch up the running averages in one
* go - see calc_avgs() and missed_periods.
*/
- if (now >= group->avg_next_update)
+ if (now >= group->avg_next_update) {
+ update_triggers(group, now, PSI_AVGS);
group->avg_next_update = update_averages(group, now);
+ }

if (changed_states & PSI_STATE_RESCHEDULE) {
schedule_delayed_work(dwork, nsecs_to_jiffies(
@@ -571,9 +581,17 @@ static void init_triggers(struct psi_group *group, u64 now)
list_for_each_entry(t, &group->rtpoll_triggers, node)
window_reset(&t->win, now,
group->total[PSI_POLL][t->state], 0);
+
+ list_for_each_entry(t, &group->triggers, node)
+ window_reset(&t->win, now,
+ group->total[PSI_AVGS][t->state], 0);
+
memcpy(group->rtpoll_total, group->total[PSI_POLL],
sizeof(group->rtpoll_total));
group->rtpoll_next_update = now + group->rtpoll_min_period;
+
+ memcpy(group->avg_total, group->total[PSI_AVGS],
+ sizeof(group->avg_total));
}

/* Schedule polling if it's not already scheduled or forced. */
@@ -673,7 +691,7 @@ static void psi_rtpoll_work(struct psi_group *group)
}

if (now >= group->rtpoll_next_update) {
- group->rtpoll_next_update = update_triggers(group, now);
+ group->rtpoll_next_update = update_triggers(group, now, PSI_POLL);
if (changed_states & group->rtpoll_states)
memcpy(group->rtpoll_total, group->total[PSI_POLL],
sizeof(group->rtpoll_total));
@@ -1243,16 +1261,19 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
}

struct psi_trigger *psi_trigger_create(struct psi_group *group,
- char *buf, enum psi_res res)
+ char *buf, enum psi_res res, struct file *file)
{
struct psi_trigger *t;
enum psi_states state;
u32 threshold_us;
+ bool privileged;
u32 window_us;

if (static_branch_likely(&psi_disabled))
return ERR_PTR(-EOPNOTSUPP);

+ privileged = cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
+
if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2)
state = PSI_IO_SOME + res * 2;
else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2)
@@ -1272,6 +1293,13 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
window_us > WINDOW_MAX_US)
return ERR_PTR(-EINVAL);

+ /*
+ * Unprivileged users can only use 2s windows so that averages aggregation
+ * work is used, and no RT threads need to be spawned.
+ */
+ if (!privileged && window_us % 2000000)
+ return ERR_PTR(-EINVAL);
+
/* Check threshold */
if (threshold_us == 0 || threshold_us > window_us)
return ERR_PTR(-EINVAL);
@@ -1291,10 +1319,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
t->last_event_time = 0;
init_waitqueue_head(&t->event_wait);
t->pending_event = false;
+ t->aggregator = privileged ? PSI_POLL : PSI_AVGS;

mutex_lock(&group->rtpoll_trigger_lock);

- if (!rcu_access_pointer(group->rtpoll_task)) {
+ if (privileged && !rcu_access_pointer(group->rtpoll_task)) {
struct task_struct *task;

task = kthread_create(psi_rtpoll_worker, group, "psimon");
@@ -1308,11 +1337,16 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
rcu_assign_pointer(group->rtpoll_task, task);
}

- list_add(&t->node, &group->rtpoll_triggers);
- group->rtpoll_min_period = min(group->rtpoll_min_period,
- div_u64(t->win.size, UPDATES_PER_WINDOW));
- group->rtpoll_nr_triggers[t->state]++;
- group->rtpoll_states |= (1 << t->state);
+ if (privileged) {
+ list_add(&t->node, &group->rtpoll_triggers);
+ group->rtpoll_min_period = min(group->rtpoll_min_period,
+ div_u64(t->win.size, UPDATES_PER_WINDOW));
+ group->rtpoll_nr_triggers[t->state]++;
+ group->rtpoll_states |= (1 << t->state);
+ } else {
+ list_add(&t->node, &group->triggers);
+ group->nr_triggers[t->state]++;
+ }

mutex_unlock(&group->rtpoll_trigger_lock);

@@ -1346,22 +1380,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
u64 period = ULLONG_MAX;

list_del(&t->node);
- group->rtpoll_nr_triggers[t->state]--;
- if (!group->rtpoll_nr_triggers[t->state])
- group->rtpoll_states &= ~(1 << t->state);
- /* reset min update period for the remaining triggers */
- list_for_each_entry(tmp, &group->rtpoll_triggers, node)
- period = min(period, div_u64(tmp->win.size,
- UPDATES_PER_WINDOW));
- group->rtpoll_min_period = period;
- /* Destroy rtpoll_task when the last trigger is destroyed */
- if (group->rtpoll_states == 0) {
- group->rtpoll_until = 0;
- task_to_destroy = rcu_dereference_protected(
- group->rtpoll_task,
- lockdep_is_held(&group->rtpoll_trigger_lock));
- rcu_assign_pointer(group->rtpoll_task, NULL);
- del_timer(&group->rtpoll_timer);
+ if (t->aggregator == PSI_AVGS) {
+ group->nr_triggers[t->state]--;
+ } else {
+ group->rtpoll_nr_triggers[t->state]--;
+ if (!group->rtpoll_nr_triggers[t->state])
+ group->rtpoll_states &= ~(1 << t->state);
+ /* reset min update period for the remaining triggers */
+ list_for_each_entry(tmp, &group->rtpoll_triggers, node)
+ period = min(period, div_u64(tmp->win.size,
+ UPDATES_PER_WINDOW));
+ group->rtpoll_min_period = period;
+ /* Destroy rtpoll_task when the last trigger is destroyed */
+ if (group->rtpoll_states == 0) {
+ group->rtpoll_until = 0;
+ task_to_destroy = rcu_dereference_protected(
+ group->rtpoll_task,
+ lockdep_is_held(&group->rtpoll_trigger_lock));
+ rcu_assign_pointer(group->rtpoll_task, NULL);
+ del_timer(&group->rtpoll_timer);
+ }
}
}

@@ -1428,9 +1466,6 @@ static int psi_cpu_show(struct seq_file *m, void *v)

static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *))
{
- if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
- return -EPERM;
-
return single_open(file, psi_show, NULL);
}

@@ -1480,7 +1515,7 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
return -EBUSY;
}

- new = psi_trigger_create(&psi_system, buf, res);
+ new = psi_trigger_create(&psi_system, buf, res, file);
if (IS_ERR(new)) {
mutex_unlock(&seq->lock);
return PTR_ERR(new);
--
2.34.1


2023-03-13 15:30:00

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/psi: Allow unprivileged PSI polling

On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
<[email protected]> wrote:
>
> PSI offers 2 mechanisms to get information about a specific resource
> pressure. One is reading from /proc/pressure/<resource>, which gives
> average pressures aggregated every 2s. The other is creating a pollable
> fd for a specific resource and cgroup.
>
> The trigger creation requires CAP_SYS_RESOURCE, and gives the
> possibility to pick specific time window and threshold, spawing an RT
> thread to aggregate the data.
>
> Systemd would like to provide containers the option to monitor pressure
> on their own cgroup and sub-cgroups. For example, if systemd launches a
> container that itself then launches services, the container should have
> the ability to poll() for pressure in individual services. But neither
> the container nor the services are privileged.

This sounds like an interesting usecase. I'll need to take a closer
look once I'm back from vacation later this week.
Thanks!

>
> The series is implemented in 4 steps in order to reduce the noise of
> the change.
>
> Domenico Cerasuolo (4):
> sched/psi: rearrange polling code in preparation
> sched/psi: rename existing poll members in preparation
> sched/psi: extract update_triggers side effect
> sched/psi: allow unprivileged polling of N*2s period
>
> Documentation/accounting/psi.rst | 4 +
> include/linux/psi.h | 2 +-
> include/linux/psi_types.h | 43 ++--
> kernel/cgroup/cgroup.c | 2 +-
> kernel/sched/psi.c | 412 ++++++++++++++++---------------
> 5 files changed, 250 insertions(+), 213 deletions(-)
>
> --
> 2.34.1
>

2023-03-14 16:10:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/psi: Allow unprivileged PSI polling

On Mon, Mar 13, 2023 at 08:29:37AM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
> <[email protected]> wrote:
> >
> > PSI offers 2 mechanisms to get information about a specific resource
> > pressure. One is reading from /proc/pressure/<resource>, which gives
> > average pressures aggregated every 2s. The other is creating a pollable
> > fd for a specific resource and cgroup.
> >
> > The trigger creation requires CAP_SYS_RESOURCE, and gives the
> > possibility to pick specific time window and threshold, spawing an RT
> > thread to aggregate the data.
> >
> > Systemd would like to provide containers the option to monitor pressure
> > on their own cgroup and sub-cgroups. For example, if systemd launches a
> > container that itself then launches services, the container should have
> > the ability to poll() for pressure in individual services. But neither
> > the container nor the services are privileged.
>
> This sounds like an interesting usecase. I'll need to take a closer
> look once I'm back from vacation later this week.
> Thanks!

Thanks, Suren!

There is also the desktop monitoring usecase that Chris Down had
inquired about some while back:

https://lore.kernel.org/all/CAJuCfpGnJBEvQTUeJ_U6+rHmPcMjw_pPL+QFj7Sec5fHZPH67w@mail.gmail.com/T/

The patches should help with that as well.

2023-03-20 21:06:29

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/4] sched/psi: rearrange polling code in preparation

On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
<[email protected]> wrote:
>
> Move a few functions up in the file to avoid forward declaration needed
> in the patch implementing unprivileged PSI triggers.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Domenico Cerasuolo <[email protected]>

LGTM. Will Ack when we finalize the rest of the patchset.

> ---
> kernel/sched/psi.c | 196 ++++++++++++++++++++++-----------------------
> 1 file changed, 98 insertions(+), 98 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 02e011cabe91..fe9269f1d2a4 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -384,92 +384,6 @@ static void collect_percpu_times(struct psi_group *group,
> *pchanged_states = changed_states;
> }
>
> -static u64 update_averages(struct psi_group *group, u64 now)
> -{
> - unsigned long missed_periods = 0;
> - u64 expires, period;
> - u64 avg_next_update;
> - int s;
> -
> - /* avgX= */
> - expires = group->avg_next_update;
> - if (now - expires >= psi_period)
> - missed_periods = div_u64(now - expires, psi_period);
> -
> - /*
> - * The periodic clock tick can get delayed for various
> - * reasons, especially on loaded systems. To avoid clock
> - * drift, we schedule the clock in fixed psi_period intervals.
> - * But the deltas we sample out of the per-cpu buckets above
> - * are based on the actual time elapsing between clock ticks.
> - */
> - avg_next_update = expires + ((1 + missed_periods) * psi_period);
> - period = now - (group->avg_last_update + (missed_periods * psi_period));
> - group->avg_last_update = now;
> -
> - for (s = 0; s < NR_PSI_STATES - 1; s++) {
> - u32 sample;
> -
> - sample = group->total[PSI_AVGS][s] - group->avg_total[s];
> - /*
> - * Due to the lockless sampling of the time buckets,
> - * recorded time deltas can slip into the next period,
> - * which under full pressure can result in samples in
> - * excess of the period length.
> - *
> - * We don't want to report non-sensical pressures in
> - * excess of 100%, nor do we want to drop such events
> - * on the floor. Instead we punt any overage into the
> - * future until pressure subsides. By doing this we
> - * don't underreport the occurring pressure curve, we
> - * just report it delayed by one period length.
> - *
> - * The error isn't cumulative. As soon as another
> - * delta slips from a period P to P+1, by definition
> - * it frees up its time T in P.
> - */
> - if (sample > period)
> - sample = period;
> - group->avg_total[s] += sample;
> - calc_avgs(group->avg[s], missed_periods, sample, period);
> - }
> -
> - return avg_next_update;
> -}
> -
> -static void psi_avgs_work(struct work_struct *work)
> -{
> - struct delayed_work *dwork;
> - struct psi_group *group;
> - u32 changed_states;
> - u64 now;
> -
> - dwork = to_delayed_work(work);
> - group = container_of(dwork, struct psi_group, avgs_work);
> -
> - mutex_lock(&group->avgs_lock);
> -
> - now = sched_clock();
> -
> - collect_percpu_times(group, PSI_AVGS, &changed_states);
> - /*
> - * If there is task activity, periodically fold the per-cpu
> - * times and feed samples into the running averages. If things
> - * are idle and there is no data to process, stop the clock.
> - * Once restarted, we'll catch up the running averages in one
> - * go - see calc_avgs() and missed_periods.
> - */
> - if (now >= group->avg_next_update)
> - group->avg_next_update = update_averages(group, now);
> -
> - if (changed_states & PSI_STATE_RESCHEDULE) {
> - schedule_delayed_work(dwork, nsecs_to_jiffies(
> - group->avg_next_update - now) + 1);
> - }
> -
> - mutex_unlock(&group->avgs_lock);
> -}
> -
> /* Trigger tracking window manipulations */
> static void window_reset(struct psi_window *win, u64 now, u64 value,
> u64 prev_growth)
> @@ -516,18 +430,6 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> return growth;
> }
>
> -static void init_triggers(struct psi_group *group, u64 now)
> -{
> - struct psi_trigger *t;
> -
> - list_for_each_entry(t, &group->triggers, node)
> - window_reset(&t->win, now,
> - group->total[PSI_POLL][t->state], 0);
> - memcpy(group->polling_total, group->total[PSI_POLL],
> - sizeof(group->polling_total));
> - group->polling_next_update = now + group->poll_min_period;
> -}
> -
> static u64 update_triggers(struct psi_group *group, u64 now)
> {
> struct psi_trigger *t;
> @@ -590,6 +492,104 @@ static u64 update_triggers(struct psi_group *group, u64 now)
> return now + group->poll_min_period;
> }
>
> +static u64 update_averages(struct psi_group *group, u64 now)
> +{
> + unsigned long missed_periods = 0;
> + u64 expires, period;
> + u64 avg_next_update;
> + int s;
> +
> + /* avgX= */
> + expires = group->avg_next_update;
> + if (now - expires >= psi_period)
> + missed_periods = div_u64(now - expires, psi_period);
> +
> + /*
> + * The periodic clock tick can get delayed for various
> + * reasons, especially on loaded systems. To avoid clock
> + * drift, we schedule the clock in fixed psi_period intervals.
> + * But the deltas we sample out of the per-cpu buckets above
> + * are based on the actual time elapsing between clock ticks.
> + */
> + avg_next_update = expires + ((1 + missed_periods) * psi_period);
> + period = now - (group->avg_last_update + (missed_periods * psi_period));
> + group->avg_last_update = now;
> +
> + for (s = 0; s < NR_PSI_STATES - 1; s++) {
> + u32 sample;
> +
> + sample = group->total[PSI_AVGS][s] - group->avg_total[s];
> + /*
> + * Due to the lockless sampling of the time buckets,
> + * recorded time deltas can slip into the next period,
> + * which under full pressure can result in samples in
> + * excess of the period length.
> + *
> + * We don't want to report non-sensical pressures in
> + * excess of 100%, nor do we want to drop such events
> + * on the floor. Instead we punt any overage into the
> + * future until pressure subsides. By doing this we
> + * don't underreport the occurring pressure curve, we
> + * just report it delayed by one period length.
> + *
> + * The error isn't cumulative. As soon as another
> + * delta slips from a period P to P+1, by definition
> + * it frees up its time T in P.
> + */
> + if (sample > period)
> + sample = period;
> + group->avg_total[s] += sample;
> + calc_avgs(group->avg[s], missed_periods, sample, period);
> + }
> +
> + return avg_next_update;
> +}
> +
> +static void psi_avgs_work(struct work_struct *work)
> +{
> + struct delayed_work *dwork;
> + struct psi_group *group;
> + u32 changed_states;
> + u64 now;
> +
> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, avgs_work);
> +
> + mutex_lock(&group->avgs_lock);
> +
> + now = sched_clock();
> +
> + collect_percpu_times(group, PSI_AVGS, &changed_states);
> + /*
> + * If there is task activity, periodically fold the per-cpu
> + * times and feed samples into the running averages. If things
> + * are idle and there is no data to process, stop the clock.
> + * Once restarted, we'll catch up the running averages in one
> + * go - see calc_avgs() and missed_periods.
> + */
> + if (now >= group->avg_next_update)
> + group->avg_next_update = update_averages(group, now);
> +
> + if (changed_states & PSI_STATE_RESCHEDULE) {
> + schedule_delayed_work(dwork, nsecs_to_jiffies(
> + group->avg_next_update - now) + 1);
> + }
> +
> + mutex_unlock(&group->avgs_lock);
> +}
> +
> +static void init_triggers(struct psi_group *group, u64 now)
> +{
> + struct psi_trigger *t;
> +
> + list_for_each_entry(t, &group->triggers, node)
> + window_reset(&t->win, now,
> + group->total[PSI_POLL][t->state], 0);
> + memcpy(group->polling_total, group->total[PSI_POLL],
> + sizeof(group->polling_total));
> + group->polling_next_update = now + group->poll_min_period;
> +}
> +
> /* Schedule polling if it's not already scheduled or forced. */
> static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
> bool force)
> --
> 2.34.1
>

2023-03-20 23:01:04

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/psi: extract update_triggers side effect

On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
<[email protected]> wrote:
>
> The update of rtpoll_total inside update_triggers can be moved out of
> the function since changed_states has the same information as the
> update_total flag used in the function. Besides the simplification of
> the function, with the next patch it would become an unwanted side
> effect needed only for PSI_POLL.

(changed_states & group->rtpoll_states) and update_total flag are not
really equivalent. update_total flag depends on the difference between
group->polling_total[state] and group->total[PSI_POLL][state] while
changed_states depends on the difference between groupc->times and
groupc->times_prev. groupc->times_prev is updated every time
collect_percpu_times() is called and there are 3 places where that
happens: from psi_avgs_work(), from psi_poll_work() and from
psi_show(). group->polling_total[state] is updated only from
psi_poll_work(). Therefore the deltas between these values might not
always be in-sync.

Consider the following sequence as an example:

psi_poll_work()
...
psi_avgs_work()/psi_show()
collect_percpu_times() // we detect a change in a monitored state
...
psi_poll_work()
collect_percpu_times() // this time no change in monitored states
update_triggers() // group->polling_total[state] !=
group->total[PSI_POLL][state]

In the last psi_poll_work() collect_percpu_times() recorded no change
in monitored states, so (changed_states & group->rtpoll_states) == 0,
however since the last time psi_poll_work() was called there was
actually a change in monitored states recorded by the first
collect_percpu_times(), therefore (group->polling_total[t->state] !=
total[t->state]) and we should update the totals. With your change we
will miss that update.

I think you can easily fix that by introducing update_triggers as an
output parameter in window_update() like this:

static u64 window_update(struct psi_window *win, u64 now, u64 value,
bool *update_triggers) {
*update_total = false;
...
if (new_stall) {
*update_total = true;
...
}

static void psi_rtpoll_work(struct psi_group *group) {
+ bool update_triggers;
...
- if (now >= group->rtpoll_next_update)
+ if (now >= group->rtpoll_next_update) {
group->rtpoll_next_update = update_triggers(group,
now, &update_triggers);
+ if (update_triggers)
+ memcpy(group->rtpoll_total, group->total[PSI_POLL],
+ sizeof(group->rtpoll_total));
+ }
}


>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Domenico Cerasuolo <[email protected]>
> ---
> kernel/sched/psi.c | 20 +++++---------------
> 1 file changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index a3d0b5cf797a..476941c1cbea 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -433,7 +433,6 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> static u64 update_triggers(struct psi_group *group, u64 now)
> {
> struct psi_trigger *t;
> - bool update_total = false;
> u64 *total = group->total[PSI_POLL];
>
> /*
> @@ -456,14 +455,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
> * events without dropping any).
> */
> if (new_stall) {
> - /*
> - * Multiple triggers might be looking at the same state,
> - * remember to update group->polling_total[] once we've
> - * been through all of them. Also remember to extend the
> - * polling time if we see new stall activity.
> - */
> - update_total = true;
> -
> /* Calculate growth since last update */
> growth = window_update(&t->win, now, total[t->state]);
> if (!t->pending_event) {
> @@ -484,11 +475,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
> /* Reset threshold breach flag once event got generated */
> t->pending_event = false;
> }
> -
> - if (update_total)
> - memcpy(group->rtpoll_total, total,
> - sizeof(group->rtpoll_total));
> -
> return now + group->rtpoll_min_period;
> }
>
> @@ -686,8 +672,12 @@ static void psi_rtpoll_work(struct psi_group *group)
> goto out;
> }
>
> - if (now >= group->rtpoll_next_update)
> + if (now >= group->rtpoll_next_update) {
> group->rtpoll_next_update = update_triggers(group, now);
> + if (changed_states & group->rtpoll_states)
> + memcpy(group->rtpoll_total, group->total[PSI_POLL],
> + sizeof(group->rtpoll_total));
> + }
>
> psi_schedule_rtpoll_work(group,
> nsecs_to_jiffies(group->rtpoll_next_update - now) + 1,
> --
> 2.34.1
>

2023-03-20 23:18:40

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/psi: allow unprivileged polling of N*2s period

On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
<[email protected]> wrote:
>
> PSI offers 2 mechanisms to get information about a specific resource
> pressure. One is reading from /proc/pressure/<resource>, which gives
> average pressures aggregated every 2s. The other is creating a pollable
> fd for a specific resource and cgroup.
>
> The trigger creation requires CAP_SYS_RESOURCE, and gives the
> possibility to pick specific time window and threshold, spawing an RT
> thread to aggregate the data.
>
> Systemd would like to provide containers the option to monitor pressure
> on their own cgroup and sub-cgroups. For example, if systemd launches a
> container that itself then launches services, the container should have
> the ability to poll() for pressure in individual services. But neither
> the container nor the services are privileged.
>
> This patch implements a mechanism to allow unprivileged users to create
> pressure triggers. The difference with privileged triggers creation is
> that unprivileged ones must have a time window that's a multiple of 2s.
> This is so that we can avoid unrestricted spawning of rt threads, and
> use instead the same aggregation mechanism done for the averages, which
> runs indipendently of any triggers.

s/indipendently/independently

>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Domenico Cerasuolo <[email protected]>
> ---
> Documentation/accounting/psi.rst | 4 ++
> include/linux/psi.h | 2 +-
> include/linux/psi_types.h | 7 +++
> kernel/cgroup/cgroup.c | 2 +-
> kernel/sched/psi.c | 105 ++++++++++++++++++++-----------
> 5 files changed, 83 insertions(+), 37 deletions(-)
>
> diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
> index 5e40b3f437f9..df6062eb3abb 100644
> --- a/Documentation/accounting/psi.rst
> +++ b/Documentation/accounting/psi.rst
> @@ -105,6 +105,10 @@ prevent overly frequent polling. Max limit is chosen as a high enough number
> after which monitors are most likely not needed and psi averages can be used
> instead.
>
> +Unprivileged users can also create monitors, with the only limitation that the
> +window size must be a multiple of 2s, in order to prevent excessive resource
> +usage.
> +
> When activated, psi monitor stays active for at least the duration of one
> tracking window to avoid repeated activations/deactivations when system is
> bouncing in and out of the stall state.
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index b029a847def1..ab26200c2803 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -24,7 +24,7 @@ void psi_memstall_leave(unsigned long *flags);
>
> int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
> struct psi_trigger *psi_trigger_create(struct psi_group *group,
> - char *buf, enum psi_res res);
> + char *buf, enum psi_res res, struct file *file);
> void psi_trigger_destroy(struct psi_trigger *t);
>
> __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 1819afa8b198..e439f411c23b 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -151,6 +151,9 @@ struct psi_trigger {
>
> /* Deferred event(s) from previous ratelimit window */
> bool pending_event;
> +
> + /* Used to differentiate destruction action*/
> + enum psi_aggregators aggregator;
> };
>
> struct psi_group {
> @@ -171,6 +174,10 @@ struct psi_group {
> /* Aggregator work control */
> struct delayed_work avgs_work;
>
> + /* Unprivileged triggers against N*PSI_FREQ windows */
> + struct list_head triggers;
> + u32 nr_triggers[NR_PSI_STATES - 1];
> +
> /* Total stall times and sampled pressure averages */
> u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1];
> unsigned long avg[NR_PSI_STATES - 1][3];
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 935e8121b21e..dead36969bba 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3761,7 +3761,7 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
> }
>
> psi = cgroup_psi(cgrp);
> - new = psi_trigger_create(psi, buf, res);
> + new = psi_trigger_create(psi, buf, res, of->file);
> if (IS_ERR(new)) {
> cgroup_put(cgrp);
> return PTR_ERR(new);
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 476941c1cbea..fde91aa4e55f 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
> seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> group->avg_last_update = sched_clock();
> group->avg_next_update = group->avg_last_update + psi_period;
> - INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> mutex_init(&group->avgs_lock);
> - /* Init trigger-related members */
> +
> + /* Init rtpoll trigger-related members */
> atomic_set(&group->rtpoll_scheduled, 0);
> mutex_init(&group->rtpoll_trigger_lock);
> INIT_LIST_HEAD(&group->rtpoll_triggers);
> @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
> init_waitqueue_head(&group->rtpoll_wait);
> timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
> rcu_assign_pointer(group->rtpoll_task, NULL);
> +
> + /* Init avg trigger-related members */
> + INIT_LIST_HEAD(&group->triggers);
> + memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
> + INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> }
>
> void __init psi_init(void)
> @@ -430,20 +435,23 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> return growth;
> }
>
> -static u64 update_triggers(struct psi_group *group, u64 now)
> +static u64 update_triggers(struct psi_group *group, u64 now, enum psi_aggregators aggregator)
> {
> struct psi_trigger *t;
> - u64 *total = group->total[PSI_POLL];
> + u64 *total = group->total[aggregator];
> + struct list_head *triggers = aggregator == PSI_AVGS ? &group->triggers
> + : &group->rtpoll_triggers;
> + u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total : group->rtpoll_total;

I think if you follow my suggestion in patch#2 you can do:

struct trigger_info* tinfo = group->trig_info[aggregator];

and use that instead of the above if/else conditions.


>
> /*
> * On subsequent updates, calculate growth deltas and let
> * watchers know when their specified thresholds are exceeded.
> */
> - list_for_each_entry(t, &group->rtpoll_triggers, node) {
> + list_for_each_entry(t, triggers, node) {
> u64 growth;
> bool new_stall;
>
> - new_stall = group->rtpoll_total[t->state] != total[t->state];
> + new_stall = aggregator_total[t->state] != total[t->state];
>
> /* Check for stall activity or a previous threshold breach */
> if (!new_stall && !t->pending_event)
> @@ -553,8 +561,10 @@ static void psi_avgs_work(struct work_struct *work)
> * Once restarted, we'll catch up the running averages in one
> * go - see calc_avgs() and missed_periods.
> */
> - if (now >= group->avg_next_update)
> + if (now >= group->avg_next_update) {
> + update_triggers(group, now, PSI_AVGS);
> group->avg_next_update = update_averages(group, now);
> + }
>
> if (changed_states & PSI_STATE_RESCHEDULE) {
> schedule_delayed_work(dwork, nsecs_to_jiffies(
> @@ -571,9 +581,17 @@ static void init_triggers(struct psi_group *group, u64 now)
> list_for_each_entry(t, &group->rtpoll_triggers, node)
> window_reset(&t->win, now,
> group->total[PSI_POLL][t->state], 0);
> +
> + list_for_each_entry(t, &group->triggers, node)
> + window_reset(&t->win, now,
> + group->total[PSI_AVGS][t->state], 0);
> +
> memcpy(group->rtpoll_total, group->total[PSI_POLL],
> sizeof(group->rtpoll_total));
> group->rtpoll_next_update = now + group->rtpoll_min_period;
> +
> + memcpy(group->avg_total, group->total[PSI_AVGS],
> + sizeof(group->avg_total));
> }
>
> /* Schedule polling if it's not already scheduled or forced. */
> @@ -673,7 +691,7 @@ static void psi_rtpoll_work(struct psi_group *group)
> }
>
> if (now >= group->rtpoll_next_update) {
> - group->rtpoll_next_update = update_triggers(group, now);
> + group->rtpoll_next_update = update_triggers(group, now, PSI_POLL);
> if (changed_states & group->rtpoll_states)
> memcpy(group->rtpoll_total, group->total[PSI_POLL],
> sizeof(group->rtpoll_total));
> @@ -1243,16 +1261,19 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
> }
>
> struct psi_trigger *psi_trigger_create(struct psi_group *group,
> - char *buf, enum psi_res res)
> + char *buf, enum psi_res res, struct file *file)
> {
> struct psi_trigger *t;
> enum psi_states state;
> u32 threshold_us;
> + bool privileged;
> u32 window_us;
>
> if (static_branch_likely(&psi_disabled))
> return ERR_PTR(-EOPNOTSUPP);
>
> + privileged = cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
> +
> if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2)
> state = PSI_IO_SOME + res * 2;
> else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2)
> @@ -1272,6 +1293,13 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> window_us > WINDOW_MAX_US)
> return ERR_PTR(-EINVAL);
>
> + /*
> + * Unprivileged users can only use 2s windows so that averages aggregation
> + * work is used, and no RT threads need to be spawned.
> + */
> + if (!privileged && window_us % 2000000)
> + return ERR_PTR(-EINVAL);
> +
> /* Check threshold */
> if (threshold_us == 0 || threshold_us > window_us)
> return ERR_PTR(-EINVAL);
> @@ -1291,10 +1319,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> t->last_event_time = 0;
> init_waitqueue_head(&t->event_wait);
> t->pending_event = false;
> + t->aggregator = privileged ? PSI_POLL : PSI_AVGS;
>
> mutex_lock(&group->rtpoll_trigger_lock);
>
> - if (!rcu_access_pointer(group->rtpoll_task)) {
> + if (privileged && !rcu_access_pointer(group->rtpoll_task)) {
> struct task_struct *task;
>
> task = kthread_create(psi_rtpoll_worker, group, "psimon");
> @@ -1308,11 +1337,16 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> rcu_assign_pointer(group->rtpoll_task, task);
> }
>
> - list_add(&t->node, &group->rtpoll_triggers);
> - group->rtpoll_min_period = min(group->rtpoll_min_period,
> - div_u64(t->win.size, UPDATES_PER_WINDOW));
> - group->rtpoll_nr_triggers[t->state]++;
> - group->rtpoll_states |= (1 << t->state);
> + if (privileged) {
> + list_add(&t->node, &group->rtpoll_triggers);
> + group->rtpoll_min_period = min(group->rtpoll_min_period,
> + div_u64(t->win.size, UPDATES_PER_WINDOW));
> + group->rtpoll_nr_triggers[t->state]++;
> + group->rtpoll_states |= (1 << t->state);
> + } else {
> + list_add(&t->node, &group->triggers);
> + group->nr_triggers[t->state]++;
> + }
>
> mutex_unlock(&group->rtpoll_trigger_lock);
>
> @@ -1346,22 +1380,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
> u64 period = ULLONG_MAX;
>
> list_del(&t->node);
> - group->rtpoll_nr_triggers[t->state]--;
> - if (!group->rtpoll_nr_triggers[t->state])
> - group->rtpoll_states &= ~(1 << t->state);
> - /* reset min update period for the remaining triggers */
> - list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> - period = min(period, div_u64(tmp->win.size,
> - UPDATES_PER_WINDOW));
> - group->rtpoll_min_period = period;
> - /* Destroy rtpoll_task when the last trigger is destroyed */
> - if (group->rtpoll_states == 0) {
> - group->rtpoll_until = 0;
> - task_to_destroy = rcu_dereference_protected(
> - group->rtpoll_task,
> - lockdep_is_held(&group->rtpoll_trigger_lock));
> - rcu_assign_pointer(group->rtpoll_task, NULL);
> - del_timer(&group->rtpoll_timer);
> + if (t->aggregator == PSI_AVGS) {
> + group->nr_triggers[t->state]--;
> + } else {
> + group->rtpoll_nr_triggers[t->state]--;

Same here. This can be:

group->trig_info[t->aggregator].nr_triggers[t->state]--;
if (t->aggregator == PSI_AVGS)
return;
// original code for PSI_POLL triggers

> + if (!group->rtpoll_nr_triggers[t->state])
> + group->rtpoll_states &= ~(1 << t->state);
> + /* reset min update period for the remaining triggers */
> + list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> + period = min(period, div_u64(tmp->win.size,
> + UPDATES_PER_WINDOW));
> + group->rtpoll_min_period = period;
> + /* Destroy rtpoll_task when the last trigger is destroyed */
> + if (group->rtpoll_states == 0) {
> + group->rtpoll_until = 0;
> + task_to_destroy = rcu_dereference_protected(
> + group->rtpoll_task,
> + lockdep_is_held(&group->rtpoll_trigger_lock));
> + rcu_assign_pointer(group->rtpoll_task, NULL);
> + del_timer(&group->rtpoll_timer);
> + }
> }
> }
>
> @@ -1428,9 +1466,6 @@ static int psi_cpu_show(struct seq_file *m, void *v)
>
> static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *))
> {
> - if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
> - return -EPERM;
> -
> return single_open(file, psi_show, NULL);
> }

Here you can eliminate psi_open() completely and use single_open() directly.

>
> @@ -1480,7 +1515,7 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
> return -EBUSY;
> }
>
> - new = psi_trigger_create(&psi_system, buf, res);
> + new = psi_trigger_create(&psi_system, buf, res, file);
> if (IS_ERR(new)) {
> mutex_unlock(&seq->lock);
> return PTR_ERR(new);
> --
> 2.34.1
>

2023-03-22 03:55:04

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/psi: extract update_triggers side effect

On Tue, Mar 21, 2023 at 3:18 AM Domenico Cerasuolo
<[email protected]> wrote:
>
> Hi Suren, thanks for all the feedback! This makes sense, I only have one doubt, if we set update_total flag to window_update() and update_triggers(), that flag would be ignored when the caller is psi_avgs_work(), this would be happening in the next patch in the set.

I don't see why the update_triggers part should be conceptually
different between RT and unprivileged triggers. Could you please
explain?

> What do you think if I just remove this change from the patchset and then work on a solution after the iterations on the main change are completed? This was in fact just an attempt to clean up.
> I'll apply your suggested changes on the other patches, wait a bit for comments from someone else and then send V2.
>
> On Tue, Mar 21, 2023 at 12:00 AM Suren Baghdasaryan <[email protected]> wrote:
>>
>> On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
>> <[email protected]> wrote:
>> >
>> > The update of rtpoll_total inside update_triggers can be moved out of
>> > the function since changed_states has the same information as the
>> > update_total flag used in the function. Besides the simplification of
>> > the function, with the next patch it would become an unwanted side
>> > effect needed only for PSI_POLL.
>>
>> (changed_states & group->rtpoll_states) and update_total flag are not
>> really equivalent. update_total flag depends on the difference between
>> group->polling_total[state] and group->total[PSI_POLL][state] while
>> changed_states depends on the difference between groupc->times and
>> groupc->times_prev. groupc->times_prev is updated every time
>> collect_percpu_times() is called and there are 3 places where that
>> happens: from psi_avgs_work(), from psi_poll_work() and from
>> psi_show(). group->polling_total[state] is updated only from
>> psi_poll_work(). Therefore the deltas between these values might not
>> always be in-sync.
>>
>> Consider the following sequence as an example:
>>
>> psi_poll_work()
>> ...
>> psi_avgs_work()/psi_show()
>> collect_percpu_times() // we detect a change in a monitored state
>> ...
>> psi_poll_work()
>> collect_percpu_times() // this time no change in monitored states
>> update_triggers() // group->polling_total[state] !=
>> group->total[PSI_POLL][state]
>>
>> In the last psi_poll_work() collect_percpu_times() recorded no change
>> in monitored states, so (changed_states & group->rtpoll_states) == 0,
>> however since the last time psi_poll_work() was called there was
>> actually a change in monitored states recorded by the first
>> collect_percpu_times(), therefore (group->polling_total[t->state] !=
>> total[t->state]) and we should update the totals. With your change we
>> will miss that update.
>>
>> I think you can easily fix that by introducing update_triggers as an
>> output parameter in window_update() like this:
>>
>> static u64 window_update(struct psi_window *win, u64 now, u64 value,
>> bool *update_triggers) {
>> *update_total = false;
>> ...
>> if (new_stall) {
>> *update_total = true;
>> ...
>> }
>>
>> static void psi_rtpoll_work(struct psi_group *group) {
>> + bool update_triggers;
>> ...
>> - if (now >= group->rtpoll_next_update)
>> + if (now >= group->rtpoll_next_update) {
>> group->rtpoll_next_update = update_triggers(group,
>> now, &update_triggers);
>> + if (update_triggers)
>> + memcpy(group->rtpoll_total, group->total[PSI_POLL],
>> + sizeof(group->rtpoll_total));
>> + }
>> }
>>
>>
>> >
>> > Suggested-by: Johannes Weiner <[email protected]>
>> > Signed-off-by: Domenico Cerasuolo <[email protected]>
>> > ---
>> > kernel/sched/psi.c | 20 +++++---------------
>> > 1 file changed, 5 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> > index a3d0b5cf797a..476941c1cbea 100644
>> > --- a/kernel/sched/psi.c
>> > +++ b/kernel/sched/psi.c
>> > @@ -433,7 +433,6 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>> > static u64 update_triggers(struct psi_group *group, u64 now)
>> > {
>> > struct psi_trigger *t;
>> > - bool update_total = false;
>> > u64 *total = group->total[PSI_POLL];
>> >
>> > /*
>> > @@ -456,14 +455,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
>> > * events without dropping any).
>> > */
>> > if (new_stall) {
>> > - /*
>> > - * Multiple triggers might be looking at the same state,
>> > - * remember to update group->polling_total[] once we've
>> > - * been through all of them. Also remember to extend the
>> > - * polling time if we see new stall activity.
>> > - */
>> > - update_total = true;
>> > -
>> > /* Calculate growth since last update */
>> > growth = window_update(&t->win, now, total[t->state]);
>> > if (!t->pending_event) {
>> > @@ -484,11 +475,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
>> > /* Reset threshold breach flag once event got generated */
>> > t->pending_event = false;
>> > }
>> > -
>> > - if (update_total)
>> > - memcpy(group->rtpoll_total, total,
>> > - sizeof(group->rtpoll_total));
>> > -
>> > return now + group->rtpoll_min_period;
>> > }
>> >
>> > @@ -686,8 +672,12 @@ static void psi_rtpoll_work(struct psi_group *group)
>> > goto out;
>> > }
>> >
>> > - if (now >= group->rtpoll_next_update)
>> > + if (now >= group->rtpoll_next_update) {
>> > group->rtpoll_next_update = update_triggers(group, now);
>> > + if (changed_states & group->rtpoll_states)
>> > + memcpy(group->rtpoll_total, group->total[PSI_POLL],
>> > + sizeof(group->rtpoll_total));
>> > + }
>> >
>> > psi_schedule_rtpoll_work(group,
>> > nsecs_to_jiffies(group->rtpoll_next_update - now) + 1,
>> > --
>> > 2.34.1
>> >

2023-03-22 16:46:51

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/psi: extract update_triggers side effect

On Wed, Mar 22, 2023 at 3:14 AM Domenico Cerasuolo
<[email protected]> wrote:
>
> I'm not suggesting that update_triggers should be different, I agree that they should behave the same for both types of trigger.
> The problem is that if we extract the update_total information out of update_triggers, that information will be ignored by psi_avgs_work because avg_total is always updated in update_averages, only psi_poll_work would use it to copy the total to polling_total.
> If this is the only alternative to having `if (update_total && aggregator == PSI_POLL)` inside update_triggers, I'll add the argument to update_triggers, I'm just wondering if there could be another alternative.

I suggest you post the V2 with suggested changes and this approach and
it will be easier to decide whether this can be improved further.
Also, please do not top-post (read through
https://kernelnewbies.org/mailinglistguidelines for more hints).
Thanks,
Suren.

>
> On Wed, Mar 22, 2023 at 4:41 AM Suren Baghdasaryan <[email protected]> wrote:
>>
>> On Tue, Mar 21, 2023 at 3:18 AM Domenico Cerasuolo
>> <[email protected]> wrote:
>> >
>> > Hi Suren, thanks for all the feedback! This makes sense, I only have one doubt, if we set update_total flag to window_update() and update_triggers(), that flag would be ignored when the caller is psi_avgs_work(), this would be happening in the next patch in the set.
>>
>> I don't see why the update_triggers part should be conceptually
>> different between RT and unprivileged triggers. Could you please
>> explain?
>>
>> > What do you think if I just remove this change from the patchset and then work on a solution after the iterations on the main change are completed? This was in fact just an attempt to clean up.
>> > I'll apply your suggested changes on the other patches, wait a bit for comments from someone else and then send V2.
>> >
>> > On Tue, Mar 21, 2023 at 12:00 AM Suren Baghdasaryan <[email protected]> wrote:
>> >>
>> >> On Thu, Mar 9, 2023 at 9:08 AM Domenico Cerasuolo
>> >> <[email protected]> wrote:
>> >> >
>> >> > The update of rtpoll_total inside update_triggers can be moved out of
>> >> > the function since changed_states has the same information as the
>> >> > update_total flag used in the function. Besides the simplification of
>> >> > the function, with the next patch it would become an unwanted side
>> >> > effect needed only for PSI_POLL.
>> >>
>> >> (changed_states & group->rtpoll_states) and update_total flag are not
>> >> really equivalent. update_total flag depends on the difference between
>> >> group->polling_total[state] and group->total[PSI_POLL][state] while
>> >> changed_states depends on the difference between groupc->times and
>> >> groupc->times_prev. groupc->times_prev is updated every time
>> >> collect_percpu_times() is called and there are 3 places where that
>> >> happens: from psi_avgs_work(), from psi_poll_work() and from
>> >> psi_show(). group->polling_total[state] is updated only from
>> >> psi_poll_work(). Therefore the deltas between these values might not
>> >> always be in-sync.
>> >>
>> >> Consider the following sequence as an example:
>> >>
>> >> psi_poll_work()
>> >> ...
>> >> psi_avgs_work()/psi_show()
>> >> collect_percpu_times() // we detect a change in a monitored state
>> >> ...
>> >> psi_poll_work()
>> >> collect_percpu_times() // this time no change in monitored states
>> >> update_triggers() // group->polling_total[state] !=
>> >> group->total[PSI_POLL][state]
>> >>
>> >> In the last psi_poll_work() collect_percpu_times() recorded no change
>> >> in monitored states, so (changed_states & group->rtpoll_states) == 0,
>> >> however since the last time psi_poll_work() was called there was
>> >> actually a change in monitored states recorded by the first
>> >> collect_percpu_times(), therefore (group->polling_total[t->state] !=
>> >> total[t->state]) and we should update the totals. With your change we
>> >> will miss that update.
>> >>
>> >> I think you can easily fix that by introducing update_triggers as an
>> >> output parameter in window_update() like this:
>> >>
>> >> static u64 window_update(struct psi_window *win, u64 now, u64 value,
>> >> bool *update_triggers) {
>> >> *update_total = false;
>> >> ...
>> >> if (new_stall) {
>> >> *update_total = true;
>> >> ...
>> >> }
>> >>
>> >> static void psi_rtpoll_work(struct psi_group *group) {
>> >> + bool update_triggers;
>> >> ...
>> >> - if (now >= group->rtpoll_next_update)
>> >> + if (now >= group->rtpoll_next_update) {
>> >> group->rtpoll_next_update = update_triggers(group,
>> >> now, &update_triggers);
>> >> + if (update_triggers)
>> >> + memcpy(group->rtpoll_total, group->total[PSI_POLL],
>> >> + sizeof(group->rtpoll_total));
>> >> + }
>> >> }
>> >>
>> >>
>> >> >
>> >> > Suggested-by: Johannes Weiner <[email protected]>
>> >> > Signed-off-by: Domenico Cerasuolo <[email protected]>
>> >> > ---
>> >> > kernel/sched/psi.c | 20 +++++---------------
>> >> > 1 file changed, 5 insertions(+), 15 deletions(-)
>> >> >
>> >> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> >> > index a3d0b5cf797a..476941c1cbea 100644
>> >> > --- a/kernel/sched/psi.c
>> >> > +++ b/kernel/sched/psi.c
>> >> > @@ -433,7 +433,6 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>> >> > static u64 update_triggers(struct psi_group *group, u64 now)
>> >> > {
>> >> > struct psi_trigger *t;
>> >> > - bool update_total = false;
>> >> > u64 *total = group->total[PSI_POLL];
>> >> >
>> >> > /*
>> >> > @@ -456,14 +455,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
>> >> > * events without dropping any).
>> >> > */
>> >> > if (new_stall) {
>> >> > - /*
>> >> > - * Multiple triggers might be looking at the same state,
>> >> > - * remember to update group->polling_total[] once we've
>> >> > - * been through all of them. Also remember to extend the
>> >> > - * polling time if we see new stall activity.
>> >> > - */
>> >> > - update_total = true;
>> >> > -
>> >> > /* Calculate growth since last update */
>> >> > growth = window_update(&t->win, now, total[t->state]);
>> >> > if (!t->pending_event) {
>> >> > @@ -484,11 +475,6 @@ static u64 update_triggers(struct psi_group *group, u64 now)
>> >> > /* Reset threshold breach flag once event got generated */
>> >> > t->pending_event = false;
>> >> > }
>> >> > -
>> >> > - if (update_total)
>> >> > - memcpy(group->rtpoll_total, total,
>> >> > - sizeof(group->rtpoll_total));
>> >> > -
>> >> > return now + group->rtpoll_min_period;
>> >> > }
>> >> >
>> >> > @@ -686,8 +672,12 @@ static void psi_rtpoll_work(struct psi_group *group)
>> >> > goto out;
>> >> > }
>> >> >
>> >> > - if (now >= group->rtpoll_next_update)
>> >> > + if (now >= group->rtpoll_next_update) {
>> >> > group->rtpoll_next_update = update_triggers(group, now);
>> >> > + if (changed_states & group->rtpoll_states)
>> >> > + memcpy(group->rtpoll_total, group->total[PSI_POLL],
>> >> > + sizeof(group->rtpoll_total));
>> >> > + }
>> >> >
>> >> > psi_schedule_rtpoll_work(group,
>> >> > nsecs_to_jiffies(group->rtpoll_next_update - now) + 1,
>> >> > --
>> >> > 2.34.1
>> >> >