Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   Johannes Weiner <hannes@cmpxchg.org>
To:     Peter Zijlstra <peterz@infradead.org>
Cc:     Suren Baghdasaryan <surenb@google.com>,
        Jared Pochtar <jpochtar@fb.com>, linux-kernel@vger.kernel.org,
        kernel-team@fb.com
Subject: [PATCH] psi: fix sampling artifact from pressure file read frequency
Date:   Tue,  8 Jun 2021 15:03:36 -0400
Message-Id: <20210608190336.77380-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Currently, every time a psi pressure file is read, the per-cpu states
are collected and aggregated according to the CPU's non-idle time
weight. This dynamically changes the sampling period, which means read
frequency can introduce variance into the observed results. This is
somewhat unexpected for users and can be confusing, when e.g. two
nested cgroups with the same workload report different pressure levels
just because they have different consumers reading the pressure files.

Consider the following two CPU timelines:

	CPU0: [ STALL ] [ SLEEP ]
	CPU1: [  RUN  ] [  RUN  ]

If we sample and aggregate once for the whole period, we get the
following total stall time for CPU0:

	CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3

But if we sample twice, the total for the period is higher:

	CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
	CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
                                                          ---
                                                          0.5

Neither answer is inherently wrong: if the user asks for pressure
after half the period, we can't know yet that the CPU will go to sleep
right after and its weight would be lower over the combined period.

We could normalize the weight averaging to a fixed window regardless
of how often stall times themselves are sampled. But that would make
reporting less adaptive to sudden changes when the user intentionally
uses poll() with short intervals in order to get a higher resolution.

For now, simply limit sampling of the pressure file contents to the
fixed two-second period already used by the aggregation worker.

psi_show() still needs to force a catch-up run in case the workload
went idle and periodic aggregation shut off. Since that can race with
periodic aggregation, worker and psi_show() need to fully serialize.
And with sampling becoming exclusive, whoever wins the race must also
requeue the worker if necessary. Move the locking, the aggregation
work, and the worker scheduling logic into update_averages(); use that
from the worker and psi_show().

poll() continues to use the proportional weight sampling window.

Reported-by: Jared Pochtar <jpochtar@fb.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 83 ++++++++++++++++++++++++----------------------
 1 file changed, 43 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc25a3cff41f..9d647d974f55 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -358,17 +358,36 @@ static void collect_percpu_times(struct psi_group *group,
 		*pchanged_states = changed_states;
 }
 
-static u64 update_averages(struct psi_group *group, u64 now)
+static void update_averages(struct psi_group *group)
 {
 	unsigned long missed_periods = 0;
-	u64 expires, period;
-	u64 avg_next_update;
+	u64 now, expires, period;
+	u32 changed_states;
 	int s;
 
 	/* avgX= */
+	mutex_lock(&group->avgs_lock);
+
+	now = sched_clock();
 	expires = group->avg_next_update;
-	if (now - expires >= psi_period)
-		missed_periods = div_u64(now - expires, psi_period);
+
+	/*
+	 * Periodic aggregation.
+	 *
+	 * When tasks in the group are active, we make sure to
+	 * aggregate per-cpu samples and calculate the running
+	 * averages at exactly once per PSI_FREQ period.
+	 *
+	 * When tasks go idle, there is no point in keeping the
+	 * workers running, so we shut them down too. Once restarted,
+	 * we backfill zeroes for the missed periods in calc_avgs().
+	 *
+	 * We can get here from inside the aggregation worker, but
+	 * also from psi_show() as userspace may query pressure files
+	 * of an idle group whose aggregation worker shut down.
+	 */
+	if (now < expires)
+		goto unlock;
 
 	/*
 	 * The periodic clock tick can get delayed for various
@@ -377,10 +396,13 @@ static u64 update_averages(struct psi_group *group, u64 now)
 	 * But the deltas we sample out of the per-cpu buckets above
 	 * are based on the actual time elapsing between clock ticks.
 	 */
-	avg_next_update = expires + ((1 + missed_periods) * psi_period);
+	if (now - expires >= psi_period)
+		missed_periods = div_u64(now - expires, psi_period);
 	period = now - (group->avg_last_update + (missed_periods * psi_period));
 	group->avg_last_update = now;
+	group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
 
+	collect_percpu_times(group, PSI_AVGS, &changed_states);
 	for (s = 0; s < NR_PSI_STATES - 1; s++) {
 		u32 sample;
 
@@ -408,42 +430,25 @@ static u64 update_averages(struct psi_group *group, u64 now)
 		calc_avgs(group->avg[s], missed_periods, sample, period);
 	}
 
-	return avg_next_update;
+	if (changed_states & (1 << PSI_NONIDLE)) {
+		unsigned long delay;
+
+		delay = nsecs_to_jiffies(group->avg_next_update - now) + 1;
+		schedule_delayed_work(&group->avgs_work, delay);
+	}
+unlock:
+	mutex_unlock(&group->avgs_lock);
 }
 
 static void psi_avgs_work(struct work_struct *work)
 {
 	struct delayed_work *dwork;
 	struct psi_group *group;
-	u32 changed_states;
-	bool nonidle;
-	u64 now;
 
 	dwork = to_delayed_work(work);
 	group = container_of(dwork, struct psi_group, avgs_work);
 
-	mutex_lock(&group->avgs_lock);
-
-	now = sched_clock();
-
-	collect_percpu_times(group, PSI_AVGS, &changed_states);
-	nonidle = changed_states & (1 << PSI_NONIDLE);
-	/*
-	 * If there is task activity, periodically fold the per-cpu
-	 * times and feed samples into the running averages. If things
-	 * are idle and there is no data to process, stop the clock.
-	 * Once restarted, we'll catch up the running averages in one
-	 * go - see calc_avgs() and missed_periods.
-	 */
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-
-	if (nonidle) {
-		schedule_delayed_work(dwork, nsecs_to_jiffies(
-				group->avg_next_update - now) + 1);
-	}
-
-	mutex_unlock(&group->avgs_lock);
+	update_averages(group);
 }
 
 /* Trigger tracking window manipulations */
@@ -1029,18 +1034,16 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
 	int full;
-	u64 now;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
 
-	/* Update averages before reporting them */
-	mutex_lock(&group->avgs_lock);
-	now = sched_clock();
-	collect_percpu_times(group, PSI_AVGS, NULL);
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-	mutex_unlock(&group->avgs_lock);
+	/*
+	 * Periodic aggregation should do all the sampling for us, but
+	 * if the workload goes idle, the worker goes to sleep before
+	 * old stalls may have averaged out. Backfill any idle zeroes.
+	 */
+	update_averages(group);
 
 	for (full = 0; full < 2; full++) {
 		unsigned long avg[3];
-- 
2.32.0