Received-SPF: pass (google.com: domain of linux-kernel+bounces-139145-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Date: Wed, 10 Apr 2024 19:47:49 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org,
	Luca Abeni <luca.abeni@santannapisa.it>,
	Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>,
	Thomas Gleixner <tglx@linutronix.de>,
	Joel Fernandes <joel@joelfernandes.org>,
	Vineeth Pillai <vineeth@bitbyteword.org>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Phil Auld <pauld@redhat.com>,
	Suleiman Souhlal <suleiman@google.com>,
	Youssef Esmat <youssefesmat@google.com>
Subject: Re: [PATCH V6 2/6] sched/deadline: Deferrable dl server
Message-ID: <20240410174749.GC30852@noisy.programming.kicks-ass.net>
References: <cover.1712337227.git.bristot@kernel.org>
 <7b9c206e914ef257a2534199f25938ffafa3e59e.1712337227.git.bristot@kernel.org>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <7b9c206e914ef257a2534199f25938ffafa3e59e.1712337227.git.bristot@kernel.org>

On Fri, Apr 05, 2024 at 07:28:01PM +0200, Daniel Bristot de Oliveira wrote:
> @@ -874,6 +895,37 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  		dl_se->dl_yielded = 0;
>  	if (dl_se->dl_throttled)
>  		dl_se->dl_throttled = 0;
> +
> +	/*
> +	 * If this is the replenishment of a deferred reservation,
> +	 * clear the flag and return.
> +	 */
> +	if (dl_se->dl_defer_armed) {
> +		dl_se->dl_defer_armed = 0;
> +		return;
> +	}
> +
> +	/*
> +	 * A this point, if the deferred server is not armed, and the deadline
> +	 * is in the future, if it is not running already, throttle the server
> +	 * and arm the defer timer.
> +	 */
> +	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
> +	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
> +		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
> +			dl_se->dl_defer_armed = 1;
> +			dl_se->dl_throttled = 1;
> +			if (!start_dl_timer(dl_se)) {
> +				/*
> +				 * If for whatever reason (delays), if a previous timer was
> +				 *  queued but not serviced, cancel it.

(whitespace damage)

> +				 */
> +				hrtimer_try_to_cancel(&dl_se->dl_timer);
> +				dl_se->dl_defer_armed = 0;
> +				dl_se->dl_throttled = 0;
> +			}

This looks funny in that it 'obviously' should only set the variables to
1 on success, but I'm thinking it is this way because the timer (when
programming is successful) needs to observe the 1s.

That is, there is an implicit memory ordering here, perhaps put in a
comment to avoid someone 'fixing' this later?

> +		}
> +	}
>  }
>  
>  /*

> @@ -1056,8 +1117,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
>  	 * We want the timer to fire at the deadline, but considering
>  	 * that it is actually coming from rq->clock and not from
>  	 * hrtimer's time base reading.
> +	 *
> +	 * The deferred reservation will have its timer set to
> +	 * (deadline - runtime). At that point, the CBS rule will decide
> +	 * if the current deadline can be used, or if a replenishment is
> +	 * required to avoid add too much pressure on the system
> +	 * (current u > U).

(I wanted to type a comment about how this comment might not do the
subtlety justice, OTOH, fixing that might require much more text and
become unwieldy, so meh..)

>  	 */
> -	act = ns_to_ktime(dl_next_period(dl_se));
> +	if (dl_se->dl_defer_armed) {
> +		WARN_ON_ONCE(!dl_se->dl_throttled);
> +		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> +	} else {

		/* act = deadline - rel-deadline + period */

> +		act = ns_to_ktime(dl_next_period(dl_se));

I had to look up what that function does, it either needs a better name
or I just need more exposure to this code I suppose :-)

> +	}
> +
>  	now = hrtimer_cb_get_time(timer);
>  	delta = ktime_to_ns(now) - rq_clock(rq);
>  	act = ktime_add_ns(act, delta);
> @@ -1107,6 +1180,64 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
>  #endif
>  }
>  
> +/* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
> +static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
> +
> +static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
> +{
> +	struct rq *rq = rq_of_dl_se(dl_se);
> +	enum hrtimer_restart restart = 0;
> +	struct rq_flags rf;
> +	u64 fw;
> +
> +	rq_lock(rq, &rf);

	guard(rq_lock)(rq, &rf);

> +	if (dl_se->dl_throttled) {
> +		sched_clock_tick();
> +		update_rq_clock(rq);
> +
> +		if (!dl_se->dl_runtime)
> +			goto unlock;
> +
> +		if (!dl_se->server_has_tasks(dl_se)) {
> +			replenish_dl_entity(dl_se);
> +			goto unlock;
> +		}
> +
> +		if (dl_se->dl_defer_armed) {
> +			/*
> +			 * First check if the server could consume runtime in background.
> +			 * If so, it is possible to push the defer timer for this amount
> +			 * of time. The dl_server_min_res serves as a limit to avoid
> +			 * forwarding the timer for a too small amount of time.
> +			 */
> +			if (dl_time_before(rq_clock(dl_se->rq),
> +				(dl_se->deadline - dl_se->runtime - dl_server_min_res))) {

:se cino=(0:0

that is, this wants to be something like:

			if (dl_time_before(rq_clock(dl_se->rq),
					   (dl_se->deadline - dl_se->runtime - dl_server_min_res))) {

> +
> +				/* reset the defer timer */
> +				fw = dl_se->deadline - rq_clock(dl_se->rq) - dl_se->runtime;
> +
> +				hrtimer_forward_now(timer, ns_to_ktime(fw));

> +				restart = 1;
> +				goto unlock;

				return HRTIMER_RESTART;

> +			}
> +
> +			dl_se->dl_defer_running = 1;
> +		}
> +
> +		enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
> +
> +		if (!dl_task(dl_se->rq->curr) ||
> +			dl_entity_preempt(dl_se, &dl_se->rq->curr->dl))
> +			resched_curr(rq);
> +
> +		__push_dl_task(rq, &rf);
> +	}
> +unlock:
> +	rq_unlock(rq, &rf);
> +
> +	return restart ? HRTIMER_RESTART : HRTIMER_NORESTART;

	return HRTIMER_NORESTART;

> +}
> +
>  /*
>   * This is the bandwidth enforcement timer callback. If here, we know
>   * a task is not on its dl_rq, since the fact that the timer was running

> @@ -1320,22 +1431,10 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
>  	return (delta * u_act) >> BW_SHIFT;
>  }
>  
> -static inline void
> -update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
> -                        int flags);
> -static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
> +s64 dl_scalled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)

%s/_scalled_/_scaled_/g ?

>  {
>  	s64 scaled_delta_exec;
>  
> -	if (unlikely(delta_exec <= 0)) {
> -		if (unlikely(dl_se->dl_yielded))
> -			goto throttle;
> -		return;
> -	}
> -
> -	if (dl_entity_is_special(dl_se))
> -		return;
> -
>  	/*
>  	 * For tasks that participate in GRUB, we implement GRUB-PA: the
>  	 * spare reclaimed bandwidth is used to clock down frequency.
> @@ -1354,8 +1453,64 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>  		scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
>  	}
>  
> +	return scaled_delta_exec;
> +}
> +
> +static inline void
> +update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
> +			int flags);
> +static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
> +{
> +	s64 scaled_delta_exec;
> +
> +	if (unlikely(delta_exec <= 0)) {
> +		if (unlikely(dl_se->dl_yielded))
> +			goto throttle;
> +		return;
> +	}
> +
> +	if (dl_server(dl_se) && dl_se->dl_throttled && !dl_se->dl_defer)
> +		return;
> +
> +	if (dl_entity_is_special(dl_se))
> +		return;
> +
> +	scaled_delta_exec = dl_scalled_delta_exec(rq, dl_se, delta_exec);
> +
>  	dl_se->runtime -= scaled_delta_exec;
>  
> +	/*
> +	 * The fair server can consume its runtime while throttled (not queued/
> +	 * running as regular CFS).
> +	 *
> +	 * If the server consumes its entire runtime in this state. The server
> +	 * is not required for the current period. Thus, reset the server by
> +	 * starting a new period, pushing the activation.
> +	 */
> +	if (dl_se->dl_defer && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
> +		/*
> +		 * If the server was previously activated - the starving condition
> +		 * took place, it this point it went away because the fair scheduler

                               ^ at ?

> +		 * was able to get runtime in background. So return to the initial
> +		 * state.
> +		 */
> +		dl_se->dl_defer_running = 0;
> +
> +		hrtimer_try_to_cancel(&dl_se->dl_timer);
> +
> +		replenish_dl_new_period(dl_se, dl_se->rq);
> +
> +		/*
> +		 * Not being able to start the timer seems problematic. If it could not
> +		 * be started for whatever reason, we need to "unthrottle" the DL server
> +		 * and queue right away. Otherwise nothing might queue it. That's similar
> +		 * to what enqueue_dl_entity() does on start_dl_timer==0. For now, just warn.
> +		 */
> +		WARN_ON_ONCE(!start_dl_timer(dl_se));
> +
> +		return;
> +	}
> +
>  throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
>  		dl_se->dl_throttled = 1;
> @@ -1415,9 +1570,47 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>  	}
>  }
>  
> +/*
> + * In the non-defer mode, the idle time is not accounted, as the
> + * server provides a guarantee.
> + *
> + * If the dl_server is in defer mode, the idle time is also considered
> + * as time available for the fair server. This avoids creating a
> + * regression with the rt throttling behavior where the idle time did
> + * not create a penalty to the rt schedulers.

I don't think it makes sense to refer to rt throttle behaviour here. The
goal is to delete that code, at which this comment becomes a hysterical
artifact.

I think we can easily give a rational reason for this behaviour without
referring current behaviour. That is, the point of all this is to grant
fair a chance to run, but if there is no fair task (idle), there is no
point in trying to run.

Hmm?

Also, this is done to avoid having to reprogram the timer muck when
cfs_rq::nr_running changes to/from 0 ?

> + */
> +void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
> +{
> +	s64 delta_exec, scaled_delta_exec;
> +
> +	if (!rq->fair_server.dl_defer)
> +		return;
> +
> +	/* no need to discount more */
> +	if (rq->fair_server.runtime < 0)
> +		return;
> +
> +	delta_exec = rq_clock_task(rq) - p->se.exec_start;
> +	if (delta_exec < 0)
> +		return;
> +
> +	scaled_delta_exec = dl_scalled_delta_exec(rq, &rq->fair_server, delta_exec);
> +
> +	rq->fair_server.runtime -= scaled_delta_exec;
> +
> +	if (rq->fair_server.runtime < 0) {
> +		rq->fair_server.dl_defer_running = 0;
> +		rq->fair_server.runtime = 0;
> +	}
> +
> +	p->se.exec_start = rq_clock_task(rq);
> +}
> +