LinuxLists.cc - [PATCH] sched/fair: Optimize sum computation with a lookup table

2016-04-08 09:49:59

Subject: [PATCH] sched/fair: Optimize sum computation with a lookup table

__compute_runnable_contrib() uses a loop to compute sum, whereas a
table loopup can do it faster in a constant time.

Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8cc1c3..32bc4ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
};

/*
+ * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
+ * lower integers.
+ */
+static const u32 __accumulated_sum_N32[] = {
+ 0, 23371, 35056, 40899, 43820, 45281,
+ 46011, 46376, 46559, 46650, 46696, 46719,
+};
+
+/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
*/
@@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
else if (unlikely(n >= LOAD_AVG_MAX_N))
return LOAD_AVG_MAX;

- /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
- do {
- contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
- contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
-
- n -= LOAD_AVG_PERIOD;
- } while (n > LOAD_AVG_PERIOD);
-
+ /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
+ contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
+ n %= LOAD_AVG_PERIOD;
contrib = decay_load(contrib, n);
return contrib + runnable_avg_yN_sum[n];
}
--
1.7.9.5

2016-04-08 10:31:47

by Joe Perches

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

On Fri, 2016-04-08 at 10:07 +0800, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table lookup can do it faster in a constant time.

Perhaps this becomes rather fragile code overly dependent on the
current #define values of LOAD_AVG_MAX_N and LOAD_AVG_PERIOD.

Perhaps this comment just above the definitions of LOAD_AVG_MAX_N
and LOAD_AVG_PERIOD should be updated to include this new table:

?* Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are
?* dependent on this value.

Perhaps the __ prefix for __accumulated_sum_N32 is odd as both of
the runnable_avg_yN_ tables are not __ prefixed.

2016-04-08 10:44:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

On Fri, Apr 08, 2016 at 10:07:20AM +0800, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.

> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
> + n %= LOAD_AVG_PERIOD;
> contrib = decay_load(contrib, n);
> return contrib + runnable_avg_yN_sum[n];

You replace a simple loop with a DIV instruction and a potential extra
cachemiss.

Is that really faster? What is the median 'n' for which we run that
loop? IOW how many loops do we normally do?

And remember that while recent Intel chips are really good at divisions,
not everybody is (and even then they're still slow).

2016-04-08 10:45:50

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

On Fri, Apr 08, 2016 at 12:44:32PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 08, 2016 at 10:07:20AM +0800, Yuyang Du wrote:
> > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > table loopup can do it faster in a constant time.
>
> > - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> > - do {
> > - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> > - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> > -
> > - n -= LOAD_AVG_PERIOD;
> > - } while (n > LOAD_AVG_PERIOD);
> > -
> > + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> > + contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
> > + n %= LOAD_AVG_PERIOD;
> > contrib = decay_load(contrib, n);
> > return contrib + runnable_avg_yN_sum[n];
>
> You replace a simple loop with a DIV instruction and a potential extra
> cachemiss.

Duh, LOAD_AVG_PERIOD is 32, that's a simple shift.

2016-04-08 10:54:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

On Fri, Apr 08, 2016 at 03:31:41AM -0700, Joe Perches wrote:
> On Fri, 2016-04-08 at 10:07 +0800, Yuyang Du wrote:
> > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > table lookup can do it faster in a constant time.
>
> Perhaps this becomes rather fragile code overly dependent on the
> current #define values of LOAD_AVG_MAX_N and LOAD_AVG_PERIOD.

Too late for that, these tables already heavily depend on these values.

Commit 5b51f2f80b3b ("sched: Make __update_entity_runnable_avg() fast"),
which introduced these tables, includes a program to re-compute the
values if we ever need to change them.

It might maybe make sense for Yuyang to update that program to also
compute this new table and include the whole new program in the
Changelog of this patch.

2016-04-08 11:27:40

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

On Fri, Apr 08, 2016 at 10:07:20AM +0800, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.
>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/fair.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b8cc1c3..32bc4ef 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2603,6 +2603,15 @@ static const u32 runnable_avg_yN_sum[] = {
> };
>
> /*
> + * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> + * lower integers.
> + */
> +static const u32 __accumulated_sum_N32[] = {
> + 0, 23371, 35056, 40899, 43820, 45281,
> + 46011, 46376, 46559, 46650, 46696, 46719,
> +};

Could we come up with a name that better indicates that there is
relation between this new table and runnable_avg_yN_sum[]?
runnable_avg_N32_sum[]?

> +
> +/*
> * Approximate:
> * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
> */
> @@ -2650,14 +2659,9 @@ static u32 __compute_runnable_contrib(u64 n)
> else if (unlikely(n >= LOAD_AVG_MAX_N))
> return LOAD_AVG_MAX;
>
> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
> + n %= LOAD_AVG_PERIOD;
> contrib = decay_load(contrib, n);
> return contrib + runnable_avg_yN_sum[n];
> }

Looks okay to me, but as PeterZ already said, it would be good to have
the code to compute that values.

With that, feel free to put my acked/reviewed-by on it if you like.

2016-04-08 16:22:49

by Juri Lelli

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

Hi,

On 08/04/16 12:54, Peter Zijlstra wrote:
> On Fri, Apr 08, 2016 at 03:31:41AM -0700, Joe Perches wrote:
> > On Fri, 2016-04-08 at 10:07 +0800, Yuyang Du wrote:
> > > __compute_runnable_contrib() uses a loop to compute sum, whereas a
> > > table lookup can do it faster in a constant time.
> >
> > Perhaps this becomes rather fragile code overly dependent on the
> > current #define values of LOAD_AVG_MAX_N and LOAD_AVG_PERIOD.
>
> Too late for that, these tables already heavily depend on these values.
>
> Commit 5b51f2f80b3b ("sched: Make __update_entity_runnable_avg() fast"),
> which introduced these tables, includes a program to re-compute the
> values if we ever need to change them.
>
> It might maybe make sense for Yuyang to update that program to also
> compute this new table and include the whole new program in the
> Changelog of this patch.
>

+1 on this. That program turns out to be really useful if one needs to
recompute tables.

Best,

- Juri