From: KUROSAWA Takahiro <kurosawa@valinux.co.jp>
To: linux-kernel@vger.kernel.org, ckrm-tech@lists.sourceforge.net
Cc: KUROSAWA Takahiro <kurosawa@valinux.co.jp>
Message-Id: <20060209061142.2164.35994.sendpatchset@debian>
Subject: [PATCH 0/2] CKRM CPU resource controller
Date: Thu,  9 Feb 2006 15:11:42 +0900 (JST)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9898
Lines: 260

This patchset adds a CPU resource controller for CKRM.  The CPU
resource controller manages CPU resources by scaling timeslice
allocated for each task without changing the algorithm of the O(1)
scheduler.  A document that describes how the resource controller
works is attached at the end of this mail.

Performance tests showed us that the overhead introduced by this
patchset was negligible small.  Here are the summary of performance tests:

(1) overhead

 lat_ctx -s 0 -W 100 -N 1000 $N,N=(2..500)

  N    2.6.15  cpurc	delta (cpurc - 2.6.15) [us]
-------------------------------------------------
  2	0.48	0.50	 0.02
  3	0.69	0.70	 0.01
  4	0.65	0.71	 0.06
  5	0.68	0.70	 0.02
  6	0.66	0.73	 0.07
  7	0.74	0.69	-0.05
  8	0.67	0.76	 0.09
  9	0.68	0.71	 0.03
 10	0.66	0.71	 0.05
 20	0.72	0.74	 0.02
 30	0.75	0.77	 0.04
 40	0.77	0.79	 0.02
 50	0.83	0.84	 0.02
 60	0.86	0.86	-0.02
 70	0.90	0.86	-0.04
 80	0.92	0.89	-0.03
 90	1.00	0.92	-0.08
100	0.96	1.03	 0.07
200	1.55	1.65	 0.10
300	2.30	2.23	-0.07
400	2.80	2.85	 0.05
500	3.12	3.20	 0.08

The overhead is smaller than the error limit.

See also the following graph:
http://prdownloads.sourceforge.net/ckrm/cpurc-v0.3-2615-lat_ctx.pdf?download

(2) accuracy of share

 kernbench running with 2 infinite loops on 2 CPU machine

%Share   Elaps   User    Sys	%CPU
-------------------------------------
 10	2515.5  571.8   61.6	 24.8
 20	1349.3  578.1	59.7	 46.8
 30 	 905.5	575.7	59.8	 69.8
 40	 706.9	574.1	60.1	 89.4
 50 	 586.8	572.4	60.1	107.2
 60 	 494.2	572.6	60.1	127.2
 70 	 430.0	572.4	60.9	146.6
 80	 368.4  571.2	60.5	171.0
 90	 328.4  572.1	60.6	192.2
100	 320.5	571.8	60.6	196.8

Notice that maximun share is 100% regardless of # of CPU, and
maximum %CPU is 200%. Therefore, %CPU should be two times of %Share.

See also the following graph:
http://prdownloads.sourceforge.net/ckrm/cpurc-v0.3-2615-kernbench.pdf?download


-----------------------------------------------------------------------------

How the CPU resource controller works

 There are 3 components in the CPU resource controller:

 (1)  load estimation
 (2)  hungry detection
 (3)  timeslice scaling
 (3') task requeueing

 We need to estimate the class load in order to check whether the
 guarantee is satisfied or not.  Class load also gets lower than the
 guarantee when all the tasks in the class tends to sleep. We need to
 check whether the class needs to schedule more or not by hungry
 detection.  If a class needs to schedule more, timeslices of tasks
 are scaled by timeslice scaling.  When timeslice scaling can't satisfy 
 the guarantee, task requeueing supplements timeslice scaling.
 

1. Load estimation

 We calculate the class load as the accumulation of task loads in the
 class.  We need to calculate the task load first, then calculate the
 class load from the task loads.

 Task load estimation

  Task load is estimated as the ratio of:
   * the timeslice value allocated to the task (Ts)
  to:
   * the time that is taken for the task to run out the allocated timeslice
     (Tr).
  If a task can use all the CPU time, Ts / Tr becomes 1 for example.

  The detailed procedure of the calculation is as follows:
  (1) Record the timeslice (Ts) and the time when the timeslice is 
      allocated to the task (by calling cpu_rc_record_allocation()).
      * The timeslice value is recorded to task->last_slice ( = Ts).
      * The time is recorded to task->ts_alloced.
  (2) Calculate the task load when the timeslice is expired
      (by calling cpu_rc_account()).
      Tr is calculated as:
       Tr = jiffies - task->ts_alloced
      Then task load (Ts / Tr) becomes:
       Ts / Tr = task->last_slice / (jiffies - task->ts_alloced)

      The load value is scaled by CPU_RC_LOAD_SCALE.
      If the load value equals to CPU_RC_LOAD_SCALE, it indicates 100% 
      CPU usage.

          task->ts_alloced   task scheduled             now
             v               v                          v
             |---------------===========================|

                             |<------------------------>|
                               Ts ( = task->last_slice)

             |<---------------------------------------->|
                Tr ( = now - task->ts_alloced)

             |<------------->|
               the time that the task isn't scheduled


      Note that task load calculation is also needed for strict
      accuracy when a task forks or exits, because timeslice is
      changed on fork and exit.  But we don't do that in order to
      simplify the code and in order not to introduce overhead on fork
      and exit.  Probably we can get enough accurate number without
      calculating the task load on fork/exit.

 Class load estimation:

  Class load is the accumulation of load values of tasks in the class in
  the duration of CPU_RC_SPREAD_PERIOD.
  Per-CPU class load is recalculated each time the task load is calculated
  in the cpu_rc_account() function.
  Then on CPU_RC_RECALC_INTERVAL intervals, the class load value per-CPU 
  value is calculated as the average of the per-CPU class load.

  Task load is accumulated to the per-CPU class load as if the class uses 
  Ts/Tr of the CPU time from task->ts_alloced to now (the time the timeslice 
  expired).

  So the time that the task has used the CPU from (now - CPU_RC_SPREAD_PERIOD)
  to now (Ttsk) should be:

   if task->ts_alloced < now - CPU_RC_SPREAD_PERIOD:
     Ts/Tr * CPU_RC_SPREAD_PERIOD
     (We assume that the task has used the CPU at the constant rate of Ts/Tr.)

                    now-CPU_RC_SPREAD_PERIOD                now
                    v                                       v
                    |---------------------------------------|
         |==================================================| load: Ts/Tr
         ^
         task->ts_alloced

   else:
     Ts

                    now-CPU_RC_SPREAD_PERIOD                now
                    v                                       v
                    |---------------------------------------|
                               |============================| load: Ts/Tr
                               ^
                               task->ts_alloced             

  Also, we assume that the class uses the CPU at the rate of the class load
  from (now - CPU_RC_SPREAD_PERIOD) to the last time the per-CPU class load
  was calculated (stored in struct cpu_rc::stat[cpu].timestamp).  If
  cpu_rc::stat[cpu].timestamp < now - CPU_RC_SPREAD_PERIOD, we assume that
  the class doesn't use the CPU from (now - CPU_RC_SPREAD_PERIOD) to
  task->ts_alloced.

  So the time that the class use the CPU from (now - CPU_RC_SPREAD_PERIOD)
  to now (Tcls) should be:
   if cpu_rc::stat[cpu].timestamp < now - CPU_RC_SPREAD_PERIOD:
     0
   else:
     cpu_rc::stat[cpu].load * (cpu_rc::stat[cpu].timestamp - (now - CPU_RC_SPREAD_PERIOD))

  The new per-CPU class load that will be assigned to cpu_rc::stat[cpu].load
  is calculated as:
    (Ttsk + Tcls) / CPU_RC_SPREAD_PERIOD

2. Hungry detection

 When the class load is less than the guarantee, there are 2 cases:
  (a) the guarantee is enough and tasks in the class have time for sleep
  (b) tasks in other classes overuse the CPU

 We should not scale the timeslice in case (a) even if the class load
 is lower than the guarantee.  In order to distinguish case (b) from
 case (a), we measure the time (Tsch) from when a task is activated
 (stored in task->last_activated) till when the task is actually
 scheduled.  If the class load is lower than the guarantee but tasks
 in the class are quickly scheduled, it can be classified to case (a).
 If Tsch / timeslice of a task is lower than the guarantee, the class
 that has the task is marked as "maybe hungry."  If the class load of
 the class that is marked as "maybe hungry" is lower than the
 guarantee, it is treated as hungry and the timeslices of tasks in
 other classes will be scaled down.


3. Timeslice scaling

 If there are hungry classes, we need to adjust timeslices to satisfy
 the guarantee.  To scale timeslices, we introduce a scaling factor
 used for scaling timeslices.  The scaling factor is associated with
 the class (stored in the cpu_rc structure) and adaptively adjusted
 according to the class load and the guarantee.

 If some classes are hungry, the scaling factor of the class that is
 not hungry is calculated as follows (note: F is the scaling factor):
   F_new = F * guarantee / class_load

 And the scaling factor of the hungry class is calculated as:
   F_new = F + 1

 When all the classes are not hungry, the scaling factor is calculated
 as follows in order to recover the timeslices:
   F_new = F + CPU_RC_TSFACTOR_INC   (CPU_RC_TSFACTOR_INC is defined as 5)

 Note that the maximum value of F is limited to CPU_RC_TSFACTOR_MAX.
 The timeslice assigned to each task is:
   timeslice_scaled = timeslice_orig * F / CPU_RC_TSFACTOR_MAX

 where timeslice_orig is the value that is calculated by the conventional 
 O(1) scheduler.

3'. Task requeueing

 There is cases that timeslice scaling is not enough for satisfying
 the guarantee because there is highest and lowest value defined for
 the timeslice.  The lowest value is 1, so a class that has low
 guarantee but has huge number of tasks may beat another class that
 has high guarantee but has small number of tasks.  Task requeueing
 works in such cases.

 If the hungry state continues more than CPU_RC_STARVE_THRESHOLD
 (defined as 2) on the scale factor recalculation, the class is
 considered as starving.  Tasks in the starving classes are requeued
 to the active queue again when the timeslices are expired.

-- 
KUROSAWA, Takahiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/