Hello!
Making some progress on CONFIG_PREEMPT_RT-compatible RCU. Am exploring
two approaches, the lock-based approach discussed earlier and a
counter-flip approach, with the latter very likely being the method
of choice. The reason for working the former is to get myself up to
speed on details of CONFIG_PREEMPT_RT with something relatively simple.
I am basing my work off of:
http://people.redhat.com/mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc1-V0.7.41-14
since it works well on a four-CPU x86 system. I will port forward when
I have something worth considering for inclusion. To reiterate, the
patches referenced below are playtoys, for experimental and educational
use only.
Lock-Based Approach
1. Trivial working patch:
http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-2a.patch
This one uses a single rwlock and a single global callback
list. This of course means that only one task may be in
an RCU read-side critical section at a time. Even so,
I had to split synchronize_kernel() into synchronize_rcu()
and synchronize_sched() -- I get infrequent hangs otherwise.
The implementation of synchronize_sched() is probably not what
it eventually needs to be, since it simply forces each CPU to
context switch, whether voluntary or preemption. Will be looking
into this later on.
2. Slightly less trivial working patch:
http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-3a.patch
This one uses a per-CPU rwlock, but keeps the single global
callback list. It is otherwise identical to #1.
Next step is to go to per-CPU callback lists. If I was taking this
approach seriously, I would also experiment with multiple RCU read-side
locks per CPU, but I don't believe I would learn anything from that
exercise.
The reason that I am not taking this approach seriously is that it
can impose high latencies on RCU read-side critical sections, as
discussed earlier on LKML. It also has high rcu_read_lock() and
rcu_read_unlock() overhead.
Counter-Based Approach
The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
counter-based approach, which seems to work, but which can result in
indefinite-duration grace periods. The following are very hazy thoughts
on how to get the benefits of this approach, but with short grace periods.
1. The basic trick is to maintain a pair of counters per CPU.
There would also be a global boolean variable that would select
one or the other of each pair. The rcu_read_lock() primitive
would then increment the counter indicated by the boolean
corresponding to the CPU that it is currently running on.
It would also keep a pointer to that particular counter in
the task structure. The rcu_read_unlock() primitive would
decrement this counter. (And, yes, you would also have a
counter in the task structure so that only the outermost of
a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
actually increment/decrement the per-CPU counter pairs.)
To force a grace period, one would invert the value of the
global boolean variable. Once all the counters indicated
by the old value of the global boolean variable hit zero,
the corresponding set of RCU callbacks can be safely invoked.
The big problem with this approach is that a pair of inversions
of the global boolean variable could be spaced arbitrarily
closely, especially when you consider that the read side code
can be preempted. This could cause RCU callbacks to be invoked
prematurely, which could greatly reduce the life expectancy
of your kernel.
2. #1 above, but use a seqlock to guard the counter selection in
rcu_read_lock(). One potential disadvantage of this approach
is that an extremely unlucky instance of rcu_read_lock() might
be indefinitely delayed by a series of counter flips. I am
concerned that this might actually happen under low-memory
conditions. Also requires memory barriers on the read side,
which we might be stuck with, but still hope to be able to
get rid of. And the per-CPU counter manipulation must use
atomic instructions.
3. #1 above, but use per-CPU locks to guard the counter selection.
I don't like this any better than #2, worse, in fact, since it
requires expensive atomic instructions as well.
4. The Australian National Zoo alternative: keep the counter pairs
in the task structure rather than keeping them per-CPU. This
eliminates the need for atomic operations in rcu_read_lock() and
rcu_read_unlock(), but makes the update side do horribly expensive
task-list trawls. [So named because I thought of it while trying
to jog to the Australian National Zoo. I took a wrong turn, and
ended up running up a valley on the other side of Black Mountain,
so never did make it to the zoo. On the other hand, I did encounter
a herd of wild kangaroo and also thought up this approach, so I
think I came out OK on the deal.]
5. The National Train notion: #4 above, but keep a separate list
containing only preempted tasks that had non-zero RCU counters
at the time of preemption. In the (presumably) common case of
no preemption in RCU read-side critical sections, both the
read-side and the update-side overhead is low. But... There
is a problem with detecting tasks that are in long-running
RCU read-side critical sections that don't get preempted.
[So named because I thought of it on the UK National Train
somewhere between London and Winchester.]
6. Oak Hills option: keep per-CPU counters, which require atomic
increment/decrement in the general case, but use a fastpath
that (with preemption disabled) checks to see if the value of
the counter is zero (for rcu_read_lock()) or one (for
rcu_read_unlock()), and, if so, does the counter manipulation
non-atomically. Use atomics on the (presumably infrequent)
slow path, which is taken if someone gets preempted in the middle
of an RCU read-side critical section.
Handle races between rcu_read_lock() and counter flips by
having rcu_read_lock() increment the counter, then checking
to see if it incremented the correct counter of the pair.
If it did not (i.e., the flip just happened), increment
the other counter of the pair as well, recording the fact that
both were incremented in the task struct. The rcu_read_unlock()
primitive then decrements any/all counters that rcu_read_lock()
incremented.
Memory barriers are still needed in the non-atomic increment
and decrement cases. However, it may be possible to leverage
naturally occuring memory barriers (see for example Joe Seigh's
recent LKML posting on RCU+SMR: http://lkml.org/lkml/2005/5/9/129).
If the naturally occuring memory barriers aren't happening fast
enough (e.g., low memory situation), a round of IPIs should
suffice, for example, smp_call_function() to a function that
advances the callbacks on each CPU.
If this one pans out, the common-case overhead of rcu_read_lock()
and rcu_read_unlock() would not be much more expensive than the
current CONFIG_PREEMPT implementations.
There are probably better approaches, but that is what I have thus far.
Thoughts?
Thanx, Paul
On Mon, 9 May 2005, Paul E. McKenney wrote:
> Hello!
>
> Making some progress on CONFIG_PREEMPT_RT-compatible RCU. Am exploring
> two approaches, the lock-based approach discussed earlier and a
> counter-flip approach, with the latter very likely being the method
> of choice. The reason for working the former is to get myself up to
> speed on details of CONFIG_PREEMPT_RT with something relatively simple.
>
> I am basing my work off of:
>
> http://people.redhat.com/mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc1-V0.7.41-14
>
> since it works well on a four-CPU x86 system. I will port forward when
> I have something worth considering for inclusion. To reiterate, the
> patches referenced below are playtoys, for experimental and educational
> use only.
>
>
> Lock-Based Approach
>
> 1. Trivial working patch:
>
> http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-2a.patch
>
> This one uses a single rwlock and a single global callback
> list. This of course means that only one task may be in
> an RCU read-side critical section at a time. Even so,
> I had to split synchronize_kernel() into synchronize_rcu()
> and synchronize_sched() -- I get infrequent hangs otherwise.
> The implementation of synchronize_sched() is probably not what
> it eventually needs to be, since it simply forces each CPU to
> context switch, whether voluntary or preemption. Will be looking
> into this later on.
>
*shiver* A single rwlock is bad in all respects, both regarding
scaling and real-time behaviour.
> 2. Slightly less trivial working patch:
>
> http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-3a.patch
>
> This one uses a per-CPU rwlock, but keeps the single global
> callback list. It is otherwise identical to #1.
>
Again, a single rwlock is very bad for RT behaviour.
> Next step is to go to per-CPU callback lists. If I was taking this
> approach seriously, I would also experiment with multiple RCU read-side
> locks per CPU, but I don't believe I would learn anything from that
> exercise.
>
> The reason that I am not taking this approach seriously is that it
> can impose high latencies on RCU read-side critical sections, as
> discussed earlier on LKML. It also has high rcu_read_lock() and
> rcu_read_unlock() overhead.
>
>
> Counter-Based Approach
>
> The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> counter-based approach, which seems to work, but which can result in
> indefinite-duration grace periods.
Is that really a problem in practise? For RCU updates should happen
rarely.
> The following are very hazy thoughts
> on how to get the benefits of this approach, but with short grace periods.
>
> 1. The basic trick is to maintain a pair of counters per CPU.
> There would also be a global boolean variable that would select
> one or the other of each pair. The rcu_read_lock() primitive
> would then increment the counter indicated by the boolean
> corresponding to the CPU that it is currently running on.
> It would also keep a pointer to that particular counter in
> the task structure. The rcu_read_unlock() primitive would
> decrement this counter.
(And, yes, you would also have a
> counter in the task structure so that only the outermost of
> a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> actually increment/decrement the per-CPU counter pairs.)
Why bother? What is the overhead of checking relative to just doing the
increment/decrement?
I had another idea: Do it in the scheduler:
per_cpu_rcu_count += prev->rcu_read_count;
if(per_cpu_rcu_count==0) {
/* grace period encountered */
}
(do the task switch)
per_cpu_rcu_count -= count->rcu_read_count;
Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
current. During schedule there is no current and therefore it is the total
count.
>
> To force a grace period, one would invert the value of the
> global boolean variable. Once all the counters indicated
> by the old value of the global boolean variable hit zero,
> the corresponding set of RCU callbacks can be safely invoked.
>
> The big problem with this approach is that a pair of inversions
> of the global boolean variable could be spaced arbitrarily
> closely, especially when you consider that the read side code
> can be preempted. This could cause RCU callbacks to be invoked
> prematurely, which could greatly reduce the life expectancy
> of your kernel.
>
> 2. #1 above, but use a seqlock to guard the counter selection in
> rcu_read_lock(). One potential disadvantage of this approach
> is that an extremely unlucky instance of rcu_read_lock() might
> be indefinitely delayed by a series of counter flips. I am
> concerned that this might actually happen under low-memory
> conditions. Also requires memory barriers on the read side,
> which we might be stuck with, but still hope to be able to
> get rid of. And the per-CPU counter manipulation must use
> atomic instructions.
Why not just use
preempt_disable();
manipulate counters;
preempt_enable();
??
>
> 3. #1 above, but use per-CPU locks to guard the counter selection.
> I don't like this any better than #2, worse, in fact, since it
> requires expensive atomic instructions as well.
local_irqs_disable() andor preempt_disable() should work as they counters
are per-cpu, right?
Or see the alternative above: The counter is per task but latched to a
per-cpu on schedule().
>
> 4. The Australian National Zoo alternative: keep the counter pairs
> in the task structure rather than keeping them per-CPU. This
> eliminates the need for atomic operations in rcu_read_lock() and
> rcu_read_unlock(), but makes the update side do horribly expensive
> task-list trawls. [So named because I thought of it while trying
> to jog to the Australian National Zoo. I took a wrong turn, and
> ended up running up a valley on the other side of Black Mountain,
> so never did make it to the zoo. On the other hand, I did encounter
> a herd of wild kangaroo and also thought up this approach, so I
> think I came out OK on the deal.]
>
> 5. The National Train notion: #4 above, but keep a separate list
> containing only preempted tasks that had non-zero RCU counters
> at the time of preemption. In the (presumably) common case of
> no preemption in RCU read-side critical sections, both the
> read-side and the update-side overhead is low. But... There
> is a problem with detecting tasks that are in long-running
> RCU read-side critical sections that don't get preempted.
> [So named because I thought of it on the UK National Train
> somewhere between London and Winchester.]
>
> 6. Oak Hills option: keep per-CPU counters, which require atomic
> increment/decrement in the general case, but use a fastpath
> that (with preemption disabled) checks to see if the value of
> the counter is zero (for rcu_read_lock()) or one (for
> rcu_read_unlock()), and, if so, does the counter manipulation
> non-atomically. Use atomics on the (presumably infrequent)
> slow path, which is taken if someone gets preempted in the middle
> of an RCU read-side critical section.
>
> Handle races between rcu_read_lock() and counter flips by
> having rcu_read_lock() increment the counter, then checking
> to see if it incremented the correct counter of the pair.
> If it did not (i.e., the flip just happened), increment
> the other counter of the pair as well, recording the fact that
> both were incremented in the task struct. The rcu_read_unlock()
> primitive then decrements any/all counters that rcu_read_lock()
> incremented.
>
> Memory barriers are still needed in the non-atomic increment
> and decrement cases. However, it may be possible to leverage
> naturally occuring memory barriers (see for example Joe Seigh's
> recent LKML posting on RCU+SMR: http://lkml.org/lkml/2005/5/9/129).
> If the naturally occuring memory barriers aren't happening fast
> enough (e.g., low memory situation), a round of IPIs should
> suffice, for example, smp_call_function() to a function that
> advances the callbacks on each CPU.
>
> If this one pans out, the common-case overhead of rcu_read_lock()
> and rcu_read_unlock() would not be much more expensive than the
> current CONFIG_PREEMPT implementations.
>
> There are probably better approaches, but that is what I have thus far.
>
I was playing around with it a little while back. I didn't sned a patch
though as I didn't have time. It works on a UP machine but I don't have a
SMP to test on :-(
The ingreedients are:
1) A per-cpu read-side counter, protected locally by preempt_disable().
2) A task read-side counter (doesn't need protectection at all).
3) Tasks having non-zero read-side counter can't be migrated to other
CPUs. That is to make the per-cpu counter truely per-cpu.
4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
boost the priority to the maximum non-RT priority such it will not be
preempted by other SCHED_OTHER tasks. This makes the delay in
syncronize_kernel() somewhat deterministic (it depends on how
"deterministic" the RT tasks in the system are.)
I'll try to send you what I got when I get my computer at home back up.
I have only testet id on my UP labtop, where it seems to run fine.
> Thoughts?
>
> Thanx, Paul
Esben
On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> On Mon, 9 May 2005, Paul E. McKenney wrote:
>
> > Hello!
> >
> > Making some progress on CONFIG_PREEMPT_RT-compatible RCU. Am exploring
> > two approaches, the lock-based approach discussed earlier and a
> > counter-flip approach, with the latter very likely being the method
> > of choice. The reason for working the former is to get myself up to
> > speed on details of CONFIG_PREEMPT_RT with something relatively simple.
> >
> > I am basing my work off of:
> >
> > http://people.redhat.com/mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc1-V0.7.41-14
> >
> > since it works well on a four-CPU x86 system. I will port forward when
> > I have something worth considering for inclusion. To reiterate, the
> > patches referenced below are playtoys, for experimental and educational
> > use only.
> >
> >
> > Lock-Based Approach
> >
> > 1. Trivial working patch:
> >
> > http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-2a.patch
> >
> > This one uses a single rwlock and a single global callback
> > list. This of course means that only one task may be in
> > an RCU read-side critical section at a time. Even so,
> > I had to split synchronize_kernel() into synchronize_rcu()
> > and synchronize_sched() -- I get infrequent hangs otherwise.
> > The implementation of synchronize_sched() is probably not what
> > it eventually needs to be, since it simply forces each CPU to
> > context switch, whether voluntary or preemption. Will be looking
> > into this later on.
> >
> *shiver* A single rwlock is bad in all respects, both regarding
> scaling and real-time behaviour.
Agreed, see above: "for experimental and educational purposes only".
> > 2. Slightly less trivial working patch:
> >
> > http://www.rdrop.com/users/paulmck/RCU/patches/lbdf-3a.patch
> >
> > This one uses a per-CPU rwlock, but keeps the single global
> > callback list. It is otherwise identical to #1.
> >
> Again, a single rwlock is very bad for RT behaviour.
Again, please see above. BTW, this is per-CPU rwlock, not a single
global one. But of course your point still stands, since bad latencies
can still be transmitted via the locks.
The purpose of this patch is to educate me on CONFIG_PREEMPT_RT, -not-
to be submitted for inclusion! ;-)
> > Next step is to go to per-CPU callback lists. If I was taking this
> > approach seriously, I would also experiment with multiple RCU read-side
> > locks per CPU, but I don't believe I would learn anything from that
> > exercise.
> >
> > The reason that I am not taking this approach seriously is that it
> > can impose high latencies on RCU read-side critical sections, as
> > discussed earlier on LKML. It also has high rcu_read_lock() and
> > rcu_read_unlock() overhead.
> >
> >
> > Counter-Based Approach
> >
> > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > counter-based approach, which seems to work, but which can result in
> > indefinite-duration grace periods.
>
> Is that really a problem in practise? For RCU updates should happen
> rarely.
Yes. Several people have run into serious problems on small-memory
machines under denial-of-service workloads. Not pretty.
> > The following are very hazy thoughts
> > on how to get the benefits of this approach, but with short grace periods.
> >
> > 1. The basic trick is to maintain a pair of counters per CPU.
> > There would also be a global boolean variable that would select
> > one or the other of each pair. The rcu_read_lock() primitive
> > would then increment the counter indicated by the boolean
> > corresponding to the CPU that it is currently running on.
> > It would also keep a pointer to that particular counter in
> > the task structure. The rcu_read_unlock() primitive would
> > decrement this counter. (And, yes, you would also have a
> > counter in the task structure so that only the outermost of
> > a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > actually increment/decrement the per-CPU counter pairs.)
>
> Why bother? What is the overhead of checking relative to just doing the
> increment/decrement?
The increment of the per-CPU counter pair is likely to incur a cache miss,
and thus be orders of magnitude more expensive than the increment of
the variable in the task structure. I did not propose this optimization
lightly. ;-)
> I had another idea: Do it in the scheduler:
> per_cpu_rcu_count += prev->rcu_read_count;
> if(per_cpu_rcu_count==0) {
> /* grace period encountered */
> }
> (do the task switch)
>
> per_cpu_rcu_count -= count->rcu_read_count;
>
> Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> current. During schedule there is no current and therefore it is the total
> count.
One downside of this approach is that it would maximally expand the
RCU read-side critical sections -- one of the things we need is to
avoid holding onto memory longer than necessary, as noted above.
Also, what prevents a grace period from ending while a task is
preempted in an RCU read-side critical section? Or are you intending
that synchronize_rcu() scan the task list as well as the per-CPU
counters?
> > To force a grace period, one would invert the value of the
> > global boolean variable. Once all the counters indicated
> > by the old value of the global boolean variable hit zero,
> > the corresponding set of RCU callbacks can be safely invoked.
> >
> > The big problem with this approach is that a pair of inversions
> > of the global boolean variable could be spaced arbitrarily
> > closely, especially when you consider that the read side code
> > can be preempted. This could cause RCU callbacks to be invoked
> > prematurely, which could greatly reduce the life expectancy
> > of your kernel.
> >
> > 2. #1 above, but use a seqlock to guard the counter selection in
> > rcu_read_lock(). One potential disadvantage of this approach
> > is that an extremely unlucky instance of rcu_read_lock() might
> > be indefinitely delayed by a series of counter flips. I am
> > concerned that this might actually happen under low-memory
> > conditions. Also requires memory barriers on the read side,
> > which we might be stuck with, but still hope to be able to
> > get rid of. And the per-CPU counter manipulation must use
> > atomic instructions.
>
> Why not just use
> preempt_disable();
> manipulate counters;
> preempt_enable();
> ??
Because we can race with the counter switch, which can happen on some
other CPU. Therefore, preempt_disable() does not help with this race.
> > 3. #1 above, but use per-CPU locks to guard the counter selection.
> > I don't like this any better than #2, worse, in fact, since it
> > requires expensive atomic instructions as well.
>
> local_irqs_disable() andor preempt_disable() should work as they counters
> are per-cpu, right?
Again, neither of these help with the race against the counter-switch,
which would likely be happening on some other CPU.
> Or see the alternative above: The counter is per task but latched to a
> per-cpu on schedule().
This could work (assuming that synchronize_rcu() scanned the task list
as well as the per-CPU counters), but again would extend grace periods
too long for small-memory machines subject to denial-of-service attacks.
> > 4. The Australian National Zoo alternative: keep the counter pairs
> > in the task structure rather than keeping them per-CPU. This
> > eliminates the need for atomic operations in rcu_read_lock() and
> > rcu_read_unlock(), but makes the update side do horribly expensive
> > task-list trawls. [So named because I thought of it while trying
> > to jog to the Australian National Zoo. I took a wrong turn, and
> > ended up running up a valley on the other side of Black Mountain,
> > so never did make it to the zoo. On the other hand, I did encounter
> > a herd of wild kangaroo and also thought up this approach, so I
> > think I came out OK on the deal.]
> >
> > 5. The National Train notion: #4 above, but keep a separate list
> > containing only preempted tasks that had non-zero RCU counters
> > at the time of preemption. In the (presumably) common case of
> > no preemption in RCU read-side critical sections, both the
> > read-side and the update-side overhead is low. But... There
> > is a problem with detecting tasks that are in long-running
> > RCU read-side critical sections that don't get preempted.
> > [So named because I thought of it on the UK National Train
> > somewhere between London and Winchester.]
> >
> > 6. Oak Hills option: keep per-CPU counters, which require atomic
> > increment/decrement in the general case, but use a fastpath
> > that (with preemption disabled) checks to see if the value of
> > the counter is zero (for rcu_read_lock()) or one (for
> > rcu_read_unlock()), and, if so, does the counter manipulation
> > non-atomically. Use atomics on the (presumably infrequent)
> > slow path, which is taken if someone gets preempted in the middle
> > of an RCU read-side critical section.
> >
> > Handle races between rcu_read_lock() and counter flips by
> > having rcu_read_lock() increment the counter, then checking
> > to see if it incremented the correct counter of the pair.
> > If it did not (i.e., the flip just happened), increment
> > the other counter of the pair as well, recording the fact that
> > both were incremented in the task struct. The rcu_read_unlock()
> > primitive then decrements any/all counters that rcu_read_lock()
> > incremented.
> >
> > Memory barriers are still needed in the non-atomic increment
> > and decrement cases. However, it may be possible to leverage
> > naturally occuring memory barriers (see for example Joe Seigh's
> > recent LKML posting on RCU+SMR: http://lkml.org/lkml/2005/5/9/129).
> > If the naturally occuring memory barriers aren't happening fast
> > enough (e.g., low memory situation), a round of IPIs should
> > suffice, for example, smp_call_function() to a function that
> > advances the callbacks on each CPU.
> >
> > If this one pans out, the common-case overhead of rcu_read_lock()
> > and rcu_read_unlock() would not be much more expensive than the
> > current CONFIG_PREEMPT implementations.
> >
> > There are probably better approaches, but that is what I have thus far.
>
> I was playing around with it a little while back. I didn't sned a patch
> though as I didn't have time. It works on a UP machine but I don't have a
> SMP to test on :-(
There are some available from OSDL, right? Or am I out of date here?
> The ingreedients are:
> 1) A per-cpu read-side counter, protected locally by preempt_disable().
In order to avoid too-long grace periods, you need a pair of counters.
If you only have a single counter, you can end up with a series of
RCU read-side critical-section preemptions resulting in the counter
never reaching zero, and thus the grace period never terminating.
Not good, especially in small-memory machines subject to denial-of-service
attacks.
> 2) A task read-side counter (doesn't need protectection at all).
Yep. Also a field in the task struct indicating which per-CPU counter
applies (unless the scheduler is managing this).
> 3) Tasks having non-zero read-side counter can't be migrated to other
> CPUs. That is to make the per-cpu counter truely per-cpu.
If CPU hotplug can be disallowed, then this can work. Do we really want
to disallow CPU hotplug? Especially given that some implementations
might want to use CPU hotplug to reduce power consumption? There also
might be issues with tickless operation.
> 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> boost the priority to the maximum non-RT priority such it will not be
> preempted by other SCHED_OTHER tasks. This makes the delay in
> syncronize_kernel() somewhat deterministic (it depends on how
> "deterministic" the RT tasks in the system are.)
But one would only want to do this boost if there was a synchronize_rcu()
waiting -and- if there is some reason to accelerate grace periods (e.g.,
low on memory), right? If there is no reason to accelerate grace periods,
then extending them increases efficiency by amortizing grace-period
overhead over many updates.
> I'll try to send you what I got when I get my computer at home back up.
> I have only testet id on my UP labtop, where it seems to run fine.
I look forward to seeing it!
Thanx, Paul
On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
>
> Counter-Based Approach
>
> The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> counter-based approach, which seems to work, but which can result in
> indefinite-duration grace periods. The following are very hazy thoughts
> on how to get the benefits of this approach, but with short grace periods.
>
> 1. The basic trick is to maintain a pair of counters per CPU.
> There would also be a global boolean variable that would select
> one or the other of each pair. The rcu_read_lock() primitive
> would then increment the counter indicated by the boolean
> corresponding to the CPU that it is currently running on.
> It would also keep a pointer to that particular counter in
> the task structure. The rcu_read_unlock() primitive would
> decrement this counter. (And, yes, you would also have a
> counter in the task structure so that only the outermost of
> a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> actually increment/decrement the per-CPU counter pairs.)
>
> To force a grace period, one would invert the value of the
> global boolean variable. Once all the counters indicated
> by the old value of the global boolean variable hit zero,
> the corresponding set of RCU callbacks can be safely invoked.
>
> The big problem with this approach is that a pair of inversions
> of the global boolean variable could be spaced arbitrarily
> closely, especially when you consider that the read side code
> can be preempted. This could cause RCU callbacks to be invoked
> prematurely, which could greatly reduce the life expectancy
> of your kernel.
> Thoughts?
>
How about having another boolean indicating the ability to flip the
selector boolean. This boolean would be set false on an actual flip and
cleared during a grace period. That way the flips cannot ever interfere
with one another such that the callbacks would be cleared prematurely.
--
Peter Zijlstra <[email protected]>
On Tue, 10 May 2005 22:08:11 +0200, Peter Zijlstra said:
> How about having another boolean indicating the ability to flip the
> selector boolean. This boolean would be set false on an actual flip and
> cleared during a grace period. That way the flips cannot ever interfere
> with one another such that the callbacks would be cleared prematurely.
As all the dining philosophers grab a fork and a spoon and dig in. ;)
On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
>
> >
> > Counter-Based Approach
> >
> > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > counter-based approach, which seems to work, but which can result in
> > indefinite-duration grace periods. The following are very hazy thoughts
> > on how to get the benefits of this approach, but with short grace periods.
> >
> > 1. The basic trick is to maintain a pair of counters per CPU.
> > There would also be a global boolean variable that would select
> > one or the other of each pair. The rcu_read_lock() primitive
> > would then increment the counter indicated by the boolean
> > corresponding to the CPU that it is currently running on.
> > It would also keep a pointer to that particular counter in
> > the task structure. The rcu_read_unlock() primitive would
> > decrement this counter. (And, yes, you would also have a
> > counter in the task structure so that only the outermost of
> > a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > actually increment/decrement the per-CPU counter pairs.)
> >
> > To force a grace period, one would invert the value of the
> > global boolean variable. Once all the counters indicated
> > by the old value of the global boolean variable hit zero,
> > the corresponding set of RCU callbacks can be safely invoked.
> >
> > The big problem with this approach is that a pair of inversions
> > of the global boolean variable could be spaced arbitrarily
> > closely, especially when you consider that the read side code
> > can be preempted. This could cause RCU callbacks to be invoked
> > prematurely, which could greatly reduce the life expectancy
> > of your kernel.
>
> > Thoughts?
>
> How about having another boolean indicating the ability to flip the
> selector boolean. This boolean would be set false on an actual flip and
> cleared during a grace period. That way the flips cannot ever interfere
> with one another such that the callbacks would be cleared prematurely.
But the flip is an integral part of detecting a grace period. So, if I
understand your proposal correctly, I would have to flip to figure out
when it was safe to flip.
What am I missing here?
Thanx, Paul
On Tue, 2005-05-10 at 13:29 -0700, Paul E. McKenney wrote:
> On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> > On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> >
> > >
> > > Counter-Based Approach
> > >
> > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > counter-based approach, which seems to work, but which can result in
> > > indefinite-duration grace periods. The following are very hazy thoughts
> > > on how to get the benefits of this approach, but with short grace periods.
> > >
> > > 1. The basic trick is to maintain a pair of counters per CPU.
> > > There would also be a global boolean variable that would select
> > > one or the other of each pair. The rcu_read_lock() primitive
> > > would then increment the counter indicated by the boolean
> > > corresponding to the CPU that it is currently running on.
> > > It would also keep a pointer to that particular counter in
> > > the task structure. The rcu_read_unlock() primitive would
> > > decrement this counter. (And, yes, you would also have a
> > > counter in the task structure so that only the outermost of
> > > a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > > actually increment/decrement the per-CPU counter pairs.)
> > >
> > > To force a grace period, one would invert the value of the
> > > global boolean variable. Once all the counters indicated
> > > by the old value of the global boolean variable hit zero,
> > > the corresponding set of RCU callbacks can be safely invoked.
> > >
> > > The big problem with this approach is that a pair of inversions
> > > of the global boolean variable could be spaced arbitrarily
> > > closely, especially when you consider that the read side code
> > > can be preempted. This could cause RCU callbacks to be invoked
> > > prematurely, which could greatly reduce the life expectancy
> > > of your kernel.
> >
> > > Thoughts?
> >
> > How about having another boolean indicating the ability to flip the
> > selector boolean. This boolean would be set false on an actual flip and
> > cleared during a grace period. That way the flips cannot ever interfere
> > with one another such that the callbacks would be cleared prematurely.
>
> But the flip is an integral part of detecting a grace period. So, if I
> understand your proposal correctly, I would have to flip to figure out
> when it was safe to flip.
>
> What am I missing here?
int can_flip = 1;
int selector = 0;
int counter[2] = {0, 0};
void up()
{
++counter[current->selection = selector];
}
void down()
{
if (!--counter[current->selection])
do_grace();
}
void do_grace()
{
// free stuff
can_flip = 1;
}
void force_grace()
{
if (can_flip)
{
can_flip = 0;
selector ^= 1;
}
}
The way I understood your proposal was that in order to force a grace
period you flip the selector and once the old one reaches zero again it
does a cleanup.
Now your problem was that when you force another flip before the old
counter reached zero the shit will hit the proverbial fan. So what I
proposed (as hopefully illustrated by the code) was another boolean; my
can_flip; that controls the flippability :-)
One can for example call force_grace every few seconds or when a
watermark on the rcu-callback queue has been reached. If can_flip blocks
the actual flip nothing is lost since a cleanup is allready pending.
I hope to have been clearer in explaining my idea; or if I'm missing the
issue tell me to read the thread in the morning ;)
--
Peter Zijlstra <[email protected]>
On Tue, May 10, 2005 at 10:56:24PM +0200, Peter Zijlstra wrote:
> On Tue, 2005-05-10 at 13:29 -0700, Paul E. McKenney wrote:
> > On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> > >
> > > >
> > > > Counter-Based Approach
> > > >
> > > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > > counter-based approach, which seems to work, but which can result in
> > > > indefinite-duration grace periods. The following are very hazy thoughts
> > > > on how to get the benefits of this approach, but with short grace periods.
> > > >
> > > > 1. The basic trick is to maintain a pair of counters per CPU.
> > > > There would also be a global boolean variable that would select
> > > > one or the other of each pair. The rcu_read_lock() primitive
> > > > would then increment the counter indicated by the boolean
> > > > corresponding to the CPU that it is currently running on.
> > > > It would also keep a pointer to that particular counter in
> > > > the task structure. The rcu_read_unlock() primitive would
> > > > decrement this counter. (And, yes, you would also have a
> > > > counter in the task structure so that only the outermost of
> > > > a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > > > actually increment/decrement the per-CPU counter pairs.)
> > > >
> > > > To force a grace period, one would invert the value of the
> > > > global boolean variable. Once all the counters indicated
> > > > by the old value of the global boolean variable hit zero,
> > > > the corresponding set of RCU callbacks can be safely invoked.
> > > >
> > > > The big problem with this approach is that a pair of inversions
> > > > of the global boolean variable could be spaced arbitrarily
> > > > closely, especially when you consider that the read side code
> > > > can be preempted. This could cause RCU callbacks to be invoked
> > > > prematurely, which could greatly reduce the life expectancy
> > > > of your kernel.
> > >
> > > > Thoughts?
> > >
> > > How about having another boolean indicating the ability to flip the
> > > selector boolean. This boolean would be set false on an actual flip and
> > > cleared during a grace period. That way the flips cannot ever interfere
> > > with one another such that the callbacks would be cleared prematurely.
> >
> > But the flip is an integral part of detecting a grace period. So, if I
> > understand your proposal correctly, I would have to flip to figure out
> > when it was safe to flip.
> >
> > What am I missing here?
>
>
> int can_flip = 1;
> int selector = 0;
>
> int counter[2] = {0, 0};
>
> void up()
> {
> ++counter[current->selection = selector];
Suppose task 0 has just fetched the value of "selector". How does
force_grace() know that it is now inappropriate to invert the value
of "selector"?
One might suppress preemption, but there can still be interrupts,
ECC error correction, and just plain bad luck. So up() needs to
be able to deal with "selector" getting inverted out from under it.
Unless I am missing something still...
Thanx, Paul
> }
>
> void down()
> {
> if (!--counter[current->selection])
> do_grace();
> }
>
> void do_grace()
> {
> // free stuff
> can_flip = 1;
> }
>
> void force_grace()
> {
> if (can_flip)
> {
> can_flip = 0;
> selector ^= 1;
> }
> }
>
>
> The way I understood your proposal was that in order to force a grace
> period you flip the selector and once the old one reaches zero again it
> does a cleanup.
>
> Now your problem was that when you force another flip before the old
> counter reached zero the shit will hit the proverbial fan. So what I
> proposed (as hopefully illustrated by the code) was another boolean; my
> can_flip; that controls the flippability :-)
>
> One can for example call force_grace every few seconds or when a
> watermark on the rcu-callback queue has been reached. If can_flip blocks
> the actual flip nothing is lost since a cleanup is allready pending.
>
> I hope to have been clearer in explaining my idea; or if I'm missing the
> issue tell me to read the thread in the morning ;)
>
> --
> Peter Zijlstra <[email protected]>
>
>
On Tue, 10 May 2005, Paul E. McKenney wrote:
> On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> [...]
> > >
> > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > counter-based approach, which seems to work, but which can result in
> > > indefinite-duration grace periods.
> >
> > Is that really a problem in practise? For RCU updates should happen
> > rarely.
>
> Yes. Several people have run into serious problems on small-memory
> machines under denial-of-service workloads. Not pretty.
>
That is what I tell people at work: Dynamic memory allocation is _not_ a
good thing in real-time embedded systems. They call be old fashioned when
I say I want anything important pre-allocated in seperate pools. But it
exactly avoids these kind of surprises, makes the system more
deterministic and makes it easier to find leaks.
> [...]
> >
> > Why bother? What is the overhead of checking relative to just doing the
> > increment/decrement?
>
> The increment of the per-CPU counter pair is likely to incur a cache miss,
> and thus be orders of magnitude more expensive than the increment of
> the variable in the task structure. I did not propose this optimization
> lightly. ;-)
>
> > I had another idea: Do it in the scheduler:
> > per_cpu_rcu_count += prev->rcu_read_count;
> > if(per_cpu_rcu_count==0) {
> > /* grace period encountered */
> > }
> > (do the task switch)
> >
> > per_cpu_rcu_count -= count->rcu_read_count;
> >
> > Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> > current. During schedule there is no current and therefore it is the total
> > count.
>
> One downside of this approach is that it would maximally expand the
> RCU read-side critical sections -- one of the things we need is to
> avoid holding onto memory longer than necessary, as noted above.
>
> Also, what prevents a grace period from ending while a task is
> preempted in an RCU read-side critical section? Or are you intending
> that synchronize_rcu() scan the task list as well as the per-CPU
> counters?
Eh, if a RCU task is preempted in a read-side CS there IS no grace period!
The trick is to avoid having it haning there for too long (see my boosting
mechanismen below).
>
> > > [...]
> > > 2. #1 above, but use a seqlock to guard the counter selection in
> > > rcu_read_lock(). One potential disadvantage of this approach
> > > is that an extremely unlucky instance of rcu_read_lock() might
> > > be indefinitely delayed by a series of counter flips. I am
> > > concerned that this might actually happen under low-memory
> > > conditions. Also requires memory barriers on the read side,
> > > which we might be stuck with, but still hope to be able to
> > > get rid of. And the per-CPU counter manipulation must use
> > > atomic instructions.
> >
> > Why not just use
> > preempt_disable();
> > manipulate counters;
> > preempt_enable();
> > ??
>
> Because we can race with the counter switch, which can happen on some
> other CPU. Therefore, preempt_disable() does not help with this race.
Well, the per-cpu counters should be truely per-cpu. I.e. all tasks must
_only_ touch the counter on their current cpu. preempt_disable() prevents
the migration of the current task, right?
>
> > > 3. #1 above, but use per-CPU locks to guard the counter selection.
> > > I don't like this any better than #2, worse, in fact, since it
> > > requires expensive atomic instructions as well.
> >
> > local_irqs_disable() andor preempt_disable() should work as they counters
> > are per-cpu, right?
>
> Again, neither of these help with the race against the counter-switch,
> which would likely be happening on some other CPU.
As above :-)
>
> > Or see the alternative above: The counter is per task but latched to a
> > per-cpu on schedule().
>
> This could work (assuming that synchronize_rcu() scanned the task list
> as well as the per-CPU counters), but again would extend grace periods
> too long for small-memory machines subject to denial-of-service attacks.
>
No, need to traverse the list. Again the syncronize_rcu() waits for all
CPU to check their number _locally_. The task doing that can simply check
it's the latched number as it knows it is not in a read-side CS.
> > [...]
> > I was playing around with it a little while back. I didn't sned a patch
> > though as I didn't have time. It works on a UP machine but I don't have a
> > SMP to test on :-(
>
> There are some available from OSDL, right? Or am I out of date here?
Honestly, I can't remember if I mailed it to anyone. I only have the code
on my home PC, which broke down. I am right now running off a backup. I
have attached the patch to this mail. I don't have time to clean it up :-(
I see it also includes code for a testing device trying where I try
to meassure the grace period.
>
> > The ingreedients are:
> > 1) A per-cpu read-side counter, protected locally by preempt_disable().
>
> In order to avoid too-long grace periods, you need a pair of counters.
> If you only have a single counter, you can end up with a series of
> RCU read-side critical-section preemptions resulting in the counter
> never reaching zero, and thus the grace period never terminating.
> Not good, especially in small-memory machines subject to denial-of-service
> attacks.
See the boosting trick below. That will do the trick.
>
> > 2) A task read-side counter (doesn't need protectection at all).
>
> Yep. Also a field in the task struct indicating which per-CPU counter
> applies (unless the scheduler is managing this).
No, it is _always_ the local CPU which alplies as I have tried to make 3)
below (which isn't tested at all).
>
> > 3) Tasks having non-zero read-side counter can't be migrated to other
> > CPUs. That is to make the per-cpu counter truely per-cpu.
>
> If CPU hotplug can be disallowed, then this can work. Do we really want
> to disallow CPU hotplug? Especially given that some implementations
> might want to use CPU hotplug to reduce power consumption? There also
> might be issues with tickless operation.
>
Yeah, I noticed the hotplog code. What are the RT requirements for
hotplugging? With 4) below it shouldn't be a problem to wait until all
tasks ar out of the read-lock session.
A better alternative is to make migration work like this:
1) Deactive the task.
2) Make the target cpu increment it's rcu-counter.
3) Migrate the task.
4) Make the source cpu increment it's rcu-counter.
Then you can safely migrate tasks preempted in read-side CS as well.
Harder to code than my small hack, though.
> > 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> > boost the priority to the maximum non-RT priority such it will not be
> > preempted by other SCHED_OTHER tasks. This makes the delay in
> > syncronize_kernel() somewhat deterministic (it depends on how
> > "deterministic" the RT tasks in the system are.)
>
> But one would only want to do this boost if there was a synchronize_rcu()
> waiting -and- if there is some reason to accelerate grace periods (e.g.,
> low on memory), right? If there is no reason to accelerate grace periods,
> then extending them increases efficiency by amortizing grace-period
> overhead over many updates.
Well, I only do the boost if the task is actually preempted in the
read-side CS - it should be relatively rare. I could make a check and
only do the boost if their is an request for a grace-period set. BUT the
problem is that when somebody requests a grace-period later on it would
have to traverse the runqueue and boost the tasks in read-side CS. That
sounds like more expensive, but one can only know that by testing....
>
> > I'll try to send you what I got when I get my computer at home back up.
> > I have only testet id on my UP labtop, where it seems to run fine.
>
> I look forward to seeing it!
>
Attached here :-)
> Thanx, Paul
>
Esben
On Wed, May 11, 2005 at 08:14:49PM +0200, Esben Nielsen wrote:
> On Tue, 10 May 2005, Paul E. McKenney wrote:
>
> > On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> > [...]
> > > >
> > > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > > counter-based approach, which seems to work, but which can result in
> > > > indefinite-duration grace periods.
> > >
> > > Is that really a problem in practise? For RCU updates should happen
> > > rarely.
> >
> > Yes. Several people have run into serious problems on small-memory
> > machines under denial-of-service workloads. Not pretty.
> >
> That is what I tell people at work: Dynamic memory allocation is _not_ a
> good thing in real-time embedded systems. They call be old fashioned when
> I say I want anything important pre-allocated in seperate pools. But it
> exactly avoids these kind of surprises, makes the system more
> deterministic and makes it easier to find leaks.
As always, depends on the situation. Configuring all the separate
pools can be a bit on the painful side, right? ;-)
> > [...]
> > >
> > > Why bother? What is the overhead of checking relative to just doing the
> > > increment/decrement?
> >
> > The increment of the per-CPU counter pair is likely to incur a cache miss,
> > and thus be orders of magnitude more expensive than the increment of
> > the variable in the task structure. I did not propose this optimization
> > lightly. ;-)
> >
> > > I had another idea: Do it in the scheduler:
> > > per_cpu_rcu_count += prev->rcu_read_count;
> > > if(per_cpu_rcu_count==0) {
> > > /* grace period encountered */
> > > }
> > > (do the task switch)
> > >
> > > per_cpu_rcu_count -= count->rcu_read_count;
> > >
> > > Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> > > current. During schedule there is no current and therefore it is the total
> > > count.
> >
> > One downside of this approach is that it would maximally expand the
> > RCU read-side critical sections -- one of the things we need is to
> > avoid holding onto memory longer than necessary, as noted above.
> >
> > Also, what prevents a grace period from ending while a task is
> > preempted in an RCU read-side critical section? Or are you intending
> > that synchronize_rcu() scan the task list as well as the per-CPU
> > counters?
>
> Eh, if a RCU task is preempted in a read-side CS there IS no grace period!
> The trick is to avoid having it haning there for too long (see my boosting
> mechanismen below).
There is not -supposed- to be a grace period in that case.
> > > > [...]
> > > > 2. #1 above, but use a seqlock to guard the counter selection in
> > > > rcu_read_lock(). One potential disadvantage of this approach
> > > > is that an extremely unlucky instance of rcu_read_lock() might
> > > > be indefinitely delayed by a series of counter flips. I am
> > > > concerned that this might actually happen under low-memory
> > > > conditions. Also requires memory barriers on the read side,
> > > > which we might be stuck with, but still hope to be able to
> > > > get rid of. And the per-CPU counter manipulation must use
> > > > atomic instructions.
> > >
> > > Why not just use
> > > preempt_disable();
> > > manipulate counters;
> > > preempt_enable();
> > > ??
> >
> > Because we can race with the counter switch, which can happen on some
> > other CPU. Therefore, preempt_disable() does not help with this race.
>
> Well, the per-cpu counters should be truely per-cpu. I.e. all tasks must
> _only_ touch the counter on their current cpu. preempt_disable() prevents
> the migration of the current task, right?
>
> >
> > > > 3. #1 above, but use per-CPU locks to guard the counter selection.
> > > > I don't like this any better than #2, worse, in fact, since it
> > > > requires expensive atomic instructions as well.
> > >
> > > local_irqs_disable() andor preempt_disable() should work as they counters
> > > are per-cpu, right?
> >
> > Again, neither of these help with the race against the counter-switch,
> > which would likely be happening on some other CPU.
> As above :-)
>
> >
> > > Or see the alternative above: The counter is per task but latched to a
> > > per-cpu on schedule().
> >
> > This could work (assuming that synchronize_rcu() scanned the task list
> > as well as the per-CPU counters), but again would extend grace periods
> > too long for small-memory machines subject to denial-of-service attacks.
> >
> No, need to traverse the list. Again the syncronize_rcu() waits for all
> CPU to check their number _locally_. The task doing that can simply check
> it's the latched number as it knows it is not in a read-side CS.
>
> > > [...]
> > > I was playing around with it a little while back. I didn't sned a patch
> > > though as I didn't have time. It works on a UP machine but I don't have a
> > > SMP to test on :-(
> >
> > There are some available from OSDL, right? Or am I out of date here?
> Honestly, I can't remember if I mailed it to anyone. I only have the code
> on my home PC, which broke down. I am right now running off a backup. I
> have attached the patch to this mail. I don't have time to clean it up :-(
> I see it also includes code for a testing device trying where I try
> to meassure the grace period.
>
> >
> > > The ingreedients are:
> > > 1) A per-cpu read-side counter, protected locally by preempt_disable().
> >
> > In order to avoid too-long grace periods, you need a pair of counters.
> > If you only have a single counter, you can end up with a series of
> > RCU read-side critical-section preemptions resulting in the counter
> > never reaching zero, and thus the grace period never terminating.
> > Not good, especially in small-memory machines subject to denial-of-service
> > attacks.
> See the boosting trick below. That will do the trick.
>
> >
> > > 2) A task read-side counter (doesn't need protectection at all).
> >
> > Yep. Also a field in the task struct indicating which per-CPU counter
> > applies (unless the scheduler is managing this).
> No, it is _always_ the local CPU which alplies as I have tried to make 3)
> below (which isn't tested at all).
> >
> > > 3) Tasks having non-zero read-side counter can't be migrated to other
> > > CPUs. That is to make the per-cpu counter truely per-cpu.
> >
> > If CPU hotplug can be disallowed, then this can work. Do we really want
> > to disallow CPU hotplug? Especially given that some implementations
> > might want to use CPU hotplug to reduce power consumption? There also
> > might be issues with tickless operation.
> >
> Yeah, I noticed the hotplog code. What are the RT requirements for
> hotplugging? With 4) below it shouldn't be a problem to wait until all
> tasks ar out of the read-lock session.
>
> A better alternative is to make migration work like this:
> 1) Deactive the task.
> 2) Make the target cpu increment it's rcu-counter.
> 3) Migrate the task.
> 4) Make the source cpu increment it's rcu-counter.
> Then you can safely migrate tasks preempted in read-side CS as well.
> Harder to code than my small hack, though.
>
> > > 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> > > boost the priority to the maximum non-RT priority such it will not be
> > > preempted by other SCHED_OTHER tasks. This makes the delay in
> > > syncronize_kernel() somewhat deterministic (it depends on how
> > > "deterministic" the RT tasks in the system are.)
> >
> > But one would only want to do this boost if there was a synchronize_rcu()
> > waiting -and- if there is some reason to accelerate grace periods (e.g.,
> > low on memory), right? If there is no reason to accelerate grace periods,
> > then extending them increases efficiency by amortizing grace-period
> > overhead over many updates.
> Well, I only do the boost if the task is actually preempted in the
> read-side CS - it should be relatively rare. I could make a check and
> only do the boost if their is an request for a grace-period set. BUT the
> problem is that when somebody requests a grace-period later on it would
> have to traverse the runqueue and boost the tasks in read-side CS. That
> sounds like more expensive, but one can only know that by testing....
>
> >
> > > I'll try to send you what I got when I get my computer at home back up.
> > > I have only testet id on my UP labtop, where it seems to run fine.
> >
> > I look forward to seeing it!
> >
> Attached here :-)
Thank you, see comments interspersed.
> > Thanx, Paul
> >
>
> Esben
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/Makefile linux-2.6.12-rc1-RCU/Makefile
> --- Orig/linux-2.6.12-rc1/Makefile Sun Mar 20 20:30:55 2005
> +++ linux-2.6.12-rc1-RCU/Makefile Sun Mar 27 21:05:58 2005
> @@ -1,7 +1,7 @@
> VERSION = 2
> PATCHLEVEL = 6
> SUBLEVEL = 12
> -EXTRAVERSION =-rc1
> +EXTRAVERSION =-rc1-RCU-SMP
> NAME=Woozy Numbat
>
> # *DOCUMENTATION*
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Kconfig linux-2.6.12-rc1-RCU/drivers/char/Kconfig
> --- Orig/linux-2.6.12-rc1/drivers/char/Kconfig Sun Mar 20 20:31:03 2005
> +++ linux-2.6.12-rc1-RCU/drivers/char/Kconfig Fri Mar 25 00:19:07 2005
> @@ -978,6 +978,13 @@
> The mmtimer device allows direct userspace access to the
> Altix system timer.
>
> +
> +config RCU_TESTER
> + tristate "Test device for testing RCU code."
> + default n
> + help
> + To help debugging the RCU code. Don't include this.
> +
> source "drivers/char/tpm/Kconfig"
>
> endmenu
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Makefile linux-2.6.12-rc1-RCU/drivers/char/Makefile
> --- Orig/linux-2.6.12-rc1/drivers/char/Makefile Sun Mar 20 20:31:03 2005
> +++ linux-2.6.12-rc1-RCU/drivers/char/Makefile Fri Mar 25 00:19:11 2005
> @@ -48,6 +48,8 @@
> obj-$(CONFIG_VIOTAPE) += viotape.o
> obj-$(CONFIG_HVCS) += hvcs.o
>
> +obj-$(CONFIG_RCU_TESTER) += rcu_tester.o
> +
> obj-$(CONFIG_PRINTER) += lp.o
> obj-$(CONFIG_TIPAR) += tipar.o
>
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c
> --- Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c Thu Jan 1 01:00:00 1970
> +++ linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c Wed Mar 30 00:15:52 2005
> @@ -0,0 +1,120 @@
> +/*
> + * test device for testing RCU code
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/miscdevice.h>
> +
> +#define RCU_TESTER_MINOR 222
> +
> +#define RCU_TESTER_WRITE 4247
> +#define RCU_TESTER_READ 4248
> +
> +#define ERCU_FAILED 35
> +
> +#ifndef notrace
> +#define notrace
> +#endif
> +
> +u64 notrace get_cpu_tick(void)
> +{
> + u64 tsc;
> +#ifdef ARCHARM
> + tsc = *oscr;
> +#else
> + __asm__ __volatile__("rdtsc" : "=A" (tsc));
> +#endif
> + return tsc;
> +}
> +
> +void notrace loop(int loops)
> +{
> + int i;
> +
> + for (i = 0; i < loops; i++)
> + get_cpu_tick();
> +}
> +
> +static spinlock_t write_lock;
> +static int *protected_data = NULL;
> +
> +static int rcu_tester_open(struct inode *in, struct file *file)
> +{
> + return 0;
> +}
> +
> +static long rcu_tester_ioctl(struct file *file,
> + unsigned int cmd, unsigned long args)
> +{
> + switch(cmd) {
> + case RCU_TESTER_READ: {
> + int *dataptr, data1,data2;
> + rcu_read_lock();
> + dataptr = protected_data;
> + data1 = *dataptr;
> + loop(args);
> + data2 = *dataptr;
> + rcu_read_unlock();
> + if(data1!=data2)
> + return -ERCU_FAILED;
> + else
> + return 0; /* ok */
> + }
> + case RCU_TESTER_WRITE: {
> + int *newdata, *olddata;
> + newdata = kmalloc(sizeof(int), GFP_KERNEL);
> + if(!newdata)
> + return -ENOMEM;
> +
> + spin_lock(&write_lock);
> + olddata = rcu_dereference(protected_data);
> + *newdata = *olddata + 1;
> + rcu_assign_pointer(protected_data, newdata);
> + spin_unlock(&write_lock);
> + synchronize_kernel();
> + kfree(olddata);
> + return 0;
> + }
> + default:
> + return -EINVAL;
> + }
> +}
Cute RCU test code!
You could get better coverage by switching between two different structs,
with each containing a pair of integers.
Initialize both, point to one of them. Repeatedly do the following
in RCU_TESTER_WRITE:
1. Point to the other struct.
2. Wait for a grace period to elapse.
3. Increment one of the integers in the not-pointed-to structure.
4. Delay for some time.
5. Increment the other integer in the not-pointed-to structure.
Then RCU_TESTER_READ can repeatedly dereference the pointer under RCU
protection, and compare the two integers. If they differ, RCU's grace
period was not long enough.
> +
> +static struct file_operations rcu_tester_fops = {
> + .owner = THIS_MODULE,
> + .llseek = no_llseek,
> + .unlocked_ioctl = rcu_tester_ioctl,
> + .open = rcu_tester_open,
> +};
> +
> +static struct miscdevice rcu_tester_dev =
> +{
> + RCU_TESTER_MINOR,
> + "rcu_tester",
> + &rcu_tester_fops
> +};
> +
> +static int __init rcu_tester_init(void)
> +{
> + if (misc_register(&rcu_tester_dev))
> + return -ENODEV;
> +
> + protected_data = kmalloc(sizeof(int), GFP_KERNEL);
> + *protected_data = 0;
> +
> + spin_lock_init(&write_lock);
> +
> + return 0;
> +}
> +
> +void __exit rcu_tester_exit(void)
> +{
> + printk(KERN_INFO "rcu-tester device uninstalled\n");
> + misc_deregister(&rcu_tester_dev);
> +}
> +
> +module_init(rcu_tester_init);
> +module_exit(rcu_tester_exit);
> +
> +MODULE_LICENSE("GPL");
> +
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/init_task.h linux-2.6.12-rc1-RCU/include/linux/init_task.h
> --- Orig/linux-2.6.12-rc1/include/linux/init_task.h Sun Mar 20 20:31:28 2005
> +++ linux-2.6.12-rc1-RCU/include/linux/init_task.h Fri Mar 25 00:22:00 2005
> @@ -74,6 +74,7 @@
> .usage = ATOMIC_INIT(2), \
> .flags = 0, \
> .lock_depth = -1, \
> + .rcu_read_lock_nesting = 0, \
> .prio = MAX_PRIO-20, \
> .static_prio = MAX_PRIO-20, \
> .policy = SCHED_NORMAL, \
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/rcupdate.h linux-2.6.12-rc1-RCU/include/linux/rcupdate.h
> --- Orig/linux-2.6.12-rc1/include/linux/rcupdate.h Wed Mar 2 08:37:50 2005
> +++ linux-2.6.12-rc1-RCU/include/linux/rcupdate.h Sun Mar 27 18:12:45 2005
> @@ -41,6 +41,8 @@
> #include <linux/percpu.h>
> #include <linux/cpumask.h>
> #include <linux/seqlock.h>
> +#include <linux/sched.h>
> +#include <linux/interrupt.h>
>
> /**
> * struct rcu_head - callback structure for use with RCU
> @@ -85,6 +87,7 @@
> * curlist - current batch for which quiescent cycle started if any
> */
> struct rcu_data {
> + int active_readers;
> /* 1) quiescent state handling : */
> long quiescbatch; /* Batch # for grace period */
> int passed_quiesc; /* User-mode/idle loop etc. */
> @@ -115,12 +118,14 @@
> static inline void rcu_qsctr_inc(int cpu)
> {
> struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> - rdp->passed_quiesc = 1;
> + if(rdp->active_readers==0)
> + rdp->passed_quiesc = 1;
> }
> static inline void rcu_bh_qsctr_inc(int cpu)
> {
> struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
> - rdp->passed_quiesc = 1;
> + if(rdp->active_readers==0)
> + rdp->passed_quiesc = 1;
> }
This seems to suffer from extended grace periods. Or am I missing something?
> static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
> @@ -183,14 +188,34 @@
> *
> * It is illegal to block while in an RCU read-side critical section.
> */
> -#define rcu_read_lock() preempt_disable()
> +static inline void rcu_read_lock(void)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + __get_cpu_var(rcu_data).active_readers++;
> + current->rcu_read_lock_nesting++;
> + local_irq_restore(flags);
> +}
>
> /**
> * rcu_read_unlock - marks the end of an RCU read-side critical section.
> *
> * See rcu_read_lock() for more information.
> */
> -#define rcu_read_unlock() preempt_enable()
> +static inline void rcu_read_unlock(void)
> +{
> + unsigned long flags;
> + int cpu;
> +
> + local_irq_save(flags);
> + cpu = smp_processor_id();
> +
> + per_cpu(rcu_data,cpu).active_readers--;
> + current->rcu_read_lock_nesting--;
> + rcu_qsctr_inc(cpu);
> + local_irq_restore(flags);
> +}
>
> /*
> * So where is rcu_write_lock()? It does not exist, as there is no
> @@ -213,14 +238,34 @@
> * can use just rcu_read_lock().
> *
> */
> -#define rcu_read_lock_bh() local_bh_disable()
> +static inline void rcu_read_lock_bh(void)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + __get_cpu_var(rcu_bh_data).active_readers++;
> + current->rcu_read_lock_nesting++;
> + local_irq_restore(flags);
> +}
>
> /*
> * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
> *
> * See rcu_read_lock_bh() for more information.
> */
> -#define rcu_read_unlock_bh() local_bh_enable()
> +static inline void rcu_read_unlock_bh(void)
> +{
> + unsigned long flags;
> + int cpu;
> +
> + local_irq_save(flags);
> + cpu = smp_processor_id();
> +
> + per_cpu(rcu_bh_data,cpu).active_readers--;
> + current->rcu_read_lock_nesting--;
> + rcu_bh_qsctr_inc(cpu);
> + local_irq_restore(flags);
> +}
>
> /**
> * rcu_dereference - fetch an RCU-protected pointer in an
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/sched.h linux-2.6.12-rc1-RCU/include/linux/sched.h
> --- Orig/linux-2.6.12-rc1/include/linux/sched.h Sun Mar 20 20:31:29 2005
> +++ linux-2.6.12-rc1-RCU/include/linux/sched.h Fri Mar 25 00:22:00 2005
> @@ -569,6 +569,9 @@
> unsigned long ptrace;
>
> int lock_depth; /* Lock depth */
> +
> + int rcu_read_lock_nesting; /* How many RCU read reagions the thread is
> + in */
>
> int prio, static_prio;
> struct list_head run_list;
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/init/main.c linux-2.6.12-rc1-RCU/init/main.c
> --- Orig/linux-2.6.12-rc1/init/main.c Sun Mar 20 20:31:30 2005
> +++ linux-2.6.12-rc1-RCU/init/main.c Fri Mar 25 13:36:26 2005
> @@ -688,8 +688,11 @@
> system_state = SYSTEM_RUNNING;
> numa_default_policy();
>
> - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
> - printk(KERN_WARNING "Warning: unable to open an initial console.\n");
> + {
> + int res = sys_open((const char __user *) "/dev/console", O_RDWR, 0);
> + if(res<0)
> + printk(KERN_WARNING "Warning: unable to open an initial console.: retval=%d\n",res);
> + }
>
> (void) sys_dup(0);
> (void) sys_dup(0);
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/rcupdate.c linux-2.6.12-rc1-RCU/kernel/rcupdate.c
> --- Orig/linux-2.6.12-rc1/kernel/rcupdate.c Wed Mar 2 08:37:30 2005
> +++ linux-2.6.12-rc1-RCU/kernel/rcupdate.c Fri Mar 25 13:08:38 2005
> @@ -66,7 +66,10 @@
> {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
>
> DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
> +EXPORT_PER_CPU_SYMBOL(rcu_data);
> DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
> +EXPORT_PER_CPU_SYMBOL(rcu_bh_data);
> +
>
> /* Fake initialization required by compiler */
> static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
> diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/sched.c linux-2.6.12-rc1-RCU/kernel/sched.c
> --- Orig/linux-2.6.12-rc1/kernel/sched.c Sun Mar 20 20:31:30 2005
> +++ linux-2.6.12-rc1-RCU/kernel/sched.c Wed Mar 30 00:06:39 2005
> @@ -594,6 +594,9 @@
> if (rt_task(p))
> return p->prio;
>
> + if (p->rcu_read_lock_nesting)
> + return MAX_RT_PRIO;
> +
Do you really want to do this priority boost unconditionally?
Seems like you would only want to if there are callbacks waiting
for a grace period when the system is low on memory.
Otherwise, how does boosting the priority help?
> bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>
> prio = p->static_prio - bonus;
> @@ -986,6 +989,9 @@
> if (unlikely(task_running(rq, p)))
> goto out_activate;
>
> + if (unlikely(p->rcu_read_lock_nesting))
> + goto out_activate;
> +
> #ifdef CONFIG_SCHEDSTATS
> schedstat_inc(rq, ttwu_cnt);
> if (cpu == this_cpu) {
> @@ -1644,6 +1650,8 @@
> return 0;
> if (!cpu_isset(this_cpu, p->cpus_allowed))
> return 0;
> + if (p->rcu_read_lock_nesting)
> + return 0;
Could you please add "p" to your "diff" command so that the function
is listed?
> /*
> * Aggressive migration if:
> @@ -2633,6 +2641,7 @@
> need_resched:
> preempt_disable();
> prev = current;
> +
> release_kernel_lock(prev);
> need_resched_nonpreemptible:
> rq = this_rq();
> @@ -2675,6 +2684,7 @@
> else {
> if (prev->state == TASK_UNINTERRUPTIBLE)
> rq->nr_uninterruptible++;
> +
> deactivate_task(prev, rq);
> }
> }
> @@ -2710,6 +2720,17 @@
> }
>
> array = rq->active;
> +
> + if( unlikely(prev->rcu_read_lock_nesting) &&
> + prev->prio > MAX_RT_PRIO ) {
> + prio_array_t *prev_array = prev->array;
> + if(prev_array)
> + dequeue_task(prev, prev_array);
> + prev->prio = MAX_RT_PRIO;
> + if(prev_array)
> + enqueue_task_head(prev, prev_array);
> + }
> +
Again, do you really want to boost priority unconditionally when in
an RCU read-side critical section?
> if (unlikely(!array->nr_active)) {
> /*
> * Switch the active and expired arrays.
On Tue, 2005-05-10 at 15:36 -0700, Paul E. McKenney wrote:
> On Tue, May 10, 2005 at 10:56:24PM +0200, Peter Zijlstra wrote:
> > On Tue, 2005-05-10 at 13:29 -0700, Paul E. McKenney wrote:
> > > On Tue, May 10, 2005 at 10:08:11PM +0200, Peter Zijlstra wrote:
> > > > On Mon, 2005-05-09 at 18:24 -0700, Paul E. McKenney wrote:
> > > >
> > > > >
> > > > > Counter-Based Approach
> > > > >
> > > > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > > > counter-based approach, which seems to work, but which can result in
> > > > > indefinite-duration grace periods. The following are very hazy thoughts
> > > > > on how to get the benefits of this approach, but with short grace periods.
> > > > >
> > > > > 1. The basic trick is to maintain a pair of counters per CPU.
> > > > > There would also be a global boolean variable that would select
> > > > > one or the other of each pair. The rcu_read_lock() primitive
> > > > > would then increment the counter indicated by the boolean
> > > > > corresponding to the CPU that it is currently running on.
> > > > > It would also keep a pointer to that particular counter in
> > > > > the task structure. The rcu_read_unlock() primitive would
> > > > > decrement this counter. (And, yes, you would also have a
> > > > > counter in the task structure so that only the outermost of
> > > > > a set of nested rcu_read_lock()/rcu_read_unlock() pairs would
> > > > > actually increment/decrement the per-CPU counter pairs.)
> > > > >
> > > > > To force a grace period, one would invert the value of the
> > > > > global boolean variable. Once all the counters indicated
> > > > > by the old value of the global boolean variable hit zero,
> > > > > the corresponding set of RCU callbacks can be safely invoked.
> > > > >
> > > > > The big problem with this approach is that a pair of inversions
> > > > > of the global boolean variable could be spaced arbitrarily
> > > > > closely, especially when you consider that the read side code
> > > > > can be preempted. This could cause RCU callbacks to be invoked
> > > > > prematurely, which could greatly reduce the life expectancy
> > > > > of your kernel.
> > > >
> > > > > Thoughts?
> > > >
> > > > How about having another boolean indicating the ability to flip the
> > > > selector boolean. This boolean would be set false on an actual flip and
> > > > cleared during a grace period. That way the flips cannot ever interfere
> > > > with one another such that the callbacks would be cleared prematurely.
> > >
> > > But the flip is an integral part of detecting a grace period. So, if I
> > > understand your proposal correctly, I would have to flip to figure out
> > > when it was safe to flip.
> > >
> > > What am I missing here?
> >
> >
> > int can_flip = 1;
> > int selector = 0;
> >
> > int counter[2] = {0, 0};
> >
> > void up()
> > {
> > ++counter[current->selection = selector];
>
> Suppose task 0 has just fetched the value of "selector". How does
> force_grace() know that it is now inappropriate to invert the value
> of "selector"?
>
> One might suppress preemption, but there can still be interrupts,
> ECC error correction, and just plain bad luck. So up() needs to
> be able to deal with "selector" getting inverted out from under it.
>
> Unless I am missing something still...
True, I see you point; there is a race between the = and ++ operators.
current->selection = selector;
--> gap
++counter[current->selection];
if you flip and the then old current->selection reached 0 before this
task gets executed again do_grace gets called and cleans the callbacks;
which should not matter since this task has not yet started using any
data. however it will be problematic because when it does get scheduled
again it works on the wrong counter and thus does not prevent a grace
period on the data will be using.
however I assumed you had these problems solved with your counter-based
approach. My code was meant to illustrate how I thought your double
inversion problem could be avoided.
Do you by any chance have a RCU impl. based on the counter-based
approach so I can try to understand and maybe try to intergrate my
ideas?
> > }
> >
> > void down()
> > {
> > if (!--counter[current->selection])
> > do_grace();
> > }
> >
> > void do_grace()
> > {
> > // free stuff
> > can_flip = 1;
> > }
> >
> > void force_grace()
> > {
> > if (can_flip)
> > {
> > can_flip = 0;
> > selector ^= 1;
> > }
> > }
> >
> >
> > The way I understood your proposal was that in order to force a grace
> > period you flip the selector and once the old one reaches zero again it
> > does a cleanup.
> >
> > Now your problem was that when you force another flip before the old
> > counter reached zero the shit will hit the proverbial fan. So what I
> > proposed (as hopefully illustrated by the code) was another boolean; my
> > can_flip; that controls the flippability :-)
> >
> > One can for example call force_grace every few seconds or when a
> > watermark on the rcu-callback queue has been reached. If can_flip blocks
> > the actual flip nothing is lost since a cleanup is allready pending.
> >
> > I hope to have been clearer in explaining my idea; or if I'm missing the
> > issue tell me to read the thread in the morning ;)
> >
> > --
> > Peter Zijlstra <[email protected]>
> >
> >
--
Peter Zijlstra <[email protected]>
On Thu, May 12, 2005 at 10:24:37AM +0200, Peter Zijlstra wrote:
> On Tue, 2005-05-10 at 15:36 -0700, Paul E. McKenney wrote:
[ . . . ]
> > > > But the flip is an integral part of detecting a grace period. So, if I
> > > > understand your proposal correctly, I would have to flip to figure out
> > > > when it was safe to flip.
> > > >
> > > > What am I missing here?
> > >
> > >
> > > int can_flip = 1;
> > > int selector = 0;
> > >
> > > int counter[2] = {0, 0};
> > >
> > > void up()
> > > {
> > > ++counter[current->selection = selector];
> >
> > Suppose task 0 has just fetched the value of "selector". How does
> > force_grace() know that it is now inappropriate to invert the value
> > of "selector"?
> >
> > One might suppress preemption, but there can still be interrupts,
> > ECC error correction, and just plain bad luck. So up() needs to
> > be able to deal with "selector" getting inverted out from under it.
> >
> > Unless I am missing something still...
>
> True, I see you point; there is a race between the = and ++ operators.
>
> current->selection = selector;
> --> gap
> ++counter[current->selection];
>
> if you flip and the then old current->selection reached 0 before this
> task gets executed again do_grace gets called and cleans the callbacks;
> which should not matter since this task has not yet started using any
> data. however it will be problematic because when it does get scheduled
> again it works on the wrong counter and thus does not prevent a grace
> period on the data will be using.
Yep, that is indeed my concern.
> however I assumed you had these problems solved with your counter-based
> approach. My code was meant to illustrate how I thought your double
> inversion problem could be avoided.
>
> Do you by any chance have a RCU impl. based on the counter-based
> approach so I can try to understand and maybe try to intergrate my
> ideas?
There are a couple of counter-based implementations out there:
o The K42/Tornado implementation of RCU.
o Dipankar Sarma's rcu_preempt-2.5.8-3.patch and
rcu_poll_preempt-2.5.14-2.patch that were proposed for
Linux some years back. These are attached.
Both of these can potentially suffer from huge grace periods, though
the K42/Tornado guys may have come up with some tricks to avoid this
by now. These long grace periods were why the above patches were
passed over for Linux's RCU.
Both of these also rely on the fact that the time interval between
a pair of counter switches must have context switches on all CPUs
(or threads, depending on the exact implementation). We need to
avoid this requirement in CONFIG_PREEMPT_RT in order to deal with
small-memory environments.
Thanx, Paul