2005-03-18 00:25:24

by Paul E. McKenney

[permalink] [raw]
Subject: Real-Time Preemption and RCU

Hello!

As promised/threatened earlier in another forum...

Thanx, Paul

------------------------------------------------------------------------

The Real-Time Preemption patch modified RCU to permit its read-side
critical sections to be safely preempted. This has worked, but there
are a few serious problems with this variant of RCU. [If you just want
to skip directly to the code, search for "synchronize_kernel(void)".
There are five occurrences, each a variation on the theme. I recommend
the fifth one. The third one might also be OK in some environments.
If you have better approaches, please do not keep them a secret!!!]

So, why am I saying that there are problems with the real-time preemption
implementation of RCU?

o RCU read-side critical sections cannot be freely nested,
since the read-side critical section now acquires locks.
This means that the real-time preemption variant of RCU
is subject to deadlock conditions that "classic" RCU
is immune to. This is not just a theoretical concern,
for example, see nf_hook_slow() in linux/net/core/netfilter.c:

+ /*
+ * PREEMPT_RT semantics: different-type read-locks
+ * dont nest that easily:
+ */
+// rcu_read_lock_read(&ptype_lock);

A number of other RCU read-side critical sections have been
similarly disabled (17 total in the patch). Perhaps most embedded
systems will not be using netfilter heavily, but this does bring
up a very real stability concern, even on UP systems (since
preemption will eventually occur when and where least expected).

o RCU read-side critical sections cannot be unconditionally
upgraded to write-side critical sections in all cases.
For example, in classic RCU, it is perfectly legal to do
the following:

rcu_read_lock();
list_for_each_entry_rcu(lep, head, p) {
if (p->needs_update) {
spin_lock(&update_lock);
update_it(p);
spin_unlock(&update_lock);
}
}
rcu_read_unlock()

This would need to change to the following for real-time
preempt kernels:

rcu_read_lock_spin(&update_lock);
list_for_each_entry_rcu(lep, head, p) {
if (p->needs_update) {
spin_lock(&update_lock);
update_it(p);
spin_unlock(&update_lock);
}
}
rcu_read_unlock_spin(&update_lock)

This results in self-deadlock.

o There is an API expansion, with five different variants
of rcu_read_lock():

API # uses
------------------------ ------
rcu_read_lock_spin() 11
rcu_read_unlock_spin() 12
rcu_read_lock_read() 42
rcu_read_unlock_read() 42
rcu_read_lock_bh_read() 2
rcu_read_unlock_bh_read() 3
rcu_read_lock_down_read() 14
rcu_read_unlock_up_read() 20
rcu_read_lock_nort() 3
rcu_read_unlock_nort() 4

TOTAL 153

o The need to modify lots of RCU code expands the size of this
patch -- roughly 10% of the 20K lines of this patch are devoted
to modifying RCU code to meet this new API. 10% may not sound
like much, but it comes to more than 2,000 lines of context diffs.

Seems to me that it would be good to have an RCU implementation
that meet the requirements of the Real-Time Preemption patch,
but that is 100% compatible with the "classic RCU" API. Such
an implementation must meet a number of requirements, which are
listed at the end of this message (search for "REQUIREMENTS").

I have looked into a number of seductive but subtly broken "solutions"
to this problem. The solution listed here is not perfect, but I believe
that it has enough advantages to be worth pursuing. The solution is
quite simple, and I feel a bit embarrassed that it took me so long
to come up with it. All I can say in my defense is that the idea
of -adding- locks to improve scalability and eliminate deadlocks is
quite counterintuitive to me. And, like I said earlier, if you
know of a better approach, please don't keep it a secret!

The following verbiage steps through several variations on this
solution, as follows:

1. "Toy" implementation that has numerous API, scalability,
and realtime problems, but is a very simple 28-line
illustration of the underlying principles. (In case you
get excited about this being much smaller than "classic
RCU", keep in mind that a similar "toy" implementation of
"classic RCU" is even smaller. To say nothing of faster
and more scalable.)

2. Expanded version of the "toy" implementation that exactly
matches the "classic RCU" API, but still has realtime and
scalability problems.

3. Expanded version of solution #2 that meets realtime
requirements (at least -I- think it does...). It still does
not scale well.

4. Expanded version of solution #2 that scales, but that does
not meet realtime requirements.

5. The final version, which both scales and meets realtime
requirements, as well as exactly matching the "classic RCU"
API.

I have tested this approach, but in user-level scaffolding. All of
these implementations should therefore be regarded with great suspicion:
untested, probably don't even compile. Besides which, I certainly can't
claim to fully understand the real-time preempt patch, so I am bound
to have gotten something wrong somewhere. In any case, none of these
implementations are a suitable replacement for "classic RCU" on large
servers, since they acquire locks in the RCU read-side critical sections.
However, they should scale enough to support small SMP systems, inflicting
only a modest performance penalty.

I believe that implementation #5 is most appropriate for real-time
preempt kernels. Note that #1, #2, and #4 are not at all appropriate for
-any- kernel. In theory, #3 might be appropriate, but if I understand
the real-time preempt implementation of reader-writer lock, it will
not perform well if there are long RCU read-side critical sections,
even in UP kernels.


------------------------------------------------------------------------

1. "Toy" Implementation.

[Inappropriate for actual use.]

The basic principle may be familiar to people who worked with the Linux
2.4 and early 2.5 networking code. The RCU read-side critical section
read-acquires a lock, recursively if need be. This lock is -not- acquired
by the update side, thus preventing the deadlocks caused by the current
real-time preempt implementation of RCU (and thus also deviating from the
aforementioned networking approach). Instead, the lock is write-acquired
(and then immediately released) before freeing any memory removed from the
data structure. The lock cannot be write-acquired until all currently
running read-side critical sections complete. Therefore, the memory
cannot be released until all read-side critical sections have released
any references that they might have to the memory.

rwlock_t rcu_deferral_lock = RW_LOCK_UNLOCKED;

void
rcu_read_lock(void)
{
read_lock(&rcu_deferral_lock);
}

void
rcu_read_unlock(void)
{
read_unlock(&rcu_deferral_lock);
}

void
synchronize_kernel(void)
{
write_lock(&rcu_deferral_lock);
write_unlock(&rcu_deferral_lock);
}

void
call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
{
synchronize_kernel();
func(head);
}

This is quite small and simple, but before getting too excited, keep
in mind that a similar "toy" implementation of classic RCU would be
even smaller and simpler.

Note that, just as with classic RCU, the read-side and update-side
critical sections can run concurrently. This means that the current
rcu_dereference() and rcu_assign_pointer() macros remain just as they
are in the current in-tree RCU implementation. This is also true of
the other four implementations.

Note also that, just as with classic RCU, there has to be some sort
of mechanism (lock, semaphore, non-blocking synchronization, ...)
that prevents concurrent updates from making a mess.

Quick Quiz 1: In what way does implementation #1 violate the
current "classic RCU" API? (Search for "ANSWER" if you can't
wait.)

------------------------------------------------------------------------

2. "Toy" Implementation That Exactly Matches the "Classic RCU" API.

[Inappropriate for actual use.]

The following implementation exactly matches the "Classic RCU" API,
but does not meet realtime requirements, nor does it scale.

Quick Quiz 2: Why does implementation #2 not meet realtime requirements?

Quick Quiz 3: Why does implementation #2 fail to scale?

struct rcu_data {
long batch;
struct rcu_head *waitlist;
struct rcu_head **waittail;
};
struct rcu_ctrlblk {
rwlock lock;
long batch;
};
DECLARE_PER_CPU(struct rcu_data, rcu_data);
struct rcu_ctrlblk rcu_ctrlblk = {
.lock = RW_LOCK_UNLOCKED,
.batch = 0,
};

void
rcu_read_lock(void)
{
read_lock(&rcu_ctrlblk.lock);
}

void
rcu_read_unlock(void)
{
read_unlock(&rcu_ctrlblk.lock);
}

void
synchronize_kernel(void)
{
write_lock(&rcu_ctrlblk.lock);
rcu_ctrlblk.batch++;
write_unlock(&rcu_ctrlblk.lock);
}

void
call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
{
unsigned long flags;
struct rcu_data *rdp;

head->func = func;
head->next = NULL;
local_irq_save(flags);
rcu_do_my_batch();
rdp = &__get_cpu_var(rcu_data);
*rdp->waittail = head;
rdp->nexttail = &head->next;
local_irq_restore(flags);
}

void rcu_do_my_batch(void)
{
unsigned long flags;
struct rcu_data *rdp;
struct rcu_head *next, *list;

local_irq_save(flags);
rdp = &__get_cpu_var(rcu_data);
smp_mb(); /* prevent sampling batch # before list removal. */
if (rdp->batch == rcu_ctrlblk.batch) {
local_irq_restore(flags);
return;
}
list = rdp->waitlist;
rdp->waitlist = NULL;
rdp->waittail = &rdp->waitlist;
rdp->batch = rcu_ctrlblk.batch;
local_irq_restore(flags);
while (list) {
next = list->next;
list->func(list);
list = next;
}
}

/*
* Note: rcu_dereference() and rcu_assign_pointer() unchanged
* from current in-tree implementation.
*/

The trick here is that call_rcu() does not write-acquire the lock,
either directly or indirectly. This means that there is no way
for it to deadlock with an RCU read-side critical section.
However, the callbacks are still invoked with interrupts disabled,
so this implementation still does not meet realtime requirements.
In addition, it still uses a global reader-writer lock, so it does
not scale, either.

However, it -does- faithfully implement the classic RCU API.


------------------------------------------------------------------------

3. Meets Realtime Requirements, Doesn't Scale

The trick here is to break the callback processing out into a separate
rcu_process_callbacks() function, so that it may be invoked from a kernel
daemon, a work queue, or whatever. It must be called periodically.
If the kernel is under memory pressure, one can get immediate relief
by doing the following:

synchronize_kernel();
rcu_process_callbacks();

Even more relief can be obtained by using something like
smp_call_function() in order to invoke rcu_process_callbacks() on each
CPU in an SMP system. Alternatively, one could awaken the kernel daemons
or schedule the function into the appropriate work queue, depending
on how this code fragment is integrated in. Or, perhaps better yet,
restrict use of this approach to UP kernels. ;-)

struct rcu_data {
long batch;
struct rcu_head *waitlist;
struct rcu_head **waittail;
struct rcu_head *donelist;
struct rcu_head **donetail;
};
struct rcu_ctrlblk {
rwlock lock;
long batch;
};
DECLARE_PER_CPU(struct rcu_data, rcu_data);
struct rcu_ctrlblk rcu_ctrlblk = {
.lock = RW_LOCK_UNLOCKED,
.batch = 0,
};

void
rcu_read_lock(void)
{
read_lock(&rcu_ctrlblk.lock);
}

void
rcu_read_unlock(void)
{
read_unlock(&rcu_ctrlblk.lock);
}

void
synchronize_kernel(void)
{
write_lock(&rcu_ctrlblk.lock);
rcu_ctrlblk.batch++;
write_unlock(&rcu_ctrlblk.lock);
}

void
call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
{
unsigned long flags;
struct rcu_data *rdp;

head->func = func;
head->next = NULL;
local_irq_save(flags);
rcu_advance_callbacks();
rdp = &__get_cpu_var(rcu_data);
*rdp->waittail = head;
rdp->nexttail = &head->next;
local_irq_restore(flags);
}

void rcu_advance_callbacks(void)
{
unsigned long flags;
struct rcu_data *rdp;

local_irq_save(flags);
rdp = &__get_cpu_var(rcu_data);
smp_mb(); /* prevent sampling batch # before list removal. */
if (rdp->batch != rcu_ctrlblk.batch) {
*rdp->donetail = rdp->waitlist;
rdp->donetail = rdp->waittail;
rdp->waitlist = NULL;
rdp->waittail = &rdp->waitlist;
rdp->batch = rcu_ctrlblk.batch;
}
local_irq_restore(flags);
}

void rcu_process_callbacks(void)
{
unsigned long flags;
struct rcu_head *next, *list;
struct rcu_data *rdp;

local_irq_save(flags);
rdp = &__get_cpu_var(rcu_data);
list = rdp->donelist;
if (list == NULL) {
local_irq_restore(flags);
return;
}
rdp->donelist = NULL;
rdp->donetail = &rdp->waitlist;
local_irq_restore(flags);
while (list) {
next = list->next;
list->func(list);
list = next;
}
}

/*
* Note: rcu_dereference() and rcu_assign_pointer() unchanged
* from current in-tree implementation.
*/

Since a single global rwlock is used, this implementation still does
not scale.


------------------------------------------------------------------------

4. Scalable Implementation

[Inappropriate for realtime use.]

This variant scales, though not well enough to recommend use on a
mid-range POWER machine, let alone a full-blown Altix. In addition,
it fails to meet some realtime requirements.

Quick Quiz 4: Why does implementation #4 fail the realtime test?

#define GRACE_PERIODS_PER_SEC 10

struct rcu_data {
rwlock lock;
long batch;
struct rcu_head *waitlist;
struct rcu_head **waittail;
};
struct rcu_ctrlblk {
long batch;
};
DECLARE_PER_CPU(struct rcu_data, rcu_data);
struct rcu_ctrlblk rcu_ctrlblk = {
.batch = 0,
};

void
rcu_read_lock(void)
{
read_lock(&__get_cpu_var(rcu_data).lock);
}

void
rcu_read_unlock(void)
{
read_unlock(&__get_cpu_var(rcu_data).lock);
}

void
_synchronize_kernel(void)
{
int cpu;

for_each_cpu(cpu) {
write_lock(&per_cpu(rcu_data, cpu).lock);
}
rcu_ctrlblk.batch++;
for_each_cpu(cpu) {
write_unlock(&per_cpu(rcu_data, cpu).lock);
}
}

void
synchronize_kernel(void)
{
long oldbatch;

smp_mb();
oldbatch = rcu_ctrlblk.batch;
schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
if (rcu_ctrlblk.batch == oldbatch) {
_synchronize_kernel();
}
}

void
call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
{
unsigned long flags;
struct rcu_data *rdp;

head->func = func;
head->next = NULL;
local_irq_save(flags);
rcu_do_my_batch();
rdp = &__get_cpu_var(rcu_data);
*rdp->waittail = head;
rdp->nexttail = &head->next;
local_irq_restore(flags);
}

void rcu_do_my_batch(void)
{
unsigned long flags;
struct rcu_data *rdp;
struct rcu_head *next, *list;

local_irq_save(flags);
rdp = &__get_cpu_var(rcu_data);
smp_mb(); /* prevent sampling batch # before list removal. */
if (rdp->batch == rcu_ctrlblk.batch) {
local_irq_restore(flags);
return;
}
list = rdp->waitlist;
rdp->waitlist = NULL;
rdp->waittail = &rdp->waitlist;
rdp->batch = rcu_ctrlblk.batch;
local_irq_restore(flags);
while (list) {
next = list->next;
list->func(list);
list = next;
}
}

/*
* Note: rcu_dereference() and rcu_assign_pointer() unchanged
* from current in-tree implementation.
*/

Quick Quiz 5: Why the use of for_each_cpu() rather than (say)
for_each_online_cpu() in _synchronize_kernel() in
implementation #4?

Quick Quiz 6: Why aren't there write-side contention problems
in implementation #4?


------------------------------------------------------------------------

5. Scalability -and- Realtime Response.

The trick here is to keep track of the CPU that did the rcu_read_lock()
in the task structure. If there is a preemption, there will be some
slight inefficiency, but the correct lock will be released, preserving
correctness.

#define GRACE_PERIODS_PER_SEC 10

struct rcu_data {
rwlock lock;
long batch;
struct rcu_head *waitlist;
struct rcu_head **waittail;
struct rcu_head *donelist;
struct rcu_head **donetail;
};
struct rcu_ctrlblk {
long batch;
};
DECLARE_PER_CPU(struct rcu_data, rcu_data);
struct rcu_ctrlblk rcu_ctrlblk = {
.batch = 0,
};

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
}
preempt_enable();
}

void
rcu_read_unlock(void)
{
preempt_disable();
if (--current->rcu_read_lock_nesting == 0) {
read_unlock(current->rcu_read_lock_ptr);
}
preempt_enable();
}

void
_synchronize_kernel(void)
{
int cpu;

for_each_cpu(cpu) {
write_lock(&per_cpu(rcu_data, cpu).lock);
}
rcu_ctrlblk.batch++;
for_each_cpu(cpu) {
write_unlock(&per_cpu(rcu_data, cpu).lock);
}
}

void
synchronize_kernel(void)
{
long oldbatch;

smp_mb();
oldbatch = rcu_ctrlblk.batch;
schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
if (rcu_ctrlblk.batch == oldbatch) {
_synchronize_kernel();
}
}

void
call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
{
unsigned long flags;
struct rcu_data *rdp;

head->func = func;
head->next = NULL;
local_irq_save(flags);
rcu_advance_callbacks();
rdp = &__get_cpu_var(rcu_data);
*rdp->waittail = head;
rdp->nexttail = &head->next;
local_irq_restore(flags);
}

void rcu_advance_callbacks(void)
{
unsigned long flags;
struct rcu_data *rdp;

local_irq_save(flags); /* allow invocation from OOM handler. */
rdp = &__get_cpu_var(rcu_data);
smp_mb(); /* prevent sampling batch # before list removal. */
if (rdp->batch != rcu_ctrlblk.batch) {
*rdp->donetail = rdp->waitlist;
rdp->donetail = rdp->waittail;
rdp->waitlist = NULL;
rdp->waittail = &rdp->waitlist;
rdp->batch = rcu_ctrlblk.batch;
}
local_irq_restore(flags);
}

void rcu_process_callbacks(void)
{
unsigned long flags;
struct rcu_head *next, *list;
struct rcu_data *rdp;

local_irq_save(flags);
rdp = &__get_cpu_var(rcu_data);
list = rdp->donelist;
if (list == NULL) {
local_irq_restore(flags);
return;
}
rdp->donelist = NULL;
rdp->donetail = &rdp->waitlist;
local_irq_restore(flags);
while (list) {
next = list->next;
list->func(list);
list = next;
}
}

/*
* Note: rcu_dereference() and rcu_assign_pointer() unchanged
* from current in-tree implementation.
*/

This implementation meets all of the requirements, -except- that it
imposes significant (but probably not crippling) synchronization overhead
on RCU read-side critical sections. This implementation is probably
serviceable on systems with up to 2-4 CPUs, and perhaps a few more.
However, it is not going to cut it on a mid-range POWER system, much
less a high-end Altix system. So, if you don't need heavy-duty realtime
response, you should stick with "classic RCU". Or, better yet, come
up with a single RCU implementation that is suited to both realtime and
server usage!

Quick Quiz 7: What if there is real interrupt or softirq code that
needs to call rcu_read_lock()?

Quick Quiz 8: What other bugs are present in implementation #5?


------------------------------------------------------------------------

REQUIREMENTS

These are -not- in strict priority order. Nor can they be, since
different people have different priorities. ;-)

1. Deferred destruction.

No data element may be destroyed (for example, freed) while an
RCU read-side critical section is referencing it. This property
absolutely must be implemented, as failing to do so will break
pretty much all the code that uses RCU.

2. Reliable.

The implementation must not be prone to gratuitous failure; it
must be able to run 24x7 for months at a time, and preferably
for years.

[This is a problem for so-called "hazard pointer" mechanisms,
which require that hazard pointers be allocated and freed.
So what do you do when the allocation fails while nested
way down in the bowels of some lock's critical section?
Avoiding this sort of unwind code was a major motivation
for Jack and I to come up with RCU in the first place!]

3. Callable from IRQ.

Since RCU is used from softirq state, a realtime implementation
must either eliminate softirq and interrupt states, eliminate use
of RCU from these states, or make RCU callable from these states.

[I believe that the real-time preemption patch moves code
from softirq/interrupt to process context, but could easily be
missing something. If need be, there are ways of handling cases
were realtime RCU is called from softirq and interrupt contexts,
for an example approach, see the Quick Quiz answers.]

4. Preemptible read side.

RCU read-side critical sections can (in theory, anyway) be quite
large, degrading realtime scheduling response. Preemptible RCU
read-side critical sections avoid such degradation. Manual
insertion of "preemption points" might be an alternative as well.
But I have no intention of trying to settle the long-running
argument between proponents of preemption and of preemption
points. Not today, anyway! ;-)

[This is one of classic RCU's weak points. No surprise, as
it was not designed with aggressive realtime in mind.]

5. Small memory footprint.

Many realtime systems are configured with modest amounts
of memory, so it is highly desirable to limit the number of
outstanding RCU callbacks. It is also necessary to force
"reaping" of callbacks should the system run short of memory.

[This is another of classic RCU's weak points. No surprise,
it was not designed with tiny-memory embedded systems in
mind.]

6. Independent of memory blocks.

The implementation should not make assumptions about the size
and extent of the data elements being protected by RCU, since
such assumptions constrain memory allocation design and can
impose increased complexity.

[This requirement makes it difficult to use so-called
"hazard pointer" techniques.]

7. Synchronization-free read side.

RCU read-side critical sections should avoid expensive
synchronization instructions or cache misses. It is most
important to avoid global locks and atomic instructions that
modify global memory, as these operations can inflict severe
performance and scalability bottlenecks. Avoiding memory barriers
is desirable but not as critical.

[This is where classic RCU shines. The realtime scheme does not
have a synchronization-free read side, but implementation #5 at
least has decent cache locality for read-mostly workloads.]

8. Freely nestable read-side critical sections.

Restrictions on nestability can result in code duplication,
so should be avoided if possible.

[This is where the current real-time preempt RCU implementation
has -serious- problems.]

9. Unconditional read-to-write upgrade.

RCU permits a read-side critical section to acquire the
corresponding write-side lock. This capability can be convenient
in some cases. However, since it is rarely used, it cannot be
considered mandatory. Then again, working around the lack of
such upgrade usually results in code bloat.

[This is another big issue with the current real-time preempt
RCU implementation.]

10. Compatible API.

A realtime RCU implementation should have an API compatible with
that in the kernel.

[Another big issue with the current real-time preempt RCU
implementation.]


------------------------------------------------------------------------

ANSWERS TO QUICK QUIZ

Quick Quiz 1: In what way does implementation #1 violate the
current "classic RCU" API?

Answer: Unlike "classic RCU", it is not legal to invoke
call_rcu() from within an RCU read-side critical
section. Doing so will result in deadlock.

Quick Quiz 2: Why does implementation #2 not meet realtime requirements?

Answer: Because all the callbacks are invoked with interrupts
disabled. If there are a lot of callbacks, realtime
latency will suffer.

Quick Quiz 3: Why does implementation #2 fail to scale?

Answer: Because the large cache-miss penalties of modern
microprocessors. See http://linuxjournal.com/article/6993
for more details on why this is so.

Quick Quiz 4: Why does implementation #4 fail the realtime test?

Answer: Because the realtime-preempt version of locks can be
preempted. Therefore, the rcu_read_unlock() might
run on a different CPU than did the corresponding
rcu_read_lock(), resulting in horrible confusion and
a great decrease in the expected lifespan of your
kernel.

Also because it reverts to invoking the callbacks
with interrupts disabled.

Quick Quiz 5: Why the use of for_each_cpu() rather than (say)
for_each_online_cpu() in _synchronize_kernel()?

Answer: To completely avoid any races with CPU hotplug.
Admittedly a large hammer to use, but it does
have the advantage of simplicity. Suggestions
for improvements welcome!

Quick Quiz 6: Why aren't there write-side contention problems
in implementation #4?

Answer: Because synchronize_kernel() refuses to acquire
the lock more than 10 times a second. Of course,
if you invoke _synchronize_kernel() frequently
from an OOM handler, or if you are running on
a 100-CPU machine, some adjustments may be needed.
In a -lot- of places in the latter case.

Quick Quiz 7: What if there is real interrupt or softirq code that
needs to call rcu_read_lock()?

Answer: Use something like the call_rcu_bh() API. This could
have a similar implementation to the ones laid out here,
but rcu_read_lock_bh() must disable preemption, and
perhaps also interrupts. As must _synchronize_kernel().

Quick Quiz 8: What other bugs are present in implementation #5?

Answer: Beats me! If I knew, I might have fixed them. ;-)


2005-03-18 07:49:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> Seems to me that it would be good to have an RCU implementation that
> meet the requirements of the Real-Time Preemption patch, but that is
> 100% compatible with the "classic RCU" API. Such an implementation
> must meet a number of requirements, which are listed at the end of
> this message (search for "REQUIREMENTS").

[ Wow. you must be a secret telepatic mind-reader - yesterday i was
thinking about mailing you, because my approach to RCU preemptability
(the API variants) clearly sucked and caused problems all around, both
in terms of maintainability and in terms of stability and
scalability. ]

> 5. The final version, which both scales and meets realtime
> requirements, as well as exactly matching the "classic RCU"
> API.
>
> I have tested this approach, but in user-level scaffolding. All of
> these implementations should therefore be regarded with great
> suspicion: untested, probably don't even compile. Besides which, I
> certainly can't claim to fully understand the real-time preempt patch,
> so I am bound to have gotten something wrong somewhere. In any case,
> none of these implementations are a suitable replacement for "classic
> RCU" on large servers, since they acquire locks in the RCU read-side
> critical sections. However, they should scale enough to support small
> SMP systems, inflicting only a modest performance penalty.

basically for PREEMPT_RT the only constraint is that RCU sections should
be preemptable. Whatever the performance cost. If PREEMPT_RT is merged
into the upstream kernel then it will (at least initially) be at a
status similar to NOMMU: it will be tolerated as long as it causes no
'drag' on the main code. The RCU API variants i introduced clearly
violated this requirement, and were my #1 worry wrt. upstream
mergability.

> I believe that implementation #5 is most appropriate for real-time
> preempt kernels. [...]

yeah, agreed - it looks perfect - both the read and write side is
preemptable. Can i just plug the code you sent into rcupudate.c and
expect it to work, or would you like to send a patch? If you prefer you
can make it an unconditional patch against an upstream kernel to keep
things simple for you - i'll then massage it to be properly PREEMPT_RT
dependent.

> [...] In theory, #3 might be appropriate, but if I understand the
> real-time preempt implementation of reader-writer lock, it will not
> perform well if there are long RCU read-side critical sections, even
> in UP kernels.

all RCU-locked sections must be preemptable in -RT. Basically RCU is a
mainstream API that is used by lots of code and will be introduced in
many other areas as well. From the -RT kernel's POV sees this as an
'uncontrollable latency source', which keeps introducing critical
sections. One major goal of PREEMPT_RT is to convert all popular
critical section APIs into preemptible sections, so that the amount of
code that is non-preemptable is drastically reduced and can be managed
(and thus can be trusted). This goal has a higher priority than any
performance consideration, because it doesnt matter what performance you
have, if you cannot trust the kernel to be deterministic.

Ingo

2005-03-18 08:45:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> [I believe that the real-time preemption patch moves code
> from softirq/interrupt to process context, but could easily be
> missing something. If need be, there are ways of handling cases
> were realtime RCU is called from softirq and interrupt contexts,
> for an example approach, see the Quick Quiz answers.]

correct, on PREEMPT_RT, IRQ processing is (and must be) in process
context.

(it's pretty much a must-have thing to enable the preemption of irq
handlers and softirq processing: if irq/softirq contexts nested on the
kernel stack then irq contexts would 'pin down' the process context that
was interrupted, so even if irq preemption was possible, the underlying
process context couldnt be freely scheduled.)

still, it's nice that your #5 design does not rely on it - it's a sign
of fundamental robustness, plus we could (in theory) offer preemptable
RCU on PREEMPT_DESKTOP (== current PREEMPT) as well - just like there is
PREEMPT_BKL.

Ingo

2005-03-18 09:05:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> 4. Preemptible read side.
>
> RCU read-side critical sections can (in theory, anyway) be quite
> large, degrading realtime scheduling response. Preemptible RCU
> read-side critical sections avoid such degradation. Manual
> insertion of "preemption points" might be an alternative as well.
> But I have no intention of trying to settle the long-running
> argument between proponents of preemption and of preemption
> points. Not today, anyway! ;-)

i'm cleverly sidestepping that argument by offering 4 preemption models
in the -RT tree :-) We dont have to pick a winner, users will. The way
low latencies are achieved depends on the preemption model:

( ) No Forced Preemption (Server)
( ) Voluntary Kernel Preemption (Desktop)
( ) Preemptible Kernel (Low-Latency Desktop)
(X) Complete Preemption (Real-Time)

"Server" is the current default !PREEMPT model. Best throughput, bad
worst-case latencies.

"Low-Latency Desktop" is the current PREEMPT model. Has some runtime
overhead relative to Server, offers fair worst-case latencies.

"Desktop" is a new mode that is somewhere between Server and Low-Latency
Desktop: it's what was initially called 'voluntary preemption'. It
doesnt make the kernel forcibly preemptible anywhere, but inserts a fair
number of preemption points to decrease latencies statistically, while
keeping the runtime overhead close to the "Server". (a variant of this
model is utilized by Fedora and RHEL4 currently, so if this gets
upstream i expect distros to pick it up - it can be a migration helper
towards the "Low-Latency Desktop" model.)

"Real-Time" is the no-compromises hard-RT model: almost everything but
the scheduler itself is fully preemptible. Phenomenal worst-case
latencies in every workload scenario, but also has the highest runtime
overhead.

preemptable RCU makes sense for the "Low-Latency Desktop" and
"Real-Time" preemption models - these are the ones that do forced
preemption.

Ingo

2005-03-18 09:15:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> I have tested this approach, but in user-level scaffolding. All of
> these implementations should therefore be regarded with great
> suspicion: untested, probably don't even compile. Besides which, I
> certainly can't claim to fully understand the real-time preempt patch,
> so I am bound to have gotten something wrong somewhere. [...]

you dont even have to consider the -RT patchset: if the scheme allows
forced preemption of read-side RCU sections on current upstream
CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too.

Ingo

2005-03-18 09:28:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Ingo Molnar <[email protected]> wrote:

>
> * Paul E. McKenney <[email protected]> wrote:
>
> > I have tested this approach, but in user-level scaffolding. All of
> > these implementations should therefore be regarded with great
> > suspicion: untested, probably don't even compile. Besides which, I
> > certainly can't claim to fully understand the real-time preempt patch,
> > so I am bound to have gotten something wrong somewhere. [...]
>
> you dont even have to consider the -RT patchset: if the scheme allows
> forced preemption of read-side RCU sections on current upstream
> CONFIG_PREEMPT, then it's perfect for PREEMPT_RT too.

there's one detail on PREEMPT_RT though (which i think you noticed too).

Priority inheritance handling can be done in a pretty straightforward
way as long as no true read-side nesting is allowed for rwsems and
rwlocks - i.e. there's only one owner of a lock at a time. So PREEMPT_RT
restricts rwsem and rwlock concurrency: readers are writers, with the
only exception that they are allowed to 'self-nest'. I.e. things like:

read_lock(&rwlock);
...
read_lock(&rwlock);

are still legal. (it's also done quite often.)

(it is virtually impossible to implement priority inheritance for true
multi-reader locks in any sane way: i've done it initially and it sucks
very much. It also fundamentally increases the 'lock-dependent'
worst-case latencies - imagine 4 readers having to finish first if a
higher-prio writer comes along. It's insane.)

Ingo

2005-03-18 09:38:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Ingo Molnar <[email protected]> wrote:

> * Paul E. McKenney <[email protected]> wrote:
>
> > 4. Preemptible read side.
> >
> > RCU read-side critical sections can (in theory, anyway) be quite
> > large, degrading realtime scheduling response. Preemptible RCU
> > read-side critical sections avoid such degradation. Manual
> > insertion of "preemption points" might be an alternative as well.
> > But I have no intention of trying to settle the long-running
> > argument between proponents of preemption and of preemption
> > points. Not today, anyway! ;-)
>
> i'm cleverly sidestepping that argument by offering 4 preemption
> models in the -RT tree :-) We dont have to pick a winner, users will.
> [...]

also, it turned out that "preemption points" vs. "forced preemption" are
not exclusive concepts: PREEMPT_RT relies on _both_ of them.

when a highprio task tries to acquire a lock that another, lower-prio
task already holds, then 'the time it takes for the lowprio task to drop
the lock' directly depends on the frequency of explicit preemption
points within the critical section. So to get good 'lock dependent'
latencies on PREEMPT_RT we need both a good distribution of preemption
points (within locked sections).

(obviously, 'lock independent' preemption latencies purely depends on
the quality of forced preemption and the size of non-preempt critical
sections, not on any explicit preemption point.)

dont we love seemingly conflicting concepts that end up helping each
other? It's a nice flamewar-killer ;-)

Ingo

2005-03-18 09:53:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Ingo Molnar <[email protected]> wrote:

> there's one detail on PREEMPT_RT though (which i think you noticed too).
>
> Priority inheritance handling can be done in a pretty straightforward
> way as long as no true read-side nesting is allowed for rwsems and
> rwlocks - i.e. there's only one owner of a lock at a time. So
> PREEMPT_RT restricts rwsem and rwlock concurrency: readers are
> writers, with the only exception that they are allowed to 'self-nest'.
> [...]

this does not affect read-side RCU, because read-side RCU can never
block a higher-prio thread. (as long as callback processing is pushed
into a separate kernel thread.)

so RCU will be pretty much the only mechanism (besides lock-free code)
that allows reader concurrency on PREEMPT_RT.

Ingo

2005-03-18 10:03:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


there's a problem in #5's rcu_read_lock():

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
}
preempt_enable();
}

not only are read_lock()-ed sections preemptible, read_lock() itself may
block, so it cannot be called from within preempt_disable(). How about
something like:

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
preempt_enable();
read_lock(current->rcu_read_lock_ptr);
} else
preempt_enable();
}

this would still make it 'statistically scalable' - but is it correct?

Ingo

2005-03-18 11:31:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Ingo Molnar <[email protected]> wrote:

> [...] How about something like:
>
> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> preempt_enable();
> read_lock(current->rcu_read_lock_ptr);
> } else
> preempt_enable();
> }
>
> this would still make it 'statistically scalable' - but is it correct?

thinking some more about it, i believe it's correct, because it picks
one particular CPU's lock and correctly releases that lock.

(read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is
fine.)

Ingo

2005-03-18 11:38:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> 5. Scalability -and- Realtime Response.
>
> The trick here is to keep track of the CPU that did the
> rcu_read_lock() in the task structure. If there is a preemption,
> there will be some slight inefficiency, but the correct lock will be
> released, preserving correctness.

the inefficiency will be larger, because (as explained in a previous
mail) multiple concurrent owners of the read-lock are not allowed. This
adds to the overhead of PREEMPT_RT on SMP, but is an intentional
tradeoff. I dont expect PREEMPT_RT to be used on an Altix :-|

#5 is still the best option, because in the normal 'infrequent
preemption' case it will behave in a cache-friendly way. A positive
effect of the lock serializing is that the callback backlog will stay
relatively small and that the RCU backlog processing can be made
deterministic as well (if needed), by making the backlog processing
thread(s) SCHED_FIFO.

Ingo

2005-03-18 12:55:10

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> 5. Scalability -and- Realtime Response.
...

> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> read_lock(current->rcu_read_lock_ptr);
> }
> preempt_enable();
> }

Ok, here's a rather unsure question...

Uh, is that a sleep violation if that is exclusively held since it
can block within an atomic critical section (deadlock) ?

bill


2005-03-18 13:15:39

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > 5. Scalability -and- Realtime Response.
> ...
>
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > read_lock(current->rcu_read_lock_ptr);
> > }
> > preempt_enable();
> > }
>
> Ok, here's a rather unsure question...
>
> Uh, is that a sleep violation if that is exclusively held since it
> can block within an atomic critical section (deadlock) ?

I'd like to note another problem. Mingo's current implementation of rt_mutex
(super mutex for all blocking synchronization) is still missing reader counts
and something like that would have to be implemented if you want to do priority
inheritance over blocks.

This is going to throw a wrench into your implementation if you assume that.

bill


2005-03-18 15:51:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 10:53:27AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > there's one detail on PREEMPT_RT though (which i think you noticed too).
> >
> > Priority inheritance handling can be done in a pretty straightforward
> > way as long as no true read-side nesting is allowed for rwsems and
> > rwlocks - i.e. there's only one owner of a lock at a time. So
> > PREEMPT_RT restricts rwsem and rwlock concurrency: readers are
> > writers, with the only exception that they are allowed to 'self-nest'.
> > [...]
>
> this does not affect read-side RCU, because read-side RCU can never
> block a higher-prio thread. (as long as callback processing is pushed
> into a separate kernel thread.)
>
> so RCU will be pretty much the only mechanism (besides lock-free code)
> that allows reader concurrency on PREEMPT_RT.

This is a relief! I was wondering how on earth I was going to solve
the multi-task priority-inheritance problem!

But... How do we handle the following scenario?

0. A bunch of low-priority threads are preempted in the
middle of various RCU read-side critical sections.

1. High-priority thread does kmalloc(), but there is no
memory, so it blocks.

2. OOM handling notices, and decides to clean up the outstanding
RCU callbacks. It therefore invokes _synchronize_kernel()
(in implementation #5).

3. The _synchronize_kernel() function tries to acquire all of
the read-side locks, which are held by numerous preempted
low-priority threads.

4. What now???

Or does the current patch do priority inheritance across the memory
allocator? In other words, can we get away with ignoring this one? ;-)

Thanx, Paul

2005-03-18 15:53:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 11:03:39AM +0100, Ingo Molnar wrote:
>
> there's a problem in #5's rcu_read_lock():
>
> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> read_lock(current->rcu_read_lock_ptr);
> }
> preempt_enable();
> }
>
> not only are read_lock()-ed sections preemptible, read_lock() itself may
> block, so it cannot be called from within preempt_disable(). How about
> something like:
>
> void
> rcu_read_lock(void)
> {
> preempt_disable();
> if (current->rcu_read_lock_nesting++ == 0) {
> current->rcu_read_lock_ptr =
> &__get_cpu_var(rcu_data).lock;
> preempt_enable();
> read_lock(current->rcu_read_lock_ptr);
> } else
> preempt_enable();
> }
>
> this would still make it 'statistically scalable' - but is it correct?

Good catch!

Also good question...

Strictly speaking, it is not necessary to block callback invocation until
just after rcu_read_lock() returns.

It is correct as long as there is no sort of "upcall" or "callback" that
can masquerade as this task. I know of no such thing in the Linux kernel.
In fact such a thing would break a lot of code, right?

Any tool that relied on the ->rcu_read_lock_nesting counter to deduce
RCU state would be confused by this change, but there might be other
ways of handling this. Also, we are currently making do without such
a tool.

It should be possible to move the preempt_enable() further forward
ahead of the assignment to ->rcu_read_lock_ptr, since the assignment
to ->rcu_read_lock_ptr is strictly local. Not sure that this is
worthwhile, thoughts?

void
rcu_read_lock(void)
{
preempt_disable();
if (current->rcu_read_lock_nesting++ == 0) {
preempt_enable();
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
} else
preempt_enable();
}

The other question is whether preempt_disable() is needed in the first
place. The two task-structure fields are not accessed except by the
task itself. I bet that the following is just fine:

void
rcu_read_lock(void)
{
if (current->rcu_read_lock_nesting++ == 0) {
current->rcu_read_lock_ptr =
&__get_cpu_var(rcu_data).lock;
read_lock(current->rcu_read_lock_ptr);
}
}

Thoughts?

Thanx, Paul

2005-03-18 15:54:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > 5. Scalability -and- Realtime Response.
> ...
>
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > read_lock(current->rcu_read_lock_ptr);
> > }
> > preempt_enable();
> > }
>
> Ok, here's a rather unsure question...
>
> Uh, is that a sleep violation if that is exclusively held since it
> can block within an atomic critical section (deadlock) ?

Hey, I wasn't joking when I said that I probably got something wrong! ;-)

My current thought is that the preempt_disable()/preempt_enable() can
be dropped entirely. Messes up any tool that browses through
->rcu_read_lock_nesting, but don't see any other problem. Yet, anyway!

Thanx, Paul

2005-03-18 15:59:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 05:17:29AM -0800, Bill Huey wrote:
> On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> > On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > > 5. Scalability -and- Realtime Response.
> > ...
> >
> > > void
> > > rcu_read_lock(void)
> > > {
> > > preempt_disable();
> > > if (current->rcu_read_lock_nesting++ == 0) {
> > > current->rcu_read_lock_ptr =
> > > &__get_cpu_var(rcu_data).lock;
> > > read_lock(current->rcu_read_lock_ptr);
> > > }
> > > preempt_enable();
> > > }
> >
> > Ok, here's a rather unsure question...
> >
> > Uh, is that a sleep violation if that is exclusively held since it
> > can block within an atomic critical section (deadlock) ?
>
> I'd like to note another problem. Mingo's current implementation of rt_mutex
> (super mutex for all blocking synchronization) is still missing reader counts
> and something like that would have to be implemented if you want to do priority
> inheritance over blocks.
>
> This is going to throw a wrench into your implementation if you assume that.

If we need to do priority inheritance across the memory allocator, so
that high-priority tasks blocked waiting for memory pass their priority
on to RCU readers, agreed. But I don't see any sign that real-time
preempt does this.

In absence of this, as Ingo noted, the fact that readers don't block
each other should make things be safe. I think...

Thanx, Paul

2005-03-18 16:01:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> > > preempt_disable();
> > > if (current->rcu_read_lock_nesting++ == 0) {
> > > current->rcu_read_lock_ptr =
> > > &__get_cpu_var(rcu_data).lock;
> > > read_lock(current->rcu_read_lock_ptr);
> > > }
> > > preempt_enable();

> My current thought is that the preempt_disable()/preempt_enable() can
> be dropped entirely. Messes up any tool that browses through
> ->rcu_read_lock_nesting, but don't see any other problem. Yet,
> anyway!

yeah - this sounds good. (We are not aiming for irq-safe RCU anyway, on
PREEMPT_RT.)

Ingo

2005-03-18 16:05:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Bill Huey <[email protected]> wrote:

> I'd like to note another problem. Mingo's current implementation of
> rt_mutex (super mutex for all blocking synchronization) is still
> missing reader counts and something like that would have to be
> implemented if you want to do priority inheritance over blocks.

i really have no intention to allow multiple readers for rt-mutexes. We
got away with that so far, and i'd like to keep it so. Imagine 100
threads all blocked in the same critical section (holding the read-lock)
when a highprio writer thread comes around: instant 100x latency to let
all of them roll forward. The only sane solution is to not allow
excessive concurrency. (That limits SMP scalability, but there's no
other choice i can see.)

Ingo

2005-03-18 16:44:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 08:49:03AM +0100, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > Seems to me that it would be good to have an RCU implementation that
> > meet the requirements of the Real-Time Preemption patch, but that is
> > 100% compatible with the "classic RCU" API. Such an implementation
> > must meet a number of requirements, which are listed at the end of
> > this message (search for "REQUIREMENTS").
>
> [ Wow. you must be a secret telepatic mind-reader - yesterday i was
> thinking about mailing you, because my approach to RCU preemptability
> (the API variants) clearly sucked and caused problems all around, both
> in terms of maintainability and in terms of stability and
> scalability. ]

Wish I could claim that, but must credit dumb luck. ;-)

> > 5. The final version, which both scales and meets realtime
> > requirements, as well as exactly matching the "classic RCU"
> > API.
> >
> > I have tested this approach, but in user-level scaffolding. All of
> > these implementations should therefore be regarded with great
> > suspicion: untested, probably don't even compile. Besides which, I
> > certainly can't claim to fully understand the real-time preempt patch,
> > so I am bound to have gotten something wrong somewhere. In any case,
> > none of these implementations are a suitable replacement for "classic
> > RCU" on large servers, since they acquire locks in the RCU read-side
> > critical sections. However, they should scale enough to support small
> > SMP systems, inflicting only a modest performance penalty.
>
> basically for PREEMPT_RT the only constraint is that RCU sections should
> be preemptable. Whatever the performance cost. If PREEMPT_RT is merged
> into the upstream kernel then it will (at least initially) be at a
> status similar to NOMMU: it will be tolerated as long as it causes no
> 'drag' on the main code. The RCU API variants i introduced clearly
> violated this requirement, and were my #1 worry wrt. upstream
> mergability.

Fair enough!

> > I believe that implementation #5 is most appropriate for real-time
> > preempt kernels. [...]
>
> yeah, agreed - it looks perfect - both the read and write side is
> preemptable. Can i just plug the code you sent into rcupudate.c and
> expect it to work, or would you like to send a patch? If you prefer you
> can make it an unconditional patch against an upstream kernel to keep
> things simple for you - i'll then massage it to be properly PREEMPT_RT
> dependent.

I will give it a shot, but feel free to grab anything at any point,
starting either from the patch I will send in a bit or from my earlier
email. Do whatever works best for you.

For the patch, here are my questions:

o What is the best way to select between classic RCU and this
scheme?

1. Massive #ifdef across rcupdate.c

2. Create an rcupdate_rt.c and browbeat the build system
into picking one or the other (no clue if this is
possible...)

3. Create an rcupdate_rt.c and rely on the linker to pick
one or the other, with rcupdate.h generating different
external symbol names to make the choice.

Left to myself, I would grit my teeth and do #1, since I
understand how to do it. Flames welcome, as long as accompanied
by good advice!

o How best to interface to OOM? Left to myself, I leave this
for later. ;-)

I will take the cowardly approach of patching against the upstream kernel.

> > [...] In theory, #3 might be appropriate, but if I understand the
> > real-time preempt implementation of reader-writer lock, it will not
> > perform well if there are long RCU read-side critical sections, even
> > in UP kernels.
>
> all RCU-locked sections must be preemptable in -RT. Basically RCU is a
> mainstream API that is used by lots of code and will be introduced in
> many other areas as well. From the -RT kernel's POV sees this as an
> 'uncontrollable latency source', which keeps introducing critical
> sections. One major goal of PREEMPT_RT is to convert all popular
> critical section APIs into preemptible sections, so that the amount of
> code that is non-preemptable is drastically reduced and can be managed
> (and thus can be trusted). This goal has a higher priority than any
> performance consideration, because it doesnt matter what performance you
> have, if you cannot trust the kernel to be deterministic.

Makes sense to me!

Thanx, Paul

2005-03-18 16:49:48

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, 18 Mar 2005, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > [...] How about something like:
> >
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > preempt_enable();
> > read_lock(current->rcu_read_lock_ptr);
> > } else
> > preempt_enable();
> > }
> >
> > this would still make it 'statistically scalable' - but is it correct?
>
> thinking some more about it, i believe it's correct, because it picks
> one particular CPU's lock and correctly releases that lock.
>
> (read_unlock() is atomic even on PREEMPT_RT, so rcu_read_unlock() is
> fine.)
>

Why can should there only be one RCU-reader per CPU at each given
instance? Even on a real-time UP system it would be very helpfull to have
RCU areas to be enterable by several tasks as once. It would perform
better, both wrt. latencies and throughput:
With the above implementation an high priority task entering an RCU area
will have to boost the current RCU reader, make a task switch until that
one finishes and makes yet another task switch. to get back to the high
priority task. With an RCU implementation which can take n RCU readers per CPU
there is no such problem.

Also having all tasks serializing on one lock (per CPU) really destroys
the real-time properties: The latency of anything which uses RCU will be
the worst latency of anything done under the RCU lock.

When I looked briefly at it in the fall the following solution jumped into
mind: Have a RCU-reader count, rcu_read_count, for each CPU. When you
enter an RCU read region increment it and decrement it when you go out of
it. When it is 0, RCU cleanups are allowed - a perfect quiescent state. At
that point call rcu_qsctr_inc() at that point. Or call it in schedule() as
now just with a if(rcu_read_count==0) around it.

I don't think I understand the current code. But if it works now with
preempt_disable()/preempt_enable() around all the read-regions it ought to
work with
preempt_enable();
rcu_read_count++/--;
preempt_disable()
around the same regions and the above check for rcu_read_count==0 in or
around rcu_qsctr_inc() as well.

It might take a long time before the rcu-batches are actually called,
though, but that is a different story, which can be improved upon. An
improvemnt would be to boost the none-RT tasks entering a rcu-read region
into the lowest RT-priority. That way there can't be a lot of low
priority tasks hanging around making rcu_read_count non-zero for a long
period of time since these tasks only can be preempted by RT tasks while
in the RCU-region.

> Ingo

Esben



2005-03-18 16:57:25

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


On Fri, 18 Mar 2005, Ingo Molnar wrote:

>
> * Bill Huey <[email protected]> wrote:
>
> > I'd like to note another problem. Mingo's current implementation of
> > rt_mutex (super mutex for all blocking synchronization) is still
> > missing reader counts and something like that would have to be
> > implemented if you want to do priority inheritance over blocks.
>
> i really have no intention to allow multiple readers for rt-mutexes. We
> got away with that so far, and i'd like to keep it so. Imagine 100
> threads all blocked in the same critical section (holding the read-lock)
> when a highprio writer thread comes around: instant 100x latency to let
> all of them roll forward. The only sane solution is to not allow
> excessive concurrency. (That limits SMP scalability, but there's no
> other choice i can see.)
>

Unless a design change is made: One could argue for a semantics where
write-locking _isn't_ deterministic and thus do not have to boost all the
readers. Readers boost the writers but not the other way around. Readers
will be deterministic, but not writers.
Such a semantics would probably work for a lot of RT applications
happening not to take any write-locks - these will in fact perform better.
But it will give the rest a lot of problems.

> Ingo

Esben

2005-03-18 17:12:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Paul E. McKenney <[email protected]> wrote:

> For the patch, here are my questions:
>
> o What is the best way to select between classic RCU and this
> scheme?
>
> 1. Massive #ifdef across rcupdate.c
>
> 2. Create an rcupdate_rt.c and browbeat the build system
> into picking one or the other (no clue if this is
> possible...)
>
> 3. Create an rcupdate_rt.c and rely on the linker to pick
> one or the other, with rcupdate.h generating different
> external symbol names to make the choice.

you can also go for option #0: just replace the existing RCU code with
the new one, and i'll then deal with the configuration details.

what will have to happen is most likely #2 (since there is near zero
code sharing between the two variants, right?). Picking rcupdate_rt.c is
as simple as doing this:

obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o
obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o

and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU
(default) or CONFIG_PREEMPT_RCU (if the user selects it).

but it's not yet clear whether we want to offer this to users as a
configurable option. The simplest solution for you would be to go with
option #0 :-) [or if you prefer switchability, #1 is good too - i can
then extract the bits and do #2 based on that.]

> o How best to interface to OOM? Left to myself, I leave this
> for later. ;-)

yeah, i'd not worry about OOM that much at this stage.

> I will take the cowardly approach of patching against the upstream
> kernel.

sure. This is in fact easier for me: i'll first rip all my RCU hackery
out of -RT and then add your code, so the base i'll be merging against
will be closer to upstream than to current -RT.

Ingo

2005-03-18 17:20:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Esben Nielsen <[email protected]> wrote:

> Why can should there only be one RCU-reader per CPU at each given
> instance? Even on a real-time UP system it would be very helpfull to
> have RCU areas to be enterable by several tasks as once. It would
> perform better, both wrt. latencies and throughput: With the above
> implementation an high priority task entering an RCU area will have to
> boost the current RCU reader, make a task switch until that one
> finishes and makes yet another task switch. to get back to the high
> priority task. With an RCU implementation which can take n RCU readers
> per CPU there is no such problem.

correct, for RCU we could allow multiple readers per lock, because the
'blocking' side of RCU (callback processing) is never (supposed to be)
in any latency path.

except if someone wants to make RCU callback processing deterministic at
some point. (e.g. for memory management reasons.)

clearly the simplest solution is to go with the single-reader locks for
now - a separate experiment could be done with a new type of rwlock that
can only be used by the RCU code. (I'm not quite sure whether we could
guarantee a minimum rate of RCU callback processing under such a scheme
though. It's an eventual memory DoS otherwise.)

Ingo

2005-03-18 17:29:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 06:11:26PM +0100, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > For the patch, here are my questions:
> >
> > o What is the best way to select between classic RCU and this
> > scheme?
> >
> > 1. Massive #ifdef across rcupdate.c
> >
> > 2. Create an rcupdate_rt.c and browbeat the build system
> > into picking one or the other (no clue if this is
> > possible...)
> >
> > 3. Create an rcupdate_rt.c and rely on the linker to pick
> > one or the other, with rcupdate.h generating different
> > external symbol names to make the choice.
>
> you can also go for option #0: just replace the existing RCU code with
> the new one, and i'll then deal with the configuration details.

Having just spent the past few minutes descending into #ifdef hell,
I agree completely with your option #0.

> what will have to happen is most likely #2 (since there is near zero
> code sharing between the two variants, right?). Picking rcupdate_rt.c is
> as simple as doing this:
>
> obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o
> obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o
>
> and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU
> (default) or CONFIG_PREEMPT_RCU (if the user selects it).

Cool! Thank you for the tutorial! And yes, there is no shared code
that I can see.

> but it's not yet clear whether we want to offer this to users as a
> configurable option. The simplest solution for you would be to go with
> option #0 :-) [or if you prefer switchability, #1 is good too - i can
> then extract the bits and do #2 based on that.]

Option #0 it is -- I will stick with the locking algorithms and let
wiser and more experienced heads work out the configuration issues.

> > o How best to interface to OOM? Left to myself, I leave this
> > for later. ;-)
>
> yeah, i'd not worry about OOM that much at this stage.
>
> > I will take the cowardly approach of patching against the upstream
> > kernel.
>
> sure. This is in fact easier for me: i'll first rip all my RCU hackery
> out of -RT and then add your code, so the base i'll be merging against
> will be closer to upstream than to current -RT.

Sounds good to me!

Thanx, Paul

2005-03-18 20:37:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 06:11:26PM +0100, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > For the patch, here are my questions:
> >
> > o What is the best way to select between classic RCU and this
> > scheme?
> >
> > 1. Massive #ifdef across rcupdate.c
> >
> > 2. Create an rcupdate_rt.c and browbeat the build system
> > into picking one or the other (no clue if this is
> > possible...)
> >
> > 3. Create an rcupdate_rt.c and rely on the linker to pick
> > one or the other, with rcupdate.h generating different
> > external symbol names to make the choice.
>
> you can also go for option #0: just replace the existing RCU code with
> the new one, and i'll then deal with the configuration details.
>
> what will have to happen is most likely #2 (since there is near zero
> code sharing between the two variants, right?). Picking rcupdate_rt.c is
> as simple as doing this:
>
> obj-$(CONFIG_DONT_PREEMPT_RCU) += rcupdate.o
> obj-$(CONFIG_PREEMPT_RCU) += rcupdate_rt.o
>
> and then use Kconfig to generate either CONFIG_DONT_PREEMPT_RCU
> (default) or CONFIG_PREEMPT_RCU (if the user selects it).
>
> but it's not yet clear whether we want to offer this to users as a
> configurable option. The simplest solution for you would be to go with
> option #0 :-) [or if you prefer switchability, #1 is good too - i can
> then extract the bits and do #2 based on that.]
>
> > o How best to interface to OOM? Left to myself, I leave this
> > for later. ;-)
>
> yeah, i'd not worry about OOM that much at this stage.
>
> > I will take the cowardly approach of patching against the upstream
> > kernel.
>
> sure. This is in fact easier for me: i'll first rip all my RCU hackery
> out of -RT and then add your code, so the base i'll be merging against
> will be closer to upstream than to current -RT.

Compiles, probably dies horribly. "diff" didn't do such a good job
on this one, so attaching the raw rcupdate.[hc] files as well.

Thanx, Paul

Signed-off-by: <[email protected]>

diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h linux-2.5-rtRCU/include/linux/rcupdate.h
--- linux-2.5/include/linux/rcupdate.h Wed Mar 9 12:37:06 2005
+++ linux-2.5-rtRCU/include/linux/rcupdate.h Fri Mar 18 11:37:02 2005
@@ -58,169 +58,11 @@ struct rcu_head {
(ptr)->next = NULL; (ptr)->func = NULL; \
} while (0)

-
-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
- long cur; /* Current batch number. */
- long completed; /* Number of the last completed batch */
- int next_pending; /* Is the next batch already waiting? */
-} ____cacheline_maxaligned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
- return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
- return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
- /* 1) quiescent state handling : */
- long quiescbatch; /* Batch # for grace period */
- int passed_quiesc; /* User-mode/idle loop etc. */
- int qs_pending; /* core waits for quiesc state */
-
- /* 2) batch handling */
- long batch; /* Batch # for current RCU batch */
- struct rcu_head *nxtlist;
- struct rcu_head **nxttail;
- struct rcu_head *curlist;
- struct rcu_head **curtail;
- struct rcu_head *donelist;
- struct rcu_head **donetail;
- int cpu;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-extern struct rcu_ctrlblk rcu_ctrlblk;
-extern struct rcu_ctrlblk rcu_bh_ctrlblk;
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
- rdp->passed_quiesc = 1;
-}
-
-static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- /* This cpu has pending rcu entries and the grace period
- * for them has completed.
- */
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
- return 1;
-
- /* This cpu has no pending entries, but there are new entries */
- if (!rdp->curlist && rdp->nxtlist)
- return 1;
-
- /* This cpu has finished callbacks to invoke */
- if (rdp->donelist)
- return 1;
-
- /* The rcu core waits for a quiescent state from the cpu */
- if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
- return 1;
-
- /* nothing to do */
- return 0;
-}
-
-static inline int rcu_pending(int cpu)
-{
- return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
- __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
-}
-
-/**
- * rcu_read_lock - mark the beginning of an RCU read-side critical section.
- *
- * When synchronize_kernel() is invoked on one CPU while other CPUs
- * are within RCU read-side critical sections, then the
- * synchronize_kernel() is guaranteed to block until after all the other
- * CPUs exit their critical sections. Similarly, if call_rcu() is invoked
- * on one CPU while other CPUs are within RCU read-side critical
- * sections, invocation of the corresponding RCU callback is deferred
- * until after the all the other CPUs exit their critical sections.
- *
- * Note, however, that RCU callbacks are permitted to run concurrently
- * with RCU read-side critical sections. One way that this can happen
- * is via the following sequence of events: (1) CPU 0 enters an RCU
- * read-side critical section, (2) CPU 1 invokes call_rcu() to register
- * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
- * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
- * callback is invoked. This is legal, because the RCU read-side critical
- * section that was running concurrently with the call_rcu() (and which
- * therefore might be referencing something that the corresponding RCU
- * callback would free up) has completed before the corresponding
- * RCU callback is invoked.
- *
- * RCU read-side critical sections may be nested. Any deferred actions
- * will be deferred until the outermost RCU read-side critical section
- * completes.
- *
- * It is illegal to block while in an RCU read-side critical section.
- */
-#define rcu_read_lock() preempt_disable()
-
-/**
- * rcu_read_unlock - marks the end of an RCU read-side critical section.
- *
- * See rcu_read_lock() for more information.
- */
-#define rcu_read_unlock() preempt_enable()
-
-/*
- * So where is rcu_write_lock()? It does not exist, as there is no
- * way for writers to lock out RCU readers. This is a feature, not
- * a bug -- this property is what provides RCU's performance benefits.
- * Of course, writers must coordinate with each other. The normal
- * spinlock primitives work well for this, but any other technique may be
- * used as well. RCU does not care how the writers keep out of each
- * others' way, as long as they do so.
- */
-
-/**
- * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical section
- *
- * This is equivalent of rcu_read_lock(), but to be used when updates
- * are being done using call_rcu_bh(). Since call_rcu_bh() callbacks
- * consider completion of a softirq handler to be a quiescent state,
- * a process in RCU read-side critical section must be protected by
- * disabling softirqs. Read-side critical sections in interrupt context
- * can use just rcu_read_lock().
- *
- */
-#define rcu_read_lock_bh() local_bh_disable()
-
-/*
- * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
- *
- * See rcu_read_lock_bh() for more information.
- */
-#define rcu_read_unlock_bh() local_bh_enable()
+#define rcu_read_lock_bh() rcu_read_lock()
+#define rcu_read_unlock_bh() rcu_read_unlock()
+#define call_rcu_bh(head, func) call_rcu(head, func)
+#define rcu_bh_qsctr_inc(cpu)
+#define rcu_qsctr_inc(cpu)

/**
* rcu_dereference - fetch an RCU-protected pointer in an
@@ -257,15 +99,15 @@ static inline int rcu_pending(int cpu)
})

extern void rcu_init(void);
-extern void rcu_check_callbacks(int cpu, int user);
-extern void rcu_restart_cpu(int cpu);

/* Exported interfaces */
extern void FASTCALL(call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *head)));
-extern void FASTCALL(call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *head)));
+extern void rcu_read_lock(void);
+extern void rcu_read_unlock(void);
extern void synchronize_kernel(void);
+extern int rcu_pending(int cpu);
+extern void rcu_check_callbacks(int cpu, int user);

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpN -X ../dontdiff linux-2.5/include/linux/sched.h linux-2.5-rtRCU/include/linux/sched.h
--- linux-2.5/include/linux/sched.h Wed Mar 9 12:37:07 2005
+++ linux-2.5-rtRCU/include/linux/sched.h Fri Mar 18 11:33:13 2005
@@ -707,6 +707,9 @@ struct task_struct {
struct mempolicy *mempolicy;
short il_next;
#endif
+
+ int rcu_read_lock_nesting;
+ rwlock_t *rcu_read_lock_ptr;
};

static inline pid_t process_group(struct task_struct *tsk)
diff -urpN -X ../dontdiff linux-2.5/kernel/rcupdate.c linux-2.5-rtRCU/kernel/rcupdate.c
--- linux-2.5/kernel/rcupdate.c Wed Mar 9 12:37:22 2005
+++ linux-2.5-rtRCU/kernel/rcupdate.c Fri Mar 18 11:22:24 2005
@@ -47,424 +47,166 @@
#include <linux/rcupdate.h>
#include <linux/cpu.h>

-/* Definition for rcupdate control block. */
-struct rcu_ctrlblk rcu_ctrlblk =
- { .cur = -300, .completed = -300 };
-struct rcu_ctrlblk rcu_bh_ctrlblk =
- { .cur = -300, .completed = -300 };
-
-/* Bookkeeping of the progress of the grace period */
-struct rcu_state {
- spinlock_t lock; /* Guard this struct and writes to rcu_ctrlblk */
- cpumask_t cpumask; /* CPUs that need to switch in order */
- /* for current batch to proceed. */
+#define GRACE_PERIODS_PER_SEC 10
+
+struct rcu_data {
+ rwlock_t lock;
+ long batch;
+ struct rcu_head *waitlist;
+ struct rcu_head **waittail;
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+};
+struct rcu_ctrlblk {
+ long batch;
+ unsigned long last_sk;
+};
+DEFINE_PER_CPU(struct rcu_data, rcu_data) = {
+ .lock = RW_LOCK_UNLOCKED,
+ .batch = 0,
+ .waitlist = NULL,
+ .donelist = NULL
+};
+struct rcu_ctrlblk rcu_ctrlblk = {
+ .batch = 0,
};

-static struct rcu_state rcu_state ____cacheline_maxaligned_in_smp =
- {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
-static struct rcu_state rcu_bh_state ____cacheline_maxaligned_in_smp =
- {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
-
-DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
-DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
-
-/* Fake initialization required by compiler */
-static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10;
-
-/**
- * call_rcu - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- */
-void fastcall call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void rcu_init(void)
{
- unsigned long flags;
+ int cpu;
struct rcu_data *rdp;

- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
- local_irq_restore(flags);
+ for_each_cpu(cpu) {
+ rdp = &per_cpu(rcu_data, cpu);
+ rdp->waittail = &rdp->waitlist;
+ rdp->donetail = &rdp->donelist;
+ }
}

-/**
- * call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_bh() assumes
- * that the read-side critical sections end on completion of a softirq
- * handler. This means that read-side critical sections in process
- * context must not be interrupted by softirqs. This interface is to be
- * used when most of the read-side critical sections are in softirq context.
- * RCU read-side critical sections are delimited by rcu_read_lock() and
- * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh()
- * and rcu_read_unlock_bh(), if in process context. These may be nested.
- */
-void fastcall call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void rcu_read_lock(void)
{
- unsigned long flags;
- struct rcu_data *rdp;
-
- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_bh_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
- local_irq_restore(flags);
+ preempt_disable();
+ if (current->rcu_read_lock_nesting++ == 0 * current->static_prio) {
+ current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock;
+ read_lock(current->rcu_read_lock_ptr);
+ }
+ preempt_enable();
}

-/*
- * Invoke the completed RCU callbacks. They are expected to be in
- * a per-cpu list.
- */
-static void rcu_do_batch(struct rcu_data *rdp)
+void rcu_read_unlock(void)
{
- struct rcu_head *next, *list;
- int count = 0;
-
- list = rdp->donelist;
- while (list) {
- next = rdp->donelist = list->next;
- list->func(list);
- list = next;
- if (++count >= maxbatch)
- break;
+ preempt_disable();
+ if (--current->rcu_read_lock_nesting == 0) {
+ read_unlock(current->rcu_read_lock_ptr);
}
- if (!rdp->donelist)
- rdp->donetail = &rdp->donelist;
- else
- tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
+ preempt_enable();
}

-/*
- * Grace period handling:
- * The grace period handling consists out of two steps:
- * - A new grace period is started.
- * This is done by rcu_start_batch. The start is not broadcasted to
- * all cpus, they must pick this up by comparing rcp->cur with
- * rdp->quiescbatch. All cpus are recorded in the
- * rcu_state.cpumask bitmap.
- * - All cpus must go through a quiescent state.
- * Since the start of the grace period is not broadcasted, at least two
- * calls to rcu_check_quiescent_state are required:
- * The first call just notices that a new grace period is running. The
- * following calls check if there was a quiescent state since the beginning
- * of the grace period. If so, it updates rcu_state.cpumask. If
- * the bitmap is empty, then the grace period is completed.
- * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace
- * period (if necessary).
- */
-/*
- * Register a new batch of callbacks, and start it up if there is currently no
- * active batch and the batch to be registered has not already occurred.
- * Caller must hold rcu_state.lock.
- */
-static void rcu_start_batch(struct rcu_ctrlblk *rcp, struct rcu_state *rsp,
- int next_pending)
+void _synchronize_kernel(void)
{
- if (next_pending)
- rcp->next_pending = 1;
+ int cpu;

- if (rcp->next_pending &&
- rcp->completed == rcp->cur) {
- /* Can't change, since spin lock held. */
- cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
-
- rcp->next_pending = 0;
- /* next_pending == 0 must be visible in __rcu_process_callbacks()
- * before it can see new value of cur.
- */
- smp_wmb();
- rcp->cur++;
+ for_each_cpu(cpu) { /* _online() or _present() races with hotplug */
+ write_lock(per_cpu(rcu_data, cpu));
}
-}
-
-/*
- * cpu went through a quiescent state since the beginning of the grace period.
- * Clear it from the cpu mask and complete the grace period if it was the last
- * cpu. Start another grace period if someone has further entries pending
- */
-static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp, struct rcu_state *rsp)
-{
- cpu_clear(cpu, rsp->cpumask);
- if (cpus_empty(rsp->cpumask)) {
- /* batch completed ! */
- rcp->completed = rcp->cur;
- rcu_start_batch(rcp, rsp, 0);
+ rcu_ctrlblk.batch++;
+ rcu_ctrlblk.last_sk = jiffies;
+ for_each_cpu(cpu) {
+ write_unlock(per_cpu(rcu_data, cpu));
}
}

-/*
- * Check if the cpu has gone through a quiescent state (say context
- * switch). If so and if it already hasn't done so in this RCU
- * quiescent cycle, then indicate that it has done so.
- */
-static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
- struct rcu_state *rsp, struct rcu_data *rdp)
+void synchronize_kernel(void)
{
- if (rdp->quiescbatch != rcp->cur) {
- /* start new grace period: */
- rdp->qs_pending = 1;
- rdp->passed_quiesc = 0;
- rdp->quiescbatch = rcp->cur;
- return;
- }
-
- /* Grace period already completed for this cpu?
- * qs_pending is checked instead of the actual bitmap to avoid
- * cacheline trashing.
- */
- if (!rdp->qs_pending)
- return;
-
- /*
- * Was there a quiescent state since the beginning of the grace
- * period? If no, then exit and wait for the next call.
- */
- if (!rdp->passed_quiesc)
- return;
- rdp->qs_pending = 0;
-
- spin_lock(&rsp->lock);
- /*
- * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync
- * during cpu startup. Ignore the quiescent state.
- */
- if (likely(rdp->quiescbatch == rcp->cur))
- cpu_quiet(rdp->cpu, rcp, rsp);
-
- spin_unlock(&rsp->lock);
-}
-
+ long oldbatch;

-#ifdef CONFIG_HOTPLUG_CPU
-
-/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing
- * locking requirements, the list it's pulling from has to belong to a cpu
- * which is dead and hence not processing interrupts.
- */
-static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
- struct rcu_head **tail)
-{
- local_irq_disable();
- *this_rdp->nxttail = list;
- if (list)
- this_rdp->nxttail = tail;
- local_irq_enable();
+ smp_mb();
+ oldbatch = rcu_ctrlblk.batch;
+ schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
+ if (rcu_ctrlblk.batch == oldbatch) {
+ _synchronize_kernel();
+ }
}

-static void __rcu_offline_cpu(struct rcu_data *this_rdp,
- struct rcu_ctrlblk *rcp, struct rcu_state *rsp, struct rcu_data *rdp)
+void rcu_advance_callbacks(void)
{
- /* if the cpu going offline owns the grace period
- * we can block indefinitely waiting for it, so flush
- * it here
- */
- spin_lock_bh(&rsp->lock);
- if (rcp->cur != rcp->completed)
- cpu_quiet(rdp->cpu, rcp, rsp);
- spin_unlock_bh(&rsp->lock);
- rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
- rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
+ unsigned long flags;
+ struct rcu_data *rdp;

+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ rdp = &__get_cpu_var(rcu_data);
+ smp_mb(); /* prevent sampling batch # before list removal. */
+ if (rdp->batch != rcu_ctrlblk.batch) {
+ *rdp->donetail = rdp->waitlist;
+ rdp->donetail = rdp->waittail;
+ rdp->waitlist = NULL;
+ rdp->waittail = &rdp->waitlist;
+ rdp->batch = rcu_ctrlblk.batch;
+ }
+ local_irq_restore(flags);
}
-static void rcu_offline_cpu(int cpu)
+
+void call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
{
- struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
- struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data);
+ unsigned long flags;
+ struct rcu_data *rdp;

- __rcu_offline_cpu(this_rdp, &rcu_ctrlblk, &rcu_state,
- &per_cpu(rcu_data, cpu));
- __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk, &rcu_bh_state,
- &per_cpu(rcu_bh_data, cpu));
- put_cpu_var(rcu_data);
- put_cpu_var(rcu_bh_data);
- tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu);
+ head->func = func;
+ head->next = NULL;
+ local_irq_save(flags);
+ rcu_advance_callbacks();
+ rdp = &__get_cpu_var(rcu_data);
+ *rdp->waittail = head;
+ rdp->waittail = &head->next;
+ local_irq_restore(flags);
}

-#else
-
-static void rcu_offline_cpu(int cpu)
+void rcu_process_callbacks(void)
{
-}
-
-#endif
+ unsigned long flags;
+ struct rcu_head *next, *list;
+ struct rcu_data *rdp;

-/*
- * This does the RCU processing work from tasklet context.
- */
-static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
- struct rcu_state *rsp, struct rcu_data *rdp)
-{
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
- *rdp->donetail = rdp->curlist;
- rdp->donetail = rdp->curtail;
- rdp->curlist = NULL;
- rdp->curtail = &rdp->curlist;
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ list = rdp->donelist;
+ if (list == NULL) {
+ local_irq_restore(flags);
+ return;
}
-
- local_irq_disable();
- if (rdp->nxtlist && !rdp->curlist) {
- rdp->curlist = rdp->nxtlist;
- rdp->curtail = rdp->nxttail;
- rdp->nxtlist = NULL;
- rdp->nxttail = &rdp->nxtlist;
- local_irq_enable();
-
- /*
- * start the next batch of callbacks
- */
-
- /* determine batch number */
- rdp->batch = rcp->cur + 1;
- /* see the comment and corresponding wmb() in
- * the rcu_start_batch()
- */
- smp_rmb();
-
- if (!rcp->next_pending) {
- /* and start it/schedule start if it's a new batch */
- spin_lock(&rsp->lock);
- rcu_start_batch(rcp, rsp, 1);
- spin_unlock(&rsp->lock);
- }
- } else {
- local_irq_enable();
+ rdp->donelist = NULL;
+ rdp->donetail = &rdp->waitlist;
+ local_irq_restore(flags);
+ while (list) {
+ next = list->next;
+ list->func(list);
+ list = next;
}
- rcu_check_quiescent_state(rcp, rsp, rdp);
- if (rdp->donelist)
- rcu_do_batch(rdp);
-}
-
-static void rcu_process_callbacks(unsigned long unused)
-{
- __rcu_process_callbacks(&rcu_ctrlblk, &rcu_state,
- &__get_cpu_var(rcu_data));
- __rcu_process_callbacks(&rcu_bh_ctrlblk, &rcu_bh_state,
- &__get_cpu_var(rcu_bh_data));
}

void rcu_check_callbacks(int cpu, int user)
{
- if (user ||
- (idle_cpu(cpu) && !in_softirq() &&
- hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
- rcu_qsctr_inc(cpu);
- rcu_bh_qsctr_inc(cpu);
- } else if (!in_softirq())
- rcu_bh_qsctr_inc(cpu);
- tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
-}
-
-static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- memset(rdp, 0, sizeof(*rdp));
- rdp->curtail = &rdp->curlist;
- rdp->nxttail = &rdp->nxtlist;
- rdp->donetail = &rdp->donelist;
- rdp->quiescbatch = rcp->completed;
- rdp->qs_pending = 0;
- rdp->cpu = cpu;
-}
-
-static void __devinit rcu_online_cpu(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu);
-
- rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp);
- rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
- tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL);
-}
-
-static int __devinit rcu_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- long cpu = (long)hcpu;
- switch (action) {
- case CPU_UP_PREPARE:
- rcu_online_cpu(cpu);
- break;
- case CPU_DEAD:
- rcu_offline_cpu(cpu);
- break;
- default:
- break;
+ if ((unsigned long)(jiffies - rcu_ctrlblk.last_sk) >
+ HZ/GRACE_PERIODS_PER_SEC) {
+ synchronize_kernel();
+ rcu_advance_callbacks();
+ rcu_process_callbacks();
}
- return NOTIFY_OK;
-}
-
-static struct notifier_block __devinitdata rcu_nb = {
- .notifier_call = rcu_cpu_notify,
-};
-
-/*
- * Initializes rcu mechanism. Assumed to be called early.
- * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
- * Note that rcu_qsctr and friends are implicitly
- * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
- */
-void __init rcu_init(void)
-{
- rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
- (void *)(long)smp_processor_id());
- /* Register notifier for non-boot CPUs */
- register_cpu_notifier(&rcu_nb);
}

-struct rcu_synchronize {
- struct rcu_head head;
- struct completion completion;
-};
-
-/* Because of FASTCALL declaration of complete, we use this wrapper */
-static void wakeme_after_rcu(struct rcu_head *head)
-{
- struct rcu_synchronize *rcu;
-
- rcu = container_of(head, struct rcu_synchronize, head);
- complete(&rcu->completion);
-}
-
-/**
- * synchronize_kernel - wait until a grace period has elapsed.
- *
- * Control will return to the caller some time after a full grace
- * period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- */
-void synchronize_kernel(void)
+int rcu_pending(int cpu)
{
- struct rcu_synchronize rcu;
-
- init_completion(&rcu.completion);
- /* Will wake me after RCU finished */
- call_rcu(&rcu.head, wakeme_after_rcu);
+ unsigned long flags;
+ struct rcu_data *rdp;
+ int retval;

- /* Wait for it */
- wait_for_completion(&rcu.completion);
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ retval = (rdp->waitlist || rdp->donelist);
+ local_irq_restore(flags);
+ return (retval);
}

-module_param(maxbatch, int, 0);
EXPORT_SYMBOL(call_rcu);
-EXPORT_SYMBOL(call_rcu_bh);
EXPORT_SYMBOL(synchronize_kernel);


Attachments:
(No filename) (25.00 kB)
rcupdate.h (3.59 kB)
rcupdate.c (4.88 kB)
Download all attachments

2005-03-18 22:35:44

by Herbert Xu

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

Ingo Molnar <[email protected]> wrote:
>
> i really have no intention to allow multiple readers for rt-mutexes. We
> got away with that so far, and i'd like to keep it so. Imagine 100
> threads all blocked in the same critical section (holding the read-lock)
> when a highprio writer thread comes around: instant 100x latency to let
> all of them roll forward. The only sane solution is to not allow
> excessive concurrency. (That limits SMP scalability, but there's no
> other choice i can see.)

What about allowing only as many concurrent readers as there are CPUs?
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-03-18 22:50:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 12:35:17PM -0800, Paul E. McKenney wrote:
> Compiles, probably dies horribly. "diff" didn't do such a good job
> on this one, so attaching the raw rcupdate.[hc] files as well.

My prediction was all too accurate. ;-)

The attached patch at least boots on a 1-CPU x86 box. I added some
interrupt disabling that is a bad idea in real-time preempt kernels,
but necessary for stock kernels to even have a ghost of a chance.

Again, the diff is quite confusing to read (for me, anyway!), so attached
the rcupdate.[hc] files.

Assuming this patch survives the LTP run (hah!!!), next step is a small
SMP system.

Thanx, Paul

Signed-off-by: <[email protected]>

diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h linux-2.5-rtRCU/include/linux/rcupdate.h
--- linux-2.5/include/linux/rcupdate.h Wed Mar 9 12:37:06 2005
+++ linux-2.5-rtRCU/include/linux/rcupdate.h Fri Mar 18 11:37:02 2005
@@ -58,169 +58,11 @@ struct rcu_head {
(ptr)->next = NULL; (ptr)->func = NULL; \
} while (0)

-
-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
- long cur; /* Current batch number. */
- long completed; /* Number of the last completed batch */
- int next_pending; /* Is the next batch already waiting? */
-} ____cacheline_maxaligned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
- return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
- return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
- /* 1) quiescent state handling : */
- long quiescbatch; /* Batch # for grace period */
- int passed_quiesc; /* User-mode/idle loop etc. */
- int qs_pending; /* core waits for quiesc state */
-
- /* 2) batch handling */
- long batch; /* Batch # for current RCU batch */
- struct rcu_head *nxtlist;
- struct rcu_head **nxttail;
- struct rcu_head *curlist;
- struct rcu_head **curtail;
- struct rcu_head *donelist;
- struct rcu_head **donetail;
- int cpu;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-extern struct rcu_ctrlblk rcu_ctrlblk;
-extern struct rcu_ctrlblk rcu_bh_ctrlblk;
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
- rdp->passed_quiesc = 1;
-}
-
-static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- /* This cpu has pending rcu entries and the grace period
- * for them has completed.
- */
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
- return 1;
-
- /* This cpu has no pending entries, but there are new entries */
- if (!rdp->curlist && rdp->nxtlist)
- return 1;
-
- /* This cpu has finished callbacks to invoke */
- if (rdp->donelist)
- return 1;
-
- /* The rcu core waits for a quiescent state from the cpu */
- if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
- return 1;
-
- /* nothing to do */
- return 0;
-}
-
-static inline int rcu_pending(int cpu)
-{
- return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
- __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
-}
-
-/**
- * rcu_read_lock - mark the beginning of an RCU read-side critical section.
- *
- * When synchronize_kernel() is invoked on one CPU while other CPUs
- * are within RCU read-side critical sections, then the
- * synchronize_kernel() is guaranteed to block until after all the other
- * CPUs exit their critical sections. Similarly, if call_rcu() is invoked
- * on one CPU while other CPUs are within RCU read-side critical
- * sections, invocation of the corresponding RCU callback is deferred
- * until after the all the other CPUs exit their critical sections.
- *
- * Note, however, that RCU callbacks are permitted to run concurrently
- * with RCU read-side critical sections. One way that this can happen
- * is via the following sequence of events: (1) CPU 0 enters an RCU
- * read-side critical section, (2) CPU 1 invokes call_rcu() to register
- * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
- * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
- * callback is invoked. This is legal, because the RCU read-side critical
- * section that was running concurrently with the call_rcu() (and which
- * therefore might be referencing something that the corresponding RCU
- * callback would free up) has completed before the corresponding
- * RCU callback is invoked.
- *
- * RCU read-side critical sections may be nested. Any deferred actions
- * will be deferred until the outermost RCU read-side critical section
- * completes.
- *
- * It is illegal to block while in an RCU read-side critical section.
- */
-#define rcu_read_lock() preempt_disable()
-
-/**
- * rcu_read_unlock - marks the end of an RCU read-side critical section.
- *
- * See rcu_read_lock() for more information.
- */
-#define rcu_read_unlock() preempt_enable()
-
-/*
- * So where is rcu_write_lock()? It does not exist, as there is no
- * way for writers to lock out RCU readers. This is a feature, not
- * a bug -- this property is what provides RCU's performance benefits.
- * Of course, writers must coordinate with each other. The normal
- * spinlock primitives work well for this, but any other technique may be
- * used as well. RCU does not care how the writers keep out of each
- * others' way, as long as they do so.
- */
-
-/**
- * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical section
- *
- * This is equivalent of rcu_read_lock(), but to be used when updates
- * are being done using call_rcu_bh(). Since call_rcu_bh() callbacks
- * consider completion of a softirq handler to be a quiescent state,
- * a process in RCU read-side critical section must be protected by
- * disabling softirqs. Read-side critical sections in interrupt context
- * can use just rcu_read_lock().
- *
- */
-#define rcu_read_lock_bh() local_bh_disable()
-
-/*
- * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
- *
- * See rcu_read_lock_bh() for more information.
- */
-#define rcu_read_unlock_bh() local_bh_enable()
+#define rcu_read_lock_bh() rcu_read_lock()
+#define rcu_read_unlock_bh() rcu_read_unlock()
+#define call_rcu_bh(head, func) call_rcu(head, func)
+#define rcu_bh_qsctr_inc(cpu)
+#define rcu_qsctr_inc(cpu)

/**
* rcu_dereference - fetch an RCU-protected pointer in an
@@ -257,15 +99,15 @@ static inline int rcu_pending(int cpu)
})

extern void rcu_init(void);
-extern void rcu_check_callbacks(int cpu, int user);
-extern void rcu_restart_cpu(int cpu);

/* Exported interfaces */
extern void FASTCALL(call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *head)));
-extern void FASTCALL(call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *head)));
+extern void rcu_read_lock(void);
+extern void rcu_read_unlock(void);
extern void synchronize_kernel(void);
+extern int rcu_pending(int cpu);
+extern void rcu_check_callbacks(int cpu, int user);

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpN -X ../dontdiff linux-2.5/include/linux/sched.h linux-2.5-rtRCU/include/linux/sched.h
--- linux-2.5/include/linux/sched.h Wed Mar 9 12:37:07 2005
+++ linux-2.5-rtRCU/include/linux/sched.h Fri Mar 18 11:33:13 2005
@@ -707,6 +707,9 @@ struct task_struct {
struct mempolicy *mempolicy;
short il_next;
#endif
+
+ int rcu_read_lock_nesting;
+ rwlock_t *rcu_read_lock_ptr;
};

static inline pid_t process_group(struct task_struct *tsk)
diff -urpN -X ../dontdiff linux-2.5/kernel/rcupdate.c linux-2.5-rtRCU/kernel/rcupdate.c
--- linux-2.5/kernel/rcupdate.c Wed Mar 9 12:37:22 2005
+++ linux-2.5-rtRCU/kernel/rcupdate.c Fri Mar 18 13:21:55 2005
@@ -47,424 +47,173 @@
#include <linux/rcupdate.h>
#include <linux/cpu.h>

-/* Definition for rcupdate control block. */
-struct rcu_ctrlblk rcu_ctrlblk =
- { .cur = -300, .completed = -300 };
-struct rcu_ctrlblk rcu_bh_ctrlblk =
- { .cur = -300, .completed = -300 };
-
-/* Bookkeeping of the progress of the grace period */
-struct rcu_state {
- spinlock_t lock; /* Guard this struct and writes to rcu_ctrlblk */
- cpumask_t cpumask; /* CPUs that need to switch in order */
- /* for current batch to proceed. */
+#define GRACE_PERIODS_PER_SEC 10
+
+struct rcu_data {
+ rwlock_t lock;
+ long batch;
+ struct rcu_head *waitlist;
+ struct rcu_head **waittail;
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+};
+struct rcu_ctrlblk {
+ long batch;
+ unsigned long last_sk;
+};
+DEFINE_PER_CPU(struct rcu_data, rcu_data) = {
+ .lock = RW_LOCK_UNLOCKED,
+ .batch = 0,
+ .waitlist = NULL,
+ .donelist = NULL
+};
+struct rcu_ctrlblk rcu_ctrlblk = {
+ .batch = 0,
};

-static struct rcu_state rcu_state ____cacheline_maxaligned_in_smp =
- {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
-static struct rcu_state rcu_bh_state ____cacheline_maxaligned_in_smp =
- {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
-
-DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
-DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
-
-/* Fake initialization required by compiler */
-static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10;
-
-/**
- * call_rcu - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- */
-void fastcall call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void rcu_init(void)
{
- unsigned long flags;
+ int cpu;
struct rcu_data *rdp;

- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
- local_irq_restore(flags);
+ for_each_cpu(cpu) {
+ rdp = &per_cpu(rcu_data, cpu);
+ rdp->waittail = &rdp->waitlist;
+ rdp->donetail = &rdp->donelist;
+ }
}

-/**
- * call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_bh() assumes
- * that the read-side critical sections end on completion of a softirq
- * handler. This means that read-side critical sections in process
- * context must not be interrupted by softirqs. This interface is to be
- * used when most of the read-side critical sections are in softirq context.
- * RCU read-side critical sections are delimited by rcu_read_lock() and
- * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh()
- * and rcu_read_unlock_bh(), if in process context. These may be nested.
- */
-void fastcall call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void rcu_read_lock(void)
{
unsigned long flags;
- struct rcu_data *rdp;

- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_bh_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ if (current->rcu_read_lock_nesting++ == 0) {
+ current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock;
+ read_lock(current->rcu_read_lock_ptr);
+ }
local_irq_restore(flags);
}

-/*
- * Invoke the completed RCU callbacks. They are expected to be in
- * a per-cpu list.
- */
-static void rcu_do_batch(struct rcu_data *rdp)
+void rcu_read_unlock(void)
{
- struct rcu_head *next, *list;
- int count = 0;
+ unsigned long flags;

- list = rdp->donelist;
- while (list) {
- next = rdp->donelist = list->next;
- list->func(list);
- list = next;
- if (++count >= maxbatch)
- break;
+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ if (--current->rcu_read_lock_nesting == 0) {
+ read_unlock(current->rcu_read_lock_ptr);
}
- if (!rdp->donelist)
- rdp->donetail = &rdp->donelist;
- else
- tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
+ local_irq_restore(flags);
}

-/*
- * Grace period handling:
- * The grace period handling consists out of two steps:
- * - A new grace period is started.
- * This is done by rcu_start_batch. The start is not broadcasted to
- * all cpus, they must pick this up by comparing rcp->cur with
- * rdp->quiescbatch. All cpus are recorded in the
- * rcu_state.cpumask bitmap.
- * - All cpus must go through a quiescent state.
- * Since the start of the grace period is not broadcasted, at least two
- * calls to rcu_check_quiescent_state are required:
- * The first call just notices that a new grace period is running. The
- * following calls check if there was a quiescent state since the beginning
- * of the grace period. If so, it updates rcu_state.cpumask. If
- * the bitmap is empty, then the grace period is completed.
- * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace
- * period (if necessary).
- */
-/*
- * Register a new batch of callbacks, and start it up if there is currently no
- * active batch and the batch to be registered has not already occurred.
- * Caller must hold rcu_state.lock.
- */
-static void rcu_start_batch(struct rcu_ctrlblk *rcp, struct rcu_state *rsp,
- int next_pending)
+void _synchronize_kernel(void)
{
- if (next_pending)
- rcp->next_pending = 1;
+ int cpu;
+ unsigned long flags;

- if (rcp->next_pending &&
- rcp->completed == rcp->cur) {
- /* Can't change, since spin lock held. */
- cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
-
- rcp->next_pending = 0;
- /* next_pending == 0 must be visible in __rcu_process_callbacks()
- * before it can see new value of cur.
- */
- smp_wmb();
- rcp->cur++;
+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ for_each_cpu(cpu) { /* _online() or _present() races with hotplug */
+ write_lock(per_cpu(rcu_data, cpu));
+ }
+ rcu_ctrlblk.batch++;
+ rcu_ctrlblk.last_sk = jiffies;
+ for_each_cpu(cpu) {
+ write_unlock(per_cpu(rcu_data, cpu));
}
+ local_irq_restore(flags);
}

-/*
- * cpu went through a quiescent state since the beginning of the grace period.
- * Clear it from the cpu mask and complete the grace period if it was the last
- * cpu. Start another grace period if someone has further entries pending
- */
-static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp, struct rcu_state *rsp)
+void synchronize_kernel(void)
{
- cpu_clear(cpu, rsp->cpumask);
- if (cpus_empty(rsp->cpumask)) {
- /* batch completed ! */
- rcp->completed = rcp->cur;
- rcu_start_batch(rcp, rsp, 0);
- }
-}
+ long oldbatch;

-/*
- * Check if the cpu has gone through a quiescent state (say context
- * switch). If so and if it already hasn't done so in this RCU
- * quiescent cycle, then indicate that it has done so.
- */
-static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
- struct rcu_state *rsp, struct rcu_data *rdp)
-{
- if (rdp->quiescbatch != rcp->cur) {
- /* start new grace period: */
- rdp->qs_pending = 1;
- rdp->passed_quiesc = 0;
- rdp->quiescbatch = rcp->cur;
- return;
+ smp_mb();
+ oldbatch = rcu_ctrlblk.batch;
+ schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
+ if (rcu_ctrlblk.batch == oldbatch) {
+ _synchronize_kernel();
}
-
- /* Grace period already completed for this cpu?
- * qs_pending is checked instead of the actual bitmap to avoid
- * cacheline trashing.
- */
- if (!rdp->qs_pending)
- return;
-
- /*
- * Was there a quiescent state since the beginning of the grace
- * period? If no, then exit and wait for the next call.
- */
- if (!rdp->passed_quiesc)
- return;
- rdp->qs_pending = 0;
-
- spin_lock(&rsp->lock);
- /*
- * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync
- * during cpu startup. Ignore the quiescent state.
- */
- if (likely(rdp->quiescbatch == rcp->cur))
- cpu_quiet(rdp->cpu, rcp, rsp);
-
- spin_unlock(&rsp->lock);
-}
-
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing
- * locking requirements, the list it's pulling from has to belong to a cpu
- * which is dead and hence not processing interrupts.
- */
-static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
- struct rcu_head **tail)
-{
- local_irq_disable();
- *this_rdp->nxttail = list;
- if (list)
- this_rdp->nxttail = tail;
- local_irq_enable();
}

-static void __rcu_offline_cpu(struct rcu_data *this_rdp,
- struct rcu_ctrlblk *rcp, struct rcu_state *rsp, struct rcu_data *rdp)
+void rcu_advance_callbacks(void)
{
- /* if the cpu going offline owns the grace period
- * we can block indefinitely waiting for it, so flush
- * it here
- */
- spin_lock_bh(&rsp->lock);
- if (rcp->cur != rcp->completed)
- cpu_quiet(rdp->cpu, rcp, rsp);
- spin_unlock_bh(&rsp->lock);
- rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
- rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
+ unsigned long flags;
+ struct rcu_data *rdp;

+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ rdp = &__get_cpu_var(rcu_data);
+ smp_mb(); /* prevent sampling batch # before list removal. */
+ if (rdp->batch != rcu_ctrlblk.batch) {
+ *rdp->donetail = rdp->waitlist;
+ rdp->donetail = rdp->waittail;
+ rdp->waitlist = NULL;
+ rdp->waittail = &rdp->waitlist;
+ rdp->batch = rcu_ctrlblk.batch;
+ }
+ local_irq_restore(flags);
}
-static void rcu_offline_cpu(int cpu)
+
+void call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
{
- struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
- struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data);
+ unsigned long flags;
+ struct rcu_data *rdp;

- __rcu_offline_cpu(this_rdp, &rcu_ctrlblk, &rcu_state,
- &per_cpu(rcu_data, cpu));
- __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk, &rcu_bh_state,
- &per_cpu(rcu_bh_data, cpu));
- put_cpu_var(rcu_data);
- put_cpu_var(rcu_bh_data);
- tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu);
+ head->func = func;
+ head->next = NULL;
+ local_irq_save(flags);
+ rcu_advance_callbacks();
+ rdp = &__get_cpu_var(rcu_data);
+ *rdp->waittail = head;
+ rdp->waittail = &head->next;
+ local_irq_restore(flags);
}

-#else
-
-static void rcu_offline_cpu(int cpu)
+void rcu_process_callbacks(void)
{
-}
-
-#endif
+ unsigned long flags;
+ struct rcu_head *next, *list;
+ struct rcu_data *rdp;

-/*
- * This does the RCU processing work from tasklet context.
- */
-static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
- struct rcu_state *rsp, struct rcu_data *rdp)
-{
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
- *rdp->donetail = rdp->curlist;
- rdp->donetail = rdp->curtail;
- rdp->curlist = NULL;
- rdp->curtail = &rdp->curlist;
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ list = rdp->donelist;
+ if (list == NULL) {
+ local_irq_restore(flags);
+ return;
}
-
- local_irq_disable();
- if (rdp->nxtlist && !rdp->curlist) {
- rdp->curlist = rdp->nxtlist;
- rdp->curtail = rdp->nxttail;
- rdp->nxtlist = NULL;
- rdp->nxttail = &rdp->nxtlist;
- local_irq_enable();
-
- /*
- * start the next batch of callbacks
- */
-
- /* determine batch number */
- rdp->batch = rcp->cur + 1;
- /* see the comment and corresponding wmb() in
- * the rcu_start_batch()
- */
- smp_rmb();
-
- if (!rcp->next_pending) {
- /* and start it/schedule start if it's a new batch */
- spin_lock(&rsp->lock);
- rcu_start_batch(rcp, rsp, 1);
- spin_unlock(&rsp->lock);
- }
- } else {
- local_irq_enable();
+ rdp->donelist = NULL;
+ rdp->donetail = &rdp->waitlist;
+ local_irq_restore(flags);
+ while (list) {
+ next = list->next;
+ list->func(list);
+ list = next;
}
- rcu_check_quiescent_state(rcp, rsp, rdp);
- if (rdp->donelist)
- rcu_do_batch(rdp);
-}
-
-static void rcu_process_callbacks(unsigned long unused)
-{
- __rcu_process_callbacks(&rcu_ctrlblk, &rcu_state,
- &__get_cpu_var(rcu_data));
- __rcu_process_callbacks(&rcu_bh_ctrlblk, &rcu_bh_state,
- &__get_cpu_var(rcu_bh_data));
}

void rcu_check_callbacks(int cpu, int user)
{
- if (user ||
- (idle_cpu(cpu) && !in_softirq() &&
- hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
- rcu_qsctr_inc(cpu);
- rcu_bh_qsctr_inc(cpu);
- } else if (!in_softirq())
- rcu_bh_qsctr_inc(cpu);
- tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
-}
-
-static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- memset(rdp, 0, sizeof(*rdp));
- rdp->curtail = &rdp->curlist;
- rdp->nxttail = &rdp->nxtlist;
- rdp->donetail = &rdp->donelist;
- rdp->quiescbatch = rcp->completed;
- rdp->qs_pending = 0;
- rdp->cpu = cpu;
-}
-
-static void __devinit rcu_online_cpu(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu);
-
- rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp);
- rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
- tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL);
-}
-
-static int __devinit rcu_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- long cpu = (long)hcpu;
- switch (action) {
- case CPU_UP_PREPARE:
- rcu_online_cpu(cpu);
- break;
- case CPU_DEAD:
- rcu_offline_cpu(cpu);
- break;
- default:
- break;
+ if ((unsigned long)(jiffies - rcu_ctrlblk.last_sk) >
+ HZ/GRACE_PERIODS_PER_SEC) {
+ synchronize_kernel();
+ rcu_advance_callbacks();
+ rcu_process_callbacks();
}
- return NOTIFY_OK;
}

-static struct notifier_block __devinitdata rcu_nb = {
- .notifier_call = rcu_cpu_notify,
-};
-
-/*
- * Initializes rcu mechanism. Assumed to be called early.
- * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
- * Note that rcu_qsctr and friends are implicitly
- * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
- */
-void __init rcu_init(void)
+int rcu_pending(int cpu)
{
- rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
- (void *)(long)smp_processor_id());
- /* Register notifier for non-boot CPUs */
- register_cpu_notifier(&rcu_nb);
-}
-
-struct rcu_synchronize {
- struct rcu_head head;
- struct completion completion;
-};
-
-/* Because of FASTCALL declaration of complete, we use this wrapper */
-static void wakeme_after_rcu(struct rcu_head *head)
-{
- struct rcu_synchronize *rcu;
-
- rcu = container_of(head, struct rcu_synchronize, head);
- complete(&rcu->completion);
-}
-
-/**
- * synchronize_kernel - wait until a grace period has elapsed.
- *
- * Control will return to the caller some time after a full grace
- * period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- */
-void synchronize_kernel(void)
-{
- struct rcu_synchronize rcu;
-
- init_completion(&rcu.completion);
- /* Will wake me after RCU finished */
- call_rcu(&rcu.head, wakeme_after_rcu);
+ unsigned long flags;
+ struct rcu_data *rdp;
+ int retval;

- /* Wait for it */
- wait_for_completion(&rcu.completion);
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ retval = (rdp->waitlist || rdp->donelist);
+ local_irq_restore(flags);
+ return (retval);
}

-module_param(maxbatch, int, 0);
EXPORT_SYMBOL(call_rcu);
-EXPORT_SYMBOL(call_rcu_bh);
EXPORT_SYMBOL(synchronize_kernel);


Attachments:
(No filename) (23.88 kB)
rcupdate.h (3.59 kB)
rcupdate.c (5.12 kB)
Download all attachments

2005-03-19 00:49:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 02:22:30PM -0800, Paul E. McKenney wrote:
> On Fri, Mar 18, 2005 at 12:35:17PM -0800, Paul E. McKenney wrote:
> > Compiles, probably dies horribly. "diff" didn't do such a good job
> > on this one, so attaching the raw rcupdate.[hc] files as well.
>
> My prediction was all too accurate. ;-)
>
> The attached patch at least boots on a 1-CPU x86 box. I added some
> interrupt disabling that is a bad idea in real-time preempt kernels,
> but necessary for stock kernels to even have a ghost of a chance.
>
> Again, the diff is quite confusing to read (for me, anyway!), so attached
> the rcupdate.[hc] files.
>
> Assuming this patch survives the LTP run (hah!!!), next step is a small
> SMP system.

It did survive the UP LTP run, here is an updated patch with a fix needed
for SMP. Having trouble with test system (independent of this patch),
hopefully will get more testing in over the weekend.

Thanx, Paul

Signed-off-by: <[email protected]>

diff -urpN -X ../dontdiff linux-2.5/include/linux/rcupdate.h linux-2.5-rtRCU/include/linux/rcupdate.h
--- linux-2.5/include/linux/rcupdate.h Wed Mar 9 12:37:06 2005
+++ linux-2.5-rtRCU/include/linux/rcupdate.h Fri Mar 18 11:37:02 2005
@@ -58,169 +58,11 @@ struct rcu_head {
(ptr)->next = NULL; (ptr)->func = NULL; \
} while (0)

-
-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
- long cur; /* Current batch number. */
- long completed; /* Number of the last completed batch */
- int next_pending; /* Is the next batch already waiting? */
-} ____cacheline_maxaligned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
- return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
- return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
- /* 1) quiescent state handling : */
- long quiescbatch; /* Batch # for grace period */
- int passed_quiesc; /* User-mode/idle loop etc. */
- int qs_pending; /* core waits for quiesc state */
-
- /* 2) batch handling */
- long batch; /* Batch # for current RCU batch */
- struct rcu_head *nxtlist;
- struct rcu_head **nxttail;
- struct rcu_head *curlist;
- struct rcu_head **curtail;
- struct rcu_head *donelist;
- struct rcu_head **donetail;
- int cpu;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-extern struct rcu_ctrlblk rcu_ctrlblk;
-extern struct rcu_ctrlblk rcu_bh_ctrlblk;
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
- rdp->passed_quiesc = 1;
-}
-
-static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- /* This cpu has pending rcu entries and the grace period
- * for them has completed.
- */
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
- return 1;
-
- /* This cpu has no pending entries, but there are new entries */
- if (!rdp->curlist && rdp->nxtlist)
- return 1;
-
- /* This cpu has finished callbacks to invoke */
- if (rdp->donelist)
- return 1;
-
- /* The rcu core waits for a quiescent state from the cpu */
- if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
- return 1;
-
- /* nothing to do */
- return 0;
-}
-
-static inline int rcu_pending(int cpu)
-{
- return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
- __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
-}
-
-/**
- * rcu_read_lock - mark the beginning of an RCU read-side critical section.
- *
- * When synchronize_kernel() is invoked on one CPU while other CPUs
- * are within RCU read-side critical sections, then the
- * synchronize_kernel() is guaranteed to block until after all the other
- * CPUs exit their critical sections. Similarly, if call_rcu() is invoked
- * on one CPU while other CPUs are within RCU read-side critical
- * sections, invocation of the corresponding RCU callback is deferred
- * until after the all the other CPUs exit their critical sections.
- *
- * Note, however, that RCU callbacks are permitted to run concurrently
- * with RCU read-side critical sections. One way that this can happen
- * is via the following sequence of events: (1) CPU 0 enters an RCU
- * read-side critical section, (2) CPU 1 invokes call_rcu() to register
- * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
- * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
- * callback is invoked. This is legal, because the RCU read-side critical
- * section that was running concurrently with the call_rcu() (and which
- * therefore might be referencing something that the corresponding RCU
- * callback would free up) has completed before the corresponding
- * RCU callback is invoked.
- *
- * RCU read-side critical sections may be nested. Any deferred actions
- * will be deferred until the outermost RCU read-side critical section
- * completes.
- *
- * It is illegal to block while in an RCU read-side critical section.
- */
-#define rcu_read_lock() preempt_disable()
-
-/**
- * rcu_read_unlock - marks the end of an RCU read-side critical section.
- *
- * See rcu_read_lock() for more information.
- */
-#define rcu_read_unlock() preempt_enable()
-
-/*
- * So where is rcu_write_lock()? It does not exist, as there is no
- * way for writers to lock out RCU readers. This is a feature, not
- * a bug -- this property is what provides RCU's performance benefits.
- * Of course, writers must coordinate with each other. The normal
- * spinlock primitives work well for this, but any other technique may be
- * used as well. RCU does not care how the writers keep out of each
- * others' way, as long as they do so.
- */
-
-/**
- * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical section
- *
- * This is equivalent of rcu_read_lock(), but to be used when updates
- * are being done using call_rcu_bh(). Since call_rcu_bh() callbacks
- * consider completion of a softirq handler to be a quiescent state,
- * a process in RCU read-side critical section must be protected by
- * disabling softirqs. Read-side critical sections in interrupt context
- * can use just rcu_read_lock().
- *
- */
-#define rcu_read_lock_bh() local_bh_disable()
-
-/*
- * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
- *
- * See rcu_read_lock_bh() for more information.
- */
-#define rcu_read_unlock_bh() local_bh_enable()
+#define rcu_read_lock_bh() rcu_read_lock()
+#define rcu_read_unlock_bh() rcu_read_unlock()
+#define call_rcu_bh(head, func) call_rcu(head, func)
+#define rcu_bh_qsctr_inc(cpu)
+#define rcu_qsctr_inc(cpu)

/**
* rcu_dereference - fetch an RCU-protected pointer in an
@@ -257,15 +99,15 @@ static inline int rcu_pending(int cpu)
})

extern void rcu_init(void);
-extern void rcu_check_callbacks(int cpu, int user);
-extern void rcu_restart_cpu(int cpu);

/* Exported interfaces */
extern void FASTCALL(call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *head)));
-extern void FASTCALL(call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *head)));
+extern void rcu_read_lock(void);
+extern void rcu_read_unlock(void);
extern void synchronize_kernel(void);
+extern int rcu_pending(int cpu);
+extern void rcu_check_callbacks(int cpu, int user);

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpN -X ../dontdiff linux-2.5/include/linux/sched.h linux-2.5-rtRCU/include/linux/sched.h
--- linux-2.5/include/linux/sched.h Wed Mar 9 12:37:07 2005
+++ linux-2.5-rtRCU/include/linux/sched.h Fri Mar 18 11:33:13 2005
@@ -707,6 +707,9 @@ struct task_struct {
struct mempolicy *mempolicy;
short il_next;
#endif
+
+ int rcu_read_lock_nesting;
+ rwlock_t *rcu_read_lock_ptr;
};

static inline pid_t process_group(struct task_struct *tsk)
diff -urpN -X ../dontdiff linux-2.5/kernel/rcupdate.c linux-2.5-rtRCU/kernel/rcupdate.c
--- linux-2.5/kernel/rcupdate.c Wed Mar 9 12:37:22 2005
+++ linux-2.5-rtRCU/kernel/rcupdate.c Fri Mar 18 15:28:28 2005
@@ -47,424 +47,173 @@
#include <linux/rcupdate.h>
#include <linux/cpu.h>

-/* Definition for rcupdate control block. */
-struct rcu_ctrlblk rcu_ctrlblk =
- { .cur = -300, .completed = -300 };
-struct rcu_ctrlblk rcu_bh_ctrlblk =
- { .cur = -300, .completed = -300 };
-
-/* Bookkeeping of the progress of the grace period */
-struct rcu_state {
- spinlock_t lock; /* Guard this struct and writes to rcu_ctrlblk */
- cpumask_t cpumask; /* CPUs that need to switch in order */
- /* for current batch to proceed. */
+#define GRACE_PERIODS_PER_SEC 10
+
+struct rcu_data {
+ rwlock_t lock;
+ long batch;
+ struct rcu_head *waitlist;
+ struct rcu_head **waittail;
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+};
+struct rcu_ctrlblk {
+ long batch;
+ unsigned long last_sk;
+};
+DEFINE_PER_CPU(struct rcu_data, rcu_data) = {
+ .lock = RW_LOCK_UNLOCKED,
+ .batch = 0,
+ .waitlist = NULL,
+ .donelist = NULL
+};
+struct rcu_ctrlblk rcu_ctrlblk = {
+ .batch = 0,
};

-static struct rcu_state rcu_state ____cacheline_maxaligned_in_smp =
- {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
-static struct rcu_state rcu_bh_state ____cacheline_maxaligned_in_smp =
- {.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };
-
-DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
-DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
-
-/* Fake initialization required by compiler */
-static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10;
-
-/**
- * call_rcu - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- */
-void fastcall call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void rcu_init(void)
{
- unsigned long flags;
+ int cpu;
struct rcu_data *rdp;

- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
- local_irq_restore(flags);
+ for_each_cpu(cpu) {
+ rdp = &per_cpu(rcu_data, cpu);
+ rdp->waittail = &rdp->waitlist;
+ rdp->donetail = &rdp->donelist;
+ }
}

-/**
- * call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_bh() assumes
- * that the read-side critical sections end on completion of a softirq
- * handler. This means that read-side critical sections in process
- * context must not be interrupted by softirqs. This interface is to be
- * used when most of the read-side critical sections are in softirq context.
- * RCU read-side critical sections are delimited by rcu_read_lock() and
- * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh()
- * and rcu_read_unlock_bh(), if in process context. These may be nested.
- */
-void fastcall call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void rcu_read_lock(void)
{
unsigned long flags;
- struct rcu_data *rdp;

- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_bh_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ if (current->rcu_read_lock_nesting++ == 0) {
+ current->rcu_read_lock_ptr = &__get_cpu_var(rcu_data).lock;
+ read_lock(current->rcu_read_lock_ptr);
+ }
local_irq_restore(flags);
}

-/*
- * Invoke the completed RCU callbacks. They are expected to be in
- * a per-cpu list.
- */
-static void rcu_do_batch(struct rcu_data *rdp)
+void rcu_read_unlock(void)
{
- struct rcu_head *next, *list;
- int count = 0;
+ unsigned long flags;

- list = rdp->donelist;
- while (list) {
- next = rdp->donelist = list->next;
- list->func(list);
- list = next;
- if (++count >= maxbatch)
- break;
+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ if (--current->rcu_read_lock_nesting == 0) {
+ read_unlock(current->rcu_read_lock_ptr);
}
- if (!rdp->donelist)
- rdp->donetail = &rdp->donelist;
- else
- tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
+ local_irq_restore(flags);
}

-/*
- * Grace period handling:
- * The grace period handling consists out of two steps:
- * - A new grace period is started.
- * This is done by rcu_start_batch. The start is not broadcasted to
- * all cpus, they must pick this up by comparing rcp->cur with
- * rdp->quiescbatch. All cpus are recorded in the
- * rcu_state.cpumask bitmap.
- * - All cpus must go through a quiescent state.
- * Since the start of the grace period is not broadcasted, at least two
- * calls to rcu_check_quiescent_state are required:
- * The first call just notices that a new grace period is running. The
- * following calls check if there was a quiescent state since the beginning
- * of the grace period. If so, it updates rcu_state.cpumask. If
- * the bitmap is empty, then the grace period is completed.
- * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace
- * period (if necessary).
- */
-/*
- * Register a new batch of callbacks, and start it up if there is currently no
- * active batch and the batch to be registered has not already occurred.
- * Caller must hold rcu_state.lock.
- */
-static void rcu_start_batch(struct rcu_ctrlblk *rcp, struct rcu_state *rsp,
- int next_pending)
+void _synchronize_kernel(void)
{
- if (next_pending)
- rcp->next_pending = 1;
+ int cpu;
+ unsigned long flags;

- if (rcp->next_pending &&
- rcp->completed == rcp->cur) {
- /* Can't change, since spin lock held. */
- cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
-
- rcp->next_pending = 0;
- /* next_pending == 0 must be visible in __rcu_process_callbacks()
- * before it can see new value of cur.
- */
- smp_wmb();
- rcp->cur++;
+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ for_each_cpu(cpu) { /* _online() or _present() races with hotplug */
+ write_lock(&per_cpu(rcu_data, cpu).lock);
+ }
+ rcu_ctrlblk.batch++;
+ rcu_ctrlblk.last_sk = jiffies;
+ for_each_cpu(cpu) {
+ write_unlock(&per_cpu(rcu_data, cpu).lock);
}
+ local_irq_restore(flags);
}

-/*
- * cpu went through a quiescent state since the beginning of the grace period.
- * Clear it from the cpu mask and complete the grace period if it was the last
- * cpu. Start another grace period if someone has further entries pending
- */
-static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp, struct rcu_state *rsp)
+void synchronize_kernel(void)
{
- cpu_clear(cpu, rsp->cpumask);
- if (cpus_empty(rsp->cpumask)) {
- /* batch completed ! */
- rcp->completed = rcp->cur;
- rcu_start_batch(rcp, rsp, 0);
- }
-}
+ long oldbatch;

-/*
- * Check if the cpu has gone through a quiescent state (say context
- * switch). If so and if it already hasn't done so in this RCU
- * quiescent cycle, then indicate that it has done so.
- */
-static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
- struct rcu_state *rsp, struct rcu_data *rdp)
-{
- if (rdp->quiescbatch != rcp->cur) {
- /* start new grace period: */
- rdp->qs_pending = 1;
- rdp->passed_quiesc = 0;
- rdp->quiescbatch = rcp->cur;
- return;
+ smp_mb();
+ oldbatch = rcu_ctrlblk.batch;
+ schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
+ if (rcu_ctrlblk.batch == oldbatch) {
+ _synchronize_kernel();
}
-
- /* Grace period already completed for this cpu?
- * qs_pending is checked instead of the actual bitmap to avoid
- * cacheline trashing.
- */
- if (!rdp->qs_pending)
- return;
-
- /*
- * Was there a quiescent state since the beginning of the grace
- * period? If no, then exit and wait for the next call.
- */
- if (!rdp->passed_quiesc)
- return;
- rdp->qs_pending = 0;
-
- spin_lock(&rsp->lock);
- /*
- * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync
- * during cpu startup. Ignore the quiescent state.
- */
- if (likely(rdp->quiescbatch == rcp->cur))
- cpu_quiet(rdp->cpu, rcp, rsp);
-
- spin_unlock(&rsp->lock);
-}
-
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing
- * locking requirements, the list it's pulling from has to belong to a cpu
- * which is dead and hence not processing interrupts.
- */
-static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
- struct rcu_head **tail)
-{
- local_irq_disable();
- *this_rdp->nxttail = list;
- if (list)
- this_rdp->nxttail = tail;
- local_irq_enable();
}

-static void __rcu_offline_cpu(struct rcu_data *this_rdp,
- struct rcu_ctrlblk *rcp, struct rcu_state *rsp, struct rcu_data *rdp)
+void rcu_advance_callbacks(void)
{
- /* if the cpu going offline owns the grace period
- * we can block indefinitely waiting for it, so flush
- * it here
- */
- spin_lock_bh(&rsp->lock);
- if (rcp->cur != rcp->completed)
- cpu_quiet(rdp->cpu, rcp, rsp);
- spin_unlock_bh(&rsp->lock);
- rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
- rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
+ unsigned long flags;
+ struct rcu_data *rdp;

+ local_irq_save(flags); /* allow invocation from OOM handler. */
+ rdp = &__get_cpu_var(rcu_data);
+ smp_mb(); /* prevent sampling batch # before list removal. */
+ if (rdp->batch != rcu_ctrlblk.batch) {
+ *rdp->donetail = rdp->waitlist;
+ rdp->donetail = rdp->waittail;
+ rdp->waitlist = NULL;
+ rdp->waittail = &rdp->waitlist;
+ rdp->batch = rcu_ctrlblk.batch;
+ }
+ local_irq_restore(flags);
}
-static void rcu_offline_cpu(int cpu)
+
+void call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
{
- struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
- struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data);
+ unsigned long flags;
+ struct rcu_data *rdp;

- __rcu_offline_cpu(this_rdp, &rcu_ctrlblk, &rcu_state,
- &per_cpu(rcu_data, cpu));
- __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk, &rcu_bh_state,
- &per_cpu(rcu_bh_data, cpu));
- put_cpu_var(rcu_data);
- put_cpu_var(rcu_bh_data);
- tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu);
+ head->func = func;
+ head->next = NULL;
+ local_irq_save(flags);
+ rcu_advance_callbacks();
+ rdp = &__get_cpu_var(rcu_data);
+ *rdp->waittail = head;
+ rdp->waittail = &head->next;
+ local_irq_restore(flags);
}

-#else
-
-static void rcu_offline_cpu(int cpu)
+void rcu_process_callbacks(void)
{
-}
-
-#endif
+ unsigned long flags;
+ struct rcu_head *next, *list;
+ struct rcu_data *rdp;

-/*
- * This does the RCU processing work from tasklet context.
- */
-static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
- struct rcu_state *rsp, struct rcu_data *rdp)
-{
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
- *rdp->donetail = rdp->curlist;
- rdp->donetail = rdp->curtail;
- rdp->curlist = NULL;
- rdp->curtail = &rdp->curlist;
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ list = rdp->donelist;
+ if (list == NULL) {
+ local_irq_restore(flags);
+ return;
}
-
- local_irq_disable();
- if (rdp->nxtlist && !rdp->curlist) {
- rdp->curlist = rdp->nxtlist;
- rdp->curtail = rdp->nxttail;
- rdp->nxtlist = NULL;
- rdp->nxttail = &rdp->nxtlist;
- local_irq_enable();
-
- /*
- * start the next batch of callbacks
- */
-
- /* determine batch number */
- rdp->batch = rcp->cur + 1;
- /* see the comment and corresponding wmb() in
- * the rcu_start_batch()
- */
- smp_rmb();
-
- if (!rcp->next_pending) {
- /* and start it/schedule start if it's a new batch */
- spin_lock(&rsp->lock);
- rcu_start_batch(rcp, rsp, 1);
- spin_unlock(&rsp->lock);
- }
- } else {
- local_irq_enable();
+ rdp->donelist = NULL;
+ rdp->donetail = &rdp->waitlist;
+ local_irq_restore(flags);
+ while (list) {
+ next = list->next;
+ list->func(list);
+ list = next;
}
- rcu_check_quiescent_state(rcp, rsp, rdp);
- if (rdp->donelist)
- rcu_do_batch(rdp);
-}
-
-static void rcu_process_callbacks(unsigned long unused)
-{
- __rcu_process_callbacks(&rcu_ctrlblk, &rcu_state,
- &__get_cpu_var(rcu_data));
- __rcu_process_callbacks(&rcu_bh_ctrlblk, &rcu_bh_state,
- &__get_cpu_var(rcu_bh_data));
}

void rcu_check_callbacks(int cpu, int user)
{
- if (user ||
- (idle_cpu(cpu) && !in_softirq() &&
- hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
- rcu_qsctr_inc(cpu);
- rcu_bh_qsctr_inc(cpu);
- } else if (!in_softirq())
- rcu_bh_qsctr_inc(cpu);
- tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
-}
-
-static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- memset(rdp, 0, sizeof(*rdp));
- rdp->curtail = &rdp->curlist;
- rdp->nxttail = &rdp->nxtlist;
- rdp->donetail = &rdp->donelist;
- rdp->quiescbatch = rcp->completed;
- rdp->qs_pending = 0;
- rdp->cpu = cpu;
-}
-
-static void __devinit rcu_online_cpu(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu);
-
- rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp);
- rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
- tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL);
-}
-
-static int __devinit rcu_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- long cpu = (long)hcpu;
- switch (action) {
- case CPU_UP_PREPARE:
- rcu_online_cpu(cpu);
- break;
- case CPU_DEAD:
- rcu_offline_cpu(cpu);
- break;
- default:
- break;
+ if ((unsigned long)(jiffies - rcu_ctrlblk.last_sk) >
+ HZ/GRACE_PERIODS_PER_SEC) {
+ synchronize_kernel();
+ rcu_advance_callbacks();
+ rcu_process_callbacks();
}
- return NOTIFY_OK;
}

-static struct notifier_block __devinitdata rcu_nb = {
- .notifier_call = rcu_cpu_notify,
-};
-
-/*
- * Initializes rcu mechanism. Assumed to be called early.
- * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
- * Note that rcu_qsctr and friends are implicitly
- * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
- */
-void __init rcu_init(void)
+int rcu_pending(int cpu)
{
- rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
- (void *)(long)smp_processor_id());
- /* Register notifier for non-boot CPUs */
- register_cpu_notifier(&rcu_nb);
-}
-
-struct rcu_synchronize {
- struct rcu_head head;
- struct completion completion;
-};
-
-/* Because of FASTCALL declaration of complete, we use this wrapper */
-static void wakeme_after_rcu(struct rcu_head *head)
-{
- struct rcu_synchronize *rcu;
-
- rcu = container_of(head, struct rcu_synchronize, head);
- complete(&rcu->completion);
-}
-
-/**
- * synchronize_kernel - wait until a grace period has elapsed.
- *
- * Control will return to the caller some time after a full grace
- * period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- */
-void synchronize_kernel(void)
-{
- struct rcu_synchronize rcu;
-
- init_completion(&rcu.completion);
- /* Will wake me after RCU finished */
- call_rcu(&rcu.head, wakeme_after_rcu);
+ unsigned long flags;
+ struct rcu_data *rdp;
+ int retval;

- /* Wait for it */
- wait_for_completion(&rcu.completion);
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ retval = (rdp->waitlist || rdp->donelist);
+ local_irq_restore(flags);
+ return (retval);
}

-module_param(maxbatch, int, 0);
EXPORT_SYMBOL(call_rcu);
-EXPORT_SYMBOL(call_rcu_bh);
EXPORT_SYMBOL(synchronize_kernel);

2005-03-19 05:04:20

by Manfred Spraul

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

Ingo Molnar wrote:

> read_lock(&rwlock);
> ...
> read_lock(&rwlock);
>
>are still legal. (it's also done quite often.)
>
>
>
How do you handle the write_lock_irq()/read_lock locks?
E.g. the tasklist_lock or the fasync_lock.

--
Manfred

2005-03-19 16:27:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Manfred Spraul <[email protected]> wrote:

> Ingo Molnar wrote:
>
> > read_lock(&rwlock);
> > ...
> > read_lock(&rwlock);
> >
> >are still legal. (it's also done quite often.)
> >
> >
> >
>
> How do you handle the write_lock_irq()/read_lock locks? E.g. the
> tasklist_lock or the fasync_lock.

which precise locking situation do you mean?

Ingo

2005-03-19 16:32:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Herbert Xu <[email protected]> wrote:

> Ingo Molnar <[email protected]> wrote:
> >
> > i really have no intention to allow multiple readers for rt-mutexes. We
> > got away with that so far, and i'd like to keep it so. Imagine 100
> > threads all blocked in the same critical section (holding the read-lock)
> > when a highprio writer thread comes around: instant 100x latency to let
> > all of them roll forward. The only sane solution is to not allow
> > excessive concurrency. (That limits SMP scalability, but there's no
> > other choice i can see.)
>
> What about allowing only as many concurrent readers as there are CPUs?

since a reader may be preempted by a higher prio task, there is no
linear relationship between CPU utilization and the number of readers
allowed. You could easily end up having all the nr_cpus readers
preempted on one CPU. It gets pretty messy.

Ingo

2005-03-20 06:37:27

by Manfred Spraul

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

Ingo Molnar wrote:

>which precise locking situation do you mean?
>
>
>
cpu 1:
acquire random networking spin_lock_bh()

cpu 2:
read_lock(&tasklist_lock) from process context
interrupt. softirq. within softirq: try to acquire the networking lock.
* spins.

cpu 1:
hardware interrupt
within hw interrupt: signal delivery. tries to acquire tasklist_lock.

--> deadlock.

--
Manfred

2005-03-20 08:01:39

by Kyle Moffett

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Mar 19, 2005, at 11:31, Ingo Molnar wrote:
>> What about allowing only as many concurrent readers as there are CPUs?
>
> since a reader may be preempted by a higher prio task, there is no
> linear relationship between CPU utilization and the number of readers
> allowed. You could easily end up having all the nr_cpus readers
> preempted on one CPU. It gets pretty messy

One solution I can think of, although it bloats memory usage for
many-way
boxen, is to just have a table in the rwlock with one entry per cpu.
Each
CPU would get one concurrent reader, others would need to sleep

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
!y?(-)
------END GEEK CODE BLOCK------


2005-03-20 09:25:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote:
> cpu 1:
> acquire random networking spin_lock_bh()
>
> cpu 2:
> read_lock(&tasklist_lock) from process context
> interrupt. softirq. within softirq: try to acquire the networking lock.
> * spins.
>
> cpu 1:
> hardware interrupt
> within hw interrupt: signal delivery. tries to acquire tasklist_lock.
>
> --> deadlock.

Signal delivery from hw interrupt context (interrupt is flagged
SA_NODELAY) is not possible in RT preemption mode. The
local_irq_save_nort() check in __cache_alloc will catch you.

When it happens from a threaded irq handler the situation is solvable by
the PI code.

tglx


2005-03-20 13:30:24

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, 18 Mar 2005, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > Why can should there only be one RCU-reader per CPU at each given
> > instance? Even on a real-time UP system it would be very helpfull to
> > have RCU areas to be enterable by several tasks as once. It would
> > perform better, both wrt. latencies and throughput: With the above
> > implementation an high priority task entering an RCU area will have to
> > boost the current RCU reader, make a task switch until that one
> > finishes and makes yet another task switch. to get back to the high
> > priority task. With an RCU implementation which can take n RCU readers
> > per CPU there is no such problem.
>
> correct, for RCU we could allow multiple readers per lock, because the
> 'blocking' side of RCU (callback processing) is never (supposed to be)
> in any latency path.
>
> except if someone wants to make RCU callback processing deterministic at
> some point. (e.g. for memory management reasons.)

I think it can be deterministic (on the long timescale of memory management)
anyway: Boost any non-RT task entering an RCU region to the lowest RT priority.
This way only all the RT tasks + one non-RT task can be within those
regions. The RT-tasks are supposed to have some kind of upper bound to
their CPU-usage. The non-RT task will also finish "soon" as it is
boosted. If the RCU batches are also at the lowest RT-priority they can be
run immediately after the non-RT task is done.

>
> clearly the simplest solution is to go with the single-reader locks for
> now - a separate experiment could be done with a new type of rwlock that
> can only be used by the RCU code. (I'm not quite sure whether we could
> guarantee a minimum rate of RCU callback processing under such a scheme
> though. It's an eventual memory DoS otherwise.)
>

Why are a lock needed at all? If it is doable without locking for an
non-preemptable SMP kernel it must be doable for an preemptable kernel as
well.I am convinced some kind of per-CPU rcu_read_count as I specified in
my previous mail can work some way or the other. call_rcu() might need to
do more complicated stuff and thus use CPU but call_rcu() is supposed to
be an relative rare event not worth optimizing for. Such an
implementation will work for any preemptable kernel, not only PREEMPT_RT.
For performance is considered it is important not to acquire any locks in
the rcu-read regions.

I tried this approach. My UP labtop did boot on it, but I haven't testet
it further. I have included the very small patch as an attachment.

> Ingo

I have not yet looked at -V0.7.41-00...

Esben


Attachments:
rcu_patch (3.65 kB)

2005-03-20 16:58:16

by Manfred Spraul

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

Thomas Gleixner wrote:

>On Sun, 2005-03-20 at 07:36 +0100, Manfred Spraul wrote:
>
>
>>cpu 1:
>>acquire random networking spin_lock_bh()
>>
>>cpu 2:
>>read_lock(&tasklist_lock) from process context
>>interrupt. softirq. within softirq: try to acquire the networking lock.
>>* spins.
>>
>>cpu 1:
>>hardware interrupt
>>within hw interrupt: signal delivery. tries to acquire tasklist_lock.
>>
>>--> deadlock.
>>
>>
>
>Signal delivery from hw interrupt context (interrupt is flagged
>SA_NODELAY) is not possible in RT preemption mode. The
>local_irq_save_nort() check in __cache_alloc will catch you.
>
>
>
That was just one random example.
Another one would be :

drivers/chat/tty_io.c, __do_SAK() contains
read_lock(&tasklist_lock);
task_lock(p);

kernel/sys.c, sys_setrlimit contains
task_lock(current->group_leader);
read_lock(&tasklist_lock);

task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is
a normal spinlock, then this is an A/B B/A deadlock.

--
Manfred

2005-03-20 21:37:02

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
> That was just one random example.
> Another one would be :
>
> drivers/chat/tty_io.c, __do_SAK() contains
> read_lock(&tasklist_lock);
> task_lock(p);
>
> kernel/sys.c, sys_setrlimit contains
> task_lock(current->group_leader);
> read_lock(&tasklist_lock);
>
> task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is
> a normal spinlock, then this is an A/B B/A deadlock.

That code was already dubious in the first place just because it
contained that circularity. If you had a rwlock that block on an
upper read count maximum a deadlock situation would trigger anyways,
say, upon a flood of threads trying to do that sequence of aquires.

I'd probably experiment with using the {spin,read,write}-trylock
logic and release the all locks contains in a sequence like that
on the failure to aquire any of the locks in the chain as an
initial fix. A longer term fix might be to break things up a bit
so that whatever ordering being done would have that circularity.

BTW, the runtime lock cricularity detector was designed to trigger
on that situtation anyways.

That's my thoughts on the matter.

bill

2005-03-20 21:58:06

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote:
> On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
> > That was just one random example.
> > Another one would be :
> >
> > drivers/chat/tty_io.c, __do_SAK() contains
> > read_lock(&tasklist_lock);
> > task_lock(p);
> >
> > kernel/sys.c, sys_setrlimit contains
> > task_lock(current->group_leader);
> > read_lock(&tasklist_lock);
> >
> > task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is
> > a normal spinlock, then this is an A/B B/A deadlock.
>
> That code was already dubious in the first place just because it
> contained that circularity. If you had a rwlock that block on an
> upper read count maximum[,] a deadlock situation would trigger anyways,
> say, upon a flood of threads trying to do that sequence of aquires.

The RT patch uses the lock ordering "in place" and whatevery nasty
situation was going on previously will be effectively under high load,
which increases the chance of it being triggered. Removal of the read
side semantic just increases load more so that those cases can trigger.

I disagree with this approach and I have an alternate implementation
here that restores it. It's only half tested and fairly meaningless
until an extreme contention case is revealed with the current rt lock
implementation. Numbers need to be gather to prove or disprove this
conjecture.

> I'd probably experiment with using the {spin,read,write}-trylock
> logic and release the all locks contains in a sequence like that
> on the failure to aquire any of the locks in the chain as an
> initial fix. A longer term fix might be to break things up a bit
> so that whatever ordering being done would have that circularity.

Excuse me, ...would *not* have that circularity.

bill

2005-03-20 22:38:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
> On Fri, 18 Mar 2005, Ingo Molnar wrote:
>
> >
> > * Esben Nielsen <[email protected]> wrote:
> >
> > > Why can should there only be one RCU-reader per CPU at each given
> > > instance? Even on a real-time UP system it would be very helpfull to
> > > have RCU areas to be enterable by several tasks as once. It would
> > > perform better, both wrt. latencies and throughput: With the above
> > > implementation an high priority task entering an RCU area will have to
> > > boost the current RCU reader, make a task switch until that one
> > > finishes and makes yet another task switch. to get back to the high
> > > priority task. With an RCU implementation which can take n RCU readers
> > > per CPU there is no such problem.
> >
> > correct, for RCU we could allow multiple readers per lock, because the
> > 'blocking' side of RCU (callback processing) is never (supposed to be)
> > in any latency path.
> >
> > except if someone wants to make RCU callback processing deterministic at
> > some point. (e.g. for memory management reasons.)
>
> I think it can be deterministic (on the long timescale of memory management)
> anyway: Boost any non-RT task entering an RCU region to the lowest RT priority.
> This way only all the RT tasks + one non-RT task can be within those
> regions. The RT-tasks are supposed to have some kind of upper bound to
> their CPU-usage. The non-RT task will also finish "soon" as it is
> boosted. If the RCU batches are also at the lowest RT-priority they can be
> run immediately after the non-RT task is done.

Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-)

> > clearly the simplest solution is to go with the single-reader locks for
> > now - a separate experiment could be done with a new type of rwlock that
> > can only be used by the RCU code. (I'm not quite sure whether we could
> > guarantee a minimum rate of RCU callback processing under such a scheme
> > though. It's an eventual memory DoS otherwise.)
> >
>
> Why are a lock needed at all? If it is doable without locking for an
> non-preemptable SMP kernel it must be doable for an preemptable kernel as
> well.I am convinced some kind of per-CPU rcu_read_count as I specified in
> my previous mail can work some way or the other. call_rcu() might need to
> do more complicated stuff and thus use CPU but call_rcu() is supposed to
> be an relative rare event not worth optimizing for. Such an
> implementation will work for any preemptable kernel, not only PREEMPT_RT.
> For performance is considered it is important not to acquire any locks in
> the rcu-read regions.

You definitely don't need a lock -- you can just suppress preemption
on the read side instead. But that potentially makes for long scheduling
latencies.

The counter approach might work, and is also what the implementation #5
does -- check out rcu_read_lock() in Ingo's most recent patch.

Thanx, Paul

> I tried this approach. My UP labtop did boot on it, but I haven't testet
> it further. I have included the very small patch as an attachment.
>
> > Ingo
>
> I have not yet looked at -V0.7.41-00...
>
> Esben
>

> diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
> --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 23:40:13.000000000 +0100
> +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.000000000 +0100
> @@ -85,6 +85,7 @@
> * curlist - current batch for which quiescent cycle started if any
> */
> struct rcu_data {
> + long active_readers;
> /* 1) quiescent state handling : */
> long quiescbatch; /* Batch # for grace period */
> int passed_quiesc; /* User-mode/idle loop etc. */
> @@ -115,12 +116,14 @@
> static inline void rcu_qsctr_inc(int cpu)
> {
> struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> - rdp->passed_quiesc = 1;
> + if(rdp->active_readers==0)
> + rdp->passed_quiesc = 1;
> }
> static inline void rcu_bh_qsctr_inc(int cpu)
> {
> struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
> - rdp->passed_quiesc = 1;
> + if(rdp->active_readers==0)
> + rdp->passed_quiesc = 1;
> }
>
> static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
> @@ -183,29 +186,27 @@
> *
> * It is illegal to block while in an RCU read-side critical section.
> */
> -#define rcu_read_lock() preempt_disable()
> +static inline void rcu_read_lock(void)
> +{
> + preempt_disable();
> + __get_cpu_var(rcu_data).active_readers++;
> + preempt_enable();
> +}
>
> /**
> * rcu_read_unlock - marks the end of an RCU read-side critical section.
> *
> * See rcu_read_lock() for more information.
> */
> -#define rcu_read_unlock() preempt_enable()
> +static inline void rcu_read_unlock(void)
> +{
> + preempt_disable();
> + __get_cpu_var(rcu_data).active_readers--;
> + preempt_enable();
> +}
>
> #define IGNORE_LOCK(op, lock) do { (void)(lock); op(); } while (0)
>
> -#ifdef CONFIG_PREEMPT_RT
> -# define rcu_read_lock_spin(lock) spin_lock(lock)
> -# define rcu_read_unlock_spin(lock) spin_unlock(lock)
> -# define rcu_read_lock_read(lock) read_lock(lock)
> -# define rcu_read_unlock_read(lock) read_unlock(lock)
> -# define rcu_read_lock_bh_read(lock) read_lock_bh(lock)
> -# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock)
> -# define rcu_read_lock_down_read(rwsem) down_read(rwsem)
> -# define rcu_read_unlock_up_read(rwsem) up_read(rwsem)
> -# define rcu_read_lock_nort() do { } while (0)
> -# define rcu_read_unlock_nort() do { } while (0)
> -#else
> # define rcu_read_lock_spin(lock) IGNORE_LOCK(rcu_read_lock, lock)
> # define rcu_read_unlock_spin(lock) IGNORE_LOCK(rcu_read_unlock, lock)
> # define rcu_read_lock_read(lock) IGNORE_LOCK(rcu_read_lock, lock)
> @@ -216,15 +217,10 @@
> # define rcu_read_unlock_nort() rcu_read_unlock()
> # define rcu_read_lock_bh_read(lock) IGNORE_LOCK(rcu_read_lock_bh, lock)
> # define rcu_read_unlock_bh_read(lock) IGNORE_LOCK(rcu_read_unlock_bh, lock)
> -#endif
>
> -#ifdef CONFIG_PREEMPT_RT
> -# define rcu_read_lock_sem(lock) down(lock)
> -# define rcu_read_unlock_sem(lock) up(lock)
> -#else
> # define rcu_read_lock_sem(lock) IGNORE_LOCK(rcu_read_lock, lock)
> # define rcu_read_unlock_sem(lock) IGNORE_LOCK(rcu_read_unlock, lock)
> -#endif
> +
> /*
> * So where is rcu_write_lock()? It does not exist, as there is no
> * way for writers to lock out RCU readers. This is a feature, not
> diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/Makefile linux-2.6.11-final-V0.7.40-00-RCU/Makefile
> --- linux-2.6.11-final-V0.7.40-00/Makefile 2005-03-11 23:40:13.000000000 +0100
> +++ linux-2.6.11-final-V0.7.40-00-RCU/Makefile 2005-03-19 12:41:09.000000000 +0100
> @@ -1,7 +1,7 @@
> VERSION = 2
> PATCHLEVEL = 6
> SUBLEVEL = 11
> -EXTRAVERSION = -RT-V0.7.40-00
> +EXTRAVERSION = -RT-V0.7.40-00-RCU
> NAME=Woozy Numbat
>
> # *DOCUMENTATION*

2005-03-20 23:31:39

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Sun, 20 Mar 2005, Paul E. McKenney wrote:

> On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
> > On Fri, 18 Mar 2005, Ingo Molnar wrote:
> >
> > > [...]
> >
> > I think it can be deterministic (on the long timescale of memory management)
> > anyway: Boost any non-RT task entering an RCU region to the lowest RT priority.
> > This way only all the RT tasks + one non-RT task can be within those
> > regions. The RT-tasks are supposed to have some kind of upper bound to
> > their CPU-usage. The non-RT task will also finish "soon" as it is
> > boosted. If the RCU batches are also at the lowest RT-priority they can be
> > run immediately after the non-RT task is done.
>
> Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-)
>
Well, I was actually thinking of an API like
preempt_by_nonrt_disable()
preempt_by_nonrt_enable()
working like the old preempt_disable()/preempt_enable() but still
allowing RT tasks (as well as priority inheriting non-RT tasks) to be
scheduled. That would kind of help "split" the kernel into two halfs: the
RT part and the non-RT part. The non-RT part would in many ways work as it
has always done.

> > > clearly the simplest solution is to go with the single-reader locks for
> > > now - a separate experiment could be done with a new type of rwlock that
> > > can only be used by the RCU code. (I'm not quite sure whether we could
> > > guarantee a minimum rate of RCU callback processing under such a scheme
> > > though. It's an eventual memory DoS otherwise.)
> > >
> >
> > Why are a lock needed at all? If it is doable without locking for an
> > non-preemptable SMP kernel it must be doable for an preemptable kernel as
> > well.I am convinced some kind of per-CPU rcu_read_count as I specified in
> > my previous mail can work some way or the other. call_rcu() might need to
> > do more complicated stuff and thus use CPU but call_rcu() is supposed to
> > be an relative rare event not worth optimizing for. Such an
> > implementation will work for any preemptable kernel, not only PREEMPT_RT.
> > For performance is considered it is important not to acquire any locks in
> > the rcu-read regions.
>
> You definitely don't need a lock -- you can just suppress preemption
> on the read side instead. But that potentially makes for long scheduling
> latencies.

Well, in my patch I do not leave preemption off - only while doing the
simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
might be active, i.e. preempted, on the current CPU such that this isn't
and quiescent point after all.
(To others: Paul nicely unfolded my attachment below - I left it in
the mail such you can read it.)
The problem with this approach is ofcourse that user space programs might
preempt an RCU reader for a very long time such that RCU batches are never
really run. The boosting of non-RT tasks mentioned above would help a
lot.
A plus(?) in it: You can actually sleep while having the rcu_read_lock !!

>
> The counter approach might work, and is also what the implementation #5
> does -- check out rcu_read_lock() in Ingo's most recent patch.
>

Do you refer to your original mail with implementing it in 5 steps?
In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
forces syncronization between the CPUs - bad for scaling. It does make the
RCU batches somewhat deterministic, as the RCU task can boost the readers
to the rcu-task's priority.
The problem about this approach is that everybody calling into RCU code
have a worst case behaviour of the systemwide worst case RCU reader
section - which can be pretty large (in principle infinite if somebody.)
So if somebody uses a call to a function in the code containing a RCU read
area the worst case behavious would be the same as teh worst case latency
in the simple world where preempt_disable()/preempt_enable() was used.

> Thanx, Paul
>
> > I tried this approach. My UP labtop did boot on it, but I haven't testet
> > it further. I have included the very small patch as an attachment.
> >
> > > Ingo
> >
> > I have not yet looked at -V0.7.41-00...
> >
> > Esben
> >
>
> > diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
> > --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 23:40:13.000000000 +0100
> > +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.000000000 +0100
> > @@ -85,6 +85,7 @@
> > * curlist - current batch for which quiescent cycle started if any
> > */
> > struct rcu_data {
> > + long active_readers;
> > /* 1) quiescent state handling : */
> > long quiescbatch; /* Batch # for grace period */
> > int passed_quiesc; /* User-mode/idle loop etc. */
> > @@ -115,12 +116,14 @@
> > static inline void rcu_qsctr_inc(int cpu)
> > {
> > struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> > - rdp->passed_quiesc = 1;
> > + if(rdp->active_readers==0)
> > + rdp->passed_quiesc = 1;
> > }
> > static inline void rcu_bh_qsctr_inc(int cpu)
> > {
> > struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
> > - rdp->passed_quiesc = 1;
> > + if(rdp->active_readers==0)
> > + rdp->passed_quiesc = 1;
> > }
> >
> > static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
> > @@ -183,29 +186,27 @@
> > *
> > * It is illegal to block while in an RCU read-side critical section.
> > */
> > -#define rcu_read_lock() preempt_disable()
> > +static inline void rcu_read_lock(void)
> > +{
> > + preempt_disable();
> > + __get_cpu_var(rcu_data).active_readers++;
> > + preempt_enable();
> > +}
> >
> > /**
> > * rcu_read_unlock - marks the end of an RCU read-side critical section.
> > *
> > * See rcu_read_lock() for more information.
> > */
> > -#define rcu_read_unlock() preempt_enable()
> > +static inline void rcu_read_unlock(void)
> > +{
> > + preempt_disable();
> > + __get_cpu_var(rcu_data).active_readers--;
> > + preempt_enable();
> > +}
> >
> > #define IGNORE_LOCK(op, lock) do { (void)(lock); op(); } while (0)
> >
> > -#ifdef CONFIG_PREEMPT_RT
> > -# define rcu_read_lock_spin(lock) spin_lock(lock)
> > -# define rcu_read_unlock_spin(lock) spin_unlock(lock)
> > -# define rcu_read_lock_read(lock) read_lock(lock)
> > -# define rcu_read_unlock_read(lock) read_unlock(lock)
> > -# define rcu_read_lock_bh_read(lock) read_lock_bh(lock)
> > -# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock)
> > -# define rcu_read_lock_down_read(rwsem) down_read(rwsem)
> > -# define rcu_read_unlock_up_read(rwsem) up_read(rwsem)
> > -# define rcu_read_lock_nort() do { } while (0)
> > -# define rcu_read_unlock_nort() do { } while (0)
> > -#else
> > # define rcu_read_lock_spin(lock) IGNORE_LOCK(rcu_read_lock, lock)
> > # define rcu_read_unlock_spin(lock) IGNORE_LOCK(rcu_read_unlock, lock)
> > # define rcu_read_lock_read(lock) IGNORE_LOCK(rcu_read_lock, lock)
> > @@ -216,15 +217,10 @@
> > # define rcu_read_unlock_nort() rcu_read_unlock()
> > # define rcu_read_lock_bh_read(lock) IGNORE_LOCK(rcu_read_lock_bh, lock)
> > # define rcu_read_unlock_bh_read(lock) IGNORE_LOCK(rcu_read_unlock_bh, lock)
> > -#endif
> >
> > -#ifdef CONFIG_PREEMPT_RT
> > -# define rcu_read_lock_sem(lock) down(lock)
> > -# define rcu_read_unlock_sem(lock) up(lock)
> > -#else
> > # define rcu_read_lock_sem(lock) IGNORE_LOCK(rcu_read_lock, lock)
> > # define rcu_read_unlock_sem(lock) IGNORE_LOCK(rcu_read_unlock, lock)
> > -#endif
> > +
> > /*
> > * So where is rcu_write_lock()? It does not exist, as there is no
> > * way for writers to lock out RCU readers. This is a feature, not
> > diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/Makefile linux-2.6.11-final-V0.7.40-00-RCU/Makefile
> > --- linux-2.6.11-final-V0.7.40-00/Makefile 2005-03-11 23:40:13.000000000 +0100
> > +++ linux-2.6.11-final-V0.7.40-00-RCU/Makefile 2005-03-19 12:41:09.000000000 +0100
> > @@ -1,7 +1,7 @@
> > VERSION = 2
> > PATCHLEVEL = 6
> > SUBLEVEL = 11
> > -EXTRAVERSION = -RT-V0.7.40-00
> > +EXTRAVERSION = -RT-V0.7.40-00-RCU
> > NAME=Woozy Numbat
> >
> > # *DOCUMENTATION*
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-03-22 05:58:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> On Sun, 20 Mar 2005, Paul E. McKenney wrote:
>
> > On Sun, Mar 20, 2005 at 02:29:17PM +0100, Esben Nielsen wrote:
> > > On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > >
> > > > [...]
> > >
> > > I think it can be deterministic (on the long timescale of memory management)
> > > anyway: Boost any non-RT task entering an RCU region to the lowest RT priority.
> > > This way only all the RT tasks + one non-RT task can be within those
> > > regions. The RT-tasks are supposed to have some kind of upper bound to
> > > their CPU-usage. The non-RT task will also finish "soon" as it is
> > > boosted. If the RCU batches are also at the lowest RT-priority they can be
> > > run immediately after the non-RT task is done.
> >
> > Hmmm... Sort of a preemptive-first-strike priority boost. Cute! ;-)
> >
> Well, I was actually thinking of an API like
> preempt_by_nonrt_disable()
> preempt_by_nonrt_enable()
> working like the old preempt_disable()/preempt_enable() but still
> allowing RT tasks (as well as priority inheriting non-RT tasks) to be
> scheduled. That would kind of help "split" the kernel into two halfs: the
> RT part and the non-RT part. The non-RT part would in many ways work as it
> has always done.

Does sound in some ways similar to the migration approach -- there, the
RT/non-RT split is made across CPUs. But if RT is allowed to preempt,
then you still have to deal with preemption for locking correctness, right?

> > > > clearly the simplest solution is to go with the single-reader locks for
> > > > now - a separate experiment could be done with a new type of rwlock that
> > > > can only be used by the RCU code. (I'm not quite sure whether we could
> > > > guarantee a minimum rate of RCU callback processing under such a scheme
> > > > though. It's an eventual memory DoS otherwise.)
> > > >
> > >
> > > Why are a lock needed at all? If it is doable without locking for an
> > > non-preemptable SMP kernel it must be doable for an preemptable kernel as
> > > well.I am convinced some kind of per-CPU rcu_read_count as I specified in
> > > my previous mail can work some way or the other. call_rcu() might need to
> > > do more complicated stuff and thus use CPU but call_rcu() is supposed to
> > > be an relative rare event not worth optimizing for. Such an
> > > implementation will work for any preemptable kernel, not only PREEMPT_RT.
> > > For performance is considered it is important not to acquire any locks in
> > > the rcu-read regions.
> >
> > You definitely don't need a lock -- you can just suppress preemption
> > on the read side instead. But that potentially makes for long scheduling
> > latencies.
>
> Well, in my patch I do not leave preemption off - only while doing the
> simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
> might be active, i.e. preempted, on the current CPU such that this isn't
> and quiescent point after all.
> (To others: Paul nicely unfolded my attachment below - I left it in
> the mail such you can read it.)
> The problem with this approach is ofcourse that user space programs might
> preempt an RCU reader for a very long time such that RCU batches are never
> really run. The boosting of non-RT tasks mentioned above would help a
> lot.
> A plus(?) in it: You can actually sleep while having the rcu_read_lock !!

This is in some ways similar to the K42 approach to RCU (which they call
"generations"). Dipankar put together a similar patch for Linux, but
the problem was that grace periods could be deferred for an extremely
long time. Which I suspect is what you were calling out as causing
RCU batches never to run.

> > The counter approach might work, and is also what the implementation #5
> > does -- check out rcu_read_lock() in Ingo's most recent patch.
> >
>
> Do you refer to your original mail with implementing it in 5 steps?
> In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
> forces syncronization between the CPUs - bad for scaling. It does make the
> RCU batches somewhat deterministic, as the RCU task can boost the readers
> to the rcu-task's priority.
> The problem about this approach is that everybody calling into RCU code
> have a worst case behaviour of the systemwide worst case RCU reader
> section - which can be pretty large (in principle infinite if somebody.)
> So if somebody uses a call to a function in the code containing a RCU read
> area the worst case behavious would be the same as teh worst case latency
> in the simple world where preempt_disable()/preempt_enable() was used.

I missed something here -- readers would see the worst-case -writer-
latency rather than the worst-case -reader- latency, right? Or are you
concerned about the case where some reader blocks the write-side
acquisitions in _synchronize_kernel(), but not before the writer has
grabbed a lock, blocking any readers on the corresponding CPU?

Yes, but this is true of every other lock in the system as well, not?
In a separate conversation a few weeks ago, Steve Hemminger suggested
placing checks in long RCU read-side critical sections -- this could
be used to keep the worst-case reader latency within the desired bounds.
Not pretty, but, then again, bounding lock latency will require a fair
number of similar changes, I suspect.

Thanx, Paul

> > > I tried this approach. My UP labtop did boot on it, but I haven't testet
> > > it further. I have included the very small patch as an attachment.
> > >
> > > > Ingo
> > >
> > > I have not yet looked at -V0.7.41-00...
> > >
> > > Esben
> > >
> >
> > > diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h
> > > --- linux-2.6.11-final-V0.7.40-00/include/linux/rcupdate.h 2005-03-11 23:40:13.000000000 +0100
> > > +++ linux-2.6.11-final-V0.7.40-00-RCU/include/linux/rcupdate.h 2005-03-19 12:47:09.000000000 +0100
> > > @@ -85,6 +85,7 @@
> > > * curlist - current batch for which quiescent cycle started if any
> > > */
> > > struct rcu_data {
> > > + long active_readers;
> > > /* 1) quiescent state handling : */
> > > long quiescbatch; /* Batch # for grace period */
> > > int passed_quiesc; /* User-mode/idle loop etc. */
> > > @@ -115,12 +116,14 @@
> > > static inline void rcu_qsctr_inc(int cpu)
> > > {
> > > struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> > > - rdp->passed_quiesc = 1;
> > > + if(rdp->active_readers==0)
> > > + rdp->passed_quiesc = 1;
> > > }
> > > static inline void rcu_bh_qsctr_inc(int cpu)
> > > {
> > > struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
> > > - rdp->passed_quiesc = 1;
> > > + if(rdp->active_readers==0)
> > > + rdp->passed_quiesc = 1;
> > > }
> > >
> > > static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
> > > @@ -183,29 +186,27 @@
> > > *
> > > * It is illegal to block while in an RCU read-side critical section.
> > > */
> > > -#define rcu_read_lock() preempt_disable()
> > > +static inline void rcu_read_lock(void)
> > > +{
> > > + preempt_disable();
> > > + __get_cpu_var(rcu_data).active_readers++;
> > > + preempt_enable();
> > > +}
> > >
> > > /**
> > > * rcu_read_unlock - marks the end of an RCU read-side critical section.
> > > *
> > > * See rcu_read_lock() for more information.
> > > */
> > > -#define rcu_read_unlock() preempt_enable()
> > > +static inline void rcu_read_unlock(void)
> > > +{
> > > + preempt_disable();
> > > + __get_cpu_var(rcu_data).active_readers--;
> > > + preempt_enable();
> > > +}
> > >
> > > #define IGNORE_LOCK(op, lock) do { (void)(lock); op(); } while (0)
> > >
> > > -#ifdef CONFIG_PREEMPT_RT
> > > -# define rcu_read_lock_spin(lock) spin_lock(lock)
> > > -# define rcu_read_unlock_spin(lock) spin_unlock(lock)
> > > -# define rcu_read_lock_read(lock) read_lock(lock)
> > > -# define rcu_read_unlock_read(lock) read_unlock(lock)
> > > -# define rcu_read_lock_bh_read(lock) read_lock_bh(lock)
> > > -# define rcu_read_unlock_bh_read(lock) read_unlock_bh(lock)
> > > -# define rcu_read_lock_down_read(rwsem) down_read(rwsem)
> > > -# define rcu_read_unlock_up_read(rwsem) up_read(rwsem)
> > > -# define rcu_read_lock_nort() do { } while (0)
> > > -# define rcu_read_unlock_nort() do { } while (0)
> > > -#else
> > > # define rcu_read_lock_spin(lock) IGNORE_LOCK(rcu_read_lock, lock)
> > > # define rcu_read_unlock_spin(lock) IGNORE_LOCK(rcu_read_unlock, lock)
> > > # define rcu_read_lock_read(lock) IGNORE_LOCK(rcu_read_lock, lock)
> > > @@ -216,15 +217,10 @@
> > > # define rcu_read_unlock_nort() rcu_read_unlock()
> > > # define rcu_read_lock_bh_read(lock) IGNORE_LOCK(rcu_read_lock_bh, lock)
> > > # define rcu_read_unlock_bh_read(lock) IGNORE_LOCK(rcu_read_unlock_bh, lock)
> > > -#endif
> > >
> > > -#ifdef CONFIG_PREEMPT_RT
> > > -# define rcu_read_lock_sem(lock) down(lock)
> > > -# define rcu_read_unlock_sem(lock) up(lock)
> > > -#else
> > > # define rcu_read_lock_sem(lock) IGNORE_LOCK(rcu_read_lock, lock)
> > > # define rcu_read_unlock_sem(lock) IGNORE_LOCK(rcu_read_unlock, lock)
> > > -#endif
> > > +
> > > /*
> > > * So where is rcu_write_lock()? It does not exist, as there is no
> > > * way for writers to lock out RCU readers. This is a feature, not
> > > diff -Naur --exclude-from diff_exclude linux-2.6.11-final-V0.7.40-00/Makefile linux-2.6.11-final-V0.7.40-00-RCU/Makefile
> > > --- linux-2.6.11-final-V0.7.40-00/Makefile 2005-03-11 23:40:13.000000000 +0100
> > > +++ linux-2.6.11-final-V0.7.40-00-RCU/Makefile 2005-03-19 12:41:09.000000000 +0100
> > > @@ -1,7 +1,7 @@
> > > VERSION = 2
> > > PATCHLEVEL = 6
> > > SUBLEVEL = 11
> > > -EXTRAVERSION = -RT-V0.7.40-00
> > > +EXTRAVERSION = -RT-V0.7.40-00-RCU
> > > NAME=Woozy Numbat
> > >
> > > # *DOCUMENTATION*
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>

2005-03-22 08:08:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Kyle Moffett <[email protected]> wrote:

> One solution I can think of, although it bloats memory usage for
> many-way boxen, is to just have a table in the rwlock with one entry
> per cpu. Each CPU would get one concurrent reader, others would need
> to sleep

yes, it bloats memory usage, and complicates PI quite significantly.
We've been there and have done something similar to that in earlier -RT
patches - 64 'owner' entries in the lock already caused problems on 8K
stacks. 32 seemed to work but caused quite some bloat in data structure
sizes.

i'd rather live with some scalability bottleneck for now, in exchange of
a wastly easier to handle design. We can still complicate things later
on for better scalability, once all the other issues have been sorted
out.

Ingo

2005-03-22 08:56:14

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


On Mon, 21 Mar 2005, Paul E. McKenney wrote:

> On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > > [...]
> > Well, I was actually thinking of an API like
> > preempt_by_nonrt_disable()
> > preempt_by_nonrt_enable()
> > working like the old preempt_disable()/preempt_enable() but still
> > allowing RT tasks (as well as priority inheriting non-RT tasks) to be
> > scheduled. That would kind of help "split" the kernel into two halfs: the
> > RT part and the non-RT part. The non-RT part would in many ways work as it
> > has always done.
>
> Does sound in some ways similar to the migration approach -- there, the
> RT/non-RT split is made across CPUs. But if RT is allowed to preempt,
> then you still have to deal with preemption for locking correctness, right?
>
Yes. It is not a locking mechanism, it should just prevent scheduling of
"normal" userspace tasks.

> > [..]
> > Well, in my patch I do not leave preemption off - only while doing the
> > simple ++/--. In effect, I let rcu_qsctr_inc know that some RCU reader
> > might be active, i.e. preempted, on the current CPU such that this isn't
> > and quiescent point after all.
> > (To others: Paul nicely unfolded my attachment below - I left it in
> > the mail such you can read it.)
> > The problem with this approach is ofcourse that user space programs might
> > preempt an RCU reader for a very long time such that RCU batches are never
> > really run. The boosting of non-RT tasks mentioned above would help a
> > lot.
> > A plus(?) in it: You can actually sleep while having the rcu_read_lock !!
>
> This is in some ways similar to the K42 approach to RCU (which they call
> "generations"). Dipankar put together a similar patch for Linux, but
> the problem was that grace periods could be deferred for an extremely
> long time. Which I suspect is what you were calling out as causing
> RCU batches never to run.
>

That is where the preempt_by_nonrt_disable/enable() is supposed to help:
Then it can't take longer than the normal kernel in the situation where
there is no RT tasks running. RT tasks will prolong the grace periods if
they go into RCU regions, but they are supposed to be relatively small -
and deterministic!

> > > The counter approach might work, and is also what the implementation #5
> > > does -- check out rcu_read_lock() in Ingo's most recent patch.
> > >
> >
> > Do you refer to your original mail with implementing it in 5 steps?
> > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
> > forces syncronization between the CPUs - bad for scaling. It does make the
> > RCU batches somewhat deterministic, as the RCU task can boost the readers
> > to the rcu-task's priority.
> > The problem about this approach is that everybody calling into RCU code
> > have a worst case behaviour of the systemwide worst case RCU reader
> > section - which can be pretty large (in principle infinite if somebody.)
> > So if somebody uses a call to a function in the code containing a RCU read
> > area the worst case behavious would be the same as teh worst case latency
> > in the simple world where preempt_disable()/preempt_enable() was used.
>
> I missed something here -- readers would see the worst-case -writer-
> latency rather than the worst-case -reader- latency, right? Or are you
> concerned about the case where some reader blocks the write-side
> acquisitions in _synchronize_kernel(), but not before the writer has
> grabbed a lock, blocking any readers on the corresponding CPU?
>

I am conserned that readers block each other, too. You do need an rw-mutex
allowing an unlimited number of readers for doing this. With the current
rw-mutex the readers block each other. I.e. the worst case latency is the
worst case reader latency - globally!
On the other hand with a rw-lock being unlimited - and thus do not keep
track of it readers - the readers can't be boosted by the writer. Then you
are back to square 1: The grace period can take a very long time.

> Yes, but this is true of every other lock in the system as well, not?

Other locks are not globaly used but only used for a specific subsystem.
On a real-time system you are supposed to know which subsystems you can
call into and still have a low enough latency as each subsystem has it's
own bound. But with a global RCU locking mechanism all RCU using code is
to be regarded as _one_ such subsystem.

> In a separate conversation a few weeks ago, Steve Hemminger suggested
> placing checks in long RCU read-side critical sections -- this could
> be used to keep the worst-case reader latency within the desired bounds.
> Not pretty, but, then again, bounding lock latency will require a fair
> number of similar changes, I suspect.
>
> Thanx, Paul
>

Esben

2005-03-22 09:21:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Esben Nielsen <[email protected]> wrote:

> On the other hand with a rw-lock being unlimited - and thus do not
> keep track of it readers - the readers can't be boosted by the writer.
> Then you are back to square 1: The grace period can take a very long
> time.

btw., is the 'very long grace period' a real issue? We could avoid all
the RCU read-side locking latencies by making it truly unlocked and just
living with the long grace periods. Perhaps it's enough to add an
emergency mechanism to the OOM handler, which frees up all the 'blocked
by preemption' RCU callbacks via some scheduler magic. (e.g. such an
emergency mechanism could be _conditional_ locking on the read side -
i.e. new RCU read-side users would be blocked until the OOM situation
goes away, or something like that.)

your patch is implementing just that, correct? Would you mind redoing it
against a recent -RT base? (-40-04 or so)

also, what would be the worst-case workload causing long grace periods?

Ingo

2005-03-22 10:08:00

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
> On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > i really have no intention to allow multiple readers for rt-mutexes. We
> > got away with that so far, and i'd like to keep it so. Imagine 100
> > threads all blocked in the same critical section (holding the read-lock)
> > when a highprio writer thread comes around: instant 100x latency to let
> > all of them roll forward. The only sane solution is to not allow
> > excessive concurrency. (That limits SMP scalability, but there's no
> > other choice i can see.)
>
> Unless a design change is made: One could argue for a semantics where
> write-locking _isn't_ deterministic and thus do not have to boost all the

RCU isn't write deterministic like typical RT apps are we can... (below :-))

> readers. Readers boost the writers but not the other way around. Readers
> will be deterministic, but not writers.
> Such a semantics would probably work for a lot of RT applications
> happening not to take any write-locks - these will in fact perform better.
> But it will give the rest a lot of problems.

Just came up with an idea after I thought about how much of a bitch it
would be to get a fast RCU multipule reader semantic (our current shared-
exclusive lock inserts owners into a sorted priority list per-thread which
makes it very expensive for a simple RCU case since they are typically very
small batches of items being altered). Basically the RCU algorithm has *no*
notion of writer priority and to propagate a PI operation down all reader
is meaningless, so why not revert back to the original rwlock-semaphore to
get the multipule reader semantics ?

A notion of priority across a quiescience operation is crazy anyways, so
it would be safe just to use to the old rwlock-semaphore "in place" without
any changes or priorty handling addtions. The RCU algorithm is only concerned
with what is basically a coarse data guard and it isn't time or priority
critical.

What do you folks think ? That would make Paul's stuff respect multipule
readers which reduces contention and gets around the problem of possibly
overloading the current rt lock implementation that we've been bitching
about. The current RCU development track seem wrong in the first place and
this seem like it could be a better and more complete solution to the problem.

If this works, well, you heard it here first. :)

bill

2005-03-22 10:18:06

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote:
> RCU isn't write deterministic like typical RT apps are[, so] we can... (below :-))
...
> Just came up with an idea after I thought about how much of a bitch it
> would be to get a fast RCU multipule reader semantic (our current shared-
> exclusive lock inserts owners into a sorted priority list per-thread which
> makes it very expensive for a simple RCU case[,] since they are typically very
> small batches of items being altered). Basically the RCU algorithm has *no*
> notion of writer priority and to propagate a PI operation down all reader[s]
> is meaningless, so why not revert back to the original rwlock-semaphore to
> get the multipule reader semantics ?

The original lock, for those that don't know, doesn't strictly track read owners
so reentrancy is cheap.

> A notion of priority across a quiescience operation is crazy anyways[-,-] so
> it would be safe just to use to the old rwlock-semaphore "in place" without
> any changes or priorty handling add[i]tions. The RCU algorithm is only concerned
> with what is basically a coarse data guard and it isn't time or priority
> critical.

A little jitter in a quiescence operation isn't going to hurt things right ?.

> What do you folks think ? That would make Paul's stuff respect multipule
> readers which reduces contention and gets around the problem of possibly
> overloading the current rt lock implementation that we've been bitching
> about. The current RCU development track seem wrong in the first place and
> this seem like it could be a better and more complete solution to the problem.

Got to get rid of those typos :)

bill

2005-03-22 10:20:39

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, 22 Mar 2005, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > On the other hand with a rw-lock being unlimited - and thus do not
> > keep track of it readers - the readers can't be boosted by the writer.
> > Then you are back to square 1: The grace period can take a very long
> > time.
>
> btw., is the 'very long grace period' a real issue? We could avoid all
> the RCU read-side locking latencies by making it truly unlocked and just
> living with the long grace periods. Perhaps it's enough to add an
> emergency mechanism to the OOM handler, which frees up all the 'blocked
> by preemption' RCU callbacks via some scheduler magic. (e.g. such an
> emergency mechanism could be _conditional_ locking on the read side -
> i.e. new RCU read-side users would be blocked until the OOM situation
> goes away, or something like that.)

You wont catch RCU read-sides already entered - see below.

>
> your patch is implementing just that, correct? Would you mind redoing it
> against a recent -RT base? (-40-04 or so)
>

In fact I am working on clean 2.6.12-rc1 right now. I figured this is
orthorgonal to the rest RT patch and can probably go upstream immediately.
Seemed to work. I'll try to make into a clean patch soonish and also try
it on -40-04.
I am trying to make a boosting mechanism in the scheduler such that RCU
readers are boosted to MAX_RT_PRIO when preempted. I have to take it out
first.

Any specific tests I have to run? I am considering making an RCU test
device.

> also, what would be the worst-case workload causing long grace periods?

A nice 19 task, A, enter an RCU region and is preempted. A lot of other
tasks starts running. Then task A might starved for _minuttes_ such that
there is no RCU-grace periods in all that time.

>
> Ingo

Esben

2005-03-22 10:35:21

by Bill Huey

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote:
> > A notion of priority across a quiescience operation is crazy anyways[-,-] so
> > it would be safe just to use to the old rwlock-semaphore "in place" without
> > any changes or priorty handling add[i]tions. The RCU algorithm is only concerned
> > with what is basically a coarse data guard and it isn't time or priority
> > critical.
>
> A little jitter in a quiescence operation isn't going to hurt things right ?.

The only thing that I can think of that can go wrong here is what kind
of effect it would have on the thread write blocking against a bunch of
RCU readers. It could introduce a chain of delays into, say, a timer event
and might cause problems/side-effects for other things being processed.
RCU processing might have to decoupled processed by a different thread
to avoid some of that latency weirdness.

What do you folks think ?

bill

2005-03-22 10:39:36

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, 22 Mar 2005, Bill Huey wrote:

> On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
> > On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > > i really have no intention to allow multiple readers for rt-mutexes. We
> > > got away with that so far, and i'd like to keep it so. Imagine 100
> > > threads all blocked in the same critical section (holding the read-lock)
> > > when a highprio writer thread comes around: instant 100x latency to let
> > > all of them roll forward. The only sane solution is to not allow
> > > excessive concurrency. (That limits SMP scalability, but there's no
> > > other choice i can see.)
> >
> > Unless a design change is made: One could argue for a semantics where
> > write-locking _isn't_ deterministic and thus do not have to boost all the
>
> RCU isn't write deterministic like typical RT apps are we can... (below :-))

It is: It takes place right away. But it is not non-deterministic when
_all_ readers actually read it. Also the cleanup is non-deterministic.
So unless you actually _wait_ for the cleanup to happen instead of
defering it you can safely do RCU writes in a RT-task.

>
> > readers. Readers boost the writers but not the other way around. Readers
> > will be deterministic, but not writers.
> > Such a semantics would probably work for a lot of RT applications
> > happening not to take any write-locks - these will in fact perform better.
> > But it will give the rest a lot of problems.
>
> Just came up with an idea after I thought about how much of a bitch it
> would be to get a fast RCU multipule reader semantic (our current shared-
> exclusive lock inserts owners into a sorted priority list per-thread which
> makes it very expensive for a simple RCU case since they are typically very
> small batches of items being altered). Basically the RCU algorithm has *no*
> notion of writer priority and to propagate a PI operation down all reader
> is meaningless, so why not revert back to the original rwlock-semaphore to
> get the multipule reader semantics ?

Remember to boost the writer such RT tasks can enter read regions. I must
also warn against the dangers: A lot of code where a write-lock is taken
need to marked as non-deterministic, i.e. must not-be called from
RT-tasks (maybe put a WARN_ON(rt_task(current)) in the write-lock
operation?)

>
> A notion of priority across a quiescience operation is crazy anyways, so
> it would be safe just to use to the old rwlock-semaphore "in place" without
> any changes or priorty handling addtions. The RCU algorithm is only concerned
> with what is basically a coarse data guard and it isn't time or priority
> critical.

I don't find it crazy. I think it is elegant - but also dangerous as it
might take a long time.

>
> What do you folks think ? That would make Paul's stuff respect multipule
> readers which reduces contention and gets around the problem of possibly
> overloading the current rt lock implementation that we've been bitching
> about. The current RCU development track seem wrong in the first place and
> this seem like it could be a better and more complete solution to the problem.
>
> If this works, well, you heard it here first. :)
>
> bill
>
Esben

2005-03-22 10:56:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Esben Nielsen <[email protected]> wrote:

> +static inline void rcu_read_lock(void)
> +{
> + preempt_disable();
> + __get_cpu_var(rcu_data).active_readers++;
> + preempt_enable();
> +}

this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
the same CPU, and hence ->active_readers can get out of sync.

Ingo

2005-03-22 11:40:34

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, 22 Mar 2005, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > +static inline void rcu_read_lock(void)
> > +{
> > + preempt_disable();
> > + __get_cpu_var(rcu_data).active_readers++;
> > + preempt_enable();
> > +}
>
> this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
> the same CPU, and hence ->active_readers can get out of sync.
>

Ok, this have to be handled in the mitigration code somehow. I have already
added an
current->rcu_read_depth++
so it ought to be painless. A simple solution would be not to mititagrate
threads with rcu_read_depth!=0.

> Ingo
>

Esben

2005-03-22 13:11:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU


* Esben Nielsen <[email protected]> wrote:

> > > +static inline void rcu_read_lock(void)
> > > +{
> > > + preempt_disable();
> > > + __get_cpu_var(rcu_data).active_readers++;
> > > + preempt_enable();
> > > +}
> >
> > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
> > the same CPU, and hence ->active_readers can get out of sync.
> >
>
> Ok, this have to be handled in the mitigration code somehow. I have already
> added an
> current->rcu_read_depth++
> so it ought to be painless. A simple solution would be not to
> mititagrate threads with rcu_read_depth!=0.

could you elaborate?

In any case, see the new PREEMPT_RCU code in the -40-07 patch (and
upwards). I've also attached a separate patch, it should apply cleanly
to 2.6.12-rc1.

Ingo


Attachments:
(No filename) (783.00 B)
preempt-rcu-2.6.12-rc1-A0 (3.41 kB)
Download all attachments

2005-03-22 15:10:20

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, 22 Mar 2005, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > > > +static inline void rcu_read_lock(void)
> > > > +{
> > > > + preempt_disable();
> > > > + __get_cpu_var(rcu_data).active_readers++;
> > > > + preempt_enable();
> > > > +}
> > >
> > > this is buggy. Nothing guarantees that we'll do the rcu_read_unlock() on
> > > the same CPU, and hence ->active_readers can get out of sync.
> > >
> >
> > Ok, this have to be handled in the mitigration code somehow. I have already
> > added an
> > current->rcu_read_depth++
> > so it ought to be painless. A simple solution would be not to
> > mititagrate threads with rcu_read_depth!=0.
>
> could you elaborate?
>
Put an rcu_read_depth on each task. In rcu_read_lock() make a
current->rcu_read_depth++;
and visa versa in rcu_read_unlock(). In can_migrate_task() add
if(p->rcu_read_depth)
return 0;
That might do the trick.


> In any case, see the new PREEMPT_RCU code in the -40-07 patch (and
> upwards). I've also attached a separate patch, it should apply cleanly
> to 2.6.12-rc1.
>

I barely have time to download at the patches - let alone applying them!

Anyway: I find one thing I don't like: using atomic_inc()/dec() in
rcu_read_lock()/unlock() to touch rcu_data which might be on another
CPU. Then rcu_data is not really per-CPU data anymore and it also hurts
performance of RCU readers.
I think it will be cheaper to use the above rcu_read_depth and then either
not migrate tasks at all or make the migrate code take care of migrating
the rcu_read_depth count to the new CPU - one would have to take care to
increment it in the rcu_data of the new CPU on the new CPU (it isn't
atomic) and then decrement it in the rcu_data of the old CPU on the old
CPU - in that order.


> Ingo
>


Esben

2005-03-23 05:40:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
> On Mon, 21 Mar 2005, Paul E. McKenney wrote:
[ . . . ]
> > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > This is in some ways similar to the K42 approach to RCU (which they call
> > "generations"). Dipankar put together a similar patch for Linux, but
> > the problem was that grace periods could be deferred for an extremely
> > long time. Which I suspect is what you were calling out as causing
> > RCU batches never to run.
>
> That is where the preempt_by_nonrt_disable/enable() is supposed to help:
> Then it can't take longer than the normal kernel in the situation where
> there is no RT tasks running. RT tasks will prolong the grace periods if
> they go into RCU regions, but they are supposed to be relatively small -
> and deterministic!

The part that I am missing is how this helps in the case where a non-RT
task gets preempted in the middle of an RCU read-side critical section
indefinitely. Or are you boosting the priority of any task that
enters an RCU read-side critical section?

> > > > The counter approach might work, and is also what the implementation #5
> > > > does -- check out rcu_read_lock() in Ingo's most recent patch.
> > > >
> > >
> > > Do you refer to your original mail with implementing it in 5 steps?
> > > In #5 in that one (-V0.7.41-00, right?) you use a lock and as you say that
> > > forces syncronization between the CPUs - bad for scaling. It does make the
> > > RCU batches somewhat deterministic, as the RCU task can boost the readers
> > > to the rcu-task's priority.
> > > The problem about this approach is that everybody calling into RCU code
> > > have a worst case behaviour of the systemwide worst case RCU reader
> > > section - which can be pretty large (in principle infinite if somebody.)
> > > So if somebody uses a call to a function in the code containing a RCU read
> > > area the worst case behavious would be the same as teh worst case latency
> > > in the simple world where preempt_disable()/preempt_enable() was used.
> >
> > I missed something here -- readers would see the worst-case -writer-
> > latency rather than the worst-case -reader- latency, right? Or are you
> > concerned about the case where some reader blocks the write-side
> > acquisitions in _synchronize_kernel(), but not before the writer has
> > grabbed a lock, blocking any readers on the corresponding CPU?
>
> I am conserned that readers block each other, too. You do need an rw-mutex
> allowing an unlimited number of readers for doing this. With the current
> rw-mutex the readers block each other. I.e. the worst case latency is the
> worst case reader latency - globally!
> On the other hand with a rw-lock being unlimited - and thus do not keep
> track of it readers - the readers can't be boosted by the writer. Then you
> are back to square 1: The grace period can take a very long time.

OK, good point.

> > Yes, but this is true of every other lock in the system as well, not?
>
> Other locks are not globaly used but only used for a specific subsystem.
> On a real-time system you are supposed to know which subsystems you can
> call into and still have a low enough latency as each subsystem has it's
> own bound. But with a global RCU locking mechanism all RCU using code is
> to be regarded as _one_ such subsystem.

Yep. As would the things protected by the dcache lock, task list lock,
and so on, right?

Thanx, Paul

> > In a separate conversation a few weeks ago, Steve Hemminger suggested
> > placing checks in long RCU read-side critical sections -- this could
> > be used to keep the worst-case reader latency within the desired bounds.
> > Not pretty, but, then again, bounding lock latency will require a fair
> > number of similar changes, I suspect.
> >
> > Thanx, Paul
> >
>
> Esben
>
>

2005-03-23 11:45:40

by Esben Nielsen

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Tue, 22 Mar 2005, Paul E. McKenney wrote:

> On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
> > On Mon, 21 Mar 2005, Paul E. McKenney wrote:
> [ . . . ]
> > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > > This is in some ways similar to the K42 approach to RCU (which they call
> > > "generations"). Dipankar put together a similar patch for Linux, but
> > > the problem was that grace periods could be deferred for an extremely
> > > long time. Which I suspect is what you were calling out as causing
> > > RCU batches never to run.
> >
> > That is where the preempt_by_nonrt_disable/enable() is supposed to help:
> > Then it can't take longer than the normal kernel in the situation where
> > there is no RT tasks running. RT tasks will prolong the grace periods if
> > they go into RCU regions, but they are supposed to be relatively small -
> > and deterministic!
>
> The part that I am missing is how this helps in the case where a non-RT
> task gets preempted in the middle of an RCU read-side critical section
> indefinitely. Or are you boosting the priority of any task that
> enters an RCU read-side critical section?

Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am
playing around (when I get time for it that is :-( ) with cheaper
solution:
I assume you enter these regions where you don't want to be
preempted by non-RT tasks are relatively short. Therefore the risc of
getting preempted is small. Moving the priority is expensive since you
need to lock the runqueue. I only want to do the movement when
there is an preemption. Therefore I added code in schedule() to take care
of it: If a task is in a rcu-read section, is non-RT and is preempted it's
priority is set to MAX_RT_PRIO for the time being. It will keep that
priority until the priority is recalculated, but that shouldn't hurt
anyone.
I am not happy about adding code to schedule() but setting the
priority in there is very cheap because it already has the lock
on the runqueue. Furthermore, I assume it only happens very rarely. In the
execution of schedule() my code only takes a single test on wether the
previous task was in a rcu-section or not. That is not very much code.

I have not yet tested it (no time :-( )


> [...]
> > > Yes, but this is true of every other lock in the system as well, not?
> >
> > Other locks are not globaly used but only used for a specific subsystem.
> > On a real-time system you are supposed to know which subsystems you can
> > call into and still have a low enough latency as each subsystem has it's
> > own bound. But with a global RCU locking mechanism all RCU using code is
> > to be regarded as _one_ such subsystem.
>
> Yep. As would the things protected by the dcache lock, task list lock,
> and so on, right?

Yep

>
> Thanx, Paul
>
Esben

2005-03-24 07:05:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Real-Time Preemption and RCU

On Wed, Mar 23, 2005 at 12:44:52PM +0100, Esben Nielsen wrote:
> On Tue, 22 Mar 2005, Paul E. McKenney wrote:
>
> > On Tue, Mar 22, 2005 at 09:55:26AM +0100, Esben Nielsen wrote:
> > > On Mon, 21 Mar 2005, Paul E. McKenney wrote:
> > [ . . . ]
> > > > On Mon, Mar 21, 2005 at 12:23:22AM +0100, Esben Nielsen wrote:
> > > > This is in some ways similar to the K42 approach to RCU (which they call
> > > > "generations"). Dipankar put together a similar patch for Linux, but
> > > > the problem was that grace periods could be deferred for an extremely
> > > > long time. Which I suspect is what you were calling out as causing
> > > > RCU batches never to run.
> > >
> > > That is where the preempt_by_nonrt_disable/enable() is supposed to help:
> > > Then it can't take longer than the normal kernel in the situation where
> > > there is no RT tasks running. RT tasks will prolong the grace periods if
> > > they go into RCU regions, but they are supposed to be relatively small -
> > > and deterministic!
> >
> > The part that I am missing is how this helps in the case where a non-RT
> > task gets preempted in the middle of an RCU read-side critical section
> > indefinitely. Or are you boosting the priority of any task that
> > enters an RCU read-side critical section?
>
> Yes in effect: I set the priority to MAX_RT_PRIO. But actually I am
> playing around (when I get time for it that is :-( ) with cheaper
> solution:
> I assume you enter these regions where you don't want to be
> preempted by non-RT tasks are relatively short. Therefore the risc of
> getting preempted is small. Moving the priority is expensive since you
> need to lock the runqueue. I only want to do the movement when
> there is an preemption. Therefore I added code in schedule() to take care
> of it: If a task is in a rcu-read section, is non-RT and is preempted it's
> priority is set to MAX_RT_PRIO for the time being. It will keep that
> priority until the priority is recalculated, but that shouldn't hurt
> anyone.
> I am not happy about adding code to schedule() but setting the
> priority in there is very cheap because it already has the lock
> on the runqueue. Furthermore, I assume it only happens very rarely. In the
> execution of schedule() my code only takes a single test on wether the
> previous task was in a rcu-section or not. That is not very much code.

Interesting approach -- could come in handy.

> I have not yet tested it (no time :-( )

Well, being as I haven't got the lock-based scheme fully running yet,
I can't give you too much trouble about that. :-/

Thanx, Paul

> > [...]
> > > > Yes, but this is true of every other lock in the system as well, not?
> > >
> > > Other locks are not globaly used but only used for a specific subsystem.
> > > On a real-time system you are supposed to know which subsystems you can
> > > call into and still have a low enough latency as each subsystem has it's
> > > own bound. But with a global RCU locking mechanism all RCU using code is
> > > to be regarded as _one_ such subsystem.
> >
> > Yep. As would the things protected by the dcache lock, task list lock,
> > and so on, right?
>
> Yep
>
> >
> > Thanx, Paul
> >
> Esben
>
>