by Gregory Haskins

[permalink] [raw]

Subject: Re: [PATCH RT RFC 7/7] rtmutex: pi-boost locks as late as possible

Gregory Haskins wrote:
> Adaptive-locking technology often times acquires the lock by
> spinning on a running-owner instead of sleeping. It is unecessary
> to go through pi-boosting if the owner is of equal or (logically)
> lower priority. Therefore, we can save some significant overhead
> by deferring the boost until absolutely necessary. This has shown
> to improve overall performance in PREEMPT_RT
>
> Special thanks to Peter Morreale for suggesting the optimization to
> only consider skipping the boost if the owner is >= to current
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Peter Morreale <[email protected]>
>

I received feedback that this prologue was too vague to accurately
describe what this patch does and why it is not broken to use this
optimization. Therefore, here is the new prologue:

-----------------------
From: Gregory Haskins <[email protected]>

rtmutex: pi-boost locks as late as possible

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance. An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart. However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath. The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
each spin of the lock (if we are not boosted already). If the
owner's effective priority is logically less than the waiters
priority, we must boost them.

2) We check the priority of ourselves against our current queue
position on the waiters-list (if we are not boosted already).
If our priority was changed, we need to re-queue ourselves to
update our position.

3) We break out of the adaptive-spin if either of the above
conditions (1), (2) change so that we can re-evaluate the
lock conditions.

4) We must enter pi-boost mode if, at any time, we decide to
voluntarily preempt since we are losing our ability to
dynamically process the conditions above.

Note: We still fully support priority inheritance with this
protocol, even if we defer the low-level calls to adjust priority.
The difference is really in terms of being a pro-active protocol
(boost on entry), verses a reactive protocol (boost when
necessary). The upside to the latter is that we don't take a
penalty for pi when it is not necessary. The downside is that we
technically leave the owner exposed to getting preempted, even if
our waiter is the highest priority task in the system. When this
happens, the owner would be immediately boosted (because we would
hit the "oncpu" condition, and subsequently follow the voluntary
preempt path which boosts the owner). Therefore, inversion is
fully prevented, but we have the extra latency of the
preempt/boost/wakeup that could have been avoided in the proactive
model.

However, the design of the algorithm described above constrains the
probability of this phenomenon occurring to setscheduler()
operations. Since rt-locks do not support being interrupted by
signals or timeouts, waiters only depart via the acquisition path.
And while acquisitions do deboost the owner, the owner also
changes simultaneously, rending the deboost moot relative to the
other waiters.

What this all means is that the downside to this implementation is
that a high-priority waiter *may* see an extra latency (equivalent
to roughly two wake-ups) if the owner has its priority reduced via
setscheduler() while it holds the lock. The penalty is
deterministic, arguably small enough, and sufficiently rare that I
do not believe it should be an issue.

Note: If the concept of other exit paths are ever introduced in the
future, simply adapting the condition to look at owner->normal_prio
instead of owner->prio should once again constrain the limitation
to setscheduler().

Special thanks to Peter Morreale for suggesting the optimization to
only consider skipping the boost if the owner is >= to current.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Peter Morreale <[email protected]>

> ---
>
> include/linux/rtmutex.h | 1
> kernel/rtmutex.c | 195 ++++++++++++++++++++++++++++++++++++-----------
> kernel/rtmutex_common.h | 1
> 3 files changed, 153 insertions(+), 44 deletions(-)
>
> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
> index d984244..1d98107 100644
> --- a/include/linux/rtmutex.h
> +++ b/include/linux/rtmutex.h
> @@ -33,6 +33,7 @@ struct rt_mutex {
> struct pi_node node;
> struct pi_sink snk;
> int prio;
> + int boosters;
> } pi;
> #ifdef CONFIG_DEBUG_RT_MUTEXES
> int save_state;
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 0f64298..de213ac 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
> {
> unsigned long val = (unsigned long)owner | mask;
>
> - if (rt_mutex_has_waiters(lock)) {
> + if (lock->pi.boosters) {
> struct task_struct *prev_owner = rt_mutex_owner(lock);
>
> rtmutex_pi_owner(lock, prev_owner, 0);
> rtmutex_pi_owner(lock, owner, 1);
> + }
>
> + if (rt_mutex_has_waiters(lock))
> val |= RT_MUTEX_HAS_WAITERS;
> - }
>
> lock->owner = (struct task_struct *)val;
> }
> @@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct pi_sink *snk,
>
> spin_lock_irqsave(&lock->wait_lock, iflags);
>
> - if (rt_mutex_has_waiters(lock)) {
> + if (lock->pi.boosters) {
> owner = rt_mutex_owner(lock);
>
> if (owner && owner != RT_RW_READER) {
> @@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
> pi_node_init(&lock->pi.node);
>
> lock->pi.prio = MAX_PRIO;
> + lock->pi.boosters = 0;
> pi_source_init(&lock->pi.src, &lock->pi.prio);
> lock->pi.snk = rtmutex_pi_snk;
>
> @@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
> return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
> }
>
> +static inline void requeue_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> +{
> + BUG_ON(!waiter->task);
> +
> + plist_del(&waiter->list_entry, &lock->wait_list);
> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
> + plist_add(&waiter->list_entry, &lock->wait_list);
> +}
> +
> /*
> * These callbacks are invoked whenever a waiter has changed priority.
> * So we should requeue it within the lock->wait_list
> @@ -343,11 +355,8 @@ static inline int rtmutex_waiter_pi_update(struct pi_sink *snk,
> * pi list. Therefore, if waiter->pi.prio has changed since we
> * queued ourselves, requeue it.
> */
> - if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
> - plist_del(&waiter->list_entry, &lock->wait_list);
> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
> - plist_add(&waiter->list_entry, &lock->wait_list);
> - }
> + if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
> + requeue_waiter(lock, waiter);
>
> spin_unlock_irqrestore(&lock->wait_lock, iflags);
>
> @@ -359,20 +368,9 @@ static struct pi_sink rtmutex_waiter_pi_snk = {
> .update = rtmutex_waiter_pi_update,
> };
>
> -/*
> - * This must be called with lock->wait_lock held.
> - */
> -static int add_waiter(struct rt_mutex *lock,
> - struct rt_mutex_waiter *waiter,
> - unsigned long *flags)
> +static void boost_lock(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> {
> - int has_waiters = rt_mutex_has_waiters(lock);
> -
> - waiter->task = current;
> - waiter->lock = lock;
> - waiter->pi.prio = current->prio;
> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
> - plist_add(&waiter->list_entry, &lock->wait_list);
> waiter->pi.snk = rtmutex_waiter_pi_snk;
>
> /*
> @@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
> * If we previously had no waiters, we are transitioning to
> * a mode where we need to boost the owner
> */
> - if (!has_waiters) {
> + if (!lock->pi.boosters) {
> struct task_struct *owner = rt_mutex_owner(lock);
> rtmutex_pi_owner(lock, owner, 1);
> }
>
> - spin_unlock_irqrestore(&lock->wait_lock, *flags);
> - task_pi_update(current, 0);
> - spin_lock_irqsave(&lock->wait_lock, *flags);
> -
> - return 0;
> + lock->pi.boosters++;
> + waiter->pi.boosted = 1;
> }
>
> -/*
> - * Remove a waiter from a lock
> - *
> - * Must be called with lock->wait_lock held
> - */
> -static void remove_waiter(struct rt_mutex *lock,
> - struct rt_mutex_waiter *waiter)
> +static void deboost_lock(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter,
> + struct task_struct *p)
> {
> - struct task_struct *p = waiter->task;
> + BUG_ON(!waiter->pi.boosted);
>
> - plist_del(&waiter->list_entry, &lock->wait_list);
> - waiter->task = NULL;
> + waiter->pi.boosted = 0;
> + lock->pi.boosters--;
>
> /*
> * We can stop boosting the owner if there are no more waiters
> */
> - if (!rt_mutex_has_waiters(lock)) {
> + if (!lock->pi.boosters) {
> struct task_struct *owner = rt_mutex_owner(lock);
> rtmutex_pi_owner(lock, owner, 0);
> }
> @@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
> }
>
> /*
> + * This must be called with lock->wait_lock held.
> + */
> +static void _add_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> +{
> + waiter->task = current;
> + waiter->lock = lock;
> + waiter->pi.prio = current->prio;
> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
> + plist_add(&waiter->list_entry, &lock->wait_list);
> +}
> +
> +static int add_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter,
> + unsigned long *flags)
> +{
> + _add_waiter(lock, waiter);
> +
> + boost_lock(lock, waiter);
> +
> + spin_unlock_irqrestore(&lock->wait_lock, *flags);
> + task_pi_update(current, 0);
> + spin_lock_irqsave(&lock->wait_lock, *flags);
> +
> + return 0;
> +}
> +
> +/*
> + * Remove a waiter from a lock
> + *
> + * Must be called with lock->wait_lock held
> + */
> +static void remove_waiter(struct rt_mutex *lock,
> + struct rt_mutex_waiter *waiter)
> +{
> + struct task_struct *p = waiter->task;
> +
> + plist_del(&waiter->list_entry, &lock->wait_list);
> + waiter->task = NULL;
> +
> + if (waiter->pi.boosted)
> + deboost_lock(lock, waiter, p);
> +}
> +
> +/*
> * Wake up the next waiter on the lock.
> *
> * Remove the top waiter from the current tasks waiter list and from
> @@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
> if (orig_owner != rt_mutex_owner(waiter->lock))
> return 0;
>
> + /* Special handling for when we are not in pi-boost mode */
> + if (!waiter->pi.boosted) {
> + /*
> + * Are we higher priority than the owner? If so
> + * we should bail out immediately so that we can
> + * pi boost them.
> + */
> + if (current->prio < orig_owner->prio)
> + return 0;
> +
> + /*
> + * Did our priority change? If so, we need to
> + * requeue our position in the list
> + */
> + if (waiter->pi.prio != current->prio)
> + return 0;
> + }
> +
> /* Owner went to bed, so should we */
> if (!task_is_current(orig_owner))
> return 1;
> @@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> unsigned long saved_state, state, flags;
> struct task_struct *orig_owner;
> int missed = 0;
> + int boosted = 0;
>
> init_waiter(&waiter);
>
> @@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> }
> missed = 1;
>
> + orig_owner = rt_mutex_owner(lock);
> +
> /*
> * waiter.task is NULL the first time we come here and
> * when we have been woken up by the previous owner
> * but the lock got stolen by an higher prio task.
> */
> - if (!waiter.task) {
> - add_waiter(lock, &waiter, &flags);
> + if (!waiter.task)
> + _add_waiter(lock, &waiter);
> +
> + /*
> + * We only need to pi-boost the owner if they are lower
> + * priority than us. We dont care if this is racy
> + * against priority changes as we will break out of
> + * the adaptive spin anytime any priority changes occur
> + * without boosting enabled.
> + */
> + if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
> + boost_lock(lock, &waiter);
> + boosted = 1;
> +
> + spin_unlock_irqrestore(&lock->wait_lock, flags);
> + task_pi_update(current, 0);
> + spin_lock_irqsave(&lock->wait_lock, flags);
> +
> /* Wakeup during boost ? */
> if (unlikely(!waiter.task))
> continue;
> }
>
> /*
> + * If we are not currently pi-boosting the lock, we have to
> + * monitor whether our priority changed since the last
> + * time it was recorded and requeue ourselves if it moves.
> + */
> + if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
> + waiter.pi.prio = current->prio;
> +
> + requeue_waiter(lock, &waiter);
> + }
> +
> + /*
> * Prevent schedule() to drop BKL, while waiting for
> * the lock ! We restore lock_depth when we come back.
> */
> saved_flags = current->flags & PF_NOSCHED;
> current->lock_depth = -1;
> current->flags &= ~PF_NOSCHED;
> - orig_owner = rt_mutex_owner(lock);
> get_task_struct(orig_owner);
> spin_unlock_irqrestore(&lock->wait_lock, flags);
>
> @@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> * barrier which we rely upon to ensure current->state
> * is visible before we test waiter.task.
> */
> + if (waiter.task && !waiter.pi.boosted) {
> + spin_lock_irqsave(&lock->wait_lock, flags);
> +
> + /*
> + * We get here if we have not yet boosted
> + * the lock, yet we are going to sleep. If
> + * we are still pending (waiter.task != 0),
> + * then go ahead and boost them now
> + */
> + if (waiter.task) {
> + boost_lock(lock, &waiter);
> + boosted = 1;
> + }
> +
> + spin_unlock_irqrestore(&lock->wait_lock, flags);
> + task_pi_update(current, 0);
> + }
> +
> if (waiter.task)
> schedule_rt_mutex(lock);
> } else
> @@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
> spin_unlock_irqrestore(&lock->wait_lock, flags);
>
> /* Undo any pi boosting, if necessary */
> - task_pi_update(current, 0);
> + if (boosted)
> + task_pi_update(current, 0);
>
> debug_rt_mutex_free_waiter(&waiter);
> }
> @@ -708,6 +810,7 @@ static void noinline __sched
> rt_spin_lock_slowunlock(struct rt_mutex *lock)
> {
> unsigned long flags;
> + int deboost = 0;
>
> spin_lock_irqsave(&lock->wait_lock, flags);
>
> @@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
> return;
> }
>
> + if (lock->pi.boosters)
> + deboost = 1;
> +
> wakeup_next_waiter(lock, 1);
>
> spin_unlock_irqrestore(&lock->wait_lock, flags);
>
> - /* Undo pi boosting when necessary */
> - task_pi_update(current, 0);
> + if (deboost)
> + /* Undo pi boosting when necessary */
> + task_pi_update(current, 0);
> }
>
> void __lockfunc rt_spin_lock(spinlock_t *lock)
> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
> index 7bf32d0..34e2381 100644
> --- a/kernel/rtmutex_common.h
> +++ b/kernel/rtmutex_common.h
> @@ -55,6 +55,7 @@ struct rt_mutex_waiter {
> struct {
> struct pi_sink snk;
> int prio;
> + int boosted;
> } pi;
> #ifdef CONFIG_DEBUG_RT_MUTEXES
> unsigned long ip;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-08-05 03:03:48

by Gregory Haskins

[permalink] [raw]

Subject: Re: [PATCH RT RFC 7/7] rtmutex: pi-boost locks as late as possible

Gregory Haskins wrote:
> Gregory Haskins wrote:
>> Adaptive-locking technology often times acquires the lock by
>> spinning on a running-owner instead of sleeping. It is unecessary
>> to go through pi-boosting if the owner is of equal or (logically)
>> lower priority. Therefore, we can save some significant overhead
>> by deferring the boost until absolutely necessary. This has shown
>> to improve overall performance in PREEMPT_RT
>>
>> Special thanks to Peter Morreale for suggesting the optimization to
>> only consider skipping the boost if the owner is >= to current
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Peter Morreale <[email protected]>
>>
>
> I received feedback that this prologue was too vague to accurately
> describe what this patch does and why it is not broken to use this
> optimization. Therefore, here is the new prologue:
>
> -----------------------
> From: Gregory Haskins <[email protected]>
>
> rtmutex: pi-boost locks as late as possible
>
> PREEMPT_RT replaces most spinlock_t instances with a preemptible
> real-time lock that supports priority inheritance. An uncontended
> (fastpath) acquisition of this lock has no more overhead than
> its non-rt spinlock_t counterpart. However, the contended case
> has considerably more overhead so that the lock can maintain
> proper priority queue order and support pi-boosting of the lock
> owner, yet remaining fully preemptible.
>
> Instrumentation shows that the majority of acquisitions under most
> workloads falls either into the fastpath category, or the adaptive
> spin category within the slowpath. The necessity to pi-boost a
> lock-owner should be sufficiently rare, yet the slow-path path
> blindly incurs this overhead in 100% of contentions.
>
> Therefore, this patch intends to capitalize on this observation
> in order to reduce overhead and improve acquisition throughput.
> It is important to note that real-time latency is still treated
> as a higher order constraint than throughput, so the full
> pi-protocol is observed using new carefully constructed rules
> around the old concepts.
>
> 1) We check the priority of the owner relative to the waiter on
> each spin of the lock (if we are not boosted already). If the
> owner's effective priority is logically less than the waiters
> priority, we must boost them.
>
> 2) We check the priority of ourselves against our current queue
> position on the waiters-list (if we are not boosted already).
> If our priority was changed, we need to re-queue ourselves to
> update our position.
>
> 3) We break out of the adaptive-spin if either of the above
> conditions (1), (2) change so that we can re-evaluate the
> lock conditions.
>
> 4) We must enter pi-boost mode if, at any time, we decide to
> voluntarily preempt since we are losing our ability to
> dynamically process the conditions above.
>
> Note: We still fully support priority inheritance with this
> protocol, even if we defer the low-level calls to adjust priority.
> The difference is really in terms of being a pro-active protocol
> (boost on entry), verses a reactive protocol (boost when
> necessary). The upside to the latter is that we don't take a
> penalty for pi when it is not necessary. The downside is that we
> technically leave the owner exposed to getting preempted, even if
> our waiter is the highest priority task in the system.

David Holmes (CC'd) pointed out that this statement is a little vague
and confusing as well. The question is: how could the owner be exposed
to preemption since it would presumably be running at or above the
waiters priority or we would have boosted it already? The answer is
that this is in reference to the fact that the owner may have its
priority lowered after we have already made the decision to defer boosting.

Therefore, my updated statement should read:

"The downside is that we technically leave the owner exposed to getting
preempted *should it get asynchronously deprioritized*, even if ...."
As I go on to explain below, this deboosting could only happen as the
result of a setscheduler() call, which I assert should not be cause for
concern. However, I wanted to highlight this phenomenon in the interest
of full disclosure since it is technically a difference in behavior from
the original algorithm. I will update the header with this edit for
clarity.

Thanks for the review, David!

-Greg

> When this
> happens, the owner would be immediately boosted (because we would
> hit the "oncpu" condition, and subsequently follow the voluntary
> preempt path which boosts the owner). Therefore, inversion is
> fully prevented, but we have the extra latency of the
> preempt/boost/wakeup that could have been avoided in the proactive
> model.
>
> However, the design of the algorithm described above constrains the
> probability of this phenomenon occurring to setscheduler()
> operations. Since rt-locks do not support being interrupted by
> signals or timeouts, waiters only depart via the acquisition path.
> And while acquisitions do deboost the owner, the owner also
> changes simultaneously, rending the deboost moot relative to the
> other waiters.
>
> What this all means is that the downside to this implementation is
> that a high-priority waiter *may* see an extra latency (equivalent
> to roughly two wake-ups) if the owner has its priority reduced via
> setscheduler() while it holds the lock. The penalty is
> deterministic, arguably small enough, and sufficiently rare that I
> do not believe it should be an issue.
>
> Note: If the concept of other exit paths are ever introduced in the
> future, simply adapting the condition to look at owner->normal_prio
> instead of owner->prio should once again constrain the limitation
> to setscheduler().
>
> Special thanks to Peter Morreale for suggesting the optimization to
> only consider skipping the boost if the owner is >= to current.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Peter Morreale <[email protected]>
>
>
>> ---
>>
>> include/linux/rtmutex.h | 1 kernel/rtmutex.c | 195
>> ++++++++++++++++++++++++++++++++++++-----------
>> kernel/rtmutex_common.h | 1 3 files changed, 153 insertions(+),
>> 44 deletions(-)
>>
>> diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
>> index d984244..1d98107 100644
>> --- a/include/linux/rtmutex.h
>> +++ b/include/linux/rtmutex.h
>> @@ -33,6 +33,7 @@ struct rt_mutex {
>> struct pi_node node;
>> struct pi_sink snk;
>> int prio;
>> + int boosters;
>> } pi;
>> #ifdef CONFIG_DEBUG_RT_MUTEXES
>> int save_state;
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index 0f64298..de213ac 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct
>> task_struct *owner,
>> {
>> unsigned long val = (unsigned long)owner | mask;
>>
>> - if (rt_mutex_has_waiters(lock)) {
>> + if (lock->pi.boosters) {
>> struct task_struct *prev_owner = rt_mutex_owner(lock);
>>
>> rtmutex_pi_owner(lock, prev_owner, 0);
>> rtmutex_pi_owner(lock, owner, 1);
>> + }
>>
>> + if (rt_mutex_has_waiters(lock))
>> val |= RT_MUTEX_HAS_WAITERS;
>> - }
>>
>> lock->owner = (struct task_struct *)val;
>> }
>> @@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct
>> pi_sink *snk,
>>
>> spin_lock_irqsave(&lock->wait_lock, iflags);
>>
>> - if (rt_mutex_has_waiters(lock)) {
>> + if (lock->pi.boosters) {
>> owner = rt_mutex_owner(lock);
>>
>> if (owner && owner != RT_RW_READER) {
>> @@ -206,6 +207,7 @@ static void init_pi(struct rt_mutex *lock)
>> pi_node_init(&lock->pi.node);
>>
>> lock->pi.prio = MAX_PRIO;
>> + lock->pi.boosters = 0;
>> pi_source_init(&lock->pi.src, &lock->pi.prio);
>> lock->pi.snk = rtmutex_pi_snk;
>>
>> @@ -303,6 +305,16 @@ static inline int try_to_take_rt_mutex(struct
>> rt_mutex *lock)
>> return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
>> }
>>
>> +static inline void requeue_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> +{
>> + BUG_ON(!waiter->task);
>> +
>> + plist_del(&waiter->list_entry, &lock->wait_list);
>> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> + plist_add(&waiter->list_entry, &lock->wait_list);
>> +}
>> +
>> /*
>> * These callbacks are invoked whenever a waiter has changed priority.
>> * So we should requeue it within the lock->wait_list
>> @@ -343,11 +355,8 @@ static inline int
>> rtmutex_waiter_pi_update(struct pi_sink *snk,
>> * pi list. Therefore, if waiter->pi.prio has changed since we
>> * queued ourselves, requeue it.
>> */
>> - if (waiter->task && waiter->list_entry.prio != waiter->pi.prio) {
>> - plist_del(&waiter->list_entry, &lock->wait_list);
>> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> - plist_add(&waiter->list_entry, &lock->wait_list);
>> - }
>> + if (waiter->task && waiter->list_entry.prio != waiter->pi.prio)
>> + requeue_waiter(lock, waiter);
>>
>> spin_unlock_irqrestore(&lock->wait_lock, iflags);
>>
>> @@ -359,20 +368,9 @@ static struct pi_sink rtmutex_waiter_pi_snk = {
>> .update = rtmutex_waiter_pi_update,
>> };
>>
>> -/*
>> - * This must be called with lock->wait_lock held.
>> - */
>> -static int add_waiter(struct rt_mutex *lock,
>> - struct rt_mutex_waiter *waiter,
>> - unsigned long *flags)
>> +static void boost_lock(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> {
>> - int has_waiters = rt_mutex_has_waiters(lock);
>> -
>> - waiter->task = current;
>> - waiter->lock = lock;
>> - waiter->pi.prio = current->prio;
>> - plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> - plist_add(&waiter->list_entry, &lock->wait_list);
>> waiter->pi.snk = rtmutex_waiter_pi_snk;
>>
>> /*
>> @@ -397,35 +395,28 @@ static int add_waiter(struct rt_mutex *lock,
>> * If we previously had no waiters, we are transitioning to
>> * a mode where we need to boost the owner
>> */
>> - if (!has_waiters) {
>> + if (!lock->pi.boosters) {
>> struct task_struct *owner = rt_mutex_owner(lock);
>> rtmutex_pi_owner(lock, owner, 1);
>> }
>>
>> - spin_unlock_irqrestore(&lock->wait_lock, *flags);
>> - task_pi_update(current, 0);
>> - spin_lock_irqsave(&lock->wait_lock, *flags);
>> -
>> - return 0;
>> + lock->pi.boosters++;
>> + waiter->pi.boosted = 1;
>> }
>>
>> -/*
>> - * Remove a waiter from a lock
>> - *
>> - * Must be called with lock->wait_lock held
>> - */
>> -static void remove_waiter(struct rt_mutex *lock,
>> - struct rt_mutex_waiter *waiter)
>> +static void deboost_lock(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter,
>> + struct task_struct *p)
>> {
>> - struct task_struct *p = waiter->task;
>> + BUG_ON(!waiter->pi.boosted);
>>
>> - plist_del(&waiter->list_entry, &lock->wait_list);
>> - waiter->task = NULL;
>> + waiter->pi.boosted = 0;
>> + lock->pi.boosters--;
>>
>> /*
>> * We can stop boosting the owner if there are no more waiters
>> */
>> - if (!rt_mutex_has_waiters(lock)) {
>> + if (!lock->pi.boosters) {
>> struct task_struct *owner = rt_mutex_owner(lock);
>> rtmutex_pi_owner(lock, owner, 0);
>> }
>> @@ -446,6 +437,51 @@ static void remove_waiter(struct rt_mutex *lock,
>> }
>>
>> /*
>> + * This must be called with lock->wait_lock held.
>> + */
>> +static void _add_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> +{
>> + waiter->task = current;
>> + waiter->lock = lock;
>> + waiter->pi.prio = current->prio;
>> + plist_node_init(&waiter->list_entry, waiter->pi.prio);
>> + plist_add(&waiter->list_entry, &lock->wait_list);
>> +}
>> +
>> +static int add_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter,
>> + unsigned long *flags)
>> +{
>> + _add_waiter(lock, waiter);
>> +
>> + boost_lock(lock, waiter);
>> +
>> + spin_unlock_irqrestore(&lock->wait_lock, *flags);
>> + task_pi_update(current, 0);
>> + spin_lock_irqsave(&lock->wait_lock, *flags);
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * Remove a waiter from a lock
>> + *
>> + * Must be called with lock->wait_lock held
>> + */
>> +static void remove_waiter(struct rt_mutex *lock,
>> + struct rt_mutex_waiter *waiter)
>> +{
>> + struct task_struct *p = waiter->task;
>> +
>> + plist_del(&waiter->list_entry, &lock->wait_list);
>> + waiter->task = NULL;
>> +
>> + if (waiter->pi.boosted)
>> + deboost_lock(lock, waiter, p);
>> +}
>> +
>> +/*
>> * Wake up the next waiter on the lock.
>> *
>> * Remove the top waiter from the current tasks waiter list and from
>> @@ -558,6 +594,24 @@ static int adaptive_wait(struct rt_mutex_waiter
>> *waiter,
>> if (orig_owner != rt_mutex_owner(waiter->lock))
>> return 0;
>>
>> + /* Special handling for when we are not in pi-boost mode */
>> + if (!waiter->pi.boosted) {
>> + /*
>> + * Are we higher priority than the owner? If so
>> + * we should bail out immediately so that we can
>> + * pi boost them.
>> + */
>> + if (current->prio < orig_owner->prio)
>> + return 0;
>> +
>> + /*
>> + * Did our priority change? If so, we need to
>> + * requeue our position in the list
>> + */
>> + if (waiter->pi.prio != current->prio)
>> + return 0;
>> + }
>> +
>> /* Owner went to bed, so should we */
>> if (!task_is_current(orig_owner))
>> return 1;
>> @@ -599,6 +653,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> unsigned long saved_state, state, flags;
>> struct task_struct *orig_owner;
>> int missed = 0;
>> + int boosted = 0;
>>
>> init_waiter(&waiter);
>>
>> @@ -631,26 +686,54 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> }
>> missed = 1;
>>
>> + orig_owner = rt_mutex_owner(lock);
>> +
>> /*
>> * waiter.task is NULL the first time we come here and
>> * when we have been woken up by the previous owner
>> * but the lock got stolen by an higher prio task.
>> */
>> - if (!waiter.task) {
>> - add_waiter(lock, &waiter, &flags);
>> + if (!waiter.task)
>> + _add_waiter(lock, &waiter);
>> +
>> + /*
>> + * We only need to pi-boost the owner if they are lower
>> + * priority than us. We dont care if this is racy
>> + * against priority changes as we will break out of
>> + * the adaptive spin anytime any priority changes occur
>> + * without boosting enabled.
>> + */
>> + if (!waiter.pi.boosted && current->prio < orig_owner->prio) {
>> + boost_lock(lock, &waiter);
>> + boosted = 1;
>> +
>> + spin_unlock_irqrestore(&lock->wait_lock, flags);
>> + task_pi_update(current, 0);
>> + spin_lock_irqsave(&lock->wait_lock, flags);
>> +
>> /* Wakeup during boost ? */
>> if (unlikely(!waiter.task))
>> continue;
>> }
>>
>> /*
>> + * If we are not currently pi-boosting the lock, we have to
>> + * monitor whether our priority changed since the last
>> + * time it was recorded and requeue ourselves if it moves.
>> + */
>> + if (!waiter.pi.boosted && waiter.pi.prio != current->prio) {
>> + waiter.pi.prio = current->prio;
>> +
>> + requeue_waiter(lock, &waiter);
>> + }
>> +
>> + /*
>> * Prevent schedule() to drop BKL, while waiting for
>> * the lock ! We restore lock_depth when we come back.
>> */
>> saved_flags = current->flags & PF_NOSCHED;
>> current->lock_depth = -1;
>> current->flags &= ~PF_NOSCHED;
>> - orig_owner = rt_mutex_owner(lock);
>> get_task_struct(orig_owner);
>> spin_unlock_irqrestore(&lock->wait_lock, flags);
>>
>> @@ -664,6 +747,24 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> * barrier which we rely upon to ensure current->state
>> * is visible before we test waiter.task.
>> */
>> + if (waiter.task && !waiter.pi.boosted) {
>> + spin_lock_irqsave(&lock->wait_lock, flags);
>> +
>> + /*
>> + * We get here if we have not yet boosted
>> + * the lock, yet we are going to sleep. If
>> + * we are still pending (waiter.task != 0),
>> + * then go ahead and boost them now
>> + */
>> + if (waiter.task) {
>> + boost_lock(lock, &waiter);
>> + boosted = 1;
>> + }
>> +
>> + spin_unlock_irqrestore(&lock->wait_lock, flags);
>> + task_pi_update(current, 0);
>> + }
>> +
>> if (waiter.task)
>> schedule_rt_mutex(lock);
>> } else
>> @@ -696,7 +797,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>> spin_unlock_irqrestore(&lock->wait_lock, flags);
>>
>> /* Undo any pi boosting, if necessary */
>> - task_pi_update(current, 0);
>> + if (boosted)
>> + task_pi_update(current, 0);
>>
>> debug_rt_mutex_free_waiter(&waiter);
>> }
>> @@ -708,6 +810,7 @@ static void noinline __sched
>> rt_spin_lock_slowunlock(struct rt_mutex *lock)
>> {
>> unsigned long flags;
>> + int deboost = 0;
>>
>> spin_lock_irqsave(&lock->wait_lock, flags);
>>
>> @@ -721,12 +824,16 @@ rt_spin_lock_slowunlock(struct rt_mutex *lock)
>> return;
>> }
>>
>> + if (lock->pi.boosters)
>> + deboost = 1;
>> +
>> wakeup_next_waiter(lock, 1);
>>
>> spin_unlock_irqrestore(&lock->wait_lock, flags);
>>
>> - /* Undo pi boosting when necessary */
>> - task_pi_update(current, 0);
>> + if (deboost)
>> + /* Undo pi boosting when necessary */
>> + task_pi_update(current, 0);
>> }
>>
>> void __lockfunc rt_spin_lock(spinlock_t *lock)
>> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
>> index 7bf32d0..34e2381 100644
>> --- a/kernel/rtmutex_common.h
>> +++ b/kernel/rtmutex_common.h
>> @@ -55,6 +55,7 @@ struct rt_mutex_waiter {
>> struct {
>> struct pi_sink snk;
>> int prio;
>> + int boosted;
>> } pi;
>> #ifdef CONFIG_DEBUG_RT_MUTEXES
>> unsigned long ip;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-08-15 12:19:11

by Gregory Haskins

[permalink] [raw]

2008-08-15 12:21:32

by Gregory Haskins

[permalink] [raw]

Subject: [PATCH RT RFC v2 0/8] Priority Inheritance enhancements

** RFC for PREEMPT_RT branch, 26-rt1 **

Synopsis: We gain a 13%+ IO improvement in the PREEMPT_RT kernel by
re-working some of the PI logic.

[
pi-enhancements v2

Changes since v1:

*) Added proper reference counting to prevent tasks from
deleting while a node->update() is still in flight
*) unified the RCU boost path
]

[
fyi -> you can find this series at the following URLs in
addition to this thread:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=shortlog;h=pi-rework

ftp://ftp.novell.com/dev/ghaskins/pi-rework-v2.tar.bz2

]

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches. This yields at
least a 13-15% improvement for diskio on my 4-way x86_64 system. An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
(at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
if their non-rt equivalent. We have determined via instrumentation
that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we want to address (1) and (2)
above. Therefore the first step is to separate the PI code away from
rtmutex into its own library (libpi). This is covered in patches 1-7.

(I realize patch #7 is a little hard to review since I removed and added
a lot of code that the unified diff is all mashing together...I will try
to find a way to make this more readable).

Patch 8 is the first real consumer of the libpi logic to try to enhance
performance. It accomplishes this by deferring pi-boosting a lock
owner unless it is absolutely necessary. Since instrumentation
shows that the majority of locks are acquired either via the fast-path,
or via the adaptive-spin path, we can eliminate most of the pi-overhead
with this technique. This yields a measurable performance gain (at least
13% for workloads with heavy lock contention was observed in our lab).

We have not yet completed the work on the pi-waitqueues or any of the other
related pi enhancements. Those will be coming in a follow-on announcement.

Feedback/comments welcome!

Regards,
-Greg

---

Gregory Haskins (8):
rtmutex: pi-boost locks as late as possible
rtmutex: convert rtmutexes to fully use the PI library
rtmutex: use runtime init for rtmutexes
RT: wrap the rt_rwlock "add reader" logic
rtmutex: formally initialize the rt_mutex_waiters
sched: rework task reference counting to work with the pi infrastructure
sched: add the basic PI infrastructure to the task_struct
add generalized priority-inheritance interface

Documentation/libpi.txt | 59 ++
include/linux/pi.h | 278 +++++++++++
include/linux/rt_lock.h | 2
include/linux/rtmutex.h | 18 -
include/linux/sched.h | 57 +-
include/linux/workqueue.h | 2
kernel/fork.c | 35 +
kernel/rcupreempt-boost.c | 25 -
kernel/rtmutex-debug.c | 4
kernel/rtmutex-tester.c | 4
kernel/rtmutex.c | 1091 ++++++++++++++++++---------------------------
kernel/rtmutex_common.h | 19 -
kernel/rwlock_torture.c | 32 -
kernel/sched.c | 209 ++++++---
kernel/workqueue.c | 39 +-
lib/Makefile | 3
lib/pi.c | 516 +++++++++++++++++++++
17 files changed, 1543 insertions(+), 850 deletions(-)
create mode 100644 Documentation/libpi.txt
create mode 100644 include/linux/pi.h
create mode 100644 lib/pi.c

2008-08-15 12:20:51

by Gregory Haskins

[permalink] [raw]

Subject: [PATCH RT RFC v2 3/8] sched: rework task reference counting to work with the pi infrastructure

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 5 +++--
kernel/fork.c | 32 +++++++++++++++-----------------
kernel/sched.c | 23 +++++++++++++++++++++++
3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63ddd1f..9132b42 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct task_struct {
struct pi_node node;
struct pi_sink snk; /* registered to 'this' to get updates */
int prio;
+ struct rcu_head rcu; /* for destruction cleanup */
+
} pi;

#ifdef CONFIG_RT_MUTEXES
@@ -1633,12 +1635,11 @@ static inline void put_task_struct(struct task_struct *t)
call_rcu(&t->rcu, __put_task_struct_cb);
}
#else
-extern void __put_task_struct(struct task_struct *t);

static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ pi_dropref(&t->pi.node, 0);
}
#endif

diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..399a0a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -130,39 +130,37 @@ void free_task(struct task_struct *tsk)
}
EXPORT_SYMBOL(free_task);

-#ifdef CONFIG_PREEMPT_RT
-void __put_task_struct_cb(struct rcu_head *rhp)
+void prepare_free_task(struct task_struct *tsk)
{
- struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
-
BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(!tsk->exit_state);
WARN_ON(tsk == current);

+#ifdef CONFIG_PREEMPT_RT
+ WARN_ON(!tsk->exit_state);
+#else
+ WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
+#endif
+
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);
+
+#ifdef CONFIG_PREEMPT_RT
delayacct_tsk_free(tsk);
+#endif

if (!profile_handoff_task(tsk))
free_task(tsk);
}

-#else
-
-void __put_task_struct(struct task_struct *tsk)
+#ifdef CONFIG_PREEMPT_RT
+void __put_task_struct_cb(struct rcu_head *rhp)
{
- WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
- BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(tsk == current);
-
- security_task_free(tsk);
- free_uid(tsk->user);
- put_group_info(tsk->group_info);
+ struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);

- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ pi_dropref(&tsk->pi.node, 0);
}
+
#endif

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index c129b10..729139d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2370,11 +2370,34 @@ task_pi_boost_cb(struct pi_sink *snk, struct pi_source *src,
return 0;
}

+extern void prepare_free_task(struct task_struct *tsk);
+
+static void task_pi_free_rcu(struct rcu_head *rhp)
+{
+ struct task_struct *tsk = container_of(rhp, struct task_struct, pi.rcu);
+
+ prepare_free_task(tsk);
+}
+
+/*
+ * This function is invoked whenever the last references to a task have
+ * been dropped, and we should free the memory on the next rcu grace period
+ */
+static int task_pi_free_cb(struct pi_sink *snk, unsigned int flags)
+{
+ struct task_struct *p = container_of(snk, struct task_struct, pi.snk);
+
+ call_rcu(&p->pi.rcu, task_pi_free_rcu);
+
+ return 0;
+}
+
static int task_pi_update_cb(struct pi_sink *snk, unsigned int flags);

static struct pi_sink task_pi_sink = {
.boost = task_pi_boost_cb,
.update = task_pi_update_cb,
+ .free = task_pi_free_cb,
};

static inline void

2008-08-15 12:29:05

by Gregory Haskins

[permalink] [raw]

2008-08-15 20:31:58

by Gregory Haskins

[permalink] [raw]

Subject: [PATCH RT RFC v4 4/8] rtmutex: formally initialize the rt_mutex_waiters

2008-08-15 20:31:42

by Gregory Haskins

[permalink] [raw]

Subject: [PATCH RT RFC v4 3/8] sched: rework task reference counting to work with the pi infrastructure

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 7 +++++--
kernel/fork.c | 32 +++++++++++++++-----------------
kernel/sched.c | 21 +++++++++++++++++++++
3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5521a64..7ae8eca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct task_struct {
struct pi_node node;
struct pi_sink sink; /* registered to 'this' to get updates */
int prio;
+ struct rcu_head rcu; /* for destruction cleanup */
+
} pi;

#ifdef CONFIG_RT_MUTEXES
@@ -1633,12 +1635,11 @@ static inline void put_task_struct(struct task_struct *t)
call_rcu(&t->rcu, __put_task_struct_cb);
}
#else
-extern void __put_task_struct(struct task_struct *t);

static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ pi_put(&t->pi.node, 0);
}
#endif

@@ -2469,6 +2470,8 @@ static inline int task_is_current(struct task_struct *task)
}
#endif

+extern void prepare_free_task(struct task_struct *tsk);
+
#define TASK_STATE_TO_CHAR_STR "RMSDTtZX"

#endif /* __KERNEL__ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..4d4fba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -130,39 +130,37 @@ void free_task(struct task_struct *tsk)
}
EXPORT_SYMBOL(free_task);

-#ifdef CONFIG_PREEMPT_RT
-void __put_task_struct_cb(struct rcu_head *rhp)
+void prepare_free_task(struct task_struct *tsk)
{
- struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
-
BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(!tsk->exit_state);
WARN_ON(tsk == current);

+#ifdef CONFIG_PREEMPT_RT
+ WARN_ON(!tsk->exit_state);
+#else
+ WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
+#endif
+
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);
+
+#ifdef CONFIG_PREEMPT_RT
delayacct_tsk_free(tsk);
+#endif

if (!profile_handoff_task(tsk))
free_task(tsk);
}

-#else
-
-void __put_task_struct(struct task_struct *tsk)
+#ifdef CONFIG_PREEMPT_RT
+void __put_task_struct_cb(struct rcu_head *rhp)
{
- WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
- BUG_ON(atomic_read(&tsk->usage));
- WARN_ON(tsk == current);
-
- security_task_free(tsk);
- free_uid(tsk->user);
- put_group_info(tsk->group_info);
+ struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);

- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ pi_put(&tsk->pi.node, 0);
}
+
#endif

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 0732a9b..eb14b9f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2370,11 +2370,32 @@ task_pi_boost_cb(struct pi_sink *sink, struct pi_source *src,
return 0;
}

+static void task_pi_free_rcu(struct rcu_head *rhp)
+{
+ struct task_struct *tsk = container_of(rhp, struct task_struct, pi.rcu);
+
+ prepare_free_task(tsk);
+}
+
+/*
+ * This function is invoked whenever the last references to a task have
+ * been dropped, and we should free the memory on the next rcu grace period
+ */
+static int task_pi_free_cb(struct pi_sink *sink, unsigned int flags)
+{
+ struct task_struct *p = container_of(sink, struct task_struct, pi.sink);
+
+ call_rcu(&p->pi.rcu, task_pi_free_rcu);
+
+ return 0;
+}
+
static int task_pi_update_cb(struct pi_sink *sink, unsigned int flags);

static struct pi_sink_ops task_pi_sink = {
.boost = task_pi_boost_cb,
.update = task_pi_update_cb,
+ .free = task_pi_free_cb,
};

static inline void

2008-08-15 20:32:39

by Gregory Haskins

[permalink] [raw]

Hi Matthias,

Matthias Behr wrote:
> Hi Greg,
>
> I got a few review comments/questions. Pls see below.
>
> Best Regards,
> Matthias
>
> P.S. I'm a kernel newbie so don't hesitate to tell me if I'm wrong ;-)
>
>
>> +/**
>> + * pi_sink_init - initialize a pi_sink before use
>> + * @sink: a sink context
>> + * @ops: pointer to an pi_sink_ops structure
>> + */
>> +static inline void
>> +pi_sink_init(struct pi_sink *sink, struct pi_sink_ops *ops)
>> +{
>> + atomic_set(&sink->refs, 0);
>> + sink->ops = ops;
>> +}
>>
>
> Shouldn't ops be tested for 0 here? (ASSERT/BUG_ON/...) (get's dereferenced later quite often in the form "if (sink->ops->...)".
>

This is a good idea. I will add this.

>
>> +/**
>> + * pi_sink_put - down the reference count, freeing the sink if 0
>> + * @node: the node context
>> + * @flags: optional flags to modify behavior. Reserved, must be 0.
>> + *
>> + * Returns: none
>> + */
>> +static inline void
>> +pi_sink_put(struct pi_sink *sink, unsigned int flags)
>> +{
>> + if (atomic_dec_and_test(&sink->refs)) {
>> + if (sink->ops->free)
>> + sink->ops->free(sink, flags);
>> + }
>> +}
>>
>
> Shouldn't the atomic/locked part cover the ...->free(...) as well?

Actually, it already does. The free can only be called by the last
reference dropping the ref-count.

> A pi_get right after the atomic_dec_and_test but before the free() could lead to a free() with refs>0?
>

A pi_get() after the ref could have already dropped to zero is broken at
a higher layer. E.g. the caller of pi_get() has to ensure that there
are no races against the reference dropping to begin with. This is the
same as any reference-counted object (for instance, see get_task_struct()).

Thanks for the review, Matthias!

Regards,
-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-08-19 08:42:44

Hi Esben,
Thank you for the review. Comments inline.

Esben Nielsen wrote:
> Disclaimer: I am no longer actively involved and I must admit I might
> have lost out on much of
> what have been going on since I contributed to the PI system 2 years
> ago. But I allow myself to comment
> anyway.
>
> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
>
>> The kernel currently addresses priority-inversion through priority-
>> inheritence. However, all of the priority-inheritence logic is
>> integrated into the Real-Time Mutex infrastructure. This causes a few
>> problems:
>>
>> 1) This tightly coupled relationship makes it difficult to extend to
>> other areas of the kernel (for instance, pi-aware wait-queues may
>> be desirable).
>> 2) Enhancing the rtmutex infrastructure becomes challenging because
>> there is no seperation between the locking code, and the pi-code.
>>
>> This patch aims to rectify these shortcomings by designing a stand-alone
>> pi framework which can then be used to replace the rtmutex-specific
>> version. The goal of this framework is to provide similar functionality
>> to the existing subsystem, but with sole focus on PI and the
>> relationships between objects that can boost priority, and the objects
>> that get boosted.
>>
>
> This is really a good idea. When I had time (2 years ago) to actively
> work on these problem
> I also came to the conclusion that PI should be more general than just
> the rtmutex. Preemptive RCU
> was the example which drove it.
>
> But I do disagree that general objects should get boosted: The end
> targets are always tasks. The objects might
> be boosted as intermediate steps, but priority end the only applies to tasks.
>
Actually I fully agree with you here. Its probably just poor wording on
my part, but this is exactly what happens. We may "boost" arbitrary
objects on the way to boosting a task...but the intermediate objects are
just there to help find our way to the proper tasks. Ultimately
everything ends up at the scheduler eventually ;)

> I also have a few comments to the actual design:
>
>
>> ....
>> +
>> +Multiple sinks per Node:
>> +
>> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
>> +
>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>>
>
> This is bad from a RT point of view: You have a hard time determininig
> the number of sinks per node. An rw-lock could have an arbitrary
> number of readers (is supposed to really). Therefore
> you have no chance of knowing how long the boost/deboost operation
> will take. And you also know for how long the boosted tasks stay
> boosted. If there can be an arbitrary number of
> such tasks you can no longer be deterministic.
>

While you may have a valid concern about what rwlocks can do to
determinism, not that we already have PI enabled rwlocks before my
patch, so I am not increasing nor decreasing determinism in this
regard. That being said, Steven Rostedt (author of the pi-rwlocks,
CC'd) has facilities to manage this (such as limiting the number of
readers to num_online_cpus) which this design would retain. Long story
short, I do not believe I have made anything worse here, so this is a
different discussion if you are still concerned.

>
>> ...
>> +
>> +#define MAX_PI_DEPENDENCIES 5
>>
>
>
> WHAT??? There is a finite lock depth defined. I know we did that
> originally but it wasn't hardcoded (as far as I remember) and
> it was certainly not as low as 5.
>

Note that this is simply in reference to how many direct sinks you can
link to a node, not how long the resulting chain can grow. The chain
depth is actually completely unconstrained by the design. I chose "5"
here because typically we need 1 sink for the next link in the chain,
and 1 sink for local notifications. The other 3 are there for head-room
(we often hit 3-4 as we transition between nodes (add one node -> delete
another, etc).

You are not the first to comment about this, however, so it makes me
realize it is not very clear ;) I will comment the code better.

> Remember: PI is used by the user space futeces as well!
>

Yes, and on a slight tangent from your point, this incidentally is
actually a problem in the design such that I need to respin at least a
v5. My current design uses recursion against the sink->update()
methods, which Peter Zijlstra pointed out would blow up with large
userpspace chains. My next version will forgo the recursion in favor of
an iterative method more reminiscent of the original design.

>
>> ....
>> +/*
>> + * _pi_node_update - update the chain
>> + *
>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
>> + * that need to propagate up the chain. This is a step-wise process where we
>> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
>> + * times, we guarantee that this update routine is an effective barrier...
>> + * all modifications made prior to the call to this barrier will have completed.
>> + *
>> + * Deadlock avoidance: This node may participate in a chain of nodes which
>> + * form a graph of arbitrary structure. While the graph should technically
>> + * never close on itself barring any bugs, we still want to protect against
>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
>> + * from detecting this potential). To do this, we employ a dual-locking
>> + * scheme where we can carefully control the order. That is: node->lock
>> + * protects most of the node's internal state, but it will never be held
>> + * across a chain update. sinkref->lock, on the other hand, can be held
>> + * across a boost/deboost, and also guarantees proper execution order. Also
>> + * note that no locks are held across an sink->update.
>> + */
>> +static int
>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>> +{
>> + struct pi_node *node = node_of(sink);
>> + struct pi_sinkref *sinkref;
>> + unsigned long iflags;
>> + int count = 0;
>> + int i;
>> + int pprio;
>> + struct updater updaters[MAX_PI_DEPENDENCIES];
>> +
>> + spin_lock_irqsave(&node->lock, iflags);
>> +
>> + pprio = node->prio;
>> +
>> + if (!plist_head_empty(&node->srcs))
>> + node->prio = plist_first(&node->srcs)->prio;
>> + else
>> + node->prio = MAX_PRIO;
>> +
>> + list_for_each_entry(sinkref, &node->sinks, list) {
>> + /*
>> + * If the priority is changing, or if this is a
>> + * BOOST/DEBOOST, we consider this sink "stale"
>> + */
>> + if (pprio != node->prio
>> + || sinkref->state != pi_state_boosted) {
>> + struct updater *iter = &updaters[count++];
>>
>
> What prevents count from overrun?
>

The node->sinks list will never have more than MAX_PI_DEPs in it, by design.

>
>> +
>> + BUG_ON(!atomic_read(&sinkref->sink->refs));
>> + _pi_sink_get(sinkref);
>> +
>> + iter->update = 1;
>> + iter->sinkref = sinkref;
>> + iter->sink = sinkref->sink;
>> + }
>> + }
>> +
>> + spin_unlock(&node->lock);
>> +
>> + for (i = 0; i < count; ++i) {
>> + struct updater *iter = &updaters[i];
>> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
>> + struct pi_sink *sink;
>> +
>> + sinkref = iter->sinkref;
>> + sink = iter->sink;
>> +
>> + spin_lock(&sinkref->lock);
>> +
>> + switch (sinkref->state) {
>> + case pi_state_boost:
>> + sinkref->state = pi_state_boosted;
>> + /* Fall through */
>> + case pi_state_boosted:
>> + sink->ops->boost(sink, &sinkref->src, lflags);
>> + break;
>> + case pi_state_deboost:
>> + sink->ops->deboost(sink, &sinkref->src, lflags);
>> + sinkref->state = pi_state_free;
>> +
>> + /*
>> + * drop the ref that we took when the sinkref
>> + * was allocated. We still hold a ref from
>> + * above.
>> + */
>> + _pi_sink_put_all(node, sinkref);
>> + break;
>> + case pi_state_free:
>> + iter->update = 0;
>> + break;
>> + default:
>> + panic("illegal sinkref type: %d", sinkref->state);
>> + }
>> +
>> + spin_unlock(&sinkref->lock);
>> +
>> + /*
>> + * We will drop the sinkref reference while still holding the
>> + * preempt/irqs off so that the memory is returned synchronously
>> + * to the system.
>> + */
>> + _pi_sink_put_local(node, sinkref);
>> + }
>> +
>> + local_irq_restore(iflags);
>>
>
> Yack! You keep interrupts off while doing the chain.

Actually, not quite. The first pass (with interrupts off) simply sets
the new priority value at each local element (limited to 5, typically
1-2). Short and sweet. Its the "update" that happens next (with
interrupts/preemption enabled) that updates the chain.

> I think my main
> contribution to the PI system 2 years ago was to do this preemptively.
> I.e. there was points in the loop where interrupts and preemption
> where turned on.
>

I agree this is important, but I think you will see with further review
that this is in fact what I do too.

> Remember: It goes into user space again. An evil user could craft an
> application with a very long lock depth and keep higher priority real
> time tasks from running for an arbitrary long time (if
> no limit on the lock depth is set, which is bad because it will be too
> low in some cases.)
>
> But as I said I have had no time to watch what has actually been going
> on in the kernel for the last 2 years roughly. The said defects might
> have creeped in by other contributers already :-(
>
> Esben
>

Esben,
Your review and insight are very much appreciated. I will be sure to
address the concerns mentioned above and CC you on the next release.

Thanks again,
-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2008-08-22 13:17:55

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

On Fri, 22 Aug 2008, Esben Nielsen wrote:

> Disclaimer: I am no longer actively involved and I must admit I might
> have lost out on much of
> what have been going on since I contributed to the PI system 2 years
> ago. But I allow myself to comment
> anyway.

Esben, you are always welcomed. You are one of the copyright owners of
rtmutex.c ;-)

>
> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
> > The kernel currently addresses priority-inversion through priority-
> > inheritence. However, all of the priority-inheritence logic is
> > integrated into the Real-Time Mutex infrastructure. This causes a few
> > problems:
> >
> > 1) This tightly coupled relationship makes it difficult to extend to
> > other areas of the kernel (for instance, pi-aware wait-queues may
> > be desirable).
> > 2) Enhancing the rtmutex infrastructure becomes challenging because
> > there is no seperation between the locking code, and the pi-code.
> >
> > This patch aims to rectify these shortcomings by designing a stand-alone
> > pi framework which can then be used to replace the rtmutex-specific
> > version. The goal of this framework is to provide similar functionality
> > to the existing subsystem, but with sole focus on PI and the
> > relationships between objects that can boost priority, and the objects
> > that get boosted.
>
> This is really a good idea. When I had time (2 years ago) to actively
> work on these problem
> I also came to the conclusion that PI should be more general than just
> the rtmutex. Preemptive RCU
> was the example which drove it.
>
> But I do disagree that general objects should get boosted: The end
> targets are always tasks. The objects might
> be boosted as intermediate steps, but priority end the only applies to tasks.
>
> I also have a few comments to the actual design:
>
> > ....
> > +
> > +Multiple sinks per Node:
> > +
> > +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
> > +
> > +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>
> This is bad from a RT point of view: You have a hard time determininig
> the number of sinks per node. An rw-lock could have an arbitrary
> number of readers (is supposed to really). Therefore
> you have no chance of knowing how long the boost/deboost operation
> will take. And you also know for how long the boosted tasks stay
> boosted. If there can be an arbitrary number of
> such tasks you can no longer be deterministic.
>
> > ...
> > +
> > +#define MAX_PI_DEPENDENCIES 5
>
>
> WHAT??? There is a finite lock depth defined. I know we did that
> originally but it wasn't hardcoded (as far as I remember) and
> it was certainly not as low as 5.

Yeah, I believe our number is 1024, and is not hardcoded, but is there
to detect recursive locks.

>
> Remember: PI is used by the user space futeces as well!

I haven't looked to hard at this code yet, but this may only be kernel
related for multiple owners (see my explanaiton below).

>
> > ....
> > +/*
> > + * _pi_node_update - update the chain
> > + *
> > + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
> > + * that need to propagate up the chain. This is a step-wise process where we
> > + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
> > + * times, we guarantee that this update routine is an effective barrier...
> > + * all modifications made prior to the call to this barrier will have completed.
> > + *
> > + * Deadlock avoidance: This node may participate in a chain of nodes which
> > + * form a graph of arbitrary structure. While the graph should technically
> > + * never close on itself barring any bugs, we still want to protect against
> > + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
> > + * from detecting this potential). To do this, we employ a dual-locking
> > + * scheme where we can carefully control the order. That is: node->lock
> > + * protects most of the node's internal state, but it will never be held
> > + * across a chain update. sinkref->lock, on the other hand, can be held
> > + * across a boost/deboost, and also guarantees proper execution order. Also
> > + * note that no locks are held across an sink->update.
> > + */
> > +static int
> > +_pi_node_update(struct pi_sink *sink, unsigned int flags)
> > +{
> > + struct pi_node *node = node_of(sink);
> > + struct pi_sinkref *sinkref;
> > + unsigned long iflags;
> > + int count = 0;
> > + int i;
> > + int pprio;
> > + struct updater updaters[MAX_PI_DEPENDENCIES];
> > +
> > + spin_lock_irqsave(&node->lock, iflags);
> > +
> > + pprio = node->prio;
> > +
> > + if (!plist_head_empty(&node->srcs))
> > + node->prio = plist_first(&node->srcs)->prio;
> > + else
> > + node->prio = MAX_PRIO;
> > +
> > + list_for_each_entry(sinkref, &node->sinks, list) {
> > + /*
> > + * If the priority is changing, or if this is a
> > + * BOOST/DEBOOST, we consider this sink "stale"
> > + */
> > + if (pprio != node->prio
> > + || sinkref->state != pi_state_boosted) {
> > + struct updater *iter = &updaters[count++];
>
> What prevents count from overrun?
>
> > +
> > + BUG_ON(!atomic_read(&sinkref->sink->refs));
> > + _pi_sink_get(sinkref);
> > +
> > + iter->update = 1;
> > + iter->sinkref = sinkref;
> > + iter->sink = sinkref->sink;
> > + }
> > + }
> > +
> > + spin_unlock(&node->lock);
> > +
> > + for (i = 0; i < count; ++i) {
> > + struct updater *iter = &updaters[i];
> > + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
> > + struct pi_sink *sink;
> > +
> > + sinkref = iter->sinkref;
> > + sink = iter->sink;
> > +
> > + spin_lock(&sinkref->lock);
> > +
> > + switch (sinkref->state) {
> > + case pi_state_boost:
> > + sinkref->state = pi_state_boosted;
> > + /* Fall through */
> > + case pi_state_boosted:
> > + sink->ops->boost(sink, &sinkref->src, lflags);
> > + break;
> > + case pi_state_deboost:
> > + sink->ops->deboost(sink, &sinkref->src, lflags);
> > + sinkref->state = pi_state_free;
> > +
> > + /*
> > + * drop the ref that we took when the sinkref
> > + * was allocated. We still hold a ref from
> > + * above.
> > + */
> > + _pi_sink_put_all(node, sinkref);
> > + break;
> > + case pi_state_free:
> > + iter->update = 0;
> > + break;
> > + default:
> > + panic("illegal sinkref type: %d", sinkref->state);
> > + }
> > +
> > + spin_unlock(&sinkref->lock);
> > +
> > + /*
> > + * We will drop the sinkref reference while still holding the
> > + * preempt/irqs off so that the memory is returned synchronously
> > + * to the system.
> > + */
> > + _pi_sink_put_local(node, sinkref);
> > + }
> > +
> > + local_irq_restore(iflags);
>
> Yack! You keep interrupts off while doing the chain. I think my main
> contribution to the PI system 2 years ago was to do this preemptively.
> I.e. there was points in the loop where interrupts and preemption
> where turned on.
>
> Remember: It goes into user space again. An evil user could craft an
> application with a very long lock depth and keep higher priority real
> time tasks from running for an arbitrary long time (if
> no limit on the lock depth is set, which is bad because it will be too
> low in some cases.)
>
> But as I said I have had no time to watch what has actually been going
> on in the kernel for the last 2 years roughly. The said defects might
> have creeped in by other contributers already :-(

The rtmutex.c has hardly changed since you last left it. The two big
additions, were adaptive locks, which hardly touched the pi chain, and my
rwlocks allowing multiple readers. It added a hook to allow going into the
pi chain for all readers while holding a spinlock and yes irqs off. The
difference is that it is a bug to hold an rwlock (internal kernel lock
only) and take a futex. Thus, this rwlock code did have a recursive depth
of 5. Perhaps that's what the PI depth is from above?

I still haven't had the time to analyze Gregory's code, so those points
that you made, may be only related to kernel activities (like the new
rwlock code). But by generalizing it, it decouples the PI from the
locking, which in general is a good thing, but for the multiple reader
locks, it is dangerous to decouple it, since there are a lot of
assumptions between the multiple PI owner code and rwlocks. In a general
approach, the assumptions will be harder to see.

-- Steve

2008-08-22 16:10:41

by Gregory Haskins

[permalink] [raw]

Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Gregory Haskins wrote:
> Hi Esben,
> Thank you for the review. Comments inline.
>
> Esben Nielsen wrote:
>
>> Disclaimer: I am no longer actively involved and I must admit I might
>> have lost out on much of
>> what have been going on since I contributed to the PI system 2 years
>> ago. But I allow myself to comment
>> anyway.
>>
>> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <[email protected]> wrote:
>>
>>
>>> The kernel currently addresses priority-inversion through priority-
>>> inheritence. However, all of the priority-inheritence logic is
>>> integrated into the Real-Time Mutex infrastructure. This causes a few
>>> problems:
>>>
>>> 1) This tightly coupled relationship makes it difficult to extend to
>>> other areas of the kernel (for instance, pi-aware wait-queues may
>>> be desirable).
>>> 2) Enhancing the rtmutex infrastructure becomes challenging because
>>> there is no seperation between the locking code, and the pi-code.
>>>
>>> This patch aims to rectify these shortcomings by designing a stand-alone
>>> pi framework which can then be used to replace the rtmutex-specific
>>> version. The goal of this framework is to provide similar functionality
>>> to the existing subsystem, but with sole focus on PI and the
>>> relationships between objects that can boost priority, and the objects
>>> that get boosted.
>>>
>>>
>> This is really a good idea. When I had time (2 years ago) to actively
>> work on these problem
>> I also came to the conclusion that PI should be more general than just
>> the rtmutex. Preemptive RCU
>> was the example which drove it.
>>
>> But I do disagree that general objects should get boosted: The end
>> targets are always tasks. The objects might
>> be boosted as intermediate steps, but priority end the only applies to tasks.
>>
>>
> Actually I fully agree with you here. Its probably just poor wording on
> my part, but this is exactly what happens. We may "boost" arbitrary
> objects on the way to boosting a task...but the intermediate objects are
> just there to help find our way to the proper tasks. Ultimately
> everything ends up at the scheduler eventually ;)
>
>
>> I also have a few comments to the actual design:
>>
>>
>>
>>> ....
>>> +
>>> +Multiple sinks per Node:
>>> +
>>> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
>>> +
>>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>>>
>>>
>> This is bad from a RT point of view: You have a hard time determininig
>> the number of sinks per node. An rw-lock could have an arbitrary
>> number of readers (is supposed to really). Therefore
>> you have no chance of knowing how long the boost/deboost operation
>> will take. And you also know for how long the boosted tasks stay
>> boosted. If there can be an arbitrary number of
>> such tasks you can no longer be deterministic.
>>
>>
>
> While you may have a valid concern about what rwlocks can do to
> determinism, not that we already have PI enabled rwlocks before my
> patch, so I am not increasing nor decreasing determinism in this
> regard. That being said, Steven Rostedt (author of the pi-rwlocks,
> CC'd) has facilities to manage this (such as limiting the number of
> readers to num_online_cpus) which this design would retain. Long story
> short, I do not believe I have made anything worse here, so this is a
> different discussion if you are still concerned.
>
>
>
>>
>>
>>> ...
>>> +
>>> +#define MAX_PI_DEPENDENCIES 5
>>>
>>>
>> WHAT??? There is a finite lock depth defined. I know we did that
>> originally but it wasn't hardcoded (as far as I remember) and
>> it was certainly not as low as 5.
>>
>>
>
> Note that this is simply in reference to how many direct sinks you can
> link to a node, not how long the resulting chain can grow. The chain
> depth is actually completely unconstrained by the design. I chose "5"
> here because typically we need 1 sink for the next link in the chain,
> and 1 sink for local notifications. The other 3 are there for head-room
> (we often hit 3-4 as we transition between nodes (add one node -> delete
> another, etc).
>

To clarify what I meant here: If you think of a normal linked-list node
having a single "next" pointer, this implementation is like each node
having up to 5 "next" pointers. However typically only 1-2 are used,
and all but one will usually point to a "leaf" node, meaning it does not
form a chain but terminates processing locally. Typically there will be
only one link to something that forms a chain with other nodes. I did
this because I realized the pattern (boost/deboost/update) was similar
whether the node was a leaf or a chain-link, so I unified both behind
the single pi_sink interface.

That being understood, note that as with any linked-list, the nodes can
still have an arbitrary chaining depth (and I will fix this to be
iterative instead of recursive, as previously mentioned).

> You are not the first to comment about this, however, so it makes me
> realize it is not very clear ;) I will comment the code better.
>
>
>
>> Remember: PI is used by the user space futeces as well!
>>
>>
>
> Yes, and on a slight tangent from your point, this incidentally is
> actually a problem in the design such that I need to respin at least a
> v5. My current design uses recursion against the sink->update()
> methods, which Peter Zijlstra pointed out would blow up with large
> userpspace chains. My next version will forgo the recursion in favor of
> an iterative method more reminiscent of the original design.
>
>
>>
>>
>>> ....
>>> +/*
>>> + * _pi_node_update - update the chain
>>> + *
>>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
>>> + * that need to propagate up the chain. This is a step-wise process where we
>>> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
>>> + * times, we guarantee that this update routine is an effective barrier...
>>> + * all modifications made prior to the call to this barrier will have completed.
>>> + *
>>> + * Deadlock avoidance: This node may participate in a chain of nodes which
>>> + * form a graph of arbitrary structure. While the graph should technically
>>> + * never close on itself barring any bugs, we still want to protect against
>>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
>>> + * from detecting this potential). To do this, we employ a dual-locking
>>> + * scheme where we can carefully control the order. That is: node->lock
>>> + * protects most of the node's internal state, but it will never be held
>>> + * across a chain update. sinkref->lock, on the other hand, can be held
>>> + * across a boost/deboost, and also guarantees proper execution order. Also
>>> + * note that no locks are held across an sink->update.
>>> + */
>>> +static int
>>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>>> +{
>>> + struct pi_node *node = node_of(sink);
>>> + struct pi_sinkref *sinkref;
>>> + unsigned long iflags;
>>> + int count = 0;
>>> + int i;
>>> + int pprio;
>>> + struct updater updaters[MAX_PI_DEPENDENCIES];
>>> +
>>> + spin_lock_irqsave(&node->lock, iflags);
>>> +
>>> + pprio = node->prio;
>>> +
>>> + if (!plist_head_empty(&node->srcs))
>>> + node->prio = plist_first(&node->srcs)->prio;
>>> + else
>>> + node->prio = MAX_PRIO;
>>> +
>>> + list_for_each_entry(sinkref, &node->sinks, list) {
>>> + /*
>>> + * If the priority is changing, or if this is a
>>> + * BOOST/DEBOOST, we consider this sink "stale"
>>> + */
>>> + if (pprio != node->prio
>>> + || sinkref->state != pi_state_boosted) {
>>> + struct updater *iter = &updaters[count++];
>>>
>>>
>> What prevents count from overrun?
>>
>>
>
> The node->sinks list will never have more than MAX_PI_DEPs in it, by design.
>
>
>>
>>
>>> +
>>> + BUG_ON(!atomic_read(&sinkref->sink->refs));
>>> + _pi_sink_get(sinkref);
>>> +
>>> + iter->update = 1;
>>> + iter->sinkref = sinkref;
>>> + iter->sink = sinkref->sink;
>>> + }
>>> + }
>>> +
>>> + spin_unlock(&node->lock);
>>> +
>>> + for (i = 0; i < count; ++i) {
>>> + struct updater *iter = &updaters[i];
>>> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
>>> + struct pi_sink *sink;
>>> +
>>> + sinkref = iter->sinkref;
>>> + sink = iter->sink;
>>> +
>>> + spin_lock(&sinkref->lock);
>>> +
>>> + switch (sinkref->state) {
>>> + case pi_state_boost:
>>> + sinkref->state = pi_state_boosted;
>>> + /* Fall through */
>>> + case pi_state_boosted:
>>> + sink->ops->boost(sink, &sinkref->src, lflags);
>>> + break;
>>> + case pi_state_deboost:
>>> + sink->ops->deboost(sink, &sinkref->src, lflags);
>>> + sinkref->state = pi_state_free;
>>> +
>>> + /*
>>> + * drop the ref that we took when the sinkref
>>> + * was allocated. We still hold a ref from
>>> + * above.
>>> + */
>>> + _pi_sink_put_all(node, sinkref);
>>> + break;
>>> + case pi_state_free:
>>> + iter->update = 0;
>>> + break;
>>> + default:
>>> + panic("illegal sinkref type: %d", sinkref->state);
>>> + }
>>> +
>>> + spin_unlock(&sinkref->lock);
>>> +
>>> + /*
>>> + * We will drop the sinkref reference while still holding the
>>> + * preempt/irqs off so that the memory is returned synchronously
>>> + * to the system.
>>> + */
>>> + _pi_sink_put_local(node, sinkref);
>>> + }
>>> +
>>> + local_irq_restore(iflags);
>>>
>>>
>> Yack! You keep interrupts off while doing the chain.
>>
>
> Actually, not quite. The first pass (with interrupts off) simply sets
> the new priority value at each local element (limited to 5, typically
> 1-2). Short and sweet. Its the "update" that happens next (with
> interrupts/preemption enabled) that updates the chain.
>
>
>
>> I think my main
>> contribution to the PI system 2 years ago was to do this preemptively.
>> I.e. there was points in the loop where interrupts and preemption
>> where turned on.
>>
>>
>
> I agree this is important, but I think you will see with further review
> that this is in fact what I do too.
>
>
>> Remember: It goes into user space again. An evil user could craft an
>> application with a very long lock depth and keep higher priority real
>> time tasks from running for an arbitrary long time (if
>> no limit on the lock depth is set, which is bad because it will be too
>> low in some cases.)
>>
>> But as I said I have had no time to watch what has actually been going
>> on in the kernel for the last 2 years roughly. The said defects might
>> have creeped in by other contributers already :-(
>>
>> Esben
>>
>>
>
> Esben,
> Your review and insight are very much appreciated. I will be sure to
> address the concerns mentioned above and CC you on the next release.
>
> Thanks again,
> -Greg
>
>
>

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature