2007-08-01 12:01:17

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH] RT: Add priority-queuing and priority-inheritance to workqueue infrastructure

On Tue, 2007-07-31 at 20:52 -0700, Daniel Walker wrote:

>
> Here's a simpler version .. uses the plist data structure instead of the
> 100 queues, which makes for a cleaner patch ..

Hi Daniel,

I like your idea on the plist simplification a lot. I will definitely
roll that into my series.

I am not too psyched about using the rt_mutex_setprio() API directly,
however. It seems broken to be calling that directly from non rt_mutex
code, IMHO. Perhaps the PI subsystem should be broken out from the
rt_mutex code so it can be used generally? There are other areas where
PI could potentially be used besides rt_mutex (this patch as a prime
example), so this might make sense.

Anyway, thanks for finding that simplification! I definitely learned
something.

-Greg


>
> Signed-off-by: Daniel Walker <[email protected]>
>
> ---
> include/linux/workqueue.h | 7 ++++---
> kernel/power/poweroff.c | 1 +
> kernel/sched.c | 4 ----
> kernel/workqueue.c | 40 +++++++++++++++++++++++++---------------
> 4 files changed, 30 insertions(+), 22 deletions(-)
>
> Index: linux-2.6.22/include/linux/workqueue.h
> ===================================================================
> --- linux-2.6.22.orig/include/linux/workqueue.h
> +++ linux-2.6.22/include/linux/workqueue.h
> @@ -8,6 +8,7 @@
> #include <linux/timer.h>
> #include <linux/linkage.h>
> #include <linux/bitops.h>
> +#include <linux/plist.h>
> #include <asm/atomic.h>
>
> struct workqueue_struct;
> @@ -26,7 +27,7 @@ struct work_struct {
> #define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
> #define WORK_STRUCT_FLAG_MASK (3UL)
> #define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
> - struct list_head entry;
> + struct plist_node entry;
> work_func_t func;
> };
>
> @@ -43,7 +44,7 @@ struct execute_work {
>
> #define __WORK_INITIALIZER(n, f) { \
> .data = WORK_DATA_INIT(), \
> - .entry = { &(n).entry, &(n).entry }, \
> + .entry = PLIST_NODE_INIT(n.entry, MAX_PRIO), \
> .func = (f), \
> }
>
> @@ -79,7 +80,7 @@ struct execute_work {
> #define INIT_WORK(_work, _func) \
> do { \
> (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
> - INIT_LIST_HEAD(&(_work)->entry); \
> + plist_node_init(&(_work)->entry, -1); \
> PREPARE_WORK((_work), (_func)); \
> } while (0)
>
> Index: linux-2.6.22/kernel/power/poweroff.c
> ===================================================================
> --- linux-2.6.22.orig/kernel/power/poweroff.c
> +++ linux-2.6.22/kernel/power/poweroff.c
> @@ -8,6 +8,7 @@
> #include <linux/sysrq.h>
> #include <linux/init.h>
> #include <linux/pm.h>
> +#include <linux/sched.h>
> #include <linux/workqueue.h>
> #include <linux/reboot.h>
>
> Index: linux-2.6.22/kernel/sched.c
> ===================================================================
> --- linux-2.6.22.orig/kernel/sched.c
> +++ linux-2.6.22/kernel/sched.c
> @@ -4379,8 +4379,6 @@ long __sched sleep_on_timeout(wait_queue
> }
> EXPORT_SYMBOL(sleep_on_timeout);
>
> -#ifdef CONFIG_RT_MUTEXES
> -
> /*
> * rt_mutex_setprio - set the current priority of a task
> * @p: task
> @@ -4457,8 +4455,6 @@ out_unlock:
> task_rq_unlock(rq, &flags);
> }
>
> -#endif
> -
> void set_user_nice(struct task_struct *p, long nice)
> {
> int old_prio, delta, on_rq;
> Index: linux-2.6.22/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.22.orig/kernel/workqueue.c
> +++ linux-2.6.22/kernel/workqueue.c
> @@ -44,7 +44,7 @@ struct cpu_workqueue_struct {
>
> spinlock_t lock;
>
> - struct list_head worklist;
> + struct plist_head worklist;
> wait_queue_head_t more_work;
> struct work_struct *current_work;
>
> @@ -127,16 +127,19 @@ struct cpu_workqueue_struct *get_wq_data
> static void insert_work(struct cpu_workqueue_struct *cwq,
> struct work_struct *work, int tail)
> {
> + int prio = current->normal_prio;
> +
> set_wq_data(work, cwq);
> /*
> * Ensure that we get the right work->data if we see the
> * result of list_add() below, see try_to_grab_pending().
> */
> smp_wmb();
> - if (tail)
> - list_add_tail(&work->entry, &cwq->worklist);
> - else
> - list_add(&work->entry, &cwq->worklist);
> + plist_node_init(&work->entry, prio);
> + plist_add(&work->entry, &cwq->worklist);
> +
> + if (prio < cwq->thread->prio)
> + rt_mutex_setprio(cwq->thread, prio);
> wake_up(&cwq->more_work);
> }
>
> @@ -168,7 +171,7 @@ int fastcall queue_work(struct workqueue
> int ret = 0, cpu = raw_smp_processor_id();
>
> if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
> - BUG_ON(!list_empty(&work->entry));
> + BUG_ON(!plist_node_empty(&work->entry));
> __queue_work(wq_per_cpu(wq, cpu), work);
> ret = 1;
> }
> @@ -222,7 +225,7 @@ int queue_delayed_work_on(int cpu, struc
>
> if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
> BUG_ON(timer_pending(timer));
> - BUG_ON(!list_empty(&work->entry));
> + BUG_ON(!plist_node_empty(&work->entry));
>
> /* This stores cwq for the moment, for the timer_fn */
> set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
> @@ -264,13 +267,17 @@ static void run_workqueue(struct cpu_wor
> __FUNCTION__, cwq->run_depth);
> dump_stack();
> }
> - while (!list_empty(&cwq->worklist)) {
> - struct work_struct *work = list_entry(cwq->worklist.next,
> + while (!plist_head_empty(&cwq->worklist)) {
> + struct work_struct *work = plist_first_entry(&cwq->worklist,
> struct work_struct, entry);
> work_func_t f = work->func;
>
> + if (likely(cwq->thread->prio != work->entry.prio))
> + rt_mutex_setprio(cwq->thread, work->entry.prio);
> +
> cwq->current_work = work;
> - list_del_init(cwq->worklist.next);
> + plist_del(&work->entry, &cwq->worklist);
> + plist_node_init(&work->entry, MAX_PRIO);
> spin_unlock_irq(&cwq->lock);
>
> BUG_ON(get_wq_data(work) != cwq);
> @@ -283,6 +290,7 @@ static void run_workqueue(struct cpu_wor
> spin_lock_irq(&cwq->lock);
> cwq->current_work = NULL;
> }
> + rt_mutex_setprio(cwq->thread, current->normal_prio);
> cwq->run_depth--;
> spin_unlock_irq(&cwq->lock);
> }
> @@ -301,7 +309,7 @@ static int worker_thread(void *__cwq)
> prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
> if (!freezing(current) &&
> !kthread_should_stop() &&
> - list_empty(&cwq->worklist))
> + plist_head_empty(&cwq->worklist))
> schedule();
> finish_wait(&cwq->more_work, &wait);
>
> @@ -354,7 +362,8 @@ static int flush_cpu_workqueue(struct cp
>
> active = 0;
> spin_lock_irq(&cwq->lock);
> - if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
> + if (!plist_head_empty(&cwq->worklist) ||
> + cwq->current_work != NULL) {
> insert_wq_barrier(cwq, &barr, 1);
> active = 1;
> }
> @@ -413,7 +422,7 @@ static int try_to_grab_pending(struct wo
> return ret;
>
> spin_lock_irq(&cwq->lock);
> - if (!list_empty(&work->entry)) {
> + if (!plist_node_empty(&work->entry)) {
> /*
> * This work is queued, but perhaps we locked the wrong cwq.
> * In that case we must see the new value after rmb(), see
> @@ -421,7 +430,8 @@ static int try_to_grab_pending(struct wo
> */
> smp_rmb();
> if (cwq == get_wq_data(work)) {
> - list_del_init(&work->entry);
> + plist_del(&work->entry, &cwq->worklist);
> + plist_node_init(&work->entry, MAX_PRIO);
> ret = 1;
> }
> }
> @@ -747,7 +757,7 @@ init_cpu_workqueue(struct workqueue_stru
>
> cwq->wq = wq;
> spin_lock_init(&cwq->lock);
> - INIT_LIST_HEAD(&cwq->worklist);
> + plist_head_init(&cwq->worklist, NULL);
> init_waitqueue_head(&cwq->more_work);
>
> return cwq;
>
>


2007-08-01 15:12:17

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH] RT: Add priority-queuing and priority-inheritance to workqueue infrastructure

On Wed, 2007-08-01 at 07:59 -0400, Gregory Haskins wrote:
> On Tue, 2007-07-31 at 20:52 -0700, Daniel Walker wrote:
>
> >
> > Here's a simpler version .. uses the plist data structure instead of the
> > 100 queues, which makes for a cleaner patch ..
>
> Hi Daniel,
>
> I like your idea on the plist simplification a lot. I will definitely
> roll that into my series.
>
> I am not too psyched about using the rt_mutex_setprio() API directly,
> however. It seems broken to be calling that directly from non rt_mutex
> code, IMHO. Perhaps the PI subsystem should be broken out from the
> rt_mutex code so it can be used generally? There are other areas where
> PI could potentially be used besides rt_mutex (this patch as a prime
> example), so this might make sense.

rt_mutex_setprio() is just a function. It was also designed specifically
for PI , so it seems fairly sane to use it in other PI type
situations ..

Daniel

2007-08-01 15:21:53

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH] RT: Add priority-queuing and priority-inheritance to workqueue infrastructure

On Wed, 2007-08-01 at 08:10 -0700, Daniel Walker wrote:

>
> rt_mutex_setprio() is just a function. It was also designed specifically
> for PI , so it seems fairly sane to use it in other PI type
> situations ..
>

Yes. It is designed for PI and I wasn't suggesting you shouldn't use
the logic itself. What I was suggesting is that dealing with an API
that has "rt_mutex" in it for something that has nothing to do with
rt_mutexes is, well...

All I was suggesting is that we break out the PI subsystem from rt_mutex
code so its an independent PI API and have the rt_mutex subsystem become
a user. That's a far cleaner way to do it, IMHO.

-Greg



2007-08-01 15:57:40

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH] RT: Add priority-queuing and priority-inheritance to workqueue infrastructure

On Wed, 2007-08-01 at 11:19 -0400, Gregory Haskins wrote:
> On Wed, 2007-08-01 at 08:10 -0700, Daniel Walker wrote:
>
> >
> > rt_mutex_setprio() is just a function. It was also designed specifically
> > for PI , so it seems fairly sane to use it in other PI type
> > situations ..
> >
>
> Yes. It is designed for PI and I wasn't suggesting you shouldn't use
> the logic itself. What I was suggesting is that dealing with an API
> that has "rt_mutex" in it for something that has nothing to do with
> rt_mutexes is, well...

It's fine for now .. One step at a time..

> All I was suggesting is that we break out the PI subsystem from rt_mutex
> code so its an independent PI API and have the rt_mutex subsystem become
> a user. That's a far cleaner way to do it, IMHO.

The workqueues don't really need full blown transitive PI. So without
that your back to the rt_mutex_setprio function .. Which could be
renamed ..

Here was my attempt years ago ,

http://lkml.org/lkml/2005/5/31/288

Looking back on it, I'm not sure what users I was planning to implement
along with it .. I'm sure I was thinking "There must be other blocking
primitives that could use this.." , but now I don't think there are ..
Everything pretty much runs through the rt mutex.. workqueues are just
"dancing" , or changing priorities up/down which is really only the
lowest level of what the rt-mutex does.

Daniel

2007-08-01 17:34:22

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH] RT: Add priority-queuing and priority-inheritance to workqueue infrastructure

On Wed, 2007-08-01 at 08:55 -0700, Daniel Walker wrote:
> On Wed, 2007-08-01 at 11:19 -0400, Gregory Haskins wrote:
> > On Wed, 2007-08-01 at 08:10 -0700, Daniel Walker wrote:
> >
> > >
> > > rt_mutex_setprio() is just a function. It was also designed specifically
> > > for PI , so it seems fairly sane to use it in other PI type
> > > situations ..
> > >
> >
> > Yes. It is designed for PI and I wasn't suggesting you shouldn't use
> > the logic itself. What I was suggesting is that dealing with an API
> > that has "rt_mutex" in it for something that has nothing to do with
> > rt_mutexes is, well...
>
> It's fine for now .. One step at a time..

I can live with that.

>
> > All I was suggesting is that we break out the PI subsystem from rt_mutex
> > code so its an independent PI API and have the rt_mutex subsystem become
> > a user. That's a far cleaner way to do it, IMHO.
>
> The workqueues don't really need full blown transitive PI. So without
> that your back to the rt_mutex_setprio function .. Which could be
> renamed ..

Right ;) That's all I am really saying, but its not a big deal.

>
> Here was my attempt years ago ,
>
> http://lkml.org/lkml/2005/5/31/288

Cool! I haven't looked at the patches in depth yet, but perhaps that
concept should be revisited.

> Looking back on it, I'm not sure what users I was planning to implement
> along with it .. I'm sure I was thinking "There must be other blocking
> primitives that could use this.." , but now I don't think there are ..
> Everything pretty much runs through the rt mutex..

Well, not necessarily. I think your first instinct was right. Really
any form of waiting can potentially be subject to inversion, IMHO.
Waiting for a mutex is an obvious form, and also easy to solve since you
have clear delineation between the waiters (and their priorities) and
the resource holder.

However, inversion can happen any time you are waiting on a resource
regardless of the blocking mechanism...for instance, waiting on a
wait_queue. Consider the situation of three RT tasks A, B, C. A is the
highest, C the lowest. Assume C is a threaded IRQ handler for, say,
diskio, and C will signal a wait_queue when IO completes. In this
scenario, A can get inverted behind B if it does diskio. (In fact, I
think someone was reporting a similar problem on the list last week, and
the response was that we don't support that yet...I would like to change
that :).

The problem with something like a wait_queue, as previously mentioned,
is you don't have a clear way to delineate who should get PI boosted. I
am wondering if we can put our heads together and come up with a
comprehensive solution to this problem (even if it only works in certain
situations..e.g. threads that explicitly register with a waitqueue,
etc).

The solution here might be as simple as the previously mentioned
registration API, it might be something that can be done automatically
in the infrastructure, or it might be something along the lines of
converting more things over to use rt_mutex? I don't know what the
answer will be yet. All have their challenges. But I do think its a
problem worth solving.

I think the workqueue stuff we are doing is really the first step in
this direction...but really some parts of that work are just special
casing one form of the "wait_queue" scenario. I would like to
generalize this going forward.

> workqueues are just
> "dancing" , or changing priorities up/down which is really only the
> lowest level of what the rt-mutex does.

I totally agree that what the workqueues are currently doing in our
patches are just a subset of PI. But I absolutely think there are
potential users of at least some of PI like facilities other than just
mutexes.

I appreciate your review, discussion, and contribution on this matter,
Daniel. Thank you! I will take a look at your PI patches to see if I
think it aligns with my current thoughts.

-Greg

2007-08-01 21:48:52

by Esben Nielsen

[permalink] [raw]
Subject: Re: [PATCH] RT: Add priority-queuing and priority-inheritance to workqueue infrastructure



On Wed, 1 Aug 2007, Daniel Walker wrote:

> On Wed, 2007-08-01 at 07:59 -0400, Gregory Haskins wrote:
>> On Tue, 2007-07-31 at 20:52 -0700, Daniel Walker wrote:
>>
>>>
>>> Here's a simpler version .. uses the plist data structure instead of the
>>> 100 queues, which makes for a cleaner patch ..
>>
>> Hi Daniel,
>>
>> I like your idea on the plist simplification a lot. I will definitely
>> roll that into my series.
>>
>> I am not too psyched about using the rt_mutex_setprio() API directly,
>> however. It seems broken to be calling that directly from non rt_mutex
>> code, IMHO. Perhaps the PI subsystem should be broken out from the
>> rt_mutex code so it can be used generally? There are other areas where
>> PI could potentially be used besides rt_mutex (this patch as a prime
>> example), so this might make sense.
>
> rt_mutex_setprio() is just a function. It was also designed specifically
> for PI , so it seems fairly sane to use it in other PI type
> situations ..
>
> Daniel
>

There seems to be a general need for boosting threads temporarely in a few
places. HR-timers also have it, last time I checked. And preemptive RCU as
well for boosting RCU readers. They all seems to deal with the same issues
of correctly dealing with setting the priority and PI bosting from
mutexes.

When boosting of RCU readers was discussed I came to the conclusion that
the boosting property should be taken out of the of the rt_mutex module
and instead be made into a sub-property of the scheduler:

task->pi_waiters should be replaced with task->prio_boosters being a
pi_list of struct prio_booster representing something, which temporarely
wants to boost a task.
A rt_mutex_waiter should of course contain a prio_booster which is added
to owner->prio_boosters. A work queue element should contain a prio_booster.
When boosting a RCU reader a prio_booster is added to the reader's
prio_boosters.

Such a system will correctly deal with boosters going away in arbitrary
order. Something which is not strait forward when each user of boosting is
trying to do it on their own.

Esben