2008-06-12 16:49:24

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

insert_work() inserts the new work_struct before or after cwq->worklist,
depending on the "int tail" parameter. Change it to accept "list_head *"
instead, this shrinks .text a bit and allows us to insert the barrier
after specific work_struct.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 26-rc2/kernel/workqueue.c~WQ_1_S_TAIL_PREV 2008-05-18 15:44:19.000000000 +0400
+++ 26-rc2/kernel/workqueue.c 2008-06-12 19:12:46.000000000 +0400
@@ -125,7 +125,7 @@ struct cpu_workqueue_struct *get_wq_data
}

static void insert_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work, int tail)
+ struct work_struct *work, struct list_head *head)
{
set_wq_data(work, cwq);
/*
@@ -133,10 +133,7 @@ static void insert_work(struct cpu_workq
* result of list_add() below, see try_to_grab_pending().
*/
smp_wmb();
- if (tail)
- list_add_tail(&work->entry, &cwq->worklist);
- else
- list_add(&work->entry, &cwq->worklist);
+ list_add_tail(&work->entry, head);
wake_up(&cwq->more_work);
}

@@ -147,7 +144,7 @@ static void __queue_work(struct cpu_work
unsigned long flags;

spin_lock_irqsave(&cwq->lock, flags);
- insert_work(cwq, work, 1);
+ insert_work(cwq, work, &cwq->worklist);
spin_unlock_irqrestore(&cwq->lock, flags);
}

@@ -337,14 +334,14 @@ static void wq_barrier_func(struct work_
}

static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
- struct wq_barrier *barr, int tail)
+ struct wq_barrier *barr, struct list_head *head)
{
INIT_WORK(&barr->work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));

init_completion(&barr->done);

- insert_work(cwq, &barr->work, tail);
+ insert_work(cwq, &barr->work, head);
}

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -364,7 +361,7 @@ static int flush_cpu_workqueue(struct cp
active = 0;
spin_lock_irq(&cwq->lock);
if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
- insert_wq_barrier(cwq, &barr, 1);
+ insert_wq_barrier(cwq, &barr, &cwq->worklist);
active = 1;
}
spin_unlock_irq(&cwq->lock);
@@ -449,7 +446,7 @@ static void wait_on_cpu_work(struct cpu_

spin_lock_irq(&cwq->lock);
if (unlikely(cwq->current_work == work)) {
- insert_wq_barrier(cwq, &barr, 0);
+ insert_wq_barrier(cwq, &barr, cwq->worklist.next);
running = 1;
}
spin_unlock_irq(&cwq->lock);


2008-06-12 16:53:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On 06/12, Oleg Nesterov wrote:
>
> insert_work() inserts the new work_struct before or after cwq->worklist,
> depending on the "int tail" parameter. Change it to accept "list_head *"
> instead, this shrinks .text a bit and allows us to insert the barrier
> after specific work_struct.

This allows us to implement

int flush_work(struct work_struct *work)
{
struct cpu_workqueue_struct *cwq;
struct list_head *head;
struct wq_barrier barr;

cwq = get_wq_data(work);
if (!cwq)
return 0;

head = NULL;
spin_lock_irq(&cwq->lock);
if (!list_empty(&work->entry)) {
smp_rmb();
/*
* ---- FAT COMMENT ----
*/
if (cwq == get_wq_data(work))
head = work->entry.next;
} else if (cwq->current_work == work) {
head = cwq->worklist.next;
}

if (head)
insert_wq_barrier(cwq, &barr, head);
spin_unlock_irq(&cwq->lock);

if (!head)
return 0;
wait_for_completion(&barr.done);
return 1;
}

suggested by Peter. It only waits for selected work_struct.

I doubt it will have a lot of users though. In most cases we need
cancel_work_sync() and nothing more.

Oleg.

2008-06-12 17:01:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On Thu, 2008-06-12 at 20:55 +0400, Oleg Nesterov wrote:
> On 06/12, Oleg Nesterov wrote:
> >
> > insert_work() inserts the new work_struct before or after cwq->worklist,
> > depending on the "int tail" parameter. Change it to accept "list_head *"
> > instead, this shrinks .text a bit and allows us to insert the barrier
> > after specific work_struct.
>
> This allows us to implement
>
> int flush_work(struct work_struct *work)
> {
> struct cpu_workqueue_struct *cwq;
> struct list_head *head;
> struct wq_barrier barr;
>
> cwq = get_wq_data(work);
> if (!cwq)
> return 0;
>
> head = NULL;
> spin_lock_irq(&cwq->lock);
> if (!list_empty(&work->entry)) {
> smp_rmb();
> /*
> * ---- FAT COMMENT ----
> */
> if (cwq == get_wq_data(work))
> head = work->entry.next;
> } else if (cwq->current_work == work) {
> head = cwq->worklist.next;
> }
>
> if (head)
> insert_wq_barrier(cwq, &barr, head);
> spin_unlock_irq(&cwq->lock);
>
> if (!head)
> return 0;
> wait_for_completion(&barr.done);
> return 1;
> }
>
> suggested by Peter. It only waits for selected work_struct.
>
> I doubt it will have a lot of users though. In most cases we need
> cancel_work_sync() and nothing more.

Are there cases where we dynamically allocate work structs and queue
them and then forget about them? In such cases we'd need something a
little more complex as we don't have work pointers to flush or cancel.

Hence that idea of flush context and completions.

Aside from that, this seems like a fine idea. :-)

2008-06-12 17:42:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On 06/12, Peter Zijlstra wrote:
>
> On Thu, 2008-06-12 at 20:55 +0400, Oleg Nesterov wrote:
> >
> > This allows us to implement
> >
> > int flush_work(struct work_struct *work)
> > {
> > struct cpu_workqueue_struct *cwq;
> > struct list_head *head;
> > struct wq_barrier barr;
> >
> > cwq = get_wq_data(work);
> > if (!cwq)
> > return 0;
> >
> > head = NULL;
> > spin_lock_irq(&cwq->lock);
> > if (!list_empty(&work->entry)) {
> > smp_rmb();
> > /*
> > * ---- FAT COMMENT ----
> > */
> > if (cwq == get_wq_data(work))
> > head = work->entry.next;
> > } else if (cwq->current_work == work) {
> > head = cwq->worklist.next;
> > }
> >
> > if (head)
> > insert_wq_barrier(cwq, &barr, head);
> > spin_unlock_irq(&cwq->lock);
> >
> > if (!head)
> > return 0;
> > wait_for_completion(&barr.done);
> > return 1;
> > }
> >
> > suggested by Peter. It only waits for selected work_struct.
> >
> > I doubt it will have a lot of users though. In most cases we need
> > cancel_work_sync() and nothing more.
>
> Are there cases where we dynamically allocate work structs and queue
> them and then forget about them? In such cases we'd need something a
> little more complex as we don't have work pointers to flush or cancel.
>
> Hence that idea of flush context and completions.

Do you mean something like (just for example) below? If yes, then yes
sure, flush_work() is limited. But I can't see how it is possible to
"generalize" this idea.

(hmm... actually, if we add flush_work(), we can speedup schedule_on_each_cpu(),
instead of flush_workqueue(keventd_wq) we can do

for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(works, cpu));

not sure this really makes sense though).

Oleg.

--- kernel/workqueue.c~ 2007-07-28 16:58:17.000000000 +0400
+++ kernel/workqueue.c 2007-08-06 20:33:25.000000000 +0400
@@ -590,25 +590,54 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
*
* schedule_on_each_cpu() is very slow.
*/
+
+struct xxx
+{
+ atomic_t count;
+ struct completion done;
+ work_func_t func;
+};
+
+struct yyy
+{
+ struct work_struct work;
+ struct xxx *xxx;
+};
+
+static void yyy_func(struct work_struct *work)
+{
+ struct xxx *xxx = container_of(work, struct yyy, work)->xxx;
+ xxx->func(work);
+
+ if (atomic_dec_and_test(&xxx->count))
+ complete(&xxx->done);
+}
+
int schedule_on_each_cpu(work_func_t func)
{
int cpu;
- struct work_struct *works;
+ struct xxx xxx;
+ struct yyy *works;

- works = alloc_percpu(struct work_struct);
+ init_completion(&xxx.done);
+ xxx.func = func;
+
+ works = alloc_percpu(struct yyy);
if (!works)
return -ENOMEM;

get_online_cpus();
+ atomic_set(&xxx.count, num_online_cpus());
for_each_online_cpu(cpu) {
- struct work_struct *work = per_cpu_ptr(works, cpu);
+ struct yyy *yyy = per_cpu_ptr(works, cpu);

- INIT_WORK(work, func);
- set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
- __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
+ yyy->xxx = &xxx;
+ INIT_WORK(&yyy->work, yyy_func);
+ set_bit(WORK_STRUCT_PENDING, work_data_bits(&yyy->work));
+ __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), &yyy->work);
}
- flush_workqueue(keventd_wq);
put_online_cpus();
+ wait_for_completion(&xxx.done);
free_percpu(works);
return 0;
}

2008-06-12 18:39:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On Thu, 2008-06-12 at 21:44 +0400, Oleg Nesterov wrote:

> > Hence that idea of flush context and completions.
>
> Do you mean something like (just for example) below? If yes, then yes
> sure, flush_work() is limited. But I can't see how it is possible to
> "generalize" this idea.
>
> (hmm... actually, if we add flush_work(), we can speedup schedule_on_each_cpu(),
> instead of flush_workqueue(keventd_wq) we can do
>
> for_each_online_cpu(cpu)
> flush_work(per_cpu_ptr(works, cpu));
>
> not sure this really makes sense though).

Speedups are always nice ;-), but the below also gets us there.

> Oleg.
>
> --- kernel/workqueue.c~ 2007-07-28 16:58:17.000000000 +0400
> +++ kernel/workqueue.c 2007-08-06 20:33:25.000000000 +0400
> @@ -590,25 +590,54 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
> *
> * schedule_on_each_cpu() is very slow.
> */
> +
> +struct xxx
> +{
> + atomic_t count;
> + struct completion done;
> + work_func_t func;
> +};
> +
> +struct yyy
> +{
> + struct work_struct work;
> + struct xxx *xxx;
> +};
> +
> +static void yyy_func(struct work_struct *work)
> +{
> + struct xxx *xxx = container_of(work, struct yyy, work)->xxx;
> + xxx->func(work);
> +
> + if (atomic_dec_and_test(&xxx->count))
> + complete(&xxx->done);
> +}
> +
> int schedule_on_each_cpu(work_func_t func)
> {
> int cpu;
> - struct work_struct *works;
> + struct xxx xxx;
> + struct yyy *works;
>
> - works = alloc_percpu(struct work_struct);
> + init_completion(&xxx.done);
> + xxx.func = func;
> +
> + works = alloc_percpu(struct yyy);
> if (!works)
> return -ENOMEM;
>
> get_online_cpus();
> + atomic_set(&xxx.count, num_online_cpus());
> for_each_online_cpu(cpu) {
> - struct work_struct *work = per_cpu_ptr(works, cpu);
> + struct yyy *yyy = per_cpu_ptr(works, cpu);
>
> - INIT_WORK(work, func);
> - set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
> - __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
> + yyy->xxx = &xxx;
> + INIT_WORK(&yyy->work, yyy_func);
> + set_bit(WORK_STRUCT_PENDING, work_data_bits(&yyy->work));
> + __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), &yyy->work);
> }
> - flush_workqueue(keventd_wq);
> put_online_cpus();
> + wait_for_completion(&xxx.done);
> free_percpu(works);
> return 0;
> }

Yes, along those lines.

you can call xxx a flush_context and create an interface like:

int queue_work_contex(struct workqueue_struct *wq,
struct flush_context *fc, struct work_struct *work)
{
work->context = fc;
return queue_work(wq, work);
}

void flush_workqueue_context(struct workqueue_strucy *wq, t
struct flush_context *fc)
{
if (atomic_read(&context->count))
wait_for_completion(&fc->completion);
/* except that the above is racy, wait_event() comes to mind */
}

of course run_workqueue() would then need to be augmented with something
like:

context = work->context;
...
f(work);
...
if (context && atomic_dec_and_test(&context->count))
complete(&context->done);

making all this PI savvy for -rt is going to be fun though.. I guess we
can just queue a normal barrier of the flusher's priority, and cancel it
once we complete.. hey - that doesn't sound hard at all :-)

also, I seem to have quitely ignored the fact that struct work doesn't
have the context pointer, and growing it unconditionally like this isn't
nice - hummm,. perhaps we have a bit left in data and can signify a
larger struct work_struct.. ?


2008-06-12 22:24:57

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On Thu, Jun 12, 2008 at 08:55:50PM +0400, Oleg Nesterov wrote:
> On 06/12, Oleg Nesterov wrote:
> >
> > insert_work() inserts the new work_struct before or after cwq->worklist,
> > depending on the "int tail" parameter. Change it to accept "list_head *"
> > instead, this shrinks .text a bit and allows us to insert the barrier
> > after specific work_struct.
>
> This allows us to implement
>
> int flush_work(struct work_struct *work)
> {
...
> }
>
> suggested by Peter. It only waits for selected work_struct.
>
> I doubt it will have a lot of users though. In most cases we need
> cancel_work_sync() and nothing more.

I guess it could've had enough users if it were done a bit sooner...

I didn't check this implementation yet, but if it's "rtnl_lock in
other works" safe then it could've been used in David Miller's fresh
patch replacing last uses of flush_scheduled_work() in net drivers'
->stop() etc (thread: "Re: 2.6.25rc7 lockdep trace") - there would
be far less doubts about possible change of functionality.

Anyway, this idea looks right to me.

Thanks,
Jarek P.

2008-06-13 10:09:18

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On 13-06-2008 00:24, Jarek Poplawski wrote:
> On Thu, Jun 12, 2008 at 08:55:50PM +0400, Oleg Nesterov wrote:
>> On 06/12, Oleg Nesterov wrote:
>>> insert_work() inserts the new work_struct before or after cwq->worklist,
>>> depending on the "int tail" parameter. Change it to accept "list_head *"
>>> instead, this shrinks .text a bit and allows us to insert the barrier
>>> after specific work_struct.
>> This allows us to implement
>>
>> int flush_work(struct work_struct *work)
>> {
> ...
>> }
>>
>> suggested by Peter. It only waits for selected work_struct.
>>
>> I doubt it will have a lot of users though. In most cases we need
>> cancel_work_sync() and nothing more.
>
> I guess it could've had enough users if it were done a bit sooner...
>
> I didn't check this implementation yet, but if it's "rtnl_lock in
> other works" safe then it could've been used in David Miller's fresh
> patch replacing last uses of flush_scheduled_work() in net drivers'
> ->stop() etc (thread: "Re: 2.6.25rc7 lockdep trace") - there would
> be far less doubts about possible change of functionality.

Hmm... I see it's definitely not for this. I should forget about my
crazy idea. Yes, cancel_work_sync() is mostly enough, and flush_
remains dangerous. (Maybe it's better not to get new users for this?)

Regards,
Jarek P.

2008-06-13 14:25:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On 06/12, Peter Zijlstra wrote:
>
> On Thu, 2008-06-12 at 21:44 +0400, Oleg Nesterov wrote:
>
> > > Hence that idea of flush context and completions.
> >
> > Do you mean something like (just for example) below? If yes, then yes
> > sure, flush_work() is limited. But I can't see how it is possible to
> > "generalize" this idea.
> >
> > (hmm... actually, if we add flush_work(), we can speedup schedule_on_each_cpu(),
> > instead of flush_workqueue(keventd_wq) we can do
> >
> > for_each_online_cpu(cpu)
> > flush_work(per_cpu_ptr(works, cpu));
> >
> > not sure this really makes sense though).
>
> Speedups are always nice ;-),

OK, I'm sending the patch.

> but the below also gets us there.

yeah, and it needs only 1 wakeup. But otoh it is much more complex :(

> > +struct xxx
> > +{
> > + atomic_t count;
> > + struct completion done;
> > + work_func_t func;
> > +};
> > +
> > +struct yyy
> > +{
> > + struct work_struct work;
> > + struct xxx *xxx;
> > +};
> > +
> > +static void yyy_func(struct work_struct *work)
> > +{
> > + struct xxx *xxx = container_of(work, struct yyy, work)->xxx;
> > + xxx->func(work);
> > +
> > + if (atomic_dec_and_test(&xxx->count))
> > + complete(&xxx->done);
> > +}
> > ...
>
> Yes, along those lines.
>
> you can call xxx a flush_context and create an interface like:
>
> int queue_work_contex(struct workqueue_struct *wq,
> struct flush_context *fc, struct work_struct *work)
> {
> work->context = fc;
> return queue_work(wq, work);
> }
>
> void flush_workqueue_context(struct workqueue_strucy *wq, t
> struct flush_context *fc)
> {
> if (atomic_read(&context->count))
> wait_for_completion(&fc->completion);
> /* except that the above is racy, wait_event() comes to mind */
> }
>
> of course run_workqueue() would then need to be augmented with something
> like:
>
> context = work->context;
> ...
> f(work);
> ...
> if (context && atomic_dec_and_test(&context->count))
> complete(&context->done);

> also, I seem to have quitely ignored the fact that struct work doesn't
> have the context pointer, and growing it unconditionally like this isn't
> nice - hummm,. perhaps we have a bit left in data and can signify a
> larger struct work_struct.. ?

Yes, we have a free bit... but afaics we can do better.

struct context_barrier {
struct work_struct work;
struct flush_context *fc;
...
}

void context_barrier_barrier_func(struct work_struct *work)
{
struct flush_context *fc = container_of();
if (atomic_dec_and_test())
...
}

void insert_context_barrier(work, barr)
{
...insert barr after work, like flush_work() does...
}

queue_work_contex(struct workqueue_struct *wq,
struct work_struct *work,
struct flush_context *fc)
{
int ret = queue_work(wq, work);
if (ret)
insert_context_barrier(work, barr);
return ret;
}

this way we shouldn't change run_workqueue() and introduce a "parallel"
larger work_struct which needs its own INIT_()/etc.

However I'm a bit sceptical this will be widely used... I may be wrong.

> making all this PI savvy for -rt is going to be fun though.. I guess we
> can just queue a normal barrier of the flusher's priority, and cancel it
> once we complete.. hey - that doesn't sound hard at all :-)

Yes!!! I think this is much better (because _much_ simple) than re-ordering
the pending work_struct's, we can just boost the whole ->worklist. We can
implement flush_work_pi() in the same manner as queue_work_contex() above.
That is why I said previously that flush_() should govern the priority,
not queue.

But we can also implement queue_work_pi(struct work_struct_pi *work).

Oleg.

2008-06-13 14:43:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On Fri, 2008-06-13 at 18:26 +0400, Oleg Nesterov wrote:
> On 06/12, Peter Zijlstra wrote:
> >
> > On Thu, 2008-06-12 at 21:44 +0400, Oleg Nesterov wrote:
> >
> > > > Hence that idea of flush context and completions.
> > >
> > > Do you mean something like (just for example) below? If yes, then yes
> > > sure, flush_work() is limited. But I can't see how it is possible to
> > > "generalize" this idea.
> > >
> > > (hmm... actually, if we add flush_work(), we can speedup schedule_on_each_cpu(),
> > > instead of flush_workqueue(keventd_wq) we can do
> > >
> > > for_each_online_cpu(cpu)
> > > flush_work(per_cpu_ptr(works, cpu));
> > >
> > > not sure this really makes sense though).
> >
> > Speedups are always nice ;-),
>
> OK, I'm sending the patch.
>
> > but the below also gets us there.
>
> yeah, and it needs only 1 wakeup. But otoh it is much more complex :(
>
> > > +struct xxx
> > > +{
> > > + atomic_t count;
> > > + struct completion done;
> > > + work_func_t func;
> > > +};
> > > +
> > > +struct yyy
> > > +{
> > > + struct work_struct work;
> > > + struct xxx *xxx;
> > > +};
> > > +
> > > +static void yyy_func(struct work_struct *work)
> > > +{
> > > + struct xxx *xxx = container_of(work, struct yyy, work)->xxx;
> > > + xxx->func(work);
> > > +
> > > + if (atomic_dec_and_test(&xxx->count))
> > > + complete(&xxx->done);
> > > +}
> > > ...
> >
> > Yes, along those lines.
> >
> > you can call xxx a flush_context and create an interface like:
> >
> > int queue_work_contex(struct workqueue_struct *wq,
> > struct flush_context *fc, struct work_struct *work)
> > {
> > work->context = fc;
> > return queue_work(wq, work);
> > }
> >
> > void flush_workqueue_context(struct workqueue_strucy *wq, t
> > struct flush_context *fc)
> > {
> > if (atomic_read(&context->count))
> > wait_for_completion(&fc->completion);
> > /* except that the above is racy, wait_event() comes to mind */
> > }
> >
> > of course run_workqueue() would then need to be augmented with something
> > like:
> >
> > context = work->context;
> > ...
> > f(work);
> > ...
> > if (context && atomic_dec_and_test(&context->count))
> > complete(&context->done);
>
> > also, I seem to have quitely ignored the fact that struct work doesn't
> > have the context pointer, and growing it unconditionally like this isn't
> > nice - hummm,. perhaps we have a bit left in data and can signify a
> > larger struct work_struct.. ?
>
> Yes, we have a free bit... but afaics we can do better.
>
> struct context_barrier {
> struct work_struct work;
> struct flush_context *fc;
> ...
> }
>
> void context_barrier_barrier_func(struct work_struct *work)
> {
> struct flush_context *fc = container_of();
> if (atomic_dec_and_test())
> ...
> }
>
> void insert_context_barrier(work, barr)
> {
> ...insert barr after work, like flush_work() does...
> }
>
> queue_work_contex(struct workqueue_struct *wq,
> struct work_struct *work,
> struct flush_context *fc)
> {
> int ret = queue_work(wq, work);
> if (ret)
> insert_context_barrier(work, barr);
> return ret;
> }
>
> this way we shouldn't change run_workqueue() and introduce a "parallel"
> larger work_struct which needs its own INIT_()/etc.

Where does do the context_barrier instances come from, are they
allocated in insert_context_barrier() ?

> However I'm a bit sceptical this will be widely used... I may be wrong.

I have no idea either - but it doesn't seem weird to have multiple
worklets outstanding without knowing which.

> > making all this PI savvy for -rt is going to be fun though.. I guess we
> > can just queue a normal barrier of the flusher's priority, and cancel it
> > once we complete.. hey - that doesn't sound hard at all :-)
>
> Yes!!! I think this is much better (because _much_ simple) than re-ordering
> the pending work_struct's, we can just boost the whole ->worklist. We can
> implement flush_work_pi() in the same manner as queue_work_contex() above.
> That is why I said previously that flush_() should govern the priority,
> not queue.
>
> But we can also implement queue_work_pi(struct work_struct_pi *work).

in -rt all the workqueue stuff is PI already, we replaced the list_head
in work_struct with a plist_node and queue_work() enqueues worklets at
the ->normal_prio of the calling task.

Hmm, the required barrier might still spoil - or at least complicate the
matter.

In order to boost the pending work, we'd need to enqueue a barrer before
the high prio worklet - otherwise the high prio worklet will complete
before the work we're waiting on.

And we can cancel the high prio worklet once all our worklets of
interrest are done, but we cannot cancel the barrier.

So it would leave a barrier behind.

It comes back to me, PI workqueues was comlicated..

2008-06-13 15:15:32

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On 06/13, Peter Zijlstra wrote:
>
> On Fri, 2008-06-13 at 18:26 +0400, Oleg Nesterov wrote:
> >
> > Yes, we have a free bit... but afaics we can do better.
> >
> > struct context_barrier {
> > struct work_struct work;
> > struct flush_context *fc;
> > ...
> > }
> >
> > void context_barrier_barrier_func(struct work_struct *work)
> > {
> > struct flush_context *fc = container_of();
> > if (atomic_dec_and_test())
> > ...
> > }
> >
> > void insert_context_barrier(work, barr)
> > {
> > ...insert barr after work, like flush_work() does...
> > }
> >
> > queue_work_contex(struct workqueue_struct *wq,
> > struct work_struct *work,
> > struct flush_context *fc)
> > {
> > int ret = queue_work(wq, work);
> > if (ret)
> > insert_context_barrier(work, barr);
> > return ret;
> > }
> >
> > this way we shouldn't change run_workqueue() and introduce a "parallel"
> > larger work_struct which needs its own INIT_()/etc.
>
> Where does do the context_barrier instances come from, are they
> allocated in insert_context_barrier() ?

Ah, indeed. We have to kmalloc() or use a larger work_struct... And if we
use a larger work_struct, we can make

struct work_struct_context
{
struct work_struct work;
work_func_t real_func;
struct flush_context *fc;
};

static void work_context_func(struct work_struct *work)
{
struct work_struct_context *wc = container_of();
struct flush_context *fc = wc->context;

wc->real_func(work);

if (atomic_dec_and_test(fc->count))
...
}

to avoid changing run_workqueue(), but this is nasty. Perhaps flush_context
can be just array or list of queued work_structs, then we can do flush_work()
for each work_struct...

> > > making all this PI savvy for -rt is going to be fun though.. I guess we
> > > can just queue a normal barrier of the flusher's priority, and cancel it
> > > once we complete.. hey - that doesn't sound hard at all :-)
> >
> > Yes!!! I think this is much better (because _much_ simple) than re-ordering
> > the pending work_struct's, we can just boost the whole ->worklist. We can
> > implement flush_work_pi() in the same manner as queue_work_contex() above.
> > That is why I said previously that flush_() should govern the priority,
> > not queue.
> >
> > But we can also implement queue_work_pi(struct work_struct_pi *work).
>
> in -rt all the workqueue stuff is PI already, we replaced the list_head
> in work_struct with a plist_node and queue_work() enqueues worklets at
> the ->normal_prio of the calling task.

Oh well. IIRC, these patches really uglified^Wcompicated the code. Don't
get me wrong, if I knew how to make this better, I'd sent the patch ;)

> Hmm, the required barrier might still spoil - or at least complicate the
> matter.
>
> In order to boost the pending work, we'd need to enqueue a barrer before
> the high prio worklet - otherwise the high prio worklet will complete
> before the work we're waiting on.

flush_work_pi() can boost the priority, and then barrier restores it.
Can't this work?

> And we can cancel the high prio worklet once all our worklets of
> interrest are done, but we cannot cancel the barrier.

Yes, if we cancel the high prio worklet, this doesn't restore the prio
immediately.

Oleg.

2008-06-13 15:32:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

On Fri, 2008-06-13 at 19:17 +0400, Oleg Nesterov wrote:
> On 06/13, Peter Zijlstra wrote:
> >
> > On Fri, 2008-06-13 at 18:26 +0400, Oleg Nesterov wrote:
> > >
> > > Yes, we have a free bit... but afaics we can do better.
> > >
> > > struct context_barrier {
> > > struct work_struct work;
> > > struct flush_context *fc;
> > > ...
> > > }
> > >
> > > void context_barrier_barrier_func(struct work_struct *work)
> > > {
> > > struct flush_context *fc = container_of();
> > > if (atomic_dec_and_test())
> > > ...
> > > }
> > >
> > > void insert_context_barrier(work, barr)
> > > {
> > > ...insert barr after work, like flush_work() does...
> > > }
> > >
> > > queue_work_contex(struct workqueue_struct *wq,
> > > struct work_struct *work,
> > > struct flush_context *fc)
> > > {
> > > int ret = queue_work(wq, work);
> > > if (ret)
> > > insert_context_barrier(work, barr);
> > > return ret;
> > > }
> > >
> > > this way we shouldn't change run_workqueue() and introduce a "parallel"
> > > larger work_struct which needs its own INIT_()/etc.
> >
> > Where does do the context_barrier instances come from, are they
> > allocated in insert_context_barrier() ?
>
> Ah, indeed. We have to kmalloc() or use a larger work_struct... And if we
> use a larger work_struct, we can make
>
> struct work_struct_context
> {
> struct work_struct work;
> work_func_t real_func;
> struct flush_context *fc;
> };
>
> static void work_context_func(struct work_struct *work)
> {
> struct work_struct_context *wc = container_of();
> struct flush_context *fc = wc->context;
>
> wc->real_func(work);
>
> if (atomic_dec_and_test(fc->count))
> ...
> }
>
> to avoid changing run_workqueue(), but this is nasty. Perhaps flush_context
> can be just array or list of queued work_structs, then we can do flush_work()
> for each work_struct...

list would grow work_struct with a list_head, and sizeof(list_head) >
sizeof(conetxt *), and an array would require krealloc().

Neither sounds really appealing..

Anyway, I think before we go further down this road, we'd better see if
anybody actually needs this. Not that theorizing about this problem
isn't fun,... but... :-)

> > > > making all this PI savvy for -rt is going to be fun though.. I guess we
> > > > can just queue a normal barrier of the flusher's priority, and cancel it
> > > > once we complete.. hey - that doesn't sound hard at all :-)
> > >
> > > Yes!!! I think this is much better (because _much_ simple) than re-ordering
> > > the pending work_struct's, we can just boost the whole ->worklist. We can
> > > implement flush_work_pi() in the same manner as queue_work_contex() above.
> > > That is why I said previously that flush_() should govern the priority,
> > > not queue.
> > >
> > > But we can also implement queue_work_pi(struct work_struct_pi *work).
> >
> > in -rt all the workqueue stuff is PI already, we replaced the list_head
> > in work_struct with a plist_node and queue_work() enqueues worklets at
> > the ->normal_prio of the calling task.
>
> Oh well. IIRC, these patches really uglified^Wcompicated the code. Don't
> get me wrong, if I knew how to make this better, I'd sent the patch ;)
>
> > Hmm, the required barrier might still spoil - or at least complicate the
> > matter.
> >
> > In order to boost the pending work, we'd need to enqueue a barrer before
> > the high prio worklet - otherwise the high prio worklet will complete
> > before the work we're waiting on.
>
> flush_work_pi() can boost the priority, and then barrier restores it.
> Can't this work?
>
> > And we can cancel the high prio worklet once all our worklets of
> > interrest are done, but we cannot cancel the barrier.
>
> Yes, if we cancel the high prio worklet, this doesn't restore the prio
> immediately.

/me reads back that -rt barrier code,... *groan* head explodes.. I'm
sure we can make it drop the prio on cancel, but I'd have to take a
proper look at it.

But getting rid of the barrier will be very tricky. So it will always
have some side-effects..


2008-06-24 05:41:39

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH] workqueues: insert_work: use "list_head *" instead of "int tail"

Sorry for the silence. I stirred the discussion but got buried in other stuff.

Peter Zijlstra wrote:
> Anyway, I think before we go further down this road, we'd better see if
> anybody actually needs this. Not that theorizing about this problem
> isn't fun,... but... :-)

Let me see if I can sum up current state of affairs. Looks like people are in
general ok with Oleg's patches. Fancier stuff is much more complex and may not
be needed.
Combining Oleg's patches with auditing current flush_scheduled_work() users
and fixing them to use cancel_work_sync() (and in some cases flush_work())
gives us desired behaviour. Which is:
1. minimizing flush overhead
2. handling work queue thread starvation

Does that sound right ? Or did I miss something in the discussion ?

If that sounds right we should resend the patches to Andrew with formal ACKs
because I do not seem them in mainline, linux-next or -mm.

Thanks
Max