2006-08-20 13:15:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] file: kill unnecessary timer in fdtable_defer

free_fdtable_rc() schedules timer to reschedule fddef->wq if
schedule_work() on it returns 0. However, schedule_work() guarantees
that the target work is executed at least once after the scheduling
regardless of its return value. 0 return simply means that the work
was already pending and thus no further action was required.

Another problem is that it used contant '5' as @expires argument to
mod_timer().

Kill unnecessary fddef->timer.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Dipankar Sarma <[email protected]>
---
fs/file.c | 29 ++---------------------------
1 file changed, 2 insertions(+), 27 deletions(-)

Index: work/fs/file.c
===================================================================
--- work.orig/fs/file.c
+++ work/fs/file.c
@@ -21,7 +21,6 @@
struct fdtable_defer {
spinlock_t lock;
struct work_struct wq;
- struct timer_list timer;
struct fdtable *next;
};

@@ -75,22 +74,6 @@ static void __free_fdtable(struct fdtabl
kfree(fdt);
}

-static void fdtable_timer(unsigned long data)
-{
- struct fdtable_defer *fddef = (struct fdtable_defer *)data;
-
- spin_lock(&fddef->lock);
- /*
- * If someone already emptied the queue return.
- */
- if (!fddef->next)
- goto out;
- if (!schedule_work(&fddef->wq))
- mod_timer(&fddef->timer, 5);
-out:
- spin_unlock(&fddef->lock);
-}
-
static void free_fdtable_work(struct fdtable_defer *f)
{
struct fdtable *fdt;
@@ -142,13 +125,8 @@ static void free_fdtable_rcu(struct rcu_
spin_lock(&fddef->lock);
fdt->next = fddef->next;
fddef->next = fdt;
- /*
- * vmallocs are handled from the workqueue context.
- * If the per-cpu workqueue is running, then we
- * defer work scheduling through a timer.
- */
- if (!schedule_work(&fddef->wq))
- mod_timer(&fddef->timer, 5);
+ /* vmallocs are handled from the workqueue context */
+ schedule_work(&fddef->wq);
spin_unlock(&fddef->lock);
put_cpu_var(fdtable_defer_list);
}
@@ -362,9 +340,6 @@ static void __devinit fdtable_defer_list
struct fdtable_defer *fddef = &per_cpu(fdtable_defer_list, cpu);
spin_lock_init(&fddef->lock);
INIT_WORK(&fddef->wq, (void (*)(void *))free_fdtable_work, fddef);
- init_timer(&fddef->timer);
- fddef->timer.data = (unsigned long)fddef;
- fddef->timer.function = fdtable_timer;
fddef->next = NULL;
}


2006-08-21 04:30:09

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] file: kill unnecessary timer in fdtable_defer

On Sun, Aug 20, 2006 at 10:15:42PM +0900, Tejun Heo wrote:
> free_fdtable_rc() schedules timer to reschedule fddef->wq if
> schedule_work() on it returns 0. However, schedule_work() guarantees
> that the target work is executed at least once after the scheduling
> regardless of its return value. 0 return simply means that the work
> was already pending and thus no further action was required.

Hmm.. Is this really true ? IIRC, schedule_work() checks pending
work based on bit ops on work->pending and clear_bit() is not
a memory barrier. So, if I see work->pending = 1 in free_fdtable_work(), how
do I know that the work function is already executing and
missed the new work that I had queued ?


Thanks
Dipankar

2006-08-21 05:18:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] file: kill unnecessary timer in fdtable_defer

On Mon, Aug 21, 2006 at 10:02:57AM +0530, Dipankar Sarma wrote:
> On Sun, Aug 20, 2006 at 10:15:42PM +0900, Tejun Heo wrote:
> > free_fdtable_rc() schedules timer to reschedule fddef->wq if
> > schedule_work() on it returns 0. However, schedule_work() guarantees
> > that the target work is executed at least once after the scheduling
> > regardless of its return value. 0 return simply means that the work
> > was already pending and thus no further action was required.
>
> Hmm.. Is this really true ? IIRC, schedule_work() checks pending
> work based on bit ops on work->pending and clear_bit() is not
> a memory barrier.

Those bitops are not memory barriers but they can define order between
them alright. Once the execution order is correct, the rest of
synchronization is the caller's responsbility. In fddef's case, it's
achieved via fddef->lock.

> So, if I see work->pending = 1 in free_fdtable_work(), how do I know
> that the work function is already executing and missed the new work
> that I had queued ?

This is classical edge-triggered event handling where each event
doesn't require separate handling. The producer produces one or more
items at a time and the consumer consumes the whole queue on
invocation.

For example, when Pn indicates producing of item n and C indicates
consuming. The seqeunce "P0 C P1 C P2 C" and "P0 P1 P2 C" are
effectively identical, so all that's required for correct operation
(that is, full queue consumption) is that the consumer is run at least
once after producing of the last item.

In workqueue, this is guaranteed by

1. If pending bit is set, the work is guaranteed to be executed in
some future - it's on timer or queue.

2. The queuer sets the pending bit *after* producing a job to be
done.

3. The worker clears the pending *before* executing the job.

I sometimes find it easier to understand with a diagram which looks
like the following. Time flows from top to bottom.

Queuer Worker

-------------
| produce job |
| | <--- clr pending --->
------------- |
| v
v --------------
<--- set pending ---> | consume jobs |
| |
--------------

Now move the queuer and worker up and down in your mind. If 'set
pending' is higher than clr pending 'consume job' is guaranteed to see
the job queuer has produced whether pending is set or clear
beforehand. If 'set pending' is lower than 'clr pending', it is
guaranteed to set pending and schedule worker. The workqueue is
straight-forward expansion where there are N queuers and a repeating
consumer.

--
tejun

2006-08-21 07:55:30

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] file: kill unnecessary timer in fdtable_defer

On Mon, Aug 21, 2006 at 02:18:16PM +0900, Tejun Heo wrote:
> On Mon, Aug 21, 2006 at 10:02:57AM +0530, Dipankar Sarma wrote:
> > > regardless of its return value. 0 return simply means that the work
> > > was already pending and thus no further action was required.
> >
> > Hmm.. Is this really true ? IIRC, schedule_work() checks pending
> > work based on bit ops on work->pending and clear_bit() is not
> > a memory barrier.
>
> Those bitops are not memory barriers but they can define order between
> them alright. Once the execution order is correct, the rest of

Huh ? If they are not memory barriers, they how can you guranttee
ordering on weakly ordered CPUs ?


> In workqueue, this is guaranteed by
>
> 1. If pending bit is set, the work is guaranteed to be executed in
> some future - it's on timer or queue.
>
> 2. The queuer sets the pending bit *after* producing a job to be
> done.
>
> 3. The worker clears the pending *before* executing the job.
>
> I sometimes find it easier to understand with a diagram which looks
> like the following. Time flows from top to bottom.
>
> Queuer Worker
>
> -------------
> | produce job |
> | | <--- clr pending --->
> ------------- |
> | v
> v --------------
> <--- set pending ---> | consume jobs |
> | |
> --------------
>
> Now move the queuer and worker up and down in your mind. If 'set
> pending' is higher than clr pending 'consume job' is guaranteed to see
> the job queuer has produced whether pending is set or clear
> beforehand. If 'set pending' is lower than 'clr pending', it is
> guaranteed to set pending and schedule worker. The workqueue is
> straight-forward expansion where there are N queuers and a repeating
> consumer.

Given that there is no smp_mb__after_clear_bit() after clearing
work->pending, what prevents the worker thread from seeing
the state of the deferred fd queue before setting the pending bit ?
IOW, the queuer sees pending = 1 and ignores waking the
worker thread, worker sees a stale state of the deferred fd queue
ignoring the newly queued work. That should be possible on
a cpu with weak memory ordering. Perhaps, we should fix
__queue_work() to add the smp_mb__after_clear_bit() and
make sure that we have a memory barrier after adding the
deferred fds.

Thanks
Dipankar


2006-08-21 08:25:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] file: kill unnecessary timer in fdtable_defer

On Mon, Aug 21, 2006 at 01:28:18PM +0530, Dipankar Sarma wrote:
> On Mon, Aug 21, 2006 at 02:18:16PM +0900, Tejun Heo wrote:
> > On Mon, Aug 21, 2006 at 10:02:57AM +0530, Dipankar Sarma wrote:
> > > > regardless of its return value. 0 return simply means that the work
> > > > was already pending and thus no further action was required.
> > >
> > > Hmm.. Is this really true ? IIRC, schedule_work() checks pending
> > > work based on bit ops on work->pending and clear_bit() is not
> > > a memory barrier.
> >
> > Those bitops are not memory barriers but they can define order between
> > them alright. Once the execution order is correct, the rest of
>
> Huh ? If they are not memory barriers, they how can you guranttee
> ordering on weakly ordered CPUs ?

Atomic bitops define orders *between* them not *around* them. ie. if
you have two atomic bitops on the same bit, they're ordered one way or
the other. As the workqueue code currently stands, ordering around
the bitops is the caller's responsibility and fddef does it with a
spinlock. As the producer and consumer usually access a common
resource, that kind of synchrnoization is inherent in most cases.

[--snip--]
> Given that there is no smp_mb__after_clear_bit() after clearing
> work->pending, what prevents the worker thread from seeing

Let me revise the diagram.

Queuer Worker

lock
------------- <--- clr pending --->
| produce job | |
| | v
------------- trying to lock
unlock v
| lock granted
v --------------
<--- set pending ---> | consume jobs |
| |
--------------
unlock

> the state of the deferred fd queue before setting the pending bit ?
> IOW, the queuer sees pending = 1 and ignores waking the
> worker thread, worker sees a stale state of the deferred fd queue
> ignoring the newly queued work. That should be possible on
> a cpu with weak memory ordering.

Queue being a shared resource, people usually synchronize around it.

> Perhaps, we should fix __queue_work() to add the
> smp_mb__after_clear_bit() and make sure that we have a memory
> barrier after adding the deferred fds.

That would help if the producer and consumer depend on memory ordering
for synchronization, which usually isn't the case. I'm not sure
whether adding that is a good idea or not. IMHO, it's better to use
explicit memory barriers with enough comments in such subtle cases to
make things clear.

Either way, fddef is just straight forward producer-consumer case with
no need for requeueing mechanism. There is no need and no other user
does anything similar.

Thanks.

--
tejun

2006-09-22 15:46:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] file: kill unnecessary timer in fdtable_defer

Hello, Dipankar.

It has been a month and this thread hasn't gotten anywhere. If there is
no further objection, I'll try to push it through -mm for testing.

Thanks.

--
tejun