2015-08-29 02:42:36

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] task_work: remove fifo ordering guarantee

From: Eric Dumazet <[email protected]>

In commit f341861fb0b ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.

Later, commit ac3d0da8f329 added yet another loop to reverse a list,
bringing back the latency spike :

I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.

We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.

Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Maciej Żenczykowski <[email protected]>
---
kernel/task_work.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8727032e3a6f..53fa971d000d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -18,6 +18,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* This is like the signal handler which runs in kernel mode, but it doesn't
* try to wake up the @task.
*
+ * Note: there is no ordering guarantee on works queued here.
+ *
* RETURNS:
* 0 if succeeds or -ESRCH.
*/
@@ -108,16 +110,6 @@ void task_work_run(void)
raw_spin_unlock_wait(&task->pi_lock);
smp_mb();

- /* Reverse the list to run the works in fifo order */
- head = NULL;
- do {
- next = work->next;
- work->next = head;
- head = work;
- work = next;
- } while (work);
-
- work = head;
do {
next = work->next;
work->func(work);


2015-08-29 03:19:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <[email protected]> wrote:
>
> We could add yet another cond_resched() in the reverse loop, or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.

So I think this should be ok, with things like file closing not really
caring about ordering as far as I can tell.

However, has anybody gone through all the task-work users? I looked
quickly at the task_work_add() cases, and didn't see anything that
looked like it would care, but others should look too. In the vfs,
theres' the delayed fput and mnt freeing, and there's a keyring
installation one.

The threaded irq handlers use it as that exit-time hack, which
certainly shouldn't care, and there's some uprobe thing.

Can anybody see anything fishy?

Linus

2015-08-29 09:22:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee


* Linus Torvalds <[email protected]> wrote:

> On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <[email protected]> wrote:
> >
> > We could add yet another cond_resched() in the reverse loop, or we can simply
> > remove the reversal, as I do not think anything would depend on order of
> > task_work_add() submitted works.
>
> So I think this should be ok, with things like file closing not really caring
> about ordering as far as I can tell.
>
> However, has anybody gone through all the task-work users? I looked quickly at
> the task_work_add() cases, and didn't see anything that looked like it would
> care, but others should look too. In the vfs, theres' the delayed fput and mnt
> freeing, and there's a keyring installation one.
>
> The threaded irq handlers use it as that exit-time hack, which certainly
> shouldn't care, and there's some uprobe thing.
>
> Can anybody see anything fishy?

So I'm wondering, is there any strong reason why we couldn't use a double linked
list and still do FIFO and remove that silly linear list walking hack?

Thanks,

Ingo

2015-08-29 12:51:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On 08/28, Eric Dumazet wrote:
>
> From: Eric Dumazet <[email protected]>
>
> In commit f341861fb0b ("task_work: add a scheduling point in
> task_work_run()") I fixed a latency problem adding a cond_resched()
> call.
>
> Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> bringing back the latency spike :
>
> I've seen in some cases this loop taking 275 ms, if for example a
> process with 2,000,000 files is killed.
>
> We could add yet another cond_resched() in the reverse loop,

Can't we do this?

> or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.

Personally I'd prefer to keep the fifo ordering. It just makes
more sense imho. Even if currently nobody depends on it (although
I am not sure about out-of-tree modules, say, systemtap).

Let's look keyctl_session_to_parent(). It does task_work_cancel()
but only because we can not trust user-space. Otherwise we could
remove it and just do task_work_add(), but this needs fifo.

Fifo just looks more sane to me.

Oleg.

2015-08-29 12:56:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On 08/29, Ingo Molnar wrote:
>
> So I'm wondering, is there any strong reason why we couldn't use a double linked
> list and still do FIFO and remove that silly linear list walking hack?

This will obviously enlarge callback_head, and it is often embedded.
But this is minor.

If we use a double linked list we can't do task_work_add() lockless.
So we will need another spinlock_t in task_struct. We can't use pi_lock.

Oleg.

2015-08-29 13:57:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote:
> On 08/28, Eric Dumazet wrote:
> >
> > From: Eric Dumazet <[email protected]>
> >
> > In commit f341861fb0b ("task_work: add a scheduling point in
> > task_work_run()") I fixed a latency problem adding a cond_resched()
> > call.
> >
> > Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> > bringing back the latency spike :
> >
> > I've seen in some cases this loop taking 275 ms, if for example a
> > process with 2,000,000 files is killed.
> >
> > We could add yet another cond_resched() in the reverse loop,
>
> Can't we do this?

Well, I stated in the changelog we could. Obviously we can.

Adding 275 ms of pure overhead to perform this list reversal for files
to be closed is quite unfortunate.


> Personally I'd prefer to keep the fifo ordering. It just makes
> more sense imho. Even if currently nobody depends on it (although
> I am not sure about out-of-tree modules, say, systemtap).
>
> Let's look keyctl_session_to_parent(). It does task_work_cancel()
> but only because we can not trust user-space. Otherwise we could
> remove it and just do task_work_add(), but this needs fifo.

So it looks like there is no problem today, right, other than the
possibility to parse a long list while blocking IRQ ?

>
> Fifo just looks more sane to me.

Well, files are closed in a random order. These are the main user of
this stuff.

If this is that critical, maybe use 2 lists, one for stuff needing fifo,
and another one for un-ordered stuff (ed : file closing), and add a
boolean to task_work_add()/task_work_cancel(). This adds yet another
field into struct task_struct.

Now we also could question why we needed commit
4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
") since it seems quite an overhead at task exit with 10^6 of files to
close.

I understood the 'schedule_work() for interrupt/kernel_thread callers'
part, but not the task_work_add() one.



2015-08-29 14:11:17

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On Sat, 2015-08-29 at 06:57 -0700, Eric Dumazet wrote:

> Now we also could question why we needed commit
> 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
> ") since it seems quite an overhead at task exit with 10^6 of files to
> close.
>
> I understood the 'schedule_work() for interrupt/kernel_thread callers'
> part, but not the task_work_add() one.

If this needs to be kept, maybe then add following, to make sure
we flush the list at most every BITS_PER_LONG files

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..f3d0a79cef05 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -22,6 +22,7 @@
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/workqueue.h>
+#include <linux/task_work.h>

int sysctl_nr_open __read_mostly = 1024*1024;
int sysctl_nr_open_min = BITS_PER_LONG;
@@ -392,6 +393,7 @@ static struct fdtable *close_files(struct
files_struct * files)
i++;
set >>= 1;
}
+ task_work_run();
}

return fdt;

2015-08-29 17:08:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <[email protected]> wrote:
>
> If this needs to be kept, maybe then add following, to make sure
> we flush the list at most every BITS_PER_LONG files

Hmm.

I'm wondering if we should just make close_files() (or maybe even
filp_close()) use a synchronous fput().

Iirc, the reason we delay fput() is that we had some nasty issues for
the generic fput case. It was called from interrupt context by the aio
code, and just in general there's a lot of nasty cases that can cause
the final fput to happen (so there are lockdep issues with the mmap
locks because the last fput being from munmap etc).

Maybe I forget some detail - it's been several years by now - but I
think we could make the regular "close()" and "exit()" cases just use
the synchronous fput (it's called "__fput_sync()" and currently
explicitly limited to just kernel threads).

Al?

Because it feels all kinds of stupid to add things to the task-work
queue just to then remove it almost immediately again. And
close_files() is also called from various contexts. but the whole "put
the final 'files_struct' case is certainly not at all as special as
the 'put the final file'.

Linus

2015-08-29 21:08:19

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

Oleg Nesterov wrote:
> On 08/29, Ingo Molnar wrote:
>> So I'm wondering, is there any strong reason why we couldn't use a double
>> linked list and still do FIFO and remove that silly linear list walking hack?
>
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
>
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.

You only need a singly linked list for FIFO, but indeed lockless
access is a pain.

For a LIFO stack, you just do a single compare-and-swap on the head.
Once an entry is in the list, it's immutable.

For FIFO, you only need one pointer in the nodes, but two in the list
head: a head pointer and a tail pointer.

The problem for lockless access is that you have to update both the next
pointer and the tail pointer, and without very specialized instructions
like 680x0's CAS2, there's no way to do them both atomically.

So the procedure to append (write) to the list is:
- Set your newly added node's next pointer to NULL.
- CAS the tail pointer. This establishes your place in line,
but does not make it visible to the reader.
- At this point, the next pointer is not yours any more and may be
written by someone else, but the rest of the node may still be
finished, if you like.
- But also, there's a sort of priority inversion problem. If a writer
stalls here, no following writer is visible to the reader.
- Then you update the previous tail (obtained from the CAS) to link via its
next pointer to your newly added node. This makes it visible to the reader.

- The reader, meanwhile, totally ignores the tail pointer, and
only uses the head and next pointers.

Now, one trick with the above is that the tail pointer must *only*
be accessed by writers. The single-threaded technique of representing
an empty queue by using a "struct **" tail pointer and having tail = &head
to represent an empty list is not okay.

Rather, the reader must never deallocate a node with a NULL next pointer.

If the nodes are large, you can minimize the overhead of this by having a
dedicated sentinel/dummy node which lives in the queue and gets moved to
the tail whenever it reaches the head *and* has a non-NULL next pointer.
But the reader may not depend on this node always being visible!

The reader might find the sentinel pointing to node A, and then move it to
the tail by doing a CAS on the tail pointer, but the tail pointer
by then points to B. It's possible that A's next pointer is still NULL
and the writer adding B has not gotten to the second step yet.


Anyway, it can be done, but honestly the current "empty the stack in
batches and then reverse it" is simpler.

The cond_resched is ugly, but the overhead is minimal, and the FIFO
guarantee is convenient.


By the way, it's quite possible (LIFO or FIFO) for the writer to batch up
queue adds and only do one CAS to add an arbitrary number.

2015-08-31 05:23:36

by yalin wang

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee


> On Aug 30, 2015, at 01:08, Linus Torvalds <[email protected]> wrote:
>
> On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <[email protected]> wrote:
>>
>> If this needs to be kept, maybe then add following, to make sure
>> we flush the list at most every BITS_PER_LONG files
>
> Hmm.
>
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().
>
> Iirc, the reason we delay fput() is that we had some nasty issues for
> the generic fput case. It was called from interrupt context by the aio
> code, and just in general there's a lot of nasty cases that can cause
> the final fput to happen (so there are lockdep issues with the mmap
> locks because the last fput being from munmap etc).
>
> Maybe I forget some detail - it's been several years by now - but I
> think we could make the regular "close()" and "exit()" cases just use
> the synchronous fput (it's called "__fput_sync()" and currently
> explicitly limited to just kernel threads).
>
> Al?
>
> Because it feels all kinds of stupid to add things to the task-work
> queue just to then remove it almost immediately again. And
> close_files() is also called from various contexts. but the whole "put
> the final 'files_struct' case is certainly not at all as special as
> the 'put the final file'.
>
> Linus
> —
why not provide API like:
fput()
fput_nosync() ?

because synchronous version are reasonable and safe in most time,
let the user to select which version to use is more feasible, no matter if it is kthread or not.

Thanks

2015-08-31 06:02:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee


* Oleg Nesterov <[email protected]> wrote:

> On 08/29, Ingo Molnar wrote:
> >
> > So I'm wondering, is there any strong reason why we couldn't use a double linked
> > list and still do FIFO and remove that silly linear list walking hack?
>
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
>
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.

The fact that the O(N) overhead was measured in real apps to be in the
milliseconds IMHO weakens cycle-level concerns about also having a spinlock next
to the list head. (There's no additional cacheline bouncing concerns with the
spinlock: the head of a LIFO list is essentially a bouncing cacheline.)

If there's some other solution, sure, but LIFO queues tend to be trouble down the
line.

Thanks,

Ingo

2015-08-31 12:08:04

by Oleg Nesterov

[permalink] [raw]
Subject: change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee)

On 08/29, Eric Dumazet wrote:
>
> On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote:
> > On 08/28, Eric Dumazet wrote:
> > >
> > > From: Eric Dumazet <[email protected]>
> > >
> > > In commit f341861fb0b ("task_work: add a scheduling point in
> > > task_work_run()") I fixed a latency problem adding a cond_resched()
> > > call.
> > >
> > > Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> > > bringing back the latency spike :
> > >
> > > I've seen in some cases this loop taking 275 ms, if for example a
> > > process with 2,000,000 files is killed.
> > >
> > > We could add yet another cond_resched() in the reverse loop,
> >
> > Can't we do this?
>
> Well, I stated in the changelog we could. Obviously we can.
>
> Adding 275 ms of pure overhead to perform this list reversal for files
> to be closed is quite unfortunate.

Well, if the first loop takes 275 ms, then probably the next one which
actually does a lot of __fput's takes much, much more time, so perhaps
these 275 ms are not very noticable. Ignoring the latency problem.

But of course, this is not good, I agree. Please see below.

> > Fifo just looks more sane to me.
>
> Well, files are closed in a random order. These are the main user of
> this stuff.

This is the most "heavy" user. But task_works is the generic API.

> Now we also could question why we needed commit
> 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
> ") since it seems quite an overhead at task exit with 10^6 of files to
> close.

How about the patch below? I didn't try to test it yet, but since
filp_close() does ->flush() I think __fput_sync() should be safe here.

Al, what do you think?

Oleg.


--- x/fs/file_table.c
+++ x/fs/file_table.c
@@ -292,11 +292,8 @@ void fput(struct file *file)
*/
void __fput_sync(struct file *file)
{
- if (atomic_long_dec_and_test(&file->f_count)) {
- struct task_struct *task = current;
- BUG_ON(!(task->flags & PF_KTHREAD));
+ if (atomic_long_dec_and_test(&file->f_count))
__fput(file);
- }
}

EXPORT_SYMBOL(fput);
--- x/fs/open.c
+++ x/fs/open.c
@@ -1074,7 +1074,7 @@ int filp_close(struct file *filp, fl_owner_t id)
dnotify_flush(filp, id);
locks_remove_posix(filp, id);
}
- fput(filp);
+ __fput_sync(filp);
return retval;
}

2015-08-31 12:47:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On 08/29, Linus Torvalds wrote:
>
> On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <[email protected]> wrote:
> >
> > If this needs to be kept, maybe then add following, to make sure
> > we flush the list at most every BITS_PER_LONG files
>
> Hmm.
>
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().

Heh. I thought about the same change. So perhaps it is even the right
thing to do. Still I am worried, because "it can't be that simple" ;)

And, with this change close_files() is called before exit_fs() and
exit_task_namespaces(). This is fine (iiuc), but this means that the
creative code in drivers/ can (wrongly) rely on this fact again. IIRC,
the change which moved __fput() into task_work_exit() uncovered some
interesting problems, like filp_open() called from fop->release().

Anyway, this is the question to Al, I guess.

Oleg.

2015-08-31 12:54:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On 08/31, Ingo Molnar wrote:
>
> * Oleg Nesterov <[email protected]> wrote:
>
> > On 08/29, Ingo Molnar wrote:
> > >
> > > So I'm wondering, is there any strong reason why we couldn't use a double linked
> > > list and still do FIFO and remove that silly linear list walking hack?
> >
> > This will obviously enlarge callback_head, and it is often embedded.
> > But this is minor.
> >
> > If we use a double linked list we can't do task_work_add() lockless.
> > So we will need another spinlock_t in task_struct. We can't use pi_lock.
>
> The fact that the O(N) overhead was measured in real apps to be in the
> milliseconds IMHO weakens cycle-level concerns about also having a spinlock next
> to the list head. (There's no additional cacheline bouncing concerns with the
> spinlock: the head of a LIFO list is essentially a bouncing cacheline.)

I agree. I just tried to explain that we need a bit more changes than
just s/callback_head/list_head/ in task_struct.

And. The fact that this O(N) overhead was measured means that we have
more overhead with offload-fput-to-exit_task_work which would be nice
to remove as well.

Oleg.

2015-08-31 13:24:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

On 08/29, George Spelvin wrote:
>
> Oleg Nesterov wrote:
> > On 08/29, Ingo Molnar wrote:
> >> So I'm wondering, is there any strong reason why we couldn't use a double
> >> linked list and still do FIFO and remove that silly linear list walking hack?
> >
> > This will obviously enlarge callback_head, and it is often embedded.
> > But this is minor.
> >
> > If we use a double linked list we can't do task_work_add() lockless.
> > So we will need another spinlock_t in task_struct. We can't use pi_lock.
>
> You only need a singly linked list for FIFO, but indeed lockless
> access is a pain.
>
> For a LIFO stack, you just do a single compare-and-swap on the head.
> Once an entry is in the list, it's immutable.
>
> For FIFO, you only need one pointer in the nodes, but two in the list
> head: a head pointer and a tail pointer.

Actually you need a single tail pointer, See 158e1645e07f3e9f7e49.
But this doesn't matter.

> The problem for lockless access is that you have to update both the next
> pointer and the tail pointer, and without very specialized instructions
> like 680x0's CAS2, there's no way to do them both atomically.
>
> So the procedure to append (write) to the list is:
> ...
> - But also, there's a sort of priority inversion problem. If a writer
> stalls here, no following writer is visible to the reader.

And this also means that the next writer which does task_work_add() +
task_work_cancel() will be suprised. Worse, this means that work->func()
doesn't own its callback_head/container_of. The previous tail is visible
to task_work_run().

Perhaps I missed something. But to me this all looks too clever ;)
Personally I'd prefer to just add another spinlock_t.

But so far I hope we can keep this stupid but simple "reverse the list"
loop.

Oleg.

2015-08-31 15:21:29

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee

Oleg Nesterov wrote:
> Actually you need a single tail pointer, See 158e1645e07f3e9f7e49.
> But this doesn't matter.

(This uses a circularly linked list, keeping a pointer to the
tail, and tail->next pointing to the head.)

Yes, if you're not trying to be lockless, that works quite well.

> And this also means that the next writer which does task_work_add() +
> task_work_cancel() will be suprised. Worse, this means that work->func()
> doesn't own its callback_head/container_of. The previous tail is visible
> to task_work_run().

I forgot about task_work_cancel()! Yes, supporting that as well would
be a problem, and make a messy algorithm even messier.

> Perhaps I missed something. But to me this all looks too clever ;)
> Personally I'd prefer to just add another spinlock_t.
>
> But so far I hope we can keep this stupid but simple "reverse the list"
> loop.

It *is* too complicated. That's why I described it: to discourage people
from implementing it.