2009-01-08 08:07:33

by Grissiom

[permalink] [raw]
Subject: "BUG: scheduling while atomic: pdflush/30/0x00000002" in latest git

When I using the latest git version, I occasionally got this in dmesg:

[ 2008.237234] BUG: scheduling while atomic: pdflush/30/0x00000002
[ 2008.237240] 2 locks held by pdflush/30:
[ 2008.237244] #0: (mutex){--..}, at: [<c01a57a1>] sync_filesystems+0x11/0x120
[ 2008.237258] #1: (sb_lock){--..}, at: [<c01a57ab>]
sync_filesystems+0x1b/0x120
[ 2008.237277] Modules linked in: fuse ricoh_mmc b43
[ 2008.237288] Pid: 30, comm: pdflush Not tainted
2.6.28-g14-rfkill-nophy-ledon-07485-g9e42d0c #62
[ 2008.237294] Call Trace:
[ 2008.237303] [<c04d7576>] schedule+0x326/0x8e0
[ 2008.237311] [<c01500d3>] __lock_acquire+0x293/0xa20
[ 2008.237321] [<c014e307>] mark_held_locks+0x67/0x80
[ 2008.237330] [<c04da58c>] _spin_unlock_irqrestore+0x4c/0x60
[ 2008.237339] [<c014e519>] trace_hardirqs_on_caller+0x149/0x1a0
[ 2008.237351] [<c0143d55>] async_synchronize_cookie_special+0xb5/0x140
[ 2008.237362] [<c013e1a0>] autoremove_wake_function+0x0/0x40
[ 2008.237372] [<c01a57fc>] sync_filesystems+0x6c/0x120
[ 2008.237381] [<c017ee90>] pdflush+0x0/0x1e0
[ 2008.237392] [<c01c0b90>] do_sync+0x20/0x60
[ 2008.237402] [<c01c0bda>] sys_sync+0xa/0x10
[ 2008.237412] [<c017ef9e>] pdflush+0x10e/0x1e0
[ 2008.237420] [<c014e519>] trace_hardirqs_on_caller+0x149/0x1a0
[ 2008.237429] [<c017de80>] laptop_flush+0x0/0x10
[ 2008.237437] [<c013dea2>] kthread+0x42/0x70
[ 2008.237444] [<c013de60>] kthread+0x0/0x70
[ 2008.237452] [<c0103c03>] kernel_thread_helper+0x7/0x14
(repeat some times)

Is there any resolution?
Please add me to the CC list. Thanks.

--
Cheers,
Grissiom


2009-01-08 14:38:28

by Dave Kleikamp

[permalink] [raw]
Subject: Re: "BUG: scheduling while atomic: pdflush/30/0x00000002" in latest git

Adding cc:lix-fsdevel

On Thu, 2009-01-08 at 16:07 +0800, Grissiom wrote:
> When I using the latest git version, I occasionally got this in dmesg:
>
> [ 2008.237234] BUG: scheduling while atomic: pdflush/30/0x00000002
> [ 2008.237240] 2 locks held by pdflush/30:
> [ 2008.237244] #0: (mutex){--..}, at: [<c01a57a1>] sync_filesystems+0x11/0x120
> [ 2008.237258] #1: (sb_lock){--..}, at: [<c01a57ab>]
> sync_filesystems+0x1b/0x120
> [ 2008.237277] Modules linked in: fuse ricoh_mmc b43
> [ 2008.237288] Pid: 30, comm: pdflush Not tainted
> 2.6.28-g14-rfkill-nophy-ledon-07485-g9e42d0c #62
> [ 2008.237294] Call Trace:
> [ 2008.237303] [<c04d7576>] schedule+0x326/0x8e0
> [ 2008.237311] [<c01500d3>] __lock_acquire+0x293/0xa20
> [ 2008.237321] [<c014e307>] mark_held_locks+0x67/0x80
> [ 2008.237330] [<c04da58c>] _spin_unlock_irqrestore+0x4c/0x60
> [ 2008.237339] [<c014e519>] trace_hardirqs_on_caller+0x149/0x1a0
> [ 2008.237351] [<c0143d55>] async_synchronize_cookie_special+0xb5/0x140
> [ 2008.237362] [<c013e1a0>] autoremove_wake_function+0x0/0x40
> [ 2008.237372] [<c01a57fc>] sync_filesystems+0x6c/0x120
> [ 2008.237381] [<c017ee90>] pdflush+0x0/0x1e0
> [ 2008.237392] [<c01c0b90>] do_sync+0x20/0x60
> [ 2008.237402] [<c01c0bda>] sys_sync+0xa/0x10
> [ 2008.237412] [<c017ef9e>] pdflush+0x10e/0x1e0
> [ 2008.237420] [<c014e519>] trace_hardirqs_on_caller+0x149/0x1a0
> [ 2008.237429] [<c017de80>] laptop_flush+0x0/0x10
> [ 2008.237437] [<c013dea2>] kthread+0x42/0x70
> [ 2008.237444] [<c013de60>] kthread+0x0/0x70
> [ 2008.237452] [<c0103c03>] kernel_thread_helper+0x7/0x14
> (repeat some times)
>

The offender is
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=efaee192

async_synchronize_full_special() shouldn't be called while holding a
spinlock, sb_lock.

I think this patch should fix it. Arjan, would this work?

Signed-off-by: Dave Kleikamp <[email protected]>

diff --git a/fs/super.c b/fs/super.c
index cb20744..7d67387 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -458,7 +458,6 @@ void sync_filesystems(int wait)
if (sb->s_flags & MS_RDONLY)
continue;
sb->s_need_sync_fs = 1;
- async_synchronize_full_special(&sb->s_async_list);
}

restart:
@@ -471,6 +470,7 @@ restart:
sb->s_count++;
spin_unlock(&sb_lock);
down_read(&sb->s_umount);
+ async_synchronize_full_special(&sb->s_async_list);
if (sb->s_root && (wait || sb->s_dirt))
sb->s_op->sync_fs(sb, wait);
up_read(&sb->s_umount);

--
David Kleikamp
IBM Linux Technology Center

2009-01-08 15:19:25

by Arjan van de Ven

[permalink] [raw]
Subject: Re: "BUG: scheduling while atomic: pdflush/30/0x00000002" in latest git

On Thu, 08 Jan 2009 08:37:52 -0600
Dave Kleikamp <[email protected]> wrote:

> Adding cc:lix-fsdevel
>
> On Thu, 2009-01-08 at 16:07 +0800, Grissiom wrote:
> > When I using the latest git version, I occasionally got this in
> > dmesg:
> >
> > [ 2008.237234] BUG: scheduling while atomic: pdflush/30/0x00000002
> > [ 2008.237240] 2 locks held by pdflush/30:
> > [ 2008.237244] #0: (mutex){--..}, at: [<c01a57a1>]
> > sync_filesystems+0x11/0x120 [ 2008.237258] #1: (sb_lock){--..},
> > at: [<c01a57ab>] sync_filesystems+0x1b/0x120
> > [ 2008.237277] Modules linked in: fuse ricoh_mmc b43
> > [ 2008.237288] Pid: 30, comm: pdflush Not tainted
> > 2.6.28-g14-rfkill-nophy-ledon-07485-g9e42d0c #62
> > [ 2008.237294] Call Trace:
> > [ 2008.237303] [<c04d7576>] schedule+0x326/0x8e0
> > [ 2008.237311] [<c01500d3>] __lock_acquire+0x293/0xa20
> > [ 2008.237321] [<c014e307>] mark_held_locks+0x67/0x80
> > [ 2008.237330] [<c04da58c>] _spin_unlock_irqrestore+0x4c/0x60
> > [ 2008.237339] [<c014e519>] trace_hardirqs_on_caller+0x149/0x1a0
> > [ 2008.237351] [<c0143d55>]
> > async_synchronize_cookie_special+0xb5/0x140 [ 2008.237362]
> > [<c013e1a0>] autoremove_wake_function+0x0/0x40 [ 2008.237372]
> > [<c01a57fc>] sync_filesystems+0x6c/0x120 [ 2008.237381]
> > [<c017ee90>] pdflush+0x0/0x1e0 [ 2008.237392] [<c01c0b90>]
> > do_sync+0x20/0x60 [ 2008.237402] [<c01c0bda>] sys_sync+0xa/0x10
> > [ 2008.237412] [<c017ef9e>] pdflush+0x10e/0x1e0
> > [ 2008.237420] [<c014e519>] trace_hardirqs_on_caller+0x149/0x1a0
> > [ 2008.237429] [<c017de80>] laptop_flush+0x0/0x10
> > [ 2008.237437] [<c013dea2>] kthread+0x42/0x70
> > [ 2008.237444] [<c013de60>] kthread+0x0/0x70
> > [ 2008.237452] [<c0103c03>] kernel_thread_helper+0x7/0x14
> > (repeat some times)
> >
>
> The offender is
> http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=efaee192
>
> async_synchronize_full_special() shouldn't be called while holding a
> spinlock, sb_lock.
>
> I think this patch should fix it. Arjan, would this work?

yeah it will

woops/

2009-01-08 15:46:49

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

sync_filesystems() shouldn't be calling async_synchronize_full_special
while holding a spinlock. The second while loop in that function is the
right place for this anyway.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Reported-by: Grissiom <[email protected]>

diff --git a/fs/super.c b/fs/super.c
index cb20744..7d67387 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -458,7 +458,6 @@ void sync_filesystems(int wait)
if (sb->s_flags & MS_RDONLY)
continue;
sb->s_need_sync_fs = 1;
- async_synchronize_full_special(&sb->s_async_list);
}

restart:
@@ -471,6 +470,7 @@ restart:
sb->s_count++;
spin_unlock(&sb_lock);
down_read(&sb->s_umount);
+ async_synchronize_full_special(&sb->s_async_list);
if (sb->s_root && (wait || sb->s_dirt))
sb->s_op->sync_fs(sb, wait);
up_read(&sb->s_umount);

--
David Kleikamp
IBM Linux Technology Center

2009-01-08 22:51:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Thu, Jan 08, 2009 at 09:46:31AM -0600, Dave Kleikamp wrote:
> sync_filesystems() shouldn't be calling async_synchronize_full_special
> while holding a spinlock. The second while loop in that function is the
> right place for this anyway.

Out of curiousity, what on earth does
async_synchronize_full_special() do and why does it need to be in
sync_filesystems()?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-08 22:52:32

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

Dave Chinner wrote:
> On Thu, Jan 08, 2009 at 09:46:31AM -0600, Dave Kleikamp wrote:
>> sync_filesystems() shouldn't be calling async_synchronize_full_special
>> while holding a spinlock. The second while loop in that function is the
>> right place for this anyway.
>
> Out of curiousity, what on earth does
> async_synchronize_full_special() do and why does it need to be in
> sync_filesystems()?
>
now that we have asynchronous operations, this function makes sure that all the functions
that we started async before this point complete, so that what when you sync, you sync all in progress work.

2009-01-09 00:33:41

by Alan

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Thu, 08 Jan 2009 14:51:52 -0800
Arjan van de Ven <[email protected]> wrote:

> Dave Chinner wrote:
> > On Thu, Jan 08, 2009 at 09:46:31AM -0600, Dave Kleikamp wrote:
> >> sync_filesystems() shouldn't be calling async_synchronize_full_special
> >> while holding a spinlock. The second while loop in that function is the
> >> right place for this anyway.
> >
> > Out of curiousity, what on earth does
> > async_synchronize_full_special() do and why does it need to be in
> > sync_filesystems()?
> >
> now that we have asynchronous operations, this function makes sure that all the functions
> that we started async before this point complete, so that what when you sync, you sync all in progress work.

So why is it special - wouldn't async_synchronize_all() be a bit (or
complete_all) be a bit more clear ?

Alan

2009-01-09 00:39:04

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

Alan Cox wrote:
> On Thu, 08 Jan 2009 14:51:52 -0800
> Arjan van de Ven <[email protected]> wrote:
>
>> Dave Chinner wrote:
>>> On Thu, Jan 08, 2009 at 09:46:31AM -0600, Dave Kleikamp wrote:
>>>> sync_filesystems() shouldn't be calling async_synchronize_full_special
>>>> while holding a spinlock. The second while loop in that function is the
>>>> right place for this anyway.
>>> Out of curiousity, what on earth does
>>> async_synchronize_full_special() do and why does it need to be in
>>> sync_filesystems()?
>>>
>> now that we have asynchronous operations, this function makes sure that all the functions
>> that we started async before this point complete, so that what when you sync, you sync all in progress work.
>
> So why is it special - wouldn't async_synchronize_all() be a bit (or
> complete_all) be a bit more clear ?

there is 2 types; one synchronizes the global kernel events
one is special in that it only synchronizes the events you give it (via a list head)

2009-01-09 01:41:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Thu, Jan 08, 2009 at 02:51:52PM -0800, Arjan van de Ven wrote:
> Dave Chinner wrote:
>> On Thu, Jan 08, 2009 at 09:46:31AM -0600, Dave Kleikamp wrote:
>>> sync_filesystems() shouldn't be calling async_synchronize_full_special
>>> while holding a spinlock. The second while loop in that function is the
>>> right place for this anyway.
>>
>> Out of curiousity, what on earth does
>> async_synchronize_full_special() do and why does it need to be in
>> sync_filesystems()?
>>
> now that we have asynchronous operations, this function makes sure that all the functions
> that we started async before this point complete, so that what when you sync, you sync all in progress work.

So what you are saying that we now have async operations that
need magical undocumented sync primitives apparently randomly
strewn throughout the code base?

Could you name the interface somewhat better? it's an async
operation synchronisation primitive - currently the name is
full of words that are typically suffixes to something that
is typically "namespace_operation_suffix". This is whole API
looks like "suffix_suffix_suffix_suffix".....

<rummage around the code>

Oh, fuck. You've made generic_delete_inode() an asychronous operation.
This is bad. Really Bad.

XFS does transactions in ->clear_inode so what you've done is
removed one of the synchronous throttleѕ on inode removal that
prevents build-up of massive numbers of deleted XFS inodes in
memory that have run the first unlink phase but not the second.

I know for a fact (because this has been done within XFS before)
that this causes serious writeback, sync and unmount latency
problems with XFS as well as introducing a whole new class of OOM
problems when under heavy load. The depth of the async queues
requires serious throttling to prevent these issues from occurring.

How bad? For example, rm -rf そf a larg enumber of files can cause
unmount or sync can take hours to run the async queues under heavy
load because the inodes existed only in memory and are randomly
spread around the filesystem (this is a real world example). In this
case under heavy memory load, XFS required inode buffer RMW to do
the modifcation, and given that it was on software RAID5 this also
required a RAID5 RMW cycle. The result? The async inode delete
queue drained at *50* deleted inodes per second. The synchronous
unlink rate was around 5000 inodes per second.....

So, given the potential impact of this change, what testing have
you done in terms of:

- performance impact
- regression testing
- sync() safety
- removing a million files and queuing all of the
deletes in the async queues....

On all the different filesystems under heavy I/O workloads?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-09 04:45:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

Dave Chinner wrote:

>
> So, given the potential impact of this change, what testing have
> you done in terms of:
>
> - performance impact

I tested this on my machines and it gave a real performance improvement (11 to 8 seconds for a full kernel tree unlink, and cutting out
latency for normal applications)
> - sync() safety
that was exactly the synchronization point that's discussed here.

> - removing a million files and queuing all of the
> deletes in the async queues....

the async code throttles at 32k outstanding.
Yes 32K is arbitrary, but if you delete a million files fast, all but the first few thousand are
synchronous.

2009-01-09 05:18:26

by Grissiom

[permalink] [raw]
Subject: Re: "BUG: scheduling while atomic: pdflush/30/0x00000002" in latest git

On Thu, Jan 8, 2009 at 22:37, Dave Kleikamp <[email protected]> wrote:
>
> The offender is
> http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=efaee192
>
> async_synchronize_full_special() shouldn't be called while holding a
> spinlock, sb_lock.
>
> I think this patch should fix it. Arjan, would this work?
>
> Signed-off-by: Dave Kleikamp <[email protected]>
>
> diff --git a/fs/super.c b/fs/super.c
> index cb20744..7d67387 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -458,7 +458,6 @@ void sync_filesystems(int wait)
> if (sb->s_flags & MS_RDONLY)
> continue;
> sb->s_need_sync_fs = 1;
> - async_synchronize_full_special(&sb->s_async_list);
> }
>
> restart:
> @@ -471,6 +470,7 @@ restart:
> sb->s_count++;
> spin_unlock(&sb_lock);
> down_read(&sb->s_umount);
> + async_synchronize_full_special(&sb->s_async_list);
> if (sb->s_root && (wait || sb->s_dirt))
> sb->s_op->sync_fs(sb, wait);
> up_read(&sb->s_umount);
>
> --
> David Kleikamp
> IBM Linux Technology Center
>
>

Problem gone with this patch applied. Thanks for the fixing.

--
Cheers,
Grissiom

2009-01-09 08:23:15

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Thu, Jan 08, 2009 at 08:45:06PM -0800, Arjan van de Ven wrote:
> Dave Chinner wrote:
>
>>
>> So, given the potential impact of this change, what testing have
>> you done in terms of:
>>
>> - performance impact
>
> I tested this on my machines and it gave a real performance
> improvement (11 to 8 seconds for a full kernel tree unlink, and
> cutting out latency for normal applications)

Not really waht I'd call a significant performance test. What
happens under serious sustained I/O load? On multiple different
filesystems?

FWIW, my current kernel tree is:

$ find . -print | wc -l
30082

Which is just under the async operation limit you set so this
probably didn't even stress the mixing of sync and async deletes at
the same time...

>> - sync() safety
> that was exactly the synchronization point that's discussed here.

Right - how many fs developers did you discuss the impact of this
with? Did you crash test any filesystems? Did you run any fs QA
suites to check for regressions? How many different filesystems
did you even test?

We already know that sync is busted and doesn't provide the
guarantees it is supposed so on some filesystems. I'm extremely
concerned that changing unlink behaviour in this manner subtly
breaks sync even more than it already is....

>> - removing a million files and queuing all of the
>> deletes in the async queues....
>
> the async code throttles at 32k outstanding.
> Yes 32K is arbitrary, but if you delete a million files fast, all
> but the first few thousand are synchronous.

No, after the first 32k you get a mix of synchronous and async as
the async queue gets processed in parallel. This means you are
screwing up the temporal locality of the unlink of inodes. i.e. some
unlinks are being done immediately, others are being delayed until
another 32k inodes have been processed of the unlink list.

If we've been careful about allocation locality during untar so
that inodes that are allocated sequentially by untar will be
deleted sequential by rm -rf, then this new pattern will break those
optimisations because the unlink locality is being broken by
the some sync/some async behaviour that this queuing mechanism
creates.

Hmmmm - I suspect that rm -rf will also trigger lots of kernel
threads to be created. We do not want 256 kernel threads to
be spawned to bang on the serialised regions of filesystem
journals (generating lock contention) when one thread would
be sufficient....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-09 12:31:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

Hi Arjan.

Likely my mail is in your blacklist that's why you did not reply on
comment on the discussion about this async interface month or so ago :)

Anyway, what you did for the boot process should be only there, but vfs
changes have to be discussed in fsdevel.

For example generic_delete_inode() does not delete inode anymore,
instead it queues the work to the set of threads, serialized by the
global spinlock and atomic variables, allocating additional structure
for this not from the memory pool or at least memory cache, but
directly via slab.
This atomic allocation thus may be invoked during the cache shrink
path to free some memory, which is not a good idea. So if you do insist
that deletion of every inode should be async, at least shrink size of
the async_event and embed it into the inode, or use memory pool.

Just several other notes on the code.

__async_schedule() could first check amount of pending requests and
do not allocate and free the entry if all threads are busy.

entry processing code run_one_entry() should not free entry under the
irq-off lock.

Thread start routing should check if thread successfully started before
increasing number of the running thread, or this can be done in the
thread callback itself.

--
Evgeniy Polyakov

2009-01-09 15:10:39

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Fri, 2009-01-09 at 19:22 +1100, Dave Chinner wrote:
> >
> > the async code throttles at 32k outstanding.
> > Yes 32K is arbitrary, but if you delete a million files fast, all
> > but the first few thousand are synchronous.
>
> No, after the first 32k you get a mix of synchronous and async as
> the async queue gets processed in parallel. This means you are
> screwing up the temporal locality of the unlink of inodes. i.e. some
> unlinks are being done immediately, others are being delayed until
> another 32k inodes have been processed of the unlink list.
>
>
> If we've been careful about allocation locality during untar so
> that inodes that are allocated sequentially by untar will be
> deleted sequential by rm -rf, then this new pattern will break those
> optimisations because the unlink locality is being broken by
> the some sync/some async behaviour that this queuing mechanism
> creates.
>

You won't notice on ext34, but xfs, btrfs and probably jfs should go
slower with this. The throttle should wait for one of the entries on
the queue to finish, and then put this inode on the queue in order.
Doing the current one synchronously will generate random IO.

On XFS, create 256,000 4k files, drop caches and then time find . |
xargs rm

> Hmmmm - I suspect that rm -rf will also trigger lots of kernel
> threads to be created. We do not want 256 kernel threads to
> be spawned to bang on the serialised regions of filesystem
> journals (generating lock contention) when one thread would
> be sufficient....

This could be good and bad.

-chris

2009-01-12 02:31:54

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

Arjan van de Ven wrote:
> > - removing a million files and queuing all of the
> > deletes in the async queues....
>
> the async code throttles at 32k outstanding.
> Yes 32K is arbitrary, but if you delete a million files fast, all but the
> first few thousand are
> synchronous.

Hmm.

If I call unlink() a thousand times and then call fsync() on the
parent directories covering files I've unlinked... I expect the
deletes to be committed to disk when the last fsync() has returned. I
require that a crash and restart will not see the files. Several
kinds of transactional software and even some shell scripts expect this.

Will these asynchronous deletes break the guaranteed
commit-of-the-delete provided by fsync() on the parent directory?

-- Jamie

2009-01-12 03:55:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

Jamie Lokier wrote:
> Arjan van de Ven wrote:
>>> - removing a million files and queuing all of the
>>> deletes in the async queues....
>> the async code throttles at 32k outstanding.
>> Yes 32K is arbitrary, but if you delete a million files fast, all but the
>> first few thousand are
>> synchronous.
>
> Hmm.
>
> If I call unlink() a thousand times and then call fsync() on the
> parent directories covering files I've unlinked... I expect the
> deletes to be committed to disk when the last fsync() has returned. I
> require that a crash and restart will not see the files. Several
> kinds of transactional software and even some shell scripts expect this.
>
> Will these asynchronous deletes break the guaranteed
> commit-of-the-delete provided by fsync() on the parent directory?

3 things:
1) removing the name from the directory and removing the data from disk are independent things.
The former happens from unlink(), the later happens when the refcount hits 0 (eg no more openers nor
any directory on disk referencing it). fsync() on a parent dir obviously only covers the first part,
while only the 2nd part was made asynchronous.
2) with the right synchronization point in fsync, it will still work out
3) this code will be redone for 2.6.30; for 2.6.29 it is removed.

2009-01-12 07:49:11

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Mon, Jan 12, 2009 at 02:31:38AM +0000, Jamie Lokier wrote:
> Arjan van de Ven wrote:
> > > - removing a million files and queuing all of the
> > > deletes in the async queues....
> >
> > the async code throttles at 32k outstanding.
> > Yes 32K is arbitrary, but if you delete a million files fast, all but the
> > first few thousand are
> > synchronous.
>
> Hmm.
>
> If I call unlink() a thousand times and then call fsync() on the
> parent directories covering files I've unlinked... I expect the
> deletes to be committed to disk when the last fsync() has returned. I
> require that a crash and restart will not see the files. Several
> kinds of transactional software and even some shell scripts expect this.
>
> Will these asynchronous deletes break the guaranteed
> commit-of-the-delete provided by fsync() on the parent directory?

It depends on the implementation of the filesystem you are running
on.

On XFS, it won't break anything because it uses a two-phase unlink.
The unlink() syscall removes the inode from the namespace
transactionally which means that even if you crash after the
directory fsync then they will never, ever appear in the directory
after recovery. In fact, recovery will see all those inodes as
existing on the unlinked lists and hence the execute the second
phase of the unlink during recovery.

I have no idea what the behaviour of other filesystems will
be but it needs to be evaluated on a filesystem-by-filesystem
basis.

Hmmmm. That introduces another interesting question w.r.t async
unlink in XFS - it's going to hammer the on-disk unlinked list hash
tables in XFS. The hashes are only 64 chains wide (per AG) so
pumping thousands of inodes into them is going to increase the
minimum memory footprint of XFS during rm -rf operations
substantially.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-12 07:55:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

On Mon, Jan 12, 2009 at 03:54:41AM +0000, Arjan van de Ven wrote:
> Jamie Lokier wrote:
>> Arjan van de Ven wrote:
>>>> - removing a million files and queuing all of the
>>>> deletes in the async queues....
>>> the async code throttles at 32k outstanding.
>>> Yes 32K is arbitrary, but if you delete a million files fast, all
>>> but the first few thousand are
>>> synchronous.
>>
>> Hmm.
>>
>> If I call unlink() a thousand times and then call fsync() on the
>> parent directories covering files I've unlinked... I expect the
>> deletes to be committed to disk when the last fsync() has returned. I
>> require that a crash and restart will not see the files. Several
>> kinds of transactional software and even some shell scripts expect this.
>>
>> Will these asynchronous deletes break the guaranteed
>> commit-of-the-delete provided by fsync() on the parent directory?
>
> 3 things:
> 1) removing the name from the directory and removing the data from disk are independent things.
> The former happens from unlink(), the later happens when the refcount hits 0 (eg no more openers nor
> any directory on disk referencing it). fsync() on a parent dir obviously only covers the first part,
> while only the 2nd part was made asynchronous.
> 2) with the right synchronization point in fsync, it will still work out

What scope does that synchronisation point have? I sincerely
hope you are not proposing to put a filesystem global
synchronisation point into fsync....

> 3) this code will be redone for 2.6.30; for 2.6.29 it is removed.

Can you tell use how you plan to redo this code and test it
adequately for 2.6.30?

Cheers,

Dave.
--
Dave Chinner
[email protected]