2009-07-04 21:49:20

by Roland Dreier

[permalink] [raw]
Subject: ext4 lockdep report re possible reclaim deadlock on jbd2_handle

Hi,

I recently got the following lockdep warning on my laptop. The kernel
is Ubuntu's tree with Linus's git up to d960eea9 pulled it; I don't
think there are any non-mainline ext4 changes involved, and the
warning below does look valid on mainline.

As far as I can tell, lockdep is warning that jbd2_handle is usually
acquired while doing reclaim -- which makes sense, as pushing dirty
inodes out of memory is of course going to want to do fs transactions;
but also there is the code path:

ext4_ext_truncate():
...
handle = ext4_journal_start(inode, err);
...
ext4_block_truncate_page(handle, mapping, inode->i_size);
...

where we have

ext4_block_truncate_page():
...
page = grab_cache_page(mapping, from >> PAGE_CACHE_SHIFT);

and grab_cache_page() expands as:

static inline struct page *grab_cache_page(struct address_space *mapping,
pgoff_t index)
{
return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
}

and if find_or_create_page() doesn't find the page, it just does:

page = __page_cache_alloc(gfp_mask);

Now, I'm not much of a VM or FS expert, but I see no reason why
find_or_create_page() would never have to create a page in the ext4
usage, and I don't see any reason mapping_gfp_mask() would never
include __GFP_FS; so it seems that lockdep is correct that this
codepath could then trigger reclaim and deadlock by calling back into
ext4 and trying to start another transaction. (And indeed the my
system did hit the page allocate path, although perhaps that is a bug
somewhere else)

It does seem a little odd to me that no one else has seen this,
since I thought quite a few people tested with lockdep enabled. So
maybe something is odd about my system or kernel config. Anyway, I'm
happy to try further tests and/or debugging patches.

Full lockdep output is below.

Thanks,
Roland

[20786.363249] =================================
[20786.363257] [ INFO: inconsistent lock state ]
[20786.363265] 2.6.31-2-generic #14~rbd4gitd960eea9
[20786.363270] ---------------------------------
[20786.363276] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[20786.363285] http/8397 [HC0[0]:SC0[0]:HE1:SE1] takes:
[20786.363291] (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150
[20786.363314] {IN-RECLAIM_FS-W} state was registered at:
[20786.363320] [<ffffffff8108bef6>] mark_irqflags+0xc6/0x1a0
[20786.363334] [<ffffffff8108d347>] __lock_acquire+0x287/0x430
[20786.363345] [<ffffffff8108d595>] lock_acquire+0xa5/0x150
[20786.363355] [<ffffffff812008da>] jbd2_journal_start+0xfa/0x150
[20786.363365] [<ffffffff811d98a8>] ext4_journal_start_sb+0x58/0x90
[20786.363377] [<ffffffff811cce85>] ext4_delete_inode+0xc5/0x2c0
[20786.363389] [<ffffffff81146fa3>] generic_delete_inode+0xd3/0x1a0
[20786.363401] [<ffffffff81147095>] generic_drop_inode+0x25/0x30
[20786.363411] [<ffffffff81145ce2>] iput+0x62/0x70
[20786.363420] [<ffffffff81142878>] dentry_iput+0x98/0x110
[20786.363429] [<ffffffff81142a00>] d_kill+0x50/0x80
[20786.363438] [<ffffffff811444c5>] dput+0x95/0x180
[20786.363447] [<ffffffff8120de4b>] ecryptfs_d_release+0x2b/0x70
[20786.363459] [<ffffffff81142978>] d_free+0x28/0x60
[20786.363468] [<ffffffff81142a18>] d_kill+0x68/0x80
[20786.363477] [<ffffffff81142ad3>] prune_one_dentry+0xa3/0xc0
[20786.363487] [<ffffffff81142d61>] __shrink_dcache_sb+0x271/0x290
[20786.363497] [<ffffffff81142e89>] prune_dcache+0x109/0x1b0
[20786.363506] [<ffffffff81142f6f>] shrink_dcache_memory+0x3f/0x50
[20786.363516] [<ffffffff810f6d3d>] shrink_slab+0x12d/0x190
[20786.363527] [<ffffffff810f97d7>] balance_pgdat+0x4d7/0x640
[20786.363537] [<ffffffff810f9a57>] kswapd+0x117/0x170
[20786.363546] [<ffffffff810773ce>] kthread+0x9e/0xb0
[20786.363558] [<ffffffff8101430a>] child_rip+0xa/0x20
[20786.363569] [<ffffffffffffffff>] 0xffffffffffffffff
[20786.363598] irq event stamp: 15997
[20786.363603] hardirqs last enabled at (15997): [<ffffffff81125f9d>] kmem_cache_alloc+0xfd/0x1a0
[20786.363617] hardirqs last disabled at (15996): [<ffffffff81125f01>] kmem_cache_alloc+0x61/0x1a0
[20786.363628] softirqs last enabled at (15966): [<ffffffff810631ea>] __do_softirq+0x14a/0x220
[20786.363641] softirqs last disabled at (15861): [<ffffffff8101440c>] call_softirq+0x1c/0x30
[20786.363651]
[20786.363653] other info that might help us debug this:
[20786.363660] 3 locks held by http/8397:
[20786.363665] #0: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<ffffffff8112ed24>] do_truncate+0x64/0x90
[20786.363685] #1: (&sb->s_type->i_alloc_sem_key#5){+++++.}, at: [<ffffffff81147f90>] notify_change+0x250/0x350
[20786.363707] #2: (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150
[20786.363724]
[20786.363726] stack backtrace:
[20786.363734] Pid: 8397, comm: http Tainted: G C 2.6.31-2-generic #14~rbd4gitd960eea9
[20786.363741] Call Trace:
[20786.363752] [<ffffffff8108ad7c>] print_usage_bug+0x18c/0x1a0
[20786.363763] [<ffffffff8108b0c0>] ? check_usage_backwards+0x0/0xb0
[20786.363773] [<ffffffff8108bad2>] mark_lock_irq+0xf2/0x280
[20786.363783] [<ffffffff8108bd97>] mark_lock+0x137/0x1d0
[20786.363793] [<ffffffff8108c03c>] mark_held_locks+0x6c/0xa0
[20786.363803] [<ffffffff8108c11f>] lockdep_trace_alloc+0xaf/0xe0
[20786.363813] [<ffffffff810efbac>] __alloc_pages_nodemask+0x7c/0x180
[20786.363824] [<ffffffff810e9411>] ? find_get_page+0x91/0xf0
[20786.363835] [<ffffffff8111d3b7>] alloc_pages_current+0x87/0xd0
[20786.363845] [<ffffffff810e9827>] __page_cache_alloc+0x67/0x70
[20786.363856] [<ffffffff810eb7df>] find_or_create_page+0x4f/0xb0
[20786.363867] [<ffffffff811cb3be>] ext4_block_truncate_page+0x3e/0x460
[20786.363876] [<ffffffff812008da>] ? jbd2_journal_start+0xfa/0x150
[20786.363885] [<ffffffff812008bb>] ? jbd2_journal_start+0xdb/0x150
[20786.363895] [<ffffffff811c6415>] ? ext4_meta_trans_blocks+0x75/0xf0
[20786.363905] [<ffffffff811e8d8b>] ext4_ext_truncate+0x1bb/0x1e0
[20786.363916] [<ffffffff811072c5>] ? unmap_mapping_range+0x75/0x290
[20786.363926] [<ffffffff811ccc28>] ext4_truncate+0x498/0x630
[20786.363938] [<ffffffff8129b4ce>] ? _raw_spin_unlock+0x5e/0xb0
[20786.363947] [<ffffffff81107306>] ? unmap_mapping_range+0xb6/0x290
[20786.363957] [<ffffffff8108c3ad>] ? trace_hardirqs_on+0xd/0x10
[20786.363966] [<ffffffff811ffe58>] ? jbd2_journal_stop+0x1f8/0x2e0
[20786.363976] [<ffffffff81107690>] vmtruncate+0xb0/0x110
[20786.363986] [<ffffffff81147c05>] inode_setattr+0x35/0x170
[20786.363995] [<ffffffff811c9906>] ext4_setattr+0x186/0x370
[20786.364005] [<ffffffff81147eab>] notify_change+0x16b/0x350
[20786.364014] [<ffffffff8112ed30>] do_truncate+0x70/0x90
[20786.364021] [<ffffffff8112f48b>] T.657+0xeb/0x110
[20786.364021] [<ffffffff8112f4be>] sys_ftruncate+0xe/0x10
[20786.364021] [<ffffffff81013132>] system_call_fastpath+0x16/0x1b


2009-07-05 12:44:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 lockdep report re possible reclaim deadlock on jbd2_handle

On Sat, Jul 04, 2009 at 02:49:06PM -0700, Roland Dreier wrote:
>
> I recently got the following lockdep warning on my laptop. The kernel
> is Ubuntu's tree with Linus's git up to d960eea9 pulled it; I don't
> think there are any non-mainline ext4 changes involved, and the
> warning below does look valid on mainline.

Yeah, it looks valid. And thanks. This looks like it might be
related to a long-standing bug which has puzzled me a lot of other
people, Ubuntu Launchpad #330824:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824

> It does seem a little odd to me that no one else has seen this,
> since I thought quite a few people tested with lockdep enabled. So
> maybe something is odd about my system or kernel config. Anyway, I'm
> happy to try further tests and/or debugging patches.

The funny thing has been that many people using Ubuntu kernels haven't
been able to replicate it (myself and a number of Canonical kernel
people included), and people who *can* reproduce it report that it
goes away the moment they go use a stock mainline kernel ---
regardless of whether it's 2.6.28, 2.6.29, 2.6.30, or any number of
other -rc kernels.

The other funny thing is that if the lockdep warning you've reported
also explains Ubuntu Launchpad #330824, it doesn't make any sense,
since there the problem shows up when deleting a large number of
files, and in that case, we should be truncating the file down to zero
--- so (inode->i_size & (sb->s_blocksize - 1)) should be evaluating to
zero, and so ext4_block_truncate_page() shouldn't be getting called in
that case.

> As far as I can tell, lockdep is warning that jbd2_handle is usually
> acquired while doing reclaim -- which makes sense, as pushing dirty
> inodes out of memory is of course going to want to do fs transactions;

Actually, it's not that common that we would be pushing dirty inodes
out of memory during a reclaim, since normally the transaction is
assocated with the foreground kernel syscall that modified the inode
in the first place (i.e., chmod, rename, utimes, etc.). And when an
inode gets deleted, the filesystem transaction either happens
immediately, as a result of the unlink call, or if there is a file
descriptor holding it open, in the close() system call that releases
the file descriptor.

What seems to be happening here is the ecryptfs is holding the file
open, because of the upper dentry reference. So that's not a normal
thing, and maybe that's why most people haven't noticed the problem;
they're not doing enough with ecryptfs to trigger things.


How easily can you reproduce the lockdep warning? Does this patch
(not tested; sorry, am in the Berkshires for the July 4th holiday)
make it go away?

- Ted

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 60a26f3..9760ba0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3583,7 +3583,8 @@ int ext4_block_truncate_page(handle_t *handle,
struct page *page;
int err = 0;

- page = grab_cache_page(mapping, from >> PAGE_CACHE_SHIFT);
+ page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
+ mapping_gfp_mask(mapping) & ~__GFP_FS);
if (!page)
return -EINVAL;


2009-07-05 19:10:10

by Roland Dreier

[permalink] [raw]
Subject: Re: ext4 lockdep report re possible reclaim deadlock on jbd2_handle

> What seems to be happening here is the ecryptfs is holding the file
> open, because of the upper dentry reference. So that's not a normal
> thing, and maybe that's why most people haven't noticed the problem;
> they're not doing enough with ecryptfs to trigger things.

Makes sense -- yes, I should have mentioned that I am using the Ubuntu
"$HOME on ecryptfs" feature; and also I had to fix 3 lockdep warnings
with ecryptfs before I hit this one. Still haven't caught the issue
with my own module code that I'm actually trying to debug :)

> How easily can you reproduce the lockdep warning? Does this patch
> (not tested; sorry, am in the Berkshires for the July 4th holiday)
> make it go away?

It's not very reproducible, I don't think; although I've just been
running with lockdep for about a week or so, and as I said I had to
work through 3 ecryptfs false positives before I got to this. I think
I triggered this while building a new kernel, and I haven't done that
since, so it may be reproducible, but I'm not sure.

In any case, I'd be surprised if:

- page = grab_cache_page(mapping, from >> PAGE_CACHE_SHIFT);
+ page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
+ mapping_gfp_mask(mapping) & ~__GFP_FS);

doesn't get rid of the warning, and it looks like a good fix, although
I wonder if for a real upstream patch we wouldn't want to add a new
grab_cache_page()-like wrapper that does the "& ~__GFP_FS" --
grab_cache_page_nowait() isn't quite what's wanted here I don't think,
but something like that.

Anyway, I'll add this patch to my kernel and let you know if anything
ext4-related pops up.

Thanks,
Roland
--
Roland Dreier <[email protected]> GPG Key: 1024D/E0EEFAC0
Fingerprint: A89F B5E9 C185 F34D BD50 4009 37E2 25CC E0EE FAC0

Sending >500KB attachments is forbidden by the Geneva Convention.
Your country may be at risk if you fail to comply.