Hi,
I managed to trigger:
[ 1015.776029] kernel BUG at mm/list_lru.c:92!
[ 1015.776029] invalid opcode: 0000 [#1] SMP
[ 1015.776029] Modules linked in: edd nfsv3 nfs_acl nfs fscache lockd sunrpc af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave powernow_k8 fuse loop dm_mod ohci_pci ohci_hcd ehci_hcd usbcore e1000 kvm_amd kvm tg3 usb_common ptp pps_core sg shpchp edac_core pci_hotplug sr_mod k8temp i2c_amd8111 i2c_amd756 amd_rng edac_mce_amd button serio_raw cdrom pcspkr processor thermal_sys scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_generic sata_sil pata_amd
[ 1015.776029] CPU: 5 PID: 10480 Comm: cc1 Not tainted 3.10.0-rc4-next-20130607nextbadpagefix+ #1
[ 1015.776029] Hardware name: AMD A8440/WARTHOG, BIOS PW2A00-5 09/23/2005
[ 1015.776029] task: ffff8800327fc240 ti: ffff88003a59a000 task.ti: ffff88003a59a000
[ 1015.776029] RIP: 0010:[<ffffffff81122d9c>] [<ffffffff81122d9c>] list_lru_walk_node+0x10c/0x140
[ 1015.776029] RSP: 0018:ffff88003a59b7a8 EFLAGS: 00010286
[ 1015.776029] RAX: ffffffffffffffff RBX: ffff880002f7ae80 RCX: ffff880002f7ae80
[ 1015.776029] RDX: 0000000000000000 RSI: ffff8800370dacc0 RDI: ffff880002f7ad88
[ 1015.776029] RBP: ffff88003a59b808 R08: 0000000000000000 R09: ffff88001ffeafc0
[ 1015.776029] R10: 0000000000000002 R11: 0000000000000000 R12: ffff8800370dacc0
[ 1015.776029] R13: 0000000000000227 R14: ffff880002fb6850 R15: ffff8800370dacc8
[ 1015.776029] FS: 00002aaaaaada600(0000) GS:ffff88001f300000(0000) knlGS:0000000000000000
[ 1015.776029] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1015.776029] CR2: 00000000025aac5c CR3: 000000001cf9d000 CR4: 00000000000007e0
[ 1015.776029] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1015.776029] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1015.776029] Stack:
[ 1015.776029] ffff8800151c6440 ffff88003a59b820 ffff88003a59b828 ffffffff8117e4a0
[ 1015.776029] 000000008117e6e1 ffff88003e174c48 00ff88003a59b828 ffff88003a59b828
[ 1015.776029] 000000000000021f ffff88003e174800 ffff88003a59b9f8 0000000000000220
[ 1015.776029] Call Trace:
[ 1015.776029] [<ffffffff8117e4a0>] ? insert_inode_locked+0x160/0x160
[ 1015.776029] [<ffffffff8117e74c>] prune_icache_sb+0x3c/0x60
[ 1015.776029] [<ffffffff81167a2e>] super_cache_scan+0x12e/0x1b0
[ 1015.776029] [<ffffffff8110f3da>] shrink_slab_node+0x13a/0x250
[ 1015.776029] [<ffffffff8111256b>] shrink_slab+0xab/0x120
[ 1015.776029] [<ffffffff81113784>] do_try_to_free_pages+0x264/0x360
[ 1015.776029] [<ffffffff81113bd0>] try_to_free_pages+0x130/0x180
[ 1015.776029] [<ffffffff81106fce>] __alloc_pages_slowpath+0x39e/0x790
[ 1015.776029] [<ffffffff811075ba>] __alloc_pages_nodemask+0x1fa/0x210
[ 1015.776029] [<ffffffff81147470>] alloc_pages_vma+0xa0/0x120
[ 1015.776029] [<ffffffff81124c33>] do_anonymous_page+0x133/0x300
[ 1015.776029] [<ffffffff8112a10d>] handle_pte_fault+0x22d/0x240
[ 1015.776029] [<ffffffff81122f58>] ? list_lru_add+0x68/0xe0
[ 1015.776029] [<ffffffff8112a3e3>] handle_mm_fault+0x2c3/0x3e0
[ 1015.776029] [<ffffffff815a4997>] __do_page_fault+0x227/0x4e0
[ 1015.776029] [<ffffffff81002930>] ? do_notify_resume+0x90/0x1d0
[ 1015.776029] [<ffffffff81163d18>] ? fsnotify_access+0x68/0x80
[ 1015.776029] [<ffffffff811660a4>] ? file_sb_list_del+0x44/0x50
[ 1015.776029] [<ffffffff81060b05>] ? task_work_add+0x55/0x70
[ 1015.776029] [<ffffffff81166214>] ? fput+0x74/0xd0
[ 1015.776029] [<ffffffff815a4c59>] do_page_fault+0x9/0x10
[ 1015.776029] [<ffffffff815a1632>] page_fault+0x22/0x30
[ 1015.776029] Code: b3 66 0f 1f 44 00 00 48 8b 03 48 8b 53 08 48 89 50 08 48 89 02 49 8b 44 24 10 49 89 5c 24 10 4c 89 3b 48 89 43 08 48 89 18 eb 89 <0f> 0b eb fe 8b 55 c4 48 8b 45 c8 f0 0f b3 10 e9 69 ff ff ff 66
[ 1015.776029] RIP [<ffffffff81122d9c>] list_lru_walk_node+0x10c/0x140
with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
on top.
This is obviously BUG_ON(nlru->nr_items < 0) and
ffffffff81122d0b: 48 85 c0 test %rax,%rax
ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
[...]
ffffffff81122d9c: 0f 0b ud2
RAX is -1UL.
I assume that the current backtrace is of no use and it would most
probably be some shrinker which doesn't behave.
Any idea how to pin this down?
Thanks!
--
Michal Hocko
SUSE Labs
On Mon, Jun 17, 2013 at 04:18:22PM +0200, Michal Hocko wrote:
> Hi,
Hi,
> I managed to trigger:
> [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> [ 1015.776029] invalid opcode: 0000 [#1] SMP
> with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> on top.
>
> This is obviously BUG_ON(nlru->nr_items < 0) and
> ffffffff81122d0b: 48 85 c0 test %rax,%rax
> ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> [...]
> ffffffff81122d9c: 0f 0b ud2
>
> RAX is -1UL.
Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
and BUG, instead of an unsigned quantity.
>
> I assume that the current backtrace is of no use and it would most
> probably be some shrinker which doesn't behave.
>
There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
Assuming you are not using xfs, we are left with dentries and inodes.
The first thing to do is to find which one of them is misbehaving. You can try finding
this out by the address of the list_lru, and where it lays in the superblock.
Once we know each of them is misbehaving, then we'll have to figure out why.
Any special filesystem workload ?
On Mon 17-06-13 19:14:12, Glauber Costa wrote:
> On Mon, Jun 17, 2013 at 04:18:22PM +0200, Michal Hocko wrote:
> > Hi,
>
> Hi,
>
> > I managed to trigger:
> > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > on top.
> >
> > This is obviously BUG_ON(nlru->nr_items < 0) and
> > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > [...]
> > ffffffff81122d9c: 0f 0b ud2
> >
> > RAX is -1UL.
>
> Yes, fearing those kind of imbalances, we decided to leave the counter
> as a signed quantity and BUG, instead of an unsigned quantity.
>
> > I assume that the current backtrace is of no use and it would most
> > probably be some shrinker which doesn't behave.
> >
> There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> Assuming you are not using xfs, we are left with dentries and inodes.
>
> The first thing to do is to find which one of them is misbehaving. You
> can try finding this out by the address of the list_lru, and where it
> lays in the superblock.
I am not sure I understand. Care to prepare a debugging patch for me?
> Once we know each of them is misbehaving, then we'll have to figure
> out why.
>
> Any special filesystem workload ?
This is two parallel kernel builds with separate kernel trees running
under 2 hard unlimitted groups (with 0 soft limit) followed by rm -rf
source trees + drop caches. Sometimes I have to repeat this multiple
times. I can also see some timer specific crashes which are most
probably not related so I am getting back to my mm tree and will hope
the tree is healthy.
I have seen some other traces as well (mentioning ext3 dput paths) but I
cannot reproduce them anymore.
Thanks!
--
Michal Hocko
SUSE Labs
On Mon, Jun 17, 2013 at 05:33:02PM +0200, Michal Hocko wrote:
> On Mon 17-06-13 19:14:12, Glauber Costa wrote:
> > On Mon, Jun 17, 2013 at 04:18:22PM +0200, Michal Hocko wrote:
> > > Hi,
> >
> > Hi,
> >
> > > I managed to trigger:
> > > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > > on top.
> > >
> > > This is obviously BUG_ON(nlru->nr_items < 0) and
> > > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > > [...]
> > > ffffffff81122d9c: 0f 0b ud2
> > >
> > > RAX is -1UL.
> >
> > Yes, fearing those kind of imbalances, we decided to leave the counter
> > as a signed quantity and BUG, instead of an unsigned quantity.
> >
> > > I assume that the current backtrace is of no use and it would most
> > > probably be some shrinker which doesn't behave.
> > >
> > There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> > Assuming you are not using xfs, we are left with dentries and inodes.
> >
> > The first thing to do is to find which one of them is misbehaving. You
> > can try finding this out by the address of the list_lru, and where it
> > lays in the superblock.
>
> I am not sure I understand. Care to prepare a debugging patch for me?
>
> > Once we know each of them is misbehaving, then we'll have to figure
> > out why.
> >
> > Any special filesystem workload ?
>
> This is two parallel kernel builds with separate kernel trees running
> under 2 hard unlimitted groups (with 0 soft limit) followed by rm -rf
> source trees + drop caches. Sometimes I have to repeat this multiple
> times. I can also see some timer specific crashes which are most
> probably not related so I am getting back to my mm tree and will hope
> the tree is healthy.
>
> I have seen some other traces as well (mentioning ext3 dput paths) but I
> cannot reproduce them anymore.
>
Do you have those traces? If there is a bug in the ext3 dput, then it is
most likely the culprit. dput() is when we insert things into the LRU. So
if we are not fully inserting an element that we should have - and later
on try to remove it, we'll go negative.
Can we see those traces?
On Mon, 17 Jun 2013 19:14:12 +0400 Glauber Costa <[email protected]> wrote:
> > I managed to trigger:
> > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > on top.
> >
> > This is obviously BUG_ON(nlru->nr_items < 0) and
> > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > [...]
> > ffffffff81122d9c: 0f 0b ud2
> >
> > RAX is -1UL.
> Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
> and BUG, instead of an unsigned quantity.
>
> >
> > I assume that the current backtrace is of no use and it would most
> > probably be some shrinker which doesn't behave.
> >
> There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> Assuming you are not using xfs, we are left with dentries and inodes.
>
> The first thing to do is to find which one of them is misbehaving. You can try finding
> this out by the address of the list_lru, and where it lays in the superblock.
>
> Once we know each of them is misbehaving, then we'll have to figure out why.
The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
it's inodes?
diff --git a/fs/inode.c b/fs/inode.c
index 00b804e..c46c92e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
static void inode_lru_list_del(struct inode *inode)
{
+ if (inode->i_state & I_FREEING)
+ return;
if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
@@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
inode->i_state &= ~I_WILL_FREE;
}
+ inode_lru_list_del(inode);
inode->i_state |= I_FREEING;
- if (!list_empty(&inode->i_lru))
- inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
evict(inode);
On Tue, Jun 18, 2013 at 02:30:05AM +0400, Glauber Costa wrote:
> On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
> > On Mon, 17 Jun 2013 19:14:12 +0400 Glauber Costa <[email protected]> wrote:
> >
> > > > I managed to trigger:
> > > > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > > > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > > > on top.
> > > >
> > > > This is obviously BUG_ON(nlru->nr_items < 0) and
> > > > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > > > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > > > [...]
> > > > ffffffff81122d9c: 0f 0b ud2
> > > >
> > > > RAX is -1UL.
> > > Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
> > > and BUG, instead of an unsigned quantity.
> > >
> > > >
> > > > I assume that the current backtrace is of no use and it would most
> > > > probably be some shrinker which doesn't behave.
> > > >
> > > There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> > > Assuming you are not using xfs, we are left with dentries and inodes.
> > >
> > > The first thing to do is to find which one of them is misbehaving. You can try finding
> > > this out by the address of the list_lru, and where it lays in the superblock.
> > >
> > > Once we know each of them is misbehaving, then we'll have to figure out why.
> >
> > The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> > it's inodes?
> >
> Assuming there is no memory corruption of any sort going on , let's check the code.
> nr_item is only manipulated in 3 places:
>
> 1) list_lru_add, where it is increased
> 2) list_lru_del, where it is decreased in case the user have voluntarily removed the
> element from the list
> 3) list_lru_walk_node, where an element is removing during shrink.
>
> All three excerpts seem to be correctly locked, so something like this indicates an imbalance.
inode_lru_isolate() looks suspicious to me:
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
list_move(&inode->i_lru, freeable);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}
All the other cases where I_FREEING is set and the inode is removed
from the LRU are completely done under the inode->i_lock. i.e. from
an external POV, the state change to I_FREEING and removal from LRU
are supposed to be atomic, but they are not here.
I'm not sure this is the source of the problem, but it definitely
needs fixing.
> callers:
> iput_final, evict_inodes, invalidate_inodes.
> Both evict_inodes and invalidate_inodes will do the following pattern:
>
> inode->i_state |= I_FREEING;
> inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
> list_add(&inode->i_lru, &dispose);
>
> IOW, they will remove the element from the LRU, and add it to the dispose list.
> Both of them will also bail out if they see I_FREEING already set, so they are safe
> against each other - because the flag is manipulated inside the lock.
>
> But how about iput_final? It seems to me that if we are calling iput_final at the
> same time as the other two, this *could* happen (maybe there is some extra protection
> that can be seen from Australia but not from here. Dave?)
If I_FREEING is set before we enter iput_final(), then something
else is screwed up. I_FREEING is only set once the last reference
has gone away and we are killing the inode. All the other callers
that set I_FREEING check that the reference count on the inode is
zero before they set I_FREEING. Hence I_FREEING cannot be set on the
transition of i_count from 1 to 0 when iput_final() is called. So
the patch won't do anything to avoid the problem being seen.
Keep in mind that we this is actually a new warning on the count of
inodes on the LRU - we never had a check that it didn't go negative
before....
Cheers,
Dave.
--
Dave Chinner
[email protected]
diff --git a/fs/inode.c b/fs/inode.c
index 00b804e..48eafa6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
static void inode_lru_list_del(struct inode *inode)
{
+ if (inode->i_state & I_FREEING)
+ return;
if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
@@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
continue;
}
- inode->i_state |= I_FREEING;
inode_lru_list_del(inode);
+ inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
}
@@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
continue;
}
- inode->i_state |= I_FREEING;
inode_lru_list_del(inode);
+ inode->i_state |= I_FREEING;
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
}
@@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
inode->i_state &= ~I_WILL_FREE;
}
+ inode_lru_list_del(inode);
inode->i_state |= I_FREEING;
- if (!list_empty(&inode->i_lru))
- inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
evict(inode);
On Tue, Jun 18, 2013 at 12:46:23PM +1000, Dave Chinner wrote:
> On Tue, Jun 18, 2013 at 02:30:05AM +0400, Glauber Costa wrote:
> > On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
> > > On Mon, 17 Jun 2013 19:14:12 +0400 Glauber Costa <[email protected]> wrote:
> > >
> > > > > I managed to trigger:
> > > > > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > > > > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > > > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > > > > on top.
> > > > >
> > > > > This is obviously BUG_ON(nlru->nr_items < 0) and
> > > > > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > > > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > > > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > > > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > > > > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > > > > [...]
> > > > > ffffffff81122d9c: 0f 0b ud2
> > > > >
> > > > > RAX is -1UL.
> > > > Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
> > > > and BUG, instead of an unsigned quantity.
> > > >
> > > > >
> > > > > I assume that the current backtrace is of no use and it would most
> > > > > probably be some shrinker which doesn't behave.
> > > > >
> > > > There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> > > > Assuming you are not using xfs, we are left with dentries and inodes.
> > > >
> > > > The first thing to do is to find which one of them is misbehaving. You can try finding
> > > > this out by the address of the list_lru, and where it lays in the superblock.
> > > >
> > > > Once we know each of them is misbehaving, then we'll have to figure out why.
> > >
> > > The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> > > it's inodes?
> > >
> > Assuming there is no memory corruption of any sort going on , let's check the code.
> > nr_item is only manipulated in 3 places:
> >
> > 1) list_lru_add, where it is increased
> > 2) list_lru_del, where it is decreased in case the user have voluntarily removed the
> > element from the list
> > 3) list_lru_walk_node, where an element is removing during shrink.
> >
> > All three excerpts seem to be correctly locked, so something like this indicates an imbalance.
>
> inode_lru_isolate() looks suspicious to me:
>
> WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_FREEING;
> spin_unlock(&inode->i_lock);
>
> list_move(&inode->i_lru, freeable);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }
>
> All the other cases where I_FREEING is set and the inode is removed
> from the LRU are completely done under the inode->i_lock. i.e. from
> an external POV, the state change to I_FREEING and removal from LRU
> are supposed to be atomic, but they are not here.
>
> I'm not sure this is the source of the problem, but it definitely
> needs fixing.
>
Yes, I missed that yesterday, but that does look suspicious to me as well.
Michal, if you can manually move this one inside the lock as well and see
if it fixes your problem as well... Otherwise I can send you a patch as well
so we don't get lost on what is patched and what is not.
Let us at least know if this is the problem.
> > callers:
> > iput_final, evict_inodes, invalidate_inodes.
> > Both evict_inodes and invalidate_inodes will do the following pattern:
> >
> > inode->i_state |= I_FREEING;
> > inode_lru_list_del(inode);
> > spin_unlock(&inode->i_lock);
> > list_add(&inode->i_lru, &dispose);
> >
> > IOW, they will remove the element from the LRU, and add it to the dispose list.
> > Both of them will also bail out if they see I_FREEING already set, so they are safe
> > against each other - because the flag is manipulated inside the lock.
> >
> > But how about iput_final? It seems to me that if we are calling iput_final at the
> > same time as the other two, this *could* happen (maybe there is some extra protection
> > that can be seen from Australia but not from here. Dave?)
>
> If I_FREEING is set before we enter iput_final(), then something
> else is screwed up. I_FREEING is only set once the last reference
> has gone away and we are killing the inode. All the other callers
> that set I_FREEING check that the reference count on the inode is
> zero before they set I_FREEING. Hence I_FREEING cannot be set on the
> transition of i_count from 1 to 0 when iput_final() is called. So
> the patch won't do anything to avoid the problem being seen.
>
Yes, but isn't things like evict_inodes and invalidate_inodes called at
umount time, for instance? Can't it be that we drop the last reference
to a valid in use inode while someone else is invalidating them all?
> Keep in mind that we this is actually a new warning on the count of
> inodes on the LRU - we never had a check that it didn't go negative
> before....
>
On Mon 17-06-13 20:54:10, Glauber Costa wrote:
> On Mon, Jun 17, 2013 at 05:33:02PM +0200, Michal Hocko wrote:
[...]
> > I have seen some other traces as well (mentioning ext3 dput paths) but I
> > cannot reproduce them anymore.
> >
>
> Do you have those traces? If there is a bug in the ext3 dput, then it is
> most likely the culprit. dput() is when we insert things into the LRU. So
> if we are not fully inserting an element that we should have - and later
> on try to remove it, we'll go negative.
>
> Can we see those traces?
Unfortunatelly I don't because the machine where I saw those didn't have
a serial console and the traces where scrolling like crazy. Anyway I am
working on reproducing this. Linux next is hard to debug due to
unrelated crashes so I am still with my -mm git tree.
Anyway, I was able to reproduce one of those hangs which smells like the
same/similar issue:
4659 pts/0 S+ 0:00 /bin/sh ./run_batch.sh mmotm
4661 pts/0 S+ 0:00 /bin/bash ./start.sh
4666 pts/0 S+ 5:08 /bin/bash ./start.sh
18294 pts/0 S+ 0:00 sleep 1s
4682 pts/0 S+ 0:00 /bin/bash ./run_test.sh /dev/cgroup B 2
4683 pts/0 S+ 5:16 /bin/bash ./run_test.sh /dev/cgroup B 2
18293 pts/0 S+ 0:00 sleep 1s
8509 pts/0 S+ 0:00 /usr/bin/time -v make -j4 vmlinux
8510 pts/0 S+ 0:00 make -j4 vmlinux
11730 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=drivers
13135 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=drivers/net
13415 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=drivers/net/wireless
13657 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=drivers/net/wireless/rtl818x
13665 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=drivers/net/wireless/rtl818x/rtl8180
13737 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=drivers/net/wireless/rtlwifi
13754 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=drivers/net/wireless/rtlwifi/rtl8192de
13917 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=drivers/net/wireless/rtlwifi/rtl8192se
demon:/home/mhocko # cat /proc/13917/stack
[<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/13754/stack
[<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118526f>] iget_locked+0x4f/0x180
[<ffffffff811ef9f3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a2c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/13665/stack
[<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118526f>] iget_locked+0x4f/0x180
[<ffffffff811ef9f3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a2c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Sysrq+l doesn't show only idle CPUs. Ext4 is showing in the traces
because of CONFIG_EXT4_USE_FOR_EXT23=y.
--
Michal Hocko
SUSE Labs
On Tue 18-06-13 02:30:05, Glauber Costa wrote:
> On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
[...]
> > The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> > it's inodes?
> >
> Assuming there is no memory corruption of any sort going on , let's
> check the code. nr_item is only manipulated in 3 places:
>
> 1) list_lru_add, where it is increased
> 2) list_lru_del, where it is decreased in case the user have voluntarily removed the
> element from the list
> 3) list_lru_walk_node, where an element is removing during shrink.
>
> All three excerpts seem to be correctly locked, so something like this
> indicates an imbalance. Either the element was never added to the
> list, or it was added, removed, and we didn't notice it. (Again, your
> backing storage is not XFS, is it? If it is , we have another user to
> look for)
No this is ext3. But I can try to test with xfs as well if it helps.
[...]
--
Michal Hocko
SUSE Labs
On Tue, Jun 18, 2013 at 10:19:31AM +0200, Michal Hocko wrote:
> On Tue 18-06-13 02:30:05, Glauber Costa wrote:
> > On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
> [...]
> > > The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> > > it's inodes?
> > >
> > Assuming there is no memory corruption of any sort going on , let's
> > check the code. nr_item is only manipulated in 3 places:
> >
> > 1) list_lru_add, where it is increased
> > 2) list_lru_del, where it is decreased in case the user have voluntarily removed the
> > element from the list
> > 3) list_lru_walk_node, where an element is removing during shrink.
> >
> > All three excerpts seem to be correctly locked, so something like this
> > indicates an imbalance. Either the element was never added to the
> > list, or it was added, removed, and we didn't notice it. (Again, your
> > backing storage is not XFS, is it? If it is , we have another user to
> > look for)
>
> No this is ext3. But I can try to test with xfs as well if it helps.
> [...]
XFS won't help this, on the contrary. The reason I asked is because XFS
uses list_lru for its internal structures as well. So it is actually preferred
if you are reproducing this without it, so we can at least isolate that part.
On Tue 18-06-13 10:31:05, Glauber Costa wrote:
> On Tue, Jun 18, 2013 at 12:46:23PM +1000, Dave Chinner wrote:
> > On Tue, Jun 18, 2013 at 02:30:05AM +0400, Glauber Costa wrote:
> > > On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
> > > > On Mon, 17 Jun 2013 19:14:12 +0400 Glauber Costa <[email protected]> wrote:
> > > >
> > > > > > I managed to trigger:
> > > > > > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > > > > > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > > > > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > > > > > on top.
> > > > > >
> > > > > > This is obviously BUG_ON(nlru->nr_items < 0) and
> > > > > > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > > > > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > > > > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > > > > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > > > > > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > > > > > [...]
> > > > > > ffffffff81122d9c: 0f 0b ud2
> > > > > >
> > > > > > RAX is -1UL.
> > > > > Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
> > > > > and BUG, instead of an unsigned quantity.
> > > > >
> > > > > >
> > > > > > I assume that the current backtrace is of no use and it would most
> > > > > > probably be some shrinker which doesn't behave.
> > > > > >
> > > > > There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> > > > > Assuming you are not using xfs, we are left with dentries and inodes.
> > > > >
> > > > > The first thing to do is to find which one of them is misbehaving. You can try finding
> > > > > this out by the address of the list_lru, and where it lays in the superblock.
> > > > >
> > > > > Once we know each of them is misbehaving, then we'll have to figure out why.
> > > >
> > > > The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> > > > it's inodes?
> > > >
> > > Assuming there is no memory corruption of any sort going on , let's check the code.
> > > nr_item is only manipulated in 3 places:
> > >
> > > 1) list_lru_add, where it is increased
> > > 2) list_lru_del, where it is decreased in case the user have voluntarily removed the
> > > element from the list
> > > 3) list_lru_walk_node, where an element is removing during shrink.
> > >
> > > All three excerpts seem to be correctly locked, so something like this indicates an imbalance.
> >
> > inode_lru_isolate() looks suspicious to me:
> >
> > WARN_ON(inode->i_state & I_NEW);
> > inode->i_state |= I_FREEING;
> > spin_unlock(&inode->i_lock);
> >
> > list_move(&inode->i_lru, freeable);
> > this_cpu_dec(nr_unused);
> > return LRU_REMOVED;
> > }
> >
> > All the other cases where I_FREEING is set and the inode is removed
> > from the LRU are completely done under the inode->i_lock. i.e. from
> > an external POV, the state change to I_FREEING and removal from LRU
> > are supposed to be atomic, but they are not here.
> >
> > I'm not sure this is the source of the problem, but it definitely
> > needs fixing.
> >
> Yes, I missed that yesterday, but that does look suspicious to me as well.
>
> Michal, if you can manually move this one inside the lock as well and see
> if it fixes your problem as well... Otherwise I can send you a patch as well
> so we don't get lost on what is patched and what is not.
OK, I am testing with this now:
diff --git a/fs/inode.c b/fs/inode.c
index 604c15e..95e598c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -733,9 +733,9 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
WARN_ON(inode->i_state & I_NEW);
inode->i_state |= I_FREEING;
+ list_move(&inode->i_lru, freeable);
spin_unlock(&inode->i_lock);
- list_move(&inode->i_lru, freeable);
this_cpu_dec(nr_unused);
return LRU_REMOVED;
}
> Let us at least know if this is the problem.
>
> > > callers:
> > > iput_final, evict_inodes, invalidate_inodes.
> > > Both evict_inodes and invalidate_inodes will do the following pattern:
> > >
> > > inode->i_state |= I_FREEING;
> > > inode_lru_list_del(inode);
> > > spin_unlock(&inode->i_lock);
> > > list_add(&inode->i_lru, &dispose);
> > >
> > > IOW, they will remove the element from the LRU, and add it to the dispose list.
> > > Both of them will also bail out if they see I_FREEING already set, so they are safe
> > > against each other - because the flag is manipulated inside the lock.
> > >
> > > But how about iput_final? It seems to me that if we are calling iput_final at the
> > > same time as the other two, this *could* happen (maybe there is some extra protection
> > > that can be seen from Australia but not from here. Dave?)
> >
> > If I_FREEING is set before we enter iput_final(), then something
> > else is screwed up. I_FREEING is only set once the last reference
> > has gone away and we are killing the inode. All the other callers
> > that set I_FREEING check that the reference count on the inode is
> > zero before they set I_FREEING. Hence I_FREEING cannot be set on the
> > transition of i_count from 1 to 0 when iput_final() is called. So
> > the patch won't do anything to avoid the problem being seen.
> >
> Yes, but isn't things like evict_inodes and invalidate_inodes called at
> umount time, for instance?
JFYI No unmount is going on in my test case.
--
Michal Hocko
SUSE Labs
On Tue 18-06-13 10:26:24, Glauber Costa wrote:
[...]
> Which is obviously borked since I did not fix the other callers so to move I_FREEING
> after lru del.
>
> Michal, would you mind testing the following patch?
I was about to start testing with inode_lru_isolate fix. I will give it
few runs and then test this one if it is still relevant.
> diff --git a/fs/inode.c b/fs/inode.c
> index 00b804e..48eafa6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
>
> static void inode_lru_list_del(struct inode *inode)
> {
> + if (inode->i_state & I_FREEING)
> + return;
>
> if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> this_cpu_dec(nr_unused);
> @@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
> continue;
> }
>
> - inode->i_state |= I_FREEING;
> inode_lru_list_del(inode);
> + inode->i_state |= I_FREEING;
> spin_unlock(&inode->i_lock);
> list_add(&inode->i_lru, &dispose);
> }
> @@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> continue;
> }
>
> - inode->i_state |= I_FREEING;
> inode_lru_list_del(inode);
> + inode->i_state |= I_FREEING;
> spin_unlock(&inode->i_lock);
> list_add(&inode->i_lru, &dispose);
> }
> @@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
> inode->i_state &= ~I_WILL_FREE;
> }
>
> + inode_lru_list_del(inode);
> inode->i_state |= I_FREEING;
> - if (!list_empty(&inode->i_lru))
> - inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
>
> evict(inode);
--
Michal Hocko
SUSE Labs
On Tue 18-06-13 12:21:33, Glauber Costa wrote:
> On Tue, Jun 18, 2013 at 10:19:31AM +0200, Michal Hocko wrote:
[...]
> > No this is ext3. But I can try to test with xfs as well if it helps.
> > [...]
>
> XFS won't help this, on the contrary. The reason I asked is because XFS
> uses list_lru for its internal structures as well. So it is actually preferred
> if you are reproducing this without it, so we can at least isolate that part.
OK
--
Michal Hocko
SUSE Labs
On Tue 18-06-13 10:24:14, Michal Hocko wrote:
> On Tue 18-06-13 10:31:05, Glauber Costa wrote:
> > On Tue, Jun 18, 2013 at 12:46:23PM +1000, Dave Chinner wrote:
> > > On Tue, Jun 18, 2013 at 02:30:05AM +0400, Glauber Costa wrote:
> > > > On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
> > > > > On Mon, 17 Jun 2013 19:14:12 +0400 Glauber Costa <[email protected]> wrote:
> > > > >
> > > > > > > I managed to trigger:
> > > > > > > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > > > > > > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > > > > > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > > > > > > on top.
> > > > > > >
> > > > > > > This is obviously BUG_ON(nlru->nr_items < 0) and
> > > > > > > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > > > > > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > > > > > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > > > > > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > > > > > > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > > > > > > [...]
> > > > > > > ffffffff81122d9c: 0f 0b ud2
> > > > > > >
> > > > > > > RAX is -1UL.
> > > > > > Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
> > > > > > and BUG, instead of an unsigned quantity.
> > > > > >
> > > > > > >
> > > > > > > I assume that the current backtrace is of no use and it would most
> > > > > > > probably be some shrinker which doesn't behave.
> > > > > > >
> > > > > > There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> > > > > > Assuming you are not using xfs, we are left with dentries and inodes.
> > > > > >
> > > > > > The first thing to do is to find which one of them is misbehaving. You can try finding
> > > > > > this out by the address of the list_lru, and where it lays in the superblock.
> > > > > >
> > > > > > Once we know each of them is misbehaving, then we'll have to figure out why.
> > > > >
> > > > > The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> > > > > it's inodes?
> > > > >
> > > > Assuming there is no memory corruption of any sort going on , let's check the code.
> > > > nr_item is only manipulated in 3 places:
> > > >
> > > > 1) list_lru_add, where it is increased
> > > > 2) list_lru_del, where it is decreased in case the user have voluntarily removed the
> > > > element from the list
> > > > 3) list_lru_walk_node, where an element is removing during shrink.
> > > >
> > > > All three excerpts seem to be correctly locked, so something like this indicates an imbalance.
> > >
> > > inode_lru_isolate() looks suspicious to me:
> > >
> > > WARN_ON(inode->i_state & I_NEW);
> > > inode->i_state |= I_FREEING;
> > > spin_unlock(&inode->i_lock);
> > >
> > > list_move(&inode->i_lru, freeable);
> > > this_cpu_dec(nr_unused);
> > > return LRU_REMOVED;
> > > }
> > >
> > > All the other cases where I_FREEING is set and the inode is removed
> > > from the LRU are completely done under the inode->i_lock. i.e. from
> > > an external POV, the state change to I_FREEING and removal from LRU
> > > are supposed to be atomic, but they are not here.
> > >
> > > I'm not sure this is the source of the problem, but it definitely
> > > needs fixing.
> > >
> > Yes, I missed that yesterday, but that does look suspicious to me as well.
> >
> > Michal, if you can manually move this one inside the lock as well and see
> > if it fixes your problem as well... Otherwise I can send you a patch as well
> > so we don't get lost on what is patched and what is not.
>
> OK, I am testing with this now:
> diff --git a/fs/inode.c b/fs/inode.c
> index 604c15e..95e598c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -733,9 +733,9 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>
> WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_FREEING;
> + list_move(&inode->i_lru, freeable);
> spin_unlock(&inode->i_lock);
>
> - list_move(&inode->i_lru, freeable);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }
And this hung again:
4434 pts/0 S+ 0:00 /bin/sh ./run_batch.sh mmotmdebug
4436 pts/0 S+ 0:00 /bin/bash ./start.sh
4441 pts/0 S+ 0:26 /bin/bash ./start.sh
1919 pts/0 S+ 0:00 sleep 1s
4457 pts/0 S+ 0:00 /bin/bash ./run_test.sh /dev/cgroup A 2
4459 pts/0 S+ 0:27 /bin/bash ./run_test.sh /dev/cgroup A 2
1913 pts/0 S+ 0:00 sleep 1s
5626 pts/0 S+ 0:00 /usr/bin/time -v make -j4 vmlinux
5628 pts/0 S+ 0:00 make -j4 vmlinux
2676 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=sound
2893 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=sound/pci
2998 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=sound/pci/emu10k1
6590 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=net
4458 pts/0 S+ 0:00 /bin/bash ./run_test.sh /dev/cgroup B 2
4464 pts/0 S+ 0:27 /bin/bash ./run_test.sh /dev/cgroup B 2
1914 pts/0 S+ 0:00 sleep 1s
5625 pts/0 S+ 0:00 /usr/bin/time -v make -j4 vmlinux
5627 pts/0 S+ 0:00 make -j4 vmlinux
13010 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=kernel
3933 pts/0 Z+ 0:00 [sh] <defunct>
14459 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=fs
784 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=fs/romfs
2401 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=crypto
4614 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=crypto/asymmetric_keys
3343 pts/0 S+ 0:00 make -f scripts/Makefile.build obj=block
5167 pts/0 D+ 0:00 make -f scripts/Makefile.build obj=block/partitions
demon:/home/mhocko # cat /proc/2998/stack
[<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/6590/stack
[<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/13010/stack
[<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/784/stack
[<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/4614/stack
[<ffffffff8117d0ca>] vfs_readdir+0x7a/0xc0
[<ffffffff8117d1a6>] sys_getdents64+0x96/0x100
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
demon:/home/mhocko # cat /proc/5167/stack
[<ffffffff8117d0ca>] vfs_readdir+0x7a/0xc0
[<ffffffff8117d1a6>] sys_getdents64+0x96/0x100
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
JFYI: This is still with my -mm git tree + the above diff + referenced
patch from Mel (mm: Clear page active before releasing pages).
--
Michal Hocko
SUSE Labs
And again, another hang. It looks like the inode deletion never
finishes. The good thing is that I do not see any LRU related BUG_ONs
anymore. I am going to test with the other patch in the thread.
2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
11214 [<ffffffff81178144>] do_last+0x2c4/0x780 <<< blocked on i_mutex
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
11217 [<ffffffff81178144>] do_last+0x2c4/0x780 <<< blocked on i_mutex
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
11288 [<ffffffff81178144>] do_last+0x2c4/0x780 <<< blocked on i_mutex
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
11453 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12439 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12542 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12588 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12589 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
13098 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118525f>] iget_locked+0x4f/0x180
[<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
19091 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
19092 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
--
Michal Hocko
SUSE Labs
On Tue 18-06-13 10:26:24, Glauber Costa wrote:
[...]
> Michal, would you mind testing the following patch?
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 00b804e..48eafa6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
>
> static void inode_lru_list_del(struct inode *inode)
> {
> + if (inode->i_state & I_FREEING)
> + return;
>
> if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> this_cpu_dec(nr_unused);
> @@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
> continue;
> }
>
> - inode->i_state |= I_FREEING;
> inode_lru_list_del(inode);
> + inode->i_state |= I_FREEING;
> spin_unlock(&inode->i_lock);
> list_add(&inode->i_lru, &dispose);
> }
> @@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> continue;
> }
>
> - inode->i_state |= I_FREEING;
> inode_lru_list_del(inode);
> + inode->i_state |= I_FREEING;
> spin_unlock(&inode->i_lock);
> list_add(&inode->i_lru, &dispose);
> }
> @@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
> inode->i_state &= ~I_WILL_FREE;
> }
>
> + inode_lru_list_del(inode);
> inode->i_state |= I_FREEING;
> - if (!list_empty(&inode->i_lru))
> - inode_lru_list_del(inode);
> spin_unlock(&inode->i_lock);
>
> evict(inode);
No luck. I have this on top of inode_lru_isolate one but still can see
hangs:
911 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118529f>] iget_locked+0x4f/0x180
[<ffffffff811efa23>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
21409 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118529f>] iget_locked+0x4f/0x180
[<ffffffff811efa23>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
21745 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
22032 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
22621 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
[<ffffffff81183321>] find_inode_fast+0xa1/0xc0
[<ffffffff8118529f>] iget_locked+0x4f/0x180
[<ffffffff811efa23>] ext4_iget+0x33/0x9f0
[<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
22711 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
22946 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
23393 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
--
Michal Hocko
SUSE Labs
On Wed, Jun 19, 2013 at 09:13:46AM +0200, Michal Hocko wrote:
> On Tue 18-06-13 10:26:24, Glauber Costa wrote:
> [...]
> > Michal, would you mind testing the following patch?
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 00b804e..48eafa6 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
> >
> > static void inode_lru_list_del(struct inode *inode)
> > {
> > + if (inode->i_state & I_FREEING)
> > + return;
> >
> > if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> > this_cpu_dec(nr_unused);
> > @@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
> > continue;
> > }
> >
> > - inode->i_state |= I_FREEING;
> > inode_lru_list_del(inode);
> > + inode->i_state |= I_FREEING;
> > spin_unlock(&inode->i_lock);
> > list_add(&inode->i_lru, &dispose);
> > }
> > @@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> > continue;
> > }
> >
> > - inode->i_state |= I_FREEING;
> > inode_lru_list_del(inode);
> > + inode->i_state |= I_FREEING;
> > spin_unlock(&inode->i_lock);
> > list_add(&inode->i_lru, &dispose);
> > }
> > @@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
> > inode->i_state &= ~I_WILL_FREE;
> > }
> >
> > + inode_lru_list_del(inode);
> > inode->i_state |= I_FREEING;
> > - if (!list_empty(&inode->i_lru))
> > - inode_lru_list_del(inode);
> > spin_unlock(&inode->i_lock);
> >
> > evict(inode);
>
> No luck. I have this on top of inode_lru_isolate one but still can see
> hangs:
> 911 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
> [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> [<ffffffff8118529f>] iget_locked+0x4f/0x180
> [<ffffffff811efa23>] ext4_iget+0x33/0x9f0
> [<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
> [<ffffffff81174ad0>] lookup_real+0x20/0x60
> [<ffffffff81175254>] __lookup_hash+0x34/0x40
> [<ffffffff81179872>] path_lookupat+0x7a2/0x830
> [<ffffffff81179933>] filename_lookup+0x33/0xd0
> [<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
> [<ffffffff8117ab4c>] user_path_at+0xc/0x10
> [<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
> [<ffffffff81170116>] vfs_stat+0x16/0x20
> [<ffffffff8117013f>] sys_newstat+0x1f/0x50
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 21409 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
> [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> [<ffffffff8118529f>] iget_locked+0x4f/0x180
> [<ffffffff811efa23>] ext4_iget+0x33/0x9f0
> [<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
> [<ffffffff81174ad0>] lookup_real+0x20/0x60
> [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> [<ffffffff8117815e>] do_last+0x2de/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 21745 [<ffffffff81179862>] path_lookupat+0x792/0x830
> [<ffffffff81179933>] filename_lookup+0x33/0xd0
> [<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
> [<ffffffff8117ab4c>] user_path_at+0xc/0x10
> [<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
> [<ffffffff81170116>] vfs_stat+0x16/0x20
> [<ffffffff8117013f>] sys_newstat+0x1f/0x50
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 22032 [<ffffffff81179862>] path_lookupat+0x792/0x830
> [<ffffffff81179933>] filename_lookup+0x33/0xd0
> [<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
> [<ffffffff8117ab4c>] user_path_at+0xc/0x10
> [<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
> [<ffffffff81170116>] vfs_stat+0x16/0x20
> [<ffffffff8117013f>] sys_newstat+0x1f/0x50
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 22621 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
> [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> [<ffffffff8118529f>] iget_locked+0x4f/0x180
> [<ffffffff811efa23>] ext4_iget+0x33/0x9f0
> [<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
> [<ffffffff81174ad0>] lookup_real+0x20/0x60
> [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> [<ffffffff8117815e>] do_last+0x2de/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 22711 [<ffffffff81178144>] do_last+0x2c4/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 22946 [<ffffffff81178144>] do_last+0x2c4/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 23393 [<ffffffff81178144>] do_last+0x2c4/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> --
Sorry if you said that before Michal.
But given the backtrace, are you sure this is LRU-related? You mentioned you bisected
it but found nothing conclusive. I will keep looking but maybe this could benefit from
a broader fs look
In any case, the patch we suggested is obviously correct and we should apply nevertheless.
I will write it down and send it to Andrew.
On Wed, Jun 19, 2013 at 11:35:27AM +0400, Glauber Costa wrote:
> On Wed, Jun 19, 2013 at 09:13:46AM +0200, Michal Hocko wrote:
> > On Tue 18-06-13 10:26:24, Glauber Costa wrote:
> > [...]
> > > Michal, would you mind testing the following patch?
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 00b804e..48eafa6 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
> > >
> > > static void inode_lru_list_del(struct inode *inode)
> > > {
> > > + if (inode->i_state & I_FREEING)
> > > + return;
> > >
> > > if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> > > this_cpu_dec(nr_unused);
> > > @@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
> > > continue;
> > > }
> > >
> > > - inode->i_state |= I_FREEING;
> > > inode_lru_list_del(inode);
> > > + inode->i_state |= I_FREEING;
> > > spin_unlock(&inode->i_lock);
> > > list_add(&inode->i_lru, &dispose);
> > > }
> > > @@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> > > continue;
> > > }
> > >
> > > - inode->i_state |= I_FREEING;
> > > inode_lru_list_del(inode);
> > > + inode->i_state |= I_FREEING;
> > > spin_unlock(&inode->i_lock);
> > > list_add(&inode->i_lru, &dispose);
> > > }
> > > @@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
> > > inode->i_state &= ~I_WILL_FREE;
> > > }
> > >
> > > + inode_lru_list_del(inode);
> > > inode->i_state |= I_FREEING;
> > > - if (!list_empty(&inode->i_lru))
> > > - inode_lru_list_del(inode);
> > > spin_unlock(&inode->i_lock);
> > >
> > > evict(inode);
> >
> > No luck. I have this on top of inode_lru_isolate one but still can see
> > hangs:
> > 911 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
> > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > [<ffffffff8118529f>] iget_locked+0x4f/0x180
> > [<ffffffff811efa23>] ext4_iget+0x33/0x9f0
> > [<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
> > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > [<ffffffff81175254>] __lookup_hash+0x34/0x40
> > [<ffffffff81179872>] path_lookupat+0x7a2/0x830
> > [<ffffffff81179933>] filename_lookup+0x33/0xd0
> > [<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
> > [<ffffffff8117ab4c>] user_path_at+0xc/0x10
> > [<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
> > [<ffffffff81170116>] vfs_stat+0x16/0x20
> > [<ffffffff8117013f>] sys_newstat+0x1f/0x50
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 21409 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
> > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > [<ffffffff8118529f>] iget_locked+0x4f/0x180
> > [<ffffffff811efa23>] ext4_iget+0x33/0x9f0
> > [<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
> > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 21745 [<ffffffff81179862>] path_lookupat+0x792/0x830
> > [<ffffffff81179933>] filename_lookup+0x33/0xd0
> > [<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
> > [<ffffffff8117ab4c>] user_path_at+0xc/0x10
> > [<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
> > [<ffffffff81170116>] vfs_stat+0x16/0x20
> > [<ffffffff8117013f>] sys_newstat+0x1f/0x50
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 22032 [<ffffffff81179862>] path_lookupat+0x792/0x830
> > [<ffffffff81179933>] filename_lookup+0x33/0xd0
> > [<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
> > [<ffffffff8117ab4c>] user_path_at+0xc/0x10
> > [<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
> > [<ffffffff81170116>] vfs_stat+0x16/0x20
> > [<ffffffff8117013f>] sys_newstat+0x1f/0x50
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 22621 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0
> > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > [<ffffffff8118529f>] iget_locked+0x4f/0x180
> > [<ffffffff811efa23>] ext4_iget+0x33/0x9f0
> > [<ffffffff811f6a5c>] ext4_lookup+0xbc/0x160
> > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 22711 [<ffffffff81178144>] do_last+0x2c4/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 22946 [<ffffffff81178144>] do_last+0x2c4/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 23393 [<ffffffff81178144>] do_last+0x2c4/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > --
> Sorry if you said that before Michal.
>
> But given the backtrace, are you sure this is LRU-related? You mentioned you bisected
> it but found nothing conclusive. I will keep looking but maybe this could benefit from
> a broader fs look
>
> In any case, the patch we suggested is obviously correct and we should apply nevertheless.
> I will write it down and send it to Andrew.
My analysis of the LRU side code so far is:
* Assuming we are not hanging because of held locks, the fact that we
are hung at __wait_on_freeing_inode() means that someone who should
be waking us up is not. This would indicate that an inode is marked
as to-be-freed, but later on not freed.
* We will wait for free inodes if the state is I_FREEING or I_WILL_FREE.
* I_WILL_FREE is only set for a very short time during iput_final, and
the code path leads unconditionally to evict(), which wakes up any
waiters.
* clear_inode sets I_FREEING but it is only called from within evict,
which means we will wake up the waiters shortly.
* The LRU will not necessarily put the element into those states, but when
it does, it moves them to the dispose list. We will call evict() for all
ements in the dispose list, and that will unconditionally call wake_up_bit.
So it seems that if the LRU sets I_FREEING (we never set I_WILL_FREE)
* The same is true for evict_inodes and invalidate_inodes. They test
for the freeing bits and will skip the inodes marked as such. This seems
okay, since this means someone else marked them as freeing and it should
be their responsibility to wake up the callers.
So this shows that strangely enough, the code seems very safe and fine.
Still you are seeing hangs... Any chance we are hanging on the acquisition
of inode_hash_lock?
I need to be away for some hours, but I will be back to it soon. Meanwhile,
if Dave could take a look into it, that would be trully great.
On Wed 19-06-13 11:35:27, Glauber Costa wrote:
[...]
> Sorry if you said that before Michal.
>
> But given the backtrace, are you sure this is LRU-related?
No idea. I just know that my mm tree behaves correctly after the whole
series has been reverted (58f6e0c8fb37e8e37d5ac17a61a53ac236c15047) and
before the latest version of the patchset has been applied.
> You mentioned you bisected it but found nothing conclusive.
Yes, but I was interested in crashes and not hangs so I will try it
again.
I really hope this is not just some stupidness in my tree.
> I will keep looking but maybe this could benefit from
> a broader fs look
>
> In any case, the patch we suggested is obviously correct and we should
> apply nevertheless. I will write it down and send it to Andrew.
OK, feel free to stick my Tested-by there.
--
Michal Hocko
SUSE Labs
On Wed, Jun 19, 2013 at 03:57:16PM +0200, Michal Hocko wrote:
> On Wed 19-06-13 11:35:27, Glauber Costa wrote:
> [...]
> > Sorry if you said that before Michal.
> >
> > But given the backtrace, are you sure this is LRU-related?
>
> No idea. I just know that my mm tree behaves correctly after the whole
> series has been reverted (58f6e0c8fb37e8e37d5ac17a61a53ac236c15047) and
> before the latest version of the patchset has been applied.
>
> > You mentioned you bisected it but found nothing conclusive.
>
> Yes, but I was interested in crashes and not hangs so I will try it
> again.
>
> I really hope this is not just some stupidness in my tree.
>
Okay. Just looking at the stack trace you provided me it would be hard
to implicate us. But it is not totally unreasonable either since we touch
things around this. I would right now assign more probability to a tricky
bug than to some misconfiguration on your side.
On Wed 19-06-13 09:13:46, Michal Hocko wrote:
> On Tue 18-06-13 10:26:24, Glauber Costa wrote:
> [...]
> > Michal, would you mind testing the following patch?
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 00b804e..48eafa6 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
> >
> > static void inode_lru_list_del(struct inode *inode)
> > {
> > + if (inode->i_state & I_FREEING)
> > + return;
> >
> > if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> > this_cpu_dec(nr_unused);
> > @@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
> > continue;
> > }
> >
> > - inode->i_state |= I_FREEING;
> > inode_lru_list_del(inode);
> > + inode->i_state |= I_FREEING;
> > spin_unlock(&inode->i_lock);
> > list_add(&inode->i_lru, &dispose);
> > }
> > @@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> > continue;
> > }
> >
> > - inode->i_state |= I_FREEING;
> > inode_lru_list_del(inode);
> > + inode->i_state |= I_FREEING;
> > spin_unlock(&inode->i_lock);
> > list_add(&inode->i_lru, &dispose);
> > }
> > @@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
> > inode->i_state &= ~I_WILL_FREE;
> > }
> >
> > + inode_lru_list_del(inode);
> > inode->i_state |= I_FREEING;
> > - if (!list_empty(&inode->i_lru))
> > - inode_lru_list_del(inode);
> > spin_unlock(&inode->i_lock);
> >
> > evict(inode);
>
> No luck. I have this on top of inode_lru_isolate one but still can see
And I was lucky enough to hit another BUG_ON with this kernel (the above
patch and inode_lru_isolate-fix):
[84091.219056] ------------[ cut here ]------------
[84091.220015] kernel BUG at mm/list_lru.c:42!
[84091.220015] invalid opcode: 0000 [#1] SMP
[84091.220015] Modules linked in: edd nfsv3 nfs_acl nfs fscache lockd sunrpc af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop dm_mod powernow_k8 tg3 kvm_amd kvm ptp e1000 pps_core shpchp edac_core i2c_amd756 amd_rng pci_hotplug k8temp sg i2c_amd8111 edac_mce_amd serio_raw sr_mod pcspkr cdrom button ohci_hcd ehci_hcd usbcore usb_common processor thermal_sys scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_generic sata_sil pata_amd
[84091.220015] CPU 1
[84091.220015] Pid: 32545, comm: rm Not tainted 3.9.0mmotmdebugging1+ #1472 AMD A8440/WARTHOG
[84091.220015] RIP: 0010:[<ffffffff81127fff>] [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
[84091.220015] RSP: 0018:ffff88001de85df8 EFLAGS: 00010286
[84091.220015] RAX: ffffffffffffffff RBX: ffff88001e1ce2c0 RCX: 0000000000000002
[84091.220015] RDX: ffff88001e1ce2c8 RSI: ffff8800087f4220 RDI: ffff88001e1ce2c0
[84091.220015] RBP: ffff88001de85e18 R08: 0000000000000000 R09: 0000000000000000
[84091.220015] R10: ffff88001d539128 R11: ffff880018234882 R12: ffff8800087f4220
[84091.220015] R13: ffff88001c68bc40 R14: 0000000000000000 R15: ffff88001de85ea8
[84091.220015] FS: 00007f43adb30700(0000) GS:ffff88001f100000(0000) knlGS:0000000000000000
[84091.220015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[84091.220015] CR2: 0000000001ffed30 CR3: 000000001e02e000 CR4: 00000000000007e0
[84091.220015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[84091.220015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[84091.220015] Process rm (pid: 32545, threadinfo ffff88001de84000, task ffff88001c22e5c0)
[84091.220015] Stack:
[84091.220015] ffff8800087f4130 ffff8800087f41b8 ffff88001c68b800 0000000000000000
[84091.220015] ffff88001de85e48 ffffffff81184357 ffff88001de85e48 ffff8800087f4130
[84091.220015] ffff88001e005000 ffff880014e4eb40 ffff88001de85e68 ffffffff81184418
[84091.220015] Call Trace:
[84091.220015] [<ffffffff81184357>] iput_final+0x117/0x190
[84091.220015] [<ffffffff81184418>] iput+0x48/0x60
[84091.220015] [<ffffffff8117a804>] do_unlinkat+0x214/0x240
[84091.220015] [<ffffffff8117aa4d>] sys_unlinkat+0x1d/0x40
[84091.220015] [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
[84091.220015] Code: 5c 41 5d b8 01 00 00 00 41 5e c9 c3 49 8d 45 08 f0 45 0f b3 75 08 eb db 0f 1f 40 00 66 83 03 01 5b 41 5c 41 5d 31 c0 41 5e c9 c3 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 ba 00 00
[84091.220015] RIP [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
[84091.220015] RSP <ffff88001de85df8>
[84091.470390] ---[ end trace e6915e8ee0f5f079 ]---
Which is BUG_ON(nlru->nr_items < 0) from iput_final path. So it seems
that there is still a race there.
--
Michal Hocko
SUSE Labs
On Wed, Jun 19, 2013 at 04:28:01PM +0200, Michal Hocko wrote:
> On Wed 19-06-13 09:13:46, Michal Hocko wrote:
> > On Tue 18-06-13 10:26:24, Glauber Costa wrote:
> > [...]
> > > Michal, would you mind testing the following patch?
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 00b804e..48eafa6 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)
> > >
> > > static void inode_lru_list_del(struct inode *inode)
> > > {
> > > + if (inode->i_state & I_FREEING)
> > > + return;
> > >
> > > if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> > > this_cpu_dec(nr_unused);
> > > @@ -609,8 +611,8 @@ void evict_inodes(struct super_block *sb)
> > > continue;
> > > }
> > >
> > > - inode->i_state |= I_FREEING;
> > > inode_lru_list_del(inode);
> > > + inode->i_state |= I_FREEING;
> > > spin_unlock(&inode->i_lock);
> > > list_add(&inode->i_lru, &dispose);
> > > }
> > > @@ -653,8 +655,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> > > continue;
> > > }
> > >
> > > - inode->i_state |= I_FREEING;
> > > inode_lru_list_del(inode);
> > > + inode->i_state |= I_FREEING;
> > > spin_unlock(&inode->i_lock);
> > > list_add(&inode->i_lru, &dispose);
> > > }
> > > @@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
> > > inode->i_state &= ~I_WILL_FREE;
> > > }
> > >
> > > + inode_lru_list_del(inode);
> > > inode->i_state |= I_FREEING;
> > > - if (!list_empty(&inode->i_lru))
> > > - inode_lru_list_del(inode);
> > > spin_unlock(&inode->i_lock);
> > >
> > > evict(inode);
> >
> > No luck. I have this on top of inode_lru_isolate one but still can see
>
> And I was lucky enough to hit another BUG_ON with this kernel (the above
> patch and inode_lru_isolate-fix):
> [84091.219056] ------------[ cut here ]------------
> [84091.220015] kernel BUG at mm/list_lru.c:42!
> [84091.220015] invalid opcode: 0000 [#1] SMP
> [84091.220015] Modules linked in: edd nfsv3 nfs_acl nfs fscache lockd sunrpc af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop dm_mod powernow_k8 tg3 kvm_amd kvm ptp e1000 pps_core shpchp edac_core i2c_amd756 amd_rng pci_hotplug k8temp sg i2c_amd8111 edac_mce_amd serio_raw sr_mod pcspkr cdrom button ohci_hcd ehci_hcd usbcore usb_common processor thermal_sys scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_generic sata_sil pata_amd
> [84091.220015] CPU 1
> [84091.220015] Pid: 32545, comm: rm Not tainted 3.9.0mmotmdebugging1+ #1472 AMD A8440/WARTHOG
> [84091.220015] RIP: 0010:[<ffffffff81127fff>] [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
> [84091.220015] RSP: 0018:ffff88001de85df8 EFLAGS: 00010286
> [84091.220015] RAX: ffffffffffffffff RBX: ffff88001e1ce2c0 RCX: 0000000000000002
> [84091.220015] RDX: ffff88001e1ce2c8 RSI: ffff8800087f4220 RDI: ffff88001e1ce2c0
> [84091.220015] RBP: ffff88001de85e18 R08: 0000000000000000 R09: 0000000000000000
> [84091.220015] R10: ffff88001d539128 R11: ffff880018234882 R12: ffff8800087f4220
> [84091.220015] R13: ffff88001c68bc40 R14: 0000000000000000 R15: ffff88001de85ea8
> [84091.220015] FS: 00007f43adb30700(0000) GS:ffff88001f100000(0000) knlGS:0000000000000000
> [84091.220015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [84091.220015] CR2: 0000000001ffed30 CR3: 000000001e02e000 CR4: 00000000000007e0
> [84091.220015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [84091.220015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [84091.220015] Process rm (pid: 32545, threadinfo ffff88001de84000, task ffff88001c22e5c0)
> [84091.220015] Stack:
> [84091.220015] ffff8800087f4130 ffff8800087f41b8 ffff88001c68b800 0000000000000000
> [84091.220015] ffff88001de85e48 ffffffff81184357 ffff88001de85e48 ffff8800087f4130
> [84091.220015] ffff88001e005000 ffff880014e4eb40 ffff88001de85e68 ffffffff81184418
> [84091.220015] Call Trace:
> [84091.220015] [<ffffffff81184357>] iput_final+0x117/0x190
> [84091.220015] [<ffffffff81184418>] iput+0x48/0x60
> [84091.220015] [<ffffffff8117a804>] do_unlinkat+0x214/0x240
> [84091.220015] [<ffffffff8117aa4d>] sys_unlinkat+0x1d/0x40
> [84091.220015] [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> [84091.220015] Code: 5c 41 5d b8 01 00 00 00 41 5e c9 c3 49 8d 45 08 f0 45 0f b3 75 08 eb db 0f 1f 40 00 66 83 03 01 5b 41 5c 41 5d 31 c0 41 5e c9 c3 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 ba 00 00
> [84091.220015] RIP [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
> [84091.220015] RSP <ffff88001de85df8>
> [84091.470390] ---[ end trace e6915e8ee0f5f079 ]---
>
> Which is BUG_ON(nlru->nr_items < 0) from iput_final path. So it seems
> that there is still a race there.
I am still looking at this - still can't reproduce, still don't know what is going
on.
Could you share with me your .config and your hardware info and dmesg? In particular, I want
to know how many nodes do you have.
On Thu 20-06-13 18:11:38, Glauber Costa wrote:
[...]
> > [84091.219056] ------------[ cut here ]------------
> > [84091.220015] kernel BUG at mm/list_lru.c:42!
> > [84091.220015] invalid opcode: 0000 [#1] SMP
> > [84091.220015] Modules linked in: edd nfsv3 nfs_acl nfs fscache lockd sunrpc af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop dm_mod powernow_k8 tg3 kvm_amd kvm ptp e1000 pps_core shpchp edac_core i2c_amd756 amd_rng pci_hotplug k8temp sg i2c_amd8111 edac_mce_amd serio_raw sr_mod pcspkr cdrom button ohci_hcd ehci_hcd usbcore usb_common processor thermal_sys scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_generic sata_sil pata_amd
> > [84091.220015] CPU 1
> > [84091.220015] Pid: 32545, comm: rm Not tainted 3.9.0mmotmdebugging1+ #1472 AMD A8440/WARTHOG
> > [84091.220015] RIP: 0010:[<ffffffff81127fff>] [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
> > [84091.220015] RSP: 0018:ffff88001de85df8 EFLAGS: 00010286
> > [84091.220015] RAX: ffffffffffffffff RBX: ffff88001e1ce2c0 RCX: 0000000000000002
> > [84091.220015] RDX: ffff88001e1ce2c8 RSI: ffff8800087f4220 RDI: ffff88001e1ce2c0
> > [84091.220015] RBP: ffff88001de85e18 R08: 0000000000000000 R09: 0000000000000000
> > [84091.220015] R10: ffff88001d539128 R11: ffff880018234882 R12: ffff8800087f4220
> > [84091.220015] R13: ffff88001c68bc40 R14: 0000000000000000 R15: ffff88001de85ea8
> > [84091.220015] FS: 00007f43adb30700(0000) GS:ffff88001f100000(0000) knlGS:0000000000000000
> > [84091.220015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [84091.220015] CR2: 0000000001ffed30 CR3: 000000001e02e000 CR4: 00000000000007e0
> > [84091.220015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [84091.220015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [84091.220015] Process rm (pid: 32545, threadinfo ffff88001de84000, task ffff88001c22e5c0)
> > [84091.220015] Stack:
> > [84091.220015] ffff8800087f4130 ffff8800087f41b8 ffff88001c68b800 0000000000000000
> > [84091.220015] ffff88001de85e48 ffffffff81184357 ffff88001de85e48 ffff8800087f4130
> > [84091.220015] ffff88001e005000 ffff880014e4eb40 ffff88001de85e68 ffffffff81184418
> > [84091.220015] Call Trace:
> > [84091.220015] [<ffffffff81184357>] iput_final+0x117/0x190
> > [84091.220015] [<ffffffff81184418>] iput+0x48/0x60
> > [84091.220015] [<ffffffff8117a804>] do_unlinkat+0x214/0x240
> > [84091.220015] [<ffffffff8117aa4d>] sys_unlinkat+0x1d/0x40
> > [84091.220015] [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > [84091.220015] Code: 5c 41 5d b8 01 00 00 00 41 5e c9 c3 49 8d 45 08 f0 45 0f b3 75 08 eb db 0f 1f 40 00 66 83 03 01 5b 41 5c 41 5d 31 c0 41 5e c9 c3 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 ba 00 00
> > [84091.220015] RIP [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
> > [84091.220015] RSP <ffff88001de85df8>
> > [84091.470390] ---[ end trace e6915e8ee0f5f079 ]---
> >
> > Which is BUG_ON(nlru->nr_items < 0) from iput_final path. So it seems
> > that there is still a race there.
>
> I am still looking at this - still can't reproduce, still don't know what is going
> on.
I am bisecting it again. It is quite tedious, though, because good case
is hard to be sure about.
> Could you share with me your .config and your hardware info and
> dmesg? In particular, I want to know how many nodes do you have.
Well, the machine has 4 nodes but only 2 of them are initialized because
I am booting with mem=1G to make a bigger memory pressure. dmesg and
zoneinfo is attached as well as the config.
--
Michal Hocko
SUSE Labs
On Thu 20-06-13 17:12:01, Michal Hocko wrote:
> On Thu 20-06-13 18:11:38, Glauber Costa wrote:
> [...]
> > > [84091.219056] ------------[ cut here ]------------
> > > [84091.220015] kernel BUG at mm/list_lru.c:42!
> > > [84091.220015] invalid opcode: 0000 [#1] SMP
> > > [84091.220015] Modules linked in: edd nfsv3 nfs_acl nfs fscache lockd sunrpc af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop dm_mod powernow_k8 tg3 kvm_amd kvm ptp e1000 pps_core shpchp edac_core i2c_amd756 amd_rng pci_hotplug k8temp sg i2c_amd8111 edac_mce_amd serio_raw sr_mod pcspkr cdrom button ohci_hcd ehci_hcd usbcore usb_common processor thermal_sys scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_generic sata_sil pata_amd
> > > [84091.220015] CPU 1
> > > [84091.220015] Pid: 32545, comm: rm Not tainted 3.9.0mmotmdebugging1+ #1472 AMD A8440/WARTHOG
> > > [84091.220015] RIP: 0010:[<ffffffff81127fff>] [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
> > > [84091.220015] RSP: 0018:ffff88001de85df8 EFLAGS: 00010286
> > > [84091.220015] RAX: ffffffffffffffff RBX: ffff88001e1ce2c0 RCX: 0000000000000002
> > > [84091.220015] RDX: ffff88001e1ce2c8 RSI: ffff8800087f4220 RDI: ffff88001e1ce2c0
> > > [84091.220015] RBP: ffff88001de85e18 R08: 0000000000000000 R09: 0000000000000000
> > > [84091.220015] R10: ffff88001d539128 R11: ffff880018234882 R12: ffff8800087f4220
> > > [84091.220015] R13: ffff88001c68bc40 R14: 0000000000000000 R15: ffff88001de85ea8
> > > [84091.220015] FS: 00007f43adb30700(0000) GS:ffff88001f100000(0000) knlGS:0000000000000000
> > > [84091.220015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > [84091.220015] CR2: 0000000001ffed30 CR3: 000000001e02e000 CR4: 00000000000007e0
> > > [84091.220015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [84091.220015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [84091.220015] Process rm (pid: 32545, threadinfo ffff88001de84000, task ffff88001c22e5c0)
> > > [84091.220015] Stack:
> > > [84091.220015] ffff8800087f4130 ffff8800087f41b8 ffff88001c68b800 0000000000000000
> > > [84091.220015] ffff88001de85e48 ffffffff81184357 ffff88001de85e48 ffff8800087f4130
> > > [84091.220015] ffff88001e005000 ffff880014e4eb40 ffff88001de85e68 ffffffff81184418
> > > [84091.220015] Call Trace:
> > > [84091.220015] [<ffffffff81184357>] iput_final+0x117/0x190
> > > [84091.220015] [<ffffffff81184418>] iput+0x48/0x60
> > > [84091.220015] [<ffffffff8117a804>] do_unlinkat+0x214/0x240
> > > [84091.220015] [<ffffffff8117aa4d>] sys_unlinkat+0x1d/0x40
> > > [84091.220015] [<ffffffff81583129>] system_call_fastpath+0x16/0x1b
> > > [84091.220015] Code: 5c 41 5d b8 01 00 00 00 41 5e c9 c3 49 8d 45 08 f0 45 0f b3 75 08 eb db 0f 1f 40 00 66 83 03 01 5b 41 5c 41 5d 31 c0 41 5e c9 c3 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 ba 00 00
> > > [84091.220015] RIP [<ffffffff81127fff>] list_lru_del+0xcf/0xe0
> > > [84091.220015] RSP <ffff88001de85df8>
> > > [84091.470390] ---[ end trace e6915e8ee0f5f079 ]---
> > >
> > > Which is BUG_ON(nlru->nr_items < 0) from iput_final path. So it seems
> > > that there is still a race there.
> >
> > I am still looking at this - still can't reproduce, still don't know what is going
> > on.
>
> I am bisecting it again. It is quite tedious, though, because good case
> is hard to be sure about.
And my test case runs the following (there is very same B.run, both of
them run in its own group):
#JOBS=4
#KERNEL_CONFIG="./config"
#KERNEL_TAR="./linux-3.7-rc5.tar.bz2"
#KERNEL_OUT="build/$CGROUP/kernel"
#CGROUP=A
$ cat A.run
KERNEL_DIR="$KERNEL_OUT/${KERNEL_TAR%.tar.bz2}"
mkdir -p "$KERNEL_DIR"
tar -xf $KERNEL_TAR -C $KERNEL_OUT || fail "get the source for $KERNEL_TAR->$KERNEL_OUT"
cp "$KERNEL_CONFIG" "$KERNEL_DIR/.config" || fail "Get the config"
LOG="`pwd`/$LOG_OUT_DIR"
mkdir -p "$LOG"
old_path="`pwd`"
cd "$KERNEL_DIR"
info "$CGROUP starting build jobs:$JOBS"
TIMESTAMP=`date +%s`
( /usr/bin/time -v make -j$JOBS vmlinux >/dev/null ) > $LOG/time.$CGROUP.$TIMESTAMP 2>&1 || fail "Build the kernel at $KERNEL_DIR"
cd "$old_path"
rm -rf "$KERNEL_OUT"
sync
echo 3 > /proc/sys/vm/drop_caches
--
Michal Hocko
SUSE Labs
On Thu 20-06-13 17:12:01, Michal Hocko wrote:
> I am bisecting it again. It is quite tedious, though, because good case
> is hard to be sure about.
OK, so now I converged to 2d4fc052 (inode: convert inode lru list to generic lru
list code.) in my tree and I have double checked it matches what is in
the linux-next. This doesn't help much to pin point the issue I am
afraid :/
I have applied the inode_lru_isolate fix on each step.
$ git bisect log
git bisect start
# bad: [d02c11c146b626cf8e2586446773ba02999e4e2f] mm/sparse.c: put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE
git bisect bad d02c11c146b626cf8e2586446773ba02999e4e2f
# good: [58f6e0c8fb37e8e37d5ac17a61a53ac236c15047] Reverted "mm: tlb_fast_mode check missing in tlb_finish_mmu()"
git bisect good 58f6e0c8fb37e8e37d5ac17a61a53ac236c15047
# bad: [4ec7ecd30d643b12e1041226ff180da3d88918ee] include/linux/math64.h: add div64_ul()
git bisect bad 4ec7ecd30d643b12e1041226ff180da3d88918ee
# bad: [96dd4e69dc50c7ed18e407798f7f677fa5588eae] xfs: convert dquot cache lru to list_lru
git bisect bad 96dd4e69dc50c7ed18e407798f7f677fa5588eae
# bad: [2d4fc052823c2f598f03633e64bc0439cd2bfa04] inode: convert inode lru list to generic lru list code.
git bisect bad 2d4fc052823c2f598f03633e64bc0439cd2bfa04
# good: [cacbac6d9d80cc5277e8f67c7a474edb6488a5ea] dcache: remove dentries from LRU before putting on dispose list
git bisect good cacbac6d9d80cc5277e8f67c7a474edb6488a5ea
# good: [03a05514e71551bfdff1e3496a30b0fd5083f8fe] shrinker: convert superblock shrinkers to new API
git bisect good 03a05514e71551bfdff1e3496a30b0fd5083f8fe
# good: [ddc5bc7a8856e0e61ea9c2d4fcbcd3f85ecc92e7] list: add a new LRU list type
git bisect good ddc5bc7a8856e0e61ea9c2d4fcbcd3f85ecc92e7
--
Michal Hocko
SUSE Labs
On Fri, Jun 21, 2013 at 11:00:21AM +0200, Michal Hocko wrote:
> On Thu 20-06-13 17:12:01, Michal Hocko wrote:
> > I am bisecting it again. It is quite tedious, though, because good case
> > is hard to be sure about.
>
> OK, so now I converged to 2d4fc052 (inode: convert inode lru list to generic lru
> list code.) in my tree and I have double checked it matches what is in
> the linux-next. This doesn't help much to pin point the issue I am
> afraid :/
>
Can you revert this patch (easiest way ATM is to rewind your tree to a point
right before it) and apply the following patch?
As Dave has mentioned, it is very likely that this bug was already there, we
were just not ever checking imbalances. The attached patch would tell us at
least if the imbalance was there before. If this is the case, I would suggest
turning the BUG condition into a WARN_ON_ONCE since we would be officially
not introducing any regression. It is no less of a bug, though, and we should
keep looking for it.
The main change from before / after the patch is that we are now keeping things
per node. One possibility of having this BUGing would be to have an inode to be
inserted into one node-lru and removed from another. I cannot see how it could
happen, because kernel pages are stable in memory and are not moved from node
to node. We could still have some sort of weird bug in the node calculation
function. In any case, would it be possible for you to artificially restrict
your setup to a single node ? Although I have no idea how to do that, we seem
to have no parameter to disable numa. Maybe booting with less memory, enough to
fit a single node?
diff --git a/fs/inode.c b/fs/inode.c
index 1ddaa2e..0b5c3fa 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -427,6 +427,7 @@ static void inode_lru_list_del(struct inode *inode)
if (!list_empty(&inode->i_lru)) {
list_del_init(&inode->i_lru);
inode->i_sb->s_nr_inodes_unused--;
+ BUG_ON(sb->s_nr_inodes_unused < 0);
this_cpu_dec(nr_unused);
}
spin_unlock(&inode->i_sb->s_inode_lru_lock);
@@ -739,6 +740,7 @@ long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan)
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
sb->s_nr_inodes_unused--;
+ BUG_ON(sb->s_nr_inodes_unused < 0);
this_cpu_dec(nr_unused);
continue;
}
@@ -777,6 +779,7 @@ long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan)
list_move(&inode->i_lru, &freeable);
sb->s_nr_inodes_unused--;
+ BUG_ON(sb->s_nr_inodes_unused < 0);
this_cpu_dec(nr_unused);
freed++;
}
On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> And again, another hang. It looks like the inode deletion never
> finishes. The good thing is that I do not see any LRU related BUG_ONs
> anymore. I am going to test with the other patch in the thread.
>
> 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> [<ffffffff8118525f>] iget_locked+0x4f/0x180
> [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> [<ffffffff81174ad0>] lookup_real+0x20/0x60
> [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
I don't think this has anything to do with LRUs.
__wait_on_freeing_inode() only blocks once the inode is being freed
(i.e. I_FREEING is set), and that happens when a lookup is done when
the inode is still in the inode hash.
I_FREEING is set on the inode at the same time it is removed from
the LRU, and from that point onwards the LRUs play no part in the
inode being freed and anyone waiting on the inode being freed
getting woken.
The only way I can see this happening, is if there is a dispose list
that is not getting processed properly. e.g., we move a bunch on
inodes to the dispose list setting I_FREEING, then for some reason
it gets dropped on the ground and so the wakeup call doesn't happen
when the inode has been removed from the hash.
I can't see anywhere in the code that this happens, though, but it
might be some pre-existing race in the inode hash that you are now
triggering because freeing will be happening in parallel on multiple
nodes rather than serialising on a global lock...
I won't have seen this on XFS stress testing, because it doesn't use
the VFS inode hashes for inode lookups. Given that XFS is not
triggering either problem you are seeing, that makes me think
that it might be a pre-existing inode hash lookup/reclaim race
condition, not a LRU problem.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun, Jun 23, 2013 at 03:51:29PM +0400, Glauber Costa wrote:
> On Fri, Jun 21, 2013 at 11:00:21AM +0200, Michal Hocko wrote:
> > On Thu 20-06-13 17:12:01, Michal Hocko wrote:
> > > I am bisecting it again. It is quite tedious, though, because good case
> > > is hard to be sure about.
> >
> > OK, so now I converged to 2d4fc052 (inode: convert inode lru list to generic lru
> > list code.) in my tree and I have double checked it matches what is in
> > the linux-next. This doesn't help much to pin point the issue I am
> > afraid :/
> >
> Can you revert this patch (easiest way ATM is to rewind your tree to a point
> right before it) and apply the following patch?
>
> As Dave has mentioned, it is very likely that this bug was already there, we
> were just not ever checking imbalances. The attached patch would tell us at
> least if the imbalance was there before. If this is the case, I would suggest
> turning the BUG condition into a WARN_ON_ONCE since we would be officially
> not introducing any regression. It is no less of a bug, though, and we should
> keep looking for it.
We probably should do that BUG->WARN change anyway. BUG_ON is pretty
obnoxious in places where we can probably continue on without much
impact....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > And again, another hang. It looks like the inode deletion never
> > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > anymore. I am going to test with the other patch in the thread.
> >
> > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> I don't think this has anything to do with LRUs.
I am not claiming that. It might be a timing issue which never mattered
but it is strange I can reproduce this so easily and repeatedly with the
shrinkers patchset applied.
As I said earlier, this might be breakage in my -mm tree as well
(missing some patch which didn't go via Andrew or misapplied patch). The
situation is worsen by the state of linux-next which has some unrelated
issues.
I really do not want to delay the whole patchset just because of some
problem on my side. Do you have any tree that I should try to test?
> __wait_on_freeing_inode() only blocks once the inode is being freed
> (i.e. I_FREEING is set), and that happens when a lookup is done when
> the inode is still in the inode hash.
>
> I_FREEING is set on the inode at the same time it is removed from
> the LRU, and from that point onwards the LRUs play no part in the
> inode being freed and anyone waiting on the inode being freed
> getting woken.
>
> The only way I can see this happening, is if there is a dispose list
> that is not getting processed properly. e.g., we move a bunch on
> inodes to the dispose list setting I_FREEING, then for some reason
> it gets dropped on the ground and so the wakeup call doesn't happen
> when the inode has been removed from the hash.
>
> I can't see anywhere in the code that this happens, though, but it
> might be some pre-existing race in the inode hash that you are now
> triggering because freeing will be happening in parallel on multiple
> nodes rather than serialising on a global lock...
>
> I won't have seen this on XFS stress testing, because it doesn't use
> the VFS inode hashes for inode lookups. Given that XFS is not
> triggering either problem you are seeing, that makes me think
I haven't tested with xfs.
> that it might be a pre-existing inode hash lookup/reclaim race
> condition, not a LRU problem.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
--
Michal Hocko
SUSE Labs
On Sun 23-06-13 15:51:29, Glauber Costa wrote:
> On Fri, Jun 21, 2013 at 11:00:21AM +0200, Michal Hocko wrote:
> > On Thu 20-06-13 17:12:01, Michal Hocko wrote:
> > > I am bisecting it again. It is quite tedious, though, because good case
> > > is hard to be sure about.
> >
> > OK, so now I converged to 2d4fc052 (inode: convert inode lru list to generic lru
> > list code.) in my tree and I have double checked it matches what is in
> > the linux-next. This doesn't help much to pin point the issue I am
> > afraid :/
> >
> Can you revert this patch (easiest way ATM is to rewind your tree to a point
> right before it) and apply the following patch?
OK, I am testing it now.
> As Dave has mentioned, it is very likely that this bug was already there, we
> were just not ever checking imbalances. The attached patch would tell us at
> least if the imbalance was there before.
Maybe I wasn't clear before but I have seen mostly hangs (posted
earlier) during bisection. I do not remember BUG_ONs on imbalances after
inode_lru_isolate fix.
> If this is the case, I would suggest
> turning the BUG condition into a WARN_ON_ONCE since we would be officially
> not introducing any regression. It is no less of a bug, though, and we should
> keep looking for it.
>
> The main change from before / after the patch is that we are now keeping things
> per node. One possibility of having this BUGing would be to have an inode to be
> inserted into one node-lru and removed from another. I cannot see how it could
> happen, because kernel pages are stable in memory and are not moved from node
> to node. We could still have some sort of weird bug in the node calculation
> function.
> In any case, would it be possible for you to artificially restrict
> your setup to a single node ? Although I have no idea how to do that, we seem
> to have no parameter to disable numa. Maybe booting with less memory, enough to
> fit a single node?
I can play with memmap to use areas on from a single node. Let's see
whether the patch in the follow up email shows something first.
--
Michal Hocko
SUSE Labs
On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > And again, another hang. It looks like the inode deletion never
> > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > anymore. I am going to test with the other patch in the thread.
> > >
> > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > I don't think this has anything to do with LRUs.
>
> I am not claiming that. It might be a timing issue which never mattered
> but it is strange I can reproduce this so easily and repeatedly with the
> shrinkers patchset applied.
> As I said earlier, this might be breakage in my -mm tree as well
> (missing some patch which didn't go via Andrew or misapplied patch). The
> situation is worsen by the state of linux-next which has some unrelated
> issues.
>
> I really do not want to delay the whole patchset just because of some
> problem on my side. Do you have any tree that I should try to test?
No, I've just been testing Glauber's tree and sending patches for
problems back to him based on it.
> > I won't have seen this on XFS stress testing, because it doesn't use
> > the VFS inode hashes for inode lookups. Given that XFS is not
> > triggering either problem you are seeing, that makes me think
>
> I haven't tested with xfs.
That might be worthwhile if you can easily do that - another data
point indicating a hang or absence of a hang will help point us in
the right direction here...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > And again, another hang. It looks like the inode deletion never
> > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > anymore. I am going to test with the other patch in the thread.
> > > >
> > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > I don't think this has anything to do with LRUs.
> >
> > I am not claiming that. It might be a timing issue which never mattered
> > but it is strange I can reproduce this so easily and repeatedly with the
> > shrinkers patchset applied.
> > As I said earlier, this might be breakage in my -mm tree as well
> > (missing some patch which didn't go via Andrew or misapplied patch). The
> > situation is worsen by the state of linux-next which has some unrelated
> > issues.
> >
> > I really do not want to delay the whole patchset just because of some
> > problem on my side. Do you have any tree that I should try to test?
>
> No, I've just been testing Glauber's tree and sending patches for
> problems back to him based on it.
>
> > > I won't have seen this on XFS stress testing, because it doesn't use
> > > the VFS inode hashes for inode lookups. Given that XFS is not
> > > triggering either problem you are seeing, that makes me think
> >
> > I haven't tested with xfs.
>
> That might be worthwhile if you can easily do that - another data
> point indicating a hang or absence of a hang will help point us in
> the right direction here...
OK, still hanging (with inode_lru_isolate-fix.patch). It is not the same
thing, though, as xfs seem to do lookup slightly differently.
12467 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12667 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
13830 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81175254>] __lookup_hash+0x34/0x40
[<ffffffff81179872>] path_lookupat+0x7a2/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff81169c30>] sys_faccessat+0xe0/0x230
[<ffffffff81169d93>] sys_access+0x13/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
13913 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
14245 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
14365 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
14384 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
14413 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
--
Michal Hocko
SUSE Labs
I have just triggered this one.
[37955.354041] BUG: unable to handle kernel paging request at 000000572ead7838
[37955.356032] IP: [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
[37955.364062] PGD 2bf0a067 PUD 0
[37955.364062] Oops: 0000 [#1] SMP
[37955.364062] Modules linked in: edd nfsv3 nfs_acl nfs fscache lockd sunrpc af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse xfs libcrc32c loop dm_mod tg3 ptp powernow_k8 pps_core e1000 kvm_amd shpchp kvm edac_core i2c_amd756 pci_hotplug i2c_amd8111 sg edac_mce_amd amd_rng k8temp sr_mod pcspkr cdrom serio_raw button ohci_hcd ehci_hcd usbcore usb_common processor thermal_sys scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_generic sata_sil pata_amd
[37955.364062] CPU 3
[37955.364062] Pid: 3351, comm: as Not tainted 3.9.0mmotm+ #1490 AMD A8440/WARTHOG
[37955.364062] RIP: 0010:[<ffffffff81127e5b>] [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
[37955.364062] RSP: 0000:ffff8800374af7b8 EFLAGS: 00010286
[37955.364062] RAX: 0000000000000106 RBX: ffff88002ead7838 RCX: ffff8800374af830
[37955.364062] RDX: 0000000000000107 RSI: ffff88001d250dc0 RDI: ffff88002ead77d0
[37955.364062] RBP: ffff8800374af818 R08: 0000000000000000 R09: ffff88001ffeafc0
[37955.364062] R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001d250dc0
[37955.364062] R13: 00000000000000a0 R14: 000000572ead7838 R15: ffff88001d250dc8
[37955.364062] FS: 00002aaaaaadb100(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[37955.364062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37955.364062] CR2: 000000572ead7838 CR3: 0000000036f61000 CR4: 00000000000007e0
[37955.364062] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37955.364062] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37955.364062] Process as (pid: 3351, threadinfo ffff8800374ae000, task ffff880036d665c0)
[37955.364062] Stack:
[37955.364062] ffff88001da3e700 ffff8800374af830 ffff8800374af838 ffffffff811846d0
[37955.364062] 0000000000000000 ffff88001ce75c48 01ff8800374af838 ffff8800374af838
[37955.364062] 0000000000000000 ffff88001ce75800 ffff8800374afa08 0000000000001014
[37955.364062] Call Trace:
[37955.364062] [<ffffffff811846d0>] ? insert_inode_locked+0x160/0x160
[37955.364062] [<ffffffff8118496c>] prune_icache_sb+0x3c/0x60
[37955.364062] [<ffffffff8116dcbe>] super_cache_scan+0x12e/0x1b0
[37955.364062] [<ffffffff8111354a>] shrink_slab_node+0x13a/0x250
[37955.364062] [<ffffffff8111671b>] shrink_slab+0xab/0x120
[37955.364062] [<ffffffff81117944>] do_try_to_free_pages+0x264/0x360
[37955.364062] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[37955.364062] [<ffffffff81001974>] ? __switch_to+0x1b4/0x550
[37955.364062] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[37955.364062] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[37955.364062] [<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[37955.364062] [<ffffffff81129ebb>] do_anonymous_page+0x16b/0x350
[37955.364062] [<ffffffff8112f9c5>] handle_pte_fault+0x235/0x240
[37955.364062] [<ffffffff8107b8b0>] ? set_next_entity+0xb0/0xd0
[37955.364062] [<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[37955.364062] [<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[37955.364062] [<ffffffff8116a8a8>] ? fsnotify_access+0x68/0x80
[37955.364062] [<ffffffff8116b0b8>] ? vfs_read+0xd8/0x130
[37955.364062] [<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[37955.364062] [<ffffffff8157b348>] page_fault+0x28/0x30
[37955.364062] Code: 44 24 18 0f 84 87 00 00 00 49 83 7c 24 18 00 78 7b 49 83 c5 01 48 8b 4d a8 48 8b 11 48 8d 42 ff 48 85 d2 48 89 01 74 78 4d 39 f7 <49> 8b 06 4c 89 f3 74 6d 49 89 c6 eb a6 0f 1f 84 00 00 00 00 00
[37955.364062] RIP [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
ffffffff81127e0e: 48 8b 55 b0 mov -0x50(%rbp),%rdx
ffffffff81127e12: 4c 89 e6 mov %r12,%rsi
ffffffff81127e15: 48 89 df mov %rbx,%rdi
ffffffff81127e18: ff 55 b8 callq *-0x48(%rbp) # isolate(item, &nlru->lock, cb_arg)
ffffffff81127e1b: 83 f8 01 cmp $0x1,%eax
ffffffff81127e1e: 74 78 je ffffffff81127e98 <list_lru_walk_node+0xe8>
ffffffff81127e20: 73 4e jae ffffffff81127e70 <list_lru_walk_node+0xc0>
[...]
ffffffff81127e45: 48 8b 4d a8 mov -0x58(%rbp),%rcx # LRU_ROTATE:
ffffffff81127e49: 48 8b 11 mov (%rcx),%rdx
ffffffff81127e4c: 48 8d 42 ff lea -0x1(%rdx),%rax
ffffffff81127e50: 48 85 d2 test %rdx,%rdx # if ((*nr_to_walk)-- == 0)
ffffffff81127e53: 48 89 01 mov %rax,(%rcx)
ffffffff81127e56: 74 78 je ffffffff81127ed0 <list_lru_walk_node+0x120>
ffffffff81127e58: 4d 39 f7 cmp %r14,%r15
ffffffff81127e5b: 49 8b 06 mov (%r14),%rax <<< BANG
ffffffff81127e5e: 4c 89 f3 mov %r14,%rbx
ffffffff81127e61: 74 6d je ffffffff81127ed0 <list_lru_walk_node+0x120>
ffffffff81127e63: 49 89 c6 mov %rax,%r14
ffffffff81127e66: eb a6 jmp ffffffff81127e0e <list_lru_walk_node+0x5e>
[...]
ffffffff81127e70: 83 f8 02 cmp $0x2,%eax
ffffffff81127e73: 74 d0 je ffffffff81127e45 <list_lru_walk_node+0x95>
ffffffff81127e75: 83 f8 03 cmp $0x3,%eax
ffffffff81127e78: 74 06 je ffffffff81127e80 <list_lru_walk_node+0xd0>
ffffffff81127e7a: 0f 0b ud2
[...]
ffffffff81127ed0: 66 41 83 04 24 01 addw $0x1,(%r12)
ffffffff81127ed6: 48 83 c4 38 add $0x38,%rsp
ffffffff81127eda: 4c 89 e8 mov %r13,%rax
ffffffff81127edd: 5b pop %rbx
ffffffff81127ede: 41 5c pop %r12
ffffffff81127ee0: 41 5d pop %r13
ffffffff81127ee2: 41 5e pop %r14
ffffffff81127ee4: 41 5f pop %r15
ffffffff81127ee6: c9 leaveq
ffffffff81127ee7: c3 retq
We are tripping over in list_for_each_safe and r14(000000572ead7838) is
obviously a garbage. So the lru is clobbered?
--
Michal Hocko
SUSE Labs
On Fri, Jun 28, 2013 at 10:39:43AM +0200, Michal Hocko wrote:
> I have just triggered this one.
>
> [37955.364062] RIP: 0010:[<ffffffff81127e5b>] [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
> [37955.364062] RSP: 0000:ffff8800374af7b8 EFLAGS: 00010286
> [37955.364062] RAX: 0000000000000106 RBX: ffff88002ead7838 RCX: ffff8800374af830
Note ebx
> [37955.364062] RDX: 0000000000000107 RSI: ffff88001d250dc0 RDI: ffff88002ead77d0
> [37955.364062] RBP: ffff8800374af818 R08: 0000000000000000 R09: ffff88001ffeafc0
> [37955.364062] R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001d250dc0
> [37955.364062] R13: 00000000000000a0 R14: 000000572ead7838 R15: ffff88001d250dc8
Note r14
> [37955.364062] Process as (pid: 3351, threadinfo ffff8800374ae000, task ffff880036d665c0)
> [37955.364062] Stack:
> [37955.364062] ffff88001da3e700 ffff8800374af830 ffff8800374af838 ffffffff811846d0
> [37955.364062] 0000000000000000 ffff88001ce75c48 01ff8800374af838 ffff8800374af838
> [37955.364062] 0000000000000000 ffff88001ce75800 ffff8800374afa08 0000000000001014
> [37955.364062] Call Trace:
> [37955.364062] [<ffffffff811846d0>] ? insert_inode_locked+0x160/0x160
> [37955.364062] [<ffffffff8118496c>] prune_icache_sb+0x3c/0x60
> [37955.364062] [<ffffffff8116dcbe>] super_cache_scan+0x12e/0x1b0
> [37955.364062] [<ffffffff8111354a>] shrink_slab_node+0x13a/0x250
> [37955.364062] [<ffffffff8111671b>] shrink_slab+0xab/0x120
> [37955.364062] [<ffffffff81117944>] do_try_to_free_pages+0x264/0x360
> [37955.364062] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
> [37955.364062] [<ffffffff81001974>] ? __switch_to+0x1b4/0x550
> [37955.364062] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
> [37955.364062] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
> [37955.364062] [<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
> [37955.364062] [<ffffffff81129ebb>] do_anonymous_page+0x16b/0x350
> [37955.364062] [<ffffffff8112f9c5>] handle_pte_fault+0x235/0x240
> [37955.364062] [<ffffffff8107b8b0>] ? set_next_entity+0xb0/0xd0
> [37955.364062] [<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
> [37955.364062] [<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
> [37955.364062] [<ffffffff8116a8a8>] ? fsnotify_access+0x68/0x80
> [37955.364062] [<ffffffff8116b0b8>] ? vfs_read+0xd8/0x130
> [37955.364062] [<ffffffff8157ebe9>] do_page_fault+0x9/0x10ffff88002ead7838
> [37955.364062] [<ffffffff8157b348>] page_fault+0x28/0x30
> [37955.364062] Code: 44 24 18 0f 84 87 00 00 00 49 83 7c 24 18 00 78 7b 49 83 c5 01 48 8b 4d a8 48 8b 11 48 8d 42 ff 48 85 d2 48 89 01 74 78 4d 39 f7 <49> 8b 06 4c 89 f3 74 6d 49 89 c6 eb a6 0f 1f 84 00 00 00 00 00
> [37955.364062] RIP [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
>
> ffffffff81127e0e: 48 8b 55 b0 mov -0x50(%rbp),%rdx
> ffffffff81127e12: 4c 89 e6 mov %r12,%rsi
> ffffffff81127e15: 48 89 df mov %rbx,%rdi
> ffffffff81127e18: ff 55 b8 callq *-0x48(%rbp) # isolate(item, &nlru->lock, cb_arg)
> ffffffff81127e1b: 83 f8 01 cmp $0x1,%eax
> ffffffff81127e1e: 74 78 je ffffffff81127e98 <list_lru_walk_node+0xe8>
> ffffffff81127e20: 73 4e jae ffffffff81127e70 <list_lru_walk_node+0xc0>
> [...]
One interesting thing I have noted here, is that r14 is basically the lower half of rbx, with
the upper part borked.
Because we are talking about a single word, this does not seem the usual update-half-of-double-word
without locking issue.
>From your excerpt, it is not totally clear what r14 is. But by looking at rdi which
is 0xffff88002ead77d0 and very probable nlru->lock due to the calling convention,
that would indicate that this is nlru->list in case you have spinlock debugging enabled.
So yes, someone destroyed our next pointer, and amazingly only half of it.
Still, the only time we ever release this lock is when isolate returns LRU_RETRY. Maybe the
way we restart is wrong? (although I can't see how)
An iput() happens outside the lock in that case, but it seems safe : if that ends up manipulating
the lru it will do so through our accessors.
I will have to think a bit more... Any other strange thing happening before it ?
On Fri 28-06-13 18:31:26, Glauber Costa wrote:
> On Fri, Jun 28, 2013 at 10:39:43AM +0200, Michal Hocko wrote:
> > I have just triggered this one.
> >
> > [37955.364062] RIP: 0010:[<ffffffff81127e5b>] [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
> > [37955.364062] RSP: 0000:ffff8800374af7b8 EFLAGS: 00010286
> > [37955.364062] RAX: 0000000000000106 RBX: ffff88002ead7838 RCX: ffff8800374af830
> Note ebx
>
> > [37955.364062] RDX: 0000000000000107 RSI: ffff88001d250dc0 RDI: ffff88002ead77d0
> > [37955.364062] RBP: ffff8800374af818 R08: 0000000000000000 R09: ffff88001ffeafc0
> > [37955.364062] R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001d250dc0
> > [37955.364062] R13: 00000000000000a0 R14: 000000572ead7838 R15: ffff88001d250dc8
> Note r14
Hmm the upper part is 57 which is also weird. Do you think this might be
a HW issue?
It would be strange that I cannot reproduce it without the series
applied and I was testing the "good" case for a long time to be sure
that this is just not a "consistent good luck".
> > [37955.364062] Process as (pid: 3351, threadinfo ffff8800374ae000, task ffff880036d665c0)
> > [37955.364062] Stack:
> > [37955.364062] ffff88001da3e700 ffff8800374af830 ffff8800374af838 ffffffff811846d0
> > [37955.364062] 0000000000000000 ffff88001ce75c48 01ff8800374af838 ffff8800374af838
> > [37955.364062] 0000000000000000 ffff88001ce75800 ffff8800374afa08 0000000000001014
> > [37955.364062] Call Trace:
> > [37955.364062] [<ffffffff811846d0>] ? insert_inode_locked+0x160/0x160
> > [37955.364062] [<ffffffff8118496c>] prune_icache_sb+0x3c/0x60
> > [37955.364062] [<ffffffff8116dcbe>] super_cache_scan+0x12e/0x1b0
> > [37955.364062] [<ffffffff8111354a>] shrink_slab_node+0x13a/0x250
> > [37955.364062] [<ffffffff8111671b>] shrink_slab+0xab/0x120
> > [37955.364062] [<ffffffff81117944>] do_try_to_free_pages+0x264/0x360
> > [37955.364062] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
> > [37955.364062] [<ffffffff81001974>] ? __switch_to+0x1b4/0x550
> > [37955.364062] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
> > [37955.364062] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
> > [37955.364062] [<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
> > [37955.364062] [<ffffffff81129ebb>] do_anonymous_page+0x16b/0x350
> > [37955.364062] [<ffffffff8112f9c5>] handle_pte_fault+0x235/0x240
> > [37955.364062] [<ffffffff8107b8b0>] ? set_next_entity+0xb0/0xd0
> > [37955.364062] [<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
> > [37955.364062] [<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
> > [37955.364062] [<ffffffff8116a8a8>] ? fsnotify_access+0x68/0x80
> > [37955.364062] [<ffffffff8116b0b8>] ? vfs_read+0xd8/0x130
> > [37955.364062] [<ffffffff8157ebe9>] do_page_fault+0x9/0x10ffff88002ead7838
> > [37955.364062] [<ffffffff8157b348>] page_fault+0x28/0x30
> > [37955.364062] Code: 44 24 18 0f 84 87 00 00 00 49 83 7c 24 18 00 78 7b 49 83 c5 01 48 8b 4d a8 48 8b 11 48 8d 42 ff 48 85 d2 48 89 01 74 78 4d 39 f7 <49> 8b 06 4c 89 f3 74 6d 49 89 c6 eb a6 0f 1f 84 00 00 00 00 00
> > [37955.364062] RIP [<ffffffff81127e5b>] list_lru_walk_node+0xab/0x140
> >
> > ffffffff81127e0e: 48 8b 55 b0 mov -0x50(%rbp),%rdx
> > ffffffff81127e12: 4c 89 e6 mov %r12,%rsi
> > ffffffff81127e15: 48 89 df mov %rbx,%rdi
> > ffffffff81127e18: ff 55 b8 callq *-0x48(%rbp) # isolate(item, &nlru->lock, cb_arg)
> > ffffffff81127e1b: 83 f8 01 cmp $0x1,%eax
> > ffffffff81127e1e: 74 78 je ffffffff81127e98 <list_lru_walk_node+0xe8>
> > ffffffff81127e20: 73 4e jae ffffffff81127e70 <list_lru_walk_node+0xc0>
> > [...]
> One interesting thing I have noted here, is that r14 is basically the lower half of rbx, with
> the upper part borked.
>
> Because we are talking about a single word, this does not seem the usual update-half-of-double-word
> without locking issue.
>
> From your excerpt, it is not totally clear what r14 is. But by looking at rdi which
> is 0xffff88002ead77d0 and very probable nlru->lock due to the calling convention,
> that would indicate that this is nlru->list in case you have spinlock debugging enabled.
>
> So yes, someone destroyed our next pointer, and amazingly only half of it.
>
> Still, the only time we ever release this lock is when isolate returns LRU_RETRY. Maybe the
> way we restart is wrong? (although I can't see how)
>
> An iput() happens outside the lock in that case, but it seems safe : if that ends up manipulating
> the lru it will do so through our accessors.
>
> I will have to think a bit more... Any other strange thing happening before it ?
Nothing. I basically see two cases here. One is hang and the other is
this crash followed up by many soft lockups as we died with the lock
held.
When I think about it, the hang case might still be a false positive
(for xfs at least) because it might just happen that those processes I
have listed are in D state just too long. I have gave them tens of
seconds and than compared traces and consider them hung if they didn't
change. I can retest and try to give them hours to be absolutely sure.
--
Michal Hocko
SUSE Labs
On Thu, Jun 27, 2013 at 04:54:11PM +0200, Michal Hocko wrote:
> On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> > On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > > And again, another hang. It looks like the inode deletion never
> > > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > > anymore. I am going to test with the other patch in the thread.
> > > > >
> > > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > >
> > > > I don't think this has anything to do with LRUs.
> > >
> > > I am not claiming that. It might be a timing issue which never mattered
> > > but it is strange I can reproduce this so easily and repeatedly with the
> > > shrinkers patchset applied.
> > > As I said earlier, this might be breakage in my -mm tree as well
> > > (missing some patch which didn't go via Andrew or misapplied patch). The
> > > situation is worsen by the state of linux-next which has some unrelated
> > > issues.
> > >
> > > I really do not want to delay the whole patchset just because of some
> > > problem on my side. Do you have any tree that I should try to test?
> >
> > No, I've just been testing Glauber's tree and sending patches for
> > problems back to him based on it.
> >
> > > > I won't have seen this on XFS stress testing, because it doesn't use
> > > > the VFS inode hashes for inode lookups. Given that XFS is not
> > > > triggering either problem you are seeing, that makes me think
> > >
> > > I haven't tested with xfs.
> >
> > That might be worthwhile if you can easily do that - another data
> > point indicating a hang or absence of a hang will help point us in
> > the right direction here...
>
> OK, still hanging (with inode_lru_isolate-fix.patch). It is not the same
> thing, though, as xfs seem to do lookup slightly differently.
> 12467 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
> [<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
> [<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
> [<ffffffff81174ad0>] lookup_real+0x20/0x60
> [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> [<ffffffff8117815e>] do_last+0x2de/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
What are the full traces? This could be blocking on IO or locks
here if it's a cache miss and we are reading an inode....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat 29-06-13 12:55:09, Dave Chinner wrote:
> On Thu, Jun 27, 2013 at 04:54:11PM +0200, Michal Hocko wrote:
> > On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> > > On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > > > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > > > And again, another hang. It looks like the inode deletion never
> > > > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > > > anymore. I am going to test with the other patch in the thread.
> > > > > >
> > > > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > > >
> > > > > I don't think this has anything to do with LRUs.
> > > >
> > > > I am not claiming that. It might be a timing issue which never mattered
> > > > but it is strange I can reproduce this so easily and repeatedly with the
> > > > shrinkers patchset applied.
> > > > As I said earlier, this might be breakage in my -mm tree as well
> > > > (missing some patch which didn't go via Andrew or misapplied patch). The
> > > > situation is worsen by the state of linux-next which has some unrelated
> > > > issues.
> > > >
> > > > I really do not want to delay the whole patchset just because of some
> > > > problem on my side. Do you have any tree that I should try to test?
> > >
> > > No, I've just been testing Glauber's tree and sending patches for
> > > problems back to him based on it.
> > >
> > > > > I won't have seen this on XFS stress testing, because it doesn't use
> > > > > the VFS inode hashes for inode lookups. Given that XFS is not
> > > > > triggering either problem you are seeing, that makes me think
> > > >
> > > > I haven't tested with xfs.
> > >
> > > That might be worthwhile if you can easily do that - another data
> > > point indicating a hang or absence of a hang will help point us in
> > > the right direction here...
> >
> > OK, still hanging (with inode_lru_isolate-fix.patch). It is not the same
> > thing, though, as xfs seem to do lookup slightly differently.
> > 12467 [<ffffffffa02ca03e>] xfs_iget+0xbe/0x190 [xfs]
> > [<ffffffffa02d6e98>] xfs_lookup+0xe8/0x110 [xfs]
> > [<ffffffffa02cdad9>] xfs_vn_lookup+0x49/0x90 [xfs]
> > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> What are the full traces?
Do you mean sysrq+t? It is attached.
Btw. I was able to reproduce this again. The stuck processes were
sitting in the same traces for more than 28 hours without any change so
I do not think this is a temporal condition.
Traces of all processes in the D state:
7561 [<ffffffffa029c03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02a8e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa029fad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
8156 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
8913 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
9100 [<ffffffffa029c03e>] xfs_iget+0xbe/0x190 [xfs]
[<ffffffffa02a8e98>] xfs_lookup+0xe8/0x110 [xfs]
[<ffffffffa029fad9>] xfs_vn_lookup+0x49/0x90 [xfs]
[<ffffffff81174ad0>] lookup_real+0x20/0x60
[<ffffffff81177e25>] lookup_open+0x175/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
9158 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
11247 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12161 [<ffffffff81179862>] path_lookupat+0x792/0x830
[<ffffffff81179933>] filename_lookup+0x33/0xd0
[<ffffffff8117ab0b>] user_path_at_empty+0x7b/0xb0
[<ffffffff8117ab4c>] user_path_at+0xc/0x10
[<ffffffff8116ff91>] vfs_fstatat+0x51/0xb0
[<ffffffff81170116>] vfs_stat+0x16/0x20
[<ffffffff8117013f>] sys_newstat+0x1f/0x50
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
12585 [<ffffffff81178144>] do_last+0x2c4/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
--
Michal Hocko
SUSE Labs
On Sun, Jun 30, 2013 at 08:33:49PM +0200, Michal Hocko wrote:
> On Sat 29-06-13 12:55:09, Dave Chinner wrote:
> > On Thu, Jun 27, 2013 at 04:54:11PM +0200, Michal Hocko wrote:
> > > On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> > > > On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > > > > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > > > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > > > > And again, another hang. It looks like the inode deletion never
> > > > > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > > > > anymore. I am going to test with the other patch in the thread.
> > > > > > >
> > > > > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > > > > [<ffffffffffffffff>] 0xffffffffffffffff
.....
> Do you mean sysrq+t? It is attached.
>
> Btw. I was able to reproduce this again. The stuck processes were
> sitting in the same traces for more than 28 hours without any change so
> I do not think this is a temporal condition.
>
> Traces of all processes in the D state:
> 7561 [<ffffffffa029c03e>] xfs_iget+0xbe/0x190 [xfs]
> [<ffffffffa02a8e98>] xfs_lookup+0xe8/0x110 [xfs]
> [<ffffffffa029fad9>] xfs_vn_lookup+0x49/0x90 [xfs]
> [<ffffffff81174ad0>] lookup_real+0x20/0x60
> [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> [<ffffffff8117815e>] do_last+0x2de/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
This looks like it may be equivalent to the ext4 trace above, though
I'm not totally sure on that yet. Can you get me the line of code
where the above code is sleeping - 'gdb> l *(xfs_iget+0xbe)' output
is sufficient.
If it's where I suspect it is, we are hitting a VFS inode that
igrab() is failing on because I_FREEING is set and that is returning
EAGAIN. Hence xfs_iget() sleeps for a short period and retries the
lookup. If you've still got a system in this state, can you dump the
xfs stats a few times about 5s apart i.e.
$ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
Depending on what stat is changing (i'm looking for skip vs recycle
in the inode cache stats), that will tell us why the lookup is
failing...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon 01-07-13 11:25:58, Dave Chinner wrote:
> On Sun, Jun 30, 2013 at 08:33:49PM +0200, Michal Hocko wrote:
> > On Sat 29-06-13 12:55:09, Dave Chinner wrote:
> > > On Thu, Jun 27, 2013 at 04:54:11PM +0200, Michal Hocko wrote:
> > > > On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> > > > > On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > > > > > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > > > > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > > > > > And again, another hang. It looks like the inode deletion never
> > > > > > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > > > > > anymore. I am going to test with the other patch in the thread.
> > > > > > > >
> > > > > > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > > > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > > > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > > > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > > > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > > > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > > > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > > > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > > > > > [<ffffffffffffffff>] 0xffffffffffffffff
>
> .....
> > Do you mean sysrq+t? It is attached.
> >
> > Btw. I was able to reproduce this again. The stuck processes were
> > sitting in the same traces for more than 28 hours without any change so
> > I do not think this is a temporal condition.
> >
> > Traces of all processes in the D state:
> > 7561 [<ffffffffa029c03e>] xfs_iget+0xbe/0x190 [xfs]
> > [<ffffffffa02a8e98>] xfs_lookup+0xe8/0x110 [xfs]
> > [<ffffffffa029fad9>] xfs_vn_lookup+0x49/0x90 [xfs]
> > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> This looks like it may be equivalent to the ext4 trace above, though
> I'm not totally sure on that yet. Can you get me the line of code
> where the above code is sleeping - 'gdb> l *(xfs_iget+0xbe)' output
> is sufficient.
OK, this is a bit tricky because I have xfs built as a module so objdump
on xfs.ko shows nonsense
19039: e8 00 00 00 00 callq 1903e <xfs_iget+0xbe>
1903e: 48 8b 75 c0 mov -0x40(%rbp),%rsi
crash was more clever though and it says:
0xffffffffa029c034 <xfs_iget+180>: mov $0x1,%edi
0xffffffffa029c039 <xfs_iget+185>: callq 0xffffffff815776d0
<schedule_timeout_uninterruptible>
/dev/shm/mhocko-build/BUILD/kernel-3.9.0mmotm+/fs/xfs/xfs_icache.c: 423
0xffffffffa029c03e <xfs_iget+190>: mov -0x40(%rbp),%rsi
which maps to:
out_error_or_again:
if (error == EAGAIN) {
delay(1);
goto again;
}
So this looks like this path loops in goto again and out_error_or_again.
> If it's where I suspect it is, we are hitting a VFS inode that
> igrab() is failing on because I_FREEING is set and that is returning
> EAGAIN. Hence xfs_iget() sleeps for a short period and retries the
> lookup. If you've still got a system in this state, can you dump the
> xfs stats a few times about 5s apart i.e.
>
> $ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
>
> Depending on what stat is changing (i'm looking for skip vs recycle
> in the inode cache stats), that will tell us why the lookup is
> failing...
$ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
Mon Jul 1 09:29:57 CEST 2013
extent_alloc 1484333 2038118 1678 13182
abt 0 0 0 0
blk_map 21004635 3433178 1450438 1461372 1450017 25888309 0
bmbt 0 0 0 0
dir 1482235 1466711 7281 2529
trans 7676 6231535 1444850
ig 0 8534 299 1463749 0 1256778 262381
log 37039 2082072 414 8808 16395
push_ail 7684106 0 519016 449446 0 12401 64613 2970751 0 1036
xstrat 1441551 0
rw 1744884 1351499
attr 84933 0 0 0
icluster 130532 102985 2389817
vnodes 4293706604 0 0 0 1260692 1260692 1260692 0
buf 24539551 79603 24464366 2126 8792 75185 0 129859 9654
abtb2 1520647 1551239 12314 12331 0 0 0 0 0 0 0 0 0 0 15613
abtc2 2972473 1641548 1486215 1486232 0 0 0 0 0 0 0 0 0 0 258694
bmbt2 16968 199868 14855 0 3 0 89 0 6414 89 58 0 61 0 1800151
ibt2 4289847 39122572 22887 1 4 0 644 59 10700 0 88 0 92 0 2732985
qm 0 0 0 0 0 0 0 0
xpc 7892422656 3364392442 7942370166
debug 0
Mon Jul 1 09:30:02 CEST 2013
extent_alloc 1484362 2038147 1678 13182
abt 0 0 0 0
blk_map 21005075 3433237 1450468 1461401 1450047 25888838 0
bmbt 0 0 0 0
dir 1482265 1466741 7281 2529
trans 7676 6231652 1444880
ig 0 8534 299 1463779 0 1256778 262381
log 37039 2082072 414 8808 16395
push_ail 7684253 0 519016 449446 0 12401 64613 2970751 0 1036
xstrat 1441579 0
rw 1744914 1351499
attr 84933 0 0 0
icluster 130532 102985 2389817
vnodes 4293706604 0 0 0 1260692 1260692 1260692 0
buf 24540112 79607 24464923 2126 8792 75189 0 129863 9657
abtb2 1520676 1551268 12314 12331 0 0 0 0 0 0 0 0 0 0 15613
abtc2 2972531 1641578 1486244 1486261 0 0 0 0 0 0 0 0 0 0 258696
bmbt2 16969 199882 14856 0 3 0 89 0 6415 89 58 0 61 0 1800406
ibt2 4289937 39123472 22887 1 4 0 644 59 10700 0 88 0 92 0 2732985
qm 0 0 0 0 0 0 0 0
xpc 7892537344 3364415667 7942370166
debug 0
Mon Jul 1 09:30:07 CEST 2013
extent_alloc 1484393 2038181 1678 13182
abt 0 0 0 0
blk_map 21005515 3433297 1450498 1461431 1450077 25889368 0
bmbt 0 0 0 0
dir 1482295 1466771 7281 2529
trans 7676 6231774 1444910
ig 0 8534 299 1463809 0 1256778 262381
log 37039 2082072 414 8808 16395
push_ail 7684405 0 519016 449446 0 12401 64613 2970751 0 1036
xstrat 1441609 0
rw 1744944 1351499
attr 84933 0 0 0
icluster 130532 102985 2389817
vnodes 4293706604 0 0 0 1260692 1260692 1260692 0
buf 24540682 79609 24465491 2126 8792 75191 0 129867 9657
abtb2 1520708 1551300 12314 12331 0 0 0 0 0 0 0 0 0 0 15613
abtc2 2972593 1641609 1486275 1486292 0 0 0 0 0 0 0 0 0 0 258696
bmbt2 16969 199882 14856 0 3 0 89 0 6415 89 58 0 61 0 1800406
ibt2 4290028 39124384 22888 1 4 0 644 59 10700 0 88 0 92 0 2732985
qm 0 0 0 0 0 0 0 0
xpc 7892660224 3364438892 7942370166
debug 0
Mon Jul 1 09:30:12 CEST 2013
extent_alloc 1484424 2038215 1678 13182
abt 0 0 0 0
blk_map 21005901 3433353 1450524 1461461 1450103 25889836 0
bmbt 0 0 0 0
dir 1482321 1466797 7281 2529
trans 7677 6231889 1444936
ig 0 8534 299 1463835 0 1256778 262381
log 37045 2082361 414 8810 16398
push_ail 7684547 0 519079 449508 0 12408 64613 2971092 0 1037
xstrat 1441639 0
rw 1744970 1351499
attr 84933 0 0 0
icluster 130548 102999 2390155
vnodes 4293706604 0 0 0 1260692 1260692 1260692 0
buf 24541210 79611 24466017 2126 8792 75193 0 129871 9657
abtb2 1520740 1551332 12314 12331 0 0 0 0 0 0 0 0 0 0 15613
abtc2 2972655 1641640 1486306 1486323 0 0 0 0 0 0 0 0 0 0 258696
bmbt2 16969 199882 14856 0 3 0 89 0 6415 89 58 0 61 0 1800406
ibt2 4290107 39125176 22889 1 4 0 644 59 10700 0 88 0 92 0 2732985
qm 0 0 0 0 0 0 0 0
xpc 7892783104 3364458016 7942370166
debug 0
Mon Jul 1 09:30:17 CEST 2013
extent_alloc 1484454 2038245 1678 13182
abt 0 0 0 0
blk_map 21006341 3433413 1450554 1461491 1450133 25890366 0
bmbt 0 0 0 0
dir 1482351 1466827 7281 2529
trans 7677 6232011 1444966
ig 0 8534 299 1463865 0 1256778 262381
log 37045 2082361 414 8810 16398
push_ail 7684699 0 519175 449508 0 12408 64613 2971092 0 1037
xstrat 1441669 0
rw 1745000 1351499
attr 84933 0 0 0
icluster 130548 102999 2390155
vnodes 4293706604 0 0 0 1260692 1260692 1260692 0
buf 24541770 79611 24466577 2126 8792 75193 0 129871 9657
abtb2 1520770 1551362 12314 12331 0 0 0 0 0 0 0 0 0 0 15613
abtc2 2972715 1641670 1486336 1486353 0 0 0 0 0 0 0 0 0 0 258696
bmbt2 16969 199882 14856 0 3 0 89 0 6415 89 58 0 61 0 1800406
ibt2 4290197 39126076 22889 1 4 0 644 59 10700 0 88 0 92 0 2732985
qm 0 0 0 0 0 0 0 0
xpc 7892905984 3364481241 7942370166
debug 0
Mon Jul 1 09:30:22 CEST 2013
extent_alloc 1484486 2038280 1678 13182
abt 0 0 0 0
blk_map 21006782 3433474 1450584 1461522 1450163 25890898 0
bmbt 0 0 0 0
dir 1482381 1466857 7281 2529
trans 7677 6232134 1444996
ig 0 8534 299 1463895 0 1256778 262381
log 37045 2082361 414 8810 16398
push_ail 7684852 0 519272 449508 0 12408 64613 2971092 0 1037
xstrat 1441699 0
rw 1745030 1351499
attr 84933 0 0 0
icluster 130548 102999 2390155
vnodes 4293706604 0 0 0 1260692 1260692 1260692 0
buf 24542347 79614 24467151 2126 8792 75196 0 129876 9657
abtb2 1520803 1551395 12314 12331 0 0 0 0 0 0 0 0 0 0 15613
abtc2 2972779 1641702 1486368 1486385 0 0 0 0 0 0 0 0 0 0 258696
bmbt2 16970 199896 14857 0 3 0 89 0 6415 89 58 0 61 0 1800407
ibt2 4290288 39126988 22890 1 4 0 644 59 10700 0 88 0 92 0 2732985
qm 0 0 0 0 0 0 0 0
xpc 7893028864 3364504466 7942370166
debug 0
--
Michal Hocko
SUSE Labs
On Mon, Jul 01, 2013 at 09:50:05AM +0200, Michal Hocko wrote:
> On Mon 01-07-13 11:25:58, Dave Chinner wrote:
> > On Sun, Jun 30, 2013 at 08:33:49PM +0200, Michal Hocko wrote:
> > > On Sat 29-06-13 12:55:09, Dave Chinner wrote:
> > > > On Thu, Jun 27, 2013 at 04:54:11PM +0200, Michal Hocko wrote:
> > > > > On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> > > > > > On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > > > > > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > > > > > > And again, another hang. It looks like the inode deletion never
> > > > > > > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > > > > > > anymore. I am going to test with the other patch in the thread.
> > > > > > > > >
> > > > > > > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > > > > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > > > > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > > > > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > > > > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > > > > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > > > > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > > > > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > > > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > > > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > > > > > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > .....
> > > Do you mean sysrq+t? It is attached.
> > >
> > > Btw. I was able to reproduce this again. The stuck processes were
> > > sitting in the same traces for more than 28 hours without any change so
> > > I do not think this is a temporal condition.
> > >
> > > Traces of all processes in the D state:
> > > 7561 [<ffffffffa029c03e>] xfs_iget+0xbe/0x190 [xfs]
> > > [<ffffffffa02a8e98>] xfs_lookup+0xe8/0x110 [xfs]
> > > [<ffffffffa029fad9>] xfs_vn_lookup+0x49/0x90 [xfs]
> > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > [<ffffffff8117815e>] do_last+0x2de/0x780
> > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > This looks like it may be equivalent to the ext4 trace above, though
> > I'm not totally sure on that yet. Can you get me the line of code
> > where the above code is sleeping - 'gdb> l *(xfs_iget+0xbe)' output
> > is sufficient.
>
> OK, this is a bit tricky because I have xfs built as a module so objdump
> on xfs.ko shows nonsense
> 19039: e8 00 00 00 00 callq 1903e <xfs_iget+0xbe>
> 1903e: 48 8b 75 c0 mov -0x40(%rbp),%rsi
>
> crash was more clever though and it says:
> 0xffffffffa029c034 <xfs_iget+180>: mov $0x1,%edi
> 0xffffffffa029c039 <xfs_iget+185>: callq 0xffffffff815776d0
> <schedule_timeout_uninterruptible>
> /dev/shm/mhocko-build/BUILD/kernel-3.9.0mmotm+/fs/xfs/xfs_icache.c: 423
> 0xffffffffa029c03e <xfs_iget+190>: mov -0x40(%rbp),%rsi
>
> which maps to:
> out_error_or_again:
> if (error == EAGAIN) {
> delay(1);
> goto again;
> }
>
> So this looks like this path loops in goto again and out_error_or_again.
Yup, that's what I suspected.
> > If it's where I suspect it is, we are hitting a VFS inode that
> > igrab() is failing on because I_FREEING is set and that is returning
> > EAGAIN. Hence xfs_iget() sleeps for a short period and retries the
> > lookup. If you've still got a system in this state, can you dump the
> > xfs stats a few times about 5s apart i.e.
> >
> > $ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
> >
> > Depending on what stat is changing (i'm looking for skip vs recycle
> > in the inode cache stats), that will tell us why the lookup is
> > failing...
>
> $ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
>
> Mon Jul 1 09:29:57 CEST 2013
> extent_alloc 1484333 2038118 1678 13182
> abt 0 0 0 0
> blk_map 21004635 3433178 1450438 1461372 1450017 25888309 0
> bmbt 0 0 0 0
> dir 1482235 1466711 7281 2529
> trans 7676 6231535 1444850
> ig 0 8534 299 1463749 0 1256778 262381
^^^
That is the recycle stat, which indicates we've found an inode being
reclaimed. When it's found an inode that have been evicted, but not
yet reclaimed at the XFS level, that stat will increase. If the
inode is still valid at the VFS level, and igrab() fails, then we'll
get EAGAIN without that stat being increased. So, igrab() is
failing, and that means I_FREEING|I_WILL_FREE are set.
So, it looks to be the same case as the ext4 hang, and it's likely
that we have some dangling inode dispose list somewhere. So, here's
the fun part. Use tracing to grab the inode number that is stuck
(tracepoint xfs::xfs_iget_skip), and then run crash on the live
kernel on the process that is looping, and find the struct xfs_inode
and print it. Use the inode number from the trace point to check
you've got the right inode.
Th struct inode of the VFS inode is embedded into the struct
xfs_inode, and the dispose list that it is on should be the on the
inode->i_lru_list. What that, and see how many other inodes are on
that list. Once we know if it's a single inode, and whether the
dispose list it is on is intact, empty or corrupt, we might have a
better idea of how these inodes are getting lost....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon 01-07-13 18:10:56, Dave Chinner wrote:
> On Mon, Jul 01, 2013 at 09:50:05AM +0200, Michal Hocko wrote:
> > On Mon 01-07-13 11:25:58, Dave Chinner wrote:
> > > On Sun, Jun 30, 2013 at 08:33:49PM +0200, Michal Hocko wrote:
> > > > On Sat 29-06-13 12:55:09, Dave Chinner wrote:
> > > > > On Thu, Jun 27, 2013 at 04:54:11PM +0200, Michal Hocko wrote:
> > > > > > On Thu 27-06-13 09:24:26, Dave Chinner wrote:
> > > > > > > On Wed, Jun 26, 2013 at 10:15:09AM +0200, Michal Hocko wrote:
> > > > > > > > On Tue 25-06-13 12:27:54, Dave Chinner wrote:
> > > > > > > > > On Tue, Jun 18, 2013 at 03:50:25PM +0200, Michal Hocko wrote:
> > > > > > > > > > And again, another hang. It looks like the inode deletion never
> > > > > > > > > > finishes. The good thing is that I do not see any LRU related BUG_ONs
> > > > > > > > > > anymore. I am going to test with the other patch in the thread.
> > > > > > > > > >
> > > > > > > > > > 2476 [<ffffffff8118325e>] __wait_on_freeing_inode+0x9e/0xc0 <<< waiting for an inode to go away
> > > > > > > > > > [<ffffffff81183321>] find_inode_fast+0xa1/0xc0
> > > > > > > > > > [<ffffffff8118525f>] iget_locked+0x4f/0x180
> > > > > > > > > > [<ffffffff811ef9e3>] ext4_iget+0x33/0x9f0
> > > > > > > > > > [<ffffffff811f6a1c>] ext4_lookup+0xbc/0x160
> > > > > > > > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > > > > > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > > > > > > > [<ffffffff8117815e>] do_last+0x2de/0x780 <<< holds i_mutex
> > > > > > > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > > > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > > > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > > > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > > > > > > [<ffffffff81582fe9>] system_call_fastpath+0x16/0x1b
> > > > > > > > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > .....
> > > > Do you mean sysrq+t? It is attached.
> > > >
> > > > Btw. I was able to reproduce this again. The stuck processes were
> > > > sitting in the same traces for more than 28 hours without any change so
> > > > I do not think this is a temporal condition.
> > > >
> > > > Traces of all processes in the D state:
> > > > 7561 [<ffffffffa029c03e>] xfs_iget+0xbe/0x190 [xfs]
> > > > [<ffffffffa02a8e98>] xfs_lookup+0xe8/0x110 [xfs]
> > > > [<ffffffffa029fad9>] xfs_vn_lookup+0x49/0x90 [xfs]
> > > > [<ffffffff81174ad0>] lookup_real+0x20/0x60
> > > > [<ffffffff81177e25>] lookup_open+0x175/0x1d0
> > > > [<ffffffff8117815e>] do_last+0x2de/0x780
> > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > This looks like it may be equivalent to the ext4 trace above, though
> > > I'm not totally sure on that yet. Can you get me the line of code
> > > where the above code is sleeping - 'gdb> l *(xfs_iget+0xbe)' output
> > > is sufficient.
> >
> > OK, this is a bit tricky because I have xfs built as a module so objdump
> > on xfs.ko shows nonsense
> > 19039: e8 00 00 00 00 callq 1903e <xfs_iget+0xbe>
> > 1903e: 48 8b 75 c0 mov -0x40(%rbp),%rsi
> >
> > crash was more clever though and it says:
> > 0xffffffffa029c034 <xfs_iget+180>: mov $0x1,%edi
> > 0xffffffffa029c039 <xfs_iget+185>: callq 0xffffffff815776d0
> > <schedule_timeout_uninterruptible>
> > /dev/shm/mhocko-build/BUILD/kernel-3.9.0mmotm+/fs/xfs/xfs_icache.c: 423
> > 0xffffffffa029c03e <xfs_iget+190>: mov -0x40(%rbp),%rsi
> >
> > which maps to:
> > out_error_or_again:
> > if (error == EAGAIN) {
> > delay(1);
> > goto again;
> > }
> >
> > So this looks like this path loops in goto again and out_error_or_again.
>
> Yup, that's what I suspected.
>
> > > If it's where I suspect it is, we are hitting a VFS inode that
> > > igrab() is failing on because I_FREEING is set and that is returning
> > > EAGAIN. Hence xfs_iget() sleeps for a short period and retries the
> > > lookup. If you've still got a system in this state, can you dump the
> > > xfs stats a few times about 5s apart i.e.
> > >
> > > $ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
> > >
> > > Depending on what stat is changing (i'm looking for skip vs recycle
> > > in the inode cache stats), that will tell us why the lookup is
> > > failing...
> >
> > $ for i in `seq 0 1 5`; do echo ; date; cat /proc/fs/xfs/stat ; sleep 5 ; done
> >
> > Mon Jul 1 09:29:57 CEST 2013
> > extent_alloc 1484333 2038118 1678 13182
> > abt 0 0 0 0
> > blk_map 21004635 3433178 1450438 1461372 1450017 25888309 0
> > bmbt 0 0 0 0
> > dir 1482235 1466711 7281 2529
> > trans 7676 6231535 1444850
> > ig 0 8534 299 1463749 0 1256778 262381
> ^^^
>
> That is the recycle stat, which indicates we've found an inode being
> reclaimed. When it's found an inode that have been evicted, but not
> yet reclaimed at the XFS level, that stat will increase. If the
> inode is still valid at the VFS level, and igrab() fails, then we'll
> get EAGAIN without that stat being increased. So, igrab() is
> failing, and that means I_FREEING|I_WILL_FREE are set.
>
> So, it looks to be the same case as the ext4 hang, and it's likely
> that we have some dangling inode dispose list somewhere. So, here's
> the fun part. Use tracing to grab the inode number that is stuck
> (tracepoint xfs::xfs_iget_skip),
$ cat /sys/kernel/debug/tracing/trace_pipe > demon.trace.log &
$ pid=$!
$ sleep 10s ; kill $pid
$ awk '{print $1, $9}' demon.trace.log | sort -u
cc1-7561 0xf78d4f
cc1-9100 0x80b2a35
> and then run crash on the live kernel on the process that is looping,
> and find the struct xfs_inode and print it. Use the inode number from
> the trace point to check you've got the right inode.
crash> bt -f 7561
#4 [ffff88003744db40] xfs_iget at ffffffffa029c03e [xfs]
ffff88003744db48: 0000000000000000 0000000000000000
ffff88003744db58: 0000000000013b40 ffff88003744dc30
ffff88003744db68: 0000000000000000 0000000000000000
ffff88003744db78: 0000000000f78d4f ffffffffa02dafec
ffff88003744db88: ffff88000c09e1c0 0000000000000008
ffff88003744db98: 0000000000000000 ffff88000c0a0ac0
ffff88003744dba8: ffff88003744dc18 0000000000000000
ffff88003744dbb8: ffff88003744dc08 ffffffffa02a8e98
crash> dis xfs_iget
[...]
0xffffffffa029c045 <xfs_iget+197>: callq 0xffffffff812ca190 <radix_tree_lookup>
0xffffffffa029c04a <xfs_iget+202>: test %rax,%rax
0xffffffffa029c04d <xfs_iget+205>: mov %rax,-0x30(%rbp)
So the inode should be at -0x30(%rbp) which is
crash> struct xfs_inode.i_ino ffff88000c09e1c0
i_ino = 16223567
crash> p /x 16223567
$15 = 0xf78d4f
> Th struct inode of the VFS inode is embedded into the struct
> xfs_inode,
crash> struct -o xfs_inode.i_vnode ffff88000c09e1c0
struct xfs_inode {
[ffff88000c09e2f8] struct inode i_vnode;
}
> and the dispose list that it is on should be the on the
> inode->i_lru_list.
crash> struct inode.i_lru ffff88000c09e2f8
i_lru = {
next = 0xffff88000c09e3e8,
prev = 0xffff88000c09e3e8
}
crash> struct inode.i_flags ffff88000c09e2f8
i_flags = 4096
The full xfs_inode dump is attached.
> What that, and see how many other inodes are on that list. Once we
> know if it's a single inode,
The list seems to be empty. And the same is the case for the other inode:
crash> bt -f 9100
#4 [ffff88001c8c5b40] xfs_iget at ffffffffa029c03e [xfs]
ffff88001c8c5b48: 0000000000000000 0000000000000000
ffff88001c8c5b58: 0000000000013b40 ffff88001c8c5c30
ffff88001c8c5b68: 0000000000000000 0000000000000000
ffff88001c8c5b78: 00000000000b2a35 ffffffffa02dafec
ffff88001c8c5b88: ffff88000c09ec40 0000000000000008
ffff88001c8c5b98: 0000000000000000 ffff8800359e9b00
ffff88001c8c5ba8: ffff88001c8c5c18 0000000000000000
ffff88001c8c5bb8: ffff88001c8c5c08 ffffffffa02a8e98
crash> p /x 0xffff88001c8c5bb8-0x30
$16 = 0xffff88001c8c5b88
sh> struct xfs_inode.i_ino ffff88000c09ec40
i_ino = 134949429
crash> p /x 134949429
$17 = 0x80b2a35
crash> struct -o xfs_inode.i_vnode ffff88000c09ec40
struct xfs_inode {
[ffff88000c09ed78] struct inode i_vnode;
}
crash> struct inode.i_lru ffff88000c09ed78
i_lru = {
next = 0xffff88000c09ee68,
prev = 0xffff88000c09ee68
}
crash> struct inode.i_flags ffff88000c09ed78
i_flags = 4096
> and whether the dispose list it is on is intact, empty or corrupt, we
> might have a better idea of how these inodes are getting lost....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
--
Michal Hocko
SUSE Labs
On Tue, Jul 02, 2013 at 11:22:00AM +0200, Michal Hocko wrote:
> On Mon 01-07-13 18:10:56, Dave Chinner wrote:
> > On Mon, Jul 01, 2013 at 09:50:05AM +0200, Michal Hocko wrote:
> > > On Mon 01-07-13 11:25:58, Dave Chinner wrote:
> > That is the recycle stat, which indicates we've found an inode being
> > reclaimed. When it's found an inode that have been evicted, but not
> > yet reclaimed at the XFS level, that stat will increase. If the
> > inode is still valid at the VFS level, and igrab() fails, then we'll
> > get EAGAIN without that stat being increased. So, igrab() is
> > failing, and that means I_FREEING|I_WILL_FREE are set.
> >
> > So, it looks to be the same case as the ext4 hang, and it's likely
> > that we have some dangling inode dispose list somewhere. So, here's
> > the fun part. Use tracing to grab the inode number that is stuck
> > (tracepoint xfs::xfs_iget_skip),
>
> $ cat /sys/kernel/debug/tracing/trace_pipe > demon.trace.log &
> $ pid=$!
> $ sleep 10s ; kill $pid
> $ awk '{print $1, $9}' demon.trace.log | sort -u
> cc1-7561 0xf78d4f
> cc1-9100 0x80b2a35
.....
>
> > and the dispose list that it is on should be the on the
> > inode->i_lru_list.
>
> crash> struct inode.i_lru ffff88000c09e2f8
> i_lru = {
> next = 0xffff88000c09e3e8,
> prev = 0xffff88000c09e3e8
> }
Hmmm, that's empty.
> crash> struct inode.i_flags ffff88000c09e2f8
> i_flags = 4096
I asked for the wrong field, I wanted i_state, but seeing as you:
> The full xfs_inode dump is attached.
Dumped the whole inode, I got it from below :)
> i_state = 32,
so, i_state = I_FREEING.
IOWs, we've got an inode marked I_FREEING that isn't on a dispose
list but hasn't passed through evict() correctly.
> crash> struct xfs_inode ffff88000c09e1c0
> struct xfs_inode {
.....
> i_flags = 0,
XFS doesn't see the inode as reclaimable yet, either.
Ok, so it's been leaked from a dispose list somehow. Thanks for the
info, Michal, it's time to go look at the code....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue 02-07-13 22:19:47, Dave Chinner wrote:
[...]
> Ok, so it's been leaked from a dispose list somehow. Thanks for the
> info, Michal, it's time to go look at the code....
OK, just in case we will need it, I am keeping the machine in this state
for now. So we still can play with crash and check all the juicy
internals.
--
Michal Hocko
SUSE Labs
On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> [...]
> > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > info, Michal, it's time to go look at the code....
>
> OK, just in case we will need it, I am keeping the machine in this state
> for now. So we still can play with crash and check all the juicy
> internals.
My current suspect is the LRU_RETRY code. I don't think what it is
doing is at all valid - list_for_each_safe() is not safe if you drop
the lock that protects the list. i.e. there is nothing that protects
the stored next pointer from being removed from the list by someone
else. Hence what I think is occurring is this:
thread 1 thread 2
lock(lru)
list_for_each_safe(lru) lock(lru)
isolate ......
lock(i_lock)
has buffers
__iget
unlock(i_lock)
unlock(lru)
..... (gets lru lock)
list_for_each_safe(lru)
walks all the inodes
finds inode being isolated by other thread
isolate
i_count > 0
list_del_init(i_lru)
return LRU_REMOVED;
moves to next inode, inode that
other thread has stored as next
isolate
i_state |= I_FREEING
list_move(dispose_list)
return LRU_REMOVED
....
unlock(lru)
lock(lru)
return LRU_RETRY;
if (!first_pass)
....
--nr_to_scan
(loop again using next, which has already been removed from the
LRU by the other thread!)
isolate
lock(i_lock)
if (i_state & ~I_REFERENCED)
list_del_init(i_lru) <<<<< inode is on dispose list!
<<<<< inode is now isolated, with I_FREEING set
return LRU_REMOVED;
That fits the corpse left on your machine, Michal. One thread has
moved the inode to a dispose list, the other thread thinks it is
still on the LRU and should be removed, and removes it.
This also explains the lru item count going negative - the same item
is being removed from the lru twice. So it seems like all the
problems you've been seeing are caused by this one problem....
Patch below that should fix this.
Cheers,
Dave.
--
Dave Chinner
[email protected]
list_lru: fix broken LRU_RETRY behaviour
From: Dave Chinner <[email protected]>
The LRU_RETRY code assumes that the list traversal status after we
have dropped and regained the list lock. Unfortunately, this is not
a valid assumption, and that can lead to racing traversals isolating
objects that the other traversal expects to be the next item on the
list.
This is causing problems with the inode cache shrinker isolation,
with races resulting in an inode on a dispose list being "isolated"
because a racing traversal still thinks it is on the LRU. The inode
is then never reclaimed and that causes hangs if a subsequent lookup
on that inode occurs.
Fix it by always restarting the list walk on a LRU_RETRY return from
the isolate callback. Avoid the possibility of livelocks the current
code was trying to aavoid by always decrementing the nr_to_walk
counter on retries so that even if we keep hitting the same item on
the list we'll eventually stop trying to walk and exit out of the
situation causing the problem.
Reported-by: Michal Hocko <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
---
mm/list_lru.c | 29 ++++++++++++-----------------
1 file changed, 12 insertions(+), 17 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index dc71659..7246791 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -71,19 +71,19 @@ list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
struct list_lru_node *nlru = &lru->node[nid];
struct list_head *item, *n;
unsigned long isolated = 0;
- /*
- * If we don't keep state of at which pass we are, we can loop at
- * LRU_RETRY, since we have no guarantees that the caller will be able
- * to do something other than retry on the next pass. We handle this by
- * allowing at most one retry per object. This should not be altered
- * by any condition other than LRU_RETRY.
- */
- bool first_pass = true;
spin_lock(&nlru->lock);
restart:
list_for_each_safe(item, n, &nlru->list) {
enum lru_status ret;
+
+ /*
+ * decrement nr_to_walk first so that we don't livelock if we
+ * get stuck on large numbesr of LRU_RETRY items
+ */
+ if (--(*nr_to_walk) == 0)
+ break;
+
ret = isolate(item, &nlru->lock, cb_arg);
switch (ret) {
case LRU_REMOVED:
@@ -98,19 +98,14 @@ restart:
case LRU_SKIP:
break;
case LRU_RETRY:
- if (!first_pass) {
- first_pass = true;
- break;
- }
- first_pass = false;
+ /*
+ * The lru lock has been dropped, our list traversal is
+ * now invalid and so we have to restart from scratch.
+ */
goto restart;
default:
BUG();
}
-
- if ((*nr_to_walk)-- == 0)
- break;
-
}
spin_unlock(&nlru->lock);
On Wed, Jul 03, 2013 at 09:24:03PM +1000, Dave Chinner wrote:
> On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > [...]
> > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > info, Michal, it's time to go look at the code....
> >
> > OK, just in case we will need it, I am keeping the machine in this state
> > for now. So we still can play with crash and check all the juicy
> > internals.
>
> My current suspect is the LRU_RETRY code. I don't think what it is
> doing is at all valid - list_for_each_safe() is not safe if you drop
> the lock that protects the list. i.e. there is nothing that protects
> the stored next pointer from being removed from the list by someone
> else. Hence what I think is occurring is this:
>
>
> thread 1 thread 2
> lock(lru)
> list_for_each_safe(lru) lock(lru)
> isolate ......
> lock(i_lock)
> has buffers
> __iget
> unlock(i_lock)
> unlock(lru)
> ..... (gets lru lock)
> list_for_each_safe(lru)
> walks all the inodes
> finds inode being isolated by other thread
> isolate
> i_count > 0
> list_del_init(i_lru)
> return LRU_REMOVED;
> moves to next inode, inode that
> other thread has stored as next
> isolate
> i_state |= I_FREEING
> list_move(dispose_list)
> return LRU_REMOVED
> ....
> unlock(lru)
> lock(lru)
> return LRU_RETRY;
> if (!first_pass)
> ....
> --nr_to_scan
> (loop again using next, which has already been removed from the
> LRU by the other thread!)
> isolate
> lock(i_lock)
> if (i_state & ~I_REFERENCED)
> list_del_init(i_lru) <<<<< inode is on dispose list!
> <<<<< inode is now isolated, with I_FREEING set
> return LRU_REMOVED;
>
> That fits the corpse left on your machine, Michal. One thread has
> moved the inode to a dispose list, the other thread thinks it is
> still on the LRU and should be removed, and removes it.
>
> This also explains the lru item count going negative - the same item
> is being removed from the lru twice. So it seems like all the
> problems you've been seeing are caused by this one problem....
>
> Patch below that should fix this.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> list_lru: fix broken LRU_RETRY behaviour
>
> From: Dave Chinner <[email protected]>
>
> The LRU_RETRY code assumes that the list traversal status after we
> have dropped and regained the list lock. Unfortunately, this is not
> a valid assumption, and that can lead to racing traversals isolating
> objects that the other traversal expects to be the next item on the
> list.
>
> This is causing problems with the inode cache shrinker isolation,
> with races resulting in an inode on a dispose list being "isolated"
> because a racing traversal still thinks it is on the LRU. The inode
> is then never reclaimed and that causes hangs if a subsequent lookup
> on that inode occurs.
>
> Fix it by always restarting the list walk on a LRU_RETRY return from
> the isolate callback. Avoid the possibility of livelocks the current
> code was trying to aavoid by always decrementing the nr_to_walk
> counter on retries so that even if we keep hitting the same item on
> the list we'll eventually stop trying to walk and exit out of the
> situation causing the problem.
>
> Reported-by: Michal Hocko <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/list_lru.c | 29 ++++++++++++-----------------
> 1 file changed, 12 insertions(+), 17 deletions(-)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index dc71659..7246791 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -71,19 +71,19 @@ list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
> struct list_lru_node *nlru = &lru->node[nid];
> struct list_head *item, *n;
> unsigned long isolated = 0;
> - /*
> - * If we don't keep state of at which pass we are, we can loop at
> - * LRU_RETRY, since we have no guarantees that the caller will be able
> - * to do something other than retry on the next pass. We handle this by
> - * allowing at most one retry per object. This should not be altered
> - * by any condition other than LRU_RETRY.
> - */
> - bool first_pass = true;
>
> spin_lock(&nlru->lock);
> restart:
> list_for_each_safe(item, n, &nlru->list) {
> enum lru_status ret;
> +
> + /*
> + * decrement nr_to_walk first so that we don't livelock if we
> + * get stuck on large numbesr of LRU_RETRY items
> + */
> + if (--(*nr_to_walk) == 0)
> + break;
> +
> ret = isolate(item, &nlru->lock, cb_arg);
> switch (ret) {
> case LRU_REMOVED:
> @@ -98,19 +98,14 @@ restart:
> case LRU_SKIP:
> break;
> case LRU_RETRY:
> - if (!first_pass) {
> - first_pass = true;
> - break;
> - }
> - first_pass = false;
> + /*
> + * The lru lock has been dropped, our list traversal is
> + * now invalid and so we have to restart from scratch.
> + */
> goto restart;
> default:
> BUG();
> }
> -
> - if ((*nr_to_walk)-- == 0)
> - break;
> -
> }
This patch makes perfect sense to me, along with your description.
On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > [...]
> > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > info, Michal, it's time to go look at the code....
> >
> > OK, just in case we will need it, I am keeping the machine in this state
> > for now. So we still can play with crash and check all the juicy
> > internals.
>
> My current suspect is the LRU_RETRY code. I don't think what it is
> doing is at all valid - list_for_each_safe() is not safe if you drop
> the lock that protects the list. i.e. there is nothing that protects
> the stored next pointer from being removed from the list by someone
> else. Hence what I think is occurring is this:
>
>
> thread 1 thread 2
> lock(lru)
> list_for_each_safe(lru) lock(lru)
> isolate ......
> lock(i_lock)
> has buffers
> __iget
> unlock(i_lock)
> unlock(lru)
> ..... (gets lru lock)
> list_for_each_safe(lru)
> walks all the inodes
> finds inode being isolated by other thread
> isolate
> i_count > 0
> list_del_init(i_lru)
> return LRU_REMOVED;
> moves to next inode, inode that
> other thread has stored as next
> isolate
> i_state |= I_FREEING
> list_move(dispose_list)
> return LRU_REMOVED
> ....
> unlock(lru)
> lock(lru)
> return LRU_RETRY;
> if (!first_pass)
> ....
> --nr_to_scan
> (loop again using next, which has already been removed from the
> LRU by the other thread!)
> isolate
> lock(i_lock)
> if (i_state & ~I_REFERENCED)
> list_del_init(i_lru) <<<<< inode is on dispose list!
> <<<<< inode is now isolated, with I_FREEING set
> return LRU_REMOVED;
>
> That fits the corpse left on your machine, Michal. One thread has
> moved the inode to a dispose list, the other thread thinks it is
> still on the LRU and should be removed, and removes it.
>
> This also explains the lru item count going negative - the same item
> is being removed from the lru twice. So it seems like all the
> problems you've been seeing are caused by this one problem....
>
> Patch below that should fix this.
Good news! The test was running since morning and it didn't hang nor
crashed. So this really looks like the right fix. It will run also
during weekend to be 100% sure. But I guess it is safe to say
Tested-by: Michal Hocko <[email protected]>
Thanks a lot Dave!
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> list_lru: fix broken LRU_RETRY behaviour
>
> From: Dave Chinner <[email protected]>
>
> The LRU_RETRY code assumes that the list traversal status after we
> have dropped and regained the list lock. Unfortunately, this is not
> a valid assumption, and that can lead to racing traversals isolating
> objects that the other traversal expects to be the next item on the
> list.
>
> This is causing problems with the inode cache shrinker isolation,
> with races resulting in an inode on a dispose list being "isolated"
> because a racing traversal still thinks it is on the LRU. The inode
> is then never reclaimed and that causes hangs if a subsequent lookup
> on that inode occurs.
>
> Fix it by always restarting the list walk on a LRU_RETRY return from
> the isolate callback. Avoid the possibility of livelocks the current
> code was trying to aavoid by always decrementing the nr_to_walk
> counter on retries so that even if we keep hitting the same item on
> the list we'll eventually stop trying to walk and exit out of the
> situation causing the problem.
>
> Reported-by: Michal Hocko <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/list_lru.c | 29 ++++++++++++-----------------
> 1 file changed, 12 insertions(+), 17 deletions(-)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index dc71659..7246791 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -71,19 +71,19 @@ list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
> struct list_lru_node *nlru = &lru->node[nid];
> struct list_head *item, *n;
> unsigned long isolated = 0;
> - /*
> - * If we don't keep state of at which pass we are, we can loop at
> - * LRU_RETRY, since we have no guarantees that the caller will be able
> - * to do something other than retry on the next pass. We handle this by
> - * allowing at most one retry per object. This should not be altered
> - * by any condition other than LRU_RETRY.
> - */
> - bool first_pass = true;
>
> spin_lock(&nlru->lock);
> restart:
> list_for_each_safe(item, n, &nlru->list) {
> enum lru_status ret;
> +
> + /*
> + * decrement nr_to_walk first so that we don't livelock if we
> + * get stuck on large numbesr of LRU_RETRY items
> + */
> + if (--(*nr_to_walk) == 0)
> + break;
> +
> ret = isolate(item, &nlru->lock, cb_arg);
> switch (ret) {
> case LRU_REMOVED:
> @@ -98,19 +98,14 @@ restart:
> case LRU_SKIP:
> break;
> case LRU_RETRY:
> - if (!first_pass) {
> - first_pass = true;
> - break;
> - }
> - first_pass = false;
> + /*
> + * The lru lock has been dropped, our list traversal is
> + * now invalid and so we have to restart from scratch.
> + */
> goto restart;
> default:
> BUG();
> }
> -
> - if ((*nr_to_walk)-- == 0)
> - break;
> -
> }
>
> spin_unlock(&nlru->lock);
--
Michal Hocko
SUSE Labs
On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > [...]
> > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > info, Michal, it's time to go look at the code....
> > >
> > > OK, just in case we will need it, I am keeping the machine in this state
> > > for now. So we still can play with crash and check all the juicy
> > > internals.
> >
> > My current suspect is the LRU_RETRY code. I don't think what it is
> > doing is at all valid - list_for_each_safe() is not safe if you drop
> > the lock that protects the list. i.e. there is nothing that protects
> > the stored next pointer from being removed from the list by someone
> > else. Hence what I think is occurring is this:
> >
> >
> > thread 1 thread 2
> > lock(lru)
> > list_for_each_safe(lru) lock(lru)
> > isolate ......
> > lock(i_lock)
> > has buffers
> > __iget
> > unlock(i_lock)
> > unlock(lru)
> > ..... (gets lru lock)
> > list_for_each_safe(lru)
> > walks all the inodes
> > finds inode being isolated by other thread
> > isolate
> > i_count > 0
> > list_del_init(i_lru)
> > return LRU_REMOVED;
> > moves to next inode, inode that
> > other thread has stored as next
> > isolate
> > i_state |= I_FREEING
> > list_move(dispose_list)
> > return LRU_REMOVED
> > ....
> > unlock(lru)
> > lock(lru)
> > return LRU_RETRY;
> > if (!first_pass)
> > ....
> > --nr_to_scan
> > (loop again using next, which has already been removed from the
> > LRU by the other thread!)
> > isolate
> > lock(i_lock)
> > if (i_state & ~I_REFERENCED)
> > list_del_init(i_lru) <<<<< inode is on dispose list!
> > <<<<< inode is now isolated, with I_FREEING set
> > return LRU_REMOVED;
> >
> > That fits the corpse left on your machine, Michal. One thread has
> > moved the inode to a dispose list, the other thread thinks it is
> > still on the LRU and should be removed, and removes it.
> >
> > This also explains the lru item count going negative - the same item
> > is being removed from the lru twice. So it seems like all the
> > problems you've been seeing are caused by this one problem....
> >
> > Patch below that should fix this.
>
> Good news! The test was running since morning and it didn't hang nor
> crashed. So this really looks like the right fix. It will run also
> during weekend to be 100% sure. But I guess it is safe to say
Hmm, it seems I was too optimistic or we have yet another issue here (I
guess the later is more probable).
The weekend testing got stuck as well.
The dmesg shows there were some hung tasks:
[275284.264312] start.sh (11025): dropped kernel caches: 3
[276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
[276962.652102] ffff88003794d198 0000000000000046 ffff8800325f4480 0000000000000000
[276962.652113] ffff88003794c010 0000000000012dc0 0000000000012dc0 0000000000012dc0
[276962.652121] 0000000000012dc0 ffff88003794dfd8 ffff88003794dfd8 0000000000012dc0
[276962.652128] Call Trace:
[276962.652151] [<ffffffff812a2c22>] ? __blk_run_queue+0x32/0x40
[276962.652160] [<ffffffff812a31f8>] ? queue_unplugged+0x78/0xb0
[276962.652171] [<ffffffff815793a4>] schedule+0x24/0x70
[276962.652178] [<ffffffff8157948c>] io_schedule+0x9c/0xf0
[276962.652187] [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[276962.652194] [<ffffffff815778ca>] __wait_on_bit+0x5a/0x90
[276962.652200] [<ffffffff811011a0>] ? __lock_page+0x70/0x70
[276962.652206] [<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[276962.652215] [<ffffffff81067190>] ? autoremove_wake_function+0x40/0x40
[276962.652224] [<ffffffff81112ee1>] ? page_evictable+0x11/0x50
[276962.652231] [<ffffffff81114e43>] shrink_page_list+0x503/0x790
[276962.652239] [<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[276962.652246] [<ffffffff81115d5f>] ? shrink_active_list+0x29f/0x340
[276962.652254] [<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[276962.652262] [<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[276962.652274] [<ffffffff81160c28>] ? mem_cgroup_reclaimable+0x108/0x150
[276962.652282] [<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[276962.652291] [<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[276962.652298] [<ffffffff81116418>] shrink_zones+0x108/0x220
[276962.652305] [<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[276962.652313] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[276962.652323] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[276962.652332] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[276962.652343] [<ffffffff81151c72>] kmem_getpages+0x62/0x1d0
[276962.652351] [<ffffffff81153869>] fallback_alloc+0x189/0x250
[276962.652359] [<ffffffff8115360d>] ____cache_alloc_node+0x8d/0x160
[276962.652367] [<ffffffff81153e51>] __kmalloc+0x281/0x290
[276962.652490] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
[276962.652540] [<ffffffffa02c6e97>] kmem_alloc+0x77/0xe0 [xfs]
[276962.652588] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
[276962.652653] [<ffffffffa030a334>] xfs_inode_item_format_extents+0x54/0x100 [xfs]
[276962.652714] [<ffffffffa030a63a>] xfs_inode_item_format+0x25a/0x4f0 [xfs]
[276962.652774] [<ffffffffa03081a0>] xlog_cil_prepare_log_vecs+0xa0/0x170 [xfs]
[276962.652834] [<ffffffffa03082a8>] xfs_log_commit_cil+0x38/0x1c0 [xfs]
[276962.652894] [<ffffffffa0303304>] xfs_trans_commit+0x74/0x260 [xfs]
[276962.652935] [<ffffffffa02ac70b>] xfs_setfilesize+0x12b/0x130 [xfs]
[276962.652947] [<ffffffff81076bd0>] ? __migrate_task+0x150/0x150
[276962.652988] [<ffffffffa02ac985>] xfs_end_io+0x75/0xc0 [xfs]
[276962.652997] [<ffffffff8105e934>] process_one_work+0x1b4/0x380
[276962.653004] [<ffffffff8105f294>] rescuer_thread+0x234/0x320
[276962.653011] [<ffffffff8105f060>] ? free_pwqs+0x30/0x30
[276962.653017] [<ffffffff81066a86>] kthread+0xc6/0xd0
[276962.653025] [<ffffffff810669c0>] ? kthread_freezable_should_stop+0x70/0x70
[276962.653034] [<ffffffff8158303c>] ret_from_fork+0x7c/0xb0
[276962.653041] [<ffffffff810669c0>] ? kthread_freezable_should_stop+0x70/0x70
$ dmesg | grep "blocked for more than"
[276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[276962.653097] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
[276962.653940] INFO: task ld:14442 blocked for more than 480 seconds.
[276962.654297] INFO: task ld:14962 blocked for more than 480 seconds.
[277442.652123] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[277442.653153] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
[277442.653997] INFO: task ld:14442 blocked for more than 480 seconds.
[277442.654353] INFO: task ld:14962 blocked for more than 480 seconds.
[277922.652069] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[277922.653089] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
All of them are sitting in io_schedule triggered from the memcg soft
reclaim waiting for a wake up (full dmesg is attached). I guess this has
nothing to do with the slab shrinkers directly. It is probably priority 0
reclaim which is done in the soft reclaim path.
$ uptime
13:32pm up 4 days 2:54, 2 users, load average: 25.00, 24.97, 24.66
so the current timestamp should be 352854 which means that all of them
happened quite some time ago and the system obviously resurrected from
this state.
What is more important, though, is that we still have the following
tasks stuck in D state for hours:
14442 [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[<ffffffff81114e43>] shrink_page_list+0x503/0x790
[<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[<ffffffff81116418>] shrink_zones+0x108/0x220
[<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[<ffffffff8113fe93>] read_swap_cache_async+0x113/0x160
[<ffffffff8113ffe1>] swapin_readahead+0x101/0x190
[<ffffffff8112e93f>] do_swap_page+0xef/0x5e0
[<ffffffff8112f94d>] handle_pte_fault+0x1bd/0x240
[<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[<ffffffff8157b348>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
14962 [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[<ffffffff81114e43>] shrink_page_list+0x503/0x790
[<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[<ffffffff81116418>] shrink_zones+0x108/0x220
[<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[<ffffffff81129ebb>] do_anonymous_page+0x16b/0x350
[<ffffffff8112f9c5>] handle_pte_fault+0x235/0x240
[<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[<ffffffff8157b348>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c62d0>] xfs_free_eofblocks+0x180/0x250 [xfs]
[<ffffffffa02c68e6>] xfs_release+0x106/0x1d0 [xfs]
[<ffffffffa02b3b20>] xfs_file_release+0x10/0x20 [xfs]
[<ffffffff8116c86d>] __fput+0xbd/0x240
[<ffffffff8116ca49>] ____fput+0x9/0x10
[<ffffffff81063221>] task_work_run+0xb1/0xe0
[<ffffffff810029e0>] do_notify_resume+0x90/0x1d0
[<ffffffff815833a2>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
20758 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
[<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff811763dd>] vfs_create+0xad/0xd0
[<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
[<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff811763dd>] vfs_create+0xad/0xd0
[<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Hohmm, now that I am looking at pids of the stuck processes, two of them
14442 and 14962 are mentioned in the soft lockup warnings. It is really
weird that the lockups have stopped quite some time ago (~20h ago).
I am keeping the system in this state in case you want to examine
details via crash again.
Let me know whether you need any further details.
Thanks!
--
Michal Hocko
SUSE Labs
On Mon, 8 Jul 2013 14:53:52 +0200 Michal Hocko <[email protected]> wrote:
> > Good news! The test was running since morning and it didn't hang nor
> > crashed. So this really looks like the right fix. It will run also
> > during weekend to be 100% sure. But I guess it is safe to say
>
> Hmm, it seems I was too optimistic or we have yet another issue here (I
> guess the later is more probable).
>
> The weekend testing got stuck as well.
>
> The dmesg shows there were some hung tasks:
That looks like the classic "we lost an IO completion" trace.
I think it would be prudent to defer these patches into 3.12.
On Mon, Jul 08, 2013 at 02:53:52PM +0200, Michal Hocko wrote:
> On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> > On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > > [...]
> > > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > > info, Michal, it's time to go look at the code....
> > > >
> > > > OK, just in case we will need it, I am keeping the machine in this state
> > > > for now. So we still can play with crash and check all the juicy
> > > > internals.
> > >
> > > My current suspect is the LRU_RETRY code. I don't think what it is
> > > doing is at all valid - list_for_each_safe() is not safe if you drop
> > > the lock that protects the list. i.e. there is nothing that protects
> > > the stored next pointer from being removed from the list by someone
> > > else. Hence what I think is occurring is this:
> > >
> > >
> > > thread 1 thread 2
> > > lock(lru)
> > > list_for_each_safe(lru) lock(lru)
> > > isolate ......
> > > lock(i_lock)
> > > has buffers
> > > __iget
> > > unlock(i_lock)
> > > unlock(lru)
> > > ..... (gets lru lock)
> > > list_for_each_safe(lru)
> > > walks all the inodes
> > > finds inode being isolated by other thread
> > > isolate
> > > i_count > 0
> > > list_del_init(i_lru)
> > > return LRU_REMOVED;
> > > moves to next inode, inode that
> > > other thread has stored as next
> > > isolate
> > > i_state |= I_FREEING
> > > list_move(dispose_list)
> > > return LRU_REMOVED
> > > ....
> > > unlock(lru)
> > > lock(lru)
> > > return LRU_RETRY;
> > > if (!first_pass)
> > > ....
> > > --nr_to_scan
> > > (loop again using next, which has already been removed from the
> > > LRU by the other thread!)
> > > isolate
> > > lock(i_lock)
> > > if (i_state & ~I_REFERENCED)
> > > list_del_init(i_lru) <<<<< inode is on dispose list!
> > > <<<<< inode is now isolated, with I_FREEING set
> > > return LRU_REMOVED;
> > >
> > > That fits the corpse left on your machine, Michal. One thread has
> > > moved the inode to a dispose list, the other thread thinks it is
> > > still on the LRU and should be removed, and removes it.
> > >
> > > This also explains the lru item count going negative - the same item
> > > is being removed from the lru twice. So it seems like all the
> > > problems you've been seeing are caused by this one problem....
> > >
> > > Patch below that should fix this.
> >
> > Good news! The test was running since morning and it didn't hang nor
> > crashed. So this really looks like the right fix. It will run also
> > during weekend to be 100% sure. But I guess it is safe to say
>
> Hmm, it seems I was too optimistic or we have yet another issue here (I
> guess the later is more probable).
>
> The weekend testing got stuck as well.
>
> The dmesg shows there were some hung tasks:
> [275284.264312] start.sh (11025): dropped kernel caches: 3
> [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
> [276962.652102] ffff88003794d198 0000000000000046 ffff8800325f4480 0000000000000000
> [276962.652113] ffff88003794c010 0000000000012dc0 0000000000012dc0 0000000000012dc0
> [276962.652121] 0000000000012dc0 ffff88003794dfd8 ffff88003794dfd8 0000000000012dc0
> [276962.652128] Call Trace:
> [276962.652151] [<ffffffff812a2c22>] ? __blk_run_queue+0x32/0x40
> [276962.652160] [<ffffffff812a31f8>] ? queue_unplugged+0x78/0xb0
> [276962.652171] [<ffffffff815793a4>] schedule+0x24/0x70
> [276962.652178] [<ffffffff8157948c>] io_schedule+0x9c/0xf0
> [276962.652187] [<ffffffff811011a9>] sleep_on_page+0x9/0x10
> [276962.652194] [<ffffffff815778ca>] __wait_on_bit+0x5a/0x90
> [276962.652200] [<ffffffff811011a0>] ? __lock_page+0x70/0x70
> [276962.652206] [<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
> [276962.652215] [<ffffffff81067190>] ? autoremove_wake_function+0x40/0x40
> [276962.652224] [<ffffffff81112ee1>] ? page_evictable+0x11/0x50
> [276962.652231] [<ffffffff81114e43>] shrink_page_list+0x503/0x790
> [276962.652239] [<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
> [276962.652246] [<ffffffff81115d5f>] ? shrink_active_list+0x29f/0x340
> [276962.652254] [<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
> [276962.652262] [<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
> [276962.652274] [<ffffffff81160c28>] ? mem_cgroup_reclaimable+0x108/0x150
> [276962.652282] [<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
> [276962.652291] [<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
> [276962.652298] [<ffffffff81116418>] shrink_zones+0x108/0x220
> [276962.652305] [<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
> [276962.652313] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
> [276962.652323] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
> [276962.652332] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
> [276962.652343] [<ffffffff81151c72>] kmem_getpages+0x62/0x1d0
> [276962.652351] [<ffffffff81153869>] fallback_alloc+0x189/0x250
> [276962.652359] [<ffffffff8115360d>] ____cache_alloc_node+0x8d/0x160
> [276962.652367] [<ffffffff81153e51>] __kmalloc+0x281/0x290
> [276962.652490] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
> [276962.652540] [<ffffffffa02c6e97>] kmem_alloc+0x77/0xe0 [xfs]
> [276962.652588] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
> [276962.652653] [<ffffffffa030a334>] xfs_inode_item_format_extents+0x54/0x100 [xfs]
> [276962.652714] [<ffffffffa030a63a>] xfs_inode_item_format+0x25a/0x4f0 [xfs]
> [276962.652774] [<ffffffffa03081a0>] xlog_cil_prepare_log_vecs+0xa0/0x170 [xfs]
> [276962.652834] [<ffffffffa03082a8>] xfs_log_commit_cil+0x38/0x1c0 [xfs]
> [276962.652894] [<ffffffffa0303304>] xfs_trans_commit+0x74/0x260 [xfs]
> [276962.652935] [<ffffffffa02ac70b>] xfs_setfilesize+0x12b/0x130 [xfs]
> [276962.652947] [<ffffffff81076bd0>] ? __migrate_task+0x150/0x150
> [276962.652988] [<ffffffffa02ac985>] xfs_end_io+0x75/0xc0 [xfs]
> [276962.652997] [<ffffffff8105e934>] process_one_work+0x1b4/0x380
> [276962.653004] [<ffffffff8105f294>] rescuer_thread+0x234/0x320
> [276962.653011] [<ffffffff8105f060>] ? free_pwqs+0x30/0x30
> [276962.653017] [<ffffffff81066a86>] kthread+0xc6/0xd0
> [276962.653025] [<ffffffff810669c0>] ? kthread_freezable_should_stop+0x70/0x70
> [276962.653034] [<ffffffff8158303c>] ret_from_fork+0x7c/0xb0
> [276962.653041] [<ffffffff810669c0>] ? kthread_freezable_should_stop+0x70/0x70
>
> $ dmesg | grep "blocked for more than"
> [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> [276962.653097] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> [276962.653940] INFO: task ld:14442 blocked for more than 480 seconds.
> [276962.654297] INFO: task ld:14962 blocked for more than 480 seconds.
> [277442.652123] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> [277442.653153] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> [277442.653997] INFO: task ld:14442 blocked for more than 480 seconds.
> [277442.654353] INFO: task ld:14962 blocked for more than 480 seconds.
> [277922.652069] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> [277922.653089] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
>
You seem to have switched to XFS. Dave posted a patch two days ago fixing some
missing conversions in the XFS side. AFAIK, Andrew hasn't yet picked the patch.
Are you running with that patch applied?
On Mon, Jul 08, 2013 at 02:04:19PM -0700, Andrew Morton wrote:
> On Mon, 8 Jul 2013 14:53:52 +0200 Michal Hocko <[email protected]> wrote:
>
> > > Good news! The test was running since morning and it didn't hang nor
> > > crashed. So this really looks like the right fix. It will run also
> > > during weekend to be 100% sure. But I guess it is safe to say
> >
> > Hmm, it seems I was too optimistic or we have yet another issue here (I
> > guess the later is more probable).
> >
> > The weekend testing got stuck as well.
> >
> > The dmesg shows there were some hung tasks:
>
> That looks like the classic "we lost an IO completion" trace.
>
> I think it would be prudent to defer these patches into 3.12.
Agree.
Will they still in -mm, or do I have to resend ?
On Tue, 9 Jul 2013 21:32:51 +0400 Glauber Costa <[email protected]> wrote:
> > $ dmesg | grep "blocked for more than"
> > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > [276962.653097] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> > [276962.653940] INFO: task ld:14442 blocked for more than 480 seconds.
> > [276962.654297] INFO: task ld:14962 blocked for more than 480 seconds.
> > [277442.652123] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > [277442.653153] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> > [277442.653997] INFO: task ld:14442 blocked for more than 480 seconds.
> > [277442.654353] INFO: task ld:14962 blocked for more than 480 seconds.
> > [277922.652069] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > [277922.653089] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> >
>
> You seem to have switched to XFS. Dave posted a patch two days ago fixing some
> missing conversions in the XFS side. AFAIK, Andrew hasn't yet picked the patch.
I can't find that patch. Please resend?
There's also "list_lru: fix broken LRU_RETRY behaviour", which I
assume we need?
On Tue, 9 Jul 2013 21:34:08 +0400 Glauber Costa <[email protected]> wrote:
> On Mon, Jul 08, 2013 at 02:04:19PM -0700, Andrew Morton wrote:
> > On Mon, 8 Jul 2013 14:53:52 +0200 Michal Hocko <[email protected]> wrote:
> >
> > > > Good news! The test was running since morning and it didn't hang nor
> > > > crashed. So this really looks like the right fix. It will run also
> > > > during weekend to be 100% sure. But I guess it is safe to say
> > >
> > > Hmm, it seems I was too optimistic or we have yet another issue here (I
> > > guess the later is more probable).
> > >
> > > The weekend testing got stuck as well.
> > >
> > > The dmesg shows there were some hung tasks:
> >
> > That looks like the classic "we lost an IO completion" trace.
> >
> > I think it would be prudent to defer these patches into 3.12.
> Agree.
>
> Will they still in -mm, or do I have to resend ?
No, I don't intend to drop them from -mm.
On Tue, Jul 09, 2013 at 10:50:32AM -0700, Andrew Morton wrote:
> On Tue, 9 Jul 2013 21:32:51 +0400 Glauber Costa <[email protected]> wrote:
>
> > > $ dmesg | grep "blocked for more than"
> > > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > > [276962.653097] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> > > [276962.653940] INFO: task ld:14442 blocked for more than 480 seconds.
> > > [276962.654297] INFO: task ld:14962 blocked for more than 480 seconds.
> > > [277442.652123] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > > [277442.653153] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> > > [277442.653997] INFO: task ld:14442 blocked for more than 480 seconds.
> > > [277442.654353] INFO: task ld:14962 blocked for more than 480 seconds.
> > > [277922.652069] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > > [277922.653089] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
> > >
> >
> > You seem to have switched to XFS. Dave posted a patch two days ago fixing some
> > missing conversions in the XFS side. AFAIK, Andrew hasn't yet picked the patch.
>
> I can't find that patch. Please resend?
>
> There's also "list_lru: fix broken LRU_RETRY behaviour", which I
> assume we need?
Yes, we can either apply or stash that one - up to you.
On Tue 09-07-13 21:32:51, Glauber Costa wrote:
[...]
> You seem to have switched to XFS.
Yes, to make sure that the original hang is not fs specific. I can
switch to other fs if it helps. This seems to be really hard to
reproduce now so I would rather not change things if possible.
> Dave posted a patch two days ago fixing some missing conversions in
> the XFS side. AFAIK, Andrew hasn't yet picked the patch.
Could you point me to those patches, please?
> Are you running with that patch applied?
I am currently running with "list_lru: fix broken LRU_RETRY behaviour"
--
Michal Hocko
SUSE Labs
On Tue, 9 Jul 2013 19:57:49 +0200 Michal Hocko <[email protected]> wrote:
> On Tue 09-07-13 21:32:51, Glauber Costa wrote:
> [...]
> > You seem to have switched to XFS.
>
> Yes, to make sure that the original hang is not fs specific. I can
> switch to other fs if it helps. This seems to be really hard to
> reproduce now so I would rather not change things if possible.
>
> > Dave posted a patch two days ago fixing some missing conversions in
> > the XFS side. AFAIK, Andrew hasn't yet picked the patch.
>
> Could you point me to those patches, please?
This one:
From: Dave Chinner <[email protected]>
Subject: xfs: fix dquot isolation hang
The new LRU list isolation code in xfs_qm_dquot_isolate() isn't
completely up to date. Firstly, it needs conversion to return enum
lru_status values, not raw numbers. Secondly - most importantly - it
fails to unlock the dquot and relock the LRU in the LRU_RETRY path.
This leads to deadlocks in xfstests generic/232. Fix them.
Signed-off-by: Dave Chinner <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/xfs/xfs_qm.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff -puN fs/xfs/xfs_qm.c~xfs-convert-dquot-cache-lru-to-list_lru-fix-dquot-isolation-hang fs/xfs/xfs_qm.c
--- a/fs/xfs/xfs_qm.c~xfs-convert-dquot-cache-lru-to-list_lru-fix-dquot-isolation-hang
+++ a/fs/xfs/xfs_qm.c
@@ -659,7 +659,7 @@ xfs_qm_dquot_isolate(
trace_xfs_dqreclaim_want(dqp);
list_del_init(&dqp->q_lru);
XFS_STATS_DEC(xs_qm_dquot_unused);
- return 0;
+ return LRU_REMOVED;
}
/*
@@ -705,17 +705,19 @@ xfs_qm_dquot_isolate(
XFS_STATS_DEC(xs_qm_dquot_unused);
trace_xfs_dqreclaim_done(dqp);
XFS_STATS_INC(xs_qm_dqreclaims);
- return 0;
+ return LRU_REMOVED;
out_miss_busy:
trace_xfs_dqreclaim_busy(dqp);
XFS_STATS_INC(xs_qm_dqreclaim_misses);
- return 2;
+ return LRU_SKIP;
out_unlock_dirty:
trace_xfs_dqreclaim_busy(dqp);
XFS_STATS_INC(xs_qm_dqreclaim_misses);
- return 3;
+ xfs_dqunlock(dqp);
+ spin_lock(lru_lock);
+ return LRU_RETRY;
}
static unsigned long
_
On Mon, Jul 08, 2013 at 02:53:52PM +0200, Michal Hocko wrote:
> On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> > On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > > [...]
> > > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > > info, Michal, it's time to go look at the code....
> > > >
> > > > OK, just in case we will need it, I am keeping the machine in this state
> > > > for now. So we still can play with crash and check all the juicy
> > > > internals.
> > >
> > > My current suspect is the LRU_RETRY code. I don't think what it is
> > > doing is at all valid - list_for_each_safe() is not safe if you drop
> > > the lock that protects the list. i.e. there is nothing that protects
> > > the stored next pointer from being removed from the list by someone
> > > else. Hence what I think is occurring is this:
> > >
> > >
> > > thread 1 thread 2
> > > lock(lru)
> > > list_for_each_safe(lru) lock(lru)
> > > isolate ......
> > > lock(i_lock)
> > > has buffers
> > > __iget
> > > unlock(i_lock)
> > > unlock(lru)
> > > ..... (gets lru lock)
> > > list_for_each_safe(lru)
> > > walks all the inodes
> > > finds inode being isolated by other thread
> > > isolate
> > > i_count > 0
> > > list_del_init(i_lru)
> > > return LRU_REMOVED;
> > > moves to next inode, inode that
> > > other thread has stored as next
> > > isolate
> > > i_state |= I_FREEING
> > > list_move(dispose_list)
> > > return LRU_REMOVED
> > > ....
> > > unlock(lru)
> > > lock(lru)
> > > return LRU_RETRY;
> > > if (!first_pass)
> > > ....
> > > --nr_to_scan
> > > (loop again using next, which has already been removed from the
> > > LRU by the other thread!)
> > > isolate
> > > lock(i_lock)
> > > if (i_state & ~I_REFERENCED)
> > > list_del_init(i_lru) <<<<< inode is on dispose list!
> > > <<<<< inode is now isolated, with I_FREEING set
> > > return LRU_REMOVED;
> > >
> > > That fits the corpse left on your machine, Michal. One thread has
> > > moved the inode to a dispose list, the other thread thinks it is
> > > still on the LRU and should be removed, and removes it.
> > >
> > > This also explains the lru item count going negative - the same item
> > > is being removed from the lru twice. So it seems like all the
> > > problems you've been seeing are caused by this one problem....
> > >
> > > Patch below that should fix this.
> >
> > Good news! The test was running since morning and it didn't hang nor
> > crashed. So this really looks like the right fix. It will run also
> > during weekend to be 100% sure. But I guess it is safe to say
>
> Hmm, it seems I was too optimistic or we have yet another issue here (I
> guess the later is more probable).
>
> The weekend testing got stuck as well.
....
> 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
> [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
> [<ffffffff811763dd>] vfs_create+0xad/0xd0
> [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
> [<ffffffff8117815e>] do_last+0x2de/0x780
> [<ffffffff8117ae9a>] path_openat+0xda/0x400
> [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> [<ffffffff81168f9c>] sys_open+0x1c/0x20
> [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
That's an XFS log space issue, indicating that it has run out of
space in IO the log and it is waiting for more to come free. That
requires IO completion to occur.
> [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
Oh, that's why. This is the IO completion worker...
> [276962.652102] ffff88003794d198 0000000000000046 ffff8800325f4480 0000000000000000
> [276962.652113] ffff88003794c010 0000000000012dc0 0000000000012dc0 0000000000012dc0
> [276962.652121] 0000000000012dc0 ffff88003794dfd8 ffff88003794dfd8 0000000000012dc0
> [276962.652128] Call Trace:
> [276962.652151] [<ffffffff812a2c22>] ? __blk_run_queue+0x32/0x40
> [276962.652160] [<ffffffff812a31f8>] ? queue_unplugged+0x78/0xb0
> [276962.652171] [<ffffffff815793a4>] schedule+0x24/0x70
> [276962.652178] [<ffffffff8157948c>] io_schedule+0x9c/0xf0
> [276962.652187] [<ffffffff811011a9>] sleep_on_page+0x9/0x10
> [276962.652194] [<ffffffff815778ca>] __wait_on_bit+0x5a/0x90
> [276962.652200] [<ffffffff811011a0>] ? __lock_page+0x70/0x70
> [276962.652206] [<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
> [276962.652215] [<ffffffff81067190>] ? autoremove_wake_function+0x40/0x40
> [276962.652224] [<ffffffff81112ee1>] ? page_evictable+0x11/0x50
> [276962.652231] [<ffffffff81114e43>] shrink_page_list+0x503/0x790
> [276962.652239] [<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
> [276962.652246] [<ffffffff81115d5f>] ? shrink_active_list+0x29f/0x340
> [276962.652254] [<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
> [276962.652262] [<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
> [276962.652274] [<ffffffff81160c28>] ? mem_cgroup_reclaimable+0x108/0x150
> [276962.652282] [<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
> [276962.652291] [<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
> [276962.652298] [<ffffffff81116418>] shrink_zones+0x108/0x220
> [276962.652305] [<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
> [276962.652313] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
> [276962.652323] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
> [276962.652332] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
> [276962.652343] [<ffffffff81151c72>] kmem_getpages+0x62/0x1d0
> [276962.652351] [<ffffffff81153869>] fallback_alloc+0x189/0x250
> [276962.652359] [<ffffffff8115360d>] ____cache_alloc_node+0x8d/0x160
> [276962.652367] [<ffffffff81153e51>] __kmalloc+0x281/0x290
> [276962.652490] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
> [276962.652540] [<ffffffffa02c6e97>] kmem_alloc+0x77/0xe0 [xfs]
> [276962.652588] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
> [276962.652653] [<ffffffffa030a334>] xfs_inode_item_format_extents+0x54/0x100 [xfs]
> [276962.652714] [<ffffffffa030a63a>] xfs_inode_item_format+0x25a/0x4f0 [xfs]
> [276962.652774] [<ffffffffa03081a0>] xlog_cil_prepare_log_vecs+0xa0/0x170 [xfs]
> [276962.652834] [<ffffffffa03082a8>] xfs_log_commit_cil+0x38/0x1c0 [xfs]
> [276962.652894] [<ffffffffa0303304>] xfs_trans_commit+0x74/0x260 [xfs]
> [276962.652935] [<ffffffffa02ac70b>] xfs_setfilesize+0x12b/0x130 [xfs]
> [276962.652947] [<ffffffff81076bd0>] ? __migrate_task+0x150/0x150
> [276962.652988] [<ffffffffa02ac985>] xfs_end_io+0x75/0xc0 [xfs]
> [276962.652997] [<ffffffff8105e934>] process_one_work+0x1b4/0x380
... is running IO completion work and trying to commit a transaction
that is blocked in memory allocation which is waiting for IO
completion. It's disappeared up it's own fundamental orifice.
Ok, this has absolutely nothing to do with the LRU changes - this is
a pre-existing XFS/mm interaction problem from around 3.2. The
question is now this: how the hell do I get memory allocation to not
block waiting on IO completion here? This is already being done in
GFP_NOFS allocation context here....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed 10-07-13 12:31:39, Dave Chinner wrote:
> On Mon, Jul 08, 2013 at 02:53:52PM +0200, Michal Hocko wrote:
[...]
> > Hmm, it seems I was too optimistic or we have yet another issue here (I
> > guess the later is more probable).
> >
> > The weekend testing got stuck as well.
> ....
> > 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> > [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
> > [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> > [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
> > [<ffffffff811763dd>] vfs_create+0xad/0xd0
> > [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> That's an XFS log space issue, indicating that it has run out of
> space in IO the log and it is waiting for more to come free. That
> requires IO completion to occur.
>
> > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
>
> Oh, that's why. This is the IO completion worker...
>
> > [276962.652102] ffff88003794d198 0000000000000046 ffff8800325f4480 0000000000000000
> > [276962.652113] ffff88003794c010 0000000000012dc0 0000000000012dc0 0000000000012dc0
> > [276962.652121] 0000000000012dc0 ffff88003794dfd8 ffff88003794dfd8 0000000000012dc0
> > [276962.652128] Call Trace:
> > [276962.652151] [<ffffffff812a2c22>] ? __blk_run_queue+0x32/0x40
> > [276962.652160] [<ffffffff812a31f8>] ? queue_unplugged+0x78/0xb0
> > [276962.652171] [<ffffffff815793a4>] schedule+0x24/0x70
> > [276962.652178] [<ffffffff8157948c>] io_schedule+0x9c/0xf0
> > [276962.652187] [<ffffffff811011a9>] sleep_on_page+0x9/0x10
> > [276962.652194] [<ffffffff815778ca>] __wait_on_bit+0x5a/0x90
> > [276962.652200] [<ffffffff811011a0>] ? __lock_page+0x70/0x70
> > [276962.652206] [<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
> > [276962.652215] [<ffffffff81067190>] ? autoremove_wake_function+0x40/0x40
> > [276962.652224] [<ffffffff81112ee1>] ? page_evictable+0x11/0x50
> > [276962.652231] [<ffffffff81114e43>] shrink_page_list+0x503/0x790
> > [276962.652239] [<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
> > [276962.652246] [<ffffffff81115d5f>] ? shrink_active_list+0x29f/0x340
> > [276962.652254] [<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
> > [276962.652262] [<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
> > [276962.652274] [<ffffffff81160c28>] ? mem_cgroup_reclaimable+0x108/0x150
> > [276962.652282] [<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
> > [276962.652291] [<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
> > [276962.652298] [<ffffffff81116418>] shrink_zones+0x108/0x220
> > [276962.652305] [<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
> > [276962.652313] [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
> > [276962.652323] [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
> > [276962.652332] [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
> > [276962.652343] [<ffffffff81151c72>] kmem_getpages+0x62/0x1d0
> > [276962.652351] [<ffffffff81153869>] fallback_alloc+0x189/0x250
> > [276962.652359] [<ffffffff8115360d>] ____cache_alloc_node+0x8d/0x160
> > [276962.652367] [<ffffffff81153e51>] __kmalloc+0x281/0x290
> > [276962.652490] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
> > [276962.652540] [<ffffffffa02c6e97>] kmem_alloc+0x77/0xe0 [xfs]
> > [276962.652588] [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
> > [276962.652653] [<ffffffffa030a334>] xfs_inode_item_format_extents+0x54/0x100 [xfs]
> > [276962.652714] [<ffffffffa030a63a>] xfs_inode_item_format+0x25a/0x4f0 [xfs]
> > [276962.652774] [<ffffffffa03081a0>] xlog_cil_prepare_log_vecs+0xa0/0x170 [xfs]
> > [276962.652834] [<ffffffffa03082a8>] xfs_log_commit_cil+0x38/0x1c0 [xfs]
> > [276962.652894] [<ffffffffa0303304>] xfs_trans_commit+0x74/0x260 [xfs]
> > [276962.652935] [<ffffffffa02ac70b>] xfs_setfilesize+0x12b/0x130 [xfs]
> > [276962.652947] [<ffffffff81076bd0>] ? __migrate_task+0x150/0x150
> > [276962.652988] [<ffffffffa02ac985>] xfs_end_io+0x75/0xc0 [xfs]
> > [276962.652997] [<ffffffff8105e934>] process_one_work+0x1b4/0x380
>
> ... is running IO completion work and trying to commit a transaction
> that is blocked in memory allocation which is waiting for IO
> completion. It's disappeared up it's own fundamental orifice.
>
> Ok, this has absolutely nothing to do with the LRU changes - this is
> a pre-existing XFS/mm interaction problem from around 3.2.
OK. I am retesting with ext3 now.
--
Michal Hocko
SUSE Labs
On Wed 10-07-13 12:31:39, Dave Chinner wrote:
[...]
> > 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> > [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
> > [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> > [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
> > [<ffffffff811763dd>] vfs_create+0xad/0xd0
> > [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
> > [<ffffffff8117815e>] do_last+0x2de/0x780
> > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> That's an XFS log space issue, indicating that it has run out of
> space in IO the log and it is waiting for more to come free. That
> requires IO completion to occur.
>
> > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
>
> Oh, that's why. This is the IO completion worker...
But that task doesn't seem to be stuck anymore (at least lockup watchdog
doesn't report it anymore and I have already rebooted to test with ext3
:/). I am sorry if the these lockups logs were more confusing than
helpful, but they happened _long_ time ago and the system obviously
recovered from them. I am pasting only the traces for processes in D
state here again for reference.
14442 [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[<ffffffff81114e43>] shrink_page_list+0x503/0x790
[<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[<ffffffff81116418>] shrink_zones+0x108/0x220
[<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[<ffffffff8113fe93>] read_swap_cache_async+0x113/0x160
[<ffffffff8113ffe1>] swapin_readahead+0x101/0x190
[<ffffffff8112e93f>] do_swap_page+0xef/0x5e0
[<ffffffff8112f94d>] handle_pte_fault+0x1bd/0x240
[<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[<ffffffff8157b348>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
14962 [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[<ffffffff81114e43>] shrink_page_list+0x503/0x790
[<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[<ffffffff81116418>] shrink_zones+0x108/0x220
[<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[<ffffffff81129ebb>] do_anonymous_page+0x16b/0x350
[<ffffffff8112f9c5>] handle_pte_fault+0x235/0x240
[<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[<ffffffff8157b348>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c62d0>] xfs_free_eofblocks+0x180/0x250 [xfs]
[<ffffffffa02c68e6>] xfs_release+0x106/0x1d0 [xfs]
[<ffffffffa02b3b20>] xfs_file_release+0x10/0x20 [xfs]
[<ffffffff8116c86d>] __fput+0xbd/0x240
[<ffffffff8116ca49>] ____fput+0x9/0x10
[<ffffffff81063221>] task_work_run+0xb1/0xe0
[<ffffffff810029e0>] do_notify_resume+0x90/0x1d0
[<ffffffff815833a2>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
20758 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
[<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff811763dd>] vfs_create+0xad/0xd0
[<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
[<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff811763dd>] vfs_create+0xad/0xd0
[<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
We are wating for page under writeback but neither of the 2 paths starts
in xfs code. So I do not think waiting for PageWriteback causes a
deadlock here.
[...]
> ... is running IO completion work and trying to commit a transaction
> that is blocked in memory allocation which is waiting for IO
> completion. It's disappeared up it's own fundamental orifice.
>
> Ok, this has absolutely nothing to do with the LRU changes - this is
> a pre-existing XFS/mm interaction problem from around 3.2. The
> question is now this: how the hell do I get memory allocation to not
> block waiting on IO completion here? This is already being done in
> GFP_NOFS allocation context here....
Just for reference. wait_on_page_writeback is issued only for memcg
reclaim because there is no other throttling mechanism to prevent from
too many dirty pages on the list, thus pre-mature OOM killer. See
e62e384e9d (memcg: prevent OOM with too many dirty pages) for more
details. The original patch relied on may_enter_fs but that check
disappeared by later changes by c3b94f44fc (memcg: further prevent OOM
with too many dirty pages).
--
Michal Hocko
SUSE Labs
On Wed, Jul 10, 2013 at 10:06:05AM +0200, Michal Hocko wrote:
> On Wed 10-07-13 12:31:39, Dave Chinner wrote:
> [...]
> > > 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> > > [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
> > > [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> > > [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
> > > [<ffffffff811763dd>] vfs_create+0xad/0xd0
> > > [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
> > > [<ffffffff8117815e>] do_last+0x2de/0x780
> > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > That's an XFS log space issue, indicating that it has run out of
> > space in IO the log and it is waiting for more to come free. That
> > requires IO completion to occur.
> >
> > > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > > [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
> >
> > Oh, that's why. This is the IO completion worker...
>
> But that task doesn't seem to be stuck anymore (at least lockup watchdog
> doesn't report it anymore and I have already rebooted to test with ext3
> :/). I am sorry if the these lockups logs were more confusing than
> helpful, but they happened _long_ time ago and the system obviously
> recovered from them. I am pasting only the traces for processes in D
> state here again for reference.
Right, there are various triggers that can get XFS out of the
situation - it takes something to kick the log or metadata writeback
and that can make space in the log free up and hence things get
moving again. The problem will be that once in this low memory state
everything in the filesystem will back up on slow memory allocation
and it might take minutes to clear the backlog of IO completions....
> 20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
That is the stack of a process waiting for log space to come
available.
> We are wating for page under writeback but neither of the 2 paths starts
> in xfs code. So I do not think waiting for PageWriteback causes a
> deadlock here.
The problem is this: the page that we are waiting for IO on is in
the IO completion queue, but the IO compeltion requires memory
allocation to complete the transaction. That memory allocation is
causing memcg reclaim, which then waits for IO completion on another
page, which may or may not end up in the same IO completion queue.
The CMWQ can continue to process new Io completions - up to a point
- so slow progress will be made. In the worst case, it can deadlock.
GFP_NOFS allocation is the mechanism by which filesystems are
supposed to be able to avoid this recursive deadlock...
> [...]
> > ... is running IO completion work and trying to commit a transaction
> > that is blocked in memory allocation which is waiting for IO
> > completion. It's disappeared up it's own fundamental orifice.
> >
> > Ok, this has absolutely nothing to do with the LRU changes - this is
> > a pre-existing XFS/mm interaction problem from around 3.2. The
> > question is now this: how the hell do I get memory allocation to not
> > block waiting on IO completion here? This is already being done in
> > GFP_NOFS allocation context here....
>
> Just for reference. wait_on_page_writeback is issued only for memcg
> reclaim because there is no other throttling mechanism to prevent from
> too many dirty pages on the list, thus pre-mature OOM killer. See
> e62e384e9d (memcg: prevent OOM with too many dirty pages) for more
> details. The original patch relied on may_enter_fs but that check
> disappeared by later changes by c3b94f44fc (memcg: further prevent OOM
> with too many dirty pages).
Aye. That's the exact code I was looking at yesterday and wondering
"how the hell is waiting on page writeback valid in GFP_NOFS
context?". It seems that memcg reclaim is intentionally ignoring
GFP_NOFS to avoid OOM issues. That's a memcg implementation problem,
not a filesystem or LRU infrastructure problem....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, 11 Jul 2013 12:26:34 +1000 Dave Chinner <[email protected]> wrote:
> > Just for reference. wait_on_page_writeback is issued only for memcg
> > reclaim because there is no other throttling mechanism to prevent from
> > too many dirty pages on the list, thus pre-mature OOM killer. See
> > e62e384e9d (memcg: prevent OOM with too many dirty pages) for more
> > details. The original patch relied on may_enter_fs but that check
> > disappeared by later changes by c3b94f44fc (memcg: further prevent OOM
> > with too many dirty pages).
>
> Aye. That's the exact code I was looking at yesterday and wondering
> "how the hell is waiting on page writeback valid in GFP_NOFS
> context?". It seems that memcg reclaim is intentionally ignoring
> GFP_NOFS to avoid OOM issues. That's a memcg implementation problem,
> not a filesystem or LRU infrastructure problem....
Yup, c3b94f44fc shouldn't have done that.
Throttling by waiting on a specific page is indeed prone to deadlocks
and has a number of efficiency problems as well: if 1,000,000 pages
came clean while you're waiting for *this* page to come clean, you're
left looking pretty stupid.
Hence congestion_wait(), which perhaps can save us here. I'm not sure
how the wait_on_page_writeback() got back in there - I must have been
asleep at the time.
On Thu 11-07-13 12:26:34, Dave Chinner wrote:
> On Wed, Jul 10, 2013 at 10:06:05AM +0200, Michal Hocko wrote:
> > On Wed 10-07-13 12:31:39, Dave Chinner wrote:
> > [...]
> > > > 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > > > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > > > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > > > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> > > > [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
> > > > [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> > > > [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
> > > > [<ffffffff811763dd>] vfs_create+0xad/0xd0
> > > > [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
> > > > [<ffffffff8117815e>] do_last+0x2de/0x780
> > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > That's an XFS log space issue, indicating that it has run out of
> > > space in IO the log and it is waiting for more to come free. That
> > > requires IO completion to occur.
> > >
> > > > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > > > [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
> > >
> > > Oh, that's why. This is the IO completion worker...
> >
> > But that task doesn't seem to be stuck anymore (at least lockup watchdog
> > doesn't report it anymore and I have already rebooted to test with ext3
> > :/). I am sorry if the these lockups logs were more confusing than
> > helpful, but they happened _long_ time ago and the system obviously
> > recovered from them. I am pasting only the traces for processes in D
> > state here again for reference.
>
> Right, there are various triggers that can get XFS out of the
> situation - it takes something to kick the log or metadata writeback
> and that can make space in the log free up and hence things get
> moving again. The problem will be that once in this low memory state
> everything in the filesystem will back up on slow memory allocation
> and it might take minutes to clear the backlog of IO completions....
>
> > 20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
>
> That is the stack of a process waiting for log space to come
> available.
>
> > We are wating for page under writeback but neither of the 2 paths starts
> > in xfs code. So I do not think waiting for PageWriteback causes a
> > deadlock here.
>
> The problem is this: the page that we are waiting for IO on is in
> the IO completion queue, but the IO compeltion requires memory
> allocation to complete the transaction. That memory allocation is
> causing memcg reclaim, which then waits for IO completion on another
> page, which may or may not end up in the same IO completion queue.
> The CMWQ can continue to process new Io completions - up to a point
> - so slow progress will be made. In the worst case, it can deadlock.
OK, I thought something like that was going on but I just wanted to be
sure that I didn't manage to confuse you by the lockup messages.
>
> GFP_NOFS allocation is the mechanism by which filesystems are
> supposed to be able to avoid this recursive deadlock...
Yes.
> > [...]
> > > ... is running IO completion work and trying to commit a transaction
> > > that is blocked in memory allocation which is waiting for IO
> > > completion. It's disappeared up it's own fundamental orifice.
> > >
> > > Ok, this has absolutely nothing to do with the LRU changes - this is
> > > a pre-existing XFS/mm interaction problem from around 3.2. The
> > > question is now this: how the hell do I get memory allocation to not
> > > block waiting on IO completion here? This is already being done in
> > > GFP_NOFS allocation context here....
> >
> > Just for reference. wait_on_page_writeback is issued only for memcg
> > reclaim because there is no other throttling mechanism to prevent from
> > too many dirty pages on the list, thus pre-mature OOM killer. See
> > e62e384e9d (memcg: prevent OOM with too many dirty pages) for more
> > details. The original patch relied on may_enter_fs but that check
> > disappeared by later changes by c3b94f44fc (memcg: further prevent OOM
> > with too many dirty pages).
>
> Aye. That's the exact code I was looking at yesterday and wondering
> "how the hell is waiting on page writeback valid in GFP_NOFS
> context?". It seems that memcg reclaim is intentionally ignoring
> GFP_NOFS to avoid OOM issues. That's a memcg implementation problem,
> not a filesystem or LRU infrastructure problem....
Agreed and until we have a proper per memcg dirty memory throttling we
will always be in a workaround mode. Which is sad but that is the
reality...
I am CCing Hugh (the discussion was long and started with a different
issue but the above should tell about the current xfs hang. It seems
that c3b94f44fc make xfs hang).
--
Michal Hocko
SUSE Labs
On Thu, 11 Jul 2013, Michal Hocko wrote:
> On Thu 11-07-13 12:26:34, Dave Chinner wrote:
> > On Wed, Jul 10, 2013 at 10:06:05AM +0200, Michal Hocko wrote:
> > > On Wed 10-07-13 12:31:39, Dave Chinner wrote:
> > > [...]
> > > > > 20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > > > > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > > > > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > > > > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> > > > > [<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
> > > > > [<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
> > > > > [<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
> > > > > [<ffffffff811763dd>] vfs_create+0xad/0xd0
> > > > > [<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
> > > > > [<ffffffff8117815e>] do_last+0x2de/0x780
> > > > > [<ffffffff8117ae9a>] path_openat+0xda/0x400
> > > > > [<ffffffff8117b303>] do_filp_open+0x43/0xa0
> > > > > [<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
> > > > > [<ffffffff81168f9c>] sys_open+0x1c/0x20
> > > > > [<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
> > > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > >
> > > > That's an XFS log space issue, indicating that it has run out of
> > > > space in IO the log and it is waiting for more to come free. That
> > > > requires IO completion to occur.
> > > >
> > > > > [276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
> > > > > [276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > [276962.652093] xfs-data/sda9 D ffff88001ffb9cc8 0 930 2 0x00000000
> > > >
> > > > Oh, that's why. This is the IO completion worker...
> > >
> > > But that task doesn't seem to be stuck anymore (at least lockup watchdog
> > > doesn't report it anymore and I have already rebooted to test with ext3
> > > :/). I am sorry if the these lockups logs were more confusing than
> > > helpful, but they happened _long_ time ago and the system obviously
> > > recovered from them. I am pasting only the traces for processes in D
> > > state here again for reference.
> >
> > Right, there are various triggers that can get XFS out of the
> > situation - it takes something to kick the log or metadata writeback
> > and that can make space in the log free up and hence things get
> > moving again. The problem will be that once in this low memory state
> > everything in the filesystem will back up on slow memory allocation
> > and it might take minutes to clear the backlog of IO completions....
> >
> > > 20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
> > > [<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
> > > [<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
> > > [<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
> >
> > That is the stack of a process waiting for log space to come
> > available.
> >
> > > We are wating for page under writeback but neither of the 2 paths starts
> > > in xfs code. So I do not think waiting for PageWriteback causes a
> > > deadlock here.
> >
> > The problem is this: the page that we are waiting for IO on is in
> > the IO completion queue, but the IO compeltion requires memory
> > allocation to complete the transaction. That memory allocation is
> > causing memcg reclaim, which then waits for IO completion on another
> > page, which may or may not end up in the same IO completion queue.
> > The CMWQ can continue to process new Io completions - up to a point
> > - so slow progress will be made. In the worst case, it can deadlock.
>
> OK, I thought something like that was going on but I just wanted to be
> sure that I didn't manage to confuse you by the lockup messages.
> >
> > GFP_NOFS allocation is the mechanism by which filesystems are
> > supposed to be able to avoid this recursive deadlock...
>
> Yes.
>
> > > [...]
> > > > ... is running IO completion work and trying to commit a transaction
> > > > that is blocked in memory allocation which is waiting for IO
> > > > completion. It's disappeared up it's own fundamental orifice.
> > > >
> > > > Ok, this has absolutely nothing to do with the LRU changes - this is
> > > > a pre-existing XFS/mm interaction problem from around 3.2. The
> > > > question is now this: how the hell do I get memory allocation to not
> > > > block waiting on IO completion here? This is already being done in
> > > > GFP_NOFS allocation context here....
> > >
> > > Just for reference. wait_on_page_writeback is issued only for memcg
> > > reclaim because there is no other throttling mechanism to prevent from
> > > too many dirty pages on the list, thus pre-mature OOM killer. See
> > > e62e384e9d (memcg: prevent OOM with too many dirty pages) for more
> > > details. The original patch relied on may_enter_fs but that check
> > > disappeared by later changes by c3b94f44fc (memcg: further prevent OOM
> > > with too many dirty pages).
> >
> > Aye. That's the exact code I was looking at yesterday and wondering
> > "how the hell is waiting on page writeback valid in GFP_NOFS
> > context?". It seems that memcg reclaim is intentionally ignoring
> > GFP_NOFS to avoid OOM issues. That's a memcg implementation problem,
> > not a filesystem or LRU infrastructure problem....
>
> Agreed and until we have a proper per memcg dirty memory throttling we
> will always be in a workaround mode. Which is sad but that is the
> reality...
>
> I am CCing Hugh (the discussion was long and started with a different
> issue but the above should tell about the current xfs hang. It seems
> that c3b94f44fc make xfs hang).
The may_enter_fs test came and went several times as we prepared those
patches: one set of problems with it in, another set with it out.
When I made c3b94f44fc, I was not imagining that I/O completion might
have to wait on a further __GFP_IO allocation. But I can see the sense
of what XFS is doing there: after writing the data, it wants to perform
(initiate?) a transaction; but if that happens to fail, wants to mark
the written data pages as bad before reaching the end_page_writeback.
I've toyed with reordering that, but its order does seem sensible.
I've always thought of GFP_NOFS as meaning "don't recurse into the
filesystem" (and wondered what that amounts to since direct reclaim
stopped doing filesystem writeback); but here XFS is expecting it
to include "and don't wait for PageWriteback to be cleared".
I've mused on this for a while, and haven't arrived at any conclusion;
but do have several mutterings on different kinds of solution.
Probably the easiest solution, but not necessarily the right solution,
would be for XFS to add a KM_NOIO akin to its KM_NOFS, and use KM_NOIO
instead of KM_NOFS in xfs_iomap_write_unwritten() (anywhere else?).
I'd find that more convincing if it were not so obviously designed
to match an assumption I'd once made over in mm/vmscan.c.
A harder solution, but one which I'd expect to have larger benefits,
would be to reinstate the may_enter_fs test there in shrink_page_list(),
but modify ext4 and xfs and gfs2 to use grab_cache_page_write_begin()
without needing AOP_FLAG_NOFS: I think it is very sad that major FS
page allocations are made with the limiting GFP_NOFS, and I hope there
might be an efficient way to make those page allocations outside of the
transaction, with __GFP_FS instead.
Another kind of solution: I did originally worry about your e62e384e9d
in rather the same way that akpm has, thinking a wait on return from
shrink_page_list() more appropriate than waiting on a single page
(with a hold on all the other pages of the page_list). I did have a
patch I'd been playing with about the time you posted yours, but we
agreed to go ahead with yours unless problems showed up (I think mine
was not so pretty as yours). Maybe I need to dust off my old
alternative now - though I've rather forgotten how to test it.
Hugh
On Thu, Jul 11, 2013 at 06:42:03PM -0700, Hugh Dickins wrote:
> On Thu, 11 Jul 2013, Michal Hocko wrote:
> > On Thu 11-07-13 12:26:34, Dave Chinner wrote:
> > > > We are wating for page under writeback but neither of the 2 paths starts
> > > > in xfs code. So I do not think waiting for PageWriteback causes a
> > > > deadlock here.
> > >
> > > The problem is this: the page that we are waiting for IO on is in
> > > the IO completion queue, but the IO compeltion requires memory
> > > allocation to complete the transaction. That memory allocation is
> > > causing memcg reclaim, which then waits for IO completion on another
> > > page, which may or may not end up in the same IO completion queue.
> > > The CMWQ can continue to process new Io completions - up to a point
> > > - so slow progress will be made. In the worst case, it can deadlock.
> >
> > OK, I thought something like that was going on but I just wanted to be
> > sure that I didn't manage to confuse you by the lockup messages.
> > >
> > > GFP_NOFS allocation is the mechanism by which filesystems are
> > > supposed to be able to avoid this recursive deadlock...
> >
> > Yes.
> >
> > > > [...]
> > > > > ... is running IO completion work and trying to commit a transaction
> > > > > that is blocked in memory allocation which is waiting for IO
> > > > > completion. It's disappeared up it's own fundamental orifice.
> > > > >
> > > > > Ok, this has absolutely nothing to do with the LRU changes - this is
> > > > > a pre-existing XFS/mm interaction problem from around 3.2. The
> > > > > question is now this: how the hell do I get memory allocation to not
> > > > > block waiting on IO completion here? This is already being done in
> > > > > GFP_NOFS allocation context here....
> > > >
> > > > Just for reference. wait_on_page_writeback is issued only for memcg
> > > > reclaim because there is no other throttling mechanism to prevent from
> > > > too many dirty pages on the list, thus pre-mature OOM killer. See
> > > > e62e384e9d (memcg: prevent OOM with too many dirty pages) for more
> > > > details. The original patch relied on may_enter_fs but that check
> > > > disappeared by later changes by c3b94f44fc (memcg: further prevent OOM
> > > > with too many dirty pages).
> > >
> > > Aye. That's the exact code I was looking at yesterday and wondering
> > > "how the hell is waiting on page writeback valid in GFP_NOFS
> > > context?". It seems that memcg reclaim is intentionally ignoring
> > > GFP_NOFS to avoid OOM issues. That's a memcg implementation problem,
> > > not a filesystem or LRU infrastructure problem....
> >
> > Agreed and until we have a proper per memcg dirty memory throttling we
> > will always be in a workaround mode. Which is sad but that is the
> > reality...
> >
> > I am CCing Hugh (the discussion was long and started with a different
> > issue but the above should tell about the current xfs hang. It seems
> > that c3b94f44fc make xfs hang).
>
> The may_enter_fs test came and went several times as we prepared those
> patches: one set of problems with it in, another set with it out.
>
> When I made c3b94f44fc, I was not imagining that I/O completion might
> have to wait on a further __GFP_IO allocation. But I can see the sense
> of what XFS is doing there: after writing the data, it wants to perform
> (initiate?) a transaction; but if that happens to fail, wants to mark
> the written data pages as bad before reaching the end_page_writeback.
> I've toyed with reordering that, but its order does seem sensible.
>
> I've always thought of GFP_NOFS as meaning "don't recurse into the
> filesystem" (and wondered what that amounts to since direct reclaim
> stopped doing filesystem writeback); but here XFS is expecting it
> to include "and don't wait for PageWriteback to be cleared".
Well, it's more general than that - my understanding of GFP_NOFS is
that it means "don't block reclaim on anything filesystem related
because a filesystem deadlock is possible from this calling
content". Even without direct reclaim doing writeback, there is
still shrinkers that need to avoid locking filesystem objects during
direct reclaim, and the fact that waiting on writeback for specific
pages to complete may (indirectly) block a memory allocation
required to complete the writeback of that page. It's the latter
case that is the problem here...
> I've mused on this for a while, and haven't arrived at any conclusion;
> but do have several mutterings on different kinds of solution.
>
> Probably the easiest solution, but not necessarily the right solution,
> would be for XFS to add a KM_NOIO akin to its KM_NOFS, and use KM_NOIO
> instead of KM_NOFS in xfs_iomap_write_unwritten() (anywhere else?).
> I'd find that more convincing if it were not so obviously designed
> to match an assumption I'd once made over in mm/vmscan.c.
I'd prefer not to have to start using KM_NOIO in specific places in
the filesystem layer. I can see how it may be relevant, though,
because we are in the IO completion path here, and so -technically-
we are dealing with IO layer interactions here. Hmmm - it looks like
there is already a task flag to tell memory allocation we are in IO
context without needing to pass GFP_IO: PF_MEMALLOC_NOIO.
[ As an idle thought, if we drove PF_FSTRANS into the memory
allocation to clear __GFP_FS like PF_MEMALLOC_NOIO clears __GFP_IO,
we could probably get rid of a large amount of the XFS specific
memory allocation wrappers. Hmmm, it would solve all the "we
need to do GFP_NOFS for vmalloc()" problems we have as well, which
is what DM uses PF_MEMALLOC_NOIO for.... ]
> A harder solution, but one which I'd expect to have larger benefits,
> would be to reinstate the may_enter_fs test there in shrink_page_list(),
> but modify ext4 and xfs and gfs2 to use grab_cache_page_write_begin()
> without needing AOP_FLAG_NOFS: I think it is very sad that major FS
> page allocations are made with the limiting GFP_NOFS, and I hope there
> might be an efficient way to make those page allocations outside of the
> transaction, with __GFP_FS instead.
I don't think that helps the AOP_FLAG_NOFS case - even if we aren't
in a transaction context, we're still holding (multiple) filesystem
locks when doing memory allocation...
> Another kind of solution: I did originally worry about your e62e384e9d
> in rather the same way that akpm has, thinking a wait on return from
> shrink_page_list() more appropriate than waiting on a single page
> (with a hold on all the other pages of the page_list). I did have a
> patch I'd been playing with about the time you posted yours, but we
> agreed to go ahead with yours unless problems showed up (I think mine
> was not so pretty as yours). Maybe I need to dust off my old
> alternative now - though I've rather forgotten how to test it.
I think that a congestion_wait()-styleange (as Andrew suggested) is
a workable interim solution but I suspect - like Michal - that we
need proper memcg awareness in balance_dirty_pages() to really solve
this problem fully....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > [...]
> > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > info, Michal, it's time to go look at the code....
> > >
> > > OK, just in case we will need it, I am keeping the machine in this state
> > > for now. So we still can play with crash and check all the juicy
> > > internals.
> >
> > My current suspect is the LRU_RETRY code. I don't think what it is
> > doing is at all valid - list_for_each_safe() is not safe if you drop
> > the lock that protects the list. i.e. there is nothing that protects
> > the stored next pointer from being removed from the list by someone
> > else. Hence what I think is occurring is this:
> >
> >
> > thread 1 thread 2
> > lock(lru)
> > list_for_each_safe(lru) lock(lru)
> > isolate ......
> > lock(i_lock)
> > has buffers
> > __iget
> > unlock(i_lock)
> > unlock(lru)
> > ..... (gets lru lock)
> > list_for_each_safe(lru)
> > walks all the inodes
> > finds inode being isolated by other thread
> > isolate
> > i_count > 0
> > list_del_init(i_lru)
> > return LRU_REMOVED;
> > moves to next inode, inode that
> > other thread has stored as next
> > isolate
> > i_state |= I_FREEING
> > list_move(dispose_list)
> > return LRU_REMOVED
> > ....
> > unlock(lru)
> > lock(lru)
> > return LRU_RETRY;
> > if (!first_pass)
> > ....
> > --nr_to_scan
> > (loop again using next, which has already been removed from the
> > LRU by the other thread!)
> > isolate
> > lock(i_lock)
> > if (i_state & ~I_REFERENCED)
> > list_del_init(i_lru) <<<<< inode is on dispose list!
> > <<<<< inode is now isolated, with I_FREEING set
> > return LRU_REMOVED;
> >
> > That fits the corpse left on your machine, Michal. One thread has
> > moved the inode to a dispose list, the other thread thinks it is
> > still on the LRU and should be removed, and removes it.
> >
> > This also explains the lru item count going negative - the same item
> > is being removed from the lru twice. So it seems like all the
> > problems you've been seeing are caused by this one problem....
> >
> > Patch below that should fix this.
>
> Good news! The test was running since morning and it didn't hang nor
> crashed. So this really looks like the right fix. It will run also
> during weekend to be 100% sure. But I guess it is safe to say
>
> Tested-by: Michal Hocko <[email protected]>
And I can finally confirm this after over weekend testing on ext3.
Thanks a lot for your help Dave!
--
Michal Hocko
SUSE Labs