2008-06-22 09:59:10

by Daniel J Blueman

[permalink] [raw]
Subject: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

I'm seeing a similar issue [2] to what was recently reported [1] by
Alexander, but with another workload involving XFS and memory
pressure.

SLUB allocator is in use and config is at http://quora.org/config-client-debug .

Let me know if you'd like more details/vmlinux objdump etc.

Thanks,
Daniel

--- [1]

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c

--- [2]

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.26-rc7-210c #2
-------------------------------------------------------
AutopanoPro/4470 is trying to acquire lock:
(iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290

but task is already holding lock:
(&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&mm->mmap_sem){----}:
[<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
[<ffffffff802793f5>] lock_acquire+0x65/0x90
[<ffffffff805df5ab>] down_read+0x3b/0x70
[<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
[<ffffffff805e16cd>] error_exit+0x0/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

-> #1 (&(&ip->i_iolock)->mr_lock){----}:
[<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
[<ffffffff802793f5>] lock_acquire+0x65/0x90
[<ffffffff8026d746>] down_write_nested+0x46/0x80
[<ffffffff8039df29>] xfs_ilock+0x99/0xa0
[<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
[<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
[<ffffffff803bc199>] xfs_reclaim+0x109/0x110
[<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
[<ffffffff802d906d>] clear_inode+0x7d/0x110
[<ffffffff802d93aa>] dispose_list+0x2a/0x100
[<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
[<ffffffff8029d868>] shrink_slab+0x168/0x1d0
[<ffffffff8029e0b6>] kswapd+0x3b6/0x560
[<ffffffff8026921d>] kthread+0x4d/0x80
[<ffffffff80227428>] child_rip+0xa/0x12
[<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (iprune_mutex){--..}:
[<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
[<ffffffff802793f5>] lock_acquire+0x65/0x90
[<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
[<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
[<ffffffff8029d868>] shrink_slab+0x168/0x1d0
[<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
[<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
[<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
[<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
[<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
[<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
[<ffffffff805e16cd>] error_exit+0x0/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by AutopanoPro/4470:
#0: (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
#1: (shrinker_rwsem){----}, at: [<ffffffff8029d732>] shrink_slab+0x32/0x1d0

stack backtrace:
Pid: 4470, comm: AutopanoPro Not tainted 2.6.26-rc7-210c #2

Call Trace:
[<ffffffff80276823>] print_circular_bug_tail+0x83/0x90
[<ffffffff80275e09>] ? print_circular_bug_entry+0x49/0x60
[<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
[<ffffffff802793f5>] lock_acquire+0x65/0x90
[<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
[<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
[<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
[<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
[<ffffffff8029d732>] ? shrink_slab+0x32/0x1d0
[<ffffffff8029d868>] shrink_slab+0x168/0x1d0
[<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
[<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
[<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
[<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
[<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
[<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
[<ffffffff80277e2f>] ? trace_hardirqs_on+0xbf/0x150
[<ffffffff805e3e15>] ? do_page_fault+0x255/0x890
[<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
[<ffffffff805e16cd>] error_exit+0x0/0xa9
--
Daniel J Blueman


2008-06-22 18:10:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On (22/06/08 10:58), Daniel J Blueman didst pronounce:
> I'm seeing a similar issue [2] to what was recently reported [1] by
> Alexander, but with another workload involving XFS and memory
> pressure.
>

Is NFS involved or is this XFS only? It looks like XFS-only but no harm in
being sure.

I'm beginning to wonder if this is a problem where a lot of dirty inodes are
being written back in this path and we stall while that happens. I'm still
not getting why we are triggering this now and did not before 2.6.26-rc1
or why it bisects to the zonelist modifications. Diffing the reclaim and
allocation paths between 2.6.25 and 2.6.26-rc1 has not yielded any candidates
for me yet that would explain this.

> SLUB allocator is in use and config is at http://quora.org/config-client-debug .
>
> Let me know if you'd like more details/vmlinux objdump etc.
>
> Thanks,
> Daniel
>
> --- [1]
>
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c
>
> --- [2]
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc7-210c #2
> -------------------------------------------------------
> AutopanoPro/4470 is trying to acquire lock:
> (iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
>
> but task is already holding lock:
> (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&mm->mmap_sem){----}:
> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff805df5ab>] down_read+0x3b/0x70
> [<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff8026d746>] down_write_nested+0x46/0x80
> [<ffffffff8039df29>] xfs_ilock+0x99/0xa0
> [<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
> [<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
> [<ffffffff803bc199>] xfs_reclaim+0x109/0x110
> [<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
> [<ffffffff802d906d>] clear_inode+0x7d/0x110
> [<ffffffff802d93aa>] dispose_list+0x2a/0x100
> [<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029e0b6>] kswapd+0x3b6/0x560
> [<ffffffff8026921d>] kthread+0x4d/0x80
> [<ffffffff80227428>] child_rip+0xa/0x12
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #0 (iprune_mutex){--..}:
> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> other info that might help us debug this:
>
> 2 locks held by AutopanoPro/4470:
> #0: (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
> #1: (shrinker_rwsem){----}, at: [<ffffffff8029d732>] shrink_slab+0x32/0x1d0
>
> stack backtrace:
> Pid: 4470, comm: AutopanoPro Not tainted 2.6.26-rc7-210c #2
>
> Call Trace:
> [<ffffffff80276823>] print_circular_bug_tail+0x83/0x90
> [<ffffffff80275e09>] ? print_circular_bug_entry+0x49/0x60
> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> [<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> [<ffffffff8029d732>] ? shrink_slab+0x32/0x1d0
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> [<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> [<ffffffff80277e2f>] ? trace_hardirqs_on+0xbf/0x150
> [<ffffffff805e3e15>] ? do_page_fault+0x255/0x890
> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> --
> Daniel J Blueman
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-06-22 18:15:01

by Mel Gorman

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

(Sorry for the resend, the wrong Dave Chinner's email address was used)

On (22/06/08 10:58), Daniel J Blueman didst pronounce:
> I'm seeing a similar issue [2] to what was recently reported [1] by
> Alexander, but with another workload involving XFS and memory
> pressure.
>

Is NFS involved or is this XFS only? It looks like XFS-only but no harm in
being sure.

I'm beginning to wonder if this is a problem where a lot of dirty inodes are
being written back in this path and we stall while that happens. I'm still
not getting why we are triggering this now and did not before 2.6.26-rc1
or why it bisects to the zonelist modifications. Diffing the reclaim and
allocation paths between 2.6.25 and 2.6.26-rc1 has not yielded any candidates
for me yet that would explain this.

> SLUB allocator is in use and config is at http://quora.org/config-client-debug .
>
> Let me know if you'd like more details/vmlinux objdump etc.
>
> Thanks,
> Daniel
>
> --- [1]
>
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c
>
> --- [2]
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc7-210c #2
> -------------------------------------------------------
> AutopanoPro/4470 is trying to acquire lock:
> (iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
>
> but task is already holding lock:
> (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&mm->mmap_sem){----}:
> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff805df5ab>] down_read+0x3b/0x70
> [<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff8026d746>] down_write_nested+0x46/0x80
> [<ffffffff8039df29>] xfs_ilock+0x99/0xa0
> [<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
> [<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
> [<ffffffff803bc199>] xfs_reclaim+0x109/0x110
> [<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
> [<ffffffff802d906d>] clear_inode+0x7d/0x110
> [<ffffffff802d93aa>] dispose_list+0x2a/0x100
> [<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029e0b6>] kswapd+0x3b6/0x560
> [<ffffffff8026921d>] kthread+0x4d/0x80
> [<ffffffff80227428>] child_rip+0xa/0x12
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #0 (iprune_mutex){--..}:
> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> other info that might help us debug this:
>
> 2 locks held by AutopanoPro/4470:
> #0: (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
> #1: (shrinker_rwsem){----}, at: [<ffffffff8029d732>] shrink_slab+0x32/0x1d0
>
> stack backtrace:
> Pid: 4470, comm: AutopanoPro Not tainted 2.6.26-rc7-210c #2
>
> Call Trace:
> [<ffffffff80276823>] print_circular_bug_tail+0x83/0x90
> [<ffffffff80275e09>] ? print_circular_bug_entry+0x49/0x60
> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> [<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> [<ffffffff8029d732>] ? shrink_slab+0x32/0x1d0
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> [<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> [<ffffffff80277e2f>] ? trace_hardirqs_on+0xbf/0x150
> [<ffffffff805e3e15>] ? do_page_fault+0x255/0x890
> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> --
> Daniel J Blueman
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-06-22 18:21:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On Sun, 22 Jun 2008 19:10:11 +0100
Mel Gorman <[email protected]> wrote:

> On (22/06/08 10:58), Daniel J Blueman didst pronounce:
> > I'm seeing a similar issue [2] to what was recently reported [1] by
> > Alexander, but with another workload involving XFS and memory
> > pressure.
> >
>
> Is NFS involved or is this XFS only? It looks like XFS-only but no
> harm in being sure.
>
> I'm beginning to wonder if this is a problem where a lot of dirty
> inodes are being written back in this path and we stall while that
> happens. I'm still not getting why we are triggering this now and did
> not before 2.6.26-rc1 or why it bisects to the zonelist
> modifications. Diffing the reclaim and allocation paths between
> 2.6.25 and 2.6.26-rc1 has not yielded any candidates for me yet that
> would explain this.
>

this sort of thing can easily be exposed with the latencytop tool...
it will at least tell you WHAT the system is blocking on.
(not so much the why, the tool isn't smart enough to automatically spit
out kernel patches yet)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-06-22 18:55:20

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On Sun, Jun 22, 2008 at 7:14 PM, Mel Gorman <[email protected]> wrote:
> On (22/06/08 10:58), Daniel J Blueman didst pronounce:
>> I'm seeing a similar issue [2] to what was recently reported [1] by
>> Alexander, but with another workload involving XFS and memory
>> pressure.
>>
>
> Is NFS involved or is this XFS only? It looks like XFS-only but no harm in
> being sure.

The application is reading a 10MB file from NFS around every few
seconds, and writing back ~30MB every few seconds to local XFS;
another thread in the same application is consuming that data and
writing out ~2MB to NFS after circa 10 input files, so NFS isn't
dominant but is involved.

> I'm beginning to wonder if this is a problem where a lot of dirty inodes are
> being written back in this path and we stall while that happens. I'm still
> not getting why we are triggering this now and did not before 2.6.26-rc1
> or why it bisects to the zonelist modifications. Diffing the reclaim and
> allocation paths between 2.6.25 and 2.6.26-rc1 has not yielded any candidates
> for me yet that would explain this.
>
>> SLUB allocator is in use and config is at http://quora.org/config-client-debug .
>>
>> Let me know if you'd like more details/vmlinux objdump etc.
>>
>> Thanks,
>> Daniel
>>
>> --- [1]
>>
>> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c
>>
>> --- [2]
>>
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 2.6.26-rc7-210c #2
>> -------------------------------------------------------
>> AutopanoPro/4470 is trying to acquire lock:
>> (iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
>>
>> but task is already holding lock:
>> (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #2 (&mm->mmap_sem){----}:
>> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
>> [<ffffffff802793f5>] lock_acquire+0x65/0x90
>> [<ffffffff805df5ab>] down_read+0x3b/0x70
>> [<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
>> [<ffffffff805e16cd>] error_exit+0x0/0xa9
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> -> #1 (&(&ip->i_iolock)->mr_lock){----}:
>> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
>> [<ffffffff802793f5>] lock_acquire+0x65/0x90
>> [<ffffffff8026d746>] down_write_nested+0x46/0x80
>> [<ffffffff8039df29>] xfs_ilock+0x99/0xa0
>> [<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
>> [<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
>> [<ffffffff803bc199>] xfs_reclaim+0x109/0x110
>> [<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
>> [<ffffffff802d906d>] clear_inode+0x7d/0x110
>> [<ffffffff802d93aa>] dispose_list+0x2a/0x100
>> [<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
>> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
>> [<ffffffff8029e0b6>] kswapd+0x3b6/0x560
>> [<ffffffff8026921d>] kthread+0x4d/0x80
>> [<ffffffff80227428>] child_rip+0xa/0x12
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> -> #0 (iprune_mutex){--..}:
>> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
>> [<ffffffff802793f5>] lock_acquire+0x65/0x90
>> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
>> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
>> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
>> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
>> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
>> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
>> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
>> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
>> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
>> [<ffffffff805e16cd>] error_exit+0x0/0xa9
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> other info that might help us debug this:
>>
>> 2 locks held by AutopanoPro/4470:
>> #0: (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
>> #1: (shrinker_rwsem){----}, at: [<ffffffff8029d732>] shrink_slab+0x32/0x1d0
>>
>> stack backtrace:
>> Pid: 4470, comm: AutopanoPro Not tainted 2.6.26-rc7-210c #2
>>
>> Call Trace:
>> [<ffffffff80276823>] print_circular_bug_tail+0x83/0x90
>> [<ffffffff80275e09>] ? print_circular_bug_entry+0x49/0x60
>> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
>> [<ffffffff802793f5>] lock_acquire+0x65/0x90
>> [<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
>> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
>> [<ffffffff802d94fd>] ? shrink_icache_memory+0x7d/0x290
>> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
>> [<ffffffff8029d732>] ? shrink_slab+0x32/0x1d0
>> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
>> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
>> [<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
>> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
>> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
>> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
>> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
>> [<ffffffff80277e2f>] ? trace_hardirqs_on+0xbf/0x150
>> [<ffffffff805e3e15>] ? do_page_fault+0x255/0x890
>> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
>> [<ffffffff805e16cd>] error_exit+0x0/0xa9
>> --
>> Daniel J Blueman
>>
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
--
Daniel J Blueman

2008-06-22 20:56:27

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On Sun, Jun 22, 2008 at 7:21 PM, Arjan van de Ven <[email protected]> wrote:
> On Sun, 22 Jun 2008 19:10:11 +0100
> Mel Gorman <[email protected]> wrote:
>
>> On (22/06/08 10:58), Daniel J Blueman didst pronounce:
>> > I'm seeing a similar issue [2] to what was recently reported [1] by
>> > Alexander, but with another workload involving XFS and memory
>> > pressure.
>> >
>>
>> Is NFS involved or is this XFS only? It looks like XFS-only but no
>> harm in being sure.
>>
>> I'm beginning to wonder if this is a problem where a lot of dirty
>> inodes are being written back in this path and we stall while that
>> happens. I'm still not getting why we are triggering this now and did
>> not before 2.6.26-rc1 or why it bisects to the zonelist
>> modifications. Diffing the reclaim and allocation paths between
>> 2.6.25 and 2.6.26-rc1 has not yielded any candidates for me yet that
>> would explain this.
>
> this sort of thing can easily be exposed with the latencytop tool...
> it will at least tell you WHAT the system is blocking on.
> (not so much the why, the tool isn't smart enough to automatically spit
> out kernel patches yet)

Good plan. I reproduced this without NFS mounted, so pure XFS. I
wasn't able to capture the same process's (ie 5480) latencytop trace,
but it may have contributed to the list.

A fair amount of debugging was turned on, hurting the latency some
(despite it being a 3.6GHz Penryn).

Daniel

--- [1]

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
<snip>
1 1 156 14424 12 2329672 0 0 0 110755 177 3820 57 6 36 0
2 1 156 14736 12 2348152 0 0 24 25172 204 26018 35 21 43 1
5 0 156 24252 12 2363800 0 0 59656 31 545 25292 35 14 28 23
4 0 156 14696 12 2317784 0 0 3824 0 38 23083 95 6 0 0
4 0 156 14440 12 2319304 0 0 4672 0 72 3372 93 3 3 2
2 0 156 14428 12 2318484 0 0 0 4 27 731 52 0 49 0
2 0 156 14480 12 2308512 0 0 0 12 32 36629 39 13 49 0
2 0 156 14572 12 2301220 0 0 3904 12316 117 10760 58 7 26 11

--- [2]

Cause Maximum Percentage
down xfs_buf_lock _xfs_buf_find xfs_buf_get_flags 271.1 msec 0.4 %
down xfs_buf_iowait xfs_buf_iostart xfs_buf_read_f206.1 msec 1.3 %
down xfs_buf_lock xfs_getsb xfs_trans_getsb xfs_tr160.4 msec 0.5 %
sync_page sync_page_killable __lock_page_killable 155.3 msec 26.1 %
down xfs_buf_iowait xfs_buf_iostart xfs_buf_read_f101.6 msec 0.6 %
down xfs_buf_iowait xfs_buf_iostart xfs_buf_read_f 99.1 msec 0.7 %
sync_page __lock_page find_lock_page filemap_fault 84.0 msec 1.5 %
down xfs_buf_iowait xfs_buf_iostart xfs_buf_read_f 70.8 msec 0.4 %
sync_buffer __wait_on_buffer __block_prepare_write 66.4 msec 0.1 %
o_sync_readv_writev do_readv_writev vfs_writev

Process AutopanoPro (4697) Total: 4693.4 msec
down xfs_buf_lock xfs_getsb xfs_trans_getsb xfs_tr126.2 msec
2.7 %trans_commit xfs_create xfs_vn_mknod xfs_vn_create vfs_create
do_filp_open do_sys_open
sync_page sync_page_killable __lock_page_killable 89.3 msec
15.3 %_read __xfs_file_read xfs_file_aio_read do_sync_read vfs_read
sys_read system_call_after
down xfs_buf_iowait xfs_buf_iostart xfs_buf_read_f 88.5 msec
2.2 %fs_da_do_buf xfs_da_read_buf xfs_dir2_leaf_lookup_int
xfs_dir2_leaf_lookup xfs_dir_looku
Scheduler: waiting for cpu 49.4 msec 77.7 %
congestion_wait __alloc_pages_internal __alloc_pag 19.8 msec
1.4 %e_fault error_exit
congestion_wait __alloc_pages_internal __alloc_pag 19.4 msec
0.4 %ead ondemand_readahead page_cache_async_readahead
generic_file_aio_read xfs_read __xfs_f
ile_read xfs_file_aio_read do_sync_read vfs_read

--- [3]

Cause Maximum Percentage
sync_page sync_page_killable __lock_page_killable 103.9 msec 36.5 %
Scheduler: waiting for cpu 37.0 msec 49.5 %
sync_page __lock_page find_lock_page filemap_fault 28.6 msec 1.2 %
congestion_wait __alloc_pages_internal __alloc_pag 19.5 msec 1.3 %
congestion_wait __alloc_pages_internal __alloc_pag 19.2 msec 0.4 %
congestion_wait __alloc_pages_internal __alloc_pag 19.2 msec 0.4 %
do_page_fault error_exit d do_sync_read vfs_read 11.6 msec 0.5 %
sys_mmap system_call_after_swapgs 7.8 msec 0.2 %
do_select core_sys_select sys_select system_call_a 4.9 msec 8.1 %

Process AutopanoPro (4697) Total: 977.5 msec
sync_page sync_page_killable __lock_page_killable 97.7 msec
76.6 %_read __xfs_file_read xfs_file_aio_read do_sync_read vfs_read
sys_read system_call_after
sync_page __lock_page find_lock_page filemap_fault 28.6 msec
5.6 %lt do_page_fault error_exit
congestion_wait __alloc_pages_internal __alloc_pag 19.5 msec
2.0 %e_fault error_exit
congestion_wait __alloc_pages_internal __alloc_pag 19.2 msec
2.0 %k_write_begin xfs_vm_write_begin generic_file_buffered_write
xfs_write __xfs_file_write
Scheduler: waiting for cpu_write vfs_write 16.5 msec 8.7 %
do_page_fault error_exit 11.6 msec 2.3 %

--- [4]

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.26-rc7-211c #2
-------------------------------------------------------
AutopanoPro/5480 is trying to acquire lock:
(iprune_mutex){--..}, at: [<ffffffff802d2a5d>] shrink_icache_memory+0x7d/0x280

but task is already holding lock:
(&(&ip->i_iolock)->mr_lock){----}, at: [<ffffffff803a459f>]
xfs_ilock+0xbf/0x110

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&ip->i_iolock)->mr_lock){----}:
[<ffffffff802769ad>] __lock_acquire+0xbdd/0x1020
[<ffffffff80276e55>] lock_acquire+0x65/0x90
[<ffffffff8026b41b>] down_write_nested+0x4b/0x90
[<ffffffff803a45df>] xfs_ilock+0xff/0x110
[<ffffffff803a47af>] xfs_ireclaim+0x3f/0x90
[<ffffffff803c9169>] xfs_finish_reclaim+0x59/0x220
[<ffffffff803cc3b5>] xfs_reclaim+0x185/0x190
[<ffffffff803d9211>] xfs_fs_clear_inode+0xe1/0x130
[<ffffffff802d25c7>] clear_inode+0x87/0x120
[<ffffffff802d290a>] dispose_list+0x2a/0x100
[<ffffffff802d2c06>] shrink_icache_memory+0x226/0x280
[<ffffffff8029d9c5>] shrink_slab+0x125/0x180
[<ffffffff8029dc52>] try_to_free_pages+0x232/0x360
[<ffffffff80297f0d>] __alloc_pages_internal+0x1ed/0x4a0
[<ffffffff802981db>] __alloc_pages+0xb/0x10
[<ffffffff802a312a>] handle_mm_fault+0x46a/0x6d0
[<ffffffff8060510a>] do_page_fault+0x3ca/0x830
[<ffffffff80602add>] error_exit+0x0/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (iprune_mutex){--..}:
[<ffffffff80276817>] __lock_acquire+0xa47/0x1020
[<ffffffff80276e55>] lock_acquire+0x65/0x90
[<ffffffff8060059a>] mutex_lock_nested+0xba/0x2b0
[<ffffffff802d2a5d>] shrink_icache_memory+0x7d/0x280
[<ffffffff8029d9c5>] shrink_slab+0x125/0x180
[<ffffffff8029dc52>] try_to_free_pages+0x232/0x360
[<ffffffff80297f0d>] __alloc_pages_internal+0x1ed/0x4a0
[<ffffffff802981db>] __alloc_pages+0xb/0x10
[<ffffffff8029a6b6>] __do_page_cache_readahead+0x136/0x230
[<ffffffff8029aa08>] ondemand_readahead+0x128/0x1f0
[<ffffffff8029ab45>] page_cache_async_readahead+0x75/0xa0
[<ffffffff80293a8a>] generic_file_aio_read+0x28a/0x610
[<ffffffff803d78c4>] xfs_read+0x124/0x270
[<ffffffff803d4416>] __xfs_file_read+0x46/0x50
[<ffffffff803d4451>] xfs_file_aio_read+0x11/0x20
[<ffffffff802bc1b1>] do_sync_read+0xf1/0x130
[<ffffffff802bca74>] vfs_read+0xc4/0x160
[<ffffffff802bcf10>] sys_read+0x50/0x90
[<ffffffff8022639b>] system_call_after_swapgs+0x7b/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by AutopanoPro/5480:
#0: (&(&ip->i_iolock)->mr_lock){----}, at: [<ffffffff803a459f>]
xfs_ilock+0xbf/0x110
#1: (shrinker_rwsem){----}, at: [<ffffffff8029d8d2>] shrink_slab+0x32/0x180

stack backtrace:
Pid: 5480, comm: AutopanoPro Not tainted 2.6.26-rc7-211c #2

Call Trace:
[<ffffffff80274293>] print_circular_bug_tail+0x83/0x90
[<ffffffff80276817>] __lock_acquire+0xa47/0x1020
[<ffffffff8027588f>] ? trace_hardirqs_on+0xbf/0x150
[<ffffffff80276e55>] lock_acquire+0x65/0x90
[<ffffffff802d2a5d>] ? shrink_icache_memory+0x7d/0x280
[<ffffffff8060059a>] mutex_lock_nested+0xba/0x2b0
[<ffffffff802d2a5d>] ? shrink_icache_memory+0x7d/0x280
[<ffffffff80602915>] ? _spin_unlock_irqrestore+0x55/0x60
[<ffffffff802d2a5d>] shrink_icache_memory+0x7d/0x280
[<ffffffff8029d8d2>] ? shrink_slab+0x32/0x180
[<ffffffff8029d9c5>] shrink_slab+0x125/0x180
[<ffffffff8029dc52>] try_to_free_pages+0x232/0x360
[<ffffffff8029c570>] ? isolate_pages_global+0x0/0x40
[<ffffffff80297f0d>] __alloc_pages_internal+0x1ed/0x4a0
[<ffffffff802981db>] __alloc_pages+0xb/0x10
[<ffffffff8029a6b6>] __do_page_cache_readahead+0x136/0x230
[<ffffffff8029aa08>] ondemand_readahead+0x128/0x1f0
[<ffffffff8029ab45>] page_cache_async_readahead+0x75/0xa0
[<ffffffff80293a8a>] generic_file_aio_read+0x28a/0x610
[<ffffffff803d78c4>] xfs_read+0x124/0x270
[<ffffffff803d4416>] __xfs_file_read+0x46/0x50
[<ffffffff803d4451>] xfs_file_aio_read+0x11/0x20
[<ffffffff802bc1b1>] do_sync_read+0xf1/0x130
[<ffffffff802673f0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8022ce57>] ? native_sched_clock+0x57/0x80
[<ffffffff802bda68>] ? fget_light+0xe8/0x100
[<ffffffff80272dea>] ? lock_release_holdtime+0x5a/0x80
[<ffffffff802bca74>] vfs_read+0xc4/0x160
[<ffffffff802bcf10>] sys_read+0x50/0x90
[<ffffffff8022639b>] system_call_after_swapgs+0x7b/0x80
--
Daniel J Blueman

2008-06-22 22:19:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

[added [email protected] to cc]

On Sun, Jun 22, 2008 at 10:58:56AM +0100, Daniel J Blueman wrote:
> I'm seeing a similar issue [2] to what was recently reported [1] by
> Alexander, but with another workload involving XFS and memory
> pressure.
>
> SLUB allocator is in use and config is at http://quora.org/config-client-debug .
>
> Let me know if you'd like more details/vmlinux objdump etc.
>
> Thanks,
> Daniel
>
> --- [1]
>
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c
>
> --- [2]
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.26-rc7-210c #2
> -------------------------------------------------------
> AutopanoPro/4470 is trying to acquire lock:
> (iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
>
> but task is already holding lock:
> (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&mm->mmap_sem){----}:
> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff805df5ab>] down_read+0x3b/0x70
> [<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff8026d746>] down_write_nested+0x46/0x80
> [<ffffffff8039df29>] xfs_ilock+0x99/0xa0
> [<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
> [<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
> [<ffffffff803bc199>] xfs_reclaim+0x109/0x110
> [<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
> [<ffffffff802d906d>] clear_inode+0x7d/0x110
> [<ffffffff802d93aa>] dispose_list+0x2a/0x100
> [<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029e0b6>] kswapd+0x3b6/0x560
> [<ffffffff8026921d>] kthread+0x4d/0x80
> [<ffffffff80227428>] child_rip+0xa/0x12
> [<ffffffffffffffff>] 0xffffffffffffffff

You may as well ignore anything invlving this path in XFS until
lockdep gets fixed. The kswapd reclaim path is inverted over the
synchronous reclaim path that is xfs_ilock -> run out of memory ->
prune_icache and then potentially another -> xfs_ilock.

In this case, XFS can *never* deadlock because the second xfs_ilock
is on a different, unreferenced, unlocked inode, but without turning
off lockdep there is nothing in XFS that can be done to prevent
this warning.

Therxp eis a similar bug in the VM w.r.t the mmap_sem in that the
mmap_sem is held across a call to put_filp() which can result in
inversions between the xfs_ilock and mmap_sem.

Both of these cases cannot be solved by changing XFS - lockdep
needs to be made aware of paths that can invert normal locking
order (like prune_icache) so it doesn't give false positives
like this.

> -> #0 (iprune_mutex){--..}:
> [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> [<ffffffff802793f5>] lock_acquire+0x65/0x90
> [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> [<ffffffff805e16cd>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff

This case is different in that it įs complaining about mmap_sem vs
iprune_mutex, so I think that we can pretty much ignore the XFS side
of things here - the problem is higher level code....

> [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> [<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
> [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0

FWIW, should page allocation in a page fault be allowed to recurse
into the filesystem? If I follow the spaghetti of inline and
compiler inlined functions correctly, this is a GFP_HIGHUSER_MOVABLE
allocation, right? Should we be allowing shrink_icache_memory()
to be called at all in the page fault path?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-22 22:30:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On Sun, Jun 22, 2008 at 09:56:17PM +0100, Daniel J Blueman wrote:
> On Sun, Jun 22, 2008 at 7:21 PM, Arjan van de Ven <[email protected]> wrote:
> > this sort of thing can easily be exposed with the latencytop tool...
> > it will at least tell you WHAT the system is blocking on.
> > (not so much the why, the tool isn't smart enough to automatically spit
> > out kernel patches yet)
>
> Good plan. I reproduced this without NFS mounted, so pure XFS. I
> wasn't able to capture the same process's (ie 5480) latencytop trace,
> but it may have contributed to the list.
>
> A fair amount of debugging was turned on, hurting the latency some
> (despite it being a 3.6GHz Penryn).
>
> Daniel
>
> --- [1]
>
> $ vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> <snip>
> 1 1 156 14424 12 2329672 0 0 0 110755 177 3820 57 6 36 0
> 2 1 156 14736 12 2348152 0 0 24 25172 204 26018 35 21 43 1
> 5 0 156 24252 12 2363800 0 0 59656 31 545 25292 35 14 28 23
> 4 0 156 14696 12 2317784 0 0 3824 0 38 23083 95 6 0 0
> 4 0 156 14440 12 2319304 0 0 4672 0 72 3372 93 3 3 2
> 2 0 156 14428 12 2318484 0 0 0 4 27 731 52 0 49 0
> 2 0 156 14480 12 2308512 0 0 0 12 32 36629 39 13 49 0
> 2 0 156 14572 12 2301220 0 0 3904 12316 117 10760 58 7 26 11
>
> --- [2]
>
> Cause Maximum Percentage
> down xfs_buf_lock _xfs_buf_find xfs_buf_get_flags 271.1 msec 0.4 %

Waiting on I/O to complete. Your disk is busy.

> down xfs_buf_iowait xfs_buf_iostart xfs_buf_read_f206.1 msec 1.3 %

Waiting on I/O to complete. Your disk is busy.

> down xfs_buf_lock xfs_getsb xfs_trans_getsb xfs_tr160.4 msec 0.5 %

Waiting on a superblock I/O or a transaction to complete. Your
disk is busy. (Note, this one can be avoided with lazy
superblock counters).

[snip reast of "busy disk trace"]

But really, all latencytop is telling you here is that it takes
time to wait for I/O to complete. It's mostly useless for tracking
down locking issues when you've got I/O in progress...

> 2.6.26-rc7-211c #2
> -------------------------------------------------------
> AutopanoPro/5480 is trying to acquire lock:
> (iprune_mutex){--..}, at: [<ffffffff802d2a5d>] shrink_icache_memory+0x7d/0x280
>
> but task is already holding lock:
> (&(&ip->i_iolock)->mr_lock){----}, at: [<ffffffff803a459f>]
> xfs_ilock+0xbf/0x110
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> [<ffffffff802769ad>] __lock_acquire+0xbdd/0x1020
> [<ffffffff80276e55>] lock_acquire+0x65/0x90
> [<ffffffff8026b41b>] down_write_nested+0x4b/0x90
> [<ffffffff803a45df>] xfs_ilock+0xff/0x110
> [<ffffffff803a47af>] xfs_ireclaim+0x3f/0x90
> [<ffffffff803c9169>] xfs_finish_reclaim+0x59/0x220
> [<ffffffff803cc3b5>] xfs_reclaim+0x185/0x190
> [<ffffffff803d9211>] xfs_fs_clear_inode+0xe1/0x130
> [<ffffffff802d25c7>] clear_inode+0x87/0x120
> [<ffffffff802d290a>] dispose_list+0x2a/0x100
> [<ffffffff802d2c06>] shrink_icache_memory+0x226/0x280
> [<ffffffff8029d9c5>] shrink_slab+0x125/0x180
> [<ffffffff8029dc52>] try_to_free_pages+0x232/0x360
> [<ffffffff80297f0d>] __alloc_pages_internal+0x1ed/0x4a0
> [<ffffffff802981db>] __alloc_pages+0xb/0x10
> [<ffffffff802a312a>] handle_mm_fault+0x46a/0x6d0
> [<ffffffff8060510a>] do_page_fault+0x3ca/0x830
> [<ffffffff80602add>] error_exit+0x0/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff

mmap_sem -> iprune_mutex -> xfs_ilock

> -> #0 (iprune_mutex){--..}:
> [<ffffffff80276817>] __lock_acquire+0xa47/0x1020
> [<ffffffff80276e55>] lock_acquire+0x65/0x90
> [<ffffffff8060059a>] mutex_lock_nested+0xba/0x2b0
> [<ffffffff802d2a5d>] shrink_icache_memory+0x7d/0x280
> [<ffffffff8029d9c5>] shrink_slab+0x125/0x180
> [<ffffffff8029dc52>] try_to_free_pages+0x232/0x360
> [<ffffffff80297f0d>] __alloc_pages_internal+0x1ed/0x4a0
> [<ffffffff802981db>] __alloc_pages+0xb/0x10
> [<ffffffff8029a6b6>] __do_page_cache_readahead+0x136/0x230
> [<ffffffff8029aa08>] ondemand_readahead+0x128/0x1f0
> [<ffffffff8029ab45>] page_cache_async_readahead+0x75/0xa0
> [<ffffffff80293a8a>] generic_file_aio_read+0x28a/0x610
> [<ffffffff803d78c4>] xfs_read+0x124/0x270
> [<ffffffff803d4416>] __xfs_file_read+0x46/0x50
> [<ffffffff803d4451>] xfs_file_aio_read+0x11/0x20
> [<ffffffff802bc1b1>] do_sync_read+0xf1/0x130
> [<ffffffff802bca74>] vfs_read+0xc4/0x160
> [<ffffffff802bcf10>] sys_read+0x50/0x90
> [<ffffffff8022639b>] system_call_after_swapgs+0x7b/0x80
> [<ffffffffffffffff>] 0xffffffffffffffff

xfs_ilock -> iprune_mutex

This is exactly the situation I mentioned in the previous email.
There is no potential deadlock here between the xfs_ilock and
iprune_mutex as the xfs_ilock that is held before and/or after
iprune_mutex is guaranteed to be different (the first is in use
so will never be found by shrink_icache_memory())...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-23 00:24:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On (23/06/08 08:19), Dave Chinner didst pronounce:
> [added [email protected] to cc]
>
> On Sun, Jun 22, 2008 at 10:58:56AM +0100, Daniel J Blueman wrote:
> > I'm seeing a similar issue [2] to what was recently reported [1] by
> > Alexander, but with another workload involving XFS and memory
> > pressure.
> >
> > SLUB allocator is in use and config is at http://quora.org/config-client-debug .
> >
> > Let me know if you'd like more details/vmlinux objdump etc.
> >
> > Thanks,
> > Daniel
> >
> > --- [1]
> >
> > http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/e673c9173d45a735/db9213ef39e4e11c
> >
> > --- [2]
> >
> > =======================================================
> > [ INFO: possible circular locking dependency detected ]
> > 2.6.26-rc7-210c #2
> > -------------------------------------------------------
> > AutopanoPro/4470 is trying to acquire lock:
> > (iprune_mutex){--..}, at: [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> >
> > but task is already holding lock:
> > (&mm->mmap_sem){----}, at: [<ffffffff805e3e15>] do_page_fault+0x255/0x890
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #2 (&mm->mmap_sem){----}:
> > [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> > [<ffffffff802793f5>] lock_acquire+0x65/0x90
> > [<ffffffff805df5ab>] down_read+0x3b/0x70
> > [<ffffffff805e3e3c>] do_page_fault+0x27c/0x890
> > [<ffffffff805e16cd>] error_exit+0x0/0xa9
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > -> #1 (&(&ip->i_iolock)->mr_lock){----}:
> > [<ffffffff80278f4d>] __lock_acquire+0xbdd/0x1020
> > [<ffffffff802793f5>] lock_acquire+0x65/0x90
> > [<ffffffff8026d746>] down_write_nested+0x46/0x80
> > [<ffffffff8039df29>] xfs_ilock+0x99/0xa0
> > [<ffffffff8039e0cf>] xfs_ireclaim+0x3f/0x90
> > [<ffffffff803ba889>] xfs_finish_reclaim+0x59/0x1a0
> > [<ffffffff803bc199>] xfs_reclaim+0x109/0x110
> > [<ffffffff803c9541>] xfs_fs_clear_inode+0xe1/0x110
> > [<ffffffff802d906d>] clear_inode+0x7d/0x110
> > [<ffffffff802d93aa>] dispose_list+0x2a/0x100
> > [<ffffffff802d96af>] shrink_icache_memory+0x22f/0x290
> > [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> > [<ffffffff8029e0b6>] kswapd+0x3b6/0x560
> > [<ffffffff8026921d>] kthread+0x4d/0x80
> > [<ffffffff80227428>] child_rip+0xa/0x12
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> You may as well ignore anything invlving this path in XFS until
> lockdep gets fixed. The kswapd reclaim path is inverted over the
> synchronous reclaim path that is xfs_ilock -> run out of memory ->
> prune_icache and then potentially another -> xfs_ilock.
>

In that case, have you any theory as to why this circular dependency is
being reported now but wasn't before 2.6.26-rc1? I'm beginning to wonder
if the bisecting fingering the zonelist modifiation is just a
co-incidence.

Also, do you think the stalls were happening before but just not being noticed?

> In this case, XFS can *never* deadlock because the second xfs_ilock
> is on a different, unreferenced, unlocked inode, but without turning
> off lockdep there is nothing in XFS that can be done to prevent
> this warning.
>
> Therxp eis a similar bug in the VM w.r.t the mmap_sem in that the
> mmap_sem is held across a call to put_filp() which can result in
> inversions between the xfs_ilock and mmap_sem.
>
> Both of these cases cannot be solved by changing XFS - lockdep
> needs to be made aware of paths that can invert normal locking
> order (like prune_icache) so it doesn't give false positives
> like this.
>
> > -> #0 (iprune_mutex){--..}:
> > [<ffffffff80278db7>] __lock_acquire+0xa47/0x1020
> > [<ffffffff802793f5>] lock_acquire+0x65/0x90
> > [<ffffffff805dedd5>] mutex_lock_nested+0xb5/0x300
> > [<ffffffff802d94fd>] shrink_icache_memory+0x7d/0x290
> > [<ffffffff8029d868>] shrink_slab+0x168/0x1d0
> > [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> > [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> > [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> > [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> > [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
> > [<ffffffff805e3ecc>] do_page_fault+0x30c/0x890
> > [<ffffffff805e16cd>] error_exit+0x0/0xa9
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> This case is different in that it ??s complaining about mmap_sem vs
> iprune_mutex, so I think that we can pretty much ignore the XFS side
> of things here - the problem is higher level code....
>
> > [<ffffffff8029db38>] try_to_free_pages+0x268/0x3a0
> > [<ffffffff8029c240>] ? isolate_pages_global+0x0/0x40
> > [<ffffffff802979d6>] __alloc_pages_internal+0x206/0x4b0
> > [<ffffffff80297c89>] __alloc_pages_nodemask+0x9/0x10
> > [<ffffffff802b2bc2>] alloc_page_vma+0x72/0x1b0
> > [<ffffffff802a3642>] handle_mm_fault+0x462/0x7b0
>
> FWIW, should page allocation in a page fault be allowed to recurse
> into the filesystem? If I follow the spaghetti of inline and
> compiler inlined functions correctly, this is a GFP_HIGHUSER_MOVABLE
> allocation, right? Should we be allowing shrink_icache_memory()
> to be called at all in the page fault path?
>

Well, the page fault path is able to go to sleep and can enter direct
reclaim under low memory situations. Right now, I'm failing to see why a
page fault should not be allowed to reclaim pages in use by a
filesystem. It was allowed before so the question still is why the
circular lock warning appears now but didn't before.

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-06-23 00:54:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On Mon, Jun 23, 2008 at 01:24:15AM +0100, Mel Gorman wrote:
> On (23/06/08 08:19), Dave Chinner didst pronounce:
> > [added [email protected] to cc]
> >
> > On Sun, Jun 22, 2008 at 10:58:56AM +0100, Daniel J Blueman wrote:
> > > I'm seeing a similar issue [2] to what was recently reported [1] by
> > > Alexander, but with another workload involving XFS and memory
> > > pressure.

[....]

> > You may as well ignore anything invlving this path in XFS until
> > lockdep gets fixed. The kswapd reclaim path is inverted over the
> > synchronous reclaim path that is xfs_ilock -> run out of memory ->
> > prune_icache and then potentially another -> xfs_ilock.
> >
>
> In that case, have you any theory as to why this circular dependency is
> being reported now but wasn't before 2.6.26-rc1? I'm beginning to wonder
> if the bisecting fingering the zonelist modifiation is just a
> co-incidence.

Probably co-incidence. Perhaps it's simply changed the way reclaim
is behaving and we are more likely to be trimming slab caches
instead of getting free pages from the page lists?

> Also, do you think the stalls were happening before but just not
> being noticed?

Entirely possible, I think, but I know of no evidence one way or
another.

[....]

> > FWIW, should page allocation in a page fault be allowed to recurse
> > into the filesystem? If I follow the spaghetti of inline and
> > compiler inlined functions correctly, this is a GFP_HIGHUSER_MOVABLE
> > allocation, right? Should we be allowing shrink_icache_memory()
> > to be called at all in the page fault path?
>
> Well, the page fault path is able to go to sleep and can enter direct
> reclaim under low memory situations. Right now, I'm failing to see why a
> page fault should not be allowed to reclaim pages in use by a
> filesystem. It was allowed before so the question still is why the
> circular lock warning appears now but didn't before.

Yeah, it's the fact that this is the first time that this lockdep
warning has come up that prompted me to ask the question. I know
that we are not allowed to lock an inode in the fault path as that
can lead to deadlocks in the read and write paths, so what I was
really wondering is if we can deadlock in a more convoluted manner
by taking locks on *other inodes* in the page fault path....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-23 07:22:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On Mon, Jun 23, 2008 at 01:24:15AM +0100, Mel Gorman wrote:
> In that case, have you any theory as to why this circular dependency is
> being reported now but wasn't before 2.6.26-rc1? I'm beginning to wonder
> if the bisecting fingering the zonelist modifiation is just a
> co-incidence.

I've seen this traces since lockdep was added when running xfsqa.

2008-06-23 18:38:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [2.6.26-rc7] shrink_icache from pagefault locking (nee: nfsd hangs for a few sec)...

On (23/06/08 03:22), Christoph Hellwig didst pronounce:
> On Mon, Jun 23, 2008 at 01:24:15AM +0100, Mel Gorman wrote:
> > In that case, have you any theory as to why this circular dependency is
> > being reported now but wasn't before 2.6.26-rc1? I'm beginning to wonder
> > if the bisecting fingering the zonelist modifiation is just a
> > co-incidence.
>
> I've seen this traces since lockdep was added when running xfsqa.
>

Oh right, so this isn't even 2.6.26-rc1 as such. It's an older problem that
seems to be happening in more cases now, right? At this point, I believe the
bisection fingering the zonelist modification was a co-incidence as reclaim
behaviour at least is equivilant although catching the memory leak early
was a lucky positive outcome.

It's still not clear why the circular warning is happening more regularly now
but it's "something else". Considering the number of changes made to NFS,
XFS, reclaim and other areas since, I'm not sure how to go about finding
the real underlying problem or if it can be dealt with in a trivial manner.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab