This message was originally posted to the XFS mailing list, but received no responses. Thus, I am sending it to LKML on the advice of Martin.
Using the attached program, we are able to reproduce this bug reliably.
$ make vmtest
$ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
/xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
749660: avg 13339 max 234667 ticks
371945: avg 26885 max 281616 ticks
---
At this point, we see the following on the console:
[593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
This is the same message presented in
http://oss.sgi.com/bugzilla/show_bug.cgi?id=410
We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release. This does not appear to be present in 2.6.33, but we have not done testing in between. We have tested with ext4 and do not encounter this bug.
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_VXFS_FS is not set
Here is the stack from the process:
[<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
[<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
[<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
[<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
[<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
[<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
[<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
[<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
[<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
[<ffffffff8102e399>] do_page_fault+0x159/0x470
[<ffffffff816cf6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
# uname -a
Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux
Please let me know if additional information is required.
Thanks!
Sean
I believe this patch fixes the behavior:
diff --git a/mm/memory.c b/mm/memory.c
index e48945a..740d5ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,7 +3461,9 @@ int make_pages_present(unsigned long addr, unsigned long end)
* to break COW, except for shared mappings because these don't COW
* and we would not want to dirty them for nothing.
*/
- write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
+ write = (vma->vm_flags & VM_WRITE) != 0;
+ if (write && ((vma->vm_flags & VM_SHARED) !=0) && (vma->vm_file == NULL))
+ write = 0;
BUG_ON(addr >= end);
BUG_ON(end > vma->vm_end);
len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
This was traced to the following commit:
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 is the first bad commit
commit 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272
Author: Michel Lespinasse <[email protected]>
Date: Thu Jan 13 15:46:09 2011 -0800
mlock: avoid dirtying pages and triggering writeback
When faulting in pages for mlock(), we want to break COW for anonymous or
file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
need to write-fault into VM_SHARED vmas since shared file pages can be
mlocked first and dirtied later, when/if they actually get written to.
Skipping the write fault is desirable, as we don't want to unnecessarily
cause these pages to be dirtied and queued for writeback.
Signed-off-by: Michel Lespinasse <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Kosaki Motohiro <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: Michael Rubin <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
:040000 040000 604eede2f45b7e5276ce9725b715ed15a868861d 3c175eadf4cf33d4f78d4d455c9a04f3df2c199e M mm
-----Original Message-----
From: Sean Noonan
Sent: Monday, March 21, 2011 12:20
To: '[email protected]'
Cc: Trammell Hudson; Martin Bligh; Stephen Degler; Christos Zoulas
Subject: XFS memory allocation deadlock in 2.6.38
This message was originally posted to the XFS mailing list, but received no responses. Thus, I am sending it to LKML on the advice of Martin.
Using the attached program, we are able to reproduce this bug reliably.
$ make vmtest
$ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
/xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
749660: avg 13339 max 234667 ticks
371945: avg 26885 max 281616 ticks
---
At this point, we see the following on the console:
[593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
This is the same message presented in
http://oss.sgi.com/bugzilla/show_bug.cgi?id=410
We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release. This does not appear to be present in 2.6.33, but we have not done testing in between. We have tested with ext4 and do not encounter this bug.
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_VXFS_FS is not set
Here is the stack from the process:
[<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
[<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
[<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
[<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
[<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
[<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
[<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
[<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
[<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
[<ffffffff8102e399>] do_page_fault+0x159/0x470
[<ffffffff816cf6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
# uname -a
Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux
Please let me know if additional information is required.
Thanks!
Sean
Michel,
can you take a look at this bug report? It looks like a regression
in your mlock handling changes.
On Wed, Mar 23, 2011 at 03:39:05PM -0400, Sean Noonan wrote:
> I believe this patch fixes the behavior:
> diff --git a/mm/memory.c b/mm/memory.c
> index e48945a..740d5ab 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3461,7 +3461,9 @@ int make_pages_present(unsigned long addr, unsigned long end)
> * to break COW, except for shared mappings because these don't COW
> * and we would not want to dirty them for nothing.
> */
> - write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
> + write = (vma->vm_flags & VM_WRITE) != 0;
> + if (write && ((vma->vm_flags & VM_SHARED) !=0) && (vma->vm_file == NULL))
> + write = 0;
> BUG_ON(addr >= end);
> BUG_ON(end > vma->vm_end);
> len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
>
>
> This was traced to the following commit:
> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 is the first bad commit
> commit 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272
> Author: Michel Lespinasse <[email protected]>
> Date: Thu Jan 13 15:46:09 2011 -0800
>
> mlock: avoid dirtying pages and triggering writeback
>
> When faulting in pages for mlock(), we want to break COW for anonymous or
> file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
> need to write-fault into VM_SHARED vmas since shared file pages can be
> mlocked first and dirtied later, when/if they actually get written to.
> Skipping the write fault is desirable, as we don't want to unnecessarily
> cause these pages to be dirtied and queued for writeback.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Kosaki Motohiro <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Theodore Tso <[email protected]>
> Cc: Michael Rubin <[email protected]>
> Cc: Suleiman Souhlal <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> :040000 040000 604eede2f45b7e5276ce9725b715ed15a868861d 3c175eadf4cf33d4f78d4d455c9a04f3df2c199e M mm
>
>
> -----Original Message-----
> From: Sean Noonan
> Sent: Monday, March 21, 2011 12:20
> To: '[email protected]'
> Cc: Trammell Hudson; Martin Bligh; Stephen Degler; Christos Zoulas
> Subject: XFS memory allocation deadlock in 2.6.38
>
> This message was originally posted to the XFS mailing list, but received no responses. Thus, I am sending it to LKML on the advice of Martin.
>
> Using the attached program, we are able to reproduce this bug reliably.
> $ make vmtest
> $ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
> /xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
> 749660: avg 13339 max 234667 ticks
> 371945: avg 26885 max 281616 ticks
> ---
> At this point, we see the following on the console:
> [593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> [593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> [593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> [593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
>
> This is the same message presented in
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=410
>
> We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release. This does not appear to be present in 2.6.33, but we have not done testing in between. We have tested with ext4 and do not encounter this bug.
> CONFIG_XFS_FS=y
> CONFIG_XFS_QUOTA=y
> CONFIG_XFS_POSIX_ACL=y
> CONFIG_XFS_RT=y
> # CONFIG_XFS_DEBUG is not set
> # CONFIG_VXFS_FS is not set
>
> Here is the stack from the process:
> [<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
> [<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
> [<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
> [<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
> [<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
> [<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
> [<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
> [<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
> [<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
> [<ffffffff8102e399>] do_page_fault+0x159/0x470
> [<ffffffff816cf6cf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> # uname -a
> Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux
>
> Please let me know if additional information is required.
>
> Thanks!
>
> Sean
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs
---end quoted text---
On Thu, Mar 24, 2011 at 10:43 AM, Christoph Hellwig <[email protected]> wrote:
> Michel,
>
> can you take a look at this bug report? ?It looks like a regression
> in your mlock handling changes.
I had a quick look and at this point I can describe how the patch will
affect behavior of this test, but not why this causes a deadlock with
xfs.
The test creates a writable, shared mapping of a file that does not
have data blocks allocated on disk, and also uses the MAP_POPULATE
flag.
Before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272, make_pages_present
during the mmap would cause data blocks to get allocated on disk with
an xfs_vm_page_mkwrite call, and then the file pages would get mapped
as writable ptes.
After 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272, make_pages_present
does NOT cause data blocks to get allocated on disk. Instead,
xfs_vm_readpages is called, which (I suppose) does not allocate the
data blocks and returns zero filled pages instead, which get mapped as
readonly ptes. Later, the test tries writing into the mmap'ed block,
causing minor page faults, xfs_vm_page_mkwrite calls and data block
allocations to occur.
Regarding the deadlock: I am curious to see if it could be made to
happen before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272. Could you test
what happens if you remove the MAP_POPULATE flag from your mmap call,
and instead read all pages from userspace right after the mmap ? I
expect you would then be able to trigger the deadlock before
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272.
This leaves the issue of the change of behavior for MAP_POPULATE on
ftruncated file holes. I'm not sure what to say there though, because
MAP_POPULATE is documented to cause file read-ahead (and it still does
after 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272), but that doesn't say
anything about block allocation.
Hope this helps,
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
I created a Bugzilla entry at
https://bugzilla.kernel.org/show_bug.cgi?id=31982
for your bug report, please add your address to the CC list in there, thanks!
On poniedziaĆek, 21 marca 2011 o 17:19:44 Sean Noonan wrote:
> This message was originally posted to the XFS mailing list, but received no
> responses. Thus, I am sending it to LKML on the advice of Martin.
>
> Using the attached program, we are able to reproduce this bug reliably.
> $ make vmtest
> $ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest
> <path_to_file> <size_in_bytes> /xfs/hugefile.dat: mapped 17179869184 bytes
> in 33822066943 ticks
> 749660: avg 13339 max 234667 ticks
> 371945: avg 26885 max 281616 ticks
> ---
> At this point, we see the following on the console:
> [593492.694806] XFS: possible memory allocation deadlock in kmem_alloc
> (mode:0x250) [593506.724367] XFS: possible memory allocation deadlock in
> kmem_alloc (mode:0x250) [593524.837717] XFS: possible memory allocation
> deadlock in kmem_alloc (mode:0x250) [593556.742386] XFS: possible memory
> allocation deadlock in kmem_alloc (mode:0x250)
>
> This is the same message presented in
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=410
>
> We started testing with 2.6.38-rc7 and have seen this bug through to the .0
> release. This does not appear to be present in 2.6.33, but we have not
> done testing in between. We have tested with ext4 and do not encounter
> this bug. CONFIG_XFS_FS=y
> CONFIG_XFS_QUOTA=y
> CONFIG_XFS_POSIX_ACL=y
> CONFIG_XFS_RT=y
> # CONFIG_XFS_DEBUG is not set
> # CONFIG_VXFS_FS is not set
>
> Here is the stack from the process:
> [<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
> [<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
> [<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
> [<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
> [<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
> [<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
> [<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
> [<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
> [<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
> [<ffffffff8102e399>] do_page_fault+0x159/0x470
> [<ffffffff816cf6cf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> # uname -a
> Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64
> GNU/Linux
>
> Please let me know if additional information is required.
>
> Thanks!
>
> Sean
--
Maciej Rutecki
http://www.maciek.unixy.pl
> Regarding the deadlock: I am curious to see if it could be made to
> happen before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272. Could you test
> what happens if you remove the MAP_POPULATE flag from your mmap call,
> and instead read all pages from userspace right after the mmap ? I
> expect you would then be able to trigger the deadlock before
> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272.
I still see the deadlock without MAP_POPULATE
Sean
On Mon, Mar 28, 2011 at 7:58 AM, Sean Noonan <[email protected]> wrote:
>> Regarding the deadlock: I am curious to see if it could be made to
>> happen before 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272. Could you test
>> what happens if you remove the MAP_POPULATE flag from your mmap call,
>> and instead read all pages from userspace right after the mmap ? I
>> expect you would then be able to trigger the deadlock before
>> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272.
>
> I still see the deadlock without MAP_POPULATE
Could you test if you see the deadlock before
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
> Could you test if you see the deadlock before
> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
Confirmed that the original bug does not present in this version.
Confirmed that removing MAP_POPULATE does cause the deadlock to occur.
Here is the stack of the test:
# cat /proc/3846/stack
[<ffffffff812e8a64>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff81271c1d>] xfs_ilock+0x9d/0x110
[<ffffffff81271cae>] xfs_ilock_map_shared+0x1e/0x50
[<ffffffff81294985>] __xfs_get_blocks+0xc5/0x4e0
[<ffffffff81294dcc>] xfs_get_blocks+0xc/0x10
[<ffffffff811322c2>] do_mpage_readpage+0x462/0x660
[<ffffffff8113250a>] mpage_readpage+0x4a/0x60
[<ffffffff81295433>] xfs_vm_readpage+0x13/0x20
[<ffffffff810bb850>] filemap_fault+0x2d0/0x4e0
[<ffffffff810d8680>] __do_fault+0x50/0x510
[<ffffffff810da542>] handle_mm_fault+0x1a2/0xe60
[<ffffffff8102a466>] do_page_fault+0x146/0x440
[<ffffffff8164e6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
xfssyncd is stuck in D state.
# cat /proc/2484/stack
[<ffffffff8106ee1c>] down+0x3c/0x50
[<ffffffff81297802>] xfs_buf_lock+0x72/0x170
[<ffffffff8128762d>] xfs_getsb+0x1d/0x50
[<ffffffff8128e6af>] xfs_trans_getsb+0x5f/0x150
[<ffffffff8128821e>] xfs_mod_sb+0x4e/0xe0
[<ffffffff8126e4ea>] xfs_fs_log_dummy+0x5a/0xb0
[<ffffffff812a2a13>] xfs_sync_worker+0x83/0x90
[<ffffffff812a28e2>] xfssyncd+0x172/0x220
[<ffffffff81069576>] kthread+0x96/0xa0
[<ffffffff81003354>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
Sean
On Mon, Mar 28, 2011 at 2:34 PM, Sean Noonan <[email protected]> wrote:
>> Could you test if you see the deadlock before
>> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
>
> Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
> Confirmed that the original bug does not present in this version.
> Confirmed that removing MAP_POPULATE does cause the deadlock to occur.
It seems that the test (without MAP_POPULATE) reveals that the root
cause is an xfs bug, which had been hidden up to now by MAP_POPULATE
preallocating disk blocks (but could always be triggered by the same
test without the MAP_POPULATE flag). I'm not sure how to go about
debugging the xfs deadlock; it would probably be best if an xfs person
could have a look ?
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Mon, Mar 28, 2011 at 05:34:09PM -0400, Sean Noonan wrote:
> > Could you test if you see the deadlock before
> > 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
>
> Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
> Confirmed that the original bug does not present in this version.
> Confirmed that removing MAP_POPULATE does cause the deadlock to occur.
>
> Here is the stack of the test:
> # cat /proc/3846/stack
> [<ffffffff812e8a64>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff81271c1d>] xfs_ilock+0x9d/0x110
> [<ffffffff81271cae>] xfs_ilock_map_shared+0x1e/0x50
> [<ffffffff81294985>] __xfs_get_blocks+0xc5/0x4e0
> [<ffffffff81294dcc>] xfs_get_blocks+0xc/0x10
> [<ffffffff811322c2>] do_mpage_readpage+0x462/0x660
> [<ffffffff8113250a>] mpage_readpage+0x4a/0x60
> [<ffffffff81295433>] xfs_vm_readpage+0x13/0x20
> [<ffffffff810bb850>] filemap_fault+0x2d0/0x4e0
> [<ffffffff810d8680>] __do_fault+0x50/0x510
> [<ffffffff810da542>] handle_mm_fault+0x1a2/0xe60
> [<ffffffff8102a466>] do_page_fault+0x146/0x440
> [<ffffffff8164e6cf>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
Something else is holding the inode locked here.
> xfssyncd is stuck in D state.
> # cat /proc/2484/stack
> [<ffffffff8106ee1c>] down+0x3c/0x50
> [<ffffffff81297802>] xfs_buf_lock+0x72/0x170
> [<ffffffff8128762d>] xfs_getsb+0x1d/0x50
> [<ffffffff8128e6af>] xfs_trans_getsb+0x5f/0x150
> [<ffffffff8128821e>] xfs_mod_sb+0x4e/0xe0
> [<ffffffff8126e4ea>] xfs_fs_log_dummy+0x5a/0xb0
> [<ffffffff812a2a13>] xfs_sync_worker+0x83/0x90
> [<ffffffff812a28e2>] xfssyncd+0x172/0x220
> [<ffffffff81069576>] kthread+0x96/0xa0
> [<ffffffff81003354>] kernel_thread_helper+0x4/0x10
> [<ffffffffffffffff>] 0xffffffffffffffff
And this is indicating that something else is holding the superblock
locked here. IOWs, whatever thread is having trouble with memory
allocation is causing these threads to block and so they can be
ignored. What's the stack trace of the thread that is throwing the
"I can't allocating a page" errors?
As it is, the question I'd really like answered is how a machine with
48GB RAM can possibly be short of memory when running mmap() on a
16GB file. The error that XFS is throwing indicates that the
machine cannot allocate a single page of memory, so where has all
your memory gone, and why hasn't the OOM killer been let off the
leash? What is consuming the other 32GB of RAM or preventing it
from being allocated?
Also, I was unable to reproduce this at all on a machine with only
2GB of RAM, regardless of the kernel version and/or MAP_POPULATE, so
I'm left to wonder what is special about your test system...
Perhaps the output of xfs_bmap -vvp <file> after a successful vs
deadlocked run would be instructive....
Cheers,
Dave.
--
Dave Chinner
[email protected]
> As it is, the question I'd really like answered is how a machine with
> 48GB RAM can possibly be short of memory when running mmap() on a
> 16GB file. The error that XFS is throwing indicates that the
> machine cannot allocate a single page of memory, so where has all
> your memory gone, and why hasn't the OOM killer been let off the
> leash? What is consuming the other 32GB of RAM or preventing it
> from being allocated?
Here's meminfo while a test was deadlocking. As you can see, we certainly aren't running out of RAM.
# cat /proc/meminfo
MemTotal: 49551548 kB
MemFree: 44139876 kB
Buffers: 5324 kB
Cached: 4970552 kB
SwapCached: 0 kB
Active: 52772 kB
Inactive: 4960624 kB
Active(anon): 37864 kB
Inactive(anon): 0 kB
Active(file): 14908 kB
Inactive(file): 4960624 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 4914084 kB
Writeback: 0 kB
AnonPages: 37636 kB
Mapped: 4925460 kB
Shmem: 280 kB
Slab: 223212 kB
SReclaimable: 176280 kB
SUnreclaim: 46932 kB
KernelStack: 3968 kB
PageTables: 35228 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 47073968 kB
Committed_AS: 86556 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 380892 kB
VmallocChunk: 34331773836 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 2048 kB
DirectMap2M: 2086912 kB
DirectMap1G: 48234496 kB
> Perhaps the output of xfs_bmap -vvp <file> after a successful vs
deadlocked run would be instructive....
I will try to get this tomorrow.
Sean
>> Could you test if you see the deadlock before
>> 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ?
> Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab.
> Confirmed that the original bug does not present in this version.
> Confirmed that removing MAP_POPULATE does cause the deadlock to occur.
git bisect leads to this:
bdfb04301fa5fdd95f219539a9a5b9663b1e5fc2 is the first bad commit
commit bdfb04301fa5fdd95f219539a9a5b9663b1e5fc2
Author: Christoph Hellwig <[email protected]>
Date: Wed Jan 20 21:55:30 2010 +0000
xfs: replace KM_LARGE with explicit vmalloc use
We use the KM_LARGE flag to make kmem_alloc and friends use vmalloc
if necessary. As we only need this for a few boot/mount time
allocations just switch to explicit vmalloc calls there.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Alex Elder <[email protected]>
:040000 040000 1eed68ced17d8794fa842396c01c3b9677c6e709 d462932a318f8c823fa2a73156e980a688968cb2 M fs
Can you check if the brute force patch below helps? If it does I
still need to refine it a bit, but it could be that we are doing
an allocation under an xfs lock that could recurse back into the
filesystem. We have a per-process flag to disable that for normal
kmalloc allocation, but we lost it for vmalloc in the commit you
bisected the regression to.
Index: xfs/fs/xfs/linux-2.6/kmem.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/kmem.h 2011-03-29 21:16:58.039224236 +0200
+++ xfs/fs/xfs/linux-2.6/kmem.h 2011-03-29 21:17:08.368223598 +0200
@@ -63,7 +63,7 @@ static inline void *kmem_zalloc_large(si
{
void *ptr;
- ptr = vmalloc(size);
+ ptr = __vmalloc(size, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
if (ptr)
memset(ptr, 0, size);
return ptr;
On Tue, Mar 29, 2011 at 03:24:34PM -0400, 'Christoph Hellwig' wrote:
> Can you check if the brute force patch below helps? If it does I
> still need to refine it a bit, but it could be that we are doing
> an allocation under an xfs lock that could recurse back into the
> filesystem. We have a per-process flag to disable that for normal
> kmalloc allocation, but we lost it for vmalloc in the commit you
> bisected the regression to.
>
>
> Index: xfs/fs/xfs/linux-2.6/kmem.h
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/kmem.h 2011-03-29 21:16:58.039224236 +0200
> +++ xfs/fs/xfs/linux-2.6/kmem.h 2011-03-29 21:17:08.368223598 +0200
> @@ -63,7 +63,7 @@ static inline void *kmem_zalloc_large(si
> {
> void *ptr;
>
> - ptr = vmalloc(size);
> + ptr = __vmalloc(size, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
> if (ptr)
> memset(ptr, 0, size);
> return ptr;
Note that vmalloc is currently broken in that it does a GFP_KERNEL
allocation if it has to allocate page table pages, even when invoked
with GFP_NOFS:
http://marc.info/?l=linux-mm&m=128942194520631&w=4
On Tue, Mar 29, 2011 at 09:39:07PM +0200, Johannes Weiner wrote:
> > - ptr = vmalloc(size);
> > + ptr = __vmalloc(size, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
> > if (ptr)
> > memset(ptr, 0, size);
> > return ptr;
>
> Note that vmalloc is currently broken in that it does a GFP_KERNEL
> allocation if it has to allocate page table pages, even when invoked
> with GFP_NOFS:
>
> http://marc.info/?l=linux-mm&m=128942194520631&w=4
Oh great. In that case we had a chance to hit the deadlock even before
the offending commit, just a much smaller one.
> Can you check if the brute force patch below helps?
No such luck.
> Can you check if the brute force patch below helps?
Not sure if this helps at all, but here is the stack from all three processes involved. This is without MAP_POPULATE and with the patch you just sent.
# ps aux | grep 'D[+]*[[:space:]]'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2314 0.2 0.0 0 0 ? D 19:44 0:00 [flush-8:0]
root 2402 0.0 0.0 0 0 ? D 19:44 0:00 [xfssyncd/sda9]
root 3861 2.6 9.9 16785280 4912848 pts/0 D+ 19:45 0:07 ./vmtest /xfs/hugefile.dat 17179869184
# for p in 2314 2402 3861; do echo $p; cat /proc/$p/stack; done
2314
[<ffffffff810d634a>] congestion_wait+0x7a/0x130
[<ffffffff8129721c>] kmem_alloc+0x6c/0xf0
[<ffffffff8127c07e>] xfs_inode_item_format+0x36e/0x3b0
[<ffffffff8128401f>] xfs_log_commit_cil+0x4f/0x3b0
[<ffffffff8128ff31>] _xfs_trans_commit+0x1f1/0x2b0
[<ffffffff8127c716>] xfs_iomap_write_allocate+0x1a6/0x340
[<ffffffff81298883>] xfs_map_blocks+0x193/0x2c0
[<ffffffff812992fa>] xfs_vm_writepage+0x1ca/0x520
[<ffffffff810c4bd2>] __writepage+0x12/0x40
[<ffffffff810c53dd>] write_cache_pages+0x1dd/0x4f0
[<ffffffff810c573c>] generic_writepages+0x4c/0x70
[<ffffffff812986b8>] xfs_vm_writepages+0x58/0x70
[<ffffffff810c577c>] do_writepages+0x1c/0x40
[<ffffffff811247d1>] writeback_single_inode+0xf1/0x240
[<ffffffff81124edd>] writeback_sb_inodes+0xdd/0x1b0
[<ffffffff81125966>] writeback_inodes_wb+0x76/0x160
[<ffffffff81125d93>] wb_writeback+0x343/0x550
[<ffffffff81126126>] wb_do_writeback+0x186/0x2e0
[<ffffffff81126342>] bdi_writeback_thread+0xc2/0x310
[<ffffffff81067846>] kthread+0x96/0xa0
[<ffffffff8165a414>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
2402
[<ffffffff8106d0ec>] down+0x3c/0x50
[<ffffffff8129a7bd>] xfs_buf_lock+0x5d/0x170
[<ffffffff8128a87d>] xfs_getsb+0x1d/0x50
[<ffffffff81291bcf>] xfs_trans_getsb+0x5f/0x150
[<ffffffff8128b80e>] xfs_mod_sb+0x4e/0xe0
[<ffffffff81271dbf>] xfs_fs_log_dummy+0x4f/0x90
[<ffffffff812a61c1>] xfs_sync_worker+0x81/0x90
[<ffffffff812a6092>] xfssyncd+0x172/0x220
[<ffffffff81067846>] kthread+0x96/0xa0
[<ffffffff8165a414>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
3861
[<ffffffff812ec744>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff812754dd>] xfs_ilock+0x9d/0x110
[<ffffffff8127556e>] xfs_ilock_map_shared+0x1e/0x50
[<ffffffff81297c45>] __xfs_get_blocks+0xc5/0x4e0
[<ffffffff8129808c>] xfs_get_blocks+0xc/0x10
[<ffffffff81135ca2>] do_mpage_readpage+0x462/0x660
[<ffffffff81135eea>] mpage_readpage+0x4a/0x60
[<ffffffff812986e3>] xfs_vm_readpage+0x13/0x20
[<ffffffff810bd150>] filemap_fault+0x2d0/0x4e0
[<ffffffff810db0a0>] __do_fault+0x50/0x4f0
[<ffffffff810db85e>] handle_pte_fault+0x7e/0xc90
[<ffffffff810ddbf8>] handle_mm_fault+0x138/0x230
[<ffffffff8102b37c>] do_page_fault+0x12c/0x420
[<ffffffff81658fcf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
On Tue, Mar 29, 2011 at 03:46:21PM -0400, Sean Noonan wrote:
> > Can you check if the brute force patch below helps?
>
> No such luck.
Actually thinking about it - we never do the vmalloc under any fs lock,
so this can't be the reason. But nothing else in the patch spring to
mind either, so to narrow this down does reverting the patch on
2.6.38 also fix it? The revert isn't quite trivial due to changes
since then, so here's the patch I came up with:
Index: xfs/fs/xfs/linux-2.6/kmem.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/kmem.c 2011-03-29 21:55:12.871726512 +0200
+++ xfs/fs/xfs/linux-2.6/kmem.c 2011-03-29 21:55:31.648723706 +0200
@@ -16,6 +16,7 @@
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <linux/mm.h>
+#include <linux/vmalloc.h>
#include <linux/highmem.h>
#include <linux/slab.h>
#include <linux/swap.h>
@@ -25,25 +26,8 @@
#include "kmem.h"
#include "xfs_message.h"
-/*
- * Greedy allocation. May fail and may return vmalloced memory.
- *
- * Must be freed using kmem_free_large.
- */
-void *
-kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
-{
- void *ptr;
- size_t kmsize = maxsize;
-
- while (!(ptr = kmem_zalloc_large(kmsize))) {
- if ((kmsize >>= 1) <= minsize)
- kmsize = minsize;
- }
- if (ptr)
- *size = kmsize;
- return ptr;
-}
+#define MAX_VMALLOCS 6
+#define MAX_SLAB_SIZE 0x20000
void *
kmem_alloc(size_t size, unsigned int __nocast flags)
@@ -52,8 +36,19 @@ kmem_alloc(size_t size, unsigned int __n
gfp_t lflags = kmem_flags_convert(flags);
void *ptr;
+#ifdef DEBUG
+ if (unlikely(!(flags & KM_LARGE) && (size > PAGE_SIZE))) {
+ printk(KERN_WARNING "Large %s attempt, size=%ld\n",
+ __func__, (long)size);
+ dump_stack();
+ }
+#endif
+
do {
- ptr = kmalloc(size, lflags);
+ if (size < MAX_SLAB_SIZE || retries > MAX_VMALLOCS)
+ ptr = kmalloc(size, lflags);
+ else
+ ptr = __vmalloc(size, lflags, PAGE_KERNEL);
if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
return ptr;
if (!(++retries % 100))
@@ -75,6 +70,27 @@ kmem_zalloc(size_t size, unsigned int __
return ptr;
}
+void *
+kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize,
+ unsigned int __nocast flags)
+{
+ void *ptr;
+ size_t kmsize = maxsize;
+ unsigned int kmflags = (flags & ~KM_SLEEP) | KM_NOSLEEP;
+
+ while (!(ptr = kmem_zalloc(kmsize, kmflags))) {
+ if ((kmsize <= minsize) && (flags & KM_NOSLEEP))
+ break;
+ if ((kmsize >>= 1) <= minsize) {
+ kmsize = minsize;
+ kmflags = flags;
+ }
+ }
+ if (ptr)
+ *size = kmsize;
+ return ptr;
+}
+
void
kmem_free(const void *ptr)
{
Index: xfs/fs/xfs/linux-2.6/kmem.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/kmem.h 2011-03-29 21:55:12.879725146 +0200
+++ xfs/fs/xfs/linux-2.6/kmem.h 2011-03-29 21:55:31.652725467 +0200
@@ -21,7 +21,6 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/mm.h>
-#include <linux/vmalloc.h>
/*
* General memory allocation interfaces
@@ -31,6 +30,7 @@
#define KM_NOSLEEP 0x0002u
#define KM_NOFS 0x0004u
#define KM_MAYFAIL 0x0008u
+#define KM_LARGE 0x0010u
/*
* We use a special process flag to avoid recursive callbacks into
@@ -42,7 +42,7 @@ kmem_flags_convert(unsigned int __nocast
{
gfp_t lflags;
- BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL));
+ BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_LARGE));
if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;
@@ -56,25 +56,10 @@ kmem_flags_convert(unsigned int __nocast
extern void *kmem_alloc(size_t, unsigned int __nocast);
extern void *kmem_zalloc(size_t, unsigned int __nocast);
+extern void *kmem_zalloc_greedy(size_t *, size_t, size_t, unsigned int __nocast);
extern void *kmem_realloc(const void *, size_t, size_t, unsigned int __nocast);
extern void kmem_free(const void *);
-static inline void *kmem_zalloc_large(size_t size)
-{
- void *ptr;
-
- ptr = vmalloc(size);
- if (ptr)
- memset(ptr, 0, size);
- return ptr;
-}
-static inline void kmem_free_large(void *ptr)
-{
- vfree(ptr);
-}
-
-extern void *kmem_zalloc_greedy(size_t *, size_t, size_t);
-
/*
* Zone interfaces
*/
Index: xfs/fs/xfs/quota/xfs_qm.c
===================================================================
--- xfs.orig/fs/xfs/quota/xfs_qm.c 2011-03-29 21:55:12.859726589 +0200
+++ xfs/fs/xfs/quota/xfs_qm.c 2011-03-29 21:55:41.387278609 +0200
@@ -110,11 +110,12 @@ xfs_Gqm_init(void)
*/
udqhash = kmem_zalloc_greedy(&hsize,
XFS_QM_HASHSIZE_LOW * sizeof(xfs_dqhash_t),
- XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t));
+ XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t),
+ KM_SLEEP | KM_MAYFAIL | KM_LARGE);
if (!udqhash)
goto out;
- gdqhash = kmem_zalloc_large(hsize);
+ gdqhash = kmem_zalloc(hsize, KM_SLEEP | KM_LARGE);
if (!gdqhash)
goto out_free_udqhash;
@@ -171,7 +172,7 @@ xfs_Gqm_init(void)
return xqm;
out_free_udqhash:
- kmem_free_large(udqhash);
+ kmem_free(udqhash);
out:
return NULL;
}
@@ -194,8 +195,8 @@ xfs_qm_destroy(
xfs_qm_list_destroy(&(xqm->qm_usr_dqhtable[i]));
xfs_qm_list_destroy(&(xqm->qm_grp_dqhtable[i]));
}
- kmem_free_large(xqm->qm_usr_dqhtable);
- kmem_free_large(xqm->qm_grp_dqhtable);
+ kmem_free(xqm->qm_usr_dqhtable);
+ kmem_free(xqm->qm_grp_dqhtable);
xqm->qm_usr_dqhtable = NULL;
xqm->qm_grp_dqhtable = NULL;
xqm->qm_dqhashmask = 0;
Index: xfs/fs/xfs/xfs_itable.c
===================================================================
--- xfs.orig/fs/xfs/xfs_itable.c 2011-03-29 21:55:12.851725366 +0200
+++ xfs/fs/xfs/xfs_itable.c 2011-03-29 21:55:31.660724287 +0200
@@ -259,10 +259,8 @@ xfs_bulkstat(
(XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog);
nimask = ~(nicluster - 1);
nbcluster = nicluster >> mp->m_sb.sb_inopblog;
- irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4);
- if (!irbuf)
- return ENOMEM;
-
+ irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4,
+ KM_SLEEP | KM_MAYFAIL | KM_LARGE);
nirbuf = irbsize / sizeof(*irbuf);
/*
@@ -527,7 +525,7 @@ xfs_bulkstat(
/*
* Done, we're either out of filesystem or space to put the data.
*/
- kmem_free_large(irbuf);
+ kmem_free(irbuf);
*ubcountp = ubelem;
/*
* Found some inodes, return them now and return the error next time.
> mind either, so to narrow this down does reverting the patch on
> 2.6.38 also fix it? The revert isn't quite trivial due to changes
> since then, so here's the patch I came up with:
This patch does fix the problem.
On Tue, Mar 29, 2011 at 04:02:56PM -0400, 'Christoph Hellwig' wrote:
> On Tue, Mar 29, 2011 at 03:46:21PM -0400, Sean Noonan wrote:
> > > Can you check if the brute force patch below helps?
> >
> > No such luck.
>
> Actually thinking about it - we never do the vmalloc under any fs lock,
> so this can't be the reason. But nothing else in the patch spring to
> mind either, so to narrow this down does reverting the patch on
> 2.6.38 also fix it? The revert isn't quite trivial due to changes
> since then, so here's the patch I came up with:
>
>
> Index: xfs/fs/xfs/linux-2.6/kmem.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/kmem.c 2011-03-29 21:55:12.871726512 +0200
> +++ xfs/fs/xfs/linux-2.6/kmem.c 2011-03-29 21:55:31.648723706 +0200
> @@ -16,6 +16,7 @@
> * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> */
> #include <linux/mm.h>
> +#include <linux/vmalloc.h>
> #include <linux/highmem.h>
> #include <linux/slab.h>
> #include <linux/swap.h>
> @@ -25,25 +26,8 @@
> #include "kmem.h"
> #include "xfs_message.h"
>
> -/*
> - * Greedy allocation. May fail and may return vmalloced memory.
> - *
> - * Must be freed using kmem_free_large.
> - */
> -void *
> -kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> -{
> - void *ptr;
> - size_t kmsize = maxsize;
> -
> - while (!(ptr = kmem_zalloc_large(kmsize))) {
> - if ((kmsize >>= 1) <= minsize)
> - kmsize = minsize;
> - }
> - if (ptr)
> - *size = kmsize;
> - return ptr;
> -}
> +#define MAX_VMALLOCS 6
> +#define MAX_SLAB_SIZE 0x20000
Why those values for the magic numbers?
....
> Index: xfs/fs/xfs/quota/xfs_qm.c
> ===================================================================
> --- xfs.orig/fs/xfs/quota/xfs_qm.c 2011-03-29 21:55:12.859726589 +0200
> +++ xfs/fs/xfs/quota/xfs_qm.c 2011-03-29 21:55:41.387278609 +0200
> @@ -110,11 +110,12 @@ xfs_Gqm_init(void)
> */
> udqhash = kmem_zalloc_greedy(&hsize,
> XFS_QM_HASHSIZE_LOW * sizeof(xfs_dqhash_t),
> - XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t));
> + XFS_QM_HASHSIZE_HIGH * sizeof(xfs_dqhash_t),
> + KM_SLEEP | KM_MAYFAIL | KM_LARGE);
> if (!udqhash)
> goto out;
>
> - gdqhash = kmem_zalloc_large(hsize);
> + gdqhash = kmem_zalloc(hsize, KM_SLEEP | KM_LARGE);
Needs a KM_MAYFAIL as well?
> if (!gdqhash)
> goto out_free_udqhash;
>
....
> Index: xfs/fs/xfs/xfs_itable.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_itable.c 2011-03-29 21:55:12.851725366 +0200
> +++ xfs/fs/xfs/xfs_itable.c 2011-03-29 21:55:31.660724287 +0200
> @@ -259,10 +259,8 @@ xfs_bulkstat(
> (XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog);
> nimask = ~(nicluster - 1);
> nbcluster = nicluster >> mp->m_sb.sb_inopblog;
> - irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4);
> - if (!irbuf)
> - return ENOMEM;
> -
> + irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4,
> + KM_SLEEP | KM_MAYFAIL | KM_LARGE);
> nirbuf = irbsize / sizeof(*irbuf);
Need to keep the if (!irbuf) check as KM_MAYFAIL is passed.
Cheers,
Dave
--
Dave Chinner
[email protected]
> Need to keep the if (!irbuf) check as KM_MAYFAIL is passed.
It wasn't in before the bug presented, so leaving it in wouldn't be a true test as to whether the bug has been tracked to the correct place. I'll test again with the if (!irbuf).
Sean
On Tue, Mar 29, 2011 at 03:54:12PM -0400, Sean Noonan wrote:
> > Can you check if the brute force patch below helps?
>
> Not sure if this helps at all, but here is the stack from all three processes involved. This is without MAP_POPULATE and with the patch you just sent.
>
> # ps aux | grep 'D[+]*[[:space:]]'
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 2314 0.2 0.0 0 0 ? D 19:44 0:00 [flush-8:0]
> root 2402 0.0 0.0 0 0 ? D 19:44 0:00 [xfssyncd/sda9]
> root 3861 2.6 9.9 16785280 4912848 pts/0 D+ 19:45 0:07 ./vmtest /xfs/hugefile.dat 17179869184
>
> # for p in 2314 2402 3861; do echo $p; cat /proc/$p/stack; done
> 2314
> [<ffffffff810d634a>] congestion_wait+0x7a/0x130
> [<ffffffff8129721c>] kmem_alloc+0x6c/0xf0
> [<ffffffff8127c07e>] xfs_inode_item_format+0x36e/0x3b0
> [<ffffffff8128401f>] xfs_log_commit_cil+0x4f/0x3b0
> [<ffffffff8128ff31>] _xfs_trans_commit+0x1f1/0x2b0
> [<ffffffff8127c716>] xfs_iomap_write_allocate+0x1a6/0x340
> [<ffffffff81298883>] xfs_map_blocks+0x193/0x2c0
> [<ffffffff812992fa>] xfs_vm_writepage+0x1ca/0x520
> [<ffffffff810c4bd2>] __writepage+0x12/0x40
> [<ffffffff810c53dd>] write_cache_pages+0x1dd/0x4f0
> [<ffffffff810c573c>] generic_writepages+0x4c/0x70
> [<ffffffff812986b8>] xfs_vm_writepages+0x58/0x70
> [<ffffffff810c577c>] do_writepages+0x1c/0x40
> [<ffffffff811247d1>] writeback_single_inode+0xf1/0x240
> [<ffffffff81124edd>] writeback_sb_inodes+0xdd/0x1b0
> [<ffffffff81125966>] writeback_inodes_wb+0x76/0x160
> [<ffffffff81125d93>] wb_writeback+0x343/0x550
> [<ffffffff81126126>] wb_do_writeback+0x186/0x2e0
> [<ffffffff81126342>] bdi_writeback_thread+0xc2/0x310
> [<ffffffff81067846>] kthread+0x96/0xa0
> [<ffffffff8165a414>] kernel_thread_helper+0x4/0x10
> [<ffffffffffffffff>] 0xffffffffffffffff
So, it's trying to allocate a buffer for the inode extent list, so
should only be a couple of hundred bytes, and at most ~2kB if you
are using large inodes. That still doesn't seem like it should be
having memory allocation problems here with 44GB of free RAM....
Hmmmm. I wonder - the process is doing a random walk of 16GB, so
it's probably created tens of thousands of delayed allocation
extents before any real allocation was done. xfs_inode_item_format()
uses the in-core data fork size for the extent buffer allocation
which in this case would be much larger than what can possibly fit
inside the inode data fork.
Lets see - worst case is 8GB of sparse blocks, which is 2^21
delalloc blocks, which gives a worst case allocation size of 2^21 *
sizeof(struct xfs_bmbt_rec), which is roughly 64MB. Which would
overflow the return value. Even at 1k delalloc extents, we'll be
asking for an order-15 allocation when all we really need is an
order-0 allocation.
Ok, so that looks like root cause of the problem. can you try the
patch below to see if it fixes the problem (without any other
patches applied or reverted).
Cheers,,
Dave.
--
Dave Chinner
[email protected]
xfs: fix extent format buffer allocation size
From: Dave Chinner <[email protected]>
When formatting an inode item, we have to allocate a separate buffer
to hold extents when there are delayed allocation extents on the
inode and it is in extent format. The allocation size is derived
from the in-core data fork representation, which accounts for
delayed allocation extents, while the on-disk representation does
not contain any delalloc extents.
As a result of this mismatch, the allocated buffer can be far larger
than needed to hold the real extent list which, due to the fact the
inode is in extent format, is limited to the size of the literal
area of the inode. However, we can have thousands of delalloc
extents, resulting in an allocation size orders of magnitude larger
than is needed to hold all the real extents.
Fix this by limiting the size of the buffer being allocated to the
size of the literal area of the inodes in the filesystem (i.e. the
maximum size an inode fork can grow to).
Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_inode_item.c | 69 ++++++++++++++++++++++++++++------------------
1 files changed, 42 insertions(+), 27 deletions(-)
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 46cc401..12cdc39 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -198,6 +198,43 @@ xfs_inode_item_size(
}
/*
+ * xfs_inode_item_format_extents - convert in-core extents to on-disk form
+ *
+ * For either the data or attr fork in extent format, we need to endian convert
+ * the in-core extent as we place them into the on-disk inode. In this case, we
+ * ned to do this conversion before we write the extents into the log. Because
+ * we don't have the disk inode to write into here, we allocate a buffer and
+ * format the extents into it via xfs_iextents_copy(). We free the buffer in
+ * the unlock routine after the copy for the log has been made.
+ *
+ * For the data fork, there can be delayed allocation extents
+ * in the inode as well, so the in-core data fork can be much larger than the
+ * on-disk data representation of real inodes. Hence we need to limit the size
+ * of the allocation to what will fit in the inode fork, otherwise we could be
+ * asking for excessively large allocation sizes.
+ */
+STATIC void
+xfs_inode_item_format_extents(
+ struct xfs_inode *ip,
+ struct xfs_log_iovec *vecp,
+ int whichfork,
+ int type)
+{
+ xfs_bmbt_rec_t *ext_buffer;
+
+ ext_buffer = kmem_alloc(XFS_IFORK_SIZE(ip, whichfork),
+ KM_SLEEP | KM_NOFS);
+ if (whichfork == XFS_DATA_FORK)
+ ip->i_itemp->ili_extents_buf = ext_buffer;
+ else
+ ip->i_itemp->ili_aextents_buf = ext_buffer;
+
+ vecp->i_addr = ext_buffer;
+ vecp->i_len = xfs_iextents_copy(ip, ext_buffer, whichfork);
+ vecp->i_type = type;
+}
+
+/*
* This is called to fill in the vector of log iovecs for the
* given inode log item. It fills the first item with an inode
* log format structure, the second with the on-disk inode structure,
@@ -213,7 +250,6 @@ xfs_inode_item_format(
struct xfs_inode *ip = iip->ili_inode;
uint nvecs;
size_t data_bytes;
- xfs_bmbt_rec_t *ext_buffer;
xfs_mount_t *mp;
vecp->i_addr = &iip->ili_format;
@@ -320,22 +356,8 @@ xfs_inode_item_format(
} else
#endif
{
- /*
- * There are delayed allocation extents
- * in the inode, or we need to convert
- * the extents to on disk format.
- * Use xfs_iextents_copy()
- * to copy only the real extents into
- * a separate buffer. We'll free the
- * buffer in the unlock routine.
- */
- ext_buffer = kmem_alloc(ip->i_df.if_bytes,
- KM_SLEEP);
- iip->ili_extents_buf = ext_buffer;
- vecp->i_addr = ext_buffer;
- vecp->i_len = xfs_iextents_copy(ip, ext_buffer,
- XFS_DATA_FORK);
- vecp->i_type = XLOG_REG_TYPE_IEXT;
+ xfs_inode_item_format_extents(ip, vecp,
+ XFS_DATA_FORK, XLOG_REG_TYPE_IEXT);
}
ASSERT(vecp->i_len <= ip->i_df.if_bytes);
iip->ili_format.ilf_dsize = vecp->i_len;
@@ -445,19 +467,12 @@ xfs_inode_item_format(
*/
vecp->i_addr = ip->i_afp->if_u1.if_extents;
vecp->i_len = ip->i_afp->if_bytes;
+ vecp->i_type = XLOG_REG_TYPE_IATTR_EXT;
#else
ASSERT(iip->ili_aextents_buf == NULL);
- /*
- * Need to endian flip before logging
- */
- ext_buffer = kmem_alloc(ip->i_afp->if_bytes,
- KM_SLEEP);
- iip->ili_aextents_buf = ext_buffer;
- vecp->i_addr = ext_buffer;
- vecp->i_len = xfs_iextents_copy(ip, ext_buffer,
- XFS_ATTR_FORK);
+ xfs_inode_item_format_extents(ip, vecp,
+ XFS_ATTR_FORK, XLOG_REG_TYPE_IATTR_EXT);
#endif
- vecp->i_type = XLOG_REG_TYPE_IATTR_EXT;
iip->ili_format.ilf_asize = vecp->i_len;
vecp++;
nvecs++;
> Ok, so that looks like root cause of the problem. can you try the
> patch below to see if it fixes the problem (without any other
> patches applied or reverted).
It looks like this does fix the deadlock problem. However, it appears to come at the price of significantly higher mmap startup costs.
# ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 ))
/xfs/d-1/hugefile.dat: mapped 17179869184 bytes in 324387362198 ticks
Sean
On Tue, Mar 29, 2011 at 09:32:06PM -0400, Sean Noonan wrote:
> > Ok, so that looks like root cause of the problem. can you try the
> > patch below to see if it fixes the problem (without any other
> > patches applied or reverted).
>
> It looks like this does fix the deadlock problem. However, it
> appears to come at the price of significantly higher mmap startup
> costs.
It shouldn't make any difference to startup costs with the current
code uses read faults to populate the region and that doesn't cause
any allocation to occur and hence this code is not executed during
the populate phase.
Is this repeatable or is it just a one-off result?
Cheers,
Dave.
--
Dave Chinner
[email protected]
> Is this repeatable or is it just a one-off result?
It was repeated three times before I sent the email, but I can't reproduce it again now. Call it a fluke.
Sean
On Wed, Mar 30, 2011 at 09:42:30AM +1100, Dave Chinner wrote:
> > +#define MAX_VMALLOCS 6
> > +#define MAX_SLAB_SIZE 0x20000
>
> Why those values for the magic numbers?
Ask the person who added it originall, it's just a revert to the
code before my commit to clean up our vmalloc usage.
On Wed, Mar 30, 2011 at 11:09:42AM +1100, Dave Chinner wrote:
> + ext_buffer = kmem_alloc(XFS_IFORK_SIZE(ip, whichfork),
> + KM_SLEEP | KM_NOFS);
The old code didn't use KM_NOFS, and I don't think it needed it either,
as we call the iop_format handlers inside the region covered by the
PF_FSTRANS flag.
Also I think the routine needs to be under #ifndef XFS_NATIVE_HOST, as
we do not use it for big endian builds.