2011-03-14 19:21:04

by Hugh Dickins

[permalink] [raw]
Subject: ext4 deep stack with mark_page_dirty reclaim

When testing something else on 2.6.38-rc8 last night,
I hit this x86_64 stack overflow. I've never had one before,
it seems worth reporting. kdb was in, I jotted it down by hand
(the notifier part of it will be notifying kdb of the fault).
CONFIG_DEBUG_STACK_OVERFLOW and DEBUG_STACK_USAGE were not set.

I should disclose that I have a hack in which may make my stack
frames slightly larger than they should be: check against yours.
So it may not be an overflow for anyone else, but still a trace
to worry about.

The faulting address has a stray 2 in it: presumably from when
the stack descended into the thread_info and got corrupted.

BUG: unable to handle kernel paging request at ffff88027a704060
IP: vprintk+0x100
Thread overran stack, or stack corrupted
ffff88007a7040a8 notifier_call_chain+0x40
ffff88007a704108 __atomic_notifier_call_chain+0x48
ffff88007a704148 __die+0x48
ffff88007a7041a8 die+0x50
ffff88007a7041e8 do_general_protection+0x168
ffff88007a704248 general_protection+0x1f
ffff88007a704338 schedule+0x25a
ffff88007a7044a8 io_schedule+0x35
ffff88007a7044c8 get_request_wait+0xc6
ffff88007a704568 __make_request+0x36d
ffff88007a7045d8 generic_make_request+0x2f2
ffff88007a7046a8 submit_bio+0xe1
ffff88007a704738 swap_writepage+0xa3
ffff88007a704788 pageout+0x151
ffff88007a704808 shrink_page_list+0x2db
ffff88007a7048b8 shrink_inactive_list+0x2d3
ffff88007a7049b8 shrink_zone+0x17d
ffff88007a704a98 shrink_zones+0x0xa3
ffff88007a704b18 do_try_to_free_pages+0x87
ffff88007a704ba8 try_to_free_mem_cgroup_pages+0x8e
ffff88007a704c18 mem_cgroup_hierarchical_reclaim+0x220
ffff88007a704cc8 mem_cgroup_do_charge+0xdc
ffff88007a704d48 __mem_cgroup_try_charge+0x19c
ffff88007a704dc8 mem_cgroup_charge_common+0xa8
ffff88007a704e48 mem_cgroup_cache_charge+0x19a
ffff88007a704ec8 add_to_page_cache_locked+0x57
ffff88007a704f28 add_to_page_cache_lru+0x3e
ffff88007a704f78 find_or_create_page+0x69
ffff88007a704fe8 grow_dev_page+0x4a
ffff88007a705048 grow_buffers+0x41
ffff88007a705088 __getblk_slow+0xd7
ffff88007a7050d8 __getblk+0x44
ffff88007a705128 __ext4_get_inode_loc+0x12c
ffff88007a7051d8 ext4_get_inode_loc+0x30
ffff88007a705208 ext4_reserve_inode_write+0x21
ffff88007a705258 ext4_mark_inode_dirty+0x3b
ffff88007a7052f8 ext4_dirty_inode+0x3e
ffff88007a705338 __mark_inode_dirty+0x32
linux/fs.h mark_inode_dirty
linux/quotaops.h dquot_alloc_space
linux/quotaops.h dquot_alloc_block
ffff88007a705388 ext4_mb_new_blocks+0xc2
ffff88007a705418 ext4_alloc_blocks+0x189
ffff88007a7054e8 ext4_alloc_branch+0x73
ffff88007a7055b8 ext4_ind_map_blocks+0x148
ffff88007a7056c8 ext4_map_blocks+0x148
ffff88007a705738 ext4_getblk+0x5f
ffff88007a7057c8 ext4_bread+0x36
ffff88007a705828 ext4_append+0x52
ffff88007a705888 do_split+0x5b
ffff88007a705968 ext4_dx_add_entry+0x4b4
ffff88007a705a98 ext4_add_entry+0x7c
ffff88007a705b48 ext4_add_nondir+0x2e
ffff88007a705b98 ext4_create+0xf5
ffff88007a705c28 vfs_create+0x83
ffff88007a705c88 __open_namei_create+0x59
ffff88007a705ce8 do_last+0x13b
ffff88007a705d58 do_filp_open+0x2ae
ffff88007a705ed8 do_sys_open+0x72
ffff88007a705f58 sys_open+0x27

Hugh


2011-03-14 20:46:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 deep stack with mark_page_dirty reclaim

On Mon, Mar 14, 2011 at 12:20:52PM -0700, Hugh Dickins wrote:
> When testing something else on 2.6.38-rc8 last night,
> I hit this x86_64 stack overflow. I've never had one before,
> it seems worth reporting. kdb was in, I jotted it down by hand
> (the notifier part of it will be notifying kdb of the fault).
> CONFIG_DEBUG_STACK_OVERFLOW and DEBUG_STACK_USAGE were not set.
>
> I should disclose that I have a hack in which may make my stack
> frames slightly larger than they should be: check against yours.
> So it may not be an overflow for anyone else, but still a trace
> to worry about.

Here's the trace translated to the stack space used by each function.
There are a few piggy ext4 functions that we can try to shrink, but
the real problem is just how deep the whole stack is getting.

>From the syscall to the lowest-level ext4 function is 3712 bytes, and
everything from there to the schedule() which then triggered the GPF
was another 3728 of stack space....

- Ted

240 schedule+0x25a
368 io_schedule+0x35
32 get_request_wait+0xc6
160 __make_request+0x36d
112 generic_make_request+0x2f2
208 submit_bio+0xe1
144 swap_writepage+0xa3
80 pageout+0x151
128 shrink_page_list+0x2db
176 shrink_inactive_list+0x2d3
256 shrink_zone+0x17d
224 shrink_zones+0x0xa3
128 do_try_to_free_pages+0x87
144 try_to_free_mem_cgroup_pages+0x8e
112 mem_cgroup_hierarchical_reclaim+0x220
176 mem_cgroup_do_charge+0xdc
128 __mem_cgroup_try_charge+0x19c
128 mem_cgroup_charge_common+0xa8
128 mem_cgroup_cache_charge+0x19a
128 add_to_page_cache_locked+0x57
96 add_to_page_cache_lru+0x3e
80 find_or_create_page+0x69
112 grow_dev_page+0x4a
96 grow_buffers+0x41
64 __getblk_slow+0xd7
80 __getblk+0x44
80 __ext4_get_inode_loc+0x12c
176 ext4_get_inode_loc+0x30
48 ext4_reserve_inode_write+0x21
80 ext4_mark_inode_dirty+0x3b
160 ext4_dirty_inode+0x3e
64 __mark_inode_dirty+0x32
80 linux/fs.h mark_inode_dirty
0 linux/quotaops.h dquot_alloc_space
0 linux/quotaops.h dquot_alloc_block
0 ext4_mb_new_blocks+0xc2
144 ext4_alloc_blocks+0x189
208 ext4_alloc_branch+0x73
208 ext4_ind_map_blocks+0x148
272 ext4_map_blocks+0x148
112 ext4_getblk+0x5f
144 ext4_bread+0x36
96 ext4_append+0x52
96 do_split+0x5b
224 ext4_dx_add_entry+0x4b4
304 ext4_add_entry+0x7c
176 ext4_add_nondir+0x2e
80 ext4_create+0xf5
144 vfs_create+0x83
96 __open_namei_create+0x59
96 do_last+0x13b
112 do_filp_open+0x2ae
384 do_sys_open+0x72
128 sys_open+0x27

2011-03-14 22:46:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: ext4 deep stack with mark_page_dirty reclaim

Direct reclaim (in the cgroup variant) at it's work. We had a couple of
flamewars on this before, but this trivial example with reclaim from
the most simple case (swap space) shows that we really should never
reclaim from memory allocation callers for stack usage reasons.

2011-03-15 05:17:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4 deep stack with mark_page_dirty reclaim

On 2011-03-14, at 1:46 PM, Ted Ts'o wrote:
> On Mon, Mar 14, 2011 at 12:20:52PM -0700, Hugh Dickins wrote:
>> When testing something else on 2.6.38-rc8 last night,
>> I hit this x86_64 stack overflow. I've never had one before,
>> it seems worth reporting. kdb was in, I jotted it down by hand
>> (the notifier part of it will be notifying kdb of the fault).
>> CONFIG_DEBUG_STACK_OVERFLOW and DEBUG_STACK_USAGE were not set.
>>
>> I should disclose that I have a hack in which may make my stack
>> frames slightly larger than they should be: check against yours.
>> So it may not be an overflow for anyone else, but still a trace
>> to worry about.
>
> Here's the trace translated to the stack space used by each function.
> There are a few piggy ext4 functions that we can try to shrink, but
> the real problem is just how deep the whole stack is getting.
>
> From the syscall to the lowest-level ext4 function is 3712 bytes, and
> everything from there to the schedule() which then triggered the GPF
> was another 3728 of stack space....

Is there a script which you used to generate this stack trace to function size mapping, or did you do it by hand? I've always wanted such a script, but the tricky part is that there is so much garbage on the stack that any automated stack parsing is almost useless. Alternately, it would seem trivial to have the stack dumper print the relative address of each symbol, and the delta from the previous symbol...

To be honest, I think the stack size limitation is becoming a serious problem in itself. While some stack-size reduction effort is actually useful in removing inefficiency, I think there is a lot of crazy and inefficient things to try and minimize the stack usage (e.g. lots of kmalloc/kfree of temporary arrays instead of just putting them on the stack), which ends up consuming _more_ total memory.

This can be seen with deep storage stacks that are using the network on both ends, like NFS+{XFS, ext4}+LVM+DM+{fcoib,iSCSI}+driver+kmalloc or similar... The below stack isn't even using something so convoluted.

> 240 schedule+0x25a
> 368 io_schedule+0x35
> 32 get_request_wait+0xc6
> 160 __make_request+0x36d
> 112 generic_make_request+0x2f2
> 208 submit_bio+0xe1
> 144 swap_writepage+0xa3
> 80 pageout+0x151
> 128 shrink_page_list+0x2db
> 176 shrink_inactive_list+0x2d3
> 256 shrink_zone+0x17d
> 224 shrink_zones+0x0xa3
> 128 do_try_to_free_pages+0x87
> 144 try_to_free_mem_cgroup_pages+0x8e
> 112 mem_cgroup_hierarchical_reclaim+0x220
> 176 mem_cgroup_do_charge+0xdc
> 128 __mem_cgroup_try_charge+0x19c
> 128 mem_cgroup_charge_common+0xa8
> 128 mem_cgroup_cache_charge+0x19a
> 128 add_to_page_cache_locked+0x57
> 96 add_to_page_cache_lru+0x3e
> 80 find_or_create_page+0x69
> 112 grow_dev_page+0x4a
> 96 grow_buffers+0x41
> 64 __getblk_slow+0xd7
> 80 __getblk+0x44
> 80 __ext4_get_inode_loc+0x12c
> 176 ext4_get_inode_loc+0x30
> 48 ext4_reserve_inode_write+0x21
> 80 ext4_mark_inode_dirty+0x3b
> 160 ext4_dirty_inode+0x3e
> 64 __mark_inode_dirty+0x32
> 80 linux/fs.h mark_inode_dirty
> 0 linux/quotaops.h dquot_alloc_space
> 0 linux/quotaops.h dquot_alloc_block
> 0 ext4_mb_new_blocks+0xc2
> 144 ext4_alloc_blocks+0x189
> 208 ext4_alloc_branch+0x73
> 208 ext4_ind_map_blocks+0x148
> 272 ext4_map_blocks+0x148
> 112 ext4_getblk+0x5f
> 144 ext4_bread+0x36
> 96 ext4_append+0x52
> 96 do_split+0x5b
> 224 ext4_dx_add_entry+0x4b4
> 304 ext4_add_entry+0x7c
> 176 ext4_add_nondir+0x2e
> 80 ext4_create+0xf5
> 144 vfs_create+0x83
> 96 __open_namei_create+0x59
> 96 do_last+0x13b
> 112 do_filp_open+0x2ae
> 384 do_sys_open+0x72
> 128 sys_open+0x27
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas




2011-03-15 16:03:25

by David Sterba

[permalink] [raw]
Subject: Re: ext4 deep stack with mark_page_dirty reclaim

On Mon, Mar 14, 2011 at 07:25:10PM -0700, Andreas Dilger wrote:
> Is there a script which you used to generate this stack trace to
> function size mapping, or did you do it by hand? I've always wanted
> such a script, but the tricky part is that there is so much garbage on
> the stack that any automated stack parsing is almost useless.
> Alternately, it would seem trivial to have the stack dumper print the
> relative address of each symbol, and the delta from the previous
> symbol...

> > 240 schedule+0x25a
> > 368 io_schedule+0x35
> > 32 get_request_wait+0xc6

from the callstack:

ffff88007a704338 schedule+0x25a
ffff88007a7044a8 io_schedule+0x35
ffff88007a7044c8 get_request_wait+0xc6

subtract the values and you get the ones Ted posted,

eg. for get_request_wait:

0xffff88007a7044c8 - 0xffff88007a7044a8 = 32

There'se a script scripts/checkstack.pl which tries to determine stack
usage from 'objdump -d' looking for the 'sub 0x123,%rsp' instruction and
reporting the 0x123 as stack consumption. It does not give same results,
for the get_request_wait:

ffffffff81216205: 48 83 ec 68 sub $0x68,%rsp

reported as 104.


dave

2011-03-15 16:27:33

by Chris Mason

[permalink] [raw]
Subject: Re: ext4 deep stack with mark_page_dirty reclaim

Excerpts from David Sterba's message of 2011-03-15 11:22:22 -0400:
> On Mon, Mar 14, 2011 at 07:25:10PM -0700, Andreas Dilger wrote:
> > Is there a script which you used to generate this stack trace to
> > function size mapping, or did you do it by hand? I've always wanted
> > such a script, but the tricky part is that there is so much garbage on
> > the stack that any automated stack parsing is almost useless.
> > Alternately, it would seem trivial to have the stack dumper print the
> > relative address of each symbol, and the delta from the previous
> > symbol...
>
> > > 240 schedule+0x25a
> > > 368 io_schedule+0x35
> > > 32 get_request_wait+0xc6
>
> from the callstack:
>
> ffff88007a704338 schedule+0x25a
> ffff88007a7044a8 io_schedule+0x35
> ffff88007a7044c8 get_request_wait+0xc6
>
> subtract the values and you get the ones Ted posted,
>
> eg. for get_request_wait:
>
> 0xffff88007a7044c8 - 0xffff88007a7044a8 = 32
>
> There'se a script scripts/checkstack.pl which tries to determine stack
> usage from 'objdump -d' looking for the 'sub 0x123,%rsp' instruction and
> reporting the 0x123 as stack consumption. It does not give same results,
> for the get_request_wait:
>
> ffffffff81216205: 48 83 ec 68 sub $0x68,%rsp
>
> reported as 104.

Also, the ftrace stack usage tracer gives more verbose output that
includes the size of each function.

-chris

2011-03-23 14:03:23

by David Sterba

[permalink] [raw]
Subject: Re: ext4 deep stack with mark_page_dirty reclaim

On Tue, Mar 15, 2011 at 12:26:43PM -0400, Chris Mason wrote:
> Also, the ftrace stack usage tracer gives more verbose output that
> includes the size of each function.

Yet another one on the list is the -fstack-size option in new gcc 4.6 [*].
It creates a file with .su extension containing lines in format
file:line:char:function_name stack_size linkage_type

eg.

a.c:168:5:main 224 static


dave

* http://gcc.gnu.org/gcc-4.6/changes.html