2010-04-07 11:13:26

by John Berthels

[permalink] [raw]
Subject: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

Hi folks,

[I'm afraid that I'm not subscribed to the list, please cc: me on any
reply].

Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under write-heavy
I/O load. It is "fixed" by changing THREAD_ORDER to 2.

Is this an OK long-term solution/should this be needed? As far as I can
see from searching, there is an expectation that xfs would generally
work with 8k stacks (THREAD_ORDER 1). We don't have xfs stacked over LVM
or anything else.

If anyone can offer any advice on this, that would be great. I
understand larger kernel stacks may introduce problems in getting an
allocation of the appropriate size. So am I right in thinking the
symptom we need to look out for would be an error on fork() or clone()?
Or will the box panic in that case?

Details below.

regards,

jb


Background: We have a cluster of systems with roughly the following
specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @ 2.2GHz).

Following a the addition of three new servers to the cluster, we started
seeing a high incidence of intermittent lockups (up to several times per
day for some servers) across both the old and new servers. Prior to
that, we saw this problem only rarely (perhaps once per 3 months).

Adding the new servers will have changed the I/O patterns to all
servers. The servers receive a heavy write load, often with many slow
writers (as well as a read load).

Servers would become unresponsive, with nothing written to
/var/log/messages. Setting sysctl kernel.panic=300 caused a restart
(which showed the kernel was panicing and unable to write at the time).
netconsole showed a variety of stack traces, mostly related to xfs_write
activity (but then, that's what the box spends it's time doing).

22/24 of the disks have 1 partition, formatted with xfs (over the
partition, not over LVM). The other 2 disks have 3 partitions: xfs data,
swap and a RAID1 partition contributing to an ext3 root filesystem
mounted on /dev/md0.

We have tried various solutions (different kernels from ubuntu server
2.6.28->2.6.32).

Vanilla 2.6.33.2 from kernel.org + stack tracing still has the problem,
and logged:

kernel: [58552.740032] flush-8:112 used greatest stack depth: 184 bytes left

a short while before dying.

Vanilla 2.6.33.2 + stack tracing + THREAD_ORDER 2 is much more stable
(no lockups so far, we would have expected 5-6 by now) and has logged:

kernel: [44798.183507] apache2 used greatest stack depth: 7208 bytes left

which I understand (possibly wrongly) as concrete evidence that we have
exceeded 8k of stack space.


2010-04-07 14:05:40

by Dave Chinner

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

On Wed, Apr 07, 2010 at 12:06:01PM +0100, John Berthels wrote:
> Hi folks,
>
> [I'm afraid that I'm not subscribed to the list, please cc: me on
> any reply].
>
> Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under
> write-heavy I/O load. It is "fixed" by changing THREAD_ORDER to 2.
>
> Is this an OK long-term solution/should this be needed? As far as I
> can see from searching, there is an expectation that xfs would
> generally work with 8k stacks (THREAD_ORDER 1). We don't have xfs
> stacked over LVM or anything else.

I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
loads. That's nowhere near blowing an 8k stack, so there must be
something special about what you are doing. Can you post the stack
traces that are being generated for the deepest stack generated -
/sys/kernel/debug/tracing/stack_trace should contain it.

> Background: We have a cluster of systems with roughly the following
> specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @
> 2.2GHz).
>
> Following a the addition of three new servers to the cluster, we
> started seeing a high incidence of intermittent lockups (up to
> several times per day for some servers) across both the old and new
> servers. Prior to that, we saw this problem only rarely (perhaps
> once per 3 months).

What is generating the write load?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-07 15:54:54

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

Dave Chinner wrote:
> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> loads. That's nowhere near blowing an 8k stack, so there must be
> something special about what you are doing. Can you post the stack
> traces that are being generated for the deepest stack generated -
> /sys/kernel/debug/tracing/stack_trace should contain it.
>
Appended below. That doesn't seem to reach 8192 but the box it's from
has logged:

[74649.579386] apache2 used greatest stack depth: 7024 bytes left

full dmesg (gzipped) attached.
> What is generating the write load?
>

WebDAV PUTs in a modified mogilefs cluster, running apache-mpm-worker
(threaded) as the DAV server. The write load is a mix of internet-upload
speed writers trickling files up and some local fast replicators copying
from elsewhere in the cluster. mpm worker cfg is:

ServerLimit 20
StartServers 5
MaxClients 300
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 30
MaxRequestsPerChild 0

File sizes are a mix of small to large (4GB+). Each disk is exported as
a mogile device, so it's possible for mogile to pound a single disk with
lots of write activity (if the random number generator decides to put
lots of files on that device at the same time).

We're also seeing occasional slowdowns + high load avg (up to ~300, i.e.
MaxClients) with a corresponding number of threads in D state. (This
slowdown + high load avg seems to correlate with what would have
previously caused a panic on the THREAD_ORDER 1, but not 100% sure).

As you can see from the dmesg, this trips the "task xxx blocked for more
than 120 seconds." on some of the threads.

Don't know if that's related to the stack issue or to be expected under
the load.


jb

Depth Size Location (47 entries)
----- ---- --------
0) 7568 16 mempool_alloc_slab+0x16/0x20
1) 7552 144 mempool_alloc+0x65/0x140
2) 7408 96 get_request+0x124/0x370
3) 7312 144 get_request_wait+0x29/0x1b0
4) 7168 96 __make_request+0x9b/0x490
5) 7072 208 generic_make_request+0x3df/0x4d0
6) 6864 80 submit_bio+0x7c/0x100
7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs]
9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs]
10) 6608 48 xfs_buf_read+0xda/0x110 [xfs]
11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs]
12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs]
15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs]
16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs]
17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs]
22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs]
25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs]
28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs]
30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs]
31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs]
32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
33) 3120 384 shrink_page_list+0x65e/0x840
34) 2736 528 shrink_zone+0x63f/0xe10
35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
36) 2096 128 try_to_free_pages+0x77/0x80
37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
38) 1728 48 alloc_pages_current+0x8c/0xe0
39) 1680 16 __get_free_pages+0xe/0x50
40) 1664 48 __pollwait+0xca/0x110
41) 1616 32 unix_poll+0x28/0xc0
42) 1584 16 sock_poll+0x1d/0x20
43) 1568 912 do_select+0x3d6/0x700
44) 656 416 core_sys_select+0x18c/0x2c0
45) 240 112 sys_select+0x4f/0x110
46) 128 128 system_call_fastpath+0x16/0x1b


Attachments:
dmesg.txt.gz (18.31 kB)

2010-04-07 17:43:12

by Eric Sandeen

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

John Berthels wrote:
> Dave Chinner wrote:
>> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
>> loads. That's nowhere near blowing an 8k stack, so there must be
>> something special about what you are doing. Can you post the stack
>> traces that are being generated for the deepest stack generated -
>> /sys/kernel/debug/tracing/stack_trace should contain it.
>>
> Appended below. That doesn't seem to reach 8192 but the box it's from
> has logged:
>
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left

but that's -left- (out of 8k or is that from a THREAD_ORDER=2 box?)

I guess it must be out of 16k...

> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 7568 16 mempool_alloc_slab+0x16/0x20
> 1) 7552 144 mempool_alloc+0x65/0x140
> 2) 7408 96 get_request+0x124/0x370
> 3) 7312 144 get_request_wait+0x29/0x1b0
> 4) 7168 96 __make_request+0x9b/0x490
> 5) 7072 208 generic_make_request+0x3df/0x4d0
> 6) 6864 80 submit_bio+0x7c/0x100
> 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs]
> 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs]
> 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs]
> 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs]
> 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs]
> 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs]
> 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs]

This one, I'm afraid, has always been big.

> 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs]
> 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs]
> 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs]
> 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 33) 3120 384 shrink_page_list+0x65e/0x840
> 34) 2736 528 shrink_zone+0x63f/0xe10

that's a nice one (actually the two together at > 900 bytes, ouch)

> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 38) 1728 48 alloc_pages_current+0x8c/0xe0
> 39) 1680 16 __get_free_pages+0xe/0x50
> 40) 1664 48 __pollwait+0xca/0x110
> 41) 1616 32 unix_poll+0x28/0xc0
> 42) 1584 16 sock_poll+0x1d/0x20
> 43) 1568 912 do_select+0x3d6/0x700

912, ouch!

int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
ktime_t expire, *to = NULL;
struct poll_wqueues table;

(gdb) p sizeof(struct poll_wqueues)
$1 = 624

I guess that's been there forever, though.

> 44) 656 416 core_sys_select+0x18c/0x2c0

416 hurts too.

The xfs callchain is deep, no doubt, but the combination of the select path
and the shrink calls is almost 2k in just a few calls, and that doesn't
help much.

-Eric

> 45) 240 112 sys_select+0x4f/0x110
> 46) 128 128 system_call_fastpath+0x16/0x1b
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs

2010-04-07 23:44:00

by Dave Chinner

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

[added linux-mm]

On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> Dave Chinner wrote:
> >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> >loads. That's nowhere near blowing an 8k stack, so there must be
> >something special about what you are doing. Can you post the stack
> >traces that are being generated for the deepest stack generated -
> >/sys/kernel/debug/tracing/stack_trace should contain it.
> Appended below. That doesn't seem to reach 8192 but the box it's
> from has logged:
>
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left
>
> full dmesg (gzipped) attached.
> >What is generating the write load?
>
> WebDAV PUTs in a modified mogilefs cluster, running
> apache-mpm-worker (threaded) as the DAV server. The write load is a
> mix of internet-upload speed writers trickling files up and some
> local fast replicators copying from elsewhere in the cluster. mpm
> worker cfg is:
>
> ServerLimit 20
> StartServers 5
> MaxClients 300
> MinSpareThreads 25
> MaxSpareThreads 75
> ThreadsPerChild 30
> MaxRequestsPerChild 0
>
> File sizes are a mix of small to large (4GB+). Each disk is exported
> as a mogile device, so it's possible for mogile to pound a single
> disk with lots of write activity (if the random number generator
> decides to put lots of files on that device at the same time).
>
> We're also seeing occasional slowdowns + high load avg (up to ~300,
> i.e. MaxClients) with a corresponding number of threads in D state.
> (This slowdown + high load avg seems to correlate with what would
> have previously caused a panic on the THREAD_ORDER 1, but not 100%
> sure).
>
> As you can see from the dmesg, this trips the "task xxx blocked for
> more than 120 seconds." on some of the threads.
>
> Don't know if that's related to the stack issue or to be expected
> under the load.

It looks to be caused by direct memory reclaim trying to clean pages
with a significant amount of stack already in use. basically there
is not enough stack space left for the XFS ->writepage path to
execute in. I can't see any fast fix for this occurring, so you are
probably best to run with a larger stack for the moment.

As it is, I don't think direct memory reclim should be cleaning
dirty file pages - it should be leaving that to the writeback
threads (which are far more efficient at it) or, as a
last resort, kswapd. Direct memory reclaim is invoked with an
unknown amount of stack already in use, so there is never any
guarantee that there is enough stack space left to enter the
->writepage path of any filesystem.

MM-folk - have there been any changes recently to writeback of
pages from direct reclaim that may have caused this,
or have we just been lucky for a really long time?

Cheers,

Dave.

> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 7568 16 mempool_alloc_slab+0x16/0x20
> 1) 7552 144 mempool_alloc+0x65/0x140
> 2) 7408 96 get_request+0x124/0x370
> 3) 7312 144 get_request_wait+0x29/0x1b0
> 4) 7168 96 __make_request+0x9b/0x490
> 5) 7072 208 generic_make_request+0x3df/0x4d0
> 6) 6864 80 submit_bio+0x7c/0x100
> 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs]
> 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs]
> 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs]
> 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs]
> 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs]
> 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs]
> 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs]
> 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs]
> 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs]
> 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs]
> 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 33) 3120 384 shrink_page_list+0x65e/0x840
> 34) 2736 528 shrink_zone+0x63f/0xe10
> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 38) 1728 48 alloc_pages_current+0x8c/0xe0
> 39) 1680 16 __get_free_pages+0xe/0x50
> 40) 1664 48 __pollwait+0xca/0x110
> 41) 1616 32 unix_poll+0x28/0xc0
> 42) 1584 16 sock_poll+0x1d/0x20
> 43) 1568 912 do_select+0x3d6/0x700
> 44) 656 416 core_sys_select+0x18c/0x2c0
> 45) 240 112 sys_select+0x4f/0x110
> 46) 128 128 system_call_fastpath+0x16/0x1b
>



--
Dave Chinner
[email protected]

2010-04-08 03:04:13

by Dave Chinner

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
> [added linux-mm]

Now really added linux-mm.

And there's a patch attached that stops direct reclaim from writing
back dirty pages - it seems to work fine from some rough testing
I've done. Perhaps you might want to give it a spin on a
test box, John?

> On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> > Dave Chinner wrote:
> > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> > >loads. That's nowhere near blowing an 8k stack, so there must be
> > >something special about what you are doing. Can you post the stack
> > >traces that are being generated for the deepest stack generated -
> > >/sys/kernel/debug/tracing/stack_trace should contain it.
> > Appended below. That doesn't seem to reach 8192 but the box it's
> > from has logged:
> >
> > [74649.579386] apache2 used greatest stack depth: 7024 bytes left
> >
> > full dmesg (gzipped) attached.
> > >What is generating the write load?
> >
> > WebDAV PUTs in a modified mogilefs cluster, running
> > apache-mpm-worker (threaded) as the DAV server. The write load is a
> > mix of internet-upload speed writers trickling files up and some
> > local fast replicators copying from elsewhere in the cluster. mpm
> > worker cfg is:
> >
> > ServerLimit 20
> > StartServers 5
> > MaxClients 300
> > MinSpareThreads 25
> > MaxSpareThreads 75
> > ThreadsPerChild 30
> > MaxRequestsPerChild 0
> >
> > File sizes are a mix of small to large (4GB+). Each disk is exported
> > as a mogile device, so it's possible for mogile to pound a single
> > disk with lots of write activity (if the random number generator
> > decides to put lots of files on that device at the same time).
> >
> > We're also seeing occasional slowdowns + high load avg (up to ~300,
> > i.e. MaxClients) with a corresponding number of threads in D state.
> > (This slowdown + high load avg seems to correlate with what would
> > have previously caused a panic on the THREAD_ORDER 1, but not 100%
> > sure).
> >
> > As you can see from the dmesg, this trips the "task xxx blocked for
> > more than 120 seconds." on some of the threads.
> >
> > Don't know if that's related to the stack issue or to be expected
> > under the load.
>
> It looks to be caused by direct memory reclaim trying to clean pages
> with a significant amount of stack already in use. basically there
> is not enough stack space left for the XFS ->writepage path to
> execute in. I can't see any fast fix for this occurring, so you are
> probably best to run with a larger stack for the moment.
>
> As it is, I don't think direct memory reclim should be cleaning
> dirty file pages - it should be leaving that to the writeback
> threads (which are far more efficient at it) or, as a
> last resort, kswapd. Direct memory reclaim is invoked with an
> unknown amount of stack already in use, so there is never any
> guarantee that there is enough stack space left to enter the
> ->writepage path of any filesystem.
>
> MM-folk - have there been any changes recently to writeback of
> pages from direct reclaim that may have caused this,
> or have we just been lucky for a really long time?
>
> Cheers,
>
> Dave.
>
> > Depth Size Location (47 entries)
> > ----- ---- --------
> > 0) 7568 16 mempool_alloc_slab+0x16/0x20
> > 1) 7552 144 mempool_alloc+0x65/0x140
> > 2) 7408 96 get_request+0x124/0x370
> > 3) 7312 144 get_request_wait+0x29/0x1b0
> > 4) 7168 96 __make_request+0x9b/0x490
> > 5) 7072 208 generic_make_request+0x3df/0x4d0
> > 6) 6864 80 submit_bio+0x7c/0x100
> > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs]
> > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs]
> > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs]
> > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs]
> > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs]
> > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
> > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs]
> > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
> > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs]
> > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
> > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs]
> > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs]
> > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs]
> > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs]
> > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > 33) 3120 384 shrink_page_list+0x65e/0x840
> > 34) 2736 528 shrink_zone+0x63f/0xe10
> > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> > 36) 2096 128 try_to_free_pages+0x77/0x80
> > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> > 38) 1728 48 alloc_pages_current+0x8c/0xe0
> > 39) 1680 16 __get_free_pages+0xe/0x50
> > 40) 1664 48 __pollwait+0xca/0x110
> > 41) 1616 32 unix_poll+0x28/0xc0
> > 42) 1584 16 sock_poll+0x1d/0x20
> > 43) 1568 912 do_select+0x3d6/0x700
> > 44) 656 416 core_sys_select+0x18c/0x2c0
> > 45) 240 112 sys_select+0x4f/0x110
> > 46) 128 128 system_call_fastpath+0x16/0x1b

--
Dave Chinner
[email protected]

mm: disallow direct reclaim page writeback

From: Dave Chinner <[email protected]>

When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence entering the filesystem to do writeback can then lead to
stack overruns.

Writeback from direct reclaim is a bad idea, anyway. The background flusher
threads should be taking care of cleaning dirty pages, and direct reclaim will
kick them if they aren't already doing work. If direct reclaim is also calling
->writepage, it will cause the IO patterns from the background flusher threads
to be upset by LRU-order writeback from pageout(). Having competing sources of
IO trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.

Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.

Signed-off-by: Dave Chinner <[email protected]>
---
mm/vmscan.c | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f293372..3c194f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* writeout. So in laptop mode, write out the whole world.
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
- if (total_scanned > writeback_threshold) {
+ if (total_scanned > writeback_threshold)
wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
- sc->may_writepage = 1;
- }

/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
{
struct scan_control sc = {
.gfp_mask = gfp_mask,
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
@@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
struct zone *zone, int nid)
{
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.swappiness = swappiness,
@@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
{
struct zonelist *zonelist;
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = {
- .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+ .may_writepage = (current_is_kswapd() &&
+ (zone_reclaim_mode & RECLAIM_WRITE)),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.nr_to_reclaim = max_t(unsigned long, nr_pages,

2010-04-08 12:23:35

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
>
> And there's a patch attached that stops direct reclaim from writing
> back dirty pages - it seems to work fine from some rough testing
> I've done. Perhaps you might want to give it a spin on a
> test box, John?
>
Thanks very much for this. The patch is in and soaking on a THREAD_ORDER
1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but
it's early days. After about 2hrs of uptime:

$ dmesg | grep stack | tail -1
[ 60.350766] apache2 used greatest stack depth: 2544 bytes left

(which tallies well with your 5 1/2Kbytes usage figure).

I'll reply again after it's been running long enough to draw conclusions.

jb

2010-04-08 14:45:28

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

John Berthels wrote:
> I'll reply again after it's been running long enough to draw conclusions.
We're getting pretty close on the 8k stack on this box now. It's running
2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and
CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's
going to throw the figures and we'll restart the test systems with new
kernels).

This is significantly more than 5.6K, so it shows a potential problem?
Or is 720 bytes enough headroom?

jb

[ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
[ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
[ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
[ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
[ 5531.406529] apache2 used greatest stack depth: 720 bytes left

$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (55 entries)
----- ---- --------
0) 7440 48 add_partial+0x26/0x90
1) 7392 64 __slab_free+0x1a9/0x380
2) 7328 64 kmem_cache_free+0xb9/0x160
3) 7264 16 free_buffer_head+0x25/0x50
4) 7248 64 try_to_free_buffers+0x79/0xc0
5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
6) 7024 16 try_to_release_page+0x33/0x60
7) 7008 384 shrink_page_list+0x585/0x860
8) 6624 528 shrink_zone+0x636/0xdc0
9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
10) 5984 112 try_to_free_pages+0x64/0x70
11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
12) 5616 48 alloc_pages_current+0x8c/0xe0
13) 5568 32 __page_cache_alloc+0x67/0x70
14) 5536 80 find_or_create_page+0x50/0xb0
15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs]
17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs]
18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs]
19) 5104 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
20) 5024 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
21) 4944 176 xfs_btree_lookup+0xd7/0x490 [xfs]
22) 4768 16 xfs_alloc_lookup_ge+0x1c/0x20 [xfs]
23) 4752 144 xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs]
24) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
25) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
26) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
27) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs]
28) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
29) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
30) 3792 128 xfs_btree_insert+0x86/0x180 [xfs]
31) 3664 352 xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs]
32) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
33) 3104 448 xfs_bmapi+0x982/0x1200 [xfs]
34) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
35) 2400 208 xfs_iomap+0x3d8/0x410 [xfs]
36) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs]
37) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs]
38) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs]
39) 1840 32 __writepage+0x1a/0x60
40) 1808 288 write_cache_pages+0x1f7/0x400
41) 1520 16 generic_writepages+0x27/0x30
42) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs]
43) 1456 16 do_writepages+0x24/0x40
44) 1440 64 writeback_single_inode+0xf1/0x3e0
45) 1376 128 writeback_inodes_wb+0x31e/0x510
46) 1248 16 writeback_inodes_wbc+0x1e/0x20
47) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410
48) 1008 192 generic_file_buffered_write+0x19b/0x240
49) 816 288 xfs_write+0x849/0x930 [xfs]
50) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs]
51) 512 272 do_sync_write+0xd1/0x120
52) 240 48 vfs_write+0xcb/0x1a0
53) 192 64 sys_write+0x55/0x90
54) 128 128 system_call_fastpath+0x16/0x1b

2010-04-08 16:15:50

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

John Berthels wrote:
> John Berthels wrote:
>> I'll reply again after it's been running long enough to draw
>> conclusions.
The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further
logging retrievable by /var/log/messages or netconsole).

We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get
better info as to whether we are still blowing the 8k limit with the
patch in place.

jb

2010-04-08 23:39:20

by Dave Chinner

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> John Berthels wrote:
> >I'll reply again after it's been running long enough to draw conclusions.
> We're getting pretty close on the 8k stack on this box now. It's
> running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> that's going to throw the figures and we'll restart the test systems
> with new kernels).
>
> This is significantly more than 5.6K, so it shows a potential
> problem? Or is 720 bytes enough headroom?
>
> jb
>
> [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
>
> $ cat /sys/kernel/debug/tracing/stack_trace
> Depth Size Location (55 entries)
> ----- ---- --------
> 0) 7440 48 add_partial+0x26/0x90
> 1) 7392 64 __slab_free+0x1a9/0x380
> 2) 7328 64 kmem_cache_free+0xb9/0x160
> 3) 7264 16 free_buffer_head+0x25/0x50
> 4) 7248 64 try_to_free_buffers+0x79/0xc0
> 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
> 6) 7024 16 try_to_release_page+0x33/0x60
> 7) 7008 384 shrink_page_list+0x585/0x860
> 8) 6624 528 shrink_zone+0x636/0xdc0
> 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
> 10) 5984 112 try_to_free_pages+0x64/0x70
> 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
> 12) 5616 48 alloc_pages_current+0x8c/0xe0
> 13) 5568 32 __page_cache_alloc+0x67/0x70
> 14) 5536 80 find_or_create_page+0x50/0xb0
> 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs]
> 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs]
> 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs]

We're entering memory reclaim with almost 6k of stack already in
use. If we get down into the IO layer and then have to do a memory
reclaim, then we'll have even less stack to work with. It looks like
memory allocation needs at least 2KB of stack to work with now,
so if we enter anywhere near the top of the stack we can blow it...

Basically this trace is telling us the stack we have to work with
is:

2KB memory allocation
4KB page writeback
2KB write foreground throttling path

So effectively the storage subsystem (NFS, filesystem, DM, MD,
device drivers) have about 4K of stack to work in now. That seems to
be a lot less than last time I looked at this, and we've been really
careful not to increase XFS's stack usage for quite some time now.

Hence I'm not sure exactly what to do about this, John. I can't
really do much about the stack footprint of XFS as all the
low-hanging fruit has already been trimmed. Even if I convert the
foreground throttling to not issue IO, the background flush threads
still have roughly the same stack usage, so a memory allocation and
reclaim in the wrong place could still blow the stack....

I'll have to have a bit of a think on this one - if you could
provide further stack traces as they get deeper (esp. if they go
past 8k) that would be really handy.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-09 11:39:56

by Chris Mason

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> > John Berthels wrote:
> > >I'll reply again after it's been running long enough to draw conclusions.
> > We're getting pretty close on the 8k stack on this box now. It's
> > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> > that's going to throw the figures and we'll restart the test systems
> > with new kernels).
> >
> > This is significantly more than 5.6K, so it shows a potential
> > problem? Or is 720 bytes enough headroom?
> >
> > jb
> >
> > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> >
> > $ cat /sys/kernel/debug/tracing/stack_trace
> > Depth Size Location (55 entries)
> > ----- ---- --------
> > 0) 7440 48 add_partial+0x26/0x90
> > 1) 7392 64 __slab_free+0x1a9/0x380
> > 2) 7328 64 kmem_cache_free+0xb9/0x160
> > 3) 7264 16 free_buffer_head+0x25/0x50
> > 4) 7248 64 try_to_free_buffers+0x79/0xc0
> > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
> > 6) 7024 16 try_to_release_page+0x33/0x60
> > 7) 7008 384 shrink_page_list+0x585/0x860
> > 8) 6624 528 shrink_zone+0x636/0xdc0
> > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
> > 10) 5984 112 try_to_free_pages+0x64/0x70
> > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
> > 12) 5616 48 alloc_pages_current+0x8c/0xe0
> > 13) 5568 32 __page_cache_alloc+0x67/0x70
> > 14) 5536 80 find_or_create_page+0x50/0xb0
> > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs]
> > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs]
> > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs]
>
> We're entering memory reclaim with almost 6k of stack already in
> use. If we get down into the IO layer and then have to do a memory
> reclaim, then we'll have even less stack to work with. It looks like
> memory allocation needs at least 2KB of stack to work with now,
> so if we enter anywhere near the top of the stack we can blow it...

shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
first. This is against .34, if you have any trouble applying to .32,
just add the word noinline after the word static on the function
definitions.

This makes shrink_zone disappear from my check_stack.pl output.
Basically I think the compiler is inlining the shrink_active_zone and
shrink_inactive_zone code into shrink_zone.

-chris

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..c70593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
/*
* shrink_page_list() returns the number of reclaimed pages
*/
-static unsigned long shrink_page_list(struct list_head *page_list,
+static noinline unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum pageout_io sync_writeback)
{
@@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
struct zone *zone, struct scan_control *sc,
int priority, int file)
{
@@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
__count_vm_events(PGDEACTIVATE, pgmoved);
}

-static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
+static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
unsigned long nr_taken;
@@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
return inactive_anon_is_low(zone, sc);
}

-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);

2010-04-09 13:44:34

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

=== server16 ===

apache2 used greatest stack depth: 7208 bytes left

Depth Size Location (72 entries)
----- ---- --------
0) 8336 304 select_task_rq_fair+0x235/0xad0
1) 8032 96 try_to_wake_up+0x189/0x3f0
2) 7936 16 default_wake_function+0x12/0x20
3) 7920 32 autoremove_wake_function+0x16/0x40
4) 7888 64 __wake_up_common+0x5a/0x90
5) 7824 64 __wake_up+0x48/0x70
6) 7760 64 insert_work+0x9f/0xb0
7) 7696 48 __queue_work+0x36/0x50
8) 7648 16 queue_work_on+0x4d/0x60
9) 7632 16 queue_work+0x1f/0x30
10) 7616 16 queue_delayed_work+0x2d/0x40
11) 7600 32 ata_pio_queue_task+0x35/0x40
12) 7568 48 ata_sff_qc_issue+0x146/0x2f0
13) 7520 96 mv_qc_issue+0x12d/0x540 [sata_mv]
14) 7424 96 ata_qc_issue+0x1fe/0x320
15) 7328 64 ata_scsi_translate+0xae/0x1a0
16) 7264 64 ata_scsi_queuecmd+0xbf/0x2f0
17) 7200 48 scsi_dispatch_cmd+0x114/0x2b0
18) 7152 96 scsi_request_fn+0x419/0x590
19) 7056 32 __blk_run_queue+0x82/0x150
20) 7024 48 elv_insert+0x1aa/0x2d0
21) 6976 48 __elv_add_request+0x83/0xd0
22) 6928 96 __make_request+0x139/0x490
23) 6832 208 generic_make_request+0x3df/0x4d0
24) 6624 80 submit_bio+0x7c/0x100
25) 6544 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
26) 6448 48 xfs_buf_iorequest+0x75/0xd0 [xfs]
27) 6400 32 xlog_bdstrat_cb+0x4d/0x60 [xfs]
28) 6368 80 xlog_sync+0x218/0x510 [xfs]
29) 6288 64 xlog_state_release_iclog+0xbb/0x100 [xfs]
30) 6224 160 xlog_state_sync+0x1ab/0x230 [xfs]
31) 6064 32 _xfs_log_force+0x5a/0x80 [xfs]
32) 6032 32 xfs_log_force+0x18/0x40 [xfs]
33) 6000 64 xfs_alloc_search_busy+0x14b/0x160 [xfs]
34) 5936 112 xfs_alloc_get_freelist+0x130/0x170 [xfs]
35) 5824 48 xfs_allocbt_alloc_block+0x33/0x70 [xfs]
36) 5776 208 xfs_btree_split+0xb3/0x6a0 [xfs]
37) 5568 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
38) 5472 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
39) 5248 128 xfs_btree_insert+0x86/0x180 [xfs]
40) 5120 144 xfs_free_ag_extent+0x33b/0x7b0 [xfs]
41) 4976 224 xfs_alloc_fix_freelist+0x120/0x490 [xfs]
42) 4752 96 xfs_alloc_vextent+0x1f5/0x630 [xfs]
43) 4656 272 xfs_bmap_btalloc+0x497/0xa70 [xfs]
44) 4384 16 xfs_bmap_alloc+0x21/0x40 [xfs]
45) 4368 448 xfs_bmapi+0x85e/0x1200 [xfs]
46) 3920 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
47) 3664 208 xfs_iomap+0x3d8/0x410 [xfs]
48) 3456 32 xfs_map_blocks+0x2c/0x30 [xfs]
49) 3424 256 xfs_page_state_convert+0x443/0x730 [xfs]
50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
51) 3104 384 shrink_page_list+0x65e/0x840
52) 2720 528 shrink_zone+0x63f/0xe10
53) 2192 112 do_try_to_free_pages+0xc2/0x3c0
54) 2080 128 try_to_free_pages+0x77/0x80
55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710
56) 1712 48 alloc_pages_current+0x8c/0xe0
57) 1664 32 __page_cache_alloc+0x67/0x70
58) 1632 144 __do_page_cache_readahead+0xd3/0x220
59) 1488 16 ra_submit+0x21/0x30
60) 1472 80 ondemand_readahead+0x11d/0x250
61) 1392 64 page_cache_async_readahead+0xa9/0xe0
62) 1328 592 __generic_file_splice_read+0x48a/0x530
63) 736 48 generic_file_splice_read+0x4f/0x90
64) 688 96 xfs_splice_read+0xf2/0x130 [xfs]
65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs]
66) 560 64 do_splice_to+0x77/0xb0
67) 496 112 splice_direct_to_actor+0xcc/0x1c0
68) 384 80 do_splice_direct+0x57/0x80
69) 304 96 do_sendfile+0x16c/0x1e0
70) 208 80 sys_sendfile64+0x8d/0xb0
71) 128 128 system_call_fastpath+0x16/0x1b

=== server9 ===

[223269.859411] apache2 used greatest stack depth: 7088 bytes left

Depth Size Location (62 entries)
----- ---- --------

0) 8528 32 down_trylock+0x1e/0x50
1) 8496 80 _xfs_buf_find+0x12f/0x290 [xfs]
2) 8416 64 xfs_buf_get+0x61/0x1c0 [xfs]
3) 8352 48 xfs_buf_read+0x2f/0x110 [xfs]
4) 8304 48 xfs_buf_readahead+0x61/0x90 [xfs]
5) 8256 48 xfs_btree_readahead_sblock+0xea/0xf0 [xfs]
6) 8208 16 xfs_btree_readahead+0x5f/0x90 [xfs]
7) 8192 112 xfs_btree_increment+0x2e/0x2b0 [xfs]
8) 8080 176 xfs_btree_rshift+0x2f2/0x530 [xfs]
9) 7904 272 xfs_btree_delrec+0x4a3/0x1020 [xfs]
10) 7632 64 xfs_btree_delete+0x40/0xd0 [xfs]
11) 7568 96 xfs_alloc_fixup_trees+0x7d/0x350 [xfs]
12) 7472 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
13) 7328 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
14) 7296 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
15) 7200 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
16) 7040 208 xfs_btree_split+0xb3/0x6a0 [xfs]
17) 6832 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
18) 6736 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
19) 6512 128 xfs_btree_insert+0x86/0x180 [xfs]
20) 6384 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
21) 6032 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
22) 5824 448 xfs_bmapi+0x982/0x1200 [xfs]
23) 5376 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
24) 5120 208 xfs_iomap+0x3d8/0x410 [xfs]
25) 4912 32 xfs_map_blocks+0x2c/0x30 [xfs]
26) 4880 256 xfs_page_state_convert+0x443/0x730 [xfs]
27) 4624 64 xfs_vm_writepage+0xab/0x160 [xfs]
28) 4560 384 shrink_page_list+0x65e/0x840
29) 4176 528 shrink_zone+0x63f/0xe10
30) 3648 112 do_try_to_free_pages+0xc2/0x3c0
31) 3536 128 try_to_free_pages+0x77/0x80
32) 3408 240 __alloc_pages_nodemask+0x3e4/0x710
33) 3168 48 alloc_pages_current+0x8c/0xe0
34) 3120 80 new_slab+0x247/0x300
35) 3040 96 __slab_alloc+0x137/0x490
36) 2944 64 kmem_cache_alloc+0x110/0x120
37) 2880 64 kmem_zone_alloc+0x9a/0xe0 [xfs]
38) 2816 32 kmem_zone_zalloc+0x1e/0x50 [xfs]
39) 2784 32 _xfs_trans_alloc+0x38/0x80 [xfs]
40) 2752 96 xfs_trans_alloc+0x9f/0xb0 [xfs]
41) 2656 256 xfs_iomap_write_allocate+0xf1/0x3c0 [xfs]
42) 2400 208 xfs_iomap+0x3d8/0x410 [xfs]
43) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs]
44) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs]
45) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs]
46) 1840 32 __writepage+0x17/0x50
47) 1808 288 write_cache_pages+0x1f7/0x400
48) 1520 16 generic_writepages+0x24/0x30
49) 1504 48 xfs_vm_writepages+0x5c/0x80 [xfs]
50) 1456 16 do_writepages+0x21/0x40
51) 1440 64 writeback_single_inode+0xeb/0x3c0
52) 1376 128 writeback_inodes_wb+0x318/0x510
53) 1248 16 writeback_inodes_wbc+0x1e/0x20
54) 1232 224 balance_dirty_pages_ratelimited_nr+0x269/0x3a0
55) 1008 192 generic_file_buffered_write+0x19b/0x240
56) 816 288 xfs_write+0x837/0x920 [xfs]
57) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs]
58) 512 272 do_sync_write+0xd1/0x120
59) 240 48 vfs_write+0xcb/0x1a0
60) 192 64 sys_write+0x55/0x90
61) 128 128 system_call_fastpath+0x16/0x1b


Attachments:
stack_traces.txt (7.65 kB)

2010-04-09 18:05:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

Chris Mason wrote:

> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first. This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
>
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.
>
> -chris
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..c70593e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> /*
> * shrink_page_list() returns the number of reclaimed pages
> */
> -static unsigned long shrink_page_list(struct list_head *page_list,
> +static noinline unsigned long shrink_page_list(struct list_head *page_list,

FWIW akpm suggested that I add:

/*
* Rather then using noinline to prevent stack consumption, use
* noinline_for_stack instead. For documentaiton reasons.
*/
#define noinline_for_stack noinline

so maybe for a formal submission that'd be good to use.


> struct scan_control *sc,
> enum pageout_io sync_writeback)
> {
> @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
> * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> * of reclaimed pages
> */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
> struct zone *zone, struct scan_control *sc,
> int priority, int file)
> {
> @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
> __count_vm_events(PGDEACTIVATE, pgmoved);
> }
>
> -static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> struct scan_control *sc, int priority, int file)
> {
> unsigned long nr_taken;
> @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
> return inactive_anon_is_low(zone, sc);
> }
>
> -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> struct zone *zone, struct scan_control *sc, int priority)
> {
> int file = is_file_lru(lru);
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs
>

2010-04-09 18:12:14

by Chris Mason

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> Chris Mason wrote:
>
> > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > first. This is against .34, if you have any trouble applying to .32,
> > just add the word noinline after the word static on the function
> > definitions.
> >
> > This makes shrink_zone disappear from my check_stack.pl output.
> > Basically I think the compiler is inlining the shrink_active_zone and
> > shrink_inactive_zone code into shrink_zone.
> >
> > -chris
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 79c8098..c70593e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> > /*
> > * shrink_page_list() returns the number of reclaimed pages
> > */
> > -static unsigned long shrink_page_list(struct list_head *page_list,
> > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
>
> FWIW akpm suggested that I add:
>
> /*
> * Rather then using noinline to prevent stack consumption, use
> * noinline_for_stack instead. For documentaiton reasons.
> */
> #define noinline_for_stack noinline
>
> so maybe for a formal submission that'd be good to use.

Oh yeah, I forgot about that one. If the patch actually helps we can
switch it.

-chris

2010-04-13 01:21:19

by Dave Chinner

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote:
> On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> > Chris Mason wrote:
> >
> > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > > first. This is against .34, if you have any trouble applying to .32,
> > > just add the word noinline after the word static on the function
> > > definitions.
> > >
> > > This makes shrink_zone disappear from my check_stack.pl output.
> > > Basically I think the compiler is inlining the shrink_active_zone and
> > > shrink_inactive_zone code into shrink_zone.
> > >
> > > -chris
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 79c8098..c70593e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> > > /*
> > > * shrink_page_list() returns the number of reclaimed pages
> > > */
> > > -static unsigned long shrink_page_list(struct list_head *page_list,
> > > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> >
> > FWIW akpm suggested that I add:
> >
> > /*
> > * Rather then using noinline to prevent stack consumption, use
> > * noinline_for_stack instead. For documentaiton reasons.
> > */
> > #define noinline_for_stack noinline
> >
> > so maybe for a formal submission that'd be good to use.
>
> Oh yeah, I forgot about that one. If the patch actually helps we can
> switch it.

Well, given that the largest stack overflow reported was about 800
bytes, I don't think it's enough. All the fat has been trimmed from
XFS long ago, and there isn't that much in the generic code paths
to trim. And if we consider that this isn't including a significant
storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then
trimming a few hundred bytes is not enough to prevent an 8k stack
being blown sky high.

That is why I was saying I'm not sure what the best way to solve the
problem is - I've got a couple of ideas for fixing the problem in
XFS once and for all, but I'm not sure if they will fly or not
yet, let alone written any code....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-13 09:53:08

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first. This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.

Hi Chris,

Thanks for this, we've been soaking it for a while and get the stack
trace below (which is still >8k), which still has shrink_zone at 528 bytes.

I find it odd that the shrink_zone stack usage is different on our
systems. This is a stock kernel 2.6.33.2 kernel, x86_64 arch (plus your
patch + Dave Chinner's patch) built using ubuntu make-kpkg, with gcc
(Ubuntu 4.3.3-5ubuntu4) 4.3.3 (.vmscan.o.cmd with full build options is
below, gzipped .config attached).

Can you see any difference between your system and ours which might
explain the discrepancy? I note -g and -pg in there. (Does -pg have any
stack overhead? It seems to be enabled in ubuntu release kernels).

regards,

jb



mm/.vmscan.o.cmd:

cmd_mm/vmscan.o := gcc -Wp,-MD,mm/.vmscan.o.d -nostdinc -isystem
/usr/lib/gcc/x86_64-linux-gnu/4.3.3/include
-I/usr/local/src/kern/linux-2.6.33.2/arch/x86/include -Iinclude
-include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef
-Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common
-Werror-implicit-function-declaration -Wno-format-security
-fno-delete-null-pointer-checks -O2 -m64 -mtune=generic -mno-red-zone
-mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args
-fstack-protector -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe
-Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx
-mno-sse2 -mno-3dnow -fno-omit-frame-pointer -fno-optimize-sibling-calls
-g -pg -Wdeclaration-after-statement -Wno-pointer-sign
-fno-strict-overflow -D"KBUILD_STR(s)=\#s"
-D"KBUILD_BASENAME=KBUILD_STR(vmscan)"
-D"KBUILD_MODNAME=KBUILD_STR(vmscan)" -c -o mm/.tmp_vmscan.o mm/vmscan.c



Apr 12 22:06:35 nas17 kernel: [36346.599076] apache2 used greatest stack
depth: 7904 bytes left
Depth Size Location (56 entries)
----- ---- --------
0) 7904 48 __call_rcu+0x67/0x190
1) 7856 16 call_rcu_sched+0x15/0x20
2) 7840 16 call_rcu+0xe/0x10
3) 7824 272 radix_tree_delete+0x159/0x2e0
4) 7552 32 __remove_from_page_cache+0x21/0x110
5) 7520 64 __remove_mapping+0xe8/0x130
6) 7456 384 shrink_page_list+0x400/0x860
7) 7072 528 shrink_zone+0x636/0xdc0
8) 6544 112 do_try_to_free_pages+0xc2/0x3c0
9) 6432 112 try_to_free_pages+0x64/0x70
10) 6320 256 __alloc_pages_nodemask+0x3d2/0x710
11) 6064 48 alloc_pages_current+0x8c/0xe0
12) 6016 32 __page_cache_alloc+0x67/0x70
13) 5984 80 find_or_create_page+0x50/0xb0
14) 5904 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
15) 5744 64 xfs_buf_get+0x74/0x1d0 [xfs]
16) 5680 48 xfs_buf_read+0x2f/0x110 [xfs]
17) 5632 80 xfs_trans_read_buf+0x2bf/0x430 [xfs]
18) 5552 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
19) 5472 176 xfs_btree_rshift+0xd7/0x530 [xfs]
20) 5296 96 xfs_btree_make_block_unfull+0x5b/0x190 [xfs]
21) 5200 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
22) 4976 128 xfs_btree_insert+0x86/0x180 [xfs]
23) 4848 96 xfs_alloc_fixup_trees+0x1fa/0x350 [xfs]
24) 4752 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
25) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
26) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
27) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
28) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs]
29) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
30) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
31) 3792 128 xfs_btree_insert+0x86/0x180 [xfs]
32) 3664 352 xfs_bmap_add_extent_delay_real+0x41e/0x1670 [xfs]
33) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
34) 3104 448 xfs_bmapi+0x982/0x1200 [xfs]
35) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
36) 2400 208 xfs_iomap+0x3d8/0x410 [xfs]
37) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs]
38) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs]
39) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs]
40) 1840 32 __writepage+0x1a/0x60
41) 1808 288 write_cache_pages+0x1f7/0x400
42) 1520 16 generic_writepages+0x27/0x30
43) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs]
44) 1456 16 do_writepages+0x24/0x40
45) 1440 64 writeback_single_inode+0xf1/0x3e0
46) 1376 128 writeback_inodes_wb+0x31e/0x510
47) 1248 16 writeback_inodes_wbc+0x1e/0x20
48) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410
49) 1008 192 generic_file_buffered_write+0x19b/0x240
50) 816 288 xfs_write+0x849/0x930 [xfs]
51) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs]
52) 512 272 do_sync_write+0xd1/0x120
53) 240 48 vfs_write+0xcb/0x1a0
54) 192 64 sys_write+0x55/0x90
55) 128 128 system_call_fastpath+0x16/0x1b


Attachments:
config.gz (27.92 kB)

2010-04-16 13:42:10

by John Berthels

[permalink] [raw]
Subject: Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first. This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
>
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.

Hi Chris,

I hadn't seen the followup discussion on lkml until today, but this message:

http://marc.info/?l=linux-mm&m=127122143303771&w=2

allowed me to look at stack usage in our build environment. If I've
understood correctly, it seems that a build with gcc-4.4 and gcc-4.3
have very different stack usages for shrink_zone(): 0x88 versus 0x1d8.
(details below).

The reason appears to be the -fconserve-stack compilation option
specified when using 4.4, since running the cmdline from mm/.vmscan.cmd
with gcc-4.4 but *without* -fconserve-stack gives the same result as
with 4.3.

According to the discussion when the flag was added,
http://www.gossamer-threads.com/lists/linux/kernel/1131612
this flag seems to primarily affects inlining, so I double-checked the
noinline patch you sent to the list and discovered that it had been
incorrectly applied to the build tree. Correctly applying that patch to
mm/vmscan.c (and using gcc-4.3) gives a

sub $0x78,%rsp

line. I'm very sorry that this test or ours wasn't correct and I'm sorry
for sending bad info to the list.

We're currently building a kernel with gcc-4.4 and will let you know if
it blows the 8k limit or not.

Thanks for your help.

regards,

jb

$ gcc-4.3 --version
gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4
$ gcc-4.4 --version
gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1


$ make CC=gcc-4.4 mm/vmscan.o
$ objdump -d mm/vmscan.o | less +/shrink_zone
0000000000002830 <shrink_zone>:
2830: 55 push %rbp
2831: 48 89 e5 mov %rsp,%rbp
2834: 41 57 push %r15
2836: 41 56 push %r14
2838: 41 55 push %r13
283a: 41 54 push %r12
283c: 53 push %rbx
283d: 48 81 ec 88 00 00 00 sub $0x88,%rsp
2844: e8 00 00 00 00 callq 2849 <shrink_zone+0x19>
$ make clean
$ make CC=gcc-4.3 mm/vmscan.o
$ objdump -d mm/vmscan.o | less +/shrink_zone
0000000000001ca0 <shrink_zone>:
1ca0: 55 push %rbp
1ca1: 48 89 e5 mov %rsp,%rbp
1ca4: 41 57 push %r15
1ca6: 41 56 push %r14
1ca8: 41 55 push %r13
1caa: 41 54 push %r12
1cac: 53 push %rbx
1cad: 48 81 ec d8 01 00 00 sub $0x1d8,%rsp
1cb4: e8 00 00 00 00 callq 1cb9 <shrink_zone+0x19>