2008-11-17 20:34:25

by Steven Rostedt

[permalink] [raw]
Subject: Large stack usage in fs code (especially for PPC64)


I've been hitting stack overflows on a PPC64 box, so I ran the ftrace
stack_tracer and part of the problem with that box is that it can nest
interrupts too deep. But what also worries me is that there's some heavy
hitters of stacks in generic code. Namely the fs directory has some.

Here's the full dump of the stack (PPC64):

root@electra ~> cat /debug/tracing/stack_trace
Depth Size Location (56 entries)
----- ---- --------
0) 14032 112 ftrace_call+0x4/0x14
1) 13920 128 .sched_clock+0x20/0x60
2) 13792 128 .sched_clock_cpu+0x34/0x50
3) 13664 144 .cpu_clock+0x3c/0xa0
4) 13520 144 .get_timestamp+0x2c/0x50
5) 13376 192 .softlockup_tick+0x100/0x220
6) 13184 128 .run_local_timers+0x34/0x50
7) 13056 160 .update_process_times+0x44/0xb0
8) 12896 176 .tick_sched_timer+0x8c/0x120
9) 12720 160 .__run_hrtimer+0xd8/0x130
10) 12560 240 .hrtimer_interrupt+0x16c/0x220
11) 12320 160 .timer_interrupt+0xcc/0x110
12) 12160 240 decrementer_common+0xe0/0x100
13) 11920 576 0x80
14) 11344 160 .usb_hcd_irq+0x94/0x150
15) 11184 160 .handle_IRQ_event+0x80/0x120
16) 11024 160 .handle_fasteoi_irq+0xd8/0x1e0
17) 10864 160 .do_IRQ+0xbc/0x150
18) 10704 144 hardware_interrupt_entry+0x1c/0x3c
19) 10560 672 0x0
20) 9888 144 ._spin_unlock_irqrestore+0x84/0xd0
21) 9744 160 .scsi_dispatch_cmd+0x170/0x360
22) 9584 208 .scsi_request_fn+0x324/0x5e0
23) 9376 144 .blk_invoke_request_fn+0xc8/0x1b0
24) 9232 144 .__blk_run_queue+0x48/0x60
25) 9088 144 .blk_run_queue+0x40/0x70
26) 8944 192 .scsi_run_queue+0x3a8/0x3e0
27) 8752 160 .scsi_next_command+0x58/0x90
28) 8592 176 .scsi_end_request+0xd4/0x130
29) 8416 208 .scsi_io_completion+0x15c/0x500
30) 8208 160 .scsi_finish_command+0x15c/0x190
31) 8048 160 .scsi_softirq_done+0x138/0x1e0
32) 7888 160 .blk_done_softirq+0xd0/0x100
33) 7728 192 .__do_softirq+0xe8/0x1e0
34) 7536 144 .do_softirq+0xa4/0xd0
35) 7392 144 .irq_exit+0xb4/0xf0
36) 7248 160 .do_IRQ+0x114/0x150
37) 7088 752 hardware_interrupt_entry+0x1c/0x3c
38) 6336 144 .blk_rq_init+0x28/0xc0
39) 6192 208 .get_request+0x13c/0x3d0
40) 5984 240 .get_request_wait+0x60/0x170
41) 5744 192 .__make_request+0xd4/0x560
42) 5552 192 .generic_make_request+0x210/0x300
43) 5360 208 .submit_bio+0x168/0x1a0
44) 5152 160 .submit_bh+0x188/0x1e0
45) 4992 1280 .block_read_full_page+0x23c/0x430
46) 3712 1280 .do_mpage_readpage+0x43c/0x740
47) 2432 352 .mpage_readpages+0x130/0x1c0
48) 2080 160 .ext3_readpages+0x50/0x80
49) 1920 256 .__do_page_cache_readahead+0x1e4/0x340
50) 1664 160 .do_page_cache_readahead+0x94/0xe0
51) 1504 240 .filemap_fault+0x360/0x530
52) 1264 256 .__do_fault+0xb8/0x600
53) 1008 240 .handle_mm_fault+0x190/0x920
54) 768 320 .do_page_fault+0x3d4/0x5f0
55) 448 448 handle_page_fault+0x20/0x5c


Notice at line 45 and 46 the stack usage of block_read_full_page and
do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start
of these two:

int block_read_full_page(struct page *page, get_block_t *get_block)
{
struct inode *inode = page->mapping->host;
sector_t iblock, lblock;
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
unsigned int blocksize;
int nr, i;
int fully_mapped = 1;
[...]

static struct bio *
do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
sector_t *last_block_in_bio, struct buffer_head *map_bh,
unsigned long *first_logical_block, get_block_t get_block)
{
struct inode *inode = page->mapping->host;
const unsigned blkbits = inode->i_blkbits;
const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
const unsigned blocksize = 1 << blkbits;
sector_t block_in_file;
sector_t last_block;
sector_t last_block_in_file;
sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_hole = blocks_per_page;
struct block_device *bdev = NULL;
int length;
int fully_mapped = 1;
unsigned nblocks;
unsigned relative_block;


The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is
defined as:

define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)

Where PAGE_CACHE_SIZE is the same as PAGE_SIZE.

On PPC64 I'm told that the page size is 64K, which makes the above equal
to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes.

The problem with PPC64 is that the stack size is not equal to the page
size. The stack size is only 16K not 64K.

The above stack trace was taken right after boot up and it was already at
14K, not too far from the 16k limit.

Note, I was using a default config that had CONFIG_IRQSTACKS off and
CONFIG_PPC_64K_PAGES on.

-- Steve


2008-11-17 21:00:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
> Note, I was using a default config that had CONFIG_IRQSTACKS off and
> CONFIG_PPC_64K_PAGES on.

Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that
softirqs still use the same stack as the process.

root@electra ~> cat /debug/tracing/stack_trace
Depth Size Location (59 entries)
----- ---- --------
0) 12384 192 ftrace_call+0x4/0x14
1) 12192 128 .sched_clock+0x20/0x60
2) 12064 128 .sched_clock_cpu+0x34/0x50
3) 11936 144 .cpu_clock+0x3c/0xa0
4) 11792 144 .get_timestamp+0x2c/0x50
5) 11648 144 .__touch_softlockup_watchdog+0x3c/0x60
6) 11504 192 .softlockup_tick+0xe4/0x220
7) 11312 128 .run_local_timers+0x34/0x50
8) 11184 160 .update_process_times+0x44/0xb0
9) 11024 176 .tick_sched_timer+0x8c/0x120
10) 10848 160 .__run_hrtimer+0xd8/0x130
11) 10688 240 .hrtimer_interrupt+0x16c/0x220
12) 10448 160 .timer_interrupt+0xcc/0x110
13) 10288 96 decrementer_common+0xe0/0x100
14) 10192 784 .ftrace_test_stop_func+0x4c/0x70
15) 9408 112 ftrace_call+0x4/0x14
16) 9296 144 .init_buffer_head+0x20/0x60
17) 9152 256 .cache_alloc_refill+0x5a0/0x740
18) 8896 160 .kmem_cache_alloc+0xfc/0x140
19) 8736 144 .alloc_buffer_head+0x3c/0xc0
20) 8592 176 .alloc_page_buffers+0xb0/0x130
21) 8416 160 .create_empty_buffers+0x44/0x1a0
22) 8256 1280 .block_read_full_page+0x3b0/0x430
23) 6976 144 .blkdev_readpage+0x38/0x60
24) 6832 176 .read_cache_page_async+0x124/0x230
25) 6656 160 .read_cache_page+0x50/0xe0
26) 6496 160 .read_dev_sector+0x58/0x100
27) 6336 272 .msdos_partition+0xa4/0x880
28) 6064 240 .rescan_partitions+0x1f0/0x430
29) 5824 208 .__blkdev_get+0x124/0x3d0
30) 5616 144 .blkdev_get+0x3c/0x60
31) 5472 192 .register_disk+0x194/0x1f0
32) 5280 160 .add_disk+0xe8/0x160
33) 5120 224 .sd_probe+0x2f0/0x440
34) 4896 176 .driver_probe_device+0x14c/0x380
35) 4720 144 .__device_attach+0x38/0x60
36) 4576 192 .bus_for_each_drv+0xb8/0x120
37) 4384 160 .device_attach+0x94/0xe0
38) 4224 144 .bus_attach_device+0x78/0xb0
39) 4080 240 .device_add+0x54c/0x730
40) 3840 160 .scsi_sysfs_add_sdev+0x7c/0x2e0
41) 3680 336 .scsi_probe_and_add_lun+0xb7c/0xc60
42) 3344 192 .__scsi_add_device+0x11c/0x150
43) 3152 176 .ata_scsi_scan_host+0x238/0x330
44) 2976 208 .ata_host_register+0x320/0x3c0
45) 2768 192 .ata_host_activate+0xf8/0x190
46) 2576 224 .mv_pci_init_one+0x284/0x430
47) 2352 160 .pci_device_probe+0x98/0xf0
48) 2192 176 .driver_probe_device+0x14c/0x380
49) 2016 160 .__driver_attach+0xd4/0x110
50) 1856 192 .bus_for_each_dev+0xa8/0x100
51) 1664 144 .driver_attach+0x40/0x60
52) 1520 192 .bus_add_driver+0x268/0x340
53) 1328 176 .driver_register+0x88/0x1d0
54) 1152 176 .__pci_register_driver+0x90/0x100
55) 976 144 .mv_init+0x38/0xa0
56) 832 544 .do_one_initcall+0x78/0x240
57) 288 192 .kernel_init+0x21c/0x2b0
58) 96 96 .kernel_thread+0x54/0x70

This is still 12K. Kind of big even for a 16K stack.

-- Steve

2008-11-17 21:10:30

by Andrew Morton

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Mon, 17 Nov 2008 15:34:13 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

>
> I've been hitting stack overflows on a PPC64 box, so I ran the ftrace
> stack_tracer and part of the problem with that box is that it can nest
> interrupts too deep. But what also worries me is that there's some heavy
> hitters of stacks in generic code. Namely the fs directory has some.
>
> Here's the full dump of the stack (PPC64):
>
> ...
>
> do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start
> of these two:
>
> int block_read_full_page(struct page *page, get_block_t *get_block)
> {
> struct inode *inode = page->mapping->host;
> sector_t iblock, lblock;
> struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> unsigned int blocksize;
> int nr, i;
> int fully_mapped = 1;
> [...]
>
> static struct bio *
> do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
> sector_t *last_block_in_bio, struct buffer_head *map_bh,
> unsigned long *first_logical_block, get_block_t get_block)
> {
> struct inode *inode = page->mapping->host;
> const unsigned blkbits = inode->i_blkbits;
> const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
> const unsigned blocksize = 1 << blkbits;
> sector_t block_in_file;
> sector_t last_block;
> sector_t last_block_in_file;
> sector_t blocks[MAX_BUF_PER_PAGE];
> unsigned page_block;
> unsigned first_hole = blocks_per_page;
> struct block_device *bdev = NULL;
> int length;
> int fully_mapped = 1;
> unsigned nblocks;
> unsigned relative_block;
>
>
> The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is
> defined as:
>
> define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
>
> Where PAGE_CACHE_SIZE is the same as PAGE_SIZE.
>
> On PPC64 I'm told that the page size is 64K, which makes the above equal
> to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes.
>
> The problem with PPC64 is that the stack size is not equal to the page
> size. The stack size is only 16K not 64K.
>
> The above stack trace was taken right after boot up and it was already at
> 14K, not too far from the 16k limit.
>
> Note, I was using a default config that had CONFIG_IRQSTACKS off and
> CONFIG_PPC_64K_PAGES on.
>

Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT!!!!! :)

I fixed this six years ago. See http://lkml.org/lkml/2002/6/17/68

I still have the patch but for some reason it appears to get some
rejects. However I think the approach (whatever it was ;)) is still
usable. Perhaps some keen young thing has time to get down and redo
it.




This patch fixes some potential stack consumption problems.

The storage for

sector_t blocks[MAX_BUF_PER_PAGE];
and
struct buffer_head *arr[MAX_BUF_PER_PAGE];

will consume a kilobyte with 64k page, 64-bit sector_t. And on
the path

do_mpage_readpage
->block_read_full_page
-> <page allocation>
->mpage_writepage

they can be nested three-deep. 3k of stack gone. Presumably in this
case the stack page would be 64k anyway, so that's not a problem.
However if PAGE_SIZE=4k and PAGE_CACHE_SIZE=64k, we die.

I've yet to see any reason for larger PAGE_CACHE_SIZE, but this is a
neater implementation anyway.

The patch removes MAX_BUF_PER_PAGE and instead walks the page's buffer
ring to work out which buffers need to be submitted.



--- 2.5.22/fs/mpage.c~stack-space Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/mpage.c Sun Jun 16 22:50:17 2002
@@ -169,8 +169,9 @@ do_mpage_readpage(struct bio *bio, struc
const unsigned blocksize = 1 << blkbits;
struct bio_vec *bvec;
sector_t block_in_file;
- sector_t last_block;
- sector_t blocks[MAX_BUF_PER_PAGE];
+ sector_t last_file_block;
+ sector_t first_page_block = -1;
+ sector_t last_page_block = -1;
unsigned page_block;
unsigned first_hole = blocks_per_page;
struct block_device *bdev = NULL;
@@ -180,12 +181,12 @@ do_mpage_readpage(struct bio *bio, struc
goto confused;

block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
- last_block = (inode->i_size + blocksize - 1) >> blkbits;
+ last_file_block = (inode->i_size + blocksize - 1) >> blkbits;

for (page_block = 0; page_block < blocks_per_page;
page_block++, block_in_file++) {
bh.b_state = 0;
- if (block_in_file < last_block) {
+ if (block_in_file < last_file_block) {
if (get_block(inode, block_in_file, &bh, 0))
goto confused;
}
@@ -199,10 +200,14 @@ do_mpage_readpage(struct bio *bio, struc
if (first_hole != blocks_per_page)
goto confused; /* hole -> non-hole */

- /* Contiguous blocks? */
- if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
- goto confused;
- blocks[page_block] = bh.b_blocknr;
+ if (page_block) {
+ /* Contiguous blocks? */
+ if (bh.b_blocknr != last_page_block + 1)
+ goto confused;
+ } else {
+ first_page_block = bh.b_blocknr;
+ }
+ last_page_block = bh.b_blocknr;
bdev = bh.b_bdev;
}

@@ -222,7 +227,7 @@ do_mpage_readpage(struct bio *bio, struc
* This page will go to BIO. Do we need to send this BIO off first?
*/
if (bio && (bio->bi_idx == bio->bi_vcnt ||
- *last_block_in_bio != blocks[0] - 1))
+ *last_block_in_bio != first_page_block - 1))
bio = mpage_bio_submit(READ, bio);

if (bio == NULL) {
@@ -230,7 +235,7 @@ do_mpage_readpage(struct bio *bio, struc

if (nr_bvecs > nr_pages)
nr_bvecs = nr_pages;
- bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+ bio = mpage_alloc(bdev, first_page_block << (blkbits - 9),
nr_bvecs, GFP_KERNEL);
if (bio == NULL)
goto confused;
@@ -244,7 +249,7 @@ do_mpage_readpage(struct bio *bio, struc
if (buffer_boundary(&bh) || (first_hole != blocks_per_page))
bio = mpage_bio_submit(READ, bio);
else
- *last_block_in_bio = blocks[blocks_per_page - 1];
+ *last_block_in_bio = last_page_block;
out:
return bio;

@@ -322,9 +327,10 @@ mpage_writepage(struct bio *bio, struct
unsigned long end_index;
const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
struct bio_vec *bvec;
- sector_t last_block;
+ sector_t last_file_block;
sector_t block_in_file;
- sector_t blocks[MAX_BUF_PER_PAGE];
+ sector_t first_page_block = -1;
+ sector_t last_page_block = -1;
unsigned page_block;
unsigned first_unmapped = blocks_per_page;
struct block_device *bdev = NULL;
@@ -355,11 +361,13 @@ mpage_writepage(struct bio *bio, struct

if (!buffer_dirty(bh) || !buffer_uptodate(bh))
goto confused;
- if (page_block) {
- if (bh->b_blocknr != blocks[page_block-1] + 1)
+ if (page_block++) {
+ if (bh->b_blocknr != last_page_block + 1)
goto confused;
+ } else {
+ first_page_block = bh->b_blocknr;
}
- blocks[page_block++] = bh->b_blocknr;
+ last_page_block = bh->b_blocknr;
boundary = buffer_boundary(bh);
bdev = bh->b_bdev;
} while ((bh = bh->b_this_page) != head);
@@ -381,7 +389,7 @@ mpage_writepage(struct bio *bio, struct
*/
BUG_ON(!PageUptodate(page));
block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
- last_block = (inode->i_size - 1) >> blkbits;
+ last_file_block = (inode->i_size - 1) >> blkbits;
for (page_block = 0; page_block < blocks_per_page; ) {
struct buffer_head map_bh;

@@ -392,13 +400,16 @@ mpage_writepage(struct bio *bio, struct
unmap_underlying_metadata(map_bh.b_bdev,
map_bh.b_blocknr);
if (page_block) {
- if (map_bh.b_blocknr != blocks[page_block-1] + 1)
+ if (map_bh.b_blocknr != last_page_block + 1)
goto confused;
+ } else {
+ first_page_block = map_bh.b_blocknr;
}
- blocks[page_block++] = map_bh.b_blocknr;
+ page_block++;
+ last_page_block = map_bh.b_blocknr;
boundary = buffer_boundary(&map_bh);
bdev = map_bh.b_bdev;
- if (block_in_file == last_block)
+ if (block_in_file == last_file_block)
break;
block_in_file++;
}
@@ -424,13 +435,13 @@ page_is_mapped:
* This page will go to BIO. Do we need to send this BIO off first?
*/
if (bio && (bio->bi_idx == bio->bi_vcnt ||
- *last_block_in_bio != blocks[0] - 1))
+ *last_block_in_bio != first_page_block - 1))
bio = mpage_bio_submit(WRITE, bio);

if (bio == NULL) {
unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;

- bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+ bio = mpage_alloc(bdev, first_page_block << (blkbits - 9),
nr_bvecs, GFP_NOFS);
if (bio == NULL)
goto confused;
@@ -464,7 +475,7 @@ page_is_mapped:
if (boundary || (first_unmapped != blocks_per_page))
bio = mpage_bio_submit(WRITE, bio);
else
- *last_block_in_bio = blocks[blocks_per_page - 1];
+ *last_block_in_bio = last_page_block;
goto out;

confused:
--- 2.5.22/fs/buffer.c~stack-space Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/buffer.c Sun Jun 16 23:22:48 2002
@@ -1799,7 +1799,7 @@ int block_read_full_page(struct page *pa
{
struct inode *inode = page->mapping->host;
unsigned long iblock, lblock;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head;
unsigned int blocksize, blocks;
int nr, i;

@@ -1842,7 +1842,7 @@ int block_read_full_page(struct page *pa
if (buffer_uptodate(bh))
continue;
}
- arr[nr++] = bh;
+ nr++;
} while (i++, iblock++, (bh = bh->b_this_page) != head);

if (!nr) {
@@ -1857,24 +1857,26 @@ int block_read_full_page(struct page *pa
}

/* Stage two: lock the buffers */
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- lock_buffer(bh);
- mark_buffer_async_read(bh);
- }
+ do {
+ if (!buffer_uptodate(bh)) {
+ lock_buffer(bh);
+ mark_buffer_async_read(bh);
+ }
+ } while ((bh = bh->b_this_page) != head);

/*
* Stage 3: start the IO. Check for uptodateness
* inside the buffer lock in case another process reading
* the underlying blockdev brought it uptodate (the sct fix).
*/
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- if (buffer_uptodate(bh))
- end_buffer_async_read(bh, 1);
- else
- submit_bh(READ, bh);
- }
+ do {
+ if (buffer_async_read(bh)) {
+ if (buffer_uptodate(bh))
+ end_buffer_async_read(bh, 1);
+ else
+ submit_bh(READ, bh);
+ }
+ } while ((bh = bh->b_this_page) != head);
return 0;
}

@@ -2490,8 +2492,8 @@ static void bh_mempool_free(void *elemen
return kmem_cache_free(bh_cachep, element);
}

-#define NR_RESERVED (10*MAX_BUF_PER_PAGE)
-#define MAX_UNUSED_BUFFERS NR_RESERVED+20
+
+#define MEMPOOL_BUFFERS (32 * PAGE_CACHE_SIZE / 512)

void __init buffer_init(void)
{
@@ -2500,7 +2502,7 @@ void __init buffer_init(void)
bh_cachep = kmem_cache_create("buffer_head",
sizeof(struct buffer_head), 0,
SLAB_HWCACHE_ALIGN, init_buffer_head, NULL);
- bh_mempool = mempool_create(MAX_UNUSED_BUFFERS, bh_mempool_alloc,
+ bh_mempool = mempool_create(MEMPOOL_BUFFERS, bh_mempool_alloc,
bh_mempool_free, NULL);
for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++)
init_waitqueue_head(&bh_wait_queue_heads[i].wqh);
--- 2.5.22/fs/block_dev.c~stack-space Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/block_dev.c Sun Jun 16 22:50:17 2002
@@ -23,8 +23,6 @@

#include <asm/uaccess.h>

-#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
-
static unsigned long max_block(struct block_device *bdev)
{
unsigned int retval = ~0U;
--- 2.5.22/include/linux/buffer_head.h~stack-space Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/include/linux/buffer_head.h Sun Jun 16 23:22:47 2002
@@ -29,8 +29,6 @@ enum bh_state_bits {
*/
};

-#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
-
struct page;
struct kiobuf;
struct buffer_head;
--- 2.5.22/fs/ntfs/aops.c~stack-space Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/ntfs/aops.c Sun Jun 16 23:22:46 2002
@@ -104,7 +104,7 @@ static int ntfs_file_read_block(struct p
LCN lcn;
ntfs_inode *ni;
ntfs_volume *vol;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head;
sector_t iblock, lblock, zblock;
unsigned int blocksize, blocks, vcn_ofs;
int i, nr;
@@ -116,11 +116,12 @@ static int ntfs_file_read_block(struct p
blocksize_bits = VFS_I(ni)->i_blkbits;
blocksize = 1 << blocksize_bits;

- if (!page_has_buffers(page))
+ if (!page_has_buffers(page)) {
create_empty_buffers(page, blocksize, 0);
+ if (!page_has_buffers(page)) /* This can't happen */
+ return -ENOMEM;
+ }
bh = head = page_buffers(page);
- if (!bh)
- return -ENOMEM;

blocks = PAGE_CACHE_SIZE >> blocksize_bits;
iblock = page->index << (PAGE_CACHE_SHIFT - blocksize_bits);
@@ -138,10 +139,12 @@ static int ntfs_file_read_block(struct p
/* Loop through all the buffers in the page. */
nr = i = 0;
do {
+ BUG_ON(buffer_async_read(bh));
if (unlikely(buffer_uptodate(bh)))
continue;
if (unlikely(buffer_mapped(bh))) {
- arr[nr++] = bh;
+ set_buffer_async_read(bh);
+ nr++;
continue;
}
bh->b_bdev = vol->sb->s_bdev;
@@ -167,7 +170,8 @@ retry_remap:
set_buffer_mapped(bh);
/* Only read initialized data blocks. */
if (iblock < zblock) {
- arr[nr++] = bh;
+ set_buffer_async_read(bh);
+ nr++;
continue;
}
/* Fully non-initialized data block, zero it. */
@@ -208,15 +212,18 @@ handle_zblock:
/* Check we have at least one buffer ready for i/o. */
if (nr) {
/* Lock the buffers. */
- for (i = 0; i < nr; i++) {
- struct buffer_head *tbh = arr[i];
- lock_buffer(tbh);
- tbh->b_end_io = end_buffer_read_file_async;
- set_buffer_async_read(tbh);
- }
+ do {
+ if (buffer_async_read(bh)) {
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_read_file_async;
+ }
+ } while ((bh = bh->b_this_page) != head);
+
/* Finally, start i/o on the buffers. */
- for (i = 0; i < nr; i++)
- submit_bh(READ, arr[i]);
+ do {
+ if (buffer_async_read(bh))
+ submit_bh(READ, bh);
+ } while ((bh = bh->b_this_page) != head);
return 0;
}
/* No i/o was scheduled on any of the buffers. */
@@ -404,7 +411,7 @@ static int ntfs_mftbmp_readpage(ntfs_vol
{
VCN vcn;
LCN lcn;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head;
sector_t iblock, lblock, zblock;
unsigned int blocksize, blocks, vcn_ofs;
int nr, i;
@@ -416,11 +423,12 @@ static int ntfs_mftbmp_readpage(ntfs_vol
blocksize = vol->sb->s_blocksize;
blocksize_bits = vol->sb->s_blocksize_bits;

- if (!page_has_buffers(page))
+ if (!page_has_buffers(page)) {
create_empty_buffers(page, blocksize, 0);
+ if (!page_has_buffers(page)) /* This can't happen */
+ return -ENOMEM;
+ }
bh = head = page_buffers(page);
- if (!bh)
- return -ENOMEM;

blocks = PAGE_CACHE_SIZE >> blocksize_bits;
iblock = page->index << (PAGE_CACHE_SHIFT - blocksize_bits);
@@ -431,10 +439,12 @@ static int ntfs_mftbmp_readpage(ntfs_vol
/* Loop through all the buffers in the page. */
nr = i = 0;
do {
+ BUG_ON(buffer_async_read(bh));
if (unlikely(buffer_uptodate(bh)))
continue;
if (unlikely(buffer_mapped(bh))) {
- arr[nr++] = bh;
+ set_buffer_async_read(bh);
+ nr++;
continue;
}
bh->b_bdev = vol->sb->s_bdev;
@@ -457,7 +467,8 @@ static int ntfs_mftbmp_readpage(ntfs_vol
set_buffer_mapped(bh);
/* Only read initialized data blocks. */
if (iblock < zblock) {
- arr[nr++] = bh;
+ set_buffer_async_read(bh);
+ nr++;
continue;
}
/* Fully non-initialized data block, zero it. */
@@ -491,15 +502,18 @@ handle_zblock:
/* Check we have at least one buffer ready for i/o. */
if (nr) {
/* Lock the buffers. */
- for (i = 0; i < nr; i++) {
- struct buffer_head *tbh = arr[i];
- lock_buffer(tbh);
- tbh->b_end_io = end_buffer_read_mftbmp_async;
- set_buffer_async_read(tbh);
- }
+ do {
+ if (buffer_async_read(bh)) {
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_read_mftbmp_async;
+ }
+ } while ((bh = bh->b_this_page) != head);
+
/* Finally, start i/o on the buffers. */
- for (i = 0; i < nr; i++)
- submit_bh(READ, arr[i]);
+ do {
+ if (buffer_async_read(bh))
+ submit_bh(READ, bh);
+ } while ((bh = bh->b_this_page) != head);
return 0;
}
/* No i/o was scheduled on any of the buffers. */
@@ -643,7 +657,7 @@ int ntfs_mst_readpage(struct file *dir,
LCN lcn;
ntfs_inode *ni;
ntfs_volume *vol;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head;
sector_t iblock, lblock, zblock;
unsigned int blocksize, blocks, vcn_ofs;
int i, nr;
@@ -658,11 +672,12 @@ int ntfs_mst_readpage(struct file *dir,
blocksize_bits = VFS_I(ni)->i_blkbits;
blocksize = 1 << blocksize_bits;

- if (!page_has_buffers(page))
+ if (!page_has_buffers(page)) {
create_empty_buffers(page, blocksize, 0);
+ if (!page_has_buffers(page)) /* This can't happen */
+ return -ENOMEM;
+ }
bh = head = page_buffers(page);
- if (!bh)
- return -ENOMEM;

blocks = PAGE_CACHE_SIZE >> blocksize_bits;
iblock = page->index << (PAGE_CACHE_SHIFT - blocksize_bits);
@@ -678,10 +693,12 @@ int ntfs_mst_readpage(struct file *dir,
/* Loop through all the buffers in the page. */
nr = i = 0;
do {
+ BUG_ON(buffer_async_read(bh));
if (unlikely(buffer_uptodate(bh)))
continue;
if (unlikely(buffer_mapped(bh))) {
- arr[nr++] = bh;
+ set_buffer_async_read(bh);
+ nr++;
continue;
}
bh->b_bdev = vol->sb->s_bdev;
@@ -707,7 +724,8 @@ retry_remap:
set_buffer_mapped(bh);
/* Only read initialized data blocks. */
if (iblock < zblock) {
- arr[nr++] = bh;
+ set_buffer_async_read(bh);
+ nr++;
continue;
}
/* Fully non-initialized data block, zero it. */
@@ -748,15 +766,18 @@ handle_zblock:
/* Check we have at least one buffer ready for i/o. */
if (nr) {
/* Lock the buffers. */
- for (i = 0; i < nr; i++) {
- struct buffer_head *tbh = arr[i];
- lock_buffer(tbh);
- tbh->b_end_io = end_buffer_read_mst_async;
- set_buffer_async_read(tbh);
- }
+ do {
+ if (buffer_async_read(bh)) {
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_read_mst_async;
+ }
+ } while ((bh = bh->b_this_page) != head);
+
/* Finally, start i/o on the buffers. */
- for (i = 0; i < nr; i++)
- submit_bh(READ, arr[i]);
+ do {
+ if (buffer_async_read(bh))
+ submit_bh(READ, bh);
+ } while ((bh = bh->b_this_page) != head);
return 0;
}
/* No i/o was scheduled on any of the buffers. */

-

2008-11-17 21:11:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
> 45) 4992 1280 .block_read_full_page+0x23c/0x430
> 46) 3712 1280 .do_mpage_readpage+0x43c/0x740

Ouch.

> Notice at line 45 and 46 the stack usage of block_read_full_page and
> do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start
> of these two:
>
> int block_read_full_page(struct page *page, get_block_t *get_block)
> {
> struct inode *inode = page->mapping->host;
> sector_t iblock, lblock;
> struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];

Yeah, that's unacceptable.

Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry
array), but as you say:

> On PPC64 I'm told that the page size is 64K, which makes the above equal
> to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes.

Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB
pages are insane, and 16kB pages are borderline. I've told people so.

The ppc people run databases, and they don't care about sane people
telling them the big pages suck. It's made worse by the fact that they
also have horribly bad TLB fills on their broken CPU's, and years and
years of telling people that the MMU on ppc's are sh*t has only been
reacted to with "talk to the hand, we know better".

Quite frankly, 64kB pages are INSANE. But yes, in this case they actually
cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses
64 bytes, not 1kB.

I suspect the PPC people need to figure out some way to handle this in
their broken setups (since I don't really expect them to finally admit
that they were full of sh*t with their big pages), but since I think it's
a ppc bug, I'm not at all interested in a fix that penalizes the _good_
case.

So either make it some kind of (clean) conditional dynamic non-stack
allocation, or make it do some outer loop over the whole page that turns
into a compile-time no-op when the page is sufficiently small to be done
in one go.

Or perhaps say "if you have 64kB pages, you're a moron, and to counteract
that moronic page size, you cannot do 512-byte granularity IO any more".

Of course, that would likely mean that FAT etc wouldn't work on ppc64, so
I don't think that's a valid model either. But if the 64kB page size is
just a "database server crazy-people config option", then maybe it's
acceptable.

Database people usually don't want to connect their cameras or mp3-players
with their FAT12 filesystems.

Linus

2008-11-17 21:19:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that
> softirqs still use the same stack as the process.

Yes.

> This is still 12K. Kind of big even for a 16K stack.

And while that 1kB+ stack slot for block_read_full_page still stands out
like a sore thumb, I do agree that there's way too many other functions
too with big stack frames.

I do wonder just _what_ it is that causes the stack frames to be so
horrid. For example, you have

18) 8896 160 .kmem_cache_alloc+0xfc/0x140

and I'm looking at my x86-64 compile, and it has a stack frame of just 8
bytes (!) for local variables plus the save/restore area (which looks like
three registers plus frame pointer plus return address). IOW, if I'm
looking at the code right (so big caveat: I did _not_ do a real stack
dump!) the x86-64 stack cost for that same function is on the order of 48
bytes. Not 160.

Where does that factor-of-three+ difference come from? From the numbers, I
suspect ppc64 has a 32-byte stack alignment, which may be part of it, and
I guess the compiler is more eager to use all those extra registers and
will happily have many more callee-saved regs that are actually used.

But that still a _lot_ of extra stack.

Of course, you may have things like spinlock debugging etc enabled. Some
of our debugging options do tend to blow things up.

Linus

2008-11-17 21:24:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Mon, 17 Nov 2008, Andrew Morton wrote:
>
> Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT!!!!! :)
>
> I fixed this six years ago. See http://lkml.org/lkml/2002/6/17/68

Btw, in that thread I also said:

"If we have 64kB pages, such architectures will have to have a bigger
kernel stack. Which they will have, simply by virtue of having the very
same bigger page. So that problem kind of solves itself."

and that may still be the "right" solution - if somebody is so insane that
they want 64kB pages, then they might as well have a 64kB kernel stack as
well.

Trust me, the kernel stack isn't where you blow your memory with a 64kB
page. You blow all your memory on the memory fragmentation of your page
cache. I did the stats for the kernel source tree a long time ago, and I
think you wasted something like 4GB of RAM with a 64kB page size.

Linus

2008-11-17 21:25:42

by Pekka Enberg

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Mon, Nov 17, 2008 at 11:18 PM, Linus Torvalds
<[email protected]> wrote:
> I do wonder just _what_ it is that causes the stack frames to be so
> horrid. For example, you have
>
> 18) 8896 160 .kmem_cache_alloc+0xfc/0x140
>
> and I'm looking at my x86-64 compile, and it has a stack frame of just 8
> bytes (!) for local variables plus the save/restore area (which looks like
> three registers plus frame pointer plus return address). IOW, if I'm
> looking at the code right (so big caveat: I did _not_ do a real stack
> dump!) the x86-64 stack cost for that same function is on the order of 48
> bytes. Not 160.
>
> Where does that factor-of-three+ difference come from? From the numbers, I
> suspect ppc64 has a 32-byte stack alignment, which may be part of it, and
> I guess the compiler is more eager to use all those extra registers and
> will happily have many more callee-saved regs that are actually used.
>
> But that still a _lot_ of extra stack.
>
> Of course, you may have things like spinlock debugging etc enabled. Some
> of our debugging options do tend to blow things up.

Note that kmem_cache_alloc() is likely to contain lots of inlined
functions for both SLAB and SLUB. Perhaps that blows up stack usage on
ppc?

2008-11-17 21:33:10

by Andrew Morton

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Mon, 17 Nov 2008 13:23:23 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 17 Nov 2008, Andrew Morton wrote:
> >
> > Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT!!!!! :)
> >
> > I fixed this six years ago. See http://lkml.org/lkml/2002/6/17/68
>
> Btw, in that thread I also said:
>
> "If we have 64kB pages, such architectures will have to have a bigger
> kernel stack. Which they will have, simply by virtue of having the very
> same bigger page. So that problem kind of solves itself."
>
> and that may still be the "right" solution - if somebody is so insane that
> they want 64kB pages, then they might as well have a 64kB kernel stack as
> well.

I'd have thought so, but I'm sure we're about to hear how important an
optimisation the smaller stacks are ;)

> Trust me, the kernel stack isn't where you blow your memory with a 64kB
> page. You blow all your memory on the memory fragmentation of your page
> cache. I did the stats for the kernel source tree a long time ago, and I
> think you wasted something like 4GB of RAM with a 64kB page size.
>

Yup. That being said, the younger me did assert that "this is a neater
implementation anyway". If we can implement those loops without
needing those on-stack temporary arrays then things probably are better
overall.

2008-11-17 21:43:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Mon, 17 Nov 2008, Andrew Morton wrote:
>
> Yup. That being said, the younger me did assert that "this is a neater
> implementation anyway". If we can implement those loops without
> needing those on-stack temporary arrays then things probably are better
> overall.

Sure, if it actually ends up being nicer, I'll not argue with it. But from
an L1 I$ standpoint (and I$ is often very important, especially for kernel
loads where loops are fairly rare), it's often _much_ better to do two
"tight" loops over two subsystems (filesystem and block layer) than it is
to do one bigger loop that contains both. If the L1 can fit both subsystem
paths, you're fine - but if not, you may get a lot more misses.

So it's often nice if you can "stage" things so that you do a cluster of
calls to one area, followed by a cluster of calls to another, rather than
mix it up.

But numbers talk. And code cleanliness. If somebody has numbers that the
code size actually goes down for example, or the code is just more
readable, micro-optimizing cache patterns isn't worth it.

Linus

2008-11-17 22:19:55

by Paul Mackerras

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

Steven Rostedt writes:

> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that
> softirqs still use the same stack as the process.

They shouldn't. I don't see do_softirq in the trace, though. Which
functions did you think would be run in a softirq? It looks to me
like the deepest 10 or so functions are called at hard irq level,
within hrtimer_interrupt called from timer_interrupt.

> root@electra ~> cat /debug/tracing/stack_trace
> Depth Size Location (59 entries)
> ----- ---- --------
> 0) 12384 192 ftrace_call+0x4/0x14
> 1) 12192 128 .sched_clock+0x20/0x60
> 2) 12064 128 .sched_clock_cpu+0x34/0x50
> 3) 11936 144 .cpu_clock+0x3c/0xa0
> 4) 11792 144 .get_timestamp+0x2c/0x50
> 5) 11648 144 .__touch_softlockup_watchdog+0x3c/0x60
> 6) 11504 192 .softlockup_tick+0xe4/0x220
> 7) 11312 128 .run_local_timers+0x34/0x50
> 8) 11184 160 .update_process_times+0x44/0xb0
> 9) 11024 176 .tick_sched_timer+0x8c/0x120
> 10) 10848 160 .__run_hrtimer+0xd8/0x130
> 11) 10688 240 .hrtimer_interrupt+0x16c/0x220
> 12) 10448 160 .timer_interrupt+0xcc/0x110
> 13) 10288 96 decrementer_common+0xe0/0x100

Paul.

2008-11-17 22:53:50

by Paul Mackerras

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

Linus Torvalds writes:

> The ppc people run databases, and they don't care about sane people

And HPC apps, and all sorts of other things...

> telling them the big pages suck. It's made worse by the fact that they
> also have horribly bad TLB fills on their broken CPU's, and years and

Taking page faults at a 4k granularity also sucks. A lot of the
performance improvement from using 64k pages comes just from executing
the page fault path (and other paths in the mm code) less frequently.
That's why we see a performance improvement from 64k pages even on
machines that don't support 64k pages in hardware (like the 21%
reduction in system time on a kernel compile that I reported earlier).
That has nothing to do with TLB miss times or anything to do with the
MMU.

I'd love to be able to use a 4k base page size if I could still get
the reduction in page faults and the expanded TLB reach that we get
now with 64k pages. If we could allocate the page cache for large
files with order-4 allocations wherever possible that would be a good
start. I think Christoph Lameter had some patches in that direction
but they didn't seem to get very far.

> years of telling people that the MMU on ppc's are sh*t has only been
> reacted to with "talk to the hand, we know better".

Well, it's been reacted to with "AIX can use small pages where it
makes sense, and large pages where that makes sense, so why can't
Linux?"

> I suspect the PPC people need to figure out some way to handle this in
> their broken setups (since I don't really expect them to finally admit
> that they were full of sh*t with their big pages), but since I think it's
> a ppc bug, I'm not at all interested in a fix that penalizes the _good_
> case.

Looks like we should make the stack a bit larger.

Paul.

2008-11-17 23:04:46

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> Note, I was using a default config that had CONFIG_IRQSTACKS off and
> CONFIG_PPC_64K_PAGES on.

For one, we definitely need to turn IRQSTACKS on by default ... In fact,
I'm pondering just removing the option.

Cheers,
Ben.

2008-11-17 23:05:28

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Mon, 2008-11-17 at 15:59 -0500, Steven Rostedt wrote:
> On Mon, 17 Nov 2008, Steven Rostedt wrote:
> >
> > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > CONFIG_PPC_64K_PAGES on.
>
> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that
> softirqs still use the same stack as the process.

No, they should use their own, unless somebody change the softirq code
in ways that break us... (or unless I missed something, we based our
implementation roughtly on what x86 did back then).

Ben.

> root@electra ~> cat /debug/tracing/stack_trace
> Depth Size Location (59 entries)
> ----- ---- --------
> 0) 12384 192 ftrace_call+0x4/0x14
> 1) 12192 128 .sched_clock+0x20/0x60
> 2) 12064 128 .sched_clock_cpu+0x34/0x50
> 3) 11936 144 .cpu_clock+0x3c/0xa0
> 4) 11792 144 .get_timestamp+0x2c/0x50
> 5) 11648 144 .__touch_softlockup_watchdog+0x3c/0x60
> 6) 11504 192 .softlockup_tick+0xe4/0x220
> 7) 11312 128 .run_local_timers+0x34/0x50
> 8) 11184 160 .update_process_times+0x44/0xb0
> 9) 11024 176 .tick_sched_timer+0x8c/0x120
> 10) 10848 160 .__run_hrtimer+0xd8/0x130
> 11) 10688 240 .hrtimer_interrupt+0x16c/0x220
> 12) 10448 160 .timer_interrupt+0xcc/0x110
> 13) 10288 96 decrementer_common+0xe0/0x100
> 14) 10192 784 .ftrace_test_stop_func+0x4c/0x70
> 15) 9408 112 ftrace_call+0x4/0x14
> 16) 9296 144 .init_buffer_head+0x20/0x60
> 17) 9152 256 .cache_alloc_refill+0x5a0/0x740
> 18) 8896 160 .kmem_cache_alloc+0xfc/0x140
> 19) 8736 144 .alloc_buffer_head+0x3c/0xc0
> 20) 8592 176 .alloc_page_buffers+0xb0/0x130
> 21) 8416 160 .create_empty_buffers+0x44/0x1a0
> 22) 8256 1280 .block_read_full_page+0x3b0/0x430
> 23) 6976 144 .blkdev_readpage+0x38/0x60
> 24) 6832 176 .read_cache_page_async+0x124/0x230
> 25) 6656 160 .read_cache_page+0x50/0xe0
> 26) 6496 160 .read_dev_sector+0x58/0x100
> 27) 6336 272 .msdos_partition+0xa4/0x880
> 28) 6064 240 .rescan_partitions+0x1f0/0x430
> 29) 5824 208 .__blkdev_get+0x124/0x3d0
> 30) 5616 144 .blkdev_get+0x3c/0x60
> 31) 5472 192 .register_disk+0x194/0x1f0
> 32) 5280 160 .add_disk+0xe8/0x160
> 33) 5120 224 .sd_probe+0x2f0/0x440
> 34) 4896 176 .driver_probe_device+0x14c/0x380
> 35) 4720 144 .__device_attach+0x38/0x60
> 36) 4576 192 .bus_for_each_drv+0xb8/0x120
> 37) 4384 160 .device_attach+0x94/0xe0
> 38) 4224 144 .bus_attach_device+0x78/0xb0
> 39) 4080 240 .device_add+0x54c/0x730
> 40) 3840 160 .scsi_sysfs_add_sdev+0x7c/0x2e0
> 41) 3680 336 .scsi_probe_and_add_lun+0xb7c/0xc60
> 42) 3344 192 .__scsi_add_device+0x11c/0x150
> 43) 3152 176 .ata_scsi_scan_host+0x238/0x330
> 44) 2976 208 .ata_host_register+0x320/0x3c0
> 45) 2768 192 .ata_host_activate+0xf8/0x190
> 46) 2576 224 .mv_pci_init_one+0x284/0x430
> 47) 2352 160 .pci_device_probe+0x98/0xf0
> 48) 2192 176 .driver_probe_device+0x14c/0x380
> 49) 2016 160 .__driver_attach+0xd4/0x110
> 50) 1856 192 .bus_for_each_dev+0xa8/0x100
> 51) 1664 144 .driver_attach+0x40/0x60
> 52) 1520 192 .bus_add_driver+0x268/0x340
> 53) 1328 176 .driver_register+0x88/0x1d0
> 54) 1152 176 .__pci_register_driver+0x90/0x100
> 55) 976 144 .mv_init+0x38/0xa0
> 56) 832 544 .do_one_initcall+0x78/0x240
> 57) 288 192 .kernel_init+0x21c/0x2b0
> 58) 96 96 .kernel_thread+0x54/0x70
>
> This is still 12K. Kind of big even for a 16K stack.
>
> -- Steve

2008-11-17 23:13:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


> Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry
> array), but as you say:
>
> > On PPC64 I'm told that the page size is 64K, which makes the above equal
> > to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes.
>
> Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB
> pages are insane, and 16kB pages are borderline. I've told people so.
>
> The ppc people run databases, and they don't care about sane people
> telling them the big pages suck.

Hehe :-)

Guess who is pushing for larger page sizes nowadays ? Embedded
people :-) In fact, we have patches submited on the list to offer the
option for ... 256K pages on some 44x embedded CPUs :-)

It makes some sort of sense I suppose on very static embedded workloads
with no swap nor demand paging.

> It's made worse by the fact that they
> also have horribly bad TLB fills on their broken CPU's, and years and
> years of telling people that the MMU on ppc's are sh*t has only been
> reacted to with "talk to the hand, we know better".

Who are you talking about here precisely ? I don't think either Paul or
I every said something nearly around those lines ... Oh well.

But yeah, our existing server CPUs have pretty poor TLB refills and yes,
64K pages help. And yes, we would like things to be different, but they
aren't.

But there is also pressure to get larger page sizes from small embedded
field, where CPUs have even poorer TLB refill (software loaded
basically) :-)

> Quite frankly, 64kB pages are INSANE. But yes, in this case they actually
> cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses
> 64 bytes, not 1kB.

Come on, the code is crap to allocate that on the stack anyway :-)

> Of course, that would likely mean that FAT etc wouldn't work on ppc64, so
> I don't think that's a valid model either. But if the 64kB page size is
> just a "database server crazy-people config option", then maybe it's
> acceptable.

Well, as I said, embedded folks are wanting that too ...

Cheers,
Ben.

2008-11-17 23:17:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


> I'd have thought so, but I'm sure we're about to hear how important an
> optimisation the smaller stacks are ;)

Not sure, I tend to agree that it would make sense to bump our stack to
64K on 64K pages, it's not like we are saving anything and we are
probably adding overhead in alloc/dealloc. I'll see what Paul thinks
here.

> Yup. That being said, the younger me did assert that "this is a neater
> implementation anyway". If we can implement those loops without
> needing those on-stack temporary arrays then things probably are better
> overall.

Amen.

Cheers,
Ben.

2008-11-17 23:22:27

by Josh Boyer

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Tue, 18 Nov 2008 10:13:16 +1100
Benjamin Herrenschmidt <[email protected]> wrote:

>
> > Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry
> > array), but as you say:
> >
> > > On PPC64 I'm told that the page size is 64K, which makes the above equal
> > > to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes.
> >
> > Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB
> > pages are insane, and 16kB pages are borderline. I've told people so.
> >
> > The ppc people run databases, and they don't care about sane people
> > telling them the big pages suck.
>
> Hehe :-)
>
> Guess who is pushing for larger page sizes nowadays ? Embedded
> people :-) In fact, we have patches submited on the list to offer the
> option for ... 256K pages on some 44x embedded CPUs :-)

For clarification, that workload is very precise. Namely embedded 44x
CPUs used in RAID cards. I'm not entirely convinced bringing 256K
pages into mainline is a good thing yet anyway.

64K pages, while seemingly insane for embedded boards that typically
have less than 512 MiB of DRAM, help for a bit larger set of
workloads. As a KVM host is the primary winner at the moment. But
given the small number of TLB entries on these CPUs, it can pay off
elsewhere as well.

josh

2008-11-17 23:29:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Tue, 18 Nov 2008, Benjamin Herrenschmidt wrote:
>
> Guess who is pushing for larger page sizes nowadays ? Embedded
> people :-) In fact, we have patches submited on the list to offer the
> option for ... 256K pages on some 44x embedded CPUs :-)
>
> It makes some sort of sense I suppose on very static embedded workloads
> with no swap nor demand paging.

It makes perfect sense for anything that doesn't use any MMU.

The hugepage support seems to cover many of the relevant cases, ie
databases and things like big static mappings (frame buffers etc).

> > It's made worse by the fact that they
> > also have horribly bad TLB fills on their broken CPU's, and years and
> > years of telling people that the MMU on ppc's are sh*t has only been
> > reacted to with "talk to the hand, we know better".
>
> Who are you talking about here precisely ? I don't think either Paul or
> I every said something nearly around those lines ... Oh well.

Every single time I've complained about it, somebody from IBM has said "..
but but AIX".

This time it was Paul. Sometimes it has been software people who agree,
but point to hardware designers who "know better". If it's not some insane
database person, it's a Fortran program that runs for days.

> But there is also pressure to get larger page sizes from small embedded
> field, where CPUs have even poorer TLB refill (software loaded
> basically) :-)

Yeah, I agree that you _can_ have even worse MMU's. I'm not saying that
PPC64 is absolutely pessimal and cannot be made worse. Software fill is
indeed even worse from a performance angle, despite the fact that it's
really "nice" from a conceptual angle.

Of course, of thesw fill users that remain, many do seem to be ppc.. It's
like the architecture brings out the worst in hardware designers.

> > Quite frankly, 64kB pages are INSANE. But yes, in this case they actually
> > cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses
> > 64 bytes, not 1kB.
>
> Come on, the code is crap to allocate that on the stack anyway :-)

Why? We do actually expect to be able to use stack-space for small
structures. We do it for a lot of things, including stuff like select()
optimistically using arrays allocated on the stack for the common small
case, just because it's, oh, about infinitely faster to do than to use
kmalloc().

Many of the page cache functions also have the added twist that they get
called from low-memory setups (eg write_whole_page()), and so try to
minimize allocations for that reason too.

Linus

2008-11-17 23:31:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


On Tue, 18 Nov 2008, Paul Mackerras wrote:

> Steven Rostedt writes:
>
> > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that
> > softirqs still use the same stack as the process.
>
> They shouldn't. I don't see do_softirq in the trace, though. Which
> functions did you think would be run in a softirq? It looks to me
> like the deepest 10 or so functions are called at hard irq level,
> within hrtimer_interrupt called from timer_interrupt.
>
> > root@electra ~> cat /debug/tracing/stack_trace
> > Depth Size Location (59 entries)
> > ----- ---- --------
> > 0) 12384 192 ftrace_call+0x4/0x14
> > 1) 12192 128 .sched_clock+0x20/0x60
> > 2) 12064 128 .sched_clock_cpu+0x34/0x50
> > 3) 11936 144 .cpu_clock+0x3c/0xa0
> > 4) 11792 144 .get_timestamp+0x2c/0x50
> > 5) 11648 144 .__touch_softlockup_watchdog+0x3c/0x60
> > 6) 11504 192 .softlockup_tick+0xe4/0x220
> > 7) 11312 128 .run_local_timers+0x34/0x50
> > 8) 11184 160 .update_process_times+0x44/0xb0
> > 9) 11024 176 .tick_sched_timer+0x8c/0x120
> > 10) 10848 160 .__run_hrtimer+0xd8/0x130
> > 11) 10688 240 .hrtimer_interrupt+0x16c/0x220
> > 12) 10448 160 .timer_interrupt+0xcc/0x110
> > 13) 10288 96 decrementer_common+0xe0/0x100

Ah, you are right. I thought I saw softirq code in there, but must have
been seeing things. I need to get my eyesight checked. The other day I
totally botch a hand in cards because I thought I had 3 Aces of diamonds
when I really only had 2 Aces of diamonds and a Ace of hearts.

-- Steve

2008-11-17 23:31:36

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
>
> I've been hitting stack overflows on a PPC64 box, so I ran the ftrace
> stack_tracer and part of the problem with that box is that it can nest
> interrupts too deep. But what also worries me is that there's some heavy
> hitters of stacks in generic code. Namely the fs directory has some.

Note that we shouldn't stack interrupts much in practice. The PIC will
not let same or lower prio interrupts in until we have completed one.
However timer/decrementer is not going through the PIC, so I think what
happens is we get a hw IRQ, on the way back, just before returning from
do_IRQ (so we have completed the IRQ from the PIC standpoint), we go
into soft-irq's, at which point deep inside SCSI we get another HW IRQ
and we stack a decrementer interrupt on top of it.

Now, we should do stack switching for both HW IRQs and softirqs with
CONFIG_IRQSTACKS, which should significantly alleviate the problem.

Your second trace also shows how horrible the stack traces can be when
the device-model kicks in, ie, register->probe->register sub device ->
etc... that isnt going to be nice on x86 with 4k stacks neither.

I wonder if we should generally recommend for drivers of "bus" devices
not to register sub devices from their own probe() routine, but defer
that to a kernel thread... Because the stacking can be pretty bad, I
mean, nobody's done SATA over USB yet but heh :-)

Cheers,
Ben.

2008-11-18 00:12:04

by Paul Mackerras

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

Linus Torvalds writes:

> > > It's made worse by the fact that they
> > > also have horribly bad TLB fills on their broken CPU's, and years and
> > > years of telling people that the MMU on ppc's are sh*t has only been
> > > reacted to with "talk to the hand, we know better".
> >
> > Who are you talking about here precisely ? I don't think either Paul or
> > I every said something nearly around those lines ... Oh well.
>
> Every single time I've complained about it, somebody from IBM has said "..
> but but AIX".
>
> This time it was Paul. Sometimes it has been software people who agree,

Maybe I wasn't clear. I was reporting the reaction I get from the
hardware designers, who often like to push off the hard problems to
software to fix.

Keep flaming, by the way. It provides me with some amount of
assistance in my quest to get the hardware designers to do better. ;)

Also, you didn't respond to my comments about the purely software
benefits of a larger page size.

Paul.

2008-11-18 00:54:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


On Mon, 17 Nov 2008, Linus Torvalds wrote:
>
> I do wonder just _what_ it is that causes the stack frames to be so
> horrid. For example, you have
>
> 18) 8896 160 .kmem_cache_alloc+0xfc/0x140
>
> and I'm looking at my x86-64 compile, and it has a stack frame of just 8
> bytes (!) for local variables plus the save/restore area (which looks like
> three registers plus frame pointer plus return address). IOW, if I'm
> looking at the code right (so big caveat: I did _not_ do a real stack
> dump!) the x86-64 stack cost for that same function is on the order of 48
> bytes. Not 160.

Out of curiosity, I just ran stack_trace on the latest version of git
(pulled sometime today) and ran it on my x86_64.

I have SLUB and SLUB debug defined, and here's what I found:

11) 3592 64 kmem_cache_alloc+0x64/0xa3

64 bytes, still much lower than the 160 of PPC64.

Just to see where it got that number, I ran objdump on slub.o.

000000000000300e <kmem_cache_alloc_node>:
300e: 55 push %rbp
300f: 48 89 e5 mov %rsp,%rbp
3012: 41 57 push %r15
3014: 41 56 push %r14
3016: 41 55 push %r13
3018: 41 54 push %r12
301a: 53 push %rbx
301b: 48 83 ec 08 sub $0x8,%rsp

Six words pushed, plus the 8 byte stack frame.

6*8+8 = 56

But we also add the push of the function return address which is another 8
bytes which gives us 64 bytes.

Here's the complete dump:

[root@bxrhel51 linux-compile.git]# cat /debug/tracing/stack_trace
Depth Size Location (54 entries)
----- ---- --------
0) 4504 48 __mod_zone_page_state+0x59/0x68
1) 4456 64 __rmqueue_smallest+0xa0/0xd9
2) 4392 80 __rmqueue+0x24/0x172
3) 4312 96 rmqueue_bulk+0x57/0xa3
4) 4216 224 get_page_from_freelist+0x371/0x6e1
5) 3992 160 __alloc_pages_internal+0xe0/0x3f8
6) 3832 16 __alloc_pages_nodemask+0xe/0x10
7) 3816 48 alloc_pages_current+0xbe/0xc7
8) 3768 16 alloc_slab_page+0x28/0x34
9) 3752 64 new_slab+0x4a/0x1bb
10) 3688 96 __slab_alloc+0x203/0x364
11) 3592 64 kmem_cache_alloc+0x64/0xa3
12) 3528 48 alloc_buffer_head+0x22/0x9d
13) 3480 64 alloc_page_buffers+0x2f/0xd1
14) 3416 48 create_empty_buffers+0x22/0xb5
15) 3368 176 block_read_full_page+0x6b/0x25c
16) 3192 16 blkdev_readpage+0x18/0x1a
17) 3176 64 read_cache_page_async+0x85/0x11a
18) 3112 32 read_cache_page+0x13/0x48
19) 3080 48 read_dev_sector+0x36/0xaf
20) 3032 96 read_lba+0x51/0xb0
21) 2936 176 efi_partition+0x92/0x585
22) 2760 128 rescan_partitions+0x173/0x308
23) 2632 96 __blkdev_get+0x22a/0x2ea
24) 2536 16 blkdev_get+0x10/0x12
25) 2520 80 register_disk+0xe5/0x14a
26) 2440 48 add_disk+0xbd/0x11f
27) 2392 96 sd_probe+0x2c5/0x3ad
28) 2296 48 driver_probe_device+0xc5/0x14c
29) 2248 16 __device_attach+0xe/0x10
30) 2232 64 bus_for_each_drv+0x56/0x8d
31) 2168 48 device_attach+0x68/0x7f
32) 2120 32 bus_attach_device+0x2d/0x5e
33) 2088 112 device_add+0x45f/0x5d3
34) 1976 64 scsi_sysfs_add_sdev+0xb7/0x246
35) 1912 336 scsi_probe_and_add_lun+0x9f9/0xad7
36) 1576 80 __scsi_add_device+0xb6/0xe5
37) 1496 80 ata_scsi_scan_host+0x9e/0x1c6
38) 1416 64 ata_host_register+0x238/0x252
39) 1352 96 ata_pci_sff_activate_host+0x1a1/0x1ce
40) 1256 256 piix_init_one+0x646/0x6b2
41) 1000 144 pci_device_probe+0xc9/0x120
42) 856 48 driver_probe_device+0xc5/0x14c
43) 808 48 __driver_attach+0x67/0x91
44) 760 64 bus_for_each_dev+0x54/0x84
45) 696 16 driver_attach+0x21/0x23
46) 680 64 bus_add_driver+0xba/0x204
47) 616 64 driver_register+0x98/0x110
48) 552 48 __pci_register_driver+0x6b/0xa4
49) 504 16 piix_init+0x19/0x2c
50) 488 272 _stext+0x5d/0x145
51) 216 32 kernel_init+0x127/0x17a
52) 184 184 child_rip+0xa/0x11

-- Steve

2008-11-18 01:05:42

by Paul Mackerras

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

Steve,

> On Mon, 17 Nov 2008, Linus Torvalds wrote:
> >
> > I do wonder just _what_ it is that causes the stack frames to be so
> > horrid. For example, you have
> >
> > 18) 8896 160 .kmem_cache_alloc+0xfc/0x140
> >
> > and I'm looking at my x86-64 compile, and it has a stack frame of just 8
> > bytes (!) for local variables plus the save/restore area (which looks like
> > three registers plus frame pointer plus return address). IOW, if I'm
> > looking at the code right (so big caveat: I did _not_ do a real stack
> > dump!) the x86-64 stack cost for that same function is on the order of 48
> > bytes. Not 160.
>
> Out of curiosity, I just ran stack_trace on the latest version of git
> (pulled sometime today) and ran it on my x86_64.
>
> I have SLUB and SLUB debug defined, and here's what I found:
>
> 11) 3592 64 kmem_cache_alloc+0x64/0xa3
>
> 64 bytes, still much lower than the 160 of PPC64.

The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an
area for called functions to store their parameters (64 bytes) plus 6
slots for saving stuff and for the compiler and linker to use if they
need to. That's before any local variables are allocated.

The ppc32 ABI has a minimum stack frame of 16 bytes, which is much
nicer, at the expense of a much more complicated va_arg().

Paul.

2008-11-18 01:42:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


On Tue, 18 Nov 2008, Paul Mackerras wrote:
> >
> > 64 bytes, still much lower than the 160 of PPC64.
>
> The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an
> area for called functions to store their parameters (64 bytes) plus 6
> slots for saving stuff and for the compiler and linker to use if they
> need to. That's before any local variables are allocated.
>
> The ppc32 ABI has a minimum stack frame of 16 bytes, which is much
> nicer, at the expense of a much more complicated va_arg().

Here's a full dump that I got from my 32bit Powerbook:

rostedt@gollum:~$ cat /debug/tracing/stack_trace
Depth Size Location (57 entries)
----- ---- --------
0) 3196 48 ftrace_call+0x4/0x48
1) 3148 48 update_curr+0xa0/0x110
2) 3100 48 enqueue_task_fair+0x54/0xfc
3) 3052 32 enqueue_task+0x34/0x54
4) 3020 16 activate_task+0x40/0x64
5) 3004 48 try_to_wake_up+0x78/0x164
6) 2956 16 default_wake_function+0x28/0x40
7) 2940 48 __wake_up_common+0x54/0xac
8) 2892 32 __wake_up_sync+0x58/0x94
9) 2860 16 sock_def_readable+0x58/0xb8
10) 2844 32 sock_queue_rcv_skb+0x10c/0x128
11) 2812 32 __udp_queue_rcv_skb+0x2c/0xc4
12) 2780 32 udp_queue_rcv_skb+0x1d8/0x27c
13) 2748 96 __udp4_lib_rcv+0x514/0x77c
14) 2652 16 udp_rcv+0x30/0x48
15) 2636 32 ip_local_deliver_finish+0xf4/0x1cc
16) 2604 16 ip_local_deliver+0x94/0xac
17) 2588 64 ip_rcv_finish+0x2ec/0x31c
18) 2524 32 ip_rcv+0x23c/0x278
19) 2492 48 netif_receive_skb+0x48c/0x4c0
20) 2444 48 process_backlog+0xb0/0x144
21) 2396 48 net_rx_action+0x8c/0x1bc
22) 2348 48 __do_softirq+0x84/0x10c
23) 2300 16 do_softirq+0x50/0x6c
24) 2284 16 local_bh_enable+0x90/0xc8
25) 2268 32 dev_queue_xmit+0x564/0x5b0
26) 2236 48 ip_finish_output+0x2b4/0x2f4
27) 2188 32 ip_output+0xec/0x104
28) 2156 16 ip_local_out+0x44/0x5c
29) 2140 48 ip_push_pending_frames+0x340/0x3ec
30) 2092 48 udp_push_pending_frames+0x2ac/0x348
31) 2044 160 udp_sendmsg+0x458/0x5bc
32) 1884 32 inet_sendmsg+0x70/0x88
33) 1852 240 sock_sendmsg+0xd0/0xf8
34) 1612 32 kernel_sendmsg+0x3c/0x58
35) 1580 96 xs_send_kvec+0xb4/0xcc [sunrpc]
36) 1484 64 xs_sendpages+0x94/0x1f0 [sunrpc]
37) 1420 32 xs_udp_send_request+0x44/0x14c [sunrpc]
38) 1388 32 xprt_transmit+0x100/0x228 [sunrpc]
39) 1356 32 call_transmit+0x234/0x290 [sunrpc]
40) 1324 48 __rpc_execute+0xcc/0x2d4 [sunrpc]
41) 1276 32 rpc_execute+0x48/0x60 [sunrpc]
42) 1244 32 rpc_run_task+0x74/0x90 [sunrpc]
43) 1212 64 rpc_call_sync+0x6c/0xa0 [sunrpc]
44) 1148 80 rpcb_register_call+0xb0/0x110 [sunrpc]
45) 1068 96 rpcb_register+0xe8/0x100 [sunrpc]
46) 972 80 svc_register+0x118/0x168 [sunrpc]
47) 892 64 svc_setup_socket+0xa0/0x354 [sunrpc]
48) 828 272 svc_create_socket+0x1e0/0x250 [sunrpc]
49) 556 16 svc_udp_create+0x38/0x50 [sunrpc]
50) 540 96 svc_create_xprt+0x1c8/0x2fc [sunrpc]
51) 444 48 lockd_up+0xcc/0x27c [lockd]
52) 396 96 write_ports+0xf4/0x2cc [nfsd]
53) 300 32 nfsctl_transaction_write+0x78/0xbc [nfsd]
54) 268 32 vfs_write+0xc8/0x178
55) 236 48 sys_write+0x5c/0xa0
56) 188 188 ret_from_syscall+0x0/0x40

And here's my i386 max stack:

[root@mirhel51 ~]# cat /debug/tracing/stack_trace
Depth Size Location (47 entries)
----- ---- --------
0) 2216 240 blk_recount_segments+0x39/0x51
1) 1976 12 bio_phys_segments+0x16/0x1c
2) 1964 20 blk_rq_bio_prep+0x29/0xae
3) 1944 16 init_request_from_bio+0xc9/0xcd
4) 1928 60 __make_request+0x2f6/0x3c1
5) 1868 136 generic_make_request+0x36a/0x3a0
6) 1732 56 submit_bio+0xcd/0xd8
7) 1676 28 submit_bh+0xd1/0xf0
8) 1648 112 block_read_full_page+0x299/0x2a9
9) 1536 8 blkdev_readpage+0x14/0x16
10) 1528 28 read_cache_page_async+0x7e/0x109
11) 1500 16 read_cache_page+0x11/0x49
12) 1484 32 read_dev_sector+0x3c/0x72
13) 1452 48 read_lba+0x4d/0xaa
14) 1404 168 efi_partition+0x85/0x61b
15) 1236 84 rescan_partitions+0x14b/0x316
16) 1152 40 __blkdev_get+0x1b2/0x258
17) 1112 8 blkdev_get+0xf/0x11
18) 1104 36 register_disk+0xbc/0x112
19) 1068 32 add_disk+0x9f/0xf3
20) 1036 48 sd_probe+0x2d4/0x394
21) 988 20 driver_probe_device+0xa5/0x120
22) 968 8 __device_attach+0xd/0xf
23) 960 28 bus_for_each_drv+0x3e/0x67
24) 932 24 device_attach+0x56/0x6a
25) 908 16 bus_attach_device+0x26/0x4d
26) 892 64 device_add+0x380/0x4aa
27) 828 28 scsi_sysfs_add_sdev+0xa1/0x1c9
28) 800 160 scsi_probe_and_add_lun+0x97e/0xa8f
29) 640 36 __scsi_add_device+0x88/0xae
30) 604 40 ata_scsi_scan_host+0x8b/0x1cd
31) 564 28 ata_host_register+0x1da/0x1ea
32) 536 24 ata_host_activate+0x98/0xb5
33) 512 188 ahci_init_one+0xa23/0xa4f
34) 324 20 pci_device_probe+0x3e/0x5e
35) 304 20 driver_probe_device+0xa5/0x120
36) 284 20 __driver_attach+0x51/0x70
37) 264 32 bus_for_each_dev+0x40/0x62
38) 232 12 driver_attach+0x19/0x1b
39) 220 28 bus_add_driver+0x9c/0x1ac
40) 192 28 driver_register+0x76/0xd2
41) 164 20 __pci_register_driver+0x44/0x70
42) 144 8 ahci_init+0x14/0x16
43) 136 92 _stext+0x4f/0x116
44) 44 16 kernel_init+0xf7/0x145
45) 28 28 kernel_thread_helper+0x7/0x10

Note, the i386 box had 4KSTACKS defined and the PPC did not have
IRQSTACKS defined. I'll compile with IRQSTACKS defined and post that
result.

-- Steve

2008-11-18 02:02:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
> And here's my i386 max stack:
>
> [root@mirhel51 ~]# cat /debug/tracing/stack_trace
> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 2216 240 blk_recount_segments+0x39/0x51
> 1) 1976 12 bio_phys_segments+0x16/0x1c
> 2) 1964 20 blk_rq_bio_prep+0x29/0xae
> 3) 1944 16 init_request_from_bio+0xc9/0xcd
> 4) 1928 60 __make_request+0x2f6/0x3c1
> 5) 1868 136 generic_make_request+0x36a/0x3a0
> 6) 1732 56 submit_bio+0xcd/0xd8
> 7) 1676 28 submit_bh+0xd1/0xf0
> 8) 1648 112 block_read_full_page+0x299/0x2a9
> 9) 1536 8 blkdev_readpage+0x14/0x16
> 10) 1528 28 read_cache_page_async+0x7e/0x109
> 11) 1500 16 read_cache_page+0x11/0x49
> 12) 1484 32 read_dev_sector+0x3c/0x72
> 13) 1452 48 read_lba+0x4d/0xaa
> 14) 1404 168 efi_partition+0x85/0x61b
> 15) 1236 84 rescan_partitions+0x14b/0x316
> 16) 1152 40 __blkdev_get+0x1b2/0x258
> 17) 1112 8 blkdev_get+0xf/0x11
> 18) 1104 36 register_disk+0xbc/0x112
> 19) 1068 32 add_disk+0x9f/0xf3
> 20) 1036 48 sd_probe+0x2d4/0x394
> 21) 988 20 driver_probe_device+0xa5/0x120
> 22) 968 8 __device_attach+0xd/0xf
> 23) 960 28 bus_for_each_drv+0x3e/0x67
> 24) 932 24 device_attach+0x56/0x6a
> 25) 908 16 bus_attach_device+0x26/0x4d
> 26) 892 64 device_add+0x380/0x4aa
> 27) 828 28 scsi_sysfs_add_sdev+0xa1/0x1c9
> 28) 800 160 scsi_probe_and_add_lun+0x97e/0xa8f
> 29) 640 36 __scsi_add_device+0x88/0xae
> 30) 604 40 ata_scsi_scan_host+0x8b/0x1cd
> 31) 564 28 ata_host_register+0x1da/0x1ea
> 32) 536 24 ata_host_activate+0x98/0xb5
> 33) 512 188 ahci_init_one+0xa23/0xa4f
> 34) 324 20 pci_device_probe+0x3e/0x5e
> 35) 304 20 driver_probe_device+0xa5/0x120
> 36) 284 20 __driver_attach+0x51/0x70
> 37) 264 32 bus_for_each_dev+0x40/0x62
> 38) 232 12 driver_attach+0x19/0x1b
> 39) 220 28 bus_add_driver+0x9c/0x1ac
> 40) 192 28 driver_register+0x76/0xd2
> 41) 164 20 __pci_register_driver+0x44/0x70
> 42) 144 8 ahci_init+0x14/0x16
> 43) 136 92 _stext+0x4f/0x116
> 44) 44 16 kernel_init+0xf7/0x145
> 45) 28 28 kernel_thread_helper+0x7/0x10
>
> Note, the i386 box had 4KSTACKS defined and the PPC did not have
> IRQSTACKS defined. I'll compile with IRQSTACKS defined and post that
> result.

Here's the PPC32 with IRQSTACKS on:

rostedt@gollum:~$ cat /debug/tracing/stack_trace
Depth Size Location (42 entries)
----- ---- --------
0) 2940 48 ftrace_call+0x4/0x48
1) 2892 16 ide_map_sg+0x4c/0x88
2) 2876 32 ide_build_sglist+0x30/0xa0
3) 2844 64 pmac_ide_dma_setup+0x90/0x2a0
4) 2780 48 do_rw_taskfile+0x250/0x2a0
5) 2732 96 ide_do_rw_disk+0x290/0x2f8
6) 2636 16 ide_gd_do_request+0x30/0x48
7) 2620 176 ide_do_request+0x8e8/0xa70
8) 2444 16 do_ide_request+0x34/0x4c
9) 2428 16 __generic_unplug_device+0x4c/0x64
10) 2412 16 generic_unplug_device+0x4c/0x90
11) 2396 16 blk_unplug+0x34/0x4c
12) 2380 80 dm_table_unplug_all+0x54/0xc0 [dm_mod]
13) 2300 16 dm_unplug_all+0x34/0x54 [dm_mod]
14) 2284 16 blk_unplug+0x34/0x4c
15) 2268 16 blk_backing_dev_unplug+0x28/0x40
16) 2252 16 sync_buffer+0x60/0x80
17) 2236 48 __wait_on_bit+0x78/0xd4
18) 2188 80 out_of_line_wait_on_bit+0x88/0xa0
19) 2108 16 __wait_on_buffer+0x40/0x58
20) 2092 16 __bread+0xbc/0xf0
21) 2076 48 ext3_get_branch+0x90/0x118 [ext3]
22) 2028 208 ext3_get_blocks_handle+0xac/0xa10 [ext3]
23) 1820 48 ext3_get_block+0xc4/0x114 [ext3]
24) 1772 160 do_mpage_readpage+0x26c/0x6f4
25) 1612 144 mpage_readpages+0xe8/0x148
26) 1468 16 ext3_readpages+0x38/0x50 [ext3]
27) 1452 80 __do_page_cache_readahead+0x19c/0x248
28) 1372 32 do_page_cache_readahead+0x80/0x98
29) 1340 80 filemap_fault+0x1b0/0x444
30) 1260 80 __do_fault+0x68/0x61c
31) 1180 64 handle_mm_fault+0x500/0xafc
32) 1116 176 do_page_fault+0x2d4/0x450
33) 940 84 handle_page_fault+0xc/0x80
34) 856 140 0xdeaffa78
35) 716 128 load_elf_binary+0x6a0/0x1048
36) 588 64 search_binary_handler+0x10c/0x310
37) 524 176 load_script+0x248/0x268
38) 348 64 search_binary_handler+0x10c/0x310
39) 284 64 do_execve+0x15c/0x210
40) 220 32 sys_execve+0x68/0x98
41) 188 188 ret_from_syscall+0x0/0x40

It does indeed help.

-- Steve

2008-11-18 02:08:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Tue, 18 Nov 2008, Paul Mackerras wrote:
>
> Also, you didn't respond to my comments about the purely software
> benefits of a larger page size.

I realize that there are benefits. It's just that the downsides tend to
swamp the upsides.

The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
can do good TLB's with essentially gang lookups, giving almost effective
page sizes of 32kB with hardly any of the downsides. Couple that with
low-latency fault handling (for not when you miss in the TLB, but when
something really isn't in the page tables), and it seems to be seldom the
biggest issue.

(Don't get me wrong - TLB's are not unimportant on x86 either. But on x86,
things are generally much better).

Yes, we could prefill the page tables and do other things, and ultimately
if you don't need to - by virtue of big pages, some loads will always
benefit from just making the page size larger.

But the people who advocate large pages seem to never really face the
downsides. They talk about their single loads, and optimize for that and
nothing else. They don't seem to even acknowledge the fact that a 64kB
page size is simply NOT EVEN REMOTELY ACCEPTABLE for other loads!

That's what gets to me. These absolute -idiots- talk about how they win 5%
on some (important, for them) benchmark by doing large pages, but then
ignore the fact that on other real-world loads they lose by sevaral
HUNDRED percent because of the memory fragmentation costs.

(And btw, if they win more than 5%, it's because the hardware sucks really
badly).

THAT is what irritates me.

What also irritates me is the ".. but AIX" argument. The fact is, the AIX
memory management is very tightly tied to one particular broken MMU model.
Linux supports something like thirty architectures, and while PPC may be
one of the top ones, it is NOT EVEN CLOSE to be really relevant.

So ".. but AIX" simply doesn't matter. The Linux VM has other priorities.

And I _guarantee_ that in general, in the high-volume market (which is
what drives things, like it or not), page sizes will not be growing. In
that market, terabytes of RAM is not the primary case, and small files
that want mmap are one _very_ common case.

To make things worse, the biggest performance market has another vendor
that hasn't been saying ".. but AIX" for the last decade, and that
actually listens to input. And, perhaps not incidentally, outperforms the
highest-performance ppc64 chips mostly by a huge margin - while selling
their chips for a fraction of the price.

I realize that this may be hard to accept for some people. But somebody
who says "... but AIX" should be taking a damn hard look in the mirror,
and ask themselves some really tough questions. Because quite frankly, the
"..but AIX" market isn't the most interesting one.

Linus

2008-11-18 02:29:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
> On Mon, 17 Nov 2008, Steven Rostedt wrote:
> >
> > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > CONFIG_PPC_64K_PAGES on.
>
> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that
> softirqs still use the same stack as the process.
>

By-the-way, my box has been running stable ever since I switched to
CONFIG_IRQSTACKS.

-- Steve

2008-11-18 02:36:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

Steven Rostedt writes:

> By-the-way, my box has been running stable ever since I switched to
> CONFIG_IRQSTACKS.

Great. We probably should remove the config option and just always
use irq stacks.

Paul.

2008-11-18 05:40:27

by David Miller

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

From: Paul Mackerras <[email protected]>
Date: Tue, 18 Nov 2008 13:36:16 +1100

> Steven Rostedt writes:
>
> > By-the-way, my box has been running stable ever since I switched to
> > CONFIG_IRQSTACKS.
>
> Great. We probably should remove the config option and just always
> use irq stacks.

That's what I did from the start on sparc64 when I added
irqstacks support. It's pretty stupid to make it optional
when we know there are failure cases.

For example, is XFS dependant on IRQSTACKS on x86? It should be, or
even more so XFS and NFS both being enabled at the same time :-)

2008-11-18 07:26:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


> > It makes some sort of sense I suppose on very static embedded workloads
> > with no swap nor demand paging.
>
> It makes perfect sense for anything that doesn't use any MMU.

To a certain extent. There's two different aspects to having an MMU and
in embedded space it's useful to have one and not the other, ie,
protection & address space isolation vs. paging.

Thus, it does make sense for those embedded devices with few files
(mostly a statically linked busybox, a few /dev entries and some app
stuff) to use large page sizes. Of course, as soon as they try to
populate their ramfs with lots of small files they lose... but mostly,
the idea is that the entire working set fits in the TLB and thus the
cost of TLB miss becomes irrelevant.

Now, regarding the shortcomings of the powerpc server MMU, well, we know
them, we know your opinion and mostly share it, and until we can get the
HW to change we are stuck with it.

Cheers,
Ben.

2008-11-18 09:53:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > CONFIG_PPC_64K_PAGES on.
>
> For one, we definitely need to turn IRQSTACKS on by default ... In fact,
> I'm pondering just removing the option.

It shouldn't be a user-visible option. It shouldn't on x86 either, but
there it always gets into the ideological 4k stacks flameware so people
usually gave up after a while instead of doing the sensibke 8k +
irqstacks by default, 4k as an option..

2008-11-18 10:07:30

by Nick Piggin

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Tuesday 18 November 2008 09:53, Paul Mackerras wrote:

> I'd love to be able to use a 4k base page size if I could still get
> the reduction in page faults and the expanded TLB reach that we get
> now with 64k pages. If we could allocate the page cache for large
> files with order-4 allocations wherever possible that would be a good
> start.

That can still have nasty side-effects like fragmentation and 64k
granular reclaim. It also adds complexity to mapping the pages to
userspace.

Christoph's patchset IIRC also only did page cache, wheras I suppose
your kernbench workload is gaining mainly from anonymous page faults.

2008-11-18 10:24:50

by Nick Piggin

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

On Tuesday 18 November 2008 13:08, Linus Torvalds wrote:
> On Tue, 18 Nov 2008, Paul Mackerras wrote:
> > Also, you didn't respond to my comments about the purely software
> > benefits of a larger page size.
>
> I realize that there are benefits. It's just that the downsides tend to
> swamp the upsides.
>
> The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
> can do good TLB's with essentially gang lookups, giving almost effective
> page sizes of 32kB with hardly any of the downsides. Couple that with

It's much harder to do this with powerpc I think because they would need
to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
wouldn't they?


> low-latency fault handling (for not when you miss in the TLB, but when
> something really isn't in the page tables), and it seems to be seldom the
> biggest issue.
>
> (Don't get me wrong - TLB's are not unimportant on x86 either. But on x86,
> things are generally much better).

The per-page processing costs are interesting too, but IMO there is more
work that should be done to speed up order-0 pages. The patches I had to
remove the sync instruction for smp_mb() in unlock_page sped up pagecache
throughput (populate, write(2), reclaim) on my G5 by something really
crazy like 50% (most of that's in, but I'm still sitting on that fancy
unlock_page speedup to remove the final smp_mb).

I suspect some of the costs are also in powerpc specific code to insert
linux ptes into their hash table. I think some of the synchronisation for
those could possibly be shared with generic code so you don't need the
extra layer of locks there.

2008-11-18 10:38:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)


* Christoph Hellwig <[email protected]> wrote:

> On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> > > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > > CONFIG_PPC_64K_PAGES on.
> >
> > For one, we definitely need to turn IRQSTACKS on by default ... In fact,
> > I'm pondering just removing the option.
>
> It shouldn't be a user-visible option. It shouldn't on x86 either,
> but there it always gets into the ideological 4k stacks flameware so
> people usually gave up after a while instead of doing the sensibke
> 8k + irqstacks by default, 4k as an option..

yep, i tend to agree that 8k + irqstacks on both 32-bit and 64-bit
would be the sane default. There's just too much gcc and other noise
for us to have that cushion by default on 32-bit too. 64-bit is
already there, on 32-bit we should decouple irqstacks from 4K stacks
and just turn irqstacks on by default.

Life's too short to fight kernel stack overflows - and now we've got
the stack-tracer that will record and show the frame of the worst-ever
kernel stack situation the system was in since bootup. (see Steve's
tracer output in this thread)

Ingo

2008-11-18 11:45:05

by Paul Mackerras

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)

Nick Piggin writes:

> It's much harder to do this with powerpc I think because they would need
> to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
> wouldn't they?

Yeah, the hashed page table sucks. Film at 11, as they say. :)

Paul.

2008-11-18 16:02:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Large stack usage in fs code (especially for PPC64)



On Tue, 18 Nov 2008, Nick Piggin wrote:
> >
> > The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
> > can do good TLB's with essentially gang lookups, giving almost effective
> > page sizes of 32kB with hardly any of the downsides. Couple that with
>
> It's much harder to do this with powerpc I think because they would need
> to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
> wouldn't they?

Oh, absolutely. It's why I despise hashed page tables. It's a broken
concept.

> The per-page processing costs are interesting too, but IMO there is more
> work that should be done to speed up order-0 pages. The patches I had to
> remove the sync instruction for smp_mb() in unlock_page sped up pagecache
> throughput (populate, write(2), reclaim) on my G5 by something really
> crazy like 50% (most of that's in, but I'm still sitting on that fancy
> unlock_page speedup to remove the final smp_mb).
>
> I suspect some of the costs are also in powerpc specific code to insert
> linux ptes into their hash table. I think some of the synchronisation for
> those could possibly be shared with generic code so you don't need the
> extra layer of locks there.

Yeah, the hashed page tables get extra costs from the fact that it can't
share the software page tables with the hardware ones, and the associated
coherency logic. It's even worse at unmap time, I think.

Linus