Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753663AbaBLIf5 (ORCPT ); Wed, 12 Feb 2014 03:35:57 -0500 Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:35977 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753648AbaBLIfv (ORCPT ); Wed, 12 Feb 2014 03:35:51 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AnEHAAoy+1J5LJLw/2dsb2JhbABagwy6ZIVQgQ8XdIIlAQEEAScTHCMFCwgDGAklDwUlAyETh30HySkXFo4BCwYBUAeDJIEUBJgpik+HUoNBKIEtCBc Date: Wed, 12 Feb 2014 19:35:13 +1100 From: Dave Chinner To: Linus Torvalds Cc: Jens Axboe , Tejun Heo , Steven Rostedt , Dave Jones , Al Viro , Eric Sandeen , Linux Kernel , xfs@oss.sgi.com Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled. Message-ID: <20140212083513.GF13997@dastard> References: <20140212004403.GA17129@redhat.com> <20140212010941.GM18016@ZenIV.linux.org.uk> <20140212040358.GA25327@redhat.com> <20140212042215.GN18016@ZenIV.linux.org.uk> <20140212054043.GB13997@dastard> <20140212055027.GA28502@redhat.com> <20140212061038.GC13997@dastard> <20140212063150.GD13997@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote: > On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner wrote: > > > > FYI, just creating lots of files with open(O_CREAT): > > > > [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left > > [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left > > [ 349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left > > [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left > > [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left > > [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left > > [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left > > > > We've got absolutely no spare stack space anymore in the IO path. > > And the IO path can't get much simpler than filesystem -> virtio > > block device. > > Ugh, that's bad. A thousand bytes of stack space is much too close to > any limits. > > Do you have the stack traces for these things so that we can look at > worst offenders? Sure. It's almost through the XFS block allocation path where we do metadata reads or log writes. That's always been the worst case stack usage path for XFS. The worst path is about 4800 bytes - it was a double btree split (inode extent map tree split, followed by a freespace btree split when allocating a block for the extent map btree split), triggering a reallocation of a btree block that was pinned and STALE, which triggered a log force, which then made it's way into the block layer. I don't have a stack trace of that because I can't create the initial conditions required to trigger it on demand. I've only ever seen it in the wild once. More common, though, is the situation we see here - somewhere around 3-4k of stack usage from a single extent btree update operation. We've already got a stack switch in the data allocation path - we had to add it to stop the bdi-flusher from busting the stack with alarming regularity in production systems. I initially made the metadata allocation paths use it as well, but that part got reverted (aa29284 xfs: don't defer metadata allocation to the workqueue) because Mel Gorman demonstrated several workloads that regressed significantly as a result of making that change. So now we are stuck betweeen a rock and a hard place - those metadata block allocation paths are triggering the stack overflows, but switching stacks in the allocation path to avoid stack overflows causes massive performance regressions.... The thing is, the XFS stack usage has not changed significantly over the past 5 years - we keep a pretty close eye on it. The issue is that everything around XFS has slowly been growing and I can't point you at one stack trace that demonstrates the worst case. What I can point you to function calls that consume a lot of stack space and hence limit what is available to callers. So here's a few stack fragements I've generated in teh past few minutes on a couple of XFS filesystems inside a 16p/16GB VM by running: $ dd if=/dev/zero of=/mnt/test/foo bs=1024k & which writes at about 600MB/s to a 18TB RAID array. That runs in parallel with a multithreaded inode creation/walk/remove workload on a different filesystem (creates 50 million inodes ~250k inode creates/s, walks them removes them at around 170k removes/s) which is hosted on a dm RAID0 stripe of 2 samsung 840 EVO SSDs. mutex_lock() requires at least 1.2k of stack because of the scheduler overhead.: $ sudo cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (39 entries) ----- ---- -------- 0) 4368 64 update_group_power+0x2c/0x270 1) 4304 384 find_busiest_group+0x10a/0x8b0 2) 3920 288 load_balance+0x165/0x870 3) 3632 96 idle_balance+0x106/0x1b0 4) 3536 112 __schedule+0x7b6/0x7e0 5) 3424 16 schedule+0x29/0x70 6) 3408 16 schedule_preempt_disabled+0xe/0x10 7) 3392 112 __mutex_lock_slowpath+0x11a/0x400 8) 3280 32 mutex_lock+0x1e/0x32 i.e. any function that is going to schedule needs at least 1k of stack, and some of the lead in infrastructure (like wait_for_completion) can require a total of up to 1.5k... Basic memory allocation can easily >1k of stack without entering reclaim: ----- ---- -------- 0) 4920 40 zone_statistics+0xbd/0xc0 1) 4880 256 get_page_from_freelist+0x3a8/0x8a0 2) 4624 256 __alloc_pages_nodemask+0x143/0x8e0 3) 4368 80 alloc_pages_current+0xb2/0x170 4) 4288 64 new_slab+0x265/0x2e0 5) 4224 240 __slab_alloc+0x2fb/0x4c4 6) 3984 80 kmem_cache_alloc+0x10b/0x140 7) 3904 48 kmem_zone_alloc+0x77/0x100 If it enters reclaim and we are allowed to IO? Well, you saw the stack I posted in the other thread - 4700 bytes from kmem_cache_alloc() to the top of the stack. Another bad case is the swap path (using a scsi device rather than virtio for the root/swap devices): Depth Size Location (44 entries) ----- ---- -------- 0) 4496 16 mempool_alloc_slab+0x15/0x20 1) 4480 128 mempool_alloc+0x66/0x170 2) 4352 16 scsi_sg_alloc+0x4e/0x60 3) 4336 112 __sg_alloc_table+0x68/0x130 4) 4224 32 scsi_init_sgtable+0x34/0x90 5) 4192 48 scsi_init_io+0x34/0xd0 6) 4144 32 scsi_setup_fs_cmnd+0x66/0xa0 7) 4112 112 sd_prep_fn+0x2a0/0xce0 8) 4000 48 blk_peek_request+0x13c/0x260 9) 3952 112 scsi_request_fn+0x4b/0x490 10) 3840 32 __blk_run_queue+0x37/0x50 11) 3808 64 queue_unplugged+0x39/0xb0 12) 3744 112 blk_flush_plug_list+0x20b/0x240 13) 3632 80 blk_queue_bio+0x1ca/0x310 14) 3552 48 generic_make_request+0xc0/0x110 15) 3504 96 submit_bio+0x71/0x120 16) 3408 160 __swap_writepage+0x184/0x220 17) 3248 32 swap_writepage+0x42/0x90 18) 3216 304 shrink_page_list+0x6fd/0xa20 19) 2912 208 shrink_inactive_list+0x243/0x480 20) 2704 288 shrink_lruvec+0x371/0x670 21) 2416 112 do_try_to_free_pages+0x11a/0x360 22) 2304 192 try_to_free_pages+0x110/0x190 23) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0 24) 1856 80 alloc_pages_current+0xb2/0x170 25) 1776 64 new_slab+0x265/0x2e0 26) 1712 240 __slab_alloc+0x2fb/0x4c4 27) 1472 80 kmem_cache_alloc+0x10b/0x140 28) 1392 48 kmem_zone_alloc+0x77/0x100 There's over 3k in the stack from kmem_zone_alloc(GFP_KERNEL) and that's got nothing XFS in it at all - it's just memory allocation, reclaim and Io path. And it also demonstrates that the scsi layer has significant stack usage. here's 2.5k from submit_bio() on virtio block device: Depth Size Location (44 entries) ----- ---- -------- 0) 4512 64 update_curr+0x8b/0x160 1) 4448 96 enqueue_task_fair+0x39d/0xd50 2) 4352 48 enqueue_task+0x50/0x60 3) 4304 16 activate_task+0x23/0x30 4) 4288 32 ttwu_do_activate.constprop.84+0x3c/0x70 5) 4256 96 try_to_wake_up+0x1b4/0x2c0 6) 4160 16 default_wake_function+0x12/0x20 7) 4144 32 autoremove_wake_function+0x16/0x40 8) 4112 80 __wake_up_common+0x55/0x90 9) 4032 64 __wake_up+0x48/0x70 10) 3968 80 wakeup_kswapd+0xe7/0x130 11) 3888 256 __alloc_pages_nodemask+0x371/0x8e0 12) 3632 80 alloc_pages_current+0xb2/0x170 13) 3552 64 new_slab+0x265/0x2e0 14) 3488 240 __slab_alloc+0x2fb/0x4c4 15) 3248 80 __kmalloc+0x133/0x180 16) 3168 112 virtqueue_add_sgs+0x2fe/0x520 17) 3056 288 __virtblk_add_req+0xd5/0x180 18) 2768 96 virtio_queue_rq+0xdd/0x1d0 19) 2672 112 __blk_mq_run_hw_queue+0x1c3/0x3c0 20) 2560 16 blk_mq_run_hw_queue+0x35/0x40 21) 2544 80 blk_mq_insert_requests+0xc5/0x120 22) 2464 96 blk_mq_flush_plug_list+0x129/0x140 23) 2368 112 blk_flush_plug_list+0xe7/0x240 24) 2256 128 blk_mq_make_request+0x3ca/0x510 25) 2128 48 generic_make_request+0xc0/0x110 26) 2080 96 submit_bio+0x71/0x120 You can also see the difference in stack usage that the blk_mq layer adds compared to the previous trace that went through the old code to get to the SCSI stack. And you saw the memory reclaim via shrinker path taht I posted in another email - that was 4700 bytes from kmem_cache_alloc() to teh top of the stack (can be reduced to about 2000 bytes by having XFS chop it in half). > If the new block-mq code is to blame, it needs to be fixed. It's a symptom of the pattern of gradual growth in just about every core subsystem. Each individual change is not significant, but when you put the whole stack together the canary is well and truly dead. > __virtblk_add_req() has a 300-byte stack frame, it seems. Looking > elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is > hopefully not in any normal path - online discard is moronic, and I'm > assuming XFS doesn't do that. We *support* it, but we don't *recommend* it. So, no, that's not the problem path. ;) > There's a lot of 200+ byte stack frames in block/blk-core.s, and they > all seem to be of the type perf_trace_block_buffer() - things created > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of > frame, I have no idea. That sounds like a potential disaster too, > although hopefully it's mostly leaf functions - but leaf functions > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such > huge frames? And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch on x86-64 (at least, according to the documentation). CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more stack usage than the tracer. As such, I also think that CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of stack usage because it is canary based and so captures the very worst case usage of the process's stack... It really seems to me that we have got to the point where it is not safe to do anything like submit IO or try to do memory allocation with any more than half the stack space consumed.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/