Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932264AbdCFWlu (ORCPT ); Mon, 6 Mar 2017 17:41:50 -0500 Received: from mga05.intel.com ([192.55.52.43]:24616 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754052AbdCFWll (ORCPT ); Mon, 6 Mar 2017 17:41:41 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.35,255,1484035200"; d="scan'208";a="1138730834" Date: Mon, 6 Mar 2017 17:46:33 -0500 From: Keith Busch To: Thomas Fjellstrom Cc: linux-kernel , linux-nvme@lists.infradead.org Subject: Re: nvme, allocation failures, and aborted commands Message-ID: <20170306224633.GA21202@localhost.localdomain> References: <2324331.vgmPNSfijK@natasha> <1898554.3jK49l1AhF@natasha> <24015519.RaUGNvtkS7@natasha> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <24015519.RaUGNvtkS7@natasha> User-Agent: Mutt/1.7.0 (2016-08-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2725 Lines: 56 On Sun, Mar 05, 2017 at 11:11:45PM -0700, Thomas Fjellstrom wrote: > Tonight I decided to try kernel 4.11-rc1. Still getting page allocation > failures and aborted nvme commands once iozone gets to the fwrite/fread > testing. > > The taint seems to be comming from previos warnings from the radeon kernel > driver. > > Its a pretty simple: iozone -a -s 64G run. > > new log: > > [ 5710.312195] kworker/u16:1: page allocation failure: order:2, mode:0x1084020(GFP_ATOMIC|__GFP_COMP), nodemask=(null) > [ 5710.312202] kworker/u16:1 cpuset=/ mems_allowed=0 > [ 5710.312208] CPU: 1 PID: 30166 Comm: kworker/u16:1 Tainted: G W 4.11.0-rc1 #2 > [ 5710.312209] Hardware name: To be filled by O.E.M. To be filled by O.E.M./970 PRO GAMING/AURA, BIOS 0901 11/07/2016 > [ 5710.312214] Workqueue: writeback wb_workfn (flush-259:0) > [ 5710.312217] Call Trace: > [ 5710.312223] dump_stack+0x4f/0x73 > [ 5710.312225] warn_alloc+0xde/0x170 > [ 5710.312227] ? wake_all_kswapds+0x60/0xa0 > [ 5710.312229] __alloc_pages_slowpath+0xe21/0xe50 > [ 5710.312232] ? __inc_zone_state+0x42/0xb0 > [ 5710.312235] ? preempt_count_add+0xa3/0xc0 > [ 5710.312236] __alloc_pages_nodemask+0x1c6/0x1e0 > [ 5710.312239] alloc_pages_current+0x90/0x140 > [ 5710.312241] kmalloc_order_trace+0x29/0xf0 > [ 5710.312243] ? alloc_pages_current+0x90/0x140 > [ 5710.312245] __kmalloc+0x17a/0x1a0 > [ 5710.312250] nvme_queue_rq+0x12f/0x860 [nvme] > [ 5710.312253] ? __sbitmap_queue_get+0x2b/0xa0 > [ 5710.312255] ? __blk_mq_get_tag.isra.7+0x62/0x70 > [ 5710.312256] ? blk_mq_get_tag+0x82/0x2b0 > [ 5710.312258] blk_mq_try_issue_directly+0x8f/0xe0 > [ 5710.312260] blk_mq_make_request+0x1ef/0x750 I'm not familiar with iozone, but I'll offer what the nvme driver is doing here. The error says we failed to allocate 2 physically continguous pages from nvme_queue_rq. The only thing we allocate is a driver IO descriptor. Order 2 means your data transfer for this IO is pretty large, somewhere between 1 and 4MB. I don't think we need contiguous memory here, though; I'll see if we can use vmalloc in case memory is heavily fragmented. On the timeout/abort, this appears to be happening on a functioning device completing commands just slower than the kernel's timeout. If your io test creates a very high outstanding depth with very large transfer sizes, it's possible to create tail latencies higher than the default 30 second timeout (YMMV, check your device's capabilities). Could you throttle the device's max transfer and queue depth prior to running your IO test and see if the error messages clear up? For example: echo 128 > /sys/block/nvme0n1/queue/max_hw_sectors_kb echo 128 > /sys/block/nvme0n1/queue/max_nr_requests