Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp978161imm; Wed, 1 Aug 2018 08:16:03 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcdrpBTgRFSVfr6Shu4eJYBEd4QisIZwjG+PyhqUX0sVHxPXUj43l20VPE4ZAZOzIRsql7Z X-Received: by 2002:aa7:82c3:: with SMTP id f3-v6mr27122052pfn.136.1533136563778; Wed, 01 Aug 2018 08:16:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533136563; cv=none; d=google.com; s=arc-20160816; b=KQiM4dL6XvqqxGnK5WYPprNRvsVm2Du/8pjlpGvIbpYz6L30oHtahBkK20laZWGa8u prNih4jLkjqxq7Gv5FfP7v6PwitXDiaibXRS1mNyOXIuXcqH60jre1zkDRuBssgYS0ai K23/Eo3OC/uy+sJINcATwK0+LwBOW6eSLzNCvVLsIHTI5FWj2nTtpF1jm6+zHJ5RJ7tm NB+5r09k596RA4Pd/4vOUtaA3qI40XcjbfrceX2H2zdXonAOBAWvxkiObbKWKRhLkGfS skQ5oL6BCcTGliodOPnN6RooBWUzYuIpi2UIdemFP6Lg2d5gjkuodw3EgKfClIwrRlVG oUBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=Yk6H51y2ZLIcO3j+pK9JoTpumhbBjRrdnCQdK1xMB0w=; b=I7eaD+6VqwCE/x+J6HqA/wW1bUPY+vtZVjOyzqfaL+Pve1ENmDZKj6Fy/D5r/SGD+n KIyKhAdiNyOdxaG6jY6GGLhFKf5tMQwPCfngZ9Bf78f8WvpzrBCEPqshSL7A4lvKxfun prHiaL+EbUk/CpVn+sH7FKKQT+Y/b+CJZBR+RP/8Xb2V1HNONMy5xI1rRkdW49xrfNZu y10JFbh4ZVEmVaPay3Du6qRJwhG/xvxXxk3XiLdqX/PIMF+2OhyvGW/XwYZazK739VD2 CmG/ObXkV5SxYs7Y40eRNQ6wJvWgnBem8cF69mX3dWJltAQFOftH4UicsYJXlhCNPT3l 3xuw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=y9GCNcNm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o73-v6si16311121pfk.356.2018.08.01.08.15.49; Wed, 01 Aug 2018 08:16:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=y9GCNcNm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389697AbeHARBE (ORCPT + 99 others); Wed, 1 Aug 2018 13:01:04 -0400 Received: from mail-io0-f194.google.com ([209.85.223.194]:43512 "EHLO mail-io0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389651AbeHARBD (ORCPT ); Wed, 1 Aug 2018 13:01:03 -0400 Received: by mail-io0-f194.google.com with SMTP id y10-v6so16349266ioa.10 for ; Wed, 01 Aug 2018 08:14:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=Yk6H51y2ZLIcO3j+pK9JoTpumhbBjRrdnCQdK1xMB0w=; b=y9GCNcNmsKFwg9CtnWulqdQAaglrJVjDSw2mI00s+VXdVTAtRNtdXUj54Nx+XjQfl0 9+q6hCsVJ3tWfNj7aHTYOzYaN6hRZZQ6THqL/CACbZl5RHc2iSrnHpyx5XTIPfnE5pbQ GHIZuDv2uYlFQ386pYBIdDP6bb4y30YJUmKpDN4hFWQrYc7qvpnNfzt73aIBR8Uqak2e HeEpcxxCkLz0Z4sWsRD0gEV/kyYPC5GofJFd07ZdCkGQY7y/4h3ZpwJeifoclyszDV1x zS+bnVzYOYlycsfnOEnsLK0RsVy4WXC/KQrMdQhHadYrawW0qzJAVse9JtE6n+JF/y/r Be/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=Yk6H51y2ZLIcO3j+pK9JoTpumhbBjRrdnCQdK1xMB0w=; b=Nmr+LlkIYkjzjjkkPTQfGjTvUWq5emlqs+luuuKDl63/aDddyWQ9PJ3zvAL0sW7n+1 pM3/FsRkQrVXxx6AYTn+Td2ORjM3m64jka40MHRLuRgnflpiEn9HpZAOu8ZZi0i7M++6 QlxhOVYyPNlQKbbkzO5t+nkx8BM6VhxCNt2SbosoS/UPhLbIAP6aFLtnSk6qaxyLxy/h S84KcH0MZqXZnQ9SXbzdb09TkhvyE5GhaiObfvuIhWEss3N+feGGf17Z2rwNEzNHh1ke g1kDtlxqk7WWCGl1xCTZTqREuWKX//RgCH39D3U3o8MdxgtY5Ph1LIy2Tcnajs/BA+Rz +cZA== X-Gm-Message-State: AOUpUlFcch7aELY5sJwDPhcOxbbWokFCFjmPtVtjlEsnTG51EyPm+nvq viPphNTmMEd/1nZC4FkkIfmJGA== X-Received: by 2002:a6b:84d6:: with SMTP id o83-v6mr3862110ioi.93.1533136492619; Wed, 01 Aug 2018 08:14:52 -0700 (PDT) Received: from [192.168.1.212] (107.191.0.158.static.utbb.net. [107.191.0.158]) by smtp.gmail.com with ESMTPSA id t20-v6sm2959493itb.31.2018.08.01.08.14.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 01 Aug 2018 08:14:51 -0700 (PDT) Subject: Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait To: Anchal Agarwal , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Cc: fllinden@amazon.com, msw@amazon.com, sblbir@amazon.com References: <20180731213410.GA35291@kaos-source-ops-60001.pdx1.amazon.com> From: Jens Axboe Message-ID: Date: Wed, 1 Aug 2018 09:14:50 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.0 MIME-Version: 1.0 In-Reply-To: <20180731213410.GA35291@kaos-source-ops-60001.pdx1.amazon.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 7/31/18 3:34 PM, Anchal Agarwal wrote: > Hi folks, > > This patch modifies commit e34cbd307477a > (blk-wbt: add general throttling mechanism) > > I am currently running a large bare metal instance (i3.metal) > on EC2 with 72 cores, 512GB of RAM and NVME drives, with a > 4.18 kernel. I have a workload that simulates a database > workload and I am running into lockup issues when writeback > throttling is enabled,with the hung task detector also > kicking in. > > Crash dumps show that most CPUs (up to 50 of them) are > all trying to get the wbt wait queue lock while trying to add > themselves to it in __wbt_wait (see stack traces below). > > [ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 > [ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 > [ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000 > [ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0 > [ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046 > [ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00 > [ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68 > [ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000 > [ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002 > [ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 > [ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000 > [ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0 > [ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 0.948138] Call Trace: > [ 0.948139] > [ 0.948142] do_raw_spin_lock+0xad/0xc0 > [ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b > [ 0.948149] ? __wake_up_common_lock+0x53/0x90 > [ 0.948150] __wake_up_common_lock+0x53/0x90 > [ 0.948155] wbt_done+0x7b/0xa0 > [ 0.948158] blk_mq_free_request+0xb7/0x110 > [ 0.948161] __blk_mq_complete_request+0xcb/0x140 > [ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme] > [ 0.948169] nvme_irq+0x23/0x50 [nvme] > [ 0.948173] __handle_irq_event_percpu+0x46/0x300 > [ 0.948176] handle_irq_event_percpu+0x20/0x50 > [ 0.948179] handle_irq_event+0x34/0x60 > [ 0.948181] handle_edge_irq+0x77/0x190 > [ 0.948185] handle_irq+0xaf/0x120 > [ 0.948188] do_IRQ+0x53/0x110 > [ 0.948191] common_interrupt+0x87/0x87 > [ 0.948192] > .... > [ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 > [ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 > [ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000 > [ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0 > [ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046 > [ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00 > [ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68 > [ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000 > [ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68 > [ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00 > [ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000 > [ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0 > [ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 0.311154] Call Trace: > [ 0.311157] do_raw_spin_lock+0xad/0xc0 > [ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b > [ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0 > [ 0.311164] prepare_to_wait_exclusive+0x28/0xb0 > [ 0.311167] wbt_wait+0x127/0x330 > [ 0.311169] ? finish_wait+0x80/0x80 > [ 0.311172] ? generic_make_request+0xda/0x3b0 > [ 0.311174] blk_mq_make_request+0xd6/0x7b0 > [ 0.311176] ? blk_queue_enter+0x24/0x260 > [ 0.311178] ? generic_make_request+0xda/0x3b0 > [ 0.311181] generic_make_request+0x10c/0x3b0 > [ 0.311183] ? submit_bio+0x5c/0x110 > [ 0.311185] submit_bio+0x5c/0x110 > [ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4] > [ 0.311210] ext4_io_submit+0x48/0x60 [ext4] > [ 0.311222] ext4_writepages+0x810/0x11f0 [ext4] > [ 0.311229] ? do_writepages+0x3c/0xd0 > [ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4] > [ 0.311240] do_writepages+0x3c/0xd0 > [ 0.311243] ? _raw_spin_unlock+0x24/0x30 > [ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280 > [ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0 > [ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0 > [ 0.311253] file_write_and_wait_range+0x34/0x90 > [ 0.311264] ext4_sync_file+0x151/0x500 [ext4] > [ 0.311267] do_fsync+0x38/0x60 > [ 0.311270] SyS_fsync+0xc/0x10 > [ 0.311272] do_syscall_64+0x6f/0x170 > [ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > In the original patch, wbt_done is waking up all the exclusive > processes in the wait queue, which can cause a thundering herd > if there is a large number of writer threads in the queue. The > original intention of the code seems to be to wake up one thread > only however, it uses wake_up_all() in __wbt_done(), and then > uses the following check in __wbt_wait to have only one thread > actually get out of the wait loop: > > if (waitqueue_active(&rqw->wait) && > rqw->wait.head.next != &wait->entry) > return false; > > The problem with this is that the wait entry in wbt_wait is > define with DEFINE_WAIT, which uses the autoremove wakeup function. > That means that the above check is invalid - the wait entry will > have been removed from the queue already by the time we hit the > check in the loop. > > Secondly, auto-removing the wait entries also means that the wait > queue essentially gets reordered "randomly" (e.g. threads re-add > themselves in the order they got to run after being woken up). > Additionally, new requests entering wbt_wait might overtake requests > that were queued earlier, because the wait queue will be > (temporarily) empty after the wake_up_all, so the waitqueue_active > check will not stop them. This can cause certain threads to starve > under high load. > > The fix is to leave the woken up requests in the queue and remove > them in finish_wait() once the current thread breaks out of the > wait loop in __wbt_wait. This will ensure new requests always > end up at the back of the queue, and they won't overtake requests > that are already in the wait queue. With that change, the loop > in wbt_wait is also in line with many other wait loops in the kernel. > Waking up just one thread drastically reduces lock contention, as > does moving the wait queue add/remove out of the loop. > > A significant drop in lockdep's lock contention numbers is seen when > running the test application on the patched kernel. I like the patch, and a few weeks ago we independently discovered that the waitqueue list checking was bogus as well. My only worry is that changes like this can be delicate, meaning that it's easy to introduce stall conditions. What kind of testing did you push this through? -- Jens Axboe