Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5131450imu; Tue, 8 Jan 2019 12:05:37 -0800 (PST) X-Google-Smtp-Source: ALg8bN4xAWSY3KfQIPujwU0GgC9FolhCEsewWs2Znsyt/yqf9ELLKrrxA82D3AGWAyRkTDWOsGoK X-Received: by 2002:a62:1c86:: with SMTP id c128mr3237641pfc.54.1546977937312; Tue, 08 Jan 2019 12:05:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546977937; cv=none; d=google.com; s=arc-20160816; b=ZVItxcWDsP1hCEA8AXyY4+DB0YjQ9twtdWIyXIKOmg4Sn5D+PH1+owwB6CAPXSTrXW 6NgHNgK+IgpAEvV/MPLHEdtrJWrJJ9R0pYCfQ9YJHB0UqA3xMLYTGecNwcWVI/KNx+f2 i1rFOj9LGLhe33NYASbXjmfuKHdNiIUDEoEvtCOc+8Hxe5CSZGO29TT9eAT71JDzLytW ug03lFGQKI1sP6fBi4WUuhtglj07uS9PO5V60eqZDlcIeGoWX7fZR7TJpYeu3y01OLeR iV4EtAQgDQhEhF1rMHsP+Nb2P7Bet3X+VOZr804IRPxJ0U9puDicwrnkDc7kB/+1hfEf hTJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=FgAD5nLXblAg8ZPoDpnRhhseD3t2+fkEj4IqgxympWE=; b=uCMaZo9ncMUfmdErWHtBb/kV2ic/3cT6PYxbROmRweEYdu8rX2pz+AnTw085NYgJhz jykSnBr94rI2ky1zQ1R7Qn43aizKXy4UyqEbWa9I7mEa4Debed5xXahEb3OcVfkdHX/N 2NwzinnhUZkiWOSHsv74QIkzh7XsqWmqCuHaywIXqZ0XWCMTAQBgCpc4DaF+oZWMtx6C asV4H9GlxjAgAPVj4FpPTBHEwn65QyQ21+5f5LLETe2zsm2KE4QLtVIasouXvMcA3dtW PTI6wFpOSaPI/kxv3eugD4uYMpx12x9TT44mIQU/xFGVWONEDSIglOlMOG5bz/Pvm8W0 46iA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=02b0zpx+; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i64si20903795pge.361.2019.01.08.12.05.22; Tue, 08 Jan 2019 12:05:37 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=02b0zpx+; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729810AbfAHUCM (ORCPT + 99 others); Tue, 8 Jan 2019 15:02:12 -0500 Received: from mail.kernel.org ([198.145.29.99]:36334 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730292AbfAHT3H (ORCPT ); Tue, 8 Jan 2019 14:29:07 -0500 Received: from sasha-vm.mshome.net (c-73-47-72-35.hsd1.nh.comcast.net [73.47.72.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B627F20883; Tue, 8 Jan 2019 19:29:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1546975746; bh=+0+o20fY3CIdZRKVD4cr1nqDc+/nt/TaWqJvUQmNBqs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=02b0zpx+KV8RgDpeM4sEfPRBcR9ql+3Z5pWXgEFpXEqGeyRHTQ+464JwYNRnrv0Kt s0HhKRaa5sqNTttyaIMWcg667W4n/m/5MJ92duhLlNYP8yFbNAckfD1FYU5Tl2a9ku CMKcAMK8qjTTaAgH3gaYcG2OLYx/NtGjbycLgQl8= From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Nikos Tsironis , Ilias Tsitsimpis , Mike Snitzer , Sasha Levin Subject: [PATCH AUTOSEL 4.20 092/117] dm snapshot: Fix excessive memory usage and workqueue stalls Date: Tue, 8 Jan 2019 14:26:00 -0500 Message-Id: <20190108192628.121270-92-sashal@kernel.org> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190108192628.121270-1-sashal@kernel.org> References: <20190108192628.121270-1-sashal@kernel.org> MIME-Version: 1.0 X-Patchwork-Hint: Ignore Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Nikos Tsironis [ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ] kcopyd has no upper limit to the number of jobs one can allocate and issue. Under certain workloads this can lead to excessive memory usage and workqueue stalls. For example, when creating multiple dm-snapshot targets with a 4K chunk size and then writing to the origin through the page cache. Syncing the page cache causes a large number of BIOs to be issued to the dm-snapshot origin target, which itself issues an even larger (because of the BIO splitting taking place) number of kcopyd jobs. Running the following test, from the device mapper test suite [1], dmtest run --suite snapshot -n many_snapshots_of_same_volume_N , with 8 active snapshots, results in the kcopyd job slab cache growing to 10G. Depending on the available system RAM this can lead to the OOM killer killing user processes: [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=1, oom_score_adj=0 [463.492894] kthreadd cpuset=/ mems_allowed=0 [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3 [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [463.492952] Call Trace: [463.492964] dump_stack+0x7d/0xbb [463.492973] dump_header+0x6b/0x2fc [463.492987] ? lockdep_hardirqs_on+0xee/0x190 [463.493012] oom_kill_process+0x302/0x370 [463.493021] out_of_memory+0x113/0x560 [463.493030] __alloc_pages_slowpath+0xf40/0x1020 [463.493055] __alloc_pages_nodemask+0x348/0x3c0 [463.493067] cache_grow_begin+0x81/0x8b0 [463.493072] ? cache_grow_begin+0x874/0x8b0 [463.493078] fallback_alloc+0x1e4/0x280 [463.493092] kmem_cache_alloc_node+0xd6/0x370 [463.493098] ? copy_process.part.31+0x1c5/0x20d0 [463.493105] copy_process.part.31+0x1c5/0x20d0 [463.493115] ? __lock_acquire+0x3cc/0x1550 [463.493121] ? __switch_to_asm+0x34/0x70 [463.493129] ? kthread_create_worker_on_cpu+0x70/0x70 [463.493135] ? finish_task_switch+0x90/0x280 [463.493165] _do_fork+0xe0/0x6d0 [463.493191] ? kthreadd+0x19f/0x220 [463.493233] kernel_thread+0x25/0x30 [463.493235] kthreadd+0x1bf/0x220 [463.493242] ? kthread_create_on_cpu+0x90/0x90 [463.493248] ret_from_fork+0x3a/0x50 [463.493279] Mem-Info: [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0 [463.493285] active_file:80216 inactive_file:80107 isolated_file:435 [463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0 [463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521 [463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0 [463.493285] free:33623 free_pcp:2392 free_cma:0 ... [463.493489] Unreclaimable slab info: [463.493513] Name Used Total [463.493522] bio-6 1028KB 1028KB [463.493525] bio-5 1028KB 1028KB [463.493528] dm_snap_pending_exception 236783KB 243789KB [463.493531] dm_exception 41KB 42KB [463.493534] bio-4 1216KB 1216KB [463.493537] bio-3 439396KB 439396KB [463.493539] kcopyd_job 6973427KB 6973427KB ... [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Moreover, issuing a large number of kcopyd jobs results in kcopyd hogging the CPU, while processing them. As a result, processing of work items, queued for execution on the same CPU as the currently running kcopyd thread, is stalled for long periods of time, hurting performance. Running the aforementioned test we get, in dmesg, messages like the following: [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s! [67501.195586] Showing busy workqueues and worker pools: [67501.195591] workqueue events: flags=0x0 [67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195611] pending: cache_reap [67501.195641] workqueue mm_percpu_wq: flags=0x8 [67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195656] pending: vmstat_update [67501.195682] workqueue kblockd: flags=0x18 [67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 [67501.195698] pending: blk_timeout_work [67501.195753] workqueue kcopyd: flags=0x8 [67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195768] pending: do_work [dm_mod] [67501.195802] workqueue kcopyd: flags=0x8 [67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195817] pending: do_work [dm_mod] [67501.195834] workqueue kcopyd: flags=0x8 [67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195848] pending: do_work [dm_mod] [67501.195881] workqueue kcopyd: flags=0x8 [67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195896] pending: do_work [dm_mod] [67501.195920] workqueue kcopyd: flags=0x8 [67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256 [67501.195935] in-flight: 67:do_work [dm_mod] [67501.195945] pending: do_work [dm_mod] [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765 The root cause for these issues is the way dm-snapshot uses kcopyd. In particular, the lack of an explicit or implicit limit to the maximum number of in-flight COW jobs. The merging path is not affected because it implicitly limits the in-flight kcopyd jobs to one. Fix these issues by using a semaphore to limit the maximum number of in-flight kcopyd jobs. We grab the semaphore before allocating a new kcopyd job in start_copy() and start_full_bio() and release it after the job finishes in copy_callback(). The initial semaphore value is configurable through a module parameter, to allow fine tuning the maximum number of in-flight COW jobs. Setting this parameter to zero initializes the semaphore to INT_MAX. A default value of 2048 maximum in-flight kcopyd jobs was chosen. This value was decided experimentally as a trade-off between memory consumption, stalling the kernel's workqueues and maintaining a high enough throughput. Re-running the aforementioned test: * Workqueue stalls are eliminated * kcopyd's job slab cache uses a maximum of 130MB * The time taken by the test to write to the snapshot-origin target is reduced from 05m20.48s to 03m26.38s [1] https://github.com/jthornber/device-mapper-test-suite Signed-off-by: Nikos Tsironis Signed-off-by: Ilias Tsitsimpis Signed-off-by: Mike Snitzer Signed-off-by: Sasha Levin --- drivers/md/dm-snap.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index ae4b33d10924..36805b12661e 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "dm.h" @@ -105,6 +106,9 @@ struct dm_snapshot { /* The on disk metadata handler */ struct dm_exception_store *store; + /* Maximum number of in-flight COW jobs. */ + struct semaphore cow_count; + struct dm_kcopyd_client *kcopyd_client; /* Wait for events based on state_bits */ @@ -145,6 +149,19 @@ struct dm_snapshot { #define RUNNING_MERGE 0 #define SHUTDOWN_MERGE 1 +/* + * Maximum number of chunks being copied on write. + * + * The value was decided experimentally as a trade-off between memory + * consumption, stalling the kernel's workqueues and maintaining a high enough + * throughput. + */ +#define DEFAULT_COW_THRESHOLD 2048 + +static int cow_threshold = DEFAULT_COW_THRESHOLD; +module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644); +MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being copied on write"); + DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle, "A percentage of time allocated for copy on write"); @@ -1190,6 +1207,8 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) goto bad_hash_tables; } + sema_init(&s->cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX); + s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle); if (IS_ERR(s->kcopyd_client)) { r = PTR_ERR(s->kcopyd_client); @@ -1575,6 +1594,7 @@ static void copy_callback(int read_err, unsigned long write_err, void *context) rb_link_node(&pe->out_of_order_node, parent, p); rb_insert_color(&pe->out_of_order_node, &s->out_of_order_tree); } + up(&s->cow_count); } /* @@ -1598,6 +1618,7 @@ static void start_copy(struct dm_snap_pending_exception *pe) dest.count = src.count; /* Hand over to kcopyd */ + down(&s->cow_count); dm_kcopyd_copy(s->kcopyd_client, &src, 1, &dest, 0, copy_callback, pe); } @@ -1617,6 +1638,7 @@ static void start_full_bio(struct dm_snap_pending_exception *pe, pe->full_bio = bio; pe->full_bio_end_io = bio->bi_end_io; + down(&s->cow_count); callback_data = dm_kcopyd_prepare_callback(s->kcopyd_client, copy_callback, pe); -- 2.19.1