Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp4496107ybx; Mon, 4 Nov 2019 14:22:21 -0800 (PST) X-Google-Smtp-Source: APXvYqz59j0Iu30EZ2quQ+JsNRa7gEi5QP5YA2WK1VeOtZPG8/HlHrqHIA6Q6htezl7et7W+E29c X-Received: by 2002:a17:906:3458:: with SMTP id d24mr11667343ejb.271.1572906141817; Mon, 04 Nov 2019 14:22:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1572906141; cv=none; d=google.com; s=arc-20160816; b=uRd6/cauZPqbduzP2Nv08Va42t/38BAYIDFxj2a5vwY6yk6SEr7cFF9Tg9CPkm6M9B CNDC4Brh80Gst9GJ1aRQ/hbh0A31j+BOvhxStIjQju2HE9uE8BzoCxzxsCq6epsAmhCs ugpviBkV3V0gQIwDN+pzCulkO1WtfYcnHjbM3fRMeMW9YIIcngQ6QYdv2qM4tTyH608P KOpPap2Mz9eOQcWyPTCSpC9ICUyCMZvp5m19jEPWwcqaDoz5HwPbI0onCaPVAkh0OiFZ 8J9b83DNwRJIE8NssNgtpRr+Ge88CNSOvLLZfb5iggmknRwWPi865ICJpa9BpwHYuxjn YMBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=/3EKOLWvFK3k9VozJ378f+rupr0Yn9ANMDn0D1H1XDs=; b=nXDPHpTZGBfYLHZ6eU8Hnou/J6UcEuAiSyPTqaQqkEBsuAlES0iiYGrHNM0GgyqjVg WZ9z5UmVxDFaGdj1dR/OK+eD9J3DPbgp04y2GLWx7pHbYDuiZc7cicNxIxXEnaMk9So0 1eH9Hyk19NkSUCXJ2fL8vzh7Cj4e4qeQY3sYw0R0L5igh5HTlApWSqpSGo5N3l+F4fSJ HmKuXXCpASSkKl45eQ28bGxv0RcDyxBdrrghQ8VsD8MyKYxohCuLex0MfkkXuHZA21FI e5nig4q+qEGO7XcvDvlekbJb1WgdqgguKrmwH2Q+1WxQhG3npHZZwZU83vNtYkEaApL2 FIug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=B58PqKns; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x14si8570747edl.128.2019.11.04.14.21.58; Mon, 04 Nov 2019 14:22:21 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=B58PqKns; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388701AbfKDV57 (ORCPT + 99 others); Mon, 4 Nov 2019 16:57:59 -0500 Received: from mail.kernel.org ([198.145.29.99]:54582 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387506AbfKDV55 (ORCPT ); Mon, 4 Nov 2019 16:57:57 -0500 Received: from localhost (6.204-14-84.ripe.coltfrance.com [84.14.204.6]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 03C0F214D8; Mon, 4 Nov 2019 21:57:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1572904676; bh=ISOCXHXFnxYp+D1ay8dVBO+AxI8M8jHi7r4PrG4+nMY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=B58PqKnsT2SW9NY3p7u510NojGhBAvFmgW0A1B0q3QD1NLTQAh8UdkNNko118diRu dWKyAXH2dtHPMUMhfrKX7k+Ee79SuGtCN1kYsXRDw3JwuIgdlo+jqvQsa5g01IxZ9B 0XLE2hc0zLsD4kpyGeseA/HVp4UxAqPhOV0JAzaA= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Guruswamy Basavaiah , Nikos Tsironis , Mikulas Patocka , Mike Snitzer , Sasha Levin Subject: [PATCH 4.19 003/149] dm snapshot: rework COW throttling to fix deadlock Date: Mon, 4 Nov 2019 22:43:16 +0100 Message-Id: <20191104212127.409465102@linuxfoundation.org> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20191104212126.090054740@linuxfoundation.org> References: <20191104212126.090054740@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Mikulas Patocka [ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ] Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and workqueue stalls") introduced a semaphore to limit the maximum number of in-flight kcopyd (COW) jobs. The implementation of this throttling mechanism is prone to a deadlock: 1. One or more threads write to the origin device causing COW, which is performed by kcopyd. 2. At some point some of these threads might reach the s->cow_count semaphore limit and block in down(&s->cow_count), holding a read lock on _origins_lock. 3. Someone tries to acquire a write lock on _origins_lock, e.g., snapshot_ctr(), which blocks because the threads at step (2) already hold a read lock on it. 4. A COW operation completes and kcopyd runs dm-snapshot's completion callback, which ends up calling pending_complete(). pending_complete() tries to resubmit any deferred origin bios. This requires acquiring a read lock on _origins_lock, which blocks. This happens because the read-write semaphore implementation gives priority to writers, meaning that as soon as a writer tries to enter the critical section, no readers will be allowed in, until all writers have completed their work. So, pending_complete() waits for the writer at step (3) to acquire and release the lock. This writer waits for the readers at step (2) to release the read lock and those readers wait for pending_complete() (the kcopyd thread) to signal the s->cow_count semaphore: DEADLOCK. The above was thoroughly analyzed and documented by Nikos Tsironis as part of his initial proposal for fixing this deadlock, see: https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html Fix this deadlock by reworking COW throttling so that it waits without holding any locks. Add a variable 'in_progress' that counts how many kcopyd jobs are running. A function wait_for_in_progress() will sleep if 'in_progress' is over the limit. It drops _origins_lock in order to avoid the deadlock. Reported-by: Guruswamy Basavaiah Reported-by: Nikos Tsironis Reviewed-by: Nikos Tsironis Tested-by: Nikos Tsironis Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls") Cc: stable@vger.kernel.org # v5.0+ Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()") Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Sasha Levin --- drivers/md/dm-snap.c | 78 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 64 insertions(+), 14 deletions(-) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index d9216afbd490f..d3f28a9e3fd95 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -19,7 +19,6 @@ #include #include #include -#include #include "dm.h" @@ -106,8 +105,8 @@ struct dm_snapshot { /* The on disk metadata handler */ struct dm_exception_store *store; - /* Maximum number of in-flight COW jobs. */ - struct semaphore cow_count; + unsigned in_progress; + struct wait_queue_head in_progress_wait; struct dm_kcopyd_client *kcopyd_client; @@ -158,8 +157,8 @@ struct dm_snapshot { */ #define DEFAULT_COW_THRESHOLD 2048 -static int cow_threshold = DEFAULT_COW_THRESHOLD; -module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644); +static unsigned cow_threshold = DEFAULT_COW_THRESHOLD; +module_param_named(snapshot_cow_threshold, cow_threshold, uint, 0644); MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being copied on write"); DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle, @@ -1207,7 +1206,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) goto bad_hash_tables; } - sema_init(&s->cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX); + init_waitqueue_head(&s->in_progress_wait); s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle); if (IS_ERR(s->kcopyd_client)) { @@ -1396,17 +1395,54 @@ static void snapshot_dtr(struct dm_target *ti) dm_put_device(ti, s->origin); + WARN_ON(s->in_progress); + kfree(s); } static void account_start_copy(struct dm_snapshot *s) { - down(&s->cow_count); + spin_lock(&s->in_progress_wait.lock); + s->in_progress++; + spin_unlock(&s->in_progress_wait.lock); } static void account_end_copy(struct dm_snapshot *s) { - up(&s->cow_count); + spin_lock(&s->in_progress_wait.lock); + BUG_ON(!s->in_progress); + s->in_progress--; + if (likely(s->in_progress <= cow_threshold) && + unlikely(waitqueue_active(&s->in_progress_wait))) + wake_up_locked(&s->in_progress_wait); + spin_unlock(&s->in_progress_wait.lock); +} + +static bool wait_for_in_progress(struct dm_snapshot *s, bool unlock_origins) +{ + if (unlikely(s->in_progress > cow_threshold)) { + spin_lock(&s->in_progress_wait.lock); + if (likely(s->in_progress > cow_threshold)) { + /* + * NOTE: this throttle doesn't account for whether + * the caller is servicing an IO that will trigger a COW + * so excess throttling may result for chunks not required + * to be COW'd. But if cow_threshold was reached, extra + * throttling is unlikely to negatively impact performance. + */ + DECLARE_WAITQUEUE(wait, current); + __add_wait_queue(&s->in_progress_wait, &wait); + __set_current_state(TASK_UNINTERRUPTIBLE); + spin_unlock(&s->in_progress_wait.lock); + if (unlock_origins) + up_read(&_origins_lock); + io_schedule(); + remove_wait_queue(&s->in_progress_wait, &wait); + return false; + } + spin_unlock(&s->in_progress_wait.lock); + } + return true; } /* @@ -1424,7 +1460,7 @@ static void flush_bios(struct bio *bio) } } -static int do_origin(struct dm_dev *origin, struct bio *bio); +static int do_origin(struct dm_dev *origin, struct bio *bio, bool limit); /* * Flush a list of buffers. @@ -1437,7 +1473,7 @@ static void retry_origin_bios(struct dm_snapshot *s, struct bio *bio) while (bio) { n = bio->bi_next; bio->bi_next = NULL; - r = do_origin(s->origin, bio); + r = do_origin(s->origin, bio, false); if (r == DM_MAPIO_REMAPPED) generic_make_request(bio); bio = n; @@ -1739,6 +1775,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) if (!s->valid) return DM_MAPIO_KILL; + if (bio_data_dir(bio) == WRITE) { + while (unlikely(!wait_for_in_progress(s, false))) + ; /* wait_for_in_progress() has slept */ + } + mutex_lock(&s->lock); if (!s->valid || (unlikely(s->snapshot_overflowed) && @@ -1887,7 +1928,7 @@ redirect_to_origin: if (bio_data_dir(bio) == WRITE) { mutex_unlock(&s->lock); - return do_origin(s->origin, bio); + return do_origin(s->origin, bio, false); } out_unlock: @@ -2224,15 +2265,24 @@ next_snapshot: /* * Called on a write from the origin driver. */ -static int do_origin(struct dm_dev *origin, struct bio *bio) +static int do_origin(struct dm_dev *origin, struct bio *bio, bool limit) { struct origin *o; int r = DM_MAPIO_REMAPPED; +again: down_read(&_origins_lock); o = __lookup_origin(origin->bdev); - if (o) + if (o) { + if (limit) { + struct dm_snapshot *s; + list_for_each_entry(s, &o->snapshots, list) + if (unlikely(!wait_for_in_progress(s, true))) + goto again; + } + r = __origin_write(&o->snapshots, bio->bi_iter.bi_sector, bio); + } up_read(&_origins_lock); return r; @@ -2345,7 +2395,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio) dm_accept_partial_bio(bio, available_sectors); /* Only tell snapshots if this is a write */ - return do_origin(o->dev, bio); + return do_origin(o->dev, bio, true); } static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, -- 2.20.1