Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2249587imm; Mon, 28 May 2018 04:46:08 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrcxI2JCBoQXEaV1btzwSVgfPcfBG5ERfB74aNU8enDWwOlXnJO9wYEX98HczHLzXmtzKID X-Received: by 2002:a17:902:1c8:: with SMTP id b66-v6mr12977358plb.156.1527507968017; Mon, 28 May 2018 04:46:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527507967; cv=none; d=google.com; s=arc-20160816; b=BH51mu7PZgkbJlsTg4QvpJm4ZwvghWlFfk9or3hZQASX63wgA3muGTR0qa99i2Jjfb zSlQavLQ0GmTKdc0O27i1cqZ4jyI2cxDMe+1/syYvNqRnAxBV/qvdpzncVcFYrlK7JcQ ZQmx2UQaSPrS+AGIbgMvfgjdIaIAyJKDagI21VQQkZ80coYSAGW2Hc6jGKRZWaM51Kfz 5ar4TG0AtYH1A0BwfkqnTTyXNAdtphhS0xBueLWuJNmgJlFaEuiRSnKym3OuGWbkZHU+ /yQI6IC/2xFTFXnists8FCVemiS3eKo/ghJOCQKR+dUYIPPAdCZ7xAqqRkTNNw7R5LG+ PJmA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :in-reply-to:message-id:date:subject:cc:to:from:dkim-signature :arc-authentication-results; bh=jdYwShH8kSthbRz7JjKfYnW8shYyCQT/kamaI/z45m8=; b=txr2rTZosKNq5dI1Lj3SjJD7cmuA//6hAZPc6RVX93PB5J3LiTWYwABY5vTfW1EKQL AalezsbrD9rOYNd0w3da2zACJgh5DjjsXzt3fS6XJfEhELS20tUjDE0qM1bHKHtR+lt0 iZbCriszJfPuSGa/JyqYEzXX+7qhszUe+sz6IMgf1f/3Sj2jo/EXznDBDCePETZQlmO7 Me8608PMUgFgYibjpyupxNg+6n7fE1eGwSWTQWkX8Taf/Q+PkgOTgWkh+5nFiM1/+4M7 j+vyVjy6TsWMlcxLMx2MT5Ye1THzfrgzfBFaTKfXEcXb1fdePSmK8bv6/9sttLQWskNM +/Pw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=mYJMStfQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a3-v6si23212886pgq.652.2018.05.28.04.45.53; Mon, 28 May 2018 04:46:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=mYJMStfQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1424089AbeE1LpG (ORCPT + 99 others); Mon, 28 May 2018 07:45:06 -0400 Received: from mail.kernel.org ([198.145.29.99]:59168 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423968AbeE1LLJ (ORCPT ); Mon, 28 May 2018 07:11:09 -0400 Received: from localhost (LFbn-1-12247-202.w90-92.abo.wanadoo.fr [90.92.61.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id EA3622087E; Mon, 28 May 2018 11:11:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1527505868; bh=55XoDhCftcgLh2JAvDPyIpExRqrVPleI0Oet6HFsWqk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=mYJMStfQtQU0NuSuVea22oXJgKWgYtnM915+3fc4uVIphWcs/1Pdo9LoIBbtxrArC 0gvGWTg0OaBok/nK8iOKa/j2NsWxWhsjOvm6E1lnguQzTlQ5wEwcn9TZGrDVixjgD8 8FkTMLvR/uD4TJJYl5FeWAKsjGqMHOdTdaQPFKm4= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Coly Li , Hannes Reinecke , Michael Lyle , Junhui Tang , Jens Axboe , Sasha Levin Subject: [PATCH 4.16 150/272] bcache: fix cached_dev->count usage for bch_cache_set_error() Date: Mon, 28 May 2018 12:03:03 +0200 Message-Id: <20180528100253.560846038@linuxfoundation.org> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180528100240.256525891@linuxfoundation.org> References: <20180528100240.256525891@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.16-stable review patch. If anyone has any objections, please let me know. ------------------ From: Coly Li [ Upstream commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae ] When bcache metadata I/O fails, bcache will call bch_cache_set_error() to retire the whole cache set. The expected behavior to retire a cache set is to unregister the cache set, and unregister all backing device attached to this cache set, then remove sysfs entries of the cache set and all attached backing devices, finally release memory of structs cache_set, cache, cached_dev and bcache_device. In my testing when journal I/O failure triggered by disconnected cache device, sometimes the cache set cannot be retired, and its sysfs entry /sys/fs/bcache/ still exits and the backing device also references it. This is not expected behavior. When metadata I/O failes, the call senquence to retire whole cache set is, bch_cache_set_error() bch_cache_set_unregister() bch_cache_set_stop() __cache_set_unregister() <- called as callback by calling clousre_queue(&c->caching) cache_set_flush() <- called as a callback when refcount of cache_set->caching is 0 cache_set_free() <- called as a callback when refcount of catch_set->cl is 0 bch_cache_set_release() <- called as a callback when refcount of catch_set->kobj is 0 I find if kernel thread bch_writeback_thread() quits while-loop when kthread_should_stop() is true and searched_full_index is false, clousre callback cache_set_flush() set by continue_at() will never be called. The result is, bcache fails to retire whole cache set. cache_set_flush() will be called when refcount of closure c->caching is 0, and in function bcache_device_detach() refcount of closure c->caching is released to 0 by clousre_put(). In metadata error code path, function bcache_device_detach() is called by cached_dev_detach_finish(). This is a callback routine being called when cached_dev->count is 0. This refcount is decreased by cached_dev_put(). The above dependence indicates, cache_set_flush() will be called when refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0 when refcount of cache_dev->count is 0. The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount of cache_dev is not decreased properly. In bch_writeback_thread(), cached_dev_put() is called only when searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a there is no dirty data on cache. In most of run time it is correct, but when bch_writeback_thread() quits the while-loop while cache is still dirty, current code forget to call cached_dev_put() before this kernel thread exits. This is why sometimes cache_set_flush() is not executed and cache set fails to be retired. The reason to call cached_dev_put() in bch_writeback_rate() is, when the cache device changes from clean to dirty, cached_dev_get() is called, to make sure during writeback operatiions both backing and cache devices won't be released. Adding following code in bch_writeback_thread() does not work, static int bch_writeback_thread(void *arg) } + if (atomic_read(&dc->has_dirty)) + cached_dev_put() + return 0; } because writeback kernel thread can be waken up and start via sysfs entry: echo 1 > /sys/block/bcache/bcache/writeback_running It is difficult to check whether backing device is dirty without race and extra lock. So the above modification will introduce potential refcount underflow in some conditions. The correct fix is, to take cached dev refcount when creating the kernel thread, and put it before the kernel thread exits. Then bcache does not need to take a cached dev refcount when cache turns from clean to dirty, or to put a cached dev refcount when cache turns from ditry to clean. The writeback kernel thread is alwasy safe to reference data structure from cache set, cache and cached device (because a refcount of cache device is taken for it already), and no matter the kernel thread is stopped by I/O errors or system reboot, cached_dev->count can always be used correctly. The patch is simple, but understanding how it works is quite complicated. Changelog: v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes. v1: initial version for review. Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Reviewed-by: Michael Lyle Cc: Michael Lyle Cc: Junhui Tang Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman --- drivers/md/bcache/super.c | 1 - drivers/md/bcache/writeback.c | 11 ++++++++--- drivers/md/bcache/writeback.h | 2 -- 3 files changed, 8 insertions(+), 6 deletions(-) --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1065,7 +1065,6 @@ int bch_cached_dev_attach(struct cached_ if (BDEV_STATE(&dc->sb) == BDEV_STATE_DIRTY) { bch_sectors_dirty_init(&dc->disk); atomic_set(&dc->has_dirty, 1); - refcount_inc(&dc->count); bch_writeback_queue(dc); } --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -578,7 +578,7 @@ static int bch_writeback_thread(void *ar if (kthread_should_stop()) { set_current_state(TASK_RUNNING); - return 0; + break; } schedule(); @@ -591,7 +591,6 @@ static int bch_writeback_thread(void *ar if (searched_full_index && RB_EMPTY_ROOT(&dc->writeback_keys.keys)) { atomic_set(&dc->has_dirty, 0); - cached_dev_put(dc); SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN); bch_write_bdev_super(dc, NULL); /* @@ -620,6 +619,9 @@ static int bch_writeback_thread(void *ar } } + dc->writeback_thread = NULL; + cached_dev_put(dc); + return 0; } @@ -683,10 +685,13 @@ int bch_cached_dev_writeback_start(struc if (!dc->writeback_write_wq) return -ENOMEM; + cached_dev_get(dc); dc->writeback_thread = kthread_create(bch_writeback_thread, dc, "bcache_writeback"); - if (IS_ERR(dc->writeback_thread)) + if (IS_ERR(dc->writeback_thread)) { + cached_dev_put(dc); return PTR_ERR(dc->writeback_thread); + } schedule_delayed_work(&dc->writeback_rate_update, dc->writeback_rate_update_seconds * HZ); --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -105,8 +105,6 @@ static inline void bch_writeback_add(str { if (!atomic_read(&dc->has_dirty) && !atomic_xchg(&dc->has_dirty, 1)) { - refcount_inc(&dc->count); - if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) { SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY); /* XXX: should do this synchronously */