Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp3922460imm; Mon, 8 Oct 2018 11:44:28 -0700 (PDT) X-Google-Smtp-Source: ACcGV63mT32pn+EONJ6r6+6CtfvSy7NwKw5HEaheLZZ+R5qMVPn5wK6iQE151zv7ASpadd4m9csa X-Received: by 2002:a62:b286:: with SMTP id z6-v6mr26771314pfl.79.1539024268780; Mon, 08 Oct 2018 11:44:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539024268; cv=none; d=google.com; s=arc-20160816; b=BD1r7Am2Dy4MkmgX0ry9FaZuILbgACk+/LyHJFVKWYlyYnyP0j2uUTBDrgOdy0hTYn wzC70fBMzWL/2LJa+dBDZD6dm9mMmd7mTSgz/GzQHygA8zxukKUDAyCKexMcbsSh2sZo rRSQDTet98Hldm1iDq9OhpL4EnYBWzMHHlBghutJRYYpWWunQnqffElXguNOQx9+b4Rf 64RE7fHZVd7UJtN5Lp1cQhgMsiWgwdz4HL/mj4h/QGU5KkR7+jZ6dITqq7XMw3eu/9YD zFmrMYte/1yBvbyMoigv2smt04uOfcIXXuKBfU+aIhl1nCxWn7cwAPbeobOSpFHbc67Z BEqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=Z72BQIoU+qO0HYZyj+0a5LG1zxsu94+4glK0oDpK0rM=; b=tW1TlsvM7DmHWkh+p5CVwaVWOhmruRUxV22tPOuWNy4FNwq1RfPHSKdZ+2MPa9Dq9a Z9KS6ATbhGgwvITv9NL7WuFkCT6ynipe2p1EDFhyA98Q1XcpIUBQ6uS/CzlfsqibYKYK E10OmvpzZa3+8Hnz6n+pVagRStTnXrUyjx42h6kNFzXyK5rs/Zzi/AlefF6+Rn7ghkh8 XYMzeRs1AskBW5YBudZKzVRfrKiSkfLfNPQffTqOZvTnZrtCMkZaSQKezUk06QLmVhtc /9GUKNMkGOyvZOcw03/qAtP0B91rCHwVcMm2ImePZiBHNPMfTXDUs2kTltFFcytjs+dO eWiQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="R+AEYz/z"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 16-v6si23937960pfc.21.2018.10.08.11.44.13; Mon, 08 Oct 2018 11:44:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="R+AEYz/z"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730601AbeJIB47 (ORCPT + 99 others); Mon, 8 Oct 2018 21:56:59 -0400 Received: from mail.kernel.org ([198.145.29.99]:44848 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729885AbeJIB46 (ORCPT ); Mon, 8 Oct 2018 21:56:58 -0400 Received: from localhost (ip-213-127-77-176.ip.prioritytelecom.net [213.127.77.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id AEB8A2064A; Mon, 8 Oct 2018 18:43:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1539024233; bh=LQhNEp5DaLt+ZX3ifKgVI0VCEMHyQ9J9yxXV0WQdMz4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=R+AEYz/zwmtEnFGkvJDB586++eXhVVr2SAJb4MS0yutlFlOukHLOszGsQ/ppvxF4O QMHLePI9rUnzSPdyVc00g8uQB6H+IBGI6pkFRd6B/q3tdJ7+VH7f2FSMpyIkhDuPNX 57lDU9Afop4OkPh8HE9wVsIwFKPv2JJ6p6+afXMA= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Joe Thornber , Mike Snitzer , Sasha Levin Subject: [PATCH 4.14 66/94] dm thin metadata: try to avoid ever aborting transactions Date: Mon, 8 Oct 2018 20:31:47 +0200 Message-Id: <20181008175609.656421529@linuxfoundation.org> X-Mailer: git-send-email 2.19.0 In-Reply-To: <20181008175605.067676667@linuxfoundation.org> References: <20181008175605.067676667@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.14-stable review patch. If anyone has any objections, please let me know. ------------------ From: Joe Thornber [ Upstream commit 3ab91828166895600efd9cdc3a0eb32001f7204a ] Committing a transaction can consume some metadata of it's own, we now reserve a small amount of metadata to cover this. Free metadata reported by the kernel will not include this reserve. If any of the reserve has been used after a commit we enter a new internal state PM_OUT_OF_METADATA_SPACE. This is reported as PM_READ_ONLY, so no userland changes are needed. If the metadata device is resized the pool will move back to PM_WRITE. These changes mean we never need to abort and rollback a transaction due to running out of metadata space. This is particularly important because there have been a handful of reports of data corruption against DM thin-provisioning that can all be attributed to the thin-pool having ran out of metadata space. Signed-off-by: Joe Thornber Signed-off-by: Mike Snitzer Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman --- drivers/md/dm-thin-metadata.c | 36 ++++++++++++++++++++ drivers/md/dm-thin.c | 73 +++++++++++++++++++++++++++++++++++++----- 2 files changed, 100 insertions(+), 9 deletions(-) --- a/drivers/md/dm-thin-metadata.c +++ b/drivers/md/dm-thin-metadata.c @@ -189,6 +189,12 @@ struct dm_pool_metadata { sector_t data_block_size; /* + * We reserve a section of the metadata for commit overhead. + * All reported space does *not* include this. + */ + dm_block_t metadata_reserve; + + /* * Set if a transaction has to be aborted but the attempt to roll back * to the previous (good) transaction failed. The only pool metadata * operation possible in this state is the closing of the device. @@ -825,6 +831,22 @@ static int __commit_transaction(struct d return dm_tm_commit(pmd->tm, sblock); } +static void __set_metadata_reserve(struct dm_pool_metadata *pmd) +{ + int r; + dm_block_t total; + dm_block_t max_blocks = 4096; /* 16M */ + + r = dm_sm_get_nr_blocks(pmd->metadata_sm, &total); + if (r) { + DMERR("could not get size of metadata device"); + pmd->metadata_reserve = max_blocks; + } else { + sector_div(total, 10); + pmd->metadata_reserve = min(max_blocks, total); + } +} + struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev, sector_t data_block_size, bool format_device) @@ -858,6 +880,8 @@ struct dm_pool_metadata *dm_pool_metadat return ERR_PTR(r); } + __set_metadata_reserve(pmd); + return pmd; } @@ -1829,6 +1853,13 @@ int dm_pool_get_free_metadata_block_coun down_read(&pmd->root_lock); if (!pmd->fail_io) r = dm_sm_get_nr_free(pmd->metadata_sm, result); + + if (!r) { + if (*result < pmd->metadata_reserve) + *result = 0; + else + *result -= pmd->metadata_reserve; + } up_read(&pmd->root_lock); return r; @@ -1941,8 +1972,11 @@ int dm_pool_resize_metadata_dev(struct d int r = -EINVAL; down_write(&pmd->root_lock); - if (!pmd->fail_io) + if (!pmd->fail_io) { r = __resize_space_map(pmd->metadata_sm, new_count); + if (!r) + __set_metadata_reserve(pmd); + } up_write(&pmd->root_lock); return r; --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -200,7 +200,13 @@ struct dm_thin_new_mapping; enum pool_mode { PM_WRITE, /* metadata may be changed */ PM_OUT_OF_DATA_SPACE, /* metadata may be changed, though data may not be allocated */ + + /* + * Like READ_ONLY, except may switch back to WRITE on metadata resize. Reported as READ_ONLY. + */ + PM_OUT_OF_METADATA_SPACE, PM_READ_ONLY, /* metadata may not be changed */ + PM_FAIL, /* all I/O fails */ }; @@ -1382,7 +1388,35 @@ static void set_pool_mode(struct pool *p static void requeue_bios(struct pool *pool); -static void check_for_space(struct pool *pool) +static bool is_read_only_pool_mode(enum pool_mode mode) +{ + return (mode == PM_OUT_OF_METADATA_SPACE || mode == PM_READ_ONLY); +} + +static bool is_read_only(struct pool *pool) +{ + return is_read_only_pool_mode(get_pool_mode(pool)); +} + +static void check_for_metadata_space(struct pool *pool) +{ + int r; + const char *ooms_reason = NULL; + dm_block_t nr_free; + + r = dm_pool_get_free_metadata_block_count(pool->pmd, &nr_free); + if (r) + ooms_reason = "Could not get free metadata blocks"; + else if (!nr_free) + ooms_reason = "No free metadata blocks"; + + if (ooms_reason && !is_read_only(pool)) { + DMERR("%s", ooms_reason); + set_pool_mode(pool, PM_OUT_OF_METADATA_SPACE); + } +} + +static void check_for_data_space(struct pool *pool) { int r; dm_block_t nr_free; @@ -1408,14 +1442,16 @@ static int commit(struct pool *pool) { int r; - if (get_pool_mode(pool) >= PM_READ_ONLY) + if (get_pool_mode(pool) >= PM_OUT_OF_METADATA_SPACE) return -EINVAL; r = dm_pool_commit_metadata(pool->pmd); if (r) metadata_operation_failed(pool, "dm_pool_commit_metadata", r); - else - check_for_space(pool); + else { + check_for_metadata_space(pool); + check_for_data_space(pool); + } return r; } @@ -1481,6 +1517,19 @@ static int alloc_data_block(struct thin_ return r; } + r = dm_pool_get_free_metadata_block_count(pool->pmd, &free_blocks); + if (r) { + metadata_operation_failed(pool, "dm_pool_get_free_metadata_block_count", r); + return r; + } + + if (!free_blocks) { + /* Let's commit before we use up the metadata reserve. */ + r = commit(pool); + if (r) + return r; + } + return 0; } @@ -1512,6 +1561,7 @@ static blk_status_t should_error_unservi case PM_OUT_OF_DATA_SPACE: return pool->pf.error_if_no_space ? BLK_STS_NOSPC : 0; + case PM_OUT_OF_METADATA_SPACE: case PM_READ_ONLY: case PM_FAIL: return BLK_STS_IOERR; @@ -2475,8 +2525,9 @@ static void set_pool_mode(struct pool *p error_retry_list(pool); break; + case PM_OUT_OF_METADATA_SPACE: case PM_READ_ONLY: - if (old_mode != new_mode) + if (!is_read_only_pool_mode(old_mode)) notify_of_pool_mode_change(pool, "read-only"); dm_pool_metadata_read_only(pool->pmd); pool->process_bio = process_bio_read_only; @@ -3412,6 +3463,10 @@ static int maybe_resize_metadata_dev(str DMINFO("%s: growing the metadata device from %llu to %llu blocks", dm_device_name(pool->pool_md), sb_metadata_dev_size, metadata_dev_size); + + if (get_pool_mode(pool) == PM_OUT_OF_METADATA_SPACE) + set_pool_mode(pool, PM_WRITE); + r = dm_pool_resize_metadata_dev(pool->pmd, metadata_dev_size); if (r) { metadata_operation_failed(pool, "dm_pool_resize_metadata_dev", r); @@ -3715,7 +3770,7 @@ static int pool_message(struct dm_target struct pool_c *pt = ti->private; struct pool *pool = pt->pool; - if (get_pool_mode(pool) >= PM_READ_ONLY) { + if (get_pool_mode(pool) >= PM_OUT_OF_METADATA_SPACE) { DMERR("%s: unable to service pool target messages in READ_ONLY or FAIL mode", dm_device_name(pool->pool_md)); return -EOPNOTSUPP; @@ -3789,6 +3844,7 @@ static void pool_status(struct dm_target dm_block_t nr_blocks_data; dm_block_t nr_blocks_metadata; dm_block_t held_root; + enum pool_mode mode; char buf[BDEVNAME_SIZE]; char buf2[BDEVNAME_SIZE]; struct pool_c *pt = ti->private; @@ -3859,9 +3915,10 @@ static void pool_status(struct dm_target else DMEMIT("- "); - if (pool->pf.mode == PM_OUT_OF_DATA_SPACE) + mode = get_pool_mode(pool); + if (mode == PM_OUT_OF_DATA_SPACE) DMEMIT("out_of_data_space "); - else if (pool->pf.mode == PM_READ_ONLY) + else if (is_read_only_pool_mode(mode)) DMEMIT("ro "); else DMEMIT("rw ");