Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp3929912imm; Mon, 8 Oct 2018 11:52:03 -0700 (PDT) X-Google-Smtp-Source: ACcGV60jce/Zo6Fo/a2SXDN8yz21/4m55qq+ahP6wL9sqtjtDO/K85Lshsp86NmqRtkvTPFBvJH8 X-Received: by 2002:a17:902:48:: with SMTP id 66-v6mr14425592pla.7.1539024723165; Mon, 08 Oct 2018 11:52:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539024723; cv=none; d=google.com; s=arc-20160816; b=hUjLXDB+H0QiSbQ/QIP0rMBGshm/8Ll/lRndutEjg8/bpkAWOJNXVT7RqkxJGbWdZU se6Jam/rjYWq1Ofx1Xj9M8kXBN4smInb6pesnrCOtFMNLsNE0rwzseu/hK7LtlX9HU2V 4R166Z6yBhObj7OyUh101lHFKM5GW/YK7Le/uOYKhpzGEUtbiJUGQIu4PFZrWhX/fLZL suCIlZG0zmn7kTPw6ZirwrwNmKRPi5MEiYLsqPHNMzuYmolrRn7NP6OhROU1OUTTnfXX 0gMzkH5z3rQc4ACY5jZEb+wZTBPVL9Z+CbsBfAlGT/oD+W+mg8v9o09tKzhQj3/B7MOa JM+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=cOhjqlGob0o0TGiVE2mnPgu3/EpD/NnrcsfXvmfU7fo=; b=DWbqPOEXXCAha98jaf7AV3bcEgcDIBrpVEWKvAc8NghuN58ea78QkGm4G1Hp5cPDY/ laRkDaIAg2fjSn/HlDY/eFJAiNUYmj9BVVFWYVrTtpXnJvuW6RINFdzd8ccoq25GIu5+ IuErnYz9tM/O6TKNU3cxJldLSMg5nhsXmdi5rbhrlq3W6htnAzp0ndqiIcO9aJtwCcil +vWxnX/oWl+94hjGFYi3jLbINHEvyi5Y3IBhmT/8CAtbxX6j690JaGUtDJOwi9aAsyxB KjIplupgB3r1ZZ1kKgVGdLZiFFSOB7AG+Bs+weMQQW33IKUq9teX/HOiUnhlOiDwroNO J+Jg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=ZoNz1X1q; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r68-v6si20012334pfk.151.2018.10.08.11.51.48; Mon, 08 Oct 2018 11:52:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=ZoNz1X1q; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732377AbeJICEu (ORCPT + 99 others); Mon, 8 Oct 2018 22:04:50 -0400 Received: from mail.kernel.org ([198.145.29.99]:54786 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731723AbeJICEt (ORCPT ); Mon, 8 Oct 2018 22:04:49 -0400 Received: from localhost (ip-213-127-77-176.ip.prioritytelecom.net [213.127.77.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8156D214C4; Mon, 8 Oct 2018 18:51:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1539024701; bh=i3MigrwyYAVdKKd58YtJkYF2BhbRda7n/zbKltyIG1k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ZoNz1X1qz/D8fhcQXwpw3FSmJgySRzJdZhmIXcFZf4Om1yIwCsy8cIgdWu7UH6Gun hVXHfgpStZqHK+InN6z7wqO4fKaolMYlnFOxnRduQF6dq3ZQe+7Lqpa2i5CJJfNvAA AcIUTiat8M0ik2PqaMPeKEj6ghep7hmTINYGiNdU= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Joe Thornber , Mike Snitzer , Sasha Levin Subject: [PATCH 4.18 126/168] dm thin metadata: try to avoid ever aborting transactions Date: Mon, 8 Oct 2018 20:31:46 +0200 Message-Id: <20181008175624.845178509@linuxfoundation.org> X-Mailer: git-send-email 2.19.0 In-Reply-To: <20181008175620.043587728@linuxfoundation.org> References: <20181008175620.043587728@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.18-stable review patch. If anyone has any objections, please let me know. ------------------ From: Joe Thornber [ Upstream commit 3ab91828166895600efd9cdc3a0eb32001f7204a ] Committing a transaction can consume some metadata of it's own, we now reserve a small amount of metadata to cover this. Free metadata reported by the kernel will not include this reserve. If any of the reserve has been used after a commit we enter a new internal state PM_OUT_OF_METADATA_SPACE. This is reported as PM_READ_ONLY, so no userland changes are needed. If the metadata device is resized the pool will move back to PM_WRITE. These changes mean we never need to abort and rollback a transaction due to running out of metadata space. This is particularly important because there have been a handful of reports of data corruption against DM thin-provisioning that can all be attributed to the thin-pool having ran out of metadata space. Signed-off-by: Joe Thornber Signed-off-by: Mike Snitzer Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman --- drivers/md/dm-thin-metadata.c | 36 ++++++++++++++++++++ drivers/md/dm-thin.c | 73 +++++++++++++++++++++++++++++++++++++----- 2 files changed, 100 insertions(+), 9 deletions(-) --- a/drivers/md/dm-thin-metadata.c +++ b/drivers/md/dm-thin-metadata.c @@ -189,6 +189,12 @@ struct dm_pool_metadata { sector_t data_block_size; /* + * We reserve a section of the metadata for commit overhead. + * All reported space does *not* include this. + */ + dm_block_t metadata_reserve; + + /* * Set if a transaction has to be aborted but the attempt to roll back * to the previous (good) transaction failed. The only pool metadata * operation possible in this state is the closing of the device. @@ -816,6 +822,22 @@ static int __commit_transaction(struct d return dm_tm_commit(pmd->tm, sblock); } +static void __set_metadata_reserve(struct dm_pool_metadata *pmd) +{ + int r; + dm_block_t total; + dm_block_t max_blocks = 4096; /* 16M */ + + r = dm_sm_get_nr_blocks(pmd->metadata_sm, &total); + if (r) { + DMERR("could not get size of metadata device"); + pmd->metadata_reserve = max_blocks; + } else { + sector_div(total, 10); + pmd->metadata_reserve = min(max_blocks, total); + } +} + struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev, sector_t data_block_size, bool format_device) @@ -849,6 +871,8 @@ struct dm_pool_metadata *dm_pool_metadat return ERR_PTR(r); } + __set_metadata_reserve(pmd); + return pmd; } @@ -1820,6 +1844,13 @@ int dm_pool_get_free_metadata_block_coun down_read(&pmd->root_lock); if (!pmd->fail_io) r = dm_sm_get_nr_free(pmd->metadata_sm, result); + + if (!r) { + if (*result < pmd->metadata_reserve) + *result = 0; + else + *result -= pmd->metadata_reserve; + } up_read(&pmd->root_lock); return r; @@ -1932,8 +1963,11 @@ int dm_pool_resize_metadata_dev(struct d int r = -EINVAL; down_write(&pmd->root_lock); - if (!pmd->fail_io) + if (!pmd->fail_io) { r = __resize_space_map(pmd->metadata_sm, new_count); + if (!r) + __set_metadata_reserve(pmd); + } up_write(&pmd->root_lock); return r; --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -200,7 +200,13 @@ struct dm_thin_new_mapping; enum pool_mode { PM_WRITE, /* metadata may be changed */ PM_OUT_OF_DATA_SPACE, /* metadata may be changed, though data may not be allocated */ + + /* + * Like READ_ONLY, except may switch back to WRITE on metadata resize. Reported as READ_ONLY. + */ + PM_OUT_OF_METADATA_SPACE, PM_READ_ONLY, /* metadata may not be changed */ + PM_FAIL, /* all I/O fails */ }; @@ -1388,7 +1394,35 @@ static void set_pool_mode(struct pool *p static void requeue_bios(struct pool *pool); -static void check_for_space(struct pool *pool) +static bool is_read_only_pool_mode(enum pool_mode mode) +{ + return (mode == PM_OUT_OF_METADATA_SPACE || mode == PM_READ_ONLY); +} + +static bool is_read_only(struct pool *pool) +{ + return is_read_only_pool_mode(get_pool_mode(pool)); +} + +static void check_for_metadata_space(struct pool *pool) +{ + int r; + const char *ooms_reason = NULL; + dm_block_t nr_free; + + r = dm_pool_get_free_metadata_block_count(pool->pmd, &nr_free); + if (r) + ooms_reason = "Could not get free metadata blocks"; + else if (!nr_free) + ooms_reason = "No free metadata blocks"; + + if (ooms_reason && !is_read_only(pool)) { + DMERR("%s", ooms_reason); + set_pool_mode(pool, PM_OUT_OF_METADATA_SPACE); + } +} + +static void check_for_data_space(struct pool *pool) { int r; dm_block_t nr_free; @@ -1414,14 +1448,16 @@ static int commit(struct pool *pool) { int r; - if (get_pool_mode(pool) >= PM_READ_ONLY) + if (get_pool_mode(pool) >= PM_OUT_OF_METADATA_SPACE) return -EINVAL; r = dm_pool_commit_metadata(pool->pmd); if (r) metadata_operation_failed(pool, "dm_pool_commit_metadata", r); - else - check_for_space(pool); + else { + check_for_metadata_space(pool); + check_for_data_space(pool); + } return r; } @@ -1487,6 +1523,19 @@ static int alloc_data_block(struct thin_ return r; } + r = dm_pool_get_free_metadata_block_count(pool->pmd, &free_blocks); + if (r) { + metadata_operation_failed(pool, "dm_pool_get_free_metadata_block_count", r); + return r; + } + + if (!free_blocks) { + /* Let's commit before we use up the metadata reserve. */ + r = commit(pool); + if (r) + return r; + } + return 0; } @@ -1518,6 +1567,7 @@ static blk_status_t should_error_unservi case PM_OUT_OF_DATA_SPACE: return pool->pf.error_if_no_space ? BLK_STS_NOSPC : 0; + case PM_OUT_OF_METADATA_SPACE: case PM_READ_ONLY: case PM_FAIL: return BLK_STS_IOERR; @@ -2481,8 +2531,9 @@ static void set_pool_mode(struct pool *p error_retry_list(pool); break; + case PM_OUT_OF_METADATA_SPACE: case PM_READ_ONLY: - if (old_mode != new_mode) + if (!is_read_only_pool_mode(old_mode)) notify_of_pool_mode_change(pool, "read-only"); dm_pool_metadata_read_only(pool->pmd); pool->process_bio = process_bio_read_only; @@ -3420,6 +3471,10 @@ static int maybe_resize_metadata_dev(str DMINFO("%s: growing the metadata device from %llu to %llu blocks", dm_device_name(pool->pool_md), sb_metadata_dev_size, metadata_dev_size); + + if (get_pool_mode(pool) == PM_OUT_OF_METADATA_SPACE) + set_pool_mode(pool, PM_WRITE); + r = dm_pool_resize_metadata_dev(pool->pmd, metadata_dev_size); if (r) { metadata_operation_failed(pool, "dm_pool_resize_metadata_dev", r); @@ -3724,7 +3779,7 @@ static int pool_message(struct dm_target struct pool_c *pt = ti->private; struct pool *pool = pt->pool; - if (get_pool_mode(pool) >= PM_READ_ONLY) { + if (get_pool_mode(pool) >= PM_OUT_OF_METADATA_SPACE) { DMERR("%s: unable to service pool target messages in READ_ONLY or FAIL mode", dm_device_name(pool->pool_md)); return -EOPNOTSUPP; @@ -3798,6 +3853,7 @@ static void pool_status(struct dm_target dm_block_t nr_blocks_data; dm_block_t nr_blocks_metadata; dm_block_t held_root; + enum pool_mode mode; char buf[BDEVNAME_SIZE]; char buf2[BDEVNAME_SIZE]; struct pool_c *pt = ti->private; @@ -3868,9 +3924,10 @@ static void pool_status(struct dm_target else DMEMIT("- "); - if (pool->pf.mode == PM_OUT_OF_DATA_SPACE) + mode = get_pool_mode(pool); + if (mode == PM_OUT_OF_DATA_SPACE) DMEMIT("out_of_data_space "); - else if (pool->pf.mode == PM_READ_ONLY) + else if (is_read_only_pool_mode(mode)) DMEMIT("ro "); else DMEMIT("rw ");