Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp3063004pxf; Mon, 5 Apr 2021 02:06:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzb8AXhsHLdlv2eVl5UWx5bJFHYju3CYDweOrMhtLTcWVPfaQ3n3sNAp8faJkb4/yHFguwg X-Received: by 2002:a02:b890:: with SMTP id p16mr22619506jam.138.1617613561264; Mon, 05 Apr 2021 02:06:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617613561; cv=none; d=google.com; s=arc-20160816; b=eLE+NZQY0NnRCUAtQIq+GCi/jlJF1UyS9yel+fckqCapWiDWBvnDNiPXRmhGHBK4np IrqS76GeY12EyxdMvAFKQT97KccKEKubeXY9rzdzolvHO3UOQPWOmmRiwCB1eNpDVbAZ 6gOqVMhXjrPy6CsA5Mz/FOtx3o4JNhcL4rUHsUWKZ8upDBrrKvqH/X+sG8lC1/5aYqSy IuJpLLV8T68g6iKCoFC2vJ3rnHA6HQAwvjAHWX7HKsyxbDrIwJwRSqA4WlMUFEDQ3zZW qKtFg7yve3wKn1ETTPJwNA5TOF3DAAqT9hbEuAeXmeSpgI8ZDdGgkIJpgWuK+XTVFdkS OlBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=/3BieOJJaaCOVFFGK8CpiM3ksGrqR34s9AVWjhjq+6Q=; b=ciXucU/+sTTybk8H471osqJa26z53oaKyNmGpy86aLsChyr104CbQvLGKXKxqPrh24 9xzk29lQK8Mqxc53DlRxv+LIdeWtTxIm+zXmqK56fdXVRHy0jJ3Hnk1I3wrXqtv33MHU aCzCYo/vf1O0XHMm3fNRSRn11B+Ddd6k4QJVxaXjl77c3smrd60DjT7JsESvhMh9mDZr jvz3R3GJyl9DDJ5wU6QPdweOqbgSwg2B98plPhSEEJGrCv8A5yDr3buZmY57L8rKlGPb 6/CfpMGnCcnbVB1TZapulIz8Sz7KA9i3w8JQf2DysneLJ8uKjS4ipNE/4+eSoAcDCPmt Hopw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=aPdhgQXD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y14si14646104ilu.81.2021.04.05.02.05.49; Mon, 05 Apr 2021 02:06:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=aPdhgQXD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237998AbhDEJET (ORCPT + 99 others); Mon, 5 Apr 2021 05:04:19 -0400 Received: from mail.kernel.org ([198.145.29.99]:45676 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237256AbhDEJDC (ORCPT ); Mon, 5 Apr 2021 05:03:02 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 1658C613C5; Mon, 5 Apr 2021 09:02:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1617613376; bh=ytUUbqvWoGeoFBS64CMQtxqqcD2Gua5+oy7bqfCpo7E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aPdhgQXDYckB7g9gNdkXm5fZjxfAB5/RjTc7MpBkTpG0SK5GVgEaxGleh+1SxuCyt ZRm+fpDyBdXie3yJQNxmiZ986BZNULDLPHUjVXaapVhUqR0Y3pWGvQwjGPID0LyNlg 2WhZ4qXKBtESr3bFRkDai28DPSKrCMkUHJbSBvys= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Eric Whitney , Theodore Tso , Sasha Levin Subject: [PATCH 5.4 06/74] ext4: shrink race window in ext4_should_retry_alloc() Date: Mon, 5 Apr 2021 10:53:30 +0200 Message-Id: <20210405085024.918760035@linuxfoundation.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210405085024.703004126@linuxfoundation.org> References: <20210405085024.703004126@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Eric Whitney [ Upstream commit efc61345274d6c7a46a0570efbc916fcbe3e927b ] When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it fails at significant rates on the two test scenarios that disable delayed allocation (ext3conv and data_journal) and force actual block allocation for the fallocate and pwrite functions in the test. The failure rate on 5.10 for both ext3conv and data_journal on one test system typically runs about 85%. On 5.11, the failure rate on ext3conv sometimes drops to as low as 1% while the rate on data_journal increases to nearly 100%. The observed failures are largely due to ext4_should_retry_alloc() cutting off block allocation retries when s_mb_free_pending (used to indicate that a transaction in progress will free blocks) is 0. However, free space is usually available when this occurs during runs of generic/371. It appears that a thread attempting to allocate blocks is just missing transaction commits in other threads that increase the free cluster count and reset s_mb_free_pending while the allocating thread isn't running. Explicitly testing for free space availability avoids this race. The current code uses a post-increment operator in the conditional expression that determines whether the retry limit has been exceeded. This means that the conditional expression uses the value of the retry counter before it's increased, resulting in an extra retry cycle. The current code actually retries twice before hitting its retry limit rather than once. Increasing the retry limit to 3 from the current actual maximum retry count of 2 in combination with the change described above reduces the observed failure rate to less that 0.1% on both ext3conv and data_journal with what should be limited impact on users sensitive to the overhead caused by retries. A per filesystem percpu counter exported via sysfs is added to allow users or developers to track the number of times the retry limit is exceeded without resorting to debugging methods. This should provide some insight into worst case retry behavior. Signed-off-by: Eric Whitney Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o Signed-off-by: Sasha Levin --- fs/ext4/balloc.c | 38 ++++++++++++++++++++++++++------------ fs/ext4/ext4.h | 1 + fs/ext4/super.c | 5 +++++ fs/ext4/sysfs.c | 7 +++++++ 4 files changed, 39 insertions(+), 12 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 5aba67a504cf..031ff3f19018 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -612,27 +612,41 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, /** * ext4_should_retry_alloc() - check if a block allocation should be retried - * @sb: super block - * @retries: number of attemps has been made + * @sb: superblock + * @retries: number of retry attempts made so far * - * ext4_should_retry_alloc() is called when ENOSPC is returned, and if - * it is profitable to retry the operation, this function will wait - * for the current or committing transaction to complete, and then - * return TRUE. We will only retry once. + * ext4_should_retry_alloc() is called when ENOSPC is returned while + * attempting to allocate blocks. If there's an indication that a pending + * journal transaction might free some space and allow another attempt to + * succeed, this function will wait for the current or committing transaction + * to complete and then return TRUE. */ int ext4_should_retry_alloc(struct super_block *sb, int *retries) { - if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || - (*retries)++ > 1 || - !EXT4_SB(sb)->s_journal) + struct ext4_sb_info *sbi = EXT4_SB(sb); + + if (!sbi->s_journal) return 0; - smp_mb(); - if (EXT4_SB(sb)->s_mb_free_pending == 0) + if (++(*retries) > 3) { + percpu_counter_inc(&sbi->s_sra_exceeded_retry_limit); return 0; + } + /* + * if there's no indication that blocks are about to be freed it's + * possible we just missed a transaction commit that did so + */ + smp_mb(); + if (sbi->s_mb_free_pending == 0) + return ext4_has_free_clusters(sbi, 1, 0); + + /* + * it's possible we've just missed a transaction commit here, + * so ignore the returned status + */ jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); - jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); + (void) jbd2_journal_force_commit_nested(sbi->s_journal); return 1; } diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 1c558b554788..bf3eaa903033 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1420,6 +1420,7 @@ struct ext4_sb_info { struct percpu_counter s_freeinodes_counter; struct percpu_counter s_dirs_counter; struct percpu_counter s_dirtyclusters_counter; + struct percpu_counter s_sra_exceeded_retry_limit; struct blockgroup_lock *s_blockgroup_lock; struct proc_dir_entry *s_proc; struct kobject s_kobj; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 06568467b0c2..2ecf4594a20d 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1017,6 +1017,7 @@ static void ext4_put_super(struct super_block *sb) percpu_counter_destroy(&sbi->s_freeinodes_counter); percpu_counter_destroy(&sbi->s_dirs_counter); percpu_counter_destroy(&sbi->s_dirtyclusters_counter); + percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit); percpu_free_rwsem(&sbi->s_writepages_rwsem); #ifdef CONFIG_QUOTA for (i = 0; i < EXT4_MAXQUOTAS; i++) @@ -4597,6 +4598,9 @@ no_journal: if (!err) err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0, GFP_KERNEL); + if (!err) + err = percpu_counter_init(&sbi->s_sra_exceeded_retry_limit, 0, + GFP_KERNEL); if (!err) err = percpu_init_rwsem(&sbi->s_writepages_rwsem); @@ -4699,6 +4703,7 @@ failed_mount6: percpu_counter_destroy(&sbi->s_freeinodes_counter); percpu_counter_destroy(&sbi->s_dirs_counter); percpu_counter_destroy(&sbi->s_dirtyclusters_counter); + percpu_counter_destroy(&sbi->s_sra_exceeded_retry_limit); percpu_free_rwsem(&sbi->s_writepages_rwsem); failed_mount5: ext4_ext_release(sb); diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index eb1efad0e20a..9394360ff137 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -23,6 +23,7 @@ typedef enum { attr_session_write_kbytes, attr_lifetime_write_kbytes, attr_reserved_clusters, + attr_sra_exceeded_retry_limit, attr_inode_readahead, attr_trigger_test_error, attr_first_error_time, @@ -176,6 +177,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444); EXT4_ATTR_FUNC(session_write_kbytes, 0444); EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444); EXT4_ATTR_FUNC(reserved_clusters, 0644); +EXT4_ATTR_FUNC(sra_exceeded_retry_limit, 0444); EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead, ext4_sb_info, s_inode_readahead_blks); @@ -207,6 +209,7 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(session_write_kbytes), ATTR_LIST(lifetime_write_kbytes), ATTR_LIST(reserved_clusters), + ATTR_LIST(sra_exceeded_retry_limit), ATTR_LIST(inode_readahead_blks), ATTR_LIST(inode_goal), ATTR_LIST(mb_stats), @@ -308,6 +311,10 @@ static ssize_t ext4_attr_show(struct kobject *kobj, return snprintf(buf, PAGE_SIZE, "%llu\n", (unsigned long long) atomic64_read(&sbi->s_resv_clusters)); + case attr_sra_exceeded_retry_limit: + return snprintf(buf, PAGE_SIZE, "%llu\n", + (unsigned long long) + percpu_counter_sum(&sbi->s_sra_exceeded_retry_limit)); case attr_inode_readahead: case attr_pointer_ui: if (!ptr) -- 2.30.1