Patch 1-2: fix a potential filesystem inconsistency problem.
Patch 3-7: fix two use after free problem.
Changes since v1:
- Do not use j_checkpoint_mutex to fix the filesystem inconsistency
problem, introduce a new mark instead.
- Fix superblock use-after-free issue in blkdev_releasepage().
- Avoid race between bdev_try_to_free_page() and ext4_put_super().
Zhang Yi (7):
jbd2: remove the out label in __jbd2_journal_remove_checkpoint()
jbd2: ensure abort the journal if detect IO error when writing
original buffer back
jbd2: don't abort the journal when freeing buffers
jbd2: do not free buffers in jbd2_journal_try_to_free_buffers()
ext4: use RCU to protect accessing superblock in blkdev_releasepage()
fs: introduce a usage count into the superblock
ext4: fix race between blkdev_releasepage() and ext4_put_super()
fs/block_dev.c | 13 ++++++----
fs/ext4/inode.c | 6 +++--
fs/ext4/super.c | 32 +++++++++++++++++++++----
fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++++-------------------
fs/jbd2/journal.c | 9 +++++++
fs/jbd2/transaction.c | 32 ++++++-------------------
include/linux/fs.h | 29 ++++++++++++++++++++++
include/linux/jbd2.h | 7 ++++++
8 files changed, 123 insertions(+), 61 deletions(-)
--
2.25.4
Commit <87d8fe1ee6b8> ("add releasepage hooks to block devices which can
be used by file systems") introduce a hook that used by ext4 filesystem
to release journal buffers, but it doesn't add corresponding concurrency
protection that ->bdev_try_to_free_page() could be raced by umount
filesystem concurrently. This patch add a usage count on superblock that
filesystem can use it to prevent above race and make invoke
->bdev_try_to_free_page() safe.
Signed-off-by: Zhang Yi <[email protected]>
---
include/linux/fs.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..3c6a5c08c2df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -41,6 +41,7 @@
#include <linux/stddef.h>
#include <linux/mount.h>
#include <linux/cred.h>
+#include <linux/percpu-refcount.h>
#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1547,6 +1548,13 @@ struct super_block {
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
+
+ /*
+ * Count users who are using the super_block, used to protect
+ * umount filesystem concurrently with others.
+ */
+ struct percpu_ref s_usage_counter;
+ wait_queue_head_t s_usage_waitq;
} __randomize_layout;
/* Helper functions so that in most cases filesystems will
@@ -1765,6 +1773,27 @@ static inline bool sb_start_intwrite_trylock(struct super_block *sb)
bool inode_owner_or_capable(struct user_namespace *mnt_userns,
const struct inode *inode);
+static inline void sb_usage_counter_release(struct percpu_ref *ref)
+{
+ struct super_block *sb;
+
+ sb = container_of(ref, struct super_block, s_usage_counter);
+ wake_up(&sb->s_usage_waitq);
+}
+
+static inline int sb_usage_counter_init(struct super_block *sb)
+{
+ init_waitqueue_head(&sb->s_usage_waitq);
+ return percpu_ref_init(&sb->s_usage_counter, sb_usage_counter_release,
+ 0, GFP_KERNEL);
+}
+
+static inline void sb_usage_counter_wait(struct super_block *sb)
+{
+ percpu_ref_kill(&sb->s_usage_counter);
+ wait_event(sb->s_usage_waitq, percpu_ref_is_zero(&sb->s_usage_counter));
+}
+
/*
* VFS helper functions..
*/
--
2.25.4
This patch move try_to_free_buffers() out from
jbd2_journal_try_to_free_buffers() to the caller function, it just check
the buffers are JBD2 journal busy or not, and the caller should invoke
try_to_free_buffers() if it want to release page.
Signed-off-by: Zhang Yi <[email protected]>
---
fs/block_dev.c | 8 +++++---
fs/ext4/inode.c | 6 ++++--
fs/ext4/super.c | 5 +++--
fs/jbd2/transaction.c | 17 ++++++++---------
4 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 09d6f7229db9..5ed79a9063f6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1735,11 +1735,13 @@ EXPORT_SYMBOL_GPL(blkdev_read_iter);
static int blkdev_releasepage(struct page *page, gfp_t wait)
{
struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
+ int ret = 0;
if (super && super->s_op->bdev_try_to_free_page)
- return super->s_op->bdev_try_to_free_page(super, page, wait);
-
- return try_to_free_buffers(page);
+ ret = super->s_op->bdev_try_to_free_page(super, page, wait);
+ if (!ret)
+ return try_to_free_buffers(page);
+ return 0;
}
static int blkdev_writepages(struct address_space *mapping,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0948a43f1b3d..3211af9c969f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3301,6 +3301,7 @@ static void ext4_journalled_invalidatepage(struct page *page,
static int ext4_releasepage(struct page *page, gfp_t wait)
{
journal_t *journal = EXT4_JOURNAL(page->mapping->host);
+ int ret = 0;
trace_ext4_releasepage(page);
@@ -3308,9 +3309,10 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
if (PageChecked(page))
return 0;
if (journal)
- return jbd2_journal_try_to_free_buffers(journal, page);
- else
+ ret = jbd2_journal_try_to_free_buffers(journal, page);
+ if (!ret)
return try_to_free_buffers(page);
+ return 0;
}
static bool ext4_inode_datasync_dirty(struct inode *inode)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..55c7a0d8d77d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1444,7 +1444,8 @@ static int ext4_nfs_commit_metadata(struct inode *inode)
* Try to release metadata pages (indirect blocks, directories) which are
* mapped via the block device. Since these pages could have journal heads
* which would prevent try_to_free_buffers() from freeing them, we must use
- * jbd2 layer's try_to_free_buffers() function to release them.
+ * jbd2 layer's try_to_free_buffers() function to check and we could
+ * release them if it return 0.
*/
static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
gfp_t wait)
@@ -1457,7 +1458,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
if (journal)
return jbd2_journal_try_to_free_buffers(journal, page);
- return try_to_free_buffers(page);
+ return 0;
}
#ifdef CONFIG_FS_ENCRYPTION
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 3e0db4953fe4..451798051fde 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2089,10 +2089,9 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
* if they are fully written out ordered data, move them onto BUF_CLEAN
* so try_to_free_buffers() can reap them.
*
- * This function returns non-zero if we wish try_to_free_buffers()
- * to be called. We do this if the page is releasable by try_to_free_buffers().
- * We also do it if the page has locked or dirty buffers and the caller wants
- * us to perform sync or async writeout.
+ * This function returns zero if all the buffers on this page are
+ * journal cleaned and the caller should invoke try_to_free_buffers() and
+ * could release page if the page is releasable by try_to_free_buffers().
*
* This complicates JBD locking somewhat. We aren't protected by the
* BKL here. We wish to remove the buffer from its committing or
@@ -2112,7 +2111,7 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
* cannot happen because we never reallocate freed data as metadata
* while the data is part of a transaction. Yes?
*
- * Return 0 on failure, 1 on success
+ * Return 0 on success, -EBUSY if any buffer is still journal busy.
*/
int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
{
@@ -2140,12 +2139,12 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
__journal_try_to_free_buffer(journal, bh);
spin_unlock(&jh->b_state_lock);
jbd2_journal_put_journal_head(jh);
- if (buffer_jbd(bh))
- goto busy;
+ if (buffer_jbd(bh)) {
+ ret = -EBUSY;
+ break;
+ }
} while ((bh = bh->b_this_page) != head);
- ret = try_to_free_buffers(page);
-busy:
return ret;
}
--
2.25.4
Althourh we merged <c044f3d8360> ("jbd2: abort journal if free a async
write error metadata buffer"), there is a race between
jbd2_journal_try_to_free_buffers() and jbd2_journal_destroy(), so the
jbd2_log_do_checkpoint() may still missing to detect the buffer write
io error flag and will end up leading to filesystem inconsistency.
jbd2_journal_try_to_free_buffers() ext4_put_super()
jbd2_journal_destroy()
__jbd2_journal_remove_checkpoint()
detect buffer write error jbd2_log_do_checkpoint()
jbd2_cleanup_journal_tail()
<--- lead to inconsistency
jbd2_journal_abort()
Fix this issue by introduce a new mark j_checkpoint_io_error and set it
in __jbd2_journal_remove_checkpoint() when freeing a checkpoint buffer
has write_io_error flag. Then jbd2_journal_destroy() will detect this
mark and abort the journal to prevent updating log tail.
Signed-off-by: Zhang Yi <[email protected]>
---
fs/jbd2/checkpoint.c | 32 +++++++++++++++++++-------------
fs/jbd2/journal.c | 9 +++++++++
include/linux/jbd2.h | 7 +++++++
3 files changed, 35 insertions(+), 13 deletions(-)
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index bf5511d19ac5..ff3d38ad3a1e 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -221,14 +221,15 @@ int jbd2_log_do_checkpoint(journal_t *journal)
result = jbd2_cleanup_journal_tail(journal);
trace_jbd2_checkpoint(journal, result);
jbd_debug(1, "cleanup_journal_tail returned %d\n", result);
- if (result <= 0)
+ if (result == 0)
return result;
+ if (result < 0)
+ goto error;
/*
* OK, we need to start writing disk blocks. Take one transaction
* and write it.
*/
- result = 0;
spin_lock(&journal->j_list_lock);
if (!journal->j_checkpoint_transactions)
goto out;
@@ -295,8 +296,6 @@ int jbd2_log_do_checkpoint(journal_t *journal)
goto restart;
}
if (!buffer_dirty(bh)) {
- if (unlikely(buffer_write_io_error(bh)) && !result)
- result = -EIO;
BUFFER_TRACE(bh, "remove from checkpoint");
if (__jbd2_journal_remove_checkpoint(jh))
/* The transaction was released; we're done */
@@ -356,9 +355,6 @@ int jbd2_log_do_checkpoint(journal_t *journal)
spin_lock(&journal->j_list_lock);
goto restart2;
}
- if (unlikely(buffer_write_io_error(bh)) && !result)
- result = -EIO;
-
/*
* Now in whatever state the buffer currently is, we
* know that it has been written out and so we can
@@ -369,12 +365,13 @@ int jbd2_log_do_checkpoint(journal_t *journal)
}
out:
spin_unlock(&journal->j_list_lock);
- if (result < 0)
+ result = jbd2_cleanup_journal_tail(journal);
+ if (result >= 0)
+ return 0;
+error:
+ if (!is_journal_aborted(journal))
jbd2_journal_abort(journal, result);
- else
- result = jbd2_cleanup_journal_tail(journal);
-
- return (result < 0) ? result : 0;
+ return result;
}
/*
@@ -400,7 +397,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
tid_t first_tid;
unsigned long blocknr;
- if (is_journal_aborted(journal))
+ if (is_journal_aborted(journal) || journal->j_checkpoint_io_error)
return -EIO;
if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
@@ -564,6 +561,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
struct transaction_chp_stats_s *stats;
transaction_t *transaction;
journal_t *journal;
+ struct buffer_head *bh = jh2bh(jh);
JBUFFER_TRACE(jh, "entry");
@@ -575,6 +573,14 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
journal = transaction->t_journal;
JBUFFER_TRACE(jh, "removing from transaction");
+ /*
+ * If the buffer has been failed to write out to disk, it may probably
+ * lead to filesystem inconsistency after remove it from the log, so
+ * mark the journal checkpoint write back IO error.
+ */
+ if (buffer_write_io_error(bh))
+ journal->j_checkpoint_io_error = true;
+
__buffer_unlink(jh);
jh->b_cp_transaction = NULL;
jbd2_journal_put_journal_head(jh);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 2dc944442802..6dbeab8b9feb 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1995,6 +1995,15 @@ int jbd2_journal_destroy(journal_t *journal)
J_ASSERT(journal->j_checkpoint_transactions == NULL);
spin_unlock(&journal->j_list_lock);
+ /*
+ * OK, all checkpoint transactions have been checked, now check the
+ * write out io error flag and abort the journal if some buffer failed
+ * to write back to the original location, otherwise the filesystem
+ * may probably becomes inconsistency.
+ */
+ if (!is_journal_aborted(journal) && journal->j_checkpoint_io_error)
+ jbd2_journal_abort(journal, -EIO);
+
if (journal->j_sb_buffer) {
if (!is_journal_aborted(journal)) {
mutex_lock_io(&journal->j_checkpoint_mutex);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 99d3cd051ac3..53ca70f80ad8 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -841,6 +841,13 @@ struct journal_s
*/
transaction_t *j_checkpoint_transactions;
+ /**
+ * @j_checkpoint_io_error:
+ *
+ * Detect io error while writing back original buffer to disk.
+ */
+ bool j_checkpoint_io_error;
+
/**
* @j_wait_transaction_locked:
*
--
2.25.4
There still exist a use after free issue when accessing the journal
structure and ext4_sb_info structure on freeing bdev buffers in
bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be
raced by ext4_put_super(), it dose freeing sb->s_fs_info and
sbi->s_journal while release page progress are still accessing them.
So it could end up trigger use-after-free or NULL pointer dereference.
drop cache umount filesystem
blkdev_releasepage()
get sb
bdev_try_to_free_page() ext4_put_super()
kfree(journal)
if access EXT4_SB(sb)->s_journal <-- leader to use after free
sb->s_fs_info = NULL;
kfree(sbi)
if access EXT4_SB(sb)->s_journal <-- trigger NULL pointer dereference
The above race could also happens on the error path of
ext4_fill_super(). Now the bdev_try_to_free_page() is under RCU
protection, this patch fix this race by use sb->s_usage_counter to
make bdev_try_to_free_page() safe against concurrent kill_block_super().
Fixes: c39a7f84d784 ("ext4: provide function to release metadata pages under memory pressure")
Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/super.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 55c7a0d8d77d..14eedd8e5bd3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1172,6 +1172,12 @@ static void ext4_put_super(struct super_block *sb)
*/
ext4_unregister_sysfs(sb);
+ /*
+ * Prevent racing with bdev_try_to_free_page() access the sbi and
+ * journal concurrently.
+ */
+ sb_usage_counter_wait(sb);
+
if (sbi->s_journal) {
aborted = is_journal_aborted(sbi->s_journal);
err = jbd2_journal_destroy(sbi->s_journal);
@@ -1248,6 +1254,7 @@ static void ext4_put_super(struct super_block *sb)
kthread_stop(sbi->s_mmp_tsk);
brelse(sbi->s_sbh);
sb->s_fs_info = NULL;
+ percpu_ref_exit(&sb->s_usage_counter);
/*
* Now that we are completely done shutting down the
* superblock, we need to actually destroy the kobject.
@@ -1450,15 +1457,22 @@ static int ext4_nfs_commit_metadata(struct inode *inode)
static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
gfp_t wait)
{
- journal_t *journal = EXT4_SB(sb)->s_journal;
+ int ret = 0;
WARN_ON(PageChecked(page));
if (!page_has_buffers(page))
return 0;
- if (journal)
- return jbd2_journal_try_to_free_buffers(journal, page);
- return 0;
+ /* Racing with umount filesystem concurrently? */
+ if (percpu_ref_tryget_live(&sb->s_usage_counter)) {
+ journal_t *journal = EXT4_SB(sb)->s_journal;
+
+ if (journal)
+ ret = jbd2_journal_try_to_free_buffers(journal, page);
+ percpu_ref_put(&sb->s_usage_counter);
+ }
+
+ return ret;
}
#ifdef CONFIG_FS_ENCRYPTION
@@ -4709,6 +4723,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
spin_lock_init(&sbi->s_error_lock);
INIT_WORK(&sbi->s_error_work, flush_stashed_error_work);
+ if (sb_usage_counter_init(sb))
+ goto failed_mount2a;
+
/* Register extent status tree shrinker */
if (ext4_es_register_shrinker(sbi))
goto failed_mount3;
@@ -5148,6 +5165,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
sbi->s_ea_block_cache = NULL;
+ sb->s_bdev->bd_super = NULL;
+ sb_usage_counter_wait(sb);
if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
@@ -5155,6 +5174,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
failed_mount3a:
ext4_es_unregister_shrinker(sbi);
failed_mount3:
+ percpu_ref_exit(&sb->s_usage_counter);
+failed_mount2a:
flush_work(&sbi->s_error_work);
del_timer_sync(&sbi->s_err_report);
if (sbi->s_mmp_tsk)
--
2.25.4
In blkdev_releasepage() we access the superblock structure directly, it
could be raced by umount filesystem on destroy superblock in put_super(),
and end up triggering a use after free issue.
drop cache umount filesystem
bdev_try_to_free_page()
get superblock
deactivate_locked_super()
...
put_super() and free sb by destroy_work
access superblock <-- trigger use after free
This issue doesn't trigger easily in general because we get page locked
when invoking bdev_try_to_free_page(), and when umount filesystem the
kill_block_super()->..->kill_bdev()->truncate_inode_pages_range()
procedure wait on page unlock, but it's not a guarantee. Fix this race
by use RCU to protect superblock in blkdev_releasepage().
Fixes: 87d8fe1ee6b8 ("add releasepage hooks to block devices which can be used by file systems")
Reported-by: Jan Kara <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
---
fs/block_dev.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5ed79a9063f6..cb84f347fb04 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1734,11 +1734,14 @@ EXPORT_SYMBOL_GPL(blkdev_read_iter);
*/
static int blkdev_releasepage(struct page *page, gfp_t wait)
{
- struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
+ struct super_block *super;
int ret = 0;
+ rcu_read_lock();
+ super = READ_ONCE(BDEV_I(page->mapping->host)->bdev.bd_super);
if (super && super->s_op->bdev_try_to_free_page)
ret = super->s_op->bdev_try_to_free_page(super, page, wait);
+ rcu_read_unlock();
if (!ret)
return try_to_free_buffers(page);
return 0;
--
2.25.4
Now, we can make sure to abort the journal once the original buffer was
failed to write back to disk, we can remove the journal abort logic in
jbd2_journal_try_to_free_buffers() which was introduced in
<c044f3d8360d> ("jbd2: abort journal if free a async write error
metadata buffer"), because it may costs and propably not safe.
Signed-off-by: Zhang Yi <[email protected]>
---
fs/jbd2/transaction.c | 17 -----------------
1 file changed, 17 deletions(-)
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 9396666b7314..3e0db4953fe4 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2118,7 +2118,6 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
{
struct buffer_head *head;
struct buffer_head *bh;
- bool has_write_io_error = false;
int ret = 0;
J_ASSERT(PageLocked(page));
@@ -2143,26 +2142,10 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
jbd2_journal_put_journal_head(jh);
if (buffer_jbd(bh))
goto busy;
-
- /*
- * If we free a metadata buffer which has been failed to
- * write out, the jbd2 checkpoint procedure will not detect
- * this failure and may lead to filesystem inconsistency
- * after cleanup journal tail.
- */
- if (buffer_write_io_error(bh)) {
- pr_err("JBD2: Error while async write back metadata bh %llu.",
- (unsigned long long)bh->b_blocknr);
- has_write_io_error = true;
- }
} while ((bh = bh->b_this_page) != head);
ret = try_to_free_buffers(page);
-
busy:
- if (has_write_io_error)
- jbd2_journal_abort(journal, -EIO);
-
return ret;
}
--
2.25.4
The 'out' lable just return the 'ret' value and seems not required, so
remove this label and switch to return appropriate value immediately.
This patch also do some minor cleanup, no logical change.
Signed-off-by: Zhang Yi <[email protected]>
---
fs/jbd2/checkpoint.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 63b526d44886..bf5511d19ac5 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -564,13 +564,13 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
struct transaction_chp_stats_s *stats;
transaction_t *transaction;
journal_t *journal;
- int ret = 0;
JBUFFER_TRACE(jh, "entry");
- if ((transaction = jh->b_cp_transaction) == NULL) {
+ transaction = jh->b_cp_transaction;
+ if (!transaction) {
JBUFFER_TRACE(jh, "not on transaction");
- goto out;
+ return 0;
}
journal = transaction->t_journal;
@@ -579,9 +579,9 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
jh->b_cp_transaction = NULL;
jbd2_journal_put_journal_head(jh);
- if (transaction->t_checkpoint_list != NULL ||
- transaction->t_checkpoint_io_list != NULL)
- goto out;
+ /* Is this transaction empty? */
+ if (transaction->t_checkpoint_list || transaction->t_checkpoint_io_list)
+ return 0;
/*
* There is one special case to worry about: if we have just pulled the
@@ -593,10 +593,12 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
* See the comment at the end of jbd2_journal_commit_transaction().
*/
if (transaction->t_state != T_FINISHED)
- goto out;
+ return 0;
- /* OK, that was the last buffer for the transaction: we can now
- safely remove this transaction from the log */
+ /*
+ * OK, that was the last buffer for the transaction, we can now
+ * safely remove this transaction from the log.
+ */
stats = &transaction->t_chp_stats;
if (stats->cs_chp_time)
stats->cs_chp_time = jbd2_time_diff(stats->cs_chp_time,
@@ -606,9 +608,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
__jbd2_journal_drop_transaction(journal, transaction);
jbd2_journal_free_transaction(transaction);
- ret = 1;
-out:
- return ret;
+ return 1;
}
/*
--
2.25.4
On Wed, Apr 14, 2021 at 09:47:36PM +0800, Zhang Yi wrote:
> Commit <87d8fe1ee6b8> ("add releasepage hooks to block devices which can
> be used by file systems") introduce a hook that used by ext4 filesystem
> to release journal buffers, but it doesn't add corresponding concurrency
> protection that ->bdev_try_to_free_page() could be raced by umount
> filesystem concurrently. This patch add a usage count on superblock that
> filesystem can use it to prevent above race and make invoke
> ->bdev_try_to_free_page() safe.
We already have two refcounts in the superblock: s_active which counts
the active refernce, and s_count which prevents the data structures
from beeing freed. I don't think we need yet another one.
On Wed, Apr 14, 2021 at 09:47:34PM +0800, Zhang Yi wrote:
> static int blkdev_releasepage(struct page *page, gfp_t wait)
> {
> struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
> + int ret = 0;
>
> if (super && super->s_op->bdev_try_to_free_page)
> + ret = super->s_op->bdev_try_to_free_page(super, page, wait);
> + if (!ret)
> + return try_to_free_buffers(page);
> + return 0;
This would rea much better as:
if (super && super->s_op->bdev_try_to_free_page &&
super->s_op->bdev_try_to_free_page(super, page, wait)
return 0;
return try_to_free_buffers(page);
and I think changing ->bdev_try_to_free_page to return a bool where true
means "yes, free the buffers" and false means "don't free buffers" would
be a better calling convention.
Also please split the changes to the method signature from the ext4
internal cleanups.
On Wed, Apr 14, 2021 at 09:47:35PM +0800, Zhang Yi wrote:
> In blkdev_releasepage() we access the superblock structure directly, it
> could be raced by umount filesystem on destroy superblock in put_super(),
> and end up triggering a use after free issue.
The subject is wrong, this does not change ext4 code.
On Wed, Apr 14, 2021 at 09:47:37PM +0800, Zhang Yi wrote:
> There still exist a use after free issue when accessing the journal
> structure and ext4_sb_info structure on freeing bdev buffers in
> bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be
> raced by ext4_put_super(), it dose freeing sb->s_fs_info and
> sbi->s_journal while release page progress are still accessing them.
> So it could end up trigger use-after-free or NULL pointer dereference.
I think the right fix is to not even call into ->bdev_try_to_free_page
unless the superblock is active.
Hi, Christoph
On 2021/4/15 22:40, Christoph Hellwig wrote:
> On Wed, Apr 14, 2021 at 09:47:36PM +0800, Zhang Yi wrote:
>> Commit <87d8fe1ee6b8> ("add releasepage hooks to block devices which can
>> be used by file systems") introduce a hook that used by ext4 filesystem
>> to release journal buffers, but it doesn't add corresponding concurrency
>> protection that ->bdev_try_to_free_page() could be raced by umount
>> filesystem concurrently. This patch add a usage count on superblock that
>> filesystem can use it to prevent above race and make invoke
>> ->bdev_try_to_free_page() safe.
>
> We already have two refcounts in the superblock: s_active which counts
> the active refernce, and s_count which prevents the data structures
> from beeing freed. I don't think we need yet another one.
> .
>
Thanks you for your response. I checked the s_count and s_active refcounts,
but it seems that we could not use these two refcounts directly.
For the s_active. If we get s_active refcount in blkdev_releasepage(), we may
put the final refcount when doing umount concurrently and have to do resource
recycling, but we got page locked here and lead to deadlock. Maybe we could do
async resource recycling through kworker, but I think it's not a good way.
For the s_count, it is operated under the global sb_lock now, so get this
refcount will serialize page release and affect performance. Besides, It's
semantics are different from the 'usage count' by private fileststem, and we
have to cooperate with sb->s_umount mutex lock to close the above race.
So I introduce another refcount. Am I missing something or any suggestions?
Thanks,
Yi.
Hi, Christoph
On 2021/4/15 22:52, Christoph Hellwig wrote:
> On Wed, Apr 14, 2021 at 09:47:37PM +0800, Zhang Yi wrote:
>> There still exist a use after free issue when accessing the journal
>> structure and ext4_sb_info structure on freeing bdev buffers in
>> bdev_try_to_free_page(). The problem is bdev_try_to_free_page() could be
>> raced by ext4_put_super(), it dose freeing sb->s_fs_info and
>> sbi->s_journal while release page progress are still accessing them.
>> So it could end up trigger use-after-free or NULL pointer dereference.
>
> I think the right fix is to not even call into ->bdev_try_to_free_page
> unless the superblock is active.
> .
>
Thanks for your suggestions.
Now, we use already use "if (bdev->bd_super)" to prevent call into
->bdev_try_to_free_page unless the super is alive, and the problem is
bd_super becomes NULL concurrently after this check. So, IIUC, I think it's
the same to switch to check the superblock is active or not. The acvive
flag also could becomes inactive (raced by umount) after we call into
bdev_try_to_free_page().
In order to close this race, One solution is introduce a lock to synchronize
the active state between kill_block_super() and blkdev_releasepage(), but
the releasing page process have to try to acquire this lock in
blkdev_releasepage() for each page, and the umount process still need to wait
until the page release if some one invoke into ->bdev_try_to_free_page().
I think this solution may affect performace and is not a good way.
Think about it in depth, use percpu refcount seems have the smallest
performance effect on blkdev_releasepage().
If you don't like the refcount, maybe we could add synchronize_rcu_expedited()
in ext4_put_super(), it also could prevent this race. Any suggestions?
Thanks,
Yi.
On Fri, Apr 16, 2021 at 04:00:48PM +0800, Zhang Yi wrote:
> Now, we use already use "if (bdev->bd_super)" to prevent call into
> ->bdev_try_to_free_page unless the super is alive, and the problem is
> bd_super becomes NULL concurrently after this check. So, IIUC, I think it's
> the same to switch to check the superblock is active or not. The acvive
> flag also could becomes inactive (raced by umount) after we call into
> bdev_try_to_free_page().
Indeed.
> In order to close this race, One solution is introduce a lock to synchronize
> the active state between kill_block_super() and blkdev_releasepage(), but
> the releasing page process have to try to acquire this lock in
> blkdev_releasepage() for each page, and the umount process still need to wait
> until the page release if some one invoke into ->bdev_try_to_free_page().
> I think this solution may affect performace and is not a good way.
> Think about it in depth, use percpu refcount seems have the smallest
> performance effect on blkdev_releasepage().
>
> If you don't like the refcount, maybe we could add synchronize_rcu_expedited()
> in ext4_put_super(), it also could prevent this race. Any suggestions?
I really don't like to put a lot of overhead into the core VFS and block
device code. ext4/jbd does not own the block device inode and really
has no business controlling releasepage for it. I suspect the right
answer might be to simply revert the commit that added this hook.
On Wed 14-04-21 21:47:31, Zhang Yi wrote:
> The 'out' lable just return the 'ret' value and seems not required, so
> remove this label and switch to return appropriate value immediately.
> This patch also do some minor cleanup, no logical change.
>
> Signed-off-by: Zhang Yi <[email protected]>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/jbd2/checkpoint.c | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 63b526d44886..bf5511d19ac5 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -564,13 +564,13 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> struct transaction_chp_stats_s *stats;
> transaction_t *transaction;
> journal_t *journal;
> - int ret = 0;
>
> JBUFFER_TRACE(jh, "entry");
>
> - if ((transaction = jh->b_cp_transaction) == NULL) {
> + transaction = jh->b_cp_transaction;
> + if (!transaction) {
> JBUFFER_TRACE(jh, "not on transaction");
> - goto out;
> + return 0;
> }
> journal = transaction->t_journal;
>
> @@ -579,9 +579,9 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> jh->b_cp_transaction = NULL;
> jbd2_journal_put_journal_head(jh);
>
> - if (transaction->t_checkpoint_list != NULL ||
> - transaction->t_checkpoint_io_list != NULL)
> - goto out;
> + /* Is this transaction empty? */
> + if (transaction->t_checkpoint_list || transaction->t_checkpoint_io_list)
> + return 0;
>
> /*
> * There is one special case to worry about: if we have just pulled the
> @@ -593,10 +593,12 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> * See the comment at the end of jbd2_journal_commit_transaction().
> */
> if (transaction->t_state != T_FINISHED)
> - goto out;
> + return 0;
>
> - /* OK, that was the last buffer for the transaction: we can now
> - safely remove this transaction from the log */
> + /*
> + * OK, that was the last buffer for the transaction, we can now
> + * safely remove this transaction from the log.
> + */
> stats = &transaction->t_chp_stats;
> if (stats->cs_chp_time)
> stats->cs_chp_time = jbd2_time_diff(stats->cs_chp_time,
> @@ -606,9 +608,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
>
> __jbd2_journal_drop_transaction(journal, transaction);
> jbd2_journal_free_transaction(transaction);
> - ret = 1;
> -out:
> - return ret;
> + return 1;
> }
>
> /*
> --
> 2.25.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 14-04-21 21:47:32, Zhang Yi wrote:
> Althourh we merged <c044f3d8360> ("jbd2: abort journal if free a async
^^ Although ^^ We usually reference commits without <>.
> write error metadata buffer"), there is a race between
> jbd2_journal_try_to_free_buffers() and jbd2_journal_destroy(), so the
> jbd2_log_do_checkpoint() may still missing to detect the buffer write
^^^ fail
> io error flag and will end up leading to filesystem inconsistency.
^^^^^^^^^^^^^^^^^^^ which may lead
> jbd2_journal_try_to_free_buffers() ext4_put_super()
> jbd2_journal_destroy()
> __jbd2_journal_remove_checkpoint()
> detect buffer write error jbd2_log_do_checkpoint()
> jbd2_cleanup_journal_tail()
> <--- lead to inconsistency
> jbd2_journal_abort()
>
> Fix this issue by introduce a new mark j_checkpoint_io_error and set it
^^^ introducing
> in __jbd2_journal_remove_checkpoint() when freeing a checkpoint buffer
> has write_io_error flag. Then jbd2_journal_destroy() will detect this
^ which has
> mark and abort the journal to prevent updating log tail.
>
> Signed-off-by: Zhang Yi <[email protected]>
Thanks! In principle the patch looks good. Some spelling corrections and a
few suggestions for improvement below.
> ---
> fs/jbd2/checkpoint.c | 32 +++++++++++++++++++-------------
> fs/jbd2/journal.c | 9 +++++++++
> include/linux/jbd2.h | 7 +++++++
> 3 files changed, 35 insertions(+), 13 deletions(-)
>
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index bf5511d19ac5..ff3d38ad3a1e 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -221,14 +221,15 @@ int jbd2_log_do_checkpoint(journal_t *journal)
> result = jbd2_cleanup_journal_tail(journal);
> trace_jbd2_checkpoint(journal, result);
> jbd_debug(1, "cleanup_journal_tail returned %d\n", result);
> - if (result <= 0)
> + if (result == 0)
> return result;
> + if (result < 0)
> + goto error;
>
> /*
> * OK, we need to start writing disk blocks. Take one transaction
> * and write it.
> */
> - result = 0;
> spin_lock(&journal->j_list_lock);
> if (!journal->j_checkpoint_transactions)
> goto out;
> @@ -295,8 +296,6 @@ int jbd2_log_do_checkpoint(journal_t *journal)
> goto restart;
> }
> if (!buffer_dirty(bh)) {
> - if (unlikely(buffer_write_io_error(bh)) && !result)
> - result = -EIO;
> BUFFER_TRACE(bh, "remove from checkpoint");
> if (__jbd2_journal_remove_checkpoint(jh))
> /* The transaction was released; we're done */
> @@ -356,9 +355,6 @@ int jbd2_log_do_checkpoint(journal_t *journal)
> spin_lock(&journal->j_list_lock);
> goto restart2;
> }
> - if (unlikely(buffer_write_io_error(bh)) && !result)
> - result = -EIO;
> -
> /*
> * Now in whatever state the buffer currently is, we
> * know that it has been written out and so we can
> @@ -369,12 +365,13 @@ int jbd2_log_do_checkpoint(journal_t *journal)
> }
> out:
> spin_unlock(&journal->j_list_lock);
> - if (result < 0)
> + result = jbd2_cleanup_journal_tail(journal);
> + if (result >= 0)
> + return 0;
> +error:
> + if (!is_journal_aborted(journal))
> jbd2_journal_abort(journal, result);
> - else
> - result = jbd2_cleanup_journal_tail(journal);
> -
> - return (result < 0) ? result : 0;
> + return result;
> }
>
> /*
> @@ -400,7 +397,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> tid_t first_tid;
> unsigned long blocknr;
>
> - if (is_journal_aborted(journal))
> + if (is_journal_aborted(journal) || journal->j_checkpoint_io_error)
> return -EIO;
I think it would be cleaner to have in jbd2_journal_update_sb_log_tail()
a code handling j_checkpoint_io_error like:
if (is_journal_aborted(journal))
return -EIO;
if (journal->j_checkpoint_io_error) {
jbd2_journal_abort(...);
return -EIO;
}
So we are sure we never update log tail once we hit checkpointing error.
That way we can also drop this special case from
jbd2_cleanup_journal_tail() as well as from jbd2_log_do_checkpoint().
>
> if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
> @@ -564,6 +561,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> struct transaction_chp_stats_s *stats;
> transaction_t *transaction;
> journal_t *journal;
> + struct buffer_head *bh = jh2bh(jh);
>
> JBUFFER_TRACE(jh, "entry");
>
> @@ -575,6 +573,14 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> journal = transaction->t_journal;
>
> JBUFFER_TRACE(jh, "removing from transaction");
> + /*
> + * If the buffer has been failed to write out to disk, it may probably
> + * lead to filesystem inconsistency after remove it from the log, so
> + * mark the journal checkpoint write back IO error.
> + */
I'd rephrase this comment like:
If we have failed to write the buffer out to disk, the filesystem may
become inconsistent. We cannot abort the journal here since we hold
j_list_lock and we have to careful about races with jbd2_journal_destroy().
So mark the writeback IO error in the journal here and we abort the journal
later from a better context.
> + if (buffer_write_io_error(bh))
> + journal->j_checkpoint_io_error = true;
> +
I agree that using j_flags for this is not great because they require
j_state_lock which we cannot easily take in
__jbd2_journal_remove_checkpoint(). But I think it may be better to create
j_atomic_flags in journal_t, manipulate these with atomic bitops
(set_bit(), test_bit()) and have just a flag J_CHECKPOINT_IO_ERROR there.
That way we can use these flags for other purposes in the future as well
and we won't create new KCSAN warnings...
> __buffer_unlink(jh);
> jh->b_cp_transaction = NULL;
> jbd2_journal_put_journal_head(jh);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 2dc944442802..6dbeab8b9feb 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1995,6 +1995,15 @@ int jbd2_journal_destroy(journal_t *journal)
> J_ASSERT(journal->j_checkpoint_transactions == NULL);
> spin_unlock(&journal->j_list_lock);
>
> + /*
> + * OK, all checkpoint transactions have been checked, now check the
> + * write out io error flag and abort the journal if some buffer failed
> + * to write back to the original location, otherwise the filesystem
> + * may probably becomes inconsistency.
^^^^^^^^^^^^^^^^^^^^^^^^^^ may become inconsistent.
> + */
> + if (!is_journal_aborted(journal) && journal->j_checkpoint_io_error)
> + jbd2_journal_abort(journal, -EIO);
> +
> if (journal->j_sb_buffer) {
> if (!is_journal_aborted(journal)) {
> mutex_lock_io(&journal->j_checkpoint_mutex);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 99d3cd051ac3..53ca70f80ad8 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -841,6 +841,13 @@ struct journal_s
> */
> transaction_t *j_checkpoint_transactions;
>
> + /**
> + * @j_checkpoint_io_error:
> + *
> + * Detect io error while writing back original buffer to disk.
> + */
> + bool j_checkpoint_io_error;
> +
> /**
> * @j_wait_transaction_locked:
> *
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 14-04-21 21:47:33, Zhang Yi wrote:
> Now, we can make sure to abort the journal once the original buffer was
^^ Now that we can be sure the journal is aborted once a buffer has
failed to be written back to disk, ...
> failed to write back to disk, we can remove the journal abort logic in
> jbd2_journal_try_to_free_buffers() which was introduced in
> <c044f3d8360d> ("jbd2: abort journal if free a async write error
> metadata buffer"), because it may costs and propably not safe.
^^ cost ^^ probably is
> Signed-off-by: Zhang Yi <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/jbd2/transaction.c | 17 -----------------
> 1 file changed, 17 deletions(-)
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 9396666b7314..3e0db4953fe4 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2118,7 +2118,6 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
> {
> struct buffer_head *head;
> struct buffer_head *bh;
> - bool has_write_io_error = false;
> int ret = 0;
>
> J_ASSERT(PageLocked(page));
> @@ -2143,26 +2142,10 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
> jbd2_journal_put_journal_head(jh);
> if (buffer_jbd(bh))
> goto busy;
> -
> - /*
> - * If we free a metadata buffer which has been failed to
> - * write out, the jbd2 checkpoint procedure will not detect
> - * this failure and may lead to filesystem inconsistency
> - * after cleanup journal tail.
> - */
> - if (buffer_write_io_error(bh)) {
> - pr_err("JBD2: Error while async write back metadata bh %llu.",
> - (unsigned long long)bh->b_blocknr);
> - has_write_io_error = true;
> - }
> } while ((bh = bh->b_this_page) != head);
>
> ret = try_to_free_buffers(page);
> -
> busy:
> - if (has_write_io_error)
> - jbd2_journal_abort(journal, -EIO);
> -
> return ret;
> }
>
> --
> 2.25.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 20-04-21 14:08:41, Christoph Hellwig wrote:
> On Fri, Apr 16, 2021 at 04:00:48PM +0800, Zhang Yi wrote:
> > Now, we use already use "if (bdev->bd_super)" to prevent call into
> > ->bdev_try_to_free_page unless the super is alive, and the problem is
> > bd_super becomes NULL concurrently after this check. So, IIUC, I think it's
> > the same to switch to check the superblock is active or not. The acvive
> > flag also could becomes inactive (raced by umount) after we call into
> > bdev_try_to_free_page().
>
> Indeed.
>
> > In order to close this race, One solution is introduce a lock to synchronize
> > the active state between kill_block_super() and blkdev_releasepage(), but
> > the releasing page process have to try to acquire this lock in
> > blkdev_releasepage() for each page, and the umount process still need to wait
> > until the page release if some one invoke into ->bdev_try_to_free_page().
> > I think this solution may affect performace and is not a good way.
> > Think about it in depth, use percpu refcount seems have the smallest
> > performance effect on blkdev_releasepage().
> >
> > If you don't like the refcount, maybe we could add synchronize_rcu_expedited()
> > in ext4_put_super(), it also could prevent this race. Any suggestions?
>
> I really don't like to put a lot of overhead into the core VFS and block
> device code. ext4/jbd does not own the block device inode and really
> has no business controlling releasepage for it. I suspect the right
> answer might be to simply revert the commit that added this hook.
Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
by ext4. So maybe it is not that important? I agree with Zhang and
Christoph that getting the lifetime rules sorted out will be hairy and it
is questionable, whether it is worth the additional pages we can reclaim.
Ted, do you remember what was the original motivation for this?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
>
> Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
> by ext4. So maybe it is not that important? I agree with Zhang and
> Christoph that getting the lifetime rules sorted out will be hairy and it
> is questionable, whether it is worth the additional pages we can reclaim.
> Ted, do you remember what was the original motivation for this?
The comment in fs/ext4/super.c is I thought a pretty good explanation:
/*
* Try to release metadata pages (indirect blocks, directories) which are
* mapped via the block device. Since these pages could have journal heads
* which would prevent try_to_free_buffers() from freeing them, we must use
* jbd2 layer's try_to_free_buffers() function to release them.
*/
When we modify a metadata block, we attach a journal_head (jh)
structure to the buffer_head, and bump the ref count to prevent the
buffer from being freed. Before the transaction is committed, the
buffer is marked jbddirty, but the dirty bit is not set until the
transaction commit.
At that back, writeback happens entirely at the discretion of the
buffer cache. The jbd layer doesn't get notification when the I/O is
completed, nor when there is an I/O error. (There was an attempt to
add a callback but that was NACK'ed because of a complaint that it was
jbd specific.)
So we don't actually know when it's safe to detach the jh from the
buffer_head and can drop the refcount so that the buffer_head can be
freed. When the space in the journal starts getting low, we'll look
at at the jh's attached to completed transactions, and see how many of
them have clean bh's, and at that point, we can release the buffer
heads.
The other time when we'll attempt to detach jh's from clean buffers is
via bdev_try_to_free_buffers(). So if we drop the
bdev_try_to_free_page hook, then when we are under memory pressure,
there could be potentially a large percentage of the buffer cache
which can't be freed, and so the OOM-killer might trigger more often.
Now, if we could get a callback on I/O completion on a per-bh basis,
then we could detach the jh when the buffer is clean --- and as a
bonus, we'd get a notification when there was an I/O error writing
back a metadata block, which would be even better.
So how about an even swap? If we can get a buffer I/O completion
callback, we can drop bdev_to_free_swap hook.....
- Ted
On Wed 21-04-21 12:57:39, Theodore Ts'o wrote:
> On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
> >
> > Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
> > by ext4. So maybe it is not that important? I agree with Zhang and
> > Christoph that getting the lifetime rules sorted out will be hairy and it
> > is questionable, whether it is worth the additional pages we can reclaim.
> > Ted, do you remember what was the original motivation for this?
>
> The comment in fs/ext4/super.c is I thought a pretty good explanation:
>
> /*
> * Try to release metadata pages (indirect blocks, directories) which are
> * mapped via the block device. Since these pages could have journal heads
> * which would prevent try_to_free_buffers() from freeing them, we must use
> * jbd2 layer's try_to_free_buffers() function to release them.
> */
>
> When we modify a metadata block, we attach a journal_head (jh)
> structure to the buffer_head, and bump the ref count to prevent the
> buffer from being freed. Before the transaction is committed, the
> buffer is marked jbddirty, but the dirty bit is not set until the
> transaction commit.
>
> At that back, writeback happens entirely at the discretion of the
> buffer cache. The jbd layer doesn't get notification when the I/O is
> completed, nor when there is an I/O error. (There was an attempt to
> add a callback but that was NACK'ed because of a complaint that it was
> jbd specific.)
>
> So we don't actually know when it's safe to detach the jh from the
> buffer_head and can drop the refcount so that the buffer_head can be
> freed. When the space in the journal starts getting low, we'll look
> at at the jh's attached to completed transactions, and see how many of
> them have clean bh's, and at that point, we can release the buffer
> heads.
>
> The other time when we'll attempt to detach jh's from clean buffers is
> via bdev_try_to_free_buffers(). So if we drop the
> bdev_try_to_free_page hook, then when we are under memory pressure,
> there could be potentially a large percentage of the buffer cache
> which can't be freed, and so the OOM-killer might trigger more often.
Yes, I understand that. What I was more asking about is: Does it really
matter we leave those buffer heads and journal heads unreclaimed. I
understand it could be triggering premature OOM in theory but is it a
problem in practice? Was there some observed practical case for which this
was added or was it just added due to the theoretical concern?
> Now, if we could get a callback on I/O completion on a per-bh basis,
> then we could detach the jh when the buffer is clean --- and as a
> bonus, we'd get a notification when there was an I/O error writing
> back a metadata block, which would be even better.
>
> So how about an even swap? If we can get a buffer I/O completion
> callback, we can drop bdev_to_free_swap hook.....
I'm OK with that because mainly for IO error reporting it makes sense to
me. For this memory reclaim problem I think we have also other reasonably
sensible options. E.g. we could have a shrinker that would just walk the
checkpoint list and reclaim journal heads for whatever is already written
out... Or we could just release journal heads already after commit and
during checkpoint we'd fetch the list of blocks that may need to be written
out e.g. from journal descriptor blocks. This would be a larger overhaul
but as a bonus we'd get rid of probably the last place in the kernel which
can write out page contents through buffer heads without updating page
state properly (and thus get rid of some workarounds in mm code as well).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2021/4/22 17:04, Jan Kara wrote:
> On Wed 21-04-21 12:57:39, Theodore Ts'o wrote:
>> On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote:
>>>
>>> Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only
>>> by ext4. So maybe it is not that important? I agree with Zhang and
>>> Christoph that getting the lifetime rules sorted out will be hairy and it
>>> is questionable, whether it is worth the additional pages we can reclaim.
>>> Ted, do you remember what was the original motivation for this?
>>
>> The comment in fs/ext4/super.c is I thought a pretty good explanation:
>>
>> /*
>> * Try to release metadata pages (indirect blocks, directories) which are
>> * mapped via the block device. Since these pages could have journal heads
>> * which would prevent try_to_free_buffers() from freeing them, we must use
>> * jbd2 layer's try_to_free_buffers() function to release them.
>> */
>>
>> When we modify a metadata block, we attach a journal_head (jh)
>> structure to the buffer_head, and bump the ref count to prevent the
>> buffer from being freed. Before the transaction is committed, the
>> buffer is marked jbddirty, but the dirty bit is not set until the
>> transaction commit.
>>
>> At that back, writeback happens entirely at the discretion of the
>> buffer cache. The jbd layer doesn't get notification when the I/O is
>> completed, nor when there is an I/O error. (There was an attempt to
>> add a callback but that was NACK'ed because of a complaint that it was
>> jbd specific.)
>>
>> So we don't actually know when it's safe to detach the jh from the
>> buffer_head and can drop the refcount so that the buffer_head can be
>> freed. When the space in the journal starts getting low, we'll look
>> at at the jh's attached to completed transactions, and see how many of
>> them have clean bh's, and at that point, we can release the buffer
>> heads.
>>
>> The other time when we'll attempt to detach jh's from clean buffers is
>> via bdev_try_to_free_buffers(). So if we drop the
>> bdev_try_to_free_page hook, then when we are under memory pressure,
>> there could be potentially a large percentage of the buffer cache
>> which can't be freed, and so the OOM-killer might trigger more often.
>
> Yes, I understand that. What I was more asking about is: Does it really
> matter we leave those buffer heads and journal heads unreclaimed. I
> understand it could be triggering premature OOM in theory but is it a
> problem in practice? Was there some observed practical case for which this
> was added or was it just added due to the theoretical concern?
>
>> Now, if we could get a callback on I/O completion on a per-bh basis,
>> then we could detach the jh when the buffer is clean --- and as a
>> bonus, we'd get a notification when there was an I/O error writing
>> back a metadata block, which would be even better.
>>
>> So how about an even swap? If we can get a buffer I/O completion
>> callback, we can drop bdev_to_free_swap hook.....
>
> I'm OK with that because mainly for IO error reporting it makes sense to
> me. For this memory reclaim problem I think we have also other reasonably
> sensible options. E.g. we could have a shrinker that would just walk the
> checkpoint list and reclaim journal heads for whatever is already written
> out... Or we could just release journal heads already after commit and
> during checkpoint we'd fetch the list of blocks that may need to be written
> out e.g. from journal descriptor blocks. This would be a larger overhaul
> but as a bonus we'd get rid of probably the last place in the kernel which
> can write out page contents through buffer heads without updating page
> state properly (and thus get rid of some workarounds in mm code as well).
Thanks for these suggestions, I get your first solution and sounds good, but
I do not understand your last sentence, how does ext4 not updating page state
properly? Could you explain it more clearly?
Thanks,
Yi.
On Thu, Apr 22, 2021 at 11:04:11AM +0200, Jan Kara wrote:
> Yes, I understand that. What I was more asking about is: Does it really
> matter we leave those buffer heads and journal heads unreclaimed. I
> understand it could be triggering premature OOM in theory but is it a
> problem in practice? Was there some observed practical case for which this
> was added or was it just added due to the theoretical concern?
I was doing some research, and found the mail thread which inspired
bdev_try_to_free_page():
https://lore.kernel.org/linux-ext4/[email protected]/
From what I can tell Toshi Okajima did have a test workload which
would trigger blkdev_releasepage(). He didn't specify it in the mail
thread as near as I can tell, but he did use it to test the page.
Thinking about it, it shouldn't be hard to trigger it via something like:
find /mnt -print0 | xargs -0 touch
in a memory contrained box with a large file system attached (a
bookshelf NAS scenario). Under the right circumstances, I'm pretty
sure a premature OOM could be demonstrated.
- Ted
On Fri 23-04-21 19:39:09, Zhang Yi wrote:
> On 2021/4/22 17:04, Jan Kara wrote:
> > I'm OK with that because mainly for IO error reporting it makes sense to
> > me. For this memory reclaim problem I think we have also other reasonably
> > sensible options. E.g. we could have a shrinker that would just walk the
> > checkpoint list and reclaim journal heads for whatever is already written
> > out... Or we could just release journal heads already after commit and
> > during checkpoint we'd fetch the list of blocks that may need to be written
> > out e.g. from journal descriptor blocks. This would be a larger overhaul
> > but as a bonus we'd get rid of probably the last place in the kernel which
> > can write out page contents through buffer heads without updating page
> > state properly (and thus get rid of some workarounds in mm code as well).
>
> Thanks for these suggestions, I get your first solution and sounds good, but
> I do not understand your last sentence, how does ext4 not updating page state
> properly? Could you explain it more clearly?
The problem with current checkpointing code is that it writes out dirty
buffer heads through submit_bh() function without updating page dirty state
or without setting PageWriteback bit (because we cannot easily grab page
lock in those places due to lock ordering). Thus we can end up with a page
that is dirty but in fact does not hold any dirty data (none of the buffer
heads is dirty) and also locking a page and checking for PageWriteback
isn't enough to be sure page is not under IO. This is ugly and requires
some workarounds in MM code...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR