We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.
static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
...
BUG_ON(buffer_unwritten(bh));
Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.
./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.
Signed-off-by: Daeho Jeong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index aee960b..71fab4c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5382,22 +5382,29 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
return 0;
if (is_journal_aborted(journal))
return -EROFS;
- /* We have to allocate physical blocks for delalloc blocks
- * before flushing journal. otherwise delalloc blocks can not
- * be allocated any more. even more truncate on delalloc blocks
- * could trigger BUG by flushing delalloc blocks in journal.
- * There is no delalloc block in non-journal data mode.
- */
- if (val && test_opt(inode->i_sb, DELALLOC)) {
- err = ext4_alloc_da_blocks(inode);
- if (err < 0)
- return err;
- }
/* Wait for all existing dio workers */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);
+ /*
+ * Before flushing the journal and switching inode's aops, we have
+ * to flush all dirty data the inode has. There can be outstanding
+ * delayed allocations, there can be unwritten extents created by
+ * fallocate or buffered writes in dioread_nolock mode covered by
+ * dirty data which can be converted only after flushing the dirty
+ * data (and journalled aops don't know how to handle these cases).
+ */
+ if (val) {
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err < 0) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ ext4_inode_resume_unlocked_dio(inode);
+ return err;
+ }
+ }
+
jbd2_journal_lock_updates(journal);
/*
@@ -5422,6 +5429,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);
jbd2_journal_unlock_updates(journal);
+ if (val)
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
/* Finally we can mark the inode as dirty. */
--
1.7.9.5
Now, in ext4, there is a race condition between changing inode journal
mode and ext4_writepages(). While ext4_writepages() is executed on
a non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by
Jan Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.
Signed-off-by: Daeho Jeong <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/inode.c | 15 ++++++++++++---
fs/ext4/super.c | 4 ++++
kernel/locking/percpu-rwsem.c | 1 +
4 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 157b458..c757a3d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -33,6 +33,7 @@
#include <linux/ratelimit.h>
#include <crypto/hash.h>
#include <linux/falloc.h>
+#include <linux/percpu-rwsem.h>
#ifdef __KERNEL__
#include <linux/compat.h>
#endif
@@ -1475,6 +1476,9 @@ struct ext4_sb_info {
struct ratelimit_state s_err_ratelimit_state;
struct ratelimit_state s_warning_ratelimit_state;
struct ratelimit_state s_msg_ratelimit_state;
+
+ /* Barrier between changing inodes' journal flags and writepages ops. */
+ struct percpu_rw_semaphore s_journal_flag_rwsem;
};
static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 71fab4c..4f45f24 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2476,11 +2476,14 @@ static int ext4_writepages(struct address_space *mapping,
struct blk_plug plug;
bool give_up_on_write = false;
+ percpu_down_read(&sbi->s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
- if (dax_mapping(mapping))
- return dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
- wbc);
+ if (dax_mapping(mapping)) {
+ ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
+ wbc);
+ goto out_writepages;
+ }
/*
* No pages to write? This is mainly a kludge to avoid starting
@@ -2650,6 +2653,7 @@ retry:
out_writepages:
trace_ext4_writepages_result(inode, wbc, ret,
nr_to_write - wbc->nr_to_write);
+ percpu_up_read(&sbi->s_journal_flag_rwsem);
return ret;
}
@@ -5366,6 +5370,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
journal_t *journal;
handle_t *handle;
int err;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
/*
* We have to be very careful here: changing a data block's
@@ -5405,6 +5410,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
}
}
+ percpu_down_write(&sbi->s_journal_flag_rwsem);
jbd2_journal_lock_updates(journal);
/*
@@ -5421,6 +5427,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
err = jbd2_journal_flush(journal);
if (err < 0) {
jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
ext4_inode_resume_unlocked_dio(inode);
return err;
}
@@ -5429,6 +5436,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);
jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
+
if (val)
up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3ed01ec..a12950d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -861,6 +861,7 @@ static void ext4_put_super(struct super_block *sb)
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_free_rwsem(&sbi->s_journal_flag_rwsem);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < EXT4_MAXQUOTAS; i++)
@@ -3926,6 +3927,9 @@ no_journal:
if (!err)
err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
GFP_KERNEL);
+ if (!err)
+ err = percpu_init_rwsem(&sbi->s_journal_flag_rwsem);
+
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
goto failed_mount6;
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f231e0b..bec0b64 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -37,6 +37,7 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
free_percpu(brw->fast_read_ctr);
brw->fast_read_ctr = NULL; /* catch use after free bugs */
}
+EXPORT_SYMBOL_GPL(percpu_free_rwsem);
/*
* This is the fast-path for down_read/up_read. If it succeeds we rely
--
1.7.9.5
Several problems occurred when per-file data journaling is enabled on
"delalloc" mode, so the per-file data journaling was permanently
deactivated by commit 3d2b15826282 ("ext4: ignore
EXT4_INODE_JOURNAL_DATA flag with delalloc"). But, those are not
problems for only "delalloc" mode and when you execute xfstest on
"nodelalloc" mode, same problems happen on "nodelalloc" mode. We always
execute xfstest on "delalloc" mode, which is default mode, so we
haven't realized problems of per-file data journaling feature. Finally,
problems of per-file data journaling feature were fixed by
commit 9c02ac97989d ("ext4: fix xfstest generic/269 double revoked
buffer bug with bigalloc") and previous patchset. Now, we can re-enable
the feature on "delalloc" mode.
Signed-off-by: Daeho Jeong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/ext4_jbd2.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5f58462..9463bb8 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -393,12 +393,10 @@ static inline int ext4_inode_journal_mode(struct inode *inode)
{
if (EXT4_JOURNAL(inode) == NULL)
return EXT4_INODE_WRITEBACK_DATA_MODE; /* writeback */
- /* We do not support data journalling with delayed allocation */
if (!S_ISREG(inode->i_mode) ||
test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
- if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
- !test_opt(inode->i_sb, DELALLOC))
+ if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
return EXT4_INODE_ORDERED_DATA_MODE; /* ordered */
--
1.7.9.5
On Mon 14-03-16 11:34:59, Daeho Jeong wrote:
> Now, in ext4, there is a race condition between changing inode journal
> mode and ext4_writepages(). While ext4_writepages() is executed on
> a non-journalled mode inode, the inode's journal mode could be enabled
> by ioctl() and then, some pages dirtied after switching the journal
> mode will be still exposed to ext4_writepages() in non-journaled mode.
> To resolve this problem, we use fs-wide per-cpu rw semaphore by
> Jan Kara's suggestion because we don't want to waste ext4_inode_info's
> space for this extra rare case.
>
> Signed-off-by: Daeho Jeong <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
Yeah, this patch is fine now. Thanks!
Honza
> ---
> fs/ext4/ext4.h | 4 ++++
> fs/ext4/inode.c | 15 ++++++++++++---
> fs/ext4/super.c | 4 ++++
> kernel/locking/percpu-rwsem.c | 1 +
> 4 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 157b458..c757a3d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -33,6 +33,7 @@
> #include <linux/ratelimit.h>
> #include <crypto/hash.h>
> #include <linux/falloc.h>
> +#include <linux/percpu-rwsem.h>
> #ifdef __KERNEL__
> #include <linux/compat.h>
> #endif
> @@ -1475,6 +1476,9 @@ struct ext4_sb_info {
> struct ratelimit_state s_err_ratelimit_state;
> struct ratelimit_state s_warning_ratelimit_state;
> struct ratelimit_state s_msg_ratelimit_state;
> +
> + /* Barrier between changing inodes' journal flags and writepages ops. */
> + struct percpu_rw_semaphore s_journal_flag_rwsem;
> };
>
> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 71fab4c..4f45f24 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2476,11 +2476,14 @@ static int ext4_writepages(struct address_space *mapping,
> struct blk_plug plug;
> bool give_up_on_write = false;
>
> + percpu_down_read(&sbi->s_journal_flag_rwsem);
> trace_ext4_writepages(inode, wbc);
>
> - if (dax_mapping(mapping))
> - return dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
> - wbc);
> + if (dax_mapping(mapping)) {
> + ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
> + wbc);
> + goto out_writepages;
> + }
>
> /*
> * No pages to write? This is mainly a kludge to avoid starting
> @@ -2650,6 +2653,7 @@ retry:
> out_writepages:
> trace_ext4_writepages_result(inode, wbc, ret,
> nr_to_write - wbc->nr_to_write);
> + percpu_up_read(&sbi->s_journal_flag_rwsem);
> return ret;
> }
>
> @@ -5366,6 +5370,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> journal_t *journal;
> handle_t *handle;
> int err;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>
> /*
> * We have to be very careful here: changing a data block's
> @@ -5405,6 +5410,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> }
> }
>
> + percpu_down_write(&sbi->s_journal_flag_rwsem);
> jbd2_journal_lock_updates(journal);
>
> /*
> @@ -5421,6 +5427,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> err = jbd2_journal_flush(journal);
> if (err < 0) {
> jbd2_journal_unlock_updates(journal);
> + percpu_up_write(&sbi->s_journal_flag_rwsem);
> ext4_inode_resume_unlocked_dio(inode);
> return err;
> }
> @@ -5429,6 +5436,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> ext4_set_aops(inode);
>
> jbd2_journal_unlock_updates(journal);
> + percpu_up_write(&sbi->s_journal_flag_rwsem);
> +
> if (val)
> up_write(&EXT4_I(inode)->i_mmap_sem);
> ext4_inode_resume_unlocked_dio(inode);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3ed01ec..a12950d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -861,6 +861,7 @@ static void ext4_put_super(struct super_block *sb)
> percpu_counter_destroy(&sbi->s_freeinodes_counter);
> percpu_counter_destroy(&sbi->s_dirs_counter);
> percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
> + percpu_free_rwsem(&sbi->s_journal_flag_rwsem);
> brelse(sbi->s_sbh);
> #ifdef CONFIG_QUOTA
> for (i = 0; i < EXT4_MAXQUOTAS; i++)
> @@ -3926,6 +3927,9 @@ no_journal:
> if (!err)
> err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
> GFP_KERNEL);
> + if (!err)
> + err = percpu_init_rwsem(&sbi->s_journal_flag_rwsem);
> +
> if (err) {
> ext4_msg(sb, KERN_ERR, "insufficient memory");
> goto failed_mount6;
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index f231e0b..bec0b64 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -37,6 +37,7 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> free_percpu(brw->fast_read_ctr);
> brw->fast_read_ctr = NULL; /* catch use after free bugs */
> }
> +EXPORT_SYMBOL_GPL(percpu_free_rwsem);
>
> /*
> * This is the fast-path for down_read/up_read. If it succeeds we rely
> --
> 1.7.9.5
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Mar 14, 2016 at 11:34:59AM +0900, Daeho Jeong wrote:
> Now, in ext4, there is a race condition between changing inode journal
> mode and ext4_writepages(). While ext4_writepages() is executed on
> a non-journalled mode inode, the inode's journal mode could be enabled
> by ioctl() and then, some pages dirtied after switching the journal
> mode will be still exposed to ext4_writepages() in non-journaled mode.
> To resolve this problem, we use fs-wide per-cpu rw semaphore by
> Jan Kara's suggestion because we don't want to waste ext4_inode_info's
> space for this extra rare case.
>
> Signed-off-by: Daeho Jeong <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
Hi Daeho,
This patch is causing a regression for xfstests generic/231 when using
a 1k block size (this case tests what happens when the blocksize <
page size, such as if you are using a 4k blocksize on a Power PC
system).
If you have kvm-xfstests or gce-xfstests[1] installed you can
reproduce it like this:
gce-xfstests -c 1k -C 10 generic/231
(Or use kvm-xfstests if you don't want to get a GCE account).
[1] http://thunk.org/gce-xfstests
You will get this:
BEGIN TEST 1k: Ext4 1k block Wed Apr 27 00:48:01 EDT 2016
DEVICE: /dev/mapper/xt-vdd
MK2FS OPTIONS: -q -b 1024
MOUNT OPTIONS: -o block_validity
FSTYP -- ext4
PLATFORM -- Linux/x86_64 xfstests-201604270046 4.6.0-rc4-ext4-00010-geed525e
MKFS_OPTIONS -- -q -b 1024 /dev/mapper/xt-vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/mapper/xt-vdc /xt-vdc
generic/231 64s ... [00:48:03] [00:49:44] [failed, exit status 1] - output mismatch (see /results/results-1k/generic/231.out.bad)
--- tests/generic/231.out 2016-04-27 00:15:36.000000000 -0400
+++ /results/results-1k/generic/231.out.bad 2016-04-27 00:49:44.701895666 -0400
@@ -4,13 +4,10041 @@
Comparing user usage
Comparing group usage
=== FSX Standard Mode, Memory Mapping, 4 Tasks ===
-All 20000 operations completed A-OK!
-All 20000 operations completed A-OK!
-All 20000 operations completed A-OK!
-All 20000 operations completed A-OK!
...
(Run 'diff -u tests/generic/231.out /results/results-1k/generic/231.out.bad' to see the entire diff)
Ran: generic/231
Failures: generic/231
Failed 1 of 1 tests
I'm going to drop patch 2 and 3 of your patch series from the ext4
tree for now. Could you take a closer look?
Thanks,
- Ted
Oops, sorry about causing the regression.
But in the version I worked on, we passed the test case and
we still pass that. I executed the test case as belows.
kvm-xfstests -c 1k -C 10 generic/231
I am going to retry the test using GCE with the latest kernel version.
Sorry again.
On Wed, Apr 27, 2016 at 08:04:10AM +0000, Daeho Jeong wrote:
> Oops, sorry about causing the regression.
>
> But in the version I worked on, we passed the test case and
> we still pass that. I executed the test case as belows.
>
> kvm-xfstests -c 1k -C 10 generic/231
>
> I am going to retry the test using GCE with the latest kernel version.
Hi, sorry, it looks like I jumped to conclusions too hastily.
I let the tests finish last night and generic/231 is failing around
10-20% of the time prior to your patch series. I just didn't notice it
at first because the failure happened at the very end:
xfstests results 201604270033 - 4.6.0-rc4-ext4
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Failures: generic/231
Failures: generic/231
With your patches, it remains at about the same level, but when the
tests failed caused me to assume they were the caused by your patches.
xfstests results 201604270044 - 4.6.0-rc4-ext4-00007-g76bee81 (Just before your patches)
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Failures: generic/231
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Failures: generic/231
Passed all 1 tests
xfstests results 201604270054 - 4.6.0-rc4-ext4-00008-g31cedaa (Your 1/3 patch)
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Failures: generic/231
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
xfstests results 201604270052 - 4.6.0-rc4-ext4-00009-g46ceec9 (Your 2/3 patch)
Failures: generic/231 <=== this is why I thought this commit was the regression
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Failures: generic/231
xfstests results 201604270046 - 4.6.0-rc4-ext4-00010-geed525e (Your 3/3 patch)
Failures: generic/231
Failures: generic/231
Failures: generic/231
Failures: generic/231
Failures: generic/231
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Passed all 1 tests
Failures: generic/231
So enabling per-file data journalling and delalloc is definitely makes
the problem much more visible, but it looks like it was always there.
Apologies for the initial misdiagnosis....
- Ted
Hi Ted,
I fully understood your words and I have been executing generic/231 test case
again and again with linux-4.6.0-rc4, linux-4.6.0-rc5 and the latest source codes
of ext4 git server.
But, whenever I execute the test case, it is always passed.
I think the difference of the test results is originated from the difference between
your test environment and mine, however I cannot figure out that.
Actually, I just used kvm-xftests because GCE site cannot be accessed in
our company by security reasons.
Could you suggest some directions for me?