2016-02-28 23:48:47

by Daeho Jeong

[permalink] [raw]
Subject: [PATCH v2 1/3] ext4: handle unwritten or delalloc buffers before enabling per-file data journaling

We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.

static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
...
BUG_ON(buffer_unwritten(bh));

Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.

./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269

To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.

Signed-off-by: Daeho Jeong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9cc57c3..9ecfb76 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5378,22 +5378,29 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
return 0;
if (is_journal_aborted(journal))
return -EROFS;
- /* We have to allocate physical blocks for delalloc blocks
- * before flushing journal. otherwise delalloc blocks can not
- * be allocated any more. even more truncate on delalloc blocks
- * could trigger BUG by flushing delalloc blocks in journal.
- * There is no delalloc block in non-journal data mode.
- */
- if (val && test_opt(inode->i_sb, DELALLOC)) {
- err = ext4_alloc_da_blocks(inode);
- if (err < 0)
- return err;
- }

/* Wait for all existing dio workers */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Before flushing the journal and switching inode's aops, we have
+ * to flush all dirty data the inode has. There can be outstanding
+ * delayed allocations, there can be unwritten extents created by
+ * fallocate or buffered writes in dioread_nolock mode covered by
+ * dirty data which can be converted only after flushing the dirty
+ * data (and journalled aops don't know how to handle these cases).
+ */
+ if (val) {
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err < 0) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ ext4_inode_resume_unlocked_dio(inode);
+ return err;
+ }
+ }
+
jbd2_journal_lock_updates(journal);

/*
@@ -5418,6 +5425,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);

jbd2_journal_unlock_updates(journal);
+ if (val)
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);

/* Finally we can mark the inode as dirty. */
--
1.7.9.5



2016-02-28 23:48:53

by Daeho Jeong

[permalink] [raw]
Subject: [PATCH v2 3/3] ext4: enable again per-file data journaling on delalloc mode

Several problems occurred when per-file data journaling is enabled on
"delalloc" mode, so the per-file data journaling was permanently
deactivated by commit 3d2b15826282 ("ext4: ignore
EXT4_INODE_JOURNAL_DATA flag with delalloc"). But, those are not
problems for only "delalloc" mode and when you execute xfstest on
"nodelalloc" mode, same problems happen on "nodelalloc" mode. We always
execute xfstest on "delalloc" mode, which is default mode, so we
haven't realized problems of per-file data journaling feature. Finally,
problems of per-file data journaling feature were fixed by
commit 9c02ac97989d ("ext4: fix xfstest generic/269 double revoked
buffer bug with bigalloc") and previous patchset. Now, we can re-enable
the feature on "delalloc" mode.

Signed-off-by: Daeho Jeong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/ext4_jbd2.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5f58462..9463bb8 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -393,12 +393,10 @@ static inline int ext4_inode_journal_mode(struct inode *inode)
{
if (EXT4_JOURNAL(inode) == NULL)
return EXT4_INODE_WRITEBACK_DATA_MODE; /* writeback */
- /* We do not support data journalling with delayed allocation */
if (!S_ISREG(inode->i_mode) ||
test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
- if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
- !test_opt(inode->i_sb, DELALLOC))
+ if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
return EXT4_INODE_ORDERED_DATA_MODE; /* ordered */
--
1.7.9.5


2016-02-28 23:48:51

by Daeho Jeong

[permalink] [raw]
Subject: [PATCH v2 2/3] ext4: fix races between changing inode journal mode and ext4_writepages

Now, in ext4, there is a race condition between changing inode journal
mode and ext4_writepages(). While ext4_writepages() is executed on
a non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by
Jan Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.

Signed-off-by: Daeho Jeong <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/inode.c | 7 +++++++
fs/ext4/super.c | 4 ++++
3 files changed, 15 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 157b458..c757a3d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -33,6 +33,7 @@
#include <linux/ratelimit.h>
#include <crypto/hash.h>
#include <linux/falloc.h>
+#include <linux/percpu-rwsem.h>
#ifdef __KERNEL__
#include <linux/compat.h>
#endif
@@ -1475,6 +1476,9 @@ struct ext4_sb_info {
struct ratelimit_state s_err_ratelimit_state;
struct ratelimit_state s_warning_ratelimit_state;
struct ratelimit_state s_msg_ratelimit_state;
+
+ /* Barrier between changing inodes' journal flags and writepages ops. */
+ struct percpu_rw_semaphore s_journal_flag_rwsem;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9ecfb76..1176142 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2476,6 +2476,7 @@ static int ext4_writepages(struct address_space *mapping,
struct blk_plug plug;
bool give_up_on_write = false;

+ percpu_down_read(&sbi->s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);

/*
@@ -2646,6 +2647,7 @@ retry:
out_writepages:
trace_ext4_writepages_result(inode, wbc, ret,
nr_to_write - wbc->nr_to_write);
+ percpu_up_read(&sbi->s_journal_flag_rwsem);
return ret;
}

@@ -5362,6 +5364,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
journal_t *journal;
handle_t *handle;
int err;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

/*
* We have to be very careful here: changing a data block's
@@ -5401,6 +5404,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
}
}

+ percpu_down_write(&sbi->s_journal_flag_rwsem);
jbd2_journal_lock_updates(journal);

/*
@@ -5417,6 +5421,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
err = jbd2_journal_flush(journal);
if (err < 0) {
jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
ext4_inode_resume_unlocked_dio(inode);
return err;
}
@@ -5425,6 +5430,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_aops(inode);

jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
+
if (val)
up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3ed01ec..a12950d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -861,6 +861,7 @@ static void ext4_put_super(struct super_block *sb)
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_free_rwsem(&sbi->s_journal_flag_rwsem);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < EXT4_MAXQUOTAS; i++)
@@ -3926,6 +3927,9 @@ no_journal:
if (!err)
err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0,
GFP_KERNEL);
+ if (!err)
+ err = percpu_init_rwsem(&sbi->s_journal_flag_rwsem);
+
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
goto failed_mount6;
--
1.7.9.5


2016-02-29 01:29:51

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] ext4: fix races between changing inode journal mode and ext4_writepages

Hi Daeho,

[auto build test ERROR on ext4/dev]
[also build test ERROR on v4.5-rc6 next-20160226]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url: https://github.com/0day-ci/linux/commits/Daeho-Jeong/ext4-handle-unwritten-or-delalloc-buffers-before-enabling-per-file-data-journaling/20160229-075117
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: x86_64-randconfig-s2-02290828 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All errors (new ones prefixed by >>):

>> ERROR: "percpu_free_rwsem" [fs/ext4/ext4.ko] undefined!

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (847.00 B)
.config.gz (22.10 kB)
Download all attachments

2016-03-02 01:21:40

by Daeho Jeong

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] ext4: fix races between changing inode journal mode and ext4_writepages

Oops, This build error log is genereated when building ext4 filesystem as a module and
this is originated from that "percpu_free_rwsem()" function was not defined as an exported
function.

Could I define the "percpu_free_rwsem()" function as an exported function directly in this patch?
Or do I have to request the maintainer of percpu_rwsem to do this?

Thank you in advance. :-)

> url: https://github.com/0day-ci/linux/commits/Daeho-Jeong/ext4-handle-unwritten-or-delalloc-buffers-before-enabling-per-file-dat> a-journaling/20160229-075117
> base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
> config: x86_64-randconfig-s2-02290828 (attached as .config)
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64

> All errors (new ones prefixed by >>):

> >> ERROR: "percpu_free_rwsem" [fs/ext4/ext4.ko] undefined!