We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.
static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
...
BUG_ON(buffer_unwritten(bh));
Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.
./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.
Signed-off-by: Daeho Jeong <[email protected]>
---
fs/ext4/inode.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 612fbcf..1f9458e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5168,9 +5168,14 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
* be allocated any more. even more truncate on delalloc blocks
* could trigger BUG by flushing delalloc blocks in journal.
* There is no delalloc block in non-journal data mode.
+ * We also have to handle unwritten buffers generated by
+ * fallocate() and dioread_nolock option. Once per-file data
+ * journaling is enabled, unwritten buffers will remain in
+ * unwritten status forever and they will be the seeds of
+ * kernel panic when they are checkpointed.
*/
- if (val && test_opt(inode->i_sb, DELALLOC)) {
- err = ext4_alloc_da_blocks(inode);
+ if (val) {
+ err = filemap_write_and_wait(inode->i_mapping);
if (err < 0)
return err;
}
--
1.7.9.5
Now, in ext4, there is only one writepages() function and it is shared
by all the inode modes. Therefore, BUG_ON() for checking journaled
inode mode in ext4_writepages() is not correct anymore because, if
per-file data journaling of a file is enabled while ext4_writepages()
is being executed, this BUG_ON() function can cause a kernel panic
unintentionally even on "nodelalloc" mode.
Signed-off-by: Daeho Jeong <[email protected]>
---
fs/ext4/inode.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1f9458e..db24348 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2480,13 +2480,11 @@ retry:
}
/*
- * We have two constraints: We find one extent to map and we
+ * We have a constraint: We find one extent to map and we
* must always write out whole page (makes a difference when
* blocksize < pagesize) so that we don't block on IO when we
- * try to write out the rest of the page. Journalled mode is
- * not supported by delalloc.
+ * try to write out the rest of the page.
*/
- BUG_ON(ext4_should_journal_data(inode));
needed_blocks = ext4_da_writepages_trans_blocks(inode);
/* start a new transaction */
--
1.7.9.5
Several problems occurred when per-file data journaling is enabled on
"delalloc" mode, so the per-file data journaling was permanently
deactivated by commit 3d2b15826282 ("ext4: ignore
EXT4_INODE_JOURNAL_DATA flag with delalloc"). But, those are not
problems for only "delalloc" mode and when you execute xfstest on
"nodelalloc" mode, same problems happen on "nodelalloc" mode. We always
execute xfstest on "delalloc" mode, which is default mode, so we
haven't realized problems of per-file data journaling feature. Finally,
problems of per-file data journaling feature were fixed by commit
9c02ac97989d ("ext4: fix xfstest generic/269 double revoked buffer bug
with bigalloc") and previous patchset. Now, we can re-enable the
feature on "delalloc" mode.
Signed-off-by: Daeho Jeong <[email protected]>
---
fs/ext4/ext4_jbd2.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 9c5b49f..742b3ec 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -395,12 +395,10 @@ static inline int ext4_inode_journal_mode(struct inode *inode)
{
if (EXT4_JOURNAL(inode) == NULL)
return EXT4_INODE_WRITEBACK_DATA_MODE; /* writeback */
- /* We do not support data journalling with delayed allocation */
if (!S_ISREG(inode->i_mode) ||
test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
- if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
- !test_opt(inode->i_sb, DELALLOC))
+ if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
return EXT4_INODE_ORDERED_DATA_MODE; /* ordered */
--
1.7.9.5
On Wed 18-11-15 10:34:32, Daeho Jeong wrote:
> We already allocate delalloc blocks before changing the inode mode into
> "per-file data journal" mode to prevent delalloc blocks from remaining
> not allocated, but another issue concerned with "BH_Unwritten" status
> still exists. For example, by fallocate(), several buffers' status
> change into "BH_Unwritten", but these buffers cannot be processed by
> ext4_alloc_da_blocks(). So, they still remain in unwritten status after
> per-file data journaling is enabled and they cannot be changed into
> written status any more and, if they are journaled and eventually
> checkpointed, these unwritten buffer will cause a kernel panic by the
> below BUG_ON() function of submit_bh_wbc() when they are submitted
> during checkpointing.
>
> static int submit_bh_wbc(int rw, struct buffer_head *bh,...
> {
> ...
> BUG_ON(buffer_unwritten(bh));
>
> Moreover, when "dioread_nolock" option is enabled, the status of a
> buffer is changed into "BH_Unwritten" after write_begin() completes and
> the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
> if a buffer's status is changed into unwrutten but the buffer's I/O is
> not submitted and completed, it can cause the same problem after
> enabling per-file data journaling. You can easily generate this bug by
> executing the following command.
>
> ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
>
> To resolve these problems and define a boundary between the previous
> mode and per-file data journaling mode, we need to flush and wait all
> the I/O of buffers of a file before enabling per-file data journaling
> of the file.
>
> Signed-off-by: Daeho Jeong <[email protected]>
> ---
> fs/ext4/inode.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 612fbcf..1f9458e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5168,9 +5168,14 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> * be allocated any more. even more truncate on delalloc blocks
> * could trigger BUG by flushing delalloc blocks in journal.
> * There is no delalloc block in non-journal data mode.
> + * We also have to handle unwritten buffers generated by
> + * fallocate() and dioread_nolock option. Once per-file data
> + * journaling is enabled, unwritten buffers will remain in
> + * unwritten status forever and they will be the seeds of
> + * kernel panic when they are checkpointed.
Can we maybe rephrase the whole comment like:
/*
* Before flushing the journal and switching inode's aops, we have
* to flush all dirty data the inode has. There can be outstanding
* delayed allocations, there can be unwritten extents created by
* fallocate or buffered writes in dioread_nolock mode covered by
* dirty data which can be converted only after flushing the dirty
* data (and journalled aops don't know how to handle these cases).
*/
Otherwise the patch looks good to me. So after updating the comment you can
add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> - if (val && test_opt(inode->i_sb, DELALLOC)) {
> - err = ext4_alloc_da_blocks(inode);
> + if (val) {
> + err = filemap_write_and_wait(inode->i_mapping);
> if (err < 0)
> return err;
> }
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 30-11-15 14:45:37, Jan Kara wrote:
> On Wed 18-11-15 10:34:32, Daeho Jeong wrote:
> > We already allocate delalloc blocks before changing the inode mode into
> > "per-file data journal" mode to prevent delalloc blocks from remaining
> > not allocated, but another issue concerned with "BH_Unwritten" status
> > still exists. For example, by fallocate(), several buffers' status
> > change into "BH_Unwritten", but these buffers cannot be processed by
> > ext4_alloc_da_blocks(). So, they still remain in unwritten status after
> > per-file data journaling is enabled and they cannot be changed into
> > written status any more and, if they are journaled and eventually
> > checkpointed, these unwritten buffer will cause a kernel panic by the
> > below BUG_ON() function of submit_bh_wbc() when they are submitted
> > during checkpointing.
> >
> > static int submit_bh_wbc(int rw, struct buffer_head *bh,...
> > {
> > ...
> > BUG_ON(buffer_unwritten(bh));
> >
> > Moreover, when "dioread_nolock" option is enabled, the status of a
> > buffer is changed into "BH_Unwritten" after write_begin() completes and
> > the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
> > if a buffer's status is changed into unwrutten but the buffer's I/O is
> > not submitted and completed, it can cause the same problem after
> > enabling per-file data journaling. You can easily generate this bug by
> > executing the following command.
> >
> > ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
> >
> > To resolve these problems and define a boundary between the previous
> > mode and per-file data journaling mode, we need to flush and wait all
> > the I/O of buffers of a file before enabling per-file data journaling
> > of the file.
>
>
> >
> > Signed-off-by: Daeho Jeong <[email protected]>
> > ---
> > fs/ext4/inode.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 612fbcf..1f9458e 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -5168,9 +5168,14 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> > * be allocated any more. even more truncate on delalloc blocks
> > * could trigger BUG by flushing delalloc blocks in journal.
> > * There is no delalloc block in non-journal data mode.
> > + * We also have to handle unwritten buffers generated by
> > + * fallocate() and dioread_nolock option. Once per-file data
> > + * journaling is enabled, unwritten buffers will remain in
> > + * unwritten status forever and they will be the seeds of
> > + * kernel panic when they are checkpointed.
>
> Can we maybe rephrase the whole comment like:
>
> /*
> * Before flushing the journal and switching inode's aops, we have
> * to flush all dirty data the inode has. There can be outstanding
> * delayed allocations, there can be unwritten extents created by
> * fallocate or buffered writes in dioread_nolock mode covered by
> * dirty data which can be converted only after flushing the dirty
> * data (and journalled aops don't know how to handle these cases).
> */
>
> Otherwise the patch looks good to me. So after updating the comment you can
> add:
>
> Reviewed-by: Jan Kara <[email protected]>
BTW, the code is still racy wrt mmap write faults. Once Ted merges my
patches for hole punching races, we need to grab i_mmap_sem for writing
here as well to avoid write page fault from creating dirty page while we
are switching aops... I'm mostly writing this here so that there's lower
chance this gets lost.
Honza
> > - if (val && test_opt(inode->i_sb, DELALLOC)) {
> > - err = ext4_alloc_da_blocks(inode);
> > + if (val) {
> > + err = filemap_write_and_wait(inode->i_mapping);
> > if (err < 0)
> > return err;
> > }
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 18-11-15 10:34:33, Daeho Jeong wrote:
> Now, in ext4, there is only one writepages() function and it is shared
> by all the inode modes. Therefore, BUG_ON() for checking journaled
> inode mode in ext4_writepages() is not correct anymore because, if
> per-file data journaling of a file is enabled while ext4_writepages()
> is being executed, this BUG_ON() function can cause a kernel panic
> unintentionally even on "nodelalloc" mode.
>
> Signed-off-by: Daeho Jeong <[email protected]>
> ---
> fs/ext4/inode.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1f9458e..db24348 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2480,13 +2480,11 @@ retry:
> }
>
> /*
> - * We have two constraints: We find one extent to map and we
> + * We have a constraint: We find one extent to map and we
> * must always write out whole page (makes a difference when
> * blocksize < pagesize) so that we don't block on IO when we
> - * try to write out the rest of the page. Journalled mode is
> - * not supported by delalloc.
> + * try to write out the rest of the page.
> */
Well, it is still true that journalled mode is not supported by delalloc so
I would not delete the comment.
Actually what you do only hides the real problem - that ext4_writepages()
in non-journalled mode can be still running for an inode which is already
switched to journalled mode. In theory if we manage to dirty some pages
after the switch, non-journalled writepages *can* see them an try to write
them back which will break spectacularly. So to fix this we need something
like a writeback barrier for the inode - make sure all ext4_writepages()
calls have completed before switching aops. Now I'd hate to grow struct
ext4_inode_info only for this extra rare case so we could probably
implement the barrier on per-filesystem basis - a fs-wide per-cpu rw
semaphore acquired for reading while ext4_writepages() runs and acquired
for writing when we switch aops for some inode.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 18-11-15 10:34:34, Daeho Jeong wrote:
> Several problems occurred when per-file data journaling is enabled on
> "delalloc" mode, so the per-file data journaling was permanently
> deactivated by commit 3d2b15826282 ("ext4: ignore
> EXT4_INODE_JOURNAL_DATA flag with delalloc"). But, those are not
> problems for only "delalloc" mode and when you execute xfstest on
> "nodelalloc" mode, same problems happen on "nodelalloc" mode. We always
> execute xfstest on "delalloc" mode, which is default mode, so we
> haven't realized problems of per-file data journaling feature. Finally,
> problems of per-file data journaling feature were fixed by commit
> 9c02ac97989d ("ext4: fix xfstest generic/269 double revoked buffer bug
> with bigalloc") and previous patchset. Now, we can re-enable the
> feature on "delalloc" mode.
Yeah, once we fix what I pointed out for patch 2/3 and we fix races with
page faults, this should be OK to do.
Honza
> Signed-off-by: Daeho Jeong <[email protected]>
> ---
> fs/ext4/ext4_jbd2.h | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 9c5b49f..742b3ec 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -395,12 +395,10 @@ static inline int ext4_inode_journal_mode(struct inode *inode)
> {
> if (EXT4_JOURNAL(inode) == NULL)
> return EXT4_INODE_WRITEBACK_DATA_MODE; /* writeback */
> - /* We do not support data journalling with delayed allocation */
> if (!S_ISREG(inode->i_mode) ||
> test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
> return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
> - if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
> - !test_opt(inode->i_sb, DELALLOC))
> + if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> return EXT4_INODE_JOURNAL_DATA_MODE; /* journal data */
> if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
> return EXT4_INODE_ORDERED_DATA_MODE; /* ordered */
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
Thank you, Jan.
> Can we maybe rephrase the whole comment like:
>
> /*
> * Before flushing the journal and switching inode's aops, we have
> * to flush all dirty data the inode has. There can be outstanding
> * delayed allocations, there can be unwritten extents created by
> * fallocate or buffered writes in dioread_nolock mode covered by
> * dirty data which can be converted only after flushing the dirty
> * data (and journalled aops don't know how to handle these cases).
> */
Your comment is more understandable, so I will change the comment in
the patch.
> BTW, the code is still racy wrt mmap write faults. Once Ted merges my
> patches for hole punching races, we need to grab i_mmap_sem for writing
> here as well to avoid write page fault from creating dirty page while we
> are switching aops... I'm mostly writing this here so that there's lower
> chance this gets lost.
Ok. I've recognized another racy condition when switching aops of an inode.
I will also consider the point in v2 of this patch.
> Actually what you do only hides the real problem - that ext4_writepages()
> in non-journalled mode can be still running for an inode which is already
> switched to journalled mode. In theory if we manage to dirty some pages
> after the switch, non-journalled writepages *can* see them an try to write
> them back which will break spectacularly. So to fix this we need something
> like a writeback barrier for the inode - make sure all ext4_writepages()
> calls have completed before switching aops. Now I'd hate to grow struct
> ext4_inode_info only for this extra rare case so we could probably
> implement the barrier on per-filesystem basis - a fs-wide per-cpu rw
> semaphore acquired for reading while ext4_writepages() runs and acquired
> for writing when we switch aops for some inode.
Yes, there is another issue that newly dirtied pages after the switch can be
shown in ext4_writepages() in non-journalled mode. I overlooked that point.
According to your comment, I will modify this patch.
Thank you, again.