Hi,
following three patches reimplement ext4_page_mkwrite() so that it returns
locked pages (which is necessary for stable pages work and also for fixing of
freezing code). As a bonus we also avoid taking i_alloc_sem as it's not
necessary and use generic block_page_mkwrite() helper. The common delalloc
path should be more straightforward now.
The last patch in the series blocks mmaped writes on frozen filesystem which
is simple to do now.
I've tested these patches by xfstests and also running fsx-linux for all
modes - delalloc, nodelalloc (data=writeback), nodelalloc (data=ordered),
nodelalloc (data=journal).
Honza
Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
does except that it passes back errors from __block_write_begin /
block_commit_write calls. This is needed for some filesystems so that they
can detect errors such as ENOSPC and try harder.
CC: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 26 ++++++++++++++++++--------
include/linux/buffer_head.h | 2 ++
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..469c832 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2333,7 +2333,7 @@ EXPORT_SYMBOL(block_commit_write);
* unlock the page.
*/
int
-block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+__block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block)
{
struct page *page = vmf->page;
@@ -2361,18 +2361,28 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!ret)
ret = block_commit_write(page, 0, end);
- if (unlikely(ret)) {
+ if (unlikely(ret < 0))
unlock_page(page);
- if (ret == -ENOMEM)
- ret = VM_FAULT_OOM;
- else /* -ENOSPC, -EIO, etc */
- ret = VM_FAULT_SIGBUS;
- } else
+ else
ret = VM_FAULT_LOCKED;
-
out:
return ret;
}
+EXPORT_SYMBOL(__block_page_mkwrite);
+
+int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int ret = __block_page_mkwrite(vma, vmf, get_block);
+
+ if (unlikely(ret < 0)) {
+ if (ret == -ENOMEM)
+ return VM_FAULT_OOM;
+ /* -ENOSPC, -EIO, etc */
+ return VM_FAULT_SIGBUS;
+ }
+ return ret;
+}
EXPORT_SYMBOL(block_page_mkwrite);
/*
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index f5df235..0b719b0 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -217,6 +217,8 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
get_block_t *, loff_t *);
int generic_cont_expand_simple(struct inode *inode, loff_t size);
int block_commit_write(struct page *page, unsigned from, unsigned to);
+int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block);
int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block);
sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
--
1.7.1
We should not allow file modification via mmap while the filesystem is
frozen. So block in ext4_page_mkwrite() while the filesystem is frozen.
We have to check for frozen filesystem with the page marked dirty and under
page lock with which we then return from ext4_page_mkwrite(). Only that way we
cannot race with writeback done by freezing code - either we mark the page
dirty after the writeback has started, see freezing in progress and block, or
writeback will wait for our page lock which is released only when the fault is
done and then writeback will writeout and writeprotect the page again.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 04b164d..a7e13b6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5801,6 +5801,12 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
get_block_t *get_block;
int retries = 0;
+restart:
+ /*
+ * This check is racy but catches the common case. The check at the
+ * end of this function is reliable.
+ */
+ vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
@@ -5870,5 +5876,19 @@ out_ret:
ret = VM_FAULT_SIGBUS;
}
out:
+ /*
+ * Freezing in progress? We check after the page is marked dirty and
+ * with page lock held so if the test here fails, we are sure freezing
+ * code will wait during syncing until the page fault is done - at that
+ * point page will be dirty and unlocked so freezing code will write it
+ * and writeprotect it again.
+ */
+ if (ret == VM_FAULT_LOCKED) {
+ set_page_dirty(page);
+ if (inode->i_sb->s_frozen != SB_UNFROZEN) {
+ unlock_page(page);
+ goto restart;
+ }
+ }
return ret;
}
--
1.7.1
ext4_page_mkwrite() does not return page locked. This makes it hard
to avoid races with filesystem freezing code (so that we don't leave
writeable page on a frozen fs) or writeback code (so that we allow page
to be stable during writeback).
Also the current code uses i_alloc_sem to avoid races with truncate but that
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 98 ++++++++++++++++++++++++++++++++-----------------------
1 files changed, 57 insertions(+), 41 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..04b164d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5793,66 +5793,82 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
struct page *page = vmf->page;
loff_t size;
unsigned long len;
- int ret = -EINVAL;
- void *fsdata;
+ int ret;
struct file *file = vma->vm_file;
struct inode *inode = file->f_path.dentry->d_inode;
struct address_space *mapping = inode->i_mapping;
+ handle_t *handle;
+ get_block_t *get_block;
+ int retries = 0;
- /*
- * Get i_alloc_sem to stop truncates messing with the inode. We cannot
- * get i_mutex because we are already holding mmap_sem.
- */
- down_read(&inode->i_alloc_sem);
+ /* Delalloc case is easy... */
+ if (test_opt(inode->i_sb, DELALLOC) &&
+ !ext4_should_journal_data(inode) &&
+ !ext4_nonda_switch(inode->i_sb)) {
+ do {
+ ret = __block_page_mkwrite(vma, vmf,
+ ext4_da_get_block_prep);
+ } while (ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries));
+ goto out_ret;
+ }
+
+ lock_page(page);
size = i_size_read(inode);
- if (page->mapping != mapping || size <= page_offset(page)
- || !PageUptodate(page)) {
- /* page got truncated from under us? */
- goto out_unlock;
+ /* Page got truncated from under us? */
+ if (page->mapping != mapping || page_offset(page) > size) {
+ unlock_page(page);
+ ret = VM_FAULT_NOPAGE;
+ goto out;
}
- ret = 0;
- if (PageMappedToDisk(page))
- goto out_unlock;
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
-
- lock_page(page);
/*
- * return if we have all the buffers mapped. This avoid
- * the need to call write_begin/write_end which does a
- * journal_start/journal_stop which can block and take
- * long time
+ * Return if we have all the buffers mapped. This avoids the need to do
+ * journal_start/journal_stop which can block and take a long time
*/
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
- unlock_page(page);
- goto out_unlock;
+ ret = VM_FAULT_LOCKED;
+ goto out;
}
}
unlock_page(page);
- /*
- * OK, we need to fill the hole... Do write_begin write_end
- * to do block allocation/reservation.We are not holding
- * inode.i__mutex here. That allow * parallel write_begin,
- * write_end call. lock_page prevent this from happening
- * on the same page though
- */
- ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
- len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
- if (ret < 0)
- goto out_unlock;
- ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
- len, len, page, fsdata);
- if (ret < 0)
- goto out_unlock;
- ret = 0;
-out_unlock:
- if (ret)
+ /* OK, we need to fill the hole... */
+ if (ext4_should_dioread_nolock(inode))
+ get_block = ext4_get_block_write;
+ else
+ get_block = ext4_get_block;
+retry_alloc:
+ handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+ if (IS_ERR(handle)) {
ret = VM_FAULT_SIGBUS;
- up_read(&inode->i_alloc_sem);
+ goto out;
+ }
+ ret = __block_page_mkwrite(vma, vmf, get_block);
+ if (ret == VM_FAULT_LOCKED && ext4_should_journal_data(inode)) {
+ if (walk_page_buffers(handle, page_buffers(page), 0,
+ PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) {
+ unlock_page(page);
+ ret = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+ }
+ ext4_journal_stop(handle);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry_alloc;
+out_ret:
+ if (ret < 0) {
+ if (ret == -ENOMEM)
+ ret = VM_FAULT_OOM;
+ else
+ ret = VM_FAULT_SIGBUS;
+ }
+out:
return ret;
}
--
1.7.1
On Wed, May 11, 2011 at 12:29:31AM +0200, Jan Kara wrote:
>
> following three patches reimplement ext4_page_mkwrite() so that it returns
> locked pages (which is necessary for stable pages work and also for fixing of
> freezing code). As a bonus we also avoid taking i_alloc_sem as it's not
> necessary and use generic block_page_mkwrite() helper. The common delalloc
> path should be more straightforward now.
How does this relate to Darrick Wong's patches? Are your patches in
addition to or instead of his?
- Ted
On Wed, May 11, 2011 at 12:43 PM, Jan Kara <[email protected]> wrote:
> ?Hi Amir,
>
> On Wed 11-05-11 10:33:17, Amir G. wrote:
>> Can you provide a bit of a wider scope review of how this related to the
>> work on stable pages.
> ?It is related in the sense that Darrick had to return locked page from
> ext4_page_mkwrite() to avoid races and the result (although working) is
> really like scratching your left ear with your right hand (not sure if
> English has this idiom ;). So with my patches, what Darrick needs to do is
> simply achieved by updating block_page_mkwrite() (__block_page_mkwrite() in
> fact).
>
>> For example, when to the pages get unlocked?
> ?That's handled by mm code in mm/memory.c. You can read do_wp_page() but
> it's not a light reading ;)
This is why I asked for the cheat sheet ;-)
>
>> If the pages supposed to be stable during writeback, how is this related
>> to returning locked pages from page_mkwrite?
> ?Returning locked page is needed to avoid races with page writeback - for
> stable pages we want to do wait_on_page_writeback() to be sure that there's
> no IO happening while we make the page writeable for user. But once we
> release page lock, writepage can come and start the writeback. The basic
> scheme of the race we want to avoid is:
> ?do_wp_page()
> ? ?ext4_page_mkwrite()
> ? ? ?wait_on_page_writeback()
> ? ? ?unlock_page()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?clear_page_dirty_for_io()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?page_mkclean()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?writepage()
> ? ?maybe_mkwrite()
>
> Because this results in writeable page under writeback... We have to make
> sure page_mkclean() happens *after* maybe_mkwrite() from do_wp_page() in
> this scenario.
>
>> Is the page going to stay locked until writeback?
> ?No, until the page fault is finished.
>
>> Do I understand correctly that a page will be marked read-only after
>> writeback completes, so page_mkwrite will be called again on next write?
> ?Page is marked read-only before writeback is started in
> clear_page_dirty_for_io().
OK. that makes sense. I wasn't sure my move-on-write hook,
which is called from ext4_page_mkwrite() is going to be called on every
writeback.
The use case is:
data = mmap(file);
data[0] = 1;
take snapshot 1;
data[0] = 2;
take snapshot 2;
data[0] = 3;
...
That use case doesn't specify when writeback of first page of file happens
and I don't care which version of data[0] is seen by the snapshots, as long
as the snapshot view is internally consistent (it doesn't change over time or
after reboot).
So if ext4_page_mkwrite() is called after every writeback and the page is stable
during writeback, snapshots should be OK.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
Hi Amir,
On Wed 11-05-11 10:33:17, Amir G. wrote:
> Can you provide a bit of a wider scope review of how this related to the
> work on stable pages.
It is related in the sense that Darrick had to return locked page from
ext4_page_mkwrite() to avoid races and the result (although working) is
really like scratching your left ear with your right hand (not sure if
English has this idiom ;). So with my patches, what Darrick needs to do is
simply achieved by updating block_page_mkwrite() (__block_page_mkwrite() in
fact).
> For example, when to the pages get unlocked?
That's handled by mm code in mm/memory.c. You can read do_wp_page() but
it's not a light reading ;)
> If the pages supposed to be stable during writeback, how is this related
> to returning locked pages from page_mkwrite?
Returning locked page is needed to avoid races with page writeback - for
stable pages we want to do wait_on_page_writeback() to be sure that there's
no IO happening while we make the page writeable for user. But once we
release page lock, writepage can come and start the writeback. The basic
scheme of the race we want to avoid is:
do_wp_page()
ext4_page_mkwrite()
wait_on_page_writeback()
unlock_page()
clear_page_dirty_for_io()
page_mkclean()
writepage()
maybe_mkwrite()
Because this results in writeable page under writeback... We have to make
sure page_mkclean() happens *after* maybe_mkwrite() from do_wp_page() in
this scenario.
> Is the page going to stay locked until writeback?
No, until the page fault is finished.
> Do I understand correctly that a page will be marked read-only after
> writeback completes, so page_mkwrite will be called again on next write?
Page is marked read-only before writeback is started in
clear_page_dirty_for_io().
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Jan,
Can you provide a bit of a wider scope review of how this related to
the work on stable pages.
For example, when to the pages get unlocked?
If the pages supposed to be stable during writeback, how is this related to
returning locked pages from page_mkwrite?
Is the page going to stay locked until writeback?
Do I understand correctly that a page will be marked read-only after
writeback completes, so page_mkwrite will be called again on next write?
Thanks,
Amir.
On Wed, May 11, 2011 at 1:29 AM, Jan Kara <[email protected]> wrote:
>
> ?Hi,
>
> ?following three patches reimplement ext4_page_mkwrite() so that it returns
> locked pages (which is necessary for stable pages work and also for fixing of
> freezing code). As a bonus we also avoid taking i_alloc_sem as it's not
> necessary and use generic block_page_mkwrite() helper. The common delalloc
> path should be more straightforward now.
>
> The last patch in the series blocks mmaped writes on frozen filesystem which
> is simple to do now.
>
> I've tested these patches by xfstests and also running fsx-linux for all
> modes - delalloc, nodelalloc (data=writeback), nodelalloc (data=ordered),
> nodelalloc (data=journal).
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
On Tue 10-05-11 20:30:12, Ted Tso wrote:
> On Wed, May 11, 2011 at 12:29:31AM +0200, Jan Kara wrote:
> >
> > following three patches reimplement ext4_page_mkwrite() so that it returns
> > locked pages (which is necessary for stable pages work and also for fixing of
> > freezing code). As a bonus we also avoid taking i_alloc_sem as it's not
> > necessary and use generic block_page_mkwrite() helper. The common delalloc
> > path should be more straightforward now.
>
> How does this relate to Darrick Wong's patches? Are your patches in
> addition to or instead of his?
It's in addition to those (although with my patches, a generic change to
block_page_mkwrite() would be enough for ext4). If you plan to merge those
now, I can rediff on top of those patches.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, May 11, 2011 at 11:27:52AM +0200, Jan Kara wrote:
> On Tue 10-05-11 20:30:12, Ted Tso wrote:
> > On Wed, May 11, 2011 at 12:29:31AM +0200, Jan Kara wrote:
> > >
> > > following three patches reimplement ext4_page_mkwrite() so that it returns
> > > locked pages (which is necessary for stable pages work and also for fixing of
> > > freezing code). As a bonus we also avoid taking i_alloc_sem as it's not
> > > necessary and use generic block_page_mkwrite() helper. The common delalloc
> > > path should be more straightforward now.
> >
> > How does this relate to Darrick Wong's patches? Are your patches in
> > addition to or instead of his?
> It's in addition to those (although with my patches, a generic change to
> block_page_mkwrite() would be enough for ext4). If you plan to merge those
> now, I can rediff on top of those patches.
Just to be clear, your new __block_page_mkwrite still needs a patch to call
wait_on_page_writeback to provide stable page writes, correct?
I think we'd still need a wait_on_page_writeback for the "all buffers are
mapped" VM_FAULT_LOCKED exit from ext4_page_mkwrite?
--D
On Wed 11-05-11 09:48:36, Darrick J. Wong wrote:
> On Wed, May 11, 2011 at 11:27:52AM +0200, Jan Kara wrote:
> > On Tue 10-05-11 20:30:12, Ted Tso wrote:
> > > On Wed, May 11, 2011 at 12:29:31AM +0200, Jan Kara wrote:
> > > >
> > > > following three patches reimplement ext4_page_mkwrite() so that it returns
> > > > locked pages (which is necessary for stable pages work and also for fixing of
> > > > freezing code). As a bonus we also avoid taking i_alloc_sem as it's not
> > > > necessary and use generic block_page_mkwrite() helper. The common delalloc
> > > > path should be more straightforward now.
> > >
> > > How does this relate to Darrick Wong's patches? Are your patches in
> > > addition to or instead of his?
> > It's in addition to those (although with my patches, a generic change to
> > block_page_mkwrite() would be enough for ext4). If you plan to merge those
> > now, I can rediff on top of those patches.
>
> Just to be clear, your new __block_page_mkwrite still needs a patch to call
> wait_on_page_writeback to provide stable page writes, correct?
Correct.
> I think we'd still need a wait_on_page_writeback for the "all buffers are
> mapped" VM_FAULT_LOCKED exit from ext4_page_mkwrite?
Ah, right, I forgot about that one.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
> + if (unlikely(ret < 0))
> unlock_page(page);
> + else
> ret = VM_FAULT_LOCKED;
> out:
> return ret;
Using two different types of return values here seems rather unclean.
I'd rather use 0 instead of VM_FAULT_LOCKED here, and maybe overload
-EAGAIN for VM_FAULT_NOPAGE.
> +int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + int ret = __block_page_mkwrite(vma, vmf, get_block);
> +
> + if (unlikely(ret < 0)) {
> + if (ret == -ENOMEM)
> + return VM_FAULT_OOM;
> + /* -ENOSPC, -EIO, etc */
> + return VM_FAULT_SIGBUS;
> + }
> + return ret;
and maybe also add a small inlined block_page_mkwrite_error helper
to translate the values.
Alternatively it might make sense to add a VM_FAULT_ENOSPC return value
so that you could re-use block_page_mkwrite unmodified, and we could
also have better error reporting for that case for the core VM.
On Wed, May 11, 2011 at 12:29:34AM +0200, Jan Kara wrote:
> We should not allow file modification via mmap while the filesystem is
> frozen. So block in ext4_page_mkwrite() while the filesystem is frozen.
>
> We have to check for frozen filesystem with the page marked dirty and under
> page lock with which we then return from ext4_page_mkwrite(). Only that way we
> cannot race with writeback done by freezing code - either we mark the page
> dirty after the writeback has started, see freezing in progress and block, or
> writeback will wait for our page lock which is released only when the fault is
> done and then writeback will writeout and writeprotect the page again.
This really should be done in (__)block_page_mkwrite. I'd also return
VM_FAULT_RETRY instead of retrying inside the block_mkwrite handler
in case you hit the race.
On Tue 17-05-11 11:09:03, Christoph Hellwig wrote:
> > + if (unlikely(ret < 0))
> > unlock_page(page);
> > + else
> > ret = VM_FAULT_LOCKED;
> > out:
> > return ret;
>
> Using two different types of return values here seems rather unclean.
> I'd rather use 0 instead of VM_FAULT_LOCKED here, and maybe overload
> -EAGAIN for VM_FAULT_NOPAGE.
OK, makes sense. I'll do that.
> > +int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> > + get_block_t get_block)
> > +{
> > + int ret = __block_page_mkwrite(vma, vmf, get_block);
> > +
> > + if (unlikely(ret < 0)) {
> > + if (ret == -ENOMEM)
> > + return VM_FAULT_OOM;
> > + /* -ENOSPC, -EIO, etc */
> > + return VM_FAULT_SIGBUS;
> > + }
> > + return ret;
>
> and maybe also add a small inlined block_page_mkwrite_error helper
> to translate the values.
Good idea.
> Alternatively it might make sense to add a VM_FAULT_ENOSPC return value
> so that you could re-use block_page_mkwrite unmodified, and we could
> also have better error reporting for that case for the core VM.
Umm, I think the first solution is better. For example ext[34] needs to
differentiate ENOSPC and EDQUOT (for the second one it does not make sense
to force transaction commit because quota accounting is always precise) and
other filesystems might possibly have different requirements. So returning
the error value and letting fs sort it out looks like the most versatile
solution.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 17-05-11 11:11:08, Christoph Hellwig wrote:
> On Wed, May 11, 2011 at 12:29:34AM +0200, Jan Kara wrote:
> > We should not allow file modification via mmap while the filesystem is
> > frozen. So block in ext4_page_mkwrite() while the filesystem is frozen.
> >
> > We have to check for frozen filesystem with the page marked dirty and under
> > page lock with which we then return from ext4_page_mkwrite(). Only that way we
> > cannot race with writeback done by freezing code - either we mark the page
> > dirty after the writeback has started, see freezing in progress and block, or
> > writeback will wait for our page lock which is released only when the fault is
> > done and then writeback will writeout and writeprotect the page again.
>
> This really should be done in (__)block_page_mkwrite. I'd also return
> VM_FAULT_RETRY instead of retrying inside the block_mkwrite handler
> in case you hit the race.
Well, we can do the non-blocking check at the end of
__block_page_mkwrite() and return some error value (EAGAIN translating to
VM_FAULT_RETRY would look logical, I just have to think off better error
value for VM_FAULT_NOPAGE). But vfs_check_frozen() cannot be in
__block_page_mkwrite() since ext4 needs to call that with a transaction
started so that would create a deadlock and we need to call
vfs_check_frozen() somewhere so that we don't busyloop.
I can call vfs_check_frozen() inside block_page_mkwrite() but it would be a
bit surprising difference from __block_page_mkwrite() to me. Not sure what
the cleanest solution would be here...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, May 18, 2011 at 09:56:14AM +0200, Jan Kara wrote:
> __block_page_mkwrite() and return some error value (EAGAIN translating to
> VM_FAULT_RETRY would look logical, I just have to think off better error
> value for VM_FAULT_NOPAGE). But vfs_check_frozen() cannot be in
> __block_page_mkwrite() since ext4 needs to call that with a transaction
> started so that would create a deadlock and we need to call
> vfs_check_frozen() somewhere so that we don't busyloop.
>
> I can call vfs_check_frozen() inside block_page_mkwrite() but it would be a
> bit surprising difference from __block_page_mkwrite() to me. Not sure what
> the cleanest solution would be here...
block_page_mkwrite is supposed to be used directly by filesystems and
do all the right things. IIRC Eric even mentioned he added
vfs_check_frozen to it for RHEL, but forgot to push it upstream.
On 5/18/11 3:07 AM, Christoph Hellwig wrote:
> On Wed, May 18, 2011 at 09:56:14AM +0200, Jan Kara wrote:
>> __block_page_mkwrite() and return some error value (EAGAIN translating to
>> VM_FAULT_RETRY would look logical, I just have to think off better error
>> value for VM_FAULT_NOPAGE). But vfs_check_frozen() cannot be in
>> __block_page_mkwrite() since ext4 needs to call that with a transaction
>> started so that would create a deadlock and we need to call
>> vfs_check_frozen() somewhere so that we don't busyloop.
>>
>> I can call vfs_check_frozen() inside block_page_mkwrite() but it would be a
>> bit surprising difference from __block_page_mkwrite() to me. Not sure what
>> the cleanest solution would be here...
>
> block_page_mkwrite is supposed to be used directly by filesystems and
> do all the right things. IIRC Eric even mentioned he added
> vfs_check_frozen to it for RHEL, but forgot to push it upstream.
Well, I tried, but it was rejected IIRC. Still, mea culpa....
I can resurrect what I did for RHEL5 and repost if desired...
-Eric
On Wed 18-05-11 09:03:27, Eric Sandeen wrote:
> On 5/18/11 3:07 AM, Christoph Hellwig wrote:
> > On Wed, May 18, 2011 at 09:56:14AM +0200, Jan Kara wrote:
> >> __block_page_mkwrite() and return some error value (EAGAIN translating to
> >> VM_FAULT_RETRY would look logical, I just have to think off better error
> >> value for VM_FAULT_NOPAGE). But vfs_check_frozen() cannot be in
> >> __block_page_mkwrite() since ext4 needs to call that with a transaction
> >> started so that would create a deadlock and we need to call
> >> vfs_check_frozen() somewhere so that we don't busyloop.
> >>
> >> I can call vfs_check_frozen() inside block_page_mkwrite() but it would be a
> >> bit surprising difference from __block_page_mkwrite() to me. Not sure what
> >> the cleanest solution would be here...
> >
> > block_page_mkwrite is supposed to be used directly by filesystems and
> > do all the right things. IIRC Eric even mentioned he added
> > vfs_check_frozen to it for RHEL, but forgot to push it upstream.
>
> Well, I tried, but it was rejected IIRC. Still, mea culpa....
>
> I can resurrect what I did for RHEL5 and repost if desired...
I've just submitted second version of the patch series. So please check
whether it does all you need... Thanks.
Honza
On May 18, 2011, at 10:25 AM, Jan Kara <[email protected]> wrote:
> On Wed 18-05-11 09:03:27, Eric Sandeen wrote:
>> On 5/18/11 3:07 AM, Christoph Hellwig wrote:
>>> On Wed, May 18, 2011 at 09:56:14AM +0200, Jan Kara wrote:
>>>> __block_page_mkwrite() and return some error value (EAGAIN translating to
>>>> VM_FAULT_RETRY would look logical, I just have to think off better error
>>>> value for VM_FAULT_NOPAGE). But vfs_check_frozen() cannot be in
>>>> __block_page_mkwrite() since ext4 needs to call that with a transaction
>>>> started so that would create a deadlock and we need to call
>>>> vfs_check_frozen() somewhere so that we don't busyloop.
>>>>
>>>> I can call vfs_check_frozen() inside block_page_mkwrite() but it would be a
>>>> bit surprising difference from __block_page_mkwrite() to me. Not sure what
>>>> the cleanest solution would be here...
>>>
>>> block_page_mkwrite is supposed to be used directly by filesystems and
>>> do all the right things. IIRC Eric even mentioned he added
>>> vfs_check_frozen to it for RHEL, but forgot to push it upstream.
>>
>> Well, I tried, but it was rejected IIRC. Still, mea culpa....
>>
>> I can resurrect what I did for RHEL5 and repost if desired...
> I've just submitted second version of the patch series. So please check
> whether it does all you need... Thanks.
>
Thanks! Btw the rejection I mentioned was years ago... Not you :)
-Eric
> Honza