2004-09-03 11:05:35

by Andrey Savochkin

[permalink] [raw]
Subject: EXT3: problem with copy_from_user inside a transaction

Hi Andrew,

filemap_copy_from_user() between prepare_write() and commit_write()
appears to be a problem for ext3.

prepare_write() starts a transaction, and if filemap_copy_from_user() causes
a page fault, we'll have
- order violation with mmap_sem taken inside a transaction (possible
deadlocks),
- __GFP_FS memory allocation with all re-entrancy problems (e.g.,
current->journal_info corruption).

Am I missing something?

If this observation is correct, the possible solution is to call
get_user_pages() or somehow pin the user pages before prepare_write(),
although it will hurt performance...

Andrey


2004-09-03 12:38:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Fri, Sep 03, 2004 at 03:05:21PM +0400, Andrey Savochkin wrote:
> Hi Andrew,
>
> filemap_copy_from_user() between prepare_write() and commit_write()
> appears to be a problem for ext3.
>
> prepare_write() starts a transaction, and if filemap_copy_from_user() causes
> a page fault, we'll have
> - order violation with mmap_sem taken inside a transaction (possible
> deadlocks),
> - __GFP_FS memory allocation with all re-entrancy problems (e.g.,
> current->journal_info corruption).
>
> Am I missing something?
>
> If this observation is correct, the possible solution is to call
> get_user_pages() or somehow pin the user pages before prepare_write(),
> although it will hurt performance...

yes, Chris is working on it for a few months.

just for the mmap_sem the easiest fix I proposed him, was to take the
mmap_sem in read mode before starting the transaction in prepare_write,
then the mmap_sem has to become recursive but that's trivial (my 2.4
rw semaphores are much simpler, they don't crash with million of
waiters, and they can support recursion).

however it seems the mmap_sem is not the only issue, he found another
issue with the page lock, maybe Chris you want to elaborate (the above
deadlock is absolutely clear, the one with the page lock less, btw, I
don't care much with reading the same page from disk that we're writing
to in the write syscall, that's a case we may just want to forbidden
since it makes no sense and currently it's racey even in 2.4, no pin
happens, but the prefaulting hides it).

We also hidden the above deadlock with prefaulting (but it's far from
fixed), prefaulting is a good idea anyways.

2004-09-03 13:01:36

by Chris Mason

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Fri, 2004-09-03 at 08:35, Andrea Arcangeli wrote:
> On Fri, Sep 03, 2004 at 03:05:21PM +0400, Andrey Savochkin wrote:
> > Hi Andrew,
> >
> > filemap_copy_from_user() between prepare_write() and commit_write()
> > appears to be a problem for ext3.
> >
And reiserv3, and maybe the other journaled filesystems.

> yes, Chris is working on it for a few months.
>
Working is a generous term, I've somewhat been waiting for a better
solution to pop into my head. In the end, I think all we can do is not
allow filesystems to take locks (or implicit locks like starting a
transaction) inside the prepare_write call.

This would mean that all the work is done during the commit_write
stage. The trick is that we would have to handle -ENOSPC since we might
not know we've run out of room until after the data has been copied from
userland.

prepare_write could reserve blocks, which brings us half way to a
generic delayed allocation layer. But for a quick and dirty start,
doing it all in commit_write should work.

-chris


2004-09-03 13:04:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Fri, Sep 03, 2004 at 08:06:20AM -0400, Chris Mason wrote:
> prepare_write could reserve blocks, which brings us half way to a
> generic delayed allocation layer. [..]

sounds good to me!

2004-09-03 13:57:49

by Andrey Savochkin

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Fri, Sep 03, 2004 at 08:06:20AM -0400, Chris Mason wrote:
> On Fri, 2004-09-03 at 08:35, Andrea Arcangeli wrote:
> > On Fri, Sep 03, 2004 at 03:05:21PM +0400, Andrey Savochkin wrote:
> > >
> > > filemap_copy_from_user() between prepare_write() and commit_write()
> > > appears to be a problem for ext3.
> > >
> And reiserv3, and maybe the other journaled filesystems.
>
> > yes, Chris is working on it for a few months.
> >
> Working is a generous term, I've somewhat been waiting for a better
> solution to pop into my head. In the end, I think all we can do is not
> allow filesystems to take locks (or implicit locks like starting a
> transaction) inside the prepare_write call.
>
> This would mean that all the work is done during the commit_write
> stage. The trick is that we would have to handle -ENOSPC since we might
> not know we've run out of room until after the data has been copied from
> userland.

What is the problem -ENOSPC?
Do you think about the problem of the page existing before this write, it's
content overwritten, but the filesystem being unable to commit that write
because it needs more space?

Andrey

2004-09-04 08:41:25

by Chris Mason

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Fri, 2004-09-03 at 09:57, Andrey Savochkin wrote:

> > This would mean that all the work is done during the commit_write
> > stage. The trick is that we would have to handle -ENOSPC since we might
> > not know we've run out of room until after the data has been copied from
> > userland.
>
> What is the problem -ENOSPC?
> Do you think about the problem of the page existing before this write, it's
> content overwritten, but the filesystem being unable to commit that write
> because it needs more space?

Exactly. In this case, we've effectively corrupted the page cache.
We've copied data in that isn't (and never will be) reflected on disk.
It isn't a horribly difficult case, we just need to overwrite the data
with zeros, making sure to only overwrite the data corresponding to the
-ENOSPC error.

-chris


2004-09-04 14:33:37

by Andrey Savochkin

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Sat, Sep 04, 2004 at 03:47:44AM -0400, Chris Mason wrote:
> On Fri, 2004-09-03 at 09:57, Andrey Savochkin wrote:
>
> > > This would mean that all the work is done during the commit_write
> > > stage. The trick is that we would have to handle -ENOSPC since we might
> > > not know we've run out of room until after the data has been copied from
> > > userland.
> >
> > What is the problem -ENOSPC?
> > Do you think about the problem of the page existing before this write, it's
> > content overwritten, but the filesystem being unable to commit that write
> > because it needs more space?
>
> Exactly. In this case, we've effectively corrupted the page cache.
> We've copied data in that isn't (and never will be) reflected on disk.
> It isn't a horribly difficult case, we just need to overwrite the data
> with zeros, making sure to only overwrite the data corresponding to the
> -ENOSPC error.

Zeroing not mapped buffers in case of error is not difficult, indeed.

I was speaking about the following case:
- one page of a file is dirtied through a userspace mapping,
- the page doesn't have disk mapping yet (a hole),
- someone issues write() to this page,
- the space allocation fails in commit_write(), when the page content has
already been overwritten with the new data.

Any holes in this scenario? :)

How to handle -ENOSPC in this case?

Andrey

2004-09-04 20:12:12

by Andrey Savochkin

[permalink] [raw]
Subject: Re: EXT3: problem with copy_from_user inside a transaction

On Sat, Sep 04, 2004 at 06:33:33PM +0400, Andrey Savochkin wrote:
> On Sat, Sep 04, 2004 at 03:47:44AM -0400, Chris Mason wrote:
> > On Fri, 2004-09-03 at 09:57, Andrey Savochkin wrote:
> > > What is the problem -ENOSPC?
> > > Do you think about the problem of the page existing before this write, it's
> > > content overwritten, but the filesystem being unable to commit that write
> > > because it needs more space?
> >
> > Exactly. In this case, we've effectively corrupted the page cache.
> > We've copied data in that isn't (and never will be) reflected on disk.
> > It isn't a horribly difficult case, we just need to overwrite the data
> > with zeros, making sure to only overwrite the data corresponding to the
> > -ENOSPC error.
>
> Zeroing not mapped buffers in case of error is not difficult, indeed.
>
> I was speaking about the following case:
> - one page of a file is dirtied through a userspace mapping,
> - the page doesn't have disk mapping yet (a hole),
> - someone issues write() to this page,
> - the space allocation fails in commit_write(), when the page content has
> already been overwritten with the new data.

After some thought, we can check if the page is not completely mapped to disk
and there is a possibility that it's referenced from pte's.
In this case we can "commit" the old content in prepare_write(), allocating
space and instantiating holes...

Any better ideas?

Andrey

2004-09-07 11:56:03

by Andrey Savochkin

[permalink] [raw]
Subject: [RFC][PATCH] EXT3: problem with copy_from_user inside a transaction

On Fri, Sep 03, 2004 at 08:06:20AM -0400, Chris Mason wrote:
> On Fri, 2004-09-03 at 08:35, Andrea Arcangeli wrote:
> > On Fri, Sep 03, 2004 at 03:05:21PM +0400, Andrey Savochkin wrote:
> > > Hi Andrew,
> > >
> > > filemap_copy_from_user() between prepare_write() and commit_write()
> > > appears to be a problem for ext3.
> > >
> And reiserv3, and maybe the other journaled filesystems.
>
> > yes, Chris is working on it for a few months.
> >
> Working is a generous term, I've somewhat been waiting for a better
> solution to pop into my head. In the end, I think all we can do is not
> allow filesystems to take locks (or implicit locks like starting a
> transaction) inside the prepare_write call.
>
> This would mean that all the work is done during the commit_write
> stage. The trick is that we would have to handle -ENOSPC since we might
> not know we've run out of room until after the data has been copied from
> userland.
>
> prepare_write could reserve blocks, which brings us half way to a
> generic delayed allocation layer. But for a quick and dirty start,
> doing it all in commit_write should work.

Ok, so here is the patch moving journal_start() together with space
allocation from ext3_prepare_write() to commit_write().

ENOSPC is handled in ext3_map_write() called from commit_write().
In ENOSPC or other error case, the data copied from the userspace is zeroed,
but only if the buffers and the whole page are not up to date.
Answering my previous questions, if
- the inode page is modified through mmap and then by write,
- the file has holes and
- there is no space on disk,
the page cache will have the new content (provided by write) not written to
disk. Similarly to modifications through pure mmap.

It's interesting that block_prepare_write() zeroes the page content in case
of error even if the page has been modified through mmap()...

Comments?

Andrey

Signed-off-by: Andrey Savochkin <[email protected]>

===== fs/buffer.c 1.255 vs edited =====
--- 1.255/fs/buffer.c 2004-08-27 10:31:38 +04:00
+++ edited/fs/buffer.c 2004-09-06 14:57:01 +04:00
@@ -2025,8 +2025,9 @@ static int __block_prepare_write(struct
goto out;
if (buffer_new(bh)) {
clear_buffer_new(bh);
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
if (PageUptodate(page)) {
set_buffer_uptodate(bh);
continue;
===== fs/ext3/inode.c 1.101 vs edited =====
--- 1.101/fs/ext3/inode.c 2004-08-27 10:31:38 +04:00
+++ edited/fs/ext3/inode.c 2004-09-07 13:00:36 +04:00
@@ -782,6 +782,7 @@ reread:
if (!partial) {
clear_buffer_new(bh_result);
got_it:
+ clear_buffer_delay(bh_result);
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (boundary)
set_buffer_boundary(bh_result);
@@ -1065,11 +1066,13 @@ static int walk_page_buffers( handle_t *
* and the commit_write(). So doing the journal_start at the start of
* prepare_write() is the right place.
*
- * Also, this function can nest inside ext3_writepage() ->
- * block_write_full_page(). In that case, we *know* that ext3_writepage()
- * has generated enough buffer credits to do the whole page. So we won't
- * block on the journal in that case, which is good, because the caller may
- * be PF_MEMALLOC.
+ * [2004/09/04 SAW] journal_start() in prepare_write() causes different ranking
+ * violations if copy_from_user() triggers a page fault (mmap_sem, may be page
+ * lock, plus __GFP_FS allocations).
+ * Now we read in not up-to-date buffers in prepare_write(), and do the rest
+ * including hole instantiation and inode extension in commit_write().
+ *
+ * Other notes.
*
* By accident, ext3 can be reentered when a transaction is open via
* quota file writes. If we were to commit the transaction while thus
@@ -1084,6 +1087,66 @@ static int walk_page_buffers( handle_t *
* write.
*/

+static int ext3_get_block_delay(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int ret;
+
+ ret = ext3_get_block_handle(NULL, inode, iblock, bh, 0, 0);
+ if (ret)
+ return ret;
+ if (!buffer_mapped(bh)) {
+ set_buffer_delay(bh);
+ set_buffer_new(bh);
+ }
+ return ret;
+}
+
+static int ext3_get_block_delay_uptodate(handle_t *handle,
+ struct buffer_head *bh)
+{
+ struct page *page;
+ struct inode *inode;
+ sector_t block;
+ int ret;
+
+ page = bh->b_page;
+ inode = page->mapping->host;
+ block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ ret = ext3_get_block_handle(NULL, inode, block, bh, 0, 0);
+ if (ret)
+ return ret;
+ if (!buffer_uptodate(bh))
+ set_buffer_uptodate(bh); /* PageUptodate */
+ if (!buffer_mapped(bh)) {
+ set_buffer_delay(bh);
+ set_buffer_new(bh);
+ } else {
+ unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+ }
+ return ret;
+}
+
+static int ext3_prepare_write(struct file *file, struct page *page,
+ unsigned from, unsigned to)
+{
+ int ret;
+
+ if (PageUptodate(page)) {
+ /* simplified version of block_prepare_write */
+ struct inode *inode = page->mapping->host;
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, 1 << inode->i_blkbits, 0);
+ ret = walk_page_buffers(NULL, page_buffers(page),
+ from, to, NULL, ext3_get_block_delay_uptodate);
+ if (ret) /* XXX: really should do this? */
+ ClearPageUptodate(page);
+ } else
+ ret = block_prepare_write(page, from, to,
+ ext3_get_block_delay);
+ return ret;
+}
+
static int do_journal_get_write_access(handle_t *handle,
struct buffer_head *bh)
{
@@ -1092,8 +1155,52 @@ static int do_journal_get_write_access(h
return ext3_journal_get_write_access(handle, bh);
}

-static int ext3_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+/*
+ * This function zeroes buffers not mapped to disk.
+ * We do it similarly to the error path in __block_prepare_write() to avoid
+ * keeping garbage in the page cache.
+ * Here we check BH_delay state. We know that if the buffer appears
+ * !buffer_mapped then
+ * - it was !buffer_mapped at the moment of ext3_prepare_write, and
+ * - ext3_get_block failed to map this buffer (e.g., ENOSPC).
+ * If this !mapped buffer is not up to date (it can be up to date if
+ * PageUptodate), then we zero its content.
+ */
+static void ext3_clear_delayed_buffers(struct page *page,
+ unsigned from, unsigned to)
+{
+ struct buffer_head *bh, *head, *next;
+ unsigned block_start, block_end;
+ unsigned blocksize;
+ void *kaddr;
+
+ head = page_buffers(page);
+ blocksize = head->b_size;
+ for ( bh = head, block_start = 0;
+ bh != head || !block_start;
+ block_start = block_end, bh = next)
+ {
+ next = bh->b_this_page;
+ block_end = block_start + blocksize;
+ if (block_end <= from || block_start >= to)
+ continue;
+ if (!buffer_delay(bh))
+ continue;
+ J_ASSERT_BH(bh, !buffer_mapped(bh));
+ clear_buffer_new(bh);
+ clear_buffer_delay(bh);
+ if (!buffer_uptodate(bh)) {
+ kaddr = kmap_atomic(page, KM_USER0);
+ memset(kaddr + block_start, 0, bh->b_size);
+ kunmap_atomic(kaddr, KM_USER0);
+ set_buffer_uptodate(bh);
+ mark_buffer_dirty(bh);
+ }
+ }
+}
+
+static int ext3_map_write(struct file *file, struct page *page,
+ unsigned from, unsigned to)
{
struct inode *inode = page->mapping->host;
int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
@@ -1106,19 +1213,19 @@ retry:
ret = PTR_ERR(handle);
goto out;
}
- ret = block_prepare_write(page, from, to, ext3_get_block);
- if (ret)
- goto prepare_write_failed;

- if (ext3_should_journal_data(inode)) {
+ ret = block_prepare_write(page, from, to, ext3_get_block);
+ if (!ret && ext3_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
- if (ret)
- ext3_journal_stop(handle);
+ if (!ret)
+ goto out;
+
+ ext3_journal_stop(handle);
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
+ ext3_clear_delayed_buffers(page, from, to);
out:
return ret;
}
@@ -1153,10 +1260,15 @@ static int commit_write_fn(handle_t *han
static int ext3_ordered_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- handle_t *handle = ext3_journal_current_handle();
+ handle_t *handle;
struct inode *inode = page->mapping->host;
int ret = 0, ret2;

+ ret = ext3_map_write(file, page, from, to);
+ if (ret)
+ return ret;
+ handle = ext3_journal_current_handle();
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext3_journal_dirty_data);

@@ -1182,11 +1294,15 @@ static int ext3_ordered_commit_write(str
static int ext3_writeback_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- handle_t *handle = ext3_journal_current_handle();
+ handle_t *handle;
struct inode *inode = page->mapping->host;
int ret = 0, ret2;
loff_t new_i_size;

+ ret = ext3_map_write(file, page, from, to);
+ if (ret)
+ return ret;
+ handle = ext3_journal_current_handle();
new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
if (new_i_size > EXT3_I(inode)->i_disksize)
EXT3_I(inode)->i_disksize = new_i_size;
@@ -1200,11 +1316,16 @@ static int ext3_writeback_commit_write(s
static int ext3_journalled_commit_write(struct file *file,
struct page *page, unsigned from, unsigned to)
{
- handle_t *handle = ext3_journal_current_handle();
+ handle_t *handle;
struct inode *inode = page->mapping->host;
int ret = 0, ret2;
int partial = 0;
loff_t pos;
+
+ ret = ext3_map_write(file, page, from, to);
+ if (ret)
+ return ret;
+ handle = ext3_journal_current_handle();

/*
* Here we duplicate the generic_commit_write() functionality
===== fs/jbd/transaction.c 1.87 vs edited =====
--- 1.87/fs/jbd/transaction.c 2004-08-07 21:59:49 +04:00
+++ edited/fs/jbd/transaction.c 2004-09-04 16:17:15 +04:00
@@ -1870,6 +1870,7 @@ zap_buffer_unlocked:
clear_buffer_mapped(bh);
clear_buffer_req(bh);
clear_buffer_new(bh);
+ clear_buffer_delay(bh);
bh->b_bdev = NULL;
return may_free;
}