2007-05-16 10:19:44

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

David Chinner <[email protected]> wrote:

> + ret = block_prepare_write(page, 0, end, get_block);

As I understand the way prepare_write() works, this is incorrect.

The start and end points passed to block_prepare_write() delimit the region of
the page that is going to be modified. This means that prepare_write()
doesn't need to fill it in if the page is not up to date. It does, however,
need to fill in the region before (if present) and the region after (if
present). Look at it like this:

+---------------+
| |
| | <-- Filled in by prepare_write()
| |
to-> |:::::::::::::::|
| |
| | <-- Filled in by caller
| |
offset->|:::::::::::::::|
| |
| | <-- Filled in by prepare_write()
| |
page-> +---------------+

However, page_mkwrite() isn't told which bit of the page is going to be
written to. This means it has to ask prepare_write() to make sure the whole
page is filled in. In other words, offset and to must be equal (in AFS I set
them both to 0).

With what you've got, if, say, 'offset' is 0 and 'to' is calculated at
PAGE_SIZE, then if the page is not up to date for any reason, then none of the
page will be updated before the page is written on by the faulting code.

You probably get away with this in a blockdev-based filesystem because it's
unlikely that the page will cease to be up to date.

However, if someone adds a syscall to punch holes in files, this may change...

David


2007-05-16 12:00:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

David Howells wrote:
> David Chinner <[email protected]> wrote:
>
>
>>+ ret = block_prepare_write(page, 0, end, get_block);
>
>
> As I understand the way prepare_write() works, this is incorrect.

I think it is actually OK.


> The start and end points passed to block_prepare_write() delimit the region of
> the page that is going to be modified. This means that prepare_write()
> doesn't need to fill it in if the page is not up to date. It does, however,
> need to fill in the region before (if present) and the region after (if
> present). Look at it like this:
>
> +---------------+
> | |
> | | <-- Filled in by prepare_write()
> | |
> to-> |:::::::::::::::|
> | |
> | | <-- Filled in by caller
> | |
> offset->|:::::::::::::::|
> | |
> | | <-- Filled in by prepare_write()
> | |
> page-> +---------------+
>
> However, page_mkwrite() isn't told which bit of the page is going to be
> written to. This means it has to ask prepare_write() to make sure the whole
> page is filled in. In other words, offset and to must be equal (in AFS I set
> them both to 0).

Dave is using prepare_write here to ensure blocks are allocated in the
given range. The filesystem's ->nopage function must ensure it is uptodate
before allowing it to be mapped.


> With what you've got, if, say, 'offset' is 0 and 'to' is calculated at
> PAGE_SIZE, then if the page is not up to date for any reason, then none of the
> page will be updated before the page is written on by the faulting code.

Consider that the code currently works OK today _without_ page_mkwrite.
page_mkwrite is being added to do block allocation / reservation.


> You probably get away with this in a blockdev-based filesystem because it's
> unlikely that the page will cease to be up to date.
>
> However, if someone adds a syscall to punch holes in files, this may change...

We have one. Strangely enough, it is done with madvise(MADV_REMOVE).

--
SUSE Labs, Novell Inc.

2007-05-16 12:09:38

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> The start and end points passed to block_prepare_write() delimit the region of
> the page that is going to be modified. This means that prepare_write()
> doesn't need to fill it in if the page is not up to date.

Really? Is it _really_ going to be modified? Even if the pointer
userspace gave to write() is bogus, and is going to fault half-way
through the copy_from_user()?

--
dwmw2

2007-05-16 12:55:46

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> > The start and end points passed to block_prepare_write() delimit the region of
> > the page that is going to be modified. This means that prepare_write()
> > doesn't need to fill it in if the page is not up to date.
>
> Really? Is it _really_ going to be modified? Even if the pointer
> userspace gave to write() is bogus, and is going to fault half-way
> through the copy_from_user()?

This is why there are so many variations on copy_from_user that zero on
faults. One way or another, the prepare_write/commit_write pair are
responsible for filling it in.

-chris

2007-05-16 13:04:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

Chris Mason wrote:
> On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
>
>>On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
>>
>>>The start and end points passed to block_prepare_write() delimit the region of
>>>the page that is going to be modified. This means that prepare_write()
>>>doesn't need to fill it in if the page is not up to date.
>>
>>Really? Is it _really_ going to be modified? Even if the pointer
>>userspace gave to write() is bogus, and is going to fault half-way
>>through the copy_from_user()?
>
>
> This is why there are so many variations on copy_from_user that zero on
> faults. One way or another, the prepare_write/commit_write pair are
> responsible for filling it in.

I'll add to David's question about David's comment on David's patch, yes
it will be modified but in that case it would be zero-filled as Chris
says. However I believe this is incorrect behaviour.

It is possible to easily fix that so it would only happen via a tiny race
window (where the source memory gets unmapped at just the right time)
however nobody seemed to interested (just by checking the return value of
fault_in_pages_readable).

The buffered write patches I'm working on fix that (among other things) of
course. But they do away with prepare_write and introduce new aops, and
they indeed must not expect the full range to have been written to.

--
SUSE Labs, Novell Inc.

2007-05-16 13:14:15

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

On Wed, May 16, 2007 at 11:04:11PM +1000, Nick Piggin wrote:
> Chris Mason wrote:
> >On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> >
> >>On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> >>
> >>>The start and end points passed to block_prepare_write() delimit the
> >>>region of
> >>>the page that is going to be modified. This means that prepare_write()
> >>>doesn't need to fill it in if the page is not up to date.
> >>
> >>Really? Is it _really_ going to be modified? Even if the pointer
> >>userspace gave to write() is bogus, and is going to fault half-way
> >>through the copy_from_user()?
> >
> >
> >This is why there are so many variations on copy_from_user that zero on
> >faults. One way or another, the prepare_write/commit_write pair are
> >responsible for filling it in.
>
> I'll add to David's question about David's comment on David's patch, yes
> it will be modified but in that case it would be zero-filled as Chris
> says. However I believe this is incorrect behaviour.
>
> It is possible to easily fix that so it would only happen via a tiny race
> window (where the source memory gets unmapped at just the right time)
> however nobody seemed to interested (just by checking the return value of
> fault_in_pages_readable).
>
> The buffered write patches I'm working on fix that (among other things) of
> course. But they do away with prepare_write and introduce new aops, and
> they indeed must not expect the full range to have been written to.

I was also wrong to say prepare_write and commit_write are
responsible, they work together with their callers to make the right
things happen. Oh well, so much for trying to give a short answer for a
chunk of code full of corner cases ;)

-chris

2007-05-16 13:21:03

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

Nick Piggin <[email protected]> wrote:

> Dave is using prepare_write here to ensure blocks are allocated in the
> given range. The filesystem's ->nopage function must ensure it is uptodate
> before allowing it to be mapped.

Which is fine... assuming it's called. For blockdev-based filesystems, this
is probably true. But I'm not sure you can guarantee it.

I've seen Ext3, for example, unlocking a page that isn't yet uptodate.
nopage() won't get called on it again, but prepare_write() might. I don't
know why this happens, but it's something I've fallen over in doing
CacheFiles. When reading, readpage() is just called on it again and again
until it is up to date. When writing, prepare_write() is called correctly.

> Consider that the code currently works OK today _without_ page_mkwrite.
> page_mkwrite is being added to do block allocation / reservation.

Which doesn't prove anything. All it means is that PG_uptodate being unset is
handled elsewhere.

David

2007-05-16 13:25:31

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

David Woodhouse <[email protected]> wrote:

> Really? Is it _really_ going to be modified?

Well, generic_file_buffered_write() doesn't check the success of the copy
before calling commit_write(), presumably because it uses
fault_in_pages_readable() first.

David

2007-05-16 13:42:03

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

David Howells wrote:
> Nick Piggin <[email protected]> wrote:
>
>
>>Dave is using prepare_write here to ensure blocks are allocated in the
>>given range. The filesystem's ->nopage function must ensure it is uptodate
>>before allowing it to be mapped.
>
>
> Which is fine... assuming it's called. For blockdev-based filesystems, this
> is probably true. But I'm not sure you can guarantee it.
>
> I've seen Ext3, for example, unlocking a page that isn't yet uptodate.
> nopage() won't get called on it again, but prepare_write() might. I don't
> know why this happens, but it's something I've fallen over in doing
> CacheFiles. When reading, readpage() is just called on it again and again
> until it is up to date. When writing, prepare_write() is called correctly.

There are bugs in the core VM and block filesystem code where !uptodate pages
are left in pagetables. Some of these are fixed in -mm.

But they aren't a good reason to invent completely different ways to do things.


>>Consider that the code currently works OK today _without_ page_mkwrite.
>>page_mkwrite is being added to do block allocation / reservation.
>
>
> Which doesn't prove anything. All it means is that PG_uptodate being unset is
> handled elsewhere.

It means that Dave's page_mkwrite function will do the block allocation
and everything else continues as it is. Your suggested change to pass in
offset == to is just completely wrong for this.

PG_uptodate being unset should be done via pagecache invalidation or truncation
APIs, which (sometimes... modulo bugs) tear down pagetables first.

--
SUSE Labs, Novell Inc.

2007-05-16 23:29:19

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

On Wed, May 16, 2007 at 11:19:29AM +0100, David Howells wrote:
>
> However, page_mkwrite() isn't told which bit of the page is going to be
> written to. This means it has to ask prepare_write() to make sure the whole
> page is filled in. In other words, offset and to must be equal (in AFS I set
> them both to 0).

The assumption is the page is already up to date and we are writing
the whole page unless EOF lands inside the page. AFAICT, we can't
get called with a page that is not uptodate and so page filling is
not something we should be doing (or want to be doing) here. All we
want to do is to be able to change the mapping from a read to a
write mapping (e.g. a read mapping of a hole needs to be changed on
write) and do the relevant space reservation/allocation and buffer
mapping needed for this change.

> However, if someone adds a syscall to punch holes in files, this may change...

We already have them - ioctl(XFS_IOC_UNRESVSP) and
madvise(MADV_REMOVE) - and another - fallocate(FA_DEALLOCATE) - is
on it's way. Racing with truncates should already be handled by the
truncate code (i.e. partial page truncation does the zero filling).

/me makes note to implement ->truncate_range() in XFS for MADV_REMOVE.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group