LinuxLists.cc - Re: splice(SPLICE_F

2006-05-01 06:53:34

Subject: Re: splice(SPLICE_F_MOVE) problems

On Mon, May 01 2006, Oleg Nesterov wrote:
> I noticed sys_splice() and friends were added. Cool!
> But I can't understand how SPLICE_F_MOVE is supposed to
> work.
>
> pipe_to_file:
>
> if (sd->flags & SPLICE_F_MOVE) {
>
> if (buf->ops->steal(info, buf))
> goto find_page;
>
> Let's suppose that buf->ops == page_cache_pipe_buf_ops.
> page_cache_pipe_buf_steal() returns PG_locked page, why?

Seems silly to unlock the page, when add_to_page_cache() will set it
locked the instant after again.

>
> page = buf->page;
> if (add_to_page_cache(page, mapping, index, gfp_mask))
>
> This adds entire page to page cache. What about partial pages?
> This can corrupt sd->file if offset != 0 || this_len != PAGE_SIZE.

Yeah that's silly, I'll add a check for sd->len == PAGE_CACHE_SIZE!

> goto find_page;
>
> Ok, add_to_page_cache() failed. 'page' is still locked.
> It will be released later, this should trigger bad_page().

Indeed, that was fixed and pushed to the splice repo yesterday actually
along with a fix for a missing lock in the pipe variant.

> Also, we don't clear PIPE_BUF_FLAG_STOLEN, so we will miss
> the data copying and page_cache_release(page) below:
>
> if (!(buf->flags & PIPE_BUF_FLAG_STOLEN)) {
> char *dst = kmap_atomic(page, KM_USER0);
>
> memcpy(dst + offset, src + buf->offset, this_len);
> flush_dcache_page(page);
> kunmap_atomic(dst, KM_USER0);
> }
>
> I can't understand why do we need PIPE_BUF_FLAG_STOLEN at all.
> It seems to me we need a local boolean in pipe_to_file.

PIPE_BUF_FLAG_STOLEN used to be used in the release function as well,
hence the flag. It could be removed now, yes. I'll make sure to clear
the flag as well on add_to_page_cache() failure.

> I downloaded splice-git-20060430152503.tar.gz, but was unable
> to demonstrate these problems until I found that this definition
>
> static inline int splice(int fdin, loff_t *off_in, int fdout, loff_t *off_out,
> size_t len, unsigned long flags)
> {
> return syscall(__NR_splice, fdin, off_in, fdout, off_out, len, flags);
> }
>
> is not correct. At least on i386 you need _syscall6() here.

I guess I might as well use the right syscallX, but so far things have
worked fine for me on ia64/x86/x86_64. I'll update it.

--
Jens Axboe

2006-05-01 15:06:26

by Oleg Nesterov

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On 05/01, Jens Axboe wrote:
>
> On Mon, May 01 2006, Oleg Nesterov wrote:
> >
> > I can't understand why do we need PIPE_BUF_FLAG_STOLEN at all.
> > It seems to me we need a local boolean in pipe_to_file.
>
> PIPE_BUF_FLAG_STOLEN used to be used in the release function as well,
> hence the flag.

Ok, but in that case

> I'll make sure to clear
> the flag as well on add_to_page_cache() failure.

... it is not good to clear it in pipe_to_file(). The page remains
stolen from pipe_buf_operations pov, this flag imho should be private
to buf, and page_cache_pipe_buf_ops doesn't need it.

I think pipe_to_buf() can test 'buf->page == page' instead of
PIPE_BUF_FLAG_STOLEN.

Another question,

__generic_file_splice_read:

/*
* Initiate read-ahead on this page range. however, don't call into
* read-ahead if this is a non-zero offset (we are likely doing small
* chunk splice and the page is already there) for a single page.
*/
if (!loff || nr_pages > 1)
page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages);

Why this check? page_cache_readahead() should detect sub-page
reads correctly.

page = find_get_page(mapping, index);
if (!page) {
page = page_cache_alloc_cold();

add_to_page_cache_lru(page);

I think it makes sense to add handle_ra_miss() here. Otherwise,
for example, readahead could be disabled by RA_FLAG_INCACHE
forever.

If readahead doesn't work, SPLICE_F_MOVE is problematic too.
add_to_page_cache_lru()->lru_cache_add() first increments
page->count and adds this page to lru_add_pvecs. This means
page_cache_pipe_buf_steal()->remove_mapping() will probably
fail.

Oleg.

2006-05-01 17:41:14

by Jens Axboe

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On Mon, May 01 2006, Oleg Nesterov wrote:
> On 05/01, Jens Axboe wrote:
> >
> > On Mon, May 01 2006, Oleg Nesterov wrote:
> > >
> > > I can't understand why do we need PIPE_BUF_FLAG_STOLEN at all.
> > > It seems to me we need a local boolean in pipe_to_file.
> >
> > PIPE_BUF_FLAG_STOLEN used to be used in the release function as well,
> > hence the flag.
>
> Ok, but in that case
>
> > I'll make sure to clear
> > the flag as well on add_to_page_cache() failure.
>
> ... it is not good to clear it in pipe_to_file(). The page remains
> stolen from pipe_buf_operations pov, this flag imho should be private
> to buf, and page_cache_pipe_buf_ops doesn't need it.
>
> I think pipe_to_buf() can test 'buf->page == page' instead of
> PIPE_BUF_FLAG_STOLEN.

I ended up fixing it with a local variable, but you are right it can be
killed with just a buf->page != page == stolen check. I got rid of the
last check of that, so just one remaining. Will commit this change.

> Another question,
>
> __generic_file_splice_read:
>
> /*
> * Initiate read-ahead on this page range. however, don't call into
> * read-ahead if this is a non-zero offset (we are likely doing small
> * chunk splice and the page is already there) for a single page.
> */
> if (!loff || nr_pages > 1)
> page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages);
>
> Why this check? page_cache_readahead() should detect sub-page
> reads correctly.

Leftover from do_page_cache_readahead I suppose. I probably shouldn't
try to second guess read-ahead, however.

> page = find_get_page(mapping, index);
> if (!page) {
> page = page_cache_alloc_cold();
>
> add_to_page_cache_lru(page);
>
> I think it makes sense to add handle_ra_miss() here. Otherwise,
> for example, readahead could be disabled by RA_FLAG_INCACHE
> forever.

Good point, added.

> If readahead doesn't work, SPLICE_F_MOVE is problematic too.
> add_to_page_cache_lru()->lru_cache_add() first increments
> page->count and adds this page to lru_add_pvecs. This means
> page_cache_pipe_buf_steal()->remove_mapping() will probably
> fail.

Because of the temporarily elevated page count?

--
Jens Axboe

2006-05-01 20:11:20

by Oleg Nesterov

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On 05/01, Jens Axboe wrote:
>
> > If readahead doesn't work, SPLICE_F_MOVE is problematic too.
> > add_to_page_cache_lru()->lru_cache_add() first increments
> > page->count and adds this page to lru_add_pvecs. This means
> > page_cache_pipe_buf_steal()->remove_mapping() will probably
> > fail.
>
> Because of the temporarily elevated page count?

Yes.

On the other hand, if readahead doesn't work we already have a
bigger problem, and SPLICE_F_MOVE is not garanteed, so I think
this is very minor.

Oleg.

2006-05-02 05:28:07

by Jens Axboe

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On Tue, May 02 2006, Oleg Nesterov wrote:
> On 05/01, Jens Axboe wrote:
> >
> > > If readahead doesn't work, SPLICE_F_MOVE is problematic too.
> > > add_to_page_cache_lru()->lru_cache_add() first increments
> > > page->count and adds this page to lru_add_pvecs. This means
> > > page_cache_pipe_buf_steal()->remove_mapping() will probably
> > > fail.
> >
> > Because of the temporarily elevated page count?
>
> Yes.
>
> On the other hand, if readahead doesn't work we already have a
> bigger problem, and SPLICE_F_MOVE is not garanteed, so I think
> this is very minor.

Yes, clearly readahead has to work as expected. I haven't noticed any
problems, even on half cached workloads. With your handle_ra_miss()
addition and possibly killing the redundant !offset || nr_pages check,
it should be fine.

--
Jens Axboe

2006-05-03 00:14:57

by Oleg Nesterov

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

I am looking at current fs/splice.c from
http://kernel.org/git/?p=linux/kernel/git/torvalds/....

pipe_to_file:

if ((sd->flags & SPLICE_F_MOVE) && this_len == PAGE_CACHE_SIZE) {
/*
* If steal succeeds, buf->page is now pruned from the vm
* side (page cache) and we can reuse it. The page will also
* be locked on successful return.
*/
if (buf->ops->steal(info, buf))
goto find_page;

page = buf->page;
page_cache_get(page);

Isn't it better to do this after successful successful add_to_page_cache() ?
This way you don't need to do page_cache_release() if add_to_page_cache fails.

/*
* page must be on the LRU for adding to the pagecache.

Is it true? (looking at add_to_page_cache_lru).

* Check this without grabbing the zone lock, if it isn't
* the do grab the zone lock, recheck, and add if necessary.
*/
if (!PageLRU(page)) {
struct zone *zone = page_zone(page);

spin_lock_irq(&zone->lru_lock);
if (!PageLRU(page)) {
SetPageLRU(page);
add_page_to_inactive_list(zone, page);
}
spin_unlock_irq(&zone->lru_lock);

Why not lru_cache_add() ?

if (add_to_page_cache(page, mapping, index, gfp_mask)) {
page_cache_release(page);
unlock_page(page);
goto find_page;

This leaves unmapped page on ->inactive_list. page_count(page) == 1. What if
shrink_inactive_list() finds this page, increments page->count, and calls
shrink_page_list()->remove_mapping() again? Hmm... So page_cache_pipe_buf_steal()
has a reason to return a locked page (I am not an expert, please correct me).

However, user_page_pipe_buf_steal() returns unlocked page in PIPE_BUF_FLAG_GIFT
case. So, if add_to_page_cache() fails, unlock_page() will trigger BUG().

ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
if (ret == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
goto find_page;

We also need to unlock(page) if it was stealed.

Oleg.

2006-05-03 06:56:04

by Jens Axboe

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On Wed, May 03 2006, Oleg Nesterov wrote:
> I am looking at current fs/splice.c from
> http://kernel.org/git/?p=linux/kernel/git/torvalds/....
>
>
> pipe_to_file:
>
> if ((sd->flags & SPLICE_F_MOVE) && this_len == PAGE_CACHE_SIZE) {
> /*
> * If steal succeeds, buf->page is now pruned from the vm
> * side (page cache) and we can reuse it. The page will also
> * be locked on successful return.
> */
> if (buf->ops->steal(info, buf))
> goto find_page;
>
> page = buf->page;
> page_cache_get(page);
>
> Isn't it better to do this after successful successful add_to_page_cache() ?
> This way you don't need to do page_cache_release() if add_to_page_cache fails.

It is, I dunno why I moved the ordering around when fixing that up. I'll
move it down a block and skip the early page_cache_get.

> /*
> * page must be on the LRU for adding to the pagecache.
>
> Is it true? (looking at add_to_page_cache_lru).

My wording is a little confusing probably, I didn't mean to imply any
ordering contraints there. Just simply state that a page cannot be in
the pagecache and not on the LRU.

> if (!PageLRU(page)) {
> struct zone *zone = page_zone(page);
>
> spin_lock_irq(&zone->lru_lock);
> if (!PageLRU(page)) {
> SetPageLRU(page);
> add_page_to_inactive_list(zone, page);
> }
> spin_unlock_irq(&zone->lru_lock);
>
> Why not lru_cache_add() ?

Because for calling lru_cache_add, I have to be absolutely certain that
the page isn't on the LRU already. So I'd have to grab the zone lru_lock
first anyways, and I cannot call lru_cache_add() with that held since
__pagevec_lru_add() also grabs it. Hence the "open coding".

> if (add_to_page_cache(page, mapping, index, gfp_mask)) {
> page_cache_release(page);
> unlock_page(page);
> goto find_page;
>
> This leaves unmapped page on ->inactive_list. page_count(page) == 1.
> What if shrink_inactive_list() finds this page, increments
> page->count, and calls shrink_page_list()->remove_mapping() again?
> Hmm... So page_cache_pipe_buf_steal() has a reason to return a locked
> page (I am not an expert, please correct me).

With the block moved above the LRU business, this disappears as well.

> However, user_page_pipe_buf_steal() returns unlocked page in
> PIPE_BUF_FLAG_GIFT case. So, if add_to_page_cache() fails,
> unlock_page() will trigger BUG().

It does, it calls generic_pipe_buf_steal() which locks it.

> ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
> if (ret == AOP_TRUNCATED_PAGE) {
> page_cache_release(page);
> goto find_page;
>
> We also need to unlock(page) if it was stealed.

Are you sure that's the right test? Don't you mean if ret !=
AOP_TRUNCATED_PAGE && ret?

How about the attached?

diff --git a/fs/splice.c b/fs/splice.c
index 7fb0497..9b10ec4 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -574,12 +574,17 @@ static int pipe_to_file(struct pipe_inod
goto find_page;

page = buf->page;
+ if (add_to_page_cache(page, mapping, index, gfp_mask)) {
+ unlock_page(page);
+ goto find_page;
+ }
+
page_cache_get(page);

/*
- * page must be on the LRU for adding to the pagecache.
- * Check this without grabbing the zone lock, if it isn't
- * the do grab the zone lock, recheck, and add if necessary.
+ * The page cannot reside in the pagecache and not be on
+ * the LRU list. It should be safe to check PageLRU()
+ * outside the zone lru_lock, as long as we recheck it inside.
*/
if (!PageLRU(page)) {
struct zone *zone = page_zone(page);
@@ -591,12 +596,6 @@ static int pipe_to_file(struct pipe_inod
}
spin_unlock_irq(&zone->lru_lock);
}
-
- if (add_to_page_cache(page, mapping, index, gfp_mask)) {
- page_cache_release(page);
- unlock_page(page);
- goto find_page;
- }
} else {
find_page:
page = find_lock_page(mapping, index);
@@ -647,11 +646,15 @@ find_page:
}

ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
- if (ret == AOP_TRUNCATED_PAGE) {
+ if (unlikely(ret)) {
+ if (ret != AOP_TRUNCATED_PAGE)
+ unlock_page(page);
page_cache_release(page);
- goto find_page;
- } else if (ret)
+ if (ret == AOP_TRUNCATED_PAGE)
+ goto find_page;
+
goto out;
+ }

if (buf->page != page) {
/*

--
Jens Axboe

2006-05-03 10:35:20

by Oleg Nesterov

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On 05/03, Jens Axboe wrote:
>
> On Wed, May 03 2006, Oleg Nesterov wrote:
>
> > However, user_page_pipe_buf_steal() returns unlocked page in
> > PIPE_BUF_FLAG_GIFT case. So, if add_to_page_cache() fails,
> > unlock_page() will trigger BUG().
>
> It does, it calls generic_pipe_buf_steal() which locks it.

What I have in splice.c:

static int user_page_pipe_buf_steal(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
return 1;

return 0;
}

(I don't use git, reading Linus's tree via http).

> > ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
> > if (ret == AOP_TRUNCATED_PAGE) {
> > page_cache_release(page);
> > goto find_page;
> >
> > We also need to unlock(page) if it was stealed.
>
> Are you sure that's the right test? Don't you mean if ret !=
> AOP_TRUNCATED_PAGE && ret?
>
> How about the attached?

Ah, yes, you are right. Sorry for confusion.

Oleg.

2006-05-03 10:48:10

by Jens Axboe

[permalink] [raw]

Subject: Re: splice(SPLICE_F_MOVE) problems

On Wed, May 03 2006, Oleg Nesterov wrote:
> On 05/03, Jens Axboe wrote:
> >
> > On Wed, May 03 2006, Oleg Nesterov wrote:
> >
> > > However, user_page_pipe_buf_steal() returns unlocked page in
> > > PIPE_BUF_FLAG_GIFT case. So, if add_to_page_cache() fails,
> > > unlock_page() will trigger BUG().
> >
> > It does, it calls generic_pipe_buf_steal() which locks it.
>
> What I have in splice.c:
>
> static int user_page_pipe_buf_steal(struct pipe_inode_info *pipe,
> struct pipe_buffer *buf)
> {
> if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
> return 1;
>
> return 0;
> }
>
> (I don't use git, reading Linus's tree via http).

Ah ok, it got fixed yesterday:

static int user_page_pipe_buf_steal(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
return 1;

return generic_pipe_buf_steal(pipe, buf);
}

> > > ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
> > > if (ret == AOP_TRUNCATED_PAGE) {
> > > page_cache_release(page);
> > > goto find_page;
> > >
> > > We also need to unlock(page) if it was stealed.
> >
> > Are you sure that's the right test? Don't you mean if ret !=
> > AOP_TRUNCATED_PAGE && ret?
> >
> > How about the attached?
>
> Ah, yes, you are right. Sorry for confusion.

Good, I already committed that variant of the fix :)

--
Jens Axboe