LinuxLists.cc - msync() behaviour broken for MS

2004-03-31 22:16:35

Subject: msync() behaviour broken for MS_ASYNC, revert patch?

--- linux-2.6-tmp/mm/msync.c.=K0000=.orig
+++ linux-2.6-tmp/mm/msync.c
@@ -127,13 +127,10 @@ static int filemap_sync(struct vm_area_s
/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67). Instead, it just
- * marks the relevant pages dirty. The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
- * So my _not_ starting I/O in MS_ASYNC we provide complete flexibility to
- * applications.
+ * MS_ASYNC once again starts I/O (it did not between 2.5.68 and 2.6.4.)
+ * SingleUnix requires it. If an application wants to queue dirty pages
+ * for normal asychronous writeback, msync with flags==0 should achieve
+ * that on all kernels at least as far back as 2.4.
*/
static int msync_interval(struct vm_area_struct * vma,
unsigned long start, unsigned long end, int flags)
@@ -147,20 +144,22 @@ static int msync_interval(struct vm_area
if (file && (vma->vm_flags & VM_SHARED)) {
ret = filemap_sync(vma, start, end-start, flags);

- if (!ret && (flags & MS_SYNC)) {
+ if (!ret && (flags & (MS_SYNC|MS_ASYNC))) {
struct address_space *mapping = file->f_mapping;
int err;

down(&mapping->host->i_sem);
ret = filemap_fdatawrite(mapping);
- if (file->f_op && file->f_op->fsync) {
- err = file->f_op->fsync(file,file->f_dentry,1);
- if (err && !ret)
+ if (flags & MS_SYNC) {
+ if (file->f_op && file->f_op->fsync) {
+ err = file->f_op->fsync(file, file->f_dentry, 1);
+ if (err && !ret)
+ ret = err;
+ }
+ err = filemap_fdatawait(mapping);
+ if (!ret)
ret = err;
}
- err = filemap_fdatawait(mapping);
- if (!ret)
- ret = err;
up(&mapping->host->i_sem);
}
}

Attachments:

msync-async-writeout.patch (1.84 kB)

2004-03-31 22:39:03

by Linus Torvalds

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

On Wed, 31 Mar 2004, Stephen C. Tweedie wrote:
>
> although I can't find an unambiguous definition of "queued for service"
> in the online standard. I'm reading it as requiring that the I/O has
> reached the block device layer, not simply that it has been marked dirty
> for some future writeback pass to catch; Uli agrees with that
> interpretation.

That interpretation makes pretty much zero sense.

If you care about the data hitting the disk, you have to use fsync() or
similar _anyway_, and pretending anything else is just bogus.

As such, just marking the pages dirty is as much of a "queing" them for
write as actually writing them, since in both cases the guarantees are
_exactly_ the same: the pages have not hit the disk by the time the system
call returns, but will hit the disk at some time in the future.

Having the requirement that it is on some sw-only request queue is
nonsensical, since such a queue is totally invisible from a user
perspective.

User space has no idea about "block device layer" vs "VM layer" queues,
and trying to distinguiosh between the two is madness. It's just an
internal implementation issue that has no meaning to the user.

Linus

2004-03-31 22:51:58

by Andrew Morton

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

"Stephen C. Tweedie" <[email protected]> wrote:
>
> Hi,
>
> I've been looking at a discrepancy between msync() behaviour on 2.4.9
> and newer 2.4 kernels, and it looks like things changed again in
> 2.5.68. From the ChangeLog:
>
> ChangeSet 1.971.76.156 2003/04/09 11:31:36 [email protected]
> [PATCH] Make msync(MS_ASYNC) no longer start the I/O
>
> MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O
> and not wait on it. This can cause undesirable blocking if msync is called
> rapidly against the same memory.
>
> So instead, change msync(MS_ASYNC) to not start any IO at all. Just flush
> the pte dirty bits into the pageframe and leave it at that.
>
> The IO _will_ happen within a kupdate period. And the application can use
> fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO
> immediately.
>
> Unfortunately, this seems to contradict SingleUnix requirements, which
> state:
>
> When MS_ASYNC is specified, msync() shall return immediately
> once all the write operations are initiated or queued for
> servicing
>
> although I can't find an unambiguous definition of "queued for service"
> in the online standard. I'm reading it as requiring that the I/O has
> reached the block device layer, not simply that it has been marked dirty
> for some future writeback pass to catch; Uli agrees with that
> interpretation.

I don't think I agree with that. If "queued for service" means we've
started the I/O, then what does "initiated" mean, and why did they specify
"initiated" separately?

What triggered all this was a dinky little test app which Linus wrote to
time some aspect of P4 tlb writeback latency. It sits in a loop dirtying a
page then msyncing it with MS_ASYNC. It ran very poorly, because MS_ASYNC
ended up waiting on the previously-submitted I/O before starting new I/O.

One approach to improving that would be for MS_ASYNC to say "if the page is
already under writeout then just skip the I/O". But that's worthless,
really - it makes the MS_ASYNC semantics too vague.

As you point out, Linus's app should have used the "flags=0" linux
extension. Didn't think of that.

Your reversion patch would mean that current applications which use
MS_ASYNC will again suffer large latencies if the pages are under writeout.
Sure, users could switch apps to using flags=0 to avoid that, but people
don't know to do that.

So given that SUS is ambiguous about this, I'd suggest that we be able to
demonstrate some real-world reason why this matters. Why are you concerned
about this?

> The 2.5.68 changeset also includes the comment:
>
> (This has triggered an ext3 bug - the page's buffers get dirtied so fast
> that kjournald keeps writing the buffers over and over for 10-20 seconds
> before deciding to give up for some reason)
>
> Was that ever resolved? If it's still there, I should have a look at it
> if we're restoring the old trigger.

(These changelog thingies are useful, aren't they?)

I don't recall checking since that time. I expect that Linus's test app
will still livelock kjournals in the current -linus tree - kjournald sits
there trying to write out the dirty buffers but the dang things just keep
on getting dirtied.

If so, I'm sure this patch (queued for 2.6.6) will fix it:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5-rc3/2.6.5-rc3-mm3/broken-out/jbd-move-locked-buffers.patch

2004-03-31 23:20:35

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Wed, 2004-03-31 at 23:53, Andrew Morton wrote:
> "Stephen C. Tweedie" <[email protected]> wrote:

> > Unfortunately, this seems to contradict SingleUnix requirements, which
> > state:
> > When MS_ASYNC is specified, msync() shall return immediately
> > once all the write operations are initiated or queued for
> > servicing
> > although I can't find an unambiguous definition of "queued for service"
> > in the online standard. I'm reading it as requiring that the I/O has
> > reached the block device layer

> I don't think I agree with that. If "queued for service" means we've
> started the I/O, then what does "initiated" mean, and why did they specify
> "initiated" separately?

I'd interpret "initiated" as having reached hardware. "Queued for
service" is much more open to interpretation: Uli came up with "the data
must be actively put in a stage where I/O is initiated", which still
doesn't really address what sort of queueing is allowed.

> What triggered all this was a dinky little test app which Linus wrote to
> time some aspect of P4 tlb writeback latency. It sits in a loop dirtying a
> page then msyncing it with MS_ASYNC. It ran very poorly, because MS_ASYNC
> ended up waiting on the previously-submitted I/O before starting new I/O.

Sure. There are lots of ways an interface can be misused, though: you
only know if one use is valid or not once you've determined what the
_correct_ use is. I'm much more concerned about getting a correct
interpretation of the spec than of making IO fast for the sake of a
memory benchmark. :-)

> One approach to improving that would be for MS_ASYNC to say "if the page is
> already under writeout then just skip the I/O". But that's worthless,
> really - it makes the MS_ASYNC semantics too vague.

Agreed.

> Your reversion patch would mean that current applications which use
> MS_ASYNC will again suffer large latencies if the pages are under writeout.

Well, this whole issue came up precisely because somebody was seeing
exactly such a latency hit going from 2.4.9 to a later kernel. We've
not really been consistent about it in the past.

> Sure, users could switch apps to using flags=0 to avoid that, but people
> don't know to do that.

Exactly why we need documentation for that combination, whatever
happens.

> So given that SUS is ambiguous about this, I'd suggest that we be able to
> demonstrate some real-world reason why this matters. Why are you concerned
> about this?

Just for the reason you mentioned --- a real-world app (in-house, so
flags==0 is actually a valid solution for them) which was seeing
performance degradation when the "MS_ASYNC submits IO" was introduced in
the first place. But it was internally written, so I've no idea at all
whether or not the app was assuming one behaviour or the other on other
Unixen.

--Stephen

2004-03-31 23:41:46

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Wed, 2004-03-31 at 23:37, Linus Torvalds wrote:
> On Wed, 31 Mar 2004, Stephen C. Tweedie wrote:
> >
> > although I can't find an unambiguous definition of "queued for service"
> > in the online standard. I'm reading it as requiring that the I/O has
> > reached the block device layer, not simply that it has been marked dirty
> > for some future writeback pass to catch; Uli agrees with that
> > interpretation.
>
> That interpretation makes pretty much zero sense.
>
> If you care about the data hitting the disk, you have to use fsync() or
> similar _anyway_, and pretending anything else is just bogus.

You can make the same argument for either implementation of MS_ASYNC.
And there's at least one way in which the "submit IO now" version can be
used meaningfully --- if you've got several specific areas of data in
one or more mappings that need flushed to disk, you'd be able to
initiate IO with multiple MS_ASYNC calls and then wait for completion
with either MS_SYNC or fsync(). That gives you an interface that
corresponds somewhat with the region-based filemap_sync();
filemap_fdatawrite(); filemap_datawait() that the kernel itself uses.

> Having the requirement that it is on some sw-only request queue is
> nonsensical, since such a queue is totally invisible from a user
> perspective.

It's very much visible, just from a performance perspective, if you want
to support "kick off this IO, I'm going to wait for the completion
shortly." If that's the interpretation of MS_ASYNC, then the app is
basically saying it doesn't want the writeback mechanism to be idle
until the writes have completed, regardless of whether it's a block
device or an NFS file or whatever underneath.

But whether that's a legal use of MS_ASYNC really depends on what the
standard is requiring. I could be persuaded either way. Uli?

Does anyone know what other Unixen do here?

--Stephen

2004-04-01 00:08:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

On Wed, 1 Apr 2004, Stephen C. Tweedie wrote:
>
> On Wed, 2004-03-31 at 23:37, Linus Torvalds wrote:
>
> > If you care about the data hitting the disk, you have to use fsync() or
> > similar _anyway_, and pretending anything else is just bogus.
>
> You can make the same argument for either implementation of MS_ASYNC.

Exactly.

Which is why I say that the implementation cannot matter, because user
space would be _buggy_ if it depended on some timing issue.

> And there's at least one way in which the "submit IO now" version can be
> used meaningfully --- if you've got several specific areas of data in
> one or more mappings that need flushed to disk, you'd be able to
> initiate IO with multiple MS_ASYNC calls and then wait for completion
> with either MS_SYNC or fsync().

Why wouldn't you be able to do that with the current one?

Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE:
because we don't wait for in-progress IO, it can be used to efficiently
synchronize multiple different areas, and then after that waiting for them
with _one_ single fsync().

In contrast, the "wait for queued IO" approach can't sanely do that,
exactly because it will wait in the middle, depending on other activity at
the same time. It will always have the worry that it happens to do the
msync() at the wrong time, and then wait synchronously when it shouldn't.

More importanrtly, the current behaviour makes certain patterns _possible_
that your suggested semantics simply cannot do efficiently. If we have
data records smaller than a page, and want to mark them dirty as they
happen, the current msync() allows that - it doesn't matter that another
datum was marked dirty just a moment ago. Then, you do one fsync() only
when you actually want to _commit_ a series of updates before you change
the index.

But if we want to have another flag, with MS_HALF_ASYNC, that's certainly
ok by me. I'm all for choice. It's just that I most definitely want the
choice of doing it the way we do it now, since I consider that to be the
_sane_ way.

> It's very much visible, just from a performance perspective, if you want
> to support "kick off this IO, I'm going to wait for the completion
> shortly."

That may well be worth a call of its own. It has nothing to do with memory
mapping, though - what you're really looking for is fasync().

And yes, I agree that _that_ would make sense. Havign some primitives to
start writeout of an area of a file would likely be a good thing.

I'd be perfectly happy with a set of file cache control operations,
including

- start writeback in [a,b]
- wait for [a,b] stable
- and maybe "punch hole in [a,b]"

Then you could use these for write() in addition to mmap(), and you can
first mark multiple regions dirty, and then do a single wait (which is
clearly more efficient than synchronously waiting for multiple regions).

But none of these have anything to do with what SuS or any other standard
says about MS_ASYNC.

> But whether that's a legal use of MS_ASYNC really depends on what the
> standard is requiring. I could be persuaded either way. Uli?

My argument was that a standard CANNOT say anything one way or the other,
because the behaviour IS NOT USER-VISIBLE! A program fundamentally cannot
care, since the only issue is a pure implementation issue of "which queue"
the data got queued onto.

Bringing in a standards body is irrelevant. It's like trying to use the
bible to determine whether protons have a positive charge.

Linus

2004-04-01 00:28:53

by Andrew Morton

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Linus Torvalds <[email protected]> wrote:
>
> I'd be perfectly happy with a set of file cache control operations,
> including
>
> - start writeback in [a,b]
> - wait for [a,b] stable
> - and maybe "punch hole in [a,b]"

Yup, there are a number of linux-specific fadvise() extensions we
can/should be adding, including "start writeback on this byte range for
flush" and "start writeback on this byte range for data integrity" and
"wait on writeback of this byte range".

Some of these are needed internally for the fs-AIO implementation, and also
for an O_SYNC which only writes the pages which the writer wrote. It's
pretty simple, and it'll be happening.

One wrinkle is that we'd need to add the start/end loff_t pair to the
a_ops->writepages() prototype. But instead I intend to put the start/end
info into struct writeback_control and pass it that way. It seems sleazy
at first but when you think about it, it isn't. It provides forward and
backward compatability, it recognises that it's just a hint and that
filesystems can legitimately sync the whole file and it produces
smaller+faster code.

We might need a wait_on_page_writeback_range() a_op though.

2004-04-01 15:40:50

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Thu, 2004-04-01 at 01:08, Linus Torvalds wrote:

> > You can make the same argument for either implementation of MS_ASYNC.
> Exactly.

> Which is why I say that the implementation cannot matter, because user
> space would be _buggy_ if it depended on some timing issue.

I see it purely as a performance issue. That's the context in which we
saw the initial complaint about the 2.4 behaviour change.

> > And there's at least one way in which the "submit IO now" version can be
> > used meaningfully --- if you've got several specific areas of data in
> > one or more mappings that need flushed to disk, you'd be able to
> > initiate IO with multiple MS_ASYNC calls and then wait for completion
> > with either MS_SYNC or fsync().
>
> Why wouldn't you be able to do that with the current one?

You can, but only across one fd.

A = mmap(..., a);
B = mmap(..., b);

msync(A, ..., MS_ASYNC);
msync(B, ..., MS_ASYNC);
fsync(a);
fsync(b);

has rather different performance characteristics according to which way
you go. Do deferred writeback and the two fsync()s do serialised IO,
with the fs idle in between. Submit the background IO immediately and
you avoid that.

Anyway, I just tried on a Solaris-2.8 box, and the results are rather
interesting. Doing a simple (touch-one-char, msync one page) loop on a
mmap'ed file on a local scsi disk, MS_ASYNC gives ~15000 msyncs per
second; MS_SYNC gives ~900. [A null getpid() loop gives about 250,000
loops a second.]

However, the "iostat" shows *exactly* the same disk throughput in each
case. MS_ASYNC is causing immediate IO kickoff, but shows only ~900 ios
per second, the same as the ios-per-second for MS_SYNC and the same as
the MS_SYNC loop frequency.

So it appears that on Solaris, MS_ASYNC is kicking off instant IO, but
is not waiting for existing IO to complete first. So if we have an IO
already in progress, then many msync calls end up queuing the *same*
subsequent IO, and once one new IO is queued, further MS_ASYNC msyncs
don't bother scheduling a new one (on the basis that the
already-scheduled one hasn't started yet so the new data is already
guaranteed to hit disk.)

So Solaris behaviour is indeed to begin IO as soon as possible on
MS_ASYNC, but they are doing it far more efficiently than our current
msync code can do.

> Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE:
> because we don't wait for in-progress IO, it can be used to efficiently
> synchronize multiple different areas, and then after that waiting for them
> with _one_ single fsync().

The Solaris one manages to preserve those properties while still
scheduling the IO "soon". I'm not sure how we could do that in the
current VFS, short of having a background thread scheduling deferred
writepage()s as soon as the existing page becomes unlocked.

> More importanrtly, the current behaviour makes certain patterns _possible_
> that your suggested semantics simply cannot do efficiently. If we have
> data records smaller than a page, and want to mark them dirty as they
> happen, the current msync() allows that - it doesn't matter that another
> datum was marked dirty just a moment ago. Then, you do one fsync() only
> when you actually want to _commit_ a series of updates before you change
> the index.

> But if we want to have another flag, with MS_HALF_ASYNC, that's certainly
> ok by me. I'm all for choice.

Yes, but we _used_ to have that choice --- call msync() with flags == 0,
and you'd get the deferred kupdated writeback; call it with MS_ASYNC and
you'd get instant IO kickoff; call it with MS_SYNC and you'd get
synchronous completion. But now we've lost the instant kickoff, async
completion option, and MS_ASYNC behaves just like flags==0.

So I'm all for adding the choice back, and *documenting* it so that
people know exactly what to expect in all three cases. Whether the
choice comes from an fadvise option or an msync() doesn't bother me that
much.

In that case, the decision about which version of the behaviour MS_ASYNC
should give is (as it should be) a matter of obeying the standard
correctly, and the other useful behaviours are preserved elsewhere.
Which brings us back to trying to interpret the vague standard. Both
Uli's interpretation and the Solaris implementation suggest that we need
to start the writepage sooner rather than later.

> > It's very much visible, just from a performance perspective, if you want
> > to support "kick off this IO, I'm going to wait for the completion
> > shortly."
>
> That may well be worth a call of its own. It has nothing to do with memory
> mapping, though - what you're really looking for is fasync().

Indeed. And msync(flags==0) remains as a way of synchronising mmaps
with the inode-dirty-list fasync writeback.

> And yes, I agree that _that_ would make sense. Havign some primitives to
> start writeout of an area of a file would likely be a good thing.
>
> I'd be perfectly happy with a set of file cache control operations,
> including
>
> - start writeback in [a,b]

posix_fadvise() seems to do something a little like this already: the
FADV_DONTNEED handler tries

if (!bdi_write_congested(mapping->backing_dev_info))
filemap_flush(mapping);

before going into the invalidate_mapping_pages() call. Having that (a)
limited to the specific file range passed into the fadvise(), and (b)
available as a separate function independent of the DONTNEED page
invalidator, would seem like an entirely sensible extension.

The obvious implementations would be somewhat inefficient in some cases,
though --- currently __filemap_fdatawrite simply list_splice()s the
inode dirty list into the io list. Walking a long dirty list to flush
just a few pages from a narrow range could get slow, and walking the
radix tree would be inefficient if there are only a few dirty pages
hidden in a large cache of clean pages.

> My argument was that a standard CANNOT say anything one way or the other,
> because the behaviour IS NOT USER-VISIBLE!

Worse, it doesn't seem to be implemented consistently either. I've been
trying on a few other Unixen while writing this. First on a Tru64 box,
and it is _not_ kicking off any IO at all for MS_ASYNC, except for the
30-second regular sync. The same appears to be true on FreeBSD. And on
HP-UX, things go in the other direction: the performance of MS_ASYNC is
identical to MS_SYNC, both in terms of observed disk IO during the sync
and the overall rate of the msync loop.

So it appears we've got Unix precedent for pretty-much any reasonable
interpretation of MS_ASYNC that we want. Patch withdrawn!

--Stephen

2004-04-01 16:02:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

On Thu, 1 Apr 2004, Stephen C. Tweedie wrote:
>
> So it appears that on Solaris, MS_ASYNC is kicking off instant IO, but
> is not waiting for existing IO to complete first.

A much more likely schenario is that Solaris is really doing the same
thing we are, but it _also_ ends up opportunistically trying to put the
resultant pages on the IO queues if possible (ie do a "write-ahead": start
writeout if that doesn't imply blocking).

We could probably do that too, it seems easy enough. A
"TestSetPageLocked()" along with setting the BIO_RW_AHEAD flag. The only
problem is that I don't think we really have support for doing write-ahead
(ie we clear the page "dirty" bit too early, so if the write gets
cancelled due to the IO queues being full, the dirty bit gets lost.

So we don't want to go there for now, but it's something to keep in mind,
perhaps.

> Worse, it doesn't seem to be implemented consistently either. I've been
> trying on a few other Unixen while writing this. First on a Tru64 box,
> and it is _not_ kicking off any IO at all for MS_ASYNC, except for the
> 30-second regular sync. The same appears to be true on FreeBSD. And on
> HP-UX, things go in the other direction: the performance of MS_ASYNC is
> identical to MS_SYNC, both in terms of observed disk IO during the sync
> and the overall rate of the msync loop.

If you check HP-UX, make sure it's a recent one. HPUX has historically
been just too broken for words when it comes to mmap() (ie some _really_
strange semantics, like not being able to unmap partial mappings etc).

Linus

2004-04-01 16:20:14

by Jamie Lokier

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Stephen C. Tweedie wrote:
> Yes, but we _used_ to have that choice --- call msync() with flags == 0,
> and you'd get the deferred kupdated writeback;

Is that not equivalent to MS_INVALIDATE? It seems to be equivalent in
2.6.4.

The code in 2.6.4 ignores MS_INVALIDATE except for trivial error
checks, so msync() with flags == MS_INVALIDATE has the same effect as
msync() with flags == 0.

Some documentation I'm looking at says MS_INVALIDATE updates the
mapped page to contain the current contents of the file. 2.6.4 seems
to do the reverse: update the file to contain the current content of
the mapped page. "man msync" agrees with the the latter. (I can't
look at SUS right now).

On systems where the CPU caches are fully coherent, the only
difference is that the former is a no-op and the latter does the same
as the new behaviour of MS_ASYNC.

On systems where the CPU caches aren't coherent, some cache
synchronising or flushing operations are implied.

On either type of system, MS_INVALIDATE doesn't seem to be doing what
the documentation I'm looking at says it should do.

-- Jamie

2004-04-01 16:34:38

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Thu, 2004-04-01 at 17:02, Linus Torvalds wrote:

> > Worse, it doesn't seem to be implemented consistently either. I've been
> > trying on a few other Unixen while writing this. First on a Tru64 box,
> > and it is _not_ kicking off any IO at all for MS_ASYNC, except for the
> > 30-second regular sync. The same appears to be true on FreeBSD. And on
> > HP-UX, things go in the other direction: the performance of MS_ASYNC is
> > identical to MS_SYNC, both in terms of observed disk IO during the sync
> > and the overall rate of the msync loop.
>
> If you check HP-UX, make sure it's a recent one. HPUX has historically
> been just too broken for words when it comes to mmap() (ie some _really_
> strange semantics, like not being able to unmap partial mappings etc).

I'm not sure what counts as "recent" for that, but this was on HP-UX
11. That's the most recent I've got access to.

--Stephen

2004-04-01 16:56:39

by Stephen C. Tweedie

[permalink] [raw]

Subject: s390 storage key inconsistency? [was Re: msync() behaviour broken for MS_ASYNC, revert patch?]

Hi,

On Thu, 2004-04-01 at 17:19, Jamie Lokier wrote:

> Some documentation I'm looking at says MS_INVALIDATE updates the
> mapped page to contain the current contents of the file. 2.6.4 seems
> to do the reverse: update the file to contain the current content of
> the mapped page. "man msync" agrees with the the latter. (I can't
> look at SUS right now).

btw, just looking at the filemap_sync_pte() code for MS_INVALIDATE, I
noticed

if (!PageReserved(page) &&
(ptep_clear_flush_dirty(vma, address, ptep) ||
page_test_and_clear_dirty(page)))
set_page_dirty(page);

I just happened to follow the function and noticed that on s390,
page_test_and_clear_dirty() has the comment:

* Test and clear dirty bit in storage key.
* We can't clear the changed bit atomically. This is a potential
* race against modification of the referenced bit. This function
* should therefore only be called if it is not mapped in any
* address space.

but in this case the page is clearly mapped in the caller's address
space, else we wouldn't have reached this.

Is this a problem?

--Stephen

2004-04-01 16:58:32

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Thu, 2004-04-01 at 17:19, Jamie Lokier wrote:
> Stephen C. Tweedie wrote:
> > Yes, but we _used_ to have that choice --- call msync() with flags == 0,
> > and you'd get the deferred kupdated writeback;
>
> Is that not equivalent to MS_INVALIDATE? It seems to be equivalent in
> 2.6.4.

It is in all the kernels I've looked at, but that's mainly because we
seem to ignore MS_INVALIDATE.

> Some documentation I'm looking at says MS_INVALIDATE updates the
> mapped page to contain the current contents of the file. 2.6.4 seems
> to do the reverse: update the file to contain the current content of
> the mapped page. "man msync" agrees with the the latter. (I can't
> look at SUS right now).

SUSv3 says

When MS_INVALIDATE is specified, msync() shall invalidate all
cached copies of mapped data that are inconsistent with the
permanent storage locations such that subsequent references
shall obtain data that was consistent with the permanent storage
locations sometime between the call to msync() and the first
subsequent memory reference to the data.

which seems to imply that dirty ptes should simply be cleared, rather
than propagated to the page dirty bits.

That's easy enough --- we already propagate the flags down to
filemap_sync_pte, where the page and pte dirty bits are modified. Does
anyone know any reason why we don't do MS_INVALIDATE there already?

--Stephen

2004-04-01 18:51:44

by Andrew Morton

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

"Stephen C. Tweedie" <[email protected]> wrote:
>
> > Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE:
> > because we don't wait for in-progress IO, it can be used to efficiently
> > synchronize multiple different areas, and then after that waiting for them
> > with _one_ single fsync().
>
> The Solaris one manages to preserve those properties while still
> scheduling the IO "soon". I'm not sure how we could do that in the
> current VFS, short of having a background thread scheduling deferred
> writepage()s as soon as the existing page becomes unlocked.

filemap_flush() will do exactly this. So if you want the Solaris
semantics, calling filemap_flush() intead of filemap_fdatawrite() should do
it.

> posix_fadvise() seems to do something a little like this already: the
> FADV_DONTNEED handler tries
>
> if (!bdi_write_congested(mapping->backing_dev_info))
> filemap_flush(mapping);
>
> before going into the invalidate_mapping_pages() call. Having that (a)
> limited to the specific file range passed into the fadvise(), and (b)
> available as a separate function independent of the DONTNEED page
> invalidator, would seem like an entirely sensible extension.
>
> The obvious implementations would be somewhat inefficient in some cases,
> though --- currently __filemap_fdatawrite simply list_splice()s the
> inode dirty list into the io list. Walking a long dirty list to flush
> just a few pages from a narrow range could get slow, and walking the
> radix tree would be inefficient if there are only a few dirty pages
> hidden in a large cache of clean pages.

The patches I have queued in -mm allow us to do this. We use
find_get_pages_tag() to iterate over only the dirty pages in the tree.

That still has the efficiency problem that when searching for dirty pages
we also visit pages which are both dirty and under writeback (we're not
interested in those pages if it is a non-blocking flush), although I've
only observed that to be a problem when the queue size was bumped up to
10,000 requests and I fixed that up for the common cases by other means.

2004-04-16 22:42:43

by Jamie Lokier

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Stephen C. Tweedie wrote:
> I've been looking at a discrepancy between msync() behaviour on 2.4.9
> and newer 2.4 kernels, and it looks like things changed again in
> 2.5.68.

When you say a discrepancy between 2.4.9 and newer 2.4 kernels, do you
mean that the msync() behaviour changed during the 2.4 series?

If so, what was the change?

Thanks,
-- Jamie

2004-04-19 21:54:31

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Fri, 2004-04-16 at 23:35, Jamie Lokier wrote:
> Stephen C. Tweedie wrote:
> > I've been looking at a discrepancy between msync() behaviour on 2.4.9
> > and newer 2.4 kernels, and it looks like things changed again in
> > 2.5.68.
>
> When you say a discrepancy between 2.4.9 and newer 2.4 kernels, do you
> mean that the msync() behaviour changed during the 2.4 series?

Yes.

> If so, what was the change?

2.4.9 behaved like current 2.6 --- on MS_ASYNC, it did a
set_page_dirty() which means the page will get picked up by the next
5-second bdflush pass. But later 2.4 kernels were changed so that they
started MS_ASYNC IO immediately with filemap_fdatasync() (which is
asynchronous regarding the new IO, but which blocks synchronously if
there is already old IO in flight on the page.)

That was reverted back to the earlier, 2.4.9 behaviour in the 2.5
series.

Cheers,
Stephen

2004-04-21 02:10:30

by Jamie Lokier

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Stephen C. Tweedie wrote:
> > If so, what was the change?
>
> 2.4.9 behaved like current 2.6 --- on MS_ASYNC, it did a
> set_page_dirty() which means the page will get picked up by the next
> 5-second bdflush pass. But later 2.4 kernels were changed so that they
> started MS_ASYNC IO immediately with filemap_fdatasync() (which is
> asynchronous regarding the new IO, but which blocks synchronously if
> there is already old IO in flight on the page.)
>
> That was reverted back to the earlier, 2.4.9 behaviour in the 2.5
> series.

It was 2.5.68.

Thanks, that's very helpful.

msync(0) has always had behaviour consistent with the <=2.4.9 and
>=2.5.68 MS_ASYNC behaviour, is that right?

If so, programs may as well "#define MS_ASYNC 0" on Linux, to get well
defined and consistent behaviour. It would be nice to change the
definition in libc to zero, but I don't think it's possible because
msync(MS_SYNC|MS_ASYNC) needs to fail.

-- Jamie

2004-04-21 09:52:24

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Wed, 2004-04-21 at 03:10, Jamie Lokier wrote:

> msync(0) has always had behaviour consistent with the <=2.4.9 and
> >=2.5.68 MS_ASYNC behaviour, is that right?

Not sure about "always", but it looks like it recently at least. 2.2
msync was implemented very differently but seems, from the source, to
have the same property --- do_write_page() calls f_op->write() on msync,
and MS_SYNC forces an fsync after the writes. But 2.4 and 2.6 share
much more similar code to each other. So all since 2.2 seem to do the
fully-async, deferred writeback behaviour for flags==0.

--Stephen

2006-02-09 07:18:44

by George Spelvin

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

Sorry to bring this up only after a 2-year hiatus, but I'm trying to
port an application from Solaris and Linux 2.4 to 2.6 and finding amazing
performance regression due to this. (For referemce, as of 2.5.something,
msync(MS_ASYNC) just puts the pages on a dirty list but doesn't actually
start the I/O until some fairly coarse timer in the VM system fires.)

It uses msync(MS_ASYNC) and msync(MS_SYNC) as a poor man's portable
async IO. It's appending data to an on-disk log. When a page is full
or a transaction complete, the page will not be modified any further and
it uses MS_ASYNC to start the I/O as early as possible. (When compiled
with debugging, it also uses remaps the page read-only.)

Then it accomplishes as much as it can without needing the transaction
committed (typically 25-100 ms), and when it's blocked until the
transaction is known to be durable, it calls msync(MS_SYNC). Which
should, if everything went right, return immediately, because the page
is clean.

Reading the spec, this seems like exactly what msync() is designed for.

But looking at the 2.6 code, I see that it does't actually start the
write promptly, and makes the kooky and unusable suggestions to either
use fdatasync(fd), which will block on all the *following* transactions,
or fadvise(FADV_DONTNEED), which is emphatically lying to the kernel.
The data *is* needed in future (by the database *readers*), so discarding
it from memory is a stupid idea. The only thing we don't need to do to
the page any more is write to it.

Now, I know I could research when async-IO support became reasonably
stable and have the code do async I/O when on a recent enough kernel,
but I really wonder who the genius was who managed to misunderstand
"initiated or queued for servicing" enough to think it involves sleeping
with an idle disk.

Yes, it's just timing, but I hadn't noticed Linux developers being
ivory-tower academic types who only care about correctness or big-O
performance measures. In fact, "no, we can't prove it's starvation-free,
but damn it's fast on benchmarks!" is more of the attitude I've come
to expect.

Anyway, for future reference, Linux's current non-implementation of
msync(MS_ASYNC) is an outright BUG. It "computes the correct result",
but totally buggers performance.

(Deep breath) Now that I've finished complaining, I need to ask for
help. I may be inspired to fix the kernel, but first I have to fix my
application, which has to run on existing Linux kernels.

Can anyone advise me on the best way to perform this sort of split-
transaction disk write on extant 2.6 kernels? Preferably without using
glibc's pthread-based aio emulation? Will O_DIRECT and !O_SYNC writes
do what I want? Or will that interact bady with mmap()?

For my application, all transactions are completed in-order, so there is
never any question of which order to wait for completion. My current
guess is that I'm going to have to call io_submit directly; is there
any documentation with more detail than the man pages but less than the
source code? The former is silent on how the semantics of the various
IOCB_CMD_* opcodes, while the latter doesn't distinguish clearly between
the promises the interface is intended to keep and the properties of
the current implementation.

Thanks for any suggestions.

2006-02-09 08:24:49

by Andrew Morton

[permalink] [raw]

Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?

[email protected] wrote:
>
> Sorry to bring this up only after a 2-year hiatus, but I'm trying to
> port an application from Solaris and Linux 2.4 to 2.6 and finding amazing
> performance regression due to this. (For referemce, as of 2.5.something,
> msync(MS_ASYNC) just puts the pages on a dirty list but doesn't actually
> start the I/O until some fairly coarse timer in the VM system fires.)
>
> It uses msync(MS_ASYNC) and msync(MS_SYNC) as a poor man's portable
> async IO. It's appending data to an on-disk log. When a page is full
> or a transaction complete, the page will not be modified any further and
> it uses MS_ASYNC to start the I/O as early as possible. (When compiled
> with debugging, it also uses remaps the page read-only.)
>
> Then it accomplishes as much as it can without needing the transaction
> committed (typically 25-100 ms), and when it's blocked until the
> transaction is known to be durable, it calls msync(MS_SYNC). Which
> should, if everything went right, return immediately, because the page
> is clean.

2.4:

MS_ASYNC: dirty the pagecache pages, start I/O
MS_SYNC: dirty the pagecache pages, start I/O, wait on I/O

2.6:

MS_ASYNC: dirty the pagecache pages
MS_SYNC: dirty the pagecache pages, start I/O, wait on I/O.

So you're saying that doing the I/O in that 25-100msec window allowed your
app to do more pipelining.

I think for most scenarios, what we have in 2.6 is better: it gives the app
more control over when the I/O should be started. But not for you, because
you have this handy 25-100ms window in which to do other stuff, which
eliminates the need to create a new thread to do the I/O.

Something like this? (Needs a triple-check).

Add two new linux-specific fadvise extensions():

LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'.

LINUX_FADV_SYNC_WRITE: start and wait upon writeout of any dirty pages between
file offsets `offset' and `offset+len'.

The patch also regularises the filemap_write_and_wait_range() API. Make it
look like the __filemap_fdatawrite_range() one: the `end' argument points at
the first byte beyond the range being written.

Signed-off-by: Andrew Morton <[email protected]>
---

fs/direct-io.c | 2 +-
include/linux/fadvise.h | 6 ++++++
include/linux/fs.h | 3 +++
mm/fadvise.c | 13 +++++++++++--
mm/filemap.c | 18 +++++++++++-------
5 files changed, 32 insertions(+), 10 deletions(-)

diff -puN mm/fadvise.c~fadvise-async-write-commands mm/fadvise.c
--- devel/mm/fadvise.c~fadvise-async-write-commands 2006-02-08 23:55:42.000000000 -0800
+++ devel-akpm/mm/fadvise.c 2006-02-09 00:16:58.000000000 -0800
@@ -15,6 +15,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/fadvise.h>
+#include <linux/writeback.h>
#include <linux/syscalls.h>

#include <asm/unistd.h>
@@ -96,11 +97,19 @@ asmlinkage long sys_fadvise64_64(int fd,
filemap_flush(mapping);

/* First and last FULL page! */
- start_index = (offset + (PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
+ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
end_index = (endbyte >> PAGE_CACHE_SHIFT);

if (end_index > start_index)
- invalidate_mapping_pages(mapping, start_index, end_index-1);
+ invalidate_mapping_pages(mapping, start_index,
+ end_index - 1);
+ break;
+ case LINUX_FADV_ASYNC_WRITE:
+ ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
+ WB_SYNC_NONE);
+ break;
+ case LINUX_FADV_SYNC_WRITE:
+ ret = filemap_write_and_wait_range(mapping, offset, endbyte);
break;
default:
ret = -EINVAL;
diff -puN include/linux/fadvise.h~fadvise-async-write-commands include/linux/fadvise.h
--- devel/include/linux/fadvise.h~fadvise-async-write-commands 2006-02-08 23:55:42.000000000 -0800
+++ devel-akpm/include/linux/fadvise.h 2006-02-08 23:56:55.000000000 -0800
@@ -18,4 +18,10 @@
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif

+/*
+ * Linux-specific fadvise() extensions:
+ */
+#define LINUX_FADV_ASYNC_WRITE 32 /* Start writeout on range */
+#define LINUX_FADV_SYNC_WRITE 33 /* Write out and wait upon range */
+
#endif /* FADVISE_H_INCLUDED */
diff -puN mm/filemap.c~fadvise-async-write-commands mm/filemap.c
--- devel/mm/filemap.c~fadvise-async-write-commands 2006-02-08 23:59:01.000000000 -0800
+++ devel-akpm/mm/filemap.c 2006-02-09 00:10:40.000000000 -0800
@@ -174,7 +174,8 @@ static int sync_page(void *word)
* dirty pages that lie within the byte offsets <start, end>
* @mapping: address space structure to write
* @start: offset in bytes where the range starts
- * @end: offset in bytes where the range ends
+ * @end: offset in bytes where the range ends (+1: we write end-start
+ * bytes)
* @sync_mode: enable synchronous operation
*
* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
@@ -182,8 +183,8 @@ static int sync_page(void *word)
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
*/
-static int __filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end, int sync_mode)
+int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+ loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
@@ -212,8 +213,8 @@ int filemap_fdatawrite(struct address_sp
}
EXPORT_SYMBOL(filemap_fdatawrite);

-static int filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end)
+static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+ loff_t end)
{
return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
}
@@ -367,19 +368,22 @@ int filemap_write_and_wait(struct addres
}
EXPORT_SYMBOL(filemap_write_and_wait);

+/*
+ * Write out and wait upon all the bytes between lstart and (lend-1)
+ */
int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
int err = 0;

- if (mapping->nrpages) {
+ if (mapping->nrpages && lend > lstart) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = wait_on_page_writeback_range(mapping,
lstart >> PAGE_CACHE_SHIFT,
- lend >> PAGE_CACHE_SHIFT);
+ (lend - 1) >> PAGE_CACHE_SHIFT);
if (!err)
err = err2;
}
diff -puN include/linux/fs.h~fadvise-async-write-commands include/linux/fs.h
--- devel/include/linux/fs.h~fadvise-async-write-commands 2006-02-08 23:59:24.000000000 -0800
+++ devel-akpm/include/linux/fs.h 2006-02-09 00:03:22.000000000 -0800
@@ -1476,6 +1476,9 @@ extern int filemap_fdatawait(struct addr
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
+extern int __filemap_fdatawrite_range(struct address_space *mapping,
+ loff_t start, loff_t end, int sync_mode);
+
extern void sync_supers(void);
extern void sync_filesystems(int wait);
extern void emergency_sync(void);
diff -puN fs/direct-io.c~fadvise-async-write-commands fs/direct-io.c
--- devel/fs/direct-io.c~fadvise-async-write-commands 2006-02-09 00:09:54.000000000 -0800
+++ devel-akpm/fs/direct-io.c 2006-02-09 00:10:06.000000000 -0800
@@ -1240,7 +1240,7 @@ __blockdev_direct_IO(int rw, struct kioc
}

retval = filemap_write_and_wait_range(mapping, offset,
- end - 1);
+ end);
if (retval) {
kfree(dio);
goto out;
_

2006-02-09 08:35:08