LinuxLists.cc - Re: several messages

2006-10-03 13:41:07

Subject: Re: several messages

Sorry for insisting, but it seems to me there's still a problem in need of
fixing: when writing a 5GB file over NFS to an XFS file system and hitting
ENOSPC, it takes on the order of 22hours before my application gets an
error, whereas it would normally take about 2minutes if the file system
did not become full.

Perhaps I was being a bit too "constructive" and drowned my point in
explanations and proposed workarounds... You are telling me that neither
NFS nor XFS is doing anything wrong, and I can understand your points of
view, but surely that behavior isn't considered acceptable?

On Tue, 26 Sep 2006, Trond Myklebust wrote:

> On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote:
>> I suppose it's not technically wrong to try to flush all the pages of the
>> file, but if the server file system is full then it will be at its worse.
>> Also if you happened to be on a slower link and have a big cache to flush,
>> you're waiting around for very little gain.
>
> That all assumes that nobody fixes the problem on the server. If
> somebody notices, and actually removes an unused file, then you may be
> happy that the kernel preserved the last 80% of the apache log file that
> was being written out.
>
> ENOSPC is a transient error: that is why the current behaviour exists.

On Tue, 3 Oct 2006, David Chinner wrote:

> This deep in the XFS allocation functions, we cannot tell if we hold
> the i_mutex or not, and it plays no part in determining if we have
> space or not. Hence we don't touch it here.

> I doubt it's a good idea for an NFS server, either.
[...]
> Remember that XFS, like most filesystems, trades off speed for
> correctness as we approach ENOSPC. Many parts of XFS slow down as we
> approach ENOSPC, and this is just one example of where we need to be
> correct, not fast.
[...]
> IMO, this is a non-problem. You're talking about optimising a
> relatively rare corner case where correctness is more important than
> speed and your test case is highly artificial. AFAIC, if you are
> running at ENOSPC then you get what performance is appropriate for
> correctness and if you are continually runing at ENOSPC, then buy
> some more disks.....

My recipe to reproduce the problem locally is admittedly somewhat
artificial, but the problematic usage definitely isn't: simply an app on
an NFS client that happens to fill up a file system. There must be some
way to handle this better.

Thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-03 16:41:15

by Trond Myklebust

[permalink] [raw]

Subject: Re: several messages

On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
> ENOSPC, it takes on the order of 22hours before my application gets an
> error, whereas it would normally take about 2minutes if the file system
> did not become full.
>
> Perhaps I was being a bit too "constructive" and drowned my point in
> explanations and proposed workarounds... You are telling me that neither
> NFS nor XFS is doing anything wrong, and I can understand your points of
> view, but surely that behavior isn't considered acceptable?

Sure it is. You are allowing the kernel to cache 5GB, and that means you
only get the error message when close() completes.

If you want faster error reporting, there are modes like O_SYNC,
O_DIRECT, that will attempt to flush the data more quickly. In addition,
you can force flushing using fsync(). Finally, you can tweak the VM into
flushing more often using /proc/sys/vm.

Cheers,
Trond

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-05 08:30:53

by David Chinner

[permalink] [raw]

Subject: Re: several messages

On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
> ENOSPC, it takes on the order of 22hours before my application gets an
> error, whereas it would normally take about 2minutes if the file system
> did not become full.
>
> Perhaps I was being a bit too "constructive" and drowned my point in
> explanations and proposed workarounds... You are telling me that neither
> NFS nor XFS is doing anything wrong, and I can understand your points of
> view, but surely that behavior isn't considered acceptable?

I agree that this a little extreme and I can't recall of seeing
anything like this before, but I can see how that may happen if the
NFS client continues to try to write every dirty page after getting
an ENOSPC and each one of those writes has to wait for 500ms.

However, you did not mention what kernel version you are running.
One recent bug (introduced by a fix for deadlocks at ENOSPC) could
allow oversubscription of free space to occur in XFS, resulting in
the write being allowed to proceed (i.e. sufficient space for the
data blocks) but then failing the allocation because there weren't
enough blocks put aside for potential btree splits that occur during
allocation. If the linux client is using sync writes on retry, then
this would trigger a 500ms sleep on every write. That's the right
sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
= ~22 hours....

This got fixed in 2.6.18-rc6 - can you retry with a 2.6.18 server
and see if your problem goes away?

Cheers,

Dave.

--
Dave Chinner
Principal Engineer
SGI Australian Software Group

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-05 15:41:18

by Stephane Doyon

[permalink] [raw]

Subject: Re: several messages

On Tue, 3 Oct 2006, Trond Myklebust wrote:

> On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> Sure it is.

If you say so :-).

> You are allowing the kernel to cache 5GB, and that means you
> only get the error message when close() completes.

But it's not actually caching the entire 5GB at once... I guess you're
saying that doesn't matter...?

> If you want faster error reporting, there are modes like O_SYNC,
> O_DIRECT, that will attempt to flush the data more quickly. In addition,
> you can force flushing using fsync().

What if the program is a standard utility like cp?

> Finally, you can tweak the VM into
> flushing more often using /proc/sys/vm.

It doesn't look to me like a question of degrees about how early to flush.
Actually my client can't possibly be caching all of 5GB, it doesn't have
the RAM or swap for that. Tracing it more carefully, it appears dirty data
starts being flushed after a few hundred MBs. No error is returned on the
subsequent writes, only on the final close(). I see some of the write()
calls are delayed, presumably when the machine reaches the dirty
threshold. So I don't see how the vm settings can help in this case.

I hadn't realized that the issue isn't just with the final flush on
close(). It's actually been flushing all along, delaying some of the
subsequent write()s, getting NOSPC errors but not reporting them until the
end.

I understand that since my application did not request any syncing, the
system cannot guarantee to report errors until cached data has been
flushed. But some data has indeed been flushed with an error; can't this
be reported earlier than on close?

Would it be incorrect for a subsequent write to return the error that
occurred while flushing data from previous writes? Then the app could
decide whether to continue and retry or not. But I guess I can see how
that might get convoluted.

Thanks for your patience,

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-05 16:34:17

by Stephane Doyon

[permalink] [raw]

Subject: Re: several messages

On Thu, 5 Oct 2006, David Chinner wrote:

> On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> I agree that this a little extreme and I can't recall of seeing
> anything like this before, but I can see how that may happen if the
> NFS client continues to try to write every dirty page after getting
> an ENOSPC and each one of those writes has to wait for 500ms.
>
> However, you did not mention what kernel version you are running.
> One recent bug (introduced by a fix for deadlocks at ENOSPC) could
> allow oversubscription of free space to occur in XFS, resulting in

I do have that fix in my kernel. (I'm the one who pointed you to the patch
that introduced that particular problem.)

> the write being allowed to proceed (i.e. sufficient space for the
> data blocks) but then failing the allocation because there weren't
> enough blocks put aside for potential btree splits that occur during
> allocation. If the linux client is using sync writes on retry, then

The writes from nfsd shouldn't be sync. Technically it's not even
retrying, just plowing on...

> this would trigger a 500ms sleep on every write. That's the right
> sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> = ~22 hours....
>
> This got fixed in 2.6.18-rc6 -

You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do
have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.

> can you retry with a 2.6.18 server
> and see if your problem goes away?

Unfortunately it will be several days before I have a chance to do that.

The backtrace looked like this:

... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev
xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device
schedule_timeout_uninterruptible.

with a 500ms sleep in xfs_flush_device().

Thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-05 23:30:08

by David Chinner

[permalink] [raw]

Subject: Re: several messages

On Thu, Oct 05, 2006 at 12:33:05PM -0400, Stephane Doyon wrote:
> retrying, just plowing on...
>
> >this would trigger a 500ms sleep on every write. That's the right
> >sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> >= ~22 hours....
> >
> >This got fixed in 2.6.18-rc6 -
>
> You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do
> have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.
>
> >can you retry with a 2.6.18 server
> >and see if your problem goes away?
>
> Unfortunately it will be several days before I have a chance to do that.
>
> The backtrace looked like this:
>
> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev
> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device
> schedule_timeout_uninterruptible.

Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),
not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
(and probably everyone else) confused there which why I suspected sync
writes - they trigger the allocate path in the write call. I don't think
2.6.18 will change anything.

FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
conditions, but perhaps once we are certain of the ENOSPC status
we can tag the filesystem with this state (say an xfs_mount flag)
and only clear that tag when something is freed. We could then
use the tag to avoid continually trying extremely hard to allocate
space when we know there is none available....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-06 00:34:07

by David Chinner

[permalink] [raw]

Subject: Re: several messages

On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
>
> I hadn't realized that the issue isn't just with the final flush on
> close(). It's actually been flushing all along, delaying some of the
> subsequent write()s, getting NOSPC errors but not reporting them until the
> end.

Other NFS clients will report an ENOSPC on the next write() or close()
if the error is reported during async writeback. The clients that typically
do this throw away any unwritten data as well on the basis that the
application was returned an error ASAP and it is now Somebody Else's
Problem (i.e. the application needs to handle it from there).

> I understand that since my application did not request any syncing, the
> system cannot guarantee to report errors until cached data has been
> flushed. But some data has indeed been flushed with an error; can't this
> be reported earlier than on close?

It could, but...

> Would it be incorrect for a subsequent write to return the error that
> occurred while flushing data from previous writes? Then the app could
> decide whether to continue and retry or not. But I guess I can see how
> that might get convoluted.

.... there's many entertaining hoops to jump through to do this
reliably.

FWIW, these are simply two different approaches to handling ENOSPC
(and other server) errors. Mostly it comes down to how the ppl who
implemented the NFS client think it's best to handle the errors in
the scenarios that they most care about.

For example: when you have large amounts of cached data, expedient
error reporting and tossing unwritten data leads to much faster
error recovery than trying to write every piece of data (hence the
Irix use of this method).

OTOH, when you really want as much of the data to get to the server,
regardless of whether you lose some (e.g. log files) before
reporting an error then you try to write every bit of data before
telling the application.

There's no clear right or wrong approach here - both have their
advantages and disadvantages for different workloads. If it
weren't for the sub-optimal behaviour of XFS in this case, you
probably wouldn't have even cared about this....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-06 13:04:22

by Stephane Doyon

[permalink] [raw]

Subject: Re: several messages

On Fri, 6 Oct 2006, David Chinner wrote:

>> The backtrace looked like this:
>>
>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev
>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device
>> schedule_timeout_uninterruptible.
>
> Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),

Yes.

> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
> (and probably everyone else) confused there which why I suspected sync
> writes - they trigger the allocate path in the write call. I don't think
> 2.6.18 will change anything.
>
> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
> conditions, but perhaps once we are certain of the ENOSPC status
> we can tag the filesystem with this state (say an xfs_mount flag)
> and only clear that tag when something is freed. We could then
> use the tag to avoid continually trying extremely hard to allocate
> space when we know there is none available....

Yes! That's what I was trying to suggest :-). Thank you.

Is that hard to do?

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-10-06 13:26:32

by Stephane Doyon

[permalink] [raw]

Subject: Re: several messages

On Fri, 6 Oct 2006, David Chinner wrote:

> On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
>>
>> I hadn't realized that the issue isn't just with the final flush on
>> close(). It's actually been flushing all along, delaying some of the
>> subsequent write()s, getting NOSPC errors but not reporting them until the
>> end.
>
> Other NFS clients will report an ENOSPC on the next write() or close()
> if the error is reported during async writeback. The clients that typically
> do this throw away any unwritten data as well on the basis that the
> application was returned an error ASAP and it is now Somebody Else's
> Problem (i.e. the application needs to handle it from there).

Well the client wouldn't necessarily have to throw away cached data. It
could conceivably be made to return ENOSPC on some subsequent write. It
would need to throw away the data for that write, but not necessarily
destroy its cache. It could then clear the error condition and allow the
application to keep trying if it wants to...

>> Would it be incorrect for a subsequent write to return the error that
>> occurred while flushing data from previous writes? Then the app could
>> decide whether to continue and retry or not. But I guess I can see how
>> that might get convoluted.
>
> .... there's many entertaining hoops to jump through to do this
> reliably.

I imagine there would be...

> For example: when you have large amounts of cached data, expedient
> error reporting and tossing unwritten data leads to much faster
> error recovery than trying to write every piece of data (hence the
> Irix use of this method).

In my case, I didn't think I was caching that much data though, only a few
hundred MBs, and I wouldn't have minded so much if an error had been
returned after that much. The way it's implemented though, I can write an
unbounded amount of data through that cache and not be told of the problem
until I close or fsync. It may not be technically wrong, but given the
outrageous delay I saw in my particular situation, it felt pretty
suboptimal.

> There's no clear right or wrong approach here - both have their
> advantages and disadvantages for different workloads. If it
> weren't for the sub-optimal behaviour of XFS in this case, you
> probably wouldn't have even cared about this....

Indeed not! In fact, changing the client is not practical for me, what I
need is a fix for the XFS behavior. I just thought it was also worth
reporting what I perceived to be an issue with the NFS client.

Thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs