2006-09-26 18:54:33

by Stephane Doyon

[permalink] [raw]
Subject: Long sleep with i_mutex in xfs_flush_device(), affects NFS service

Hi,

I'm seeing an unpleasant behavior when an XFS file system becomes full,
particularly when accessed over NFS. Both XFS and the linux NFS client
appear to be contributing to the problem.

When the file system becomes nearly full, we eventually call down to
xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
do some work.

xfs_flush_space()does
xfs_iunlock(ip, XFS_ILOCK_EXCL);
before calling xfs_flush_device(), but i_mutex is still held, at least
when we're being called from under xfs_write(). It seems like a fairly
long time to hold a mutex. And I wonder whether it's really necessary to
keep going through that again and again for every new request after we've
hit NOSPC.

In particular this can cause a pileup when several threads are writing
concurrently to the same file. Some specialized apps might do that, and
nfsd threads do it all the time.

To reproduce locally, on a full file system:
#!/bin/sh
for i in `seq 30`; do
dd if=/dev/zero of=f bs=1 count=1 &
done
wait
time that, it takes nearly exactly 15s.

The linux NFS client typically sends bunches of 16 requests, and so if the
client is writing a single file, some NFS requests are therefore delayed
by up to 8seconds, which is kind of long for NFS.

What's worse, when my linux NFS client writes out a file's pages, it does
not react immediately on receiving a NOSPC error. It will remember and
report the error later on close(), but it still tries and issues write
requests for each page of the file. So even if there isn't a pileup on the
i_mutex on the server, the NFS client still waits 0.5s for each 32K
(typically) request. So on an NFS client on a gigabit network, on an
already full filesystem, if I open and write a 10M file and close() it, it
takes 2m40.083s for it to issue all the requests, get an NOSPC for each,
and finally have my close() call return ENOSPC. That can stretch to
several hours for gigabyte-sized files, which is how I noticed the
problem.

I'm not too familiar with the NFS client code, but would it not be
possible for it to give up when it encounters NOSPC? Or is there some
reason why this wouldn't be desirable?

The rough workaround I have come up with for the problem is to have
xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs
of having returned ENOSPC. I have verified that this workaround is
effective, but I imagine there might be a cleaner solution.

Thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2006-09-26 19:06:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service

On Tue, 2006-09-26 at 14:51 -0400, Stephane Doyon wrote:
> Hi,
>
> I'm seeing an unpleasant behavior when an XFS file system becomes full,
> particularly when accessed over NFS. Both XFS and the linux NFS client
> appear to be contributing to the problem.
>
> When the file system becomes nearly full, we eventually call down to
> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
> do some work.
>
> xfs_flush_space()does
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> before calling xfs_flush_device(), but i_mutex is still held, at least
> when we're being called from under xfs_write(). It seems like a fairly
> long time to hold a mutex. And I wonder whether it's really necessary to
> keep going through that again and again for every new request after we've
> hit NOSPC.
>
> In particular this can cause a pileup when several threads are writing
> concurrently to the same file. Some specialized apps might do that, and
> nfsd threads do it all the time.
>
> To reproduce locally, on a full file system:
> #!/bin/sh
> for i in `seq 30`; do
> dd if=/dev/zero of=f bs=1 count=1 &
> done
> wait
> time that, it takes nearly exactly 15s.
>
> The linux NFS client typically sends bunches of 16 requests, and so if the
> client is writing a single file, some NFS requests are therefore delayed
> by up to 8seconds, which is kind of long for NFS.

Why? The file is still open, and so the standard close-to-open rules
state that you are not guaranteed that the cache will be flushed unless
the VM happens to want to reclaim memory.

> What's worse, when my linux NFS client writes out a file's pages, it does
> not react immediately on receiving a NOSPC error. It will remember and
> report the error later on close(), but it still tries and issues write
> requests for each page of the file. So even if there isn't a pileup on the
> i_mutex on the server, the NFS client still waits 0.5s for each 32K
> (typically) request. So on an NFS client on a gigabit network, on an
> already full filesystem, if I open and write a 10M file and close() it, it
> takes 2m40.083s for it to issue all the requests, get an NOSPC for each,
> and finally have my close() call return ENOSPC. That can stretch to
> several hours for gigabyte-sized files, which is how I noticed the
> problem.
>
> I'm not too familiar with the NFS client code, but would it not be
> possible for it to give up when it encounters NOSPC? Or is there some
> reason why this wouldn't be desirable?

How would it then detect that you have fixed the problem on the server?

Cheers,
Trond


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-09-26 20:08:25

by Stephane Doyon

[permalink] [raw]
Subject: Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service

On Tue, 26 Sep 2006, Trond Myklebust wrote:

[...]
>> When the file system becomes nearly full, we eventually call down to
>> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
>> do some work.
>>
>> xfs_flush_space()does
>> xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> before calling xfs_flush_device(), but i_mutex is still held, at least
>> when we're being called from under xfs_write(). It seems like a fairly
>> long time to hold a mutex. And I wonder whether it's really necessary to
>> keep going through that again and again for every new request after we've
>> hit NOSPC.
>>
>> In particular this can cause a pileup when several threads are writing
>> concurrently to the same file. Some specialized apps might do that, and
>> nfsd threads do it all the time.
[...]
>> The linux NFS client typically sends bunches of 16 requests, and so if the
>> client is writing a single file, some NFS requests are therefore delayed
>> by up to 8seconds, which is kind of long for NFS.
>
> Why? The file is still open, and so the standard close-to-open rules
> state that you are not guaranteed that the cache will be flushed unless
> the VM happens to want to reclaim memory.

I mean there will be a delay on the server, in responding to the requests.
Sorry for the confusion.

When the NFS client does flush its cache, each request will take an extra
0.5s to execute on the server, and the i_mutex will prevent their parallel
execution on the server.

>> What's worse, when my linux NFS client writes out a file's pages, it does
>> not react immediately on receiving a NOSPC error. It will remember and
>> report the error later on close(), but it still tries and issues write
>> requests for each page of the file. So even if there isn't a pileup on the
>> i_mutex on the server, the NFS client still waits 0.5s for each 32K
>> (typically) request. So on an NFS client on a gigabit network, on an
>> already full filesystem, if I open and write a 10M file and close() it, it
>> takes 2m40.083s for it to issue all the requests, get an NOSPC for each,
>> and finally have my close() call return ENOSPC. That can stretch to
>> several hours for gigabyte-sized files, which is how I noticed the
>> problem.
>>
>> I'm not too familiar with the NFS client code, but would it not be
>> possible for it to give up when it encounters NOSPC? Or is there some
>> reason why this wouldn't be desirable?
>
> How would it then detect that you have fixed the problem on the server?

I suppose it has to try again at some point. Yet when flushing a file, if
even one write requests gets an error response like ENOSPC, we know some
part of the data has not been written on the server, and close() will
return the appropriate error to the program on the client. If a single
write error is enough to cause close() to return an error, why bother
sending all the other write requests for that file? If we get an error
while flushing, couldn't that one flushing operation bail out early? As I
said I'm not too familiar with the code, but AFAICT nfs_wb_all() will keep
flushing everything, and afterwards nfs_file_flush() wil check ctx->error.
Perhaps ctx->error could be checked at some lower level, maybe in
nfs_sync_inode_wait...

I suppose it's not technically wrong to try to flush all the pages of the
file, but if the server file system is full then it will be at its worse.
Also if you happened to be on a slower link and have a big cache to flush,
you're waiting around for very little gain.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-09-26 20:30:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service

On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote:
> I suppose it's not technically wrong to try to flush all the pages of the
> file, but if the server file system is full then it will be at its worse.
> Also if you happened to be on a slower link and have a big cache to flush,
> you're waiting around for very little gain.

That all assumes that nobody fixes the problem on the server. If
somebody notices, and actually removes an unused file, then you may be
happy that the kernel preserved the last 80% of the apache log file that
was being written out.

ENOSPC is a transient error: that is why the current behaviour exists.

Cheers,
Trond


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-09-27 11:36:43

by Shailendra Tripathi

[permalink] [raw]
Subject: Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service

Hi Stephane,
> When the file system becomes nearly full, we eventually call down to
> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
> do some work.
> xfs_flush_space()does
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> before calling xfs_flush_device(), but i_mutex is still held, at least
> when we're being called from under xfs_write().

1. I agree that the delay of 500 ms is not a deterministic wait.

2. xfs_flush_device is a big operation. It has to flush all the dirty
pages possibly in the cache on the device. Depending upon the device, it
might take significant amount of time. Keeping view of it, 500 ms in
that unreasonable. Also, perhaps you would never want more than one
request to be queued for device flush.
3. The hope is that after one big flush operation, it would be able to
free up resources which are in transient state (over-reservation of
blocks, delalloc, pending removes, ...). The whole operation is intended
to make sure that ENOSPC is not returned unless really required.

4. This wait could be made deterministic by waiting for the syncer
thread to complete when device flush is triggered.

> It seems like a fairly long time to hold a mutex. And I wonder whether it's really

It might not be that good even if it doesn't. This can return pre-mature
ENOSPC or it can queue many xfs_flush_device requests (which can make
your system dead(-slow) anyway)

> necessary to keep going through that again and again for every new request after
> we've hit NOSPC.
>
> In particular this can cause a pileup when several threads are writing
> concurrently to the same file. Some specialized apps might do that, and
> nfsd threads do it all the time.
>
> To reproduce locally, on a full file system:
> #!/bin/sh
> for i in `seq 30`; do
> dd if=/dev/zero of=f bs=1 count=1 &
> done
> wait
> time that, it takes nearly exactly 15s.
>
> The linux NFS client typically sends bunches of 16 requests, and so if
> the client is writing a single file, some NFS requests are therefore
> delayed by up to 8seconds, which is kind of long for NFS.
>
> What's worse, when my linux NFS client writes out a file's pages, it
> does not react immediately on receiving a NOSPC error. It will remember
> and report the error later on close(), but it still tries and issues
> write requests for each page of the file. So even if there isn't a
> pileup on the i_mutex on the server, the NFS client still waits 0.5s for
> each 32K (typically) request. So on an NFS client on a gigabit network,
> on an already full filesystem, if I open and write a 10M file and
> close() it, it takes 2m40.083s for it to issue all the requests, get an
> NOSPC for each, and finally have my close() call return ENOSPC. That can
> stretch to several hours for gigabyte-sized files, which is how I
> noticed the problem.
>
> I'm not too familiar with the NFS client code, but would it not be
> possible for it to give up when it encounters NOSPC? Or is there some
> reason why this wouldn't be desirable?
>
> The rough workaround I have come up with for the problem is to have
> xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs
> of having returned ENOSPC. I have verified that this workaround is
> effective, but I imagine there might be a cleaner solution.

The fix would not be a good idea for standalone use of XFS.

if (nimaps == 0) {
if (xfs_flush_space(ip, &fsynced, &ioflag))
return XFS_ERROR(ENOSPC);

error = 0;
goto retry;
}

xfs_flush_space:
case 2:
xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_flush_device(ip);
xfs_ilock(ip, XFS_ILOCK_EXCL);
*fsynced = 3;
return 0;
}
return 1;

lets say that you don't enqueue it for another 2 secs. Then, in next
retry it would return 1 and, hence, outer if condition would return
ENOSPC. Please note that for standalone XFS, the application or client
mostly don't retry and, hence, it might return premature ENOSPC.
You didn't notice this because, as you said, nfs client will retry in
case of ENOSPC.

Assuming that you don't return *fsynced = 3 (instead *fsynced = 2), the
code path will loop (because of retry) and CPU itself would become busy
for no good job.

You might experiment by adding deterministic wait. When you enqueue, set
some flag. All others who come in between just get enqueued. Once,
device flush is over wake up all. If flush could free enough resources,
threads will proceed ahead and return. Otherwise, another flush would be
enqueued to flush what might have come since last flush.

> Thanks
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs