Date: Tue, 13 Jan 2009 23:28:58 -0500 (EST)
From: Mikulas Patocka <mpatocka@redhat.com>
To: Dave Chinner <david@fromorbit.com>
cc: xfs@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: spurious -ENOSPC on XFS
In-Reply-To: <20090113214949.GN8071@disturbed>
Message-ID: <Pine.LNX.4.64.0901132324070.16396@hs20-bc2-1.build.redhat.com>
References: <Pine.LNX.4.64.0901120509550.11089@hs20-bc2-1.build.redhat.com>
 <20090113214949.GN8071@disturbed>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2166
Lines: 56

> > This misbehavior is apparently caused by delayed allocation, delayed 
> > allocation does not exactly know how much space will be occupied by data, 
> > so it makes some upper bound guess.
> 
> No, we know *exactly* how much space is consumed by the data. What
> we don't know is how much space will be required for additional
> *metadata* to do the allocation so we reserve the worst case need so
> hat we should never get an ENOSPC during async writeback when we
> can't report the error to anyone.  Worst case is 4 metadata blocks
> per allocation (delalloc extent, really).
> 
> If we ENOSPC in the delalloc path, we have two choices:
> 
> 	1. potentially lock the system up due to OOM and being
> 	   unable to flush pages
> 	2. throw away user data without being able to report an
> 	   error to the application that wrote it originally.
> 
> Personally, I don't like either option, so premature ENOSPC at
> write() time is fine by me....
> 
> > Because free space count is only a 
> > guess, not the actual data being consumed, XFS should not return -ENOSPC 
> > on behalf of it. When the free space overflows, XFS should sync itself, 
> > retry allocation and only return -ENOSPC if it fails the second time, 
> > after the sync.
> 
> It does, by graduated response (see xfs_iomap_write_delay() and
> xfs_flush_space()):
> 
> 	1. trigger async flush of the inode and retry
> 	2. retry again
> 	3. start a filesystem wide flush, wait 500ms and try again

The result must not depend on magic timer values. If it does, you end up 
with undebbugable nondeterministic failures.

Why don't you change that 500ms wait to "wait until the flush finishes"? 
That would be correct.

> 	4. really ENOSPC now.
> 
> It could probably be improved but, quite frankly, XFS wasn't designed
> for small filesystems so I don't think this is worth investing any
> major effort in changing/fixing.
> 
> Cheers,
> 
> Dave.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/