Hi
I discovered a bug in XFS in delayed allocation.
When you take a small partition (52MB in my case) and copy many small
files on it (source code) that barely fits there, you get -ENOSPC. Then
sync the partition, some free space pops up, click "retry" in MC an the
copy continues. They you get again -ENOSPC, you must sync, click "retry"
and go on. And so on few times until the source code finally fits on the
XFS partition.
This misbehavior is apparently caused by delayed allocation, delayed
allocation does not exactly know how much space will be occupied by data,
so it makes some upper bound guess. Because free space count is only a
guess, not the actual data being consumed, XFS should not return -ENOSPC
on behalf of it. When the free space overflows, XFS should sync itself,
retry allocation and only return -ENOSPC if it fails the second time,
after the sync.
Mikulas
On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
> Hi
>
> I discovered a bug in XFS in delayed allocation.
>
> When you take a small partition (52MB in my case) and copy many small
> files on it (source code) that barely fits there, you get -ENOSPC. Then
> sync the partition, some free space pops up, click "retry" in MC an the
> copy continues. They you get again -ENOSPC, you must sync, click "retry"
> and go on. And so on few times until the source code finally fits on the
> XFS partition.
>
> This misbehavior is apparently caused by delayed allocation, delayed
> allocation does not exactly know how much space will be occupied by data,
> so it makes some upper bound guess. Because free space count is only a
> guess, not the actual data being consumed, XFS should not return -ENOSPC
> on behalf of it. When the free space overflows, XFS should sync itself,
> retry allocation and only return -ENOSPC if it fails the second time,
> after the sync.
This looks a lot like: http://oss.sgi.com/bugzilla/show_bug.cgi?id=724
It's on my short-term todo list to turn the testcase in that entry
into a proper xfsqa testcase and followup on the investigation by
Dave and Eric.
Christoph Hellwig wrote:
> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
>> Hi
>>
>> I discovered a bug in XFS in delayed allocation.
>>
>> When you take a small partition (52MB in my case) and copy many small
>> files on it (source code) that barely fits there, you get -ENOSPC. Then
>> sync the partition, some free space pops up, click "retry" in MC an the
>> copy continues. They you get again -ENOSPC, you must sync, click "retry"
>> and go on. And so on few times until the source code finally fits on the
>> XFS partition.
>>
>> This misbehavior is apparently caused by delayed allocation, delayed
>> allocation does not exactly know how much space will be occupied by data,
>> so it makes some upper bound guess. Because free space count is only a
>> guess, not the actual data being consumed, XFS should not return -ENOSPC
>> on behalf of it. When the free space overflows, XFS should sync itself,
>> retry allocation and only return -ENOSPC if it fails the second time,
>> after the sync.
This sounds like a problem with speculative allocation - delayed allocations
beyond eof. Even if we write a small file, say 4k, a 64k chunk of delayed
allocation will be credited to the file. If we extend the file we use more
of that chunk instead of allocating more. When the file is closed and the
last inode reference is dropped the excess delayed allocations are released
back to freespace - this must be happening during the sync.
>
> This looks a lot like: http://oss.sgi.com/bugzilla/show_bug.cgi?id=724
> It's on my short-term todo list to turn the testcase in that entry
> into a proper xfsqa testcase and followup on the investigation by
> Dave and Eric.
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs
On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
> Hi
>
> I discovered a bug in XFS in delayed allocation.
>
> When you take a small partition (52MB in my case) and copy many small
> files on it (source code) that barely fits there, you get -ENOSPC. Then
> sync the partition, some free space pops up, click "retry" in MC an the
> copy continues. They you get again -ENOSPC, you must sync, click "retry"
> and go on. And so on few times until the source code finally fits on the
> XFS partition.
Not a Bug. This is by design.
> This misbehavior is apparently caused by delayed allocation, delayed
> allocation does not exactly know how much space will be occupied by data,
> so it makes some upper bound guess.
No, we know *exactly* how much space is consumed by the data. What
we don't know is how much space will be required for additional
*metadata* to do the allocation so we reserve the worst case need so
hat we should never get an ENOSPC during async writeback when we
can't report the error to anyone. Worst case is 4 metadata blocks
per allocation (delalloc extent, really).
If we ENOSPC in the delalloc path, we have two choices:
1. potentially lock the system up due to OOM and being
unable to flush pages
2. throw away user data without being able to report an
error to the application that wrote it originally.
Personally, I don't like either option, so premature ENOSPC at
write() time is fine by me....
> Because free space count is only a
> guess, not the actual data being consumed, XFS should not return -ENOSPC
> on behalf of it. When the free space overflows, XFS should sync itself,
> retry allocation and only return -ENOSPC if it fails the second time,
> after the sync.
It does, by graduated response (see xfs_iomap_write_delay() and
xfs_flush_space()):
1. trigger async flush of the inode and retry
2. retry again
3. start a filesystem wide flush, wait 500ms and try again
4. really ENOSPC now.
It could probably be improved but, quite frankly, XFS wasn't designed
for small filesystems so I don't think this is worth investing any
major effort in changing/fixing.
Cheers,
Dave.
--
Dave Chinner
[email protected]
> > This misbehavior is apparently caused by delayed allocation, delayed
> > allocation does not exactly know how much space will be occupied by data,
> > so it makes some upper bound guess.
>
> No, we know *exactly* how much space is consumed by the data. What
> we don't know is how much space will be required for additional
> *metadata* to do the allocation so we reserve the worst case need so
> hat we should never get an ENOSPC during async writeback when we
> can't report the error to anyone. Worst case is 4 metadata blocks
> per allocation (delalloc extent, really).
>
> If we ENOSPC in the delalloc path, we have two choices:
>
> 1. potentially lock the system up due to OOM and being
> unable to flush pages
> 2. throw away user data without being able to report an
> error to the application that wrote it originally.
>
> Personally, I don't like either option, so premature ENOSPC at
> write() time is fine by me....
>
> > Because free space count is only a
> > guess, not the actual data being consumed, XFS should not return -ENOSPC
> > on behalf of it. When the free space overflows, XFS should sync itself,
> > retry allocation and only return -ENOSPC if it fails the second time,
> > after the sync.
>
> It does, by graduated response (see xfs_iomap_write_delay() and
> xfs_flush_space()):
>
> 1. trigger async flush of the inode and retry
> 2. retry again
> 3. start a filesystem wide flush, wait 500ms and try again
The result must not depend on magic timer values. If it does, you end up
with undebbugable nondeterministic failures.
Why don't you change that 500ms wait to "wait until the flush finishes"?
That would be correct.
> 4. really ENOSPC now.
>
> It could probably be improved but, quite frankly, XFS wasn't designed
> for small filesystems so I don't think this is worth investing any
> major effort in changing/fixing.
>
> Cheers,
>
> Dave.
Mikulas
On Tue, Jan 13, 2009 at 04:58:01PM +1100, Lachlan McIlroy wrote:
> Christoph Hellwig wrote:
>> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
>>> Hi
>>>
>>> I discovered a bug in XFS in delayed allocation.
>>>
>>> When you take a small partition (52MB in my case) and copy many small
>>> files on it (source code) that barely fits there, you get -ENOSPC.
>>> Then sync the partition, some free space pops up, click "retry" in MC
>>> an the copy continues. They you get again -ENOSPC, you must sync,
>>> click "retry" and go on. And so on few times until the source code
>>> finally fits on the XFS partition.
>>>
>>> This misbehavior is apparently caused by delayed allocation, delayed
>>> allocation does not exactly know how much space will be occupied by
>>> data, so it makes some upper bound guess. Because free space count is
>>> only a guess, not the actual data being consumed, XFS should not
>>> return -ENOSPC on behalf of it. When the free space overflows, XFS
>>> should sync itself, retry allocation and only return -ENOSPC if it
>>> fails the second time, after the sync.
> This sounds like a problem with speculative allocation - delayed allocations
> beyond eof. Even if we write a small file, say 4k, a 64k chunk of delayed
> allocation will be credited to the file.
The second retry occurs without speculative EOF allocation. That's
what the BMAPI_SYNC flag does....
That being said, it can't truncate away pre-existing speculative
allocations on other files, which is why there is a global flush
and wait before the third retry.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
Dave Chinner wrote:
> On Tue, Jan 13, 2009 at 04:58:01PM +1100, Lachlan McIlroy wrote:
>> Christoph Hellwig wrote:
>>> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
>>>> Hi
>>>>
>>>> I discovered a bug in XFS in delayed allocation.
>>>>
>>>> When you take a small partition (52MB in my case) and copy many small
>>>> files on it (source code) that barely fits there, you get -ENOSPC.
>>>> Then sync the partition, some free space pops up, click "retry" in MC
>>>> an the copy continues. They you get again -ENOSPC, you must sync,
>>>> click "retry" and go on. And so on few times until the source code
>>>> finally fits on the XFS partition.
>>>>
>>>> This misbehavior is apparently caused by delayed allocation, delayed
>>>> allocation does not exactly know how much space will be occupied by
>>>> data, so it makes some upper bound guess. Because free space count is
>>>> only a guess, not the actual data being consumed, XFS should not
>>>> return -ENOSPC on behalf of it. When the free space overflows, XFS
>>>> should sync itself, retry allocation and only return -ENOSPC if it
>>>> fails the second time, after the sync.
>> This sounds like a problem with speculative allocation - delayed allocations
>> beyond eof. Even if we write a small file, say 4k, a 64k chunk of delayed
>> allocation will be credited to the file.
>
> The second retry occurs without speculative EOF allocation. That's
> what the BMAPI_SYNC flag does....
By then it's too late. There could already be many files with delayed
allocations beyond eof that are unneccesarily consuming space. I suspect
that when those files are flushed some are not able to convert the full
delayed allocation in one extent and only convert what is needed to write
out data. The remaining unused delayed allocation is released and that's
why the freespace is going up and down.
>
> That being said, it can't truncate away pre-existing speculative
> allocations on other files, which is why there is a global flush
> and wait before the third retry.....
>
> Cheers,
>
> Dave.
On Thu, Jan 15, 2009 at 11:57:08AM +1100, Lachlan McIlroy wrote:
> Dave Chinner wrote:
> > On Tue, Jan 13, 2009 at 04:58:01PM +1100, Lachlan McIlroy wrote:
> >> Christoph Hellwig wrote:
> >>> On Mon, Jan 12, 2009 at 06:14:36AM -0500, Mikulas Patocka wrote:
> >>>> Hi
> >>>>
> >>>> I discovered a bug in XFS in delayed allocation.
> >>>>
> >>>> When you take a small partition (52MB in my case) and copy many small
> >>>> files on it (source code) that barely fits there, you get -ENOSPC.
> >>>> Then sync the partition, some free space pops up, click "retry" in MC
> >>>> an the copy continues. They you get again -ENOSPC, you must sync,
> >>>> click "retry" and go on. And so on few times until the source code
> >>>> finally fits on the XFS partition.
> >>>>
> >>>> This misbehavior is apparently caused by delayed allocation, delayed
> >>>> allocation does not exactly know how much space will be occupied by
> >>>> data, so it makes some upper bound guess. Because free space count is
> >>>> only a guess, not the actual data being consumed, XFS should not
> >>>> return -ENOSPC on behalf of it. When the free space overflows, XFS
> >>>> should sync itself, retry allocation and only return -ENOSPC if it
> >>>> fails the second time, after the sync.
> >> This sounds like a problem with speculative allocation - delayed allocations
> >> beyond eof. Even if we write a small file, say 4k, a 64k chunk of delayed
> >> allocation will be credited to the file.
> >
> > The second retry occurs without speculative EOF allocation. That's
> > what the BMAPI_SYNC flag does....
> By then it's too late. There could already be many files with delayed
> allocations beyond eof that are unneccesarily consuming space.
Yes:
> > That being said, it can't truncate away pre-existing speculative
> > allocations on other files, which is why there is a global flush
> > and wait before the third retry.....
> I suspect
> that when those files are flushed some are not able to convert the full
> delayed allocation in one extent and only convert what is needed to write
> out data. The remaining unused delayed allocation is released and that's
> why the freespace is going up and down.
It takes time to flush several thousand files. The 500ms
sleep hack is probably not a long enough delay when there are lots
of small files to flush that haven't had their speculative
allcoation truncations removed.
I suspect that xfs_release() might not be truncating the speculative
EOF allocation on file close when the inode has no extents yet:
1190 if (ip->i_d.di_nlink != 0) {
1191 if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
1192 ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
1193 ip->i_delayed_blks > 0)) &&
1194 >>>>>>>>> (ip->i_df.if_flags & XFS_IFEXTENTS)) &&
1195 (!(ip->i_d.di_flags &
1196 (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
1197 error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_LOCK);
1198 if (error)
1199 return error;
1200 }
1201 }
I think that check will prevent truncation of the speculative
allocation on new files that haven't been flushed or had allocation
done on them yet. That'll be the cause of the buildup, I think...
It's interesting that the only consideration for EOF truncation on
file close is when there extents in the inode itself, not when in
btree format....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jan 13, 2009 at 11:28:58PM -0500, Mikulas Patocka wrote:
> The result must not depend on magic timer values. If it does, you end up
> with undebbugable nondeterministic failures.
>
> Why don't you change that 500ms wait to "wait until the flush finishes"?
> That would be correct.
Yes, this probably would better. Could I motivate you to come up with
a patch for that?
On Sun, 18 Jan 2009, Christoph Hellwig wrote:
> On Tue, Jan 13, 2009 at 11:28:58PM -0500, Mikulas Patocka wrote:
> > The result must not depend on magic timer values. If it does, you end up
> > with undebbugable nondeterministic failures.
> >
> > Why don't you change that 500ms wait to "wait until the flush finishes"?
> > That would be correct.
>
> Yes, this probably would better. Could I motivate you to come up with
> a patch for that?
>
Hi
I looked at the source and found out that it uses sync_blockdev for
syncing --- but sync_blockdev writes only metadata buffers, it doesn't
touch inodes and pages and doesn't resolve delayed allocations. So it
really doesn't sync anything.
I replaced it with correct syncing of all inodes. With this patch it
passes my testcase (no more spurious -ENOSPCs), but it still isn't
correct, there is that 500ms delay --- if the machine was so overloaded
that it couldn't sync withing 500ms, you still get spurious -ENOSPC.
There are notions about possible deadlocks (the syncer may lock against
the process that is waiting for the sync to finish), that's why removing
that 500ms delay isn't that easy as it seems. I don't have XFS knowledge
to check for the deadlocks, it should be done by XFS developers. Also,
when you resolve the deadlocks and drop the timeout, replace WB_SYNC_NONE
with WB_SYNC_ALL in this patch.
Mikulas
---
Properly sync when the filesystem runs out of space because of delayed
allocation overaccounting.
Note that sync_blockdev writes just dirty metadata buffers, it doesn't touch
inodes or data buffers --- so it had no effect.
WB_SYNC_ALL should be used instead of WB_SYNC_NONE, but WB_SYNC_ALL gets worse
results --- it is likely because WB_SYNC_ALL sync locks against the process
doing the write, stalling the whole sync. This needs to be further investigated
by XFS developers. Also, further investigation is needed to remove that
delay(msecs_to_jiffies(500)).
Signed-off-by: Mikulas Patocka <[email protected]>
---
fs/xfs/linux-2.6/xfs_sync.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Index: linux-2.6.29-rc1-devel/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.29-rc1-devel.orig/fs/xfs/linux-2.6/xfs_sync.c 2009-01-20 20:13:32.000000000 +0100
+++ linux-2.6.29-rc1-devel/fs/xfs/linux-2.6/xfs_sync.c 2009-01-20 20:24:31.000000000 +0100
@@ -446,7 +446,13 @@ xfs_flush_device_work(
void *arg)
{
struct inode *inode = arg;
- sync_blockdev(mp->m_super->s_bdev);
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nr_to_write = LONG_MAX,
+ };
+ generic_sync_sb_inodes(mp->m_super, &wbc);
iput(inode);
}
On Tue, Jan 20, 2009 at 02:38:27PM -0500, Mikulas Patocka wrote:
>
>
> On Sun, 18 Jan 2009, Christoph Hellwig wrote:
>
> > On Tue, Jan 13, 2009 at 11:28:58PM -0500, Mikulas Patocka wrote:
> > > The result must not depend on magic timer values. If it does, you end up
> > > with undebbugable nondeterministic failures.
> > >
> > > Why don't you change that 500ms wait to "wait until the flush finishes"?
> > > That would be correct.
> >
> > Yes, this probably would better. Could I motivate you to come up with
> > a patch for that?
> >
>
> Hi
>
> I looked at the source and found out that it uses sync_blockdev for
> syncing --- but sync_blockdev writes only metadata buffers, it doesn't
> touch inodes and pages and doesn't resolve delayed allocations. So it
> really doesn't sync anything.
Ah, bugger. Thanks for finding this.
> I replaced it with correct syncing of all inodes. With this patch it
> passes my testcase (no more spurious -ENOSPCs), but it still isn't
> correct, there is that 500ms delay --- if the machine was so overloaded
> that it couldn't sync withing 500ms, you still get spurious -ENOSPC.
That's VFS level data syncing - there may be other XFS level stuff
that can be dones as well (e.g. cleanup/truncate of unlinked inodes)
that will release space.
> There are notions about possible deadlocks (the syncer may lock against
> the process that is waiting for the sync to finish), that's why removing
> that 500ms delay isn't that easy as it seems. I don't have XFS knowledge
> to check for the deadlocks, it should be done by XFS developers. Also,
> when you resolve the deadlocks and drop the timeout, replace WB_SYNC_NONE
> with WB_SYNC_ALL in this patch.
Right, so you need to use internal xfs sync functions that don't
have these problems. That is:
error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
will do a blocking flush of all the inodes without deadlocks occurring.
Then you can remove the 500ms wait.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> Right, so you need to use internal xfs sync functions that don't
> have these problems. That is:
>
> error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
>
> will do a blocking flush of all the inodes without deadlocks occurring.
> Then you can remove the 500ms wait.
I've given this a try with Eric's testcase from #724 in the oss bugzilla,
but it's not enough yet. I thinks that's because SYNC_WAIT is rather
meaningless for data writeout, and we need SYNC_IOWAIT instead. The
patch below gets the testcase working for me:
Index: linux-2.6/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_sync.c 2009-01-22 06:29:30.646141628 +0100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_sync.c 2009-01-22 21:57:53.073864570 +0100
@@ -446,7 +446,9 @@ xfs_flush_device_work(
void *arg)
{
struct inode *inode = arg;
- sync_blockdev(mp->m_super->s_bdev);
+
+ xfs_sync_inodes(mp, SYNC_DELWRI);
+ xfs_sync_inodes(mp, SYNC_DELWRI | SYNC_IOWAIT);
iput(inode);
}
On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > Right, so you need to use internal xfs sync functions that don't
> > have these problems. That is:
> >
> > error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> >
> > will do a blocking flush of all the inodes without deadlocks occurring.
> > Then you can remove the 500ms wait.
>
> I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> but it's not enough yet. I thinks that's because SYNC_WAIT is rather
> meaningless for data writeout, and we need SYNC_IOWAIT instead. The
> patch below gets the testcase working for me:
Actually I still see failures happing sometimes. I guess tha'ts because
our flush is still asynchronous due to the schedule_work..
On Thu, 22 Jan 2009, Christoph Hellwig wrote:
> On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> > On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > > Right, so you need to use internal xfs sync functions that don't
> > > have these problems. That is:
> > >
> > > error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > >
> > > will do a blocking flush of all the inodes without deadlocks occurring.
> > > Then you can remove the 500ms wait.
> >
> > I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> > but it's not enough yet. I thinks that's because SYNC_WAIT is rather
> > meaningless for data writeout, and we need SYNC_IOWAIT instead. The
> > patch below gets the testcase working for me:
>
> Actually I still see failures happing sometimes. I guess tha'ts because
> our flush is still asynchronous due to the schedule_work..
If I placed
xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
directly to xfs_flush_device, I got lock dependency warning (though not a
real deadlock).
So synchronous flushing definitely needs some thinking and lock
rearchitecting. If you sync it asynchronously, a deadlock possibility will
actually turn into spurious -ENOSPC (because the flushing thread will wait
untill the main thread sleeps for 500ms, drops the lock and returns
ENOSPC).
Mikulas
=============================================
[ INFO: possible recursive locking detected ]
2.6.29-rc1-devel #2
---------------------------------------------
mc/336 is trying to acquire lock:
(&(&ip->i_iolock)->mr_lock){----}, at: [<0000000010186b88>]
xfs_ilock+0x68/0xa0 [xfs]
but task is already holding lock:
(&(&ip->i_iolock)->mr_lock){----}, at: [<0000000010186bb0>]
xfs_ilock+0x90/0xa0 [xfs]
other info that might help us debug this:
2 locks held by mc/336:
#0: (&sb->s_type->i_mutex_key#8){--..}, at: [<00000000101b1560>]
xfs_write+0x340/0x6a0 [xfs]
#1: (&(&ip->i_iolock)->mr_lock){----}, at: [<0000000010186bb0>]
xfs_ilock+0x90/0xa0 [xfs]
stack backtrace:
Call Trace:
[000000000047e924] validate_chain+0x1184/0x1460
[000000000047ee88] __lock_acquire+0x288/0xae0
[000000000047f734] lock_acquire+0x54/0x80
[0000000000470b20] down_read_nested+0x40/0x60
[0000000010186b88] xfs_ilock+0x68/0xa0 [xfs]
[00000000101b4f78] xfs_sync_inodes_ag+0x218/0x320 [xfs]
[00000000101b5100] xfs_sync_inodes+0x80/0x100 [xfs]
[00000000101b51a0] xfs_flush_device+0x20/0x60 [xfs]
[000000001018e654] xfs_flush_space+0x94/0x100 [xfs]
[000000001018e95c] xfs_iomap_write_delay+0x13c/0x240 [xfs]
[000000001018f100] xfs_iomap+0x280/0x300 [xfs]
[00000000101a84d4] __xfs_get_blocks+0x74/0x260 [xfs]
[00000000101a8724] xfs_get_blocks+0x24/0x40 [xfs]
[00000000004e6800] __block_prepare_write+0x220/0x4e0
[00000000004e6b68] block_write_begin+0x48/0x100
[00000000101a785c] xfs_vm_write_begin+0x3c/0x60 [xfs]
On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
>
>
> On Thu, 22 Jan 2009, Christoph Hellwig wrote:
>
> > On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> > > On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > > > Right, so you need to use internal xfs sync functions that don't
> > > > have these problems. That is:
> > > >
> > > > error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > > >
> > > > will do a blocking flush of all the inodes without deadlocks occurring.
> > > > Then you can remove the 500ms wait.
> > >
> > > I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> > > but it's not enough yet. I thinks that's because SYNC_WAIT is rather
> > > meaningless for data writeout, and we need SYNC_IOWAIT instead. The
> > > patch below gets the testcase working for me:
> >
> > Actually I still see failures happing sometimes. I guess tha'ts because
> > our flush is still asynchronous due to the schedule_work..
>
> If I placed
> xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> directly to xfs_flush_device, I got lock dependency warning (though not a
> real deadlock).
Same reason memory reclaim gives lockdep warnings on XFS - we're
recursing into operations that take inode locks while we currently
hold an inode lock. However, it shouldn't deadlock because
we should ever try to take the iolock on the inode that we current
hold it on.
> So synchronous flushing definitely needs some thinking and lock
> rearchitecting.
No, not at all. At most the grabbing of the iolock in
xfs_sync_inodes_ag() needs to become a trylock....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat, 24 Jan 2009, Dave Chinner wrote:
> On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> >
> >
> > On Thu, 22 Jan 2009, Christoph Hellwig wrote:
> >
> > > On Thu, Jan 22, 2009 at 03:59:13PM -0500, Christoph Hellwig wrote:
> > > > On Wed, Jan 21, 2009 at 10:24:22AM +1100, Dave Chinner wrote:
> > > > > Right, so you need to use internal xfs sync functions that don't
> > > > > have these problems. That is:
> > > > >
> > > > > error = xfs_sync_inodes(ip->i_mount, SYNC_DELWRI|SYNC_WAIT);
> > > > >
> > > > > will do a blocking flush of all the inodes without deadlocks occurring.
> > > > > Then you can remove the 500ms wait.
> > > >
> > > > I've given this a try with Eric's testcase from #724 in the oss bugzilla,
> > > > but it's not enough yet. I thinks that's because SYNC_WAIT is rather
> > > > meaningless for data writeout, and we need SYNC_IOWAIT instead. The
> > > > patch below gets the testcase working for me:
> > >
> > > Actually I still see failures happing sometimes. I guess tha'ts because
> > > our flush is still asynchronous due to the schedule_work..
> >
> > If I placed
> > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > directly to xfs_flush_device, I got lock dependency warning (though not a
> > real deadlock).
>
> Same reason memory reclaim gives lockdep warnings on XFS - we're
> recursing into operations that take inode locks while we currently
> hold an inode lock. However, it shouldn't deadlock because
> we should ever try to take the iolock on the inode that we current
> hold it on.
>
> > So synchronous flushing definitely needs some thinking and lock
> > rearchitecting.
>
> No, not at all. At most the grabbing of the iolock in
> xfs_sync_inodes_ag() needs to become a trylock....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
You are wrong, the comments in the code are right. It really deadlocks if
it is modified to use synchronous wait for the end of the flush and if
the flush is done with xfs_sync_inodes:
xfssyncd D 0000000000606808 0 4819 2
Call Trace:
[0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
[0000000000606808] rwsem_down_read_failed+0x24/0x34
[0000000000606848] __down_read+0x30/0x40
[0000000000468a64] down_read_nested+0x48/0x58
[00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
[000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
[000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
[00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
[000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
[000000000046556c] kthread+0x54/0x88
[000000000042b2a0] kernel_thread+0x3c/0x54
[00000000004653b8] kthreadd+0xac/0x160
dd D 00000000006045c4 0 4829 2500
Call Trace:
[00000000006047d8] schedule_timeout+0x24/0xb8
[00000000006045c4] wait_for_common+0xe4/0x14c
[0000000000604700] wait_for_completion+0x1c/0x2c
[00000000101106b0] xfs_flush_device+0x68/0x88 [xfs]
[00000000100edba4] xfs_flush_space+0xa8/0xd0 [xfs]
[00000000100edec0] xfs_iomap_write_delay+0x1bc/0x228 [xfs]
[00000000100ee4b8] xfs_iomap+0x1c4/0x2e0 [xfs]
[0000000010104f04] __xfs_get_blocks+0x74/0x240 [xfs]
[000000001010512c] xfs_get_blocks+0x24/0x38 [xfs]
[00000000004d05f0] __block_prepare_write+0x184/0x41c
[00000000004d095c] block_write_begin+0x84/0xe8
[000000001010447c] xfs_vm_write_begin+0x3c/0x50 [xfs]
[0000000000485258] generic_file_buffered_write+0xe8/0x28c
[000000001010cec4] xfs_write+0x40c/0x604 [xfs]
[0000000010108d3c] xfs_file_aio_write+0x74/0x84 [xfs]
[00000000004ae70c] do_sync_write+0x8c/0xdc
Mikulas
> > > So synchronous flushing definitely needs some thinking and lock
> > > rearchitecting.
> >
> > No, not at all. At most the grabbing of the iolock in
> > xfs_sync_inodes_ag() needs to become a trylock....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]
>
> You are wrong, the comments in the code are right. It really deadlocks if
> it is modified to use synchronous wait for the end of the flush and if
> the flush is done with xfs_sync_inodes:
>
> xfssyncd D 0000000000606808 0 4819 2
> Call Trace:
> [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
> [0000000000606808] rwsem_down_read_failed+0x24/0x34
> [0000000000606848] __down_read+0x30/0x40
> [0000000000468a64] down_read_nested+0x48/0x58
> [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
> [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
> [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
> [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
> [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
> [000000000046556c] kthread+0x54/0x88
> [000000000042b2a0] kernel_thread+0x3c/0x54
> [00000000004653b8] kthreadd+0xac/0x160
> dd D 00000000006045c4 0 4829 2500
> Call Trace:
> [00000000006047d8] schedule_timeout+0x24/0xb8
> [00000000006045c4] wait_for_common+0xe4/0x14c
> [0000000000604700] wait_for_completion+0x1c/0x2c
> [00000000101106b0] xfs_flush_device+0x68/0x88 [xfs]
> [00000000100edba4] xfs_flush_space+0xa8/0xd0 [xfs]
> [00000000100edec0] xfs_iomap_write_delay+0x1bc/0x228 [xfs]
> [00000000100ee4b8] xfs_iomap+0x1c4/0x2e0 [xfs]
> [0000000010104f04] __xfs_get_blocks+0x74/0x240 [xfs]
> [000000001010512c] xfs_get_blocks+0x24/0x38 [xfs]
> [00000000004d05f0] __block_prepare_write+0x184/0x41c
> [00000000004d095c] block_write_begin+0x84/0xe8
> [000000001010447c] xfs_vm_write_begin+0x3c/0x50 [xfs]
> [0000000000485258] generic_file_buffered_write+0xe8/0x28c
> [000000001010cec4] xfs_write+0x40c/0x604 [xfs]
> [0000000010108d3c] xfs_file_aio_write+0x74/0x84 [xfs]
> [00000000004ae70c] do_sync_write+0x8c/0xdc
>
> Mikulas
BTW. I modified it to use completion to wait for the end of the sync work
(instead of that fixed 500ms wait) and got this deadlock quite quickly.
filemap_flush used to deadlock, I disabled it and used only
xfs_sync_inodes, but xfs_sync_inodes also deadlocks.
Mikulas
On Thu, Jan 29, 2009 at 11:39:00AM -0500, Mikulas Patocka wrote:
> On Sat, 24 Jan 2009, Dave Chinner wrote:
> > On Fri, Jan 23, 2009 at 03:14:30PM -0500, Mikulas Patocka wrote:
> > > If I placed
> > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI);
> > > xfs_sync_inodes(ip->i_mount, SYNC_DELWRI | SYNC_IOWAIT);
> > > directly to xfs_flush_device, I got lock dependency warning (though not a
> > > real deadlock).
> >
> > Same reason memory reclaim gives lockdep warnings on XFS - we're
> > recursing into operations that take inode locks while we currently
> > hold an inode lock. However, it shouldn't deadlock because
> > we should ever try to take the iolock on the inode that we current
> > hold it on.
> >
> > > So synchronous flushing definitely needs some thinking and lock
> > > rearchitecting.
> >
> > No, not at all. At most the grabbing of the iolock in
> > xfs_sync_inodes_ag() needs to become a trylock....
>
> You are wrong, the comments in the code are right. It really
> deadlocks if it is modified to use synchronous wait for the end of
> the flush and if the flush is done with xfs_sync_inodes:
>
> xfssyncd D 0000000000606808 0 4819 2
> Call Trace:
> [0000000000606788] rwsem_down_failed_common+0x1ac/0x1d8
> [0000000000606808] rwsem_down_read_failed+0x24/0x34
> [0000000000606848] __down_read+0x30/0x40
> [0000000000468a64] down_read_nested+0x48/0x58
> [00000000100e6cc8] xfs_ilock+0x48/0x8c [xfs]
> [000000001011018c] xfs_sync_inodes_ag+0x17c/0x2dc [xfs]
> [000000001011034c] xfs_sync_inodes+0x60/0xc4 [xfs]
> [00000000101103c4] xfs_flush_device_work+0x14/0x2c [xfs]
> [000000001010ff1c] xfssyncd+0x110/0x174 [xfs]
> [000000000046556c] kthread+0x54/0x88
> [000000000042b2a0] kernel_thread+0x3c/0x54
> [00000000004653b8] kthreadd+0xac/0x160
So it is stuck:
127 /*
128 * If we have to flush data or wait for I/O completion
129 * we need to hold the iolock.
130 */
131 if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
132 >>>>>>>> xfs_ilock(ip, XFS_IOLOCK_SHARED);
133 lock_flags |= XFS_IOLOCK_SHARED;
134 error = xfs_flush_pages(ip, 0, -1, fflag, FI_NONE);
135 if (flags & SYNC_IOWAIT)
136 xfs_ioend_wait(ip);
137 }
Given that we are stuck on the iolock in xfs_sync_inodes_ag(), I
suspect you should re-read my comments above about "lock
re-architecting" ;).
If you make the xfs_ilock() there xfs_ilock_nowait() and avoid data
writeback if we don't get the lock the deadlock goes away, right?
BTW, can you post the patch you are working on?
Cheers,
Dave.
--
Dave Chinner
[email protected]