2008-08-20 05:43:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: ENOSPC returned during writepages

Hi,

I am getting this even with the latest patch queue. The test program is
a modified fsstress with fallocate support.

mpage_da_map_blocks block allocation failed for inode 377954 at logical
offset 313 with max blocks 4 with error -28
mpage_da_map_blocks block allocation failed for inode 336367 at logical
offset 74 with max blocks 9 with error -28
mpage_da_map_blocks block allocation failed for inode 345560 at logical
offset 542 with max blocks 7 with error -28
This should not happen.!! Data will be lost
mpage_da_map_blocks block allocation failed for inode 355317 at logical
offset 152 with max blocks 10 with error -28
This should not happen.!! Data will be lost
mpage_da_map_blocks block allocation failed for inode 395261 at logical
offset 462 with max blocks 1 with error -28
This should not happen.!! Data will be lost
mpage_da_map_blocks block allocation failed for inode 323784 at logical
offset 313 with max blocks 11 with error -28
This should not happen.!! Data will be lost



2008-08-20 10:46:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 11:13:39AM +0530, Aneesh Kumar K.V wrote:
> Hi,
>
> I am getting this even with the latest patch queue. The test program is
> a modified fsstress with fallocate support.
>
> mpage_da_map_blocks block allocation failed for inode 377954 at logical
> offset 313 with max blocks 4 with error -28
> mpage_da_map_blocks block allocation failed for inode 336367 at logical
> offset 74 with max blocks 9 with error -28
> mpage_da_map_blocks block allocation failed for inode 345560 at logical
> offset 542 with max blocks 7 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 355317 at logical
> offset 152 with max blocks 10 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 395261 at logical
> offset 462 with max blocks 1 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 323784 at logical
> offset 313 with max blocks 11 with error -28
> This should not happen.!! Data will be lost
>

I tried this patch. There are still multiple ways we can get wrong free
block count. The patch reduced the number of errors. So we are doing
better with patch. But I guess we can't use the percpu_counter based
free block accounting with delalloc. Without delalloc it is ok even if
we find some wrong free blocks count . The actual block allocation will fail in
that case and we handle it perfectly fine. With delalloc we cannot
afford to fail the block allocation. Should we look at a free block
accounting rewrite using simple ext4_fsblk_t and and a spin lock ?

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index dfe2d4f..00934b1 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1624,7 +1624,7 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
(sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
root_blocks = ext4_r_blocks_count(sbi->s_es);
#ifdef CONFIG_SMP
- if (free_blocks - root_blocks < FBC_BATCH)
+ if (free_blocks - (nblocks + root_blocks) < FBC_BATCH)
free_blocks =
percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
#endif
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1c289c1..f825470 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
}
/* reduce fs free blocks counter */
percpu_counter_sub(&sbi->s_freeblocks_counter, total);

2008-08-20 11:53:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > offset 313 with max blocks 11 with error -28
> > This should not happen.!! Data will be lost

We don't actually lose the data if free blocks are subsequently made
available, correct?

> I tried this patch. There are still multiple ways we can get wrong free
> block count. The patch reduced the number of errors. So we are doing
> better with patch. But I guess we can't use the percpu_counter based
> free block accounting with delalloc. Without delalloc it is ok even if
> we find some wrong free blocks count . The actual block allocation will fail in
> that case and we handle it perfectly fine. With delalloc we cannot
> afford to fail the block allocation. Should we look at a free block
> accounting rewrite using simple ext4_fsblk_t and and a spin lock ?

It would be a shame if we did given that the whole point of the percpu
counter was to avoid a scalability bottleneck. Perhaps we could take
a filesystem-level spinlock only when the number of free blocks as
reported by the percpu_counter falls below some critical level?

> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> }
> /* reduce fs free blocks counter */
> percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> -
> + /*
> + * Now check whether the block count has gone negative.
> + * Some other CPU could have reserved blocks in between
> + */
> + if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> + return -ENOSPC;
> + }


I think you want to do the check before calling percpu_counter_sub();
otherwise when you return ENOSPC the free blocks counter ends up
getting reduced (and gets left negative).

Also, this is one of the places where it might help if we did
something like:

freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
if (freeblocks < NR_CPUS*4)
freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);

if (freeblocks < total) {
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
return -ENOSPC;
}

BTW, I was looking at the percpu_counter interface, and I'm confused
why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
we're taking the fbc->lock to calculate the precise value of the
counter, why not simply set fbc->count?

Also, it is singularly unfortunate that certain interfaces, such as
percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
definitely post-2.6.27, but it seems to me that we probably want
something like percpu_counter_compare_lt() which does something like this:

static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
s64 amount)
{
#ifdef CONFIG_SMP
if ((fbc->count - amount) < FBC_BATCH)
percpu_counter_sum_and_set(fbc);
#endif
return (fbc->count < amount);
}

... which we would then use in ext4_has_free_blocks() and
ext4_da_reserve_space().

- Ted

2008-08-20 18:28:06

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 07:53:31AM -0400, Theodore Tso wrote:
> On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > offset 313 with max blocks 11 with error -28
> > > This should not happen.!! Data will be lost
>
> We don't actually lose the data if free blocks are subsequently made
> available, correct?
>
> > I tried this patch. There are still multiple ways we can get wrong free
> > block count. The patch reduced the number of errors. So we are doing
> > better with patch. But I guess we can't use the percpu_counter based
> > free block accounting with delalloc. Without delalloc it is ok even if
> > we find some wrong free blocks count . The actual block allocation will fail in
> > that case and we handle it perfectly fine. With delalloc we cannot
> > afford to fail the block allocation. Should we look at a free block
> > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
>
> It would be a shame if we did given that the whole point of the percpu
> counter was to avoid a scalability bottleneck. Perhaps we could take
> a filesystem-level spinlock only when the number of free blocks as
> reported by the percpu_counter falls below some critical level?
>
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> > }
> > /* reduce fs free blocks counter */
> > percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> > -
> > + /*
> > + * Now check whether the block count has gone negative.
> > + * Some other CPU could have reserved blocks in between
> > + */
> > + if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > + return -ENOSPC;
> > + }
>
>
> I think you want to do the check before calling percpu_counter_sub();
> otherwise when you return ENOSPC the free blocks counter ends up
> getting reduced (and gets left negative).
>
> Also, this is one of the places where it might help if we did
> something like:
>
> freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> if (freeblocks < NR_CPUS*4)
> freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
>
> if (freeblocks < total) {
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> return -ENOSPC;
> }
>
> BTW, I was looking at the percpu_counter interface, and I'm confused
> why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> we're taking the fbc->lock to calculate the precise value of the
> counter, why not simply set fbc->count?
>
> Also, it is singularly unfortunate that certain interfaces, such as
> percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
> definitely post-2.6.27, but it seems to me that we probably want
> something like percpu_counter_compare_lt() which does something like this:
>
> static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> s64 amount)
> {
> #ifdef CONFIG_SMP
> if ((fbc->count - amount) < FBC_BATCH)
> percpu_counter_sum_and_set(fbc);
> #endif
> return (fbc->count < amount);
> }
>
> ... which we would then use in ext4_has_free_blocks() and
> ext4_da_reserve_space().
>

Let's say FBC_BATCH = 64 and fbc->count = 100 and we have four cpus and
each cpu request for 30 blocks. each CPU does

in ext4_has_free_blocks:
free_blocks - nblocks = 100 - 30 = 70 and is > FBC_BATCH So we don't do
percpu_counter_sum_and_set
That means ext4_has_free_blocks return success

Now while claiming blocks we do
__percpu_counter_add(fbc, 30, 64)

here 30 < 64. That means we don't do fbc->count += count.
so fbc->count remains as 100 and we have 4 cpu successfully
allocating 30 blocks which means we have to satisfy 120 blocks.

-aneesh


2008-08-20 19:25:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Aug 20, 2008 07:53 -0400, Theodore Ts'o wrote:
> Also, this is one of the places where it might help if we did
> something like:
>
> freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> if (freeblocks < NR_CPUS*4)
> freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
>
> if (freeblocks < total) {
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> return -ENOSPC;
> }

This is definitely a start. Lustre does the freeblocks granting to clients
in a manner that the amount of grant is some fraction of the remaining free
space (up to a maximum), and the clients only have to block and do sync IO
when the free space is very low.

The per-CPU allocation is very similar to having multiple clients...
I don't think NR_CPUS*4 is big enough to avoid the races though. It
needs to be something like 4 * max(FBC_BATCH, total) * NR_CPUS.

What I think makes sense, however, is that if freeblocks < $threshold that
a global spinlock is taken and the percpu_counter_sum() is done under the
lock before deciding if enough space is left. Since it is impossible that
the other CPUs get below -FBC_BATCH away from the correct free space they
should all get the spinlock at the same time when we get too low.

> BTW, I was looking at the percpu_counter interface, and I'm confused
> why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> we're taking the fbc->lock to calculate the precise value of the
> counter, why not simply set fbc->count?
>
> Also, it is singularly unfortunate that certain interfaces, such as
> percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
> definitely post-2.6.27, but it seems to me that we probably want
> something like percpu_counter_compare_lt() which does something like this:
>
> static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> s64 amount)
> {
> #ifdef CONFIG_SMP
> if ((fbc->count - amount) < FBC_BATCH)
> percpu_counter_sum_and_set(fbc);
> #endif
> return (fbc->count < amount);
> }
>
> ... which we would then use in ext4_has_free_blocks() and
> ext4_da_reserve_space().
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-20 19:34:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 01:25:36PM -0600, Andreas Dilger wrote:
> What I think makes sense, however, is that if freeblocks < $threshold that
> a global spinlock is taken and the percpu_counter_sum() is done under the
> lock before deciding if enough space is left. Since it is impossible that
> the other CPUs get below -FBC_BATCH away from the correct free space they
> should all get the spinlock at the same time when we get too low.

Yep, I agree. I suggested something very similar as my first suggestion.

I do think though that we need to rationalize the percpu_counter
interface, though; those are two separable issues, and both IMHO need
fixing... The fact that we have a #ifdef CONFIG_SMP in
fs/ext4/inode.c and we are doing wierd things with percpu_counter_sum
and percpu_counter_sum_and_set is ugly. And it would be good if most
of the time we can avoid taking the filesystem-level spinlock, and
rely on the percpu_counter except when we start getting low on space.

- Ted

2008-08-20 20:56:50

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 07:53 -0400,Theodore Tso写道:
> On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > offset 313 with max blocks 11 with error -28
> > > This should not happen.!! Data will be lost
>
> We don't actually lose the data if free blocks are subsequently made
> available, correct?
>
> > I tried this patch. There are still multiple ways we can get wrong free
> > block count. The patch reduced the number of errors. So we are doing
> > better with patch. But I guess we can't use the percpu_counter based
> > free block accounting with delalloc. Without delalloc it is ok even if
> > we find some wrong free blocks count . The actual block allocation will fail in
> > that case and we handle it perfectly fine. With delalloc we cannot
> > afford to fail the block allocation. Should we look at a free block
> > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
>
> It would be a shame if we did given that the whole point of the percpu
> counter was to avoid a scalability bottleneck. Perhaps we could take
> a filesystem-level spinlock only when the number of free blocks as
> reported by the percpu_counter falls below some critical level?
>

Agree, and perhaps we should fall back to non-delalloc mode if the fs
free blocks below some critical level?

> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> > }
> > /* reduce fs free blocks counter */
> > percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> > -
> > + /*
> > + * Now check whether the block count has gone negative.
> > + * Some other CPU could have reserved blocks in between
> > + */
> > + if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > + return -ENOSPC;
> > + }
>
>
> I think you want to do the check before calling percpu_counter_sub();
> otherwise when you return ENOSPC the free blocks counter ends up
> getting reduced (and gets left negative).
>

The check is done in ext4_has_free_blocks(), which does do the
percpu_counter_read_positive() first, then do a percpu_counter_sum_set
() if the free blocks are below the threshhold.

There is always a window between ext4_has_free_blocks() and
percpu_counter_sub(&sbi->s_freeblocks_counter, total), so perhaps we
should introduce a new interface in percpu counter, which does the sum
and sub together, protected by the global percpu lock.
}
> Also, this is one of the places where it might help if we did
> something like:
>
> freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> if (freeblocks < NR_CPUS*4)
> freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
>
> if (freeblocks < total) {
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> return -ENOSPC;
> }
>

This is what currently ext4_has_free_blocks() is trying to do...

> BTW, I was looking at the percpu_counter interface, and I'm confused
> why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> we're taking the fbc->lock to calculate the precise value of the
> counter, why not simply set fbc->count?
>

I added the percpu_count_sum_and _set() interface, when addingdelalloc
block reservation. I agree it make sense to clean up current all the
user of percpu_counter_sum() and replace with
percpu_counter_sum_and_set(), just hasn't get chance to clean up yet.

> Also, it is singularly unfortunate that certain interfaces, such as
> percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
> definitely post-2.6.27, but it seems to me that we probably want
> something like percpu_counter_compare_lt() which does something like this:
>
> static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> s64 amount)
> {
> #ifdef CONFIG_SMP
> if ((fbc->count - amount) < FBC_BATCH)
> percpu_counter_sum_and_set(fbc);
> #endif
> return (fbc->count < amount);
> }
>
Looks better.

> ... which we would then use in ext4_has_free_blocks() and
> ext4_da_reserve_space().
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-20 21:35:38

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 23:57 +0530,Aneesh Kumar K.V写道:
> On Wed, Aug 20, 2008 at 07:53:31AM -0400, Theodore Tso wrote:
> > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> >
> > We don't actually lose the data if free blocks are subsequently made
> > available, correct?
> >
> > > I tried this patch. There are still multiple ways we can get wrong free
> > > block count. The patch reduced the number of errors. So we are doing
> > > better with patch. But I guess we can't use the percpu_counter based
> > > free block accounting with delalloc. Without delalloc it is ok even if
> > > we find some wrong free blocks count . The actual block allocation will fail in
> > > that case and we handle it perfectly fine. With delalloc we cannot
> > > afford to fail the block allocation. Should we look at a free block
> > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> >
> > It would be a shame if we did given that the whole point of the percpu
> > counter was to avoid a scalability bottleneck. Perhaps we could take
> > a filesystem-level spinlock only when the number of free blocks as
> > reported by the percpu_counter falls below some critical level?
> >
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> > > }
> > > /* reduce fs free blocks counter */
> > > percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> > > -
> > > + /*
> > > + * Now check whether the block count has gone negative.
> > > + * Some other CPU could have reserved blocks in between
> > > + */
> > > + if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> > > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > > + return -ENOSPC;
> > > + }
> >
> >
> > I think you want to do the check before calling percpu_counter_sub();
> > otherwise when you return ENOSPC the free blocks counter ends up
> > getting reduced (and gets left negative).
> >
> > Also, this is one of the places where it might help if we did
> > something like:
> >
> > freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> > if (freeblocks < NR_CPUS*4)
> > freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
> >
> > if (freeblocks < total) {
> > spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > return -ENOSPC;
> > }
> >
> > BTW, I was looking at the percpu_counter interface, and I'm confused
> > why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> > we're taking the fbc->lock to calculate the precise value of the
> > counter, why not simply set fbc->count?
> >
> > Also, it is singularly unfortunate that certain interfaces, such as
> > percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
> > definitely post-2.6.27, but it seems to me that we probably want
> > something like percpu_counter_compare_lt() which does something like this:
> >
> > static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> > s64 amount)
> > {
> > #ifdef CONFIG_SMP
> > if ((fbc->count - amount) < FBC_BATCH)
> > percpu_counter_sum_and_set(fbc);
> > #endif
> > return (fbc->count < amount);
> > }
> >
> > ... which we would then use in ext4_has_free_blocks() and
> > ext4_da_reserve_space().
> >
>
> Let's say FBC_BATCH = 64 and fbc->count = 100 and we have four cpus and
> each cpu request for 30 blocks. each CPU does
>

But, ext4_da_reserve_space() is called at the prepare_write/write_begin
time for each page to write, so at most per cpu would request 1 block at
a time, it is not possible to request reserve 30 blocks at a time.

> in ext4_has_free_blocks:
> free_blocks - nblocks = 100 - 30 = 70 and is > FBC_BATCH So we don't do

free_blocks is not necessary 100,

free_blocks is percpu_counter_read_positive(), which reads the local cpu
counter. In your example, if the global counter is 100, but the local
cpu counter is 0, then you will get free_blocks = 0 here. nblocks = 1,
then you will get

free_blocks - nblocks = 0-1 =-1, which will call
percpu_counter_sum_and_set() to get more accurate value.

> percpu_counter_sum_and_set
> That means ext4_has_free_blocks return success
>
> Now while claiming blocks we do
> __percpu_counter_add(fbc, 30, 64)
>
> here 30 < 64. That means we don't do fbc->count += count.
> so fbc->count remains as 100 and we have 4 cpu successfully
> allocating 30 blocks which means we have to satisfy 120 blocks.
>

The situation you described here could happen, but really rare and
should happen at the case fs is really full. The total number of global
free blocks have to be less than total number of CPU, and there are
multiple threads write/allocate on each cpu.

Mingming
> -aneesh
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-20 21:55:41

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 07:53 -0400,Theodore Tso写道:
> On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > offset 313 with max blocks 11 with error -28
> > > This should not happen.!! Data will be lost
>
> We don't actually lose the data if free blocks are subsequently made
> available, correct?
>

Well, I thought with Aneesh's new ext4_da_invalidate patch in the patch
queue, the dirty page get invalidate if ext4_da_writepages() could not
successfully map/allocate blocks. That means we lost data:(

I have a feeling that we did not try very hard before invalidate the
dirty page which fail to map to disks. Perhaps we should try a few more
times before give up. Also in that case, perhaps we should turn off
delalloc fs wide, so the new writers won't take the subsequently made
avaible free blocks away from this unlucky delalloc da writepages.

> > I tried this patch. There are still multiple ways we can get wrong free
> > block count. The patch reduced the number of errors. So we are doing
> > better with patch. But I guess we can't use the percpu_counter based
> > free block accounting with delalloc. Without delalloc it is ok even if
> > we find some wrong free blocks count . The actual block allocation will fail in
> > that case and we handle it perfectly fine. With delalloc we cannot
> > afford to fail the block allocation. Should we look at a free block
> > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
>
> It would be a shame if we did given that the whole point of the percpu
> counter was to avoid a scalability bottleneck. Perhaps we could take
> a filesystem-level spinlock only when the number of free blocks as
> reported by the percpu_counter falls below some critical level?

Perhaps the thresh hold should b higher, but other than that, the
current ext4_has_free_blocks() code, does 1) get the freeblocks counter
2) if the counter < FBC_BATCH , it will call
percpu_counter_sum_and_set(), which will take the per-cpu-counter lock,
and do accurate accounting.

So after think again, I could not see what suggested above diffrent from
what current ext4_has_free_blocks() does?


Right now the ext4_has_free_blocks() uses the

#define FBC_BATCH (NR_CPUS*4)

as the thresh hold. I thought that was good enough as
ext4_da_reserve_space() only request 1 block at a time (called at
write_begin time), but maybe I am wrong...

Mingming

2008-08-20 21:58:34

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 11:13 +0530,Aneesh Kumar K.V写道:
> Hi,
>
> I am getting this even with the latest patch queue. The test program is
> a modified fsstress with fallocate support.
>

Hi Aneesh,

Is this a regression with the latest patch queue? I wonder when do we
start to seeing this problem?

Also, are the failure inodes the ones with fallocated blocks?

Mingming
> mpage_da_map_blocks block allocation failed for inode 377954 at logical
> offset 313 with max blocks 4 with error -28
> mpage_da_map_blocks block allocation failed for inode 336367 at logical
> offset 74 with max blocks 9 with error -28
> mpage_da_map_blocks block allocation failed for inode 345560 at logical
> offset 542 with max blocks 7 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 355317 at logical
> offset 152 with max blocks 10 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 395261 at logical
> offset 462 with max blocks 1 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 323784 at logical
> offset 313 with max blocks 11 with error -28
> This should not happen.!! Data will be lost
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-20 21:59:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 01:56:48PM -0700, Mingming Cao wrote:
> > BTW, I was looking at the percpu_counter interface, and I'm confused
> > why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> > we're taking the fbc->lock to calculate the precise value of the
> > counter, why not simply set fbc->count?
>
> I added the percpu_count_sum_and _set() interface, when addingdelalloc
> block reservation. I agree it make sense to clean up current all the
> user of percpu_counter_sum() and replace with
> percpu_counter_sum_and_set(), just hasn't get chance to clean up yet.

Why not make percpu_counter_sum() always do sum_and_set, and change
the ext4 calls to use percpu_counter_sum()? In fact, I'm wondering
why you didn't do that in the first place? Was that your trying to be
as conservative as possible with respect to not changing things?

- Ted

2008-08-20 22:03:30

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 17:55 -0400,Theodore Tso写道:
> On Wed, Aug 20, 2008 at 01:56:48PM -0700, Mingming Cao wrote:
> > > BTW, I was looking at the percpu_counter interface, and I'm confused
> > > why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> > > we're taking the fbc->lock to calculate the precise value of the
> > > counter, why not simply set fbc->count?
> >
> > I added the percpu_count_sum_and _set() interface, when addingdelalloc
> > block reservation. I agree it make sense to clean up current all the
> > user of percpu_counter_sum() and replace with
> > percpu_counter_sum_and_set(), just hasn't get chance to clean up yet.
>
> Why not make percpu_counter_sum() always do sum_and_set, and change
> the ext4 calls to use percpu_counter_sum()? In fact, I'm wondering

I will clean this up.
> why you didn't do that in the first place? Was that your trying to be
> as conservative as possible with respect to not changing things?

Yeah, I think I am being over cautious, try not to impact to other user
of percpu_counter_sum()... no more excuse:)


Mingming
>
> - Ted

2008-08-20 23:22:19

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 13:56 -0700,Mingming Cao写道:
> 在 2008-08-20三的 07:53 -0400,Theodore Tso写道:
> > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> >
> > We don't actually lose the data if free blocks are subsequently made
> > available, correct?
> >
> > > I tried this patch. There are still multiple ways we can get wrong free
> > > block count. The patch reduced the number of errors. So we are doing
> > > better with patch. But I guess we can't use the percpu_counter based
> > > free block accounting with delalloc. Without delalloc it is ok even if
> > > we find some wrong free blocks count . The actual block allocation will fail in
> > > that case and we handle it perfectly fine. With delalloc we cannot
> > > afford to fail the block allocation. Should we look at a free block
> > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> >
> > It would be a shame if we did given that the whole point of the percpu
> > counter was to avoid a scalability bottleneck. Perhaps we could take
> > a filesystem-level spinlock only when the number of free blocks as
> > reported by the percpu_counter falls below some critical level?
> >
>
> Agree, and perhaps we should fall back to non-delalloc mode if the fs
> free blocks below some critical level?

How about this?

ext4: fall back to non delalloc mode if filesystem is almost full
From: Mingming Cao <[email protected]>

In the case of filesystem is close to full (free blocks is below
the watermark NRCPUS *4) and there is not enough to reserve blocks for
delayed allocation, instead of return user back with ENOSPC error, with
this patch, it tries to fall back to non delayed allocation mode.

Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/ext4.h | 2 -
fs/ext4/inode.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------------
fs/ext4/namei.c | 4 +--
3 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c 2008-08-20 15:20:10.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c 2008-08-20 16:13:48.000000000 -0700
@@ -2391,6 +2391,25 @@
return ret;
}

+/*
+ * In case of filesystem is almost full and delalloc could not
+ * get enough free blocks to reserve to prevent later ENOSPC,
+ * let's fall back to the nondelalloc mode
+ */
+static int ext4_write_begin_nondelalloc(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ struct inode *inode = mapping->host;
+
+ /* turn off delalloc for this inode*/
+ ext4_set_aops(inode, 0);
+
+ return mapping->a_ops->write_begin(file, mapping, pos, len,
+ flags, pagep, fsdata);
+}
+
static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
@@ -2435,8 +2454,14 @@
page_cache_release(page);
}

- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
+ if (ret == -ENOSPC) {
+ if (ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+ else
+ ret= ext4_write_begin_nondelalloc(file,mapping,pos,
+ len, flags, pagep,
+ fsdata);
+ }
out:
return ret;
}
@@ -3008,16 +3033,26 @@
.is_partially_uptodate = block_is_partially_uptodate,
};

-void ext4_set_aops(struct inode *inode)
+#define EXT4_MIN_FREE_BLOCKS (NR_CPUS*4)
+
+void ext4_set_aops(struct inode *inode, int delalloc)
{
- if (ext4_should_order_data(inode) &&
- test_opt(inode->i_sb, DELALLOC))
- inode->i_mapping->a_ops = &ext4_da_aops;
- else if (ext4_should_order_data(inode))
+ if (test_opt(inode->i_sb, DELALLOC)) {
+ if (ext4_has_free_blocks(EXT4_SB(inode->i_sb),
+ EXT4_MIN_FREE_BLOCKS) > EXT4_MIN_FREE_BLOCKS)
+ delalloc = 0;
+
+ if (delalloc) {
+ inode->i_mapping->a_ops = &ext4_da_aops;
+ return;
+ } else
+ printk(KERN_INFO "filesystem is close to full, "
+ "delayed allocation is turned off for "
+ " inode %lu\n", inode->i_ino);
+ }
+
+ if (ext4_should_order_data(inode))
inode->i_mapping->a_ops = &ext4_ordered_aops;
- else if (ext4_should_writeback_data(inode) &&
- test_opt(inode->i_sb, DELALLOC))
- inode->i_mapping->a_ops = &ext4_da_aops;
else if (ext4_should_writeback_data(inode))
inode->i_mapping->a_ops = &ext4_writeback_aops;
else
@@ -4011,7 +4046,7 @@
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
inode->i_fop = &ext4_dir_operations;
@@ -4020,7 +4055,7 @@
inode->i_op = &ext4_fast_symlink_inode_operations;
else {
inode->i_op = &ext4_symlink_inode_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
}
} else {
inode->i_op = &ext4_special_inode_operations;
@@ -4783,7 +4818,7 @@
EXT4_I(inode)->i_flags |= EXT4_JOURNAL_DATA_FL;
else
EXT4_I(inode)->i_flags &= ~EXT4_JOURNAL_DATA_FL;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);

jbd2_journal_unlock_updates(journal);

Index: linux-2.6.27-rc3/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4.h 2008-08-20 15:41:36.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4.h 2008-08-20 15:41:56.000000000 -0700
@@ -1070,7 +1070,7 @@
extern void ext4_truncate (struct inode *);
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
-extern void ext4_set_aops(struct inode *inode);
+extern void ext4_set_aops(struct inode *inode, int delalloc);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
Index: linux-2.6.27-rc3/fs/ext4/namei.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/namei.c 2008-08-20 15:42:13.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/namei.c 2008-08-20 15:42:41.000000000 -0700
@@ -1738,7 +1738,7 @@
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
err = ext4_add_nondir(handle, dentry, inode);
}
ext4_journal_stop(handle);
@@ -2210,7 +2210,7 @@

if (l > sizeof (EXT4_I(inode)->i_data)) {
inode->i_op = &ext4_symlink_inode_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
/*
* page_symlink() calls into ext4_prepare/commit_write.
* We have a transaction open. All is sweetness. It also sets

2008-08-20 23:42:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Aug 20, 2008 16:22 -0700, Mingming Cao wrote:
> ext4: fall back to non delalloc mode if filesystem is almost full
> From: Mingming Cao <[email protected]>
>
> In the case of filesystem is close to full (free blocks is below
> the watermark NRCPUS *4) and there is not enough to reserve blocks for
> delayed allocation, instead of return user back with ENOSPC error, with
> this patch, it tries to fall back to non delayed allocation mode.

I don't think that making a low watermark of only 4 blocks is enough,
because each of the per-CPU counters could be off by as much as FBC_BATCH.
I think dropping delalloc support earlier is safer, something like
(FBC_BATCH * NR_CPUS).

> +static int ext4_write_begin_nondelalloc(struct file *file,
> + struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned flags,
> + struct page **pagep, void **fsdata)
> +{
> + struct inode *inode = mapping->host;
> +
> + /* turn off delalloc for this inode*/
> + ext4_set_aops(inode, 0);
> +
> + return mapping->a_ops->write_begin(file, mapping, pos, len,
> + flags, pagep, fsdata);
> +}

Hmm, I don't understand this - isn't delalloc already off here, because
this is "ext4_write_begin_nondelalloc()"?

> +void ext4_set_aops(struct inode *inode, int delalloc)
> {
> + if (test_opt(inode->i_sb, DELALLOC)) {
> + if (ext4_has_free_blocks(EXT4_SB(inode->i_sb),
> + EXT4_MIN_FREE_BLOCKS) > EXT4_MIN_FREE_BLOCKS)
> + delalloc = 0;
> +
> + if (delalloc) {
> + inode->i_mapping->a_ops = &ext4_da_aops;
> + return;
> + } else
> + printk(KERN_INFO "filesystem is close to full, "
> + "delayed allocation is turned off for "
> + " inode %lu\n", inode->i_ino);
> + }

Also, if you are doing this by changing the aops on the inode, isn't
it possible that a large write starts outside the EXT4_MIN_FREE_BLOCKS
boundary and then still runs out of space without changing the aops?

Instead it is maybe better to do the check at the start of
ext4_da_write_begin() and if it fails then call the non-delalloc
write_begin from there?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-20 23:58:49

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-20三的 17:42 -0600,Andreas Dilger写道:
> On Aug 20, 2008 16:22 -0700, Mingming Cao wrote:
> > ext4: fall back to non delalloc mode if filesystem is almost full
> > From: Mingming Cao <[email protected]>
> >
> > In the case of filesystem is close to full (free blocks is below
> > the watermark NRCPUS *4) and there is not enough to reserve blocks for
> > delayed allocation, instead of return user back with ENOSPC error, with
> > this patch, it tries to fall back to non delayed allocation mode.
>
> I don't think that making a low watermark of only 4 blocks is enough,
> because each of the per-CPU counters could be off by as much as FBC_BATCH.
> I think dropping delalloc support earlier is safer, something like
> (FBC_BATCH * NR_CPUS).
>
Okay, make sense.

> > +static int ext4_write_begin_nondelalloc(struct file *file,
> > + struct address_space *mapping,
> > + loff_t pos, unsigned len, unsigned flags,
> > + struct page **pagep, void **fsdata)
> > +{
> > + struct inode *inode = mapping->host;
> > +
> > + /* turn off delalloc for this inode*/
> > + ext4_set_aops(inode, 0);
> > +
> > + return mapping->a_ops->write_begin(file, mapping, pos, len,
> > + flags, pagep, fsdata);
> > +}
>
> Hmm, I don't understand this - isn't delalloc already off here, because
> this is "ext4_write_begin_nondelalloc()"?
>

This function probably should be called
ext4_wb_fall_back_to_nondelalloc(). it is called when we detect ENOSPC
and trying to fall back to non delalloc.

This function eventually will call nondelalloc write_begin function
ext4_write_begin().

> > +void ext4_set_aops(struct inode *inode, int delalloc)
> > {
> > + if (test_opt(inode->i_sb, DELALLOC)) {
> > + if (ext4_has_free_blocks(EXT4_SB(inode->i_sb),
> > + EXT4_MIN_FREE_BLOCKS) > EXT4_MIN_FREE_BLOCKS)
> > + delalloc = 0;
> > +
> > + if (delalloc) {
> > + inode->i_mapping->a_ops = &ext4_da_aops;
> > + return;
> > + } else
> > + printk(KERN_INFO "filesystem is close to full, "
> > + "delayed allocation is turned off for "
> > + " inode %lu\n", inode->i_ino);
> > + }
>
> Also, if you are doing this by changing the aops on the inode, isn't
> it possible that a large write starts outside the EXT4_MIN_FREE_BLOCKS
> boundary and then still runs out of space without changing the aops?
>

> Instead it is maybe better to do the check at the start of
> ext4_da_write_begin() and if it fails then call the non-delalloc
> write_begin from there?
>

Yeah that's better.

But I realize a problem. Actually now I think we can't fall back to
nondelalloc mode if the inode has any dirty pages in the page cache, as
those pages need delalloc aops ->ext4_da_writepages() to handle delayed
allocation writeout..


> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

2008-08-21 01:44:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Aug 20, 2008 16:58 -0700, Mingming Cao wrote:
> > Also, if you are doing this by changing the aops on the inode, isn't
> > it possible that a large write starts outside the EXT4_MIN_FREE_BLOCKS
> > boundary and then still runs out of space without changing the aops?
>
> > Instead it is maybe better to do the check at the start of
> > ext4_da_write_begin() and if it fails then call the non-delalloc
> > write_begin from there?
>
> Yeah that's better.
>
> But I realize a problem. Actually now I think we can't fall back to
> nondelalloc mode if the inode has any dirty pages in the page cache, as
> those pages need delalloc aops ->ext4_da_writepages() to handle delayed
> allocation writeout..

That is only if you are changing the aops, which I already don't think
is a good idea. Instead, as I suggest there should be a check in the
ext4_da_write_begin() that makes the decision to delay the allocation
or to actually do it, based on the current free blocks.

Instead of doing this check repeatedly (if we think it is expensive, maybe
it isn't), it might make sense to set a superblock flag that indicates
"no more delalloc, too close to ENOSPC" and this is cleared when blocks
are freed beyond the low watermark.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-21 05:11:53

by Eric Sandeen

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

Aneesh Kumar K.V wrote:
> Hi,
>
> I am getting this even with the latest patch queue. The test program is
> a modified fsstress with fallocate support.
>
> mpage_da_map_blocks block allocation failed for inode 377954 at logical
> offset 313 with max blocks 4 with error -28
> mpage_da_map_blocks block allocation failed for inode 336367 at logical
> offset 74 with max blocks 9 with error -28
> mpage_da_map_blocks block allocation failed for inode 345560 at logical
> offset 542 with max blocks 7 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 355317 at logical
> offset 152 with max blocks 10 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 395261 at logical
> offset 462 with max blocks 1 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 323784 at logical
> offset 313 with max blocks 11 with error -28
> This should not happen.!! Data will be lost

So, I really need to un-bury myself and read this thread more carefully,
but: It may be interesting to look at what XFS did for this sort of
problem a couple of years ago.

-Eric


From: David Chinner <[email protected]>
Date: Tue, 14 Mar 2006 02:13:09 +0000 (+1100)
Subject: [XFS] On machines with more than 8 cpus, when running parallel I/O
X-Git-Tag: v2.6.17-rc1~1000^2~69
X-Git-Url: X-Git-Url:
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=8d280b98cfe3c0b69c37d355218975c1c0279bb0

[XFS] On machines with more than 8 cpus, when running parallel I/O
threads, the incore superblock lock becomes the limiting factor for
buffered write throughput. Make the contended fields in the incore
superblock use per-cpu counters so that there is no global lock to limit
scalability.

SGI-PV: 946630
SGI-Modid: xfs-linux-melb:xfs-kern:25106a

Signed-off-by: David Chinner <[email protected]>
Signed-off-by: Nathan Scott <[email protected]>

-Eric

2008-08-21 15:10:18

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 02:58:17PM -0700, Mingming Cao wrote:
>
> 在 2008-08-20三的 11:13 +0530,Aneesh Kumar K.V写道:
> > Hi,
> >
> > I am getting this even with the latest patch queue. The test program is
> > a modified fsstress with fallocate support.
> >
>
> Hi Aneesh,
>
> Is this a regression with the latest patch queue? I wonder when do we
> start to seeing this problem?

We started logging the error recently. So the problem is not recently
introduced.

>
> Also, are the failure inodes the ones with fallocated blocks?

No i get the problem even without fallocate

-aneesh

2008-08-21 15:13:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 07:53:31AM -0400, Theodore Tso wrote:
> On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > offset 313 with max blocks 11 with error -28
> > > This should not happen.!! Data will be lost
>
> We don't actually lose the data if free blocks are subsequently made
> available, correct?
>
> > I tried this patch. There are still multiple ways we can get wrong free
> > block count. The patch reduced the number of errors. So we are doing
> > better with patch. But I guess we can't use the percpu_counter based
> > free block accounting with delalloc. Without delalloc it is ok even if
> > we find some wrong free blocks count . The actual block allocation will fail in
> > that case and we handle it perfectly fine. With delalloc we cannot
> > afford to fail the block allocation. Should we look at a free block
> > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
>
> It would be a shame if we did given that the whole point of the percpu
> counter was to avoid a scalability bottleneck. Perhaps we could take
> a filesystem-level spinlock only when the number of free blocks as
> reported by the percpu_counter falls below some critical level?
>
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> > }
> > /* reduce fs free blocks counter */
> > percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> > -
> > + /*
> > + * Now check whether the block count has gone negative.
> > + * Some other CPU could have reserved blocks in between
> > + */
> > + if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > + return -ENOSPC;
> > + }
>
>
> I think you want to do the check before calling percpu_counter_sub();
> otherwise when you return ENOSPC the free blocks counter ends up
> getting reduced (and gets left negative).
>
> Also, this is one of the places where it might help if we did
> something like:
>
> freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> if (freeblocks < NR_CPUS*4)
> freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
>
> if (freeblocks < total) {
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> return -ENOSPC;
> }
>
> BTW, I was looking at the percpu_counter interface, and I'm confused
> why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> we're taking the fbc->lock to calculate the precise value of the
> counter, why not simply set fbc->count?
>
> Also, it is singularly unfortunate that certain interfaces, such as
> percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
> definitely post-2.6.27, but it seems to me that we probably want
> something like percpu_counter_compare_lt() which does something like this:
>
> static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> s64 amount)
> {
> #ifdef CONFIG_SMP
> if ((fbc->count - amount) < FBC_BATCH)
> percpu_counter_sum_and_set(fbc);
> #endif
> return (fbc->count < amount);
> }
>
> ... which we would then use in ext4_has_free_blocks() and
> ext4_da_reserve_space().
>

This is what i am testing now. It is much better. I am getting few
errors now. But I guess that may be due to our meta-data block
reservation going wrong. Still debugging.

-aneesh

commit be0a76f17f45b1009c40b2adb7d95f93dfdbb95a
Author: Aneesh Kumar K.V <[email protected]>
Date: Thu Aug 21 15:56:19 2008 +0530

ENOSPC handling

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index dfe2d4f..5d0a676 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1602,6 +1602,64 @@ ext4_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
return ret;
}

+int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
+ ext4_fsblk_t nblocks)
+{
+ int cpu;
+ s64 free_blocks;
+ ext4_fsblk_t root_blocks = 0;
+ struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
+
+ free_blocks = percpu_counter_read(fbc);
+
+ if (!capable(CAP_SYS_RESOURCE) &&
+ sbi->s_resuid != current->fsuid &&
+ (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
+ root_blocks = ext4_r_blocks_count(sbi->s_es);
+#ifdef CONFIG_SMP
+ /* Each CPU can accumulate FBC_BATCH blocks in their local
+ * counters. So we need to make sure we have free blocks more
+ * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
+ */
+ if (free_blocks - (nblocks + root_blocks) <
+ (4 * (FBC_BATCH * nr_cpu_ids))) {
+ /*
+ * We need to sum and claim under lock
+ * This is the slow patch which will be
+ * taken when we are very low on free blocks
+ */
+ spin_lock(&fbc->lock);
+ free_blocks = fbc->count;
+ for_each_online_cpu(cpu) {
+ s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ free_blocks += *pcount;
+ *pcount = 0;
+ }
+ fbc->count = free_blocks;
+ if (free_blocks <= root_blocks) {
+ /* we don't have free space */
+ spin_unlock(&fbc->lock);
+ return -ENOSPC;
+ }
+ if (free_blocks - root_blocks < nblocks) {
+ spin_unlock(&fbc->lock);
+ return -ENOSPC;
+ }
+ fbc->count -= nblocks;
+ spin_unlock(&fbc->lock);
+ return 0;
+ }
+#endif
+ if (free_blocks <= root_blocks)
+ /* we don't have free space */
+ return -ENOSPC;
+ if (free_blocks - root_blocks < nblocks)
+ return -ENOSPC;
+ /* reduce fs free blocks counter */
+ percpu_counter_sub(fbc, nblocks);
+ return 0;
+}
+
/**
* ext4_has_free_blocks()
* @sbi: in-core super block structure.
@@ -1624,9 +1682,15 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
(sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
root_blocks = ext4_r_blocks_count(sbi->s_es);
#ifdef CONFIG_SMP
- if (free_blocks - root_blocks < FBC_BATCH)
+ /* Each CPU can accumulate FBC_BATCH blocks in their local
+ * counters. So we need to make sure we have free blocks more
+ * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
+ */
+ if (free_blocks - (nblocks + root_blocks) <
+ (4 * (FBC_BATCH * nr_cpu_ids))) {
free_blocks =
percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
+ }
#endif
if (free_blocks <= root_blocks)
/* we don't have free space */
@@ -1634,7 +1698,7 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
if (free_blocks - root_blocks < nblocks)
return free_blocks - root_blocks;
return nblocks;
- }
+}


/**
@@ -1713,14 +1777,11 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
/*
* With delalloc we already reserved the blocks
*/
- *count = ext4_has_free_blocks(sbi, *count);
- }
- if (*count == 0) {
- *errp = -ENOSPC;
- return 0; /*return with ENOSPC error */
+ if (ext4_claim_free_blocks(sbi, *count)) {
+ *errp = -ENOSPC;
+ return 0; /*return with ENOSPC error */
+ }
}
- num = *count;
-
/*
* Check quota for allocation of this block.
*/
@@ -1915,9 +1976,13 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
le16_add_cpu(&gdp->bg_free_blocks_count, -num);
gdp->bg_checksum = ext4_group_desc_csum(sbi, group_no, gdp);
spin_unlock(sb_bgl_lock(sbi, group_no));
- if (!EXT4_I(inode)->i_delalloc_reserved_flag)
- percpu_counter_sub(&sbi->s_freeblocks_counter, num);
-
+ if (!EXT4_I(inode)->i_delalloc_reserved_flag && (*count != num)) {
+ /*
+ * we allocated less blocks than we
+ * claimed. Add the difference back.
+ */
+ percpu_counter_add(&sbi->s_freeblocks_counter, *count - num);
+ }
if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, group_no);
spin_lock(sb_bgl_lock(sbi, flex_group));
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7f11b25..3738039 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1047,6 +1047,8 @@ extern ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
unsigned long *count, int *errp);
extern ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, unsigned long *count, int *errp);
+extern int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
+ ext4_fsblk_t nblocks);
extern ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
ext4_fsblk_t nblocks);
extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1c289c1..d965a05 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1537,13 +1537,10 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
md_needed = mdblocks - EXT4_I(inode)->i_reserved_meta_blocks;
total = md_needed + nrblocks;

- if (ext4_has_free_blocks(sbi, total) < total) {
+ if (ext4_claim_free_blocks(sbi, total)) {
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
return -ENOSPC;
}
- /* reduce fs free blocks counter */
- percpu_counter_sub(&sbi->s_freeblocks_counter, total);
-
EXT4_I(inode)->i_reserved_data_blocks += nrblocks;
EXT4_I(inode)->i_reserved_meta_blocks = mdblocks;

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5267efc..7d94119 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2977,9 +2977,15 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
* at write_begin() time for delayed allocation
* do not double accounting
*/
- if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
- percpu_counter_sub(&sbi->s_freeblocks_counter,
- ac->ac_b_ex.fe_len);
+ if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED) &&
+ ac->ac_o_ex.fe_len != ac->ac_b_ex.fe_len) {
+ /*
+ * we allocated less blocks than we calimed
+ * Add the difference back
+ */
+ percpu_counter_add(&sbi->s_freeblocks_counter,
+ ac->ac_o_ex.fe_len -ac->ac_b_ex.fe_len);
+ }

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi,
@@ -4392,14 +4398,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
/*
* With delalloc we already reserved the blocks
*/
- ar->len = ext4_has_free_blocks(sbi, ar->len);
- }
-
- if (ar->len == 0) {
- *errp = -ENOSPC;
- return 0;
+ if (ext4_claim_free_blocks(sbi, ar->len)) {
+ *errp = -ENOSPC;
+ return 0;
+ }
}

2008-08-21 15:15:59

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 02:35:35PM -0700, Mingming Cao wrote:
>
> 在 2008-08-20三的 23:57 +0530,Aneesh Kumar K.V写道:
> > On Wed, Aug 20, 2008 at 07:53:31AM -0400, Theodore Tso wrote:
> > > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > > offset 313 with max blocks 11 with error -28
> > > > > This should not happen.!! Data will be lost
> > >
> > > We don't actually lose the data if free blocks are subsequently made
> > > available, correct?
> > >
> > > > I tried this patch. There are still multiple ways we can get wrong free
> > > > block count. The patch reduced the number of errors. So we are doing
> > > > better with patch. But I guess we can't use the percpu_counter based
> > > > free block accounting with delalloc. Without delalloc it is ok even if
> > > > we find some wrong free blocks count . The actual block allocation will fail in
> > > > that case and we handle it perfectly fine. With delalloc we cannot
> > > > afford to fail the block allocation. Should we look at a free block
> > > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> > >
> > > It would be a shame if we did given that the whole point of the percpu
> > > counter was to avoid a scalability bottleneck. Perhaps we could take
> > > a filesystem-level spinlock only when the number of free blocks as
> > > reported by the percpu_counter falls below some critical level?
> > >
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -1543,7 +1543,14 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> > > > }
> > > > /* reduce fs free blocks counter */
> > > > percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> > > > -
> > > > + /*
> > > > + * Now check whether the block count has gone negative.
> > > > + * Some other CPU could have reserved blocks in between
> > > > + */
> > > > + if (percpu_counter_read(&sbi->s_freeblocks_counter) < 0) {
> > > > + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > > > + return -ENOSPC;
> > > > + }
> > >
> > >
> > > I think you want to do the check before calling percpu_counter_sub();
> > > otherwise when you return ENOSPC the free blocks counter ends up
> > > getting reduced (and gets left negative).
> > >
> > > Also, this is one of the places where it might help if we did
> > > something like:
> > >
> > > freeblocks = percpu_counter_read(&sbi->s_freeblocks_counter);
> > > if (freeblocks < NR_CPUS*4)
> > > freeblocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
> > >
> > > if (freeblocks < total) {
> > > spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > > return -ENOSPC;
> > > }
> > >
> > > BTW, I was looking at the percpu_counter interface, and I'm confused
> > > why we have percpu_counter_sum_and_set() and percpu_counter_sum(). If
> > > we're taking the fbc->lock to calculate the precise value of the
> > > counter, why not simply set fbc->count?
> > >
> > > Also, it is singularly unfortunate that certain interfaces, such as
> > > percpu_counter_sum_and_set() only exist for CONFIG_SMP. This is
> > > definitely post-2.6.27, but it seems to me that we probably want
> > > something like percpu_counter_compare_lt() which does something like this:
> > >
> > > static inline int percpu_counter_compare_lt(struct percpu_counter *fbc,
> > > s64 amount)
> > > {
> > > #ifdef CONFIG_SMP
> > > if ((fbc->count - amount) < FBC_BATCH)
> > > percpu_counter_sum_and_set(fbc);
> > > #endif
> > > return (fbc->count < amount);
> > > }
> > >
> > > ... which we would then use in ext4_has_free_blocks() and
> > > ext4_da_reserve_space().
> > >
> >
> > Let's say FBC_BATCH = 64 and fbc->count = 100 and we have four cpus and
> > each cpu request for 30 blocks. each CPU does
> >
>
> But, ext4_da_reserve_space() is called at the prepare_write/write_begin
> time for each page to write, so at most per cpu would request 1 block at
> a time, it is not possible to request reserve 30 blocks at a time.
>
> > in ext4_has_free_blocks:
> > free_blocks - nblocks = 100 - 30 = 70 and is > FBC_BATCH So we don't do
>
> free_blocks is not necessary 100,
>
> free_blocks is percpu_counter_read_positive(), which reads the local cpu
> counter. In your example, if the global counter is 100, but the local
> cpu counter is 0, then you will get free_blocks = 0 here. nblocks = 1,
> then you will get
>
> free_blocks - nblocks = 0-1 =-1, which will call
> percpu_counter_sum_and_set() to get more accurate value.
>
> > percpu_counter_sum_and_set
> > That means ext4_has_free_blocks return success
> >
> > Now while claiming blocks we do
> > __percpu_counter_add(fbc, 30, 64)
> >
> > here 30 < 64. That means we don't do fbc->count += count.
> > so fbc->count remains as 100 and we have 4 cpu successfully
> > allocating 30 blocks which means we have to satisfy 120 blocks.
> >
>
> The situation you described here could happen, but really rare and
> should happen at the case fs is really full. The total number of global
> free blocks have to be less than total number of CPU, and there are
> multiple threads write/allocate on each cpu.
>

Current code also get it wrong with a parallel directIO and fallocate.

-aneesh

2008-08-21 15:18:30

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 02:55:25PM -0700, Mingming Cao wrote:
>
> 在 2008-08-20三的 07:53 -0400,Theodore Tso写道:
> > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> >
> > We don't actually lose the data if free blocks are subsequently made
> > available, correct?
> >
>
> Well, I thought with Aneesh's new ext4_da_invalidate patch in the patch
> queue, the dirty page get invalidate if ext4_da_writepages() could not
> successfully map/allocate blocks. That means we lost data:(
>
> I have a feeling that we did not try very hard before invalidate the
> dirty page which fail to map to disks. Perhaps we should try a few more
> times before give up. Also in that case, perhaps we should turn off
> delalloc fs wide, so the new writers won't take the subsequently made
> avaible free blocks away from this unlucky delalloc da writepages.

How do we try hard ? The mballoc already try had to allocate blocks. So I
am not sure what do we achieve by requesting for block allocation again.


>
> > > I tried this patch. There are still multiple ways we can get wrong free
> > > block count. The patch reduced the number of errors. So we are doing
> > > better with patch. But I guess we can't use the percpu_counter based
> > > free block accounting with delalloc. Without delalloc it is ok even if
> > > we find some wrong free blocks count . The actual block allocation will fail in
> > > that case and we handle it perfectly fine. With delalloc we cannot
> > > afford to fail the block allocation. Should we look at a free block
> > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> >
> > It would be a shame if we did given that the whole point of the percpu
> > counter was to avoid a scalability bottleneck. Perhaps we could take
> > a filesystem-level spinlock only when the number of free blocks as
> > reported by the percpu_counter falls below some critical level?
>
> Perhaps the thresh hold should b higher, but other than that, the
> current ext4_has_free_blocks() code, does 1) get the freeblocks counter
> 2) if the counter < FBC_BATCH , it will call
> percpu_counter_sum_and_set(), which will take the per-cpu-counter lock,
> and do accurate accounting.
>
> So after think again, I could not see what suggested above diffrent from
> what current ext4_has_free_blocks() does?
>
>
> Right now the ext4_has_free_blocks() uses the
>
> #define FBC_BATCH (NR_CPUS*4)
>
> as the thresh hold. I thought that was good enough as
> ext4_da_reserve_space() only request 1 block at a time (called at
> write_begin time), but maybe I am wrong...
>

I have right now threshold check as below.

+ /* Each CPU can accumulate FBC_BATCH blocks in their local
+ * counters. So we need to make sure we have free blocks more
+ * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
+ */
+ if (free_blocks - (nblocks + root_blocks) <
+ (4 * (FBC_BATCH * nr_cpu_ids)))
{

-aneesh

2008-08-21 15:36:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Thu, Aug 21, 2008 at 08:48:15PM +0530, Aneesh Kumar K.V wrote:
> > I have a feeling that we did not try very hard before invalidate the
> > dirty page which fail to map to disks. Perhaps we should try a few more
> > times before give up. Also in that case, perhaps we should turn off
> > delalloc fs wide, so the new writers won't take the subsequently made
> > avaible free blocks away from this unlucky delalloc da writepages.
>
> How do we try hard ? The mballoc already try had to allocate blocks. So I
> am not sure what do we achieve by requesting for block allocation again.

So here's the problem that we face. If we have a situation where the
disk fills temporarily, but then subsequently space gets freed up, it
would be preferable if the dirty page isn't invalidated, and so
periodically (or perhaps via "there's-free-space-now notifier") we
retry the delayed allocation so we don't lose data during a transient
disk full situation. But at the same time, we don't want an fsync()
on the entire filesystem, or a umount on the filesystem, to hang
forever.

- Ted

2008-08-21 16:46:01

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Wed, Aug 20, 2008 at 11:13:39AM +0530, Aneesh Kumar K.V wrote:
> Hi,
>
> I am getting this even with the latest patch queue. The test program is
> a modified fsstress with fallocate support.
>
> mpage_da_map_blocks block allocation failed for inode 377954 at logical
> offset 313 with max blocks 4 with error -28
> mpage_da_map_blocks block allocation failed for inode 336367 at logical
> offset 74 with max blocks 9 with error -28
> mpage_da_map_blocks block allocation failed for inode 345560 at logical
> offset 542 with max blocks 7 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 355317 at logical
> offset 152 with max blocks 10 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 395261 at logical
> offset 462 with max blocks 1 with error -28
> This should not happen.!! Data will be lost
> mpage_da_map_blocks block allocation failed for inode 323784 at logical
> offset 313 with max blocks 11 with error -28
> This should not happen.!! Data will be lost
>

With this patch i am not seeing error. It does the below

a) use ext4_claim_free_blocks that also update the free blocks count
b) Later after block allocation update the free blocks count if we
allocated less with non-delayed mode
c) Switch to non delay mode if we are low on free blocks.

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index dfe2d4f..5d0a676 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1602,6 +1602,64 @@ ext4_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
return ret;
}

+int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
+ ext4_fsblk_t nblocks)
+{
+ int cpu;
+ s64 free_blocks;
+ ext4_fsblk_t root_blocks = 0;
+ struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
+
+ free_blocks = percpu_counter_read(fbc);
+
+ if (!capable(CAP_SYS_RESOURCE) &&
+ sbi->s_resuid != current->fsuid &&
+ (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
+ root_blocks = ext4_r_blocks_count(sbi->s_es);
+#ifdef CONFIG_SMP
+ /* Each CPU can accumulate FBC_BATCH blocks in their local
+ * counters. So we need to make sure we have free blocks more
+ * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
+ */
+ if (free_blocks - (nblocks + root_blocks) <
+ (4 * (FBC_BATCH * nr_cpu_ids))) {
+ /*
+ * We need to sum and claim under lock
+ * This is the slow patch which will be
+ * taken when we are very low on free blocks
+ */
+ spin_lock(&fbc->lock);
+ free_blocks = fbc->count;
+ for_each_online_cpu(cpu) {
+ s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ free_blocks += *pcount;
+ *pcount = 0;
+ }
+ fbc->count = free_blocks;
+ if (free_blocks <= root_blocks) {
+ /* we don't have free space */
+ spin_unlock(&fbc->lock);
+ return -ENOSPC;
+ }
+ if (free_blocks - root_blocks < nblocks) {
+ spin_unlock(&fbc->lock);
+ return -ENOSPC;
+ }
+ fbc->count -= nblocks;
+ spin_unlock(&fbc->lock);
+ return 0;
+ }
+#endif
+ if (free_blocks <= root_blocks)
+ /* we don't have free space */
+ return -ENOSPC;
+ if (free_blocks - root_blocks < nblocks)
+ return -ENOSPC;
+ /* reduce fs free blocks counter */
+ percpu_counter_sub(fbc, nblocks);
+ return 0;
+}
+
/**
* ext4_has_free_blocks()
* @sbi: in-core super block structure.
@@ -1624,9 +1682,15 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
(sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
root_blocks = ext4_r_blocks_count(sbi->s_es);
#ifdef CONFIG_SMP
- if (free_blocks - root_blocks < FBC_BATCH)
+ /* Each CPU can accumulate FBC_BATCH blocks in their local
+ * counters. So we need to make sure we have free blocks more
+ * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
+ */
+ if (free_blocks - (nblocks + root_blocks) <
+ (4 * (FBC_BATCH * nr_cpu_ids))) {
free_blocks =
percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
+ }
#endif
if (free_blocks <= root_blocks)
/* we don't have free space */
@@ -1634,7 +1698,7 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
if (free_blocks - root_blocks < nblocks)
return free_blocks - root_blocks;
return nblocks;
- }
+}


/**
@@ -1713,14 +1777,11 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
/*
* With delalloc we already reserved the blocks
*/
- *count = ext4_has_free_blocks(sbi, *count);
- }
- if (*count == 0) {
- *errp = -ENOSPC;
- return 0; /*return with ENOSPC error */
+ if (ext4_claim_free_blocks(sbi, *count)) {
+ *errp = -ENOSPC;
+ return 0; /*return with ENOSPC error */
+ }
}
- num = *count;
-
/*
* Check quota for allocation of this block.
*/
@@ -1915,9 +1976,13 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
le16_add_cpu(&gdp->bg_free_blocks_count, -num);
gdp->bg_checksum = ext4_group_desc_csum(sbi, group_no, gdp);
spin_unlock(sb_bgl_lock(sbi, group_no));
- if (!EXT4_I(inode)->i_delalloc_reserved_flag)
- percpu_counter_sub(&sbi->s_freeblocks_counter, num);
-
+ if (!EXT4_I(inode)->i_delalloc_reserved_flag && (*count != num)) {
+ /*
+ * we allocated less blocks than we
+ * claimed. Add the difference back.
+ */
+ percpu_counter_add(&sbi->s_freeblocks_counter, *count - num);
+ }
if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, group_no);
spin_lock(sb_bgl_lock(sbi, flex_group));
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7f11b25..3738039 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1047,6 +1047,8 @@ extern ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
unsigned long *count, int *errp);
extern ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, unsigned long *count, int *errp);
+extern int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
+ ext4_fsblk_t nblocks);
extern ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
ext4_fsblk_t nblocks);
extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index bf612a7..52902cc 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2815,6 +2815,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
path, iblock,
max_blocks);
if (ret <= 0) {
+ printk(KERN_CRIT "fallocate conversion failed %d\n", ret);
err = ret;
goto out2;
} else
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1c289c1..087abca 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1030,11 +1030,17 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;

- /* Account for allocated meta_blocks */
- mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;
+ if (mdb_free) {
+ /* Account for allocated meta_blocks */
+ mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;

- /* update fs free blocks counter for truncate case */
- percpu_counter_add(&sbi->s_freeblocks_counter, mdb_free);
+ /*
+ * We have reserved more blocks.
+ * Now free the extra blocks reserved
+ */
+ percpu_counter_add(&sbi->s_freeblocks_counter, mdb_free);
+ EXT4_I(inode)->i_allocated_meta_blocks = 0;
+ }

/* update per-inode reservations */
BUG_ON(used > EXT4_I(inode)->i_reserved_data_blocks);
@@ -1042,7 +1048,6 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)

BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
EXT4_I(inode)->i_reserved_meta_blocks = mdb;
- EXT4_I(inode)->i_allocated_meta_blocks = 0;
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
}

@@ -1537,13 +1542,10 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
md_needed = mdblocks - EXT4_I(inode)->i_reserved_meta_blocks;
total = md_needed + nrblocks;

- if (ext4_has_free_blocks(sbi, total) < total) {
+ if (ext4_claim_free_blocks(sbi, total)) {
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
return -ENOSPC;
}
- /* reduce fs free blocks counter */
- percpu_counter_sub(&sbi->s_freeblocks_counter, total);
-
EXT4_I(inode)->i_reserved_data_blocks += nrblocks;
EXT4_I(inode)->i_reserved_meta_blocks = mdblocks;

@@ -2462,11 +2464,21 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
unsigned from, to;
struct inode *inode = mapping->host;
handle_t *handle;
+ s64 free_blocks;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;

+ free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+ if (free_blocks < (4 * (FBC_BATCH * nr_cpu_ids))) {
+ /* switch to non delalloc mode */
+ *fsdata = (void *)1;
+ return ext4_write_begin(file, mapping, pos,
+ len, flags, pagep, fsdata);
+ }
+ *fsdata = (void *)0;
retry:
/*
* With delayed allocation, we don't log the i_disksize update
@@ -2535,6 +2547,19 @@ static int ext4_da_write_end(struct file *file,
handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
unsigned long start, end;
+ int low_free_blocks = (int)fsdata;
+
+ if (low_free_blocks) {
+ if (ext4_should_order_data(inode)) {
+ return ext4_ordered_write_end(file, mapping, pos,
+ len, copied, page, fsdata);
+ } else if (ext4_should_writeback_data(inode)) {
+ return ext4_writeback_write_end(file, mapping, pos,
+ len, copied, page, fsdata);
+ } else {
+ BUG();
+ }
+ }

start = pos & (PAGE_CACHE_SIZE - 1);
end = start + copied -1;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 96319b2..7d94119 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2977,9 +2977,15 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
* at write_begin() time for delayed allocation
* do not double accounting
*/
- if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
- percpu_counter_sub(&sbi->s_freeblocks_counter,
- ac->ac_b_ex.fe_len);
+ if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED) &&
+ ac->ac_o_ex.fe_len != ac->ac_b_ex.fe_len) {
+ /*
+ * we allocated less blocks than we calimed
+ * Add the difference back
+ */
+ percpu_counter_add(&sbi->s_freeblocks_counter,
+ ac->ac_o_ex.fe_len -ac->ac_b_ex.fe_len);
+ }

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi,
@@ -4097,7 +4103,8 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
* per cpu locality group is to reduce the contention between block
* request from multiple CPUs.
*/
- ac->ac_lg = per_cpu_ptr(sbi->s_locality_groups, smp_processor_id());
+ ac->ac_lg = per_cpu_ptr(sbi->s_locality_groups, get_cpu());
+ put_cpu();

/* we're going to use group allocation */
ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
@@ -4391,14 +4398,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
/*
* With delalloc we already reserved the blocks
*/
- ar->len = ext4_has_free_blocks(sbi, ar->len);
- }
-
- if (ar->len == 0) {
- *errp = -ENOSPC;
- return 0;
+ if (ext4_claim_free_blocks(sbi, ar->len)) {
+ *errp = -ENOSPC;
+ return 0;
+ }
}

2008-08-21 16:56:39

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-21四的 20:42 +0530,Aneesh Kumar K.V写道:

> This is what i am testing now. It is much better. I am getting few
> errors now. But I guess that may be due to our meta-data block
> reservation going wrong. Still debugging.
>
> -aneesh
>
> commit be0a76f17f45b1009c40b2adb7d95f93dfdbb95a
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Thu Aug 21 15:56:19 2008 +0530
>
> ENOSPC handling
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index dfe2d4f..5d0a676 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -1602,6 +1602,64 @@ ext4_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
> return ret;
> }
>
> +int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
> + ext4_fsblk_t nblocks)
> +{
> + int cpu;
> + s64 free_blocks;
> + ext4_fsblk_t root_blocks = 0;
> + struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
> +
> + free_blocks = percpu_counter_read(fbc);
> +
> + if (!capable(CAP_SYS_RESOURCE) &&
> + sbi->s_resuid != current->fsuid &&
> + (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
> + root_blocks = ext4_r_blocks_count(sbi->s_es);
> +#ifdef CONFIG_SMP
> + /* Each CPU can accumulate FBC_BATCH blocks in their local
> + * counters. So we need to make sure we have free blocks more
> + * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
> + */
> + if (free_blocks - (nblocks + root_blocks) <
> + (4 * (FBC_BATCH * nr_cpu_ids))) {
> + /*
> + * We need to sum and claim under lock
> + * This is the slow patch which will be
> + * taken when we are very low on free blocks
> + */
> + spin_lock(&fbc->lock);
> + free_blocks = fbc->count;
> + for_each_online_cpu(cpu) {
> + s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> + free_blocks += *pcount;
> + *pcount = 0;
> + }
> + fbc->count = free_blocks;
> + if (free_blocks <= root_blocks) {
> + /* we don't have free space */
> + spin_unlock(&fbc->lock);
> + return -ENOSPC;
> + }
> + if (free_blocks - root_blocks < nblocks) {
> + spin_unlock(&fbc->lock);
> + return -ENOSPC;
> + }
> + fbc->count -= nblocks;
> + spin_unlock(&fbc->lock);
> + return 0;
> + }
> +#endif
> + if (free_blocks <= root_blocks)
> + /* we don't have free space */
> + return -ENOSPC;
> + if (free_blocks - root_blocks < nblocks)
> + return -ENOSPC;
> + /* reduce fs free blocks counter */
> + percpu_counter_sub(fbc, nblocks);
> + return 0;
> +}
> +

Like I proposed earlier, I would prefer create a new percpu counter API
to do the percpu_counter_sum_and_sub(), hide the CONFIG_SMP there,
rather than expose the percpu counter implementation details in ext4 fs
code here.

It also be nice and sane to define

#define EXT4_FREEBLOCKS_LOW_WATERMARK 4* nr_cpu_ids * nr_cpu_ids


Other than this, patch looks good.


Mingming
> /**
> * ext4_has_free_blocks()
> * @sbi: in-core super block structure.
> @@ -1624,9 +1682,15 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
> (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
> root_blocks = ext4_r_blocks_count(sbi->s_es);
> #ifdef CONFIG_SMP
> - if (free_blocks - root_blocks < FBC_BATCH)
> + /* Each CPU can accumulate FBC_BATCH blocks in their local
> + * counters. So we need to make sure we have free blocks more
> + * than FBC_BATCH * nr_cpu_ids. Also add a window of 4 times.
> + */
> + if (free_blocks - (nblocks + root_blocks) <
> + (4 * (FBC_BATCH * nr_cpu_ids))) {
> free_blocks =
> percpu_counter_sum_and_set(&sbi->s_freeblocks_counter);
> + }
> #endif
> if (free_blocks <= root_blocks)
> /* we don't have free space */
> @@ -1634,7 +1698,7 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
> if (free_blocks - root_blocks < nblocks)
> return free_blocks - root_blocks;
> return nblocks;
> - }
> +}
>
>
> /**
> @@ -1713,14 +1777,11 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
> /*
> * With delalloc we already reserved the blocks
> */
> - *count = ext4_has_free_blocks(sbi, *count);
> - }
> - if (*count == 0) {
> - *errp = -ENOSPC;
> - return 0; /*return with ENOSPC error */
> + if (ext4_claim_free_blocks(sbi, *count)) {
> + *errp = -ENOSPC;
> + return 0; /*return with ENOSPC error */
> + }
> }
> - num = *count;
> -
> /*
> * Check quota for allocation of this block.
> */
> @@ -1915,9 +1976,13 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
> le16_add_cpu(&gdp->bg_free_blocks_count, -num);
> gdp->bg_checksum = ext4_group_desc_csum(sbi, group_no, gdp);
> spin_unlock(sb_bgl_lock(sbi, group_no));
> - if (!EXT4_I(inode)->i_delalloc_reserved_flag)
> - percpu_counter_sub(&sbi->s_freeblocks_counter, num);
> -
> + if (!EXT4_I(inode)->i_delalloc_reserved_flag && (*count != num)) {
> + /*
> + * we allocated less blocks than we
> + * claimed. Add the difference back.
> + */
> + percpu_counter_add(&sbi->s_freeblocks_counter, *count - num);
> + }
> if (sbi->s_log_groups_per_flex) {
> ext4_group_t flex_group = ext4_flex_group(sbi, group_no);
> spin_lock(sb_bgl_lock(sbi, flex_group));
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 7f11b25..3738039 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1047,6 +1047,8 @@ extern ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
> unsigned long *count, int *errp);
> extern ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
> ext4_fsblk_t goal, unsigned long *count, int *errp);
> +extern int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
> + ext4_fsblk_t nblocks);
> extern ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
> ext4_fsblk_t nblocks);
> extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1c289c1..d965a05 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1537,13 +1537,10 @@ static int ext4_da_reserve_space(struct inode *inode, int nrblocks)
> md_needed = mdblocks - EXT4_I(inode)->i_reserved_meta_blocks;
> total = md_needed + nrblocks;
>
> - if (ext4_has_free_blocks(sbi, total) < total) {
> + if (ext4_claim_free_blocks(sbi, total)) {
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> return -ENOSPC;
> }
> - /* reduce fs free blocks counter */
> - percpu_counter_sub(&sbi->s_freeblocks_counter, total);
> -
> EXT4_I(inode)->i_reserved_data_blocks += nrblocks;
> EXT4_I(inode)->i_reserved_meta_blocks = mdblocks;
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5267efc..7d94119 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2977,9 +2977,15 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> * at write_begin() time for delayed allocation
> * do not double accounting
> */
> - if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
> - percpu_counter_sub(&sbi->s_freeblocks_counter,
> - ac->ac_b_ex.fe_len);
> + if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED) &&
> + ac->ac_o_ex.fe_len != ac->ac_b_ex.fe_len) {
> + /*
> + * we allocated less blocks than we calimed
> + * Add the difference back
> + */
> + percpu_counter_add(&sbi->s_freeblocks_counter,
> + ac->ac_o_ex.fe_len -ac->ac_b_ex.fe_len);
> + }
>
> if (sbi->s_log_groups_per_flex) {
> ext4_group_t flex_group = ext4_flex_group(sbi,
> @@ -4392,14 +4398,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> /*
> * With delalloc we already reserved the blocks
> */
> - ar->len = ext4_has_free_blocks(sbi, ar->len);
> - }
> -
> - if (ar->len == 0) {
> - *errp = -ENOSPC;
> - return 0;
> + if (ext4_claim_free_blocks(sbi, ar->len)) {
> + *errp = -ENOSPC;
> + return 0;
> + }
> }
> -
> while (ar->len && DQUOT_ALLOC_BLOCK(ar->inode, ar->len)) {
> ar->flags |= EXT4_MB_HINT_NOPREALLOC;
> ar->len--;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-21 17:07:54

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-21四的 22:15 +0530,Aneesh Kumar K.V写道:
> On Wed, Aug 20, 2008 at 11:13:39AM +0530, Aneesh Kumar K.V wrote:
> > Hi,
> >
> > I am getting this even with the latest patch queue. The test program is
> > a modified fsstress with fallocate support.
> >
> > mpage_da_map_blocks block allocation failed for inode 377954 at logical
> > offset 313 with max blocks 4 with error -28
> > mpage_da_map_blocks block allocation failed for inode 336367 at logical
> > offset 74 with max blocks 9 with error -28
> > mpage_da_map_blocks block allocation failed for inode 345560 at logical
> > offset 542 with max blocks 7 with error -28
> > This should not happen.!! Data will be lost
> > mpage_da_map_blocks block allocation failed for inode 355317 at logical
> > offset 152 with max blocks 10 with error -28
> > This should not happen.!! Data will be lost
> > mpage_da_map_blocks block allocation failed for inode 395261 at logical
> > offset 462 with max blocks 1 with error -28
> > This should not happen.!! Data will be lost
> > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > offset 313 with max blocks 11 with error -28
> > This should not happen.!! Data will be lost
> >
>
> With this patch i am not seeing error. It does the below
>
> a) use ext4_claim_free_blocks that also update the free blocks count
> b) Later after block allocation update the free blocks count if we
> allocated less with non-delayed mode
> c) Switch to non delay mode if we are low on free blocks.
>
I had sent a patch to do c) yesterday, I noticed that we can't switch to
non delayed mode if the inode already have some delalloc dirty pages.


> @@ -2462,11 +2464,21 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> unsigned from, to;
> struct inode *inode = mapping->host;
> handle_t *handle;
> + s64 free_blocks;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> to = from + len;
>
> + free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> + if (free_blocks < (4 * (FBC_BATCH * nr_cpu_ids))) {
> + /* switch to non delalloc mode */
> + *fsdata = (void *)1;
> + return ext4_write_begin(file, mapping, pos,
> + len, flags, pagep, fsdata);
> + }
> + *fsdata = (void *)0;

No, calling ext4_write_begin() directly won't work, as it start a
handle, do the block allocation , and leave the handle there. It
expect later the write_end aops to file the inode to the transaction
list, and close that handle.

With your change, the aops write_end still points to the
ext4_da_write_end(), which doesn't match the ext4_write_begin. We need
to switch the aop write_begin/write_end function pointers all together.


I updated my patch, I have a few place I want to polish, since you are
looking at this , comments pls.


ext4: fall back to non delalloc mode if filesystem is almost full
From: Mingming Cao <[email protected]>

In the case of filesystem is close to full (free blocks is below
the watermark FBC_BATCH * NRCPUS *4) and there is no delalloc dirty
pages in the page cache for the given inode, fall back to non delalloc
mode for the inode to avoid possible later ENOSPC.

Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/ext4.h | 2 -
fs/ext4/inode.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++----------
fs/ext4/namei.c | 4 +-
3 files changed, 65 insertions(+), 16 deletions(-)

Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c 2008-08-20 15:20:10.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c 2008-08-20 17:59:20.000000000 -0700
@@ -2391,6 +2391,25 @@
return ret;
}

+/*
+ * In case of filesystem is almost full and delalloc could not
+ * get enough free blocks to reserve to prevent later ENOSPC,
+ * let's fall back to the nondelalloc mode
+ */
+static int ext4_fall_back_to_nondelalloc(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ struct inode *inode = mapping->host;
+
+ /* turn off delalloc for this inode*/
+ ext4_set_aops(inode, 0);
+
+ return mapping->a_ops->write_begin(file, mapping, pos, len,
+ flags, pagep, fsdata);
+}
+
static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
@@ -2402,6 +2421,15 @@
struct inode *inode = mapping->host;
handle_t *handle;

+ /*
+ * Fall back to nondelalloc mode if fs is about run out of free
+ * blocks. We only do this if there is no delalloc dirty pages, those
+ * dirty pages still need delalloc writepages aops to flush to disk.
+ */
+ if (!ext4_has_delalloc_pages(inode) && ext4_free_blocks_low(inode))
+ ret= ext4_fall_back_to_nondelalloc(file,mapping,pos,
+ len, flags, pagep,fsdata);
+
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -2435,8 +2463,8 @@
page_cache_release(page);
}

- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
+ if (ret == -ENOSPC && (ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
out:
return ret;
}
@@ -3008,16 +3036,37 @@
.is_partially_uptodate = block_is_partially_uptodate,
};

-void ext4_set_aops(struct inode *inode)
+#define EXT4_MIN_FREE_BLOCKS (NR_CPUS * 4 * NR_CPUS*4)
+
+static int ext4_free_blocks_low(struct inode* inode)
+{
+ return (ext4_has_free_blocks(EXT4_SB(inode->i_sb),
+ EXT4_MIN_FREE_BLOCKS) > EXT4_MIN_FREE_BLOCKS);
+}
+
+static int ext4_has_delalloc_pages(struct inode* inode)
+{
+ return EXT4_I(inode)->i_reserved_data_blocks;
+}
+
+void ext4_set_aops(struct inode *inode, int delalloc)
{
- if (ext4_should_order_data(inode) &&
- test_opt(inode->i_sb, DELALLOC))
- inode->i_mapping->a_ops = &ext4_da_aops;
- else if (ext4_should_order_data(inode))
+ if (test_opt(inode->i_sb, DELALLOC)) {
+ if (delalloc && !ext4_has_delalloc_pages(inode)
+ && ext4_free_blocks_low(inode))
+ delalloc = 0;
+
+ if (delalloc) {
+ inode->i_mapping->a_ops = &ext4_da_aops;
+ return;
+ } else
+ printk(KERN_INFO "filesystem is close to full, "
+ "delayed allocation is turned off for "
+ " inode %lu\n", inode->i_ino);
+ }
+
+ if (ext4_should_order_data(inode))
inode->i_mapping->a_ops = &ext4_ordered_aops;
- else if (ext4_should_writeback_data(inode) &&
- test_opt(inode->i_sb, DELALLOC))
- inode->i_mapping->a_ops = &ext4_da_aops;
else if (ext4_should_writeback_data(inode))
inode->i_mapping->a_ops = &ext4_writeback_aops;
else
@@ -4011,7 +4060,7 @@
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
inode->i_fop = &ext4_dir_operations;
@@ -4020,7 +4069,7 @@
inode->i_op = &ext4_fast_symlink_inode_operations;
else {
inode->i_op = &ext4_symlink_inode_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
}
} else {
inode->i_op = &ext4_special_inode_operations;
@@ -4783,7 +4832,7 @@
EXT4_I(inode)->i_flags |= EXT4_JOURNAL_DATA_FL;
else
EXT4_I(inode)->i_flags &= ~EXT4_JOURNAL_DATA_FL;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);

jbd2_journal_unlock_updates(journal);

Index: linux-2.6.27-rc3/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4.h 2008-08-20 15:41:36.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4.h 2008-08-20 15:41:56.000000000 -0700
@@ -1070,7 +1070,7 @@
extern void ext4_truncate (struct inode *);
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
-extern void ext4_set_aops(struct inode *inode);
+extern void ext4_set_aops(struct inode *inode, int delalloc);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
Index: linux-2.6.27-rc3/fs/ext4/namei.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/namei.c 2008-08-20 15:42:13.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/namei.c 2008-08-20 15:42:41.000000000 -0700
@@ -1738,7 +1738,7 @@
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
err = ext4_add_nondir(handle, dentry, inode);
}
ext4_journal_stop(handle);
@@ -2210,7 +2210,7 @@

if (l > sizeof (EXT4_I(inode)->i_data)) {
inode->i_op = &ext4_symlink_inode_operations;
- ext4_set_aops(inode);
+ ext4_set_aops(inode, 1);
/*
* page_symlink() calls into ext4_prepare/commit_write.
* We have a transaction open. All is sweetness. It also sets

2008-08-21 17:17:26

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-21四的 11:35 -0400,Theodore Tso写道:
> On Thu, Aug 21, 2008 at 08:48:15PM +0530, Aneesh Kumar K.V wrote:
> > > I have a feeling that we did not try very hard before invalidate the
> > > dirty page which fail to map to disks. Perhaps we should try a few more
> > > times before give up. Also in that case, perhaps we should turn off
> > > delalloc fs wide, so the new writers won't take the subsequently made
> > > avaible free blocks away from this unlucky delalloc da writepages.
> >
> > How do we try hard ? The mballoc already try had to allocate blocks. So I
> > am not sure what do we achieve by requesting for block allocation again.
>
> So here's the problem that we face. If we have a situation where the
> disk fills temporarily, but then subsequently space gets freed up, it
> would be preferable if the dirty page isn't invalidated, and so
> periodically (or perhaps via "there's-free-space-now notifier") we
> retry the delayed allocation so we don't lose data during a transient
> disk full situation. But at the same time, we don't want an fsync()
> on the entire filesystem, or a umount on the filesystem, to hang
> forever.
>

This situation (disk fills temporarily) is not new, we handled this in
ext4_write_begin() (and other places) by retry allocation three times
in case of ENOSPC error. we could do the same retry in
ext4_da_writepages().

Mingming
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-21 17:31:43

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Thu, Aug 21, 2008 at 10:07:37AM -0700, Mingming Cao wrote:
>
> 在 2008-08-21四的 22:15 +0530,Aneesh Kumar K.V写道:
> > On Wed, Aug 20, 2008 at 11:13:39AM +0530, Aneesh Kumar K.V wrote:
> > > Hi,
> > >
> > > I am getting this even with the latest patch queue. The test program is
> > > a modified fsstress with fallocate support.
> > >
> > > mpage_da_map_blocks block allocation failed for inode 377954 at logical
> > > offset 313 with max blocks 4 with error -28
> > > mpage_da_map_blocks block allocation failed for inode 336367 at logical
> > > offset 74 with max blocks 9 with error -28
> > > mpage_da_map_blocks block allocation failed for inode 345560 at logical
> > > offset 542 with max blocks 7 with error -28
> > > This should not happen.!! Data will be lost
> > > mpage_da_map_blocks block allocation failed for inode 355317 at logical
> > > offset 152 with max blocks 10 with error -28
> > > This should not happen.!! Data will be lost
> > > mpage_da_map_blocks block allocation failed for inode 395261 at logical
> > > offset 462 with max blocks 1 with error -28
> > > This should not happen.!! Data will be lost
> > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > offset 313 with max blocks 11 with error -28
> > > This should not happen.!! Data will be lost
> > >
> >
> > With this patch i am not seeing error. It does the below
> >
> > a) use ext4_claim_free_blocks that also update the free blocks count
> > b) Later after block allocation update the free blocks count if we
> > allocated less with non-delayed mode
> > c) Switch to non delay mode if we are low on free blocks.
> >
> I had sent a patch to do c) yesterday, I noticed that we can't switch to
> non delayed mode if the inode already have some delalloc dirty pages.
>
>
> > @@ -2462,11 +2464,21 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> > unsigned from, to;
> > struct inode *inode = mapping->host;
> > handle_t *handle;
> > + s64 free_blocks;
> > + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> >
> > index = pos >> PAGE_CACHE_SHIFT;
> > from = pos & (PAGE_CACHE_SIZE - 1);
> > to = from + len;
> >
> > + free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> > + if (free_blocks < (4 * (FBC_BATCH * nr_cpu_ids))) {
> > + /* switch to non delalloc mode */
> > + *fsdata = (void *)1;
> > + return ext4_write_begin(file, mapping, pos,
> > + len, flags, pagep, fsdata);
> > + }
> > + *fsdata = (void *)0;
>
> No, calling ext4_write_begin() directly won't work, as it start a
> handle, do the block allocation , and leave the handle there. It
> expect later the write_end aops to file the inode to the transaction
> list, and close that handle.
>
> With your change, the aops write_end still points to the
> ext4_da_write_end(), which doesn't match the ext4_write_begin. We need
> to switch the aop write_begin/write_end function pointers all together.
>

My patch does it in a simple way. I am attaching only switch to non
delalloc patch below. It is still being tested.

commit 490b69bd47b9ea27b1bb86bbdfb85a2911047149
Author: Aneesh Kumar K.V <[email protected]>
Date: Thu Aug 21 22:10:50 2008 +0530

switch to non-delalloc

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d965a05..087abca 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1030,11 +1030,17 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;

- /* Account for allocated meta_blocks */
- mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;
+ if (mdb_free) {
+ /* Account for allocated meta_blocks */
+ mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;

- /* update fs free blocks counter for truncate case */
- percpu_counter_add(&sbi->s_freeblocks_counter, mdb_free);
+ /*
+ * We have reserved more blocks.
+ * Now free the extra blocks reserved
+ */
+ percpu_counter_add(&sbi->s_freeblocks_counter, mdb_free);
+ EXT4_I(inode)->i_allocated_meta_blocks = 0;
+ }

/* update per-inode reservations */
BUG_ON(used > EXT4_I(inode)->i_reserved_data_blocks);
@@ -1042,7 +1048,6 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)

BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
EXT4_I(inode)->i_reserved_meta_blocks = mdb;
- EXT4_I(inode)->i_allocated_meta_blocks = 0;
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
}

@@ -2459,11 +2464,21 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
unsigned from, to;
struct inode *inode = mapping->host;
handle_t *handle;
+ s64 free_blocks;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;

+ free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+ if (free_blocks < (4 * (FBC_BATCH * nr_cpu_ids))) {
+ /* switch to non delalloc mode */
+ *fsdata = (void *)1;
+ return ext4_write_begin(file, mapping, pos,
+ len, flags, pagep, fsdata);
+ }
+ *fsdata = (void *)0;
retry:
/*
* With delayed allocation, we don't log the i_disksize update
@@ -2532,6 +2547,19 @@ static int ext4_da_write_end(struct file *file,
handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
unsigned long start, end;
+ int low_free_blocks = (int)fsdata;
+
+ if (low_free_blocks) {
+ if (ext4_should_order_data(inode)) {
+ return ext4_ordered_write_end(file, mapping, pos,
+ len, copied, page, fsdata);
+ } else if (ext4_should_writeback_data(inode)) {
+ return ext4_writeback_write_end(file, mapping, pos,
+ len, copied, page, fsdata);
+ } else {
+ BUG();
+ }
+ }

start = pos & (PAGE_CACHE_SIZE - 1);
end = start + copied -1;

2008-08-21 18:06:19

by Mingming Cao

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages


在 2008-08-21四的 23:01 +0530,Aneesh Kumar K.V写道:
> On Thu, Aug 21, 2008 at 10:07:37AM -0700, Mingming Cao wrote:
> >
> > 在 2008-08-21四的 22:15 +0530,Aneesh Kumar K.V写道:
> > > On Wed, Aug 20, 2008 at 11:13:39AM +0530, Aneesh Kumar K.V wrote:
> > > > Hi,
> > > >
> > > > I am getting this even with the latest patch queue. The test program is
> > > > a modified fsstress with fallocate support.
> > > >
> > > > mpage_da_map_blocks block allocation failed for inode 377954 at logical
> > > > offset 313 with max blocks 4 with error -28
> > > > mpage_da_map_blocks block allocation failed for inode 336367 at logical
> > > > offset 74 with max blocks 9 with error -28
> > > > mpage_da_map_blocks block allocation failed for inode 345560 at logical
> > > > offset 542 with max blocks 7 with error -28
> > > > This should not happen.!! Data will be lost
> > > > mpage_da_map_blocks block allocation failed for inode 355317 at logical
> > > > offset 152 with max blocks 10 with error -28
> > > > This should not happen.!! Data will be lost
> > > > mpage_da_map_blocks block allocation failed for inode 395261 at logical
> > > > offset 462 with max blocks 1 with error -28
> > > > This should not happen.!! Data will be lost
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> > > >
> > >
> > > With this patch i am not seeing error. It does the below
> > >
> > > a) use ext4_claim_free_blocks that also update the free blocks count
> > > b) Later after block allocation update the free blocks count if we
> > > allocated less with non-delayed mode
> > > c) Switch to non delay mode if we are low on free blocks.
> > >
> > I had sent a patch to do c) yesterday, I noticed that we can't switch to
> > non delayed mode if the inode already have some delalloc dirty pages.
> >
> >
> > > @@ -2462,11 +2464,21 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> > > unsigned from, to;
> > > struct inode *inode = mapping->host;
> > > handle_t *handle;
> > > + s64 free_blocks;
> > > + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > >
> > > index = pos >> PAGE_CACHE_SHIFT;
> > > from = pos & (PAGE_CACHE_SIZE - 1);
> > > to = from + len;
> > >
> > > + free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> > > + if (free_blocks < (4 * (FBC_BATCH * nr_cpu_ids))) {
> > > + /* switch to non delalloc mode */
> > > + *fsdata = (void *)1;
> > > + return ext4_write_begin(file, mapping, pos,
> > > + len, flags, pagep, fsdata);
> > > + }
> > > + *fsdata = (void *)0;
> >
> > No, calling ext4_write_begin() directly won't work, as it start a
> > handle, do the block allocation , and leave the handle there. It
> > expect later the write_end aops to file the inode to the transaction
> > list, and close that handle.
> >
> > With your change, the aops write_end still points to the
> > ext4_da_write_end(), which doesn't match the ext4_write_begin. We need
> > to switch the aop write_begin/write_end function pointers all together.
> >
>
> My patch does it in a simple way. I am attaching only switch to non
> delalloc patch below. It is still being tested.
>

For the people on the mailing list, after chatted with Aneesh on IRC I
realize how it is handle it now:) He passed a special flag to
da_write_end() to indicating fall back to normal write_end() in the case
the ext4_da_write_begin() falls back to normal write_begin. It's not
very obvious so I missed it in the first place.

Mingming
> commit 490b69bd47b9ea27b1bb86bbdfb85a2911047149
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Thu Aug 21 22:10:50 2008 +0530
>
> switch to non-delalloc
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d965a05..087abca 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1030,11 +1030,17 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
> BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
> mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>
> - /* Account for allocated meta_blocks */
> - mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;
> + if (mdb_free) {
> + /* Account for allocated meta_blocks */
> + mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;
>
> - /* update fs free blocks counter for truncate case */
> - percpu_counter_add(&sbi->s_freeblocks_counter, mdb_free);
> + /*
> + * We have reserved more blocks.
> + * Now free the extra blocks reserved
> + */
> + percpu_counter_add(&sbi->s_freeblocks_counter, mdb_free);
> + EXT4_I(inode)->i_allocated_meta_blocks = 0;
> + }
>
> /* update per-inode reservations */
> BUG_ON(used > EXT4_I(inode)->i_reserved_data_blocks);
> @@ -1042,7 +1048,6 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
>
> BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
> EXT4_I(inode)->i_reserved_meta_blocks = mdb;
> - EXT4_I(inode)->i_allocated_meta_blocks = 0;
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> }
>
> @@ -2459,11 +2464,21 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> unsigned from, to;
> struct inode *inode = mapping->host;
> handle_t *handle;
> + s64 free_blocks;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> to = from + len;
>
> + free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
> + if (free_blocks < (4 * (FBC_BATCH * nr_cpu_ids))) {
> + /* switch to non delalloc mode */
> + *fsdata = (void *)1;
> + return ext4_write_begin(file, mapping, pos,
> + len, flags, pagep, fsdata);
> + }
> + *fsdata = (void *)0;
> retry:
> /*
> * With delayed allocation, we don't log the i_disksize update
> @@ -2532,6 +2547,19 @@ static int ext4_da_write_end(struct file *file,
> handle_t *handle = ext4_journal_current_handle();
> loff_t new_i_size;
> unsigned long start, end;
> + int low_free_blocks = (int)fsdata;
> +
> + if (low_free_blocks) {
> + if (ext4_should_order_data(inode)) {
> + return ext4_ordered_write_end(file, mapping, pos,
> + len, copied, page, fsdata);
> + } else if (ext4_should_writeback_data(inode)) {
> + return ext4_writeback_write_end(file, mapping, pos,
> + len, copied, page, fsdata);
> + } else {
> + BUG();
> + }
> + }
>
> start = pos & (PAGE_CACHE_SIZE - 1);
> end = start + copied -1;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-23 11:12:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: ENOSPC returned during writepages

On Aug 21, 2008 20:48 +0530, Aneesh Kumar wrote:
> On Wed, Aug 20, 2008 at 02:55:25PM -0700, Mingming Cao wrote:
> > I have a feeling that we did not try very hard before invalidate the
> > dirty page which fail to map to disks. Perhaps we should try a few more
> > times before give up. Also in that case, perhaps we should turn off
> > delalloc fs wide, so the new writers won't take the subsequently made
> > avaible free blocks away from this unlucky delalloc da writepages.
>
> How do we try hard ? The mballoc already try had to allocate blocks. So I
> am not sure what do we achieve by requesting for block allocation again.

One reason is that if blocks are being released then they cannot be
reallocated until the transaction has committed.

Running something like iozone in a loop on a filesystem that has only
as much free space as the size of the output file can run out of space
after a few loops because of this, so definitely retrying the allocation
can help.

Also, keeping the pages in RAM may allow them to be written again if
the user detects the ENOSPC and deletes some files...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.