2012-02-01 03:18:04

by Robin Dong

[permalink] [raw]
Subject: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

From: Robin Dong <[email protected]>

Creating 4-byte files until ENOSPC in a delay-allocation and bigalloc ext4 fs and then sync it, the dmseg will report like:

[ 482.154538] EXT4-fs (sdb6): delayed block allocation failed for inode 1664 at logical offset 0 with max blocks 1 with error -28
[ 482.154540] EXT4-fs (sdb6): This should not happen!! Data will be lost

The reason is ext4_has_free_clusters reporting wrong result. Actually, the unit of sbi->s_dirtyclusters_counter is block, so we should tranform it to cluster for argument "dirty_clusters", just like "free_clusters".

Cc: "Theodore Ts'o" <[email protected]>
Reported-by: Eric Sandeen <[email protected]>
Signed-off-by: Robin Dong <[email protected]>
---
fs/ext4/balloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index f9e2cd8..8ff10c4 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -427,7 +427,7 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,
if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
EXT4_FREECLUSTERS_WATERMARK) {
free_clusters = EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
- dirty_clusters = percpu_counter_sum_positive(dcc);
+ dirty_clusters = EXT4_C2B(sbi, percpu_counter_sum_positive(dcc));
}
/* Check whether we have space after accounting for current
* dirty clusters & root reserved clusters.
--
1.7.4.1



2012-02-02 07:48:50

by Robin Dong

[permalink] [raw]
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

2012/2/1 Robin Dong <[email protected]>:
> From: Robin Dong <[email protected]>
>
> Creating 4-byte files until ENOSPC in a delay-allocation and bigalloc ext4 fs and then sync it, the dmseg will report like:
>
> ? ? ? ?[ ?482.154538] EXT4-fs (sdb6): delayed block allocation failed for inode 1664 at logical offset 0 with max blocks 1 with error -28
> ? ? ? ?[ ?482.154540] EXT4-fs (sdb6): This should not happen!! Data will be lost
>
> The reason is ext4_has_free_clusters reporting wrong result. Actually, the unit of sbi->s_dirtyclusters_counter is block, so we should tranform it to cluster for argument "dirty_clusters", just like "free_clusters".

Sorry, the description is in confusion, the correct version is:

The reason is ext4_has_free_clusters reporting wrong result. Actually,
the unit of argument "dirty_clusters" is block, so we should tranform
the
sbi->s_dirtyclusters_counter to block , just like "free_clusters".

>
> Cc: "Theodore Ts'o" <[email protected]>
> Reported-by: Eric Sandeen <[email protected]>
> Signed-off-by: Robin Dong <[email protected]>
> ---
> ?fs/ext4/balloc.c | ? ?2 +-
> ?1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index f9e2cd8..8ff10c4 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -427,7 +427,7 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,
> ? ? ? ?if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FREECLUSTERS_WATERMARK) {
> ? ? ? ? ? ? ? ?free_clusters ?= EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
> - ? ? ? ? ? ? ? dirty_clusters = percpu_counter_sum_positive(dcc);
> + ? ? ? ? ? ? ? dirty_clusters = EXT4_C2B(sbi, percpu_counter_sum_positive(dcc));
> ? ? ? ?}
> ? ? ? ?/* Check whether we have space after accounting for current
> ? ? ? ? * dirty clusters & root reserved clusters.
> --
> 1.7.4.1
>



--
--
Best Regard
Robin Dong

2012-02-10 02:55:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

On Thu, Feb 02, 2012 at 03:48:49PM +0800, Robin Dong wrote:
>
> The reason is ext4_has_free_clusters reporting wrong
> result. Actually, the unit of argument "dirty_clusters" is block, so
> we should tranform the sbi->s_dirtyclusters_counter to block , just
> like "free_clusters".

I've been looking at the the delayed allocation reservation code, and
I think it's a lot more confused that this. Some places we are
playing with fields as if they are clusters, and some times as if they
are blocks, and I've definitely found places where we're not handling
quota correctly with bigalloc (ext4_map_blocks is calling
ext4_da_update_reserve_space() with a value which is clearly in units
of blocks, but we are treating that vale as if it is clusters, etc.)

So I think the problem is a lot larger than what you pointed out, and
we shouldn't fix it just by throwing in an EXT4_C2B call. If
s_dirtyclusters_counter should be denominated in blocks, then we need
to rename it. And there there's this comment which made me wince:

* Note that in case of bigalloc, i_reserved_meta_blocks,
* i_reserved_data_blocks, etc. refer to number of clusters.

Argh, no. If something is denominated in clusters, then it should be
appropriately named as such.

Actually, I think the problem is a lot bigger than this, and it's a
conceptual one. When we try writing to a block which does not have a
mapping, and bigalloc is enabled, there are three cases:

1) This is the first time the cluster has ever been written to; hence,
even though we are dirtying a block, we need to reserve room for the
entire cluster.

2) While the block has not been mapped (and so the data on disk is
uninitialized) the cluster has indeed been allocated, so we don't need
to reserve any additional room for the block.

3) One or more blocks in the cluster have already been written to via
delayed allocation, but the cluster itself has not been selected
(allocated) yet. But since the space for the entire cluster was
reserved when the first block in the cluster was dirtied (case #1), we
don't need to reserve any more space.

We aren't handling these cases correctly today, since we don't have
this logic here. These cases don't matter if you're only using
fallocate() and direct I/O, but if you are using buffered writes and
delayed allocation, they obviously matter a huge amount.

And I don't think we're handling it correctly. Argh...

- Ted

2012-02-14 08:01:37

by Robin Dong

[permalink] [raw]
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

?? 2012??2??10?? ????10:55??Ted Ts'o <[email protected]> д????
> On Thu, Feb 02, 2012 at 03:48:49PM +0800, Robin Dong wrote:
>>
>> The reason is ext4_has_free_clusters reporting wrong
>> result. Actually, the unit of argument "dirty_clusters" is block, so
>> we should tranform the sbi->s_dirtyclusters_counter to block , just
>> like "free_clusters".
>
> I've been looking at the the delayed allocation reservation code, and
> I think it's a lot more confused that this. Some places we are
> playing with fields as if they are clusters, and some times as if they
> are blocks, and I've definitely found places where we're not handling
> quota correctly with bigalloc (ext4_map_blocks is calling
> ext4_da_update_reserve_space() with a value which is clearly in units
> of blocks, but we are treating that vale as if it is clusters, etc.)
>
> So I think the problem is a lot larger than what you pointed out, and
> we shouldn't fix it just by throwing in an EXT4_C2B call. If
> s_dirtyclusters_counter should be denominated in blocks, then we need
> to rename it. And there there's this comment which made me wince:
>
> * Note that in case of bigalloc, i_reserved_meta_blocks,
> * i_reserved_data_blocks, etc. refer to number of clusters.
>
> Argh, no. If something is denominated in clusters, then it should be
> appropriately named as such.
>
> Actually, I think the problem is a lot bigger than this, and it's a
> conceptual one. When we try writing to a block which does not have a
> mapping, and bigalloc is enabled, there are three cases:
>
> 1) This is the first time the cluster has ever been written to; hence,
> even though we are dirtying a block, we need to reserve room for the
> entire cluster.
>
> 2) While the block has not been mapped (and so the data on disk is
> uninitialized) the cluster has indeed been allocated, so we don't need
> to reserve any additional room for the block.
>
> 3) One or more blocks in the cluster have already been written to via
> delayed allocation, but the cluster itself has not been selected
> (allocated) yet. But since the space for the entire cluster was
> reserved when the first block in the cluster was dirtied (case #1), we
> don't need to reserve any more space.
>
> We aren't handling these cases correctly today, since we don't have
> this logic here. These cases don't matter if you're only using
> fallocate() and direct I/O, but if you are using buffered writes and
> delayed allocation, they obviously matter a huge amount.
>
> And I don't think we're handling it correctly. Argh...
>
> - Ted

Hi, Ted and Aditya

I am trying to fix this "bigalloc quota reservation in
delay-allocation" problem recently, and I feel confusion about these
code below in ext4_ext_map_blocks():

if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
unsigned int reserved_clusters;
/*
* Check how many clusters we had reserved this allocated range
*/
reserved_clusters = get_reserved_cluster_alloc(inode,
map->m_lblk, allocated);
if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
if (reserved_clusters) {
......
ext4_da_update_reserve_space(inode,
reserved_clusters, 0);
}
} else {
BUG_ON(allocated_clusters < reserved_clusters);
/* We will claim quota for all newly allocated blocks.*/
ext4_da_update_reserve_space(inode, allocated_clusters,
1);
if (reserved_clusters < allocated_clusters) {
struct ext4_inode_info *ei = EXT4_I(inode);
int reservation = allocated_clusters -
reserved_clusters;
dquot_reserve_block(inode,
EXT4_C2B(sbi, reservation));
spin_lock(&ei->i_block_reservation_lock);
ei->i_reserved_data_blocks += reservation;
spin_unlock(&ei->i_block_reservation_lock);
}
}
}

if we have allocated 3 clusters, why should we add "reservation" back
to i_reservd_data_blocks ? (Aditya Kali 's long comment dose not make
sense to me) Maybe we don't need to do ext4_da_update_reserve_space
when the new block is allocated from "implied_cluster"? Why not just
change them to:

if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
if ( !(map->m_flags & EXT4_MAP_FROM_CLUSTER) )
/* We will claim quota for all newly allocated blocks.*/
ext4_da_update_reserve_space(inode, allocated_clusters,
1);
}

--
--
Best Regard
Robin Dong

2012-02-14 21:25:19

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

Hi Robin,


On Tue, Feb 14, 2012 at 12:01 AM, Robin Dong <[email protected]> wrote:
>
> 在 2012年2月10日 上午10:55,Ted Ts'o <[email protected]> 写道:
> > On Thu, Feb 02, 2012 at 03:48:49PM +0800, Robin Dong wrote:
> >>
> >> The reason is ext4_has_free_clusters reporting wrong
> >> result. Actually, the unit of argument "dirty_clusters" is block, so
> >> we should tranform the sbi->s_dirtyclusters_counter to block , just
> >> like "free_clusters".
> >
> > I've been looking at the the delayed allocation reservation code, and
> > I think it's a lot more confused that this.  Some places we are
> > playing with fields as if they are clusters, and some times as if they
> > are blocks, and I've definitely found places where we're not handling
> > quota correctly with bigalloc (ext4_map_blocks is calling
> > ext4_da_update_reserve_space() with a value which is clearly in units
> > of blocks, but we are treating that vale as if it is clusters, etc.)
> >
> > So I think the problem is a lot larger than what you pointed out, and
> > we shouldn't fix it just by throwing in an EXT4_C2B call.  If
> > s_dirtyclusters_counter should be denominated in blocks, then we need
> > to rename it.  And there there's this comment which made me wince:
> >
> >           * Note that in case of bigalloc, i_reserved_meta_blocks,
> >           * i_reserved_data_blocks, etc. refer to number of clusters.
> >
> > Argh, no.  If something is denominated in clusters, then it should be
> > appropriately named as such.
> >
> > Actually, I think the problem is a lot bigger than this, and it's a
> > conceptual one.  When we try writing to a block which does not have a
> > mapping, and bigalloc is enabled, there are three cases:
> >
> > 1) This is the first time the cluster has ever been written to; hence,
> > even though we are dirtying a block, we need to reserve room for the
> > entire cluster.
> >
> > 2) While the block has not been mapped (and so the data on disk is
> > uninitialized) the cluster has indeed been allocated, so we don't need
> > to reserve any additional room for the block.
> >
> > 3) One or more blocks in the cluster have already been written to via
> > delayed allocation, but the cluster itself has not been selected
> > (allocated) yet.  But since the space for the entire cluster was
> > reserved when the first block in the cluster was dirtied (case #1), we
> > don't need to reserve any more space.
> >
> > We aren't handling these cases correctly today, since we don't have
> > this logic here.  These cases don't matter if you're only using
> > fallocate() and direct I/O, but if you are using buffered writes and
> > delayed allocation, they obviously matter a huge amount.
> >
>
> > And I don't think we're handling it correctly.   Argh...
> >

I thought we do handle these cases correctly after the 'ext4: attempt
to fix race in bigalloc code path' change. May be I missed something
or the code changed somewhere. If there is a testcase that fails, I
can take a look.

> >                                                 - Ted
>
> Hi, Ted and Aditya
>
> I am trying to fix this "bigalloc quota reservation in
> delay-allocation" problem recently, and I feel confusion about these
> code below in ext4_ext_map_blocks():
>
>        if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
>                unsigned int reserved_clusters;
>                /*
>                 * Check how many clusters we had reserved this allocated range
>                 */
>                reserved_clusters = get_reserved_cluster_alloc(inode,
>                                                map->m_lblk, allocated);
>                if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
>                        if (reserved_clusters) {
>                                ......
>                                ext4_da_update_reserve_space(inode,
>                                                reserved_clusters, 0);
>                        }
>                } else {
>                        BUG_ON(allocated_clusters < reserved_clusters);
>                        /* We will claim quota for all newly allocated blocks.*/
>                        ext4_da_update_reserve_space(inode, allocated_clusters,
>                                                        1);
>                        if (reserved_clusters < allocated_clusters) {
>                                struct ext4_inode_info *ei = EXT4_I(inode);
>                                int reservation = allocated_clusters -
>                                                  reserved_clusters;
>                                dquot_reserve_block(inode,
>                                                EXT4_C2B(sbi, reservation));
>                                spin_lock(&ei->i_block_reservation_lock);
>                                ei->i_reserved_data_blocks += reservation;
>                                spin_unlock(&ei->i_block_reservation_lock);
>                        }
>                }
>        }
>
> if we have allocated 3 clusters, why should we add "reservation" back
> to i_reservd_data_blocks ?
Because we will end up here again next time when we write blocks
[10-11] and try to claim space for them. At this time, there will be
no more reserved blocks left (unless you give them back) and you will
hit the warning in ext4_da_update_reserve_space():

if (unlikely(used > ei->i_reserved_data_blocks)) {
ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d "
"with only %d reserved data blocks\n",
__func__, inode->i_ino, used,
ei->i_reserved_data_blocks);
WARN_ON(1);
used = ei->i_reserved_data_blocks;
}


> (Aditya Kali 's long comment dose not make
> sense to me) Maybe we don't need to do ext4_da_update_reserve_space
> when the new block is allocated from "implied_cluster"? Why not just
> change them to:
>
>        if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
>                if ( !(map->m_flags & EXT4_MAP_FROM_CLUSTER) )
>                        /* We will claim quota for all newly allocated blocks.*/
>                        ext4_da_update_reserve_space(inode, allocated_clusters,
>                                                        1);
>        }
>

This won't work in all cases, like the one given in the long comment.
With above change, ext4_da_update_reserve_space() will get called
again the next time for some delayed allocated blocks in the same
cluster, but there isn't any reserved blocks left to claim.
The flag 'EXT4_MAP_FROM_CLUSTER' is not a sufficient check to
determine 'how many' clusters we need to claim. This is because
get_implied_cluster_alloc() only checks within already allocated
extents (and not in the page cache). So, I had to add the function
get_reserved_cluster_alloc() to check if there are any delayed
allocated blocks still in the page-cache.
You will also miss the other case (when EXT4_MAP_FROM_CLUSTER is set)
where we had reserved the cluster, but ended up not allocating it.
Since this cluster was not allocated, we need to update the reserved
space without claiming quota (ext4_da_update_reserve_space(inode,
reserved_clusters, 0) --- notice the last argument is '0').

I know its kinda messy, but I think it works. Also, since we don't
keep track of delayed extents, its harder to check in one go.

As for your proposed fix earlier, the correct solution will be to not
convert 'free_clusters' value into blocks.

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index f9e2cd8..2c518f8 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -426,7 +426,7 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,

        if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
                                        EXT4_FREECLUSTERS_WATERMARK) {
-               free_clusters  = EXT4_C2B(sbi,
percpu_counter_sum_positive(fcc));
+               free_clusters  = percpu_counter_sum_positive(fcc);
                dirty_clusters = percpu_counter_sum_positive(dcc);
        }
        /* Check whether we have space after accounting for current


We were incorrectly converting clusters to blocks and so estimating
that we have more free clusters available than what we actually do.
Hence the dmesg errors at write time.

>
> --
> --
> Best Regard
> Robin Dong


Thanks,
Aditya

2012-02-15 03:19:33

by Robin Dong

[permalink] [raw]
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

?? 2012??2??15?? ????5:24??Aditya Kali <[email protected]> д????
> Hi Robin,
>
>
> On Tue, Feb 14, 2012 at 12:01 AM, Robin Dong <[email protected]> wrote:
>>
>> ?? 2012??2??10?? ????10:55??Ted Ts'o <[email protected]> д????
>> > On Thu, Feb 02, 2012 at 03:48:49PM +0800, Robin Dong wrote:
>> >>
>> >> The reason is ext4_has_free_clusters reporting wrong
>> >> result. Actually, the unit of argument "dirty_clusters" is block, so
>> >> we should tranform the sbi->s_dirtyclusters_counter to block , just
>> >> like "free_clusters".
>> >
>> > I've been looking at the the delayed allocation reservation code, and
>> > I think it's a lot more confused that this. Some places we are
>> > playing with fields as if they are clusters, and some times as if they
>> > are blocks, and I've definitely found places where we're not handling
>> > quota correctly with bigalloc (ext4_map_blocks is calling
>> > ext4_da_update_reserve_space() with a value which is clearly in units
>> > of blocks, but we are treating that vale as if it is clusters, etc.)
>> >
>> > So I think the problem is a lot larger than what you pointed out, and
>> > we shouldn't fix it just by throwing in an EXT4_C2B call. If
>> > s_dirtyclusters_counter should be denominated in blocks, then we need
>> > to rename it. And there there's this comment which made me wince:
>> >
>> > * Note that in case of bigalloc, i_reserved_meta_blocks,
>> > * i_reserved_data_blocks, etc. refer to number of clusters.
>> >
>> > Argh, no. If something is denominated in clusters, then it should be
>> > appropriately named as such.
>> >
>> > Actually, I think the problem is a lot bigger than this, and it's a
>> > conceptual one. When we try writing to a block which does not have a
>> > mapping, and bigalloc is enabled, there are three cases:
>> >
>> > 1) This is the first time the cluster has ever been written to; hence,
>> > even though we are dirtying a block, we need to reserve room for the
>> > entire cluster.
>> >
>> > 2) While the block has not been mapped (and so the data on disk is
>> > uninitialized) the cluster has indeed been allocated, so we don't need
>> > to reserve any additional room for the block.
>> >
>> > 3) One or more blocks in the cluster have already been written to via
>> > delayed allocation, but the cluster itself has not been selected
>> > (allocated) yet. But since the space for the entire cluster was
>> > reserved when the first block in the cluster was dirtied (case #1), we
>> > don't need to reserve any more space.
>> >
>> > We aren't handling these cases correctly today, since we don't have
>> > this logic here. These cases don't matter if you're only using
>> > fallocate() and direct I/O, but if you are using buffered writes and
>> > delayed allocation, they obviously matter a huge amount.
>> >
>>
>> > And I don't think we're handling it correctly. Argh...
>> >
>
> I thought we do handle these cases correctly after the 'ext4: attempt
> to fix race in bigalloc code path' change. May be I missed something
> or the code changed somewhere. If there is a testcase that fails, I
> can take a look.
>
>> > - Ted
>>
>> Hi, Ted and Aditya
>>
>> I am trying to fix this "bigalloc quota reservation in
>> delay-allocation" problem recently, and I feel confusion about these
>> code below in ext4_ext_map_blocks():
>>
>> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
>> unsigned int reserved_clusters;
>> /*
>> * Check how many clusters we had reserved this allocated range
>> */
>> reserved_clusters = get_reserved_cluster_alloc(inode,
>> map->m_lblk, allocated);
>> if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
>> if (reserved_clusters) {
>> ......
>> ext4_da_update_reserve_space(inode,
>> reserved_clusters, 0);
>> }
>> } else {
>> BUG_ON(allocated_clusters < reserved_clusters);
>> /* We will claim quota for all newly allocated blocks.*/
>> ext4_da_update_reserve_space(inode, allocated_clusters,
>> 1);
>> if (reserved_clusters < allocated_clusters) {
>> struct ext4_inode_info *ei = EXT4_I(inode);
>> int reservation = allocated_clusters -
>> reserved_clusters;
>> dquot_reserve_block(inode,
>> EXT4_C2B(sbi, reservation));
>> spin_lock(&ei->i_block_reservation_lock);
>> ei->i_reserved_data_blocks += reservation;
>> spin_unlock(&ei->i_block_reservation_lock);
>> }
>> }
>> }
>>
>> if we have allocated 3 clusters, why should we add "reservation" back
>> to i_reservd_data_blocks ?
> Because we will end up here again next time when we write blocks
> [10-11] and try to claim space for them. At this time, there will be
> no more reserved blocks left (unless you give them back) and you will
> hit the warning in ext4_da_update_reserve_space():
>
> if (unlikely(used > ei->i_reserved_data_blocks)) {
> ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d "
> "with only %d reserved data blocks\n",
> __func__, inode->i_ino, used,
> ei->i_reserved_data_blocks);
> WARN_ON(1);
> used = ei->i_reserved_data_blocks;
> }
>
>
>> (Aditya Kali 's long comment dose not make
>> sense to me) Maybe we don't need to do ext4_da_update_reserve_space
>> when the new block is allocated from "implied_cluster"? Why not just
>> change them to:
>>
>> if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
>> if ( !(map->m_flags & EXT4_MAP_FROM_CLUSTER) )
>> /* We will claim quota for all newly allocated blocks.*/
>> ext4_da_update_reserve_space(inode, allocated_clusters,
>> 1);
>> }
>>
>
> This won't work in all cases, like the one given in the long comment.
> With above change, ext4_da_update_reserve_space() will get called
> again the next time for some delayed allocated blocks in the same
> cluster, but there isn't any reserved blocks left to claim.
> The flag 'EXT4_MAP_FROM_CLUSTER' is not a sufficient check to
> determine 'how many' clusters we need to claim. This is because
> get_implied_cluster_alloc() only checks within already allocated
> extents (and not in the page cache). So, I had to add the function
> get_reserved_cluster_alloc() to check if there are any delayed
> allocated blocks still in the page-cache.

But I found the code in ext4_ext_map_blocks():

if ((sbi->s_cluster_ratio > 1) &&
ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
map->m_flags |= EXT4_MAP_FROM_CLUSTER;

thus, even the delayed allocated block in the page-cache (have not be
allocated) already have been checked. Therefore in my opinion, the
EXT4_MAP_FROM_CLUSTER is a sufficient check.

Follow the case of your long comment :) , process will go like this
in MY CODE ABOVE :

[0-3], [4-7], [8-11]

1. delay-allocation write blocks 10&11
we reserve 1 cluster, the i_reserved_data_blocks is 1 now
2. delay-allocation write blocks 3 to 8
we reserve other 2 clusters, so 3 clusters totally, the
i_reserved_data_blocks is 3 now
3. writeout the blocks 3 to 8
claim all 3 clusters, i_reserved_data_blocks is 0 now
At this moment, we really have allocated 3 clusters, and the
remain 10&11 block would never occupy another cluster (it would
definitely go into the [8-11] cluster), so, why need we reserve one
more cluster quota ?
4. writeout the 10&11 blocks
the 10&11 blocks will be set EXT4_MAP_FROM_CLUSTER flag ( by
get_implied_cluster_alloc as the 8~9 block have been allocated), and
it will not call ext4_da_update_reserve_space any more

As this, no warning, no reserve-and-release ping pong....at least in my imaging

Could you please point out what I missed ?

Thanks

> You will also miss the other case (when EXT4_MAP_FROM_CLUSTER is set)
> where we had reserved the cluster, but ended up not allocating it.
> Since this cluster was not allocated, we need to update the reserved
> space without claiming quota (ext4_da_update_reserve_space(inode,
> reserved_clusters, 0) --- notice the last argument is '0').
>
> I know its kinda messy, but I think it works. Also, since we don't
> keep track of delayed extents, its harder to check in one go.
>
> As for your proposed fix earlier, the correct solution will be to not
> convert 'free_clusters' value into blocks.
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index f9e2cd8..2c518f8 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -426,7 +426,7 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,
>
> if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
> EXT4_FREECLUSTERS_WATERMARK) {
> - free_clusters = EXT4_C2B(sbi,
> percpu_counter_sum_positive(fcc));
> + free_clusters = percpu_counter_sum_positive(fcc);
> dirty_clusters = percpu_counter_sum_positive(dcc);
> }
> /* Check whether we have space after accounting for current
>
>
> We were incorrectly converting clusters to blocks and so estimating
> that we have more free clusters available than what we actually do.
> Hence the dmesg errors at write time.
>
>>
>> --
>> --
>> Best Regard
>> Robin Dong
>
>
> Thanks,
> Aditya



--
--
Best Regard
Robin Dong

2012-02-16 20:12:49

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCH] ext4: s_dirtyclusters_counter should tranform to unit of cluster before assigning to "dirty_clusters" in ext4_has_free_clusters()

On Tue, Feb 14, 2012 at 7:19 PM, Robin Dong <[email protected]> wrote:
> But I found the code in ext4_ext_map_blocks():
>
>        if ((sbi->s_cluster_ratio > 1) &&
>            ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
>                map->m_flags |= EXT4_MAP_FROM_CLUSTER;
>
> thus, even the delayed allocated block in the page-cache (have not be
> allocated) already have been checked. Therefore in my opinion, the
> EXT4_MAP_FROM_CLUSTER is a sufficient check.
>
> Follow the case of your long comment :) ,  process will go like this
> in MY CODE ABOVE :
>
> [0-3], [4-7], [8-11]
>
> 1. delay-allocation write blocks 10&11
>    we reserve 1 cluster, the i_reserved_data_blocks is 1 now
> 2. delay-allocation write blocks 3 to 8
>    we reserve other 2 clusters, so 3 clusters totally, the
> i_reserved_data_blocks is 3 now
> 3. writeout the blocks 3 to 8
>    claim all 3 clusters, i_reserved_data_blocks is 0 now
>    At this moment, we really have allocated 3 clusters, and the
> remain 10&11 block would never occupy another cluster (it would
> definitely go into the [8-11] cluster), so, why need we reserve one
> more cluster quota ?
> 4.  writeout the 10&11 blocks
>    the 10&11 blocks will be set EXT4_MAP_FROM_CLUSTER flag ( by
> get_implied_cluster_alloc as the 8~9 block have been allocated), and
> it will not call ext4_da_update_reserve_space any more
>
> As this, no warning, no reserve-and-release ping pong....at least in my imaging
>
I couldn't look into this in detail, but your logic seems totally
fine. But when I wrote this code, I remember being able to reproduce
the quota warnings by running xfstests (some corner case that needed
this code).

> Could you please point out what I missed ?
>
> Thanks
>

--
Aditya