Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Subject: Re: [PATCH v3] ext4: Fix dead loop in ext4_mb_new_blocks
To:     Ye Bin <yebin10@huawei.com>
Cc:     Jan Kara <jack@suse.cz>, tytso@mit.edu, adilger.kernel@dilger.ca,
        jack@suse.com, linux-ext4@vger.kernel.org
References: <20200910091252.525346-1-yebin10@huawei.com>
 <20200910161753.GB17362@quack2.suse.cz>
From:   Ritesh Harjani <riteshh@linux.ibm.com>
Message-ID: <0e48b9bb-f839-4b3b-dbce-45755618df97@linux.ibm.com>
Date:   Fri, 11 Sep 2020 18:50:22 +0530
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.11.0
MIME-Version: 1.0
In-Reply-To: <20200910161753.GB17362@quack2.suse.cz>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk

Hello Ye,

Please excuse if there is something horribly wrong with my email
formatting. Have yesterday received this laptop and still setting up
few things.

On 9/10/20 9:47 PM, Jan Kara wrote:
> On Thu 10-09-20 17:12:52, Ye Bin wrote:
>> As we test disk offline/online with running fsstress, we find fsstress
>> process is keeping running state.
>> kworker/u32:3-262   [004] ...1   140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
>> ....
>> kworker/u32:3-262   [004] ...1   140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
>>
>> ext4_mb_new_blocks
>> repeat:
>> 	ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)
>> 		freed = ext4_mb_discard_preallocations
>> 			ext4_mb_discard_group_preallocations
>> 				this_cpu_inc(discard_pa_seq);
>> 		---> freed == 0
>> 		seq_retry = ext4_get_discard_pa_seq_sum
>> 			for_each_possible_cpu(__cpu)
>> 				__seq += per_cpu(discard_pa_seq, __cpu);
>> 		if (seq_retry != *seq) {
>> 			*seq = seq_retry;
>> 			ret = true;
>> 		}
>>
>> As we see seq_retry is sum of discard_pa_seq every cpu, if
>> ext4_mb_discard_group_preallocations return zero discard_pa_seq in this
>> cpu maybe increase one, so condition "seq_retry != *seq" have always
>> been met.
>> To Fix this problem, in ext4_mb_discard_group_preallocations function increase
>> discard_pa_seq only when it found preallocation to discard.
>>
>> Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
> 
> Thanks for the patch. One comment below.
> 
>> ---
>>   fs/ext4/mballoc.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index f386fe62727d..fd55264dc3fe 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   	INIT_LIST_HEAD(&list);
>>   repeat:
>>   	ext4_lock_group(sb, group);
>> -	this_cpu_inc(discard_pa_seq);
>>   	list_for_each_entry_safe(pa, tmp,
>>   				&grp->bb_prealloc_list, pa_group_list) {
>>   		spin_lock(&pa->pa_lock);
>> @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
>>   		goto out;
>>   	}
>>   
>> +	/* only increase when find reallocation to discard */
>> +	this_cpu_inc(discard_pa_seq);
>> +
> 
> This is a good place to increment the counter but I think you also need to
> handle the case:
> 
>          if (free < needed && busy) {
>                  busy = 0;
>                  ext4_unlock_group(sb, group);
>                  cond_resched();
>                  goto repeat;
>          }
> 
> We can unlock the group here after removing some preallocations and thus
> other processes checking discard_pa_seq could miss we did this. In fact I
> think the code is somewhat buggy here and we should also discard extents
> accumulated on "list" so far before unlocking the group. Ritesh?
> 

mmm, so even though this code is not discarding those buffers b4
unlocking, but it has still removed those from the grp->bb_prealloc_list
and added it to the local list. And since it will at some point get
scheduled and start operating from repeat: label so functionally wise
this should be ok. Am I missing anything?

Although I agree, that if we remove at least the current pa's before
unlocking the group may be a good idea, but we should also check
why was this done like this at the first place.


I agree with Jan, that we should increment discard_pa_seq once we 
actually have something
to discard. I should have written a comment here to explain why we did 
this here.
But I think commit msg should have all the history (since I have a habit 
of writing long commit msgs ;)

But IIRC, it was done since in case if there is a parallel thread which 
is discarding
all the preallocations so the current thread may return 0 since it 
checks the
list_empty(&grp->bb_prealloc_list) check in 
ext4_mb_discard_group_preallocations() and returns 0 directly.

And why the discard_pa_seq counter at other places may not help since we 
remove the pa nodes from
grp->bb_prealloc_list into a local list and then start operating on
that. So meanwhile some thread may comes and just checks that the list
is empty and return 0 while some other thread may start discarding from
it's local list.
So I guess the main problem was that in the current code we remove
the pa from grp->bb_prealloc_list and add it to local list. So if 
someone else comes
and checks that grp->bb_prealloc_list is empty then it will directly 
return 0.

So, maybe we could do something like this then?

repeat:
	ext4_lock_group(sb, group);
-	this_cpu_inc(discard_pa_seq);
list_for_each_entry_safe(pa, tmp,
				&grp->bb_prealloc_list, pa_group_list) {<...>

+		if (!free)
+			this_cpu_inc(discard_pa_seq);   // we should do this here before 
calling list_del(&pa->pa_group_list);

		/* we can trust pa_free ... */
		free += pa->pa_free;
		spin_unlock(&pa->pa_lock);

		list_del(&pa->pa_group_list);
		list_add(&pa->u.pa_tmp_list, &list);
	}

I have some test cases around this to test for cases which were
failing. Since in world of parallelism you can't be 100% certain of some
corner case (like this one you just reported).
But, I don't have my other box rite now where I kept all of those -
due to some technical issues. I think I should be able to get those by
next week, if not, I anyways will setup my current machine for testing
this.

-ritesh