2009-02-06

by Bernd Schubert

Subject: possible endless loop in ext4_mb_discard_group_preallocations()


actually I run into an endless loop in Lustres ldiskfs, but the code is very
similar, so it also might be present in ext4. For those who are interested to
read lustre bug reports, please see here:

Unfortunately I have never been able to reproduce this directly on ldiskfs or
ext4, so far the reproducer ('stress') only works on a real lustre

In ext4_mb_discard_group_preallocations() there is a repeat loop

ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
&grp->bb_prealloc_list, pa_group_list) {
if (atomic_read(&pa->pa_count)) {
busy = 1;

/* if we still need more blocks and some PAs were used, try again */
if (free < needed && busy) {
busy = 0;
ext4_unlock_group(sb, group);
* Yield the CPU here so that we don't get soft lockup
* in non preempt case.
goto repeat;

Aren't repeat-loop somehow the style of about 30 years ago? Anyway, somehow
the i/o pattern of Lustre manages, that it never goes out of this loop.
Unfortunately, after several thousand repeats it then locks up entirely at one
of the spin_locks, which is nearly impossible to debug with Lustre, since it
it doesn't allow to run lockdep.

The calling chain is (reverse order as in traces)


Now in ext4_mb_new_blocks() is another repeat-loop, which will run, if
ext4_mb_discard_group_preallocations() could free a sufficient number of
blocks. So I simply removed the loop in
ext4_mb_discard_group_preallocations() and all our problems where gone. This
now leaves two questions:

1) Is it safe to to remove this second repeat-loop. As far as I can see, it
seems so.

2) Wouldn't it make sense to run ext4_mb_discard_group_preallocations() in an
extra thread, which would free unused preallocations, lets say after 60s? In
our Lustre case it seems to me, that at about 60% disk usage, the other 40%
had been take by preallocations. At least at about 60% the filesystem started
to run ext4_mb_discard_group_preallocations().


PS: I think by removing the 2nd loop in ext4_mb_discard_group_preallocations()
the spin_lock problem is not solved, just the probability is reduced. But
untill all the lockdep problems in Lustre are solved, I don't see how to
debug this (at least not without a huge number of printks).

Bernd Schubert
