2009-02-06 01:25:28

by Bernd Schubert

[permalink] [raw]
Subject: possible endless loop in ext4_mb_discard_group_preallocations()

Hello,

actually I run into an endless loop in Lustres ldiskfs, but the code is very
similar, so it also might be present in ext4. For those who are interested to
read lustre bug reports, please see here:
https://bugzilla.lustre.org/show_bug.cgi?id=16984

Unfortunately I have never been able to reproduce this directly on ldiskfs or
ext4, so far the reproducer ('stress') only works on a real lustre
filesystem.

In ext4_mb_discard_group_preallocations() there is a repeat loop

repeat:
ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
&grp->bb_prealloc_list, pa_group_list) {
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
spin_unlock(&pa->pa_lock);
busy = 1;
continue;
}
...
}

/* if we still need more blocks and some PAs were used, try again */
if (free < needed && busy) {
busy = 0;
ext4_unlock_group(sb, group);
/*
* Yield the CPU here so that we don't get soft lockup
* in non preempt case.
*/
yield();
goto repeat;
}

Aren't repeat-loop somehow the style of about 30 years ago? Anyway, somehow
the i/o pattern of Lustre manages, that it never goes out of this loop.
Unfortunately, after several thousand repeats it then locks up entirely at one
of the spin_locks, which is nearly impossible to debug with Lustre, since it
it doesn't allow to run lockdep.

The calling chain is (reverse order as in traces)

ext4_mb_discard_group_preallocations
ext4_mb_discard_preallocations()
ext4_mb_new_blocks()

Now in ext4_mb_new_blocks() is another repeat-loop, which will run, if
ext4_mb_discard_group_preallocations() could free a sufficient number of
blocks. So I simply removed the loop in
ext4_mb_discard_group_preallocations() and all our problems where gone. This
now leaves two questions:

1) Is it safe to to remove this second repeat-loop. As far as I can see, it
seems so.

2) Wouldn't it make sense to run ext4_mb_discard_group_preallocations() in an
extra thread, which would free unused preallocations, lets say after 60s? In
our Lustre case it seems to me, that at about 60% disk usage, the other 40%
had been take by preallocations. At least at about 60% the filesystem started
to run ext4_mb_discard_group_preallocations().


Thanks,
Bernd


PS: I think by removing the 2nd loop in ext4_mb_discard_group_preallocations()
the spin_lock problem is not solved, just the probability is reduced. But
untill all the lockdep problems in Lustre are solved, I don't see how to
debug this (at least not without a huge number of printks).


--
Bernd Schubert
Q-Leap Networks GmbH