LinuxLists.cc - unionfs_copy_attr

2008-02-01 20:57:07

Subject: unionfs_copy_attr_times oopses

Hi Erez,

Aside from the occasional "unionfs: new lower inode mtime" messages
on directories (which I've got into the habit of ignoring now), the
only problem I'm still suffering with unionfs over tmpfs (not tested
any other fs's below it recently) is oops in unionfs_copy_attr_times.

I believe I'm working with your latest: 2.6.24-rc8-mm1 plus the four
patches you posted to lkml on 26 Jan. But this problem has been around
for a while before that: I'd been hoping to debug it myself, but taken
too long to make too little progress, so now handing over to you.

The oops occurs while doing repeated "make -j20" kernel builds in a
unionfs mount of a tmpfs (though I doubt tmpfs is relevant): most of
my testing was while swapping, but today I find that's irrelevant,
and it should happen much quicker without. SMP kernels (4 cpus),
I haven't tried UP; happens with or without PREEMPT, may just be
coincidence that it happens quicker on the machines with PREEMPT.

Most commonly it's unionfs_copy_attr_times called from unionfs_create,
but that's probably just the most common route in this workload:
I've seen it also when called from unionfs_rename or unionfs_open or
unionfs_unlink. It looks like there's a locking or refcounting bug,
hence a race: the unionfs_inode_info which unionfs_copy_attr_times
is working on gets changed underneath it, so it oopses on NULL
lower_inodes. It took me far too long to realize that ibstart
(and ibend) are -1 when it oopses, yet the function returns
immediately if ibstart is negative on entry. I've not a clue
what it is that's changing it.

What I did make progress with yesterday was a workaround patch, which
additionally makes the problem more manifest: by WARNing in much more
common cases which were invisible (didn't go so far as to oops) before.
The oops had typically taken 12 hours to happen, but I'm getting like
one warning per hour (varies from machine to machine) with the patch
while swapping, and now much more frequently giving it more memory.

I've two patches. The first isn't interesting, so I just attached
it. It moves unionfs_copy_attr_times (and unionfs_copy_attr_all)
from inline in fanout.h to out-of-line in copyup.c: the function's
too big and too widely used to be suitable for inlining, this reduces
kernel text by about 6k. But since making that patch, I've seen that
copyup.c is really about copying up the data, and you may well prefer
to move the functions to subr.c or elsewhere.

The patch to go on top of that is appended below: makes unionfs_copy
_attr_times more paranoid about the fields it's accessing; but this
can only be a temporary workaround - so long as there's such a race,
it could be accessing freed/reused memory with unpredictable results.

Most of the warnings I see with this patch are the first, on bend < 0:
the old "bindex <= ibend(upper)" condition would often have terminated
the loop in that case, without noticing the discrepancy that ibend
had gone negative. Occasionally I'm seeing the second warning, on
lower_inodes, which would correspond to the oops: here's a trace
(cutting out the ? lines, just stale addresses left on the stack):

WARNING: at fs/unionfs/copyup.c:906 unionfs_copy_attr_times+0x5f/0xcc()
Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal button processor
Pid: 13791, comm: cc1 Not tainted 2.6.24-rc8-mm1 #19
[<c0124ce6>] warn_on_slowpath+0x3e/0x51
[<c01d5f63>] unionfs_copy_attr_times+0x5f/0xcc
[<c01d169d>] unionfs_create+0x160/0x1e7
[<c0173b8f>] vfs_create+0x62/0xa8
[<c0173d85>] __open_namei_create+0x44/0x8e
[<c0173f55>] open_pathname+0x14d/0x55c
[<c016b7d9>] do_sys_open+0x41/0xbd
[<c016b868>] sys_open+0x13/0x15
[<c01040ee>] sysenter_past_esp+0x5f/0x85

--- a/fs/unionfs/copyup.c 2008-01-29 16:59:13.000000000 +0000
+++ b/fs/unionfs/copyup.c 2008-01-31 12:24:58.000000000 +0000
@@ -887,13 +887,25 @@ void unionfs_postcopyup_release(struct d
/* copy a/m/ctime from the lower branch with the newest times */
void unionfs_copy_attr_times(struct inode *upper)
{
- int bindex;
+ int bindex, bend;
struct inode *lower;
+ struct inode **lower_inodes;

- if (!upper || ibstart(upper) < 0)
+ if (!upper)
return;
- for (bindex = ibstart(upper); bindex <= ibend(upper); bindex++) {
- lower = unionfs_lower_inode_idx(upper, bindex);
+ bindex = ibstart(upper);
+ if (bindex < 0)
+ return;
+ while (1) {
+ bend = ibend(upper);
+ if (WARN_ON(bend < 0))
+ break;
+ if (bindex > bend)
+ break;
+ lower_inodes = UNIONFS_I(upper)->lower_inodes;
+ if (WARN_ON(!lower_inodes))
+ break;
+ lower = lower_inodes[bindex++];
if (!lower)
continue; /* not all lower dir objects may exist */
if (unlikely(timespec_compare(&upper->i_mtime,

Attachments:

copyup1.patch (3.72 kB)
out-of-lining

2008-02-01 22:35:25

by Erez Zadok

[permalink] [raw]

Subject: Re: unionfs_copy_attr_times oopses

2008-02-17 03:07:28

by Erez Zadok

[permalink] [raw]

Subject: Re: unionfs_copy_attr_times oopses

In message <[email protected]>, Hugh Dickins writes:

> Hi Erez,
>
> Aside from the occasional "unionfs: new lower inode mtime" messages
> on directories (which I've got into the habit of ignoring now), the
> only problem I'm still suffering with unionfs over tmpfs (not tested
> any other fs's below it recently) is oops in unionfs_copy_attr_times.
>
> I believe I'm working with your latest: 2.6.24-rc8-mm1 plus the four
> patches you posted to lkml on 26 Jan. But this problem has been around
> for a while before that: I'd been hoping to debug it myself, but taken
> too long to make too little progress, so now handing over to you.
>
> The oops occurs while doing repeated "make -j20" kernel builds in a
> unionfs mount of a tmpfs (though I doubt tmpfs is relevant): most of
> my testing was while swapping, but today I find that's irrelevant,
> and it should happen much quicker without. SMP kernels (4 cpus),
> I haven't tried UP; happens with or without PREEMPT, may just be
> coincidence that it happens quicker on the machines with PREEMPT.
>
> Most commonly it's unionfs_copy_attr_times called from unionfs_create,
> but that's probably just the most common route in this workload:
> I've seen it also when called from unionfs_rename or unionfs_open or
> unionfs_unlink. It looks like there's a locking or refcounting bug,
> hence a race: the unionfs_inode_info which unionfs_copy_attr_times
> is working on gets changed underneath it, so it oopses on NULL
> lower_inodes.
[...]

Hugh,

Check out my latest set of patches (which correspond to release 2.2.4 of
Unionfs). Thanks to your info and the patch, I was able to trigger several
races more frequently, and fix them. I've tested my code with make -j N
(for N=4 and N=20), on a 4 cpu machine a well as a 2 cpu machine (w/
different amounts of memory and CPU speeds, also 32-bit vs 64-bit); I ran a
kernel compile for ~10-12 hours. With the patches I just posted, I wasn't
able to trigger any of the WARN_ON's in unionfs_copy_attr_times. I also
tried it while flushing caches via /proc, and/or performing branch-mgmt
commands in unionfs.

Give it a good shake and let me know what you find.

Thanks,
Erez.

2008-02-20 00:14:31

by Hugh Dickins

[permalink] [raw]

Subject: Re: unionfs_copy_attr_times oopses

On Sat, 16 Feb 2008, Erez Zadok wrote:
>
> Check out my latest set of patches (which correspond to release 2.2.4 of
> Unionfs). Thanks to your info and the patch, I was able to trigger several
> races more frequently, and fix them. I've tested my code with make -j N
> (for N=4 and N=20), on a 4 cpu machine a well as a 2 cpu machine (w/
> different amounts of memory and CPU speeds, also 32-bit vs 64-bit); I ran a
> kernel compile for ~10-12 hours. With the patches I just posted, I wasn't
> able to trigger any of the WARN_ON's in unionfs_copy_attr_times. I also
> tried it while flushing caches via /proc, and/or performing branch-mgmt
> commands in unionfs.
>
> Give it a good shake and let me know what you find.

I've now shaken it for 26 hours each on three machines, while running
a few other tests (including LTP, with CONFIG_LOCKDEP=y) on another;
used 2.6.25-rc2-mm1 plus your patches plus undo dependence on BROKEN.

I took the precaution of running with my WARN_ONs
reinstated in unionfs_copy_attr_times, i.e.

int bindex, bend;
struct inode *lower;
struct inode **lower_inodes;

if (!upper)
return;
bindex = ibstart(upper);
if (bindex < 0)
return;
while (1) {
bend = ibend(upper);
if (WARN_ON(bend < 0))
break;
if (bindex > bend)
break;
lower_inodes = UNIONFS_I(upper)->lower_inodes;
if (WARN_ON(!lower_inodes))
break;
lower = lower_inodes[bindex++];
if (!lower)
continue; /* not all lower dir objects may exist */
...

Things look much better with this version than they did, but in that
26 hours one of the machines did issue one of those warnings, the
WARN_ON(bend < 0): that does indicate still some raciness,
doesn't it, since bindex >= 0 before entering the loop?
Here's the accompanying stacktrace:

------------[ cut here ]------------
WARNING: at fs/unionfs/subr.c:258 unionfs_copy_attr_times+0x70/0x11c()
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device acpi_cpufreq processor button
Pid: 10438, comm: pdflush Not tainted 2.6.25-rc2-mm1 #3

Call Trace:
[<ffffffff8023812f>] warn_on_slowpath+0x62/0x8d
[<ffffffff80509dd9>] ? preempt_schedule_irq+0x63/0x7c
[<ffffffff8032a7a4>] ? unionfs_copy_attr_times+0x85/0x11c
[<ffffffff8032a78f>] unionfs_copy_attr_times+0x70/0x11c
[<ffffffff80273369>] ? __writepage+0x0/0x45
[<ffffffff803379c4>] unionfs_writepages+0x5c/0x69
[<ffffffff80273422>] do_writepages+0x36/0x51
[<ffffffff802c13a8>] __sync_single_inode+0x71/0x1a1
[<ffffffff802c1609>] __writeback_single_inode+0x131/0x13e
[<ffffffff80509d5c>] ? preempt_schedule+0x5d/0x77
[<ffffffff802b67bf>] ? iput+0x55/0x8f
[<ffffffff802c176d>] generic_sync_sb_inodes+0x157/0x257
[<ffffffff802c18a8>] sync_sb_inodes+0x3b/0x3d
[<ffffffff802c1917>] writeback_inodes+0x6d/0xc7
[<ffffffff80272c63>] ? background_writeout+0x35/0xe9
[<ffffffff80272cd4>] background_writeout+0xa6/0xe9
[<ffffffff80273cf0>] __pdflush+0x148/0x1f5
[<ffffffff80273d9d>] ? pdflush+0x0/0x50
[<ffffffff80273deb>] pdflush+0x4e/0x50
[<ffffffff80272c2e>] ? background_writeout+0x0/0xe9
[<ffffffff8024d2ae>] kthread+0x56/0x86
[<ffffffff8022eb51>] ? schedule_tail+0x36/0x72
[<ffffffff8020c449>] child_rip+0xb/0x12
[<ffffffff8024d258>] ? kthread+0x0/0x86
[<ffffffff8020c43e>] ? child_rip+0x0/0x12

---[ end trace 1b6402a1105c37cb ]---

Another of the machines, with CONFIG_UNION_FS_DEBUG=y,
occasionally issued one of your debug warnings, six in all:

PC:fs/unionfs/rename.c:unionfs_rename:536
Ci7: inode/linode=f08e9bec:00000000 bindex=0 istart/end=0:0
...
PC:fs/unionfs/dentry.c:unionfs_d_revalidate:481
CI8: bindex=0 mtime/lmtime=1203370546.467901272/1203370546.467901272 ctime/lctime=1203370546.467901272/1203370546.467901272
...
PC:fs/unionfs/rename.c:unionfs_rename:536
Ci7: inode/linode=efabb31c:00000000 bindex=0 istart/end=0:0
...
PC:fs/unionfs/rename.c:unionfs_rename:536
Ci7: inode/linode=ee06dbec:00000000 bindex=0 istart/end=0:0
...
PC:fs/unionfs/rename.c:unionfs_rename:536
Ci4: no lower_inodes e1442494
...
PC:fs/unionfs/rename.c:unionfs_rename:536
Ci7: inode/linode=d71731a4:00000000 bindex=0 istart/end=0:0

The third (like the first, without UNION_FS_DEBUG) had a clean run.

The LTP run generally went fine, but one anomaly I happened to
notice this time, probably been there for months: when it's testing
swapon on unionfs over tmpfs, the kernel's "Adding 32k swap" message
said "across:4k" each time, which is odd since it's hard to fit 32k
into 4k - unless you're compressing, which is someone else's project!

I think that's because bmap() is not very well defined, and actually,
you need to be sure to return sector 0 whenever there's an error
(including the lower level not supporting bmap, as in the tmpfs case),
not -EINVAL. I think that's how generic_block_bmap tries to play it.

Previously I'd felt indulgent about you "supporting" swapon of a
unionfs file even when the lower level didn't support it; but now
I think it's probably dangerously wrong (though limited to root).
I didn't dare try what happens when you actually get to swapping to
such a beast; and in fact, I didn't even try to wrap my head around
even your "good" bmap support in unionfs - it's not obvious to me
that it makes any sense at all, but I've hardly tried to think.

The last thing I noticed, there seems to be a regression (since when?
long time since I last tried it) with fsx on unionfs while swapping:
very soon collapses with fsx reporting inconsistency. I've not had
time to investigate (nor shall this week), but after finding it with
tmpfs, did verify that it's not a tmpfs issue, same happens with ext2
as the lower filesystem; but doesn't happen without memory pressure.
Probably a truncation issue like last time.

Hugh