On running the Stress Test on machine for more than 72 hours following
error message was observed.
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
pc: c000000000060d90: .dup_fd+0x240/0x39c
lr: c000000000060d6c: .dup_fd+0x21c/0x39c
sp: c00000007ce2fa70
msr: 800000000000b032
dar: ffffffff00000028
dsisr: 40000000
current = 0xc000000074950980
paca = 0xc000000000454500
pid = 27330, comm = bash
0:mon> t
[c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
[c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
[c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
[c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
[c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
[c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
--- Exception: c00 (System Call) at 000000000fee9c60
SP (fcb2e770) is in userspace
---------------------------
The problem is because of race window. When if(expand) block is executed in dup_fd
unlocking of oldf->file_lock give a window for fdtable in oldf to be
modified. So actual open_files in oldf may not match with open_files
variable.
This is the debug patch to fix the problem
Please let me know of your opinion. It is generated on:2.6.19-rc1
On Fri, 10 Nov 2006 15:02:01 +0530 Sharyathi Nagesh wrote:
> On running the Stress Test on machine for more than 72 hours following
> error message was observed.
Was your stress test using threads (or other nonstandard clone() calls)?
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
> pc: c000000000060d90: .dup_fd+0x240/0x39c
> lr: c000000000060d6c: .dup_fd+0x21c/0x39c
> sp: c00000007ce2fa70
> msr: 800000000000b032
> dar: ffffffff00000028
> dsisr: 40000000
> current = 0xc000000074950980
> paca = 0xc000000000454500
> pid = 27330, comm = bash
>
> 0:mon> t
> [c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
> [c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
> [c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
> [c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
> [c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
> [c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
> --- Exception: c00 (System Call) at 000000000fee9c60
> SP (fcb2e770) is in userspace
Did you find the exact instruction which faulted, and the place in
dup_fd() C code which it corresponds to? Was the oops due to a NULL
dereference or an invalid pointer?
> The problem is because of race window. When if(expand) block is executed in dup_fd
> unlocking of oldf->file_lock give a window for fdtable in oldf to be
> modified. So actual open_files in oldf may not match with open_files
> variable.
> This is the debug patch to fix the problem
> --- kernel/fork.c.orig 2006-11-10 14:42:02.000000000 +0530
> +++ kernel/fork.c 2006-11-10 14:42:30.000000000 +0530
> @@ -687,6 +687,7 @@ static struct files_struct *dup_fd(struc
> * the latest pointer.
> */
> spin_lock(&oldf->file_lock);
> + open_files = count_open_files(old_fdt);
> old_fdt = files_fdtable(oldf);
> }
I don't see immediately how the wrong open_files value could cause an
oops in this function. If the stale open_files value was too big (some
files were closed while we dropped the lock), it should still be safe
(AFAIK file tables can only grow, but never shrink, so access to entries
which were valid before should not access invalid memory). If the stale
open_files value was too small (some more files were opened), the copy
would miss some files, which should be OK (except that memcpy() calls
which copy fd_sets will copy bits for some of that missed files which
happened to be in the last word - this would cause some fd's to be
permanently busy, and potentially could cause problems later, but an
oops in dup_fd() due to this problem does not look likely). Seeing the
exact place in the code which oopsed would be interesting.
However, the new code does not look safe in all cases. If some other
task has opened more files while dup_fd() released oldf->file_lock, the
new code will update open_files to the new larger value. But newf was
allocated with the old smaller value of open_files, therefore subsequent
accesses to newf may try to write into unallocated memory.
Hmm, and actually the patch seems to be wrong for yet another reason -
the old_fdt pointer which you pass to count_open_files() may be stale
(the comment above even warns about that, and the following line updates
old_fdt). The count_open_files() call must be after the "old_fdt =
files_fdtable(oldf)" line.
On Tue, 2006-11-14 at 18:16 +0300, Sergey Vlasov wrote:
> On Fri, 10 Nov 2006 15:02:01 +0530 Sharyathi Nagesh wrote:
> > --- kernel/fork.c.orig 2006-11-10 14:42:02.000000000 +0530
> > +++ kernel/fork.c 2006-11-10 14:42:30.000000000 +0530
> > @@ -687,6 +687,7 @@ static struct files_struct *dup_fd(struc
> > * the latest pointer.
> > */
> > spin_lock(&oldf->file_lock);
> > + open_files = count_open_files(old_fdt);
> > old_fdt = files_fdtable(oldf);
> > }
Looks like your analysis of the proposed patch's side-effects agrees
with mine (call it independent verification, if you will :) ); I was
expressing the very same concerns about it introducing a race condition
on the mm-commits@ and stable@ lists. The only concern is that, although
this patch is not correct, it does fix "something" -- it would be good
to identify what exactly that "something" is.
> [...] If the stale
> open_files value was too small (some more files were opened), the copy
> would miss some files, which should be OK (except that memcpy() calls
> which copy fd_sets will copy bits for some of that missed files which
> happened to be in the last word - this would cause some fd's to be
> permanently busy, and potentially could cause problems later [...]
Nope, this logic also looks fine. The open_files value that
count_open_files() returns will always be a multiple of BITS_PER_LONG,
so no extraneous bits will ever be copied. It's a tad confusing since
count_open_files() does something a bit different than what its name
suggests.
Speaking of: does a maintainer currently exist for this part of the
kernel? Who's familiar with all the related code? :)
Also, here's some extra information from the other email thread
regarding this patch, that might aid in debugging. I'm merely
copy-pasting it here for reference:
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
pc: c000000000060d90: .dup_fd+0x240/0x39c
lr: c000000000060d6c: .dup_fd+0x21c/0x39c
sp: c00000007ce2fa70
msr: 800000000000b032
dar: ffffffff00000028
dsisr: 40000000
current = 0xc000000074950980
paca = 0xc000000000454500
pid = 27330, comm = bash
0:mon> t
[c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
[c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
[c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
[c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
[c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
[c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
The PC translates to:
for (i = open_files; i != 0; i--) {
struct file *f = *old_fds++;
if (f) {
get_file(f); <-- Data access error
} else {
And more info still:
0:mon> r
R00 = ffffffff00000028 R16 = 00000000100e0000
R01 = c00000007ce2fa70 R17 = 000000000fff1d38
R02 = c00000000056cd20 R18 = 0000000000000000
R03 = c000000029f40a58 R19 = 0000000001200011
R04 = c000000029f442d8 R20 = c0000000a544a2a0
R05 = 0000000000000001 R21 = 0000000000000000
R06 = 0000000000000024 R22 = 0000000000000100
R07 = 0000001000000000 R23 = c00000008635f5e8
R08 = 0000000000000000 R24 = c0000000919c5448
R09 = 0000000000000024 R25 = 0000000000000100
R10 = 00000000000000dc R26 = c000000086359c30
R11 = ffffffff00000000 R27 = c000000089e5e230
R12 = 0000000006bbd9e9 R28 = c00000000c8d3d80
R13 = c000000000454500 R29 = 0000000000000020
R14 = c00000007ce2fea0 R30 = c000000000491fc8
R15 = 00000000fcb2e770 R31 = c0000000b8369b08
pc = c000000000060d90 .dup_fd+0x240/0x39c
lr = c000000000060d6c .dup_fd+0x21c/0x39c
msr = 800000000000b032 cr = 24242428
ctr = 0000000000000000 xer = 0000000000000000 trap = 300
dar = ffffffff00000028 dsisr = 40000000
-----------------------
0:mon> di c000000000060d90 <==PC
c000000000060d90 7d200028 lwarx r9,r0,r0
c000000000060d94 31290001 addic r9,r9,1
c000000000060d98 7d20012d stwcx. r9,r0,r0
c000000000060d9c 40a2fff4 bne c000000000060d90 #
.dup_fd+0x240/0x39c
c000000000060da0 48000014 b c000000000060db4 #
.dup_fd+0x264/0x39c
c000000000060da4 e93b0018 ld r9,24(r27)
c000000000060da8 7c08482a ldx r0,r8,r9
c000000000060dac 7c003878 andc r0,r0,r7
c000000000060db0 7c08492a stdx r0,r8,r9
c000000000060db4 3b180008 addi r24,r24,8
c000000000060db8 7c0006ac eieio
c000000000060dbc 380affff addi r0,r10,-1
c000000000060dc0 f97c0000 std r11,0(r28)
c000000000060dc4 38c60001 addi r6,r6,1
c000000000060dc8 3b9c0008 addi r28,r28,8
c000000000060dcc 7c0a07b4 extsw r10,r0
Thanks.
-- Vadim Lobanov
On Tue, Nov 14, 2006 at 10:49:14AM -0800, Vadim Lobanov wrote:
> On Tue, 2006-11-14 at 18:16 +0300, Sergey Vlasov wrote:
> > On Fri, 10 Nov 2006 15:02:01 +0530 Sharyathi Nagesh wrote:
> > > --- kernel/fork.c.orig 2006-11-10 14:42:02.000000000 +0530
> > > +++ kernel/fork.c 2006-11-10 14:42:30.000000000 +0530
> > > @@ -687,6 +687,7 @@ static struct files_struct *dup_fd(struc
> > > * the latest pointer.
> > > */
> > > spin_lock(&oldf->file_lock);
> > > + open_files = count_open_files(old_fdt);
> > > old_fdt = files_fdtable(oldf);
> > > }
>
> Looks like your analysis of the proposed patch's side-effects agrees
> with mine (call it independent verification, if you will :) ); I was
> expressing the very same concerns about it introducing a race condition
> on the mm-commits@ and stable@ lists. The only concern is that, although
> this patch is not correct, it does fix "something" -- it would be good
> to identify what exactly that "something" is.
Yes, very interesting (although if the problem appeared only after 72
hours of testing, it is hard to be sure that the bug is really fixed).
> [...] The open_files value that
> count_open_files() returns will always be a multiple of BITS_PER_LONG,
> so no extraneous bits will ever be copied. It's a tad confusing since
> count_open_files() does something a bit different than what its name
> suggests.
Yes, then the logic looks fine. (The comment in count_open_files()
says "Find the last open fd", which is _almost_ what it does.)
There is also some unused code and slightly incorrect comment in
dup_fd():
size = old_fdt->max_fdset;
... here "size" is not used ....
/* if the old fdset gets grown now, we'll only copy up to "size" fds */
... here "size" is not used either ....
size = (new_fdt->max_fds - open_files) * sizeof(struct file *);
The result of the first assignment to "size" is not used anywhere,
even if it is mentioned in the comment. However, the intent to keep
the old size of fdset is noted again.
> Also, here's some extra information from the other email thread
> regarding this patch, that might aid in debugging. I'm merely
> copy-pasting it here for reference:
Thanks - I cannot find this discussion anywhere in archives...
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
> pc: c000000000060d90: .dup_fd+0x240/0x39c
> lr: c000000000060d6c: .dup_fd+0x21c/0x39c
> sp: c00000007ce2fa70
> msr: 800000000000b032
> dar: ffffffff00000028
> dsisr: 40000000
> current = 0xc000000074950980
> paca = 0xc000000000454500
> pid = 27330, comm = bash
>
> 0:mon> t
> [c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
> [c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
> [c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
> [c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
> [c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
> [c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
>
> The PC translates to:
> for (i = open_files; i != 0; i--) {
> struct file *f = *old_fds++;
> if (f) {
> get_file(f); <-- Data access error
So we probably got a bogus "struct file" pointer...
> } else {
>
> And more info still:
> 0:mon> r
> R00 = ffffffff00000028 R16 = 00000000100e0000
> R01 = c00000007ce2fa70 R17 = 000000000fff1d38
> R02 = c00000000056cd20 R18 = 0000000000000000
> R03 = c000000029f40a58 R19 = 0000000001200011
> R04 = c000000029f442d8 R20 = c0000000a544a2a0
> R05 = 0000000000000001 R21 = 0000000000000000
> R06 = 0000000000000024 R22 = 0000000000000100
> R07 = 0000001000000000 R23 = c00000008635f5e8
> R08 = 0000000000000000 R24 = c0000000919c5448
> R09 = 0000000000000024 R25 = 0000000000000100
> R10 = 00000000000000dc R26 = c000000086359c30
> R11 = ffffffff00000000 R27 = c000000089e5e230
> R12 = 0000000006bbd9e9 R28 = c00000000c8d3d80
> R13 = c000000000454500 R29 = 0000000000000020
> R14 = c00000007ce2fea0 R30 = c000000000491fc8
> R15 = 00000000fcb2e770 R31 = c0000000b8369b08
> pc = c000000000060d90 .dup_fd+0x240/0x39c
> lr = c000000000060d6c .dup_fd+0x21c/0x39c
> msr = 800000000000b032 cr = 24242428
> ctr = 0000000000000000 xer = 0000000000000000 trap = 300
> dar = ffffffff00000028 dsisr = 40000000
> -----------------------
> 0:mon> di c000000000060d90 <==PC
> c000000000060d90 7d200028 lwarx r9,r0,r0
> c000000000060d94 31290001 addic r9,r9,1
> c000000000060d98 7d20012d stwcx. r9,r0,r0
> c000000000060d9c 40a2fff4 bne c000000000060d90 # .dup_fd+0x240/0x39c
From what little I know about PowerPC, this looks like an atomic
increment of the memory word pointed to by r0, which contains
0xffffffff00000028 - definitely looks like a bogus address. The
offset of file->f_count should be 0x28 on a 64-bit architecture, so
apparently we got f == 0xffffffff00000000 from *old_fds. Something
scribbled over that memory?
> c000000000060da0 48000014 b c000000000060db4 # .dup_fd+0x264/0x39c
> c000000000060da4 e93b0018 ld r9,24(r27)
> c000000000060da8 7c08482a ldx r0,r8,r9
> c000000000060dac 7c003878 andc r0,r0,r7
> c000000000060db0 7c08492a stdx r0,r8,r9
> c000000000060db4 3b180008 addi r24,r24,8
> c000000000060db8 7c0006ac eieio
> c000000000060dbc 380affff addi r0,r10,-1
> c000000000060dc0 f97c0000 std r11,0(r28)
> c000000000060dc4 38c60001 addi r6,r6,1
> c000000000060dc8 3b9c0008 addi r28,r28,8
> c000000000060dcc 7c0a07b4 extsw r10,r0
On Tue, 2006-11-14 at 23:42 +0300, Sergey Vlasov wrote:
> Yes, very interesting (although if the problem appeared only after 72
> hours of testing, it is hard to be sure that the bug is really fixed).
Yep. Could be random memory corruption from some other piece of the
code.
> > [...] The open_files value that
> > count_open_files() returns will always be a multiple of BITS_PER_LONG,
> > so no extraneous bits will ever be copied. It's a tad confusing since
> > count_open_files() does something a bit different than what its name
> > suggests.
>
> Yes, then the logic looks fine. (The comment in count_open_files()
> says "Find the last open fd", which is _almost_ what it does.)
It's always the details that getcha. :) There's other implicit tricks
that the code pulls, such as implicit requirements for minimum size
granularity of certain data structures. It basically comes down to
trying to read and understand all the surrounding code.
> There is also some unused code and slightly incorrect comment in
> dup_fd():
>
> size = old_fdt->max_fdset;
>
> ... here "size" is not used ....
>
> /* if the old fdset gets grown now, we'll only copy up to "size" fds */
>
> ... here "size" is not used either ....
>
> size = (new_fdt->max_fds - open_files) * sizeof(struct file *);
>
> The result of the first assignment to "size" is not used anywhere,
> even if it is mentioned in the comment. However, the intent to keep
> the old size of fdset is noted again.
I read that comment in my mind like this: s/"size"/"open_files"/g. The
wording can definitely be improved, though. Any takers?
I also already sent a patch to Andrew a while ago to clean up that
unused assignment to "size", as part of a bigger fdtable-rework
patchset. This code is currently getting its testing in -mm for the time
being.
> > 0:mon> e
> > cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
> > pc: c000000000060d90: .dup_fd+0x240/0x39c
> > lr: c000000000060d6c: .dup_fd+0x21c/0x39c
> > sp: c00000007ce2fa70
> > msr: 800000000000b032
> > dar: ffffffff00000028
> > dsisr: 40000000
> > current = 0xc000000074950980
> > paca = 0xc000000000454500
> > pid = 27330, comm = bash
> >
> > 0:mon> t
> > [c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
> > [c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
> > [c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
> > [c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
> > [c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
> > [c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
> >
> > The PC translates to:
> > for (i = open_files; i != 0; i--) {
> > struct file *f = *old_fds++;
> > if (f) {
> > get_file(f); <-- Data access error
>
> So we probably got a bogus "struct file" pointer...
>
> > } else {
> >
> > And more info still:
> > 0:mon> r
> > R00 = ffffffff00000028 R16 = 00000000100e0000
> > R01 = c00000007ce2fa70 R17 = 000000000fff1d38
> > R02 = c00000000056cd20 R18 = 0000000000000000
> > R03 = c000000029f40a58 R19 = 0000000001200011
> > R04 = c000000029f442d8 R20 = c0000000a544a2a0
> > R05 = 0000000000000001 R21 = 0000000000000000
> > R06 = 0000000000000024 R22 = 0000000000000100
> > R07 = 0000001000000000 R23 = c00000008635f5e8
> > R08 = 0000000000000000 R24 = c0000000919c5448
> > R09 = 0000000000000024 R25 = 0000000000000100
> > R10 = 00000000000000dc R26 = c000000086359c30
> > R11 = ffffffff00000000 R27 = c000000089e5e230
> > R12 = 0000000006bbd9e9 R28 = c00000000c8d3d80
> > R13 = c000000000454500 R29 = 0000000000000020
> > R14 = c00000007ce2fea0 R30 = c000000000491fc8
> > R15 = 00000000fcb2e770 R31 = c0000000b8369b08
> > pc = c000000000060d90 .dup_fd+0x240/0x39c
> > lr = c000000000060d6c .dup_fd+0x21c/0x39c
> > msr = 800000000000b032 cr = 24242428
> > ctr = 0000000000000000 xer = 0000000000000000 trap = 300
> > dar = ffffffff00000028 dsisr = 40000000
> > -----------------------
> > 0:mon> di c000000000060d90 <==PC
> > c000000000060d90 7d200028 lwarx r9,r0,r0
> > c000000000060d94 31290001 addic r9,r9,1
> > c000000000060d98 7d20012d stwcx. r9,r0,r0
> > c000000000060d9c 40a2fff4 bne c000000000060d90 # .dup_fd+0x240/0x39c
>
> From what little I know about PowerPC, this looks like an atomic
> increment of the memory word pointed to by r0, which contains
> 0xffffffff00000028 - definitely looks like a bogus address. The
> offset of file->f_count should be 0x28 on a 64-bit architecture, so
> apparently we got f == 0xffffffff00000000 from *old_fds. Something
> scribbled over that memory?
Looks very likely. It would be ironic if the fdsets somehow scribbled
there -- a series of all ones in the open_fds. :)
I sent a request asking which version of the kernel did this. 2.6.19-mm1
without hotfixes could crash like this, due to a bug in the new fdtable
code. Aside from that, and if noone can recognize some sort of magical
constant value in those registers above, it might be bisection time.
-- Vadim Lobanov
Lobanov/Sergey
On Tue, 2006-11-14 at 13:35 -0800, Vadim Lobanov wrote:
> On Tue, 2006-11-14 at 23:42 +0300, Sergey Vlasov wrote:
> > Yes, very interesting (although if the problem appeared only after 72
> > hours of testing, it is hard to be sure that the bug is really fixed).
>
> Yep. Could be random memory corruption from some other piece of the
> code.
>
> > > [...] The open_files value that
> > > count_open_files() returns will always be a multiple of BITS_PER_LONG,
> > > so no extraneous bits will ever be copied. It's a tad confusing since
> > > count_open_files() does something a bit different than what its name
> > > suggests.
> >
> > Yes, then the logic looks fine. (The comment in count_open_files()
> > says "Find the last open fd", which is _almost_ what it does.)
>
> It's always the details that getcha. :) There's other implicit tricks
> that the code pulls, such as implicit requirements for minimum size
> granularity of certain data structures. It basically comes down to
> trying to read and understand all the surrounding code.
>
> > There is also some unused code and slightly incorrect comment in
> > dup_fd():
> >
> > size = old_fdt->max_fdset;
> >
> > ... here "size" is not used ....
> >
> > /* if the old fdset gets grown now, we'll only copy up to "size" fds */
> >
> > ... here "size" is not used either ....
> >
> > size = (new_fdt->max_fds - open_files) * sizeof(struct file *);
> >
> > The result of the first assignment to "size" is not used anywhere,
> > even if it is mentioned in the comment. However, the intent to keep
> > the old size of fdset is noted again.
>
> I read that comment in my mind like this: s/"size"/"open_files"/g. The
> wording can definitely be improved, though. Any takers?
>
> I also already sent a patch to Andrew a while ago to clean up that
> unused assignment to "size", as part of a bigger fdtable-rework
> patchset. This code is currently getting its testing in -mm for the time
> being.
>
> > > 0:mon> e
> > > cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
> > > pc: c000000000060d90: .dup_fd+0x240/0x39c
> > > lr: c000000000060d6c: .dup_fd+0x21c/0x39c
> > > sp: c00000007ce2fa70
> > > msr: 800000000000b032
> > > dar: ffffffff00000028
> > > dsisr: 40000000
> > > current = 0xc000000074950980
> > > paca = 0xc000000000454500
> > > pid = 27330, comm = bash
> > >
> > > 0:mon> t
> > > [c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
> > > [c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
> > > [c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
> > > [c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
> > > [c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
> > > [c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
> > >
> > > The PC translates to:
> > > for (i = open_files; i != 0; i--) {
> > > struct file *f = *old_fds++;
> > > if (f) {
> > > get_file(f); <-- Data access error
> >
> > So we probably got a bogus "struct file" pointer...
> >
> > > } else {
> > >
> > > And more info still:
> > > 0:mon> r
> > > R00 = ffffffff00000028 R16 = 00000000100e0000
> > > R01 = c00000007ce2fa70 R17 = 000000000fff1d38
> > > R02 = c00000000056cd20 R18 = 0000000000000000
> > > R03 = c000000029f40a58 R19 = 0000000001200011
> > > R04 = c000000029f442d8 R20 = c0000000a544a2a0
> > > R05 = 0000000000000001 R21 = 0000000000000000
> > > R06 = 0000000000000024 R22 = 0000000000000100
> > > R07 = 0000001000000000 R23 = c00000008635f5e8
> > > R08 = 0000000000000000 R24 = c0000000919c5448
> > > R09 = 0000000000000024 R25 = 0000000000000100
> > > R10 = 00000000000000dc R26 = c000000086359c30
> > > R11 = ffffffff00000000 R27 = c000000089e5e230
> > > R12 = 0000000006bbd9e9 R28 = c00000000c8d3d80
> > > R13 = c000000000454500 R29 = 0000000000000020
> > > R14 = c00000007ce2fea0 R30 = c000000000491fc8
> > > R15 = 00000000fcb2e770 R31 = c0000000b8369b08
> > > pc = c000000000060d90 .dup_fd+0x240/0x39c
> > > lr = c000000000060d6c .dup_fd+0x21c/0x39c
> > > msr = 800000000000b032 cr = 24242428
> > > ctr = 0000000000000000 xer = 0000000000000000 trap = 300
> > > dar = ffffffff00000028 dsisr = 40000000
> > > -----------------------
> > > 0:mon> di c000000000060d90 <==PC
> > > c000000000060d90 7d200028 lwarx r9,r0,r0
> > > c000000000060d94 31290001 addic r9,r9,1
> > > c000000000060d98 7d20012d stwcx. r9,r0,r0
> > > c000000000060d9c 40a2fff4 bne c000000000060d90 # .dup_fd+0x240/0x39c
> >
> > From what little I know about PowerPC, this looks like an atomic
> > increment of the memory word pointed to by r0, which contains
> > 0xffffffff00000028 - definitely looks like a bogus address. The
> > offset of file->f_count should be 0x28 on a 64-bit architecture, so
> > apparently we got f == 0xffffffff00000000 from *old_fds. Something
> > scribbled over that memory?
This is very interesting: after reading through I am feeling there is high chance this
could as well be a memory corruption issue. But if the issue is memory getting corrupted
what could be the possible reasons.
I had observed random slab corruption issues in the machine, could
that may have resulted in corruption, we may be opening up larger issues
here about which I am not much aware of,
> Looks very likely. It would be ironic if the fdsets somehow scribbled
> there -- a series of all ones in the open_fds. :)
>
> I sent a request asking which version of the kernel did this. 2.6.19-mm1
> without hotfixes could crash like this, due to a bug in the new fdtable
> code. Aside from that, and if noone can recognize some sort of magical
> constant value in those registers above, it might be bisection time.
>
Lobanov
Excuse me for the late response.
The kernel version on which it is tested is: 2.6.18-1 (Distro
variant)
Test suite:
TCP/IO stress are from ltp test suite: you can find about them
here: http://ltp.sourceforge.net/testPlan.php#6.0
On Wed, 2006-11-15 at 13:08 +0530, Sharyathi Nagesh wrote:
> This is very interesting: after reading through I am feeling there is high chance this
> could as well be a memory corruption issue. But if the issue is memory getting corrupted
> what could be the possible reasons.
> I had observed random slab corruption issues in the machine, could
> that may have resulted in corruption, we may be opening up larger issues
> here about which I am not much aware of,
I'm guessing that you've already tried this, but it never hurts to be
sure: does this machine pass memtest? :)
> The kernel version on which it is tested is: 2.6.18-1 (Distro
> variant)
Unless someone recognizes special magic values from the register dumps
to point at any particular part of the kernel, the corruption could be
coming from almost anywhere. If noone has any better guesses, then
narrowing down the problem might be worthwhile: grab a vanilla
non-distro 2.6.18-1 kernel (from kernel.org) and see if you can
reproduce the problem with that, and then try to find the previous
release where the problem disappears. Or use git instead, which folks
say can do this bisection process rather well. :)
Thanks,
-- Vadim Lobanov
On Wed, 2006-11-15 at 00:15 -0800, Vadim Lobanov wrote:
> On Wed, 2006-11-15 at 13:08 +0530, Sharyathi Nagesh wrote:
> > This is very interesting: after reading through I am feeling there is high chance this
> > could as well be a memory corruption issue. But if the issue is memory getting corrupted
> > what could be the possible reasons.
> > I had observed random slab corruption issues in the machine, could
> > that may have resulted in corruption, we may be opening up larger issues
> > here about which I am not much aware of,
>
> I'm guessing that you've already tried this, but it never hurts to be
> sure: does this machine pass memtest? :)
Haven't run memtest will run and check if the problem gets recreated.
> > The kernel version on which it is tested is: 2.6.18-1 (Distro
> > variant)
>
> Unless someone recognizes special magic values from the register dumps
> to point at any particular part of the kernel, the corruption could be
> coming from almost anywhere. If noone has any better guesses, then
> narrowing down the problem might be worthwhile: grab a vanilla
> non-distro 2.6.18-1 kernel (from kernel.org) and see if you can
> reproduce the problem with that, and then try to find the previous
> release where the problem disappears. Or use git instead, which folks
> say can do this bisection process rather well. :)
>
> Thanks,
> -- Vadim Lobanov
>
I looked into a few memtests that were run in similar machine. There
are a few slab corruption issues but not while running memtest and no
other issues.
Seems difficult to replicate.
On Wed, 2006-11-15 at 00:15 -0800, Vadim Lobanov wrote:
> On Wed, 2006-11-15 at 13:08 +0530, Sharyathi Nagesh wrote:
> > This is very interesting: after reading through I am feeling there is high chance this
> > could as well be a memory corruption issue. But if the issue is memory getting corrupted
> > what could be the possible reasons.
> > I had observed random slab corruption issues in the machine, could
> > that may have resulted in corruption, we may be opening up larger issues
> > here about which I am not much aware of,
>
> I'm guessing that you've already tried this, but it never hurts to be
> sure: does this machine pass memtest? :)
>
> > The kernel version on which it is tested is: 2.6.18-1 (Distro
> > variant)
>
> Unless someone recognizes special magic values from the register dumps
> to point at any particular part of the kernel, the corruption could be
> coming from almost anywhere. If noone has any better guesses, then
> narrowing down the problem might be worthwhile: grab a vanilla
> non-distro 2.6.18-1 kernel (from kernel.org) and see if you can
> reproduce the problem with that, and then try to find the previous
> release where the problem disappears. Or use git instead, which folks
> say can do this bisection process rather well. :)
>
> Thanks,
> -- Vadim Lobanov
>
On Fri, 2006-11-17 at 18:38 +0530, Sharyathi Nagesh wrote:
> I looked into a few memtests that were run in similar machine. There
> are a few slab corruption issues but not while running memtest and no
> other issues.
> Seems difficult to replicate.
Bummer.
Bisection with vanilla (non-distro) kernels may be your best bet, then.
I have tried to replicate this issue in my spare time, but so far have
been unsuccessful.
-- Vadim Lobanov
Excuse me for the late response
On Fri, 2006-11-17 at 18:38 +0530, Sharyathi Nagesh wrote:
> > I'm guessing that you've already tried this, but it never hurts to be
> > sure: does this machine pass memtest? :)
Machine passes memtest without fail, even after running for many
hours.
Since the problem looked genuine to me and may hit again in future
do let me know how I should improve the patch. Please let us know of the
opinion...
---------------------------------------------------------------------
To restate the problem:
Data access error was observed after testing IO and TCP Stress test for
more than 72 hrs over ppc-64 machine.
[c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
[c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
[c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
[c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
[c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
[c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc
--- Exception: c00 (System Call) at 000000000fee9c60
The location of the problem was
-----------
for (i = open_files; i != 0; i--) {
struct file *f = *old_fds++;
if (f) {
get_file(f); <== Get file does increment of f_count of
struct file
} else {
-----------
in kernel/fork.c.Accessing f_count gives Data access error as 'struct file' pointer f is
pointing to invalid location
Regards
Sharyathi Nagesh