2006-09-21 09:54:51

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

Hi,

On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote:

Andrew, thanks for forwarding me the message...

> Begin forwarded message:
>
> Date: Wed, 13 Sep 2006 10:05:42 +0930 (CST)
> From: Jonathan Woithe <[email protected]>
> To: [email protected]
> Cc: [email protected] (Jonathan Woithe)
> Subject: 2.6.17 oops, possibly ntfs/mmap related
>
>
> We have a machine which is currently making heavy use of a usb hard disc
> formatted with ntfs. There have been two occasions where the kernel has
> oopsed while this disc was being accessed heavily. Before adding this HDD
> the machine in question was rock solid which leads me to think that it
> might be related to ntfs. USB drives formatted with other filesystems do
> not appear to suffer from this problem.
>
> Unfortunately bogofilter considers the oops reports as spam so I cannot post
> them to the list. I have instead put the full text of my original post
> regarding this topic on the web at
>
> http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt
>
> I'm happy to try things to narrow down the cause if it will help.
>
> Please CC me on reply.

These were the oopses from above text url:

> The first oops caused the machine to totally lock up:
>
> BUG: unable to handle kernel paging request at virtual address e4004de0
> printing eip:
> c012de0c
> *pde = 00000000
> Oops: 0000 [#1]
> Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore
> CPU: 0
> EIP: 0060:[<c012de0c>] Not tainted VLI
> EFLAGS: 00010082 (2.6.17 #2)
> EIP is at find_get_page+0x11/0x22
> eax: e4004de0 ebx: c02f01a8 ecx: e4004de0 edx: e4004de0
> esi: 00000000 edi: 00000066 ebp: cfc20574 esp: c770bee8
> ds: 007b es: 007b ss: 0068
> Process sh (pid: 10467, threadinfo=c770a000 task=cfa495c0)
> Stack: c012ea09 00000002 00000000 cfc204d8 cff882a4 cff88260 c770bf30 c5962544
> c02f01a8 00000000 ceec10a0 080aef10 c01385d1 00000000 cfc20574 080aef10
> c5962544 ceec10a0 00000002 c7aeb080 080aef10 ceec10a0 080aef10 c013886a
> Call Trace:
> <c012ea09> filemap_nopage+0x98/0x2b2 <c01385d1> do_no_page+0x6d/0x1e1
> <c013886a> __handle_mm_fault+0xc4/0x162 <c0112190> do_page_fault+0x23e/0x56b
> <c01c43c1> copy_to_user+0x41/0x49 <c0111f52> do_page_fault+0x0/0x56b
> <c010342f> error_code+0x4f/0x54
> Code: a0 fe ff ff 89 ea b9 e2 d7 12 c0 6a 02 e8 5f ec 15 00 83 c4 44 5b 5e 5f 5d c3 fa 83 c0 04 e8 2c 3f 09 00 85 c0 89 c1 74 0f 89 c2 <8b> 00 f6 c4 40 74 03 8b 51 0c ff 42 04 fb 89 c8 c3 fa 83 c0 04
>
>
> In the case of the second oops the machine was still partially usable and a
> clean shutdown was possible. However, services such as sshd were no longer
> responding.
>
> BUG: unable to handle kernel paging request at virtual address 0010c744
> printing eip:
> c013be50
> *pde = 00000000
> Oops: 0002 [#1]
> Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore
> CPU: 0
> EIP: 0060:[<c013be50>] Tainted: G M VLI
> EFLAGS: 00010282 (2.6.17 #2)
> EIP is at anon_vma_unlink+0x16/0x3c
> eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740
> esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58
> ds: 007b es: 007b ss: 0068
> Process sh (pid: 20272, threadinfo=cdad6000 task=c0d8d580)
> Stack: cf1070cc cf61f3e4 c0136b5f cdad7f80 c4084b74 cf8b5860 00000001 00000000
> c013ab92 00000000 c0371b7c 000000b9 cf8b5860 c0d8d580 c01145dd cdad6000
> c0118187 cdad6000 00000000 00000000 cdad6000 c0118380 00000000 b7f9968c
> Call Trace:
> <c0136b5f> free_pgtables+0x41/0x82 <c013ab92> exit_mmap+0x6a/0xb8
> <c01145dd> mmput+0x1b/0x5e <c0118187> do_exit+0x14e/0x2d1
> <c0118380> sys_exit_group+0x0/0xd <c010299b> syscall_call+0x7/0xb
> Code: c9 74 10 8b 11 8d 40 38 89 42 04 89 53 38 89 48 04 89 01 5b c3 56 53 8b 70 40 89 c3 85 f6 74 2e 8d 48 38 8b 40 38 8b 51 04 89 02 <89> 50 04 c7 43 38 00 01 10 00 39 36 c7 41 04 00 02 20 00 75 0e
> EIP: [<c013be50>] anon_vma_unlink+0x16/0x3c SS:ESP 0068:cdad7f58
> <1>Fixing recursive fault but reboot is needed!
>
> I'm not entirely sure why the kernel considered itself tainted in the second
> oops and not in the first - the setup hadn't changed and precisely the same
> kernel modules were loaded. This machine does not have any external (ie:
> out-of-tree) modules installed.

Weird. The traces do not include NTFS at all in them so I have no idea
if NTFS has anything to do with this or not...

Anyone have any idea what these oopses mean?!?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://www.linux-ntfs.org/ & http://www-stu.christs.cam.ac.uk/~aia21/


2006-09-21 14:41:44

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

Hi,

On Thu, 2006-09-21 at 10:54 +0100, Anton Altaparmakov wrote:
> On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote:
> Andrew, thanks for forwarding me the message...
> > Begin forwarded message:
> >
> > We have a machine which is currently making heavy use of a usb hard disc
> > formatted with ntfs. There have been two occasions where the kernel has
> > oopsed while this disc was being accessed heavily. Before adding this HDD
> > the machine in question was rock solid which leads me to think that it
> > might be related to ntfs. USB drives formatted with other filesystems do
> > not appear to suffer from this problem.

I have now seen such an oops too with 2.6.18 kernel. Note no NTFS file
systems were mounted at the time (but I had an NTFS file system mounted
earlier in the day).

The oops is caused by kswapd0 kernel thread, the stack trace is:

Call Trace:
[<c10470a3>] shrink_inactive_list+0x46b/0x790
[<c104747c>] shrink_zone+0xb4/0xd3
[<c104797d>] kswapd+0x2de/0x3cf
[<c102c18e>] kthread+0xc2/0xf0
[<c1000bf1>] kernel_thread_helper+0x5/0xb
DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
Leftover inexact backtrace:
[<c1003e6c>] show_stack_log_lvl+0x8c/0x97
[<c1003fc8>] show_registers+0x151/0x1c6
[<c10041af>] die+0x172/0x27b
[<c145f22c>] do_page_fault+0x42c/0x4f9
[<c10037dd>] error_code+0x39/0x40
[<c10470a3>] shrink_inactive_list+0x46b/0x790
[<c104747c>] shrink_zone+0xb4/0xd3
[<c104797d>] kswapd+0x2de/0x3cf
[<c102c18e>] kthread+0xc2/0xf0
[<c1000bf1>] kernel_thread_helper+0x5/0xb

And the EIP is at fs/buffer.c::try_to_release_page() the code of which
is here:

int try_to_release_page(struct page *page, gfp_t gfp_mask)
{
struct address_space * const mapping = page->mapping;

BUG_ON(!PageLocked(page));
if (PageWriteback(page))
return 0;

if (mapping && mapping->a_ops->releasepage)

^^^ bug happens here when the value of mapping->a_ops is used to obtain
mapping->a_ops->releasepage

return mapping->a_ops->releasepage(page, gfp_mask);
return try_to_free_buffers(page);
}

This bug seems to suggest that there is a page which the kernel is
trying to release private data which has page->mapping set to a valid
value and page->mapping->a_ops apparently set to an invalid value and
when page->mapping->a_ops->releasepage is dereferenced it causes an oops
with the kernel saying:

BUG: unable to handle kernel paging request at virtual address 020030d2

The values of the relevant variables from the oops are:

page = 0xc2248fa0
page->mapping = 0xe3a79eac
page->mapping->a_ops = 0x020030aa

Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address
and 0x28 is the offset of the releasepage function pointer in the
address space operations structure...

This oops is not identical to the oopses pointed out by Jonathan at:

http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt

But those oopses have to do with pages also so could be related...

Anyone have any ideas how a page can end up in such a weird state?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://www.linux-ntfs.org/ & http://www-stu.christs.cam.ac.uk/~aia21/

2006-09-21 17:52:52

by Andrew Morton

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

On Thu, 21 Sep 2006 15:41:36 +0100
Anton Altaparmakov <[email protected]> wrote:

> Hi,
>
> On Thu, 2006-09-21 at 10:54 +0100, Anton Altaparmakov wrote:
> > On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote:
> > Andrew, thanks for forwarding me the message...
> > > Begin forwarded message:
> > >
> > > We have a machine which is currently making heavy use of a usb hard disc
> > > formatted with ntfs. There have been two occasions where the kernel has
> > > oopsed while this disc was being accessed heavily. Before adding this HDD
> > > the machine in question was rock solid which leads me to think that it
> > > might be related to ntfs. USB drives formatted with other filesystems do
> > > not appear to suffer from this problem.
>
> I have now seen such an oops too with 2.6.18 kernel.

I assume it is a once-off?

> Note no NTFS file
> systems were mounted at the time (but I had an NTFS file system mounted
> earlier in the day).
>
> The oops is caused by kswapd0 kernel thread, the stack trace is:
>
> Call Trace:
> [<c10470a3>] shrink_inactive_list+0x46b/0x790
> [<c104747c>] shrink_zone+0xb4/0xd3
> [<c104797d>] kswapd+0x2de/0x3cf
> [<c102c18e>] kthread+0xc2/0xf0
> [<c1000bf1>] kernel_thread_helper+0x5/0xb
> DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
> Leftover inexact backtrace:
> [<c1003e6c>] show_stack_log_lvl+0x8c/0x97
> [<c1003fc8>] show_registers+0x151/0x1c6
> [<c10041af>] die+0x172/0x27b
> [<c145f22c>] do_page_fault+0x42c/0x4f9
> [<c10037dd>] error_code+0x39/0x40
> [<c10470a3>] shrink_inactive_list+0x46b/0x790
> [<c104747c>] shrink_zone+0xb4/0xd3
> [<c104797d>] kswapd+0x2de/0x3cf
> [<c102c18e>] kthread+0xc2/0xf0
> [<c1000bf1>] kernel_thread_helper+0x5/0xb
>
> And the EIP is at fs/buffer.c::try_to_release_page() the code of which
> is here:
>
> int try_to_release_page(struct page *page, gfp_t gfp_mask)
> {
> struct address_space * const mapping = page->mapping;
>
> BUG_ON(!PageLocked(page));
> if (PageWriteback(page))
> return 0;
>
> if (mapping && mapping->a_ops->releasepage)
>
> ^^^ bug happens here when the value of mapping->a_ops is used to obtain
> mapping->a_ops->releasepage
>
> return mapping->a_ops->releasepage(page, gfp_mask);
> return try_to_free_buffers(page);
> }
>
> This bug seems to suggest that there is a page which the kernel is
> trying to release private data which has page->mapping set to a valid
> value and page->mapping->a_ops apparently set to an invalid value and
> when page->mapping->a_ops->releasepage is dereferenced it causes an oops
> with the kernel saying:
>
> BUG: unable to handle kernel paging request at virtual address 020030d2
>
> The values of the relevant variables from the oops are:
>
> page = 0xc2248fa0
> page->mapping = 0xe3a79eac
> page->mapping->a_ops = 0x020030aa

I wonder if page->mapping really wanted to be 0xc3a79eac, only something
set bit 29.

> Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address
> and 0x28 is the offset of the releasepage function pointer in the
> address space operations structure...
>
> This oops is not identical to the oopses pointed out by Jonathan at:
>
> http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt
>
> But those oopses have to do with pages also so could be related...

Looks a bit different - Jonathan appears to have pulled a bad page* out
of the radix tree whereas you got your page off the LRU.

> Anyone have any ideas how a page can end up in such a weird state?

Nope.

2006-09-21 19:05:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

On Thu, 21 Sep 2006, Andrew Morton wrote:
> On Thu, 21 Sep 2006 15:41:36 +0100
> Anton Altaparmakov <[email protected]> wrote:
> >
> > BUG: unable to handle kernel paging request at virtual address 020030d2
> >
> > The values of the relevant variables from the oops are:
> >
> > page = 0xc2248fa0
> > page->mapping = 0xe3a79eac
> > page->mapping->a_ops = 0x020030aa
>
> I wonder if page->mapping really wanted to be 0xc3a79eac, only something
> set bit 29.

Perhaps, but I'm more suspicious of that 0x0200 top half of the a_ops ptr.

>
> > Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address
> > and 0x28 is the offset of the releasepage function pointer in the
> > address space operations structure...
> >
> > This oops is not identical to the oopses pointed out by Jonathan at:
> >
> > http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt
> >
> > But those oopses have to do with pages also so could be related...
>
> Looks a bit different - Jonathan appears to have pulled a bad page* out
> of the radix tree whereas you got your page off the LRU.

Jonathan does show a second oops, from a later boot:

BUG: unable to handle kernel paging request at virtual address 0010c744
printing eip:
c013be50
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore
CPU: 0
EIP: 0060:[<c013be50>] Tainted: G M VLI
EFLAGS: 00010282 (2.6.17 #2)
EIP is at anon_vma_unlink+0x16/0x3c
eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740
esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58

I haven't worked out the disassembly in detail to support the idea
(though certainly anon_vma_unlink would be trying to list_del around
here), but that eax and esi do suggest a corrupted list: somehow the
top half of a pointer overwritten by the top half of LIST_POISON1.

And in Anton's case, the top half of a pointer overwritten by the
bottom half of LIST_POISON2.

Maybe just coincidence, and I've nothing more illuminating to add;
but just a hint of a list_del going very wrong somewhere?

Hugh

2006-09-21 19:25:39

by Dave Jones

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

On Thu, Sep 21, 2006 at 08:04:49PM +0100, Hugh Dickins wrote:

> BUG: unable to handle kernel paging request at virtual address 0010c744
> printing eip:
> c013be50
> *pde = 00000000
> Oops: 0002 [#1]
> Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore
> CPU: 0
> EIP: 0060:[<c013be50>] Tainted: G M VLI
> EFLAGS: 00010282 (2.6.17 #2)
> EIP is at anon_vma_unlink+0x16/0x3c
> eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740
> esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58
>
> I haven't worked out the disassembly in detail to support the idea
> (though certainly anon_vma_unlink would be trying to list_del around
> here), but that eax and esi do suggest a corrupted list: somehow the
> top half of a pointer overwritten by the top half of LIST_POISON1.
>
> And in Anton's case, the top half of a pointer overwritten by the
> bottom half of LIST_POISON2.
>
> Maybe just coincidence, and I've nothing more illuminating to add;
> but just a hint of a list_del going very wrong somewhere?

Given a machine check happened, the state of the machine in general
is questionable. I'd recommend a run of memtest86+

Dave

2006-09-22 07:00:00

by Jonathan Woithe

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

> On Thu, Sep 21, 2006 at 08:04:49PM +0100, Hugh Dickins wrote:
>
> > BUG: unable to handle kernel paging request at virtual address 0010c744
> > printing eip:
> > c013be50
> > *pde = 00000000
> > Oops: 0002 [#1]
> > Modules linked in: ntfs 8139too via_agp agpgart usb_storage ehci_hcd uhci_hcd usbcore
> > CPU: 0
> > EIP: 0060:[<c013be50>] Tainted: G M VLI
> > EFLAGS: 00010282 (2.6.17 #2)
> > EIP is at anon_vma_unlink+0x16/0x3c
> > eax: 0010c740 ebx: cf1070cc ecx: cf107104 edx: cf8bc740
> > esi: cf8bc740 edi: b7e82000 ebp: 00000000 esp: cdad7f58
> >
> > I haven't worked out the disassembly in detail to support the idea
> > (though certainly anon_vma_unlink would be trying to list_del around
> > here), but that eax and esi do suggest a corrupted list: somehow the
> > top half of a pointer overwritten by the top half of LIST_POISON1.
> >
> > And in Anton's case, the top half of a pointer overwritten by the
> > bottom half of LIST_POISON2.
> >
> > Maybe just coincidence, and I've nothing more illuminating to add;
> > but just a hint of a list_del going very wrong somewhere?
>
> Given a machine check happened, the state of the machine in general
> is questionable. I'd recommend a run of memtest86+

That was already done. No memory errors were reported over 10 passes.

Secondly, the machine check indication was only present on one of the two
oopses we saw. Furthermore, there was no indication in any log files
that a machine check had occurred in the case of the second oops.
Then again, perhaps machine checks don't get logged which would make this
observation irrelevant.

Could we be looking at a dying CPU?

Regards
jonathan

2006-09-22 06:59:49

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

On Thu, 2006-09-21 at 10:52 -0700, Andrew Morton wrote:
> On Thu, 21 Sep 2006 15:41:36 +0100
> Anton Altaparmakov <[email protected]> wrote:
> > On Thu, 2006-09-21 at 10:54 +0100, Anton Altaparmakov wrote:
> > > On Tue, 2006-09-12 at 20:56 -0700, Andrew Morton wrote:
> > > Andrew, thanks for forwarding me the message...
> > > > Begin forwarded message:
> > > >
> > > > We have a machine which is currently making heavy use of a usb hard disc
> > > > formatted with ntfs. There have been two occasions where the kernel has
> > > > oopsed while this disc was being accessed heavily. Before adding this HDD
> > > > the machine in question was rock solid which leads me to think that it
> > > > might be related to ntfs. USB drives formatted with other filesystems do
> > > > not appear to suffer from this problem.
> >
> > I have now seen such an oops too with 2.6.18 kernel.
>
> I assume it is a once-off?

So far yes. I now have seen a recursive locking thing reported by the
new lock analyzer but that looks like it has to do with NFS (my home
directory is on NFS) so I don't think it is in any way related.

> > Note no NTFS file
> > systems were mounted at the time (but I had an NTFS file system mounted
> > earlier in the day).
> >
> > The oops is caused by kswapd0 kernel thread, the stack trace is:
> >
> > Call Trace:
> > [<c10470a3>] shrink_inactive_list+0x46b/0x790
> > [<c104747c>] shrink_zone+0xb4/0xd3
> > [<c104797d>] kswapd+0x2de/0x3cf
> > [<c102c18e>] kthread+0xc2/0xf0
> > [<c1000bf1>] kernel_thread_helper+0x5/0xb
> > DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
> > Leftover inexact backtrace:
> > [<c1003e6c>] show_stack_log_lvl+0x8c/0x97
> > [<c1003fc8>] show_registers+0x151/0x1c6
> > [<c10041af>] die+0x172/0x27b
> > [<c145f22c>] do_page_fault+0x42c/0x4f9
> > [<c10037dd>] error_code+0x39/0x40
> > [<c10470a3>] shrink_inactive_list+0x46b/0x790
> > [<c104747c>] shrink_zone+0xb4/0xd3
> > [<c104797d>] kswapd+0x2de/0x3cf
> > [<c102c18e>] kthread+0xc2/0xf0
> > [<c1000bf1>] kernel_thread_helper+0x5/0xb
> >
> > And the EIP is at fs/buffer.c::try_to_release_page() the code of which
> > is here:
> >
> > int try_to_release_page(struct page *page, gfp_t gfp_mask)
> > {
> > struct address_space * const mapping = page->mapping;
> >
> > BUG_ON(!PageLocked(page));
> > if (PageWriteback(page))
> > return 0;
> >
> > if (mapping && mapping->a_ops->releasepage)
> >
> > ^^^ bug happens here when the value of mapping->a_ops is used to obtain
> > mapping->a_ops->releasepage
> >
> > return mapping->a_ops->releasepage(page, gfp_mask);
> > return try_to_free_buffers(page);
> > }
> >
> > This bug seems to suggest that there is a page which the kernel is
> > trying to release private data which has page->mapping set to a valid
> > value and page->mapping->a_ops apparently set to an invalid value and
> > when page->mapping->a_ops->releasepage is dereferenced it causes an oops
> > with the kernel saying:
> >
> > BUG: unable to handle kernel paging request at virtual address 020030d2
> >
> > The values of the relevant variables from the oops are:
> >
> > page = 0xc2248fa0
> > page->mapping = 0xe3a79eac
> > page->mapping->a_ops = 0x020030aa
>
> I wonder if page->mapping really wanted to be 0xc3a79eac, only something
> set bit 29.

I don't know, it could be but the machine is totally stable so I would
be surprised if it is bad ram...

> > Note that 0x020030aa+0x28 = 020030d2 which is the oops causing address
> > and 0x28 is the offset of the releasepage function pointer in the
> > address space operations structure...
> >
> > This oops is not identical to the oopses pointed out by Jonathan at:
> >
> > http://www.atrad.com.au/~jwoithe/kernel/oopses-20060913.txt
> >
> > But those oopses have to do with pages also so could be related...
>
> Looks a bit different - Jonathan appears to have pulled a bad page* out
> of the radix tree whereas you got your page off the LRU.
>
> > Anyone have any ideas how a page can end up in such a weird state?
>
> Nope.

)-:

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://www.linux-ntfs.org/ & http://www-stu.christs.cam.ac.uk/~aia21/

2006-09-22 07:05:37

by Jonathan Woithe

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

> > > > > We have a machine which is currently making heavy use of a usb hard disc
> > > > > formatted with ntfs. There have been two occasions where the kernel has
> > > > > oopsed while this disc was being accessed heavily. Before adding this HDD
> > > > > the machine in question was rock solid which leads me to think that it
> > > > > might be related to ntfs. USB drives formatted with other filesystems do
> > > > > not appear to suffer from this problem.
> > >
> > > I have now seen such an oops too with 2.6.18 kernel.
> >
> > I assume it is a once-off?
>
> So far yes. I now have seen a recursive locking thing reported by the
> new lock analyzer but that looks like it has to do with NFS (my home
> directory is on NFS) so I don't think it is in any way related.

Our setup also has user home directories on NFS, so that much at least is
common to our configurations. I don't know if the USB/NTFS user was writing
to their home directory as part of their work at the time of the oops
though.

Regards
jonathan

2006-09-22 15:40:55

by Dave Jones

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

On Fri, Sep 22, 2006 at 04:47:00PM +0930, Jonathan Woithe wrote:

> > Given a machine check happened, the state of the machine in general
> > is questionable. I'd recommend a run of memtest86+
>
> That was already done. No memory errors were reported over 10 passes.
>
> Secondly, the machine check indication was only present on one of the two
> oopses we saw. Furthermore, there was no indication in any log files
> that a machine check had occurred in the case of the second oops.
> Then again, perhaps machine checks don't get logged which would make this
> observation irrelevant.
>
> Could we be looking at a dying CPU?

Maybe. Or some other hardware problem. Insufficient cooling/power for eg.

Dave

2006-09-24 23:32:24

by Jonathan Woithe

[permalink] [raw]
Subject: Re: Fw: 2.6.17 oops, possibly ntfs/mmap related

> > > Given a machine check happened, the state of the machine in general
> > > is questionable. I'd recommend a run of memtest86+
> >
> > That was already done. No memory errors were reported over 10 passes.
> >
> > Secondly, the machine check indication was only present on one of the two
> > oopses we saw. Furthermore, there was no indication in any log files
> > that a machine check had occurred in the case of the second oops.
> > Then again, perhaps machine checks don't get logged which would make this
> > observation irrelevant.
> >
> > Could we be looking at a dying CPU?
>
> Maybe. Or some other hardware problem. Insufficient cooling/power for eg.

Power and cooling should be fine, and I've checked fans etc for correct
functioning - all is ok. The other thing worth noting is that the problems
with this machine only started once the USB/NTFS HDD started being used.
Before this the machine has been rock solid for 2+ years, and the usage
patterns haven't changed. Anyway, I'll keep an eye on it and post any
subsequent information as it becomes available.

Regards
jonathan