LinuxLists.cc - Is this Oops due to messed up memory? - And how to protect fs during driver development?

2005-10-26 17:04:08

Subject: Is this Oops due to messed up memory? - And how to protect fs during driver development?

Hi There!

During my DMA driver development on an embedded MPC8540 (e500) PPC processor, which
is work in progress and definitely dangerous and unstable, I received the oops as shown
below... not really during a dma transfer but later during recompilation of my driver.
I guess, that I corrupted my memory very bad (misprogrammed DMA).
The kernel is still a 2.6.13-rc7

How can I avoid to crash my filesystem during driver development as much as possible?
Well, I do backups, but is there something like a temporary "remount it physically
readonly for the next 10 secons" thingy?

Thanks,

Clemens

Oct 26 18:32:32 ecam kernel: Oops: kernel access of bad area, sig: 11 [#1]
Oct 26 18:32:32 ecam kernel: NIP: C0048070 LR: C004818C SP: C05CBEC0 REGS: c05cbe10 TRAP: 0300 Not tainted
Oct 26 18:32:32 ecam kernel: MSR: 00021000 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 00
Oct 26 18:32:32 ecam kernel: DAR: 00000000, DSISR: 00800000
Oct 26 18:32:32 ecam kernel: TASK = c05bf100[3] 'events/0' THREAD: c05ca000
Oct 26 18:32:32 ecam kernel: Last syscall: -1
Oct 26 18:32:32 ecam kernel: GPR00: 00200200 C05CBEC0 C05BF100 C0559340 C055E930 00000001 C0566C14 CA5E1F80
Oct 26 18:32:32 ecam kernel: GPR08: 00000000 00000001 C7AD4000 00100100 00000000 00000000 10000000 00000000
Oct 26 18:32:32 ecam kernel: GPR16: 00000001 00000001 FFFFFFFF 007FFF00 0FFFA600 00000000 00000002 0FFAF3B0
Oct 26 18:32:32 ecam kernel: GPR24: C02D0000 C055935C C02D0000 C055E930 00000001 C055934C 00000000 C0559340
Oct 26 18:32:32 ecam kernel: NIP [c0048070] free_block+0xb0/0x140
Oct 26 18:32:32 ecam kernel: LR [c004818c] drain_array_locked+0x8c/0xd4
Oct 26 18:32:32 ecam kernel: Call trace:
Oct 26 18:32:32 ecam kernel: [c004818c] drain_array_locked+0x8c/0xd4
Oct 26 18:32:32 ecam kernel: [c0048fec] cache_reap+0x84/0x1e4
Oct 26 18:32:32 ecam kernel: [c0030640] worker_thread+0x174/0x218
Oct 26 18:32:32 ecam kernel: [c003574c] kthread+0xec/0x128
Oct 26 18:32:32 ecam kernel: [c00050f0] kernel_thread+0x44/0x60
Oct 26 18:34:53 ecam kernel: Oops: kernel access of bad area, sig: 11 [#2]
Oct 26 18:34:53 ecam kernel: NIP: C00CAA28 LR: C00CAB48 SP: C07C3CE0 REGS: c07c3c30 TRAP: 0300 Not tainted
Oct 26 18:34:53 ecam kernel: MSR: 00029000 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 00
Oct 26 18:34:53 ecam kernel: DAR: 2F6C643A, DSISR: 00000000
Oct 26 18:34:53 ecam kernel: TASK = c07c1980[43] 'pdflush' THREAD: c07c2000
Oct 26 18:34:53 ecam kernel: Last syscall: -1
Oct 26 18:34:53 ecam kernel: GPR00: 2F6C643A C07C3CE0 C07C1980 00000000 C19439F0 D10FF0CC C00CA5F4 00000000
Oct 26 18:34:53 ecam kernel: GPR08: CA5E16E0 C07C2000 CA5E16E8 00000000 00000001 00000000 00000000 D1107104
Oct 26 18:34:53 ecam kernel: GPR16: 067D4870 D1124F78 C31B1000 C07C3E70 00B3812B C7C36000 00000012 00000000
Oct 26 18:34:53 ecam kernel: GPR24: 00001AB0 0000010D 00000001 C07C3D80 00000000 D10FF0CC CCA90F74 2F6C642E
Oct 26 18:34:53 ecam kernel: NIP [c00caa28] write_ordered_buffers+0x54/0x248
Oct 26 18:34:53 ecam kernel: LR [c00cab48] write_ordered_buffers+0x174/0x248
Oct 26 18:34:55 ecam kernel: Call trace:
Oct 26 18:34:55 ecam kernel: [c00caedc] flush_commit_list+0x168/0x5b0
Oct 26 18:34:55 ecam kernel: [c00d0e6c] do_journal_end+0xa64/0xa94
Oct 26 18:34:55 ecam kernel: [c00cf7d4] journal_end_sync+0x8c/0xa0
Oct 26 18:34:55 ecam kernel: [c00b57dc] reiserfs_sync_fs+0x4c/0x88
Oct 26 18:34:55 ecam kernel: [c00b582c] reiserfs_write_super+0x14/0x24
Oct 26 18:34:55 ecam kernel: [c0067038] sync_supers+0x1a8/0x1ac
Oct 26 18:34:55 ecam kernel: [c0045a24] wb_kupdate+0x5c/0x168
Oct 26 18:34:55 ecam kernel: [c004683c] pdflush+0x120/0x1e0
Oct 26 18:34:55 ecam kernel: [c003574c] kthread+0xec/0x128
Oct 26 18:34:55 ecam kernel: [c00050f0] kernel_thread+0x44/0x60
Oct 26 18:34:55 ecam kernel: Badness in do_exit at kernel/exit.c:787
Oct 26 18:34:55 ecam kernel: Call trace:
Oct 26 18:34:55 ecam kernel: [c0003514] check_bug_trap+0x98/0xdc
Oct 26 18:34:55 ecam kernel: [c00037b4] ProgramCheckException+0x25c/0x4c8
Oct 26 18:34:55 ecam kernel: [c0002b40] ret_from_except_full+0x0/0x4c
Oct 26 18:34:55 ecam kernel: [c0020278] do_exit+0x24/0xad0
Oct 26 18:34:55 ecam kernel: [c0003130] _exception+0x0/0xa8
Oct 26 18:34:55 ecam kernel: [c000b0b8] bad_page_fault+0x58/0x5c
Oct 26 18:34:55 ecam kernel: [c00029d4] handle_page_fault+0x7c/0x80
Oct 26 18:34:55 ecam kernel: [c00cab48] write_ordered_buffers+0x174/0x248
Oct 26 18:34:55 ecam kernel: [c00caedc] flush_commit_list+0x168/0x5b0
Oct 26 18:34:55 ecam kernel: [c00d0e6c] do_journal_end+0xa64/0xa94
Oct 26 18:34:55 ecam kernel: [c00cf7d4] journal_end_sync+0x8c/0xa0
Oct 26 18:34:55 ecam kernel: [c00b57dc] reiserfs_sync_fs+0x4c/0x88
Oct 26 18:34:55 ecam kernel: [c00b582c] reiserfs_write_super+0x14/0x24
Oct 26 18:34:55 ecam kernel: [c0067038] sync_supers+0x1a8/0x1ac
Oct 26 18:34:55 ecam kernel: [c0045a24] wb_kupdate+0x5c/0x168

--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19

2005-10-26 17:28:29

by linux-os (Dick Johnson)

[permalink] [raw]

Subject: Re: Is this Oops due to messed up memory? - And how to protect fs during driver development?

On Wed, 26 Oct 2005, Clemens Koller wrote:

> Hi There!
>
> During my DMA driver development on an embedded MPC8540 (e500) PPC processor, which
> is work in progress and definitely dangerous and unstable, I received the oops as shown
> below... not really during a dma transfer but later during recompilation of my driver.
> I guess, that I corrupted my memory very bad (misprogrammed DMA).
> The kernel is still a 2.6.13-rc7
>
> How can I avoid to crash my filesystem during driver development as much as possible?
> Well, I do backups, but is there something like a temporary "remount it physically
> readonly for the next 10 secons" thingy?
>
> Thanks,
>
> Clemens
>
[SNIPPED crash]

DMA can write ANYWHERE! It doesn't know anything about CPU page
protection, etc. It doesn't use the CPU! So, if you trash some
buffer that eventually gets written to your file-system, you can
trash the file-system and destroy all the work you ever did
on the system.

There are several solutions:

(1) Do initial testing on a "throw-away" system that only
has some tools and the kernel installed (like a fresh
copy of your favorite distribution) -- nothing else.
Use the network to copy over your completed work. Do
NOT mount your remote system via NFS.

(2) Boot on a throw-away partition then mount your work
partition for design/development. Unmount that partition
before you insert or try to test your module.

(3) Don't ever write DMA code.

> Clemens Koller
> _______________________________
> R&D Imaging Devices
> Anagramm GmbH
> Rupert-Mayer-Str. 45/1
> 81379 Muenchen
> Germany
>
> http://www.anagramm.de
> Phone: +49-89-741518-50
> Fax: +49-89-741518-19
> -

You need to remember that simple "off-by-one" errors that are hard to
find in user-mode code, will still be hard to find in kernel code
-- and harder when you have to install everything from scratch each
time you try to insert your module!

Simple question? With the old PC/AT DMA controller, do you
program it for a byte-count or a word-count? Is the count
exactly the required transfer count or is it different?

Trick questions, but they MUST be answered before you try
to run any code using the device(s).

There are similar off-by-zillions problems with many bus-mastering
DMA controllers. Some can't take a count of 0 because the count
is decremented AFTER each transfer!!! That could trash 32-bits
worth of address-space! So, if you have scatter-lists that are
dynamically built, you need to check out the corner cases long
before you throw the code off-the-cliff and hope it will fly!

What I do with DMA code, and I'm supposed to be experienced, is
I `ftp` it to a "target" that I can trash. Even code that has
been tested and "known to work" gets tested on a trash target.

They are cheap. Don't test new drivers on your development
machine. If you do, sooner or later you WILL destroy at least
a day's work, maybe more.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.