Greetings,
I'm encountering a rather annoying error on one of our AMD Opteron
boxes. I recompiled the working 2.6.17 kernel to add some extra SCSI
support and booted to find something similar to:
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBD,PNP0f0e:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
device-mapper: ioctl: 4.10.0-ioctl (2006-09-14) initialised: [email protected]
device-mapper: multipath: version 1.0.5 loaded
device-mapper: multipath round-robin: version 1.0.0 loaded
device-mapper: multipath emc: version 0.0.3 loaded
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
Freeing unused kernel memory: 324k freed
request_module: runaway loop modprobe binfmt-0000
request_module: runaway loop modprobe binfmt-0000
request_module: runaway loop modprobe binfmt-0000
request_module: runaway loop modprobe binfmt-0000
request_module: runaway loop modprobe binfmt-0000
... and then a hung kernel.
I figured I'd enabled/disabled something I shouldn't have and went
through the kernel config many times, recompiling with different
options (including ELF and misc emulation support), but to no avail.
As time ticked closer to 5pm on a Friday my interest decreased and so
I tried upgrading to 2.6.18 and then 2.6.19, with a tonne of different
configurations. Still nothing. The best I could achieve was on a fresh
2.6.19.1 I managed to pass the modprobe and load some input drivers,
but then it hung directly after that.
...
request_module: runaway loop modprobe binfmt-0000
request_module: runaway loop modprobe binfmt-0000
request_module: runaway loop modprobe binfmt-0000
input: AT Translated Set 2 keyboard as /class/input/input0
input: PS/2 Generic Mouse as /class/input/input1
Any advice on debugging this would be appreciated. How can I discover
what is trying to load binfmt-0000 and why is it looping?
Regards
--
Tom Lanyon
On Sat, 6 Jan 2007 09:43:14 +1030
"Tom Lanyon" <[email protected]> wrote:
> How can I discover
> what is trying to load binfmt-0000 and why is it looping?
Start with this, I guess..
--- a/kernel/kmod.c~a
+++ a/kernel/kmod.c
@@ -98,10 +98,12 @@ int request_module(const char *fmt, ...)
atomic_inc(&kmod_concurrent);
if (atomic_read(&kmod_concurrent) > max_modprobes) {
/* We may be blaming an innocent here, but unlikely */
- if (kmod_loop_msg++ < 5)
+ if (kmod_loop_msg++ < 5) {
printk(KERN_ERR
"request_module: runaway loop modprobe %s\n",
module_name);
+ dump_stack();
+ }
atomic_dec(&kmod_concurrent);
return -ENOMEM;
}
_
On 1/6/07, Andrew Morton <[email protected]> wrote:
> On Sat, 6 Jan 2007 09:43:14 +1030
> "Tom Lanyon" <[email protected]> wrote:
>
> > How can I discover
> > what is trying to load binfmt-0000 and why is it looping?
>
> Start with this, I guess..
>
> --- a/kernel/kmod.c~a
> +++ a/kernel/kmod.c
> @@ -98,10 +98,12 @@ int request_module(const char *fmt, ...)
> atomic_inc(&kmod_concurrent);
> if (atomic_read(&kmod_concurrent) > max_modprobes) {
> /* We may be blaming an innocent here, but unlikely */
> - if (kmod_loop_msg++ < 5)
> + if (kmod_loop_msg++ < 5) {
> printk(KERN_ERR
> "request_module: runaway loop modprobe %s\n",
> module_name);
> + dump_stack();
> + }
> atomic_dec(&kmod_concurrent);
> return -ENOMEM;
> }
> _
>
>
Thanks for the reply, Andrew.
How interesting... added that to kmod.c, rebuilt without change to
config, reboot.... machine booted perfectly!
I'm going to leave it for now, but I'll leave the dump_stack() call in
there in case further issues arise.
Regards
--
Tom Lanyon
Hi Andrew,
Thanks for the email back.
Andrew Morton wrote:
>On Thu, 03 Jan 2008 12:18:19 -0600 Tom Gall <[email protected]> wrote:
>
>
>>This is an interesting one as I just hit this with a 2.6.23.12 kernel
>>... granted running on systemsim (the powerpc simulator) ... but as I'm
>>hitting a wall I dropped in the dump_stack call and here is the output:
>>
>>
>
>I don't remember what any of this is about and there is no record on the
>mailing list - presumably because you're ccing [email protected] and not
>linux-kernel@...
>
>
Ever written the previous year on your checks? Well when I was googling
about for folks with similar woes, the posts I had thought were from
this January were actually from Jan 2007. O Well! Not to mention it's
been awhile since I've posted to lkml ....
>I suggest that we start again with this bug report from scratch.
>
>
In this case, not needed. I spent friday taking a tour of the codepath
and basically narrowed it down to a device driver passing bum data.
So if this ever happens in the future and someone somewhere seems the
message about trying to load binfmt-xxxx (where xxxx is 0000 or some
other "random" number * (hold that thought I'll come back to this) )
here are some things to consider as you're debugging your problem.
In my case, the device driver was just good enough that it could access
the device, but was poor enough that the bits it was loading were
complete crap ... So the kernel was happily running along booting,
gets to the point where it's time to exec init, loads the first bits of
the binary, does a compare to make sure it's an elf binary, fails the
compare (because the data is junk) ... the kernel figures that ooo ..
this data must be good it's just in a different format ... I'll run
through the list of known binary file formats.. hmm not that one .. or
that one .... then running out of options trys to load a module for
what must be this amazing new file format, can't find the module and
ker-blew-ee.
* So this number that is part of the module name binfmt-xxxx is from
what I see the id of the file format ... thankfully no one has choosen
0000 as their id ...
>Unless it's a potential denial-of-service attack for unprivileged users,
>in which case a cc to [email protected] migt be more appropriate.
>
>
As this is a boot time issue it's certainly not a DOS. Now there might
be value is thinking about this error path in the initial boot up of the
kernel. The message is a bit obscure for what the real issue is. Still
.. should one go back to the device and comfirm that the data received
was actually reasonable? No.
So I think we have two cases here:
Bootup:
I could see a change where the system should probably panic with some
sort of message like : "Not able to exec %s, unrecognized file format"
if this were to happen at boot up. I could put together a patch.
Running System:
If this happens after the system is up and running, I'd think one is
going to have some data in dmeg probably to the tune of "media error"
... so in my mind, who cares about that case.
Regards,
Tom