I have been trying to figure this out a while now with printk's all over my kernel as well as adding kdb and tracing the int3 events.
I have tried various 2.6 kernels and so far all i have tried do this.
My current tests are on 2.6.22.10
I have a simple init binary I compiled static that is my init that is loaded into an init file system. I am not using cpio but that did not seem to matter.
---- begin testinit.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
printf("Hello world!\n");
sleep(999999999);
}
---- end
I am using syslunux to start my kernel and appending the follwiing startup command most of this is specific to my true init script but again im using a "hello world" script to debug this for now.
append debug kidb=early console=ttyS0,384008n initrd=ufoinit.img init=/testinit rw var_size=12M tmp_size=MAX log_size=16M root_size=64M root=/dev/ram0 boot=/dev/hda1,msdos rw pkgpath=/dev/hda1:msdos rw verbose DELAY=0 TEST=0 DEBUG=0 VERBOSE=0 UFO=root,etc,modules
On an intel CA810EEA 800mhz board or QEMU this runs fine but on the via boards it dies right after "Freeing unused kernel memory: 132k freed"
on the 500mhz board it dies every time on the 800mhz it is random.
I have noticed that i get the elf binarly loading into user space with some page_faults then I get blasted with do_notify_resume with 0x04 or TIF_SIGPENDING over and over as if its in an infinite loop.
This begins shortly after load_elf_binary -> clear_user i think right after a page_fault during the clear_user. I dont even know why that signal is being sent on other hardware it never happens.
I am not even sure what do try next.
I have been trying to get my distro up on 2.6 for a while now and away from 2.4 but im currently stuck here.
Any and all suggestions and help welcome..
my kernel config is here
http://pastebin.cross-lfs.org/4561
Thanks in advance..
Regards
Sean Mathews
struct SoftwareProfessional {
double salary;
long lunches;
float jobs;
char unstable;
void work;
short tempers;
};
________________________________________________________________
Sent via the WebMail system at mail.nutech.com
On Sat, Jan 05, 2008 at 05:14:08PM -0800, mathewss wrote:
> I have been trying to figure this out a while now with printk's all over my kernel as well as adding kdb and tracing the int3 events.
>
> I have tried various 2.6 kernels and so far all i have tried do this.
>
> My current tests are on 2.6.22.10
>
> I have a simple init binary I compiled static that is my init that is loaded into an init file system. I am not using cpio but that did not seem to matter.
> ---- begin testinit.c
> #include <stdio.h>
> #include <unistd.h>
> int main(int argc, char *argv[])
> {
> printf("Hello world!\n");
> sleep(999999999);
> }
> ---- end
>
> I am using syslunux to start my kernel and appending the follwiing startup command most of this is specific to my true init script but again im using a "hello world" script to debug this for now.
>
> append debug kidb=early console=ttyS0,384008n initrd=ufoinit.img init=/testinit rw var_size=12M tmp_size=MAX log_size=16M root_size=64M root=/dev/ram0 boot=/dev/hda1,msdos rw pkgpath=/dev/hda1:msdos rw verbose DELAY=0 TEST=0 DEBUG=0 VERBOSE=0 UFO=root,etc,modules
>
> On an intel CA810EEA 800mhz board or QEMU this runs fine but on the via boards it dies right after "Freeing unused kernel memory: 132k freed"
>
That will be where it invokes the init program, I think, so the
kernel is probably not to blame.
> on the 500mhz board it dies every time on the 800mhz it is random.
>
For the 500MHz, this sounds like the "i686 implies cmov" problem -
gcc thinks that all i686 CPUs understand a particular instruction
('cmov', if my brain cells haven't totally given up), but early via
processors didn't. Haven't seen too many references to this
recently, so perhaps recent versions of gcc have fixed this, or
perhaps people know of a workaround.
I suggest that your userspace (glibc and gcc, I suppose) is built
for i686 and uses the instruction that your CPU doesn't understand.
The 800MHz might be different, I thought those did provide the
instruction. Have you checked the memory with memtest86 ? For the
cases where it doesn't die, perhaps you should give it an init which
is going to do something, and see if it actually manages to boot any
of the time. If so, that would confirm that the two CPUs are not
identical in their capabilities. It wouldn't explain the less than
100% success, of course, so the usual suspects (crap hardware,
failing memory, dodgy power supplies) would need to be investigated.
As always, this is intended to be helpful, but treat it with a
grain of salt, I could well be talking out of a different orifice
than my mouth. My last experience with a via processor was a 1.2GHz
beastie which certainly understood all i686 instructions, but
managed to make snails look fast, and wasn't as power-frugal as
expected, so I might be prejudiced.
> I have noticed that i get the elf binarly loading into user space with some page_faults then I get blasted with do_notify_resume with 0x04 or TIF_SIGPENDING over and over as if its in an infinite loop.
>
> This begins shortly after load_elf_binary -> clear_user i think right after a page_fault during the clear_user. I dont even know why that signal is being sent on other hardware it never happens.
>
> I am not even sure what do try next.
>
Find a toolchain built for i586 ? (Or preferably i486, I think I
remember comments that early via CPUs run better when optimised for
i486).
If you think your own toolchain is compiled for i586, you could try
downloading one of the distros which definitely is built for i586 or
i486 - if that works, it's a userspace compile problem. Or, perhaps,
the kernel actually needs to be built for i486 - I doubt that, but I
don't have the hardware.
Ken
--
das eine Mal als Trag?die, das andere Mal als Farce
Ken thanks for your insight.
You were correct that I was chasing multiple problems.
I still ended up crashing on boot after dealing with the processor issue.
As it turns out a bug in bash 3.2 was the cause. As bash initializes before
loading any script it calls a built in version of getcwd() that has a memcpy()
that reads out of bounds and may read across the stack and touch kernel
memory resulting in a fault. In my case patch bash32-11 triggered the bug
as this patch causes my build to force the use of this version of getcwd()
and not the one built into libc.
The handling of kernel memory faults for process id 1 needs some work imho.
As process ID 1 is not kernel or user some special conditions have been
made in the kernel to deal with situations like this. The case of a SEGFAULT
into kernel memory for process ID 1 is not handled so you end up with an
infinite loop in the kernel trying to deal with the fault and the boot process
hangs with no visual indication as to what is wrong.
?Regards
? Sean Mathews
struct SoftwareProfessional {
? ?double salary;
? ?long ? lunches;
? ?float ?jobs;
? ?char ? unstable;
? ?void ? work;
? ?short ?tempers;
};
On Jan 5, 7:30 pm, Ken Moffat <[email protected]> wrote:
>
> As always, this is intended to be helpful, but treat it with a
> grain of salt, I could well be talking out of a different orifice
> than my mouth. My last experience with aviaprocessor was a 1.2GHz
> beastie which certainly understood all i686 instructions, but
> managed to make snails look fast, and wasn't as power-frugal as
> expected, so I might be prejudiced.
>
> If you think your own toolchain is compiled for i586, you could try
> downloading one of the distros which definitely is built for i586 or
> i486 - if that works, it's a userspace compile problem. Or, perhaps,
> the kernel actually needs to be built for i486 - I doubt that, but I
> don't have the hardware.
>
> Ken
> --
> das eine Mal als Trag?die, das andere Mal als Farce
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/