I am working on a port for the bcm33xx platform, which includes a MIPS CPU:
http://wiki.openwrt.org/BroadcomBCM33xxPort
However, I seem to be hitting problems with memory allocation, as
load_elf_binary is segfaulting inside padzero/__bzero.
This is my first attempt at real Linux kernel development, so if anyone could
throw me pointers on how I would approach debugging and fixing this problem,
I'd really appreciate it. (boot log with early printk is attached)
Thanks,
Luke
On Sat, 2008-06-07 at 20:19 -0500, Luke -Jr wrote:
> I am working on a port for the bcm33xx platform, which includes a MIPS CPU:
> http://wiki.openwrt.org/BroadcomBCM33xxPort
>
> However, I seem to be hitting problems with memory allocation, as
> load_elf_binary is segfaulting inside padzero/__bzero.
Seems to have issues way before that ;-)
>
> This is my first attempt at real Linux kernel development, so if anyone could
> throw me pointers on how I would approach debugging and fixing this problem,
> I'd really appreciate it. (boot log with early printk is attached)
I'm not too up on MIPS but there're a few things in the log which stand
out to me:
Determined physical RAM map:
memory: 00fa0000 @ 00000000 (usable)
User-defined physical RAM map:
memory: 007a1200 @ 00000000 (usable)
Can you confirm these sizes and locations for RAM? Does anything change
if you don't force the size constraint?
CPU frequency 32.00 MHz
Really? Is your bootloader setting the CPU up correctly before handing
control to Linux?
irq 8: nobody cared (try booting with the "irqpoll" option)
What's on IRQ8, should anyone care at this early stage? Did the
bootloader enable this (should it)?
Reserved instruction in kernel code[#1]:
You're compiling with an appropriate -march switch?
Good luck :-)
--Ben.
On Saturday 07 June 2008, you wrote:
> On Sat, 2008-06-07 at 20:19 -0500, Luke -Jr wrote:
> > I am working on a port for the bcm33xx platform, which includes a MIPS
> > CPU: http://wiki.openwrt.org/BroadcomBCM33xxPort
> >
> > However, I seem to be hitting problems with memory allocation, as
> > load_elf_binary is segfaulting inside padzero/__bzero.
>
> Seems to have issues way before that ;-)
No doubt, but those don't appear to be getting in my way yet ;)
> > This is my first attempt at real Linux kernel development, so if anyone
> > could throw me pointers on how I would approach debugging and fixing this
> > problem, I'd really appreciate it. (boot log with early printk is
> > attached)
>
> I'm not too up on MIPS but there're a few things in the log which stand
> out to me:
>
> Determined physical RAM map:
> memory: 00fa0000 @ 00000000 (usable)
> User-defined physical RAM map:
> memory: 007a1200 @ 00000000 (usable)
>
> Can you confirm these sizes and locations for RAM? Does anything change
> if you don't force the size constraint?
According to http://research.msrg.utoronto.ca/ece344/2007s/os161/mips.html ,
MIPS has a pretty odd memory layout, and I'm honestly not sure how Linux
usually handles it. I don't feel competent to try and summarize the details
on that page here.
> CPU frequency 32.00 MHz
>
> Really? Is your bootloader setting the CPU up correctly before handing
> control to Linux?
The CPU is 200 MHz, I believe. The bootloader is just a part of VxWorks, not
really meant to boot anything else.
> irq 8: nobody cared (try booting with the "irqpoll" option)
>
> What's on IRQ8, should anyone care at this early stage? Did the
> bootloader enable this (should it)?
No idea, sorry.
> Reserved instruction in kernel code[#1]:
>
> You're compiling with an appropriate -march switch?
I believe so... It appears to be a "reserved instruction" only because of the
memory area it tries to access. The instruction in question is "store word",
nothing complex.
Thanks,
Luke
On Sat, 7 Jun 2008, Luke -Jr wrote:
> > I'm not too up on MIPS but there're a few things in the log which stand
> > out to me:
> >
> > Determined physical RAM map:
> > memory: 00fa0000 @ 00000000 (usable)
> > User-defined physical RAM map:
> > memory: 007a1200 @ 00000000 (usable)
> >
> > Can you confirm these sizes and locations for RAM? Does anything change
> > if you don't force the size constraint?
>
> According to http://research.msrg.utoronto.ca/ece344/2007s/os161/mips.html ,
> MIPS has a pretty odd memory layout, and I'm honestly not sure how Linux
> usually handles it. I don't feel competent to try and summarize the details
> on that page here.
Nothing odd about the memory layout I would say unless you want to go
beyond 512MB with a 32-bit system which is not the case here.
> > CPU frequency 32.00 MHz
> >
> > Really? Is your bootloader setting the CPU up correctly before handing
> > control to Linux?
>
> The CPU is 200 MHz, I believe. The bootloader is just a part of VxWorks, not
> really meant to boot anything else.
CFE is pretty much standard for Broadcom platforms and far from being
specific to VxWorks.
The clock frequency of the CPU is calculated by Linux based on the rate
of the internal timer calibrated against the real-time clock or possibly
another reference of a known rate. That's assuming board setup has got it
right, of course.
I'd be more concerned about:
Calibrating delay loop (skipped)... 0.00 BogoMIPS preset
> > irq 8: nobody cared (try booting with the "irqpoll" option)
> >
> > What's on IRQ8, should anyone care at this early stage? Did the
> > bootloader enable this (should it)?
>
> No idea, sorry.
What the bootloader does should no matter -- whatever piece of code
initializes the interrupt controller used for IRQ 8 should mask all the
sources off till a handler is installed anyway.
> > Reserved instruction in kernel code[#1]:
> >
> > You're compiling with an appropriate -march switch?
>
> I believe so... It appears to be a "reserved instruction" only because of the
> memory area it tries to access. The instruction in question is "store word",
> nothing complex.
You have got something seriously broken -- __bzero traps exceptions on
stores for graceful recovery as user addresses may be accessed as is the
case here. If the reserved instruction exception handler is reached, then
clearly the store instruction is not the immediate cause.
You might be better off asking questions at <[email protected]>.
Maciej
On Saturday 07 June 2008, Maciej W. Rozycki wrote:
> On Sat, 7 Jun 2008, Luke -Jr wrote:
> > > I'm not too up on MIPS but there're a few things in the log which stand
> > > out to me:
> > >
> > > Determined physical RAM map:
> > > memory: 00fa0000 @ 00000000 (usable)
> > > User-defined physical RAM map:
> > > memory: 007a1200 @ 00000000 (usable)
> > >
> > > Can you confirm these sizes and locations for RAM? Does anything
> > > change if you don't force the size constraint?
> >
> > According to
> > http://research.msrg.utoronto.ca/ece344/2007s/os161/mips.html , MIPS has
> > a pretty odd memory layout, and I'm honestly not sure how Linux usually
> > handles it. I don't feel competent to try and summarize the details on
> > that page here.
>
> Nothing odd about the memory layout I would say unless you want to go
> beyond 512MB with a 32-bit system which is not the case here.
Well, I always imagined memory layout as being a simple flat range from 0 to
all_memory_in_system, but this is my first experience with it at such a low
level, so I guess I don't know what's "odd" or "normal".
> > > CPU frequency 32.00 MHz
> > >
> > > Really? Is your bootloader setting the CPU up correctly before handing
> > > control to Linux?
> >
> > The CPU is 200 MHz, I believe. The bootloader is just a part of VxWorks,
> > not really meant to boot anything else.
>
> CFE is pretty much standard for Broadcom platforms and far from being
> specific to VxWorks.
VxWorks, including the boot loader, is not CFE as far as I am aware. If you're
referring to the "CFEv2" in the log, that appears to be the default of a
switch (eg, if Linux doesn't detect anything else).
> I'd be more concerned about:
>
> Calibrating delay loop (skipped)... 0.00 BogoMIPS preset
The calibration code was crashing, so I set it to a fixed 1 value.
Worst case, some code won't delay as long as it wants to, right?
> > > Reserved instruction in kernel code[#1]:
> > >
> > > You're compiling with an appropriate -march switch?
> >
> > I believe so... It appears to be a "reserved instruction" only because of
> > the memory area it tries to access. The instruction in question is "store
> > word", nothing complex.
>
> You have got something seriously broken -- __bzero traps exceptions on
> stores for graceful recovery as user addresses may be accessed as is the
> case here. If the reserved instruction exception handler is reached, then
> clearly the store instruction is not the immediate cause.
What else could it be?
On Sat, 7 Jun 2008, Luke -Jr wrote:
> Well, I always imagined memory layout as being a simple flat range from 0 to
> all_memory_in_system, but this is my first experience with it at such a low
> level, so I guess I don't know what's "odd" or "normal".
You mean the layout of virtual memory? Well, have a look at what the
Alpha defines as sparse memory for something certainly less
straightforward than what MIPS segments are. Anyway, what's reported here
is physical memory and there is nothing special about it.
> VxWorks, including the boot loader, is not CFE as far as I am aware. If you're
> referring to the "CFEv2" in the log, that appears to be the default of a
> switch (eg, if Linux doesn't detect anything else).
That message is not included in the standard kernel -- how can I know it
is meaningless? As I wrote CFE is standard Broadcom firmware.
> The calibration code was crashing, so I set it to a fixed 1 value.
> Worst case, some code won't delay as long as it wants to, right?
That's grossly wrong. If you need to preset it for the time being till
you debug calibration, then for a MIPS processor assume one instruction
per clock tick and two instructions per loop -- that may not be entirely
correct, but is a good approximation. Otherwise you risk peripheral
devices are not driven correctly with all sorts of the nasty results.
> > You have got something seriously broken -- __bzero traps exceptions on
> > stores for graceful recovery as user addresses may be accessed as is the
> > case here. If the reserved instruction exception handler is reached, then
> > clearly the store instruction is not the immediate cause.
>
> What else could it be?
Well, you've got the system and I have no crystal ball. You have means
to debug it. See how control is passed to the RI exception. Find which
of the TLB exceptions happens and how it proceeds. Etc...
Maciej
On Sunday 08 June 2008, Maciej W. Rozycki wrote:
> On Sat, 7 Jun 2008, Luke -Jr wrote:
> > VxWorks, including the boot loader, is not CFE as far as I am aware. If
> > you're referring to the "CFEv2" in the log, that appears to be the
> > default of a switch (eg, if Linux doesn't detect anything else).
>
> That message is not included in the standard kernel -- how can I know it
> is meaningless? As I wrote CFE is standard Broadcom firmware.
It's not? Guess it came from the bcm63xx patches OpenWrt has that I'm using as
a base for this... Either way, it seems unlikely something claiming to
be "VxWorks System Boot" is a standard firmware.
> > The calibration code was crashing, so I set it to a fixed 1 value.
> > Worst case, some code won't delay as long as it wants to, right?
>
> That's grossly wrong. If you need to preset it for the time being till
> you debug calibration, then for a MIPS processor assume one instruction
> per clock tick and two instructions per loop -- that may not be entirely
> correct, but is a good approximation. Otherwise you risk peripheral
> devices are not driven correctly with all sorts of the nasty results.
Meaning this?
preset_lpj = loops_per_jiffy = 2;
> > > You have got something seriously broken -- __bzero traps exceptions on
> > > stores for graceful recovery as user addresses may be accessed as is
> > > the case here. If the reserved instruction exception handler is
> > > reached, then clearly the store instruction is not the immediate cause.
> >
> > What else could it be?
>
> Well, you've got the system and I have no crystal ball. You have means
> to debug it. See how control is passed to the RI exception. Find which
> of the TLB exceptions happens and how it proceeds. Etc...
Unfortunately, I don't understand how to "see how control is passed" or
finding TLB exceptions... Could you point me in the right direction to learn
about this?
On Sunday 08 June 2008, Kevin D. Kissell wrote:
> The universe of possible failures is large. The two most likely categories
> are (a) configuring the build for a variant of the architecture (64-bit,
> MIPS32R2) that your hardware doesn't support - this is what Maciej was
> referring to,
CONFIG_CPU_MIPS32_R1=y
> and (b) control being transferred to a block of memory that isn't actually
> code, as can happen if exception vectors or global pointers-to-functions
> aren't set up correctly, or if the kernel stack is being corrupted. When
> you say "the instruction in question is a store word", how do you know that?
The RI error spits out a bunch of info, including epc which presumably points
to the instruction causing the problem: ac85ffc0; this is 'sw a1,-64(a0)'
Luke
On Sun, 8 Jun 2008, Luke -Jr wrote:
> It's not? Guess it came from the bcm63xx patches OpenWrt has that I'm using as
> a base for this... Either way, it seems unlikely something claiming to
> be "VxWorks System Boot" is a standard firmware.
It would be best if the patches you are referring to got merged with the
mainline. Otherwise whoever uses them is essentially on their own --
people lack the resources needed to chase random changes out there in
general.
> > That's grossly wrong. If you need to preset it for the time being till
> > you debug calibration, then for a MIPS processor assume one instruction
> > per clock tick and two instructions per loop -- that may not be entirely
> > correct, but is a good approximation. Otherwise you risk peripheral
> > devices are not driven correctly with all sorts of the nasty results.
>
> Meaning this?
> preset_lpj = loops_per_jiffy = 2;
Not exactly. Try harder -- this is simple arithmetic and you've got all
the data given above already. :)
> > Well, you've got the system and I have no crystal ball. You have means
> > to debug it. See how control is passed to the RI exception. Find which
> > of the TLB exceptions happens and how it proceeds. Etc...
>
> Unfortunately, I don't understand how to "see how control is passed" or
> finding TLB exceptions... Could you point me in the right direction to learn
> about this?
You can check how the return address is set at the function's entry point
to see how it's called.
As to the TLB exceptions -- well, read the MIPS architecture spec first.
Then -- well, referring you to arch/mips/mm/tlbex.c would be pure cruelty
;) -- but have a look at do_page_fault(), which is where all the
processing important here is done -- the machine code generated from
tlbex.c handles the success paths only.
> > and (b) control being transferred to a block of memory that isn't actually
> > code, as can happen if exception vectors or global pointers-to-functions
> > aren't set up correctly, or if the kernel stack is being corrupted. When
> > you say "the instruction in question is a store word", how do you know that?
>
> The RI error spits out a bunch of info, including epc which presumably points
> to the instruction causing the problem: ac85ffc0; this is 'sw a1,-64(a0)'
I have seen that already and wrote these stores in __bzero are protected.
Perhaps the fixup fails for some reason, but you need to investigate it
and this is why I suggested to see how the RI handler is reached. Since
this is a known point the failure leads to, you should be able to work
backwards from there quite easily.
Maciej
On Sun, 8 Jun 2008, Kevin D. Kissell wrote:
> > The RI error spits out a bunch of info, including epc which presumably points
> > to the instruction causing the problem: ac85ffc0; this is 'sw a1,-64(a0)'
> >
> But unless the processor itself is actually defective, there is no way that
> a SW instruction can cause an RI exception. Sometimes a kernel crash
> is so violent that the kernel stack frame cannot be reliably decoded by
> the crash dump code, and this would appear to be one of those cases.
> I find the address of 0xac85ffc0 to be a bit suspicious, myself. That's
> a kseg1 (non-cacheable identity map) address for physical address
> 0x0c85ffc0, which would be legitimate (though suspicious) if you had
> 256MB of RAM, but the boot log quote you posted earlier suggests
> that you've only got 16M. Is there really memory of some kind at
> that address? Are you calling routines in a boot ROM from Linux?
Well, 0xac85ffc0 is the instruction word corresponding to 'sw a1,-64(a0)'.
:) The actual address of the failure is apparently 0x004e010c, which is
pretty much a standard location somewhere within a user executable proper.
Maciej
Luke -Jr wrote:
> On Sunday 08 June 2008, Kevin D. Kissell wrote:
>
>> and (b) control being transferred to a block of memory that isn't actually
>> code, as can happen if exception vectors or global pointers-to-functions
>> aren't set up correctly, or if the kernel stack is being corrupted. When
>> you say "the instruction in question is a store word", how do you know that?
>>
>
> The RI error spits out a bunch of info, including epc which presumably points
> to the instruction causing the problem: ac85ffc0; this is 'sw a1,-64(a0)'
>
But unless the processor itself is actually defective, there is no way that
a SW instruction can cause an RI exception. Sometimes a kernel crash
is so violent that the kernel stack frame cannot be reliably decoded by
the crash dump code, and this would appear to be one of those cases.
I find the address of 0xac85ffc0 to be a bit suspicious, myself. That's
a kseg1 (non-cacheable identity map) address for physical address
0x0c85ffc0, which would be legitimate (though suspicious) if you had
256MB of RAM, but the boot log quote you posted earlier suggests
that you've only got 16M. Is there really memory of some kind at
that address? Are you calling routines in a boot ROM from Linux?
Debugging Linux kernel crashes is probably not the best way to learn
the MIPS privileged resource architecture. I'd strongly recommend
http://www.amazon.com/See-MIPS-Second-Dominic-Sweetman/dp/0120884216/
Regards,
Kevin K.
On Sunday 08 June 2008, Kevin D. Kissell wrote:
> Luke -Jr wrote:
> > On Sunday 08 June 2008, Kevin D. Kissell wrote:
> >> and (b) control being transferred to a block of memory that isn't
> >> actually code, as can happen if exception vectors or global
> >> pointers-to-functions aren't set up correctly, or if the kernel stack is
> >> being corrupted. When you say "the instruction in question is a store
> >> word", how do you know that?
> >
> > The RI error spits out a bunch of info, including epc which presumably
> > points to the instruction causing the problem: ac85ffc0; this is 'sw
> > a1,-64(a0)'
>
> But unless the processor itself is actually defective, there is no way that
> a SW instruction can cause an RI exception. Sometimes a kernel crash
> is so violent that the kernel stack frame cannot be reliably decoded by
> the crash dump code, and this would appear to be one of those cases.
In that case, wouldn't the "kernel stack" appear to be complete nonsense?
Yet the stack in this case is quite logical and consistent. Furthermore, if I
skip the bzero stuff (by commenting out the call), it will crash shortly
thereafter when the ELF loader attempts to write to it in another way.
Is it very unlikely that the bcm3345 is simply raising the wrong exception (or
perhaps Linux is misinterpreting the exception)?
> I find the address of 0xac85ffc0 to be a bit suspicious, myself. That's
> a kseg1 (non-cacheable identity map) address for physical address
> 0x0c85ffc0, which would be legitimate (though suspicious) if you had
> 256MB of RAM, but the boot log quote you posted earlier suggests
> that you've only got 16M. Is there really memory of some kind at
> that address? Are you calling routines in a boot ROM from Linux?
ac85ffc0 is the instruction for 'sw a1,-64(a0)', not an address.
The board has only 8 MB RAM, to the best I can tell from looking up the RAM
chip (hynix KOREA HY57V641620HG 0229A T-7).
> Debugging Linux kernel crashes is probably not the best way to learn
> the MIPS privileged resource architecture. I'd strongly recommend
> http://www.amazon.com/See-MIPS-Second-Dominic-Sweetman/dp/0120884216/
Can you recommend any gratis materials to read? I don't have room in my budget
to spend money on this hobby right now..
Luke
On Sunday 08 June 2008, Maciej W. Rozycki wrote:
> On Sun, 8 Jun 2008, Luke -Jr wrote:
> > the bcm63xx patches OpenWrt has that I'm using as a base for this...
>
> It would be best if the patches you are referring to got merged with the
> mainline. Otherwise whoever uses them is essentially on their own --
> people lack the resources needed to chase random changes out there in
> general.
Is merging with mainline something I can help with, being a beginner in this
area generally and not having any part in writing them?
> > > That's grossly wrong. If you need to preset it for the time being
> > > till you debug calibration, then for a MIPS processor assume one
> > > instruction per clock tick and two instructions per loop -- that may
> > > not be entirely correct, but is a good approximation. Otherwise you
> > > risk peripheral devices are not driven correctly with all sorts of the
> > > nasty results.
> >
> > Meaning this?
> > preset_lpj = loops_per_jiffy = 2;
>
> Not exactly. Try harder -- this is simple arithmetic and you've got all
> the data given above already. :)
200 / 2? I'm not really sure what a 'jiffy' is..
> > > and (b) control being transferred to a block of memory that isn't
> > > actually code, as can happen if exception vectors or global
> > > pointers-to-functions aren't set up correctly, or if the kernel stack
> > > is being corrupted. When you say "the instruction in question is a
> > > store word", how do you know that?
> >
> > The RI error spits out a bunch of info, including epc which presumably
> > points to the instruction causing the problem: ac85ffc0; this is 'sw
> > a1,-64(a0)'
>
> I have seen that already and wrote these stores in __bzero are protected.
> Perhaps the fixup fails for some reason, but you need to investigate it
> and this is why I suggested to see how the RI handler is reached. Since
> this is a known point the failure leads to, you should be able to work
> backwards from there quite easily.
Ah, so what you're saying is that perhaps the 'sw' is triggering a TLB
exception, and the handler for *that* is causing the RI problem?
Thanks,
Luke
On Sun, 8 Jun 2008, Luke -Jr wrote:
> Is merging with mainline something I can help with, being a beginner in this
> area generally and not having any part in writing them?
Well, you can certainly serve as a messenger telling them if they want
people to get proper support from upstream maintainers they better merge
sooner rather than later. Otherwise it is them who should really be
bothered with cases like yours.
The general principle is: "merge as soon as you can, even if code is
incomplete" as you get more attention and perhaps developers involved as a
result, some free support (e.g. with bulk changes done automatically to
all the relevant bits in the tree) and avoid duplicated work; also when at
the time of the merge you are told to rewrite your code differently.
> > Not exactly. Try harder -- this is simple arithmetic and you've got all
> > the data given above already. :)
>
> 200 / 2? I'm not really sure what a 'jiffy' is..
Hmm, I have thought it can be inferred from the code involved or failing
that -- Google... Well, anyway, a jiffy is a tick of the kernel timer or,
specifically in this context and to be more precise, the interval between
such two consecutive ticks or, in other words, 1/HZ.
> > I have seen that already and wrote these stores in __bzero are protected.
> > Perhaps the fixup fails for some reason, but you need to investigate it
> > and this is why I suggested to see how the RI handler is reached. Since
> > this is a known point the failure leads to, you should be able to work
> > backwards from there quite easily.
>
> Ah, so what you're saying is that perhaps the 'sw' is triggering a TLB
> exception, and the handler for *that* is causing the RI problem?
This is almost certain what happens here. The pointer involved is a
valid (user) address and is correctly aligned, so you cannot get an
address error exception. A TLB exception is next on the list to check.
Of course you cannot rule out I-cache corruption or suchlike, but if I
were you, I would start with simple assumptions first.
Maciej
On Sunday 08 June 2008, Maciej W. Rozycki wrote:
> On Sun, 8 Jun 2008, Luke -Jr wrote:
> > Is merging with mainline something I can help with, being a beginner in
> > this area generally and not having any part in writing them?
>
> Well, you can certainly serve as a messenger telling them if they want
> people to get proper support from upstream maintainers they better merge
> sooner rather than later.
Apparently the reason for lack of merge is due to missing (proprietary?)
drivers for DSL, Ethernet, and WiFi on the bcm63xx platform. I'll pass on
the "incomplete is ok" message, though, and hopefully that will help :)
> The general principle is: "merge as soon as you can, even if code is
> incomplete" as you get more attention and perhaps developers involved as a
> result, some free support (e.g. with bulk changes done automatically to
> all the relevant bits in the tree) and avoid duplicated work; also when at
> the time of the merge you are told to rewrite your code differently.
Does this apply even to my trivial/barely begun attempts so far? When bcm63xx
gets merged, should I be planning to merge my stuff even before it boots?
> > > Not exactly. Try harder -- this is simple arithmetic and you've got
> > > all the data given above already. :)
> >
> > 200 / 2? I'm not really sure what a 'jiffy' is..
>
> Hmm, I have thought it can be inferred from the code involved or failing
> that -- Google... Well, anyway, a jiffy is a tick of the kernel timer or,
> specifically in this context and to be more precise, the interval between
> such two consecutive ticks or, in other words, 1/HZ.
jiffy = 1 / 200000 HZ = 0.000005 sec/tick
loop = 200000 instructions / 2 instructions per loop = 100000 loops/sec
So 0.00000000005 loops per jiffy? But it can't be, since loops_per_jiffy isn't
floating point... :/
> > > I have seen that already and wrote these stores in __bzero are
> > > protected. Perhaps the fixup fails for some reason, but you need to
> > > investigate it and this is why I suggested to see how the RI handler is
> > > reached. Since this is a known point the failure leads to, you should
> > > be able to work backwards from there quite easily.
> >
> > Ah, so what you're saying is that perhaps the 'sw' is triggering a TLB
> > exception, and the handler for *that* is causing the RI problem?
>
> This is almost certain what happens here. The pointer involved is a
> valid (user) address and is correctly aligned, so you cannot get an
> address error exception. A TLB exception is next on the list to check.
Is there an easy way to printk out a complete trace of the exception stack?
On Sunday 08 June 2008, Maciej W. Rozycki wrote:
> On Sun, 8 Jun 2008, Luke -Jr wrote:
> > > I have seen that already and wrote these stores in __bzero are
> > > protected. Perhaps the fixup fails for some reason, but you need to
> > > investigate it and this is why I suggested to see how the RI handler is
> > > reached. Since this is a known point the failure leads to, you should
> > > be able to work backwards from there quite easily.
> >
> > Ah, so what you're saying is that perhaps the 'sw' is triggering a TLB
> > exception, and the handler for *that* is causing the RI problem?
>
> This is almost certain what happens here. The pointer involved is a
> valid (user) address and is correctly aligned, so you cannot get an
> address error exception. A TLB exception is next on the list to check.
I added some code to do_ri:
if (unlikely(!user_mode(regs)))
{
long real_epc;
asm("move %0, $sp" : "=r"(real_epc));
printk("----- LJR -------\n");
show_raw_backtrace(real_epc);
printk("----- LJRx-------\n");
}
Which gave me some potentially useful info:
----- LJR -------
Call Trace:
[<80011460>] ret_from_exception+0x0/0x24
[<80069de4>] vma_link+0x48/0x114
[<8001b1f0>] blast_icache16+0x0/0xec
[<800aa27c>] padzero+0x5c/0x74
[<800c6774>] __bzero+0x38/0x164
[<800ab04c>] load_elf_binary+0x948/0x145c
[<800aac6c>] load_elf_binary+0x568/0x145c
[<80083b80>] __path_lookup_intent_open+0x60/0xe4
[<80083b50>] __path_lookup_intent_open+0x30/0xe4
[<80080044>] permission+0x10c/0x148
[<8007bfd4>] search_binary_handler+0x78/0x18c
[<800aa15c>] load_script+0x25c/0x270
[<800aa148>] load_script+0x248/0x270
[<800aa7b4>] load_elf_binary+0xb0/0x145c
[<8007c204>] get_arg_page+0x4c/0xc4
[<8001cab4>] r4k_flush_cache_page+0x1c/0x28
[<8007bfd4>] search_binary_handler+0x78/0x18c
[<8007e004>] do_execve+0x18c/0x258
[<8007dfe4>] do_execve+0x16c/0x258
[<80081074>] getname+0x24/0x118
[<8001570c>] sys_execve+0x4c/0x78
[<80030610>] release_console_sem+0x114/0x358
[<80018410>] stack_done+0x20/0x3c
[<80031038>] vprintk+0x368/0x448
[<8007554c>] get_unused_fd_flags+0x60/0x184
[<80081074>] getname+0x24/0x118
[<80010478>] init_post+0x60/0xe8
[<80015584>] kernel_execve+0x8/0x20
[<800136cc>] kernel_thread_helper+0x10/0x18
[<800136bc>] kernel_thread_helper+0x0/0x18
----- LJRx-------
Too tired to debug further tonight, but hopefully this stack will stand out to
someone :)
Luke
On Sun, 8 Jun 2008, Luke -Jr wrote:
> On Sunday 08 June 2008, Maciej W. Rozycki wrote:
> > On Sun, 8 Jun 2008, Luke -Jr wrote:
> > > > Not exactly. Try harder -- this is simple arithmetic and you've got
> > > > all the data given above already. :)
> > >
> > > 200 / 2? I'm not really sure what a 'jiffy' is..
> >
> > Hmm, I have thought it can be inferred from the code involved or failing
> > that -- Google... Well, anyway, a jiffy is a tick of the kernel timer or,
> > specifically in this context and to be more precise, the interval between
> > such two consecutive ticks or, in other words, 1/HZ.
^^
Look at CONFIG_HZ, which is probably 100, 250, or 1000.
> jiffy = 1 / 200000 HZ = 0.000005 sec/tick
> loop = 200000 instructions / 2 instructions per loop = 100000 loops/sec
>
> So 0.00000000005 loops per jiffy? But it can't be, since loops_per_jiffy isn't
> floating point... :/
So loops_per_jiffie is approx. CPU clock frequency / CONFIG_HZ.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds