2002-06-12 05:20:46

by Anjali Kulkarni

[permalink] [raw]
Subject: scheduler problems


Hi,

I am getting a problem in the scheduler() function....

I am running an in-kernel proxy on linux 2.2.16 and I get a problem in
sched.c at line 384. It is due to the fact that the schedule() function
does not find the 'current' process in the runqueue. (A detailed
explanation of the OOPS message which comes when run without serial
line debugging is given below).

With serial line debugging I got the following backtrace:---

0# schedule() at sched.c:384
1# schedule_timeout(timeout=-806527036) at sched.c:653
2# kupdate (unused=0x0) at buffer.c:1921
3# kernel_thread(fn=0xb, arg=0xbffff86c, flags=0) at process.c:496
4# system_call at process.c:812

Note that paramters to functions schedule_timeout(negative value) and
kernel_thread are incorrect or do not seem right.
When I booted the kernel, I set breakpoints in init/main.c where
kupdate is created, and it shows a correct call to kernel_thread-
>kupdate->schedule_timeout->schedule with all functions called with
correct parameters.

Can anyone tell me what's happening here? My kernel module is no way
the cause of any of this. A detailed explanation is given below...

Thanks!
Anjali

/*--------------------In more detail-------------------------*/
I am running an in-kernel proxy on linux 2.2.16, which places high
demand on the n/w activity of the linux m/c. I am repeatedly getting an
OOPS message at a particular place in the scheduler() function call. I
am trying to analyse the call trace, and it looks something like the
following:-

schedule()
schedule_timeout()
process_timeout()
do_poll()
sys_poll()

After looking at the address where OOPS reports a problem in schedule()
and looking at the objdump of sched.o, I found the problem is due to
the fact that when schedule() calls del_from_runqueue(), it finds that
the current process is *not* there on the runqueue. Further, this
current process *IS* present in the list of task structs, in
TASK_INTERRPUTIBLE state. This process is generally some process like
inetd or some process doing n/w activity.
Now, if I kill all the processes on my linux machine (just to check),
the problem frequency reduces, but it still appears, and now the
process not present on the runqueue is some process like init(pid 1) or
kupdate(pid 3). These are the processes which could not be
killed.
>From this, I concluded what happens is probably that some process
called sys_poll which called do_poll(). In do_poll(), a
process_timeout occured(I assume this a soft interrupt), which will try
and wakeup the process which caused it, ie put the process on the
runqueue. I dont know who calls the schedule_timeout? Is it the process
which wakes up from the call to schedule_timeout() after a context
switch occured? I am probably not aware of exactly how to interpret a
call trace...

Thanks again.

/*----------------------------------------------------------------*/


Anjali Kulkarni
Software Engineer
Indra Networks

~Living Well is the best Revenge~


2002-06-12 06:23:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: scheduler problems


On Tue, 11 Jun 2002, Anjali Kulkarni wrote:

> I am getting a problem in the scheduler() function....
>
> I am running an in-kernel proxy on linux 2.2.16 and I get a problem in
> sched.c at line 384. [...]

(given that the current 2.2 kernel is 2.2.21, the first thing would be to
test it there too.)

> [...] It is due to the fact that the schedule() function does not find
> the 'current' process in the runqueue. [...]

a crash in line 384 means that the runqueue got corrupted by something,
most likely caused by buggy kernel code outside of the scheduler.

> Can anyone tell me what's happening here? My kernel module is no way the
> cause of any of this. [...]

does it happen if you do not run your kernel module after bootup, ever?

Ingo

2002-06-12 07:14:11

by Anjali Kulkarni

[permalink] [raw]
Subject: Re: scheduler problems


> (given that the current 2.2 kernel is 2.2.21, the first thing would
be to
> test it there too.)
>

Thanks, I 'll do that.

> > [...] It is due to the fact that the schedule() function does not
find
> > the 'current' process in the runqueue. [...]
>
> a crash in line 384 means that the runqueue got corrupted by
something,
> most likely caused by buggy kernel code outside of the scheduler.

Right, I thought of that, but how is it that it gets corrupt at exactly
the same offset in task_struct of that process and every time with
different processes? (I have run it atleast 20-30 times). And it just
doesnt come if I kill the process in question? (I couldnt kill kupdate,
and hence it comes anyways). And I have checked the task_struct of that
process, the next_task & prev_task & other fields are not corrupted.
Ofcource, it's still possible, like if the memory allocated & freed by
my code is then used by scheduler for allocating task_struct; and then
it is accessed again by mistake by my code at the same offset.
But you feel sure it's a run queue corruption problem, and not anything
else? If so, is there any particular way to debug this?

> > Can anyone tell me what's happening here? My kernel module is no
way the
> > cause of any of this. [...]
>
> does it happen if you do not run your kernel module after bootup,
ever?

No, it does not:(

Thanks,
Anjali

>
> Ingo
>
>


Anjali Kulkarni
Software Engineer
Indra Networks

~Living Well is the best Revenge~

2002-06-12 20:51:48

by Richard Zidlicky

[permalink] [raw]
Subject: Re: scheduler problems

On Wed, Jun 12, 2002 at 12:14:09AM -0700, Anjali Kulkarni wrote:
>
> > (given that the current 2.2 kernel is 2.2.21, the first thing would
> be to
> > test it there too.)
> >
>
> Thanks, I 'll do that.
>
> > > [...] It is due to the fact that the schedule() function does not
> find
> > > the 'current' process in the runqueue. [...]
> >
> > a crash in line 384 means that the runqueue got corrupted by
> something,
> > most likely caused by buggy kernel code outside of the scheduler.
>
> Right, I thought of that, but how is it that it gets corrupt at exactly
> the same offset in task_struct of that process and every time with
> different processes? (I have run it atleast 20-30 times). And it just
> doesnt come if I kill the process in question?

I've had similar problems when some code invalidated CPU cache
and an interrupt came in at the wrong time.

Richard

2002-06-13 05:38:29

by Anjali Kulkarni

[permalink] [raw]
Subject: Re: scheduler problems



> > > > [...] It is due to the fact that the schedule() function does
not
> > find
> > > > the 'current' process in the runqueue. [...]
> > >
> > > a crash in line 384 means that the runqueue got corrupted by
> > something,
> > > most likely caused by buggy kernel code outside of the scheduler.
> >
> > Right, I thought of that, but how is it that it gets corrupt at
exactly
> > the same offset in task_struct of that process and every time with
> > different processes? (I have run it atleast 20-30 times). And it
just
> > doesnt come if I kill the process in question?
>
> I've had similar problems when some code invalidated CPU cache
> and an interrupt came in at the wrong time.
>

Hi!

I have not very clear on what u mean. Can u explain in more detail?

Thanks,
Anjali

> Richard
>
>


Anjali Kulkarni
Software Engineer
Indra Networks

~Living Well is the best Revenge~