LinuxLists.cc - [BUG -rt] -rt16 hang w/ realtime thread test

2006-02-11 02:58:00

Subject: [BUG -rt] -rt16 hang w/ realtime thread test

Hey Ingo,
I've been hunting a report that lower priority realtime threads are not
preempting higher priority realtime threads. However, in generating test
cases, I found I was locking the system quite frequently.

The attached test runs to completion on 2.6.15, but with 2.6.15-rt16, it
hangs the box. It could very well be a test issue, but I'm not sure I
see where the problem is.

thanks
-john

Attachments:

hangbox.c (1.91 kB)

2006-02-11 15:34:37

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG -rt] -rt16 hang w/ realtime thread test

On Fri, 10 Feb 2006, john stultz wrote:

> Hey Ingo,
> I've been hunting a report that lower priority realtime threads are not
> preempting higher priority realtime threads. However, in generating test
> cases, I found I was locking the system quite frequently.
>
> The attached test runs to completion on 2.6.15, but with 2.6.15-rt16, it
> hangs the box. It could very well be a test issue, but I'm not sure I
> see where the problem is.
>

Hi John,

Have you turned on nmi_watchdog and softlockup detect? Just so we can see
where it is hung.

Thanks,

-- Steve

2006-02-14 01:14:11

by john stultz

[permalink] [raw]

Subject: Re: [BUG -rt] -rt16 hang w/ realtime thread test

On Sat, 2006-02-11 at 10:34 -0500, Steven Rostedt wrote:
> On Fri, 10 Feb 2006, john stultz wrote:
>
> > Hey Ingo,
> > I've been hunting a report that lower priority realtime threads are not
> > preempting higher priority realtime threads. However, in generating test
> > cases, I found I was locking the system quite frequently.
> >
> > The attached test runs to completion on 2.6.15, but with 2.6.15-rt16, it
> > hangs the box. It could very well be a test issue, but I'm not sure I
> > see where the problem is.
>
> Have you turned on nmi_watchdog and softlockup detect? Just so we can see
> where it is hung.

With both of those enabled, I get nothing.

The only thing I could trigger was with Ingo's lock debugging, I got the
following at bootup, although I doubt its the problem.

sleeping function called from invalid context modprobe(2917) at
mm/slab.c:2134
in_atomic():1 [00000002], irqs_disabled():0
[<c011af7a>] __might_sleep+0xcf/0xda (8)
[<c015da2e>] kmem_cache_alloc+0x1c/0xaf (40)
[<c02b8b5b>] __alloc_skb+0x29/0x11f (20)
[<f8a7d6d8>] inet6_rt_notify+0x52/0x105 [ipv6] (28)
[<f8a7decb>] fib6_add_rt2node+0x11d/0x145 [ipv6] (24)
[<f8a7df52>] fib6_add+0x3f/0xb0 [ipv6] (36)
[<f8a7b380>] ip6_ins_rt+0x26/0x3a [ipv6] (24)
[<f8a79cdd>] __ipv6_ifa_notify+0x3f/0x11e [ipv6] (24)
[<f8a79de4>] ipv6_ifa_notify+0x28/0x34 [ipv6] (16)
[<f8a77d49>] init_loopback+0x93/0xaa [ipv6] (12)
[<f8a78027>] addrconf_notify+0xb1/0x196 [ipv6] (12)
[<c02bd3da>] register_netdevice_notifier+0x42/0x52 (20)
[<f881a2ed>] addrconf_init+0x60/0xb2 [ipv6] (16)
[<f881a197>] inet6_init+0x131/0x1d3 [ipv6] (8)
[<c013cb1c>] sys_init_module+0xa9/0x1d7 (8)
[<c01029ef>] sysenter_past_esp+0x54/0x75 (12)
---------------------------
| preempt count: 00000002 ]
| 2-level deep critical section nesting:
----------------------------------------
.. [<c03166d5>] .... _raw_read_lock+0xd/0x1c
.....[<00000000>] .. ( <= _stext+0x3feffd68/0x3e)
.. [<c0316838>] .... _raw_write_lock+0xd/0x1f
.....[<00000000>] .. ( <= _stext+0x3feffd68/0x3e)

------------------------------
| showing all locks held by: | (modprobe/2917 [f798f7d0, 117]):
------------------------------

#001: [c03bba44] {rtnl_sem.lock}
... acquired at: register_netdevice_notifier+0xa/0x52

thanks
-john

2006-02-14 01:54:00

by john stultz

[permalink] [raw]

Subject: Re: [BUG -rt] -rt16 hang w/ realtime thread test

2006-02-24 01:10:56

by john stultz

[permalink] [raw]

Subject: Re: [BUG -rt] -rt16 hang w/ realtime thread test

On Mon, 2006-02-13 at 17:53 -0800, john stultz wrote:
> On Sat, 2006-02-11 at 10:34 -0500, Steven Rostedt wrote:
> > On Fri, 10 Feb 2006, john stultz wrote:
> >
> > > Hey Ingo,
> > > I've been hunting a report that lower priority realtime threads are not
> > > preempting higher priority realtime threads. However, in generating test
> > > cases, I found I was locking the system quite frequently.
> > >
> > > The attached test runs to completion on 2.6.15, but with 2.6.15-rt16, it
> > > hangs the box. It could very well be a test issue, but I'm not sure I
> > > see where the problem is.
> > >
> >
> > Hi John,
> >
> > Have you turned on nmi_watchdog and softlockup detect? Just so we can see
> > where it is hung.
>
> Ugh. Ok, I think I've found the issue.
>
> The systems I'm testing w/ all use the ACPI PM timer for the
> clocksource. On a whim I forced the TSC to be used and the hang went
> away.
>
> It appears the issue is that the ACPI PM wraps every ~5 seconds. Since
> the test takes longer the 5 seconds to run, the
> timeofday_periodic_hook() function gets starved and we never accumulate
> time. Then as the counter wraps, time starts wrapping thus timers do not
> expire and the test never completes, effectively hanging the box.
>
> I believe this issue was hit before back when cycle_t was 32bits long,
> thus causing the TSC to wrap every ~4seconds on a 1Ghz box.
>
> So whats the solution here? Do we need to do something to
> timeofday_periodic_hook is guaranteed to run with some frequency? Or is
> the test just bunk because realtime threads must give up the cpu in
> order for the kernel to function?

So without any suggestions, I created a terrible hack that tries to
avoid this issue. Maybe it will spur some discussion. :)

Basically, we set a flag value every time timeofday_periodic_hook() is
run, then every HZ ticks (1 second), we call
timeofday_ensure_correctness() which checks that flag variable to make
sure the periodic_hook has been called in that second. If it has been
called clear the flag and wait another second, otherwise directly call
timeofday_periodic_hook().

This has worked around the starvation in my tests, but I really doubt
its the right solution.

Any other ideas or thoughts?

thanks
-john

diff -ru 2.6-rt/kernel/time/clockevents.c rttest/kernel/time/clockevents.c
--- 2.6-rt/kernel/time/clockevents.c 2006-02-22 17:31:03.000000000 -0600
+++ rttest/kernel/time/clockevents.c 2006-02-23 19:24:25.000000000 -0600
@@ -94,8 +94,15 @@
/*
* Handle tick
*/
+extern void timeofday_ensure_correctness(void);
+
static void handle_tick(struct pt_regs *regs)
{
+ static atomic_t tick_ctr;
+ atomic_inc(&tick_ctr);
+ if(!(atomic_read(&tick_ctr)%HZ))
+ timeofday_ensure_correctness();
+
write_seqlock(&xtime_lock);
do_timer(regs);
write_sequnlock(&xtime_lock);
diff -ru 2.6-rt/kernel/time/timeofday.c rttest/kernel/time/timeofday.c
--- 2.6-rt/kernel/time/timeofday.c 2006-02-22 17:31:03.000000000 -0600
+++ rttest/kernel/time/timeofday.c 2006-02-23 19:50:16.000000000 -0600
@@ -487,6 +487,9 @@

device_initcall(timeofday_init_device);

+/* hack for periodic hook starvation */
+unsigned long recently_run;
+
/**
* timeofday_periodic_hook - Does periodic update of timekeeping values.
* @unused: unused value
@@ -514,6 +517,8 @@
struct clocksource old_clock;
static nsec_t second_check;

+ set_bit(0,&recently_run);
+
write_seqlock_irqsave(&system_time_lock, flags);

/* read time source & calc time since last call: */
@@ -626,6 +631,17 @@
jiffies + 1 + msecs_to_jiffies(PERIODIC_INTERVAL_MS));
}

+/* If timeofday_periodic_hook might be starved, call this at least
+ * once a second from interrupt context to ensure things run properly.
+ */
+void timeofday_ensure_correctness(void)
+{
+ if (!test_bit(0,&recently_run))
+ timeofday_periodic_hook(0);
+ else
+ clear_bit(0, &recently_run);
+}
+
/**
* timeofday_is_continuous - check to see if timekeeping is free running
*/