Subject: [ANNOUNCE] 3.8.10-rt6

Dear RT Folks,

I'm pleased to announce the 3.8.10-rt6 release.

changes since v3.8.10-rt5:
- the i915 compiles again after I broke it in the last release. A patch
was sent by Carsten Emde.

Known issues:

- SLxB is broken on PowerPC.
- suspend / resume seems to program program the timer wrong and wait
ages until it continues.

The delta patch against v3.8.10-rt5 is appended below and can be found here:

https://www.kernel.org/pub/linux/kernel/projects/rt/3.8/incr/patch-3.8.10-rt5-rt6.patch.xz

The RT patch against 3.8.9 can be found here:

https://www.kernel.org/pub/linux/kernel/projects/rt/3.8/patch-3.8.10-rt6.patch.xz

The split quilt queue is available at:

https://www.kernel.org/pub/linux/kernel/projects/rt/3.8/patches-3.8.10-rt6.tar.xz

Sebastian

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 81125de..eabd3dd 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -814,6 +814,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
struct intel_ring_buffer *ring;
u32 ctx_id = i915_execbuffer2_get_context_id(*args);
u32 exec_start, exec_len;
+ u32 seqno;
u32 mask;
u32 flags;
int ret, mode, i;
@@ -1068,7 +1069,8 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
goto err;
}

- trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
+ seqno = intel_ring_get_seqno(ring);
+ trace_i915_gem_ring_dispatch(ring, seqno, flags);
i915_trace_irq_get(ring, seqno);

i915_gem_execbuffer_move_to_active(&objects, ring);
diff --git a/localversion-rt b/localversion-rt
index 0efe7ba..8fc605d 100644
--- a/localversion-rt
+++ b/localversion-rt
@@ -1 +1 @@
--rt5
+-rt6


2013-04-29 21:19:43

by Clark Williams

[permalink] [raw]
Subject: Re: [ANNOUNCE] 3.8.10-rt6

On Mon, 29 Apr 2013 22:12:02 +0200
Sebastian Andrzej Siewior <[email protected]> wrote:
> - suspend / resume seems to program program the timer wrong and wait
> ages until it continues.
>

It has to be something we're doing when we apply RT to v3.8.x, since
v3.8.x suspends/resumes with no issues and I was able to suspend and
resume fine with the 3.6-rt series.

I'm looking at a git diff between 3.6.11-rt30 and 3.8.9-rt4,
specifically in kernel/time* and arch/x86/kernel but so far I'm not
seeing much that's RT specific.

Clark


Attachments:
signature.asc (198.00 B)
Subject: Re: [ANNOUNCE] 3.8.10-rt6

On 04/29/2013 10:46 PM, Bernhard Schiffner wrote:
>> Known issues:
>>
>> - SLxB is broken on PowerPC.
>> - suspend / resume seems to program program the timer wrong and wait
>> ages until it continues.
>
> Yes, it's a annoying problem here too.
> How can I help to solve it?

Are you referring to the PowerPC issue or suspend / resume?

> Bernhard

Sebastian

2013-04-30 08:47:26

by John Kacur

[permalink] [raw]
Subject: Re: [ANNOUNCE] 3.8.10-rt6



On Mon, 29 Apr 2013, Clark Williams wrote:

> On Mon, 29 Apr 2013 22:12:02 +0200
> Sebastian Andrzej Siewior <[email protected]> wrote:
> > - suspend / resume seems to program program the timer wrong and wait
> > ages until it continues.
> >
>
> It has to be something we're doing when we apply RT to v3.8.x, since
> v3.8.x suspends/resumes with no issues and I was able to suspend and
> resume fine with the 3.6-rt series.

Our v3.8x series is currently no different than "vanilla" rt.
quilt-import on top of v3.8.10, with no RH patches. So, I know you said
that just as a polite way to say, "maybe we messed up, but...", however
I'm confident we didn't.

Also, that must be a typo, you meant to say that v3.8.x does have issues
with suspend / resume right?

>
> I'm looking at a git diff between 3.6.11-rt30 and 3.8.9-rt4,
> specifically in kernel/time* and arch/x86/kernel but so far I'm not
> seeing much that's RT specific.
>
> Clark
>

Subject: Re: [ANNOUNCE] 3.8.10-rt6

* Clark Williams | 2013-04-29 16:19:25 [-0500]:

>On Mon, 29 Apr 2013 22:12:02 +0200
>Sebastian Andrzej Siewior <[email protected]> wrote:
>> - suspend / resume seems to program program the timer wrong and wait
>> ages until it continues.
>>
>
>It has to be something we're doing when we apply RT to v3.8.x, since
>v3.8.x suspends/resumes with no issues and I was able to suspend and
>resume fine with the 3.6-rt series.

Are your problems gone with:

diff --git a/kernel/printk.c b/kernel/printk.c
index 6d52c34..8783ea5 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1583,6 +1583,8 @@ asmlinkage int vprintk_emit(int facility, int level,
*/
if (unlikely(forced_early_printk(fmt, args)))
return 1;
+ if (in_nmi())
+ return 1;

boot_delay_msec(level);
printk_delay();

>Clark

Sebastian

Subject: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

* Clark Williams | 2013-04-29 16:19:25 [-0500]:

>On Mon, 29 Apr 2013 22:12:02 +0200
>Sebastian Andrzej Siewior <[email protected]> wrote:
>> - suspend / resume seems to program program the timer wrong and wait
>> ages until it continues.
>
>It has to be something we're doing when we apply RT to v3.8.x, since
>v3.8.x suspends/resumes with no issues and I was able to suspend and
>resume fine with the 3.6-rt series.

I think I figured out what is going on or atleast I think I did.

This log snippet is from the resume path (from suspend to mem):

[ 15.052115] Enabling non-boot CPUs ...
[ 15.052115] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 14.841378] Initializing CPU#1
[ 42.840017] [sched_delayed] sched: RT throttling activated
[ 42.842144] CPU1 is up
[ 42.842536] smpboot: Booting Node 0 Processor 2 APIC 0x2

Two things happen here:
- the time goes backwards from 15.X to 14.X. This is okay because the
14.X is the timestamp from the secondary CPU not - yet synchronized
with the bootcpu
- the printk with "CPU1 is up" is comming from the boot CPU and
according to the timestamp about 28secs passed by. But this did not
really happen as the whole procedure took less time.

The next thing that happens is that RCU assumes nobody is doing any
progress (for almost 28secs) and triggers NMIs & printks to get some
attention. I have a trace where
- CPU0: arch_trigger_all_cpu_backtrace_handler() => printk()
has "lock" and is spinning for logbuf_lock

- CPU1: print_cpu_stall() => printk() (spinning for the lock) => NMI =>
arch_trigger_all_cpu_backtrace_handler()
it may have logbuf_lock and is spinning for "lock"

I can't tell if CPU1 got the logbuf_lock at this time but it seemed that
it made no progress until I ended it.
This NMI releated deadlock is a problem which should also trigger
mainline, right?

Now, the time jump on the other hand is the real issue here and is
RT-only. It looks like we get a big number of timer updates via
tick_do_update_jiffies64() because according to ktime_get() that much
time really passed by.

The sollution seems as simple as

>From c27eb2e0ab0b5acd96a4b62288976f1b72789b3e Mon Sep 17 00:00:00 2001
From: Sebastian Andrzej Siewior <[email protected]>
Date: Tue, 30 Apr 2013 18:53:55 +0200
Subject: [PATCH] time/timekeeping: shadow tk->cycle_last together with
clock->cycle_last

Commit ("timekeeping: Store cycle_last value in timekeeper struct as
well") introduced a tk-> based cycle_last values which needs to be reset
on resume path as well or else ktime_get() will think that time
increased a lot.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/time/timekeeping.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 99f943b..688817f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -777,6 +777,7 @@ static void timekeeping_resume(void)
}
/* re-base the last cycle value */
tk->clock->cycle_last = tk->clock->read(tk->clock);
+ tk->cycle_last = tk->clock->cycle_last;
tk->ntp_error = 0;
timekeeping_suspended = 0;
timekeeping_update(tk, false, true);
--
1.7.10.4

So Clark, does this patch fix your problem?

>Clark

Sebastian

2013-04-30 18:08:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On Tue, 2013-04-30 at 19:09 +0200, Sebastian Andrzej Siewior wrote:

> The next thing that happens is that RCU assumes nobody is doing any
> progress (for almost 28secs) and triggers NMIs & printks to get some
> attention. I have a trace where
> - CPU0: arch_trigger_all_cpu_backtrace_handler() => printk()
> has "lock" and is spinning for logbuf_lock
>
> - CPU1: print_cpu_stall() => printk() (spinning for the lock) => NMI =>
> arch_trigger_all_cpu_backtrace_handler()
> it may have logbuf_lock and is spinning for "lock"
>
> I can't tell if CPU1 got the logbuf_lock at this time but it seemed that
> it made no progress until I ended it.
> This NMI releated deadlock is a problem which should also trigger
> mainline, right?

Well, yeah, as sending out a NMI stack dump is sorta the last resort,
and is dangerous to do printks from NMI context.

>
> Now, the time jump on the other hand is the real issue here and is
> RT-only. It looks like we get a big number of timer updates via
> tick_do_update_jiffies64() because according to ktime_get() that much
> time really passed by.

As the NMI dump only happens because of the time jump, which as you
said, is -rt only, I wouldn't say that the NMI deadlock is a mainline
bug.

-- Steve

2013-04-30 19:18:40

by Clark Williams

[permalink] [raw]
Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On Tue, 30 Apr 2013 19:09:48 +0200
Sebastian Andrzej Siewior <[email protected]> wrote:

> * Clark Williams | 2013-04-29 16:19:25 [-0500]:
>
> >On Mon, 29 Apr 2013 22:12:02 +0200
> >Sebastian Andrzej Siewior <[email protected]> wrote:
> >> - suspend / resume seems to program program the timer wrong and wait
> >> ages until it continues.
> >
> >It has to be something we're doing when we apply RT to v3.8.x, since
> >v3.8.x suspends/resumes with no issues and I was able to suspend and
> >resume fine with the 3.6-rt series.
>
> I think I figured out what is going on or atleast I think I did.
>
> This log snippet is from the resume path (from suspend to mem):
>
> [ 15.052115] Enabling non-boot CPUs ...
> [ 15.052115] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [ 14.841378] Initializing CPU#1
> [ 42.840017] [sched_delayed] sched: RT throttling activated
> [ 42.842144] CPU1 is up
> [ 42.842536] smpboot: Booting Node 0 Processor 2 APIC 0x2
>
> Two things happen here:
> - the time goes backwards from 15.X to 14.X. This is okay because the
> 14.X is the timestamp from the secondary CPU not - yet synchronized
> with the bootcpu
> - the printk with "CPU1 is up" is comming from the boot CPU and
> according to the timestamp about 28secs passed by. But this did not
> really happen as the whole procedure took less time.
>
> The next thing that happens is that RCU assumes nobody is doing any
> progress (for almost 28secs) and triggers NMIs & printks to get some
> attention. I have a trace where
> - CPU0: arch_trigger_all_cpu_backtrace_handler() => printk()
> has "lock" and is spinning for logbuf_lock
>
> - CPU1: print_cpu_stall() => printk() (spinning for the lock) => NMI =>
> arch_trigger_all_cpu_backtrace_handler()
> it may have logbuf_lock and is spinning for "lock"
>
> I can't tell if CPU1 got the logbuf_lock at this time but it seemed that
> it made no progress until I ended it.
> This NMI releated deadlock is a problem which should also trigger
> mainline, right?
>
> Now, the time jump on the other hand is the real issue here and is
> RT-only. It looks like we get a big number of timer updates via
> tick_do_update_jiffies64() because according to ktime_get() that much
> time really passed by.
>
> The sollution seems as simple as
>
> From c27eb2e0ab0b5acd96a4b62288976f1b72789b3e Mon Sep 17 00:00:00 2001
> From: Sebastian Andrzej Siewior <[email protected]>
> Date: Tue, 30 Apr 2013 18:53:55 +0200
> Subject: [PATCH] time/timekeeping: shadow tk->cycle_last together with
> clock->cycle_last
>
> Commit ("timekeeping: Store cycle_last value in timekeeper struct as
> well") introduced a tk-> based cycle_last values which needs to be reset
> on resume path as well or else ktime_get() will think that time
> increased a lot.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> kernel/time/timekeeping.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 99f943b..688817f 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -777,6 +777,7 @@ static void timekeeping_resume(void)
> }
> /* re-base the last cycle value */
> tk->clock->cycle_last = tk->clock->read(tk->clock);
> + tk->cycle_last = tk->clock->cycle_last;
> tk->ntp_error = 0;
> timekeeping_suspended = 0;
> timekeeping_update(tk, false, true);
> --
> 1.7.10.4
>
> So Clark, does this patch fix your problem?
>

It does seem to! I've got both patches applied right now (your patch to
vprintk_emit() and the above patch) and it fixes the long delay on my
lab box. When I get done today (or have a break in the action) I'll try
it on my laptop to verify.

Thanks Sebastian,
Clark


Attachments:
signature.asc (198.00 B)

2013-04-30 21:55:18

by Clark Williams

[permalink] [raw]
Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On Tue, 30 Apr 2013 14:18:24 -0500
Clark Williams <[email protected]> wrote:

> On Tue, 30 Apr 2013 19:09:48 +0200
> Sebastian Andrzej Siewior <[email protected]> wrote:
>
> > * Clark Williams | 2013-04-29 16:19:25 [-0500]:
> >
> > >On Mon, 29 Apr 2013 22:12:02 +0200
> > >Sebastian Andrzej Siewior <[email protected]> wrote:
> > >> - suspend / resume seems to program program the timer wrong and wait
> > >> ages until it continues.
> > >
> > >It has to be something we're doing when we apply RT to v3.8.x, since
> > >v3.8.x suspends/resumes with no issues and I was able to suspend and
> > >resume fine with the 3.6-rt series.
> >
> > I think I figured out what is going on or atleast I think I did.
> >
> > This log snippet is from the resume path (from suspend to mem):
> >
> > [ 15.052115] Enabling non-boot CPUs ...
> > [ 15.052115] smpboot: Booting Node 0 Processor 1 APIC 0x1
> > [ 14.841378] Initializing CPU#1
> > [ 42.840017] [sched_delayed] sched: RT throttling activated
> > [ 42.842144] CPU1 is up
> > [ 42.842536] smpboot: Booting Node 0 Processor 2 APIC 0x2
> >
> > Two things happen here:
> > - the time goes backwards from 15.X to 14.X. This is okay because the
> > 14.X is the timestamp from the secondary CPU not - yet synchronized
> > with the bootcpu
> > - the printk with "CPU1 is up" is comming from the boot CPU and
> > according to the timestamp about 28secs passed by. But this did not
> > really happen as the whole procedure took less time.
> >
> > The next thing that happens is that RCU assumes nobody is doing any
> > progress (for almost 28secs) and triggers NMIs & printks to get some
> > attention. I have a trace where
> > - CPU0: arch_trigger_all_cpu_backtrace_handler() => printk()
> > has "lock" and is spinning for logbuf_lock
> >
> > - CPU1: print_cpu_stall() => printk() (spinning for the lock) => NMI =>
> > arch_trigger_all_cpu_backtrace_handler()
> > it may have logbuf_lock and is spinning for "lock"
> >
> > I can't tell if CPU1 got the logbuf_lock at this time but it seemed that
> > it made no progress until I ended it.
> > This NMI releated deadlock is a problem which should also trigger
> > mainline, right?
> >
> > Now, the time jump on the other hand is the real issue here and is
> > RT-only. It looks like we get a big number of timer updates via
> > tick_do_update_jiffies64() because according to ktime_get() that much
> > time really passed by.
> >
> > The sollution seems as simple as
> >
> > From c27eb2e0ab0b5acd96a4b62288976f1b72789b3e Mon Sep 17 00:00:00 2001
> > From: Sebastian Andrzej Siewior <[email protected]>
> > Date: Tue, 30 Apr 2013 18:53:55 +0200
> > Subject: [PATCH] time/timekeeping: shadow tk->cycle_last together with
> > clock->cycle_last
> >
> > Commit ("timekeeping: Store cycle_last value in timekeeper struct as
> > well") introduced a tk-> based cycle_last values which needs to be reset
> > on resume path as well or else ktime_get() will think that time
> > increased a lot.
> >
> > Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> > ---
> > kernel/time/timekeeping.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> > index 99f943b..688817f 100644
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> > @@ -777,6 +777,7 @@ static void timekeeping_resume(void)
> > }
> > /* re-base the last cycle value */
> > tk->clock->cycle_last = tk->clock->read(tk->clock);
> > + tk->cycle_last = tk->clock->cycle_last;
> > tk->ntp_error = 0;
> > timekeeping_suspended = 0;
> > timekeeping_update(tk, false, true);
> > --
> > 1.7.10.4
> >
> > So Clark, does this patch fix your problem?
> >
>
> It does seem to! I've got both patches applied right now (your patch to
> vprintk_emit() and the above patch) and it fixes the long delay on my
> lab box. When I get done today (or have a break in the action) I'll try
> it on my laptop to verify.
>
> Thanks Sebastian,
> Clark

Tested on my laptop which now resumes.

Many thanks.

Clark


Attachments:
signature.asc (198.00 B)

2013-04-30 22:31:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On Tue, Apr 30, 2013 at 07:09:48PM +0200, Sebastian Andrzej Siewior wrote:
> Now, the time jump on the other hand is the real issue here and is
> RT-only. It looks like we get a big number of timer updates via
> tick_do_update_jiffies64() because according to ktime_get() that much
> time really passed by.
>
> The sollution seems as simple as
>
> From c27eb2e0ab0b5acd96a4b62288976f1b72789b3e Mon Sep 17 00:00:00 2001
> From: Sebastian Andrzej Siewior <[email protected]>
> Date: Tue, 30 Apr 2013 18:53:55 +0200
> Subject: [PATCH] time/timekeeping: shadow tk->cycle_last together with
> clock->cycle_last
>
> Commit ("timekeeping: Store cycle_last value in timekeeper struct as
> well") introduced a tk-> based cycle_last values which needs to be reset
> on resume path as well or else ktime_get() will think that time
> increased a lot.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> kernel/time/timekeeping.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 99f943b..688817f 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -777,6 +777,7 @@ static void timekeeping_resume(void)
> }
> /* re-base the last cycle value */
> tk->clock->cycle_last = tk->clock->read(tk->clock);
> + tk->cycle_last = tk->clock->cycle_last;
> tk->ntp_error = 0;
> timekeeping_suspended = 0;
> timekeeping_update(tk, false, true);

Didn't tlgx fix a similar issue upstream already?

77c675ba18836.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On 05/01/2013 12:31 AM, Borislav Petkov wrote:
> Didn't tlgx fix a similar issue upstream already?
>
> 77c675ba18836.

He did as it seems.

Sebastian

2013-05-03 04:40:39

by Jain Priyanka-B32167

[permalink] [raw]
Subject: RE: [ANNOUNCE] 3.8.10-rt6

Hello Sebastian,

It is mentioned below that SLxB is broken.
I assume it means bit SLUB and SLAB is broken?
Can you please share the error-details/logs/scenario/steps-to-reproduce.

Regards
Priyanka

> -----Original Message-----
> From: [email protected] [mailto:linux-rt-users-
> [email protected]] On Behalf Of Sebastian Andrzej Siewior
> Sent: Tuesday, April 30, 2013 1:42 AM
> To: linux-rt-users
> Cc: LKML; Thomas Gleixner; [email protected]
> Subject: [ANNOUNCE] 3.8.10-rt6
>
> Dear RT Folks,
>
> I'm pleased to announce the 3.8.10-rt6 release.
>
> changes since v3.8.10-rt5:
> - the i915 compiles again after I broke it in the last release. A patch
> was sent by Carsten Emde.
>
> Known issues:
>
> - SLxB is broken on PowerPC.
> - suspend / resume seems to program program the timer wrong and wait
> ages until it continues.
>
> The delta patch against v3.8.10-rt5 is appended below and can be found
> here:
>
> https://www.kernel.org/pub/linux/kernel/projects/rt/3.8/incr/patch-
> 3.8.10-rt5-rt6.patch.xz
>
> The RT patch against 3.8.9 can be found here:
>
> https://www.kernel.org/pub/linux/kernel/projects/rt/3.8/patch-3.8.10-
> rt6.patch.xz
>
> The split quilt queue is available at:
>
> https://www.kernel.org/pub/linux/kernel/projects/rt/3.8/patches-3.8.10-
> rt6.tar.xz
>
> Sebastian
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 81125de..eabd3dd 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -814,6 +814,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void
> *data,
> struct intel_ring_buffer *ring;
> u32 ctx_id = i915_execbuffer2_get_context_id(*args);
> u32 exec_start, exec_len;
> + u32 seqno;
> u32 mask;
> u32 flags;
> int ret, mode, i;
> @@ -1068,7 +1069,8 @@ i915_gem_do_execbuffer(struct drm_device *dev, void
> *data,
> goto err;
> }
>
> - trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring),
> flags);
> + seqno = intel_ring_get_seqno(ring);
> + trace_i915_gem_ring_dispatch(ring, seqno, flags);
> i915_trace_irq_get(ring, seqno);
>
> i915_gem_execbuffer_move_to_active(&objects, ring); diff --git
> a/localversion-rt b/localversion-rt index 0efe7ba..8fc605d 100644
> --- a/localversion-rt
> +++ b/localversion-rt
> @@ -1 +1 @@
> --rt5
> +-rt6
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users"
> in the body of a message to [email protected] More majordomo info
> at http://vger.kernel.org/majordomo-info.html

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

Subject: Re: [ANNOUNCE] 3.8.10-rt6

* Jain Priyanka-B32167 | 2013-05-03 04:40:33 [+0000]:

>Hello Sebastian,
Hello Jain,

>It is mentioned below that SLxB is broken.
>I assume it means bit SLUB and SLAB is broken?

Yes. It looks like that this is limited to Book-E / e500. I have here a
MPC8572DS which shows this:
|[27173.423355] ------------[ cut here ]------------
|[27173.423360] kernel BUG at mm/slab.c:3227!
|[27173.423364] Oops: Exception in kernel mode, sig: 5 [#1]
|[27173.423367] PREEMPT SMP NR_CPUS=2 MPC8572 DS
|[27173.423370] NIP: 800b236c LR: 800b2290 CTR: 802bd168
|[27173.423373] REGS: ba557b90 TRAP: 0700 Not tainted (3.8.9-rt4-dirty)
|[27173.423378] MSR: 00029000 <CE,EE,ME> CR: 24002444 XER: 00000000
|[27173.423402] TASK = ba101290[31018] 'hackbench' THREAD: ba556000 CPU: 0
|[27173.423402] GPR00: 800b2bb8 ba557c40 ba101290 b7374200 000106d0 00000000 00000000 00000200
|[27173.423402] GPR08: 00000001 00000008 00000008 b7a0fc60 24002462 1001a810 00000000 803c0000
|[27173.423402] GPR16: 00000001 bf002490 bf002488 803c0000 00100100 00200200 bf002480 803c32f0
|[27173.423402] GPR24: 00000000 bf0024a4 000106d0 ba556000 bf000540 bf00f200 00000003 81eb5ae0
|[27173.423412] NIP [800b236c] cache_alloc_refill+0x16c/0x7e8
|[27173.423414] LR [800b2290] cache_alloc_refill+0x90/0x7e8
|[27173.423415] Call Trace:
|[27173.423422] [ba557c40] [802d0c54] rt_spin_lock_slowlock+0x58/0x288 (unreliable)
|[27173.423426] [ba557c90] [800b2bb8] __kmalloc+0x1d0/0x204
|[27173.423432] [ba557cc0] [80236ee4] __kmalloc_reserve+0x28/0x84
|[27173.423435] [ba557ce0] [80236fc4] __alloc_skb+0x84/0x18c
|[27173.423439] [ba557d20] [802338ec] sock_alloc_send_pskb+0x1d8/0x36c
|[27173.423444] [ba557d80] [802bd414] unix_stream_sendmsg+0x2ac/0x3ec
|[27173.423453] [ba557de0] [8022e4c4] sock_aio_write+0x110/0x148
|[27173.423458] [ba557e40] [800b7030] do_sync_write+0x94/0x108
|[27173.423462] [ba557ef0] [800b7204] vfs_write+0x160/0x170
|[27173.423465] [ba557f10] [800b7308] sys_write+0x4c/0xa8
|[27173.423471] [ba557f40] [8000d3c0] ret_from_syscall+0x0/0x3c
|[27173.423473] --- Exception: c01 at 0xffad0ec
|[27173.423473] LR = 0x100011c8
|[27173.423474] Instruction dump:
|[27173.423480] 3de0803c 62940100 62b50200 3a560008 83f60000 7f16f800 419a019c 813f0010
|[27173.423486] 815c0018 7d0a4810 39000000 7d084114 <0f080000> 7f0a4840 40990084 3bdeffff
|[27173.604492] ---[ end trace 0000000000000002 ]---

after (according to the timestamp) 7:32 hours runtime. It run was
running in one shell
| cyclictest -m -n -S -p 80 -d 0 -i 500
and the other
|while ((1)); do hackbench; done

This was done with SLAB, the backtrace is different with SLUB. I
tried with one CPU but it is same thing.

I tried MPC5200b based board and it did not do anything stupid for over
two days while doing exact the same thing.
The obvious difference is the different MMU implementation of those two.
The other difference is ~400Mhz CPU vs 1.5Ghz.

>Can you please share the error-details/logs/scenario/steps-to-reproduce.

As I wrote above, cyclictest + hackbench. My MPC8572 boots from hard
disk into a e500 based root file system (that means it uses its FPU for
floating point instead SW-emulation).

>Regards
>Priyanka

Sebastian

Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On 04/30/2013 08:08 PM, Steven Rostedt wrote:
>> This NMI releated deadlock is a problem which should also trigger
>> mainline, right?
>
> Well, yeah, as sending out a NMI stack dump is sorta the last resort,
> and is dangerous to do printks from NMI context.

So we did bad and we upgrade to bad and dangerous.

>>
>> Now, the time jump on the other hand is the real issue here and is
>> RT-only. It looks like we get a big number of timer updates via
>> tick_do_update_jiffies64() because according to ktime_get() that much
>> time really passed by.
>
> As the NMI dump only happens because of the time jump, which as you
> said, is -rt only, I wouldn't say that the NMI deadlock is a mainline
> bug.

The reason for the NMI was a bug in the -RT tree but if something else
triggers that NMI we have a good chance to deadlock.

What about a try_lock() and leave after 50 usecs of trying and not
getting it in the in_nmi() case?

> -- Steve

Sebastian

2013-05-03 15:31:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6)

On Fri, 2013-05-03 at 11:59 +0200, Sebastian Andrzej Siewior wrote:
>
> > As the NMI dump only happens because of the time jump, which as you
> > said, is -rt only, I wouldn't say that the NMI deadlock is a mainline
> > bug.
>
> The reason for the NMI was a bug in the -RT tree but if something else
> triggers that NMI we have a good chance to deadlock.

But only if the NMI does a printk(). The only reason NMIs do printks is
when a bug is detected. But usually oops_in_progress() is called and
also zap_locks() is suppose to help prevent these problems. But that
doesn't always work.

>
> What about a try_lock() and leave after 50 usecs of trying and not
> getting it in the in_nmi() case?

I wouldn't try too hard to fix printks for NMIs. There's many things
that can go wrong with NMIs doing a printk while another printk is
active.

-- Steve