2007-09-15 02:29:37

by Michal Piotrowski

[permalink] [raw]
Subject: [1/3] 2.6.23-rc6: known regressions v2

Hi all,

Here is a list of some known regressions in 2.6.23-rc6.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

Name Regressions fixed since 21-Jun-2007
Adrian Bunk 10
Andi Kleen 7
Linus Torvalds 6
Alan Stern 5
Hugh Dickins 5
Trond Myklebust 5
Andrew Morton 4
David S. Miller 4
Al Viro 3
Alexey Starikovskiy 3
Cornelia Huck 3
Jens Axboe 3
Stephen Hemminger 3
Tejun Heo 3



Unclassified

Subject : crashme fault
References : http://lkml.org/lkml/2007/9/13/19
Last known good : ?
Submitter : Randy Dunlap <[email protected]>
Caused-By : ?
Handled-By : ?
Status : unknown

Subject : cpu hotplug support broken in 2.6.23-rc3/highres timers break cpu hotplug in 2.6.23-rc5
References : http://lkml.org/lkml/2007/8/27/58
http://lkml.org/lkml/2007/9/3/65
Last known good : ?
Submitter : Pavel Machek <[email protected]>
Caused-By : ?
Handled-By : Thomas Gleixner <[email protected]>
Status : problem is being debugged

Subject : console is messed up after resume from s2ram or switching to console from X
References : http://lkml.org/lkml/2007/8/4/6
http://lkml.org/lkml/2007/9/13/227
Last known good : ?
Submitter : Jeff Chua <[email protected]>
Caused-By : ?
Handled-By : H. Peter Anvin <[email protected]>
Antonino A. Daplas <[email protected]>
Workaround : "s2ram --force --acpi_sleep 1 --vbe_mode"
Status : problem is being debugged



CPUFREQ

Subject : ide problems: 2.6.22-git17 working, 2.6.23-rc1* is not
References : http://lkml.org/lkml/2007/7/27/298
http://lkml.org/lkml/2007/7/29/371
Last known good : ?
Submitter : dth <[email protected]>
Caused-By : Len Brown <[email protected]>
commit f79e3185dd0f8650022518d7624c876d8929061b
Handled-By : Len Brown <[email protected]>
Status : problem is being debugged



FireWire

Subject : empty suspend stopped working around 2.6.23-rc4
References : http://lkml.org/lkml/2007/9/11/326
Last known good : ?
Submitter : Pavel Machek <[email protected]>
Caused-By : ?
Handled-By : ?
Status : unknown



Farewell!
Michal

--
LOGOUT
http://www.stardust.webpages.pl/


2007-09-16 09:35:47

by Stefan Richter

[permalink] [raw]
Subject: Re: [1/3] 2.6.23-rc6: known regressions v2 --- ieee1394, empty suspend stopped working

Michal Piotrowski wrote:
> FireWire

Could you title it "IEEE1394" instead? (According to the information so
far, it's in drivers/ieee1394, not drivers/firewire.)

> Subject : empty suspend stopped working around 2.6.23-rc4
> References : http://lkml.org/lkml/2007/9/11/326
> Last known good : ?
> Submitter : Pavel Machek <[email protected]>
> Caused-By : ?
> Handled-By : ?
> Status : unknown

I did a quick test on my main test machine, a Mac mini running x86-64
Linux. Regardless whether swap is on or off and whether ieee1394 and
ohci1394 are loaded or not, it always behaves the same. It does
something, then ends up with power LED off but a non-blinking cursor,
i.e. _, still shown on the text console. Is this good or bad? Note, I
didn't try to resume and won't, because I'm out of spare time.

Perhaps I resurrect two older PCs I have here and set them up for
hibernation sometime during the coming weeks. But although the words
regression and FireWire meet here, I am so far not inclined to make this
my problem.

Note, the only post-2.6.22 patch to the ieee1394 subsystem which is
directly related to power management went in before 2.6.23-rc1:
"Freezer: make kernel threads nonfreezable by default"
http://me.in-berlin.de/~s5r6/linux1394/merged/in_2.6.23/freezer-make-kernel-threads-nonfreezable-by-default.patch

(Apart from using git, all 2.6.23 patches to ieee1394 and firewire can
be seen at http://me.in-berlin.de/~s5r6/linux1394/merged/in_2.6.23/.)
--
Stefan Richter
-=====-=-=== =--= =----
http://arcgraph.de/sr/

2007-09-16 09:43:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [1/3] 2.6.23-rc6: known regressions v2 --- ieee1394, empty suspend stopped working

On Sun, 16 Sep 2007 11:35:04 +0200 Stefan Richter <[email protected]> wrote:

> Michal Piotrowski wrote:
> > FireWire
>
> Could you title it "IEEE1394" instead? (According to the information so
> far, it's in drivers/ieee1394, not drivers/firewire.)
>
> > Subject : empty suspend stopped working around 2.6.23-rc4
> > References : http://lkml.org/lkml/2007/9/11/326
> > Last known good : ?
> > Submitter : Pavel Machek <[email protected]>
> > Caused-By : ?
> > Handled-By : ?
> > Status : unknown
>
> I did a quick test on my main test machine, a Mac mini running x86-64
> Linux. Regardless whether swap is on or off and whether ieee1394 and
> ohci1394 are loaded or not, it always behaves the same. It does
> something, then ends up with power LED off but a non-blinking cursor,
> i.e. _, still shown on the text console. Is this good or bad? Note, I
> didn't try to resume and won't, because I'm out of spare time.

When it is in this state, try hitting a key lots of times. It Works For Me ;)

Try Thomas's hrt tree, below. It Also Works For Me.


GIT f5ea61b91472bcc0782aa857fd4cf24db45ce732 git+ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt.git#for-2.6.23

commit f5ea61b91472bcc0782aa857fd4cf24db45ce732
Author: Thomas Gleixner <[email protected]>
Date: Sat Sep 15 11:45:13 2007 +0200

clockevents: prevent stale tick update on offline cpu

Taking a cpu offline removes the cpu from the online mask before the
CPU_DEAD notification is done. The clock events layer does the cleanup
of the dead CPU from the CPU_DEAD notifier chain. tick_do_timer_cpu is
used to avoid xtime lock contention by assigning the task of jiffies
xtime updates to one CPU. If a CPU is taken offline, then this
assignment becomes stale. This went unnoticed because most of the time
the offline CPU went dead before the online CPU reached __cpu_die(),
where the CPU_DEAD state is checked. In the case that the offline CPU did
not reach the DEAD state before we reach __cpu_die(), the code in there
goes to sleep for 100ms. Due to the stale time update assignment, the
system is stuck forever.

Take the assignment away when a cpu is not longer in the cpu_online_mask.
We do this in the last call to tick_nohz_stop_sched_tick() when the offline
CPU is on the way to the final play_dead() idle entry.

Signed-off-by: Thomas Gleixner <[email protected]>

commit 259df5c5f73b39bea8e62baa6c85e7b0ee61dff3
Author: Thomas Gleixner <[email protected]>
Date: Sat Sep 15 11:45:13 2007 +0200

clockevents: do not shutdown the oneshot broadcast device

When a cpu goes offline it is removed from the broadcast masks. If the
mask becomes empty the code shuts down the broadcast device. This is
wrong, because the broadcast device needs to be ready for the online
cpu going idle (into a c-state, which stops the local apic timer).

Signed-off-by: Thomas Gleixner <[email protected]>

commit 63ffda310fe82212aac2bb85a9105b87430995f7
Author: Thomas Gleixner <[email protected]>
Date: Sat Sep 15 11:45:13 2007 +0200

clockevents: Enforce oneshot broadcast when broadcast mask is set on resume

The jinxed VAIO refuses to resume without hitting keys on the keyboard
when this is not enforced. It is unclear why the cpu ends up in a lower
C State without notifying the clock events layer, but enforcing the
oneshot broadcast here is safe.

Signed-off-by: Thomas Gleixner <[email protected]>

commit 112044df762a5e6e8946a3e5773312345f0154cf
Author: Thomas Gleixner <[email protected]>
Date: Sat Sep 15 11:45:13 2007 +0200

ACPI: Reevaluate C/P/T states when a cpu becomes online

Reevaluate C/P/T states when a cpu becomes online. This avoids
the caching of the broadcast information in the clockevents layer.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Len Brown <[email protected]>

commit e07fd18c03e3e8a03454074d46e052e622a0fdff
Author: Thomas Gleixner <[email protected]>
Date: Sat Sep 15 11:45:12 2007 +0200

timekeeping: Prevent time going backwards on resume

Timekeeping resume adjusts xtime by adding the slept time in seconds and
resets the reference value of the clock source (clock->cycle_last).
clock->cycle last is used to calculate the delta between the last xtime
update and the readout of the clock source in __get_nsec_offset(). xtime
plus the offset is the current time. The resume code ignores the delta
which had already elapsed between the last xtime update and the actual
time of suspend. If the suspend time is short, then we can see time
going backwards on resume.

Suspend:
offs_s = clock->read() - clock->cycle_last;
now = xtime + offs_s;
timekeeping_suspend_time = read_rtc();

Resume:
sleep_time = read_rtc() - timekeeping_suspend_time;
xtime.tv_sec += sleep_time;
clock->cycle_last = clock->read();
offs_r = clock->read() - clock->cycle_last;
now = xtime + offs_r;

if sleep_time_seconds == 0 and offs_r < offs_s, then time goes
backwards.

Fix this by storing the offset from the last xtime update and add it to
xtime during resume, when we reset clock->cycle_last:

sleep_time = read_rtc() - timekeeping_suspend_time;
xtime.tv_sec += sleep_time;
xtime += offs_s; /* Fixup xtime offset at suspend time */
clock->cycle_last = clock->read();
offs_r = clock->read() - clock->cycle_last;
now = xtime + offs_r;

Thanks to Marcelo for tracking this down on the OLPC and providing the
necessary details to analyze the root cause.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Tosatti <[email protected]>

commit 1d4a40ed14684b198f873e16be1f84835afade9b
Author: Thomas Gleixner <[email protected]>
Date: Sat Sep 15 11:45:12 2007 +0200

timekeeping: access rtc outside of xtime lock

Lockdep complains about the access of rtc in timekeeping_suspend
inside the interrupt disabled region of the write locked xtime lock.
Move the access outside.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

drivers/acpi/processor_core.c | 21 +++++++++++++++++++++
kernel/time/tick-broadcast.c | 24 ++++++++++++++++--------
kernel/time/tick-sched.c | 12 ++++++++++++
kernel/time/timekeeping.c | 10 +++++++++-
4 files changed, 58 insertions(+), 9 deletions(-)

diff -puN drivers/acpi/processor_core.c~git-hrt drivers/acpi/processor_core.c
--- a/drivers/acpi/processor_core.c~git-hrt
+++ a/drivers/acpi/processor_core.c
@@ -725,6 +725,25 @@ static void acpi_processor_notify(acpi_h
return;
}

+static int acpi_cpu_soft_notify(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+ struct acpi_processor *pr = processors[cpu];
+
+ if (action == CPU_ONLINE && pr) {
+ acpi_processor_ppc_has_changed(pr);
+ acpi_processor_cst_has_changed(pr);
+ acpi_processor_tstate_has_changed(pr);
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block acpi_cpu_notifier =
+{
+ .notifier_call = acpi_cpu_soft_notify,
+};
+
static int acpi_processor_add(struct acpi_device *device)
{
struct acpi_processor *pr = NULL;
@@ -988,6 +1007,7 @@ void acpi_processor_install_hotplug_noti
ACPI_UINT32_MAX,
processor_walk_namespace_cb, &action, NULL);
#endif
+ register_hotcpu_notifier(&acpi_cpu_notifier);
}

static
@@ -1000,6 +1020,7 @@ void acpi_processor_uninstall_hotplug_no
ACPI_UINT32_MAX,
processor_walk_namespace_cb, &action, NULL);
#endif
+ unregister_hotcpu_notifier(&acpi_cpu_notifier);
}

/*
diff -puN kernel/time/tick-broadcast.c~git-hrt kernel/time/tick-broadcast.c
--- a/kernel/time/tick-broadcast.c~git-hrt
+++ a/kernel/time/tick-broadcast.c
@@ -382,12 +382,23 @@ static int tick_broadcast_set_event(ktim

int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
{
+ int cpu = smp_processor_id();
+
+ /*
+ * If the CPU is marked for broadcast, enforce oneshot
+ * broadcast mode. The jinxed VAIO does not resume otherwise.
+ * No idea why it ends up in a lower C State during resume
+ * without notifying the clock events layer.
+ */
+ if (cpu_isset(cpu, tick_broadcast_mask))
+ cpu_set(cpu, tick_broadcast_oneshot_mask);
+
clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);

if(!cpus_empty(tick_broadcast_oneshot_mask))
tick_broadcast_set_event(ktime_get(), 1);

- return cpu_isset(smp_processor_id(), tick_broadcast_oneshot_mask);
+ return cpu_isset(cpu, tick_broadcast_oneshot_mask);
}

/*
@@ -549,20 +560,17 @@ void tick_broadcast_switch_to_oneshot(vo
*/
void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
{
- struct clock_event_device *bc;
unsigned long flags;
unsigned int cpu = *cpup;

spin_lock_irqsave(&tick_broadcast_lock, flags);

- bc = tick_broadcast_device.evtdev;
+ /*
+ * Clear the broadcast mask flag for the dead cpu, but do not
+ * stop the broadcast device!
+ */
cpu_clear(cpu, tick_broadcast_oneshot_mask);

- if (tick_broadcast_device.mode == TICKDEV_MODE_ONESHOT) {
- if (bc && cpus_empty(tick_broadcast_oneshot_mask))
- clockevents_set_mode(bc, CLOCK_EVT_MODE_SHUTDOWN);
- }
-
spin_unlock_irqrestore(&tick_broadcast_lock, flags);
}

diff -puN kernel/time/tick-sched.c~git-hrt kernel/time/tick-sched.c
--- a/kernel/time/tick-sched.c~git-hrt
+++ a/kernel/time/tick-sched.c
@@ -161,6 +161,18 @@ void tick_nohz_stop_sched_tick(void)
cpu = smp_processor_id();
ts = &per_cpu(tick_cpu_sched, cpu);

+ /*
+ * If this cpu is offline and it is the one which updates
+ * jiffies, then give up the assignment and let it be taken by
+ * the cpu which runs the tick timer next. If we don't drop
+ * this here the jiffies might be stale and do_timer() never
+ * invoked.
+ */
+ if (unlikely(!cpu_online(cpu))) {
+ if (cpu == tick_do_timer_cpu)
+ tick_do_timer_cpu = -1;
+ }
+
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;

diff -puN kernel/time/timekeeping.c~git-hrt kernel/time/timekeeping.c
--- a/kernel/time/timekeeping.c~git-hrt
+++ a/kernel/time/timekeeping.c
@@ -217,6 +217,7 @@ static void change_clocksource(void)
}
#else
static inline void change_clocksource(void) { }
+static inline s64 __get_nsec_offset(void) { return 0; }
#endif

/**
@@ -280,6 +281,8 @@ void __init timekeeping_init(void)
static int timekeeping_suspended;
/* time in seconds when suspend began */
static unsigned long timekeeping_suspend_time;
+/* xtime offset when we went into suspend */
+static s64 timekeeping_suspend_nsecs;

/**
* timekeeping_resume - Resumes the generic timekeeping subsystem.
@@ -305,6 +308,8 @@ static int timekeeping_resume(struct sys
wall_to_monotonic.tv_sec -= sleep_length;
total_sleep_time += sleep_length;
}
+ /* Make sure that we have the correct xtime reference */
+ timespec_add_ns(&xtime, timekeeping_suspend_nsecs);
/* re-base the last cycle value */
clock->cycle_last = clocksource_read(clock);
clock->error = 0;
@@ -325,9 +330,12 @@ static int timekeeping_suspend(struct sy
{
unsigned long flags;

+ timekeeping_suspend_time = read_persistent_clock();
+
write_seqlock_irqsave(&xtime_lock, flags);
+ /* Get the current xtime offset */
+ timekeeping_suspend_nsecs = __get_nsec_offset();
timekeeping_suspended = 1;
- timekeeping_suspend_time = read_persistent_clock();
write_sequnlock_irqrestore(&xtime_lock, flags);

clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);
_

2007-09-16 10:24:19

by Stefan Richter

[permalink] [raw]
Subject: Re: [1/3] 2.6.23-rc6: known regressions v2 --- ieee1394, empty suspend stopped working

Andrew Morton wrote:
> On Sun, 16 Sep 2007 11:35:04 +0200 Stefan Richter <[email protected]> wrote:
>> Regardless whether swap is on or off and whether ieee1394 and
>> ohci1394 are loaded or not, it always behaves the same. It does
>> something, then ends up with power LED off but a non-blinking cursor,
>> i.e. _, still shown on the text console.
...
> When it is in this state, try hitting a key lots of times. It Works For Me ;)

Doesn't work here.

> Try Thomas's hrt tree, below. It Also Works For Me.

No change with this either.
--
Stefan Richter
-=====-=-=== =--= =----
http://arcgraph.de/sr/

2007-09-16 11:57:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [1/3] 2.6.23-rc6: known regressions v2 --- ieee1394, empty suspend stopped working

On Sunday, 16 September 2007 11:35, Stefan Richter wrote:
> Michal Piotrowski wrote:
> > FireWire
>
> Could you title it "IEEE1394" instead? (According to the information so
> far, it's in drivers/ieee1394, not drivers/firewire.)
>
> > Subject : empty suspend stopped working around 2.6.23-rc4
> > References : http://lkml.org/lkml/2007/9/11/326
> > Last known good : ?
> > Submitter : Pavel Machek <[email protected]>
> > Caused-By : ?
> > Handled-By : ?
> > Status : unknown
>
> I did a quick test on my main test machine, a Mac mini running x86-64
> Linux. Regardless whether swap is on or off and whether ieee1394 and
> ohci1394 are loaded or not, it always behaves the same. It does
> something, then ends up with power LED off but a non-blinking cursor,
> i.e. _, still shown on the text console. Is this good or bad?

Bad. It's supposed to power off the machine.

Can you please try "echo test > /sys/power/disk" before the hibernation?

After that "echo disk > /sys/power/state" should only "suspend" the box
for several seconds and come back to the command prompt.

If that fails, please try

# echo testproc > /sys/power/disk
# echo disk > /sys/power/state

and see if this returns to the command prompt.

Greetings,
Rafael

2007-09-16 13:25:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [1/3] 2.6.23-rc6: known regressions v2 --- ieee1394, empty suspend stopped working

On Sun, 2007-09-16 at 02:41 -0700, Andrew Morton wrote:
> > I did a quick test on my main test machine, a Mac mini running x86-64
> > Linux. Regardless whether swap is on or off and whether ieee1394 and
> > ohci1394 are loaded or not, it always behaves the same. It does
> > something, then ends up with power LED off but a non-blinking cursor,
> > i.e. _, still shown on the text console. Is this good or bad? Note, I
> > didn't try to resume and won't, because I'm out of spare time.
>
> When it is in this state, try hitting a key lots of times. It Works For Me ;)
>
> Try Thomas's hrt tree, below. It Also Works For Me.

Will not help much. He is running x86-64 which is not affected by
clockevents & Co. (yet).

tglx


2007-09-30 16:45:49

by Stefan Richter

[permalink] [raw]
Subject: Re: [1/3] 2.6.23-rc6: known regressions v2 --- ieee1394, empty suspend stopped working

Rafael J. Wysocki wrote on 2007-09-16:
[me having trouble to set something up to reproduce the regression]
> Can you please try "echo test > /sys/power/disk" before the hibernation?
>
> After that "echo disk > /sys/power/state" should only "suspend" the box
> for several seconds and come back to the command prompt.
>
> If that fails, please try
>
> # echo testproc > /sys/power/disk
> # echo disk > /sys/power/state
>
> and see if this returns to the command prompt.

Sorry for keeping silence, but there is too much else going on here at
the moment. If I find time to follow up reasonably soon, I'll do.
Thanks for the advice.
--
Stefan Richter
-=====-=-=== =--= ====-
http://arcgraph.de/sr/