2007-06-28 14:01:15

by Thomas Sattler

[permalink] [raw]
Subject: 2.6.22-rc6 spurious hangs

Hi there ...

I'm observing seldom hangs with linux 2.6. I can't tell when exactly it
happened the first time, I think somewhere around 2.6.16 or 2.6.17. I
see it about once or twice a month. With absolutely nothing in the logs.
So far I asked for help:

- in the -ck list

Mon Sep 4 10:22:06 EST 2006, [ck] ck-patches seem to break DVB-T drivers
(see http://bhhdoa.org.au/pipermail/ck/2006-September/thread.html#6385)

- in the linux-dvb list

Wed Sep 6 19:02:29 CEST 2006, [linux-dvb] driver problems when using
ck-patchset
(http://www.linuxtv.org/pipermail/linux-dvb/2006-September/thread.html#12649)

- in the DaLUG (german, currently no archive) 14.09.2006


But nobody could help me so far.

Here is what I do:

I was running different kernels with different patchsets. It happened in
the past on -ck kernels (staircase), vanilla scheduler and cfs. As far
as I can remember the following patches were allways applied: squashfs
and vesa-tng.

Currently I'm running 2.6.22rc6 with cfs-v18, vesa-tng and an
XFS-lockdep patch:

http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-rc6-v18.patch
http://dev.gentoo.org/~spock/projects/vesafb-tng/archive/vesafb-tng-1.0-rc2-2.6.20-rc2.patch
see http://marc.info/?l=linux-kernel&m=118286232709378&w=2

I also installed these kernel modules via gentoo portage:

ati-drivers-8.37.6-r1
fuse-2.6.4-r1
kqemu-1.3.0_pre11
truecrypt-4.3

kqemu and truecrypt weren't loaded, but ati-drivers and fuse were.

The box I talk about is an IBM T41p with 1.7GHz Pentium M and 512MB RAM.
The distribution in use is gentoo, quite up to date. Attached to the box
is an USB2.0 DVB-T receiver (Cinergy T?, Terratec).

In rare cases the keyboard stops working when the T? stops streaming DVB
to the box. It happens when I record the stream to disk as well as when
I stream it to mplayer.

If end of streaming is caused by a keypress, 'q' or 'enter' on mplayer,
that key gets stuck. It's repeated until I reboot the box.

If the recording was scheduled and stops by itself no more keys are
recognized. The keyboard is dead. The laptop's own and the attached
USB-Keyboard. Magic-Sys-Keys are still working.

I can still use the mouse to move windows around, start new xterms via
icewm's panel or copy and past single characters from an xterm to other
xterms.

I can also close most of the open windows, for example firefox and most
xterms. I cannot close an xterm which is started as 'xterm -e top' by
icewm or a vncviewer. Both windows stay open but lose their content.

If a root shell is open I can enter 'reboot' or 'halt' but most of the
time this doesn't reboot or halt. I get the message for an upcoming
shotdown in all xterms but the box doesn't come down.

The systemload continously increases but there is nothing to see in top why.

Ingo Molnar told me to enable CONFIG_PROVE_LOCKING but xfs triggers it
long before the box hangs. I tested the patch mentioned above but it was
triggered by xfs again, see [1] and I didn't reboot between this and the
last hung. [1] http://marc.info/?l=linux-kernel&m=118295294529681&w=2

As Ingo told me I run 'echo t > /proc/sysrq-trigger' this time. The
corresponding part of my syslogs is attached, as well as my kernel config.


Another thing I observed with the T? is that it doesn't work if it's
already connected when the laptop boots up. I need to power off,
disconnect and boot. If I connect the T? after bootup it works. I can
also rmmod it's driver when it's not in use.

If I boot the box with the T? connected I cannot use it, the blue led in
the T? is always off and I cannot rmmod the driver. (I don't know
whether I ever tired to rmmod the driver before I tried to use the T?.)


Please CC me as I'm not subscribed to the list.

Thomas

--
keep mailinglists in english, feel free to send PM in german


Attachments:
messages.gz (4.10 kB)
config.gz (11.97 kB)
Download all attachments

2007-06-28 14:47:31

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 06/28, Thomas Sattler wrote:
>
> As Ingo told me I run 'echo t > /proc/sysrq-trigger' this time. The
> corresponding part of my syslogs is attached, as well as my kernel config.

xs_connect() and release_dev() are blocked on flush_workqueue(). Perhaps
this is OK, but may be not.

Could you try the patch below? It dumps some info when flush_workqueue()
hangs.

Oleg.

--- OLD/kernel/sched.c~TST 2007-04-05 12:20:35.000000000 +0400
+++ OLD/kernel/sched.c 2007-06-02 15:41:53.000000000 +0400
@@ -4177,6 +4177,20 @@ struct task_struct *idle_task(int cpu)
return cpu_rq(cpu)->idle;
}

+struct task_struct *get_cpu_curr(int cpu)
+{
+ unsigned long flags;
+ struct task_struct *curr;
+ struct rq *rq = cpu_rq(cpu);
+
+ spin_lock_irqsave(&rq->lock, flags);
+ curr = rq->curr;
+ get_task_struct(curr);
+ spin_unlock_irqrestore(&rq->lock, flags);
+
+ return curr;
+}
+
/**
* find_process_by_pid - find a process with a matching PID value.
* @pid: the pid in question.
--- OLD/kernel/workqueue.c~TST 2007-06-02 13:34:57.000000000 +0400
+++ OLD/kernel/workqueue.c 2007-06-03 11:28:54.000000000 +0400
@@ -49,6 +49,7 @@ struct cpu_workqueue_struct {
struct task_struct *thread;

int run_depth; /* Detect run_workqueue() recursion depth */
+ int jobs;
} ____cacheline_aligned;

/*
@@ -253,6 +254,7 @@ static void run_workqueue(struct cpu_wor

cwq->current_work = work;
list_del_init(cwq->worklist.next);
+ cwq->jobs++;
spin_unlock_irq(&cwq->lock);

BUG_ON(get_wq_data(work) != cwq);
@@ -328,6 +330,47 @@ static void insert_wq_barrier(struct cpu
insert_work(cwq, &barr->work, tail);
}

+extern struct task_struct *get_cpu_curr(int cpu);
+
+static void flush_wait(struct cpu_workqueue_struct *cwq, struct completion *done)
+{
+ const int cpu = task_cpu(cwq->thread);
+ struct task_struct *curr;
+ struct work_struct *work;
+ int old_pid, state, jobs;
+
+again:
+ state = cwq->thread->state;
+ work = cwq->current_work;
+ jobs = cwq->jobs;
+
+ curr = get_cpu_curr(cpu);
+ old_pid = curr->pid;
+ put_task_struct(curr);
+
+ if (wait_for_completion_timeout(done, HZ * 30))
+ return;
+
+ printk(KERN_ERR "ERR!! %s flush hang: %p %p %d %d %d %d\n", cwq->thread->comm,
+ work, cwq->current_work, jobs, cwq->jobs,
+ state, (int)cwq->thread->state);
+
+ curr = get_cpu_curr(cpu);
+ printk(KERN_ERR "CURR: %d %d %s %ld %ld\n", old_pid, curr->pid,
+ curr->comm, curr->nivcsw, curr->nvcsw);
+ put_task_struct(curr);
+
+ spin_lock_irq(&cwq->lock);
+ list_for_each_entry(work, &cwq->worklist, entry)
+ print_symbol(" %s\n", (unsigned long) work->func);
+ printk(" ----\n");
+ if (cwq->current_work)
+ print_symbol(" %s\n", (unsigned long) cwq->current_work->func);
+ spin_unlock_irq(&cwq->lock);
+
+ goto again;
+}
+
static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
{
int active;
@@ -351,7 +394,7 @@ static int flush_cpu_workqueue(struct cp
spin_unlock_irq(&cwq->lock);

if (active)
- wait_for_completion(&barr.done);
+ flush_wait(cwq, &barr.done);
}

return active;

2007-06-28 14:58:28

by Thomas Sattler

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

>> As Ingo told me I run 'echo t > /proc/sysrq-trigger' this time. The
>> corresponding part of my syslogs is attached, as well as my kernel config.
>
> Could you try the patch below? It dumps some info when flush_workqueue()
> hangs.

I'm compiling a patched kernel right now. As I wrote in my former mail the
whole thing not easy to trigger. So it can take some time to get the info.

Thanks so far,
Thomas

--
keep mailinglists in english, feel free to send PM in german

2007-06-28 15:01:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs


* Thomas Sattler <[email protected]> wrote:

> >> As Ingo told me I run 'echo t > /proc/sysrq-trigger' this time. The
> >> corresponding part of my syslogs is attached, as well as my kernel config.
> >
> > Could you try the patch below? It dumps some info when flush_workqueue()
> > hangs.
>
> I'm compiling a patched kernel right now. As I wrote in my former mail
> the whole thing not easy to trigger. So it can take some time to get
> the info.

i'd suggest to increase CONFIG_LOG_BUF_SHIFT=17 to =20 in your .config -
to make sure the full SysRq-t dump is captured when you do it.

Ingo

2007-06-28 17:35:17

by Thomas Sattler

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

>>>> As Ingo told me I run 'echo t > /proc/sysrq-trigger' this time. The
>>>> corresponding part of my syslogs is attached, as well as my kernel config.
>>> Could you try the patch below? It dumps some info when flush_workqueue()
>>> hangs.
>> I'm compiling a patched kernel right now. As I wrote in my former mail the
>> whole thing not easy to trigger. So it can take some time to get the info.
>
> Forgot to say, if you manage to trigger the hang, please wait a couple of
> minutes to collect more info from flush_wait().

Seems today is my lucky day: I triggered it in just a few minutes.

The logfile is attached.

Thomas

--
keep mailinglists in english, feel free to send PM in german

2007-06-28 17:49:17

by Thomas Sattler

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

Here is the logfile.

Thomas

--
keep mailinglists in english, feel free to send PM in german


Attachments:
messages.gz (3.98 kB)

2007-06-28 18:10:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 06/28, Thomas Sattler wrote:
>
> Here is the logfile.

> Jun 28 18:26:03 pearl ERR!! events/0 flush hang: f573759c f573759c 5782 5782 0 1
> Jun 28 18:26:03 pearl CURR: 5557 5557 dvbd 0 129024
> Jun 28 18:26:03 pearl wq_barrier_func+0x0/0xd
> Jun 28 18:26:03 pearl cache_reap+0x0/0xe3
> Jun 28 18:26:03 pearl ----
> Jun 28 18:26:03 pearl cinergyt2_query_rc+0x0/0x2e9 [cinergyT2]
> ...
> Jun 28 19:23:03 pearl ERR!! events/0 flush hang: f573759c f573759c 5782 5782 1 1
> Jun 28 19:23:03 pearl CURR: 5557 5557 dvbd 0 129138
> Jun 28 19:23:03 pearl wq_barrier_func+0x0/0xd
> Jun 28 19:23:03 pearl cache_reap+0x0/0xe3
> Jun 28 19:23:03 pearl console_callback+0x0/0xca
> Jun 28 19:23:03 pearl flush_to_ldisc+0x0/0xf7
> Jun 28 19:23:03 pearl rekey_seq_generator+0x0/0x4e
> Jun 28 19:23:03 pearl ----
> Jun 28 19:23:03 pearl cinergyt2_query_rc+0x0/0x2e9 [cinergyT2]

cinergyt2_query_rc() hangs. I'll try to look tomorrov, but I know nothing
about drivers/media/dvb/.

This log is strange, flush_work() should report every 30 seconds, but it
looks like wait_for_completion_timeout(HZ * 30) returns a lot earlier.

Could you also show the result of sysrq-T ?

Thanks!

Oleg.

2007-06-28 21:31:19

by Thomas Sattler

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

> Could you also show the result of sysrq-T ?
I was so happy that I could trigger it that fast ...
... that I forgot to press Alt-Sysrq-t before reboot.
:-(

But, I could trigger it again. :-)

This time I can offer:

- Debug output from Oleg's patch (11x, every 30s)
- Alt-Sysrq-t (3x, about 30s between them)

There is no lockdep stuff but lockdep must have
been running. It's enabled and did not fire
before the the bug was triggered.

The logfile is attached.
(yes it is, I checked twice)

Thomas


Attachments:
messages.gz (27.97 kB)

2007-06-29 07:14:15

by Thomas Sattler

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

>> Jun 28 19:23:03 pearl cinergyt2_query_rc+0x0/0x2e9 [cinergyT2]
>
> cinergyt2_query_rc() hangs. I'll try to look tomorrov, but I know nothing
> about drivers/media/dvb/.

Does this mean the problem is in the cinergyt2 driver? I'm having similar
problems with another box but with different hardware. While my laptop is
used as a test system the other one is used as a 'productive' TV-recorder.
I hoped we could trace the bug on the test system and fix the productive
one at the same time. :-/

The other box ("silver") is a desktop, which has two Hauppauge Nova-T DVB-T
PCI cards and one (analog) Hauppauge WinTV PVR-350. Silver only hangs if
the (digital) recording process has to much priority: (silver is running
2.6.21.5-cfs-v17 +squashfs +ivtv)

As I wanted to give as much priority to the recording process as possible
I firstly run dvbd as SCHED_RR. This hung the box quite often, sometimes
after an uptime of several minutes, sometimes after two weeks.

I switched to -ck and run dvbd as SCHED_ISO which worked without *any*
problem for about 18 months. As -ck is discontinued I switched to CFS and
the box hung again (twice until I understood why) when dvbd was running as
nice -15.

ATM dvbd runs with nice -12 but yesterday, during a rsync-transfer of
several >4G files, a recording was broken. 29 seconds of the recorded
stream are lost because the system load was at 5 for about three hours.

Perhaps the 29 missing seconds are caused not by to less CPU time but by
the havy IO of rsync. But on the other hand dvbd is also running at IO
realtime prio 4 (ionice) while rsync run as IO normal.

Any hints?
Thomas

--
keep mailinglists in english, feel free to send PM in german

2007-06-29 11:27:09

by Tomi Orava

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs


Hello,

> I'm observing seldom hangs with linux 2.6. I can't tell when exactly it
> happened the first time, I think somewhere around 2.6.16 or 2.6.17. I
> see it about once or twice a month. With absolutely nothing in the logs.
> So far I asked for help:

<snip>

> The box I talk about is an IBM T41p with 1.7GHz Pentium M and 512MB RAM.
> The distribution in use is gentoo, quite up to date. Attached to the box
> is an USB2.0 DVB-T receiver (Cinergy T², Terratec).
>
> In rare cases the keyboard stops working when the T² stops streaming DVB
> to the box. It happens when I record the stream to disk as well as when
> I stream it to mplayer.

I just wanted you to know that you are not alone with this problem.
I've seen the same symptoms on my parents machine during the last 1-2 months
--> ie. keyboard hangs completely, mouse works and the host either hangs
completely sooner or later with only the following usb debug messages in
the logs (pasted below).

I believe that there is certainly some sort of bug in cinergyT2-driver
as I've not seen the hang happen whenever the cinergyT2 usb-receiver
has been disconnected from the machine.

Host configuration is:
2.6.22-rc4-git4 kernel + latest DVB drivers from v4l-dvb tree
(previously also tested with FC6 kernels with same symptoms)
MythTV-software (from devel SVN)

Asus P4P800-E motherboard (ich5 & PDC20378 sata-controllers)
1GB ram
Intel P4 2.8MHz
Terratec Cinergy T2 dvb-t receiver

Regards,
Tomi Orava


--------------------------------------------------
ehci_hcd 0000:00:1d.7: devpath 1 ep2in 3strikes
hub 1-0:1.0: state 7 ports 8 chg 0000 evt 0002
ehci_hcd 0000:00:1d.7: GetStatus port 1 status 001002 POWER sig=se0 CSC
hub 1-0:1.0: port 1, status 0100, change 0001, 12 Mb/s
usb 1-1: USB disconnect, address 2
usb 1-1: unregistering device
usb 1-1: usb_disable_device nuking all URBs
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a3c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a340 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a2c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a240 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f71a6c40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f7e7f0c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c37dc0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c37e40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f71a6d40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9af40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f71a61c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9ae40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9adc0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9ad40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9acc0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9ac40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9abc0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9ab40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9aac0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9aa40 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a9c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a940 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a8c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a840 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a7c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a740 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a6c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a640 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a5c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a540 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a4c0 pipe c0010280 ep2in-bulk
ehci_hcd 0000:00:1d.7: shutdown urb f6c9a440 pipe c0010280 ep2in-bulk
cinergyT2: cinergyt2_submit_stream_urb(): urb submission failed (err = -19)!
last message repeated 31 times
---------------------------------------------------------------------
--


2007-06-29 13:09:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 06/29, Thomas Sattler wrote:
>
> >> Jun 28 19:23:03 pearl cinergyt2_query_rc+0x0/0x2e9 [cinergyT2]
> >
> > cinergyt2_query_rc() hangs. I'll try to look tomorrov, but I know nothing
> > about drivers/media/dvb/.
>
> Does this mean the problem is in the cinergyt2 driver? I'm having similar
> problems with another box but with different hardware. While my laptop is
> used as a test system the other one is used as a 'productive' TV-recorder.
> I hoped we could trace the bug on the test system and fix the productive
> one at the same time. :-/

Yes, I think cinergyt2 is buggy.

> Jun 28 23:02:42 pearl events/0 S 00000280 0 5 2 (L-TLB)
> Jun 28 23:02:42 pearl c1d4ded0 00000096 55136d42 00000280 c0302fa1 c1d4ad10 55136d42 00000280
> Jun 28 23:02:42 pearl c1d5c170 f737cd40 c1d4deec f56eca50 00000246 c1d4df0c c0303b81 00000000
> Jun 28 23:02:42 pearl 00000002 c0303c94 00000280 c1d4ab80 f56eca7c f56eca7c c1d4ab80 f56eca50
> Jun 28 23:02:42 pearl Call Trace:
> Jun 28 23:02:42 pearl [<c0303b81>] __mutex_lock_interruptible_slowpath+0x192/0x293
> Jun 28 23:02:42 pearl [<c0303c94>] mutex_lock_interruptible+0x12/0x15
> Jun 28 23:02:42 pearl [<f96ae120>] cinergyt2_query_rc+0x2d/0x2e9 [cinergyT2]
> Jun 28 23:02:42 pearl [<c012a79b>] run_workqueue+0x7f/0x130
> Jun 28 23:02:42 pearl [<c012b036>] worker_thread+0xb7/0xc5
> Jun 28 23:02:42 pearl [<c012d577>] kthread+0x39/0x5e
> Jun 28 23:02:42 pearl [<c010485b>] kernel_thread_helper+0x7/0x10


> Jun 28 23:02:42 pearl dvbd D 00000281 0 5643 1 (NOTLB)
> Jun 28 23:02:42 pearl f6fd1e28 00000086 0a067524 00000281 c0543800 c1d5c300 0a067524 00000281
> Jun 28 23:02:42 pearl 00000000 f737cd40 f6fd1e38 00c3fba6 f6fd1e90 f6fd1e70 c0303704 f6fd1f00
> Jun 28 23:02:42 pearl 000458ee f702cf28 c05442a8 00c3fba6 c0124e58 c1d5c170 c0543800 c03036ff
> Jun 28 23:02:42 pearl Call Trace:
> Jun 28 23:02:42 pearl [<c0303704>] schedule_timeout+0x6e/0x8c
> Jun 28 23:02:42 pearl [<c03032b7>] wait_for_completion_timeout+0x72/0xcf
> Jun 28 23:02:42 pearl [<c012a9a4>] flush_cpu_workqueue+0x9d/0x19a
> Jun 28 23:02:42 pearl [<c012ad6e>] flush_workqueue+0x12/0x15
> Jun 28 23:02:42 pearl [<c012ad7e>] flush_scheduled_work+0xd/0xf
> Jun 28 23:02:42 pearl [<f96ae011>] cinergyt2_release+0x4c/0xa4 [cinergyT2]
> Jun 28 23:02:42 pearl [<c0160159>] __fput+0xaf/0x14b
> Jun 28 23:02:42 pearl [<c016020b>] fput+0x16/0x18
> Jun 28 23:02:42 pearl [<c015ddf3>] filp_close+0x54/0x5c
> Jun 28 23:02:42 pearl [<c015edcb>] sys_close+0x6e/0xa6
> Jun 28 23:02:42 pearl [<c0103c4e>] sysenter_past_esp+0x5f/0x99

cinergyt2_release() does flush_scheduled_work() under cinergyt2->sem.
flush_scheduled_work() hangs because cinergyt2_query_rc() waits for
the same cinergyt2->sem.

->disconnect_pending is used without any locks/barriers, perhaps this
is the reason.

I'll try to look further tomorrow. In any case, cinergyT2 should not
use flush_scheduled_work() at all.

Oleg.

2007-06-29 13:16:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs


* Oleg Nesterov <[email protected]> wrote:

> Yes, I think cinergyt2 is buggy.

> cinergyt2_release() does flush_scheduled_work() under cinergyt2->sem.
> flush_scheduled_work() hangs because cinergyt2_query_rc() waits for
> the same cinergyt2->sem.
>
> ->disconnect_pending is used without any locks/barriers, perhaps this
> is the reason.
>
> I'll try to look further tomorrow. In any case, cinergyT2 should not
> use flush_scheduled_work() at all.

would the hack below be worth trying, to see whether there are any
further problems?

Ingo

Index: linux/drivers/media/dvb/cinergyT2/cinergyT2.c
===================================================================
--- linux.orig/drivers/media/dvb/cinergyT2/cinergyT2.c
+++ linux/drivers/media/dvb/cinergyT2/cinergyT2.c
@@ -523,7 +523,6 @@ static int cinergyt2_release (struct ino

if (!cinergyt2->disconnect_pending && (file->f_flags & O_ACCMODE) != O_RDONLY) {
cancel_delayed_work(&cinergyt2->query_work);
- flush_scheduled_work();
cinergyt2_sleep(cinergyt2, 1);
}

@@ -974,8 +973,6 @@ static void cinergyt2_disconnect (struct
{
struct cinergyt2 *cinergyt2 = usb_get_intfdata (intf);

- flush_scheduled_work();
-
cinergyt2_unregister_rc(cinergyt2);

cancel_delayed_work(&cinergyt2->query_work);
@@ -1002,7 +999,6 @@ static int cinergyt2_suspend (struct usb
cancel_delayed_work(&cinergyt2->query_work);
if (cinergyt2->streaming)
cinergyt2_stop_stream_xfer(cinergyt2);
- flush_scheduled_work();
cinergyt2_sleep(cinergyt2, 1);
}

2007-06-29 13:57:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 06/29, Ingo Molnar wrote:
> * Oleg Nesterov <[email protected]> wrote:
>
> > Yes, I think cinergyt2 is buggy.
>
> > cinergyt2_release() does flush_scheduled_work() under cinergyt2->sem.
> > flush_scheduled_work() hangs because cinergyt2_query_rc() waits for
> > the same cinergyt2->sem.
> >
> > ->disconnect_pending is used without any locks/barriers, perhaps this
> > is the reason.

I misread cinergyt2_release, it checks !->disconnect_pending, so it is
very clear why cinergyt2_query_rc() tries to take the mutex.

> > I'll try to look further tomorrow. In any case, cinergyT2 should not
> > use flush_scheduled_work() at all.
>
> would the hack below be worth trying, to see whether there are any
> further problems?
>
> Ingo
>
> Index: linux/drivers/media/dvb/cinergyT2/cinergyT2.c
> ===================================================================
> --- linux.orig/drivers/media/dvb/cinergyT2/cinergyT2.c
> +++ linux/drivers/media/dvb/cinergyT2/cinergyT2.c
> @@ -523,7 +523,6 @@ static int cinergyt2_release (struct ino
>
> if (!cinergyt2->disconnect_pending && (file->f_flags & O_ACCMODE) != O_RDONLY) {
> cancel_delayed_work(&cinergyt2->query_work);
> - flush_scheduled_work();
> cinergyt2_sleep(cinergyt2, 1);
> }

I don't think we can just kill flush_scheduled_work(). We can use
cancel_rearming_delayed_work() instead of
cancel_delayed_work()+flush_scheduled_work()

Still we can't do this under cinergyt2->sem, because cinergyt2_query()
takes it too. This all looks very wrong to me, I hope maintaners can
explain.

I think cinergyt2_query() and cinergyt2_query_rc() should not use
->disconnect_pending at all. cinergyt2_disconnect() should set
->disconnect_pending = 1 and cancel both delayed_works.

cinergyt2_release() checks !->disconnect_pending and does the cancel
without mutex.

Oleg.

2007-06-29 14:11:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs


* Oleg Nesterov <[email protected]> wrote:

> > > ->disconnect_pending is used without any locks/barriers, perhaps
> > > this is the reason.
>
> I misread cinergyt2_release, it checks !->disconnect_pending, so it is
> very clear why cinergyt2_query_rc() tries to take the mutex.
>
> > > I'll try to look further tomorrow. In any case, cinergyT2 should not
> > > use flush_scheduled_work() at all.
> >
> > would the hack below be worth trying, to see whether there are any
> > further problems?
[...]
> I don't think we can just kill flush_scheduled_work(). We can use
> cancel_rearming_delayed_work() instead of
> cancel_delayed_work()+flush_scheduled_work()
>
> Still we can't do this under cinergyt2->sem, because cinergyt2_query()
> takes it too. This all looks very wrong to me, I hope maintaners can
> explain.

i've Cc:-ed the maintainers.

Ingo

2007-06-29 16:59:57

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 6/29/07, Ingo Molnar <[email protected]> wrote:
>
> * Oleg Nesterov <[email protected]> wrote:
>
> > > > ->disconnect_pending is used without any locks/barriers, perhaps
> > > > this is the reason.
> >
> > I misread cinergyt2_release, it checks !->disconnect_pending, so it is
> > very clear why cinergyt2_query_rc() tries to take the mutex.
> >
> > > > I'll try to look further tomorrow. In any case, cinergyT2 should not
> > > > use flush_scheduled_work() at all.
> > >
> > > would the hack below be worth trying, to see whether there are any
> > > further problems?
> [...]
> > I don't think we can just kill flush_scheduled_work(). We can use
> > cancel_rearming_delayed_work() instead of
> > cancel_delayed_work()+flush_scheduled_work()
> >
> > Still we can't do this under cinergyt2->sem, because cinergyt2_query()
> > takes it too. This all looks very wrong to me, I hope maintaners can
> > explain.
>
> i've Cc:-ed the maintainers.
>

Well, not really maintainer but I think the short term soluton (at
least for the RC part) is to alter cinergyt2_query_rc to take
cinergyt2->sem only around cinergyt2_command(). Ther rest of the
polling function need not be protected as it does nto tun concurently
with itself.

The longer trem solution is to convert cinergyt2 to use polled input
device framework (as in attached patch - untested). Unfortunately it
depends on adding suspend/resume support to polled devices now that
workqueues are not being freezed again.

Overall the driver makes me cringe...

- Why does it take cinergyt2->sem in its cinergyt2_poll() function
before calling poll_wait()? Most of the places that try to wake up
polling process (such as FE_SET_FRONTEND) attempt to take the same
mutex and will promptly deadlock unless I am missing something.

- cinergyt2_query() wakes up pollers before altering poll condition.

- cinergyt2->iunuse is racy...

--
Dmitry


Attachments:
(No filename) (1.84 kB)
cinergyt-to-polldev.patch (11.70 kB)
Download all attachments

2007-06-29 17:22:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 06/29, Dmitry Torokhov wrote:
>
> Well, not really maintainer but I think the short term soluton (at
> least for the RC part) is to alter cinergyt2_query_rc to take
> cinergyt2->sem only around cinergyt2_command(). Ther rest of the
> polling function need not be protected as it does nto tun concurently
> with itself.

Can't comment, because I know nothing about this stuff.

But, unless I misread this patch, it doesn't solve the problem.
cinergyt2_release() calls flush_scheduled_work() under ->sem,
but ->query_work takes it too, no?

Quoting myself,
>
> I think cinergyt2_query() and cinergyt2_query_rc() should not use
> ->disconnect_pending at all. cinergyt2_disconnect() should set
> ->disconnect_pending = 1 and cancel both delayed_works.
>
> cinergyt2_release() checks !->disconnect_pending and does the cancel
> without mutex.

Possible?

Oleg.

2007-06-29 21:22:19

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

> Still we can't do this under cinergyt2->sem, because cinergyt2_query()
> takes it too. This all looks very wrong to me, I hope maintaners can
> explain.

AFAIK, the driver authors are not working anymore with CinergyT2. The
last patch we have on development tree from Holger is dated as Dec, 3
2004. Since them, only internal kernel API changes and a few sparse
fixes from other contributors were applied.

Also, none of the current DVB maintainers seem to have any hardware for
testing.


Cheers,
Mauro

2007-06-29 21:34:55

by Markus Rechberger

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 6/29/07, Mauro Carvalho Chehab <[email protected]> wrote:
> > Still we can't do this under cinergyt2->sem, because cinergyt2_query()
> > takes it too. This all looks very wrong to me, I hope maintaners can
> > explain.
>
> AFAIK, the driver authors are not working anymore with CinergyT2. The
> last patch we have on development tree from Holger is dated as Dec, 3
> 2004. Since them, only internal kernel API changes and a few sparse
> fixes from other contributors were applied.
>
> Also, none of the current DVB maintainers seem to have any hardware for
> testing.
>

I received a Mail a while ago that this driver is open to the
community, it duplicates some code because the developers wanted to
use this driver for testing another DVB API which never took off.
Best would be to remove the duplicated code in that driver and make it
look like all other DVB drivers.

Markus

2007-06-30 14:18:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 2.6.22-rc6 spurious hangs

On 06/29, Markus Rechberger wrote:
>
> On 6/29/07, Mauro Carvalho Chehab <[email protected]> wrote:
> >> Still we can't do this under cinergyt2->sem, because cinergyt2_query()
> >> takes it too. This all looks very wrong to me, I hope maintaners can
> >> explain.
> >
> >AFAIK, the driver authors are not working anymore with CinergyT2. The
> >last patch we have on development tree from Holger is dated as Dec, 3
> >2004. Since them, only internal kernel API changes and a few sparse
> >fixes from other contributors were applied.
> >
> >Also, none of the current DVB maintainers seem to have any hardware for
> >testing.
> >
>
> I received a Mail a while ago that this driver is open to the
> community, it duplicates some code because the developers wanted to
> use this driver for testing another DVB API which never took off.
> Best would be to remove the duplicated code in that driver and make it
> look like all other DVB drivers.

OK.

Thomas, any chance you could try the patch below? It is very, very stupid,
it was done without any understanding of this code, and of course it is
completely untested. I doubt very much it is correct, and even if it is
correct it is definitely not good. It would be great if Dmitry can take a look.

This patch shifts cancel_rearming_delayed_work() out of ->sem. Another mutex,
->wq_sem, was added to protect against the concurrent open/resume.

Oleg.

--- t/drivers/media/dvb/cinergyT2/cinergyT2.c~cinergy 2007-06-19 17:09:15.000000000 +0400
+++ t/drivers/media/dvb/cinergyT2/cinergyT2.c 2007-06-30 18:01:43.000000000 +0400
@@ -118,6 +118,7 @@ struct cinergyt2 {
struct dvb_demux demux;
struct usb_device *udev;
struct mutex sem;
+ struct mutex wq_sem;
struct dvb_adapter adapter;
struct dvb_device *fedev;
struct dmxdev dmxdev;
@@ -482,14 +483,14 @@ static int cinergyt2_open (struct inode
struct cinergyt2 *cinergyt2 = dvbdev->priv;
int err = -ERESTARTSYS;

- if (cinergyt2->disconnect_pending || mutex_lock_interruptible(&cinergyt2->sem))
- return -ERESTARTSYS;
+ if (cinergyt2->disconnect_pending || mutex_lock_interruptible(&cinergyt2->wq_sem))
+ goto out;

- if ((err = dvb_generic_open(inode, file))) {
- mutex_unlock(&cinergyt2->sem);
- return err;
- }
+ if (mutex_lock_interruptible(&cinergyt2->sem))
+ goto out_unlock1;

+ if ((err = dvb_generic_open(inode, file)))
+ goto out_unlock2;

if ((file->f_flags & O_ACCMODE) != O_RDONLY) {
cinergyt2_sleep(cinergyt2, 0);
@@ -498,8 +499,12 @@ static int cinergyt2_open (struct inode

atomic_inc(&cinergyt2->inuse);

+out_unlock2:
mutex_unlock(&cinergyt2->sem);
- return 0;
+out_unlock1:
+ mutex_unlock(&cinergyt2->wq_sem);
+out:
+ return err;
}

static void cinergyt2_unregister(struct cinergyt2 *cinergyt2)
@@ -519,15 +524,17 @@ static int cinergyt2_release (struct ino
struct dvb_device *dvbdev = file->private_data;
struct cinergyt2 *cinergyt2 = dvbdev->priv;

- mutex_lock(&cinergyt2->sem);
+ mutex_lock(&cinergyt2->wq_sem);

if (!cinergyt2->disconnect_pending && (file->f_flags & O_ACCMODE) != O_RDONLY) {
- cancel_delayed_work(&cinergyt2->query_work);
- flush_scheduled_work();
+ cancel_rearming_delayed_work(&cinergyt2->query_work);
+
+ mutex_lock(&cinergyt2->sem);
cinergyt2_sleep(cinergyt2, 1);
+ mutex_unlock(&cinergyt2->sem);
}

- mutex_unlock(&cinergyt2->sem);
+ mutex_unlock(&cinergyt2->wq_sem);

if (atomic_dec_and_test(&cinergyt2->inuse) && cinergyt2->disconnect_pending) {
warn("delayed unregister in release");
@@ -838,13 +845,13 @@ static int cinergyt2_register_rc(struct

static void cinergyt2_unregister_rc(struct cinergyt2 *cinergyt2)
{
- cancel_delayed_work(&cinergyt2->rc_query_work);
+ cancel_rearming_delayed_work(&cinergyt2->rc_query_work);
input_unregister_device(cinergyt2->rc_input_dev);
}

static inline void cinergyt2_suspend_rc(struct cinergyt2 *cinergyt2)
{
- cancel_delayed_work(&cinergyt2->rc_query_work);
+ cancel_rearming_delayed_work(&cinergyt2->rc_query_work);
}

static inline void cinergyt2_resume_rc(struct cinergyt2 *cinergyt2)
@@ -907,6 +914,7 @@ static int cinergyt2_probe (struct usb_i
usb_set_intfdata (intf, (void *) cinergyt2);

mutex_init(&cinergyt2->sem);
+ mutex_init(&cinergyt2->wq_sem);
init_waitqueue_head (&cinergyt2->poll_wq);
INIT_DELAYED_WORK(&cinergyt2->query_work, cinergyt2_query);

@@ -974,11 +982,8 @@ static void cinergyt2_disconnect (struct
{
struct cinergyt2 *cinergyt2 = usb_get_intfdata (intf);

- flush_scheduled_work();
-
cinergyt2_unregister_rc(cinergyt2);
-
- cancel_delayed_work(&cinergyt2->query_work);
+ cancel_rearming_delayed_work(&cinergyt2->query_work);
wake_up_interruptible(&cinergyt2->poll_wq);

cinergyt2->demux.dmx.close(&cinergyt2->demux.dmx);
@@ -992,21 +997,22 @@ static int cinergyt2_suspend (struct usb
{
struct cinergyt2 *cinergyt2 = usb_get_intfdata (intf);

- if (cinergyt2->disconnect_pending || mutex_lock_interruptible(&cinergyt2->sem))
+ if (cinergyt2->disconnect_pending || mutex_lock_interruptible(&cinergyt2->wq_sem))
return -ERESTARTSYS;

if (1) {
- struct cinergyt2 *cinergyt2 = usb_get_intfdata (intf);

cinergyt2_suspend_rc(cinergyt2);
- cancel_delayed_work(&cinergyt2->query_work);
+ cancel_rearming_delayed_work(&cinergyt2->query_work);
+
+ mutex_lock(&cinergyt2->sem);
if (cinergyt2->streaming)
cinergyt2_stop_stream_xfer(cinergyt2);
- flush_scheduled_work();
cinergyt2_sleep(cinergyt2, 1);
+ mutex_unlock(&cinergyt2->sem);
}

- mutex_unlock(&cinergyt2->sem);
+ mutex_unlock(&cinergyt2->wq_sem);
return 0;
}

@@ -1014,9 +1020,15 @@ static int cinergyt2_resume (struct usb_
{
struct cinergyt2 *cinergyt2 = usb_get_intfdata (intf);
struct dvbt_set_parameters_msg *param = &cinergyt2->param;
+ int err = -ERESTARTSYS;

- if (cinergyt2->disconnect_pending || mutex_lock_interruptible(&cinergyt2->sem))
- return -ERESTARTSYS;
+ if (cinergyt2->disconnect_pending || mutex_lock_interruptible(&cinergyt2->wq_sem))
+ goto out;
+
+ if (mutex_lock_interruptible(&cinergyt2->sem))
+ goto out_unlock1;
+
+ err = 0;

if (!cinergyt2->sleeping) {
cinergyt2_sleep(cinergyt2, 0);
@@ -1029,7 +1041,10 @@ static int cinergyt2_resume (struct usb_
cinergyt2_resume_rc(cinergyt2);

mutex_unlock(&cinergyt2->sem);
- return 0;
+out_unlock1:
+ mutex_unlock(&cinergyt2->wq_sem);
+out:
+ return err;
}

static const struct usb_device_id cinergyt2_table [] __devinitdata = {