2007-01-14 22:57:05

by Florin Iucha

[permalink] [raw]
Subject: Re: [linux-usb-devel] 2.6.20-rc4: known unfixed regressions (v2)

On Wed, Jan 10, 2007 at 10:54:34AM -0500, Alan Stern wrote:
> It's still possible that this is hardware related; perhaps some component
> just began to wear out. If you return to an earlier kernel, does the
> problem go away?

As reported in my original e-mail and verified just minutes ago, the
copy succeeds with 2.6.19 (kernel.org vanilla, compiled with the same
config as 2.6.20-rcX). I will begin bisecting between .19 and .20-rc1
after re-reading Jiri's messages.

florin

--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163


Attachments:
(No filename) (582.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-01-14 23:58:19

by Florin Iucha

[permalink] [raw]
Subject: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Sun, Jan 14, 2007 at 04:57:01PM -0600, wrote:
> On Wed, Jan 10, 2007 at 10:54:34AM -0500, Alan Stern wrote:
> > It's still possible that this is hardware related; perhaps some component
> > just began to wear out. If you return to an earlier kernel, does the
> > problem go away?
>
> As reported in my original e-mail and verified just minutes ago, the
> copy succeeds with 2.6.19 (kernel.org vanilla, compiled with the same
> config as 2.6.20-rcX). I will begin bisecting between .19 and .20-rc1
> after re-reading Jiri's messages.

All the testing was done via a ssh into the workstation. The console
was left as booted into, with the gdm running. The remote nfs4
directory was mounted on "/mnt".

After copying the 60+ GB and testing that the keyboard was still
functioning, I did not reboot but stayed in the same kernel and pulled
the latest git then started bisecting. After recompiling, I moved
over to the workstation to reboot it, but the keyboard was not
functioning ;(

I ran "lsusb" and it displayed all the devices. "dmesg" did not show
any oops, anything for that matter. I have unplugged the keyboard and
run "lsusb" again, but it hang. I ran "ls /mnt" and it hang as well.
Stracing "lsusb" showed it hang (entered the kernel) at opening the device
that used to be the keyboard. Stracing "ls /mnt" showed that it
hang at "stat(/mnt)". Both processes were in "D" state. "ls /root"
worked without problem, so it appears that crossing mountpoints causes
some hang in the kernel.

Based on this info, I think we can rule out any USB. I will try
testing with NFS3 to see if the problem persists. Unfortunately there
is no oops or anything in "dmesg".

florin

--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163


Attachments:
(No filename) (1.75 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-01-15 00:14:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Sun, 2007-01-14 at 17:58 -0600, Florin Iucha wrote:
> All the testing was done via a ssh into the workstation. The console
> was left as booted into, with the gdm running. The remote nfs4
> directory was mounted on "/mnt".
>
> After copying the 60+ GB and testing that the keyboard was still
> functioning, I did not reboot but stayed in the same kernel and pulled
> the latest git then started bisecting. After recompiling, I moved
> over to the workstation to reboot it, but the keyboard was not
> functioning ;(
>
> I ran "lsusb" and it displayed all the devices. "dmesg" did not show
> any oops, anything for that matter. I have unplugged the keyboard and
> run "lsusb" again, but it hang. I ran "ls /mnt" and it hang as well.
> Stracing "lsusb" showed it hang (entered the kernel) at opening the device
> that used to be the keyboard. Stracing "ls /mnt" showed that it
> hang at "stat(/mnt)". Both processes were in "D" state. "ls /root"
> worked without problem, so it appears that crossing mountpoints causes
> some hang in the kernel.
>
> Based on this info, I think we can rule out any USB. I will try
> testing with NFS3 to see if the problem persists. Unfortunately there
> is no oops or anything in "dmesg".

Did you try an 'echo t > /proc/sysrq-trigger' in order to find out where
the stat process is hanging?

Trond

2007-01-15 00:15:41

by Jiri Kosina

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Sun, 14 Jan 2007, Florin Iucha wrote:

> All the testing was done via a ssh into the workstation. The console
> was left as booted into, with the gdm running. The remote nfs4
> directory was mounted on "/mnt". After copying the 60+ GB and testing
> that the keyboard was still functioning, I did not reboot but stayed in
> the same kernel and pulled the latest git then started bisecting.

Hi Florin,

thanks a lot for the testing. Just to verify - what kernel is 'the same
kernel' mentioned above? (just to isolate whether the problem is really
somewhere between 2.6.19 and 2.6.20-rc2, as you stated in previous posts,
or the situation has changed).

> After recompiling, I moved over to the workstation to reboot it, but the
> keyboard was not functioning ;(

So this time the hang occured when the system was idle, not during the
transfers, right?

> I ran "lsusb" and it displayed all the devices. "dmesg" did not show
> any oops, anything for that matter. I have unplugged the keyboard and
> run "lsusb" again, but it hang. I ran "ls /mnt" and it hang as well.
> Stracing "lsusb" showed it hang (entered the kernel) at opening the device
> that used to be the keyboard. Stracing "ls /mnt" showed that it
> hang at "stat(/mnt)". Both processes were in "D" state. "ls /root"
> worked without problem, so it appears that crossing mountpoints causes
> some hang in the kernel.

Could you please do alt-sysrq-t (or "echo t > /proc/sysrq-trigger" via
ssh, when your keyboard is dead) to see the calltraces of the processes
which are stuck inside kernel?

You will probably get a lot of output after the sysrq, so please either
put it somewhere on the web if possible, or just extract the interesting
processes out of it (mainly the ones which are stuck).

Thanks,

--
Jiri Kosina

2007-01-15 02:02:23

by Florin Iucha

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

Jiri and Trond,

On Mon, Jan 15, 2007 at 01:14:09AM +0100, Jiri Kosina wrote:
> On Sun, 14 Jan 2007, Florin Iucha wrote:
>
> > All the testing was done via a ssh into the workstation. The console
> > was left as booted into, with the gdm running. The remote nfs4
> > directory was mounted on "/mnt". After copying the 60+ GB and testing
> > that the keyboard was still functioning, I did not reboot but stayed in
> > the same kernel and pulled the latest git then started bisecting.
>
> Hi Florin,
>
> thanks a lot for the testing. Just to verify - what kernel is 'the same
> kernel' mentioned above? (just to isolate whether the problem is really
> somewhere between 2.6.19 and 2.6.20-rc2, as you stated in previous posts,
> or the situation has changed).

This happened with 2.6.19. It worked last time, but I wanted to test
again, to make sure. This time, it bombed, but half an hour after the
transfer finished.

> > After recompiling, I moved over to the workstation to reboot it, but the
> > keyboard was not functioning ;(
>
> So this time the hang occured when the system was idle, not during the
> transfers, right?

Yes it was idle. Immediately after the transfer finished, the keyboard was
still functioning. It "hang" minutes later, after the first bisected kernel
was compiled and installed.

> > I ran "lsusb" and it displayed all the devices. "dmesg" did not show
> > any oops, anything for that matter. I have unplugged the keyboard and
> > run "lsusb" again, but it hang. I ran "ls /mnt" and it hang as well.
> > Stracing "lsusb" showed it hang (entered the kernel) at opening the device
> > that used to be the keyboard. Stracing "ls /mnt" showed that it
> > hang at "stat(/mnt)". Both processes were in "D" state. "ls /root"
> > worked without problem, so it appears that crossing mountpoints causes
> > some hang in the kernel.
>
> Could you please do alt-sysrq-t (or "echo t > /proc/sysrq-trigger" via
> ssh, when your keyboard is dead) to see the calltraces of the processes
> which are stuck inside kernel?
>
> You will probably get a lot of output after the sysrq, so please either
> put it somewhere on the web if possible, or just extract the interesting
> processes out of it (mainly the ones which are stuck).

Will do.

florin

--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163


Attachments:
(No filename) (2.33 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-01-15 12:06:41

by Horst H. von Brand

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

Florin Iucha <[email protected]> wrote:

[...]

> Based on this info, I think we can rule out any USB. I will try
> testing with NFS3 to see if the problem persists. Unfortunately there
> is no oops or anything in "dmesg".

Take a look at bz #7796, a NFS bug + fix. But my feelin is that this is
older.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria +56 32 2654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513

2007-01-15 15:46:51

by Florin Iucha

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Sun, Jan 14, 2007 at 11:11:13PM -0300, Horst H. von Brand wrote:
> Florin Iucha <[email protected]> wrote:
>
> [...]
>
> > Based on this info, I think we can rule out any USB. I will try
> > testing with NFS3 to see if the problem persists. Unfortunately there
> > is no oops or anything in "dmesg".
>
> Take a look at bz #7796, a NFS bug + fix. But my feelin is that this is
> older.

The reported had and oops? Luxury! I get nothing ;)

I am testing again, this time on 2.6.20-rc5 compiled with extra debug
and I got a couple dozens of:

"eth0: too many iterations (6) in nv_nic_irq."

in the kernel log.

florin

--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163


Attachments:
(No filename) (728.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-01-15 15:58:32

by Alan Stern

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Sun, 14 Jan 2007, Florin Iucha wrote:

> Jiri and Trond,
>
> On Mon, Jan 15, 2007 at 01:14:09AM +0100, Jiri Kosina wrote:
> > On Sun, 14 Jan 2007, Florin Iucha wrote:
> >
> > > All the testing was done via a ssh into the workstation. The console
> > > was left as booted into, with the gdm running. The remote nfs4
> > > directory was mounted on "/mnt". After copying the 60+ GB and testing
> > > that the keyboard was still functioning, I did not reboot but stayed in
> > > the same kernel and pulled the latest git then started bisecting.
> >
> > Hi Florin,
> >
> > thanks a lot for the testing. Just to verify - what kernel is 'the same
> > kernel' mentioned above? (just to isolate whether the problem is really
> > somewhere between 2.6.19 and 2.6.20-rc2, as you stated in previous posts,
> > or the situation has changed).
>
> This happened with 2.6.19. It worked last time, but I wanted to test
> again, to make sure. This time, it bombed, but half an hour after the
> transfer finished.
>
> > > After recompiling, I moved over to the workstation to reboot it, but the
> > > keyboard was not functioning ;(
> >
> > So this time the hang occured when the system was idle, not during the
> > transfers, right?
>
> Yes it was idle. Immediately after the transfer finished, the keyboard was
> still functioning. It "hang" minutes later, after the first bisected kernel
> was compiled and installed.
>
> > > I ran "lsusb" and it displayed all the devices. "dmesg" did not show
> > > any oops, anything for that matter. I have unplugged the keyboard and
> > > run "lsusb" again, but it hang. I ran "ls /mnt" and it hang as well.
> > > Stracing "lsusb" showed it hang (entered the kernel) at opening the device
> > > that used to be the keyboard. Stracing "ls /mnt" showed that it
> > > hang at "stat(/mnt)". Both processes were in "D" state. "ls /root"
> > > worked without problem, so it appears that crossing mountpoints causes
> > > some hang in the kernel.
> >
> > Could you please do alt-sysrq-t (or "echo t > /proc/sysrq-trigger" via
> > ssh, when your keyboard is dead) to see the calltraces of the processes
> > which are stuck inside kernel?
> >
> > You will probably get a lot of output after the sysrq, so please either
> > put it somewhere on the web if possible, or just extract the interesting
> > processes out of it (mainly the ones which are stuck).
>
> Will do.

It would be nice to learn exactly why the keyboard stopped working. Try
using the usbmon facility (instructions in Documentation/usb/usbmon.txt)
to see what happens when you type on the dead keyboard. Be sure to turn
on CONFIG_USB_DEBUG as well. And also check /proc/interrupts; each time
you hit a key the USB controller should get an interrupt.

Alan Stern

2007-01-24 03:04:51

by Florin Iucha

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Mon, Jan 15, 2007 at 10:58:29AM -0500, Alan Stern wrote:
> On Sun, 14 Jan 2007, Florin Iucha wrote:
>
> > Jiri and Trond,
> >
> > On Mon, Jan 15, 2007 at 01:14:09AM +0100, Jiri Kosina wrote:
> > > On Sun, 14 Jan 2007, Florin Iucha wrote:
> > >
> > > > All the testing was done via a ssh into the workstation. The console
> > > > was left as booted into, with the gdm running. The remote nfs4
> > > > directory was mounted on "/mnt". After copying the 60+ GB and testing
> > > > that the keyboard was still functioning, I did not reboot but stayed in
> > > > the same kernel and pulled the latest git then started bisecting.
> > >
> > > Hi Florin,
> > >
> > > thanks a lot for the testing. Just to verify - what kernel is 'the same
> > > kernel' mentioned above? (just to isolate whether the problem is really
> > > somewhere between 2.6.19 and 2.6.20-rc2, as you stated in previous posts,
> > > or the situation has changed).
> >
> > This happened with 2.6.19. It worked last time, but I wanted to test
> > again, to make sure. This time, it bombed, but half an hour after the
> > transfer finished.
> >
> > > > After recompiling, I moved over to the workstation to reboot it, but the
> > > > keyboard was not functioning ;(
> > >
> > > So this time the hang occured when the system was idle, not during the
> > > transfers, right?
> >
> > Yes it was idle. Immediately after the transfer finished, the keyboard was
> > still functioning. It "hang" minutes later, after the first bisected kernel
> > was compiled and installed.
> >
> > > > I ran "lsusb" and it displayed all the devices. "dmesg" did not show
> > > > any oops, anything for that matter. I have unplugged the keyboard and
> > > > run "lsusb" again, but it hang. I ran "ls /mnt" and it hang as well.
> > > > Stracing "lsusb" showed it hang (entered the kernel) at opening the device
> > > > that used to be the keyboard. Stracing "ls /mnt" showed that it
> > > > hang at "stat(/mnt)". Both processes were in "D" state. "ls /root"
> > > > worked without problem, so it appears that crossing mountpoints causes
> > > > some hang in the kernel.
> > >
> > > Could you please do alt-sysrq-t (or "echo t > /proc/sysrq-trigger" via
> > > ssh, when your keyboard is dead) to see the calltraces of the processes
> > > which are stuck inside kernel?
> > >
> > > You will probably get a lot of output after the sysrq, so please either
> > > put it somewhere on the web if possible, or just extract the interesting
> > > processes out of it (mainly the ones which are stuck).
> >
> > Will do.
>
> It would be nice to learn exactly why the keyboard stopped working. Try
> using the usbmon facility (instructions in Documentation/usb/usbmon.txt)
> to see what happens when you type on the dead keyboard. Be sure to turn
> on CONFIG_USB_DEBUG as well. And also check /proc/interrupts; each time
> you hit a key the USB controller should get an interrupt.

Attached is the output from usbmon, unfortunately this kernel did not
have CONFIG_USB_DEBUG set. This is kernel 2.6.20-rc5.

So, the bus sees some traffic when the keyboard is used, but gdm does
not receive any keystrokes.

florin

--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163


Attachments:
(No filename) (0.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-01-24 19:07:24

by Alan Stern

[permalink] [raw]
Subject: Re: heavy nfs[4]] causes fs badness Was: 2.6.20-rc4: known unfixed regressions (v2)

On Tue, 23 Jan 2007, Florin Iucha wrote:

> > It would be nice to learn exactly why the keyboard stopped working. Try
> > using the usbmon facility (instructions in Documentation/usb/usbmon.txt)
> > to see what happens when you type on the dead keyboard. Be sure to turn
> > on CONFIG_USB_DEBUG as well. And also check /proc/interrupts; each time
> > you hit a key the USB controller should get an interrupt.
>
> Attached is the output from usbmon, unfortunately this kernel did not
> have CONFIG_USB_DEBUG set. This is kernel 2.6.20-rc5.
>
> So, the bus sees some traffic when the keyboard is used, but gdm does
> not receive any keystrokes.

So it's possible that the USB drivers are working correctly but the
keystrokes are getting lost somewhere in the X server. Can you switch to
a VT (or kill the X server entirely) and see if the keyboard works then?

Alan Stern