Hi,
I just encountered a nasty symptom for the second time that has started to
occur after updating my home server from vanilla 2.6.27.7 to .8 (same
config).
A while after disconnecting a samba client, the smbd samba server
process goes crazy and consumes 100% CPU. From that time on it is
unkillable (kill -9 returns but the process continues to run). The only
recourse is reboot, which works without problem (i.e. unmounting the
served filesystems is apparently possible?). I tried to attach to the
process with gdb but that just hung.
The system is a generic old single-core P4 box with a single SATA drive,
Gentoo userland and Samba is 3.0.33 (in async mode). The kernel has no
patches or binary drivers. It has been rock solid before the update and
shows no other signs of weirdness in logs or otherwise. I downgraded to .7
for now and will see what happens, but since it worked before I am certain
that this is a regression in the .8 release.
The only commonality is a log entry by samba that seems to correlate with
both occurrences:
[2008/12/08 01:02:52, 0] lib/util_sock.c:read_data(534)
read_data: read failure for 4 bytes to client 192.168.100.128. Error = No route to host
.128 is the Windows client machine (connected via a stable GigE link),
which I shut down pretty much exactly 30 minutes before that (any 30
minute timeouts in the kernel/network stack?). Both instances of these log
entries correlate with the CPU spikes which I noticed in my MRTG graphs.
Any suspects or ideas?
thanks
Holger
On Monday, 8 of December 2008, Holger Hoffstaette wrote:
>
> Hi,
>
> I just encountered a nasty symptom for the second time that has started to
> occur after updating my home server from vanilla 2.6.27.7 to .8 (same
> config).
>
> A while after disconnecting a samba client, the smbd samba server
> process goes crazy and consumes 100% CPU. From that time on it is
> unkillable (kill -9 returns but the process continues to run). The only
> recourse is reboot, which works without problem (i.e. unmounting the
> served filesystems is apparently possible?). I tried to attach to the
> process with gdb but that just hung.
>
> The system is a generic old single-core P4 box with a single SATA drive,
> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel has no
> patches or binary drivers. It has been rock solid before the update and
> shows no other signs of weirdness in logs or otherwise. I downgraded to .7
> for now and will see what happens, but since it worked before I am certain
> that this is a regression in the .8 release.
>
> The only commonality is a log entry by samba that seems to correlate with
> both occurrences:
>
> [2008/12/08 01:02:52, 0] lib/util_sock.c:read_data(534)
> read_data: read failure for 4 bytes to client 192.168.100.128. Error = No route to host
>
> .128 is the Windows client machine (connected via a stable GigE link),
> which I shut down pretty much exactly 30 minutes before that (any 30
> minute timeouts in the kernel/network stack?). Both instances of these log
> entries correlate with the CPU spikes which I noticed in my MRTG graphs.
>
> Any suspects or ideas?
Please bisect.
Thanks,
Rafael
On Mon, 08 Dec 2008 08:34:22 +0100, Rafael J. Wysocki wrote:
> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
>>
>> A while after disconnecting a samba client, the smbd samba server
>> process goes crazy and consumes 100% CPU. From that time on it is
>> unkillable (kill -9 returns but the process continues to run). The only
>> recourse is reboot, which works without problem (i.e. unmounting the
>> served filesystems is apparently possible?). I tried to attach to the
>> process with gdb but that just hung.
>> [..]
>
> Please bisect.
I would love to try, but this is my "production server" (i.e. I need it
for real work) and I'll be traveling the next few days. I will try to
bisect after that (if nobody else has any ideas) but will have to make
sure the bug is actually reproducible after the timeout - for now I only
observed it by accident (via mrtg).
In the meantime maybe someone else will observe it as well.
thanks
Holger
Rafael J. Wysocki wrote:
> Please bisect.
Why should he bisect before the developers who added networking related
patches to .8 attempted to reproduce the bug, let alone looked at the
report?
Actually he "bisected" it already to the diff of .7->.8.
(Maybe it is not a networking bug, but that's where it makes most sense
to start to look.)
--
Stefan Richter
-=====-==--- ==-- -=---
http://arcgraph.de/sr/
Holger Hoffstaette wrote at LKML:
> On Mon, 08 Dec 2008 08:34:22 +0100, Rafael J. Wysocki wrote:
>
>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
>>> Hi,
>>>
>>> I just encountered a nasty symptom for the second time that has started to
>>> occur after updating my home server from vanilla 2.6.27.7 to .8 (same
>>> config).
>>>
>>> A while after disconnecting a samba client, the smbd samba server
>>> process goes crazy and consumes 100% CPU. From that time on it is
>>> unkillable (kill -9 returns but the process continues to run). The only
>>> recourse is reboot, which works without problem (i.e. unmounting the
>>> served filesystems is apparently possible?). I tried to attach to the
>>> process with gdb but that just hung.
>>>
>>> The system is a generic old single-core P4 box with a single SATA drive,
>>> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel has no
>>> patches or binary drivers. It has been rock solid before the update and
>>> shows no other signs of weirdness in logs or otherwise. I downgraded to .7
>>> for now and will see what happens, but since it worked before I am certain
>>> that this is a regression in the .8 release.
>>>
>>> The only commonality is a log entry by samba that seems to correlate with
>>> both occurrences:
>>>
>>> [2008/12/08 01:02:52, 0] lib/util_sock.c:read_data(534)
>>> read_data: read failure for 4 bytes to client 192.168.100.128. Error = No route to host
>>>
>>> .128 is the Windows client machine (connected via a stable GigE link),
>>> which I shut down pretty much exactly 30 minutes before that (any 30
>>> minute timeouts in the kernel/network stack?). Both instances of these log
>>> entries correlate with the CPU spikes which I noticed in my MRTG graphs.
>>>
>>> Any suspects or ideas?
>>>
>>> thanks
>>> Holger
>>
>> Please bisect.
>
> I would love to try, but this is my "production server" (i.e. I need it
> for real work) and I'll be traveling the next few days. I will try to
> bisect after that (if nobody else has any ideas) but will have to make
> sure the bug is actually reproducible after the timeout - for now I only
> observed it by accident (via mrtg).
> In the meantime maybe someone else will observe it as well.
>
> thanks
> Holger
>
Added Cc: netdev, readded all other Cc's, quoted in full for netdev.
Good luck,
--
Stefan Richter
-=====-==--- ==-- -=---
http://arcgraph.de/sr/
On Monday, 8 of December 2008, Stefan Richter wrote:
> Rafael J. Wysocki wrote:
> > Please bisect.
>
> Why should he bisect before the developers who added networking related
> patches to .8 attempted to reproduce the bug, let alone looked at the
> report?
Because I think that's the fastest way to turn the attention of the appropriate
people to the problem.
Thanks,
Rafael
Rafael J. Wysocki wrote:
> On Monday, 8 of December 2008, Stefan Richter wrote:
>> Rafael J. Wysocki wrote:
>>> Please bisect.
>> Why should he bisect before the developers who added networking related
>> patches to .8 attempted to reproduce the bug, let alone looked at the
>> report?
>
> Because I think that's the fastest way to turn the attention of the appropriate
> people to the problem.
All the contributors to 2.6.27.8 are known by name & address, and they
hopefully remember what they put into .8 and can quickly tell whether
there is a chance that it could have something to do with Samba going
into an unkillable busy loop.
So, in case of a bug report which even includes a potential way to
reproduce the issue on generic hardware with common tools, "Could you
bisect? Meanwhile, let's Cc netdev." sounds better to me than "Please
bisect."
--
Stefan Richter
-=====-==--- ==-- -=---
http://arcgraph.de/sr/
>>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
>>>> The system is a generic old single-core P4 box with a single SATA drive,
>>>> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel has no
>>>> patches or binary drivers.
Holger, it may be unrelated to the issue, but to be sure: Which network
card driver do you use?
--
Stefan Richter
-=====-==--- ==-- -=---
http://arcgraph.de/sr/
On Mon, 08 Dec 2008 20:19:37 +0100, Stefan Richter wrote:
>>>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
>>>>> The system is a generic old single-core P4 box with a single SATA
>>>>> drive, Gentoo userland and Samba is 3.0.33 (in async mode). The
>>>>> kernel has no patches or binary drivers.
>
> Holger, it may be unrelated to the issue, but to be sure: Which network
> card driver do you use?
e1000 with the older PCI/PCI-X 82545GM rev.04 card in a PCI slot.
thanks,
Holger
On Mon, 08 Dec 2008, Stefan Richter wrote:
> >>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
> >>>> The system is a generic old single-core P4 box with a single SATA drive,
> >>>> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel has no
> >>>> patches or binary drivers.
>
> Holger, it may be unrelated to the issue, but to be sure: Which network
> card driver do you use?
I think you can safely rule out NIC, I'm also seeing this behaviour on a
brand new server with imap hanging in some busy-loop.
Network card in my case:
Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
What I observer was one CPU doing 100% system work, and the number of
timer interrupts went from 1k per second to 4k (for the whole system).
I didn't report it because I thought it's one of patches I have to blame.
Oh, and, unfortunately, I can't bisect, I'm seeing this only on one machine
that has to be running.
Jan
--
Jan Rekorajski | ALL SUSPECTS ARE GUILTY. PERIOD!
baggins<at>mimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY?
BOFH, MANIAC | -- TROOPS by Kevin Rubio
On Mon, 8 Dec 2008 23:22:46 +0100
Jan Rekorajski <[email protected]> wrote:
> On Mon, 08 Dec 2008, Stefan Richter wrote:
>
> > >>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
> > >>>> The system is a generic old single-core P4 box with a single SATA drive,
> > >>>> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel has no
> > >>>> patches or binary drivers.
> >
> > Holger, it may be unrelated to the issue, but to be sure: Which network
> > card driver do you use?
>
> I think you can safely rule out NIC, I'm also seeing this behaviour on a
> brand new server with imap hanging in some busy-loop.
> Network card in my case:
> Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
>
> What I observer was one CPU doing 100% system work, and the number of
> timer interrupts went from 1k per second to 4k (for the whole system).
>
Try reverting the idr patch that went into 2.6.27.8. It broke DRM in the
Fedora kernel at least.
http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob_plain;f=releases/2.6.27.8/lib-idr.c-fix-rcu-related-race-with-idr_find.patch;h=b1145766fb9460a0c0285350b49216355c5b4ad8
On Tue, 09 Dec 2008 20:16:34 +0100
Manfred Spraul <[email protected]> wrote:
> Chuck Ebbert wrote:
> > Try reverting the idr patch that went into 2.6.27.8. It broke DRM in the
> > Fedora kernel at least.
> >
> >
> What happens?
> Does it oops, does one of the BUG() statements trigger?
>
It fails in strange ways, e.g. trying to open a DRM device causes it to
disappear. (And DRM is a heavy user of idr.)
Chuck Ebbert wrote:
> Try reverting the idr patch that went into 2.6.27.8. It broke DRM in the
> Fedora kernel at least.
>
>
What happens?
Does it oops, does one of the BUG() statements trigger?
--
Manfred
rom ae060e0b7bc071bd73dd5319b93c3344d9e10212 Mon Sep 17 00:00:00 2001
From: Manfred Spraul <[email protected]>
To: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Bcc: [email protected]
Date: Wed, 10 Dec 2008 18:17:06 +0100
Subject: [PATCH] lib/idr.c: Fix bug introduced by RCU fix
The last patch to lib/idr.c caused a bug if idr_get_new_above() was
called on an empty idr:
Usually, nodes stay on the same layer. New layers are added to the top
of the tree.
The exception is idr_get_new_above() on an empty tree: In this case,
the new root node is first added on layer 0, then moved upwards.
p->layer was not updated.
As usual: You shall never rely on the source code comments, they
will only mislead you.
Signed-off-by: Manfred Spraul <[email protected]>
---
lib/idr.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/lib/idr.c b/lib/idr.c
index 7a785a0..1c4f928 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -220,8 +220,14 @@ build_up:
*/
while ((layers < (MAX_LEVEL - 1)) && (id >= (1 << (layers*IDR_BITS)))) {
layers++;
- if (!p->count)
+ if (!p->count) {
+ /* special case: if the tree is currently empty,
+ * then we grow the tree by moving the top node
+ * upwards.
+ */
+ p->layer++;
continue;
+ }
if (!(new = get_from_free_list(idp))) {
/*
* The allocation failed. If we built part of
--
1.5.6.5
Dear all -
Thanks for your efforts.
Manfred Spraul wrote:
> Chuck Ebbert wrote:
>> On Mon, 8 Dec 2008 23:22:46 +0100
>> Jan Rekorajski <[email protected]> wrote:
>>
>>
>>> On Mon, 08 Dec 2008, Stefan Richter wrote:
>>>
>>>
>>>>>>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
>>>>>>>
>>>>>>>> The system is a generic old single-core P4 box with a single
>>>>>>>> SATA drive,
>>>>>>>> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel
>>>>>>>> has no
>>>>>>>> patches or binary drivers.
>>>>>>>>
>>>> Holger, it may be unrelated to the issue, but to be sure: Which
>>>> network
>>>> card driver do you use?
>>>>
>>> I think you can safely rule out NIC, I'm also seeing this behaviour on a
>>> brand new server with imap hanging in some busy-loop.
>>> Network card in my case:
>>> Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
>>>
>>> What I observer was one CPU doing 100% system work, and the number of
>>> timer interrupts went from 1k per second to 4k (for the whole system).
>>>
> Could you try the attached patch?
> It should fix the bug.
I just built 2.6.27.9-rc1 and disconnected the Windowz box several times.
For now smbd does not seem to go into a death spin any more, even though
as far as I can tell .9-rc1 does not contain Manfred'd latest patch. Not
sure what that means, if anything.
I'll keep running stable.9-rc1 and see what happens..
thanks all
Holger
On Thu, Dec 11, 2008 at 11:54:12PM +0100, Holger Hoffst?tte wrote:
>
> Dear all -
>
> Thanks for your efforts.
>
> Manfred Spraul wrote:
> > Chuck Ebbert wrote:
> >> On Mon, 8 Dec 2008 23:22:46 +0100
> >> Jan Rekorajski <[email protected]> wrote:
> >>
> >>
> >>> On Mon, 08 Dec 2008, Stefan Richter wrote:
> >>>
> >>>
> >>>>>>> On Monday, 8 of December 2008, Holger Hoffstaette wrote:
> >>>>>>>
> >>>>>>>> The system is a generic old single-core P4 box with a single
> >>>>>>>> SATA drive,
> >>>>>>>> Gentoo userland and Samba is 3.0.33 (in async mode). The kernel
> >>>>>>>> has no
> >>>>>>>> patches or binary drivers.
> >>>>>>>>
> >>>> Holger, it may be unrelated to the issue, but to be sure: Which
> >>>> network
> >>>> card driver do you use?
> >>>>
> >>> I think you can safely rule out NIC, I'm also seeing this behaviour on a
> >>> brand new server with imap hanging in some busy-loop.
> >>> Network card in my case:
> >>> Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
> >>>
> >>> What I observer was one CPU doing 100% system work, and the number of
> >>> timer interrupts went from 1k per second to 4k (for the whole system).
> >>>
> > Could you try the attached patch?
> > It should fix the bug.
>
> I just built 2.6.27.9-rc1 and disconnected the Windowz box several times.
> For now smbd does not seem to go into a death spin any more, even though
> as far as I can tell .9-rc1 does not contain Manfred'd latest patch. Not
> sure what that means, if anything.
.9-rc1 does contain a cifs patch, so perhaps that resolved the issue for
you.
thanks,
greg k-h
On Wed, 10 Dec 2008, Manfred Spraul wrote:
>> On Mon, 8 Dec 2008 23:22:46 +0100
>> Jan Rekorajski <[email protected]> wrote:
>>
>>> I think you can safely rule out NIC, I'm also seeing this behaviour on a
>>> brand new server with imap hanging in some busy-loop.
>>> Network card in my case:
>>> Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
>>>
>>> What I observer was one CPU doing 100% system work, and the number of
>>> timer interrupts went from 1k per second to 4k (for the whole system).
>>>
>>>
> Could you try the attached patch?
> It should fix the bug.
Thank you, I'm currently running 2.6.27.8 with your patch, I'll report
after 12-24 hours.
Jan
--
Jan Rekorajski | ALL SUSPECTS ARE GUILTY. PERIOD!
baggins<at>mimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY?
BOFH, MANIAC | -- TROOPS by Kevin Rubio
Greg KH wrote:
> On Thu, Dec 11, 2008 at 11:54:12PM +0100, Holger Hoffst?tte wrote:
> [samba's smbd going into spin of death after client disconnect]
>
> .9-rc1 does contain a cifs patch, so perhaps that resolved the issue for
> you.
I spoke too soon: no, it didn't help as it happened again last night. I
don't see how CIFS oculd have helped, as userlevel smbd has AFAIK nothing
to do with the CIFS kernel module?
Apparently only pulling the client cable does NOT provoke the bug (tried
several times), whereas putting the client box into sleep mode does
(though that worked at least once as well). This time it also didn't
reboot properly without hard power-off. :/
Now running with Manfred's patch to idr.c; will report back if it happens
again.
Holger
Manfred Spraul wrote:
> Could you try the attached patch?
> It should fix the bug.
After applying Manfred's patch to .9-rc1, it *seems* that the problem is
gone. I have put the Windows client to sleep several times and after the
20-something minutes timeout smbd reports the error ("No route to host")
but does not go into a spinloop any more.
I'll continue to test this, but as far as I'm concerned this should go
into stable.9 as well (not sure if it's already in rc2).
thanks!
Holger
On Fri, 12 Dec 2008, Jan Rekorajski wrote:
> On Wed, 10 Dec 2008, Manfred Spraul wrote:
>
> >> On Mon, 8 Dec 2008 23:22:46 +0100
> >> Jan Rekorajski <[email protected]> wrote:
> >>
> >>> I think you can safely rule out NIC, I'm also seeing this behaviour on a
> >>> brand new server with imap hanging in some busy-loop.
> >>> Network card in my case:
> >>> Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
> >>>
> >>> What I observer was one CPU doing 100% system work, and the number of
> >>> timer interrupts went from 1k per second to 4k (for the whole system).
> >>>
> >>>
> > Could you try the attached patch?
> > It should fix the bug.
>
> Thank you, I'm currently running 2.6.27.8 with your patch, I'll report
> after 12-24 hours.
top - 19:20:59 up 17:22, 34 users, load average: 0.34, 0.41, 0.33
So, it seems that your patch cured my problem, as that server couldn't
survive more than 8 hours previously (2-3 was norm).
Jan
--
Jan Rekorajski | ALL SUSPECTS ARE GUILTY. PERIOD!
baggins<at>mimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY?
BOFH, MANIAC | -- TROOPS by Kevin Rubio