2014-02-27 00:52:50

by Ken Moffat

[permalink] [raw]
Subject: 3.13.5 : rm -rf running forever, one cpu at approx 100%

Hi,

Short summary : on 3.13.5, rm -rf of an application source
directory on an ext4 filesystem sometimes takes forever (probably
isn't going anywhere), with one CPU pegged at all-but 100% utilization.

I've nearly finished building a new system from source, to check
various desktop packages in linuxfromscratch. On this build, much of
it is things I don't normally use and I needed to upgrade my
buildscripts, so most of it was built in chroot using 3.10.32. But
late last night I booted the new system using 3.13.5 to finish the
build. This morning I discovered that rm -rf for the icedtea source
directory was still running, and had taken over 5 hours of CPU time
(one CPU seemd to be running at close to 100%, the others had dropped
to their slowest frequency). That script was running as root (yeah,
but it's a new system) and it looks as if /etc/passwd~ had got
trashed, because I could no longer su or login. Not sure if that is
related, at this stage it might just be a side-effect of my scripts.

Booted another system, chrooted, fixed up passwords. Started
again after commenting out icedtea - I hadn't intended to build
what was an old version, I'd just forgotten it was in this script -
that's why I do things in userspace, not the kernel :-(

Continued with remaining packages, but a couple of hours later I
saw a similar "one CPU at 100%, rm -rf GConf source taking forever"
problem. Dumped all the processes with Alt-SysRQ-T [ huge log ] but
at that point 'rm' was merely 'ready' so I doubt there is anything
useful to see in the log.

Built 3.13.4, booted to that. So far, everything looks good - but
I'm now building the _current_ version of icedtea, so if this isn't
a new 3.13.5 problem I guess I'm fairly likely to see it tomorrow.

Meanwhile, any suggestions about how I can debug this if I hit it
again, please ?

ĸen
--
das eine Mal als Tragödie, dieses Mal als Farce


2014-02-27 01:28:33

by gene heskett

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Wednesday 26 February 2014, Ken Moffat wrote:
>Hi,
>
> Short summary : on 3.13.5, rm -rf of an application source
>directory on an ext4 filesystem sometimes takes forever (probably
>isn't going anywhere), with one CPU pegged at all-but 100% utilization.
>
> I've nearly finished building a new system from source, to check
>various desktop packages in linuxfromscratch. On this build, much of
>it is things I don't normally use and I needed to upgrade my
>buildscripts, so most of it was built in chroot using 3.10.32. But
>late last night I booted the new system using 3.13.5 to finish the
>build. This morning I discovered that rm -rf for the icedtea source
>directory was still running, and had taken over 5 hours of CPU time
>(one CPU seemd to be running at close to 100%, the others had dropped
>to their slowest frequency). That script was running as root (yeah,
>but it's a new system) and it looks as if /etc/passwd~ had got
>trashed, because I could no longer su or login. Not sure if that is
>related, at this stage it might just be a side-effect of my scripts.
>
> Booted another system, chrooted, fixed up passwords. Started
>again after commenting out icedtea - I hadn't intended to build
>what was an old version, I'd just forgotten it was in this script -
>that's why I do things in userspace, not the kernel :-(
>
> Continued with remaining packages, but a couple of hours later I
>saw a similar "one CPU at 100%, rm -rf GConf source taking forever"
>problem. Dumped all the processes with Alt-SysRQ-T [ huge log ] but
>at that point 'rm' was merely 'ready' so I doubt there is anything
>useful to see in the log.
>
> Built 3.13.4, booted to that. So far, everything looks good - but
>I'm now building the _current_ version of icedtea, so if this isn't
>a new 3.13.5 problem I guess I'm fairly likely to see it tomorrow.
>
> Meanwhile, any suggestions about how I can debug this if I hit it
>again, please ?
>
>ĸen

I don't have any help to offer Ken, but this walks and quacks much like a
duck I'm encountering in 3.13.5, with the backup program amanda, which uses
gnu tar. To facilitate intelligent guesses as to the size of the various
levels of backup, amanda does a dummy collection using tar, sent to
/dev/null, using only the size it reports on the first pass. Version 1.22,
quite old, works on 3.12.9, but not on 3.13.5. I have now pulled in, built
and installed tar-1.27, and rebuilt amanda to let it know that the tar its
using is not in /usr/bin, but in /usr/local/bin. Next run at 1:30AM

This freeze, using 100% of a core, but causing no visible disk activity has
killed my backups 3 nights running. At this point I've no clue as to the
cause, but I will be watching this thread closely, it has a similar
description.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

NOTICE: Will pay 100 USD for an HP-4815A defective but
complete probe assembly.

2014-02-27 01:47:47

by Ken Moffat

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Wed, Feb 26, 2014 at 08:28:29PM -0500, Gene Heskett wrote:
> On Wednesday 26 February 2014, Ken Moffat wrote:
>
> I don't have any help to offer Ken, but this walks and quacks much like a
> duck I'm encountering in 3.13.5, with the backup program amanda, which uses
> gnu tar. To facilitate intelligent guesses as to the size of the various
> levels of backup, amanda does a dummy collection using tar, sent to
> /dev/null, using only the size it reports on the first pass. Version 1.22,
> quite old, works on 3.12.9, but not on 3.13.5. I have now pulled in, built
> and installed tar-1.27, and rebuilt amanda to let it know that the tar its
> using is not in /usr/bin, but in /usr/local/bin. Next run at 1:30AM
>
> This freeze, using 100% of a core, but causing no visible disk activity has
> killed my backups 3 nights running. At this point I've no clue as to the
> cause, but I will be watching this thread closely, it has a similar
> description.
>
> Cheers, Gene

My box where I see this is a phenom (my intel is still on 3.13.4),
and from past threads I suspect yours is as well ?

ĸen
--
das eine Mal als Tragödie, dieses Mal als Farce

2014-02-27 01:57:21

by gene heskett

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Wednesday 26 February 2014, Ken Moffat wrote:
>On Wed, Feb 26, 2014 at 08:28:29PM -0500, Gene Heskett wrote:
>> On Wednesday 26 February 2014, Ken Moffat wrote:
>>
>> I don't have any help to offer Ken, but this walks and quacks much like
>> a duck I'm encountering in 3.13.5, with the backup program amanda,
>> which uses gnu tar. To facilitate intelligent guesses as to the size
>> of the various levels of backup, amanda does a dummy collection using
>> tar, sent to /dev/null, using only the size it reports on the first
>> pass. Version 1.22, quite old, works on 3.12.9, but not on 3.13.5. I
>> have now pulled in, built and installed tar-1.27, and rebuilt amanda
>> to let it know that the tar its using is not in /usr/bin, but in
>> /usr/local/bin. Next run at 1:30AM
>>
>> This freeze, using 100% of a core, but causing no visible disk activity
>> has killed my backups 3 nights running. At this point I've no clue as
>> to the cause, but I will be watching this thread closely, it has a
>> similar description.
>>
>> Cheers, Gene
>
> My box where I see this is a phenom (my intel is still on 3.13.4),
>and from past threads I suspect yours is as well ?
>
>ĸen

Yes, it is Ken, an old slow 2.1 Ghz 9550 phenom, with 8Gb of ram. 32 bit
PAE build. I wonder if that is also connected... But my crystal ball is
broke, in addition to the wet ram starting to rust out. Can't be the
years, I've only 40 years practice at being 39. ;-)

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

NOTICE: Will pay 100 USD for an HP-4815A defective but
complete probe assembly.

2014-02-27 03:26:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Thu, 2014-02-27 at 00:52 +0000, Ken Moffat wrote:
> Hi,
>
> Short summary : on 3.13.5, rm -rf of an application source
> directory on an ext4 filesystem sometimes takes forever (probably
> isn't going anywhere), with one CPU pegged at all-but 100% utilization.
>
> I've nearly finished building a new system from source, to check
> various desktop packages in linuxfromscratch. On this build, much of
> it is things I don't normally use and I needed to upgrade my
> buildscripts, so most of it was built in chroot using 3.10.32. But
> late last night I booted the new system using 3.13.5 to finish the
> build. This morning I discovered that rm -rf for the icedtea source
> directory was still running, and had taken over 5 hours of CPU time
> (one CPU seemd to be running at close to 100%, the others had dropped
> to their slowest frequency). That script was running as root (yeah,
> but it's a new system) and it looks as if /etc/passwd~ had got
> trashed, because I could no longer su or login. Not sure if that is
> related, at this stage it might just be a side-effect of my scripts.
>
> Booted another system, chrooted, fixed up passwords. Started
> again after commenting out icedtea - I hadn't intended to build
> what was an old version, I'd just forgotten it was in this script -
> that's why I do things in userspace, not the kernel :-(
>
> Continued with remaining packages, but a couple of hours later I
> saw a similar "one CPU at 100%, rm -rf GConf source taking forever"
> problem. Dumped all the processes with Alt-SysRQ-T [ huge log ] but
> at that point 'rm' was merely 'ready' so I doubt there is anything
> useful to see in the log.
>
> Built 3.13.4, booted to that. So far, everything looks good - but
> I'm now building the _current_ version of icedtea, so if this isn't
> a new 3.13.5 problem I guess I'm fairly likely to see it tomorrow.
>
> Meanwhile, any suggestions about how I can debug this if I hit it
> again, please ?

I would start with strace to see if a task is looping in userspace, then
move on to perf top -g -p <pid> (or perf record/report) to peek at what
it's up to in the kernel. Once you have the where, trace_printk() is
the best thing since sliced bread (which ranks just below printk()).

-Mike

2014-02-27 03:45:38

by Ken Moffat

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Thu, Feb 27, 2014 at 04:26:35AM +0100, Mike Galbraith wrote:
>
> I would start with strace to see if a task is looping in userspace, then
> move on to perf top -g -p <pid> (or perf record/report) to peek at what
> it's up to in the kernel. Once you have the where, trace_printk() is
> the best thing since sliced bread (which ranks just below printk()).
>
> -Mike
Thanks. I'll need to build perf.

ĸen
--
das eine Mal als Tragödie, dieses Mal als Farce

2014-02-27 04:19:35

by gene heskett

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Wednesday 26 February 2014, Ken Moffat wrote:
>On Thu, Feb 27, 2014 at 04:26:35AM +0100, Mike Galbraith wrote:
>> I would start with strace to see if a task is looping in userspace,
>> then move on to perf top -g -p <pid> (or perf record/report) to peek
>> at what it's up to in the kernel. Once you have the where,
>> trace_printk() is the best thing since sliced bread (which ranks just
>> below printk()).
>>
>> -Mike
>
> Thanks. I'll need to build perf.
>
>ĸen
I probably will too, but I don't have a huge amount of tracing turned on in
this kernel. We'll see what happens tonight & go from there.
FWIW, about all I see in htop is the command line that launched it.


Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

NOTICE: Will pay 100 USD for an HP-4815A defective but
complete probe assembly.

2014-02-27 04:52:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Thu, 2014-02-27 at 03:45 +0000, Ken Moffat wrote:
> On Thu, Feb 27, 2014 at 04:26:35AM +0100, Mike Galbraith wrote:
> >
> > I would start with strace to see if a task is looping in userspace, then
> > move on to perf top -g -p <pid> (or perf record/report) to peek at what
> > it's up to in the kernel. Once you have the where, trace_printk() is
> > the best thing since sliced bread (which ranks just below printk()).
> >
> > -Mike
> Thanks. I'll need to build perf.

You may want to build the kernel with frame-pointers too, for easy gdb
list *0x(hexnum) of *func()+0x(hexoffset) use. Crash is also pretty
handy both for rummaging live via crash vmlinux /proc/kcore, and for
leisurely postmortem analysis if you set the box up to crashdump in
advance, and force a dump (poke sysrq-c or echo c > /proc/sysrq-trigger)
when you see the bad thing happen. Crash has all kinds of goodies,
including invocation of gdb.

-Mike

2014-02-27 11:30:09

by gene heskett

[permalink] [raw]
Subject: Re: 3.13.5 : rm -rf running forever, one cpu at approx 100%

On Wednesday 26 February 2014, Mike Galbraith wrote:
>On Thu, 2014-02-27 at 03:45 +0000, Ken Moffat wrote:
>> On Thu, Feb 27, 2014 at 04:26:35AM +0100, Mike Galbraith wrote:
>> > I would start with strace to see if a task is looping in userspace,
>> > then move on to perf top -g -p <pid> (or perf record/report) to peek
>> > at what it's up to in the kernel. Once you have the where,
>> > trace_printk() is the best thing since sliced bread (which ranks
>> > just below printk()).
>> >
>> > -Mike
>>
>> Thanks. I'll need to build perf.
>
>You may want to build the kernel with frame-pointers too, for easy gdb
>list *0x(hexnum) of *func()+0x(hexoffset) use. Crash is also pretty
>handy both for rummaging live via crash vmlinux /proc/kcore, and for
>leisurely postmortem analysis if you set the box up to crashdump in
>advance, and force a dump (poke sysrq-c or echo c > /proc/sysrq-trigger)
>when you see the bad thing happen. Crash has all kinds of goodies,
>including invocation of gdb.
>
>-Mike

It doesn't look as if I'll have to guys. Amanda is about 1/2 done and
running normally by switching from tar-1.22 to tar-1.27. So I'm going to
toddle off in the general direction of a bed and let amanda send me an
email when its done.

And it did, including some "strange" reports I have yet to decipher. Diffs
in the tar's probably. But it did not hang.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

NOTICE: Will pay 100 USD for an HP-4815A defective but
complete probe assembly.