I'm running into a problem in 2.6.38 where the kernel is not doing what
I'm expecting it to do. I'm guessing that some things have changed and
that is what it going on.
First, The tune at boot:
f.open("/proc/sys/vm/panic_on_oom", std::ios::out);
f << "1";
f.close();
f.open("/proc/sys/kernel/panic", std::ios::out);
f << "10";
f.close();
I want the kernel to panic on out of memory. I then want it to wait 10s
before doing a reboot.
This program will consume all memory and make the box unresponsive
#!/usr/bin/perl
my @mem = ()
while(1) {
push @mem, "########################";
}
It does not take long to fill up 1G of space. There is NO swap on this
device and never will be. I did notice that after a long period of time
(I've not timed it) I finally do see a panic and I do see "rebooting in
10 seconds..." . It does not reboot.
I'm guessing that there are some tweaks or new behavior I just need to
be aware of.
Thanks,
Chris
On Tue, 14 Jun 2011, Chris Fowler wrote:
> I'm running into a problem in 2.6.38 where the kernel is not doing what
> I'm expecting it to do. I'm guessing that some things have changed and
> that is what it going on.
>
> First, The tune at boot:
>
> f.open("/proc/sys/vm/panic_on_oom", std::ios::out);
> f << "1";
> f.close();
>
> f.open("/proc/sys/kernel/panic", std::ios::out);
> f << "10";
> f.close();
>
Hmm, you don't check that the writes to the sysctls actually succeed?
Using /proc/sys/vm/panic_on_oom also won't panic the machine if you happen
to use a cpuset or mempolicy. You'll want to write '2' instead if you
want to panic in all possible oom conditions.
> I want the kernel to panic on out of memory. I then want it to wait 10s
> before doing a reboot.
>
> This program will consume all memory and make the box unresponsive
>
> #!/usr/bin/perl
>
> my @mem = ()
> while(1) {
> push @mem, "########################";
> }
>
> It does not take long to fill up 1G of space. There is NO swap on this
> device and never will be. I did notice that after a long period of time
> (I've not timed it) I finally do see a panic and I do see "rebooting in
> 10 seconds..." . It does not reboot.
>
Ok, it seems like the oom killer is being called correctly and respecting
your panic_on_oom setting because it is a system-wide oom condition (your
perl script wasn't bound to any cpuset or mempolicy).
So that leaves the panic() not rebooting properly when the timeout is set.
You would only see the "Rebooting in 10 seconds..." message if the write
to /proc/sys/kernel/panic suceeded, and there's this little comment in
kernel/panic.c:
/*
* This will not be a clean reboot, with everything
* shutting down. But if there is a chance of
* rebooting the system it will be rebooted.
*/
with a call to emergency_restart(). You didn't specify your architecture,
but assuming you're using x86 without a hypervisor and didn't specify a
reboot= parameter on the command line, it should suceed although there are
some hardware dependencies. Does using
reboot=force
on the command line help?
Either way, could you send your /proc/cpuinfo and .config?
On Tue, 14 Jun 2011 09:31:16 -0400
Chris Fowler <[email protected]> wrote:
> I'm running into a problem in 2.6.38 where the kernel is not doing what
> I'm expecting it to do. I'm guessing that some things have changed and
> that is what it going on.
>
> First, The tune at boot:
>
> f.open("/proc/sys/vm/panic_on_oom", std::ios::out);
> f << "1";
> f.close();
>
> f.open("/proc/sys/kernel/panic", std::ios::out);
> f << "10";
> f.close();
>
> I want the kernel to panic on out of memory. I then want it to wait 10s
> before doing a reboot.
>
> This program will consume all memory and make the box unresponsive
>
> #!/usr/bin/perl
>
> my @mem = ()
> while(1) {
> push @mem, "########################";
> }
>
Hmm, then, OOM-Killer wasn't invoked ?
> It does not take long to fill up 1G of space. There is NO swap on this
> device and never will be. I did notice that after a long period of time
> (I've not timed it) I finally do see a panic and I do see "rebooting in
> 10 seconds..." . It does not reboot.
>
In these month(after 2.6.38), there has been some discussion
that "oom-killer doesn't work enough or lru scan very slow" problem
in linux-mm list. (and some improvemetns have been done.)
Then, if you can post your 'test case' with precise description of
machine set up, we're glad.
>
> I'm guessing that there are some tweaks or new behavior I just need to
> be aware of.
>
What version of kernel did you used in previous setup ?
Thanks,
-Kame
Hi
(2011/06/14 22:31), Chris Fowler wrote:
> I'm running into a problem in 2.6.38 where the kernel is not doing what
> I'm expecting it to do. I'm guessing that some things have changed and
> that is what it going on.
If guess, you need kernel upgrade. Maybe you need following commit.
Author: KOSAKI Motohiro <[email protected]>
Date: Thu Apr 14 15:22:12 2011 -0700
vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
And
% git name-rev --tags 929bea7c714220fc76ce3f75bef9056477c28e74
929bea7c714220fc76ce3f75bef9056477c28e74 tags/v2.6.39-rc4~36
Previous version is 2.6.33.
This box is a small device and I've configured it so that it on OOM or
panic will reboot. A device that is locked or out of memory is useless
and if it reboots it will pick up where it left off. This has worked
well for me.
-------- [ cpuinfo]----------------------------------------------------
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 28
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
stepping : 2
cpu MHz : 1600.245
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up
arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3
xtpr pdcm movbe lahf_lm dts
bogomips : 3200.49
clflush size : 64
cache_alignment : 64
address sizes : 32 bits physical, 32 bits virtual
power management:
-------- [ cpuinfo]----------------------------------------------------
config attached
I've not tried reboot=force. I will do that next
On Tue, 2011-06-14 at 17:13 -0700, David Rientjes wrote:
> Using /proc/sys/vm/panic_on_oom also won't panic the machine if you
> happen to use a cpuset or mempolicy. You'll want to write '2' instead
> if you want to panic in all possible oom conditions.
>
>
2 did it. Thank you.
perl -e 'my @mem = (); while(1) { push @mem, "XXXXXXXXXXXXXXXX"; }'
I lost connection and it came back after about 30s. Reboot worked.
In the past I've had OOM conditions "lock" a device so to keep from
having to call someone to reboot it I started using this method instead.
Out of memory conditions are rare and would only be caused by memory
leaks. I've found all memory leaks that I could fine and the first OOM
condition was caused by the program doing exactly what I told it to
do. :)
On the device that is running 2.6.38 this is the first time I'm planning
on using some PERL code on the device. I am a bit concerned about
possibly memory leaks taking down the device so I wanted to be sure this
works. This box does not have any swap space and never real. The 1G of
memory will be all that is available.
Thanks,
Chris
On Tue, 14 Jun 2011, Chris Fowler wrote:
> Previous version is 2.6.33.
>
> This box is a small device and I've configured it so that it on OOM or
> panic will reboot. A device that is locked or out of memory is useless
> and if it reboots it will pick up where it left off. This has worked
> well for me.
>
Not sure why you're responding to Kame when you're really replying to my
email. The problem is not the oom killer, Chris says that the kernel
reports that it will reboot in 10 seconds, because of his
/proc/sys/kernel/panic setting, yet that never happens. He is able to
induce a panic easily with the oom killer, but the problem being reported
here has nothing to do with the VM.
> -------- [ cpuinfo]----------------------------------------------------
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 28
> model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
> stepping : 2
> cpu MHz : 1600.245
> cache size : 512 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 10
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up
> arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3
> xtpr pdcm movbe lahf_lm dts
> bogomips : 3200.49
> clflush size : 64
> cache_alignment : 64
> address sizes : 32 bits physical, 32 bits virtual
> power management:
> -------- [ cpuinfo]----------------------------------------------------
>
> config attached
>
>
> I've not tried reboot=force. I will do that next
>
If that works, then we'll know that the forced machine restart is stalling
or looping. Using "reboot=force" supposedly will ensure that no such
condition may ever exist for x86. Do you know the last kernel version
that actually worked correctly?
On Tue, 14 Jun 2011, Chris Fowler wrote:
> > Using /proc/sys/vm/panic_on_oom also won't panic the machine if you
> > happen to use a cpuset or mempolicy. You'll want to write '2' instead
> > if you want to panic in all possible oom conditions.
> >
> >
>
> 2 did it. Thank you.
>
> perl -e 'my @mem = (); while(1) { push @mem, "XXXXXXXXXXXXXXXX"; }'
>
> I lost connection and it came back after about 30s. Reboot worked.
>
That wasn't meant as a fix for the problem but rather just a suggestion
based on how you're using the device.
It's just a coincidence that it worked that time, because
/proc/sys/vm/panic_on_oom == 2 is the exact same as
/proc/sys/vm/panic_on_oom == 1 with the .config you posted (since you
don't have CONFIG_NUMA, constrained_alloc() always returns CONSTRAINT_NONE
in the oom killer).
More supporting evidence is that in the initial report that you said you
had seen the panic and "Rebooting in 10 seconds..." message, yet no
reboot. That indicates the oom killer is working fine in both conditions.
So it's definitely the reboot code that is causing an issue that either
hangs or takes excessively long, and that only happens sporadically for
your machine. The only differences between this code between v2.6.33 and
v2.6.38 is how reboots are handled for Dell Precision T7400, VersaLogic
Menlow based, and Apple iMac9,1.
On Tue, 2011-06-14 at 19:32 -0700, David Rientjes wrote:
> More supporting evidence is that in the initial report that you said
> you had seen the panic and "Rebooting in 10 seconds..."
There was a long period between when the box became unresponsive and the
panic happened. I did not time it but I left it alone in the afternoon
and later that night I saw the panic. I know the panic was not
immediate.
On Tue, 2011-06-14 at 19:32 -0700, David Rientjes wrote:
> On Tue, 14 Jun 2011, Chris Fowler wrote:
>
> > > Using /proc/sys/vm/panic_on_oom also won't panic the machine if you
> > > happen to use a cpuset or mempolicy. You'll want to write '2' instead
> > > if you want to panic in all possible oom conditions.
> > >
> > >
> >
> > 2 did it. Thank you.
> >
> > perl -e 'my @mem = (); while(1) { push @mem, "XXXXXXXXXXXXXXXX"; }'
> >
> > I lost connection and it came back after about 30s. Reboot worked.
> >
>
> That wasn't meant as a fix for the problem but rather just a suggestion
> based on how you're using the device.
>
> It's just a coincidence that it worked that time, because
> /proc/sys/vm/panic_on_oom == 2 is the exact same as
> /proc/sys/vm/panic_on_oom == 1 with the .config you posted (since you
> don't have CONFIG_NUMA, constrained_alloc() always returns CONSTRAINT_NONE
> in the oom killer).
Yea you're right. I just did it again and it does not reboot. It goes
in this state where it no longer responds. The keyboard is not
responsive the screen blank no longer works, etc.
Right now I'm just waiting for it to panic.
On Tue, 14 Jun 2011, Chris Fowler wrote:
> > More supporting evidence is that in the initial report that you said
> > you had seen the panic and "Rebooting in 10 seconds..."
>
> There was a long period between when the box became unresponsive and the
> panic happened.
Yes, that may indicate a VM issue that changed between v2.6.33 and v2.6.38
that caused the oom killer not to get triggered right away.
> I did not time it but I left it alone in the afternoon
> and later that night I saw the panic. I know the panic was not
> immediate.
>
I was attempting to diagnose an issue where the machine failed to actually
reboot after the "Rebooting in 10 seconds..." message. So the issue is
that the machine becomes unresponsive for an indefinite period of time
between invoking your perl script and seeing the panic, or that the
machine fails to reboot after "Rebooting in 10 seconds..."?
On Tue, 2011-06-14 at 19:51 -0700, David Rientjes wrote:
> So the issue is that the machine becomes unresponsive for an
> indefinite period of time between invoking your perl script and seeing
> the panic, or that the machine fails to reboot after "Rebooting in 10
> seconds..."?
Both. It still has not paniced. I'm going to allow it to sit all night
and I'll see if it finally paniced in the morning.
On Tue, 2011-06-14 at 19:51 -0700, David Rientjes wrote:
> I was attempting to diagnose an issue where the machine failed to
> actually reboot after the "Rebooting in 10 seconds..." message. So
> the issue is that the machine becomes unresponsive for an indefinite
> period of time between invoking your perl script and seeing the panic,
> or that the machine fails to reboot after "Rebooting in 10
> seconds..."?
It has been almost 12 hours and no panic yet. Externally the box seems
unresponsive. No response to network activity. No keyboard response,
etc. On the screen I'm still seeing printk's from the hda-intel driver
complaining about a spurious interrupt. I'm not to concerned about that
but it does tell me the kernel seems to be alive.
Chris
On Wed, 15 Jun 2011, Chris Fowler wrote:
> It has been almost 12 hours and no panic yet. Externally the box seems
> unresponsive. No response to network activity. No keyboard response,
> etc. On the screen I'm still seeing printk's from the hda-intel driver
> complaining about a spurious interrupt. I'm not to concerned about that
> but it does tell me the kernel seems to be alive.
>
You've tried 2.6.38.8?
On Wed, 2011-06-15 at 13:07 -0700, David Rientjes wrote:
> On Wed, 15 Jun 2011, Chris Fowler wrote:
>
> > It has been almost 12 hours and no panic yet. Externally the box seems
> > unresponsive. No response to network activity. No keyboard response,
> > etc. On the screen I'm still seeing printk's from the hda-intel driver
> > complaining about a spurious interrupt. I'm not to concerned about that
> > but it does tell me the kernel seems to be alive.
> >
>
> You've tried 2.6.38.8?
I have not but can work on that tonight.
Chris