2004-01-01 09:35:56

by Joonas Kortesalmi

[permalink] [raw]
Subject: swapper: page allocation failure. order:3, mode:0x20

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

After running 2.6.0 on a server for a few days, I met an interesting and
annoying problem. I was playing with NFS over gigabit ethernet (e1000) and
it was a bit slow. I tried to find out why by running top and I saw syslog-ng
eating almost 10% of the 1,3GHz Duron. Looked at the log and there was a huge
flood of these messages:

swapper: page allocation failure. order:3, mode:0x20
irssi: page allocation failure. order:3, mode:0x20
swapper: page allocation failure. order:3, mode:0x20
vim: page allocation failure. order:3, mode:0x20
swapper: page allocation failure. order:3, mode:0x20

most of the messages starting with swapper.

After a quick google search it looks like this kind of a problem existed in
a bit older 2.5/2.6 serie kernels. It looks like it's still with us.

Linux hamsu.org 2.6.0 #1 Mon Dec 29 21:07:28 EET 2003 i686 AMD Duron(tm) processor AuthenticAMD GNU/Linux

Shortly about the hardware of the machine in case it would matter:
AMD Duron 1,3GHz CPU, 768MB of RAM, Intel PRO/1000MT Desktop and Realtek 8139C
NICs.

Pre-empt was disabled in the kernel config and the kernel didn't have any
extra patches. A pure vanilla kernel.

I'm not a subscriber so please CC me a copy of messages related to the subject.
I'm not sure if I can help much looking at the insides of the kernel, but I will
try my best answering any questions about this.
- --
Joonas OH8GDV (ham call sign) OpenPGP: 0x5F72BE43
( 0B37 05E0 0FB4 EB2E 161C DCE6 7F7B C645 5F72 BE43 )
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQE/8+lzf3vGRV9yvkMRAgkIAJ4tjyWtfeDmXCaM0s6w9+ofYLKktACgt+I5
MhYxd+KeJS8QFOjPbtOmykk=
=75d0
-----END PGP SIGNATURE-----


2004-01-01 10:18:20

by Anton Blanchard

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20


Hi,

> After running 2.6.0 on a server for a few days, I met an interesting and
> annoying problem. I was playing with NFS over gigabit ethernet (e1000) and
> it was a bit slow. I tried to find out why by running top and I saw syslog-ng
> eating almost 10% of the 1,3GHz Duron. Looked at the log and there was a huge
> flood of these messages:
>
> swapper: page allocation failure. order:3, mode:0x20
> irssi: page allocation failure. order:3, mode:0x20
> swapper: page allocation failure. order:3, mode:0x20
> vim: page allocation failure. order:3, mode:0x20
> swapper: page allocation failure. order:3, mode:0x20

Its sounds like you are using either a large MTU (9k?) or TSO. TSO
causes the networking stack to think it has a massive MTU and the e1000
card busts it up into proper MTU sized packets. The problem is that
it places much more stress on the allocator by asking for these large
chunks of memory in interrupt context.

Now e1000 uses TSO (and can regularly ask for 32kB+ kmallocs in
interrupt context) perhaps we should look moving the rx buffer refill code
into a context that can sleep. Then again its not like we can tolerate
much latency in this code path, your rx ring will run out quite quickly :)

BTW We have found increasing /proc/sys/vm/min_free_kbytes can help the
situation a bit. Bumping the slab limits for the larger kmallocs (via
echo X Y Z > /proc/slab) might be useful too.

We should probably rate limit that printk. Andrew: I was thinking of
stealing net_ratelimit and calling it core_ratelimit or whatever. Then
wrap these non critical things with it. Overkill?

Anton

2004-01-01 10:25:25

by Andrew Morton

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20

Anton Blanchard <[email protected]> wrote:
>
> > swapper: page allocation failure. order:3, mode:0x20
>
> ...
> We should probably rate limit that printk. Andrew: I was thinking of
> stealing net_ratelimit and calling it core_ratelimit or whatever. Then
> wrap these non critical things with it. Overkill?

Actually my intent was just to remove it (and __GFP_NOWARN) - it's just
development-time debug. But it is handy on occasion.

So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()?

(printk_ratelimit() may be a suitable name)

2004-01-01 10:28:47

by Anton Blanchard

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20


> Now e1000 uses TSO (and can regularly ask for 32kB+ kmallocs in
> interrupt context) perhaps we should look moving the rx buffer refill code
> into a context that can sleep. Then again its not like we can tolerate
> much latency in this code path, your rx ring will run out quite quickly :)

I hate to argue with myself, but thats crap. TSO only affects the TX
path and its buffers are created outside interrupt context. So it must be
a large MTU causing the failures, regardless it still makes sense to explore
rx skb refill outside of interrupt context idea.

Anton

2004-01-01 10:42:53

by Arjan van de Ven

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20

On Thu, 2004-01-01 at 11:27, Anton Blanchard wrote:
> regardless it still makes sense to explore
> rx skb refill outside of interrupt context idea.

could be done in 2 steps; eg have a "normal path" that starts refilling
from process context if, say, half the skbs are used but an emergency
fallback to do it from irq context if for whatever reason the process
context didn't get around to it....


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-01-01 13:07:25

by Anton Blanchard

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20


> So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()?

Sounds good, I always end up adding a dump_stack there when debugging
these problems anyway.

> (printk_ratelimit() may be a suitable name)

How does this look?

Anton

===== Documentation/sysctl/kernel.txt 1.7 vs edited =====
--- 1.7/Documentation/sysctl/kernel.txt Wed Apr 9 18:01:38 2003
+++ edited/Documentation/sysctl/kernel.txt Thu Jan 1 23:52:10 2004
@@ -251,6 +251,25 @@

==============================================================

+printk_ratelimit:
+
+Some warning messages are rate limited. printk_ratelimit specifies
+the minimum length of time between these messages, by default we
+allow one every 5 seconds.
+
+A value of 0 will disable rate limiting.
+
+==============================================================
+
+printk_ratelimit_burst:
+
+While long term we enforce one message per printk_ratelimit
+seconds, we do allow a burst of messages to pass through.
+printk_ratelimit_burst specifies the number of messages we can
+send before ratelimiting kicks in.
+
+==============================================================
+
reboot-cmd: (Sparc only)

??? This seems to be a way to give an argument to the Sparc
===== include/linux/kernel.h 1.43 vs edited =====
--- 1.43/include/linux/kernel.h Tue Dec 30 08:37:26 2003
+++ edited/include/linux/kernel.h Thu Jan 1 23:52:10 2004
@@ -89,6 +89,8 @@

unsigned long int_sqrt(unsigned long);

+extern int printk_ratelimit(void);
+
static inline void console_silent(void)
{
console_loglevel = 0;
===== include/linux/sysctl.h 1.54 vs edited =====
--- 1.54/include/linux/sysctl.h Thu Dec 25 14:32:23 2003
+++ edited/include/linux/sysctl.h Thu Jan 1 23:53:11 2004
@@ -127,6 +127,8 @@
KERN_PANIC_ON_OOPS=57, /* int: whether we will panic on an oops */
KERN_HPPA_PWRSW=58, /* int: hppa soft-power enable */
KERN_HPPA_UNALIGNED=59, /* int: hppa unaligned-trap enable */
+ KERN_PRINTK_RATELIMIT=60, /* int: tune printk ratelimiting */
+ KERN_PRINTK_RATELIMIT_BURST=61, /* int: tune printk ratelimiting */
};


===== kernel/printk.c 1.29 vs edited =====
--- 1.29/kernel/printk.c Tue Dec 30 12:01:55 2003
+++ edited/kernel/printk.c Thu Jan 1 23:53:51 2004
@@ -762,3 +762,45 @@
tty->driver->write(tty, 0, msg, strlen(msg));
return;
}
+
+/* minimum time in jiffies between messages */
+int printk_ratelimit_jiffies = 5*HZ;
+
+/* number of messages we send before ratelimiting */
+int printk_ratelimit_burst = 10;
+
+/*
+ * printk rate limiting, lifted from the networking subsystem.
+ *
+ * This enforces a rate limit: not more than one kernel message
+ * every printk_ratelimit_jiffies to make a denial-of-service
+ * attack impossible.
+ */
+int printk_ratelimit(void)
+{
+ static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED;
+ static unsigned long toks = 10*5*HZ;
+ static unsigned long last_msg;
+ static int missed;
+ unsigned long flags;
+ unsigned long now = jiffies;
+
+ spin_lock_irqsave(&ratelimit_lock, flags);
+ toks += now - last_msg;
+ last_msg = now;
+ if (toks > (printk_ratelimit_burst * printk_ratelimit_jiffies))
+ toks = printk_ratelimit_burst * printk_ratelimit_jiffies;
+ if (toks >= printk_ratelimit_jiffies) {
+ int lost = missed;
+ missed = 0;
+ toks -= printk_ratelimit_jiffies;
+ spin_unlock_irqrestore(&ratelimit_lock, flags);
+ if (lost)
+ printk(KERN_WARNING "printk: %d messages suppressed.\n", lost);
+ return 1;
+ }
+ missed++;
+ spin_unlock_irqrestore(&ratelimit_lock, flags);
+ return 0;
+}
+EXPORT_SYMBOL(printk_ratelimit);
===== kernel/sysctl.c 1.55 vs edited =====
--- 1.55/kernel/sysctl.c Thu Oct 2 17:12:07 2003
+++ edited/kernel/sysctl.c Thu Jan 1 23:52:11 2004
@@ -60,6 +60,8 @@
extern int pid_max;
extern int sysctl_lower_zone_protection;
extern int min_free_kbytes;
+extern int printk_ratelimit_jiffies;
+extern int printk_ratelimit_burst;

/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
@@ -575,6 +577,22 @@
.ctl_name = KERN_PANIC_ON_OOPS,
.procname = "panic_on_oops",
.data = &panic_on_oops,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = KERN_PRINTK_RATELIMIT,
+ .procname = "printk_ratelimit",
+ .data = &printk_ratelimit_jiffies,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_jiffies,
+ },
+ {
+ .ctl_name = KERN_PRINTK_RATELIMIT_BURST,
+ .procname = "printk_ratelimit_burst",
+ .data = &printk_ratelimit_burst,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &proc_dointvec,
===== mm/page_alloc.c 1.176 vs edited =====
--- 1.176/mm/page_alloc.c Tue Dec 30 08:38:05 2003
+++ edited/mm/page_alloc.c Thu Jan 1 23:52:11 2004
@@ -670,10 +670,11 @@
}

nopage:
- if (!(gfp_mask & __GFP_NOWARN)) {
- printk("%s: page allocation failure."
+ if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
+ printk(KERN_WARNING "%s: page allocation failure."
" order:%d, mode:0x%x\n",
p->comm, order, gfp_mask);
+ dump_stack();
}
return NULL;
got_pg:

2004-01-01 20:48:03

by Andrew Morton

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20

Anton Blanchard <[email protected]> wrote:
>
>
> > So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()?
>
> Sounds good, I always end up adding a dump_stack there when debugging
> these problems anyway.
>
> > (printk_ratelimit() may be a suitable name)
>
> How does this look?

Good. I guess we need to make net_ratelimit() use this sometime.

> +/*
> + * printk rate limiting, lifted from the networking subsystem.
> + *
> + * This enforces a rate limit: not more than one kernel message
> + * every printk_ratelimit_jiffies to make a denial-of-service
> + * attack impossible.
> + */
> +int printk_ratelimit(void)
> +{
> + static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED;
> + static unsigned long toks = 10*5*HZ;
> + static unsigned long last_msg;
> + static int missed;
> + unsigned long flags;
> + unsigned long now = jiffies;
> +
> + spin_lock_irqsave(&ratelimit_lock, flags);
> + toks += now - last_msg;
> + last_msg = now;
> + if (toks > (printk_ratelimit_burst * printk_ratelimit_jiffies))
> + toks = printk_ratelimit_burst * printk_ratelimit_jiffies;
> + if (toks >= printk_ratelimit_jiffies) {
> + int lost = missed;
> + missed = 0;
> + toks -= printk_ratelimit_jiffies;
> + spin_unlock_irqrestore(&ratelimit_lock, flags);
> + if (lost)
> + printk(KERN_WARNING "printk: %d messages suppressed.\n", lost);
> + return 1;
> + }
> + missed++;
> + spin_unlock_irqrestore(&ratelimit_lock, flags);
> + return 0;
> +}

This seems a bit odd. It means that the further apart the message bursts
are, the longer they are allowed to be. Or something.

Wouldn't it be better to say "after each greater-than-five second window,
allow up to ten printk's as long as they happen in the next five
milliseconds"?

2004-01-01 22:14:52

by Christophe Saout

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20

Am Do, den 01.01.2004 schrieb Anton Blanchard um 14:01:

> How does this look?
> [...]
> + static unsigned long toks = 10*5*HZ;
> + static unsigned long last_msg;
> + static int missed;

This would mean that all users of printk_ratelimit share this. If
printk_ratelimit is bombed by one user other perhaps important messages
are also suppressed.


2004-01-01 22:54:33

by Anton Blanchard

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20


> > How does this look?
> > [...]
> > + static unsigned long toks = 10*5*HZ;
> > + static unsigned long last_msg;
> > + static int missed;
>
> This would mean that all users of printk_ratelimit share this. If
> printk_ratelimit is bombed by one user other perhaps important messages
> are also suppressed.

printk_ratelimit is only to be used for things which we can afford to
lose (eg our VM debugging messages). Don't use it on anything important :)

Anton

2004-01-01 22:54:30

by Anton Blanchard

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20


> This seems a bit odd. It means that the further apart the message bursts
> are, the longer they are allowed to be. Or something.
>
> Wouldn't it be better to say "after each greater-than-five second window,
> allow up to ten printk's as long as they happen in the next five
> milliseconds"?

Its 100% copied from the networking code. I would have thought we really
want that behaviour, if Im on a serial console I can only tolerate a
decent sized backtrace every few seconds. If I allow 10 bursts every 5
seconds then im screwed.

Instead we allow the burst of 10 once but then ratelimit it to one
per 5 seconds until we have had a long enough period of silence. The
main thing is that over the long term we have an average of one message
per 5 seconds.

Anton

2004-01-01 23:13:09

by Christophe Saout

[permalink] [raw]
Subject: Re: swapper: page allocation failure. order:3, mode:0x20

Am Do, den 01.01.2004 schrieb Anton Blanchard um 23:45:

> > This would mean that all users of printk_ratelimit share this. If
> > printk_ratelimit is bombed by one user other perhaps important messages
> > are also suppressed.
>
> printk_ratelimit is only to be used for things which we can afford to
> lose (eg our VM debugging messages). Don't use it on anything important :)

Other implementations would probably be overkill, so I'll agree with
you. :)

If printk_ratelimit is called too often there is probably something
wrong going anyway that is worth fixing.