2018-12-25 15:45:11

by Konstantin Khorenko

[permalink] [raw]
Subject: [RFC PATCH 0/1] mm: add a warning about high order allocations

Q: Why do we need to bother at all?
A: If a node is highly loaded and its memory is significantly fragmented
(unfortunately almost any node with serious load has highly fragmented memory)
then any high order memory allocation can trigger massive memory shrink and
result in quite a big allocation latency. And the node becomes less responsive
and users don't like it.
The ultimate solution here is to get rid of large allocations, but we need an
instrument to detect them.

Q: Why warning? Use tracepoints!
A: Well, this is a matter of magic defaults.
Yes, you can use tracepoints to catch large allocations, but you need to do this
on purpose and regularly and this is to be done by every developer which is
quite unreal.
On the other hand if you develop something and get a warning, you'll have to
think about the reason and either succeed with reworking the code to use
smaller allocation sizes (and thus decrease allocation latency!) or just use
kvmalloc() if you don't really need physically continuos chunk or come to the
conclusion you definitely need physically continuos memory and shut up the
warning.

Q: Why compile time config option?
A: In order not to decrease the performance even a bit in case someone does not
want to hunt for large allocations.
In an ideal life i'd prefer this check/warning is enabled by default and may be
even without a config option so it works on every node. Once we find and rework
or mark all large allocations that would be good by default. Until that though
it will be noisy.

Another option is to rework the patch via static keys (having the warning
disabled by default surely). That makes it possible to turn on the feature
without recompiling the kernel - during testing period for example.

If you prefer this way, i would be happy to rework the patch via static keys.

Konstantin Khorenko (1):
mm/page_alloc: add warning about high order allocations

kernel/sysctl.c | 15 +++++++++++++++
mm/Kconfig | 18 ++++++++++++++++++
mm/page_alloc.c | 25 +++++++++++++++++++++++++
3 files changed, 58 insertions(+)

--
2.15.1



2018-12-25 15:45:11

by Konstantin Khorenko

[permalink] [raw]
Subject: [PATCH 1/1] mm/page_alloc: add a warning about high order allocations

Adds sysctl "vm.warn_high_order". If set it will warn about
allocations with order >= vm.warn_high_order.
Prints only 32 warnings at most and skips all __GFP_NOWARN allocations.

The code is under config option, disabled by default.
If enabled, default vm.warn_high_order is 3 (PAGE_ALLOC_COSTLY_ORDER).

Sysctl max value is set to 100 which should be enough to exceed maximum
order (normally MAX_ORDER == 11).

Signed-off-by: Konstantin Khorenko <[email protected]>
---
kernel/sysctl.c | 15 +++++++++++++++
mm/Kconfig | 18 ++++++++++++++++++
mm/page_alloc.c | 25 +++++++++++++++++++++++++
3 files changed, 58 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..28a8ebfa7a1d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -176,6 +176,10 @@ extern int unaligned_dump_stack;
extern int no_unaligned_warning;
#endif

+#ifdef CONFIG_WARN_HIGH_ORDER
+extern int warn_order;
+#endif
+
#ifdef CONFIG_PROC_SYSCTL

/**
@@ -1675,6 +1679,17 @@ static struct ctl_table vm_table[] = {
.extra1 = (void *)&mmap_rnd_compat_bits_min,
.extra2 = (void *)&mmap_rnd_compat_bits_max,
},
+#endif
+#ifdef CONFIG_WARN_HIGH_ORDER
+ {
+ .procname = "warn_high_order",
+ .data = &warn_order,
+ .maxlen = sizeof(warn_order),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#endif
{ }
};
diff --git a/mm/Kconfig b/mm/Kconfig
index d85e39da47ae..4e2d613d52f1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -336,6 +336,24 @@ config MEMORY_FAILURE
even when some of its memory has uncorrected errors. This requires
special hardware support and typically ECC memory.

+config WARN_HIGH_ORDER
+ bool "Enable complains about high order memory allocations"
+ depends on !LOCKDEP
+ default n
+ help
+ Enables warnings on high order memory allocations. This allows to
+ determine users of large memory chunks and rework them to decrease
+ allocation latency. Note, some debug options make kernel structures
+ fat.
+
+config WARN_HIGH_ORDER_LEVEL
+ int "Define page order level considered as too high"
+ depends on WARN_HIGH_ORDER
+ default 3
+ help
+ Defines page order starting which the system to complain about.
+ Default is current PAGE_ALLOC_COSTLY_ORDER.
+
config HWPOISON_INJECT
tristate "HWPoison pages injector"
depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e95b5b7c9c3d..258892adb861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4341,6 +4341,30 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
ac->high_zoneidx, ac->nodemask);
}

+#ifdef CONFIG_WARN_HIGH_ORDER
+int warn_order = CONFIG_WARN_HIGH_ORDER_LEVEL;
+
+/*
+ * Complain if we allocate a high order page unless there is a __GFP_NOWARN
+ * flag provided.
+ *
+ * Shuts up after 32 complains.
+ */
+static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
+{
+ static atomic_t warn_count = ATOMIC_INIT(32);
+
+ if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
+ WARN(atomic_dec_if_positive(&warn_count) >= 0,
+ "order %d >= %d, gfp 0x%x\n",
+ order, warn_order, gfp_mask);
+}
+#else
+static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
+{
+}
+#endif
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -4361,6 +4385,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
return NULL;
}
+ warn_high_order(order, gfp_mask);

gfp_mask &= gfp_allowed_mask;
alloc_mask = gfp_mask;
--
2.15.1


2018-12-26 08:42:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] mm: add a warning about high order allocations

On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote:
> Q: Why do we need to bother at all?
> A: If a node is highly loaded and its memory is significantly fragmented
> (unfortunately almost any node with serious load has highly fragmented memory)
> then any high order memory allocation can trigger massive memory shrink and
> result in quite a big allocation latency. And the node becomes less responsive
> and users don't like it.
> The ultimate solution here is to get rid of large allocations, but we need an
> instrument to detect them.

Can you point to an example of the problem you are referring here? At
least for costly orders we do bail out early and try to not cause
massive reclaim. So what is the order that you are concerned about?

> Q: Why warning? Use tracepoints!
> A: Well, this is a matter of magic defaults.
> Yes, you can use tracepoints to catch large allocations, but you need to do this
> on purpose and regularly and this is to be done by every developer which is
> quite unreal.
> On the other hand if you develop something and get a warning, you'll have to
> think about the reason and either succeed with reworking the code to use
> smaller allocation sizes (and thus decrease allocation latency!) or just use
> kvmalloc() if you don't really need physically continuos chunk or come to the
> conclusion you definitely need physically continuos memory and shut up the
> warning.

Well, not really. For one thing, there are systems to panic on warning
and you really do not want to blow up just because somebody is doing a
large order allocation.

> Q: Why compile time config option?
> A: In order not to decrease the performance even a bit in case someone does not
> want to hunt for large allocations.
> In an ideal life i'd prefer this check/warning is enabled by default and may be
> even without a config option so it works on every node. Once we find and rework
> or mark all large allocations that would be good by default. Until that though
> it will be noisy.

So who is going to enable this option?

> Another option is to rework the patch via static keys (having the warning
> disabled by default surely). That makes it possible to turn on the feature
> without recompiling the kernel - during testing period for example.
>
> If you prefer this way, i would be happy to rework the patch via static keys.

I would rather go and chase the underlying issue. So can we get an
actual data please?

--
Michal Hocko
SUSE Labs

2018-12-26 08:51:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm/page_alloc: add a warning about high order allocations

Appart from general comments as a reply to the cover (btw. this all
should be in the changelog because this is the _why_ part of the
justification which should be _always_ part of the changelog).

On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
[...]
> +config WARN_HIGH_ORDER
> + bool "Enable complains about high order memory allocations"
> + depends on !LOCKDEP

Why?

> + default n
> + help
> + Enables warnings on high order memory allocations. This allows to
> + determine users of large memory chunks and rework them to decrease
> + allocation latency. Note, some debug options make kernel structures
> + fat.
> +
> +config WARN_HIGH_ORDER_LEVEL
> + int "Define page order level considered as too high"
> + depends on WARN_HIGH_ORDER
> + default 3
> + help
> + Defines page order starting which the system to complain about.
> + Default is current PAGE_ALLOC_COSTLY_ORDER.
> +
> config HWPOISON_INJECT
> tristate "HWPoison pages injector"
> depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e95b5b7c9c3d..258892adb861 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4341,6 +4341,30 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
> ac->high_zoneidx, ac->nodemask);
> }
>
> +#ifdef CONFIG_WARN_HIGH_ORDER
> +int warn_order = CONFIG_WARN_HIGH_ORDER_LEVEL;
> +
> +/*
> + * Complain if we allocate a high order page unless there is a __GFP_NOWARN
> + * flag provided.
> + *
> + * Shuts up after 32 complains.
> + */
> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
> +{
> + static atomic_t warn_count = ATOMIC_INIT(32);
> +
> + if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
> + WARN(atomic_dec_if_positive(&warn_count) >= 0,
> + "order %d >= %d, gfp 0x%x\n",
> + order, warn_order, gfp_mask);
> +}

We do have ratelimit functionality, so why cannot you use it?

> +#else
> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
> +{
> +}
> +#endif
> +
> /*
> * This is the 'heart' of the zoned buddy allocator.
> */
> @@ -4361,6 +4385,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
> return NULL;
> }
> + warn_high_order(order, gfp_mask);
>
> gfp_mask &= gfp_allowed_mask;
> alloc_mask = gfp_mask;

Why do you warn about all allocations in the hot path? I thought you
want to catch expensive allocations so I would assume that you would
stick that into a slow path after we are not able to allocate anything
after the first round of compaction.

Also do you want to warn about opportunistic GFP_NOWAIT allocations that
have a reasonable fallback?
--
Michal Hocko
SUSE Labs

2018-12-26 11:56:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm/page_alloc: add a warning about high order allocations

On Wed 26-12-18 09:40:51, Michal Hocko wrote:
> Appart from general comments as a reply to the cover (btw. this all
> should be in the changelog because this is the _why_ part of the
> justification which should be _always_ part of the changelog).
>
> On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
> [...]
> > +config WARN_HIGH_ORDER
> > + bool "Enable complains about high order memory allocations"
> > + depends on !LOCKDEP
>
> Why?
>
> > + default n
> > + help
> > + Enables warnings on high order memory allocations. This allows to
> > + determine users of large memory chunks and rework them to decrease
> > + allocation latency. Note, some debug options make kernel structures
> > + fat.
> > +
> > +config WARN_HIGH_ORDER_LEVEL
> > + int "Define page order level considered as too high"
> > + depends on WARN_HIGH_ORDER
> > + default 3
> > + help
> > + Defines page order starting which the system to complain about.
> > + Default is current PAGE_ALLOC_COSTLY_ORDER.
> > +
> > config HWPOISON_INJECT
> > tristate "HWPoison pages injector"
> > depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e95b5b7c9c3d..258892adb861 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4341,6 +4341,30 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
> > ac->high_zoneidx, ac->nodemask);
> > }
> >
> > +#ifdef CONFIG_WARN_HIGH_ORDER
> > +int warn_order = CONFIG_WARN_HIGH_ORDER_LEVEL;
> > +
> > +/*
> > + * Complain if we allocate a high order page unless there is a __GFP_NOWARN
> > + * flag provided.
> > + *
> > + * Shuts up after 32 complains.
> > + */
> > +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
> > +{
> > + static atomic_t warn_count = ATOMIC_INIT(32);
> > +
> > + if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
> > + WARN(atomic_dec_if_positive(&warn_count) >= 0,
> > + "order %d >= %d, gfp 0x%x\n",
> > + order, warn_order, gfp_mask);
> > +}
>
> We do have ratelimit functionality, so why cannot you use it?
>
> > +#else
> > +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
> > +{
> > +}
> > +#endif
> > +
> > /*
> > * This is the 'heart' of the zoned buddy allocator.
> > */
> > @@ -4361,6 +4385,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> > WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
> > return NULL;
> > }
> > + warn_high_order(order, gfp_mask);
> >
> > gfp_mask &= gfp_allowed_mask;
> > alloc_mask = gfp_mask;
>
> Why do you warn about all allocations in the hot path? I thought you
> want to catch expensive allocations so I would assume that you would
> stick that into a slow path after we are not able to allocate anything
> after the first round of compaction.
>
> Also do you want to warn about opportunistic GFP_NOWAIT allocations that
> have a reasonable fallback?

And forgot to mention other opportunistic allocations like THP of
course.
--
Michal Hocko
SUSE Labs

2018-12-28 01:36:31

by Konstantin Khorenko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] mm: add a warning about high order allocations

Hi Michal,

thank you very much for your questions, please see my notes below.

On 12/26/2018 11:35 AM, Michal Hocko wrote:
> On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote:
>> Q: Why do we need to bother at all?
>> A: If a node is highly loaded and its memory is significantly fragmented
>> (unfortunately almost any node with serious load has highly fragmented memory)
>> then any high order memory allocation can trigger massive memory shrink and
>> result in quite a big allocation latency. And the node becomes less responsive
>> and users don't like it.
>> The ultimate solution here is to get rid of large allocations, but we need an
>> instrument to detect them.
>
> Can you point to an example of the problem you are referring here? At
> least for costly orders we do bail out early and try to not cause
> massive reclaim. So what is the order that you are concerned about?

Well, this is the most difficult question to answer.
Unfortunately i don't have a reproducer for that, usually we get into situation
when someone experiences significant node slowdown, nodes most often have a lot of RAM,
we check what is going on there and see the node is busy with reclaim.
And almost every time the reason was - fragmented memory and high order allocations.
Mostly of 2nd and 3rd (which is still considered not costly) order.

Recent related issues we faced were about FUSE dev pipe:
d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.")

and about bnx driver + mtu 9000 which for each packet required page of 2nd order
(and it even failed sometimes, though it was not the root cause):
kswapd0: page allocation failure: order:2, mode:0x4020
Call Trace:
dump_stack+0x19/0x1b
warn_alloc_failed+0x110/0x180
__alloc_pages_nodemask+0x7bf/0xc60
alloc_pages_current+0x98/0x110
kmalloc_order+0x18/0x40
kmalloc_order_trace+0x26/0xa0
__kmalloc+0x279/0x290
bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x]
bnx2x_rx_int+0x227/0x17c0 [bnx2x]
bnx2x_poll+0x1dd/0x260 [bnx2x]
net_rx_action+0x179/0x390
__do_softirq+0x10f/0x2aa
call_softirq+0x1c/0x30
do_softirq+0x65/0xa0
irq_exit+0x105/0x110
do_IRQ+0x56/0xe0
common_interrupt+0x6d/0x6d

And as both places were called very often - the system latency was high.

This warning can be also used to catch allocation of 4th order and higher which may
easily fail. Those places which are ready to get allocation errors and have
fallbacks are marked with __GFP_NOWARN.

>> Q: Why warning? Use tracepoints!
>> A: Well, this is a matter of magic defaults.
>> Yes, you can use tracepoints to catch large allocations, but you need to do this
>> on purpose and regularly and this is to be done by every developer which is
>> quite unreal.
>> On the other hand if you develop something and get a warning, you'll have to
>> think about the reason and either succeed with reworking the code to use
>> smaller allocation sizes (and thus decrease allocation latency!) or just use
>> kvmalloc() if you don't really need physically continuos chunk or come to the
>> conclusion you definitely need physically continuos memory and shut up the
>> warning.
>
> Well, not really. For one thing, there are systems to panic on warning
> and you really do not want to blow up just because somebody is doing a
> large order allocation.

Well, on one hand - yes, i agree with you. That's why i don't suggest to enable
the warning by default right now - until all (most) of large allocations are marked
properly.
But after it's done and there are no (almost) unmarked high order allocations -
why not? This will reveal new cases of high order allocations soon.
i think people who run systems with "kernel.panic_on_warn" enabled do care
about reporting issues.

>> Q: Why compile time config option?
>> A: In order not to decrease the performance even a bit in case someone does not
>> want to hunt for large allocations.
>> In an ideal life i'd prefer this check/warning is enabled by default and may be
>> even without a config option so it works on every node. Once we find and rework
>> or mark all large allocations that would be good by default. Until that though
>> it will be noisy.
>
> So who is going to enable this option?

At the beginning - people who want to debug kernel and verify their fallbacks on
memory allocations failures in the code or just speed up their code on nodes
with fragmented memory - for 2nd and 3rd orders.

mm performance issues are tough, you know, and this is just another way to
gain more performance. It won't avoid the necessity of digging mm for sure,
but might decrease the pressure level.

Later (i hope) i could be enabled by default so all big new allocations
are verified sooner and either reworked or marked with __GFP_NOWARN if the code is
ready.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

>> Another option is to rework the patch via static keys (having the warning
>> disabled by default surely). That makes it possible to turn on the feature
>> without recompiling the kernel - during testing period for example.
>>
>> If you prefer this way, i would be happy to rework the patch via static keys.
>
> I would rather go and chase the underlying issue. So can we get an
> actual data please?

2018-12-28 01:45:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm/page_alloc: add a warning about high order allocations

On Thu 27-12-18 16:05:18, Konstantin Khorenko wrote:
> On 12/26/2018 11:40 AM, Michal Hocko wrote:
> > Appart from general comments as a reply to the cover (btw. this all
> > should be in the changelog because this is the _why_ part of the
> > justification which should be _always_ part of the changelog).
>
> Thank you, will add in the next version of the patch alltogether
> with other changes if any.
>
> > On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
> > [...]
> >> +config WARN_HIGH_ORDER
> >> + bool "Enable complains about high order memory allocations"
> >> + depends on !LOCKDEP
> >
> > Why?
>
> LOCKDEP makes structures big, so if we see a high order allocation warning
> on a debug kernel with lockdep, it does not give us a lot - lockdep enabled
> kernel performance is not our target.
> i can remove !LOCKDEP dependence here, but then need to adjust default
> warning level i think, or logs will be spammed.

OK, I see but this just points to how this is not really a suitable
solution for the problem you are looking for.

> >> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
> >> +{
> >> + static atomic_t warn_count = ATOMIC_INIT(32);
> >> +
> >> + if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
> >> + WARN(atomic_dec_if_positive(&warn_count) >= 0,
> >> + "order %d >= %d, gfp 0x%x\n",
> >> + order, warn_order, gfp_mask);
> >> +}
> >
> > We do have ratelimit functionality, so why cannot you use it?
>
> Well, my idea was to really shut up the warning after some number of messages
> (if a node is in production and its uptime, say, a year, i don't want to see
> many warnings in logs, first several is enough - let's fix them first).

OK, but it is quite likely that the system is perfectly healthy and
unfragmented after fresh boot when doing a large order allocations is
perfectly fine. Note that it is smaller order allocations that generate
fragmentation in general.
--
Michal Hocko
SUSE Labs

2018-12-28 02:25:17

by Konstantin Khorenko

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm/page_alloc: add a warning about high order allocations

On 12/26/2018 11:40 AM, Michal Hocko wrote:
> Appart from general comments as a reply to the cover (btw. this all
> should be in the changelog because this is the _why_ part of the
> justification which should be _always_ part of the changelog).

Thank you, will add in the next version of the patch alltogether
with other changes if any.

> On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
> [...]
>> +config WARN_HIGH_ORDER
>> + bool "Enable complains about high order memory allocations"
>> + depends on !LOCKDEP
>
> Why?

LOCKDEP makes structures big, so if we see a high order allocation warning
on a debug kernel with lockdep, it does not give us a lot - lockdep enabled
kernel performance is not our target.
i can remove !LOCKDEP dependence here, but then need to adjust default
warning level i think, or logs will be spammed.

>> + default n
>> + help
>> + Enables warnings on high order memory allocations. This allows to
>> + determine users of large memory chunks and rework them to decrease
>> + allocation latency. Note, some debug options make kernel structures
>> + fat.
>> +
>> +config WARN_HIGH_ORDER_LEVEL
>> + int "Define page order level considered as too high"
>> + depends on WARN_HIGH_ORDER
>> + default 3
>> + help
>> + Defines page order starting which the system to complain about.
>> + Default is current PAGE_ALLOC_COSTLY_ORDER.
>> +
>> config HWPOISON_INJECT
>> tristate "HWPoison pages injector"
>> depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e95b5b7c9c3d..258892adb861 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4341,6 +4341,30 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
>> ac->high_zoneidx, ac->nodemask);
>> }
>>
>> +#ifdef CONFIG_WARN_HIGH_ORDER
>> +int warn_order = CONFIG_WARN_HIGH_ORDER_LEVEL;
>> +
>> +/*
>> + * Complain if we allocate a high order page unless there is a __GFP_NOWARN
>> + * flag provided.
>> + *
>> + * Shuts up after 32 complains.
>> + */
>> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
>> +{
>> + static atomic_t warn_count = ATOMIC_INIT(32);
>> +
>> + if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
>> + WARN(atomic_dec_if_positive(&warn_count) >= 0,
>> + "order %d >= %d, gfp 0x%x\n",
>> + order, warn_order, gfp_mask);
>> +}
>
> We do have ratelimit functionality, so why cannot you use it?

Well, my idea was to really shut up the warning after some number of messages
(if a node is in production and its uptime, say, a year, i don't want to see
many warnings in logs, first several is enough - let's fix them first).

If i use printk_ratelimited() i could get 24 days delay at most AFAIK,
but okay, i can switch to printk_ratelimited() if you prefer.

struct ratelimit_state {
int interval;

(gdb) p ((1LL<<31) -1)/1000/60/60/24
$11 = 24

>> +#else
>> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
>> +{
>> +}
>> +#endif
>> +
>> /*
>> * This is the 'heart' of the zoned buddy allocator.
>> */
>> @@ -4361,6 +4385,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>> WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
>> return NULL;
>> }
>> + warn_high_order(order, gfp_mask);
>>
>> gfp_mask &= gfp_allowed_mask;
>> alloc_mask = gfp_mask;
>
> Why do you warn about all allocations in the hot path? I thought you
> want to catch expensive allocations so I would assume that you would
> stick that into a slow path after we are not able to allocate anything
> after the first round of compaction.

The idea is to catch big allocations soon and preferably during testing,
not on production nodes under high load, that's why i've chosen hot path.

And if we switch to the slow path, we'll have to run all tests under
additional serious load - to generate memory fragmentation.
Not so convenient.

> Also do you want to warn about opportunistic GFP_NOWAIT allocations that
> have a reasonable fallback?

Yes, i would like to. Sometimes allocation flags come from upper level and
it's not always evident if there is GFP_NOWAIT flag or not, so let's we
are noticed about such a case and verify if it's legal and not called often.
If yes - we'll just mark it with NO_WARN.


> And forgot to mention other opportunistic allocations like THP of
> course.

hugepages are allocated with NOWARN already, AFAIK.
alloc_fresh_huge_page_node(), __hugetlb_alloc_buddy_huge_page()


Thank you once again, Michal.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

2018-12-28 02:46:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] mm: add a warning about high order allocations

On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote:
> Hi Michal,
>
> thank you very much for your questions, please see my notes below.
>
> On 12/26/2018 11:35 AM, Michal Hocko wrote:
> > On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote:
> >> Q: Why do we need to bother at all?
> >> A: If a node is highly loaded and its memory is significantly fragmented
> >> (unfortunately almost any node with serious load has highly fragmented memory)
> >> then any high order memory allocation can trigger massive memory shrink and
> >> result in quite a big allocation latency. And the node becomes less responsive
> >> and users don't like it.
> >> The ultimate solution here is to get rid of large allocations, but we need an
> >> instrument to detect them.
> >
> > Can you point to an example of the problem you are referring here? At
> > least for costly orders we do bail out early and try to not cause
> > massive reclaim. So what is the order that you are concerned about?
>
> Well, this is the most difficult question to answer.
> Unfortunately i don't have a reproducer for that, usually we get into situation
> when someone experiences significant node slowdown, nodes most often have a lot of RAM,
> we check what is going on there and see the node is busy with reclaim.
> And almost every time the reason was - fragmented memory and high order allocations.
> Mostly of 2nd and 3rd (which is still considered not costly) order.
>
> Recent related issues we faced were about FUSE dev pipe:
> d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.")
>
> and about bnx driver + mtu 9000 which for each packet required page of 2nd order
> (and it even failed sometimes, though it was not the root cause):
> kswapd0: page allocation failure: order:2, mode:0x4020
> Call Trace:
> dump_stack+0x19/0x1b
> warn_alloc_failed+0x110/0x180
> __alloc_pages_nodemask+0x7bf/0xc60
> alloc_pages_current+0x98/0x110
> kmalloc_order+0x18/0x40
> kmalloc_order_trace+0x26/0xa0
> __kmalloc+0x279/0x290
> bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x]
> bnx2x_rx_int+0x227/0x17c0 [bnx2x]
> bnx2x_poll+0x1dd/0x260 [bnx2x]
> net_rx_action+0x179/0x390
> __do_softirq+0x10f/0x2aa
> call_softirq+0x1c/0x30
> do_softirq+0x65/0xa0
> irq_exit+0x105/0x110
> do_IRQ+0x56/0xe0
> common_interrupt+0x6d/0x6d
>
> And as both places were called very often - the system latency was high.
>
> This warning can be also used to catch allocation of 4th order and higher which may
> easily fail. Those places which are ready to get allocation errors and have
> fallbacks are marked with __GFP_NOWARN.

This is not true in general, though.

[...]
> But after it's done and there are no (almost) unmarked high order allocations -
> why not? This will reveal new cases of high order allocations soon.

There will always be legitimate high order allocations. I believe that
for your particular use case it is much better to simply enable reclaim
and page allocator tracepoints which will give you not only the source
of the allocation but also a much better picture

> i think people who run systems with "kernel.panic_on_warn" enabled do care
> about reporting issues.

You surely do not want to put the system down just because of the high
order allocation though, right?

> >> Q: Why compile time config option?
> >> A: In order not to decrease the performance even a bit in case someone does not
> >> want to hunt for large allocations.
> >> In an ideal life i'd prefer this check/warning is enabled by default and may be
> >> even without a config option so it works on every node. Once we find and rework
> >> or mark all large allocations that would be good by default. Until that though
> >> it will be noisy.
> >
> > So who is going to enable this option?
>
> At the beginning - people who want to debug kernel and verify their fallbacks on
> memory allocations failures in the code or just speed up their code on nodes
> with fragmented memory - for 2nd and 3rd orders.
>
> mm performance issues are tough, you know, and this is just another way to
> gain more performance. It won't avoid the necessity of digging mm for sure,
> but might decrease the pressure level.

But the warning alone will not give us useful information I am afraid.
It will only give us, there are warnings but not whether those are
actually a problem or not. So I really believe that using existing
tracepoints or add some that will fill missing gaps will be much more
better long term. And we do not have to add another config and touch the
code as a bonus.
--
Michal Hocko
SUSE Labs

2018-12-29 01:32:20

by Konstantin Khorenko

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm/page_alloc: add a warning about high order allocations

On 12/27/2018 07:50 PM, Michal Hocko wrote:
> On Thu 27-12-18 16:05:18, Konstantin Khorenko wrote:
>> On 12/26/2018 11:40 AM, Michal Hocko wrote:
>>> Appart from general comments as a reply to the cover (btw. this all
>>> should be in the changelog because this is the _why_ part of the
>>> justification which should be _always_ part of the changelog).
>>
>> Thank you, will add in the next version of the patch alltogether
>> with other changes if any.
>>
>>> On Tue 25-12-18 18:39:27, Konstantin Khorenko wrote:
>>> [...]
>>>> +config WARN_HIGH_ORDER
>>>> + bool "Enable complains about high order memory allocations"
>>>> + depends on !LOCKDEP
>>>
>>> Why?
>>
>> LOCKDEP makes structures big, so if we see a high order allocation warning
>> on a debug kernel with lockdep, it does not give us a lot - lockdep enabled
>> kernel performance is not our target.
>> i can remove !LOCKDEP dependence here, but then need to adjust default
>> warning level i think, or logs will be spammed.
>
> OK, I see but this just points to how this is not really a suitable
> solution for the problem you are looking for.

i have to admit, yes, it is inconvenient to have such a dependency.
i would be really glad to hear alternatives...

Just to mention: tracepoints are for issues investigation (when we already have any),
and i'm thinking about issues prevention. Gathering places which might
lead to problem and rework to prevent possible issues.

>>>> +static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
>>>> +{
>>>> + static atomic_t warn_count = ATOMIC_INIT(32);
>>>> +
>>>> + if (order >= warn_order && !(gfp_mask & __GFP_NOWARN))
>>>> + WARN(atomic_dec_if_positive(&warn_count) >= 0,
>>>> + "order %d >= %d, gfp 0x%x\n",
>>>> + order, warn_order, gfp_mask);
>>>> +}
>>>
>>> We do have ratelimit functionality, so why cannot you use it?
>>
>> Well, my idea was to really shut up the warning after some number of messages
>> (if a node is in production and its uptime, say, a year, i don't want to see
>> many warnings in logs, first several is enough - let's fix them first).
>
> OK, but it is quite likely that the system is perfectly healthy and
> unfragmented after fresh boot when doing a large order allocations is
> perfectly fine.

You are right again, Michal. Thus just switch those early allocations to
kvmalloc() and on boot on an unfragmented system same kmalloc() will be used.
And no new warning for that place. And no any chance this place ever trigger
compaction (in case we did not notice some way this allocation might also
happen later, not on a fresh node).

> Note that it is smaller order allocations that generate
> fragmentation in general.

And again - yes, that's true. Still i don't know how to avoid memory fragmentation,
but can avoid (ok, significantly decrease the number of) big allocations which
might trigger compaction because of that fragmentation.


--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

2018-12-29 01:35:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] mm: add a warning about high order allocations

On Fri 28-12-18 14:23:29, Konstantin Khorenko wrote:
> On 12/27/2018 07:46 PM, Michal Hocko wrote:
> > On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote:
> >> Hi Michal,
> >>
> >> thank you very much for your questions, please see my notes below.
> >>
> >> On 12/26/2018 11:35 AM, Michal Hocko wrote:
> >>> On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote:
> >>>> Q: Why do we need to bother at all?
> >>>> A: If a node is highly loaded and its memory is significantly fragmented
> >>>> (unfortunately almost any node with serious load has highly fragmented memory)
> >>>> then any high order memory allocation can trigger massive memory shrink and
> >>>> result in quite a big allocation latency. And the node becomes less responsive
> >>>> and users don't like it.
> >>>> The ultimate solution here is to get rid of large allocations, but we need an
> >>>> instrument to detect them.
> >>>
> >>> Can you point to an example of the problem you are referring here? At
> >>> least for costly orders we do bail out early and try to not cause
> >>> massive reclaim. So what is the order that you are concerned about?
> >>
> >> Well, this is the most difficult question to answer.
> >> Unfortunately i don't have a reproducer for that, usually we get into situation
> >> when someone experiences significant node slowdown, nodes most often have a lot of RAM,
> >> we check what is going on there and see the node is busy with reclaim.
> >> And almost every time the reason was - fragmented memory and high order allocations.
> >> Mostly of 2nd and 3rd (which is still considered not costly) order.
> >>
> >> Recent related issues we faced were about FUSE dev pipe:
> >> d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.")
> >>
> >> and about bnx driver + mtu 9000 which for each packet required page of 2nd order
> >> (and it even failed sometimes, though it was not the root cause):
> >> kswapd0: page allocation failure: order:2, mode:0x4020
> >> Call Trace:
> >> dump_stack+0x19/0x1b
> >> warn_alloc_failed+0x110/0x180
> >> __alloc_pages_nodemask+0x7bf/0xc60
> >> alloc_pages_current+0x98/0x110
> >> kmalloc_order+0x18/0x40
> >> kmalloc_order_trace+0x26/0xa0
> >> __kmalloc+0x279/0x290
> >> bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x]
> >> bnx2x_rx_int+0x227/0x17c0 [bnx2x]
> >> bnx2x_poll+0x1dd/0x260 [bnx2x]
> >> net_rx_action+0x179/0x390
> >> __do_softirq+0x10f/0x2aa
> >> call_softirq+0x1c/0x30
> >> do_softirq+0x65/0xa0
> >> irq_exit+0x105/0x110
> >> do_IRQ+0x56/0xe0
> >> common_interrupt+0x6d/0x6d
> >>
> >> And as both places were called very often - the system latency was high.
> >>
> >> This warning can be also used to catch allocation of 4th order and higher which may
> >> easily fail. Those places which are ready to get allocation errors and have
> >> fallbacks are marked with __GFP_NOWARN.
> >
> > This is not true in general, though.
>
> Right now - yep, this is not true. Sorry, i was not clear enough here.
> i meant after we catch all high order allocations, we review them and either
> switch to kvmalloc() or mark with NOWARN if we are ready to handle allocation errors
> in that particular place. So this is an ideal situation when we reviewed all the things.

I do not think that making all high order allocations NOWARN is
reasonable or even correct. There allocations which really require
physically contiguous memory range of a larger size. And we migh want to
warn about those and they might be perfectly fine.

[...]
> > I believe that
> > for your particular use case it is much better to simply enable reclaim
> > and page allocator tracepoints which will give you not only the source
> > of the allocation but also a much better picture
>
> Tracepoints are much better for issues investigation, right. And we do so.
>
> And warnings are intended not for investigation but for prevention of possible future issues.
> If we spot a big allocation, we can review it in advance, before we face any problem,
> and in most cases just switch it to use kvmalloc() in 90% cases and we never ever have
> a problem with unexpected reclaim due to this allocation. Ever.
> With any reclaim algorithm - the compaction just won't be triggered.

I do not think this is realistic, to be honest. As mentioned before
there are simply valid high order allocations.

[...]

> > But the warning alone will not give us useful information I am afraid.
> > It will only give us, there are warnings but not whether those are
> > actually a problem or not.
>
> Yes. And even more - a lot of high order allocations which cannot be
> switched to kvmalloc() are in drivers - for DMA zones - so they are very
> rare and most probably won't ever cause a problem.
>
> But some of them can potentially cause a problem some day. And warning
> does not provide info how to distinguish "bad" and "good" ones.
> But if we switch both "bad" and "good" big allocations to kvmalloc(),
> that won't hurt, right?

Ohh, this is not that simple. Vmalloc doesn't scale. We might work on
changing underlying allocator but there are fundamental perfomance
roadblocks that cannot be removed. You are manipulating page tables
which require TLB flushes for example. E.g. networking code would rather
see higher order allocations and use the vmalloc only as a fallback.
That is the reason why we are supporting __GFP_RETRY_MAYFAIL for
kvmalloc btw.

I do not even want to mention 32b and the restricted vmalloc space
because 32b is on decline.

So really, I think you are right there are many places which could and
should be changed to be less demanding on high order allocations. But I
think that your warning behind a config is not a right direction to go
there. Each performance issue should be debugged and the underlying
culprit found. In many cases it might be just an ineffective or outright
buggy implementation of the page allocator itself. We want to fix those.
Warning about "something's broken" will help us much less than a set of
tracepoints with a better picture.

Thanks!
--
Michal Hocko
SUSE Labs

2018-12-29 02:58:54

by Konstantin Khorenko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] mm: add a warning about high order allocations

On 12/27/2018 07:46 PM, Michal Hocko wrote:
> On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote:
>> Hi Michal,
>>
>> thank you very much for your questions, please see my notes below.
>>
>> On 12/26/2018 11:35 AM, Michal Hocko wrote:
>>> On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote:
>>>> Q: Why do we need to bother at all?
>>>> A: If a node is highly loaded and its memory is significantly fragmented
>>>> (unfortunately almost any node with serious load has highly fragmented memory)
>>>> then any high order memory allocation can trigger massive memory shrink and
>>>> result in quite a big allocation latency. And the node becomes less responsive
>>>> and users don't like it.
>>>> The ultimate solution here is to get rid of large allocations, but we need an
>>>> instrument to detect them.
>>>
>>> Can you point to an example of the problem you are referring here? At
>>> least for costly orders we do bail out early and try to not cause
>>> massive reclaim. So what is the order that you are concerned about?
>>
>> Well, this is the most difficult question to answer.
>> Unfortunately i don't have a reproducer for that, usually we get into situation
>> when someone experiences significant node slowdown, nodes most often have a lot of RAM,
>> we check what is going on there and see the node is busy with reclaim.
>> And almost every time the reason was - fragmented memory and high order allocations.
>> Mostly of 2nd and 3rd (which is still considered not costly) order.
>>
>> Recent related issues we faced were about FUSE dev pipe:
>> d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.")
>>
>> and about bnx driver + mtu 9000 which for each packet required page of 2nd order
>> (and it even failed sometimes, though it was not the root cause):
>> kswapd0: page allocation failure: order:2, mode:0x4020
>> Call Trace:
>> dump_stack+0x19/0x1b
>> warn_alloc_failed+0x110/0x180
>> __alloc_pages_nodemask+0x7bf/0xc60
>> alloc_pages_current+0x98/0x110
>> kmalloc_order+0x18/0x40
>> kmalloc_order_trace+0x26/0xa0
>> __kmalloc+0x279/0x290
>> bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x]
>> bnx2x_rx_int+0x227/0x17c0 [bnx2x]
>> bnx2x_poll+0x1dd/0x260 [bnx2x]
>> net_rx_action+0x179/0x390
>> __do_softirq+0x10f/0x2aa
>> call_softirq+0x1c/0x30
>> do_softirq+0x65/0xa0
>> irq_exit+0x105/0x110
>> do_IRQ+0x56/0xe0
>> common_interrupt+0x6d/0x6d
>>
>> And as both places were called very often - the system latency was high.
>>
>> This warning can be also used to catch allocation of 4th order and higher which may
>> easily fail. Those places which are ready to get allocation errors and have
>> fallbacks are marked with __GFP_NOWARN.
>
> This is not true in general, though.

Right now - yep, this is not true. Sorry, i was not clear enough here.
i meant after we catch all high order allocations, we review them and either
switch to kvmalloc() or mark with NOWARN if we are ready to handle allocation errors
in that particular place. So this is an ideal situation when we reviewed all the things.

> [...]
>> But after it's done and there are no (almost) unmarked high order allocations -
>> why not? This will reveal new cases of high order allocations soon.
>
> There will always be legitimate high order allocations.

Sure. But after we review them we either switch them to kvmalloc() or mark them with
NOWARN. In both cases we won't get new warnings about that places.

> I believe that
> for your particular use case it is much better to simply enable reclaim
> and page allocator tracepoints which will give you not only the source
> of the allocation but also a much better picture

Tracepoints are much better for issues investigation, right. And we do so.

And warnings are intended not for investigation but for prevention of possible future issues.
If we spot a big allocation, we can review it in advance, before we face any problem,
and in most cases just switch it to use kvmalloc() in 90% cases and we never ever have
a problem with unexpected reclaim due to this allocation. Ever.
With any reclaim algorithm - the compaction just won't be triggered.

>> i think people who run systems with "kernel.panic_on_warn" enabled do care
>> about reporting issues.
>
> You surely do not want to put the system down just because of the high
> order allocation though, right?

Right, i do not. (And i also don't want to run a node with "kernel.panic_on_warn"
enabled in production :) )
But people who do run nodes with "kernel.panic_on_warn" enabled in production
may disable high allocation warning by increasing warning order level higher than
MAX_ORDER. Or just not enable kernel config option.

i do understand the warning will be noisy at the beginning thus i surely don't even
suggest to make it enable by default now.

>>>> Q: Why compile time config option?
>>>> A: In order not to decrease the performance even a bit in case someone does not
>>>> want to hunt for large allocations.
>>>> In an ideal life i'd prefer this check/warning is enabled by default and may be
>>>> even without a config option so it works on every node. Once we find and rework
>>>> or mark all large allocations that would be good by default. Until that though
>>>> it will be noisy.
>>>
>>> So who is going to enable this option?
>>
>> At the beginning - people who want to debug kernel and verify their fallbacks on
>> memory allocations failures in the code or just speed up their code on nodes
>> with fragmented memory - for 2nd and 3rd orders.
>>
>> mm performance issues are tough, you know, and this is just another way to
>> gain more performance. It won't avoid the necessity of digging mm for sure,
>> but might decrease the pressure level.
>
> But the warning alone will not give us useful information I am afraid.
> It will only give us, there are warnings but not whether those are
> actually a problem or not.

Yes. And even more - a lot of high order allocations which cannot be
switched to kvmalloc() are in drivers - for DMA zones - so they are very
rare and most probably won't ever cause a problem.

But some of them can potentially cause a problem some day. And warning
does not provide info how to distinguish "bad" and "good" ones.
But if we switch both "bad" and "good" big allocations to kvmalloc(),
that won't hurt, right? But that way we ensure we won't get any problems
from "bad" cases, even if we don't know exactly which of them are potentially "bad".

> So I really believe that using existing
> tracepoints or add some that will fill missing gaps will be much more
> better long term. And we do not have to add another config and touch the
> code as a bonus.


And Michal, thank you very much once again for this conversation.
i appreciate it.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team