It's unnecessary to excessively spam the kernel log anytime the BTS buffer
cannot be allocated, so make this allocation __GFP_NOWARN.
The user probably will want to at least find some artifact that the
allocation has failed in the past, probably due to fragmentation because
of its large size, when it's not allocated at bootstrap. Thus, add a
WARN_ONCE() so something is left behind for them to understand why perf
commnads that require PEBS is not working properly.
Signed-off-by: David Rientjes <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -311,9 +311,11 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
- buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL, node);
- if (unlikely(!buffer))
+ buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+ if (unlikely(!buffer)) {
+ WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
+ }
max = BTS_BUFFER_SIZE / BTS_RECORD_SIZE;
thresh = max / 16;
On Mon, Jun 30, 2014 at 04:04:08PM -0700, David Rientjes wrote:
> It's unnecessary to excessively spam the kernel log anytime the BTS buffer
> cannot be allocated, so make this allocation __GFP_NOWARN.
>
> The user probably will want to at least find some artifact that the
> allocation has failed in the past, probably due to fragmentation because
> of its large size, when it's not allocated at bootstrap. Thus, add a
> WARN_ONCE() so something is left behind for them to understand why perf
> commnads that require PEBS is not working properly.
Can you elaborate a bit under which conditions this triggered? Typically
we should be doing fairly well allocating such buffers with GFP_KERNEL,
that should allow things like compaction to run and create higher order
pages.
And the BTS (branch trace store) isn't _that_ large.
That said, the patch is reasonable; although arguably we should maybe do
the same to alloc_pebs_buffer().
On Tue, Jul 1, 2014 at 11:34 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 04:04:08PM -0700, David Rientjes wrote:
>> It's unnecessary to excessively spam the kernel log anytime the BTS buffer
>> cannot be allocated, so make this allocation __GFP_NOWARN.
>>
>> The user probably will want to at least find some artifact that the
>> allocation has failed in the past, probably due to fragmentation because
>> of its large size, when it's not allocated at bootstrap. Thus, add a
>> WARN_ONCE() so something is left behind for them to understand why perf
>> commnads that require PEBS is not working properly.
>
> Can you elaborate a bit under which conditions this triggered? Typically
> we should be doing fairly well allocating such buffers with GFP_KERNEL,
> that should allow things like compaction to run and create higher order
> pages.
>
I think this triggers when you have fragmented memory and you have
perf_events active and inactive (i.e., 0 users = no nmi watchdog) frequently.
Each first user invokes the reserve_ds() function to reserve DS, PEBS, BTS.
The reason for BTS rather then PEBS is the size of the allocation.
PEBS allocates one page, i.e., less likely to get a failure than BTS
which allocates 4 pages, I think.
David and I discussed this. He can probably add more background
info, if needed.
> And the BTS (branch trace store) isn't _that_ large.
>
> That said, the patch is reasonable; although arguably we should maybe do
> the same to alloc_pebs_buffer().
On Wed, Jul 02, 2014 at 03:16:40PM +0200, Stephane Eranian wrote:
> On Tue, Jul 1, 2014 at 11:34 AM, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Jun 30, 2014 at 04:04:08PM -0700, David Rientjes wrote:
> >> It's unnecessary to excessively spam the kernel log anytime the BTS buffer
> >> cannot be allocated, so make this allocation __GFP_NOWARN.
> >>
> >> The user probably will want to at least find some artifact that the
> >> allocation has failed in the past, probably due to fragmentation because
> >> of its large size, when it's not allocated at bootstrap. Thus, add a
> >> WARN_ONCE() so something is left behind for them to understand why perf
> >> commnads that require PEBS is not working properly.
> >
> > Can you elaborate a bit under which conditions this triggered? Typically
> > we should be doing fairly well allocating such buffers with GFP_KERNEL,
> > that should allow things like compaction to run and create higher order
> > pages.
> >
> I think this triggers when you have fragmented memory and you have
> perf_events active and inactive (i.e., 0 users = no nmi watchdog) frequently.
> Each first user invokes the reserve_ds() function to reserve DS, PEBS, BTS.
Right, that'd suck. I suppose we could also change that to allocate the
DS resources on first demand and never free them again.
So only allocate the PEBS buffer when we create the first PEBS event,
and idem for the BTS muck.
> The reason for BTS rather then PEBS is the size of the allocation.
> PEBS allocates one page, i.e., less likely to get a failure than BTS
> which allocates 4 pages, I think.
Sure..
> David and I discussed this. He can probably add more background
> info, if needed.
It would still be good to see why compaction etc is failing.
On Wed, Jul 2, 2014 at 3:31 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jul 02, 2014 at 03:16:40PM +0200, Stephane Eranian wrote:
>> On Tue, Jul 1, 2014 at 11:34 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Mon, Jun 30, 2014 at 04:04:08PM -0700, David Rientjes wrote:
>> >> It's unnecessary to excessively spam the kernel log anytime the BTS buffer
>> >> cannot be allocated, so make this allocation __GFP_NOWARN.
>> >>
>> >> The user probably will want to at least find some artifact that the
>> >> allocation has failed in the past, probably due to fragmentation because
>> >> of its large size, when it's not allocated at bootstrap. Thus, add a
>> >> WARN_ONCE() so something is left behind for them to understand why perf
>> >> commnads that require PEBS is not working properly.
>> >
>> > Can you elaborate a bit under which conditions this triggered? Typically
>> > we should be doing fairly well allocating such buffers with GFP_KERNEL,
>> > that should allow things like compaction to run and create higher order
>> > pages.
>> >
>> I think this triggers when you have fragmented memory and you have
>> perf_events active and inactive (i.e., 0 users = no nmi watchdog) frequently.
>> Each first user invokes the reserve_ds() function to reserve DS, PEBS, BTS.
>
> Right, that'd suck. I suppose we could also change that to allocate the
> DS resources on first demand and never free them again.
>
Some may argue that if you never use perf_event again, you are wasting
(1 + 1 + 4) pages per CPU. That may not be okay on some systems.
But yes, it would avoid this problem and also take the penalty for the allocs
only once.
> So only allocate the PEBS buffer when we create the first PEBS event,
> and idem for the BTS muck.
>
>> The reason for BTS rather then PEBS is the size of the allocation.
>> PEBS allocates one page, i.e., less likely to get a failure than BTS
>> which allocates 4 pages, I think.
>
> Sure..
>
>> David and I discussed this. He can probably add more background
>> info, if needed.
>
> It would still be good to see why compaction etc is failing.
On Wed, Jul 02, 2014 at 03:38:50PM +0200, Stephane Eranian wrote:
> On Wed, Jul 2, 2014 at 3:31 PM, Peter Zijlstra <[email protected]> wrote:
> > Right, that'd suck. I suppose we could also change that to allocate the
> > DS resources on first demand and never free them again.
> >
> Some may argue that if you never use perf_event again, you are wasting
> (1 + 1 + 4) pages per CPU. That may not be okay on some systems.
>
> But yes, it would avoid this problem and also take the penalty for the allocs
> only once.
We could of course over engineer this and put a timer on them to free
after 5 minutes to avoid the alloc/free cycle for workloads that
create/destroy events a lot.
On Wed, 2 Jul 2014, Peter Zijlstra wrote:
> > David and I discussed this. He can probably add more background
> > info, if needed.
>
> It would still be good to see why compaction etc is failing.
>
"Why compaction is failing" has been the story of my life for the past few
weeks, unfortunately. One person ran into this and here is the breakdown
at the time of the page allocation failure:
Node 0 DMA32: 12478*4kB 7595*8kB 2087*16kB 26*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 144896kB
Node 0 Normal: 55037*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 222196kB
Node 1 Normal: 165860*4kB 18*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 663600kB
So we really don't have any order-4 or higher memory and if compaction and
reclaim fails, then the allocation failure is printed every time. This is
what the patch is trying to address and we can't guarantee order-4 is
always allocatable even with GFP_KERNEL. This allocation is just outside
the PAGE_ALLOC_COSTLY_ORDER == 3 threshold that would have oom killed
something and retried; the oom killer is deferred for this case because
there's no guarantee killing a process would have resulted in order-4
memory being available (and nobody wants killing when they are 902MB above
the per-zone min watermarks like this poor guy).
On Mon, 30 Jun 2014, David Rientjes wrote:
> It's unnecessary to excessively spam the kernel log anytime the BTS buffer
> cannot be allocated, so make this allocation __GFP_NOWARN.
>
> The user probably will want to at least find some artifact that the
> allocation has failed in the past, probably due to fragmentation because
> of its large size, when it's not allocated at bootstrap. Thus, add a
> WARN_ONCE() so something is left behind for them to understand why perf
> commnads that require PEBS is not working properly.
>
> Signed-off-by: David Rientjes <[email protected]>
Peter, would you like to ack this? Trying to get this merged in time for
the next merge window.
> ---
> arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -311,9 +311,11 @@ static int alloc_bts_buffer(int cpu)
> if (!x86_pmu.bts)
> return 0;
>
> - buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL, node);
> - if (unlikely(!buffer))
> + buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
> + if (unlikely(!buffer)) {
> + WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
> return -ENOMEM;
> + }
>
> max = BTS_BUFFER_SIZE / BTS_RECORD_SIZE;
> thresh = max / 16;
On Mon, Jul 14, 2014 at 04:33:47PM -0700, David Rientjes wrote:
> On Mon, 30 Jun 2014, David Rientjes wrote:
>
> > It's unnecessary to excessively spam the kernel log anytime the BTS buffer
> > cannot be allocated, so make this allocation __GFP_NOWARN.
> >
> > The user probably will want to at least find some artifact that the
> > allocation has failed in the past, probably due to fragmentation because
> > of its large size, when it's not allocated at bootstrap. Thus, add a
> > WARN_ONCE() so something is left behind for them to understand why perf
> > commnads that require PEBS is not working properly.
> >
> > Signed-off-by: David Rientjes <[email protected]>
>
> Peter, would you like to ack this? Trying to get this merged in time for
> the next merge window.
I've queued it; still don't like it though.
Commit-ID: 4485154138f6ffa5b252cb490aba3e8eb30124e4
Gitweb: http://git.kernel.org/tip/4485154138f6ffa5b252cb490aba3e8eb30124e4
Author: David Rientjes <[email protected]>
AuthorDate: Mon, 30 Jun 2014 16:04:08 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Jul 2014 13:31:30 +0200
perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
It's unnecessary to excessively spam the kernel log anytime the BTS buffer
cannot be allocated, so make this allocation __GFP_NOWARN.
The user probably will want to at least find some artifact that the
allocation has failed in the past, probably due to fragmentation because
of its large size, when it's not allocated at bootstrap. Thus, add a
WARN_ONCE() so something is left behind for them to understand why perf
commnads that require PEBS is not working properly.
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970c..696ade3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -311,9 +311,11 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
- buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL, node);
- if (unlikely(!buffer))
+ buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+ if (unlikely(!buffer)) {
+ WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
+ }
max = BTS_BUFFER_SIZE / BTS_RECORD_SIZE;
thresh = max / 16;