LinuxLists.cc - [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

>
> 2. Uncore pmu NMI handling
>
> All the 4 cores are programmed to receive uncore counter overflow
> interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> counters enabled by all 4 cores.

Really for uncore monitoring there is no need to use an NMI handler.
You can't profile a core anyways, so you can just delay the reporting
a little bit. It may simplify the code to not use one here
and just use an ordinary handler.

In general since there is already much trouble with overloaded
NMI events avoiding new NMIs is a good idea.

> +
> +static struct node_hw_events *uncore_events[MAX_NUMNODES];

Don't declare static arrays with MAX_NUMNODES, that number can be
very large and cause unnecessary bloat. Better use per CPU data or similar
(e.g. with alloc_percpu)

> + /*
> + * The hw event starts counting from this event offset,
> + * mark it to be able to extra future deltas:
> + */
> + local64_set(&hwc->prev_count, (u64)-left);

Your use of local* seems dubious. That is only valid if it's really
all on the same CPU. Is that really true?

> +static int uncore_pmu_add(struct perf_event *event, int flags)
> +{
> + int node = numa_node_id();

this should be still package id

> + /* Check CPUID signatures: 06_1AH, 06_1EH, 06_1FH */
> + model = eax.split.model | (eax.split.ext_model << 4);
> + if (eax.split.family != 6 || (model != 0x1A && model != 0x1E && model !=
> 0x1F))
> + return;

You can just get that from boot_cpu_data, no need to call cpuid

> +#include <linux/perf_event.h>
> +#include <linux/capability.h>
> +#include <linux/notifier.h>
> +#include <linux/hardirq.h>
> +#include <linux/kprobes.h>
> +#include <linux/module.h>
> +#include <linux/kdebug.h>
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/highmem.h>
> +#include <linux/cpu.h>
> +#include <linux/bitops.h>

Do you really need all these includes?

-Andi

2010-11-21 14:04:29

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
> >
> > 2. Uncore pmu NMI handling
> >
> > All the 4 cores are programmed to receive uncore counter overflow
> > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> > counters enabled by all 4 cores.
>
> Really for uncore monitoring there is no need to use an NMI handler.
> You can't profile a core anyways, so you can just delay the reporting
> a little bit. It may simplify the code to not use one here
> and just use an ordinary handler.

OK, I can use on ordinary interrupt handler here.

>
> In general since there is already much trouble with overloaded
> NMI events avoiding new NMIs is a good idea.
>
>
>
> > +
> > +static struct node_hw_events *uncore_events[MAX_NUMNODES];
>
> Don't declare static arrays with MAX_NUMNODES, that number can be
> very large and cause unnecessary bloat. Better use per CPU data or similar
> (e.g. with alloc_percpu)

I really need is a per physical cpu data here, is alloc_percpu enough?

>
> > + /*
> > + * The hw event starts counting from this event offset,
> > + * mark it to be able to extra future deltas:
> > + */
> > + local64_set(&hwc->prev_count, (u64)-left);
>
> Your use of local* seems dubious. That is only valid if it's really
> all on the same CPU. Is that really true?

Good catch! That is not true.

The interrupt handler is running on one core and the
data(hwc->prev_count) maybe on another core.

Any idea to set this cross-core data?

>
> > +static int uncore_pmu_add(struct perf_event *event, int flags)
> > +{
> > + int node = numa_node_id();
>
> this should be still package id

Understand, this is in my TODO.

>
> > + /* Check CPUID signatures: 06_1AH, 06_1EH, 06_1FH */
> > + model = eax.split.model | (eax.split.ext_model << 4);
> > + if (eax.split.family != 6 || (model != 0x1A && model != 0x1E && model !=
> > 0x1F))
> > + return;
>
> You can just get that from boot_cpu_data, no need to call cpuid

Nice, will use it.

>
> > +#include <linux/perf_event.h>
> > +#include <linux/capability.h>
> > +#include <linux/notifier.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/kprobes.h>
> > +#include <linux/module.h>
> > +#include <linux/kdebug.h>
> > +#include <linux/sched.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/slab.h>
> > +#include <linux/highmem.h>
> > +#include <linux/cpu.h>
> > +#include <linux/bitops.h>
>
> Do you really need all these includes?

Only

#include <linux/perf_event.h>
#include <linux/kprobes.h>
#include <linux/hardirq.h>
#include <linux/slab.h>

are needed.

Thanks for the comments.
Lin Ming

2010-11-21 17:00:55

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

BTW another thing I noticed that if you ever add opcode/address
matching you'll need to add the new parameters for the address
at least to the input perf_event structure. opcode could
in theory be encoded in the upper 32bits like offcore does,
but address needs to be extra. It's only a small
incremental step, but may make this more useful.

>> Really for uncore monitoring there is no need to use an NMI handler.
>> You can't profile a core anyways, so you can just delay the reporting
>> a little bit. It may simplify the code to not use one here
>> and just use an ordinary handler.
>
> OK, I can use on ordinary interrupt handler here.

You'll need to allocate a vector, it shouldn't be too difficult.

>>
>> In general since there is already much trouble with overloaded
>> NMI events avoiding new NMIs is a good idea.
>>
>>
>>
>> > +
>> > +static struct node_hw_events *uncore_events[MAX_NUMNODES];
>>
>> Don't declare static arrays with MAX_NUMNODES, that number can be
>> very large and cause unnecessary bloat. Better use per CPU data or
>> similar
>> (e.g. with alloc_percpu)
>
> I really need is a per physical cpu data here, is alloc_percpu enough?

If you use a per cpu array then each CPU can carry a pointer
to its per socket data structure.

This could use a similar scheme as the per core data I submitted
recently.

>
> Any idea to set this cross-core data?

s/local/atomic/

But if it's just stores/loads without read-modify-write you
can just use normal stores.

>
>>
>> > +static int uncore_pmu_add(struct perf_event *event, int flags)
>> > +{
>> > + int node = numa_node_id();
>>
>> this should be still package id
>
> Understand, this is in my TODO.

With the per cpu pointer scheme you likely don't even need it,
just check the topology at set up time (similar as in my patch,
just using the package)

-Andi

2010-11-21 17:44:26

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
> On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
> > >
> > > 2. Uncore pmu NMI handling
> > >
> > > All the 4 cores are programmed to receive uncore counter overflow
> > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> > > counters enabled by all 4 cores.
> >
> > Really for uncore monitoring there is no need to use an NMI handler.
> > You can't profile a core anyways, so you can just delay the reporting
> > a little bit. It may simplify the code to not use one here
> > and just use an ordinary handler.
>
> OK, I can use on ordinary interrupt handler here.

Does the hardware actually allow using a different interrupt source?

> >
> > In general since there is already much trouble with overloaded
> > NMI events avoiding new NMIs is a good idea.
> >
> >
> >
> > > +
> > > +static struct node_hw_events *uncore_events[MAX_NUMNODES];
> >
> > Don't declare static arrays with MAX_NUMNODES, that number can be
> > very large and cause unnecessary bloat. Better use per CPU data or similar
> > (e.g. with alloc_percpu)
>
> I really need is a per physical cpu data here, is alloc_percpu enough?

Nah, simply manually allocate bits using kmalloc_node(), that's
something I still need to fix in Andi's patches as well.

> > > + /*
> > > + * The hw event starts counting from this event offset,
> > > + * mark it to be able to extra future deltas:
> > > + */
> > > + local64_set(&hwc->prev_count, (u64)-left);
> >
> > Your use of local* seems dubious. That is only valid if it's really
> > all on the same CPU. Is that really true?
>
> Good catch! That is not true.
>
> The interrupt handler is running on one core and the
> data(hwc->prev_count) maybe on another core.
>
> Any idea to set this cross-core data?

IIRC you can steer the uncore interrupts (it has a mask somewhere)
simply steer everything to the first cpu in the nodemask?

2010-11-23 10:00:51

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Sun, Nov 21, 2010 at 6:44 PM, Peter Zijlstra <[email protected]> wrote:
> On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
>> On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
>> > >
>> > > 2. Uncore pmu NMI handling
>> > >
>> > > All the 4 cores are programmed to receive uncore counter overflow
>> > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
>> > > counters enabled by all 4 cores.
>> >
>> > Really for uncore monitoring there is no need to use an NMI handler.
>> > You can't profile a core anyways, so you can just delay the reporting
>> > a little bit. It may simplify the code to not use one here
>> > and just use an ordinary handler.
>>
>> OK, I can use on ordinary interrupt handler here.
>
> Does the hardware actually allow using a different interrupt source?
>
It does not. It's using whatever you've programmed into the APIC
LVT vector, AFAIK. Uncore interrupt mode is enabled via
IA32_DEBUGCTL. Regarless of sampling or not, you need the interrupt
to virtualize the counters to 64 bits.

2010-11-23 10:17:14

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

Lin,

On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
> +static void uncore_pmu_enable_all(void)
> +{
> + u64 ctrl;
> +
> + /*
> + * (0xFULL << 48): 1 of the 4 cores can receive NMI each time
> + * but we don't know which core will receive the NMI when overflow happens
> + */

That does not sound right. If you set bit 48-51 to 1, then all 4 cores
will receive EVERY
interrupt, i.e., it's a broadcast. That seems to contradict your
comment: 1 of the 4. Unless
you meant, they all get the interrupt and one will handle it, the
other will find nothing to
process. But I don't see the atomic op that would make this true in
uncore_handle_irq().

I also think that if you want all processors to receive the
interrupts, then the mask should
be 0xff when HT is on. The manual is rather obscure on this, but it
does make sense.

> + ctrl = ((1 << UNCORE_NUM_GENERAL_COUNTERS) - 1) | (0xFULL << 48);
> + ctrl |= MSR_UNCORE_PERF_GLOBAL_CTRL_EN_FC0;
> +
> + /*
> + * Freeze the uncore pmu on overflow of any uncore counter.
> + * This makes unocre NMI handling easier.
> + */
> + ctrl |= MSR_UNCORE_PERF_GLOBAL_CTRL_PMI_FRZ;
> +
> + wrmsrl(MSR_UNCORE_PERF_GLOBAL_CTRL, ctrl);
> +}
> +

2010-11-24 01:32:34

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Tue, 2010-11-23 at 18:17 +0800, Stephane Eranian wrote:
> Lin,
>
> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
> > +static void uncore_pmu_enable_all(void)
> > +{
> > + u64 ctrl;
> > +
> > + /*
> > + * (0xFULL << 48): 1 of the 4 cores can receive NMI each time
> > + * but we don't know which core will receive the NMI when overflow happens
> > + */
>
> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
> will receive EVERY
> interrupt, i.e., it's a broadcast. That seems to contradict your
> comment: 1 of the 4. Unless
> you meant, they all get the interrupt and one will handle it, the
> other will find nothing to
> process. But I don't see the atomic op that would make this true in
> uncore_handle_irq().

I thought it's a broadcast too in the v1 patches, let me double check
it.

>
> I also think that if you want all processors to receive the
> interrupts, then the mask should
> be 0xff when HT is on. The manual is rather obscure on this, but it
> does make sense.

Kernel panics if 0xff is set, but it maybe bugs of my code.

Anyway, is it told in some errata that the mask should be 0xff when HT
is on?

Thanks,
Lin Ming

>
>
> > + ctrl = ((1 << UNCORE_NUM_GENERAL_COUNTERS) - 1) | (0xFULL << 48);
> > + ctrl |= MSR_UNCORE_PERF_GLOBAL_CTRL_EN_FC0;
> > +
> > + /*
> > + * Freeze the uncore pmu on overflow of any uncore counter.
> > + * This makes unocre NMI handling easier.
> > + */
> > + ctrl |= MSR_UNCORE_PERF_GLOBAL_CTRL_PMI_FRZ;
> > +
> > + wrmsrl(MSR_UNCORE_PERF_GLOBAL_CTRL, ctrl);
> > +}
> > +

2010-11-24 09:54:00

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Mon, 2010-11-22 at 01:44 +0800, Peter Zijlstra wrote:
> On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
> > On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
> > > >
> > > > 2. Uncore pmu NMI handling
> > > >
> > > > All the 4 cores are programmed to receive uncore counter overflow
> > > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> > > > counters enabled by all 4 cores.
> > >
> > > Really for uncore monitoring there is no need to use an NMI handler.
> > > You can't profile a core anyways, so you can just delay the reporting
> > > a little bit. It may simplify the code to not use one here
> > > and just use an ordinary handler.
> >
> > OK, I can use on ordinary interrupt handler here.
>
> Does the hardware actually allow using a different interrupt source?
>
> > >
> > > In general since there is already much trouble with overloaded
> > > NMI events avoiding new NMIs is a good idea.
> > >
> > >
> > >
> > > > +
> > > > +static struct node_hw_events *uncore_events[MAX_NUMNODES];
> > >
> > > Don't declare static arrays with MAX_NUMNODES, that number can be
> > > very large and cause unnecessary bloat. Better use per CPU data or similar
> > > (e.g. with alloc_percpu)
> >
> > I really need is a per physical cpu data here, is alloc_percpu enough?
>
> Nah, simply manually allocate bits using kmalloc_node(), that's
> something I still need to fix in Andi's patches as well.

I'm writing this like AMD NB events allocation.

Thanks,
Lin Ming

>
> > > > + /*
> > > > + * The hw event starts counting from this event offset,
> > > > + * mark it to be able to extra future deltas:
> > > > + */
> > > > + local64_set(&hwc->prev_count, (u64)-left);
> > >
> > > Your use of local* seems dubious. That is only valid if it's really
> > > all on the same CPU. Is that really true?
> >
> > Good catch! That is not true.
> >
> > The interrupt handler is running on one core and the
> > data(hwc->prev_count) maybe on another core.
> >
> > Any idea to set this cross-core data?
>
> IIRC you can steer the uncore interrupts (it has a mask somewhere)
> simply steer everything to the first cpu in the nodemask?
>
>
>

2010-11-25 00:23:17

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Tue, 2010-11-23 at 18:00 +0800, Stephane Eranian wrote:
> On Sun, Nov 21, 2010 at 6:44 PM, Peter Zijlstra <[email protected]> wrote:
> > On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
> >> On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
> >> > >
> >> > > 2. Uncore pmu NMI handling
> >> > >
> >> > > All the 4 cores are programmed to receive uncore counter overflow
> >> > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> >> > > counters enabled by all 4 cores.
> >> >
> >> > Really for uncore monitoring there is no need to use an NMI handler.
> >> > You can't profile a core anyways, so you can just delay the reporting
> >> > a little bit. It may simplify the code to not use one here
> >> > and just use an ordinary handler.
> >>
> >> OK, I can use on ordinary interrupt handler here.
> >
> > Does the hardware actually allow using a different interrupt source?
> >
> It does not. It's using whatever you've programmed into the APIC
> LVT vector, AFAIK. Uncore interrupt mode is enabled via
> IA32_DEBUGCTL. Regarless of sampling or not, you need the interrupt
> to virtualize the counters to 64 bits.

If only counting(perf stat) makes sense for uncore events, do we still
need an interrupt handler?

48 bits counter is not that easy to overflow in practice.

2010-11-25 06:08:58

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Thu, 2010-11-25 at 08:24 +0800, Lin Ming wrote:
> On Tue, 2010-11-23 at 18:00 +0800, Stephane Eranian wrote:
> > On Sun, Nov 21, 2010 at 6:44 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
> > >> On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
> > >> > >
> > >> > > 2. Uncore pmu NMI handling
> > >> > >
> > >> > > All the 4 cores are programmed to receive uncore counter overflow
> > >> > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> > >> > > counters enabled by all 4 cores.
> > >> >
> > >> > Really for uncore monitoring there is no need to use an NMI handler.
> > >> > You can't profile a core anyways, so you can just delay the reporting
> > >> > a little bit. It may simplify the code to not use one here
> > >> > and just use an ordinary handler.
> > >>
> > >> OK, I can use on ordinary interrupt handler here.
> > >
> > > Does the hardware actually allow using a different interrupt source?
> > >
> > It does not. It's using whatever you've programmed into the APIC
> > LVT vector, AFAIK. Uncore interrupt mode is enabled via
> > IA32_DEBUGCTL. Regarless of sampling or not, you need the interrupt
> > to virtualize the counters to 64 bits.
>
> If only counting(perf stat) makes sense for uncore events, do we still
> need an interrupt handler?

Yep, I see no reason to dis-allow sampling. Sure its hard to make sense
of it, but since there are people who offline all but one cpu of a
package, I bet there are people who will run just one task on a package
as well.

Just because it doesn't make sense in general doesn't mean there isn't
anybody who'd want to do it and actually knows wth he's doing.

> 48 bits counter is not that easy to overflow in practice.

Still..

2010-11-25 06:26:30

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Thu, 2010-11-25 at 14:09 +0800, Peter Zijlstra wrote:
> On Thu, 2010-11-25 at 08:24 +0800, Lin Ming wrote:
> > On Tue, 2010-11-23 at 18:00 +0800, Stephane Eranian wrote:
> > > On Sun, Nov 21, 2010 at 6:44 PM, Peter Zijlstra <[email protected]> wrote:
> > > > On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
> > > >> On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
> > > >> > >
> > > >> > > 2. Uncore pmu NMI handling
> > > >> > >
> > > >> > > All the 4 cores are programmed to receive uncore counter overflow
> > > >> > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
> > > >> > > counters enabled by all 4 cores.
> > > >> >
> > > >> > Really for uncore monitoring there is no need to use an NMI handler.
> > > >> > You can't profile a core anyways, so you can just delay the reporting
> > > >> > a little bit. It may simplify the code to not use one here
> > > >> > and just use an ordinary handler.
> > > >>
> > > >> OK, I can use on ordinary interrupt handler here.
> > > >
> > > > Does the hardware actually allow using a different interrupt source?
> > > >
> > > It does not. It's using whatever you've programmed into the APIC
> > > LVT vector, AFAIK. Uncore interrupt mode is enabled via
> > > IA32_DEBUGCTL. Regarless of sampling or not, you need the interrupt
> > > to virtualize the counters to 64 bits.
> >
> > If only counting(perf stat) makes sense for uncore events, do we still
> > need an interrupt handler?
>
> Yep, I see no reason to dis-allow sampling. Sure its hard to make sense
> of it, but since there are people who offline all but one cpu of a
> package, I bet there are people who will run just one task on a package
> as well.
>
> Just because it doesn't make sense in general doesn't mean there isn't
> anybody who'd want to do it and actually knows wth he's doing.
>
> > 48 bits counter is not that easy to overflow in practice.
>
> Still..

OK, will do more tests, then send out a new version.

Thanks.

2010-11-25 08:48:50

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Thu, Nov 25, 2010 at 7:09 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-11-25 at 08:24 +0800, Lin Ming wrote:
>> On Tue, 2010-11-23 at 18:00 +0800, Stephane Eranian wrote:
>> > On Sun, Nov 21, 2010 at 6:44 PM, Peter Zijlstra <[email protected]> wrote:
>> > > On Sun, 2010-11-21 at 22:04 +0800, Lin Ming wrote:
>> > >> On Sun, 2010-11-21 at 20:46 +0800, Andi Kleen wrote:
>> > >> > >
>> > >> > > 2. Uncore pmu NMI handling
>> > >> > >
>> > >> > > All the 4 cores are programmed to receive uncore counter overflow
>> > >> > > interrupt. The NMI handler(running on 1 of the 4 cores) handle all
>> > >> > > counters enabled by all 4 cores.
>> > >> >
>> > >> > Really for uncore monitoring there is no need to use an NMI handler.
>> > >> > You can't profile a core anyways, so you can just delay the reporting
>> > >> > a little bit. It may simplify the code to not use one here
>> > >> > and just use an ordinary handler.
>> > >>
>> > >> OK, I can use on ordinary interrupt handler here.
>> > >
>> > > Does the hardware actually allow using a different interrupt source?
>> > >
>> > It does not. It's using whatever you've programmed into the APIC
>> > LVT vector, AFAIK. Uncore interrupt mode is enabled via
>> > IA32_DEBUGCTL. Regarless of sampling or not, you need the interrupt
>> > to virtualize the counters to 64 bits.
>>
>> If only counting(perf stat) makes sense for uncore events, do we still
>> need an interrupt handler?
>
> Yep, I see no reason to dis-allow sampling. Sure its hard to make sense
> of it, but since there are people who offline all but one cpu of a
> package, I bet there are people who will run just one task on a package
> as well.
>
> Just because it doesn't make sense in general doesn't mean there isn't
> anybody who'd want to do it and actually knows wth he's doing.
>
>> 48 bits counter is not that easy to overflow in practice.
>
> Still..
>
Agreed.

2010-11-25 18:21:03

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

> Yep, I see no reason to dis-allow sampling. Sure its hard to make sense
> of it, but since there are people who offline all but one cpu of a
> package,

Assuming they don't have any active PCI devices either.

> I bet there are people who will run just one task on a package
> as well.

In that case the sampling has a 1/NUM-CPU-THREADS-IN-PACKAGE chance
to report the right task (or actually somewhat less because the measurement
skew for uncore is much higher than for normal events)

Really for per core measurements using the OFFCORE events is much better.

-Andi
--
[email protected] -- Speaking for myself only.

2010-11-25 21:10:09

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Thu, Nov 25, 2010 at 7:20 PM, Andi Kleen <[email protected]> wrote:
>> Yep, I see no reason to dis-allow sampling. Sure its hard to make sense
>> of it, but since there are people who offline all but one cpu of a
>> package,
>
> Assuming they don't have any active PCI devices either.
>
Good point.

>> I bet there are people who will run just one task on a package
>> as well.
>
> In that case the sampling has a 1/NUM-CPU-THREADS-IN-PACKAGE chance
> to report the right task (or actually somewhat less because the measurement
> skew for uncore is much higher than for normal events)
>
> Really for per core measurements using the OFFCORE events is much better.
>
yes, OFFCORE_RESPONSE is much more useful.

2010-11-26 05:15:19

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Tue, Nov 23, 2010 at 6:17 PM, Stephane Eranian <[email protected]> wrote:
> Lin,
>
> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
>> +static void uncore_pmu_enable_all(void)
>> +{
>> + ? ? ? u64 ctrl;
>> +
>> + ? ? ? /*
>> + ? ? ? ?* (0xFULL << 48): 1 of the 4 cores can receive NMI each time
>> + ? ? ? ?* but we don't know which core will receive the NMI when overflow happens
>> + ? ? ? ?*/
>
> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
> will receive EVERY
> interrupt, i.e., it's a broadcast. That seems to contradict your
> comment: 1 of the 4. Unless
> you meant, they all get the interrupt and one will handle it, the
> other will find nothing to
> process. But I don't see the atomic op that would make this true in
> uncore_handle_irq().

Stephane,

The interrupt model is strange, it behaves differently when HT on/off.

If HT is off, all 4 cores will receive every interrupt, i.e., it's a broadcast.

If HT is on, only 1 of the 4 cores will receive the interrupt(both
Threads in that core receive the interrupt),
and it can't be determined which core will receive the interrupt.

Did you ever observe this?

>
> I also think that if you want all processors to receive the
> interrupts, then the mask should
> be 0xff when HT is on. The manual is rather obscure on this, but it
> does make sense.

I tried to set the mask 0xff when HT is on, but kernel panics, because
the reserve bits are set.

Thanks,
Lin Ming

>
>
>> + ? ? ? ctrl = ((1 << UNCORE_NUM_GENERAL_COUNTERS) - 1) | (0xFULL << 48);
>> + ? ? ? ctrl |= MSR_UNCORE_PERF_GLOBAL_CTRL_EN_FC0;
>> +
>> + ? ? ? /*
>> + ? ? ? ?* Freeze the uncore pmu on overflow of any uncore counter.
>> + ? ? ? ?* This makes unocre NMI handling easier.
>> + ? ? ? ?*/
>> + ? ? ? ctrl |= MSR_UNCORE_PERF_GLOBAL_CTRL_PMI_FRZ;
>> +
>> + ? ? ? wrmsrl(MSR_UNCORE_PERF_GLOBAL_CTRL, ctrl);
>> +}
>> +
> --

2010-11-26 08:18:09

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 6:15 AM, Lin Ming <[email protected]> wrote:
> On Tue, Nov 23, 2010 at 6:17 PM, Stephane Eranian <[email protected]> wrote:
>> Lin,
>>
>> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
>>> +static void uncore_pmu_enable_all(void)
>>> +{
>>> + u64 ctrl;
>>> +
>>> + /*
>>> + * (0xFULL << 48): 1 of the 4 cores can receive NMI each time
>>> + * but we don't know which core will receive the NMI when overflow happens
>>> + */
>>
>> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
>> will receive EVERY
>> interrupt, i.e., it's a broadcast. That seems to contradict your
>> comment: 1 of the 4. Unless
>> you meant, they all get the interrupt and one will handle it, the
>> other will find nothing to
>> process. But I don't see the atomic op that would make this true in
>> uncore_handle_irq().
>
> Stephane,
>
> The interrupt model is strange, it behaves differently when HT on/off.
>
> If HT is off, all 4 cores will receive every interrupt, i.e., it's a broadcast.
>
That's if yo set the mask to 0xf, right?

In the perf_event model, given that any one of the 4 cores can be used
to program uncore events, you have no choice but to broadcast to all
4 cores. Each has to demultiplex and figure out which of its counters
have overflowed.

> If HT is on, only 1 of the 4 cores will receive the interrupt(both
> Threads in that core receive the interrupt),
> and it can't be determined which core will receive the interrupt.
>
> Did you ever observe this?
>
No because I never set more than one bit in the mask.

> I tried to set the mask 0xff when HT is on, but kernel panics, because
> the reserve bits are set.

Let me check on this. It would seem to imply that in HT mode, both threads
necessarily receive the interrupts.

Was that on Nehalem or Westmere?

2010-11-26 08:29:44

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 4:18 PM, Stephane Eranian <[email protected]> wrote:
> On Fri, Nov 26, 2010 at 6:15 AM, Lin Ming <[email protected]> wrote:
>> On Tue, Nov 23, 2010 at 6:17 PM, Stephane Eranian <[email protected]> wrote:
>>> Lin,
>>>
>>> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
>>>> +static void uncore_pmu_enable_all(void)
>>>> +{
>>>> + ? ? ? u64 ctrl;
>>>> +
>>>> + ? ? ? /*
>>>> + ? ? ? ?* (0xFULL << 48): 1 of the 4 cores can receive NMI each time
>>>> + ? ? ? ?* but we don't know which core will receive the NMI when overflow happens
>>>> + ? ? ? ?*/
>>>
>>> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
>>> will receive EVERY
>>> interrupt, i.e., it's a broadcast. That seems to contradict your
>>> comment: 1 of the 4. Unless
>>> you meant, they all get the interrupt and one will handle it, the
>>> other will find nothing to
>>> process. But I don't see the atomic op that would make this true in
>>> uncore_handle_irq().
>>
>> Stephane,
>>
>> The interrupt model is strange, it behaves differently when HT on/off.
>>
>> If HT is off, all 4 cores will receive every interrupt, i.e., it's a broadcast.
>>
> That's if yo set the mask to 0xf, right?

Right.

>
> In the perf_event model, given that any one of the 4 cores can be used
> to program uncore events, you have no choice but to broadcast to all
> 4 cores. Each has to demultiplex and figure out which of its counters
> have overflowed.

This is what my upcoming v3 patches doing.

>
>> If HT is on, only 1 of the 4 cores will receive the interrupt(both
>> Threads in that core receive the interrupt),
>> and it can't be determined which core will receive the interrupt.
>>
>> Did you ever observe this?
>>
> No because I never set more than one bit in the mask.
>
>> I tried to set the mask 0xff when HT is on, but kernel panics, because
>> the reserve bits are set.
>
> Let me check on this. It would seem to imply that in HT mode, both threads
> necessarily receive the interrupts.
>
> Was that on Nehalem or Westmere?

Nehalem.

Thanks,
Lin Ming

2010-11-26 08:33:46

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

Lin,

Looked at the perfmon code, and it seems the mask is actual
cores, not threads:
rdmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
val |= 1ULL << (48 + cpu_data(smp_processor_id()).cpu_core_id);
wrmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);

That seems to imply both threads will get the interrupt.

In the the overflowed event was programmed from on of the two threads, that
means one will process the overflow, the other will get spurious.

On the cores where no uncore was programmed, then both threads will have
a spurious interrupt.

That brings up back to the 'spurious interrupt' issue and the 'NMI
Dazed' message
that Don tried to eliminate. Now we have a new situation where we will
get interrupt
with no work to do, so the perf_event will pass the interrupt onto the
next subsystem
and eventually we will get the 'dazed' message. I am just guessing here....

On Fri, Nov 26, 2010 at 9:18 AM, Stephane Eranian <[email protected]> wrote:
> On Fri, Nov 26, 2010 at 6:15 AM, Lin Ming <[email protected]> wrote:
>> On Tue, Nov 23, 2010 at 6:17 PM, Stephane Eranian <[email protected]> wrote:
>>> Lin,
>>>
>>> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
>>>> +static void uncore_pmu_enable_all(void)
>>>> +{
>>>> + u64 ctrl;
>>>> +
>>>> + /*
>>>> + * (0xFULL << 48): 1 of the 4 cores can receive NMI each time
>>>> + * but we don't know which core will receive the NMI when overflow happens
>>>> + */
>>>
>>> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
>>> will receive EVERY
>>> interrupt, i.e., it's a broadcast. That seems to contradict your
>>> comment: 1 of the 4. Unless
>>> you meant, they all get the interrupt and one will handle it, the
>>> other will find nothing to
>>> process. But I don't see the atomic op that would make this true in
>>> uncore_handle_irq().
>>
>> Stephane,
>>
>> The interrupt model is strange, it behaves differently when HT on/off.
>>
>> If HT is off, all 4 cores will receive every interrupt, i.e., it's a broadcast.
>>
> That's if yo set the mask to 0xf, right?
>
> In the perf_event model, given that any one of the 4 cores can be used
> to program uncore events, you have no choice but to broadcast to all
> 4 cores. Each has to demultiplex and figure out which of its counters
> have overflowed.
>
>> If HT is on, only 1 of the 4 cores will receive the interrupt(both
>> Threads in that core receive the interrupt),
>> and it can't be determined which core will receive the interrupt.
>>
>> Did you ever observe this?
>>
> No because I never set more than one bit in the mask.
>
>> I tried to set the mask 0xff when HT is on, but kernel panics, because
>> the reserve bits are set.
>
> Let me check on this. It would seem to imply that in HT mode, both threads
> necessarily receive the interrupts.
>
> Was that on Nehalem or Westmere?
>

2010-11-26 09:00:13

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 4:33 PM, Stephane Eranian <[email protected]> wrote:
> Lin,
>
> Looked at the perfmon code, and it seems the mask is actual
> cores, not threads:
> ? ? ? ? ? ? ? ?rdmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
> ? ? ? ? ? ? ? ?val |= 1ULL << (48 + cpu_data(smp_processor_id()).cpu_core_id);
> ? ? ? ? ? ? ? ?wrmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
>
> That seems to imply both threads will get the interrupt.
>
> In the the overflowed event was programmed from on of the two threads, that
> means one will process the overflow, the other will get spurious.
>
> On the cores where no uncore was programmed, then both threads will have
> a spurious interrupt.

But in my test, if HT is on, only the 2 theads in one of the four cores
will receive the interrupt. Even worse, we don't know which core will
receive the interrupt
when overflow happens.

I'll do more tests to verify this.

>
> That brings up back to the 'spurious interrupt' issue and the 'NMI
> Dazed' message
> that Don tried to eliminate. Now we have a new situation where we will
> get interrupt
> with no work to do, so the perf_event will pass the interrupt onto the
> next subsystem
> and eventually we will get the 'dazed' message. I am just guessing here....

Add Don.

Thanks,
Lin Ming

>
>
> On Fri, Nov 26, 2010 at 9:18 AM, Stephane Eranian <[email protected]> wrote:
>> On Fri, Nov 26, 2010 at 6:15 AM, Lin Ming <[email protected]> wrote:
>>> On Tue, Nov 23, 2010 at 6:17 PM, Stephane Eranian <[email protected]> wrote:
>>>> Lin,
>>>>
>>>> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
>>>>> +static void uncore_pmu_enable_all(void)
>>>>> +{
>>>>> + ? ? ? u64 ctrl;
>>>>> +
>>>>> + ? ? ? /*
>>>>> + ? ? ? ?* (0xFULL << 48): 1 of the 4 cores can receive NMI each time
>>>>> + ? ? ? ?* but we don't know which core will receive the NMI when overflow happens
>>>>> + ? ? ? ?*/
>>>>
>>>> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
>>>> will receive EVERY
>>>> interrupt, i.e., it's a broadcast. That seems to contradict your
>>>> comment: 1 of the 4. Unless
>>>> you meant, they all get the interrupt and one will handle it, the
>>>> other will find nothing to
>>>> process. But I don't see the atomic op that would make this true in
>>>> uncore_handle_irq().
>>>
>>> Stephane,
>>>
>>> The interrupt model is strange, it behaves differently when HT on/off.
>>>
>>> If HT is off, all 4 cores will receive every interrupt, i.e., it's a broadcast.
>>>
>> That's if yo set the mask to 0xf, right?
>>
>> In the perf_event model, given that any one of the 4 cores can be used
>> to program uncore events, you have no choice but to broadcast to all
>> 4 cores. Each has to demultiplex and figure out which of its counters
>> have overflowed.
>>
>>> If HT is on, only 1 of the 4 cores will receive the interrupt(both
>>> Threads in that core receive the interrupt),
>>> and it can't be determined which core will receive the interrupt.
>>>
>>> Did you ever observe this?
>>>
>> No because I never set more than one bit in the mask.
>>
>>> I tried to set the mask 0xff when HT is on, but kernel panics, because
>>> the reserve bits are set.
>>
>> Let me check on this. It would seem to imply that in HT mode, both threads
>> necessarily receive the interrupts.
>>
>> Was that on Nehalem or Westmere?
>>
>

2010-11-26 10:06:48

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 10:00 AM, Lin Ming <[email protected]> wrote:
> On Fri, Nov 26, 2010 at 4:33 PM, Stephane Eranian <[email protected]> wrote:
>> Lin,
>>
>> Looked at the perfmon code, and it seems the mask is actual
>> cores, not threads:
>> rdmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
>> val |= 1ULL << (48 + cpu_data(smp_processor_id()).cpu_core_id);
>> wrmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
>>
>> That seems to imply both threads will get the interrupt.
>>
>> In the the overflowed event was programmed from on of the two threads, that
>> means one will process the overflow, the other will get spurious.
>>
>> On the cores where no uncore was programmed, then both threads will have
>> a spurious interrupt.
>
> But in my test, if HT is on, only the 2 theads in one of the four cores
> will receive the interrupt. Even worse, we don't know which core will
> receive the interrupt
> when overflow happens.
>
The MSR_NHM_UNC_GLOBAL_CTRL is per socket not per core.

> I'll do more tests to verify this.

In your tests, are your programming the same uncore event
across all CPUs? If so then you may have a race condition
setting the MSR because it read-modify-write.

What about you program only one uncore event from one CPU?

>
>>
>> That brings up back to the 'spurious interrupt' issue and the 'NMI
>> Dazed' message
>> that Don tried to eliminate. Now we have a new situation where we will
>> get interrupt
>> with no work to do, so the perf_event will pass the interrupt onto the
>> next subsystem
>> and eventually we will get the 'dazed' message. I am just guessing here....
>
> Add Don.
>
> Thanks,
> Lin Ming
>
>>
>>
>> On Fri, Nov 26, 2010 at 9:18 AM, Stephane Eranian <[email protected]> wrote:
>>> On Fri, Nov 26, 2010 at 6:15 AM, Lin Ming <[email protected]> wrote:
>>>> On Tue, Nov 23, 2010 at 6:17 PM, Stephane Eranian <[email protected]> wrote:
>>>>> Lin,
>>>>>
>>>>> On Sun, Nov 21, 2010 at 1:01 PM, Lin Ming <[email protected]> wrote:
>>>>>> +static void uncore_pmu_enable_all(void)
>>>>>> +{
>>>>>> + u64 ctrl;
>>>>>> +
>>>>>> + /*
>>>>>> + * (0xFULL << 48): 1 of the 4 cores can receive NMI each time
>>>>>> + * but we don't know which core will receive the NMI when overflow happens
>>>>>> + */
>>>>>
>>>>> That does not sound right. If you set bit 48-51 to 1, then all 4 cores
>>>>> will receive EVERY
>>>>> interrupt, i.e., it's a broadcast. That seems to contradict your
>>>>> comment: 1 of the 4. Unless
>>>>> you meant, they all get the interrupt and one will handle it, the
>>>>> other will find nothing to
>>>>> process. But I don't see the atomic op that would make this true in
>>>>> uncore_handle_irq().
>>>>
>>>> Stephane,
>>>>
>>>> The interrupt model is strange, it behaves differently when HT on/off.
>>>>
>>>> If HT is off, all 4 cores will receive every interrupt, i.e., it's a broadcast.
>>>>
>>> That's if yo set the mask to 0xf, right?
>>>
>>> In the perf_event model, given that any one of the 4 cores can be used
>>> to program uncore events, you have no choice but to broadcast to all
>>> 4 cores. Each has to demultiplex and figure out which of its counters
>>> have overflowed.
>>>
>>>> If HT is on, only 1 of the 4 cores will receive the interrupt(both
>>>> Threads in that core receive the interrupt),
>>>> and it can't be determined which core will receive the interrupt.
>>>>
>>>> Did you ever observe this?
>>>>
>>> No because I never set more than one bit in the mask.
>>>
>>>> I tried to set the mask 0xff when HT is on, but kernel panics, because
>>>> the reserve bits are set.
>>>
>>> Let me check on this. It would seem to imply that in HT mode, both threads
>>> necessarily receive the interrupts.
>>>
>>> Was that on Nehalem or Westmere?
>>>
>>
>

2010-11-26 11:24:08

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, 2010-11-26 at 09:18 +0100, Stephane Eranian wrote:

> In the perf_event model, given that any one of the 4 cores can be used
> to program uncore events, you have no choice but to broadcast to all
> 4 cores. Each has to demultiplex and figure out which of its counters
> have overflowed.

Not really, you can redirect all these events to the first online cpu of
the node.

You can re-write event->cpu in pmu::event_init(), and register cpu
hotplug notifiers to migrate the state around.

2010-11-26 11:25:56

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 12:24 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-26 at 09:18 +0100, Stephane Eranian wrote:
>
>> In the perf_event model, given that any one of the 4 cores can be used
>> to program uncore events, you have no choice but to broadcast to all
>> 4 cores. Each has to demultiplex and figure out which of its counters
>> have overflowed.
>
> Not really, you can redirect all these events to the first online cpu of
> the node.
>
> You can re-write event->cpu in pmu::event_init(), and register cpu
> hotplug notifiers to migrate the state around.
>
I am sure you could. But then the user thinks the event is controlled
from CPUx when it's actually from CPUz. I am sure it can work but
that's confusing, especially interrupt-wise.

2010-11-26 11:36:51

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, 2010-11-26 at 12:25 +0100, Stephane Eranian wrote:
> On Fri, Nov 26, 2010 at 12:24 PM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, 2010-11-26 at 09:18 +0100, Stephane Eranian wrote:
> >
> >> In the perf_event model, given that any one of the 4 cores can be used
> >> to program uncore events, you have no choice but to broadcast to all
> >> 4 cores. Each has to demultiplex and figure out which of its counters
> >> have overflowed.
> >
> > Not really, you can redirect all these events to the first online cpu of
> > the node.
> >
> > You can re-write event->cpu in pmu::event_init(), and register cpu
> > hotplug notifiers to migrate the state around.
> >
> I am sure you could. But then the user thinks the event is controlled
> from CPUx when it's actually from CPUz. I am sure it can work but
> that's confusing, especially interrupt-wise.

Well, its either that or keeping a node wide state like we do for AMD
and serialize everything from there.

And I'm not sure what's most expensive, steering the interrupt to one
core only, or broadcasting every interrupt, I'd favour the first
approach.

The whole thing is a node-wide resource, so the user needs to think in
nodes anyway, we already do a cpu->node mapping for identifying the
thing.

2010-11-26 11:41:37

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-26 at 12:25 +0100, Stephane Eranian wrote:
>> On Fri, Nov 26, 2010 at 12:24 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Fri, 2010-11-26 at 09:18 +0100, Stephane Eranian wrote:
>> >
>> >> In the perf_event model, given that any one of the 4 cores can be used
>> >> to program uncore events, you have no choice but to broadcast to all
>> >> 4 cores. Each has to demultiplex and figure out which of its counters
>> >> have overflowed.
>> >
>> > Not really, you can redirect all these events to the first online cpu of
>> > the node.
>> >
>> > You can re-write event->cpu in pmu::event_init(), and register cpu
>> > hotplug notifiers to migrate the state around.
>> >
>> I am sure you could. But then the user thinks the event is controlled
>> from CPUx when it's actually from CPUz. I am sure it can work but
>> that's confusing, especially interrupt-wise.
>
> Well, its either that or keeping a node wide state like we do for AMD
> and serialize everything from there.
>
> And I'm not sure what's most expensive, steering the interrupt to one
> core only, or broadcasting every interrupt, I'd favour the first
> approach.

I think the one core-only approach will limit the spurious interrupt aspect.
In perfmon, that's how I had it setup. The first CPU where uncore is
accessed owns the uncore PMU for the socket, thus all interrupts are
routed there. What you are proposing is the same. Now you can chose
you hardcode which is the default core to handle this, or (better) you
use the first core that accesses uncore.

>
> The whole thing is a node-wide resource, so the user needs to think in
> nodes anyway, we already do a cpu->node mapping for identifying the
> thing.
>
Agreed.

2010-11-26 16:25:43

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, Nov 26, 2010 at 7:41 PM, Stephane Eranian <[email protected]> wrote:
> On Fri, Nov 26, 2010 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
>> On Fri, 2010-11-26 at 12:25 +0100, Stephane Eranian wrote:
>>> On Fri, Nov 26, 2010 at 12:24 PM, Peter Zijlstra <[email protected]> wrote:
>>> > On Fri, 2010-11-26 at 09:18 +0100, Stephane Eranian wrote:
>>> >
>>> >> In the perf_event model, given that any one of the 4 cores can be used
>>> >> to program uncore events, you have no choice but to broadcast to all
>>> >> 4 cores. Each has to demultiplex and figure out which of its counters
>>> >> have overflowed.
>>> >
>>> > Not really, you can redirect all these events to the first online cpu of
>>> > the node.
>>> >
>>> > You can re-write event->cpu in pmu::event_init(), and register cpu
>>> > hotplug notifiers to migrate the state around.
>>> >
>>> I am sure you could. But then the user thinks the event is controlled
>>> from CPUx when it's actually from CPUz. I am sure it can work but
>>> that's confusing, especially interrupt-wise.
>>
>> Well, its either that or keeping a node wide state like we do for AMD
>> and serialize everything from there.
>>
>> And I'm not sure what's most expensive, steering the interrupt to one
>> core only, or broadcasting every interrupt, I'd favour the first
>> approach.
>
> I think the one core-only approach will limit the spurious interrupt aspect.
> In perfmon, that's how I had it setup. The first CPU where uncore is
> accessed owns the uncore PMU for the socket, thus all interrupts are
> routed there. What you are proposing is the same. Now you can chose
> you hardcode which is the default core to handle this, or (better) you
> use the first core that accesses uncore.
>
>>
>> The whole thing is a node-wide resource, so the user needs to think in
>> nodes anyway, we already do a cpu->node mapping for identifying the
>> thing.
>>
> Agreed.
>

Hi, all

Thanks for all the comments.
I'm on travel Nov 27 to Nov 30.

I'll address the comments when I'm back.

Thanks,
Lin Ming

2010-12-01 03:19:27

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, 2010-11-26 at 18:06 +0800, Stephane Eranian wrote:
> On Fri, Nov 26, 2010 at 10:00 AM, Lin Ming <[email protected]> wrote:
> > On Fri, Nov 26, 2010 at 4:33 PM, Stephane Eranian <[email protected]> wrote:
> >> Lin,
> >>
> >> Looked at the perfmon code, and it seems the mask is actual
> >> cores, not threads:
> >> rdmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
> >> val |= 1ULL << (48 + cpu_data(smp_processor_id()).cpu_core_id);
> >> wrmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
> >>
> >> That seems to imply both threads will get the interrupt.
> >>
> >> In the the overflowed event was programmed from on of the two threads, that
> >> means one will process the overflow, the other will get spurious.
> >>
> >> On the cores where no uncore was programmed, then both threads will have
> >> a spurious interrupt.
> >
> > But in my test, if HT is on, only the 2 theads in one of the four cores
> > will receive the interrupt. Even worse, we don't know which core will
> > receive the interrupt
> > when overflow happens.
> >
> The MSR_NHM_UNC_GLOBAL_CTRL is per socket not per core.

Understood.

>
> > I'll do more tests to verify this.
>
> In your tests, are your programming the same uncore event
> across all CPUs? If so then you may have a race condition
> setting the MSR because it read-modify-write.
>
> What about you program only one uncore event from one CPU?

This is what I tested, programming only one uncore event from one CPU.
When HT is off, all four cores in the socket receive the interrupt.
When HT is on, only the 2 threads in one of the four cores receive the
interrupt.

2010-12-01 03:26:18

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Fri, 2010-11-26 at 19:36 +0800, Peter Zijlstra wrote:
> On Fri, 2010-11-26 at 12:25 +0100, Stephane Eranian wrote:
> > On Fri, Nov 26, 2010 at 12:24 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Fri, 2010-11-26 at 09:18 +0100, Stephane Eranian wrote:
> > >
> > >> In the perf_event model, given that any one of the 4 cores can be used
> > >> to program uncore events, you have no choice but to broadcast to all
> > >> 4 cores. Each has to demultiplex and figure out which of its counters
> > >> have overflowed.
> > >
> > > Not really, you can redirect all these events to the first online cpu of
> > > the node.
> > >
> > > You can re-write event->cpu in pmu::event_init(), and register cpu
> > > hotplug notifiers to migrate the state around.
> > >
> > I am sure you could. But then the user thinks the event is controlled
> > from CPUx when it's actually from CPUz. I am sure it can work but
> > that's confusing, especially interrupt-wise.
>
> Well, its either that or keeping a node wide state like we do for AMD
> and serialize everything from there.
>
> And I'm not sure what's most expensive, steering the interrupt to one
> core only, or broadcasting every interrupt, I'd favour the first
> approach.
>
> The whole thing is a node-wide resource, so the user needs to think in
> nodes anyway, we already do a cpu->node mapping for identifying the
> thing.

How about a new sub-command for node-wide events statistics?

perf node -n <node> -e <event>?

2010-12-01 11:37:52

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Wed, 2010-12-01 at 11:28 +0800, Lin Ming wrote:

> How about a new sub-command for node-wide events statistics?
>
> perf node -n <node> -e <event>?

Maybe as a very slim wrapper around perf stat, but personally I wouldn't
care.

2010-12-01 13:04:49

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Wed, Dec 1, 2010 at 4:21 AM, Lin Ming <[email protected]> wrote:
>
> On Fri, 2010-11-26 at 18:06 +0800, Stephane Eranian wrote:
> > On Fri, Nov 26, 2010 at 10:00 AM, Lin Ming <[email protected]> wrote:
> > > On Fri, Nov 26, 2010 at 4:33 PM, Stephane Eranian <[email protected]> wrote:
> > >> Lin,
> > >>
> > >> Looked at the perfmon code, and it seems the mask is actual
> > >> cores, not threads:
> > >> rdmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
> > >> val |= 1ULL << (48 + cpu_data(smp_processor_id()).cpu_core_id);
> > >> wrmsrl(MSR_NHM_UNC_GLOBAL_CTRL, val);
> > >>
> > >> That seems to imply both threads will get the interrupt.
> > >>
> > >> In the the overflowed event was programmed from on of the two threads, that
> > >> means one will process the overflow, the other will get spurious.
> > >>
> > >> On the cores where no uncore was programmed, then both threads will have
> > >> a spurious interrupt.
> > >
> > > But in my test, if HT is on, only the 2 theads in one of the four cores
> > > will receive the interrupt. Even worse, we don't know which core will
> > > receive the interrupt
> > > when overflow happens.
> > >
> > The MSR_NHM_UNC_GLOBAL_CTRL is per socket not per core.
>
> Understood.
>
> >
> > > I'll do more tests to verify this.
> >
> > In your tests, are your programming the same uncore event
> > across all CPUs? If so then you may have a race condition
> > setting the MSR because it read-modify-write.
> >
> > What about you program only one uncore event from one CPU?
>
> This is what I tested, programming only one uncore event from one CPU.

> When HT is off, all four cores in the socket receive the interrupt.

If the value of the MSR is 0xf << 48?

> When HT is on, only the 2 threads in one of the four cores receive the
> interrupt.
Something is not right here. Next week, I may be able to run some tests
on a Nehalem using perfmon to compare. Could you also send me your
latest uncore patch against tip-x86?
Thanks.

2010-12-01 14:08:22

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

> How about a new sub-command for node-wide events statistics?
>
> perf node -n <node> -e <event>?

Seems like the best option to me (and not allowing the uncore events
for the other commands)

But I don't like "node" because in a non NUMA kernel you won't have nodes,
but this is still useful. Or with NUMA emulation you may have nodes
that doesn't match the sockets.

Maybe perf package or perf socket ?

-Andi
--
[email protected] -- Speaking for myself only.

2010-12-01 14:18:44

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3 v2] perf: Implement Nehalem uncore pmu

On Wed, 2010-12-01 at 15:08 +0100, Andi Kleen wrote:
> Seems like the best option to me (and not allowing the uncore events
> for the other commands)
>
Andi, I don't care what you think, that's simply not going to happen.

2010-12-02 05:24:03