LinuxLists.cc - [PATCH] fix up perfmon to build on -mm

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

Greg,

On Tue, Nov 06, 2007 at 04:34:54PM -0800, Greg KH wrote:
> Here's a patch against my current tree that gets the perfmon code
> building and hopefully working.
>
Thanks for your quick help.

> Note, it needs the kobject_create_and_register() patch which is in my
> tree, but I do not think it made it to -mm yet. The next -mm cycle
> should have it.
>
> Also, the sysfs usage in the perfmon code is quite strange and not
> documented at all. Yes, there is a little bit in the documentation
> about what a few of the files do, but there are _way_ more files and
> even directories being created under /sys/kernel/perfmon/ that are not
> documented at all here.
>
The full documentation for /sys/kernel/perfmon is in Documentation/perfmon2.txt

> If you document this stuff, I think I can clean up your sysfs code a
> lot, making things simpler, easier to extend, and easier to understand.
> But as it is, I don't want to break anything as it's totally unknown how
> this stuff is supposed to work...
>
I certainly welcome your help.

> Hint, use the Documentation/ABI directory to document your sysfs
> interfaces, that is what it is there for...
>
I will move the description from perfmon2.txt to its own file in
ABI/testing.

--
-Stephane

2007-11-07 13:43:19

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

Greg,

Perfmon sysfs document has been updated following your adivce.
you can check out in my perfmon tree the following commit:

e83278f879e52ecee025effe9ad509fd51e4a516

Thanks.

On Tue, Nov 06, 2007 at 04:34:54PM -0800, Greg KH wrote:
> Here's a patch against my current tree that gets the perfmon code
> building and hopefully working.
>
> Note, it needs the kobject_create_and_register() patch which is in my
> tree, but I do not think it made it to -mm yet. The next -mm cycle
> should have it.
>
> Also, the sysfs usage in the perfmon code is quite strange and not
> documented at all. Yes, there is a little bit in the documentation
> about what a few of the files do, but there are _way_ more files and
> even directories being created under /sys/kernel/perfmon/ that are not
> documented at all here.
>
> If you document this stuff, I think I can clean up your sysfs code a
> lot, making things simpler, easier to extend, and easier to understand.
> But as it is, I don't want to break anything as it's totally unknown how
> this stuff is supposed to work...
>
> Hint, use the Documentation/ABI directory to document your sysfs
> interfaces, that is what it is there for...
>
> thanks,
>
> greg k-h
>
> ---------------
> From: Greg Kroah-Hartman <[email protected]>
> Subject: perfmon: fix up some static kobject usages
>
> This gets the perfmon code to build properly on the latest -mm tree, as
> well as removing some static kobjects.
>
> A lot of future kobject cleanups can be done on this code, but the
> documentation for the perfmon sysfs interface is very limited and does
> not describe all of the different files and subdirectories at all.
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> perfmon/perfmon_sysfs.c | 37 +++++++++++++++----------------------
> 1 file changed, 15 insertions(+), 22 deletions(-)
>
> --- a/perfmon/perfmon_sysfs.c
> +++ b/perfmon/perfmon_sysfs.c
> @@ -76,7 +76,8 @@ EXPORT_SYMBOL(pfm_controls);
>
> DECLARE_PER_CPU(struct pfm_stats, pfm_stats);
>
> -static struct kobject pfm_kernel_kobj, pfm_kernel_fmt_kobj;
> +static struct kobject *pfm_kernel_kobj;
> +static struct kobject *pfm_kernel_fmt_kobj;
>
> static void pfm_reset_stats(int cpu)
> {
> @@ -402,31 +403,23 @@ static struct attribute_group pfm_kernel
>
> int __init pfm_init_sysfs(void)
> {
> - int ret;
> + int ret = -ENOMEM;
> int i, cpu = -1;
>
> - kobject_init(&pfm_kernel_kobj);
> - kobject_init(&pfm_kernel_fmt_kobj);
> -
> - pfm_kernel_kobj.parent = &kernel_subsys.kobj;
> - kobject_set_name(&pfm_kernel_kobj, "perfmon");
> -
> - pfm_kernel_fmt_kobj.parent = &pfm_kernel_kobj;
> - kobject_set_name(&pfm_kernel_fmt_kobj, "formats");
> -
> - ret = kobject_add(&pfm_kernel_kobj);
> - if (ret) {
> - PFM_INFO("cannot add kernel object: %d", ret);
> + pfm_kernel_kobj = kobject_create_and_register("perfmon", kernel_kobj);
> + if (!pfm_kernel_kobj) {
> + PFM_INFO("cannot create perfmon kernel object");
> goto error;
> }
>
> - ret = kobject_add(&pfm_kernel_fmt_kobj);
> - if (ret) {
> - PFM_INFO("cannot add fmt object: %d", ret);
> + pfm_kernel_fmt_kobj = kobject_create_and_register("formats",
> + pfm_kernel_kobj);
> + if (!pfm_kernel_fmt_kobj) {
> + PFM_INFO("cannot add fmt object");
> goto error_fmt;
> }
>
> - ret = sysfs_create_group(&pfm_kernel_kobj, &pfm_kernel_attr_group);
> + ret = sysfs_create_group(pfm_kernel_kobj, &pfm_kernel_attr_group);
> if (ret) {
> PFM_INFO("cannot create kernel group");
> goto error_group;
> @@ -449,9 +442,9 @@ int __init pfm_init_sysfs(void)
> return 0;
>
> error_group:
> - kobject_del(&pfm_kernel_fmt_kobj);
> + kobject_unregister(pfm_kernel_fmt_kobj);
> error_fmt:
> - kobject_del(&pfm_kernel_kobj);
> + kobject_unregister(pfm_kernel_kobj);
>
> for (i=0; i < cpu; i++)
> pfm_sysfs_del_cpu(i);
> @@ -683,7 +676,7 @@ int pfm_sysfs_add_fmt(struct pfm_smpl_fm
>
> kobject_set_name(&fmt->kobj, fmt->fmt_name);
> //kobj_set_kset_s(fmt, pfm_fmt_subsys);
> - fmt->kobj.parent = &pfm_kernel_fmt_kobj;
> + fmt->kobj.parent = pfm_kernel_fmt_kobj;
>
> ret = kobject_add(&fmt->kobj);
> if (ret)
> @@ -861,7 +854,7 @@ int pfm_sysfs_add_pmu(struct pfm_pmu_con
> kobject_init(&pmu->kobj);
> kobject_set_name(&pmu->kobj, "pmu_desc");
> //kobj_set_kset_s(pmu, pfm_pmu_subsys);
> - pmu->kobj.parent = &pfm_kernel_kobj;
> + pmu->kobj.parent = pfm_kernel_kobj;
>
> ret = kobject_add(&pmu->kobj);
> if (ret)

--

-Stephane

2007-11-07 17:15:44

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Wed, Nov 07, 2007 at 02:34:49AM -0800, Stephane Eranian wrote:
> Greg,
>
> On Tue, Nov 06, 2007 at 04:34:54PM -0800, Greg KH wrote:
> > Here's a patch against my current tree that gets the perfmon code
> > building and hopefully working.
> >
> Thanks for your quick help.
>
> > Note, it needs the kobject_create_and_register() patch which is in my
> > tree, but I do not think it made it to -mm yet. The next -mm cycle
> > should have it.
> >
> > Also, the sysfs usage in the perfmon code is quite strange and not
> > documented at all. Yes, there is a little bit in the documentation
> > about what a few of the files do, but there are _way_ more files and
> > even directories being created under /sys/kernel/perfmon/ that are not
> > documented at all here.
> >
> The full documentation for /sys/kernel/perfmon is in Documentation/perfmon2.txt

That is what I was referring to, that file does not describe all of the
sysfs files in /sys/kernel/perfmon by far.

> > If you document this stuff, I think I can clean up your sysfs code a
> > lot, making things simpler, easier to extend, and easier to understand.
> > But as it is, I don't want to break anything as it's totally unknown how
> > this stuff is supposed to work...
> >
> I certainly welcome your help.
>
> > Hint, use the Documentation/ABI directory to document your sysfs
> > interfaces, that is what it is there for...
> >
> I will move the description from perfmon2.txt to its own file in
> ABI/testing.

That would be great to have, thanks.

greg k-h

2007-11-07 17:15:59

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

2007-11-07 17:33:51

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

> On Wed, 7 Nov 2007 09:08:20 -0800 Greg KH <[email protected]> wrote:
> On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > Greg,
> >
> > Perfmon sysfs document has been updated following your adivce.
> > you can check out in my perfmon tree the following commit:
> >
> > e83278f879e52ecee025effe9ad509fd51e4a516
>
> Where is this git tree located? On git.kernel.org somewhere?
>

I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git

2007-11-07 17:47:04

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Wed, Nov 07, 2007 at 09:33:13AM -0800, Andrew Morton wrote:
> > On Wed, 7 Nov 2007 09:08:20 -0800 Greg KH <[email protected]> wrote:
> > On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > > Greg,
> > >
> > > Perfmon sysfs document has been updated following your adivce.
> > > you can check out in my perfmon tree the following commit:
> > >
> > > e83278f879e52ecee025effe9ad509fd51e4a516
> >
> > Where is this git tree located? On git.kernel.org somewhere?
> >
>
>
> I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git

Thanks, that worked, let me go read the new documentation...

2007-11-07 17:50:39

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Wed, Nov 07, 2007 at 09:08:20AM -0800, Greg KH wrote:
> On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > Greg,
> >
> > Perfmon sysfs document has been updated following your adivce.
> > you can check out in my perfmon tree the following commit:
> >
> > e83278f879e52ecee025effe9ad509fd51e4a516
>
> Where is this git tree located? On git.kernel.org somewhere?
>
http://git.kernel.org/?p=linux/kernel/git/eranian/linux-2.6.git

--
-Stephane

2007-11-07 17:51:56

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> Greg,
>
> Perfmon sysfs document has been updated following your adivce.
> you can check out in my perfmon tree the following commit:
>
> e83278f879e52ecee025effe9ad509fd51e4a516

Thanks, that looks a lot better.

Do you want me to send you patches based on this tree to help clean up
the sysfs usage now that it's documented?

Also, a lot of your per-cpu sysfs files should probably move to debugfs
as they are for debugging only, right? No need to clutter up sysfs with
them when only the very few perfmon developers would be needing access
to them.

thanks,

greg k-h

2007-11-07 17:57:58

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

Greg,

On Wed, Nov 07, 2007 at 09:47:47AM -0800, Greg KH wrote:
> On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > Greg,
> >
> > Perfmon sysfs document has been updated following your adivce.
> > you can check out in my perfmon tree the following commit:
> >
> > e83278f879e52ecee025effe9ad509fd51e4a516
>
> Thanks, that looks a lot better.
>
> Do you want me to send you patches based on this tree to help clean up
> the sysfs usage now that it's documented?
>
Yes, send me the patches. But from what you were saying earlier it seems
I would need an extra sysfs patches to make this compile. Is that particular
patch already in Linus's tree?

> Also, a lot of your per-cpu sysfs files should probably move to debugfs
> as they are for debugging only, right? No need to clutter up sysfs with
> them when only the very few perfmon developers would be needing access
> to them.
>
Yes, this is mostly debugging. If debugfs is meant for this, then I'll
be happy to move this stuff over there. Is there some good example of how
I could do that based on my current sysfs code?

Thanks.

--
-Stephane

2007-11-07 19:59:40

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Wed, Nov 07, 2007 at 09:57:42AM -0800, Stephane Eranian wrote:
> Greg,
>
> On Wed, Nov 07, 2007 at 09:47:47AM -0800, Greg KH wrote:
> > On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > > Greg,
> > >
> > > Perfmon sysfs document has been updated following your adivce.
> > > you can check out in my perfmon tree the following commit:
> > >
> > > e83278f879e52ecee025effe9ad509fd51e4a516
> >
> > Thanks, that looks a lot better.
> >
> > Do you want me to send you patches based on this tree to help clean up
> > the sysfs usage now that it's documented?
> >
> Yes, send me the patches. But from what you were saying earlier it seems
> I would need an extra sysfs patches to make this compile. Is that particular
> patch already in Linus's tree?

No, it's in my tree, and will be in the next -mm. You will need a few
patches to get this to work, not just a single patch.

> > Also, a lot of your per-cpu sysfs files should probably move to debugfs
> > as they are for debugging only, right? No need to clutter up sysfs with
> > them when only the very few perfmon developers would be needing access
> > to them.
> >
> Yes, this is mostly debugging. If debugfs is meant for this, then I'll
> be happy to move this stuff over there. Is there some good example of how
> I could do that based on my current sysfs code?

There is documentation for debugfs in the kernel api document :)

And, there are many in-kernel users of debugfs, a grep for
"debugfs_create_" should show you some examples of how to use this. If
you have any questions, please let me know.

thanks,

greg k-h

2007-11-07 20:39:40

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

Greg,

On Wed, Nov 07, 2007 at 11:53:20AM -0800, Greg KH wrote:
> > >
> > > Do you want me to send you patches based on this tree to help clean up
> > > the sysfs usage now that it's documented?
> > >
> > Yes, send me the patches. But from what you were saying earlier it seems
> > I would need an extra sysfs patches to make this compile. Is that particular
> > patch already in Linus's tree?
>
> No, it's in my tree, and will be in the next -mm. You will need a few
> patches to get this to work, not just a single patch.
>
Could you send them to me? if they are not too intrusive I could add them
to my tree. Yet I don't want something to distant from Linus's tree which
I pull from. My goal is to ensure that my tree still compiles and works.

> > > Also, a lot of your per-cpu sysfs files should probably move to debugfs
> > > as they are for debugging only, right? No need to clutter up sysfs with
> > > them when only the very few perfmon developers would be needing access
> > > to them.
> > >
> > Yes, this is mostly debugging. If debugfs is meant for this, then I'll
> > be happy to move this stuff over there. Is there some good example of how
> > I could do that based on my current sysfs code?
>
> There is documentation for debugfs in the kernel api document :)
>
> And, there are many in-kernel users of debugfs, a grep for
> "debugfs_create_" should show you some examples of how to use this. If
> you have any questions, please let me know.
>
Ok, I'll look at that next.

Thanks,

--

-Stephane

2007-11-08 15:29:23

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

Greg,

On Wed, Nov 07, 2007 at 11:53:20AM -0800, Greg KH wrote:
> > > Also, a lot of your per-cpu sysfs files should probably move to debugfs
> > > as they are for debugging only, right? No need to clutter up sysfs with
> > > them when only the very few perfmon developers would be needing access
> > > to them.
> > >
> > Yes, this is mostly debugging. If debugfs is meant for this, then I'll
> > be happy to move this stuff over there. Is there some good example of how
> > I could do that based on my current sysfs code?
>
> There is documentation for debugfs in the kernel api document :)
>
> And, there are many in-kernel users of debugfs, a grep for
> "debugfs_create_" should show you some examples of how to use this. If
> you have any questions, please let me know.

I have now removed all the perfmon2 statistics from sysfs and moved them
to debugfs. I must admit, I like it better this way. Debugfs is also so
much easier to program.

Patch has been pushed into my tree. Let me know if you think I can improve
the sysfs code some more.

Thanks.

--

-Stephane

2007-11-09 20:07:01

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Tue, 6 Nov 2007 16:34:54 -0800
Greg KH <[email protected]> wrote:

> Here's a patch against my current tree that gets the perfmon code
> building and hopefully working.

Unfortunately I still haven't merged perfmon due to recently-occurring
minor conflicts with Tony's ia64 tree and more major recently-occurring
conflicts with the x86 tree.

There's not really a lot which Stephane can practically do about this -
normally I'll just get down and fix stuff like this up. But the impression
I get from various people is that the perfmon tree in its present form
would not be a popular merge.

The impression which people have (and I admit to sharing it) is that
there's just too much stuff in there and it might not all be justifiable.
But I suspect that people have largely forgotten what is in there, and why
it is in there.

We really need to get this ball rolling, and that will require a sustained
effort from more people than just Stephane. I suppose as a starting point
we could yet again review the existing patches, please. People will mainly
concentrate upon the changelogging to understand which features are being
proposed and why, so that submission should describe these things pretty
carefully: what are the features and why do we need each of them.

tia.

2007-11-09 22:15:40

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

On Fri, Nov 09, 2007 at 12:06:27PM -0800, Andrew Morton wrote:
> On Tue, 6 Nov 2007 16:34:54 -0800
> Greg KH <[email protected]> wrote:
>
> > Here's a patch against my current tree that gets the perfmon code
> > building and hopefully working.
>
> Unfortunately I still haven't merged perfmon due to recently-occurring
> minor conflicts with Tony's ia64 tree and more major recently-occurring
> conflicts with the x86 tree.
>
> There's not really a lot which Stephane can practically do about this -
> normally I'll just get down and fix stuff like this up. But the impression
> I get from various people is that the perfmon tree in its present form
> would not be a popular merge.
>
> The impression which people have (and I admit to sharing it) is that
> there's just too much stuff in there and it might not all be justifiable.
> But I suspect that people have largely forgotten what is in there, and why
> it is in there.
>
> We really need to get this ball rolling, and that will require a sustained
> effort from more people than just Stephane. I suppose as a starting point
> we could yet again review the existing patches, please. People will mainly
> concentrate upon the changelogging to understand which features are being
> proposed and why, so that submission should describe these things pretty
> carefully: what are the features and why do we need each of them.

Is there some way to rebase these patches/git tree to be a bit more easy
to review? Right now there are over 75 patches in the tree and many (if
not most) can be removed by merging them with previous patches.

If someone could break this stuff down into reviewable pieces, it would
go a very long way toward making it acceptable.

Is there any way to just provide a basic framework that everyone can
agree on and then add on more stuff as time goes on? Do we have to have
every different processor/arch with support to start with?

thanks,

greg k-h

2007-11-10 20:32:48

[permalink] [raw]

Subject: Re: [PATCH] fix up perfmon to build on -mm

Greg KH <[email protected]> writes:

[dropped perfmon list because gmane messed it up and it's apparently
closed anyways]

> Is there any way to just provide a basic framework that everyone can
> agree on and then add on more stuff as time goes on? Do we have to have
> every different processor/arch with support to start with?

I think the real problem are not the architectures (the processor
adaption layer is usually relatively straight forward IIRC), but the
excessive functionality implemented by the user interface.

It would be really good to extract a core perfmon and start with
that and then add stuff as it makes sense.

e.g. core perfmon could be something simple like just support
to context switch state and initialize counters in a basic way
and perhaps get counter numbers for RDPMC in ring3 on x86[1]

Next step could be basic event on overflow/underflow support.

Then more features as they make sense, with clear rationale
what they're good for and proper step by step patches.

-Andi

[1] On x86 we urgently need a replacement to RDTSC for counting
cycles.

2007-11-13 15:27:37

by tip-bot for Robert Richter

[permalink] [raw]

Subject: perfmon2 merge news

On 10.11.07 21:32:39, Andi Kleen wrote:
> It would be really good to extract a core perfmon and start with
> that and then add stuff as it makes sense.
>
> e.g. core perfmon could be something simple like just support
> to context switch state and initialize counters in a basic way
> and perhaps get counter numbers for RDPMC in ring3 on x86[1]

Perhaps a core could provide also as much functionality so that
Perfmon can be used with an *unpatched* kernel using loadable modules?
One drawback with today's Perfmon is that it can not be used with a
vanilla kernel. But maybe such a core is by far too complex for a
first merge.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2007-11-13 15:35:38

[permalink] [raw]

Subject: Re: [perfmon2] perfmon2 merge news

Robert Richter wrote:
> On 10.11.07 21:32:39, Andi Kleen wrote:
>> It would be really good to extract a core perfmon and start with
>> that and then add stuff as it makes sense.
>>
>> e.g. core perfmon could be something simple like just support
>> to context switch state and initialize counters in a basic way
>> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
>
> Perhaps a core could provide also as much functionality so that
> Perfmon can be used with an *unpatched* kernel using loadable modules?
> One drawback with today's Perfmon is that it can not be used with a
> vanilla kernel. But maybe such a core is by far too complex for a
> first merge.
>
> -Robert
>

Hi Robert,

In the past I suggested that it might be useful to have a version of perfmon2
that only set up the perfmon on a global basis. That would allow the patches for
context switches to be added as a separate step, splitting up the patch into
smaller set of patches.

Perfmon2 uses a set of system calls to control the performance monitoring
hardware. This would make it difficult to use an unpatch kernel unless perfmon
changed the mechanism used to control the performance monitoring hardware.

-Will

2007-11-13 17:58:08

[permalink] [raw]

Subject: Re: [perfmon2] perfmon2 merge news

Hello,

On Tue, Nov 13, 2007 at 10:35:11AM -0500, William Cohen wrote:
> Robert Richter wrote:
> > On 10.11.07 21:32:39, Andi Kleen wrote:
> >> It would be really good to extract a core perfmon and start with
> >> that and then add stuff as it makes sense.
> >>
> >> e.g. core perfmon could be something simple like just support
> >> to context switch state and initialize counters in a basic way
> >> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
> >
> > Perhaps a core could provide also as much functionality so that
> > Perfmon can be used with an *unpatched* kernel using loadable modules?
> > One drawback with today's Perfmon is that it can not be used with a
> > vanilla kernel. But maybe such a core is by far too complex for a
> > first merge.
> >
> > -Robert
> >
>
> Hi Robert,
>
> In the past I suggested that it might be useful to have a version of perfmon2
> that only set up the perfmon on a global basis. That would allow the patches for
> context switches to be added as a separate step, splitting up the patch into
> smaller set of patches.
>
> Perfmon2 uses a set of system calls to control the performance monitoring
> hardware. This would make it difficult to use an unpatch kernel unless perfmon
> changed the mechanism used to control the performance monitoring hardware.
>
Yes, that would be a possibility but as you pointed out there are some problems:

- perfmon2 uses system calls. So unless you can dynamically patch the
syscall table we would have to go back to the ioctl() and driver model.
I was under the impression that people did not quite like multiplexing
syscalls such as ioctl(). I also do prefer the multi syscall approach.

- perfmon2 needs to install a PMU interrupt handler. On X86, this is not just
an external device interrupts. There needs to be some APIC and interrupt
gate setup. There maybe other constraints on other architectures as well.
Not sure if all functions/structures necessary for this are available to
modules.

- we could not support per-thread mode with the kernel module approach due to
link to the context switch code. I do believe per-thread is a key value-add
for performance monitoring.

--
-Stephane

2007-11-13 18:33:39

[permalink] [raw]

Subject: Re: perfmon2 merge news

Hello,

On Tue, Nov 13, 2007 at 04:17:18PM +0100, Robert Richter wrote:
> On 10.11.07 21:32:39, Andi Kleen wrote:
> > It would be really good to extract a core perfmon and start with
> > that and then add stuff as it makes sense.
> >
> > e.g. core perfmon could be something simple like just support
> > to context switch state and initialize counters in a basic way
> > and perhaps get counter numbers for RDPMC in ring3 on x86[1]
>
> Perhaps a core could provide also as much functionality so that
> Perfmon can be used with an *unpatched* kernel using loadable modules?
> One drawback with today's Perfmon is that it can not be used with a
> vanilla kernel. But maybe such a core is by far too complex for a
> first merge.
>
Note that I am not against the gradual approach such as:
- system-wide only counting
- per-thread counting
- user-level sampling support
- in-kernel sampling buffer support
- in-kernel customizable sampling buffer formats via modules
- event set multiplexing
- PMU description modules

It would obvisouly cause a lot of troubles to existing perfmon libraries and
applications (e.g. PAPI). It would also be fairly tricky to do because you'd
have to make sure that in the beginning, you leave enough flexiblity such that
you can add the rest while maintaining total backward compatibility. But given
that we already have the full solution, it could just be a matter of dropping
features without disrupting the user level API. Of course there would be a bigger
burden on the maintainer because he would have two trees to maintain but I think
that is already commonplace in many of the kernel-related projects.

Let's take a simple example. The set of syscalls necessary to control a system-wide
monitoring session is exactly the same as for a per-thread session. The difference is
just a flag when the session is created. Thus, we could keep the same set of syscalls,
but only accept system-wide sessions. Later on, when we add per-thread, we would just
have to expose the per-thread session flag.

Having said that, does not mean that this is necessarily what we will do. I am just
try to present my understanding of the comments from Andrew, Andi and others.

I think that going with a kernel module will not address the 'complexity/bloat' perception
that some people have. There is a logic to that, I did not just wakeup one day saying
'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or
developers and it was justfied by what the hardware offered then. This unfortunately still
stands today. I admit that justification is not necessarily spelled out clearly in the code. So
I understand most of those worries and I am trying to figure out how we could best address them.

--
-Stephane

2007-11-13 18:41:55

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Stephane Eranian wrote:
> Hello,
>
> On Tue, Nov 13, 2007 at 10:35:11AM -0500, William Cohen wrote:
>> Robert Richter wrote:
>>> On 10.11.07 21:32:39, Andi Kleen wrote:
>>>> It would be really good to extract a core perfmon and start with
>>>> that and then add stuff as it makes sense.
>>>>
>>>> e.g. core perfmon could be something simple like just support
>>>> to context switch state and initialize counters in a basic way
>>>> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
>>> Perhaps a core could provide also as much functionality so that
>>> Perfmon can be used with an *unpatched* kernel using loadable modules?
>>> One drawback with today's Perfmon is that it can not be used with a
>>> vanilla kernel. But maybe such a core is by far too complex for a
>>> first merge.
>>>
>>> -Robert
>>>
>> Hi Robert,
>>
>> In the past I suggested that it might be useful to have a version of perfmon2
>> that only set up the perfmon on a global basis. That would allow the patches for
>> context switches to be added as a separate step, splitting up the patch into
>> smaller set of patches.
>>
>> Perfmon2 uses a set of system calls to control the performance monitoring
>> hardware. This would make it difficult to use an unpatch kernel unless perfmon
>> changed the mechanism used to control the performance monitoring hardware.
>>
> Yes, that would be a possibility but as you pointed out there are some problems:
>
> - perfmon2 uses system calls. So unless you can dynamically patch the
> syscall table we would have to go back to the ioctl() and driver model.
> I was under the impression that people did not quite like multiplexing
> syscalls such as ioctl(). I also do prefer the multi syscall approach.
>
> - perfmon2 needs to install a PMU interrupt handler. On X86, this is not just
> an external device interrupts. There needs to be some APIC and interrupt
> gate setup. There maybe other constraints on other architectures as well.
> Not sure if all functions/structures necessary for this are available to
> modules.

The oprofile module can setup a handler for PMU interrupts. This is done in
archi/x86/oprofile/nmi_int:nmi_cpu_setup(). Other modules could do the same.
However, it bumps what ever was using the nmi/pmu off, then restores nmi/pmu
when oprofile is shut down. Maybe the pmu/nmi resource reservation mechanism
should be another self-contained patch.

> - we could not support per-thread mode with the kernel module approach due to
> link to the context switch code. I do believe per-thread is a key value-add
> for performance monitoring.

The per-thread monitoring is useful to a number of people and many people want
it. The thought was how to break the large perfmon patch into set of smaller
incremental patches. So it isn't whether to have per-thread pmu virtualization,
but rather when/how to get it in.

-Will

2007-11-13 18:59:55

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, Nov 13, 2007 at 10:47:45AM -0800, Philip Mucci wrote:
> Hi folks,
>
> Well, I can say the mood here at supercomputing'07 is pretty somber in
> regards to the latest exchange of messages regarding the perfmon patches.

"somber"?

Why?

We (a number of the kernel developers) want to see the perfmon code make
it into the kernel tree, unfortunatly, in the current state it is in,
that's not going to happen.

Andi specified a way that this can happen, just refactor your patches
into smaller bits that can be reviewed and applied.

If you, or anyone else has any questions about this, please let us know.
So far, I have not seen any response to his message, so I'm guessing
that the perfmon developers either are off working on this, or don't
care.

And if they don't care, then yes, I agree with your "somber" feeling...

thanks,

greg k-h

2007-11-13 19:18:30

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Hi folks,

Well, I can say the mood here at supercomputing'07 is pretty somber
in regards to the latest exchange of messages regarding the perfmon
patches. Our community has been the largest user of both the PerfCtr
and the Perfmon patches, the former being regularly installed by
vendors and integrators on clusters at install time, and the latter
now being adopted into vendor kernels by IBM, Cray, AMD, SiCortex and
others. Of course, adoption by a vendor, does not a good kernel patch
make. However, it should be viewed as a strong data point on demand
for such functionality. We are a community focused on performance and
we have long had a need for these tools.

A solution that does not provide 64 bit virtualized per-thread counts
is not a solution at all. That would need to be ripped out by all of
us using this functionality so we could get something that actually
does what the community needs, not what the you folks think we need.
Device level access and/or root access to the counters is not
unacceptable for machines in production. If that was fine, oprofile
would have satisfied everyone and we wouldn't be sucking up your
bandwidth. Please understand that people outside of the your
community are desperate for adoption of any form of 'per-thread' PMU
functionality into the kernel. For those of you who are (still) not
convinced of this, I can arrange your inbox to be spammed by 1000's
of HPC geeks, managers, vendors, etc. My point is, let's start
somewhere that the community finds useful. Otherwise we run the risk
of developing an interface that everyone isn't comfortable with and
no-one uses. Hardly a productive exercise.

So please, do consider a set of core functionality that provides for
(at least) the following:

- per-CPU and per-thread 64 bit virtualized counts
- third person operation (attach/ptrace)
- dispatch of signal upon interrupt on overflow if requested
- 'buffered' interrupts into a buffer that can be mmap'd into user space
- support for a variety of the major processor platforms

Regards,

On Nov 13, 2007, at 9:55 AM, Stephane Eranian wrote:

> Hello,
>
> On Tue, Nov 13, 2007 at 10:35:11AM -0500, William Cohen wrote:
>> Robert Richter wrote:
>>> On 10.11.07 21:32:39, Andi Kleen wrote:
>>>> It would be really good to extract a core perfmon and start with
>>>> that and then add stuff as it makes sense.
>>>>
>>>> e.g. core perfmon could be something simple like just support
>>>> to context switch state and initialize counters in a basic way
>>>> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
>>>
>>> Perhaps a core could provide also as much functionality so that
>>> Perfmon can be used with an *unpatched* kernel using loadable
>>> modules?
>>> One drawback with today's Perfmon is that it can not be used with a
>>> vanilla kernel. But maybe such a core is by far too complex for a
>>> first merge.
>>>
>>> -Robert
>>>
>>
>> Hi Robert,
>>
>> In the past I suggested that it might be useful to have a version
>> of perfmon2
>> that only set up the perfmon on a global basis. That would allow
>> the patches for
>> context switches to be added as a separate step, splitting up the
>> patch into
>> smaller set of patches.
>>
>> Perfmon2 uses a set of system calls to control the performance
>> monitoring
>> hardware. This would make it difficult to use an unpatch kernel
>> unless perfmon
>> changed the mechanism used to control the performance monitoring
>> hardware.
>>
> Yes, that would be a possibility but as you pointed out there are
> some problems:
>
> - perfmon2 uses system calls. So unless you can dynamically patch the
> syscall table we would have to go back to the ioctl() and driver
> model.
> I was under the impression that people did not quite like
> multiplexing
> syscalls such as ioctl(). I also do prefer the multi syscall
> approach.
>
> - perfmon2 needs to install a PMU interrupt handler. On X86, this
> is not just
> an external device interrupts. There needs to be some APIC and
> interrupt
> gate setup. There maybe other constraints on other architectures
> as well.
> Not sure if all functions/structures necessary for this are
> available to
> modules.
>
> - we could not support per-thread mode with the kernel module
> approach due to
> link to the context switch code. I do believe per-thread is a
> key value-add
> for performance monitoring.
>
> --
> -Stephane
> _______________________________________________
> perfmon mailing list
> [email protected]
> http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

2007-11-13 20:09:20

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, 13 Nov 2007 10:59:24 -0800 Greg KH <[email protected]> wrote:

> On Tue, Nov 13, 2007 at 10:47:45AM -0800, Philip Mucci wrote:
> > Hi folks,
> >
> > Well, I can say the mood here at supercomputing'07 is pretty somber in
> > regards to the latest exchange of messages regarding the perfmon patches.
>
> "somber"?
>
> Why?
>
> We (a number of the kernel developers) want to see the perfmon code make
> it into the kernel tree, unfortunatly, in the current state it is in,
> that's not going to happen.
>
> Andi specified a way that this can happen, just refactor your patches
> into smaller bits that can be reviewed and applied.
>
> If you, or anyone else has any questions about this, please let us know.
> So far, I have not seen any response to his message, so I'm guessing
> that the perfmon developers either are off working on this, or don't
> care.
>
> And if they don't care, then yes, I agree with your "somber" feeling...
>

Well... Philip is (I assume) a numerical-computing guy and not a
kernel-developing guy (probably a wise choice).

He speaks for quite a few people - they have serious need for this feature
but they've had to scruff around with out-of-tree patches for years to get
it, and still there are problems.

I was hoping that after the round of release-and-review which Stephane,
Andi and I did about twelve months ago that we were on track to merge the
perfmon codebase as-offered. But now it turns out that the sentiment is
that the code simply has too many bells-and-whistles to be acceptable.

My problem with that sentiment is that it is quite likely the case that
those bells-n-whistles are actually useful and needed features. Perfmon
has been out there for quite a few years and the code which is in there
_should_ be in response to real-world in-the-field experience. Such
requirements never go away.

So. If what I am saying is correct then the best course of action would be
for Stephane to help us all to understand what these features are and why
we need them. The ideal way in which to do this is

[patch] perfmon: core
[patch] perfmon: whizzy feature #1
[patch] perfmon: whizzy feature #2
[patch] perfmon: whizzy feature #3

etc. Where the changelog in each whizzy-feature-n explains what it does,
why it does it and why our users need it.

Whatever happens, perfmon is so big and so old and has been out-of-tree for
so long that it's going to take a pile of work from lots of people to get
any of it landed.

2007-11-13 20:20:42

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, Nov 13, 2007 at 12:07:28PM -0800, Andrew Morton wrote:
>
> So. If what I am saying is correct then the best course of action would be
> for Stephane to help us all to understand what these features are and why
> we need them. The ideal way in which to do this is
>
> [patch] perfmon: core
> [patch] perfmon: whizzy feature #1
> [patch] perfmon: whizzy feature #2
> [patch] perfmon: whizzy feature #3
>
> etc. Where the changelog in each whizzy-feature-n explains what it does,
> why it does it and why our users need it.

I agree. Right now their git tree has over 80 patches in it, without
descriptions like this to help those of us who want to review and help
out, it is quite difficult.

thanks,

greg k-h

2007-11-13 20:37:25

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

> He speaks for quite a few people - they have serious need for this feature

Most likely they have serious need for a very small subset of perfmon2.
The point of my proposal was to get this very small subset in quickly.

Phil, how many of the command line options of pfmon do you
actually use? How many do the people at your conference use? Or what
functions, what performance counters etc. in PAPI or whatever
library you use?

Make use understand the use cases better, that would already help a lot
in merging by concentrating on what people actually really need.

-Andi

2007-11-13 20:42:20

[permalink] [raw]

Subject: Re: [perfmon2] perfmon2 merge news

> In the past I suggested that it might be useful to have a version of
> perfmon2 that only set up the perfmon on a global basis. That would allow

Context switch is imho the main differentiating feature of perfmon
over oprofile. Not sure it makes sense to take that one out.

I don't think the complexity of the patches comes from the context
switch anyways, it comes from the lots of other things it does.

-Andi

2007-11-13 21:14:27

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Will,

On Tue, Nov 13, 2007 at 01:33:55PM -0500, William Cohen wrote:
>
> The oprofile module can setup a handler for PMU interrupts. This is done in
> archi/x86/oprofile/nmi_int:nmi_cpu_setup(). Other modules could do the
> same. However, it bumps what ever was using the nmi/pmu off, then restores
> nmi/pmu when oprofile is shut down. Maybe the pmu/nmi resource reservation
> mechanism should be another self-contained patch.
>

Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
setup. It uses the register_die() mechanism, if I recall. The low level APIC
and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because
of the NMI watchdog.

> > - we could not support per-thread mode with the kernel module
> > approach due to
> > link to the context switch code. I do believe per-thread is a key
> > value-add
> > for performance monitoring.
>
> The per-thread monitoring is useful to a number of people and many people
> want it. The thought was how to break the large perfmon patch into set of
> smaller incremental patches. So it isn't whether to have per-thread pmu
> virtualization, but rather when/how to get it in.

I think we all agree on this.

--

-Stephane

2007-11-13 21:29:18

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, Nov 13, 2007 at 01:13:45PM -0800, Stephane Eranian wrote:
> Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
> setup.

Oprofile works without the NMI watchdog too, but it just happens to be another
NMI user.

> It uses the register_die() mechanism,

Not correct.

> if I recall. The low level APIC
> and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because
> of the NMI watchdog.

It could handle it in the same way as oprofile if it wanted. But given
NMIs make everything more complicated and it might not be worth it.

-Andi

2007-11-13 21:34:47

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Greg,

On Tue, Nov 13, 2007 at 10:59:24AM -0800, Greg KH wrote:
> On Tue, Nov 13, 2007 at 10:47:45AM -0800, Philip Mucci wrote:
> > Hi folks,
> >
> > Well, I can say the mood here at supercomputing'07 is pretty somber in
> > regards to the latest exchange of messages regarding the perfmon patches.
>
> "somber"?
>

I am the core developer of this and I am not as pessimistic as Phil. Yet I admit
Phil has been asking for this kind of kernel interface for a very very long time ;-<

> Why?
>
> We (a number of the kernel developers) want to see the perfmon code make
> it into the kernel tree, unfortunatly, in the current state it is in,
> that's not going to happen.
>
> Andi specified a way that this can happen, just refactor your patches
> into smaller bits that can be reviewed and applied.
>

I think I understand your concerns. I will work on this. I think it is possible to
refactor. It will certainly be painful (for me), but I think it can be done within
some reasonable delay. Of course, it would be help if you could better qualify what
you mean by 'smaller'.

> If you, or anyone else has any questions about this, please let us know.
> So far, I have not seen any response to his message, so I'm guessing
> that the perfmon developers either are off working on this, or don't
> care.
>

I will start working on this once I fix the tickless/hrtimer issues.

> And if they don't care, then yes, I agree with your "somber" feeling...
>

I do care a lot actually. Believe me, I do spend a lot of effort and energy
on this project everyday, like many others around the world, and I intend for
it to succeed. We have reached a point in the development of processor hardware
where this kind of features is crucial and it is not just for HPC folks anymore.

--
-Stephane

2007-11-13 21:45:47

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, Nov 13, 2007 at 01:33:13PM -0800, Stephane Eranian wrote:
> I think I understand your concerns. I will work on this. I think it is possible to
> refactor. It will certainly be painful (for me), but I think it can be done within
> some reasonable delay. Of course, it would be help if you could better qualify what
> you mean by 'smaller'.

I think Andrew already spelled this out. If after reading his message,
you still have questions, please let me know and I'll be glad to work
with you to address them.

thanks,

greg k-h

2007-11-13 21:48:37

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi.

On Tue, Nov 13, 2007 at 10:29:02PM +0100, Andi Kleen wrote:
> On Tue, Nov 13, 2007 at 01:13:45PM -0800, Stephane Eranian wrote:
> > Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
> > setup.
>
> Oprofile works without the NMI watchdog too, but it just happens to be another
> NMI user.
>
I have no doubt it can work with a "regular" interrupt.

> > It uses the register_die() mechanism,
>
> Not correct.
>
I meant the register_die_notifier() mechanism which allow you to
chain a handler on NMI interrupts. At least that's my understanding
reading the code:

static int nmi_setup(void)
{
int err=0;
int cpu;

if (!allocate_msrs())
return -ENOMEM;

if ((err = register_die_notifier(&profile_exceptions_nb))){
free_msrs();
pfm_release_allcpus();
return err;
}
...

> > if I recall. The low level APIC
> > and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because
> > of the NMI watchdog.
>
> It could handle it in the same way as oprofile if it wanted. But given
> NMIs make everything more complicated and it might not be worth it.
>
Yes, horribly more complicated because of locking issues within perfmon.
As soon as you expose a file descriptor, you need some locking to prevent
multiple user threads (malicious or not) to compete to access the PMU state.
I think the value add of NMI can be as well achieved with advanced PMU features
such as Intel Core 2 PEBS.

--
-Stephane

2007-11-13 21:51:23

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

> Yes, horribly more complicated because of locking issues within perfmon.
> As soon as you expose a file descriptor, you need some locking to prevent
> multiple user threads (malicious or not) to compete to access the PMU state.

Why do you need the file descriptor?

One of the main problems with perfmon is the complicated user interface.

Naively I would assume just some thread global state should be sufficient.

> I think the value add of NMI can be as well achieved with advanced PMU features
> such as Intel Core 2 PEBS.

True probably, although only on CPUs that support PEBS. Dropping features
for old CPUs is unfortunately quite difficult in Linux, and in this case
probably not an option because there are so many of them (e.g. all of AMD
not Fam10h)

-Andi

2007-11-13 22:23:22

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi,
On Tue, Nov 13, 2007 at 10:50:56PM +0100, Andi Kleen wrote:
> > Yes, horribly more complicated because of locking issues within perfmon.
> > As soon as you expose a file descriptor, you need some locking to prevent
> > multiple user threads (malicious or not) to compete to access the PMU state.
>
> Why do you need the file descriptor?
>

To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
file descriptor allows you to use close, read, select, poll and you leverage the
existing file descriptor sharing/inheritance sematics. At the kernel level, a
descriptor provides all the callback necessary to make sure you clean up the perfmon
session state on exit.

> One of the main problems with perfmon is the complicated user interface.
>
> Naively I would assume just some thread global state should be sufficient.
>
> > I think the value add of NMI can be as well achieved with advanced PMU features
> > such as Intel Core 2 PEBS.
>
> True probably, although only on CPUs that support PEBS. Dropping features
> for old CPUs is unfortunately quite difficult in Linux, and in this case
> probably not an option because there are so many of them (e.g. all of AMD
> not Fam10h)
>

Yes, I know that. Also note that unfortunately, AMD Fam10h IBS feature does not
allow you to capture more than one sample in critical sections. It is still
interrupt based sampling with one entry-deep buffer: one interrupt = one sample.
Perfmon does support NMI though it is much more expensive to use.

--
-Stephane

2007-11-13 22:25:45

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, Nov 13, 2007 at 02:22:34PM -0800, Stephane Eranian wrote:
> Andi,
> On Tue, Nov 13, 2007 at 10:50:56PM +0100, Andi Kleen wrote:
> > > Yes, horribly more complicated because of locking issues within perfmon.
> > > As soon as you expose a file descriptor, you need some locking to prevent
> > > multiple user threads (malicious or not) to compete to access the PMU state.
> >
> > Why do you need the file descriptor?
> >
>
> To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
> file descriptor allows you to use close, read, select, poll and you leverage the

Surely that could be done with a flag for each call too? Keeping file descriptors
to pass essentially a boolean seems overkill.

> existing file descriptor sharing/inheritance sematics. At the kernel level, a
> descriptor provides all the callback necessary to make sure you clean up the perfmon
> session state on exit.

Didn't you already have a thread destructor for it?

-Andi

2007-11-13 22:27:41

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

<stupid bullshitting snipped>

What about investing some effort to do a proper performance counter
infrastructure or turning the mess perfom is into one instead of this
useless rant? Code is not getting any better by your complain ccing
gazillions of useless list.

2007-11-13 22:30:01

[permalink] [raw]

Subject: Re: perfmon2 merge news

On Tue, Nov 13, 2007 at 10:32:39AM -0800, Stephane Eranian wrote:
> It would obvisouly cause a lot of troubles to existing perfmon libraries and
> applications (e.g. PAPI). It would also be fairly tricky to do because you'd
> have to make sure that in the beginning, you leave enough flexiblity such that
> you can add the rest while maintaining total backward compatibility. But given
> that we already have the full solution, it could just be a matter of dropping
> features without disrupting the user level API.

There no way we'll keep this completely idiotic userland API. If people start
to use out of tree APIs they can pretty much expect that they're not going
to stay around. And in this case they most certainly won't.

2007-11-13 22:59:23

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi,

On Tue, Nov 13, 2007 at 11:25:34PM +0100, Andi Kleen wrote:
> On Tue, Nov 13, 2007 at 02:22:34PM -0800, Stephane Eranian wrote:
> > Andi,
> > On Tue, Nov 13, 2007 at 10:50:56PM +0100, Andi Kleen wrote:
> > > > Yes, horribly more complicated because of locking issues within perfmon.
> > > > As soon as you expose a file descriptor, you need some locking to prevent
> > > > multiple user threads (malicious or not) to compete to access the PMU state.
> > >
> > > Why do you need the file descriptor?
> > >
> >
> > To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
> > file descriptor allows you to use close, read, select, poll and you leverage the
>
> Surely that could be done with a flag for each call too? Keeping file descriptors
> to pass essentially a boolean seems overkill.
>

I don't understand this.

Let's take the simplest possible example (self-monitoring per-thread)
counting one event in one data register.

int
main(int argc, char **argv)
{
int ctx_fd;
pfarg_pmd_t pd[1];
pfarg_pmc_t pc[1];
pfarg_ctx_t ctx;
pfarg_load_t load_args;

memset(&ctx, 0, sizeof(ctx));
memset(pc, 0, sizeof(pc));
memset(pd, 0, sizeof(pd));

/* create session (context) and get file descriptor back (identifier) */
ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);

/* setup one config register (PMC0) */
pc[0].reg_num = 0
pc[0].reg_value = 0x1234;

/* setup one data register (PMD0) */
pd[0].reg_num = 0;
pd[0].reg_value = 0;

/* program the registers */
pfm_write_pmcs(ctx_fd, pc, 1);
pfm_write_pmds(ctx_fd, pd, 1);

/* attach the context to self */
load_args.load_pid = getpid();
pfm_load_context(ctx_fd, &load_args);

/* activate monitoring */
pfm_start(ctx_fd, NULL);

/*
* run code to measure
*/

/* stop monitoring */
pfm_stop(ctx_fd);

/* read data register */
pfm_read_pmds(ctx_fd, pd, 1);

printf("PMD0 %llu\n", pd[0].reg_value);

/* destroy session */
close(ctx_fd);

return 0;
}

--

-Stephane

2007-11-14 00:29:24

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Hi Andi,

pfmon is a single tool and fairly low level, the HPC folks don't use
it so much because it isn't parallel aware and is meant for power-
users. It is not representative of the tools used in HPC at all. Our
community uses tools built on the infrastructure provided by libpfm
and PAPI for the most part.

I know you don't want to hear this, but we actually use all of the
features of perfmon, because a) we wanted to use the best methods
available and b) areas where user level solutions could be made (like
multiplexing) introduced too much noise and overhead to be of use.
For years we relied on PerfCtr which did 'just enough' for us. But
when Perfmon2 became available, we adopted technology where it meant
a significant increase in accuracy for the resulting measurements,
specifically for us that meant, kernel multiplexing and sample buffers.
Note that PAPI is just middleware. The tools built upon it are what
people use...some of those are commercial tools like Vampir but most
are Open Source. These tools are cross platform, as such they run on
nearly everything...although intel/amd/ppc systems dominate the HPC
market.

The usage cases are always the same and can be broken down into
simple counting and sampling:

- providing virtualized 64-bit counters per-thread
- providing notification (buffered or non) on interrupt/overflow of
the above.

If you'd like to outline further what you'd like to hear from the
community, I can arrange that. I seem to remember going through this
once before, but I'd be happy to do it again. For reference, here's a
quick list from memory of some of the tools in active use and built
on this infrastructure. These are used heavily around the globe.
You'll see that each basically follows one of the 2 usage models above.

- HPCToolkit (Rice)
- PerfSuite (NCSA)
- Vampir (Dresden)
- Kojak (Juelich)
- TAU (UOregon)
- PAPIEX (me)
- GPTL (NCAR)
- HPM-Linux (IBM)
- Paraver (Barcelona)

Time to go give a talk here at a tools session at SC'07 about this
very subject.

Phil

On Nov 13, 2007, at 12:36 PM, Andi Kleen wrote:

>> He speaks for quite a few people - they have serious need for this
>> feature
>
> Most likely they have serious need for a very small subset of
> perfmon2.
> The point of my proposal was to get this very small subset in quickly.
>
> Phil, how many of the command line options of pfmon do you
> actually use? How many do the people at your conference use? Or what
> functions, what performance counters etc. in PAPI or whatever
> library you use?
>
> Make use understand the use cases better, that would already help a
> lot
> in merging by concentrating on what people actually really need.
>
> -Andi
>

2007-11-14 01:52:35

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Tue, Nov 13, 2007 at 04:28:52PM -0800, Philip Mucci wrote:
> I know you don't want to hear this, but we actually use all of the
> features of perfmon, because a) we wanted to use the best methods

That is hard to believe.

But let's go for it temporarily for the argument.

Can you instead prioritize features. What is most essential, what is
important, what is just nice to have, what is rarely used?

> Note that PAPI is just middleware. The tools built upon it are what

Surely the tools on top cannot use more than the middleware provides.

> - providing virtualized 64-bit counters per-thread
> - providing notification (buffered or non) on interrupt/overflow of
> the above.

Ok that makes sense and should be possible with a reasonable simple
interface.

> If you'd like to outline further what you'd like to hear from the
> community, I can arrange that. I seem to remember going through this
> once before, but I'd be happy to do it again. For reference, here's a
> quick list from memory of some of the tools in active use and built
> on this infrastructure. These are used heavily around the globe.

Please list concrete features, throwing around random names is not useful.

-Andi

2007-11-14 02:11:34

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

[dropped all these bouncing email lists. Adding closed lists to public
cc lists is just a bad idea]

> int
> main(int argc, char **argv)
> {
> int ctx_fd;
> pfarg_pmd_t pd[1];
> pfarg_pmc_t pc[1];
> pfarg_ctx_t ctx;
> pfarg_load_t load_args;
>
> memset(&ctx, 0, sizeof(ctx));
> memset(pc, 0, sizeof(pc));
> memset(pd, 0, sizeof(pd));
>
> /* create session (context) and get file descriptor back (identifier) */
> ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);

There's nothing in your example that makes the file descriptor needed.

>
> /* setup one config register (PMC0) */
> pc[0].reg_num = 0
> pc[0].reg_value = 0x1234;

That would be nicer if it was just two arguments.

>
> /* setup one data register (PMD0) */
> pd[0].reg_num = 0;
> pd[0].reg_value = 0;

Why do you need to set the data register? Wouldn't it make
more sense to let the kernel handle that and just return one.

>
> /* program the registers */
> pfm_write_pmcs(ctx_fd, pc, 1);
> pfm_write_pmds(ctx_fd, pd, 1);
>
> /* attach the context to self */
> load_args.load_pid = getpid();
> pfm_load_context(ctx_fd, &load_args);

My replacement would be to just add a flags argument to write_pmcs
with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT"
>
> /* activate monitoring */
> pfm_start(ctx_fd, NULL);

Why can't that be done by the call setting up the register?

Or if someone needs to do it for a specific region they can read
the register before and then afterwards.

>
> /*
> * run code to measure
> */
>
> /* stop monitoring */
> pfm_stop(ctx_fd);
>
> /* read data register */
> pfm_read_pmds(ctx_fd, pd, 1);

On x86 i think it would be much simpler to just let the set/alloc
register call return a number and then use RDPMC directly. That would
be actually faster and be much simpler too.

I suppose most architectures have similar facilities, if not a call could be
added for them but it's not really essential. The call might be also needed
for event multiplexing, but frankly I would just leave that out for now.

e.g. here is one use case I would personally see as useful. We need
a replacement for simple cycle counting since RDTSC doesn't do that anymore
on modern x86 CPUs. It could be something like:

/* 0 is the initial value */

/* could be either library or syscall */
event = get_event(COUNTER_CYCLES);
if (event < 0)
/* CPU has no cycle counter */

reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */

rdpmc(reg, start);
.... some code to run ...
rdpmc(reg, end);

free_perfctr(reg); /* syscall */

On other architectures rdpmc would be different of course, but
the rest could be probably similar.

-Andi

2007-11-14 07:25:18

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andrew Morton writes:

> I was hoping that after the round of release-and-review which Stephane,
> Andi and I did about twelve months ago that we were on track to merge the
> perfmon codebase as-offered. But now it turns out that the sentiment is
> that the code simply has too many bells-and-whistles to be acceptable.

Whose sentiment?

I've had a bit of a look at it today together with David Gibson. Our
impression is that the latest version is a lot cleaner and simpler
than it used to be. I'm also reading Stephane's technical report
which describes the interface, and whilst I'm only part-way through
it, I haven't seen anything yet which strikes me as unnecessary or
overly complicated.

Paul.

2007-11-14 07:41:38

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, 14 Nov 2007 18:24:36 +1100 Paul Mackerras <[email protected]> wrote:

> Andrew Morton writes:
>
> > I was hoping that after the round of release-and-review which Stephane,
> > Andi and I did about twelve months ago that we were on track to merge the
> > perfmon codebase as-offered. But now it turns out that the sentiment is
> > that the code simply has too many bells-and-whistles to be acceptable.
>
> Whose sentiment?

Andi and hch, maybe others I've forgotten about.

> I've had a bit of a look at it today together with David Gibson. Our
> impression is that the latest version is a lot cleaner and simpler
> than it used to be. I'm also reading Stephane's technical report
> which describes the interface, and whilst I'm only part-way through
> it, I haven't seen anything yet which strikes me as unnecessary or
> overly complicated.

Yes, that's quite possible. I don't know how up-to-date people's
knowledge is. I know I haven't looked seriously at the code in around
twelve months.

Let's get it on the wires as outlined and take a look at it all.

2007-11-14 10:38:28

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, Nov 14, 2007 at 06:24:36PM +1100, Paul Mackerras wrote:
> Whose sentiment?

Mine for example. The whole userspace interface is just on crack,
and the code is full of complexities aswell.

2007-11-14 10:43:36

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Christoph Hellwig writes:

> Mine for example. The whole userspace interface is just on crack,
> and the code is full of complexities aswell.

Could you give some _technical_ details of what you don't like?

Paul.

2007-11-14 11:00:29

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, Nov 14, 2007 at 09:43:02PM +1100, Paul Mackerras wrote:
> Christoph Hellwig writes:
>
> > Mine for example. The whole userspace interface is just on crack,
> > and the code is full of complexities aswell.
>
> Could you give some _technical_ details of what you don't like?

I've done this a gazillion times before, so maybe instead of beeing a lazy
bastard you could look up mailinglist archive. It's not like this is the
first discussion of perfmon. But to get start look at the systems calls,
many of them are beasts like:

int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

This is basically a read(2) (or for other syscalls a write) on something
else than the file descriptor provided to the system call. The right thing
to do is obviously have a pmds and pmcs file in procfs for the thread beeing
monitored instead of these special-case files, with another set for global
tracing. Similarly I'm pretty sure we can get a much better interface
if we introduce marching files in procfs for the other calls.

2007-11-14 11:12:28

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Christoph Hellwig <[email protected]>
Date: Wed, 14 Nov 2007 11:00:09 +0000

> I've done this a gazillion times before, so maybe instead of beeing a lazy
> bastard you could look up mailinglist archive. It's not like this is the
> first discussion of perfmon. But to get start look at the systems calls,
> many of them are beasts like:
>
> int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> This is basically a read(2) (or for other syscalls a write) on something
> else than the file descriptor provided to the system call. The right thing
> to do is obviously have a pmds and pmcs file in procfs for the thread beeing
> monitored instead of these special-case files, with another set for global
> tracing. Similarly I'm pretty sure we can get a much better interface
> if we introduce marching files in procfs for the other calls.

This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops. Whether the thing is some kind
of misc device or procfs is less important than simply getting
away from these system calls.

2007-11-14 11:14:50

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Ok, I just got 4 freakin' bounces from all of these subscriber only
perfmon etc. mailing lists.

Please remove those lists from the CC: as it's pointless for those of
us not on the lists to participate if those lists can't even see the
feedback we are giving.

2007-11-14 11:45:37

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> This is my impression too, all of the things being done with
> a slew of system calls would be better served by real special
> files and appropriate fops.

Special files and fops really only work well if you can coerce the
interface into one where data flows predominantly one way. I don't
think they work so well for something that is more like an RPC across
the user/kernel barrier. For that a system call is better.

For instance, if you have something that kind-of looks like

read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

where the caller supplies an array of PMD numbers and the function
returns their values (and you want that reading to be done atomically
in some sense), how would you do that using special files and fops?

> Whether the thing is some kind
> of misc device or procfs is less important than simply getting
> away from these system calls.

Why? What's inherently offensive about system calls?

Paul.

2007-11-14 11:45:51

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Christoph Hellwig writes:

> int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> This is basically a read(2) (or for other syscalls a write) on something
> else than the file descriptor provided to the system call.

No it's not basically a read(). It's more like a request/reply
interface, which a read()/write() interface doesn't handle very well.
The request in this case is "tell me about this particular collection
of PMDs" and the reply is the values.

It seems to me that an important part of this is to be able to collect
values from several PMDs at a single point in time, or at least an
approximation to a single point in time. So that means that you don't
want a file per PMD either.

Basically we don't have a good abstraction for a request/reply (or
command/response) type of interface, and this is a case where we need
one. Having a syscall that takes a struct containing the request and
reply is as good a way as any, particularly for something that needs
to be quick.

Paul.

2007-11-14 11:52:21

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> David Miller writes:
> > This is my impression too, all of the things being done with
> > a slew of system calls would be better served by real special
> > files and appropriate fops.
>
> Special files and fops really only work well if you can coerce the
> interface into one where data flows predominantly one way. I don't
> think they work so well for something that is more like an RPC across
> the user/kernel barrier. For that a system call is better.
>
> For instance, if you have something that kind-of looks like
>
> read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
>
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?

Could you implement it with readv()?

2007-11-14 11:52:50

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Paul Mackerras <[email protected]>
Date: Wed, 14 Nov 2007 22:44:56 +1100

> For instance, if you have something that kind-of looks like
>
> read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
>
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?

The same way we handle some of the multicast "getsockopt()"
calls. The parameters passed in are both inputs and outputs.

For the above example:

struct pmd_info {
int *pmd_numbers;
u64 *pmd_values;
int n;
} *p;

buffer_size = N;
p = malloc(buffer_size);
p->pmd_numbers = p + foo;
p->pmd_values = p + bar;
p->n = whatever(N);
err = read(fd, p, N);

It's definitely doable, use your imagination.

You can encode all kinds of operation types into the
header as well.

Another alternative is to use generic netlink.

2007-11-14 11:53:19

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Paul Mackerras <[email protected]>
Date: Wed, 14 Nov 2007 22:39:24 +1100

> No it's not basically a read(). It's more like a request/reply
> interface, which a read()/write() interface doesn't handle very well.

Yes it can, see my other reply.

2007-11-14 11:59:08

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Nick Piggin <[email protected]>
Date: Wed, 14 Nov 2007 10:49:48 +1100

> On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> > David Miller writes:
> > > This is my impression too, all of the things being done with
> > > a slew of system calls would be better served by real special
> > > files and appropriate fops.
> >
> > Special files and fops really only work well if you can coerce the
> > interface into one where data flows predominantly one way. I don't
> > think they work so well for something that is more like an RPC across
> > the user/kernel barrier. For that a system call is better.
> >
> > For instance, if you have something that kind-of looks like
> >
> > read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> >
> > where the caller supplies an array of PMD numbers and the function
> > returns their values (and you want that reading to be done atomically
> > in some sense), how would you do that using special files and fops?
>
> Could you implement it with readv()?

Sure, why not? Just cook up an iovec. pmd_numbers goes to offset
X and pmd_values goes to offset Y, with some helpers like what
we have in the networking already for recvmsg.

But why would you want readv() for this? The syscall thing
Paul asked me to translate into a read() doesn't provide
iovec-like behavior so I don't see why readv() is necessary
at all.

2007-11-14 12:04:17

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> The same way we handle some of the multicast "getsockopt()"
> calls. The parameters passed in are both inputs and outputs.

For a read??!!!

> For the above example:
>
> struct pmd_info {
> int *pmd_numbers;
> u64 *pmd_values;
> int n;
> } *p;
>
> buffer_size = N;
> p = malloc(buffer_size);
> p->pmd_numbers = p + foo;
> p->pmd_values = p + bar;
> p->n = whatever(N);
> err = read(fd, p, N);

You're suggesting that the behaviour of a read() should depend on what
was in the buffer before the read? Gack! Surely you have better
taste than that?

Or are you saying that a read (or write) has a side-effect of altering
some other area of memory besides the buffer you give to read()? That
seems even worse to me.

> Another alternative is to use generic netlink.

Then you end up with two system calls to get the data rather than one
(one to send the request and another to read the reply). For
something that needs to be quick that is a suboptimal interface.

Paul.

2007-11-14 12:08:17

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Paul Mackerras <[email protected]>
Date: Wed, 14 Nov 2007 23:03:24 +1100

> You're suggesting that the behaviour of a read() should depend on what
> was in the buffer before the read? Gack! Surely you have better
> taste than that?

Absolutely that's what I mean, it's atomic and gives you exactly what
you need.

I see nothing wrong or gross with these semantics. Nothing in the
"book of UNIX" specifies that for a device or special file the passed
in buffer cannot contain input control data.

> > Another alternative is to use generic netlink.
>
> Then you end up with two system calls to get the data rather than one
> (one to send the request and another to read the reply). For
> something that needs to be quick that is a suboptimal interface.

Not necessarily, consider the possibility of using recvmsg() control
message data. With that it could be done in one go.

This also suggests that it could be implemented as it's own protocol
family.

2007-11-14 12:30:31

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wednesday 14 November 2007 22:58, David Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Wed, 14 Nov 2007 10:49:48 +1100
>
> > On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> > > David Miller writes:
> > > > This is my impression too, all of the things being done with
> > > > a slew of system calls would be better served by real special
> > > > files and appropriate fops.
> > >
> > > Special files and fops really only work well if you can coerce the
> > > interface into one where data flows predominantly one way. I don't
> > > think they work so well for something that is more like an RPC across
> > > the user/kernel barrier. For that a system call is better.
> > >
> > > For instance, if you have something that kind-of looks like
> > >
> > > read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> > >
> > > where the caller supplies an array of PMD numbers and the function
> > > returns their values (and you want that reading to be done atomically
> > > in some sense), how would you do that using special files and fops?
> >
> > Could you implement it with readv()?
>
> Sure, why not? Just cook up an iovec. pmd_numbers goes to offset
> X and pmd_values goes to offset Y, with some helpers like what
> we have in the networking already for recvmsg.
>
> But why would you want readv() for this? The syscall thing
> Paul asked me to translate into a read() doesn't provide
> iovec-like behavior so I don't see why readv() is necessary
> at all.

Ah sorry, that's what I get for typing before I think: of course
readv doesn't vectorise the right part of the equation.

What I really mean is a readv-like syscall, but one that also
vectorises the file offset. Maybe this is useful enough as a generic
syscall that also helps Paul's example...

Of course, I guess this all depends on whether the atomicity is an
important requirement. If not, you can obviously just do it with
multiple read syscalls...

2007-11-14 12:34:45

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wednesday 14 November 2007 23:07, David Miller wrote:
> From: Paul Mackerras <[email protected]>
> Date: Wed, 14 Nov 2007 23:03:24 +1100
>
> > You're suggesting that the behaviour of a read() should depend on what
> > was in the buffer before the read? Gack! Surely you have better
> > taste than that?
>
> Absolutely that's what I mean, it's atomic and gives you exactly what
> you need.
>
> I see nothing wrong or gross with these semantics. Nothing in the
> "book of UNIX" specifies that for a device or special file the passed
> in buffer cannot contain input control data.

True, but is it now any so different to an ioctl?

2007-11-14 12:38:50

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Christoph Hellwig <[email protected]> writes:
>
> I've done this a gazillion times before, so maybe instead of beeing a lazy
> bastard you could look up mailinglist archive. It's not like this is the
> first discussion of perfmon. But to get start look at the systems calls,
> many of them are beasts like:
>
> int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> This is basically a read(2) (or for other syscalls a write) on something

At least for x86 and I suspect some 1other architectures we don't
initially need a syscall at all for this. There is an instruction
RDPMC who can read a performance counter just fine. It is also much
faster and generally preferable for the case where a process measures
events about itself. In fact it is essential for one of the use cases
I would like to see perfmon used (replacement of RDTSC for cycle
counting)

Later a syscall might be needed with event multiplexing, but that seems
more like a far away non essential feature.

> else than the file descriptor provided to the system call. The right thing

I don't like read/write for this too much. I think it's better to
have individual syscalls. After all that is CPU state and having
syscalls for that does seem reasonable.

-Andi

2007-11-14 13:11:35

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi,

On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote:
>
> [dropped all these bouncing email lists. Adding closed lists to public
> cc lists is just a bad idea]
>

Just want to make sure perfmon2 users participate in this discussion.

> > int
> > main(int argc, char **argv)
> > {
> > int ctx_fd;
> > pfarg_pmd_t pd[1];
> > pfarg_pmc_t pc[1];
> > pfarg_ctx_t ctx;
> > pfarg_load_t load_args;
> >
> > memset(&ctx, 0, sizeof(ctx));
> > memset(pc, 0, sizeof(pc));
> > memset(pd, 0, sizeof(pd));
> >
> > /* create session (context) and get file descriptor back (identifier) */
> > ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);
>
> There's nothing in your example that makes the file descriptor needed.
>

Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the session, i.e.,
free the associated kernel resources. For that, you use close() instead of adding yet
another syscall. It also provides a resource limitation mechanisms to control consumption
of kernel memory, i.e., you can only create as many sessions as you can have open files.

> >
> > /* setup one config register (PMC0) */
> > pc[0].reg_num = 0
> > pc[0].reg_value = 0x1234;
>
> That would be nicer if it was just two arguments.
>
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the critical
path.

You do not want a call that programs the entire PMU state all at once either. Many times,
you only want to modify a small subset. Having the full state does also cause some portability
problems.

> >
> > /* setup one data register (PMD0) */
> > pd[0].reg_num = 0;
> > pd[0].reg_value = 0;
>
> Why do you need to set the data register? Wouldn't it make
> more sense to let the kernel handle that and just return one.
>
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS sampling
period.

With event-based sampling, the period is expressed as the number of occurrences
of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
The way you express this with perfmon2 is that you program a counter to measure
L2 cache misses, and then you initialize the corresponding data register (counter)
to overflow after 2000 occurrences. Given that the interface guarantees all counters
are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
Thus you see that you need a call to actual program the data registers.

> >
> > /* program the registers */
> > pfm_write_pmcs(ctx_fd, pc, 1);
> > pfm_write_pmds(ctx_fd, pd, 1);
> >
> > /* attach the context to self */
> > load_args.load_pid = getpid();
> > pfm_load_context(ctx_fd, &load_args);
>
> My replacement would be to just add a flags argument to write_pmcs
> with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT"
> >

You are mixing PMU programming with the type of measurement you want to do.

Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
before you attach to either a CPU or a thread. This way, you can prepare your measurement
and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
That is useful, for instance, when you are trying to measure across fork, pthread_create
which you can catch on-the-fly.

Take the per-thread example, you can setup your session before you fork/exec the program
you want to measure.

Note also that perfmon2 supports attaching to an already running thread. So there is
more than "GLOBAL CONTEXT" versus "MY CONTEXT".

> > /* activate monitoring */
> > pfm_start(ctx_fd, NULL);
>
> Why can't that be done by the call setting up the register?
>

Good question. If you do what say, you assume that the start/stop bit lives in the
config (or data) registers of the PMU. This is not true on all hardware. On Itanium
for instance, the start/stop bit is part of the Processor Status Register (psr).
That is not a PMU register.

On X86, you set the enable bit the PERFEVTSEL, but nothing really happens until you issue
pfm_start(), i.e., the PERFEVTSEL registers are not touched until then.

> Or if someone needs to do it for a specific region they can read
> the register before and then afterwards.
>
> >
> > /*
> > * run code to measure
> > */
> >
> > /* stop monitoring */
> > pfm_stop(ctx_fd);
> >
> > /* read data register */
> > pfm_read_pmds(ctx_fd, pd, 1);
>
> On x86 i think it would be much simpler to just let the set/alloc
> register call return a number and then use RDPMC directly. That would
> be actually faster and be much simpler too.
>
One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
a self-monitoring thread from reading the counters directly. You'll just get the
lower 32-bit of it. So if you read frequently enough, you should not have a problem.

But keep in mind that we do want a uniform interface across all hardware and all type
of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so
on. You want an interface that guarantees that with pfm_read_pmds() you'll be able to
read on any hardware platforms, then on some you may be able to use a more efficient
method, e.g., rdpmc on X86.

Reducing performance monitoring to self-monitoring is not what we want. In fact, there
are only a few domains where you can actually do this and HPC is one of them. But in
many other situations, you cannot and don't want to have to instrument applications
or libraries to collect performance data. It is quite handy to be able to do:
$ pfmon /bin/ls
or
$ pfmon --attach-task=`pidof sshd` -timeout=10s

Also note that there is no guarantee that RDPMC allows you to access all data registers
on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
RDPMC.

> I suppose most architectures have similar facilities, if not a call could be
> added for them but it's not really essential. The call might be also needed
> for event multiplexing, but frankly I would just leave that out for now.
>
Itanium does allow user level read of data registers. It also allows start/stop.
Perfmon2 allows this only for self-monitoring per-thread sessions.

I think restricting per-thread mode to only self-monitoring is just too limiting
even for a start.

> e.g. here is one use case I would personally see as useful. We need
> a replacement for simple cycle counting since RDTSC doesn't do that anymore
> on modern x86 CPUs. It could be something like:
>
You can do exactly this with the perfmon2 interface as it exists today.
Your example is perfectly fine, your interface works in your case.

But you are driving the design of the interface from your very specific need
and you are ignoring all the other usage models. This has been a problem with so
many other interfaces and that explains the current situation. You have to
take a broader view, look at what the hardware (across the board) provides and
build from there. We do not need yet another interface to support one tool or one
type of measurement, we need a true programming interface with a uniform set
of calls. So sure, several calls may look overkill for basic measurements, but
they become necessary with others.

> /* 0 is the initial value */
>
> /* could be either library or syscall */
> event = get_event(COUNTER_CYCLES);
> if (event < 0)
> /* CPU has no cycle counter */
>
> reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */
>
> rdpmc(reg, start);
> .... some code to run ...
> rdpmc(reg, end);
>
> free_perfctr(reg); /* syscall */
>
--
-Stephane

2007-11-14 13:54:30

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Hello,

On Wed, Nov 14, 2007 at 10:39:24PM +1100, Paul Mackerras wrote:
> Christoph Hellwig writes:
>
> > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> >
> > This is basically a read(2) (or for other syscalls a write) on something
> > else than the file descriptor provided to the system call.
>
> No it's not basically a read(). It's more like a request/reply
> interface, which a read()/write() interface doesn't handle very well.
> The request in this case is "tell me about this particular collection
> of PMDs" and the reply is the values.
>

Exactly. This is not a brute force read()! On input you pass the list
of registers you want to read. Upon return, you get the list of values.

Now, I think the current call could be optimized even more by making
the structure smaller. Today, the structure passed read/write
PMD registers is the same. On write, we pass other information such as
the reset values (sampling periods), randomization parameters and some
flags. They are not needed on read.

> It seems to me that an important part of this is to be able to collect
> values from several PMDs at a single point in time, or at least an
> approximation to a single point in time. So that means that you don't
> want a file per PMD either.
>

Yes, we want to be able to read one or many registers in one call.
The number of PMU counters is not going to shrink, so having a file
descriptor per register looks overkill to me.

> Basically we don't have a good abstraction for a request/reply (or
> command/response) type of interface, and this is a case where we need
> one. Having a syscall that takes a struct containing the request and
> reply is as good a way as any, particularly for something that needs
> to be quick.
>

--
-Stephane

2007-11-14 13:58:19

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, Nov 14, 2007 at 10:44:56PM +1100, Paul Mackerras wrote:
> David Miller writes:
>
> > This is my impression too, all of the things being done with
> > a slew of system calls would be better served by real special
> > files and appropriate fops.
>
> Special files and fops really only work well if you can coerce the
> interface into one where data flows predominantly one way. I don't
> think they work so well for something that is more like an RPC across
> the user/kernel barrier. For that a system call is better.
>
> For instance, if you have something that kind-of looks like
>
> read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
>
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?
>
Yes, the read call could be simplified to the level proposed above by Paul.

--
-Stephane

2007-11-14 14:20:41

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi,

On Wed, Nov 14, 2007 at 01:38:38PM +0100, Andi Kleen wrote:
> Christoph Hellwig <[email protected]> writes:
> >
> > I've done this a gazillion times before, so maybe instead of beeing a lazy
> > bastard you could look up mailinglist archive. It's not like this is the
> > first discussion of perfmon. But to get start look at the systems calls,
> > many of them are beasts like:
> >
> > int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> >
> > This is basically a read(2) (or for other syscalls a write) on something
>
> At least for x86 and I suspect some 1other architectures we don't
> initially need a syscall at all for this. There is an instruction
> RDPMC who can read a performance counter just fine. It is also much
> faster and generally preferable for the case where a process measures
> events about itself. In fact it is essential for one of the use cases
> I would like to see perfmon used (replacement of RDTSC for cycle
> counting)
>

This only works when counting (not sampling) and only for self-monitoring.

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.
>
On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
to one generic counter on on Core 2.

> > else than the file descriptor provided to the system call. The right thing
>
> I don't like read/write for this too much. I think it's better to
> have individual syscalls. After all that is CPU state and having
> syscalls for that does seem reasonable.

As I said earlier, we do use read(), not for reading counters but to extract overflow
notification messages when we are sampling. It makes more sense for this usage because
this is where you want to leverage some key mechanisms such as:

- asynchronous notification via SIGIO. this is how you can implement self-sampling
for instance.

- select/poll to allow monitoring tools to wait for notification coming from
multiple sessions in one call. This is useful when monitoring across fork or
pthread_create.

--
-Stephane

2007-11-14 14:24:27

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
>
> Partially true. The file descriptor becomes really useful when you sample.
> You leverage the file descriptor to receive notifications of counter overflows
> and full sampling buffer. You extract notification messages via read() and you can
> use SIGIO, select/poll.

Hmm, ok for the event notification we would need a nice interface. Still
have my doubts a file descriptor is the best way to do this though.

> Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

See my example below.
>
> That would be quite expensive when you have lots of registers to setup: one
> syscall per register. The perfmon syscalls to read/write registers accept vector
> of arguments to amortize the cost of the syscall over multiple registers
> (similar to poll(2)).

First system calls are not that slow on Linux. Measure it.

>
> With many tools, registers are not just setup once. During certain measurements,
> data registers may be read multiple times. When you sample or multiplex at

I think you optimize the wrong thing here.

There are basically two cases I see:

- Global measurement of lots of things:
Things are slow anyways with large context switch overheads. The
overheads are large anyways. Doing one or more system calls probably
does not matter much. Most important is a clean interface.

- Exact measurement of the current process. For that you need very
low latencies. Any system call is too slow. That is why CPUs have
instructions like RDPMC that allow to read those registers with
minimal latency in user space. Interface should support those.

Also for this case programming time does not matter too much. You
just program once and then do RDPMC before code to measure and then
afterwards and take the difference. The actual counter setup is out
of the latency critical path.

> It depends on what you are doing. Here, this was not really necessary. It was
> meant to show how you can program the data registers as well. Perfmon2 provides
> default values for all data registers. For counters, the value is guaranteed to
> be zero.
>
> But it is important to note that not all data registers are counters. That is the
> case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
> well, and some may need to be initialized to non zero value, i.e., the IBS sampling
> period.

Setting period should be a separate call. Mixing the two together into one
does not look like a nice interface.

>
> With event-based sampling, the period is expressed as the number of occurrences
> of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
> The way you express this with perfmon2 is that you program a counter to measure
> L2 cache misses, and then you initialize the corresponding data register (counter)
> to overflow after 2000 occurrences. Given that the interface guarantees all counters
> are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
> Thus you see that you need a call to actual program the data registers.

I didn't object to providing the initial value -- my example had that.
Just having a separate concept of data registers seems too complicated to me.
You should just pass event types and values and the kernel gives you
a register number.

> Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
> before you attach to either a CPU or a thread. This way, you can prepare your measurement
> and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
> That is useful, for instance, when you are trying to measure across fork, pthread_create
> which you can catch on-the-fly.
>
> Take the per-thread example, you can setup your session before you fork/exec the program
> you want to measure.

And? You didn't say what the advantage of that is?

All the approaches add context switch latencies. It is not clear that the separate
session setup helps it all that much.

>
> Note also that perfmon2 supports attaching to an already running thread. So there is
> more than "GLOBAL CONTEXT" versus "MY CONTEXT".

What is the use case of this? Do users use that?

>
>
> > > /* activate monitoring */
> > > pfm_start(ctx_fd, NULL);
> >
> > Why can't that be done by the call setting up the register?
> >
>
> Good question. If you do what say, you assume that the start/stop bit lives in the
> config (or data) registers of the PMU. This is not true on all hardware. On Itanium
> for instance, the start/stop bit is part of the Processor Status Register (psr).
> That is not a PMU register.

Well the system call layer can manage that transparently with a little software state
(counter). No need to expose it.

> One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
> a self-monitoring thread from reading the counters directly. You'll just get the
> lower 32-bit of it. So if you read frequently enough, you should not have a problem.

Hmm? RDPMC is 64bit.
>
> But keep in mind that we do want a uniform interface across all hardware and all type
> of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
> an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so

I disagree. Using RDPMC is essential for at least some of the things I would like
to do with perfmon2. If the interface does not provide it it is useless to me at least.
System calls are far too slow for cycle measurements.

And when RDPMC is already supported it should be as widely used as possible.

Regarding the portable code problem: of course you would have some header in user space
that hides the details in a hopefully portable macro.

> on. You want an interface that guarantees that with pfm_read_pmds() you'll be able to
> read on any hardware platforms, then on some you may be able to use a more efficient
> method, e.g., rdpmc on X86.
>
> Reducing performance monitoring to self-monitoring is not what we want. In fact, there
> are only a few domains where you can actually do this and HPC is one of them. But in
> many other situations, you cannot and don't want to have to instrument applications
> or libraries to collect performance data. It is quite handy to be able to do:
> $ pfmon /bin/ls
> or
> $ pfmon --attach-task=`pidof sshd` -timeout=10s

I think only supporting global and self monitoring as first step is totally fine.
All the bells'n'whistles can be added later if users really want them.

>
>
> Also note that there is no guarantee that RDPMC allows you to access all data registers
> on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
> RDPMC.

Sure at some point a system call for the more complex cases (also like multiplexing) would
be needed. But I don't think we need it as first step. The goal would be to define a
simple subset that is actually mergeable.

> But you are driving the design of the interface from your very specific need
> and you are ignoring all the other usage models. This has been a problem with so

I asked your noisy user base to specify more concrete use cases, but so far
they have not provided anything except rather vacuous complaints. Short of that I'll stick
with what I know currently.

> many other interfaces and that explains the current situation. You have to
> take a broader view, look at what the hardware (across the board) provides and
> build from there. We do not need yet another interface to support one tool or one

Well your "broad view" resulted in a incredible mess of interface moloch to be honest.
I really think we need a fresh start examining many of the underlying assumptions.

Regarding itanium: I suppose it could provide a RDPMC replacement using your
fast priviledged vsyscalls.

-Andi

2007-11-14 14:26:42

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, Nov 14, 2007 at 06:13:42AM -0800, Stephane Eranian wrote:
> > At least for x86 and I suspect some 1other architectures we don't
> > initially need a syscall at all for this. There is an instruction
> > RDPMC who can read a performance counter just fine. It is also much
> > faster and generally preferable for the case where a process measures
> > events about itself. In fact it is essential for one of the use cases
> > I would like to see perfmon used (replacement of RDTSC for cycle
> > counting)
> >
>
> This only works when counting (not sampling) and only for self-monitoring.

It works for global monitoring too.

>
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
> >
> On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
> multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
> to one generic counter on on Core 2.

NMI watchdog is off by default now.

Yes longer term we might need multiplexing, but definitely not as first step.

-Andi

2007-11-14 15:48:48

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi Kleen wrote:

>> One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
>> a self-monitoring thread from reading the counters directly. You'll just get the
>> lower 32-bit of it. So if you read frequently enough, you should not have a problem.
>
> Hmm? RDPMC is 64bit.

There are a number of processors that have 32-bit counters such as the IBM power
processors. On many x86 processors the upper bits of the counter are sign
extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are
available. Roll over of values is quite possible (<2 seconds of cycle count), so
additional work needs to be done to obtain a valid value.

>> But keep in mind that we do want a uniform interface across all hardware and all type
>> of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
>> an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so
>
> I disagree. Using RDPMC is essential for at least some of the things I would like
> to do with perfmon2. If the interface does not provide it it is useless to me at least.
> System calls are far too slow for cycle measurements.

What range of cycles are you interested in measuring? 100's of cycles? A couple
thousand? Are you just looking at cycle counts or other events?

-Will

2007-11-14 16:14:18

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, Nov 14, 2007 at 10:44:20AM -0500, William Cohen wrote:
> Andi Kleen wrote:
>
> >>One approach does not prevent the other. Assuming you allow cr4.pce, then
> >>nothing prevents
> >>a self-monitoring thread from reading the counters directly. You'll just
> >>get the
> >>lower 32-bit of it. So if you read frequently enough, you should not have
> >>a problem.
> >
> >Hmm? RDPMC is 64bit.
>
> There are a number of processors that have 32-bit counters such as the IBM
> power processors. On many x86 processors the upper bits of the counter are
> sign extended from the lower 32 bits. Thus, one can only assume the lower
> 32-bit are available. Roll over of values is quite possible (<2 seconds of
> cycle count), so additional work needs to be done to obtain a valid value.
>

Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is
sign-extension. That's why it is okay for measuring small sections of code,
but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth.
Don't know about Power or Cell.

--
-Stephane

2007-11-14 18:59:35

by Philippe Elie

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, 14 Nov 2007 at 10:44 +0000, Will Cohen wrote:

> Andi Kleen wrote:
>
> >>One approach does not prevent the other. Assuming you allow cr4.pce, then
> >>nothing prevents
> >>a self-monitoring thread from reading the counters directly. You'll just
> >>get the
> >>lower 32-bit of it. So if you read frequently enough, you should not have
> >>a problem.
> >
> >Hmm? RDPMC is 64bit.
>
> There are a number of processors that have 32-bit counters such as the IBM
> power processors. On many x86 processors the upper bits of the counter are
> sign extended from the lower 32 bits. Thus, one can only assume the lower
> 32-bit are available. Roll over of values is quite possible (<2 seconds of
> cycle count), so additional work needs to be done to obtain a valid value.

On x86 they are sign-extended only on write, on read they are 40 bits wide
for intel, 48 bits for AMD.

BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
to disable it, dunno if it has been applied.

--
Phe

2007-11-14 19:16:14

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

> BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
> to disable it, dunno if it has been applied.

Obviously -- without a system call to set up performance counters it
would be fairly useless. But of course once such system calls are in
they should be able to trigger the bit for each process.

-Andi

2007-11-14 19:48:38

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Andi Kleen <[email protected]>
Date: Wed, 14 Nov 2007 13:38:38 +0100

> At least for x86 and I suspect some 1other architectures we don't
> initially need a syscall at all for this. There is an instruction
> RDPMC who can read a performance counter just fine. It is also much
> faster and generally preferable for the case where a process measures
> events about itself. In fact it is essential for one of the use cases
> I would like to see perfmon used (replacement of RDTSC for cycle
> counting)

I wouldn't even want to use a syscall for something like
that on Sparc, I'd rather give this a dedicated software
trap so that I can code it completely in assembler.

2007-11-14 21:50:59

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> > You're suggesting that the behaviour of a read() should depend on what
> > was in the buffer before the read? Gack! Surely you have better
> > taste than that?
>
> Absolutely that's what I mean, it's atomic and gives you exactly what
> you need.
>
> I see nothing wrong or gross with these semantics. Nothing in the
> "book of UNIX" specifies that for a device or special file the passed
> in buffer cannot contain input control data.

Ohhhhh.... kayyyyy.... *shudders*

It really violates the abstract model of "read" pretty badly. "Read"
is "fill in the buffer with data from the device", not "do some
arbitrary stuff with this area of memory".

I'd prefer to have a transaction() system call like I suggested to
Nick rather than overloading read() like this.

> > Then you end up with two system calls to get the data rather than one
> > (one to send the request and another to read the reply). For
> > something that needs to be quick that is a suboptimal interface.
>
> Not necessarily, consider the possibility of using recvmsg() control
> message data. With that it could be done in one go.
>
> This also suggests that it could be implemented as it's own protocol
> family.

There's all sorts of possible ways that it could be implemented. On
the one hand we have an actual proposed implementation, and on the
other we have various people saying "oh but it could be implemented
this other way" without providing any actual code.

Now if those people can show that their way of doing it is
significantly simpler and better than the existing implementation,
then that's useful. I really don't think that doing a whole new
net protocol family is a simpler and better way of doing a performance
monitor interface, though.

Paul.

2007-11-14 21:51:24

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Nick Piggin writes:

> What I really mean is a readv-like syscall, but one that also
> vectorises the file offset. Maybe this is useful enough as a generic
> syscall that also helps Paul's example...

I've sometimes thought it would be useful to have a "transaction"
system call that is like a write + read combined into one:

int transaction(int fd, char *req, size_t req_nb,
char *reply, size_t reply_nb);

as a way to provide a general request/reply interface for special
files.

> Of course, I guess this all depends on whether the atomicity is an
> important requirement. If not, you can obviously just do it with
> multiple read syscalls...

That would take N system calls instead of one, which could have a
performance impact if you need to read the counters frequently (which
I believe you do in some performance monitoring situations).

Paul.

2007-11-14 22:42:42

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Thursday 15 November 2007 08:30, Paul Mackerras wrote:
> Nick Piggin writes:
> > What I really mean is a readv-like syscall, but one that also
> > vectorises the file offset. Maybe this is useful enough as a generic
> > syscall that also helps Paul's example...
>
> I've sometimes thought it would be useful to have a "transaction"
> system call that is like a write + read combined into one:
>
> int transaction(int fd, char *req, size_t req_nb,
> char *reply, size_t reply_nb);
>
> as a way to provide a general request/reply interface for special
> files.

Maybe not a bad idea, though I'm not the one to ask about taste ;)
In this case, it is enough for your requests to be a set of scalars
(eg. file offsets), so it _could_ be handled with vectorised offsets...

But in general, for special files, I guess the response is usually
some structured data (that is not visible at the syscall layer).
So I don't see a big problem to have a similarly arbitrarily
structured request.

> > Of course, I guess this all depends on whether the atomicity is an
> > important requirement. If not, you can obviously just do it with
> > multiple read syscalls...
>
> That would take N system calls instead of one, which could have a
> performance impact if you need to read the counters frequently (which
> I believe you do in some performance monitoring situations).

That's true too.

2007-11-14 23:02:43

by Chuck Ebbert

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On 11/14/2007 05:17 AM, Nick Piggin wrote:
>
> But in general, for special files, I guess the response is usually
> some structured data (that is not visible at the syscall layer).
> So I don't see a big problem to have a similarly arbitrarily
> structured request.
>
>

IOW, an ioctl.

2007-11-14 23:03:20

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Paul Mackerras <[email protected]>
Date: Thu, 15 Nov 2007 08:50:22 +1100

> I'd prefer to have a transaction() system call like I suggested to
> Nick rather than overloading read() like this.

So much for getting rid of the extra system calls...

2007-11-14 23:19:29

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> From: Paul Mackerras <[email protected]>
> Date: Thu, 15 Nov 2007 08:50:22 +1100
>
> > I'd prefer to have a transaction() system call like I suggested to
> > Nick rather than overloading read() like this.
>
> So much for getting rid of the extra system calls...

*I* never had a problem with a few extra system calls. I don't
understand why you (apparently) do.

Paul.

2007-11-14 23:21:44

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Paul Mackerras <[email protected]>
Date: Thu, 15 Nov 2007 10:12:22 +1100

> *I* never had a problem with a few extra system calls. I don't
> understand why you (apparently) do.

We're stuck with them forever, they are hard to version and extend
cleanly.

Those are my main objections.

2007-11-14 23:28:19

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Thursday 15 November 2007 09:56, Chuck Ebbert wrote:
> On 11/14/2007 05:17 AM, Nick Piggin wrote:
> > But in general, for special files, I guess the response is usually
> > some structured data (that is not visible at the syscall layer).
> > So I don't see a big problem to have a similarly arbitrarily
> > structured request.
>
> IOW, an ioctl.

In the same way a read of structured data from a special file
"is an" ioctl, yeah. You could implement either with an ioctl.

The main difference is they have more explicitly typed interfaces
Whether that's enough argument (and if Paul's proposal is widely
usable enough) is another question. Which I won't try to answer.

2007-11-15 00:10:21

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi,

On Wed, Nov 14, 2007 at 03:24:11PM +0100, Andi Kleen wrote:
> On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
> >
> > Partially true. The file descriptor becomes really useful when you sample.
> > You leverage the file descriptor to receive notifications of counter overflows
> > and full sampling buffer. You extract notification messages via read() and you can
> > use SIGIO, select/poll.
>
> Hmm, ok for the event notification we would need a nice interface. Still
> have my doubts a file descriptor is the best way to do this though.
>

Why do you think the existing interfaces are not a good fit for this?
Is this just because of your problem with file descriptors?

>From my experience read(), select(), and SIGIO are fine. I know many tools use that.

As for the file descriptor, you would need to replace that with another identifier of
some sort. As I pointed out in another message on this thread, you don't want to use
a pid-based identifier. This is not usable when you monitor other threads and you
want to read out the results after their death.

> > Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?
>
> See my example below.
> >
> > That would be quite expensive when you have lots of registers to setup: one
> > syscall per register. The perfmon syscalls to read/write registers accept vector
> > of arguments to amortize the cost of the syscall over multiple registers
> > (similar to poll(2)).
>
>
> First system calls are not that slow on Linux. Measure it.
>
If people do not like vector arguments, then I think I can live with N system calls
to program N registers. Now you have two choices for passing the arguments:

- a pointer to a struct
struct pfarg_pmc {
uint64_t reg_value;
uint16_t reg_num;
} pmc0;
pmc0.reg_value = 0; pmc0.reg_value = 0x1234;
pfm_write_pmcs(fd, &pmc0);

- explicitly passing every field:
pfm_write_pmcs(fd, 0x0, 0x1234);

Given that event set and multiplexing would not be in initially, we would want
to allow for them to be added later without having to create yet another
system call, right?

Of course the same approach would work for the data registers at least for counting.

> > With many tools, registers are not just setup once. During certain measurements,
> > data registers may be read multiple times. When you sample or multiplex at
>
> I think you optimize the wrong thing here.
>
> There are basically two cases I see:
>
> - Global measurement of lots of things:

I am not sure I understand what you mean by 'lots of things'?
Are you still talking per-thread and self-monitoring?

> Things are slow anyways with large context switch overheads. The
> overheads are large anyways. Doing one or more system calls probably
> does not matter much. Most important is a clean interface.
>
> - Exact measurement of the current process. For that you need very
> low latencies. Any system call is too slow. That is why CPUs have
> instructions like RDPMC that allow to read those registers with
> minimal latency in user space. Interface should support those.
>

I don't have a problem with that. And in fact, I already support that
at least on Itanium. I had that in there for X86 but I dropped it after
you said that you would enable cr4.pce globally. I don't have a problem
adding it back for self-monitoring sessions.

> Also for this case programming time does not matter too much. You
> just program once and then do RDPMC before code to measure and then
> afterwards and take the difference. The actual counter setup is out
> of the latency critical path.
>
Agreed.

>
> > It depends on what you are doing. Here, this was not really necessary. It was
> > meant to show how you can program the data registers as well. Perfmon2 provides
> > default values for all data registers. For counters, the value is guaranteed to
> > be zero.
> >
> > But it is important to note that not all data registers are counters. That is the
> > case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
> > well, and some may need to be initialized to non zero value, i.e., the IBS sampling
> > period.
>
> Setting period should be a separate call. Mixing the two together into one
> does not look like a nice interface.
>
Periods are setup by data register. Given that there is already a call to program
the data register why add another one? You don't need to treat the sampling period
differently from the register value. This just a value that will cause the register
to overflow after an explicit number of occurrences.

> > With event-based sampling, the period is expressed as the number of occurrences
> > of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
> > The way you express this with perfmon2 is that you program a counter to measure
> > L2 cache misses, and then you initialize the corresponding data register (counter)
> > to overflow after 2000 occurrences. Given that the interface guarantees all counters
> > are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
> > Thus you see that you need a call to actual program the data registers.
>
> I didn't object to providing the initial value -- my example had that.

Should you support a kernel level sampling buffer (like Oprofile) you'd also want
to specify the reset value on overflow. And you would not necessarily want it to
be identical to the initial value (period). So you'd to have a way to specify that
one as well.

> Just having a separate concept of data registers seems too complicated to me.

I am not against providing a flat namespace. But I think it is nice to separate config
from data.

> You should just pass event types and values and the kernel gives you
> a register number.

Absolutely not, you don't want to the kernel to know about events. This has to
remain at the user level. The event -> register problem is best solved in a user
library (such as libpfm). You don't want to bloat the kernel with event tables.
Many PMU models have over 200 events. And there is worse, in many PMU models,
you have tons of constraints as to each counter can measure, it can become very complicated,
e.g., Itanium and Power and Pentium 4 are good examples. It is difficult to get right, vendors
are constantly correcting their spec so maintenance is a pain.

The kernel interface must just deal with PMU registers and not events.

>
>
> > Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
> > before you attach to either a CPU or a thread. This way, you can prepare your measurement
> > and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
> > That is useful, for instance, when you are trying to measure across fork, pthread_create
> > which you can catch on-the-fly.
> >
> > Take the per-thread example, you can setup your session before you fork/exec the program
> > you want to measure.
>
> And? You didn't say what the advantage of that is?
>
You pass to the kernel all the register values (config, data), you setup the kernel sampling
buffer and the mapping of it. Then it is just ma tter of attaching + start. The value of this
is that it lets you create a pool of ready-to-go sessions and when you are monitoring across
fork/pthread_create, each time you receive a notification from ptrace, you simply have to
attach, start and go, i..e, you minimize the overhead on the application you are measuring.

> All the approaches add context switch latencies. It is not clear that the separate
> session setup helps it all that much.
>
This is a different issue. Sure the more PMU register you use the more expensive the
context switch gets. Yet the current perfmon2 implementation tries to mitigate this by
using lazy restore scheme, similar to the one used for FP registers.,

> >
> > Note also that perfmon2 supports attaching to an already running thread. So there is
> > more than "GLOBAL CONTEXT" versus "MY CONTEXT".
>
> What is the use case of this? Do users use that?
>
I think this is even the first approach when you get code to measure. You want to try
and characterize the workload without having to instrument and recompile. Furthermore, their
are certain workloads which are very long to restart and that cannot be stopped and restarted
easily, yet you may want to attach for several seconds. You may also want to use this approach
to avoid monitoring the initialization phase of an application. Sometimes you may not even
all have the sources to be able to instrument (e.g. 3rd party libraries).

> >
> >
> > > > /* activate monitoring */
> > > > pfm_start(ctx_fd, NULL);
> > >
> > > Why can't that be done by the call setting up the register?
> > >
> >
> > Good question. If you do what say, you assume that the start/stop bit lives in the
> > config (or data) registers of the PMU. This is not true on all hardware. On Itanium
> > for instance, the start/stop bit is part of the Processor Status Register (psr).
> > That is not a PMU register.
>
>
> Well the system call layer can manage that transparently with a little software state
> (counter). No need to expose it.
>
Are you suggesting virtual PMU registers that map to other resources, e.g., Itanium's PSR?

>
> I disagree. Using RDPMC is essential for at least some of the things I would like
> to do with perfmon2. If the interface does not provide it it is useless to me at least.
> System calls are far too slow for cycle measurements.
>
> And when RDPMC is already supported it should be as widely used as possible.
>
I am perfectly fine with RDPMC for self-monitoring and simple counting. I need to check and
see if this could work for self-sampling. But I also want to provide an interface
that would work for: non self-monitoring, self-monitoring, architecture without RDPMC equivalent.
This is important for people who want to write portable tools. The syscall would
return the full 64-bit value of the counter without the sign-extension.

> >
> > Reducing performance monitoring to self-monitoring is not what we want. In fact, there
> > are only a few domains where you can actually do this and HPC is one of them. But in
> > many other situations, you cannot and don't want to have to instrument applications
> > or libraries to collect performance data. It is quite handy to be able to do:
> > $ pfmon /bin/ls
> > or
> > $ pfmon --attach-task=`pidof sshd` -timeout=10s
>
> I think only supporting global and self monitoring as first step is totally fine.
I asssume by 'global' you mean system-wide, i.e., measuring all threads running on
a cpu.

> All the bells'n'whistles can be added later if users really want them.
>
They do because it provides such a simplicity of use. On production systems, it is not
uncommon to not even have compilers installed yet you may want to diagnose performance
problems by simply running a performance tool for a while.

> >
> > Also note that there is no guarantee that RDPMC allows you to access all data registers
> > on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
> > RDPMC.
>
> Sure at some point a system call for the more complex cases (also like multiplexing) would
> be needed. But I don't think we need it as first step. The goal would be to define a
> simple subset that is actually mergeable.
>
> > But you are driving the design of the interface from your very specific need
> > and you are ignoring all the other usage models. This has been a problem with so
>
> I asked your noisy user base to specify more concrete use cases, but so far
> they have not provided anything except rather vacuous complaints. Short of that I'll stick
> with what I know currently.
>
I think they will respond but Phil is busy at Supercomputing right now. They'll be able
to provide lots of use cases based on their experience with the popular PAPI toolkit.

> > many other interfaces and that explains the current situation. You have to
> > take a broader view, look at what the hardware (across the board) provides and
> > build from there. We do not need yet another interface to support one tool or one
>
>
> Well your "broad view" resulted in a incredible mess of interface moloch to be honest.

That is your opinion. I am not trying to say perfmon2 is perfect and I don't want to make changes.
I have proved in the past and still today that I am willing to make changes. See my comments about
pfm_write_pmcs() above.

But what I also know now is that people have managed to port this interface on all major hardware
platforms from X86, Itanium, Cray, Power*, Cell and derivative such Sony Playstation 3. They were
able to do so while providing access to all the advanced features (PEBS, IBS, DEAR, IPEAR, opcode
matchers, range restriction) and not just counters. They have never had to make changes to the
user level API to make their hardware work.

I just trying to say that you need to consider the arguments of people who have been involved with
performance monitoring and development of monitoring tools for a long time and on different architectures.
What you want to do with it is perfectly fine but it only represents a tiny fraction of what you can do
with the hardware and of what many people already want todo today. I would not want to have one interface
to do self-monitoring very well, then another one to do sampling, and another one for multiplexing.

> I really think we need a fresh start examining many of the underlying assumptions.
>
I am happy to go over every design choices with you and others.

> Regarding itanium: I suppose it could provide a RDPMC replacement using your
> fast priviledged vsyscalls.
>

We don't need that. Itanium allows reading of PMD registers directly from user space with
a single instruction once we clear the protection mechanism similar to cr4.pce. And this
is already done for self-monitoring per-thread sessions today.

--
-Stephane

2007-11-15 00:24:17

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi Kleen writes:

> > This only works when counting (not sampling) and only for self-monitoring.
>
> It works for global monitoring too.

How would you provide access to the counters of another process?
Through an extension to ptrace perhaps?

Paul.

2007-11-15 01:11:29

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> From: Paul Mackerras <[email protected]>
> Date: Thu, 15 Nov 2007 10:12:22 +1100
>
> > *I* never had a problem with a few extra system calls. I don't
> > understand why you (apparently) do.
>
> We're stuck with them forever, they are hard to version and extend
> cleanly.
>
> Those are my main objections.

The first is valid (for suitable values of "forever") but applies to
any user/kernel interface, not just system calls.

As for the second (hard to version) I don't see why it applies to
syscalls specifically more than to other interfaces. It's just a
matter of designing it correctly in the first place. For example, the
sys_swapcontext system call we have on powerpc takes an argument which
is the size of the ucontext_t that userland is using, which allows us
to extend it in future if necessary. (Note that I'm not saying that
the current perfmon2 interfaces are well-designed in this respect.)

The third (hard to extend cleanly) is a good point, and is a valid
criticism of the current set of perfmon2 system calls, I think.
However, the goal of being able to extend the interface tends to be in
opposition to the goal of having strong typing of the interface.
Things like a multiplexed syscall or an ioctl are much easier to
extend but that is at the expense of losing strong typing. Something
like my transaction() (or your weird kind of read() :) also provides
extensibility but loses type safety to some degree.

Also, as Andi says, this is core CPU state that we are dealing with,
not some I/O device, so treating the whole of perfmon2 (or any
performance monitoring infrastructure) as a driver doesn't fit very
well, and in fact system calls are appropriate. Just like we don't
try to make access to debugging facilities fit into a driver, we
shouldn't make performance monitoring fit into a driver either.

Paul.

2007-11-15 01:27:32

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Paul Mackerras <[email protected]>
Date: Thu, 15 Nov 2007 12:11:10 +1100

> The third (hard to extend cleanly) is a good point, and is a valid
> criticism of the current set of perfmon2 system calls, I think.
> However, the goal of being able to extend the interface tends to be in
> opposition to the goal of having strong typing of the interface.
> Things like a multiplexed syscall or an ioctl are much easier to
> extend but that is at the expense of losing strong typing.

I disagree.

With netlink we can just add new attributes when a new need arises for
a particular interface. The attribute code describes the type
precisely, so there is no loss of strong typing at all.

2007-11-15 02:34:37

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> From: Paul Mackerras <[email protected]>
> Date: Thu, 15 Nov 2007 12:11:10 +1100
>
> > The third (hard to extend cleanly) is a good point, and is a valid
> > criticism of the current set of perfmon2 system calls, I think.
> > However, the goal of being able to extend the interface tends to be in
> > opposition to the goal of having strong typing of the interface.
> > Things like a multiplexed syscall or an ioctl are much easier to
> > extend but that is at the expense of losing strong typing.
>
> I disagree.
>
> With netlink we can just add new attributes when a new need arises for
> a particular interface. The attribute code describes the type
> precisely, so there is no loss of strong typing at all.

Well you must mean something different by "strong typing" from the
rest of us. Strong typing means that the compiler can check that you
have passed in the correct types of arguments, but the compiler
doesn't have any visibility into what structures are valid in netlink
messages.

In any case, I think that adding a structure size argument to the
current perfmon2 system calls where appropriate would mean that we
could extend them cleanly later on if necessary. It would mean that
we could add fields at the end, and that the kernel could know what
version of the structures that userspace was using.

Paul.

2007-11-15 04:20:32

by dean gaudet

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Wed, 14 Nov 2007, Andi Kleen wrote:

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.

actually multiplexing is the main feature i am in need of. there are an
insufficient number of counters (even on k8 with 4 counters) to do
complete stall accounting or to get a general overview of L1d/L1i/L2 cache
hit rates, average miss latency, time spent in various stalls, and the
memory system utilization (or HT bus utilization). this runs out to
something like 30 events which are interesting... and re-running a
benchmark over and over just to get around the lack of multiplexing is a
royal pain in the ass.

it's not a "far away non-essential feature" to me. it's something i would
use daily if i had all the pieces together now (and i'm constrained
because i cannot add an out-of-tree patch which adds unofficial syscalls
to the kernel i use).

-dean

2007-11-15 04:48:16

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

dean gaudet writes:

> actually multiplexing is the main feature i am in need of. there are an
> insufficient number of counters (even on k8 with 4 counters) to do
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache
> hit rates, average miss latency, time spent in various stalls, and the
> memory system utilization (or HT bus utilization). this runs out to
> something like 30 events which are interesting... and re-running a
> benchmark over and over just to get around the lack of multiplexing is a
> royal pain in the ass.

So by "multiplexing" do you mean the ability to have multiple event
sets associated with a context and have the kernel switch between them
automatically?

Paul.

2007-11-15 05:14:44

by dean gaudet

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

On Thu, 15 Nov 2007, Paul Mackerras wrote:

> dean gaudet writes:
>
> > actually multiplexing is the main feature i am in need of. there are an
> > insufficient number of counters (even on k8 with 4 counters) to do
> > complete stall accounting or to get a general overview of L1d/L1i/L2 cache
> > hit rates, average miss latency, time spent in various stalls, and the
> > memory system utilization (or HT bus utilization). this runs out to
> > something like 30 events which are interesting... and re-running a
> > benchmark over and over just to get around the lack of multiplexing is a
> > royal pain in the ass.
>
> So by "multiplexing" do you mean the ability to have multiple event
> sets associated with a context and have the kernel switch between them
> automatically?

yep.

-dean

2007-11-15 07:49:43

by Herbert Xu

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Paul Mackerras <[email protected]> wrote:
>
> Well you must mean something different by "strong typing" from the
> rest of us. Strong typing means that the compiler can check that you
> have passed in the correct types of arguments, but the compiler
> doesn't have any visibility into what structures are valid in netlink
> messages.

That's strong static typing. Netlink is 90% strong static
typing plus 10% strong dynamic typing. That is, it'll tell
you at run-time if you give it the wrong netlink attribute.

The types within each netlink attribute is checked at compile
time.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-15 08:19:26

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Herbert Xu <[email protected]> writes:

> That's strong static typing. Netlink is 90% strong static
> typing plus 10% strong dynamic typing. That is, it'll tell
> you at run-time if you give it the wrong netlink attribute.

Well it tells you EINVAL no matter what is wrong.

That's roughly similar to a compiler whose only error message
is 'WRONG'. Or the ed school of error reporting.

That makes any checking it does barely useful.

-Andi

2007-11-15 08:29:37

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Hi,

On Thu, Nov 15, 2007 at 12:11:10PM +1100, Paul Mackerras wrote:
> David Miller writes:
>
> > From: Paul Mackerras <[email protected]>
> > Date: Thu, 15 Nov 2007 10:12:22 +1100
> >
> > > *I* never had a problem with a few extra system calls. I don't
> > > understand why you (apparently) do.
> >
> > We're stuck with them forever, they are hard to version and extend
> > cleanly.
> >
> > Those are my main objections.
>
> The first is valid (for suitable values of "forever") but applies to
> any user/kernel interface, not just system calls.
>
Agreed.

> As for the second (hard to version) I don't see why it applies to
> syscalls specifically more than to other interfaces. It's just a
> matter of designing it correctly in the first place. For example, the
> sys_swapcontext system call we have on powerpc takes an argument which
> is the size of the ucontext_t that userland is using, which allows us
> to extend it in future if necessary. (Note that I'm not saying that
> the current perfmon2 interfaces are well-designed in this respect.)
>
> The third (hard to extend cleanly) is a good point, and is a valid
> criticism of the current set of perfmon2 system calls, I think.
> However, the goal of being able to extend the interface tends to be in
> opposition to the goal of having strong typing of the interface.
> Things like a multiplexed syscall or an ioctl are much easier to
> extend but that is at the expense of losing strong typing. Something
> like my transaction() (or your weird kind of read() :) also provides
> extensibility but loses type safety to some degree.
>
In the initial design there was only one perfmon syscall perfmonctl()
and it was a multiplexing call. People objected to it and thus I split it
up into multiple system calls. I like the strong typing but I agree that
it is harder to extend without creating new syscalls. In the current
state, all perfmon syscalls take a pointer to structs which have reserved
fields for future extensions. If you specify that reserved fields must be
zeroed, then it leaves you *some* flexibility for extending the structs.

Another alternative, similar to your ucontext, would be to pass the size
of the structure. If we assume we drop the vector arguments, we could do:

pfm_write_pmcs(fd, &pmc, sizeof(pmc));
instead of
pfm_write_pmcs(fd, &pmc);

Should the sizeof(pmc) need to change we could demultiplex inside the
kernel. Another, probably cleaner, possibility is to version structures
that are passed:
union pfarg_pmc {
int version;
struct {
int version;
int reg_num;
u64 reg_value;
}
}

But that seems overkill. I think versioning could be passed when the session
is created instead of at every call:

fd = pfm_create_session(version, &ctx, ....);

> Also, as Andi says, this is core CPU state that we are dealing with,
> not some I/O device, so treating the whole of perfmon2 (or any
> performance monitoring infrastructure) as a driver doesn't fit very
> well, and in fact system calls are appropriate. Just like we don't
> try to make access to debugging facilities fit into a driver, we
> shouldn't make performance monitoring fit into a driver either.
>

Agreed 100%. This is especially true because we support per-thread
monitoring.

--
-Stephane

2007-11-15 09:00:56

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Hello,

On Wed, Nov 14, 2007 at 08:20:22PM -0800, dean gaudet wrote:
> On Wed, 14 Nov 2007, Andi Kleen wrote:
>
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
>
> actually multiplexing is the main feature i am in need of. there are an
> insufficient number of counters (even on k8 with 4 counters) to do
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache
> hit rates, average miss latency, time spent in various stalls, and the
> memory system utilization (or HT bus utilization). this runs out to
> something like 30 events which are interesting... and re-running a
> benchmark over and over just to get around the lack of multiplexing is a
> royal pain in the ass.
>
> it's not a "far away non-essential feature" to me. it's something i would
> use daily if i had all the pieces together now (and i'm constrained
> because i cannot add an out-of-tree patch which adds unofficial syscalls
> to the kernel i use).
>

Multiplexing in the context of perfmon2 means that you can measure more events
than there are counters. To make this work, we create the notion of an event set
or more precisely a register set. Each set encapsulates the full PMU state. Then
the kernel multiplexes the sets onto the actual PMU hardware.

Why do we need this?

As Dean pointed out, that are many important metrics which do require more events
than there are counters. Making multiple runs can be difficult with some workloads.

But there are also other, less known, reasons why you'd want to do this. This is
not because you have lots of counters that you can necessarily measure lots of
related events simultaneously. Take pentium 4 for instance, it has 18 counters, but
for most interesting metrics, you cannot measure all the events at once. Why? Because
there are important hardware constraints which translate into event combination
constraints. It is not uncommon to have constraints such as:
- event A and B cannot be measured together
- event A can only be measured by counter X
- if event A is measured, then only events B, C, D can be measured

This is not just on Itanium. Power has limitations, Intel Core 2 has limitations,
AMD Opterons also have limitations.

When you combine limited number of counters with strong constraints, it can quickly
become difficult to make measurements in one run.

Multiplexing is, of course, not as good as measuring all events continuously but
if you run for long enough and with a reasonable switching periods, the *estimates*
you get by scaling the obtained counts can be very close to what they would have
been had you measured all events all the time. You have to balance precision with
overhead.

Why do this in the kernel?

One might argue that there is nothing preventing tools from multiplexing at the user
level. That's true and we do support this as well. You have to:
- stop monitoring
- read out current counter
- reprogram config and data registers
- restart monitoring

But there are some important benefits for doing this in the kernel especially for
per-thread monitoring. When you are not self-monitoring, you would need to stop the
other thread first, then issue a minimum of 4 system calls and incur a couple of
context switches. By doing it in the kernel, you guaranteed that switching always occur
in the context of the monitored thread.

Furthermore it can be integrated with kernel-level sampling. Adding the notion
of event set is fairly pervasive and you need to make sure that it fits well with
the other parts of the interface.

--
-Stephane

2007-11-15 17:45:29

by Dan Terpstra

[permalink] [raw]

Subject: RE: [perfmon2] [perfmon] Re: perfmon2 merge news

We've provided multiplexing in PAPI at the user level for years. That forced
it to the user level, which wasn't pretty. Or very statistically accurate.
We've been eagerly anticipating the improvements provided by in-kernel
multiplexing in perfmon2. We and our user base don't consider this a "far
away non-essential feature", but a deficiency that's needed addressing for a
long time.
- d

> -----Original Message-----
> From: [email protected] [mailto:perfmon2-devel-
> [email protected]] On Behalf Of dean gaudet
> Sent: Wednesday, November 14, 2007 11:20 PM
> To: Andi Kleen
> Cc: papi list; OSPAT devel; Greg KH; Perfmon; linux-
> [email protected]; Christoph Hellwig; Paul Mackerras; Andrew Morton;
> [email protected]; Philip Mucci
> Subject: Re: [perfmon2] [perfmon] Re: perfmon2 merge news
>
> On Wed, 14 Nov 2007, Andi Kleen wrote:
>
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
>
> actually multiplexing is the main feature i am in need of. there are an
> insufficient number of counters (even on k8 with 4 counters) to do
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache
> hit rates, average miss latency, time spent in various stalls, and the
> memory system utilization (or HT bus utilization). this runs out to
> something like 30 events which are interesting... and re-running a
> benchmark over and over just to get around the lack of multiplexing is a
> royal pain in the ass.
>
> it's not a "far away non-essential feature" to me. it's something i would
> use daily if i had all the pieces together now (and i'm constrained
> because i cannot add an out-of-tree patch which adds unofficial syscalls
> to the kernel i use).
>
> -dean
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

2007-11-16 09:19:00

[permalink] [raw]

Subject: Re: perfmon2 merge news

Just getting back to this now that SC07 is finally over...

On Nov 13, 2007, at 5:52 PM, Andi Kleen wrote:

> On Tue, Nov 13, 2007 at 04:28:52PM -0800, Philip Mucci wrote:
>> I know you don't want to hear this, but we actually use all of the
>> features of perfmon, because a) we wanted to use the best methods
>
> That is hard to believe.
>

You are welcome to download the code and some of the tools and verify
the functionality yourself. It might be a good exercise.

> But let's go for it temporarily for the argument.
>
> Can you instead prioritize features. What is most essential, what is
> important, what is just nice to have, what is rarely used?

Yes, although this has been done before. You've got the list below in
the previous
emails which should be considered the absolute minimum.

- A feature which was dropped earlier by Stefane (only to satiate
LKML), we consider
very important. Allowing one tomapping of the kernels view of the
PMD's, allowing
user-space access to full 64-bit counts, if the architecture
supports a user-level read instruction. Getting the counts in a
couple of dozen cycles
is ALWAYS a win for us. This is because the HPC community is mainly
interested in
self-monitoring, not third-party, because the former can be easily
associated with
context in the app through instrumentation in various forms.

- Kernel multiplexing is very nice to have, saves you tremendous
overhead at user
level. PAPI has an implementation in user-space for the platforms
that don't support
this. The flexibility of the current implementation is not exploited,
here I'm
referring to the concept of eventsets. Having multiplexing is
important. Being able
to allocate/reallocate eventsets and the threshold of individual
eventsets is just nice
to have.

- Custom sample formats would be considered not often used in our
community, largely
because the tools run on all HPC/Linux architectures. PAPI uses the
default sample
format which has been sufficient for our needs. However, the lack of
custom sample
formats preclude the dev of the specialized tools that access the
sampling
hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
node chip.
pfmon exports this functionality quite well, and it does get used.

>> - providing virtualized 64-bit counters per-thread
>> - providing notification (buffered or non) on interrupt/overflow of
>> the above.
>
> Ok that makes sense and should be possible with a reasonable simple
> interface.

Well that's good news. The above is what we have used via the PerfCtr
set of
patches for a long time. It wasn't quite enough, but it got the job
done.

>> If you'd like to outline further what you'd like to hear from the
>> community, I can arrange that. I seem to remember going through this
>> once before, but I'd be happy to do it again. For reference, here's a
>> quick list from memory of some of the tools in active use and built
>> on this infrastructure. These are used heavily around the globe.
>
> Please list concrete features, throwing around random names is not
> useful.
>

This is kind of comment that makes the Linux/HPC folks 'somber'. What
isn't useful, is being dismissive of an entire community that moves a
heck of a lot of Linux DVD's. >80% of the top500 list is Linux these
days (compared to < 10% just a few years back), and so is the bulk of
the HPC clusters in the marketplace, large and small. (ref those
expensive IDC reports) These are tools used daily in HPC centers and
industry around the globe, doing real work for folks that buy a lot
of hardware and actually pay for Linux distributions. These tools
seem random to you, because you haven't spent any time educating
yourself about this community since we first talked about this >3
years back when considering PerfCtr. Really, there are dozens of HPC/
Linux events held every year around the world of varying sizes;
should you ever attend one, this all might not seem so 'random'. And
yes, you would be warmly welcomed.

-Phil

2007-11-16 15:16:16

[permalink] [raw]

Subject: Re: perfmon2 merge news

Philip Mucci <[email protected]> writes:
>
> Yes, although this has been done before. You've got the list below in
> the previous
> emails which should be considered the absolute minimum.

I didn't see a clear list.

My impression so far is that you're not quite sure what you want,
otherwise you would be more concrete.

> - A feature which was dropped earlier by Stefane (only to satiate
> LKML), we consider
> very important. Allowing one tomapping of the kernels view of the
> PMD's, allowing
> user-space access to full 64-bit counts, if the architecture
> supports a user-level read instruction.

You mean returning the register number for RDPMC or equivalent
and a way to enable it for ring 3 access?

I'm considering that an essential feature too. I wasn't aware
it was dropped.

> Getting the counts in a
> couple of dozen cycles
> is ALWAYS a win for us.

Yes it is for everybody. I've been rather questioning if the slow
ways (complicated syscalls) to get the counter information are really
needed.

> referring to the concept of eventsets. Having multiplexing is
> important.

Why is it important?

> - Custom sample formats would be considered not often used in our
> community, largely
> because the tools run on all HPC/Linux architectures. PAPI uses the
> default sample
> format which has been sufficient for our needs. However, the lack of
> custom sample
> formats preclude the dev of the specialized tools that access the
> sampling
> hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
> node chip.
> pfmon exports this functionality quite well, and it does get used.

What do you mean with custom sample formats exactly? What information
do you want in there? And why?

e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
so they only way to get a custom format would be to use a separate buffer.

I can think of one reason why the kernel should add more information
in a separate buffer (log the instruction bytes so that it can
be disassembled and a address histogram be generated using the PEBS
register values), but it is a relatively obscure one and definitely
not a essential feature. Unfortunately it is also hard to implement completely
race-free.

> This is kind of comment that makes the Linux/HPC folks 'somber'. What
> isn't useful, is being dismissive of an entire community that moves a
> heck of a lot of Linux DVD's.

Sorry, but these kind of non technical BS arguments will just make
you be ignored in mainline Linux lands. They might work if you pay
a lot of money to specific Linux companies (do you?), but here
on linux-kernel you have to convince with purely technical arguments.

-Andi

2007-11-16 16:01:41

[permalink] [raw]

Subject: Re: perfmon2 merge news

Andi,

On Fri, Nov 16, 2007 at 04:15:56PM +0100, Andi Kleen wrote:
> My impression so far is that you're not quite sure what you want,
> otherwise you would be more concrete.
>
> > - A feature which was dropped earlier by Stefane (only to satiate
> > LKML), we consider
> > very important. Allowing one tomapping of the kernels view of the
> > PMD's, allowing
> > user-space access to full 64-bit counts, if the architecture
> > supports a user-level read instruction.
>
> You mean returning the register number for RDPMC or equivalent
> and a way to enable it for ring 3 access?
>
No, he is talking about something similar to what was in perfctr.
The kernel emulates 64-bit counters in software and that is you
get back when you read the counters. If you read via RDPMC, you
get 40 bits. To reconstruct the full 64-bit value from user land
you need the upper bits. One approach is for the kernel to allow
you to remap a page that has the 64-bit (software) counters. With
that and a bit of mask/shifting you can reconstruct the full value.

> I'm considering that an essential feature too. I wasn't aware
> it was dropped.
>
What I dropped is the cr4.pce enabled for self-monitoring sessions.

> Yes it is for everybody. I've been rather questioning if the slow
> ways (complicated syscalls) to get the counter information are really
> needed.
>
> > referring to the concept of eventsets. Having multiplexing is
> > important.
>
> Why is it important?
>
Read my follow-up message to Dean's message.

> > - Custom sample formats would be considered not often used in our
> > community, largely
> > because the tools run on all HPC/Linux architectures. PAPI uses the
> > default sample
> > format which has been sufficient for our needs. However, the lack of
> > custom sample
> > formats preclude the dev of the specialized tools that access the
> > sampling
> > hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
> > node chip.
> > pfmon exports this functionality quite well, and it does get used.
>
> What do you mean with custom sample formats exactly? What information
> do you want in there? And why?
>
Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
not new, Oprofile has this as well. The problem here is that if the
buffer is in the kernel the format of the samples is fixed and it
should have to. Tools may want to record samples in different formats
and as you said some may need extra information gathered in the kernel.
Some may want to aggregate samples in the kernel (Oprofile used to
do that), some may want to use a double-buffer approach to minimize
blind spots, others may simply use the counter overflow mechanism to
record something that is non-PMU related, e.g, kernel call stack.
I have built such a module and it was quite interesting to collect
the call stack when you hit a last cache level miss.

The idea behind customizable sampling format is simple: extract the
format from the perfmon core and put this into a kernel module. The
core provides a simple registration mechanism and the two communicate
via a set of callbacks.

Perfmon2 comes with a basic default format which works on all
platforms. But it is possible to develop others without having to
patch the kernel nor recompile nor reboot. At its core, each format provides
a handler routine which is called on counter overflow. The handler routine
controls what is recorded, how it is recorded, how it is exported to
userland, and wheher overflow notifications need to be sent.

Using this mechanism, for instance, we were able to connect the
Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
code. The exact same approach would also work on X86 Oprofile as well.

> e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
> so they only way to get a custom format would be to use a separate buffer.
>

This is also how we support PEBS because, as you said, the format of the
samples is not under your control. if you want zero-copy PEBS support,
you have to follow the PEBS format.

I am sure other processors haev and will have hardware buffers as well.

> I can think of one reason why the kernel should add more information
> in a separate buffer (log the instruction bytes so that it can
> be disassembled and a address histogram be generated using the PEBS
> register values), but it is a relatively obscure one and definitely
> not a essential feature. Unfortunately it is also hard to implement
> completely race-free.
>
Yes, you could do that without changing the core implementation of
perfmon2.

--
-Stephane

2007-11-16 16:28:24

[permalink] [raw]

Subject: Re: perfmon2 merge news

On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
> No, he is talking about something similar to what was in perfctr.
> The kernel emulates 64-bit counters in software and that is you
> get back when you read the counters. If you read via RDPMC, you
> get 40 bits. To reconstruct the full 64-bit value from user land
> you need the upper bits. One approach is for the kernel to allow
> you to remap a page that has the 64-bit (software) counters. With
> that and a bit of mask/shifting you can reconstruct the full value.

You mean the page contains the upper [40;63] bits?

Sounds reasonable, although I don't remember seeing that when I looked
at the perfmon code last.

>
> > I'm considering that an essential feature too. I wasn't aware
> > it was dropped.
> >
> What I dropped is the cr4.pce enabled for self-monitoring sessions.

That sounds bad.

> Perfmon2 allows you to have an in-kernel sampling buffer. The idea is

... you also didn't say *why* that is needed.

Can you give a concrete use case for something that cannot be done
without custom buffer formats?

> Using this mechanism, for instance, we were able to connect the
> Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
> code. The exact same approach would also work on X86 Oprofile as well.

The existing oprofile code works already fine on x86, no real
need for another one.

> > e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
> > so they only way to get a custom format would be to use a separate buffer.
> >
>
> This is also how we support PEBS because, as you said, the format of the
> samples is not under your control. if you want zero-copy PEBS support,
> you have to follow the PEBS format.

Exactly that makes the support for random custom buffers questionable.

e.g. as I can see the main advantage of perfmon over existing setups
is that it support PEBS etc., but with your custom buffer formats which
are by definition incompatible with PEBS you would negate that advantage
again.

Ok IBS will probably need some special handling.

> Yes, you could do that without changing the core implementation of
> perfmon2.

Why this insistence against changing anything?

-Andi

2007-11-16 17:18:03

[permalink] [raw]

Subject: Re: perfmon2 merge news

Andi Kleen wrote:
> On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
>> No, he is talking about something similar to what was in perfctr.
>> The kernel emulates 64-bit counters in software and that is you
>> get back when you read the counters. If you read via RDPMC, you
>> get 40 bits. To reconstruct the full 64-bit value from user land
>> you need the upper bits. One approach is for the kernel to allow
>> you to remap a page that has the 64-bit (software) counters. With
>> that and a bit of mask/shifting you can reconstruct the full value.
>
> You mean the page contains the upper [40;63] bits?
>
> Sounds reasonable, although I don't remember seeing that when I looked
> at the perfmon code last.

Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are
available in the register. the 32:40 bits in several processor implementation of
x86 processors can not be set to bit outside of sign extension of bit 32. On
other processor implementations the event counters are only 32-bit in width.

>
>>> I'm considering that an essential feature too. I wasn't aware
>>> it was dropped.
>>>
>> What I dropped is the cr4.pce enabled for self-monitoring sessions.
>
> That sounds bad.
>
>> Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
>
> ... you also didn't say *why* that is needed.
>
> Can you give a concrete use case for something that cannot be done
> without custom buffer formats?
>
>> Using this mechanism, for instance, we were able to connect the
>> Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
>> code. The exact same approach would also work on X86 Oprofile as well.
>
> The existing oprofile code works already fine on x86, no real
> need for another one.

OProfile is very useful in many cases, but it only perform sampling. If one want
to take a look at the number events a specific section of code causes, one can't
really do that with oprofile. The counters are running systemwide, not per
thread. For some experiments developers really like to have per thread counters.

The rewrite of oprofile to use the perfmon code was to consolidate code using
the performance monitoring hardware. Use one interface for accessing the
performance monitoring hardware rather than have one for sampling and another
for virtualizing the counters on a per thread basis.

>>> e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
>>> so they only way to get a custom format would be to use a separate buffer.
>>>
>> This is also how we support PEBS because, as you said, the format of the
>> samples is not under your control. if you want zero-copy PEBS support,
>> you have to follow the PEBS format.
>
> Exactly that makes the support for random custom buffers questionable.
>
> e.g. as I can see the main advantage of perfmon over existing setups
> is that it support PEBS etc., but with your custom buffer formats which
> are by definition incompatible with PEBS you would negate that advantage
> again.
>
> Ok IBS will probably need some special handling.
>
>> Yes, you could do that without changing the core implementation of
>> perfmon2.
>
> Why this insistence against changing anything?
>
> -Andi

So the alternative approach is to write a new device driver for each of the new
performance monitoring mechanisms, e.g. one for PEBS and another for IBS?

One of the reason for the custom sample buffers was to avoid having an expensive
user-space signal for a process to record some simple pieces of data each time
the data becomes available. For the oprofile port to the perfmon2 custom buffer
mechanism the instruction pointer and the counter that overflowed are
recorded. The buffer can be processed in one large chunk by userspace, reducing
overhead. In essence the current implementation of OProfile in the mainline
kernels has a custom buffer mechanism.

-Will

2007-11-16 17:36:54

[permalink] [raw]

Subject: Re: perfmon2 merge news

Andi,
On Fri, Nov 16, 2007 at 05:28:13PM +0100, Andi Kleen wrote:
> On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
> > No, he is talking about something similar to what was in perfctr.
> > The kernel emulates 64-bit counters in software and that is you
> > get back when you read the counters. If you read via RDPMC, you
> > get 40 bits. To reconstruct the full 64-bit value from user land
> > you need the upper bits. One approach is for the kernel to allow
> > you to remap a page that has the 64-bit (software) counters. With
> > that and a bit of mask/shifting you can reconstruct the full value.
>
> You mean the page contains the upper [40;63] bits?
>
> Sounds reasonable, although I don't remember seeing that when I looked
> at the perfmon code last.
>
I dropped that quite some time ago.

> >
> > > I'm considering that an essential feature too. I wasn't aware
> > > it was dropped.
> > >
> > What I dropped is the cr4.pce enabled for self-monitoring sessions.
>
> That sounds bad.

That's because you said you were going to enable it system-wide by default.

>
> > Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
>
> ... you also didn't say *why* that is needed.
>
Do you question why Oprofile has one ;->

But I am happy to explain.

With sampling, you want to record information about the execution of a
thread at some interval. The interval could be expressed as time or
number of occurences of an PMU event.

Typically you get a notification. Then you need to collect certain
information about the execution. Typically you record the instruction
pointer (e.g. Oprofile), but you may want to record the value of other
counters, PMU registers or other HW/SW resources. While you're doing
this monitoring is typically stopped so you get a consitent view. After
you're done recording you need to re-arm the sampling period. If you
use event-based sampling, you need to reprogram the counter(s). Then
you resume monitoring. You have to repeat this process for each sample
regardless of whether you are self-monitoring, monitoring another thread,
or monitoring a CPU.

Such sequence of operations is quite expensive, especially in the case
where you are monitoring another thread, because it incurs at least
a couple of context switches per sample in addition to the various
register manipulations and syscalls.

The idea with the kernel sampling buffer is that you amortize the
cost of notification to userland over LOTS of samples. On counter
overflow, the kernel records the samples on your behalf. There is
no context switch, samples are always recorded in the context on
the monitored thread.

Now, you need a bit more information for this to work correctly
because the kernel records on *your behalf*, thus
you need to express:
- what you want to see recorded

- the value to reload into the overflowed counter(s)
so the kernel can re-arm the next period.

Because you have multiple counters, you may use them for sampling
periods, i.e., overlap sampling measurements. That is something
done very frequently.

For instance, the q-syscollect tool that D. Mosberger wrote, is
overlapping elapsed cycles and branch trace buffer (BTB) sampling
to collect, in *one* run, a flat profile and a statistical call graph.

Depending on which counter overflowed, you may one to record
different things. For instance, the flat profile requires
just the instruction pointer. But for the BTB, the buffer
is implemented by PMU registers, thus you need to record
them (16 total). You don't want to record all register possible
in each sample: reading PMU register is costly and you
want to maximize buffer space usage.

As you can see, you need to express per counter:
- what other resources to record when it overflows
- the value to reload into the counter after overflow

In perfmon2 this information is passed by the pfm_write_pmds()
call. You can say:
PMD2.value = -5000; /* initial period */
PMD2.reset = -2000; /* repeat period */
PMD2.smpl_pmds = 0xf0; /* to record PMD4-7 on overlow */

Now, it is important to note that this is not just on Itanium
that we need this kind of flexibility. Given that you mentioned
IBS, I will use it as a non-X86 example. IBS is implemented
using PMU registers, 10 to be precise. There is no need for a
custom sampling format to support that, the default format is
sufficient.

The default sampling format does record more than the instruction
pointer. Each sample has a fixed size header including the instruction
pointer but also PID/TID/CPU. But it also has a variable size body
where the kernel stores the other registers you want to record
in each sample based on which counter overflowed. So for IBS, it
would store the 10 data registers.

> Can you give a concrete use case for something that cannot be done
> without custom buffer formats?
>

PEBS is one. You would have to special this. PEBS includes
the instruction pointer + values of all registers. You'd have
to devise a scheme to allocate the PEBS buffer and then on
PEBS interrupt you'd have to copy the data into the other
buffer. Not counting on the fact that PEBS between P4 and Intel
Core 2 different and that this is an Intel X86 only feature.
I think this is better isolated into X86 specific code and
into a kernel module because it does not work on all models.

> > Using this mechanism, for instance, we were able to connect the
> > Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
> > code. The exact same approach would also work on X86 Oprofile as well.
>
> The existing oprofile code works already fine on x86, no real
> need for another one.
>
Can you support advanced monitoring like I just described above?

> > > e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
> > > so they only way to get a custom format would be to use a separate buffer.
> > >
> >
> > This is also how we support PEBS because, as you said, the format of the
> > samples is not under your control. if you want zero-copy PEBS support,
> > you have to follow the PEBS format.
>
> Exactly that makes the support for random custom buffers questionable.
>
Quite the contrary, without the custom buffers we would have horrible
hacks to support PEBS.

> e.g. as I can see the main advantage of perfmon over existing setups
> is that it support PEBS etc., but with your custom buffer formats which
> are by definition incompatible with PEBS you would negate that advantage
> again.
>

I think you are confused about the terms here. The custom sampling
format is a kernel-level interface to plug-in kernel modules
which implement custom sampling formats. PEBS requires a custom
format because you do not control what is recorded. Thus what
you do is you *create* a format whose sample format *maps* the PEBS
format exactly. And that format is *different* from the one used
by the default sampling format.

> Ok IBS will probably need some special handling.
>
No, it does not. No sampling format, no extra tricks.

> > Yes, you could do that without changing the core implementation of
> > perfmon2.
>
> Why this insistence against changing anything?
>

Because hardware is very diverse and is changing rapidly.
Changing the kernel is difficult and it takes a very long time
for new features to reach end-users. You are not without knowing
that most users do not download their production kernels from
kernel.org. Monitoring is not just reserved for core developers
and it is also very useful on production systems to diagnose
performance problems.

--
-Stephane

2007-11-16 17:51:19

by dean gaudet

[permalink] [raw]

Subject: Re: perfmon2 merge news

On Fri, 16 Nov 2007, Andi Kleen wrote:

> I didn't see a clear list.

- cross platform extensible API for configuring perf counters
- support for multiplexed counters
- support for virtualized 64-bit counters
- support for PC and call graph sampling at specific intervals
- support for reading counters not necessarily with sampling
- taskswitch support for counters
- API available from userland
- ability to self-monitor: need select/poll/etc interface
- support for PEBS, IBS and whatever other new perf monitoring
infrastructure the vendors through at us in the future
- low overhead: must minimize the "probe effect" of monitoring
- low noise in measurements: cannot achieve this in userland

permon2 has all of this and more i've probably neglected...

-dean

2007-11-16 18:30:25

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: PMC core internal API design

* Stephane Eranian ([email protected]) wrote:
> Hello,
>
> On Tue, Nov 13, 2007 at 04:17:18PM +0100, Robert Richter wrote:
> > On 10.11.07 21:32:39, Andi Kleen wrote:
> > > It would be really good to extract a core perfmon and start with
> > > that and then add stuff as it makes sense.
> > >
> > > e.g. core perfmon could be something simple like just support
> > > to context switch state and initialize counters in a basic way
> > > and perhaps get counter numbers for RDPMC in ring3 on x86[1]
> >
> > Perhaps a core could provide also as much functionality so that
> > Perfmon can be used with an *unpatched* kernel using loadable modules?
> > One drawback with today's Perfmon is that it can not be used with a
> > vanilla kernel. But maybe such a core is by far too complex for a
> > first merge.
> >
> Note that I am not against the gradual approach such as:
> - system-wide only counting

(jumping in late in the game)

Linux Trace Toolkit Next Generation would _happily_ use global PMC
counters, but I would prefer to interact with an internal kernel API
rather than being required to start/stop counters from user-space. There
is a big precision loss involved in having to start things from
userspace.

Ideally, this API would manage access to available PMCs and even use the
same counters for both system-wide tracing/profiling done at the same
time as user-space profiling. This would however involve having a
wrapper around both user-space and kernel-space performance counter
reads, which is fine with me. I would suggest that user-space still go
through a system call for this, since this is available a early boot,
before the filesystem is mounted.

This API could offer to in-kernel architecture _independent_ PMC control
interface to :
- list available PMCs
- That would involve mapping the common PMCs to some generic
identifier
- attach to these PMCs, with a certain priority

We could call a single connexion to a PMC a "virtual PMC". All PMC
accesses should then be done through this internally managed structure
(giving callbacks to be called after a certain count, reads, stop...).
We could have virtual PMCs that are : system wide, or per thread.

As a starting point, we could limit one virtual PMC attached to a
physical PMC at a given time. Later, we could add support for multiple
virtual PMCs connected to a single physical PMC. The priorities could be
used to kick out the PMC users with lower priorities (that involves that
a PMC read could fail!).

Then, to get interrupts or signals upon PMC overflow, we could manage
each physical PMC like a timer, using the lowest requested value for the
next time were are to be awakened. Some logic would have to be added to
the pmc read operation to get the "real" expected value, but this is
nothing difficult.

Those were the ideas I had last OLS after hearing the talk about
perfmon2. I hope they can be useful. If things need to be clarified, I
will gladly discuss them further.

Mathieu

P.S. : the rest of the feature list _should_ be easy to implement on top
of this internal architecture.

> - per-thread counting
> - user-level sampling support
> - in-kernel sampling buffer support
> - in-kernel customizable sampling buffer formats via modules
> - event set multiplexing
> - PMU description modules
>
> It would obvisouly cause a lot of troubles to existing perfmon libraries and
> applications (e.g. PAPI). It would also be fairly tricky to do because you'd
> have to make sure that in the beginning, you leave enough flexiblity such that
> you can add the rest while maintaining total backward compatibility. But given
> that we already have the full solution, it could just be a matter of dropping
> features without disrupting the user level API. Of course there would be a bigger
> burden on the maintainer because he would have two trees to maintain but I think
> that is already commonplace in many of the kernel-related projects.
>
> Let's take a simple example. The set of syscalls necessary to control a system-wide
> monitoring session is exactly the same as for a per-thread session. The difference is
> just a flag when the session is created. Thus, we could keep the same set of syscalls,
> but only accept system-wide sessions. Later on, when we add per-thread, we would just
> have to expose the per-thread session flag.
>
> Having said that, does not mean that this is necessarily what we will do. I am just
> try to present my understanding of the comments from Andrew, Andi and others.
>
> I think that going with a kernel module will not address the 'complexity/bloat' perception
> that some people have. There is a logic to that, I did not just wakeup one day saying
> 'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or
> developers and it was justfied by what the hardware offered then. This unfortunately still
> stands today. I admit that justification is not necessarily spelled out clearly in the code. So
> I understand most of those worries and I am trying to figure out how we could best address them.
>
> --
> -Stephane
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-11-16 20:36:40

[permalink] [raw]

Subject: Re: perfmon2 merge news

> Yes it is for everybody. I've been rather questioning if the slow
> ways (complicated syscalls) to get the counter information are really
> needed.

I suppose by complicated here, your referring to the gather semantics
of the
pfm_read/write_pmds/pmcs calls. Many processors may have 100's of
registers
(IA64, BG/P, SiCortex), some of which have different access times. So a
naive syscall of 'give me all the registers you've got' isn't going
to cut it.
However, any additional simplicity (performance) we can squeeze out
of this
particular primitive is a huge win as it sits in the critical path of
the user
tools (unless one is sampling).

>> referring to the concept of eventsets. Having multiplexing is
>> important.
>
> Why is it important?
>

Performance and noise. See the earlier message about our user-land
implementation versus kernel mode implementations. Any any useful
granularity, you begin to seriously affect the counts with noise as
well as dilate the run-time. But let's punt on this one until after
we get the basics in. It's a non-essential feature at this point.

>> - Custom sample formats would be considered not often used in our
>> community, largely
>> because the tools run on all HPC/Linux architectures. PAPI uses the
>> default sample
>> format which has been sufficient for our needs. However, the lack of
>> custom sample
>> formats preclude the dev of the specialized tools that access the
>> sampling
>> hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
>> node chip.
>> pfmon exports this functionality quite well, and it does get used.
>
> What do you mean with custom sample formats exactly? What information
> do you want in there? And why?

By custom here, I mean the ability to have the kernel take samples
containing
more than just the IP, the PID and a bitmask of which registers
overflowed at this
point. Myself and others have worked hard to get effective address
sampling into the
hardware (there are registers that contain EA's of misses as well as
branch mispredict
data on the PPC, IA64, Barcelona and SiCortex) that are handled
through the use
of a format that gathers up that information at interrupt time for
deposit into
the sample buffer. We are not wedded to Perfmon2's implementation of
these formats, we
are however, wedded to having this information collected at interrupt
time as the data
may change by the time you get back to user-mode. This hardware is
not obscure any more,
it's the norm, as we've learned at thus simple aggregate counters,
even those with precise
interrupt abilities, are not sufficient to satisfy all of our needs.

> e.g. PEBS and so on pretty much fix the in memory sample format in
> hardware,
> so they only way to get a custom format would be to use a separate
> buffer.
>
> I can think of one reason why the kernel should add more information
> in a separate buffer (log the instruction bytes so that it can
> be disassembled and a address histogram be generated using the PEBS
> register values), but it is a relatively obscure one and definitely
> not a essential feature. Unfortunately it is also hard to implement
> completely
> race-free.
>

>> This is kind of comment that makes the Linux/HPC folks 'somber'. What
>> isn't useful, is being dismissive of an entire community that moves a
>> heck of a lot of Linux DVD's.
>
> Sorry, but these kind of non technical BS arguments will just make
> you be ignored in mainline Linux lands. They might work if you pay
> a lot of money to specific Linux companies (do you?), but here
> on linux-kernel you have to convince with purely technical arguments.

I love it when kernel folks refer to their own revenue streams
(and yes, we do, ask your VP of sales) and the needs of a user
community as
"BS non-technical arguments".

But let's get back to basics here. We can sort that out over a beer
sometime.
At this point, let's try and agree on the minimum set of
functionality acceptable for a first round of patches.

- per-CPU (system-wide) and per-thread 64-bit virtualized counters
- dispatch of interrupt on overflow via a signal
- first (self) and third-party (attach) semantics
- extensible to new lines within an architecture without repatching
(By this I mean that through the use of modules that contain PMU
description tables, i.e. patches don't have to be issued for
every new rev
of HW that Intel releases)

To be considered later:
- Sample buffers and formats
- Multiplexing (by event threshold and time slicing)
- fast-read support if the hardware supports it (mmap + user rdpmc)

I think(?) we are all clear now why oprofile is not sufficient, i.e.
simultaneous usage by non-root users, each with different counter
configurations,
lack of read/write access etc. Oprofile is however, a very important
tool
and any initial set of functionality should allow for a very simple
port to
each version of the infrastructure along the way. I'd happily port
incremental versions of PAPI to the patches, so the performance tools
can be
accessible to the LKML community while testing/benchmarking the
patchset on a variety of architectures. If we can agree on the
starting point,
we can move the discussion of the API to the Perfmon2 mailing list
and with your
input, finally 'get it right' in terms of acceptance.

If there's anything we do have in common at the moment, it's momentum.
We (speaking for the HPC community/vendors again) are not in favor of
useless bloat, every
TLB slot, mispredict, miss, timer-tick or pipeline bubble, we care
about, kernel
space or not. It's precisely why this type of infrastructure has
become so
vital to us over the years.

-Phil

2007-11-16 21:57:21

[permalink] [raw]

Subject: Re: perfmon2 merge news

Will,

On Fri, Nov 16, 2007 at 12:13:07PM -0500, William Cohen wrote:
> Andi Kleen wrote:
> >On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
> >>No, he is talking about something similar to what was in perfctr.
> >>The kernel emulates 64-bit counters in software and that is you
> >>get back when you read the counters. If you read via RDPMC, you
> >>get 40 bits. To reconstruct the full 64-bit value from user land
> >>you need the upper bits. One approach is for the kernel to allow
> >>you to remap a page that has the 64-bit (software) counters. With
> >>that and a bit of mask/shifting you can reconstruct the full value.
> >
> >You mean the page contains the upper [40;63] bits?
> >
> >Sounds reasonable, although I don't remember seeing that when I looked
> >at the perfmon code last.
>
> Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are
> available in the register. the 32:40 bits in several processor
> implementation of x86 processors can not be set to bit outside of sign
> extension of bit 32. On other processor implementations the event counters
> are only 32-bit in width.
>
That is quite true on Intel's. Perfmon2 only considers the bottom 31 bits as
true counter bits, the rest is forced to 1. This is true even on Intel Core 2.

--
-Stephane

2007-11-17 00:15:57

[permalink] [raw]

Subject: Re: perfmon2 merge news

From: Andi Kleen <[email protected]>
Date: Fri, 16 Nov 2007 16:15:56 +0100

> Philip Mucci <[email protected]> writes:
> > - A feature which was dropped earlier by Stefane (only to satiate
> > LKML), we consider
> > very important. Allowing one tomapping of the kernels view of the
> > PMD's, allowing
> > user-space access to full 64-bit counts, if the architecture
> > supports a user-level read instruction.
>
> You mean returning the register number for RDPMC or equivalent
> and a way to enable it for ring 3 access?
>
> I'm considering that an essential feature too. I wasn't aware
> it was dropped.
>
> > Getting the counts in a
> > couple of dozen cycles
> > is ALWAYS a win for us.
>
> Yes it is for everybody. I've been rather questioning if the slow
> ways (complicated syscalls) to get the counter information are really
> needed.

I would like to add sparc64 support to perfmon2 as well
and therefore I've been considering this angle of the
API issues as well.

The counters on sparc64 can be configured to be readable by userspace,
so for the self-monitoring cases I really would like to make sure the
perfmon2 library interface could use direct reads for sampling instead
of system calls or specialized traps.

If I get some spare time I'll look at the current perfmon2 patches
and see if I can toss together sparc64 support to get a feel for
how things stand currently.

2007-11-17 00:29:20

[permalink] [raw]

Subject: Re: perfmon2 merge news

From: dean gaudet <[email protected]>
Date: Fri, 16 Nov 2007 09:51:08 -0800 (PST)

> On Fri, 16 Nov 2007, Andi Kleen wrote:
>
> > I didn't see a clear list.
>
> - cross platform extensible API for configuring perf counters
> - support for multiplexed counters
> - support for virtualized 64-bit counters
> - support for PC and call graph sampling at specific intervals
> - support for reading counters not necessarily with sampling
> - taskswitch support for counters
> - API available from userland
> - ability to self-monitor: need select/poll/etc interface
> - support for PEBS, IBS and whatever other new perf monitoring
> infrastructure the vendors through at us in the future
> - low overhead: must minimize the "probe effect" of monitoring
> - low noise in measurements: cannot achieve this in userland
>
> permon2 has all of this and more i've probably neglected...

I want to state that even though I've been a stickler on the system
call stuff, in general I want to see perfmon2 go into tree and I agree
with how most of the infrastructure is implemented and the features it
provides.

2007-11-17 01:09:13

[permalink] [raw]

Subject: Re: perfmon2 merge news

On Fri, Nov 16, 2007 at 04:29:05PM -0800, David Miller wrote:
> From: dean gaudet <[email protected]>
> Date: Fri, 16 Nov 2007 09:51:08 -0800 (PST)
>
> > On Fri, 16 Nov 2007, Andi Kleen wrote:
> >
> > > I didn't see a clear list.
> >
> > - cross platform extensible API for configuring perf counters
> > - support for multiplexed counters
> > - support for virtualized 64-bit counters
> > - support for PC and call graph sampling at specific intervals
> > - support for reading counters not necessarily with sampling
> > - taskswitch support for counters
> > - API available from userland
> > - ability to self-monitor: need select/poll/etc interface
> > - support for PEBS, IBS and whatever other new perf monitoring
> > infrastructure the vendors through at us in the future
> > - low overhead: must minimize the "probe effect" of monitoring
> > - low noise in measurements: cannot achieve this in userland
> >
> > permon2 has all of this and more i've probably neglected...
>
> I want to state that even though I've been a stickler on the system
> call stuff, in general I want to see perfmon2 go into tree and I agree
> with how most of the infrastructure is implemented and the features it
> provides.

Now if we only had a series of patches that we could actually review and
apply to the -mm tree so that people can try them out... :)

thanks,

greg k-h

2007-11-17 01:26:38

[permalink] [raw]

Subject: Re: perfmon2 merge news

On Sat, Nov 17, 2007 at 02:13:13AM +0100, Patrick DEMICHEL wrote:
> Yet another noisy linux HPC user
>
> I hope to convince you, lkml developers, to pay more attention to our HPC
> performance problems.

We do pay attention, and want to help out, we just need either bug
reports of problems that we can work to address, or patches in a
reviewable state whereby we are able to review, work with, and apply to
our trees.

Please do not think we are ignoring you at all, we are glad to work with
anyone who uses the Linux kernel on whatever platform as we well know
this allows us to create a kernel that works even better for everyone.

thanks,

greg k-h

2007-11-17 02:15:17

[permalink] [raw]

Subject: Re: perfmon2 merge news

On Sat, Nov 17, 2007 at 02:48:45AM +0100, Patrick DEMICHEL wrote:
> Thanks Greg,
>
> but for external people it seems there is lot of people with opposite
> opinions, for sure some are valid and they can be focused on different
> things. But for example this critical topic seems quite not under control.
> And we don't like that.
> At least not under the control of Stephane, whatever the efforts if could
> generate, we have the feeling we will never have something serious to us.
> Also I never see a clear statement after long exchanges on what is the
> accepted final common view of some topic like that
> Maybe there is never definitive position, but a resume could be
> interesting to make some reference point
>
> What I would like to see is:

<snip>

Heh, no, code is our currency here, it's the center of everything that
we do and work with. Agreements, deadlines and plans are just not
relevant at all here, sorry.

So again, post the code, in reviewable patches, and then let's talk. A
number of developers have expressed a concrete interest in getting this
kind of feature into the kernel tree, so show us the code so that we can
move forward.

thanks,

greg k-h

2007-11-17 17:19:36

by Patrick DEMICHEL

[permalink] [raw]

Subject: Re: perfmon2 merge news

Yet another noisy linux HPC user

I hope to convince you, lkml developers, to pay more attention to our
HPC performance problems.

I will not try to convince you that our problems are also the problems
of many others users, I hope they will do it directly.

Imagine my company bought an expensive complex multi nodes, multi
sockets, multi cores machine.

This is cheap today, around 10M$
My company made the strange decision to go for linux, in fact we had
no choice : OOPS
This machine will be used to solve many fundamental problems like
meteorology, life&sciences, nanotechnologies, technologies, maths,
climatology, ...
Many of our scientists and developers will try to exploit the
potential of this machine to make some radically new sciences and make
breakthroughs in their domains. Some of those results could have a
major impact on everybody's life.

Then you see this is not just the problem of a bunch of desperate HPC users.

Moore's Law gave us the opportunity to solve many fundamental problems
by offering tons of cheap transistors,

but we all have 1 major problem : how to optimize our codes in that
context of massive parallelism?

Any idea what is massive?

Maybe you start to be familiar with tuning 4 cores.

We target shortly tuning millions of heterogeneous cores.

Good news for you, this is not the only problem we need to solve, but
this one is very serious.
And we know this is just an intermediate step towards somethings
continuously more complex and challenging.
There are tons of papers in the WEB written by many talented and
motivated people.
You need to be motivated to stay in this business :-)

Developing the complete software stack required to manage and use such
machines will require that a large number of different actors succeed
in going in the same direction and share the burden.
Nobody and no company can sustain all the required developments at
reasonable cost.
No company has the time and complete expertise to do it alone.
I hope collectively we can do it. This is not even sure as I can see today.

Following your logic, you can claim "why such useless hardware
complexity? Do something simpler."
Here we have a problem, we cannot change the constants and laws of
physics, then we face the inevitable choice of massive parallelism,
complex memory hierarchies, complex micro architectures, complex
interconnects, variable elements, failing elements, ...
Quite some fun ahead in fact.

And I can promise you, the hardware designers are not lazy or short of
inspiration and they also have a growing infinitude of challenges.

Some people argue that some magic tools will decompose and tune the
programs automatically, then why you need performance tools in fact?
First this is will be done at the price of loosing an enormous part of
the potential, secondly the compilers will probably require extensive
support from the hardware counters to be somewhat effective. Most of
us target reasonable scalability, cannot afford to reach only 20% of
peak of anything.

A dream without some breakthrough on the tools side.

This is where we need advanced performance tools, tools that permit to
the largest amount of developers,
to understand how the architectures really work. Not how naively we
think they should work, but like they really work.

Theory and reality are not good friends, it's rare to meet them together.
We cannot afford that only some too rare specialists can do an always
partial tuning, I am sure they also have some limits at least time.
I am sure as soon the advanced tools will expose in the right form the
real problems to the developers, they will find innovative solutions.
Can you imagine a modern medicine without scanners ,radios, all the
sources of information on your body?

For us this is exactly the same thing, we desperately need advanced
performance tools, not one but many to attack the problems from
different angles.
An the tools should be easy to use, reliable, flexible, predictable,
ready to use when I need them, standard, installed everywhere and in
particular on the new platforms as soon they appear, ...
The tools need to hide the complexity when I need it, and expose it if required.
The tools will always be behind the requirements but I hope not too far.
I prefer tools that adapt to me than the opposite.

But I am realistic, I don't need perfect tools, I need tools I can
invest in learning and progressing a long time with them.

That's why also meanwhile the cost of development, many developers
ended up developing their own performance tools.
This is my case, spending more time to develop the tools I need than
using them, I have no choice today, but this is unsustainable now.
The current state of what is available is not what is required, not
even close to minimal of what I needed 5 years ago.

What Stephane is developing, is layer 1 of what we need, something
that hides most of the complexity of the hardware counters, and this
is not the fault of Stephane if this is very complex. This is not even
the fault of the hardware designers. In fact we ask them for more
complexity, to at least better support the virtualization and multi
users multi usages environments and ...
Naturally complexity should be managed and controlled, and I am an
ardent defender of simplification, but not to the point of suppressing
the functionalities even if they target today some rare advanced
users, tomorrow this could be common case.
The people developing the layers 2,3,4,.. of the performance tools,
all expect the under layers to be simple, flexible, stable, robust and
offer what they require.
Simple does not mean trivial, whatever the complexity of the API of
perfmon2, it will be 1 or 2 order of magnitude simpler than if we had
to program those hardware counters directly in our programs. And stop
claiming counting or RDTSC is sufficient.

In fact most users will never see the perform2 API, just some tools
developers or advanced users pushing the limits of the technologies.
The guys that understand the hardware counters intricacies will find
the perform2 API trivially simple.

The performance tools need to offer to a large variety of different
users with a large variety of different expertises and a large variety
of different situations the best chance to approach some reasonable
optimal level at the cost they decide is justified by them.
There is room for lot of tools as soon they have some common ground
and preferably interoperability.

Then please consider helping Stephane to deliver a viable, supported
perform2 in all standard linux releases.

This is not only important for the performance, but this is also
important for many others reasons:
1: power consumption optimization, probably the companies you work for
are interested in this.
2: debugging and diagnosing those enormous machines.
In fact those 2 problems become even more important than pure code
tuning in some cases.
The hardware counters and tools also offer good opportunities to
progress on those topics.

Now what we need:

I definitively don't want to instrument my codes, I am sure I will
continue, I hope this will just be rare justified cases.

You will say I am lazy, maybe, but I have some obligation of productivity.
Then if with sampling I can obtain informations in minutes, why I
would need to instrument, recompile, and rerun my code?
By the way, sometimes you need to tune some codes you don't know and
or don't access the source code.
Yes some people work with codes without the sources and can do
important tunings that can do significant productivity gains.
How many iterations or time you need?

With sampling in 1 run if I can attach/detach and or use system wide,
then I can capture and correlate lot of set of counters.
Sometimes they can be measured independently, sometimes you need high
frequency multiplexing, sometimes I am obliged to tune and monitor the
system for a code I cannot instrument, sometimes the run is so long
than I have only 1 chance to run the program,

sometimes I need at same time to do system wide and process monitoring.
And I can probably give tens of very different scenarios where
counting is ridiculously insufficient, where instrumentation is
impossible.
And whatever perform2 offers, it will stay a hard problem for me to
correlate all the sources of measures and get the useful and valid
informations in that ocean of numbers. And I promise, I like numbers,
but graphics would be better.
Without perform2 things are much simpler, I can globally do nothing
because I am not expert in crystal ball analysis.

We can spend hours and hours convincing you this is required but I
think you need to look the problem differently.
Imagine, I spend hours and hundreds of mails discussing why you have
implement such a complex linux OS,
why are you using so much different structures in place of 1? ,... blah blah,...
I imagine you will make the comments that I am not a linux kernel
expert, and you are right.
And meanwhile I worked since 25 years on many others kernels, you are
still right.
My obsolete expertise is useless to you and lkml and even that does
not qualify me because
I have not spent the hundreds of hours required to just start to be
serious on lkml.

But as linux kernel experts, you have no idea what means tuning a
massive SMP machine,
what means programming and tuning a MPI job of 1000+ nodes and
programming and tuning a PS3 processor,
and programming and tuning a MPI cluster of OpenMP nodes using some
accelerators to achieve 1Pflops sustained.
All those machines run or will run linux, then why you don't care?

Because you have not have one at home?
Buy a PS3 and try to do a 200Mflops FFT on it.
Can you explain to me what qualify you to pretend you can understand
what is useful and not useful for us?
How you will judge?
Number of M$?, Number of desperate developers?,something useful or fun?

I return the problem: can you do it or not?

If not, what would be required to do it rapidly?

Some funding,collaboration,trainings,coordination,...?

Be open, we will help you, but implement the thing rapidly, this is so
crucial to us..

Stephane is exposed to many of the people that are confronted to the
challenges I just described above.
He had the exceptional chance to meet many of the best experts in
performance in most of the major companies that probably cover all the
businesses.
He had the chance and pain to be obliged to work with many very
different processors and architectures.
He had the talent and the multi years patience and perseverance to
work and fight for us, not for him, even if he probably likes his
baby.
I can assure you, he is not lazy as some of you claimed imprudently.

I can assure you this domain is horribly large and subtle, far beyond
you can think.

I probably underestimate the challenges your group face, like exactly
you underestimate ours.
But I will never underestimate your efforts and exceptional talents.
And I should thank warmly your group to have offered us this
opportunity of the road to Eflops machines.

Clearly without linux we would not be there.
You are a key contributor to that vision, this is more important for
the humanity that you could understand.

Is not it important to predict the storms, hurricanes, tsunami,...?
Is not it important to discover many new drugs to prepare to fight H5N1?

Is not it important to have Google?

Is not it important to have automatic real time translation, to be
able to speak one day all together?
Is it not important to understand Global Warming?
If you think you can do that with a PC? You are welcome to show us.
No, for some problems you need 1Pflops ,1 Eflops or more
No, you need hundreds of thousands of the best scientists on the
planet trying to push the limits, there are thousands of problems to
solve.

Never heard about ITER,SKA,CERN,...?
In all cases you need tuned machines, because you cannot afford
wasting the electricity, or waiting 10 more years.

Then your role today is to offer us a way to continue our progression,
please don't stall us.

This will be very hard for you, I have no doubt, and for us, and for
the many people involved from close to far.
But many of the key problems of our society can only be solved with
those HPC machines, sorry

LOOK AT THE BIG PICTURE

I will encourage all people depending on performance tools and
suffering of lack of generally available high performance
profilers to manifest their desire for better linux support for
performance tools.

Please encourage Stephane, to continue the hard work, and help him to
succeed for us in integrating perform2.
Stephane is very open to major rewriting, if required for any valid
reason, but stop to ask for extreme truncation of functionalities.
The state of the code is not an accident, this is the result of a long
and progressive maturation shared with many people.

To be a little productive in this mail I want to give an example of
something that would be interesting to have.
This is naturally a trivial example, in fact I will not explain what
the counters means in detail and how t interpret, this is not useful
here.
Just understand this is the result of sampling and multiplexing, I
report 6 different counters with different frequencies of sampling.
Why different sampling frequencies? Because some events are very
frequent and others rares.
Some rare events can have a huge impact on efficiency, I need to
quantify their impact then get optimal resolution.
I need to capture the maximum of information in 1 run because I cannot
repeat the experiment.
I want to understand how my loops behave, one by one, because I will
tune one by one, from the most promising to least.
I will naturally collect the clock,caches,tlb,stall rate,bus
activities,flops,exceptions,... the maximum I can do in one complete
or partial run.
I want to take 1 hour sample at different moments of the life of the
program to study how the counters evolve in time.
I need to be normal user and my administrator need some counters to
monitor the global system, also done by multiplexing.
By the way the nodes need careful intrusion control and
synchronization to minimize the pollution on my MPI sensitive code.
This is not exact counting for sure,because of sampling and low
frequency I use to reduce the pollution, but this turn out to be
extremely productive in capturing all performance problems and
offering clear understanding of very subtle problems that where
previously undetectable.
I can expose percentage time spent in the loop, instruction rate, the
stall rate of the loop, and the correlation with others counters will
trigger a large variety of situations/improvements.

If I tune a loop that is 20% in middle of something very large, I can
precisely measure the benefit or degradation of any transformation
The key of success : multiplexing the largest number of valid
counters. Valid means , experience proved sometimes useful by a large
range of users.
I will try for sure to reduce the set of counters and increase the
frequency of sampling while maintaining the level of intrusion, but
generally the first pass is sufficient in most situations because it
is more important to have lot of counters to correlate that few
counters at ultra high frequency, I said generally.
In fact what I would like to obtain is some systematization of that
mechanism for regression analysis.
Imagine,I have something consuming less than 1 percent and reporting
tens of counters of any runs of my large programs.
I would be delighted to have some tool to collect and post analyze all
the variations.
One day my code is 20% slower , it is clear some of those columns will
expose precisely why.
Observe I coded no lines to achieve that dream.
This is critical because I will not be obliged to do 20 runs to
reproduce again.
I will not fix 100% of all problems, but my success rate is improved
by an enormous factor and that whatever the code is doing.
Imagine you have hundreds of nodes, you think they are all identical
mathematical objects? NO
The temporal and spatial variations could be crucial information to
collect and exploit.
You can observe that 1 or 2 nodes are systematically 5% slower,
probably interesting to fix that.
And you think this interest ONLY the HPC guys?
I am sure this will help you to anticipate many hardware problems,
most problems have subtle performance symptoms
that it would be interesting to capture to avoid the crash.
We need some software infrastructure, this can only be built above
some rich and standard low level API to capture the best of the
counters, whatever they are or will become, we will push the hardware
designer to add more, much more. But why would they spend energy to
add something that nobody use?
The sooner we have it, the better. There is lot of work to add above
permon2 or PAPI or others layers, and the processors, chipsets, ...

---------------------------------------------------------------------------------------------------------------

One example of capture with few counters to expose the concept:
column 1 is address
column 2 is cycles
column 3 is cycles : hardwired in core2
column 4 is instructions rate
column 5 is memory bus cycles
column 6 is cache miss rate
column 7 stall rate information
--------------------------------------------------------------------------------------------------------------

_ 423d54 14979 19282 20482 158 9187 282 mov
(%r9,%r15,1),%r10d
_ 423d58 23124 109815 114371 265 54809 3259 mov
0x4(%r9,%r15,1),%r11d
_ 423d5d 7135 8379 8888 40 4151 116
movss (%rdi,%r14,1),%xmm1
_ 423d63 15501 40062 41924 78 19540 977 mulss
%xmm0,%xmm1
_ 423d67 49814 48065 50873 330 21451 509 movss
%xmm1,(%rdi,%rax,1)
_ 423d6c 47365 58046 60958 159 29862 1036 mov
%r10d,(%r9,%rdx,1)
_ 423d70 3765 5132 5386 15 3052 103 mov
(%rdi,%r13,1),%r10d
_ 423d74 75911 130227 136796 318 62672 2538 mov
%r11d,0x4(%r9,%rdx,1)
_ 423d79 6623 9732 10444 32 5203 148 add $0x8,%r9
_ 423d7d 10575 8214 8663 36 3681 88 mov
%r10d,0x67a500(%rdi)
_ 423d84 16072 23604 24715 42 11284 371 mov
(%rdi,%r12,1),%r10d
_ 423d88 5343 33126 34584 185 16226 938 mov
%r10d,0x684140(%rdi)
_ 423d8f 14227 22018 23249 80 10950 370 mov
(%rdi,%rbx,1),%r10d
_ 423d93 10917 34384 35815 168 17132 924 mov
%r10d,0x68dd80(%rdi)
_ 423d9a 9991 13744 14676 96 7077 202 add $0x4,%rdi
_ 423d9e 0 8 9 0 0 0
cmp %rcx,%rdi
_ 423da1 951 2611 2677 24 14 11 jl 423d54

--------------------------------------------------------------------------------------------------------------

Then please help Stephane to integrate the complete perfmon2 feature
set in standard linux kernels

Patrick DEMICHEL

2007-11-18 00:35:44

[permalink] [raw]

Subject: Re: perfmon2 merge news

From: "Patrick DEMICHEL" <[email protected]>
Date: Sat, 17 Nov 2007 18:19:25 +0100

> Yet another noisy linux HPC user

Nobody on this list is interested in discussing this.

Really, the on-topic discussion here is the code and
the technical issues. And we will work on those to
get perfmon2 into shape in due time.

I guarentee you that %99 of the kernel developers didn't
wade through your description at all, myself included.

2007-11-19 13:08:54

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Instead of blabbering further about this topic, I decided to put my
code where my mouth is and spent the weekend porting the perfmon2
kernel bits, and the user bits (libpfm and pfmon) to sparc64.

As a result I've found that perfmon2 is quite nice and allows
incredibly useful and powerful tools to be written. The syscalls
aren't that bad and really I see not reason to block it's inclusion.

I rescind all of my earlier objections, let's merge this soon :-)

2007-11-19 21:00:40

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David,

On Mon, Nov 19, 2007 at 05:08:43AM -0800, David Miller wrote:
>
> Instead of blabbering further about this topic, I decided to put my
> code where my mouth is and spent the weekend porting the perfmon2
> kernel bits, and the user bits (libpfm and pfmon) to sparc64.
>

I appreciate your effort. I am glad to see that the interface
and implementation survived yet another architecture. I think at this
point ARM is the only major architecture missing. In anycase, I would
be happy to integrate your sparc64 patches.

> As a result I've found that perfmon2 is quite nice and allows
> incredibly useful and powerful tools to be written. The syscalls
> aren't that bad and really I see not reason to block it's inclusion.
>

As I said earlier, I am not opposed to changing the syscalls. I have
proposed a few schemes to address the issue of versioning. If vectors
arguments are problematic, we can go with single register/call.

I think there are other areas where perfmon2 could benefit from the
help of the LKML developers. I will post a list shortly.

> I rescind all of my earlier objections, let's merge this soon :-)

Thanks.

--
-Stephane

2007-11-19 21:44:47

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

David Miller writes:

> As a result I've found that perfmon2 is quite nice and allows
> incredibly useful and powerful tools to be written. The syscalls
> aren't that bad and really I see not reason to block it's inclusion.
>
> I rescind all of my earlier objections, let's merge this soon :-)

Strongly agree. However, I think we need to add structure size
arguments to most of the syscalls so we can extend them later.

Also, something I've been meaning to mention to Stephane is that the
use of the cast_ulp() macro in perfmon is bogus and won't work on
32-bit big-endian platforms such as ppc32 and sparc32. On such
platforms you can't take a pointer to an array of u64, cast it to
unsigned long * and expect the kernel bitmap operations to work
correctly on it. At the least you also need to XOR the bit numbers
with 32 on those platforms. Another alternative is to define the
bitmaps as arrays of bytes instead, which eliminates all byte ordering
and wordsize problems (but makes it more tricky to use the kernel
bitmap functions directly).

Paul.

2007-11-19 22:56:05

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Paul,

On Tue, Nov 20, 2007 at 08:43:32AM +1100, Paul Mackerras wrote:
> David Miller writes:
>
> > As a result I've found that perfmon2 is quite nice and allows
> > incredibly useful and powerful tools to be written. The syscalls
> > aren't that bad and really I see not reason to block it's inclusion.
> >
> > I rescind all of my earlier objections, let's merge this soon :-)
>
> Strongly agree. However, I think we need to add structure size
> arguments to most of the syscalls so we can extend them later.
>
Yes, that is one way. It works well if you only extend structures at the end.
Given that you need to obtain the file descriptor first via a pfm_create_context
call, an alternative could be that you pass a version number to that call to
identify the version the application is requesting.

> Also, something I've been meaning to mention to Stephane is that the
> use of the cast_ulp() macro in perfmon is bogus and won't work on
> 32-bit big-endian platforms such as ppc32 and sparc32. On such

I don't like those cast_ulp() macros. They were put there to avoid compiler
warnings on some architectures. Clearly with the big-endian issue, we need
to find something else. The bitmap*() macros make unsigned long *.

The interface uses fixed size type to ensure ABI compatibility between
32 and 64 bit modes. This way there is no need to marhsall syscall arguments
for a 32-bit app running on a 64-bit host.

Looks like we will have to use bytes (u8) instead. This may have some
performance impact as well. Several bitmaps are used in the context/interrupt
routines. Even with u8, there is still a problem with the bitmap*() macros.
Now, only a small subset of the bitmap() macros are used, so it may be okay
to duplicate them for u8.

What do you think?

> platforms you can't take a pointer to an array of u64, cast it to
> unsigned long * and expect the kernel bitmap operations to work
> correctly on it. At the least you also need to XOR the bit numbers
> with 32 on those platforms. Another alternative is to define the
> bitmaps as arrays of bytes instead, which eliminates all byte ordering
> and wordsize problems (but makes it more tricky to use the kernel
> bitmap functions directly).
>

--

-Stephane

2007-11-20 00:53:24

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Stephane Eranian <[email protected]>
Date: Mon, 19 Nov 2007 14:48:46 -0800

> Looks like we will have to use bytes (u8) instead. This may have some
> performance impact as well. Several bitmaps are used in the context/interrupt
> routines. Even with u8, there is still a problem with the bitmap*() macros.
> Now, only a small subset of the bitmap() macros are used, so it may be okay
> to duplicate them for u8.

I think it would be fine to just create a set of bitop interfaces that
operate on u32 objects instead of "unsigned long".

Currently perfmon2 does not need the atomic variants at all, and those
could thus be provided entirely under include/asm-generic/bitops/

2007-11-20 00:55:32

[permalink] [raw]

Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

From: Stephane Eranian <[email protected]>
Date: Mon, 19 Nov 2007 12:53:30 -0800

> In anycase, I would be happy to integrate your sparc64 patches.

I sent these to Philip Mucci late last night, but in the meantime
I finished implementing breakpoint support as well for pfmon.

Let me clean up my diffs and I'll send it all out to you in a
few hours.

2007-12-13 16:01:23

[permalink] [raw]

Subject: Re: [perfmon2] perfmon2 merge news

Hello,

A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.

Here is one that I would like to solve right now and
for which I am interested in your comments.

One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:

Thread T0 Thread T1
| |
pfm_restart() |
| |
spin_lock_irqsave() |
| |
<modify T1's machine state>--------------->|
| |
spin_unlock_irqrestore() |
| |
v v

Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.

But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.

In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a semaphore
to enforce that guarantee but it gets difficult with corner cases and
cleanups in case of unpexected termination.

AFAIK, there is no single call to stop T1 and wait until it is completely
off the CPU, unless we go through the (internal) ptrace interface.

Would you have anything better to suggest?

Thanks.

--
-Stephane

2007-12-14 19:19:18

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: [perfmon2] perfmon2 merge news

Stephane Eranian <[email protected]> writes:

> [...] AFAIK, there is no single call to stop T1 and wait until it
> is completely off the CPU, unless we go through the (internal)
> ptrace interface.

The utrace code supports this style of thread manipulation better
than ptrace.

- FChE

2007-12-14 21:08:41