by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

Hi Peter,

On Wed, Apr 21, 2021 at 12:37 PM Namhyung Kim <[email protected]> wrote:
>
> On Tue, Apr 20, 2021 at 8:29 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Apr 20, 2021 at 01:34:40AM -0700, Stephane Eranian wrote:
> > > This does not scale for us:
> > > - run against the fd limit, but also memory consumption in the
> > > kernel per struct file, struct inode, struct perf_event ....
> > > - number of events per-cpu is still also large
> > > - require event scheduling on cgroup switches, even with RB-tree
> > > improvements, still heavy
> > > - require event scheduling even if measuring the same events across
> > > all cgroups
> > >
> > > One factor in that equation above needs to disappear. The one counter
> > > per file descriptor is respected with Nahmyung's patch because he is
> > > operating a plain per-cpu mode. What changes is just how and where the
> > > count is accumulated in perf_events. The resulting programming on the
> > > hardware is the same as before.
> >
> > Yes, you're aggregating differently. And that's exactly the problem. The
> > aggregation is a variable one with fairly poor semantics. Suppose you
> > create a new cgroup, then you have to tear down and recreate the whole
> > thing, which is pretty crap.
>
> Yep, but I think cgroup aggregation is an important use case and
> we'd better support it efficiently.
>
> Tracking all cgroups (including new one) can be difficult, that's why
> I suggested passing a list of interested cgroups and counting them
> only. I can change it to allow adding new cgroups without tearing
> down the existing list. Is that ok to you?

Trying to move it forward.. I'll post v4 if you don't object to adding
new cgroup nodes while keeping the existing ones.

Thanks,
Namhyung

>
> >
> > Ftrace had a similar issue; where people wanted aggregation, and that
> > resulted in the event histogram, which, quite frankla,y is a scary
> > monster that I've no intention of duplicating. That's half a programming
> > language implemented.
>
> The ftrace event histogram supports generic aggregation. IOW users
> can specify which key and data field to aggregate. That surely would
> complicate the things.
>
> >
> > > As you point out, the difficulty is how to express the cgroups of
> > > interest and how to read the counts back. I agree that the ioctl() is
> > > not ideal for the latter. For the former, if you do not want ioctl()
> > > then you would have to overload perf_event_open() with a vector of
> > > cgroup fd, for instance. As for the read, you could, as you suggest,
> > > use the read syscall if you want to read all the cgroups at once using
> > > a new read_format. I don't have a problem with that. As for cgroup-id
> > > vs. cgroup-fd, I think you make a fair point about consistency with
> > > the existing approach. I don't have a problem with that either
> >
> > So that is a problem of aggregation; which is basically a
> > programmability problem. You're asking for a variadic-fixed-function
> > now, but tomorrow someone else will come and want another one.
>
> Well.. maybe we can add more stuff later if it's really needed.
> But BPF also can handle many aggregations these days. :)
>
> Thanks,
> Namhyung

2021-05-09 07:23:47

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

Hi Peter,

Thinking about the interface a bit more...

On Fri, Apr 16, 2021 at 4:59 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote:
> > On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
> > >
> > > > > So I think we've had proposals for being able to close fds in the past;
> > > > > while preserving groups etc. We've always pushed back on that because of
> > > > > the resource limit issue. By having each counter be a filedesc we get a
> > > > > natural limit on the amount of resources you can consume. And in that
> > > > > respect, having to use 400k fds is things working as designed.
> > > > >
> > > > > Anyway, there might be a way around this..
> > >
> > > So how about we flip the whole thing sideways, instead of doing one
> > > event for multiple cgroups, do an event for multiple-cpus.
> > >
> > > Basically, allow:
> > >
> > > perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
> > >
> > > Which would have the kernel create nr_cpus events [the corrolary is that
> > > we'd probably also allow: (.pid=-1, cpu=-1) ].
> >
> > Do you mean it'd have separate perf_events per cpu internally?
> > From a cpu's perspective, there's nothing changed, right?
> > Then it will have the same performance problem as of now.
>
> Yes, but we'll not end up in ioctl() hell. The interface is sooo much
> better. The performance thing just means we need to think harder.

So I'd like to have vector support for cgroups but it could be
extended later. So open with a flag that it'd accept a vector

fd = perf_event_open(.pid=-1, .cpu=N, .flag=VECTOR);

Then it'd still need an additional interface (probably ioctl) to
set (or append) the vector.

ioctl(fd, ADD_VECTOR, { .type = VEC_CGROUP, .nr = N, ... });

Maybe we also need to add FORMAT_VECTOR and use read(v)
or friends to read the contents for each entry. It'd be nice
if it can have a vector-specific info like cgroup-id in this case.

What do you think?

Thanks,
Namhyung