LinuxLists.cc - [RFC 0/2] perf core: Sharing events with multiple cgroups

2021-03-24 05:49:46

Subject: [RFC 0/2] perf core: Sharing events with multiple cgroups

Hello,

This work is to make perf stat more scalable with a lot of cgroups.

Currently we need to open a separate perf_event to count an event in a
cgroup. For a big machine, this requires lots of events like

256 cpu x 8 events x 200 cgroups = 409600 events

This is very wasteful and not scalable. In this case, the perf stat
actually counts exactly same events for each cgroup. I think we can
just use a single event to measure all cgroups running on that cpu.

So I added new ioctl commands to add per-cgroup counters to an
existing perf_event and to read the per-cgroup counters from the
event. The per-cgroup counters are updated during the context switch
if tasks' cgroups are different (and no need to change the HW PMU).
It keeps the counters in a hash table with cgroup id as a key.

With this change, average processing time of my internal test workload
which runs tasks in a different cgroup and communicates by pipes
dropped from 11.3 usec to 5.8 usec.

Thanks,
Namhyung

Namhyung Kim (2):
perf/core: Share an event with multiple cgroups
perf/core: Support reading group events with shared cgroups

include/linux/perf_event.h | 22 ++
include/uapi/linux/perf_event.h | 2 +
kernel/events/core.c | 588 ++++++++++++++++++++++++++++++--
3 files changed, 585 insertions(+), 27 deletions(-)

--
2.31.0.rc2.261.g7f71774620-goog

2021-03-24 05:49:53

by Namhyung Kim

[permalink] [raw]

Subject: [PATCH 2/2] perf/core: Support reading group events with shared cgroups

This enables reading event group's counter values together with a
PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
Users should give a correct size of buffer to be read.

Signed-off-by: Namhyung Kim <[email protected]>
---
kernel/events/core.c | 119 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 116 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 38c26a23418a..3225177e54d5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2232,13 +2232,24 @@ static void perf_add_cgrp_node_list(struct perf_event *event,
{
struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+ struct perf_event *sibling;
bool is_first;

lockdep_assert_irqs_disabled();
lockdep_assert_held(&ctx->lock);

+ /* only group leader can be added directly */
+ if (event->group_leader != event)
+ return;
+
+ if (!event_has_cgroup_node(event))
+ return;
+
is_first = list_empty(&ctx->cgrp_node_list);
+
list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
+ for_each_sibling_event(sibling, event)
+ list_add_tail(&sibling->cgrp_node_entry, &ctx->cgrp_node_list);

if (is_first)
list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
@@ -2250,15 +2261,25 @@ static void perf_del_cgrp_node_list(struct perf_event *event,
struct perf_event_context *ctx)
{
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+ struct perf_event *sibling;

lockdep_assert_irqs_disabled();
lockdep_assert_held(&ctx->lock);

+ /* only group leader can be deleted directly */
+ if (event->group_leader != event)
+ return;
+
+ if (!event_has_cgroup_node(event))
+ return;
+
update_cgroup_node(event, cgrp->css.cgroup);
/* to refresh delta when it's enabled */
event->cgrp_node_count = 0;

list_del(&event->cgrp_node_entry);
+ for_each_sibling_event(sibling, event)
+ list_del(&sibling->cgrp_node_entry);

if (list_empty(&ctx->cgrp_node_list))
list_del(&ctx->cgrp_ctx_entry);
@@ -2333,7 +2354,7 @@ static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,

raw_spin_unlock_irqrestore(&ctx->lock, flags);

- if (is_first && enabled)
+ if (is_first && enabled && event->group_leader == event)
event_function_call(event, perf_attach_cgroup_node, NULL);

return 0;
@@ -2370,8 +2391,8 @@ static void __perf_read_cgroup_node(struct perf_event *event)
}
}

-static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
- u64 cgrp_id, char __user *buf)
+static int perf_event_read_cgrp_node_one(struct perf_event *event, u64 cgrp_id,
+ char __user *buf)
{
struct perf_cgroup_node *cgrp;
struct perf_event_context *ctx = event->ctx;
@@ -2406,6 +2427,91 @@ static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,

return n * sizeof(u64);
}
+
+static int perf_event_read_cgrp_node_sibling(struct perf_event *event,
+ u64 read_format, u64 cgrp_id,
+ u64 *values)
+{
+ struct perf_cgroup_node *cgrp;
+ int n = 0;
+
+ cgrp = find_cgroup_node(event, cgrp_id);
+ if (cgrp == NULL)
+ return (read_format & PERF_FORMAT_ID) ? 2 : 1;
+
+ values[n++] = cgrp->count;
+ if (read_format & PERF_FORMAT_ID)
+ values[n++] = primary_event_id(event);
+ return n;
+}
+
+static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 cgrp_id,
+ char __user *buf)
+{
+ struct perf_cgroup_node *cgrp;
+ struct perf_event_context *ctx = event->ctx;
+ struct perf_event *sibling;
+ u64 read_format = event->attr.read_format;
+ unsigned long flags;
+ u64 *values;
+ int n = 1;
+ int ret;
+
+ values = kzalloc(event->read_size, GFP_KERNEL);
+ if (!values)
+ return -ENOMEM;
+
+ values[0] = 1 + event->nr_siblings;
+
+ /* update event count and times (possibly run on other cpu) */
+ (void)perf_event_read(event, true);
+
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+
+ cgrp = find_cgroup_node(event, cgrp_id);
+ if (cgrp == NULL) {
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+ kfree(values);
+ return -ENOENT;
+ }
+
+ if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+ values[n++] = cgrp->time_enabled;
+ if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+ values[n++] = cgrp->time_running;
+
+ values[n++] = cgrp->count;
+ if (read_format & PERF_FORMAT_ID)
+ values[n++] = primary_event_id(event);
+
+ for_each_sibling_event(sibling, event) {
+ n += perf_event_read_cgrp_node_sibling(sibling, read_format,
+ cgrp_id, &values[n]);
+ }
+
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+ ret = copy_to_user(buf, values, n * sizeof(u64));
+ kfree(values);
+ if (ret)
+ return -EFAULT;
+
+ return n * sizeof(u64);
+}
+
+static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
+ u64 cgrp_id, char __user *buf)
+{
+ u64 read_format = event->attr.read_format;
+
+ if (read_size < event->read_size + 2 * sizeof(u64))
+ return -EINVAL;
+
+ if (read_format & PERF_FORMAT_GROUP)
+ return perf_event_read_cgrp_node_group(event, cgrp_id, buf);
+
+ return perf_event_read_cgrp_node_one(event, cgrp_id, buf);
+}
#else /* !CONFIG_CGROUP_PERF */
static inline bool event_can_attach_cgroup(struct perf_event *event)
{
@@ -2511,6 +2617,7 @@ static void perf_group_detach(struct perf_event *event)
if (sibling->state == PERF_EVENT_STATE_ACTIVE)
list_add_tail(&sibling->active_list, get_event_list(sibling));
}
+ perf_add_cgrp_node_list(sibling, event->ctx);

WARN_ON_ONCE(sibling->ctx != event->ctx);
}
@@ -2654,6 +2761,9 @@ __perf_remove_from_context(struct perf_event *event,
perf_group_detach(event);
list_del_event(event, ctx);

+ if (event->state > PERF_EVENT_STATE_OFF)
+ perf_del_cgrp_node_list(event, ctx);
+
if (!ctx->nr_events && ctx->is_active) {
ctx->is_active = 0;
ctx->rotate_necessary = 0;
@@ -3112,6 +3222,9 @@ static int __perf_install_in_context(void *info)
reprogram = cgroup_is_descendant(cgrp->css.cgroup,
event->cgrp->css.cgroup);
}
+
+ if (event->state > PERF_EVENT_STATE_OFF)
+ perf_add_cgrp_node_list(event, ctx);
#endif

if (reprogram) {
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-24 05:50:31

by Namhyung Kim

[permalink] [raw]

Subject: [PATCH 1/2] perf/core: Share an event with multiple cgroups

As we can run many jobs (in container) on a big machine, we want to
measure each job's performance during the run. To do that, the
perf_event can be associated to a cgroup to measure it only.

However such cgroup events need to be opened separately and it causes
significant overhead in event multiplexing during the context switch
as well as resource consumption like in file descriptors and memory
footprint.

As a cgroup event is basically a cpu event, we can share a single cpu
event for multiple cgroups. All we need is a separate counter (and
two timing variables) for each cgroup. I added a hash table to map
from cgroup id to the attached cgroups.

With this change, the cpu event needs to calculate a delta of event
counter values when the cgroups of current and the next task are
different. And it attributes the delta to the current task's cgroup.

This patch adds two new ioctl commands to perf_event for light-weight
cgroup event counting (i.e. perf stat).

* PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
64-bit array to attach given cgroups. The first element is a
number of cgroups in the buffer, and the rest is a list of cgroup
ids to add a cgroup info to the given event.

* PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
array to get the event counter values. The first element is size
of the array in byte, and the second element is a cgroup id to
read. The rest is to save the counter value and timings.

This attaches all cgroups in a single syscall and I didn't add the
DETACH command deliberately to make the implementation simple. The
attached cgroup nodes would be deleted when the file descriptor of the
perf_event is closed.

Cc: Tejun Heo <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
include/linux/perf_event.h | 22 ++
include/uapi/linux/perf_event.h | 2 +
kernel/events/core.c | 474 ++++++++++++++++++++++++++++++--
3 files changed, 471 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..2760f3b07534 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -771,6 +771,18 @@ struct perf_event {

#ifdef CONFIG_CGROUP_PERF
struct perf_cgroup *cgrp; /* cgroup event is attach to */
+
+ /* to share an event for multiple cgroups */
+ struct hlist_head *cgrp_node_hash;
+ struct perf_cgroup_node *cgrp_node_entries;
+ int nr_cgrp_nodes;
+ int cgrp_node_hash_bits;
+
+ struct list_head cgrp_node_entry;
+
+ u64 cgrp_node_count;
+ u64 cgrp_node_time_enabled;
+ u64 cgrp_node_time_running;
#endif

#ifdef CONFIG_SECURITY
@@ -780,6 +792,14 @@ struct perf_event {
#endif /* CONFIG_PERF_EVENTS */
};

+struct perf_cgroup_node {
+ struct hlist_node node;
+ u64 id;
+ u64 count;
+ u64 time_enabled;
+ u64 time_running;
+ u64 padding[2];
+};

struct perf_event_groups {
struct rb_root tree;
@@ -843,6 +863,8 @@ struct perf_event_context {
int pin_count;
#ifdef CONFIG_CGROUP_PERF
int nr_cgroups; /* cgroup evts */
+ struct list_head cgrp_node_list;
+ struct list_head cgrp_ctx_entry;
#endif
void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..06bc7ab13616 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -479,6 +479,8 @@ struct perf_event_query_bpf {
#define PERF_EVENT_IOC_PAUSE_OUTPUT _IOW('$', 9, __u32)
#define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *)
#define PERF_EVENT_IOC_MODIFY_ATTRIBUTES _IOW('$', 11, struct perf_event_attr *)
+#define PERF_EVENT_IOC_ATTACH_CGROUP _IOW('$', 12, __u64 *)
+#define PERF_EVENT_IOC_READ_CGROUP _IOWR('$', 13, __u64 *)

enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..38c26a23418a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -379,6 +379,7 @@ enum event_type_t {
* perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
*/

+static void perf_sched_enable(void);
static void perf_sched_delayed(struct work_struct *work);
DEFINE_STATIC_KEY_FALSE(perf_sched_events);
static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
@@ -2124,6 +2125,322 @@ static int perf_get_aux_event(struct perf_event *event,
return 1;
}

+#ifdef CONFIG_CGROUP_PERF
+static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
+
+static bool event_can_attach_cgroup(struct perf_event *event)
+{
+ if (is_sampling_event(event))
+ return false;
+ if (event->attach_state & PERF_ATTACH_TASK)
+ return false;
+ if (is_cgroup_event(event))
+ return false;
+
+ return true;
+}
+
+static bool event_has_cgroup_node(struct perf_event *event)
+{
+ return event->nr_cgrp_nodes > 0;
+}
+
+static struct perf_cgroup_node *
+find_cgroup_node(struct perf_event *event, u64 cgrp_id)
+{
+ struct perf_cgroup_node *cgrp_node;
+ int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
+
+ hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
+ if (cgrp_node->id == cgrp_id)
+ return cgrp_node;
+ }
+
+ return NULL;
+}
+
+static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
+{
+ u64 delta_count, delta_time_enabled, delta_time_running;
+ int i;
+
+ if (event->cgrp_node_count == 0)
+ goto out;
+
+ delta_count = local64_read(&event->count) - event->cgrp_node_count;
+ delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
+ delta_time_running = event->total_time_running - event->cgrp_node_time_running;
+
+ /* account delta to all ancestor cgroups */
+ for (i = 0; i <= cgrp->level; i++) {
+ struct perf_cgroup_node *node;
+
+ node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
+ if (node) {
+ node->count += delta_count;
+ node->time_enabled += delta_time_enabled;
+ node->time_running += delta_time_running;
+ }
+ }
+
+out:
+ event->cgrp_node_count = local64_read(&event->count);
+ event->cgrp_node_time_enabled = event->total_time_enabled;
+ event->cgrp_node_time_running = event->total_time_running;
+}
+
+static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
+{
+ if (event->state == PERF_EVENT_STATE_ACTIVE)
+ event->pmu->read(event);
+
+ perf_event_update_time(event);
+ perf_update_cgroup_node(event, cgrp);
+}
+
+/* this is called from context switch */
+static void update_cgroup_node_events(struct perf_event_context *ctx,
+ struct cgroup *cgrp)
+{
+ struct perf_event *event;
+
+ lockdep_assert_held(&ctx->lock);
+
+ if (ctx->is_active & EVENT_TIME)
+ update_context_time(ctx);
+
+ list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
+ update_cgroup_node(event, cgrp);
+}
+
+static void cgroup_node_sched_out(struct task_struct *task)
+{
+ struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
+ struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
+ struct perf_event_context *ctx;
+
+ list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
+ raw_spin_lock(&ctx->lock);
+ update_cgroup_node_events(ctx, cgrp->css.cgroup);
+ raw_spin_unlock(&ctx->lock);
+ }
+}
+
+/* this is called from the when event is enabled/disabled */
+static void perf_add_cgrp_node_list(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
+ struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+ bool is_first;
+
+ lockdep_assert_irqs_disabled();
+ lockdep_assert_held(&ctx->lock);
+
+ is_first = list_empty(&ctx->cgrp_node_list);
+ list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
+
+ if (is_first)
+ list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
+
+ update_cgroup_node(event, cgrp->css.cgroup);
+}
+
+static void perf_del_cgrp_node_list(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+
+ lockdep_assert_irqs_disabled();
+ lockdep_assert_held(&ctx->lock);
+
+ update_cgroup_node(event, cgrp->css.cgroup);
+ /* to refresh delta when it's enabled */
+ event->cgrp_node_count = 0;
+
+ list_del(&event->cgrp_node_entry);
+
+ if (list_empty(&ctx->cgrp_node_list))
+ list_del(&ctx->cgrp_ctx_entry);
+}
+
+static void perf_attach_cgroup_node(struct perf_event *event,
+ struct perf_cpu_context *cpuctx,
+ struct perf_event_context *ctx,
+ void *data)
+{
+ if (ctx->is_active & EVENT_TIME)
+ update_context_time(ctx);
+
+ perf_add_cgrp_node_list(event, ctx);
+}
+
+#define MIN_CGRP_NODE_HASH 4
+#define MAX_CGRP_NODE_HASH (4 * 1024)
+
+/* this is called from ioctl */
+static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
+ u64 *cgroup_ids)
+{
+ struct perf_cgroup_node *cgrp_node;
+ struct perf_event_context *ctx = event->ctx;
+ struct hlist_head *cgrp_node_hash;
+ int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
+ unsigned long flags;
+ bool is_first = true;
+ bool enabled;
+ int i, nr_hash;
+ int hash_bits;
+
+ if (nr_cgrps < MIN_CGRP_NODE_HASH)
+ nr_hash = MIN_CGRP_NODE_HASH;
+ else
+ nr_hash = roundup_pow_of_two(nr_cgrps);
+ hash_bits = ilog2(nr_hash);
+
+ cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
+ GFP_KERNEL, node);
+ if (cgrp_node_hash == NULL)
+ return -ENOMEM;
+
+ cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
+ if (cgrp_node == NULL) {
+ kfree(cgrp_node_hash);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < (int)nr_cgrps; i++) {
+ int key = hash_64(cgroup_ids[i], hash_bits);
+
+ cgrp_node[i].id = cgroup_ids[i];
+ hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
+ }
+
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+
+ enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
+
+ if (event->nr_cgrp_nodes != 0) {
+ kfree(event->cgrp_node_hash);
+ kfree(event->cgrp_node_entries);
+ is_first = false;
+ }
+
+ event->cgrp_node_hash = cgrp_node_hash;
+ event->cgrp_node_entries = cgrp_node;
+ event->cgrp_node_hash_bits = hash_bits;
+ event->nr_cgrp_nodes = nr_cgrps;
+
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (is_first && enabled)
+ event_function_call(event, perf_attach_cgroup_node, NULL);
+
+ return 0;
+}
+
+static void perf_event_destroy_cgroup_nodes(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+
+ if (event_has_cgroup_node(event)) {
+ if (!atomic_add_unless(&perf_sched_count, -1, 1))
+ schedule_delayed_work(&perf_sched_work, HZ);
+ }
+
+ kfree(event->cgrp_node_hash);
+ kfree(event->cgrp_node_entries);
+ event->nr_cgrp_nodes = 0;
+
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+static int perf_event_read(struct perf_event *event, bool group);
+
+static void __perf_read_cgroup_node(struct perf_event *event)
+{
+ struct perf_cgroup *cgrp;
+
+ if (event_has_cgroup_node(event)) {
+ cgrp = perf_cgroup_from_task(current, NULL);
+ perf_update_cgroup_node(event, cgrp->css.cgroup);
+ }
+}
+
+static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
+ u64 cgrp_id, char __user *buf)
+{
+ struct perf_cgroup_node *cgrp;
+ struct perf_event_context *ctx = event->ctx;
+ unsigned long flags;
+ u64 read_format = event->attr.read_format;
+ u64 values[4];
+ int n = 0;
+
+ /* update event count and times (possibly run on other cpu) */
+ (void)perf_event_read(event, false);
+
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+
+ cgrp = find_cgroup_node(event, cgrp_id);
+ if (cgrp == NULL) {
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+ return -ENOENT;
+ }
+
+ values[n++] = cgrp->count;
+ if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+ values[n++] = cgrp->time_enabled;
+ if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+ values[n++] = cgrp->time_running;
+ if (read_format & PERF_FORMAT_ID)
+ values[n++] = primary_event_id(event);
+
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(buf, values, n * sizeof(u64)))
+ return -EFAULT;
+
+ return n * sizeof(u64);
+}
+#else /* !CONFIG_CGROUP_PERF */
+static inline bool event_can_attach_cgroup(struct perf_event *event)
+{
+ return false;
+}
+
+static inline bool event_has_cgroup_node(struct perf_event *event)
+{
+ return false;
+}
+
+static void cgroup_node_sched_out(struct task_struct *task) {}
+
+static inline void perf_add_cgrp_node_list(struct perf_event *event,
+ struct perf_event_context *ctx) {}
+static inline void perf_del_cgrp_node_list(struct perf_event *event,
+ struct perf_event_context *ctx) {}
+
+#define MAX_CGRP_NODE_HASH 1
+static inline int perf_event_attach_cgroup_node(struct perf_event *event,
+ u64 nr_cgrps, u64 *cgrp_ids)
+{
+ return -ENODEV;
+}
+
+static inline void perf_event_destroy_cgroup_nodes(struct perf_event *event) {}
+static inline void __perf_read_cgroup_node(struct perf_event *event) {}
+
+static inline int perf_event_read_cgroup_node(struct perf_event *event,
+ u64 cgrp_id, char __user *buf)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_CGROUP_PERF */
+
static inline struct list_head *get_event_list(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
@@ -2407,6 +2724,7 @@ static void __perf_event_disable(struct perf_event *event,

perf_event_set_state(event, PERF_EVENT_STATE_OFF);
perf_cgroup_event_disable(event, ctx);
+ perf_del_cgrp_node_list(event, ctx);
}

/*
@@ -2946,6 +3264,7 @@ static void __perf_event_enable(struct perf_event *event,

perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
perf_cgroup_event_enable(event, ctx);
+ perf_add_cgrp_node_list(event, ctx);

if (!ctx->is_active)
return;
@@ -3568,6 +3887,11 @@ void __perf_event_task_sched_out(struct task_struct *task,
*/
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
perf_cgroup_sched_out(task, next);
+
+ if (!list_empty(this_cpu_ptr(&cgroup_ctx_list)) &&
+ perf_cgroup_from_task(task, NULL) !=
+ perf_cgroup_from_task(next, NULL))
+ cgroup_node_sched_out(task);
}

/*
@@ -4268,6 +4592,7 @@ static void __perf_event_read(void *info)

if (!data->group) {
pmu->read(event);
+ __perf_read_cgroup_node(event);
data->ret = 0;
goto unlock;
}
@@ -4283,6 +4608,7 @@ static void __perf_event_read(void *info)
* sibling could be on different (eg: software) PMU.
*/
sub->pmu->read(sub);
+ __perf_read_cgroup_node(sub);
}
}

@@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
INIT_LIST_HEAD(&ctx->event_list);
INIT_LIST_HEAD(&ctx->pinned_active);
INIT_LIST_HEAD(&ctx->flexible_active);
+ INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
+ INIT_LIST_HEAD(&ctx->cgrp_node_list);
refcount_set(&ctx->refcount, 1);
}

@@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
if (is_cgroup_event(event))
perf_detach_cgroup(event);

+ perf_event_destroy_cgroup_nodes(event);
+
if (!event->parent) {
if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
put_callchain_buffers();
@@ -5571,6 +5901,58 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon

return perf_event_modify_attr(event, &new_attr);
}
+
+ case PERF_EVENT_IOC_ATTACH_CGROUP: {
+ u64 nr_cgrps;
+ u64 *cgrp_buf;
+ size_t cgrp_bufsz;
+ int ret;
+
+ if (!event_can_attach_cgroup(event))
+ return -EINVAL;
+
+ if (copy_from_user(&nr_cgrps, (u64 __user *)arg,
+ sizeof(nr_cgrps)))
+ return -EFAULT;
+
+ if (nr_cgrps == 0 || nr_cgrps > MAX_CGRP_NODE_HASH)
+ return -EINVAL;
+
+ cgrp_bufsz = nr_cgrps * sizeof(*cgrp_buf);
+
+ cgrp_buf = kmalloc(cgrp_bufsz, GFP_KERNEL);
+ if (cgrp_buf == NULL)
+ return -ENOMEM;
+
+ if (copy_from_user(cgrp_buf, (u64 __user *)(arg + 8),
+ cgrp_bufsz)) {
+ kfree(cgrp_buf);
+ return -EFAULT;
+ }
+
+ ret = perf_event_attach_cgroup_node(event, nr_cgrps, cgrp_buf);
+
+ kfree(cgrp_buf);
+ return ret;
+ }
+
+ case PERF_EVENT_IOC_READ_CGROUP: {
+ u64 read_size, cgrp_id;
+
+ if (!event_can_attach_cgroup(event))
+ return -EINVAL;
+
+ if (copy_from_user(&read_size, (u64 __user *)arg,
+ sizeof(read_size)))
+ return -EFAULT;
+ if (copy_from_user(&cgrp_id, (u64 __user *)(arg + 8),
+ sizeof(cgrp_id)))
+ return -EFAULT;
+
+ return perf_event_read_cgroup_node(event, read_size, cgrp_id,
+ (char __user *)(arg + 16));
+ }
+
default:
return -ENOTTY;
}
@@ -5583,10 +5965,39 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
return 0;
}

+static void perf_sched_enable(void)
+{
+ /*
+ * We need the mutex here because static_branch_enable()
+ * must complete *before* the perf_sched_count increment
+ * becomes visible.
+ */
+ if (atomic_inc_not_zero(&perf_sched_count))
+ return;
+
+ mutex_lock(&perf_sched_mutex);
+ if (!atomic_read(&perf_sched_count)) {
+ static_branch_enable(&perf_sched_events);
+ /*
+ * Guarantee that all CPUs observe they key change and
+ * call the perf scheduling hooks before proceeding to
+ * install events that need them.
+ */
+ synchronize_rcu();
+ }
+ /*
+ * Now that we have waited for the sync_sched(), allow further
+ * increments to by-pass the mutex.
+ */
+ atomic_inc(&perf_sched_count);
+ mutex_unlock(&perf_sched_mutex);
+}
+
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct perf_event *event = file->private_data;
struct perf_event_context *ctx;
+ bool do_sched_enable = false;
long ret;

/* Treat ioctl like writes as it is likely a mutating operation. */
@@ -5595,9 +6006,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
return ret;

ctx = perf_event_ctx_lock(event);
+ /* ATTACH_CGROUP requires context switch callback */
+ if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
+ do_sched_enable = true;
ret = _perf_ioctl(event, cmd, arg);
perf_event_ctx_unlock(event, ctx);

+ /*
+ * Due to the circular lock dependency, it cannot call
+ * static_branch_enable() under the ctx->mutex.
+ */
+ if (do_sched_enable && ret >= 0)
+ perf_sched_enable();
+
return ret;
}

@@ -11240,33 +11661,8 @@ static void account_event(struct perf_event *event)
if (event->attr.text_poke)
atomic_inc(&nr_text_poke_events);

- if (inc) {
- /*
- * We need the mutex here because static_branch_enable()
- * must complete *before* the perf_sched_count increment
- * becomes visible.
- */
- if (atomic_inc_not_zero(&perf_sched_count))
- goto enabled;
-
- mutex_lock(&perf_sched_mutex);
- if (!atomic_read(&perf_sched_count)) {
- static_branch_enable(&perf_sched_events);
- /*
- * Guarantee that all CPUs observe they key change and
- * call the perf scheduling hooks before proceeding to
- * install events that need them.
- */
- synchronize_rcu();
- }
- /*
- * Now that we have waited for the sync_sched(), allow further
- * increments to by-pass the mutex.
- */
- atomic_inc(&perf_sched_count);
- mutex_unlock(&perf_sched_mutex);
- }
-enabled:
+ if (inc)
+ perf_sched_enable();

account_event_cpu(event, event->cpu);

@@ -13008,6 +13404,7 @@ static void __init perf_event_init_all_cpus(void)

#ifdef CONFIG_CGROUP_PERF
INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
+ INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
#endif
INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
}
@@ -13218,6 +13615,28 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static int __perf_cgroup_update_node(void *info)
+{
+ struct task_struct *task = info;
+
+ rcu_read_lock();
+ cgroup_node_sched_out(task);
+ rcu_read_unlock();
+
+ return 0;
+}
+
+static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+ struct task_struct *task;
+ struct cgroup_subsys_state *css;
+
+ cgroup_taskset_for_each(task, css, tset)
+ task_function_call(task, __perf_cgroup_update_node, task);
+
+ return 0;
+}
+
static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
@@ -13240,6 +13659,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
.css_alloc = perf_cgroup_css_alloc,
.css_free = perf_cgroup_css_free,
.css_online = perf_cgroup_css_online,
+ .can_attach = perf_cgroup_can_attach,
.attach = perf_cgroup_attach,
/*
* Implicitly enable on dfl hierarchy so that perf events can
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-24 21:42:32

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-24 21:44:58

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

Hi Song,

On Wed, Mar 24, 2021 at 9:30 AM Song Liu <[email protected]> wrote:
>
>
>
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run. To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups. All we need is a separate counter (and
> > two timing variables) for each cgroup. I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different. And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > 64-bit array to attach given cgroups. The first element is a
> > number of cgroups in the buffer, and the rest is a list of cgroup
> > ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > array to get the event counter values. The first element is size
> > of the array in byte, and the second element is a cgroup id to
> > read. The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple. The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
>
> This is very interesting idea!

Thanks!

>
> Could you please add some description of the relationship among
> perf_event and contexts? The code is a little confusing. For example,
> why do we need cgroup_ctx_list?

Sure, a perf_event belongs to an event context (hw or sw, mostly) which
takes care of multiplexing, timing, locking and so on. So many of the
fields in the perf_event are protected by the context lock. A context has
a list of perf_events and there are per-cpu contexts and per-task contexts.

The cgroup_ctx_list is to traverse contexts (in that cpu) only having
perf_events with attached cgroups.

Hope this makes it clear. Please let me know if you need more. :)

Thanks,
Namhyung

2021-03-25 03:39:07

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
>
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run. To do that, the
> perf_event can be associated to a cgroup to measure it only.
>
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
>
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups. All we need is a separate counter (and
> two timing variables) for each cgroup. I added a hash table to map
> from cgroup id to the attached cgroups.
>
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different. And it attributes the delta to the current task's cgroup.
>
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
>
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> 64-bit array to attach given cgroups. The first element is a
> number of cgroups in the buffer, and the rest is a list of cgroup
> ids to add a cgroup info to the given event.
>
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> array to get the event counter values. The first element is size
> of the array in byte, and the second element is a cgroup id to
> read. The rest is to save the counter value and timings.
>
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple. The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.
>
> Cc: Tejun Heo <[email protected]>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> include/linux/perf_event.h | 22 ++
> include/uapi/linux/perf_event.h | 2 +
> kernel/events/core.c | 474 ++++++++++++++++++++++++++++++--
> 3 files changed, 471 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..2760f3b07534 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -771,6 +771,18 @@ struct perf_event {
>
> #ifdef CONFIG_CGROUP_PERF
> struct perf_cgroup *cgrp; /* cgroup event is attach to */
> +
> + /* to share an event for multiple cgroups */
> + struct hlist_head *cgrp_node_hash;
> + struct perf_cgroup_node *cgrp_node_entries;
> + int nr_cgrp_nodes;
> + int cgrp_node_hash_bits;
> +
> + struct list_head cgrp_node_entry;
> +
> + u64 cgrp_node_count;
> + u64 cgrp_node_time_enabled;
> + u64 cgrp_node_time_running;

A comment saying the above values are from previous reading would be helpful.

> #endif
>
> #ifdef CONFIG_SECURITY
> @@ -780,6 +792,14 @@ struct perf_event {
> #endif /* CONFIG_PERF_EVENTS */
> };
>
> +struct perf_cgroup_node {
> + struct hlist_node node;
> + u64 id;
> + u64 count;
> + u64 time_enabled;
> + u64 time_running;
> + u64 padding[2];

Do we really need the padding? For cache line alignment?

> +};
>
> struct perf_event_groups {
> struct rb_root tree;
> @@ -843,6 +863,8 @@ struct perf_event_context {
> int pin_count;
> #ifdef CONFIG_CGROUP_PERF
> int nr_cgroups; /* cgroup evts */
> + struct list_head cgrp_node_list;
> + struct list_head cgrp_ctx_entry;
> #endif
> void *task_ctx_data; /* pmu specific data */
> struct rcu_head rcu_head;
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index ad15e40d7f5d..06bc7ab13616 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> #define PERF_EVENT_IOC_PAUSE_OUTPUT _IOW('$', 9, __u32)
> #define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *)
> #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES _IOW('$', 11, struct perf_event_attr *)
> +#define PERF_EVENT_IOC_ATTACH_CGROUP _IOW('$', 12, __u64 *)
> +#define PERF_EVENT_IOC_READ_CGROUP _IOWR('$', 13, __u64 *)
>
> enum perf_event_ioc_flags {
> PERF_IOC_FLAG_GROUP = 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f07943183041..38c26a23418a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -379,6 +379,7 @@ enum event_type_t {
> * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
> */
>
> +static void perf_sched_enable(void);
> static void perf_sched_delayed(struct work_struct *work);
> DEFINE_STATIC_KEY_FALSE(perf_sched_events);
> static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> @@ -2124,6 +2125,322 @@ static int perf_get_aux_event(struct perf_event *event,
> return 1;
> }
>
> +#ifdef CONFIG_CGROUP_PERF
> +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> +
> +static bool event_can_attach_cgroup(struct perf_event *event)
> +{
> + if (is_sampling_event(event))
> + return false;
> + if (event->attach_state & PERF_ATTACH_TASK)
> + return false;
> + if (is_cgroup_event(event))
> + return false;
> +
> + return true;
> +}
> +
> +static bool event_has_cgroup_node(struct perf_event *event)
> +{
> + return event->nr_cgrp_nodes > 0;
> +}
> +
> +static struct perf_cgroup_node *
> +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> +{
> + struct perf_cgroup_node *cgrp_node;
> + int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> +
> + hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> + if (cgrp_node->id == cgrp_id)
> + return cgrp_node;
> + }
> +
> + return NULL;
> +}
> +
> +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> +{
> + u64 delta_count, delta_time_enabled, delta_time_running;
> + int i;
> +
> + if (event->cgrp_node_count == 0)

Do you mean to use nr_cgrp_nodes above?

> + goto out;
> +
> + delta_count = local64_read(&event->count) - event->cgrp_node_count;
> + delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> + delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> +
> + /* account delta to all ancestor cgroups */
> + for (i = 0; i <= cgrp->level; i++) {
> + struct perf_cgroup_node *node;
> +
> + node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> + if (node) {
> + node->count += delta_count;
> + node->time_enabled += delta_time_enabled;
> + node->time_running += delta_time_running;
> + }
> + }
> +
> +out:
> + event->cgrp_node_count = local64_read(&event->count);
> + event->cgrp_node_time_enabled = event->total_time_enabled;
> + event->cgrp_node_time_running = event->total_time_running;
> +}
> +
> +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> +{
> + if (event->state == PERF_EVENT_STATE_ACTIVE)
> + event->pmu->read(event);
> +
> + perf_event_update_time(event);
> + perf_update_cgroup_node(event, cgrp);
> +}
> +
> +/* this is called from context switch */
> +static void update_cgroup_node_events(struct perf_event_context *ctx,
> + struct cgroup *cgrp)
> +{
> + struct perf_event *event;
> +
> + lockdep_assert_held(&ctx->lock);
> +
> + if (ctx->is_active & EVENT_TIME)
> + update_context_time(ctx);
> +
> + list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> + update_cgroup_node(event, cgrp);
> +}
> +
> +static void cgroup_node_sched_out(struct task_struct *task)
> +{
> + struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> + struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> + struct perf_event_context *ctx;
> +
> + list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> + raw_spin_lock(&ctx->lock);
> + update_cgroup_node_events(ctx, cgrp->css.cgroup);
> + raw_spin_unlock(&ctx->lock);
> + }
> +}
> +
> +/* this is called from the when event is enabled/disabled */

I don't think we call this when the event is disabled.

> +static void perf_add_cgrp_node_list(struct perf_event *event,
> + struct perf_event_context *ctx)
> +{
> + struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> + struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> + bool is_first;
> +
> + lockdep_assert_irqs_disabled();
> + lockdep_assert_held(&ctx->lock);
> +
> + is_first = list_empty(&ctx->cgrp_node_list);
> + list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
> +
> + if (is_first)
> + list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
> +
> + update_cgroup_node(event, cgrp->css.cgroup);

Will this add some readings before PERF_EVENT_IOC_ATTACH_CGROUP to the counters?

> +
> }
> +
> +static void perf_del_cgrp_node_list(struct perf_event *event,
> + struct perf_event_context *ctx)
> +{
> + struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> +
> + lockdep_assert_irqs_disabled();
> + lockdep_assert_held(&ctx->lock);
> +
> + update_cgroup_node(event, cgrp->css.cgroup);
> + /* to refresh delta when it's enabled */
> + event->cgrp_node_count = 0;
> +
> + list_del(&event->cgrp_node_entry);
> +
> + if (list_empty(&ctx->cgrp_node_list))
> + list_del(&ctx->cgrp_ctx_entry);
> +}
> +
[...]
> +
> +/* this is called from ioctl */
> +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> + u64 *cgroup_ids)
> +{
> + struct perf_cgroup_node *cgrp_node;
> + struct perf_event_context *ctx = event->ctx;
> + struct hlist_head *cgrp_node_hash;
> + int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
> + unsigned long flags;
> + bool is_first = true;
> + bool enabled;
> + int i, nr_hash;
> + int hash_bits;
> +
> + if (nr_cgrps < MIN_CGRP_NODE_HASH)
> + nr_hash = MIN_CGRP_NODE_HASH;
> + else
> + nr_hash = roundup_pow_of_two(nr_cgrps);
> + hash_bits = ilog2(nr_hash);
> +
> + cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> + GFP_KERNEL, node);
> + if (cgrp_node_hash == NULL)
> + return -ENOMEM;
> +
> + cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> + if (cgrp_node == NULL) {
> + kfree(cgrp_node_hash);
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < (int)nr_cgrps; i++) {
> + int key = hash_64(cgroup_ids[i], hash_bits);
> +
> + cgrp_node[i].id = cgroup_ids[i];
> + hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> + }
> +
> + raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> + enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> +
> + if (event->nr_cgrp_nodes != 0) {
> + kfree(event->cgrp_node_hash);
> + kfree(event->cgrp_node_entries);
> + is_first = false;
> + }

To add another cgroup to the list, we use PERF_EVENT_IOC_ATTACH_CGROUP to
do the whole list. So we may lost some readings during this, right?

> +
> + event->cgrp_node_hash = cgrp_node_hash;
> + event->cgrp_node_entries = cgrp_node;
> + event->cgrp_node_hash_bits = hash_bits;
> + event->nr_cgrp_nodes = nr_cgrps;
> +
> + raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + if (is_first && enabled)
> + event_function_call(event, perf_attach_cgroup_node, NULL);
> +
> + return 0;
> +}

[...]
>

2021-03-25 03:44:53

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

Hi Song,

Thanks for your review!

On Thu, Mar 25, 2021 at 9:56 AM Song Liu <[email protected]> wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run. To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups. All we need is a separate counter (and
> > two timing variables) for each cgroup. I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different. And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > 64-bit array to attach given cgroups. The first element is a
> > number of cgroups in the buffer, and the rest is a list of cgroup
> > ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > array to get the event counter values. The first element is size
> > of the array in byte, and the second element is a cgroup id to
> > read. The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple. The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo <[email protected]>
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---
> > include/linux/perf_event.h | 22 ++
> > include/uapi/linux/perf_event.h | 2 +
> > kernel/events/core.c | 474 ++++++++++++++++++++++++++++++--
> > 3 files changed, 471 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..2760f3b07534 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -771,6 +771,18 @@ struct perf_event {
> >
> > #ifdef CONFIG_CGROUP_PERF
> > struct perf_cgroup *cgrp; /* cgroup event is attach to */
> > +
> > + /* to share an event for multiple cgroups */
> > + struct hlist_head *cgrp_node_hash;
> > + struct perf_cgroup_node *cgrp_node_entries;
> > + int nr_cgrp_nodes;
> > + int cgrp_node_hash_bits;
> > +
> > + struct list_head cgrp_node_entry;
> > +
> > + u64 cgrp_node_count;
> > + u64 cgrp_node_time_enabled;
> > + u64 cgrp_node_time_running;
>
> A comment saying the above values are from previous reading would be helpful.

Sure, will add.

>
> > #endif
> >
> > #ifdef CONFIG_SECURITY
> > @@ -780,6 +792,14 @@ struct perf_event {
> > #endif /* CONFIG_PERF_EVENTS */
> > };
> >
> > +struct perf_cgroup_node {
> > + struct hlist_node node;
> > + u64 id;
> > + u64 count;
> > + u64 time_enabled;
> > + u64 time_running;
> > + u64 padding[2];
>
> Do we really need the padding? For cache line alignment?

Yeah I was thinking about it. It seems I need to use the
___cacheline_aligned macro instead.

>
> > +};
> >
> > struct perf_event_groups {
> > struct rb_root tree;
> > @@ -843,6 +863,8 @@ struct perf_event_context {
> > int pin_count;
> > #ifdef CONFIG_CGROUP_PERF
> > int nr_cgroups; /* cgroup evts */
> > + struct list_head cgrp_node_list;
> > + struct list_head cgrp_ctx_entry;
> > #endif
> > void *task_ctx_data; /* pmu specific data */
> > struct rcu_head rcu_head;
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index ad15e40d7f5d..06bc7ab13616 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> > #define PERF_EVENT_IOC_PAUSE_OUTPUT _IOW('$', 9, __u32)
> > #define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *)
> > #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES _IOW('$', 11, struct perf_event_attr *)
> > +#define PERF_EVENT_IOC_ATTACH_CGROUP _IOW('$', 12, __u64 *)
> > +#define PERF_EVENT_IOC_READ_CGROUP _IOWR('$', 13, __u64 *)
> >
> > enum perf_event_ioc_flags {
> > PERF_IOC_FLAG_GROUP = 1U << 0,
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f07943183041..38c26a23418a 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -379,6 +379,7 @@ enum event_type_t {
> > * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
> > */
> >
> > +static void perf_sched_enable(void);
> > static void perf_sched_delayed(struct work_struct *work);
> > DEFINE_STATIC_KEY_FALSE(perf_sched_events);
> > static DECLARE_DELAYED_WORK(perf_sched_work, perf_sched_delayed);
> > @@ -2124,6 +2125,322 @@ static int perf_get_aux_event(struct perf_event *event,
> > return 1;
> > }
> >
> > +#ifdef CONFIG_CGROUP_PERF
> > +static DEFINE_PER_CPU(struct list_head, cgroup_ctx_list);
> > +
> > +static bool event_can_attach_cgroup(struct perf_event *event)
> > +{
> > + if (is_sampling_event(event))
> > + return false;
> > + if (event->attach_state & PERF_ATTACH_TASK)
> > + return false;
> > + if (is_cgroup_event(event))
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > +static bool event_has_cgroup_node(struct perf_event *event)
> > +{
> > + return event->nr_cgrp_nodes > 0;
> > +}
> > +
> > +static struct perf_cgroup_node *
> > +find_cgroup_node(struct perf_event *event, u64 cgrp_id)
> > +{
> > + struct perf_cgroup_node *cgrp_node;
> > + int key = hash_64(cgrp_id, event->cgrp_node_hash_bits);
> > +
> > + hlist_for_each_entry(cgrp_node, &event->cgrp_node_hash[key], node) {
> > + if (cgrp_node->id == cgrp_id)
> > + return cgrp_node;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > + u64 delta_count, delta_time_enabled, delta_time_running;
> > + int i;
> > +
> > + if (event->cgrp_node_count == 0)
>
> Do you mean to use nr_cgrp_nodes above?

No, this is to calculate delta so it needs to be set first.
If it's the first call, it just updates the count and time and
skips the delta accounting.

>
> > + goto out;
> > +
> > + delta_count = local64_read(&event->count) - event->cgrp_node_count;
> > + delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > + delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > +
> > + /* account delta to all ancestor cgroups */
> > + for (i = 0; i <= cgrp->level; i++) {
> > + struct perf_cgroup_node *node;
> > +
> > + node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > + if (node) {
> > + node->count += delta_count;
> > + node->time_enabled += delta_time_enabled;
> > + node->time_running += delta_time_running;
> > + }
> > + }
> > +
> > +out:
> > + event->cgrp_node_count = local64_read(&event->count);
> > + event->cgrp_node_time_enabled = event->total_time_enabled;
> > + event->cgrp_node_time_running = event->total_time_running;
> > +}
> > +
> > +static void update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > +{
> > + if (event->state == PERF_EVENT_STATE_ACTIVE)
> > + event->pmu->read(event);
> > +
> > + perf_event_update_time(event);
> > + perf_update_cgroup_node(event, cgrp);
> > +}
> > +
> > +/* this is called from context switch */
> > +static void update_cgroup_node_events(struct perf_event_context *ctx,
> > + struct cgroup *cgrp)
> > +{
> > + struct perf_event *event;
> > +
> > + lockdep_assert_held(&ctx->lock);
> > +
> > + if (ctx->is_active & EVENT_TIME)
> > + update_context_time(ctx);
> > +
> > + list_for_each_entry(event, &ctx->cgrp_node_list, cgrp_node_entry)
> > + update_cgroup_node(event, cgrp);
> > +}
> > +
> > +static void cgroup_node_sched_out(struct task_struct *task)
> > +{
> > + struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > + struct perf_cgroup *cgrp = perf_cgroup_from_task(task, NULL);
> > + struct perf_event_context *ctx;
> > +
> > + list_for_each_entry(ctx, cgrp_ctx_list, cgrp_ctx_entry) {
> > + raw_spin_lock(&ctx->lock);
> > + update_cgroup_node_events(ctx, cgrp->css.cgroup);
> > + raw_spin_unlock(&ctx->lock);
> > + }
> > +}
> > +
> > +/* this is called from the when event is enabled/disabled */
>
> I don't think we call this when the event is disabled.

Oh, sorry. I meant 'add' for enable, 'del' for disable..
Maybe I can change it to 'these are called from ...'.

>
> > +static void perf_add_cgrp_node_list(struct perf_event *event,
> > + struct perf_event_context *ctx)
> > +{
> > + struct list_head *cgrp_ctx_list = this_cpu_ptr(&cgroup_ctx_list);
> > + struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > + bool is_first;
> > +
> > + lockdep_assert_irqs_disabled();
> > + lockdep_assert_held(&ctx->lock);
> > +
> > + is_first = list_empty(&ctx->cgrp_node_list);
> > + list_add_tail(&event->cgrp_node_entry, &ctx->cgrp_node_list);
> > +
> > + if (is_first)
> > + list_add_tail(&ctx->cgrp_ctx_entry, cgrp_ctx_list);
> > +
> > + update_cgroup_node(event, cgrp->css.cgroup);
>
> Will this add some readings before PERF_EVENT_IOC_ATTACH_CGROUP to the counters?

At this moment, the event is just enabled so the cgrp_node_count
is 0 like I said above. So it'll update the timestamp and count in
the event but won't update the cgroup nodes.

>
> > +
> > }
> > +
> > +static void perf_del_cgrp_node_list(struct perf_event *event,
> > + struct perf_event_context *ctx)
> > +{
> > + struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
> > +
> > + lockdep_assert_irqs_disabled();
> > + lockdep_assert_held(&ctx->lock);
> > +
> > + update_cgroup_node(event, cgrp->css.cgroup);
> > + /* to refresh delta when it's enabled */
> > + event->cgrp_node_count = 0;
> > +
> > + list_del(&event->cgrp_node_entry);
> > +
> > + if (list_empty(&ctx->cgrp_node_list))
> > + list_del(&ctx->cgrp_ctx_entry);
> > +}
> > +
> [...]
> > +
> > +/* this is called from ioctl */
> > +static int perf_event_attach_cgroup_node(struct perf_event *event, u64 nr_cgrps,
> > + u64 *cgroup_ids)
> > +{
> > + struct perf_cgroup_node *cgrp_node;
> > + struct perf_event_context *ctx = event->ctx;
> > + struct hlist_head *cgrp_node_hash;
> > + int node = (event->cpu >= 0) ? cpu_to_node(event->cpu) : -1;
> > + unsigned long flags;
> > + bool is_first = true;
> > + bool enabled;
> > + int i, nr_hash;
> > + int hash_bits;
> > +
> > + if (nr_cgrps < MIN_CGRP_NODE_HASH)
> > + nr_hash = MIN_CGRP_NODE_HASH;
> > + else
> > + nr_hash = roundup_pow_of_two(nr_cgrps);
> > + hash_bits = ilog2(nr_hash);
> > +
> > + cgrp_node_hash = kcalloc_node(nr_hash, sizeof(*cgrp_node_hash),
> > + GFP_KERNEL, node);
> > + if (cgrp_node_hash == NULL)
> > + return -ENOMEM;
> > +
> > + cgrp_node = kcalloc_node(nr_cgrps, sizeof(*cgrp_node), GFP_KERNEL, node);
> > + if (cgrp_node == NULL) {
> > + kfree(cgrp_node_hash);
> > + return -ENOMEM;
> > + }
> > +
> > + for (i = 0; i < (int)nr_cgrps; i++) {
> > + int key = hash_64(cgroup_ids[i], hash_bits);
> > +
> > + cgrp_node[i].id = cgroup_ids[i];
> > + hlist_add_head(&cgrp_node[i].node, &cgrp_node_hash[key]);
> > + }
> > +
> > + raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > + enabled = event->state >= PERF_EVENT_STATE_INACTIVE;
> > +
> > + if (event->nr_cgrp_nodes != 0) {
> > + kfree(event->cgrp_node_hash);
> > + kfree(event->cgrp_node_entries);
> > + is_first = false;
> > + }
>
> To add another cgroup to the list, we use PERF_EVENT_IOC_ATTACH_CGROUP to
> do the whole list. So we may lost some readings during this, right?

So the basic use case is perf stat which can have a list of all
cgroups to measure
when it calls the ioctl. Then it creates all nodes in the table at
once and sets it.

If someone wants to measure more cgroups, [s]he can call the ioctl again with
the update cgroup list (original + new).

Thanks,
Namhyung

>
> > +
> > + event->cgrp_node_hash = cgrp_node_hash;
> > + event->cgrp_node_entries = cgrp_node;
> > + event->cgrp_node_hash_bits = hash_bits;
> > + event->nr_cgrp_nodes = nr_cgrps;
> > +
> > + raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > + if (is_first && enabled)
> > + event_function_call(event, perf_attach_cgroup_node, NULL);
> > +
> > + return 0;
> > +}
>
> [...]
> >
>

2021-03-25 12:59:14

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

Em Thu, Mar 25, 2021 at 12:55:50AM +0000, Song Liu escreveu:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> > #ifdef CONFIG_SECURITY
> > @@ -780,6 +792,14 @@ struct perf_event {
> > #endif /* CONFIG_PERF_EVENTS */
> > };

> > +struct perf_cgroup_node {
> > + struct hlist_node node;
> > + u64 id;
> > + u64 count;
> > + u64 time_enabled;
> > + u64 time_running;
> > + u64 padding[2];
>
> Do we really need the padding? For cache line alignment?

I guess so, to get it to 64 bytes, then having it as:

struct perf_cgroup_node {
struct hlist_node node;
u64 id;
u64 count;
u64 time_enabled;
u64 time_running;
} ____cacheline_aligned;

Seems better :-)

Testing:

[acme@five c]$ cat cacheline_aligned.c
#ifndef ____cacheline_aligned
#define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))
#endif

// from ../build/v5.12.0-rc4+/include/generated/autoconf.h
#define CONFIG_X86_L1_CACHE_SHIFT 6

#define L1_CACHE_SHIFT (CONFIG_X86_L1_CACHE_SHIFT)
#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)

#ifndef SMP_CACHE_BYTES
#define SMP_CACHE_BYTES L1_CACHE_BYTES
#endif

typedef long long unsigned int u64;

struct hlist_node {
struct hlist_node * next; /* 0 8 */
struct hlist_node * * pprev; /* 8 8 */

/* size: 16, cachelines: 1, members: 2 */
/* last cacheline: 16 bytes */
};

struct perf_cgroup_node {
struct hlist_node node;
u64 id;
u64 count;
u64 time_enabled;
u64 time_running;
} ____cacheline_aligned foo;

[acme@five c]$ cc -g -c -o cacheline_aligned.o cacheline_aligned.c
[acme@five c]$ pahole cacheline_aligned.o
struct hlist_node {
struct hlist_node * next; /* 0 8 */
struct hlist_node * * pprev; /* 8 8 */

/* size: 16, cachelines: 1, members: 2 */
/* last cacheline: 16 bytes */
};
struct perf_cgroup_node {
struct hlist_node node; /* 0 16 */
u64 id; /* 16 8 */
u64 count; /* 24 8 */
u64 time_enabled; /* 32 8 */
u64 time_running; /* 40 8 */

/* size: 64, cachelines: 1, members: 5 */
/* padding: 16 */
} __attribute__((__aligned__(64)));
[acme@five c]$

- Arnaldo

2021-03-28 17:24:13

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-28 17:37:29

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH 2/2] perf/core: Support reading group events with shared cgroups

> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
>
> This enables reading event group's counter values together with a
> PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
> Users should give a correct size of buffer to be read.
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> kernel/events/core.c | 119 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 116 insertions(+), 3 deletions(-)
>

[...]

> +}
> +
> +static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 cgrp_id,
> + char __user *buf)
> +{
> + struct perf_cgroup_node *cgrp;
> + struct perf_event_context *ctx = event->ctx;
> + struct perf_event *sibling;
> + u64 read_format = event->attr.read_format;
> + unsigned long flags;
> + u64 *values;
> + int n = 1;
> + int ret;
> +
> + values = kzalloc(event->read_size, GFP_KERNEL);
> + if (!values)
> + return -ENOMEM;
> +
> + values[0] = 1 + event->nr_siblings;
> +
> + /* update event count and times (possibly run on other cpu) */
> + (void)perf_event_read(event, true);
> +
> + raw_spin_lock_irqsave(&ctx->lock, flags);
> +
> + cgrp = find_cgroup_node(event, cgrp_id);
> + if (cgrp == NULL) {
> + raw_spin_unlock_irqrestore(&ctx->lock, flags);
> + kfree(values);
> + return -ENOENT;
> + }
> +
> + if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> + values[n++] = cgrp->time_enabled;
> + if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> + values[n++] = cgrp->time_running;
> +
> + values[n++] = cgrp->count;
> + if (read_format & PERF_FORMAT_ID)
> + values[n++] = primary_event_id(event);
> +
> + for_each_sibling_event(sibling, event) {
> + n += perf_event_read_cgrp_node_sibling(sibling, read_format,
> + cgrp_id, &values[n]);
> + }
> +
> + raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + ret = copy_to_user(buf, values, n * sizeof(u64));
> + kfree(values);
> + if (ret)
> + return -EFAULT;
> +
> + return n * sizeof(u64);
> +}
> +
> +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> + u64 cgrp_id, char __user *buf)
> +{
> + u64 read_format = event->attr.read_format;
> +
> + if (read_size < event->read_size + 2 * sizeof(u64))

Why do we need read_size + 2 u64 here?

Thanks,
Song

[...]

2021-03-29 11:37:20

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

On Mon, Mar 29, 2021 at 2:17 AM Song Liu <[email protected]> wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run. To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups. All we need is a separate counter (and
> > two timing variables) for each cgroup. I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different. And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > 64-bit array to attach given cgroups. The first element is a
> > number of cgroups in the buffer, and the rest is a list of cgroup
> > ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > array to get the event counter values. The first element is size
> > of the array in byte, and the second element is a cgroup id to
> > read. The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple. The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo <[email protected]>
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---
> > include/linux/perf_event.h | 22 ++
> > include/uapi/linux/perf_event.h | 2 +
> > kernel/events/core.c | 474 ++++++++++++++++++++++++++++++--
> > 3 files changed, 471 insertions(+), 27 deletions(-)
>
> [...]
>
> > @@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
> > INIT_LIST_HEAD(&ctx->event_list);
> > INIT_LIST_HEAD(&ctx->pinned_active);
> > INIT_LIST_HEAD(&ctx->flexible_active);
> > + INIT_LIST_HEAD(&ctx->cgrp_ctx_entry);
> > + INIT_LIST_HEAD(&ctx->cgrp_node_list);
>
> I guess we need ifdef CONFIG_CGROUP_PERF here?

Correct. Thanks for pointing that out.

>
> > refcount_set(&ctx->refcount, 1);
> > }
> >
> > @@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
> > if (is_cgroup_event(event))
> > perf_detach_cgroup(event);
> >
> > + perf_event_destroy_cgroup_nodes(event);
> > +
> > if (!event->parent) {
> > if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> > put_callchain_buffers();
>
> [...]
>
> > +static void perf_sched_enable(void)
> > +{
> > + /*
> > + * We need the mutex here because static_branch_enable()
> > + * must complete *before* the perf_sched_count increment
> > + * becomes visible.
> > + */
> > + if (atomic_inc_not_zero(&perf_sched_count))
> > + return;
>
> Why don't we use perf_cgroup_events for the new use case?

Maybe.. The two methods are mutually exclusive and I think
this will be preferred in the future due to the lower overhead.
And I'd like to separate it from the existing code to avoid
possible confusions.

For the perf_sched_enable(), the difference between the
existing cgroup events and this approach is when it calls
the function above. Usually it calls during account_event()
which is a part of the event initialization. But this approach
calls the function after an event is created. That's why I
have the do_sched_enable variable in the perf_ioctl below
to ensure it's called exactly once for each event.

>
> > +
> > + mutex_lock(&perf_sched_mutex);
> > + if (!atomic_read(&perf_sched_count)) {
> > + static_branch_enable(&perf_sched_events);
> > + /*
> > + * Guarantee that all CPUs observe they key change and
> > + * call the perf scheduling hooks before proceeding to
> > + * install events that need them.
> > + */
> > + synchronize_rcu();
> > + }
> > + /*
> > + * Now that we have waited for the sync_sched(), allow further
> > + * increments to by-pass the mutex.
> > + */
> > + atomic_inc(&perf_sched_count);
> > + mutex_unlock(&perf_sched_mutex);
> > +}
> > +
> > static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > {
> > struct perf_event *event = file->private_data;
> > struct perf_event_context *ctx;
> > + bool do_sched_enable = false;
> > long ret;
> >
> > /* Treat ioctl like writes as it is likely a mutating operation. */
> > @@ -5595,9 +6006,19 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > return ret;
> >
> > ctx = perf_event_ctx_lock(event);
> > + /* ATTACH_CGROUP requires context switch callback */
> > + if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && !event_has_cgroup_node(event))
> > + do_sched_enable = true;
> > ret = _perf_ioctl(event, cmd, arg);
> > perf_event_ctx_unlock(event, ctx);
> >
> > + /*
> > + * Due to the circular lock dependency, it cannot call
> > + * static_branch_enable() under the ctx->mutex.
> > + */
> > + if (do_sched_enable && ret >= 0)
> > + perf_sched_enable();
> > +
> > return ret;
> > }
> >
> > @@ -11240,33 +11661,8 @@ static void account_event(struct perf_event *event)
> > if (event->attr.text_poke)
> > atomic_inc(&nr_text_poke_events);
> >
> > - if (inc) {
> > - /*
> > - * We need the mutex here because static_branch_enable()
> > - * must complete *before* the perf_sched_count increment
> > - * becomes visible.
> > - */
> > - if (atomic_inc_not_zero(&perf_sched_count))
> > - goto enabled;
> > -
> > - mutex_lock(&perf_sched_mutex);
> > - if (!atomic_read(&perf_sched_count)) {
> > - static_branch_enable(&perf_sched_events);
> > - /*
> > - * Guarantee that all CPUs observe they key change and
> > - * call the perf scheduling hooks before proceeding to
> > - * install events that need them.
> > - */
> > - synchronize_rcu();
> > - }
> > - /*
> > - * Now that we have waited for the sync_sched(), allow further
> > - * increments to by-pass the mutex.
> > - */
> > - atomic_inc(&perf_sched_count);
> > - mutex_unlock(&perf_sched_mutex);
> > - }
> > -enabled:
> > + if (inc)
> > + perf_sched_enable();
> >
> > account_event_cpu(event, event->cpu);
> >
> > @@ -13008,6 +13404,7 @@ static void __init perf_event_init_all_cpus(void)
> >
> > #ifdef CONFIG_CGROUP_PERF
> > INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
> > + INIT_LIST_HEAD(&per_cpu(cgroup_ctx_list, cpu));
> > #endif
> > INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
> > }
> > @@ -13218,6 +13615,28 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
> > return 0;
> > }
> >
> > +static int __perf_cgroup_update_node(void *info)
> > +{
> > + struct task_struct *task = info;
> > +
> > + rcu_read_lock();
> > + cgroup_node_sched_out(task);
> > + rcu_read_unlock();
> > +
> > + return 0;
> > +}
> > +
> > +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> > +{
> > + struct task_struct *task;
> > + struct cgroup_subsys_state *css;
> > +
> > + cgroup_taskset_for_each(task, css, tset)
> > + task_function_call(task, __perf_cgroup_update_node, task);
> > +
> > + return 0;
> > +}
>
> Could you please explain why we need this logic in can_attach?

IIUC the ss->attach() is called after a task's cgroup membership
is changed. But we want to collect the performance numbers for
the old cgroup just before the change. As the logic merely checks
the current task's cgroup, it should be done in the can_attach()
which is called before the cgroup change.

Thanks,
Namhyung

2021-03-29 11:38:33

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH 2/2] perf/core: Support reading group events with shared cgroups

On Mon, Mar 29, 2021 at 2:32 AM Song Liu <[email protected]> wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> >
> > This enables reading event group's counter values together with a
> > PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
> > Users should give a correct size of buffer to be read.
> >
> > Signed-off-by: Namhyung Kim <[email protected]>
> > ---
> > kernel/events/core.c | 119 +++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 116 insertions(+), 3 deletions(-)
> >
>
> [...]
>
> > +}
> > +
> > +static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 cgrp_id,
> > + char __user *buf)
> > +{
> > + struct perf_cgroup_node *cgrp;
> > + struct perf_event_context *ctx = event->ctx;
> > + struct perf_event *sibling;
> > + u64 read_format = event->attr.read_format;
> > + unsigned long flags;
> > + u64 *values;
> > + int n = 1;
> > + int ret;
> > +
> > + values = kzalloc(event->read_size, GFP_KERNEL);
> > + if (!values)
> > + return -ENOMEM;
> > +
> > + values[0] = 1 + event->nr_siblings;
> > +
> > + /* update event count and times (possibly run on other cpu) */
> > + (void)perf_event_read(event, true);
> > +
> > + raw_spin_lock_irqsave(&ctx->lock, flags);
> > +
> > + cgrp = find_cgroup_node(event, cgrp_id);
> > + if (cgrp == NULL) {
> > + raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > + kfree(values);
> > + return -ENOENT;
> > + }
> > +
> > + if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> > + values[n++] = cgrp->time_enabled;
> > + if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> > + values[n++] = cgrp->time_running;
> > +
> > + values[n++] = cgrp->count;
> > + if (read_format & PERF_FORMAT_ID)
> > + values[n++] = primary_event_id(event);
> > +
> > + for_each_sibling_event(sibling, event) {
> > + n += perf_event_read_cgrp_node_sibling(sibling, read_format,
> > + cgrp_id, &values[n]);
> > + }
> > +
> > + raw_spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > + ret = copy_to_user(buf, values, n * sizeof(u64));
> > + kfree(values);
> > + if (ret)
> > + return -EFAULT;
> > +
> > + return n * sizeof(u64);
> > +}
> > +
> > +static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
> > + u64 cgrp_id, char __user *buf)
> > +{
> > + u64 read_format = event->attr.read_format;
> > +
> > + if (read_size < event->read_size + 2 * sizeof(u64))
>
> Why do we need read_size + 2 u64 here?

I should've repeated the following description in the patch 1.

* PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
array to get the event counter values. The first element is size
of the array in byte, and the second element is a cgroup id to
read. The rest is to save the counter value and timings.

Thanks,
Namhyung

2021-03-30 06:35:18

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

> On Mar 29, 2021, at 4:33 AM, Namhyung Kim <[email protected]> wrote:
>
> On Mon, Mar 29, 2021 at 2:17 AM Song Liu <[email protected]> wrote:
>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
>>>
>>> As we can run many jobs (in container) on a big machine, we want to
>>> measure each job's performance during the run. To do that, the
>>> perf_event can be associated to a cgroup to measure it only.
>>>

[...]

>>> + return 0;
>>> +}
>>
>> Could you please explain why we need this logic in can_attach?
>
> IIUC the ss->attach() is called after a task's cgroup membership
> is changed. But we want to collect the performance numbers for
> the old cgroup just before the change. As the logic merely checks
> the current task's cgroup, it should be done in the can_attach()
> which is called before the cgroup change.

Thanks for the explanations.

Overall, I really like the core idea, especially that the overhead on
context switch is bounded (by the depth of cgroup tree).

Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
Specifically, if we can have

PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
PERF_EVENT_IOC_EL_CGROUP delete a cgroup from the list

we can probably share these events among multiple processes, and
these processes don't need to know others' cgroup list. I think
this will be useful for users to build customized monitoring in
its own container.

Does this make sense?

Thanks,
Song

2021-03-30 15:13:43

by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

On Tue, Mar 30, 2021 at 3:33 PM Song Liu <[email protected]> wrote:
> > On Mar 29, 2021, at 4:33 AM, Namhyung Kim <[email protected]> wrote:
> >
> > On Mon, Mar 29, 2021 at 2:17 AM Song Liu <[email protected]> wrote:
> >>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> >>>
> >>> As we can run many jobs (in container) on a big machine, we want to
> >>> measure each job's performance during the run. To do that, the
> >>> perf_event can be associated to a cgroup to measure it only.
> >>>
>
> [...]
>
> >>> + return 0;
> >>> +}
> >>
> >> Could you please explain why we need this logic in can_attach?
> >
> > IIUC the ss->attach() is called after a task's cgroup membership
> > is changed. But we want to collect the performance numbers for
> > the old cgroup just before the change. As the logic merely checks
> > the current task's cgroup, it should be done in the can_attach()
> > which is called before the cgroup change.
>
> Thanks for the explanations.
>
> Overall, I really like the core idea, especially that the overhead on
> context switch is bounded (by the depth of cgroup tree).

Thanks!

>
> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
> Specifically, if we can have
>
> PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
> PERF_EVENT_IOC_EL_CGROUP delete a cgroup from the list
>
> we can probably share these events among multiple processes, and
> these processes don't need to know others' cgroup list. I think
> this will be useful for users to build customized monitoring in
> its own container.
>
> Does this make sense?

Maybe we can add ADD/DEL interface for more flexible monitoring
but I'm not sure which use cases it'll be used actually.

For your multi-process sharing case, the original events' file
descriptors should be shared first. Also adding and deleting
(or just reading) arbitrary cgroups from a container can be a
security concern IMHO.

So I just focused on the single-process multi-cgroup case which is
already used (perf stat --for-each-cgroup) and very important in my
company's setup. In this case we have a list of interested cgroups
from the beginning so it's more efficient to create a properly sized
hash table and all the nodes at once.

Thanks,
Namhyung

2021-04-01 06:05:04

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

Hi,

I would like to re-emphasize why this patch is important. As Namhyung
outlined in his cover message,
cgroup monitoring build on top of per-cpu monitoring and offers
maximum flexibility by allowing each event
to be attached to a single cgroup. Although this is fine when the
machines were much smaller and the number
of simultaneous cgroups was also small, it does not work anymore with
today's machines and even less with future
machines. Over the last couple of years, we have tried to make cgroup
monitoring more scalable. Ian Rogers
patch series addressed the RB-tree handling of the event to avoid
walking the whole tree to find events from the
sched in cgroup. This helped reduce some of the overhead we are seeing
and which is causing serious problems
to our end users, forcing them to tone down monitoring and slice
collection over cgroup over time which is far from
ideal.

Namhyung's series goes a lot further, by addressing two key overhead factors:
1- the file descriptor consumption explosion
2- the context switch overhead

Again this is a major cause of problems for us and needed to be
addressed in a way that maintained backward compatibility.
We are interested in the case where the same events are measured
across all cgroups and I believe this is a common usage
model.

1/ File descriptor issue

With the current interface, if you want to monitor 10 events on a 112
CPU server across 200 cgroups, you need:

num_fds = num_events x num_cpus x num_cgroups = 10 x 112 x 200 =
224,000 descriptors

A usual Linux distribution allows around 1024. Although if you are
root, you could increase the limit, this has some other impact to the
system: the memory footprint in kernel memory to back these file
descriptors and struct perf_event is large (see our presentation at
LPC2019).

2/ Context switch overhead

Each time you have a cgroup switch, i.e., a context switch where you
switch cgroups, then you incur a PMU event reschedule. A cgroup sched
in
is more expensive than a per-process sched in because you have to find
the events which are relevant to the next cgroup, i.e., you may have
to
walk more entries in the RB-tree. If the events are identical across
cgroups, you may end up paying that cost to reinstall the same events
(ignoring
multiplexing).
Furthermore, event scheduling is an expensive operation because of
memory access and PMU register accesses. It is always best, if it can
be avoided.
From our experience, that has caused significant overhead in our
systems to the point where we have to reduce the interval at which we
collect the data
and the number of cgroups we can monitor at once.

3/ Namhyung's solution

I personally like Namhyung's solution to the problem because it fits
within the interface, does not break existing per-cgroup mode. The
implementation is fairly
simple and non-invasive. It provides a very significant reduction of
overhead on BOTH the file descriptor pressure and context switch
overheads. It matches perfectly
with the common usage model of monitoring the same events across
multiple cgroups simultaneously. The patch does not disrupt existing
perf_event_open() or
read()/close() syscalls. Everything is handled via a pair of new ioctl().

It eliminates the file descriptor overhead as follows using the same
example as before:

Before:
num_fds = num_events x num_cpus x num_cgroups = 10 x 112 x 200 =
224,000 descriptors
After:
num_fds = num_events x num_cpus = 10 x 112 = 1120 descriptors
(200x reduction in fds and the memory savings that go with that in the
kernel)

In other words, it reduces the file descriptor consumption to what is
necessary for plain system-wide monitoring.

On context switch, the kernel computes the event delta and stores into
a hash table, i.e., a single PMU register access instead of the full
PMU rescheduling.
The delta is propagated to the proper cgroup hierarchy if needed.

The change is generic and benefits ALL processor architectures in the
same manner.

We have tested the patch on our servers with large configurations and
it has delivered significant savings and enabled monitoring of more
cgroups simultaneously
instead of monitoring in batches which never yielded a consistent view
of the system.

Furthermore, the patches could be extended to add, as Song Lu
suggested, the possibility of deleting cgroups attached to an event to
allow continuous monitoring
without having to restart the monitoring tool. I believe the extension
can be further improved by allowing a vector read of the counts as
well. That would eliminate a
significant number of ioctl(READ) syscalls.

Overall, I think this patch series delivers significant value-add to
the perf_events interface and should be committed ASAP.

Thanks.

On Tue, Mar 30, 2021 at 8:11 AM Namhyung Kim <[email protected]> wrote:
>
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu <[email protected]> wrote:
> > > On Mar 29, 2021, at 4:33 AM, Namhyung Kim <[email protected]> wrote:
> > >
> > > On Mon, Mar 29, 2021 at 2:17 AM Song Liu <[email protected]> wrote:
> > >>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
> > >>>
> > >>> As we can run many jobs (in container) on a big machine, we want to
> > >>> measure each job's performance during the run. To do that, the
> > >>> perf_event can be associated to a cgroup to measure it only.
> > >>>
> >
> > [...]
> >
> > >>> + return 0;
> > >>> +}
> > >>
> > >> Could you please explain why we need this logic in can_attach?
> > >
> > > IIUC the ss->attach() is called after a task's cgroup membership
> > > is changed. But we want to collect the performance numbers for
> > > the old cgroup just before the change. As the logic merely checks
> > > the current task's cgroup, it should be done in the can_attach()
> > > which is called before the cgroup change.
> >
> > Thanks for the explanations.
> >
> > Overall, I really like the core idea, especially that the overhead on
> > context switch is bounded (by the depth of cgroup tree).
>
> Thanks!
>
> >
> > Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
> > Specifically, if we can have
> >
> > PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
> > PERF_EVENT_IOC_EL_CGROUP delete a cgroup from the list
> >
> > we can probably share these events among multiple processes, and
> > these processes don't need to know others' cgroup list. I think
> > this will be useful for users to build customized monitoring in
> > its own container.
> >
> > Does this make sense?
>
> Maybe we can add ADD/DEL interface for more flexible monitoring
> but I'm not sure which use cases it'll be used actually.
>
> For your multi-process sharing case, the original events' file
> descriptors should be shared first. Also adding and deleting
> (or just reading) arbitrary cgroups from a container can be a
> security concern IMHO.
>
> So I just focused on the single-process multi-cgroup case which is
> already used (perf stat --for-each-cgroup) and very important in my
> company's setup. In this case we have a list of interested cgroups
> from the beginning so it's more efficient to create a properly sized
> hash table and all the nodes at once.
>
> Thanks,
> Namhyung

2021-04-01 06:22:17

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

> On Mar 30, 2021, at 8:11 AM, Namhyung Kim <[email protected]> wrote:
>
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu <[email protected]> wrote:
>>> On Mar 29, 2021, at 4:33 AM, Namhyung Kim <[email protected]> wrote:
>>>
>>> On Mon, Mar 29, 2021 at 2:17 AM Song Liu <[email protected]> wrote:
>>>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim <[email protected]> wrote:
>>>>>
>>>>> As we can run many jobs (in container) on a big machine, we want to
>>>>> measure each job's performance during the run. To do that, the
>>>>> perf_event can be associated to a cgroup to measure it only.
>>>>>
>>
>> [...]
>>
>>>>> + return 0;
>>>>> +}
>>>>
>>>> Could you please explain why we need this logic in can_attach?
>>>
>>> IIUC the ss->attach() is called after a task's cgroup membership
>>> is changed. But we want to collect the performance numbers for
>>> the old cgroup just before the change. As the logic merely checks
>>> the current task's cgroup, it should be done in the can_attach()
>>> which is called before the cgroup change.
>>
>> Thanks for the explanations.
>>
>> Overall, I really like the core idea, especially that the overhead on
>> context switch is bounded (by the depth of cgroup tree).
>
> Thanks!
>
>>
>> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
>> Specifically, if we can have
>>
>> PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
>> PERF_EVENT_IOC_EL_CGROUP delete a cgroup from the list
>>
>> we can probably share these events among multiple processes, and
>> these processes don't need to know others' cgroup list. I think
>> this will be useful for users to build customized monitoring in
>> its own container.
>>
>> Does this make sense?
>
> Maybe we can add ADD/DEL interface for more flexible monitoring
> but I'm not sure which use cases it'll be used actually.
>
> For your multi-process sharing case, the original events' file
> descriptors should be shared first.

Yes, we will need some other work to make the ADD/DEL interface
work properly. Let's worry about that later.

For both patches:

Acked-by: Song Liu <[email protected]>

Thanks,
Song

[...]