From: Kan Liang <[email protected]>
For perf record frequency mode, the initial sample_period is 1. That's
because perf doesn't know what period should be set. It uses the minimum
period 1 as the first period. It will trigger an interrupt soon. Then
there will be enough data to calculate the period for the given
frequency. But too many very short period like 1 may cause various
problems and increase the overhead. It's better to limit the 1 period to
just the first several period setting.
However, for some workload, 1 period is frequently set. For example,
perf record a busy loop for 10 seconds.
perf record ./finity_busy_loop.sh 10
while [ "A" != "B" ]
do
date > /dev/null
done
Period was changed 150503 times in 10 seconds. 22.5% (33861 times) of
the period is set to 1. That's because, in the inherit_event, the period
for child event is inherit from parent's parent's event, which is
usually the default sample_period 1. Each child event has to recaculate
the period from 1 everytime. That brings high overhead.
This patch keeps the sample period average in original parent event.
Each new child event can use it as its initial sample period.
Adding a ori_parent in struct perf_event to help child event access the
original parent. For each new child event, the parent event refcount++.
Parent will not go away until all children go away. So the stored
pointer is safe to be accessed.
After applying this patch, the 1 period rate reduces to 0.1%.
Signed-off-by: Kan Liang <[email protected]>
---
include/linux/perf_event.h | 4 ++++
kernel/events/core.c | 22 ++++++++++++++++++++--
2 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 486e84c..b328617 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -403,6 +403,10 @@ struct perf_event {
struct list_head child_list;
struct perf_event *parent;
+ /* Average Sample period in the original parent event */
+ struct perf_event *ori_parent;
+ local64_t avg_sample_period;
+
int oncpu;
int cpu;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index af0a5ba..a8be6d3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2795,7 +2795,8 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
{
struct hw_perf_event *hwc = &event->hw;
s64 period, sample_period;
- s64 delta;
+ s64 delta, avg_period;
+ struct perf_event *head_event = event->ori_parent;
period = perf_calculate_period(event, nsec, count);
@@ -2809,6 +2810,9 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
hwc->sample_period = sample_period;
+ avg_period = (local64_read(&head_event->avg_sample_period) + sample_period) / 2;
+ local64_set(&head_event->avg_sample_period, avg_period);
+
if (local64_read(&hwc->period_left) > 8*sample_period) {
if (disable)
event->pmu->stop(event, PERF_EF_UPDATE);
@@ -6996,6 +7000,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
event->oncpu = -1;
event->parent = parent_event;
+ if (parent_event)
+ event->ori_parent = parent_event->ori_parent;
+ else
+ event->ori_parent = event;
event->ns = get_pid_ns(task_active_pid_ns(current));
event->id = atomic64_inc_return(&perf_event_id);
@@ -7030,8 +7038,16 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
hwc = &event->hw;
hwc->sample_period = attr->sample_period;
- if (attr->freq && attr->sample_freq)
+ if (attr->freq && attr->sample_freq) {
hwc->sample_period = 1;
+ if (parent_event) {
+ struct perf_event *head_event = event->ori_parent;
+
+ hwc->sample_period = local64_read(&head_event->avg_sample_period);
+ } else {
+ local64_set(&event->avg_sample_period, hwc->sample_period);
+ }
+ }
hwc->last_period = hwc->sample_period;
local64_set(&hwc->period_left, hwc->sample_period);
@@ -7904,7 +7920,9 @@ inherit_event(struct perf_event *parent_event,
if (parent_event->attr.freq) {
u64 sample_period = parent_event->hw.sample_period;
struct hw_perf_event *hwc = &child_event->hw;
+ struct perf_event *head_event = child_event->ori_parent;
+ sample_period = local64_read(&head_event->avg_sample_period);
hwc->sample_period = sample_period;
hwc->last_period = sample_period;
--
1.8.3.2
On Fri, Dec 12, 2014 at 10:10:35AM -0500, [email protected] wrote:
> That's because, in the inherit_event, the period
> for child event is inherit from parent's parent's event, which is
> usually the default sample_period 1. Each child event has to recaculate
> the period from 1 everytime. That brings high overhead.
>
> This patch keeps the sample period average in original parent event.
That's going to make an even worse nightmare of the refcounting :/
> Each new child event can use it as its initial sample period.
> Adding a ori_parent in struct perf_event to help child event access the
> original parent. For each new child event, the parent event refcount++.
> Parent will not go away until all children go away. So the stored
> pointer is safe to be accessed.
So going by the above you only use this once, during fork, which already
guarantees the existence of the parent.
Now looking at the code you actually use it again in
perf_adjust_period() and push period adjustments into the parent.
This doesn't seem to make any kind of sense, and its weirdly
implemented.
So why would you push anything to the original parent? Your description
states that the parent event usually has 1, and then you argue about
fixing that by using the orig parent, but then you need to update the
orig parent. Did you go in circles and confuse yourself? Why not push
things into the regular parent event if you're going to push things up.
Also, since you can have multiple child events, on many CPUs local64_t
is the wrong data type, furthermore its going to be a scalability issue
on big hardware.
So please, try again. And try and explain what you're actually doing.
> On Fri, Dec 12, 2014 at 10:10:35AM -0500, [email protected] wrote:
>
> > That's because, in the inherit_event, the period for child event is
> > inherit from parent's parent's event, which is usually the default
> > sample_period 1. Each child event has to recaculate the period from 1
> > everytime. That brings high overhead.
> >
> > This patch keeps the sample period average in original parent event.
>
> That's going to make an even worse nightmare of the refcounting :/
It will not change the refcount. Current mechanism has already
guarantee the original parent event doesn't go away until all children
events go away, doesn't it?
>
> > Each new child event can use it as its initial sample period.
> > Adding a ori_parent in struct perf_event to help child event access
> > the original parent. For each new child event, the parent event
> refcount++.
> > Parent will not go away until all children go away. So the stored
> > pointer is safe to be accessed.
>
> So going by the above you only use this once, during fork, which already
> guarantees the existence of the parent.
>
> Now looking at the code you actually use it again in
> perf_adjust_period() and push period adjustments into the parent.
The original parent is the last event to be released. It is safe to keep
children's average period in original parent struct.
Yes, there are two places to access the average sample period.
In inherit_event, the stored average sample period in ori parent will pass
to child event. In perf_adjust_period, the average sample period will be
recalculated and stored in ori parent event.
I can add the description about updating period avg in perf_adjust_period
>
> This doesn't seem to make any kind of sense, and its weirdly implemented.
>
> So why would you push anything to the original parent? Your description
> states that the parent event usually has 1, and then you argue about fixing
> that by using the orig parent, but then you need to update the orig parent.
> Did you go in circles and confuse yourself? Why not push things into the
> regular parent event if you're going to push things up.
My thought is that the original parent is the root of the tree. If there is an
average sample period for nodes, it should be kept in the root node, since
it's the only node everyone knows.
Yes, it's OK to put the average sample period in the regular parent event.
Because what we need is a reference for initial sample period of new child.
No matter it's from all event or just sibling/direct parent event average, the
average sample period is more larger than 1.
So I think either is fine. I can push things to the regular parent event.
>
> Also, since you can have multiple child events, on many CPUs local64_t is
> the wrong data type, furthermore its going to be a scalability issue on big
> hardware.
I'd like to have avg_sample_period for each CPU. The similar usage is
period_left in hw_perf_event.
We don't need to share the avg_sample_period between CPUs, after all
it's only a reference.
Thanks,
Kan
>
> So please, try again. And try and explain what you're actually doing.
On Mon, Dec 15, 2014 at 09:17:33PM +0000, Liang, Kan wrote:
> > This doesn't seem to make any kind of sense, and its weirdly implemented.
> >
> > So why would you push anything to the original parent? Your description
> > states that the parent event usually has 1, and then you argue about fixing
> > that by using the orig parent, but then you need to update the orig parent.
> > Did you go in circles and confuse yourself? Why not push things into the
> > regular parent event if you're going to push things up.
>
> My thought is that the original parent is the root of the tree.
No parent is the root; I thought your orig parent thing was the event
you forked from, but now I see its not.
See inherit_event(), event->parent is the root event.
> If there is an
> average sample period for nodes, it should be kept in the root node, since
> it's the only node everyone knows.
Right, but, that's also contention central..
> > Also, since you can have multiple child events, on many CPUs local64_t is
> > the wrong data type, furthermore its going to be a scalability issue on big
> > hardware.
>
> I'd like to have avg_sample_period for each CPU. The similar usage is
> period_left in hw_perf_event.
Well, some events are per cpu, some are per task. The per task events do
not have per-cpu storage and their parent can be on whatever cpu.
> We don't need to share the avg_sample_period between CPUs, after all
> it's only a reference.
Right, some smarts are needed to avoid the worst contention there. Maybe
a jiffy timestamp and don't update more than once every HZ jiffies or
so.
>
> On Mon, Dec 15, 2014 at 09:17:33PM +0000, Liang, Kan wrote:
> > > This doesn't seem to make any kind of sense, and its weirdly
> implemented.
> > >
> > > So why would you push anything to the original parent? Your
> > > description states that the parent event usually has 1, and then you
> > > argue about fixing that by using the orig parent, but then you need to
> update the orig parent.
> > > Did you go in circles and confuse yourself? Why not push things into
> > > the regular parent event if you're going to push things up.
> >
> > My thought is that the original parent is the root of the tree.
>
> No parent is the root; I thought your orig parent thing was the event you
> forked from, but now I see its not.
>
> See inherit_event(), event->parent is the root event.
>
I c. I will keep the information in regular parent event.
> > If there is an
> > average sample period for nodes, it should be kept in the root node,
> > since it's the only node everyone knows.
>
> Right, but, that's also contention central..
>
> > > Also, since you can have multiple child events, on many CPUs
> > > local64_t is the wrong data type, furthermore its going to be a
> > > scalability issue on big hardware.
> >
> > I'd like to have avg_sample_period for each CPU. The similar usage is
> > period_left in hw_perf_event.
>
> Well, some events are per cpu, some are per task. The per task events do
> not have per-cpu storage and their parent can be on whatever cpu.
>
I will use atomic64_t to replace local64_t.
> > We don't need to share the avg_sample_period between CPUs, after all
> > it's only a reference.
>
> Right, some smarts are needed to avoid the worst contention there.
> Maybe a jiffy timestamp and don't update more than once every HZ jiffies
> or so.
perf_adjust_freq_unthr_context will be called each tick. I think we may
only update the avg_sample_period in this function.
Also, I think we'd better to update the avg_sample_period in first period
adjust. We need to get rid of 1 period as earlier as possible.
I will change the update part as below.
@@ -2810,8 +2833,10 @@ static void perf_adjust_period(struct perf_event
*event, u64 nsec, u64 count, bo
hwc->sample_period = sample_period;
+ if (!disable || (atomic64_read(&head_event->avg_sample_period)
== 1)) {
+ avg_period = (atomic64_read(&head_event->avg_sample_period
) + sample_period) / 2;
+ atomic64_set(&head_event->avg_sample_period, avg_period);
+ }
if (local64_read(&hwc->period_left) > 8*sample_period) {
if (disable)
Thanks,
Kan