LinuxLists.cc - [RFC][PATCH] PERF_COUNT_SW_RUNNABLE_TASKS: measure and act upon parallellism

2010-02-07 11:31:17

Subject: [RFC][PATCH] PERF_COUNT_SW_RUNNABLE_TASKS: measure and act upon parallellism

Hi all,

Here's an initial RFC patch for the parallallism
events for perf_events.

It works with a low number of tasks, but runs into a
couple of issues:
- something goes wrong when a thread exits. Apparently
the count is decremented twice. I'm still figuring this
one out, perhaps sync_stat is to blame.
This causes things to break with more than 3 threads if
I recall correctly.
- the count is stored with the parent event only. This means
the child list lock is not taken too often.
- adding threshold support to poll() breaks use with mmap.
poll() returns when count < threshold although no new
entries might be added to the memory map.
- min threshold is actually useless so it should be dropped.

There's definately more wrong with this early patchset
and I'm doubting whether perf_events is actually fit for
this purpose.
Currently, by design, it's not meant to support counters
that may also decrement. Also, the foreseen use of poll
to block execution of the thread doesn't seem to work
together well with the mmap()-use.

Some hints, pointers and remarks are definately welcome
at this stage.

Regards,
Stijn

2010-02-07 11:31:21

by Stijn Devriendt

[permalink] [raw]

Subject: [PATCH 1/6] Move poll into perf_event and allow for wakeups when fd has not been mmap'd. This is useful when using read() to read out the current counter value.

From: Stijn Devriendt <[email protected]>

---
include/linux/perf_event.h | 2 +-
kernel/perf_event.c | 30 +++++++++++++++---------------
2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c66b34f..827a221 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -538,7 +538,6 @@ struct perf_mmap_data {
int writable; /* are we writable */
int nr_locked; /* nr pages mlocked */

- atomic_t poll; /* POLL_ for wakeups */
atomic_t events; /* event_id limit */

atomic_long_t head; /* write position */
@@ -639,6 +638,7 @@ struct perf_event {
struct perf_mmap_data *data;

/* poll related */
+ atomic_t poll; /* POLL_ for wakeups */
wait_queue_head_t waitq;
struct fasync_struct *fasync;

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index e0eb4a2..42dc18d 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1919,14 +1919,7 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
static unsigned int perf_poll(struct file *file, poll_table *wait)
{
struct perf_event *event = file->private_data;
- struct perf_mmap_data *data;
- unsigned int events = POLL_HUP;
-
- rcu_read_lock();
- data = rcu_dereference(event->data);
- if (data)
- events = atomic_xchg(&data->poll, 0);
- rcu_read_unlock();
+ unsigned int events = atomic_xchg(&event->poll, 0);

poll_wait(file, &event->waitq, wait);

@@ -2680,16 +2673,20 @@ static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
return true;
}

-static void perf_output_wakeup(struct perf_output_handle *handle)
+static void __perf_output_wakeup(struct perf_event* event, int nmi)
{
- atomic_set(&handle->data->poll, POLL_IN);
+ atomic_set(&event->poll, POLLIN);

- if (handle->nmi) {
- handle->event->pending_wakeup = 1;
- perf_pending_queue(&handle->event->pending,
- perf_pending_event);
+ if (nmi) {
+ event->pending_wakeup = 1;
+ perf_pending_queue(&event->pending, perf_pending_event);
} else
- perf_event_wakeup(handle->event);
+ perf_event_wakeup(event);
+}
+
+static void perf_output_wakeup(struct perf_output_handle *handle)
+{
+ __perf_output_wakeup(handle->event, handle->nmi);
}

/*
@@ -3171,6 +3168,9 @@ static void perf_event_output(struct perf_event *event, int nmi,
struct perf_output_handle handle;
struct perf_event_header header;

+ if (!event->data)
+ return __perf_output_wakeup(event, nmi);
+
perf_prepare_sample(&header, data, event, regs);

if (perf_output_begin(&handle, event, header.size, nmi, 1))
--
1.6.6

2010-02-07 11:31:35

by Stijn Devriendt

[permalink] [raw]

Subject: [PATCH 5/6] Allow decrementing counters

From: Stijn Devriendt <[email protected]>

---
include/linux/perf_event.h | 6 +++---
kernel/perf_event.c | 16 ++++++++--------
2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4f7d318..084f322 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -515,7 +515,7 @@ struct pmu {
void (*disable) (struct perf_event *event);
void (*update) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
- u64 (*add) (struct perf_event *event, u64 count);
+ u64 (*add) (struct perf_event *event, s64 count);
int (*reset) (struct perf_event *event);
void (*wakeup) (struct perf_event *event);
u64 (*read) (struct perf_event *event);
@@ -826,10 +826,10 @@ static inline int is_software_event(struct perf_event *event)

extern atomic_t perf_swevent_enabled[PERF_COUNT_SW_MAX];

-extern void __perf_sw_event(u32, u64, int, struct pt_regs *, u64);
+extern void __perf_sw_event(u32, s64, int, struct pt_regs *, u64);

static inline void
-perf_sw_event(u32 event_id, u64 nr, int nmi, struct pt_regs *regs, u64 addr)
+perf_sw_event(u32 event_id, s64 nr, int nmi, struct pt_regs *regs, u64 addr)
{
if (atomic_read(&perf_swevent_enabled[event_id]))
__perf_sw_event(event_id, nr, nmi, regs, addr);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 724aafd..08885d0 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3770,12 +3770,12 @@ static void perf_event_wakeup_one(struct perf_event *event)
wake_up(&event->waitq);
}

-static u64 __perf_event_add(struct perf_event *event, u64 count)
+static u64 __perf_event_add(struct perf_event *event, s64 count)
{
return atomic64_add_return(count, &event->count);
}

-static u64 perf_event_add(struct perf_event *event, u64 count)
+static u64 perf_event_add(struct perf_event *event, s64 count)
{
if (event->pmu->add)
return event->pmu->add(event, count);
@@ -3856,7 +3856,7 @@ static void perf_swevent_unthrottle(struct perf_event *event)
*/
}

-static void perf_swevent_add(struct perf_event *event, u64 nr,
+static void perf_swevent_add(struct perf_event *event, s64 nr,
int nmi, struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -3870,10 +3870,10 @@ static void perf_swevent_add(struct perf_event *event, u64 nr,
if (!hwc->sample_period)
return;

- if (nr == 1 && hwc->sample_period == 1 && !event->attr.freq)
+ if (abs(nr) == 1 && hwc->sample_period == 1 && !event->attr.freq)
return perf_swevent_overflow(event, 1, nmi, data, regs);

- if (atomic64_add_negative(nr, &hwc->period_left))
+ if (atomic64_add_negative(abs(nr), &hwc->period_left))
return;

perf_swevent_overflow(event, 0, nmi, data, regs);
@@ -3956,7 +3956,7 @@ static int perf_swevent_match(struct perf_event *event,

static void perf_swevent_ctx_event(struct perf_event_context *ctx,
enum perf_type_id type,
- u32 event_id, u64 nr, int nmi,
+ u32 event_id, s64 nr, int nmi,
struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -4004,7 +4004,7 @@ void perf_swevent_put_recursion_context(int rctx)
EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);

static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
- u64 nr, int nmi,
+ s64 nr, int nmi,
struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -4025,7 +4025,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
rcu_read_unlock();
}

-void __perf_sw_event(u32 event_id, u64 nr, int nmi,
+void __perf_sw_event(u32 event_id, s64 nr, int nmi,
struct pt_regs *regs, u64 addr)
{
struct perf_sample_data data;
--
1.6.6

2010-02-07 11:31:27

by Stijn Devriendt

[permalink] [raw]

Subject: [PATCH 4/6] Fix HW breakpoint handlers

From: Stijn Devriendt <[email protected]>

---
arch/x86/include/asm/hw_breakpoint.h | 2 +-
arch/x86/kernel/hw_breakpoint.c | 2 +-
kernel/hw_breakpoint.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index 0675a7c..d060a5c 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -54,7 +54,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,

int arch_install_hw_breakpoint(struct perf_event *bp);
void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+void hw_breakpoint_pmu_update(struct perf_event *bp);
void hw_breakpoint_pmu_unthrottle(struct perf_event *bp);

extern void
diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 05d5fec..55469e0 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -543,7 +543,7 @@ int __kprobes hw_breakpoint_exceptions_notify(
return hw_breakpoint_handler(data);
}

-void hw_breakpoint_pmu_read(struct perf_event *bp)
+void hw_breakpoint_pmu_update(struct perf_event *bp)
{
/* TODO */
}
diff --git a/kernel/hw_breakpoint.c b/kernel/hw_breakpoint.c
index dbcbf6a..16b3a9b 100644
--- a/kernel/hw_breakpoint.c
+++ b/kernel/hw_breakpoint.c
@@ -448,6 +448,6 @@ core_initcall(init_hw_breakpoint);
struct pmu perf_ops_bp = {
.enable = arch_install_hw_breakpoint,
.disable = arch_uninstall_hw_breakpoint,
- .read = hw_breakpoint_pmu_read,
+ .update = hw_breakpoint_pmu_update,
.unthrottle = hw_breakpoint_pmu_unthrottle
};
--
1.6.6

2010-02-07 11:31:32

by Stijn Devriendt

[permalink] [raw]

Subject: [PATCH 3/6] Add handlers for adding counter events, reading counter events, counter resets, wakeups. Renamed old read handler to update.

From: Stijn Devriendt <[email protected]>

---
arch/x86/kernel/cpu/perf_event.c | 4 +-
include/linux/perf_event.h | 8 ++-
kernel/perf_event.c | 120 +++++++++++++++++++++++++++-----------
3 files changed, 94 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index c223b7e..9738f22 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2222,7 +2222,7 @@ void __init init_hw_perf_events(void)
pr_info("... event mask: %016Lx\n", perf_event_mask);
}

-static inline void x86_pmu_read(struct perf_event *event)
+static inline void x86_pmu_update(struct perf_event *event)
{
x86_perf_event_update(event, &event->hw, event->hw.idx);
}
@@ -2230,7 +2230,7 @@ static inline void x86_pmu_read(struct perf_event *event)
static const struct pmu pmu = {
.enable = x86_pmu_enable,
.disable = x86_pmu_disable,
- .read = x86_pmu_read,
+ .update = x86_pmu_update,
.unthrottle = x86_pmu_unthrottle,
};

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0fa235e..4f7d318 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -513,8 +513,12 @@ struct perf_event;
struct pmu {
int (*enable) (struct perf_event *event);
void (*disable) (struct perf_event *event);
- void (*read) (struct perf_event *event);
+ void (*update) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
+ u64 (*add) (struct perf_event *event, u64 count);
+ int (*reset) (struct perf_event *event);
+ void (*wakeup) (struct perf_event *event);
+ u64 (*read) (struct perf_event *event);
};

/**
@@ -639,7 +643,7 @@ struct perf_event {
struct perf_mmap_data *data;

/* poll related */
- atomic_t poll; /* POLL_ for wakeups */
+ atomic_t poll; /* POLLX for wakeups */
wait_queue_head_t waitq;
struct fasync_struct *fasync;

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 70ca6e1..724aafd 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1092,7 +1092,7 @@ static void __perf_event_sync_stat(struct perf_event *event,
return;

/*
- * Update the event value, we cannot use perf_event_read()
+ * Update the event value, we cannot use perf_event_update()
* because we're in the middle of a context switch and have IRQs
* disabled, which upsets smp_call_function_single(), however
* we know the event must be on the current CPU, therefore we
@@ -1100,7 +1100,7 @@ static void __perf_event_sync_stat(struct perf_event *event,
*/
switch (event->state) {
case PERF_EVENT_STATE_ACTIVE:
- event->pmu->read(event);
+ event->pmu->update(event);
/* fall-through */

case PERF_EVENT_STATE_INACTIVE:
@@ -1534,10 +1534,23 @@ static void perf_event_enable_on_exec(struct task_struct *task)
local_irq_restore(flags);
}

+static u64 __perf_event_read(struct perf_event *event)
+{
+ return atomic64_read(&event->count);
+}
+
+static u64 perf_event_read(struct perf_event *event)
+{
+ if (event->pmu->read)
+ return event->pmu->read(event);
+ else
+ return __perf_event_read(event);
+}
+
/*
* Cross CPU call to read the hardware event
*/
-static void __perf_event_read(void *info)
+static void __perf_event_update(void *info)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
struct perf_event *event = info;
@@ -1558,10 +1571,10 @@ static void __perf_event_read(void *info)
update_event_times(event);
raw_spin_unlock(&ctx->lock);

- event->pmu->read(event);
+ event->pmu->update(event);
}

-static u64 perf_event_read(struct perf_event *event)
+static u64 perf_event_update(struct perf_event *event)
{
/*
* If event is enabled and currently active on a CPU, update the
@@ -1569,7 +1582,7 @@ static u64 perf_event_read(struct perf_event *event)
*/
if (event->state == PERF_EVENT_STATE_ACTIVE) {
smp_call_function_single(event->oncpu,
- __perf_event_read, event, 1);
+ __perf_event_update, event, 1);
} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
struct perf_event_context *ctx = event->ctx;
unsigned long flags;
@@ -1580,7 +1593,7 @@ static u64 perf_event_read(struct perf_event *event)
raw_spin_unlock_irqrestore(&ctx->lock, flags);
}

- return atomic64_read(&event->count);
+ return perf_event_read(event);
}

/*
@@ -1793,14 +1806,14 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
*running = 0;

mutex_lock(&event->child_mutex);
- total += perf_event_read(event);
+ total += perf_event_update(event);
*enabled += event->total_time_enabled +
atomic64_read(&event->child_total_time_enabled);
*running += event->total_time_running +
atomic64_read(&event->child_total_time_running);

list_for_each_entry(child, &event->child_list, child_list) {
- total += perf_event_read(child);
+ total += perf_event_update(child);
*enabled += child->total_time_enabled;
*running += child->total_time_running;
}
@@ -1925,7 +1938,7 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)

if (event->attr.threshold)
{
- u64 count = atomic64_read(&event->count);
+ u64 count = perf_event_read(event);
if (count < event->attr.min_threshold)
events |= POLLIN;
else if (count > event->attr.max_threshold)
@@ -1937,13 +1950,25 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
return events;
}

-static void perf_event_reset(struct perf_event *event)
+static void perf_event_reset_noop(struct perf_event *event)
+{
+}
+
+static void __perf_event_reset(struct perf_event *event)
{
- (void)perf_event_read(event);
+ (void)perf_event_update(event);
atomic64_set(&event->count, 0);
perf_event_update_userpage(event);
}

+static void perf_event_reset(struct perf_event *event)
+{
+ if (event->pmu->reset)
+ event->pmu->reset(event);
+ else
+ __perf_event_reset(event);
+}
+
/*
* Holding the top-level event's child_mutex means that any
* descendant process that has inherited this event will block
@@ -2123,7 +2148,7 @@ void perf_event_update_userpage(struct perf_event *event)
++userpg->lock;
barrier();
userpg->index = perf_event_index(event);
- userpg->offset = atomic64_read(&event->count);
+ userpg->offset = perf_event_read(event);
if (event->state == PERF_EVENT_STATE_ACTIVE)
userpg->offset -= atomic64_read(&event->hw.prev_count);

@@ -2540,7 +2565,10 @@ static const struct file_operations perf_fops = {

void perf_event_wakeup(struct perf_event *event)
{
- wake_up_all(&event->waitq);
+ if (event->pmu->wakeup)
+ event->pmu->wakeup(event);
+ else
+ wake_up_all(&event->waitq);

if (event->pending_kill) {
kill_fasync(&event->fasync, SIGIO, event->pending_kill);
@@ -2689,7 +2717,7 @@ static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,

static void __perf_output_wakeup(struct perf_event* event, int nmi)
{
- if (event->attr.threshold && atomic64_read(&event->count) > event->attr.max_threshold)
+ if (event->attr.threshold && perf_event_read(event) > event->attr.max_threshold)
return;

atomic_set(&event->poll, POLLIN);
@@ -2912,7 +2940,7 @@ void perf_output_end(struct perf_output_handle *handle)
struct perf_event *event = handle->event;
struct perf_mmap_data *data = handle->data;

- int wakeup_events = event->attr.thresold ? 1 : event->attr.wakeup_events;
+ int wakeup_events = event->attr.threshold ? 1 : event->attr.wakeup_events;

if (handle->sample && wakeup_events) {
int events = atomic_inc_return(&data->events);
@@ -2955,7 +2983,7 @@ static void perf_output_read_one(struct perf_output_handle *handle,
u64 values[4];
int n = 0;

- values[n++] = atomic64_read(&event->count);
+ values[n++] = perf_event_read(event);
if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
values[n++] = event->total_time_enabled +
atomic64_read(&event->child_total_time_enabled);
@@ -2990,7 +3018,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
values[n++] = leader->total_time_running;

if (leader != event)
- leader->pmu->read(leader);
+ leader->pmu->update(leader);

values[n++] = atomic64_read(&leader->count);
if (read_format & PERF_FORMAT_ID)
@@ -3002,7 +3030,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
n = 0;

if (sub != event)
- sub->pmu->read(sub);
+ sub->pmu->update(sub);

values[n++] = atomic64_read(&sub->count);
if (read_format & PERF_FORMAT_ID)
@@ -3737,6 +3765,29 @@ int perf_event_overflow(struct perf_event *event, int nmi,
return __perf_event_overflow(event, nmi, 1, data, regs);
}

+static void perf_event_wakeup_one(struct perf_event *event)
+{
+ wake_up(&event->waitq);
+}
+
+static u64 __perf_event_add(struct perf_event *event, u64 count)
+{
+ return atomic64_add_return(count, &event->count);
+}
+
+static u64 perf_event_add(struct perf_event *event, u64 count)
+{
+ if (event->pmu->add)
+ return event->pmu->add(event, count);
+ else
+ return __perf_event_add(event, count);
+}
+
+static u64 perf_event_add_parent(struct perf_event *event, u64 count)
+{
+ return event->parent ? __perf_event_add(event->parent, count) : __perf_event_add(event, count);
+}
+
/*
* Generic software event infrastructure
*/
@@ -3811,7 +3862,7 @@ static void perf_swevent_add(struct perf_event *event, u64 nr,
{
struct hw_perf_event *hwc = &event->hw;

- atomic64_add(nr, &event->count);
+ perf_event_add(event, nr);

if (!regs)
return;
@@ -4011,10 +4062,11 @@ static void perf_swevent_disable(struct perf_event *event)
{
}

+
static const struct pmu perf_ops_generic = {
.enable = perf_swevent_enable,
.disable = perf_swevent_disable,
- .read = perf_swevent_read,
+ .update = perf_swevent_read,
.unthrottle = perf_swevent_unthrottle,
};

@@ -4031,7 +4083,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
u64 period;

event = container_of(hrtimer, struct perf_event, hw.hrtimer);
- event->pmu->read(event);
+ event->pmu->update(event);

data.addr = 0;
data.raw = NULL;
@@ -4097,7 +4149,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
* Software event: cpu wall time clock
*/

-static void cpu_clock_perf_event_update(struct perf_event *event)
+static void __cpu_clock_perf_event_update(struct perf_event *event)
{
int cpu = raw_smp_processor_id();
s64 prev;
@@ -4105,7 +4157,7 @@ static void cpu_clock_perf_event_update(struct perf_event *event)

now = cpu_clock(cpu);
prev = atomic64_xchg(&event->hw.prev_count, now);
- atomic64_add(now - prev, &event->count);
+ perf_event_add(event, now - prev);
}

static int cpu_clock_perf_event_enable(struct perf_event *event)
@@ -4122,32 +4174,32 @@ static int cpu_clock_perf_event_enable(struct perf_event *event)
static void cpu_clock_perf_event_disable(struct perf_event *event)
{
perf_swevent_cancel_hrtimer(event);
- cpu_clock_perf_event_update(event);
+ __cpu_clock_perf_event_update(event);
}

-static void cpu_clock_perf_event_read(struct perf_event *event)
+static void cpu_clock_perf_event_update(struct perf_event *event)
{
- cpu_clock_perf_event_update(event);
+ __cpu_clock_perf_event_update(event);
}

static const struct pmu perf_ops_cpu_clock = {
.enable = cpu_clock_perf_event_enable,
.disable = cpu_clock_perf_event_disable,
- .read = cpu_clock_perf_event_read,
+ .update = cpu_clock_perf_event_update,
};

/*
* Software event: task time clock
*/

-static void task_clock_perf_event_update(struct perf_event *event, u64 now)
+static void __task_clock_perf_event_update(struct perf_event *event, u64 now)
{
u64 prev;
s64 delta;

prev = atomic64_xchg(&event->hw.prev_count, now);
delta = now - prev;
- atomic64_add(delta, &event->count);
+ perf_event_add(event, delta);
}

static int task_clock_perf_event_enable(struct perf_event *event)
@@ -4167,11 +4219,11 @@ static int task_clock_perf_event_enable(struct perf_event *event)
static void task_clock_perf_event_disable(struct perf_event *event)
{
perf_swevent_cancel_hrtimer(event);
- task_clock_perf_event_update(event, event->ctx->time);
+ __task_clock_perf_event_update(event, event->ctx->time);

}

-static void task_clock_perf_event_read(struct perf_event *event)
+static void task_clock_perf_event_update(struct perf_event *event)
{
u64 time;

@@ -4184,13 +4236,13 @@ static void task_clock_perf_event_read(struct perf_event *event)
time = event->ctx->time + delta;
}

- task_clock_perf_event_update(event, time);
+ __task_clock_perf_event_update(event, time);
}

static const struct pmu perf_ops_task_clock = {
.enable = task_clock_perf_event_enable,
.disable = task_clock_perf_event_disable,
- .read = task_clock_perf_event_read,
+ .update = task_clock_perf_event_update,
};

#ifdef CONFIG_EVENT_PROFILE
--
1.6.6

2010-02-07 11:31:56

by Stijn Devriendt

[permalink] [raw]

Subject: [PATCH 6/6] Add PERF_COUNT_SW_RUNNABLE_TASKS

From: Stijn Devriendt <[email protected]>

---
include/linux/perf_event.h | 17 ++++-
include/linux/sched.h | 1 +
kernel/perf_event.c | 180 ++++++++++++++++++++++++++++++++++++++------
kernel/sched.c | 7 ++
4 files changed, 178 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 084f322..10e56f2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -105,6 +105,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6,
PERF_COUNT_SW_ALIGNMENT_FAULTS = 7,
PERF_COUNT_SW_EMULATION_FAULTS = 8,
+ PERF_COUNT_SW_RUNNABLE_TASKS = 9,

PERF_COUNT_SW_MAX, /* non-ABI */
};
@@ -456,6 +457,7 @@ enum perf_callchain_context {
#include <linux/pid_namespace.h>
#include <linux/workqueue.h>
#include <asm/atomic.h>
+#include <linux/poll.h>

#define PERF_MAX_STACK_DEPTH 255

@@ -519,6 +521,8 @@ struct pmu {
int (*reset) (struct perf_event *event);
void (*wakeup) (struct perf_event *event);
u64 (*read) (struct perf_event *event);
+ void (*init) (struct perf_event *event);
+ unsigned int (*poll) (struct perf_event *event, struct file* file, poll_table *wait);
};

/**
@@ -826,13 +830,20 @@ static inline int is_software_event(struct perf_event *event)

extern atomic_t perf_swevent_enabled[PERF_COUNT_SW_MAX];

-extern void __perf_sw_event(u32, s64, int, struct pt_regs *, u64);
+extern void __perf_sw_event(u32, s64, int, struct pt_regs *, u64,
+ struct task_struct* task, int cpu);
+static inline void
+perf_sw_event_target(u32 event_id, s64 nr, int nmi, struct pt_regs *regs,
+ u64 addr, struct task_struct* task, int cpu)
+{
+ if (atomic_read(&perf_swevent_enabled[event_id]))
+ __perf_sw_event(event_id, nr, nmi, regs, addr, task, cpu);
+}

static inline void
perf_sw_event(u32 event_id, s64 nr, int nmi, struct pt_regs *regs, u64 addr)
{
- if (atomic_read(&perf_swevent_enabled[event_id]))
- __perf_sw_event(event_id, nr, nmi, regs, addr);
+ perf_sw_event_target(event_id, nr, nmi, regs, addr, current, smp_processor_id());
}

extern void __perf_event_mmap(struct vm_area_struct *vma);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f842d..dce2213 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -138,6 +138,7 @@ extern int nr_threads;
DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned long nr_running(void);
+extern unsigned long nr_running_cpu(int cpu);
extern unsigned long nr_uninterruptible(void);
extern unsigned long nr_iowait(void);
extern unsigned long nr_iowait_cpu(void);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 08885d0..5f4f23d 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -743,6 +743,18 @@ static void add_event_to_ctx(struct perf_event *event,
event->tstamp_stopped = ctx->time;
}

+static void __perf_event_init_event(struct perf_event* event)
+{
+}
+
+static void perf_event_init_event(struct perf_event* event)
+{
+ if (event->pmu->init)
+ event->pmu->init(event);
+ else
+ __perf_event_init_event(event);
+}
+
/*
* Cross CPU call to install and enable a performance event
*
@@ -782,6 +794,8 @@ static void __perf_install_in_context(void *info)

add_event_to_ctx(event, ctx);

+ perf_event_init_event(event);
+
if (event->cpu != -1 && event->cpu != smp_processor_id())
goto unlock;

@@ -1593,7 +1607,7 @@ static u64 perf_event_update(struct perf_event *event)
raw_spin_unlock_irqrestore(&ctx->lock, flags);
}

- return perf_event_read(event);
+ return __perf_event_read(event);
}

/*
@@ -1931,18 +1945,26 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
return perf_read_hw(event, buf, count);
}

-static unsigned int perf_poll(struct file *file, poll_table *wait)
+static unsigned int __perf_poll(struct perf_event *event, struct file *file, poll_table *wait)
{
- struct perf_event *event = file->private_data;
unsigned int events = atomic_xchg(&event->poll, 0);

+ /*if (events)
+ printk("Events: POLLIN=%u\n", events&POLLIN);*/
+
if (event->attr.threshold)
{
u64 count = perf_event_read(event);
- if (count < event->attr.min_threshold)
+ if (count <= event->attr.max_threshold)
+ {
events |= POLLIN;
- else if (count > event->attr.max_threshold)
+ //printk(KERN_CONT "+");
+ }
+ else //if (count > event->attr.max_threshold)
+ {
events &= ~POLLIN;
+ //printk(KERN_CONT "-");
+ }
}

poll_wait(file, &event->waitq, wait);
@@ -1950,8 +1972,23 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
return events;
}

-static void perf_event_reset_noop(struct perf_event *event)
+static unsigned int perf_rt_poll(struct perf_event *event, struct file *file, poll_table *wait)
+{
+ return __perf_poll((event->parent ? event->parent : event), file, wait);
+}
+
+static unsigned int perf_poll(struct file* file, poll_table *wait)
+{
+ struct perf_event *event = file->private_data;
+ if (event->pmu->poll)
+ return event->pmu->poll(event, file, wait);
+ else
+ return __perf_poll(event, file, wait);
+}
+
+static int perf_event_reset_noop(struct perf_event *event)
{
+ return 0;
}

static void __perf_event_reset(struct perf_event *event)
@@ -2568,7 +2605,10 @@ void perf_event_wakeup(struct perf_event *event)
if (event->pmu->wakeup)
event->pmu->wakeup(event);
else
+ {
+ atomic_set(&event->poll, POLLIN);
wake_up_all(&event->waitq);
+ }

if (event->pending_kill) {
kill_fasync(&event->fasync, SIGIO, event->pending_kill);
@@ -2719,8 +2759,6 @@ static void __perf_output_wakeup(struct perf_event* event, int nmi)
{
if (event->attr.threshold && perf_event_read(event) > event->attr.max_threshold)
return;
-
- atomic_set(&event->poll, POLLIN);

if (nmi) {
event->pending_wakeup = 1;
@@ -3767,7 +3805,18 @@ int perf_event_overflow(struct perf_event *event, int nmi,

static void perf_event_wakeup_one(struct perf_event *event)
{
- wake_up(&event->waitq);
+ struct perf_event *wakeup_event = event->parent ? event->parent : event;
+ s64 wakeup_count = event->attr.max_threshold - __perf_event_read(wakeup_event);
+
+ if (wakeup_count < 1)
+ wakeup_count = 1;
+
+ atomic_set(&wakeup_event->poll, POLLIN);
+
+ if (event->attr.threshold && wakeup_count == 1)
+ wake_up(&wakeup_event->waitq);
+ else
+ wake_up_all(&wakeup_event->waitq);
}

static u64 __perf_event_add(struct perf_event *event, s64 count)
@@ -3783,7 +3832,7 @@ static u64 perf_event_add(struct perf_event *event, s64 count)
return __perf_event_add(event, count);
}

-static u64 perf_event_add_parent(struct perf_event *event, u64 count)
+static u64 perf_event_add_parent(struct perf_event *event, s64 count)
{
return event->parent ? __perf_event_add(event->parent, count) : __perf_event_add(event, count);
}
@@ -3864,6 +3913,22 @@ static void perf_swevent_add(struct perf_event *event, s64 nr,

perf_event_add(event, nr);

+ BUG_ON(perf_event_read(event) == (u64)-1);
+
+ if (event->attr.config == PERF_COUNT_SW_RUNNABLE_TASKS) {
+ if (event->ctx->task)
+ {
+ }
+ else
+ {
+ if (atomic64_read(&event->count) != nr_running_cpu(event->cpu))
+ {
+ printk("count = %lu <-> nr_running_cpu = %lu", atomic64_read(&event->count), nr_running_cpu(event->cpu));
+ BUG();
+ }
+ }
+ }
+
if (!regs)
return;

@@ -3932,7 +3997,7 @@ static int perf_swevent_match(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
- if (event->cpu != -1 && event->cpu != smp_processor_id())
+ if (event->cpu != -1 && event->cpu != smp_processor_id() && event_id != PERF_COUNT_SW_RUNNABLE_TASKS)
return 0;

if (!perf_swevent_is_counting(event))
@@ -4006,27 +4071,27 @@ EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);
static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
s64 nr, int nmi,
struct perf_sample_data *data,
- struct pt_regs *regs)
+ struct pt_regs *regs,
+ struct task_struct* task,
+ int cpu)
{
struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;

- cpuctx = &__get_cpu_var(perf_cpu_context);
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
rcu_read_lock();
perf_swevent_ctx_event(&cpuctx->ctx, type, event_id,
nr, nmi, data, regs);
- /*
- * doesn't really matter which of the child contexts the
- * events ends up in.
- */
- ctx = rcu_dereference(current->perf_event_ctxp);
+
+ ctx = rcu_dereference(task->perf_event_ctxp);
if (ctx)
perf_swevent_ctx_event(ctx, type, event_id, nr, nmi, data, regs);
rcu_read_unlock();
}

void __perf_sw_event(u32 event_id, s64 nr, int nmi,
- struct pt_regs *regs, u64 addr)
+ struct pt_regs *regs, u64 addr,
+ struct task_struct* task, int cpu)
{
struct perf_sample_data data;
int rctx;
@@ -4038,12 +4103,12 @@ void __perf_sw_event(u32 event_id, s64 nr, int nmi,
data.addr = addr;
data.raw = NULL;

- do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, nmi, &data, regs);
+ do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, nmi, &data, regs, task, cpu);

perf_swevent_put_recursion_context(rctx);
}

-static void perf_swevent_read(struct perf_event *event)
+static void perf_swevent_update(struct perf_event *event)
{
}

@@ -4066,10 +4131,61 @@ static void perf_swevent_disable(struct perf_event *event)
static const struct pmu perf_ops_generic = {
.enable = perf_swevent_enable,
.disable = perf_swevent_disable,
- .update = perf_swevent_read,
+ .update = perf_swevent_update,
.unthrottle = perf_swevent_unthrottle,
};

+static int perf_rt_enable(struct perf_event* event)
+{
+ return 0;
+}
+
+static void perf_rt_init_event(struct perf_event* event)
+{
+ if (event->ctx->task)
+ {
+ perf_event_add(event, event->ctx->task->state == 0);
+ }
+ else
+ atomic64_set(&event->count, nr_running_cpu(smp_processor_id()));
+}
+
+static void perf_rt_disable(struct perf_event* event)
+{
+ /* Nothing to do */
+}
+
+static void perf_rt_unthrottle(struct perf_event* event)
+{
+ /* Nothing to do */
+}
+
+static void perf_rt_update(struct perf_event* event)
+{
+ /* Nothing to do */
+}
+
+static u64 perf_event_read_parent(struct perf_event* event)
+{
+ if (event->parent)
+ return __perf_event_read(event->parent);
+ else
+ return __perf_event_read(event);
+}
+
+static const struct pmu perf_ops_runnable_tasks = {
+ .enable = perf_rt_enable,
+ .disable = perf_rt_disable,
+ .update = perf_rt_update,
+ .unthrottle = perf_rt_unthrottle,
+ .read = perf_event_read_parent,
+ .add = perf_event_add_parent,
+ .reset = perf_event_reset_noop,
+ .wakeup = perf_event_wakeup_one,
+ .init = perf_rt_init_event,
+ .poll = perf_rt_poll,
+};
+
/*
* hrtimer based swevent callback
*/
@@ -4267,7 +4383,7 @@ void perf_tp_event(int event_id, u64 addr, u64 count, void *record,

/* Trace events already protected against recursion */
do_perf_sw_event(PERF_TYPE_TRACEPOINT, event_id, count, 1,
- &data, regs);
+ &data, regs, current, smp_processor_id());
}
EXPORT_SYMBOL_GPL(perf_tp_event);

@@ -4404,6 +4520,13 @@ static void sw_perf_event_destroy(struct perf_event *event)
atomic_dec(&perf_swevent_enabled[event_id]);
}

+static void sw_rt_perf_event_destroy(struct perf_event *event)
+{
+ BUG_ON(event->parent && __perf_event_read(event) != (u64)0);
+ sw_perf_event_destroy(event);
+}
+
+
static const struct pmu *sw_perf_event_init(struct perf_event *event)
{
const struct pmu *pmu = NULL;
@@ -4445,6 +4568,13 @@ static const struct pmu *sw_perf_event_init(struct perf_event *event)
}
pmu = &perf_ops_generic;
break;
+ case PERF_COUNT_SW_RUNNABLE_TASKS:
+ if (!event->parent) {
+ atomic_inc(&perf_swevent_enabled[event_id]);
+ event->destroy = sw_rt_perf_event_destroy;
+ }
+ pmu = &perf_ops_runnable_tasks;
+ break;
}

return pmu;
@@ -4743,7 +4873,7 @@ SYSCALL_DEFINE5(perf_event_open,
return -EACCES;
}

- if (attr.threshold && (attr.freq || attr.watermark))
+ if (attr.threshold && (attr.freq || attr.watermark || attr.min_threshold > attr.max_threshold))
return -EINVAL;

if (attr.freq) {
@@ -4944,6 +5074,8 @@ inherit_event(struct perf_event *parent_event,
*/
add_event_to_ctx(child_event, child_ctx);

+ perf_event_init_event(child_event);
+
/*
* Get a reference to the parent filp - we will fput it
* when the child event exits. This is safe to do because
diff --git a/kernel/sched.c b/kernel/sched.c
index 87f1f47..53c679c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1967,6 +1967,7 @@ static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)

enqueue_task(rq, p, wakeup);
inc_nr_running(rq);
+ perf_sw_event_target(PERF_COUNT_SW_RUNNABLE_TASKS, 1, 1, task_pt_regs(p), 0, p, cpu_of(rq));
}

/*
@@ -1979,6 +1980,7 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)

dequeue_task(rq, p, sleep);
dec_nr_running(rq);
+ perf_sw_event_target(PERF_COUNT_SW_RUNNABLE_TASKS, -1, 1, task_pt_regs(p), 0, p, cpu_of(rq));
}

/**
@@ -2932,6 +2934,11 @@ unsigned long nr_running(void)
return sum;
}

+unsigned long nr_running_cpu(int cpu)
+{
+ return cpu_rq(cpu)->nr_running;
+}
+
unsigned long nr_uninterruptible(void)
{
unsigned long i, sum = 0;
--
1.6.6

2010-02-07 11:32:20

by Stijn Devriendt

[permalink] [raw]

Subject: [PATCH 2/6] Allow min/max thresholds for wakeups

From: Stijn Devriendt <[email protected]>

---
include/linux/perf_event.h | 5 +++--
kernel/perf_event.c | 28 ++++++++++++++++++++++++----
2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 827a221..0fa235e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -183,6 +183,7 @@ struct perf_event_attr {
union {
__u64 sample_period;
__u64 sample_freq;
+ __u64 max_threshold; /* maximum threshold */
};

__u64 sample_type;
@@ -203,16 +204,16 @@ struct perf_event_attr {
enable_on_exec : 1, /* next exec enables */
task : 1, /* trace fork/exit */
watermark : 1, /* wakeup_watermark */
+ threshold : 1, /* tresholds */

__reserved_1 : 49;

union {
__u32 wakeup_events; /* wakeup every n events */
__u32 wakeup_watermark; /* bytes before wakeup */
+ __u64 min_threshold; /* minimum threshold */
};

- __u32 __reserved_2;
-
__u64 bp_addr;
__u32 bp_type;
__u32 bp_len;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 42dc18d..70ca6e1 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1401,6 +1401,8 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx)
if (!event->attr.freq || !event->attr.sample_freq)
continue;

+ if (event->attr.threshold)
+ continue;
/*
* if the specified freq < HZ then we need to skip ticks
*/
@@ -1921,6 +1923,15 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
struct perf_event *event = file->private_data;
unsigned int events = atomic_xchg(&event->poll, 0);

+ if (event->attr.threshold)
+ {
+ u64 count = atomic64_read(&event->count);
+ if (count < event->attr.min_threshold)
+ events |= POLLIN;
+ else if (count > event->attr.max_threshold)
+ events &= ~POLLIN;
+ }
+
poll_wait(file, &event->waitq, wait);

return events;
@@ -1979,6 +1990,9 @@ static int perf_event_period(struct perf_event *event, u64 __user *arg)
if (!event->attr.sample_period)
return -EINVAL;

+ if (event->attr.threshold)
+ return -EINVAL;
+
size = copy_from_user(&value, arg, sizeof(value));
if (size != sizeof(value))
return -EFAULT;
@@ -2675,6 +2689,9 @@ static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,

static void __perf_output_wakeup(struct perf_event* event, int nmi)
{
+ if (event->attr.threshold && atomic64_read(&event->count) > event->attr.max_threshold)
+ return;
+
atomic_set(&event->poll, POLLIN);

if (nmi) {
@@ -2895,7 +2912,7 @@ void perf_output_end(struct perf_output_handle *handle)
struct perf_event *event = handle->event;
struct perf_mmap_data *data = handle->data;

- int wakeup_events = event->attr.wakeup_events;
+ int wakeup_events = event->attr.thresold ? 1 : event->attr.wakeup_events;

if (handle->sample && wakeup_events) {
int events = atomic_inc_return(&data->events);
@@ -4445,7 +4462,7 @@ perf_event_alloc(struct perf_event_attr *attr,

hwc = &event->hw;
hwc->sample_period = attr->sample_period;
- if (attr->freq && attr->sample_freq)
+ if (attr->threshold || (attr->freq && attr->sample_freq))
hwc->sample_period = 1;
hwc->last_period = hwc->sample_period;

@@ -4571,7 +4588,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
if (attr->type >= PERF_TYPE_MAX)
return -EINVAL;

- if (attr->__reserved_1 || attr->__reserved_2)
+ if (attr->__reserved_1)
return -EINVAL;

if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
@@ -4674,6 +4691,9 @@ SYSCALL_DEFINE5(perf_event_open,
return -EACCES;
}

+ if (attr.threshold && (attr.freq || attr.watermark))
+ return -EINVAL;
+
if (attr.freq) {
if (attr.sample_freq > sysctl_perf_event_sample_rate)
return -EINVAL;
@@ -4862,7 +4882,7 @@ inherit_event(struct perf_event *parent_event,
else
child_event->state = PERF_EVENT_STATE_OFF;

- if (parent_event->attr.freq)
+ if (parent_event->attr.freq || parent_event->attr.threshold)
child_event->hw.sample_period = parent_event->hw.sample_period;

child_event->overflow_handler = parent_event->overflow_handler;
--
1.6.6

2010-02-08 10:11:43

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH] PERF_COUNT_SW_RUNNABLE_TASKS: measure and act upon parallellism

On Sun, 2010-02-07 at 12:30 +0100, [email protected] wrote:

> Here's an initial RFC patch for the parallallism
> events for perf_events.

OK, so you managed to rub me totally the wrong way with posting this
yesterday:
- you send me each patch twice
- you used the horrible git sendmail default of --chain-reply-to
(some day I'll write a script that will detect and auto-bounce
series sent to me like that)
- you failed to provide a changelog for any of the patches
- some subjects were long enough to be a changelog

Please don't do that again ;-)

Anyway, it did get me thinking, how about something like the below?

(compile tested only, we probably want a different name than CLONE_SEM,
but I failed at coming up with anything better, CLONE_FRED?)

---
include/linux/sched.h | 11 ++++++++++
kernel/exit.c | 5 ++++
kernel/fork.c | 24 ++++++++++++++++++++++
kernel/sched.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 92 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b1b8d84..580c623 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -9,6 +9,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_SEM 0x00001000 /* set if */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
@@ -1214,6 +1215,13 @@ struct sched_rt_entity {

struct rcu_node;

+struct task_sem {
+ raw_spinlock_t lock;
+ unsigned int count;
+ struct list_head wait_list;
+ atomic_t ref;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1235,6 +1243,9 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;

+ struct task_sem *sem;
+ struct list_head sem_waiter;
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* list of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/kernel/exit.c b/kernel/exit.c
index 546774a..f8b9ab3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -991,6 +991,11 @@ NORET_TYPE void do_exit(long code)
*/
perf_event_exit_task(tsk);

+ if (unlikely(tsk->sem) && atomic_dec_and_test(&tsk->sem->ref)) {
+ kfree(tsk->sem);
+ tsk->sem = NULL;
+ }
+
exit_notify(tsk, group_dead);
#ifdef CONFIG_NUMA
mpol_put(tsk->mempolicy);
diff --git a/kernel/fork.c b/kernel/fork.c
index f88bd98..cea102c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -989,6 +989,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);

+ if ((clone_flags & (CLONE_VFORK|CLONE_SEM)) == (CLONE_VFORK|CLONE_SEM))
+ return ERR_PTR(-EINVAL);
+
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
@@ -1023,6 +1026,27 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (!p)
goto fork_out;

+ if (clone_flags & CLONE_SEM) {
+ INIT_LIST_HEAD(&p->sem_waiter);
+ if (!current->sem) {
+ struct task_sem *sem =
+ kmalloc(sizeof(struct task_sem), GFP_KERNEL);
+
+ if (!sem)
+ goto bad_fork_free;
+
+ raw_spin_lock_init(&sem->lock);
+ sem->count = 0; /* current is running */
+ INIT_LIST_HEAD(&sem->wait_list);
+ atomic_set(&sem->ref, 2);
+
+ current->sem = sem;
+ p->sem = sem;
+ } else
+ atomic_inc(&current->sem->ref);
+ } else if (current->sem)
+ p->sem = NULL;
+
ftrace_graph_init_task(p);

rt_mutex_init_task(p);
diff --git a/kernel/sched.c b/kernel/sched.c
index de9f9d4..9cd6144 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2247,6 +2247,48 @@ void task_oncpu_function_call(struct task_struct *p,
preempt_enable();
}

+static void task_up(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *waiter = NULL;
+ struct task_sem *sem = p->sem;
+
+ raw_spin_lock(&sem->lock);
+ sem->count++;
+ if (sem->count > 0 && !list_empty(&sem->wait_list)) {
+ waiter = list_first_entry(&sem->wait_list,
+ struct task_struct, sem_waiter);
+
+ list_del_init(&waiter->sem_waiter);
+ }
+ raw_spin_unlock(&sem->lock);
+
+ if (waiter) {
+ raw_spin_unlock(&rq->lock);
+ wake_up_process(waiter);
+ raw_spin_lock(&rq->lock);
+ }
+}
+
+static int task_down(struct task_struct *p)
+{
+ struct task_sem *sem = p->sem;
+ int ret = 0;
+
+ raw_spin_lock(&sem->lock);
+ if (sem->count > 0) {
+ sem->count--;
+ } else {
+ WARN_ON_ONCE(!list_empty(&p->sem_waiter));
+
+ list_add_tail(&p->sem_waiter, &sem->wait_list);
+ __set_task_state(p, TASK_UNINTERRUPTIBLE);
+ ret = 1;
+ }
+ raw_spin_unlock(&sem->lock);
+
+ return ret;
+}
+
#ifdef CONFIG_SMP
static int select_fallback_rq(int cpu, struct task_struct *p)
{
@@ -2357,7 +2399,12 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
#ifdef CONFIG_SMP
if (unlikely(task_running(rq, p)))
goto out_activate;
+#endif

+ if (unlikely(p->sem) && task_down(p))
+ goto out;
+
+#ifdef CONFIG_SMP
/*
* In order to handle concurrent wakeups and release the rq->lock
* we put the task in TASK_WAKING state.
@@ -3671,8 +3718,12 @@ need_resched_nonpreemptible:
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
- else
+ else {
deactivate_task(rq, prev, 1);
+
+ if (unlikely(prev->sem))
+ task_up(rq, prev);
+ }
switch_count = &prev->nvcsw;
}

2010-02-08 11:52:17

by Stijn Devriendt

[permalink] [raw]

Subject: Re: [RFC][PATCH] PERF_COUNT_SW_RUNNABLE_TASKS: measure and act upon parallellism

On Mon, Feb 8, 2010 at 11:00 AM, Peter Zijlstra <[email protected]> wrote:
> On Sun, 2010-02-07 at 12:30 +0100, [email protected] wrote:
>
>> Here's an initial RFC patch for the parallallism
>> events for perf_events.
>
> OK, so you managed to rub me totally the wrong way with posting this
> yesterday:
> ?- you send me each patch twice
> ?- you used the horrible git sendmail default of --chain-reply-to
> ? (some day I'll write a script that will detect and auto-bounce
> ? ?series sent to me like that)
> ?- you failed to provide a changelog for any of the patches
> ?- some subjects were long enough to be a changelog
>
> Please don't do that again ;-)

Sorry for that. I'm still getting used to the thing called 'git'. Apart
from obviously sharing its name, there's not so
much we have in common so far... VCS experiences currently
don't go far beyond svn and the horrid CVS.
I just used the reply-list from the initial series about this subject,
perhaps you were listed twice in that one as well.

Isn't no longer making --chain-reply-to the default an option?

>
> Anyway, it did get me thinking, how about something like the below?
> <SNIP>

That pretty much looks like what I need at first sight. I'll definately
give it a very close look and test it out.
The perf_event approach would be nice, but the implementations
just seem to collide.

Thanks for taking the time anyway,
Stijn

2010-02-08 12:00:21

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH] PERF_COUNT_SW_RUNNABLE_TASKS: measure and act upon parallellism

On Mon, 2010-02-08 at 12:52 +0100, Stijn Devriendt wrote:
> Isn't no longer making --chain-reply-to the default an option?

That will hopefully happen on the next major git release (1.7 iirc).