2021-08-20 10:56:29

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 00/15] perf: add workqueue library and use it in synthetic-events

Changes in v3:
- improved separation of threadpool and threadpool_entry method
- replaced shared workqueue with per-thread workqueue. This should
improve the performance on big machines (Jiri noticed in his
experiments a significant performance degradation after 15 threads
with the shared queue).
- improved error reporting in both threadpool and workqueue
- added lazy spinup of threads in workqueue [9/15]
- added global workqueue [10/15]
- setup global workqueue in perf record, top and synthesize bench
[12-14/15] and used in in synthetic events

v2: https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/4f0cd6c8e77c0b4f4d4b8d553a7032757b976e61.1627657061.git.rickyman7@gmail.com/
(something went wrong when sending it and the cover letter has a wrong
Message-Id)

Changes in v2:
- rename threadpool_struct and its functions to adhere to naming style
- use ERR_PTR instead of returning NULL
- add *__strerror functions, removing pr_err from library code
- wait for threads after creation of all threads, instead of waiting
after each creation
- use intention-revealing macros in test code instead of 0 and -1
- use readn/writen functions

v1: https://lkml.kernel.org/lkml/[email protected]/

This patchset introduces a new utility library inside perf/util, which
provides a work queue abstraction, which follows the Kernel workqueue API.

The workqueue abstraction is made up by two components:
- threadpool: which takes care of managing a pool of threads. It is
inspired by the prototype for threaded trace in perf-record from Alexey:
https://lore.kernel.org/lkml/[email protected]/
- workqueue: manages the workers in the threadpool and assigns the work
items to the thread-local queues.

On top of the workqueue, a simple parallel-for utility is implemented
which is then showcased in synthetic-events.c, replacing the previous
manual pthread-created threads.

Through some experiments with perf bench, I can see how the new
workqueue has a slightly higher overhead compared to manual creation of
threads, but is able to more effectively partition work among threads,
yielding better results overall (see the last patch for benchmark
detaisl on my machine).
Since I'm not able to test it on bigger machine, it would be helpful if
someone could also test it and report back his results (thanks to Jiri
and Arnaldo who have already helped me by doing some tests).

Furthermore, the overhead could be reduced by changing the
`work_size` (currently 1), aka the number of dirents that are
processed by a thread before grabbing a lock to get the new work item.
I experimented with different sizes but, while bigger sizes reduce overhead
as expected, they do not scale as well to more threads.

Soon I will also send another patchset applying the workqueue to evlist
operations (open, enable, disable, close).

Thanks,
Riccardo

Riccardo Mancini (15):
perf workqueue: threadpool creation and destruction
perf tests: add test for workqueue
perf workqueue: add threadpool start and stop functions
perf workqueue: add threadpool execute and wait functions
tools: add sparse context/locking annotations in compiler-types.h
perf workqueue: introduce workqueue struct
perf workqueue: implement worker thread and management
perf workqueue: add queue_work and flush_workqueue functions
perf workqueue: spinup threads when needed
perf workqueue: create global workqueue
perf workqueue: add utility to execute a for loop in parallel
perf record: setup global workqueue
perf top: setup global workqueue
perf test/synthesis: setup global workqueue
perf synthetic-events: use workqueue parallel_for

tools/include/linux/compiler_types.h | 18 +
tools/perf/bench/synthesize.c | 28 +-
tools/perf/builtin-kvm.c | 2 +-
tools/perf/builtin-record.c | 18 +-
tools/perf/builtin-top.c | 19 +-
tools/perf/builtin-trace.c | 3 +-
tools/perf/tests/Build | 1 +
tools/perf/tests/builtin-test.c | 9 +
tools/perf/tests/mmap-thread-lookup.c | 2 +-
tools/perf/tests/tests.h | 3 +
tools/perf/tests/workqueue.c | 453 +++++++++++++
tools/perf/util/Build | 1 +
tools/perf/util/synthetic-events.c | 135 ++--
tools/perf/util/synthetic-events.h | 8 +-
tools/perf/util/workqueue/Build | 2 +
tools/perf/util/workqueue/threadpool.c | 619 +++++++++++++++++
tools/perf/util/workqueue/threadpool.h | 41 ++
tools/perf/util/workqueue/workqueue.c | 901 +++++++++++++++++++++++++
tools/perf/util/workqueue/workqueue.h | 104 +++
19 files changed, 2253 insertions(+), 114 deletions(-)
create mode 100644 tools/perf/tests/workqueue.c
create mode 100644 tools/perf/util/workqueue/Build
create mode 100644 tools/perf/util/workqueue/threadpool.c
create mode 100644 tools/perf/util/workqueue/threadpool.h
create mode 100644 tools/perf/util/workqueue/workqueue.c
create mode 100644 tools/perf/util/workqueue/workqueue.h

--
2.31.1


2021-08-20 10:56:29

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 01/15] perf workqueue: threadpool creation and destruction

The workqueue library is made up by two components:
- threadpool: handles the lifetime of the threads
- workqueue: handles work distribution among the threads

This first patch introduces the threadpool, starting from its creation
and destruction functions.
Thread management is based on the prototype from Alexey:
https://lore.kernel.org/lkml/[email protected]/

Each thread in the threadpool executes the same function (aka task)
with a different argument tidx.
Threads use a pair of pipes to communicate with the main process.
The threadpool is static (all threads will be spawned at the same time).
Future work could include making it resizable and adding affinity support
(as in Alexey prototype).

Suggested-by: Alexey Bayduraev <[email protected]>
Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/util/Build | 1 +
tools/perf/util/workqueue/Build | 1 +
tools/perf/util/workqueue/threadpool.c | 205 +++++++++++++++++++++++++
tools/perf/util/workqueue/threadpool.h | 24 +++
4 files changed, 231 insertions(+)
create mode 100644 tools/perf/util/workqueue/Build
create mode 100644 tools/perf/util/workqueue/threadpool.c
create mode 100644 tools/perf/util/workqueue/threadpool.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 2d4fa13041789cd6..c7b09701661c869d 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -180,6 +180,7 @@ perf-$(CONFIG_LIBBABELTRACE) += data-convert-bt.o
perf-y += data-convert-json.o

perf-y += scripting-engines/
+perf-y += workqueue/

perf-$(CONFIG_ZLIB) += zlib.o
perf-$(CONFIG_LZMA) += lzma.o
diff --git a/tools/perf/util/workqueue/Build b/tools/perf/util/workqueue/Build
new file mode 100644
index 0000000000000000..8b72a6cd4e2cba0d
--- /dev/null
+++ b/tools/perf/util/workqueue/Build
@@ -0,0 +1 @@
+perf-y += threadpool.o
diff --git a/tools/perf/util/workqueue/threadpool.c b/tools/perf/util/workqueue/threadpool.c
new file mode 100644
index 0000000000000000..17672cb089afcf1d
--- /dev/null
+++ b/tools/perf/util/workqueue/threadpool.c
@@ -0,0 +1,205 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include "debug.h"
+#include <asm/bug.h>
+#include <linux/zalloc.h>
+#include <linux/string.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <pthread.h>
+#include "threadpool.h"
+
+struct threadpool {
+ int nr_threads; /* number of threads in the pool */
+ struct threadpool_entry *threads; /* array of threads in the pool */
+ struct task_struct *current_task; /* current executing function */
+};
+
+struct threadpool_entry {
+ int idx; /* idx of thread in pool->threads */
+ pid_t tid; /* tid of thread */
+ pthread_t ptid; /* pthread id */
+ struct threadpool *pool; /* parent threadpool */
+ struct {
+ int ack[2]; /* messages from thread (acks) */
+ int cmd[2]; /* messages to thread (commands) */
+ } pipes;
+ bool running; /* has this thread been started? */
+};
+
+/**
+ * threadpool_entry__init_pipes - initialize all pipes of @thread
+ */
+static void threadpool_entry__init_pipes(struct threadpool_entry *thread)
+{
+ thread->pipes.ack[0] = -1;
+ thread->pipes.ack[1] = -1;
+ thread->pipes.cmd[0] = -1;
+ thread->pipes.cmd[1] = -1;
+}
+
+/**
+ * threadpool_entry__open_pipes - open all pipes of @thread
+ *
+ * Caller should perform cleanup on error.
+ */
+static int threadpool_entry__open_pipes(struct threadpool_entry *thread)
+{
+ char sbuf[STRERR_BUFSIZE];
+
+ if (pipe(thread->pipes.ack)) {
+ pr_debug2("threadpool: failed to create comm pipe 'from': %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -ENOMEM;
+ }
+
+ if (pipe(thread->pipes.cmd)) {
+ pr_debug2("threadpool: failed to create comm pipe 'to': %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+/**
+ * threadpool_entry__close_pipes - close all communication pipes of @thread
+ */
+static void threadpool_entry__close_pipes(struct threadpool_entry *thread)
+{
+ if (thread->pipes.ack[0] != -1) {
+ close(thread->pipes.ack[0]);
+ thread->pipes.ack[0] = -1;
+ }
+ if (thread->pipes.ack[1] != -1) {
+ close(thread->pipes.ack[1]);
+ thread->pipes.ack[1] = -1;
+ }
+ if (thread->pipes.cmd[0] != -1) {
+ close(thread->pipes.cmd[0]);
+ thread->pipes.cmd[0] = -1;
+ }
+ if (thread->pipes.cmd[1] != -1) {
+ close(thread->pipes.cmd[1]);
+ thread->pipes.cmd[1] = -1;
+ }
+}
+
+/**
+ * threadpool__new - create a fixed threadpool with @n_threads threads
+ */
+struct threadpool *threadpool__new(int n_threads)
+{
+ int ret, err, t;
+ char sbuf[STRERR_BUFSIZE];
+ struct threadpool *pool = malloc(sizeof(*pool));
+
+ if (!pool) {
+ pr_debug2("threadpool: cannot allocate pool: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ err = -ENOMEM;
+ goto out_return;
+ }
+
+ if (n_threads <= 0) {
+ pr_debug2("threadpool: invalid number of threads: %d\n",
+ n_threads);
+ err = -EINVAL;
+ goto out_free_pool;
+ }
+
+ pool->nr_threads = n_threads;
+ pool->current_task = NULL;
+
+ pool->threads = calloc(n_threads, sizeof(*pool->threads));
+ if (!pool->threads) {
+ pr_debug2("threadpool: cannot allocate threads: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ err = -ENOMEM;
+ goto out_free_pool;
+ }
+
+ for (t = 0; t < n_threads; t++) {
+ pool->threads[t].idx = t;
+ pool->threads[t].tid = -1;
+ pool->threads[t].ptid = 0;
+ pool->threads[t].pool = pool;
+ pool->threads[t].running = false;
+ threadpool_entry__init_pipes(&pool->threads[t]);
+ }
+
+ for (t = 0; t < n_threads; t++) {
+ ret = threadpool_entry__open_pipes(&pool->threads[t]);
+ if (ret) {
+ err = ret;
+ goto out_close_pipes;
+ }
+ }
+
+ return pool;
+
+out_close_pipes:
+ for (t = 0; t < n_threads; t++)
+ threadpool_entry__close_pipes(&pool->threads[t]);
+
+ zfree(&pool->threads);
+out_free_pool:
+ free(pool);
+out_return:
+ return ERR_PTR(err);
+}
+
+/**
+ * threadpool__strerror - print message regarding given @err in @pool
+ *
+ * Buffer size should be at least THREADPOOL_STRERR_BUFSIZE bytes.
+ */
+int threadpool__strerror(struct threadpool *pool __maybe_unused, int err, char *buf, size_t size)
+{
+ char sbuf[STRERR_BUFSIZE], *emsg;
+
+ emsg = str_error_r(err, sbuf, sizeof(sbuf));
+ return scnprintf(buf, size, "Error: %s.\n", emsg);
+}
+
+/**
+ * threadpool__new_strerror - print message regarding @err_ptr
+ *
+ * Buffer size should be at least THREADPOOL_STRERR_BUFSIZE bytes.
+ */
+int threadpool__new_strerror(struct threadpool *err_ptr, char *buf, size_t size)
+{
+ return threadpool__strerror(err_ptr, PTR_ERR(err_ptr), buf, size);
+}
+
+/**
+ * threadpool__delete - free the @pool and all its resources
+ */
+void threadpool__delete(struct threadpool *pool)
+{
+ struct threadpool_entry *thread;
+ int t;
+
+ if (IS_ERR_OR_NULL(pool))
+ return;
+
+ for (t = 0; t < pool->nr_threads; t++) {
+ thread = &pool->threads[t];
+ threadpool_entry__close_pipes(thread);
+ }
+
+ zfree(&pool->threads);
+ free(pool);
+}
+
+/**
+ * threadpool__size - get number of threads in the threadpool
+ */
+int threadpool__size(struct threadpool *pool)
+{
+ return pool->nr_threads;
+}
diff --git a/tools/perf/util/workqueue/threadpool.h b/tools/perf/util/workqueue/threadpool.h
new file mode 100644
index 0000000000000000..55146eb141d4c380
--- /dev/null
+++ b/tools/perf/util/workqueue/threadpool.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __WORKQUEUE_THREADPOOL_H
+#define __WORKQUEUE_THREADPOOL_H
+
+struct threadpool;
+struct task_struct;
+
+typedef void (*task_func_t)(int tidx, struct task_struct *task);
+
+struct task_struct {
+ task_func_t fn;
+};
+
+extern struct threadpool *threadpool__new(int n_threads);
+extern void threadpool__delete(struct threadpool *pool);
+
+extern int threadpool__size(struct threadpool *pool);
+
+/* Error management */
+#define THREADPOOL_STRERR_BUFSIZE (128+STRERR_BUFSIZE)
+extern int threadpool__strerror(struct threadpool *pool, int err, char *buf, size_t size);
+extern int threadpool__new_strerror(struct threadpool *err_ptr, char *buf, size_t size);
+
+#endif /* __WORKQUEUE_THREADPOOL_H */
--
2.31.1

2021-08-20 10:56:29

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 03/15] perf workqueue: add threadpool start and stop functions

This patch adds the start and stop functions, alongside the thread
function.
Each thread will run until a stop signal is received.
Furthermore, start and stop are added to the test.

Thread management is based on the prototype from Alexey:
https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Alexey Bayduraev <[email protected]>
Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/tests/workqueue.c | 12 +
tools/perf/util/workqueue/threadpool.c | 324 ++++++++++++++++++++++++-
tools/perf/util/workqueue/threadpool.h | 13 +
3 files changed, 347 insertions(+), 2 deletions(-)

diff --git a/tools/perf/tests/workqueue.c b/tools/perf/tests/workqueue.c
index 469b154d7522f132..01f05b066d9fbc70 100644
--- a/tools/perf/tests/workqueue.c
+++ b/tools/perf/tests/workqueue.c
@@ -12,16 +12,28 @@ struct threadpool_test_args_t {

static int __threadpool__prepare(struct threadpool **pool, int pool_size)
{
+ int ret;
+
*pool = threadpool__new(pool_size);
TEST_ASSERT_VAL("threadpool creation failure", !IS_ERR(*pool));
TEST_ASSERT_VAL("threadpool size is wrong",
threadpool__size(*pool) == pool_size);

+ ret = threadpool__start(*pool);
+ TEST_ASSERT_VAL("threadpool start failure", ret == 0);
+ TEST_ASSERT_VAL("threadpool is not ready", threadpool__is_running(*pool));
+
return TEST_OK;
}

static int __threadpool__teardown(struct threadpool *pool)
{
+ int ret = threadpool__stop(pool);
+
+ TEST_ASSERT_VAL("threadpool stop failure", ret == 0);
+ TEST_ASSERT_VAL("stopped threadpool is ready",
+ !threadpool__is_running(pool));
+
threadpool__delete(pool);

return TEST_OK;
diff --git a/tools/perf/util/workqueue/threadpool.c b/tools/perf/util/workqueue/threadpool.c
index 17672cb089afcf1d..861a20231558e618 100644
--- a/tools/perf/util/workqueue/threadpool.c
+++ b/tools/perf/util/workqueue/threadpool.c
@@ -4,6 +4,8 @@
#include <unistd.h>
#include <errno.h>
#include <string.h>
+#include <signal.h>
+#include <syscall.h>
#include "debug.h"
#include <asm/bug.h>
#include <linux/zalloc.h>
@@ -11,8 +13,16 @@
#include <linux/err.h>
#include <linux/kernel.h>
#include <pthread.h>
+#include <internal/lib.h>
#include "threadpool.h"

+#ifndef HAVE_GETTID
+static inline pid_t gettid(void)
+{
+ return (pid_t)syscall(__NR_gettid);
+}
+#endif
+
struct threadpool {
int nr_threads; /* number of threads in the pool */
struct threadpool_entry *threads; /* array of threads in the pool */
@@ -31,6 +41,28 @@ struct threadpool_entry {
bool running; /* has this thread been started? */
};

+enum threadpool_msg {
+ THREADPOOL_MSG__UNDEFINED = 0,
+ THREADPOOL_MSG__ACK, /* from th: create and exit ack */
+ THREADPOOL_MSG__WAKE, /* to th: wake up */
+ THREADPOOL_MSG__STOP, /* to th: exit */
+ THREADPOOL_MSG__MAX
+};
+
+static const char * const threadpool_msg_tags[] = {
+ "undefined",
+ "ack",
+ "wake",
+ "stop"
+};
+
+static const char * const threadpool_errno_str[] = {
+ "Error calling sigprocmask",
+ "Error receiving message from thread",
+ "Error sending message to thread",
+ "Thread sent unexpected message"
+};
+
/**
* threadpool_entry__init_pipes - initialize all pipes of @thread
*/
@@ -89,6 +121,164 @@ static void threadpool_entry__close_pipes(struct threadpool_entry *thread)
}
}

+/**
+ * threadpool__send_cmd - send @cmd to @thread
+ */
+static int threadpool__send_cmd(struct threadpool *pool, int tidx, enum threadpool_msg cmd)
+{
+ struct threadpool_entry *thread = &pool->threads[tidx];
+ char sbuf[STRERR_BUFSIZE];
+ int res = writen(thread->pipes.cmd[1], &cmd, sizeof(cmd));
+
+ if (res < 0) {
+ pr_debug2("threadpool: error sending %s msg to tid=%d: %s\n",
+ threadpool_msg_tags[cmd], thread->tid,
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -THREADPOOL_ERROR__WRITEPIPE;
+ }
+
+ pr_debug2("threadpool: sent %s msg to tid=%d\n", threadpool_msg_tags[cmd], thread->tid);
+ return 0;
+}
+
+/**
+ * threadpool__wait_thread - receive ack from thread
+ *
+ * NB: call only from main thread!
+ */
+static int threadpool__wait_thread(struct threadpool *pool, int tidx)
+{
+ int res;
+ char sbuf[STRERR_BUFSIZE];
+ struct threadpool_entry *thread = &pool->threads[tidx];
+ enum threadpool_msg msg = THREADPOOL_MSG__UNDEFINED;
+
+ res = readn(thread->pipes.ack[0], &msg, sizeof(msg));
+ if (res < 0) {
+ pr_debug2("threadpool: failed to recv msg from tid=%d: %s\n",
+ thread->tid, str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -THREADPOOL_ERROR__READPIPE;
+ }
+ if (msg != THREADPOOL_MSG__ACK) {
+ pr_debug2("threadpool: received unexpected msg from tid=%d: %s\n",
+ thread->tid, threadpool_msg_tags[msg]);
+ return -THREADPOOL_ERROR__INVALIDMSG;
+ }
+
+ pr_debug2("threadpool: received ack from tid=%d\n", thread->tid);
+
+ return 0;
+}
+
+/**
+ * threadpool__terminate_thread - send stop signal to thread and wait for ack
+ *
+ * NB: call only from main thread!
+ */
+static int threadpool__terminate_thread(struct threadpool *pool, int tidx)
+{
+ struct threadpool_entry *thread = &pool->threads[tidx];
+ int err;
+
+ if (!thread->running)
+ return 0;
+
+ err = threadpool__send_cmd(pool, tidx, THREADPOOL_MSG__STOP);
+ if (err)
+ goto out_cancel;
+
+ err = threadpool__wait_thread(pool, tidx);
+ if (err)
+ goto out_cancel;
+
+ thread->running = false;
+out:
+ return err;
+
+out_cancel:
+ pthread_cancel(thread->ptid);
+ goto out;
+}
+
+/**
+ * threadpool_entry__send_ack - send ack to main thread
+ */
+static int threadpool_entry__send_ack(struct threadpool_entry *thread)
+{
+ enum threadpool_msg msg = THREADPOOL_MSG__ACK;
+ char sbuf[STRERR_BUFSIZE];
+ int ret = writen(thread->pipes.ack[1], &msg, sizeof(msg));
+
+ if (ret < 0) {
+ pr_debug("threadpool[%d]: failed to send ack: %s\n",
+ thread->tid, str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -THREADPOOL_ERROR__WRITEPIPE;
+ }
+
+ return 0;
+}
+
+/**
+ * threadpool_entry__recv_cmd - receive command from main thread
+ */
+static int threadpool_entry__recv_cmd(struct threadpool_entry *thread,
+ enum threadpool_msg *cmd)
+{
+ char sbuf[STRERR_BUFSIZE];
+ int ret;
+
+ *cmd = THREADPOOL_MSG__UNDEFINED;
+ ret = readn(thread->pipes.cmd[0], cmd, sizeof(*cmd));
+ if (ret < 0) {
+ pr_debug("threadpool[%d]: error receiving command: %s\n",
+ thread->tid, str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -THREADPOOL_ERROR__READPIPE;
+ }
+
+ if (*cmd != THREADPOOL_MSG__WAKE && *cmd != THREADPOOL_MSG__STOP) {
+ pr_debug("threadpool[%d]: received unexpected command: %s\n",
+ thread->tid, threadpool_msg_tags[*cmd]);
+ return -THREADPOOL_ERROR__INVALIDMSG;
+ }
+
+ return 0;
+}
+
+/**
+ * threadpool_entry__function - function running on thread
+ *
+ * This function waits for a signal from main thread to start executing
+ * a task.
+ * On completion, it will go back to sleep, waiting for another signal.
+ * Signals are delivered through pipes.
+ */
+static void *threadpool_entry__function(void *args)
+{
+ struct threadpool_entry *thread = (struct threadpool_entry *) args;
+ enum threadpool_msg cmd;
+
+ thread->tid = gettid();
+
+ pr_debug2("threadpool[%d]: started\n", thread->tid);
+
+ for (;;) {
+ if (threadpool_entry__send_ack(thread))
+ break;
+
+ if (threadpool_entry__recv_cmd(thread, &cmd))
+ break;
+
+ if (cmd == THREADPOOL_MSG__STOP)
+ break;
+ }
+
+ pr_debug2("threadpool[%d]: exit\n", thread->tid);
+
+ threadpool_entry__send_ack(thread);
+
+ return NULL;
+}
+
/**
* threadpool__new - create a fixed threadpool with @n_threads threads
*/
@@ -161,9 +351,23 @@ struct threadpool *threadpool__new(int n_threads)
int threadpool__strerror(struct threadpool *pool __maybe_unused, int err, char *buf, size_t size)
{
char sbuf[STRERR_BUFSIZE], *emsg;
+ const char *errno_str;
+ int err_idx = -err-THREADPOOL_ERROR__OFFSET;

- emsg = str_error_r(err, sbuf, sizeof(sbuf));
- return scnprintf(buf, size, "Error: %s.\n", emsg);
+ switch (err) {
+ case -THREADPOOL_ERROR__SIGPROCMASK:
+ case -THREADPOOL_ERROR__READPIPE:
+ case -THREADPOOL_ERROR__WRITEPIPE:
+ emsg = str_error_r(errno, sbuf, sizeof(sbuf));
+ errno_str = threadpool_errno_str[err_idx];
+ return scnprintf(buf, size, "%s: %s.\n", errno_str, emsg);
+ case -THREADPOOL_ERROR__INVALIDMSG:
+ errno_str = threadpool_errno_str[err_idx];
+ return scnprintf(buf, size, "%s.\n", errno_str);
+ default:
+ emsg = str_error_r(err, sbuf, sizeof(sbuf));
+ return scnprintf(buf, size, "Error: %s", emsg);
+ }
}

/**
@@ -203,3 +407,119 @@ int threadpool__size(struct threadpool *pool)
{
return pool->nr_threads;
}
+
+/**
+ * threadpool__start_thread - start thread @tidx of the pool
+ *
+ * The function blocks until the thread is up and running.
+ * This function can also be called if the threadpool is already executing.
+ */
+int threadpool__start_thread(struct threadpool *pool, int tidx)
+{
+ char sbuf[STRERR_BUFSIZE];
+ int ret, err = 0;
+ sigset_t full, mask;
+ pthread_attr_t attrs;
+ struct threadpool_entry *thread = &pool->threads[tidx];
+
+ if (thread->running)
+ return -EBUSY;
+
+ sigfillset(&full);
+ if (sigprocmask(SIG_SETMASK, &full, &mask)) {
+ pr_debug2("Failed to block signals on threads start: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -THREADPOOL_ERROR__SIGPROCMASK;
+ }
+
+ pthread_attr_init(&attrs);
+ pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
+
+ ret = pthread_create(&thread->ptid, &attrs, threadpool_entry__function, thread);
+ if (ret) {
+ err = -ret;
+ pr_debug2("Failed to start threads: %s\n", str_error_r(ret, sbuf, sizeof(sbuf)));
+ goto out;
+ }
+
+ err = threadpool__wait_thread(pool, tidx);
+ if (err)
+ goto out_cancel;
+
+ thread->running = true;
+
+out:
+ pthread_attr_destroy(&attrs);
+
+ if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
+ pr_debug2("Failed to unblock signals on threads start: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ err = -THREADPOOL_ERROR__SIGPROCMASK;
+ }
+
+ return err;
+
+out_cancel:
+ pthread_cancel(thread->ptid);
+ goto out;
+}
+
+/**
+ * threadpool__start - start all threads in the pool.
+ *
+ * The function blocks until all threads are up and running.
+ */
+int threadpool__start(struct threadpool *pool)
+{
+ int t, tt, err = 0, nr_threads = pool->nr_threads;
+
+ for (t = 0; t < nr_threads; t++) {
+ err = threadpool__start_thread(pool, t);
+ if (err)
+ goto out_terminate;
+ }
+
+out:
+ return err;
+
+out_terminate:
+ for (tt = 0; tt < t; tt++)
+ threadpool__terminate_thread(pool, tt);
+ goto out;
+}
+
+
+/**
+ * threadpool__stop - stop all threads in the pool.
+ *
+ * This function blocks waiting for ack from all threads.
+ */
+int threadpool__stop(struct threadpool *pool)
+{
+ int t, ret, err = 0;
+
+ for (t = 0; t < pool->nr_threads; t++) {
+ /**
+ * Even if a termination fails, we should continue to terminate
+ * all other threads.
+ */
+ ret = threadpool__terminate_thread(pool, t);
+ if (ret)
+ err = ret;
+ }
+
+ return err;
+}
+
+/**
+ * threadpool__is_running - return true if any of the threads is running
+ */
+bool threadpool__is_running(struct threadpool *pool)
+{
+ int t;
+
+ for (t = 0; t < pool->nr_threads; t++)
+ if (pool->threads[t].running)
+ return true;
+ return false;
+}
diff --git a/tools/perf/util/workqueue/threadpool.h b/tools/perf/util/workqueue/threadpool.h
index 55146eb141d4c380..0e03fdd377627e79 100644
--- a/tools/perf/util/workqueue/threadpool.h
+++ b/tools/perf/util/workqueue/threadpool.h
@@ -14,10 +14,23 @@ struct task_struct {
extern struct threadpool *threadpool__new(int n_threads);
extern void threadpool__delete(struct threadpool *pool);

+extern int threadpool__start_thread(struct threadpool *pool, int tidx);
+extern int threadpool__start(struct threadpool *pool);
+extern int threadpool__stop(struct threadpool *pool);
+
extern int threadpool__size(struct threadpool *pool);
+extern bool threadpool__is_running(struct threadpool *pool);

/* Error management */
#define THREADPOOL_STRERR_BUFSIZE (128+STRERR_BUFSIZE)
+#define THREADPOOL_ERROR__OFFSET 512
+enum {
+ THREADPOOL_ERROR__SIGPROCMASK = THREADPOOL_ERROR__OFFSET,
+ THREADPOOL_ERROR__READPIPE,
+ THREADPOOL_ERROR__WRITEPIPE,
+ THREADPOOL_ERROR__INVALIDMSG,
+ THREADPOOL_ERROR__NOTALLOWED
+};
extern int threadpool__strerror(struct threadpool *pool, int err, char *buf, size_t size);
extern int threadpool__new_strerror(struct threadpool *err_ptr, char *buf, size_t size);

--
2.31.1

2021-08-20 10:56:32

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 05/15] tools: add sparse context/locking annotations in compiler-types.h

This patch copies sparse context/locking annotations from
include/compiler-types.h to tools/include/compiler-types.h.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/include/linux/compiler_types.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/tools/include/linux/compiler_types.h b/tools/include/linux/compiler_types.h
index feea09029f610120..24ae3054f304f274 100644
--- a/tools/include/linux/compiler_types.h
+++ b/tools/include/linux/compiler_types.h
@@ -13,6 +13,24 @@
#define __has_builtin(x) (0)
#endif

+#ifdef __CHECKER__
+/* context/locking */
+# define __must_hold(x) __attribute__((context(x,1,1)))
+# define __acquires(x) __attribute__((context(x,0,1)))
+# define __releases(x) __attribute__((context(x,1,0)))
+# define __acquire(x) __context__(x,1)
+# define __release(x) __context__(x,-1)
+# define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0)
+#else /* __CHECKER__ */
+/* context/locking */
+# define __must_hold(x)
+# define __acquires(x)
+# define __releases(x)
+# define __acquire(x) (void)0
+# define __release(x) (void)0
+# define __cond_lock(x,c) (c)
+#endif /* __CHECKER__ */
+
/* Compiler specific macros. */
#ifdef __GNUC__
#include <linux/compiler-gcc.h>
--
2.31.1

2021-08-20 10:56:44

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 04/15] perf workqueue: add threadpool execute and wait functions

This patch adds:
- threadpool__execute: assigns a task to the threads to execute
asynchronously.
- threadpool__wait: waits for the task to complete on all threads.
Furthermore, testing for these new functions is added.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/tests/workqueue.c | 86 ++++++++++++++++++++++-
tools/perf/util/workqueue/threadpool.c | 94 ++++++++++++++++++++++++++
tools/perf/util/workqueue/threadpool.h | 4 ++
3 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/tools/perf/tests/workqueue.c b/tools/perf/tests/workqueue.c
index 01f05b066d9fbc70..b145a5155089497f 100644
--- a/tools/perf/tests/workqueue.c
+++ b/tools/perf/tests/workqueue.c
@@ -1,15 +1,61 @@
// SPDX-License-Identifier: GPL-2.0
#include <unistd.h>
+#include <stdlib.h>
#include <linux/kernel.h>
#include <linux/err.h>
+#include <linux/zalloc.h>
#include "tests.h"
#include "util/debug.h"
#include "util/workqueue/threadpool.h"

+#define DUMMY_FACTOR 100000
+#define N_DUMMY_WORK_SIZES 7
+
struct threadpool_test_args_t {
int pool_size;
};

+struct test_task {
+ struct task_struct task;
+ int n_threads;
+ int *array;
+};
+
+/**
+ * dummy_work - calculates DUMMY_FACTOR * (idx % N_DUMMY_WORK_SIZES) inefficiently
+ *
+ * This function uses modulus to create work items of different sizes.
+ */
+static void dummy_work(int idx)
+{
+ volatile int prod = 0; /* prevent possible compiler optimizations */
+ int k = idx % N_DUMMY_WORK_SIZES;
+ int i, j;
+
+ for (i = 0; i < DUMMY_FACTOR; i++)
+ for (j = 0; j < k; j++)
+ prod++;
+
+ pr_debug3("dummy: %d * %d = %d\n", DUMMY_FACTOR, k, prod);
+}
+
+static void test_task_fn1(int tidx, struct task_struct *task)
+{
+ struct test_task *mtask = container_of(task, struct test_task, task);
+
+ dummy_work(tidx);
+ mtask->array[tidx] = tidx+1;
+}
+
+static void test_task_fn2(int tidx, struct task_struct *task)
+{
+ struct test_task *mtask = container_of(task, struct test_task, task);
+
+ dummy_work(tidx);
+ mtask->array[tidx] = tidx*2;
+}
+
+
static int __threadpool__prepare(struct threadpool **pool, int pool_size)
{
int ret;
@@ -39,21 +85,59 @@ static int __threadpool__teardown(struct threadpool *pool)
return TEST_OK;
}

+static int __threadpool__exec_wait(struct threadpool *pool,
+ struct task_struct *task)
+{
+ int ret = threadpool__execute(pool, task);
+
+ TEST_ASSERT_VAL("threadpool execute failure", ret == 0);
+ TEST_ASSERT_VAL("threadpool is not executing", threadpool__is_busy(pool));
+
+ ret = threadpool__wait(pool);
+ TEST_ASSERT_VAL("threadpool wait failure", ret == 0);
+ TEST_ASSERT_VAL("waited threadpool is not running", threadpool__is_running(pool));
+
+ return TEST_OK;
+}
+
static int __test__threadpool(void *_args)
{
struct threadpool_test_args_t *args = _args;
struct threadpool *pool;
+ struct test_task task;
int pool_size = args->pool_size ?: sysconf(_SC_NPROCESSORS_ONLN);
- int ret = __threadpool__prepare(&pool, pool_size);
+ int i, ret = __threadpool__prepare(&pool, pool_size);

if (ret)
goto out;

+ task.task.fn = test_task_fn1;
+ task.n_threads = pool_size;
+ task.array = calloc(pool_size, sizeof(*task.array));
+ TEST_ASSERT_VAL("calloc failure", task.array);
+
+ ret = __threadpool__exec_wait(pool, &task.task);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < pool_size; i++)
+ TEST_ASSERT_VAL("failed array check (1)", task.array[i] == i+1);
+
+ task.task.fn = test_task_fn2;
+
+ ret = __threadpool__exec_wait(pool, &task.task);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < pool_size; i++)
+ TEST_ASSERT_VAL("failed array check (2)", task.array[i] == 2*i);
+
ret = __threadpool__teardown(pool);
if (ret)
goto out;

out:
+ free(task.array);
return ret;
}

diff --git a/tools/perf/util/workqueue/threadpool.c b/tools/perf/util/workqueue/threadpool.c
index 861a20231558e618..44bcbe4fa3d2d026 100644
--- a/tools/perf/util/workqueue/threadpool.c
+++ b/tools/perf/util/workqueue/threadpool.c
@@ -200,6 +200,17 @@ static int threadpool__terminate_thread(struct threadpool *pool, int tidx)
goto out;
}

+/**
+ * threadpool__wake_thread - send wake msg to @thread
+ *
+ * This function does not wait for the thread to actually wake
+ * NB: call only from main thread!
+ */
+static int threadpool__wake_thread(struct threadpool *pool, int tidx)
+{
+ return threadpool__send_cmd(pool, tidx, THREADPOOL_MSG__WAKE);
+}
+
/**
* threadpool_entry__send_ack - send ack to main thread
*/
@@ -270,6 +281,15 @@ static void *threadpool_entry__function(void *args)

if (cmd == THREADPOOL_MSG__STOP)
break;
+
+ if (!thread->pool->current_task) {
+ pr_debug("threadpool[%d]: received wake without task\n",
+ thread->tid);
+ break;
+ }
+
+ pr_debug("threadpool[%d]: executing task\n", thread->tid);
+ thread->pool->current_task->fn(thread->idx, thread->pool->current_task);
}

pr_debug2("threadpool[%d]: exit\n", thread->tid);
@@ -448,6 +468,12 @@ int threadpool__start_thread(struct threadpool *pool, int tidx)

thread->running = true;

+ if (pool->current_task) {
+ err = threadpool__wake_thread(pool, tidx);
+ if (err)
+ goto out_cancel;
+ }
+
out:
pthread_attr_destroy(&attrs);

@@ -498,6 +524,10 @@ int threadpool__stop(struct threadpool *pool)
{
int t, ret, err = 0;

+ err = threadpool__wait(pool);
+ if (err)
+ return err;
+
for (t = 0; t < pool->nr_threads; t++) {
/**
* Even if a termination fails, we should continue to terminate
@@ -523,3 +553,67 @@ bool threadpool__is_running(struct threadpool *pool)
return true;
return false;
}
+
+/**
+ * threadpool__execute - set threadpool @task
+ *
+ * The task will be immediately executed on all started threads. If a thread
+ * is not running, it will start executing this task once started.
+ * The task will run asynchronously wrt the main thread.
+ * The task can be waited with threadpool__wait. Since no queueing is performed,
+ * you need to wait the pool before submitting a new task.
+ */
+int threadpool__execute(struct threadpool *pool, struct task_struct *task)
+{
+ int t, ret;
+
+ if (pool->current_task)
+ return -EBUSY;
+
+ pool->current_task = task;
+
+ for (t = 0; t < pool->nr_threads; t++) {
+ if (!pool->threads[t].running)
+ continue;
+ ret = threadpool__wake_thread(pool, t);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/**
+ * threadpool__wait - wait until all threads in @pool are done
+ *
+ * This function will wait for all threads to finish execution and send their
+ * ack message.
+ *
+ * NB: call only from main thread!
+ */
+int threadpool__wait(struct threadpool *pool)
+{
+ int t, err = 0, ret;
+
+ if (!pool->current_task)
+ return 0;
+
+ for (t = 0; t < pool->nr_threads; t++) {
+ if (!pool->threads[t].running)
+ continue;
+ ret = threadpool__wait_thread(pool, t);
+ if (ret)
+ err = ret;
+ }
+
+ pool->current_task = NULL;
+ return err;
+}
+
+/**
+ * threadpool__is_busy - check if the pool has work to do
+ */
+bool threadpool__is_busy(struct threadpool *pool)
+{
+ return pool->current_task;
+}
diff --git a/tools/perf/util/workqueue/threadpool.h b/tools/perf/util/workqueue/threadpool.h
index 0e03fdd377627e79..9a6081cef8af95e0 100644
--- a/tools/perf/util/workqueue/threadpool.h
+++ b/tools/perf/util/workqueue/threadpool.h
@@ -18,8 +18,12 @@ extern int threadpool__start_thread(struct threadpool *pool, int tidx);
extern int threadpool__start(struct threadpool *pool);
extern int threadpool__stop(struct threadpool *pool);

+extern int threadpool__wait(struct threadpool *pool);
+extern int threadpool__execute(struct threadpool *pool, struct task_struct *task);
+
extern int threadpool__size(struct threadpool *pool);
extern bool threadpool__is_running(struct threadpool *pool);
+extern bool threadpool__is_busy(struct threadpool *pool);

/* Error management */
#define THREADPOOL_STRERR_BUFSIZE (128+STRERR_BUFSIZE)
--
2.31.1

2021-08-20 10:56:52

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 06/15] perf workqueue: introduce workqueue struct

This patch adds the workqueue definition, along with simple creation and
destruction functions.
Furthermore, a simple subtest is added.

A workqueue is attached to a pool, on which it executes its workers.
Next patches will introduce workers.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/tests/workqueue.c | 81 +++++++++++
tools/perf/util/workqueue/Build | 1 +
tools/perf/util/workqueue/workqueue.c | 196 ++++++++++++++++++++++++++
tools/perf/util/workqueue/workqueue.h | 39 +++++
4 files changed, 317 insertions(+)
create mode 100644 tools/perf/util/workqueue/workqueue.c
create mode 100644 tools/perf/util/workqueue/workqueue.h

diff --git a/tools/perf/tests/workqueue.c b/tools/perf/tests/workqueue.c
index b145a5155089497f..1aa6ee788b0b1c32 100644
--- a/tools/perf/tests/workqueue.c
+++ b/tools/perf/tests/workqueue.c
@@ -7,6 +7,7 @@
#include "tests.h"
#include "util/debug.h"
#include "util/workqueue/threadpool.h"
+#include "util/workqueue/workqueue.h"

#define DUMMY_FACTOR 100000
#define N_DUMMY_WORK_SIZES 7
@@ -15,6 +16,11 @@ struct threadpool_test_args_t {
int pool_size;
};

+struct workqueue_test_args_t {
+ int pool_size;
+ int n_work_items;
+};
+
struct test_task {
struct task_struct task;
int n_threads;
@@ -141,6 +147,43 @@ static int __test__threadpool(void *_args)
return ret;
}

+static int __workqueue__prepare(struct workqueue_struct **wq,
+ int pool_size)
+{
+ *wq = create_workqueue(pool_size);
+ TEST_ASSERT_VAL("workqueue creation failure", !IS_ERR(*wq));
+ TEST_ASSERT_VAL("workqueue wrong size", workqueue_nr_threads(*wq) == pool_size);
+
+ return TEST_OK;
+}
+
+static int __workqueue__teardown(struct workqueue_struct *wq)
+{
+ int ret = destroy_workqueue(wq);
+
+ TEST_ASSERT_VAL("workqueue detruction failure", ret == 0);
+
+ return 0;
+}
+
+static int __test__workqueue(void *_args)
+{
+ struct workqueue_test_args_t *args = _args;
+ struct workqueue_struct *wq;
+ int pool_size = args->pool_size ?: sysconf(_SC_NPROCESSORS_ONLN);
+ int ret = __workqueue__prepare(&wq, pool_size);
+
+ if (ret)
+ goto out;
+
+ ret = __workqueue__teardown(wq);
+ if (ret)
+ goto out;
+
+out:
+ return ret;
+}
+
static const struct threadpool_test_args_t threadpool_test_args[] = {
{
.pool_size = 1
@@ -162,6 +205,37 @@ static const struct threadpool_test_args_t threadpool_test_args[] = {
}
};

+static const struct workqueue_test_args_t workqueue_test_args[] = {
+ {
+ .pool_size = 1,
+ .n_work_items = 1
+ },
+ {
+ .pool_size = 1,
+ .n_work_items = 10
+ },
+ {
+ .pool_size = 2,
+ .n_work_items = 1
+ },
+ {
+ .pool_size = 2,
+ .n_work_items = 100
+ },
+ {
+ .pool_size = 16,
+ .n_work_items = 7
+ },
+ {
+ .pool_size = 16,
+ .n_work_items = 2789
+ },
+ {
+ .pool_size = 0, // sysconf(_SC_NPROCESSORS_ONLN)
+ .n_work_items = 8191
+ }
+};
+
struct test_case {
const char *desc;
int (*func)(void *args);
@@ -177,6 +251,13 @@ static struct test_case workqueue_testcase_table[] = {
.args = (void *) threadpool_test_args,
.n_args = (int)ARRAY_SIZE(threadpool_test_args),
.arg_size = sizeof(struct threadpool_test_args_t)
+ },
+ {
+ .desc = "Workqueue",
+ .func = __test__workqueue,
+ .args = (void *) workqueue_test_args,
+ .n_args = (int)ARRAY_SIZE(workqueue_test_args),
+ .arg_size = sizeof(struct workqueue_test_args_t)
}
};

diff --git a/tools/perf/util/workqueue/Build b/tools/perf/util/workqueue/Build
index 8b72a6cd4e2cba0d..4af721345c0a6bb7 100644
--- a/tools/perf/util/workqueue/Build
+++ b/tools/perf/util/workqueue/Build
@@ -1 +1,2 @@
perf-y += threadpool.o
+perf-y += workqueue.o
diff --git a/tools/perf/util/workqueue/workqueue.c b/tools/perf/util/workqueue/workqueue.c
new file mode 100644
index 0000000000000000..053aac43e038f0b7
--- /dev/null
+++ b/tools/perf/util/workqueue/workqueue.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <pthread.h>
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/string.h>
+#include <linux/zalloc.h>
+#include "debug.h"
+#include <internal/lib.h>
+#include "workqueue.h"
+
+struct workqueue_struct {
+ pthread_mutex_t lock; /* locking of the workqueue */
+ pthread_cond_t idle_cond; /* all workers are idle cond */
+ struct threadpool *pool; /* underlying pool */
+ int pool_errno; /* latest pool error */
+ struct task_struct task; /* threadpool task */
+ struct list_head busy_list; /* busy workers */
+ struct list_head idle_list; /* idle workers */
+ int msg_pipe[2]; /* main thread comm pipes */
+};
+
+static const char * const workqueue_errno_str[] = {
+ "Error creating threadpool",
+ "Error executing function in threadpool",
+ "Error stopping the threadpool",
+ "Error starting thread in the threadpool",
+ "Error sending message to worker",
+ "Error receiving message from worker",
+ "Received unexpected message from worker",
+};
+
+/**
+ * create_workqueue - create a workqueue associated to @pool
+ *
+ * The workqueue will create a threadpool on which to execute.
+ */
+struct workqueue_struct *create_workqueue(int nr_threads)
+{
+ int ret, err = 0;
+ struct workqueue_struct *wq = zalloc(sizeof(struct workqueue_struct));
+
+ if (!wq) {
+ err = -ENOMEM;
+ goto out_return;
+ }
+
+ wq->pool = threadpool__new(nr_threads);
+ if (IS_ERR(wq->pool)) {
+ err = -WORKQUEUE_ERROR__POOLNEW;
+ wq->pool_errno = PTR_ERR(wq->pool);
+ goto out_free_wq;
+ }
+
+ ret = pthread_mutex_init(&wq->lock, NULL);
+ if (ret) {
+ err = -ret;
+ goto out_delete_pool;
+ }
+
+ ret = pthread_cond_init(&wq->idle_cond, NULL);
+ if (ret) {
+ err = -ret;
+ goto out_destroy_mutex;
+ }
+
+ INIT_LIST_HEAD(&wq->busy_list);
+ INIT_LIST_HEAD(&wq->idle_list);
+
+ ret = pipe(wq->msg_pipe);
+ if (ret) {
+ err = -ENOMEM;
+ goto out_destroy_cond;
+ }
+
+ return wq;
+
+out_destroy_cond:
+ pthread_cond_destroy(&wq->idle_cond);
+out_destroy_mutex:
+ pthread_mutex_destroy(&wq->lock);
+out_delete_pool:
+ threadpool__delete(wq->pool);
+out_free_wq:
+ free(wq);
+out_return:
+ return ERR_PTR(err);
+}
+
+/**
+ * destroy_workqueue - stop @wq workers and destroy @wq
+ */
+int destroy_workqueue(struct workqueue_struct *wq)
+{
+ int err = 0, ret;
+ char sbuf[STRERR_BUFSIZE];
+
+ if (IS_ERR_OR_NULL(wq))
+ return 0;
+
+ threadpool__delete(wq->pool);
+ wq->pool = NULL;
+
+ ret = pthread_mutex_destroy(&wq->lock);
+ if (ret) {
+ err = -ret;
+ pr_debug2("workqueue: error pthread_mutex_destroy: %s\n",
+ str_error_r(ret, sbuf, sizeof(sbuf)));
+ }
+
+ ret = pthread_cond_destroy(&wq->idle_cond);
+ if (ret) {
+ err = -ret;
+ pr_debug2("workqueue: error pthread_cond_destroy: %s\n",
+ str_error_r(ret, sbuf, sizeof(sbuf)));
+ }
+
+ close(wq->msg_pipe[0]);
+ wq->msg_pipe[0] = -1;
+
+ close(wq->msg_pipe[1]);
+ wq->msg_pipe[1] = -1;
+
+ free(wq);
+ return err;
+}
+
+/**
+ * workqueue_strerror - print message regarding lastest error in @wq
+ *
+ * Buffer size should be at least WORKQUEUE_STRERR_BUFSIZE bytes.
+ */
+int workqueue_strerror(struct workqueue_struct *wq, int err, char *buf, size_t size)
+{
+ int ret;
+ char sbuf[THREADPOOL_STRERR_BUFSIZE], *emsg;
+ const char *errno_str;
+
+ errno_str = workqueue_errno_str[-err-WORKQUEUE_ERROR__OFFSET];
+
+ switch (err) {
+ case -WORKQUEUE_ERROR__POOLNEW:
+ case -WORKQUEUE_ERROR__POOLEXE:
+ case -WORKQUEUE_ERROR__POOLSTOP:
+ case -WORKQUEUE_ERROR__POOLSTARTTHREAD:
+ if (IS_ERR_OR_NULL(wq))
+ return scnprintf(buf, size, "%s: unknown.\n",
+ errno_str);
+
+ ret = threadpool__strerror(wq->pool, wq->pool_errno, sbuf, sizeof(sbuf));
+ if (ret < 0)
+ return ret;
+ return scnprintf(buf, size, "%s: %s.\n", errno_str, sbuf);
+ case -WORKQUEUE_ERROR__WRITEPIPE:
+ case -WORKQUEUE_ERROR__READPIPE:
+ emsg = str_error_r(errno, sbuf, sizeof(sbuf));
+ return scnprintf(buf, size, "%s: %s.\n", errno_str, emsg);
+ case -WORKQUEUE_ERROR__INVALIDMSG:
+ return scnprintf(buf, size, "%s.\n", errno_str);
+ default:
+ emsg = str_error_r(err, sbuf, sizeof(sbuf));
+ return scnprintf(buf, size, "Error: %s", emsg);
+ }
+}
+
+/**
+ * create_workqueue_strerror - print message regarding @err_ptr
+ *
+ * Buffer size should be at least WORKQUEUE_STRERR_BUFSIZE bytes.
+ */
+int create_workqueue_strerror(struct workqueue_struct *err_ptr, char *buf, size_t size)
+{
+ return workqueue_strerror(err_ptr, PTR_ERR(err_ptr), buf, size);
+}
+
+/**
+ * destroy_workqueue_strerror - print message regarding @err
+ *
+ * Buffer size should be at least WORKQUEUE_STRERR_BUFSIZE bytes.
+ */
+int destroy_workqueue_strerror(int err, char *buf, size_t size)
+{
+ return workqueue_strerror(NULL, err, buf, size);
+}
+
+/**
+ * workqueue_nr_threads - get size of threadpool underlying @wq
+ */
+int workqueue_nr_threads(struct workqueue_struct *wq)
+{
+ return threadpool__size(wq->pool);
+}
diff --git a/tools/perf/util/workqueue/workqueue.h b/tools/perf/util/workqueue/workqueue.h
new file mode 100644
index 0000000000000000..100841cc035fde1d
--- /dev/null
+++ b/tools/perf/util/workqueue/workqueue.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __WORKQUEUE_WORKQUEUE_H
+#define __WORKQUEUE_WORKQUEUE_H
+
+#include <stdlib.h>
+#include <sys/types.h>
+#include <linux/list.h>
+#include "threadpool.h"
+
+struct work_struct;
+typedef void (*work_func_t)(struct work_struct *work);
+
+struct work_struct {
+ struct list_head entry;
+ work_func_t func;
+};
+
+struct workqueue_struct;
+
+extern struct workqueue_struct *create_workqueue(int nr_threads);
+extern int destroy_workqueue(struct workqueue_struct *wq);
+
+extern int workqueue_nr_threads(struct workqueue_struct *wq);
+
+#define WORKQUEUE_STRERR_BUFSIZE (128+THREADPOOL_STRERR_BUFSIZE)
+#define WORKQUEUE_ERROR__OFFSET 512
+enum {
+ WORKQUEUE_ERROR__POOLNEW = WORKQUEUE_ERROR__OFFSET,
+ WORKQUEUE_ERROR__POOLEXE,
+ WORKQUEUE_ERROR__POOLSTOP,
+ WORKQUEUE_ERROR__POOLSTARTTHREAD,
+ WORKQUEUE_ERROR__WRITEPIPE,
+ WORKQUEUE_ERROR__READPIPE,
+ WORKQUEUE_ERROR__INVALIDMSG,
+};
+extern int workqueue_strerror(struct workqueue_struct *wq, int err, char *buf, size_t size);
+extern int create_workqueue_strerror(struct workqueue_struct *err_ptr, char *buf, size_t size);
+extern int destroy_workqueue_strerror(int err, char *buf, size_t size);
+#endif /* __WORKQUEUE_WORKQUEUE_H */
--
2.31.1

2021-08-20 10:57:08

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 07/15] perf workqueue: implement worker thread and management

This patch adds the implementation of the worker thread that is executed
in the threadpool, and all management-related functions.

At startup, a worker registers itself with the workqueue by adding itself
to the idle_list, then it sends an ack back to the main thread. When
creating wotkers, the main thread will wait for the related acks.
Once there is work to do, threads are woken up to perform the work.
Threads will try to dequeue a new pending work before going to sleep.

This registering mechanism has been implemented to enable for lazy spin
up of worker threads in following patches

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/util/workqueue/workqueue.c | 376 +++++++++++++++++++++++++-
1 file changed, 374 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/workqueue/workqueue.c b/tools/perf/util/workqueue/workqueue.c
index 053aac43e038f0b7..a2747fcc004ab0d1 100644
--- a/tools/perf/util/workqueue/workqueue.c
+++ b/tools/perf/util/workqueue/workqueue.c
@@ -13,6 +13,21 @@
#include <internal/lib.h>
#include "workqueue.h"

+enum worker_msg {
+ WORKER_MSG__UNDEFINED,
+ WORKER_MSG__READY, /* from worker: ack */
+ WORKER_MSG__WAKE, /* to worker: wake up */
+ WORKER_MSG__STOP, /* to worker: exit */
+ WORKER_MSG__ERROR,
+ WORKER_MSG__MAX
+};
+
+enum worker_status {
+ WORKER_STATUS__IDLE, /* worker is sleeping, waiting for signal */
+ WORKER_STATUS__BUSY, /* worker is executing */
+ WORKER_STATUS__MAX
+};
+
struct workqueue_struct {
pthread_mutex_t lock; /* locking of the workqueue */
pthread_cond_t idle_cond; /* all workers are idle cond */
@@ -22,6 +37,7 @@ struct workqueue_struct {
struct list_head busy_list; /* busy workers */
struct list_head idle_list; /* idle workers */
int msg_pipe[2]; /* main thread comm pipes */
+ struct worker **workers; /* array of all workers */
};

static const char * const workqueue_errno_str[] = {
@@ -34,6 +50,310 @@ static const char * const workqueue_errno_str[] = {
"Received unexpected message from worker",
};

+struct worker {
+ pthread_mutex_t lock; /* locking of the thread_pool */
+ int tidx; /* idx of thread in pool */
+ struct list_head entry; /* in idle or busy list */
+ struct work_struct *current_work; /* work being processed */
+ int msg_pipe[2]; /* main thread comm pipes*/
+ struct list_head queue; /* pending work items */
+ enum worker_status status; /* worker status */
+};
+
+#define for_each_busy_worker(wq, m_worker) \
+ list_for_each_entry(m_worker, &wq->busy_list, entry)
+
+#define for_each_idle_worker(wq, m_worker) \
+ list_for_each_entry(m_worker, &wq->idle_list, entry)
+
+static inline int lock_workqueue(struct workqueue_struct *wq)
+__acquires(&wq->lock)
+{
+ __acquire(&wq->lock);
+ return pthread_mutex_lock(&wq->lock);
+}
+
+static inline int unlock_workqueue(struct workqueue_struct *wq)
+__releases(&wq->lock)
+{
+ __release(&wq->lock);
+ return pthread_mutex_unlock(&wq->lock);
+}
+
+static inline int lock_worker(struct worker *worker)
+__acquires(&worker->lock)
+{
+ __acquire(&worker->lock);
+ return pthread_mutex_lock(&worker->lock);
+}
+
+static inline int unlock_worker(struct worker *worker)
+__releases(&worker->lock)
+{
+ __release(&worker->lock);
+ return pthread_mutex_unlock(&worker->lock);
+}
+
+/**
+ * available_work - check if worker @worker has work to do
+ */
+static int available_work(struct worker *worker)
+__must_hold(&worker->lock)
+{
+ return !list_empty(&worker->queue);
+}
+
+/**
+ * dequeue_work - retrieve the next work in worker @worker's queue
+ *
+ * Called inside worker.
+ */
+static struct work_struct *dequeue_work(struct worker *worker)
+__must_hold(&worker->lock)
+{
+ struct work_struct *work = list_first_entry(&worker->queue, struct work_struct, entry);
+
+ list_del_init(&work->entry);
+ return work;
+}
+
+/**
+ * spinup_worker - start worker underlying thread and wait for it
+ *
+ * This function MUST NOT hold any lock and can be called only from main thread.
+ */
+static int spinup_worker(struct workqueue_struct *wq, int tidx)
+{
+ int ret;
+ enum worker_msg msg = WORKER_MSG__UNDEFINED;
+ char sbuf[STRERR_BUFSIZE];
+
+ wq->pool_errno = threadpool__start_thread(wq->pool, tidx);
+ if (wq->pool_errno)
+ return -WORKQUEUE_ERROR__POOLSTARTTHREAD;
+
+ ret = readn(wq->msg_pipe[0], &msg, sizeof(msg));
+ if (ret < 0) {
+ pr_debug("workqueue: error receiving ack: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -WORKQUEUE_ERROR__READPIPE;
+ }
+ if (msg != WORKER_MSG__READY) {
+ pr_debug2("workqueue: received error\n");
+ return -WORKQUEUE_ERROR__INVALIDMSG;
+ }
+
+ pr_debug("workqueue: spinup worker %d\n", tidx);
+
+ return 0;
+}
+
+/**
+ * sleep_worker - worker @worker of workqueue @wq goes to sleep
+ *
+ * Called inside worker.
+ * If this was the last idle thread, signal it to the main thread, in case it
+ * was flushing the workqueue.
+ */
+static void sleep_worker(struct workqueue_struct *wq, struct worker *worker)
+__must_hold(&wq->lock)
+{
+ worker->status = WORKER_STATUS__IDLE;
+ list_move(&worker->entry, &wq->idle_list);
+ if (list_empty(&wq->busy_list))
+ pthread_cond_signal(&wq->idle_cond);
+}
+
+/**
+ * dequeue_or_sleep - check if work is available and dequeue or go to sleep
+ *
+ * Called inside worker.
+ */
+static void dequeue_or_sleep(struct worker *worker, struct workqueue_struct *wq)
+__must_hold(&worker->lock)
+{
+ if (available_work(worker)) {
+ worker->current_work = dequeue_work(worker);
+ pr_debug2("worker[%d]: dequeued work\n", worker->tidx);
+ } else {
+ unlock_worker(worker);
+
+ lock_workqueue(wq);
+ lock_worker(worker);
+
+ // Check if I've been assigned new work in the
+ // meantime
+ if (available_work(worker)) {
+ // yep, no need to sleep
+ worker->current_work = dequeue_work(worker);
+ } else {
+ // nope, I gotta sleep
+ worker->current_work = NULL;
+ sleep_worker(wq, worker);
+ pr_debug2("worker[%d]: going to sleep\n", worker->tidx);
+ }
+ unlock_workqueue(wq);
+ }
+}
+
+
+/**
+ * stop_worker - stop worker @worker
+ *
+ * Called from main thread.
+ * Send stop message to worker @worker.
+ */
+static int stop_worker(struct worker *worker)
+{
+ int ret;
+ enum worker_msg msg;
+ char sbuf[STRERR_BUFSIZE];
+
+ msg = WORKER_MSG__STOP;
+ ret = writen(worker->msg_pipe[1], &msg, sizeof(msg));
+ if (ret < 0) {
+ pr_debug2("workqueue: error sending stop msg: %s\n",
+ str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -WORKQUEUE_ERROR__WRITEPIPE;
+ }
+
+ return 0;
+}
+
+/**
+ * init_worker - init @worker struct
+ * @worker: the struct to init
+ * @tidx: index of the executing thread inside the threadpool
+ */
+static int init_worker(struct worker *worker, int tidx)
+{
+ int ret;
+ char sbuf[STRERR_BUFSIZE];
+
+ if (pipe(worker->msg_pipe)) {
+ pr_debug2("worker[%d]: error opening pipe: %s\n",
+ tidx, str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -ENOMEM;
+ }
+
+ worker->tidx = tidx;
+ worker->current_work = NULL;
+ worker->status = WORKER_STATUS__IDLE;
+ INIT_LIST_HEAD(&worker->entry);
+ INIT_LIST_HEAD(&worker->queue);
+
+ ret = pthread_mutex_init(&worker->lock, NULL);
+ if (ret)
+ return -ret;
+
+ return 0;
+}
+
+/**
+ * fini_worker - deallocate resources used by @worker struct
+ */
+static void fini_worker(struct worker *worker)
+{
+ close(worker->msg_pipe[0]);
+ worker->msg_pipe[0] = -1;
+ close(worker->msg_pipe[1]);
+ worker->msg_pipe[1] = -1;
+ pthread_mutex_destroy(&worker->lock);
+}
+
+/**
+ * register_worker - add worker to @wq->idle_list
+ */
+static void register_worker(struct workqueue_struct *wq, struct worker *worker)
+__must_hold(&wq->lock)
+{
+ list_move(&worker->entry, &wq->idle_list);
+ wq->workers[worker->tidx] = worker;
+}
+
+/**
+ * unregister_worker - remove worker from @wq->idle_list
+ */
+static void unregister_worker(struct workqueue_struct *wq __maybe_unused,
+ struct worker *worker)
+__must_hold(&wq->lock)
+{
+ list_del_init(&worker->entry);
+ wq->workers[worker->tidx] = NULL;
+}
+
+/**
+ * worker_thread - worker function executed on threadpool
+ */
+static void worker_thread(int tidx, struct task_struct *task)
+{
+ struct workqueue_struct *wq = container_of(task, struct workqueue_struct, task);
+ char sbuf[STRERR_BUFSIZE];
+ struct worker this_worker;
+ enum worker_msg msg;
+ int ret, init_err = init_worker(&this_worker, tidx);
+
+ if (init_err) {
+ // send error message to main thread
+ msg = WORKER_MSG__ERROR;
+ } else {
+ lock_workqueue(wq);
+ register_worker(wq, &this_worker);
+ unlock_workqueue(wq);
+
+ // ack worker creation
+ msg = WORKER_MSG__READY;
+ }
+
+ ret = writen(wq->msg_pipe[1], &msg, sizeof(msg));
+ if (ret < 0) {
+ pr_debug("worker[%d]: error sending msg: %s\n",
+ tidx, str_error_r(errno, sbuf, sizeof(sbuf)));
+
+ if (init_err)
+ return;
+ goto out;
+ }
+
+ // stop if there have been errors in init
+ if (init_err)
+ return;
+
+ for (;;) {
+ msg = WORKER_MSG__UNDEFINED;
+ ret = readn(this_worker.msg_pipe[0], &msg, sizeof(msg));
+ if (ret < 0 || (msg != WORKER_MSG__WAKE && msg != WORKER_MSG__STOP)) {
+ pr_debug("worker[%d]: error receiving msg: %s\n",
+ tidx, str_error_r(errno, sbuf, sizeof(sbuf)));
+ break;
+ }
+
+ if (msg == WORKER_MSG__STOP)
+ break;
+
+ // main thread takes care of moving to busy list and appending
+ // work to list
+
+ for (;;) {
+ lock_worker(&this_worker);
+ dequeue_or_sleep(&this_worker, wq);
+ unlock_worker(&this_worker);
+
+ if (!this_worker.current_work)
+ break;
+
+ this_worker.current_work->func(this_worker.current_work);
+ };
+ }
+
+out:
+ lock_workqueue(wq);
+ unregister_worker(wq, &this_worker);
+ unlock_workqueue(wq);
+
+ fini_worker(&this_worker);
+}
+
/**
* create_workqueue - create a workqueue associated to @pool
*
@@ -41,7 +361,8 @@ static const char * const workqueue_errno_str[] = {
*/
struct workqueue_struct *create_workqueue(int nr_threads)
{
- int ret, err = 0;
+ int ret, err = 0, t;
+ struct worker *worker;
struct workqueue_struct *wq = zalloc(sizeof(struct workqueue_struct));

if (!wq) {
@@ -56,10 +377,16 @@ struct workqueue_struct *create_workqueue(int nr_threads)
goto out_free_wq;
}

+ wq->workers = calloc(nr_threads, sizeof(*wq->workers));
+ if (!wq->workers) {
+ err = -ENOMEM;
+ goto out_delete_pool;
+ }
+
ret = pthread_mutex_init(&wq->lock, NULL);
if (ret) {
err = -ret;
- goto out_delete_pool;
+ goto out_free_workers;
}

ret = pthread_cond_init(&wq->idle_cond, NULL);
@@ -77,12 +404,41 @@ struct workqueue_struct *create_workqueue(int nr_threads)
goto out_destroy_cond;
}

+ wq->task.fn = worker_thread;
+
+ wq->pool_errno = threadpool__execute(wq->pool, &wq->task);
+ if (wq->pool_errno) {
+ err = -WORKQUEUE_ERROR__POOLEXE;
+ goto out_close_pipe;
+ }
+
+ for (t = 0; t < nr_threads; t++) {
+ err = spinup_worker(wq, t);
+ if (err)
+ goto out_stop_pool;
+ }
+
return wq;

+out_stop_pool:
+ lock_workqueue(wq);
+ for_each_idle_worker(wq, worker) {
+ ret = stop_worker(worker);
+ if (ret)
+ err = ret;
+ }
+ unlock_workqueue(wq);
+out_close_pipe:
+ close(wq->msg_pipe[0]);
+ wq->msg_pipe[0] = -1;
+ close(wq->msg_pipe[1]);
+ wq->msg_pipe[1] = -1;
out_destroy_cond:
pthread_cond_destroy(&wq->idle_cond);
out_destroy_mutex:
pthread_mutex_destroy(&wq->lock);
+out_free_workers:
+ free(wq->workers);
out_delete_pool:
threadpool__delete(wq->pool);
out_free_wq:
@@ -96,12 +452,27 @@ struct workqueue_struct *create_workqueue(int nr_threads)
*/
int destroy_workqueue(struct workqueue_struct *wq)
{
+ struct worker *worker;
int err = 0, ret;
char sbuf[STRERR_BUFSIZE];

if (IS_ERR_OR_NULL(wq))
return 0;

+ lock_workqueue(wq);
+ for_each_idle_worker(wq, worker) {
+ ret = stop_worker(worker);
+ if (ret)
+ err = ret;
+ }
+ unlock_workqueue(wq);
+
+ wq->pool_errno = threadpool__stop(wq->pool);
+ if (wq->pool_errno) {
+ pr_debug2("workqueue: error stopping threadpool\n");
+ err = -WORKQUEUE_ERROR__POOLSTOP;
+ }
+
threadpool__delete(wq->pool);
wq->pool = NULL;

@@ -125,6 +496,7 @@ int destroy_workqueue(struct workqueue_struct *wq)
close(wq->msg_pipe[1]);
wq->msg_pipe[1] = -1;

+ zfree(&wq->workers);
free(wq);
return err;
}
--
2.31.1

2021-08-20 10:57:37

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 11/15] perf workqueue: add utility to execute a for loop in parallel

This patch adds the parallel_for which executes a given function inside
the workqueue, taking care of managing the work items.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/tests/workqueue.c | 89 +++++++++++++++++
tools/perf/util/workqueue/workqueue.c | 135 ++++++++++++++++++++++++++
tools/perf/util/workqueue/workqueue.h | 7 ++
3 files changed, 231 insertions(+)

diff --git a/tools/perf/tests/workqueue.c b/tools/perf/tests/workqueue.c
index 194bab2f3f668ce9..4eb14a75b6c0a3aa 100644
--- a/tools/perf/tests/workqueue.c
+++ b/tools/perf/tests/workqueue.c
@@ -21,6 +21,12 @@ struct workqueue_test_args_t {
int n_work_items;
};

+struct parallel_for_test_args_t {
+ int pool_size;
+ int n_work_items;
+ int work_size;
+};
+
struct test_task {
struct task_struct task;
int n_threads;
@@ -253,6 +259,44 @@ static int __test__workqueue(void *_args)
return ret;
}

+static void test_pfw_fn(int i, void *args)
+{
+ int *array = args;
+
+ dummy_work(i);
+ array[i] = i+1;
+}
+
+static int __test__parallel_for(void *_args)
+{
+ struct parallel_for_test_args_t *args = _args;
+ struct workqueue_struct *wq;
+ int ret, i, pool_size = args->pool_size ?: sysconf(_SC_NPROCESSORS_ONLN);
+ int *array = calloc(args->n_work_items, sizeof(*array));
+
+ TEST_ASSERT_VAL("calloc array failure", array);
+
+ ret = __workqueue__prepare(&wq, pool_size);
+ if (ret)
+ goto out;
+
+ ret = parallel_for(wq, 0, args->n_work_items, args->work_size,
+ test_pfw_fn, array);
+ TEST_ASSERT_VAL("parallel_for failure", ret == 0);
+
+ for (i = 0; i < args->n_work_items; i++)
+ TEST_ASSERT_VAL("failed array check", array[i] == i+1);
+
+ ret = __workqueue__teardown(wq);
+ if (ret)
+ goto out;
+
+out:
+ free(array);
+
+ return TEST_OK;
+}
+
static const struct threadpool_test_args_t threadpool_test_args[] = {
{
.pool_size = 1
@@ -305,6 +349,44 @@ static const struct workqueue_test_args_t workqueue_test_args[] = {
}
};

+static const struct parallel_for_test_args_t parallel_for_test_args[] = {
+ {
+ .pool_size = 1,
+ .n_work_items = 1,
+ .work_size = 1
+ },
+ {
+ .pool_size = 1,
+ .n_work_items = 10,
+ .work_size = 3
+ },
+ {
+ .pool_size = 2,
+ .n_work_items = 1,
+ .work_size = 1
+ },
+ {
+ .pool_size = 2,
+ .n_work_items = 100,
+ .work_size = 10
+ },
+ {
+ .pool_size = 16,
+ .n_work_items = 7,
+ .work_size = 2
+ },
+ {
+ .pool_size = 16,
+ .n_work_items = 2789,
+ .work_size = 16
+ },
+ {
+ .pool_size = 0, // sysconf(_SC_NPROCESSORS_ONLN)
+ .n_work_items = 8191,
+ .work_size = 17
+ }
+};
+
struct test_case {
const char *desc;
int (*func)(void *args);
@@ -327,6 +409,13 @@ static struct test_case workqueue_testcase_table[] = {
.args = (void *) workqueue_test_args,
.n_args = (int)ARRAY_SIZE(workqueue_test_args),
.arg_size = sizeof(struct workqueue_test_args_t)
+ },
+ {
+ .desc = "Workqueue parallel-for",
+ .func = __test__parallel_for,
+ .args = (void *) parallel_for_test_args,
+ .n_args = (int)ARRAY_SIZE(parallel_for_test_args),
+ .arg_size = sizeof(struct parallel_for_test_args_t)
}
};

diff --git a/tools/perf/util/workqueue/workqueue.c b/tools/perf/util/workqueue/workqueue.c
index a89370e68bd720c8..7daac65abb5d57d1 100644
--- a/tools/perf/util/workqueue/workqueue.c
+++ b/tools/perf/util/workqueue/workqueue.c
@@ -9,6 +9,7 @@
#include <linux/err.h>
#include <linux/string.h>
#include <linux/zalloc.h>
+#include <linux/kernel.h>
#include "debug.h"
#include <internal/lib.h>
#include "workqueue.h"
@@ -764,3 +765,137 @@ void init_work(struct work_struct *work)
{
INIT_LIST_HEAD(&work->entry);
}
+
+/* Parallel-for utility */
+
+struct parallel_for_work {
+ struct work_struct work; /* work item that is queued */
+ parallel_for_func_t func; /* function to execute for each item */
+ void *args; /* additional args to pass to func */
+ int start; /* first item to execute */
+ int num; /* number of items to execute */
+};
+
+/**
+ * parallel_for_work_fn - execute parallel_for_work.func in parallel
+ *
+ * This function will be executed by workqueue's workers.
+ */
+static void parallel_for_work_fn(struct work_struct *work)
+{
+ struct parallel_for_work *pfw = container_of(work, struct parallel_for_work, work);
+ int i;
+
+ for (i = 0; i < pfw->num; i++)
+ pfw->func(pfw->start+i, pfw->args);
+}
+
+static inline void init_parallel_for_work(struct parallel_for_work *pfw,
+ parallel_for_func_t func, void *args,
+ int start, int num)
+{
+ init_work(&pfw->work);
+ pfw->work.func = parallel_for_work_fn;
+ pfw->func = func;
+ pfw->args = args;
+ pfw->start = start;
+ pfw->num = num;
+
+ pr_debug2("pfw: start=%d, num=%d\n", start, num);
+}
+
+/**
+ * parallel_for - execute @func in parallel over indexes between @from and @to
+ * @wq: workqueue that will run @func in parallel
+ * @from: first index
+ * @to: last index (excluded)
+ * @work_size: number of indexes to handle on the same work item.
+ * ceil((to-from)/work_size) work items will be added to @wq
+ * NB: this is only a hint. The function will reduce the size of
+ * the work items to fill all workers.
+ * @func: function to execute in parallel
+ * @args: additional arguments to @func
+ *
+ * This function is equivalent to:
+ * for (i = from; i < to; i++) {
+ * // parallel
+ * func(i, args);
+ * }
+ * // sync
+ *
+ * This function takes care of:
+ * - creating balanced work items to submit to workqueue
+ * - submitting the work items to the workqueue
+ * - waiting for completion of the work items
+ * - cleanup of the work items
+ */
+int parallel_for(struct workqueue_struct *wq, int from, int to, int work_size,
+ parallel_for_func_t func, void *args)
+{
+ int n = to-from;
+ int n_work_items;
+ int nr_threads = workqueue_nr_threads(wq);
+ int i, j, start, num, m, base, num_per_item;
+ struct parallel_for_work *pfw_array;
+ int ret, err = 0;
+
+ if (work_size <= 0) {
+ pr_debug("workqueue parallel-for: work_size must be >0\n");
+ return -EINVAL;
+ }
+
+ if (to < from) {
+ pr_debug("workqueue parallel-for: to must be >= from\n");
+ return -EINVAL;
+ } else if (to == from) {
+ pr_debug2("workqueue parallel-for: skip since from == to\n");
+ return 0;
+ }
+
+ n_work_items = DIV_ROUND_UP(n, work_size);
+ if (n_work_items < nr_threads)
+ n_work_items = min(n, nr_threads);
+
+ pfw_array = calloc(n_work_items, sizeof(*pfw_array));
+
+ if (!pfw_array) {
+ pr_debug2("%s: error allocating pfw_array\n", __func__);
+ return -ENOMEM;
+ }
+
+ num_per_item = n / n_work_items;
+ m = n % n_work_items;
+
+ for (i = 0; i < m; i++) {
+ num = num_per_item + 1;
+ start = i * num;
+ init_parallel_for_work(&pfw_array[i], func, args, start, num);
+ ret = queue_work(wq, &pfw_array[i].work);
+ if (ret) {
+ err = ret;
+ goto out;
+ }
+ }
+ if (i != 0)
+ base = pfw_array[i-1].start + pfw_array[i-1].num;
+ else
+ base = 0;
+ for (j = i; j < n_work_items; j++) {
+ num = num_per_item;
+ start = base + (j - i) * num;
+ init_parallel_for_work(&pfw_array[j], func, args, start, num);
+ ret = queue_work(wq, &pfw_array[j].work);
+ if (ret) {
+ err = ret;
+ goto out;
+ }
+ }
+
+out:
+ ret = flush_workqueue(wq);
+ if (ret)
+ err = ret;
+
+ free(pfw_array);
+ return err;
+}
diff --git a/tools/perf/util/workqueue/workqueue.h b/tools/perf/util/workqueue/workqueue.h
index fc6166757f0e1d0d..7a0eda923df25d85 100644
--- a/tools/perf/util/workqueue/workqueue.h
+++ b/tools/perf/util/workqueue/workqueue.h
@@ -30,6 +30,13 @@ extern int flush_workqueue(struct workqueue_struct *wq);

extern void init_work(struct work_struct *work);

+/* parallel_for utility */
+
+typedef void (*parallel_for_func_t)(int i, void *args);
+
+extern int parallel_for(struct workqueue_struct *wq, int from, int to, int work_size,
+ parallel_for_func_t func, void *args);
+
/* Global workqueue */

extern struct workqueue_struct *global_wq;
--
2.31.1

2021-08-20 10:57:51

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 12/15] perf record: setup global workqueue

This patch initializes the global workqueue in perf-record if
nr_threads_synthesize is set.

This patch is a preparation for using the global_workqueue in
synthesize.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/builtin-record.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 548c1dbde6c52ed6..4d7b610b1d0bb9af 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,7 @@
#include "util/clockid.h"
#include "util/pmu-hybrid.h"
#include "util/evlist-hybrid.h"
+#include "util/workqueue/workqueue.h"
#include "asm/bug.h"
#include "perf.h"

@@ -2894,7 +2895,21 @@ int cmd_record(int argc, const char **argv)
rec->opts.comp_level = comp_level_max;
pr_debug("comp level: %d\n", rec->opts.comp_level);

+ if (rec->opts.nr_threads_synthesize == UINT_MAX)
+ rec->opts.nr_threads_synthesize = sysconf(_SC_NPROCESSORS_ONLN);
+ if (rec->opts.nr_threads_synthesize > 1) {
+ err = setup_global_workqueue(rec->opts.nr_threads_synthesize);
+ if (err) {
+ create_workqueue_strerror(global_wq, errbuf, sizeof(errbuf));
+ pr_err("setup_global_workqueue: %s\n", errbuf);
+ goto out;
+ }
+ }
+
err = __cmd_record(&record, argc, argv);
+
+ if (rec->opts.nr_threads_synthesize > 1)
+ teardown_global_workqueue();
out:
bitmap_free(rec->affinity_mask.bits);
evlist__delete(rec->evlist);
--
2.31.1

2021-08-20 10:57:51

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 13/15] perf top: setup global workqueue

This patch initializes the global workqueue in perf-top if
nr_threads_synthesize is set.

This patch is a preparation for using the global_workqueue in
synthesize.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/builtin-top.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index a3ae9176a83e2453..9b4f220920780a95 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -49,6 +49,7 @@
#include "util/term.h"
#include "util/intlist.h"
#include "util/parse-branch-options.h"
+#include "util/workqueue/workqueue.h"
#include "arch/common.h"
#include "ui/ui.h"

@@ -1767,8 +1768,23 @@ int cmd_top(int argc, const char **argv)
opts->no_bpf_event = true;
}

+ if (top.nr_threads_synthesize == UINT_MAX)
+ top.nr_threads_synthesize = sysconf(_SC_NPROCESSORS_ONLN);
+ if (top.nr_threads_synthesize > 1) {
+ status = setup_global_workqueue(top.nr_threads_synthesize);
+ if (status) {
+ create_workqueue_strerror(global_wq, errbuf, sizeof(errbuf));
+ pr_err("setup_global_workqueue: %s\n", errbuf);
+ goto out_stop_sb_th;
+ }
+ }
+
status = __cmd_top(&top);

+ if (top.nr_threads_synthesize > 1)
+ teardown_global_workqueue();
+
+out_stop_sb_th:
if (!opts->no_bpf_event)
evlist__stop_sb_thread(top.sb_evlist);

--
2.31.1

2021-08-20 10:57:51

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 08/15] perf workqueue: add queue_work and flush_workqueue functions

This patch adds functions to queue and wait work_structs, and
related tests.

When a new work item is added, the workqueue first checks if there
are threads to wake up. If so, it wakes it up with the given work item,
otherwise it will pick the next round-robin thread and queue the work
item to its queue. A thread which completes its queue will go to sleep.

The round-robin mechanism is implemented through the next_worker
attibute which will point to the next worker to be chosen for queueing.
When work is assigned to that worker or when the worker goes to sleep,
the pointer is moved to the next worker in the busy_list, if any.
When a worker is woken up, it is added in the busy list just before the
next_worker, so that it will be chosen as last (it's just been assigned
a work item).

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/tests/workqueue.c | 71 ++++++++++-
tools/perf/util/workqueue/workqueue.c | 176 +++++++++++++++++++++++++-
tools/perf/util/workqueue/workqueue.h | 9 ++
3 files changed, 254 insertions(+), 2 deletions(-)

diff --git a/tools/perf/tests/workqueue.c b/tools/perf/tests/workqueue.c
index 1aa6ee788b0b1c32..194bab2f3f668ce9 100644
--- a/tools/perf/tests/workqueue.c
+++ b/tools/perf/tests/workqueue.c
@@ -147,6 +147,28 @@ static int __test__threadpool(void *_args)
return ret;
}

+struct test_work {
+ struct work_struct work;
+ int i;
+ int *array;
+};
+
+static void test_work_fn1(struct work_struct *work)
+{
+ struct test_work *mwork = container_of(work, struct test_work, work);
+
+ dummy_work(mwork->i);
+ mwork->array[mwork->i] = mwork->i+1;
+}
+
+static void test_work_fn2(struct work_struct *work)
+{
+ struct test_work *mwork = container_of(work, struct test_work, work);
+
+ dummy_work(mwork->i);
+ mwork->array[mwork->i] = mwork->i*2;
+}
+
static int __workqueue__prepare(struct workqueue_struct **wq,
int pool_size)
{
@@ -166,21 +188,68 @@ static int __workqueue__teardown(struct workqueue_struct *wq)
return 0;
}

+static int __workqueue__exec_wait(struct workqueue_struct *wq,
+ int *array, struct test_work *works,
+ work_func_t func, int n_work_items)
+{
+ int ret, i;
+
+ for (i = 0; i < n_work_items; i++) {
+ works[i].array = array;
+ works[i].i = i;
+
+ init_work(&works[i].work);
+ works[i].work.func = func;
+ queue_work(wq, &works[i].work);
+ }
+
+ ret = flush_workqueue(wq);
+ TEST_ASSERT_VAL("workqueue flush failure", ret == 0);
+
+ return TEST_OK;
+}
+
+
static int __test__workqueue(void *_args)
{
struct workqueue_test_args_t *args = _args;
struct workqueue_struct *wq;
+ struct test_work *works;
+ int *array;
int pool_size = args->pool_size ?: sysconf(_SC_NPROCESSORS_ONLN);
- int ret = __workqueue__prepare(&wq, pool_size);
+ int i, ret = __workqueue__prepare(&wq, pool_size);

+ if (ret)
+ return ret;
+
+ array = calloc(args->n_work_items, sizeof(*array));
+ TEST_ASSERT_VAL("failed array calloc", array);
+ works = calloc(args->n_work_items, sizeof(*works));
+ TEST_ASSERT_VAL("failed works calloc", works);
+
+ ret = __workqueue__exec_wait(wq, array, works, test_work_fn1,
+ args->n_work_items);
if (ret)
goto out;

+ for (i = 0; i < args->n_work_items; i++)
+ TEST_ASSERT_VAL("failed array check (1)", array[i] == i+1);
+
+ ret = __workqueue__exec_wait(wq, array, works, test_work_fn2,
+ args->n_work_items);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < args->n_work_items; i++)
+ TEST_ASSERT_VAL("failed array check (2)", array[i] == 2*i);
+
ret = __workqueue__teardown(wq);
if (ret)
goto out;

out:
+ free(array);
+ free(works);
return ret;
}

diff --git a/tools/perf/util/workqueue/workqueue.c b/tools/perf/util/workqueue/workqueue.c
index a2747fcc004ab0d1..1092ece9ad39d6d2 100644
--- a/tools/perf/util/workqueue/workqueue.c
+++ b/tools/perf/util/workqueue/workqueue.c
@@ -38,6 +38,7 @@ struct workqueue_struct {
struct list_head idle_list; /* idle workers */
int msg_pipe[2]; /* main thread comm pipes */
struct worker **workers; /* array of all workers */
+ struct worker *next_worker; /* next worker to choose (round robin) */
};

static const char * const workqueue_errno_str[] = {
@@ -48,6 +49,8 @@ static const char * const workqueue_errno_str[] = {
"Error sending message to worker",
"Error receiving message from worker",
"Received unexpected message from worker",
+ "Worker is not ready",
+ "Worker is in an unrecognized status",
};

struct worker {
@@ -94,6 +97,15 @@ __releases(&worker->lock)
return pthread_mutex_unlock(&worker->lock);
}

+static void advance_next_worker(struct workqueue_struct *wq)
+__must_hold(&wq->lock)
+{
+ if (list_is_last(&wq->next_worker->entry, &wq->busy_list))
+ wq->next_worker = list_first_entry(&wq->busy_list, struct worker, entry);
+ else
+ wq->next_worker = list_next_entry(wq->next_worker, entry);
+}
+
/**
* available_work - check if worker @worker has work to do
*/
@@ -159,9 +171,13 @@ static void sleep_worker(struct workqueue_struct *wq, struct worker *worker)
__must_hold(&wq->lock)
{
worker->status = WORKER_STATUS__IDLE;
+ if (wq->next_worker == worker)
+ advance_next_worker(wq);
list_move(&worker->entry, &wq->idle_list);
- if (list_empty(&wq->busy_list))
+ if (list_empty(&wq->busy_list)) {
+ wq->next_worker = NULL;
pthread_cond_signal(&wq->idle_cond);
+ }
}

/**
@@ -196,6 +212,52 @@ __must_hold(&worker->lock)
}
}

+/**
+ * wake_worker - prepare for waking worker @worker of workqueue @wq assigning @work to do
+ *
+ * Called from main thread.
+ * Moves worker from idle to busy list and assigns @work to it.
+ * Must call wake_worker outside critical section afterwards.
+ */
+static int prepare_wake_worker(struct workqueue_struct *wq, struct worker *worker,
+ struct work_struct *work)
+__must_hold(&wq->lock)
+__must_hold(&worker->lock)
+{
+ if (wq->next_worker)
+ list_move_tail(&worker->entry, &wq->next_worker->entry);
+ else
+ list_move(&worker->entry, &wq->busy_list);
+ wq->next_worker = worker;
+
+ list_add_tail(&work->entry, &worker->queue);
+ worker->status = WORKER_STATUS__BUSY;
+
+ return 0;
+}
+
+/**
+ * wake_worker - send wake message to worker @worker of workqueue @wq
+ *
+ * Called from main thread.
+ * Must be called after prepare_wake_worker and outside critical section to
+ * reduce time spent inside it
+ */
+static int wake_worker(struct worker *worker)
+{
+ enum worker_msg msg = WORKER_MSG__WAKE;
+ int ret;
+ char sbuf[STRERR_BUFSIZE];
+
+ ret = writen(worker->msg_pipe[1], &msg, sizeof(msg));
+ if (ret < 0) {
+ pr_debug2("wake worker %d: error seding msg: %s\n",
+ worker->tidx, str_error_r(errno, sbuf, sizeof(sbuf)));
+ return -WORKQUEUE_ERROR__WRITEPIPE;
+ }
+
+ return 0;
+}

/**
* stop_worker - stop worker @worker
@@ -418,6 +480,8 @@ struct workqueue_struct *create_workqueue(int nr_threads)
goto out_stop_pool;
}

+ wq->next_worker = NULL;
+
return wq;

out_stop_pool:
@@ -532,6 +596,8 @@ int workqueue_strerror(struct workqueue_struct *wq, int err, char *buf, size_t s
emsg = str_error_r(errno, sbuf, sizeof(sbuf));
return scnprintf(buf, size, "%s: %s.\n", errno_str, emsg);
case -WORKQUEUE_ERROR__INVALIDMSG:
+ case -WORKQUEUE_ERROR__INVALIDWORKERSTATUS:
+ case -WORKQUEUE_ERROR__NOTREADY:
return scnprintf(buf, size, "%s.\n", errno_str);
default:
emsg = str_error_r(err, sbuf, sizeof(sbuf));
@@ -566,3 +632,111 @@ int workqueue_nr_threads(struct workqueue_struct *wq)
{
return threadpool__size(wq->pool);
}
+
+/**
+ * __queue_work_on_worker - add @work to the internal queue of worker @worker
+ *
+ * NB: this function releases the locks to be able to send notification to
+ * thread outside the critical section.
+ */
+static int __queue_work_on_worker(struct workqueue_struct *wq __maybe_unused,
+ struct worker *worker, struct work_struct *work)
+__must_hold(&wq->lock)
+__must_hold(&worker->lock)
+__releases(&wq->lock)
+__releases(&worker->lock)
+{
+ int ret;
+
+ switch (worker->status) {
+ case WORKER_STATUS__BUSY:
+ list_add_tail(&work->entry, &worker->queue);
+
+ unlock_worker(worker);
+ unlock_workqueue(wq);
+ pr_debug("workqueue: queued new work item\n");
+ return 0;
+ case WORKER_STATUS__IDLE:
+ ret = prepare_wake_worker(wq, worker, work);
+ unlock_worker(worker);
+ unlock_workqueue(wq);
+ if (ret)
+ return ret;
+ ret = wake_worker(worker);
+ if (!ret)
+ pr_debug("workqueue: woke worker %d\n", worker->tidx);
+ return ret;
+ default:
+ case WORKER_STATUS__MAX:
+ unlock_worker(worker);
+ unlock_workqueue(wq);
+ pr_debug2("workqueue: worker is in unrecognized status %d\n",
+ worker->status);
+ return -WORKQUEUE_ERROR__INVALIDWORKERSTATUS;
+ }
+
+ return 0;
+}
+
+/**
+ * queue_work - add @work to @wq internal queue
+ *
+ * If there are idle threads, one of these will be woken up.
+ * Otherwise, the work is added to the pending list.
+ */
+int queue_work(struct workqueue_struct *wq, struct work_struct *work)
+{
+ struct worker *worker;
+
+ lock_workqueue(wq);
+ if (list_empty(&wq->idle_list)) {
+ worker = wq->next_worker;
+ advance_next_worker(wq);
+ } else {
+ worker = list_first_entry(&wq->idle_list, struct worker, entry);
+ }
+ lock_worker(worker);
+
+ return __queue_work_on_worker(wq, worker, work);
+}
+
+/**
+ * queue_work_on_worker - add @work to worker @tidx internal queue
+ */
+int queue_work_on_worker(int tidx, struct workqueue_struct *wq, struct work_struct *work)
+{
+ lock_workqueue(wq);
+ lock_worker(wq->workers[tidx]);
+ return __queue_work_on_worker(wq, wq->workers[tidx], work);
+}
+
+/**
+ * flush_workqueue - wait for all currently executed and pending work to finish
+ *
+ * This function blocks until all threads become idle.
+ */
+int flush_workqueue(struct workqueue_struct *wq)
+{
+ int err = 0, ret;
+
+ lock_workqueue(wq);
+ while (!list_empty(&wq->busy_list)) {
+ ret = pthread_cond_wait(&wq->idle_cond, &wq->lock);
+ if (ret) {
+ pr_debug2("%s: error in pthread_cond_wait\n", __func__);
+ err = -ret;
+ break;
+ }
+ }
+ unlock_workqueue(wq);
+
+ return err;
+}
+
+/**
+ * init_work - initialize the @work struct
+ */
+void init_work(struct work_struct *work)
+{
+ INIT_LIST_HEAD(&work->entry);
+}
diff --git a/tools/perf/util/workqueue/workqueue.h b/tools/perf/util/workqueue/workqueue.h
index 100841cc035fde1d..37ef84fc9c6ed4b6 100644
--- a/tools/perf/util/workqueue/workqueue.h
+++ b/tools/perf/util/workqueue/workqueue.h
@@ -22,6 +22,13 @@ extern int destroy_workqueue(struct workqueue_struct *wq);

extern int workqueue_nr_threads(struct workqueue_struct *wq);

+extern int queue_work(struct workqueue_struct *wq, struct work_struct *work);
+extern int queue_work_on_worker(int tidx, struct workqueue_struct *wq, struct work_struct *work);
+
+extern int flush_workqueue(struct workqueue_struct *wq);
+
+extern void init_work(struct work_struct *work);
+
#define WORKQUEUE_STRERR_BUFSIZE (128+THREADPOOL_STRERR_BUFSIZE)
#define WORKQUEUE_ERROR__OFFSET 512
enum {
@@ -32,6 +39,8 @@ enum {
WORKQUEUE_ERROR__WRITEPIPE,
WORKQUEUE_ERROR__READPIPE,
WORKQUEUE_ERROR__INVALIDMSG,
+ WORKQUEUE_ERROR__NOTREADY,
+ WORKQUEUE_ERROR__INVALIDWORKERSTATUS,
};
extern int workqueue_strerror(struct workqueue_struct *wq, int err, char *buf, size_t size);
extern int create_workqueue_strerror(struct workqueue_struct *err_ptr, char *buf, size_t size);
--
2.31.1

2021-08-20 10:58:09

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 15/15] perf synthetic-events: use workqueue parallel_for

To generate synthetic events, perf has the option to use multiple
threads. These threads are created manually using pthread_created.

This patch replaces the manual pthread_create with a workqueue,
using the parallel_for utility.

Experimental results show that workqueue has a slightly higher overhead,
but this is repayed by the improved work balancing among threads.

Results of perf bench before and after are reported below:
Command: sudo ./perf bench internals synthesize -t
Average synthesis time in usec is reported.

Laptop (2 cores 4 threads i7), avg num events ~21500:
N pthread (before) workqueue (after)
1 121475.200 +- 2227.757 118882.900 +- 1389.398
2 72834.100 +- 1860.677 67668.600 +- 2847.693
3 70650.200 +- 540.096 55694.200 +- 496.155
4 55554.300 +- 259.968 50901.400 +- 434.327

VM (16 vCPU over 16 cores 32 threads Xeon), avg num events ~2920:
N pthread (before) workqueue (after)
1 35182.400 +- 3561.189 37528.300 +- 2972.887
2 29188.400 +- 2191.767 28250.300 +- 1694.575
3 22172.200 +- 788.659 19062.400 +- 611.201
4 21600.700 +- 728.941 16812.900 +- 1085.359
5 19395.800 +- 1070.617 14764.600 +- 1339.113
6 18553.000 +- 1272.486 12814.200 +- 408.462
7 14691.400 +- 485.105 12382.200 +- 464.964
8 16036.400 +- 842.728 15015.000 +- 1648.844
9 15606.800 +- 470.100 13230.800 +- 1288.246
10 15527.000 +- 822.317 12661.800 +- 873.199
11 13097.400 +- 513.870 13082.700 +- 974.378
12 14053.700 +- 592.427 13123.400 +- 1054.939
13 15446.400 +- 765.850 12837.200 +- 770.646
14 14979.400 +- 1056.955 13695.400 +- 1066.302
15 12578.000 +- 846.142 15053.600 +- 992.118
16 12394.800 +- 602.295 13683.700 +- 911.517

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/bench/synthesize.c | 12 +--
tools/perf/builtin-kvm.c | 2 +-
tools/perf/builtin-record.c | 3 +-
tools/perf/builtin-top.c | 3 +-
tools/perf/builtin-trace.c | 3 +-
tools/perf/tests/mmap-thread-lookup.c | 2 +-
tools/perf/util/synthetic-events.c | 135 +++++++++-----------------
tools/perf/util/synthetic-events.h | 8 +-
8 files changed, 56 insertions(+), 112 deletions(-)

diff --git a/tools/perf/bench/synthesize.c b/tools/perf/bench/synthesize.c
index 738821113a005a6c..f1880116f4375c46 100644
--- a/tools/perf/bench/synthesize.c
+++ b/tools/perf/bench/synthesize.c
@@ -63,7 +63,6 @@ static int do_run_single_threaded(struct perf_session *session,
struct perf_thread_map *threads,
struct target *target, bool data_mmap)
{
- const unsigned int nr_threads_synthesize = 1;
struct timeval start, end, diff;
u64 runtime_us;
unsigned int i;
@@ -81,8 +80,7 @@ static int do_run_single_threaded(struct perf_session *session,
NULL,
target, threads,
process_synthesized_event,
- data_mmap,
- nr_threads_synthesize);
+ data_mmap);
if (err)
return err;

@@ -148,8 +146,7 @@ static int run_single_threaded(void)
return err;
}

-static int do_run_multi_threaded(struct target *target,
- unsigned int nr_threads_synthesize)
+static int do_run_multi_threaded(struct target *target)
{
struct timeval start, end, diff;
u64 runtime_us;
@@ -172,8 +169,7 @@ static int do_run_multi_threaded(struct target *target,
NULL,
target, NULL,
process_synthesized_event,
- false,
- nr_threads_synthesize);
+ false);
if (err) {
perf_session__delete(session);
return err;
@@ -236,7 +232,7 @@ static int run_multi_threaded(void)
printf(" Number of synthesis threads: %u\n",
nr_threads_synthesize);

- err = do_run_multi_threaded(&target, nr_threads_synthesize);
+ err = do_run_multi_threaded(&target);
if (err)
return err;

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index aa1b127ffb5be047..7afa1a41a627f353 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -1456,7 +1456,7 @@ static int kvm_events_live(struct perf_kvm_stat *kvm,
perf_session__set_id_hdr_size(kvm->session);
ordered_events__set_copy_on_queue(&kvm->session->ordered_events, true);
machine__synthesize_threads(&kvm->session->machines.host, &kvm->opts.target,
- kvm->evlist->core.threads, false, 1);
+ kvm->evlist->core.threads, false);
err = kvm_live_open_events(kvm);
if (err)
goto out;
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4d7b610b1d0bb9af..cccc2d0f9977d5b3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1481,8 +1481,7 @@ static int record__synthesize(struct record *rec, bool tail)
}

err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->core.threads,
- f, opts->sample_address,
- rec->opts.nr_threads_synthesize);
+ f, opts->sample_address);

if (rec->opts.nr_threads_synthesize > 1)
perf_set_singlethreaded();
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 9b4f220920780a95..36cd1294d9b4ebd3 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1272,8 +1272,7 @@ static int __cmd_top(struct perf_top *top)
pr_debug("Couldn't synthesize cgroup events.\n");

machine__synthesize_threads(&top->session->machines.host, &opts->target,
- top->evlist->core.threads, false,
- top->nr_threads_synthesize);
+ top->evlist->core.threads, false);

if (top->nr_threads_synthesize > 1)
perf_set_singlethreaded();
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 2bf21194c7b3959e..e2b50ba55a5ea93d 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1628,8 +1628,7 @@ static int trace__symbols_init(struct trace *trace, struct evlist *evlist)
goto out;

err = __machine__synthesize_threads(trace->host, &trace->tool, &trace->opts.target,
- evlist->core.threads, trace__tool_process, false,
- 1);
+ evlist->core.threads, trace__tool_process, false);
out:
if (err)
symbol__exit();
diff --git a/tools/perf/tests/mmap-thread-lookup.c b/tools/perf/tests/mmap-thread-lookup.c
index 8d9d4cbff76d17d5..809be9510e849d1b 100644
--- a/tools/perf/tests/mmap-thread-lookup.c
+++ b/tools/perf/tests/mmap-thread-lookup.c
@@ -135,7 +135,7 @@ static int synth_all(struct machine *machine)
{
return perf_event__synthesize_threads(NULL,
perf_event__process,
- machine, 0, 1);
+ machine, 0);
}

static int synth_process(struct machine *machine)
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index a7e981b2d7decd3b..5f41e2f9579e3f77 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -23,6 +23,7 @@
#include <linux/string.h>
#include <linux/zalloc.h>
#include <linux/perf_event.h>
+#include <linux/err.h>
#include <asm/bug.h>
#include <perf/evsel.h>
#include <perf/cpumap.h>
@@ -42,6 +43,8 @@
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
+#include "util/workqueue/workqueue.h"
+#include "util/util.h"

#define DEFAULT_PROC_MAP_PARSE_TIMEOUT 500

@@ -883,16 +886,13 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
perf_event__handler_t process,
struct machine *machine,
bool mmap_data,
- struct dirent **dirent,
- int start,
- int num)
+ char *d_name)
{
union perf_event *comm_event, *mmap_event, *fork_event;
union perf_event *namespaces_event;
int err = -1;
char *end;
pid_t pid;
- int i;

comm_event = malloc(sizeof(comm_event->comm) + machine->id_hdr_size);
if (comm_event == NULL)
@@ -912,24 +912,22 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
if (namespaces_event == NULL)
goto out_free_fork;

- for (i = start; i < start + num; i++) {
- if (!isdigit(dirent[i]->d_name[0]))
- continue;
+ if (!isdigit(d_name[0]))
+ goto out_free_namespaces;

- pid = (pid_t)strtol(dirent[i]->d_name, &end, 10);
- /* only interested in proper numerical dirents */
- if (*end)
- continue;
- /*
- * We may race with exiting thread, so don't stop just because
- * one thread couldn't be synthesized.
- */
- __event__synthesize_thread(comm_event, mmap_event, fork_event,
- namespaces_event, pid, 1, process,
- tool, machine, mmap_data);
- }
+ pid = (pid_t)strtol(d_name, &end, 10);
+ /* only interested in proper numerical dirents */
+ if (*end)
+ goto out_free_namespaces;
+ /*
+ * We may race with exiting thread, so don't stop just because
+ * one thread couldn't be synthesized.
+ */
+ __event__synthesize_thread(comm_event, mmap_event, fork_event,
+ namespaces_event, pid, 1, process,
+ tool, machine, mmap_data);
err = 0;
-
+out_free_namespaces:
free(namespaces_event);
out_free_fork:
free(fork_event);
@@ -947,36 +945,28 @@ struct synthesize_threads_arg {
struct machine *machine;
bool mmap_data;
struct dirent **dirent;
- int num;
- int start;
};

-static void *synthesize_threads_worker(void *arg)
+static void synthesize_threads_worker(int i, void *arg)
{
struct synthesize_threads_arg *args = arg;

__perf_event__synthesize_threads(args->tool, args->process,
args->machine, args->mmap_data,
- args->dirent,
- args->start, args->num);
- return NULL;
+ args->dirent[i]->d_name);
}

int perf_event__synthesize_threads(struct perf_tool *tool,
perf_event__handler_t process,
struct machine *machine,
- bool mmap_data,
- unsigned int nr_threads_synthesize)
+ bool mmap_data)
{
- struct synthesize_threads_arg *args = NULL;
- pthread_t *synthesize_threads = NULL;
+ struct synthesize_threads_arg args;
char proc_path[PATH_MAX];
struct dirent **dirent;
- int num_per_thread;
- int m, n, i, j;
- int thread_nr;
- int base = 0;
- int err = -1;
+ int n, i;
+ int err = -1, ret;
+ char err_buf[WORKQUEUE_STRERR_BUFSIZE];


if (machine__is_default_guest(machine))
@@ -987,60 +977,27 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
if (n < 0)
return err;

- if (nr_threads_synthesize == UINT_MAX)
- thread_nr = sysconf(_SC_NPROCESSORS_ONLN);
- else
- thread_nr = nr_threads_synthesize;
-
- if (thread_nr <= 1) {
- err = __perf_event__synthesize_threads(tool, process,
- machine, mmap_data,
- dirent, base, n);
+ if (perf_singlethreaded) {
+ for (i = 0; i < n; i++)
+ err = __perf_event__synthesize_threads(tool, process,
+ machine, mmap_data,
+ dirent[i]->d_name);
goto free_dirent;
}
- if (thread_nr > n)
- thread_nr = n;
-
- synthesize_threads = calloc(sizeof(pthread_t), thread_nr);
- if (synthesize_threads == NULL)
- goto free_dirent;

- args = calloc(sizeof(*args), thread_nr);
- if (args == NULL)
- goto free_threads;
+ args.tool = tool;
+ args.process = process;
+ args.machine = machine;
+ args.mmap_data = mmap_data;
+ args.dirent = dirent;

- num_per_thread = n / thread_nr;
- m = n % thread_nr;
- for (i = 0; i < thread_nr; i++) {
- args[i].tool = tool;
- args[i].process = process;
- args[i].machine = machine;
- args[i].mmap_data = mmap_data;
- args[i].dirent = dirent;
- }
- for (i = 0; i < m; i++) {
- args[i].num = num_per_thread + 1;
- args[i].start = i * args[i].num;
- }
- if (i != 0)
- base = args[i-1].start + args[i-1].num;
- for (j = i; j < thread_nr; j++) {
- args[j].num = num_per_thread;
- args[j].start = base + (j - i) * args[i].num;
+ err = parallel_for(global_wq, 0, n, 1, synthesize_threads_worker, &args);
+ if (err) {
+ ret = workqueue_strerror(global_wq, err, err_buf, sizeof(err_buf));
+ pr_err("parallel_for: %s\n",
+ ret <= 0 ? "Error generating error msg" : err_buf);
}

- for (i = 0; i < thread_nr; i++) {
- if (pthread_create(&synthesize_threads[i], NULL,
- synthesize_threads_worker, &args[i]))
- goto out_join;
- }
- err = 0;
-out_join:
- for (i = 0; i < thread_nr; i++)
- pthread_join(synthesize_threads[i], NULL);
- free(args);
-free_threads:
- free(synthesize_threads);
free_dirent:
for (i = 0; i < n; i++)
zfree(&dirent[i]);
@@ -1775,26 +1732,22 @@ int perf_event__synthesize_id_index(struct perf_tool *tool, perf_event__handler_

int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
struct target *target, struct perf_thread_map *threads,
- perf_event__handler_t process, bool data_mmap,
- unsigned int nr_threads_synthesize)
+ perf_event__handler_t process, bool data_mmap)
{
if (target__has_task(target))
return perf_event__synthesize_thread_map(tool, threads, process, machine, data_mmap);
else if (target__has_cpu(target))
return perf_event__synthesize_threads(tool, process,
- machine, data_mmap,
- nr_threads_synthesize);
+ machine, data_mmap);
/* command specified */
return 0;
}

int machine__synthesize_threads(struct machine *machine, struct target *target,
- struct perf_thread_map *threads, bool data_mmap,
- unsigned int nr_threads_synthesize)
+ struct perf_thread_map *threads, bool data_mmap)
{
return __machine__synthesize_threads(machine, NULL, target, threads,
- perf_event__process, data_mmap,
- nr_threads_synthesize);
+ perf_event__process, data_mmap);
}

static struct perf_record_event_update *event_update_event__new(size_t size, u64 type, u64 id)
diff --git a/tools/perf/util/synthetic-events.h b/tools/perf/util/synthetic-events.h
index c845e2b9b444df57..cd3c43451a237563 100644
--- a/tools/perf/util/synthetic-events.h
+++ b/tools/perf/util/synthetic-events.h
@@ -54,7 +54,7 @@ int perf_event__synthesize_stat_round(struct perf_tool *tool, u64 time, u64 type
int perf_event__synthesize_stat(struct perf_tool *tool, u32 cpu, u32 thread, u64 id, struct perf_counts_values *count, perf_event__handler_t process, struct machine *machine);
int perf_event__synthesize_thread_map2(struct perf_tool *tool, struct perf_thread_map *threads, perf_event__handler_t process, struct machine *machine);
int perf_event__synthesize_thread_map(struct perf_tool *tool, struct perf_thread_map *threads, perf_event__handler_t process, struct machine *machine, bool mmap_data);
-int perf_event__synthesize_threads(struct perf_tool *tool, perf_event__handler_t process, struct machine *machine, bool mmap_data, unsigned int nr_threads_synthesize);
+int perf_event__synthesize_threads(struct perf_tool *tool, perf_event__handler_t process, struct machine *machine, bool mmap_data);
int perf_event__synthesize_tracing_data(struct perf_tool *tool, int fd, struct evlist *evlist, perf_event__handler_t process);
int perf_event__synth_time_conv(const struct perf_event_mmap_page *pc, struct perf_tool *tool, perf_event__handler_t process, struct machine *machine);
pid_t perf_event__synthesize_comm(struct perf_tool *tool, union perf_event *event, pid_t pid, perf_event__handler_t process, struct machine *machine);
@@ -65,11 +65,9 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,

int __machine__synthesize_threads(struct machine *machine, struct perf_tool *tool,
struct target *target, struct perf_thread_map *threads,
- perf_event__handler_t process, bool data_mmap,
- unsigned int nr_threads_synthesize);
+ perf_event__handler_t process, bool data_mmap);
int machine__synthesize_threads(struct machine *machine, struct target *target,
- struct perf_thread_map *threads, bool data_mmap,
- unsigned int nr_threads_synthesize);
+ struct perf_thread_map *threads, bool data_mmap);

#ifdef HAVE_AUXTRACE_SUPPORT
int perf_event__synthesize_auxtrace_info(struct auxtrace_record *itr, struct perf_tool *tool,
--
2.31.1

2021-08-20 10:58:58

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 14/15] perf test/synthesis: setup global workqueue

This patch sets up the global workqueue in the synthesis test.
The next patch will use it for synthesis.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/bench/synthesize.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/perf/bench/synthesize.c b/tools/perf/bench/synthesize.c
index 05f7c923c745b4e8..738821113a005a6c 100644
--- a/tools/perf/bench/synthesize.c
+++ b/tools/perf/bench/synthesize.c
@@ -16,6 +16,7 @@
#include "../util/thread_map.h"
#include "../util/tool.h"
#include "../util/util.h"
+#include "../util/workqueue/workqueue.h"
#include <linux/atomic.h>
#include <linux/err.h>
#include <linux/time64.h>
@@ -208,6 +209,7 @@ static int run_multi_threaded(void)
};
unsigned int nr_threads_synthesize;
int err;
+ char errbuf[BUFSIZ];

if (max_threads == UINT_MAX)
max_threads = sysconf(_SC_NPROCESSORS_ONLN);
@@ -219,10 +221,17 @@ static int run_multi_threaded(void)
for (nr_threads_synthesize = min_threads;
nr_threads_synthesize <= max_threads;
nr_threads_synthesize++) {
- if (nr_threads_synthesize == 1)
+ if (nr_threads_synthesize == 1) {
perf_set_singlethreaded();
- else
+ } else {
+ err = setup_global_workqueue(nr_threads_synthesize);
+ if (err) {
+ create_workqueue_strerror(global_wq, errbuf, sizeof(errbuf));
+ pr_err("setup_global_workqueue: %s\n", errbuf);
+ return err;
+ }
perf_set_multithreaded();
+ }

printf(" Number of synthesis threads: %u\n",
nr_threads_synthesize);
@@ -230,6 +239,9 @@ static int run_multi_threaded(void)
err = do_run_multi_threaded(&target, nr_threads_synthesize);
if (err)
return err;
+
+ if (nr_threads_synthesize > 1)
+ teardown_global_workqueue();
}
perf_set_singlethreaded();
return 0;
--
2.31.1

2021-08-20 10:59:16

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 09/15] perf workqueue: spinup threads when needed

This patch adds lazy thread creation in the workqueue.
When a new work is submitted, first an idle worker is searched. If one
is found, it will be selected for execution. Otherwise, a not already
spawned thread is searched. If found, it will be spun up and selected.
If none of the latter is found, one of the busy threads is chosen using
a round-robin policy.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/util/workqueue/workqueue.c | 54 +++++++++++++++++++--------
1 file changed, 38 insertions(+), 16 deletions(-)

diff --git a/tools/perf/util/workqueue/workqueue.c b/tools/perf/util/workqueue/workqueue.c
index 1092ece9ad39d6d2..305a9cda39810b84 100644
--- a/tools/perf/util/workqueue/workqueue.c
+++ b/tools/perf/util/workqueue/workqueue.c
@@ -39,6 +39,7 @@ struct workqueue_struct {
int msg_pipe[2]; /* main thread comm pipes */
struct worker **workers; /* array of all workers */
struct worker *next_worker; /* next worker to choose (round robin) */
+ int first_stopped_worker; /* next worker to start if needed */
};

static const char * const workqueue_errno_str[] = {
@@ -423,8 +424,7 @@ static void worker_thread(int tidx, struct task_struct *task)
*/
struct workqueue_struct *create_workqueue(int nr_threads)
{
- int ret, err = 0, t;
- struct worker *worker;
+ int ret, err = 0;
struct workqueue_struct *wq = zalloc(sizeof(struct workqueue_struct));

if (!wq) {
@@ -474,24 +474,11 @@ struct workqueue_struct *create_workqueue(int nr_threads)
goto out_close_pipe;
}

- for (t = 0; t < nr_threads; t++) {
- err = spinup_worker(wq, t);
- if (err)
- goto out_stop_pool;
- }
-
wq->next_worker = NULL;
+ wq->first_stopped_worker = 0;

return wq;

-out_stop_pool:
- lock_workqueue(wq);
- for_each_idle_worker(wq, worker) {
- ret = stop_worker(worker);
- if (ret)
- err = ret;
- }
- unlock_workqueue(wq);
out_close_pipe:
close(wq->msg_pipe[0]);
wq->msg_pipe[0] = -1;
@@ -686,10 +673,28 @@ __releases(&worker->lock)
*/
int queue_work(struct workqueue_struct *wq, struct work_struct *work)
{
+ int ret;
struct worker *worker;

+repeat:
lock_workqueue(wq);
if (list_empty(&wq->idle_list)) {
+ // find a worker to spin up
+ while (wq->first_stopped_worker < threadpool__size(wq->pool)
+ && wq->workers[wq->first_stopped_worker])
+ wq->first_stopped_worker++;
+
+ // found one
+ if (wq->first_stopped_worker < threadpool__size(wq->pool)) {
+ // spinup does not hold the lock to make the thread register itself
+ unlock_workqueue(wq);
+ ret = spinup_worker(wq, wq->first_stopped_worker);
+ if (ret)
+ return ret;
+ // worker is now in idle_list
+ goto repeat;
+ }
+
worker = wq->next_worker;
advance_next_worker(wq);
} else {
@@ -705,7 +710,24 @@ int queue_work(struct workqueue_struct *wq, struct work_struct *work)
*/
int queue_work_on_worker(int tidx, struct workqueue_struct *wq, struct work_struct *work)
{
+ int ret;
+
lock_workqueue(wq);
+ if (!wq->workers[tidx]) {
+ // spinup does not hold the lock to make the thread register itself
+ unlock_workqueue(wq);
+ ret = spinup_worker(wq, tidx);
+ if (ret)
+ return ret;
+
+ // now recheck if worker is available
+ lock_workqueue(wq);
+ if (!wq->workers[tidx]) {
+ unlock_workqueue(wq);
+ return -WORKQUEUE_ERROR__NOTREADY;
+ }
+ }
+
lock_worker(wq->workers[tidx]);
return __queue_work_on_worker(wq, wq->workers[tidx], work);
}
--
2.31.1

2021-08-20 10:59:28

by Riccardo Mancini

[permalink] [raw]
Subject: [RFC PATCH v3 10/15] perf workqueue: create global workqueue

This patch adds a global static workqueue, using the same API from the
kernel (schedule_work, flush_scheduled_work, and so on).

Before use, the global workqueue should be set up using
setup_global_workqueue and eventually destroyed using
teardown_global_workqueue.

Signed-off-by: Riccardo Mancini <[email protected]>
---
tools/perf/util/workqueue/workqueue.c | 2 ++
tools/perf/util/workqueue/workqueue.h | 49 +++++++++++++++++++++++++++
2 files changed, 51 insertions(+)

diff --git a/tools/perf/util/workqueue/workqueue.c b/tools/perf/util/workqueue/workqueue.c
index 305a9cda39810b84..a89370e68bd720c8 100644
--- a/tools/perf/util/workqueue/workqueue.c
+++ b/tools/perf/util/workqueue/workqueue.c
@@ -13,6 +13,8 @@
#include <internal/lib.h>
#include "workqueue.h"

+struct workqueue_struct *global_wq;
+
enum worker_msg {
WORKER_MSG__UNDEFINED,
WORKER_MSG__READY, /* from worker: ack */
diff --git a/tools/perf/util/workqueue/workqueue.h b/tools/perf/util/workqueue/workqueue.h
index 37ef84fc9c6ed4b6..fc6166757f0e1d0d 100644
--- a/tools/perf/util/workqueue/workqueue.h
+++ b/tools/perf/util/workqueue/workqueue.h
@@ -5,6 +5,7 @@
#include <stdlib.h>
#include <sys/types.h>
#include <linux/list.h>
+#include <linux/err.h>
#include "threadpool.h"

struct work_struct;
@@ -29,6 +30,54 @@ extern int flush_workqueue(struct workqueue_struct *wq);

extern void init_work(struct work_struct *work);

+/* Global workqueue */
+
+extern struct workqueue_struct *global_wq;
+
+/**
+ * setup_global_wq - create the global_wq
+ */
+static inline int setup_global_workqueue(int nr_threads)
+{
+ global_wq = create_workqueue(nr_threads);
+ return IS_ERR(global_wq) ? PTR_ERR(global_wq) : 0;
+}
+
+/**
+ * teardown_global_wq - destroy the global_wq
+ */
+static inline int teardown_global_workqueue(void)
+{
+ int ret = destroy_workqueue(global_wq);
+
+ global_wq = NULL;
+ return ret;
+}
+
+/**
+ * schedule_work - queue @work on the global_wq
+ */
+static inline int schedule_work(struct work_struct *work)
+{
+ return queue_work(global_wq, work);
+}
+
+/**
+ * schedule_work - queue @work on thread @tidx of global_wq
+ */
+static inline int schedule_work_on_worker(int tidx, struct work_struct *work)
+{
+ return queue_work_on_worker(tidx, global_wq, work);
+}
+
+/**
+ * flush_scheduled_work - ensure that any scheduled work in global_wq has run to completion
+ */
+static inline int flush_scheduled_work(void)
+{
+ return flush_workqueue(global_wq);
+}
+
#define WORKQUEUE_STRERR_BUFSIZE (128+THREADPOOL_STRERR_BUFSIZE)
#define WORKQUEUE_ERROR__OFFSET 512
enum {
--
2.31.1

2021-08-24 19:28:45

by Namhyung Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 06/15] perf workqueue: introduce workqueue struct

Hi Riccardo,

On Fri, Aug 20, 2021 at 3:54 AM Riccardo Mancini <[email protected]> wrote:
> +/**
> + * workqueue_strerror - print message regarding lastest error in @wq
> + *
> + * Buffer size should be at least WORKQUEUE_STRERR_BUFSIZE bytes.
> + */
> +int workqueue_strerror(struct workqueue_struct *wq, int err, char *buf, size_t size)
> +{
> + int ret;
> + char sbuf[THREADPOOL_STRERR_BUFSIZE], *emsg;
> + const char *errno_str;
> +
> + errno_str = workqueue_errno_str[-err-WORKQUEUE_ERROR__OFFSET];

It seems easy to crash with an invalid err argument.

> +
> + switch (err) {
> + case -WORKQUEUE_ERROR__POOLNEW:
> + case -WORKQUEUE_ERROR__POOLEXE:
> + case -WORKQUEUE_ERROR__POOLSTOP:
> + case -WORKQUEUE_ERROR__POOLSTARTTHREAD:
> + if (IS_ERR_OR_NULL(wq))
> + return scnprintf(buf, size, "%s: unknown.\n",
> + errno_str);
> +
> + ret = threadpool__strerror(wq->pool, wq->pool_errno, sbuf, sizeof(sbuf));
> + if (ret < 0)
> + return ret;
> + return scnprintf(buf, size, "%s: %s.\n", errno_str, sbuf);
> + case -WORKQUEUE_ERROR__WRITEPIPE:
> + case -WORKQUEUE_ERROR__READPIPE:
> + emsg = str_error_r(errno, sbuf, sizeof(sbuf));

This means the errno should be kept before calling this, right?

> + return scnprintf(buf, size, "%s: %s.\n", errno_str, emsg);
> + case -WORKQUEUE_ERROR__INVALIDMSG:
> + return scnprintf(buf, size, "%s.\n", errno_str);
> + default:
> + emsg = str_error_r(err, sbuf, sizeof(sbuf));
> + return scnprintf(buf, size, "Error: %s", emsg);

Newline at the end?

Thanks,
Namhyung


> + }
> +}
> +

2021-08-24 19:43:31

by Namhyung Kim

[permalink] [raw]
Subject: Re: [RFC PATCH v3 08/15] perf workqueue: add queue_work and flush_workqueue functions

On Fri, Aug 20, 2021 at 3:54 AM Riccardo Mancini <[email protected]> wrote:
>
> This patch adds functions to queue and wait work_structs, and
> related tests.
>
> When a new work item is added, the workqueue first checks if there
> are threads to wake up. If so, it wakes it up with the given work item,
> otherwise it will pick the next round-robin thread and queue the work
> item to its queue. A thread which completes its queue will go to sleep.
>
> The round-robin mechanism is implemented through the next_worker
> attibute which will point to the next worker to be chosen for queueing.
> When work is assigned to that worker or when the worker goes to sleep,
> the pointer is moved to the next worker in the busy_list, if any.
> When a worker is woken up, it is added in the busy list just before the
> next_worker, so that it will be chosen as last (it's just been assigned
> a work item).

Do we really need this? I think some of the complexity comes
because of this. Can we simply put the works in a list in wq
and workers take it out with a lock? Then the kernel will
distribute the works among the threads for us.

Maybe we can get rid of worker->lock too..

Thanks,
Namhyung

2021-08-29 22:00:39

by Jiri Olsa

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/15] perf: add workqueue library and use it in synthetic-events

On Fri, Aug 20, 2021 at 12:53:46PM +0200, Riccardo Mancini wrote:
> Changes in v3:
> - improved separation of threadpool and threadpool_entry method
> - replaced shared workqueue with per-thread workqueue. This should
> improve the performance on big machines (Jiri noticed in his
> experiments a significant performance degradation after 15 threads
> with the shared queue).
> - improved error reporting in both threadpool and workqueue
> - added lazy spinup of threads in workqueue [9/15]
> - added global workqueue [10/15]
> - setup global workqueue in perf record, top and synthesize bench
> [12-14/15] and used in in synthetic events


hi,
I ran the test again and there's still the slowdown,
adding the stats below

I'm doing the review and I noticed few strange things,
but so far nothing that would explain that

like I can see for 40 threads only 35 threads spawned,
need to check on that more

also I'll try run some tests for parallel_for > 1 to cut
down some of the workqueue code.. any tests on that?

jirka


---
new: old:
ell-r440-01 perf]# ./perf bench internals synthesize -t [root@dell-r440-01 perf]# ./perf bench internals synthesize -t
# Running 'internals/synthesize' benchmark: # Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0: synthesizing events on CPU 0:
Number of synthesis threads: 1 Number of synthesis threads: 1
Average synthesis took: 13970.400 usec (+- 339.216 usec) Average synthesis took: 13563.700 usec (+- 348.354 usec)
Average num. events: 2349.000 (+- 0.000) Average num. events: 2317.000 (+- 0.000)
Average time per event 5.947 usec Average time per event 5.854 usec
Number of synthesis threads: 2 Number of synthesis threads: 2
Average synthesis took: 15651.800 usec (+- 1612.798 usec) Average synthesis took: 8433.600 usec (+- 83.725 usec)
Average num. events: 2353.000 (+- 0.000) Average num. events: 2321.600 (+- 0.306)
Average time per event 6.652 usec Average time per event 3.633 usec
Number of synthesis threads: 3 Number of synthesis threads: 3
Average synthesis took: 12114.100 usec (+- 1208.208 usec) Average synthesis took: 6716.200 usec (+- 16.889 usec)
Average num. events: 2355.000 (+- 0.000) Average num. events: 2325.000 (+- 0.000)
Average time per event 5.144 usec Average time per event 2.889 usec
Number of synthesis threads: 4 Number of synthesis threads: 4
Average synthesis took: 9812.500 usec (+- 951.284 usec) Average synthesis took: 5981.400 usec (+- 11.102 usec)
Average num. events: 2357.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000)
Average time per event 4.163 usec Average time per event 2.575 usec
Number of synthesis threads: 5 Number of synthesis threads: 5
Average synthesis took: 7338.300 usec (+- 661.620 usec) Average synthesis took: 5538.800 usec (+- 12.990 usec)
Average num. events: 2359.000 (+- 0.000) Average num. events: 2329.000 (+- 0.000)
Average time per event 3.111 usec Average time per event 2.378 usec
Number of synthesis threads: 6 Number of synthesis threads: 6
Average synthesis took: 7256.800 usec (+- 680.312 usec) Average synthesis took: 5255.700 usec (+- 7.454 usec)
Average num. events: 2361.000 (+- 0.000) Average num. events: 2331.000 (+- 0.000)
Average time per event 3.074 usec Average time per event 2.255 usec
Number of synthesis threads: 7 Number of synthesis threads: 7
Average synthesis took: 6119.600 usec (+- 479.409 usec) Average synthesis took: 4836.200 usec (+- 8.132 usec)
Average num. events: 2363.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000)
Average time per event 2.590 usec Average time per event 2.082 usec
Number of synthesis threads: 8 Number of synthesis threads: 8
Average synthesis took: 5899.600 usec (+- 506.285 usec) Average synthesis took: 4643.000 usec (+- 4.913 usec)
Average num. events: 2365.000 (+- 0.000) Average num. events: 2335.000 (+- 0.000)
Average time per event 2.495 usec Average time per event 1.988 usec
Number of synthesis threads: 9 Number of synthesis threads: 9
Average synthesis took: 5459.100 usec (+- 431.725 usec) Average synthesis took: 4526.600 usec (+- 5.207 usec)
Average num. events: 2367.000 (+- 0.000) Average num. events: 2337.000 (+- 0.000)
Average time per event 2.306 usec Average time per event 1.937 usec
Number of synthesis threads: 10 Number of synthesis threads: 10
Average synthesis took: 4977.100 usec (+- 251.378 usec) Average synthesis took: 4128.700 usec (+- 5.911 usec)
Average num. events: 2369.000 (+- 0.000) Average num. events: 2327.800 (+- 0.533)
Average time per event 2.101 usec Average time per event 1.774 usec
Number of synthesis threads: 11 Number of synthesis threads: 11
Average synthesis took: 5428.700 usec (+- 513.409 usec) Average synthesis took: 3890.800 usec (+- 15.051 usec)
Average num. events: 2371.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000)
Average time per event 2.290 usec Average time per event 1.675 usec
Number of synthesis threads: 12 Number of synthesis threads: 12
Average synthesis took: 5517.800 usec (+- 508.171 usec) Average synthesis took: 3367.800 usec (+- 14.261 usec)
Average num. events: 2373.000 (+- 0.000) Average num. events: 2343.000 (+- 0.000)
Average time per event 2.325 usec Average time per event 1.437 usec
Number of synthesis threads: 13 Number of synthesis threads: 13
Average synthesis took: 5279.500 usec (+- 432.819 usec) Average synthesis took: 3974.300 usec (+- 12.437 usec)
Average num. events: 2375.000 (+- 0.000) Average num. events: 2328.200 (+- 1.405)
Average time per event 2.223 usec Average time per event 1.707 usec
Number of synthesis threads: 14 Number of synthesis threads: 14
Average synthesis took: 4993.100 usec (+- 392.485 usec) Average synthesis took: 4157.100 usec (+- 163.268 usec)
Average num. events: 2377.000 (+- 0.000) Average num. events: 2319.800 (+- 0.533)
Average time per event 2.101 usec Average time per event 1.792 usec
Number of synthesis threads: 15 Number of synthesis threads: 15
Average synthesis took: 5584.700 usec (+- 379.862 usec) Average synthesis took: 4065.700 usec (+- 25.656 usec)
Average num. events: 2379.000 (+- 0.000) Average num. events: 2322.800 (+- 0.467)
Average time per event 2.347 usec Average time per event 1.750 usec
Number of synthesis threads: 16 Number of synthesis threads: 16
Average synthesis took: 5009.800 usec (+- 381.018 usec) Average synthesis took: 4580.600 usec (+- 129.218 usec)
Average num. events: 2381.000 (+- 0.000) Average num. events: 2324.800 (+- 0.200)
Average time per event 2.104 usec Average time per event 1.970 usec
Number of synthesis threads: 17 Number of synthesis threads: 17
Average synthesis took: 5543.300 usec (+- 376.064 usec) Average synthesis took: 4089.700 usec (+- 54.096 usec)
Average num. events: 2383.000 (+- 0.000) Average num. events: 2320.200 (+- 0.611)
Average time per event 2.326 usec Average time per event 1.763 usec
Number of synthesis threads: 18 Number of synthesis threads: 18
Average synthesis took: 5191.800 usec (+- 342.317 usec) Average synthesis took: 4219.000 usec (+- 61.395 usec)
Average num. events: 2385.000 (+- 0.000) Average num. events: 2323.000 (+- 0.516)
Average time per event 2.177 usec Average time per event 1.816 usec
Number of synthesis threads: 19 Number of synthesis threads: 19
Average synthesis took: 4647.000 usec (+- 273.303 usec) Average synthesis took: 3998.800 usec (+- 49.221 usec)
Average num. events: 2387.000 (+- 0.000) Average num. events: 2325.200 (+- 0.200)
Average time per event 1.947 usec Average time per event 1.720 usec
Number of synthesis threads: 20 Number of synthesis threads: 20
Average synthesis took: 4710.600 usec (+- 179.874 usec) Average synthesis took: 3930.300 usec (+- 67.725 usec)
Average num. events: 2389.000 (+- 0.000) Average num. events: 2319.000 (+- 0.000)
Average time per event 1.972 usec Average time per event 1.695 usec
Number of synthesis threads: 21 Number of synthesis threads: 21
Average synthesis took: 4959.100 usec (+- 318.519 usec) Average synthesis took: 3696.400 usec (+- 30.953 usec)
Average num. events: 2390.800 (+- 0.200) Average num. events: 2319.800 (+- 0.533)
Average time per event 2.074 usec Average time per event 1.593 usec
Number of synthesis threads: 22 Number of synthesis threads: 22
Average synthesis took: 4422.300 usec (+- 236.998 usec) Average synthesis took: 3394.000 usec (+- 63.254 usec)
Average num. events: 2392.800 (+- 0.200) Average num. events: 2319.000 (+- 0.000)
Average time per event 1.848 usec Average time per event 1.464 usec
Number of synthesis threads: 23 Number of synthesis threads: 23
Average synthesis took: 4640.800 usec (+- 245.604 usec) Average synthesis took: 4091.100 usec (+- 134.320 usec)
Average num. events: 2394.400 (+- 0.600) Average num. events: 2323.400 (+- 0.267)
Average time per event 1.938 usec Average time per event 1.761 usec
Number of synthesis threads: 24 Number of synthesis threads: 24
Average synthesis took: 4554.900 usec (+- 201.121 usec) Average synthesis took: 3346.600 usec (+- 78.846 usec)
Average num. events: 2395.800 (+- 0.854) Average num. events: 2321.000 (+- 0.667)
Average time per event 1.901 usec Average time per event 1.442 usec
Number of synthesis threads: 25 Number of synthesis threads: 25
Average synthesis took: 4668.300 usec (+- 248.254 usec) Average synthesis took: 3794.300 usec (+- 191.158 usec)
Average num. events: 2398.000 (+- 0.803) Average num. events: 2317.900 (+- 6.248)
Average time per event 1.947 usec Average time per event 1.637 usec
Number of synthesis threads: 26 Number of synthesis threads: 26
Average synthesis took: 4683.300 usec (+- 226.836 usec) Average synthesis took: 3285.700 usec (+- 18.785 usec)
Average num. events: 2399.000 (+- 1.265) Average num. events: 2317.100 (+- 6.198)
Average time per event 1.952 usec Average time per event 1.418 usec
Number of synthesis threads: 27 Number of synthesis threads: 27
Average synthesis took: 4590.300 usec (+- 158.000 usec) Average synthesis took: 3604.600 usec (+- 35.487 usec)
Average num. events: 2400.200 (+- 1.497) Average num. events: 2319.800 (+- 0.533)
Average time per event 1.912 usec Average time per event 1.554 usec
Number of synthesis threads: 28 Number of synthesis threads: 28
Average synthesis took: 4683.500 usec (+- 233.543 usec) Average synthesis took: 3594.700 usec (+- 21.267 usec)
Average num. events: 2402.400 (+- 1.688) Average num. events: 2319.200 (+- 0.200)
Average time per event 1.950 usec Average time per event 1.550 usec
Number of synthesis threads: 29 Number of synthesis threads: 29
Average synthesis took: 4830.700 usec (+- 235.730 usec) Average synthesis took: 3531.700 usec (+- 15.935 usec)
Average num. events: 2405.000 (+- 2.530) Average num. events: 2322.200 (+- 0.800)
Average time per event 2.009 usec Average time per event 1.521 usec
Number of synthesis threads: 30 Number of synthesis threads: 30
Average synthesis took: 4684.500 usec (+- 210.137 usec) Average synthesis took: 3505.700 usec (+- 58.332 usec)
Average num. events: 2407.600 (+- 2.495) Average num. events: 2315.100 (+- 5.900)
Average time per event 1.946 usec Average time per event 1.514 usec
Number of synthesis threads: 31 Number of synthesis threads: 31
Average synthesis took: 4823.300 usec (+- 213.480 usec) Average synthesis took: 3431.100 usec (+- 42.022 usec)
Average num. events: 2407.400 (+- 2.647) Average num. events: 2319.000 (+- 0.000)
Average time per event 2.004 usec Average time per event 1.480 usec
Number of synthesis threads: 32 Number of synthesis threads: 32
Average synthesis took: 4400.800 usec (+- 224.134 usec) Average synthesis took: 3684.900 usec (+- 253.077 usec)
Average num. events: 2407.400 (+- 2.544) Average num. events: 2319.200 (+- 0.200)
Average time per event 1.828 usec Average time per event 1.589 usec
Number of synthesis threads: 33 Number of synthesis threads: 33
Average synthesis took: 4452.600 usec (+- 231.034 usec) Average synthesis took: 3233.000 usec (+- 24.035 usec)
Average num. events: 2409.300 (+- 3.190) Average num. events: 2316.500 (+- 6.069)
Average time per event 1.848 usec Average time per event 1.396 usec
Number of synthesis threads: 34 Number of synthesis threads: 34
Average synthesis took: 4770.900 usec (+- 182.325 usec) Average synthesis took: 3016.300 usec (+- 13.343 usec)
Average num. events: 2411.200 (+- 3.032) Average num. events: 2322.800 (+- 0.200)
Average time per event 1.979 usec Average time per event 1.299 usec
Number of synthesis threads: 35 Number of synthesis threads: 35
Average synthesis took: 4442.800 usec (+- 248.017 usec) Average synthesis took: 3246.700 usec (+- 71.765 usec)
Average num. events: 2412.000 (+- 3.296) Average num. events: 2321.800 (+- 0.611)
Average time per event 1.842 usec Average time per event 1.398 usec
Number of synthesis threads: 36 Number of synthesis threads: 36
Average synthesis took: 5005.200 usec (+- 235.823 usec) Average synthesis took: 3329.000 usec (+- 122.028 usec)
Average num. events: 2410.400 (+- 2.750) Average num. events: 2310.800 (+- 8.133)
Average time per event 2.077 usec Average time per event 1.441 usec
Number of synthesis threads: 37 Number of synthesis threads: 37
Average synthesis took: 4654.000 usec (+- 208.838 usec) Average synthesis took: 3011.600 usec (+- 46.026 usec)
Average num. events: 2409.400 (+- 2.473) Average num. events: 2322.200 (+- 0.533)
Average time per event 1.932 usec Average time per event 1.297 usec
Number of synthesis threads: 38 Number of synthesis threads: 38
Average synthesis took: 4763.700 usec (+- 197.409 usec) Average synthesis took: 3163.500 usec (+- 36.589 usec)
Average num. events: 2406.200 (+- 2.462) Average num. events: 2319.000 (+- 0.000)
Average time per event 1.980 usec Average time per event 1.364 usec
Number of synthesis threads: 39 Number of synthesis threads: 39
Average synthesis took: 4333.100 usec (+- 194.456 usec) Average synthesis took: 3170.900 usec (+- 30.538 usec)
Average num. events: 2408.600 (+- 3.124) Average num. events: 2319.000 (+- 0.000)
Average time per event 1.799 usec Average time per event 1.367 usec
Number of synthesis threads: 40 Number of synthesis threads: 40
Average synthesis took: 4520.200 usec (+- 188.901 usec) Average synthesis took: 3111.900 usec (+- 24.287 usec)
Average num. events: 2409.600 (+- 3.184) Average num. events: 2307.600 (+- 7.600)
Average time per event 1.876 usec Average time per event 1.349 usec

2021-08-30 07:24:48

by Jiri Olsa

[permalink] [raw]
Subject: Re: [RFC PATCH v3 07/15] perf workqueue: implement worker thread and management

On Fri, Aug 20, 2021 at 12:53:53PM +0200, Riccardo Mancini wrote:

SNIP

> /**
> * create_workqueue - create a workqueue associated to @pool
> *
> @@ -41,7 +361,8 @@ static const char * const workqueue_errno_str[] = {
> */
> struct workqueue_struct *create_workqueue(int nr_threads)
> {
> - int ret, err = 0;
> + int ret, err = 0, t;
> + struct worker *worker;
> struct workqueue_struct *wq = zalloc(sizeof(struct workqueue_struct));
>
> if (!wq) {
> @@ -56,10 +377,16 @@ struct workqueue_struct *create_workqueue(int nr_threads)
> goto out_free_wq;
> }
>
> + wq->workers = calloc(nr_threads, sizeof(*wq->workers));
> + if (!wq->workers) {
> + err = -ENOMEM;
> + goto out_delete_pool;
> + }
> +
> ret = pthread_mutex_init(&wq->lock, NULL);
> if (ret) {
> err = -ret;
> - goto out_delete_pool;
> + goto out_free_workers;
> }
>
> ret = pthread_cond_init(&wq->idle_cond, NULL);
> @@ -77,12 +404,41 @@ struct workqueue_struct *create_workqueue(int nr_threads)
> goto out_destroy_cond;
> }
>
> + wq->task.fn = worker_thread;
> +
> + wq->pool_errno = threadpool__execute(wq->pool, &wq->task);
> + if (wq->pool_errno) {
> + err = -WORKQUEUE_ERROR__POOLEXE;
> + goto out_close_pipe;
> + }

hum, why the threadpool__execute in here? threads are not runnig at this
point, so nothing will happen right?

jirka

> +
> + for (t = 0; t < nr_threads; t++) {
> + err = spinup_worker(wq, t);
> + if (err)
> + goto out_stop_pool;
> + }
> +
> return wq;
>
> +out_stop_pool:
> + lock_workqueue(wq);
> + for_each_idle_worker(wq, worker) {
> + ret = stop_worker(worker);
> + if (ret)
> + err = ret;
> + }
> + unlock_workqueue(wq);
> +out_close_pipe:
> + close(wq->msg_pipe[0]);
> + wq->msg_pipe[0] = -1;
> + close(wq->msg_pipe[1]);
> + wq->msg_pipe[1] = -1;
> out_destroy_cond:
> pthread_cond_destroy(&wq->idle_cond);
> out_destroy_mutex:
> pthread_mutex_destroy(&wq->lock);
> +out_free_workers:
> + free(wq->workers);
> out_delete_pool:
> threadpool__delete(wq->pool);
> out_free_wq:
> @@ -96,12 +452,27 @@ struct workqueue_struct *create_workqueue(int nr_threads)
> */

SNIP

2021-08-31 16:26:02

by Riccardo Mancini

[permalink] [raw]
Subject: Re: [RFC PATCH v3 08/15] perf workqueue: add queue_work and flush_workqueue functions

Hi,

On Tue, 2021-08-24 at 12:40 -0700, Namhyung Kim wrote:
> On Fri, Aug 20, 2021 at 3:54 AM Riccardo Mancini <[email protected]> wrote:
> >
> > This patch adds functions to queue and wait work_structs, and
> > related tests.
> >
> > When a new work item is added, the workqueue first checks if there
> > are threads to wake up. If so, it wakes it up with the given work item,
> > otherwise it will pick the next round-robin thread and queue the work
> > item to its queue. A thread which completes its queue will go to sleep.
> >
> > The round-robin mechanism is implemented through the next_worker
> > attibute which will point to the next worker to be chosen for queueing.
> > When work is assigned to that worker or when the worker goes to sleep,
> > the pointer is moved to the next worker in the busy_list, if any.
> > When a worker is woken up, it is added in the busy list just before the
> > next_worker, so that it will be chosen as last (it's just been assigned
> > a work item).
>
> Do we really need this?  I think some of the complexity comes
> because of this.  Can we simply put the works in a list in wq
> and workers take it out with a lock?  Then the kernel will
> distribute the works among the threads for us.
>
> Maybe we can get rid of worker->lock too..

Having a per-thread queue has some benefits which are very useful in our case:
- it should be able to scale to bigger machines than a shared queue (looking at
both tests from jiri, it looks like this version is somewhat better than v2, but
they're done in different conditions, so some other tests comparing the two
versions on big machines would be useful).
- it is possible to choose the worker to execute work on, which is used in the
evlist patchset (where threads can be pinned to a cpu and evlist operations can
be done on them).

Of course, it adds some complexity over the shared queue, for example:
- the next_worker pointer to implement the round-robin policy, for which maybe
there's a cleaner way to do it.
- the thread "self-registration", which I think can be dropped in favor of an
array inside the workqueue (the max number of threads is limited, so having a
self-registration does not really add much flexibility and it adds contention on
the workqueue lock when threads are spun-up). Getting rid of it could reduce
workqueue spinup and stop time.

Thanks,
Riccardo

>
> Thanks,
> Namhyung


2021-08-31 16:56:00

by Jiri Olsa

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/15] perf: add workqueue library and use it in synthetic-events

On Sun, Aug 29, 2021 at 11:59:41PM +0200, Jiri Olsa wrote:
> On Fri, Aug 20, 2021 at 12:53:46PM +0200, Riccardo Mancini wrote:
> > Changes in v3:
> > - improved separation of threadpool and threadpool_entry method
> > - replaced shared workqueue with per-thread workqueue. This should
> > improve the performance on big machines (Jiri noticed in his
> > experiments a significant performance degradation after 15 threads
> > with the shared queue).
> > - improved error reporting in both threadpool and workqueue
> > - added lazy spinup of threads in workqueue [9/15]
> > - added global workqueue [10/15]
> > - setup global workqueue in perf record, top and synthesize bench
> > [12-14/15] and used in in synthetic events
>
>
> hi,
> I ran the test again and there's still the slowdown,
> adding the stats below
>
> I'm doing the review and I noticed few strange things,
> but so far nothing that would explain that

I used trace compass to show the flow and it shows lot of
extra scheduling in the new code, please check attached
screenshots

the current code takes the quickest approach and distribues
'equal' load for each thread

while the lazy thread spin in the new code is nice, I think
we should have a way to instruct the new code to do the same
thing as the old one, because it's faster in this case

I think the work_size setup could help with that

>
> like I can see for 40 threads only 35 threads spawned,
> need to check on that more
>
> also I'll try run some tests for parallel_for > 1 to cut

ugh.. should have been s/parallel_for/work_size/ sorry

jirka

> down some of the workqueue code.. any tests on that?
>
> jirka
>
>
> ---
> new: old:
> ell-r440-01 perf]# ./perf bench internals synthesize -t [root@dell-r440-01 perf]# ./perf bench internals synthesize -t
> # Running 'internals/synthesize' benchmark: # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0: synthesizing events on CPU 0:
> Number of synthesis threads: 1 Number of synthesis threads: 1
> Average synthesis took: 13970.400 usec (+- 339.216 usec) Average synthesis took: 13563.700 usec (+- 348.354 usec)
> Average num. events: 2349.000 (+- 0.000) Average num. events: 2317.000 (+- 0.000)
> Average time per event 5.947 usec Average time per event 5.854 usec
> Number of synthesis threads: 2 Number of synthesis threads: 2
> Average synthesis took: 15651.800 usec (+- 1612.798 usec) Average synthesis took: 8433.600 usec (+- 83.725 usec)
> Average num. events: 2353.000 (+- 0.000) Average num. events: 2321.600 (+- 0.306)
> Average time per event 6.652 usec Average time per event 3.633 usec
> Number of synthesis threads: 3 Number of synthesis threads: 3
> Average synthesis took: 12114.100 usec (+- 1208.208 usec) Average synthesis took: 6716.200 usec (+- 16.889 usec)
> Average num. events: 2355.000 (+- 0.000) Average num. events: 2325.000 (+- 0.000)
> Average time per event 5.144 usec Average time per event 2.889 usec
> Number of synthesis threads: 4 Number of synthesis threads: 4
> Average synthesis took: 9812.500 usec (+- 951.284 usec) Average synthesis took: 5981.400 usec (+- 11.102 usec)
> Average num. events: 2357.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000)
> Average time per event 4.163 usec Average time per event 2.575 usec
> Number of synthesis threads: 5 Number of synthesis threads: 5
> Average synthesis took: 7338.300 usec (+- 661.620 usec) Average synthesis took: 5538.800 usec (+- 12.990 usec)
> Average num. events: 2359.000 (+- 0.000) Average num. events: 2329.000 (+- 0.000)
> Average time per event 3.111 usec Average time per event 2.378 usec
> Number of synthesis threads: 6 Number of synthesis threads: 6
> Average synthesis took: 7256.800 usec (+- 680.312 usec) Average synthesis took: 5255.700 usec (+- 7.454 usec)
> Average num. events: 2361.000 (+- 0.000) Average num. events: 2331.000 (+- 0.000)
> Average time per event 3.074 usec Average time per event 2.255 usec
> Number of synthesis threads: 7 Number of synthesis threads: 7
> Average synthesis took: 6119.600 usec (+- 479.409 usec) Average synthesis took: 4836.200 usec (+- 8.132 usec)
> Average num. events: 2363.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000)
> Average time per event 2.590 usec Average time per event 2.082 usec
> Number of synthesis threads: 8 Number of synthesis threads: 8
> Average synthesis took: 5899.600 usec (+- 506.285 usec) Average synthesis took: 4643.000 usec (+- 4.913 usec)
> Average num. events: 2365.000 (+- 0.000) Average num. events: 2335.000 (+- 0.000)
> Average time per event 2.495 usec Average time per event 1.988 usec
> Number of synthesis threads: 9 Number of synthesis threads: 9
> Average synthesis took: 5459.100 usec (+- 431.725 usec) Average synthesis took: 4526.600 usec (+- 5.207 usec)
> Average num. events: 2367.000 (+- 0.000) Average num. events: 2337.000 (+- 0.000)
> Average time per event 2.306 usec Average time per event 1.937 usec
> Number of synthesis threads: 10 Number of synthesis threads: 10
> Average synthesis took: 4977.100 usec (+- 251.378 usec) Average synthesis took: 4128.700 usec (+- 5.911 usec)
> Average num. events: 2369.000 (+- 0.000) Average num. events: 2327.800 (+- 0.533)
> Average time per event 2.101 usec Average time per event 1.774 usec
> Number of synthesis threads: 11 Number of synthesis threads: 11
> Average synthesis took: 5428.700 usec (+- 513.409 usec) Average synthesis took: 3890.800 usec (+- 15.051 usec)
> Average num. events: 2371.000 (+- 0.000) Average num. events: 2323.000 (+- 0.000)
> Average time per event 2.290 usec Average time per event 1.675 usec
> Number of synthesis threads: 12 Number of synthesis threads: 12
> Average synthesis took: 5517.800 usec (+- 508.171 usec) Average synthesis took: 3367.800 usec (+- 14.261 usec)
> Average num. events: 2373.000 (+- 0.000) Average num. events: 2343.000 (+- 0.000)
> Average time per event 2.325 usec Average time per event 1.437 usec
> Number of synthesis threads: 13 Number of synthesis threads: 13
> Average synthesis took: 5279.500 usec (+- 432.819 usec) Average synthesis took: 3974.300 usec (+- 12.437 usec)
> Average num. events: 2375.000 (+- 0.000) Average num. events: 2328.200 (+- 1.405)
> Average time per event 2.223 usec Average time per event 1.707 usec
> Number of synthesis threads: 14 Number of synthesis threads: 14
> Average synthesis took: 4993.100 usec (+- 392.485 usec) Average synthesis took: 4157.100 usec (+- 163.268 usec)
> Average num. events: 2377.000 (+- 0.000) Average num. events: 2319.800 (+- 0.533)
> Average time per event 2.101 usec Average time per event 1.792 usec
> Number of synthesis threads: 15 Number of synthesis threads: 15
> Average synthesis took: 5584.700 usec (+- 379.862 usec) Average synthesis took: 4065.700 usec (+- 25.656 usec)
> Average num. events: 2379.000 (+- 0.000) Average num. events: 2322.800 (+- 0.467)
> Average time per event 2.347 usec Average time per event 1.750 usec
> Number of synthesis threads: 16 Number of synthesis threads: 16
> Average synthesis took: 5009.800 usec (+- 381.018 usec) Average synthesis took: 4580.600 usec (+- 129.218 usec)
> Average num. events: 2381.000 (+- 0.000) Average num. events: 2324.800 (+- 0.200)
> Average time per event 2.104 usec Average time per event 1.970 usec
> Number of synthesis threads: 17 Number of synthesis threads: 17
> Average synthesis took: 5543.300 usec (+- 376.064 usec) Average synthesis took: 4089.700 usec (+- 54.096 usec)
> Average num. events: 2383.000 (+- 0.000) Average num. events: 2320.200 (+- 0.611)
> Average time per event 2.326 usec Average time per event 1.763 usec
> Number of synthesis threads: 18 Number of synthesis threads: 18
> Average synthesis took: 5191.800 usec (+- 342.317 usec) Average synthesis took: 4219.000 usec (+- 61.395 usec)
> Average num. events: 2385.000 (+- 0.000) Average num. events: 2323.000 (+- 0.516)
> Average time per event 2.177 usec Average time per event 1.816 usec
> Number of synthesis threads: 19 Number of synthesis threads: 19
> Average synthesis took: 4647.000 usec (+- 273.303 usec) Average synthesis took: 3998.800 usec (+- 49.221 usec)
> Average num. events: 2387.000 (+- 0.000) Average num. events: 2325.200 (+- 0.200)
> Average time per event 1.947 usec Average time per event 1.720 usec
> Number of synthesis threads: 20 Number of synthesis threads: 20
> Average synthesis took: 4710.600 usec (+- 179.874 usec) Average synthesis took: 3930.300 usec (+- 67.725 usec)
> Average num. events: 2389.000 (+- 0.000) Average num. events: 2319.000 (+- 0.000)
> Average time per event 1.972 usec Average time per event 1.695 usec
> Number of synthesis threads: 21 Number of synthesis threads: 21
> Average synthesis took: 4959.100 usec (+- 318.519 usec) Average synthesis took: 3696.400 usec (+- 30.953 usec)
> Average num. events: 2390.800 (+- 0.200) Average num. events: 2319.800 (+- 0.533)
> Average time per event 2.074 usec Average time per event 1.593 usec
> Number of synthesis threads: 22 Number of synthesis threads: 22
> Average synthesis took: 4422.300 usec (+- 236.998 usec) Average synthesis took: 3394.000 usec (+- 63.254 usec)
> Average num. events: 2392.800 (+- 0.200) Average num. events: 2319.000 (+- 0.000)
> Average time per event 1.848 usec Average time per event 1.464 usec
> Number of synthesis threads: 23 Number of synthesis threads: 23
> Average synthesis took: 4640.800 usec (+- 245.604 usec) Average synthesis took: 4091.100 usec (+- 134.320 usec)
> Average num. events: 2394.400 (+- 0.600) Average num. events: 2323.400 (+- 0.267)
> Average time per event 1.938 usec Average time per event 1.761 usec
> Number of synthesis threads: 24 Number of synthesis threads: 24
> Average synthesis took: 4554.900 usec (+- 201.121 usec) Average synthesis took: 3346.600 usec (+- 78.846 usec)
> Average num. events: 2395.800 (+- 0.854) Average num. events: 2321.000 (+- 0.667)
> Average time per event 1.901 usec Average time per event 1.442 usec
> Number of synthesis threads: 25 Number of synthesis threads: 25
> Average synthesis took: 4668.300 usec (+- 248.254 usec) Average synthesis took: 3794.300 usec (+- 191.158 usec)
> Average num. events: 2398.000 (+- 0.803) Average num. events: 2317.900 (+- 6.248)
> Average time per event 1.947 usec Average time per event 1.637 usec
> Number of synthesis threads: 26 Number of synthesis threads: 26
> Average synthesis took: 4683.300 usec (+- 226.836 usec) Average synthesis took: 3285.700 usec (+- 18.785 usec)
> Average num. events: 2399.000 (+- 1.265) Average num. events: 2317.100 (+- 6.198)
> Average time per event 1.952 usec Average time per event 1.418 usec
> Number of synthesis threads: 27 Number of synthesis threads: 27
> Average synthesis took: 4590.300 usec (+- 158.000 usec) Average synthesis took: 3604.600 usec (+- 35.487 usec)
> Average num. events: 2400.200 (+- 1.497) Average num. events: 2319.800 (+- 0.533)
> Average time per event 1.912 usec Average time per event 1.554 usec
> Number of synthesis threads: 28 Number of synthesis threads: 28
> Average synthesis took: 4683.500 usec (+- 233.543 usec) Average synthesis took: 3594.700 usec (+- 21.267 usec)
> Average num. events: 2402.400 (+- 1.688) Average num. events: 2319.200 (+- 0.200)
> Average time per event 1.950 usec Average time per event 1.550 usec
> Number of synthesis threads: 29 Number of synthesis threads: 29
> Average synthesis took: 4830.700 usec (+- 235.730 usec) Average synthesis took: 3531.700 usec (+- 15.935 usec)
> Average num. events: 2405.000 (+- 2.530) Average num. events: 2322.200 (+- 0.800)
> Average time per event 2.009 usec Average time per event 1.521 usec
> Number of synthesis threads: 30 Number of synthesis threads: 30
> Average synthesis took: 4684.500 usec (+- 210.137 usec) Average synthesis took: 3505.700 usec (+- 58.332 usec)
> Average num. events: 2407.600 (+- 2.495) Average num. events: 2315.100 (+- 5.900)
> Average time per event 1.946 usec Average time per event 1.514 usec
> Number of synthesis threads: 31 Number of synthesis threads: 31
> Average synthesis took: 4823.300 usec (+- 213.480 usec) Average synthesis took: 3431.100 usec (+- 42.022 usec)
> Average num. events: 2407.400 (+- 2.647) Average num. events: 2319.000 (+- 0.000)
> Average time per event 2.004 usec Average time per event 1.480 usec
> Number of synthesis threads: 32 Number of synthesis threads: 32
> Average synthesis took: 4400.800 usec (+- 224.134 usec) Average synthesis took: 3684.900 usec (+- 253.077 usec)
> Average num. events: 2407.400 (+- 2.544) Average num. events: 2319.200 (+- 0.200)
> Average time per event 1.828 usec Average time per event 1.589 usec
> Number of synthesis threads: 33 Number of synthesis threads: 33
> Average synthesis took: 4452.600 usec (+- 231.034 usec) Average synthesis took: 3233.000 usec (+- 24.035 usec)
> Average num. events: 2409.300 (+- 3.190) Average num. events: 2316.500 (+- 6.069)
> Average time per event 1.848 usec Average time per event 1.396 usec
> Number of synthesis threads: 34 Number of synthesis threads: 34
> Average synthesis took: 4770.900 usec (+- 182.325 usec) Average synthesis took: 3016.300 usec (+- 13.343 usec)
> Average num. events: 2411.200 (+- 3.032) Average num. events: 2322.800 (+- 0.200)
> Average time per event 1.979 usec Average time per event 1.299 usec
> Number of synthesis threads: 35 Number of synthesis threads: 35
> Average synthesis took: 4442.800 usec (+- 248.017 usec) Average synthesis took: 3246.700 usec (+- 71.765 usec)
> Average num. events: 2412.000 (+- 3.296) Average num. events: 2321.800 (+- 0.611)
> Average time per event 1.842 usec Average time per event 1.398 usec
> Number of synthesis threads: 36 Number of synthesis threads: 36
> Average synthesis took: 5005.200 usec (+- 235.823 usec) Average synthesis took: 3329.000 usec (+- 122.028 usec)
> Average num. events: 2410.400 (+- 2.750) Average num. events: 2310.800 (+- 8.133)
> Average time per event 2.077 usec Average time per event 1.441 usec
> Number of synthesis threads: 37 Number of synthesis threads: 37
> Average synthesis took: 4654.000 usec (+- 208.838 usec) Average synthesis took: 3011.600 usec (+- 46.026 usec)
> Average num. events: 2409.400 (+- 2.473) Average num. events: 2322.200 (+- 0.533)
> Average time per event 1.932 usec Average time per event 1.297 usec
> Number of synthesis threads: 38 Number of synthesis threads: 38
> Average synthesis took: 4763.700 usec (+- 197.409 usec) Average synthesis took: 3163.500 usec (+- 36.589 usec)
> Average num. events: 2406.200 (+- 2.462) Average num. events: 2319.000 (+- 0.000)
> Average time per event 1.980 usec Average time per event 1.364 usec
> Number of synthesis threads: 39 Number of synthesis threads: 39
> Average synthesis took: 4333.100 usec (+- 194.456 usec) Average synthesis took: 3170.900 usec (+- 30.538 usec)
> Average num. events: 2408.600 (+- 3.124) Average num. events: 2319.000 (+- 0.000)
> Average time per event 1.799 usec Average time per event 1.367 usec
> Number of synthesis threads: 40 Number of synthesis threads: 40
> Average synthesis took: 4520.200 usec (+- 188.901 usec) Average synthesis took: 3111.900 usec (+- 24.287 usec)
> Average num. events: 2409.600 (+- 3.184) Average num. events: 2307.600 (+- 7.600)
> Average time per event 1.876 usec Average time per event 1.349 usec


Attachments:
(No filename) (24.22 kB)
tc-new.png (148.38 kB)
tc-old.png (167.91 kB)
Download all attachments

2021-08-31 17:07:23

by Riccardo Mancini

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/15] perf: add workqueue library and use it in synthetic-events

Hi Jiri,
thanks again for testing it!

On Tue, 2021-08-31 at 17:46 +0200, Jiri Olsa wrote:
> On Sun, Aug 29, 2021 at 11:59:41PM +0200, Jiri Olsa wrote:
> > On Fri, Aug 20, 2021 at 12:53:46PM +0200, Riccardo Mancini wrote:
> > > Changes in v3:
> > >  - improved separation of threadpool and threadpool_entry method
> > >  - replaced shared workqueue with per-thread workqueue. This should
> > >    improve the performance on big machines (Jiri noticed in his
> > >    experiments a significant performance degradation after 15 threads
> > >    with the shared queue).
> > >  - improved error reporting in both threadpool and workqueue
> > >  - added lazy spinup of threads in workqueue [9/15]
> > >  - added global workqueue [10/15]
> > >  - setup global workqueue in perf record, top and synthesize bench
> > >    [12-14/15] and used in in synthetic events
> >
> >
> > hi,
> > I ran the test again and there's still the slowdown,
> > adding the stats below

Looking at this experiments and v2 ones, it looks like they are synthesizing a
different number of events, so it's difficult to compare them:

old (v2 tests):                                             new (v2 tests):
Number of synthesis threads: 32                             Number of synthesis
threads: 32
   Average synthesis took: 2295.300 usec (+- 41.538 usec)      Average synthesis
took: 4191.700 usec (+- 149.780 usec)
   Average num. events: 954.000 (+- 0.000)                     Average num.
events: 1002.000 (+- 0.000)
   Average time per event 2.406 usec                           Average time per
event 4.183 usec

old (v3 tests):                                             new (v3 tests):
Number of synthesis threads: 32                              Number of synthesis
threads: 32                         
  Average synthesis took: 3684.900 usec (+- 253.077 usec)      Average synthesis
took: 4400.800 usec (+- 224.134 usec)
  Average num. events: 2319.200 (+- 0.200)                     Average num.
events: 2407.400 (+- 2.544)              
  Average time per event 1.589 usec                            Average time per
event 1.828 usec

Anyways, looking at the overhead of the workqueue, computed as the difference
between old and new time, it looks like v3 is better than v2, since it has a
800usec overhead, compared to 1800usec of v2, even with double number of events.
In any case, still not good enough.

> >
> > I'm doing the review and I noticed few strange things,
> > but so far nothing that would explain that
>
> I used trace compass to show the flow and it shows lot of
> extra scheduling in the new code, please check attached
> screenshots
>
> the current code takes the quickest approach and distribues
> 'equal' load for each thread
>
> while the lazy thread spin in the new code is nice, I think
> we should have a way to instruct the new code to do the same
> thing as the old one, because it's faster in this case

From the figures, we can see how, in the older code the spinup took ~1usec,
while, with this patchet it takes, around ~2usec, so there's definitely room for
improvement.
Regarding the lazy spawn, I think much of the overhead comes from the
registration mechanism of the workers, which I'm already planning to remove to
get rid of that additional contention on the workqueue lock, which makes it
necessary to wait for the thread to spawn.
By removing it, I think the remaining overhead should not be very big.
In any case, I could add a way to pre-spin the workers, which makes sense in
this scenario, since we'd like to use all of them anyways.

>
> I think the work_size setup could help with that
>
> >
> > like I can see for 40 threads only 35 threads spawned,
> > need to check on that more

It looks like that, by the time 35 works are dispatched, the first worker
becomes ready, so no more than 35 threads are ever spawned.

> >
> > also I'll try run some tests for parallel_for > 1 to cut
>
> ugh.. should have been s/parallel_for/work_size/ sorry
>
> > down some of the workqueue code.. any tests on that?

Yes, I think we need to increase it.
On my machines differences were not significant, so I opted for the lowest
number. On a faster machine, dispatching one single work item per pid might be
too low.
In fact, all that extra scheduling is happening due to many threads going idle
and being woken up, since they did not have enough work.
We could try increasing it to 10 and see what happens. 
The worst case behaviour should be the "old" one (static partitioning).

Thanks,
Riccardo

> >
> > jirka
> >
> >
> > ---
> > new:                                                                        
> >   
> >           old:
> > ell-r440-01 perf]# ./perf bench internals synthesize -
> > t                                      [root@dell-r440-01 perf]# ./perf
> > bench
> > internals synthesize -t
> > # Running 'internals/synthesize'
> > benchmark:                                                  # Running
> > 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis
> > by                              Computing performance of multi threaded perf
> > event synthesis by
> > synthesizing events on CPU
> > 0:                                                               
> > synthesizing
> > events on CPU 0:
> >   Number of synthesis threads:
> > 1                                                               Number of
> > synthesis threads: 1
> >     Average synthesis took: 13970.400 usec (+- 339.216
> > usec)                                     Average synthesis took: 13563.700
> > usec (+- 348.354 usec)
> >     Average num. events: 2349.000 (+-
> > 0.000)                                                     Average num.
> > events: 2317.000 (+- 0.000)
> >     Average time per event 5.947
> > usec                                                            Average time
> > per event 5.854 usec
> >   Number of synthesis threads:
> > 2                                                               Number of
> > synthesis threads: 2
> >     Average synthesis took: 15651.800 usec (+- 1612.798
> > usec)                                    Average synthesis took: 8433.600
> > usec
> > (+- 83.725 usec)
> >     Average num. events: 2353.000 (+-
> > 0.000)                                                     Average num.
> > events: 2321.600 (+- 0.306)
> >     Average time per event 6.652
> > usec                                                            Average time
> > per event 3.633 usec
> >   Number of synthesis threads:
> > 3                                                               Number of
> > synthesis threads: 3
> >     Average synthesis took: 12114.100 usec (+- 1208.208
> > usec)                                    Average synthesis took: 6716.200
> > usec
> > (+- 16.889 usec)
> >     Average num. events: 2355.000 (+-
> > 0.000)                                                     Average num.
> > events: 2325.000 (+- 0.000)
> >     Average time per event 5.144
> > usec                                                            Average time
> > per event 2.889 usec
> >   Number of synthesis threads:
> > 4                                                               Number of
> > synthesis threads: 4
> >     Average synthesis took: 9812.500 usec (+- 951.284
> > usec)                                      Average synthesis took: 5981.400
> > usec (+- 11.102 usec)
> >     Average num. events: 2357.000 (+-
> > 0.000)                                                     Average num.
> > events: 2323.000 (+- 0.000)
> >     Average time per event 4.163
> > usec                                                            Average time
> > per event 2.575 usec
> >   Number of synthesis threads:
> > 5                                                               Number of
> > synthesis threads: 5
> >     Average synthesis took: 7338.300 usec (+- 661.620
> > usec)                                      Average synthesis took: 5538.800
> > usec (+- 12.990 usec)
> >     Average num. events: 2359.000 (+-
> > 0.000)                                                     Average num.
> > events: 2329.000 (+- 0.000)
> >     Average time per event 3.111
> > usec                                                            Average time
> > per event 2.378 usec
> >   Number of synthesis threads:
> > 6                                                               Number of
> > synthesis threads: 6
> >     Average synthesis took: 7256.800 usec (+- 680.312
> > usec)                                      Average synthesis took: 5255.700
> > usec (+- 7.454 usec)
> >     Average num. events: 2361.000 (+-
> > 0.000)                                                     Average num.
> > events: 2331.000 (+- 0.000)
> >     Average time per event 3.074
> > usec                                                            Average time
> > per event 2.255 usec
> >   Number of synthesis threads:
> > 7                                                               Number of
> > synthesis threads: 7
> >     Average synthesis took: 6119.600 usec (+- 479.409
> > usec)                                      Average synthesis took: 4836.200
> > usec (+- 8.132 usec)
> >     Average num. events: 2363.000 (+-
> > 0.000)                                                     Average num.
> > events: 2323.000 (+- 0.000)
> >     Average time per event 2.590
> > usec                                                            Average time
> > per event 2.082 usec
> >   Number of synthesis threads:
> > 8                                                               Number of
> > synthesis threads: 8
> >     Average synthesis took: 5899.600 usec (+- 506.285
> > usec)                                      Average synthesis took: 4643.000
> > usec (+- 4.913 usec)
> >     Average num. events: 2365.000 (+-
> > 0.000)                                                     Average num.
> > events: 2335.000 (+- 0.000)
> >     Average time per event 2.495
> > usec                                                            Average time
> > per event 1.988 usec
> >   Number of synthesis threads:
> > 9                                                               Number of
> > synthesis threads: 9
> >     Average synthesis took: 5459.100 usec (+- 431.725
> > usec)                                      Average synthesis took: 4526.600
> > usec (+- 5.207 usec)
> >     Average num. events: 2367.000 (+-
> > 0.000)                                                     Average num.
> > events: 2337.000 (+- 0.000)
> >     Average time per event 2.306
> > usec                                                            Average time
> > per event 1.937 usec
> >   Number of synthesis threads:
> > 10                                                              Number of
> > synthesis threads: 10
> >     Average synthesis took: 4977.100 usec (+- 251.378
> > usec)                                      Average synthesis took: 4128.700
> > usec (+- 5.911 usec)
> >     Average num. events: 2369.000 (+-
> > 0.000)                                                     Average num.
> > events: 2327.800 (+- 0.533)
> >     Average time per event 2.101
> > usec                                                            Average time
> > per event 1.774 usec
> >   Number of synthesis threads:
> > 11                                                              Number of
> > synthesis threads: 11
> >     Average synthesis took: 5428.700 usec (+- 513.409
> > usec)                                      Average synthesis took: 3890.800
> > usec (+- 15.051 usec)
> >     Average num. events: 2371.000 (+-
> > 0.000)                                                     Average num.
> > events: 2323.000 (+- 0.000)
> >     Average time per event 2.290
> > usec                                                            Average time
> > per event 1.675 usec
> >   Number of synthesis threads:
> > 12                                                              Number of
> > synthesis threads: 12
> >     Average synthesis took: 5517.800 usec (+- 508.171
> > usec)                                      Average synthesis took: 3367.800
> > usec (+- 14.261 usec)
> >     Average num. events: 2373.000 (+-
> > 0.000)                                                     Average num.
> > events: 2343.000 (+- 0.000)
> >     Average time per event 2.325
> > usec                                                            Average time
> > per event 1.437 usec
> >   Number of synthesis threads:
> > 13                                                              Number of
> > synthesis threads: 13
> >     Average synthesis took: 5279.500 usec (+- 432.819
> > usec)                                      Average synthesis took: 3974.300
> > usec (+- 12.437 usec)
> >     Average num. events: 2375.000 (+-
> > 0.000)                                                     Average num.
> > events: 2328.200 (+- 1.405)
> >     Average time per event 2.223
> > usec                                                            Average time
> > per event 1.707 usec
> >   Number of synthesis threads:
> > 14                                                              Number of
> > synthesis threads: 14
> >     Average synthesis took: 4993.100 usec (+- 392.485
> > usec)                                      Average synthesis took: 4157.100
> > usec (+- 163.268 usec)
> >     Average num. events: 2377.000 (+-
> > 0.000)                                                     Average num.
> > events: 2319.800 (+- 0.533)
> >     Average time per event 2.101
> > usec                                                            Average time
> > per event 1.792 usec
> >   Number of synthesis threads:
> > 15                                                              Number of
> > synthesis threads: 15
> >     Average synthesis took: 5584.700 usec (+- 379.862
> > usec)                                      Average synthesis took: 4065.700
> > usec (+- 25.656 usec)
> >     Average num. events: 2379.000 (+-
> > 0.000)                                                     Average num.
> > events: 2322.800 (+- 0.467)
> >     Average time per event 2.347
> > usec                                                            Average time
> > per event 1.750 usec
> >   Number of synthesis threads:
> > 16                                                              Number of
> > synthesis threads: 16
> >     Average synthesis took: 5009.800 usec (+- 381.018
> > usec)                                      Average synthesis took: 4580.600
> > usec (+- 129.218 usec)
> >     Average num. events: 2381.000 (+-
> > 0.000)                                                     Average num.
> > events: 2324.800 (+- 0.200)
> >     Average time per event 2.104
> > usec                                                            Average time
> > per event 1.970 usec
> >   Number of synthesis threads:
> > 17                                                              Number of
> > synthesis threads: 17
> >     Average synthesis took: 5543.300 usec (+- 376.064
> > usec)                                      Average synthesis took: 4089.700
> > usec (+- 54.096 usec)
> >     Average num. events: 2383.000 (+-
> > 0.000)                                                     Average num.
> > events: 2320.200 (+- 0.611)
> >     Average time per event 2.326
> > usec                                                            Average time
> > per event 1.763 usec
> >   Number of synthesis threads:
> > 18                                                              Number of
> > synthesis threads: 18
> >     Average synthesis took: 5191.800 usec (+- 342.317
> > usec)                                      Average synthesis took: 4219.000
> > usec (+- 61.395 usec)
> >     Average num. events: 2385.000 (+-
> > 0.000)                                                     Average num.
> > events: 2323.000 (+- 0.516)
> >     Average time per event 2.177
> > usec                                                            Average time
> > per event 1.816 usec
> >   Number of synthesis threads:
> > 19                                                              Number of
> > synthesis threads: 19
> >     Average synthesis took: 4647.000 usec (+- 273.303
> > usec)                                      Average synthesis took: 3998.800
> > usec (+- 49.221 usec)
> >     Average num. events: 2387.000 (+-
> > 0.000)                                                     Average num.
> > events: 2325.200 (+- 0.200)
> >     Average time per event 1.947
> > usec                                                            Average time
> > per event 1.720 usec
> >   Number of synthesis threads:
> > 20                                                              Number of
> > synthesis threads: 20
> >     Average synthesis took: 4710.600 usec (+- 179.874
> > usec)                                      Average synthesis took: 3930.300
> > usec (+- 67.725 usec)
> >     Average num. events: 2389.000 (+-
> > 0.000)                                                     Average num.
> > events: 2319.000 (+- 0.000)
> >     Average time per event 1.972
> > usec                                                            Average time
> > per event 1.695 usec
> >   Number of synthesis threads:
> > 21                                                              Number of
> > synthesis threads: 21
> >     Average synthesis took: 4959.100 usec (+- 318.519
> > usec)                                      Average synthesis took: 3696.400
> > usec (+- 30.953 usec)
> >     Average num. events: 2390.800 (+-
> > 0.200)                                                     Average num.
> > events: 2319.800 (+- 0.533)
> >     Average time per event 2.074
> > usec                                                            Average time
> > per event 1.593 usec
> >   Number of synthesis threads:
> > 22                                                              Number of
> > synthesis threads: 22
> >     Average synthesis took: 4422.300 usec (+- 236.998
> > usec)                                      Average synthesis took: 3394.000
> > usec (+- 63.254 usec)
> >     Average num. events: 2392.800 (+-
> > 0.200)                                                     Average num.
> > events: 2319.000 (+- 0.000)
> >     Average time per event 1.848
> > usec                                                            Average time
> > per event 1.464 usec
> >   Number of synthesis threads:
> > 23                                                              Number of
> > synthesis threads: 23
> >     Average synthesis took: 4640.800 usec (+- 245.604
> > usec)                                      Average synthesis took: 4091.100
> > usec (+- 134.320 usec)
> >     Average num. events: 2394.400 (+-
> > 0.600)                                                     Average num.
> > events: 2323.400 (+- 0.267)
> >     Average time per event 1.938
> > usec                                                            Average time
> > per event 1.761 usec
> >   Number of synthesis threads:
> > 24                                                              Number of
> > synthesis threads: 24
> >     Average synthesis took: 4554.900 usec (+- 201.121
> > usec)                                      Average synthesis took: 3346.600
> > usec (+- 78.846 usec)
> >     Average num. events: 2395.800 (+-
> > 0.854)                                                     Average num.
> > events: 2321.000 (+- 0.667)
> >     Average time per event 1.901
> > usec                                                            Average time
> > per event 1.442 usec
> >   Number of synthesis threads:
> > 25                                                              Number of
> > synthesis threads: 25
> >     Average synthesis took: 4668.300 usec (+- 248.254
> > usec)                                      Average synthesis took: 3794.300
> > usec (+- 191.158 usec)
> >     Average num. events: 2398.000 (+-
> > 0.803)                                                     Average num.
> > events: 2317.900 (+- 6.248)
> >     Average time per event 1.947
> > usec                                                            Average time
> > per event 1.637 usec
> >   Number of synthesis threads:
> > 26                                                              Number of
> > synthesis threads: 26
> >     Average synthesis took: 4683.300 usec (+- 226.836
> > usec)                                      Average synthesis took: 3285.700
> > usec (+- 18.785 usec)
> >     Average num. events: 2399.000 (+-
> > 1.265)                                                     Average num.
> > events: 2317.100 (+- 6.198)
> >     Average time per event 1.952
> > usec                                                            Average time
> > per event 1.418 usec
> >   Number of synthesis threads:
> > 27                                                              Number of
> > synthesis threads: 27
> >     Average synthesis took: 4590.300 usec (+- 158.000
> > usec)                                      Average synthesis took: 3604.600
> > usec (+- 35.487 usec)
> >     Average num. events: 2400.200 (+-
> > 1.497)                                                     Average num.
> > events: 2319.800 (+- 0.533)
> >     Average time per event 1.912
> > usec                                                            Average time
> > per event 1.554 usec
> >   Number of synthesis threads:
> > 28                                                              Number of
> > synthesis threads: 28
> >     Average synthesis took: 4683.500 usec (+- 233.543
> > usec)                                      Average synthesis took: 3594.700
> > usec (+- 21.267 usec)
> >     Average num. events: 2402.400 (+-
> > 1.688)                                                     Average num.
> > events: 2319.200 (+- 0.200)
> >     Average time per event 1.950
> > usec                                                            Average time
> > per event 1.550 usec
> >   Number of synthesis threads:
> > 29                                                              Number of
> > synthesis threads: 29
> >     Average synthesis took: 4830.700 usec (+- 235.730
> > usec)                                      Average synthesis took: 3531.700
> > usec (+- 15.935 usec)
> >     Average num. events: 2405.000 (+-
> > 2.530)                                                     Average num.
> > events: 2322.200 (+- 0.800)
> >     Average time per event 2.009
> > usec                                                            Average time
> > per event 1.521 usec
> >   Number of synthesis threads:
> > 30                                                              Number of
> > synthesis threads: 30
> >     Average synthesis took: 4684.500 usec (+- 210.137
> > usec)                                      Average synthesis took: 3505.700
> > usec (+- 58.332 usec)
> >     Average num. events: 2407.600 (+-
> > 2.495)                                                     Average num.
> > events: 2315.100 (+- 5.900)
> >     Average time per event 1.946
> > usec                                                            Average time
> > per event 1.514 usec
> >   Number of synthesis threads:
> > 31                                                              Number of
> > synthesis threads: 31
> >     Average synthesis took: 4823.300 usec (+- 213.480
> > usec)                                      Average synthesis took: 3431.100
> > usec (+- 42.022 usec)
> >     Average num. events: 2407.400 (+-
> > 2.647)                                                     Average num.
> > events: 2319.000 (+- 0.000)
> >     Average time per event 2.004
> > usec                                                            Average time
> > per event 1.480 usec
> >   Number of synthesis threads:
> > 32                                                              Number of
> > synthesis threads: 32
> >     Average synthesis took: 4400.800 usec (+- 224.134
> > usec)                                      Average synthesis took: 3684.900
> > usec (+- 253.077 usec)
> >     Average num. events: 2407.400 (+-
> > 2.544)                                                     Average num.
> > events: 2319.200 (+- 0.200)
> >     Average time per event 1.828
> > usec                                                            Average time
> > per event 1.589 usec
> >   Number of synthesis threads:
> > 33                                                              Number of
> > synthesis threads: 33
> >     Average synthesis took: 4452.600 usec (+- 231.034
> > usec)                                      Average synthesis took: 3233.000
> > usec (+- 24.035 usec)
> >     Average num. events: 2409.300 (+-
> > 3.190)                                                     Average num.
> > events: 2316.500 (+- 6.069)
> >     Average time per event 1.848
> > usec                                                            Average time
> > per event 1.396 usec
> >   Number of synthesis threads:
> > 34                                                              Number of
> > synthesis threads: 34
> >     Average synthesis took: 4770.900 usec (+- 182.325
> > usec)                                      Average synthesis took: 3016.300
> > usec (+- 13.343 usec)
> >     Average num. events: 2411.200 (+-
> > 3.032)                                                     Average num.
> > events: 2322.800 (+- 0.200)
> >     Average time per event 1.979
> > usec                                                            Average time
> > per event 1.299 usec
> >   Number of synthesis threads:
> > 35                                                              Number of
> > synthesis threads: 35
> >     Average synthesis took: 4442.800 usec (+- 248.017
> > usec)                                      Average synthesis took: 3246.700
> > usec (+- 71.765 usec)
> >     Average num. events: 2412.000 (+-
> > 3.296)                                                     Average num.
> > events: 2321.800 (+- 0.611)
> >     Average time per event 1.842
> > usec                                                            Average time
> > per event 1.398 usec
> >   Number of synthesis threads:
> > 36                                                              Number of
> > synthesis threads: 36
> >     Average synthesis took: 5005.200 usec (+- 235.823
> > usec)                                      Average synthesis took: 3329.000
> > usec (+- 122.028 usec)
> >     Average num. events: 2410.400 (+-
> > 2.750)                                                     Average num.
> > events: 2310.800 (+- 8.133)
> >     Average time per event 2.077
> > usec                                                            Average time
> > per event 1.441 usec
> >   Number of synthesis threads:
> > 37                                                              Number of
> > synthesis threads: 37
> >     Average synthesis took: 4654.000 usec (+- 208.838
> > usec)                                      Average synthesis took: 3011.600
> > usec (+- 46.026 usec)
> >     Average num. events: 2409.400 (+-
> > 2.473)                                                     Average num.
> > events: 2322.200 (+- 0.533)
> >     Average time per event 1.932
> > usec                                                            Average time
> > per event 1.297 usec
> >   Number of synthesis threads:
> > 38                                                              Number of
> > synthesis threads: 38
> >     Average synthesis took: 4763.700 usec (+- 197.409
> > usec)                                      Average synthesis took: 3163.500
> > usec (+- 36.589 usec)
> >     Average num. events: 2406.200 (+-
> > 2.462)                                                     Average num.
> > events: 2319.000 (+- 0.000)
> >     Average time per event 1.980
> > usec                                                            Average time
> > per event 1.364 usec
> >   Number of synthesis threads:
> > 39                                                              Number of
> > synthesis threads: 39
> >     Average synthesis took: 4333.100 usec (+- 194.456
> > usec)                                      Average synthesis took: 3170.900
> > usec (+- 30.538 usec)
> >     Average num. events: 2408.600 (+-
> > 3.124)                                                     Average num.
> > events: 2319.000 (+- 0.000)
> >     Average time per event 1.799
> > usec                                                            Average time
> > per event 1.367 usec
> >   Number of synthesis threads:
> > 40                                                              Number of
> > synthesis threads: 40
> >     Average synthesis took: 4520.200 usec (+- 188.901
> > usec)                                      Average synthesis took: 3111.900
> > usec (+- 24.287 usec)
> >     Average num. events: 2409.600 (+-
> > 3.184)                                                     Average num.
> > events: 2307.600 (+- 7.600)
> >     Average time per event 1.876
> > usec                                                            Average time
> > per event 1.349 usec



2021-08-31 18:03:38

by Riccardo Mancini

[permalink] [raw]
Subject: Re: [RFC PATCH v3 06/15] perf workqueue: introduce workqueue struct

Hi Namhyung,
thanks again for taking your time to review.

On Tue, 2021-08-24 at 12:27 -0700, Namhyung Kim wrote:
> Hi Riccardo,
>
> On Fri, Aug 20, 2021 at 3:54 AM Riccardo Mancini <[email protected]> wrote:
> > +/**
> > + * workqueue_strerror - print message regarding lastest error in @wq
> > + *
> > + * Buffer size should be at least WORKQUEUE_STRERR_BUFSIZE bytes.
> > + */
> > +int workqueue_strerror(struct workqueue_struct *wq, int err, char *buf,
> > size_t size)
> > +{
> > +       int ret;
> > +       char sbuf[THREADPOOL_STRERR_BUFSIZE], *emsg;
> > +       const char *errno_str;
> > +
> > +       errno_str = workqueue_errno_str[-err-WORKQUEUE_ERROR__OFFSET];
>
> It seems easy to crash with an invalid err argument.

Yup, I should add a check in the next version.

>
> > +
> > +       switch (err) {
> > +       case -WORKQUEUE_ERROR__POOLNEW:
> > +       case -WORKQUEUE_ERROR__POOLEXE:
> > +       case -WORKQUEUE_ERROR__POOLSTOP:
> > +       case -WORKQUEUE_ERROR__POOLSTARTTHREAD:
> > +               if (IS_ERR_OR_NULL(wq))
> > +                       return scnprintf(buf, size, "%s: unknown.\n",
> > +                               errno_str);
> > +
> > +               ret = threadpool__strerror(wq->pool, wq->pool_errno, sbuf,
> > sizeof(sbuf));
> > +               if (ret < 0)
> > +                       return ret;
> > +               return scnprintf(buf, size, "%s: %s.\n", errno_str, sbuf);
> > +       case -WORKQUEUE_ERROR__WRITEPIPE:
> > +       case -WORKQUEUE_ERROR__READPIPE:
> > +               emsg = str_error_r(errno, sbuf, sizeof(sbuf));
>
>
> This means the errno should be kept before calling this, right?

Yeah, I should make sure to preserve it, I think it's done this way in some
other functions.
Otherwise, I can save it in a struct attribute, which is maybe simpler.

>
> > +               return scnprintf(buf, size, "%s: %s.\n", errno_str, emsg);
> > +       case -WORKQUEUE_ERROR__INVALIDMSG:
> > +               return scnprintf(buf, size, "%s.\n", errno_str);
> > +       default:
> > +               emsg = str_error_r(err, sbuf, sizeof(sbuf));
> > +               return scnprintf(buf, size, "Error: %s", emsg);
>
> Newline at the end?

Forgot it, thanks!

Riccardo

>
> Thanks,
> Namhyung
>
>
> > +       }
> > +}
> > +