2023-09-28 04:32:35

by Xiaobing Li

[permalink] [raw]
Subject: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.

Since the sq thread has a while(1) structure, during this process, there
may be a lot of time that is not processing IO but does not exceed the
timeout period, therefore, the sqpoll thread will keep running and will
keep occupying the CPU. Obviously, the CPU is wasted at this time;Our
goal is to count the part of the time that the sqpoll thread actually
processes IO, so as to reflect the part of the CPU it uses to process
IO, which can be used to help improve the actual utilization of the CPU
in the future.

Signed-off-by: Xiaobing Li <[email protected]>
---
io_uring/sqpoll.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index bd6c2c7959a5..2c5fc4d95fa8 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -11,6 +11,7 @@
#include <linux/audit.h>
#include <linux/security.h>
#include <linux/io_uring.h>
+#include <linux/time.h>

#include <uapi/linux/io_uring.h>

@@ -235,6 +236,10 @@ static int io_sq_thread(void *data)
set_cpus_allowed_ptr(current, cpu_online_mask);

mutex_lock(&sqd->lock);
+ bool first = true;
+ struct timespec64 ts_start, ts_end;
+ struct timespec64 ts_delta;
+ struct sched_entity *se = &sqd->thread->se;
while (1) {
bool cap_entries, sqt_spin = false;

@@ -243,7 +248,16 @@ static int io_sq_thread(void *data)
break;
timeout = jiffies + sqd->sq_thread_idle;
}
-
+ ktime_get_boottime_ts64(&ts_start);
+ ts_delta = timespec64_sub(ts_start, ts_end);
+ unsigned long long now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
+ se->sq_avg.last_update_time;
+
+ if (first) {
+ now = 0;
+ first = false;
+ }
+ __update_sq_avg_block(now, se);
cap_entries = !list_is_singular(&sqd->ctx_list);
list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
int ret = __io_sq_thread(ctx, cap_entries);
@@ -251,6 +265,16 @@ static int io_sq_thread(void *data)
if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
sqt_spin = true;
}
+
+ ktime_get_boottime_ts64(&ts_end);
+ ts_delta = timespec64_sub(ts_end, ts_start);
+ now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
+ se->sq_avg.last_update_time;
+
+ if (sqt_spin)
+ __update_sq_avg(now, se);
+ else
+ __update_sq_avg_block(now, se);
if (io_run_task_work())
sqt_spin = true;

--
2.34.1


2023-09-28 14:42:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.

On 9/28/23 2:01 AM, Peter Zijlstra wrote:
> All of this is quite insane, but the above is actually broken. You're
> using wall-time to measure runtime of a preemptible thread.

Would have to agree with that... wall-time specifics aside, this whole
thing seems to attempt to solve an issue quite the wrong way around.

> Now, I see what you're trying to do, but who actually uses this data?

This is the important question - and if you need this data, then why not
just account it in sqpoll itself and have some way to export it? Let's
please not implicate the core kernel bits for that.

--
Jens Axboe

2023-09-28 15:32:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.

On Thu, Sep 28, 2023 at 10:22:28AM +0800, Xiaobing Li wrote:
> Since the sq thread has a while(1) structure, during this process, there
> may be a lot of time that is not processing IO but does not exceed the
> timeout period, therefore, the sqpoll thread will keep running and will
> keep occupying the CPU. Obviously, the CPU is wasted at this time;Our
> goal is to count the part of the time that the sqpoll thread actually
> processes IO, so as to reflect the part of the CPU it uses to process
> IO, which can be used to help improve the actual utilization of the CPU
> in the future.
>
> Signed-off-by: Xiaobing Li <[email protected]>
> ---
> io_uring/sqpoll.c | 26 +++++++++++++++++++++++++-
> 1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index bd6c2c7959a5..2c5fc4d95fa8 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -11,6 +11,7 @@
> #include <linux/audit.h>
> #include <linux/security.h>
> #include <linux/io_uring.h>
> +#include <linux/time.h>
>
> #include <uapi/linux/io_uring.h>
>
> @@ -235,6 +236,10 @@ static int io_sq_thread(void *data)
> set_cpus_allowed_ptr(current, cpu_online_mask);
>
> mutex_lock(&sqd->lock);
> + bool first = true;
> + struct timespec64 ts_start, ts_end;
> + struct timespec64 ts_delta;
> + struct sched_entity *se = &sqd->thread->se;
> while (1) {
> bool cap_entries, sqt_spin = false;
>
> @@ -243,7 +248,16 @@ static int io_sq_thread(void *data)
> break;
> timeout = jiffies + sqd->sq_thread_idle;
> }
> -
> + ktime_get_boottime_ts64(&ts_start);
> + ts_delta = timespec64_sub(ts_start, ts_end);
> + unsigned long long now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
> + se->sq_avg.last_update_time;
> +
> + if (first) {
> + now = 0;
> + first = false;
> + }
> + __update_sq_avg_block(now, se);
> cap_entries = !list_is_singular(&sqd->ctx_list);
> list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> int ret = __io_sq_thread(ctx, cap_entries);
> @@ -251,6 +265,16 @@ static int io_sq_thread(void *data)
> if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
> sqt_spin = true;
> }
> +
> + ktime_get_boottime_ts64(&ts_end);
> + ts_delta = timespec64_sub(ts_end, ts_start);
> + now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
> + se->sq_avg.last_update_time;
> +
> + if (sqt_spin)
> + __update_sq_avg(now, se);
> + else
> + __update_sq_avg_block(now, se);
> if (io_run_task_work())
> sqt_spin = true;
>

All of this is quite insane, but the above is actually broken. You're
using wall-time to measure runtime of a preemptible thread.

On top of that, for extra insanity, you're using the frigging insane
timespec interface, because clearly the _ns() interfaces are too
complicated or something?

And that whole first thing is daft too, pull now out of the loop and
set it before, then all that goes away.

Now, I see what you're trying to do, but who actually uses this data?

Finally, please don't scream in the subject :/

2023-09-28 17:55:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.

On 9/28/23 2:37 AM, Matthew Wilcox wrote:
> On Thu, Sep 28, 2023 at 10:01:14AM +0200, Peter Zijlstra wrote:
>> Now, I see what you're trying to do, but who actually uses this data?
>
> I ... don't. There seems to be the notion that since we're polling, that
> shouldn't count against the runtime of the thread. But the thread has
> chosen to poll! It is doing something! For one thing, it's preventing
> the CPU from entering an idle state. It seems absolutely fair to
> accuont this poll time to the runtime of the thread. Clearly i'm
> missing something.

For sure, it should be accounted as CPU time, as it is exactly that. You
could argue that if we needed to preempt this task for something else we
would do that (and the code does check that on every loop), but it's
still using CPU.

I can see maybe wanting to know how much of the total time the thread
spent doing ACTUAL work rather than just polling for new work, but
that's not really something the scheduler should be involved in and
should be purely an io_uring sqpoll stat of some sort if that is truly
interesting for an application.

--
Jens Axboe

2023-09-28 21:14:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.

On Thu, Sep 28, 2023 at 10:01:14AM +0200, Peter Zijlstra wrote:
> Now, I see what you're trying to do, but who actually uses this data?

I ... don't. There seems to be the notion that since we're polling, that
shouldn't count against the runtime of the thread. But the thread has
chosen to poll! It is doing something! For one thing, it's preventing
the CPU from entering an idle state. It seems absolutely fair to
accuont this poll time to the runtime of the thread. Clearly i'm
missing something.

2023-09-28 22:56:51

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.

Hi Xiaobing,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.6-rc3 next-20230928]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Xiaobing-Li/SCHEDULER-Add-an-interface-for-counting-real-utilization/20230928-103219
base: tip/sched/core
patch link: https://lore.kernel.org/r/20230928022228.15770-4-xiaobing.li%40samsung.com
patch subject: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.
config: arm-mmp2_defconfig (https://download.01.org/0day-ci/archive/20230929/[email protected]/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230929/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

io_uring/sqpoll.c:254:7: error: no member named 'sq_avg' in 'struct sched_entity'
se->sq_avg.last_update_time;
~~ ^
>> io_uring/sqpoll.c:260:3: error: call to undeclared function '__update_sq_avg_block'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
__update_sq_avg_block(now, se);
^
io_uring/sqpoll.c:272:7: error: no member named 'sq_avg' in 'struct sched_entity'
se->sq_avg.last_update_time;
~~ ^
>> io_uring/sqpoll.c:275:4: error: call to undeclared function '__update_sq_avg'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
__update_sq_avg(now, se);
^
4 errors generated.


vim +/__update_sq_avg_block +260 io_uring/sqpoll.c

221
222 static int io_sq_thread(void *data)
223 {
224 struct io_sq_data *sqd = data;
225 struct io_ring_ctx *ctx;
226 unsigned long timeout = 0;
227 char buf[TASK_COMM_LEN];
228 DEFINE_WAIT(wait);
229
230 snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
231 set_task_comm(current, buf);
232
233 if (sqd->sq_cpu != -1)
234 set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
235 else
236 set_cpus_allowed_ptr(current, cpu_online_mask);
237
238 mutex_lock(&sqd->lock);
239 bool first = true;
240 struct timespec64 ts_start, ts_end;
241 struct timespec64 ts_delta;
242 struct sched_entity *se = &sqd->thread->se;
243 while (1) {
244 bool cap_entries, sqt_spin = false;
245
246 if (io_sqd_events_pending(sqd) || signal_pending(current)) {
247 if (io_sqd_handle_event(sqd))
248 break;
249 timeout = jiffies + sqd->sq_thread_idle;
250 }
251 ktime_get_boottime_ts64(&ts_start);
252 ts_delta = timespec64_sub(ts_start, ts_end);
253 unsigned long long now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
254 se->sq_avg.last_update_time;
255
256 if (first) {
257 now = 0;
258 first = false;
259 }
> 260 __update_sq_avg_block(now, se);
261 cap_entries = !list_is_singular(&sqd->ctx_list);
262 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
263 int ret = __io_sq_thread(ctx, cap_entries);
264
265 if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
266 sqt_spin = true;
267 }
268
269 ktime_get_boottime_ts64(&ts_end);
270 ts_delta = timespec64_sub(ts_end, ts_start);
271 now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
272 se->sq_avg.last_update_time;
273
274 if (sqt_spin)
> 275 __update_sq_avg(now, se);
276 else
277 __update_sq_avg_block(now, se);
278 if (io_run_task_work())
279 sqt_spin = true;
280
281 if (sqt_spin || !time_after(jiffies, timeout)) {
282 if (sqt_spin)
283 timeout = jiffies + sqd->sq_thread_idle;
284 if (unlikely(need_resched())) {
285 mutex_unlock(&sqd->lock);
286 cond_resched();
287 mutex_lock(&sqd->lock);
288 }
289 continue;
290 }
291
292 prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
293 if (!io_sqd_events_pending(sqd) && !task_work_pending(current)) {
294 bool needs_sched = true;
295
296 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
297 atomic_or(IORING_SQ_NEED_WAKEUP,
298 &ctx->rings->sq_flags);
299 if ((ctx->flags & IORING_SETUP_IOPOLL) &&
300 !wq_list_empty(&ctx->iopoll_list)) {
301 needs_sched = false;
302 break;
303 }
304
305 /*
306 * Ensure the store of the wakeup flag is not
307 * reordered with the load of the SQ tail
308 */
309 smp_mb__after_atomic();
310
311 if (io_sqring_entries(ctx)) {
312 needs_sched = false;
313 break;
314 }
315 }
316
317 if (needs_sched) {
318 mutex_unlock(&sqd->lock);
319 schedule();
320 mutex_lock(&sqd->lock);
321 }
322 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
323 atomic_andnot(IORING_SQ_NEED_WAKEUP,
324 &ctx->rings->sq_flags);
325 }
326
327 finish_wait(&sqd->wait, &wait);
328 timeout = jiffies + sqd->sq_thread_idle;
329 }
330
331 io_uring_cancel_generic(true, sqd);
332 sqd->thread = NULL;
333 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
334 atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags);
335 io_run_task_work();
336 mutex_unlock(&sqd->lock);
337
338 complete(&sqd->exited);
339 do_exit(0);
340 }
341

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki