Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp4019937pxm; Tue, 1 Mar 2022 09:36:38 -0800 (PST) X-Google-Smtp-Source: ABdhPJzR3PVdd5Yx65qs928joQwysZLXFZkmlZSYWZO0yHTzGHWKEqf/2z0FrBe526X5vcjKgaqx X-Received: by 2002:a63:3fc3:0:b0:373:d98:51b0 with SMTP id m186-20020a633fc3000000b003730d9851b0mr22535393pga.356.1646156197821; Tue, 01 Mar 2022 09:36:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646156197; cv=none; d=google.com; s=arc-20160816; b=CWkKBEe9isv8vT9TVUmqO+QpDLBKJ9LL0H1L0YvG2AyGmw/oeMr4lPA9BuGBfUzsLS ZWkoTopxLpg2CFTgISx52i1ffRa/4O5BrT210ri2w1hxi1QqTp5D34weN4sBbsTd/mZn Y567JV048MtGXuXsbznGtY1qmcaQbWBZclXIkEqqM4Npd/yIFabBLc/PVPeS51BnJ46f cGaMw+e2uM28dQL8b7teo5BQRWQFedzJzfyMCd06LEgG0Rhbe8vazCFeXWJEEVAcRAJU EzHUc/HWyz/dGJQw+4dToKtMmriTrW8uQPc+Xma1TNSXrFLyPqlZfUhEjWgl6EYo1Von tLEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:cc:to:from:references:in-reply-to :message-id:date; bh=0V8KL4I2BO3BNGVtjEGWY8HqAZB1wTyHK8h7U48g+0Q=; b=Xn42Bi50A40zT37GTYvUscBoxV6CPKkAwQnGzy+WCOugJqU9pgS7r6qN/wT2eeITLX Qe/zIU6jj1b/wlkH0P1deP/gvIW1jL32ruLuCQzmGi+qhkeQiC9IJLRjImspj0XrlcdO OLTPU4hdC/3VlT2c3GewHlScdyco/clXyD7IoF2SS7NUWaFzv/QVPvXf9RG/VCLE1fto Edq+Zqm9gQ9tuudxUslYDKbsgI9hQfxC77NNcvcPuIrdZFrotD1fYyYsl/8prglnY4ii soHa0VojsMQ8hOoY7WrvzgJQhkB4bqRaJMvL1Aviv+DaqyyuaRIgJRNix5T3fq8HxU9F DQXA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lt22-20020a17090b355600b001bc328580e6si2704579pjb.155.2022.03.01.09.36.20; Tue, 01 Mar 2022 09:36:37 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235127AbiCANsR (ORCPT + 99 others); Tue, 1 Mar 2022 08:48:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42762 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233231AbiCANsP (ORCPT ); Tue, 1 Mar 2022 08:48:15 -0500 Received: from cloud48395.mywhc.ca (cloud48395.mywhc.ca [173.209.37.211]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C174C17AAA; Tue, 1 Mar 2022 05:47:32 -0800 (PST) Received: from [45.44.224.220] (port=56380 helo=localhost) by cloud48395.mywhc.ca with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nP2qh-0002HX-PU; Tue, 01 Mar 2022 08:47:31 -0500 Date: Tue, 01 Mar 2022 08:47:31 -0500 Message-Id: In-Reply-To: References: From: Olivier Langlois To: Jens Axboe , Pavel Begunkov Cc: Hao Xu , io-uring , linux-kernel Subject: [PATCH v4 2/2] io_uring: Add support for napi_busy_poll X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cloud48395.mywhc.ca X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - trillion01.com X-Get-Message-Sender-Via: cloud48395.mywhc.ca: authenticated_id: olivier@trillion01.com X-Authenticated-Sender: cloud48395.mywhc.ca: olivier@trillion01.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The sqpoll thread can be used for performing the napi busy poll in a similar way that it does io polling for file systems supporting direct access bypassing the page cache. The other way that io_uring can be used for napi busy poll is by calling io_uring_enter() to get events. If the user specify a timeout value, it is distributed between polling and sleeping by using the systemwide setting /proc/sys/net/core/busy_poll. The changes have been tested with this program: https://github.com/lano1106/io_uring_udp_ping and the result is: Without sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us NAPI busy loop enabled: rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us With sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us NAPI busy loop enabled: rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us Co-developed-by: Hao Xu Signed-off-by: Hao Xu Signed-off-by: Olivier Langlois --- fs/io_uring.c | 230 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 229 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index f7b8df79a02b..37c065786e4b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -63,6 +63,7 @@ #include #include #include +#include #include #include #include @@ -395,6 +396,11 @@ struct io_ring_ctx { struct list_head sqd_list; unsigned long check_cq_overflow; +#ifdef CONFIG_NET_RX_BUSY_POLL + /* used to track busy poll napi_id */ + struct list_head napi_list; + spinlock_t napi_lock; /* napi_list lock */ +#endif struct { unsigned cached_cq_tail; @@ -1464,6 +1470,10 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_WQ_LIST(&ctx->locked_free_list); INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func); INIT_WQ_LIST(&ctx->submit_state.compl_reqs); +#ifdef CONFIG_NET_RX_BUSY_POLL + INIT_LIST_HEAD(&ctx->napi_list); + spin_lock_init(&ctx->napi_lock); +#endif return ctx; err: kfree(ctx->dummy_ubuf); @@ -5399,6 +5409,108 @@ IO_NETOP_FN(send); IO_NETOP_FN(recv); #endif /* CONFIG_NET */ +#ifdef CONFIG_NET_RX_BUSY_POLL + +#define NAPI_TIMEOUT (60 * SEC_CONVERSION) + +struct napi_entry { + struct list_head list; + unsigned int napi_id; + unsigned long timeout; +}; + +/* + * Add busy poll NAPI ID from sk. + */ +static void io_add_napi(struct file *file, struct io_ring_ctx *ctx) +{ + unsigned int napi_id; + struct socket *sock; + struct sock *sk; + struct napi_entry *ne; + + if (!net_busy_loop_on()) + return; + + sock = sock_from_file(file); + if (!sock) + return; + + sk = sock->sk; + if (!sk) + return; + + napi_id = READ_ONCE(sk->sk_napi_id); + + /* Non-NAPI IDs can be rejected */ + if (napi_id < MIN_NAPI_ID) + return; + + spin_lock(&ctx->napi_lock); + list_for_each_entry(ne, &ctx->napi_list, list) { + if (ne->napi_id == napi_id) { + ne->timeout = jiffies + NAPI_TIMEOUT; + goto out; + } + } + + ne = kmalloc(sizeof(*ne), GFP_NOWAIT); + if (!ne) + goto out; + + ne->napi_id = napi_id; + ne->timeout = jiffies + NAPI_TIMEOUT; + list_add_tail(&ne->list, &ctx->napi_list); +out: + spin_unlock(&ctx->napi_lock); +} + +static inline void io_check_napi_entry_timeout(struct napi_entry *ne) +{ + if (time_after(jiffies, ne->timeout)) { + list_del(&ne->list); + kfree(ne); + } +} + +/* + * Busy poll if globally on and supporting sockets found + */ +static bool io_napi_busy_loop(struct list_head *napi_list) +{ + struct napi_entry *ne, *n; + + list_for_each_entry_safe(ne, n, napi_list, list) { + napi_busy_loop(ne->napi_id, NULL, NULL, true, + BUSY_POLL_BUDGET); + io_check_napi_entry_timeout(ne); + } + return !list_empty(napi_list); +} + +static void io_free_napi_list(struct io_ring_ctx *ctx) +{ + spin_lock(&ctx->napi_lock); + while (!list_empty(&ctx->napi_list)) { + struct napi_entry *ne = + list_first_entry(&ctx->napi_list, struct napi_entry, + list); + + list_del(&ne->list); + kfree(ne); + } + spin_unlock(&ctx->napi_lock); +} +#else +static inline void io_add_napi(struct file *file, struct io_ring_ctx *ctx) +{ +} + +static inline void io_free_napi_list(struct io_ring_ctx *ctx) +{ +} +#endif /* CONFIG_NET_RX_BUSY_POLL */ + struct io_poll_table { struct poll_table_struct pt; struct io_kiocb *req; @@ -5545,6 +5657,7 @@ static int io_poll_check_events(struct io_kiocb *req) if (unlikely(!filled)) return -ECANCELED; io_cqring_ev_posted(ctx); + io_add_napi(req->file, ctx); } else if (req->result) { return 0; } @@ -5777,6 +5890,7 @@ static int __io_arm_poll_handler(struct io_kiocb *req, __io_poll_execute(req, mask); return 0; } + io_add_napi(req->file, req->ctx); /* * Release ownership. If someone tried to queue a tw while it was @@ -7519,7 +7633,11 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) !(ctx->flags & IORING_SETUP_R_DISABLED)) ret = io_submit_sqes(ctx, to_submit); mutex_unlock(&ctx->uring_lock); - +#ifdef CONFIG_NET_RX_BUSY_POLL + if (!list_empty(&ctx->napi_list) && + io_napi_busy_loop(&ctx->napi_list)) + ++ret; +#endif if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait)) wake_up(&ctx->sqo_sq_wait); if (creds) @@ -7650,6 +7768,9 @@ struct io_wait_queue { struct io_ring_ctx *ctx; unsigned cq_tail; unsigned nr_timeouts; +#ifdef CONFIG_NET_RX_BUSY_POLL + unsigned busy_poll_to; +#endif }; static inline bool io_should_wake(struct io_wait_queue *iowq) @@ -7711,6 +7832,87 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, return 1; } +#ifdef CONFIG_NET_RX_BUSY_POLL +static void io_adjust_busy_loop_timeout(struct timespec64 *ts, + struct io_wait_queue *iowq) +{ + unsigned busy_poll_to = READ_ONCE(sysctl_net_busy_poll); + struct timespec64 pollto = ns_to_timespec64(1000 * (s64)busy_poll_to); + + if (timespec64_compare(ts, &pollto) > 0) { + *ts = timespec64_sub(*ts, pollto); + iowq->busy_poll_to = busy_poll_to; + } else { + u64 to = timespec64_to_ns(ts); + + do_div(to, 1000); + iowq->busy_poll_to = to; + ts->tv_sec = 0; + ts->tv_nsec = 0; + } +} + +static inline bool io_busy_loop_timeout(unsigned long start_time, + unsigned long bp_usec) +{ + if (bp_usec) { + unsigned long end_time = start_time + bp_usec; + unsigned long now = busy_loop_current_time(); + + return time_after(now, end_time); + } + return true; +} + +static bool io_busy_loop_end(void *p, unsigned long start_time) +{ + struct io_wait_queue *iowq = p; + + return signal_pending(current) || + io_should_wake(iowq) || + io_busy_loop_timeout(start_time, iowq->busy_poll_to); +} + +static void io_blocking_napi_busy_loop(struct list_head *napi_list, + struct io_wait_queue *iowq) +{ + unsigned long start_time = + list_is_singular(napi_list) ? 0 : + busy_loop_current_time(); + + do { + if (list_is_singular(napi_list)) { + struct napi_entry *ne = + list_first_entry(napi_list, + struct napi_entry, list); + + napi_busy_loop(ne->napi_id, io_busy_loop_end, iowq, + true, BUSY_POLL_BUDGET); + io_check_napi_entry_timeout(ne); + break; + } + } while (io_napi_busy_loop(napi_list) && + !io_busy_loop_end(iowq, start_time)); +} + +static void io_putback_napi_list(struct io_ring_ctx *ctx, + struct list_head *napi_list) +{ + struct napi_entry *cne, *lne; + + spin_lock(&ctx->napi_lock); + list_for_each_entry(cne, &ctx->napi_list, list) + list_for_each_entry(lne, napi_list, list) + if (cne->napi_id == lne->napi_id) { + list_del(&lne->list); + kfree(lne); + break; + } + list_splice(napi_list, &ctx->napi_list); + spin_unlock(&ctx->napi_lock); +} +#endif /* CONFIG_NET_RX_BUSY_POLL */ + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -7723,6 +7925,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, struct io_rings *rings = ctx->rings; ktime_t timeout = KTIME_MAX; int ret; +#ifdef CONFIG_NET_RX_BUSY_POLL + LIST_HEAD(local_napi_list); +#endif do { io_cqring_overflow_flush(ctx); @@ -7745,13 +7950,29 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; } +#ifdef CONFIG_NET_RX_BUSY_POLL + iowq.busy_poll_to = 0; + if (!(ctx->flags & IORING_SETUP_SQPOLL)) { + spin_lock(&ctx->napi_lock); + list_splice_init(&ctx->napi_list, &local_napi_list); + spin_unlock(&ctx->napi_lock); + } +#endif if (uts) { struct timespec64 ts; if (get_timespec64(&ts, uts)) return -EFAULT; +#ifdef CONFIG_NET_RX_BUSY_POLL + if (!list_empty(&local_napi_list)) + io_adjust_busy_loop_timeout(&ts, &iowq); +#endif timeout = ktime_add_ns(timespec64_to_ktime(ts), ktime_get_ns()); } +#ifdef CONFIG_NET_RX_BUSY_POLL + else if (!list_empty(&local_napi_list)) + iowq.busy_poll_to = READ_ONCE(sysctl_net_busy_poll); +#endif init_waitqueue_func_entry(&iowq.wq, io_wake_function); iowq.wq.private = current; @@ -7761,6 +7982,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events; trace_io_uring_cqring_wait(ctx, min_events); +#ifdef CONFIG_NET_RX_BUSY_POLL + if (iowq.busy_poll_to) + io_blocking_napi_busy_loop(&local_napi_list, &iowq); + if (!list_empty(&local_napi_list)) + io_putback_napi_list(ctx, &local_napi_list); +#endif do { /* if we can't even flush overflow, don't wait for more */ if (!io_cqring_overflow_flush(ctx)) { @@ -9483,6 +9710,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) io_req_caches_free(ctx); if (ctx->hash_map) io_wq_put_hash(ctx->hash_map); + io_free_napi_list(ctx); kfree(ctx->cancel_hash); kfree(ctx->dummy_ubuf); kfree(ctx); -- 2.35.1