Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2904404rwd; Fri, 16 Jun 2023 09:35:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4Xn7CSwHftcC2l2npoG3It5s8qFyblSDjlUw8rWyEOu8sPzL1EipqtQAsIG47D1cTPm5mE X-Received: by 2002:a17:90a:e64a:b0:25b:be4a:84c9 with SMTP id ep10-20020a17090ae64a00b0025bbe4a84c9mr2065995pjb.17.1686933317666; Fri, 16 Jun 2023 09:35:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686933317; cv=none; d=google.com; s=arc-20160816; b=vgFC+v8uPlhDSfIjioBZLAa0DaRqp/X/YaWhiSeIV8FXPC0e0JHlLrQe+PfKRoLl3K sHqoSaX8Ey0cZgGNJ9+zTanldT1VorI5B/dW0+UiXPj069NE8BL8EeKGQdN6G2E1t9JW oo37qNsUbUTxv5LAs5pfUgKUWft5VoXKZsqzvE6lwgofyT3aBmOlBnBIzZk+6O6ERQ8U MGgIIghGejsYEFqkw8ukkEY1KzIUJKuDidqCPrHyfOKsKo9HtudGNTIwYWcE1fGG5p0K WxbTcyaDuNy8vcC3xcIOSnJETD0xXZApp2F8BfOYhqWQTvVUbQMTufhl77bcb/KN/f8i gQyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=8Z8TKpGSoATJg8YUA+uQGeeIQegdpSxiJL3PqPYeaz4=; b=v9yUeNrLweYuGxnCxTfoBlu3wYE+QwNIk7WKYu4n0EmoR8BJi3zgvEETfmgkYP36/P mYnHWajBtN9TW+w2IBsHbM2mMctxFPt71mVVzI+Fax+zqI9625vm6Kp34K8h3uMhWbRX enKrwruFDCKnvOm9AzBICwjji5PRHAFB8zeCpCoHqmPpFGuJ449FlEoH6kRBXT3nwfGr idavOFOk4WKrItU7W4bbPh5EyqVCyW46c3Rfoi+6tsdLWlGy66ah0JWRPGtIUtcBrVo6 qtur9RUnRXKVgp5IJ7nwCK13mfhgHDOidIjRwFBhcqXSONeTZ3bwnPLJUN/5IvD8SADE QNqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Eqnfubgn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x1-20020a17090a294100b0025bdeb86c50si1862318pjf.137.2023.06.16.09.35.02; Fri, 16 Jun 2023 09:35:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Eqnfubgn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343627AbjFPQOe (ORCPT + 99 others); Fri, 16 Jun 2023 12:14:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343608AbjFPQON (ORCPT ); Fri, 16 Jun 2023 12:14:13 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26EB435AE for ; Fri, 16 Jun 2023 09:13:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1686932005; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8Z8TKpGSoATJg8YUA+uQGeeIQegdpSxiJL3PqPYeaz4=; b=EqnfubgnwXBzMIpS/2D1znmGvUoLA37dIycmqV4GtMMQN7Wmc/+07TxAU+26fNaIftXHGC h76X9hLfcPECWjbAVnVNaznOA0RwcPc9Et7GTxJWtI4qXeMIKhn9bEqTpCpRcJPxiQgKai Vn008+wcIEe3dwmVxzVqfNFb1H6fzF0= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-606-aKODK8OJPz2V9oQLhMS4pg-1; Fri, 16 Jun 2023 12:13:21 -0400 X-MC-Unique: aKODK8OJPz2V9oQLhMS4pg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2016029AB3E8; Fri, 16 Jun 2023 16:13:21 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.51]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2A1DE40D1B60; Fri, 16 Jun 2023 16:13:19 +0000 (UTC) From: David Howells To: netdev@vger.kernel.org Cc: David Howells , Alexander Duyck , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Willem de Bruijn , David Ahern , Matthew Wilcox , Jens Axboe , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Bernard Metzler , Tom Talpey , linux-rdma@vger.kernel.org Subject: [PATCH net-next 04/17] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit Date: Fri, 16 Jun 2023 17:12:47 +0100 Message-ID: <20230616161301.622169-5-dhowells@redhat.com> In-Reply-To: <20230616161301.622169-1-dhowells@redhat.com> References: <20230616161301.622169-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When transmitting data, call down into TCP using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather than performing several sendmsg and sendpage calls to transmit header, data pages and trailer. To make this work, the data is assembled in a bio_vec array and attached to a BVEC-type iterator. The header and trailer (if present) are copied into page fragments that can be freed with put_page(). Signed-off-by: David Howells cc: Bernard Metzler cc: Tom Talpey cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: linux-rdma@vger.kernel.org cc: netdev@vger.kernel.org --- drivers/infiniband/sw/siw/siw_qp_tx.c | 230 ++++---------------------- 1 file changed, 35 insertions(+), 195 deletions(-) diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c index ffb16beb6c30..a15b3fedf484 100644 --- a/drivers/infiniband/sw/siw/siw_qp_tx.c +++ b/drivers/infiniband/sw/siw/siw_qp_tx.c @@ -311,114 +311,8 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s, return rv; } -/* - * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES. - * - * Using sendpage to push page by page appears to be less efficient - * than using sendmsg, even if data are copied. - * - * A general performance limitation might be the extra four bytes - * trailer checksum segment to be pushed after user data. - */ -static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset, - size_t size) -{ - struct bio_vec bvec; - struct msghdr msg = { - .msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST | - MSG_SPLICE_PAGES), - }; - struct sock *sk = s->sk; - int i = 0, rv = 0, sent = 0; - - while (size) { - size_t bytes = min_t(size_t, PAGE_SIZE - offset, size); - - if (size + offset <= PAGE_SIZE) - msg.msg_flags &= ~MSG_SENDPAGE_NOTLAST; - - tcp_rate_check_app_limited(sk); - bvec_set_page(&bvec, page[i], bytes, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - -try_page_again: - lock_sock(sk); - rv = tcp_sendmsg_locked(sk, &msg, size); - release_sock(sk); - - if (rv > 0) { - size -= rv; - sent += rv; - if (rv != bytes) { - offset += rv; - bytes -= rv; - goto try_page_again; - } - offset = 0; - } else { - if (rv == -EAGAIN || rv == 0) - break; - return rv; - } - i++; - } - return sent; -} - -/* - * siw_0copy_tx() - * - * Pushes list of pages to TCP socket. If pages from multiple - * SGE's, all referenced pages of each SGE are pushed in one - * shot. - */ -static int siw_0copy_tx(struct socket *s, struct page **page, - struct siw_sge *sge, unsigned int offset, - unsigned int size) -{ - int i = 0, sent = 0, rv; - int sge_bytes = min(sge->length - offset, size); - - offset = (sge->laddr + offset) & ~PAGE_MASK; - - while (sent != size) { - rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes); - if (rv >= 0) { - sent += rv; - if (size == sent || sge_bytes > rv) - break; - - i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT; - sge++; - sge_bytes = min(sge->length, size - sent); - offset = sge->laddr & ~PAGE_MASK; - } else { - sent = rv; - break; - } - } - return sent; -} - #define MAX_TRAILER (MPA_CRC_SIZE + 4) -static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len) -{ - int i; - - /* - * Work backwards through the array to honor the kmap_local_page() - * ordering requirements. - */ - for (i = (len-1); i >= 0; i--) { - if (kmap_mask & BIT(i)) { - unsigned long addr = (unsigned long)iov[i].iov_base; - - kunmap_local((void *)(addr & PAGE_MASK)); - } - } -} - /* * siw_tx_hdt() tries to push a complete packet to TCP where all * packet fragments are referenced by the elements of one iovec. @@ -438,30 +332,21 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) { struct siw_wqe *wqe = &c_tx->wqe_active; struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx]; - struct kvec iov[MAX_ARRAY]; - struct page *page_array[MAX_ARRAY]; + struct bio_vec bvec[MAX_ARRAY]; struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR }; + void *trl; int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv; unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0, sge_off = c_tx->sge_off, sge_idx = c_tx->sge_idx, pbl_idx = c_tx->pbl_idx; - unsigned long kmap_mask = 0L; if (c_tx->state == SIW_SEND_HDR) { - if (c_tx->use_sendpage) { - rv = siw_tx_ctrl(c_tx, s, MSG_DONTWAIT | MSG_MORE); - if (rv) - goto done; + void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent; - c_tx->state = SIW_SEND_DATA; - } else { - iov[0].iov_base = - (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent; - iov[0].iov_len = hdr_len = - c_tx->ctrl_len - c_tx->ctrl_sent; - seg = 1; - } + hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent; + bvec_set_virt(&bvec[0], hdr, hdr_len); + seg = 1; } wqe->processed += data_len; @@ -477,28 +362,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) } else { is_kva = 1; } - if (is_kva && !c_tx->use_sendpage) { - /* - * tx from kernel virtual address: either inline data - * or memory region with assigned kernel buffer - */ - iov[seg].iov_base = - ib_virt_dma_to_ptr(sge->laddr + sge_off); - iov[seg].iov_len = sge_len; - - if (do_crc) - crypto_shash_update(c_tx->mpa_crc_hd, - iov[seg].iov_base, - sge_len); - sge_off += sge_len; - data_len -= sge_len; - seg++; - goto sge_done; - } while (sge_len) { size_t plen = min((int)PAGE_SIZE - fp_off, sge_len); - void *kaddr; if (!is_kva) { struct page *p; @@ -511,33 +377,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) p = siw_get_upage(mem->umem, sge->laddr + sge_off); if (unlikely(!p)) { - siw_unmap_pages(iov, kmap_mask, seg); wqe->processed -= c_tx->bytes_unsent; rv = -EFAULT; goto done_crc; } - page_array[seg] = p; - - if (!c_tx->use_sendpage) { - void *kaddr = kmap_local_page(p); - - /* Remember for later kunmap() */ - kmap_mask |= BIT(seg); - iov[seg].iov_base = kaddr + fp_off; - iov[seg].iov_len = plen; - - if (do_crc) - crypto_shash_update( - c_tx->mpa_crc_hd, - iov[seg].iov_base, - plen); - } else if (do_crc) { - kaddr = kmap_local_page(p); - crypto_shash_update(c_tx->mpa_crc_hd, - kaddr + fp_off, - plen); - kunmap_local(kaddr); - } + + bvec_set_page(&bvec[seg], p, plen, fp_off); } else { /* * Cast to an uintptr_t to preserve all 64 bits @@ -545,12 +390,16 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) */ u64 va = sge->laddr + sge_off; - page_array[seg] = ib_virt_dma_to_page(va); - if (do_crc) - crypto_shash_update( - c_tx->mpa_crc_hd, - ib_virt_dma_to_ptr(va), - plen); + bvec_set_virt(&bvec[seg], ib_virt_dma_to_ptr(va), plen); + } + + if (do_crc) { + void *kaddr = + kmap_local_page(bvec[seg].bv_page); + crypto_shash_update(c_tx->mpa_crc_hd, + kaddr + bvec[seg].bv_offset, + bvec[seg].bv_len); + kunmap_local(kaddr); } sge_len -= plen; @@ -560,13 +409,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) if (++seg >= (int)MAX_ARRAY) { siw_dbg_qp(tx_qp(c_tx), "to many fragments\n"); - siw_unmap_pages(iov, kmap_mask, seg-1); wqe->processed -= c_tx->bytes_unsent; rv = -EMSGSIZE; goto done_crc; } } -sge_done: + /* Update SGE variables at end of SGE */ if (sge_off == sge->length && (data_len != 0 || wqe->processed < wqe->bytes)) { @@ -575,15 +423,8 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) sge_off = 0; } } - /* trailer */ - if (likely(c_tx->state != SIW_SEND_TRAILER)) { - iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad]; - iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad); - } else { - iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent]; - iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent; - } + /* Set the CRC in the trailer */ if (c_tx->pad) { *(u32 *)c_tx->trailer.pad = 0; if (do_crc) @@ -596,23 +437,23 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) else if (do_crc) crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)&c_tx->trailer.crc); - data_len = c_tx->bytes_unsent; - - if (c_tx->use_sendpage) { - rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx], - c_tx->sge_off, data_len); - if (rv == data_len) { - rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len); - if (rv > 0) - rv += data_len; - else - rv = data_len; - } + /* Copy the trailer and add it to the output list */ + if (likely(c_tx->state != SIW_SEND_TRAILER)) { + trl = &c_tx->trailer.pad[4 - c_tx->pad]; + trl_len = MAX_TRAILER - (4 - c_tx->pad); } else { - rv = kernel_sendmsg(s, &msg, iov, seg + 1, - hdr_len + data_len + trl_len); - siw_unmap_pages(iov, kmap_mask, seg); + trl = &c_tx->trailer.pad[c_tx->ctrl_sent]; + trl_len = MAX_TRAILER - c_tx->ctrl_sent; } + bvec_set_virt(&bvec[seg], trl, trl_len); + + data_len = c_tx->bytes_unsent; + + if (c_tx->use_sendpage) + msg.msg_flags |= MSG_SPLICE_PAGES; + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1, + hdr_len + data_len + trl_len); + rv = sock_sendmsg(s, &msg); if (rv < (int)hdr_len) { /* Not even complete hdr pushed or negative rv */ wqe->processed -= data_len; @@ -673,7 +514,6 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) } done_crc: c_tx->do_crc = 0; -done: return rv; }