Received: by 2002:a05:6358:701b:b0:131:369:b2a3 with SMTP id 27csp4533491rwo; Tue, 25 Jul 2023 07:27:53 -0700 (PDT) X-Google-Smtp-Source: APBJJlHoa870TLp0KfSIGi0dmUyKrtH0hNQ3f7sh4kgcU8ukjuZug2wyzW3+WTaRJuBd1RGHYoOD X-Received: by 2002:a17:902:c40f:b0:1b8:b382:f6c3 with SMTP id k15-20020a170902c40f00b001b8b382f6c3mr13471883plk.13.1690295273219; Tue, 25 Jul 2023 07:27:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690295273; cv=none; d=google.com; s=arc-20160816; b=rdsCK7uz+2lg60bHWEUHJRn06i5B+0ELZ/EOxsuq+y6xIYxjYtQRje9z2hdH7uLoBe qlQWjMm5YTEzED9NOuIW0k/GYZSqhWM+QkE0ZD4nx1Xsw1OaLsB3mhrM9L1EPiuijv2Z 8rjJqjdPGWiSVPQ0Op+bvlnKzKboJ+aQoryo6a8X+4m2P/5UGXqtQqod5iqobMKQ8zlz YyOa3t82w7hFXUhsC2Ht7h48/a6QPcfq2TrJH3/RyP+O4TrU3ZB6dQoQGYcc8dUdlzPl anJtfmAs4Iv6Kna3Pr5+pkkeE5f+MmQ2YLeGlisxLq+TDaAURcal2qwUE4hUvLiuqsoC 0+AQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature:dkim-filter; bh=YmXv7KDdvcEfBH8lt6FK/ffpRpXnxUy7KNw5DJoTN/o=; fh=tHeP+9hUIPC5WXeXLaTqZXnMivPk788I3J0RvXAASig=; b=QYg3TY40iaSOX6DsnRw6NXyUx8WOovCsv21evSWyp908aLDN16tZCRjmZq3SOOVJgD +yZ3N2RaFY7ZzZzBdZrcqb3eTbpfoyxSBfe4xyacn3IfKuGO9I2ecYyN4dY+ht4XDk1V SSgm4flW/1ak+xu9Lk/Q8UgyDKG6DEjOPX12ans6Z7c9RFzYZi4EZ9A1nDIfSL6nFxZq 5oZCAk+0LCVKBEAJlRM3MPZ/9Toc1urHXtFFnEgsRXWFa0GNQhL7Z4shlmoNruu0Kwrp BdJrISLfR61qXJE5I6+u0abj4oP4C5vuKgSqDRS0rEMV4tYxcZdmBTkjYrTiqebeOsnL Ggtg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=RR0VI1j4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b3-20020a170902d50300b001bb9e3bae96si5577117plg.255.2023.07.25.07.27.13; Tue, 25 Jul 2023 07:27:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=RR0VI1j4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229712AbjGYNQU (ORCPT + 99 others); Tue, 25 Jul 2023 09:16:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47748 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229651AbjGYNQP (ORCPT ); Tue, 25 Jul 2023 09:16:15 -0400 Received: from mx1.sberdevices.ru (mx2.sberdevices.ru [45.89.224.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CB26E3; Tue, 25 Jul 2023 06:16:11 -0700 (PDT) Received: from p-infra-ksmg-sc-msk02 (localhost [127.0.0.1]) by mx1.sberdevices.ru (Postfix) with ESMTP id 8DCC0120029; Tue, 25 Jul 2023 16:16:10 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.sberdevices.ru 8DCC0120029 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sberdevices.ru; s=mail; t=1690290970; bh=YmXv7KDdvcEfBH8lt6FK/ffpRpXnxUy7KNw5DJoTN/o=; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type:From; b=RR0VI1j4ANLm4RcvJNpP+8UpZhB9nygRK9J7lAY2wdHz7WJ6tsA2nkRbogtDUz0v8 YJBn43bkvp88vFB0SLIkSGxV/IkGmFZDgjN0Kh+mXecph8mLYTFti/k+8rBhwvNrXD EXgkFUuiQKIXTwqbnw9O9a9aehnHtXllW/wNVe1LGU7wKMNhlaQAVDbe7OSrZGDMMQ JMdGEJ8cbgYqxfrBJUFLtU+hN0uaVH0mKzCoCZ+9c5Bwl/Mukigj55rkaZbBhRJ9Bs wGhZPdJdJfAi+L03VkMYEo7OQn/PblZnDBKgMNfuESLNy/S67R89HIb+7JFq1alnTv q3lzLyYqGNsGg== Received: from p-i-exch-sc-m01.sberdevices.ru (p-i-exch-sc-m01.sberdevices.ru [172.16.192.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.sberdevices.ru (Postfix) with ESMTPS; Tue, 25 Jul 2023 16:16:10 +0300 (MSK) Received: from [192.168.0.104] (100.64.160.123) by p-i-exch-sc-m01.sberdevices.ru (172.16.192.107) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.30; Tue, 25 Jul 2023 16:16:09 +0300 Message-ID: <9f43b689-1c9a-6132-b9c5-2e19d249c4d0@sberdevices.ru> Date: Tue, 25 Jul 2023 16:10:40 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Content-Language: en-US To: "Michael S. Tsirkin" CC: Stefan Hajnoczi , Stefano Garzarella , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jason Wang , Bobby Eshleman , , , , , , References: <20230720214245.457298-1-AVKrasnov@sberdevices.ru> <20230720214245.457298-5-AVKrasnov@sberdevices.ru> <20230725042452-mutt-send-email-mst@kernel.org> <20230725075413-mutt-send-email-mst@kernel.org> From: Arseniy Krasnov In-Reply-To: <20230725075413-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [100.64.160.123] X-ClientProxiedBy: p-i-exch-sc-m01.sberdevices.ru (172.16.192.107) To p-i-exch-sc-m01.sberdevices.ru (172.16.192.107) X-KSMG-Rule-ID: 10 X-KSMG-Message-Action: clean X-KSMG-AntiSpam-Lua-Profiles: 178796 [Jul 22 2023] X-KSMG-AntiSpam-Version: 5.9.59.0 X-KSMG-AntiSpam-Envelope-From: AVKrasnov@sberdevices.ru X-KSMG-AntiSpam-Rate: 0 X-KSMG-AntiSpam-Status: not_detected X-KSMG-AntiSpam-Method: none X-KSMG-AntiSpam-Auth: dkim=none X-KSMG-AntiSpam-Info: LuaCore: 525 525 723604743bfbdb7e16728748c3fa45e9eba05f7d, {Tracking_from_domain_doesnt_match_to}, sberdevices.ru:7.1.1,5.0.1;100.64.160.123:7.1.2;p-i-exch-sc-m01.sberdevices.ru:7.1.1,5.0.1;d41d8cd98f00b204e9800998ecf8427e.com:7.1.1;127.0.0.199:7.1.2, FromAlignment: s, {Tracking_white_helo}, ApMailHostAddress: 100.64.160.123 X-MS-Exchange-Organization-SCL: -1 X-KSMG-AntiSpam-Interceptor-Info: scan successful X-KSMG-AntiPhishing: Clean X-KSMG-LinksScanning: Clean X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 2.0.1.6960, bases: 2023/07/23 08:49:00 #21663637 X-KSMG-AntiVirus-Status: Clean, skipped X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 25.07.2023 14:59, Michael S. Tsirkin wrote: > On Tue, Jul 25, 2023 at 11:39:22AM +0300, Arseniy Krasnov wrote: >> >> >> On 25.07.2023 11:25, Michael S. Tsirkin wrote: >>> On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote: >>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this >>>> flag is set and zerocopy transmission is possible (enabled in socket >>>> options and transport allows zerocopy), then non-linear skb will be >>>> created and filled with the pages of user's buffer. Pages of user's >>>> buffer are locked in memory by 'get_user_pages()'. Second thing that >>>> this patch does is replace type of skb owning: instead of calling >>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this >>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc' >>>> of socket, so to decrease this field correctly proper skb destructor is >>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'. >>>> >>>> Signed-off-by: Arseniy Krasnov >>>> --- >>>> Changelog: >>>> v5(big patchset) -> v1: >>>> * Refactorings of 'if' conditions. >>>> * Remove extra blank line. >>>> * Remove 'frag_off' field unneeded init. >>>> * Add function 'virtio_transport_fill_skb()' which fills both linear >>>> and non-linear skb with provided data. >>>> v1 -> v2: >>>> * Use original order of last four arguments in 'virtio_transport_alloc_skb()'. >>>> v2 -> v3: >>>> * Add new transport callback: 'msgzerocopy_check_iov'. It checks that >>>> provided 'iov_iter' with data could be sent in a zerocopy mode. >>>> If this callback is not set in transport - transport allows to send >>>> any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true' >>>> then zerocopy is allowed. Reason of this callback is that in case of >>>> G2H transmission we insert whole skb to the tx virtio queue and such >>>> skb must fit to the size of the virtio queue to be sent in a single >>>> iteration (may be tx logic in 'virtio_transport.c' could be reworked >>>> as in vhost to support partial send of current skb). This callback >>>> will be enabled only for G2H path. For details pls see comment >>>> 'Check that tx queue...' below. >>>> >>>> include/net/af_vsock.h | 3 + >>>> net/vmw_vsock/virtio_transport.c | 39 ++++ >>>> net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------ >>>> 3 files changed, 241 insertions(+), 58 deletions(-) >>>> >>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h >>>> index 0e7504a42925..a6b346eeeb8e 100644 >>>> --- a/include/net/af_vsock.h >>>> +++ b/include/net/af_vsock.h >>>> @@ -177,6 +177,9 @@ struct vsock_transport { >>>> >>>> /* Read a single skb */ >>>> int (*read_skb)(struct vsock_sock *, skb_read_actor_t); >>>> + >>>> + /* Zero-copy. */ >>>> + bool (*msgzerocopy_check_iov)(const struct iov_iter *); >>>> }; >>>> >>>> /**** CORE ****/ >>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c >>>> index 7bbcc8093e51..23cb8ed638c4 100644 >>>> --- a/net/vmw_vsock/virtio_transport.c >>>> +++ b/net/vmw_vsock/virtio_transport.c >>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq) >>>> queue_work(virtio_vsock_workqueue, &vsock->rx_work); >>>> } >>>> >>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov) >>>> +{ >>>> + struct virtio_vsock *vsock; >>>> + bool res = false; >>>> + >>>> + rcu_read_lock(); >>>> + >>>> + vsock = rcu_dereference(the_virtio_vsock); >>>> + if (vsock) { >>>> + struct virtqueue *vq; >>>> + int iov_pages; >>>> + >>>> + vq = vsock->vqs[VSOCK_VQ_TX]; >>>> + >>>> + iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE; >>>> + >>>> + /* Check that tx queue is large enough to keep whole >>>> + * data to send. This is needed, because when there is >>>> + * not enough free space in the queue, current skb to >>>> + * send will be reinserted to the head of tx list of >>>> + * the socket to retry transmission later, so if skb >>>> + * is bigger than whole queue, it will be reinserted >>>> + * again and again, thus blocking other skbs to be sent. >>>> + * Each page of the user provided buffer will be added >>>> + * as a single buffer to the tx virtqueue, so compare >>>> + * number of pages against maximum capacity of the queue. >>>> + * +1 means buffer for the packet header. >>>> + */ >>>> + if (iov_pages + 1 <= vq->num_max) >>>> + res = true; >>> >>> >>> Yes but can't there already be buffers in the queue? >>> Then you can't stick num_max there. >> >> I think, that it is not critical, because vhost part always tries to process all >> incoming buffers (yes, 'vhost_exceeds_weight()' breaks at some moment, but it will >> reschedule tx kick ('vhost_vsock_handle_tx_kick()') work again), so current "too >> big" skb will wait until there will be enough space in queue and as it is requeued >> to the head of tx list it will be inserted to tx queue first. >> >> But anyway, I agree that comparing to 'num_free' may be more effective to the whole >> system performance... >> >> Thanks, Arseniy > > Oh I see. It makes sense then - instead of copying just so we can > stick it in the queue, wait a bit and send later. > Also - for stream transports can't the message be split > and sent chunk by chunk? Better than copying ... Technically yes, also we can split message for non-stream sockets (as vhost does when it copies data to rx buffers of the guest), but it requires to rework current implementation by adding buffers one by one to the tx queue. I think it was not implemented here because until MSG_ZEROCOPY all skbs requires one (if it is control message) or two (with payload) buffers, so there is no big sense in processing max two buffers in "one-by-one" mode - we can just wait for space. May be, I can add this logic for non-linear skb's here: if (skb->len > vq->max_num) add buffers "one-by-one", incrementing internal offset in skb, if (new skb insertion fails) requeue skb, wait for space. In TX done callback I'll call consume skb only when above mentioned internal offset == skb->len. I think this approach allows to get rid of new 'check_iov' callback from this patch. Stefano, what do You think? Thanks, Arseniy > > >>> >>> >>>> + } >>>> + >>>> + rcu_read_unlock(); >>>> + >>>> + return res; >>>> +} >>>> + >>>> static bool virtio_transport_seqpacket_allow(u32 remote_cid); >>>> >>>> static struct virtio_transport virtio_transport = { >>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = { >>>> .seqpacket_allow = virtio_transport_seqpacket_allow, >>>> .seqpacket_has_data = virtio_transport_seqpacket_has_data, >>>> >>>> + .msgzerocopy_check_iov = virtio_transport_msgzerocopy_check_iov, >>>> + >>>> .notify_poll_in = virtio_transport_notify_poll_in, >>>> .notify_poll_out = virtio_transport_notify_poll_out, >>>> .notify_recv_init = virtio_transport_notify_recv_init, >>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >>>> index 26a4d10da205..e4e3d541aff4 100644 >>>> --- a/net/vmw_vsock/virtio_transport_common.c >>>> +++ b/net/vmw_vsock/virtio_transport_common.c >>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk) >>>> return container_of(t, struct virtio_transport, transport); >>>> } >>>> >>>> -/* Returns a new packet on success, otherwise returns NULL. >>>> - * >>>> - * If NULL is returned, errp is set to a negative errno. >>>> - */ >>>> -static struct sk_buff * >>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info, >>>> - size_t len, >>>> - u32 src_cid, >>>> - u32 src_port, >>>> - u32 dst_cid, >>>> - u32 dst_port) >>>> -{ >>>> - const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len; >>>> - struct virtio_vsock_hdr *hdr; >>>> - struct sk_buff *skb; >>>> - void *payload; >>>> - int err; >>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info, >>>> + size_t max_to_send) >>>> +{ >>>> + const struct vsock_transport *t; >>>> + struct iov_iter *iov_iter; >>>> >>>> - skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL); >>>> - if (!skb) >>>> - return NULL; >>>> + if (!info->msg) >>>> + return false; >>>> >>>> - hdr = virtio_vsock_hdr(skb); >>>> - hdr->type = cpu_to_le16(info->type); >>>> - hdr->op = cpu_to_le16(info->op); >>>> - hdr->src_cid = cpu_to_le64(src_cid); >>>> - hdr->dst_cid = cpu_to_le64(dst_cid); >>>> - hdr->src_port = cpu_to_le32(src_port); >>>> - hdr->dst_port = cpu_to_le32(dst_port); >>>> - hdr->flags = cpu_to_le32(info->flags); >>>> - hdr->len = cpu_to_le32(len); >>>> + iov_iter = &info->msg->msg_iter; >>>> >>>> - if (info->msg && len > 0) { >>>> - payload = skb_put(skb, len); >>>> - err = memcpy_from_msg(payload, info->msg, len); >>>> - if (err) >>>> - goto out; >>>> + t = vsock_core_get_transport(info->vsk); >>>> >>>> - if (msg_data_left(info->msg) == 0 && >>>> - info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) { >>>> - hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM); >>>> + if (t->msgzerocopy_check_iov && >>>> + !t->msgzerocopy_check_iov(iov_iter)) >>>> + return false; >>>> >>>> - if (info->msg->msg_flags & MSG_EOR) >>>> - hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR); >>>> - } >>>> + /* Data is simple buffer. */ >>>> + if (iter_is_ubuf(iov_iter)) >>>> + return true; >>>> + >>>> + if (!iter_is_iovec(iov_iter)) >>>> + return false; >>>> + >>>> + if (iov_iter->iov_offset) >>>> + return false; >>>> + >>>> + /* We can't send whole iov. */ >>>> + if (iov_iter->count > max_to_send) >>>> + return false; >>>> + >>>> + return true; >>>> +} >>>> + >>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk, >>>> + struct sk_buff *skb, >>>> + struct msghdr *msg, >>>> + bool zerocopy) >>>> +{ >>>> + struct ubuf_info *uarg; >>>> + >>>> + if (msg->msg_ubuf) { >>>> + uarg = msg->msg_ubuf; >>>> + net_zcopy_get(uarg); >>>> + } else { >>>> + struct iov_iter *iter = &msg->msg_iter; >>>> + struct ubuf_info_msgzc *uarg_zc; >>>> + int len; >>>> + >>>> + /* Only ITER_IOVEC or ITER_UBUF are allowed and >>>> + * checked before. >>>> + */ >>>> + if (iter_is_iovec(iter)) >>>> + len = iov_length(iter->__iov, iter->nr_segs); >>>> + else >>>> + len = iter->count; >>>> + >>>> + uarg = msg_zerocopy_realloc(sk_vsock(vsk), >>>> + len, >>>> + NULL); >>>> + if (!uarg) >>>> + return -1; >>>> + >>>> + uarg_zc = uarg_to_msgzc(uarg); >>>> + uarg_zc->zerocopy = zerocopy ? 1 : 0; >>>> } >>>> >>>> - if (info->reply) >>>> - virtio_vsock_skb_set_reply(skb); >>>> + skb_zcopy_init(skb, uarg); >>>> >>>> - trace_virtio_transport_alloc_pkt(src_cid, src_port, >>>> - dst_cid, dst_port, >>>> - len, >>>> - info->type, >>>> - info->op, >>>> - info->flags); >>>> + return 0; >>>> +} >>>> >>>> - if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) { >>>> - WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n"); >>>> - goto out; >>>> +static int virtio_transport_fill_skb(struct sk_buff *skb, >>>> + struct virtio_vsock_pkt_info *info, >>>> + size_t len, >>>> + bool zcopy) >>>> +{ >>>> + if (zcopy) { >>>> + return __zerocopy_sg_from_iter(info->msg, NULL, skb, >>>> + &info->msg->msg_iter, >>>> + len); >>>> + } else { >>>> + void *payload; >>>> + int err; >>>> + >>>> + payload = skb_put(skb, len); >>>> + err = memcpy_from_msg(payload, info->msg, len); >>>> + if (err) >>>> + return -1; >>>> + >>>> + if (msg_data_left(info->msg)) >>>> + return 0; >>>> + >>>> + return 0; >>>> } >>>> +} >>>> >>>> - return skb; >>>> +static void virtio_transport_init_hdr(struct sk_buff *skb, >>>> + struct virtio_vsock_pkt_info *info, >>>> + u32 src_cid, >>>> + u32 src_port, >>>> + u32 dst_cid, >>>> + u32 dst_port, >>>> + size_t len) >>>> +{ >>>> + struct virtio_vsock_hdr *hdr; >>>> >>>> -out: >>>> - kfree_skb(skb); >>>> - return NULL; >>>> + hdr = virtio_vsock_hdr(skb); >>>> + hdr->type = cpu_to_le16(info->type); >>>> + hdr->op = cpu_to_le16(info->op); >>>> + hdr->src_cid = cpu_to_le64(src_cid); >>>> + hdr->dst_cid = cpu_to_le64(dst_cid); >>>> + hdr->src_port = cpu_to_le32(src_port); >>>> + hdr->dst_port = cpu_to_le32(dst_port); >>>> + hdr->flags = cpu_to_le32(info->flags); >>>> + hdr->len = cpu_to_le32(len); >>>> } >>>> >>>> static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb, >>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk) >>>> return VIRTIO_VSOCK_TYPE_SEQPACKET; >>>> } >>>> >>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk, >>>> + struct virtio_vsock_pkt_info *info, >>>> + size_t payload_len, >>>> + bool zcopy, >>>> + u32 src_cid, >>>> + u32 src_port, >>>> + u32 dst_cid, >>>> + u32 dst_port) >>>> +{ >>>> + struct sk_buff *skb; >>>> + size_t skb_len; >>>> + >>>> + skb_len = VIRTIO_VSOCK_SKB_HEADROOM; >>>> + >>>> + if (!zcopy) >>>> + skb_len += payload_len; >>>> + >>>> + skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL); >>>> + if (!skb) >>>> + return NULL; >>>> + >>>> + virtio_transport_init_hdr(skb, info, src_cid, src_port, >>>> + dst_cid, dst_port, >>>> + payload_len); >>>> + >>>> + /* Set owner here, because '__zerocopy_sg_from_iter()' uses >>>> + * owner of skb without check to update 'sk_wmem_alloc'. >>>> + */ >>>> + if (vsk) >>>> + skb_set_owner_w(skb, sk_vsock(vsk)); >>>> + >>>> + if (info->msg && payload_len > 0) { >>>> + int err; >>>> + >>>> + err = virtio_transport_fill_skb(skb, info, payload_len, zcopy); >>>> + if (err) >>>> + goto out; >>>> + >>>> + if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) { >>>> + struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb); >>>> + >>>> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM); >>>> + >>>> + if (info->msg->msg_flags & MSG_EOR) >>>> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR); >>>> + } >>>> + } >>>> + >>>> + if (info->reply) >>>> + virtio_vsock_skb_set_reply(skb); >>>> + >>>> + trace_virtio_transport_alloc_pkt(src_cid, src_port, >>>> + dst_cid, dst_port, >>>> + payload_len, >>>> + info->type, >>>> + info->op, >>>> + info->flags); >>>> + >>>> + return skb; >>>> +out: >>>> + kfree_skb(skb); >>>> + return NULL; >>>> +} >>>> + >>>> /* This function can only be used on connecting/connected sockets, >>>> * since a socket assigned to a transport is required. >>>> * >>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk) >>>> static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >>>> struct virtio_vsock_pkt_info *info) >>>> { >>>> + u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE; >>>> u32 src_cid, src_port, dst_cid, dst_port; >>>> const struct virtio_transport *t_ops; >>>> struct virtio_vsock_sock *vvs; >>>> u32 pkt_len = info->pkt_len; >>>> + bool can_zcopy = false; >>>> u32 rest_len; >>>> int ret; >>>> >>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >>>> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW) >>>> return pkt_len; >>>> >>>> + if (info->msg) { >>>> + /* If zerocopy is not enabled by 'setsockopt()', we behave as >>>> + * there is no MSG_ZEROCOPY flag set. >>>> + */ >>>> + if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) >>>> + info->msg->msg_flags &= ~MSG_ZEROCOPY; >>>> + >>>> + if (info->msg->msg_flags & MSG_ZEROCOPY) >>>> + can_zcopy = virtio_transport_can_zcopy(info, pkt_len); >>>> + >>>> + if (can_zcopy) >>>> + max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, >>>> + (MAX_SKB_FRAGS * PAGE_SIZE)); >>>> + } >>>> + >>>> rest_len = pkt_len; >>>> >>>> do { >>>> struct sk_buff *skb; >>>> size_t skb_len; >>>> >>>> - skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len); >>>> + skb_len = min(max_skb_len, rest_len); >>>> >>>> - skb = virtio_transport_alloc_skb(info, skb_len, >>>> + skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy, >>>> src_cid, src_port, >>>> dst_cid, dst_port); >>>> if (!skb) { >>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, >>>> break; >>>> } >>>> >>>> + /* This is last skb to send this portion of data. */ >>>> + if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY && >>>> + skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) { >>>> + if (virtio_transport_init_zcopy_skb(vsk, skb, >>>> + info->msg, >>>> + can_zcopy)) { >>>> + ret = -ENOMEM; >>>> + break; >>>> + } >>>> + } >>>> + >>>> virtio_transport_inc_tx_pkt(vvs, skb); >>>> >>>> ret = t_ops->send_pkt(skb); >>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t, >>>> if (!t) >>>> return -ENOTCONN; >>>> >>>> - reply = virtio_transport_alloc_skb(&info, 0, >>>> + reply = virtio_transport_alloc_skb(NULL, &info, 0, false, >>>> le64_to_cpu(hdr->dst_cid), >>>> le32_to_cpu(hdr->dst_port), >>>> le64_to_cpu(hdr->src_cid), >>>> -- >>>> 2.25.1 >>> >