Received: by 10.192.165.148 with SMTP id m20csp881857imm; Fri, 27 Apr 2018 09:01:11 -0700 (PDT) X-Google-Smtp-Source: AB8JxZq92Ls2loCsEnYtJcyqLPsYOCxBjjneDNbH9JnyBILwuLjpdd34G2nMg9xM99srSNkKbih8 X-Received: by 10.98.60.16 with SMTP id j16mr2696048pfa.7.1524844871846; Fri, 27 Apr 2018 09:01:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524844871; cv=none; d=google.com; s=arc-20160816; b=h3oH7Kz9s7rCGzc67/74L/oX2IFQlzkZJPHIfugBttOtZegIlUZ9QaBw4ynQsQPrxg 4sobf0XAUwvIgmVbpxYp4lm2Q9shGWykOdIrWvlDLS8GjULqwV414gj1Zp8GHKVkHRzc jJisMlISeaaQXTAKovfy2Sdjd6MAkvNJ/1Rp+dSxOfn/X66tx8N9sAzJK8cHdOGs4Mw4 NqRNkniV3tf/uztTwLf8+kHO6vaQSNg/7EalzuOFMXY++526b/PMnof96woy6o+wC6Aa LnIUYPbFls4EoQ1p3aoBSMcoTvmVXPJAwiLFthOVu4GWaMZtynZo7JK5DzirnAFHgeQG otDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:arc-authentication-results; bh=XO5swcQaeSstknl6fz7b2PToKBkcbqaLcu+XbHf+uv0=; b=b3OBMq0CQVdkDTz0vEtLizgkiV97BQ5fplRdj9XX/QfS1Gp7S5wqHm/H1+HtRL3ZZJ slb9JZLgpHUfhT6kBJANqNQPMHWhTMCqHR6gcgAmDh8t/0JpWrK5gy3Cpg7TBCNMS9Eh o3bHm2Ejyx4Fw6u8qxFxjg/vKDixWvwbHO6joJeZCOU1IQBlIPp3lWfLyGcSQ6Q/GUaX gtrlWoNBIYBrgi07kHG5Je5/WMgS9HWReS49qp693lw7n0cid3ZDGfCr2i4QXpgsCWKX 2Ax17eQd2+uxY2vS9za2AGoUhQqJRpXPzpuGMdXxdy8RuS6i5iNfVBRC1zXYPPhW83El sNZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=dK7/8Vg9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a63si1524881pfc.131.2018.04.27.09.00.56; Fri, 27 Apr 2018 09:01:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=dK7/8Vg9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932834AbeD0P6a (ORCPT + 99 others); Fri, 27 Apr 2018 11:58:30 -0400 Received: from mail-wm0-f53.google.com ([74.125.82.53]:53769 "EHLO mail-wm0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758610AbeD0P6Y (ORCPT ); Fri, 27 Apr 2018 11:58:24 -0400 Received: by mail-wm0-f53.google.com with SMTP id 66so3417469wmd.3 for ; Fri, 27 Apr 2018 08:58:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=XO5swcQaeSstknl6fz7b2PToKBkcbqaLcu+XbHf+uv0=; b=dK7/8Vg9cpwZUW+yJBRGza4Bvh9Mljl0h9DvgJZWBrIQ05nGQd+KQqtgUUdtloM+cn wUZNQHqPzWBOxpj1b9YMLZ8gUSz5Emu/RgYjAQjyWb1aNvaYFLMc9aJfsUj6MwPpc8Fq eWLgt+fPeeSGQ06e2lsV90/rlKjPJwhOevnHJ5QkOJLXtn2RNsYGEm/kXbUZnK8+j9yy u5bcNyfwzfXsC+sxSjzdMi+eToNGhK/ZayKW5Wg9ZgF+DxrgKMrmowNqY6j7vNuy9LPU Ec0IFB4I2cClTMq+UGK8rIvrqy6GDY/9vgsTWYb6rzb7U7JCImyIYl1lC++9PUisxngA 6wYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=XO5swcQaeSstknl6fz7b2PToKBkcbqaLcu+XbHf+uv0=; b=dfCxkQADzfOtS3x7hFZENntJ09IeqwFBl1xiIE7nH6k1YXJ6SxLP3iOtSI5cSYSeRo +Ec0lyrKqxLjNH4RHjeVzVPPXM64Iz/eQHm7n7x6Ls6xrprdhkWcODsQ+DYOYHLD3pEn FbXQ44/Dhc+LlRC3pCYX8uyrtKptluJxILJ8z3PgV+JN5o9GlcYkncRRZJDCngx/LNm/ ZFpFbIPX/qWg+2ox6kiTIWBXxxZN5OVL7eU9z6JkAkOCPodone6s81EYmJi8klrut/XQ /mRUfA3macD9fOU5X4s8IDuVtMHiVi8NyDv6e0JctSHagsk5npw+N1h/OzWoL5u502Dt jn0Q== X-Gm-Message-State: ALQs6tAkmJK5poa0D3oU4iS55oRIYSNphK5KeBT8ELAZpL4Lh4LkDY1Q z/fjiaI7ueqyNyyej0emh8oPk3FfWBU= X-Received: by 10.28.160.26 with SMTP id j26mr1658839wme.74.1524844702169; Fri, 27 Apr 2018 08:58:22 -0700 (PDT) Received: from localhost ([2620:15c:2c4:201:f5a:7eca:440a:3ead]) by smtp.gmail.com with ESMTPSA id a129sm1322798wme.34.2018.04.27.08.58.20 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 27 Apr 2018 08:58:21 -0700 (PDT) From: Eric Dumazet To: "David S . Miller" Cc: netdev , Andy Lutomirski , linux-kernel , linux-mm , Ka-Cheong Poon , Eric Dumazet , Eric Dumazet Subject: [PATCH v4 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive Date: Fri, 27 Apr 2018 08:58:08 -0700 Message-Id: <20180427155809.79094-2-edumazet@google.com> X-Mailer: git-send-email 2.17.0.441.gb46fe60e1d-goog In-Reply-To: <20180427155809.79094-1-edumazet@google.com> References: <20180427155809.79094-1-edumazet@google.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When adding tcp mmap() implementation, I forgot that socket lock had to be taken before current->mm->mmap_sem. syzbot eventually caught the bug. Since we can not lock the socket in tcp mmap() handler we have to split the operation in two phases. 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. This operation does not involve any TCP locking. 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements the transfert of pages from skbs to one VMA. This operation only uses down_read(¤t->mm->mmap_sem) after holding TCP lock, thus solving the lockdep issue. This new implementation was suggested by Andy Lutomirski with great details. Benefits are : - Better scalability, in case multiple threads reuse VMAS (without mmap()/munmap() calls) since mmap_sem wont be write locked. - Better error recovery. The previous mmap() model had to provide the expected size of the mapping. If for some reason one part could not be mapped (partial MSS), the whole operation had to be aborted. With the tcp_zerocopy_receive struct, kernel can report how many bytes were successfuly mapped, and how many bytes should be read to skip the problematic sequence. - No more memory allocation to hold an array of page pointers. 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ - skbs are freed while mmap_sem has been released Following patch makes the change in tcp_mmap tool to demonstrate one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) Note that memcg might require additional changes. Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Signed-off-by: Eric Dumazet Reported-by: syzbot Suggested-by: Andy Lutomirski Cc: linux-mm@kvack.org Acked-by: Soheil Hassas Yeganeh --- include/uapi/linux/tcp.h | 8 ++ net/ipv4/af_inet.c | 2 + net/ipv4/tcp.c | 196 +++++++++++++++++++++------------------ net/ipv6/af_inet6.c | 2 + 4 files changed, 117 insertions(+), 91 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -122,6 +122,7 @@ enum { #define TCP_MD5SIG_EXT 32 /* TCP MD5 Signature with extensions */ #define TCP_FASTOPEN_KEY 33 /* Set the key for Fast Open (cookie) */ #define TCP_FASTOPEN_NO_COOKIE 34 /* Enable TFO without a TFO cookie */ +#define TCP_ZEROCOPY_RECEIVE 35 struct tcp_repair_opt { __u32 opt_code; @@ -276,4 +277,11 @@ struct tcp_diag_md5sig { __u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; }; +/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ + +struct tcp_zerocopy_receive { + __u64 address; /* in: address of mapping */ + __u32 length; /* in/out: number of bytes to map/mapped */ + __u32 recv_skip_hint; /* out: amount of bytes to skip */ +}; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 3ebf599cebaea4926decc1aad7274b12ec7e1566..b403499fdabea7367f65c588d957a30f5a6572b5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -994,7 +994,9 @@ const struct proto_ops inet_stream_ops = { .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, +#ifdef CONFIG_MMU .mmap = tcp_mmap, +#endif .sendpage = inet_sendpage, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dfd090ea54ad47112fc23c61180b5bf8edd2c736..4028ddd14dd5a0c5d66da3766e66197a03575bb9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1726,118 +1726,113 @@ int tcp_set_rcvlowat(struct sock *sk, int val) } EXPORT_SYMBOL(tcp_set_rcvlowat); -/* When user wants to mmap X pages, we first need to perform the mapping - * before freeing any skbs in receive queue, otherwise user would be unable - * to fallback to standard recvmsg(). This happens if some data in the - * requested block is not exactly fitting in a page. - * - * We only support order-0 pages for the moment. - * mmap() on TCP is very strict, there is no point - * trying to accommodate with pathological layouts. - */ +#ifdef CONFIG_MMU +static const struct vm_operations_struct tcp_vm_ops = { +}; + int tcp_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma) { - unsigned long size = vma->vm_end - vma->vm_start; - unsigned int nr_pages = size >> PAGE_SHIFT; - struct page **pages_array = NULL; - u32 seq, len, offset, nr = 0; - struct sock *sk = sock->sk; - const skb_frag_t *frags; + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) + return -EPERM; + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); + + /* Instruct vm_insert_page() to not down_read(mmap_sem) */ + vma->vm_flags |= VM_MIXEDMAP; + + vma->vm_ops = &tcp_vm_ops; + return 0; +} +EXPORT_SYMBOL(tcp_mmap); + +static int tcp_zerocopy_receive(struct sock *sk, + struct tcp_zerocopy_receive *zc) +{ + unsigned long address = (unsigned long)zc->address; + const skb_frag_t *frags = NULL; + u32 length = 0, seq, offset; + struct vm_area_struct *vma; + struct sk_buff *skb = NULL; struct tcp_sock *tp; - struct sk_buff *skb; int ret; - if (vma->vm_pgoff || !nr_pages) + if (address & (PAGE_SIZE - 1) || address != zc->address) return -EINVAL; - if (vma->vm_flags & VM_WRITE) - return -EPERM; - /* TODO: Maybe the following is not needed if pages are COW */ - vma->vm_flags &= ~VM_MAYWRITE; - - lock_sock(sk); - - ret = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) - goto out; + return -ENOTCONN; sock_rps_record_flow(sk); - if (tcp_inq(sk) < size) { - ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN; + down_read(¤t->mm->mmap_sem); + + ret = -EINVAL; + vma = find_vma(current->mm, address); + if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops) goto out; - } + zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); + tp = tcp_sk(sk); seq = tp->copied_seq; - /* Abort if urgent data is in the area */ - if (unlikely(tp->urg_data)) { - u32 urg_offset = tp->urg_seq - seq; + zc->length = min_t(u32, zc->length, tcp_inq(sk)); + zc->length &= ~(PAGE_SIZE - 1); - ret = -EINVAL; - if (urg_offset < size) - goto out; - } - ret = -ENOMEM; - pages_array = kvmalloc_array(nr_pages, sizeof(struct page *), - GFP_KERNEL); - if (!pages_array) - goto out; - skb = tcp_recv_skb(sk, seq, &offset); - ret = -EINVAL; -skb_start: - /* We do not support anything not in page frags */ - offset -= skb_headlen(skb); - if ((int)offset < 0) - goto out; - if (skb_has_frag_list(skb)) - goto out; - len = skb->data_len - offset; - frags = skb_shinfo(skb)->frags; - while (offset) { - if (frags->size > offset) - goto out; - offset -= frags->size; - frags++; - } - while (nr < nr_pages) { - if (len) { - if (len < PAGE_SIZE) - goto out; - if (frags->size != PAGE_SIZE || frags->page_offset) - goto out; - pages_array[nr++] = skb_frag_page(frags); - frags++; - len -= PAGE_SIZE; - seq += PAGE_SIZE; - continue; - } - skb = skb->next; - offset = seq - TCP_SKB_CB(skb)->seq; - goto skb_start; - } - /* OK, we have a full set of pages ready to be inserted into vma */ - for (nr = 0; nr < nr_pages; nr++) { - ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT), - pages_array[nr]); - if (ret) - goto out; - } - /* operation is complete, we can 'consume' all skbs */ - tp->copied_seq = seq; - tcp_rcv_space_adjust(sk); - - /* Clean up data we have read: This will do ACK frames. */ - tcp_recv_skb(sk, seq, &offset); - tcp_cleanup_rbuf(sk, size); + zap_page_range(vma, address, zc->length); + zc->recv_skip_hint = 0; ret = 0; + while (length + PAGE_SIZE <= zc->length) { + if (zc->recv_skip_hint < PAGE_SIZE) { + if (skb) { + skb = skb->next; + offset = seq - TCP_SKB_CB(skb)->seq; + } else { + skb = tcp_recv_skb(sk, seq, &offset); + } + + zc->recv_skip_hint = skb->len - offset; + offset -= skb_headlen(skb); + if ((int)offset < 0 || skb_has_frag_list(skb)) + break; + frags = skb_shinfo(skb)->frags; + while (offset) { + if (frags->size > offset) + goto out; + offset -= frags->size; + frags++; + } + } + if (frags->size != PAGE_SIZE || frags->page_offset) + break; + ret = vm_insert_page(vma, address + length, + skb_frag_page(frags)); + if (ret) + break; + length += PAGE_SIZE; + seq += PAGE_SIZE; + zc->recv_skip_hint -= PAGE_SIZE; + frags++; + } out: - release_sock(sk); - kvfree(pages_array); + up_read(¤t->mm->mmap_sem); + if (length) { + tp->copied_seq = seq; + tcp_rcv_space_adjust(sk); + + /* Clean up data we have read: This will do ACK frames. */ + tcp_recv_skb(sk, seq, &offset); + tcp_cleanup_rbuf(sk, length); + ret = 0; + if (length == zc->length) + zc->recv_skip_hint = 0; + } else { + if (!zc->recv_skip_hint && sock_flag(sk, SOCK_DONE)) + ret = -EIO; + } + zc->length = length; return ret; } -EXPORT_SYMBOL(tcp_mmap); +#endif static void tcp_update_recv_tstamps(struct sk_buff *skb, struct scm_timestamping *tss) @@ -3472,6 +3467,25 @@ static int do_tcp_getsockopt(struct sock *sk, int level, } return 0; } +#ifdef CONFIG_MMU + case TCP_ZEROCOPY_RECEIVE: { + struct tcp_zerocopy_receive zc; + int err; + + if (get_user(len, optlen)) + return -EFAULT; + if (len != sizeof(zc)) + return -EINVAL; + if (copy_from_user(&zc, optval, len)) + return -EFAULT; + lock_sock(sk); + err = tcp_zerocopy_receive(sk, &zc); + release_sock(sk); + if (!err && copy_to_user(optval, &zc, len)) + err = -EFAULT; + return err; + } +#endif default: return -ENOPROTOOPT; } diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 36d622c477b1ed3c5d2b753938444526344a6109..d0af96e0d1096e83132dfc5599eb6292db39750a 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -578,7 +578,9 @@ const struct proto_ops inet6_stream_ops = { .getsockopt = sock_common_getsockopt, /* ok */ .sendmsg = inet_sendmsg, /* ok */ .recvmsg = inet_recvmsg, /* ok */ +#ifdef CONFIG_MMU .mmap = tcp_mmap, +#endif .sendpage = inet_sendpage, .sendmsg_locked = tcp_sendmsg_locked, .sendpage_locked = tcp_sendpage_locked, -- 2.17.0.441.gb46fe60e1d-goog