Received: by 10.192.165.148 with SMTP id m20csp2243664imm; Thu, 26 Apr 2018 08:01:38 -0700 (PDT) X-Google-Smtp-Source: AIpwx49QQ5dOmNGe6MVPvnWgy9jxDxJg06G6FsYel8lL62U+4Rp8To99D2qIpLezIHzkXV9YtaqV X-Received: by 2002:a17:902:a985:: with SMTP id bh5-v6mr34978299plb.0.1524754898728; Thu, 26 Apr 2018 08:01:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524754898; cv=none; d=google.com; s=arc-20160816; b=datVzS8yGyFgQMyu3Mb+DlvOrX4Z3wDnWlrQfNGKs2zdAh9JQ0xIoMK3pJ3FTBWKuo KcJeGnrtdDuR+Yk7NywqivaTWREoHnSkQSH2NZtjojNq3XLNYKqrsRJq7QdxFSVZ2dro 6MneL50I47tImvumennQ+uoXsaxBV1vjGSJTbdP8zJAHTqFB6RSO8MBlaL+cn6AEXU/w 9BJed7/K046UiVBy+yjrV8VVAjZ2sd+p4bMeHZoaAVKDD0I8EZXxALLFfOmQY/+r7n2V kZ4A1AEmErp4VXBOwZKQRMeo6qxKmofdn7PYJcmk6Q9snQ7RBVkmqG56B7XH7BeKYGog BXAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=/syyiSWG6CbeGrO68YKQg1U+7lAwGJyOf6lAPgBIoNc=; b=szuW0xj4I+V/dgR8TWIzHIEaVb736YrKrVNeLn/NZUaTjjDFeemdAyFR14+IyiYCkG uzaZa40LYkI/ea5n90rODUVekjDzqsHcs298uAIv9Kc66HAQDoOC/ge0B5ikeLC2ly00 NdURNfTpc7A6ssa00DKYJXVYihjLxlAJX21YOs+3jlCLS0PfUjsa7cBwWCPahR6dCc67 TZyEb2R1WHF/u/5gS0BQMSJEyFyX3NxZqcG3CPUelCV76WutsK3hl0DU09MnsLwDp7YD Dq99gDySlSv6Wr/yWNR4gzvWQYjsrtXBZ/LT5HCM9WftfcSNm+fuvvMpv1aXZ9KMaVIz ehnQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CsLf8CZ5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p1si18573715pfp.72.2018.04.26.08.01.24; Thu, 26 Apr 2018 08:01:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CsLf8CZ5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756619AbeDZO7L (ORCPT + 99 others); Thu, 26 Apr 2018 10:59:11 -0400 Received: from mail-it0-f41.google.com ([209.85.214.41]:53304 "EHLO mail-it0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754727AbeDZO7J (ORCPT ); Thu, 26 Apr 2018 10:59:09 -0400 Received: by mail-it0-f41.google.com with SMTP id m134-v6so24745638itb.3 for ; Thu, 26 Apr 2018 07:59:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=/syyiSWG6CbeGrO68YKQg1U+7lAwGJyOf6lAPgBIoNc=; b=CsLf8CZ5SJ6XWy7MMKg25xiOfu4jVvJ1xiJmIHqYkpzziF9nRFax4vR+0ZiJtECDLT DdGOST1Aj3vYnwOU94l+WGcSTA+B0D4dqak59eeqytjdLp+IpDHTZccKzdrpJbJh/3nk 42I4ZuEqGwsVxooZ43Zzi+uFgL1vHj5qzhhhOxoxqcJL1/Cl+B0LEoXfnvCrP9ex/h57 OKwCHX8GnOUMmUZLNcytsMkg0PiR/Dwv/FTuxVjECY44FNgSOp4b+6fL21x4OZrLnu1U EdrJzcdo19EHnOSVJuOyse9HHIWNKUQ1mzm1CD+wG6vvCdf8ETno2hHYyKhCRm5wlTWi OZHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=/syyiSWG6CbeGrO68YKQg1U+7lAwGJyOf6lAPgBIoNc=; b=NbIvfNIX8YFVwrVTa8aLY6seod8Lqt3hLzjyf1OjaSRfNzXa2Ph0m3CrW3VcdSDPu4 mWG9O1ykeNkMulPSRW62YrHlLkKlLCfHHtuhKolyzRgIMGSXzVOa3RjcJxTrZ/Tu7eiB xMruTTw0mKmh2x7Z6xTVns6WYyvPMSofqjjbIj2mydg8w6jA6+GSp3hiEPPLbki5G6O/ p7cVBlUvqhkJoogeJ1JAK5+QkHqMD7iatFJbIONxkTiayMvxgzsbMt+FZvLjvOG5RdGS YTneXGre/IDwoId7P6V+384kwINsaTtPO6xdvZL1rJPkREHucHy8hivjPrfRP6dfHsFs evAw== X-Gm-Message-State: ALQs6tBKZY//Cybqh2tK+xnptDyvVfIXZ61ONdF4Vw0N/gmX6zg3xEdR lzPcCQqMNT4Ck6alnpJhKuzRtb1XSB0o3zjiKqd8uA== X-Received: by 2002:a24:df04:: with SMTP id r4-v6mr27613650itg.105.1524754748009; Thu, 26 Apr 2018 07:59:08 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a4f:8641:0:0:0:0:0 with HTTP; Thu, 26 Apr 2018 07:58:27 -0700 (PDT) In-Reply-To: <20180426145056.220325-2-edumazet@google.com> References: <20180426145056.220325-1-edumazet@google.com> <20180426145056.220325-2-edumazet@google.com> From: Soheil Hassas Yeganeh Date: Thu, 26 Apr 2018 10:58:27 -0400 Message-ID: Subject: Re: [PATCH v3 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive To: Eric Dumazet Cc: "David S . Miller" , netdev , Andy Lutomirski , linux-kernel , linux-mm , Ka-Cheong Poon , Eric Dumazet Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 26, 2018 at 10:50 AM, Eric Dumazet wrote: > When adding tcp mmap() implementation, I forgot that socket lock > had to be taken before current->mm->mmap_sem. syzbot eventually caught > the bug. > > Since we can not lock the socket in tcp mmap() handler we have to > split the operation in two phases. > > 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. > This operation does not involve any TCP locking. > > 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements > the transfert of pages from skbs to one VMA. > This operation only uses down_read(¤t->mm->mmap_sem) after > holding TCP lock, thus solving the lockdep issue. > > This new implementation was suggested by Andy Lutomirski with great details. > > Benefits are : > > - Better scalability, in case multiple threads reuse VMAS > (without mmap()/munmap() calls) since mmap_sem wont be write locked. > > - Better error recovery. > The previous mmap() model had to provide the expected size of the > mapping. If for some reason one part could not be mapped (partial MSS), > the whole operation had to be aborted. > With the tcp_zerocopy_receive struct, kernel can report how > many bytes were successfuly mapped, and how many bytes should > be read to skip the problematic sequence. > > - No more memory allocation to hold an array of page pointers. > 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ > > - skbs are freed while mmap_sem has been released > > Following patch makes the change in tcp_mmap tool to demonstrate > one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) > > Note that memcg might require additional changes. > > Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") > Signed-off-by: Eric Dumazet > Reported-by: syzbot > Suggested-by: Andy Lutomirski > Cc: linux-mm@kvack.org > Cc: Soheil Hassas Yeganeh Acked-by: Soheil Hassas Yeganeh getsockopt() is indeed a better choice. Thanks! > --- > include/uapi/linux/tcp.h | 8 ++ > net/ipv4/tcp.c | 192 ++++++++++++++++++++------------------- > 2 files changed, 109 insertions(+), 91 deletions(-) > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index 379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -122,6 +122,7 @@ enum { > #define TCP_MD5SIG_EXT 32 /* TCP MD5 Signature with extensions */ > #define TCP_FASTOPEN_KEY 33 /* Set the key for Fast Open (cookie) */ > #define TCP_FASTOPEN_NO_COOKIE 34 /* Enable TFO without a TFO cookie */ > +#define TCP_ZEROCOPY_RECEIVE 35 > > struct tcp_repair_opt { > __u32 opt_code; > @@ -276,4 +277,11 @@ struct tcp_diag_md5sig { > __u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; > }; > > +/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ > + > +struct tcp_zerocopy_receive { > + __u64 address; /* in: address of mapping */ > + __u32 length; /* in/out: number of bytes to map/mapped */ > + __u32 recv_skip_hint; /* out: amount of bytes to skip */ > +}; > #endif /* _UAPI_LINUX_TCP_H */ > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index dfd090ea54ad47112fc23c61180b5bf8edd2c736..c10c4a41ad39d6f8ae472882b243c2b70c915546 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1726,118 +1726,111 @@ int tcp_set_rcvlowat(struct sock *sk, int val) > } > EXPORT_SYMBOL(tcp_set_rcvlowat); > > -/* When user wants to mmap X pages, we first need to perform the mapping > - * before freeing any skbs in receive queue, otherwise user would be unable > - * to fallback to standard recvmsg(). This happens if some data in the > - * requested block is not exactly fitting in a page. > - * > - * We only support order-0 pages for the moment. > - * mmap() on TCP is very strict, there is no point > - * trying to accommodate with pathological layouts. > - */ > +static const struct vm_operations_struct tcp_vm_ops = { > +}; > + > int tcp_mmap(struct file *file, struct socket *sock, > struct vm_area_struct *vma) > { > - unsigned long size = vma->vm_end - vma->vm_start; > - unsigned int nr_pages = size >> PAGE_SHIFT; > - struct page **pages_array = NULL; > - u32 seq, len, offset, nr = 0; > - struct sock *sk = sock->sk; > - const skb_frag_t *frags; > + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) > + return -EPERM; > + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); > + > + /* Instruct vm_insert_page() to not down_read(mmap_sem) */ > + vma->vm_flags |= VM_MIXEDMAP; > + > + vma->vm_ops = &tcp_vm_ops; > + return 0; > +} > +EXPORT_SYMBOL(tcp_mmap); > + > +static int tcp_zerocopy_receive(struct sock *sk, > + struct tcp_zerocopy_receive *zc) > +{ > + unsigned long address = (unsigned long)zc->address; > + const skb_frag_t *frags = NULL; > + u32 length = 0, seq, offset; > + struct vm_area_struct *vma; > + struct sk_buff *skb = NULL; > struct tcp_sock *tp; > - struct sk_buff *skb; > int ret; > > - if (vma->vm_pgoff || !nr_pages) > + if (address & (PAGE_SIZE - 1) || address != zc->address) > return -EINVAL; > > - if (vma->vm_flags & VM_WRITE) > - return -EPERM; > - /* TODO: Maybe the following is not needed if pages are COW */ > - vma->vm_flags &= ~VM_MAYWRITE; > - > - lock_sock(sk); > - > - ret = -ENOTCONN; > if (sk->sk_state == TCP_LISTEN) > - goto out; > + return -ENOTCONN; > > sock_rps_record_flow(sk); > > - if (tcp_inq(sk) < size) { > - ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN; > + down_read(¤t->mm->mmap_sem); > + > + ret = -EINVAL; > + vma = find_vma(current->mm, address); > + if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops) > goto out; > - } > + zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); > + > tp = tcp_sk(sk); > seq = tp->copied_seq; > - /* Abort if urgent data is in the area */ > - if (unlikely(tp->urg_data)) { > - u32 urg_offset = tp->urg_seq - seq; > + zc->length = min_t(u32, zc->length, tcp_inq(sk)); > + zc->length &= ~(PAGE_SIZE - 1); > > - ret = -EINVAL; > - if (urg_offset < size) > - goto out; > - } > - ret = -ENOMEM; > - pages_array = kvmalloc_array(nr_pages, sizeof(struct page *), > - GFP_KERNEL); > - if (!pages_array) > - goto out; > - skb = tcp_recv_skb(sk, seq, &offset); > - ret = -EINVAL; > -skb_start: > - /* We do not support anything not in page frags */ > - offset -= skb_headlen(skb); > - if ((int)offset < 0) > - goto out; > - if (skb_has_frag_list(skb)) > - goto out; > - len = skb->data_len - offset; > - frags = skb_shinfo(skb)->frags; > - while (offset) { > - if (frags->size > offset) > - goto out; > - offset -= frags->size; > - frags++; > - } > - while (nr < nr_pages) { > - if (len) { > - if (len < PAGE_SIZE) > - goto out; > - if (frags->size != PAGE_SIZE || frags->page_offset) > - goto out; > - pages_array[nr++] = skb_frag_page(frags); > - frags++; > - len -= PAGE_SIZE; > - seq += PAGE_SIZE; > - continue; > - } > - skb = skb->next; > - offset = seq - TCP_SKB_CB(skb)->seq; > - goto skb_start; > - } > - /* OK, we have a full set of pages ready to be inserted into vma */ > - for (nr = 0; nr < nr_pages; nr++) { > - ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT), > - pages_array[nr]); > - if (ret) > - goto out; > - } > - /* operation is complete, we can 'consume' all skbs */ > - tp->copied_seq = seq; > - tcp_rcv_space_adjust(sk); > - > - /* Clean up data we have read: This will do ACK frames. */ > - tcp_recv_skb(sk, seq, &offset); > - tcp_cleanup_rbuf(sk, size); > + zap_page_range(vma, address, zc->length); > > + zc->recv_skip_hint = 0; > ret = 0; > + while (length + PAGE_SIZE <= zc->length) { > + if (zc->recv_skip_hint < PAGE_SIZE) { > + if (skb) { > + skb = skb->next; > + offset = seq - TCP_SKB_CB(skb)->seq; > + } else { > + skb = tcp_recv_skb(sk, seq, &offset); > + } > + > + zc->recv_skip_hint = skb->len - offset; > + offset -= skb_headlen(skb); > + if ((int)offset < 0 || skb_has_frag_list(skb)) > + break; > + frags = skb_shinfo(skb)->frags; > + while (offset) { > + if (frags->size > offset) > + goto out; > + offset -= frags->size; > + frags++; > + } > + } > + if (frags->size != PAGE_SIZE || frags->page_offset) > + break; > + ret = vm_insert_page(vma, address + length, > + skb_frag_page(frags)); > + if (ret) > + break; > + length += PAGE_SIZE; > + seq += PAGE_SIZE; > + zc->recv_skip_hint -= PAGE_SIZE; > + frags++; > + } > out: > - release_sock(sk); > - kvfree(pages_array); > + up_read(¤t->mm->mmap_sem); > + if (length) { > + tp->copied_seq = seq; > + tcp_rcv_space_adjust(sk); > + > + /* Clean up data we have read: This will do ACK frames. */ > + tcp_recv_skb(sk, seq, &offset); > + tcp_cleanup_rbuf(sk, length); > + ret = 0; > + if (length == zc->length) > + zc->recv_skip_hint = 0; > + } else { > + if (!zc->recv_skip_hint && sock_flag(sk, SOCK_DONE)) > + ret = -EIO; > + } > + zc->length = length; > return ret; > } > -EXPORT_SYMBOL(tcp_mmap); > > static void tcp_update_recv_tstamps(struct sk_buff *skb, > struct scm_timestamping *tss) > @@ -3472,6 +3465,23 @@ static int do_tcp_getsockopt(struct sock *sk, int level, > } > return 0; > } > + case TCP_ZEROCOPY_RECEIVE: { > + struct tcp_zerocopy_receive zc; > + int err; > + > + if (get_user(len, optlen)) > + return -EFAULT; > + if (len != sizeof(zc)) > + return -EINVAL; > + if (copy_from_user(&zc, optval, len)) > + return -EFAULT; > + lock_sock(sk); > + err = tcp_zerocopy_receive(sk, &zc); > + release_sock(sk); > + if (!err && copy_to_user(optval, &zc, len)) > + err = -EFAULT; > + return err; > + } > default: > return -ENOPROTOOPT; > } > -- > 2.17.0.484.g0c8726318c-goog >