Here's the first tranche of patches towards providing a MSG_SPLICE_PAGES
internal sendmsg flag that is intended to replace the ->sendpage() op with
calls to sendmsg(). MSG_SPLICE_PAGES is a hint that tells the protocol
that it should splice the pages supplied if it can and copy them if not.
This will allow splice to pass multiple pages in a single call and allow
certain parts of higher protocols (e.g. sunrpc, iwarp) to pass an entire
message in one go rather than having to send them piecemeal. This should
also make it easier to handle the splicing of multipage folios.
A helper, skb_splice_from_iter() is provided to do the work of splicing or
copying data from an iterator. If a page is determined to be unspliceable
(such as being in the slab), then the helper will give an error.
Note that this facility is not made available to userspace and does not
provide any sort of callback.
This set consists of the following parts:
(1) Define the MSG_SPLICE_PAGES flag and prevent sys_sendmsg() from being
able to set it.
(2) Add an extra argument to skb_append_pagefrags() so that something
other than MAX_SKB_FRAGS can be used (sysctl_max_skb_frags for
example).
(3) Add the skb_splice_from_iter() helper to handle splicing pages into
skbuffs for MSG_SPLICE_PAGES that can be shared by TCP, IP/UDP and
AF_UNIX.
(4) Implement MSG_SPLICE_PAGES support in TCP.
(5) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
various callers.
(6) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
a wrapper around sendmsg().
(7) Implement MSG_SPLICE_PAGES support in IP6/UDP6.
(8) Implement MSG_SPLICE_PAGES support in AF_UNIX.
(9) Make AF_UNIX copy unspliceable pages.
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-1
The follow-on patches are on branch iov-sendpage on the same tree.
David
Changes
=======
ver #7)
- Rebase after merge window.
- In ____sys_sendmsg(), clear internal flags before setting msg_flags.
- Clear internal flags in uring io_send{,_zc}().
- Export skb_splice_from_iter().
- Missed changing a "zc = 1" in tcp_sendmsg_locked().
- Remove now-unused csum_page() from UDP.
- Add a patch to make AF_UNIX sendpage() just a wrapper around sendmsg().
- Return an error if !sendpage_ok() rather than copying for now.
- Drop the page frag allocator patches for the moment.
ver #6)
- Removed a couple of leftover page pointer declarations.
- In TCP, set zc to 0/MSG_ZEROCOPY/MSG_SPLICE_PAGES rather than 0/1/2.
- Add a max-frags argument to skb_append_pagefrags().
- Extract the AF_UNIX helper out into a common helper and use it for
IP/UDP and TCP too.
- udp_sendpage() shouldn't lock the socket around udp_sendmsg().
- udp_sendpage() should only set MSG_MORE if MSG_SENDPAGE_NOTLAST is set.
- In siw, don't clear MSG_SPLICE_PAGES on the last page.
ver #5)
- Dropped the samples patch as it causes lots of failures in the patchwork
32-bit builds due to apparent libc userspace header issues.
- Made the pagefrag alloc patches alter the Google gve driver too.
- Rearranged the patches to put the support in IP before altering UDP.
ver #4)
- Added some sample socket-I/O programs into samples/net/.
- Fix a missing page-get in AF_KCM.
- Init the sgtable and mark the end in AF_ALG when calling
netfs_extract_iter_to_sg().
- Add a destructor func for page frag caches prior to generalising it and
making it per-cpu.
ver #3)
- Dropped the iterator-of-iterators patch.
- Only expunge MSG_SPLICE_PAGES in sys_send[m]msg, not sys_recv[m]msg.
- Split MSG_SPLICE_PAGES code in __ip_append_data() out into helper
functions.
- Implement MSG_SPLICE_PAGES support in __ip6_append_data() using the
above helper functions.
- Rename 'xlength' to 'initial_length'.
- Minimise the changes to sunrpc for the moment.
- Don't give -EOPNOTSUPP if NETIF_F_SG not available, just copy instead.
- Implemented MSG_SPLICE_PAGES support in the TLS, Chelsio-TLS and AF_KCM
code.
ver #2)
- Overhauled the page_frag_alloc() allocator: large folios and per-cpu.
- Got rid of my own zerocopy allocator.
- Use iov_iter_extract_pages() rather poking in iter->bvec.
- Made page splicing fall back to page copying on a page-by-page basis.
- Made splice_to_socket() pass 16 pipe buffers at a time.
- Made AF_ALG/hash use finup/digest where possible in sendmsg.
- Added an iterator-of-iterators, ITER_ITERLIST.
- Made sunrpc use the iterator-of-iterators.
- Converted more drivers.
Link: https://lore.kernel.org/r/[email protected]/ # v1
Link: https://lore.kernel.org/r/[email protected]/ # v2
Link: https://lore.kernel.org/r/[email protected]/ # v3
Link: https://lore.kernel.org/r/[email protected]/ # v4
David Howells (16):
net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
net: Pass max frags into skb_append_pagefrags()
net: Add a function to splice pages into an skbuff for
MSG_SPLICE_PAGES
tcp: Support MSG_SPLICE_PAGES
tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
tcp_sendmsg
espintcp: Inline do_tcp_sendpages()
tls: Inline do_tcp_sendpages()
siw: Inline do_tcp_sendpages()
tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
ip, udp: Support MSG_SPLICE_PAGES
ip6, udp6: Support MSG_SPLICE_PAGES
udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
ip: Remove ip_append_page()
af_unix: Support MSG_SPLICE_PAGES
unix: Convert udp_sendpage() to use MSG_SPLICE_PAGES
drivers/infiniband/sw/siw/siw_qp_tx.c | 17 +-
include/linux/skbuff.h | 5 +-
include/linux/socket.h | 3 +
include/net/ip.h | 2 -
include/net/tcp.h | 2 -
include/net/tls.h | 2 +-
io_uring/net.c | 2 +
net/core/skbuff.c | 99 +++++++++++-
net/ipv4/ip_output.c | 164 +++-----------------
net/ipv4/tcp.c | 214 ++++++--------------------
net/ipv4/tcp_bpf.c | 20 ++-
net/ipv4/udp.c | 51 +-----
net/ipv6/ip6_output.c | 17 ++
net/socket.c | 2 +
net/tls/tls_main.c | 24 +--
net/unix/af_unix.c | 183 +++++-----------------
net/xfrm/espintcp.c | 10 +-
17 files changed, 285 insertions(+), 532 deletions(-)
Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the
source iterator if possible and copying the data in otherwise.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Kuniyuki Iwashima <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
Notes:
ver #6)
- Use common helper.
net/unix/af_unix.c | 49 +++++++++++++++++++++++++++++++---------------
1 file changed, 33 insertions(+), 16 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index dd55506b4632..976bc1c5e11b 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2200,19 +2200,25 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
while (sent < len) {
size = len - sent;
- /* Keep two messages in the pipe so it schedules better */
- size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+ if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+ skb = sock_alloc_send_pskb(sk, 0, 0,
+ msg->msg_flags & MSG_DONTWAIT,
+ &err, 0);
+ } else {
+ /* Keep two messages in the pipe so it schedules better */
+ size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
- /* allow fallback to order-0 allocations */
- size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
+ /* allow fallback to order-0 allocations */
+ size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
- data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
+ data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
- data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
+ data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
- skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
- msg->msg_flags & MSG_DONTWAIT, &err,
- get_order(UNIX_SKB_FRAGS_SZ));
+ skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
+ msg->msg_flags & MSG_DONTWAIT, &err,
+ get_order(UNIX_SKB_FRAGS_SZ));
+ }
if (!skb)
goto out_err;
@@ -2224,13 +2230,24 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
}
fds_sent = true;
- skb_put(skb, size - data_len);
- skb->data_len = data_len;
- skb->len = size;
- err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
- if (err) {
- kfree_skb(skb);
- goto out_err;
+ if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+ err = skb_splice_from_iter(skb, &msg->msg_iter, size,
+ sk->sk_allocation);
+ if (err < 0) {
+ kfree_skb(skb);
+ goto out_err;
+ }
+ size = err;
+ refcount_add(size, &sk->sk_wmem_alloc);
+ } else {
+ skb_put(skb, size - data_len);
+ skb->data_len = data_len;
+ skb->len = size;
+ err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+ if (err) {
+ kfree_skb(skb);
+ goto out_err;
+ }
}
unix_state_lock(other);
Pass the maximum number of fragments into skb_append_pagefrags() rather
than using MAX_SKB_FRAGS so that it can be used from code that wants to
specify sysctl_max_skb_frags.
Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/skbuff.h | 2 +-
net/core/skbuff.c | 4 ++--
net/ipv4/ip_output.c | 3 ++-
net/unix/af_unix.c | 2 +-
4 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 00e8c435fa1a..4c0ad48e38ca 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1383,7 +1383,7 @@ static inline int skb_pad(struct sk_buff *skb, int pad)
#define dev_kfree_skb(a) consume_skb(a)
int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
- int offset, size_t size);
+ int offset, size_t size, size_t max_frags);
struct skb_seq_state {
__u32 lower_offset;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6724a84ebb09..7f53dcb26ad3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4188,13 +4188,13 @@ unsigned int skb_find_text(struct sk_buff *skb, unsigned int from,
EXPORT_SYMBOL(skb_find_text);
int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
- int offset, size_t size)
+ int offset, size_t size, size_t max_frags)
{
int i = skb_shinfo(skb)->nr_frags;
if (skb_can_coalesce(skb, i, page, offset)) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
- } else if (i < MAX_SKB_FRAGS) {
+ } else if (i < max_frags) {
skb_zcopy_downgrade_managed(skb);
get_page(page);
skb_fill_page_desc_noacc(skb, i, page, offset, size);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 61892268e8a6..52fc840898d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1450,7 +1450,8 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
if (len > size)
len = size;
- if (skb_append_pagefrags(skb, page, offset, len)) {
+ if (skb_append_pagefrags(skb, page, offset, len,
+ MAX_SKB_FRAGS)) {
err = -EMSGSIZE;
goto error;
}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index cc695c9f09ec..dd55506b4632 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2349,7 +2349,7 @@ static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
newskb = NULL;
}
- if (skb_append_pagefrags(skb, page, offset, size)) {
+ if (skb_append_pagefrags(skb, page, offset, size, MAX_SKB_FRAGS)) {
tail = skb;
goto alloc_skb;
}
Convert do_tcp_sendpages() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself. do_tcp_sendpages() can then be
inlined in subsequent patches into its callers.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/tcp.c | 158 +++----------------------------------------------
1 file changed, 7 insertions(+), 151 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 026f483f42e3..28231e503af9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -974,163 +974,19 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}
-static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
- struct page *page, int offset, size_t *size)
-{
- struct sk_buff *skb = tcp_write_queue_tail(sk);
- struct tcp_sock *tp = tcp_sk(sk);
- bool can_coalesce;
- int copy, i;
-
- if (!skb || (copy = size_goal - skb->len) <= 0 ||
- !tcp_skb_can_collapse_to(skb)) {
-new_segment:
- if (!sk_stream_memory_free(sk))
- return NULL;
-
- skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
- tcp_rtx_and_write_queues_empty(sk));
- if (!skb)
- return NULL;
-
-#ifdef CONFIG_TLS_DEVICE
- skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
-#endif
- tcp_skb_entail(sk, skb);
- copy = size_goal;
- }
-
- if (copy > *size)
- copy = *size;
-
- i = skb_shinfo(skb)->nr_frags;
- can_coalesce = skb_can_coalesce(skb, i, page, offset);
- if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
- tcp_mark_push(tp, skb);
- goto new_segment;
- }
- if (tcp_downgrade_zcopy_pure(sk, skb))
- return NULL;
-
- copy = tcp_wmem_schedule(sk, copy);
- if (!copy)
- return NULL;
-
- if (can_coalesce) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
- } else {
- get_page(page);
- skb_fill_page_desc_noacc(skb, i, page, offset, copy);
- }
-
- if (!(flags & MSG_NO_SHARED_FRAGS))
- skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
- skb->len += copy;
- skb->data_len += copy;
- skb->truesize += copy;
- sk_wmem_queued_add(sk, copy);
- sk_mem_charge(sk, copy);
- WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
- TCP_SKB_CB(skb)->end_seq += copy;
- tcp_skb_pcount_set(skb, 0);
-
- *size = copy;
- return skb;
-}
-
ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct tcp_sock *tp = tcp_sk(sk);
- int mss_now, size_goal;
- int err;
- ssize_t copied;
- long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
-
- if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- WARN_ONCE(!sendpage_ok(page),
- "page must not be a Slab one and have page_count > 0"))
- return -EINVAL;
-
- /* Wait for a connection to finish. One exception is TCP Fast Open
- * (passive side) where data is allowed to be sent before a connection
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
- !tcp_passive_fastopen(sk)) {
- err = sk_stream_wait_connect(sk, &timeo);
- if (err != 0)
- goto out_err;
- }
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
- sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- mss_now = tcp_send_mss(sk, &size_goal, flags);
- copied = 0;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
- err = -EPIPE;
- if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- goto out_err;
-
- while (size > 0) {
- struct sk_buff *skb;
- size_t copy = size;
-
- skb = tcp_build_frag(sk, size_goal, flags, page, offset, ©);
- if (!skb)
- goto wait_for_space;
-
- if (!copied)
- TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
- copied += copy;
- offset += copy;
- size -= copy;
- if (!size)
- goto out;
-
- if (skb->len < size_goal || (flags & MSG_OOB))
- continue;
-
- if (forced_push(tp)) {
- tcp_mark_push(tp, skb);
- __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
- } else if (skb == tcp_send_head(sk))
- tcp_push_one(sk, mss_now);
- continue;
-
-wait_for_space:
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
- tcp_push(sk, flags & ~MSG_MORE, mss_now,
- TCP_NAGLE_PUSH, size_goal);
-
- err = sk_stream_wait_memory(sk, &timeo);
- if (err != 0)
- goto do_error;
-
- mss_now = tcp_send_mss(sk, &size_goal, flags);
- }
-
-out:
- if (copied) {
- tcp_tx_timestamp(sk, sk->sk_tsflags);
- if (!(flags & MSG_SENDPAGE_NOTLAST))
- tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
- }
- return copied;
-
-do_error:
- tcp_remove_empty_skb(sk);
- if (copied)
- goto out;
-out_err:
- /* make sure we wake any epoll edge trigger waiter */
- if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
- sk->sk_write_space(sk);
- tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
- }
- return sk_stream_error(sk, flags, err);
+ return tcp_sendmsg_locked(sk, &msg, size);
}
EXPORT_SYMBOL_GPL(do_tcp_sendpages);
ip_append_page() is no longer used with the removal of udp_sendpage(), so
remove it.
Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: David Ahern <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
Notes:
ver #7)
- Remove now-unused csum_page().
include/net/ip.h | 2 -
net/ipv4/ip_output.c | 148 ++-----------------------------------------
2 files changed, 4 insertions(+), 146 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..7627a4df893b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -220,8 +220,6 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
unsigned int flags);
int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd,
struct sk_buff *skb);
-ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
- int offset, size_t size, int flags);
struct sk_buff *__ip_make_skb(struct sock *sk, struct flowi4 *fl4,
struct sk_buff_head *queue,
struct inet_cork *cork);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c7db973b5d29..553c740a6bfb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -946,17 +946,6 @@ ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk
}
EXPORT_SYMBOL(ip_generic_getfrag);
-static inline __wsum
-csum_page(struct page *page, int offset, int copy)
-{
- char *kaddr;
- __wsum csum;
- kaddr = kmap(page);
- csum = csum_partial(kaddr + offset, copy, 0);
- kunmap(page);
- return csum;
-}
-
static int __ip_append_data(struct sock *sk,
struct flowi4 *fl4,
struct sk_buff_head *queue,
@@ -1327,10 +1316,10 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
}
/*
- * ip_append_data() and ip_append_page() can make one large IP datagram
- * from many pieces of data. Each pieces will be holded on the socket
- * until ip_push_pending_frames() is called. Each piece can be a page
- * or non-page data.
+ * ip_append_data() can make one large IP datagram from many pieces of
+ * data. Each piece will be held on the socket until
+ * ip_push_pending_frames() is called. Each piece can be a page or
+ * non-page data.
*
* Not only UDP, other transport protocols - e.g. raw sockets - can use
* this interface potentially.
@@ -1363,135 +1352,6 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
from, length, transhdrlen, flags);
}
-ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
- int offset, size_t size, int flags)
-{
- struct inet_sock *inet = inet_sk(sk);
- struct sk_buff *skb;
- struct rtable *rt;
- struct ip_options *opt = NULL;
- struct inet_cork *cork;
- int hh_len;
- int mtu;
- int len;
- int err;
- unsigned int maxfraglen, fragheaderlen, fraggap, maxnonfragsize;
-
- if (inet->hdrincl)
- return -EPERM;
-
- if (flags&MSG_PROBE)
- return 0;
-
- if (skb_queue_empty(&sk->sk_write_queue))
- return -EINVAL;
-
- cork = &inet->cork.base;
- rt = (struct rtable *)cork->dst;
- if (cork->flags & IPCORK_OPT)
- opt = cork->opt;
-
- if (!(rt->dst.dev->features & NETIF_F_SG))
- return -EOPNOTSUPP;
-
- hh_len = LL_RESERVED_SPACE(rt->dst.dev);
- mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
-
- fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
- maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
- maxnonfragsize = ip_sk_ignore_df(sk) ? 0xFFFF : mtu;
-
- if (cork->length + size > maxnonfragsize - fragheaderlen) {
- ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
- mtu - (opt ? opt->optlen : 0));
- return -EMSGSIZE;
- }
-
- skb = skb_peek_tail(&sk->sk_write_queue);
- if (!skb)
- return -EINVAL;
-
- cork->length += size;
-
- while (size > 0) {
- /* Check if the remaining data fits into current packet. */
- len = mtu - skb->len;
- if (len < size)
- len = maxfraglen - skb->len;
-
- if (len <= 0) {
- struct sk_buff *skb_prev;
- int alloclen;
-
- skb_prev = skb;
- fraggap = skb_prev->len - maxfraglen;
-
- alloclen = fragheaderlen + hh_len + fraggap + 15;
- skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
- if (unlikely(!skb)) {
- err = -ENOBUFS;
- goto error;
- }
-
- /*
- * Fill in the control structures
- */
- skb->ip_summed = CHECKSUM_NONE;
- skb->csum = 0;
- skb_reserve(skb, hh_len);
-
- /*
- * Find where to start putting bytes.
- */
- skb_put(skb, fragheaderlen + fraggap);
- skb_reset_network_header(skb);
- skb->transport_header = (skb->network_header +
- fragheaderlen);
- if (fraggap) {
- skb->csum = skb_copy_and_csum_bits(skb_prev,
- maxfraglen,
- skb_transport_header(skb),
- fraggap);
- skb_prev->csum = csum_sub(skb_prev->csum,
- skb->csum);
- pskb_trim_unique(skb_prev, maxfraglen);
- }
-
- /*
- * Put the packet on the pending queue.
- */
- __skb_queue_tail(&sk->sk_write_queue, skb);
- continue;
- }
-
- if (len > size)
- len = size;
-
- if (skb_append_pagefrags(skb, page, offset, len,
- MAX_SKB_FRAGS)) {
- err = -EMSGSIZE;
- goto error;
- }
-
- if (skb->ip_summed == CHECKSUM_NONE) {
- __wsum csum;
- csum = csum_page(page, offset, len);
- skb->csum = csum_block_add(skb->csum, csum, skb->len);
- }
-
- skb_len_add(skb, len);
- refcount_add(len, &sk->sk_wmem_alloc);
- offset += len;
- size -= len;
- }
- return 0;
-
-error:
- cork->length -= size;
- IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
- return err;
-}
-
static void ip_cork_release(struct inet_cork *cork)
{
cork->flags &= ~IPCORK_OPT;