2018-04-20 16:02:24

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

This patch series provide a new mmap_hook to fs willing to grab
a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.

This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
is held), and improve multi-threading scalability.

Eric Dumazet (4):
mm: provide a mmap_hook infrastructure
net: implement sock_mmap_hook()
tcp: provide tcp_mmap_hook()
tcp: mmap: move the skb cleanup to tcp_mmap_hook()

include/linux/fs.h | 6 ++++++
include/linux/net.h | 1 +
include/net/tcp.h | 1 +
mm/util.c | 19 ++++++++++++++++++-
net/ipv4/af_inet.c | 1 +
net/ipv4/tcp.c | 39 ++++++++++++++++++++++++++++++---------
net/ipv6/af_inet6.c | 1 +
net/socket.c | 9 +++++++++
8 files changed, 67 insertions(+), 10 deletions(-)

--
2.17.0.484.g0c8726318c-goog



2018-04-20 15:57:52

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next 1/4] mm: provide a mmap_hook infrastructure

When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.

This patch provides a new mmap_hook() method in struct file_operations
that might be provided by fs to implement a finer control of whats
to be done before and after do_mmap_pgoff() and/or the mm->mmap_sem
acquire/release.

This is used in following patches by networking and TCP stacks
to solve the lockdep issue, and also allows some preparation
and cleanup work being done before/after mmap_sem is held,
allowing better scalability in multi-threading programs.

Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
---
include/linux/fs.h | 6 ++++++
mm/util.c | 19 ++++++++++++++++++-
2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92efaf1f89775f7b017477617dd983c10e0dc4d2..ef3526f84686585678861fc585efea974a69ca55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1698,6 +1698,11 @@ struct block_device_operations;
#define NOMMU_VMFLAGS \
(NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC)

+enum mmap_hook {
+ MMAP_HOOK_PREPARE,
+ MMAP_HOOK_ROLLBACK,
+ MMAP_HOOK_COMMIT,
+};

struct iov_iter;

@@ -1714,6 +1719,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ int (*mmap_hook) (struct file *, enum mmap_hook);
unsigned long mmap_supported_flags;
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
diff --git a/mm/util.c b/mm/util.c
index 1fc4fa7576f762bbbf341f056ca6d0be803a423f..3ddb18ab367f069d5884083e992e999546ccd995 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -350,11 +350,28 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,

ret = security_mmap_file(file, prot, flag);
if (!ret) {
- if (down_write_killable(&mm->mmap_sem))
+ int (*mmap_hook)(struct file *, enum mmap_hook) = NULL;
+
+ if (file) {
+ mmap_hook = file->f_op->mmap_hook;
+
+ if (mmap_hook) {
+ ret = mmap_hook(file, MMAP_HOOK_PREPARE);
+ if (ret)
+ return ret;
+ }
+ }
+ if (down_write_killable(&mm->mmap_sem)) {
+ if (mmap_hook)
+ mmap_hook(file, MMAP_HOOK_ROLLBACK);
return -EINTR;
+ }
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
&populate, &uf);
up_write(&mm->mmap_sem);
+ if (mmap_hook)
+ mmap_hook(file, IS_ERR(ret) ? MMAP_HOOK_ROLLBACK :
+ MMAP_HOOK_COMMIT);
userfaultfd_unmap_complete(mm, &uf);
if (populate)
mm_populate(ret, populate);
--
2.17.0.484.g0c8726318c-goog


2018-04-20 15:57:53

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next 4/4] tcp: mmap: move the skb cleanup to tcp_mmap_hook()

Freeing all skbs and sending ACK is time consuming.

This is currently done while both current->mm->mmap_sem and socket
lock are held, in tcp_mmap()

Thanks to mmap_hook infrastructure, we can perform the cleanup
after current->mm->mmap_sem has been released, thus allowing
other threads to perform mm operations without delay.

Note that the preparation work (building the array of page
pointers) can also be done from tcp_mmap_hook() while
mmap_sem has not been taken yet, but this is another independent change.

Signed-off-by: Eric Dumazet <[email protected]>
---
net/ipv4/tcp.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e913b2dd5df321f2789e8d5f233ede9c2f1d5624..82f7c3e47253cecac6ea1819fbb7a0712058ec55 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1740,9 +1740,16 @@ int tcp_mmap_hook(struct socket *sock, enum mmap_hook mode)
*/
return 0;
}
- /* TODO: Move here the stuff that can been done after
- * current->mm->mmap_sem has been released.
- */
+ if (mode == MMAP_HOOK_COMMIT) {
+ u32 offset;
+
+ tcp_rcv_space_adjust(sk);
+
+ /* Clean up data we have read: This will do ACK frames. */
+ tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset);
+
+ tcp_cleanup_rbuf(sk, PAGE_SIZE);
+ }
release_sock(sk);
return 0;
}
@@ -1843,13 +1850,8 @@ int tcp_mmap(struct file *file, struct socket *sock,
if (ret)
goto out;
}
- /* operation is complete, we can 'consume' all skbs */
+ /* operation is complete, skbs will be freed from tcp_mmap_hook() */
tp->copied_seq = seq;
- tcp_rcv_space_adjust(sk);
-
- /* Clean up data we have read: This will do ACK frames. */
- tcp_recv_skb(sk, seq, &offset);
- tcp_cleanup_rbuf(sk, size);

ret = 0;
out:
--
2.17.0.484.g0c8726318c-goog


2018-04-20 15:58:31

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next 3/4] tcp: provide tcp_mmap_hook()

Many socket operations can copy data between user and kernel space
while socket lock is held. This means mm->mmap_sem can be taken
after socket lock.

When implementing tcp mmap(), I forgot this and syzbot was kind enough
to point this to my attention.

This patch adds tcp_mmap_hook(), allowing us to grab socket lock
before vm_mmap_pgoff() grabs mm->mmap_sem

This same hook is responsible for releasing socket lock when
vm_mmap_pgoff() has released mm->mmap_sem (or failed to acquire it)

Note that follow-up patches can transfer code from tcp_mmap()
to tcp_mmap_hook() to shorten tcp_mmap() execution time
and thus increase mmap() performance in multi-threaded programs.

Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
---
include/net/tcp.h | 1 +
net/ipv4/af_inet.c | 1 +
net/ipv4/tcp.c | 25 ++++++++++++++++++++++---
net/ipv6/af_inet6.c | 1 +
4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173ea41aa16dd1ec739a175c679c5c..f68c8e8957840cacdbdd3d02bd149fce33ae324f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -404,6 +404,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
int flags, int *addr_len);
int tcp_set_rcvlowat(struct sock *sk, int val);
void tcp_data_ready(struct sock *sk);
+int tcp_mmap_hook(struct socket *sock, enum mmap_hook mode);
int tcp_mmap(struct file *file, struct socket *sock,
struct vm_area_struct *vma);
void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 3ebf599cebaea4926decc1aad7274b12ec7e1566..af597440ff59c049b7fd02f7d7f79c23b9e195bb 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -995,6 +995,7 @@ const struct proto_ops inet_stream_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = tcp_mmap,
+ .mmap_hook = tcp_mmap_hook,
.sendpage = inet_sendpage,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4022073b0aeea9d07af0fa825b640a00512908a3..e913b2dd5df321f2789e8d5f233ede9c2f1d5624 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,6 +1726,28 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
}
EXPORT_SYMBOL(tcp_set_rcvlowat);

+/* mmap() on TCP needs to grab socket lock before current->mm->mmap_sem
+ * is taken in vm_mmap_pgoff() to avoid possible dead locks.
+ */
+int tcp_mmap_hook(struct socket *sock, enum mmap_hook mode)
+{
+ struct sock *sk = sock->sk;
+
+ if (mode == MMAP_HOOK_PREPARE) {
+ lock_sock(sk);
+ /* TODO: Move here all the preparation work that can be done
+ * before having to grab current->mm->mmap_sem.
+ */
+ return 0;
+ }
+ /* TODO: Move here the stuff that can been done after
+ * current->mm->mmap_sem has been released.
+ */
+ release_sock(sk);
+ return 0;
+}
+EXPORT_SYMBOL(tcp_mmap_hook);
+
/* When user wants to mmap X pages, we first need to perform the mapping
* before freeing any skbs in receive queue, otherwise user would be unable
* to fallback to standard recvmsg(). This happens if some data in the
@@ -1756,8 +1778,6 @@ int tcp_mmap(struct file *file, struct socket *sock,
/* TODO: Maybe the following is not needed if pages are COW */
vma->vm_flags &= ~VM_MAYWRITE;

- lock_sock(sk);
-
ret = -ENOTCONN;
if (sk->sk_state == TCP_LISTEN)
goto out;
@@ -1833,7 +1853,6 @@ int tcp_mmap(struct file *file, struct socket *sock,

ret = 0;
out:
- release_sock(sk);
kvfree(pages_array);
return ret;
}
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 36d622c477b1ed3c5d2b753938444526344a6109..31ce68c001c223d3351f73453273ae517a051816 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -579,6 +579,7 @@ const struct proto_ops inet6_stream_ops = {
.sendmsg = inet_sendmsg, /* ok */
.recvmsg = inet_recvmsg, /* ok */
.mmap = tcp_mmap,
+ .mmap_hook = tcp_mmap_hook,
.sendpage = inet_sendpage,
.sendmsg_locked = tcp_sendmsg_locked,
.sendpage_locked = tcp_sendpage_locked,
--
2.17.0.484.g0c8726318c-goog


2018-04-20 15:58:42

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH net-next 2/4] net: implement sock_mmap_hook()

sock_mmap_hook() is the mmap_hook handler provided for socket_file_ops

Following patch will provide tcp_mmap_hook() for TCP protocol.

Signed-off-by: Eric Dumazet <[email protected]>
---
include/linux/net.h | 1 +
net/socket.c | 9 +++++++++
2 files changed, 10 insertions(+)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6554d3ba4396b3df49acac934ad16eeb71a695f4..5192bf502b11e42c3d9eb342ce67361916149bfa 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -181,6 +181,7 @@ struct proto_ops {
size_t total_len, int flags);
int (*mmap) (struct file *file, struct socket *sock,
struct vm_area_struct * vma);
+ int (*mmap_hook) (struct socket *sock, enum mmap_hook);
ssize_t (*sendpage) (struct socket *sock, struct page *page,
int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
diff --git a/net/socket.c b/net/socket.c
index f10f1d947c78c193b49379b0ec641d81367fb4cf..75a5c2ebe57e0621dae17c6c9e1a796ee818b107 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -131,6 +131,14 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);

+static int sock_mmap_hook(struct file *file, enum mmap_hook mode)
+{
+ struct socket *sock = file->private_data;
+
+ if (!sock->ops->mmap_hook)
+ return 0;
+ return sock->ops->mmap_hook(sock, mode);
+}
/*
* Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
* in the operation structures but are done directly via the socketcall() multiplexor.
@@ -147,6 +155,7 @@ static const struct file_operations socket_file_ops = {
.compat_ioctl = compat_sock_ioctl,
#endif
.mmap = sock_mmap,
+ .mmap_hook = sock_mmap_hook,
.release = sock_close,
.fasync = sock_fasync,
.sendpage = sock_sendpage,
--
2.17.0.484.g0c8726318c-goog


2018-04-21 09:10:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

On Fri, Apr 20, 2018 at 08:55:38AM -0700, Eric Dumazet wrote:
> This patch series provide a new mmap_hook to fs willing to grab
> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>
> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
> is held), and improve multi-threading scalability.

Missing CC to linu-fsdevel and linux-mm that will have to decide.

We've rejected this approach multiple times before, so you better
make a really good argument for it.

introducing a multiplexer that overloads a single method certainly
doesn't help making that case.

2018-04-21 16:57:38

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue



On 04/21/2018 02:07 AM, Christoph Hellwig wrote:
> On Fri, Apr 20, 2018 at 08:55:38AM -0700, Eric Dumazet wrote:
>> This patch series provide a new mmap_hook to fs willing to grab
>> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>>
>> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
>> is held), and improve multi-threading scalability.
>
> Missing CC to linu-fsdevel and linux-mm that will have to decide.
>
> We've rejected this approach multiple times before, so you better
> make a really good argument for it.
>

Well, tcp code needs to hold socket lock before mm->mmap_sem, so current
mmap hook can not fit. Or we need to revisit all code doing copyin/copyout while
holding a socket lock. (Not feasible really)


> introducing a multiplexer that overloads a single method certainly
> doesn't help making that case.

Well, if you refer to multiple hooks instead of a single one, I basically
thought that since only TCP needs this hook at the moment,
it was not worth adding extra 8-bytes loads for all other mmap() users.

I have no issue adding more hooks and more memory pressure if this is the blocking factor.

We need two actions at this moment, (to lock the socket or release it)
and a third one would allow us to build the array of pages
before grabbing mmap_sem (as I mentioned in the last patch changelog)


2018-04-23 21:17:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

On 04/20/2018 08:55 AM, Eric Dumazet wrote:
> This patch series provide a new mmap_hook to fs willing to grab
> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>
> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
> is held), and improve multi-threading scalability.
>

I think that the right solution is to rework mmap() on TCP sockets a
bit. The current approach in net-next is very strange for a few reasons:

1. It uses mmap() as an operation that has side effects besides just
creating a mapping. If nothing else, it's surprising, since mmap()
doesn't usually do that. But it's also causing problems like what
you're seeing.

2. The performance is worse than it needs to be. mmap() is slow, and I
doubt you'll find many mm developers who consider this particular abuse
of mmap() to be a valid thing to optimize for.

3. I'm not at all convinced the accounting is sane. As far as I can
tell, you're allowing unprivileged users to increment the count on
network-owned pages, limited only by available virtual memory, without
obviously charging it to the socket buffer limits. It looks like a
program that simply forgot to call munmap() would cause the system to
run out of memory, and I see no reason to expect the OOM killer to have
any real chance of killing the right task.

4. Error handling sucks. If I try to mmap() a large range (which is the
whole point -- using a small range will kill performance) and not quite
all of it can be mapped, then I waste a bunch of time in the kernel and
get *none* of the range mapped.

I would suggest that you rework the interface a bit. First a user would
call mmap() on a TCP socket, which would create an empty VMA. (It would
set vm_ops to point to tcp_vm_ops or similar so that the TCP code could
recognize it, but it would have no effect whatsoever on the TCP state
machine. Reading the VMA would get SIGBUS.) Then a user would call a
new ioctl() or setsockopt() function and pass something like:

struct tcp_zerocopy_receive {
void *address;
size_t length;
};

The kernel would verify that [address, address+length) is entirely
inside a single TCP VMA and then would do the vm_insert_range magic. On
success, length is changed to the length that actually got mapped. The
kernel could do this while holding mmap_sem for *read*, and it could get
the lock ordering right. If and when mm range locks ever get merged, it
could switch to using a range lock.

Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the
part of the mapping that you're done with.

Does this seem reasonable? It should involve very little code change,
it will run faster, it will scale better, and it is much less weird IMO.

2018-04-23 21:41:38

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

Hi Andy

On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> On 04/20/2018 08:55 AM, Eric Dumazet wrote:
>> This patch series provide a new mmap_hook to fs willing to grab
>> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>>
>> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
>> is held), and improve multi-threading scalability.
>>
>
> I think that the right solution is to rework mmap() on TCP sockets a bit.  The current approach in net-next is very strange for a few reasons:
>
> 1. It uses mmap() as an operation that has side effects besides just creating a mapping.  If nothing else, it's surprising, since mmap() doesn't usually do that.  But it's also causing problems like what you're seeing.
>
> 2. The performance is worse than it needs to be.  mmap() is slow, and I doubt you'll find many mm developers who consider this particular abuse of mmap() to be a valid thing to optimize for.
>
> 3. I'm not at all convinced the accounting is sane.  As far as I can tell, you're allowing unprivileged users to increment the count on network-owned pages, limited only by available virtual memory, without obviously charging it to the socket buffer limits.  It looks like a program that simply forgot to call munmap() would cause the system to run out of memory, and I see no reason to expect the OOM killer to have any real chance of killing the right task.

>
> 4. Error handling sucks.  If I try to mmap() a large range (which is the whole point -- using a small range will kill performance) and not quite all of it can be mapped, then I waste a bunch of time in the kernel and get *none* of the range mapped.
>
> I would suggest that you rework the interface a bit.  First a user would call mmap() on a TCP socket, which would create an empty VMA.  (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new ioctl() or setsockopt() function and pass something like:


>
> struct tcp_zerocopy_receive {
>   void *address;
>   size_t length;
> };
>
> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.

I have no idea what is the proper API for that.
Where the TCP VMA(s) would be stored ?
In TCP socket, or MM layer ?


And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?

Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)

  On success, length is changed to the length that actually got mapped.  The kernel could do this while holding mmap_sem for *read*, and it could get the lock ordering right.  If and when mm range locks ever get merged, it could switch to using a range lock.
>
> Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the part of the mapping that you're done with.
>
> Does this seem reasonable?  It should involve very little code change, it will run faster, it will scale better, and it is much less weird IMO.

Maybe, although I do not see the 'little code change' yet.

But at least, this seems pretty nice idea, especially if it could allow us to fill the mmap()ed area later when packets are received.


2018-04-24 02:07:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <[email protected]> wrote:
> Hi Andy
>
> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:

>> I would suggest that you rework the interface a bit. First a user would call mmap() on a TCP socket, which would create an empty VMA. (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine. Reading the VMA would get SIGBUS.) Then a user would call a new ioctl() or setsockopt() function and pass something like:
>
>
>>
>> struct tcp_zerocopy_receive {
>> void *address;
>> size_t length;
>> };
>>
>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>
> I have no idea what is the proper API for that.
> Where the TCP VMA(s) would be stored ?
> In TCP socket, or MM layer ?

MM layer. I haven't tested this at all, and the error handling is
totally wrong, but I think you'd do something like:

len = get_user(...);

down_read(&current->mm->mmap_sem);

vma = find_vma(mm, start);
if (!vma || vma->vm_start > start)
return -EFAULT;

/* This is buggy. You also need to check that the file is a socket.
This is probably trivial. */
if (vma->vm_file->private_data != sock)
return -EINVAL;

if (len > vma->vm_end - start)
return -EFAULT; /* too big a request. */

and now you'd do the vm_insert_page() dance, except that you don't
have to abort the whole procedure if you discover that something isn't
aligned right. Instead you'd just stop and tell the caller that you
didn't map the full requested size. You might also need to add some
code to charge the caller for the pages that get pinned, but that's an
orthogonal issue.

You also need to provide some way for user programs to signal that
they're done with the page in question. MADV_DONTNEED might be
sufficient.

In the mmap() helper, you might want to restrict the mapped size to
something reasonable. And it might be nice to hook mremap() to
prevent user code from causing too much trouble.

With my x86-writer-of-TLB-code hat on, I expect the major performance
costs to be the generic costs of mmap() and munmap() (which only
happen once per socket instead of once per read if you like my idea),
the cost of a TLB miss when the data gets read (really not so bad on
modern hardware), and the cost of the TLB invalidation when user code
is done with the buffers. The latter is awful, especially in
multithreaded programs. In fact, it's so bad that it might be worth
mentioning in the documentation for this code that it just shouldn't
be used in multithreaded processes. (Also, on non-PCID hardware,
there's an annoying situation in which a recently-migrated thread that
removes a mapping sends an IPI to the CPU that the thread used to be
on. I thought I had a clever idea to get rid of that IPI once, but it
turned out to be wrong.)

Architectures like ARM that have superior TLB handling primitives will
not be hurt as badly if this is used my a multithreaded program.

>
>
> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?

Exactly. If I request 10MB mapped and only the first 9MB are aligned
right, I still want the first 9 MB.

>
> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)

There's nothing to account. It's the same as mapping /dev/null or
similar -- the mm core should take care of it for you.

2018-04-24 04:32:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue



On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <[email protected]> wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
>
>>> I would suggest that you rework the interface a bit. First a user would call mmap() on a TCP socket, which would create an empty VMA. (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine. Reading the VMA would get SIGBUS.) Then a user would call a new ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>> void *address;
>>> size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
>
> MM layer. I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
>
> len = get_user(...);
>
> down_read(&current->mm->mmap_sem);
>
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
> return -EFAULT;
>
> /* This is buggy. You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
> return -EINVAL;
>
> if (len > vma->vm_end - start)
> return -EFAULT; /* too big a request. */
>
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right. Instead you'd just stop and tell the caller that you
> didn't map the full requested size. You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
>
> You also need to provide some way for user programs to signal that
> they're done with the page in question. MADV_DONTNEED might be
> sufficient.
>
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable. And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
>
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers. The latter is awful, especially in
> multithreaded programs. In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes. (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on. I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
>
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
>
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?
>
> Exactly. If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
>
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)
>
> There's nothing to account. It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
>

Thanks Andy, I am working on all this, and initial patch looks sane enough.

include/uapi/linux/tcp.h | 7 +
net/ipv4/tcp.c | 175 +++++++++++++++++++++++------------------------
2 files changed, 93 insertions(+), 89 deletions(-)


I will test all this before sending for review asap.

( I have not done the compat code yet, this can be done later I guess)