The following patch (for 2.4.20 -- should work with all kernels
above 2.4.17) implements TCP Zero Copy for normal (writing)
socket operations on memory mapped files.
This is a major speedup for the TCP/IP stack (depending on the size
of the file more than 100% more throughput) and makes sendfile(2)
nearly useless.
BTW: When I did a (loopback) benchmark against my very own HTTP-
Server it outperformed TUX by roughly 6%. With logging disabled
by roughly 20%.
Please CC any replies to me, as I'm not subscribed to this list.
How about putting this into a different function? It's a lot to add
inline for a special case.
> int tcp_sendmsg(struct sock *sk, struct msghdr *msg, int size)
> {
> struct iovec *iov;
> @@ -1015,6 +1051,7 @@
> int mss_now;
> int err, copied;
> long timeo;
> + int has_sendpage = sk->socket->file->f_op->sendpage != NULL;
>
> tp = &(sk->tp_pinfo.af_tcp);
>
> @@ -1049,6 +1086,44 @@
>
> iov++;
>
> + if (seglen >= PAGE_SIZE && has_sendpage) {
> + struct vm_area_struct *vma =
> + find_vma (current->mm, (long) from);
> + struct file *filp;
> +
> + if (vma && (filp = vma->vm_file)) {
> + read_descriptor_t desc;
> + struct inode *in, *out;
> + loff_t pos = (long) from - vma->vm_start;
> +
> + in = filp->f_dentry->d_inode;
> + out = sk->socket->file->f_dentry->d_inode;
> +
> + if (locks_verify_area (FLOCK_VERIFY_READ, in,
> + filp, filp->f_pos, seglen))
> + goto out_no_zero_copy;
> +
> + if (locks_verify_area (FLOCK_VERIFY_WRITE, out,
> + sk->socket->file, 0, seglen))
> + goto out_no_zero_copy;
> +
> + desc.written = 0;
> + desc.count = seglen;
> + desc.buf = (char *) sk;
> + desc.error = 0;
> +
> + do_generic_file_read (filp, &pos, &desc,
> + file_send_actor);
> +
> + if (!desc.written) {
> + err = desc.error;
> + goto do_error;
> + }
> + copied += desc.written;
> + continue;
> + }
> + }
> +out_no_zero_copy:
> while (seglen > 0) {
> int copy;
>
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Sun, 2002-12-29 at 17:29, Larry McVoy wrote:
> How about putting this into a different function? It's a lot to add
> inline for a special case.
This patch also has a ton of other problems:
1) Does not handle writes that straddle multiple VMAs
2) We do not want to encourage people to use this mmap
scheme anyways. The mmap way consumes precious VM
space, whereas the sendfile scheme does not.
3) Finally, I'm very dubious about the "this is faster than
TUX claim". Firstly because you've not provided your
self-made HTTP server so that others can try to reproduce
your benchmark. And secondly because you haven't indicated
if your self-made HTTP server is as full featured as TUX or
not. And thirdly you haven't indicated what happens if in
parallel clients ask to be served more files than you could
mmap fit into the HTTP server processes address space (ie. see
#2)
So I think this patch stinks :)
On Wed, Jan 01, 2003 at 10:37:01PM -0800, David S. Miller wrote:
> On Sun, 2002-12-29 at 17:29, Larry McVoy wrote:
> > How about putting this into a different function? It's a lot to add
> > inline for a special case.
All right.
> 1) Does not handle writes that straddle multiple VMAs
What exactly do you mean? In my test, files larger than a
page were handled perfectly, as well.
> 2) We do not want to encourage people to use this mmap
> scheme anyways. The mmap way consumes precious VM
> space, whereas the sendfile scheme does not.
Is that the answer to my "sendfile is now obsolete"?
Sure we cannot remove sendfile now, as some applications
depends on it, but that's not what I wanted.
I made this patch, so that _portable_ applications (and sendfile
is miles away from beeing portable - even if the target has a
sendfile systemcall, its highly unlikely that it has the same
semantics as Linux' sendfile) are sped up.
However, I didn't like the VM waste either, but I believe there
is no other way.
> 3) Finally, I'm very dubious about the "this is faster than
> TUX claim". Firstly because you've not provided your
> self-made HTTP server so that others can try to reproduce
> your benchmark. And secondly because you haven't indicated
> if your self-made HTTP server is as full featured as TUX or
> not. And thirdly you haven't indicated what happens if in
> parallel clients ask to be served more files than you could
> mmap fit into the HTTP server processes address space (ie. see
> #2)
Hehe. In fact that wasn't a really serious claim. My tests
were (as explicitly stated by me) done over the Loopback-
Interface. And as far as I know TUX can handle interrupts
from the network card directly, which probably makes it by
far faster.
As I neither have the time nor the infrastructure to do a real
test, I cannot really say whether TUX or my (currently unreleased)
Webserver is faster.
BTW: My webservers maps files only once, so there shouldn't be
a problem with parallel transfers.
> So I think this patch stinks :)
But it worked? If I didn't misunderstood #1 then I don't see a
problem for integrating it into the current kernel.
> > 1) Does not handle writes that straddle multiple VMAs
>
> What exactly do you mean? In my test, files larger than a
> page were handled perfectly, as well.
mmap(file1 at location [a,b)
mmap(file2 at location [b,c)
write(sock, a, (size_t)(c - a));
> However, I didn't like the VM waste either, but I believe there
> is no other way.
The VM cost hurts. Badly. Imagine that the network costs ZERO. Then
the map/unmap/vm ops become the dominating term. That's why it is a
fruitless approach, it still has a practical limit which is too low.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Thu, 2003-01-02 at 22:28, Larry McVoy wrote:
> The VM cost hurts. Badly. Imagine that the network costs ZERO. Then
> the map/unmap/vm ops become the dominating term. That's why it is a
> fruitless approach, it still has a practical limit which is too low.
It depends how predictable your content is. With a 64bit box and a porn
server its probably quite tidy
From: Thomas Ogrisegg <[email protected]>
Date: Thu, 2 Jan 2003 23:12:11 +0100
> 1) Does not handle writes that straddle multiple VMAs
What exactly do you mean?
If I mmap two areas 1 right after another, then do a write
of comprising of those two areas, your code will only lookup
one of the VMAs.
It's a bug.
> 2) We do not want to encourage people to use this mmap
> scheme anyways. The mmap way consumes precious VM
> space, whereas the sendfile scheme does not.
Is that the answer to my "sendfile is now obsolete"?
It is a "this patch is unacceptable because" comment.
Sure we cannot remove sendfile now, as some applications
depends on it, but that's not what I wanted.
That's not what I'm talking about. I'm saying, making this
mmap thing available makes no sense at all.
I made this patch, so that _portable_ applications (and sendfile
is miles away from beeing portable - even if the target has a
sendfile systemcall, its highly unlikely that it has the same
semantics as Linux' sendfile) are sped up.
This isn't a priority for us. People who want the best possible
performance can code their apps up to take advantage of sendfile()
on systems that have it. (and really, show me how many systems
lack a sendfile mechanism these days).
However, I didn't like the VM waste either, but I believe there
is no other way.
There is a way, convert to sendfile.
Hehe. In fact that wasn't a really serious claim.
Then don't make such claims.
> So I think this patch stinks :)
But it worked? If I didn't misunderstood #1 then I don't see a
problem for integrating it into the current kernel.
I think you need to rethink the multiple VMA case in #1, and
also understand why I don't want this facility in the tree
at all anyways. Apps can convert to sendfile(), and as a result
they'll get improved performance on ALL linux kernels, not just
the ones with your special patch applied.
From: Alan Cox <[email protected]>
Date: 02 Jan 2003 23:20:44 +0000
On Thu, 2003-01-02 at 22:28, Larry McVoy wrote:
> The VM cost hurts. Badly. Imagine that the network costs ZERO. Then
> the map/unmap/vm ops become the dominating term. That's why it is a
> fruitless approach, it still has a practical limit which is too low.
It depends how predictable your content is. With a 64bit box and a porn
server its probably quite tidy
Let's say you have infinite VM (which is what 64-bit almost is :) then
the cost is setting up all of these useless VMAs for each and every
file (which is a 1 time cost, ok), and also the VMA lookup each
write() call.
With sendfile() all of this goes straight to the page cache directly
without a VMA lookup.
On Thu, 2003-01-02 at 23:16, David S. Miller wrote:
> It depends how predictable your content is. With a 64bit box and a porn
> server its probably quite tidy
>
> Let's say you have infinite VM (which is what 64-bit almost is :) then
> the cost is setting up all of these useless VMAs for each and every
> file (which is a 1 time cost, ok), and also the VMA lookup each
> write() call.
>
> With sendfile() all of this goes straight to the page cache directly
> without a VMA lookup.
With a nasty unpleasant splat the moment you do modification on the
content at all. For static objects sendfile is certainly superior,
On Thu, Jan 02, 2003 at 03:13:46PM -0800, David S. Miller wrote:
> From: Thomas Ogrisegg <[email protected]>
> Date: Thu, 2 Jan 2003 23:12:11 +0100
>
> It's a bug.
I see. Ok, that can be fixed easily.
> Sure we cannot remove sendfile now, as some applications
> depends on it, but that's not what I wanted.
>
> That's not what I'm talking about. I'm saying, making this
> mmap thing available makes no sense at all.
No. For portable applications it makes great sense.
> I made this patch, so that _portable_ applications (and sendfile
> is miles away from beeing portable - even if the target has a
> sendfile systemcall, its highly unlikely that it has the same
> semantics as Linux' sendfile) are sped up.
>
> This isn't a priority for us. People who want the best possible
> performance can code their apps up to take advantage of sendfile()
> on systems that have it.
So you want to chain people to your "propritaery solution"?
> (and really, show me how many systems
> lack a sendfile mechanism these days).
What kind of systems are you talking about? Operating systems?
Nearly all.
> However, I didn't like the VM waste either, but I believe there
> is no other way.
>
> There is a way, convert to sendfile.
It might be a bit difficult to convert all applications to
sendfile. Especially those for which you don't have the
source code.
> But it worked? If I didn't misunderstood #1 then I don't see a
> problem for integrating it into the current kernel.
>
> I think you need to rethink the multiple VMA case in #1, and
> also understand why I don't want this facility in the tree
> at all anyways. Apps can convert to sendfile(), and as a result
> they'll get improved performance on ALL linux kernels, not just
> the ones with your special patch applied.
I don't see your point. Applications which really need the
performance will switch to sendfile anyway because of the
problems with mmap, you mentioned.
My patch is very simple and takes less than 1KB of code but
will speed up many applications and doesn't have a real
drawback (except when sending "normal" data which is larger
than a page - but that shouldn't happen very often).
Yet another advantage of my version is that you can use it
in conjunction with writev.
Unfortunately the linux-sendfile is not as good as the HP-UX
one. Under HP-UX you can define a "struct iovec" header to
be sent before the file is sent.
> It might be a bit difficult to convert all applications to
> sendfile. Especially those for which you don't have the
> source code.
And the list of applications which do
sock = socket(...);
map = mmap(...);
write(sock, map, bytes);
are? There are not very many that I know of and if you look carefully
at the bandwidth graphs in LMbench you'll see why. There is a cross
over point where mmap becomes cheaper but it used to be around 16-64K.
I don't know what it is now, I doubt it's moved much. I can check if
you really want.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Fri, 2003-01-03 at 00:45, Thomas Ogrisegg wrote:
> Unfortunately the linux-sendfile is not as good as the HP-UX
> one. Under HP-UX you can define a "struct iovec" header to
> be sent before the file is sent.
Thats a design decision. With TCP_CORK and sensible syscall performance
those kind of web specific hacks are not appropriate
On Fri, 2003-01-03 at 01:01, Larry McVoy wrote:
> And the list of applications which do
>
> sock = socket(...);
> map = mmap(...);
> write(sock, map, bytes);
>
> are? There are not very many that I know of and if you look carefully
> at the bandwidth graphs in LMbench you'll see why. There is a cross
> over point where mmap becomes cheaper but it used to be around 16-64K.
> I don't know what it is now, I doubt it's moved much. I can check if
> you really want.
You may not be doing an mmap a send, its more likely to look like
page = hash(url);
memcpy(current_time, page->clock, TIMESIZE);
write(sock, page->data, page->len);
that changes the breakeven point a lot
Alan
On Fri, Jan 03, 2003 at 01:56:27AM +0000, Alan Cox wrote:
> On Fri, 2003-01-03 at 00:45, Thomas Ogrisegg wrote:
> > Unfortunately the linux-sendfile is not as good as the HP-UX
> > one. Under HP-UX you can define a "struct iovec" header to
> > be sent before the file is sent.
>
> Thats a design decision. With TCP_CORK and sensible syscall performance
> those kind of web specific hacks are not appropriate
Indeed. In case Alan's message wasn't clear: if your syscall overhead
is zero then many "optimizations" become superfluous. In fact, those
optimizations, one cache miss at a time, tend to be a big part of what
makes the syscall layer so heavyweight.
Linux is amazing in that it is basically the only real operating system
I know of that has stayed so focussed on making the syscall layer be
almost invisible. it's worth a "rah rah" because you can use the
operating system like it was libc, there is basically very little
cost in crossing in/out.
Here's the LMbench context switch benchmark running on a 1.6Ghz Athlon:
load free cach swap pgin pgou dk0 dk1 dk2 dk3 ipkt opkt int ctx usr sys idl
0.67 73M 577M 25M 0 0 0 0 0 0 4.0 2.0 107 548K 23 77 0
0.67 73M 577M 25M 0 0 0 0 0 0 2.0 2.0 105 549K 19 81 0
0.67 73M 577M 25M 0 0 0 0 0 0 4.0 2.0 107 549K 27 73 0
0.70 73M 577M 25M 0 0 0 0 0 0 2.0 2.0 105 548K 23 77 0
Yeah, that's more than a half a million context switchs/second and each
of those include 2 system calls. So Linux is doing 2 system calls and
a context switch in 1.8 microseconds.
When you can get in and out of the kernel that fast, your thinking should
change. You get to use the kernel more freely. And you certainly don't
want to do anything to screw that up. My hat is off to Linus and team
for working so hard to make these numbers be so good (and keep on working,
see the recent syscall discussion).
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>On Thu, 2003-01-02 at 23:16, David S. Miller wrote:
>>
>> With sendfile() all of this goes straight to the page cache directly
>> without a VMA lookup.
>
>With a nasty unpleasant splat the moment you do modification on the
>content at all. For static objects sendfile is certainly superior,
Oh, the "unpleasant splat" happens with the mmap approach too, there's
no avoiding it. It can happen with a regular "read()" loop too (if the
read happens at the wrong time).
Both mmap and sendfile have the issue that the "splat" can happen every
time, while a read() into a private area means that the splat can only
happen the first time the web server caches the content. But the read
into a private area is also obviously the worst one from a performance
standpoint.
There are two ways to avoid the splat:
- lock the file some way before reading/writing to it.
- do all updates to a temp-file, and move the temp-file to the new location.
Those two approaches will fix the "splat" problem _regardless_ of what
IO mechanism you use. With that in mind, sendfile() is clearly the one
that performs best by far, so..
Linus
From: Alan Cox <[email protected]>
Date: 03 Jan 2003 00:56:59 +0000
On Thu, 2003-01-02 at 23:16, David S. Miller wrote:
> With sendfile() all of this goes straight to the page cache directly
> without a VMA lookup.
With a nasty unpleasant splat the moment you do modification on the
content at all. For static objects sendfile is certainly superior,
Sendfile does not protect against content changes to the
file contents. We don't lock the pages, we merely grab
references to them for the network I/O.
From: Thomas Ogrisegg <[email protected]>
Date: Fri, 3 Jan 2003 01:45:43 +0100
> This isn't a priority for us. People who want the best possible
> performance can code their apps up to take advantage of sendfile()
> on systems that have it.
So you want to chain people to your "propritaery solution"?
I don't hide my APIs.
> (and really, show me how many systems
> lack a sendfile mechanism these days).
What kind of systems are you talking about? Operating systems?
Nearly all.
HPUX has it, Solaris has it, Microsoft has something very similar,
FreeBSD has it as does I believe NetBSD. Show me the exceptions.
It might be a bit difficult to convert all applications to
sendfile. Especially those for which you don't have the
source code.
If the performance really must be top notch, someone will invest
the time for a given application. Otherwise, if it's not that
important enough to port why should it be important enough to put
a hack into the OS for it?
I don't see your point. Applications which really need the
performance will switch to sendfile anyway because of the
problems with mmap, you mentioned.
Right, so why bother with your patch?
My patch is very simple and takes less than 1KB of code but
will speed up many applications and doesn't have a real
drawback (except when sending "normal" data which is larger
than a page - but that shouldn't happen very often).
What about the extra checks you are placing in a fast path?
On Fri, 2003-01-03 at 01:59, Alan Cox wrote:
> You may not be doing an mmap a send, its more likely to look like
>
> page = hash(url);
> memcpy(current_time, page->clock, TIMESIZE);
> write(sock, page->data, page->len);
If your web data rarely changes, it could also be all the files stored
in a hashfile database covered by one large mmap, eliminating filesystem
overhead (and vma overhead).
--
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source http://www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import
8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D
From: Gianni Tedesco <[email protected]>
Date: 06 Jan 2003 14:36:19 +0000
If your web data rarely changes, it could also be all the files stored
in a hashfile database covered by one large mmap, eliminating filesystem
overhead (and vma overhead).
You still would eat a VMA lookup each and every send.