2006-01-20 21:53:45

by [email protected]

[permalink] [raw]
Subject: sendfile() with 100 simultaneous 100MB files

I was reading this blog post about the lighttpd web server.
http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
It describes problems they are having downloading 100 simultaneous 100MB files.

In this post they complain about sendfile() getting into seek storms and
ending up in 72% IO wait. As a result they built a user space
mechanism to work around the problems.

I tried looking at how the kernel implements sendfile(), I have
minimal understanding of how the fs code works but it looks to me like
sendfile() is working a page at a time. I was looking for code that
does something like this...

1) Compute an adaptive window size and read ahead the appropriate
number of pages. A larger window would minimize disk seeks.

2) Something along the lines of as soon as a page is sent age the page
down in to the middle of page ages. That would allow for files that
are repeatedly sent, but also reduce thrashing from files that are not
sent frequently and shouldn't stay in the page cache.

Any other ideas why sendfile() would get into a seek storm?

--
Jon Smirl
[email protected]


2006-01-21 02:22:49

by Matti Aarnio

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
> I was reading this blog post about the lighttpd web server.
> http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> It describes problems they are having downloading 100 simultaneous 100MB files.

"more than 100 files of each more than 100 MB"

> In this post they complain about sendfile() getting into seek storms and
> ending up in 72% IO wait. As a result they built a user space
> mechanism to work around the problems.
>
> I tried looking at how the kernel implements sendfile(), I have
> minimal understanding of how the fs code works but it looks to me like
> sendfile() is working a page at a time. I was looking for code that
> does something like this...
>
> 1) Compute an adaptive window size and read ahead the appropriate
> number of pages. A larger window would minimize disk seeks.

Or maybe not.. larger main memory would help more. But there is
another issue...

> 2) Something along the lines of as soon as a page is sent age the page
> down in to the middle of page ages. That would allow for files that
> are repeatedly sent, but also reduce thrashing from files that are not
> sent frequently and shouldn't stay in the page cache.
>
> Any other ideas why sendfile() would get into a seek storm?


Deep inside the do_generic_mapping_read() there is a loop that
reads the source file with read-ahead processing, processes it
one page at the time, calls actor (which sends the file) and
releases the page cache of that page. -- with convoluted things
done when page isn't in page cache, etc..


/*
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
* The actor routine returns how many bytes were actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
ret = actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;

page_cache_release(page);
if (ret == nr && desc->count)
continue;


That is, if machine memory is so limited (file pages + network
tcp buffers!) that source file pages gets constantly purged out,
there is not much that one can do.

That described workaround is essentially to read the file to server
process memory with half an MB sliding window, and then writev()
from there to socket. Most importantly it does the reading in _large_
chunks.

The read-ahead in sendfile is done by page_cache_readahead(), and
via fairly complicated circumstances it ends up using

bdi = mapping->backing_dev_info;

switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
break;
case POSIX_FADV_RANDOM:
file->f_ra.ra_pages = 0;
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
break;
....


Default value for ra_pages is equivalent of 128 kB, which
should be enough...

Why it goes to seek trashing ? Because read-ahead buffer memory
space is being processed in very small fragments, and the sendpage
to socket writing logic pauses frequently, during which read-ahead
buffers become recycled...

In writev() solution the pausing in socket sending side does
not appear so heavily in source file reading side, as things
get buffered in non-discardable memory space of userspace process.

> --
> Jon Smirl
> [email protected]

/Matti Aarnio

2006-01-21 03:43:46

by [email protected]

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On 1/20/06, Matti Aarnio <[email protected]> wrote:
> On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
> > I was reading this blog post about the lighttpd web server.
> > http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> > It describes problems they are having downloading 100 simultaneous 100MB files.
>
> "more than 100 files of each more than 100 MB"
>
> > In this post they complain about sendfile() getting into seek storms and
> > ending up in 72% IO wait. As a result they built a user space
> > mechanism to work around the problems.
> >
> > I tried looking at how the kernel implements sendfile(), I have
> > minimal understanding of how the fs code works but it looks to me like
> > sendfile() is working a page at a time. I was looking for code that
> > does something like this...
> >
> > 1) Compute an adaptive window size and read ahead the appropriate
> > number of pages. A larger window would minimize disk seeks.
>
> Or maybe not.. larger main memory would help more. But there is
> another issue...
>
> > 2) Something along the lines of as soon as a page is sent age the page
> > down in to the middle of page ages. That would allow for files that
> > are repeatedly sent, but also reduce thrashing from files that are not
> > sent frequently and shouldn't stay in the page cache.
> >
> > Any other ideas why sendfile() would get into a seek storm?
>
>

Thanks for pointing me in the right direction in the source.
Is there a write up anywhere on how sendfile() works?


> Deep inside the do_generic_mapping_read() there is a loop that
> reads the source file with read-ahead processing, processes it
> one page at the time, calls actor (which sends the file) and
> releases the page cache of that page. -- with convoluted things
> done when page isn't in page cache, etc..
>
>
> /*
> * Ok, we have the page, and it's up-to-date, so
> * now we can copy it to user space...
> *
> * The actor routine returns how many bytes were actually used..
> * NOTE! This may not be the same as how much of a user buffer
> * we filled up (we may be padding etc), so we can only update
> * "pos" here (the actor routine has to update the user buffer
> * pointers and the remaining count).
> */
> ret = actor(desc, page, offset, nr);
> offset += ret;
> index += offset >> PAGE_CACHE_SHIFT;
> offset &= ~PAGE_CACHE_MASK;
>
> page_cache_release(page);
> if (ret == nr && desc->count)
> continue;
>
>
> That is, if machine memory is so limited (file pages + network
> tcp buffers!) that source file pages gets constantly purged out,
> there is not much that one can do.
>
> That described workaround is essentially to read the file to server
> process memory with half an MB sliding window, and then writev()
> from there to socket. Most importantly it does the reading in _large_
> chunks.

100 users at 500K each is 50MB of read ahead, that's not a huge amount of memory

>
> The read-ahead in sendfile is done by page_cache_readahead(), and
> via fairly complicated circumstances it ends up using
>
> bdi = mapping->backing_dev_info;
>
> switch (advice) {
> case POSIX_FADV_NORMAL:
> file->f_ra.ra_pages = bdi->ra_pages;
> break;
> case POSIX_FADV_RANDOM:
> file->f_ra.ra_pages = 0;
> break;
> case POSIX_FADV_SEQUENTIAL:
> file->f_ra.ra_pages = bdi->ra_pages * 2;
> break;
> ....
>
>
> Default value for ra_pages is equivalent of 128 kB, which
> should be enough...

Does using sendfile() set MADV_SEQUENTIAL and MADV_DONTNEED implicitly?
If not would setting these help?

> Why it goes to seek trashing ? Because read-ahead buffer memory
> space is being processed in very small fragments, and the sendpage
> to socket writing logic pauses frequently, during which read-ahead
> buffers become recycled...

I was following with you until this part. I thought sendfile() worked
using mmap'd files and that readahead was done into the global page
cache.

But this makes me think that read ahead is instead going into another
pool. How large is this pool? The user space scheme is using 50MB of
readahead cache, will the kernel do that much readahead if needed?

> In writev() solution the pausing in socket sending side does
> not appear so heavily in source file reading side, as things
> get buffered in non-discardable memory space of userspace process.

Does this scenario illustrate a problem with the current sendfile()
implementation? I thought the goal of sendfile() was to always be the
best way to send complete files. This is a case where user space is
clearly beating sendfile().

--
Jon Smirl
[email protected]

2006-01-21 03:52:36

by Phillip Susi

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

I took a look at that article, and well, it looks a bit off to me. I
looked at the code it refered to and it mmap's the file and optionally
copies from the map to a private buffer before writing to the socket.

The double buffering that is enabled by LOCAL_BUFFERING is a complete
and total waste of both cpu and ram. There is no reason to allocate
more ram and waste more cpu cycles to make a second copy of the data
before passing it to the network layer. The mmap and madvice though, is
a good idea, and I imagine it is causing the kernel to perform large
block readahead.

If you really want to be able to simultainiously push hundreds of
streams efficiently though, you want to use zero copy aio, which can
have tremendous benefits in throughput and cpu usage. Unfortunately, I
believe the current kernel does not support O_DIRECT on sockets.

I last looked at the kernel implementation of sendfile about 6 years
ago, but I remember it not looking very good. I believe it WAS only
transfering a single page at a time, and it was still making a copy from
fs cache to socket buffers, so it wasn't really doing zero copy IO (
though it was one less copy than doing a read and write ).

About that time I was writing an ftp server on the NT kernel and
discovered zero copy async IO. I ended up using a small thread pool and
an IO completion port to service the async IO requests. The files were
mmaped in 64 KB chunks, three at a time, and queued asynchronously to
the socket which was set to use no kernel buffering. This allowed a
PII-233 machine to push 11,820 KB/s ( that's real KB, not salesman's )
over a single session on a 100Base-T network, and saturate dual network
interfaces with multiple connections, all using less than 1% of the cpu,
because the NICs were able to directly perform scatter/gather DMA on the
filesystem cache pages.

I'm hopefull that the Linux kernel will be able to do this soon as well,
when the network stack supports O_DIRECT on sockets.

Jon Smirl wrote:
> I was reading this blog post about the lighttpd web server.
> http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> It describes problems they are having downloading 100 simultaneous 100MB files.
>
> In this post they complain about sendfile() getting into seek storms and
> ending up in 72% IO wait. As a result they built a user space
> mechanism to work around the problems.
>
> I tried looking at how the kernel implements sendfile(), I have
> minimal understanding of how the fs code works but it looks to me like
> sendfile() is working a page at a time. I was looking for code that
> does something like this...
>
> 1) Compute an adaptive window size and read ahead the appropriate
> number of pages. A larger window would minimize disk seeks.
>
> 2) Something along the lines of as soon as a page is sent age the page
> down in to the middle of page ages. That would allow for files that
> are repeatedly sent, but also reduce thrashing from files that are not
> sent frequently and shouldn't stay in the page cache.
>
> Any other ideas why sendfile() would get into a seek storm?
>
> --
> Jon Smirl
> [email protected]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2006-01-22 03:50:52

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On Fri, Jan 20, 2006 at 10:43:44PM -0500, Jon Smirl wrote:
> 100 users at 500K each is 50MB of read ahead, that's not a huge amount of
> memory

The system might be overrunning the number of requests the disk elevator
has, which would result in the sort of disk seek storm you're seeing.
Also, what filesystem is being used? XFS would likely do substantially
better than ext3 because of its use of extents vs indirect blocks.

> Does using sendfile() set MADV_SEQUENTIAL and MADV_DONTNEED implicitly?
> If not would setting these help?

No. Readahead should be doing the right thing. Rik van Riel did some
work on drop behind for exactly this sort of case.

> I was following with you until this part. I thought sendfile() worked
> using mmap'd files and that readahead was done into the global page
> cache.

sendfile() uses the page cache directly, so it's like an mmap(), but it
does not carry the overhead associated with tlb manipulation.

> But this makes me think that read ahead is instead going into another
> pool. How large is this pool? The user space scheme is using 50MB of
> readahead cache, will the kernel do that much readahead if needed?

The kernel performs readahead using the system memory pool, which means
the VM gets involved and performs page reclaim to free up previously
cached pages.

> Does this scenario illustrate a problem with the current sendfile()
> implementation? I thought the goal of sendfile() was to always be the
> best way to send complete files. This is a case where user space is
> clearly beating sendfile().

Yes, this would be called a bug. =-)

-ben
--
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <[email protected]>.

2006-01-22 14:24:07

by Jim Nance

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:

> Any other ideas why sendfile() would get into a seek storm?

I can't really comment on the quality of the linux sendfile() implementation,
I've never looked at the code. However, a couple of general observations.

The seek storm happens because linux is trying to be "fair," where fair
means no one process get to starve another for I/O bandwidth.

The fastest way to transfer 100 100M files would be to send them one at a
time. The 99th person in line of course would percieve this as a very poor
implementation. The current sendfile implementation seems to live at the
other end of the extream.

It is possible to come up with a compromise behavior by limiting the
number of concurrent sendfiles running, and the maximum size they are
allowed to send in one squirt.

Thanks,

Jim

--
[email protected]
SDF Public Access UNIX System - http://sdf.lonestar.org

2006-01-22 17:31:12

by [email protected]

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On 1/22/06, Jim Nance <[email protected]> wrote:
> On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
>
> > Any other ideas why sendfile() would get into a seek storm?
>
> I can't really comment on the quality of the linux sendfile() implementation,
> I've never looked at the code. However, a couple of general observations.
>
> The seek storm happens because linux is trying to be "fair," where fair
> means no one process get to starve another for I/O bandwidth.

I think there is something more going on. The user space processes
submitted requests for the same IO in 500K chunks and didn't get into
a seek storm. If it was a disk fairness problem the user space
implementation would have gotten in trouble too.

There seems to be some difference in the way sendfile() submits the
requests to the disk system and how the 500K requests from user space
are handled. I believe both tests were using the same disk scheduler
algorithm so the data points to differences in how the requests are
submitted to the disk system. The sendfile() submission pattern
triggers a storm and the user space one doesn't.

I've asked the lighttpd people for more data but I haven't gotten
anything back yet. Things like RAM, network speed, disk scheduler
algorithm, etc.

>
> The fastest way to transfer 100 100M files would be to send them one at a
> time. The 99th person in line of course would percieve this as a very poor
> implementation. The current sendfile implementation seems to live at the
> other end of the extream.

One at a time may not be the fastest. When the network transmission
window is full you will stop transmitting on that socket but you can
probably still transmit on the others. Packet loss is another reason
for sockets blocking.

>
> It is possible to come up with a compromise behavior by limiting the
> number of concurrent sendfiles running, and the maximum size they are
> allowed to send in one squirt.
>
> Thanks,
>
> Jim
>
> --
> [email protected]
> SDF Public Access UNIX System - http://sdf.lonestar.org
>


--
Jon Smirl
[email protected]

2006-01-23 15:22:20

by [email protected]

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On 1/22/06, Jon Smirl <[email protected]> wrote:
> I've asked the lighttpd people for more data but I haven't gotten
> anything back yet. Things like RAM, network speed, disk scheduler
> algorithm, etc.

The developer is using this hardware:
82541GI/PI Gigabit ethernet
1.3Ghz Duron
7200RPM IDE disk
768MB RAM

Kernel:
2.6.13-1.1526_FC4
CFQ disk scheduler

Customer is getting same problem on highend hardware.

--
Jon Smirl
[email protected]

2006-01-23 16:50:56

by Jerome Lacoste

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

On 1/22/06, Jim Nance <[email protected]> wrote:
> On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
[...]
> The fastest way to transfer 100 100M files would be to send them one at a
> time.

... assuming the bottleneck is not the end user upload network
bandwidth, which is, in the case of a big network file server with
many clients over the Internet, almost never the case.

J

2006-01-24 16:30:42

by [email protected]

[permalink] [raw]
Subject: Re: sendfile() with 100 simultaneous 100MB files

I've filed a kernel bug summarizing the issue:
http://bugzilla.kernel.org/show_bug.cgi?id=5949

The lighttpd author is willing to provide more info if anyone is interested.

--
Jon Smirl
[email protected]