Hi everyone,
I'm trying to optimize a box for samba file serving (just contiguous block
I/O for the moment), and I've now got both CPUs maxxed out with system
load.
(For background info, the system is a 2x933 Intel, 1gb system memory,
133mhz FSB, 1gbit 64bit/66mhz FC card, 2x 1gbit 64/66 etherexpress boards
in etherchannel bond, running linux-2.4.1+smptimers+zero-copy+lowlatency)
CPU states typically look something like this:
CPU states: 3.6% user, 94.5% system, 0.0% nice, 1.9% idle
.. with the 3 smbd processes each drawing around 50-75% (according to
top).
When reading the profiler results, the largest consuming kernel (calls?)
are file_read_actor and csum_partial_copy_generic, by a longshot (about
70% and 20% respectively).
Presumably, the csum_partial_copy_generic should be eliminated (or at
least reduced) by David Miller's zerocopy patch, right? Or am I
misunderstanding this completely? :)
Regards,
- Gord R. Lamb ([email protected])
> When reading the profiler results, the largest consuming kernel (calls?)
> are file_read_actor and csum_partial_copy_generic, by a longshot (about
> 70% and 20% respectively).
>
> Presumably, the csum_partial_copy_generic should be eliminated (or at
> least reduced) by David Miller's zerocopy patch, right? Or am I
> misunderstanding this completely? :)
To an extent, providing you are using the samba sendfile() patches. SMB cant
make great use of zero copy file I/O due to the fact its not really designed
so much as mutated over time and isnt oriented for speed
"Gord R. Lamb" wrote:
> Hi everyone,
>
> I'm trying to optimize a box for samba file serving (just contiguous block
> I/O for the moment), and I've now got both CPUs maxxed out with system
> load.
>
> (For background info, the system is a 2x933 Intel, 1gb system memory,
> 133mhz FSB, 1gbit 64bit/66mhz FC card, 2x 1gbit 64/66 etherexpress boards
> in etherchannel bond, running linux-2.4.1+smptimers+zero-copy+lowlatency)
>
> CPU states typically look something like this:
>
> CPU states: 3.6% user, 94.5% system, 0.0% nice, 1.9% idle
>
> .. with the 3 smbd processes each drawing around 50-75% (according to
> top).
>
> When reading the profiler results, the largest consuming kernel (calls?)
> are file_read_actor and csum_partial_copy_generic, by a longshot (about
> 70% and 20% respectively).
>
> Presumably, the csum_partial_copy_generic should be eliminated (or at
> least reduced) by David Miller's zerocopy patch, right? Or am I
> misunderstanding this completely? :)
I only know enough to be dangerous here, but I believe you will need to
be using one of the network cards whose driver actually uses the
zero-copy patches, and/or which can perform tcp checksum in hardware
(of the network card).
On Wed, 14 Feb 2001, Jeremy Jackson wrote:
> "Gord R. Lamb" wrote:
>
> > Hi everyone,
> >
> > I'm trying to optimize a box for samba file serving (just contiguous block
> > I/O for the moment), and I've now got both CPUs maxxed out with system
> > load.
> >
> > (For background info, the system is a 2x933 Intel, 1gb system memory,
> > 133mhz FSB, 1gbit 64bit/66mhz FC card, 2x 1gbit 64/66 etherexpress boards
> > in etherchannel bond, running linux-2.4.1+smptimers+zero-copy+lowlatency)
> >
> > CPU states typically look something like this:
> >
> > CPU states: 3.6% user, 94.5% system, 0.0% nice, 1.9% idle
> >
> > .. with the 3 smbd processes each drawing around 50-75% (according to
> > top).
> >
> > When reading the profiler results, the largest consuming kernel (calls?)
> > are file_read_actor and csum_partial_copy_generic, by a longshot (about
> > 70% and 20% respectively).
> >
> > Presumably, the csum_partial_copy_generic should be eliminated (or at
> > least reduced) by David Miller's zerocopy patch, right? Or am I
> > misunderstanding this completely? :)
>
> I only know enough to be dangerous here, but I believe you will need to
> be using one of the network cards whose driver actually uses the
> zero-copy patches, and/or which can perform tcp checksum in hardware
> (of the network card).
Hmm. Yeah, I think that may be one of the problems (Intel's card isn't
supported afaik; if I have to I'll switch to 3com, or hopelessly try to
implement support). I'm looking for a patch to implement sendfile in
Samba, as Alan suggested. That seems like a good first step.
Quoting "Gord R. Lamb" <[email protected]>:
> On Wed, 14 Feb 2001, Jeremy Jackson wrote:
>
> > "Gord R. Lamb" wrote:
> > > in etherchannel bond, running
> linux-2.4.1+smptimers+zero-copy+lowlatency)
Not related to network, but why would you have lowlatency patches on this box?
My testing showed that the lowlatency patches abosolutely destroy a system
thoughput under heavy disk IO. Sure, the box stays nice and responsive, but
something has to give. On a file server I'll trade console responsivness for IO
performance any day (might choose the opposite on my laptop).
My testing wasn't very complete, but heavy dbench and multiple simultaneous file
copies both showed significantly lower performance with lowlatency enabled, and
returned to normal when disabled.
Of course you may have had lowlatency disabled via sysctl but I was mainly
curious if your results were different.
Later,
Tom
On Wed, 14 Feb 2001, Tom Sightler wrote:
> Quoting "Gord R. Lamb" <[email protected]>:
>
> > On Wed, 14 Feb 2001, Jeremy Jackson wrote:
> >
> > > "Gord R. Lamb" wrote:
> > > > in etherchannel bond, running
> > linux-2.4.1+smptimers+zero-copy+lowlatency)
>
> Not related to network, but why would you have lowlatency patches on
> this box?
Well, I figured it might reduce deadweight time between the different
operations (disk reads, cache operations, network I/O) at the expense of a
little throughput. It was just a hunch and I don't fully understand the
internals (of any of this, really). Since I wasn't saturating the disk or
network controller, I thought the gain from quicker response time (for
packet acknowledgement, etc.) would outweigh the loss of individual
throughputs. Again, I could be misunderstanding this completely. :)
> My testing showed that the lowlatency patches abosolutely destroy a
> system thoughput under heavy disk IO. Sure, the box stays nice and
> responsive, but something has to give. On a file server I'll trade
> console responsivness for IO performance any day (might choose the
> opposite on my laptop).
Well, I backed out that particular patch, and it didn't seem to make much
of a difference either way. I'll look at it in more detail tomorrow
though.
Cya.
> My testing wasn't very complete, but heavy dbench and multiple
> simultaneous file copies both showed significantly lower performance
> with lowlatency enabled, and returned to normal when disabled.
>
> Of course you may have had lowlatency disabled via sysctl but I was
> mainly curious if your results were different.
>
> Later,
> Tom
>
Tom Sightler wrote:
>
> My testing showed that the lowlatency patches abosolutely destroy a system
> thoughput under heavy disk IO.
I'm surprised - I've been keeping an eye on that.
Here's the result of a bunch of back-to-back `dbench 12' runs
on UP, alternating with and without LL:
With:
58.725 total
52.217 total
51.935 total
53.624 total
39.815 total
Without:
1:16.85 total
52.525 total
57.602 total
41.623 total
58.848 total
Results on reiserfs are similar.
-
> > My testing showed that the lowlatency patches abosolutely destroy a
system
> > thoughput under heavy disk IO.
>
> I'm surprised - I've been keeping an eye on that.
>
> Here's the result of a bunch of back-to-back `dbench 12' runs
> on UP, alternating with and without LL:
It's interesting that your results seem to show an improvement in
performance, while mine show a consistent drop. My tests were not very
scientific, and I was running much higher dbench processes, 'dbench 64' or
'dbench 128', and at those levels performance with lowlatency enabled fell
though the floor on my setup.
My system is a PIII 700Mhz, Adaptec 7892 Ultra-160, software RAID1,
reiserfs, 256MB RAM.
Under lower loads, like the 'dbench 12' lowlatency only showed only a few
percent loss, but once you approached the levels around 50 things really
went downhill.
I might try to do a more complete test, maybe there's something else in my
config that would make this be a problem, but it was definately quite
noticable.
Later,
Tom
> Hmm. Yeah, I think that may be one of the problems (Intel's card isn't
> supported afaik; if I have to I'll switch to 3com, or hopelessly try to
> implement support). I'm looking for a patch to implement sendfile in
> Samba, as Alan suggested. That seems like a good first step.
As Alan said, the smb protocol is pretty ugly. This patch makes samba use
sendfile for unchained read_and_X replies. I could hook this into some of
the other *read* replies but this is the one smbtorture uses so it served my
purpose. Of course this is against the current CVS head, not some wimpy
stable branch. :)
I still need to write some code to make this safe (things will break badly
if multiple clients hit the same file and one of them truncates at just the
right time).
In my tests, this only improved things by a couple of percent because we do
so many other things than just serving files in real life (well if you can
call netbench land real life).
Anton
diff -u -u -r1.195 includes.h
--- source/include/includes.h 2000/12/06 00:05:14 1.195
+++ source/include/includes.h 2001/01/26 05:38:51
@@ -871,7 +871,8 @@
/* default socket options. Dave Miller thinks we should default to TCP_NODELAY
given the socket IO pattern that Samba uses */
-#ifdef TCP_NODELAY
+
+#if 0
#define DEFAULT_SOCKET_OPTIONS "TCP_NODELAY"
#else
#define DEFAULT_SOCKET_OPTIONS ""
diff -u -u -r1.257 reply.c
--- source/smbd/reply.c 2001/01/24 19:34:53 1.257
+++ source/smbd/reply.c 2001/01/26 05:38:53
@@ -2383,6 +2391,51 @@
END_PROFILE(SMBreadX);
return(ERROR(ERRDOS,ERRlock));
}
+
+#if 1
+ /* We can use sendfile if it is not chained */
+ if (CVAL(inbuf,smb_vwv0) == 0xFF) {
+ off_t tmpoffset;
+ struct stat buf;
+ int flags = 0;
+
+ nread = smb_maxcnt;
+
+ fstat(fsp->fd, &buf);
+ if (startpos > buf.st_size)
+ return(UNIXERROR(ERRDOS,ERRnoaccess));
+ if (nread > (buf.st_size - startpos))
+ nread = (buf.st_size - startpos);
+
+ SSVAL(outbuf,smb_vwv5,nread);
+ SSVAL(outbuf,smb_vwv6,smb_offset(data,outbuf));
+ SSVAL(smb_buf(outbuf),-2,nread);
+ CVAL(outbuf,smb_vwv0) = 0xFF;
+ set_message(outbuf,12,nread,False);
+
+#define MSG_MORE 0x8000
+ if (nread > 0)
+ flags = MSG_MORE;
+ if (send(smbd_server_fd(), outbuf, data - outbuf, flags) == -1)
+ DEBUG(0,("reply_read_and_X: send ERROR!\n"));
+
+ tmpoffset = startpos;
+ while(nread) {
+ int nwritten;
+ nwritten = sendfile(smbd_server_fd(), fsp->fd, &tmpoffset, nread);
+ if (nwritten == -1)
+ DEBUG(0,("reply_read_and_X: sendfile ERROR!\n"));
+
+ if (!nwritten)
+ break;
+
+ nread -= nwritten;
+ }
+
+ return -1;
+ }
+#endif
+
nread = read_file(fsp,data,startpos,smb_maxcnt);
if (nread < 0) {
On Sat, 17 Feb 2001, Anton Blanchard wrote:
>
> > Hmm. Yeah, I think that may be one of the problems (Intel's card isn't
> > supported afaik; if I have to I'll switch to 3com, or hopelessly try to
> > implement support). I'm looking for a patch to implement sendfile in
> > Samba, as Alan suggested. That seems like a good first step.
>
> As Alan said, the smb protocol is pretty ugly. This patch makes samba use
> sendfile for unchained read_and_X replies. I could hook this into some of
> the other *read* replies but this is the one smbtorture uses so it served my
> purpose. Of course this is against the current CVS head, not some wimpy
> stable branch. :)
>
> I still need to write some code to make this safe (things will break badly
> if multiple clients hit the same file and one of them truncates at just the
> right time).
>
> In my tests, this only improved things by a couple of percent because we do
> so many other things than just serving files in real life (well if you can
> call netbench land real life).
Well, it made a big difference for me. I grabbed an extra 10-20mb/sec!
Now I just need to work on coalescing some of the ethernet interrupts.
Thanks, Anton!
> Anton
>
> diff -u -u -r1.195 includes.h
> --- source/include/includes.h 2000/12/06 00:05:14 1.195
> +++ source/include/includes.h 2001/01/26 05:38:51
> @@ -871,7 +871,8 @@
>
> /* default socket options. Dave Miller thinks we should default to TCP_NODELAY
> given the socket IO pattern that Samba uses */
> -#ifdef TCP_NODELAY
> +
> +#if 0
> #define DEFAULT_SOCKET_OPTIONS "TCP_NODELAY"
> #else
> #define DEFAULT_SOCKET_OPTIONS ""
> diff -u -u -r1.257 reply.c
> --- source/smbd/reply.c 2001/01/24 19:34:53 1.257
> +++ source/smbd/reply.c 2001/01/26 05:38:53
> @@ -2383,6 +2391,51 @@
> END_PROFILE(SMBreadX);
> return(ERROR(ERRDOS,ERRlock));
> }
> +
> +#if 1
> + /* We can use sendfile if it is not chained */
> + if (CVAL(inbuf,smb_vwv0) == 0xFF) {
> + off_t tmpoffset;
> + struct stat buf;
> + int flags = 0;
> +
> + nread = smb_maxcnt;
> +
> + fstat(fsp->fd, &buf);
> + if (startpos > buf.st_size)
> + return(UNIXERROR(ERRDOS,ERRnoaccess));
> + if (nread > (buf.st_size - startpos))
> + nread = (buf.st_size - startpos);
> +
> + SSVAL(outbuf,smb_vwv5,nread);
> + SSVAL(outbuf,smb_vwv6,smb_offset(data,outbuf));
> + SSVAL(smb_buf(outbuf),-2,nread);
> + CVAL(outbuf,smb_vwv0) = 0xFF;
> + set_message(outbuf,12,nread,False);
> +
> +#define MSG_MORE 0x8000
> + if (nread > 0)
> + flags = MSG_MORE;
> + if (send(smbd_server_fd(), outbuf, data - outbuf, flags) == -1)
> + DEBUG(0,("reply_read_and_X: send ERROR!\n"));
> +
> + tmpoffset = startpos;
> + while(nread) {
> + int nwritten;
> + nwritten = sendfile(smbd_server_fd(), fsp->fd, &tmpoffset, nread);
> + if (nwritten == -1)
> + DEBUG(0,("reply_read_and_X: sendfile ERROR!\n"));
> +
> + if (!nwritten)
> + break;
> +
> + nread -= nwritten;
> + }
> +
> + return -1;
> + }
> +#endif
> +
> nread = read_file(fsp,data,startpos,smb_maxcnt);
>
> if (nread < 0) {
>