Hello all,
I've been testing the NAS performance of ext3/Openfiler 2.2 against
NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
video workloads. The Windows CIFS client will attempt a poor-man's
pre-allocation of the file on the server by sending 1-byte writes at
128K-byte strides, breaking block allocation on ext3 and leading to
fragmentation and poor performance. This will happen for many
applications (including iTunes) as the CIFS client issues these
pre-allocates under the application layer.
I've posted a brief paper on Intel's OSS website
(http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
it a read and let me know what you think. In particular, I'd like to
arrive at the right place to fix this problem: is it in the filesystem,
VFS, or Samba?
thanks,
Mason
(please CC responses to mason dot b dot cabot at intel dot com)
On Tue, 1 May 2007 13:43:18 -0700
"Cabot, Mason B" <[email protected]> wrote:
> Hello all,
>
> I've been testing the NAS performance of ext3/Openfiler 2.2 against
> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> video workloads. The Windows CIFS client will attempt a poor-man's
> pre-allocation of the file on the server by sending 1-byte writes at
> 128K-byte strides, breaking block allocation on ext3 and leading to
> fragmentation and poor performance. This will happen for many
> applications (including iTunes) as the CIFS client issues these
> pre-allocates under the application layer.
Oh my gawd, what a stupid hack. Now we know what the MS interoperability
lab has been working on.
> I've posted a brief paper on Intel's OSS website
> (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> it a read and let me know what you think. In particular, I'd like to
> arrive at the right place to fix this problem: is it in the filesystem,
> VFS, or Samba?
Conceivably we could address this in the filesystem without mucking other
things up. But I'd have thought the simplest damage-control would be to
detect this pattern in samba and to then use glibc's fallocate().
At present glibc will emulate fallocate() by writing zeroes. There are
patches floating about to implement fallocate in-kernel and if/when that
turns up and is supported in glibc, the modified samba will automatically
start to use it.
Are you sure there isn't some registry setting to prevent the CIFS client
from doing the client-side preallocation?
On Tue, 1 May 2007, Cabot, Mason B wrote:
> Hello all,
>
> I've been testing the NAS performance of ext3/Openfiler 2.2 against
> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> video workloads. The Windows CIFS client will attempt a poor-man's
> pre-allocation of the file on the server by sending 1-byte writes at
> 128K-byte strides, breaking block allocation on ext3 and leading to
> fragmentation and poor performance. This will happen for many
> applications (including iTunes) as the CIFS client issues these
> pre-allocates under the application layer.
>
> I've posted a brief paper on Intel's OSS website
> (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> it a read and let me know what you think. In particular, I'd like to
> arrive at the right place to fix this problem: is it in the filesystem,
> VFS, or Samba?
>
> thanks,
> Mason
>
Just out of curiosity do other filesystems(reiser, xfs) take the same
performance hit?
Gerjard
--
Gerhard Mack
[email protected]
<>< As a computer I find your faith in technology amusing.
Andrew Morton <[email protected]> writes:
>
> Conceivably we could address this in the filesystem without mucking other
> things up. But I'd have thought the simplest damage-control would be to
> detect this pattern in samba and to then use glibc's fallocate().
The advantage of detecting it in kernel would be that it would handle
Linux applications that do this (I suspect there are some) too.
-Andi
On Tue, May 01, 2007 at 01:43:18PM -0700, Cabot, Mason B wrote:
> Hello all,
>
> I've been testing the NAS performance of ext3/Openfiler 2.2 against
> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> video workloads. The Windows CIFS client will attempt a poor-man's
> pre-allocation of the file on the server by sending 1-byte writes at
> 128K-byte strides, breaking block allocation on ext3 and leading to
> fragmentation and poor performance. This will happen for many
> applications (including iTunes) as the CIFS client issues these
> pre-allocates under the application layer.
>
> I've posted a brief paper on Intel's OSS website
> (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> it a read and let me know what you think. In particular, I'd like to
> arrive at the right place to fix this problem: is it in the filesystem,
> VFS, or Samba?
As I commented on IRC to Val Henson - the XFS performance indicates
that it is not a VFS or Samba problem.
I'd say it's probably delayed allocation that is making the
difference here - no allocation occurs on the single byte writes, it
occurs when the larger data writes are flushed to disk. Hence no
adverse fragmentation will occur and there wil be no extra
allocations being done.
Hence I think it's probably a filesystm problem - it would be
interesting to see how ext4 performs on this workload....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Tue, May 01, 2007 at 11:54:04PM -0400, Gerhard Mack wrote:
> On Tue, 1 May 2007, Cabot, Mason B wrote:
>
> > Hello all,
> >
> > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > video workloads. The Windows CIFS client will attempt a poor-man's
> > pre-allocation of the file on the server by sending 1-byte writes at
> > 128K-byte strides, breaking block allocation on ext3 and leading to
> > fragmentation and poor performance. This will happen for many
> > applications (including iTunes) as the CIFS client issues these
> > pre-allocates under the application layer.
> >
> > I've posted a brief paper on Intel's OSS website
> > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > it a read and let me know what you think. In particular, I'd like to
> > arrive at the right place to fix this problem: is it in the filesystem,
> > VFS, or Samba?
> >
> > thanks,
> > Mason
> >
>
> Just out of curiosity do other filesystems(reiser, xfs) take the same
> performance hit?
XFS was also tested - it is as fast as the Windows NTFS based
server.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, May 02, 2007 at 02:21:40PM +0200, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
> >
> > Conceivably we could address this in the filesystem without mucking other
> > things up. But I'd have thought the simplest damage-control would be to
> > detect this pattern in samba and to then use glibc's fallocate().
>
> The advantage of detecting it in kernel would be that it would handle
> Linux applications that do this (I suspect there are some) too.
Um, which applications do you suspect? So we can hunt down those user
space applications programmers and slap them silly? Or rather,
unsilly, since that there's no good reason to ever suspect that
writing a byte every 128k would result in a good allocation layout on disk?
- Ted
On Tue, May 01, 2007 at 02:23:25PM -0700, Andrew Morton wrote:
> On Tue, 1 May 2007 13:43:18 -0700
> "Cabot, Mason B" <[email protected]> wrote:
>
> > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > video workloads. The Windows CIFS client will attempt a poor-man's
> > pre-allocation of the file on the server by sending 1-byte writes at
> > 128K-byte strides, breaking block allocation on ext3 and leading to
> > fragmentation and poor performance. This will happen for many
> > applications (including iTunes) as the CIFS client issues these
> > pre-allocates under the application layer.
>
> Oh my gawd, what a stupid hack. Now we know what the MS interoperability
> lab has been working on.
I wonder if they patented this technique as well, as well as one of
their dozen or so patents they are filing every day? "A Method of
Screwing Over Samba's Performance So that Windows Longhorn Can Compete
On Performance" coming soon, to a patent database near you! :-)
> > I've posted a brief paper on Intel's OSS website
> > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > it a read and let me know what you think. In particular, I'd like to
> > arrive at the right place to fix this problem: is it in the filesystem,
> > VFS, or Samba?
The right place is clearly Samba. I can't think of any other program
or filesystem protocol where writing a 1 byte write at 128k strides
would be used to signal a desire to do preallocation. In fact, it's
hard to think of a worse way of doing things.
- Ted
Theodore Tso <[email protected]> writes:
> On Wed, May 02, 2007 at 02:21:40PM +0200, Andi Kleen wrote:
> > Andrew Morton <[email protected]> writes:
> > >
> > > Conceivably we could address this in the filesystem without mucking other
> > > things up. But I'd have thought the simplest damage-control would be to
> > > detect this pattern in samba and to then use glibc's fallocate().
> >
> > The advantage of detecting it in kernel would be that it would handle
> > Linux applications that do this (I suspect there are some) too.
>
> Um, which applications do you suspect? So we can hunt down those user
> space applications programmers and slap them silly? Or rather,
> unsilly, since that there's no good reason to ever suspect that
> writing a byte every 128k would result in a good allocation layout on disk?
Anything that uses glibc fallocate() ?
-Andi
On Wed, May 02, 2007 at 12:16:38PM -0400, Theodore Tso wrote:
> On Tue, May 01, 2007 at 02:23:25PM -0700, Andrew Morton wrote:
> > On Tue, 1 May 2007 13:43:18 -0700
> > "Cabot, Mason B" <[email protected]> wrote:
> >
> > > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > > video workloads. The Windows CIFS client will attempt a poor-man's
> > > pre-allocation of the file on the server by sending 1-byte writes at
> > > 128K-byte strides, breaking block allocation on ext3 and leading to
> > > fragmentation and poor performance. This will happen for many
> > > applications (including iTunes) as the CIFS client issues these
> > > pre-allocates under the application layer.
> >
> > Oh my gawd, what a stupid hack. Now we know what the MS interoperability
> > lab has been working on.
>
> I wonder if they patented this technique as well, as well as one of
> their dozen or so patents they are filing every day? "A Method of
> Screwing Over Samba's Performance So that Windows Longhorn Can Compete
> On Performance" coming soon, to a patent database near you! :-)
>
> > > I've posted a brief paper on Intel's OSS website
> > > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > > it a read and let me know what you think. In particular, I'd like to
> > > arrive at the right place to fix this problem: is it in the filesystem,
> > > VFS, or Samba?
>
> The right place is clearly Samba. I can't think of any other program
> or filesystem protocol where writing a 1 byte write at 128k strides
> would be used to signal a desire to do preallocation. In fact, it's
> hard to think of a worse way of doing things.
In fact they don't need to do this - there's an explicit CIFS
set file allocation call to pre-allocate size they could use.
There's a specific Samba VFS module that has XFS specific calls
to do this - vfs_prealloc. - but this won't work on ext3.
Jeremy.
On Wed, May 02, 2007 at 08:40:35PM +0200, Andi Kleen wrote:
> Theodore Tso <[email protected]> writes:
>
> > On Wed, May 02, 2007 at 02:21:40PM +0200, Andi Kleen wrote:
> > > Andrew Morton <[email protected]> writes:
> > > >
> > > > Conceivably we could address this in the filesystem without mucking other
> > > > things up. But I'd have thought the simplest damage-control would be to
> > > > detect this pattern in samba and to then use glibc's fallocate().
> > >
> > > The advantage of detecting it in kernel would be that it would handle
> > > Linux applications that do this (I suspect there are some) too.
> >
> > Um, which applications do you suspect? So we can hunt down those user
> > space applications programmers and slap them silly? Or rather,
> > unsilly, since that there's no good reason to ever suspect that
> > writing a byte every 128k would result in a good allocation layout on disk?
>
> Anything that uses glibc fallocate() ?
Glibc's fallocate current writes all zeros, not 1 byte every
128kbytes. And once we wire up the new sys_fallocate() support, we'll
have the right preallocation support in ext4.
- Ted
On Wed, May 02, 2007 at 11:08:10AM -0700, Jeremy Allison wrote:
> > The right place is clearly Samba. I can't think of any other program
> > or filesystem protocol where writing a 1 byte write at 128k strides
> > would be used to signal a desire to do preallocation. In fact, it's
> > hard to think of a worse way of doing things.
>
> In fact they don't need to do this - there's an explicit CIFS
> set file allocation call to pre-allocate size they could use.
>
> There's a specific Samba VFS module that has XFS specific calls
> to do this - vfs_prealloc. - but this won't work on ext3.
Jeremy,
FYI, we are currently closing on a new system call so that
glibc's fallocate() will be able to call into the appropriate
per-filesystem routines in a portable way, since ext4 will have
persistent preallocation support.
I think we mostly have consensus on a calling convention which
all of the architectures (s390, power, arm, ia64, etc.); of course
then we will need to get glibc to support the new system call.
- Ted
On Thu, May 03, 2007 at 01:44:14AM +1000, David Chinner wrote:
> On Tue, May 01, 2007 at 01:43:18PM -0700, Cabot, Mason B wrote:
> > Hello all,
> >
> > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > video workloads. The Windows CIFS client will attempt a poor-man's
> > pre-allocation of the file on the server by sending 1-byte writes at
> > 128K-byte strides, breaking block allocation on ext3 and leading to
> > fragmentation and poor performance. This will happen for many
> > applications (including iTunes) as the CIFS client issues these
> > pre-allocates under the application layer.
> >
> > I've posted a brief paper on Intel's OSS website
> > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > it a read and let me know what you think. In particular, I'd like to
> > arrive at the right place to fix this problem: is it in the filesystem,
> > VFS, or Samba?
>
> As I commented on IRC to Val Henson - the XFS performance indicates
> that it is not a VFS or Samba problem.
>
> I'd say it's probably delayed allocation that is making the
> difference here - no allocation occurs on the single byte writes, it
> occurs when the larger data writes are flushed to disk. Hence no
> adverse fragmentation will occur and there wil be no extra
> allocations being done.
>
> Hence I think it's probably a filesystm problem - it would be
> interesting to see how ext4 performs on this workload....
If we rely on delalloc for this, what happens if another proc on the
same fs is doing synchronous writes to other files? (say for mail
delivery). Will random FS commits force delayed allocations to become
real?
Also, I'd expect a sufficiently loaded server to break down eventually
as load/users increase. The cost of a bad delalloc decision gets much
higher if we're using it as a crutch for this kind of bad userland
coding.
-chris
Theodore Tso wrote:
> FYI, we are currently closing on a new system call so that
> glibc's fallocate() will be able to call into the appropriate
> per-filesystem routines in a portable way, since ext4 will have
> persistent preallocation support.
Yep.
> I think we mostly have consensus on a calling convention which
> all of the architectures (s390, power, arm, ia64, etc.); of course
> then we will need to get glibc to support the new system call.
glibc has had support for a while, in emulated form:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0004.1/1153.html
So when kernel support arrives, it should be easy and (hopefully)
seamless to plug in the new syscall.
Jeff
On Wed, May 02, 2007 at 04:38:55PM -0400, Jeff Garzik wrote:
> > I think we mostly have consensus on a calling convention which
> >all of the architectures (s390, power, arm, ia64, etc.); of course
> >then we will need to get glibc to support the new system call.
>
> glibc has had support for a while, in emulated form:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0004.1/1153.html
>
> So when kernel support arrives, it should be easy and (hopefully)
> seamless to plug in the new syscall.
Yep. Although unfortunately given where we are in distro release
cycles (and I'm not sure where glibc is in its release cycle), it'll
probably be a year or so before most users will see the benefits. So
it would be nice if we can get samba using the fallocate() support
now, in the hopes that we can get all of the pieces aligned in time
for the next major enterprise distro releases.
- Ted
On Wed, May 02, 2007 at 03:46:21PM -0400, Chris Mason wrote:
> On Thu, May 03, 2007 at 01:44:14AM +1000, David Chinner wrote:
> > On Tue, May 01, 2007 at 01:43:18PM -0700, Cabot, Mason B wrote:
> > > Hello all,
> > >
> > > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > > video workloads. The Windows CIFS client will attempt a poor-man's
> > > pre-allocation of the file on the server by sending 1-byte writes at
> > > 128K-byte strides, breaking block allocation on ext3 and leading to
> > > fragmentation and poor performance. This will happen for many
> > > applications (including iTunes) as the CIFS client issues these
> > > pre-allocates under the application layer.
> > >
> > > I've posted a brief paper on Intel's OSS website
> > > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > > it a read and let me know what you think. In particular, I'd like to
> > > arrive at the right place to fix this problem: is it in the filesystem,
> > > VFS, or Samba?
> >
> > As I commented on IRC to Val Henson - the XFS performance indicates
> > that it is not a VFS or Samba problem.
> >
> > I'd say it's probably delayed allocation that is making the
> > difference here - no allocation occurs on the single byte writes, it
> > occurs when the larger data writes are flushed to disk. Hence no
> > adverse fragmentation will occur and there wil be no extra
> > allocations being done.
> >
> > Hence I think it's probably a filesystm problem - it would be
> > interesting to see how ext4 performs on this workload....
>
> If we rely on delalloc for this, what happens if another proc on the
> same fs is doing synchronous writes to other files? (say for mail
> delivery). Will random FS commits force delayed allocations to become
> real?
Not on XFS.
> Also, I'd expect a sufficiently loaded server to break down eventually
> as load/users increase. The cost of a bad delalloc decision gets much
> higher if we're using it as a crutch for this kind of bad userland
> coding.
This only becomes a problem if the system has enough pages dirty to
be triggering throttling so that the 1byte writes are converted before
the data actually hits the server.
Even then, if you are on an XFS filesystem with a sunit/swidth set,
the alocation alignments and speculative allocations will go a long
way to preventing fragmentations.
If that doesn't work, then set the extent allocation size hint on the
XFS inode to 128k or 256k to set the minimum all ocation size for the
file to span the distance between the 1 byte writes. This attribute
can be inherited from the parent directory on create, so it's a
set and forget type of thing...
i.e. XFS has lots of ways to prevent perfromance from degrading
on these sorts of issues.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
David Chinner wrote:
> On Tue, May 01, 2007 at 01:43:18PM -0700, Cabot, Mason B wrote:
> > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > video workloads. The Windows CIFS client will attempt a poor-man's
> > pre-allocation of the file on the server by sending 1-byte writes at
> > 128K-byte strides, breaking block allocation on ext3 and leading to
> > fragmentation and poor performance. This will happen for many
> > applications (including iTunes) as the CIFS client issues these
> > pre-allocates under the application layer.
> >
> > I've posted a brief paper on Intel's OSS website
> > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > it a read and let me know what you think. In particular, I'd like to
> > arrive at the right place to fix this problem: is it in the filesystem,
> > VFS, or Samba?
It's a Samba problem. Samba doesn't do async writes, which v3.0 should have
fixed. Did you try that?
> As I commented on IRC to Val Henson - the XFS performance indicates
> that it is not a VFS or Samba problem.
XFS somewhat hides the Samba problem, by efficiently syncing to disk.
Thanks!
--
Al
On Thu, May 03, 2007 at 10:15:11AM +1000, David Chinner wrote:
[ bad fragmentation from a funky write one byte every 128k system ]
>
> This only becomes a problem if the system has enough pages dirty to
> be triggering throttling so that the 1byte writes are converted before
> the data actually hits the server.
>
> Even then, if you are on an XFS filesystem with a sunit/swidth set,
> the alocation alignments and speculative allocations will go a long
> way to preventing fragmentations.
>
> If that doesn't work, then set the extent allocation size hint on the
> XFS inode to 128k or 256k to set the minimum all ocation size for the
> file to span the distance between the 1 byte writes. This attribute
> can be inherited from the parent directory on create, so it's a
> set and forget type of thing...
>
> i.e. XFS has lots of ways to prevent perfromance from degrading
> on these sorts of issues.
I'm not surprised that XFS would fair the best in this workload,
but this sounds like a lot of magic that shouldn't be required. The
fact that it is good to have the allocation knobs and delalloc in
general doesn't mean that samba shouldn't do the right thing and
preallocate the space in a sensible fashion.
-chris
On Thu, May 03, 2007 at 01:44:14AM +1000, David Chinner wrote:
> On Tue, May 01, 2007 at 01:43:18PM -0700, Cabot, Mason B wrote:
> > Hello all,
> >
> > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> > video workloads. The Windows CIFS client will attempt a poor-man's
> > pre-allocation of the file on the server by sending 1-byte writes at
> > 128K-byte strides, breaking block allocation on ext3 and leading to
> > fragmentation and poor performance. This will happen for many
> > applications (including iTunes) as the CIFS client issues these
> > pre-allocates under the application layer.
> >
> > I've posted a brief paper on Intel's OSS website
> > (http://softwarecommunity.intel.com/articles/eng/1259.htm). Please give
> > it a read and let me know what you think. In particular, I'd like to
> > arrive at the right place to fix this problem: is it in the filesystem,
> > VFS, or Samba?
>
> As I commented on IRC to Val Henson - the XFS performance indicates
> that it is not a VFS or Samba problem.
In terms of what piece of code we can swap out and get good
performance, the problem is indeed in ext3 - it's clear that the cause
of the bad performance is the 1-byte writes resulting in ext3
fragmenting the on-disk layout of the file, and replacing it with XFS
results in nice, clean, unfragmented files.
But in terms of what we should do to fix it, there is the possibility
of some debate. In general, I think there is a lot of code stuck down
in individual file systems - especially in XFS - that could be
usefully hoisted up to a higher level as generic helper functions.
For example, we've got at least two implementations of reservations,
one in XFS and one in ext3/4. At least some of the code could be
generic - both file systems want to reserve long contiguous extents -
with the actual mechanics of looking up and reserving free blocks
implemented in per-fs code.
I'd really like to see a generic VFS-level detection of
read()/write()/creat()/mkdir()/etc. patterns which could detect things
like "Oh, this file is likely to be deleted immediately, wait and see
if it goes away and don't bother sending it on to the FS immediately"
or "Looks like this file will grow pretty big, let's go pre-allocate
some space for it." This is probably best done as a set of helper
functions in the usual way.
For this particular case, Ted is probably right and the only place
we'll ever see this insane poor man's pre-allocate pattern is from the
Windows CIFS client, in which case fixing this in Samba makes sense -
although I'm a bit horrified by the idea of writing 128K of zeroes to
pre-allocate... oh well, it's temporary, and what we care about here
is the read performance, more than the write performance.
-VAL
In article <20070503211450.GA3869@nifty> you wrote:
> For this particular case, Ted is probably right and the only place
> we'll ever see this insane poor man's pre-allocate pattern is from the
> Windows CIFS client, in which case fixing this in Samba makes sense -
> although I'm a bit horrified by the idea of writing 128K of zeroes to
> pre-allocate... oh well, it's temporary, and what we care about here
> is the read performance, more than the write performance.
What about an ioctl or advice to avoid holes? Which could be issued by
samba? Is that related to SetFileValidData and SetEndOfFile win32 functions?
What is the windows client calling, and what command is transmitted by smb?
Gruss
Bernd
On 3 May 2007, at 23:40, Bernd Eckenfels wrote:
> In article <20070503211450.GA3869@nifty> you wrote:
>> For this particular case, Ted is probably right and the only place
>> we'll ever see this insane poor man's pre-allocate pattern is from
>> the
>> Windows CIFS client, in which case fixing this in Samba makes sense -
>> although I'm a bit horrified by the idea of writing 128K of zeroes to
>> pre-allocate... oh well, it's temporary, and what we care about here
>> is the read performance, more than the write performance.
>
> What about an ioctl or advice to avoid holes? Which could be issued by
> samba? Is that related to SetFileValidData and SetEndOfFile win32
> functions?
> What is the windows client calling, and what command is transmitted
> by smb?
Nothing to do with win32 functions. Windows does NOT create sparse
files therefore it never can have an issue like ext3 does in this
scenario. Windows will cause nice allocations to happen because of
this and the 1-byte writes are perfectly sensible in this regard.
(Although a little odd as Windows has a proper API for doing
preallocation so I don't get why it is not using that instead...)
As far as I know the only time Windows will create sparse files is if
you specifically mark a file as sparse using the FSCTL_SET_SPARSE
ioctl and then create a sparse region using the FSCTL_SET_ZERO_DATA
ioctl.
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
On Fri, May 04, 2007 at 09:12:31AM +0100, Anton Altaparmakov wrote:
> Nothing to do with win32 functions. Windows does NOT create sparse
> files therefore it never can have an issue like ext3 does in this
> scenario. Windows will cause nice allocations to happen because of
> this and the 1-byte writes are perfectly sensible in this regard.
> (Although a little odd as Windows has a proper API for doing
> preallocation so I don't get why it is not using that instead...)
Which means the right place to fix this is samba. Samba just need
to intersept lseek and pread/pwrite to never allocate sparse files
but do the right thing instead. Now what the right thing would probably
be a preallocate instead of writing zeroes, and we need to provide the
infrastructure for them to do it, which is in progress currently.
(And in fact samba already does the right thing for XFS if you use
the prealloc samba vfs module, which AFAIK is not the default)
On Thu, May 03, 2007 at 02:14:52PM -0700, Valerie Henson wrote:
> But in terms of what we should do to fix it, there is the possibility
> of some debate. In general, I think there is a lot of code stuck down
> in individual file systems - especially in XFS - that could be
> usefully hoisted up to a higher level as generic helper functions.
> For example, we've got at least two implementations of reservations,
> one in XFS and one in ext3/4. At least some of the code could be
> generic - both file systems want to reserve long contiguous extents -
> with the actual mechanics of looking up and reserving free blocks
> implemented in per-fs code.
I'm not so sure.
Most of the block allocation (and pre-allocation) code is actually of
necessity going to be filesystem specific. There are patches
currently in the ext4 patch queue which would provide a
filesystem-generic preallocate system call, and that makes sense. And
delayed allocation could be done more in the VM --- but the actual
reservation code? It's not at all clear it makes sense to try to
generalize it, since filesystems like XFS which look up free blocks
via extents have fundamentally different abstractions which would be
more efficient for them.
> I'd really like to see a generic VFS-level detection of
> read()/write()/creat()/mkdir()/etc. patterns which could detect things
> like "Oh, this file is likely to be deleted immediately, wait and see
> if it goes away and don't bother sending it on to the FS immediately"
> or "Looks like this file will grow pretty big, let's go pre-allocate
> some space for it." This is probably best done as a set of helper
> functions in the usual way.
What patterns do you think means things like "this file is likely to
be deleted immediate", or "this file will grow pretty big"? I don't
think there are any that would be generally valid.
The only thing which I think makes sense is to delayed allocation,
which as I said part of which could be done in the VM/VFS layer, and
an explicit API for large files that need to persistent preallocation.
- Ted
On 4 May 2007, at 10:46, Christoph Hellwig wrote:
> On Fri, May 04, 2007 at 09:12:31AM +0100, Anton Altaparmakov wrote:
>> Nothing to do with win32 functions. Windows does NOT create sparse
>> files therefore it never can have an issue like ext3 does in this
>> scenario. Windows will cause nice allocations to happen because of
>> this and the 1-byte writes are perfectly sensible in this regard.
>> (Although a little odd as Windows has a proper API for doing
>> preallocation so I don't get why it is not using that instead...)
>
> Which means the right place to fix this is samba.
Absolutely, agreed.
> Samba just need
> to intersept lseek and pread/pwrite to never allocate sparse files
> but do the right thing instead. Now what the right thing would
> probably
> be a preallocate instead of writing zeroes, and we need to provide the
> infrastructure for them to do it, which is in progress currently.
> (And in fact samba already does the right thing for XFS if you use
> the prealloc samba vfs module, which AFAIK is not the default)
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
Christoph Hellwig wrote:
> On Fri, May 04, 2007 at 09:12:31AM +0100, Anton Altaparmakov wrote:
>> Nothing to do with win32 functions. Windows does NOT create sparse
>> files therefore it never can have an issue like ext3 does in this
>> scenario. Windows will cause nice allocations to happen because of
>> this and the 1-byte writes are perfectly sensible in this regard.
>> (Although a little odd as Windows has a proper API for doing
>> preallocation so I don't get why it is not using that instead...)
>
> Which means the right place to fix this is samba. Samba just need
> to intersept lseek and pread/pwrite to never allocate sparse files
> but do the right thing instead. Now what the right thing would probably
> be a preallocate instead of writing zeroes, and we need to provide the
> infrastructure for them to do it, which is in progress currently.
> (And in fact samba already does the right thing for XFS if you use
> the prealloc samba vfs module, which AFAIK is not the default)
Hmm.
How about providing a way to stop kernel (or filesystem) to make gaps
in files instead? Like some ioctl(fd, FS_NOGAPS, 1) -- pretty much
like 'doze has, just the opposite (on windows, this flag is "on" by
default).
Fixing this issue in samba means that samba has to keep/track more state
data than it currently does. Detecting such seek+write has some costs.
It's even worse: imagine samba transforms this into write(zeros) (as
preallocate isn't available yet), and at the same time, another process
is writing there... Which will be perfectly valid in current case, but
will go wrong way (overwriting just-written data with zeros) in this
new scenario.
But the main point is that samba has to keep track of things which it
doesn't do now, and those things becomes.. interesting (difficult if
at all possible to track) in multi-user/concurrent-writes environment.
/mjt
Cabot, Mason B wrote:
> I've been testing the NAS performance of ext3/Openfiler 2.2 against
> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> video workloads. The Windows CIFS client will attempt a poor-man's
> pre-allocation of the file on the server by sending 1-byte writes at
> 128K-byte strides, breaking block allocation on ext3 and leading to
> fragmentation and poor performance. This will happen for many
> applications (including iTunes) as the CIFS client issues these
> pre-allocates under the application layer.
This is rather hard to believe so I think some more information is in
order. Specifically, how do you know that it is the windows kernel that
is issuing these writes and not the application? Under what application
access patterns does it do this?
This is just rather hard to believe seeing as how, iirc, the CIFS
protocol has commands to extend the file size properly rather than with
this hack, and unless it is asked to by the application, the cifs client
should not be trying to extend files.
On Fri, May 04, 2007 at 08:23:08AM -0400, Theodore Tso wrote:
> On Thu, May 03, 2007 at 02:14:52PM -0700, Valerie Henson wrote:
>
> > I'd really like to see a generic VFS-level detection of
> > read()/write()/creat()/mkdir()/etc. patterns which could detect things
> > like "Oh, this file is likely to be deleted immediately, wait and see
> > if it goes away and don't bother sending it on to the FS immediately"
> > or "Looks like this file will grow pretty big, let's go pre-allocate
> > some space for it." This is probably best done as a set of helper
> > functions in the usual way.
>
> What patterns do you think means things like "this file is likely to
> be deleted immediate", or "this file will grow pretty big"? I don't
> think there are any that would be generally valid.
I wouldn't have guessed that either, but it turns out there are:
http://www.eecs.harvard.edu/~ellard/pubs/able-usenix04.pdf
We present evidence that attributes that are known to
the file system when a file is created, such as its name,
permission mode, and owner, are often strongly related
to future properties of the file such as its ultimate size,
lifespan, and access pattern. More importantly, we show
that we can exploit these relationships to automatically
generate predictive models for these properties, and that
these predictions are sufficiently accurate to enable opti-
mizations.
For example, lock files have predictable names and permissions, and
live for a fraction of second in most cases. Files which are appended
a few hundred bytes at a time are probably log files and will continue
to grow in this manner. Some of their predictions were 98% accurate!
In any case, any predictive algorithms we already do at the file
system level can be done at the VFS level, and shared between file
systems, instead of being reimplemented over and over again. Just
food for thought.
-VAL
>
> Cabot, Mason B wrote:
> > I've been testing the NAS performance of ext3/Openfiler 2.2 against
> > NTFS/WinXP and have found that NTFS significantly
> outperforms ext3 for
> > video workloads. The Windows CIFS client will attempt a poor-man's
> > pre-allocation of the file on the server by sending 1-byte writes at
> > 128K-byte strides, breaking block allocation on ext3 and leading to
> > fragmentation and poor performance. This will happen for many
> > applications (including iTunes) as the CIFS client issues these
> > pre-allocates under the application layer.
>
> This is rather hard to believe so I think some more information is in
> order. Specifically, how do you know that it is the windows
> kernel that
> is issuing these writes and not the application? Under what
> application
> access patterns does it do this?
>
> This is just rather hard to believe seeing as how, iirc, the CIFS
> protocol has commands to extend the file size properly rather
> than with
> this hack, and unless it is asked to by the application, the
> cifs client
> should not be trying to extend files.
>
Philip:
the best response I can offer is that we have traced the application's
file system accesses and seen no such one-byte writes occuring at that
level. They are generated somewhere below the application. Additionally,
while we have observed iTunes on Windows issuing these one-byte writes,
ethereal traces for iTunes on Mac OSX show no such behavior. Because of
these observations I think it is reasonable to conclude that the Windows
CIFS client is generating the one-byte writes.
thanks,
Mason
On Fri, May 04, 2007 at 07:49:13PM +0400, Michael Tokarev wrote:
>
> How about providing a way to stop kernel (or filesystem) to make gaps
> in files instead? Like some ioctl(fd, FS_NOGAPS, 1) -- pretty much
> like 'doze has, just the opposite (on windows, this flag is "on" by
> default).
This is being worked on already. XFS has a per-filesystem ioctl, but
we want to create a filesystem-independent system call,
sys_fallocate(), that would wired into the already existing
posix_fallocate() function exported by glibc.
> It's even worse: imagine samba transforms this into write(zeros) (as
> preallocate isn't available yet), and at the same time, another process
> is writing there... Which will be perfectly valid in current case, but
> will go wrong way (overwriting just-written data with zeros) in this
> new scenario.
Samba can just use the posix_fallocate() system call. Note that if
you have two processes are writing to the same file without proper
locking, you're probably going to run into potential problems anyway.
What if one process is writing whole blockfuls of data, while some
brain-damaged Windows client is writing a byte of zero every 128k, and
thus subtly corrupting the data written by the first process? We
can't fix brain-damaged applications that aren't doing proper
application level locking....
(Aside, of course, from convincing people to switch away from Vista to
Linux. :-)
- Ted
On Tue, 1 May 2007 13:43:18 -0700
"Cabot, Mason B" <[email protected]> wrote:
> Hello all,
> I've been testing the NAS performance of ext3/Openfiler 2.2 against
> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> video workloads. The Windows CIFS client will attempt a poor-man's
> pre-allocation of the file on the server by sending 1-byte writes at
> 128K-byte strides, breaking block allocation on ext3 and leading to
> fragmentation and poor performance. This will happen for many
> applications (including iTunes) as the CIFS client issues these
> pre-allocates under the application layer.
On 5 Mai, 10:20, Theodore Tso <[email protected]> wrote:
>
> This is being worked on already. XFS has a per-filesystem ioctl, but
> we want to create a filesystem-independent system call,
> sys_fallocate(), that would wired into the already existing
> posix_fallocate() function exported by glibc.
The story told us: an application must look to the file-systems, ext3
is good at aaa, is not good at bbb; XFS is good at ccc, is not good at
ddd; reiserfs is good at eee, is not good at fff........
For this scenario, XFS is good at dealing with fragmentation while ext3 not.
On Sat, May 05, 2007 at 11:13:36AM +0800, Xu CanHao wrote:
> On 5 Mai, 10:20, Theodore Tso <[email protected]> wrote:
> >
> >This is being worked on already. XFS has a per-filesystem ioctl, but
> >we want to create a filesystem-independent system call,
> >sys_fallocate(), that would wired into the already existing
> >posix_fallocate() function exported by glibc.
>
> The story told us: an application must look to the file-systems, ext3
> is good at aaa, is not good at bbb; XFS is good at ccc, is not good at
> ddd; reiserfs is good at eee, is not good at fff........
>
> For this scenario, XFS is good at dealing with fragmentation while ext3 not.
That's true. XFS has the ability to do delayed allocations, so that
the blocks don't get allocated until they are written out. Hence, a
workload that writes a pattern which uses random access writes in
strides of 128k, and then goes back to fill them in, will result in
fragmentation given ext3's current block reservation allocation
algorithm --- but, as long as the system isn't under high memory
pressure, XFS will do better in this particular scenario.
Actually, ext3 does have a block reservation system, which will
prevent this scenario if the random access writes are within a range
of 32k or so --- which is enough to protect against the bad effects of
more common random access write patterns, such as those used when
writing out ELF object files, for example. Increasing
EXT3_DEFAULT_RESERVE_BLOCKS by a factor of 4 would adaopt the ext3
block reservation system to this pathalogical workload, and we could
easily add a tunable mount option to change the reservation size used
by ext3. Unfortunately, this could make fragmentation work for other
workloads. So adding delayed allocation to ext4 is a better solution.
But as has already been discussed on this thread, in situations where
the fileserver is under high memory pressure, any filesystem (XFS or
ext4) would still end up allocating blocks out of order, resulting in
fragmentation. Explicit preallocation, as opposed to delayed
allocation, is really the best long-term solution; and in order to do
that, Samba needs to detect this scenario --- which as has been noted,
there appears to be no good reason for the Windows CIFS client (or any
other application)to be doing this, other than perhaps to deliberate
trigger a worst case allocation pattern in ext3 --- and translate it
into a explicit preallocation request.
Regards,
- Ted
On Fri, May 04, 2007 at 07:49:13PM +0400, Michael Tokarev wrote:
> How about providing a way to stop kernel (or filesystem) to make gaps
> in files instead? Like some ioctl(fd, FS_NOGAPS, 1) -- pretty much
> like 'doze has, just the opposite (on windows, this flag is "on" by
> default).
Giving filesystems non-hole semantics is non-trivial. Not allowing
for holes creates a lot of complications in unix-like filesystems.
> But the main point is that samba has to keep track of things which it
> doesn't do now, and those things becomes.. interesting (difficult if
> at all possible to track) in multi-user/concurrent-writes environment.
Samba is there to deal with a braindead protocol and braindead clients,
so let it continue to do that. No need to push this into the kernel.
Theodore Tso <[email protected]> wrote:
> But as has already been discussed on this thread, in situations where
> the fileserver is under high memory pressure, any filesystem (XFS or
> ext4) would still end up allocating blocks out of order, resulting in
> fragmentation. Explicit preallocation, as opposed to delayed
> allocation, is really the best long-term solution; and in order to do
> that, Samba needs to detect this scenario --- which as has been noted,
> there appears to be no good reason for the Windows CIFS client (or any
> other application)to be doing this, other than perhaps to deliberate
> trigger a worst case allocation pattern in ext3 --- and translate it
> into a explicit preallocation request.
There is an interface to tell the kernel about the way the file will be
accessed. IMO this interface should be used to do the preallocation, too.
The other question is: How to tell the poor-bill's preallocation from a
very clever application that communicates with another application and
which is supposed to zero out that exact byte from the data the other
application sent. I was tempted to say "just let samba cache these calls",
but it would be wrong. You'll need magic in the kernel to DTRT.
There are three correct ways of handling these one-zerobyte-writes after EOF:
1) Extend the file like truncate
2) Extend the file like write() (current behaviour)
3) Preallocate these blocks (to be implemented)
4) Write all zeroes (current behaviour for FAT)
(2) will cause bad allocations, it's obviously worse than (1). (3) would be
better than (1) and (2), but only xfs(?) and ext4 will support this in the
near future. (4) should double the write time, but give the best possible
read speed. According to [1], the expected read speed is about as high as (1)
gives, "playback performance improves to expected levels". If preallocation
does not seem to make a big difference, I don't think we should do (4) as
a replacement untill the filesystem does support real preallocations.
I suggest:
1) Make samba use fadvise(MIGHT_PREALLOCATE)
2) Make the kernel turn these 1-byte-writes-after-EOF into truncates
on MIGHT_PREALLOCATE, and possibly turn off MIGHT_PREALLOCATE on
other read/writes
3) Make the kernel fadvise(PREALLOCATE, $filesize)
on MIGHT_PREALLOCATE + lseek(0), turning off the MIGHT_PREALLOCATE
Possibly it might also turn on FADV_SEQUENTIAL.
4) Make the filesystems optionally preallocate the desired area, or
ignore fadvise(PREALLOCATE, $filesize) instead.
[1] http://softwarecommunity.intel.com/articles/eng/1259.htm
--
It is still called paranoia when they really are out to get you.
Fri?, Spammer: [email protected]
[email protected] [email protected]
Andrew Morton writes:
> "Cabot, Mason B" <[email protected]> wrote:
>> I've been testing the NAS performance of ext3/Openfiler 2.2 against
>> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
>> video workloads. The Windows CIFS client will attempt a poor-man's
>> pre-allocation of the file on the server by sending 1-byte writes at
>> 128K-byte strides, breaking block allocation on ext3 and leading to
>> fragmentation and poor performance. This will happen for many
>> applications (including iTunes) as the CIFS client issues these
>> pre-allocates under the application layer.
>
> Oh my gawd, what a stupid hack. Now we know what the
> MS interoperability lab has been working on.
Stupid or not, this is their protocol. The cifs filesystem
driver needs a patch to do this. Probably that'll help get
better performance when Linux is writing to a Windows server.
2007/5/6, Bodo Eggert <[email protected]>:
> Theodore Tso <[email protected]> wrote:
>
> > But as has already been discussed on this thread, in situations where
> > the fileserver is under high memory pressure, any filesystem (XFS or
> > ext4) would still end up allocating blocks out of order, resulting in
> > fragmentation. Explicit preallocation, as opposed to delayed
> > allocation, is really the best long-term solution; and in order to do
> > that, Samba needs to detect this scenario --- which as has been noted,
> > there appears to be no good reason for the Windows CIFS client (or any
> > other application)to be doing this, other than perhaps to deliberate
> > trigger a worst case allocation pattern in ext3 --- and translate it
> > into a explicit preallocation request.
>
> There is an interface to tell the kernel about the way the file will be
> accessed. IMO this interface should be used to do the preallocation, too.
>
> The other question is: How to tell the poor-bill's preallocation from a
> very clever application that communicates with another application and
> which is supposed to zero out that exact byte from the data the other
> application sent. I was tempted to say "just let samba cache these calls",
> but it would be wrong. You'll need magic in the kernel to DTRT.
>
> There are three correct ways of handling these one-zerobyte-writes after EOF:
>
> 1) Extend the file like truncate
> 2) Extend the file like write() (current behaviour)
> 3) Preallocate these blocks (to be implemented)
> 4) Write all zeroes (current behaviour for FAT)
>
> (2) will cause bad allocations, it's obviously worse than (1). (3) would be
> better than (1) and (2), but only xfs(?) and ext4 will support this in the
> near future. (4) should double the write time, but give the best possible
> read speed. According to [1], the expected read speed is about as high as (1)
> gives, "playback performance improves to expected levels". If preallocation
> does not seem to make a big difference, I don't think we should do (4) as
> a replacement untill the filesystem does support real preallocations.
>
>
> I suggest:
>
> 1) Make samba use fadvise(MIGHT_PREALLOCATE)
> 2) Make the kernel turn these 1-byte-writes-after-EOF into truncates
> on MIGHT_PREALLOCATE, and possibly turn off MIGHT_PREALLOCATE on
> other read/writes
> 3) Make the kernel fadvise(PREALLOCATE, $filesize)
> on MIGHT_PREALLOCATE + lseek(0), turning off the MIGHT_PREALLOCATE
> Possibly it might also turn on FADV_SEQUENTIAL.
> 4) Make the filesystems optionally preallocate the desired area, or
> ignore fadvise(PREALLOCATE, $filesize) instead.
>
>
> [1] http://softwarecommunity.intel.com/articles/eng/1259.htm
> --
> It is still called paranoia when they really are out to get you.
>
> Fri?, Spammer: [email protected]
> [email protected] [email protected]
>
So it would be possible, that "Explicit Preallocation" + "Delayed
Allocation" + (some other technology) would minimize file-system
fragmentation. And further more, massive fragments of large downloads
may could be solved by "Explicit Preallocation" too.
On Fri, 4 May 2007 10:46:10 +0100, Christoph Hellwig wrote:
>
> Which means the right place to fix this is samba. Samba just need
> to intersept lseek and pread/pwrite to never allocate sparse files
> but do the right thing instead. Now what the right thing would probably
> be a preallocate instead of writing zeroes, and we need to provide the
> infrastructure for them to do it, which is in progress currently.
Why do preallocate and not just truncate the file? If the write is a
single 0x00 somewhere beyond EOF, as appears to be the pattern, truncate
will do just as well if not better. And it is available now.
Jörn
--
Joern's library part 6:
http://www.gzip.org/zlib/feldspar.html
J?rn Engel <[email protected]> wrote:
> On Fri, 4 May 2007 10:46:10 +0100, Christoph Hellwig wrote:
>> Which means the right place to fix this is samba. Samba just need
>> to intersept lseek and pread/pwrite to never allocate sparse files
>> but do the right thing instead. Now what the right thing would probably
>> be a preallocate instead of writing zeroes, and we need to provide the
>> infrastructure for them to do it, which is in progress currently.
>
> Why do preallocate and not just truncate the file?
If it's done by samba, it's racy. Only the kernel can reliably tell a
write-beyond-eof from a write-before-eof. Either it should unconditionally
turn these preallocation-writes into truncates, or have a flag which will
turn this feature on and which can be used to turn the lseek into a real
preallocation call.
I don't think unconditionally turning these writes into truncate would be
good, it would change the behaviour of dd bs=1 count=$(($n*$BLOCKSIZE+1)).
--
Top 100 things you don't want the sysadmin to say:
17. dd if=/dev/null of=/vmunix
Fri?, Spammer: [email protected] [email protected]
Cabot, Mason B wrote:
> Philip:
>
> the best response I can offer is that we have traced the application's
> file system accesses and seen no such one-byte writes occuring at that
> level. They are generated somewhere below the application. Additionally,
> while we have observed iTunes on Windows issuing these one-byte writes,
> ethereal traces for iTunes on Mac OSX show no such behavior. Because of
> these observations I think it is reasonable to conclude that the Windows
> CIFS client is generating the one-byte writes.
Can you duplicate this behavior with a very simple test program, rather
than iTunes? Will something as simple as open() and write() with a 32
KB buffer of random data in a loop cause this behavior?
> Subject: Ext3 vs NTFS performance
>
> Hello all,
>
> I've been testing the NAS performance of ext3/Openfiler 2.2 against
> NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
> video workloads. The Windows CIFS client will attempt a poor-man's
> pre-allocation of the file on the server by sending 1-byte writes at
> 128K-byte strides, breaking block allocation on ext3 and leading to
> fragmentation and poor performance. This will happen for many
> applications (including iTunes) as the CIFS client issues these
> pre-allocates under the application layer.
>
> I've posted a brief paper on Intel's OSS website
> (http://softwarecommunity.intel.com/articles/eng/1259.htm).
> Please give
> it a read and let me know what you think. In particular, I'd like to
> arrive at the right place to fix this problem: is it in the
> filesystem,
> VFS, or Samba?
>
> thanks,
> Mason
>
> (please CC responses to mason dot b dot cabot at intel dot com)
>
Folks:
thanks for the comments from the initial posting of this note. We've
looked further into the problem and found that Samba 3.0.20 or greater
fills the performance gap for ext3: the "strict allocate" flag now zero
fills the file, forcing allocation in the underlying filesystem and
avoiding fragmentation.
An update to the original whitepaper will be posted soon to the same
location on Intel's OSS website.
thanks,
Mason
(please CC responses to mason dot b dot cabot at intel dot com)