On Fri, Jul 18, 2008 at 09:07:23AM +1000, Shehjar Tikoo wrote:
> J. Bruce Fields wrote:
>> On Wed, Jul 16, 2008 at 11:44:13AM +1000, Shehjar Tikoo wrote:
>>> Please see the attached patches for adding
>>> pre-allocation support into nfsd writes. Comments follow.
>>> a. 01_vfs_fallocate.patch
>>> Adds vfs_fallocate. Basically, encapsulates the call to
>>> inode->i_op->fallocate, which is currently called directly from
>>> sys_fallocate, which takes a file descriptor as argument, but nfsd
>>> needs to operate on struct file's.
>>> b. 02_init_file_prealloc_limit.patch
>>> Adds a new member to struct file, to keep track of how much has been
>>> preallocated for this file. For now, adding to struct file seemed an
>>> easy way to keep per-file state about preallocation but this can be
>>> changed to use a nfsd-specific hash table that maps (dev, ino) to
>>> per-file pre-allocation state.
>>> c. 03_nfsd_fallocate.patch
>>> Wires in the call to vfs_fallocate into nfsd_vfs_write.
>>> For now, the function nfsd_get_prealloc_len uses a very simple
>>> method to determine when and how much to pre-allocate. This can change
>>> if needed.
>>> This patch also adds two module_params that control pre-allocation:
>>> 1. /sys/module/nfsd/parameters/nfsd_prealloc
>>> Determine whether to pre-allocate.
>>> 2. /sys/module/nfsd/parameters/nfsd_prealloc_len
>>> How much to pre-allocate. Default is 5Megs.
>> So, if I understand the algorithm right:
>> - Initialize f_prealloc_len to 0.
>> - Ignore any write(offset, cnt) contained entirely in the range
>> (0, f_prealloc_len).
>> - For any write outside that range, extend f_prealloc_len to
>> offset + 5MB and call vfs_alloc(., ., offset, 5MB)
>> (where the 5MB is actually the configurable nfsd_prealloc_len parameter
> Yes. However, it doesnt handle all the ways in which write requests
> can come in at the server but the aim was to test for sequential
> writes as a proof of concept first.
>>> The patches are based against 220.127.116.11.
>>> See the following two plots for read and write performance, with and
>>> without pre-allocation support. Tests were run using iozone. The
>>> filesystem was ext4 with extents enabled. The testbed used two Itanium
>>> machines as client and server, connected through a Gbit network with
>>> jumbo frames enabled. The filesystem was aged with various iozone and
>>> kernel compilation workloads that consumed 45G of a 64G disk.
>>> Server side mount options:
>>> Client side mount options:
>>> 1. Read test
>> Sorry, I don't understand exactly what iozone is doing in this test (and
>> the below). Is it just doing sequential 64k reads (or, below, writes)
>> through a 2G file?
> Yes, write tests involve sequential writes with and without
> pre-allocation. The read tests read back the same file sequentially.
> So if we set nfsd_prealloc_len to 5Megs, then the sequential writes
> will be written to preallocated blocks of 5Megs. Once nfsd realizes
> that we've written to the previously pre-allocated block, it will
> pre-allocate another 5Mb block. The corresponding read test will be read
> back the same file to determine the affect of
> 5Meg preallocation on read throughput.
>>> Read throughput clearly benefits due to the contiguity of disk blocks.
>>> In the best case, i.e. with pre-allocation of 4 and 5 Mb during the
>>> writing of the test file, throughput, during read of the same
>>> file, more than doubles.
>>> 2. Write test
>>> Going just by read performance, pre-allocation would be a nice thing
>>> to have *but* note that write throughput also decreases drastically,
>>> by almost 10 Mb/sec with just 1Mb of pre-allocation.
>> So I guess it's not surprising--you're doing extra work at write time in
>> order to make the reads go faster.
> True. With ext4 it looks like pre-allocation algorithm is not fast
> enough to help nfsd maintain the same throughput as the no
> pre-allocation case. XFS, with its B-tree oriented approach, might
> help but this patch remains to be tested on XFS.
>> A general question: since this preallocation isn't already being done by
>> the filesystem, there must be some reason you think it's appropriate for
>> nfsd but not for other users. What makes nfsd special?
> Nothing special about nfsd. I've been looking at NFS performance so
> thats what I focus on with this patch. As I said in an earlier email,
> the ideal way would be to incorporate pre-allocation into VFS for
> writes which need O_SYNC. The motivation to do that is not so high
> because both ext4 and XFS now do delayed allocation for buffered
OK, fair enough. By the way, if you have code you want to merge at some
point, watch the style:
>>> + if(file->f_prealloc_limit > (offset + cnt))
We normally put a space after the "if" there.