Message-ID: <3C8D9999.83F991DB@zip.com.au>
Date: Mon, 11 Mar 2002 22:00:57 -0800
From: Andrew Morton <akpm@zip.com.au>
MIME-Version: 1.0
To: lkml <linux-kernel@vger.kernel.org>
Subject: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org


[  Does anyone know what "CFT" means?  It means "call for testers".  It
   doesn't mean "woo-hoo, it'll be neat when that's merged <delete>".  It means
   "help, help - there's no point in just one guy testing this" (thanks Randy). ]


This is an update of the delayed-allocation and multipage pagecache I/O
patches.  I'm calling this a beta, because it all works, and I have
other stuff to do for a while.

Of the thirteen patches, seven (dallocbase-* and tuning-*) are
applicable to the base 2.5.6 kernel.

You need to mount an ext2 filesystem with the `-o delalloc' mount
option to turn on most of the functionality.

The rolled up patch is at

	http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6/everything.patch.gz


These patches do a ton of stuff.  Generally, the CPU overhead for filesystem
operations is decreased by about 40%.  Note "overhead": this is after factoring
out the constant copy_*_user overhead.  This translates to a 15-25% reduction
in CPU use for most workloads.

All the benchmarks are increased, to varying degrees.  Best case is two
instances of `dbench 64' against different disks which went from 7
megabytes/sec to 25.  This is due to better write layout patterns, avoidance of
synchronous reads in the writeback path, better memory management and better
management of writeback threads.


The patch breakdown is:

dallocbase-10-readahead

  Unifies the current three readahead functions (mmap reads, read(2) and
  sys_readhead) into a single implementation.

  More aggressive in building up the readahead windows.

  More conservative in tearing them down.

  Special start-of-file heuristics.

  Preallocates the readahead pages, to avoid the (never demonstrated, but
  potentially catastrophic) scenario where allocation of readahead pages causes
  the allocator to perform VM writeout.

  {hidden agenda): Gets all the readahead pages gathered together in one
  spot, so they can be marshalled into big BIOs.

  Reinstates the readahead tuning ioctls, so hdparm(8) and blockdev(8) are
  working again.  The readahead settings are now per-request-queue, and the
  drivers never have to know about it.

  Big code cleanup.

  Identifies readahead thrashing.

    Currently, it just performs a shrink on the readahead window when thrashing
    occurs.  This greatly reduces the amount of pointless I/O which we perform,
    and will reduce the CPU load.  The idea is that the readahead window
    dynamically adjusts to a sustainable size.  It improves things, but not
    hugely, experimentally.

    We really need drop-behind for read and write streams.  Or O_STREAMING,
    indeed.

dallocbase-15-pageprivate

  page->buffers is a bit of a layering violation.  Not all address_spaces
  have pages which are backed by buffers.

  The exclusive use of page->buffers for buffers means that a piece of prime
  real estate in struct page is unavailable to other forms of address_space.

  This patch turns page->buffers into `unsigned long page->private' and sets
  in place all the infrastructure which is needed to allow other address_spaces
  to use this storage.

  With this change in place, the multipage-bio no-buffer_head code can use
  page->private to cache the results of an earlier get_block(), so repeated
  calls into the filesystem are not needed in the case of file overwriting.

dallocbase-20-page_accounting

  This patch provides global accounting of locked and dirty pages.  It does
  this via lightweight per-CPU data structures.  The page_cache_size accounting
  has been changed to use this facility as well.

  Locked and dirty page accounting is needed for making writeback and
  throttling decisions in the delayed-allocation code.

dallocbase-30-pdflush

  This patch creates an adaptively-sized pool of writeback threads, called
  `pdflush'.  A simple algorithm is used to determine when new threads are
  needed, and when excess threads should be reaped.

  The kupdate and bdflush kernel threads are removed - the pdflush pool is
  used instead.

  The (ab)use of keventd for writing back unused inodes has been removed -
  the pdflush pool is now used for that operation.

dalloc-10-core

  The core delayed allocation code.  There's a description in the
  dalloc-10-core.patch file (all the patches have descriptions).

dalloc-20-ext2

  Implements delayed allocation for ext2.

dalloc-30-ratcache

  The radix-tree pagecache patch.

mpage-10-biobits

  Little API extensions in the BIO layer which were needed for building the
  pagecache BIOs.

mpage-20-core

  The core multipage I/O layer.

  This now implements multipage BIO reads into the pagecache.  Also caching
  of get_block() results at page->private.

  The get_block() result caching currently only applies if all of a page's
  blocks are laid out contiguously on disk.  Caching of a discontiguous list of
  blocks at page->private is easy enough to do, but would require a memory
  allocation, and the requirement is so rare that I didn't bother.

mpage-30-ext2

  Implements multipage I/O for ext2.

tuning-10-request

  get_request() fairness for 2.5.x.  Avoids the situation where a thread
  sleeps for ages on the request queue while other threads whizz in and steal
  requests which they didn't wait for.

tuning-20-ext2-preread-inode

  When we create a new inode, preread its backing block.

  Without this patch, many-inode writeout gets seriously stalled by having to
  read many individual inode table blocks.

tuning-30-read_latency

  read-latency2, ported from 2.4.  Intelligently promotes reads ahead of
  writes on the request queue, to prevent reads from being stalled for very
  long periods of time.

  Also reinstates the BLKELVGET and BLKELVSET ioctls, so `elvtune' may be
  used in 2.5.

  Also increases the size of the request queues, which allows better request
  merging.  This is acceptable now that reads are not heavily penalised by a
  large queue.


-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/