From: Jesse Barnes <jbarnes@engr.sgi.com>
To: linux-kernel@vger.kernel.org
Subject: Re: Buffered I/O slowness
Date: Fri, 29 Oct 2004 10:46:48 -0700
User-Agent: KMail/1.7
Cc: akpm@osdl.org
References: <200410251814.23273.jbarnes@engr.sgi.com>
In-Reply-To: <200410251814.23273.jbarnes@engr.sgi.com>
MIME-Version: 1.0
Content-Type: Multipart/Mixed;
  boundary="Boundary-00=_IIogBbRGvh3sR3i"
Message-Id: <200410291046.48469.jbarnes@engr.sgi.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5530
Lines: 120

--Boundary-00=_IIogBbRGvh3sR3i
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Monday, October 25, 2004 6:14 pm, Jesse Barnes wrote:
> I've been doing some simple disk I/O benchmarking with an eye towards
> improving large, striped volume bandwidth.  I ran some tests on individual
> disks and filesystems to establish a baseline and found that things
> generally scale quite well:
>
> o one thread/disk using O_DIRECT on the block device
>   read avg: 2784.81 MB/s
>   write avg: 2585.60 MB/s
>
> o one thread/disk using O_DIRECT + filesystem
>   read avg: 2635.98 MB/s
>   write avg: 2573.39 MB/s
>
> o one thread/disk using buffered I/O + filesystem
>   read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
>   read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
>   write avg: 1394.99 MB/s
>
> Configuration:
>   o 8p sn2 ia64 box
>   o 8GB memory
>   o 58 disks across 16 controllers
>     (4 disks for 10 of them and 3 for the other 6)
>   o aggregate I/O bw available is about 2.8GB/s
>
> Test:
>   o one I/O thread per disk, round robined across the 8 CPUs
>   o each thread did ~450MB of I/O depending on the test (ran for 10s)
>     Note: the total was > 8GB so in the buffered read case not everything
>     could be cached

More results here.  I've run some tests on a large dm striped volume formatted 
with XFS.  It had 64 disks with a 64k stripe unit (XFS was made aware of this 
at format time), and I explicitly set the readahead using blockdev to 524288 
blocks.  The results aren't as bad as my previous runs, but are still much 
slower than they ought to be I think given the direct I/O results above.  
This is after a fresh mount, so the pagecache was empty when I started the 
tests.

o one thread on one large volume using buffered I/O + filesystem
  read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
  write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s

I'm intentionally issuing very large reads and writes here to take advantage 
of the striping, but it looks like both the readahead and regular buffered 
I/O code will split the I/O into page sized chunks?  The call chain is pretty 
long, but it looks to me like do_generic_mapping_read() will split the reads 
up by page and issue them independently to the lower levels.  In the direct 
I/O case, up to 64 pages are issued at a time, which seems like it would help 
throughput quite a bit.  The profile seems to confirm this.  Unfortunately I 
didn't save the vmstat output for this run (and now the fc switch is 
misbehaving so I have to fix that before I run again), but iirc the system 
time was pretty high given that only one thread was issuing I/O.

So maybe a few things need to be done:
  o set readahead to larger values by default for dm volumes at setup time
    (the default was very small)
  o maybe bypass readahead for very large requests?
    if the process is doing a huge request, chances are that readahead won't
    benefit it as much as a process doing small requests
  o not sure about writes yet, I haven't looked at that call chain much yet

Does any of this sound reasonable at all?  What else could be done to make the 
buffered I/O layer friendlier to large requests?

Thanks,
Jesse

--Boundary-00=_IIogBbRGvh3sR3i
Content-Type: text/plain;
  charset="iso-8859-1";
  name="vol-buffered-read-profile.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="vol-buffered-read-profile.txt"

115383 total                                      0.0203
 49642 ia64_pal_call_static                     258.5521
 42065 default_idle                              93.8951
  7348 __copy_user                                3.1030
  5865 ia64_save_scratch_fpregs                  91.6406
  5766 ia64_load_scratch_fpregs                  90.0938
  1944 _spin_unlock_irq                          30.3750
   352 _spin_unlock_irqrestore                    3.6667
   231 buffered_rmqueue                           0.1245
   225 kmem_cache_free                            0.5859
   151 mpage_end_io_read                          0.2776
   147 __end_that_request_first                   0.1242
   133 bio_alloc                                  0.0967
   122 smp_call_function                          0.1089
   102 shrink_list                                0.0224
    99 unlock_page                                0.4420
    86 free_hot_cold_page                         0.0840
    82 kmem_cache_alloc                           0.3203
    65 __alloc_pages                              0.0271
    53 do_mpage_readpage                          0.0224
    53 bio_clone                                  0.1380
    49 __might_sleep                              0.0806
    44 mpage_readpages                            0.0598
    43 generic_make_request                       0.0345
    42 sn_pci_unmap_sg                            0.1010
    42 sn_dma_flush                               0.0597
    41 clear_page                                 0.2562
    40 file_read_actor                            0.0431
    34 mark_page_accessed                         0.0966
    32 __bio_add_page                             0.0278

--Boundary-00=_IIogBbRGvh3sR3i--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/