Message-ID: <3E8889B4.FB716506@us.ibm.com>
Date: Mon, 31 Mar 2003 10:32:20 -0800
From: Janet Morgan <janetmor@us.ibm.com>
Reply-To: janetmor@us.ibm.com
MIME-Version: 1.0
To: akpm@digeo.com
CC: suparna@in.ibm.com, bcrl@redhat.com, linux-aio@kvack.org,
       linux-kernel@vger.kernel.org
Subject: Re: [Patch 2/2] Retry based aio read - filesystem read changes
References: <20030305144754.A1600@in.ibm.com> <20030305150026.B1627@in.ibm.com> <20030305024254.7f154afc.akpm@digeo.com> <20030305174452.A1882@in.ibm.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 15145
Lines: 369

> On Wed, Mar 05, 2003 at 02:42:54AM -0800, Andrew Morton wrote:
>  > Suparna Bhattacharya <suparna@in.ibm.com> wrote:
>  >
>  > +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
>  > +static inline int aio_wait_on_page_locked(struct page *page)
>
>  Oh boy.
>
>  There are soooo many more places where we can block:
>
>  - write() -> down(&inode->i_sem)
>
>  - read()/write() -> read indirect block -> wait_on_buffer()
>
>  - read()/write() -> read bitmap block -> wait_on_buffer()
>
>  - write() -> allocate block -> mark_buffer_dirty() ->
>     balance_dirty_pages() -> writer throttling
>
>  - write() -> allocate block -> journal_get_write_access()
>
>  - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
>
>  - ditto for other journalling filesystems
>
>  - read()/write() -> anything -> get_request_wait()
>    (This one can be avoided by polling the backing_dev_info congestion
>     flags)
>
>  - read()/write() -> page allocator -> blk_congestion_wait()
>
>  - write() -> balance_dirty_pages() -> writer throttling
>
>  - probably others.
>
>  Now, I assume that what you're looking for here is an 80% solution, but it
>  seems that a lot more changes will be needed to get even that far.

I'm trying to identify significant blocking points in the filesystem read
path.  I'm not sure what the best approach is, so figured I'd just track
callers of schedule() under a heavy dd workload.

I collected data while running 1000 processes, where each process was
using dd to sequentially read from a 1GB file.  I used 10 target files
in all, so 100 processes read from file1, 100 processes read from file2,
etc.  All these numbers were pretty much arbitrary.  I used stock 2.5.62
to test.

The top 3 callers of schedule based on my dd read workload are listed
below.  Together they accounted for 92% of all calls to schedule.
Suparna's filesystem aio patch already modifies 2 of these 3 callpoints
to be "retryable".  So the remaining candidate is the call to cond_resched
from do_generic_mapping_read.  Question is whether this qualifies as the
sort of thing that should cause a retry, i.e., is cond_resched a kind of
voluntary yield/preemption point which may not even result in a context
switch if there is nothing more preferable to run?   And even if the call to
cond_resched is not cause for retry, the profile data seems to indicate that
Suparna's patch is roughly an 80% solution (unless I'm missing something here,
which is entirely possible ;-).

So here are the call statistics:

  70% of all calls to schedule were from __lock_page:
        Based on the profile data, almost all calls to _lock_page were
        from do_generic_mapping_read (see filemap.c/Line 101 below).
        Suparna's patch already retries here.

  15% of all calls to schedule were from __cond_resched:
        do_generic_mapping_read -> cond_resched -> __cond_resched
        (see filemap.c/Line 41 below).
        Should this be made retryable ???

   7% of all calls to schedule were from wait_on_page_bit:
        Based on the profile data, almost all calls to wait_on_page_bit were
        from do_generic_mapping_read->wait_on_page_locked -> wait_on_page_bit
        (see filemap.c/Line 123 below).  Suparna's patch covers this
        callpoint, too.


Here's some detail....

Callers of schedule()

     7768210  Total calls to schedule
   ------------
     5440195  __lock_page+176
     1143862  do_generic_mapping_read+d6
      569139  wait_on_page_bit+17e
      488273  work_resched+5
       43414  __wait_on_buffer+14a
       23594  blk_congestion_wait+99
       21171  worker_thread+14f
        9838  cpu_idle+6d
        6308  do_select+29b
        5178  schedule_timeout+b8
        4136  sys_sched_yield+d9
        2096  kswapd+122
        1796  sleep_on+55
        1767  sys_nanosleep+f8
        1655  do_exit+4eb
        1643  wait_for_completion+d2
        1365  interruptible_sleep_on+55
        1034  pipe_wait+98
        1004  balanced_irq+6d
         975  ksoftirqd+108
         971  journal_stop+107
         884  sys_wait4+2ff
         780
         658  unix_wait_for_peer+d6
         576  read_chan+52c
         580  __pdflush+d2
         105  do_poll+ec
         102  __down+9e
          71  tty_wait_until_sent+f2
          59  ksoftirqd+b5
          55  __down_interruptible+d8
          41
          16  migration_thread+cc
          10  uart_wait_until_sent+d6
           9  unix_stream_data_wait+109
           8  tcp_data_wait+16c
           4  wait_for_packet+140
           3  wait_til_done+110
           2  write_chan+280
           2  generic_file_aio_write_nolock+bf4
           2  journal_commit_transaction+97a
           1  usb_hub_thread+c0
           1  jfs_lazycommit+1c6
           1  serio_thread+e0
           1  jfs_sync+259
           1  jfsIOWait+15d
           1  _lock_fdc+12f
           1  acpi_ex_system_do_suspend+3d
           1  init_pcmcia_ds+25e


readprofile:

10058989 total                                  2.9520
5982060 __copy_to_user_ll                       41542.0833
928000 do_generic_mapping_read                  698.7952
447968 poll_idle                                4666.3333
301263 file_read_actor                          1176.8086
265660 radix_tree_lookup                        1844.8611
209991 page_cache_readahead                     410.1387
129655 vfs_read                                 311.6707
117666 kmap_atomic                              919.2656
111874 fget                                     1165.3542
110996 __generic_file_aio_read                  182.5592
109402 kunmap_atomic                            3418.8125
 95092 vfs_write                                228.5865
 76119 mark_page_accessed                       679.6339
 69564 current_kernel_time                      724.6250
 52569 fput                                     1095.1875
 51824 update_atime                             231.3571
 51375 activate_page                            267.5781
 46577 scsi_request_fn                           61.9375
 41731 generic_file_read                        237.1080
 39300 shrink_cache                              38.9881
 37658 schedule                                  26.1514
 37501 refill_inactive_zone                      22.3220
 37325 scsi_dispatch_cmd                         70.6913
 32762 unlock_page                              292.5179
 31924 buffered_rmqueue                          86.7500
 30454 __make_request                            21.3862
 28532 do_page_cache_readahead                   54.0379
 27783 delay_tsc                                434.1094
 27723 ext2_get_branch                           69.3075
 25726 shrink_list                               13.1793
 25246 __find_get_block                          63.1150
 24443 mpage_end_io_read                        169.7431
 23800 add_to_page_cache                         87.5000
 20669 do_softirq                                71.7674
 19895 system_call                              452.1591
 18906 do_mpage_readpage                         14.4101
 18783 page_referenced                           73.3711
 18137 free_hot_cold_page                        59.6612
 18015 __wake_up                                225.1875
 13790 radix_tree_insert                         53.8672
 11227 radix_tree_delete                         31.8949
 11188 __alloc_pages                             11.4631
 10885 __brelse                                 136.0625
 10699 write_null                               668.6875
 10299 __pagevec_lru_add                         35.7604
 10100 page_address                              42.0833
  9968 ext2_get_block                             7.5976


2.5.62 filemap.c/do_generic_mapping_read:
     1  /*
     2   * This is a generic file read routine, and uses the
     3   * inode->i_op->readpage() function for the actual low-level
     4   * stuff.
     5   *
     6   * This is really ugly. But the goto's actually try to clarify some
     7   * of the logic when it comes to error handling etc.
     8   * - note the struct file * is only passed for the use of readpage
     9   */
    10  void do_generic_mapping_read(struct address_space *mapping,
    11                               struct file_ra_state *ra,
    12                               struct file * filp,
    13                               loff_t *ppos,
    14                               read_descriptor_t * desc,
    15                               read_actor_t actor)
    16  {
    17          struct inode *inode = mapping->host;
    18          unsigned long index, offset;
    19          struct page *cached_page;
    20          int error;
    21
    22          cached_page = NULL;
    23          index = *ppos >> PAGE_CACHE_SHIFT;
    24          offset = *ppos & ~PAGE_CACHE_MASK;
    25
    26          for (;;) {
    27                  struct page *page;
    28                  unsigned long end_index, nr, ret;
    29
    30                  end_index = inode->i_size >> PAGE_CACHE_SHIFT;
    31
    32                  if (index > end_index)
    33                          break;
    34                  nr = PAGE_CACHE_SIZE;
    35                  if (index == end_index) {
    36                          nr = inode->i_size & ~PAGE_CACHE_MASK;
    37                          if (nr <= offset)
    38                                  break;
    39                  }
    40
    41                  cond_resched();
    42                  page_cache_readahead(mapping, ra, filp, index);
    43
    44                  nr = nr - offset;
    45
    46                  /*
    47                   * Try to find the data in the page cache..
    48                   */
    49  find_page:
    50                  read_lock(&mapping->page_lock);
    51                  page = radix_tree_lookup(&mapping->page_tree, index);
    52                  if (!page) {
    53                          read_unlock(&mapping->page_lock);
    54                          handle_ra_miss(mapping,ra);
    55                          goto no_cached_page;
    56                  }
    57                  page_cache_get(page);
    58                  read_unlock(&mapping->page_lock);
    59
    60                  if (!PageUptodate(page))
    61                          goto page_not_up_to_date;
    62  page_ok:
    63                  /* If users can be writing to this page using arbitrary
    64                   * virtual addresses, take care about potential aliasing
    65                   * before reading the page on the kernel side.
    66                   */
    67                  if (!list_empty(&mapping->i_mmap_shared))
    68                          flush_dcache_page(page);
    69
    70                  /*
    71                   * Mark the page accessed if we read the beginning.
    72                   */
    73                  if (!offset)
    74                          mark_page_accessed(page);
    75
    76                  /*
    77                   * Ok, we have the page, and it's up-to-date, so
    78                   * now we can copy it to user space...
    79                   *
    80                   * The actor routine returns how many bytes were actually
used..
    81                   * NOTE! This may not be the same as how much of a user
buffer
    82                   * we filled up (we may be padding etc), so we can only
update
    83                   * "pos" here (the actor routine has to update the user
buffer
    84                   * pointers and the remaining count).
    85                   */  86                  ret = actor(desc, page, offset,
nr);
    87                  offset += ret;
    88                  index += offset >> PAGE_CACHE_SHIFT;
    89                  offset &= ~PAGE_CACHE_MASK;
    90
    91                  page_cache_release(page);
    92                  if (ret == nr && desc->count)
    93                          continue;
    94                  break;
    95
    96  page_not_up_to_date:
    97                  if (PageUptodate(page))
    98                          goto page_ok;
    99
   100                  /* Get exclusive access to the page ... */
   101                  lock_page(page);
   102
   103                  /* Did it get unhashed before we got the lock? */
   104                  if (!page->mapping) {
   105                          unlock_page(page);
   106                          page_cache_release(page);
   107                          continue;
   108                  }
   109
   110                  /* Did somebody else fill it already? */
   111                  if (PageUptodate(page)) {
   112                          unlock_page(page);
   113                          goto page_ok;
   114                  }
   115
   116  readpage:
   117                  /* ... and start the actual read. The read will unlock the
page. */
   118                  error = mapping->a_ops->readpage(filp, page);
   119
   120                  if (!error) {
   121                          if (PageUptodate(page))
   122                                  goto page_ok;
   123                          wait_on_page_locked(page);
   124                          if (PageUptodate(page))
   125                                  goto page_ok;
   126                          error = -EIO;
   127                  }
   128
   129                  /* UHHUH! A synchronous read error occurred. Report it */
   130                  desc->error = error;
   131                  page_cache_release(page);
   132                  break;
   133
   134  no_cached_page:
   135                  /*
   136                   * Ok, it wasn't cached, so we need to create a new
   137                   * page..
   138                   */
   139                  if (!cached_page) {
   140                          cached_page = page_cache_alloc_cold(mapping);
   141                          if (!cached_page) {
   142                                  desc->error = -ENOMEM;
   143                                  break;
   144                          }
   145                  }
   146                  error = add_to_page_cache_lru(cached_page, mapping,
   147                                                  index, GFP_KERNEL);
   148                  if (error) {
   149                          if (error == -EEXIST)
   150                                  goto find_page;
   151                          desc->error = error;
   152                          break;
   153                  }
   154                  page = cached_page;
   155                  cached_page = NULL;
   156                  goto readpage;
   157          }
   158
   159          *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
   160          if (cached_page)
   161                  page_cache_release(cached_page);
   162          UPDATE_ATIME(inode);
   163  }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/