2006-05-29 09:34:39

by Nick Piggin

[permalink] [raw]
Subject: [rfc][patch] remove racy sync_page?

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2006-05-19 12:49:07.000000000 +1000
+++ linux-2.6/mm/filemap.c 2006-05-29 18:56:07.000000000 +1000
@@ -134,42 +134,6 @@ void remove_from_page_cache(struct page
write_unlock_irq(&mapping->tree_lock);
}

-static int sync_page(void *word)
-{
- struct address_space *mapping;
- struct page *page;
-
- page = container_of((unsigned long *)word, struct page, flags);
-
- /*
- * page_mapping() is being called without PG_locked held.
- * Some knowledge of the state and use of the page is used to
- * reduce the requirements down to a memory barrier.
- * The danger here is of a stale page_mapping() return value
- * indicating a struct address_space different from the one it's
- * associated with when it is associated with one.
- * After smp_mb(), it's either the correct page_mapping() for
- * the page, or an old page_mapping() and the page's own
- * page_mapping() has gone NULL.
- * The ->sync_page() address_space operation must tolerate
- * page_mapping() going NULL. By an amazing coincidence,
- * this comes about because none of the users of the page
- * in the ->sync_page() methods make essential use of the
- * page_mapping(), merely passing the page down to the backing
- * device's unplug functions when it's non-NULL, which in turn
- * ignore it for all cases but swap, where only page_private(page) is
- * of interest. When page_mapping() does go NULL, the entire
- * call stack gracefully ignores the page and returns.
- * -- wli
- */
- smp_mb();
- mapping = page_mapping(page);
- if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
- mapping->a_ops->sync_page(page);
- io_schedule();
- return 0;
-}
-
/**
* filemap_fdatawrite_range - start writeback against all of a mapping's
* dirty pages that lie within the byte offsets <start, end>
@@ -456,6 +420,12 @@ struct page *page_cache_alloc_cold(struc
EXPORT_SYMBOL(page_cache_alloc_cold);
#endif

+static int __sleep_on_page_bit(void *word)
+{
+ io_schedule();
+ return 0;
+}
+
/*
* In order to wait for pages to become available there must be
* waitqueues associated with pages. By using a hash table of
@@ -483,7 +453,7 @@ void fastcall wait_on_page_bit(struct pa
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

if (test_bit(bit_nr, &page->flags))
- __wait_on_bit(page_waitqueue(page), &wait, sync_page,
+ __wait_on_bit(page_waitqueue(page), &wait, __sleep_on_page_bit,
TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(wait_on_page_bit);
@@ -529,17 +499,12 @@ EXPORT_SYMBOL(end_page_writeback);

/*
* Get a lock on the page, assuming we need to sleep to get it.
- *
- * Ugly: running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some
- * random driver's requestfn sets TASK_RUNNING, we could busywait. However
- * chances are that on the second loop, the block layer's plug list is empty,
- * so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
*/
void fastcall __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

- __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
+ __wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_bit,
TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(__lock_page);
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2006-05-19 12:49:03.000000000 +1000
+++ linux-2.6/fs/block_dev.c 2006-05-29 18:51:56.000000000 +1000
@@ -1080,7 +1080,6 @@ static long block_ioctl(struct file *fil
struct address_space_operations def_blk_aops = {
.readpage = blkdev_readpage,
.writepage = blkdev_writepage,
- .sync_page = block_sync_page,
.prepare_write = blkdev_prepare_write,
.commit_write = blkdev_commit_write,
.writepages = generic_writepages,
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/buffer.c 2006-05-29 18:52:04.000000000 +1000
@@ -3011,16 +3011,6 @@ out:
}
EXPORT_SYMBOL(try_to_free_buffers);

-void block_sync_page(struct page *page)
-{
- struct address_space *mapping;
-
- smp_mb();
- mapping = page_mapping(page);
- if (mapping)
- blk_run_backing_dev(mapping->backing_dev_info, page);
-}
-
/*
* There are no bdflush tunables left. But distributions are
* still running obsolete flush daemons, so we terminate them here.
@@ -3164,7 +3154,6 @@ EXPORT_SYMBOL(__wait_on_buffer);
EXPORT_SYMBOL(block_commit_write);
EXPORT_SYMBOL(block_prepare_write);
EXPORT_SYMBOL(block_read_full_page);
-EXPORT_SYMBOL(block_sync_page);
EXPORT_SYMBOL(block_truncate_page);
EXPORT_SYMBOL(block_write_full_page);
EXPORT_SYMBOL(cont_prepare_write);
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h 2006-04-15 20:19:55.000000000 +1000
+++ linux-2.6/include/linux/buffer_head.h 2006-05-29 18:52:10.000000000 +1000
@@ -203,7 +203,6 @@ int cont_prepare_write(struct page*, uns
int generic_cont_expand(struct inode *inode, loff_t size);
int generic_cont_expand_simple(struct inode *inode, loff_t size);
int block_commit_write(struct page *page, unsigned from, unsigned to);
-void block_sync_page(struct page *);
sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
int generic_commit_write(struct file *, struct page *, unsigned, unsigned);
int block_truncate_page(struct address_space *, loff_t, get_block_t *);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2006-05-29 18:42:11.000000000 +1000
+++ linux-2.6/include/linux/fs.h 2006-05-29 18:52:16.000000000 +1000
@@ -351,7 +351,6 @@ struct writeback_control;
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
- void (*sync_page)(struct page *);

/* Write back some dirty pages from this mapping. */
int (*writepages)(struct address_space *, struct writeback_control *);
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c 2006-04-15 20:19:59.000000000 +1000
+++ linux-2.6/mm/swap_state.c 2006-05-29 18:52:56.000000000 +1000
@@ -21,12 +21,11 @@

/*
* swapper_space is a fiction, retained to simplify the path through
- * vmscan's shrink_list, to make sync_page look nicer, and to allow
- * future use of radix_tree tags in the swap cache.
+ * vmscan's shrink_list and to allow future use of radix_tree tags in
+ * the swap cache.
*/
static struct address_space_operations swap_aops = {
.writepage = swap_writepage,
- .sync_page = block_sync_page,
.set_page_dirty = __set_page_dirty_nobuffers,
.migratepage = migrate_page,
};
Index: linux-2.6/fs/adfs/inode.c
===================================================================
--- linux-2.6.orig/fs/adfs/inode.c 2005-03-02 19:38:15.000000000 +1100
+++ linux-2.6/fs/adfs/inode.c 2006-05-29 18:57:32.000000000 +1000
@@ -75,7 +75,6 @@ static sector_t _adfs_bmap(struct addres
static struct address_space_operations adfs_aops = {
.readpage = adfs_readpage,
.writepage = adfs_writepage,
- .sync_page = block_sync_page,
.prepare_write = adfs_prepare_write,
.commit_write = generic_commit_write,
.bmap = _adfs_bmap
Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/affs/file.c 2006-05-29 18:57:55.000000000 +1000
@@ -409,7 +409,6 @@ static sector_t _affs_bmap(struct addres
struct address_space_operations affs_aops = {
.readpage = affs_readpage,
.writepage = affs_writepage,
- .sync_page = block_sync_page,
.prepare_write = affs_prepare_write,
.commit_write = generic_commit_write,
.bmap = _affs_bmap
@@ -762,7 +761,6 @@ out:
struct address_space_operations affs_aops_ofs = {
.readpage = affs_readpage_ofs,
//.writepage = affs_writepage_ofs,
- //.sync_page = affs_sync_page_ofs,
.prepare_write = affs_prepare_write_ofs,
.commit_write = affs_commit_write_ofs
};
Index: linux-2.6/fs/afs/file.c
===================================================================
--- linux-2.6.orig/fs/afs/file.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/afs/file.c 2006-05-29 18:58:16.000000000 +1000
@@ -37,7 +37,6 @@ struct inode_operations afs_file_inode_o

struct address_space_operations afs_fs_aops = {
.readpage = afs_file_readpage,
- .sync_page = block_sync_page,
.set_page_dirty = __set_page_dirty_nobuffers,
.releasepage = afs_file_releasepage,
.invalidatepage = afs_file_invalidatepage,
Index: linux-2.6/fs/befs/linuxvfs.c
===================================================================
--- linux-2.6.orig/fs/befs/linuxvfs.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/befs/linuxvfs.c 2006-05-29 18:58:01.000000000 +1000
@@ -75,7 +75,6 @@ static struct inode_operations befs_dir_

static struct address_space_operations befs_aops = {
.readpage = befs_readpage,
- .sync_page = block_sync_page,
.bmap = befs_bmap,
};

Index: linux-2.6/fs/bfs/file.c
===================================================================
--- linux-2.6.orig/fs/bfs/file.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/bfs/file.c 2006-05-29 18:57:33.000000000 +1000
@@ -156,7 +156,6 @@ static sector_t bfs_bmap(struct address_
struct address_space_operations bfs_aops = {
.readpage = bfs_readpage,
.writepage = bfs_writepage,
- .sync_page = block_sync_page,
.prepare_write = bfs_prepare_write,
.commit_write = generic_commit_write,
.bmap = bfs_bmap,
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c 2006-04-30 19:23:29.000000000 +1000
+++ linux-2.6/fs/cifs/file.c 2006-05-29 18:58:18.000000000 +1000
@@ -1383,34 +1383,6 @@ int cifs_fsync(struct file *file, struct
return rc;
}

-/* static void cifs_sync_page(struct page *page)
-{
- struct address_space *mapping;
- struct inode *inode;
- unsigned long index = page->index;
- unsigned int rpages = 0;
- int rc = 0;
-
- cFYI(1, ("sync page %p",page));
- mapping = page->mapping;
- if (!mapping)
- return 0;
- inode = mapping->host;
- if (!inode)
- return; */
-
-/* fill in rpages then
- result = cifs_pagein_inode(inode, index, rpages); */ /* BB finish */
-
-/* cFYI(1, ("rpages is %d for sync page of Index %ld ", rpages, index));
-
-#if 0
- if (rc < 0)
- return rc;
- return 0;
-#endif
-} */
-
/*
* As file closes, flush all cached write data for this inode checking
* for write behind errors.
@@ -1952,6 +1924,5 @@ struct address_space_operations cifs_add
.prepare_write = cifs_prepare_write,
.commit_write = cifs_commit_write,
.set_page_dirty = __set_page_dirty_nobuffers,
- /* .sync_page = cifs_sync_page, */
/* .direct_IO = */
};
Index: linux-2.6/fs/efs/inode.c
===================================================================
--- linux-2.6.orig/fs/efs/inode.c 2004-10-19 17:20:32.000000000 +1000
+++ linux-2.6/fs/efs/inode.c 2006-05-29 18:57:40.000000000 +1000
@@ -23,7 +23,6 @@ static sector_t _efs_bmap(struct address
}
static struct address_space_operations efs_aops = {
.readpage = efs_readpage,
- .sync_page = block_sync_page,
.bmap = _efs_bmap
};

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/ext2/inode.c 2006-05-29 18:57:25.000000000 +1000
@@ -688,7 +688,6 @@ struct address_space_operations ext2_aop
.readpage = ext2_readpage,
.readpages = ext2_readpages,
.writepage = ext2_writepage,
- .sync_page = block_sync_page,
.prepare_write = ext2_prepare_write,
.commit_write = generic_commit_write,
.bmap = ext2_bmap,
@@ -706,7 +705,6 @@ struct address_space_operations ext2_nob
.readpage = ext2_readpage,
.readpages = ext2_readpages,
.writepage = ext2_nobh_writepage,
- .sync_page = block_sync_page,
.prepare_write = ext2_nobh_prepare_write,
.commit_write = nobh_commit_write,
.bmap = ext2_bmap,
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c 2006-05-19 12:49:03.000000000 +1000
+++ linux-2.6/fs/ext3/inode.c 2006-05-29 18:57:51.000000000 +1000
@@ -1703,7 +1703,6 @@ static struct address_space_operations e
.readpage = ext3_readpage,
.readpages = ext3_readpages,
.writepage = ext3_ordered_writepage,
- .sync_page = block_sync_page,
.prepare_write = ext3_prepare_write,
.commit_write = ext3_ordered_commit_write,
.bmap = ext3_bmap,
@@ -1717,7 +1716,6 @@ static struct address_space_operations e
.readpage = ext3_readpage,
.readpages = ext3_readpages,
.writepage = ext3_writeback_writepage,
- .sync_page = block_sync_page,
.prepare_write = ext3_prepare_write,
.commit_write = ext3_writeback_commit_write,
.bmap = ext3_bmap,
@@ -1731,7 +1729,6 @@ static struct address_space_operations e
.readpage = ext3_readpage,
.readpages = ext3_readpages,
.writepage = ext3_journalled_writepage,
- .sync_page = block_sync_page,
.prepare_write = ext3_prepare_write,
.commit_write = ext3_journalled_commit_write,
.set_page_dirty = ext3_journalled_set_page_dirty,
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/fat/inode.c 2006-05-29 18:57:44.000000000 +1000
@@ -201,7 +201,6 @@ static struct address_space_operations f
.readpages = fat_readpages,
.writepage = fat_writepage,
.writepages = fat_writepages,
- .sync_page = block_sync_page,
.prepare_write = fat_prepare_write,
.commit_write = fat_commit_write,
.direct_IO = fat_direct_IO,
Index: linux-2.6/fs/freevxfs/vxfs_subr.c
===================================================================
--- linux-2.6.orig/fs/freevxfs/vxfs_subr.c 2005-09-02 17:01:44.000000000 +1000
+++ linux-2.6/fs/freevxfs/vxfs_subr.c 2006-05-29 18:57:46.000000000 +1000
@@ -45,7 +45,6 @@ static sector_t vxfs_bmap(struct addres
struct address_space_operations vxfs_aops = {
.readpage = vxfs_readpage,
.bmap = vxfs_bmap,
- .sync_page = block_sync_page,
};

inline void
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/hfs/inode.c 2006-05-29 18:58:48.000000000 +1000
@@ -117,7 +117,6 @@ static int hfs_writepages(struct address
struct address_space_operations hfs_btree_aops = {
.readpage = hfs_readpage,
.writepage = hfs_writepage,
- .sync_page = block_sync_page,
.prepare_write = hfs_prepare_write,
.commit_write = generic_commit_write,
.bmap = hfs_bmap,
@@ -127,7 +126,6 @@ struct address_space_operations hfs_btre
struct address_space_operations hfs_aops = {
.readpage = hfs_readpage,
.writepage = hfs_writepage,
- .sync_page = block_sync_page,
.prepare_write = hfs_prepare_write,
.commit_write = generic_commit_write,
.bmap = hfs_bmap,
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/hfsplus/inode.c 2006-05-29 18:58:23.000000000 +1000
@@ -112,7 +112,6 @@ static int hfsplus_writepages(struct add
struct address_space_operations hfsplus_btree_aops = {
.readpage = hfsplus_readpage,
.writepage = hfsplus_writepage,
- .sync_page = block_sync_page,
.prepare_write = hfsplus_prepare_write,
.commit_write = generic_commit_write,
.bmap = hfsplus_bmap,
@@ -122,7 +121,6 @@ struct address_space_operations hfsplus_
struct address_space_operations hfsplus_aops = {
.readpage = hfsplus_readpage,
.writepage = hfsplus_writepage,
- .sync_page = block_sync_page,
.prepare_write = hfsplus_prepare_write,
.commit_write = generic_commit_write,
.bmap = hfsplus_bmap,
Index: linux-2.6/fs/hpfs/file.c
===================================================================
--- linux-2.6.orig/fs/hpfs/file.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/hpfs/file.c 2006-05-29 18:57:38.000000000 +1000
@@ -102,7 +102,6 @@ static sector_t _hpfs_bmap(struct addres
struct address_space_operations hpfs_aops = {
.readpage = hpfs_readpage,
.writepage = hpfs_writepage,
- .sync_page = block_sync_page,
.prepare_write = hpfs_prepare_write,
.commit_write = generic_commit_write,
.bmap = _hpfs_bmap
Index: linux-2.6/fs/isofs/compress.c
===================================================================
--- linux-2.6.orig/fs/isofs/compress.c 2005-09-02 17:01:46.000000000 +1000
+++ linux-2.6/fs/isofs/compress.c 2006-05-29 18:57:00.000000000 +1000
@@ -314,7 +314,6 @@ eio:

struct address_space_operations zisofs_aops = {
.readpage = zisofs_readpage,
- /* No sync_page operation supported? */
/* No bmap operation supported */
};

Index: linux-2.6/fs/isofs/inode.c
===================================================================
--- linux-2.6.orig/fs/isofs/inode.c 2006-04-15 20:19:49.000000000 +1000
+++ linux-2.6/fs/isofs/inode.c 2006-05-29 18:57:01.000000000 +1000
@@ -1054,7 +1054,6 @@ static sector_t _isofs_bmap(struct addre

static struct address_space_operations isofs_aops = {
.readpage = isofs_readpage,
- .sync_page = block_sync_page,
.bmap = _isofs_bmap
};

Index: linux-2.6/fs/jfs/inode.c
===================================================================
--- linux-2.6.orig/fs/jfs/inode.c 2006-04-15 20:19:50.000000000 +1000
+++ linux-2.6/fs/jfs/inode.c 2006-05-29 18:56:52.000000000 +1000
@@ -310,7 +310,6 @@ struct address_space_operations jfs_aops
.readpages = jfs_readpages,
.writepage = jfs_writepage,
.writepages = jfs_writepages,
- .sync_page = block_sync_page,
.prepare_write = jfs_prepare_write,
.commit_write = nobh_commit_write,
.bmap = jfs_bmap,
Index: linux-2.6/fs/jfs/jfs_metapage.c
===================================================================
--- linux-2.6.orig/fs/jfs/jfs_metapage.c 2006-05-29 18:42:11.000000000 +1000
+++ linux-2.6/fs/jfs/jfs_metapage.c 2006-05-29 18:56:53.000000000 +1000
@@ -580,7 +580,6 @@ static void metapage_invalidatepage(stru
struct address_space_operations jfs_metapage_aops = {
.readpage = metapage_readpage,
.writepage = metapage_writepage,
- .sync_page = block_sync_page,
.releasepage = metapage_releasepage,
.invalidatepage = metapage_invalidatepage,
.set_page_dirty = __set_page_dirty_nobuffers,
Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c 2006-04-15 20:19:50.000000000 +1000
+++ linux-2.6/fs/minix/inode.c 2006-05-29 18:57:57.000000000 +1000
@@ -338,7 +338,6 @@ static sector_t minix_bmap(struct addres
static struct address_space_operations minix_aops = {
.readpage = minix_readpage,
.writepage = minix_writepage,
- .sync_page = block_sync_page,
.prepare_write = minix_prepare_write,
.commit_write = generic_commit_write,
.bmap = minix_bmap
Index: linux-2.6/fs/ntfs/aops.c
===================================================================
--- linux-2.6.orig/fs/ntfs/aops.c 2006-04-15 20:19:51.000000000 +1000
+++ linux-2.6/fs/ntfs/aops.c 2006-05-29 18:57:21.000000000 +1000
@@ -1546,8 +1546,6 @@ err_out:
*/
struct address_space_operations ntfs_aops = {
.readpage = ntfs_readpage, /* Fill page with data. */
- .sync_page = block_sync_page, /* Currently, just unplugs the
- disk request queue. */
#ifdef NTFS_RW
.writepage = ntfs_writepage, /* Write dirty page to disk. */
#endif /* NTFS_RW */
@@ -1562,8 +1560,6 @@ struct address_space_operations ntfs_aop
*/
struct address_space_operations ntfs_mst_aops = {
.readpage = ntfs_readpage, /* Fill page with data. */
- .sync_page = block_sync_page, /* Currently, just unplugs the
- disk request queue. */
#ifdef NTFS_RW
.writepage = ntfs_writepage, /* Write dirty page to disk. */
.set_page_dirty = __set_page_dirty_nobuffers, /* Set the page dirty
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c 2006-04-15 20:19:51.000000000 +1000
+++ linux-2.6/fs/ntfs/inode.c 2006-05-29 18:57:12.000000000 +1000
@@ -1830,7 +1830,7 @@ int ntfs_read_inode_mount(struct inode *
/* Need this to sanity check attribute list references to $MFT. */
vi->i_generation = ni->seq_no = le16_to_cpu(m->sequence_number);

- /* Provides readpage() and sync_page() for map_mft_record(). */
+ /* Provides readpage() for map_mft_record(). */
vi->i_mapping->a_ops = &ntfs_mst_aops;

ctx = ntfs_attr_get_search_ctx(ni, m);
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c 2006-05-29 18:42:11.000000000 +1000
+++ linux-2.6/fs/ocfs2/aops.c 2006-05-29 18:58:25.000000000 +1000
@@ -672,6 +672,5 @@ struct address_space_operations ocfs2_ao
.prepare_write = ocfs2_prepare_write,
.commit_write = ocfs2_commit_write,
.bmap = ocfs2_bmap,
- .sync_page = block_sync_page,
.direct_IO = ocfs2_direct_IO
};
Index: linux-2.6/fs/qnx4/inode.c
===================================================================
--- linux-2.6.orig/fs/qnx4/inode.c 2006-04-15 20:19:51.000000000 +1000
+++ linux-2.6/fs/qnx4/inode.c 2006-05-29 18:57:59.000000000 +1000
@@ -451,7 +451,6 @@ static sector_t qnx4_bmap(struct address
static struct address_space_operations qnx4_aops = {
.readpage = qnx4_readpage,
.writepage = qnx4_writepage,
- .sync_page = block_sync_page,
.prepare_write = qnx4_prepare_write,
.commit_write = generic_commit_write,
.bmap = qnx4_bmap
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c 2006-04-15 20:19:51.000000000 +1000
+++ linux-2.6/fs/reiserfs/inode.c 2006-05-29 18:57:35.000000000 +1000
@@ -3002,7 +3002,6 @@ struct address_space_operations reiserfs
.readpages = reiserfs_readpages,
.releasepage = reiserfs_releasepage,
.invalidatepage = reiserfs_invalidatepage,
- .sync_page = block_sync_page,
.prepare_write = reiserfs_prepare_write,
.commit_write = reiserfs_commit_write,
.bmap = reiserfs_aop_bmap,
Index: linux-2.6/fs/sysv/itree.c
===================================================================
--- linux-2.6.orig/fs/sysv/itree.c 2005-03-02 19:38:21.000000000 +1100
+++ linux-2.6/fs/sysv/itree.c 2006-05-29 18:57:36.000000000 +1000
@@ -468,7 +468,6 @@ static sector_t sysv_bmap(struct address
struct address_space_operations sysv_aops = {
.readpage = sysv_readpage,
.writepage = sysv_writepage,
- .sync_page = block_sync_page,
.prepare_write = sysv_prepare_write,
.commit_write = generic_commit_write,
.bmap = sysv_bmap
Index: linux-2.6/fs/udf/file.c
===================================================================
--- linux-2.6.orig/fs/udf/file.c 2006-04-15 20:19:52.000000000 +1000
+++ linux-2.6/fs/udf/file.c 2006-05-29 18:57:27.000000000 +1000
@@ -98,7 +98,6 @@ static int udf_adinicb_commit_write(stru
struct address_space_operations udf_adinicb_aops = {
.readpage = udf_adinicb_readpage,
.writepage = udf_adinicb_writepage,
- .sync_page = block_sync_page,
.prepare_write = udf_adinicb_prepare_write,
.commit_write = udf_adinicb_commit_write,
};
Index: linux-2.6/fs/udf/inode.c
===================================================================
--- linux-2.6.orig/fs/udf/inode.c 2006-04-15 20:19:52.000000000 +1000
+++ linux-2.6/fs/udf/inode.c 2006-05-29 18:57:30.000000000 +1000
@@ -135,7 +135,6 @@ static sector_t udf_bmap(struct address_
struct address_space_operations udf_aops = {
.readpage = udf_readpage,
.writepage = udf_writepage,
- .sync_page = block_sync_page,
.prepare_write = udf_prepare_write,
.commit_write = generic_commit_write,
.bmap = udf_bmap,
Index: linux-2.6/fs/ufs/inode.c
===================================================================
--- linux-2.6.orig/fs/ufs/inode.c 2006-02-26 03:01:30.000000000 +1100
+++ linux-2.6/fs/ufs/inode.c 2006-05-29 18:56:57.000000000 +1000
@@ -534,7 +534,6 @@ static sector_t ufs_bmap(struct address_
struct address_space_operations ufs_aops = {
.readpage = ufs_readpage,
.writepage = ufs_writepage,
- .sync_page = block_sync_page,
.prepare_write = ufs_prepare_write,
.commit_write = generic_commit_write,
.bmap = ufs_bmap
Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c 2006-04-20 18:55:03.000000000 +1000
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c 2006-05-29 18:58:03.000000000 +1000
@@ -1452,7 +1452,6 @@ struct address_space_operations xfs_addr
.readpage = xfs_vm_readpage,
.readpages = xfs_vm_readpages,
.writepage = xfs_vm_writepage,
- .sync_page = block_sync_page,
.releasepage = xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.prepare_write = xfs_vm_prepare_write,
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c 2006-04-20 18:55:03.000000000 +1000
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c 2006-05-29 18:58:05.000000000 +1000
@@ -1521,7 +1521,6 @@ xfs_mapping_buftarg(
struct inode *inode;
struct address_space *mapping;
static struct address_space_operations mapping_aops = {
- .sync_page = block_sync_page,
.migratepage = fail_migrate_page,
};


Attachments:
fs-remove-sync-page.patch (25.22 kB)

2006-05-29 19:20:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Mon, 29 May 2006 19:34:09 +1000
Nick Piggin <[email protected]> wrote:

> I'm not completely sure whether this is the bug or not,

"the bug". Are we suposed to know what yo're referring to here?

> nor what would be
> the performance consequences of my attached fix (wrt the block layer). So
> you're probably cc'ed because I've found similar threads with your names
> on them.
>

The performance risk is that someone will do lock_page() against a page
whose IO is queued-but-not-yet-kicked-off. We'll go to sleep with no IO
submitted until kblockd or someone else kicks off the IO for us.

Try disabling kblockd completely, see what effect that has on performance.

>
> lock_page (and other waiters on page flags bits) use sync_page when sleeping
> on a bit. sync_page, however, requires that the page's mapping be pinned
> (which is what we're sometimes trying to lock the page for).
>
> Blatant offender is set_page_dirty_lock, which falls to the race it purports
> to fix. Perhaps all callers could be fixed, however it seems that the pinned
> mapping lock_page precondition is counter-intuitive and I'd bet other callers
> to lock_page or wait_on_page_bit have got it wrong too.
>
> Also: splice can change a page's mapping, so it would have been possible to
> use the wrong mapping to "sync" a page.
>
> Can we get rid of the whole thing, confusing memory barriers and all? Nobody
> uses anything but the default sync_page, and if block rq plugging is terribly
> bad for performance, perhaps it should be reworked anyway? It shouldn't be a
> correctness thing, right?

What this means is that it is not legal to run lock_page() against a
pagecache page if you don't have a ref on the inode.

iirc the main (only?) offender here is direct-io reads into MAP_SHARED
pagecache. (And similar things, like infiniband and nfs-direct).

2006-05-30 00:08:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Andrew Morton wrote:
> On Mon, 29 May 2006 19:34:09 +1000
> Nick Piggin <[email protected]> wrote:
>
>
>>I'm not completely sure whether this is the bug or not,
>
>
> "the bug". Are we suposed to know what yo're referring to here?

You were supposed to know, if I hadn't made a typo :) "a bug".

>
>
>>nor what would be
>>the performance consequences of my attached fix (wrt the block layer). So
>>you're probably cc'ed because I've found similar threads with your names
>>on them.
>>
>
>
> The performance risk is that someone will do lock_page() against a page
> whose IO is queued-but-not-yet-kicked-off. We'll go to sleep with no IO
> submitted until kblockd or someone else kicks off the IO for us.

Yes.

>
> Try disabling kblockd completely, see what effect that has on performance.

Which is what I want to know. I don't exactly have an interesting
disk setup.

>>Can we get rid of the whole thing, confusing memory barriers and all? Nobody
>>uses anything but the default sync_page, and if block rq plugging is terribly
>>bad for performance, perhaps it should be reworked anyway? It shouldn't be a
>>correctness thing, right?
>
>
> What this means is that it is not legal to run lock_page() against a
> pagecache page if you don't have a ref on the inode.

Yes. So set_page_dirty_lock is broken, right?
And the wait_on_page_stuff needs an inode ref.
Also splice seems to have broken sync_page.

>
> iirc the main (only?) offender here is direct-io reads into MAP_SHARED
> pagecache. (And similar things, like infiniband and nfs-direct).

Well yes, writing to a page would be the main reason to set it dirty.
Is splice broken as well? I'm not sure that it always has a ref on the
inode when stealing a page.

It sounds like you think fixing the set_page_dirty_lock callers wouldn't
be too difficult? I wouldn't know (although the ptrace one should be
able to be turned into a set_page_dirty, because we're holding mmap_sem).

You're sure about all other lock_page()rs? I'm not, given that
set_page_dirty_lock got it so wrong. But you'd have a better idea than
me.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 01:34:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tue, 30 May 2006 10:08:06 +1000
Nick Piggin <[email protected]> wrote:

> >
> > Try disabling kblockd completely, see what effect that has on performance.
>
> Which is what I want to know. I don't exactly have an interesting
> disk setup.

You don't need one - just a single disk should show up such problems. I
forget which workloads though. Perhaps just a linear read (readahead
queues the I/O but doesn't unplug, subsequent lock_page() sulks).

> >>Can we get rid of the whole thing, confusing memory barriers and all? Nobody
> >>uses anything but the default sync_page, and if block rq plugging is terribly
> >>bad for performance, perhaps it should be reworked anyway? It shouldn't be a
> >>correctness thing, right?
> >
> >
> > What this means is that it is not legal to run lock_page() against a
> > pagecache page if you don't have a ref on the inode.
>
> Yes. So set_page_dirty_lock is broken, right?

yup.

> And the wait_on_page_stuff needs an inode ref.
> Also splice seems to have broken sync_page.

Please describe the splice() problem which you've observed.

> >
> > iirc the main (only?) offender here is direct-io reads into MAP_SHARED
> > pagecache. (And similar things, like infiniband and nfs-direct).
>
> Well yes, writing to a page would be the main reason to set it dirty.
> Is splice broken as well? I'm not sure that it always has a ref on the
> inode when stealing a page.

Whereabouts?

> It sounds like you think fixing the set_page_dirty_lock callers wouldn't
> be too difficult? I wouldn't know (although the ptrace one should be
> able to be turned into a set_page_dirty, because we're holding mmap_sem).

No, I think it's damn impossible ;)

get_user_pages() has gotten us a random pagecache page. How do we
non-racily get at the address_space prior to locking that page?

I don't think we can.

> You're sure about all other lock_page()rs? I'm not, given that
> set_page_dirty_lock got it so wrong. But you'd have a better idea than
> me.

No, I'm not sure.

However it is rare for the kernel to play with pagecache pages against
which the caller doesn't have an inode ref. Think: how did the caller look
up that page in the first place if not from the address_space in the first
place?

- get_user_pages(): the current problem

- page LRU: OK, uses trylock first.

- pagetable walk??

2006-05-30 02:55:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Andrew Morton wrote:

>On Tue, 30 May 2006 10:08:06 +1000
>Nick Piggin <[email protected]> wrote:
>
>
>>Which is what I want to know. I don't exactly have an interesting
>>disk setup.
>>
>
>You don't need one - just a single disk should show up such problems. I
>forget which workloads though. Perhaps just a linear read (readahead
>queues the I/O but doesn't unplug, subsequent lock_page() sulks).
>

I guess so. Is plugging still needed now that the IO layer should
get larger requests? Disabling it might result in a small initial
request (although even that may be good for pipelining)...

Otherwise, we could make set_page_dirty_lock use a weird non-unplugging
variant (although maybe that will upset Ken), but I'd rather look
at simplification first ;)

>
>>Yes. So set_page_dirty_lock is broken, right?
>>
>
>yup.
>
>
>>And the wait_on_page_stuff needs an inode ref.
>>Also splice seems to have broken sync_page.
>>
>
>Please describe the splice() problem which you've observed.
>

sync_page wants to get either the current mapping, or a NULL one.
The sync_page methods must then be able to handle running into a
NULL mapping.

With splice, the mapping can change, so you can have the wrong
sync_page callback run against the page.

>
>>Well yes, writing to a page would be the main reason to set it dirty.
>>Is splice broken as well? I'm not sure that it always has a ref on the
>>inode when stealing a page.
>>
>
>Whereabouts?
>

The ->pin() calls in pipe_to_file and pipe_to_sendpage?

>
>>It sounds like you think fixing the set_page_dirty_lock callers wouldn't
>>be too difficult? I wouldn't know (although the ptrace one should be
>>able to be turned into a set_page_dirty, because we're holding mmap_sem).
>>
>
>No, I think it's damn impossible ;)
>
>get_user_pages() has gotten us a random pagecache page. How do we
>non-racily get at the address_space prior to locking that page?
>
>I don't think we can.
>

But the vma isn't going to disappear because mmap_sem is held; and the
vma should hold a ref on the inode I think?

>
>>You're sure about all other lock_page()rs? I'm not, given that
>>set_page_dirty_lock got it so wrong. But you'd have a better idea than
>>me.
>>
>
>No, I'm not sure.
>
>However it is rare for the kernel to play with pagecache pages against
>which the caller doesn't have an inode ref. Think: how did the caller look
>up that page in the first place if not from the address_space in the first
>place?
>
>- get_user_pages(): the current problem
>
>- page LRU: OK, uses trylock first.
>
>- pagetable walk??
>

Am I wrong about mmap_sem?

Anyway, it is possible that most of the problems could be solved by locking
the page at the time of lookup, and unlocking it on completion/dirtying...
it's just that that would be a bit of a task. Can we somehow add BUG_ONs to
lock_page to ensure we've got an inode ref?

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 03:13:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tue, 30 May 2006 12:54:53 +1000
Nick Piggin <[email protected]> wrote:

> Andrew Morton wrote:
>
> >On Tue, 30 May 2006 10:08:06 +1000
> >Nick Piggin <[email protected]> wrote:
> >
> >
> >>Which is what I want to know. I don't exactly have an interesting
> >>disk setup.
> >>
> >
> >You don't need one - just a single disk should show up such problems. I
> >forget which workloads though. Perhaps just a linear read (readahead
> >queues the I/O but doesn't unplug, subsequent lock_page() sulks).
> >
>
> I guess so. Is plugging still needed now that the IO layer should
> get larger requests? Disabling it might result in a small initial
> request (although even that may be good for pipelining)...

Mysterious question, that. A few years ago I think Jens tried pulling unplugging
out, but some devices still want it (magneto-optical storage iirc). And I
think we did try removing it, and it caused hurt.

> Otherwise, we could make set_page_dirty_lock use a weird non-unplugging
> variant

Yes, that would work. In fact the number of times when direct-io actually
calls set_page_dirty_lock() is infinitesimal - I had to jump through hoops
to even test that code. The speculative
set-the-page-dirty-before-we-do-the-IO thing is very effective. So the
performace impact of making such a change would be nil.

That's for the direct-io case. Other cases might be hurt more.

Also, perhaps we could poke kblockd to do it for us.

> sync_page wants to get either the current mapping, or a NULL one.
> The sync_page methods must then be able to handle running into a
> NULL mapping.
>
> With splice, the mapping can change, so you can have the wrong
> sync_page callback run against the page.

Oh.

> >
> >>Well yes, writing to a page would be the main reason to set it dirty.
> >>Is splice broken as well? I'm not sure that it always has a ref on the
> >>inode when stealing a page.
> >>
> >
> >Whereabouts?
> >
>
> The ->pin() calls in pipe_to_file and pipe_to_sendpage?

One for Jens...

> >
> >>It sounds like you think fixing the set_page_dirty_lock callers wouldn't
> >>be too difficult? I wouldn't know (although the ptrace one should be
> >>able to be turned into a set_page_dirty, because we're holding mmap_sem).
> >>
> >
> >No, I think it's damn impossible ;)
> >
> >get_user_pages() has gotten us a random pagecache page. How do we
> >non-racily get at the address_space prior to locking that page?
> >
> >I don't think we can.
> >
>
> But the vma isn't going to disappear because mmap_sem is held; and the
> vma should hold a ref on the inode I think?

That's true during the get_user_pages() call. Be we run
set_page_dirty_lock() much later, after IO completion.

> >
> >>You're sure about all other lock_page()rs? I'm not, given that
> >>set_page_dirty_lock got it so wrong. But you'd have a better idea than
> >>me.
> >>
> >
> >No, I'm not sure.
> >
> >However it is rare for the kernel to play with pagecache pages against
> >which the caller doesn't have an inode ref. Think: how did the caller look
> >up that page in the first place if not from the address_space in the first
> >place?
> >
> >- get_user_pages(): the current problem
> >
> >- page LRU: OK, uses trylock first.
> >
> >- pagetable walk??
> >
>
> Am I wrong about mmap_sem?
>
> Anyway, it is possible that most of the problems could be solved by locking
> the page at the time of lookup, and unlocking it on completion/dirtying...
> it's just that that would be a bit of a task.

But lock_page() requires access to the address_space. To kick the IO so we
don't wait for ever.

> Can we somehow add BUG_ONs to
> lock_page to ensure we've got an inode ref?

WARN_ONs.

2006-05-30 04:13:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Andrew Morton wrote:

>On Tue, 30 May 2006 12:54:53 +1000
>Nick Piggin <[email protected]> wrote:
>
>
>>I guess so. Is plugging still needed now that the IO layer should
>>get larger requests? Disabling it might result in a small initial
>>request (although even that may be good for pipelining)...
>>
>
>Mysterious question, that. A few years ago I think Jens tried pulling unplugging
>out, but some devices still want it (magneto-optical storage iirc). And I
>think we did try removing it, and it caused hurt.
>

Would be nice to get rid of it. I guess that's out of the question
for several releases, even if there was no performance problems.

>
>>Otherwise, we could make set_page_dirty_lock use a weird non-unplugging
>>variant
>>
>
>Yes, that would work. In fact the number of times when direct-io actually
>calls set_page_dirty_lock() is infinitesimal - I had to jump through hoops
>to even test that code. The speculative
>set-the-page-dirty-before-we-do-the-IO thing is very effective. So the
>performace impact of making such a change would be nil.
>

OK, that'll probably be the way to go for upcoming releases. I'll post
a patch soon.

>
>That's for the direct-io case. Other cases might be hurt more.
>
>Also, perhaps we could poke kblockd to do it for us.
>

Hard, I think, to get back to ->mapping. Well not hard if we just use
a non-syncing lock_page, but in that case we don't need kblockd.

>
>>sync_page wants to get either the current mapping, or a NULL one.
>>The sync_page methods must then be able to handle running into a
>>NULL mapping.
>>
>>With splice, the mapping can change, so you can have the wrong
>>sync_page callback run against the page.
>>
>
>Oh.
>

Don't know what we can do about that off the top of my head. Within
block_sync_page there shouldn't be any problems (at worst, we'd unplug
the wrong dev). Maybe the whole sync_page callback scheme can be removed
so nobody tries to do anything funny with it? Call blk_run_backing_dev
directly.

>
>>>>Well yes, writing to a page would be the main reason to set it dirty.
>>>>Is splice broken as well? I'm not sure that it always has a ref on the
>>>>inode when stealing a page.
>>>>
>>>>
>>>Whereabouts?
>>>
>>>
>>The ->pin() calls in pipe_to_file and pipe_to_sendpage?
>>
>
>One for Jens...
>
>
>>>>It sounds like you think fixing the set_page_dirty_lock callers wouldn't
>>>>be too difficult? I wouldn't know (although the ptrace one should be
>>>>able to be turned into a set_page_dirty, because we're holding mmap_sem).
>>>>
>>>>
>>>No, I think it's damn impossible ;)
>>>
>>>get_user_pages() has gotten us a random pagecache page. How do we
>>>non-racily get at the address_space prior to locking that page?
>>>
>>>I don't think we can.
>>>
>>>
>>But the vma isn't going to disappear because mmap_sem is held; and the
>>vma should hold a ref on the inode I think?
>>
>
>That's true during the get_user_pages() call. Be we run
>set_page_dirty_lock() much later, after IO completion.
>

Oh, I thought you specifically meant the ptrace case. Yes, in
general it is much harder.

>>Anyway, it is possible that most of the problems could be solved by locking
>>the page at the time of lookup, and unlocking it on completion/dirtying...
>>it's just that that would be a bit of a task.
>>
>
>But lock_page() requires access to the address_space. To kick the IO so we
>don't wait for ever.
>

It shouldn't wait for ever, because the unplug timer will go off
and kblockd will do it eventually. And I was imagining that it would
have a pin on the address space at the point it is looked up... But
reworking all callers is just a pipe dream anyway, so nevermind ;)

Where we have synchronous IO, we do want the queue to be unplugged,
however in most set_page_dirty_lock case this is normally not the
case and a locked page should be rare. So hmm yes that would make
sense to have it use a special lock_page...

>
>>Can we somehow add BUG_ONs to
>>lock_page to ensure we've got an inode ref?
>>
>
>WARN_ONs.
>

But is it practical? Or I suspect the warning would only ever trigger
in the really rare racy cases anyway, when dentry, inode, etc have
been reclaimed.

Anyway, I'll come up with a less intrusive patch shortly...

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 04:21:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Tue, 30 May 2006, Nick Piggin wrote:
>
> I guess so. Is plugging still needed now that the IO layer should
> get larger requests?

Why do you think the IO layer should get larger requests?

I really don't understand why people dislike plugging. It's obviously
superior to non-plugged variants, exactly because it starts the IO only
when _needed_, not at some random "IO request feeding point" boundary.

In other words, plugging works _correctly_ regardless of any random
up-stream patterns. That's the kind of algorithms we want, we do NOT want
to have the IO layer depend on upstream always doing the "Right
Thing(tm)".

So exactly why would you want to remove it?

Linus

2006-05-30 05:07:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:
>
> On Tue, 30 May 2006, Nick Piggin wrote:
>
>>I guess so. Is plugging still needed now that the IO layer should
>>get larger requests?
>
>
> Why do you think the IO layer should get larger requests?

For workloads where plugging helps (ie. lots of smaller, contiguous
requests going into the IO layer), should be pretty good these days
due to multiple readahead and writeback.

>
> I really don't understand why people dislike plugging. It's obviously
> superior to non-plugged variants, exactly because it starts the IO only
> when _needed_, not at some random "IO request feeding point" boundary.

If you submit IO, it will be needed at some point in time. The benefit
of plugging is to hold it there in anticipation of more IO to merge
with this request.

Not sure what you mean by IO request feeding point...?

>
> In other words, plugging works _correctly_ regardless of any random
> up-stream patterns. That's the kind of algorithms we want, we do NOT want
> to have the IO layer depend on upstream always doing the "Right
> Thing(tm)".

It is a heuristic. I don't see anything obviously correct or incorrect
about it. If you were to submit some asynch requests and _knew_ that
no other requests would be merged with them, plugging would be the
wrong thing to do because the disks would be needlessly idle.

>
> So exactly why would you want to remove it?

Because of this bug ;)

I'm not saying it would never be of any benefit; I was hoping that it
might be unneeded and getting rid of it would fix this bug and
simplify a lot too.

Anyway, clearly a plug friendly bugfix needs to be implemented
regardless.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 05:21:08

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Nick Piggin wrote:
> Linus Torvalds wrote:
>
>>
>> Why do you think the IO layer should get larger requests?
>
>
> For workloads where plugging helps (ie. lots of smaller, contiguous
> requests going into the IO layer), should be pretty good these days
> due to multiple readahead and writeback.

Let me try again.

For workloads where plugging helps (ie. lots of smaller, contiguous
requests going into the IO layer), the request pattern should be
pretty good without plugging these days, due to multiple page
readahead and writeback.


>
>>
>> I really don't understand why people dislike plugging. It's obviously
>> superior to non-plugged variants, exactly because it starts the IO
>> only when _needed_,

Taken to its logical conclusion, you are saying readahead / dirty
page writeout throttling is obviously inferior, aren't you?

Non-rhetorically: Obviously there can be regressions in plugging,
because you are holding the disk idle when you know there is work to
be done.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 05:36:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

lock_page needs the caller to have a reference on the page->mapping inode
due to sync_page. So set_page_dirty_lock is obviously buggy according to
its comments.

Solve it by introducing a new lock_page_nosync which does not do a sync_page.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h 2006-05-19 12:49:06.000000000 +1000
+++ linux-2.6/include/linux/pagemap.h 2006-05-30 15:12:54.000000000 +1000
@@ -168,6 +168,7 @@ static inline pgoff_t linear_page_index(
}

extern void FASTCALL(__lock_page(struct page *page));
+extern void FASTCALL(__lock_page_nosync(struct page *page));
extern void FASTCALL(unlock_page(struct page *page));

static inline void lock_page(struct page *page)
@@ -176,6 +177,13 @@ static inline void lock_page(struct page
if (TestSetPageLocked(page))
__lock_page(page);
}
+
+static inline void lock_page_nosync(struct page *page)
+{
+ might_sleep();
+ if (TestSetPageLocked(page))
+ __lock_page_nosync(page);
+}

/*
* This is exported only for wait_on_page_locked/wait_on_page_writeback.
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2006-05-30 15:07:58.000000000 +1000
+++ linux-2.6/mm/filemap.c 2006-05-30 15:31:46.000000000 +1000
@@ -544,6 +544,23 @@ void fastcall __lock_page(struct page *p
}
EXPORT_SYMBOL(__lock_page);

+static int __sleep_on_page_lock(void *word)
+{
+ io_schedule();
+ return 0;
+}
+
+/*
+ * Variant of lock_page that does not require the caller to hold a reference
+ * on the page's mapping.
+ */
+void fastcall __lock_page_nosync(struct page *page)
+{
+ DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+ __wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
+ TASK_UNINTERRUPTIBLE);
+}
+
/*
* a rather lightweight function, finding and getting a reference to a
* hashed page atomically.
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c 2006-05-19 12:48:01.000000000 +1000
+++ linux-2.6/mm/page-writeback.c 2006-05-30 15:13:59.000000000 +1000
@@ -702,7 +702,7 @@ int set_page_dirty_lock(struct page *pag
{
int ret;

- lock_page(page);
+ lock_page_nosync(page);
ret = set_page_dirty(page);
unlock_page(page);
return ret;


Attachments:
mm-non-syncing-lock_page.patch (2.40 kB)

2006-05-30 05:52:41

by Josef Sipek

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Mon, May 29, 2006 at 07:34:09PM +1000, Nick Piggin wrote:
...
> Can we get rid of the whole thing, confusing memory barriers and all?
> Nobody uses anything but the default sync_page

I feel like I must say this: there are some file systems that live
outside the kernel (at least for now) that do _NOT_ use the default
sync_page. All the stackable file systems that are based on FiST [1],
such as Unionfs [2] and eCryptfs (currently in -mm) [3] (respective
authors CC'd). As an example, Unionfs must decide which lower file
system page to sync (since it may have several to chose from).

Josef "Jeff" Sipek.

[1] http://www.filesystems.org
[2] http://unionfs.filesystems.org
[3] http://ecryptfs.sourceforge.net


2006-05-30 06:12:30

by NeilBrown

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tuesday May 30, [email protected] wrote:
> Nick Piggin wrote:
> > Linus Torvalds wrote:
> >
> >>
> >> Why do you think the IO layer should get larger requests?
> >
> >
> > For workloads where plugging helps (ie. lots of smaller, contiguous
> > requests going into the IO layer), should be pretty good these days
> > due to multiple readahead and writeback.
>
> Let me try again.
>
> For workloads where plugging helps (ie. lots of smaller, contiguous
> requests going into the IO layer), the request pattern should be
> pretty good without plugging these days, due to multiple page
> readahead and writeback.

Can I please put in a vote for not thinking that every device is disk
drive?

I find plugging fairly important for raid5, particularly for write.

The more whole-stripe writes I can get, the better throughput I get.
So I tend to keep a raid5 array plugged while any requests are
arriving, and interpret 'plugged' to mean that incomplete stripes
don't get processed while full stripes (needing no pre-reading) do get
processed.

The only way "large requests" are going to replace plugging is they
are perfectly aligned, which I don't expect to ever see.

As for your original problem.... I wonder if PG_locked is protecting
too much? It protects against IO and it also protects against ->mapping
changes. So if you want to ensure that ->mapping won't change, you
need to wait for any pending read request to finish, which seems a bit
dumb.
Maybe we need a new bit: PG_maplocked. You are only allowed to change
->mapping or ->index of you hold PG_locked and PG_maplocked, you are
not allowed to wait for PG_locked while holding PG_maplocked, and
you can read ->mapping or ->index while PG_locked or PG_maplocked are
held.
Think of PG_locked like a mutex and PG_maplocked like a spinlock (and
probably use bit_spinlock to get it).

Then set_page_dirty_lock would use PG_maplocked to get access to
->mapping, and then hold a reference on the address_space while
calling into balance_dirty_pages ... I wonder how you hold a reference
on an address space...

There are presumably few pieces of code that change ->mapping. Once
they all take PG_maplocked as well as PG_locked, you can start freeing
up other code to take PG_maplocked instead of PG_locked....

Does that make sense at all? Do we have any spare page bits?

NeilBrown

2006-05-30 06:45:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Josef Sipek wrote:
> On Mon, May 29, 2006 at 07:34:09PM +1000, Nick Piggin wrote:
> ...
>
>>Can we get rid of the whole thing, confusing memory barriers and all?
>>Nobody uses anything but the default sync_page
>
>
> I feel like I must say this: there are some file systems that live
> outside the kernel (at least for now) that do _NOT_ use the default
> sync_page. All the stackable file systems that are based on FiST [1],
> such as Unionfs [2] and eCryptfs (currently in -mm) [3] (respective
> authors CC'd). As an example, Unionfs must decide which lower file
> system page to sync (since it may have several to chose from).

OK, noted. Thanks. Luckily for them it looks like sync_page might
stay for other reasons (eg. raid) ;)

Any good reasons they are not in the tree?

>
> Josef "Jeff" Sipek.
>
> [1] http://www.filesystems.org
> [2] http://unionfs.filesystems.org
> [3] http://ecryptfs.sourceforge.net



--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 06:50:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Nick Piggin wrote:
> Josef Sipek wrote:
>
>> On Mon, May 29, 2006 at 07:34:09PM +1000, Nick Piggin wrote:
>> ...
>>
>>> Can we get rid of the whole thing, confusing memory barriers and all?
>>> Nobody uses anything but the default sync_page
>>
>>
>>
>> I feel like I must say this: there are some file systems that live
>> outside the kernel (at least for now) that do _NOT_ use the default
>> sync_page. All the stackable file systems that are based on FiST [1],
>> such as Unionfs [2] and eCryptfs (currently in -mm) [3] (respective
>> authors CC'd). As an example, Unionfs must decide which lower file
>> system page to sync (since it may have several to chose from).
>
>
> OK, noted. Thanks. Luckily for them it looks like sync_page might
> stay for other reasons (eg. raid) ;)

Ah, no: that's the backing dev's unplug_fn. sync_page is always the
same. And it has definite bugs now that splice has been merged.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 07:10:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Neil Brown wrote:
> On Tuesday May 30, [email protected] wrote:
>
>>Nick Piggin wrote:
>>

>>For workloads where plugging helps (ie. lots of smaller, contiguous
>>requests going into the IO layer), the request pattern should be
>>pretty good without plugging these days, due to multiple page
>>readahead and writeback.
>
>
> Can I please put in a vote for not thinking that every device is disk
> drive?
>
> I find plugging fairly important for raid5, particularly for write.
>
> The more whole-stripe writes I can get, the better throughput I get.
> So I tend to keep a raid5 array plugged while any requests are
> arriving, and interpret 'plugged' to mean that incomplete stripes
> don't get processed while full stripes (needing no pre-reading) do get
> processed.
>
> The only way "large requests" are going to replace plugging is they
> are perfectly aligned, which I don't expect to ever see.

Fair enough, thanks for the input. I was more imagining that IO tends
to come down in decent chunks, but obviously that's still not sufficient
for some. OK.

>
> As for your original problem.... I wonder if PG_locked is protecting
> too much? It protects against IO and it also protects against ->mapping
> changes. So if you want to ensure that ->mapping won't change, you
> need to wait for any pending read request to finish, which seems a bit
> dumb.

I don't think that is the problem. set_page_dirty_lock is really
unlikely to get held up on read IO: that'd mean there were two things
writing into that page at the same time.

>
> Maybe we need a new bit: PG_maplocked. You are only allowed to change
> ->mapping or ->index of you hold PG_locked and PG_maplocked, you are
> not allowed to wait for PG_locked while holding PG_maplocked, and
> you can read ->mapping or ->index while PG_locked or PG_maplocked are
> held.
> Think of PG_locked like a mutex and PG_maplocked like a spinlock (and
> probably use bit_spinlock to get it).

Well the original problem is fixed by not doing the sync_page thing in
set_page_dirty_lock. Is there any advantage to having another bit?
Considering a) it will be very unlikely that a page is locked at the
same time one would like to dirty it; and b) that would seem to imply
adding extra atomic ops and barriers to reclaim and truncate (maybe
others).

>
> Then set_page_dirty_lock would use PG_maplocked to get access to
> ->mapping, and then hold a reference on the address_space while
> calling into balance_dirty_pages ... I wonder how you hold a reference
> on an address space...

inode. Presumably PG_maplocked would pin it? I don't understand
why you've brought balance_dirty_pages into it, though.

>
> There are presumably few pieces of code that change ->mapping. Once
> they all take PG_maplocked as well as PG_locked, you can start freeing
> up other code to take PG_maplocked instead of PG_locked....
>
> Does that make sense at all? Do we have any spare page bits?

I'm sure it could be made to work, but I don't really see the point.
If someone really wanted to do it, I guess the right way to go is have
a PG_readin counterpart to PG_writeback (or even extend PG_writeback
to PG_io)...

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-30 08:29:03

by Nikita Danilov

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Nick Piggin writes:
> Nick Piggin wrote:
> > Linus Torvalds wrote:
> >
> >>
> >> Why do you think the IO layer should get larger requests?
> >
> >
> > For workloads where plugging helps (ie. lots of smaller, contiguous
> > requests going into the IO layer), should be pretty good these days
> > due to multiple readahead and writeback.
>
> Let me try again.
>
> For workloads where plugging helps (ie. lots of smaller, contiguous
> requests going into the IO layer), the request pattern should be
> pretty good without plugging these days, due to multiple page
> readahead and writeback.

Pageout by VM scanner doesn't benefit from those, and it is still quite
important in some workloads (e.g., mmap intensive).

Nikita.

2006-05-30 09:07:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Mon, May 29 2006, Andrew Morton wrote:
> On Tue, 30 May 2006 12:54:53 +1000
> Nick Piggin <[email protected]> wrote:
>
> > Andrew Morton wrote:
> >
> > >On Tue, 30 May 2006 10:08:06 +1000
> > >Nick Piggin <[email protected]> wrote:
> > >
> > >
> > >>Which is what I want to know. I don't exactly have an interesting
> > >>disk setup.
> > >>
> > >
> > >You don't need one - just a single disk should show up such problems. I
> > >forget which workloads though. Perhaps just a linear read (readahead
> > >queues the I/O but doesn't unplug, subsequent lock_page() sulks).
> > >
> >
> > I guess so. Is plugging still needed now that the IO layer should
> > get larger requests? Disabling it might result in a small initial
> > request (although even that may be good for pipelining)...
>
> Mysterious question, that. A few years ago I think Jens tried pulling
> unplugging out, but some devices still want it (magneto-optical
> storage iirc). And I think we did try removing it, and it caused
> hurt.

I did, back when we had problems due to the blk_plug_lock being a global
one. I first wanted to investigate if plugging still made a difference,
otherwise we could've just ripped it out back than and the problem would
be solved. But it did get us about a 10% boost on normal SCSI drives
(don't think I tested MO drives at all), so it was fixed up.

> > sync_page wants to get either the current mapping, or a NULL one.
> > The sync_page methods must then be able to handle running into a
> > NULL mapping.
> >
> > With splice, the mapping can change, so you can have the wrong
> > sync_page callback run against the page.
>
> Oh.

Maybe I'm being dense, but I don't see a problem there. You _should_
call the new mapping sync page if it has been migrated.

> > >>Well yes, writing to a page would be the main reason to set it dirty.
> > >>Is splice broken as well? I'm not sure that it always has a ref on the
> > >>inode when stealing a page.
> > >>
> > >
> > >Whereabouts?
> > >
> >
> > The ->pin() calls in pipe_to_file and pipe_to_sendpage?
>
> One for Jens...

splice never incs/decs any inode related reference counts, so if it
needs to then yes it's broken. Any references to kernel code that deals
with that?

--
Jens Axboe

2006-05-30 13:13:09

by Josef Sipek

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tue, May 30, 2006 at 04:44:57PM +1000, Nick Piggin wrote:
> Josef Sipek wrote:
> >On Mon, May 29, 2006 at 07:34:09PM +1000, Nick Piggin wrote:
> >...
> >
> >>Can we get rid of the whole thing, confusing memory barriers and all?
> >>Nobody uses anything but the default sync_page
> >
> >
> >I feel like I must say this: there are some file systems that live
> >outside the kernel (at least for now) that do _NOT_ use the default
> >sync_page. All the stackable file systems that are based on FiST [1],
> >such as Unionfs [2] and eCryptfs (currently in -mm) [3] (respective
> >authors CC'd). As an example, Unionfs must decide which lower file
> >system page to sync (since it may have several to chose from).
>
> OK, noted. Thanks. Luckily for them it looks like sync_page might
> stay for other reasons (eg. raid) ;)
>
> Any good reasons they are not in the tree?

eCryptfs got recently into -mm, my guess would be that Mike Halcrow will
try to get it into vanilla in the coming months.

Unionfs is into process of getting ready for a review by the fs-devel &
linux-kernel crowd.

Josef "Jeff" Sipek.

2006-05-30 17:58:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Tue, 30 May 2006, Nick Piggin wrote:
>
> For workloads where plugging helps (ie. lots of smaller, contiguous
> requests going into the IO layer), the request pattern should be
> pretty good without plugging these days, due to multiple page
> readahead and writeback.

No.

That's fundamentally wrong.

The fact is, plugging is not about read-ahead and writeback. It's very
fundamentally about the _boundaries_ between multiple requests, and in
particular the time when the queue starts out empty so that we can build
up things for devices that wand big requests, but even more so for devices
where _seeking_ is very expensive.

Those boundaries haven't gone anywhere. The fact that we do read-ahead and
write-back in chunks doesn't change anything: yes, we often have the "big
requests" thing handled, but (a) not always and (b) upper layers
fundamentally don't fix the seek issues.

I want to know that the block layer could - if we wanted to - do things
like read-ahead for many distinct files, and for metadata. We don't
currently do much of that yet, but the point is, plugging _allows_ us to.
Exactly because it doesn't depend on upper layers feeding everything in
one go.

Look at "sys_readahead()", and realize that it can be used to start IO for
read ahead _across_many_small_files_. Last I tried it, it was hugely
faster at populating the page cache than reading individual files (I used
to do it with BK to bring everything into cache so that the regular ops
would be fster - now git doesn't much need it).

And maybe it was just my imagination, but the disk seemed quieter too. It
should be able to do better seek patterns at the beginning due to plugging
(ie we won't start IO after the first file, but after the request queue
fills up or something else needs to wait and we do an unplug event).

THAT is what plugging is good for. Our read-ahead does well for large
requests, and that's important for some disk controllers in particular.
But plugging is about avoiding startign the IO too early.

Think about the TCP plugging (which is actually newer, but perhaps easier
to explain): it's useful not for the big file case (just use large reads
and writes), but for the "different sources" case - for handling the gap
between a header and the actual file contents. Exactly because it plugs in
_between_ events.

Linus

2006-05-30 18:32:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tue, 30 May 2006, Nick Piggin wrote:
>
> But for 2.6.17, how's this?

It was a great emperor's-clothes-like discovery. But we've survived
for so many years without noticing, does it have to be fixed right
now for 2.6.17? (I bet I'd be insisting yes if I'd found it.)

The thing I don't like about your lock_page_nosync (reasonable as
it is) is that the one case you're using it, set_page_dirty_nolock,
would be so much happier not to have to lock the page in the first
place - it's only doing _that_ to stabilize page->mapping, and the
lock_page forbids it from being called from anywhere that can't
sleep, which is often just where we want to call it from. Neil's
suggestion, using a spin_lock against the mapping changing, would
help there; but seems like more work than I'd want to get into.

So, although I think lock_page_nosync fixes the bug (at least in
that one place we've identified there's likely to be such a bug),
it seems to be aiming at the wrong target. I'm pacing and thinking,
doubt I'll come up with anything better, please don't hold breath.

Hugh

2006-05-31 00:22:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Hugh Dickins wrote:
> On Tue, 30 May 2006, Nick Piggin wrote:
>
>>But for 2.6.17, how's this?
>
>
> It was a great emperor's-clothes-like discovery. But we've survived
> for so many years without noticing, does it have to be fixed right
> now for 2.6.17? (I bet I'd be insisting yes if I'd found it.)

It's up to Linus and Andrew I guess. I don't see why not, but I
don't much care one way or the other. But thanks having a quick
look at it, we may want it for the Suse kernel.

>
> The thing I don't like about your lock_page_nosync (reasonable as
> it is) is that the one case you're using it, set_page_dirty_nolock,
> would be so much happier not to have to lock the page in the first
> place - it's only doing _that_ to stabilize page->mapping, and the
> lock_page forbids it from being called from anywhere that can't
> sleep, which is often just where we want to call it from. Neil's
> suggestion, using a spin_lock against the mapping changing, would
> help there; but seems like more work than I'd want to get into.

But making PG_lock a spinning lock is completely unrelated to the
bug at hand.

>
> So, although I think lock_page_nosync fixes the bug (at least in
> that one place we've identified there's likely to be such a bug),
> it seems to be aiming at the wrong target. I'm pacing and thinking,
> doubt I'll come up with anything better, please don't hold breath.

It is the correct target. I know all about your set_page_dirty_lock
problems, but they aren't what I'm trying to fix.

AFAIKS, you could also make set_page_dirty_lock non sleeping quite
easily by making inode slabs RCU freed.

What places want to use set_page_dirty_lock without sleeping?
The only place in drivers/ apart from sg/st that SetPageDirty are
rd.c and via_dmablit.c, both of which look OK, if a bit crufty.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 00:33:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:
>
> On Tue, 30 May 2006, Nick Piggin wrote:
>
>>For workloads where plugging helps (ie. lots of smaller, contiguous
>>requests going into the IO layer), the request pattern should be
>>pretty good without plugging these days, due to multiple page
>>readahead and writeback.
>
>
> No.
>
> That's fundamentally wrong.
>
> The fact is, plugging is not about read-ahead and writeback. It's very
> fundamentally about the _boundaries_ between multiple requests, and in
> particular the time when the queue starts out empty so that we can build
> up things for devices that wand big requests, but even more so for devices
> where _seeking_ is very expensive.
>
> Those boundaries haven't gone anywhere. The fact that we do read-ahead and
> write-back in chunks doesn't change anything: yes, we often have the "big
> requests" thing handled, but (a) not always and (b) upper layers
> fundamentally don't fix the seek issues.

The requests can only get merged if contiguous requests from the upper
layers come down, right?

So in a random IO workload, plugging is unlikely to help at all. In a
contiguous IO workload, mpage should take *some* of the burden off
plugging. But OK, it turns out not always, I accept that.



>
> I want to know that the block layer could - if we wanted to - do things
> like read-ahead for many distinct files, and for metadata. We don't
> currently do much of that yet, but the point is, plugging _allows_ us to.
> Exactly because it doesn't depend on upper layers feeding everything in
> one go.
>
> Look at "sys_readahead()", and realize that it can be used to start IO for
> read ahead _across_many_small_files_. Last I tried it, it was hugely
> faster at populating the page cache than reading individual files (I used
> to do it with BK to bring everything into cache so that the regular ops
> would be fster - now git doesn't much need it).
>
> And maybe it was just my imagination, but the disk seemed quieter too. It
> should be able to do better seek patterns at the beginning due to plugging
> (ie we won't start IO after the first file, but after the request queue
> fills up or something else needs to wait and we do an unplug event).
>
> THAT is what plugging is good for. Our read-ahead does well for large
> requests, and that's important for some disk controllers in particular.
> But plugging is about avoiding startign the IO too early.

Why would plugging help if the requests can't get merged, though?

>
> Think about the TCP plugging (which is actually newer, but perhaps easier
> to explain): it's useful not for the big file case (just use large reads
> and writes), but for the "different sources" case - for handling the gap
> between a header and the actual file contents. Exactly because it plugs in
> _between_ events.

TCP plugging is a bit different because there is no page cache between
the application and the device; and it is stream based so everything can
be merged (within a single socket).

The same high level concept I agree, but I never said the concept was
wrong; just hoped that as a heuristic, the block plugging was no longer
useful. I've been set straight about that though ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 00:57:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Wed, 31 May 2006, Nick Piggin wrote:
>
> The requests can only get merged if contiguous requests from the upper
> layers come down, right?

It has nothing to do with merging. It has to do with IO patterns.

Seeking.

Seeking is damn expensive - much more so than command issue. People forget
that sometimes.

If you can sort the requests so that you don't have to seek back and
forth, that's often a HUGE win.

Yes, the requests will still be small, and yes, the IO might happen in 4kB
chunks, but it happens a lot faster if you do it in a good elevator
ordering and if you hit the track cache than if you seek back and forth.

And part of that is that you have to submit multiple requests when you
start, and allow the elevator to work on it.

Now, of course, if you have tons of reqeusts already in flight, you don't
care (you already have lots of work for the elevator), but at least in
desktop loads the "starting from idle" case is pretty common. Getting just
a few requests to start up with is good.

(Yes, tagged queueing makes it less of an issue, of course. I know, I
know. But I _think_ a lot of disks will start seeking for an incoming
command the moment they see it, just to get the best latency, rather than
wait a millisecond or two to see if they get another request. So even
with tagged queuing, the elevator can help, _especially_ for the initial
request).

> Why would plugging help if the requests can't get merged, though?

Why do you think we _have_ an elevator in the first place?

And just how well do you think it works if you submit one entry at a time
(regardless of how _big_ it is) and start IO on it immediately? Vs trying
to get several IO's out there, so that we can say "do this one first".

Sometimes I think harddisks have gotten too quiet - people no longer hear
it when access patters are horrible. But the big issue with plugging was
only partially about request coalescing, and was always about trying to
get the _order_ right when you start to actually submit the requests to
the hardware.

And yes, I realize that modern disks do remapping, and that we will never
do a "perfect" job. But it's still true that the block number has _some_
(fairly big, in fact) relationship to the actual disk layout, and that
avoiding seeking is a big deal.

Rotational latency is often an even bigger issue, of course, but we can't
do much about that. We really can't estimate where the head is, like
people used to try to do three decades ago. _That_ time is long past, but
we can try to avoid long seeks, and it's still true that you can get
blocks that are _close_ faster (if only because they may end up being on
the same cylinder and not need a seek).

Even better than "same cylinder" is sometimes "same cache block" - disks
often do track caching, and they aren't necessarily all that smart about
it, so even if you don't read one huge contiguous block, it's much better
to read an area _close_ to another than seek back and forth, because
you're more likely to hit the disks own track cache.

And I know, disks aren't as sensitive to long seeks as they used to be (a
short seek is almost as expensive as a long one, and a lot of it is the
head settling time), but as another example - I think for CD-ROMs you can
still have things like the motor spinning faster or slower depending on
where the read head is, for example, meaning that short seeks are cheaper
than long ones.

(Maybe constant angular velocity is what people use, though. I dunno).

Linus

2006-05-31 01:33:09

by Mark Lord

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus wrote:
> (Yes, tagged queueing makes it less of an issue, of course. I know,

My observations with (S)ATA tagged/native queuing, is that it doesn't make
nearly the difference under Linux that it does under other OSs.
Probably because our block layer is so good at ordering requests,
either from plugging or simply from clever disk scheduling.

> I know. But I _think_ a lot of disks will start seeking for an incoming
> command the moment they see it, just to get the best latency, rather than
> wait a millisecond or two to see if they get another request. So even
> with tagged queuing, the elevator can help, _especially_ for the initial
> request).

Yup. Agreed!

2006-05-31 03:06:30

by Hugh Dickins

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, 31 May 2006, Nick Piggin wrote:
> Hugh Dickins wrote:
> > lock_page forbids it from being called from anywhere that can't
> > sleep, which is often just where we want to call it from. Neil's
> > suggestion, using a spin_lock against the mapping changing, would
> > help there; but seems like more work than I'd want to get into.
>
> But making PG_lock a spinning lock is completely unrelated to the
> bug at hand.

Neil wasn't suggesting making PG_lock a spinning lock, he was
suggesting a further bit used that way. And I thought his suggestion
was relevant to the bug at hand. But it's not the way I'd like to go.

> > So, although I think lock_page_nosync fixes the bug (at least in
> > that one place we've identified there's likely to be such a bug),
> > it seems to be aiming at the wrong target. I'm pacing and thinking,
> > doubt I'll come up with anything better, please don't hold breath.
>
> It is the correct target. I know all about your set_page_dirty_lock
> problems, but they aren't what I'm trying to fix.

Yes, I had noticed yours is a different issue. I'm saying that if we
can "fix" set_page_dirty_nolock not to sleep, then your issue is fixed
(as least as it affects set_page_dirty_lock, which is all your patch
is dealing with, and we hope all it needs to deal with). Because your
issue is with the sync_page in the lock_page of set_page_dirty_nolock,
and it's that particular lock_page which I'm trying to be rid of.

I now think it can be done: in cases where TestSetPageLocked finds
the page already locked, then I believe we can fall back to inode_lock
to stabilize. But I do need to consider the possibilities some more.

> AFAIKS, you could also make set_page_dirty_lock non sleeping quite
> easily by making inode slabs RCU freed.

Which would equally deal with your issue. Yes, but it's always
seemed to me too great a risk, to add an RCU delay into such
significant slab shrinking - I don't want to handle the fallout.

> What places want to use set_page_dirty_lock without sleeping?
> The only place in drivers/ apart from sg/st that SetPageDirty are
> rd.c and via_dmablit.c, both of which look OK, if a bit crufty.

I've not looked recently, but bio.c sticks in the mind as one which
is pushed to the full contortions to allow for sleeping there.

Hugh

2006-05-31 04:34:56

by NeilBrown

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tuesday May 30, [email protected] wrote:
> Neil Brown wrote:
> > On Tuesday May 30, [email protected] wrote:
> >
> > As for your original problem.... I wonder if PG_locked is protecting
> > too much? It protects against IO and it also protects against ->mapping
> > changes. So if you want to ensure that ->mapping won't change, you
> > need to wait for any pending read request to finish, which seems a bit
> > dumb.
>
> I don't think that is the problem. set_page_dirty_lock is really
> unlikely to get held up on read IO: that'd mean there were two things
> writing into that page at the same time.

That's exactly my point - though I expand a bit more further down.
i.e. set_page_dirty_lock is unlikely to get held up on a read IO, however
it has to call into lock_page, and lock_page has no idea that it cannot be
waiting for IO, and so it has to call sync_page just in case.

>
> >
> > Maybe we need a new bit: PG_maplocked. You are only allowed to change
> > ->mapping or ->index of you hold PG_locked and PG_maplocked, you are
> > not allowed to wait for PG_locked while holding PG_maplocked, and
> > you can read ->mapping or ->index while PG_locked or PG_maplocked are
> > held.
> > Think of PG_locked like a mutex and PG_maplocked like a spinlock (and
> > probably use bit_spinlock to get it).
>
> Well the original problem is fixed by not doing the sync_page thing in
> set_page_dirty_lock. Is there any advantage to having another bit?
> Considering a) it will be very unlikely that a page is locked at the
> same time one would like to dirty it; and b) that would seem to imply
> adding extra atomic ops and barriers to reclaim and truncate (maybe
> others).

While I agree that avoiding the sync_page in this particular case is
likely to solve the problem, it seems rather fragile. Sometimes you
need to call ->sync_page when waiting for a lock, sometimes you
don't. Why the difference? Because there really are two different
locks here masquerading as one.

Yes there would be extra atomic ops at reclaim and truncate. Is that
a problem? I have no idea, but you sometimes need to break eggs to fix
an omelette. I suspect the extra bit would be very lightly
contended.


>
> >
> > Then set_page_dirty_lock would use PG_maplocked to get access to
> > ->mapping, and then hold a reference on the address_space while
> > calling into balance_dirty_pages ... I wonder how you hold a reference
> > on an address space...
>
> inode. Presumably PG_maplocked would pin it? I don't understand
> why you've brought balance_dirty_pages into it, though.
>

hmmmm.... no idea, sorry.
I was trying to trace through set_page_dirty to see if it was safe to
call it under a spinlock (or a bit_spinlock in this case). I must
have taken a wrong turn somewhere...

> >
> > There are presumably few pieces of code that change ->mapping. Once
> > they all take PG_maplocked as well as PG_locked, you can start freeing
> > up other code to take PG_maplocked instead of PG_locked....
> >
> > Does that make sense at all? Do we have any spare page bits?
>
> I'm sure it could be made to work, but I don't really see the point.
> If someone really wanted to do it, I guess the right way to go is have
> a PG_readin counterpart to PG_writeback (or even extend PG_writeback
> to PG_io)...

Yes, PG_readin or PG_io would be better names, but might be hard to
have a graceful transition to that sort of naming.

And the point is that once you separated that function from PG_locked,
lock_page would not need to wait for IO, and so would not need to call
sync_page, and so your problem would evaporate.

NeilBrown

2006-05-31 06:12:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Tue, May 30 2006, Mark Lord wrote:
> Linus wrote:
> >(Yes, tagged queueing makes it less of an issue, of course. I know,
>
> My observations with (S)ATA tagged/native queuing, is that it doesn't make
> nearly the difference under Linux that it does under other OSs.
> Probably because our block layer is so good at ordering requests,
> either from plugging or simply from clever disk scheduling.

Hmm well, I have seen 30% performance increase for a random read work
load with NCQ, I'd say that is pretty nice. And of course there's the
whole write cache issue, with NCQ you _could_ get away with playing more
safe and disabling write back caching.

NCQ helps us with something we can never fix in software - the
rotational latency. Ordering is only a small part of the picture.

Plus I think that more recent drives have a better NCQ implementation,
the first models I tried were pure and utter crap. Lets just say it
didn't instill a lot of confidence in firmware engineers at various
unnamed drive companies.

--
Jens Axboe

2006-05-31 12:34:15

by Helge Hafting

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:
> On Wed, 31 May 2006, Nick Piggin wrote:
>
>> The requests can only get merged if contiguous requests from the upper
>> layers come down, right?
>>
>
> It has nothing to do with merging. It has to do with IO patterns.
>
> Seeking.
>
> Seeking is damn expensive - much more so than command issue. People forget
> that sometimes.
>
> If you can sort the requests so that you don't have to seek back and
> forth, that's often a HUGE win.
>
> Yes, the requests will still be small, and yes, the IO might happen in 4kB
> chunks, but it happens a lot faster if you do it in a good elevator
> ordering and if you hit the track cache than if you seek back and forth.
>
This is correct, but doesn't really explain why plugging might be good.

If requests go to disk immediately and the disk is able to keep
up with the seeks, then plugging doesn't help. This is a low-bandwith
case of course, but servicing each request immediately will
keep the latency lower. The fact that the disk gets busier doesn't
matter unless you worry about power consumption for the access arm.

If lots of requests come in and we don't do plugging, then
lots of the requests will be nicely sorted while waiting for
the first seek. And when this sorted lot goes to disk, further
requests will be sorted. I.e. congestion itself can work as dynamic
plugging, giving longer and longer sorted queues until the disk keeps
up with us.

This should get us just enough sorting for the disk to keep up
with the bandwidth we demand, and therefore minimal latency
for that bandwith.

This is theory, and perhaps it doesn't quite reflect reaility.
I cannot see why it shouldn't, though - it'd be interesting
to know if I missed something obvious here.

Helge Hafting

2006-05-31 12:36:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, 2006-05-31 at 14:31 +0200, Helge Hafting wrote:
> Linus Torvalds wrote:
> > On Wed, 31 May 2006, Nick Piggin wrote:
> >
> >> The requests can only get merged if contiguous requests from the upper
> >> layers come down, right?
> >>
> >
> > It has nothing to do with merging. It has to do with IO patterns.
> >
> > Seeking.
> >
> > Seeking is damn expensive - much more so than command issue. People forget
> > that sometimes.
> >
> > If you can sort the requests so that you don't have to seek back and
> > forth, that's often a HUGE win.
> >
> > Yes, the requests will still be small, and yes, the IO might happen in 4kB
> > chunks, but it happens a lot faster if you do it in a good elevator
> > ordering and if you hit the track cache than if you seek back and forth.
> >
> This is correct, but doesn't really explain why plugging might be good.
>
> If requests go to disk immediately and the disk is able to keep
> up with the seeks, then plugging doesn't help. This is a low-bandwith
> case of course, but servicing each request immediately will
> keep the latency lower.

only for writes. afaik for reads and other synchronous beasts we already
unplug immediately anyway.

In addition unplugging happens after like 3 or 4 requests and after a
few miliseconds, whatever comes first ... so you never wait really long
(on a "what a seek costs" scale) anyway

2006-05-31 12:55:25

by Mark Lord

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Jens Axboe wrote:
>
> NCQ helps us with something we can never fix in software - the
> rotational latency. Ordering is only a small part of the picture.

Yup. And it also helps reduce the command-to-command latencies.

I'm all for it, and have implemented tagged queuing for a variety
of device drivers over the past five years (TCQ & NCQ). In every
case people say.. wow, I expected more of a difference than that,
while still noting the end result was faster under Linux than MS$.

Of course with artificial benchmarks, and the right firmware in
the right drives, it's easier to create and see a difference.
But I'm talking more life-like loads than just a multi-threaded
random seek generator.

Cheers

2006-05-31 13:00:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, May 31 2006, Mark Lord wrote:
> Jens Axboe wrote:
> >
> >NCQ helps us with something we can never fix in software - the
> >rotational latency. Ordering is only a small part of the picture.
>
> Yup. And it also helps reduce the command-to-command latencies.

That too, however that's a much much smaller part.

> I'm all for it, and have implemented tagged queuing for a variety
> of device drivers over the past five years (TCQ & NCQ). In every
> case people say.. wow, I expected more of a difference than that,
> while still noting the end result was faster under Linux than MS$.
>
> Of course with artificial benchmarks, and the right firmware in
> the right drives, it's easier to create and see a difference.
> But I'm talking more life-like loads than just a multi-threaded
> random seek generator.

Definitely, the random pseudo read work load is about as good as it gets
for NCQ, and it was tailored to keep the queue full at all times. So I
agree, real life will see less of a benefit. But it's still there.

--
Jens Axboe

2006-05-31 13:29:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:
>
> On Wed, 31 May 2006, Nick Piggin wrote:
>
>>The requests can only get merged if contiguous requests from the upper
>>layers come down, right?
>
>
> It has nothing to do with merging. It has to do with IO patterns.
>
> Seeking.
>
> Seeking is damn expensive - much more so than command issue. People forget
> that sometimes.

OK, I didn't forget that it is expensive, but I didn't make the
connection that this is what you were arguing for.

I would be surprised if plugging made a big difference to seeking:
1. the queue will be plugged only if there are no other requests (so,
nothing else to seek to).
2. if the queue was plugged and we submitted one small request,
another request to somewhere else on the drive that comes in can
itself unplug the queue and cause seeking.

as-iosched is good at cutting down seeks because it doesn't "unplug"
in situation #2.

Now having a mechanism for a task to batch up requests might be a
good idea. Eg.

plug();
submit reads
unplug();
wait for page

I'd think this would give us the benefits of corse grained (per-queue)
plugging and more (e.g. it works when the request queue isn't empty).
And it would be simpler because the unplug point is explicit and doesn't
need to be kicked by lock_page or wait_on_page

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 13:39:22

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, May 31 2006, Nick Piggin wrote:
> Now having a mechanism for a task to batch up requests might be a
> good idea. Eg.
>
> plug();
> submit reads
> unplug();
> wait for page

How's this different from what we have now? The plugging will happen
implicitly, if we need to. If the queue is already running, chances are
that there are requests there so you won't get to your first read first
anyways.

The unplug(); wait_for_page(); is already required unless you want to
wait for the plugging to time out (unlikely, since you are now waiting
for io completion on one of them).

> I'd think this would give us the benefits of corse grained (per-queue)
> plugging and more (e.g. it works when the request queue isn't empty).
> And it would be simpler because the unplug point is explicit and doesn't
> need to be kicked by lock_page or wait_on_page

I kind of like having the implicit unplug, for several reasons. One is
that people forget to unplug. We had all sorts of hangs there in 2.4 and
earlier because of that. Making the plugging implicit should help that
though. The other is that I don't see what the explicit unplug gains
you. Once you start waiting for one of the pages submitted, that is
exactly the point where you want to unplug in the first place.

--
Jens Axboe

2006-05-31 13:44:11

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Hi Jens,
Sorry, I don't think I gave you any reply to this...

Jens Axboe wrote:
> On Mon, May 29 2006, Andrew Morton wrote:
>
>>
>>Mysterious question, that. A few years ago I think Jens tried pulling
>>unplugging out, but some devices still want it (magneto-optical
>>storage iirc). And I think we did try removing it, and it caused
>>hurt.
>
>
> I did, back when we had problems due to the blk_plug_lock being a global
> one. I first wanted to investigate if plugging still made a difference,
> otherwise we could've just ripped it out back than and the problem would
> be solved. But it did get us about a 10% boost on normal SCSI drives
> (don't think I tested MO drives at all), so it was fixed up.

Interesting. I'd like to know where from. I wonder if my idea of a
process context plug/unplug would solve it...

>
>
>>>With splice, the mapping can change, so you can have the wrong
>>>sync_page callback run against the page.
>>
>>Oh.
>
>
> Maybe I'm being dense, but I don't see a problem there. You _should_
> call the new mapping sync page if it has been migrated.

But can some other thread calling lock_page first find the old mapping,
and then run its ->sync_page which finds the new mapping? While it may
not matter for anyone in-tree, it does break the API so it would be
better to either fix it or rip it out than be silently buggy.

>>>
>>>The ->pin() calls in pipe_to_file and pipe_to_sendpage?
>>
>>One for Jens...
>
>
> splice never incs/decs any inode related reference counts, so if it
> needs to then yes it's broken. Any references to kernel code that deals
> with that?

Most code in the VM that has an inode/mapping gets called from the VFS,
which already does its thing somehow (I guess something like the file
pins the dentry which pins the inode). An iget might solve it. Or you
could use the lock_page_nosync() if/when the patch goes in (although I
don't want that to spread too far just yet).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 13:54:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Jens Axboe wrote:
> On Wed, May 31 2006, Nick Piggin wrote:
>
>>Now having a mechanism for a task to batch up requests might be a
>>good idea. Eg.
>>
>>plug();
>>submit reads
>>unplug();
>>wait for page
>
>
> How's this different from what we have now? The plugging will happen
> implicitly, if we need to. If the queue is already running, chances are
> that there are requests there so you won't get to your first read first
> anyways.
>
> The unplug(); wait_for_page(); is already required unless you want to
> wait for the plugging to time out (unlikely, since you are now waiting
> for io completion on one of them).
>
>
>>I'd think this would give us the benefits of corse grained (per-queue)
>>plugging and more (e.g. it works when the request queue isn't empty).
>>And it would be simpler because the unplug point is explicit and doesn't
>>need to be kicked by lock_page or wait_on_page
>
>
> I kind of like having the implicit unplug, for several reasons. One is
> that people forget to unplug. We had all sorts of hangs there in 2.4 and
> earlier because of that. Making the plugging implicit should help that
> though. The other is that I don't see what the explicit unplug gains
> you. Once you start waiting for one of the pages submitted, that is
> exactly the point where you want to unplug in the first place.

OK I wasn't aware it was explicit in 2.4 and earlier.

Two upsides I see to explicit: firstly, it works on non-empty queues. Less
of a problem perhaps, but only because it is statistically less likely to
be the next submitted, so there will still be some improvement.

Second, for async work (aio, readahead, writeback, writeout for page reclaim),
the point where you wait is probably not the best place to unplug.

Also, it would allow lock_page to be untangled.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 14:30:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, 31 May 2006, Hugh Dickins wrote:
>
> Yes, I had noticed yours is a different issue. I'm saying that if we
> can "fix" set_page_dirty_nolock not to sleep, then your issue is fixed
> (as least as it affects set_page_dirty_lock, which is all your patch
> is dealing with, and we hope all it needs to deal with). Because your
> issue is with the sync_page in the lock_page of set_page_dirty_nolock,
> and it's that particular lock_page which I'm trying to be rid of.
>
> I now think it can be done: in cases where TestSetPageLocked finds
> the page already locked, then I believe we can fall back to inode_lock
> to stabilize. But I do need to consider the possibilities some more.

No, I'm wrong, and have been all along in thinking set_page_dirty_lock
could be done better avoiding the lock_page. inode_lock gives the hint:
it's not irq safe, nor is mapping->private_lock, and both may be taken
by set_page_dirty. The lock_page in set_page_dirty_lock was the obvious
reason it couldn't be used at interrupt time, but not the only reason.

So your lock_page_nosync does look the best way forward to me now.

Hugh

2006-05-31 14:47:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Wed, 31 May 2006, Nick Piggin wrote:
>
> Now having a mechanism for a task to batch up requests might be a
> good idea. Eg.
>
> plug();
> submit reads
> unplug();
> wait for page

What do you think we're _talking_ about?

What do you think my example of sys_readahead() was all about?

WE DO HAVE EXACTLY THAT MECHANISM. IT'S CALLED PLUGGING!

> I'd think this would give us the benefits of corse grained (per-queue)
> plugging and more (e.g. it works when the request queue isn't empty).
> And it would be simpler because the unplug point is explicit and doesn't
> need to be kicked by lock_page or wait_on_page

What do you think plugging IS?

It's _exactly_ what you're talking about. And yes, we used to have
explicit unplugging (a long long long time ago), and IT SUCKED. People
would forget, but even more importantly, people would do it even when not
needed because they didn't have a good place to do it because the waiter
was in a totally different path.

The reason it's kicked by wait_on_page() is that is when it's needed.

Linus

2006-05-31 14:57:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:
>
> On Wed, 31 May 2006, Nick Piggin wrote:
>
>>Now having a mechanism for a task to batch up requests might be a
>>good idea. Eg.
>>
>>plug();
>>submit reads
>>unplug();
>>wait for page
>
>
> What do you think we're _talking_ about?

Plugging. Emphasis on task.

>
> What do you think my example of sys_readahead() was all about?
>
> WE DO HAVE EXACTLY THAT MECHANISM. IT'S CALLED PLUGGING!

It isn't exactly that. I'm thinking per-task rather than per-queue might
be a better idea because a) you don't know what else is putting requests
on the queue, and b) the point where you wait is not always the best place
to unplug.

>
>
>>I'd think this would give us the benefits of corse grained (per-queue)
>>plugging and more (e.g. it works when the request queue isn't empty).
>>And it would be simpler because the unplug point is explicit and doesn't
>>need to be kicked by lock_page or wait_on_page
>
>
> What do you think plugging IS?
>
> It's _exactly_ what you're talking about.

Yes, and I'm talking about per-task vs per-queue plugging (because I want
to get rid of the implicit unplug, because I want to make lock_page & co
nicer).

> And yes, we used to have
> explicit unplugging (a long long long time ago), and IT SUCKED. People
> would forget, but even more importantly, people would do it even when not

I don't see what the problem is. Locks also suck if you forget to unlock
them.

> needed because they didn't have a good place to do it because the waiter
> was in a totally different path.

Example?

>
> The reason it's kicked by wait_on_page() is that is when it's needed.

Yes, you already said that, and I showed cases where that isn't optimal.

I don't know why you think this way of doing plugging is fundamentally
right and anything else must be wrong... it is always heuristic, isn't
it?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 15:09:16

by Hugh Dickins

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, 31 May 2006, Nick Piggin wrote:
> Jens Axboe wrote:
> >
> > Maybe I'm being dense, but I don't see a problem there. You _should_
> > call the new mapping sync page if it has been migrated.
>
> But can some other thread calling lock_page first find the old mapping,
> and then run its ->sync_page which finds the new mapping? While it may
> not matter for anyone in-tree, it does break the API so it would be
> better to either fix it or rip it out than be silently buggy.

Splicing a page from one mapping to another is rather worrying/exciting,
but it does look safely done to me. remove_mapping checks page_count
while page lock and old mapping->tree_lock are held, and gives up if
anyone else has an interest in the page. And we already know it's
unsafe to lock_page without holding a reference to the page, don't we?

Hugh

2006-05-31 15:12:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Wed, 31 May 2006, Linus Torvalds wrote:
>
> The reason it's kicked by wait_on_page() is that is when it's needed.

Btw, that's not how it has always been done.

For the longest time, it was actually triggered by scheduler activity, in
particular, plugging used to be a workqueue event that was triggered by
the scheduler (or any explicit points when you wanted it to be triggered
earlier).

So whenever you scheduled, _all_ plugs would be unplugged.

It was specialized to wait_for_page() in order to avoid unnecessary
overhead in scheduling (making it more directed), and to allow you to
leave the request around for further merging/sorting even if a process had
to wait for something unrelated.

But in particular, the old non-directed unplug didn't work well in SMP
environments (because _one_ CPU re-scheduling obviously doesn't mean
anything for the _other_ CPU that is actually working on setting up the
request queue).

The point being that we could certainly do it somewhere else. Doing it in
wait_for_page() (and at least historically, in waiting for bh's) is really
nothing more than trying to have as few points as possible where it's
done, and at the same time not missing any.

And yes, I'd _love_ to have better interfaces to let people take advantage
of this than sys_readahead(). sys_readahead() was a 5-minute hack that
actually does generate wonderful IO patterns, but it is also not all that
useful (too specialized, and non-portable).

I tried at one point to make us do directory inode read-ahead (ie put the
inodes on a read-ahead queue when doing a directory listing), but that
failed miserably. All the low-level filesystems are very much designed to
have inode reading be synchronous, and it would have implied major surgery
to do (and, sadly, my preliminary numbers also seemed to say that it might
be a huge time waster, with enough users just wanting the filenames and
not the inodes).

The thing is, right now we have very bad IO patterns for things that
traverse whole directory trees (like doign a "tar" or a "diff" of a tree
that is cold in the cache) because we have way too many serialization
points. We do a good job of prefetching within a file, but if you have
source trees etc, the median size for files is often smaller than a single
page, so the prefetching ends up being a non-issue most of the time, and
we do _zero_ prefetching between files ;/

Now, the reason we don't do it is that it seems to be damn hard to do. No
question about that. Especially since it's only worth doing (obviously) on
the cold-cache case, and that's also when we likely have very little
information about what the access patterns might be.. Oh, well.

Even with sys_readahead(), my simple "pre-read a whole tree" often ended
up waiting for inode IO (although at least the fact that several inodes
fit in one block gets _some_ of that).

Linus

2006-05-31 15:17:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Thu, 1 Jun 2006, Nick Piggin wrote:
>
> > And yes, we used to have explicit unplugging (a long long long time ago),
> > and IT SUCKED. People would forget, but even more importantly, people would
> > do it even when not
>
> I don't see what the problem is. Locks also suck if you forget to unlock
> them.

Locks are simple, and in fact are _made_ simple on purpose. We try very
hard to unlock in the same function that we lock, for example. Because if
we don't, bugs happen.

That's simply not _practical_ for IO. Think about it. Quite often, the
waiting is done somewhere else than the actual submission.

> > needed because they didn't have a good place to do it because the waiter was
> > in a totally different path.
>
> Example?

Pretty much all of them.

Where do you wait for IO?

Would you perhaps say "wait_on_page()"?

In other words, we really _do_ exactly what you think we should do.

> I don't know why you think this way of doing plugging is fundamentally
> right and anything else must be wrong... it is always heuristic, isn't
> it?

A _particular_ way of doing plugging is not "fundamentally right". I'm
perfectly happy with chaning the place we unplug, if people want that.
We've done it several times.

But plugging as a _concept_ is definitely fundamentally right, exactly
because it allows us to have the notion of "plug + n*<random submit by
different paths> + unplug".

And you were not suggesting moving unplugging around. You were suggesting
removing the feature. Which is when I said "no f*cking way!".

Linus

2006-05-31 15:22:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Hugh Dickins wrote:
> On Wed, 31 May 2006, Nick Piggin wrote:
>
>>Jens Axboe wrote:
>>
>>>Maybe I'm being dense, but I don't see a problem there. You _should_
>>>call the new mapping sync page if it has been migrated.
>>
>>But can some other thread calling lock_page first find the old mapping,
>>and then run its ->sync_page which finds the new mapping? While it may
>>not matter for anyone in-tree, it does break the API so it would be
>>better to either fix it or rip it out than be silently buggy.
>
>
> Splicing a page from one mapping to another is rather worrying/exciting,
> but it does look safely done to me. remove_mapping checks page_count
> while page lock and old mapping->tree_lock are held, and gives up if
> anyone else has an interest in the page. And we already know it's
> unsafe to lock_page without holding a reference to the page, don't we?

Oh, that's true. I had thought that splice allows stealing pages with
an elevated refcount, which Jens was thinking about at one stage. But
I see that code isn't in mainline. AFAIKS it would allow other
->pin()ers to attempt to lock the page while it was being stolen.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 15:34:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

[trimming CC list]

Linus Torvalds wrote:
>
> On Thu, 1 Jun 2006, Nick Piggin wrote:

>>
>>Example?
>
>
> Pretty much all of them.
>
> Where do you wait for IO?
>
> Would you perhaps say "wait_on_page()"?
>
> In other words, we really _do_ exactly what you think we should do.

I still think the submitter should plug before they start a set of
requests (the submitter currently does not plug, the queue does when
it empties), and unplug when it has finished submission (not when a
process next waits, because that is suboptimal for asynch IO).

When you do this, the plug/unplug should become simple like locks too.

You're really keen on unplugging at the point of waiting. I don't get
it.

>
>
>>I don't know why you think this way of doing plugging is fundamentally
>>right and anything else must be wrong... it is always heuristic, isn't
>>it?
>
>
> A _particular_ way of doing plugging is not "fundamentally right". I'm
> perfectly happy with chaning the place we unplug, if people want that.
> We've done it several times.

OK.

>
> But plugging as a _concept_ is definitely fundamentally right, exactly
> because it allows us to have the notion of "plug + n*<random submit by
> different paths> + unplug".

Yeah the concept isn't bad. I think the queue based implementation, and
unplug at wait time isn't great.

>
> And you were not suggesting moving unplugging around. You were suggesting
> removing the feature. Which is when I said "no f*cking way!".

It may be a good concept but if it doesn't help, then removing it is a good
idea too ;) Now I've been told it does help, so instead of removing it I
suggest changing it. Just exploring ideas.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 16:00:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Thu, 1 Jun 2006, Nick Piggin wrote:
>
> You're really keen on unplugging at the point of waiting. I don't get
> it.

Actually, no. I'm _not_ really keen on unplugging at the point of waiting.

I'm keen on unplugging at a point that makes sense, and is safe.

The problem is, you're not even exploring alternatives. Where the hell
_would_ you unplug it?

You can't unplug it in the place where we submit IO. That's _insane_,
because it basically means never plugging at all.

And most of the callers don't even _know_ whether we submit IO or not. For
example, let's pick a real and relevant example (they don't get much more
relevant than this): "do_generic_mapping_read()".

Tell me WHERE you can unplug in that sequence. I will tell you where you
can NOT unplug:

- you can NOT unplug _anywhere_ inside of the read-ahead logic, because
we obviously don't want it there (it would break the whole concept of
read-ahead, not to mention real code like sys_readahead()).

- you can NOT unplug in "->readpage()", for similar reasons (readahead
again, and again, the whole point of unplugging is that we want to do
several readpages and then unplug later)

- you can NOT just unplug in the path _after_ "readpage()", because the
IO may have been started by SOMEBODY ELSE that just did read-ahead, and
didn't unplug (on _purpose_ - the whole point of doing read-ahead is to
allow concurrent IO execution, so a read-aheader that unplugs is broken
by definition)

Those three are not just my "personal ideas". They are fundamental to how
unplugging works, and more importantly, to the whole _point_ of plugging.

Agreed?

Now, look at where we _currently_ unplug. I claim that there are exactly
_two_ places where we have to unplug ((1) we find a page that is not
up-to-date and (2) we've started a read on a page ourselves), and I claim
that those two places are _exactly_ the two places where we currently do
"lock_page()".

Again, this is not a "what if" schenario, or something where my "opinions"
are at stake. This is hard, cold, fact. We could do the unplugging outside
of the lock-page, but we'd do it in exactly that situation.

So what is your alternative? Put the explicit unplug at every possible
occurrence of lock_page() (and keep in mind that some of them don't want
it: we only want it when the lock-page will block, which is not always
true. Some people lock the page not because it's under IO, and they're
waiting for it to be unlocked, but because it's dirty and they're going to
start IO on it - the lock_page() generally won't block on that, and if
it doesn't, we don't want to kick previous IO).

In other words, we actually want to unplug at lock_page(), BUT ONLY IF IT
NEEDS OT WAIT, which is by no means all of them. So it's more than just
"add an unplug at lock_page()" it's literally "add an unplug at any
lock-page that blocks".

Still wondering why it ends up being _inside_ lock_page()?

Linus

2006-05-31 16:12:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Wed, 31 May 2006, Linus Torvalds wrote:
>
> In other words, we actually want to unplug at lock_page(), BUT ONLY IF IT
> NEEDS OT WAIT, which is by no means all of them. So it's more than just
> "add an unplug at lock_page()" it's literally "add an unplug at any
> lock-page that blocks".

Btw, don't get me wrong. In the read case, every time we do a lock_page(),
we might as well unplug unconditionally, because we effectively know we're
going to wait (we just checked that it's not up-to-date). Sure, there's a
race window, but we don't care - the window is very small, and in the end,
unplugging isn't a correctness issue as long as you do it at least as
often as required (ie unplugging too much is ok and at worst just makes
for bad performance - so a very unlikely race that causes _extra_
unplugging is fine as long as it's unlikely. forgetting to unplug is bad).

In the write case, lock_page() may or may not need an unplug. In those
cases, it needs unplugging if it was locked before, but obviously not if
it wasn't.

In the "random case" where we use the page lock not because we want to
start IO on it, but because we need to make sure that page->mapping
doesn't change, we don't really care about the IO, but we do need to
unplug just to make sure that the IO will complete.

And I suspect your objection to unplugging is not really about unplugging
itself. It's literally about the fact that we use the same page lock for
IO and for the ->mapping thing, isn't it?

IOW, you don't actually dislike plugging itself, you dislike it due to the
effects of a totally unrelated locking issue, namely that we use the same
lock for two totally independent things. If the ->mapping thing were to
use a PG_map_lock that didn't affect plugging one way or the other, you
wouldn't have any issues with unplugging, would you?

And I think _that_ is what really gets us to the problem.

Linus

2006-05-31 16:20:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:

> Tell me WHERE you can unplug in that sequence. I will tell you where you
> can NOT unplug:

...

> - you can NOT just unplug in the path _after_ "readpage()", because the
> IO may have been started by SOMEBODY ELSE that just did read-ahead, and
> didn't unplug (on _purpose_ - the whole point of doing read-ahead is to
> allow concurrent IO execution, so a read-aheader that unplugs is broken
> by definition)

Umm, this happens with the current lock_page() after readpage. And
with per-task plugs, you do not unplug anybody else.

Sure it isn't perfect (you can still concoct a sequence of concurrent
tasks doing funny things that result in a slightly suboptimal request
pattern), but to me it sounds as good or better in some cases than
what we have now.

> So what is your alternative? Put the explicit unplug at every possible
> occurrence of lock_page() (and keep in mind that some of them don't want
> it: we only want it when the lock-page will block, which is not always
> true. Some people lock the page not because it's under IO, and they're
> waiting for it to be unlocked, but because it's dirty and they're going to
> start IO on it - the lock_page() generally won't block on that, and if
> it doesn't, we don't want to kick previous IO).

I keep telling you. Put the unplug after submission of IO. Not before
waiting for IO.

I'll give you one example of how it could be better (feel free to ask
for others too). Your sys_readahead(). Someone asks to readahead 1MB
of data, currently if the queue is empty, that's going to sit around
for 4ms before starting the read. 4ms later, the app comes back hoping
for the request to have finished. Unfortunately it takes another 4ms.
Throughput is halved.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 16:22:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Nick Piggin wrote:
> Linus Torvalds wrote:
>
>> Tell me WHERE you can unplug in that sequence. I will tell you where
>> you can NOT unplug:
>
>
> ...
>
>> - you can NOT just unplug in the path _after_ "readpage()", because
>> the IO may have been started by SOMEBODY ELSE that just did
>> read-ahead, and didn't unplug (on _purpose_ - the whole point of
>> doing read-ahead is to allow concurrent IO execution, so a
>> read-aheader that unplugs is broken by definition)
>
>
> Umm, this happens with the current lock_page() after readpage. And
> with per-task plugs, you do not unplug anybody else.

If this wasn't clear: I don't mean per-task plugs as in "the task
explicitly plugs and unplugs the block device"[*]; I mean really
per-task plugs.

[*] That would be crazy anyway because that would imply some random
task can plug an filled request queue that is going full tilt.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 16:26:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

Linus Torvalds wrote:

> And I suspect your objection to unplugging is not really about unplugging
> itself. It's literally about the fact that we use the same page lock for
> IO and for the ->mapping thing, isn't it?

Nearly, but not quite that far: it's that we sync_page in lock_page.

I don't think using the single lock for both is too bad (in many
ways they are related eg. you don't want the page to be truncated
while IO is in progress).

>
> IOW, you don't actually dislike plugging itself, you dislike it due to the
> effects of a totally unrelated locking issue, namely that we use the same
> lock for two totally independent things. If the ->mapping thing were to
> use a PG_map_lock that didn't affect plugging one way or the other, you
> wouldn't have any issues with unplugging, would you?
>
> And I think _that_ is what really gets us to the problem.

No I don't dislike plugging at all ;) Just this tangle as you say.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-31 16:39:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Thu, 1 Jun 2006, Nick Piggin wrote:
>
> I keep telling you. Put the unplug after submission of IO. Not before
> waiting for IO.

And that's exactly where we have the lock_page().

And you ignored the list of _requirements_ I had, so you just missed the
other place you _have_ to have the unplug, namely in the "found a page
that was not yet up-to-date" case (look for the other lock_page()).
Because the person who started the IO might be off doing something else,
and may not be unplugging now (or ever, in the case of readahead).

In other words, when you start arguing, at least read my emails. Your
suggestion would have introduced a bug by not waiting on that other place.

Linus

2006-05-31 16:42:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Thu, 1 Jun 2006, Nick Piggin wrote:
>
> If this wasn't clear: I don't mean per-task plugs as in "the task
> explicitly plugs and unplugs the block device"[*]; I mean really
> per-task plugs.

That would be insane. It would mean that you'd have to unplug whether you
wanted to or not. Ie you've now made "sys_readahead()" impossible to do
well, and doing read-ahead across multiple files.

You're ignoring all the _reasons_ for plugging in the first place.

Linus

2006-05-31 17:48:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, May 31 2006, Nick Piggin wrote:
> Hi Jens,
> Sorry, I don't think I gave you any reply to this...

No worries :)

> Jens Axboe wrote:
> >On Mon, May 29 2006, Andrew Morton wrote:
> >
> >>
> >>Mysterious question, that. A few years ago I think Jens tried pulling
> >>unplugging out, but some devices still want it (magneto-optical
> >>storage iirc). And I think we did try removing it, and it caused
> >>hurt.
> >
> >
> >I did, back when we had problems due to the blk_plug_lock being a global
> >one. I first wanted to investigate if plugging still made a difference,
> >otherwise we could've just ripped it out back than and the problem would
> >be solved. But it did get us about a 10% boost on normal SCSI drives
> >(don't think I tested MO drives at all), so it was fixed up.
>
> Interesting. I'd like to know where from. I wonder if my idea of a
> process context plug/unplug would solve it...

As far as I recall, just doing a simple diff between source directories
on the same drive was a pretty good example of where the plugging gained
you some. A little on the benchmarks as well, iirc. I would have _loved_
to rip plugging out at that point, so while I may not remember the
details yet, don't think I did that work if I could have avoided it :-)

> >>>With splice, the mapping can change, so you can have the wrong
> >>>sync_page callback run against the page.
> >>
> >>Oh.
> >
> >
> >Maybe I'm being dense, but I don't see a problem there. You _should_
> >call the new mapping sync page if it has been migrated.
>
> But can some other thread calling lock_page first find the old mapping,
> and then run its ->sync_page which finds the new mapping? While it may
> not matter for anyone in-tree, it does break the API so it would be
> better to either fix it or rip it out than be silently buggy.

It looks ok to me, you can inspect fs/splice.c:pipe_to_file() and see if
you can spot anything.

> >>>The ->pin() calls in pipe_to_file and pipe_to_sendpage?
> >>
> >>One for Jens...
> >
> >
> >splice never incs/decs any inode related reference counts, so if it
> >needs to then yes it's broken. Any references to kernel code that deals
> >with that?
>
> Most code in the VM that has an inode/mapping gets called from the VFS,
> which already does its thing somehow (I guess something like the file
> pins the dentry which pins the inode). An iget might solve it. Or you
> could use the lock_page_nosync() if/when the patch goes in (although I
> don't want that to spread too far just yet).

I'll be sure to look over that, thanks!

--
Jens Axboe

2006-05-31 17:52:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Thu, Jun 01 2006, Nick Piggin wrote:
> Hugh Dickins wrote:
> >On Wed, 31 May 2006, Nick Piggin wrote:
> >
> >>Jens Axboe wrote:
> >>
> >>>Maybe I'm being dense, but I don't see a problem there. You _should_
> >>>call the new mapping sync page if it has been migrated.
> >>
> >>But can some other thread calling lock_page first find the old mapping,
> >>and then run its ->sync_page which finds the new mapping? While it may
> >>not matter for anyone in-tree, it does break the API so it would be
> >>better to either fix it or rip it out than be silently buggy.
> >
> >
> >Splicing a page from one mapping to another is rather worrying/exciting,
> >but it does look safely done to me. remove_mapping checks page_count
> >while page lock and old mapping->tree_lock are held, and gives up if
> >anyone else has an interest in the page. And we already know it's
> >unsafe to lock_page without holding a reference to the page, don't we?
>
> Oh, that's true. I had thought that splice allows stealing pages with
> an elevated refcount, which Jens was thinking about at one stage. But
> I see that code isn't in mainline. AFAIKS it would allow other
> ->pin()ers to attempt to lock the page while it was being stolen.

It got me in trouble, and we were already way too late into 2.6.17 to
take on more risky stuff (splice already had a few!). So right now the
page is definitely pruned and single while being stolen.

--
Jens Axboe

2006-05-31 17:54:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, May 31 2006, Hugh Dickins wrote:
> > What places want to use set_page_dirty_lock without sleeping?
> > The only place in drivers/ apart from sg/st that SetPageDirty are
> > rd.c and via_dmablit.c, both of which look OK, if a bit crufty.
>
> I've not looked recently, but bio.c sticks in the mind as one which
> is pushed to the full contortions to allow for sleeping there.

Indeed, there's a whole bunch of functions and punting to workqueues
just to work around that :-)

--
Jens Axboe

2006-05-31 18:11:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Wed, May 31 2006, Linus Torvalds wrote:
>
>
> On Wed, 31 May 2006, Linus Torvalds wrote:
> >
> > The reason it's kicked by wait_on_page() is that is when it's needed.
>
> Btw, that's not how it has always been done.
>
> For the longest time, it was actually triggered by scheduler activity, in
> particular, plugging used to be a workqueue event that was triggered by
> the scheduler (or any explicit points when you wanted it to be triggered
> earlier).

Now it's time for me to give Linus a history lesson on plugging,
apparently.

Plugging used to be done by the issuer and with immediate unplugging
when you were done issuing the blocks. Both of these actions happened in
ll_rw_block() if the caller was submitting more than on buffer_head, and
it happened without the caller knowing about plugging. He never had to
unplug.

1.2 then expanded that to be able to plug more than one device at the
time. It didn't really do much except allow the array of buffers passed
in being on separate devices. The plugging was still hidden from the
caller.

1.3 and on introduced a more generalised infrastructure for this, moving
the plugging to a task queue (tq_disk). This meant that we could finally
separate the plugging and unplugging from the direct IO issue. So
whenever someone wanted to the a wait_on_buffer/lock_page() equiv for
something that might to be issued, it would have to do a
run_task_queue(&tq_disk) first which would then unplug all the queues
that were plugged.

tq_disk was then removed and moved to a block list during the 2.5
massive io/bio changes. The functionality remained the same, though -
you had to kick all queues to force the unplug of the page you wanted.
This infrastructure lasted all up to the point where silly people with
lots of CPU's started complaining about lock contention for 32-way
systems with thousands of disks. This is the point where I reevaluated
the benefits of plugging, found it good, and decided to fix it up.
Plugging then became a simple state bit in the queue, and you would have
to pass in eg the page you wanted when asking to unplug. This would kick
just the specific queue you needed. It also got a timer tied to it, so
that we could unplug after foo msecs if we wanted. Additionally, it will
also self-unplug once a certain plug depth has been reached (like 4
requests).

Anyway, the point I wanted to make is that this was never driven by
scheduler activity. So there!

--
Jens Axboe

2006-05-31 18:26:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?



On Wed, 31 May 2006, Jens Axboe wrote:
>
> Anyway, the point I wanted to make is that this was never driven by
> scheduler activity. So there!

Heh. I confused tq_disk and tq_scheduler, methinks.

And yes, "run_task_queue(&tq_disk)" was in lock_page(), not the scheduler.

Linus

2006-05-31 23:59:32

by NeilBrown

[permalink] [raw]
Subject: Re: [rfc][patch] remove racy sync_page?

On Thursday June 1, [email protected] wrote:
>
> I keep telling you. Put the unplug after submission of IO. Not before
> waiting for IO.
>
> I'll give you one example of how it could be better (feel free to ask
> for others too). Your sys_readahead(). Someone asks to readahead 1MB
> of data, currently if the queue is empty, that's going to sit around
> for 4ms before starting the read. 4ms later, the app comes back hoping
> for the request to have finished. Unfortunately it takes another 4ms.
> Throughput is halved.

I think this is all a question of heuristics. There is a trade off
between throughput and latency, and making the right decisions require
prescience, and I don't think either Intel or AMD have implemented
yet.

One could argue that the best you can get would involve making
decisions at the lowest level - where device characteristics are
known - and using hints from upper levels like "I'm going to submit
lots of requests now" and "Ok, I'm done" and "I want this page NOW".
These should be hints and the device can treat them as it likes.

Maybe a disk device could estimate the time it would take to process
the current plugged queue, and if the first request were older than
half that time, unplug the queue, unless there was a genuine
expectation of more requests very soon...

Currently the decisions are made at a reasonably low level, but the
only hint is "I want some page somewhere on this device now, or maybe
I just want to stop anyone changing ->mapping", which is a fairly
broad and inaccurate hint....

But the real point is that as this is an heuristic, it is very hard to
argue in the abstract about what is best. We really need measurements
to show that changes to have a measurable effect.

I have an interesting issue with plugging and raid5 at the moment. As
I mentioned earlier, plugging only affects incomplete stripes being
written. So if a very long, totally sequential write is being issued
(dd of=/dev/md0) then we never really need to unplug (except may at
the end). But unplugging does happen (3msec timeout at least) and so
we sometimes end up pre-reading for partial stripes even though there
is really no need.

What I would really like is for ->unplug_fn to know something about
why it is being called.
- if it is for read, I'll unplug component devices, but not
incomplete writes.
- if it is for write, I'll unplug writes, but I'd really rather
know which address was required
- if it is a timeout, I'll unplug writes that have been pending
for more than 2msec.

I suspect there is at least one workload that this would help, but I
would need to measure to be sure.

... maybe I'll try adding a 'how' option to ->unplug_fn and see what I
can come up with...

NeilBrown