If md is asked to store a bitmap in a file, it tries to hold onto the
page cache pages for that file, manipulate them directly, and call a
cocktail of operations to write the file out. I don't believe this is
a supportable approach.
This patch changes the approach to use the same approach as swap files.
i.e. bmap is used to enumerate all the block address of parts of the file
and we write directly to those blocks of the device.
swapfile only uses parts of the file that provide a full pages at
contiguous addresses. We don't have that luxury so we have to cope
with pages that are non-contiguous in storage. To handle this
we attach buffers to each page, and store the addresses in those buffers.
With this approach the pagecache may contain data which is inconsistent with
what is on disk. To alleviate the problems this can cause, md invalidates
the pagecache when releasing the file. If the file is to be examined
while the array is active (a non-critical but occasionally useful function),
O_DIRECT io must be used. And new version of mdadm will have support for this.
This approach simplifies a lot of code:
- we no longer need to keep a list of pages which we need to wait for,
as the b_endio function can keep track of how many outstanding
writes there are. This saves a mempool.
- -EAGAIN returns from write_page are no longer possible (not sure if
they ever were actually).
Signed-off-by: Neil Brown <[email protected]>
### Diffstat output
./drivers/md/bitmap.c | 283 ++++++++++++++++++++----------------------
./include/linux/raid/bitmap.h | 7 -
2 files changed, 139 insertions(+), 151 deletions(-)
diff ./drivers/md/bitmap.c~current~ ./drivers/md/bitmap.c
--- ./drivers/md/bitmap.c~current~ 2006-05-12 16:00:03.000000000 +1000
+++ ./drivers/md/bitmap.c 2006-05-12 16:01:06.000000000 +1000
@@ -67,7 +67,6 @@ static inline char * bmname(struct bitma
return bitmap->mddev ? mdname(bitmap->mddev) : "mdX";
}
-#define WRITE_POOL_SIZE 256
/*
* just a placeholder - calls kmalloc for bitmap pages
@@ -279,75 +278,137 @@ static int write_sb_page(mddev_t *mddev,
*/
static int write_page(struct bitmap *bitmap, struct page *page, int wait)
{
- int ret = -ENOMEM;
+ struct buffer_head *bh;
if (bitmap->file == NULL)
return write_sb_page(bitmap->mddev, bitmap->offset, page, wait);
- flush_dcache_page(page); /* make sure visible to anyone reading the file */
+ bh = page_buffers(page);
- if (wait)
- lock_page(page);
- else {
- if (TestSetPageLocked(page))
- return -EAGAIN; /* already locked */
- if (PageWriteback(page)) {
- unlock_page(page);
- return -EAGAIN;
- }
+ while (bh && bh->b_blocknr) {
+ atomic_inc(&bitmap->pending_writes);
+ set_buffer_locked(bh);
+ set_buffer_mapped(bh);
+ submit_bh(WRITE, bh);
+ bh = bh->b_this_page;
+ }
+
+ if (wait) {
+ wait_event(bitmap->write_wait,
+ atomic_read(&bitmap->pending_writes)==0);
+ return (bitmap->flags & BITMAP_WRITE_ERROR) ? -EIO : 0;
}
+ return 0;
+}
+
+static void end_bitmap_write(struct buffer_head *bh, int uptodate)
+{
+ struct bitmap *bitmap = bh->b_private;
+ unsigned long flags;
- ret = page->mapping->a_ops->prepare_write(bitmap->file, page, 0, PAGE_SIZE);
- if (!ret)
- ret = page->mapping->a_ops->commit_write(bitmap->file, page, 0,
- PAGE_SIZE);
- if (ret) {
- unlock_page(page);
- return ret;
+ if (!uptodate) {
+ spin_lock_irqsave(&bitmap->lock, flags);
+ bitmap->flags |= BITMAP_WRITE_ERROR;
+ spin_unlock_irqrestore(&bitmap->lock, flags);
}
+ if (atomic_dec_and_test(&bitmap->pending_writes))
+ wake_up(&bitmap->write_wait);
+}
- set_page_dirty(page); /* force it to be written out */
+/* copied from buffer.c */
+static void
+__clear_page_buffers(struct page *page)
+{
+ ClearPagePrivate(page);
+ set_page_private(page, 0);
+ page_cache_release(page);
+}
+static void free_buffers(struct page *page)
+{
+ struct buffer_head *bh = page_buffers(page);
- if (!wait) {
- /* add to list to be waited for */
- struct page_list *item = mempool_alloc(bitmap->write_pool, GFP_NOIO);
- item->page = page;
- spin_lock(&bitmap->write_lock);
- list_add(&item->list, &bitmap->complete_pages);
- spin_unlock(&bitmap->write_lock);
+ while (bh) {
+ struct buffer_head *next = bh->b_this_page;
+ free_buffer_head(bh);
+ bh = next;
}
- return write_one_page(page, wait);
+ __clear_page_buffers(page);
+ put_page(page);
}
-/* read a page from a file, pinning it into cache, and return bytes_read */
+/* read a page from a file.
+ * We both read the page, and attach buffers to the page to record the
+ * address of each block (using bmap). These addresses will be used
+ * to write the block later, completely bypassing the filesystem.
+ * This usage is similar to how swap files are handled, and allows us
+ * to write to a file with no concerns of memory allocation failing.
+ */
static struct page *read_page(struct file *file, unsigned long index,
- unsigned long *bytes_read)
+ struct bitmap *bitmap,
+ unsigned long count)
{
- struct inode *inode = file->f_mapping->host;
struct page *page = NULL;
- loff_t isize = i_size_read(inode);
- unsigned long end_index = isize >> PAGE_SHIFT;
+ struct inode *inode = file->f_dentry->d_inode;
+ loff_t pos = index << PAGE_SHIFT;
+ int ret;
+ struct buffer_head *bh;
+ sector_t block;
+ mm_segment_t oldfs;
PRINTK("read bitmap file (%dB @ %Lu)\n", (int)PAGE_SIZE,
(unsigned long long)index << PAGE_SHIFT);
- page = read_cache_page(inode->i_mapping, index,
- (filler_t *)inode->i_mapping->a_ops->readpage, file);
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ page = ERR_PTR(-ENOMEM);
if (IS_ERR(page))
goto out;
- wait_on_page_locked(page);
- if (!PageUptodate(page) || PageError(page)) {
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = vfs_read(file, (char __user*) page_address(page), count, &pos);
+ set_fs(oldfs);
+
+ if (ret >= 0 && ret != count)
+ ret = -EIO;
+ if (ret < 0) {
put_page(page);
- page = ERR_PTR(-EIO);
+ page = ERR_PTR(ret);
goto out;
}
+ bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
+ if (!bh) {
+ put_page(page);
+ page = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+ attach_page_buffers(page, bh);
+ block = index << (PAGE_SHIFT - inode->i_blkbits);
+ while (bh) {
+ if (count == 0)
+ bh->b_blocknr = 0;
+ else {
+ bh->b_blocknr = bmap(inode, block);
+ if (bh->b_blocknr == 0) {
+ /* Cannot use this file! */
+ free_buffers(page);
+ page = ERR_PTR(-EINVAL);
+ goto out;
+ }
+ bh->b_bdev = inode->i_sb->s_bdev;
+ if (count < (1<<inode->i_blkbits))
+ count = 0;
+ else
+ count -= (1<<inode->i_blkbits);
- if (index > end_index) /* we have read beyond EOF */
- *bytes_read = 0;
- else if (index == end_index) /* possible short read */
- *bytes_read = isize & ~PAGE_MASK;
- else
- *bytes_read = PAGE_SIZE; /* got a full page */
+ bh->b_end_io = end_bitmap_write;
+ bh->b_private = bitmap;
+ }
+ block++;
+ bh = bh->b_this_page;
+ }
+
+ page->index = index;
out:
if (IS_ERR(page))
printk(KERN_ALERT "md: bitmap read error: (%dB @ %Lu): %ld\n",
@@ -418,16 +479,14 @@ static int bitmap_read_sb(struct bitmap
char *reason = NULL;
bitmap_super_t *sb;
unsigned long chunksize, daemon_sleep, write_behind;
- unsigned long bytes_read;
unsigned long long events;
int err = -EINVAL;
/* page 0 is the superblock, read it... */
if (bitmap->file)
- bitmap->sb_page = read_page(bitmap->file, 0, &bytes_read);
+ bitmap->sb_page = read_page(bitmap->file, 0, bitmap, PAGE_SIZE);
else {
bitmap->sb_page = read_sb_page(bitmap->mddev, bitmap->offset, 0);
- bytes_read = PAGE_SIZE;
}
if (IS_ERR(bitmap->sb_page)) {
err = PTR_ERR(bitmap->sb_page);
@@ -437,13 +496,6 @@ static int bitmap_read_sb(struct bitmap
sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
- if (bytes_read < sizeof(*sb)) { /* short read */
- printk(KERN_INFO "%s: bitmap file superblock truncated\n",
- bmname(bitmap));
- err = -ENOSPC;
- goto out;
- }
-
chunksize = le32_to_cpu(sb->chunksize);
daemon_sleep = le32_to_cpu(sb->daemon_sleep);
write_behind = le32_to_cpu(sb->write_behind);
@@ -589,37 +641,12 @@ static void bitmap_file_unmap(struct bit
while (pages--)
if (map[pages]->index != 0) /* 0 is sb_page, release it below */
- put_page(map[pages]);
+ free_buffers(map[pages]);
kfree(map);
kfree(attr);
- safe_put_page(sb_page);
-}
-
-/* dequeue the next item in a page list -- don't call from irq context */
-static struct page_list *dequeue_page(struct bitmap *bitmap)
-{
- struct page_list *item = NULL;
- struct list_head *head = &bitmap->complete_pages;
-
- spin_lock(&bitmap->write_lock);
- if (list_empty(head))
- goto out;
- item = list_entry(head->prev, struct page_list, list);
- list_del(head->prev);
-out:
- spin_unlock(&bitmap->write_lock);
- return item;
-}
-
-static void drain_write_queues(struct bitmap *bitmap)
-{
- struct page_list *item;
-
- while ((item = dequeue_page(bitmap))) {
- /* don't bother to wait */
- mempool_free(item, bitmap->write_pool);
- }
+ if (sb_page)
+ free_buffers(sb_page);
}
static void bitmap_file_put(struct bitmap *bitmap)
@@ -632,12 +659,16 @@ static void bitmap_file_put(struct bitma
bitmap->file = NULL;
spin_unlock_irqrestore(&bitmap->lock, flags);
- drain_write_queues(bitmap);
-
+ if (file)
+ wait_event(bitmap->write_wait,
+ atomic_read(&bitmap->pending_writes)==0);
bitmap_file_unmap(bitmap);
- if (file)
+ if (file) {
+ struct inode *inode = file->f_dentry->d_inode;
+ invalidate_inode_pages(inode->i_mapping);
fput(file);
+ }
}
@@ -728,8 +759,6 @@ static void bitmap_file_set_bit(struct b
}
-static void bitmap_writeback(struct bitmap *bitmap);
-
/* this gets called when the md device is ready to unplug its underlying
* (slave) device queues -- before we let any writes go down, we need to
* sync the dirty pages of the bitmap file to disk */
@@ -761,24 +790,18 @@ int bitmap_unplug(struct bitmap *bitmap)
wait = 1;
spin_unlock_irqrestore(&bitmap->lock, flags);
- if (dirty | need_write) {
+ if (dirty | need_write)
err = write_page(bitmap, page, 0);
- if (err == -EAGAIN) {
- if (dirty)
- err = write_page(bitmap, page, 1);
- else
- err = 0;
- }
- if (err)
- return 1;
- }
}
if (wait) { /* if any writes were performed, we need to wait on them */
if (bitmap->file)
- bitmap_writeback(bitmap);
+ wait_event(bitmap->write_wait,
+ atomic_read(&bitmap->pending_writes)==0);
else
md_super_wait(bitmap->mddev);
}
+ if (bitmap->flags & BITMAP_WRITE_ERROR)
+ bitmap_file_kick(bitmap);
return 0;
}
@@ -800,7 +823,7 @@ static int bitmap_init_from_disk(struct
struct page *page = NULL, *oldpage = NULL;
unsigned long num_pages, bit_cnt = 0;
struct file *file;
- unsigned long bytes, offset, dummy;
+ unsigned long bytes, offset;
int outofdate;
int ret = -ENOSPC;
void *paddr;
@@ -853,7 +876,12 @@ static int bitmap_init_from_disk(struct
index = file_page_index(i);
bit = file_page_offset(i);
if (index != oldindex) { /* this is a new page, read it in */
+ int count;
/* unmap the old page, we're done with it */
+ if (index == num_pages-1)
+ count = bytes - index * PAGE_SIZE;
+ else
+ count = PAGE_SIZE;
if (index == 0) {
/*
* if we're here then the superblock page
@@ -863,7 +891,7 @@ static int bitmap_init_from_disk(struct
page = bitmap->sb_page;
offset = sizeof(bitmap_super_t);
} else if (file) {
- page = read_page(file, index, &dummy);
+ page = read_page(file, index, bitmap, count);
offset = 0;
} else {
page = read_sb_page(bitmap->mddev, bitmap->offset, index);
@@ -999,9 +1027,6 @@ int bitmap_daemon_work(struct bitmap *bi
spin_unlock_irqrestore(&bitmap->lock, flags);
if (need_write) {
switch (write_page(bitmap, page, 0)) {
- case -EAGAIN:
- set_page_attr(bitmap, page, BITMAP_PAGE_NEEDWRITE);
- break;
case 0:
break;
default:
@@ -1017,10 +1042,6 @@ int bitmap_daemon_work(struct bitmap *bi
clear_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
spin_unlock_irqrestore(&bitmap->lock, flags);
err = write_page(bitmap, lastpage, 0);
- if (err == -EAGAIN) {
- err = 0;
- set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
- }
} else {
set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
spin_unlock_irqrestore(&bitmap->lock, flags);
@@ -1070,10 +1091,6 @@ int bitmap_daemon_work(struct bitmap *bi
clear_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
spin_unlock_irqrestore(&bitmap->lock, flags);
err = write_page(bitmap, lastpage, 0);
- if (err == -EAGAIN) {
- set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
- err = 0;
- }
} else {
set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
spin_unlock_irqrestore(&bitmap->lock, flags);
@@ -1083,32 +1100,6 @@ int bitmap_daemon_work(struct bitmap *bi
return err;
}
-static void bitmap_writeback(struct bitmap *bitmap)
-{
- struct page *page;
- struct page_list *item;
- int err = 0;
-
- PRINTK("%s: bitmap writeback daemon woke up...\n", bmname(bitmap));
- /* wait on bitmap page writebacks */
- while ((item = dequeue_page(bitmap))) {
- page = item->page;
- mempool_free(item, bitmap->write_pool);
- PRINTK("wait on page writeback: %p\n", page);
- wait_on_page_writeback(page);
- PRINTK("finished page writeback: %p\n", page);
-
- err = PageError(page);
- if (err) {
- printk(KERN_WARNING "%s: bitmap file writeback "
- "failed (page %lu): %d\n",
- bmname(bitmap), page->index, err);
- bitmap_file_kick(bitmap);
- break;
- }
- }
-}
-
static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
sector_t offset, int *blocks,
int create)
@@ -1377,8 +1368,6 @@ static void bitmap_free(struct bitmap *b
/* free all allocated memory */
- mempool_destroy(bitmap->write_pool);
-
if (bp) /* deallocate the page memory */
for (k = 0; k < pages; k++)
if (bp[k].map && !bp[k].hijacked)
@@ -1428,17 +1417,13 @@ int bitmap_create(mddev_t *mddev)
spin_lock_init(&bitmap->lock);
bitmap->mddev = mddev;
- spin_lock_init(&bitmap->write_lock);
- INIT_LIST_HEAD(&bitmap->complete_pages);
- bitmap->write_pool = mempool_create_kmalloc_pool(WRITE_POOL_SIZE,
- sizeof(struct page_list));
- err = -ENOMEM;
- if (!bitmap->write_pool)
- goto error;
-
bitmap->file = file;
bitmap->offset = mddev->bitmap_offset;
if (file) get_file(file);
+
+ /* Ensure we read fresh data */
+ invalidate_inode_pages(file->f_dentry->d_inode->i_mapping);
+
/* read superblock from bitmap file (this sets bitmap->chunksize) */
err = bitmap_read_sb(bitmap);
if (err)
@@ -1461,6 +1446,9 @@ int bitmap_create(mddev_t *mddev)
bitmap->syncchunk = ~0UL;
+ atomic_set(&bitmap->pending_writes, 0);
+ init_waitqueue_head(&bitmap->write_wait);
+
#ifdef INJECT_FATAL_FAULT_1
bitmap->bp = NULL;
#else
@@ -1503,4 +1491,3 @@ EXPORT_SYMBOL(bitmap_start_sync);
EXPORT_SYMBOL(bitmap_end_sync);
EXPORT_SYMBOL(bitmap_unplug);
EXPORT_SYMBOL(bitmap_close_sync);
-EXPORT_SYMBOL(bitmap_daemon_work);
diff ./include/linux/raid/bitmap.h~current~ ./include/linux/raid/bitmap.h
--- ./include/linux/raid/bitmap.h~current~ 2006-05-12 15:55:37.000000000 +1000
+++ ./include/linux/raid/bitmap.h 2006-05-12 16:01:06.000000000 +1000
@@ -140,6 +140,7 @@ typedef __u16 bitmap_counter_t;
enum bitmap_state {
BITMAP_ACTIVE = 0x001, /* the bitmap is in use */
BITMAP_STALE = 0x002, /* the bitmap file is out of date or had -EIO */
+ BITMAP_WRITE_ERROR = 0x004, /* A write error has occurred */
BITMAP_HOSTENDIAN = 0x8000,
};
@@ -244,9 +245,9 @@ struct bitmap {
unsigned long daemon_lastrun; /* jiffies of last run */
unsigned long daemon_sleep; /* how many seconds between updates? */
- spinlock_t write_lock;
- struct list_head complete_pages;
- mempool_t *write_pool;
+ atomic_t pending_writes; /* pending writes to the bitmap file */
+ wait_queue_head_t write_wait;
+
};
/* the bitmap API */
NeilBrown <[email protected]> wrote:
>
> If md is asked to store a bitmap in a file, it tries to hold onto the
> page cache pages for that file, manipulate them directly, and call a
> cocktail of operations to write the file out. I don't believe this is
> a supportable approach.
erk. I think it's better than...
> This patch changes the approach to use the same approach as swap files.
> i.e. bmap is used to enumerate all the block address of parts of the file
> and we write directly to those blocks of the device.
That's going in at a much lower level. Even swapfiles don't assume
buffer_heads.
When playing with bmap() one needs to be careful that nobody has truncated
the file on you, else you end up writing to disk blocks which aren't part
of the file any more.
All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me.
Operating at the pagecache a_ops level looked better, and more
filesystem-independent.
I haven't looked at this patch at all closely yet. Do I really need to?
On Friday May 12, [email protected] wrote:
> NeilBrown <[email protected]> wrote:
> >
> > If md is asked to store a bitmap in a file, it tries to hold onto the
> > page cache pages for that file, manipulate them directly, and call a
> > cocktail of operations to write the file out. I don't believe this is
> > a supportable approach.
>
> erk. I think it's better than...
>
> > This patch changes the approach to use the same approach as swap files.
> > i.e. bmap is used to enumerate all the block address of parts of the file
> > and we write directly to those blocks of the device.
>
> That's going in at a much lower level. Even swapfiles don't assume
> buffer_heads.
I'm not "assuming" buffer_heads. I'm creating buffer heads and using
them for my own purposes. These are my pages and my buffer heads.
None of them belong to the filesystem.
The buffer_heads are simply a convenient data-structure to record the
several block addresses for each page. I could have equally created
an array storing all the addresses, and built the required bios by
hand at write time. But buffer_heads did most of the work for me, so
I used them.
Yes, it is a lower level, but
1/ I am certain that there will be no kmalloc problems and
2/ Because it is exactly the level used by swapfile, I know that it
is sufficiently 'well defined' that no-one is going to break it.
>
> When playing with bmap() one needs to be careful that nobody has truncated
> the file on you, else you end up writing to disk blocks which aren't part
> of the file any more.
Well we currently play games with i_write_count to ensure that no-one
else has the file open for write. And if no-one else can get write
access, then it cannot be truncated.
I did think about adding the S_SWAPFILE flag, but decided to leave
that for a separate patch and review different approaches to
preventing write access first (e.g. can I use a lease?).
>
> All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me.
> Operating at the pagecache a_ops level looked better, and more
> filesystem-independent.
If you really want filesystem independence, you need to use vfs_read
and vfs_write to read/write the file. I have a patch which did that,
but decided that the possibility of kmalloc failure at awkward times
would make that not suitable.
So I now use vfs_read to read in the file (just like knfsd) and
bmap/submit_bh to write out the file (just like swapfile).
I don't think a_ops really provides an interface that I can use, partly
because, as I said in a previous email, it isn't really a public
interface to a filesystem.
>
> I haven't looked at this patch at all closely yet. Do I really need to?
I assume you are asking that because you hope I will retract the
patch. While I'm always open to being educated, I am not yet
convinced that there is any better way, or even any other usable way,
to do what needs to be done, so I am not inclined to retract the
patch.
I'd like to say that you don't need to read it because it is perfect,
but unfortunately history suggests that is unlikely to be true.
Whether you look more closely is of course up to you, but I'm convinced
that patch is in the right direction, and your review and comments are
always very valuable.
NeilBrown
Neil Brown <[email protected]> wrote:
>
> On Friday May 12, [email protected] wrote:
> > NeilBrown <[email protected]> wrote:
> > >
> > > If md is asked to store a bitmap in a file, it tries to hold onto the
> > > page cache pages for that file, manipulate them directly, and call a
> > > cocktail of operations to write the file out. I don't believe this is
> > > a supportable approach.
> >
> > erk. I think it's better than...
> >
> > > This patch changes the approach to use the same approach as swap files.
> > > i.e. bmap is used to enumerate all the block address of parts of the file
> > > and we write directly to those blocks of the device.
> >
> > That's going in at a much lower level. Even swapfiles don't assume
> > buffer_heads.
>
> I'm not "assuming" buffer_heads. I'm creating buffer heads and using
> them for my own purposes. These are my pages and my buffer heads.
> None of them belong to the filesystem.
Right, so it's incoherent with pagecache and userspace can no longer
usefully read this file.
> The buffer_heads are simply a convenient data-structure to record the
> several block addresses for each page. I could have equally created
> an array storing all the addresses, and built the required bios by
> hand at write time. But buffer_heads did most of the work for me, so
> I used them.
OK.
> Yes, it is a lower level, but
> 1/ I am certain that there will be no kmalloc problems and
> 2/ Because it is exactly the level used by swapfile, I know that it
> is sufficiently 'well defined' that no-one is going to break it.
It would be nicer of course to actually use the mm/page_io.c code. That
would involve implementing swap_bmap() and reimplementing the
get_swap_bio() stuff in terms of a_ops->bmap().
But the swap code can afford to skip blockruns which aren't page-sized and
it uses that capability nicely. You cannot do that.
> >
> > All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me.
> > Operating at the pagecache a_ops level looked better, and more
> > filesystem-independent.
>
> If you really want filesystem independence, you need to use vfs_read
> and vfs_write to read/write the file.
yup.
> I have a patch which did that,
> but decided that the possibility of kmalloc failure at awkward times
> would make that not suitable.
submit_bh() can and will allocate memory, although most decent device
drivers should be OK.
There are tricks we can do with writepage. If the backing filesystem uses
buffer_heads and if you hold a ref on the page then we know that there
won't be any buffer_head allocations nor any disk reads in the writepage
path. It'll go direct into bio_alloc+submit_bio, just as you're doing now.
IOW: no gain.
> So I now use vfs_read to read in the file (just like knfsd) and
> bmap/submit_bh to write out the file (just like swapfile).
>
> I don't think a_ops really provides an interface that I can use, partly
> because, as I said in a previous email, it isn't really a public
> interface to a filesystem.
It's publicer than bmap+submit_bh!
> >
> > I haven't looked at this patch at all closely yet. Do I really need to?
>
> I assume you are asking that because you hope I will retract the
> patch.
Was kinda hoping that would be the outcome. It's rather gruesome, using
set_fs()+vfs_read() on one side and submit_bh() on the other.
Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
is OK?
The loss of pagecache coherency seems sad. I assume there's never a
requirement for userspace to read this file.
invalidate_inode_pages() is best-effort. If someone else has the page
locked or if the page is mmapped then the attempt to take down the
pagecache will fail. That's relatively-OK, because it'll just lead to
userspace seeing wrong stuff, and we've already conceded that.
But if the pagecache is dirty then invalidate_inode_pages() will leave it
dirty and the VM will later write it back, corrupting your bitmap file.
You should get i_writecount, fsync the file and then run
invalidate_inode_pages().
Or not run invalidate_inode_pages() - it doesn't gain anything and will
just reduce the observeability of bugs. Better off leaving the pagecache
there all the time so that any rarely-occurring bugs become all-the-time
bugs.
You might as well use kernel_read() too, if you insist on begin oddball ;)
Andrew Morton wrote:
> The loss of pagecache coherency seems sad. I assume there's never a
> requirement for userspace to read this file.
Actually, there is. mdadm reads the bitmap file, so that would be
broken. Also, it's just useful for a user to be able to read the bitmap
(od -x, or similar) to figure out approximately how much more he's got
to resync to get an array in-sync. Other than reading the bitmap file, I
don't know of any way to determine that.
--
Paul
Paul Clements <[email protected]> wrote:
>
> Andrew Morton wrote:
>
> > The loss of pagecache coherency seems sad. I assume there's never a
> > requirement for userspace to read this file.
>
> Actually, there is. mdadm reads the bitmap file, so that would be
> broken. Also, it's just useful for a user to be able to read the bitmap
> (od -x, or similar) to figure out approximately how much more he's got
> to resync to get an array in-sync. Other than reading the bitmap file, I
> don't know of any way to determine that.
Read it with O_DIRECT :(
On Saturday May 13, [email protected] wrote:
> Paul Clements <[email protected]> wrote:
> >
> > Andrew Morton wrote:
> >
> > > The loss of pagecache coherency seems sad. I assume there's never a
> > > requirement for userspace to read this file.
> >
> > Actually, there is. mdadm reads the bitmap file, so that would be
> > broken. Also, it's just useful for a user to be able to read the bitmap
> > (od -x, or similar) to figure out approximately how much more he's got
> > to resync to get an array in-sync. Other than reading the bitmap file, I
> > don't know of any way to determine that.
>
> Read it with O_DIRECT :(
Which is exactly what the next release of mdadm does.
As the patch comment said:
: With this approach the pagecache may contain data which is inconsistent with
: what is on disk. To alleviate the problems this can cause, md invalidates
: the pagecache when releasing the file. If the file is to be examined
: while the array is active (a non-critical but occasionally useful function),
: O_DIRECT io must be used. And new version of mdadm will have support for this.
Neil Brown <[email protected]> wrote:
>
> On Saturday May 13, [email protected] wrote:
> > Paul Clements <[email protected]> wrote:
> > >
> > > Andrew Morton wrote:
> > >
> > > > The loss of pagecache coherency seems sad. I assume there's never a
> > > > requirement for userspace to read this file.
> > >
> > > Actually, there is. mdadm reads the bitmap file, so that would be
> > > broken. Also, it's just useful for a user to be able to read the bitmap
> > > (od -x, or similar) to figure out approximately how much more he's got
> > > to resync to get an array in-sync. Other than reading the bitmap file, I
> > > don't know of any way to determine that.
> >
> > Read it with O_DIRECT :(
>
> Which is exactly what the next release of mdadm does.
> As the patch comment said:
>
> : With this approach the pagecache may contain data which is inconsistent with
> : what is on disk. To alleviate the problems this can cause, md invalidates
> : the pagecache when releasing the file. If the file is to be examined
> : while the array is active (a non-critical but occasionally useful function),
> : O_DIRECT io must be used. And new version of mdadm will have support for this.
Which doesn't help `od -x' and is going to cause older mdadm userspace to
mysteriously and subtly fail. Or does the user<->kernel interface have
versioning which will prevent this?
(replying to bits of several emails)
On Friday May 12, [email protected] wrote:
> Neil Brown <[email protected]> wrote:
> > However some IO requests cannot complete until the filesystem I/O
> > completes, so we need to be sure that the filesystem I/O won't block
> > waiting for memory, or fail with -ENOMEM.
>
> That sounds like a complex deadlock. Suppose the bitmap writeout requres
> some writeback to happen before it can get enough memory to proceed.
>
Exactly. Bitmap writeout must not block on fs-writeback. It can block
on device writeout (e.g. queue congestion or mempool exhaustion) but
it must complete without waiting in the fs layer or above, and without
the possibility of any error other -EIO. Otherwise we can get
deadlocked writing to the raid array. bh_submit (or bio_submit) is
certain to be safe in this respect. I'm not so confident about
anything at the fs level.
> > > Read it with O_DIRECT :(
> >
> > Which is exactly what the next release of mdadm does.
> > As the patch comment said:
> >
> > : With this approach the pagecache may contain data which is inconsistent with
> > : what is on disk. To alleviate the problems this can cause, md invalidates
> > : the pagecache when releasing the file. If the file is to be examined
> > : while the array is active (a non-critical but occasionally useful function),
> > : O_DIRECT io must be used. And new version of mdadm will have support for this.
>
> Which doesn't help `od -x' and is going to cause older mdadm userspace to
> mysteriously and subtly fail. Or does the user<->kernel interface have
> versioning which will prevent this?
>
As I said: 'non-critical'. Nothing important breaks if reading the
file gets old data. Reading the file while the array is active is
purely a curiosity thing. There is information in /proc/mdstat which
gives a fairly coarse view of the same data. It could lead to some
confusion, but if a compliant mdadm comes out before this gets into a
mainline kernel, I doubt there will be any significant issue.
Read/writing the bitmap needs to work reliably when the array is not
active, but suitable sync/invalidate calls in the kernel should make
that work perfectly.
I know this is technically a regression in user-space interface, and
you don't like such regression with good reason.... Maybe I could call
invalidate_inode_pages every few seconds or whenever the atime
changes, just to be on the safe side :-)
> > I have a patch which did that,
> > but decided that the possibility of kmalloc failure at awkward times
> > would make that not suitable.
>
> submit_bh() can and will allocate memory, although most decent device
> drivers should be OK.
>
submit_bh (like all decent device drivers) uses a mempool for memory
allocation so we can be sure that the delay in getting memory is
bounded by the delay for a few IO requests to complete, and we can be
sure the allocation won't fail. This is perfectly fine.
> >
> > I don't think a_ops really provides an interface that I can use, partly
> > because, as I said in a previous email, it isn't really a public
> > interface to a filesystem.
>
> It's publicer than bmap+submit_bh!
>
I don't know how you can say that.
bmap is so public that it is exported to userspace through an IOCTL
and is used by lilo (admitted only for reading, not writing). More
significantly it is used by swapfile which is a completely independent
subsystem from the filesystem. Contrast this with a_ops. The primary
users of a_ops are routines like generic_file_{read,write} and
friends. These are tools - library routines - that are used by
filesystems to implement their 'file_operations' which are the real
public interface. As far as these uses go, it is not a public
interface. Where a filesystem doesn't use some library routines, it
does not need to implement the matching functionality in the a_op
interface.
The other main user is the 'VM' which might try to flush out or
invalidate pages. However the VM isn't using this interface to
interact with files, but only to interact with pages, and it doesn't
much care what is done with the pages providing they get clean, or get
released, or whatever.
The way I perceive Linux design/development, active usage is far more
significant than documented design. If some feature of an interface
isn't being actively used - by in-kernel code - then you cannot be
sure that feature will be uniformly implemented, or that it won't
change subtly next week.
So when I went looking for the best way to get md/bitmap to write to a
file, I didn't just look at the interface specs (which are pretty
poorly documented anyway), I looked at existing code.
I can find 3 different parts of the kernel that write to a file.
They are
swap-file
loop
nfsd
nfsd uses vfs_read/vfs_write which have too many failure/delay modes
for me to safely use.
loop uses prepare_write/commit_write (if available) or f_op->write
(not vfs_write - I wonder why) which is not much better than what
nfsd uses. And as far as I can tell loop never actually flushes data to
storage (hope no-one thinks a journalling filesystem on loop is a
good idea) so it isn't a good model to follow.
[[If loop was a good model to follow, I would have rejected the
code for writing to a file in the first place, only accepted code
for writing to a device, and insisted on using loop to close the gap]]
That leaves swap-file as the only in-kernel user of a filesystem that
accesses the file in a way similar to what I need. Is it any surprise
that I chose to copy it?
> > >
> > > I haven't looked at this patch at all closely yet. Do I really need to?
> >
> > I assume you are asking that because you hope I will retract the
> > patch.
>
> Was kinda hoping that would be the outcome. It's rather gruesome, using
> set_fs()+vfs_read() on one side and submit_bh() on the other.
>
More gruesome than a_op->readpage on one side and bmap/submit_bio
on the other side like swapfile does? However if it makes you happy I
can change the read side to anything that you suggest - submit_bh
would be the 'obvious' choice I guess.
> Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
> is OK?
I think so yes, but I will review it.
I assume that if bmap gave me a non-zero block number for the last
(partial) block, then it is safe to write that whole block. I guess
that is a subtlety in semantics that no other user would be using.
However if a filesystem were packing the tail, then it would not be
able to return an integral block number for some fragments, so
hopefully it wouldn't for any.
>
> But if the pagecache is dirty then invalidate_inode_pages() will leave it
> dirty and the VM will later write it back, corrupting your bitmap file.
> You should get i_writecount, fsync the file and then run
> invalidate_inode_pages().
Well, as I am currently reading the file through the pagecache.... but
yes, I should put an fsync in there (and the invalidate before read is
fairly pointless. It is the invalidate after close that is important).
>
> You might as well use kernel_read() too, if you insist on begin oddball ;)
I didn't know about that one..
If you would feel more comfortable with kernel_read I would be more
than happy to use it.
In fact, if you can assure me that if I have an inode, and I have
confirmed that
bdi_cap_writeback_dirty(inode->i_mapping->backing_dev_info)
&& inode->i_mapping->a_ops->readpage != NULL
and have disabled truncation, either via i_write_count or otherwise,
then
- read_cache_page( .... ..a_ops->readpage ....)
will return a page that will, as long as I hold a counted reference,
remain in the page cache for that file at the appropriate address
(even though I don't hold it locked)
- set_page_dirty and write_one_page
will not fail, except with -EIO, and will not block waiting for
any writeback except for writeback of that file (e.g. won't do
any unprotected kmallocs)
then I'll be happy to go back to using that interface, though I guess I'd
have to figure out what to do about redirty_page_for_writepage calls...
But if you can assure me of the above, then I'd be curious as to why
swapfile doesn't use the readpage/writepage interface....
NeilBrown
Neil Brown <[email protected]> wrote:
>
> ...
>
> > > I have a patch which did that,
> > > but decided that the possibility of kmalloc failure at awkward times
> > > would make that not suitable.
> >
> > submit_bh() can and will allocate memory, although most decent device
> > drivers should be OK.
> >
>
> submit_bh (like all decent device drivers) uses a mempool for memory
> allocation so we can be sure that the delay in getting memory is
> bounded by the delay for a few IO requests to complete, and we can be
> sure the allocation won't fail. This is perfectly fine.
My point is that some device drivers will apparently call into the mamory
allocator (should be GFP_NOIO) on the request submission path. I don't
recall whcih drivers - but not mainstream ones.
> > >
> > > I don't think a_ops really provides an interface that I can use, partly
> > > because, as I said in a previous email, it isn't really a public
> > > interface to a filesystem.
> >
> > It's publicer than bmap+submit_bh!
> >
>
> I don't know how you can say that.
bmap() is a userspace API, not really a kernel one. And it assumes a
block-backed filesystem.
> bmap is so public that it is exported to userspace through an IOCTL
> and is used by lilo (admitted only for reading, not writing). More
> significantly it is used by swapfile which is a completely independent
> subsystem from the filesystem. Contrast this with a_ops. The primary
> users of a_ops are routines like generic_file_{read,write} and
> friends. These are tools - library routines - that are used by
> filesystems to implement their 'file_operations' which are the real
> public interface. As far as these uses go, it is not a public
> interface. Where a filesystem doesn't use some library routines, it
> does not need to implement the matching functionality in the a_op
> interface.
Well yeah. If one wants to do filesystem I/O in-kernel then one should use
the filesystem I/O entry points: read() and write() (and get flamed ;)).
If one wants to cut in at a lower level than that then we get into a pickle
because different types of filesytems do quite different things to
implement read() and write().
> The other main user is the 'VM' which might try to flush out or
> invalidate pages. However the VM isn't using this interface to
> interact with files, but only to interact with pages, and it doesn't
> much care what is done with the pages providing they get clean, or get
> released, or whatever.
>
> The way I perceive Linux design/development, active usage is far more
> significant than documented design. If some feature of an interface
> isn't being actively used - by in-kernel code - then you cannot be
> sure that feature will be uniformly implemented, or that it won't
> change subtly next week.
>
> So when I went looking for the best way to get md/bitmap to write to a
> file, I didn't just look at the interface specs (which are pretty
> poorly documented anyway), I looked at existing code.
> I can find 3 different parts of the kernel that write to a file.
> They are
> swap-file
> loop
> nfsd
>
> nfsd uses vfs_read/vfs_write which have too many failure/delay modes
> for me to safely use.
> loop uses prepare_write/commit_write (if available) or f_op->write
> (not vfs_write - I wonder why) which is not much better than what
> nfsd uses. And as far as I can tell loop never actually flushes data to
> storage (hope no-one thinks a journalling filesystem on loop is a
> good idea) so it isn't a good model to follow.
> [[If loop was a good model to follow, I would have rejected the
> code for writing to a file in the first place, only accepted code
> for writing to a device, and insisted on using loop to close the gap]]
>
> That leaves swap-file as the only in-kernel user of a filesystem that
> accesses the file in a way similar to what I need. Is it any surprise
> that I chose to copy it?
swapfile doesn't use vfs_read() for swapin...
>
> >
> > But if the pagecache is dirty then invalidate_inode_pages() will leave it
> > dirty and the VM will later write it back, corrupting your bitmap file.
> > You should get i_writecount, fsync the file and then run
> > invalidate_inode_pages().
>
> Well, as I am currently reading the file through the pagecache.... but
> yes, I should put an fsync in there (and the invalidate before read is
> fairly pointless. It is the invalidate after close that is important).
umm, the code in its present form _will_ corrupt MD bitmap files.
invalidate_inode_pages() will not invalidate dirty pagecache, and that
dirty pagecache will later get written back, undoing any of your
intervening direct-io writes. Most of the time, MD will do another
direct-io write _after_ the VM has done that writeback and things will get
fixed up. But there is a window. So that fsync() is required.
> >
> > You might as well use kernel_read() too, if you insist on begin oddball ;)
>
> I didn't know about that one..
>
> If you would feel more comfortable with kernel_read I would be more
> than happy to use it.
We might as well save some code.
> In fact, if you can assure me that if I have an inode, and I have
> confirmed that
> bdi_cap_writeback_dirty(inode->i_mapping->backing_dev_info)
> && inode->i_mapping->a_ops->readpage != NULL
>
> and have disabled truncation, either via i_write_count or otherwise,
> then
> - read_cache_page( .... ..a_ops->readpage ....)
> will return a page that will, as long as I hold a counted reference,
> remain in the page cache for that file at the appropriate address
> (even though I don't hold it locked)
yup, as long as the page is pinned with get_page().
> - set_page_dirty and write_one_page
> will not fail, except with -EIO, and will not block waiting for
> any writeback except for writeback of that file (e.g. won't do
> any unprotected kmallocs)
Mostly. There is a risk that ->writepage() will need to read filesystem
metadata to locate the page's blocks. Most of the time that won't happen
because the page still has buffers attached. You'd have to keep the pages
locked to prevent that. But writepage() will unlock the page so there's
still a teeny window where the page could get its buffers taken away. Plus
we don't know that the filesystem caches the disk mapping in the page's
buffers anyway (ext2 -o nobh, for example).
> then I'll be happy to go back to using that interface, though I guess I'd
> have to figure out what to do about redirty_page_for_writepage calls...
>
> But if you can assure me of the above, then I'd be curious as to why
> swapfile doesn't use the readpage/writepage interface....
If we did that we'd need to memcpy the data between the swapfile pagecache
and swapcache whenever doing swapin and swapout. Plus swap goes
direct-to-BIO to minimise the risk of doing memory allocations on the
swapout path. And the swap code treats swapdevs and swapfiles identically.
And the swap code cannot pin all of the swapfile's pagecache in memory
(would make it rather pointless), whereas the MD bitmap code can afford to
do this.
Ho hum, I give up. I don't think, in practice, this code fixes any
demonstrable bug though.
On Monday May 15, [email protected] wrote:
>
> Ho hum, I give up.
Thankyou :-) I found our debate very valuable - it helped me clarify
my understanding of some areas of linux filesystem semantics (and as I
am trying to write a filesystem in my 'spare time', that will turn out
to be very useful). It also revealed some problems in the code!
> I don't think, in practice, this code fixes any
> demonstrable bug though.
I thought it was our job to kill the bugs *before* they were
demonstrated :-)
I'm still convinced that the previous code could lead to deadlocks or
worse under sufficiently sustained high memory pressure and fs
activity.
I'll send a patch shortly that fixes the known problems and
awkwardnesses in the new code.
Thanks again,
NeilBrown