This patch series, based on 3.18-rc5, implements support for swap files on
BTRFS.
The standard swap file implementation uses the filesystem's implementation of
bmap() to get a list of physical blocks on disk, which the swap file code then
does I/O on directly without going through the filesystem. This doesn't work
for BTRFS, which is copy-on-write and therefore moves disk blocks around (COW
isn't the only thing that can shuffle around disk blocks: consider
defragmentation, balancing, etc.).
Swap-over-NFS introduced an interface through which a filesystem can arbitrate
swap I/O through address space operations:
- swap_activate() is called by swapon() and informs the address space that the
given file is going to be used for swap, so it should take adequate measures
like reserving space on disk and pinning block lookup information in memory
- swap_deactivate() is used to clean up on swapoff()
- readpage() is used to page in (read a page from disk)
- direct_IO() is used to page out (write a page out to disk)
Version 2 modifies this interface in the first part of the patch series to use
direct_IO for both reads and writes, which makes things much cleaner.
The second part of the patch series implements support for the interface on
BTRFS, which just means implementing swap_{,de}activate and adding some chattr
checks, which raises the following considerations:
- We can't do direct I/O on compressed or inline extents, so we can't use files
with either for swap.
- Supporting COW swapfiles might also come with some weird edge cases?
This functionality is tenuously tested in a virtual machine with some
artificial workloads. Comment away.
Omar Sandoval (5):
direct-io: don't dirty ITER_BVEC pages on read
nfs: don't dirty ITER_BVEC pages read through direct I/O
swap: use direct I/O for SWP_FILE swap_readpage
btrfs: don't allow -C or +c chattrs on a swap file
btrfs: enable swap file support
v2: use direct_IO for swap_readpage
fs/btrfs/inode.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/ioctl.c | 50 ++++++++++++++++++++++++---------------
fs/direct-io.c | 8 ++++---
fs/nfs/direct.c | 5 +++-
mm/page_io.c | 32 +++++++++++++++++++++----
5 files changed, 139 insertions(+), 27 deletions(-)
--
2.1.3
As with the generic blockdev code, kernel pages shouldn't be dirtied by the
direct I/O path.
Signed-off-by: Omar Sandoval <[email protected]>
---
fs/nfs/direct.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 10bf072..a67fa2c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -88,6 +88,7 @@ struct nfs_direct_req {
struct pnfs_ds_commit_info ds_cinfo; /* Storage for cinfo */
struct work_struct work;
int flags;
+ int should_dirty; /* should we mark read pages dirty? */
#define NFS_ODIRECT_DO_COMMIT (1) /* an unstable reply was received */
#define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */
struct nfs_writeverf verf; /* unstable write verifier */
@@ -370,7 +371,8 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr)
struct nfs_page *req = nfs_list_entry(hdr->pages.next);
struct page *page = req->wb_page;
- if (!PageCompound(page) && bytes < hdr->good_bytes)
+ if (!PageCompound(page) && bytes < hdr->good_bytes &&
+ dreq->should_dirty)
set_page_dirty(page);
bytes += req->wb_bytes;
nfs_list_remove_request(req);
@@ -542,6 +544,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
dreq->inode = inode;
dreq->bytes_left = count;
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+ dreq->should_dirty = !(iter->type & ITER_BVEC);
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
result = PTR_ERR(l_ctx);
--
2.1.3
swap_activate will check for a compressed or copy-on-write file; we shouldn't
allow it to become either once it has already been activated.
Signed-off-by: Omar Sandoval <[email protected]>
---
fs/btrfs/ioctl.c | 50 +++++++++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 19 deletions(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4399f0c..f022dce 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -293,14 +293,21 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
}
} else {
/*
- * Revert back under same assuptions as above
+ * swap_activate checks that we don't swapon a copy-on-write
+ * file, but we must also make sure that it doesn't become
+ * copy-on-write.
*/
- if (S_ISREG(mode)) {
- if (inode->i_size == 0)
- ip->flags &= ~(BTRFS_INODE_NODATACOW
- | BTRFS_INODE_NODATASUM);
- } else {
- ip->flags &= ~BTRFS_INODE_NODATACOW;
+ if (!IS_SWAPFILE(inode)) {
+ /*
+ * Revert back under same assumptions as above
+ */
+ if (S_ISREG(mode)) {
+ if (inode->i_size == 0)
+ ip->flags &= ~(BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_NODATASUM);
+ } else {
+ ip->flags &= ~BTRFS_INODE_NODATACOW;
+ }
}
}
@@ -317,20 +324,25 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
if (ret && ret != -ENODATA)
goto out_drop;
} else if (flags & FS_COMPR_FL) {
- const char *comp;
-
- ip->flags |= BTRFS_INODE_COMPRESS;
- ip->flags &= ~BTRFS_INODE_NOCOMPRESS;
+ /*
+ * Like nodatacow, swap_activate checks that we don't swapon a
+ * compressed file, so we shouldn't let it become compressed.
+ */
+ if (!IS_SWAPFILE(inode)) {
+ const char *comp;
- if (root->fs_info->compress_type == BTRFS_COMPRESS_LZO)
- comp = "lzo";
- else
- comp = "zlib";
- ret = btrfs_set_prop(inode, "btrfs.compression",
- comp, strlen(comp), 0);
- if (ret)
- goto out_drop;
+ ip->flags |= BTRFS_INODE_COMPRESS;
+ ip->flags &= ~BTRFS_INODE_NOCOMPRESS;
+ if (root->fs_info->compress_type == BTRFS_COMPRESS_LZO)
+ comp = "lzo";
+ else
+ comp = "zlib";
+ ret = btrfs_set_prop(inode, "btrfs.compression",
+ comp, strlen(comp), 0);
+ if (ret)
+ goto out_drop;
+ }
} else {
ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
if (ret && ret != -ENODATA)
--
2.1.3
Implement the swap file a_ops on btrfs. Activation simply checks for a usable
swap file: it must be fully allocated (no holes), support direct I/O (so no
compressed or inline extents) and should be nocow (I'm not sure about that last
one).
Signed-off-by: Omar Sandoval <[email protected]>
---
fs/btrfs/inode.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..b8fd36b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9442,6 +9442,75 @@ out_inode:
}
+static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+ sector_t *span)
+{
+ struct inode *inode = file_inode(file);
+ struct btrfs_inode *ip = BTRFS_I(inode);
+ int ret = 0;
+ u64 isize = inode->i_size;
+ struct extent_state *cached_state = NULL;
+ struct extent_map *em;
+ u64 start, len;
+
+ if (ip->flags & BTRFS_INODE_COMPRESS) {
+ /* Can't do direct I/O on a compressed file. */
+ pr_err("BTRFS: swapfile is compressed");
+ return -EINVAL;
+ }
+ if (!(ip->flags & BTRFS_INODE_NODATACOW)) {
+ /* The swap file can't be copy-on-write. */
+ pr_err("BTRFS: swapfile is copy-on-write");
+ return -EINVAL;
+ }
+
+ lock_extent_bits(&ip->io_tree, 0, isize - 1, 0, &cached_state);
+
+ /*
+ * All of the extents must be allocated and support direct I/O. Inline
+ * extents and compressed extents fall back to buffered I/O, so those
+ * are no good.
+ */
+ start = 0;
+ while (start < isize) {
+ len = isize - start;
+ em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ goto out;
+ }
+
+ if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
+ em->block_start == EXTENT_MAP_HOLE) {
+ pr_err("BTRFS: swapfile has holes");
+ ret = -EINVAL;
+ goto out;
+ }
+ if (em->block_start == EXTENT_MAP_INLINE) {
+ pr_err("BTRFS: swapfile is inline");
+ ret = -EINVAL;
+ goto out;
+ }
+ if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+ pr_err("BTRFS: swapfile is compresed");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ start = extent_map_end(em);
+ free_extent_map(em);
+ }
+
+out:
+ unlock_extent_cached(&ip->io_tree, 0, isize - 1, &cached_state,
+ GFP_NOFS);
+ return ret;
+}
+
+static void btrfs_swap_deactivate(struct file *file)
+{
+}
+
static const struct inode_operations btrfs_dir_inode_operations = {
.getattr = btrfs_getattr,
.lookup = btrfs_lookup,
@@ -9519,6 +9588,8 @@ static const struct address_space_operations btrfs_aops = {
.releasepage = btrfs_releasepage,
.set_page_dirty = btrfs_set_page_dirty,
.error_remove_page = generic_error_remove_page,
+ .swap_activate = btrfs_swap_activate,
+ .swap_deactivate = btrfs_swap_deactivate,
};
static const struct address_space_operations btrfs_symlink_aops = {
--
2.1.3
Sorry for the noise, looks like Christoph got back to me on the previous RFC
just before I sent this out -- disregard this for now.
--
Omar
Reads through the iov_iter infrastructure for kernel pages shouldn't be dirtied
by the direct I/O code.
This is based on Dave Kleikamp's and Ming Lei's previously posted patches.
Cc: Dave Kleikamp <[email protected]>
Cc: Ming Lei <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
---
fs/direct-io.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..e542ce4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -120,6 +120,7 @@ struct dio {
spinlock_t bio_lock; /* protects BIO fields below */
int page_errors; /* errno from get_user_pages() */
int is_async; /* is IO async ? */
+ int should_dirty; /* should we mark read pages dirty? */
bool defer_completion; /* defer AIO completion to workqueue? */
int io_error; /* IO error in completion path */
unsigned long refcount; /* direct_io_worker() and bios */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
dio->refcount++;
spin_unlock_irqrestore(&dio->bio_lock, flags);
- if (dio->is_async && dio->rw == READ)
+ if (dio->is_async && dio->rw == READ && dio->should_dirty)
bio_set_pages_dirty(bio);
if (sdio->submit_io)
@@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
if (!uptodate)
dio->io_error = -EIO;
- if (dio->is_async && dio->rw == READ) {
+ if (dio->is_async && dio->rw == READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
- if (dio->rw == READ && !PageCompound(page))
+ if (dio->rw == READ && !PageCompound(page) && dio->should_dirty)
set_page_dirty_lock(page);
page_cache_release(page);
}
@@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
dio->inode = inode;
dio->rw = rw;
+ dio->should_dirty = !(iter->type & ITER_BVEC);
/*
* For AIO O_(D)SYNC writes we need to defer completions to a workqueue
--
2.1.3
On Mon, Nov 17, 2014 at 07:48:17AM -0800, Christoph Hellwig wrote:
> With the new iov_iter infrastructure that supprots direct I/O to kernel
> pages please get rid of the ->readpage hack first. I'm still utterly
> disapoined that this crap ever got merged.
Cc: Mel Gorman <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
---
mm/page_io.c | 32 ++++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..10715e0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,8 +283,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
set_page_writeback(page);
unlock_page(page);
- ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
- &kiocb, &from,
+ ret = mapping->a_ops->direct_IO(WRITE, &kiocb, &from,
kiocb.ki_pos);
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
@@ -348,12 +347,37 @@ int swap_readpage(struct page *page)
}
if (sis->flags & SWP_FILE) {
+ struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
+ struct bio_vec bv = {
+ .bv_page = page,
+ .bv_len = PAGE_SIZE,
+ .bv_offset = 0,
+ };
+ struct iov_iter to = {
+ .type = ITER_BVEC | READ,
+ .count = PAGE_SIZE,
+ .iov_offset = 0,
+ .nr_segs = 1,
+ };
+ to.bvec = &bv; /* older gcc versions are broken */
+
+ init_sync_kiocb(&kiocb, swap_file);
+ kiocb.ki_pos = page_file_offset(page);
+ kiocb.ki_nbytes = PAGE_SIZE;
- ret = mapping->a_ops->readpage(swap_file, page);
- if (!ret)
+ ret = mapping->a_ops->direct_IO(READ, &kiocb, &to,
+ kiocb.ki_pos);
+ if (ret == PAGE_SIZE) {
+ SetPageUptodate(page);
count_vm_event(PSWPIN);
+ ret = 0;
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ unlock_page(page);
return ret;
}
--
2.1.3
On Fri, Nov 21, 2014 at 02:15:31AM -0800, Omar Sandoval wrote:
> Sorry for the noise, looks like Christoph got back to me on the previous RFC
> just before I sent this out -- disregard this for now.
If the NFS people are fine with this version I'd certainly welcome it as
a first step. Additional improvements are of course always welcome.
On 11/21/2014 04:08 AM, Omar Sandoval wrote:
> Reads through the iov_iter infrastructure for kernel pages shouldn't be dirtied
> by the direct I/O code.
>
> This is based on Dave Kleikamp's and Ming Lei's previously posted patches.
Acked-by: Dave Kleikamp <[email protected]>
> Cc: Ming Lei <[email protected]>
> Signed-off-by: Omar Sandoval <[email protected]>
> ---
> fs/direct-io.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..e542ce4 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -120,6 +120,7 @@ struct dio {
> spinlock_t bio_lock; /* protects BIO fields below */
> int page_errors; /* errno from get_user_pages() */
> int is_async; /* is IO async ? */
> + int should_dirty; /* should we mark read pages dirty? */
> bool defer_completion; /* defer AIO completion to workqueue? */
> int io_error; /* IO error in completion path */
> unsigned long refcount; /* direct_io_worker() and bios */
> @@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
> dio->refcount++;
> spin_unlock_irqrestore(&dio->bio_lock, flags);
>
> - if (dio->is_async && dio->rw == READ)
> + if (dio->is_async && dio->rw == READ && dio->should_dirty)
> bio_set_pages_dirty(bio);
>
> if (sdio->submit_io)
> @@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
> if (!uptodate)
> dio->io_error = -EIO;
>
> - if (dio->is_async && dio->rw == READ) {
> + if (dio->is_async && dio->rw == READ && dio->should_dirty) {
> bio_check_pages_dirty(bio); /* transfers ownership */
> } else {
> bio_for_each_segment_all(bvec, bio, i) {
> struct page *page = bvec->bv_page;
>
> - if (dio->rw == READ && !PageCompound(page))
> + if (dio->rw == READ && !PageCompound(page) && dio->should_dirty)
> set_page_dirty_lock(page);
> page_cache_release(page);
> }
> @@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>
> dio->inode = inode;
> dio->rw = rw;
> + dio->should_dirty = !(iter->type & ITER_BVEC);
>
> /*
> * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>
On Fri, Nov 21, 2014 at 02:08:30AM -0800, Omar Sandoval wrote:
> @@ -293,14 +293,21 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
> }
> } else {
You can put the condition here, instead of shifting the nested block.
} else if (!IS_SWAPFILE(inode)) {
> /*
> - * Revert back under same assuptions as above
> + * swap_activate checks that we don't swapon a copy-on-write
> + * file, but we must also make sure that it doesn't become
> + * copy-on-write.
> */
> - if (S_ISREG(mode)) {
> - if (inode->i_size == 0)
> - ip->flags &= ~(BTRFS_INODE_NODATACOW
> - | BTRFS_INODE_NODATASUM);
> - } else {
> - ip->flags &= ~BTRFS_INODE_NODATACOW;
> + if (!IS_SWAPFILE(inode)) {
> + /*
> + * Revert back under same assumptions as above
> + */
> + if (S_ISREG(mode)) {
> + if (inode->i_size == 0)
> + ip->flags &= ~(BTRFS_INODE_NODATACOW |
> + BTRFS_INODE_NODATASUM);
> + } else {
> + ip->flags &= ~BTRFS_INODE_NODATACOW;
> + }
> }
> }
On Fri, Nov 21, 2014 at 02:08:31AM -0800, Omar Sandoval wrote:
> Implement the swap file a_ops on btrfs. Activation simply checks for a usable
> swap file: it must be fully allocated (no holes), support direct I/O (so no
> compressed or inline extents) and should be nocow (I'm not sure about that last
> one).
>
> Signed-off-by: Omar Sandoval <[email protected]>
> ---
> fs/btrfs/inode.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 71 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d23362f..b8fd36b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9442,6 +9442,75 @@ out_inode:
>
> }
>
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> + sector_t *span)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_inode *ip = BTRFS_I(inode);
'ip' looks strange in context of a filesystem, please pick a different
name (eg. 'inode' or whatever).
> + int ret = 0;
> + u64 isize = inode->i_size;
> + struct extent_state *cached_state = NULL;
> + struct extent_map *em;
> + u64 start, len;
> +
> + if (ip->flags & BTRFS_INODE_COMPRESS) {
> + /* Can't do direct I/O on a compressed file. */
> + pr_err("BTRFS: swapfile is compressed");
Please use the btrfs_err macros everywhere.
> + return -EINVAL;
> + }
> + if (!(ip->flags & BTRFS_INODE_NODATACOW)) {
> + /* The swap file can't be copy-on-write. */
> + pr_err("BTRFS: swapfile is copy-on-write");
> + return -EINVAL;
> + }
> +
> + lock_extent_bits(&ip->io_tree, 0, isize - 1, 0, &cached_state);
> +
> + /*
> + * All of the extents must be allocated and support direct I/O. Inline
> + * extents and compressed extents fall back to buffered I/O, so those
> + * are no good.
> + */
> + start = 0;
> + while (start < isize) {
> + len = isize - start;
> + em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> + if (IS_ERR(em)) {
> + ret = PTR_ERR(em);
> + goto out;
> + }
> +
> + if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
> + em->block_start == EXTENT_MAP_HOLE) {
If the no-holes feature is enabled on the filesystem, there won't be any
such extent representing the hole. You have to check that each two
consecutive extents are adjacent.
> + pr_err("BTRFS: swapfile has holes");
> + ret = -EINVAL;
> + goto out;
> + }
> + if (em->block_start == EXTENT_MAP_INLINE) {
> + pr_err("BTRFS: swapfile is inline");
While the test is valid, this would mean that the file is smaller than
the inline limit, which is now one page. I think the generic swap code
would refuse such a small file anyway.
> + ret = -EINVAL;
> + goto out;
> + }
> + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> + pr_err("BTRFS: swapfile is compresed");
> + ret = -EINVAL;
> + goto out;
> + }
I think the preallocated extents should be refused as well. This means
the filesystem has enough space to hold the data but it would still have
to go through the allocation and could in turn stress the memory
management code that triggered the swapping activity in the first place.
Though it's probably still possible to reach such corner case even with
fully allocated nodatacow file, this should be reviewed anyway.
> +
> + start = extent_map_end(em);
> + free_extent_map(em);
> + }
> +
> +out:
> + unlock_extent_cached(&ip->io_tree, 0, isize - 1, &cached_state,
> + GFP_NOFS);
> + return ret;
> +}
On Fri, Nov 21, 2014 at 6:00 PM, David Sterba <[email protected]> wrote:
> On Fri, Nov 21, 2014 at 02:08:31AM -0800, Omar Sandoval wrote:
>> Implement the swap file a_ops on btrfs. Activation simply checks for a usable
>> swap file: it must be fully allocated (no holes), support direct I/O (so no
>> compressed or inline extents) and should be nocow (I'm not sure about that last
>> one).
>>
>> Signed-off-by: Omar Sandoval <[email protected]>
>> ---
>> fs/btrfs/inode.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 71 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index d23362f..b8fd36b 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -9442,6 +9442,75 @@ out_inode:
>>
>> }
>>
>> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>> + sector_t *span)
>> +{
>> + struct inode *inode = file_inode(file);
>> + struct btrfs_inode *ip = BTRFS_I(inode);
>
> 'ip' looks strange in context of a filesystem, please pick a different
> name (eg. 'inode' or whatever).
>
>> + int ret = 0;
>> + u64 isize = inode->i_size;
>> + struct extent_state *cached_state = NULL;
>> + struct extent_map *em;
>> + u64 start, len;
>> +
>> + if (ip->flags & BTRFS_INODE_COMPRESS) {
>> + /* Can't do direct I/O on a compressed file. */
>> + pr_err("BTRFS: swapfile is compressed");
>
> Please use the btrfs_err macros everywhere.
>
>> + return -EINVAL;
>> + }
>> + if (!(ip->flags & BTRFS_INODE_NODATACOW)) {
>> + /* The swap file can't be copy-on-write. */
>> + pr_err("BTRFS: swapfile is copy-on-write");
>> + return -EINVAL;
>> + }
>> +
>> + lock_extent_bits(&ip->io_tree, 0, isize - 1, 0, &cached_state);
>> +
>> + /*
>> + * All of the extents must be allocated and support direct I/O. Inline
>> + * extents and compressed extents fall back to buffered I/O, so those
>> + * are no good.
>> + */
>> + start = 0;
>> + while (start < isize) {
>> + len = isize - start;
>> + em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
>> + if (IS_ERR(em)) {
>> + ret = PTR_ERR(em);
>> + goto out;
>> + }
>> +
>> + if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
>> + em->block_start == EXTENT_MAP_HOLE) {
>
> If the no-holes feature is enabled on the filesystem, there won't be any
> such extent representing the hole. You have to check that each two
> consecutive extents are adjacent.
If no-holes is enabled it means file extent items aren't used to
represent holes. But extent maps are still used to represent holes in
memory, and that's what the code is looking for and therefore it's
correct.
>
>> + pr_err("BTRFS: swapfile has holes");
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> + if (em->block_start == EXTENT_MAP_INLINE) {
>> + pr_err("BTRFS: swapfile is inline");
>
> While the test is valid, this would mean that the file is smaller than
> the inline limit, which is now one page. I think the generic swap code
> would refuse such a small file anyway.
>
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
>> + pr_err("BTRFS: swapfile is compresed");
>> + ret = -EINVAL;
>> + goto out;
>> + }
>
> I think the preallocated extents should be refused as well. This means
> the filesystem has enough space to hold the data but it would still have
> to go through the allocation and could in turn stress the memory
> management code that triggered the swapping activity in the first place.
>
> Though it's probably still possible to reach such corner case even with
> fully allocated nodatacow file, this should be reviewed anyway.
>
>> +
>> + start = extent_map_end(em);
>> + free_extent_map(em);
>> + }
>> +
>> +out:
>> + unlock_extent_cached(&ip->io_tree, 0, isize - 1, &cached_state,
>> + GFP_NOFS);
>> + return ret;
>> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
On Fri, Nov 21, 2014 at 07:00:45PM +0100, David Sterba wrote:
> > + pr_err("BTRFS: swapfile has holes");
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + if (em->block_start == EXTENT_MAP_INLINE) {
> > + pr_err("BTRFS: swapfile is inline");
>
> While the test is valid, this would mean that the file is smaller than
> the inline limit, which is now one page. I think the generic swap code
> would refuse such a small file anyway.
>
Sure. This test doesn't really cost us anything, so I think I'd feel a little
better just leaving it in. I'll add a comment for the next close reader.
Besides that and Filipe's response, I'll address everything you mentioned here
and in your other email in the next version, thanks.
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > + pr_err("BTRFS: swapfile is compresed");
> > + ret = -EINVAL;
> > + goto out;
> > + }
>
> I think the preallocated extents should be refused as well. This means
> the filesystem has enough space to hold the data but it would still have
> to go through the allocation and could in turn stress the memory
> management code that triggered the swapping activity in the first place.
>
> Though it's probably still possible to reach such corner case even with
> fully allocated nodatacow file, this should be reviewed anyway.
>
I'll definitely take a closer look at this. In particular,
btrfs_get_blocks_direct and btrfs_get_extent do allocations in some cases which
I'll look into.
--
Omar
On Sat, Nov 22, 2014 at 12:03:57PM -0800, Omar Sandoval wrote:
> On Fri, Nov 21, 2014 at 07:00:45PM +0100, David Sterba wrote:
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> > > + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > > + pr_err("BTRFS: swapfile is compresed");
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> >
> > I think the preallocated extents should be refused as well. This means
> > the filesystem has enough space to hold the data but it would still have
> > to go through the allocation and could in turn stress the memory
> > management code that triggered the swapping activity in the first place.
> >
> > Though it's probably still possible to reach such corner case even with
> > fully allocated nodatacow file, this should be reviewed anyway.
> >
> I'll definitely take a closer look at this. In particular,
> btrfs_get_blocks_direct and btrfs_get_extent do allocations in some cases which
> I'll look into.
>
Alright, I took a look at this. My understanding is that a PREALLOC extent
represents a region on disk that has already been allocated but isn't in use
yet, but please correct me if I'm wrong. Judging by this comment in
btrfs_get_blocks_direct, we don't have to worry about PREALLOC extents in
general:
/*
* We don't allocate a new extent in the following cases
*
* 1) The inode is marked as NODATACOW. In this case we'll just use the
* existing extent.
* 2) The extent is marked as PREALLOC. We're good to go here and can
* just use the extent.
*
*/
A couple of other considerations that cropped up:
- btrfs_get_extent does a bunch of extra work if the extent is not cached in
the extent map tree that would be nice to avoid when swapping
- We might still have to do a COW if the swap file is in a snapshot
We can avoid the btrfs_get_extent by pinning the extents in memory one way or
another in btrfs_swap_activate.
The snapshot issue is a little tricker to resolve. I see a few options:
1. Just do the COW and hope for the best
2. As part of btrfs_swap_activate, COW any shared extents. If a snapshot
happens while a swap file is active, we'll fall back to 1.
3. Clobber any swap file extents which are in a snapshot, i.e., always use the
existing extent.
I'm partial to 3, as it's the simplest approach, and I don't think it makes
much sense for a swap file to be in a snapshot anyways. I'd appreciate any
comments that anyone might have.
--
Omar
On Fri, Nov 21, 2014 at 02:19:14AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 21, 2014 at 02:15:31AM -0800, Omar Sandoval wrote:
> > Sorry for the noise, looks like Christoph got back to me on the previous RFC
> > just before I sent this out -- disregard this for now.
>
> If the NFS people are fine with this version I'd certainly welcome it as
> a first step. Additional improvements are of course always welcome.
>
To follow up with the NFS people, there's some more review going on on the
BTRFS side, but I'd like to have the infrastructure squared away here.
Additional improvements should be mostly on the VFS side.
Thanks!
--
Omar
On 2014/11/25 00:03, Omar Sandoval wrote:
> [snip]
>
> The snapshot issue is a little tricker to resolve. I see a few options:
>
> 1. Just do the COW and hope for the best
> 2. As part of btrfs_swap_activate, COW any shared extents. If a snapshot
> happens while a swap file is active, we'll fall back to 1.
> 3. Clobber any swap file extents which are in a snapshot, i.e., always use the
> existing extent.
>
> I'm partial to 3, as it's the simplest approach, and I don't think it makes
> much sense for a swap file to be in a snapshot anyways. I'd appreciate any
> comments that anyone might have.
>
Personally, 3 seems pragmatic - but not necessarily "correct". :-/
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
On Mon, Nov 24, 2014 at 02:03:02PM -0800, Omar Sandoval wrote:
> Alright, I took a look at this. My understanding is that a PREALLOC extent
> represents a region on disk that has already been allocated but isn't in use
> yet, but please correct me if I'm wrong. Judging by this comment in
> btrfs_get_blocks_direct, we don't have to worry about PREALLOC extents in
> general:
>
> /*
> * We don't allocate a new extent in the following cases
> *
> * 1) The inode is marked as NODATACOW. In this case we'll just use the
> * existing extent.
> * 2) The extent is marked as PREALLOC. We're good to go here and can
> * just use the extent.
> *
> */
Ok, thanks for checking.
> A couple of other considerations that cropped up:
>
> - btrfs_get_extent does a bunch of extra work if the extent is not cached in
> the extent map tree that would be nice to avoid when swapping
> - We might still have to do a COW if the swap file is in a snapshot
>
> We can avoid the btrfs_get_extent by pinning the extents in memory one way or
> another in btrfs_swap_activate.
That's preferrable, the overhead should be small.
> The snapshot issue is a little tricker to resolve. I see a few options:
>
> 1. Just do the COW and hope for the best
Snapshotting of NODATACOW files work, the extents are COWed on the first
write, the original owner sees no change.
> 2. As part of btrfs_swap_activate, COW any shared extents. If a snapshot
> happens while a swap file is active, we'll fall back to 1.
So we should make sure that any write to the swapfile will not lead to
the first-COW, this would require to scan all the extents and see if
we're the owner and COW eventually.
Doing that automatically is IMO better from the user perspective,
compared to an erroring out and requesting a manual "dd" over the file.
Possibly, the implied COW could create preallocated extents so we do not
have to rewrite the data.
> 3. Clobber any swap file extents which are in a snapshot, i.e., always use the
> existing extent.
Easy to implement but could lead to bad suprises.
More thoughts:
There are some guards in the generic code to prevent unwanted
modifications to the swapfiles (eg. no truncate, no fallocate, no
deletion). Further we should audit btrfs ioctls that may interfere with
swap and forbid any action (notably the cloning and deduplication ones).