2013-05-20 15:15:20

by Zhi Yong Wu

[permalink] [raw]
Subject: [RFC PATCH v1 0/5] BTRFS hot relocation support

From: Zhi Yong Wu <[email protected]>

The patchset as RFC is sent out mainly to see if its design
goes in the correct development direction.

When working on this feature, i am trying to change as less
the existing btrfs code as possible. After V0 was sent out,
i carefully checked the patchset for speed profile, and don't
think that it is meanful to BTRFS hot relocation, but think
that it is one simple and effective way to introduce one new
block group for nonrotating disk to differentiate if the block
space is reserved from rotating disk or nonrotating disk; So
It's very appreciated that the developers can double check if
the design is appropriate to BTRFS hot reloction.

The patchset is trying to introduce hot relocation support
for BTRFS. In hybrid storage environment, when the data in
rotating disk get hot, it can be relocated to nonrotating disk
by BTRFS hot relocation support automatically; also, if
nonrotating disk ratio exceed its upper threshold, the data
which get cold can be looked up and relocated to rotating disk
to make more space in nonrotating disk at first, and then the
data which get hot will be relocated to nonrotating disk
automatically.

BTRFS hot relocation mainly reserve block space from nonrotating
disk at first, load the hot data to page cache from rotating disk,
allocate block space from nonrotating disk, and finally write the
data to it.

If you'd like to play with it, pls pull the patchset from
my git on github:
https://github.com/wuzhy/kernel.git hot_reloc

For how to use, please refer too the example below:

root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
^^^ Above command will hack /dev/vdc to be one SSD disk
root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f

WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

[ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb
[ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
[ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
[ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
adding device /dev/vdc id 2
[ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc
fs created label (null) on /dev/vdb
nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
Btrfs v0.20-rc1-254-gb0136aa-dirty
root@debian-i386:~# mount -o hot_move /dev/vdb /data2
[ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb
[ 144.870444] btrfs: disk space caching is enabled
[ 144.904214] VFS: Turning on hot data tracking
root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.0G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=2.00GB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.19MB
Data_SSD: total=8.00MB, used=0.00
root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.1G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=6.25MB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.26MB
Data_SSD: total=2.01GB, used=2.00GB
root@debian-i386:~#

Changelog from v0:
1.) Refactor introducing one new block group.

Zhi Yong Wu (5):
BTRFS hot reloc, vfs: add one list_head field
BTRFS hot reloc: add one new block group
BTRFS hot reloc: add one hot reloc thread
BTRFS hot reloc, procfs: add three proc interfaces
BTRFS hot reloc: add hot relocation support

fs/btrfs/Makefile | 3 +-
fs/btrfs/ctree.h | 35 ++-
fs/btrfs/extent-tree.c | 99 ++++--
fs/btrfs/extent_io.c | 59 +++-
fs/btrfs/extent_io.h | 7 +
fs/btrfs/file.c | 24 +-
fs/btrfs/free-space-cache.c | 2 +-
fs/btrfs/hot_relocate.c | 721 +++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/hot_relocate.h | 38 +++
fs/btrfs/inode-map.c | 7 +-
fs/btrfs/inode.c | 94 +++++-
fs/btrfs/ioctl.c | 17 +-
fs/btrfs/relocation.c | 6 +-
fs/btrfs/super.c | 30 +-
fs/btrfs/volumes.c | 29 +-
fs/hot_tracking.c | 1 +
include/linux/btrfs.h | 4 +
include/linux/hot_tracking.h | 1 +
kernel/sysctl.c | 22 ++
19 files changed, 1130 insertions(+), 69 deletions(-)
create mode 100644 fs/btrfs/hot_relocate.c
create mode 100644 fs/btrfs/hot_relocate.h

--
1.7.11.7


2013-05-20 15:10:46

by Zhi Yong Wu

[permalink] [raw]
Subject: [RFC PATCH v1 5/5] BTRFS hot reloc: add hot relocation support

From: Zhi Yong Wu <[email protected]>

Add one new mount option '-o hot_move' for hot
relocation support. When hot relocation is enabled,
hot tracking will be enabled automatically.
Its usage looks like:
mount -o hot_move
mount -o nouser,hot_move
mount -o nouser,hot_move,loop
mount -o hot_move,nouser

Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/btrfs/super.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c10477b..1377551 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -309,8 +309,13 @@ static void btrfs_put_super(struct super_block *sb)
* process... Whom would you report that to?
*/

+ /* Hot data relocation */
+ if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_MOVE))
+ hot_relocate_exit(btrfs_sb(sb));
+
/* Hot data tracking */
- if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+ if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_MOVE)
+ || btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
hot_track_exit(sb);
}

@@ -325,7 +330,7 @@ enum {
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
- Opt_err,
+ Opt_hot_move, Opt_err,
};

static match_table_t tokens = {
@@ -366,6 +371,7 @@ static match_table_t tokens = {
{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
{Opt_fatal_errors, "fatal_errors=%s"},
{Opt_hot_track, "hot_track"},
+ {Opt_hot_move, "hot_move"},
{Opt_err, NULL},
};

@@ -634,6 +640,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
case Opt_hot_track:
btrfs_set_opt(info->mount_opt, HOT_TRACK);
break;
+ case Opt_hot_move:
+ btrfs_set_opt(info->mount_opt, HOT_MOVE);
+ break;
case Opt_err:
printk(KERN_INFO "btrfs: unrecognized mount option "
"'%s'\n", p);
@@ -853,17 +862,26 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}

- if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+ if (btrfs_test_opt(fs_info->tree_root, HOT_MOVE)
+ || btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
err = hot_track_init(sb);
if (err)
goto fail_hot;
}

+ if (btrfs_test_opt(fs_info->tree_root, HOT_MOVE)) {
+ err = hot_relocate_init(fs_info);
+ if (err)
+ goto fail_reloc;
+ }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb->s_flags |= MS_ACTIVE;
return 0;

+fail_reloc:
+ hot_track_exit(sb);
fail_hot:
dput(sb->s_root);
sb->s_root = NULL;
@@ -964,6 +982,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
seq_puts(seq, ",fatal_errors=panic");
if (btrfs_test_opt(root, HOT_TRACK))
seq_puts(seq, ",hot_track");
+ if (btrfs_test_opt(root, HOT_MOVE))
+ seq_puts(seq, ",hot_move");
return 0;
}

--
1.7.11.7

2013-05-20 15:10:58

by Zhi Yong Wu

[permalink] [raw]
Subject: [RFC PATCH v1 2/5] BTRFS hot reloc: add one new block group

From: Zhi Yong Wu <[email protected]>

Introduce one new block group BTRFS_BLOCK_GROUP_DATA_NONROT,
which is used to differentiate if the block space is reserved
and allocated from one rotating disk or nonrotating disk.

Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/btrfs/ctree.h | 33 ++++++++++++---
fs/btrfs/extent-tree.c | 99 ++++++++++++++++++++++++++++++++++++---------
fs/btrfs/extent_io.c | 59 ++++++++++++++++++++++++++-
fs/btrfs/extent_io.h | 7 ++++
fs/btrfs/file.c | 24 +++++++----
fs/btrfs/free-space-cache.c | 2 +-
fs/btrfs/inode-map.c | 7 ++--
fs/btrfs/inode.c | 94 ++++++++++++++++++++++++++++++++++--------
fs/btrfs/ioctl.c | 17 +++++---
fs/btrfs/relocation.c | 6 ++-
fs/btrfs/super.c | 4 +-
fs/btrfs/volumes.c | 29 ++++++++++++-
12 files changed, 316 insertions(+), 65 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 133a6ed..f7a3170 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -963,6 +963,12 @@ struct btrfs_dev_replace_item {
#define BTRFS_BLOCK_GROUP_RAID10 (1ULL << 6)
#define BTRFS_BLOCK_GROUP_RAID5 (1 << 7)
#define BTRFS_BLOCK_GROUP_RAID6 (1 << 8)
+/*
+ * New block groups for use with BTRFS hot relocation feature.
+ * When BTRFS hot relocation is enabled, *_NONROT block group is
+ * forced to nonrotating drives.
+ */
+#define BTRFS_BLOCK_GROUP_DATA_NONROT (1ULL << 9)
#define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE

enum btrfs_raid_types {
@@ -978,7 +984,8 @@ enum btrfs_raid_types {

#define BTRFS_BLOCK_GROUP_TYPE_MASK (BTRFS_BLOCK_GROUP_DATA | \
BTRFS_BLOCK_GROUP_SYSTEM | \
- BTRFS_BLOCK_GROUP_METADATA)
+ BTRFS_BLOCK_GROUP_METADATA | \
+ BTRFS_BLOCK_GROUP_DATA_NONROT)

#define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
BTRFS_BLOCK_GROUP_RAID1 | \
@@ -1521,6 +1528,7 @@ struct btrfs_fs_info {
struct list_head space_info;

struct btrfs_space_info *data_sinfo;
+ struct btrfs_space_info *nonrot_data_sinfo;

struct reloc_control *reloc_ctl;

@@ -1545,6 +1553,7 @@ struct btrfs_fs_info {
u64 avail_data_alloc_bits;
u64 avail_metadata_alloc_bits;
u64 avail_system_alloc_bits;
+ u64 avail_data_nonrot_alloc_bits;

/* restriper state */
spinlock_t balance_lock;
@@ -1557,6 +1566,7 @@ struct btrfs_fs_info {

unsigned data_chunk_allocations;
unsigned metadata_ratio;
+ unsigned data_nonrot_chunk_allocations;

void *bdev_holder;

@@ -1928,6 +1938,7 @@ struct btrfs_ioctl_defrag_range_args {
#define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
#define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
#define BTRFS_MOUNT_HOT_TRACK (1 << 23)
+#define BTRFS_MOUNT_HOT_MOVE (1 << 24)

#define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
#define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
@@ -3043,6 +3054,8 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_root *root,
int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 objectid, u64 offset, u64 bytenr);
+struct btrfs_block_group_cache *btrfs_lookup_first_block_group(
+ struct btrfs_fs_info *info, u64 bytenr);
struct btrfs_block_group_cache *btrfs_lookup_block_group(
struct btrfs_fs_info *info,
u64 bytenr);
@@ -3093,6 +3106,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
u64 root_objectid, u64 owner, u64 offset, int for_cow);
+struct btrfs_block_group_cache *next_block_group(struct btrfs_root *root,
+ struct btrfs_block_group_cache *cache);

int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
@@ -3122,8 +3137,14 @@ enum btrfs_reserve_flush_enum {
BTRFS_RESERVE_FLUSH_ALL,
};

-int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
+enum {
+ TYPE_ROT, /* rot -> rotating */
+ TYPE_NONROT, /* nonrot -> nonrotating */
+ MAX_RELOC_TYPES,
+};
+
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes, int *flag);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes, int flag);
void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
@@ -3138,8 +3159,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root,
u64 qgroup_reserved);
int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes, int *flag);
+void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes, int flag);
void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
unsigned short type);
@@ -3612,7 +3633,7 @@ int btrfs_release_file(struct inode *inode, struct file *file);
int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
struct page **pages, size_t num_pages,
loff_t pos, size_t write_bytes,
- struct extent_state **cached);
+ struct extent_state **cached, int flag);

/* tree-defrag.c */
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2305b5c..afc9f77 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -628,7 +628,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
/*
* return the block group that starts at or after bytenr
*/
-static struct btrfs_block_group_cache *
+struct btrfs_block_group_cache *
btrfs_lookup_first_block_group(struct btrfs_fs_info *info, u64 bytenr)
{
struct btrfs_block_group_cache *cache;
@@ -3030,7 +3030,7 @@ fail:

}

-static struct btrfs_block_group_cache *
+struct btrfs_block_group_cache *
next_block_group(struct btrfs_root *root,
struct btrfs_block_group_cache *cache)
{
@@ -3059,6 +3059,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group,
int num_pages = 0;
int retries = 0;
int ret = 0;
+ int flag = TYPE_ROT;

/*
* If this block group is smaller than 100 megs don't bother caching the
@@ -3142,7 +3143,7 @@ again:
num_pages *= 16;
num_pages *= PAGE_CACHE_SIZE;

- ret = btrfs_check_data_free_space(inode, num_pages);
+ ret = btrfs_check_data_free_space(inode, num_pages, &flag);
if (ret)
goto out_put;

@@ -3151,7 +3152,8 @@ again:
&alloc_hint);
if (!ret)
dcs = BTRFS_DC_SETUP;
- btrfs_free_reserved_data_space(inode, num_pages);
+
+ btrfs_free_reserved_data_space(inode, num_pages, flag);

out_put:
iput(inode);
@@ -3353,6 +3355,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
list_add_rcu(&found->list, &info->space_info);
if (flags & BTRFS_BLOCK_GROUP_DATA)
info->data_sinfo = found;
+ else if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT)
+ info->nonrot_data_sinfo = found;
return 0;
}

@@ -3368,6 +3372,8 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
fs_info->avail_metadata_alloc_bits |= extra_flags;
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
fs_info->avail_system_alloc_bits |= extra_flags;
+ if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT)
+ fs_info->avail_data_nonrot_alloc_bits |= extra_flags;
write_sequnlock(&fs_info->profiles_lock);
}

@@ -3474,18 +3480,27 @@ static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
flags |= root->fs_info->avail_system_alloc_bits;
else if (flags & BTRFS_BLOCK_GROUP_METADATA)
flags |= root->fs_info->avail_metadata_alloc_bits;
+ else if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT)
+ flags |= root->fs_info->avail_data_nonrot_alloc_bits;
} while (read_seqretry(&root->fs_info->profiles_lock, seq));

return btrfs_reduce_alloc_profile(root, flags);
}

+/*
+ * Turns a chunk_type integer into set of block group flags (a profile).
+ * Hot relocation code adds chunk_type 2 for hot data specific block
+ * group type.
+ */
u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data)
{
u64 flags;
u64 ret;

- if (data)
+ if (data == 1)
flags = BTRFS_BLOCK_GROUP_DATA;
+ else if (data == 2)
+ flags = BTRFS_BLOCK_GROUP_DATA_NONROT;
else if (root == root->fs_info->chunk_root)
flags = BTRFS_BLOCK_GROUP_SYSTEM;
else
@@ -3499,13 +3514,14 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data)
* This will check the space that the inode allocates from to make sure we have
* enough space for bytes.
*/
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes, int *flag)
{
struct btrfs_space_info *data_sinfo;
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_fs_info *fs_info = root->fs_info;
u64 used;
int ret = 0, committed = 0, alloc_chunk = 1;
+ int data, tried = 0;

/* make sure bytes are sectorsize aligned */
bytes = ALIGN(bytes, root->sectorsize);
@@ -3516,7 +3532,15 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
committed = 1;
}

- data_sinfo = fs_info->data_sinfo;
+ if (*flag == TYPE_NONROT) {
+try_nonrot:
+ data = 2;
+ data_sinfo = fs_info->nonrot_data_sinfo;
+ } else {
+ data = 1;
+ data_sinfo = fs_info->data_sinfo;
+ }
+
if (!data_sinfo)
goto alloc;

@@ -3534,13 +3558,22 @@ again:
* if we don't have enough free bytes in this space then we need
* to alloc a new chunk.
*/
- if (!data_sinfo->full && alloc_chunk) {
+ if (alloc_chunk) {
u64 alloc_target;

+ if (data_sinfo->full) {
+ if (!tried) {
+ tried = 1;
+ spin_unlock(&data_sinfo->lock);
+ goto try_nonrot;
+ } else
+ goto non_alloc;
+ }
+
data_sinfo->force_alloc = CHUNK_ALLOC_FORCE;
spin_unlock(&data_sinfo->lock);
alloc:
- alloc_target = btrfs_get_alloc_profile(root, 1);
+ alloc_target = btrfs_get_alloc_profile(root, data);
trans = btrfs_join_transaction(root);
if (IS_ERR(trans))
return PTR_ERR(trans);
@@ -3557,11 +3590,13 @@ alloc:
}

if (!data_sinfo)
- data_sinfo = fs_info->data_sinfo;
+ data_sinfo = (data == 1) ? fs_info->data_sinfo :
+ fs_info->nonrot_data_sinfo;

goto again;
}

+non_alloc:
/*
* If we have less pinned bytes than we want to allocate then
* don't bother committing the transaction, it won't help us.
@@ -3572,7 +3607,7 @@ alloc:

/* commit the current transaction and try again */
commit_trans:
- if (!committed &&
+ if (!committed && data_sinfo &&
!atomic_read(&root->fs_info->open_ioctl_trans)) {
committed = 1;
trans = btrfs_join_transaction(root);
@@ -3586,6 +3621,10 @@ commit_trans:

return -ENOSPC;
}
+
+ if (tried)
+ *flag = TYPE_NONROT;
+
data_sinfo->bytes_may_use += bytes;
trace_btrfs_space_reservation(root->fs_info, "space_info",
data_sinfo->flags, bytes, 1);
@@ -3597,7 +3636,7 @@ commit_trans:
/*
* Called if we need to clear a data reservation for this inode.
*/
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes, int flag)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_space_info *data_sinfo;
@@ -3605,7 +3644,10 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
/* make sure bytes are sectorsize aligned */
bytes = ALIGN(bytes, root->sectorsize);

- data_sinfo = root->fs_info->data_sinfo;
+ if (flag == TYPE_NONROT)
+ data_sinfo = root->fs_info->nonrot_data_sinfo;
+ else
+ data_sinfo = root->fs_info->data_sinfo;
spin_lock(&data_sinfo->lock);
data_sinfo->bytes_may_use -= bytes;
trace_btrfs_space_reservation(root->fs_info, "space_info",
@@ -3789,6 +3831,13 @@ again:
force_metadata_allocation(fs_info);
}

+ if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT && fs_info->metadata_ratio) {
+ fs_info->data_nonrot_chunk_allocations++;
+ if (!(fs_info->data_nonrot_chunk_allocations %
+ fs_info->metadata_ratio))
+ force_metadata_allocation(fs_info);
+ }
+
/*
* Check if we have enough space in SYSTEM chunk because we may need
* to update devices.
@@ -4495,6 +4544,13 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
meta_used = sinfo->bytes_used;
spin_unlock(&sinfo->lock);

+ sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA_NONROT);
+ if (sinfo) {
+ spin_lock(&sinfo->lock);
+ data_used += sinfo->bytes_used;
+ spin_unlock(&sinfo->lock);
+ }
+
num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) *
csum_size * 2;
num_bytes += div64_u64(data_used + meta_used, 50);
@@ -4968,6 +5024,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
* btrfs_delalloc_reserve_space - reserve data and metadata space for delalloc
* @inode: inode we're writing to
* @num_bytes: the number of bytes we want to allocate
+ * @flag: indicate if block space is reserved from rotating disk or not
*
* This will do the following things
*
@@ -4979,17 +5036,17 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
*
* This will return 0 for success and -ENOSPC if there is no space left.
*/
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes, int *flag)
{
int ret;

- ret = btrfs_check_data_free_space(inode, num_bytes);
+ ret = btrfs_check_data_free_space(inode, num_bytes, flag);
if (ret)
return ret;

ret = btrfs_delalloc_reserve_metadata(inode, num_bytes);
if (ret) {
- btrfs_free_reserved_data_space(inode, num_bytes);
+ btrfs_free_reserved_data_space(inode, num_bytes, *flag);
return ret;
}

@@ -5000,6 +5057,7 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
* btrfs_delalloc_release_space - release data and metadata space for delalloc
* @inode: inode we're releasing space for
* @num_bytes: the number of bytes we want to free up
+ * @flag: indicate if block space is freed from rotating disk or not
*
* This must be matched with a call to btrfs_delalloc_reserve_space. This is
* called in the case that we don't need the metadata AND data reservations
@@ -5009,10 +5067,10 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
* decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
* list if there are no delalloc bytes left.
*/
-void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes)
+void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes, int flag)
{
btrfs_delalloc_release_metadata(inode, num_bytes);
- btrfs_free_reserved_data_space(inode, num_bytes);
+ btrfs_free_reserved_data_space(inode, num_bytes, flag);
}

static int update_block_group(struct btrfs_root *root,
@@ -5888,7 +5946,8 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_space_info *space_info;
int loop = 0;
int index = __get_raid_index(flags);
- int alloc_type = (flags & BTRFS_BLOCK_GROUP_DATA) ?
+ int alloc_type = ((flags & BTRFS_BLOCK_GROUP_DATA)
+ || (flags & BTRFS_BLOCK_GROUP_DATA_NONROT)) ?
RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC;
bool found_uncached_bg = false;
bool failed_cluster_refill = false;
@@ -8360,6 +8419,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
fs_info->avail_metadata_alloc_bits &= ~extra_flags;
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
fs_info->avail_system_alloc_bits &= ~extra_flags;
+ if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT)
+ fs_info->avail_data_nonrot_alloc_bits &= ~extra_flags;
write_sequnlock(&fs_info->profiles_lock);
}

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 32d67a8..2b1f132 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1216,6 +1216,34 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
cached_state, mask);
}

+void set_extent_hot(struct inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state,
+ int type, int flag)
+{
+ int set_bits = 0, clear_bits = 0;
+
+ if (flag) {
+ set_bits = EXTENT_DELALLOC | EXTENT_UPTODATE;
+ clear_bits = EXTENT_DIRTY | EXTENT_DELALLOC |
+ EXTENT_DO_ACCOUNTING;
+ }
+
+ if (type == TYPE_NONROT) {
+ set_bits |= EXTENT_HOT;
+ clear_bits |= EXTENT_COLD;
+ } else {
+ set_bits |= EXTENT_COLD;
+ clear_bits |= EXTENT_HOT;
+ }
+
+ clear_extent_bit(&BTRFS_I(inode)->io_tree,
+ start, end, clear_bits,
+ 0, 0, cached_state, GFP_NOFS);
+ set_extent_bit(&BTRFS_I(inode)->io_tree, start,
+ end, set_bits, NULL,
+ cached_state, GFP_NOFS);
+}
+
/*
* either insert or lock state struct between start and end use mask to tell
* us if waiting is desired.
@@ -1417,9 +1445,11 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
{
struct rb_node *node;
struct extent_state *state;
+ struct btrfs_root *root;
u64 cur_start = *start;
u64 found = 0;
u64 total_bytes = 0;
+ int flag = EXTENT_DELALLOC;

spin_lock(&tree->lock);

@@ -1434,13 +1464,27 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
goto out;
}

+ root = BTRFS_I(tree->mapping->host)->root;
while (1) {
state = rb_entry(node, struct extent_state, rb_node);
if (found && (state->start != cur_start ||
(state->state & EXTENT_BOUNDARY))) {
goto out;
}
- if (!(state->state & EXTENT_DELALLOC)) {
+ if (btrfs_test_opt(root, HOT_MOVE)) {
+ if (!(state->state & EXTENT_DELALLOC) ||
+ (!(state->state & EXTENT_HOT) &&
+ !(state->state & EXTENT_COLD))) {
+ if (!found)
+ *end = state->end;
+ goto out;
+ } else {
+ if (!found)
+ flag = (state->state & EXTENT_HOT) ?
+ EXTENT_HOT : EXTENT_COLD;
+ }
+ }
+ if (!(state->state & flag)) {
if (!found)
*end = state->end;
goto out;
@@ -1627,7 +1671,13 @@ again:
lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state);

/* then test to make sure it is all still delalloc */
- ret = test_range_bit(tree, delalloc_start, delalloc_end,
+ if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE)) {
+ ret = test_range_bit(tree, delalloc_start, delalloc_end,
+ EXTENT_DELALLOC | EXTENT_HOT, 1, cached_state);
+ ret |= test_range_bit(tree, delalloc_start, delalloc_end,
+ EXTENT_DELALLOC | EXTENT_COLD, 1, cached_state);
+ } else
+ ret = test_range_bit(tree, delalloc_start, delalloc_end,
EXTENT_DELALLOC, 1, cached_state);
if (!ret) {
unlock_extent_cached(tree, delalloc_start, delalloc_end,
@@ -1665,6 +1715,11 @@ int extent_clear_unlock_delalloc(struct inode *inode,
if (op & EXTENT_CLEAR_DELALLOC)
clear_bits |= EXTENT_DELALLOC;

+ if (op & EXTENT_CLEAR_HOT)
+ clear_bits |= EXTENT_HOT;
+ if (op & EXTENT_CLEAR_COLD)
+ clear_bits |= EXTENT_COLD;
+
clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS);
if (!(op & (EXTENT_CLEAR_UNLOCK_PAGE | EXTENT_CLEAR_DIRTY |
EXTENT_SET_WRITEBACK | EXTENT_END_WRITEBACK |
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a2c03a1..a3bfc9d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -19,6 +19,8 @@
#define EXTENT_FIRST_DELALLOC (1 << 12)
#define EXTENT_NEED_WAIT (1 << 13)
#define EXTENT_DAMAGED (1 << 14)
+#define EXTENT_HOT (1 << 15)
+#define EXTENT_COLD (1 << 16)
#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
#define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)

@@ -51,6 +53,8 @@
#define EXTENT_END_WRITEBACK 0x20
#define EXTENT_SET_PRIVATE2 0x40
#define EXTENT_CLEAR_ACCOUNTING 0x80
+#define EXTENT_CLEAR_HOT 0x100
+#define EXTENT_CLEAR_COLD 0x200

/*
* page->private values. Every page that is controlled by the extent
@@ -237,6 +241,9 @@ int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
struct extent_state **cached_state, gfp_t mask);
int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
struct extent_state **cached_state, gfp_t mask);
+void set_extent_hot(struct inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state,
+ int type, int flag);
int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
u64 *start_ret, u64 *end_ret, unsigned long bits,
struct extent_state **cached_state);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4205ba7..4cbf236 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -41,6 +41,7 @@
#include "locking.h"
#include "compat.h"
#include "volumes.h"
+#include "hot_relocate.h"

static struct kmem_cache *btrfs_inode_defrag_cachep;
/*
@@ -500,7 +501,7 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
struct page **pages, size_t num_pages,
loff_t pos, size_t write_bytes,
- struct extent_state **cached)
+ struct extent_state **cached, int flag)
{
int err = 0;
int i;
@@ -514,6 +515,11 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);

end_of_last_block = start_pos + num_bytes - 1;
+
+ if (btrfs_test_opt(root, HOT_MOVE))
+ set_extent_hot(inode, start_pos, end_of_last_block,
+ cached, flag, 0);
+
err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
cached);
if (err)
@@ -1350,6 +1356,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
size_t dirty_pages;
size_t copied;
+ int flag = TYPE_ROT;

WARN_ON(num_pages > nrptrs);

@@ -1363,7 +1370,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
}

ret = btrfs_delalloc_reserve_space(inode,
- num_pages << PAGE_CACHE_SHIFT);
+ num_pages << PAGE_CACHE_SHIFT, &flag);
if (ret)
break;

@@ -1377,7 +1384,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
force_page_uptodate);
if (ret) {
btrfs_delalloc_release_space(inode,
- num_pages << PAGE_CACHE_SHIFT);
+ num_pages << PAGE_CACHE_SHIFT, flag);
break;
}

@@ -1416,16 +1423,16 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
}
btrfs_delalloc_release_space(inode,
(num_pages - dirty_pages) <<
- PAGE_CACHE_SHIFT);
+ PAGE_CACHE_SHIFT, flag);
}

if (copied > 0) {
ret = btrfs_dirty_pages(root, inode, pages,
dirty_pages, pos, copied,
- NULL);
+ NULL, flag);
if (ret) {
btrfs_delalloc_release_space(inode,
- dirty_pages << PAGE_CACHE_SHIFT);
+ dirty_pages << PAGE_CACHE_SHIFT, flag);
btrfs_drop_pages(pages, num_pages);
break;
}
@@ -2150,6 +2157,7 @@ static long btrfs_fallocate(struct file *file, int mode,
u64 locked_end;
struct extent_map *em;
int blocksize = BTRFS_I(inode)->root->sectorsize;
+ int flag = TYPE_ROT;
int ret;

alloc_start = round_down(offset, blocksize);
@@ -2166,7 +2174,7 @@ static long btrfs_fallocate(struct file *file, int mode,
* Make sure we have enough space before we do the
* allocation.
*/
- ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start);
+ ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start, &flag);
if (ret)
return ret;
if (root->fs_info->quota_enabled) {
@@ -2281,7 +2289,7 @@ out:
btrfs_qgroup_free(root, alloc_end - alloc_start);
out_reserve_fail:
/* Let go of our reservation. */
- btrfs_free_reserved_data_space(inode, alloc_end - alloc_start);
+ btrfs_free_reserved_data_space(inode, alloc_end - alloc_start, flag);
return ret;
}

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ecca6c7..58a1cc3 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1007,7 +1007,7 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
io_ctl_zero_remaining_pages(&io_ctl);

ret = btrfs_dirty_pages(root, inode, io_ctl.pages, io_ctl.num_pages,
- 0, i_size_read(inode), &cached_state);
+ 0, i_size_read(inode), &cached_state, TYPE_ROT);
io_ctl_drop_pages(&io_ctl);
unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0,
i_size_read(inode) - 1, &cached_state, GFP_NOFS);
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index d26f67a..ef0c79d 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -403,6 +403,7 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
u64 alloc_hint = 0;
int ret;
int prealloc;
+ int flag = TYPE_ROT;
bool retry = false;

/* only fs tree and subvol/snap needs ino cache */
@@ -490,17 +491,17 @@ again:
/* Just to make sure we have enough space */
prealloc += 8 * PAGE_CACHE_SIZE;

- ret = btrfs_delalloc_reserve_space(inode, prealloc);
+ ret = btrfs_delalloc_reserve_space(inode, prealloc, &flag);
if (ret)
goto out_put;

ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
prealloc, prealloc, &alloc_hint);
if (ret) {
- btrfs_delalloc_release_space(inode, prealloc);
+ btrfs_delalloc_release_space(inode, prealloc, flag);
goto out_put;
}
- btrfs_free_reserved_data_space(inode, prealloc);
+ btrfs_free_reserved_data_space(inode, prealloc, flag);

ret = btrfs_write_out_ino_cache(root, trans, path);
out_put:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9b31b3b..096f97f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -57,6 +57,7 @@
#include "free-space-cache.h"
#include "inode-map.h"
#include "backref.h"
+#include "hot_relocate.h"

struct btrfs_iget_args {
u64 ino;
@@ -106,6 +107,27 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start,

static int btrfs_dirty_inode(struct inode *inode);

+static int get_chunk_type(struct inode *inode, u64 start, u64 end)
+{
+ int hot, cold, ret = 1;
+
+ hot = test_range_bit(&BTRFS_I(inode)->io_tree,
+ start, end, EXTENT_HOT, 1, NULL);
+ cold = test_range_bit(&BTRFS_I(inode)->io_tree,
+ start, end, EXTENT_COLD, 1, NULL);
+
+ WARN_ON(hot && cold);
+
+ if (hot)
+ ret = 2;
+ else if (cold)
+ ret = 1;
+ else
+ WARN_ON(1);
+
+ return ret;
+}
+
static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
struct inode *inode, struct inode *dir,
const struct qstr *qstr)
@@ -859,13 +881,14 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
{
u64 alloc_hint = 0;
u64 num_bytes;
- unsigned long ram_size;
+ unsigned long ram_size, hot_flag = 0;
u64 disk_num_bytes;
u64 cur_alloc_size;
u64 blocksize = root->sectorsize;
struct btrfs_key ins;
struct extent_map *em;
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+ int chunk_type = 1;
int ret = 0;

BUG_ON(btrfs_is_free_space_inode(inode));
@@ -873,6 +896,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
num_bytes = ALIGN(end - start + 1, blocksize);
num_bytes = max(blocksize, num_bytes);
disk_num_bytes = num_bytes;
+ ret = 0;

/* if this is a small write inside eof, kick off defrag */
if (num_bytes < 64 * 1024 &&
@@ -892,7 +916,8 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
EXTENT_CLEAR_DELALLOC |
EXTENT_CLEAR_DIRTY |
EXTENT_SET_WRITEBACK |
- EXTENT_END_WRITEBACK);
+ EXTENT_END_WRITEBACK |
+ hot_flag);

*nr_written = *nr_written +
(end - start + PAGE_CACHE_SIZE) / PAGE_CACHE_SIZE;
@@ -914,9 +939,25 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
unsigned long op;

cur_alloc_size = disk_num_bytes;
+
+ /*
+ * Use COW operations to move hot data to SSD and cold data
+ * back to rotating disk. Sets chunk_type to 1 to indicate
+ * to write to BTRFS_BLOCK_GROUP_DATA or 2 to indicate
+ * BTRFS_BLOCK_GROUP_DATA_NONROT.
+ */
+ if (btrfs_test_opt(root, HOT_MOVE)) {
+ chunk_type = get_chunk_type(inode, start,
+ start + cur_alloc_size - 1);
+ if (chunk_type == 2)
+ hot_flag = EXTENT_CLEAR_HOT;
+ else
+ hot_flag = EXTENT_CLEAR_COLD;
+ }
+
ret = btrfs_reserve_extent(trans, root, cur_alloc_size,
root->sectorsize, 0, alloc_hint,
- &ins, 1);
+ &ins, chunk_type);
if (ret < 0) {
btrfs_abort_transaction(trans, root, ret);
goto out_unlock;
@@ -982,7 +1023,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
*/
op = unlock ? EXTENT_CLEAR_UNLOCK_PAGE : 0;
op |= EXTENT_CLEAR_UNLOCK | EXTENT_CLEAR_DELALLOC |
- EXTENT_SET_PRIVATE2;
+ EXTENT_SET_PRIVATE2 | hot_flag;

extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree,
start, start + ram_size - 1,
@@ -1006,7 +1047,8 @@ out_unlock:
EXTENT_CLEAR_DELALLOC |
EXTENT_CLEAR_DIRTY |
EXTENT_SET_WRITEBACK |
- EXTENT_END_WRITEBACK);
+ EXTENT_END_WRITEBACK |
+ hot_flag);

goto out;
}
@@ -1600,8 +1642,12 @@ static void btrfs_clear_bit_hook(struct inode *inode,
btrfs_delalloc_release_metadata(inode, len);

if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
- && do_list)
- btrfs_free_reserved_data_space(inode, len);
+ && do_list) {
+ int flag = TYPE_ROT;
+ if ((state->state & EXTENT_HOT) && (*bits & EXTENT_HOT))
+ flag = TYPE_NONROT;
+ btrfs_free_reserved_data_space(inode, len, flag);
+ }

__percpu_counter_add(&root->fs_info->delalloc_bytes, -len,
root->fs_info->delalloc_batch);
@@ -1796,6 +1842,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
u64 page_start;
u64 page_end;
int ret;
+ int flag = TYPE_ROT;

fixup = container_of(work, struct btrfs_writepage_fixup, work);
page = fixup->page;
@@ -1827,7 +1874,7 @@ again:
goto again;
}

- ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+ ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE, &flag);
if (ret) {
mapping_set_error(page->mapping, ret);
end_extent_writepage(page, ret, page_start, page_end);
@@ -1835,6 +1882,10 @@ again:
goto out;
}

+ if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE))
+ set_extent_hot(inode, page_start, page_end,
+ &cached_state, flag, 0);
+
btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
ClearPageChecked(page);
set_page_dirty(page);
@@ -4282,20 +4333,21 @@ int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
struct page *page;
gfp_t mask = btrfs_alloc_write_mask(mapping);
int ret = 0;
+ int flag = TYPE_ROT;
u64 page_start;
u64 page_end;

if ((offset & (blocksize - 1)) == 0 &&
(!len || ((len & (blocksize - 1)) == 0)))
goto out;
- ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+ ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE, &flag);
if (ret)
goto out;

again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
- btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+ btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE, flag);
ret = -ENOMEM;
goto out;
}
@@ -4337,6 +4389,10 @@ again:
EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
0, 0, &cached_state, GFP_NOFS);

+ if (btrfs_test_opt(root, HOT_MOVE))
+ set_extent_hot(inode, page_start, page_end,
+ &cached_state, flag, 0);
+
ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
&cached_state);
if (ret) {
@@ -4363,7 +4419,7 @@ again:

out_unlock:
if (ret)
- btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+ btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE, flag);
unlock_page(page);
page_cache_release(page);
out:
@@ -7353,6 +7409,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
struct inode *inode = file->f_mapping->host;
size_t count = 0;
int flags = 0;
+ int flag = TYPE_ROT;
bool wakeup = true;
bool relock = false;
ssize_t ret;
@@ -7375,7 +7432,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
mutex_unlock(&inode->i_mutex);
relock = true;
}
- ret = btrfs_delalloc_reserve_space(inode, count);
+ ret = btrfs_delalloc_reserve_space(inode, count, &flag);
if (ret)
goto out;
} else if (unlikely(test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
@@ -7391,10 +7448,10 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
btrfs_submit_direct, flags);
if (rw & WRITE) {
if (ret < 0 && ret != -EIOCBQUEUED)
- btrfs_delalloc_release_space(inode, count);
+ btrfs_delalloc_release_space(inode, count, flag);
else if (ret >= 0 && (size_t)ret < count)
btrfs_delalloc_release_space(inode,
- count - (size_t)ret);
+ count - (size_t)ret, flag);
else
btrfs_delalloc_release_metadata(inode, 0);
}
@@ -7573,11 +7630,12 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
loff_t size;
int ret;
int reserved = 0;
+ int flag = TYPE_ROT;
u64 page_start;
u64 page_end;

sb_start_pagefault(inode->i_sb);
- ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+ ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE, &flag);
if (!ret) {
ret = file_update_time(vma->vm_file);
reserved = 1;
@@ -7635,6 +7693,10 @@ again:
EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
0, 0, &cached_state, GFP_NOFS);

+ if (btrfs_test_opt(root, HOT_MOVE))
+ set_extent_hot(inode, page_start, page_end,
+ &cached_state, flag, 0);
+
ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
&cached_state);
if (ret) {
@@ -7674,7 +7736,7 @@ out_unlock:
}
unlock_page(page);
out:
- btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+ btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE, flag);
out_noreserve:
sb_end_pagefault(inode->i_sb);
return ret;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0de4a2f..91da5ae 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -56,6 +56,7 @@
#include "rcu-string.h"
#include "send.h"
#include "dev-replace.h"
+#include "hot_relocate.h"

/* Mask out flags that are inappropriate for the given type of inode. */
static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
@@ -1001,6 +1002,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
int ret;
int i;
int i_done;
+ int flag = TYPE_ROT;
struct btrfs_ordered_extent *ordered;
struct extent_state *cached_state = NULL;
struct extent_io_tree *tree;
@@ -1013,7 +1015,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1);

ret = btrfs_delalloc_reserve_space(inode,
- page_cnt << PAGE_CACHE_SHIFT);
+ page_cnt << PAGE_CACHE_SHIFT, &flag);
if (ret)
return ret;
i_done = 0;
@@ -1101,9 +1103,12 @@ again:
BTRFS_I(inode)->outstanding_extents++;
spin_unlock(&BTRFS_I(inode)->lock);
btrfs_delalloc_release_space(inode,
- (page_cnt - i_done) << PAGE_CACHE_SHIFT);
+ (page_cnt - i_done) << PAGE_CACHE_SHIFT, flag);
}

+ if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE))
+ set_extent_hot(inode, page_start, page_end - 1,
+ &cached_state, flag, 0);

set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end - 1,
&cached_state, GFP_NOFS);
@@ -1126,7 +1131,8 @@ out:
unlock_page(pages[i]);
page_cache_release(pages[i]);
}
- btrfs_delalloc_release_space(inode, page_cnt << PAGE_CACHE_SHIFT);
+ btrfs_delalloc_release_space(inode,
+ page_cnt << PAGE_CACHE_SHIFT, flag);
return ret;

}
@@ -3021,8 +3027,9 @@ static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
BTRFS_BLOCK_GROUP_SYSTEM,
BTRFS_BLOCK_GROUP_METADATA,
- BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
- int num_types = 4;
+ BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA,
+ BTRFS_BLOCK_GROUP_DATA_NONROT};
+ int num_types = 5;
int alloc_size;
int ret = 0;
u64 slot_count = 0;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 704a1b8..62c5897 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
#include "async-thread.h"
#include "free-space-cache.h"
#include "inode-map.h"
+#include "hot_relocate.h"

/*
* backref_node, mapping_node and tree_block start with this
@@ -2938,12 +2939,13 @@ int prealloc_file_extent_cluster(struct inode *inode,
u64 num_bytes;
int nr = 0;
int ret = 0;
+ int flag = TYPE_ROT;

BUG_ON(cluster->start != cluster->boundary[0]);
mutex_lock(&inode->i_mutex);

ret = btrfs_check_data_free_space(inode, cluster->end +
- 1 - cluster->start);
+ 1 - cluster->start, &flag);
if (ret)
goto out;

@@ -2965,7 +2967,7 @@ int prealloc_file_extent_cluster(struct inode *inode,
nr++;
}
btrfs_free_reserved_data_space(inode, cluster->end +
- 1 - cluster->start);
+ 1 - cluster->start, flag);
out:
mutex_unlock(&inode->i_mutex);
return ret;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 09fb9d2..c10477b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -58,6 +58,7 @@
#include "rcu-string.h"
#include "dev-replace.h"
#include "free-space-cache.h"
+#include "hot_relocate.h"

#define CREATE_TRACE_POINTS
#include <trace/events/btrfs.h>
@@ -1520,7 +1521,8 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
mutex_lock(&fs_info->chunk_mutex);
rcu_read_lock();
list_for_each_entry_rcu(found, head, list) {
- if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
+ if ((found->flags & BTRFS_BLOCK_GROUP_DATA) ||
+ (found->flags & BTRFS_BLOCK_GROUP_DATA_NONROT)) {
total_free_data += found->disk_total - found->disk_used;
total_free_data -=
btrfs_account_ro_block_groups_free_space(found);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e925ce..29e416d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1451,6 +1451,9 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
all_avail = root->fs_info->avail_data_alloc_bits |
root->fs_info->avail_system_alloc_bits |
root->fs_info->avail_metadata_alloc_bits;
+ if (btrfs_test_opt(root, HOT_MOVE))
+ all_avail |=
+ root->fs_info->avail_data_nonrot_alloc_bits;
} while (read_seqretry(&root->fs_info->profiles_lock, seq));

num_devices = root->fs_info->fs_devices->num_devices;
@@ -3729,7 +3732,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
devs_increment = btrfs_raid_array[index].devs_increment;
ncopies = btrfs_raid_array[index].ncopies;

- if (type & BTRFS_BLOCK_GROUP_DATA) {
+ if (type & BTRFS_BLOCK_GROUP_DATA ||
+ type & BTRFS_BLOCK_GROUP_DATA_NONROT) {
max_stripe_size = 1024 * 1024 * 1024;
max_chunk_size = 10 * max_stripe_size;
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
@@ -3768,9 +3772,30 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
struct btrfs_device *device;
u64 max_avail;
u64 dev_offset;
+ int dev_rot;
+ int skip = 0;

device = list_entry(cur, struct btrfs_device, dev_alloc_list);

+ /*
+ * If HOT_MOVE is set, the chunk type being allocated
+ * determines which disks the data may be allocated on.
+ * This can cause problems if, for example, the data alloc
+ * profile is RAID0 and there are only two devices, 1 SSD +
+ * 1 HDD. All allocations to BTRFS_BLOCK_GROUP_DATA_NONROT
+ * in this config will return -ENOSPC as the allocation code
+ * can't find allowable space for the second stripe.
+ */
+ dev_rot = !blk_queue_nonrot(bdev_get_queue(device->bdev));
+ if (btrfs_test_opt(extent_root, HOT_MOVE)) {
+ int ret1 = type & (BTRFS_BLOCK_GROUP_DATA |
+ BTRFS_BLOCK_GROUP_METADATA |
+ BTRFS_BLOCK_GROUP_SYSTEM) && !dev_rot;
+ int ret2 = type & BTRFS_BLOCK_GROUP_DATA_NONROT && dev_rot;
+ if (ret1 || ret2)
+ skip = 1;
+ }
+
cur = cur->next;

if (!device->writeable) {
@@ -3779,7 +3804,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
continue;
}

- if (!device->in_fs_metadata ||
+ if (skip || !device->in_fs_metadata ||
device->is_tgtdev_for_dev_replace)
continue;

--
1.7.11.7

2013-05-20 15:10:55

by Zhi Yong Wu

[permalink] [raw]
Subject: [RFC PATCH v1 1/5] BTRFS hot reloc, vfs: add one list_head field

From: Zhi Yong Wu <[email protected]>

Add one list_head field 'reloc_list' to accommodate
hot relocation support.

Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/hot_tracking.c | 1 +
include/linux/hot_tracking.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 46d2f7d..2a59b09 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -41,6 +41,7 @@ static void hot_comm_item_init(struct hot_comm_item *ci, int type)
clear_bit(HOT_IN_LIST, &ci->delete_flag);
clear_bit(HOT_DELETING, &ci->delete_flag);
INIT_LIST_HEAD(&ci->track_list);
+ INIT_LIST_HEAD(&ci->reloc_list);
memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data));
ci->hot_freq_data.avg_delta_reads = (u64) -1;
ci->hot_freq_data.avg_delta_writes = (u64) -1;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 008a5c1..faf1acc 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -74,6 +74,7 @@ struct hot_comm_item {
unsigned long delete_flag;
struct rcu_head c_rcu;
struct list_head track_list; /* link to *_map[] */
+ struct list_head reloc_list; /* used in hot relocation*/
};

/* An item representing an inode and its access frequency */
--
1.7.11.7

2013-05-20 15:11:48

by Zhi Yong Wu

[permalink] [raw]
Subject: [RFC PATCH v1 4/5] BTRFS hot reloc, procfs: add three proc interfaces

From: Zhi Yong Wu <[email protected]>

Add three proc interfaces hot-reloc-interval, hot-reloc-threshold,
and hot-reloc-max-items under the dir /proc/sys/fs/ in order to
turn HOT_RELOC_INTERVAL, HOT_RELOC_THRESHOLD, and HOT_RELOC_MAX_ITEMS
into be tunable.

Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/btrfs/hot_relocate.c | 26 +++++++++++++++++---------
fs/btrfs/hot_relocate.h | 5 -----
include/linux/btrfs.h | 4 ++++
kernel/sysctl.c | 22 ++++++++++++++++++++++
4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
index ae28b86..3a18555 100644
--- a/fs/btrfs/hot_relocate.c
+++ b/fs/btrfs/hot_relocate.c
@@ -25,7 +25,7 @@
* The relocation code below operates on the heat map lists to identify
* hot or cold data logical file ranges that are candidates for relocation.
* The triggering mechanism for relocation is controlled by a global heat
- * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * threshold integer value (sysctl_hot_reloc_threshold). Ranges are
* queued for relocation by the periodically executing relocate kthread,
* which updates the global heat threshold and responds to space pressure
* on the nonrotating disks.
@@ -53,6 +53,15 @@
* (assuming, critically, the HOT_MOVE option is set at mount time).
*/

+int sysctl_hot_reloc_threshold = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_threshold);
+
+int sysctl_hot_reloc_interval __read_mostly = 120;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_interval);
+
+int sysctl_hot_reloc_max_items __read_mostly = 250;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_max_items);
+
/*
* Returns the ratio of nonrotating disks that are full.
* If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1.
@@ -103,7 +112,7 @@ static int hot_calc_nonrot_ratio(struct hot_reloc *hot_reloc)
static int hot_update_threshold(struct hot_reloc *hot_reloc,
int update)
{
- int thresh = hot_reloc->thresh;
+ int thresh = sysctl_hot_reloc_threshold;
int ratio = hot_calc_nonrot_ratio(hot_reloc);

/* Sometimes update global threshold, others not */
@@ -127,7 +136,7 @@ static int hot_update_threshold(struct hot_reloc *hot_reloc,
thresh = 0;
}

- hot_reloc->thresh = thresh;
+ sysctl_hot_reloc_threshold = thresh;
return ratio;
}

@@ -215,7 +224,7 @@ static int hot_queue_extent(struct hot_reloc *hot_reloc,
*counter = *counter + 1;
}

- if (*counter >= HOT_RELOC_MAX_ITEMS)
+ if (*counter >= sysctl_hot_reloc_max_items)
break;

if (kthread_should_stop()) {
@@ -293,7 +302,7 @@ again:
while (1) {
lock_extent(tree, page_start, page_end);
ordered = btrfs_lookup_ordered_extent(inode,
- page_start);
+ page_start);
unlock_extent(tree, page_start, page_end);
if (!ordered)
break;
@@ -559,7 +568,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)

run++;
ratio = hot_update_threshold(hot_reloc, !(run % 15));
- thresh = hot_reloc->thresh;
+ thresh = sysctl_hot_reloc_threshold;

INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_NONROT]);

@@ -569,7 +578,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
if (count_to_hot == 0)
return;

- count_to_cold = HOT_RELOC_MAX_ITEMS;
+ count_to_cold = sysctl_hot_reloc_max_items;

/* Don't move cold data to HDD unless there's space pressure */
if (ratio < HIGH_WATER_LEVEL)
@@ -653,7 +662,7 @@ static int hot_relocate_kthread(void *arg)
unsigned long delay;

do {
- delay = HZ * HOT_RELOC_INTERVAL;
+ delay = HZ * sysctl_hot_reloc_interval;
if (mutex_trylock(&hot_reloc->hot_reloc_mutex)) {
hot_do_relocate(hot_reloc);
mutex_unlock(&hot_reloc->hot_reloc_mutex);
@@ -685,7 +694,6 @@ int hot_relocate_init(struct btrfs_fs_info *fs_info)

fs_info->hot_reloc = hot_reloc;
hot_reloc->fs_info = fs_info;
- hot_reloc->thresh = HOT_RELOC_THRESHOLD;
for (i = 0; i < MAX_RELOC_TYPES; i++)
INIT_LIST_HEAD(&hot_reloc->hot_relocq[i]);
mutex_init(&hot_reloc->hot_reloc_mutex);
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
index 1b1cfb5..94defe6 100644
--- a/fs/btrfs/hot_relocate.h
+++ b/fs/btrfs/hot_relocate.h
@@ -18,10 +18,6 @@
#include "btrfs_inode.h"
#include "volumes.h"

-#define HOT_RELOC_INTERVAL 120
-#define HOT_RELOC_THRESHOLD 150
-#define HOT_RELOC_MAX_ITEMS 250
-
#define HEAT_MAX_VALUE (MAP_SIZE - 1)
#define HIGH_WATER_LEVEL 75 /* when to raise the threshold */
#define LOW_WATER_LEVEL 50 /* when to lower the threshold */
@@ -32,7 +28,6 @@
struct hot_reloc {
struct btrfs_fs_info *fs_info;
struct list_head hot_relocq[MAX_RELOC_TYPES];
- int thresh;
struct task_struct *hot_reloc_kthread;
struct mutex hot_reloc_mutex;
};
diff --git a/include/linux/btrfs.h b/include/linux/btrfs.h
index 22d7991..7179819 100644
--- a/include/linux/btrfs.h
+++ b/include/linux/btrfs.h
@@ -3,4 +3,8 @@

#include <uapi/linux/btrfs.h>

+extern int sysctl_hot_reloc_threshold;
+extern int sysctl_hot_reloc_interval;
+extern int sysctl_hot_reloc_max_items;
+
#endif /* _LINUX_BTRFS_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6ee4338..3ab1a68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -62,6 +62,7 @@
#include <linux/capability.h>
#include <linux/binfmts.h>
#include <linux/sched/sysctl.h>
+#include <linux/btrfs.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -1630,6 +1631,27 @@ static struct ctl_table fs_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "hot-reloc-threshold",
+ .data = &sysctl_hot_reloc_threshold,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "hot-reloc-interval",
+ .data = &sysctl_hot_reloc_interval,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "hot-reloc-max-items",
+ .data = &sysctl_hot_reloc_max_items,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
{ }
};

--
1.7.11.7

2013-05-20 15:21:01

by Zhi Yong Wu

[permalink] [raw]
Subject: [RFC PATCH v1 3/5] BTRFS hot reloc: add one hot reloc thread

From: Zhi Yong Wu <[email protected]>

Add one private thread for hot relocation. It will check
if there're some extents which is hotter than the threshold
and queue them at first, if no, it will return and wait for
its next turn; otherwise, it will check if nonrotating disk
ratio is beyond its usage threshold, if no, it will directly
relocate hot extents from rotating disk to nonrotating disk;
otherwise it will find the extents with low temperature and
queue them, then relocate those extents with low temperature
and queue them, and finally relocate the hot extents from
from rotating disk to nonrotating disk.

Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/btrfs/Makefile | 3 +-
fs/btrfs/ctree.h | 2 +
fs/btrfs/hot_relocate.c | 713 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/hot_relocate.h | 43 +++
4 files changed, 760 insertions(+), 1 deletion(-)
create mode 100644 fs/btrfs/hot_relocate.c
create mode 100644 fs/btrfs/hot_relocate.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 3932224..94f1ea5 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
export.o tree-log.o free-space-cache.o zlib.o lzo.o \
compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
- reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o
+ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
+ hot_relocate.o

btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f7a3170..6c547ca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1636,6 +1636,8 @@ struct btrfs_fs_info {
struct btrfs_dev_replace dev_replace;

atomic_t mutually_exclusive_operation_running;
+
+ void *hot_reloc;
};

/*
diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
new file mode 100644
index 0000000..ae28b86
--- /dev/null
+++ b/fs/btrfs/hot_relocate.c
@@ -0,0 +1,713 @@
+/*
+ * fs/btrfs/hot_relocate.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <[email protected]>
+ * Ben Chociej <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/module.h>
+#include "hot_relocate.h"
+
+/*
+ * Hot relocation strategy:
+ *
+ * The relocation code below operates on the heat map lists to identify
+ * hot or cold data logical file ranges that are candidates for relocation.
+ * The triggering mechanism for relocation is controlled by a global heat
+ * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * queued for relocation by the periodically executing relocate kthread,
+ * which updates the global heat threshold and responds to space pressure
+ * on the nonrotating disks.
+ *
+ * The heat map lists index logical ranges by heat and provide a constant-time
+ * access path to hot or cold range items. The relocation kthread uses this
+ * path to find hot or cold items to move to/from nonrotating disks. To ensure
+ * that the relocation kthread has a chance to sleep, and to prevent thrashing
+ * between nonrotating disks and HDD, there is a configurable limit to how many
+ * ranges are moved per iteration of the kthread. This limit may be overrun in
+ * the case where space pressure requires that items be aggressively moved from
+ * nonrotating disks back to HDD.
+ *
+ * This needs still more resistance to thrashing and stronger (read: actual)
+ * guarantees that relocation operations won't -ENOSPC.
+ *
+ * The relocation code has introduced one new btrfs block group type:
+ * BTRFS_BLOCK_GROUP_DATA_NONROT.
+ *
+ * When mkfs'ing a volume with the hot data relocation option, initial block
+ * groups are allocated to the proper disks. Runtime block group allocation
+ * only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and
+ * BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates
+ * BTRFS_BLOCK_GROUP_DATA_NONROT to nonrotating disks.
+ * (assuming, critically, the HOT_MOVE option is set at mount time).
+ */
+
+/*
+ * Returns the ratio of nonrotating disks that are full.
+ * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1.
+ */
+static int hot_calc_nonrot_ratio(struct hot_reloc *hot_reloc)
+{
+ struct btrfs_space_info *info;
+ struct btrfs_device *device, *next;
+ struct btrfs_fs_info *fs_info = hot_reloc->fs_info;
+ u64 total_bytes = 0, bytes_used = 0;
+
+ /*
+ * Iterate through devices, if they're nonrot,
+ * add their bytes to the total_bytes.
+ */
+ mutex_lock(&fs_info->fs_devices->device_list_mutex);
+ list_for_each_entry_safe(device, next,
+ &fs_info->fs_devices->devices, dev_list) {
+ if (blk_queue_nonrot(bdev_get_queue(device->bdev)))
+ total_bytes += device->total_bytes;
+ }
+ mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+
+ if (total_bytes == 0)
+ return THRESH_MAX_VALUE + 1;
+
+ /*
+ * Iterate through space_info. if nonrot data block group
+ * is found, add the bytes used by that group bytes_used
+ */
+ rcu_read_lock();
+ list_for_each_entry_rcu(info, &fs_info->space_info, list) {
+ if (info->flags & BTRFS_BLOCK_GROUP_DATA_NONROT)
+ bytes_used += info->bytes_used;
+ }
+ rcu_read_unlock();
+
+ /* Finish up, return ratio of nonrotating disks filled. */
+ BUG_ON(bytes_used >= total_bytes);
+
+ return (int) div64_u64(bytes_used * 100, total_bytes);
+}
+
+/*
+ * Update heat threshold for hot relocation
+ * based on how full nonrotating disks are.
+ */
+static int hot_update_threshold(struct hot_reloc *hot_reloc,
+ int update)
+{
+ int thresh = hot_reloc->thresh;
+ int ratio = hot_calc_nonrot_ratio(hot_reloc);
+
+ /* Sometimes update global threshold, others not */
+ if (!update && ratio < HIGH_WATER_LEVEL)
+ return ratio;
+
+ if (unlikely(ratio > THRESH_MAX_VALUE))
+ thresh = HEAT_MAX_VALUE + 1;
+ else {
+ WARN_ON(HIGH_WATER_LEVEL > THRESH_MAX_VALUE
+ || LOW_WATER_LEVEL < 0);
+
+ if (ratio >= HIGH_WATER_LEVEL)
+ thresh += THRESH_UP_SPEED;
+ else if (ratio <= LOW_WATER_LEVEL)
+ thresh -= THRESH_DOWN_SPEED;
+
+ if (thresh > HEAT_MAX_VALUE)
+ thresh = HEAT_MAX_VALUE + 1;
+ else if (thresh < 0)
+ thresh = 0;
+ }
+
+ hot_reloc->thresh = thresh;
+ return ratio;
+}
+
+static bool hot_can_relocate(struct inode *inode, u64 start,
+ u64 len, u64 *skip, u64 *end)
+{
+ struct extent_map *em = NULL;
+ struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+ struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+ bool ret = true;
+
+ /*
+ * Make sure that once we start relocating an extent,
+ * we keep on relocating it
+ */
+ if (start < *end)
+ return true;
+
+ *skip = 0;
+
+ /*
+ * Hopefully we have this extent in the tree already,
+ * try without the full extent lock
+ */
+ read_lock(&em_tree->lock);
+ em = lookup_extent_mapping(em_tree, start, len);
+ read_unlock(&em_tree->lock);
+ if (!em) {
+ /* Get the big lock and read metadata off disk */
+ lock_extent(io_tree, start, start + len - 1);
+ em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+ unlock_extent(io_tree, start, start + len - 1);
+ if (IS_ERR(em))
+ return false;
+ }
+
+ /* This will cover holes, and inline extents */
+ if (em->block_start >= EXTENT_MAP_LAST_BYTE)
+ ret = false;
+
+ if (ret) {
+ *end = extent_map_end(em);
+ } else {
+ *skip = extent_map_end(em);
+ *end = 0;
+ }
+
+ free_extent_map(em);
+ return ret;
+}
+
+static void hot_cleanup_relocq(struct list_head *bucket)
+{
+ struct hot_range_item *hr;
+ struct hot_comm_item *ci, *ci_next;
+
+ list_for_each_entry_safe(ci, ci_next, bucket, reloc_list) {
+ hr = container_of(ci, struct hot_range_item, hot_range);
+ list_del_init(&hr->hot_range.reloc_list);
+ hot_comm_item_put(ci);
+ }
+}
+
+static int hot_queue_extent(struct hot_reloc *hot_reloc,
+ struct list_head *bucket,
+ u64 *counter, int storage_type)
+{
+ struct hot_comm_item *ci;
+ struct hot_range_item *hr;
+ int st, ret = 0;
+
+ /* Queue hot_ranges */
+ list_for_each_entry_rcu(ci, bucket, track_list) {
+ if (test_bit(HOT_DELETING, &ci->delete_flag))
+ continue;
+
+ /* Queue up on relocate list */
+ hr = container_of(ci, struct hot_range_item, hot_range);
+ st = hr->storage_type;
+ if (st != storage_type) {
+ list_del_init(&ci->reloc_list);
+ list_add_tail(&ci->reloc_list,
+ &hot_reloc->hot_relocq[storage_type]);
+ hot_comm_item_get(ci);
+ *counter = *counter + 1;
+ }
+
+ if (*counter >= HOT_RELOC_MAX_ITEMS)
+ break;
+
+ if (kthread_should_stop()) {
+ ret = 1;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+static u64 hot_search_extent(struct hot_reloc *hot_reloc,
+ int thresh, int storage_type)
+{
+ struct hot_info *root;
+ u64 counter = 0;
+ int i, ret = 0;
+
+ root = hot_reloc->fs_info->sb->s_hot_root;
+ for (i = HEAT_MAX_VALUE; i >= thresh; i--) {
+ rcu_read_lock();
+ if (!list_empty(&root->hot_map[TYPE_RANGE][i]))
+ ret = hot_queue_extent(hot_reloc,
+ &root->hot_map[TYPE_RANGE][i],
+ &counter, storage_type);
+ rcu_read_unlock();
+ if (ret) {
+ counter = 0;
+ break;
+ }
+ }
+
+ if (ret)
+ hot_cleanup_relocq(&hot_reloc->hot_relocq[storage_type]);
+
+ return counter;
+}
+
+static int hot_load_file_extent(struct inode *inode,
+ struct page **pages,
+ unsigned long start_index,
+ int num_pages, int storage_type)
+{
+ unsigned long file_end;
+ int ret, i, i_done;
+ u64 isize = i_size_read(inode), page_start, page_end, page_cnt;
+ struct btrfs_ordered_extent *ordered;
+ struct extent_state *cached_state = NULL;
+ struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+ gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+
+ file_end = (isize - 1) >> PAGE_CACHE_SHIFT;
+ if (!isize || start_index > file_end)
+ return 0;
+
+ page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1);
+
+ ret = btrfs_delalloc_reserve_space(inode,
+ page_cnt << PAGE_CACHE_SHIFT, &storage_type);
+ if (ret)
+ return ret;
+
+ i_done = 0;
+ /* step one, lock all the pages */
+ for (i = 0; i < page_cnt; i++) {
+ struct page *page;
+again:
+ page = find_or_create_page(inode->i_mapping,
+ start_index + i, mask);
+ if (!page)
+ break;
+
+ page_start = page_offset(page);
+ page_end = page_start + PAGE_CACHE_SIZE - 1;
+ while (1) {
+ lock_extent(tree, page_start, page_end);
+ ordered = btrfs_lookup_ordered_extent(inode,
+ page_start);
+ unlock_extent(tree, page_start, page_end);
+ if (!ordered)
+ break;
+
+ unlock_page(page);
+ btrfs_start_ordered_extent(inode, ordered, 1);
+ btrfs_put_ordered_extent(ordered);
+ lock_page(page);
+ /*
+ * we unlocked the page above, so we need check if
+ * it was released or not.
+ */
+ if (page->mapping != inode->i_mapping) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
+ }
+
+ if (!PageUptodate(page)) {
+ btrfs_readpage(NULL, page);
+ lock_page(page);
+ if (!PageUptodate(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ ret = -EIO;
+ break;
+ }
+ }
+
+ if (page->mapping != inode->i_mapping) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto again;
+ }
+
+ pages[i] = page;
+ i_done++;
+ }
+ if (!i_done || ret)
+ goto out;
+
+ if (!(inode->i_sb->s_flags & MS_ACTIVE))
+ goto out;
+
+ page_start = page_offset(pages[0]);
+ page_end = page_offset(pages[i_done - 1]) + PAGE_CACHE_SIZE - 1;
+
+ lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
+
+ if (i_done != page_cnt) {
+ spin_lock(&BTRFS_I(inode)->lock);
+ BTRFS_I(inode)->outstanding_extents++;
+ spin_unlock(&BTRFS_I(inode)->lock);
+
+ btrfs_delalloc_release_space(inode,
+ (page_cnt - i_done) << PAGE_CACHE_SHIFT,
+ storage_type);
+ }
+
+ set_extent_hot(inode, page_start, page_end,
+ &cached_state, storage_type, 1);
+ unlock_extent_cached(tree, page_start, page_end,
+ &cached_state, GFP_NOFS);
+
+ for (i = 0; i < i_done; i++) {
+ clear_page_dirty_for_io(pages[i]);
+ ClearPageChecked(pages[i]);
+ set_page_extent_mapped(pages[i]);
+ set_page_dirty(pages[i]);
+ unlock_page(pages[i]);
+ page_cache_release(pages[i]);
+ }
+
+ /*
+ * so now we have a nice long stream of locked
+ * and up to date pages, lets wait on them
+ */
+ for (i = 0; i < i_done; i++)
+ wait_on_page_writeback(pages[i]);
+
+ return i_done;
+out:
+ for (i = 0; i < i_done; i++) {
+ unlock_page(pages[i]);
+ page_cache_release(pages[i]);
+ }
+
+ btrfs_delalloc_release_space(inode,
+ page_cnt << PAGE_CACHE_SHIFT,
+ storage_type);
+
+ return ret;
+}
+
+/*
+ * Relocate data to SSD or spinning drive based on past location
+ * and load the file into page cache and marks pages as dirty.
+ *
+ * based on defrag ioctl
+ */
+static int hot_relocate_extent(struct hot_range_item *hr,
+ struct hot_reloc *hot_reloc,
+ int storage_type)
+{
+ struct btrfs_root *root = hot_reloc->fs_info->fs_root;
+ struct inode *inode;
+ struct file_ra_state *ra = NULL;
+ struct btrfs_key key;
+ u64 isize, last_len = 0, skip = 0, end = 0;
+ unsigned long i, last, ra_index = 0;
+ int ret = -ENOENT, count = 0, new = 0;
+ int max_cluster = (256 * 1024) >> PAGE_CACHE_SHIFT;
+ int cluster = max_cluster;
+ struct page **pages = NULL;
+
+ key.objectid = hr->hot_inode->i_ino;
+ key.type = BTRFS_INODE_ITEM_KEY;
+ key.offset = 0;
+ inode = btrfs_iget(root->fs_info->sb, &key, root, &new);
+ if (IS_ERR(inode))
+ goto out;
+ else if (is_bad_inode(inode))
+ goto out_inode;
+
+ isize = i_size_read(inode);
+ if (isize == 0) {
+ ret = 0;
+ goto out_inode;
+ }
+
+ ra = kzalloc(sizeof(*ra), GFP_NOFS);
+ if (!ra) {
+ ret = -ENOMEM;
+ goto out_inode;
+ } else {
+ file_ra_state_init(ra, inode->i_mapping);
+ }
+
+ pages = kmalloc(sizeof(struct page *) * max_cluster,
+ GFP_NOFS);
+ if (!pages) {
+ ret = -ENOMEM;
+ goto out_ra;
+ }
+
+ /* find the last page */
+ if (hr->start + hr->len > hr->start) {
+ last = min_t(u64, isize - 1,
+ hr->start + hr->len - 1) >> PAGE_CACHE_SHIFT;
+ } else {
+ last = (isize - 1) >> PAGE_CACHE_SHIFT;
+ }
+
+ i = hr->start >> PAGE_CACHE_SHIFT;
+
+ /*
+ * make writeback starts from i, so the range can be
+ * written sequentially.
+ */
+ if (i < inode->i_mapping->writeback_index)
+ inode->i_mapping->writeback_index = i;
+
+ while (i <= last && count < last + 1 &&
+ (i < (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT)) {
+ /*
+ * make sure we stop running if someone unmounts
+ * the FS
+ */
+ if (!(inode->i_sb->s_flags & MS_ACTIVE))
+ break;
+
+ if (signal_pending(current)) {
+ printk(KERN_DEBUG "btrfs: hot relocation cancelled\n");
+ break;
+ }
+
+ if (!hot_can_relocate(inode, (u64)i << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, &skip, &end)) {
+ unsigned long next;
+ /*
+ * the function tells us how much to skip
+ * bump our counter by the suggested amount
+ */
+ next = (skip + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ i = max(i + 1, next);
+ continue;
+ }
+
+ cluster = (PAGE_CACHE_ALIGN(end) >> PAGE_CACHE_SHIFT) - i;
+ cluster = min(cluster, max_cluster);
+
+ if (i + cluster > ra_index) {
+ ra_index = max(i, ra_index);
+ btrfs_force_ra(inode->i_mapping, ra, NULL, ra_index,
+ cluster);
+ ra_index += max_cluster;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ ret = hot_load_file_extent(inode, pages,
+ i, cluster, storage_type);
+ if (ret < 0) {
+ mutex_unlock(&inode->i_mutex);
+ goto out_ra;
+ }
+
+ count += ret;
+ balance_dirty_pages_ratelimited(inode->i_mapping);
+ mutex_unlock(&inode->i_mutex);
+
+ if (ret > 0) {
+ i += ret;
+ last_len += ret << PAGE_CACHE_SHIFT;
+ } else {
+ i++;
+ last_len = 0;
+ }
+ }
+
+ ret = count;
+ if (ret > 0)
+ hr->storage_type = storage_type;
+
+out_ra:
+ kfree(ra);
+ kfree(pages);
+out_inode:
+ iput(inode);
+out:
+ list_del_init(&hr->hot_range.reloc_list);
+
+ hot_comm_item_put(&hr->hot_range);
+
+ return ret;
+}
+
+/*
+ * Main function iterates through heat map table and
+ * finds hot and cold data to move based on SSD pressure.
+ *
+ * First iterates through cold items below the heat
+ * threshold, if the item is on SSD and is now cold,
+ * we queue it up for relocation back to spinning disk.
+ * After scanning these items, we call relocation code
+ * on all ranges that have been queued up for moving
+ * to HDD.
+ *
+ * We then iterate through items above the heat threshold
+ * and if they are on HDD we queue them up to be moved to
+ * SSD. We then iterate through queue and move hot ranges
+ * to SSD if they are not already.
+ */
+void hot_do_relocate(struct hot_reloc *hot_reloc)
+{
+ struct hot_info *root;
+ struct hot_range_item *hr;
+ struct hot_comm_item *ci, *ci_next;
+ int i, ret = 0, thresh, ratio = 0;
+ u64 count, count_to_cold, count_to_hot;
+ static u32 run = 1;
+
+ run++;
+ ratio = hot_update_threshold(hot_reloc, !(run % 15));
+ thresh = hot_reloc->thresh;
+
+ INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_NONROT]);
+
+ /* Check and queue hot extents */
+ count_to_hot = hot_search_extent(hot_reloc,
+ thresh, TYPE_NONROT);
+ if (count_to_hot == 0)
+ return;
+
+ count_to_cold = HOT_RELOC_MAX_ITEMS;
+
+ /* Don't move cold data to HDD unless there's space pressure */
+ if (ratio < HIGH_WATER_LEVEL)
+ goto do_hot_reloc;
+
+ INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_ROT]);
+
+ /*
+ * Move up to RELOCATE_MAX_ITEMS cold ranges back to spinning
+ * disk. First, queue up items to move on the hot_relocq[TYPE_ROT].
+ */
+ root = hot_reloc->fs_info->sb->s_hot_root;
+ for (count = 0, count_to_cold = 0; (count < thresh) &&
+ (count_to_cold < count_to_hot); count++) {
+ rcu_read_lock();
+ if (!list_empty(&root->hot_map[TYPE_RANGE][count]))
+ ret = hot_queue_extent(hot_reloc,
+ &root->hot_map[TYPE_RANGE][count],
+ &count_to_cold, TYPE_ROT);
+ rcu_read_unlock();
+ if (ret)
+ goto relocq_clean;
+ }
+
+ /* Do the hot -> cold relocation */
+ count_to_cold = 0;
+ list_for_each_entry_safe(ci, ci_next,
+ &hot_reloc->hot_relocq[TYPE_ROT], reloc_list) {
+ hr = container_of(ci, struct hot_range_item, hot_range);
+ ret = hot_relocate_extent(hr, hot_reloc, TYPE_ROT);
+ if ((ret == -ENOSPC) || (ret == -ENOMEM) ||
+ kthread_should_stop())
+ goto relocq_clean;
+ else if (ret > 0)
+ count_to_cold++;
+ }
+
+ /*
+ * Move up to RELOCATE_MAX_ITEMS ranges to SSD. Periodically check
+ * for space pressure on SSD and directly return if we've exceeded
+ * the SSD capacity high water mark.
+ * First, queue up items to move on hot_relocq[TYPE_NONROT].
+ */
+do_hot_reloc:
+ /* Do the cold -> hot relocation */
+ count_to_hot = 0;
+ list_for_each_entry_safe(ci, ci_next,
+ &hot_reloc->hot_relocq[TYPE_NONROT], reloc_list) {
+ if (count_to_hot >= count_to_cold)
+ goto relocq_clean;
+ hr = container_of(ci, struct hot_range_item, hot_range);
+ ret = hot_relocate_extent(hr, hot_reloc, TYPE_NONROT);
+ if ((ret == -ENOSPC) || (ret == -ENOMEM) ||
+ kthread_should_stop())
+ goto relocq_clean;
+ else if (ret > 0)
+ count_to_hot++;
+
+ /*
+ * If we've exceeded the SSD capacity high water mark,
+ * directly return.
+ */
+ if ((count_to_hot != 0) && count_to_hot % 30 == 0) {
+ ratio = hot_update_threshold(hot_reloc, 1);
+ if (ratio >= HIGH_WATER_LEVEL)
+ goto relocq_clean;
+ }
+ }
+
+ return;
+
+relocq_clean:
+ for (i = 0; i < MAX_RELOC_TYPES; i++)
+ hot_cleanup_relocq(&hot_reloc->hot_relocq[i]);
+}
+
+/* Main loop for running relcation thread */
+static int hot_relocate_kthread(void *arg)
+{
+ struct hot_reloc *hot_reloc = arg;
+ unsigned long delay;
+
+ do {
+ delay = HZ * HOT_RELOC_INTERVAL;
+ if (mutex_trylock(&hot_reloc->hot_reloc_mutex)) {
+ hot_do_relocate(hot_reloc);
+ mutex_unlock(&hot_reloc->hot_reloc_mutex);
+ }
+
+ if (!try_to_freeze()) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!kthread_should_stop())
+ schedule_timeout(delay);
+ __set_current_state(TASK_RUNNING);
+ }
+ } while (!kthread_should_stop());
+
+ return 0;
+}
+
+/* Kick off the relocation kthread */
+int hot_relocate_init(struct btrfs_fs_info *fs_info)
+{
+ int i, ret = 0;
+ struct hot_reloc *hot_reloc;
+
+ hot_reloc = kzalloc(sizeof(*hot_reloc), GFP_NOFS);
+ if (!hot_reloc) {
+ printk(KERN_ERR "%s: Failed to allocate memory for "
+ "hot_reloc\n", __func__);
+ return -ENOMEM;
+ }
+
+ fs_info->hot_reloc = hot_reloc;
+ hot_reloc->fs_info = fs_info;
+ hot_reloc->thresh = HOT_RELOC_THRESHOLD;
+ for (i = 0; i < MAX_RELOC_TYPES; i++)
+ INIT_LIST_HEAD(&hot_reloc->hot_relocq[i]);
+ mutex_init(&hot_reloc->hot_reloc_mutex);
+
+ hot_reloc->hot_reloc_kthread = kthread_run(hot_relocate_kthread,
+ hot_reloc, "hot_relocate_kthread");
+ ret = IS_ERR(hot_reloc->hot_reloc_kthread);
+ if (ret) {
+ kthread_stop(hot_reloc->hot_reloc_kthread);
+ kfree(hot_reloc);
+ }
+
+ return ret;
+}
+
+void hot_relocate_exit(struct btrfs_fs_info *fs_info)
+{
+ struct hot_reloc *hot_reloc = fs_info->hot_reloc;
+
+ if (hot_reloc->hot_reloc_kthread)
+ kthread_stop(hot_reloc->hot_reloc_kthread);
+
+ kfree(hot_reloc);
+ fs_info->hot_reloc = NULL;
+}
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
new file mode 100644
index 0000000..1b1cfb5
--- /dev/null
+++ b/fs/btrfs/hot_relocate.h
@@ -0,0 +1,43 @@
+/*
+ * fs/btrfs/hot_relocate.h
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <[email protected]>
+ * Ben Chociej <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_RELOCATE__
+#define __HOT_RELOCATE__
+
+#include <linux/hot_tracking.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+
+#define HOT_RELOC_INTERVAL 120
+#define HOT_RELOC_THRESHOLD 150
+#define HOT_RELOC_MAX_ITEMS 250
+
+#define HEAT_MAX_VALUE (MAP_SIZE - 1)
+#define HIGH_WATER_LEVEL 75 /* when to raise the threshold */
+#define LOW_WATER_LEVEL 50 /* when to lower the threshold */
+#define THRESH_UP_SPEED 10 /* how much to raise it by */
+#define THRESH_DOWN_SPEED 1 /* how much to lower it by */
+#define THRESH_MAX_VALUE 100
+
+struct hot_reloc {
+ struct btrfs_fs_info *fs_info;
+ struct list_head hot_relocq[MAX_RELOC_TYPES];
+ int thresh;
+ struct task_struct *hot_reloc_kthread;
+ struct mutex hot_reloc_mutex;
+};
+
+int hot_relocate_init(struct btrfs_fs_info *fs_info);
+void hot_relocate_exit(struct btrfs_fs_info *fs_info);
+
+#endif /* __HOT_RELOCATE__ */
--
1.7.11.7

2013-05-21 02:25:07

by Duncan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/5] BTRFS hot relocation support

zwu.kernel posted on Mon, 20 May 2013 23:11:22 +0800 as excerpted:

> The patchset is trying to introduce hot relocation support
> for BTRFS. In hybrid storage environment, when the data in rotating disk
> get hot, it can be relocated to nonrotating disk by BTRFS hot relocation
> support automatically; also, if nonrotating disk ratio exceed its upper
> threshold, the data which get cold can be looked up and relocated to
> rotating disk to make more space in nonrotating disk at first, and then
> the data which get hot will be relocated to nonrotating disk
> automatically.

One advantage of a filesystem implementation, as opposed to bcache or
dmcache, is arguably a corner-case, but it's /my/ corner-case, so...

I run an intr*-less (I guess technically, empty initramfs) monolithic-
kernel boot, using the kernel commandline root= and (formerly) md= and
related logic to choose/assemble/mount root directly from the kernel
command line via bootloader (grub2). Thus, any user-space-required-to-
mount-root is out, since I don't have an initr* and thus no early
userspace. That means both lvm2 and dmcache (AFAIK) are out. I'm not
sure about bcache, but it has other negatives, particularly against btrfs-
raid-1 and I'd guess md/raid-1 as well.

Much like md before it, btrfs, while normally requiring the user-space-
required device-scan to properly handle multiple devices, has kernel-
command-line options that allow direct kernel multi-device assembly
without the help of early-userspace/initr*.

So in-btrfs hot-relocation support building on the existing kernel-
command-line multi-device assembly options would definitely be welcomed
by all us no-initr* folks looking at SSDs but not able to afford them
for /everything/ just yet. =:^)

(That said, even if accepted, your solution's a bit late for my own
current needs, but there's surely going to be others hitting the same
issue in a few kernel cycles when your patches could be mainline btrfs,
and having the option at my next upgrade cycle say a couple years out
would be very nice, indeed. =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

2013-05-29 00:38:20

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/5] BTRFS hot relocation support

On Tue, May 21, 2013 at 02:22:34AM +0000, Duncan wrote:
> zwu.kernel posted on Mon, 20 May 2013 23:11:22 +0800 as excerpted:
>
> > The patchset is trying to introduce hot relocation support
> > for BTRFS. In hybrid storage environment, when the data in rotating disk
> > get hot, it can be relocated to nonrotating disk by BTRFS hot relocation
> > support automatically; also, if nonrotating disk ratio exceed its upper
> > threshold, the data which get cold can be looked up and relocated to
> > rotating disk to make more space in nonrotating disk at first, and then
> > the data which get hot will be relocated to nonrotating disk
> > automatically.
>
> One advantage of a filesystem implementation, as opposed to bcache or
> dmcache, is arguably a corner-case, but it's /my/ corner-case, so...
>
> I run an intr*-less (I guess technically, empty initramfs) monolithic-
> kernel boot, using the kernel commandline root= and (formerly) md= and
> related logic to choose/assemble/mount root directly from the kernel
> command line via bootloader (grub2). Thus, any user-space-required-to-
> mount-root is out, since I don't have an initr* and thus no early
> userspace. That means both lvm2 and dmcache (AFAIK) are out. I'm not
> sure about bcache, but it has other negatives, particularly against btrfs-
> raid-1 and I'd guess md/raid-1 as well.
>
> Much like md before it, btrfs, while normally requiring the user-space-
> required device-scan to properly handle multiple devices, has kernel-
> command-line options that allow direct kernel multi-device assembly
> without the help of early-userspace/initr*.

I wouldn't be averse to adding such functionality to bcache, provided it
could be done reasonably cleanly/sensibly. It's not high on my list but
I'd accept patches :)

2013-05-29 01:42:52

by Duncan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/5] BTRFS hot relocation support

Kent Overstreet posted on Tue, 28 May 2013 17:38:15 -0700 as excerpted:

> On Tue, May 21, 2013 at 02:22:34AM +0000, Duncan wrote:
>> zwu.kernel posted on Mon, 20 May 2013 23:11:22 +0800 as excerpted:
>>
>> > The patchset is trying to introduce hot relocation support for BTRFS.
>>
>> One advantage of a filesystem implementation, as opposed to bcache or
>> dmcache, is arguably a corner-case, but it's /my/ corner-case, so...
>>
>> I run an intr*-less (I guess technically, empty initramfs) monolithic-
>> kernel boot [so] any user-space-required-to- mount-root is out [...]
>>
>> Much like md before it, btrfs, while normally requiring the user-space-
>> required device-scan to properly handle multiple devices, has kernel-
>> command-line options that allow direct kernel multi-device assembly
>> without the help of early-userspace/initr*.

Just a note that while btrfs is /supposed/ to have that functionality,
I'm actually trying to make it work now, and failing (I can get it to
work only with rootflags=degraded, which then of course screws the sync
between the devices, losing the point of multi-device mirroring). As
(about a year ago when I brought it up then) someone else (btrfs dev)
mentioned not being able to get rootflags=dev=... to work on the kernel
command line with btrfs as well, I assume that functionality is broken
due to some as yet un-addressed bug, hopefully to be fixed by the time
btrfs is finally declared stable. However, that exchange from a year ago
suggests it's not a particularly high priority...

Meanwhile, I'm working on switching to a very minimal (dracut-based but
cut WAY down) initramfs now. Basically only enough to mount the btrfs
multi-device raid1 mirror since the kernel commandline rootflags method
appears to be broken, but still with a monolithic kernel, etc, so REALLY
quite minimal, indeed. (Once I get the semi-automated dracut host-only
no-kernel with most-default-modules-omitted version working to give me a
base pattern to work with, I may well switch to a hand assembled and
kernel-built initramfs, trimming it down even further.) Hopefully
someday the btrfs or rootflags kernel command-line bug will be fixed and
I can go initr*less again.

So in terms of bcache, for me personally for now, I could in theory add
it to the minimal initr*. But there's certainly others running initr*-
less as well, and I'd prefer to be in that class myself once again at
some point. (When gentoo devs suggested forcing initr* for the separate /
usr case, users raised QUITE a ruckus, so initr*less may be a minority
case, but there's still quite a few out there, systemd's universe-
engulfing gray-goo to the contrary or not.)

So there's certainly people out there running initr*less who could make
use of some sort of hot-data-cache-device functionality, if it's
available to them.

> I wouldn't be averse to adding such functionality to bcache, provided it
> could be done reasonably cleanly/sensibly. It's not high on my list but
> I'd accept patches :)

Unfortunately I'm more an admin type than coder. I know my way around a
Linux system well enough to confidently troubleshoot and trace bugs, but
for anything other than shell-script, only in the trivial case can I
actually file an appropriate bugfix patch and feature-patching is right
out. So unfortunately, while I'm interested, such a patch can't come
from me. =:^(

But should anyone else with interest AND the ability be reading... =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman