2010-07-27 22:01:20

by Ben Chociej

[permalink] [raw]
Subject: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

INTRODUCTION:

This patch series adds experimental support for tracking data
temperature in Btrfs. Essentially, this means maintaining some key
stats (like number of reads/writes, last read/write time, frequency of
reads/writes), then distilling those numbers down to a single
"temperature" value that reflects what data is "hot."

The long-term goal of these patches, as discussed in the Motivation
section at the end of this message, is to enable Btrfs to perform
automagic relocation of hot data to fast media like SSD. This goal has
been motivated by the Project Ideas page on the Btrfs wiki.

Of course, users are warned not to run this code outside of development
environments. These patches are EXPERIMENTAL, and as such they might
eat your data and/or memory.


MOTIVATION:

The overall goal of enabling hot data relocation to SSD has been
motivated by the Project Ideas page on the Btrfs wiki at
https://btrfs.wiki.kernel.org/index.php/Project_ideas. It is hoped that
this initial patchset will eventually mature into a usable hybrid
storage feature set for Btrfs.

This is essentially the traditional cache argument: SSD is fast and
expensive; HDD is cheap but slow. ZFS, for example, can already take
advantage of SSD caching. Btrfs should also be able to take advantage
of hybrid storage without any broad, sweeping changes to existing code.

With Btrfs's COW approach, an external cache (where data is *moved* to
SSD, rather than just cached there) makes a lot of sense. Though these
patches don't enable any relocation yet, they do lay an essential
foundation for enabling that functionality in the near future. We plan
to roll out an additional patchset introducing some of the automatic
migration functionality in the next few weeks.


SUMMARY:

- Hooks in existing Btrfs functions to track data access frequency
(btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)

- New rbtrees for tracking access frequency of inodes and sub-file
ranges (hotdata_map.c)

- A hash list for indexing data by its temperature (hotdata_hash.c)

- A debugfs interface for dumping data from the rbtrees (debugfs.c)

- A foundation for relocating data to faster media based on temperature
(future patchset)

- Mount options for enabling temperature tracking (-o hotdatatrack,
-o hotdatamove; move implies track; both default to disabled)

- An ioctl to retrieve the frequency information collected for a certain
file

- Ioctls to enable/disable frequency tracking per inode.


DIFFSTAT:

fs/btrfs/Makefile | 5 +-
fs/btrfs/ctree.h | 42 +++
fs/btrfs/debugfs.c | 500 +++++++++++++++++++++++++++++++++++
fs/btrfs/debugfs.h | 57 ++++
fs/btrfs/disk-io.c | 29 ++
fs/btrfs/extent_io.c | 18 ++
fs/btrfs/hotdata_hash.c | 111 ++++++++
fs/btrfs/hotdata_hash.h | 89 +++++++
fs/btrfs/hotdata_map.c | 660 +++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/hotdata_map.h | 118 +++++++++
fs/btrfs/inode.c | 29 ++-
fs/btrfs/ioctl.c | 146 +++++++++++-
fs/btrfs/ioctl.h | 21 ++
fs/btrfs/super.c | 48 ++++-
14 files changed, 1867 insertions(+), 6 deletions(-)


IMPLEMENTATION (in a nutshell):

Hooks have been added to various functions (btrfs_writepage(s),
btrfs_readpages, btrfs_direct_IO, and extent_write_cache_pages) in
order to track data access patterns. Each of these hooks calls a new
function, btrfs_update_freqs, that records each access to an inode,
possibly including some sub-file-level information as well. A data
structure containing some various frequency metrics gets updated with
the latest access information.

>From there, a hash list takes over the job of figuring out a total
"temperature" value for the data and indexing that temperature for fast
lookup in the future. The function that does the temperature
distilliation is rather sensitive and can be tuned/tweaked by altering
various #defined values in hotdata_hash.h.

Aside from the core functionality, there is a debugfs interface to spit
out some of the data that is collected, and ioctls are also introduced
to manipulate the new functionality on a per-inode basis.


Signed-off-by: Ben Chociej <[email protected]>
Signed-off-by: Matt Lupfer <[email protected]>
Signed-off-by: Conor Scott <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
Reviewed-by: Steve French <[email protected]>


2010-07-27 22:01:24

by Ben Chociej

[permalink] [raw]
Subject: [RFC PATCH 1/5] Btrfs: Add experimental hot data hash list index

From: Ben Chociej <[email protected]>

Adds a hash table structure to efficiently lookup the data temperature
of a file. Also adds a function to calculate that temperature based on
some metrics kept in custom frequency data structs.

Signed-off-by: Ben Chociej <[email protected]>
Signed-off-by: Matt Lupfer <[email protected]>
Signed-off-by: Conor Scott <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
Reviewed-by: Steve French <[email protected]>
---
fs/btrfs/hotdata_hash.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/hotdata_hash.h | 89 +++++++++++++++++++++++++++++++++++++
2 files changed, 200 insertions(+), 0 deletions(-)
create mode 100644 fs/btrfs/hotdata_hash.c
create mode 100644 fs/btrfs/hotdata_hash.h

diff --git a/fs/btrfs/hotdata_hash.c b/fs/btrfs/hotdata_hash.c
new file mode 100644
index 0000000..a0de853
--- /dev/null
+++ b/fs/btrfs/hotdata_hash.c
@@ -0,0 +1,111 @@
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/hash.h>
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "async-thread.h"
+#include "ctree.h"
+
+/* set thread to update temperatures every 5 minutes */
+#define HEAT_UPDATE_DELAY (HZ * 60 * 5)
+
+struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask)
+{
+ struct heat_hashlist_node *node;
+
+ node = kmalloc(sizeof(struct heat_hashlist_node), mask);
+ if (!node || IS_ERR(node))
+ return node;
+ INIT_HLIST_NODE(&node->hashnode);
+ node->freq_data = NULL;
+ node->hlist = NULL;
+
+ return node;
+}
+
+void free_heat_hashlists(struct btrfs_root *root)
+{
+ int i;
+
+ /* Free node/range heat hash lists */
+ for (i = 0; i < HEAT_HASH_SIZE; i++) {
+ struct hlist_node *pos = NULL, *pos2 = NULL;
+ struct heat_hashlist_node *heatnode = NULL;
+
+ hlist_for_each_safe(pos, pos2,
+ &root->heat_inode_hl[i].hashhead) {
+ heatnode = hlist_entry(pos, struct heat_hashlist_node,
+ hashnode);
+ hlist_del(pos);
+ kfree(heatnode);
+ }
+
+ hlist_for_each_safe(pos, pos2,
+ &root->heat_range_hl[i].hashhead) {
+ heatnode = hlist_entry(pos, struct heat_hashlist_node,
+ hashnode);
+ hlist_del(pos);
+ kfree(heatnode);
+ }
+ }
+}
+
+/*
+ * Function that converts btrfs_freq_data structs to integer temperature
+ * values, determined by some constants in .h.
+ *
+ * This is not very calibrated, though we've gotten it in the ballpark.
+ */
+int btrfs_get_temp(struct btrfs_freq_data *fdata)
+{
+ u32 result = 0;
+
+ struct timespec ckt = current_kernel_time();
+ u64 cur_time = timespec_to_ns(&ckt);
+
+ u32 nrr_heat = fdata->nr_reads << NRR_MULTIPLIER_POWER;
+ u32 nrw_heat = fdata->nr_writes << NRW_MULTIPLIER_POWER;
+
+ u64 ltr_heat = (cur_time - timespec_to_ns(&fdata->last_read_time))
+ >> LTR_DIVIDER_POWER;
+ u64 ltw_heat = (cur_time - timespec_to_ns(&fdata->last_write_time))
+ >> LTW_DIVIDER_POWER;
+
+ u64 avr_heat = (((u64) -1) - fdata->avg_delta_reads)
+ >> AVR_DIVIDER_POWER;
+ u64 avw_heat = (((u64) -1) - fdata->avg_delta_writes)
+ >> AVR_DIVIDER_POWER;
+
+ if (ltr_heat >= ((u64) 1 << 32))
+ ltr_heat = 0;
+ else
+ ltr_heat = ((u64) 1 << 32) - ltr_heat;
+ /* ltr_heat is now guaranteed to be u32 safe */
+
+ if (ltw_heat >= ((u64) 1 << 32))
+ ltw_heat = 0;
+ else
+ ltw_heat = ((u64) 1 << 32) - ltw_heat;
+ /* ltw_heat is now guaranteed to be u32 safe */
+
+ if (avr_heat >= ((u64) 1 << 32))
+ avr_heat = (u32) -1;
+ /* avr_heat is now guaranteed to be u32 safe */
+
+ if (avw_heat >= ((u64) 1 << 32))
+ avr_heat = (u32) -1;
+ /* avw_heat is now guaranteed to be u32 safe */
+
+ nrr_heat = nrr_heat >> (3 - NRR_COEFF_POWER);
+ nrw_heat = nrw_heat >> (3 - NRW_COEFF_POWER);
+ ltr_heat = ltr_heat >> (3 - LTR_COEFF_POWER);
+ ltw_heat = ltw_heat >> (3 - LTW_COEFF_POWER);
+ avr_heat = avr_heat >> (3 - AVR_COEFF_POWER);
+ avw_heat = avw_heat >> (3 - AVW_COEFF_POWER);
+
+ result = nrr_heat + nrw_heat + (u32) ltr_heat +
+ (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+ return result >> (32 - HEAT_HASH_BITS);
+}
diff --git a/fs/btrfs/hotdata_hash.h b/fs/btrfs/hotdata_hash.h
new file mode 100644
index 0000000..46bf61e
--- /dev/null
+++ b/fs/btrfs/hotdata_hash.h
@@ -0,0 +1,89 @@
+#ifndef __HOTDATAHASH__
+#define __HOTDATAHASH__
+
+#include <linux/list.h>
+#include <linux/hash.h>
+
+#define HEAT_HASH_BITS 8
+#define HEAT_HASH_SIZE (1 << HEAT_HASH_BITS)
+#define HEAT_HASH_MASK (HEAT_HASH_SIZE - 1)
+#define HEAT_MIN_VALUE 0
+#define HEAT_MAX_VALUE (HEAT_HASH_SIZE - 1)
+#define HEAT_HOT_MIN (HEAT_HASH_SIZE - 50)
+
+/*
+ * The following comments explain what exactly comprises a unit of heat.
+ *
+ * Each of six values of heat are calculated and combined in order to form an
+ * overall temperature for the data:
+ *
+ * NRR - number of reads since mount
+ * NRW - number of writes since mount
+ * LTR - time elapsed since last read (ns)
+ * LTW - time elapsed since last write (ns)
+ * AVR - average delta between recent reads (ns)
+ * AVW - average delta between recent writes (ns)
+ *
+ * These values are divided (right-shifted) according to the *_DIVIDER_POWER
+ * values defined below to bring the numbers into a reasonable range. You can
+ * modify these values to fit your needs. However, each heat unit is a u32 and
+ * thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+ * carefully or else they could max out or be stuck at zero quite easily.
+ *
+ * (E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+ * delta would bring the temperature above zero, ever.)
+ *
+ * Finally, each value is added to the overall temperature between 0 and 8
+ * times, depending on its *_COEFF_POWER value. Note that the coefficients are
+ * also actually implemented with shifts, so take care to treat these values
+ * as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+ */
+
+#define NRR_MULTIPLIER_POWER 23
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 23
+#define NRW_COEFF_POWER 0
+#define LTR_DIVIDER_POWER 30
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30
+#define LTW_COEFF_POWER 1
+#define AVR_DIVIDER_POWER 40
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40
+#define AVW_COEFF_POWER 0
+
+/* TODO a kmem cache for entry structs */
+
+struct btrfs_root;
+
+/* Hash list heads for heat hash table */
+struct heat_hashlist_entry {
+ struct hlist_head hashhead;
+ rwlock_t rwlock;
+ u32 temperature;
+};
+
+/* Nodes stored in each hash list of hash table */
+struct heat_hashlist_node {
+ struct hlist_node hashnode;
+ struct btrfs_freq_data *freq_data;
+ struct heat_hashlist_entry *hlist;
+};
+
+struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask);
+void free_heat_hashlists(struct btrfs_root *root);
+
+/*
+ * Returns a value from 0 to HEAT_MAX_VALUE indicating the temperature of the
+ * file (and consequently its bucket number in hashlist)
+ */
+int btrfs_get_temp(struct btrfs_freq_data *fdata);
+
+/*
+ * recalculates temperatures for inode or range
+ * and moves around in heat hash table based on temp
+ */
+void btrfs_update_heat_index(struct btrfs_freq_data *fdata,
+ struct btrfs_root *root);
+
+#endif /* __HOTDATAHASH__ */
--
1.7.1

2010-07-27 22:01:36

by Ben Chociej

[permalink] [raw]
Subject: [RFC PATCH 4/5] Btrfs: Add debugfs interface for hot data stats

From: Ben Chociej <[email protected]>

Adds a ./btrfs_data/<device_name>/ directory in the debugfs directory
for each volume. The directory contains two files. The first,
`inode_data', contains the heat information for inodes that have been
brought into the hot data map structures. The second, `range_data',
contains similar information about subfile ranges.

Signed-off-by: Ben Chociej <[email protected]>
Signed-off-by: Matt Lupfer <[email protected]>
Signed-off-by: Conor Scott <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
Reviewed-by: Steve French <[email protected]>
---
fs/btrfs/debugfs.c | 500 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/debugfs.h | 57 ++++++
2 files changed, 557 insertions(+), 0 deletions(-)
create mode 100644 fs/btrfs/debugfs.c
create mode 100644 fs/btrfs/debugfs.h

diff --git a/fs/btrfs/debugfs.c b/fs/btrfs/debugfs.c
new file mode 100644
index 0000000..a0e7bb7
--- /dev/null
+++ b/fs/btrfs/debugfs.c
@@ -0,0 +1,500 @@
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/vmalloc.h>
+#include <linux/limits.h>
+#include "ctree.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "debugfs.h"
+
+/*
+ * debugfs.c contains the code to interface with the btrfs debugfs.
+ * The debugfs outputs range- and file-level access frequency
+ * statistics for each mounted volume.
+ */
+
+static int copy_msg_to_log(struct debugfs_vol_data *data, char *msg, int len)
+{
+ struct lstring *debugfs_log = data->debugfs_log;
+ uint new_log_alloc_size;
+ char *new_log;
+
+ if (len >= data->log_alloc_size - debugfs_log->len) {
+ /* Not enough room in the log buffer for the new message. */
+ /* Allocate a bigger buffer. */
+ new_log_alloc_size = data->log_alloc_size + LOG_PAGE_SIZE;
+ new_log = vmalloc(new_log_alloc_size);
+
+ if (new_log) {
+ memcpy(new_log, debugfs_log->str,
+ debugfs_log->len);
+ memset(new_log + debugfs_log->len, 0,
+ new_log_alloc_size - debugfs_log->len);
+ vfree(debugfs_log->str);
+ debugfs_log->str = new_log;
+ data->log_alloc_size = new_log_alloc_size;
+ } else {
+ WARN_ON(1);
+ if (data->log_alloc_size - debugfs_log->len) {
+ #define err_msg "No more memory!\n"
+ strlcpy(debugfs_log->str +
+ debugfs_log->len,
+ err_msg, data->log_alloc_size -
+ debugfs_log->len);
+ debugfs_log->len +=
+ min((typeof(debugfs_log->len))
+ sizeof(err_msg),
+ ((typeof(debugfs_log->len))
+ data->log_alloc_size -
+ debugfs_log->len));
+ }
+ return 0;
+ }
+ }
+
+ memcpy(debugfs_log->str + debugfs_log->len,
+ data->log_work_buff, len);
+ debugfs_log->len += (unsigned long) len;
+
+ return len;
+}
+
+/* Returns the number of bytes written to the log. */
+static int debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
+{
+ struct lstring *debugfs_log = data->debugfs_log;
+ va_list args;
+ int len;
+
+ if (debugfs_log->str == NULL)
+ return -1;
+
+ spin_lock(&data->log_lock);
+
+ va_start(args, fmt);
+ len = vsnprintf(data->log_work_buff, sizeof(data->log_work_buff), fmt,
+ args);
+ va_end(args);
+
+ if (len >= sizeof(data->log_work_buff)) {
+ #define truncate_msg "The next message has been truncated.\n"
+ copy_msg_to_log(data, truncate_msg, sizeof(truncate_msg));
+ }
+
+ len = copy_msg_to_log(data, data->log_work_buff, len);
+ spin_unlock(&data->log_lock);
+
+ return len;
+}
+
+/* initialize a log corresponding to a btrfs volume */
+static int debugfs_log_init(struct debugfs_vol_data *data)
+{
+ int err = 0;
+ struct lstring *debugfs_log = data->debugfs_log;
+
+ spin_lock(&data->log_lock);
+ debugfs_log->str = vmalloc(INIT_LOG_ALLOC_SIZE);
+
+ if (debugfs_log->str) {
+ memset(debugfs_log->str, 0, INIT_LOG_ALLOC_SIZE);
+ data->log_alloc_size = INIT_LOG_ALLOC_SIZE;
+ } else {
+ err = -ENOMEM;
+ }
+
+ spin_unlock(&data->log_lock);
+ return err;
+}
+
+/* free a log corresponding to a btrfs volume */
+static void debugfs_log_exit(struct debugfs_vol_data *data)
+{
+ struct lstring *debugfs_log = data->debugfs_log;
+ spin_lock(&data->log_lock);
+ vfree(debugfs_log->str);
+ debugfs_log->str = NULL;
+ debugfs_log->len = 0;
+ spin_unlock(&data->log_lock);
+}
+
+/* fops to override for printing range data */
+static const struct file_operations btrfs_debugfs_range_fops = {
+ .read = __btrfs_debugfs_range_read,
+ .open = __btrfs_debugfs_open,
+};
+
+/* fops to override for printing inode data */
+static const struct file_operations btrfs_debugfs_inode_fops = {
+ .read = __btrfs_debugfs_inode_read,
+ .open = __btrfs_debugfs_open,
+};
+
+/* initialize debugfs for btrfs at module init */
+int btrfs_init_debugfs(void)
+{
+ debugfs_root_dentry = debugfs_create_dir(DEBUGFS_ROOT_NAME, NULL);
+ /*init list of debugfs data list */
+ INIT_LIST_HEAD(&debugfs_vol_data_list);
+ /*init lock to list of debugfs data list */
+ spin_lock_init(&data_list_lock);
+ if (!debugfs_root_dentry)
+ goto debugfs_error;
+ return 0;
+
+debugfs_error:
+ return -EIO;
+}
+
+/*
+ * on each volume mount, initialize the debugfs dentries and associated
+ * structures (debugfs_vol_data and debugfs_log)
+ */
+int btrfs_init_debugfs_volume(const char *uuid, struct super_block *sb)
+{
+ struct dentry *debugfs_volume_entry = NULL;
+ struct dentry *debugfs_range_entry = NULL;
+ struct dentry *debugfs_inode_entry = NULL;
+ struct debugfs_vol_data *range_data = NULL;
+ struct debugfs_vol_data *inode_data = NULL;
+ size_t dev_name_length = strlen(uuid);
+ char dev[NAME_MAX];
+
+ if (!debugfs_root_dentry)
+ goto debugfs_error;
+
+ /* create debugfs folder for this volume by mounted dev name */
+ memcpy(dev, uuid + DEV_NAME_CHOP, dev_name_length -
+ DEV_NAME_CHOP + 1);
+ debugfs_volume_entry = debugfs_create_dir(dev, debugfs_root_dentry);
+
+ if (!debugfs_volume_entry)
+ goto debugfs_error;
+
+ /* malloc and initialize debugfs_vol_data for range_data */
+ range_data = kmalloc(sizeof(struct debugfs_vol_data),
+ GFP_KERNEL | GFP_NOFS);
+ memset(range_data, 0, sizeof(struct debugfs_vol_data));
+ range_data->debugfs_log = NULL;
+ range_data->sb = sb;
+ spin_lock_init(&range_data->log_lock);
+ range_data->log_alloc_size = 0;
+
+ /* malloc and initialize debugfs_vol_data for range_data */
+ inode_data = kmalloc(sizeof(struct debugfs_vol_data),
+ GFP_KERNEL | GFP_NOFS);
+ memset(inode_data, 0, sizeof(struct debugfs_vol_data));
+ inode_data->debugfs_log = NULL;
+ inode_data->sb = sb;
+ spin_lock_init(&inode_data->log_lock);
+ inode_data->log_alloc_size = 0;
+
+ /* add debugfs_vol_data for inode data and range data for
+ * volume to list */
+ range_data->de = debugfs_volume_entry;
+ inode_data->de = debugfs_volume_entry;
+ spin_lock(&data_list_lock);
+ list_add(&range_data->node, &debugfs_vol_data_list);
+ list_add(&inode_data->node, &debugfs_vol_data_list);
+ spin_unlock(&data_list_lock);
+
+ /* create debugfs range_data file */
+ debugfs_range_entry = debugfs_create_file("range_data",
+ S_IFREG | S_IRUSR | S_IWUSR |
+ S_IRUGO,
+ debugfs_volume_entry,
+ (void *) range_data,
+ &btrfs_debugfs_range_fops);
+ if (!debugfs_range_entry)
+ goto debugfs_error;
+
+ /* create debugfs inode_data file */
+ debugfs_inode_entry = debugfs_create_file("inode_data",
+ S_IFREG | S_IRUSR | S_IWUSR |
+ S_IRUGO,
+ debugfs_volume_entry,
+ (void *) inode_data,
+ &btrfs_debugfs_inode_fops);
+
+ if (!debugfs_inode_entry)
+ goto debugfs_error;
+
+ return 0;
+
+debugfs_error:
+
+ kfree(range_data);
+ kfree(inode_data);
+
+ return -EIO;
+}
+
+/* find volume mounted (match by superblock) and remove
+ * debugfs dentry
+ */
+void btrfs_exit_debugfs_volume(struct super_block *sb)
+{
+ struct list_head *head;
+ struct list_head *pos;
+ struct debugfs_vol_data *data;
+ spin_lock(&data_list_lock);
+ head = &debugfs_vol_data_list;
+ /* must clean up memory assicatied with superblock */
+ list_for_each(pos, head)
+ {
+ data = list_entry(pos, struct debugfs_vol_data, node);
+ if (data->sb == sb) {
+ list_del(pos);
+ debugfs_remove_recursive(data->de);
+ kfree(data);
+ data = NULL;
+ break;
+ }
+ }
+ spin_unlock(&data_list_lock);
+}
+
+/* clean up memory and remove dentries for debugsfs */
+void btrfs_exit_debugfs(void)
+{
+ /* first iterate through debugfs_vol_data_list and free memory */
+ struct list_head *head;
+ struct list_head *pos;
+ struct list_head *cur;
+ struct debugfs_vol_data *data;
+
+ spin_lock(&data_list_lock);
+ head = &debugfs_vol_data_list;
+ list_for_each_safe(pos, cur, head) {
+ data = list_entry(pos, struct debugfs_vol_data, node);
+ if (data && pos != head)
+ kfree(data);
+ }
+ spin_unlock(&data_list_lock);
+
+ /* remove all debugfs entries recursively from the root */
+ debugfs_remove_recursive(debugfs_root_dentry);
+}
+
+/* debugfs open file override from fops table */
+int __btrfs_debugfs_open(struct inode *inode, struct file *file)
+{
+ if (inode->i_private)
+ file->private_data = inode->i_private;
+
+ return 0;
+}
+
+/* debugfs read file override from fops table */
+ssize_t __btrfs_debugfs_range_read(struct file *file, char __user *user,
+ size_t count, loff_t *ppos)
+{
+ int err = 0;
+ struct super_block *sb;
+ struct btrfs_root *root;
+ struct btrfs_root *fs_root;
+ struct hot_inode_item *current_hot_inode;
+ struct debugfs_vol_data *data;
+ struct lstring *debugfs_log;
+
+ data = (struct debugfs_vol_data *) file->private_data;
+ sb = data->sb;
+ root = btrfs_sb(sb);
+ fs_root = (struct btrfs_root *) root->fs_info->fs_root;
+
+ if (!data->debugfs_log) {
+ /* initialize debugfs log corresponding to this volume*/
+ debugfs_log = kmalloc(sizeof(struct lstring),
+ GFP_KERNEL | GFP_NOFS);
+ debugfs_log->str = NULL,
+ debugfs_log->len = 0;
+ data->debugfs_log = debugfs_log;
+ debugfs_log_init(data);
+ }
+
+ if ((unsigned long) *ppos > 0) {
+ /* caller is continuing a previous read, don't walk tree */
+ if ((unsigned long) *ppos >= data->debugfs_log->len)
+ goto clean_up;
+
+ goto print_to_user;
+ }
+
+ /* walk the inode tree */
+
+ current_hot_inode = find_next_hot_inode(fs_root, 0);
+
+ while (current_hot_inode) {
+ /* walk ranges, print data to debugfs log */
+ __walk_range_tree(current_hot_inode, data);
+
+ free_hot_inode_item(current_hot_inode);
+ current_hot_inode = find_next_hot_inode(fs_root,
+ (u64) current_hot_inode->i_ino + 1);
+ }
+
+print_to_user:
+
+ if (data->debugfs_log->len) {
+ err = simple_read_from_buffer(user, count, ppos,
+ data->debugfs_log->str,
+ data->debugfs_log->len);
+ }
+
+ return err;
+
+clean_up:
+
+ /* reader has finished the file */
+ /* clean up */
+
+ debugfs_log_exit(data);
+ kfree(data->debugfs_log);
+ data->debugfs_log = NULL;
+
+ return 0;
+}
+
+/* debugfs read file override from fops table */
+ssize_t __btrfs_debugfs_inode_read(struct file *file, char __user *user,
+ size_t count, loff_t *ppos)
+{
+ int err = 0;
+ struct super_block *sb;
+ struct btrfs_root *root;
+ struct btrfs_root *fs_root;
+ struct hot_inode_item *current_hot_inode;
+ struct debugfs_vol_data *data;
+ struct lstring *debugfs_log;
+
+ data = (struct debugfs_vol_data *) file->private_data;
+ sb = data->sb;
+ root = btrfs_sb(sb);
+ fs_root = (struct btrfs_root *) root->fs_info->fs_root;
+
+ if (!data->debugfs_log) {
+ /* initialize debugfs log corresponding to this volume */
+ debugfs_log = kmalloc(sizeof(struct lstring),
+ GFP_KERNEL | GFP_NOFS);
+ debugfs_log->str = NULL,
+ debugfs_log->len = 0;
+ data->debugfs_log = debugfs_log;
+ debugfs_log_init(data);
+ }
+
+ if ((unsigned long) *ppos > 0) {
+ /* caller is continuing a previous read, don't walk tree */
+ if ((unsigned long) *ppos >= data->debugfs_log->len)
+ goto clean_up;
+
+ goto print_to_user;
+ }
+
+ /* walk the inode tree */
+
+ current_hot_inode = find_next_hot_inode(fs_root, 0);
+
+ while (current_hot_inode) {
+ /* walk ranges, print data to debugfs log */
+ __print_inode_freq_data(current_hot_inode, data);
+
+ free_hot_inode_item(current_hot_inode);
+ current_hot_inode = find_next_hot_inode(fs_root,
+ (u64) current_hot_inode->i_ino + 1);
+ }
+
+print_to_user:
+
+ if (data->debugfs_log->len) {
+ err = simple_read_from_buffer(user, count, ppos,
+ data->debugfs_log->str,
+ data->debugfs_log->len);
+ }
+
+ return err;
+
+clean_up:
+
+ /* reader has finished the file */
+ /* clean up */
+ debugfs_log_exit(data);
+ kfree(data->debugfs_log);
+ data->debugfs_log = NULL;
+
+ return 0;
+}
+
+/*
+ * Take the inode, find ranges associated with inode
+ * and print each range data struct
+ */
+void __walk_range_tree(struct hot_inode_item *hot_inode,
+ struct debugfs_vol_data *data)
+{
+ struct hot_range_tree *inode_range_tree;
+ struct rb_node *node;
+ struct hot_range_item *current_range;
+
+ inode_range_tree = &hot_inode->hot_range_tree;
+ read_lock(&inode_range_tree->lock);
+ node = rb_first(&inode_range_tree->map);
+
+ /* Walk the hot_range_tree for inode */
+ while (node) {
+ current_range = rb_entry(node, struct hot_range_item, rb_node);
+ __print_range_freq_data(hot_inode, current_range, data);
+ node = rb_next(node);
+ }
+ read_unlock(&inode_range_tree->lock);
+}
+
+/* Print frequency data for each range to log */
+void __print_range_freq_data(struct hot_inode_item *hot_inode,
+ struct hot_range_item *hot_range,
+ struct debugfs_vol_data *data)
+{
+ struct btrfs_freq_data *freq_data;
+ int temp;
+ freq_data = &hot_range->freq_data;
+ read_lock(&hot_range->heat_node->hlist->rwlock);
+ temp = hot_range->heat_node->hlist->temperature;
+ read_unlock(&hot_range->heat_node->hlist->rwlock);
+
+ /* Always lock hot_inode_item first */
+ spin_lock(&hot_inode->lock);
+ spin_lock(&hot_range->lock);
+ debugfs_log(data, "inode #%lu, range start "
+ "%llu (range len %llu) reads %u, writes %u, temp %u\n",
+ hot_inode->i_ino,
+ hot_range->start,
+ hot_range->len,
+ freq_data->nr_reads,
+ freq_data->nr_writes,
+ temp);
+ spin_unlock(&hot_range->lock);
+ spin_unlock(&hot_inode->lock);
+}
+
+/* Print frequency data for each freq data to log */
+void __print_inode_freq_data(struct hot_inode_item *hot_inode,
+ struct debugfs_vol_data *data)
+{
+ struct btrfs_freq_data *freq_data;
+ int temp;
+ freq_data = &hot_inode->freq_data;
+
+ read_lock(&hot_inode->heat_node->hlist->rwlock);
+ temp = hot_inode->heat_node->hlist->temperature;
+ read_unlock(&hot_inode->heat_node->hlist->rwlock);
+
+ spin_lock(&hot_inode->lock);
+ debugfs_log(data, "inode #%lu, reads %u, writes %u, temp %u\n",
+ hot_inode->i_ino,
+ freq_data->nr_reads,
+ freq_data->nr_writes,
+ temp);
+ spin_unlock(&hot_inode->lock);
+}
+
diff --git a/fs/btrfs/debugfs.h b/fs/btrfs/debugfs.h
new file mode 100644
index 0000000..bdd4938
--- /dev/null
+++ b/fs/btrfs/debugfs.h
@@ -0,0 +1,57 @@
+#ifndef __BTRFS_DEBUGFS__
+#define __BTRFS_DEBUGFS__
+
+/* size of log to vmalloc */
+#define INIT_LOG_ALLOC_SIZE (PAGE_SIZE * 10)
+#define LOG_PAGE_SIZE (PAGE_SIZE * 10)
+
+/* number of chars of device name of chop off for making debugfs folder
+ * e.g. /dev/sda -> sda */
+#define DEV_NAME_CHOP 5
+
+/* list to keep track of each mounted volumes debugfs_vol_data */
+static struct list_head debugfs_vol_data_list;
+/* lock for debugfs_vol_data_list */
+static spinlock_t data_list_lock;
+
+/*
+ * Name for BTRFS data in debugfs directory
+ * e.g. /sys/kernel/debug/btrfs_data
+ */
+#define DEBUGFS_ROOT_NAME "btrfs_data"
+/* pointer to top level debugfs dentry */
+static struct dentry *debugfs_root_dentry;
+
+/* log to output to userspace in debugfs files */
+struct lstring {
+ char *str;
+ unsigned long len;
+};
+
+/*
+ * debugfs_vol_data is a struct of items that is passed to the debugfs
+ */
+struct debugfs_vol_data {
+ struct list_head node; /* protected by data_list_lock */
+ struct lstring *debugfs_log;
+ struct super_block *sb;
+ struct dentry *de;
+ spinlock_t log_lock; /* protects debugfs_log */
+ char log_work_buff[1024];
+ uint log_alloc_size;
+};
+
+ssize_t __btrfs_debugfs_range_read(struct file *file, char __user *user,
+ size_t size, loff_t *len);
+ssize_t __btrfs_debugfs_inode_read(struct file *file, char __user *user,
+ size_t size, loff_t *len);
+int __btrfs_debugfs_open(struct inode *inode, struct file *file);
+void __walk_range_tree(struct hot_inode_item *hot_inode,
+ struct debugfs_vol_data *data);
+void __print_range_freq_data(struct hot_inode_item *hot_inode,
+ struct hot_range_item *hot_range,
+ struct debugfs_vol_data *data);
+void __print_inode_freq_data(struct hot_inode_item *hot_inode,
+ struct debugfs_vol_data *data);
+
+#endif
--
1.7.1

2010-07-27 22:01:38

by Ben Chociej

[permalink] [raw]
Subject: [RFC PATCH 3/5] Btrfs: 3 new ioctls related to hot data features

From: Ben Chociej <[email protected]>

BTRFS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in btrfs_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.

BTRFS_IOC_GET_HEAT_OPTS: return an integer representing the current
state of hot data tracking and migration:
0 = do nothing
1 = track frequency of access
2 = migrate data to fast media based on temperature
(not implemented)

BTRFS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
migration, as described above.

Signed-off-by: Ben Chociej <[email protected]>
Signed-off-by: Matt Lupfer <[email protected]>
Signed-off-by: Conor Scott <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
Reviewed-by: Steve French <[email protected]>
---
fs/btrfs/ioctl.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/ioctl.h | 21 ++++++++
2 files changed, 166 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4dbaf89..be7aba2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -49,6 +49,8 @@
#include "print-tree.h"
#include "volumes.h"
#include "locking.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"

/* Mask out flags that are inappropriate for the given type of inode. */
static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
@@ -1869,7 +1871,7 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
return 0;
}

-long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
+static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
{
struct btrfs_ioctl_space_args space_args;
struct btrfs_ioctl_space_info space;
@@ -1974,6 +1976,142 @@ long btrfs_ioctl_trans_end(struct file *file)
return 0;
}

+/*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by heat_info->live.
+ */
+static long btrfs_ioctl_heat_info(struct file *file, void __user *argp)
+{
+ struct inode *mnt_inode = fdentry(file)->d_inode;
+ struct inode *file_inode;
+ struct file *file_filp;
+ struct btrfs_root *root = BTRFS_I(mnt_inode)->root;
+ struct btrfs_ioctl_heat_info *heat_info;
+ struct hot_inode_tree *hitree;
+ struct hot_inode_item *he;
+ int ret;
+
+ heat_info = kmalloc(sizeof(struct btrfs_ioctl_heat_info),
+ GFP_KERNEL | GFP_NOFS);
+
+ if (copy_from_user((void *) heat_info,
+ argp,
+ sizeof(struct btrfs_ioctl_heat_info)) != 0) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+ file_filp = filp_open(heat_info->filename, O_RDONLY, 0);
+
+ if (IS_ERR(file_filp)) {
+ ret = (long) file_filp;
+ goto err;
+ }
+
+ file_inode = file_filp->f_dentry->d_inode;
+
+ hitree = &root->hot_inode_tree;
+ read_lock(&hitree->lock);
+ he = lookup_hot_inode_item(hitree, file_inode->i_ino);
+ read_unlock(&hitree->lock);
+
+ if (!he || IS_ERR(he)) {
+ /* we don't have any info on this file yet */
+ ret = -ENODATA;
+ goto err;
+ }
+
+ spin_lock(&he->lock);
+
+ heat_info->avg_delta_reads =
+ (__u64) he->freq_data.avg_delta_reads;
+ heat_info->avg_delta_writes =
+ (__u64) he->freq_data.avg_delta_writes;
+ heat_info->last_read_time =
+ (__u64) timespec_to_ns(&he->freq_data.last_read_time);
+ heat_info->last_write_time =
+ (__u64) timespec_to_ns(&he->freq_data.last_write_time);
+ heat_info->num_reads =
+ (__u32) he->freq_data.nr_reads;
+ heat_info->num_writes =
+ (__u32) he->freq_data.nr_writes;
+
+ if (heat_info->live > 0) {
+ /* got a request for live temperature,
+ * call btrfs_get_temp to recalculate */
+ heat_info->temperature = btrfs_get_temp(&he->freq_data);
+ } else {
+ /* not live temperature, get it from the hashlist */
+ read_lock(&he->heat_node->hlist->rwlock);
+ heat_info->temperature = he->heat_node->hlist->temperature;
+ read_unlock(&he->heat_node->hlist->rwlock);
+ }
+
+ spin_unlock(&he->lock);
+ free_hot_inode_item(he);
+
+ if (copy_to_user(argp, (void *) heat_info,
+ sizeof(struct btrfs_ioctl_heat_info))) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+ kfree(heat_info);
+ return 0;
+
+err:
+ kfree(heat_info);
+ return ret;
+}
+
+static long btrfs_ioctl_heat_opts(struct file *file, void __user *argp, int set)
+{
+ struct inode *inode = fdentry(file)->d_inode;
+ int arg, ret = 0;
+
+ if (!set) {
+ arg = ((BTRFS_I(inode)->flags & BTRFS_INODE_NO_HOTDATA_TRACK)
+ ? 0 : 1) +
+ ((BTRFS_I(inode)->flags & BTRFS_INODE_NO_HOTDATA_MOVE)
+ ? 0 : 1);
+
+ if (copy_to_user(argp, (void *) &arg, sizeof(int)) != 0)
+ ret = -EFAULT;
+ } else if (copy_from_user((void *) &arg, argp, sizeof(int)) != 0)
+ ret = -EFAULT;
+ else
+ switch (arg) {
+ case 0: /* track nothing, move nothing */
+ /* set both flags */
+ BTRFS_I(inode)->flags |=
+ BTRFS_INODE_NO_HOTDATA_TRACK |
+ BTRFS_INODE_NO_HOTDATA_MOVE;
+ break;
+ case 1: /* do tracking, don't move anything */
+ /* clear NO_HOTDATA_TRACK, set NO_HOTDATA_MOVE */
+ BTRFS_I(inode)->flags &=
+ ~BTRFS_INODE_NO_HOTDATA_TRACK;
+ BTRFS_I(inode)->flags |=
+ BTRFS_INODE_NO_HOTDATA_MOVE;
+ break;
+ case 2: /* track and move */
+ /* clear both flags */
+ BTRFS_I(inode)->flags &=
+ ~(BTRFS_INODE_NO_HOTDATA_TRACK |
+ BTRFS_INODE_NO_HOTDATA_MOVE);
+ break;
+ default:
+ ret = -EINVAL;
+ }
+
+ return ret;
+}
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -2021,6 +2159,12 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_ino_lookup(file, argp);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(root, argp);
+ case BTRFS_IOC_GET_HEAT_INFO:
+ return btrfs_ioctl_heat_info(file, argp);
+ case BTRFS_IOC_GET_HEAT_OPTS:
+ return btrfs_ioctl_heat_opts(file, argp, 0);
+ case BTRFS_IOC_SET_HEAT_OPTS:
+ return btrfs_ioctl_heat_opts(file, argp, 1);
case BTRFS_IOC_SYNC:
btrfs_sync_fs(file->f_dentry->d_sb, 1);
return 0;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 424694a..8ba775e 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -138,6 +138,18 @@ struct btrfs_ioctl_space_args {
struct btrfs_ioctl_space_info spaces[0];
};

+struct btrfs_ioctl_heat_info {
+ __u64 avg_delta_reads;
+ __u64 avg_delta_writes;
+ __u64 last_read_time;
+ __u64 last_write_time;
+ __u32 num_reads;
+ __u32 num_writes;
+ char filename[BTRFS_PATH_NAME_MAX + 1];
+ int temperature;
+ __u8 live;
+};
+
#define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
struct btrfs_ioctl_vol_args)
#define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -178,4 +190,13 @@ struct btrfs_ioctl_space_args {
#define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
#define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
struct btrfs_ioctl_space_args)
+
+/*
+ * Hot data tracking ioctls:
+ */
+#define BTRFS_IOC_GET_HEAT_INFO _IOWR(BTRFS_IOCTL_MAGIC, 21, \
+ struct btrfs_ioctl_heat_info)
+#define BTRFS_IOC_SET_HEAT_OPTS _IOW(BTRFS_IOCTL_MAGIC, 22, int)
+#define BTRFS_IOC_GET_HEAT_OPTS _IOR(BTRFS_IOCTL_MAGIC, 23, int)
+
#endif
--
1.7.1

2010-07-27 22:02:04

by Ben Chociej

[permalink] [raw]
Subject: [RFC PATCH 5/5] Btrfs: Add hooks to enable hot data tracking

From: Ben Chociej <[email protected]>

Miscellaneous features that enable hot data tracking features, open the
door for future hot data migration to faster media, and generally make
the hot data functions a bit more friendly.

ctree.h: Add the root hot_inode_tree and heat hashlists. Defines some
mount options and inode flags for turning all of the hot data
functionality on and off globally and per file. Defines some guard
macros that enforce the mount options and inode flags.

disk-io.c: Initialization and freeing of various structures.

extent_io.c: Add hook into extent_write_cache_pages to enable hot data
tracking functionality. Actual IO tracking is done here (and in
inode.c).

inode.c: Add hooks into btrfs_direct_IO and btrfs_readpages to enable
hot data tracking functionality. Actual IO tracking is done here (and
in extent_io.c).

super.c: Implements aforementioned mount options, does various
initializing and freeing.

Signed-off-by: Ben Chociej <[email protected]>
Signed-off-by: Matt Lupfer <[email protected]>
Signed-off-by: Conor Scott <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
Reviewed-by: Steve French <[email protected]>
---
fs/btrfs/Makefile | 5 ++++-
fs/btrfs/ctree.h | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/disk-io.c | 29 +++++++++++++++++++++++++++++
fs/btrfs/extent_io.c | 18 ++++++++++++++++++
fs/btrfs/inode.c | 27 +++++++++++++++++++++++++++
fs/btrfs/super.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
6 files changed, 165 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index a35eb36..8bc70ba 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -7,4 +7,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
export.o tree-log.o acl.o free-space-cache.o zlib.o \
- compression.o delayed-ref.o relocation.o
+ compression.o delayed-ref.o relocation.o hotdata_map.o \
+ hotdata_hash.o
+
+btrfs-$(CONFIG_DEBUG_FS) += debugfs.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e9bf864..7284cb5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -31,6 +31,8 @@
#include "extent_io.h"
#include "extent_map.h"
#include "async-thread.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"

struct btrfs_trans_handle;
struct btrfs_transaction;
@@ -877,6 +879,7 @@ struct btrfs_fs_info {
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
+
/*
* this protects the ordered operations list only while we are
* processing all of the entries on it. This way we make
@@ -950,6 +953,7 @@ struct btrfs_fs_info {
struct btrfs_workers endio_meta_write_workers;
struct btrfs_workers endio_write_workers;
struct btrfs_workers submit_workers;
+
/*
* fixup workers take dirty pages that didn't properly go through
* the cow mechanism and make them safe to write. It happens
@@ -958,6 +962,7 @@ struct btrfs_fs_info {
struct btrfs_workers fixup_workers;
struct task_struct *transaction_kthread;
struct task_struct *cleaner_kthread;
+
int thread_pool_size;

struct kobject super_kobj;
@@ -1092,6 +1097,15 @@ struct btrfs_root {
/* red-black tree that keeps track of in-memory inodes */
struct rb_root inode_tree;

+ /* red-black tree that keeps track of fs-wide hot data */
+ struct hot_inode_tree hot_inode_tree;
+
+ /* hash map of inode temperature */
+ struct heat_hashlist_entry heat_inode_hl[HEAT_HASH_SIZE];
+
+ /* hash map of range temperature */
+ struct heat_hashlist_entry heat_range_hl[HEAT_HASH_SIZE];
+
/*
* right now this just gets used so that a root has its own devid
* for stat. It may be used for more later
@@ -1192,6 +1206,8 @@ struct btrfs_root {
#define BTRFS_MOUNT_NOSSD (1 << 9)
#define BTRFS_MOUNT_DISCARD (1 << 10)
#define BTRFS_MOUNT_FORCE_COMPRESS (1 << 11)
+#define BTRFS_MOUNT_HOTDATA_TRACK (1 << 12)
+#define BTRFS_MOUNT_HOTDATA_MOVE (1 << 13)

#define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
#define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
@@ -1211,6 +1227,24 @@ struct btrfs_root {
#define BTRFS_INODE_NODUMP (1 << 8)
#define BTRFS_INODE_NOATIME (1 << 9)
#define BTRFS_INODE_DIRSYNC (1 << 10)
+#define BTRFS_INODE_NO_HOTDATA_TRACK (1 << 11)
+#define BTRFS_INODE_NO_HOTDATA_MOVE (1 << 12)
+
+/* Hot data tracking -- guard macros */
+#define BTRFS_TRACKING_HOT_DATA(btrfs_root) \
+(btrfs_test_opt(btrfs_root, HOTDATA_TRACK))
+
+#define BTRFS_MOVING_HOT_DATA(btrfs_root) \
+((btrfs_test_opt(btrfs_root, HOTDATA_TRACK)) && \
+!(btrfs_root->fs_info->sb->s_flags & MS_RDONLY))
+
+#define BTRFS_TRACK_THIS_INODE(btrfs_inode) \
+((BTRFS_TRACKING_HOT_DATA(btrfs_inode->root)) && \
+!(btrfs_inode->flags & BTRFS_INODE_NO_HOTDATA_TRACK))
+
+#define BTRFS_MOVE_THIS_INODE(btrfs_inode) \
+((BTRFS_MOVING_HOT_DATA(btrfs_inode->root)) && \
+!(btrfs_inode->flags & BTRFS_INODE_NO_HOTDATA_MOVE))

/* some macros to generate set/get funcs for the struct fields. This
* assumes there is a lefoo_to_cpu for every type, so lets make a simple
@@ -2457,6 +2491,14 @@ int btrfs_sysfs_add_root(struct btrfs_root *root);
void btrfs_sysfs_del_root(struct btrfs_root *root);
void btrfs_sysfs_del_super(struct btrfs_fs_info *root);

+#ifdef CONFIG_DEBUG_FS
+/* debugfs.c */
+int btrfs_init_debugfs(void);
+void btrfs_exit_debugfs(void);
+int btrfs_init_debugfs_volume(const char *, struct super_block *);
+void btrfs_exit_debugfs_volume(struct super_block *);
+#endif
+
/* xattr.c */
ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..8f9c866 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -39,6 +39,7 @@
#include "locking.h"
#include "tree-log.h"
#include "free-space-cache.h"
+#include "hotdata_hash.h"

static struct extent_io_ops btree_extent_io_ops;
static void end_workqueue_fn(struct btrfs_work *work);
@@ -893,11 +894,32 @@ int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
return 0;
}

+static inline void __setup_hotdata(struct btrfs_root *root)
+{
+ int i;
+
+ hot_inode_tree_init(&root->hot_inode_tree);
+
+ memset(&root->heat_inode_hl, 0, sizeof(root->heat_inode_hl));
+ memset(&root->heat_range_hl, 0, sizeof(root->heat_range_hl));
+ for (i = 0; i < HEAT_HASH_SIZE; i++) {
+ INIT_HLIST_HEAD(&root->heat_inode_hl[i].hashhead);
+ INIT_HLIST_HEAD(&root->heat_range_hl[i].hashhead);
+
+ rwlock_init(&root->heat_inode_hl[i].rwlock);
+ rwlock_init(&root->heat_range_hl[i].rwlock);
+
+ root->heat_inode_hl[i].temperature = i;
+ root->heat_range_hl[i].temperature = i;
+ }
+}
+
static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
u32 stripesize, struct btrfs_root *root,
struct btrfs_fs_info *fs_info,
u64 objectid)
{
+
root->node = NULL;
root->commit_root = NULL;
root->sectorsize = sectorsize;
@@ -945,6 +967,10 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
memset(&root->root_item, 0, sizeof(root->root_item));
memset(&root->defrag_progress, 0, sizeof(root->defrag_progress));
memset(&root->root_kobj, 0, sizeof(root->root_kobj));
+
+ if (BTRFS_TRACKING_HOT_DATA(root))
+ __setup_hotdata(root);
+
root->defrag_trans_start = fs_info->generation;
init_completion(&root->kobj_unregister);
root->defrag_running = 0;
@@ -2324,6 +2350,9 @@ static void free_fs_root(struct btrfs_root *root)
down_write(&root->anon_super.s_umount);
kill_anon_super(&root->anon_super);
}
+
+ free_heat_hashlists(root);
+ free_hot_inode_tree(root);
free_extent_buffer(root->node);
free_extent_buffer(root->commit_root);
kfree(root->name);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a4080c2..8fa2820 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2468,8 +2468,10 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
int ret = 0;
int done = 0;
int nr_to_write_done = 0;
+ int nr_written = 0;
struct pagevec pvec;
int nr_pages;
+ pgoff_t start;
pgoff_t index;
pgoff_t end; /* Inclusive */
int scanned = 0;
@@ -2486,6 +2488,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
range_whole = 1;
scanned = 1;
}
+ start = index << PAGE_CACHE_SHIFT;
retry:
while (!done && !nr_to_write_done && (index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
@@ -2547,6 +2550,7 @@ retry:
* at any time
*/
nr_to_write_done = wbc->nr_to_write <= 0;
+ nr_written += 1;
}
pagevec_release(&pvec);
cond_resched();
@@ -2560,6 +2564,20 @@ retry:
index = 0;
goto retry;
}
+
+ /*
+ * i_ino = 1 appears to come from metadata operations, ignore
+ * those writes
+ */
+ if (BTRFS_TRACK_THIS_INODE(BTRFS_I(mapping->host)) &&
+ mapping->host->i_ino > 1) {
+ printk(KERN_INFO "btrfs recorded a write %lu, %lu, %lu\n",
+ mapping->host->i_ino, start, nr_written *
+ PAGE_CACHE_SIZE);
+ btrfs_update_freqs(mapping->host, start,
+ nr_written * PAGE_CACHE_SIZE, 1);
+ }
+
return ret;
}

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f08427c..010eb29 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -37,6 +37,7 @@
#include <linux/posix_acl.h>
#include <linux/falloc.h>
#include <linux/slab.h>
+#include <linux/pagevec.h>
#include "compat.h"
#include "ctree.h"
#include "disk-io.h"
@@ -50,6 +51,7 @@
#include "tree-log.h"
#include "compression.h"
#include "locking.h"
+#include "hotdata_map.h"

struct btrfs_iget_args {
u64 ino;
@@ -4515,6 +4517,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM;
if (btrfs_test_opt(root, NODATACOW))
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATACOW;
+ if (!btrfs_test_opt(root, HOTDATA_TRACK))
+ BTRFS_I(inode)->flags |= BTRFS_INODE_NO_HOTDATA_TRACK;
+ if (!btrfs_test_opt(root, HOTDATA_MOVE))
+ BTRFS_I(inode)->flags |= BTRFS_INODE_NO_HOTDATA_MOVE;
}

insert_inode_hash(inode);
@@ -5781,6 +5787,10 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
lockstart = offset;
lockend = offset + count - 1;

+ if (BTRFS_TRACK_THIS_INODE(BTRFS_I(inode)) && count > 0)
+ btrfs_update_freqs(inode, lockstart, (u64) count,
+ writing);
+
if (writing) {
ret = btrfs_delalloc_reserve_space(inode, count);
if (ret)
@@ -5860,7 +5870,15 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
int btrfs_readpage(struct file *file, struct page *page)
{
struct extent_io_tree *tree;
+ u64 start;
+
tree = &BTRFS_I(page->mapping->host)->io_tree;
+ start = (u64) page->index << PAGE_CACHE_SHIFT;
+
+ if (BTRFS_TRACK_THIS_INODE(BTRFS_I(page->mapping->host)))
+ btrfs_update_freqs(page->mapping->host, start,
+ PAGE_CACHE_SIZE, 0);
+
return extent_read_full_page(tree, page, btrfs_get_extent);
}

@@ -5892,7 +5910,16 @@ btrfs_readpages(struct file *file, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages)
{
struct extent_io_tree *tree;
+ u64 start, len;
+
tree = &BTRFS_I(mapping->host)->io_tree;
+ start = (u64) (list_entry(pages->prev, struct page, lru)->index)
+ << PAGE_CACHE_SHIFT;
+ len = nr_pages * PAGE_CACHE_SIZE;
+
+ if (len > 0 && BTRFS_TRACK_THIS_INODE(BTRFS_I(mapping->host)))
+ btrfs_update_freqs(mapping->host, start, len, 0);
+
return extent_readpages(tree, mapping, pages, nr_pages,
btrfs_get_extent);
}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 859ddaa..db91b38 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -51,6 +51,8 @@
#include "version.h"
#include "export.h"
#include "compression.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"

static const struct super_operations btrfs_super_ops;

@@ -59,6 +61,9 @@ static void btrfs_put_super(struct super_block *sb)
struct btrfs_root *root = btrfs_sb(sb);
int ret;

+ if (BTRFS_TRACKING_HOT_DATA(root))
+ btrfs_exit_debugfs_volume(sb);
+
ret = close_ctree(root);
sb->s_fs_info = NULL;
}
@@ -68,7 +73,7 @@ enum {
Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd,
Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress,
Opt_compress_force, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
- Opt_discard, Opt_err,
+ Opt_discard, Opt_hotdatatrack, Opt_hotdatamove, Opt_err,
};

static match_table_t tokens = {
@@ -92,6 +97,8 @@ static match_table_t tokens = {
{Opt_flushoncommit, "flushoncommit"},
{Opt_ratio, "metadata_ratio=%d"},
{Opt_discard, "discard"},
+ {Opt_hotdatatrack, "hotdatatrack"},
+ {Opt_hotdatamove, "hotdatamove"},
{Opt_err, NULL},
};

@@ -235,6 +242,18 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
case Opt_discard:
btrfs_set_opt(info->mount_opt, DISCARD);
break;
+ case Opt_hotdatamove:
+ printk(KERN_INFO "btrfs: turning on hot data "
+ "migration\n");
+ printk(KERN_INFO " (implies hotdatatrack, "
+ "no ssd_spread)\n");
+ btrfs_set_opt(info->mount_opt, HOTDATA_MOVE);
+ btrfs_clear_opt(info->mount_opt, SSD_SPREAD);
+ case Opt_hotdatatrack:
+ printk(KERN_INFO "btrfs: turning on hot data"
+ " tracking\n");
+ btrfs_set_opt(info->mount_opt, HOTDATA_TRACK);
+ break;
case Opt_err:
printk(KERN_INFO "btrfs: unrecognized mount option "
"'%s'\n", p);
@@ -457,6 +476,7 @@ static int btrfs_fill_super(struct super_block *sb,
printk("btrfs: open_ctree failed\n");
return PTR_ERR(tree_root);
}
+
sb->s_fs_info = tree_root;
disk_super = &tree_root->fs_info->super_copy;

@@ -659,6 +679,9 @@ static int btrfs_get_sb(struct file_system_type *fs_type, int flags,
mnt->mnt_sb = s;
mnt->mnt_root = root;

+ if (btrfs_test_opt(btrfs_sb(s), HOTDATA_TRACK))
+ btrfs_init_debugfs_volume(dev_name, s);
+
kfree(subvol_name);
return 0;

@@ -846,18 +869,30 @@ static int __init init_btrfs_fs(void)
if (err)
goto free_sysfs;

- err = extent_io_init();
+ err = btrfs_init_debugfs();
if (err)
goto free_cachep;

+ err = extent_io_init();
+ if (err)
+ goto free_debugfs;
+
err = extent_map_init();
if (err)
goto free_extent_io;

- err = btrfs_interface_init();
+ err = hot_inode_item_init();
if (err)
goto free_extent_map;

+ err = hot_range_item_init();
+ if (err)
+ goto free_hot_inode_item;
+
+ err = btrfs_interface_init();
+ if (err)
+ goto free_hot_range_item;
+
err = register_filesystem(&btrfs_fs_type);
if (err)
goto unregister_ioctl;
@@ -867,10 +902,16 @@ static int __init init_btrfs_fs(void)

unregister_ioctl:
btrfs_interface_exit();
+free_hot_range_item:
+ hot_range_item_exit();
+free_hot_inode_item:
+ hot_inode_item_exit();
free_extent_map:
extent_map_exit();
free_extent_io:
extent_io_exit();
+free_debugfs:
+ btrfs_exit_debugfs();
free_cachep:
btrfs_destroy_cachep();
free_sysfs:
@@ -886,6 +927,7 @@ static void __exit exit_btrfs_fs(void)
btrfs_interface_exit();
unregister_filesystem(&btrfs_fs_type);
btrfs_exit_sysfs();
+ btrfs_exit_debugfs();
btrfs_cleanup_fs_uuids();
btrfs_zlib_exit();
}
--
1.7.1

2010-07-27 22:02:25

by Ben Chociej

[permalink] [raw]
Subject: [RFC PATCH 2/5] Btrfs: Add data structures for hot data tracking

From: Ben Chociej <[email protected]>

Adds hot_inode_tree and hot_range_tree structs to keep track of
frequently accessed files and ranges within files. Trees contain
hot_{inode,range}_items representing those files and ranges, each of
which contains a btrfs_freq_data struct with its frequency of access
metrics (number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}.

Having these trees means that Btrfs can quickly determine the
temperature of some data by doing some calculations on the
btrfs_freq_data struct that hangs off of the tree item.

Also, since it isn't entirely obvious, the "frequency" or reads or
writes is determined by taking a kind of generalized average of the last
few (2^N for some tunable N) reads or writes.

Signed-off-by: Ben Chociej <[email protected]>
Signed-off-by: Matt Lupfer <[email protected]>
Signed-off-by: Conor Scott <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
Reviewed-by: Steve French <[email protected]>
---
fs/btrfs/hotdata_map.c | 660 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/hotdata_map.h | 118 +++++++++
2 files changed, 778 insertions(+), 0 deletions(-)
create mode 100644 fs/btrfs/hotdata_map.c
create mode 100644 fs/btrfs/hotdata_map.h

diff --git a/fs/btrfs/hotdata_map.c b/fs/btrfs/hotdata_map.c
new file mode 100644
index 0000000..77a560e
--- /dev/null
+++ b/fs/btrfs/hotdata_map.c
@@ -0,0 +1,660 @@
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/hardirq.h>
+#include "ctree.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "btrfs_inode.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cache;
+static struct kmem_cache *hot_range_item_cache;
+
+struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode *inode,
+ int create);
+struct hot_range_item *btrfs_update_range_freq(struct hot_inode_item *he,
+ u64 off, u64 len, int create,
+ struct btrfs_root *root);
+/* init hot_inode_item kmem cache */
+int __init hot_inode_item_init(void)
+{
+ hot_inode_item_cache = kmem_cache_create("hot_inode_item",
+ sizeof(struct hot_inode_item), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+ if (!hot_inode_item_cache)
+ return -ENOMEM;
+ return 0;
+}
+
+/* init hot_range_item kmem cache */
+int __init hot_range_item_init(void)
+{
+ hot_range_item_cache = kmem_cache_create("hot_range_item",
+ sizeof(struct hot_range_item), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+ if (!hot_range_item_cache)
+ return -ENOMEM;
+ return 0;
+}
+
+void hot_inode_item_exit(void)
+{
+ if (hot_inode_item_cache)
+ kmem_cache_destroy(hot_inode_item_cache);
+}
+
+void hot_range_item_exit(void)
+{
+ if (hot_range_item_cache)
+ kmem_cache_destroy(hot_range_item_cache);
+}
+
+
+/* Initialize the inode tree */
+void hot_inode_tree_init(struct hot_inode_tree *tree)
+{
+ tree->map = RB_ROOT;
+ rwlock_init(&tree->lock);
+}
+
+/* Initialize the hot range tree tree */
+void hot_range_tree_init(struct hot_range_tree *tree)
+{
+ tree->map = RB_ROOT;
+ rwlock_init(&tree->lock);
+}
+
+/* Allocate a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_inode_item() */
+struct hot_inode_item *alloc_hot_inode_item(unsigned long ino)
+{
+ struct hot_inode_item *he;
+ he = kmem_cache_alloc(hot_inode_item_cache, GFP_KERNEL | GFP_NOFS);
+ if (!he || IS_ERR(he))
+ return he;
+
+ atomic_set(&he->refs, 1);
+ he->in_tree = 0;
+ he->i_ino = ino;
+ he->heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS);
+ he->freq_data.avg_delta_reads = (u64) -1;
+ he->freq_data.avg_delta_writes = (u64) -1;
+ he->freq_data.nr_reads = 0;
+ he->freq_data.nr_writes = 0;
+ he->freq_data.flags = FREQ_DATA_TYPE_INODE;
+ hot_range_tree_init(&he->hot_range_tree);
+
+ spin_lock_init(&he->lock);
+
+ return he;
+}
+
+/* Allocate a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item() */
+struct hot_range_item *alloc_hot_range_item(u64 start, u64 len)
+{
+ struct hot_range_item *hr;
+ hr = kmem_cache_alloc(hot_range_item_cache, GFP_KERNEL | GFP_NOFS);
+ if (!hr || IS_ERR(hr))
+ return hr;
+ atomic_set(&hr->refs, 1);
+ hr->in_tree = 0;
+ hr->start = start & RANGE_SIZE_MASK;
+ hr->len = len;
+ hr->heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS);
+ hr->heat_node->freq_data = &hr->freq_data;
+ hr->freq_data.avg_delta_reads = (u64) -1;
+ hr->freq_data.avg_delta_writes = (u64) -1;
+ hr->freq_data.nr_reads = 0;
+ hr->freq_data.nr_writes = 0;
+ hr->freq_data.flags = FREQ_DATA_TYPE_RANGE;
+
+ spin_lock_init(&hr->lock);
+
+ return hr;
+}
+
+/* Drops the reference out on hot_inode_item by one and free the structure
+ * if the reference count hits zero */
+void free_hot_inode_item(struct hot_inode_item *he)
+{
+ if (!he)
+ return;
+ if (atomic_dec_and_test(&he->refs)) {
+ WARN_ON(he->in_tree);
+ kmem_cache_free(hot_inode_item_cache, he);
+ }
+}
+
+/* Drops the reference out on hot_range_item by one and free the structure
+ * if the reference count hits zero */
+void free_hot_range_item(struct hot_range_item *hr)
+{
+ if (!hr)
+ return;
+ if (atomic_dec_and_test(&hr->refs)) {
+ WARN_ON(hr->in_tree);
+ kmem_cache_free(hot_range_item_cache, hr);
+ }
+}
+
+/* Frees the entire hot_inode_tree. Called by free_fs_root */
+void free_hot_inode_tree(struct btrfs_root *root)
+{
+ struct rb_node *node, *node2;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+
+ /* Free hot inode and range trees on fs root */
+ node = rb_first(&root->hot_inode_tree.map);
+
+ while (node) {
+ he = rb_entry(node, struct hot_inode_item,
+ rb_node);
+
+ node2 = rb_first(&he->hot_range_tree.map);
+
+ while (node2) {
+ hr = rb_entry(node2, struct hot_range_item,
+ rb_node);
+ remove_hot_range_item(&he->hot_range_tree, hr);
+ free_hot_range_item(hr);
+ node2 = rb_first(&he->hot_range_tree.map);
+ }
+
+ remove_hot_inode_item(&root->hot_inode_tree, he);
+ free_hot_inode_item(he);
+ node = rb_first(&root->hot_inode_tree.map);
+ }
+}
+
+static struct rb_node *tree_insert_inode_item(struct rb_root *root,
+ unsigned long inode_num,
+ struct rb_node *node)
+{
+ struct rb_node **p = &root->rb_node;
+ struct rb_node *parent = NULL;
+ struct hot_inode_item *entry;
+
+ /* walk tree to find insertion point */
+ while (*p) {
+ parent = *p;
+ entry = rb_entry(parent, struct hot_inode_item, rb_node);
+
+ if (inode_num < entry->i_ino)
+ p = &(*p)->rb_left;
+ else if (inode_num > entry->i_ino)
+ p = &(*p)->rb_right;
+ else
+ return parent;
+ }
+
+ entry = rb_entry(node, struct hot_inode_item, rb_node);
+ entry->in_tree = 1;
+ rb_link_node(node, parent, p);
+ rb_insert_color(node, root);
+ return NULL;
+}
+
+static u64 range_map_end(struct hot_range_item *hr)
+{
+ if (hr->start + hr->len < hr->start)
+ return (u64)-1;
+ return hr->start + hr->len;
+}
+
+static struct rb_node *tree_insert_range_item(struct rb_root *root,
+ u64 start,
+ struct rb_node *node)
+{
+ struct rb_node **p = &root->rb_node;
+ struct rb_node *parent = NULL;
+ struct hot_range_item *entry;
+
+
+ /* walk tree to find insertion point */
+ while (*p) {
+ parent = *p;
+ entry = rb_entry(parent, struct hot_range_item, rb_node);
+
+ if (start < entry->start)
+ p = &(*p)->rb_left;
+ else if (start >= range_map_end(entry))
+ p = &(*p)->rb_right;
+ else
+ return parent;
+ }
+
+ entry = rb_entry(node, struct hot_range_item, rb_node);
+ entry->in_tree = 1;
+ rb_link_node(node, parent, p);
+ rb_insert_color(node, root);
+ return NULL;
+}
+
+/* Add a hot_inode_item to a hot_inode_tree. If the tree already contains
+ * an item with the index given, return -EEXIST */
+int add_hot_inode_item(struct hot_inode_tree *tree,
+ struct hot_inode_item *he)
+{
+ int ret = 0;
+ struct rb_node *rb;
+ struct hot_inode_item *exist;
+
+ exist = lookup_hot_inode_item(tree, he->i_ino);
+ if (exist) {
+ free_hot_inode_item(exist);
+ ret = -EEXIST;
+ goto out;
+ }
+ rb = tree_insert_inode_item(&tree->map, he->i_ino, &he->rb_node);
+ if (rb) {
+ ret = -EEXIST;
+ goto out;
+ }
+ atomic_inc(&he->refs);
+out:
+ return ret;
+}
+
+/* Add a hot_range_item to a hot_range_tree. If the tree already contains
+ * an item with the index given, return -EEXIST
+ * Also optionally aggressively merge ranges (currently disabled) */
+int add_hot_range_item(struct hot_range_tree *tree,
+ struct hot_range_item *hr)
+{
+ int ret = 0;
+ struct rb_node *rb;
+ struct hot_range_item *exist;
+ /* struct hot_range_item *merge = NULL; */
+
+ exist = lookup_hot_range_item(tree, hr->start);
+ if (exist) {
+ free_hot_range_item(exist);
+ ret = -EEXIST;
+ goto out;
+ }
+ rb = tree_insert_range_item(&tree->map, hr->start, &hr->rb_node);
+ if (rb) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ atomic_inc(&hr->refs);
+
+out:
+ return ret;
+}
+
+/* Lookup a hot_inode_item in the hot_inode_tree with the given index
+ * (inode_num) */
+struct hot_inode_item *lookup_hot_inode_item(struct hot_inode_tree *tree,
+ unsigned long inode_num)
+{
+ struct rb_node **p = &(tree->map.rb_node);
+ struct rb_node *parent = NULL;
+ struct hot_inode_item *entry;
+
+ while (*p) {
+ parent = *p;
+ entry = rb_entry(parent, struct hot_inode_item, rb_node);
+
+ if (inode_num < entry->i_ino)
+ p = &(*p)->rb_left;
+ else if (inode_num > entry->i_ino)
+ p = &(*p)->rb_right;
+ else {
+ atomic_inc(&entry->refs);
+ return entry;
+ }
+ }
+
+ return NULL;
+}
+
+/* Lookup a hot_range_item in a hot_range_tree with the given index
+ * (start, offset) */
+struct hot_range_item *lookup_hot_range_item(struct hot_range_tree *tree,
+ u64 start)
+{
+ struct rb_node **p = &(tree->map.rb_node);
+ struct rb_node *parent = NULL;
+ struct hot_range_item *entry;
+
+ /* ensure start is on a range boundary */
+ start = start & RANGE_SIZE_MASK;
+
+ while (*p) {
+ parent = *p;
+ entry = rb_entry(parent, struct hot_range_item, rb_node);
+
+ if (start < entry->start)
+ p = &(*p)->rb_left;
+ else if (start >= range_map_end(entry))
+ p = &(*p)->rb_right;
+ else {
+ atomic_inc(&entry->refs);
+ return entry;
+ }
+ }
+ return NULL;
+}
+
+int remove_hot_inode_item(struct hot_inode_tree *tree,
+ struct hot_inode_item *he)
+{
+ int ret = 0;
+ rb_erase(&he->rb_node, &tree->map);
+ he->in_tree = 0;
+ return ret;
+}
+
+int remove_hot_range_item(struct hot_range_tree *tree,
+ struct hot_range_item *hr)
+{
+ int ret = 0;
+ rb_erase(&hr->rb_node, &tree->map);
+ hr->in_tree = 0;
+ return ret;
+}
+
+/* main function to update access frequency from read/writepage(s) hooks */
+inline void btrfs_update_freqs(struct inode *inode, u64 start,
+ u64 len, int create)
+{
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+ struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
+
+ he = btrfs_update_inode_freq(btrfs_inode, create);
+
+ WARN_ON(!he || IS_ERR(he));
+
+ if (he && !IS_ERR(he)) {
+ hr = btrfs_update_range_freq(he, start, len,
+ create, btrfs_inode->root);
+ WARN_ON(!hr || IS_ERR(hr));
+
+
+ /*
+ * drop refcounts on inode/range items:
+ */
+
+ free_hot_inode_item(he);
+
+ if (hr && !IS_ERR(hr))
+ free_hot_range_item(hr);
+ }
+
+}
+
+/* Update inode frequency struct */
+struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode *inode,
+ int create)
+{
+ struct hot_inode_tree *hitree = &inode->root->hot_inode_tree;
+ struct hot_inode_item *he;
+ struct btrfs_root *root = inode->root;
+
+ read_lock(&hitree->lock);
+ he = lookup_hot_inode_item(hitree, inode->vfs_inode.i_ino);
+ read_unlock(&hitree->lock);
+
+ if (!he) {
+ he = alloc_hot_inode_item(inode->vfs_inode.i_ino);
+
+ if (!he || IS_ERR(he))
+ goto out;
+
+ write_lock(&hitree->lock);
+ add_hot_inode_item(hitree, he);
+ write_unlock(&hitree->lock);
+ }
+
+ spin_lock(&he->lock);
+ btrfs_update_freq(&he->freq_data, create);
+ /*
+ * printk(KERN_DEBUG "btrfs_update_inode_freq avd_r: %llu,"
+ * " avd_w: %llu\n",
+ * he->freq_data.avg_delta_reads,
+ * he->freq_data.avg_delta_writes);
+ */
+ spin_unlock(&he->lock);
+
+ /* will get its own lock(s) */
+ btrfs_update_heat_index(&he->freq_data, root);
+
+out:
+ return he;
+}
+
+/* Update range frequency struct */
+struct hot_range_item *btrfs_update_range_freq(struct hot_inode_item *he,
+ u64 off, u64 len, int create,
+ struct btrfs_root *root)
+{
+ struct hot_range_tree *hrtree = &he->hot_range_tree;
+ struct hot_range_item *hr = NULL;
+ u64 start_off = off & RANGE_SIZE_MASK;
+ u64 end_off = (off + len - 1) & RANGE_SIZE_MASK;
+ u64 cur;
+
+ /*
+ * Align ranges on RANGE_SIZE boundary to prevent proliferation
+ * of range structs
+ */
+ for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) {
+ read_lock(&hrtree->lock);
+ hr = lookup_hot_range_item(hrtree, cur);
+ read_unlock(&hrtree->lock);
+
+ if (!hr) {
+ hr = alloc_hot_range_item(cur, RANGE_SIZE);
+
+ if (!hr || IS_ERR(hr))
+ goto out;
+
+ write_lock(&hrtree->lock);
+ add_hot_range_item(hrtree, hr);
+ write_unlock(&hrtree->lock);
+ }
+
+ spin_lock(&hr->lock);
+ btrfs_update_freq(&hr->freq_data, create);
+ /*
+ * printk(KERN_DEBUG "btrfs_update_range_freq avd_r: %llu,"
+ * " avd_w: %llu\n",
+ * he->freq_data.avg_delta_reads,
+ * he->freq_data.avg_delta_writes);
+ */
+ spin_unlock(&hr->lock);
+
+
+ /* will get its own locks */
+ btrfs_update_heat_index(&hr->freq_data, root);
+ }
+out:
+ return hr;
+}
+
+/*
+ * This function does the actual work of updating the frequency numbers,
+ * whatever they turn out to be. BTRFS_FREQ_POWER determines how many atime
+ * deltas we keep track of (as a power of 2). So, setting it to anything above
+ * 16ish is probably overkill. Also, the higher the power, the more bits get
+ * right shifted out of the timestamp, reducing precision, so take note of that
+ * as well.
+ *
+ * The caller (which is probably btrfs_update_freq) should have already locked
+ * fdata's parent's spinlock.
+ */
+#define BTRFS_FREQ_POWER 4
+void btrfs_update_freq(struct btrfs_freq_data *fdata, int create)
+{
+ struct timespec old_atime;
+ struct timespec current_time;
+ struct timespec delta_ts;
+ u64 new_avg;
+ u64 new_delta;
+
+ if (unlikely(create)) {
+ old_atime = fdata->last_write_time;
+ fdata->nr_writes += 1;
+ new_avg = fdata->avg_delta_writes;
+ } else {
+ old_atime = fdata->last_read_time;
+ fdata->nr_reads += 1;
+ new_avg = fdata->avg_delta_reads;
+ }
+
+ current_time = current_kernel_time();
+ delta_ts = timespec_sub(current_time, old_atime);
+ new_delta = timespec_to_ns(&delta_ts) >> BTRFS_FREQ_POWER;
+
+ new_avg = (new_avg << BTRFS_FREQ_POWER) - new_avg + new_delta;
+ new_avg = new_avg >> BTRFS_FREQ_POWER;
+
+ if (unlikely(create)) {
+ fdata->last_write_time = current_time;
+ fdata->avg_delta_writes = new_avg;
+ } else {
+ fdata->last_read_time = current_time;
+ fdata->avg_delta_reads = new_avg;
+ }
+
+}
+
+/*
+ * Get a new temperature and, if necessary, move the heat_node corresponding
+ * to this inode or range to the proper hashlist with the new temperature
+ */
+void btrfs_update_heat_index(struct btrfs_freq_data *fdata,
+ struct btrfs_root *root)
+{
+ int temp = 0;
+ int moved = 0;
+ struct heat_hashlist_entry *buckets, *current_bucket = NULL;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+
+ if (fdata->flags & FREQ_DATA_TYPE_INODE) {
+ he = freq_data_get_he(fdata);
+ buckets = root->heat_inode_hl;
+
+ spin_lock(&he->lock);
+ temp = btrfs_get_temp(fdata);
+ spin_unlock(&he->lock);
+
+ if (he == NULL)
+ return;
+
+ if (he->heat_node->hlist == NULL) {
+ current_bucket = buckets +
+ temp;
+ moved = 1;
+ } else {
+ current_bucket = he->heat_node->hlist;
+ if (current_bucket->temperature != temp) {
+ write_lock(&current_bucket->rwlock);
+ hlist_del(&he->heat_node->hashnode);
+ write_unlock(&current_bucket->rwlock);
+ current_bucket = buckets + temp;
+ moved = 1;
+ }
+ }
+
+ if (moved) {
+ write_lock(&current_bucket->rwlock);
+ hlist_add_head(&he->heat_node->hashnode,
+ &current_bucket->hashhead);
+ he->heat_node->hlist = current_bucket;
+ write_unlock(&current_bucket->rwlock);
+ }
+
+ } else if (fdata->flags & FREQ_DATA_TYPE_RANGE) {
+ hr = freq_data_get_hr(fdata);
+ buckets = root->heat_range_hl;
+
+ spin_lock(&hr->lock);
+ temp = btrfs_get_temp(fdata);
+ spin_unlock(&hr->lock);
+
+ if (hr == NULL)
+ return;
+
+ if (hr->heat_node->hlist == NULL) {
+ current_bucket = buckets +
+ temp;
+ moved = 1;
+ } else {
+ current_bucket = hr->heat_node->hlist;
+ if (current_bucket->temperature != temp) {
+ write_lock(&current_bucket->rwlock);
+ hlist_del(&hr->heat_node->hashnode);
+ write_unlock(&current_bucket->rwlock);
+ current_bucket = buckets + temp;
+ moved = 1;
+ }
+ }
+
+ if (moved) {
+ write_lock(&current_bucket->rwlock);
+ hlist_add_head(&hr->heat_node->hashnode,
+ &current_bucket->hashhead);
+ hr->heat_node->hlist = current_bucket;
+ write_unlock(&current_bucket->rwlock);
+ }
+ }
+}
+
+/* Walk the hot_inode_tree, locking as necessary */
+struct hot_inode_item *find_next_hot_inode(struct btrfs_root *root,
+ u64 objectid)
+{
+ struct rb_node *node;
+ struct rb_node *prev;
+ struct hot_inode_item *entry;
+
+ read_lock(&root->hot_inode_tree.lock);
+
+ node = root->hot_inode_tree.map.rb_node;
+ prev = NULL;
+ while (node) {
+ prev = node;
+ entry = rb_entry(node, struct hot_inode_item, rb_node);
+
+ if (objectid < entry->i_ino)
+ node = node->rb_left;
+ else if (objectid > entry->i_ino)
+ node = node->rb_right;
+ else
+ break;
+ }
+ if (!node) {
+ while (prev) {
+ entry = rb_entry(prev, struct hot_inode_item, rb_node);
+ if (objectid <= entry->i_ino) {
+ node = prev;
+ break;
+ }
+ prev = rb_next(prev);
+ }
+ }
+ if (node) {
+ entry = rb_entry(node, struct hot_inode_item, rb_node);
+ /* increase reference count to prevent pruning while
+ * caller is using the hot_inode_item */
+ atomic_inc(&entry->refs);
+
+ read_unlock(&root->hot_inode_tree.lock);
+ return entry;
+ }
+
+ read_unlock(&root->hot_inode_tree.lock);
+ return NULL;
+}
+
diff --git a/fs/btrfs/hotdata_map.h b/fs/btrfs/hotdata_map.h
new file mode 100644
index 0000000..46ae1d6
--- /dev/null
+++ b/fs/btrfs/hotdata_map.h
@@ -0,0 +1,118 @@
+#ifndef __HOTDATAMAP__
+#define __HOTDATAMAP__
+
+#include <linux/rbtree.h>
+
+/* values for btrfs_freq_data flags */
+#define FREQ_DATA_TYPE_INODE 1 /* freq data struct is for an inode */
+#define FREQ_DATA_TYPE_RANGE (1 << 1) /* freq data struct is for a range */
+#define FREQ_DATA_HEAT_HOT (1 << 2) /* freq data struct is for hot data */
+ /* (not implemented) */
+#define RANGE_SIZE (1<<12)
+#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1)))
+
+/* macros to wrap container_of()'s for hot data structs */
+#define freq_data_get_he(x) (struct hot_inode_item *) container_of(x, \
+ struct hot_inode_item, freq_data)
+#define freq_data_get_hr(x) (struct hot_range_item *) container_of(x, \
+ struct hot_range_item, freq_data)
+#define rb_node_get_he(x) (struct hot_inode_item *) container_of(x, \
+ struct hot_inode_item, rb_node)
+#define rb_node_get_hr(x) (struct hot_range_item *) container_of(x, \
+ struct hot_range_item, rb_node)
+
+/* A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item */
+struct btrfs_freq_data {
+ struct timespec last_read_time;
+ struct timespec last_write_time;
+ u32 nr_reads;
+ u32 nr_writes;
+ u64 avg_delta_reads;
+ u64 avg_delta_writes;
+ u8 flags;
+};
+
+/* A tree that sits on the fs_root */
+struct hot_inode_tree {
+ struct rb_root map;
+ rwlock_t lock;
+};
+
+/* A tree of ranges for each inode in the hot_inode_tree */
+struct hot_range_tree {
+ struct rb_root map;
+ rwlock_t lock;
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+ struct rb_node rb_node; /* node for hot_inode_tree rb_tree */
+ unsigned long i_ino; /* inode number, copied from vfs_inode */
+ struct hot_range_tree hot_range_tree; /* tree of ranges in this
+ inode */
+ struct btrfs_freq_data freq_data; /* frequency data for this inode */
+ spinlock_t lock; /* protects freq_data, i_no, in_tree */
+ atomic_t refs; /* prevents kfree */
+ u8 in_tree; /* used to check for errors in ref counting */
+ struct heat_hashlist_node *heat_node; /* hashlist node for this
+ inode */
+};
+
+/* An item representing a range inside of an inode whose frequency
+ * is being tracked */
+struct hot_range_item {
+ struct rb_node rb_node; /* node for hot_range_tree rb_tree */
+ u64 start; /* starting offset of this range */
+ u64 len; /* length of this range */
+ struct btrfs_freq_data freq_data; /* frequency data for this range */
+ spinlock_t lock; /* protects freq_data, start, len, in_tree */
+ atomic_t refs; /* prevents kfree */
+ u8 in_tree; /* used to check for errors in ref counting */
+ struct heat_hashlist_node *heat_node; /* hashlist node for this
+ range */
+};
+
+struct btrfs_root;
+struct inode;
+
+void hot_inode_tree_init(struct hot_inode_tree *tree);
+void hot_range_tree_init(struct hot_range_tree *tree);
+
+struct hot_range_item *lookup_hot_range_item(struct hot_range_tree *tree,
+ u64 start);
+
+struct hot_inode_item *lookup_hot_inode_item(struct hot_inode_tree *tree,
+ unsigned long inode_num);
+
+int add_hot_inode_item(struct hot_inode_tree *tree,
+ struct hot_inode_item *he);
+int add_hot_range_item(struct hot_range_tree *tree,
+ struct hot_range_item *hr);
+
+int remove_hot_inode_item(struct hot_inode_tree *tree,
+ struct hot_inode_item *he);
+int remove_hot_range_item(struct hot_range_tree *tree,
+ struct hot_range_item *hr);
+
+struct hot_inode_item *alloc_hot_inode_item(unsigned long ino);
+struct hot_range_item *alloc_hot_range_item(u64 start, u64 len);
+
+void free_hot_inode_item(struct hot_inode_item *he);
+void free_hot_range_item(struct hot_range_item *hr);
+void free_hot_inode_tree(struct btrfs_root *root);
+
+int __init hot_inode_item_init(void);
+int __init hot_range_item_init(void);
+
+void hot_inode_item_exit(void);
+void hot_range_item_exit(void);
+
+struct hot_inode_item *find_next_hot_inode(struct btrfs_root *root,
+ u64 objectid);
+void btrfs_update_freq(struct btrfs_freq_data *fdata, int create);
+void btrfs_update_freqs(struct inode *inode, u64 start, u64 len,
+ int create);
+
+#endif
--
1.7.1

2010-07-27 22:50:51

by Tracy R Reed

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Tue, Jul 27, 2010 at 05:00:18PM -0500, [email protected] spake thusly:
> The long-term goal of these patches, as discussed in the Motivation
> section at the end of this message, is to enable Btrfs to perform
> automagic relocation of hot data to fast media like SSD. This goal has
> been motivated by the Project Ideas page on the Btrfs wiki.

With disks being so highly virtualized away these days is there any
way for btrfs to know which are the fast outer-tracks vs the slower
inner-tracks of a physical disk? If so not only could this benefit SSD
owners but it could also benefit the many more spinning platters out
there. If not (which wouldn't be surprising) then disregard. Even just
having that sort of functionality for SSD would be excellent. If I
understand correctly not only would this work for SSD but if I have a
SAN full of many large 7200rpm disks and a few 15k SAS disks I could
effectively utilize that disk by allowing btrfs to place hot data on
the 15k SAS. I understand Compellent does this as well.

--
Tracy Reed
http://tracyreed.org


Attachments:
(No filename) (1.04 kB)
(No filename) (189.00 B)
Download all attachments

2010-07-27 23:11:06

by Diego Calleja

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Mi?rcoles, 28 de Julio de 2010 00:00:18 [email protected] escribi?:
> With Btrfs's COW approach, an external cache (where data is moved to
> SSD, rather than just cached there) makes a lot of sense. Though these

As I understand it, what your proyect intends to do is to move "hot"
data to a SSD which would be part of a Btrfs pool, and not do any
kind of SSD caching, as bcache (http://lwn.net/Articles/394672/) does?

2010-07-27 23:18:53

by Ben Chociej

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Tue, Jul 27, 2010 at 6:10 PM, Diego Calleja <[email protected]> wrote:
> On Mi?rcoles, 28 de Julio de 2010 00:00:18 [email protected] escribi?:
>> With Btrfs's COW approach, an external cache (where data is moved to
>> SSD, rather than just cached there) makes a lot of sense. Though these
>
> As I understand it, what your proyect intends to do is to move "hot"
> data to a SSD which would be part of a Btrfs pool, and not do any
> kind of SSD caching, as bcache (http://lwn.net/Articles/394672/) does?
>

Yes, that's correct. It's likely not going to be a cache in the
traditional sense, since the entire capacity of both HDD and SSD would
be available.

BC

2010-07-27 23:34:40

by Christian Stroetmann

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

At the 28.07.2010 00:00, Ben Chociej wrote:
> INTRODUCTION:
>
> This patch series adds experimental support for tracking data
> temperature in Btrfs. Essentially, this means maintaining some key
> stats (like number of reads/writes, last read/write time, frequency of
> reads/writes), then distilling those numbers down to a single
> "temperature" value that reflects what data is "hot."
>
> The long-term goal of these patches, as discussed in the Motivation
> section at the end of this message, is to enable Btrfs to perform
> automagic relocation of hot data to fast media like SSD. This goal has
> been motivated by the Project Ideas page on the Btrfs wiki.
>
> Of course, users are warned not to run this code outside of development
> environments. These patches are EXPERIMENTAL, and as such they might
> eat your data and/or memory.
>
>
> MOTIVATION:
>
> The overall goal of enabling hot data relocation to SSD has been
> motivated by the Project Ideas page on the Btrfs wiki at
> https://btrfs.wiki.kernel.org/index.php/Project_ideas. It is hoped that
> this initial patchset will eventually mature into a usable hybrid
> storage feature set for Btrfs.
>
> This is essentially the traditional cache argument: SSD is fast and
> expensive; HDD is cheap but slow. ZFS, for example, can already take
> advantage of SSD caching. Btrfs should also be able to take advantage
> of hybrid storage without any broad, sweeping changes to existing code.
>

Wouldn't this feature be useful for other file systems as well, so that
a more general and not an only Btrfs related solution is preferable?

> With Btrfs's COW approach, an external cache (where data is *moved* to
> SSD, rather than just cached there) makes a lot of sense. Though these
> patches don't enable any relocation yet, they do lay an essential
> foundation for enabling that functionality in the near future. We plan
> to roll out an additional patchset introducing some of the automatic
> migration functionality in the next few weeks.
>
>

With all the best
Christian Stroetmann

2010-07-28 12:36:12

by Chris Samuel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Wed, 28 Jul 2010 09:18:23 am Ben Chociej wrote:

> Yes, that's correct. It's likely not going to be a cache in the
> traditional sense, since the entire capacity of both HDD and SSD would
> be available.

To me that sounds like an HSM type arrangement, with most frequently used data
on the highest performing media and less frequently touched data getting
shunted down the chain to SAS, SATA and then tape and/or MAID type devices.

Certainly interesting from my HPC point of view in that I can see it being
useful to parallel filesystems like Ceph if this "just happens".

I guess real HSM devotees would want policies for migration downstream.. ;-)

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP


Attachments:
signature.asc (482.00 B)
This is a digitally signed message part.

2010-07-28 21:23:19

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Tue, 2010-07-27 at 15:29 -0700, Tracy Reed wrote:
> On Tue, Jul 27, 2010 at 05:00:18PM -0500, [email protected] spake thusly:
> > The long-term goal of these patches, as discussed in the Motivation
> > section at the end of this message, is to enable Btrfs to perform
> > automagic relocation of hot data to fast media like SSD. This goal has
> > been motivated by the Project Ideas page on the Btrfs wiki.
>
> With disks being so highly virtualized away these days is there any
> way for btrfs to know which are the fast outer-tracks vs the slower
> inner-tracks of a physical disk? If so not only could this benefit SSD
> owners but it could also benefit the many more spinning platters out
> there. If not (which wouldn't be surprising) then disregard. Even just
> having that sort of functionality for SSD would be excellent. If I
> understand correctly not only would this work for SSD but if I have a
> SAN full of many large 7200rpm disks and a few 15k SAS disks I could
> effectively utilize that disk by allowing btrfs to place hot data on
> the 15k SAS. I understand Compellent does this as well.
>

This certainly possible. The disk to store hot data does not has to
limit to SSDs, thought current implementation detecting "fast" device by
checking the SSD rotation flag. This could be easily extended if btrfs
is able to detect the relatively fast devices.

2010-07-28 22:00:52

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Wed, 2010-07-28 at 01:38 +0200, Christian Stroetmann wrote:
> At the 28.07.2010 00:00, Ben Chociej wrote:
> > INTRODUCTION:
> >
> > This patch series adds experimental support for tracking data
> > temperature in Btrfs. Essentially, this means maintaining some key
> > stats (like number of reads/writes, last read/write time, frequency of
> > reads/writes), then distilling those numbers down to a single
> > "temperature" value that reflects what data is "hot."
> >
> > The long-term goal of these patches, as discussed in the Motivation
> > section at the end of this message, is to enable Btrfs to perform
> > automagic relocation of hot data to fast media like SSD. This goal has
> > been motivated by the Project Ideas page on the Btrfs wiki.
> >
> > Of course, users are warned not to run this code outside of development
> > environments. These patches are EXPERIMENTAL, and as such they might
> > eat your data and/or memory.
> >
> >
> > MOTIVATION:
> >
> > The overall goal of enabling hot data relocation to SSD has been
> > motivated by the Project Ideas page on the Btrfs wiki at
> > https://btrfs.wiki.kernel.org/index.php/Project_ideas. It is hoped that
> > this initial patchset will eventually mature into a usable hybrid
> > storage feature set for Btrfs.
> >
> > This is essentially the traditional cache argument: SSD is fast and
> > expensive; HDD is cheap but slow. ZFS, for example, can already take
> > advantage of SSD caching. Btrfs should also be able to take advantage
> > of hybrid storage without any broad, sweeping changes to existing code.
> >
>
> Wouldn't this feature be useful for other file systems as well, so that
> a more general and not an only Btrfs related solution is preferable?
>

Would certainly nice to add this feature to all filesystem, but right
now btrfs is the only fs which have multiple device support in itself.

Mingming
> > With Btrfs's COW approach, an external cache (where data is *moved* to
> > SSD, rather than just cached there) makes a lot of sense. Though these
> > patches don't enable any relocation yet, they do lay an essential
> > foundation for enabling that functionality in the near future. We plan
> > to roll out an additional patchset introducing some of the automatic
> > migration functionality in the next few weeks.
> >
> >
>
> With all the best
> Christian Stroetmann
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-07-29 12:17:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Wed, Jul 28, 2010 at 03:00:48PM -0700, Mingming Cao wrote:
> On Wed, 2010-07-28 at 01:38 +0200, Christian Stroetmann wrote:
> > At the 28.07.2010 00:00, Ben Chociej wrote:
> > Wouldn't this feature be useful for other file systems as well, so that
> > a more general and not an only Btrfs related solution is preferable?
> >
>
> Would certainly nice to add this feature to all filesystem, but right
> now btrfs is the only fs which have multiple device support in itself.

Why does it even need multiple devices in the filesystem? All the
filesystem needs to know is the relative speed of regions of it's
block address space and to be provided allocation hints. everything
else is just movement of data. You could keep the speed information
in the device mapper table and add an interface for filesystems to
query it, and then you've got infrastructure that all filesystems
could hook into.

The tracking features dont' appear to have anything btrfs specific
in them, so t iseems wrong to implement it there just because you're
only looking at btrfs' method of tracking multiple block devices
and moving blocks....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-29 13:13:43

by Christian Stroetmann

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

Aloha Mingming, Aloha Dave;

On the 29.07.2010 14:17, Dave Chinner wrote:
> On Wed, Jul 28, 2010 at 03:00:48PM -0700, Mingming Cao wrote:
>
>> On Wed, 2010-07-28 at 01:38 +0200, Christian Stroetmann wrote:
>>
>>> At the 28.07.2010 00:00, Ben Chociej wrote:
>>> Wouldn't this feature be useful for other file systems as well, so that
>>> a more general and not an only Btrfs related solution is preferable?
>>>
>>>
>> Would certainly nice to add this feature to all filesystem, but right
>> now btrfs is the only fs which have multiple device support in itself.
>>
>

Thanks for your explanation, Mingming. And I had further questions to
this point, but didn't know exactly how to formulate them in a short
way. But luckily Dave has a possible solution that is related with my
questions around the multiple device feature of Btrfs and hot data handling.

> Why does it even need multiple devices in the filesystem?

Yes, that was the point I asked myself after reading about this in the
Btrfs wiki
(https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices)
and how ZFS does it.

> All the
> filesystem needs to know is the relative speed of regions of it's
> block address space and to be provided allocation hints. everything
> else is just movement of data. You could keep the speed information
> in the device mapper table and add an interface for filesystems to
> query it, and then you've got infrastructure that all filesystems
> could hook into.
>

Yes indeed, something like this general solution that doesn't need the
multiple device feature at all.

> The tracking features dont' appear to have anything btrfs specific
> in them, so t iseems wrong to implement it there just because you're
> only looking at btrfs' method of tracking multiple block devices
> and moving blocks....
>
> Cheers,
>
> Dave.
>

Thank you very much for being creative. :D

Cheerio
Christian *<:o) O>-< -(D)>-<

2010-08-04 17:40:31

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

On Thu, 2010-07-29 at 22:17 +1000, Dave Chinner wrote:
> On Wed, Jul 28, 2010 at 03:00:48PM -0700, Mingming Cao wrote:
> > On Wed, 2010-07-28 at 01:38 +0200, Christian Stroetmann wrote:
> > > At the 28.07.2010 00:00, Ben Chociej wrote:
> > > Wouldn't this feature be useful for other file systems as well, so that
> > > a more general and not an only Btrfs related solution is preferable?
> > >
> >
> > Would certainly nice to add this feature to all filesystem, but right
> > now btrfs is the only fs which have multiple device support in itself.
>
> Why does it even need multiple devices in the filesystem? All the
> filesystem needs to know is the relative speed of regions of it's
> block address space and to be provided allocation hints. everything
> else is just movement of data. You could keep the speed information
> in the device mapper table and add an interface for filesystems to
> query it, and then you've got infrastructure that all filesystems
> could hook into.
>
> The tracking features dont' appear to have anything btrfs specific
> in them, so t iseems wrong to implement it there just because you're
> only looking at btrfs' method of tracking multiple block devices
> and moving blocks....
>


I agree hot data tracking could be done at vfs layer. The current hot
data temperature calculation and indexing code is very self-contained,
and could be reuse to other fs or move up to vfs. We could define a
common interface to export to hot data tempreture out. The relocation
eventually has to be filesystem specific. btrfs does cow and knows where
is the data on/off SSD directly makes the relocation to and from very
straightforward.

Mingming

2010-08-04 18:40:11

by Christian Stroetmann

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality

Hola Everybody;
On the 04.08.2010 19:40, Mingming Cao wrote:
> On Thu, 2010-07-29 at 22:17 +1000, Dave Chinner wrote:
>
>> On Wed, Jul 28, 2010 at 03:00:48PM -0700, Mingming Cao wrote:
>>
>>> On Wed, 2010-07-28 at 01:38 +0200, Christian Stroetmann wrote:
>>>
>>>> At the 28.07.2010 00:00, Ben Chociej wrote:
>>>> Wouldn't this feature be useful for other file systems as well, so that
>>>> a more general and not an only Btrfs related solution is preferable?
>>>>
>>>>
>>> Would certainly nice to add this feature to all filesystem, but right
>>> now btrfs is the only fs which have multiple device support in itself.
>>>
>> Why does it even need multiple devices in the filesystem? All the
>> filesystem needs to know is the relative speed of regions of it's
>> block address space and to be provided allocation hints. everything
>> else is just movement of data. You could keep the speed information
>> in the device mapper table and add an interface for filesystems to
>> query it, and then you've got infrastructure that all filesystems
>> could hook into.
>>
>> The tracking features dont' appear to have anything btrfs specific
>> in them, so t iseems wrong to implement it there just because you're
>> only looking at btrfs' method of tracking multiple block devices
>> and moving blocks....
>>
>>
>
> I agree hot data tracking could be done at vfs layer. The current hot
> data temperature calculation and indexing code is very self-contained,
> and could be reuse to other fs or move up to vfs. We could define a
> common interface to export to hot data tempreture out. The relocation
> eventually has to be filesystem specific. btrfs does cow and knows where
> is the data on/off SSD directly makes the relocation to and from very
> straightforward.
>
>

Thanks for your thumb up. But maybe other interested persons should also
point their thumbs up at first before investing the effort of its
implementation. I mentioned this feature, because it is a trend that got
momentum in the area of (storage) virtualization and in this way in the
fields of cloud computing and service computing as well.

And: My both thumbs are already up. :D

> Mingming
>
>

Christian *<:o) O>-< -(D)>-<