2013-09-16 22:20:26

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 00/10] VFS hot tracking

From: Zhi Yong Wu <[email protected]>

The patchset is trying to introduce hot tracking function in
VFS layer, which will keep track of real disk I/O in memory.
By it, you will easily know more details about disk I/O, and
then detect where disk I/O hot spots are. Also, specific FS
can take use of it to do accurate defragment, and hot relocation
support, etc.

Now it's time to send out its V5 for external review, and
any comments or ideas are appreciated, thanks.

NOTE:

The patchset can be obtained via my kernel dev git on github:
git://github.com/wuzhy/kernel.git hot_tracking
If you're interested, you can also review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

For how to use and more other info and performance report,
please check hot_tracking.txt in Documentation and following
links:
1.) http://lwn.net/Articles/525651/
2.) https://lkml.org/lkml/2012/12/20/199

This patchset has been done scalability or performance tests
by fs_mark, ffsb and compilebench.

The perf testing was done on Linux 3.11.0+ with Intel(R) Core(TM)
i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk.

Below is the perf testing report:

1. fs_mark test

w/ : with hot tracking
w/o: without hot tracking

Count Size FSUse% Files/sec App Overhead
w/ w/o w/ w/o w/ w/o
800000 1 5 5 5606.9 40486.6 7773339 8575934
1600000 1 5 5 1244.8 1194.8 8262292 8253933
2400000 1 6 6 1155.7 997.2 7640679 7854540
3200000 1 7 8 1079.7 1124.0 7373659 8121016
4000000 1 9 9 1169.4 1324.8 7961605 9598549
4800000 1 10 10 1259.8 1331.7 8992159 8743297
5600000 1 11 11 1337.7 1339.3 8675246 8029501
6400000 1 13 13 1346.7 1365.5 8613958 10018455
7200000 1 14 14 1339.8 1423.1 7885932 8466961
8000000 1 15 15 1353.0 1368.6 13543947 9727348
8800000 1 16 17 1460.7 1396.4 8744351 8034638
9600000 1 18 18 1462.9 1415.4 11678864 8557992
10400000 1 19 19 1503.8 1457.6 8984918 9696330
11200000 1 20 20 1521.9 1491.4 8732741 8307835
12000000 1 21 22 1617.7 1556.0 12948158 8776620
12800000 1 23 23 1518.0 1572.3 8470307 8652605
13600000 1 24 24 1595.8 1570.5 11476909 8622940
14400000 1 25 26 1651.8 1722.1 11864599 9646962
15200000 1 26 27 1696.8 1619. 10679127 8472579
16000000 1 28 28 1567.4 1652.3 8756616 8713324
16800000 1 29 29 1599.9 1683.7 10982360 9084005
17600000 1 31 30 1671.3 1699.6 9559853 8388523
18400000 1 32 32 1567.3 1666.7 10576088 11717888
19200000 1 33 33 1668.4 1606.0 8657168 9063387
20000000 1 34 34 1654.1 1521.5 11115008 8384464
20800000 1 36 36 1637.6 1666.2 9964151 8176858
21600000 1 37 37 1598.7 1677.0 8648364 8190571
22400000 1 38 38 1688.8 1674.0 8881927 12847479
23200000 1 39 39 1627.0 1648.2 8707422 9350644
24000000 1 41 41 1704.7 1718.9 9525011 8437322
24800000 1 42 42 1628.2 1649.7 8445795 9195963
25600000 1 43 43 1690.4 1647.3 10444544 10808578
26400000 1 44 44 1597.4 1582.4 8956981 12286644
27200000 1 46 46 1677.7 1710.4 8244101 9492204
28000000 1 47 47 1664.9 1640.9 8860491 8683678
28800000 1 48 48 1608.7 1670.8 8381652 12105478
29600000 1 50 50 1682.0 1652.4 13991121 8630876
30400000 1 51 51 1672.6 1743.2 8853590 10377349
31200000 1 52 52 1648.5 1691.3 11290708 8407930
32000000 1 53 53 1649.5 1708.1 11647884 10120780
32800000 1 55 55 1725.2 1663.4 9641226 10092158
33600000 1 56 56 1662.2 1668.9 12228440 8579953
34400000 1 57 57 1629.7 1688.0 8232209 8290118
35200000 1 59 59 1711.5 1733.5 8175308 9081545
36000000 1 60 60 1670.6 1742.4 9884533 8554858
36800000 1 61 61 1663.0 1654.8 13227858 9112083
37600000 1 62 62 1692.4 1663.0 8590629 8884916
38400000 1 64 64 1691.6 1617.1 9437834 11534400
39200000 1 65 65 1763.5 1646.3 10385440 9854624
40000000 1 66 66 1686.8 1643.8 8860676 9939637
40800000 1 67 67 1542.9 1652.9 9280078 17640321
41600000 1 68 69 1696.2 1655.4 8972165 9473507
42400000 1 70 70 1637.8 1685.2 8294407 8767330
43200000 1 71 71 1712.8 1739.8 14135589 9175591
44000000 1 72 73 1692.4 1632.2 10287428 9130585
44800000 1 73 74 1794.9 1685.0 10727955 9486110
45600000 1 75 75 1438.1 1624.3 8476478 9232791
46400000 1 76 76 1761.2 1768.7 8644609 15745264
47200000 1 77 77 1684.2 1505.7 10269613 12412119
48000000 1 79 79 1647.0 1713.2 8287281 15352189
48800000 1 80 80 1665.7 1675.0 17468300 9012407
49600000 1 81 81 1632.5 1692.5 8178082 8865803
50400000 1 83 83 1584.5 1752.1 12857867 11970443

2. FFSB test

w/ hot tracking w/o hot tracking ratio
v1 v2 (v1-v2)/v2
large_file_create
1 thread
- Trans/sec 28091.76 28126.31 -0.12%
- Throughput 110MB/sec 110MB/sec +0.00%
- %CPU 10.7% 11.2% -4.47%
- Trans/%CPU 2625.4 2511.28 -4.54%

8 threads
- Trans/sec 27980.47 28140.34 -0.57%
- Throughput 109MB/sec 110MB/sec -0.91%
- %CPU 12.3% 12.8% -3.90%
- Trans/%CPU 2274.83 2198.46 +3.47%

16 threads
- Trans/sec 27764.36 27940.96 -0.63%
- Throughput 108MB/sec 109MB/sec -0.92%
- %CPU 12.8% 13.7% -6.57%
- Trans/%CPU 2169.09 2039.49 +6.35%

32 threads
- Trans/sec 27461.82 27624.48 -0.59%
- Throughput 107MB/sec 108MB/sec -0.93%
- %CPU 13.7% 14.4% -4.86%
- Trans/%CPU 2004.51 1918.37 +4.49%

large_file_seq_read
1 thread
- Trans/sec 34121.46 34838.65 -2.06%
- Throughput 133MB/sec 136MB/sec -2.21%
- %CPU 8.8% 8.8% +0.00%
- Trans/%CPU 3877.44 3958.94 -2.06%

8 threads
- Trans/sec 10883.15 11679.40 -6.82%
- Throughput 42.5MB/sec 45.6MB/sec -6.80%
- %CPU 3.3% 3.4% -2.94%
- Trans/%CPU 3297.92 3435.12 -3.99%

16 threads
- Trans/sec 5760.16 6193.20 -6.99%
- Throughput 22.5MB/sec 24.2MB/sec -7.02%
- %CPU 1.8% 1.9% -5.26%
- Trans/%CPU 3200.09 3259.58 -1.83%

32 threads
- Trans/sec 5470.50 5490.12 -0.36%
- Throughput 21.4MB/sec 21.4MB/sec +0.00%
- %CPU 1.7% 1.7% +0.00%
- Trans/%CPU 3217.94 3229.48 -0.36%

random_write
1 thread
- Trans/sec 1611.99 1582.57 +1.86%
- Throughput 220MB/sec 216MB/sec +1.85%
- %CPU 0.6% 0.6% +0.00%
- Trans/%CPU 2686.65 2637.62 +1.86%

8 threads
- Trans/sec 2215.59 2292.57 -3.36%
- Throughput 303MB/sec 313MB/sec -3.39%
- %CPU 1.4% 1.5% -6.67%
- Trans/%CPU 1582.56 1528.38 +3.35%

16 threads
- Trans/sec 2068.52 1935.96 +6.85%
- Throughput 283MB/sec 265MB/sec +6.79%
- %CPU 1.3% 1.3% +0.00%
- Trans/%CPU 1591.17 1464.8 +8.63%

32 threads
- Trans/sec 1764.28 1875.23 -5.92%
- Throughput 241MB/sec 256MB/sec -5.86%
- %CPU 1.2% 1.3% -7.69%
- Trans/%CPU 1470.23 1442.48 +1.92%

random_read
1 thread
- Trans/sec 222.84 224.28 -0.64%
- Throughput 891KB/sec 897KB/sec -0.67%
- %CPU 1.1% 1.0% +10.0%
- Trans/%CPU 202.58 224.28 -9.68%

8 threads
- Trans/sec 143.30 136.47 +5.01%
- Throughput 573KB/sec 546KB/sec +4.95%
- %CPU 0.5% 0.5% +0.00%
- Trans/%CPU 286.60 272.94 +5.01%

16 threads
- Trans/sec 105.17 103.75 +1.37%
- Throughput 421KB/sec 415KB/sec +1.45%
- %CPU 0.5% 0.5% +0.00%
- Trans/%CPU 210.34 207.5 +1.37%

32 threads
- Trans/sec 105.78 103.39 +2.31%
- Throughput 423KB/sec 414KB/sec +2.17%
- %CPU 0.5% 0.5% +0.00%
- Trans/%CPU 211.56 206.78 +2.31%

mail_server
1 thread
- Trans/sec [read] 433.23 446.68 -3.01%
- Throughput [read] 1.7MB/sec 1.75MB/sec -2.86%
- Trans/sec [write] 224.06 213.84 +4.78%
- Throughput [write] 889KB/sec 848KB/sec +4.83%
- %CPU 0.8% 0.8% +0.00%
- Trans/%CPU [read] 541.54 558.35 -3.01%
- Trans/%CPU [write] 280.08 267.3 +4.78%

8 threads
- Trans/sec [read] 430.47 435.84 -1.23%
- Throughput [read] 1.69MB/sec 1.71MB/sec -1.17%
- Trans/sec [write] 198.18 207.61 -4.54%
- Throughput [write] 786KB/sec 823KB/sec -4.50%
- %CPU 0.9% 0.9% +0.00%
- Trans/%CPU [read] 478.3 484.27 -1.23%
- Trans/%CPU [write] 220.2 230.68 -4.54%

16 threads
- Trans/sec [read] 326.05 347.85 -6.27%
- Throughput [read] 1.28MB/sec 1.37MB/sec -6.57%
- Trans/sec [write] 187.69 177.59 +5.69%
- Throughput [write] 744KB/sec 705KB/sec +5.53%
- %CPU 0.9% 0.9% +0.00%
- Trans/%CPU [read] 362.28 386.5 -6.27%
- Trans/%CPU [write] 208.54 197.2 +5.75%

32 threads
- Trans/sec [read] 388.04 419.52 -7.50%
- Throughput [read] 1.53MB/sec 1.65MB/sec -7.27%
- Trans/sec [write] 204.70 207.50 -1.35%
- Throughput [write] 811KB/sec 823KB/sec -1.46%
- %CPU 1.2% 1.2% +0.00%
- Trans/%CPU [read] 323.37 349.6 -7.50%
- Trans/%CPU [write] 170.58 172.92 -1.35%

3. Compilebench test

w/ hot tracking w/o hot tracking ratio
v1 v2 (v1-v2)/v2
intial create 59.33 MB/s 63.25 MB/s -6.20%

create 91.81 MB/s 81.12 MB/s +13.18%

patch 12.39 MB/s 14.94 MB/s -17.07%

compile 470.24 MB/s 442.08 MB/s +6.37%

clean 2205.16 MB/s 1992.06 MB/s +10.70%

read tree 136.77 MB/s 142.41 MB/s -3.96%

read compiled tree 46.83 MB/s 50.08 MB/s -6.49%

delete tree 3.48 seconds 3.02 seconds +15.23%

delete compiled tree 3.94 seconds 3.98 seconds -1.01%

stat tree 1.45 seconds 1.66 seconds -12.65%

stat compiled tree 0.71 seconds 0.86 seconds -17.44%

Changelog from v4:
- Added all kinds of perf testing report [viro]
- Covered mmap() now [viro]
- Removed list_sort() in hot_update_worker() to avoid locking contention
and cacheline bouncing [viro]
- Removed a /proc interface to control low memory usage [Chandra]
- Adjusted shrinker support due to the change of public shrinker APIs [zwu]
- Fixed the locking missing issue when hot_inode_item_put() is called
in ioctl_heat_info() [viro]
- Fixed some locking contention issues [zwu]

v4:
- Removed debugfs support, but leave it to TODO list [viro, Chandra]
- Killed HOT_DELETING and HOT_IN_LIST flag [viro]
- Fixed unlink issues [viro]
- Fixed the issue on lookups (both for inode and range)
leak on race with unlink [viro]
- Killed hot_comm_item and split the functions which take it [virio]
- Fixed some other issues [zwu, Chandra]

v3:
- Added memory caping function for hot items [Zhiyong]
- Cleanup aging function [Zhiyong]

v2:
- Refactored to be under RCU [Chandra Seetharaman]
Merged some code changes [Chandra Seetharaman]
- Fixed some issues [Chandra Seetharaman]

v1:
- Solved 64 bits inode number issue. [David Sterba]
- Embed struct hot_type in struct file_system_type [Darrick J. Wong]
- Cleanup Some issues [David Sterba]
- Use a static hot debugfs root [Greg KH]

rfcv4:
- Introduce hot func registering framework [Zhiyong]
- Remove global variable for hot tracking [Zhiyong]
- Add btrfs hot tracking support [Zhiyong]

rfcv3:
1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
2.) Refactored workqueue support. [Dave Chinner]
3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
TIME_TO_KICK, and HEAT_UPDATE_DELAY
4.) Cleanedup a lot of other issues [Dave Chinner]


rfcv2:
1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
2.) Added memory shrinker [Dave Chinner]
3.) Converted to one workqueue to update map info periodically [Dave Chinner]
4.) Cleanedup a lot of other issues [Dave Chinner]

rfcv1:
1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
2.) The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]

Dave Chinner (1):
VFS hot tracking, xfs: Add hot tracking support

Zhi Yong Wu (9):
VFS hot tracking: Define basic data structures and functions
VFS hot tracking: Track IO and record heat information
VFS hot tracking: Add a workqueue to move items between hot maps
VFS hot tracking: Add shrinker functionality to curtail memory usage
VFS hot tracking: Add an ioctl to get hot tracking information
VFS hot tracking: Add a /proc interface to make the interval tunable
VFS hot tracking: Add a /proc interface to control memory usage
VFS hot tracking: Add documentation
VFS hot tracking, btrfs: Add hot tracking support

Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/hot_tracking.txt | 207 ++++++++
fs/Makefile | 2 +-
fs/btrfs/ctree.h | 1 +
fs/btrfs/super.c | 22 +-
fs/compat_ioctl.c | 5 +
fs/dcache.c | 2 +
fs/direct-io.c | 5 +
fs/hot_tracking.c | 811 +++++++++++++++++++++++++++++
fs/hot_tracking.h | 66 +++
fs/ioctl.c | 71 +++
fs/namei.c | 3 +
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 18 +
include/linux/fs.h | 4 +
include/linux/hot_tracking.h | 146 ++++++
kernel/sysctl.c | 14 +
mm/filemap.c | 19 +-
mm/page-writeback.c | 13 +
mm/readahead.c | 6 +
20 files changed, 1414 insertions(+), 4 deletions(-)
create mode 100644 Documentation/filesystems/hot_tracking.txt
create mode 100644 fs/hot_tracking.c
create mode 100644 fs/hot_tracking.h
create mode 100644 include/linux/hot_tracking.h

--
1.7.11.7


2013-09-16 22:20:35

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 01/10] VFS hot tracking: Define basic data structures and functions

From: Zhi Yong Wu <[email protected]>

This patch includes the basic data structure and functions needed for
VFS hot tracking.

It adds hot_inode_tree struct to keep track of frequently accessed
files, and is keyed by {inode, offset}. Trees contain hot_inode_items
representing those files and hot_range_items representing ranges in that
file.

It defines a data structure hot_info, which is associated with a mounted
filesystem, and will be used to store the inode tree and range tree for
hot items pertaining to that filesystem.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/Makefile | 2 +-
fs/dcache.c | 2 +
fs/hot_tracking.c | 230 +++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 20 ++++
include/linux/fs.h | 4 +
include/linux/hot_tracking.h | 84 ++++++++++++++++
6 files changed, 341 insertions(+), 1 deletion(-)
create mode 100644 fs/hot_tracking.c
create mode 100644 fs/hot_tracking.h
create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..5f9b8f1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o \
- stack.o fs_struct.o statfs.o
+ stack.o fs_struct.o statfs.o hot_tracking.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index 1bd4614..cd73bb9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -38,6 +38,7 @@
#include <linux/prefetch.h>
#include <linux/ratelimit.h>
#include <linux/list_lru.h>
+#include <linux/hot_tracking.h>
#include "internal.h"
#include "mount.h"

@@ -3363,4 +3364,5 @@ void __init vfs_caches_init(unsigned long mempages)
mnt_init();
bdev_cache_init();
chrdev_init();
+ hot_cache_init();
}
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 0000000..bb82a8d
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,230 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/spinlock.h>
+#include "hot_tracking.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+static void hot_range_item_init(struct hot_range_item *hr,
+ struct hot_inode_item *he, loff_t start)
+{
+ kref_init(&hr->refs);
+ hr->start = start;
+ hr->len = hot_bit_shift(1, RANGE_BITS, true);
+ hr->hot_inode = he;
+}
+
+static void hot_range_item_free_cb(struct rcu_head *head)
+{
+ struct hot_range_item *hr = container_of(head,
+ struct hot_range_item, rcu);
+
+ kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+static void hot_range_item_free(struct kref *kref)
+{
+ struct hot_range_item *hr = container_of(kref,
+ struct hot_range_item, refs);
+
+ rb_erase(&hr->rb_node, &hr->hot_inode->hot_range_tree);
+
+ call_rcu(&hr->rcu, hot_range_item_free_cb);
+}
+
+void hot_range_item_get(struct hot_range_item *hr)
+{
+ kref_get(&hr->refs);
+}
+EXPORT_SYMBOL_GPL(hot_range_item_get);
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_range_item_put(struct hot_range_item *hr)
+{
+ kref_put(&hr->refs, hot_range_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_range_item_put);
+
+/*
+ * Free the entire hot_range_tree.
+ */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+ struct rb_node *node;
+ struct hot_range_item *hr;
+
+ /* Free hot inode and range trees on fs root */
+ spin_lock(&he->i_lock);
+ node = rb_first(&he->hot_range_tree);
+ while (node) {
+ hr = rb_entry(node, struct hot_range_item, rb_node);
+ node = rb_next(node);
+ hot_range_item_put(hr);
+ }
+ spin_unlock(&he->i_lock);
+}
+
+static void hot_inode_item_init(struct hot_inode_item *he,
+ struct hot_info *root, u64 ino)
+{
+ kref_init(&he->refs);
+ he->ino = ino;
+ he->hot_root = root;
+ spin_lock_init(&he->i_lock);
+}
+
+static void hot_inode_item_free_cb(struct rcu_head *head)
+{
+ struct hot_inode_item *he = container_of(head,
+ struct hot_inode_item, rcu);
+
+ kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+ struct hot_inode_item *he = container_of(kref,
+ struct hot_inode_item, refs);
+
+ rb_erase(&he->rb_node, &he->hot_root->hot_inode_tree);
+ hot_range_tree_free(he);
+
+ call_rcu(&he->rcu, hot_inode_item_free_cb);
+}
+
+void hot_inode_item_get(struct hot_inode_item *he)
+{
+ kref_get(&he->refs);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_get);
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+ kref_put(&he->refs, hot_inode_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_put);
+
+/*
+ * Initialize kmem cache for hot_inode_item and hot_range_item.
+ */
+void __init hot_cache_init(void)
+{
+ hot_inode_item_cachep = KMEM_CACHE(hot_inode_item,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
+ if (!hot_inode_item_cachep)
+ return;
+
+ hot_range_item_cachep = KMEM_CACHE(hot_range_item,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
+ if (!hot_range_item_cachep)
+ kmem_cache_destroy(hot_inode_item_cachep);
+}
+EXPORT_SYMBOL_GPL(hot_cache_init);
+
+static struct hot_info *hot_tree_init(struct super_block *sb)
+{
+ struct hot_info *root;
+ int i, j;
+
+ root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+ if (!root) {
+ printk(KERN_ERR "%s: Failed to malloc memory for "
+ "hot_info\n", __func__);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ root->hot_inode_tree = RB_ROOT;
+ spin_lock_init(&root->t_lock);
+
+ return root;
+}
+
+/*
+ * Frees the entire hot tree.
+ */
+static void hot_tree_exit(struct hot_info *root)
+{
+ struct hot_inode_item *he;
+ struct rb_node *node;
+
+ spin_lock(&root->t_lock);
+ node = rb_first(&root->hot_inode_tree);
+ while (node) {
+ he = rb_entry(node, struct hot_inode_item, rb_node);
+ node = rb_next(node);
+ hot_inode_item_put(he);
+ }
+ spin_unlock(&root->t_lock);
+}
+
+/*
+ * Initialize the data structures for hot tracking.
+ * This function will be called by *_fill_super()
+ * when filesystem is mounted.
+ */
+int hot_track_init(struct super_block *sb)
+{
+ struct hot_info *root;
+ int ret = 0;
+
+ if (!hot_inode_item_cachep || !hot_range_item_cachep) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ root = hot_tree_init(sb);
+ if (IS_ERR(root)) {
+ ret = PTR_ERR(root);
+ goto err;
+ }
+
+ sb->s_hot_root = root;
+
+ printk(KERN_INFO "VFS: Turning on hot tracking\n");
+
+ return ret;
+
+err:
+ sb->s_hot_root = NULL;
+
+ printk(KERN_ERR "VFS: Fail to turn on hot tracking\n");
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(hot_track_init);
+
+/*
+ * This function will be called by *_put_super()
+ * when filesystem is umounted, or also by *_fill_super()
+ * in some exceptional cases.
+ */
+void hot_track_exit(struct super_block *sb)
+{
+ struct hot_info *root = sb->s_hot_root;
+
+ sb->s_hot_root = NULL;
+ hot_tree_exit(root);
+ rcu_barrier();
+ kfree(root);
+}
+EXPORT_SYMBOL_GPL(hot_track_exit);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
new file mode 100644
index 0000000..2776092
--- /dev/null
+++ b/fs/hot_tracking.h
@@ -0,0 +1,20 @@
+/*
+ * fs/hot_tracking.h
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_TRACKING__
+#define __HOT_TRACKING__
+
+#include <linux/hot_tracking.h>
+
+/* size of sub-file ranges */
+#define RANGE_BITS 20
+
+#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a4acd3c..c0e0581 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -29,6 +29,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/hot_tracking.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1324,6 +1325,9 @@ struct super_block {
/* AIO completions deferred from interrupt context */
struct workqueue_struct *s_dio_done_wq;

+ /* Hot data tracking*/
+ struct hot_info *s_hot_root;
+
/*
* Keep the lru lists last in the structure so they always sit on their
* own individual cachelines.
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
new file mode 100644
index 0000000..4112af2
--- /dev/null
+++ b/include/linux/hot_tracking.h
@@ -0,0 +1,84 @@
+/*
+ * include/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot tracking
+ * structures etc.
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_HOTTRACK_H
+#define _LINUX_HOTTRACK_H
+
+#include <linux/types.h>
+#include <linux/slab.h>
+
+#ifdef __KERNEL__
+
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/fs.h>
+
+#define MAP_BITS 8
+#define MAP_SIZE (1 << MAP_BITS)
+
+/* values for hot_freq flags */
+enum {
+ TYPE_INODE = 0,
+ TYPE_RANGE,
+ MAX_TYPES,
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+ struct kref refs;
+ struct rb_node rb_node; /* rbtree index */
+ struct rcu_head rcu;
+ struct rb_root hot_range_tree; /* tree of ranges */
+ spinlock_t i_lock; /* protect above tree */
+ struct hot_info *hot_root; /* associated hot_info */
+ u64 ino; /* inode number from inode */
+};
+
+/*
+ * An item representing a range inside of
+ * an inode whose frequency is being tracked
+ */
+struct hot_range_item {
+ struct kref refs;
+ struct rb_node rb_node; /* rbtree index */
+ struct rcu_head rcu;
+ struct hot_inode_item *hot_inode; /* associated hot_inode_item */
+ loff_t start; /* offset in bytes */
+ size_t len; /* length in bytes */
+};
+
+struct hot_info {
+ struct rb_root hot_inode_tree;
+ spinlock_t t_lock; /* protect above tree */
+};
+
+extern void __init hot_cache_init(void);
+extern int hot_track_init(struct super_block *sb);
+extern void hot_track_exit(struct super_block *sb);
+extern void hot_range_item_put(struct hot_range_item *hr);
+extern void hot_inode_item_put(struct hot_inode_item *he);
+extern void hot_range_item_get(struct hot_range_item *hr);
+extern void hot_inode_item_get(struct hot_inode_item *he);
+
+static inline u64 hot_bit_shift(u64 counter, u32 bits, bool dir)
+{
+ if (dir)
+ return counter << bits;
+ else
+ return counter >> bits;
+}
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_HOTTRACK_H */
--
1.7.11.7

2013-09-16 22:20:41

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 03/10] VFS hot tracking: Add a workqueue to move items between hot maps

From: Zhi Yong Wu <[email protected]>

Add a workqueue per superblock and a delayed_work
to run periodic work to update map info on each superblock.

Two arrays of map list are defined, one is for hot inode
items, and the other is for hot extent items.

The hot items in the RB-tree will be at first distilled
into one temperature in the range [0, 255]. It will be
be linked to its corresponding array of map list which use
the temperature as its index.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/hot_tracking.c | 218 +++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 24 +++++
include/linux/hot_tracking.h | 9 +-
3 files changed, 250 insertions(+), 1 deletion(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index a6cf1a5..cea88f2 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -12,6 +12,7 @@
#include <linux/list.h>
#include <linux/err.h>
#include <linux/spinlock.h>
+#include <linux/sched.h>
#include "hot_tracking.h"

/* kmem_cache pointers for slab caches */
@@ -22,6 +23,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
struct hot_inode_item *he, loff_t start)
{
kref_init(&hr->refs);
+ INIT_LIST_HEAD(&hr->track_list);
hr->freq.avg_delta_reads = (u64) -1;
hr->freq.avg_delta_writes = (u64) -1;
hr->start = start;
@@ -41,8 +43,13 @@ static void hot_range_item_free(struct kref *kref)
{
struct hot_range_item *hr = container_of(kref,
struct hot_range_item, refs);
+ struct hot_info *root = hr->hot_inode->hot_root;

rb_erase(&hr->rb_node, &hr->hot_inode->hot_range_tree);
+ spin_lock(&root->m_lock);
+ if (!list_empty(&hr->track_list))
+ list_del_init(&hr->track_list);
+ spin_unlock(&root->m_lock);

call_rcu(&hr->rcu, hot_range_item_free_cb);
}
@@ -69,6 +76,8 @@ struct hot_range_item
struct rb_node **p;
struct rb_node *parent = NULL;
struct hot_range_item *hr, *hr_new = NULL;
+ u32 temp;
+ u8 temp_cur;

start = hot_bit_shift(start, RANGE_BITS, true);

@@ -102,6 +111,12 @@ redo:
if (hr_new) {
rb_link_node(&hr_new->rb_node, parent, p);
rb_insert_color(&hr_new->rb_node, &he->hot_range_tree);
+ temp = hot_temp_calc(&hr_new->freq);
+ temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+ spin_lock(&he->hot_root->m_lock);
+ list_add_tail(&hr_new->track_list,
+ &he->hot_root->hot_map[TYPE_RANGE][temp_cur]);
+ spin_unlock(&he->hot_root->m_lock);
hot_range_item_get(hr_new); /* For the caller */
spin_unlock(&he->i_lock);
return hr_new;
@@ -142,10 +157,50 @@ static void hot_range_tree_free(struct hot_inode_item *he)
spin_unlock(&he->i_lock);
}

+static void hot_range_map_update(struct hot_info *root,
+ struct hot_range_item *hr)
+{
+ u32 temp = hot_temp_calc(&hr->freq);
+ u8 temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+ u8 temp_prev = (u8)hot_bit_shift((u64)hr->freq.last_temp,
+ (32 - MAP_BITS), false);
+
+ spin_lock(&root->m_lock);
+ if (!list_empty(&hr->track_list)
+ && (temp_cur != temp_prev)) {
+ hr->freq.last_temp = temp;
+ list_del_init(&hr->track_list);
+ list_add_tail(&hr->track_list,
+ &root->hot_map[TYPE_RANGE][temp_cur]);
+ }
+ spin_unlock(&root->m_lock);
+}
+
+/*
+ * Update temperatures for each range item for aging purposes.
+ * If one hot range item is old, it will be aged out.
+ */
+static void hot_range_tree_update(struct hot_inode_item *he,
+ struct hot_info *root)
+{
+ struct rb_node *node;
+ struct hot_range_item *hr;
+
+ rcu_read_lock();
+ node = rb_first(&he->hot_range_tree);
+ while (node) {
+ hr = rb_entry(node, struct hot_range_item, rb_node);
+ node = rb_next(node);
+ hot_range_map_update(root, hr);
+ }
+ rcu_read_unlock();
+}
+
static void hot_inode_item_init(struct hot_inode_item *he,
struct hot_info *root, u64 ino)
{
kref_init(&he->refs);
+ INIT_LIST_HEAD(&he->track_list);
he->freq.avg_delta_reads = (u64) -1;
he->freq.avg_delta_writes = (u64) -1;
he->ino = ino;
@@ -167,6 +222,8 @@ static void hot_inode_item_free(struct kref *kref)
struct hot_inode_item, refs);

rb_erase(&he->rb_node, &he->hot_root->hot_inode_tree);
+ if (!list_empty(&he->track_list))
+ list_del_init(&he->track_list);
hot_range_tree_free(he);

call_rcu(&he->rcu, hot_inode_item_free_cb);
@@ -194,6 +251,8 @@ struct hot_inode_item
struct rb_node **p;
struct rb_node *parent = NULL;
struct hot_inode_item *he, *he_new = NULL;
+ u32 temp;
+ u8 temp_cur;

/* walk tree to find insertion point */
redo:
@@ -225,6 +284,10 @@ redo:
if (he_new) {
rb_link_node(&he_new->rb_node, parent, p);
rb_insert_color(&he_new->rb_node, &root->hot_inode_tree);
+ temp = hot_temp_calc(&he_new->freq);
+ temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+ list_add_tail(&he_new->track_list,
+ &root->hot_map[TYPE_INODE][temp_cur]);
hot_inode_item_get(he_new); /* For the caller */
spin_unlock(&root->t_lock);
return he_new;
@@ -266,6 +329,30 @@ void hot_inode_item_unlink(struct inode *inode)
EXPORT_SYMBOL_GPL(hot_inode_item_unlink);

/*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature.
+ */
+static void hot_inode_map_update(struct hot_info *root,
+ struct hot_inode_item *he)
+{
+ u32 temp = hot_temp_calc(&he->freq);
+ u8 temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+ u8 temp_prev = (u8)hot_bit_shift((u64)he->freq.last_temp,
+ (32 - MAP_BITS), false);
+
+ spin_lock(&root->t_lock);
+ if (!list_empty(&he->track_list)
+ && (temp_cur != temp_prev)) {
+ he->freq.last_temp = temp;
+ list_del_init(&he->track_list);
+ list_add_tail(&he->track_list,
+ &root->hot_map[TYPE_INODE][temp_cur]);
+ }
+ spin_unlock(&root->t_lock);
+}
+
+/*
* This function does the actual work of updating
* the frequency numbers.
*
@@ -311,6 +398,114 @@ static void hot_freq_update(struct hot_info *root,
}

/*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ *
+ * With the six values, we first do some very rudimentary
+ * "normalizations" to each metric such that they affect the
+ * final temperature calculation exactly the right way. It's
+ * important to note that we still weren't really sure that
+ * these six adjustments were exactly right.
+ * They could definitely use more tweaking and adjustment,
+ * especially in terms of the memory footprint they consume.
+ *
+ * Next, we take the adjusted values and shift them down to
+ * a manageable size, whereafter they are weighted using the
+ * the *_COEFF_POWER values and combined to a single temperature
+ * value.
+ */
+u32 hot_temp_calc(struct hot_freq *freq)
+{
+ u32 result = 0;
+
+ struct timespec ckt = current_kernel_time();
+ u64 cur_time = timespec_to_ns(&ckt);
+ u32 nrr_heat, nrw_heat;
+ u64 ltr_heat, ltw_heat, avr_heat, avw_heat;
+
+ nrr_heat = (u32)hot_bit_shift((u64)freq->nr_reads,
+ NRR_MULTIPLIER_POWER, true);
+ nrw_heat = (u32)hot_bit_shift((u64)freq->nr_writes,
+ NRW_MULTIPLIER_POWER, true);
+
+ ltr_heat =
+ hot_bit_shift((cur_time - timespec_to_ns(&freq->last_read_time)),
+ LTR_DIVIDER_POWER, false);
+ ltw_heat =
+ hot_bit_shift((cur_time - timespec_to_ns(&freq->last_write_time)),
+ LTW_DIVIDER_POWER, false);
+
+ avr_heat =
+ hot_bit_shift((((u64) -1) - freq->avg_delta_reads),
+ AVR_DIVIDER_POWER, false);
+ avw_heat =
+ hot_bit_shift((((u64) -1) - freq->avg_delta_writes),
+ AVW_DIVIDER_POWER, false);
+
+ /* ltr_heat is now guaranteed to be u32 safe */
+ if (ltr_heat >= hot_bit_shift((u64) 1, 32, true))
+ ltr_heat = 0;
+ else
+ ltr_heat = hot_bit_shift((u64) 1, 32, true) - ltr_heat;
+
+ /* ltw_heat is now guaranteed to be u32 safe */
+ if (ltw_heat >= hot_bit_shift((u64) 1, 32, true))
+ ltw_heat = 0;
+ else
+ ltw_heat = hot_bit_shift((u64) 1, 32, true) - ltw_heat;
+
+ /* avr_heat is now guaranteed to be u32 safe */
+ if (avr_heat >= hot_bit_shift((u64) 1, 32, true))
+ avr_heat = (u32) -1;
+
+ /* avw_heat is now guaranteed to be u32 safe */
+ if (avw_heat >= hot_bit_shift((u64) 1, 32, true))
+ avw_heat = (u32) -1;
+
+ nrr_heat = (u32)hot_bit_shift((u64)nrr_heat,
+ (3 - NRR_COEFF_POWER), false);
+ nrw_heat = (u32)hot_bit_shift((u64)nrw_heat,
+ (3 - NRW_COEFF_POWER), false);
+ ltr_heat = hot_bit_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+ ltw_heat = hot_bit_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+ avr_heat = hot_bit_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+ avw_heat = hot_bit_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+ result = nrr_heat + nrw_heat + (u32) ltr_heat +
+ (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+ return result;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+ struct hot_info *root = container_of(to_delayed_work(work),
+ struct hot_info, update_work);
+ struct hot_inode_item *he;
+ struct rb_node *node;
+
+ rcu_read_lock();
+ node = root->hot_inode_tree.rb_node;
+ while (node) {
+ he = rb_entry(node, struct hot_inode_item, rb_node);
+ node = rb_next(node);
+ hot_inode_map_update(root, he);
+ hot_range_tree_update(he, root);
+ }
+ rcu_read_unlock();
+
+ /* Instert next delayed work */
+ queue_delayed_work(root->update_wq, &root->update_work,
+ msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+}
+
+/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
void __init hot_cache_init(void)
@@ -393,6 +588,26 @@ static struct hot_info *hot_tree_init(struct super_block *sb)

root->hot_inode_tree = RB_ROOT;
spin_lock_init(&root->t_lock);
+ spin_lock_init(&root->m_lock);
+
+ for (i = 0; i < MAP_SIZE; i++) {
+ for (j = 0; j < MAX_TYPES; j++)
+ INIT_LIST_HEAD(&root->hot_map[j][i]);
+ }
+
+ root->update_wq = alloc_workqueue(
+ "hot_update_wq", WQ_NON_REENTRANT, 0);
+ if (!root->update_wq) {
+ printk(KERN_ERR "%s: Failed to create "
+ "hot update workqueue\n", __func__);
+ kfree(root);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Initialize hot tracking wq and arm one delayed work */
+ INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
+ queue_delayed_work(root->update_wq, &root->update_work,
+ msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));

return root;
}
@@ -405,6 +620,9 @@ static void hot_tree_exit(struct hot_info *root)
struct hot_inode_item *he;
struct rb_node *node;

+ cancel_delayed_work_sync(&root->update_work);
+ destroy_workqueue(root->update_wq);
+
spin_lock(&root->t_lock);
node = rb_first(&root->hot_inode_tree);
while (node) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index bb4cb16..0be7621 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -12,10 +12,34 @@
#ifndef __HOT_TRACKING__
#define __HOT_TRACKING__

+#include <linux/workqueue.h>
#include <linux/hot_tracking.h>

+#define HOT_UPDATE_INTERVAL 150
+
/* size of sub-file ranges */
#define RANGE_BITS 20
#define FREQ_POWER 4

+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ */
+#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
+#define AVW_COEFF_POWER 0
+
#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index f93db02..f5fb1ce 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -55,6 +55,7 @@ struct hot_inode_item {
struct kref refs;
struct rb_node rb_node; /* rbtree index */
struct rcu_head rcu;
+ struct list_head track_list; /* link to *_map[] */
struct rb_root hot_range_tree; /* tree of ranges */
spinlock_t i_lock; /* protect above tree */
struct hot_info *hot_root; /* associated hot_info */
@@ -70,6 +71,7 @@ struct hot_range_item {
struct kref refs;
struct rb_node rb_node; /* rbtree index */
struct rcu_head rcu;
+ struct list_head track_list; /* link to *_map[] */
struct hot_inode_item *hot_inode; /* associated hot_inode_item */
loff_t start; /* offset in bytes */
size_t len; /* length in bytes */
@@ -77,7 +79,11 @@ struct hot_range_item {

struct hot_info {
struct rb_root hot_inode_tree;
- spinlock_t t_lock; /* protect above tree */
+ struct list_head hot_map[MAX_TYPES][MAP_SIZE]; /* map of inode temp */
+ spinlock_t t_lock; /* protect tree and map for inode item */
+ spinlock_t m_lock; /* protect map for range item */
+ struct workqueue_struct *update_wq;
+ struct delayed_work update_work;
};

extern void __init hot_cache_init(void);
@@ -94,6 +100,7 @@ extern struct hot_inode_item
*hot_inode_item_lookup(struct hot_info *root,
u64 ino, int alloc);
extern void hot_inode_item_unlink(struct inode *inode);
+extern u32 hot_temp_calc(struct hot_freq *freq);
extern void hot_freqs_update(struct inode *inode, loff_t start,
size_t len, int rw);

--
1.7.11.7

2013-09-16 22:20:47

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 05/10] VFS hot tracking: Add an ioctl to get hot tracking information

From: Zhi Yong Wu <[email protected]>

FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics.

Optionally, retrieve the temperature from the hot data hash list
instead of recalculating it.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/compat_ioctl.c | 5 ++++
fs/ioctl.c | 71 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/hot_tracking.h | 21 +++++++++++++
3 files changed, 97 insertions(+)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5d19acf..9026b8a 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
#include <linux/i2c-dev.h>
#include <linux/atalk.h>
#include <linux/gfp.h>
+#include <linux/hot_tracking.h>

#include <net/bluetooth/bluetooth.h>
#include <net/bluetooth/hci.h>
@@ -1399,6 +1400,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
COMPATIBLE_IOCTL(TIOCSTOP)
#endif

+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
/* fat 'r' ioctls. These are handled by fat with ->compat_ioctl,
but we don't want warnings on other file systems. So declare
them as compatible here. */
@@ -1578,6 +1582,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
case FIBMAP:
case FIGETBSZ:
case FIONREAD:
+ case FS_IOC_GET_HEAT_INFO:
if (S_ISREG(file_inode(f.file)->i_mode))
break;
/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index fd507fb..fd2d8ec 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
#include <linux/writeback.h>
#include <linux/buffer_head.h>
#include <linux/falloc.h>
+#include <linux/hot_tracking.h>

#include <asm/ioctls.h>

@@ -537,6 +538,73 @@ static int ioctl_fsthaw(struct file *filp)
}

/*
+ * Retrieve information about access frequency for the given inode.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the map list, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info->live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct hot_info *root = inode->i_sb->s_hot_root;
+ struct hot_heat_info heat_info;
+ struct hot_inode_item *he;
+ int ret = 0;
+
+ /* The 'live' field need to be read from the user space */
+ if (copy_from_user((void *)&heat_info,
+ argp,
+ sizeof(struct hot_heat_info)) != 0) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+ he = hot_inode_item_lookup(root, inode->i_ino, 0);
+ if (IS_ERR(he)) {
+ /* we don't have any info on this file yet */
+ ret = -ENODATA;
+ goto err;
+ }
+
+ heat_info.avg_delta_reads =
+ (__u64) he->freq.avg_delta_reads;
+ heat_info.avg_delta_writes =
+ (__u64) he->freq.avg_delta_writes;
+ heat_info.last_read_time =
+ (__u64) timespec_to_ns(&he->freq.last_read_time);
+ heat_info.last_write_time =
+ (__u64) timespec_to_ns(&he->freq.last_write_time);
+ heat_info.num_reads = (__u32) he->freq.nr_reads;
+ heat_info.num_writes = (__u32) he->freq.nr_writes;
+
+ if (heat_info.live > 0) {
+ /*
+ * got a request for live temperature,
+ * call hot_calc_temp() to recalculate
+ */
+ heat_info.temp = hot_temp_calc(&he->freq);
+ } else {
+ /* not live temperature, get it from the map list */
+ heat_info.temp = he->freq.last_temp;
+ }
+
+ spin_lock(&root->t_lock);
+ hot_inode_item_put(he);
+ spin_unlock(&root->t_lock);
+
+ if (copy_to_user(argp, (void *)&heat_info,
+ sizeof(struct hot_heat_info))) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+err:
+ return ret;
+}
+
+/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
*
@@ -591,6 +659,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
case FIGETBSZ:
return put_user(inode->i_sb->s_blocksize, argp);

+ case FS_IOC_GET_HEAT_INFO:
+ return ioctl_heat_info(filp, argp);
+
default:
if (S_ISREG(inode->i_mode))
error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 455bfe8..4f80e72 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -18,6 +18,19 @@
#include <linux/types.h>
#include <linux/slab.h>

+struct hot_heat_info {
+ __u8 live;
+ __u8 resv[3];
+ __u32 temp;
+ __u64 avg_delta_reads;
+ __u64 avg_delta_writes;
+ __u64 last_read_time;
+ __u64 last_write_time;
+ __u32 num_reads;
+ __u32 num_writes;
+ __u64 future[4]; /* For future expansions */
+};
+
#ifdef __KERNEL__

#include <linux/rbtree.h>
@@ -88,6 +101,14 @@ struct hot_info {
struct shrinker hot_shrink;
};

+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define FS_IOC_GET_HEAT_INFO _IOR('f', 17, \
+ struct hot_heat_info)
+
extern void __init hot_cache_init(void);
extern int hot_track_init(struct super_block *sb);
extern void hot_track_exit(struct super_block *sb);
--
1.7.11.7

2013-09-16 22:20:59

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 07/10] VFS hot tracking: Add a /proc interface to control memory usage

From: Zhi Yong Wu <[email protected]>

Introduce a /proc interface hot-mem-high-thresh and
to cap the memory which is consumed by hot_inode_item
and hot_range_item, and they will be in the unit of
1M bytes.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/hot_tracking.c | 31 +++++++++++++++++++++++++++++++
fs/hot_tracking.h | 23 +++++++++++++++++++++++
include/linux/hot_tracking.h | 3 +++
kernel/sysctl.c | 7 +++++++
4 files changed, 64 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 8c7e403..0047252 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,9 @@
#include <linux/sched.h>
#include "hot_tracking.h"

+int sysctl_hot_mem_high_thresh __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_hot_mem_high_thresh);
+
int sysctl_hot_update_interval __read_mostly = 150;
EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);

@@ -33,6 +36,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
hr->len = hot_bit_shift(1, RANGE_BITS, true);
hr->hot_inode = he;
atomic_long_inc(&he->hot_root->hot_cnt);
+ hot_mem_limit_add(he->hot_root, sizeof(struct hot_range_item));
}

static void hot_range_item_free_cb(struct rcu_head *head)
@@ -56,6 +60,7 @@ static void hot_range_item_free(struct kref *kref)
spin_unlock(&root->m_lock);

atomic_long_dec(&root->hot_cnt);
+ hot_mem_limit_sub(root, sizeof(struct hot_range_item));
call_rcu(&hr->rcu, hot_range_item_free_cb);
}

@@ -106,6 +111,8 @@ redo:
* newly allocated item.
*/
atomic_long_dec(&he->hot_root->hot_cnt);
+ hot_mem_limit_sub(he->hot_root,
+ sizeof(struct hot_range_item));
kmem_cache_free(hot_range_item_cachep, hr_new);
}
spin_unlock(&he->i_lock);
@@ -213,6 +220,7 @@ static void hot_inode_item_init(struct hot_inode_item *he,
he->hot_root = root;
spin_lock_init(&he->i_lock);
atomic_long_inc(&root->hot_cnt);
+ hot_mem_limit_add(root, sizeof(struct hot_inode_item));
}

static void hot_inode_item_free_cb(struct rcu_head *head)
@@ -234,6 +242,7 @@ static void hot_inode_item_free(struct kref *kref)
hot_range_tree_free(he);

atomic_long_dec(&he->hot_root->hot_cnt);
+ hot_mem_limit_sub(he->hot_root, sizeof(struct hot_inode_item));
call_rcu(&he->rcu, hot_inode_item_free_cb);
}

@@ -282,6 +291,8 @@ redo:
* newly allocated item.
*/
atomic_long_dec(&root->hot_cnt);
+ hot_mem_limit_sub(root,
+ sizeof(struct hot_inode_item));
kmem_cache_free(hot_inode_item_cachep, he_new);
}
spin_unlock(&root->t_lock);
@@ -528,6 +539,23 @@ static unsigned long hot_item_evict(struct hot_info *root, unsigned long work,
return freed;
}

+static void hot_mem_evict(struct hot_info *root)
+{
+ unsigned long sum, thresh;
+
+ if (sysctl_hot_mem_high_thresh == 0)
+ return;
+
+ sum = hot_mem_limit_sum(root);
+ /* Note: sysctl_** is in the unit of 1M bytes */
+ thresh = sysctl_hot_mem_high_thresh;
+ thresh *= 1024 * 1024;
+ if (sum <= thresh)
+ return;
+
+ hot_item_evict(root, sum - thresh, hot_mem_limit_sum);
+}
+
/*
* Every sync period we update temperatures for
* each hot inode item and hot range item for aging
@@ -540,6 +568,8 @@ static void hot_update_worker(struct work_struct *work)
struct hot_inode_item *he;
struct rb_node *node;

+ hot_mem_evict(root);
+
rcu_read_lock();
node = root->hot_inode_tree.rb_node;
while (node) {
@@ -748,6 +778,7 @@ int hot_track_init(struct super_block *sb)
goto err;
}

+ hot_mem_limit_init(root);
sb->s_hot_root = root;

printk(KERN_INFO "VFS: Turning on hot tracking\n");
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 23b1339..c9efa5b 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -40,4 +40,27 @@
#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
#define AVW_COEFF_POWER 0

+/* Memory Tracking Functions. */
+static inline unsigned long hot_mem_limit_sum(struct hot_info *root)
+{
+ return atomic_long_read(&root->mem);
+}
+
+static inline void hot_mem_limit_sub(struct hot_info *root,
+ unsigned long count)
+{
+ atomic_long_sub(count, &root->mem);
+}
+
+static inline void hot_mem_limit_add(struct hot_info *root,
+ unsigned long count)
+{
+ atomic_long_add(count, &root->mem);
+}
+
+static inline void hot_mem_limit_init(struct hot_info *root)
+{
+ atomic_long_set(&root->mem, 0);
+}
+
#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 6923771..3f50c39 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -99,10 +99,13 @@ struct hot_info {
struct workqueue_struct *update_wq;
struct delayed_work update_work;
struct shrinker hot_shrink;
+ atomic_long_t mem;
};

/* set how often to update temperatures (seconds) */
extern int sysctl_hot_update_interval;
+/* note: sysctl_** is in the unit of 1M bytes */
+extern int sysctl_hot_mem_high_thresh;

/*
* Hot data tracking ioctls:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e0b062a..fde8bc2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1632,6 +1632,13 @@ static struct ctl_table fs_table[] = {
.extra1 = &pipe_min_size,
},
{
+ .procname = "hot-mem-high-thresh",
+ .data = &sysctl_hot_mem_high_thresh,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "hot-update-interval",
.data = &sysctl_hot_update_interval,
.maxlen = sizeof(int),
--
1.7.11.7

2013-09-16 22:20:50

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 06/10] VFS hot tracking: Add a /proc interface to make the interval tunable

From: Zhi Yong Wu <[email protected]>

Add a proc interface hot-update-interval under the dir
/proc/sys/fs/ in order to turn HOT_UPDATE_INTERVAL into
a tunable parameter.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/hot_tracking.c | 7 +++++--
fs/hot_tracking.h | 2 --
include/linux/hot_tracking.h | 3 +++
kernel/sysctl.c | 7 +++++++
4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 953dbc9..8c7e403 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,9 @@
#include <linux/sched.h>
#include "hot_tracking.h"

+int sysctl_hot_update_interval __read_mostly = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);
+
/* kmem_cache pointers for slab caches */
static struct kmem_cache *hot_inode_item_cachep __read_mostly;
static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -549,7 +552,7 @@ static void hot_update_worker(struct work_struct *work)

/* Instert next delayed work */
queue_delayed_work(root->update_wq, &root->update_work,
- msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+ msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
}

/*
@@ -690,7 +693,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
/* Initialize hot tracking wq and arm one delayed work */
INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
queue_delayed_work(root->update_wq, &root->update_work,
- msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+ msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));

/* Register a shrinker callback */
root->hot_shrink.count_objects = hot_track_shrink_count;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 0be7621..23b1339 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -15,8 +15,6 @@
#include <linux/workqueue.h>
#include <linux/hot_tracking.h>

-#define HOT_UPDATE_INTERVAL 150
-
/* size of sub-file ranges */
#define RANGE_BITS 20
#define FREQ_POWER 4
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 4f80e72..6923771 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -101,6 +101,9 @@ struct hot_info {
struct shrinker hot_shrink;
};

+/* set how often to update temperatures (seconds) */
+extern int sysctl_hot_update_interval;
+
/*
* Hot data tracking ioctls:
*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..e0b062a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1631,6 +1631,13 @@ static struct ctl_table fs_table[] = {
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
+ {
+ .procname = "hot-update-interval",
+ .data = &sysctl_hot_update_interval,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
{ }
};

--
1.7.11.7

2013-09-16 22:21:08

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 09/10] VFS hot tracking, btrfs: Add hot tracking support

From: Zhi Yong Wu <[email protected]>

Introduce one new mount option '-o hot_track',
and add its parsing support.

Its usage looks like:
mount -o hot_track
mount -o nouser,hot_track
mount -o nouser,hot_track,loop
mount -o hot_track,nouser

Reviewed-by: David Sterba <[email protected]>
Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/super.c | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3c1da6f..17fd7c8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1999,6 +1999,7 @@ struct btrfs_ioctl_defrag_range_args {
#define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
#define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
#define BTRFS_MOUNT_RESCAN_UUID_TREE (1 << 23)
+#define BTRFS_MOUNT_HOT_TRACK (1 << 24)

#define BTRFS_DEFAULT_COMMIT_INTERVAL (30)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3aab10c..949362d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -42,6 +42,7 @@
#include <linux/cleancache.h>
#include <linux/ratelimit.h>
#include <linux/btrfs.h>
+#include <linux/hot_tracking.h>
#include "compat.h"
#include "delayed-inode.h"
#include "ctree.h"
@@ -310,6 +311,10 @@ static void btrfs_put_super(struct super_block *sb)
* last process that kept it busy. Or segfault in the aforementioned
* process... Whom would you report that to?
*/
+
+ /* Hot data tracking */
+ if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+ hot_track_exit(sb);
}

enum {
@@ -323,7 +328,7 @@ enum {
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
- Opt_commit_interval,
+ Opt_commit_interval, Opt_hot_track,
Opt_err,
};

@@ -366,6 +371,7 @@ static match_table_t tokens = {
{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
{Opt_fatal_errors, "fatal_errors=%s"},
{Opt_commit_interval, "commit=%d"},
+ {Opt_hot_track, "hot_track"},
{Opt_err, NULL},
};

@@ -676,6 +682,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
}
break;
+ case Opt_hot_track:
+ btrfs_set_opt(info->mount_opt, HOT_TRACK);
+ break;
case Opt_err:
printk(KERN_INFO "btrfs: unrecognized mount option "
"'%s'\n", p);
@@ -898,11 +907,20 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}

+ if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+ err = hot_track_init(sb);
+ if (err)
+ goto fail_hot;
+ }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb->s_flags |= MS_ACTIVE;
return 0;

+fail_hot:
+ dput(sb->s_root);
+ sb->s_root = NULL;
fail_close:
close_ctree(fs_info->tree_root);
return err;
@@ -1014,6 +1032,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
seq_puts(seq, ",fatal_errors=panic");
if (info->commit_interval != BTRFS_DEFAULT_COMMIT_INTERVAL)
seq_printf(seq, ",commit=%d", info->commit_interval);
+ if (btrfs_test_opt(root, HOT_TRACK))
+ seq_puts(seq, ",hot_track");
return 0;
}

--
1.7.11.7

2013-09-16 22:21:12

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 10/10] VFS hot tracking, xfs: Add hot tracking support

From: Dave Chinner <[email protected]>

Connect up the VFS hot tracking support so XFS filesystem
can make use of it.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 1fa0584..c6bbf31 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -184,6 +184,7 @@ typedef struct xfs_mount {
#define XFS_MOUNT_WSYNC (1ULL << 0) /* for nfs - all metadata ops
must be synchronous except
for space allocations */
+#define XFS_MOUNT_HOTTRACK (1ULL << 1) /* hot tracking */
#define XFS_MOUNT_WAS_CLEAN (1ULL << 3)
#define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all filesystem
operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 15188cc..a2667f9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -62,6 +62,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/parser.h>
+#include <linux/hot_tracking.h>

static const struct super_operations xfs_super_operations;
static kmem_zone_t *xfs_ioend_zone;
@@ -115,6 +116,7 @@ mempool_t *xfs_ioend_pool;
#define MNTOPT_NODELAYLOG "nodelaylog" /* Delayed logging disabled */
#define MNTOPT_DISCARD "discard" /* Discard unused blocks */
#define MNTOPT_NODISCARD "nodiscard" /* Do not discard unused blocks */
+#define MNTOPT_HOTTRACK "hot_track" /* hot tracking */

/*
* Table driven mount option parser.
@@ -381,6 +383,8 @@ xfs_parseargs(
mp->m_flags |= XFS_MOUNT_DISCARD;
} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
mp->m_flags &= ~XFS_MOUNT_DISCARD;
+ } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+ mp->m_flags |= XFS_MOUNT_HOTTRACK;
} else if (!strcmp(this_char, "ihashsize")) {
xfs_warn(mp,
"ihashsize no longer used, option is deprecated.");
@@ -504,6 +508,7 @@ xfs_showargs(
{ XFS_MOUNT_GRPID, "," MNTOPT_GRPID },
{ XFS_MOUNT_DISCARD, "," MNTOPT_DISCARD },
{ XFS_MOUNT_SMALL_INUMS, "," MNTOPT_32BITINODE },
+ { XFS_MOUNT_HOTTRACK, "," MNTOPT_HOTTRACK },
{ 0, NULL }
};
static struct proc_xfs_info xfs_info_unset[] = {
@@ -1046,6 +1051,9 @@ xfs_fs_put_super(
{
struct xfs_mount *mp = XFS_M(sb);

+ if (mp->m_flags & XFS_MOUNT_HOTTRACK)
+ hot_track_exit(sb);
+
xfs_filestream_unmount(mp);
xfs_unmountfs(mp);

@@ -1501,8 +1509,18 @@ xfs_fs_fill_super(
goto out_unmount;
}

+ if (mp->m_flags & XFS_MOUNT_HOTTRACK) {
+ error = hot_track_init(sb);
+ if (error)
+ goto out_free_root;
+ }
+
return 0;

+ out_free_root:
+ dput(sb->s_root);
+ sb->s_root = NULL;
+
out_filestream_unmount:
xfs_filestream_unmount(mp);
out_free_sb:
--
1.7.11.7

2013-09-16 22:21:48

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 08/10] VFS hot tracking: Add documentation

From: Zhi Yong Wu <[email protected]>

Add Documentation for VFS hot tracking feature

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/hot_tracking.txt | 207 +++++++++++++++++++++++++++++
2 files changed, 209 insertions(+)
create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8042050..46b2f6f 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -122,3 +122,5 @@ xfs.txt
- info and mount options for the XFS filesystem.
xip.txt
- info on execute-in-place for file mappings.
+hot_tracking.txt
+ - info on hot tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 0000000..df184b9
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,207 @@
+Hot Data Tracking
+
+April, 2013 Zhi Yong Wu <[email protected]>
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes & Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+ The feature adds the support for tracking data temperature
+information in VFS layer. Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+"temperature" value that reflects what data is "hot", and filesystem
+can use this information to move hot data from slow devices to fast
+devices.
+
+ The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+
+2. Motivation
+
+ This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+ The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+<https://btrfs.wiki.kernel.org/index.php/Project_ideas>.
+It will divide into two parts. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, this feature provides the first part
+of the functionality.
+
+
+3. The Design
+
+These include the following parts:
+
+ * Hooks in existing vfs functions to track data access frequency
+
+ * New rb-trees for tracking access frequency of inodes and sub-file
+ranges
+ The relationship between super_block and rb-trees is as below:
+hot_info.hot_inode_tree
+ Each FS instance can find hot tracking info s_hot_root.
+ hot_info has hot_inode_tree and it has inode's hot information,
+and it has hot_range_tree, which has range's hot information.
+
+ * A list of hot inodes and hot ranges by its temperature
+
+ * A work queue for updating inode heat info
+
+ * Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+ * An ioctl to retrieve the frequency information collected for a certain
+inode
+
+Let us see their relationship as below:
+
+ * hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+ * hot_inode_item contains access frequency data for that inode
+
+ * hot_inode_item holds a track list node to link the access frequency
+data for that inode
+
+ * hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+ * hot_range_item contains access frequency data for that range
+
+ * hot_range_item holds a track list node to link the access frequency
+data for that range
+
+ * hot_info.hot_map[TYPE_INODE] indexes per-inode track list nodes
+
+ * hot_info.hot_map[TYPE_RANGE] indexes per-range track list nodes
+
+ How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+ super_block
+ |
+ V
+ hot_info
+ |
+ +-------------------------+----------------------------------------+
+ | | |
+ | | |
+ V V V
+heat_inode_map hot_inode_tree heat_range_map
+ | | |
+ | V hot_inode_item |
+ | +----------list_head---------+ |
+ | | frequency data | |
++---+ | | |
+| V hot_inode_item V hot_inode_item |
+|....<-----list-head--->... ...<----list_head---->... |
+ frequency data frequency data |
+ hot_range_tree hot_range_tree |
+ | |
+ V hot_range_item |
+ +---------list_head----------+ |
+ | frequency data | |
+ | ^ | +---+
+ hot_range_item V | | Vhot_range_item|
+ <--list_head-->... | | ...<--list_head-->....... |
+ frequency data frequency data
+
+
+4. How to Calc Frequency of Reads/Writes & Temperature
+
+1.) hot_freq_calc()
+
+ This function does the actual work of updating the frequency numbers.
+FREQ_POWER determines how many atime deltas we keep track of (as a power of 2).
+So, setting it to anything above 16ish is probably overkill. Also,
+the higher the power, the more bits get right shifted out of the timestamp,
+reducing precision, so take note of that as well.
+
+ FREQ_POWER, defined immediately below, determines how heavily to weight
+the current frequency numbers against the newest access. For example, a value
+of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+as heavily as the existing frequency info. In essence, this is a kludged-
+together version of a weighted average, since we can't afford to keep all of
+the information that it would take to get a _real_ weighted average.
+
+2.) hot_temp_calc()
+
+ The following comments explain what exactly comprises a unit of heat.
+Each of six values of heat are calculated and combined in order to form an
+overall temperature for the data:
+
+ * NRR - number of reads since mount
+ * NRW - number of writes since mount
+ * LTR - time elapsed since last read (ns)
+ * LTW - time elapsed since last write (ns)
+ * AVR - average delta between recent reads (ns)
+ * AVW - average delta between recent writes (ns)
+
+ These values are divided (right-shifted) according to the *_DIVIDER_POWER
+values defined below to bring the numbers into a reasonable range. You can
+modify these values to fit your needs. However, each heat unit is a u32 and
+thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+carefully or else they could max out or be stuck at zero quite easily.
+(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+delta would bring the temperature above zero, ever.)
+
+ Finally, each value is added to the overall temperature between 0 and 8
+times, depending on its *_COEFF_POWER value. Note that the coefficients are
+also actually implemented with shifts, so take care to treat these values
+as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+
+ E.g., data with an average delta between 0 and 2^X ns will have a cold
+value of 0, which means a heat value equal to HEAT_MAX_VALUE.
+
+ This function is responsible for distilling the six heat
+criteria, which are described in detail in hot_tracking.h) down into a single
+temperature value for the data, which is an integer between 0
+and HEAT_MAX_VALUE.
+
+ To accomplish this, the raw values from the hot_freq_data structure
+are shifted in order to make the temperature calculation more
+or less sensitive to each value.
+
+ Once this calibration has happened, we do some additional normalization and
+make sure that everything fits nicely in a u32. From there, we take a very
+rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+values act as weights for the average.
+
+ Finally, we use the MAP_BITS value, which determines the size of the
+heat list array, to normalize the temperature to the proper granularity.
+
+
+5. Git Development Tree
+
+ This feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+
+ https://github.com/wuzhy/kernel.git hot_tracking
+ git://github.com/wuzhy/kernel.git hot_tracking
+
+
+6. Usage Example
+
+1.) To use hot tracking, you should mount like this:
+
+$ mount -o hot_track /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] VFS: Turning on hot tracking
+
+2.) Retrieve hot tracking info for some specific file by ioctl().
--
1.7.11.7

2013-09-16 22:22:31

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 04/10] VFS hot tracking: Add shrinker functionality to curtail memory usage

From: Zhi Yong Wu <[email protected]>

Register a shrinker to control the amount of memory that
is used in tracking hot regions. If we are throwing inodes
out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking
code is using, even if it means losing useful information.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/hot_tracking.c | 91 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/hot_tracking.h | 2 +
2 files changed, 93 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index cea88f2..953dbc9 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -29,6 +29,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
hr->start = start;
hr->len = hot_bit_shift(1, RANGE_BITS, true);
hr->hot_inode = he;
+ atomic_long_inc(&he->hot_root->hot_cnt);
}

static void hot_range_item_free_cb(struct rcu_head *head)
@@ -51,6 +52,7 @@ static void hot_range_item_free(struct kref *kref)
list_del_init(&hr->track_list);
spin_unlock(&root->m_lock);

+ atomic_long_dec(&root->hot_cnt);
call_rcu(&hr->rcu, hot_range_item_free_cb);
}

@@ -100,6 +102,7 @@ redo:
* the item for the range. Free the
* newly allocated item.
*/
+ atomic_long_dec(&he->hot_root->hot_cnt);
kmem_cache_free(hot_range_item_cachep, hr_new);
}
spin_unlock(&he->i_lock);
@@ -206,6 +209,7 @@ static void hot_inode_item_init(struct hot_inode_item *he,
he->ino = ino;
he->hot_root = root;
spin_lock_init(&he->i_lock);
+ atomic_long_inc(&root->hot_cnt);
}

static void hot_inode_item_free_cb(struct rcu_head *head)
@@ -226,6 +230,7 @@ static void hot_inode_item_free(struct kref *kref)
list_del_init(&he->track_list);
hot_range_tree_free(he);

+ atomic_long_dec(&he->hot_root->hot_cnt);
call_rcu(&he->rcu, hot_inode_item_free_cb);
}

@@ -273,6 +278,7 @@ redo:
* the item for the inode. Free the
* newly allocated item.
*/
+ atomic_long_dec(&root->hot_cnt);
kmem_cache_free(hot_inode_item_cachep, he_new);
}
spin_unlock(&root->t_lock);
@@ -478,6 +484,47 @@ u32 hot_temp_calc(struct hot_freq *freq)
return result;
}

+static unsigned long hot_item_evict(struct hot_info *root, unsigned long work,
+ unsigned long (*work_get)(struct hot_info *root))
+{
+ long budget = work;
+ unsigned long freed = 0;
+ int i;
+
+ for (i = 0; i < MAP_SIZE; i++) {
+ struct hot_inode_item *he, *next;
+
+ spin_lock(&root->t_lock);
+ if (list_empty(&root->hot_map[TYPE_INODE][i])) {
+ spin_unlock(&root->t_lock);
+ continue;
+ }
+
+ list_for_each_entry_safe(he, next,
+ &root->hot_map[TYPE_INODE][i], track_list) {
+ long work_prev, delta;
+
+ if (atomic_read(&he->refs.refcount) > 1)
+ continue;
+ work_prev = work_get(root);
+ hot_inode_item_put(he);
+ delta = work_prev - work_get(root);
+ budget -= delta;
+ freed += delta;
+ if (unlikely(budget <= 0))
+ break;
+ }
+ spin_unlock(&root->t_lock);
+
+ if (unlikely(budget <= 0))
+ break;
+
+ cond_resched();
+ }
+
+ return freed;
+}
+
/*
* Every sync period we update temperatures for
* each hot inode item and hot range item for aging
@@ -522,6 +569,41 @@ void __init hot_cache_init(void)
}
EXPORT_SYMBOL_GPL(hot_cache_init);

+static unsigned long hot_track_shrink_count(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct hot_info *root =
+ container_of(shrink, struct hot_info, hot_shrink);
+
+ return (unsigned long)atomic_long_read(&root->hot_cnt);
+}
+
+static inline unsigned long hot_cnt_get(struct hot_info *root)
+{
+ return (unsigned long)atomic_long_read(&root->hot_cnt);
+}
+
+static unsigned long hot_prune_map(struct hot_info *root, unsigned long nr)
+{
+ return hot_item_evict(root, nr, hot_cnt_get);
+}
+
+/* The shrinker callback function */
+static unsigned long hot_track_shrink_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct hot_info *root =
+ container_of(shrink, struct hot_info, hot_shrink);
+ unsigned long freed;
+
+ if (!(sc->gfp_mask & __GFP_FS))
+ return SHRINK_STOP;
+
+ freed = hot_prune_map(root, sc->nr_to_scan);
+
+ return freed;
+}
+
/*
* Main function to update i/o access frequencies, and it will be called
* from read/writepages() hooks, which are read_pages(), do_writepages(),
@@ -589,6 +671,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
root->hot_inode_tree = RB_ROOT;
spin_lock_init(&root->t_lock);
spin_lock_init(&root->m_lock);
+ atomic_long_set(&root->hot_cnt, 0);

for (i = 0; i < MAP_SIZE; i++) {
for (j = 0; j < MAX_TYPES; j++)
@@ -609,6 +692,13 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
queue_delayed_work(root->update_wq, &root->update_work,
msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));

+ /* Register a shrinker callback */
+ root->hot_shrink.count_objects = hot_track_shrink_count;
+ root->hot_shrink.scan_objects = hot_track_shrink_scan;
+ root->hot_shrink.seeks = DEFAULT_SEEKS;
+ root->hot_shrink.flags = SHRINKER_NUMA_AWARE;
+ register_shrinker(&root->hot_shrink);
+
return root;
}

@@ -620,6 +710,7 @@ static void hot_tree_exit(struct hot_info *root)
struct hot_inode_item *he;
struct rb_node *node;

+ unregister_shrinker(&root->hot_shrink);
cancel_delayed_work_sync(&root->update_work);
destroy_workqueue(root->update_wq);

diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index f5fb1ce..455bfe8 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -82,8 +82,10 @@ struct hot_info {
struct list_head hot_map[MAX_TYPES][MAP_SIZE]; /* map of inode temp */
spinlock_t t_lock; /* protect tree and map for inode item */
spinlock_t m_lock; /* protect map for range item */
+ atomic_long_t hot_cnt;
struct workqueue_struct *update_wq;
struct delayed_work update_work;
+ struct shrinker hot_shrink;
};

extern void __init hot_cache_init(void);
--
1.7.11.7

2013-09-16 22:22:56

by Zhi Yong Wu

[permalink] [raw]
Subject: [PATCH v5 02/10] VFS hot tracking: Track IO and record heat information

From: Zhi Yong Wu <[email protected]>

This patch adds read/write code paths: include read_pages(),
do_writepages(), do_generic_file_read() and __blockdev_direct_IO()
to record heat information.

When real disk i/o for an inode is done, its own hot_inode_item will
be created or updated in the RB tree for the filesystem, and the i/o freq for
all of its extents will also be created/updated in the RB-tree per inode.

Each of the two structures hot_inode_item and hot_range_item
contains a hot_freq_data struct with its frequency of access metrics
(number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}).

Each hot_inode_item contains one hot_range_tree struct which is keyed by
{inode, offset, length} and used to keep track of all the ranges in this file.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Zhi Yong Wu <[email protected]>
---
fs/direct-io.c | 5 +
fs/hot_tracking.c | 238 +++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 1 +
fs/namei.c | 3 +
include/linux/hot_tracking.h | 26 +++++
mm/filemap.c | 19 +++-
mm/page-writeback.c | 13 +++
mm/readahead.c | 6 ++
8 files changed, 309 insertions(+), 2 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0e04142..db59aa3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,7 @@
#include <linux/atomic.h>
#include <linux/prefetch.h>
#include <linux/aio.h>
+#include "hot_tracking.h"

/*
* How many user pages to map in one call to get_user_pages(). This determines
@@ -1376,6 +1377,10 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
prefetch(bdev->bd_queue);
prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);

+ /* Hot tracking */
+ hot_freqs_update(inode, offset,
+ iov_length(iov, nr_segs), rw & WRITE);
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
nr_segs, get_block, end_io,
submit_io, flags);
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index bb82a8d..a6cf1a5 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -22,6 +22,8 @@ static void hot_range_item_init(struct hot_range_item *hr,
struct hot_inode_item *he, loff_t start)
{
kref_init(&hr->refs);
+ hr->freq.avg_delta_reads = (u64) -1;
+ hr->freq.avg_delta_writes = (u64) -1;
hr->start = start;
hr->len = hot_bit_shift(1, RANGE_BITS, true);
hr->hot_inode = he;
@@ -61,6 +63,66 @@ void hot_range_item_put(struct hot_range_item *hr)
}
EXPORT_SYMBOL_GPL(hot_range_item_put);

+struct hot_range_item
+*hot_range_item_lookup(struct hot_inode_item *he, loff_t start, int alloc)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct hot_range_item *hr, *hr_new = NULL;
+
+ start = hot_bit_shift(start, RANGE_BITS, true);
+
+ /* walk tree to find insertion point */
+redo:
+ spin_lock(&he->i_lock);
+ p = &he->hot_range_tree.rb_node;
+ while (*p) {
+ parent = *p;
+ hr = rb_entry(parent, struct hot_range_item, rb_node);
+ if (start < hr->start)
+ p = &(*p)->rb_left;
+ else if (start > (hr->start + hr->len - 1))
+ p = &(*p)->rb_right;
+ else {
+ hot_range_item_get(hr);
+ if (hr_new) {
+ /*
+ * Lost the race. Somebody else inserted
+ * the item for the range. Free the
+ * newly allocated item.
+ */
+ kmem_cache_free(hot_range_item_cachep, hr_new);
+ }
+ spin_unlock(&he->i_lock);
+
+ return hr;
+ }
+ }
+
+ if (hr_new) {
+ rb_link_node(&hr_new->rb_node, parent, p);
+ rb_insert_color(&hr_new->rb_node, &he->hot_range_tree);
+ hot_range_item_get(hr_new); /* For the caller */
+ spin_unlock(&he->i_lock);
+ return hr_new;
+ }
+ spin_unlock(&he->i_lock);
+
+ if (!alloc)
+ return ERR_PTR(-ENOENT);
+
+ hr_new = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
+ if (!hr_new)
+ return ERR_PTR(-ENOMEM);
+
+ hot_range_item_init(hr_new, he, start);
+
+ cond_resched();
+
+ goto redo;
+}
+EXPORT_SYMBOL_GPL(hot_range_item_lookup);
+
/*
* Free the entire hot_range_tree.
*/
@@ -84,6 +146,8 @@ static void hot_inode_item_init(struct hot_inode_item *he,
struct hot_info *root, u64 ino)
{
kref_init(&he->refs);
+ he->freq.avg_delta_reads = (u64) -1;
+ he->freq.avg_delta_writes = (u64) -1;
he->ino = ino;
he->hot_root = root;
spin_lock_init(&he->i_lock);
@@ -124,6 +188,128 @@ void hot_inode_item_put(struct hot_inode_item *he)
}
EXPORT_SYMBOL_GPL(hot_inode_item_put);

+struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root, u64 ino, int alloc)
+{
+ struct rb_node **p;
+ struct rb_node *parent = NULL;
+ struct hot_inode_item *he, *he_new = NULL;
+
+ /* walk tree to find insertion point */
+redo:
+ spin_lock(&root->t_lock);
+ p = &root->hot_inode_tree.rb_node;
+ while (*p) {
+ parent = *p;
+ he = rb_entry(parent, struct hot_inode_item, rb_node);
+ if (ino < he->ino)
+ p = &(*p)->rb_left;
+ else if (ino > he->ino)
+ p = &(*p)->rb_right;
+ else {
+ hot_inode_item_get(he);
+ if (he_new) {
+ /*
+ * Lost the race. Somebody else inserted
+ * the item for the inode. Free the
+ * newly allocated item.
+ */
+ kmem_cache_free(hot_inode_item_cachep, he_new);
+ }
+ spin_unlock(&root->t_lock);
+
+ return he;
+ }
+ }
+
+ if (he_new) {
+ rb_link_node(&he_new->rb_node, parent, p);
+ rb_insert_color(&he_new->rb_node, &root->hot_inode_tree);
+ hot_inode_item_get(he_new); /* For the caller */
+ spin_unlock(&root->t_lock);
+ return he_new;
+ }
+ spin_unlock(&root->t_lock);
+
+ if (!alloc)
+ return ERR_PTR(-ENOENT);
+
+ he_new = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
+ if (!he_new)
+ return ERR_PTR(-ENOMEM);
+
+ hot_inode_item_init(he_new, root, ino);
+
+ cond_resched();
+
+ goto redo;
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_lookup);
+
+void hot_inode_item_unlink(struct inode *inode)
+{
+ struct hot_info *root = inode->i_sb->s_hot_root;
+ struct hot_inode_item *he;
+
+ if (!root || !S_ISREG(inode->i_mode))
+ return;
+
+ he = hot_inode_item_lookup(root, inode->i_ino, 0);
+ if (IS_ERR(he))
+ return;
+
+ spin_lock(&root->t_lock);
+ hot_inode_item_put(he);
+ hot_inode_item_put(he); /* For the caller */
+ spin_unlock(&root->t_lock);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_unlink);
+
+/*
+ * This function does the actual work of updating
+ * the frequency numbers.
+ *
+ * avg_delta_{reads,writes} are indeed a kind of simple moving
+ * average of the time difference between each of the last
+ * 2^(FREQ_POWER) reads/writes. If there have not yet been that
+ * many reads or writes, it's likely that the values will be very
+ * large; They are initialized to the largest possible value for the
+ * data type. Simply, we don't want a few fast access to a file to
+ * automatically make it appear very hot.
+ */
+static void hot_freq_calc(struct timespec old_atime,
+ struct timespec cur_time, u64 *avg)
+{
+ struct timespec delta_ts;
+ u64 new_delta;
+
+ delta_ts = timespec_sub(cur_time, old_atime);
+ new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
+
+ *avg = (*avg << FREQ_POWER) - *avg + new_delta;
+ *avg = *avg >> FREQ_POWER;
+}
+
+static void hot_freq_update(struct hot_info *root,
+ struct hot_freq *freq, bool write)
+{
+ struct timespec cur_time = current_kernel_time();
+
+ if (write) {
+ freq->nr_writes += 1;
+ hot_freq_calc(freq->last_write_time,
+ cur_time,
+ &freq->avg_delta_writes);
+ freq->last_write_time = cur_time;
+ } else {
+ freq->nr_reads += 1;
+ hot_freq_calc(freq->last_read_time,
+ cur_time,
+ &freq->avg_delta_reads);
+ freq->last_read_time = cur_time;
+ }
+}
+
/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
@@ -141,6 +327,58 @@ void __init hot_cache_init(void)
}
EXPORT_SYMBOL_GPL(hot_cache_init);

+/*
+ * Main function to update i/o access frequencies, and it will be called
+ * from read/writepages() hooks, which are read_pages(), do_writepages(),
+ * do_generic_file_read(), and __blockdev_direct_IO().
+ */
+void hot_freqs_update(struct inode *inode, loff_t start,
+ size_t len, int rw)
+{
+ struct hot_info *root = inode->i_sb->s_hot_root;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+ u64 range_size;
+ loff_t cur, end;
+
+ if (!root || (len == 0) || !S_ISREG(inode->i_mode))
+ return;
+
+ he = hot_inode_item_lookup(root, inode->i_ino, 1);
+ if (IS_ERR(he))
+ return;
+
+ hot_freq_update(root, &he->freq, rw);
+
+ /*
+ * Align ranges on range size boundary
+ * to prevent proliferation of range structs
+ */
+ range_size = hot_bit_shift(1, RANGE_BITS, true);
+ end = hot_bit_shift((start + len + range_size - 1),
+ RANGE_BITS, false);
+ cur = hot_bit_shift(start, RANGE_BITS, false);
+ for (; cur < end; cur++) {
+ hr = hot_range_item_lookup(he, cur, 1);
+ if (IS_ERR(hr)) {
+ WARN(1, "hot_range_item_lookup returns %ld\n",
+ PTR_ERR(hr));
+ return;
+ }
+
+ hot_freq_update(root, &hr->freq, rw);
+
+ spin_lock(&he->i_lock);
+ hot_range_item_put(hr);
+ spin_unlock(&he->i_lock);
+ }
+
+ spin_lock(&root->t_lock);
+ hot_inode_item_put(he);
+ spin_unlock(&root->t_lock);
+}
+EXPORT_SYMBOL_GPL(hot_freqs_update);
+
static struct hot_info *hot_tree_init(struct super_block *sb)
{
struct hot_info *root;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 2776092..bb4cb16 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -16,5 +16,6 @@

/* size of sub-file ranges */
#define RANGE_BITS 20
+#define FREQ_POWER 4

#endif /* __HOT_TRACKING__ */
diff --git a/fs/namei.c b/fs/namei.c
index 0dc4cbf..e6ec3c3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3659,6 +3659,9 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
}
mutex_unlock(&dentry->d_inode->i_mutex);

+ if (!error && !dentry->d_inode->i_nlink)
+ hot_inode_item_unlink(dentry->d_inode);
+
/* We don't d_delete() NFS sillyrenamed files--they still exist. */
if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
fsnotify_link_count(dentry->d_inode);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 4112af2..f93db02 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -34,8 +34,24 @@ enum {
MAX_TYPES,
};

+/*
+ * A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item
+ */
+struct hot_freq {
+ struct timespec last_read_time;
+ struct timespec last_write_time;
+ u32 nr_reads;
+ u32 nr_writes;
+ u64 avg_delta_reads;
+ u64 avg_delta_writes;
+ u32 last_temp;
+};
+
/* An item representing an inode and its access frequency */
struct hot_inode_item {
+ struct hot_freq freq; /* frequency data */
struct kref refs;
struct rb_node rb_node; /* rbtree index */
struct rcu_head rcu;
@@ -50,6 +66,7 @@ struct hot_inode_item {
* an inode whose frequency is being tracked
*/
struct hot_range_item {
+ struct hot_freq freq; /* frequency data */
struct kref refs;
struct rb_node rb_node; /* rbtree index */
struct rcu_head rcu;
@@ -70,6 +87,15 @@ extern void hot_range_item_put(struct hot_range_item *hr);
extern void hot_inode_item_put(struct hot_inode_item *he);
extern void hot_range_item_get(struct hot_range_item *hr);
extern void hot_inode_item_get(struct hot_inode_item *he);
+extern struct hot_range_item
+*hot_range_item_lookup(struct hot_inode_item *he,
+ loff_t start, int alloc);
+extern struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root,
+ u64 ino, int alloc);
+extern void hot_inode_item_unlink(struct inode *inode);
+extern void hot_freqs_update(struct inode *inode, loff_t start,
+ size_t len, int rw);

static inline u64 hot_bit_shift(u64 counter, u32 bits, bool dir)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6aec4..d1fed16 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/cleancache.h>
+#include <linux/hot_tracking.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -1244,6 +1245,11 @@ readpage:
* PG_error will be set again if readpage fails.
*/
ClearPageError(page);
+
+ /* Hot tracking */
+ hot_freqs_update(inode, page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+
/* Start the actual read. The read will unlock the page. */
error = mapping->a_ops->readpage(filp, page);

@@ -1514,9 +1520,13 @@ static int page_cache_read(struct file *file, pgoff_t offset)
return -ENOMEM;

ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
- if (ret == 0)
+ if (ret == 0) {
+ /* Hot tracking */
+ hot_freqs_update(mapping->host,
+ page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
ret = mapping->a_ops->readpage(file, page);
- else if (ret == -EEXIST)
+ } else if (ret == -EEXIST)
ret = 0; /* losing race to add is OK */

page_cache_release(page);
@@ -1720,6 +1730,11 @@ page_not_uptodate:
* and we need to check for errors.
*/
ClearPageError(page);
+
+ /* Hot tracking */
+ hot_freqs_update(inode, page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f5236f8..8d79af0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -37,7 +37,9 @@
#include <linux/timer.h>
#include <linux/sched/rt.h>
#include <linux/mm_inline.h>
+#include <linux/hot_tracking.h>
#include <trace/events/writeback.h>
+#include <linux/hot_tracking.h>

#include "internal.h"

@@ -2062,13 +2064,24 @@ EXPORT_SYMBOL(generic_writepages);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;
+ loff_t start = 0;
+ size_t count = 0;

if (wbc->nr_to_write <= 0)
return 0;
+
+ start = mapping->writeback_index << PAGE_CACHE_SHIFT;
+ count = wbc->nr_to_write;
+
if (mapping->a_ops->writepages)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+ /* Hot tracking */
+ hot_freqs_update(mapping->host, start,
+ (count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1);
+
return ret;
}

diff --git a/mm/readahead.c b/mm/readahead.c
index e4ed041..51f0e88 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
#include <linux/pagemap.h>
#include <linux/syscalls.h>
#include <linux/file.h>
+#include <linux/hot_tracking.h>

/*
* Initialise a struct file's readahead state. Assumes that the caller has
@@ -115,6 +116,11 @@ static int read_pages(struct address_space *mapping, struct file *filp,
unsigned page_idx;
int ret;

+ /* Hot tracking */
+ hot_freqs_update(mapping->host,
+ list_to_page(pages)->index << PAGE_CACHE_SHIFT,
+ (size_t)nr_pages * PAGE_CACHE_SIZE, 0);
+
blk_start_plug(&plug);

if (mapping->a_ops->readpages) {
--
1.7.11.7

2013-09-24 00:20:47

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] VFS hot tracking

On Tue, Sep 17, 2013 at 06:17:45AM +0800, [email protected] wrote:
> From: Zhi Yong Wu <[email protected]>
>
> The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
>
> Now it's time to send out its V5 for external review, and
> any comments or ideas are appreciated, thanks.

FWIW, the most fundamental problem I see with this is that the data
you are collecting is very sensitive to VM pressure details. The
hotspots wrt accesses (i.e. the stuff accessed all the time) will
not generate a lot of IO - they'll just sit in cache and look
very cold for your code. The stuff that is accessed very rarely
will also look cold. The hot ones will be those that get in and
out of cache often; IOW, the ones that are borderline - a bit less
memory pressure and they would've stayed in cache. I would expect
that to vary very much from run to run, which would make its use
for decisions like SSD vs. HD rather dubious...

Do you have that data collected on some real tasks under varying
memory pressure conditions? How stable the results are?

Another question: do you have perf profiles for system with that
stuff enabled, on boxen with many CPUs? You are using a lot of
spinlocks there; how much contention and cacheline ping-pong are
you getting on root->t_lock, for example? Ditto for cacheline
ping-pong on root->hot_cnt, while we are at it...

Comments re code:

* don't export the stuff until it's used by a module. And as
a general policy, please, do not use EXPORT_SYMBOL_GPL in fs/*.
Either don't export at all, or pick a sane API that would not
expose the guts of your code (== wouldn't require you to look
at the users to decide what will and will not break on changes
in your code) and export that. As far as I'm concerned,
all variants of EXPORT_SYMBOL are greatly overused and
EXPORT_SYMBOL_GPL is an exercise in masturbation...

* hot_inode_item_lookup() is a couple of functions smashed together;
split it, please, and lose the "alloc" argument.

* hot_inode_item_unlink() is used when the last link is killed
by unlink(2), but not when it's killed by successful rename(2).
Why?

* what happens when one opens a file, unlinks it and starts doing
IO? hot_freqs_update() will be called, re-creating the inode item
unlink has killed...

* for pity sake, use inlines - the same hot_freqs_update() on a filesystem
that doesn't have this stuff enabled will *still* branch pretty far
out of line, only to return after checking that ->s_hot_root is NULL.
Incidentally, we still have about twenty spare bits in inode flags;
use one (S_TEMP_COLLECTED, or something like that) instead of that
check. Checking it is considerably cheaper than checking ->i_sb->s_hot_root.

* hot_bit_shift() is bloody annoying. Why does true mean "up", false -
"down" and why is it easier to memorize that than just use explicit <<
and >>?

* file->f_dentry->d_inode is spelled file_inode(file), TYVM (so does
file->f_path.dentry->d_inode, actually).

* #ifdef __KERNEL__ in include/linux/* is wrong; use include/uapi/linux/*
for the bits userland needs to see.

* checks for ->i_nlink belong under ->i_mutex. As it is, two unlink(2)
killing two last links to the same file can very well _both_ call
hot_inode_item_unlink(), with obvious unpleasant results.

2013-09-25 03:38:41

by Zhi Yong Wu

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] VFS hot tracking

On Tue, Sep 24, 2013 at 8:20 AM, Al Viro <[email protected]> wrote:
> On Tue, Sep 17, 2013 at 06:17:45AM +0800, [email protected] wrote:
>> From: Zhi Yong Wu <[email protected]>
>>
>> The patchset is trying to introduce hot tracking function in
>> VFS layer, which will keep track of real disk I/O in memory.
>> By it, you will easily know more details about disk I/O, and
>> then detect where disk I/O hot spots are. Also, specific FS
>> can take use of it to do accurate defragment, and hot relocation
>> support, etc.
>>
>> Now it's time to send out its V5 for external review, and
>> any comments or ideas are appreciated, thanks.
>
> FWIW, the most fundamental problem I see with this is that the data
> you are collecting is very sensitive to VM pressure details. The
> hotspots wrt accesses (i.e. the stuff accessed all the time) will
> not generate a lot of IO - they'll just sit in cache and look
> very cold for your code. The stuff that is accessed very rarely
> will also look cold. The hot ones will be those that get in and
> out of cache often; IOW, the ones that are borderline - a bit less
> memory pressure and they would've stayed in cache. I would expect
> that to vary very much from run to run, which would make its use
> for decisions like SSD vs. HD rather dubious...
Are you suggesting to collect the hot info when the data is accessed
while in cache?
Of course, i will do the perf testings in some scenarios where VFS hot
tracking is taking effect.
>
> Do you have that data collected on some real tasks under varying
> memory pressure conditions? How stable the results are?
Can you say what some real tasks are with more details? What kind of
tests are you suggesting? and what results are you expecting to see?
>
> Another question: do you have perf profiles for system with that
> stuff enabled, on boxen with many CPUs? You are using a lot of
No, i will try to do it, and let you know its result.
> spinlocks there; how much contention and cacheline ping-pong are
> you getting on root->t_lock, for example? Ditto for cacheline
> ping-pong on root->hot_cnt, while we are at it...
Sorry, What kind of tests are you suggesting? and what results are you
expecting to see? You know, i am one newbie for VFS, can you say with
more details? how to do this test?
>
> Comments re code:
>
> * don't export the stuff until it's used by a module. And as
> a general policy, please, do not use EXPORT_SYMBOL_GPL in fs/*.
> Either don't export at all, or pick a sane API that would not
> expose the guts of your code (== wouldn't require you to look
> at the users to decide what will and will not break on changes
> in your code) and export that. As far as I'm concerned,
> all variants of EXPORT_SYMBOL are greatly overused and
> EXPORT_SYMBOL_GPL is an exercise in masturbation...
OK, i will make appropriate change based on your comments, thanks.
>
> * hot_inode_item_lookup() is a couple of functions smashed together;
> split it, please, and lose the "alloc" argument.
Do you mean that it should be split into two functions "alloc"
function and "lookup" function?

>
> * hot_inode_item_unlink() is used when the last link is killed
> by unlink(2), but not when it's killed by successful rename(2).
> Why?
Since we are using inode for collecting the hot info, rename(2)
doesn't destroy that information as inodeis kept the same.
>
> * what happens when one opens a file, unlinks it and starts doing
> IO? hot_freqs_update() will be called, re-creating the inode item
> unlink has killed...
Since the file won't be opened and used anymore by anybody else we
don't bother about it.
But i will improve it based on your comments. hot_freqs_update() will
directly return and not re-create the inode item when this file has
been unlinked.
>
> * for pity sake, use inlines - the same hot_freqs_update() on a filesystem
> that doesn't have this stuff enabled will *still* branch pretty far
> out of line, only to return after checking that ->s_hot_root is NULL.
> Incidentally, we still have about twenty spare bits in inode flags;
> use one (S_TEMP_COLLECTED, or something like that) instead of that
> check. Checking it is considerably cheaper than checking ->i_sb->s_hot_root.
OK, will use inline and bits flag in the next post, thanks.
>
> * hot_bit_shift() is bloody annoying. Why does true mean "up", false -
> "down" and why is it easier to memorize that than just use explicit <<
> and >>?
OK, will make appropriate changed based on your comments, thanks.
>
> * file->f_dentry->d_inode is spelled file_inode(file), TYVM (so does
> file->f_path.dentry->d_inode, actually).
Ditto.
>
> * #ifdef __KERNEL__ in include/linux/* is wrong; use include/uapi/linux/*
> for the bits userland needs to see.
Ditto.
>
> * checks for ->i_nlink belong under ->i_mutex. As it is, two unlink(2)
> killing two last links to the same file can very well _both_ call
> hot_inode_item_unlink(), with obvious unpleasant results.
Ditto.



--
Regards,

Zhi Yong Wu