2009-11-19 17:20:37

by David Howells

[permalink] [raw]
Subject: [PATCH 00/28] Fixes for FS-Cache and CacheFiles

The associated patches provide three sets of changes:

(1) More ways to get debugging information from the slow-work, FS-Cache and
CacheFiles facilities.

(2) Fixes for all facilities.

(3) Some extensions for slow-work for users other than FS-Cache/CacheFiles
that aren't included in this series, but upon which other patches have
been built.

The patches can also be found in:

http://people.redhat.com/~dhowells/fscache/patches/fscache-fixes.tar.bz2

The bugs fixed were found by running a simple stress tester:

http://people.redhat.com/~dhowells/slurp.c

Build it and point it at a bunch of NFS (or AFS or whatever) directory trees,
and it forks and execs a copy of itself to handle each directory that it is
given. Whilst scanning a directory, for each subdir of that directory a new
copy forked and exec'd, and then each file in that directory is read from end
to end.

This results in a large number of processes reading in parallel a large number
of files. If the data set is large enough, it applies pressure to both RAM
space and cache space, and can cause OOMs to occur, and can cause cachefilesd
to run in parallel trying to cull the cache as well.

With these patches applied, you can see more counters in /proc/fs/fscache/stats
and new sources of information exist too:

(*) /proc/slow_work_rq

The items currently being processed by or currently queued for
processing by the slow-work facility.

(*) /proc/fs/fscache/objects

A list of current fscache objects and their states. This can be
configured by loading a key into your session keyring before viewing
the file, for instance:

keyctl add user fscache:objlist KB @s

This is described in Documentation/filesystems/caching/fscache.txt

David
---

David Howells (25):
CacheFiles: Don't log lookup/create failing with ENOBUFS
CacheFiles: Catch an overly long wait for an old active object
CacheFiles: Better showing of debugging information in active object problems
CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy
CacheFiles: Handle truncate unlocking the page we're reading
CacheFiles: Don't write a full page if there's only a partial page to cache
FS-Cache: Actually requeue an object when requested
FS-Cache: Start processing an object's operations on that object's death
FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
FS-Cache: Add a retirement stat counter
FS-Cache: Handle pages pending storage that get evicted under OOM conditions
FS-Cache: Handle read request vs lookup, creation or other cache failure
FS-Cache: Don't delete pending pages from the page-store tracking tree
FS-Cache: Fix lock misorder in fscache_write_op()
FS-Cache: The object-available state can't rely on the cookie to be available
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase
FS-Cache: Use radix tree preload correctly in tracking of pages to be stored
FS-Cache: Clear netfs pointers in cookie after detaching object, not before
FS-Cache: Add counters for entry/exit to/from cache operation functions
FS-Cache: Allow the current state of all objects to be dumped
FS-Cache: Annotate slow-work runqueue proc lines for FS-Cache work items
SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed
SLOW_WORK: Allow the owner of a work item to determine if it is queued or not
SLOW_WORK: Allow the work items to be viewed through a /proc file
SLOW_WORK: Wait for outstanding work items belonging to a module to clear

Jens Axboe (3):
SLOW_WORK: Add delayed_slow_work support
SLOW_WORK: Add support for cancellation of slow work
SLOW_WORK: Make slow_work_ops ->get_ref/->put_ref optional


Documentation/filesystems/caching/fscache.txt | 110 +++++
Documentation/filesystems/caching/netfs-api.txt | 21 +
Documentation/slow-work.txt | 160 +++++++
fs/9p/cache.c | 14 -
fs/afs/file.c | 15 -
fs/cachefiles/interface.c | 32 +
fs/cachefiles/namei.c | 187 +++++++--
fs/cachefiles/rdwr.c | 128 +++++-
fs/fscache/Kconfig | 7
fs/fscache/Makefile | 1
fs/fscache/cache.c | 5
fs/fscache/cookie.c | 26 +
fs/fscache/internal.h | 55 +++
fs/fscache/main.c | 6
fs/fscache/object-list.c | 432 ++++++++++++++++++++
fs/fscache/object.c | 104 ++++-
fs/fscache/operation.c | 120 ++++--
fs/fscache/page.c | 273 ++++++++++---
fs/fscache/proc.c | 13 +
fs/fscache/stats.c | 94 ++++
fs/gfs2/main.c | 4
fs/gfs2/recovery.c | 1
fs/nfs/fscache.c | 10
include/linux/fscache-cache.h | 40 ++
include/linux/fscache.h | 27 +
include/linux/slow-work.h | 72 +++
init/Kconfig | 10
kernel/Makefile | 1
kernel/slow-work-proc.c | 227 +++++++++++
kernel/slow-work.c | 494 +++++++++++++++++++++--
kernel/slow-work.h | 72 +++
lib/radix-tree.c | 5
32 files changed, 2502 insertions(+), 264 deletions(-)
create mode 100644 fs/fscache/object-list.c
create mode 100644 kernel/slow-work-proc.c
create mode 100644 kernel/slow-work.h


2009-11-19 17:20:45

by David Howells

[permalink] [raw]
Subject: [PATCH 01/28] SLOW_WORK: Wait for outstanding work items belonging to a module to clear

Wait for outstanding slow work items belonging to a module to clear when
unregistering that module as a user of the facility. This prevents the put_ref
code of a work item from being taken away before it returns.

Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 13 +++-
fs/fscache/main.c | 6 +-
fs/fscache/object.c | 1
fs/fscache/operation.c | 1
fs/gfs2/main.c | 4 +
fs/gfs2/recovery.c | 1
include/linux/slow-work.h | 8 ++-
kernel/slow-work.c | 132 +++++++++++++++++++++++++++++++++++++++++--
8 files changed, 150 insertions(+), 16 deletions(-)

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index ebc50f8..f12fda3 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -64,9 +64,11 @@ USING SLOW WORK ITEMS
Firstly, a module or subsystem wanting to make use of slow work items must
register its interest:

- int ret = slow_work_register_user();
+ int ret = slow_work_register_user(struct module *module);

-This will return 0 if successful, or a -ve error upon failure.
+This will return 0 if successful, or a -ve error upon failure. The module
+pointer should be the module interested in using this facility (almost
+certainly THIS_MODULE).


Slow work items may then be set up by:
@@ -110,7 +112,12 @@ operation. When all a module's slow work items have been processed, and the
module has no further interest in the facility, it should unregister its
interest:

- slow_work_unregister_user();
+ slow_work_unregister_user(struct module *module);
+
+The module pointer is used to wait for all outstanding work items for that
+module before completing the unregistration. This prevents the put_ref() code
+from being taken away before it completes. module should almost certainly be
+THIS_MODULE.


===============
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 4de41b5..add6bdb 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -48,7 +48,7 @@ static int __init fscache_init(void)
{
int ret;

- ret = slow_work_register_user();
+ ret = slow_work_register_user(THIS_MODULE);
if (ret < 0)
goto error_slow_work;

@@ -80,7 +80,7 @@ error_kobj:
error_cookie_jar:
fscache_proc_cleanup();
error_proc:
- slow_work_unregister_user();
+ slow_work_unregister_user(THIS_MODULE);
error_slow_work:
return ret;
}
@@ -97,7 +97,7 @@ static void __exit fscache_exit(void)
kobject_put(fscache_root);
kmem_cache_destroy(fscache_cookie_jar);
fscache_proc_cleanup();
- slow_work_unregister_user();
+ slow_work_unregister_user(THIS_MODULE);
printk(KERN_NOTICE "FS-Cache: Unloaded\n");
}

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 392a41b..d236eb1 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -45,6 +45,7 @@ static void fscache_enqueue_dependents(struct fscache_object *);
static void fscache_dequeue_object(struct fscache_object *);

const struct slow_work_ops fscache_object_slow_work_ops = {
+ .owner = THIS_MODULE,
.get_ref = fscache_object_slow_work_get_ref,
.put_ref = fscache_object_slow_work_put_ref,
.execute = fscache_object_slow_work_execute,
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index e7f8d53..f1a2857 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -453,6 +453,7 @@ static void fscache_op_execute(struct slow_work *work)
}

const struct slow_work_ops fscache_op_slow_work_ops = {
+ .owner = THIS_MODULE,
.get_ref = fscache_op_get_ref,
.put_ref = fscache_op_put_ref,
.execute = fscache_op_execute,
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index eacd78a..5b31f77 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -114,7 +114,7 @@ static int __init init_gfs2_fs(void)
if (error)
goto fail_unregister;

- error = slow_work_register_user();
+ error = slow_work_register_user(THIS_MODULE);
if (error)
goto fail_slow;

@@ -163,7 +163,7 @@ static void __exit exit_gfs2_fs(void)
gfs2_unregister_debugfs();
unregister_filesystem(&gfs2_fs_type);
unregister_filesystem(&gfs2meta_fs_type);
- slow_work_unregister_user();
+ slow_work_unregister_user(THIS_MODULE);

kmem_cache_destroy(gfs2_quotad_cachep);
kmem_cache_destroy(gfs2_rgrpd_cachep);
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index 59d2695..b2bb779 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -593,6 +593,7 @@ fail:
}

struct slow_work_ops gfs2_recover_ops = {
+ .owner = THIS_MODULE,
.get_ref = gfs2_recover_get_ref,
.put_ref = gfs2_recover_put_ref,
.execute = gfs2_recover_work,
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index b65c888..9adb2b3 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -24,6 +24,9 @@ struct slow_work;
* The operations used to support slow work items
*/
struct slow_work_ops {
+ /* owner */
+ struct module *owner;
+
/* get a ref on a work item
* - return 0 if successful, -ve if not
*/
@@ -42,6 +45,7 @@ struct slow_work_ops {
* queued
*/
struct slow_work {
+ struct module *owner; /* the owning module */
unsigned long flags;
#define SLOW_WORK_PENDING 0 /* item pending (further) execution */
#define SLOW_WORK_EXECUTING 1 /* item currently executing */
@@ -84,8 +88,8 @@ static inline void vslow_work_init(struct slow_work *work,
}

extern int slow_work_enqueue(struct slow_work *work);
-extern int slow_work_register_user(void);
-extern void slow_work_unregister_user(void);
+extern int slow_work_register_user(struct module *owner);
+extern void slow_work_unregister_user(struct module *owner);

#ifdef CONFIG_SYSCTL
extern ctl_table slow_work_sysctls[];
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index 0d31135..dd08f37 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -22,6 +22,8 @@
#define SLOW_WORK_OOM_TIMEOUT (5 * HZ) /* can't start new threads for 5s after
* OOM */

+#define SLOW_WORK_THREAD_LIMIT 255 /* abs maximum number of slow-work threads */
+
static void slow_work_cull_timeout(unsigned long);
static void slow_work_oom_timeout(unsigned long);

@@ -46,7 +48,7 @@ static unsigned vslow_work_proportion = 50; /* % of threads that may process

#ifdef CONFIG_SYSCTL
static const int slow_work_min_min_threads = 2;
-static int slow_work_max_max_threads = 255;
+static int slow_work_max_max_threads = SLOW_WORK_THREAD_LIMIT;
static const int slow_work_min_vslow = 1;
static const int slow_work_max_vslow = 99;

@@ -98,6 +100,23 @@ static DEFINE_TIMER(slow_work_oom_timer, slow_work_oom_timeout, 0, 0);
static struct slow_work slow_work_new_thread; /* new thread starter */

/*
+ * slow work ID allocation (use slow_work_queue_lock)
+ */
+static DECLARE_BITMAP(slow_work_ids, SLOW_WORK_THREAD_LIMIT);
+
+/*
+ * Unregistration tracking to prevent put_ref() from disappearing during module
+ * unload
+ */
+#ifdef CONFIG_MODULES
+static struct module *slow_work_thread_processing[SLOW_WORK_THREAD_LIMIT];
+static struct module *slow_work_unreg_module;
+static struct slow_work *slow_work_unreg_work_item;
+static DECLARE_WAIT_QUEUE_HEAD(slow_work_unreg_wq);
+static DEFINE_MUTEX(slow_work_unreg_sync_lock);
+#endif
+
+/*
* The queues of work items and the lock governing access to them. These are
* shared between all the CPUs. It doesn't make sense to have per-CPU queues
* as the number of threads bears no relation to the number of CPUs.
@@ -149,8 +168,11 @@ static unsigned slow_work_calc_vsmax(void)
* Attempt to execute stuff queued on a slow thread. Return true if we managed
* it, false if there was nothing to do.
*/
-static bool slow_work_execute(void)
+static bool slow_work_execute(int id)
{
+#ifdef CONFIG_MODULES
+ struct module *module;
+#endif
struct slow_work *work = NULL;
unsigned vsmax;
bool very_slow;
@@ -186,6 +208,12 @@ static bool slow_work_execute(void)
} else {
very_slow = false; /* avoid the compiler warning */
}
+
+#ifdef CONFIG_MODULES
+ if (work)
+ slow_work_thread_processing[id] = work->owner;
+#endif
+
spin_unlock_irq(&slow_work_queue_lock);

if (!work)
@@ -219,7 +247,18 @@ static bool slow_work_execute(void)
spin_unlock_irq(&slow_work_queue_lock);
}

+ /* sort out the race between module unloading and put_ref() */
work->ops->put_ref(work);
+
+#ifdef CONFIG_MODULES
+ module = slow_work_thread_processing[id];
+ slow_work_thread_processing[id] = NULL;
+ smp_mb();
+ if (slow_work_unreg_work_item == work ||
+ slow_work_unreg_module == module)
+ wake_up_all(&slow_work_unreg_wq);
+#endif
+
return true;

auto_requeue:
@@ -232,6 +271,7 @@ auto_requeue:
else
list_add_tail(&work->link, &slow_work_queue);
spin_unlock_irq(&slow_work_queue_lock);
+ slow_work_thread_processing[id] = NULL;
return true;
}

@@ -368,13 +408,22 @@ static inline bool slow_work_available(int vsmax)
*/
static int slow_work_thread(void *_data)
{
- int vsmax;
+ int vsmax, id;

DEFINE_WAIT(wait);

set_freezable();
set_user_nice(current, -5);

+ /* allocate ourselves an ID */
+ spin_lock_irq(&slow_work_queue_lock);
+ id = find_first_zero_bit(slow_work_ids, SLOW_WORK_THREAD_LIMIT);
+ BUG_ON(id < 0 || id >= SLOW_WORK_THREAD_LIMIT);
+ __set_bit(id, slow_work_ids);
+ spin_unlock_irq(&slow_work_queue_lock);
+
+ sprintf(current->comm, "kslowd%03u", id);
+
for (;;) {
vsmax = vslow_work_proportion;
vsmax *= atomic_read(&slow_work_thread_count);
@@ -395,7 +444,7 @@ static int slow_work_thread(void *_data)
vsmax *= atomic_read(&slow_work_thread_count);
vsmax /= 100;

- if (slow_work_available(vsmax) && slow_work_execute()) {
+ if (slow_work_available(vsmax) && slow_work_execute(id)) {
cond_resched();
if (list_empty(&slow_work_queue) &&
list_empty(&vslow_work_queue) &&
@@ -412,6 +461,10 @@ static int slow_work_thread(void *_data)
break;
}

+ spin_lock_irq(&slow_work_queue_lock);
+ __clear_bit(id, slow_work_ids);
+ spin_unlock_irq(&slow_work_queue_lock);
+
if (atomic_dec_and_test(&slow_work_thread_count))
complete_and_exit(&slow_work_last_thread_exited, 0);
return 0;
@@ -475,6 +528,7 @@ static void slow_work_new_thread_execute(struct slow_work *work)
}

static const struct slow_work_ops slow_work_new_thread_ops = {
+ .owner = THIS_MODULE,
.get_ref = slow_work_new_thread_get_ref,
.put_ref = slow_work_new_thread_put_ref,
.execute = slow_work_new_thread_execute,
@@ -546,12 +600,13 @@ static int slow_work_max_threads_sysctl(struct ctl_table *table, int write,

/**
* slow_work_register_user - Register a user of the facility
+ * @module: The module about to make use of the facility
*
* Register a user of the facility, starting up the initial threads if there
* aren't any other users at this point. This will return 0 if successful, or
* an error if not.
*/
-int slow_work_register_user(void)
+int slow_work_register_user(struct module *module)
{
struct task_struct *p;
int loop;
@@ -598,14 +653,79 @@ error:
}
EXPORT_SYMBOL(slow_work_register_user);

+/*
+ * wait for all outstanding items from the calling module to complete
+ * - note that more items may be queued whilst we're waiting
+ */
+static void slow_work_wait_for_items(struct module *module)
+{
+ DECLARE_WAITQUEUE(myself, current);
+ struct slow_work *work;
+ int loop;
+
+ mutex_lock(&slow_work_unreg_sync_lock);
+ add_wait_queue(&slow_work_unreg_wq, &myself);
+
+ for (;;) {
+ spin_lock_irq(&slow_work_queue_lock);
+
+ /* first of all, we wait for the last queued item in each list
+ * to be processed */
+ list_for_each_entry_reverse(work, &vslow_work_queue, link) {
+ if (work->owner == module) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ slow_work_unreg_work_item = work;
+ goto do_wait;
+ }
+ }
+ list_for_each_entry_reverse(work, &slow_work_queue, link) {
+ if (work->owner == module) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ slow_work_unreg_work_item = work;
+ goto do_wait;
+ }
+ }
+
+ /* then we wait for the items being processed to finish */
+ slow_work_unreg_module = module;
+ smp_mb();
+ for (loop = 0; loop < SLOW_WORK_THREAD_LIMIT; loop++) {
+ if (slow_work_thread_processing[loop] == module)
+ goto do_wait;
+ }
+ spin_unlock_irq(&slow_work_queue_lock);
+ break; /* okay, we're done */
+
+ do_wait:
+ spin_unlock_irq(&slow_work_queue_lock);
+ schedule();
+ slow_work_unreg_work_item = NULL;
+ slow_work_unreg_module = NULL;
+ }
+
+ remove_wait_queue(&slow_work_unreg_wq, &myself);
+ mutex_unlock(&slow_work_unreg_sync_lock);
+}
+
/**
* slow_work_unregister_user - Unregister a user of the facility
+ * @module: The module whose items should be cleared
*
* Unregister a user of the facility, killing all the threads if this was the
* last one.
+ *
+ * This waits for all the work items belonging to the nominated module to go
+ * away before proceeding.
*/
-void slow_work_unregister_user(void)
+void slow_work_unregister_user(struct module *module)
{
+ /* first of all, wait for all outstanding items from the calling module
+ * to complete */
+ if (module)
+ slow_work_wait_for_items(module);
+
+ /* then we can actually go about shutting down the facility if need
+ * be */
mutex_lock(&slow_work_user_lock);

BUG_ON(slow_work_user_count <= 0);

2009-11-19 17:20:48

by David Howells

[permalink] [raw]
Subject: [PATCH 02/28] SLOW_WORK: Make slow_work_ops ->get_ref/->put_ref optional

From: Jens Axboe <[email protected]>

Make the ability for the slow-work facility to take references on a work item
optional as not everyone requires this.

Even the internal slow-work stubs them out, so those can be got rid of too.

Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 2 +-
kernel/slow-work.c | 36 ++++++++++++++++--------------------
2 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index f12fda3..c655c51 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -125,7 +125,7 @@ ITEM OPERATIONS
===============

Each work item requires a table of operations of type struct slow_work_ops.
-All members are required:
+Only ->execute() is required, getting and putting of a reference are optional.

(*) Get a reference on an item:

diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index dd08f37..fccf421 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -145,6 +145,20 @@ static DECLARE_COMPLETION(slow_work_last_thread_exited);
static int slow_work_user_count;
static DEFINE_MUTEX(slow_work_user_lock);

+static inline int slow_work_get_ref(struct slow_work *work)
+{
+ if (work->ops->get_ref)
+ return work->ops->get_ref(work);
+
+ return 0;
+}
+
+static inline void slow_work_put_ref(struct slow_work *work)
+{
+ if (work->ops->put_ref)
+ work->ops->put_ref(work);
+}
+
/*
* Calculate the maximum number of active threads in the pool that are
* permitted to process very slow work items.
@@ -248,7 +262,7 @@ static bool slow_work_execute(int id)
}

/* sort out the race between module unloading and put_ref() */
- work->ops->put_ref(work);
+ slow_work_put_ref(work);

#ifdef CONFIG_MODULES
module = slow_work_thread_processing[id];
@@ -309,7 +323,6 @@ int slow_work_enqueue(struct slow_work *work)
BUG_ON(slow_work_user_count <= 0);
BUG_ON(!work);
BUG_ON(!work->ops);
- BUG_ON(!work->ops->get_ref);

/* when honouring an enqueue request, we only promise that we will run
* the work function in the future; we do not promise to run it once
@@ -339,7 +352,7 @@ int slow_work_enqueue(struct slow_work *work)
if (test_bit(SLOW_WORK_EXECUTING, &work->flags)) {
set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
} else {
- if (work->ops->get_ref(work) < 0)
+ if (slow_work_get_ref(work) < 0)
goto cant_get_ref;
if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
list_add_tail(&work->link, &vslow_work_queue);
@@ -480,21 +493,6 @@ static void slow_work_cull_timeout(unsigned long data)
}

/*
- * Get a reference on slow work thread starter
- */
-static int slow_work_new_thread_get_ref(struct slow_work *work)
-{
- return 0;
-}
-
-/*
- * Drop a reference on slow work thread starter
- */
-static void slow_work_new_thread_put_ref(struct slow_work *work)
-{
-}
-
-/*
* Start a new slow work thread
*/
static void slow_work_new_thread_execute(struct slow_work *work)
@@ -529,8 +527,6 @@ static void slow_work_new_thread_execute(struct slow_work *work)

static const struct slow_work_ops slow_work_new_thread_ops = {
.owner = THIS_MODULE,
- .get_ref = slow_work_new_thread_get_ref,
- .put_ref = slow_work_new_thread_put_ref,
.execute = slow_work_new_thread_execute,
};

2009-11-19 18:04:52

by David Howells

[permalink] [raw]
Subject: [PATCH 03/28] SLOW_WORK: Add support for cancellation of slow work

From: Jens Axboe <[email protected]>

Add support for cancellation of queued slow work and delayed slow work items.
The cancellation functions will wait for items that are pending or undergoing
execution to be discarded by the slow work facility.

Attempting to enqueue work that is in the process of being cancelled will
result in ECANCELED.

Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 12 ++++++
include/linux/slow-work.h | 2 +
kernel/slow-work.c | 81 ++++++++++++++++++++++++++++++++++++++++---
3 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index c655c51..2e384bd 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -108,7 +108,17 @@ on the item, 0 otherwise.


The items are reference counted, so there ought to be no need for a flush
-operation. When all a module's slow work items have been processed, and the
+operation. But as the reference counting is optional, means to cancel
+existing work items are also included:
+
+ cancel_slow_work(&myitem);
+
+can be used to cancel pending work. The above cancel function waits for
+existing work to have been executed (or prevent execution of them, depending
+on timing).
+
+
+When all a module's slow work items have been processed, and the
module has no further interest in the facility, it should unregister its
interest:

diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index 9adb2b3..eef2018 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -51,6 +51,7 @@ struct slow_work {
#define SLOW_WORK_EXECUTING 1 /* item currently executing */
#define SLOW_WORK_ENQ_DEFERRED 2 /* item enqueue deferred */
#define SLOW_WORK_VERY_SLOW 3 /* item is very slow */
+#define SLOW_WORK_CANCELLING 4 /* item is being cancelled, don't enqueue */
const struct slow_work_ops *ops; /* operations table for this item */
struct list_head link; /* link in queue */
};
@@ -88,6 +89,7 @@ static inline void vslow_work_init(struct slow_work *work,
}

extern int slow_work_enqueue(struct slow_work *work);
+extern void slow_work_cancel(struct slow_work *work);
extern int slow_work_register_user(struct module *owner);
extern void slow_work_unregister_user(struct module *owner);

diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index fccf421..671cc43 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -236,12 +236,17 @@ static bool slow_work_execute(int id)
if (!test_and_clear_bit(SLOW_WORK_PENDING, &work->flags))
BUG();

- work->ops->execute(work);
+ /* don't execute if the work is in the process of being cancelled */
+ if (!test_bit(SLOW_WORK_CANCELLING, &work->flags))
+ work->ops->execute(work);

if (very_slow)
atomic_dec(&vslow_work_executing_count);
clear_bit_unlock(SLOW_WORK_EXECUTING, &work->flags);

+ /* wake up anyone waiting for this work to be complete */
+ wake_up_bit(&work->flags, SLOW_WORK_EXECUTING);
+
/* if someone tried to enqueue the item whilst we were executing it,
* then it'll be left unenqueued to avoid multiple threads trying to
* execute it simultaneously
@@ -314,11 +319,16 @@ auto_requeue:
* allowed to pick items to execute. This ensures that very slow items won't
* overly block ones that are just ordinarily slow.
*
- * Returns 0 if successful, -EAGAIN if not.
+ * Returns 0 if successful, -EAGAIN if not (or -ECANCELED if cancelled work is
+ * attempted queued)
*/
int slow_work_enqueue(struct slow_work *work)
{
unsigned long flags;
+ int ret;
+
+ if (test_bit(SLOW_WORK_CANCELLING, &work->flags))
+ return -ECANCELED;

BUG_ON(slow_work_user_count <= 0);
BUG_ON(!work);
@@ -335,6 +345,9 @@ int slow_work_enqueue(struct slow_work *work)
if (!test_and_set_bit_lock(SLOW_WORK_PENDING, &work->flags)) {
spin_lock_irqsave(&slow_work_queue_lock, flags);

+ if (unlikely(test_bit(SLOW_WORK_CANCELLING, &work->flags)))
+ goto cancelled;
+
/* we promise that we will not attempt to execute the work
* function in more than one thread simultaneously
*
@@ -352,8 +365,9 @@ int slow_work_enqueue(struct slow_work *work)
if (test_bit(SLOW_WORK_EXECUTING, &work->flags)) {
set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
} else {
- if (slow_work_get_ref(work) < 0)
- goto cant_get_ref;
+ ret = slow_work_get_ref(work);
+ if (ret < 0)
+ goto failed;
if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
list_add_tail(&work->link, &vslow_work_queue);
else
@@ -365,12 +379,67 @@ int slow_work_enqueue(struct slow_work *work)
}
return 0;

-cant_get_ref:
+cancelled:
+ ret = -ECANCELED;
+failed:
spin_unlock_irqrestore(&slow_work_queue_lock, flags);
- return -EAGAIN;
+ return ret;
}
EXPORT_SYMBOL(slow_work_enqueue);

+static int slow_work_wait(void *word)
+{
+ schedule();
+ return 0;
+}
+
+/**
+ * slow_work_cancel - Cancel a slow work item
+ * @work: The work item to cancel
+ *
+ * This function will cancel a previously enqueued work item. If we cannot
+ * cancel the work item, it is guarenteed to have run when this function
+ * returns.
+ */
+void slow_work_cancel(struct slow_work *work)
+{
+ bool wait = true, put = false;
+
+ set_bit(SLOW_WORK_CANCELLING, &work->flags);
+
+ spin_lock_irq(&slow_work_queue_lock);
+
+ if (test_bit(SLOW_WORK_PENDING, &work->flags) &&
+ !list_empty(&work->link)) {
+ /* the link in the pending queue holds a reference on the item
+ * that we will need to release */
+ list_del_init(&work->link);
+ wait = false;
+ put = true;
+ clear_bit(SLOW_WORK_PENDING, &work->flags);
+
+ } else if (test_and_clear_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags)) {
+ /* the executor is holding our only reference on the item, so
+ * we merely need to wait for it to finish executing */
+ clear_bit(SLOW_WORK_PENDING, &work->flags);
+ }
+
+ spin_unlock_irq(&slow_work_queue_lock);
+
+ /* the EXECUTING flag is set by the executor whilst the spinlock is set
+ * and before the item is dequeued - so assuming the above doesn't
+ * actually dequeue it, simply waiting for the EXECUTING flag to be
+ * released here should be sufficient */
+ if (wait)
+ wait_on_bit(&work->flags, SLOW_WORK_EXECUTING, slow_work_wait,
+ TASK_UNINTERRUPTIBLE);
+
+ clear_bit(SLOW_WORK_CANCELLING, &work->flags);
+ if (put)
+ slow_work_put_ref(work);
+}
+EXPORT_SYMBOL(slow_work_cancel);
+
/*
* Schedule a cull of the thread pool at some time in the near future
*/

2009-11-19 17:20:57

by David Howells

[permalink] [raw]
Subject: [PATCH 04/28] SLOW_WORK: Add delayed_slow_work support

From: Jens Axboe <[email protected]>

This adds support for starting slow work with a delay, similar
to the functionality we have for workqueues.

Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 16 +++++
include/linux/slow-work.h | 29 ++++++++++
kernel/slow-work.c | 129 ++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index 2e384bd..a9d1b0f 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -41,6 +41,13 @@ expand files, provided the time taken to do so isn't too long.
Operations of both types may sleep during execution, thus tying up the thread
loaned to it.

+A further class of work item is available, based on the slow work item class:
+
+ (*) Delayed slow work items.
+
+These are slow work items that have a timer to defer queueing of the item for
+a while.
+

THREAD-TO-CLASS ALLOCATION
--------------------------
@@ -95,6 +102,10 @@ Slow work items may then be set up by:

or:

+ delayed_slow_work_init(&myitem, &myitem_ops);
+
+ or:
+
vslow_work_init(&myitem, &myitem_ops);

depending on its class.
@@ -104,7 +115,9 @@ A suitably set up work item can then be enqueued for processing:
int ret = slow_work_enqueue(&myitem);

This will return a -ve error if the thread pool is unable to gain a reference
-on the item, 0 otherwise.
+on the item, 0 otherwise, or (for delayed work):
+
+ int ret = delayed_slow_work_enqueue(&myitem, my_jiffy_delay);


The items are reference counted, so there ought to be no need for a flush
@@ -112,6 +125,7 @@ operation. But as the reference counting is optional, means to cancel
existing work items are also included:

cancel_slow_work(&myitem);
+ cancel_delayed_slow_work(&myitem);

can be used to cancel pending work. The above cancel function waits for
existing work to have been executed (or prevent execution of them, depending
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index eef2018..b245b9a 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -17,6 +17,7 @@
#ifdef CONFIG_SLOW_WORK

#include <linux/sysctl.h>
+#include <linux/timer.h>

struct slow_work;

@@ -52,10 +53,16 @@ struct slow_work {
#define SLOW_WORK_ENQ_DEFERRED 2 /* item enqueue deferred */
#define SLOW_WORK_VERY_SLOW 3 /* item is very slow */
#define SLOW_WORK_CANCELLING 4 /* item is being cancelled, don't enqueue */
+#define SLOW_WORK_DELAYED 5 /* item is struct delayed_slow_work with active timer */
const struct slow_work_ops *ops; /* operations table for this item */
struct list_head link; /* link in queue */
};

+struct delayed_slow_work {
+ struct slow_work work;
+ struct timer_list timer;
+};
+
/**
* slow_work_init - Initialise a slow work item
* @work: The work item to initialise
@@ -72,6 +79,20 @@ static inline void slow_work_init(struct slow_work *work,
}

/**
+ * slow_work_init - Initialise a delayed slow work item
+ * @work: The work item to initialise
+ * @ops: The operations to use to handle the slow work item
+ *
+ * Initialise a delayed slow work item.
+ */
+static inline void delayed_slow_work_init(struct delayed_slow_work *dwork,
+ const struct slow_work_ops *ops)
+{
+ init_timer(&dwork->timer);
+ slow_work_init(&dwork->work, ops);
+}
+
+/**
* vslow_work_init - Initialise a very slow work item
* @work: The work item to initialise
* @ops: The operations to use to handle the slow work item
@@ -93,6 +114,14 @@ extern void slow_work_cancel(struct slow_work *work);
extern int slow_work_register_user(struct module *owner);
extern void slow_work_unregister_user(struct module *owner);

+extern int delayed_slow_work_enqueue(struct delayed_slow_work *dwork,
+ unsigned long delay);
+
+static inline void delayed_slow_work_cancel(struct delayed_slow_work *dwork)
+{
+ slow_work_cancel(&dwork->work);
+}
+
#ifdef CONFIG_SYSCTL
extern ctl_table slow_work_sysctls[];
#endif
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index 671cc43..f67e1da 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -406,11 +406,40 @@ void slow_work_cancel(struct slow_work *work)
bool wait = true, put = false;

set_bit(SLOW_WORK_CANCELLING, &work->flags);
+ smp_mb();
+
+ /* if the work item is a delayed work item with an active timer, we
+ * need to wait for the timer to finish _before_ getting the spinlock,
+ * lest we deadlock against the timer routine
+ *
+ * the timer routine will leave DELAYED set if it notices the
+ * CANCELLING flag in time
+ */
+ if (test_bit(SLOW_WORK_DELAYED, &work->flags)) {
+ struct delayed_slow_work *dwork =
+ container_of(work, struct delayed_slow_work, work);
+ del_timer_sync(&dwork->timer);
+ }

spin_lock_irq(&slow_work_queue_lock);

- if (test_bit(SLOW_WORK_PENDING, &work->flags) &&
- !list_empty(&work->link)) {
+ if (test_bit(SLOW_WORK_DELAYED, &work->flags)) {
+ /* the timer routine aborted or never happened, so we are left
+ * holding the timer's reference on the item and should just
+ * drop the pending flag and wait for any ongoing execution to
+ * finish */
+ struct delayed_slow_work *dwork =
+ container_of(work, struct delayed_slow_work, work);
+
+ BUG_ON(timer_pending(&dwork->timer));
+ BUG_ON(!list_empty(&work->link));
+
+ clear_bit(SLOW_WORK_DELAYED, &work->flags);
+ put = true;
+ clear_bit(SLOW_WORK_PENDING, &work->flags);
+
+ } else if (test_bit(SLOW_WORK_PENDING, &work->flags) &&
+ !list_empty(&work->link)) {
/* the link in the pending queue holds a reference on the item
* that we will need to release */
list_del_init(&work->link);
@@ -441,6 +470,102 @@ void slow_work_cancel(struct slow_work *work)
EXPORT_SYMBOL(slow_work_cancel);

/*
+ * Handle expiry of the delay timer, indicating that a delayed slow work item
+ * should now be queued if not cancelled
+ */
+static void delayed_slow_work_timer(unsigned long data)
+{
+ struct slow_work *work = (struct slow_work *) data;
+ unsigned long flags;
+ bool queued = false, put = false;
+
+ spin_lock_irqsave(&slow_work_queue_lock, flags);
+ if (likely(!test_bit(SLOW_WORK_CANCELLING, &work->flags))) {
+ clear_bit(SLOW_WORK_DELAYED, &work->flags);
+
+ if (test_bit(SLOW_WORK_EXECUTING, &work->flags)) {
+ /* we discard the reference the timer was holding in
+ * favour of the one the executor holds */
+ set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
+ put = true;
+ } else {
+ if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
+ list_add_tail(&work->link, &vslow_work_queue);
+ else
+ list_add_tail(&work->link, &slow_work_queue);
+ queued = true;
+ }
+ }
+
+ spin_unlock_irqrestore(&slow_work_queue_lock, flags);
+ if (put)
+ slow_work_put_ref(work);
+ if (queued)
+ wake_up(&slow_work_thread_wq);
+}
+
+/**
+ * delayed_slow_work_enqueue - Schedule a delayed slow work item for processing
+ * @dwork: The delayed work item to queue
+ * @delay: When to start executing the work, in jiffies from now
+ *
+ * This is similar to slow_work_enqueue(), but it adds a delay before the work
+ * is actually queued for processing.
+ *
+ * The item can have delayed processing requested on it whilst it is being
+ * executed. The delay will begin immediately, and if it expires before the
+ * item finishes executing, the item will be placed back on the queue when it
+ * has done executing.
+ */
+int delayed_slow_work_enqueue(struct delayed_slow_work *dwork,
+ unsigned long delay)
+{
+ struct slow_work *work = &dwork->work;
+ unsigned long flags;
+ int ret;
+
+ if (delay == 0)
+ return slow_work_enqueue(&dwork->work);
+
+ BUG_ON(slow_work_user_count <= 0);
+ BUG_ON(!work);
+ BUG_ON(!work->ops);
+
+ if (test_bit(SLOW_WORK_CANCELLING, &work->flags))
+ return -ECANCELED;
+
+ if (!test_and_set_bit_lock(SLOW_WORK_PENDING, &work->flags)) {
+ spin_lock_irqsave(&slow_work_queue_lock, flags);
+
+ if (test_bit(SLOW_WORK_CANCELLING, &work->flags))
+ goto cancelled;
+
+ /* the timer holds a reference whilst it is pending */
+ ret = work->ops->get_ref(work);
+ if (ret < 0)
+ goto cant_get_ref;
+
+ if (test_and_set_bit(SLOW_WORK_DELAYED, &work->flags))
+ BUG();
+ dwork->timer.expires = jiffies + delay;
+ dwork->timer.data = (unsigned long) work;
+ dwork->timer.function = delayed_slow_work_timer;
+ add_timer(&dwork->timer);
+
+ spin_unlock_irqrestore(&slow_work_queue_lock, flags);
+ }
+
+ return 0;
+
+cancelled:
+ ret = -ECANCELED;
+cant_get_ref:
+ spin_unlock_irqrestore(&slow_work_queue_lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL(delayed_slow_work_enqueue);
+
+/*
* Schedule a cull of the thread pool at some time in the near future
*/
static void slow_work_schedule_cull(void)

2009-11-19 17:21:04

by David Howells

[permalink] [raw]
Subject: [PATCH 05/28] SLOW_WORK: Allow the work items to be viewed through a /proc file

Allow the executing and queued work items to be viewed through a /proc file
for debugging purposes. The contents look something like the following:

THR PID ITEM ADDR FL MARK DESC
=== ===== ================ == ===== ==========
0 3005 ffff880023f52348 a 952ms FSC: OBJ17d3: LOOK
1 3006 ffff880024e33668 2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2
2 3165 ffff8800296dd180 a 424ms FSC: OBJ17e4: LOOK
3 4089 ffff8800262c8d78 a 212ms FSC: OBJ17ea: CRTN
4 4090 ffff88002792bed8 2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2
5 4092 ffff88002a0ef308 2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2
6 4094 ffff88002abaf4b8 2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2
7 4095 ffff88002bb188e0 a 388ms FSC: OBJ17e9: CRTN
vsq - ffff880023d99668 1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2
vsq - ffff8800295d1740 1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2
vsq - ffff880025ba3308 1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2
vsq - ffff880024ec83e0 1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2
vsq - ffff880026618e00 1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2
vsq - ffff880025a2a4b8 1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2
vsq - ffff880023cbe6d8 9 212ms FSC: OBJ17eb: LOOK
vsq - ffff880024d37590 9 212ms FSC: OBJ17ec: LOOK
vsq - ffff880027746cb0 9 212ms FSC: OBJ17ed: LOOK
vsq - ffff880024d37ae8 9 212ms FSC: OBJ17ee: LOOK
vsq - ffff880024d37cb0 9 212ms FSC: OBJ17ef: LOOK
vsq - ffff880025036550 9 212ms FSC: OBJ17f0: LOOK
vsq - ffff8800250368e0 9 212ms FSC: OBJ17f1: LOOK
vsq - ffff880025036aa8 9 212ms FSC: OBJ17f2: LOOK

In the 'THR' column, executing items show the thread they're occupying and
queued threads indicate which queue they're on. 'PID' shows the process ID of
a slow-work thread that's executing something. 'FL' shows the work item flags.
'MARK' indicates how long since an item was queued or began executing. Lastly,
the 'DESC' column permits the owner of an item to give some information.

Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 60 +++++++++++
include/linux/slow-work.h | 11 ++
init/Kconfig | 10 ++
kernel/Makefile | 1
kernel/slow-work-proc.c | 227 +++++++++++++++++++++++++++++++++++++++++++
kernel/slow-work.c | 44 ++++++--
kernel/slow-work.h | 72 ++++++++++++++
7 files changed, 413 insertions(+), 12 deletions(-)
create mode 100644 kernel/slow-work-proc.c
create mode 100644 kernel/slow-work.h

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index a9d1b0f..f120238 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -149,7 +149,8 @@ ITEM OPERATIONS
===============

Each work item requires a table of operations of type struct slow_work_ops.
-Only ->execute() is required, getting and putting of a reference are optional.
+Only ->execute() is required; the getting and putting of a reference and the
+describing of an item are all optional.

(*) Get a reference on an item:

@@ -179,6 +180,16 @@ Only ->execute() is required, getting and putting of a reference are optional.
This should perform the work required of the item. It may sleep, it may
perform disk I/O and it may wait for locks.

+ (*) View an item through /proc:
+
+ void (*desc)(struct slow_work *work, struct seq_file *m);
+
+ If supplied, this should print to 'm' a small string describing the work
+ the item is to do. This should be no more than about 40 characters, and
+ shouldn't include a newline character.
+
+ See the 'Viewing executing and queued items' section below.
+

==================
POOL CONFIGURATION
@@ -203,3 +214,50 @@ The slow-work thread pool has a number of configurables:
is bounded to between 1 and one fewer than the number of active threads.
This ensures there is always at least one thread that can process very
slow work items, and always at least one thread that won't.
+
+
+==================================
+VIEWING EXECUTING AND QUEUED ITEMS
+==================================
+
+If CONFIG_SLOW_WORK_PROC is enabled, a proc file is made available:
+
+ /proc/slow_work_rq
+
+through which the list of work items being executed and the queues of items to
+be executed may be viewed. The owner of a work item is given the chance to
+add some information of its own.
+
+The contents look something like the following:
+
+ THR PID ITEM ADDR FL MARK DESC
+ === ===== ================ == ===== ==========
+ 0 3005 ffff880023f52348 a 952ms FSC: OBJ17d3: LOOK
+ 1 3006 ffff880024e33668 2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2
+ 2 3165 ffff8800296dd180 a 424ms FSC: OBJ17e4: LOOK
+ 3 4089 ffff8800262c8d78 a 212ms FSC: OBJ17ea: CRTN
+ 4 4090 ffff88002792bed8 2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2
+ 5 4092 ffff88002a0ef308 2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2
+ 6 4094 ffff88002abaf4b8 2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2
+ 7 4095 ffff88002bb188e0 a 388ms FSC: OBJ17e9: CRTN
+ vsq - ffff880023d99668 1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2
+ vsq - ffff8800295d1740 1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2
+ vsq - ffff880025ba3308 1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2
+ vsq - ffff880024ec83e0 1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2
+ vsq - ffff880026618e00 1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2
+ vsq - ffff880025a2a4b8 1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2
+ vsq - ffff880023cbe6d8 9 212ms FSC: OBJ17eb: LOOK
+ vsq - ffff880024d37590 9 212ms FSC: OBJ17ec: LOOK
+ vsq - ffff880027746cb0 9 212ms FSC: OBJ17ed: LOOK
+ vsq - ffff880024d37ae8 9 212ms FSC: OBJ17ee: LOOK
+ vsq - ffff880024d37cb0 9 212ms FSC: OBJ17ef: LOOK
+ vsq - ffff880025036550 9 212ms FSC: OBJ17f0: LOOK
+ vsq - ffff8800250368e0 9 212ms FSC: OBJ17f1: LOOK
+ vsq - ffff880025036aa8 9 212ms FSC: OBJ17f2: LOOK
+
+In the 'THR' column, executing items show the thread they're occupying and
+queued threads indicate which queue they're on. 'PID' shows the process ID of
+a slow-work thread that's executing something. 'FL' shows the work item flags.
+'MARK' indicates how long since an item was queued or began executing. Lastly,
+the 'DESC' column permits the owner of an item to give some information.
+
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index b245b9a..f414851 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -20,6 +20,9 @@
#include <linux/timer.h>

struct slow_work;
+#ifdef CONFIG_SLOW_WORK_PROC
+struct seq_file;
+#endif

/*
* The operations used to support slow work items
@@ -38,6 +41,11 @@ struct slow_work_ops {

/* execute a work item */
void (*execute)(struct slow_work *work);
+
+#ifdef CONFIG_SLOW_WORK_PROC
+ /* describe a work item for /proc */
+ void (*desc)(struct slow_work *work, struct seq_file *m);
+#endif
};

/*
@@ -56,6 +64,9 @@ struct slow_work {
#define SLOW_WORK_DELAYED 5 /* item is struct delayed_slow_work with active timer */
const struct slow_work_ops *ops; /* operations table for this item */
struct list_head link; /* link in queue */
+#ifdef CONFIG_SLOW_WORK_PROC
+ struct timespec mark; /* jiffies at which queued or exec begun */
+#endif
};

struct delayed_slow_work {
diff --git a/init/Kconfig b/init/Kconfig
index 9e03ef8..ab5c648 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1098,6 +1098,16 @@ config SLOW_WORK

See Documentation/slow-work.txt.

+config SLOW_WORK_PROC
+ bool "Slow work debugging through /proc"
+ default n
+ depends on SLOW_WORK && PROC_FS
+ help
+ Display the contents of the slow work run queue through /proc,
+ including items currently executing.
+
+ See Documentation/slow-work.txt.
+
endmenu # General setup

config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/Makefile b/kernel/Makefile
index b8d4cd8..776ffed 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_X86_DS) += trace/
obj-$(CONFIG_RING_BUFFER) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
obj-$(CONFIG_SLOW_WORK) += slow-work.o
+obj-$(CONFIG_SLOW_WORK_PROC) += slow-work-proc.o
obj-$(CONFIG_PERF_EVENTS) += perf_event.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
diff --git a/kernel/slow-work-proc.c b/kernel/slow-work-proc.c
new file mode 100644
index 0000000..3988032
--- /dev/null
+++ b/kernel/slow-work-proc.c
@@ -0,0 +1,227 @@
+/* Slow work debugging
+ *
+ * Copyright (C) 2009 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/slow-work.h>
+#include <linux/fs.h>
+#include <linux/time.h>
+#include <linux/seq_file.h>
+#include "slow-work.h"
+
+#define ITERATOR_SHIFT (BITS_PER_LONG - 4)
+#define ITERATOR_SELECTOR (0xfUL << ITERATOR_SHIFT)
+#define ITERATOR_COUNTER (~ITERATOR_SELECTOR)
+
+void slow_work_new_thread_desc(struct slow_work *work, struct seq_file *m)
+{
+ seq_puts(m, "Slow-work: New thread");
+}
+
+/*
+ * Render the time mark field on a work item into a 5-char time with units plus
+ * a space
+ */
+static void slow_work_print_mark(struct seq_file *m, struct slow_work *work)
+{
+ struct timespec now, diff;
+
+ now = CURRENT_TIME;
+ diff = timespec_sub(now, work->mark);
+
+ if (diff.tv_sec < 0)
+ seq_puts(m, " -ve ");
+ else if (diff.tv_sec == 0 && diff.tv_nsec < 1000)
+ seq_printf(m, "%3luns ", diff.tv_nsec);
+ else if (diff.tv_sec == 0 && diff.tv_nsec < 1000000)
+ seq_printf(m, "%3luus ", diff.tv_nsec / 1000);
+ else if (diff.tv_sec == 0 && diff.tv_nsec < 1000000000)
+ seq_printf(m, "%3lums ", diff.tv_nsec / 1000000);
+ else if (diff.tv_sec <= 1)
+ seq_puts(m, " 1s ");
+ else if (diff.tv_sec < 60)
+ seq_printf(m, "%4lus ", diff.tv_sec);
+ else if (diff.tv_sec < 60 * 60)
+ seq_printf(m, "%4lum ", diff.tv_sec / 60);
+ else if (diff.tv_sec < 60 * 60 * 24)
+ seq_printf(m, "%4luh ", diff.tv_sec / 3600);
+ else
+ seq_puts(m, "exces ");
+}
+
+/*
+ * Describe a slow work item for /proc
+ */
+static int slow_work_runqueue_show(struct seq_file *m, void *v)
+{
+ struct slow_work *work;
+ struct list_head *p = v;
+ unsigned long id;
+
+ switch ((unsigned long) v) {
+ case 1:
+ seq_puts(m, "THR PID ITEM ADDR FL MARK DESC\n");
+ return 0;
+ case 2:
+ seq_puts(m, "=== ===== ================ == ===== ==========\n");
+ return 0;
+
+ case 3 ... 3 + SLOW_WORK_THREAD_LIMIT - 1:
+ id = (unsigned long) v - 3;
+
+ read_lock(&slow_work_execs_lock);
+ work = slow_work_execs[id];
+ if (work) {
+ smp_read_barrier_depends();
+
+ seq_printf(m, "%3lu %5d %16p %2lx ",
+ id, slow_work_pids[id], work, work->flags);
+ slow_work_print_mark(m, work);
+
+ if (work->ops->desc)
+ work->ops->desc(work, m);
+ seq_putc(m, '\n');
+ }
+ read_unlock(&slow_work_execs_lock);
+ return 0;
+
+ default:
+ work = list_entry(p, struct slow_work, link);
+ seq_printf(m, "%3s - %16p %2lx ",
+ work->flags & SLOW_WORK_VERY_SLOW ? "vsq" : "sq",
+ work, work->flags);
+ slow_work_print_mark(m, work);
+
+ if (work->ops->desc)
+ work->ops->desc(work, m);
+ seq_putc(m, '\n');
+ return 0;
+ }
+}
+
+/*
+ * map the iterator to a work item
+ */
+static void *slow_work_runqueue_index(struct seq_file *m, loff_t *_pos)
+{
+ struct list_head *p;
+ unsigned long count, id;
+
+ switch (*_pos >> ITERATOR_SHIFT) {
+ case 0x0:
+ if (*_pos == 0)
+ *_pos = 1;
+ if (*_pos < 3)
+ return (void *)(unsigned long) *_pos;
+ if (*_pos < 3 + SLOW_WORK_THREAD_LIMIT)
+ for (id = *_pos - 3;
+ id < SLOW_WORK_THREAD_LIMIT;
+ id++, (*_pos)++)
+ if (slow_work_execs[id])
+ return (void *)(unsigned long) *_pos;
+ *_pos = 0x1UL << ITERATOR_SHIFT;
+
+ case 0x1:
+ count = *_pos & ITERATOR_COUNTER;
+ list_for_each(p, &slow_work_queue) {
+ if (count == 0)
+ return p;
+ count--;
+ }
+ *_pos = 0x2UL << ITERATOR_SHIFT;
+
+ case 0x2:
+ count = *_pos & ITERATOR_COUNTER;
+ list_for_each(p, &vslow_work_queue) {
+ if (count == 0)
+ return p;
+ count--;
+ }
+ *_pos = 0x3UL << ITERATOR_SHIFT;
+
+ default:
+ return NULL;
+ }
+}
+
+/*
+ * set up the iterator to start reading from the first line
+ */
+static void *slow_work_runqueue_start(struct seq_file *m, loff_t *_pos)
+{
+ spin_lock_irq(&slow_work_queue_lock);
+ return slow_work_runqueue_index(m, _pos);
+}
+
+/*
+ * move to the next line
+ */
+static void *slow_work_runqueue_next(struct seq_file *m, void *v, loff_t *_pos)
+{
+ struct list_head *p = v;
+ unsigned long selector = *_pos >> ITERATOR_SHIFT;
+
+ (*_pos)++;
+ switch (selector) {
+ case 0x0:
+ return slow_work_runqueue_index(m, _pos);
+
+ case 0x1:
+ if (*_pos >> ITERATOR_SHIFT == 0x1) {
+ p = p->next;
+ if (p != &slow_work_queue)
+ return p;
+ }
+ *_pos = 0x2UL << ITERATOR_SHIFT;
+ p = &vslow_work_queue;
+
+ case 0x2:
+ if (*_pos >> ITERATOR_SHIFT == 0x2) {
+ p = p->next;
+ if (p != &vslow_work_queue)
+ return p;
+ }
+ *_pos = 0x3UL << ITERATOR_SHIFT;
+
+ default:
+ return NULL;
+ }
+}
+
+/*
+ * clean up after reading
+ */
+static void slow_work_runqueue_stop(struct seq_file *m, void *v)
+{
+ spin_unlock_irq(&slow_work_queue_lock);
+}
+
+static const struct seq_operations slow_work_runqueue_ops = {
+ .start = slow_work_runqueue_start,
+ .stop = slow_work_runqueue_stop,
+ .next = slow_work_runqueue_next,
+ .show = slow_work_runqueue_show,
+};
+
+/*
+ * open "/proc/slow_work_rq" to list queue contents
+ */
+static int slow_work_runqueue_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &slow_work_runqueue_ops);
+}
+
+const struct file_operations slow_work_runqueue_fops = {
+ .owner = THIS_MODULE,
+ .open = slow_work_runqueue_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index f67e1da..b763bc2 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -16,13 +16,8 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/wait.h>
-
-#define SLOW_WORK_CULL_TIMEOUT (5 * HZ) /* cull threads 5s after running out of
- * things to do */
-#define SLOW_WORK_OOM_TIMEOUT (5 * HZ) /* can't start new threads for 5s after
- * OOM */
-
-#define SLOW_WORK_THREAD_LIMIT 255 /* abs maximum number of slow-work threads */
+#include <linux/proc_fs.h>
+#include "slow-work.h"

static void slow_work_cull_timeout(unsigned long);
static void slow_work_oom_timeout(unsigned long);
@@ -117,6 +112,15 @@ static DEFINE_MUTEX(slow_work_unreg_sync_lock);
#endif

/*
+ * Data for tracking currently executing items for indication through /proc
+ */
+#ifdef CONFIG_SLOW_WORK_PROC
+struct slow_work *slow_work_execs[SLOW_WORK_THREAD_LIMIT];
+pid_t slow_work_pids[SLOW_WORK_THREAD_LIMIT];
+DEFINE_RWLOCK(slow_work_execs_lock);
+#endif
+
+/*
* The queues of work items and the lock governing access to them. These are
* shared between all the CPUs. It doesn't make sense to have per-CPU queues
* as the number of threads bears no relation to the number of CPUs.
@@ -124,9 +128,9 @@ static DEFINE_MUTEX(slow_work_unreg_sync_lock);
* There are two queues of work items: one for slow work items, and one for
* very slow work items.
*/
-static LIST_HEAD(slow_work_queue);
-static LIST_HEAD(vslow_work_queue);
-static DEFINE_SPINLOCK(slow_work_queue_lock);
+LIST_HEAD(slow_work_queue);
+LIST_HEAD(vslow_work_queue);
+DEFINE_SPINLOCK(slow_work_queue_lock);

/*
* The thread controls. A variable used to signal to the threads that they
@@ -182,7 +186,7 @@ static unsigned slow_work_calc_vsmax(void)
* Attempt to execute stuff queued on a slow thread. Return true if we managed
* it, false if there was nothing to do.
*/
-static bool slow_work_execute(int id)
+static noinline bool slow_work_execute(int id)
{
#ifdef CONFIG_MODULES
struct module *module;
@@ -227,6 +231,10 @@ static bool slow_work_execute(int id)
if (work)
slow_work_thread_processing[id] = work->owner;
#endif
+ if (work) {
+ slow_work_mark_time(work);
+ slow_work_begin_exec(id, work);
+ }

spin_unlock_irq(&slow_work_queue_lock);

@@ -247,6 +255,8 @@ static bool slow_work_execute(int id)
/* wake up anyone waiting for this work to be complete */
wake_up_bit(&work->flags, SLOW_WORK_EXECUTING);

+ slow_work_end_exec(id, work);
+
/* if someone tried to enqueue the item whilst we were executing it,
* then it'll be left unenqueued to avoid multiple threads trying to
* execute it simultaneously
@@ -285,6 +295,7 @@ auto_requeue:
* - we transfer our ref on the item back to the appropriate queue
* - don't wake another thread up as we're awake already
*/
+ slow_work_mark_time(work);
if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
list_add_tail(&work->link, &vslow_work_queue);
else
@@ -368,6 +379,7 @@ int slow_work_enqueue(struct slow_work *work)
ret = slow_work_get_ref(work);
if (ret < 0)
goto failed;
+ slow_work_mark_time(work);
if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
list_add_tail(&work->link, &vslow_work_queue);
else
@@ -489,6 +501,7 @@ static void delayed_slow_work_timer(unsigned long data)
set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
put = true;
} else {
+ slow_work_mark_time(work);
if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
list_add_tail(&work->link, &vslow_work_queue);
else
@@ -627,6 +640,7 @@ static int slow_work_thread(void *_data)
id = find_first_zero_bit(slow_work_ids, SLOW_WORK_THREAD_LIMIT);
BUG_ON(id < 0 || id >= SLOW_WORK_THREAD_LIMIT);
__set_bit(id, slow_work_ids);
+ slow_work_set_thread_pid(id, current->pid);
spin_unlock_irq(&slow_work_queue_lock);

sprintf(current->comm, "kslowd%03u", id);
@@ -669,6 +683,7 @@ static int slow_work_thread(void *_data)
}

spin_lock_irq(&slow_work_queue_lock);
+ slow_work_set_thread_pid(id, 0);
__clear_bit(id, slow_work_ids);
spin_unlock_irq(&slow_work_queue_lock);

@@ -722,6 +737,9 @@ static void slow_work_new_thread_execute(struct slow_work *work)
static const struct slow_work_ops slow_work_new_thread_ops = {
.owner = THIS_MODULE,
.execute = slow_work_new_thread_execute,
+#ifdef CONFIG_SLOW_WORK_PROC
+ .desc = slow_work_new_thread_desc,
+#endif
};

/*
@@ -949,6 +967,10 @@ static int __init init_slow_work(void)
if (slow_work_max_max_threads < nr_cpus * 2)
slow_work_max_max_threads = nr_cpus * 2;
#endif
+#ifdef CONFIG_SLOW_WORK_PROC
+ proc_create("slow_work_rq", S_IFREG | 0400, NULL,
+ &slow_work_runqueue_fops);
+#endif
return 0;
}

diff --git a/kernel/slow-work.h b/kernel/slow-work.h
new file mode 100644
index 0000000..3c2f007
--- /dev/null
+++ b/kernel/slow-work.h
@@ -0,0 +1,72 @@
+/* Slow work private definitions
+ *
+ * Copyright (C) 2009 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define SLOW_WORK_CULL_TIMEOUT (5 * HZ) /* cull threads 5s after running out of
+ * things to do */
+#define SLOW_WORK_OOM_TIMEOUT (5 * HZ) /* can't start new threads for 5s after
+ * OOM */
+
+#define SLOW_WORK_THREAD_LIMIT 255 /* abs maximum number of slow-work threads */
+
+/*
+ * slow-work.c
+ */
+#ifdef CONFIG_SLOW_WORK_PROC
+extern struct slow_work *slow_work_execs[];
+extern pid_t slow_work_pids[];
+extern rwlock_t slow_work_execs_lock;
+#endif
+
+extern struct list_head slow_work_queue;
+extern struct list_head vslow_work_queue;
+extern spinlock_t slow_work_queue_lock;
+
+/*
+ * slow-work-proc.c
+ */
+#ifdef CONFIG_SLOW_WORK_PROC
+extern const struct file_operations slow_work_runqueue_fops;
+
+extern void slow_work_new_thread_desc(struct slow_work *, struct seq_file *);
+#endif
+
+/*
+ * Helper functions
+ */
+static inline void slow_work_set_thread_pid(int id, pid_t pid)
+{
+#ifdef CONFIG_SLOW_WORK_PROC
+ slow_work_pids[id] = pid;
+#endif
+}
+
+static inline void slow_work_mark_time(struct slow_work *work)
+{
+#ifdef CONFIG_SLOW_WORK_PROC
+ work->mark = CURRENT_TIME;
+#endif
+}
+
+static inline void slow_work_begin_exec(int id, struct slow_work *work)
+{
+#ifdef CONFIG_SLOW_WORK_PROC
+ slow_work_execs[id] = work;
+#endif
+}
+
+static inline void slow_work_end_exec(int id, struct slow_work *work)
+{
+#ifdef CONFIG_SLOW_WORK_PROC
+ write_lock(&slow_work_execs_lock);
+ slow_work_execs[id] = NULL;
+ write_unlock(&slow_work_execs_lock);
+#endif
+}

2009-11-19 17:21:22

by David Howells

[permalink] [raw]
Subject: [PATCH 06/28] SLOW_WORK: Allow the owner of a work item to determine if it is queued or not

Add a function (slow_work_is_queued()) to permit the owner of a work item to
determine if the item is queued or not.

The work item is counted as being queued if it is actually on the queue, not
just if it is pending. If it is executing and pending, then it is not on the
queue, but will rather be put back on the queue when execution finishes.

This permits a caller to quickly work out if it may be able to put another,
dependent work item on the queue behind it, or whether it will have to wait
till that is finished.

This can be used by CacheFiles to work out whether the creation a new object
can be immediately deferred when it has to wait for an old object to be
deleted, or whether a wait must take place. If a wait is necessary, then the
slow-work thread can otherwise get blocked, preventing the deletion from
taking place.

Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 15 +++++++++++++++
include/linux/slow-work.h | 19 +++++++++++++++++++
2 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index f120238..0169c9d 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -144,6 +144,21 @@ from being taken away before it completes. module should almost certainly be
THIS_MODULE.


+================
+HELPER FUNCTIONS
+================
+
+The slow-work facility provides a function by which it can be determined
+whether or not an item is queued for later execution:
+
+ bool queued = slow_work_is_queued(struct slow_work *work);
+
+If it returns false, then the item is not on the queue (it may be executing
+with a requeue pending). This can be used to work out whether an item on which
+another depends is on the queue, thus allowing a dependent item to be queued
+after it.
+
+
===============
ITEM OPERATIONS
===============
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index f414851..bfd3ab4 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -120,6 +120,25 @@ static inline void vslow_work_init(struct slow_work *work,
INIT_LIST_HEAD(&work->link);
}

+/**
+ * slow_work_is_queued - Determine if a slow work item is on the work queue
+ * work: The work item to test
+ *
+ * Determine if the specified slow-work item is on the work queue. This
+ * returns true if it is actually on the queue.
+ *
+ * If the item is executing and has been marked for requeue when execution
+ * finishes, then false will be returned.
+ *
+ * Anyone wishing to wait for completion of execution can wait on the
+ * SLOW_WORK_EXECUTING bit.
+ */
+static inline bool slow_work_is_queued(struct slow_work *work)
+{
+ unsigned long flags = work->flags;
+ return flags & SLOW_WORK_PENDING && !(flags & SLOW_WORK_EXECUTING);
+}
+
extern int slow_work_enqueue(struct slow_work *work);
extern void slow_work_cancel(struct slow_work *work);
extern int slow_work_register_user(struct module *owner);

2009-11-19 17:21:21

by David Howells

[permalink] [raw]
Subject: [PATCH 07/28] SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed

Add a function to allow a requeueable work item to sleep till the thread
processing it is needed by the slow-work facility to perform other work.

Sometimes a work item can't progress immediately, but must wait for the
completion of another work item that's currently being processed by another
slow-work thread.

In some circumstances, the waiting item could instead - theoretically - put
itself back on the queue and yield its thread back to the slow-work facility,
thus waiting till it gets processing time again before attempting to progress.
This would allow other work items processing time on that thread.

However, this only works if there is something on the queue for it to queue
behind - otherwise it will just get a thread again immediately, and will end
up cycling between the queue and the thread, eating up valuable CPU time.

So, slow_work_sleep_till_thread_needed() is provided such that an item can put
itself on a wait queue that will wake it up when the event it is actually
interested in occurs, then call this function in lieu of calling schedule().

This function will then sleep until either the item's event occurs or another
work item appears on the queue. If another work item is queued, but the
item's event hasn't occurred, then the work item should requeue itself and
yield the thread back to the slow-work facility by returning.

This can be used by CacheFiles for an object that is being created on one
thread to wait for an object being deleted on another thread where there is
nothing on the queue for the creation to go and wait behind. As soon as an
item appears on the queue that could be given thread time instead, CacheFiles
can stick the creating object back on the queue and return to the slow-work
facility - assuming the object deletion didn't also complete.

Signed-off-by: David Howells <[email protected]>
---

Documentation/slow-work.txt | 44 ++++++++++++++++++++
include/linux/slow-work.h | 3 +
kernel/slow-work.c | 94 +++++++++++++++++++++++++++++++++++++++----
3 files changed, 132 insertions(+), 9 deletions(-)

diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
index 0169c9d..52bc314 100644
--- a/Documentation/slow-work.txt
+++ b/Documentation/slow-work.txt
@@ -158,6 +158,50 @@ with a requeue pending). This can be used to work out whether an item on which
another depends is on the queue, thus allowing a dependent item to be queued
after it.

+If the above shows an item on which another depends not to be queued, then the
+owner of the dependent item might need to wait. However, to avoid locking up
+the threads unnecessarily be sleeping in them, it can make sense under some
+circumstances to return the work item to the queue, thus deferring it until
+some other items have had a chance to make use of the yielded thread.
+
+To yield a thread and defer an item, the work function should simply enqueue
+the work item again and return. However, this doesn't work if there's nothing
+actually on the queue, as the thread just vacated will jump straight back into
+the item's work function, thus busy waiting on a CPU.
+
+Instead, the item should use the thread to wait for the dependency to go away,
+but rather than using schedule() or schedule_timeout() to sleep, it should use
+the following function:
+
+ bool requeue = slow_work_sleep_till_thread_needed(
+ struct slow_work *work,
+ signed long *_timeout);
+
+This will add a second wait and then sleep, such that it will be woken up if
+either something appears on the queue that could usefully make use of the
+thread - and behind which this item can be queued, or if the event the caller
+set up to wait for happens. True will be returned if something else appeared
+on the queue and this work function should perhaps return, of false if
+something else woke it up. The timeout is as for schedule_timeout().
+
+For example:
+
+ wq = bit_waitqueue(&my_flags, MY_BIT);
+ init_wait(&wait);
+ requeue = false;
+ do {
+ prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+ if (!test_bit(MY_BIT, &my_flags))
+ break;
+ requeue = slow_work_sleep_till_thread_needed(&my_work,
+ &timeout);
+ } while (timeout > 0 && !requeue);
+ finish_wait(wq, &wait);
+ if (!test_bit(MY_BIT, &my_flags)
+ goto do_my_thing;
+ if (requeue)
+ return; // to slow_work
+

===============
ITEM OPERATIONS
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index bfd3ab4..5035a26 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -152,6 +152,9 @@ static inline void delayed_slow_work_cancel(struct delayed_slow_work *dwork)
slow_work_cancel(&dwork->work);
}

+extern bool slow_work_sleep_till_thread_needed(struct slow_work *work,
+ signed long *_timeout);
+
#ifdef CONFIG_SYSCTL
extern ctl_table slow_work_sysctls[];
#endif
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index b763bc2..da94f3c 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -133,6 +133,15 @@ LIST_HEAD(vslow_work_queue);
DEFINE_SPINLOCK(slow_work_queue_lock);

/*
+ * The following are two wait queues that get pinged when a work item is placed
+ * on an empty queue. These allow work items that are hogging a thread by
+ * sleeping in a way that could be deferred to yield their thread and enqueue
+ * themselves.
+ */
+static DECLARE_WAIT_QUEUE_HEAD(slow_work_queue_waits_for_occupation);
+static DECLARE_WAIT_QUEUE_HEAD(vslow_work_queue_waits_for_occupation);
+
+/*
* The thread controls. A variable used to signal to the threads that they
* should exit when the queue is empty, a waitqueue used by the threads to wait
* for signals, and a completion set by the last thread to exit.
@@ -306,6 +315,50 @@ auto_requeue:
}

/**
+ * slow_work_sleep_till_thread_needed - Sleep till thread needed by other work
+ * work: The work item under execution that wants to sleep
+ * _timeout: Scheduler sleep timeout
+ *
+ * Allow a requeueable work item to sleep on a slow-work processor thread until
+ * that thread is needed to do some other work or the sleep is interrupted by
+ * some other event.
+ *
+ * The caller must set up a wake up event before calling this and must have set
+ * the appropriate sleep mode (such as TASK_UNINTERRUPTIBLE) and tested its own
+ * condition before calling this function as no test is made here.
+ *
+ * False is returned if there is nothing on the queue; true is returned if the
+ * work item should be requeued
+ */
+bool slow_work_sleep_till_thread_needed(struct slow_work *work,
+ signed long *_timeout)
+{
+ wait_queue_head_t *wfo_wq;
+ struct list_head *queue;
+
+ DEFINE_WAIT(wait);
+
+ if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags)) {
+ wfo_wq = &vslow_work_queue_waits_for_occupation;
+ queue = &vslow_work_queue;
+ } else {
+ wfo_wq = &slow_work_queue_waits_for_occupation;
+ queue = &slow_work_queue;
+ }
+
+ if (!list_empty(queue))
+ return true;
+
+ add_wait_queue_exclusive(wfo_wq, &wait);
+ if (list_empty(queue))
+ *_timeout = schedule_timeout(*_timeout);
+ finish_wait(wfo_wq, &wait);
+
+ return !list_empty(queue);
+}
+EXPORT_SYMBOL(slow_work_sleep_till_thread_needed);
+
+/**
* slow_work_enqueue - Schedule a slow work item for processing
* @work: The work item to queue
*
@@ -335,6 +388,8 @@ auto_requeue:
*/
int slow_work_enqueue(struct slow_work *work)
{
+ wait_queue_head_t *wfo_wq;
+ struct list_head *queue;
unsigned long flags;
int ret;

@@ -354,6 +409,14 @@ int slow_work_enqueue(struct slow_work *work)
* maintaining our promise
*/
if (!test_and_set_bit_lock(SLOW_WORK_PENDING, &work->flags)) {
+ if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags)) {
+ wfo_wq = &vslow_work_queue_waits_for_occupation;
+ queue = &vslow_work_queue;
+ } else {
+ wfo_wq = &slow_work_queue_waits_for_occupation;
+ queue = &slow_work_queue;
+ }
+
spin_lock_irqsave(&slow_work_queue_lock, flags);

if (unlikely(test_bit(SLOW_WORK_CANCELLING, &work->flags)))
@@ -380,11 +443,13 @@ int slow_work_enqueue(struct slow_work *work)
if (ret < 0)
goto failed;
slow_work_mark_time(work);
- if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
- list_add_tail(&work->link, &vslow_work_queue);
- else
- list_add_tail(&work->link, &slow_work_queue);
+ list_add_tail(&work->link, queue);
wake_up(&slow_work_thread_wq);
+
+ /* if someone who could be requeued is sleeping on a
+ * thread, then ask them to yield their thread */
+ if (work->link.prev == queue)
+ wake_up(wfo_wq);
}

spin_unlock_irqrestore(&slow_work_queue_lock, flags);
@@ -487,9 +552,19 @@ EXPORT_SYMBOL(slow_work_cancel);
*/
static void delayed_slow_work_timer(unsigned long data)
{
+ wait_queue_head_t *wfo_wq;
+ struct list_head *queue;
struct slow_work *work = (struct slow_work *) data;
unsigned long flags;
- bool queued = false, put = false;
+ bool queued = false, put = false, first = false;
+
+ if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags)) {
+ wfo_wq = &vslow_work_queue_waits_for_occupation;
+ queue = &vslow_work_queue;
+ } else {
+ wfo_wq = &slow_work_queue_waits_for_occupation;
+ queue = &slow_work_queue;
+ }

spin_lock_irqsave(&slow_work_queue_lock, flags);
if (likely(!test_bit(SLOW_WORK_CANCELLING, &work->flags))) {
@@ -502,17 +577,18 @@ static void delayed_slow_work_timer(unsigned long data)
put = true;
} else {
slow_work_mark_time(work);
- if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
- list_add_tail(&work->link, &vslow_work_queue);
- else
- list_add_tail(&work->link, &slow_work_queue);
+ list_add_tail(&work->link, queue);
queued = true;
+ if (work->link.prev == queue)
+ first = true;
}
}

spin_unlock_irqrestore(&slow_work_queue_lock, flags);
if (put)
slow_work_put_ref(work);
+ if (first)
+ wake_up(wfo_wq);
if (queued)
wake_up(&slow_work_thread_wq);
}

2009-11-19 17:25:57

by David Howells

[permalink] [raw]
Subject: [PATCH 08/28] FS-Cache: Annotate slow-work runqueue proc lines for FS-Cache work items

Annotate slow-work runqueue proc lines for FS-Cache work items. Objects
include the object ID and the state. Operations include the object ID, the
operation ID and the operation type and state.

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/object.c | 41 ++++++++++++++++++++++++++++++++++++++++-
fs/fscache/operation.c | 29 +++++++++++++++++++++++++++++
fs/fscache/page.c | 27 ++++++++++++++++++++++-----
include/linux/fscache-cache.h | 12 ++++++++++++
4 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index d236eb1..615b63d 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -14,9 +14,10 @@

#define FSCACHE_DEBUG_LEVEL COOKIE
#include <linux/module.h>
+#include <linux/seq_file.h>
#include "internal.h"

-const char *fscache_object_states[] = {
+const char *fscache_object_states[FSCACHE_OBJECT__NSTATES] = {
[FSCACHE_OBJECT_INIT] = "OBJECT_INIT",
[FSCACHE_OBJECT_LOOKING_UP] = "OBJECT_LOOKING_UP",
[FSCACHE_OBJECT_CREATING] = "OBJECT_CREATING",
@@ -33,9 +34,28 @@ const char *fscache_object_states[] = {
};
EXPORT_SYMBOL(fscache_object_states);

+static const char fscache_object_states_short[FSCACHE_OBJECT__NSTATES][5] = {
+ [FSCACHE_OBJECT_INIT] = "INIT",
+ [FSCACHE_OBJECT_LOOKING_UP] = "LOOK",
+ [FSCACHE_OBJECT_CREATING] = "CRTN",
+ [FSCACHE_OBJECT_AVAILABLE] = "AVBL",
+ [FSCACHE_OBJECT_ACTIVE] = "ACTV",
+ [FSCACHE_OBJECT_UPDATING] = "UPDT",
+ [FSCACHE_OBJECT_DYING] = "DYNG",
+ [FSCACHE_OBJECT_LC_DYING] = "LCDY",
+ [FSCACHE_OBJECT_ABORT_INIT] = "ABTI",
+ [FSCACHE_OBJECT_RELEASING] = "RELS",
+ [FSCACHE_OBJECT_RECYCLING] = "RCYC",
+ [FSCACHE_OBJECT_WITHDRAWING] = "WTHD",
+ [FSCACHE_OBJECT_DEAD] = "DEAD",
+};
+
static void fscache_object_slow_work_put_ref(struct slow_work *);
static int fscache_object_slow_work_get_ref(struct slow_work *);
static void fscache_object_slow_work_execute(struct slow_work *);
+#ifdef CONFIG_SLOW_WORK_PROC
+static void fscache_object_slow_work_desc(struct slow_work *, struct seq_file *);
+#endif
static void fscache_initialise_object(struct fscache_object *);
static void fscache_lookup_object(struct fscache_object *);
static void fscache_object_available(struct fscache_object *);
@@ -49,6 +69,9 @@ const struct slow_work_ops fscache_object_slow_work_ops = {
.get_ref = fscache_object_slow_work_get_ref,
.put_ref = fscache_object_slow_work_put_ref,
.execute = fscache_object_slow_work_execute,
+#ifdef CONFIG_SLOW_WORK_PROC
+ .desc = fscache_object_slow_work_desc,
+#endif
};
EXPORT_SYMBOL(fscache_object_slow_work_ops);

@@ -327,6 +350,22 @@ static void fscache_object_slow_work_execute(struct slow_work *work)
}

/*
+ * describe an object for slow-work debugging
+ */
+#ifdef CONFIG_SLOW_WORK_PROC
+static void fscache_object_slow_work_desc(struct slow_work *work,
+ struct seq_file *m)
+{
+ struct fscache_object *object =
+ container_of(work, struct fscache_object, work);
+
+ seq_printf(m, "FSC: OBJ%x: %s",
+ object->debug_id,
+ fscache_object_states_short[object->state]);
+}
+#endif
+
+/*
* initialise an object
* - check the specified object's parent to see if we can make use of it
* immediately to do a creation
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index f1a2857..91bbe6f 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -13,6 +13,7 @@

#define FSCACHE_DEBUG_LEVEL OPERATION
#include <linux/module.h>
+#include <linux/seq_file.h>
#include "internal.h"

atomic_t fscache_op_debug_id;
@@ -31,6 +32,8 @@ void fscache_enqueue_operation(struct fscache_operation *op)
_enter("{OBJ%x OP%x,%u}",
op->object->debug_id, op->debug_id, atomic_read(&op->usage));

+ fscache_set_op_state(op, "EnQ");
+
ASSERT(op->processor != NULL);
ASSERTCMP(op->object->state, >=, FSCACHE_OBJECT_AVAILABLE);
ASSERTCMP(atomic_read(&op->usage), >, 0);
@@ -67,6 +70,8 @@ EXPORT_SYMBOL(fscache_enqueue_operation);
static void fscache_run_op(struct fscache_object *object,
struct fscache_operation *op)
{
+ fscache_set_op_state(op, "Run");
+
object->n_in_progress++;
if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
@@ -87,6 +92,8 @@ int fscache_submit_exclusive_op(struct fscache_object *object,

_enter("{OBJ%x OP%x},", object->debug_id, op->debug_id);

+ fscache_set_op_state(op, "SubmitX");
+
spin_lock(&object->lock);
ASSERTCMP(object->n_ops, >=, object->n_in_progress);
ASSERTCMP(object->n_ops, >=, object->n_exclusive);
@@ -190,6 +197,8 @@ int fscache_submit_op(struct fscache_object *object,

ASSERTCMP(atomic_read(&op->usage), >, 0);

+ fscache_set_op_state(op, "Submit");
+
spin_lock(&object->lock);
ASSERTCMP(object->n_ops, >=, object->n_in_progress);
ASSERTCMP(object->n_ops, >=, object->n_exclusive);
@@ -298,6 +307,8 @@ void fscache_put_operation(struct fscache_operation *op)
if (!atomic_dec_and_test(&op->usage))
return;

+ fscache_set_op_state(op, "Put");
+
_debug("PUT OP");
if (test_and_set_bit(FSCACHE_OP_DEAD, &op->flags))
BUG();
@@ -452,9 +463,27 @@ static void fscache_op_execute(struct slow_work *work)
_leave("");
}

+/*
+ * describe an operation for slow-work debugging
+ */
+#ifdef CONFIG_SLOW_WORK_PROC
+static void fscache_op_desc(struct slow_work *work, struct seq_file *m)
+{
+ struct fscache_operation *op =
+ container_of(work, struct fscache_operation, slow_work);
+
+ seq_printf(m, "FSC: OBJ%x OP%x: %s/%s fl=%lx",
+ op->object->debug_id, op->debug_id,
+ op->name, op->state, op->flags);
+}
+#endif
+
const struct slow_work_ops fscache_op_slow_work_ops = {
.owner = THIS_MODULE,
.get_ref = fscache_op_get_ref,
.put_ref = fscache_op_put_ref,
.execute = fscache_op_execute,
+#ifdef CONFIG_SLOW_WORK_PROC
+ .desc = fscache_op_desc,
+#endif
};
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 2568e0e..e8bbc39 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -63,14 +63,19 @@ static void fscache_end_page_write(struct fscache_cookie *cookie, struct page *p
static void fscache_attr_changed_op(struct fscache_operation *op)
{
struct fscache_object *object = op->object;
+ int ret;

_enter("{OBJ%x OP%x}", object->debug_id, op->debug_id);

fscache_stat(&fscache_n_attr_changed_calls);

- if (fscache_object_is_active(object) &&
- object->cache->ops->attr_changed(object) < 0)
- fscache_abort_object(object);
+ if (fscache_object_is_active(object)) {
+ fscache_set_op_state(op, "CallFS");
+ ret = object->cache->ops->attr_changed(object);
+ fscache_set_op_state(op, "Done");
+ if (ret < 0)
+ fscache_abort_object(object);
+ }

_leave("");
}
@@ -99,6 +104,7 @@ int __fscache_attr_changed(struct fscache_cookie *cookie)
fscache_operation_init(op, NULL);
fscache_operation_init_slow(op, fscache_attr_changed_op);
op->flags = FSCACHE_OP_SLOW | (1 << FSCACHE_OP_EXCLUSIVE);
+ fscache_set_op_name(op, "Attr");

spin_lock(&cookie->lock);

@@ -184,6 +190,7 @@ static struct fscache_retrieval *fscache_alloc_retrieval(
op->start_time = jiffies;
INIT_WORK(&op->op.fast_work, fscache_retrieval_work);
INIT_LIST_HEAD(&op->to_do);
+ fscache_set_op_name(&op->op, "Retr");
return op;
}

@@ -257,6 +264,7 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
_leave(" = -ENOMEM");
return -ENOMEM;
}
+ fscache_set_op_name(&op->op, "RetrRA1");

spin_lock(&cookie->lock);

@@ -369,6 +377,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
op = fscache_alloc_retrieval(mapping, end_io_func, context);
if (!op)
return -ENOMEM;
+ fscache_set_op_name(&op->op, "RetrRAN");

spin_lock(&cookie->lock);

@@ -461,6 +470,7 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,
op = fscache_alloc_retrieval(page->mapping, NULL, NULL);
if (!op)
return -ENOMEM;
+ fscache_set_op_name(&op->op, "RetrAL1");

spin_lock(&cookie->lock);

@@ -529,6 +539,8 @@ static void fscache_write_op(struct fscache_operation *_op)

_enter("{OP%x,%d}", op->op.debug_id, atomic_read(&op->op.usage));

+ fscache_set_op_state(&op->op, "GetPage");
+
spin_lock(&cookie->lock);
spin_lock(&object->lock);

@@ -559,13 +571,17 @@ static void fscache_write_op(struct fscache_operation *_op)
spin_unlock(&cookie->lock);

if (page) {
+ fscache_set_op_state(&op->op, "Store");
ret = object->cache->ops->write_page(op, page);
+ fscache_set_op_state(&op->op, "EndWrite");
fscache_end_page_write(cookie, page);
page_cache_release(page);
- if (ret < 0)
+ if (ret < 0) {
+ fscache_set_op_state(&op->op, "Abort");
fscache_abort_object(object);
- else
+ } else {
fscache_enqueue_operation(&op->op);
+ }
}

_leave("");
@@ -634,6 +650,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
fscache_operation_init(&op->op, fscache_release_write_op);
fscache_operation_init_slow(&op->op, fscache_write_op);
op->op.flags = FSCACHE_OP_SLOW | (1 << FSCACHE_OP_WAITING);
+ fscache_set_op_name(&op->op, "Write1");

ret = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
if (ret < 0)
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 84d3532..7a9847c 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -102,6 +102,16 @@ struct fscache_operation {

/* operation releaser */
fscache_operation_release_t release;
+
+#ifdef CONFIG_SLOW_WORK_PROC
+ const char *name; /* operation name */
+ const char *state; /* operation state */
+#define fscache_set_op_name(OP, N) do { (OP)->name = (N); } while(0)
+#define fscache_set_op_state(OP, S) do { (OP)->state = (S); } while(0)
+#else
+#define fscache_set_op_name(OP, N) do { } while(0)
+#define fscache_set_op_state(OP, S) do { } while(0)
+#endif
};

extern atomic_t fscache_op_debug_id;
@@ -125,6 +135,7 @@ static inline void fscache_operation_init(struct fscache_operation *op,
op->debug_id = atomic_inc_return(&fscache_op_debug_id);
op->release = release;
INIT_LIST_HEAD(&op->pend_link);
+ fscache_set_op_state(op, "Init");
}

/**
@@ -337,6 +348,7 @@ struct fscache_object {
FSCACHE_OBJECT_RECYCLING, /* retiring object */
FSCACHE_OBJECT_WITHDRAWING, /* withdrawing object */
FSCACHE_OBJECT_DEAD, /* object is now dead */
+ FSCACHE_OBJECT__NSTATES
} state;

int debug_id; /* debugging ID */

2009-11-19 17:21:34

by David Howells

[permalink] [raw]
Subject: [PATCH 09/28] FS-Cache: Allow the current state of all objects to be dumped

Allow the current state of all fscache objects to be dumped by doing:

cat /proc/fs/fscache/objects

By default, all objects and all fields will be shown. This can be restricted
by adding a suitable key to one of the caller's keyrings (such as the session
keyring):

keyctl add user fscache:objlist "<restrictions>" @s

The <restrictions> are:

K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)

And paired restrictions:

C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have slow work queued
s Show objects that don't have slow work queued

If neither side of a restriction pair is given, then both are implied. For
example:

keyctl add user fscache:objlist KB @s

shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 81 +++++
fs/cachefiles/interface.c | 1
fs/cachefiles/rdwr.c | 6
fs/fscache/Kconfig | 7
fs/fscache/Makefile | 1
fs/fscache/cache.c | 1
fs/fscache/cookie.c | 2
fs/fscache/internal.h | 13 +
fs/fscache/object-list.c | 432 +++++++++++++++++++++++++
fs/fscache/object.c | 2
fs/fscache/operation.c | 3
fs/fscache/page.c | 6
fs/fscache/proc.c | 13 +
include/linux/fscache-cache.h | 13 +
14 files changed, 578 insertions(+), 3 deletions(-)
create mode 100644 fs/fscache/object-list.c

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 9e94b94..cf13cd3 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -299,6 +299,87 @@ proc files.
jiffy range covered, and the SECS field the equivalent number of seconds.


+===========
+OBJECT LIST
+===========
+
+If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
+list of all the objects currently allocated and allow them to be viewed
+through:
+
+ /proc/fs/fscache/objects
+
+This will look something like:
+
+ [root@andromeda ~]# head /proc/fs/fscache/objects
+ OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
+ ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
+ 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 8 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
+ 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 8 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
+
+where the first set of columns before the '|' describe the object:
+
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
+ PARENT Debugging ID of parent object
+ STAT Object state
+ CHLDN Number of child objects of this object
+ OPS Number of outstanding operations on this object
+ OOP Number of outstanding child object management operations
+ IPR
+ EX Number of outstanding exclusive operations
+ READS Number of outstanding read operations
+ EM Object's event mask
+ EV Events raised on this object
+ F Object flags
+ S Object slow-work work item flags
+
+and the second set of columns describe the object's cookie, if present:
+
+ COLUMN DESCRIPTION
+ =============== =======================================================
+ NETFS_COOKIE_DEF Name of netfs cookie definition
+ TY Cookie type (IX - index, DT - data, hex - special)
+ FL Cookie flags
+ NETFS_DATA Netfs private data stored in the cookie
+ OBJECT_KEY Object key } 1 column, with separating comma
+ AUX_DATA Object aux data } presence may be configured
+
+The data shown may be filtered by attaching the a key to an appropriate keyring
+before viewing the file. Something like:
+
+ keyctl add user fscache:objlist <restrictions> @s
+
+where <restrictions> are a selection of the following letters:
+
+ K Show hexdump of object key (don't show if not given)
+ A Show hexdump of object aux data (don't show if not given)
+
+and the following paired letters:
+
+ C Show objects that have a cookie
+ c Show objects that don't have a cookie
+ B Show objects that are busy
+ b Show objects that aren't busy
+ W Show objects that have pending writes
+ w Show objects that don't have pending writes
+ R Show objects that have outstanding reads
+ r Show objects that don't have outstanding reads
+ S Show objects that have slow work queued
+ s Show objects that don't have slow work queued
+
+If neither side of a letter pair is given, then both are implied. For example:
+
+ keyctl add user fscache:objlist KB @s
+
+shows objects that are busy, and lists their object keys, but does not dump
+their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
+not implied.
+
+By default all objects and all fields will be shown.
+
+
=========
DEBUGGING
=========
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 431accd..dd7f852 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -331,6 +331,7 @@ static void cachefiles_put_object(struct fscache_object *_object)
}

cache = object->fscache.cache;
+ fscache_object_destroy(&object->fscache);
kmem_cache_free(cachefiles_object_jar, object);
fscache_object_destroyed(cache);
}
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index a69787e..3304646 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -333,7 +333,8 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,

shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;

- op->op.flags = FSCACHE_OP_FAST;
+ op->op.flags &= FSCACHE_OP_KEEP_FLAGS;
+ op->op.flags |= FSCACHE_OP_FAST;
op->op.processor = cachefiles_read_copier;

pagevec_init(&pagevec, 0);
@@ -639,7 +640,8 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,

pagevec_init(&pagevec, 0);

- op->op.flags = FSCACHE_OP_FAST;
+ op->op.flags &= FSCACHE_OP_KEEP_FLAGS;
+ op->op.flags |= FSCACHE_OP_FAST;
op->op.processor = cachefiles_read_copier;

INIT_LIST_HEAD(&backpages);
diff --git a/fs/fscache/Kconfig b/fs/fscache/Kconfig
index 9bbb8ce..864dac2 100644
--- a/fs/fscache/Kconfig
+++ b/fs/fscache/Kconfig
@@ -54,3 +54,10 @@ config FSCACHE_DEBUG
enabled by setting bits in /sys/modules/fscache/parameter/debug.

See Documentation/filesystems/caching/fscache.txt for more information.
+
+config FSCACHE_OBJECT_LIST
+ bool "Maintain global object list for debugging purposes"
+ depends on FSCACHE && PROC_FS
+ help
+ Maintain a global list of active fscache objects that can be
+ retrieved through /proc/fs/fscache/objects for debugging purposes
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index 91571b9..6d56153 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -15,5 +15,6 @@ fscache-y := \
fscache-$(CONFIG_PROC_FS) += proc.o
fscache-$(CONFIG_FSCACHE_STATS) += stats.o
fscache-$(CONFIG_FSCACHE_HISTOGRAM) += histogram.o
+fscache-$(CONFIG_FSCACHE_OBJECT_LIST) += object-list.o

obj-$(CONFIG_FSCACHE) := fscache.o
diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
index e21985b..724384e 100644
--- a/fs/fscache/cache.c
+++ b/fs/fscache/cache.c
@@ -263,6 +263,7 @@ int fscache_add_cache(struct fscache_cache *cache,
spin_lock(&cache->object_list_lock);
list_add_tail(&ifsdef->cache_link, &cache->object_list);
spin_unlock(&cache->object_list_lock);
+ fscache_objlist_add(ifsdef);

/* add the cache's netfs definition index object to the top level index
* cookie as a known backing object */
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index 72fd18f..9b51873 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -349,6 +349,8 @@ static int fscache_attach_object(struct fscache_cookie *cookie,
object->cookie = cookie;
atomic_inc(&cookie->usage);
hlist_add_head(&object->cookie_link, &cookie->backing_objects);
+
+ fscache_objlist_add(object);
ret = 0;

cant_attach_object:
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 1c34130..fe02973 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -88,11 +88,24 @@ extern int fscache_wait_bit_interruptible(void *);
/*
* object.c
*/
+extern const char fscache_object_states_short[FSCACHE_OBJECT__NSTATES][5];
+
extern void fscache_withdrawing_object(struct fscache_cache *,
struct fscache_object *);
extern void fscache_enqueue_object(struct fscache_object *);

/*
+ * object-list.c
+ */
+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+extern const struct file_operations fscache_objlist_fops;
+
+extern void fscache_objlist_add(struct fscache_object *);
+#else
+#define fscache_objlist_add(object) do {} while(0)
+#endif
+
+/*
* operation.c
*/
extern int fscache_submit_exclusive_op(struct fscache_object *,
diff --git a/fs/fscache/object-list.c b/fs/fscache/object-list.c
new file mode 100644
index 0000000..e590242
--- /dev/null
+++ b/fs/fscache/object-list.c
@@ -0,0 +1,432 @@
+/* Global fscache object list maintainer and viewer
+ *
+ * Copyright (C) 2009 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL COOKIE
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <linux/key.h>
+#include <keys/user-type.h>
+#include "internal.h"
+
+static struct rb_root fscache_object_list;
+static DEFINE_RWLOCK(fscache_object_list_lock);
+
+struct fscache_objlist_data {
+ unsigned long config; /* display configuration */
+#define FSCACHE_OBJLIST_CONFIG_KEY 0x00000001 /* show object keys */
+#define FSCACHE_OBJLIST_CONFIG_AUX 0x00000002 /* show object auxdata */
+#define FSCACHE_OBJLIST_CONFIG_COOKIE 0x00000004 /* show objects with cookies */
+#define FSCACHE_OBJLIST_CONFIG_NOCOOKIE 0x00000008 /* show objects without cookies */
+#define FSCACHE_OBJLIST_CONFIG_BUSY 0x00000010 /* show busy objects */
+#define FSCACHE_OBJLIST_CONFIG_IDLE 0x00000020 /* show idle objects */
+#define FSCACHE_OBJLIST_CONFIG_PENDWR 0x00000040 /* show objects with pending writes */
+#define FSCACHE_OBJLIST_CONFIG_NOPENDWR 0x00000080 /* show objects without pending writes */
+#define FSCACHE_OBJLIST_CONFIG_READS 0x00000100 /* show objects with active reads */
+#define FSCACHE_OBJLIST_CONFIG_NOREADS 0x00000200 /* show objects without active reads */
+#define FSCACHE_OBJLIST_CONFIG_EVENTS 0x00000400 /* show objects with events */
+#define FSCACHE_OBJLIST_CONFIG_NOEVENTS 0x00000800 /* show objects without no events */
+#define FSCACHE_OBJLIST_CONFIG_WORK 0x00001000 /* show objects with slow work */
+#define FSCACHE_OBJLIST_CONFIG_NOWORK 0x00002000 /* show objects without slow work */
+
+ u8 buf[512]; /* key and aux data buffer */
+};
+
+/*
+ * Add an object to the object list
+ * - we use the address of the fscache_object structure as the key into the
+ * tree
+ */
+void fscache_objlist_add(struct fscache_object *obj)
+{
+ struct fscache_object *xobj;
+ struct rb_node **p = &fscache_object_list.rb_node, *parent = NULL;
+
+ write_lock(&fscache_object_list_lock);
+
+ while (*p) {
+ parent = *p;
+ xobj = rb_entry(parent, struct fscache_object, objlist_link);
+
+ if (obj < xobj)
+ p = &(*p)->rb_left;
+ else if (obj > xobj)
+ p = &(*p)->rb_right;
+ else
+ BUG();
+ }
+
+ rb_link_node(&obj->objlist_link, parent, p);
+ rb_insert_color(&obj->objlist_link, &fscache_object_list);
+
+ write_unlock(&fscache_object_list_lock);
+}
+
+/**
+ * fscache_object_destroy - Note that a cache object is about to be destroyed
+ * @object: The object to be destroyed
+ *
+ * Note the imminent destruction and deallocation of a cache object record.
+ */
+void fscache_object_destroy(struct fscache_object *obj)
+{
+ write_lock(&fscache_object_list_lock);
+
+ BUG_ON(RB_EMPTY_ROOT(&fscache_object_list));
+ rb_erase(&obj->objlist_link, &fscache_object_list);
+
+ write_unlock(&fscache_object_list_lock);
+}
+EXPORT_SYMBOL(fscache_object_destroy);
+
+/*
+ * find the object in the tree on or after the specified index
+ */
+static struct fscache_object *fscache_objlist_lookup(loff_t *_pos)
+{
+ struct fscache_object *pobj, *obj, *minobj = NULL;
+ struct rb_node *p;
+ unsigned long pos;
+
+ if (*_pos >= (unsigned long) ERR_PTR(-ENOENT))
+ return NULL;
+ pos = *_pos;
+
+ /* banners (can't represent line 0 by pos 0 as that would involve
+ * returning a NULL pointer) */
+ if (pos == 0)
+ return (struct fscache_object *) ++(*_pos);
+ if (pos < 3)
+ return (struct fscache_object *)pos;
+
+ pobj = (struct fscache_object *)pos;
+ p = fscache_object_list.rb_node;
+ while (p) {
+ obj = rb_entry(p, struct fscache_object, objlist_link);
+ if (pobj < obj) {
+ if (!minobj || minobj > obj)
+ minobj = obj;
+ p = p->rb_left;
+ } else if (pobj > obj) {
+ p = p->rb_right;
+ } else {
+ minobj = obj;
+ break;
+ }
+ obj = NULL;
+ }
+
+ if (!minobj)
+ *_pos = (unsigned long) ERR_PTR(-ENOENT);
+ else if (minobj != obj)
+ *_pos = (unsigned long) minobj;
+ return minobj;
+}
+
+/*
+ * set up the iterator to start reading from the first line
+ */
+static void *fscache_objlist_start(struct seq_file *m, loff_t *_pos)
+ __acquires(&fscache_object_list_lock)
+{
+ read_lock(&fscache_object_list_lock);
+ return fscache_objlist_lookup(_pos);
+}
+
+/*
+ * move to the next line
+ */
+static void *fscache_objlist_next(struct seq_file *m, void *v, loff_t *_pos)
+{
+ (*_pos)++;
+ return fscache_objlist_lookup(_pos);
+}
+
+/*
+ * clean up after reading
+ */
+static void fscache_objlist_stop(struct seq_file *m, void *v)
+ __releases(&fscache_object_list_lock)
+{
+ read_unlock(&fscache_object_list_lock);
+}
+
+/*
+ * display an object
+ */
+static int fscache_objlist_show(struct seq_file *m, void *v)
+{
+ struct fscache_objlist_data *data = m->private;
+ struct fscache_object *obj = v;
+ unsigned long config = data->config;
+ uint16_t keylen, auxlen;
+ char _type[3], *type;
+ bool no_cookie;
+ u8 *buf = data->buf, *p;
+
+ if ((unsigned long) v == 1) {
+ seq_puts(m, "OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS"
+ " EM EV F S"
+ " | NETFS_COOKIE_DEF TY FL NETFS_DATA");
+ if (config & (FSCACHE_OBJLIST_CONFIG_KEY |
+ FSCACHE_OBJLIST_CONFIG_AUX))
+ seq_puts(m, " ");
+ if (config & FSCACHE_OBJLIST_CONFIG_KEY)
+ seq_puts(m, "OBJECT_KEY");
+ if ((config & (FSCACHE_OBJLIST_CONFIG_KEY |
+ FSCACHE_OBJLIST_CONFIG_AUX)) ==
+ (FSCACHE_OBJLIST_CONFIG_KEY | FSCACHE_OBJLIST_CONFIG_AUX))
+ seq_puts(m, ", ");
+ if (config & FSCACHE_OBJLIST_CONFIG_AUX)
+ seq_puts(m, "AUX_DATA");
+ seq_puts(m, "\n");
+ return 0;
+ }
+
+ if ((unsigned long) v == 2) {
+ seq_puts(m, "======== ======== ==== ===== === === === == ====="
+ " == == = ="
+ " | ================ == == ================");
+ if (config & (FSCACHE_OBJLIST_CONFIG_KEY |
+ FSCACHE_OBJLIST_CONFIG_AUX))
+ seq_puts(m, " ================");
+ seq_puts(m, "\n");
+ return 0;
+ }
+
+ /* filter out any unwanted objects */
+#define FILTER(criterion, _yes, _no) \
+ do { \
+ unsigned long yes = FSCACHE_OBJLIST_CONFIG_##_yes; \
+ unsigned long no = FSCACHE_OBJLIST_CONFIG_##_no; \
+ if (criterion) { \
+ if (!(config & yes)) \
+ return 0; \
+ } else { \
+ if (!(config & no)) \
+ return 0; \
+ } \
+ } while(0)
+
+ if (~config) {
+ FILTER(obj->cookie,
+ COOKIE, NOCOOKIE);
+ FILTER(obj->state != FSCACHE_OBJECT_ACTIVE ||
+ obj->n_ops != 0 ||
+ obj->n_obj_ops != 0 ||
+ obj->flags ||
+ !list_empty(&obj->dependents),
+ BUSY, IDLE);
+ FILTER(test_bit(FSCACHE_OBJECT_PENDING_WRITE, &obj->flags),
+ PENDWR, NOPENDWR);
+ FILTER(atomic_read(&obj->n_reads),
+ READS, NOREADS);
+ FILTER(obj->events & obj->event_mask,
+ EVENTS, NOEVENTS);
+ FILTER(obj->work.flags & ~(1UL << SLOW_WORK_VERY_SLOW),
+ WORK, NOWORK);
+ }
+
+ seq_printf(m,
+ "%8x %8x %s %5u %3u %3u %3u %2u %5u %2lx %2lx %1lx %1lx | ",
+ obj->debug_id,
+ obj->parent ? obj->parent->debug_id : -1,
+ fscache_object_states_short[obj->state],
+ obj->n_children,
+ obj->n_ops,
+ obj->n_obj_ops,
+ obj->n_in_progress,
+ obj->n_exclusive,
+ atomic_read(&obj->n_reads),
+ obj->event_mask & FSCACHE_OBJECT_EVENTS_MASK,
+ obj->events,
+ obj->flags,
+ obj->work.flags);
+
+ no_cookie = true;
+ keylen = auxlen = 0;
+ if (obj->cookie) {
+ spin_lock(&obj->lock);
+ if (obj->cookie) {
+ switch (obj->cookie->def->type) {
+ case 0:
+ type = "IX";
+ break;
+ case 1:
+ type = "DT";
+ break;
+ default:
+ sprintf(_type, "%02u",
+ obj->cookie->def->type);
+ type = _type;
+ break;
+ }
+
+ seq_printf(m, "%-16s %s %2lx %16p",
+ obj->cookie->def->name,
+ type,
+ obj->cookie->flags,
+ obj->cookie->netfs_data);
+
+ if (obj->cookie->def->get_key &&
+ config & FSCACHE_OBJLIST_CONFIG_KEY)
+ keylen = obj->cookie->def->get_key(
+ obj->cookie->netfs_data,
+ buf, 400);
+
+ if (obj->cookie->def->get_aux &&
+ config & FSCACHE_OBJLIST_CONFIG_AUX)
+ auxlen = obj->cookie->def->get_aux(
+ obj->cookie->netfs_data,
+ buf + keylen, 512 - keylen);
+
+ no_cookie = false;
+ }
+ spin_unlock(&obj->lock);
+
+ if (!no_cookie && (keylen > 0 || auxlen > 0)) {
+ seq_printf(m, " ");
+ for (p = buf; keylen > 0; keylen--)
+ seq_printf(m, "%02x", *p++);
+ if (auxlen > 0) {
+ if (config & FSCACHE_OBJLIST_CONFIG_KEY)
+ seq_printf(m, ", ");
+ for (; auxlen > 0; auxlen--)
+ seq_printf(m, "%02x", *p++);
+ }
+ }
+ }
+
+ if (no_cookie)
+ seq_printf(m, "<no_cookie>\n");
+ else
+ seq_printf(m, "\n");
+ return 0;
+}
+
+static const struct seq_operations fscache_objlist_ops = {
+ .start = fscache_objlist_start,
+ .stop = fscache_objlist_stop,
+ .next = fscache_objlist_next,
+ .show = fscache_objlist_show,
+};
+
+/*
+ * get the configuration for filtering the list
+ */
+static void fscache_objlist_config(struct fscache_objlist_data *data)
+{
+#ifdef CONFIG_KEYS
+ struct user_key_payload *confkey;
+ unsigned long config;
+ struct key *key;
+ const char *buf;
+ int len;
+
+ key = request_key(&key_type_user, "fscache:objlist", NULL);
+ if (IS_ERR(key))
+ goto no_config;
+
+ config = 0;
+ rcu_read_lock();
+
+ confkey = key->payload.data;
+ buf = confkey->data;
+
+ for (len = confkey->datalen - 1; len >= 0; len--) {
+ switch (buf[len]) {
+ case 'K': config |= FSCACHE_OBJLIST_CONFIG_KEY; break;
+ case 'A': config |= FSCACHE_OBJLIST_CONFIG_AUX; break;
+ case 'C': config |= FSCACHE_OBJLIST_CONFIG_COOKIE; break;
+ case 'c': config |= FSCACHE_OBJLIST_CONFIG_NOCOOKIE; break;
+ case 'B': config |= FSCACHE_OBJLIST_CONFIG_BUSY; break;
+ case 'b': config |= FSCACHE_OBJLIST_CONFIG_IDLE; break;
+ case 'W': config |= FSCACHE_OBJLIST_CONFIG_PENDWR; break;
+ case 'w': config |= FSCACHE_OBJLIST_CONFIG_NOPENDWR; break;
+ case 'R': config |= FSCACHE_OBJLIST_CONFIG_READS; break;
+ case 'r': config |= FSCACHE_OBJLIST_CONFIG_NOREADS; break;
+ case 'S': config |= FSCACHE_OBJLIST_CONFIG_WORK; break;
+ case 's': config |= FSCACHE_OBJLIST_CONFIG_NOWORK; break;
+ }
+ }
+
+ rcu_read_unlock();
+ key_put(key);
+
+ if (!(config & (FSCACHE_OBJLIST_CONFIG_COOKIE | FSCACHE_OBJLIST_CONFIG_NOCOOKIE)))
+ config |= FSCACHE_OBJLIST_CONFIG_COOKIE | FSCACHE_OBJLIST_CONFIG_NOCOOKIE;
+ if (!(config & (FSCACHE_OBJLIST_CONFIG_BUSY | FSCACHE_OBJLIST_CONFIG_IDLE)))
+ config |= FSCACHE_OBJLIST_CONFIG_BUSY | FSCACHE_OBJLIST_CONFIG_IDLE;
+ if (!(config & (FSCACHE_OBJLIST_CONFIG_PENDWR | FSCACHE_OBJLIST_CONFIG_NOPENDWR)))
+ config |= FSCACHE_OBJLIST_CONFIG_PENDWR | FSCACHE_OBJLIST_CONFIG_NOPENDWR;
+ if (!(config & (FSCACHE_OBJLIST_CONFIG_READS | FSCACHE_OBJLIST_CONFIG_NOREADS)))
+ config |= FSCACHE_OBJLIST_CONFIG_READS | FSCACHE_OBJLIST_CONFIG_NOREADS;
+ if (!(config & (FSCACHE_OBJLIST_CONFIG_EVENTS | FSCACHE_OBJLIST_CONFIG_NOEVENTS)))
+ config |= FSCACHE_OBJLIST_CONFIG_EVENTS | FSCACHE_OBJLIST_CONFIG_NOEVENTS;
+ if (!(config & (FSCACHE_OBJLIST_CONFIG_WORK | FSCACHE_OBJLIST_CONFIG_NOWORK)))
+ config |= FSCACHE_OBJLIST_CONFIG_WORK | FSCACHE_OBJLIST_CONFIG_NOWORK;
+
+ data->config = config;
+ return;
+
+no_config:
+#endif
+ data->config = ULONG_MAX;
+}
+
+/*
+ * open "/proc/fs/fscache/objects" to provide a list of active objects
+ * - can be configured by a user-defined key added to the caller's keyrings
+ */
+static int fscache_objlist_open(struct inode *inode, struct file *file)
+{
+ struct fscache_objlist_data *data;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open(file, &fscache_objlist_ops);
+ if (ret < 0)
+ return ret;
+
+ m = file->private_data;
+
+ /* buffer for key extraction */
+ data = kmalloc(sizeof(struct fscache_objlist_data), GFP_KERNEL);
+ if (!data) {
+ seq_release(inode, file);
+ return -ENOMEM;
+ }
+
+ /* get the configuration key */
+ fscache_objlist_config(data);
+
+ m->private = data;
+ return 0;
+}
+
+/*
+ * clean up on close
+ */
+static int fscache_objlist_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+
+ kfree(m->private);
+ m->private = NULL;
+ return seq_release(inode, file);
+}
+
+const struct file_operations fscache_objlist_fops = {
+ .owner = THIS_MODULE,
+ .open = fscache_objlist_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = fscache_objlist_release,
+};
diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 615b63d..ad1644f 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -34,7 +34,7 @@ const char *fscache_object_states[FSCACHE_OBJECT__NSTATES] = {
};
EXPORT_SYMBOL(fscache_object_states);

-static const char fscache_object_states_short[FSCACHE_OBJECT__NSTATES][5] = {
+const char fscache_object_states_short[FSCACHE_OBJECT__NSTATES][5] = {
[FSCACHE_OBJECT_INIT] = "INIT",
[FSCACHE_OBJECT_LOOKING_UP] = "LOOK",
[FSCACHE_OBJECT_CREATING] = "CRTN",
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index 91bbe6f..09e43b6 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -322,6 +322,9 @@ void fscache_put_operation(struct fscache_operation *op)

object = op->object;

+ if (test_bit(FSCACHE_OP_DEC_READ_CNT, &op->flags))
+ atomic_dec(&object->n_reads);
+
/* now... we may get called with the object spinlock held, so we
* complete the cleanup here only if we can immediately acquire the
* lock, and defer it otherwise */
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index e8bbc39..c5973e3 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -275,6 +275,9 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,

ASSERTCMP(object->state, >, FSCACHE_OBJECT_LOOKING_UP);

+ atomic_inc(&object->n_reads);
+ set_bit(FSCACHE_OP_DEC_READ_CNT, &op->op.flags);
+
if (fscache_submit_op(object, &op->op) < 0)
goto nobufs_unlock;
spin_unlock(&cookie->lock);
@@ -386,6 +389,9 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
object = hlist_entry(cookie->backing_objects.first,
struct fscache_object, cookie_link);

+ atomic_inc(&object->n_reads);
+ set_bit(FSCACHE_OP_DEC_READ_CNT, &op->op.flags);
+
if (fscache_submit_op(object, &op->op) < 0)
goto nobufs_unlock;
spin_unlock(&cookie->lock);
diff --git a/fs/fscache/proc.c b/fs/fscache/proc.c
index beeab44..1d9e495 100644
--- a/fs/fscache/proc.c
+++ b/fs/fscache/proc.c
@@ -37,10 +37,20 @@ int __init fscache_proc_init(void)
goto error_histogram;
#endif

+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+ if (!proc_create("fs/fscache/objects", S_IFREG | 0444, NULL,
+ &fscache_objlist_fops))
+ goto error_objects;
+#endif
+
_leave(" = 0");
return 0;

+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+error_objects:
+#endif
#ifdef CONFIG_FSCACHE_HISTOGRAM
+ remove_proc_entry("fs/fscache/histogram", NULL);
error_histogram:
#endif
#ifdef CONFIG_FSCACHE_STATS
@@ -58,6 +68,9 @@ error_dir:
*/
void fscache_proc_cleanup(void)
{
+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+ remove_proc_entry("fs/fscache/objects", NULL);
+#endif
#ifdef CONFIG_FSCACHE_HISTOGRAM
remove_proc_entry("fs/fscache/histogram", NULL);
#endif
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 7a9847c..184cbdf 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -91,6 +91,8 @@ struct fscache_operation {
#define FSCACHE_OP_WAITING 4 /* cleared when op is woken */
#define FSCACHE_OP_EXCLUSIVE 5 /* exclusive op, other ops must wait */
#define FSCACHE_OP_DEAD 6 /* op is now dead */
+#define FSCACHE_OP_DEC_READ_CNT 7 /* decrement object->n_reads on destruction */
+#define FSCACHE_OP_KEEP_FLAGS 0xc0 /* flags to keep when repurposing an op */

atomic_t usage;
unsigned debug_id; /* debugging ID */
@@ -357,6 +359,7 @@ struct fscache_object {
int n_obj_ops; /* number of object ops outstanding on object */
int n_in_progress; /* number of ops in progress */
int n_exclusive; /* number of exclusive ops queued */
+ atomic_t n_reads; /* number of read ops in progress */
spinlock_t lock; /* state and operations lock */

unsigned long lookup_jif; /* time at which lookup started */
@@ -370,6 +373,7 @@ struct fscache_object {
#define FSCACHE_OBJECT_EV_RELEASE 4 /* T if netfs requested object release */
#define FSCACHE_OBJECT_EV_RETIRE 5 /* T if netfs requested object retirement */
#define FSCACHE_OBJECT_EV_WITHDRAW 6 /* T if cache requested object withdrawal */
+#define FSCACHE_OBJECT_EVENTS_MASK 0x7f /* mask of all events*/

unsigned long flags;
#define FSCACHE_OBJECT_LOCK 0 /* T if object is busy being processed */
@@ -385,6 +389,9 @@ struct fscache_object {
struct list_head dependents; /* FIFO of dependent objects */
struct list_head dep_link; /* link in parent's dependents list */
struct list_head pending_ops; /* unstarted operations on this object */
+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+ struct rb_node objlist_link; /* link in global object list */
+#endif
pgoff_t store_limit; /* current storage limit */
};

@@ -434,6 +441,12 @@ void fscache_object_init(struct fscache_object *object,
extern void fscache_object_lookup_negative(struct fscache_object *object);
extern void fscache_obtained_object(struct fscache_object *object);

+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+extern void fscache_object_destroy(struct fscache_object *object);
+#else
+#define fscache_object_destroy(object) do {} while(0)
+#endif
+
/**
* fscache_object_destroyed - Note destruction of an object in a cache
* @cache: The cache from which the object came

2009-11-19 17:21:35

by David Howells

[permalink] [raw]
Subject: [PATCH 10/28] FS-Cache: Add counters for entry/exit to/from cache operation functions

Count entries to and exits from cache operation table functions. Maintain
these as a single counter that's added to or removed from as appropriate.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 16 +++++++++++
fs/fscache/cache.c | 4 +++
fs/fscache/cookie.c | 9 +++++-
fs/fscache/internal.h | 22 +++++++++++++++
fs/fscache/object.c | 26 ++++++++++++++++--
fs/fscache/page.c | 29 ++++++++++++++++----
fs/fscache/stats.c | 37 +++++++++++++++++++++++++
7 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index cf13cd3..472398f 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -274,6 +274,22 @@ proc files.
dfr=N Number of async ops queued for deferred release
rel=N Number of async ops released
gc=N Number of deferred-release async ops garbage collected
+ CacheOp alo=N Number of in-progress alloc_object() cache ops
+ luo=N Number of in-progress lookup_object() cache ops
+ luc=N Number of in-progress lookup_complete() cache ops
+ gro=N Number of in-progress grab_object() cache ops
+ upo=N Number of in-progress update_object() cache ops
+ dro=N Number of in-progress drop_object() cache ops
+ pto=N Number of in-progress put_object() cache ops
+ syn=N Number of in-progress sync_cache() cache ops
+ atc=N Number of in-progress attr_changed() cache ops
+ rap=N Number of in-progress read_or_alloc_page() cache ops
+ ras=N Number of in-progress read_or_alloc_pages() cache ops
+ alp=N Number of in-progress allocate_page() cache ops
+ als=N Number of in-progress allocate_pages() cache ops
+ wrp=N Number of in-progress write_page() cache ops
+ ucp=N Number of in-progress uncache_page() cache ops
+ dsp=N Number of in-progress dissociate_pages() cache ops


(*) /proc/fs/fscache/histogram
diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
index 724384e..6a3c48a 100644
--- a/fs/fscache/cache.c
+++ b/fs/fscache/cache.c
@@ -381,11 +381,15 @@ void fscache_withdraw_cache(struct fscache_cache *cache)

/* make sure all pages pinned by operations on behalf of the netfs are
* written to disk */
+ fscache_stat(&fscache_n_cop_sync_cache);
cache->ops->sync_cache(cache);
+ fscache_stat_d(&fscache_n_cop_sync_cache);

/* dissociate all the netfs pages backed by this cache from the block
* mappings in the cache */
+ fscache_stat(&fscache_n_cop_dissociate_pages);
cache->ops->dissociate_pages(cache);
+ fscache_stat_d(&fscache_n_cop_dissociate_pages);

/* we now have to destroy all the active objects pertaining to this
* cache - which we do by passing them off to thread pool to be
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index 9b51873..432482e 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -249,7 +249,9 @@ static int fscache_alloc_object(struct fscache_cache *cache,

/* ask the cache to allocate an object (we may end up with duplicate
* objects at this stage, but we sort that out later) */
+ fscache_stat(&fscache_n_cop_alloc_object);
object = cache->ops->alloc_object(cache, cookie);
+ fscache_stat_d(&fscache_n_cop_alloc_object);
if (IS_ERR(object)) {
fscache_stat(&fscache_n_object_no_alloc);
ret = PTR_ERR(object);
@@ -270,8 +272,11 @@ static int fscache_alloc_object(struct fscache_cache *cache,
/* only attach if we managed to allocate all we needed, otherwise
* discard the object we just allocated and instead use the one
* attached to the cookie */
- if (fscache_attach_object(cookie, object) < 0)
+ if (fscache_attach_object(cookie, object) < 0) {
+ fscache_stat(&fscache_n_cop_put_object);
cache->ops->put_object(object);
+ fscache_stat_d(&fscache_n_cop_put_object);
+ }

_leave(" = 0");
return 0;
@@ -287,7 +292,9 @@ object_already_extant:
return 0;

error_put:
+ fscache_stat(&fscache_n_cop_put_object);
cache->ops->put_object(object);
+ fscache_stat_d(&fscache_n_cop_put_object);
error:
_leave(" = %d", ret);
return ret;
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index fe02973..b85cc89 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -208,11 +208,33 @@ extern atomic_t fscache_n_checkaux_okay;
extern atomic_t fscache_n_checkaux_update;
extern atomic_t fscache_n_checkaux_obsolete;

+extern atomic_t fscache_n_cop_alloc_object;
+extern atomic_t fscache_n_cop_lookup_object;
+extern atomic_t fscache_n_cop_lookup_complete;
+extern atomic_t fscache_n_cop_grab_object;
+extern atomic_t fscache_n_cop_update_object;
+extern atomic_t fscache_n_cop_drop_object;
+extern atomic_t fscache_n_cop_put_object;
+extern atomic_t fscache_n_cop_sync_cache;
+extern atomic_t fscache_n_cop_attr_changed;
+extern atomic_t fscache_n_cop_read_or_alloc_page;
+extern atomic_t fscache_n_cop_read_or_alloc_pages;
+extern atomic_t fscache_n_cop_allocate_page;
+extern atomic_t fscache_n_cop_allocate_pages;
+extern atomic_t fscache_n_cop_write_page;
+extern atomic_t fscache_n_cop_uncache_page;
+extern atomic_t fscache_n_cop_dissociate_pages;
+
static inline void fscache_stat(atomic_t *stat)
{
atomic_inc(stat);
}

+static inline void fscache_stat_d(atomic_t *stat)
+{
+ atomic_dec(stat);
+}
+
extern const struct file_operations fscache_stats_fops;
#else

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index ad1644f..0d65c0c 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -144,13 +144,17 @@ static void fscache_object_state_machine(struct fscache_object *object)
case FSCACHE_OBJECT_UPDATING:
clear_bit(FSCACHE_OBJECT_EV_UPDATE, &object->events);
fscache_stat(&fscache_n_updates_run);
+ fscache_stat(&fscache_n_cop_update_object);
object->cache->ops->update_object(object);
+ fscache_stat_d(&fscache_n_cop_update_object);
goto active_transit;

/* handle an object dying during lookup or creation */
case FSCACHE_OBJECT_LC_DYING:
object->event_mask &= ~(1 << FSCACHE_OBJECT_EV_UPDATE);
+ fscache_stat(&fscache_n_cop_lookup_complete);
object->cache->ops->lookup_complete(object);
+ fscache_stat_d(&fscache_n_cop_lookup_complete);

spin_lock(&object->lock);
object->state = FSCACHE_OBJECT_DYING;
@@ -416,7 +420,9 @@ static void fscache_initialise_object(struct fscache_object *object)
* binding on to us, so we need to make sure we don't
* add ourself to the list multiple times */
if (list_empty(&object->dep_link)) {
+ fscache_stat(&fscache_n_cop_grab_object);
object->cache->ops->grab_object(object);
+ fscache_stat_d(&fscache_n_cop_grab_object);
list_add(&object->dep_link,
&parent->dependents);

@@ -478,7 +484,9 @@ static void fscache_lookup_object(struct fscache_object *object)
object->cache->tag->name);

fscache_stat(&fscache_n_object_lookups);
+ fscache_stat(&fscache_n_cop_lookup_object);
object->cache->ops->lookup_object(object);
+ fscache_stat_d(&fscache_n_cop_lookup_object);

if (test_bit(FSCACHE_OBJECT_EV_ERROR, &object->events))
set_bit(FSCACHE_COOKIE_UNAVAILABLE, &cookie->flags);
@@ -602,7 +610,9 @@ static void fscache_object_available(struct fscache_object *object)
}
spin_unlock(&object->lock);

+ fscache_stat(&fscache_n_cop_lookup_complete);
object->cache->ops->lookup_complete(object);
+ fscache_stat_d(&fscache_n_cop_lookup_complete);
fscache_enqueue_dependents(object);

fscache_hist(fscache_obj_instantiate_histogram, object->lookup_jif);
@@ -625,7 +635,9 @@ static void fscache_drop_object(struct fscache_object *object)
list_del_init(&object->cache_link);
spin_unlock(&cache->object_list_lock);

+ fscache_stat(&fscache_n_cop_drop_object);
cache->ops->drop_object(object);
+ fscache_stat_d(&fscache_n_cop_drop_object);

if (parent) {
_debug("release parent OBJ%x {%d}",
@@ -640,7 +652,9 @@ static void fscache_drop_object(struct fscache_object *object)
}

/* this just shifts the object release to the slow work processor */
+ fscache_stat(&fscache_n_cop_put_object);
object->cache->ops->put_object(object);
+ fscache_stat_d(&fscache_n_cop_put_object);

_leave("");
}
@@ -730,8 +744,12 @@ static int fscache_object_slow_work_get_ref(struct slow_work *work)
{
struct fscache_object *object =
container_of(work, struct fscache_object, work);
+ int ret;

- return object->cache->ops->grab_object(object) ? 0 : -EAGAIN;
+ fscache_stat(&fscache_n_cop_grab_object);
+ ret = object->cache->ops->grab_object(object) ? 0 : -EAGAIN;
+ fscache_stat_d(&fscache_n_cop_grab_object);
+ return ret;
}

/*
@@ -742,7 +760,9 @@ static void fscache_object_slow_work_put_ref(struct slow_work *work)
struct fscache_object *object =
container_of(work, struct fscache_object, work);

- return object->cache->ops->put_object(object);
+ fscache_stat(&fscache_n_cop_put_object);
+ object->cache->ops->put_object(object);
+ fscache_stat_d(&fscache_n_cop_put_object);
}

/*
@@ -779,7 +799,9 @@ static void fscache_enqueue_dependents(struct fscache_object *object)

/* sort onto appropriate lists */
fscache_enqueue_object(dep);
+ fscache_stat(&fscache_n_cop_put_object);
dep->cache->ops->put_object(dep);
+ fscache_stat_d(&fscache_n_cop_put_object);

if (!list_empty(&object->dependents))
cond_resched_lock(&object->lock);
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index c5973e3..250dfd3 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -71,7 +71,9 @@ static void fscache_attr_changed_op(struct fscache_operation *op)

if (fscache_object_is_active(object)) {
fscache_set_op_state(op, "CallFS");
+ fscache_stat(&fscache_n_cop_attr_changed);
ret = object->cache->ops->attr_changed(object);
+ fscache_stat_d(&fscache_n_cop_attr_changed);
fscache_set_op_state(op, "Done");
if (ret < 0)
fscache_abort_object(object);
@@ -300,11 +302,15 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,

/* ask the cache to honour the operation */
if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
+ fscache_stat(&fscache_n_cop_allocate_page);
ret = object->cache->ops->allocate_page(op, page, gfp);
+ fscache_stat_d(&fscache_n_cop_allocate_page);
if (ret == 0)
ret = -ENODATA;
} else {
+ fscache_stat(&fscache_n_cop_read_or_alloc_page);
ret = object->cache->ops->read_or_alloc_page(op, page, gfp);
+ fscache_stat_d(&fscache_n_cop_read_or_alloc_page);
}

if (ret == -ENOMEM)
@@ -358,7 +364,6 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
void *context,
gfp_t gfp)
{
- fscache_pages_retrieval_func_t func;
struct fscache_retrieval *op;
struct fscache_object *object;
int ret;
@@ -413,11 +418,17 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
}

/* ask the cache to honour the operation */
- if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags))
- func = object->cache->ops->allocate_pages;
- else
- func = object->cache->ops->read_or_alloc_pages;
- ret = func(op, pages, nr_pages, gfp);
+ if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
+ fscache_stat(&fscache_n_cop_allocate_pages);
+ ret = object->cache->ops->allocate_pages(
+ op, pages, nr_pages, gfp);
+ fscache_stat_d(&fscache_n_cop_allocate_pages);
+ } else {
+ fscache_stat(&fscache_n_cop_read_or_alloc_pages);
+ ret = object->cache->ops->read_or_alloc_pages(
+ op, pages, nr_pages, gfp);
+ fscache_stat_d(&fscache_n_cop_read_or_alloc_pages);
+ }

if (ret == -ENOMEM)
fscache_stat(&fscache_n_retrievals_nomem);
@@ -500,7 +511,9 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,
}

/* ask the cache to honour the operation */
+ fscache_stat(&fscache_n_cop_allocate_page);
ret = object->cache->ops->allocate_page(op, page, gfp);
+ fscache_stat_d(&fscache_n_cop_allocate_page);

if (ret < 0)
fscache_stat(&fscache_n_allocs_nobufs);
@@ -578,7 +591,9 @@ static void fscache_write_op(struct fscache_operation *_op)

if (page) {
fscache_set_op_state(&op->op, "Store");
+ fscache_stat(&fscache_n_cop_write_page);
ret = object->cache->ops->write_page(op, page);
+ fscache_stat_d(&fscache_n_cop_write_page);
fscache_set_op_state(&op->op, "EndWrite");
fscache_end_page_write(cookie, page);
page_cache_release(page);
@@ -786,7 +801,9 @@ void __fscache_uncache_page(struct fscache_cookie *cookie, struct page *page)
if (TestClearPageFsCache(page) &&
object->cache->ops->uncache_page) {
/* the cache backend releases the cookie lock */
+ fscache_stat(&fscache_n_cop_uncache_page);
object->cache->ops->uncache_page(object, page);
+ fscache_stat_d(&fscache_n_cop_uncache_page);
goto done;
}

diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 65deb99..20233fb 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -93,6 +93,23 @@ atomic_t fscache_n_checkaux_okay;
atomic_t fscache_n_checkaux_update;
atomic_t fscache_n_checkaux_obsolete;

+atomic_t fscache_n_cop_alloc_object;
+atomic_t fscache_n_cop_lookup_object;
+atomic_t fscache_n_cop_lookup_complete;
+atomic_t fscache_n_cop_grab_object;
+atomic_t fscache_n_cop_update_object;
+atomic_t fscache_n_cop_drop_object;
+atomic_t fscache_n_cop_put_object;
+atomic_t fscache_n_cop_sync_cache;
+atomic_t fscache_n_cop_attr_changed;
+atomic_t fscache_n_cop_read_or_alloc_page;
+atomic_t fscache_n_cop_read_or_alloc_pages;
+atomic_t fscache_n_cop_allocate_page;
+atomic_t fscache_n_cop_allocate_pages;
+atomic_t fscache_n_cop_write_page;
+atomic_t fscache_n_cop_uncache_page;
+atomic_t fscache_n_cop_dissociate_pages;
+
/*
* display the general statistics
*/
@@ -192,6 +209,26 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_op_deferred_release),
atomic_read(&fscache_n_op_release),
atomic_read(&fscache_n_op_gc));
+
+ seq_printf(m, "CacheOp: alo=%d luo=%d luc=%d gro=%d\n",
+ atomic_read(&fscache_n_cop_alloc_object),
+ atomic_read(&fscache_n_cop_lookup_object),
+ atomic_read(&fscache_n_cop_lookup_complete),
+ atomic_read(&fscache_n_cop_grab_object));
+ seq_printf(m, "CacheOp: upo=%d dro=%d pto=%d atc=%d syn=%d\n",
+ atomic_read(&fscache_n_cop_update_object),
+ atomic_read(&fscache_n_cop_drop_object),
+ atomic_read(&fscache_n_cop_put_object),
+ atomic_read(&fscache_n_cop_attr_changed),
+ atomic_read(&fscache_n_cop_sync_cache));
+ seq_printf(m, "CacheOp: rap=%d ras=%d alp=%d als=%d wrp=%d ucp=%d dsp=%d\n",
+ atomic_read(&fscache_n_cop_read_or_alloc_page),
+ atomic_read(&fscache_n_cop_read_or_alloc_pages),
+ atomic_read(&fscache_n_cop_allocate_page),
+ atomic_read(&fscache_n_cop_allocate_pages),
+ atomic_read(&fscache_n_cop_write_page),
+ atomic_read(&fscache_n_cop_uncache_page),
+ atomic_read(&fscache_n_cop_dissociate_pages));
return 0;
}

2009-11-19 17:21:39

by David Howells

[permalink] [raw]
Subject: [PATCH 11/28] FS-Cache: Clear netfs pointers in cookie after detaching object, not before

Clear the pointers from the fscache_cookie struct to netfs private data after
clearing the pointer to the cookie from the fscache_object struct and
releasing the object lock, rather than before.

This allows the netfs private data pointers to be relied on simply by holding
the object lock, rather than having to hold the cookie lock. This is makes
things simpler as the cookie lock has to be taken before the object lock, but
sometimes the object pointer is all that the code has.

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/cookie.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index 432482e..b1870a6 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -437,12 +437,8 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)

event = retire ? FSCACHE_OBJECT_EV_RETIRE : FSCACHE_OBJECT_EV_RELEASE;

- /* detach pointers back to the netfs */
spin_lock(&cookie->lock);

- cookie->netfs_data = NULL;
- cookie->def = NULL;
-
/* break links with all the active objects */
while (!hlist_empty(&cookie->backing_objects)) {
object = hlist_entry(cookie->backing_objects.first,
@@ -465,6 +461,10 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
BUG();
}

+ /* detach pointers back to the netfs */
+ cookie->netfs_data = NULL;
+ cookie->def = NULL;
+
spin_unlock(&cookie->lock);

if (cookie->parent) {

2009-11-19 17:21:46

by David Howells

[permalink] [raw]
Subject: [PATCH 12/28] FS-Cache: Use radix tree preload correctly in tracking of pages to be stored

__fscache_write_page() attempts to load the radix tree preallocation pool for
the CPU it is on before calling radix_tree_insert(), as the insertion must be
done inside a pair of spinlocks.

Use of the preallocation pool, however, is contingent on the radix tree being
initialised without __GFP_WAIT specified. __fscache_acquire_cookie() was
passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT.

The solution is to AND out __GFP_WAIT.

Additionally, the banner comment to radix_tree_preload() is altered to make
note of this prerequisite. Possibly there should be a WARN_ON() too.

Without this fix, I have seen the following recursive deadlock caused by
radix_tree_insert() attempting to allocate memory inside the spinlocked
region, which resulted in FS-Cache being called back into to release memory -
which required the spinlock already held.

=============================================
[ INFO: possible recursive locking detected ]
2.6.32-rc6-cachefs #24
---------------------------------------------
nfsiod/7916 is trying to acquire lock:
(&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]

but task is already holding lock:
(&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]

other info that might help us debug this:
5 locks held by nfsiod/7916:
#0: (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
#1: (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
#2: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]
#3: (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache]
#4: (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache]

stack backtrace:
Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24
Call Trace:
[<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3
[<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3
[<ffffffff8100e27d>] ? dump_trace+0x248/0x257
[<ffffffff8105ad70>] lock_acquire+0x57/0x6d
[<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
[<ffffffff8135467c>] _spin_lock+0x2c/0x3b
[<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
[<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]
[<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache]
[<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs]
[<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff81087ffb>] try_to_release_page+0x32/0x3b
[<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac
[<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
[<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31
[<ffffffff81093153>] shrink_inactive_list+0x392/0x67c
[<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
[<ffffffff810934ca>] shrink_list+0x8d/0x8f
[<ffffffff81093744>] shrink_zone+0x278/0x33c
[<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba
[<ffffffff8109453b>] try_to_free_pages+0x22e/0x392
[<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1
[<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c
[<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118
[<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c
[<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c
[<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache]
[<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs]
[<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs]
[<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs]
[<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc]
[<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc]
[<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc]
[<ffffffff810482e5>] worker_thread+0x1ef/0x2e2
[<ffffffff81048290>] ? worker_thread+0x19a/0x2e2
[<ffffffff81352433>] ? thread_return+0x3e/0x101
[<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc]
[<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34
[<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff810480f6>] ? worker_thread+0x0/0x2e2
[<ffffffff8104bd21>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44
[<ffffffff8104bca7>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20


Signed-off-by: David Howells <[email protected]>
---

fs/fscache/cookie.c | 4 +++-
lib/radix-tree.c | 3 +++
2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index b1870a6..e6854f5 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -102,7 +102,9 @@ struct fscache_cookie *__fscache_acquire_cookie(
cookie->netfs_data = netfs_data;
cookie->flags = 0;

- INIT_RADIX_TREE(&cookie->stores, GFP_NOFS);
+ /* radix tree insertion won't use the preallocation pool unless it's
+ * told it may not wait */
+ INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);

switch (cookie->def->type) {
case FSCACHE_COOKIE_TYPE_INDEX:
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 23abbd9..ae61068 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -200,6 +200,9 @@ radix_tree_node_free(struct radix_tree_node *node)
* ensure that the addition of a single element in the tree cannot fail. On
* success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
+ *
+ * To make use of this facility, the radix tree must be initialised without
+ * __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
int radix_tree_preload(gfp_t gfp_mask)
{

2009-11-19 17:21:53

by David Howells

[permalink] [raw]
Subject: [PATCH 13/28] FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase

Permit the operations to retrieve data from the cache or to allocate space in
the cache for future writes to be interrupted whilst they're waiting for
permission for the operation to proceed. Typically this wait occurs whilst the
cache object is being looked up on disk in the background.

If an interruption occurs, and the operation has not yet been given the
go-ahead to run, the operation is dequeued and cancelled, and control returns
to the read operation of the netfs routine with none of the requested pages
having been read or in any way marked as known by the cache.

This means that the initial wait is done interruptibly rather than
uninterruptibly.

In addition, extra stats values are made available to show the number of ops
cancelled and the number of cache space allocations interrupted.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 2 +
fs/fscache/internal.h | 3 +
fs/fscache/operation.c | 82 ++++++++++++++++---------
fs/fscache/page.c | 55 +++++++++++++++--
fs/fscache/stats.c | 12 ++--
5 files changed, 115 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 472398f..21d20bf 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -250,6 +250,7 @@ proc files.
ok=N Number of successful alloc reqs
wt=N Number of alloc reqs that waited on lookup completion
nbf=N Number of alloc reqs rejected -ENOBUFS
+ int=N Number of alloc reqs aborted -ERESTARTSYS
ops=N Number of alloc reqs submitted
owt=N Number of alloc reqs waited for CPU time
Retrvls n=N Number of retrieval (read) requests seen
@@ -271,6 +272,7 @@ proc files.
Ops pend=N Number of times async ops added to pending queues
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
+ can=N Number of async ops cancelled
dfr=N Number of async ops queued for deferred release
rel=N Number of async ops released
gc=N Number of deferred-release async ops garbage collected
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index b85cc89..50324ad 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -112,6 +112,7 @@ extern int fscache_submit_exclusive_op(struct fscache_object *,
struct fscache_operation *);
extern int fscache_submit_op(struct fscache_object *,
struct fscache_operation *);
+extern int fscache_cancel_op(struct fscache_operation *);
extern void fscache_abort_object(struct fscache_object *);
extern void fscache_start_operations(struct fscache_object *);
extern void fscache_operation_gc(struct work_struct *);
@@ -140,6 +141,7 @@ extern atomic_t fscache_n_op_enqueue;
extern atomic_t fscache_n_op_deferred_release;
extern atomic_t fscache_n_op_release;
extern atomic_t fscache_n_op_gc;
+extern atomic_t fscache_n_op_cancelled;

extern atomic_t fscache_n_attr_changed;
extern atomic_t fscache_n_attr_changed_ok;
@@ -151,6 +153,7 @@ extern atomic_t fscache_n_allocs;
extern atomic_t fscache_n_allocs_ok;
extern atomic_t fscache_n_allocs_wait;
extern atomic_t fscache_n_allocs_nobufs;
+extern atomic_t fscache_n_allocs_intr;
extern atomic_t fscache_n_alloc_ops;
extern atomic_t fscache_n_alloc_op_waits;

diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index 09e43b6..296492e 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -34,32 +34,31 @@ void fscache_enqueue_operation(struct fscache_operation *op)

fscache_set_op_state(op, "EnQ");

+ ASSERT(list_empty(&op->pend_link));
ASSERT(op->processor != NULL);
ASSERTCMP(op->object->state, >=, FSCACHE_OBJECT_AVAILABLE);
ASSERTCMP(atomic_read(&op->usage), >, 0);

- if (list_empty(&op->pend_link)) {
- switch (op->flags & FSCACHE_OP_TYPE) {
- case FSCACHE_OP_FAST:
- _debug("queue fast");
- atomic_inc(&op->usage);
- if (!schedule_work(&op->fast_work))
- fscache_put_operation(op);
- break;
- case FSCACHE_OP_SLOW:
- _debug("queue slow");
- slow_work_enqueue(&op->slow_work);
- break;
- case FSCACHE_OP_MYTHREAD:
- _debug("queue for caller's attention");
- break;
- default:
- printk(KERN_ERR "FS-Cache: Unexpected op type %lx",
- op->flags);
- BUG();
- break;
- }
- fscache_stat(&fscache_n_op_enqueue);
+ fscache_stat(&fscache_n_op_enqueue);
+ switch (op->flags & FSCACHE_OP_TYPE) {
+ case FSCACHE_OP_FAST:
+ _debug("queue fast");
+ atomic_inc(&op->usage);
+ if (!schedule_work(&op->fast_work))
+ fscache_put_operation(op);
+ break;
+ case FSCACHE_OP_SLOW:
+ _debug("queue slow");
+ slow_work_enqueue(&op->slow_work);
+ break;
+ case FSCACHE_OP_MYTHREAD:
+ _debug("queue for caller's attention");
+ break;
+ default:
+ printk(KERN_ERR "FS-Cache: Unexpected op type %lx",
+ op->flags);
+ BUG();
+ break;
}
}
EXPORT_SYMBOL(fscache_enqueue_operation);
@@ -97,6 +96,7 @@ int fscache_submit_exclusive_op(struct fscache_object *object,
spin_lock(&object->lock);
ASSERTCMP(object->n_ops, >=, object->n_in_progress);
ASSERTCMP(object->n_ops, >=, object->n_exclusive);
+ ASSERT(list_empty(&op->pend_link));

ret = -ENOBUFS;
if (fscache_object_is_active(object)) {
@@ -202,6 +202,7 @@ int fscache_submit_op(struct fscache_object *object,
spin_lock(&object->lock);
ASSERTCMP(object->n_ops, >=, object->n_in_progress);
ASSERTCMP(object->n_ops, >=, object->n_exclusive);
+ ASSERT(list_empty(&op->pend_link));

ostate = object->state;
smp_rmb();
@@ -273,12 +274,7 @@ void fscache_start_operations(struct fscache_object *object)
stop = true;
}
list_del_init(&op->pend_link);
- object->n_in_progress++;
-
- if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
- wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
- if (op->processor)
- fscache_enqueue_operation(op);
+ fscache_run_op(object, op);

/* the pending queue was holding a ref on the object */
fscache_put_operation(op);
@@ -291,6 +287,36 @@ void fscache_start_operations(struct fscache_object *object)
}

/*
+ * cancel an operation that's pending on an object
+ */
+int fscache_cancel_op(struct fscache_operation *op)
+{
+ struct fscache_object *object = op->object;
+ int ret;
+
+ _enter("OBJ%x OP%x}", op->object->debug_id, op->debug_id);
+
+ spin_lock(&object->lock);
+
+ ret = -EBUSY;
+ if (!list_empty(&op->pend_link)) {
+ fscache_stat(&fscache_n_op_cancelled);
+ list_del_init(&op->pend_link);
+ object->n_ops--;
+ if (test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags))
+ object->n_exclusive--;
+ if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
+ wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
+ fscache_put_operation(op);
+ ret = 0;
+ }
+
+ spin_unlock(&object->lock);
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
* release an operation
* - queues pending ops if this is the last in-progress op
*/
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 250dfd3..e6f2e61 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -295,8 +295,20 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
_debug(">>> WT");
fscache_stat(&fscache_n_retrieval_op_waits);
- wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit_interruptible,
+ TASK_INTERRUPTIBLE) < 0) {
+ ret = fscache_cancel_op(&op->op);
+ if (ret == 0) {
+ ret = -ERESTARTSYS;
+ goto error;
+ }
+
+ /* it's been removed from the pending queue by another
+ * party, so we should get to run shortly */
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ }
_debug("<<< GO");
}

@@ -313,6 +325,7 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
fscache_stat_d(&fscache_n_cop_read_or_alloc_page);
}

+error:
if (ret == -ENOMEM)
fscache_stat(&fscache_n_retrievals_nomem);
else if (ret == -ERESTARTSYS)
@@ -412,8 +425,20 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
_debug(">>> WT");
fscache_stat(&fscache_n_retrieval_op_waits);
- wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit_interruptible,
+ TASK_INTERRUPTIBLE) < 0) {
+ ret = fscache_cancel_op(&op->op);
+ if (ret == 0) {
+ ret = -ERESTARTSYS;
+ goto error;
+ }
+
+ /* it's been removed from the pending queue by another
+ * party, so we should get to run shortly */
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ }
_debug("<<< GO");
}

@@ -430,6 +455,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
fscache_stat_d(&fscache_n_cop_read_or_alloc_pages);
}

+error:
if (ret == -ENOMEM)
fscache_stat(&fscache_n_retrievals_nomem);
else if (ret == -ERESTARTSYS)
@@ -505,8 +531,20 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,
if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
_debug(">>> WT");
fscache_stat(&fscache_n_alloc_op_waits);
- wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit_interruptible,
+ TASK_INTERRUPTIBLE) < 0) {
+ ret = fscache_cancel_op(&op->op);
+ if (ret == 0) {
+ ret = -ERESTARTSYS;
+ goto error;
+ }
+
+ /* it's been removed from the pending queue by another
+ * party, so we should get to run shortly */
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ }
_debug("<<< GO");
}

@@ -515,7 +553,10 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,
ret = object->cache->ops->allocate_page(op, page, gfp);
fscache_stat_d(&fscache_n_cop_allocate_page);

- if (ret < 0)
+error:
+ if (ret == -ERESTARTSYS)
+ fscache_stat(&fscache_n_allocs_intr);
+ else if (ret < 0)
fscache_stat(&fscache_n_allocs_nobufs);
else
fscache_stat(&fscache_n_allocs_ok);
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 20233fb..4c07439 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -25,6 +25,7 @@ atomic_t fscache_n_op_requeue;
atomic_t fscache_n_op_deferred_release;
atomic_t fscache_n_op_release;
atomic_t fscache_n_op_gc;
+atomic_t fscache_n_op_cancelled;

atomic_t fscache_n_attr_changed;
atomic_t fscache_n_attr_changed_ok;
@@ -36,6 +37,7 @@ atomic_t fscache_n_allocs;
atomic_t fscache_n_allocs_ok;
atomic_t fscache_n_allocs_wait;
atomic_t fscache_n_allocs_nobufs;
+atomic_t fscache_n_allocs_intr;
atomic_t fscache_n_alloc_ops;
atomic_t fscache_n_alloc_op_waits;

@@ -169,11 +171,12 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_attr_changed_nomem),
atomic_read(&fscache_n_attr_changed_calls));

- seq_printf(m, "Allocs : n=%u ok=%u wt=%u nbf=%u\n",
+ seq_printf(m, "Allocs : n=%u ok=%u wt=%u nbf=%u int=%u\n",
atomic_read(&fscache_n_allocs),
atomic_read(&fscache_n_allocs_ok),
atomic_read(&fscache_n_allocs_wait),
- atomic_read(&fscache_n_allocs_nobufs));
+ atomic_read(&fscache_n_allocs_nobufs),
+ atomic_read(&fscache_n_allocs_intr));
seq_printf(m, "Allocs : ops=%u owt=%u\n",
atomic_read(&fscache_n_alloc_ops),
atomic_read(&fscache_n_alloc_op_waits));
@@ -201,10 +204,11 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_store_ops),
atomic_read(&fscache_n_store_calls));

- seq_printf(m, "Ops : pend=%u run=%u enq=%u\n",
+ seq_printf(m, "Ops : pend=%u run=%u enq=%u can=%u\n",
atomic_read(&fscache_n_op_pend),
atomic_read(&fscache_n_op_run),
- atomic_read(&fscache_n_op_enqueue));
+ atomic_read(&fscache_n_op_enqueue),
+ atomic_read(&fscache_n_op_cancelled));
seq_printf(m, "Ops : dfr=%u rel=%u gc=%u\n",
atomic_read(&fscache_n_op_deferred_release),
atomic_read(&fscache_n_op_release),

2009-11-19 17:24:54

by David Howells

[permalink] [raw]
Subject: [PATCH 14/28] FS-Cache: The object-available state can't rely on the cookie to be available

The object-available state in the object processing state machine (as
processed by fscache_object_available()) can't rely on the cookie to be
available because the FSCACHE_COOKIE_CREATING bit may have been cleared by
fscache_obtained_object() prior to the object being put into the
FSCACHE_OBJECT_AVAILABLE state.

Clearing the FSCACHE_COOKIE_CREATING bit on a cookie permits
__fscache_relinquish_cookie() to proceed and detach the cookie from the
object.

To deal with this, we don't dereference object->cookie in
fscache_object_available() if the object has already been detached.

In addition, a couple of assertions are added into fscache_drop_object() to
make sure the object is unbound from the cookie before it gets there.

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/object.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 0d65c0c..1a1afa8 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -158,7 +158,8 @@ static void fscache_object_state_machine(struct fscache_object *object)

spin_lock(&object->lock);
object->state = FSCACHE_OBJECT_DYING;
- if (test_and_clear_bit(FSCACHE_COOKIE_CREATING,
+ if (object->cookie &&
+ test_and_clear_bit(FSCACHE_COOKIE_CREATING,
&object->cookie->flags))
wake_up_bit(&object->cookie->flags,
FSCACHE_COOKIE_CREATING);
@@ -594,7 +595,8 @@ static void fscache_object_available(struct fscache_object *object)

spin_lock(&object->lock);

- if (test_and_clear_bit(FSCACHE_COOKIE_CREATING, &object->cookie->flags))
+ if (object->cookie &&
+ test_and_clear_bit(FSCACHE_COOKIE_CREATING, &object->cookie->flags))
wake_up_bit(&object->cookie->flags, FSCACHE_COOKIE_CREATING);

fscache_done_parent_op(object);
@@ -631,6 +633,9 @@ static void fscache_drop_object(struct fscache_object *object)

_enter("{OBJ%x,%d}", object->debug_id, object->n_children);

+ ASSERTCMP(object->cookie, ==, NULL);
+ ASSERT(hlist_unhashed(&object->cookie_link));
+
spin_lock(&cache->object_list_lock);
list_del_init(&object->cache_link);
spin_unlock(&cache->object_list_lock);

2009-11-19 17:21:59

by David Howells

[permalink] [raw]
Subject: [PATCH 15/28] FS-Cache: Fix lock misorder in fscache_write_op()

FS-Cache has two structs internally for keeping track of the internal state of
a cached file: the fscache_cookie struct, which represents the netfs's state,
and fscache_object struct, which represents the cache's state. Each has a
pointer that points to the other (when both are in existence), and each has a
spinlock for pointer maintenance.

Since netfs operations approach these structures from the cookie side, they get
the cookie lock first, then the object lock. Cache operations, on the other
hand, approach from the object side, and get the object lock first. It is not
then permitted for a cache operation to get the cookie lock whilst it is
holding the object lock lest deadlock occur; instead, it must do one of two
things:

(1) increment the cookie usage counter, drop the object lock and then get both
locks in order, or

(2) simply hold the object lock as certain parts of the cookie may not be
altered whilst the object lock is held.

It is also not permitted to follow either pointer without holding the lock at
the end you start with. To break the pointers between the cookie and the
object, both locks must be held.


fscache_write_op(), however, violates the locking rules: It attempts to get the
cookie lock without (a) checking that the cookie pointer is a valid pointer,
and (b) holding the object lock to protect the cookie pointer whilst it follows
it. This is so that it can access the pending page store tree without
interference from __fscache_write_page().

This is fixed by splitting the cookie lock, such that the page store tracking
tree is protected by its own lock, and checking that the cookie pointer is
non-NULL before we attempt to follow it whilst holding the object lock.

The new lock is subordinate to both the cookie lock and the object lock, and so
should be taken after those.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 3 +
fs/fscache/cookie.c | 1
fs/fscache/internal.h | 4 ++
fs/fscache/page.c | 52 +++++++++++++++++--------
fs/fscache/stats.c | 10 ++++-
include/linux/fscache-cache.h | 1
6 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 21d20bf..cd2b4c9 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -269,6 +269,9 @@ proc files.
oom=N Number of store reqs failed -ENOMEM
ops=N Number of store reqs submitted
run=N Number of store reqs granted CPU time
+ pgs=N Number of pages given store req processing time
+ rxd=N Number of store reqs deleted from tracking tree
+ olm=N Number of store reqs over store limit
Ops pend=N Number of times async ops added to pending queues
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index e6854f5..f979659 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -36,6 +36,7 @@ void fscache_cookie_init_once(void *_cookie)

memset(cookie, 0, sizeof(*cookie));
spin_lock_init(&cookie->lock);
+ spin_lock_init(&cookie->stores_lock);
INIT_HLIST_HEAD(&cookie->backing_objects);
}

diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 50324ad..ba1853f 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -17,6 +17,7 @@
* - cache->object_list_lock
* - object->lock
* - object->parent->lock
+ * - cookie->stores_lock
* - fscache_thread_lock
*
*/
@@ -174,6 +175,9 @@ extern atomic_t fscache_n_stores_nobufs;
extern atomic_t fscache_n_stores_oom;
extern atomic_t fscache_n_store_ops;
extern atomic_t fscache_n_store_calls;
+extern atomic_t fscache_n_store_pages;
+extern atomic_t fscache_n_store_radix_deletes;
+extern atomic_t fscache_n_store_pages_over_limit;

extern atomic_t fscache_n_marks;
extern atomic_t fscache_n_uncaches;
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index e6f2e61..3ea8897 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -45,16 +45,26 @@ EXPORT_SYMBOL(__fscache_wait_on_page_write);
/*
* note that a page has finished being written to the cache
*/
-static void fscache_end_page_write(struct fscache_cookie *cookie, struct page *page)
+static void fscache_end_page_write(struct fscache_object *object,
+ struct page *page)
{
- struct page *xpage;
+ struct fscache_cookie *cookie;
+ struct page *xpage = NULL;

- spin_lock(&cookie->lock);
- xpage = radix_tree_delete(&cookie->stores, page->index);
- spin_unlock(&cookie->lock);
- ASSERT(xpage != NULL);
-
- wake_up_bit(&cookie->flags, 0);
+ spin_lock(&object->lock);
+ cookie = object->cookie;
+ if (cookie) {
+ /* delete the page from the tree if it is now no longer
+ * pending */
+ spin_lock(&cookie->stores_lock);
+ fscache_stat(&fscache_n_store_radix_deletes);
+ xpage = radix_tree_delete(&cookie->stores, page->index);
+ spin_unlock(&cookie->stores_lock);
+ wake_up_bit(&cookie->flags, 0);
+ }
+ spin_unlock(&object->lock);
+ if (xpage)
+ page_cache_release(xpage);
}

/*
@@ -591,7 +601,7 @@ static void fscache_write_op(struct fscache_operation *_op)
struct fscache_storage *op =
container_of(_op, struct fscache_storage, op);
struct fscache_object *object = op->op.object;
- struct fscache_cookie *cookie = object->cookie;
+ struct fscache_cookie *cookie;
struct page *page;
unsigned n;
void *results[1];
@@ -601,16 +611,17 @@ static void fscache_write_op(struct fscache_operation *_op)

fscache_set_op_state(&op->op, "GetPage");

- spin_lock(&cookie->lock);
spin_lock(&object->lock);
+ cookie = object->cookie;

- if (!fscache_object_is_active(object)) {
+ if (!fscache_object_is_active(object) || !cookie) {
spin_unlock(&object->lock);
- spin_unlock(&cookie->lock);
_leave("");
return;
}

+ spin_lock(&cookie->stores_lock);
+
fscache_stat(&fscache_n_store_calls);

/* find a page to store */
@@ -621,23 +632,25 @@ static void fscache_write_op(struct fscache_operation *_op)
goto superseded;
page = results[0];
_debug("gang %d [%lx]", n, page->index);
- if (page->index > op->store_limit)
+ if (page->index > op->store_limit) {
+ fscache_stat(&fscache_n_store_pages_over_limit);
goto superseded;
+ }

radix_tree_tag_clear(&cookie->stores, page->index,
FSCACHE_COOKIE_PENDING_TAG);

+ spin_unlock(&cookie->stores_lock);
spin_unlock(&object->lock);
- spin_unlock(&cookie->lock);

if (page) {
fscache_set_op_state(&op->op, "Store");
+ fscache_stat(&fscache_n_store_pages);
fscache_stat(&fscache_n_cop_write_page);
ret = object->cache->ops->write_page(op, page);
fscache_stat_d(&fscache_n_cop_write_page);
fscache_set_op_state(&op->op, "EndWrite");
- fscache_end_page_write(cookie, page);
- page_cache_release(page);
+ fscache_end_page_write(object, page);
if (ret < 0) {
fscache_set_op_state(&op->op, "Abort");
fscache_abort_object(object);
@@ -653,9 +666,9 @@ superseded:
/* this writer is going away and there aren't any more things to
* write */
_debug("cease");
+ spin_unlock(&cookie->stores_lock);
clear_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags);
spin_unlock(&object->lock);
- spin_unlock(&cookie->lock);
_leave("");
}

@@ -731,6 +744,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
/* add the page to the pending-storage radix tree on the backing
* object */
spin_lock(&object->lock);
+ spin_lock(&cookie->stores_lock);

_debug("store limit %llx", (unsigned long long) object->store_limit);

@@ -751,6 +765,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
if (test_and_set_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags))
goto already_pending;

+ spin_unlock(&cookie->stores_lock);
spin_unlock(&object->lock);

op->op.debug_id = atomic_inc_return(&fscache_op_debug_id);
@@ -772,6 +787,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
already_queued:
fscache_stat(&fscache_n_stores_again);
already_pending:
+ spin_unlock(&cookie->stores_lock);
spin_unlock(&object->lock);
spin_unlock(&cookie->lock);
radix_tree_preload_end();
@@ -781,7 +797,9 @@ already_pending:
return 0;

submit_failed:
+ spin_lock(&cookie->stores_lock);
radix_tree_delete(&cookie->stores, page->index);
+ spin_unlock(&cookie->stores_lock);
page_cache_release(page);
ret = -ENOBUFS;
goto nobufs;
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 4c07439..1d53ea6 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -58,6 +58,9 @@ atomic_t fscache_n_stores_nobufs;
atomic_t fscache_n_stores_oom;
atomic_t fscache_n_store_ops;
atomic_t fscache_n_store_calls;
+atomic_t fscache_n_store_pages;
+atomic_t fscache_n_store_radix_deletes;
+atomic_t fscache_n_store_pages_over_limit;

atomic_t fscache_n_marks;
atomic_t fscache_n_uncaches;
@@ -200,9 +203,12 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_stores_again),
atomic_read(&fscache_n_stores_nobufs),
atomic_read(&fscache_n_stores_oom));
- seq_printf(m, "Stores : ops=%u run=%u\n",
+ seq_printf(m, "Stores : ops=%u run=%u pgs=%u rxd=%u olm=%u\n",
atomic_read(&fscache_n_store_ops),
- atomic_read(&fscache_n_store_calls));
+ atomic_read(&fscache_n_store_calls),
+ atomic_read(&fscache_n_store_pages),
+ atomic_read(&fscache_n_store_radix_deletes),
+ atomic_read(&fscache_n_store_pages_over_limit));

seq_printf(m, "Ops : pend=%u run=%u enq=%u can=%u\n",
atomic_read(&fscache_n_op_pend),
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 184cbdf..f3aa4bd 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -310,6 +310,7 @@ struct fscache_cookie {
atomic_t usage; /* number of users of this cookie */
atomic_t n_children; /* number of children of this cookie */
spinlock_t lock;
+ spinlock_t stores_lock; /* lock on page store tree */
struct hlist_head backing_objects; /* object(s) backing this file/index */
const struct fscache_cookie_def *def; /* definition */
struct fscache_cookie *parent; /* parent of this entry */

2009-11-19 17:22:11

by David Howells

[permalink] [raw]
Subject: [PATCH 16/28] FS-Cache: Don't delete pending pages from the page-store tracking tree

Don't delete pending pages from the page-store tracking tree, but rather send
them for another write as they've presumably been updated.

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/page.c | 7 +++++--
lib/radix-tree.c | 2 --
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 3ea8897..022a5da 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -57,8 +57,11 @@ static void fscache_end_page_write(struct fscache_object *object,
/* delete the page from the tree if it is now no longer
* pending */
spin_lock(&cookie->stores_lock);
- fscache_stat(&fscache_n_store_radix_deletes);
- xpage = radix_tree_delete(&cookie->stores, page->index);
+ if (!radix_tree_tag_get(&cookie->stores, page->index,
+ FSCACHE_COOKIE_PENDING_TAG)) {
+ fscache_stat(&fscache_n_store_radix_deletes);
+ xpage = radix_tree_delete(&cookie->stores, page->index);
+ }
spin_unlock(&cookie->stores_lock);
wake_up_bit(&cookie->flags, 0);
}
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index ae61068..92cdd99 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -546,7 +546,6 @@ out:
}
EXPORT_SYMBOL(radix_tree_tag_clear);

-#ifndef __KERNEL__ /* Only the test harness uses this at present */
/**
* radix_tree_tag_get - get a tag on a radix tree node
* @root: radix tree root
@@ -609,7 +608,6 @@ int radix_tree_tag_get(struct radix_tree_root *root,
}
}
EXPORT_SYMBOL(radix_tree_tag_get);
-#endif

/**
* radix_tree_next_hole - find the next hole (not-present entry)

2009-11-19 17:22:10

by David Howells

[permalink] [raw]
Subject: [PATCH 17/28] FS-Cache: Handle read request vs lookup, creation or other cache failure

FS-Cache doesn't correctly handle the netfs requesting a read from the cache
on an object that failed or was withdrawn by the cache. A trace similar to
the following might be seen:

CacheFiles: Lookup failed error -105
[exe ] unexpected submission OP165afe [OBJ6cac OBJECT_LC_DYING]
[exe ] objstate=OBJECT_LC_DYING [OBJECT_LC_DYING]
[exe ] objflags=0
[exe ] objevent=9 [fffffffffffffffb]
[exe ] ops=0 inp=0 exc=0
Pid: 6970, comm: exe Not tainted 2.6.32-rc6-cachefs #50
Call Trace:
[<ffffffffa0076477>] fscache_submit_op+0x3ff/0x45a [fscache]
[<ffffffffa0077997>] __fscache_read_or_alloc_pages+0x187/0x3c4 [fscache]
[<ffffffffa00b6480>] ? nfs_readpage_from_fscache_complete+0x0/0x66 [nfs]
[<ffffffffa00b6388>] __nfs_readpages_from_fscache+0x7e/0x176 [nfs]
[<ffffffff8108e483>] ? __alloc_pages_nodemask+0x11c/0x5cf
[<ffffffffa009d796>] nfs_readpages+0x114/0x1d7 [nfs]
[<ffffffff81090314>] __do_page_cache_readahead+0x15f/0x1ec
[<ffffffff81090228>] ? __do_page_cache_readahead+0x73/0x1ec
[<ffffffff810903bd>] ra_submit+0x1c/0x20
[<ffffffff810906bb>] ondemand_readahead+0x227/0x23a
[<ffffffff81090762>] page_cache_sync_readahead+0x17/0x19
[<ffffffff8108a99e>] generic_file_aio_read+0x236/0x5a0
[<ffffffffa00937bd>] nfs_file_read+0xe4/0xf3 [nfs]
[<ffffffff810b2fa2>] do_sync_read+0xe3/0x120
[<ffffffff81354cc3>] ? _spin_unlock_irq+0x2b/0x31
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff811848e5>] ? selinux_file_permission+0x5d/0x10f
[<ffffffff81352bdb>] ? thread_return+0x3e/0x101
[<ffffffff8117d7b0>] ? security_file_permission+0x11/0x13
[<ffffffff810b3b06>] vfs_read+0xaa/0x16f
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff810b3c84>] sys_read+0x45/0x6c
[<ffffffff8100ae2b>] system_call_fastpath+0x16/0x1b

The object state might also be OBJECT_DYING or OBJECT_WITHDRAWING.

This should be handled by simply rejecting the new operation with ENOBUFS.
There's no need to log an error for it. Events of this type now appear in the
stats file under Ops:rej.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 1 +
fs/fscache/internal.h | 1 +
fs/fscache/operation.c | 5 +++++
fs/fscache/stats.c | 6 ++++--
4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index cd2b4c9..8889d6a 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -276,6 +276,7 @@ proc files.
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
can=N Number of async ops cancelled
+ rej=N Number of async ops rejected due to object lookup/create failure
dfr=N Number of async ops queued for deferred release
rel=N Number of async ops released
gc=N Number of deferred-release async ops garbage collected
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index ba1853f..a076987 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -143,6 +143,7 @@ extern atomic_t fscache_n_op_deferred_release;
extern atomic_t fscache_n_op_release;
extern atomic_t fscache_n_op_gc;
extern atomic_t fscache_n_op_cancelled;
+extern atomic_t fscache_n_op_rejected;

extern atomic_t fscache_n_attr_changed;
extern atomic_t fscache_n_attr_changed_ok;
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index 296492e..313e79a 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -232,6 +232,11 @@ int fscache_submit_op(struct fscache_object *object,
list_add_tail(&op->pend_link, &object->pending_ops);
fscache_stat(&fscache_n_op_pend);
ret = 0;
+ } else if (object->state == FSCACHE_OBJECT_DYING ||
+ object->state == FSCACHE_OBJECT_LC_DYING ||
+ object->state == FSCACHE_OBJECT_WITHDRAWING) {
+ fscache_stat(&fscache_n_op_rejected);
+ ret = -ENOBUFS;
} else if (!test_bit(FSCACHE_IOERROR, &object->cache->flags)) {
fscache_report_unexpected_submission(object, op, ostate);
ASSERT(!fscache_object_is_active(object));
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 1d53ea6..045ba39 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -26,6 +26,7 @@ atomic_t fscache_n_op_deferred_release;
atomic_t fscache_n_op_release;
atomic_t fscache_n_op_gc;
atomic_t fscache_n_op_cancelled;
+atomic_t fscache_n_op_rejected;

atomic_t fscache_n_attr_changed;
atomic_t fscache_n_attr_changed_ok;
@@ -210,11 +211,12 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_store_radix_deletes),
atomic_read(&fscache_n_store_pages_over_limit));

- seq_printf(m, "Ops : pend=%u run=%u enq=%u can=%u\n",
+ seq_printf(m, "Ops : pend=%u run=%u enq=%u can=%u rej=%u\n",
atomic_read(&fscache_n_op_pend),
atomic_read(&fscache_n_op_run),
atomic_read(&fscache_n_op_enqueue),
- atomic_read(&fscache_n_op_cancelled));
+ atomic_read(&fscache_n_op_cancelled),
+ atomic_read(&fscache_n_op_rejected));
seq_printf(m, "Ops : dfr=%u rel=%u gc=%u\n",
atomic_read(&fscache_n_op_deferred_release),
atomic_read(&fscache_n_op_release),

2009-11-19 17:22:19

by David Howells

[permalink] [raw]
Subject: [PATCH 18/28] FS-Cache: Handle pages pending storage that get evicted under OOM conditions

Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache. Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.

The problem is typified by the following trace of a stuck process:

kslowd005 D 0000000000000000 0 4253 2 0x00000080
ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
Call Trace:
[<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
[<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
[<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff810885d3>] try_to_release_page+0x32/0x3b
[<ffffffff81093203>] shrink_page_list+0x316/0x4ac
[<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
[<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
[<ffffffff81093aa2>] shrink_list+0x8d/0x8f
[<ffffffff81093d1c>] shrink_zone+0x278/0x33c
[<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
[<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
[<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
[<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
[<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
[<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
[<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
[<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
[<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
[<ffffffff810b2e82>] do_sync_write+0xe3/0x120
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
[<ffffffff810b1a76>] ? dentry_open+0x82/0x89
[<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
[<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
[<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20

In the above backtrace, the following is happening:

(1) A page storage operation is being executed by a slow-work thread
(fscache_write_op()).

(2) FS-Cache farms the operation out to the cache to perform
(cachefiles_write_page()).

(3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
standard write (do_sync_write()) under KERNEL_DS directly from the netfs
page.

(4) However, for Ext3 to perform the write, it must allocate some memory, in
particular, it must allocate at least one page cache page into which it
can copy the data from the netfs page.

(5) Under OOM conditions, the memory allocator can't immediately come up with
a page, so it uses vmscan to find something to discard
(try_to_free_pages()).

(6) vmscan finds a clean netfs page it might be able to discard (possibly the
one it's trying to write out).

(7) The netfs is called to throw the page away (nfs_release_page()) - but it's
called with __GFP_WAIT, so the netfs decides to wait for the store to
complete (__fscache_wait_on_page_write()).

(8) This blocks a slow-work processing thread - possibly against itself.

The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.

To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed. This means that some data won't make it into the
cache this time. To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.

The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.


What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages. If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 4 +
Documentation/filesystems/caching/netfs-api.txt | 21 ++++++
fs/9p/cache.c | 14 ----
fs/afs/file.c | 15 +---
fs/fscache/internal.h | 5 +
fs/fscache/page.c | 79 ++++++++++++++++++++++-
fs/fscache/stats.c | 11 +++
fs/nfs/fscache.c | 10 +--
include/linux/fscache-cache.h | 1
include/linux/fscache.h | 27 ++++++++
10 files changed, 152 insertions(+), 35 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 8889d6a..655569e 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -272,6 +272,10 @@ proc files.
pgs=N Number of pages given store req processing time
rxd=N Number of store reqs deleted from tracking tree
olm=N Number of store reqs over store limit
+ VmScan nos=N Number of release reqs against pages with no pending store
+ gon=N Number of release reqs against pages stored by time lock granted
+ bsy=N Number of release reqs ignored due to in-progress store
+ can=N Number of page stores cancelled due to release req
Ops pend=N Number of times async ops added to pending queues
run=N Number of times async ops given CPU time
enq=N Number of times async ops queued for processing
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index 2666b1e..1902c57 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -641,7 +641,7 @@ data file must be retired (see the relinquish cookie function below).

Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions, so the page
-invalidation and release functions must use:
+invalidation functions must use:

bool fscache_check_page_write(struct fscache_cookie *cookie,
struct page *page);
@@ -654,6 +654,25 @@ to see if a page is being written to the cache, and:
to wait for it to finish if it is.


+When releasepage() is being implemented, a special FS-Cache function exists to
+manage the heuristics of coping with vmscan trying to eject pages, which may
+conflict with the cache trying to write pages to the cache (which may itself
+need to allocate memory):
+
+ bool fscache_maybe_release_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp);
+
+This takes the netfs cookie, and the page and gfp arguments as supplied to
+releasepage(). It will return false if the page cannot be released yet for
+some reason and if it returns true, the page has been uncached and can now be
+released.
+
+To make a page available for release, this function may wait for an outstanding
+storage request to complete, or it may attempt to cancel the storage request -
+in which case the page will not be stored in the cache this time.
+
+
==========================
INDEX AND DATA FILE UPDATE
==========================
diff --git a/fs/9p/cache.c b/fs/9p/cache.c
index 51c94e2..bcc5357 100644
--- a/fs/9p/cache.c
+++ b/fs/9p/cache.c
@@ -343,18 +343,7 @@ int __v9fs_fscache_release_page(struct page *page, gfp_t gfp)

BUG_ON(!vcookie->fscache);

- if (PageFsCache(page)) {
- if (fscache_check_page_write(vcookie->fscache, page)) {
- if (!(gfp & __GFP_WAIT))
- return 0;
- fscache_wait_on_page_write(vcookie->fscache, page);
- }
-
- fscache_uncache_page(vcookie->fscache, page);
- ClearPageFsCache(page);
- }
-
- return 1;
+ return fscache_maybe_release_page(vnode->cache, page, gfp);
}

void __v9fs_fscache_invalidate_page(struct page *page)
@@ -368,7 +357,6 @@ void __v9fs_fscache_invalidate_page(struct page *page)
fscache_wait_on_page_write(vcookie->fscache, page);
BUG_ON(!PageLocked(page));
fscache_uncache_page(vcookie->fscache, page);
- ClearPageFsCache(page);
}
}

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 681c2a7..39b3016 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -315,7 +315,6 @@ static void afs_invalidatepage(struct page *page, unsigned long offset)
struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
fscache_wait_on_page_write(vnode->cache, page);
fscache_uncache_page(vnode->cache, page);
- ClearPageFsCache(page);
}
#endif

@@ -349,17 +348,9 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags)
/* deny if page is being written to the cache and the caller hasn't
* elected to wait */
#ifdef CONFIG_AFS_FSCACHE
- if (PageFsCache(page)) {
- if (fscache_check_page_write(vnode->cache, page)) {
- if (!(gfp_flags & __GFP_WAIT)) {
- _leave(" = F [cache busy]");
- return 0;
- }
- fscache_wait_on_page_write(vnode->cache, page);
- }
-
- fscache_uncache_page(vnode->cache, page);
- ClearPageFsCache(page);
+ if (!fscache_maybe_release_page(vnode->cache, page, gfp_flags)) {
+ _leave(" = F [cache busy]");
+ return 0;
}
#endif

diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index a076987..e504651 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -180,6 +180,11 @@ extern atomic_t fscache_n_store_pages;
extern atomic_t fscache_n_store_radix_deletes;
extern atomic_t fscache_n_store_pages_over_limit;

+extern atomic_t fscache_n_store_vmscan_not_storing;
+extern atomic_t fscache_n_store_vmscan_gone;
+extern atomic_t fscache_n_store_vmscan_busy;
+extern atomic_t fscache_n_store_vmscan_cancelled;
+
extern atomic_t fscache_n_marks;
extern atomic_t fscache_n_uncaches;

diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 022a5da..fc76798 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -43,6 +43,75 @@ void __fscache_wait_on_page_write(struct fscache_cookie *cookie, struct page *pa
EXPORT_SYMBOL(__fscache_wait_on_page_write);

/*
+ * decide whether a page can be released, possibly by cancelling a store to it
+ * - we're allowed to sleep if __GFP_WAIT is flagged
+ */
+bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp)
+{
+ struct page *xpage;
+ void *val;
+
+ _enter("%p,%p,%x", cookie, page, gfp);
+
+ rcu_read_lock();
+ val = radix_tree_lookup(&cookie->stores, page->index);
+ if (!val) {
+ rcu_read_unlock();
+ fscache_stat(&fscache_n_store_vmscan_not_storing);
+ __fscache_uncache_page(cookie, page);
+ return true;
+ }
+
+ /* see if the page is actually undergoing storage - if so we can't get
+ * rid of it till the cache has finished with it */
+ if (radix_tree_tag_get(&cookie->stores, page->index,
+ FSCACHE_COOKIE_STORING_TAG)) {
+ rcu_read_unlock();
+ goto page_busy;
+ }
+
+ /* the page is pending storage, so we attempt to cancel the store and
+ * discard the store request so that the page can be reclaimed */
+ spin_lock(&cookie->stores_lock);
+ rcu_read_unlock();
+
+ if (radix_tree_tag_get(&cookie->stores, page->index,
+ FSCACHE_COOKIE_STORING_TAG)) {
+ /* the page started to undergo storage whilst we were looking,
+ * so now we can only wait or return */
+ spin_unlock(&cookie->stores_lock);
+ goto page_busy;
+ }
+
+ xpage = radix_tree_delete(&cookie->stores, page->index);
+ spin_unlock(&cookie->stores_lock);
+
+ if (xpage) {
+ fscache_stat(&fscache_n_store_vmscan_cancelled);
+ fscache_stat(&fscache_n_store_radix_deletes);
+ ASSERTCMP(xpage, ==, page);
+ } else {
+ fscache_stat(&fscache_n_store_vmscan_gone);
+ }
+
+ wake_up_bit(&cookie->flags, 0);
+ if (xpage)
+ page_cache_release(xpage);
+ __fscache_uncache_page(cookie, page);
+ return true;
+
+page_busy:
+ /* we might want to wait here, but that could deadlock the allocator as
+ * the slow-work threads writing to the cache may all end up sleeping
+ * on memory allocation */
+ fscache_stat(&fscache_n_store_vmscan_busy);
+ return false;
+}
+EXPORT_SYMBOL(__fscache_maybe_release_page);
+
+/*
* note that a page has finished being written to the cache
*/
static void fscache_end_page_write(struct fscache_object *object,
@@ -57,6 +126,8 @@ static void fscache_end_page_write(struct fscache_object *object,
/* delete the page from the tree if it is now no longer
* pending */
spin_lock(&cookie->stores_lock);
+ radix_tree_tag_clear(&cookie->stores, page->index,
+ FSCACHE_COOKIE_STORING_TAG);
if (!radix_tree_tag_get(&cookie->stores, page->index,
FSCACHE_COOKIE_PENDING_TAG)) {
fscache_stat(&fscache_n_store_radix_deletes);
@@ -640,8 +711,12 @@ static void fscache_write_op(struct fscache_operation *_op)
goto superseded;
}

- radix_tree_tag_clear(&cookie->stores, page->index,
- FSCACHE_COOKIE_PENDING_TAG);
+ if (page) {
+ radix_tree_tag_set(&cookie->stores, page->index,
+ FSCACHE_COOKIE_STORING_TAG);
+ radix_tree_tag_clear(&cookie->stores, page->index,
+ FSCACHE_COOKIE_PENDING_TAG);
+ }

spin_unlock(&cookie->stores_lock);
spin_unlock(&object->lock);
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 045ba39..cda6999 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -63,6 +63,11 @@ atomic_t fscache_n_store_pages;
atomic_t fscache_n_store_radix_deletes;
atomic_t fscache_n_store_pages_over_limit;

+atomic_t fscache_n_store_vmscan_not_storing;
+atomic_t fscache_n_store_vmscan_gone;
+atomic_t fscache_n_store_vmscan_busy;
+atomic_t fscache_n_store_vmscan_cancelled;
+
atomic_t fscache_n_marks;
atomic_t fscache_n_uncaches;

@@ -211,6 +216,12 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_store_radix_deletes),
atomic_read(&fscache_n_store_pages_over_limit));

+ seq_printf(m, "VmScan : nos=%u gon=%u bsy=%u can=%u\n",
+ atomic_read(&fscache_n_store_vmscan_not_storing),
+ atomic_read(&fscache_n_store_vmscan_gone),
+ atomic_read(&fscache_n_store_vmscan_busy),
+ atomic_read(&fscache_n_store_vmscan_cancelled));
+
seq_printf(m, "Ops : pend=%u run=%u enq=%u can=%u rej=%u\n",
atomic_read(&fscache_n_op_pend),
atomic_read(&fscache_n_op_run),
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index 70fad69..fa58800 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -359,17 +359,13 @@ int nfs_fscache_release_page(struct page *page, gfp_t gfp)

BUG_ON(!cookie);

- if (fscache_check_page_write(cookie, page)) {
- if (!(gfp & __GFP_WAIT))
- return 0;
- fscache_wait_on_page_write(cookie, page);
- }
-
if (PageFsCache(page)) {
dfprintk(FSCACHE, "NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
cookie, page, nfsi);

- fscache_uncache_page(cookie, page);
+ if (!fscache_maybe_release_page(cookie, page, gfp))
+ return 0;
+
nfs_add_fscache_stats(page->mapping->host,
NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
}
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index f3aa4bd..4750d5f 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -317,6 +317,7 @@ struct fscache_cookie {
void *netfs_data; /* back pointer to netfs */
struct radix_tree_root stores; /* pages to be stored on this cookie */
#define FSCACHE_COOKIE_PENDING_TAG 0 /* pages tag: pending write to cache */
+#define FSCACHE_COOKIE_STORING_TAG 1 /* pages tag: writing to cache */

unsigned long flags;
#define FSCACHE_COOKIE_LOOKING_UP 0 /* T if non-index cookie being looked up still */
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 6d8ee46..595ce49 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -202,6 +202,8 @@ extern int __fscache_write_page(struct fscache_cookie *, struct page *, gfp_t);
extern void __fscache_uncache_page(struct fscache_cookie *, struct page *);
extern bool __fscache_check_page_write(struct fscache_cookie *, struct page *);
extern void __fscache_wait_on_page_write(struct fscache_cookie *, struct page *);
+extern bool __fscache_maybe_release_page(struct fscache_cookie *, struct page *,
+ gfp_t);

/**
* fscache_register_netfs - Register a filesystem as desiring caching services
@@ -615,4 +617,29 @@ void fscache_wait_on_page_write(struct fscache_cookie *cookie,
__fscache_wait_on_page_write(cookie, page);
}

+/**
+ * fscache_maybe_release_page - Consider releasing a page, cancelling a store
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page that is being cached.
+ * @gfp: The gfp flags passed to releasepage()
+ *
+ * Consider releasing a page for the vmscan algorithm, on behalf of the netfs's
+ * releasepage() call. A storage request on the page may cancelled if it is
+ * not currently being processed.
+ *
+ * The function returns true if the page no longer has a storage request on it,
+ * and false if a storage request is left in place. If true is returned, the
+ * page will have been passed to fscache_uncache_page(). If false is returned
+ * the page cannot be freed yet.
+ */
+static inline
+bool fscache_maybe_release_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp)
+{
+ if (fscache_cookie_valid(cookie) && PageFsCache(page))
+ return __fscache_maybe_release_page(cookie, page, gfp);
+ return false;
+}
+
#endif /* _LINUX_FSCACHE_H */

2009-11-19 17:22:26

by David Howells

[permalink] [raw]
Subject: [PATCH 19/28] FS-Cache: Add a retirement stat counter

Add a stat counter to count retirement events rather than ordinary release
events (the retire argument to fscache_relinquish_cookie()).

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/cookie.c | 2 ++
fs/fscache/internal.h | 1 +
fs/fscache/stats.c | 6 ++++--
3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index f979659..9905350 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -415,6 +415,8 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
unsigned long event;

fscache_stat(&fscache_n_relinquishes);
+ if (retire)
+ fscache_stat(&fscache_n_relinquishes_retire);

if (!cookie) {
fscache_stat(&fscache_n_relinquishes_null);
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index e504651..2bf463d 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -202,6 +202,7 @@ extern atomic_t fscache_n_updates_run;
extern atomic_t fscache_n_relinquishes;
extern atomic_t fscache_n_relinquishes_null;
extern atomic_t fscache_n_relinquishes_waitcrt;
+extern atomic_t fscache_n_relinquishes_retire;

extern atomic_t fscache_n_cookie_index;
extern atomic_t fscache_n_cookie_data;
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index cda6999..9e15289 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -85,6 +85,7 @@ atomic_t fscache_n_updates_run;
atomic_t fscache_n_relinquishes;
atomic_t fscache_n_relinquishes_null;
atomic_t fscache_n_relinquishes_waitcrt;
+atomic_t fscache_n_relinquishes_retire;

atomic_t fscache_n_cookie_index;
atomic_t fscache_n_cookie_data;
@@ -168,10 +169,11 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_updates_null),
atomic_read(&fscache_n_updates_run));

- seq_printf(m, "Relinqs: n=%u nul=%u wcr=%u\n",
+ seq_printf(m, "Relinqs: n=%u nul=%u wcr=%u rtr=%u\n",
atomic_read(&fscache_n_relinquishes),
atomic_read(&fscache_n_relinquishes_null),
- atomic_read(&fscache_n_relinquishes_waitcrt));
+ atomic_read(&fscache_n_relinquishes_waitcrt),
+ atomic_read(&fscache_n_relinquishes_retire));

seq_printf(m, "AttrChg: n=%u ok=%u nbf=%u oom=%u run=%u\n",
atomic_read(&fscache_n_attr_changed),

2009-11-19 17:22:34

by David Howells

[permalink] [raw]
Subject: [PATCH 20/28] FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure

We must make sure that FSCACHE_COOKIE_LOOKING_UP is cleared on lookup failure
(if an object reaches the LC_DYING state), and we should clear it before
clearing FSCACHE_COOKIE_CREATING.

If this doesn't happen then fscache_wait_for_deferred_lookup() may hold
allocation and retrieval operations indefinitely until they're interrupted by
signals - which in turn pins the dying object until they go away.

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/object.c | 17 ++++++++++++-----
1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 1a1afa8..74bc562 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -105,6 +105,7 @@ static inline void fscache_done_parent_op(struct fscache_object *object)
static void fscache_object_state_machine(struct fscache_object *object)
{
enum fscache_object_state new_state;
+ struct fscache_cookie *cookie;

ASSERT(object != NULL);

@@ -158,11 +159,17 @@ static void fscache_object_state_machine(struct fscache_object *object)

spin_lock(&object->lock);
object->state = FSCACHE_OBJECT_DYING;
- if (object->cookie &&
- test_and_clear_bit(FSCACHE_COOKIE_CREATING,
- &object->cookie->flags))
- wake_up_bit(&object->cookie->flags,
- FSCACHE_COOKIE_CREATING);
+ cookie = object->cookie;
+ if (cookie) {
+ if (test_and_clear_bit(FSCACHE_COOKIE_LOOKING_UP,
+ &cookie->flags))
+ wake_up_bit(&cookie->flags,
+ FSCACHE_COOKIE_LOOKING_UP);
+ if (test_and_clear_bit(FSCACHE_COOKIE_CREATING,
+ &cookie->flags))
+ wake_up_bit(&cookie->flags,
+ FSCACHE_COOKIE_CREATING);
+ }
spin_unlock(&object->lock);

fscache_done_parent_op(object);

2009-11-19 17:22:33

by David Howells

[permalink] [raw]
Subject: [PATCH 21/28] FS-Cache: Start processing an object's operations on that object's death

Start processing an object's operations when that object moves into the DYING
state as the object cannot be destroyed until all its outstanding operations
have completed.

Furthermore, make sure that read and allocation operations handle being woken
up on a dead object. Such events are recorded in the Allocs.abt and
Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.

The code for waiting for object activation for the read and allocation
operations is also extracted into its own function as it is much the same in
all cases, differing only in the stats incremented.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 2
fs/fscache/internal.h | 5 +
fs/fscache/object.c | 1
fs/fscache/page.c | 112 ++++++++++++-------------
fs/fscache/stats.c | 12 ++-
include/linux/fscache-cache.h | 4 +
6 files changed, 75 insertions(+), 61 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 655569e..692ffef 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -253,6 +253,7 @@ proc files.
int=N Number of alloc reqs aborted -ERESTARTSYS
ops=N Number of alloc reqs submitted
owt=N Number of alloc reqs waited for CPU time
+ abt=N Number of alloc reqs aborted due to object death
Retrvls n=N Number of retrieval (read) requests seen
ok=N Number of successful retr reqs
wt=N Number of retr reqs that waited on lookup completion
@@ -262,6 +263,7 @@ proc files.
oom=N Number of retr reqs failed -ENOMEM
ops=N Number of retr reqs submitted
owt=N Number of retr reqs waited for CPU time
+ abt=N Number of retr reqs aborted due to object death
Stores n=N Number of storage (write) requests seen
ok=N Number of successful store reqs
agn=N Number of store reqs on a page already pending storage
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 2bf463d..5b49a37 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -156,6 +156,7 @@ extern atomic_t fscache_n_allocs_ok;
extern atomic_t fscache_n_allocs_wait;
extern atomic_t fscache_n_allocs_nobufs;
extern atomic_t fscache_n_allocs_intr;
+extern atomic_t fscache_n_allocs_object_dead;
extern atomic_t fscache_n_alloc_ops;
extern atomic_t fscache_n_alloc_op_waits;

@@ -166,6 +167,7 @@ extern atomic_t fscache_n_retrievals_nodata;
extern atomic_t fscache_n_retrievals_nobufs;
extern atomic_t fscache_n_retrievals_intr;
extern atomic_t fscache_n_retrievals_nomem;
+extern atomic_t fscache_n_retrievals_object_dead;
extern atomic_t fscache_n_retrieval_ops;
extern atomic_t fscache_n_retrieval_op_waits;

@@ -249,9 +251,12 @@ static inline void fscache_stat_d(atomic_t *stat)
atomic_dec(stat);
}

+#define __fscache_stat(stat) (stat)
+
extern const struct file_operations fscache_stats_fops;
#else

+#define __fscache_stat(stat) (NULL)
#define fscache_stat(stat) do {} while (0)
#endif

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 74bc562..c85c9f5 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -201,6 +201,7 @@ static void fscache_object_state_machine(struct fscache_object *object)
}
spin_unlock(&object->lock);
fscache_enqueue_dependents(object);
+ fscache_start_operations(object);
goto terminal_transit;

/* handle an abort during initialisation */
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index fc76798..c598ea4 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -314,6 +314,43 @@ static int fscache_wait_for_deferred_lookup(struct fscache_cookie *cookie)
}

/*
+ * wait for an object to become active (or dead)
+ */
+static int fscache_wait_for_retrieval_activation(struct fscache_object *object,
+ struct fscache_retrieval *op,
+ atomic_t *stat_op_waits,
+ atomic_t *stat_object_dead)
+{
+ int ret;
+
+ if (!test_bit(FSCACHE_OP_WAITING, &op->op.flags))
+ goto check_if_dead;
+
+ _debug(">>> WT");
+ fscache_stat(stat_op_waits);
+ if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit_interruptible,
+ TASK_INTERRUPTIBLE) < 0) {
+ ret = fscache_cancel_op(&op->op);
+ if (ret == 0)
+ return -ERESTARTSYS;
+
+ /* it's been removed from the pending queue by another party,
+ * so we should get to run shortly */
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ }
+ _debug("<<< GO");
+
+check_if_dead:
+ if (unlikely(fscache_object_is_dead(object))) {
+ fscache_stat(stat_object_dead);
+ return -ENOBUFS;
+ }
+ return 0;
+}
+
+/*
* read a page from the cache or allocate a block in which to store it
* - we return:
* -ENOMEM - out of memory, nothing done
@@ -376,25 +413,12 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,

/* we wait for the operation to become active, and then process it
* *here*, in this thread, and not in the thread pool */
- if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
- _debug(">>> WT");
- fscache_stat(&fscache_n_retrieval_op_waits);
- if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit_interruptible,
- TASK_INTERRUPTIBLE) < 0) {
- ret = fscache_cancel_op(&op->op);
- if (ret == 0) {
- ret = -ERESTARTSYS;
- goto error;
- }
-
- /* it's been removed from the pending queue by another
- * party, so we should get to run shortly */
- wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit, TASK_UNINTERRUPTIBLE);
- }
- _debug("<<< GO");
- }
+ ret = fscache_wait_for_retrieval_activation(
+ object, op,
+ __fscache_stat(&fscache_n_retrieval_op_waits),
+ __fscache_stat(&fscache_n_retrievals_object_dead));
+ if (ret < 0)
+ goto error;

/* ask the cache to honour the operation */
if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
@@ -506,25 +530,12 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,

/* we wait for the operation to become active, and then process it
* *here*, in this thread, and not in the thread pool */
- if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
- _debug(">>> WT");
- fscache_stat(&fscache_n_retrieval_op_waits);
- if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit_interruptible,
- TASK_INTERRUPTIBLE) < 0) {
- ret = fscache_cancel_op(&op->op);
- if (ret == 0) {
- ret = -ERESTARTSYS;
- goto error;
- }
-
- /* it's been removed from the pending queue by another
- * party, so we should get to run shortly */
- wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit, TASK_UNINTERRUPTIBLE);
- }
- _debug("<<< GO");
- }
+ ret = fscache_wait_for_retrieval_activation(
+ object, op,
+ __fscache_stat(&fscache_n_retrieval_op_waits),
+ __fscache_stat(&fscache_n_retrievals_object_dead));
+ if (ret < 0)
+ goto error;

/* ask the cache to honour the operation */
if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
@@ -612,25 +623,12 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,

fscache_stat(&fscache_n_alloc_ops);

- if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
- _debug(">>> WT");
- fscache_stat(&fscache_n_alloc_op_waits);
- if (wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit_interruptible,
- TASK_INTERRUPTIBLE) < 0) {
- ret = fscache_cancel_op(&op->op);
- if (ret == 0) {
- ret = -ERESTARTSYS;
- goto error;
- }
-
- /* it's been removed from the pending queue by another
- * party, so we should get to run shortly */
- wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
- fscache_wait_bit, TASK_UNINTERRUPTIBLE);
- }
- _debug("<<< GO");
- }
+ ret = fscache_wait_for_retrieval_activation(
+ object, op,
+ __fscache_stat(&fscache_n_alloc_op_waits),
+ __fscache_stat(&fscache_n_allocs_object_dead));
+ if (ret < 0)
+ goto error;

/* ask the cache to honour the operation */
fscache_stat(&fscache_n_cop_allocate_page);
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 9e15289..05f77ca 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -39,6 +39,7 @@ atomic_t fscache_n_allocs_ok;
atomic_t fscache_n_allocs_wait;
atomic_t fscache_n_allocs_nobufs;
atomic_t fscache_n_allocs_intr;
+atomic_t fscache_n_allocs_object_dead;
atomic_t fscache_n_alloc_ops;
atomic_t fscache_n_alloc_op_waits;

@@ -49,6 +50,7 @@ atomic_t fscache_n_retrievals_nodata;
atomic_t fscache_n_retrievals_nobufs;
atomic_t fscache_n_retrievals_intr;
atomic_t fscache_n_retrievals_nomem;
+atomic_t fscache_n_retrievals_object_dead;
atomic_t fscache_n_retrieval_ops;
atomic_t fscache_n_retrieval_op_waits;

@@ -188,9 +190,10 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_allocs_wait),
atomic_read(&fscache_n_allocs_nobufs),
atomic_read(&fscache_n_allocs_intr));
- seq_printf(m, "Allocs : ops=%u owt=%u\n",
+ seq_printf(m, "Allocs : ops=%u owt=%u abt=%u\n",
atomic_read(&fscache_n_alloc_ops),
- atomic_read(&fscache_n_alloc_op_waits));
+ atomic_read(&fscache_n_alloc_op_waits),
+ atomic_read(&fscache_n_allocs_object_dead));

seq_printf(m, "Retrvls: n=%u ok=%u wt=%u nod=%u nbf=%u"
" int=%u oom=%u\n",
@@ -201,9 +204,10 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_retrievals_nobufs),
atomic_read(&fscache_n_retrievals_intr),
atomic_read(&fscache_n_retrievals_nomem));
- seq_printf(m, "Retrvls: ops=%u owt=%u\n",
+ seq_printf(m, "Retrvls: ops=%u owt=%u abt=%u\n",
atomic_read(&fscache_n_retrieval_ops),
- atomic_read(&fscache_n_retrieval_op_waits));
+ atomic_read(&fscache_n_retrieval_op_waits),
+ atomic_read(&fscache_n_retrievals_object_dead));

seq_printf(m, "Stores : n=%u ok=%u agn=%u nbf=%u oom=%u\n",
atomic_read(&fscache_n_stores),
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 4750d5f..907bb56 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -404,6 +404,10 @@ extern const char *fscache_object_states[];
(obj)->state >= FSCACHE_OBJECT_AVAILABLE && \
(obj)->state < FSCACHE_OBJECT_DYING)

+#define fscache_object_is_dead(obj) \
+ (test_bit(FSCACHE_IOERROR, &(obj)->cache->flags) && \
+ (obj)->state >= FSCACHE_OBJECT_DYING)
+
extern const struct slow_work_ops fscache_object_slow_work_ops;

/**

2009-11-19 17:24:12

by David Howells

[permalink] [raw]
Subject: [PATCH 22/28] FS-Cache: Actually requeue an object when requested

FS-Cache objects have an FSCACHE_OBJECT_EV_REQUEUE event that can theoretically
be raised to ask the state machine to requeue the object for further processing
before the work function returns to the slow-work facility.

However, fscache_object_work_execute() was clearing that bit before checking
the event mask to see whether the object has any pending events that require it
to be requeued immediately.

Instead, the bit should be cleared after the check and enqueue.

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/object.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index c85c9f5..f3f952c 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -353,13 +353,12 @@ static void fscache_object_slow_work_execute(struct slow_work *work)

_enter("{OBJ%x}", object->debug_id);

- clear_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
-
start = jiffies;
fscache_object_state_machine(object);
fscache_hist(fscache_objs_histogram, start);
if (object->events & object->event_mask)
fscache_enqueue_object(object);
+ clear_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
}

/*

2009-11-19 17:22:41

by David Howells

[permalink] [raw]
Subject: [PATCH 23/28] CacheFiles: Don't write a full page if there's only a partial page to cache

cachefiles_write_page() writes a full page to the backing file for the last
page of the netfs file, even if the netfs file's last page is only a partial
page.

This causes the EOF on the backing file to be extended beyond the EOF of the
netfs, and thus the backing file will be truncated by cachefiles_attr_changed()
called from cachefiles_lookup_object().

So we need to limit the write we make to the backing file on that last page
such that it doesn't push the EOF too far.


Also, if a backing file that has a partial page at the end is expanded, we
discard the partial page and refetch it on the basis that we then have a hole
in the file with invalid data, and should the power go out... A better way to
deal with this could be to record a note that the partial page contains invalid
data until the correct data is written into it.

This isn't a problem for netfs's that discard the whole backing file if the
file size changes (such as NFS).

Signed-off-by: David Howells <[email protected]>
---

fs/cachefiles/interface.c | 20 +++++++++++++++++---
fs/cachefiles/rdwr.c | 23 +++++++++++++++++++----
include/linux/fscache-cache.h | 3 +++
3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index dd7f852..8e67abf 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -404,12 +404,26 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
if (oi_size == ni_size)
return 0;

- newattrs.ia_size = ni_size;
- newattrs.ia_valid = ATTR_SIZE;
-
cachefiles_begin_secure(cache, &saved_cred);
mutex_lock(&object->backer->d_inode->i_mutex);
+
+ /* if there's an extension to a partial page at the end of the backing
+ * file, we need to discard the partial page so that we pick up new
+ * data after it */
+ if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
+ _debug("discard tail %llx", oi_size);
+ newattrs.ia_valid = ATTR_SIZE;
+ newattrs.ia_size = oi_size & PAGE_MASK;
+ ret = notify_change(object->backer, &newattrs);
+ if (ret < 0)
+ goto truncate_failed;
+ }
+
+ newattrs.ia_valid = ATTR_SIZE;
+ newattrs.ia_size = ni_size;
ret = notify_change(object->backer, &newattrs);
+
+truncate_failed:
mutex_unlock(&object->backer->d_inode->i_mutex);
cachefiles_end_secure(cache, saved_cred);

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 3304646..5a84fd7 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -803,7 +803,8 @@ int cachefiles_write_page(struct fscache_storage *op, struct page *page)
struct cachefiles_cache *cache;
mm_segment_t old_fs;
struct file *file;
- loff_t pos;
+ loff_t pos, eof;
+ size_t len;
void *data;
int ret;

@@ -837,15 +838,29 @@ int cachefiles_write_page(struct fscache_storage *op, struct page *page)
ret = -EIO;
if (file->f_op->write) {
pos = (loff_t) page->index << PAGE_SHIFT;
+
+ /* we mustn't write more data than we have, so we have
+ * to beware of a partial page at EOF */
+ eof = object->fscache.store_limit_l;
+ len = PAGE_SIZE;
+ if (eof & ~PAGE_MASK) {
+ ASSERTCMP(pos, <, eof);
+ if (eof - pos < PAGE_SIZE) {
+ _debug("cut short %llx to %llx",
+ pos, eof);
+ len = eof - pos;
+ ASSERTCMP(pos + len, ==, eof);
+ }
+ }
+
data = kmap(page);
old_fs = get_fs();
set_fs(KERNEL_DS);
ret = file->f_op->write(
- file, (const void __user *) data, PAGE_SIZE,
- &pos);
+ file, (const void __user *) data, len, &pos);
set_fs(old_fs);
kunmap(page);
- if (ret != PAGE_SIZE)
+ if (ret != len)
ret = -EIO;
}
fput(file);
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 907bb56..5db5000 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -395,6 +395,7 @@ struct fscache_object {
struct rb_node objlist_link; /* link in global object list */
#endif
pgoff_t store_limit; /* current storage limit */
+ loff_t store_limit_l; /* current storage limit */
};

extern const char *fscache_object_states[];
@@ -439,6 +440,7 @@ void fscache_object_init(struct fscache_object *object,
object->events = object->event_mask = 0;
object->flags = 0;
object->store_limit = 0;
+ object->store_limit_l = 0;
object->cache = cache;
object->cookie = cookie;
object->parent = NULL;
@@ -491,6 +493,7 @@ static inline void fscache_object_lookup_error(struct fscache_object *object)
static inline
void fscache_set_store_limit(struct fscache_object *object, loff_t i_size)
{
+ object->store_limit_l = i_size;
object->store_limit = i_size >> PAGE_SHIFT;
if (i_size & ~PAGE_MASK)
object->store_limit++;

2009-11-19 17:22:46

by David Howells

[permalink] [raw]
Subject: [PATCH 24/28] CacheFiles: Handle truncate unlocking the page we're reading

Handle truncate unlocking the page we're attempting to read from the backing
device before the read has completed.

This was causing reports like the following to occur:

Pid: 4765, comm: kslowd Not tainted 2.6.30.1 #1
Call Trace:
[<ffffffffa0331d7a>] ? cachefiles_read_waiter+0xd9/0x147 [cachefiles]
[<ffffffff804b74bd>] ? __wait_on_bit+0x60/0x6f
[<ffffffff8022bbbb>] ? __wake_up_common+0x3f/0x71
[<ffffffff8022cc32>] ? __wake_up+0x30/0x44
[<ffffffff8024a41f>] ? __wake_up_bit+0x28/0x2d
[<ffffffffa003a793>] ? ext3_truncate+0x4d7/0x8ed [ext3]
[<ffffffff80281f90>] ? pagevec_lookup+0x17/0x1f
[<ffffffff8028c2ff>] ? unmap_mapping_range+0x59/0x1ff
[<ffffffff8022cc32>] ? __wake_up+0x30/0x44
[<ffffffff8028e286>] ? vmtruncate+0xc2/0xe2
[<ffffffff802b82cf>] ? inode_setattr+0x22/0x10a
[<ffffffffa003baa5>] ? ext3_setattr+0x17b/0x1e6 [ext3]
[<ffffffff802b853d>] ? notify_change+0x186/0x2c9
[<ffffffffa032d9de>] ? cachefiles_attr_changed+0x133/0x1cd [cachefiles]
[<ffffffffa032df7f>] ? cachefiles_lookup_object+0xcf/0x12a [cachefiles]
[<ffffffffa0318165>] ? fscache_lookup_object+0x110/0x122 [fscache]
[<ffffffffa03188c3>] ? fscache_object_slow_work_execute+0x590/0x6bc
[fscache]
[<ffffffff80278f82>] ? slow_work_thread+0x285/0x43a
[<ffffffff8024a446>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff80278cfd>] ? slow_work_thread+0x0/0x43a
[<ffffffff8024a317>] ? kthread+0x54/0x81
[<ffffffff8020c93a>] ? child_rip+0xa/0x20
[<ffffffff8024a2c3>] ? kthread+0x0/0x81
[<ffffffff8020c930>] ? child_rip+0x0/0x20
CacheFiles: I/O Error: Readpage failed on backing file 200000000000810
FS-Cache: Cache cachefiles stopped due to I/O error

Reported-by: Christian Kujau <[email protected]>
Reported-by: Takashi Iwai <[email protected]>
Reported-by: Duc Le Minh <[email protected]>
Signed-off-by: David Howells <[email protected]>
---

fs/cachefiles/rdwr.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 93 insertions(+), 6 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 5a84fd7..1d83325 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -40,8 +40,10 @@ static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,

_debug("--- monitor %p %lx ---", page, page->flags);

- if (!PageUptodate(page) && !PageError(page))
- dump_stack();
+ if (!PageUptodate(page) && !PageError(page)) {
+ /* unlocked, not uptodate and not erronous? */
+ _debug("page probably truncated");
+ }

/* remove from the waitqueue */
list_del(&wait->task_list);
@@ -61,6 +63,84 @@ static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
}

/*
+ * handle a probably truncated page
+ * - check to see if the page is still relevant and reissue the read if
+ * possible
+ * - return -EIO on error, -ENODATA if the page is gone, -EINPROGRESS if we
+ * must wait again and 0 if successful
+ */
+static int cachefiles_read_reissue(struct cachefiles_object *object,
+ struct cachefiles_one_read *monitor)
+{
+ struct address_space *bmapping = object->backer->d_inode->i_mapping;
+ struct page *backpage = monitor->back_page, *backpage2;
+ int ret;
+
+ kenter("{ino=%lx},{%lx,%lx}",
+ object->backer->d_inode->i_ino,
+ backpage->index, backpage->flags);
+
+ /* skip if the page was truncated away completely */
+ if (backpage->mapping != bmapping) {
+ kleave(" = -ENODATA [mapping]");
+ return -ENODATA;
+ }
+
+ backpage2 = find_get_page(bmapping, backpage->index);
+ if (!backpage2) {
+ kleave(" = -ENODATA [gone]");
+ return -ENODATA;
+ }
+
+ if (backpage != backpage2) {
+ put_page(backpage2);
+ kleave(" = -ENODATA [different]");
+ return -ENODATA;
+ }
+
+ /* the page is still there and we already have a ref on it, so we don't
+ * need a second */
+ put_page(backpage2);
+
+ INIT_LIST_HEAD(&monitor->op_link);
+ add_page_wait_queue(backpage, &monitor->monitor);
+
+ if (trylock_page(backpage)) {
+ ret = -EIO;
+ if (PageError(backpage))
+ goto unlock_discard;
+ ret = 0;
+ if (PageUptodate(backpage))
+ goto unlock_discard;
+
+ kdebug("reissue read");
+ ret = bmapping->a_ops->readpage(NULL, backpage);
+ if (ret < 0)
+ goto unlock_discard;
+ }
+
+ /* but the page may have been read before the monitor was installed, so
+ * the monitor may miss the event - so we have to ensure that we do get
+ * one in such a case */
+ if (trylock_page(backpage)) {
+ _debug("jumpstart %p {%lx}", backpage, backpage->flags);
+ unlock_page(backpage);
+ }
+
+ /* it'll reappear on the todo list */
+ kleave(" = -EINPROGRESS");
+ return -EINPROGRESS;
+
+unlock_discard:
+ unlock_page(backpage);
+ spin_lock_irq(&object->work_lock);
+ list_del(&monitor->op_link);
+ spin_unlock_irq(&object->work_lock);
+ kleave(" = %d", ret);
+ return ret;
+}
+
+/*
* copy data from backing pages to netfs pages to complete a read operation
* - driven by FS-Cache's thread pool
*/
@@ -92,20 +172,26 @@ static void cachefiles_read_copier(struct fscache_operation *_op)

_debug("- copy {%lu}", monitor->back_page->index);

- error = -EIO;
+ recheck:
if (PageUptodate(monitor->back_page)) {
copy_highpage(monitor->netfs_page, monitor->back_page);

pagevec_add(&pagevec, monitor->netfs_page);
fscache_mark_pages_cached(monitor->op, &pagevec);
error = 0;
- }
-
- if (error)
+ } else if (!PageError(monitor->back_page)) {
+ /* the page has probably been truncated */
+ error = cachefiles_read_reissue(object, monitor);
+ if (error == -EINPROGRESS)
+ goto next;
+ goto recheck;
+ } else {
cachefiles_io_error_obj(
object,
"Readpage failed on backing file %lx",
(unsigned long) monitor->back_page->flags);
+ error = -EIO;
+ }

page_cache_release(monitor->back_page);

@@ -114,6 +200,7 @@ static void cachefiles_read_copier(struct fscache_operation *_op)
fscache_put_retrieval(op);
kfree(monitor);

+ next:
/* let the thread pool have some air occasionally */
max--;
if (max < 0 || need_resched()) {

2009-11-19 17:22:53

by David Howells

[permalink] [raw]
Subject: [PATCH 25/28] CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy

Mark parent directory locks as I_MUTEX_PARENT in the callers of
cachefiles_bury_object() so that lockdep doesn't complain when that invokes
vfs_unlink():

=============================================
[ INFO: possible recursive locking detected ]
2.6.32-rc6-cachefs #47
---------------------------------------------
kslowd002/3089 is trying to acquire lock:
(&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff810bbf72>] vfs_unlink+0x8b/0x128

but task is already holding lock:
(&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffffa00e4e61>] cachefiles_walk_to_object+0x1b0/0x831 [cachefiles]

other info that might help us debug this:
1 lock held by kslowd002/3089:
#0: (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffffa00e4e61>] cachefiles_walk_to_object+0x1b0/0x831 [cachefiles]

stack backtrace:
Pid: 3089, comm: kslowd002 Not tainted 2.6.32-rc6-cachefs #47
Call Trace:
[<ffffffff8105ad7b>] __lock_acquire+0x1649/0x16e3
[<ffffffff8118170e>] ? inode_has_perm+0x5f/0x61
[<ffffffff8105ae6c>] lock_acquire+0x57/0x6d
[<ffffffff810bbf72>] ? vfs_unlink+0x8b/0x128
[<ffffffff81353ac3>] mutex_lock_nested+0x54/0x292
[<ffffffff810bbf72>] ? vfs_unlink+0x8b/0x128
[<ffffffff8118179e>] ? selinux_inode_permission+0x8e/0x90
[<ffffffff8117e271>] ? security_inode_permission+0x1c/0x1e
[<ffffffff810bb4fb>] ? inode_permission+0x99/0xa5
[<ffffffff810bbf72>] vfs_unlink+0x8b/0x128
[<ffffffff810adb19>] ? kfree+0xed/0xf9
[<ffffffffa00e3f00>] cachefiles_bury_object+0xb6/0x420 [cachefiles]
[<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa00e7e24>] ? cachefiles_check_object_xattr+0x233/0x293 [cachefiles]
[<ffffffffa00e51b0>] cachefiles_walk_to_object+0x4ff/0x831 [cachefiles]
[<ffffffff81032238>] ? finish_task_switch+0x0/0xb2
[<ffffffffa00e3429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
[<ffffffffa00741e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
[<ffffffffa0074fc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
[<ffffffffa00754ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20

Signed-off-by: Daivd Howells <[email protected]>
---

fs/cachefiles/namei.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 4ce818a..3df8695 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -254,7 +254,7 @@ int cachefiles_delete_object(struct cachefiles_cache *cache,

dir = dget_parent(object->dentry);

- mutex_lock(&dir->d_inode->i_mutex);
+ mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT);
ret = cachefiles_bury_object(cache, dir, object->dentry);

dput(dir);
@@ -307,7 +307,7 @@ lookup_again:
/* search the current directory for the element name */
_debug("lookup '%s'", name);

- mutex_lock(&dir->d_inode->i_mutex);
+ mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT);

start = jiffies;
next = lookup_one_len(name, dir, nlen);

2009-11-19 17:22:57

by David Howells

[permalink] [raw]
Subject: [PATCH 26/28] CacheFiles: Better showing of debugging information in active object problems

Show more debugging information if cachefiles_mark_object_active() is asked to
activate an active object.

This may happen, for instance, if the netfs tries to register an object with
the same key multiple times.

The code is changed to (a) get the appropriate object lock to protect the
cookie pointer whilst we dereference it, and (b) get and display the cookie key
if available.

Signed-off-by: David Howells <[email protected]>
---

fs/cachefiles/namei.c | 102 ++++++++++++++++++++++++++++++++++++-------------
1 files changed, 75 insertions(+), 27 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 3df8695..00a0cda 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -27,6 +27,76 @@ static int cachefiles_wait_bit(void *flags)
return 0;
}

+#define CACHEFILES_KEYBUF_SIZE 512
+
+/*
+ * dump debugging info about an object
+ */
+static noinline
+void __cachefiles_printk_object(struct cachefiles_object *object,
+ const char *prefix,
+ u8 *keybuf)
+{
+ struct fscache_cookie *cookie;
+ unsigned keylen, loop;
+
+ printk(KERN_ERR "%sobject: OBJ%x\n",
+ prefix, object->fscache.debug_id);
+ printk(KERN_ERR "%sobjstate=%s fl=%lx swfl=%lx ev=%lx[%lx]\n",
+ prefix, fscache_object_states[object->fscache.state],
+ object->fscache.flags, object->fscache.work.flags,
+ object->fscache.events,
+ object->fscache.event_mask & FSCACHE_OBJECT_EVENTS_MASK);
+ printk(KERN_ERR "%sops=%u inp=%u exc=%u\n",
+ prefix, object->fscache.n_ops, object->fscache.n_in_progress,
+ object->fscache.n_exclusive);
+ printk(KERN_ERR "%sparent=%p\n",
+ prefix, object->fscache.parent);
+
+ spin_lock(&object->fscache.lock);
+ cookie = object->fscache.cookie;
+ if (cookie) {
+ printk(KERN_ERR "%scookie=%p [pr=%p nd=%p fl=%lx]\n",
+ prefix,
+ object->fscache.cookie,
+ object->fscache.cookie->parent,
+ object->fscache.cookie->netfs_data,
+ object->fscache.cookie->flags);
+ if (keybuf)
+ keylen = cookie->def->get_key(cookie->netfs_data, keybuf,
+ CACHEFILES_KEYBUF_SIZE);
+ else
+ keylen = 0;
+ } else {
+ printk(KERN_ERR "%scookie=NULL\n", prefix);
+ keylen = 0;
+ }
+ spin_unlock(&object->fscache.lock);
+
+ if (keylen) {
+ printk(KERN_ERR "%skey=[%u] '", prefix, keylen);
+ for (loop = 0; loop < keylen; loop++)
+ printk("%02x", keybuf[loop]);
+ printk("'\n");
+ }
+}
+
+/*
+ * dump debugging info about a pair of objects
+ */
+static noinline void cachefiles_printk_object(struct cachefiles_object *object,
+ struct cachefiles_object *xobject)
+{
+ u8 *keybuf;
+
+ keybuf = kmalloc(CACHEFILES_KEYBUF_SIZE, GFP_NOIO);
+ if (object)
+ __cachefiles_printk_object(object, "", keybuf);
+ if (xobject)
+ __cachefiles_printk_object(xobject, "x", keybuf);
+ kfree(keybuf);
+}
+
/*
* record the fact that an object is now active
*/
@@ -42,8 +112,11 @@ static void cachefiles_mark_object_active(struct cachefiles_cache *cache,
try_again:
write_lock(&cache->active_lock);

- if (test_and_set_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags))
+ if (test_and_set_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags)) {
+ printk(KERN_ERR "CacheFiles: Error: Object already active\n");
+ cachefiles_printk_object(object, NULL);
BUG();
+ }

dentry = object->dentry;
_p = &cache->active_nodes.rb_node;
@@ -76,32 +149,7 @@ wait_for_old_object:
printk(KERN_ERR "\n");
printk(KERN_ERR "CacheFiles: Error:"
" Unexpected object collision\n");
- printk(KERN_ERR "xobject: OBJ%x\n",
- xobject->fscache.debug_id);
- printk(KERN_ERR "xobjstate=%s\n",
- fscache_object_states[xobject->fscache.state]);
- printk(KERN_ERR "xobjflags=%lx\n", xobject->fscache.flags);
- printk(KERN_ERR "xobjevent=%lx [%lx]\n",
- xobject->fscache.events, xobject->fscache.event_mask);
- printk(KERN_ERR "xops=%u inp=%u exc=%u\n",
- xobject->fscache.n_ops, xobject->fscache.n_in_progress,
- xobject->fscache.n_exclusive);
- printk(KERN_ERR "xcookie=%p [pr=%p nd=%p fl=%lx]\n",
- xobject->fscache.cookie,
- xobject->fscache.cookie->parent,
- xobject->fscache.cookie->netfs_data,
- xobject->fscache.cookie->flags);
- printk(KERN_ERR "xparent=%p\n",
- xobject->fscache.parent);
- printk(KERN_ERR "object: OBJ%x\n",
- object->fscache.debug_id);
- printk(KERN_ERR "cookie=%p [pr=%p nd=%p fl=%lx]\n",
- object->fscache.cookie,
- object->fscache.cookie->parent,
- object->fscache.cookie->netfs_data,
- object->fscache.cookie->flags);
- printk(KERN_ERR "parent=%p\n",
- object->fscache.parent);
+ cachefiles_printk_object(object, xobject);
BUG();
}
atomic_inc(&xobject->usage);

2009-11-19 17:23:08

by David Howells

[permalink] [raw]
Subject: [PATCH 27/28] CacheFiles: Catch an overly long wait for an old active object

Catch an overly long wait for an old, dying active object when we want to
replace it with a new one. The probability is that all the slow-work threads
are hogged, and the delete can't get a look in.

What we do instead is:

(1) if there's nothing in the slow work queue, we sleep until either the dying
object has finished dying or there is something in the slow work queue
behind which we can queue our object.

(2) if there is something in the slow work queue, we return ETIMEDOUT to
fscache_lookup_object(), which then puts us back on the slow work queue,
presumably behind the deletion that we're blocked by. We are then
deferred for a while until we work our way back through the queue -
without blocking a slow-work thread unnecessarily.

A backtrace similar to the following may appear in the log without this patch:

INFO: task kslowd004:5711 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd004 D 0000000000000000 0 5711 2 0x00000080
ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
Call Trace:
[<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
[<ffffffff81353153>] __wait_on_bit+0x43/0x76
[<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
[<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
[<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
[<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
[<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
[<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
[<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
[<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
1 lock held by kslowd004/5711:
#0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]


Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/caching/fscache.txt | 1
fs/cachefiles/interface.c | 6 +-
fs/cachefiles/namei.c | 87 ++++++++++++++++++++-----
fs/fscache/internal.h | 1
fs/fscache/object.c | 10 +++
fs/fscache/stats.c | 4 +
include/linux/fscache-cache.h | 6 +-
7 files changed, 90 insertions(+), 25 deletions(-)

diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 692ffef..571e989 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -235,6 +235,7 @@ proc files.
neg=N Number of negative lookups made
pos=N Number of positive lookups made
crt=N Number of objects created by lookup
+ tmo=N Number of lookups timed out and requeued
Updates n=N Number of update cookie requests seen
nul=N Number of upd reqs given a NULL parent
run=N Number of upd reqs granted CPU time
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 8e67abf..9d3c426 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -114,8 +114,9 @@ nomem_lookup_data:

/*
* attempt to look up the nominated node in this cache
+ * - return -ETIMEDOUT to be scheduled again
*/
-static void cachefiles_lookup_object(struct fscache_object *_object)
+static int cachefiles_lookup_object(struct fscache_object *_object)
{
struct cachefiles_lookup_data *lookup_data;
struct cachefiles_object *parent, *object;
@@ -145,13 +146,14 @@ static void cachefiles_lookup_object(struct fscache_object *_object)
object->fscache.cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX)
cachefiles_attr_changed(&object->fscache);

- if (ret < 0) {
+ if (ret < 0 && ret != -ETIMEDOUT) {
printk(KERN_WARNING "CacheFiles: Lookup failed error %d\n",
ret);
fscache_object_lookup_error(&object->fscache);
}

_leave(" [%d]", ret);
+ return ret;
}

/*
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 00a0cda..14ac480 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -21,12 +21,6 @@
#include <linux/security.h>
#include "internal.h"

-static int cachefiles_wait_bit(void *flags)
-{
- schedule();
- return 0;
-}
-
#define CACHEFILES_KEYBUF_SIZE 512

/*
@@ -100,8 +94,8 @@ static noinline void cachefiles_printk_object(struct cachefiles_object *object,
/*
* record the fact that an object is now active
*/
-static void cachefiles_mark_object_active(struct cachefiles_cache *cache,
- struct cachefiles_object *object)
+static int cachefiles_mark_object_active(struct cachefiles_cache *cache,
+ struct cachefiles_object *object)
{
struct cachefiles_object *xobject;
struct rb_node **_p, *_parent = NULL;
@@ -139,8 +133,8 @@ try_again:
rb_insert_color(&object->active_node, &cache->active_nodes);

write_unlock(&cache->active_lock);
- _leave("");
- return;
+ _leave(" = 0");
+ return 0;

/* an old object from a previous incarnation is hogging the slot - we
* need to wait for it to be destroyed */
@@ -155,13 +149,64 @@ wait_for_old_object:
atomic_inc(&xobject->usage);
write_unlock(&cache->active_lock);

- _debug(">>> wait");
- wait_on_bit(&xobject->flags, CACHEFILES_OBJECT_ACTIVE,
- cachefiles_wait_bit, TASK_UNINTERRUPTIBLE);
- _debug("<<< waited");
+ if (test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags)) {
+ wait_queue_head_t *wq;
+
+ signed long timeout = 60 * HZ;
+ wait_queue_t wait;
+ bool requeue;
+
+ /* if the object we're waiting for is queued for processing,
+ * then just put ourselves on the queue behind it */
+ if (slow_work_is_queued(&xobject->fscache.work)) {
+ _debug("queue OBJ%x behind OBJ%x immediately",
+ object->fscache.debug_id,
+ xobject->fscache.debug_id);
+ goto requeue;
+ }
+
+ /* otherwise we sleep until either the object we're waiting for
+ * is done, or the slow-work facility wants the thread back to
+ * do other work */
+ wq = bit_waitqueue(&xobject->flags, CACHEFILES_OBJECT_ACTIVE);
+ init_wait(&wait);
+ requeue = false;
+ do {
+ prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+ if (!test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags))
+ break;
+ requeue = slow_work_sleep_till_thread_needed(
+ &object->fscache.work, &timeout);
+ } while (timeout > 0 && !requeue);
+ finish_wait(wq, &wait);
+
+ if (requeue &&
+ test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags)) {
+ _debug("queue OBJ%x behind OBJ%x after wait",
+ object->fscache.debug_id,
+ xobject->fscache.debug_id);
+ goto requeue;
+ }
+
+ if (timeout <= 0) {
+ printk(KERN_ERR "\n");
+ printk(KERN_ERR "CacheFiles: Error: Overlong"
+ " wait for old active object to go away\n");
+ cachefiles_printk_object(object, xobject);
+ goto requeue;
+ }
+ }
+
+ ASSERT(!test_bit(CACHEFILES_OBJECT_ACTIVE, &xobject->flags));

cache->cache.ops->put_object(&xobject->fscache);
goto try_again;
+
+requeue:
+ clear_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags);
+ cache->cache.ops->put_object(&xobject->fscache);
+ _leave(" = -ETIMEDOUT");
+ return -ETIMEDOUT;
}

/*
@@ -466,12 +511,15 @@ lookup_again:
}

/* note that we're now using this object */
- cachefiles_mark_object_active(cache, object);
+ ret = cachefiles_mark_object_active(cache, object);

mutex_unlock(&dir->d_inode->i_mutex);
dput(dir);
dir = NULL;

+ if (ret == -ETIMEDOUT)
+ goto mark_active_timed_out;
+
_debug("=== OBTAINED_OBJECT ===");

if (object->new) {
@@ -515,6 +563,10 @@ create_error:
cachefiles_io_error(cache, "Create/mkdir failed");
goto error;

+mark_active_timed_out:
+ _debug("mark active timed out");
+ goto release_dentry;
+
check_error:
_debug("check error %d", ret);
write_lock(&cache->active_lock);
@@ -522,7 +574,7 @@ check_error:
clear_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags);
wake_up_bit(&object->flags, CACHEFILES_OBJECT_ACTIVE);
write_unlock(&cache->active_lock);
-
+release_dentry:
dput(object->dentry);
object->dentry = NULL;
goto error_out;
@@ -543,9 +595,6 @@ error:
error_out2:
dput(dir);
error_out:
- if (ret == -ENOSPC)
- ret = -ENOBUFS;
-
_leave(" = error %d", -ret);
return ret;
}
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 5b49a37..0ca2566 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -215,6 +215,7 @@ extern atomic_t fscache_n_object_no_alloc;
extern atomic_t fscache_n_object_lookups;
extern atomic_t fscache_n_object_lookups_negative;
extern atomic_t fscache_n_object_lookups_positive;
+extern atomic_t fscache_n_object_lookups_timed_out;
extern atomic_t fscache_n_object_created;
extern atomic_t fscache_n_object_avail;
extern atomic_t fscache_n_object_dead;
diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index f3f952c..e513ac5 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -468,6 +468,7 @@ static void fscache_lookup_object(struct fscache_object *object)
{
struct fscache_cookie *cookie = object->cookie;
struct fscache_object *parent;
+ int ret;

_enter("");

@@ -493,12 +494,19 @@ static void fscache_lookup_object(struct fscache_object *object)

fscache_stat(&fscache_n_object_lookups);
fscache_stat(&fscache_n_cop_lookup_object);
- object->cache->ops->lookup_object(object);
+ ret = object->cache->ops->lookup_object(object);
fscache_stat_d(&fscache_n_cop_lookup_object);

if (test_bit(FSCACHE_OBJECT_EV_ERROR, &object->events))
set_bit(FSCACHE_COOKIE_UNAVAILABLE, &cookie->flags);

+ if (ret == -ETIMEDOUT) {
+ /* probably stuck behind another object, so move this one to
+ * the back of the queue */
+ fscache_stat(&fscache_n_object_lookups_timed_out);
+ set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+ }
+
_leave("");
}

diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
index 05f77ca..46435f3 100644
--- a/fs/fscache/stats.c
+++ b/fs/fscache/stats.c
@@ -98,6 +98,7 @@ atomic_t fscache_n_object_no_alloc;
atomic_t fscache_n_object_lookups;
atomic_t fscache_n_object_lookups_negative;
atomic_t fscache_n_object_lookups_positive;
+atomic_t fscache_n_object_lookups_timed_out;
atomic_t fscache_n_object_created;
atomic_t fscache_n_object_avail;
atomic_t fscache_n_object_dead;
@@ -160,10 +161,11 @@ static int fscache_stats_show(struct seq_file *m, void *v)
atomic_read(&fscache_n_acquires_nobufs),
atomic_read(&fscache_n_acquires_oom));

- seq_printf(m, "Lookups: n=%u neg=%u pos=%u crt=%u\n",
+ seq_printf(m, "Lookups: n=%u neg=%u pos=%u crt=%u tmo=%u\n",
atomic_read(&fscache_n_object_lookups),
atomic_read(&fscache_n_object_lookups_negative),
atomic_read(&fscache_n_object_lookups_positive),
+ atomic_read(&fscache_n_object_lookups_timed_out),
atomic_read(&fscache_n_object_created));

seq_printf(m, "Updates: n=%u nul=%u run=%u\n",
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 5db5000..7be0c6f 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -234,8 +234,10 @@ struct fscache_cache_ops {
struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
struct fscache_cookie *cookie);

- /* look up the object for a cookie */
- void (*lookup_object)(struct fscache_object *object);
+ /* look up the object for a cookie
+ * - return -ETIMEDOUT to be requeued
+ */
+ int (*lookup_object)(struct fscache_object *object);

/* finished looking up */
void (*lookup_complete)(struct fscache_object *object);

2009-11-19 17:23:23

by David Howells

[permalink] [raw]
Subject: [PATCH 28/28] CacheFiles: Don't log lookup/create failing with ENOBUFS

Don't log the CacheFiles lookup/create object routined failing with ENOBUFS as
under high memory load or high cache load they can do this quite a lot. This
error simply means that the requested object cannot be created on disk due to
lack of space, or due to failure of the backing filesystem to find sufficient
resources.

Signed-off-by: David Howells <[email protected]>
---

fs/cachefiles/interface.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 9d3c426..2708931 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -147,8 +147,9 @@ static int cachefiles_lookup_object(struct fscache_object *_object)
cachefiles_attr_changed(&object->fscache);

if (ret < 0 && ret != -ETIMEDOUT) {
- printk(KERN_WARNING "CacheFiles: Lookup failed error %d\n",
- ret);
+ if (ret != -ENOBUFS)
+ printk(KERN_WARNING
+ "CacheFiles: Lookup failed error %d\n", ret);
fscache_object_lookup_error(&object->fscache);
}

2009-11-20 11:16:43

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH 01/28] SLOW_WORK: Wait for outstanding work items belonging to a module to clear

Hi,

On Thu, Nov 19, 2009 at 05:20:39PM +0000, David Howells wrote:
> Wait for outstanding slow work items belonging to a module to clear when
> unregistering that module as a user of the facility. This prevents the put_ref
> code of a work item from being taken away before it returns.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> Documentation/slow-work.txt | 13 +++-
> fs/fscache/main.c | 6 +-
> fs/fscache/object.c | 1
> fs/fscache/operation.c | 1
> fs/gfs2/main.c | 4 +
> fs/gfs2/recovery.c | 1
> include/linux/slow-work.h | 8 ++-
> kernel/slow-work.c | 132 +++++++++++++++++++++++++++++++++++++++++--
> 8 files changed, 150 insertions(+), 16 deletions(-)
>
GFS2 bits Acked-By: Steven Whitehouse <[email protected]>

Steve.

2009-11-20 21:54:39

by David Howells

[permalink] [raw]
Subject: [PATCH 0/3] Fixes for FS-Cache and CacheFiles

The associated patches provide three sets of changes:

(1) More ways to get debugging information from the slow-work, FS-Cache and
CacheFiles facilities.

(2) Fixes for all facilities.

(3) Some extensions for slow-work for users other than FS-Cache/CacheFiles
that aren't included in this series, but upon which other patches have
been built.

The patches can also be found in:

http://people.redhat.com/~dhowells/fscache/patches/fscache-fixes.tar.bz2

The bugs fixed were found by running a simple stress tester:

http://people.redhat.com/~dhowells/slurp.c

Build it and point it at a bunch of NFS (or AFS or whatever) directory trees,
and it forks and execs a copy of itself to handle each directory that it is
given. Whilst scanning a directory, for each subdir of that directory a new
copy forked and exec'd, and then each file in that directory is read from end
to end.

This results in a large number of processes reading in parallel a large number
of files. If the data set is large enough, it applies pressure to both RAM
space and cache space, and can cause OOMs to occur, and can cause cachefilesd
to run in parallel trying to cull the cache as well.

With these patches applied, you can see more counters in /proc/fs/fscache/stats
and new sources of information exist too:

(*) /proc/slow_work_rq

The items currently being processed by or currently queued for
processing by the slow-work facility.

(*) /proc/fs/fscache/objects

A list of current fscache objects and their states. This can be
configured by loading a key into your session keyring before viewing
the file, for instance:

keyctl add user fscache:objlist KB @s

This is described in Documentation/filesystems/caching/fscache.txt

David
---

David Howells (3):
FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n
SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE
SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()


fs/cifs/cifsfs.c | 2 +-
fs/fscache/internal.h | 1 +
fs/gfs2/recovery.c | 1 +
3 files changed, 3 insertions(+), 1 deletions(-)

2009-11-20 21:54:47

by David Howells

[permalink] [raw]
Subject: [PATCH 1/3] SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()

As of the patch:

SLOW_WORK: Wait for outstanding work items belonging to a module to clear

Wait for outstanding slow work items belonging to a module to clear
when unregistering that module as a user of the facility. This
prevents the put_ref code of a work item from being taken away before
it returns.

slow_work_register_user() takes a module pointer as an argument. CIFS must now
pass THIS_MODULE as that argument, lest the following error be observed:

fs/cifs/cifsfs.c: In function 'init_cifs':
fs/cifs/cifsfs.c:1040: error: too few arguments to function 'slow_work_register_user'

Signed-off-by: David Howells <[email protected]>
---

fs/cifs/cifsfs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 9a5e4f5..29f1da7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1037,7 +1037,7 @@ init_cifs(void)
if (rc)
goto out_unregister_key_type;
#endif
- rc = slow_work_register_user();
+ rc = slow_work_register_user(THIS_MODULE);
if (rc)
goto out_unregister_resolver_key;

2009-11-20 21:54:51

by David Howells

[permalink] [raw]
Subject: [PATCH 2/3] SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE

GFS2 has been altered to pass THIS_MODULE to slow_work_register_user(), but
hasn't been altered to #include <linux/module.h> to provide it, resulting in
the following error:

fs/gfs2/recovery.c:596: error: 'THIS_MODULE' undeclared here (not in a function)

Add the missing #include.

Signed-off-by: David Howells <[email protected]>
---

fs/gfs2/recovery.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index b2bb779..09fa319 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -7,6 +7,7 @@
* of the GNU General Public License version 2.
*/

+#include <linux/module.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/completion.h>

2009-11-20 21:54:55

by David Howells

[permalink] [raw]
Subject: [PATCH 3/3] FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n

Provide nop fscache_stat_d() macro if CONFIG_FSCACHE_STATS=n lest errors like
the following occur:

fs/fscache/cache.c: In function 'fscache_withdraw_cache':
fs/fscache/cache.c:386: error: implicit declaration of function 'fscache_stat_d'
fs/fscache/cache.c:386: error: 'fscache_n_cop_sync_cache' undeclared (first use in this function)
fs/fscache/cache.c:386: error: (Each undeclared identifier is reported only once
fs/fscache/cache.c:386: error: for each function it appears in.)
fs/fscache/cache.c:392: error: 'fscache_n_cop_dissociate_pages' undeclared (first use in this function)

Signed-off-by: David Howells <[email protected]>
---

fs/fscache/internal.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 0ca2566..edd7434 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -259,6 +259,7 @@ extern const struct file_operations fscache_stats_fops;

#define __fscache_stat(stat) (NULL)
#define fscache_stat(stat) do {} while (0)
+#define fscache_stat_d(stat) do {} while (0)
#endif

/*