2015-02-22 18:32:18

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 0/4] cleancache: remove limit on the number of cleancache enabled filesystems

Hi,

Currently, maximal number of cleancache enabled filesystems equals 32,
which is insufficient nowadays, because a Linux host can have hundreds
of containers on board, each of which might want its own filesystem.
This patch set targets at removing this limitation - see patch 4 for
more details. Patches 1-3 prepare the code for this change.

Thanks,

Vladimir Davydov (4):
ocfs2: copy fs uuid to superblock
cleancache: zap uuid arg of cleancache_init_shared_fs
cleancache: forbid overriding cleancache_ops
cleancache: remove limit on the number of cleancache enabled
filesystems

Documentation/vm/cleancache.txt | 4 +-
drivers/xen/tmem.c | 16 ++-
fs/ocfs2/super.c | 4 +-
fs/super.c | 2 +-
include/linux/cleancache.h | 13 +-
mm/cleancache.c | 270 +++++++++++----------------------------
6 files changed, 94 insertions(+), 215 deletions(-)

--
1.7.10.4


2015-02-22 18:32:16

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 1/4] ocfs2: copy fs uuid to superblock

This will allow us to remove the uuid argument from
cleancache_init_shared_fs.

Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/ocfs2/super.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 26675185b886..43f5a9e71b35 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2069,6 +2069,8 @@ static int ocfs2_initialize_super(struct super_block *sb,
cbits = le32_to_cpu(di->id2.i_super.s_clustersize_bits);
bbits = le32_to_cpu(di->id2.i_super.s_blocksize_bits);
sb->s_maxbytes = ocfs2_max_file_offset(bbits, cbits);
+ memcpy(sb->s_uuid, di->id2.i_super.s_uuid,
+ sizeof(di->id2.i_super.s_uuid));

osb->osb_dx_mask = (1 << (cbits - bbits)) - 1;

--
1.7.10.4

2015-02-22 18:32:15

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 2/4] cleancache: zap uuid arg of cleancache_init_shared_fs

Use super_block->s_uuid instead. Every shared filesystem using
cleancache must now initialize super_block->s_uuid before calling
cleancache_init_shared_fs. The only one on the tree, ocfs2, already
meets this requirement.

Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/ocfs2/super.c | 2 +-
include/linux/cleancache.h | 6 +++---
mm/cleancache.c | 6 +++---
3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 43f5a9e71b35..18f830a9df50 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2335,7 +2335,7 @@ static int ocfs2_initialize_super(struct super_block *sb,
mlog_errno(status);
goto bail;
}
- cleancache_init_shared_fs((char *)&di->id2.i_super.s_uuid, sb);
+ cleancache_init_shared_fs(sb);

bail:
return status;
diff --git a/include/linux/cleancache.h b/include/linux/cleancache.h
index 4ce9056b31a8..29657d1c83fb 100644
--- a/include/linux/cleancache.h
+++ b/include/linux/cleancache.h
@@ -36,7 +36,7 @@ struct cleancache_ops {
extern struct cleancache_ops *
cleancache_register_ops(struct cleancache_ops *ops);
extern void __cleancache_init_fs(struct super_block *);
-extern void __cleancache_init_shared_fs(char *, struct super_block *);
+extern void __cleancache_init_shared_fs(struct super_block *);
extern int __cleancache_get_page(struct page *);
extern void __cleancache_put_page(struct page *);
extern void __cleancache_invalidate_page(struct address_space *, struct page *);
@@ -78,10 +78,10 @@ static inline void cleancache_init_fs(struct super_block *sb)
__cleancache_init_fs(sb);
}

-static inline void cleancache_init_shared_fs(char *uuid, struct super_block *sb)
+static inline void cleancache_init_shared_fs(struct super_block *sb)
{
if (cleancache_enabled)
- __cleancache_init_shared_fs(uuid, sb);
+ __cleancache_init_shared_fs(sb);
}

static inline int cleancache_get_page(struct page *page)
diff --git a/mm/cleancache.c b/mm/cleancache.c
index 053bcd8f12fb..532495f2e4f4 100644
--- a/mm/cleancache.c
+++ b/mm/cleancache.c
@@ -155,7 +155,7 @@ void __cleancache_init_fs(struct super_block *sb)
EXPORT_SYMBOL(__cleancache_init_fs);

/* Called by a cleancache-enabled clustered filesystem at time of mount */
-void __cleancache_init_shared_fs(char *uuid, struct super_block *sb)
+void __cleancache_init_shared_fs(struct super_block *sb)
{
int i;

@@ -163,10 +163,10 @@ void __cleancache_init_shared_fs(char *uuid, struct super_block *sb)
for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
if (shared_fs_poolid_map[i] == FS_UNKNOWN) {
sb->cleancache_poolid = i + FAKE_SHARED_FS_POOLID_OFFSET;
- uuids[i] = uuid;
+ uuids[i] = sb->s_uuid;
if (cleancache_ops)
shared_fs_poolid_map[i] = cleancache_ops->init_shared_fs
- (uuid, PAGE_SIZE);
+ (sb->s_uuid, PAGE_SIZE);
else
shared_fs_poolid_map[i] = FS_NO_BACKEND;
break;
--
1.7.10.4

2015-02-22 18:32:58

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 3/4] cleancache: forbid overriding cleancache_ops

Currently, cleancache_register_ops returns the previous value of
cleancache_ops to allow chaining. However, chaining, as it is
implemented now, is extremely dangerous due to possible pool id
collisions. Suppose, a new cleancache driver is registered after the
previous one assigned an id to a super block. If the new driver assigns
the same id to another super block, which is perfectly possible, we will
have two different filesystems using the same id. No matter if the new
driver implements chaining or not, we are likely to get data corruption
with such a configuration eventually.

This patch therefore disables the ability to override cleancache_ops
altogether as potentially dangerous. If there is already cleancache
driver registered, all further calls to cleancache_register_ops will
return EBUSY. Since no user of cleancache implements chaining, we only
need to make minor changes to the code outside the cleancache core.

Signed-off-by: Vladimir Davydov <[email protected]>
---
Documentation/vm/cleancache.txt | 4 +---
drivers/xen/tmem.c | 16 +++++++++-------
include/linux/cleancache.h | 3 +--
mm/cleancache.c | 12 +++++++-----
4 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt
index 01d76282444e..e4b49df7a048 100644
--- a/Documentation/vm/cleancache.txt
+++ b/Documentation/vm/cleancache.txt
@@ -28,9 +28,7 @@ IMPLEMENTATION OVERVIEW
A cleancache "backend" that provides transcendent memory registers itself
to the kernel's cleancache "frontend" by calling cleancache_register_ops,
passing a pointer to a cleancache_ops structure with funcs set appropriately.
-Note that cleancache_register_ops returns the previous settings so that
-chaining can be performed if desired. The functions provided must conform to
-certain semantics as follows:
+The functions provided must conform to certain semantics as follows:

Most important, cleancache is "ephemeral". Pages which are copied into
cleancache have an indefinite lifetime which is completely unknowable
diff --git a/drivers/xen/tmem.c b/drivers/xen/tmem.c
index 8a65423bc696..8529e535459e 100644
--- a/drivers/xen/tmem.c
+++ b/drivers/xen/tmem.c
@@ -397,13 +397,15 @@ static int __init xen_tmem_init(void)
#ifdef CONFIG_CLEANCACHE
BUG_ON(sizeof(struct cleancache_filekey) != sizeof(struct tmem_oid));
if (tmem_enabled && cleancache) {
- char *s = "";
- struct cleancache_ops *old_ops =
- cleancache_register_ops(&tmem_cleancache_ops);
- if (old_ops)
- s = " (WARNING: cleancache_ops overridden)";
- pr_info("cleancache enabled, RAM provided by Xen Transcendent Memory%s\n",
- s);
+ int err;
+
+ err = cleancache_register_ops(&tmem_cleancache_ops);
+ if (err)
+ pr_warn("xen-tmem: failed to enable cleancache: %d\n",
+ err);
+ else
+ pr_info("cleancache enabled, RAM provided by "
+ "Xen Transcendent Memory\n");
}
#endif
#ifdef CONFIG_XEN_SELFBALLOONING
diff --git a/include/linux/cleancache.h b/include/linux/cleancache.h
index 29657d1c83fb..b23611f43cfb 100644
--- a/include/linux/cleancache.h
+++ b/include/linux/cleancache.h
@@ -33,8 +33,7 @@ struct cleancache_ops {
void (*invalidate_fs)(int);
};

-extern struct cleancache_ops *
- cleancache_register_ops(struct cleancache_ops *ops);
+extern int cleancache_register_ops(struct cleancache_ops *ops);
extern void __cleancache_init_fs(struct super_block *);
extern void __cleancache_init_shared_fs(struct super_block *);
extern int __cleancache_get_page(struct page *);
diff --git a/mm/cleancache.c b/mm/cleancache.c
index 532495f2e4f4..aa10f9a3bc88 100644
--- a/mm/cleancache.c
+++ b/mm/cleancache.c
@@ -106,15 +106,17 @@ static DEFINE_MUTEX(poolid_mutex);
*/

/*
- * Register operations for cleancache, returning previous thus allowing
- * detection of multiple backends and possible nesting.
+ * Register operations for cleancache. Returns 0 on success.
*/
-struct cleancache_ops *cleancache_register_ops(struct cleancache_ops *ops)
+int cleancache_register_ops(struct cleancache_ops *ops)
{
- struct cleancache_ops *old = cleancache_ops;
int i;

mutex_lock(&poolid_mutex);
+ if (cleancache_ops) {
+ mutex_unlock(&poolid_mutex);
+ return -EBUSY;
+ }
for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
if (fs_poolid_map[i] == FS_NO_BACKEND)
fs_poolid_map[i] = ops->init_fs(PAGE_SIZE);
@@ -130,7 +132,7 @@ struct cleancache_ops *cleancache_register_ops(struct cleancache_ops *ops)
barrier();
cleancache_ops = ops;
mutex_unlock(&poolid_mutex);
- return old;
+ return 0;
}
EXPORT_SYMBOL(cleancache_register_ops);

--
1.7.10.4

2015-02-22 18:32:23

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 4/4] cleancache: remove limit on the number of cleancache enabled filesystems

The limit equals 32 and is imposed by the number of entries in the
fs_poolid_map and shared_fs_poolid_map. Nowadays it is insufficient,
because with containers on board a Linux host can have hundreds of
active fs mounts.

These maps were introduced by commit 49a9ab815acb8 ("mm: cleancache:
lazy initialization to allow tmem backends to build/run as modules") in
order to allow compiling cleancache drivers as modules. Real pool ids
are stored in these maps while super_block->cleancache_poolid points to
an entry in the map, so that on cleancache registration we can walk over
all (if there are <= 32 of them, of course) cleancache-enabled super
blocks and assign real pool ids.

Actually, there is absolutely no need in these maps, because we can
iterate over all super blocks immediately using iterate_supers. This is
not racy, because cleancache_init_ops is called from mount_fs with
super_block->s_umount held for writing, while iterate_supers takes this
semaphore for reading, so if we call iterate_supers after setting
cleancache_ops, all super blocks that had been created before
cleancache_register_ops was called will be assigned pool ids by the
action function of iterate_supers while all newer super blocks will
receive it in cleancache_init_fs.

This patch therefore removes the maps and hence the artificial limit on
the number of cleancache enabled filesystems.

Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/super.c | 2 +-
include/linux/cleancache.h | 4 +
mm/cleancache.c | 260 +++++++++++---------------------------------
3 files changed, 71 insertions(+), 195 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 65a53efc1cf4..ed5a9b9c3206 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -224,7 +224,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
s->s_time_gran = 1000000000;
- s->cleancache_poolid = -1;
+ s->cleancache_poolid = CLEANCACHE_NO_POOL;

s->s_shrink.seeks = DEFAULT_SEEKS;
s->s_shrink.scan_objects = super_cache_scan;
diff --git a/include/linux/cleancache.h b/include/linux/cleancache.h
index b23611f43cfb..bda5ec0b4b4d 100644
--- a/include/linux/cleancache.h
+++ b/include/linux/cleancache.h
@@ -5,6 +5,10 @@
#include <linux/exportfs.h>
#include <linux/mm.h>

+#define CLEANCACHE_NO_POOL -1
+#define CLEANCACHE_NO_BACKEND -2
+#define CLEANCACHE_NO_BACKEND_SHARED -3
+
#define CLEANCACHE_KEY_MAX 6

/*
diff --git a/mm/cleancache.c b/mm/cleancache.c
index aa10f9a3bc88..5e357a3cd897 100644
--- a/mm/cleancache.c
+++ b/mm/cleancache.c
@@ -12,6 +12,7 @@
*/

#include <linux/module.h>
+#include <linux/spinlock.h>
#include <linux/fs.h>
#include <linux/exportfs.h>
#include <linux/mm.h>
@@ -19,10 +20,11 @@
#include <linux/cleancache.h>

/*
- * cleancache_ops is set by cleancache_ops_register to contain the pointers
+ * cleancache_ops is set by cleancache_register_ops to contain the pointers
* to the cleancache "backend" implementation functions.
*/
static struct cleancache_ops *cleancache_ops __read_mostly;
+static DEFINE_SPINLOCK(cleancache_ops_lock);

/*
* Counters available via /sys/kernel/debug/cleancache (if debugfs is
@@ -34,104 +36,32 @@ static u64 cleancache_failed_gets;
static u64 cleancache_puts;
static u64 cleancache_invalidates;

-/*
- * When no backend is registered all calls to init_fs and init_shared_fs
- * are registered and fake poolids (FAKE_FS_POOLID_OFFSET or
- * FAKE_SHARED_FS_POOLID_OFFSET, plus offset in the respective array
- * [shared_|]fs_poolid_map) are given to the respective super block
- * (sb->cleancache_poolid) and no tmem_pools are created. When a backend
- * registers with cleancache the previous calls to init_fs and init_shared_fs
- * are executed to create tmem_pools and set the respective poolids. While no
- * backend is registered all "puts", "gets" and "flushes" are ignored or failed.
- */
-#define MAX_INITIALIZABLE_FS 32
-#define FAKE_FS_POOLID_OFFSET 1000
-#define FAKE_SHARED_FS_POOLID_OFFSET 2000
-
-#define FS_NO_BACKEND (-1)
-#define FS_UNKNOWN (-2)
-static int fs_poolid_map[MAX_INITIALIZABLE_FS];
-static int shared_fs_poolid_map[MAX_INITIALIZABLE_FS];
-static char *uuids[MAX_INITIALIZABLE_FS];
-/*
- * Mutex for the [shared_|]fs_poolid_map to guard against multiple threads
- * invoking umount (and ending in __cleancache_invalidate_fs) and also multiple
- * threads calling mount (and ending up in __cleancache_init_[shared|]fs).
- */
-static DEFINE_MUTEX(poolid_mutex);
-/*
- * When set to false (default) all calls to the cleancache functions, except
- * the __cleancache_invalidate_fs and __cleancache_init_[shared|]fs are guarded
- * by the if (!cleancache_ops) return. This means multiple threads (from
- * different filesystems) will be checking cleancache_ops. The usage of a
- * bool instead of a atomic_t or a bool guarded by a spinlock is OK - we are
- * OK if the time between the backend's have been initialized (and
- * cleancache_ops has been set to not NULL) and when the filesystems start
- * actually calling the backends. The inverse (when unloading) is obviously
- * not good - but this shim does not do that (yet).
- */
-
-/*
- * The backends and filesystems work all asynchronously. This is b/c the
- * backends can be built as modules.
- * The usual sequence of events is:
- * a) mount / -> __cleancache_init_fs is called. We set the
- * [shared_|]fs_poolid_map and uuids for.
- *
- * b). user does I/Os -> we call the rest of __cleancache_* functions
- * which return immediately as cleancache_ops is false.
- *
- * c). modprobe zcache -> cleancache_register_ops. We init the backend
- * and set cleancache_ops to true, and for any fs_poolid_map
- * (which is set by __cleancache_init_fs) we initialize the poolid.
- *
- * d). user does I/Os -> now that cleancache_ops is true all the
- * __cleancache_* functions can call the backend. They all check
- * that fs_poolid_map is valid and if so invoke the backend.
- *
- * e). umount / -> __cleancache_invalidate_fs, the fs_poolid_map is
- * reset (which is the second check in the __cleancache_* ops
- * to call the backend).
- *
- * The sequence of event could also be c), followed by a), and d). and e). The
- * c) would not happen anymore. There is also the chance of c), and one thread
- * doing a) + d), and another doing e). For that case we depend on the
- * filesystem calling __cleancache_invalidate_fs in the proper sequence (so
- * that it handles all I/Os before it invalidates the fs (which is last part
- * of unmounting process).
- *
- * Note: The acute reader will notice that there is no "rmmod zcache" case.
- * This is b/c the functionality for that is not yet implemented and when
- * done, will require some extra locking not yet devised.
- */
+static void cleancache_register_ops_sb(struct super_block *sb, void *unused)
+{
+ switch (sb->cleancache_poolid) {
+ case CLEANCACHE_NO_BACKEND:
+ __cleancache_init_fs(sb);
+ break;
+ case CLEANCACHE_NO_BACKEND_SHARED:
+ __cleancache_init_shared_fs(sb);
+ break;
+ }
+}

/*
* Register operations for cleancache. Returns 0 on success.
*/
int cleancache_register_ops(struct cleancache_ops *ops)
{
- int i;
-
- mutex_lock(&poolid_mutex);
+ spin_lock(&cleancache_ops_lock);
if (cleancache_ops) {
- mutex_unlock(&poolid_mutex);
+ spin_unlock(&cleancache_ops_lock);
return -EBUSY;
}
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- if (fs_poolid_map[i] == FS_NO_BACKEND)
- fs_poolid_map[i] = ops->init_fs(PAGE_SIZE);
- if (shared_fs_poolid_map[i] == FS_NO_BACKEND)
- shared_fs_poolid_map[i] = ops->init_shared_fs
- (uuids[i], PAGE_SIZE);
- }
- /*
- * We MUST set cleancache_ops _after_ we have called the backends
- * init_fs or init_shared_fs functions. Otherwise the compiler might
- * re-order where cleancache_ops is set in this function.
- */
- barrier();
cleancache_ops = ops;
- mutex_unlock(&poolid_mutex);
+ spin_unlock(&cleancache_ops_lock);
+
+ iterate_supers(cleancache_register_ops_sb, NULL);
return 0;
}
EXPORT_SYMBOL(cleancache_register_ops);
@@ -139,42 +69,28 @@ EXPORT_SYMBOL(cleancache_register_ops);
/* Called by a cleancache-enabled filesystem at time of mount */
void __cleancache_init_fs(struct super_block *sb)
{
- int i;
-
- mutex_lock(&poolid_mutex);
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- if (fs_poolid_map[i] == FS_UNKNOWN) {
- sb->cleancache_poolid = i + FAKE_FS_POOLID_OFFSET;
- if (cleancache_ops)
- fs_poolid_map[i] = cleancache_ops->init_fs(PAGE_SIZE);
- else
- fs_poolid_map[i] = FS_NO_BACKEND;
- break;
- }
+ int pool_id = CLEANCACHE_NO_BACKEND;
+
+ if (cleancache_ops) {
+ pool_id = cleancache_ops->init_fs(PAGE_SIZE);
+ if (pool_id < 0)
+ pool_id = CLEANCACHE_NO_POOL;
}
- mutex_unlock(&poolid_mutex);
+ sb->cleancache_poolid = pool_id;
}
EXPORT_SYMBOL(__cleancache_init_fs);

/* Called by a cleancache-enabled clustered filesystem at time of mount */
void __cleancache_init_shared_fs(struct super_block *sb)
{
- int i;
-
- mutex_lock(&poolid_mutex);
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- if (shared_fs_poolid_map[i] == FS_UNKNOWN) {
- sb->cleancache_poolid = i + FAKE_SHARED_FS_POOLID_OFFSET;
- uuids[i] = sb->s_uuid;
- if (cleancache_ops)
- shared_fs_poolid_map[i] = cleancache_ops->init_shared_fs
- (sb->s_uuid, PAGE_SIZE);
- else
- shared_fs_poolid_map[i] = FS_NO_BACKEND;
- break;
- }
+ int pool_id = CLEANCACHE_NO_BACKEND_SHARED;
+
+ if (cleancache_ops) {
+ pool_id = cleancache_ops->init_shared_fs(sb->s_uuid, PAGE_SIZE);
+ if (pool_id < 0)
+ pool_id = CLEANCACHE_NO_POOL;
}
- mutex_unlock(&poolid_mutex);
+ sb->cleancache_poolid = pool_id;
}
EXPORT_SYMBOL(__cleancache_init_shared_fs);

@@ -204,19 +120,6 @@ static int cleancache_get_key(struct inode *inode,
}

/*
- * Returns a pool_id that is associated with a given fake poolid.
- */
-static int get_poolid_from_fake(int fake_pool_id)
-{
- if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET)
- return shared_fs_poolid_map[fake_pool_id -
- FAKE_SHARED_FS_POOLID_OFFSET];
- else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET)
- return fs_poolid_map[fake_pool_id - FAKE_FS_POOLID_OFFSET];
- return FS_NO_BACKEND;
-}
-
-/*
* "Get" data from cleancache associated with the poolid/inode/index
* that were specified when the data was put to cleanache and, if
* successful, use it to fill the specified page with data and return 0.
@@ -231,26 +134,20 @@ int __cleancache_get_page(struct page *page)
{
int ret = -1;
int pool_id;
- int fake_pool_id;
struct cleancache_filekey key = { .u.key = { 0 } };

- if (!cleancache_ops) {
- cleancache_failed_gets++;
+ if (!cleancache_ops)
goto out;
- }

VM_BUG_ON_PAGE(!PageLocked(page), page);
- fake_pool_id = page->mapping->host->i_sb->cleancache_poolid;
- if (fake_pool_id < 0)
+ pool_id = page->mapping->host->i_sb->cleancache_poolid;
+ if (pool_id < 0)
goto out;
- pool_id = get_poolid_from_fake(fake_pool_id);

if (cleancache_get_key(page->mapping->host, &key) < 0)
goto out;

- if (pool_id >= 0)
- ret = cleancache_ops->get_page(pool_id,
- key, page->index, page);
+ ret = cleancache_ops->get_page(pool_id, key, page->index, page);
if (ret == 0)
cleancache_succ_gets++;
else
@@ -273,26 +170,21 @@ EXPORT_SYMBOL(__cleancache_get_page);
void __cleancache_put_page(struct page *page)
{
int pool_id;
- int fake_pool_id;
struct cleancache_filekey key = { .u.key = { 0 } };

- if (!cleancache_ops) {
- cleancache_puts++;
+ if (!cleancache_ops)
return;
- }

VM_BUG_ON_PAGE(!PageLocked(page), page);
- fake_pool_id = page->mapping->host->i_sb->cleancache_poolid;
- if (fake_pool_id < 0)
+ pool_id = page->mapping->host->i_sb->cleancache_poolid;
+ if (pool_id < 0)
return;

- pool_id = get_poolid_from_fake(fake_pool_id);
+ if (cleancache_get_key(page->mapping->host, &key) < 0)
+ return;

- if (pool_id >= 0 &&
- cleancache_get_key(page->mapping->host, &key) >= 0) {
- cleancache_ops->put_page(pool_id, key, page->index, page);
- cleancache_puts++;
- }
+ cleancache_ops->put_page(pool_id, key, page->index, page);
+ cleancache_puts++;
}
EXPORT_SYMBOL(__cleancache_put_page);

@@ -309,24 +201,21 @@ void __cleancache_invalidate_page(struct address_space *mapping,
{
/* careful... page->mapping is NULL sometimes when this is called */
int pool_id;
- int fake_pool_id = mapping->host->i_sb->cleancache_poolid;
struct cleancache_filekey key = { .u.key = { 0 } };

if (!cleancache_ops)
return;

- if (fake_pool_id >= 0) {
- pool_id = get_poolid_from_fake(fake_pool_id);
- if (pool_id < 0)
- return;
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ pool_id = mapping->host->i_sb->cleancache_poolid;
+ if (pool_id < 0)
+ return;

- VM_BUG_ON_PAGE(!PageLocked(page), page);
- if (cleancache_get_key(mapping->host, &key) >= 0) {
- cleancache_ops->invalidate_page(pool_id,
- key, page->index);
- cleancache_invalidates++;
- }
- }
+ if (cleancache_get_key(mapping->host, &key) < 0)
+ return;
+
+ cleancache_ops->invalidate_page(pool_id, key, page->index);
+ cleancache_invalidates++;
}
EXPORT_SYMBOL(__cleancache_invalidate_page);

@@ -342,19 +231,19 @@ EXPORT_SYMBOL(__cleancache_invalidate_page);
void __cleancache_invalidate_inode(struct address_space *mapping)
{
int pool_id;
- int fake_pool_id = mapping->host->i_sb->cleancache_poolid;
struct cleancache_filekey key = { .u.key = { 0 } };

if (!cleancache_ops)
return;

- if (fake_pool_id < 0)
+ pool_id = mapping->host->i_sb->cleancache_poolid;
+ if (pool_id < 0)
return;

- pool_id = get_poolid_from_fake(fake_pool_id);
+ if (cleancache_get_key(mapping->host, &key) < 0)
+ return;

- if (pool_id >= 0 && cleancache_get_key(mapping->host, &key) >= 0)
- cleancache_ops->invalidate_inode(pool_id, key);
+ cleancache_ops->invalidate_inode(pool_id, key);
}
EXPORT_SYMBOL(__cleancache_invalidate_inode);

@@ -365,32 +254,19 @@ EXPORT_SYMBOL(__cleancache_invalidate_inode);
*/
void __cleancache_invalidate_fs(struct super_block *sb)
{
- int index;
- int fake_pool_id = sb->cleancache_poolid;
- int old_poolid = fake_pool_id;
-
- mutex_lock(&poolid_mutex);
- if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) {
- index = fake_pool_id - FAKE_SHARED_FS_POOLID_OFFSET;
- old_poolid = shared_fs_poolid_map[index];
- shared_fs_poolid_map[index] = FS_UNKNOWN;
- uuids[index] = NULL;
- } else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) {
- index = fake_pool_id - FAKE_FS_POOLID_OFFSET;
- old_poolid = fs_poolid_map[index];
- fs_poolid_map[index] = FS_UNKNOWN;
- }
- sb->cleancache_poolid = -1;
- if (cleancache_ops)
- cleancache_ops->invalidate_fs(old_poolid);
- mutex_unlock(&poolid_mutex);
+ int pool_id;
+
+ pool_id = sb->cleancache_poolid;
+ if (pool_id < 0)
+ return;
+
+ sb->cleancache_poolid = CLEANCACHE_NO_POOL;
+ cleancache_ops->invalidate_fs(pool_id);
}
EXPORT_SYMBOL(__cleancache_invalidate_fs);

static int __init init_cleancache(void)
{
- int i;
-
#ifdef CONFIG_DEBUG_FS
struct dentry *root = debugfs_create_dir("cleancache", NULL);
if (root == NULL)
@@ -402,10 +278,6 @@ static int __init init_cleancache(void)
debugfs_create_u64("invalidates", S_IRUGO,
root, &cleancache_invalidates);
#endif
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- fs_poolid_map[i] = FS_UNKNOWN;
- shared_fs_poolid_map[i] = FS_UNKNOWN;
- }
return 0;
}
module_init(init_cleancache)
--
1.7.10.4

2015-02-23 10:31:41

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 4/4] cleancache: remove limit on the number of cleancache enabled filesystems

Rechecking this patch, I find it rather difficult to review, because it
not only rids of fake_pool_id, but also rearranges code of cleancache
methods. Here is an updated patch, which attempts to be less intrusive:
---
From: Vladimir Davydov <[email protected]>
Subject: [PATCH v2] cleancache: remove limit on the number of cleancache
enabled filesystems

The limit equals 32 and is imposed by the number of entries in the
fs_poolid_map and shared_fs_poolid_map. Nowadays it is insufficient,
because with containers on board a Linux host can have hundreds of
active fs mounts.

These maps were introduced by commit 49a9ab815acb8 ("mm: cleancache:
lazy initialization to allow tmem backends to build/run as modules") in
order to allow compiling cleancache drivers as modules. Real pool ids
are stored in these maps while super_block->cleancache_poolid points to
an entry in the map, so that on cleancache registration we can walk over
all (if there are <= 32 of them, of course) cleancache-enabled super
blocks and assign real pool ids.

Actually, there is absolutely no need in these maps, because we can
iterate over all super blocks immediately using iterate_supers. This is
not racy, because cleancache_init_ops is called from mount_fs with
super_block->s_umount held for writing, while iterate_supers takes this
semaphore for reading, so if we call iterate_supers after setting
cleancache_ops, all super blocks that had been created before
cleancache_register_ops was called will be assigned pool ids by the
action function of iterate_supers while all newer super blocks will
receive it in cleancache_init_fs.

This patch therefore removes the maps and hence the artificial limit on
the number of cleancache enabled filesystems.

Signed-off-by: Vladimir Davydov <[email protected]>
---
Changes in v2:
- do not rearrange code in cleancache_{get,put,invalidate}_page
- use cmpxchg instead of spinlock to synchronize concurrent
cleancache_ops updates

fs/super.c | 2 +-
include/linux/cleancache.h | 4 +
mm/cleancache.c | 223 ++++++++------------------------------------
3 files changed, 45 insertions(+), 184 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 65a53efc1cf4..ed5a9b9c3206 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -224,7 +224,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
s->s_time_gran = 1000000000;
- s->cleancache_poolid = -1;
+ s->cleancache_poolid = CLEANCACHE_NO_POOL;

s->s_shrink.seeks = DEFAULT_SEEKS;
s->s_shrink.scan_objects = super_cache_scan;
diff --git a/include/linux/cleancache.h b/include/linux/cleancache.h
index b23611f43cfb..bda5ec0b4b4d 100644
--- a/include/linux/cleancache.h
+++ b/include/linux/cleancache.h
@@ -5,6 +5,10 @@
#include <linux/exportfs.h>
#include <linux/mm.h>

+#define CLEANCACHE_NO_POOL -1
+#define CLEANCACHE_NO_BACKEND -2
+#define CLEANCACHE_NO_BACKEND_SHARED -3
+
#define CLEANCACHE_KEY_MAX 6

/*
diff --git a/mm/cleancache.c b/mm/cleancache.c
index aa10f9a3bc88..fbdaf9c77d7a 100644
--- a/mm/cleancache.c
+++ b/mm/cleancache.c
@@ -19,7 +19,7 @@
#include <linux/cleancache.h>

/*
- * cleancache_ops is set by cleancache_ops_register to contain the pointers
+ * cleancache_ops is set by cleancache_register_ops to contain the pointers
* to the cleancache "backend" implementation functions.
*/
static struct cleancache_ops *cleancache_ops __read_mostly;
@@ -34,104 +34,27 @@ static u64 cleancache_failed_gets;
static u64 cleancache_puts;
static u64 cleancache_invalidates;

-/*
- * When no backend is registered all calls to init_fs and init_shared_fs
- * are registered and fake poolids (FAKE_FS_POOLID_OFFSET or
- * FAKE_SHARED_FS_POOLID_OFFSET, plus offset in the respective array
- * [shared_|]fs_poolid_map) are given to the respective super block
- * (sb->cleancache_poolid) and no tmem_pools are created. When a backend
- * registers with cleancache the previous calls to init_fs and init_shared_fs
- * are executed to create tmem_pools and set the respective poolids. While no
- * backend is registered all "puts", "gets" and "flushes" are ignored or failed.
- */
-#define MAX_INITIALIZABLE_FS 32
-#define FAKE_FS_POOLID_OFFSET 1000
-#define FAKE_SHARED_FS_POOLID_OFFSET 2000
-
-#define FS_NO_BACKEND (-1)
-#define FS_UNKNOWN (-2)
-static int fs_poolid_map[MAX_INITIALIZABLE_FS];
-static int shared_fs_poolid_map[MAX_INITIALIZABLE_FS];
-static char *uuids[MAX_INITIALIZABLE_FS];
-/*
- * Mutex for the [shared_|]fs_poolid_map to guard against multiple threads
- * invoking umount (and ending in __cleancache_invalidate_fs) and also multiple
- * threads calling mount (and ending up in __cleancache_init_[shared|]fs).
- */
-static DEFINE_MUTEX(poolid_mutex);
-/*
- * When set to false (default) all calls to the cleancache functions, except
- * the __cleancache_invalidate_fs and __cleancache_init_[shared|]fs are guarded
- * by the if (!cleancache_ops) return. This means multiple threads (from
- * different filesystems) will be checking cleancache_ops. The usage of a
- * bool instead of a atomic_t or a bool guarded by a spinlock is OK - we are
- * OK if the time between the backend's have been initialized (and
- * cleancache_ops has been set to not NULL) and when the filesystems start
- * actually calling the backends. The inverse (when unloading) is obviously
- * not good - but this shim does not do that (yet).
- */
-
-/*
- * The backends and filesystems work all asynchronously. This is b/c the
- * backends can be built as modules.
- * The usual sequence of events is:
- * a) mount / -> __cleancache_init_fs is called. We set the
- * [shared_|]fs_poolid_map and uuids for.
- *
- * b). user does I/Os -> we call the rest of __cleancache_* functions
- * which return immediately as cleancache_ops is false.
- *
- * c). modprobe zcache -> cleancache_register_ops. We init the backend
- * and set cleancache_ops to true, and for any fs_poolid_map
- * (which is set by __cleancache_init_fs) we initialize the poolid.
- *
- * d). user does I/Os -> now that cleancache_ops is true all the
- * __cleancache_* functions can call the backend. They all check
- * that fs_poolid_map is valid and if so invoke the backend.
- *
- * e). umount / -> __cleancache_invalidate_fs, the fs_poolid_map is
- * reset (which is the second check in the __cleancache_* ops
- * to call the backend).
- *
- * The sequence of event could also be c), followed by a), and d). and e). The
- * c) would not happen anymore. There is also the chance of c), and one thread
- * doing a) + d), and another doing e). For that case we depend on the
- * filesystem calling __cleancache_invalidate_fs in the proper sequence (so
- * that it handles all I/Os before it invalidates the fs (which is last part
- * of unmounting process).
- *
- * Note: The acute reader will notice that there is no "rmmod zcache" case.
- * This is b/c the functionality for that is not yet implemented and when
- * done, will require some extra locking not yet devised.
- */
+static void cleancache_register_ops_sb(struct super_block *sb, void *unused)
+{
+ switch (sb->cleancache_poolid) {
+ case CLEANCACHE_NO_BACKEND:
+ __cleancache_init_fs(sb);
+ break;
+ case CLEANCACHE_NO_BACKEND_SHARED:
+ __cleancache_init_shared_fs(sb);
+ break;
+ }
+}

/*
* Register operations for cleancache. Returns 0 on success.
*/
int cleancache_register_ops(struct cleancache_ops *ops)
{
- int i;
-
- mutex_lock(&poolid_mutex);
- if (cleancache_ops) {
- mutex_unlock(&poolid_mutex);
+ if (cmpxchg(&cleancache_ops, NULL, ops))
return -EBUSY;
- }
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- if (fs_poolid_map[i] == FS_NO_BACKEND)
- fs_poolid_map[i] = ops->init_fs(PAGE_SIZE);
- if (shared_fs_poolid_map[i] == FS_NO_BACKEND)
- shared_fs_poolid_map[i] = ops->init_shared_fs
- (uuids[i], PAGE_SIZE);
- }
- /*
- * We MUST set cleancache_ops _after_ we have called the backends
- * init_fs or init_shared_fs functions. Otherwise the compiler might
- * re-order where cleancache_ops is set in this function.
- */
- barrier();
- cleancache_ops = ops;
- mutex_unlock(&poolid_mutex);
+
+ iterate_supers(cleancache_register_ops_sb, NULL);
return 0;
}
EXPORT_SYMBOL(cleancache_register_ops);
@@ -139,42 +62,28 @@ EXPORT_SYMBOL(cleancache_register_ops);
/* Called by a cleancache-enabled filesystem at time of mount */
void __cleancache_init_fs(struct super_block *sb)
{
- int i;
+ int pool_id = CLEANCACHE_NO_BACKEND;

- mutex_lock(&poolid_mutex);
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- if (fs_poolid_map[i] == FS_UNKNOWN) {
- sb->cleancache_poolid = i + FAKE_FS_POOLID_OFFSET;
- if (cleancache_ops)
- fs_poolid_map[i] = cleancache_ops->init_fs(PAGE_SIZE);
- else
- fs_poolid_map[i] = FS_NO_BACKEND;
- break;
- }
+ if (cleancache_ops) {
+ pool_id = cleancache_ops->init_fs(PAGE_SIZE);
+ if (pool_id < 0)
+ pool_id = CLEANCACHE_NO_POOL;
}
- mutex_unlock(&poolid_mutex);
+ sb->cleancache_poolid = pool_id;
}
EXPORT_SYMBOL(__cleancache_init_fs);

/* Called by a cleancache-enabled clustered filesystem at time of mount */
void __cleancache_init_shared_fs(struct super_block *sb)
{
- int i;
+ int pool_id = CLEANCACHE_NO_BACKEND_SHARED;

- mutex_lock(&poolid_mutex);
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- if (shared_fs_poolid_map[i] == FS_UNKNOWN) {
- sb->cleancache_poolid = i + FAKE_SHARED_FS_POOLID_OFFSET;
- uuids[i] = sb->s_uuid;
- if (cleancache_ops)
- shared_fs_poolid_map[i] = cleancache_ops->init_shared_fs
- (sb->s_uuid, PAGE_SIZE);
- else
- shared_fs_poolid_map[i] = FS_NO_BACKEND;
- break;
- }
+ if (cleancache_ops) {
+ pool_id = cleancache_ops->init_shared_fs(sb->s_uuid, PAGE_SIZE);
+ if (pool_id < 0)
+ pool_id = CLEANCACHE_NO_POOL;
}
- mutex_unlock(&poolid_mutex);
+ sb->cleancache_poolid = pool_id;
}
EXPORT_SYMBOL(__cleancache_init_shared_fs);

@@ -204,19 +113,6 @@ static int cleancache_get_key(struct inode *inode,
}

/*
- * Returns a pool_id that is associated with a given fake poolid.
- */
-static int get_poolid_from_fake(int fake_pool_id)
-{
- if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET)
- return shared_fs_poolid_map[fake_pool_id -
- FAKE_SHARED_FS_POOLID_OFFSET];
- else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET)
- return fs_poolid_map[fake_pool_id - FAKE_FS_POOLID_OFFSET];
- return FS_NO_BACKEND;
-}
-
-/*
* "Get" data from cleancache associated with the poolid/inode/index
* that were specified when the data was put to cleanache and, if
* successful, use it to fill the specified page with data and return 0.
@@ -231,7 +127,6 @@ int __cleancache_get_page(struct page *page)
{
int ret = -1;
int pool_id;
- int fake_pool_id;
struct cleancache_filekey key = { .u.key = { 0 } };

if (!cleancache_ops) {
@@ -240,17 +135,14 @@ int __cleancache_get_page(struct page *page)
}

VM_BUG_ON_PAGE(!PageLocked(page), page);
- fake_pool_id = page->mapping->host->i_sb->cleancache_poolid;
- if (fake_pool_id < 0)
+ pool_id = page->mapping->host->i_sb->cleancache_poolid;
+ if (pool_id < 0)
goto out;
- pool_id = get_poolid_from_fake(fake_pool_id);

if (cleancache_get_key(page->mapping->host, &key) < 0)
goto out;

- if (pool_id >= 0)
- ret = cleancache_ops->get_page(pool_id,
- key, page->index, page);
+ ret = cleancache_ops->get_page(pool_id, key, page->index, page);
if (ret == 0)
cleancache_succ_gets++;
else
@@ -273,7 +165,6 @@ EXPORT_SYMBOL(__cleancache_get_page);
void __cleancache_put_page(struct page *page)
{
int pool_id;
- int fake_pool_id;
struct cleancache_filekey key = { .u.key = { 0 } };

if (!cleancache_ops) {
@@ -282,12 +173,7 @@ void __cleancache_put_page(struct page *page)
}

VM_BUG_ON_PAGE(!PageLocked(page), page);
- fake_pool_id = page->mapping->host->i_sb->cleancache_poolid;
- if (fake_pool_id < 0)
- return;
-
- pool_id = get_poolid_from_fake(fake_pool_id);
-
+ pool_id = page->mapping->host->i_sb->cleancache_poolid;
if (pool_id >= 0 &&
cleancache_get_key(page->mapping->host, &key) >= 0) {
cleancache_ops->put_page(pool_id, key, page->index, page);
@@ -308,18 +194,13 @@ void __cleancache_invalidate_page(struct address_space *mapping,
struct page *page)
{
/* careful... page->mapping is NULL sometimes when this is called */
- int pool_id;
- int fake_pool_id = mapping->host->i_sb->cleancache_poolid;
+ int pool_id = mapping->host->i_sb->cleancache_poolid;
struct cleancache_filekey key = { .u.key = { 0 } };

if (!cleancache_ops)
return;

- if (fake_pool_id >= 0) {
- pool_id = get_poolid_from_fake(fake_pool_id);
- if (pool_id < 0)
- return;
-
+ if (pool_id >= 0) {
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (cleancache_get_key(mapping->host, &key) >= 0) {
cleancache_ops->invalidate_page(pool_id,
@@ -341,18 +222,12 @@ EXPORT_SYMBOL(__cleancache_invalidate_page);
*/
void __cleancache_invalidate_inode(struct address_space *mapping)
{
- int pool_id;
- int fake_pool_id = mapping->host->i_sb->cleancache_poolid;
+ int pool_id = mapping->host->i_sb->cleancache_poolid;
struct cleancache_filekey key = { .u.key = { 0 } };

if (!cleancache_ops)
return;

- if (fake_pool_id < 0)
- return;
-
- pool_id = get_poolid_from_fake(fake_pool_id);
-
if (pool_id >= 0 && cleancache_get_key(mapping->host, &key) >= 0)
cleancache_ops->invalidate_inode(pool_id, key);
}
@@ -365,32 +240,18 @@ EXPORT_SYMBOL(__cleancache_invalidate_inode);
*/
void __cleancache_invalidate_fs(struct super_block *sb)
{
- int index;
- int fake_pool_id = sb->cleancache_poolid;
- int old_poolid = fake_pool_id;
+ int pool_id;

- mutex_lock(&poolid_mutex);
- if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) {
- index = fake_pool_id - FAKE_SHARED_FS_POOLID_OFFSET;
- old_poolid = shared_fs_poolid_map[index];
- shared_fs_poolid_map[index] = FS_UNKNOWN;
- uuids[index] = NULL;
- } else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) {
- index = fake_pool_id - FAKE_FS_POOLID_OFFSET;
- old_poolid = fs_poolid_map[index];
- fs_poolid_map[index] = FS_UNKNOWN;
- }
- sb->cleancache_poolid = -1;
- if (cleancache_ops)
- cleancache_ops->invalidate_fs(old_poolid);
- mutex_unlock(&poolid_mutex);
+ pool_id = sb->cleancache_poolid;
+ sb->cleancache_poolid = CLEANCACHE_NO_POOL;
+
+ if (cleancache_ops && pool_id >= 0)
+ cleancache_ops->invalidate_fs(pool_id);
}
EXPORT_SYMBOL(__cleancache_invalidate_fs);

static int __init init_cleancache(void)
{
- int i;
-
#ifdef CONFIG_DEBUG_FS
struct dentry *root = debugfs_create_dir("cleancache", NULL);
if (root == NULL)
@@ -402,10 +263,6 @@ static int __init init_cleancache(void)
debugfs_create_u64("invalidates", S_IRUGO,
root, &cleancache_invalidates);
#endif
- for (i = 0; i < MAX_INITIALIZABLE_FS; i++) {
- fs_poolid_map[i] = FS_UNKNOWN;
- shared_fs_poolid_map[i] = FS_UNKNOWN;
- }
return 0;
}
module_init(init_cleancache)
--
1.7.10.4

2015-02-23 16:13:13

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 0/4] cleancache: remove limit on the number of cleancache enabled filesystems

On Sun, Feb 22, 2015 at 09:31:51PM +0300, Vladimir Davydov wrote:
> Hi,
>
> Currently, maximal number of cleancache enabled filesystems equals 32,
> which is insufficient nowadays, because a Linux host can have hundreds
> of containers on board, each of which might want its own filesystem.
> This patch set targets at removing this limitation - see patch 4 for
> more details. Patches 1-3 prepare the code for this change.

Hey Vladimir,

Thank you for posting these patches. I was wondering if you had
run through some of the different combinations that you can
load the filesystems/tmem drivers in random order? The #4 patch
deleted a nice chunk of documentation that outlines the different
combinations.

Thank you!
>
> Thanks,
>
> Vladimir Davydov (4):
> ocfs2: copy fs uuid to superblock
> cleancache: zap uuid arg of cleancache_init_shared_fs
> cleancache: forbid overriding cleancache_ops
> cleancache: remove limit on the number of cleancache enabled
> filesystems
>
> Documentation/vm/cleancache.txt | 4 +-
> drivers/xen/tmem.c | 16 ++-
> fs/ocfs2/super.c | 4 +-
> fs/super.c | 2 +-
> include/linux/cleancache.h | 13 +-
> mm/cleancache.c | 270 +++++++++++----------------------------
> 6 files changed, 94 insertions(+), 215 deletions(-)
>
> --
> 1.7.10.4
>

2015-02-24 10:34:32

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/4] cleancache: remove limit on the number of cleancache enabled filesystems

On Mon, Feb 23, 2015 at 11:12:22AM -0500, Konrad Rzeszutek Wilk wrote:
> Thank you for posting these patches. I was wondering if you had
> run through some of the different combinations that you can
> load the filesystems/tmem drivers in random order? The #4 patch
> deleted a nice chunk of documentation that outlines the different
> combinations.

Yeah, I admit the synchronization between cleancache_register_ops and
cleancache_init_fs is far not obvious. I should have updated the comment
instead of merely dropping it, sorry. What about the following patch
proving correctness of register_ops-vs-init_fs synchronization? It is
meant to be applied incrementally on top of patch #4.
---
diff --git a/mm/cleancache.c b/mm/cleancache.c
index fbdaf9c77d7a..8fc50811119b 100644
--- a/mm/cleancache.c
+++ b/mm/cleancache.c
@@ -54,6 +54,57 @@ int cleancache_register_ops(struct cleancache_ops *ops)
if (cmpxchg(&cleancache_ops, NULL, ops))
return -EBUSY;

+ /*
+ * A cleancache backend can be built as a module and hence loaded after
+ * a cleancache enabled filesystem has called cleancache_init_fs. To
+ * handle such a scenario, here we call ->init_fs or ->init_shared_fs
+ * for each active super block. To differentiate between local and
+ * shared filesystems, we temporarily initialize sb->cleancache_poolid
+ * to CLEANCACHE_NO_BACKEND or CLEANCACHE_NO_BACKEND_SHARED
+ * respectively in case there is no backend registered at the time
+ * cleancache_init_fs or cleancache_init_shared_fs is called.
+ *
+ * Since filesystems can be mounted concurrently with cleancache
+ * backend registration, we have to be careful to guarantee that all
+ * cleancache enabled filesystems that has been mounted by the time
+ * cleancache_register_ops is called has got and all mounted later will
+ * get cleancache_poolid. This is assured by the following statements
+ * tied together:
+ *
+ * a) iterate_supers skips only those super blocks that has started
+ * ->kill_sb
+ *
+ * b) if iterate_supers encounters a super block that has not finished
+ * ->mount yet, it waits until it is finished
+ *
+ * c) cleancache_init_fs is called from ->mount and
+ * cleancache_invalidate_fs is called from ->kill_sb
+ *
+ * d) we call iterate_supers after cleancache_ops has been set
+ *
+ * From a) it follows that if iterate_supers skips a super block, then
+ * either the super block is already dead, in which case we do not need
+ * to bother initializing cleancache for it, or it was mounted after we
+ * initiated iterate_supers. In the latter case, it must have seen
+ * cleancache_ops set according to d) and initialized cleancache from
+ * ->mount by itself according to c). This proves that we call
+ * ->init_fs at least once for each active super block.
+ *
+ * From b) and c) it follows that if iterate_supers encounters a super
+ * block that has already started ->init_fs, it will wait until ->mount
+ * and hence ->init_fs has finished, then check cleancache_poolid, see
+ * that it has already been set and therefore do nothing. This proves
+ * that we call ->init_fs no more than once for each super block.
+ *
+ * Combined together, the last two paragraphs prove the function
+ * correctness.
+ *
+ * Note that various cleancache callbacks may proceed before this
+ * function is called or even concurrently with it, but since
+ * CLEANCACHE_NO_BACKEND is negative, they will all result in a noop
+ * until the corresponding ->init_fs has been actually called and
+ * cleancache_ops has been set.
+ */
iterate_supers(cleancache_register_ops_sb, NULL);
return 0;
}