changes since v9:
- rebase to 5.18-rc3
- cachefiles: extract cachefiles_in_ondemand_mode() helper; add barrier
pair between enqueuing and flushing requests; make the xarray
structures non-conditionally defined in struct cachefiles_cache
(patch 2) (David Howells)
- cacehfiles: use refcount_t for unbind_pincount; run "cachefiles_open = 0;"
cleanup only when unbind_pincount is decreased to 0 (patch 3)
(David Howells)
- cachefiles: rename CACHEFILES_IOC_CREAD ioctl to
CACHEFILES_IOC_READ_COMPLETE (patch 5) (David Howells)
- cachefiles: fix the error message when the argument to the 'bind'
command is invalid (patch 6) (David Howells)
- cachefiles: update the documentation polished by David (patch 8)
- erofs: tweak the code arrangement of erofs_fscache_meta_readpage()
(patch 17) (Gao Xiang)
- erofs: add comment on error cases (patch 20) (Gao Xiang)
- update Tested-by tags in the cover letter
Kernel Patchset
---------------
Git tree:
https://github.com/lostjeffle/linux.git jingbo/dev-erofs-fscache-v10
Gitweb:
https://github.com/lostjeffle/linux/commits/jingbo/dev-erofs-fscache-v10
User Guide for E2E Container Use Case
-------------------------------------
User guide:
https://github.com/dragonflyoss/image-service/blob/fscache/docs/nydus-fscache.md
Video:
https://youtu.be/F4IF2_DENXo
User Daemon for Quick Test
--------------------------
Git tree:
https://github.com/lostjeffle/demand-read-cachefilesd.git main
Gitweb:
https://github.com/lostjeffle/demand-read-cachefilesd
Tested-by: Zichen Tian <[email protected]>
Tested-by: Jia Zhu <[email protected]>
RFC: https://lore.kernel.org/all/[email protected]/t/
v1: https://lore.kernel.org/lkml/[email protected]/T/
v2: https://lore.kernel.org/all/[email protected]/t/
v3: https://lore.kernel.org/lkml/[email protected]/T/
v4: https://lore.kernel.org/lkml/[email protected]/T/#t
v5: https://lore.kernel.org/lkml/[email protected]/T/
v6: https://lore.kernel.org/lkml/[email protected]/T/
v7: https://lore.kernel.org/lkml/[email protected]/T/
v8: https://lore.kernel.org/all/[email protected]/T/
v9: https://lore.kernel.org/lkml/[email protected]/T/
[Background]
============
Nydus [1] is an image distribution service especially optimized for
distribution over network. Nydus is an excellent container image
acceleration solution, since it only pulls data from remote when needed,
a.k.a. on-demand reading and it also supports chunk-based deduplication,
compression, etc.
erofs (Enhanced Read-Only File System) is a filesystem designed for
read-only scenarios. (Documentation/filesystem/erofs.rst)
Over the past months we've been focusing on supporting Nydus image service
with in-kernel erofs format[2]. In that case, each container image will be
organized in one bootstrap (metadata) and (optional) multiple data blobs in
erofs format. Massive container images will be stored on one machine.
To accelerate the container startup (fetching container images from remote
and then start the container), we do hope that the bootstrap & blob files
could support on-demand read. That is, erofs can be mounted and accessed
even when the bootstrap/data blob files have not been fully downloaded.
Then it'll have native performance after data is available locally.
That means we have to manage the cache state of the bootstrap/data blob
files (if cache hit, read directly from the local cache; if cache miss,
fetch the data somehow). It would be painful and may be dumb for erofs to
implement the cache management itself. Thus we prefer fscache/cachefiles
to do the cache management instead.
The fscache on-demand read feature aims to be implemented in a generic way
so that it can benefit other use cases and/or filesystems if it's
implemented in the fscache subsystem.
[1] https://nydus.dev
[2] https://sched.co/pcdL
[Overall Design]
================
Please refer to patch 7 ("cachefiles: document on-demand read mode") for
more details.
When working in the original mode, cachefiles mainly serves as a local cache
for remote networking fs, while in on-demand read mode, cachefiles can work
in the scenario where on-demand read semantics is needed, e.g. container image
distribution.
The essential difference between these two modes is that, in original mode,
when cache miss, netfs itself will fetch data from remote, and then write the
fetched data into cache file. While in on-demand read mode, a user daemon is
responsible for fetching data and then feeds to the kernel fscache side.
The on-demand read mode relies on a simple protocol used for communication
between kernel and user daemon.
The proposed implementation relies on the anonymous fd mechanism to avoid
the dependence on the format of cache file. When a fscache cachefile is opened
for the first time, an anon_fd associated with the cache file is sent to the
user daemon. With the given anon_fd, user daemon could fetch and write data
into the cache file in the background, even when kernel has not triggered the
cache miss. Besides, the write() syscall to the anon_fd will finally call
cachefiles kernel module, which will write data to cache file in the latest
format of cache file.
1. cache miss
When cache miss, cachefiles kernel module will notify user daemon with the
anon_fd, along with the requested file range. When notified, user daemon
needs to fetch data of the requested file range, and then write the fetched
data into cache file with the given anonymous fd. When finished processing
the request, user daemon needs to notify the kernel.
After notifying the user daemon, the kernel read routine will wait there,
until the request is handled by user daemon. When it's awaken by the
notification from user daemon, i.e. the corresponding hole has been filled
by the user daemon, it will retry to read from the same file range.
2. cache hit
Once data is already ready in cache file, netfs will read from cache
file directly.
[Advantage of fscache-based on-demand read]
========================================
1. Asynchronous prefetch
In current mechanism, fscache is responsible for cache state management,
while the data plane (fetching data from local/remote on cache miss) is
done on the user daemon side even without any file system request driven.
In addition, if cached data has already been available locally, fscache
will use it instead of trapping to user space anymore.
Therefore, different from event-driven approaches, the fscache on-demand
user daemon could also fetch data (from remote) asynchronously in the
background just like most multi-threaded HTTP downloaders.
2. Flexible request amplification
Since the data plane can be independently controlled by the user daemon,
the user daemon can also fetch more data from remote than that the file
system actually requests for small I/O sizes. Then, fetched data in bulk
will be available at once and fscache won't be trapped into the user
daemon again.
3. Support massive blobs
This mechanism can naturally support a large amount of backing files,
and thus can benefit the densely employed scenarios. In our use cases,
one container image can be formed of one bootstrap (required) and
multiple chunk-deduplicated data blobs (optional).
For example, one container image for node.js will correspond to ~20
files in total. In densely employed environment, there could be hundreds
of containers and thus thousands of backing files on one machine.
Jeffle Xu (21):
cachefiles: extract write routine
cachefiles: notify the user daemon when looking up cookie
cachefiles: unbind cachefiles gracefully in on-demand mode
cachefiles: notify the user daemon when withdrawing cookie
cachefiles: implement on-demand read
cachefiles: enable on-demand read mode
cachefiles: add tracepoints for on-demand read mode
cachefiles: document on-demand read mode
erofs: make erofs_map_blocks() generally available
erofs: add fscache mode check helper
erofs: register fscache volume
erofs: add fscache context helper functions
erofs: add anonymous inode caching metadata for data blobs
erofs: add erofs_fscache_read_folios() helper
erofs: register fscache context for primary data blob
erofs: register fscache context for extra data blobs
erofs: implement fscache-based metadata read
erofs: implement fscache-based data read for non-inline layout
erofs: implement fscache-based data read for inline layout
erofs: implement fscache-based data readahead
erofs: add 'fsid' mount option
.../filesystems/caching/cachefiles.rst | 174 ++++++
fs/cachefiles/Kconfig | 12 +
fs/cachefiles/Makefile | 1 +
fs/cachefiles/daemon.c | 117 +++-
fs/cachefiles/interface.c | 2 +
fs/cachefiles/internal.h | 78 +++
fs/cachefiles/io.c | 76 ++-
fs/cachefiles/namei.c | 16 +-
fs/cachefiles/ondemand.c | 503 ++++++++++++++++++
fs/erofs/Kconfig | 10 +
fs/erofs/Makefile | 1 +
fs/erofs/data.c | 26 +-
fs/erofs/fscache.c | 363 +++++++++++++
fs/erofs/inode.c | 4 +
fs/erofs/internal.h | 49 ++
fs/erofs/super.c | 105 +++-
fs/erofs/sysfs.c | 4 +-
include/linux/fscache.h | 1 +
include/linux/netfs.h | 1 +
include/trace/events/cachefiles.h | 176 ++++++
include/uapi/linux/cachefiles.h | 68 +++
21 files changed, 1708 insertions(+), 79 deletions(-)
create mode 100644 fs/cachefiles/ondemand.c
create mode 100644 fs/erofs/fscache.c
create mode 100644 include/uapi/linux/cachefiles.h
--
2.27.0
Introduce 'fsid' mount option to enable on-demand read sementics, in
which case, erofs will be mounted from data blobs. Users could specify
the name of primary data blob by this mount option.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/super.c | 31 ++++++++++++++++++++++++++++++-
fs/erofs/sysfs.c | 4 ++--
2 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index f68ba929100d..4a623630e1c4 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -371,6 +371,8 @@ static int erofs_read_superblock(struct super_block *sb)
if (erofs_sb_has_ztailpacking(sbi))
erofs_info(sb, "EXPERIMENTAL compressed inline data feature in use. Use at your own risk!");
+ if (erofs_is_fscache_mode(sb))
+ erofs_info(sb, "EXPERIMENTAL fscache-based on-demand read feature in use. Use at your own risk!");
out:
erofs_put_metabuf(&buf);
return ret;
@@ -399,6 +401,7 @@ enum {
Opt_dax,
Opt_dax_enum,
Opt_device,
+ Opt_fsid,
Opt_err
};
@@ -423,6 +426,7 @@ static const struct fs_parameter_spec erofs_fs_parameters[] = {
fsparam_flag("dax", Opt_dax),
fsparam_enum("dax", Opt_dax_enum, erofs_dax_param_enums),
fsparam_string("device", Opt_device),
+ fsparam_string("fsid", Opt_fsid),
{}
};
@@ -518,6 +522,16 @@ static int erofs_fc_parse_param(struct fs_context *fc,
}
++ctx->devs->extra_devices;
break;
+ case Opt_fsid:
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+ kfree(ctx->opt.fsid);
+ ctx->opt.fsid = kstrdup(param->string, GFP_KERNEL);
+ if (!ctx->opt.fsid)
+ return -ENOMEM;
+#else
+ errorfc(fc, "fsid option not supported");
+#endif
+ break;
default:
return -ENOPARAM;
}
@@ -604,6 +618,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_fs_info = sbi;
sbi->opt = ctx->opt;
+ ctx->opt.fsid = NULL;
sbi->devs = ctx->devs;
ctx->devs = NULL;
@@ -690,6 +705,11 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
static int erofs_fc_get_tree(struct fs_context *fc)
{
+ struct erofs_fs_context *ctx = fc->fs_private;
+
+ if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && ctx->opt.fsid)
+ return get_tree_nodev(fc, erofs_fc_fill_super);
+
return get_tree_bdev(fc, erofs_fc_fill_super);
}
@@ -739,6 +759,7 @@ static void erofs_fc_free(struct fs_context *fc)
struct erofs_fs_context *ctx = fc->fs_private;
erofs_free_dev_context(ctx->devs);
+ kfree(ctx->opt.fsid);
kfree(ctx);
}
@@ -779,7 +800,10 @@ static void erofs_kill_sb(struct super_block *sb)
WARN_ON(sb->s_magic != EROFS_SUPER_MAGIC);
- kill_block_super(sb);
+ if (erofs_is_fscache_mode(sb))
+ generic_shutdown_super(sb);
+ else
+ kill_block_super(sb);
sbi = EROFS_SB(sb);
if (!sbi)
@@ -789,6 +813,7 @@ static void erofs_kill_sb(struct super_block *sb)
fs_put_dax(sbi->dax_dev);
erofs_fscache_unregister_cookie(&sbi->s_fscache);
erofs_fscache_unregister_fs(sb);
+ kfree(sbi->opt.fsid);
kfree(sbi);
sb->s_fs_info = NULL;
}
@@ -938,6 +963,10 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",dax=always");
if (test_opt(opt, DAX_NEVER))
seq_puts(seq, ",dax=never");
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+ if (opt->fsid)
+ seq_printf(seq, ",fsid=%s", opt->fsid);
+#endif
return 0;
}
diff --git a/fs/erofs/sysfs.c b/fs/erofs/sysfs.c
index f3babf1e6608..c1383e508bbe 100644
--- a/fs/erofs/sysfs.c
+++ b/fs/erofs/sysfs.c
@@ -205,8 +205,8 @@ int erofs_register_sysfs(struct super_block *sb)
sbi->s_kobj.kset = &erofs_root;
init_completion(&sbi->s_kobj_unregister);
- err = kobject_init_and_add(&sbi->s_kobj, &erofs_sb_ktype, NULL,
- "%s", sb->s_id);
+ err = kobject_init_and_add(&sbi->s_kobj, &erofs_sb_ktype, NULL, "%s",
+ erofs_is_fscache_mode(sb) ? sbi->opt.fsid : sb->s_id);
if (err)
goto put_sb_kobj;
return 0;
--
2.27.0
Add a refcount to avoid the deadlock in on-demand read mode. The
on-demand read mode will pin the corresponding cachefiles object for
each anonymous fd. The cachefiles object is unpinned when the anonymous
fd gets closed. When the user daemon exits and the fd of
"/dev/cachefiles" device node gets closed, it will wait for all
cahcefiles objects getting withdrawn. Then if there's any anonymous fd
getting closed after the fd of the device node, the user daemon will
hang forever, waiting for all objects getting withdrawn.
To fix this, add a refcount indicating if there's any object pinned by
anonymous fds. The cachefiles cache gets unbound and withdrawn when the
refcount is decreased to 0. It won't change the behaviour of the
original mode, in which case the cachefiles cache gets unbound and
withdrawn as long as the fd of the device node gets closed.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/cachefiles/daemon.c | 19 ++++++++++++++++---
fs/cachefiles/internal.h | 3 +++
fs/cachefiles/ondemand.c | 3 +++
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
index d5417da7f792..5b1d0642c749 100644
--- a/fs/cachefiles/daemon.c
+++ b/fs/cachefiles/daemon.c
@@ -111,6 +111,7 @@ static int cachefiles_daemon_open(struct inode *inode, struct file *file)
INIT_LIST_HEAD(&cache->volumes);
INIT_LIST_HEAD(&cache->object_list);
spin_lock_init(&cache->object_list_lock);
+ refcount_set(&cache->unbind_pincount, 1);
xa_init_flags(&cache->reqs, XA_FLAGS_ALLOC);
xa_init_flags(&cache->ondemand_ids, XA_FLAGS_ALLOC1);
@@ -164,6 +165,20 @@ static void cachefiles_flush_reqs(struct cachefiles_cache *cache)
xa_destroy(&cache->ondemand_ids);
}
+void cachefiles_put_unbind_pincount(struct cachefiles_cache *cache)
+{
+ if (refcount_dec_and_test(&cache->unbind_pincount)) {
+ cachefiles_daemon_unbind(cache);
+ cachefiles_open = 0;
+ kfree(cache);
+ }
+}
+
+void cachefiles_get_unbind_pincount(struct cachefiles_cache *cache)
+{
+ refcount_inc(&cache->unbind_pincount);
+}
+
/*
* Release a cache.
*/
@@ -179,14 +194,12 @@ static int cachefiles_daemon_release(struct inode *inode, struct file *file)
if (cachefiles_in_ondemand_mode(cache))
cachefiles_flush_reqs(cache);
- cachefiles_daemon_unbind(cache);
/* clean up the control file interface */
cache->cachefilesd = NULL;
file->private_data = NULL;
- cachefiles_open = 0;
- kfree(cache);
+ cachefiles_put_unbind_pincount(cache);
_leave("");
return 0;
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index 4f5150a96849..e5c612888f84 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -109,6 +109,7 @@ struct cachefiles_cache {
char *rootdirname; /* name of cache root directory */
char *secctx; /* LSM security context */
char *tag; /* cache binding tag */
+ refcount_t unbind_pincount;/* refcount to do daemon unbind */
struct xarray reqs; /* xarray of pending on-demand requests */
struct xarray ondemand_ids; /* xarray for ondemand_id allocation */
u32 ondemand_id_next;
@@ -171,6 +172,8 @@ extern int cachefiles_has_space(struct cachefiles_cache *cache,
* daemon.c
*/
extern const struct file_operations cachefiles_daemon_fops;
+extern void cachefiles_get_unbind_pincount(struct cachefiles_cache *cache);
+extern void cachefiles_put_unbind_pincount(struct cachefiles_cache *cache);
/*
* error_inject.c
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c
index 64fc312b16d3..7946ee6c40be 100644
--- a/fs/cachefiles/ondemand.c
+++ b/fs/cachefiles/ondemand.c
@@ -14,6 +14,7 @@ static int cachefiles_ondemand_fd_release(struct inode *inode,
object->ondemand_id = CACHEFILES_ONDEMAND_ID_CLOSED;
xa_erase(&cache->ondemand_ids, object_id);
cachefiles_put_object(object, cachefiles_obj_put_ondemand_fd);
+ cachefiles_put_unbind_pincount(cache);
return 0;
}
@@ -169,6 +170,8 @@ static int cachefiles_ondemand_get_fd(struct cachefiles_req *req)
load->fd = fd;
req->msg.object_id = object_id;
object->ondemand_id = object_id;
+
+ cachefiles_get_unbind_pincount(cache);
return 0;
err_put_fd:
--
2.27.0
Add tracepoints for on-demand read mode. Currently following tracepoints
are added:
OPEN request / COPEN reply
CLOSE request
READ request / CREAD reply
write through anonymous fd
release of anonymous fd
Signed-off-by: Jeffle Xu <[email protected]>
Acked-by: David Howells <[email protected]>
---
fs/cachefiles/ondemand.c | 7 ++
include/trace/events/cachefiles.h | 174 ++++++++++++++++++++++++++++++
2 files changed, 181 insertions(+)
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c
index 3470d4e8f0cb..a41ae6efc545 100644
--- a/fs/cachefiles/ondemand.c
+++ b/fs/cachefiles/ondemand.c
@@ -30,6 +30,7 @@ static int cachefiles_ondemand_fd_release(struct inode *inode,
xa_unlock(&cache->reqs);
xa_erase(&cache->ondemand_ids, object_id);
+ trace_cachefiles_ondemand_fd_release(object, object_id);
cachefiles_put_object(object, cachefiles_obj_put_ondemand_fd);
cachefiles_put_unbind_pincount(cache);
return 0;
@@ -55,6 +56,7 @@ static ssize_t cachefiles_ondemand_fd_write_iter(struct kiocb *kiocb,
if (ret < 0)
return ret;
+ trace_cachefiles_ondemand_fd_write(object, file_inode(file), pos, len);
ret = __cachefiles_write(object, file, pos, iter, NULL, NULL);
if (!ret)
ret = len;
@@ -93,6 +95,7 @@ static long cachefiles_ondemand_fd_ioctl(struct file *filp, unsigned int ioctl,
if (!req)
return -EINVAL;
+ trace_cachefiles_ondemand_cread(object, id);
complete(&req->done);
return 0;
}
@@ -166,6 +169,7 @@ int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args)
clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);
else
set_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);
+ trace_cachefiles_ondemand_copen(req->object, id, size);
out:
complete(&req->done);
@@ -213,6 +217,7 @@ static int cachefiles_ondemand_get_fd(struct cachefiles_req *req)
object->ondemand_id = object_id;
cachefiles_get_unbind_pincount(cache);
+ trace_cachefiles_ondemand_open(object, &req->msg, load);
return 0;
err_put_fd:
@@ -426,6 +431,7 @@ static int cachefiles_ondemand_init_close_req(struct cachefiles_req *req,
return -ENOENT;
req->msg.object_id = object_id;
+ trace_cachefiles_ondemand_close(object, &req->msg);
return 0;
}
@@ -452,6 +458,7 @@ static int cachefiles_ondemand_init_read_req(struct cachefiles_req *req,
req->msg.object_id = object_id;
load->off = read_ctx->off;
load->len = read_ctx->len;
+ trace_cachefiles_ondemand_read(object, &req->msg, load);
return 0;
}
diff --git a/include/trace/events/cachefiles.h b/include/trace/events/cachefiles.h
index 93df9391bd7f..d8d4d73fe7b6 100644
--- a/include/trace/events/cachefiles.h
+++ b/include/trace/events/cachefiles.h
@@ -673,6 +673,180 @@ TRACE_EVENT(cachefiles_io_error,
__entry->error)
);
+TRACE_EVENT(cachefiles_ondemand_open,
+ TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg,
+ struct cachefiles_open *load),
+
+ TP_ARGS(obj, msg, load),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(unsigned int, object_id )
+ __field(unsigned int, fd )
+ __field(unsigned int, flags )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg->msg_id;
+ __entry->object_id = msg->object_id;
+ __entry->fd = load->fd;
+ __entry->flags = load->flags;
+ ),
+
+ TP_printk("o=%08x mid=%x oid=%x fd=%d f=%x",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->object_id,
+ __entry->fd,
+ __entry->flags)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_copen,
+ TP_PROTO(struct cachefiles_object *obj, unsigned int msg_id,
+ long len),
+
+ TP_ARGS(obj, msg_id, len),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(long, len )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg_id;
+ __entry->len = len;
+ ),
+
+ TP_printk("o=%08x mid=%x l=%lx",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->len)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_close,
+ TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg),
+
+ TP_ARGS(obj, msg),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(unsigned int, object_id )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg->msg_id;
+ __entry->object_id = msg->object_id;
+ ),
+
+ TP_printk("o=%08x mid=%x oid=%x",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->object_id)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_read,
+ TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg,
+ struct cachefiles_read *load),
+
+ TP_ARGS(obj, msg, load),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(unsigned int, object_id )
+ __field(loff_t, start )
+ __field(size_t, len )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg->msg_id;
+ __entry->object_id = msg->object_id;
+ __entry->start = load->off;
+ __entry->len = load->len;
+ ),
+
+ TP_printk("o=%08x mid=%x oid=%x s=%llx l=%zx",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->object_id,
+ __entry->start,
+ __entry->len)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_cread,
+ TP_PROTO(struct cachefiles_object *obj, unsigned int msg_id),
+
+ TP_ARGS(obj, msg_id),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg_id;
+ ),
+
+ TP_printk("o=%08x mid=%x",
+ __entry->obj,
+ __entry->msg_id)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_fd_write,
+ TP_PROTO(struct cachefiles_object *obj, struct inode *backer,
+ loff_t start, size_t len),
+
+ TP_ARGS(obj, backer, start, len),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, backer )
+ __field(loff_t, start )
+ __field(size_t, len )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->backer = backer->i_ino;
+ __entry->start = start;
+ __entry->len = len;
+ ),
+
+ TP_printk("o=%08x iB=%x s=%llx l=%zx",
+ __entry->obj,
+ __entry->backer,
+ __entry->start,
+ __entry->len)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_fd_release,
+ TP_PROTO(struct cachefiles_object *obj, int object_id),
+
+ TP_ARGS(obj, object_id),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, object_id )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->object_id = object_id;
+ ),
+
+ TP_printk("o=%08x oid=%x",
+ __entry->obj,
+ __entry->object_id)
+ );
+
#endif /* _TRACE_CACHEFILES_H */
/* This part must be outside protection */
--
2.27.0
Implement the data plane of reading metadata from primary data blob
over fscache.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/data.c | 19 +++++++++++++++----
fs/erofs/fscache.c | 25 +++++++++++++++++++++++++
2 files changed, 40 insertions(+), 4 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 14b64d960541..bb9c1fd48c19 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -6,6 +6,7 @@
*/
#include "internal.h"
#include <linux/prefetch.h>
+#include <linux/sched/mm.h>
#include <linux/dax.h>
#include <trace/events/erofs.h>
@@ -35,14 +36,20 @@ void *erofs_bread(struct erofs_buf *buf, struct inode *inode,
erofs_off_t offset = blknr_to_addr(blkaddr);
pgoff_t index = offset >> PAGE_SHIFT;
struct page *page = buf->page;
+ struct folio *folio;
+ unsigned int nofs_flag;
if (!page || page->index != index) {
erofs_put_metabuf(buf);
- page = read_cache_page_gfp(mapping, index,
- mapping_gfp_constraint(mapping, ~__GFP_FS));
- if (IS_ERR(page))
- return page;
+
+ nofs_flag = memalloc_nofs_save();
+ folio = read_cache_folio(mapping, index, NULL, NULL);
+ memalloc_nofs_restore(nofs_flag);
+ if (IS_ERR(folio))
+ return folio;
+
/* should already be PageUptodate, no need to lock page */
+ page = folio_file_page(folio, index);
buf->page = page;
}
if (buf->kmap_type == EROFS_NO_KMAP) {
@@ -63,6 +70,10 @@ void *erofs_bread(struct erofs_buf *buf, struct inode *inode,
void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,
erofs_blk_t blkaddr, enum erofs_kmap_type type)
{
+ if (erofs_is_fscache_mode(sb))
+ return erofs_bread(buf, EROFS_SB(sb)->s_fscache->inode,
+ blkaddr, type);
+
return erofs_bread(buf, sb->s_bdev->bd_inode, blkaddr, type);
}
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index ac02af8cce3e..23d7e862eed8 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -59,7 +59,32 @@ static int erofs_fscache_read_folios(struct fscache_cookie *cookie,
return ret;
}
+static int erofs_fscache_meta_readpage(struct file *data, struct page *page)
+{
+ int ret;
+ struct folio *folio = page_folio(page);
+ struct super_block *sb = folio_mapping(folio)->host->i_sb;
+ struct erofs_map_dev mdev = {
+ .m_deviceid = 0,
+ .m_pa = folio_pos(folio),
+ };
+
+ ret = erofs_map_dev(sb, &mdev);
+ if (ret)
+ goto out;
+
+ ret = erofs_fscache_read_folios(mdev.m_fscache->cookie,
+ folio_mapping(folio), folio_pos(folio),
+ folio_size(folio), mdev.m_pa);
+ if (!ret)
+ folio_mark_uptodate(folio);
+out:
+ folio_unlock(folio);
+ return ret;
+}
+
static const struct address_space_operations erofs_fscache_meta_aops = {
+ .readpage = erofs_fscache_meta_readpage,
};
int erofs_fscache_register_cookie(struct super_block *sb,
--
2.27.0
Implement fscache-based data readahead. Also registers an individual
bdi for each erofs instance to enable readahead.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++
fs/erofs/super.c | 4 +++
2 files changed, 94 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 5b779812a5ee..a402d8f0a063 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -162,12 +162,102 @@ static int erofs_fscache_readpage(struct file *file, struct page *page)
return ret;
}
+static void erofs_fscache_unlock_folios(struct readahead_control *rac,
+ size_t len)
+{
+ while (len) {
+ struct folio *folio = readahead_folio(rac);
+
+ len -= folio_size(folio);
+ folio_mark_uptodate(folio);
+ folio_unlock(folio);
+ }
+}
+
+static void erofs_fscache_readahead(struct readahead_control *rac)
+{
+ struct inode *inode = rac->mapping->host;
+ struct super_block *sb = inode->i_sb;
+ size_t len, count, done = 0;
+ erofs_off_t pos;
+ loff_t start, offset;
+ int ret;
+
+ if (!readahead_count(rac))
+ return;
+
+ start = readahead_pos(rac);
+ len = readahead_length(rac);
+
+ do {
+ struct erofs_map_blocks map;
+ struct erofs_map_dev mdev;
+
+ pos = start + done;
+ map.m_la = pos;
+
+ ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW);
+ if (ret)
+ return;
+
+ offset = start + done;
+ count = min_t(size_t, map.m_llen - (pos - map.m_la),
+ len - done);
+
+ if (!(map.m_flags & EROFS_MAP_MAPPED)) {
+ struct iov_iter iter;
+
+ iov_iter_xarray(&iter, READ, &rac->mapping->i_pages,
+ offset, count);
+ iov_iter_zero(count, &iter);
+
+ erofs_fscache_unlock_folios(rac, count);
+ ret = count;
+ continue;
+ }
+
+ if (map.m_flags & EROFS_MAP_META) {
+ struct folio *folio = readahead_folio(rac);
+
+ ret = erofs_fscache_readpage_inline(folio, &map);
+ if (!ret) {
+ folio_mark_uptodate(folio);
+ ret = folio_size(folio);
+ }
+
+ folio_unlock(folio);
+ continue;
+ }
+
+ mdev = (struct erofs_map_dev) {
+ .m_deviceid = map.m_deviceid,
+ .m_pa = map.m_pa,
+ };
+ ret = erofs_map_dev(sb, &mdev);
+ if (ret)
+ return;
+
+ ret = erofs_fscache_read_folios(mdev.m_fscache->cookie,
+ rac->mapping, offset, count,
+ mdev.m_pa + (pos - map.m_la));
+ /*
+ * For the error cases, the folios will be unlocked when
+ * .readahead() returns.
+ */
+ if (!ret) {
+ erofs_fscache_unlock_folios(rac, count);
+ ret = count;
+ }
+ } while (ret > 0 && ((done += ret) < len));
+}
+
static const struct address_space_operations erofs_fscache_meta_aops = {
.readpage = erofs_fscache_meta_readpage,
};
const struct address_space_operations erofs_fscache_access_aops = {
.readpage = erofs_fscache_readpage,
+ .readahead = erofs_fscache_readahead,
};
int erofs_fscache_register_cookie(struct super_block *sb,
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index c6755bcae4a6..f68ba929100d 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -619,6 +619,10 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sbi->opt.fsid, true);
if (err)
return err;
+
+ err = super_setup_bdi(sb);
+ if (err)
+ return err;
} else {
if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
erofs_err(sb, "failed to set erofs blksize");
--
2.27.0
Extract the generic routine of writing data to cache files, and make it
generally available.
This will be used by the following patch implementing on-demand read
mode. Since it's called inside CacheFiles module, make the interface
generic and unrelated to netfs_cache_resources.
It is worth noting that, ki->inval_counter is not initialized after
this cleanup. It shall not make any visible difference, since
inval_counter is no longer used in the write completion routine, i.e.
cachefiles_write_complete().
Signed-off-by: Jeffle Xu <[email protected]>
Acked-by: David Howells <[email protected]>
---
fs/cachefiles/internal.h | 10 +++++++
fs/cachefiles/io.c | 61 +++++++++++++++++++++++-----------------
2 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index c793d33b0224..e80673d0ab97 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -201,6 +201,16 @@ extern void cachefiles_put_object(struct cachefiles_object *object,
*/
extern bool cachefiles_begin_operation(struct netfs_cache_resources *cres,
enum fscache_want_state want_state);
+extern int __cachefiles_prepare_write(struct cachefiles_object *object,
+ struct file *file,
+ loff_t *_start, size_t *_len,
+ bool no_space_allocated_yet);
+extern int __cachefiles_write(struct cachefiles_object *object,
+ struct file *file,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
/*
* key.c
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 9dc81e781f2b..50a14e8f0aac 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -277,36 +277,33 @@ static void cachefiles_write_complete(struct kiocb *iocb, long ret)
/*
* Initiate a write to the cache.
*/
-static int cachefiles_write(struct netfs_cache_resources *cres,
- loff_t start_pos,
- struct iov_iter *iter,
- netfs_io_terminated_t term_func,
- void *term_func_priv)
+int __cachefiles_write(struct cachefiles_object *object,
+ struct file *file,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv)
{
- struct cachefiles_object *object;
struct cachefiles_cache *cache;
struct cachefiles_kiocb *ki;
struct inode *inode;
- struct file *file;
unsigned int old_nofs;
- ssize_t ret = -ENOBUFS;
+ ssize_t ret;
size_t len = iov_iter_count(iter);
- if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
- goto presubmission_error;
fscache_count_write();
- object = cachefiles_cres_object(cres);
cache = object->volume->cache;
- file = cachefiles_cres_file(cres);
_enter("%pD,%li,%llx,%zx/%llx",
file, file_inode(file)->i_ino, start_pos, len,
i_size_read(file_inode(file)));
- ret = -ENOMEM;
ki = kzalloc(sizeof(struct cachefiles_kiocb), GFP_KERNEL);
- if (!ki)
- goto presubmission_error;
+ if (!ki) {
+ if (term_func)
+ term_func(term_func_priv, -ENOMEM, false);
+ return -ENOMEM;
+ }
refcount_set(&ki->ki_refcnt, 2);
ki->iocb.ki_filp = file;
@@ -314,7 +311,6 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
ki->iocb.ki_flags = IOCB_DIRECT | IOCB_WRITE;
ki->iocb.ki_ioprio = get_current_ioprio();
ki->object = object;
- ki->inval_counter = cres->inval_counter;
ki->start = start_pos;
ki->len = len;
ki->term_func = term_func;
@@ -369,11 +365,24 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
cachefiles_put_kiocb(ki);
_leave(" = %zd", ret);
return ret;
+}
-presubmission_error:
- if (term_func)
- term_func(term_func_priv, ret, false);
- return ret;
+static int cachefiles_write(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv)
+{
+ if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE)) {
+ if (term_func)
+ term_func(term_func_priv, -ENOBUFS, false);
+ return -ENOBUFS;
+ }
+
+ return __cachefiles_write(cachefiles_cres_object(cres),
+ cachefiles_cres_file(cres),
+ start_pos, iter,
+ term_func, term_func_priv);
}
/*
@@ -484,13 +493,12 @@ static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subrequest *
/*
* Prepare for a write to occur.
*/
-static int __cachefiles_prepare_write(struct netfs_cache_resources *cres,
- loff_t *_start, size_t *_len, loff_t i_size,
- bool no_space_allocated_yet)
+int __cachefiles_prepare_write(struct cachefiles_object *object,
+ struct file *file,
+ loff_t *_start, size_t *_len,
+ bool no_space_allocated_yet)
{
- struct cachefiles_object *object = cachefiles_cres_object(cres);
struct cachefiles_cache *cache = object->volume->cache;
- struct file *file = cachefiles_cres_file(cres);
loff_t start = *_start, pos;
size_t len = *_len, down;
int ret;
@@ -577,7 +585,8 @@ static int cachefiles_prepare_write(struct netfs_cache_resources *cres,
}
cachefiles_begin_secure(cache, &saved_cred);
- ret = __cachefiles_prepare_write(cres, _start, _len, i_size,
+ ret = __cachefiles_prepare_write(object, cachefiles_cres_file(cres),
+ _start, _len,
no_space_allocated_yet);
cachefiles_end_secure(cache, saved_cred);
return ret;
--
2.27.0
Implement the data plane of reading data from data blobs over fscache
for non-inline layout.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/fscache.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
fs/erofs/inode.c | 4 ++++
fs/erofs/internal.h | 2 ++
3 files changed, 57 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 23d7e862eed8..b3af72af7c88 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -83,10 +83,61 @@ static int erofs_fscache_meta_readpage(struct file *data, struct page *page)
return ret;
}
+static int erofs_fscache_readpage(struct file *file, struct page *page)
+{
+ struct folio *folio = page_folio(page);
+ struct inode *inode = folio_mapping(folio)->host;
+ struct super_block *sb = inode->i_sb;
+ struct erofs_map_blocks map;
+ struct erofs_map_dev mdev;
+ erofs_off_t pos;
+ loff_t pstart;
+ int ret;
+
+ DBG_BUGON(folio_size(folio) != EROFS_BLKSIZ);
+
+ pos = folio_pos(folio);
+ map.m_la = pos;
+
+ ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW);
+ if (ret)
+ goto out_unlock;
+
+ if (!(map.m_flags & EROFS_MAP_MAPPED)) {
+ folio_zero_range(folio, 0, folio_size(folio));
+ goto out_uptodate;
+ }
+
+ mdev = (struct erofs_map_dev) {
+ .m_deviceid = map.m_deviceid,
+ .m_pa = map.m_pa,
+ };
+
+ ret = erofs_map_dev(sb, &mdev);
+ if (ret)
+ goto out_unlock;
+
+ pstart = mdev.m_pa + (pos - map.m_la);
+ ret = erofs_fscache_read_folios(mdev.m_fscache->cookie,
+ folio_mapping(folio), folio_pos(folio),
+ folio_size(folio), pstart);
+
+out_uptodate:
+ if (!ret)
+ folio_mark_uptodate(folio);
+out_unlock:
+ folio_unlock(folio);
+ return ret;
+}
+
static const struct address_space_operations erofs_fscache_meta_aops = {
.readpage = erofs_fscache_meta_readpage,
};
+const struct address_space_operations erofs_fscache_access_aops = {
+ .readpage = erofs_fscache_readpage,
+};
+
int erofs_fscache_register_cookie(struct super_block *sb,
struct erofs_fscache **fscache,
char *name, bool need_inode)
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index e8b37ba5e9ad..8d3f56c6469b 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -297,6 +297,10 @@ static int erofs_fill_inode(struct inode *inode, int isdir)
goto out_unlock;
}
inode->i_mapping->a_ops = &erofs_raw_access_aops;
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+ if (erofs_is_fscache_mode(inode->i_sb))
+ inode->i_mapping->a_ops = &erofs_fscache_access_aops;
+#endif
out_unlock:
erofs_put_metabuf(&buf);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index fa488af8dfcf..c8f6ac910976 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -639,6 +639,8 @@ int erofs_fscache_register_cookie(struct super_block *sb,
struct erofs_fscache **fscache,
char *name, bool need_inode);
void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache);
+
+extern const struct address_space_operations erofs_fscache_access_aops;
#else
static inline int erofs_fscache_register_fs(struct super_block *sb)
{
--
2.27.0
Implement the data plane of reading data from data blobs over fscache
for inline layout.
For the heading non-inline part, the data plane for non-inline layout is
reused, while only the tail packing part needs special handling.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/fscache.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index b3af72af7c88..5b779812a5ee 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -83,6 +83,33 @@ static int erofs_fscache_meta_readpage(struct file *data, struct page *page)
return ret;
}
+static int erofs_fscache_readpage_inline(struct folio *folio,
+ struct erofs_map_blocks *map)
+{
+ struct super_block *sb = folio_mapping(folio)->host->i_sb;
+ struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
+ erofs_blk_t blknr;
+ size_t offset, len;
+ void *src, *dst;
+
+ /* For tail packing layout, the offset may be non-zero. */
+ offset = erofs_blkoff(map->m_pa);
+ blknr = erofs_blknr(map->m_pa);
+ len = map->m_llen;
+
+ src = erofs_read_metabuf(&buf, sb, blknr, EROFS_KMAP);
+ if (IS_ERR(src))
+ return PTR_ERR(src);
+
+ dst = kmap_local_folio(folio, 0);
+ memcpy(dst, src + offset, len);
+ memset(dst + len, 0, PAGE_SIZE - len);
+ kunmap_local(dst);
+
+ erofs_put_metabuf(&buf);
+ return 0;
+}
+
static int erofs_fscache_readpage(struct file *file, struct page *page)
{
struct folio *folio = page_folio(page);
@@ -108,6 +135,11 @@ static int erofs_fscache_readpage(struct file *file, struct page *page)
goto out_uptodate;
}
+ if (map.m_flags & EROFS_MAP_META) {
+ ret = erofs_fscache_readpage_inline(folio, &map);
+ goto out_uptodate;
+ }
+
mdev = (struct erofs_map_dev) {
.m_deviceid = map.m_deviceid,
.m_pa = map.m_pa,
--
2.27.0
Add erofs_fscache_read_folios() helper reading from fscache. It supports
on-demand read semantics. That is, it will make the backend prepare for
the data when cache miss. Once data ready, it will read from the cache.
This helper can then be used to implement .readpage()/.readahead() of
on-demand read semantics.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/fscache.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 26f038d9c4e1..ac02af8cce3e 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -5,6 +5,60 @@
#include <linux/fscache.h>
#include "internal.h"
+/*
+ * Read data from fscache and fill the read data into page cache described by
+ * @start/len, which shall be both aligned with PAGE_SIZE. @pstart describes
+ * the start physical address in the cache file.
+ */
+static int erofs_fscache_read_folios(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ loff_t start, size_t len,
+ loff_t pstart)
+{
+ enum netfs_io_source source;
+ struct netfs_io_request rreq = {};
+ struct netfs_io_subrequest subreq = { .rreq = &rreq, };
+ struct netfs_cache_resources *cres = &rreq.cache_resources;
+ struct super_block *sb = mapping->host->i_sb;
+ struct iov_iter iter;
+ size_t done = 0;
+ int ret;
+
+ ret = fscache_begin_read_operation(cres, cookie);
+ if (ret)
+ return ret;
+
+ while (done < len) {
+ subreq.start = pstart + done;
+ subreq.len = len - done;
+ subreq.flags = 1 << NETFS_SREQ_ONDEMAND;
+
+ source = cres->ops->prepare_read(&subreq, LLONG_MAX);
+ if (WARN_ON(subreq.len == 0))
+ source = NETFS_INVALID_READ;
+ if (source != NETFS_READ_FROM_CACHE) {
+ erofs_err(sb, "failed to fscache prepare_read (source %d)",
+ source);
+ ret = -EIO;
+ goto out;
+ }
+
+ iov_iter_xarray(&iter, READ, &mapping->i_pages,
+ start + done, subreq.len);
+ ret = fscache_read(cres, subreq.start, &iter,
+ NETFS_READ_HOLE_FAIL, NULL, NULL);
+ if (ret) {
+ erofs_err(sb, "failed to fscache_read (ret %d)", ret);
+ goto out;
+ }
+
+ done += subreq.len;
+ }
+out:
+ fscache_end_operation(cres);
+ return ret;
+}
+
static const struct address_space_operations erofs_fscache_meta_aops = {
};
--
2.27.0
Implement the data plane of on-demand read mode.
The early implementation [1] place the entry to
cachefiles_ondemand_read() in fscache_read(). However, fscache_read()
can only detect if the requested file range is fully cache miss, whilst
we need to notify the user daemon as long as there's a hole inside the
requested file range.
Thus the entry is now placed in cachefiles_prepare_read(). When working
in on-demand read mode, once a hole detected, the read routine will send
a READ request to the user daemon. The user daemon needs to fetch the
data and write it to the cache file. After sending the READ request, the
read routine will hang there, until the READ request is handled by the
user daemon. Then it will retry to read from the same file range. If no
progress encountered, the read routine will fail then.
A new NETFS_SREQ_ONDEMAND flag is introduced to indicate that on-demand
read should be done when a cache miss encountered.
Signed-off-by: Jeffle Xu <[email protected]>
[1] https://lore.kernel.org/all/[email protected]/ #v8
Acked-by: David Howells <[email protected]>
---
fs/cachefiles/internal.h | 9 ++++
fs/cachefiles/io.c | 15 ++++++-
fs/cachefiles/ondemand.c | 77 +++++++++++++++++++++++++++++++++
include/linux/netfs.h | 1 +
include/uapi/linux/cachefiles.h | 17 ++++++++
5 files changed, 117 insertions(+), 2 deletions(-)
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index da388ba127eb..6cba2c6de2f9 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -292,6 +292,9 @@ extern int cachefiles_ondemand_copen(struct cachefiles_cache *cache,
extern int cachefiles_ondemand_init_object(struct cachefiles_object *object);
extern void cachefiles_ondemand_clean_object(struct cachefiles_object *object);
+extern int cachefiles_ondemand_read(struct cachefiles_object *object,
+ loff_t pos, size_t len);
+
#else
static inline ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache,
char __user *_buffer, size_t buflen)
@@ -307,6 +310,12 @@ static inline int cachefiles_ondemand_init_object(struct cachefiles_object *obje
static inline void cachefiles_ondemand_clean_object(struct cachefiles_object *object)
{
}
+
+static inline int cachefiles_ondemand_read(struct cachefiles_object *object,
+ loff_t pos, size_t len)
+{
+ return -EOPNOTSUPP;
+}
#endif
/*
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 50a14e8f0aac..000a28f46e59 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -403,6 +403,7 @@ static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subrequest *
enum netfs_io_source ret = NETFS_DOWNLOAD_FROM_SERVER;
loff_t off, to;
ino_t ino = file ? file_inode(file)->i_ino : 0;
+ int rc;
_enter("%zx @%llx/%llx", subreq->len, subreq->start, i_size);
@@ -415,7 +416,8 @@ static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subrequest *
if (test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) {
__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
why = cachefiles_trace_read_no_data;
- goto out_no_object;
+ if (!test_bit(NETFS_SREQ_ONDEMAND, &subreq->flags))
+ goto out_no_object;
}
/* The object and the file may be being created in the background. */
@@ -432,7 +434,7 @@ static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subrequest *
object = cachefiles_cres_object(cres);
cache = object->volume->cache;
cachefiles_begin_secure(cache, &saved_cred);
-
+retry:
off = cachefiles_inject_read_error();
if (off == 0)
off = vfs_llseek(file, subreq->start, SEEK_DATA);
@@ -483,6 +485,15 @@ static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subrequest *
download_and_store:
__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
+ if (test_bit(NETFS_SREQ_ONDEMAND, &subreq->flags)) {
+ rc = cachefiles_ondemand_read(object, subreq->start,
+ subreq->len);
+ if (!rc) {
+ __clear_bit(NETFS_SREQ_ONDEMAND, &subreq->flags);
+ goto retry;
+ }
+ ret = NETFS_INVALID_READ;
+ }
out:
cachefiles_end_secure(cache, saved_cred);
out_no_object:
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c
index 11b1c15ac697..3470d4e8f0cb 100644
--- a/fs/cachefiles/ondemand.c
+++ b/fs/cachefiles/ondemand.c
@@ -10,8 +10,25 @@ static int cachefiles_ondemand_fd_release(struct inode *inode,
struct cachefiles_object *object = file->private_data;
struct cachefiles_cache *cache = object->volume->cache;
int object_id = object->ondemand_id;
+ struct cachefiles_req *req;
+ XA_STATE(xas, &cache->reqs, 0);
+ xa_lock(&cache->reqs);
object->ondemand_id = CACHEFILES_ONDEMAND_ID_CLOSED;
+
+ /*
+ * Flush all pending READ requests since their completion depends on
+ * anon_fd.
+ */
+ xas_for_each(&xas, req, ULONG_MAX) {
+ if (req->msg.opcode == CACHEFILES_OP_READ) {
+ req->error = -EIO;
+ complete(&req->done);
+ xas_store(&xas, NULL);
+ }
+ }
+ xa_unlock(&cache->reqs);
+
xa_erase(&cache->ondemand_ids, object_id);
cachefiles_put_object(object, cachefiles_obj_put_ondemand_fd);
cachefiles_put_unbind_pincount(cache);
@@ -57,11 +74,35 @@ static loff_t cachefiles_ondemand_fd_llseek(struct file *filp, loff_t pos,
return vfs_llseek(file, pos, whence);
}
+static long cachefiles_ondemand_fd_ioctl(struct file *filp, unsigned int ioctl,
+ unsigned long arg)
+{
+ struct cachefiles_object *object = filp->private_data;
+ struct cachefiles_cache *cache = object->volume->cache;
+ struct cachefiles_req *req;
+ unsigned long id;
+
+ if (ioctl != CACHEFILES_IOC_READ_COMPLETE)
+ return -EINVAL;
+
+ if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags))
+ return -EOPNOTSUPP;
+
+ id = arg;
+ req = xa_erase(&cache->reqs, id);
+ if (!req)
+ return -EINVAL;
+
+ complete(&req->done);
+ return 0;
+}
+
static const struct file_operations cachefiles_ondemand_fd_fops = {
.owner = THIS_MODULE,
.release = cachefiles_ondemand_fd_release,
.write_iter = cachefiles_ondemand_fd_write_iter,
.llseek = cachefiles_ondemand_fd_llseek,
+ .unlocked_ioctl = cachefiles_ondemand_fd_ioctl,
};
/*
@@ -388,6 +429,32 @@ static int cachefiles_ondemand_init_close_req(struct cachefiles_req *req,
return 0;
}
+struct cachefiles_read_ctx {
+ loff_t off;
+ size_t len;
+};
+
+static int cachefiles_ondemand_init_read_req(struct cachefiles_req *req,
+ void *private)
+{
+ struct cachefiles_object *object = req->object;
+ struct cachefiles_read *load = (void *)req->msg.data;
+ struct cachefiles_read_ctx *read_ctx = private;
+ int object_id = object->ondemand_id;
+
+ /* Stop enqueuing requests when daemon has closed anon_fd. */
+ if (object_id <= 0) {
+ WARN_ON_ONCE(object_id == 0);
+ pr_info_once("READ: anonymous fd closed prematurely.\n");
+ return -EIO;
+ }
+
+ req->msg.object_id = object_id;
+ load->off = read_ctx->off;
+ load->len = read_ctx->len;
+ return 0;
+}
+
int cachefiles_ondemand_init_object(struct cachefiles_object *object)
{
struct fscache_cookie *cookie = object->cookie;
@@ -417,3 +484,13 @@ void cachefiles_ondemand_clean_object(struct cachefiles_object *object)
cachefiles_ondemand_send_req(object, CACHEFILES_OP_CLOSE, 0,
cachefiles_ondemand_init_close_req, NULL);
}
+
+int cachefiles_ondemand_read(struct cachefiles_object *object,
+ loff_t pos, size_t len)
+{
+ struct cachefiles_read_ctx read_ctx = {pos, len};
+
+ return cachefiles_ondemand_send_req(object, CACHEFILES_OP_READ,
+ sizeof(struct cachefiles_read),
+ cachefiles_ondemand_init_read_req, &read_ctx);
+}
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index c7bf1eaf51d5..057d04efaf79 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -159,6 +159,7 @@ struct netfs_io_subrequest {
#define NETFS_SREQ_SHORT_IO 2 /* Set if the I/O was short */
#define NETFS_SREQ_SEEK_DATA_READ 3 /* Set if ->read() should SEEK_DATA first */
#define NETFS_SREQ_NO_PROGRESS 4 /* Set if we didn't manage to read any data */
+#define NETFS_SREQ_ONDEMAND 5 /* Set if it's from on-demand read mode */
};
enum netfs_io_origin {
diff --git a/include/uapi/linux/cachefiles.h b/include/uapi/linux/cachefiles.h
index 37a0071037c8..78caa73e5343 100644
--- a/include/uapi/linux/cachefiles.h
+++ b/include/uapi/linux/cachefiles.h
@@ -3,6 +3,7 @@
#define _LINUX_CACHEFILES_H
#include <linux/types.h>
+#include <linux/ioctl.h>
/*
* Fscache ensures that the maximum length of cookie key is 255. The volume key
@@ -13,6 +14,7 @@
enum cachefiles_opcode {
CACHEFILES_OP_OPEN,
CACHEFILES_OP_CLOSE,
+ CACHEFILES_OP_READ,
};
/*
@@ -48,4 +50,19 @@ struct cachefiles_open {
__u8 data[];
};
+/*
+ * @off indicates the starting offset of the requested file range
+ * @len indicates the length of the requested file range
+ */
+struct cachefiles_read {
+ __u64 off;
+ __u64 len;
+};
+
+/*
+ * Reply for READ request
+ * @arg for this ioctl is the @id field of READ request.
+ */
+#define CACHEFILES_IOC_READ_COMPLETE _IOW(0x98, 1, int)
+
#endif
--
2.27.0
A new fscache based mode is going to be introduced for erofs, in which
case on-demand read semantics is implemented through fscache.
As the first step, register fscache volume for each erofs filesystem.
That means, data blobs can not be shared among erofs filesystems. In the
following iteration, we are going to introduce the domain semantics, in
which case several erofs filesystems can belong to one domain, and data
blobs can be shared among these erofs filesystems of one domain.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/Kconfig | 10 ++++++++++
fs/erofs/Makefile | 1 +
fs/erofs/fscache.c | 37 +++++++++++++++++++++++++++++++++++++
fs/erofs/internal.h | 16 ++++++++++++++++
fs/erofs/super.c | 5 +++++
5 files changed, 69 insertions(+)
create mode 100644 fs/erofs/fscache.c
diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index f57255ab88ed..85490370e0ca 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -98,3 +98,13 @@ config EROFS_FS_ZIP_LZMA
systems will be readable without selecting this option.
If unsure, say N.
+
+config EROFS_FS_ONDEMAND
+ bool "EROFS fscache-based on-demand read support"
+ depends on CACHEFILES_ONDEMAND && (EROFS_FS=m && FSCACHE || EROFS_FS=y && FSCACHE=y)
+ default n
+ help
+ This permits EROFS to use fscache-backed data blobs with on-demand
+ read support.
+
+ If unsure, say N.
diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 8a3317e38e5a..99bbc597a3e9 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -5,3 +5,4 @@ erofs-objs := super.o inode.o data.o namei.o dir.o utils.o pcpubuf.o sysfs.o
erofs-$(CONFIG_EROFS_FS_XATTR) += xattr.o
erofs-$(CONFIG_EROFS_FS_ZIP) += decompressor.o zmap.o zdata.o
erofs-$(CONFIG_EROFS_FS_ZIP_LZMA) += decompressor_lzma.o
+erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
new file mode 100644
index 000000000000..7a6d0239ebb1
--- /dev/null
+++ b/fs/erofs/fscache.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022, Alibaba Cloud
+ */
+#include <linux/fscache.h>
+#include "internal.h"
+
+int erofs_fscache_register_fs(struct super_block *sb)
+{
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
+ struct fscache_volume *volume;
+ char *name;
+ int ret = 0;
+
+ name = kasprintf(GFP_KERNEL, "erofs,%s", sbi->opt.fsid);
+ if (!name)
+ return -ENOMEM;
+
+ volume = fscache_acquire_volume(name, NULL, NULL, 0);
+ if (IS_ERR_OR_NULL(volume)) {
+ erofs_err(sb, "failed to register volume for %s", name);
+ ret = volume ? PTR_ERR(volume) : -EOPNOTSUPP;
+ volume = NULL;
+ }
+
+ sbi->volume = volume;
+ kfree(name);
+ return ret;
+}
+
+void erofs_fscache_unregister_fs(struct super_block *sb)
+{
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
+
+ fscache_relinquish_volume(sbi->volume, NULL, false);
+ sbi->volume = NULL;
+}
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 05a97533b1e9..e4f6a13f161f 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -74,6 +74,7 @@ struct erofs_mount_opts {
unsigned int max_sync_decompress_pages;
#endif
unsigned int mount_opt;
+ char *fsid;
};
struct erofs_dev_context {
@@ -146,6 +147,9 @@ struct erofs_sb_info {
/* sysfs support */
struct kobject s_kobj; /* /sys/fs/erofs/<devname> */
struct completion s_kobj_unregister;
+
+ /* fscache support */
+ struct fscache_volume *volume;
};
#define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info)
@@ -618,6 +622,18 @@ static inline int z_erofs_load_lzma_config(struct super_block *sb,
}
#endif /* !CONFIG_EROFS_FS_ZIP */
+/* fscache.c */
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+int erofs_fscache_register_fs(struct super_block *sb);
+void erofs_fscache_unregister_fs(struct super_block *sb);
+#else
+static inline int erofs_fscache_register_fs(struct super_block *sb)
+{
+ return 0;
+}
+static inline void erofs_fscache_unregister_fs(struct super_block *sb) {}
+#endif
+
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#endif /* __EROFS_INTERNAL_H */
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 724d5ff0d78c..fd8daa447237 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -602,6 +602,10 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
if (erofs_is_fscache_mode(sb)) {
sb->s_blocksize = EROFS_BLKSIZ;
sb->s_blocksize_bits = LOG_BLOCK_SIZE;
+
+ err = erofs_fscache_register_fs(sb);
+ if (err)
+ return err;
} else {
if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
erofs_err(sb, "failed to set erofs blksize");
@@ -768,6 +772,7 @@ static void erofs_kill_sb(struct super_block *sb)
erofs_free_dev_context(sbi->devs);
fs_put_dax(sbi->dax_dev);
+ erofs_fscache_unregister_fs(sb);
kfree(sbi);
sb->s_fs_info = NULL;
}
--
2.27.0
Notify the user daemon that cookie is going to be withdrawn, providing a
hint that the associated anonymous fd can be closed.
Be noted that this is only a hint. The user daemon may close the
associated anonymous fd when receiving the CLOSE request, then it will
receive another anonymous fd when the cookie gets looked up. Or it may
ignore the CLOSE request, and keep writing data through the anonymous
fd. However the next time the cookie gets looked up, the user daemon
will still receive another new anonymous fd.
Signed-off-by: Jeffle Xu <[email protected]>
Acked-by: David Howells <[email protected]>
---
fs/cachefiles/interface.c | 2 ++
fs/cachefiles/internal.h | 5 +++++
fs/cachefiles/ondemand.c | 38 +++++++++++++++++++++++++++++++++
include/uapi/linux/cachefiles.h | 1 +
4 files changed, 46 insertions(+)
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index ae93cee9d25d..a69073a1d3f0 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -362,6 +362,8 @@ static void cachefiles_withdraw_cookie(struct fscache_cookie *cookie)
spin_unlock(&cache->object_list_lock);
}
+ cachefiles_ondemand_clean_object(object);
+
if (object->file) {
cachefiles_begin_secure(cache, &saved_cred);
cachefiles_clean_up_object(object, cache);
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index e5c612888f84..da388ba127eb 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -290,6 +290,7 @@ extern int cachefiles_ondemand_copen(struct cachefiles_cache *cache,
char *args);
extern int cachefiles_ondemand_init_object(struct cachefiles_object *object);
+extern void cachefiles_ondemand_clean_object(struct cachefiles_object *object);
#else
static inline ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache,
@@ -302,6 +303,10 @@ static inline int cachefiles_ondemand_init_object(struct cachefiles_object *obje
{
return 0;
}
+
+static inline void cachefiles_ondemand_clean_object(struct cachefiles_object *object)
+{
+}
#endif
/*
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c
index 7946ee6c40be..11b1c15ac697 100644
--- a/fs/cachefiles/ondemand.c
+++ b/fs/cachefiles/ondemand.c
@@ -229,6 +229,12 @@ ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache,
goto err_put_fd;
}
+ /* CLOSE request has no reply */
+ if (msg->opcode == CACHEFILES_OP_CLOSE) {
+ xa_erase(&cache->reqs, id);
+ complete(&req->done);
+ }
+
return n;
err_put_fd:
@@ -300,6 +306,13 @@ static int cachefiles_ondemand_send_req(struct cachefiles_object *object,
/* coupled with the barrier in cachefiles_flush_reqs() */
smp_mb();
+ if (opcode != CACHEFILES_OP_OPEN && object->ondemand_id <= 0) {
+ WARN_ON_ONCE(object->ondemand_id == 0);
+ xas_unlock(&xas);
+ ret = -EIO;
+ goto out;
+ }
+
xas.xa_index = 0;
xas_find_marked(&xas, UINT_MAX, XA_FREE_MARK);
if (xas.xa_node == XAS_RESTART)
@@ -356,6 +369,25 @@ static int cachefiles_ondemand_init_open_req(struct cachefiles_req *req,
return 0;
}
+static int cachefiles_ondemand_init_close_req(struct cachefiles_req *req,
+ void *private)
+{
+ struct cachefiles_object *object = req->object;
+ int object_id = object->ondemand_id;
+
+ /*
+ * It's possible that object id is still 0 if the cookie looking up
+ * phase failed before OPEN request has ever been sent. Also avoid
+ * sending CLOSE request for CACHEFILES_ONDEMAND_ID_CLOSED, which means
+ * anon_fd has already been closed.
+ */
+ if (object_id <= 0)
+ return -ENOENT;
+
+ req->msg.object_id = object_id;
+ return 0;
+}
+
int cachefiles_ondemand_init_object(struct cachefiles_object *object)
{
struct fscache_cookie *cookie = object->cookie;
@@ -379,3 +411,9 @@ int cachefiles_ondemand_init_object(struct cachefiles_object *object)
return cachefiles_ondemand_send_req(object, CACHEFILES_OP_OPEN,
data_len, cachefiles_ondemand_init_open_req, NULL);
}
+
+void cachefiles_ondemand_clean_object(struct cachefiles_object *object)
+{
+ cachefiles_ondemand_send_req(object, CACHEFILES_OP_CLOSE, 0,
+ cachefiles_ondemand_init_close_req, NULL);
+}
diff --git a/include/uapi/linux/cachefiles.h b/include/uapi/linux/cachefiles.h
index 521f2fe4fe9c..37a0071037c8 100644
--- a/include/uapi/linux/cachefiles.h
+++ b/include/uapi/linux/cachefiles.h
@@ -12,6 +12,7 @@
enum cachefiles_opcode {
CACHEFILES_OP_OPEN,
+ CACHEFILES_OP_CLOSE,
};
/*
--
2.27.0
On Mon, Apr 25, 2022 at 08:21:39PM +0800, Jeffle Xu wrote:
> Implement the data plane of reading metadata from primary data blob
> over fscache.
>
> Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Thanks,
Gao Xiang
> ---
> fs/erofs/data.c | 19 +++++++++++++++----
> fs/erofs/fscache.c | 25 +++++++++++++++++++++++++
> 2 files changed, 40 insertions(+), 4 deletions(-)
>
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 14b64d960541..bb9c1fd48c19 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -6,6 +6,7 @@
> */
> #include "internal.h"
> #include <linux/prefetch.h>
> +#include <linux/sched/mm.h>
> #include <linux/dax.h>
> #include <trace/events/erofs.h>
>
> @@ -35,14 +36,20 @@ void *erofs_bread(struct erofs_buf *buf, struct inode *inode,
> erofs_off_t offset = blknr_to_addr(blkaddr);
> pgoff_t index = offset >> PAGE_SHIFT;
> struct page *page = buf->page;
> + struct folio *folio;
> + unsigned int nofs_flag;
>
> if (!page || page->index != index) {
> erofs_put_metabuf(buf);
> - page = read_cache_page_gfp(mapping, index,
> - mapping_gfp_constraint(mapping, ~__GFP_FS));
> - if (IS_ERR(page))
> - return page;
> +
> + nofs_flag = memalloc_nofs_save();
> + folio = read_cache_folio(mapping, index, NULL, NULL);
> + memalloc_nofs_restore(nofs_flag);
> + if (IS_ERR(folio))
> + return folio;
> +
> /* should already be PageUptodate, no need to lock page */
> + page = folio_file_page(folio, index);
> buf->page = page;
> }
> if (buf->kmap_type == EROFS_NO_KMAP) {
> @@ -63,6 +70,10 @@ void *erofs_bread(struct erofs_buf *buf, struct inode *inode,
> void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,
> erofs_blk_t blkaddr, enum erofs_kmap_type type)
> {
> + if (erofs_is_fscache_mode(sb))
> + return erofs_bread(buf, EROFS_SB(sb)->s_fscache->inode,
> + blkaddr, type);
> +
> return erofs_bread(buf, sb->s_bdev->bd_inode, blkaddr, type);
> }
>
> diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
> index ac02af8cce3e..23d7e862eed8 100644
> --- a/fs/erofs/fscache.c
> +++ b/fs/erofs/fscache.c
> @@ -59,7 +59,32 @@ static int erofs_fscache_read_folios(struct fscache_cookie *cookie,
> return ret;
> }
>
> +static int erofs_fscache_meta_readpage(struct file *data, struct page *page)
> +{
> + int ret;
> + struct folio *folio = page_folio(page);
> + struct super_block *sb = folio_mapping(folio)->host->i_sb;
> + struct erofs_map_dev mdev = {
> + .m_deviceid = 0,
> + .m_pa = folio_pos(folio),
> + };
> +
> + ret = erofs_map_dev(sb, &mdev);
> + if (ret)
> + goto out;
> +
> + ret = erofs_fscache_read_folios(mdev.m_fscache->cookie,
> + folio_mapping(folio), folio_pos(folio),
> + folio_size(folio), mdev.m_pa);
> + if (!ret)
> + folio_mark_uptodate(folio);
> +out:
> + folio_unlock(folio);
> + return ret;
> +}
> +
> static const struct address_space_operations erofs_fscache_meta_aops = {
> + .readpage = erofs_fscache_meta_readpage,
> };
>
> int erofs_fscache_register_cookie(struct super_block *sb,
> --
> 2.27.0
On Mon, Apr 25, 2022 at 08:21:42PM +0800, Jeffle Xu wrote:
> Implement fscache-based data readahead. Also registers an individual
^ register
> bdi for each erofs instance to enable readahead.
>
> Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Thanks,
Gao Xiang
> ---
> fs/erofs/fscache.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/erofs/super.c | 4 +++
> 2 files changed, 94 insertions(+)
>
> diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
> index 5b779812a5ee..a402d8f0a063 100644
> --- a/fs/erofs/fscache.c
> +++ b/fs/erofs/fscache.c
> @@ -162,12 +162,102 @@ static int erofs_fscache_readpage(struct file *file, struct page *page)
> return ret;
> }
>
> +static void erofs_fscache_unlock_folios(struct readahead_control *rac,
> + size_t len)
> +{
> + while (len) {
> + struct folio *folio = readahead_folio(rac);
> +
> + len -= folio_size(folio);
> + folio_mark_uptodate(folio);
> + folio_unlock(folio);
> + }
> +}
> +
> +static void erofs_fscache_readahead(struct readahead_control *rac)
> +{
> + struct inode *inode = rac->mapping->host;
> + struct super_block *sb = inode->i_sb;
> + size_t len, count, done = 0;
> + erofs_off_t pos;
> + loff_t start, offset;
> + int ret;
> +
> + if (!readahead_count(rac))
> + return;
> +
> + start = readahead_pos(rac);
> + len = readahead_length(rac);
> +
> + do {
> + struct erofs_map_blocks map;
> + struct erofs_map_dev mdev;
> +
> + pos = start + done;
> + map.m_la = pos;
> +
> + ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW);
> + if (ret)
> + return;
> +
> + offset = start + done;
> + count = min_t(size_t, map.m_llen - (pos - map.m_la),
> + len - done);
> +
> + if (!(map.m_flags & EROFS_MAP_MAPPED)) {
> + struct iov_iter iter;
> +
> + iov_iter_xarray(&iter, READ, &rac->mapping->i_pages,
> + offset, count);
> + iov_iter_zero(count, &iter);
> +
> + erofs_fscache_unlock_folios(rac, count);
> + ret = count;
> + continue;
> + }
> +
> + if (map.m_flags & EROFS_MAP_META) {
> + struct folio *folio = readahead_folio(rac);
> +
> + ret = erofs_fscache_readpage_inline(folio, &map);
> + if (!ret) {
> + folio_mark_uptodate(folio);
> + ret = folio_size(folio);
> + }
> +
> + folio_unlock(folio);
> + continue;
> + }
> +
> + mdev = (struct erofs_map_dev) {
> + .m_deviceid = map.m_deviceid,
> + .m_pa = map.m_pa,
> + };
> + ret = erofs_map_dev(sb, &mdev);
> + if (ret)
> + return;
> +
> + ret = erofs_fscache_read_folios(mdev.m_fscache->cookie,
> + rac->mapping, offset, count,
> + mdev.m_pa + (pos - map.m_la));
> + /*
> + * For the error cases, the folios will be unlocked when
> + * .readahead() returns.
> + */
> + if (!ret) {
> + erofs_fscache_unlock_folios(rac, count);
> + ret = count;
> + }
> + } while (ret > 0 && ((done += ret) < len));
> +}
> +
> static const struct address_space_operations erofs_fscache_meta_aops = {
> .readpage = erofs_fscache_meta_readpage,
> };
>
> const struct address_space_operations erofs_fscache_access_aops = {
> .readpage = erofs_fscache_readpage,
> + .readahead = erofs_fscache_readahead,
> };
>
> int erofs_fscache_register_cookie(struct super_block *sb,
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index c6755bcae4a6..f68ba929100d 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -619,6 +619,10 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> sbi->opt.fsid, true);
> if (err)
> return err;
> +
> + err = super_setup_bdi(sb);
> + if (err)
> + return err;
> } else {
> if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
> erofs_err(sb, "failed to set erofs blksize");
> --
> 2.27.0
Introduce one anonymous inode for data blobs so that erofs can cache
metadata directly within such anonymous inode.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/fscache.c | 39 ++++++++++++++++++++++++++++++++++++---
fs/erofs/internal.h | 6 ++++--
2 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index dfff245b006b..26f038d9c4e1 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -5,12 +5,17 @@
#include <linux/fscache.h>
#include "internal.h"
+static const struct address_space_operations erofs_fscache_meta_aops = {
+};
+
int erofs_fscache_register_cookie(struct super_block *sb,
- struct erofs_fscache **fscache, char *name)
+ struct erofs_fscache **fscache,
+ char *name, bool need_inode)
{
struct fscache_volume *volume = EROFS_SB(sb)->volume;
struct erofs_fscache *ctx;
struct fscache_cookie *cookie;
+ int ret;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx)
@@ -20,15 +25,40 @@ int erofs_fscache_register_cookie(struct super_block *sb,
name, strlen(name), NULL, 0, 0);
if (!cookie) {
erofs_err(sb, "failed to get cookie for %s", name);
- kfree(name);
- return -EINVAL;
+ ret = -EINVAL;
+ goto err;
}
fscache_use_cookie(cookie, false);
ctx->cookie = cookie;
+ if (need_inode) {
+ struct inode *const inode = new_inode(sb);
+
+ if (!inode) {
+ erofs_err(sb, "failed to get anon inode for %s", name);
+ ret = -ENOMEM;
+ goto err_cookie;
+ }
+
+ set_nlink(inode, 1);
+ inode->i_size = OFFSET_MAX;
+ inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+
+ ctx->inode = inode;
+ }
+
*fscache = ctx;
return 0;
+
+err_cookie:
+ fscache_unuse_cookie(ctx->cookie, NULL, NULL);
+ fscache_relinquish_cookie(ctx->cookie, false);
+ ctx->cookie = NULL;
+err:
+ kfree(ctx);
+ return ret;
}
void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache)
@@ -42,6 +72,9 @@ void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache)
fscache_relinquish_cookie(ctx->cookie, false);
ctx->cookie = NULL;
+ iput(ctx->inode);
+ ctx->inode = NULL;
+
kfree(ctx);
*fscache = NULL;
}
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index b1f19f058503..5867cb63fd74 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -99,6 +99,7 @@ struct erofs_sb_lz4_info {
struct erofs_fscache {
struct fscache_cookie *cookie;
+ struct inode *inode;
};
struct erofs_sb_info {
@@ -632,7 +633,8 @@ int erofs_fscache_register_fs(struct super_block *sb);
void erofs_fscache_unregister_fs(struct super_block *sb);
int erofs_fscache_register_cookie(struct super_block *sb,
- struct erofs_fscache **fscache, char *name);
+ struct erofs_fscache **fscache,
+ char *name, bool need_inode);
void erofs_fscache_unregister_cookie(struct erofs_fscache **fscache);
#else
static inline int erofs_fscache_register_fs(struct super_block *sb)
@@ -643,7 +645,7 @@ static inline void erofs_fscache_unregister_fs(struct super_block *sb) {}
static inline int erofs_fscache_register_cookie(struct super_block *sb,
struct erofs_fscache **fscache,
- char *name)
+ char *name, bool need_inode)
{
return -EOPNOTSUPP;
}
--
2.27.0
... so that it can be used in the following introduced fscache mode.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/data.c | 4 ++--
fs/erofs/internal.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 780db1e5f4b7..bc22642358ec 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -110,8 +110,8 @@ static int erofs_map_blocks_flatmode(struct inode *inode,
return 0;
}
-static int erofs_map_blocks(struct inode *inode,
- struct erofs_map_blocks *map, int flags)
+int erofs_map_blocks(struct inode *inode,
+ struct erofs_map_blocks *map, int flags)
{
struct super_block *sb = inode->i_sb;
struct erofs_inode *vi = EROFS_I(inode);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 5298c4ee277d..fe9564e5091e 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -486,6 +486,8 @@ void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,
int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *dev);
int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+int erofs_map_blocks(struct inode *inode,
+ struct erofs_map_blocks *map, int flags);
/* inode.c */
static inline unsigned long erofs_inode_hash(erofs_nid_t nid)
--
2.27.0
Similar to the multi device mode, erofs could be mounted from one
primary data blob (mandatory) and multiple extra data blobs (optional).
Register fscache context for each extra data blob.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/data.c | 3 +++
fs/erofs/internal.h | 2 ++
fs/erofs/super.c | 8 +++++++-
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index bc22642358ec..14b64d960541 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -199,6 +199,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
map->m_bdev = sb->s_bdev;
map->m_daxdev = EROFS_SB(sb)->dax_dev;
map->m_dax_part_off = EROFS_SB(sb)->dax_part_off;
+ map->m_fscache = EROFS_SB(sb)->s_fscache;
if (map->m_deviceid) {
down_read(&devs->rwsem);
@@ -210,6 +211,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
map->m_bdev = dif->bdev;
map->m_daxdev = dif->dax_dev;
map->m_dax_part_off = dif->dax_part_off;
+ map->m_fscache = dif->fscache;
up_read(&devs->rwsem);
} else if (devs->extra_devices) {
down_read(&devs->rwsem);
@@ -227,6 +229,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
map->m_bdev = dif->bdev;
map->m_daxdev = dif->dax_dev;
map->m_dax_part_off = dif->dax_part_off;
+ map->m_fscache = dif->fscache;
break;
}
}
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 386658416159..fa488af8dfcf 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -49,6 +49,7 @@ typedef u32 erofs_blk_t;
struct erofs_device_info {
char *path;
+ struct erofs_fscache *fscache;
struct block_device *bdev;
struct dax_device *dax_dev;
u64 dax_part_off;
@@ -482,6 +483,7 @@ static inline int z_erofs_map_blocks_iter(struct inode *inode,
#endif /* !CONFIG_EROFS_FS_ZIP */
struct erofs_map_dev {
+ struct erofs_fscache *m_fscache;
struct block_device *m_bdev;
struct dax_device *m_daxdev;
u64 m_dax_part_off;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 61dc900295f9..c6755bcae4a6 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -259,7 +259,12 @@ static int erofs_init_devices(struct super_block *sb,
}
dis = ptr + erofs_blkoff(pos);
- if (!erofs_is_fscache_mode(sb)) {
+ if (erofs_is_fscache_mode(sb)) {
+ err = erofs_fscache_register_cookie(sb, &dif->fscache,
+ dif->path, false);
+ if (err)
+ break;
+ } else {
bdev = blkdev_get_by_path(dif->path,
FMODE_READ | FMODE_EXCL,
sb->s_type);
@@ -710,6 +715,7 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
fs_put_dax(dif->dax_dev);
if (dif->bdev)
blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL);
+ erofs_fscache_unregister_cookie(&dif->fscache);
kfree(dif->path);
kfree(dif);
return 0;
--
2.27.0
Until then erofs is exactly blockdev based filesystem.
A new fscache-based mode is going to be introduced for erofs to support
scenarios where on-demand read semantics is needed, e.g. container
image distribution. In this case, erofs could be mounted from data blobs
through fscache.
Add a helper checking which mode erofs works in, and twist the code in
preparation for the upcoming fscache mode.
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
---
fs/erofs/internal.h | 5 +++++
fs/erofs/super.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 34 insertions(+), 15 deletions(-)
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index fe9564e5091e..05a97533b1e9 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -161,6 +161,11 @@ struct erofs_sb_info {
#define set_opt(opt, option) ((opt)->mount_opt |= EROFS_MOUNT_##option)
#define test_opt(opt, option) ((opt)->mount_opt & EROFS_MOUNT_##option)
+static inline bool erofs_is_fscache_mode(struct super_block *sb)
+{
+ return IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && !sb->s_bdev;
+}
+
enum {
EROFS_ZIP_CACHE_DISABLED,
EROFS_ZIP_CACHE_READAHEAD,
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 0c4b41130c2f..724d5ff0d78c 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -259,15 +259,19 @@ static int erofs_init_devices(struct super_block *sb,
}
dis = ptr + erofs_blkoff(pos);
- bdev = blkdev_get_by_path(dif->path,
- FMODE_READ | FMODE_EXCL,
- sb->s_type);
- if (IS_ERR(bdev)) {
- err = PTR_ERR(bdev);
- break;
+ if (!erofs_is_fscache_mode(sb)) {
+ bdev = blkdev_get_by_path(dif->path,
+ FMODE_READ | FMODE_EXCL,
+ sb->s_type);
+ if (IS_ERR(bdev)) {
+ err = PTR_ERR(bdev);
+ break;
+ }
+ dif->bdev = bdev;
+ dif->dax_dev = fs_dax_get_by_bdev(bdev,
+ &dif->dax_part_off);
}
- dif->bdev = bdev;
- dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off);
+
dif->blocks = le32_to_cpu(dis->blocks);
dif->mapped_blkaddr = le32_to_cpu(dis->mapped_blkaddr);
sbi->total_blocks += dif->blocks;
@@ -586,21 +590,28 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_magic = EROFS_SUPER_MAGIC;
- if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
- erofs_err(sb, "failed to set erofs blksize");
- return -EINVAL;
- }
-
sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
if (!sbi)
return -ENOMEM;
sb->s_fs_info = sbi;
sbi->opt = ctx->opt;
- sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->dax_part_off);
sbi->devs = ctx->devs;
ctx->devs = NULL;
+ if (erofs_is_fscache_mode(sb)) {
+ sb->s_blocksize = EROFS_BLKSIZ;
+ sb->s_blocksize_bits = LOG_BLOCK_SIZE;
+ } else {
+ if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
+ erofs_err(sb, "failed to set erofs blksize");
+ return -EINVAL;
+ }
+
+ sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev,
+ &sbi->dax_part_off);
+ }
+
err = erofs_read_superblock(sb);
if (err)
return err;
@@ -857,7 +868,10 @@ static int erofs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct super_block *sb = dentry->d_sb;
struct erofs_sb_info *sbi = EROFS_SB(sb);
- u64 id = huge_encode_dev(sb->s_bdev->bd_dev);
+ u64 id = 0;
+
+ if (!erofs_is_fscache_mode(sb))
+ id = huge_encode_dev(sb->s_bdev->bd_dev);
buf->f_type = sb->s_magic;
buf->f_bsize = EROFS_BLKSIZ;
--
2.27.0
Enable on-demand read mode by adding an optional parameter to the "bind"
command.
On-demand mode will be turned on when this parameter is "ondemand", i.e.
"bind ondemand". Otherwise cachefiles will work in the original mode.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/cachefiles/daemon.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
index 5b1d0642c749..aa4efcabb5e3 100644
--- a/fs/cachefiles/daemon.c
+++ b/fs/cachefiles/daemon.c
@@ -755,11 +755,6 @@ static int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args)
cache->brun_percent >= 100)
return -ERANGE;
- if (*args) {
- pr_err("'bind' command doesn't take an argument\n");
- return -EINVAL;
- }
-
if (!cache->rootdirname) {
pr_err("No cache directory specified\n");
return -EINVAL;
@@ -771,6 +766,18 @@ static int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args)
return -EBUSY;
}
+ if (IS_ENABLED(CONFIG_CACHEFILES_ONDEMAND)) {
+ if (!strcmp(args, "ondemand")) {
+ set_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags);
+ } else if (*args) {
+ pr_err("Invalid argument to the 'bind' command\n");
+ return -EINVAL;
+ }
+ } else if (*args) {
+ pr_err("'bind' command doesn't take an argument\n");
+ return -EINVAL;
+ }
+
/* Make sure we have copies of the tag string */
if (!cache->tag) {
/*
--
2.27.0
On Mon, Apr 25, 2022 at 08:21:22PM +0800, Jeffle Xu wrote:
> changes since v9:
> - rebase to 5.18-rc3
> - cachefiles: extract cachefiles_in_ondemand_mode() helper; add barrier
> pair between enqueuing and flushing requests; make the xarray
> structures non-conditionally defined in struct cachefiles_cache
> (patch 2) (David Howells)
> - cacehfiles: use refcount_t for unbind_pincount; run "cachefiles_open = 0;"
> cleanup only when unbind_pincount is decreased to 0 (patch 3)
> (David Howells)
> - cachefiles: rename CACHEFILES_IOC_CREAD ioctl to
> CACHEFILES_IOC_READ_COMPLETE (patch 5) (David Howells)
> - cachefiles: fix the error message when the argument to the 'bind'
> command is invalid (patch 6) (David Howells)
> - cachefiles: update the documentation polished by David (patch 8)
> - erofs: tweak the code arrangement of erofs_fscache_meta_readpage()
> (patch 17) (Gao Xiang)
> - erofs: add comment on error cases (patch 20) (Gao Xiang)
> - update Tested-by tags in the cover letter
>
>
> Kernel Patchset
> ---------------
> Git tree:
>
> https://github.com/lostjeffle/linux.git jingbo/dev-erofs-fscache-v10
>
Come to an agreement with David on IRC, I will push out this series to
-next later for wider testing aiming for 5.19.
Thanks,
Gao Xiang