changes since v2:
- fscache,erofs: Now erofs uses fscache_read() directly instead of netfs
library to read data from cache, to avoid the potential conflict with
the following netfs library refactoring [1] (patch 12) (David Howells)
- erofs: Implement fscache-based readahead. The current implementation
is quite rough and is synchronous though. Need to be improved in the
following iteration.
- cachefiles_ondemand: use xarray instead of IDR managing pending read
requests (patch 5) (Matthew Wilcox)
- I also upload this patch set at:
https://github.com/lostjeffle/linux/commits/jingbo/dev-erofs-fscache
[1] https://lore.kernel.org/all/[email protected]/t/#mfbb2053476760d8fac723c57dad529192a5084c6
RFC: https://lore.kernel.org/all/[email protected]/t/
v1: https://lore.kernel.org/lkml/[email protected]/T/
v2: https://lore.kernel.org/all/[email protected]/t/
[Background]
============
Nydus is a remote container snapthotter specially optimised for container
images distribution over network. It has recently been accepted as a
sub-project of containerd[1]. Nydus is an excellent container image
acceleration solution, since it only pulls data from remote when it's
really needed, a.k.a. on-demand reading.
erofs (Enhanced Read-Only File System) is a filesystem specially
optimised for read-only scenarios. (Documentation/filesystem/erofs.rst)
Recently we are focusing on erofs in container images distribution
scenario [2], trying to combine it with nydus. In this case, erofs can
be mounted from one bootstrap file (metadata) with (optional) multiple
data blob files (data) stored on another local filesystem. (All these
files are actually image files in erofs disk format.)
To accelerate the container startup (fetching container image from remote
and then start the container), we do hope that the bootstrap blob file
could support demand read. That is, erofs can be mounted and accessed
even when the bootstrap/data blob files have not been fully downloaded.
That means we have to manage the cache state of the bootstrap/data blob
files (if cache hit, read directly from the local cache; if cache miss,
fetch the data somehow). It would be painful and may be dumb for erofs to
implement the cache management itself. Thus we prefer fscache/cachefiles
to do the cache management. Besides, the demand-read feature shall be
general and it can benefit other using scenarios if it can be implemented
in fscache level.
[1] https://d7y.io/en-us/blog/containerd_accepted_nydus-snapshotter.html
[2] https://sched.co/pcdL
[Overall Design]
================
The upper fs uses a backing file on the local fs as the local cache
(exactly the "cachefiles" way), and relies on fscache to detect if data
is ready or not (cache hit/miss). Since currently fscache detects cache
hit/miss by detecting the hole of the backing files, our demand-read
mechanism also relies on the hole detecting.
1. initial phase
On the first beginning, the user daemon will touch the backing files
(bootstrap/data blob files) under corresponding directory (under
<root>/cache/<volume>/<fan>/) in advance. These backing files are
completely sparse files (with zero disk usage). Since these backing
files are all read-only and the file size is known prior mounting, user
daemon will set corresponding file size and thus create all these sparse
backing files in advance.
2. cache miss
When a file range (of bootstrap/data blob file) is accessed for the
first time, a cache miss will be triggered and then .issue_op() will be
called to fetch the data somehow.
In the demand-read case, we relies on a user daemon to fetch the data
from local/remote. In this case, .issue_op() just packages the file
range into a message and informs the user daemon. User daemon needs to
poll and wait on the devnode (/dev/cachefiles_demand). Once awaken, the
user daemon will read the devnode to get the file range information, and
then fetch the data corresponding to the file range somehow, e.g.
download from remote through network. Once data ready, the user daemon
will write the fetched data into the backing file and then inform
cachefiles backend by writing to the devnode. Cachefiles backend getting
blocked on the previous .issue_op() calling will be awaken then. By then
the data has been ready in the backing file, and the upper fs will
reinitiate a read request from the backing file.
3. cache hit
Once data is already ready in the backing file, upper fs will read from
the backing file directly.
[Advantage of fscache-based demand-read]
========================================
1. Asynchronous Prefetch
In current mechanism, fscache is responsible for cache state management,
while the data plane (fetch data from local/remote on cache miss) is
done on the user daemon side.
If data has already been ready in the backing file, the upper fs (e.g.
erofs) will read from the backing file directly and won't be trapped to
user space anymore. Thus the user daemon could fetch data (from remote)
asynchronously on the background, and thus accelerate the backing file
accessing in some degree.
2. Support massive blob files
Besides this mechanism supports a large amount of backing files, and
thus can benefit the densely employed scenario.
In our using scenario, one container image can correspond to one
bootstrap file (required) and multiple data blob files (optional). For
example, one container image for node.js will corresponds to ~20 files
in total. In densely employed environment, there could be as many as
hundreds of containers and thus thousands of backing files on one
machine.
[Test]
==========
You could start a quick test by
https://github.com/lostjeffle/demand-read-cachefilesd
Jeffle Xu (22):
fscache: export fscache_end_operation()
fscache: add a method to support on-demand read semantics
cachefiles: extract generic function for daemon methods
cachefiles: detect backing file size in on-demand read mode
cachefiles: introduce new devnode for on-demand read mode
erofs: use meta buffers for erofs_read_superblock()
erofs: export erofs_map_blocks()
erofs: add mode checking helper
erofs: register global fscache volume
erofs: add cookie context helper functions
erofs: add anonymous inode managing page cache of blob file
erofs: add erofs_fscache_read_page() helper
erofs: register cookie context for bootstrap blob
erofs: implement fscache-based metadata read
erofs: implement fscache-based data read for non-inline layout
erofs: implement fscache-based data read for inline layout
erofs: register cookie context for data blobs
erofs: implement fscache-based data read for data blobs
erofs: implement fscache-based data readahead for hole
erofs: implement fscache-based data readahead for non-inline layout
erofs: implement fscache-based data readahead for inline layout
erofs: add 'uuid' mount option
Documentation/filesystems/netfs_library.rst | 18 +
fs/cachefiles/Kconfig | 13 +
fs/cachefiles/daemon.c | 243 +++++++++--
fs/cachefiles/internal.h | 12 +
fs/cachefiles/io.c | 60 +++
fs/cachefiles/main.c | 27 ++
fs/cachefiles/namei.c | 60 ++-
fs/erofs/Makefile | 3 +-
fs/erofs/data.c | 18 +-
fs/erofs/fscache.c | 451 ++++++++++++++++++++
fs/erofs/inode.c | 6 +-
fs/erofs/internal.h | 30 ++
fs/erofs/super.c | 106 ++++-
fs/fscache/internal.h | 11 -
fs/nfs/fscache.c | 8 -
include/linux/fscache.h | 39 ++
include/linux/netfs.h | 4 +
include/uapi/linux/cachefiles_ondemand.h | 14 +
18 files changed, 1050 insertions(+), 73 deletions(-)
create mode 100644 fs/erofs/fscache.c
create mode 100644 include/uapi/linux/cachefiles_ondemand.h
--
2.27.0
Introduce one anonymous inode for managing page cache of corresponding
blob file. Then erofs could read directly from the address space of the
anonymous inode when cache hit.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
fs/erofs/internal.h | 3 ++-
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index c043d7709d65..3addd9aa549c 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -6,6 +6,9 @@
static struct fscache_volume *volume;
+static const struct address_space_operations erofs_fscache_blob_aops = {
+};
+
static int erofs_fscache_init_cookie(struct erofs_fscache_context *ctx,
char *path)
{
@@ -36,8 +39,34 @@ void erofs_fscache_cleanup_cookie(struct erofs_fscache_context *ctx)
ctx->cookie = NULL;
}
+static int erofs_fscache_get_inode(struct erofs_fscache_context *ctx,
+ struct super_block *sb)
+{
+ struct inode *const inode = new_inode(sb);
+
+ if (!inode)
+ return -ENOMEM;
+
+ set_nlink(inode, 1);
+ inode->i_size = OFFSET_MAX;
+
+ inode->i_mapping->a_ops = &erofs_fscache_blob_aops;
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_NOFS | __GFP_HIGHMEM | __GFP_MOVABLE);
+ ctx->inode = inode;
+ return 0;
+}
+
+static inline
+void erofs_fscache_put_inode(struct erofs_fscache_context *ctx)
+{
+ iput(ctx->inode);
+ ctx->inode = NULL;
+}
+
static int erofs_fscache_init_ctx(struct erofs_fscache_context *ctx,
- struct super_block *sb, char *path)
+ struct super_block *sb, char *path,
+ bool need_inode)
{
int ret;
@@ -47,6 +76,15 @@ static int erofs_fscache_init_ctx(struct erofs_fscache_context *ctx,
return ret;
}
+ if (need_inode) {
+ ret = erofs_fscache_get_inode(ctx, sb);
+ if (ret) {
+ erofs_err(sb, "failed to get anonymous inode");
+ erofs_fscache_cleanup_cookie(ctx);
+ return ret;
+ }
+ }
+
return 0;
}
@@ -54,10 +92,11 @@ static inline
void erofs_fscache_cleanup_ctx(struct erofs_fscache_context *ctx)
{
erofs_fscache_cleanup_cookie(ctx);
+ erofs_fscache_put_inode(ctx);
}
struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
- char *path)
+ char *path, bool need_inode)
{
struct erofs_fscache_context *ctx;
int ret;
@@ -66,7 +105,7 @@ struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
if (!ctx)
return ERR_PTR(-ENOMEM);
- ret = erofs_fscache_init_ctx(ctx, sb, path);
+ ret = erofs_fscache_init_ctx(ctx, sb, path, need_inode);
if (ret) {
kfree(ctx);
return ERR_PTR(ret);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 1f5bc69e8e9f..bb5e992fe0df 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -99,6 +99,7 @@ struct erofs_sb_lz4_info {
struct erofs_fscache_context {
struct fscache_cookie *cookie;
+ struct inode *inode;
};
struct erofs_sb_info {
@@ -626,7 +627,7 @@ int erofs_init_fscache(void);
void erofs_exit_fscache(void);
struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
- char *path);
+ char *path, bool need_inode);
void erofs_fscache_put_ctx(struct erofs_fscache_context *ctx);
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
--
2.27.0
Similar to the multi device mode, erofs could be mounted from multiple
blob files (one bootstrap blob file and optional multiple data blob
files). In this case, each device slot contains the path of
corresponding data blob file.
This patch registers corresponding cookie context for each data blob
file.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/internal.h | 1 +
fs/erofs/super.c | 27 +++++++++++++++++++--------
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 548f928b0ded..5d514c7b73cc 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -53,6 +53,7 @@ struct erofs_device_info {
struct block_device *bdev;
struct dax_device *dax_dev;
u64 dax_part_off;
+ struct erofs_fscache_context *ctx;
u32 blocks;
u32 mapped_blkaddr;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 8c5783c6f71f..f058a04a00c7 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -250,6 +250,7 @@ static int erofs_init_devices(struct super_block *sb,
down_read(&sbi->devs->rwsem);
idr_for_each_entry(&sbi->devs->tree, dif, id) {
struct block_device *bdev;
+ struct erofs_fscache_context *ctx;
ptr = erofs_read_metabuf(&buf, sb, erofs_blknr(pos),
EROFS_KMAP);
@@ -259,15 +260,24 @@ static int erofs_init_devices(struct super_block *sb,
}
dis = ptr + erofs_blkoff(pos);
- bdev = blkdev_get_by_path(dif->path,
- FMODE_READ | FMODE_EXCL,
- sb->s_type);
- if (IS_ERR(bdev)) {
- err = PTR_ERR(bdev);
- break;
+ if (erofs_bdev_mode(sb)) {
+ bdev = blkdev_get_by_path(dif->path,
+ FMODE_READ | FMODE_EXCL,
+ sb->s_type);
+ if (IS_ERR(bdev)) {
+ err = PTR_ERR(bdev);
+ break;
+ }
+ dif->bdev = bdev;
+ dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off);
+ } else {
+ ctx = erofs_fscache_get_ctx(sb, dif->path, false);
+ if (IS_ERR(ctx)) {
+ err = PTR_ERR(ctx);
+ break;
+ }
+ dif->ctx = ctx;
}
- dif->bdev = bdev;
- dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off);
dif->blocks = le32_to_cpu(dis->blocks);
dif->mapped_blkaddr = le32_to_cpu(dis->mapped_blkaddr);
sbi->total_blocks += dif->blocks;
@@ -694,6 +704,7 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
{
struct erofs_device_info *dif = ptr;
+ erofs_fscache_put_ctx(dif->ctx);
fs_put_dax(dif->dax_dev);
if (dif->bdev)
blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL);
--
2.27.0
This patch implements the data plane of reading metadata from bootstrap
blob file over fscache.
Be noted that currently it only supports the scenario where the backing
file has no hole. Once it hits a hole of the backing file, erofs will
fail the IO with -EOPNOTSUPP for now. The following patch will fix this
issue, i.e. implementing the demand reading mode.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/data.c | 11 +++++++++--
fs/erofs/fscache.c | 24 ++++++++++++++++++++++++
fs/erofs/internal.h | 3 +++
3 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 6e2a28242453..1bff99576883 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -31,15 +31,22 @@ void erofs_put_metabuf(struct erofs_buf *buf)
void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,
erofs_blk_t blkaddr, enum erofs_kmap_type type)
{
- struct address_space *const mapping = sb->s_bdev->bd_inode->i_mapping;
+ struct address_space *mapping;
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
erofs_off_t offset = blknr_to_addr(blkaddr);
pgoff_t index = offset >> PAGE_SHIFT;
struct page *page = buf->page;
if (!page || page->index != index) {
erofs_put_metabuf(buf);
- page = read_cache_page_gfp(mapping, index,
+ if (erofs_bdev_mode(sb)) {
+ mapping = sb->s_bdev->bd_inode->i_mapping;
+ page = read_cache_page_gfp(mapping, index,
mapping_gfp_constraint(mapping, ~__GFP_FS));
+ } else {
+ page = erofs_fscache_read_cache_page(sbi->bootstrap,
+ index);
+ }
if (IS_ERR(page))
return page;
/* should already be PageUptodate, no need to lock page */
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index f4aade711664..a29d2ecff58b 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -42,9 +42,33 @@ static int erofs_fscache_read_page(struct fscache_cookie *cookie,
return ret;
}
+static int erofs_fscache_readpage_blob(struct file *data, struct page *page)
+{
+ int ret;
+ struct erofs_fscache_context *ctx =
+ (struct erofs_fscache_context *)data;
+
+ ret = erofs_fscache_read_page(ctx->cookie, page, page_offset(page));
+ if (ret)
+ SetPageError(page);
+ else
+ SetPageUptodate(page);
+
+ unlock_page(page);
+ return ret;
+}
+
static const struct address_space_operations erofs_fscache_blob_aops = {
+ .readpage = erofs_fscache_readpage_blob,
};
+struct page *erofs_fscache_read_cache_page(struct erofs_fscache_context *ctx,
+ pgoff_t index)
+{
+ DBG_BUGON(!ctx->inode);
+ return read_mapping_page(ctx->inode->i_mapping, index, ctx);
+}
+
static int erofs_fscache_init_cookie(struct erofs_fscache_context *ctx,
char *path)
{
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 277dcd5888ea..fca706cfaf72 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -633,6 +633,9 @@ struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
char *path, bool need_inode);
void erofs_fscache_put_ctx(struct erofs_fscache_context *ctx);
+struct page *erofs_fscache_read_cache_page(struct erofs_fscache_context *ctx,
+ pgoff_t index);
+
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#endif /* __EROFS_INTERNAL_H */
--
2.27.0
Hi David,
On Wed, Feb 09, 2022 at 02:00:46PM +0800, Jeffle Xu wrote:
...
>
>
> Jeffle Xu (22):
> fscache: export fscache_end_operation()
> fscache: add a method to support on-demand read semantics
> cachefiles: extract generic function for daemon methods
> cachefiles: detect backing file size in on-demand read mode
> cachefiles: introduce new devnode for on-demand read mode
...
>
> Documentation/filesystems/netfs_library.rst | 18 +
> fs/cachefiles/Kconfig | 13 +
> fs/cachefiles/daemon.c | 243 +++++++++--
> fs/cachefiles/internal.h | 12 +
> fs/cachefiles/io.c | 60 +++
> fs/cachefiles/main.c | 27 ++
> fs/cachefiles/namei.c | 60 ++-
Would you mind taking a review at this version? We follow your previous
advices written in v2 and it reuses almost all cachefiles code except
that it has slightly different implication of cachefile file size and
a new daemon node.
I think it could be as the first step to implement fscache-based
on-demand read.
Thanks,
Gao Xiang